Protein Group readers

[1]:
%reload_ext autoreload
%autoreload 2
[2]:
# Helper packages
import io
from copy import copy
from typing import Literal, Optional

import anndata as ad
import numpy as np
import pandas as pd

# alphabase
from alphabase.pg_reader import pg_reader_provider
from alphabase.tools.data_downloader import DataShareDownloader
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:4: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
  import cgi
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:18: ImportWarning: Dependency 'progressbar' not installed. Download progress will not be displayed.
  warnings.warn(

Background

The alphabase.pg_reader module provides a unifying interface to read protein group (PG) tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of pandas.DataFrames, regardless of the input file format.

Introduction to protein group matrices

Protein group matrices are the primary output for protein-level quantification in proteomics workflows. After search engines identify peptide spectrum matches (PSMs, see PSM-reader tutorial), they aggregate peptide-level evidence to infer protein-level abundances. These protein group tables represent a structured matrix that maps protein groups (features) to samples (observations), with estimated intensity values as entries.

A minimal protein group table could look something like this:

proteins

sample_1

sample_2

sample_3

P12345

1000.5

892.3

1150.7

Q67890

2500.1

2780.9

2340.2

💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into protein groups as features.

In this example, protein P12345 has quantified intensities of 1000.5, 892.3, and 1150.7 in samples 1, 2, and 3 respectively.

Search engine outputs

In reality, protein group tables are significantly more complex than this, as they contain additional feature-level information about the proteins (e.g., gene names, descriptions, alternative quantification methods), and the quantification (e.g., different intensity types like raw, LFQ quantification, iBAQ). This additional information can be valuable for downstream analyses, but also makes protein group tables a lot more difficult to work with, as the exact names and formats may differ between search engines, versions, and file formats.

Unifying properties

alphabase aligns the column names to a unified vocabulary, facilitating cross-engine comparisons. We can categorize protein group tables into several common types:

Type 1 — Minimal: A basic features × samples matrix. Only intensity values are stored, with sample names as columns and protein groups as the index. Example: AlphaDIA.

Type 2 — Multiple Intensity Fields: A wide matrix where each sample may appear multiple times with different quantification types (e.g., SampleA_LFQ, SampleB_raw). Example: AlphaPept.

Type 3 — Feature Metadata: A features × samples matrix with one intensity value per sample, plus additional feature-level metadata columns (e.g., gene names, descriptions). Example: DIA-NN.

Type 4 — Combined: A composite structure including both multiple intensity fields (Type 2) and feature-level metadata (Type 3). Examples: Spectronaut, MZTab, MaxQuant.

Code | Read and parse protein group tables

The alphabase pg_reader module enables users to parse proteomics protein group reports to a dataframe for most common search engines with a single line of code via its alphabase.pg_reader.pg_reader_provider factory.

All readers return a standardized pandas DataFrame with:

  • Features as index: Protein identifiers and metadata in the pandas.DataFrame.Index

  • Samples as columns: Sample/run identifiers as column index

  • Intensity values: Protein quantification data as pandas.DataFrame.values

The readers support different quantification methods by matching regular expression patterns in the output tables and the retrieval of desired metadata columns to standardized names.

The unified alphabase format enables seamless comparison and analysis across different search engines, facilitating:

  • Method comparison studies

  • Data integration workflows

  • Standardized downstream analysis pipelines

Available readers

alphabase.pg_reader.pg_reader_provider has registered reader classes for the most common proteomics search engines. A list of implemented readers can be accessed via its reader_dict property:

[3]:
all_registered_readers = pg_reader_provider.reader_dict.keys()

# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)
Registered readers in alphabase:
        - alphadia
        - alphapept
        - diann
        - fragpipe
        - maxquant
        - mztab
        - spectronaut

Interact with the reader provider

[ ]:
def get_pg_matrix_example(output_dir: Optional[str] = None, search_engine: Literal["alphadia", "alphapept", "spectronaut"] = "alphadia") -> str:
    """Get example data for the tutorial

    The function downloads example data and stores it
    in `output_dir`, or, alternatively in a temporary directory

    Parameter
    ---------
    output_dir
        Output directory. If `None`, creates a temporary directory

    Returns
    -------
    File location
    """
    EXAMPLE_URLS = {
        "alphadia": "https://datashare.biochem.mpg.de/s/4AtCZassaUzRR8K",
        "alphapept": "https://datashare.biochem.mpg.de/s/6G6KHJqwcRPQiOO",
        "spectronaut": "https://datashare.biochem.mpg.de/s/2u7U03wvmQDVT4y",
    }

    if search_engine not in EXAMPLE_URLS:
        raise KeyError(f"{search_engine} not found, select one of {', '.join(EXAMPLE_URLS.keys())}")

    if output_dir is None:
        from tempfile import tempdir

        output_dir = tempdir

    downloader = DataShareDownloader(url=EXAMPLE_URLS[search_engine], output_dir=output_dir)

    return downloader.download()

Example 1 - AlphaDIA

We demonstrate how to interact with protein group tables via alphabase based on a minimal example output of the AlphaDIA search engine.

First, let’s get some minimal example data for the AlphaDIA output. The example data represents a DIA run of 6 HeLA samples on the Orbitrap Astral.

You can see that the output data contains the feature names in the column pg and the computed protein group intensities per sample in the remaining columns.

[5]:
alphadia_example_path = get_pg_matrix_example(search_engine="alphadia")

# Parse with pandas for visualization purposes
pd.read_csv(alphadia_example_path, sep="\t")
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia1.10.4__pg_matrix.tsv already exists (0.8597145080566406 MB)
[5]:
pg 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01
0 A0A024RBG1 5.597816e+05 6.285112e+05 0.000000e+00 3.153867e+05 2.753702e+05 4.505648e+05
1 A0A024RBG1;Q9NZJ9 1.331061e+06 1.400360e+06 1.551987e+06 1.606095e+06 1.464152e+06 1.397026e+06
2 A0A075B759;A0A075B767;P62937 2.024742e+08 8.552202e+06 1.837425e+08 1.674874e+08 1.768245e+08 1.595220e+08
3 A0A096LP01 6.355092e+05 4.589410e+05 4.184495e+05 4.032932e+05 2.317467e+05 2.731363e+05
4 A0A096LP49 1.777069e+05 1.387537e+05 2.513601e+05 1.296699e+05 1.276095e+05 1.623200e+05
... ... ... ... ... ... ... ...
9359 Q9Y6X3 3.898963e+05 4.353048e+05 4.150456e+05 5.069992e+05 4.195746e+05 3.675962e+05
9360 Q9Y6X6 1.869312e+05 0.000000e+00 0.000000e+00 2.304623e+05 2.421623e+05 0.000000e+00
9361 Q9Y6X9 3.362758e+06 3.395221e+06 3.541975e+06 2.704210e+06 3.141519e+06 2.995787e+06
9362 Q9Y6Y0 5.924220e+06 6.183842e+06 6.190598e+06 6.025724e+06 5.920595e+06 6.754984e+06
9363 Q9Y6Y8 1.416146e+07 1.424916e+07 1.342342e+07 1.345135e+07 1.406395e+07 1.349913e+07

9364 rows × 7 columns

Then use the pg_reader_provider.get_reader method to get the AlphaDIA protein group reader. Use the import_file method to read the file, which is directly returned as a :class:pandas.DataFrame.

Note how the dataframe values only contain the actual measurements and how the pg column was mapped to the standardized name uniprot_ids.

[6]:
alphadia_reader = pg_reader_provider.get_reader('alphadia')

# Import the file or a bytestream
alphadia_report = alphadia_reader.import_file(alphadia_example_path)

# Display the result
alphadia_report
[6]:
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01
uniprot_ids
A0A024RBG1 5.597816e+05 6.285112e+05 0.000000e+00 3.153867e+05 2.753702e+05 4.505648e+05
A0A024RBG1;Q9NZJ9 1.331061e+06 1.400360e+06 1.551987e+06 1.606095e+06 1.464152e+06 1.397026e+06
A0A075B759;A0A075B767;P62937 2.024742e+08 8.552202e+06 1.837425e+08 1.674874e+08 1.768245e+08 1.595220e+08
A0A096LP01 6.355092e+05 4.589410e+05 4.184495e+05 4.032932e+05 2.317467e+05 2.731363e+05
A0A096LP49 1.777069e+05 1.387537e+05 2.513601e+05 1.296699e+05 1.276095e+05 1.623200e+05
... ... ... ... ... ... ...
Q9Y6X3 3.898963e+05 4.353048e+05 4.150456e+05 5.069992e+05 4.195746e+05 3.675962e+05
Q9Y6X6 1.869312e+05 0.000000e+00 0.000000e+00 2.304623e+05 2.421623e+05 0.000000e+00
Q9Y6X9 3.362758e+06 3.395221e+06 3.541975e+06 2.704210e+06 3.141519e+06 2.995787e+06
Q9Y6Y0 5.924220e+06 6.183842e+06 6.190598e+06 6.025724e+06 5.920595e+06 6.754984e+06
Q9Y6Y8 1.416146e+07 1.424916e+07 1.342342e+07 1.345135e+07 1.406395e+07 1.349913e+07

9364 rows × 6 columns

Example 2 - AlphaPept with different quantification methods

AlphaPept is a DDA search engine that returns multiple quantification methods (raw intensities, LFQ) in its protein group report. We can use the reader to extract these different types of measurements by specifying the measurement_regex parameter.

AlphaPept reports can be both in a .hdf or .tsv format. The pg_readers support all common data formats (text-based like .tsv, .csv, and binary like .hdf (via extra alphabase[hdf] dependency), .parquet) out of the box.

[7]:
# Create example MaxQuant data with multiple quantification types
alphapept_example_path = get_pg_matrix_example(search_engine="alphapept")
pd.read_csv(alphapept_example_path)
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphapept0.5.3__pg_matrix_csv.csv already exists (0.33005523681640625 MB)
[7]:
Unnamed: 0 A_LFQ B_LFQ A B
0 sp|P36578|RL4_HUMAN 4.669329e+08 4.844083e+08 4.452735e+08 5.060678e+08
1 sp|Q9P258|RCC2_HUMAN 4.074842e+08 4.138132e+08 4.177856e+08 4.035118e+08
2 sp|O60518|RNBP6_HUMAN 4.960386e+06 2.022553e+06 1.295621e+06 5.687318e+06
3 sp|P55036|PSMD4_HUMAN 1.157420e+08 1.123571e+08 1.130880e+08 1.150112e+08
4 sp|A1X283|SPD2B_HUMAN 1.247112e+07 1.180582e+07 1.380177e+07 1.047516e+07
... ... ... ... ... ...
3776 sp|Q14966|ZN638_HUMAN NaN 1.139844e+06 NaN 1.139844e+06
3777 sp|P84095|RHOG_HUMAN NaN 9.466796e+05 NaN 9.466796e+05
3778 sp|Q99766|ATP5S_HUMAN NaN 3.577785e+05 NaN 3.577785e+05
3779 sp|O14925|TIM23_HUMAN,sp|Q5SRD1|TI23B_HUMAN NaN 9.237994e+05 NaN 9.237994e+05
3780 sp|P51946|CCNH_HUMAN NaN 9.278844e+05 NaN 9.278844e+05

3781 rows × 5 columns

Default - raw intensities

Let’s first use the default option that imports raw intensities. You can see that the reader automatically extracts only raw intensity columns and that it parses the uniprot header index to a more streamlined format.

[8]:
# Default: raw intensities
alphapept_reader_default = pg_reader_provider.get_reader('alphapept')
alphapept_reader_default.import_file(alphapept_example_path)
[8]:
A B
proteins uniprot_ids ensembl_ids source_db is_decoy
RL4_HUMAN P36578 na sp False 445273477.0318756 506067774.6891948
RCC2_HUMAN Q9P258 na sp False 417785611.6324583 403511752.8857417
RNBP6_HUMAN O60518 na sp False 1295621.2466679448 5687318.493374016
PSMD4_HUMAN P55036 na sp False 113087994.44403341 115011156.7335174
SPD2B_HUMAN A1X283 na sp False 13801771.733223092 10475164.42857083
... ... ... ... ... ... ...
ZN638_HUMAN Q14966 na sp False 1139843.6453892316
RHOG_HUMAN P84095 na sp False 946679.6466570131
ATP5S_HUMAN Q99766 na sp False 357778.52002529387
TIM23_HUMAN;TI23B_HUMAN O14925;Q5SRD1 na;na sp;sp False 923799.3856913601
CCNH_HUMAN P51946 na sp False 927884.4020782198

3781 rows × 2 columns

LFQ runs

We can easily extract the LFQ intensities by selecting the pre-defined regular expression to extract them:

[9]:
# LFQ intensities
alphapept_reader_lfq = pg_reader_provider.get_reader('alphapept', measurement_regex="lfq")
alphapept_reader_lfq.import_file(alphapept_example_path)
[9]:
A_LFQ B_LFQ
proteins uniprot_ids ensembl_ids source_db is_decoy
RL4_HUMAN P36578 na sp False 466932936.27537036 484408315.44570005
RCC2_HUMAN Q9P258 na sp False 407484183.9302226 413813180.5879775
RNBP6_HUMAN O60518 na sp False 4960386.374516514 2022553.3655254466
PSMD4_HUMAN P55036 na sp False 115742020.94987468 112357130.22767611
SPD2B_HUMAN A1X283 na sp False 12471120.728621317 11805815.433172602
... ... ... ... ... ... ...
ZN638_HUMAN Q14966 na sp False 1139843.6453892316
RHOG_HUMAN P84095 na sp False 946679.6466570131
ATP5S_HUMAN Q99766 na sp False 357778.52002529387
TIM23_HUMAN;TI23B_HUMAN O14925;Q5SRD1 na;na sp;sp False 923799.3856913601
CCNH_HUMAN P51946 na sp False 927884.4020782198

3781 rows × 2 columns

Explore all pre-configured patterns

You can also pass custom patterns as valid regular expression and check out all pre-configured regular expression sets with the get_preconfigured_regex method:

[10]:
alphapept_reader_default.get_preconfigured_regex()
[10]:
{'raw': '^.*(?<!_LFQ)$', 'lfq': '_LFQ$'}

Example 3 - Spectronaut reports

Next, we explore how users can extract non-standard columns to a unified vocabulary based on a Spectronaut PG report. Spectronaut allows users to flexibly export custom feature-level metadata. alphabase allows users to extract this metadata by adding new columns to the streamlined column mapping.

[11]:
spectronaut_example_path = get_pg_matrix_example(search_engine="spectronaut")

# Parse with pandas for visualization purposes
pd.read_csv(spectronaut_example_path, sep="\t")
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv does not yet exist
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv successfully downloaded (27.531264305114746 MB)
[11]:
PG.Genes PG.Organisms PG.ProteinNames PTM.CollapseKey PTM.FlankingRegion PTM.ModificationTitle PTM.Multiplicity PTM.ProteinId PTM.SiteAA PTM.SiteLocation ... [27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity [28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity [29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity [30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity [31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity [32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity [33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity [34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity [35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity [36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
0 TRBV19;TRB Homo sapiens TVB19_HUMAN;TRBR1_HUMAN A0A075B6N1_S86_M3 IAEGYSVSREKKESF Phospho (STY) 3 A0A075B6N1 S 86 ... 69968.8359375 103632.6015625 90488.9296875 113429.859375 96970.2734375 61069.171875 99673.2734375 109199.875 112307.4765625 112374.84375
1 TRBV19;TRB Homo sapiens TVB19_HUMAN;TRBR1_HUMAN A0A075B6N1_S84_M3 GDIAEGYSVSREKKE Phospho (STY) 3 A0A075B6N1 S 84 ... 69968.8359375 103632.6015625 90488.9296875 113429.859375 96970.2734375 61069.171875 99673.2734375 109199.875 112307.4765625 112374.84375
2 TRBV19;TRB Homo sapiens TVB19_HUMAN;TRBR1_HUMAN A0A075B6N1_Y83_M3 KGDIAEGYSVSREKK Phospho (STY) 3 A0A075B6N1 Y 83 ... 69968.8359375 103632.6015625 90488.9296875 113429.859375 96970.2734375 61069.171875 99673.2734375 109199.875 112307.4765625 112374.84375
3 TRBV19;TRB Homo sapiens TVB19_HUMAN;TRBR1_HUMAN P0DSE2_S86_M3 IAEGYSVSREKKESF Phospho (STY) 3 P0DSE2 S 86 ... 69968.8359375 103632.6015625 90488.9296875 113429.859375 96970.2734375 61069.171875 99673.2734375 109199.875 112307.4765625 112374.84375
4 TRBV19;TRB Homo sapiens TVB19_HUMAN;TRBR1_HUMAN P0DSE2_S84_M3 GDIAEGYSVSREKKE Phospho (STY) 3 P0DSE2 S 84 ... 69968.8359375 103632.6015625 90488.9296875 113429.859375 96970.2734375 61069.171875 99673.2734375 109199.875 112307.4765625 112374.84375
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54858 MORC2 Homo sapiens MORC2_HUMAN Q9Y6X9_S739_M2 ATPSRKRSVAVSDEE Phospho (STY) 2 Q9Y6X9 S 739 ... 23552.466796875 22144.580078125 20846.8515625 24248.41796875 22490.0546875 22095.990234375 25553.849609375 22250.546875 14592.869140625 19265.998046875
54859 MORC2 Homo sapiens MORC2_HUMAN Q9Y6X9-2_S681_M2 RKRSVAVSDEEEVEE Phospho (STY) 2 Q9Y6X9-2 S 681 ... 23552.466796875 22144.580078125 20846.8515625 24248.41796875 22490.0546875 22095.990234375 25553.849609375 22250.546875 14592.869140625 19265.998046875
54860 MORC2 Homo sapiens MORC2_HUMAN Q9Y6X9-2_S677_M2 ATPSRKRSVAVSDEE Phospho (STY) 2 Q9Y6X9-2 S 677 ... 23552.466796875 22144.580078125 20846.8515625 24248.41796875 22490.0546875 22095.990234375 25553.849609375 22250.546875 14592.869140625 19265.998046875
54861 IVNS1ABP Homo sapiens NS1BP_HUMAN Q9Y6Y0_M341_M1 SKSLSFEMQQDELIE Oxidation (M) 1 Q9Y6Y0 M 341 ... Filtered 17287.40625 Filtered 15751.861328125 14749.724609375 12410.79296875 14130.1396484375 Filtered 13198.474609375 13553.0908203125
54862 IVNS1ABP Homo sapiens NS1BP_HUMAN Q9Y6Y0_S338_M1 PKLSKSLSFEMQQDE Phospho (STY) 1 Q9Y6Y0 S 338 ... Filtered 17287.40625 Filtered 15751.861328125 14749.724609375 12410.79296875 14130.1396484375 7562.62060546875 13198.474609375 13553.0908203125

54863 rows × 46 columns

The default reader extracts some streamlined information

[12]:
# Example with custom column mapping
reader = pg_reader_provider.get_reader('spectronaut')
reader.import_file(spectronaut_example_path)
[12]:
[1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity [2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity [3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity [4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity [5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity [6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity [7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity [8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity [9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity [10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity ... [27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity [28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity [29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity [30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity [31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity [32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity [33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity [34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity [35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity [36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
proteins genes
TVB19_HUMAN;TRBR1_HUMAN TRBV19;TRB NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
TRBV19;TRB NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
TRBV19;TRB NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
TRBV19;TRB NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
TRBV19;TRB NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
MORC2_HUMAN MORC2 NaN NaN 6817.745605 NaN NaN NaN 18010.679688 12501.521484 17377.408203 13730.358398 ... 23552.466797 22144.580078 20846.851562 24248.417969 22490.054688 22095.990234 25553.849609 22250.546875 14592.869141 19265.998047
MORC2 NaN NaN 6817.745605 NaN NaN NaN 18010.679688 12501.521484 17377.408203 13730.358398 ... 23552.466797 22144.580078 20846.851562 24248.417969 22490.054688 22095.990234 25553.849609 22250.546875 14592.869141 19265.998047
MORC2 NaN NaN 6817.745605 NaN NaN NaN 18010.679688 12501.521484 17377.408203 13730.358398 ... 23552.466797 22144.580078 20846.851562 24248.417969 22490.054688 22095.990234 25553.849609 22250.546875 14592.869141 19265.998047
NS1BP_HUMAN IVNS1ABP NaN NaN 38411.285156 NaN NaN NaN 10104.601562 12773.764648 10412.311523 11411.670898 ... NaN 17287.406250 NaN 15751.861328 14749.724609 12410.792969 14130.139648 NaN 13198.474609 13553.090820
IVNS1ABP NaN NaN 38411.285156 NaN NaN NaN 10104.601562 18788.167969 10412.311523 17367.800781 ... NaN 17287.406250 NaN 15751.861328 14749.724609 12410.792969 14130.139648 7562.620605 13198.474609 13553.090820

54863 rows × 36 columns

Let’s say that we are also interested in the PTM site in the sample. We can extract this information as well by using the add_column_mapping method:

[13]:
# Add custom column mapping for organism information
reader.add_column_mapping({"ptm_site_amino_acid": "PTM.SiteAA"})
reader.import_file(spectronaut_example_path)
[13]:
[1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity [2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity [3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity [4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity [5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity [6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity [7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity [8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity [9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity [10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity ... [27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity [28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity [29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity [30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity [31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity [32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity [33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity [34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity [35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity [36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
proteins genes ptm_site_amino_acid
TVB19_HUMAN;TRBR1_HUMAN TRBV19;TRB S NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
S NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
Y NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
S NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
S NaN NaN NaN NaN NaN NaN 89374.656250 NaN 90181.578125 96197.070312 ... 69968.835938 103632.601562 90488.929688 113429.859375 96970.273438 61069.171875 99673.273438 109199.875000 112307.476562 112374.843750
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
MORC2_HUMAN MORC2 S NaN NaN 6817.745605 NaN NaN NaN 18010.679688 12501.521484 17377.408203 13730.358398 ... 23552.466797 22144.580078 20846.851562 24248.417969 22490.054688 22095.990234 25553.849609 22250.546875 14592.869141 19265.998047
S NaN NaN 6817.745605 NaN NaN NaN 18010.679688 12501.521484 17377.408203 13730.358398 ... 23552.466797 22144.580078 20846.851562 24248.417969 22490.054688 22095.990234 25553.849609 22250.546875 14592.869141 19265.998047
S NaN NaN 6817.745605 NaN NaN NaN 18010.679688 12501.521484 17377.408203 13730.358398 ... 23552.466797 22144.580078 20846.851562 24248.417969 22490.054688 22095.990234 25553.849609 22250.546875 14592.869141 19265.998047
NS1BP_HUMAN IVNS1ABP M NaN NaN 38411.285156 NaN NaN NaN 10104.601562 12773.764648 10412.311523 11411.670898 ... NaN 17287.406250 NaN 15751.861328 14749.724609 12410.792969 14130.139648 NaN 13198.474609 13553.090820
S NaN NaN 38411.285156 NaN NaN NaN 10104.601562 18788.167969 10412.311523 17367.800781 ... NaN 17287.406250 NaN 15751.861328 14749.724609 12410.792969 14130.139648 7562.620605 13198.474609 13553.090820

54863 rows × 36 columns

scVerse compatibility

The standardized format also allows users to easily convert the protein group tables to widely used -omics formats like anndata.AnnData.

[14]:
def create_anndata_from_pg_matrix(file_path: str, search_engine: str, **kwargs) -> ad.AnnData:
    """Get anndata object from PG matrix."""

    reader = pg_reader_provider.get_reader(search_engine, **kwargs)
    df = reader.import_file(file_path)
    return ad.AnnData(
        X=df.values.T,
        var=df.index.to_frame(),
        obs = df.columns.to_frame(name="sample_id")
    )
[15]:
adata = create_anndata_from_pg_matrix(
    alphadia_example_path, search_engine="alphadia"
)

adata
[15]:
AnnData object with n_obs × n_vars = 6 × 9364
    obs: 'sample_id'
    var: 'uniprot_ids'

Conclusion

The alphabase protein group reader module provides:

  • Unified interface for reading protein group tables from multiple search engines

  • Standardized output format that facilitates cross-engine comparisons and downstream analyses

  • Flexible quantification options to extract different measurement types (raw, LFQ, iBAQ)

  • Extensible architecture that supports custom column mappings and new search engines

This standardization enables researchers to focus on biological insights rather than data format complexities.