Protein Group readers¶

[1]:

%reload_ext autoreload
%autoreload 2

[2]:

# Helper packages
import io
from copy import copy
from typing import Literal, Optional

import anndata as ad
import numpy as np
import pandas as pd

# alphabase
from alphabase.pg_reader import pg_reader_provider
from alphabase.tools.data_downloader import DataShareDownloader

/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:4: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
  import cgi
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:18: ImportWarning: Dependency 'progressbar' not installed. Download progress will not be displayed.
  warnings.warn(

Background¶

The alphabase.pg_reader module provides a unifying interface to read protein group (PG) tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of pandas.DataFrames, regardless of the input file format.

Introduction to protein group matrices¶

Protein group matrices are the primary output for protein-level quantification in proteomics workflows. After search engines identify peptide spectrum matches (PSMs, see PSM-reader tutorial), they aggregate peptide-level evidence to infer protein-level abundances. These protein group tables represent a structured matrix that maps protein groups (features) to samples (observations), with estimated intensity values as entries.

A minimal protein group table could look something like this:

proteins	sample_1	sample_2	sample_3
P12345	1000.5	892.3	1150.7
Q67890	2500.1	2780.9	2340.2

💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into protein groups as features.

In this example, protein P12345 has quantified intensities of 1000.5, 892.3, and 1150.7 in samples 1, 2, and 3 respectively.

Search engine outputs¶

In reality, protein group tables are significantly more complex than this, as they contain additional feature-level information about the proteins (e.g., gene names, descriptions, alternative quantification methods), and the quantification (e.g., different intensity types like raw, LFQ quantification, iBAQ). This additional information can be valuable for downstream analyses, but also makes protein group tables a lot more difficult to work with, as the exact names and formats may differ between search engines, versions, and file formats.

Unifying properties¶

alphabase aligns the column names to a unified vocabulary, facilitating cross-engine comparisons. We can categorize protein group tables into several common types:

Type 1 — Minimal: A basic features × samples matrix. Only intensity values are stored, with sample names as columns and protein groups as the index. Example: AlphaDIA.

Type 2 — Multiple Intensity Fields: A wide matrix where each sample may appear multiple times with different quantification types (e.g., SampleA_LFQ, SampleB_raw). Example: AlphaPept.

Type 3 — Feature Metadata: A features × samples matrix with one intensity value per sample, plus additional feature-level metadata columns (e.g., gene names, descriptions). Example: DIA-NN.

Type 4 — Combined: A composite structure including both multiple intensity fields (Type 2) and feature-level metadata (Type 3). Examples: Spectronaut, MZTab, MaxQuant.

Code | Read and parse protein group tables¶

The alphabase pg_reader module enables users to parse proteomics protein group reports to a dataframe for most common search engines with a single line of code via its alphabase.pg_reader.pg_reader_provider factory.

All readers return a standardized pandas DataFrame with:

Features as index: Protein identifiers and metadata in the pandas.DataFrame.Index
Samples as columns: Sample/run identifiers as column index
Intensity values: Protein quantification data as pandas.DataFrame.values

The readers support different quantification methods by matching regular expression patterns in the output tables and the retrieval of desired metadata columns to standardized names.

The unified alphabase format enables seamless comparison and analysis across different search engines, facilitating:

Method comparison studies
Data integration workflows
Standardized downstream analysis pipelines

Available readers¶

alphabase.pg_reader.pg_reader_provider has registered reader classes for the most common proteomics search engines. A list of implemented readers can be accessed via its reader_dict property:

[3]:

all_registered_readers = pg_reader_provider.reader_dict.keys()

# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)

Registered readers in alphabase:
        - alphadia
        - alphapept
        - diann
        - fragpipe
        - maxquant
        - mztab
        - spectronaut

Interact with the reader provider¶

[ ]:

def get_pg_matrix_example(output_dir: Optional[str] = None, search_engine: Literal["alphadia", "alphapept", "spectronaut"] = "alphadia") -> str:
    """Get example data for the tutorial

    The function downloads example data and stores it
    in `output_dir`, or, alternatively in a temporary directory

    Parameter
    ---------
    output_dir
        Output directory. If `None`, creates a temporary directory

    Returns
    -------
    File location
    """
    EXAMPLE_URLS = {
        "alphadia": "https://datashare.biochem.mpg.de/s/4AtCZassaUzRR8K",
        "alphapept": "https://datashare.biochem.mpg.de/s/6G6KHJqwcRPQiOO",
        "spectronaut": "https://datashare.biochem.mpg.de/s/2u7U03wvmQDVT4y",
    }

    if search_engine not in EXAMPLE_URLS:
        raise KeyError(f"{search_engine} not found, select one of {', '.join(EXAMPLE_URLS.keys())}")

    if output_dir is None:
        from tempfile import tempdir

        output_dir = tempdir

    downloader = DataShareDownloader(url=EXAMPLE_URLS[search_engine], output_dir=output_dir)

    return downloader.download()

Example 1 - AlphaDIA¶

We demonstrate how to interact with protein group tables via alphabase based on a minimal example output of the AlphaDIA search engine.

First, let’s get some minimal example data for the AlphaDIA output. The example data represents a DIA run of 6 HeLA samples on the Orbitrap Astral.

You can see that the output data contains the feature names in the column pg and the computed protein group intensities per sample in the remaining columns.

[5]:

alphadia_example_path = get_pg_matrix_example(search_engine="alphadia")

# Parse with pandas for visualization purposes
pd.read_csv(alphadia_example_path, sep="\t")

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia1.10.4__pg_matrix.tsv already exists (0.8597145080566406 MB)

[5]:

	pg	20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03	20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02	20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01	20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03	20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02	20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01
0	A0A024RBG1	5.597816e+05	6.285112e+05	0.000000e+00	3.153867e+05	2.753702e+05	4.505648e+05
1	A0A024RBG1;Q9NZJ9	1.331061e+06	1.400360e+06	1.551987e+06	1.606095e+06	1.464152e+06	1.397026e+06
2	A0A075B759;A0A075B767;P62937	2.024742e+08	8.552202e+06	1.837425e+08	1.674874e+08	1.768245e+08	1.595220e+08
3	A0A096LP01	6.355092e+05	4.589410e+05	4.184495e+05	4.032932e+05	2.317467e+05	2.731363e+05
4	A0A096LP49	1.777069e+05	1.387537e+05	2.513601e+05	1.296699e+05	1.276095e+05	1.623200e+05
...	...	...	...	...	...	...	...
9359	Q9Y6X3	3.898963e+05	4.353048e+05	4.150456e+05	5.069992e+05	4.195746e+05	3.675962e+05
9360	Q9Y6X6	1.869312e+05	0.000000e+00	0.000000e+00	2.304623e+05	2.421623e+05	0.000000e+00
9361	Q9Y6X9	3.362758e+06	3.395221e+06	3.541975e+06	2.704210e+06	3.141519e+06	2.995787e+06
9362	Q9Y6Y0	5.924220e+06	6.183842e+06	6.190598e+06	6.025724e+06	5.920595e+06	6.754984e+06
9363	Q9Y6Y8	1.416146e+07	1.424916e+07	1.342342e+07	1.345135e+07	1.406395e+07	1.349913e+07

9364 rows × 7 columns

Then use the pg_reader_provider.get_reader method to get the AlphaDIA protein group reader. Use the import_file method to read the file, which is directly returned as a :class:pandas.DataFrame.

Note how the dataframe values only contain the actual measurements and how the pg column was mapped to the standardized name uniprot_ids.

[6]:

alphadia_reader = pg_reader_provider.get_reader('alphadia')

# Import the file or a bytestream
alphadia_report = alphadia_reader.import_file(alphadia_example_path)

# Display the result
alphadia_report

[6]:

	20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03	20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02	20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01	20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03	20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02	20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01
uniprot_ids
A0A024RBG1	5.597816e+05	6.285112e+05	0.000000e+00	3.153867e+05	2.753702e+05	4.505648e+05
A0A024RBG1;Q9NZJ9	1.331061e+06	1.400360e+06	1.551987e+06	1.606095e+06	1.464152e+06	1.397026e+06
A0A075B759;A0A075B767;P62937	2.024742e+08	8.552202e+06	1.837425e+08	1.674874e+08	1.768245e+08	1.595220e+08
A0A096LP01	6.355092e+05	4.589410e+05	4.184495e+05	4.032932e+05	2.317467e+05	2.731363e+05
A0A096LP49	1.777069e+05	1.387537e+05	2.513601e+05	1.296699e+05	1.276095e+05	1.623200e+05
...	...	...	...	...	...	...
Q9Y6X3	3.898963e+05	4.353048e+05	4.150456e+05	5.069992e+05	4.195746e+05	3.675962e+05
Q9Y6X6	1.869312e+05	0.000000e+00	0.000000e+00	2.304623e+05	2.421623e+05	0.000000e+00
Q9Y6X9	3.362758e+06	3.395221e+06	3.541975e+06	2.704210e+06	3.141519e+06	2.995787e+06
Q9Y6Y0	5.924220e+06	6.183842e+06	6.190598e+06	6.025724e+06	5.920595e+06	6.754984e+06
Q9Y6Y8	1.416146e+07	1.424916e+07	1.342342e+07	1.345135e+07	1.406395e+07	1.349913e+07

9364 rows × 6 columns

Example 2 - AlphaPept with different quantification methods¶

AlphaPept is a DDA search engine that returns multiple quantification methods (raw intensities, LFQ) in its protein group report. We can use the reader to extract these different types of measurements by specifying the measurement_regex parameter.

AlphaPept reports can be both in a .hdf or .tsv format. The pg_readers support all common data formats (text-based like .tsv, .csv, and binary like .hdf (via extra alphabase[hdf] dependency), .parquet) out of the box.

[7]:

# Create example MaxQuant data with multiple quantification types
alphapept_example_path = get_pg_matrix_example(search_engine="alphapept")
pd.read_csv(alphapept_example_path)

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphapept0.5.3__pg_matrix_csv.csv already exists (0.33005523681640625 MB)

[7]:

	Unnamed: 0	A_LFQ	B_LFQ	A	B
0	sp\|P36578\|RL4_HUMAN	4.669329e+08	4.844083e+08	4.452735e+08	5.060678e+08
1	sp\|Q9P258\|RCC2_HUMAN	4.074842e+08	4.138132e+08	4.177856e+08	4.035118e+08
2	sp\|O60518\|RNBP6_HUMAN	4.960386e+06	2.022553e+06	1.295621e+06	5.687318e+06
3	sp\|P55036\|PSMD4_HUMAN	1.157420e+08	1.123571e+08	1.130880e+08	1.150112e+08
4	sp\|A1X283\|SPD2B_HUMAN	1.247112e+07	1.180582e+07	1.380177e+07	1.047516e+07
...	...	...	...	...	...
3776	sp\|Q14966\|ZN638_HUMAN	NaN	1.139844e+06	NaN	1.139844e+06
3777	sp\|P84095\|RHOG_HUMAN	NaN	9.466796e+05	NaN	9.466796e+05
3778	sp\|Q99766\|ATP5S_HUMAN	NaN	3.577785e+05	NaN	3.577785e+05
3779	sp\|O14925\|TIM23_HUMAN,sp\|Q5SRD1\|TI23B_HUMAN	NaN	9.237994e+05	NaN	9.237994e+05
3780	sp\|P51946\|CCNH_HUMAN	NaN	9.278844e+05	NaN	9.278844e+05

3781 rows × 5 columns

Default - raw intensities¶

Let’s first use the default option that imports raw intensities. You can see that the reader automatically extracts only raw intensity columns and that it parses the uniprot header index to a more streamlined format.

[8]:

# Default: raw intensities
alphapept_reader_default = pg_reader_provider.get_reader('alphapept')
alphapept_reader_default.import_file(alphapept_example_path)

[8]:

					A	B
proteins	uniprot_ids	ensembl_ids	source_db	is_decoy
RL4_HUMAN	P36578	na	sp	False	445273477.0318756	506067774.6891948
RCC2_HUMAN	Q9P258	na	sp	False	417785611.6324583	403511752.8857417
RNBP6_HUMAN	O60518	na	sp	False	1295621.2466679448	5687318.493374016
PSMD4_HUMAN	P55036	na	sp	False	113087994.44403341	115011156.7335174
SPD2B_HUMAN	A1X283	na	sp	False	13801771.733223092	10475164.42857083
...	...	...	...	...	...	...
ZN638_HUMAN	Q14966	na	sp	False		1139843.6453892316
RHOG_HUMAN	P84095	na	sp	False		946679.6466570131
ATP5S_HUMAN	Q99766	na	sp	False		357778.52002529387
TIM23_HUMAN;TI23B_HUMAN	O14925;Q5SRD1	na;na	sp;sp	False		923799.3856913601
CCNH_HUMAN	P51946	na	sp	False		927884.4020782198

3781 rows × 2 columns

LFQ runs¶

We can easily extract the LFQ intensities by selecting the pre-defined regular expression to extract them:

[9]:

# LFQ intensities
alphapept_reader_lfq = pg_reader_provider.get_reader('alphapept', measurement_regex="lfq")
alphapept_reader_lfq.import_file(alphapept_example_path)

[9]:

					A_LFQ	B_LFQ
proteins	uniprot_ids	ensembl_ids	source_db	is_decoy
RL4_HUMAN	P36578	na	sp	False	466932936.27537036	484408315.44570005
RCC2_HUMAN	Q9P258	na	sp	False	407484183.9302226	413813180.5879775
RNBP6_HUMAN	O60518	na	sp	False	4960386.374516514	2022553.3655254466
PSMD4_HUMAN	P55036	na	sp	False	115742020.94987468	112357130.22767611
SPD2B_HUMAN	A1X283	na	sp	False	12471120.728621317	11805815.433172602
...	...	...	...	...	...	...
ZN638_HUMAN	Q14966	na	sp	False		1139843.6453892316
RHOG_HUMAN	P84095	na	sp	False		946679.6466570131
ATP5S_HUMAN	Q99766	na	sp	False		357778.52002529387
TIM23_HUMAN;TI23B_HUMAN	O14925;Q5SRD1	na;na	sp;sp	False		923799.3856913601
CCNH_HUMAN	P51946	na	sp	False		927884.4020782198

3781 rows × 2 columns

Explore all pre-configured patterns¶

You can also pass custom patterns as valid regular expression and check out all pre-configured regular expression sets with the get_preconfigured_regex method:

[10]:

alphapept_reader_default.get_preconfigured_regex()

[10]:

{'raw': '^.*(?<!_LFQ)$', 'lfq': '_LFQ$'}

Example 3 - Spectronaut reports¶

Next, we explore how users can extract non-standard columns to a unified vocabulary based on a Spectronaut PG report. Spectronaut allows users to flexibly export custom feature-level metadata. alphabase allows users to extract this metadata by adding new columns to the streamlined column mapping.

[11]:

spectronaut_example_path = get_pg_matrix_example(search_engine="spectronaut")

# Parse with pandas for visualization purposes
pd.read_csv(spectronaut_example_path, sep="\t")

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv does not yet exist
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv successfully downloaded (27.531264305114746 MB)

[11]:

	PG.Genes	PG.Organisms	PG.ProteinNames	PTM.CollapseKey	PTM.FlankingRegion	PTM.ModificationTitle	PTM.Multiplicity	PTM.ProteinId	PTM.SiteAA	PTM.SiteLocation	...	[27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity	[28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity	[29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity	[30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity	[31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity	[32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity	[33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity	[34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity	[35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity	[36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
0	TRBV19;TRB	Homo sapiens	TVB19_HUMAN;TRBR1_HUMAN	A0A075B6N1_S86_M3	IAEGYSVSREKKESF	Phospho (STY)	3	A0A075B6N1	S	86	...	69968.8359375	103632.6015625	90488.9296875	113429.859375	96970.2734375	61069.171875	99673.2734375	109199.875	112307.4765625	112374.84375
1	TRBV19;TRB	Homo sapiens	TVB19_HUMAN;TRBR1_HUMAN	A0A075B6N1_S84_M3	GDIAEGYSVSREKKE	Phospho (STY)	3	A0A075B6N1	S	84	...	69968.8359375	103632.6015625	90488.9296875	113429.859375	96970.2734375	61069.171875	99673.2734375	109199.875	112307.4765625	112374.84375
2	TRBV19;TRB	Homo sapiens	TVB19_HUMAN;TRBR1_HUMAN	A0A075B6N1_Y83_M3	KGDIAEGYSVSREKK	Phospho (STY)	3	A0A075B6N1	Y	83	...	69968.8359375	103632.6015625	90488.9296875	113429.859375	96970.2734375	61069.171875	99673.2734375	109199.875	112307.4765625	112374.84375
3	TRBV19;TRB	Homo sapiens	TVB19_HUMAN;TRBR1_HUMAN	P0DSE2_S86_M3	IAEGYSVSREKKESF	Phospho (STY)	3	P0DSE2	S	86	...	69968.8359375	103632.6015625	90488.9296875	113429.859375	96970.2734375	61069.171875	99673.2734375	109199.875	112307.4765625	112374.84375
4	TRBV19;TRB	Homo sapiens	TVB19_HUMAN;TRBR1_HUMAN	P0DSE2_S84_M3	GDIAEGYSVSREKKE	Phospho (STY)	3	P0DSE2	S	84	...	69968.8359375	103632.6015625	90488.9296875	113429.859375	96970.2734375	61069.171875	99673.2734375	109199.875	112307.4765625	112374.84375
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54858	MORC2	Homo sapiens	MORC2_HUMAN	Q9Y6X9_S739_M2	ATPSRKRSVAVSDEE	Phospho (STY)	2	Q9Y6X9	S	739	...	23552.466796875	22144.580078125	20846.8515625	24248.41796875	22490.0546875	22095.990234375	25553.849609375	22250.546875	14592.869140625	19265.998046875
54859	MORC2	Homo sapiens	MORC2_HUMAN	Q9Y6X9-2_S681_M2	RKRSVAVSDEEEVEE	Phospho (STY)	2	Q9Y6X9-2	S	681	...	23552.466796875	22144.580078125	20846.8515625	24248.41796875	22490.0546875	22095.990234375	25553.849609375	22250.546875	14592.869140625	19265.998046875
54860	MORC2	Homo sapiens	MORC2_HUMAN	Q9Y6X9-2_S677_M2	ATPSRKRSVAVSDEE	Phospho (STY)	2	Q9Y6X9-2	S	677	...	23552.466796875	22144.580078125	20846.8515625	24248.41796875	22490.0546875	22095.990234375	25553.849609375	22250.546875	14592.869140625	19265.998046875
54861	IVNS1ABP	Homo sapiens	NS1BP_HUMAN	Q9Y6Y0_M341_M1	SKSLSFEMQQDELIE	Oxidation (M)	1	Q9Y6Y0	M	341	...	Filtered	17287.40625	Filtered	15751.861328125	14749.724609375	12410.79296875	14130.1396484375	Filtered	13198.474609375	13553.0908203125
54862	IVNS1ABP	Homo sapiens	NS1BP_HUMAN	Q9Y6Y0_S338_M1	PKLSKSLSFEMQQDE	Phospho (STY)	1	Q9Y6Y0	S	338	...	Filtered	17287.40625	Filtered	15751.861328125	14749.724609375	12410.79296875	14130.1396484375	7562.62060546875	13198.474609375	13553.0908203125

54863 rows × 46 columns

The default reader extracts some streamlined information

[12]:

# Example with custom column mapping
reader = pg_reader_provider.get_reader('spectronaut')
reader.import_file(spectronaut_example_path)

[12]:

		[1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity	[2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity	[3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity	[4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity	[5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity	[6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity	[7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity	[8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity	[9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity	[10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity	...	[27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity	[28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity	[29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity	[30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity	[31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity	[32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity	[33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity	[34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity	[35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity	[36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
proteins	genes
TVB19_HUMAN;TRBR1_HUMAN	TRBV19;TRB	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
	TRBV19;TRB	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
	TRBV19;TRB	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
	TRBV19;TRB	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
	TRBV19;TRB	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
MORC2_HUMAN	MORC2	NaN	NaN	6817.745605	NaN	NaN	NaN	18010.679688	12501.521484	17377.408203	13730.358398	...	23552.466797	22144.580078	20846.851562	24248.417969	22490.054688	22095.990234	25553.849609	22250.546875	14592.869141	19265.998047
	MORC2	NaN	NaN	6817.745605	NaN	NaN	NaN	18010.679688	12501.521484	17377.408203	13730.358398	...	23552.466797	22144.580078	20846.851562	24248.417969	22490.054688	22095.990234	25553.849609	22250.546875	14592.869141	19265.998047
	MORC2	NaN	NaN	6817.745605	NaN	NaN	NaN	18010.679688	12501.521484	17377.408203	13730.358398	...	23552.466797	22144.580078	20846.851562	24248.417969	22490.054688	22095.990234	25553.849609	22250.546875	14592.869141	19265.998047
NS1BP_HUMAN	IVNS1ABP	NaN	NaN	38411.285156	NaN	NaN	NaN	10104.601562	12773.764648	10412.311523	11411.670898	...	NaN	17287.406250	NaN	15751.861328	14749.724609	12410.792969	14130.139648	NaN	13198.474609	13553.090820
NS1BP_HUMAN	IVNS1ABP	NaN	NaN	38411.285156	NaN	NaN	NaN	10104.601562	18788.167969	10412.311523	17367.800781	...	NaN	17287.406250	NaN	15751.861328	14749.724609	12410.792969	14130.139648	7562.620605	13198.474609	13553.090820

54863 rows × 36 columns

Let’s say that we are also interested in the PTM site in the sample. We can extract this information as well by using the add_column_mapping method:

[13]:

# Add custom column mapping for organism information
reader.add_column_mapping({"ptm_site_amino_acid": "PTM.SiteAA"})
reader.import_file(spectronaut_example_path)

[13]:

			[1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity	[2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity	[3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity	[4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity	[5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity	[6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity	[7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity	[8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity	[9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity	[10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity	...	[27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity	[28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity	[29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity	[30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity	[31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity	[32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity	[33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity	[34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity	[35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity	[36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity
proteins	genes	ptm_site_amino_acid
TVB19_HUMAN;TRBR1_HUMAN	TRBV19;TRB	S	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
		S	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
		Y	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
		S	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
		S	NaN	NaN	NaN	NaN	NaN	NaN	89374.656250	NaN	90181.578125	96197.070312	...	69968.835938	103632.601562	90488.929688	113429.859375	96970.273438	61069.171875	99673.273438	109199.875000	112307.476562	112374.843750
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
MORC2_HUMAN	MORC2	S	NaN	NaN	6817.745605	NaN	NaN	NaN	18010.679688	12501.521484	17377.408203	13730.358398	...	23552.466797	22144.580078	20846.851562	24248.417969	22490.054688	22095.990234	25553.849609	22250.546875	14592.869141	19265.998047
		S	NaN	NaN	6817.745605	NaN	NaN	NaN	18010.679688	12501.521484	17377.408203	13730.358398	...	23552.466797	22144.580078	20846.851562	24248.417969	22490.054688	22095.990234	25553.849609	22250.546875	14592.869141	19265.998047
		S	NaN	NaN	6817.745605	NaN	NaN	NaN	18010.679688	12501.521484	17377.408203	13730.358398	...	23552.466797	22144.580078	20846.851562	24248.417969	22490.054688	22095.990234	25553.849609	22250.546875	14592.869141	19265.998047
NS1BP_HUMAN	IVNS1ABP	M	NaN	NaN	38411.285156	NaN	NaN	NaN	10104.601562	12773.764648	10412.311523	11411.670898	...	NaN	17287.406250	NaN	15751.861328	14749.724609	12410.792969	14130.139648	NaN	13198.474609	13553.090820
NS1BP_HUMAN	IVNS1ABP	S	NaN	NaN	38411.285156	NaN	NaN	NaN	10104.601562	18788.167969	10412.311523	17367.800781	...	NaN	17287.406250	NaN	15751.861328	14749.724609	12410.792969	14130.139648	7562.620605	13198.474609	13553.090820

54863 rows × 36 columns

scVerse compatibility¶

The standardized format also allows users to easily convert the protein group tables to widely used -omics formats like anndata.AnnData.

[14]:

def create_anndata_from_pg_matrix(file_path: str, search_engine: str, **kwargs) -> ad.AnnData:
    """Get anndata object from PG matrix."""

    reader = pg_reader_provider.get_reader(search_engine, **kwargs)
    df = reader.import_file(file_path)
    return ad.AnnData(
        X=df.values.T,
        var=df.index.to_frame(),
        obs = df.columns.to_frame(name="sample_id")
    )

[15]:

adata = create_anndata_from_pg_matrix(
    alphadia_example_path, search_engine="alphadia"
)

adata

[15]:

AnnData object with n_obs × n_vars = 6 × 9364
    obs: 'sample_id'
    var: 'uniprot_ids'

Conclusion¶

The alphabase protein group reader module provides:

Unified interface for reading protein group tables from multiple search engines
Standardized output format that facilitates cross-engine comparisons and downstream analyses
Flexible quantification options to extract different measurement types (raw, LFQ, iBAQ)
Extensible architecture that supports custom column mappings and new search engines

This standardization enables researchers to focus on biological insights rather than data format complexities.