Protein Group readers¶
[1]:
%reload_ext autoreload
%autoreload 2
[2]:
# Helper packages
import io
from copy import copy
from typing import Literal, Optional
import anndata as ad
import numpy as np
import pandas as pd
# alphabase
from alphabase.pg_reader import pg_reader_provider
from alphabase.tools.data_downloader import DataShareDownloader
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:4: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
import cgi
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:18: ImportWarning: Dependency 'progressbar' not installed. Download progress will not be displayed.
warnings.warn(
Background¶
The alphabase.pg_reader module provides a unifying interface to read protein group (PG) tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of pandas.DataFrames, regardless of the input file format.
Introduction to protein group matrices¶
Protein group matrices are the primary output for protein-level quantification in proteomics workflows. After search engines identify peptide spectrum matches (PSMs, see PSM-reader tutorial), they aggregate peptide-level evidence to infer protein-level abundances. These protein group tables represent a structured matrix that maps protein groups (features) to samples (observations), with estimated intensity values as entries.
A minimal protein group table could look something like this:
proteins |
sample_1 |
sample_2 |
sample_3 |
|---|---|---|---|
P12345 |
1000.5 |
892.3 |
1150.7 |
Q67890 |
2500.1 |
2780.9 |
2340.2 |
💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into protein groups as features.
In this example, protein P12345 has quantified intensities of 1000.5, 892.3, and 1150.7 in samples 1, 2, and 3 respectively.
Search engine outputs¶
In reality, protein group tables are significantly more complex than this, as they contain additional feature-level information about the proteins (e.g., gene names, descriptions, alternative quantification methods), and the quantification (e.g., different intensity types like raw, LFQ quantification, iBAQ). This additional information can be valuable for downstream analyses, but also makes protein group tables a lot more difficult to work with, as the exact names and formats may differ between search engines, versions, and file formats.
Unifying properties¶
alphabase aligns the column names to a unified vocabulary, facilitating cross-engine comparisons. We can categorize protein group tables into several common types:
Type 1 — Minimal: A basic features × samples matrix. Only intensity values are stored, with sample names as columns and protein groups as the index. Example: AlphaDIA.
Type 2 — Multiple Intensity Fields: A wide matrix where each sample may appear multiple times with different quantification types (e.g., SampleA_LFQ, SampleB_raw). Example: AlphaPept.
Type 3 — Feature Metadata: A features × samples matrix with one intensity value per sample, plus additional feature-level metadata columns (e.g., gene names, descriptions). Example: DIA-NN.
Type 4 — Combined: A composite structure including both multiple intensity fields (Type 2) and feature-level metadata (Type 3). Examples: Spectronaut, MZTab, MaxQuant.
Code | Read and parse protein group tables¶
The alphabase pg_reader module enables users to parse proteomics protein group reports to a dataframe for most common search engines with a single line of code via its alphabase.pg_reader.pg_reader_provider factory.
All readers return a standardized pandas DataFrame with:
Features as index: Protein identifiers and metadata in the
pandas.DataFrame.IndexSamples as columns: Sample/run identifiers as column index
Intensity values: Protein quantification data as
pandas.DataFrame.values
The readers support different quantification methods by matching regular expression patterns in the output tables and the retrieval of desired metadata columns to standardized names.
The unified alphabase format enables seamless comparison and analysis across different search engines, facilitating:
Method comparison studies
Data integration workflows
Standardized downstream analysis pipelines
Available readers¶
alphabase.pg_reader.pg_reader_provider has registered reader classes for the most common proteomics search engines. A list of implemented readers can be accessed via its reader_dict property:
[3]:
all_registered_readers = pg_reader_provider.reader_dict.keys()
# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)
Registered readers in alphabase:
- alphadia
- alphapept
- diann
- fragpipe
- maxquant
- mztab
- spectronaut
Interact with the reader provider¶
[ ]:
def get_pg_matrix_example(output_dir: Optional[str] = None, search_engine: Literal["alphadia", "alphapept", "spectronaut"] = "alphadia") -> str:
"""Get example data for the tutorial
The function downloads example data and stores it
in `output_dir`, or, alternatively in a temporary directory
Parameter
---------
output_dir
Output directory. If `None`, creates a temporary directory
Returns
-------
File location
"""
EXAMPLE_URLS = {
"alphadia": "https://datashare.biochem.mpg.de/s/4AtCZassaUzRR8K",
"alphapept": "https://datashare.biochem.mpg.de/s/6G6KHJqwcRPQiOO",
"spectronaut": "https://datashare.biochem.mpg.de/s/2u7U03wvmQDVT4y",
}
if search_engine not in EXAMPLE_URLS:
raise KeyError(f"{search_engine} not found, select one of {', '.join(EXAMPLE_URLS.keys())}")
if output_dir is None:
from tempfile import tempdir
output_dir = tempdir
downloader = DataShareDownloader(url=EXAMPLE_URLS[search_engine], output_dir=output_dir)
return downloader.download()
Example 1 - AlphaDIA¶
We demonstrate how to interact with protein group tables via alphabase based on a minimal example output of the AlphaDIA search engine.
First, let’s get some minimal example data for the AlphaDIA output. The example data represents a DIA run of 6 HeLA samples on the Orbitrap Astral.
You can see that the output data contains the feature names in the column pg and the computed protein group intensities per sample in the remaining columns.
[5]:
alphadia_example_path = get_pg_matrix_example(search_engine="alphadia")
# Parse with pandas for visualization purposes
pd.read_csv(alphadia_example_path, sep="\t")
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia1.10.4__pg_matrix.tsv already exists (0.8597145080566406 MB)
[5]:
| pg | 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03 | 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02 | 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01 | 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03 | 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02 | 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01 | |
|---|---|---|---|---|---|---|---|
| 0 | A0A024RBG1 | 5.597816e+05 | 6.285112e+05 | 0.000000e+00 | 3.153867e+05 | 2.753702e+05 | 4.505648e+05 |
| 1 | A0A024RBG1;Q9NZJ9 | 1.331061e+06 | 1.400360e+06 | 1.551987e+06 | 1.606095e+06 | 1.464152e+06 | 1.397026e+06 |
| 2 | A0A075B759;A0A075B767;P62937 | 2.024742e+08 | 8.552202e+06 | 1.837425e+08 | 1.674874e+08 | 1.768245e+08 | 1.595220e+08 |
| 3 | A0A096LP01 | 6.355092e+05 | 4.589410e+05 | 4.184495e+05 | 4.032932e+05 | 2.317467e+05 | 2.731363e+05 |
| 4 | A0A096LP49 | 1.777069e+05 | 1.387537e+05 | 2.513601e+05 | 1.296699e+05 | 1.276095e+05 | 1.623200e+05 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 9359 | Q9Y6X3 | 3.898963e+05 | 4.353048e+05 | 4.150456e+05 | 5.069992e+05 | 4.195746e+05 | 3.675962e+05 |
| 9360 | Q9Y6X6 | 1.869312e+05 | 0.000000e+00 | 0.000000e+00 | 2.304623e+05 | 2.421623e+05 | 0.000000e+00 |
| 9361 | Q9Y6X9 | 3.362758e+06 | 3.395221e+06 | 3.541975e+06 | 2.704210e+06 | 3.141519e+06 | 2.995787e+06 |
| 9362 | Q9Y6Y0 | 5.924220e+06 | 6.183842e+06 | 6.190598e+06 | 6.025724e+06 | 5.920595e+06 | 6.754984e+06 |
| 9363 | Q9Y6Y8 | 1.416146e+07 | 1.424916e+07 | 1.342342e+07 | 1.345135e+07 | 1.406395e+07 | 1.349913e+07 |
9364 rows × 7 columns
Then use the pg_reader_provider.get_reader method to get the AlphaDIA protein group reader. Use the import_file method to read the file, which is directly returned as a :class:pandas.DataFrame.
Note how the dataframe values only contain the actual measurements and how the pg column was mapped to the standardized name uniprot_ids.
[6]:
alphadia_reader = pg_reader_provider.get_reader('alphadia')
# Import the file or a bytestream
alphadia_report = alphadia_reader.import_file(alphadia_example_path)
# Display the result
alphadia_report
[6]:
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03 | 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02 | 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01 | 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03 | 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02 | 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01 | |
|---|---|---|---|---|---|---|
| uniprot_ids | ||||||
| A0A024RBG1 | 5.597816e+05 | 6.285112e+05 | 0.000000e+00 | 3.153867e+05 | 2.753702e+05 | 4.505648e+05 |
| A0A024RBG1;Q9NZJ9 | 1.331061e+06 | 1.400360e+06 | 1.551987e+06 | 1.606095e+06 | 1.464152e+06 | 1.397026e+06 |
| A0A075B759;A0A075B767;P62937 | 2.024742e+08 | 8.552202e+06 | 1.837425e+08 | 1.674874e+08 | 1.768245e+08 | 1.595220e+08 |
| A0A096LP01 | 6.355092e+05 | 4.589410e+05 | 4.184495e+05 | 4.032932e+05 | 2.317467e+05 | 2.731363e+05 |
| A0A096LP49 | 1.777069e+05 | 1.387537e+05 | 2.513601e+05 | 1.296699e+05 | 1.276095e+05 | 1.623200e+05 |
| ... | ... | ... | ... | ... | ... | ... |
| Q9Y6X3 | 3.898963e+05 | 4.353048e+05 | 4.150456e+05 | 5.069992e+05 | 4.195746e+05 | 3.675962e+05 |
| Q9Y6X6 | 1.869312e+05 | 0.000000e+00 | 0.000000e+00 | 2.304623e+05 | 2.421623e+05 | 0.000000e+00 |
| Q9Y6X9 | 3.362758e+06 | 3.395221e+06 | 3.541975e+06 | 2.704210e+06 | 3.141519e+06 | 2.995787e+06 |
| Q9Y6Y0 | 5.924220e+06 | 6.183842e+06 | 6.190598e+06 | 6.025724e+06 | 5.920595e+06 | 6.754984e+06 |
| Q9Y6Y8 | 1.416146e+07 | 1.424916e+07 | 1.342342e+07 | 1.345135e+07 | 1.406395e+07 | 1.349913e+07 |
9364 rows × 6 columns
Example 2 - AlphaPept with different quantification methods¶
AlphaPept is a DDA search engine that returns multiple quantification methods (raw intensities, LFQ) in its protein group report. We can use the reader to extract these different types of measurements by specifying the measurement_regex parameter.
AlphaPept reports can be both in a .hdf or .tsv format. The pg_readers support all common data formats (text-based like .tsv, .csv, and binary like .hdf (via extra alphabase[hdf] dependency), .parquet) out of the box.
[7]:
# Create example MaxQuant data with multiple quantification types
alphapept_example_path = get_pg_matrix_example(search_engine="alphapept")
pd.read_csv(alphapept_example_path)
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphapept0.5.3__pg_matrix_csv.csv already exists (0.33005523681640625 MB)
[7]:
| Unnamed: 0 | A_LFQ | B_LFQ | A | B | |
|---|---|---|---|---|---|
| 0 | sp|P36578|RL4_HUMAN | 4.669329e+08 | 4.844083e+08 | 4.452735e+08 | 5.060678e+08 |
| 1 | sp|Q9P258|RCC2_HUMAN | 4.074842e+08 | 4.138132e+08 | 4.177856e+08 | 4.035118e+08 |
| 2 | sp|O60518|RNBP6_HUMAN | 4.960386e+06 | 2.022553e+06 | 1.295621e+06 | 5.687318e+06 |
| 3 | sp|P55036|PSMD4_HUMAN | 1.157420e+08 | 1.123571e+08 | 1.130880e+08 | 1.150112e+08 |
| 4 | sp|A1X283|SPD2B_HUMAN | 1.247112e+07 | 1.180582e+07 | 1.380177e+07 | 1.047516e+07 |
| ... | ... | ... | ... | ... | ... |
| 3776 | sp|Q14966|ZN638_HUMAN | NaN | 1.139844e+06 | NaN | 1.139844e+06 |
| 3777 | sp|P84095|RHOG_HUMAN | NaN | 9.466796e+05 | NaN | 9.466796e+05 |
| 3778 | sp|Q99766|ATP5S_HUMAN | NaN | 3.577785e+05 | NaN | 3.577785e+05 |
| 3779 | sp|O14925|TIM23_HUMAN,sp|Q5SRD1|TI23B_HUMAN | NaN | 9.237994e+05 | NaN | 9.237994e+05 |
| 3780 | sp|P51946|CCNH_HUMAN | NaN | 9.278844e+05 | NaN | 9.278844e+05 |
3781 rows × 5 columns
Default - raw intensities¶
Let’s first use the default option that imports raw intensities. You can see that the reader automatically extracts only raw intensity columns and that it parses the uniprot header index to a more streamlined format.
[8]:
# Default: raw intensities
alphapept_reader_default = pg_reader_provider.get_reader('alphapept')
alphapept_reader_default.import_file(alphapept_example_path)
[8]:
| A | B | |||||
|---|---|---|---|---|---|---|
| proteins | uniprot_ids | ensembl_ids | source_db | is_decoy | ||
| RL4_HUMAN | P36578 | na | sp | False | 445273477.0318756 | 506067774.6891948 |
| RCC2_HUMAN | Q9P258 | na | sp | False | 417785611.6324583 | 403511752.8857417 |
| RNBP6_HUMAN | O60518 | na | sp | False | 1295621.2466679448 | 5687318.493374016 |
| PSMD4_HUMAN | P55036 | na | sp | False | 113087994.44403341 | 115011156.7335174 |
| SPD2B_HUMAN | A1X283 | na | sp | False | 13801771.733223092 | 10475164.42857083 |
| ... | ... | ... | ... | ... | ... | ... |
| ZN638_HUMAN | Q14966 | na | sp | False | 1139843.6453892316 | |
| RHOG_HUMAN | P84095 | na | sp | False | 946679.6466570131 | |
| ATP5S_HUMAN | Q99766 | na | sp | False | 357778.52002529387 | |
| TIM23_HUMAN;TI23B_HUMAN | O14925;Q5SRD1 | na;na | sp;sp | False | 923799.3856913601 | |
| CCNH_HUMAN | P51946 | na | sp | False | 927884.4020782198 |
3781 rows × 2 columns
LFQ runs¶
We can easily extract the LFQ intensities by selecting the pre-defined regular expression to extract them:
[9]:
# LFQ intensities
alphapept_reader_lfq = pg_reader_provider.get_reader('alphapept', measurement_regex="lfq")
alphapept_reader_lfq.import_file(alphapept_example_path)
[9]:
| A_LFQ | B_LFQ | |||||
|---|---|---|---|---|---|---|
| proteins | uniprot_ids | ensembl_ids | source_db | is_decoy | ||
| RL4_HUMAN | P36578 | na | sp | False | 466932936.27537036 | 484408315.44570005 |
| RCC2_HUMAN | Q9P258 | na | sp | False | 407484183.9302226 | 413813180.5879775 |
| RNBP6_HUMAN | O60518 | na | sp | False | 4960386.374516514 | 2022553.3655254466 |
| PSMD4_HUMAN | P55036 | na | sp | False | 115742020.94987468 | 112357130.22767611 |
| SPD2B_HUMAN | A1X283 | na | sp | False | 12471120.728621317 | 11805815.433172602 |
| ... | ... | ... | ... | ... | ... | ... |
| ZN638_HUMAN | Q14966 | na | sp | False | 1139843.6453892316 | |
| RHOG_HUMAN | P84095 | na | sp | False | 946679.6466570131 | |
| ATP5S_HUMAN | Q99766 | na | sp | False | 357778.52002529387 | |
| TIM23_HUMAN;TI23B_HUMAN | O14925;Q5SRD1 | na;na | sp;sp | False | 923799.3856913601 | |
| CCNH_HUMAN | P51946 | na | sp | False | 927884.4020782198 |
3781 rows × 2 columns
Explore all pre-configured patterns¶
You can also pass custom patterns as valid regular expression and check out all pre-configured regular expression sets with the get_preconfigured_regex method:
[10]:
alphapept_reader_default.get_preconfigured_regex()
[10]:
{'raw': '^.*(?<!_LFQ)$', 'lfq': '_LFQ$'}
Example 3 - Spectronaut reports¶
Next, we explore how users can extract non-standard columns to a unified vocabulary based on a Spectronaut PG report. Spectronaut allows users to flexibly export custom feature-level metadata. alphabase allows users to extract this metadata by adding new columns to the streamlined column mapping.
[11]:
spectronaut_example_path = get_pg_matrix_example(search_engine="spectronaut")
# Parse with pandas for visualization purposes
pd.read_csv(spectronaut_example_path, sep="\t")
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv does not yet exist
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/olsen-wide-format-site-report.tsv successfully downloaded (27.531264305114746 MB)
[11]:
| PG.Genes | PG.Organisms | PG.ProteinNames | PTM.CollapseKey | PTM.FlankingRegion | PTM.ModificationTitle | PTM.Multiplicity | PTM.ProteinId | PTM.SiteAA | PTM.SiteLocation | ... | [27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity | [28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity | [29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity | [30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity | [31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity | [32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity | [33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity | [34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity | [35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity | [36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TRBV19;TRB | Homo sapiens | TVB19_HUMAN;TRBR1_HUMAN | A0A075B6N1_S86_M3 | IAEGYSVSREKKESF | Phospho (STY) | 3 | A0A075B6N1 | S | 86 | ... | 69968.8359375 | 103632.6015625 | 90488.9296875 | 113429.859375 | 96970.2734375 | 61069.171875 | 99673.2734375 | 109199.875 | 112307.4765625 | 112374.84375 |
| 1 | TRBV19;TRB | Homo sapiens | TVB19_HUMAN;TRBR1_HUMAN | A0A075B6N1_S84_M3 | GDIAEGYSVSREKKE | Phospho (STY) | 3 | A0A075B6N1 | S | 84 | ... | 69968.8359375 | 103632.6015625 | 90488.9296875 | 113429.859375 | 96970.2734375 | 61069.171875 | 99673.2734375 | 109199.875 | 112307.4765625 | 112374.84375 |
| 2 | TRBV19;TRB | Homo sapiens | TVB19_HUMAN;TRBR1_HUMAN | A0A075B6N1_Y83_M3 | KGDIAEGYSVSREKK | Phospho (STY) | 3 | A0A075B6N1 | Y | 83 | ... | 69968.8359375 | 103632.6015625 | 90488.9296875 | 113429.859375 | 96970.2734375 | 61069.171875 | 99673.2734375 | 109199.875 | 112307.4765625 | 112374.84375 |
| 3 | TRBV19;TRB | Homo sapiens | TVB19_HUMAN;TRBR1_HUMAN | P0DSE2_S86_M3 | IAEGYSVSREKKESF | Phospho (STY) | 3 | P0DSE2 | S | 86 | ... | 69968.8359375 | 103632.6015625 | 90488.9296875 | 113429.859375 | 96970.2734375 | 61069.171875 | 99673.2734375 | 109199.875 | 112307.4765625 | 112374.84375 |
| 4 | TRBV19;TRB | Homo sapiens | TVB19_HUMAN;TRBR1_HUMAN | P0DSE2_S84_M3 | GDIAEGYSVSREKKE | Phospho (STY) | 3 | P0DSE2 | S | 84 | ... | 69968.8359375 | 103632.6015625 | 90488.9296875 | 113429.859375 | 96970.2734375 | 61069.171875 | 99673.2734375 | 109199.875 | 112307.4765625 | 112374.84375 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54858 | MORC2 | Homo sapiens | MORC2_HUMAN | Q9Y6X9_S739_M2 | ATPSRKRSVAVSDEE | Phospho (STY) | 2 | Q9Y6X9 | S | 739 | ... | 23552.466796875 | 22144.580078125 | 20846.8515625 | 24248.41796875 | 22490.0546875 | 22095.990234375 | 25553.849609375 | 22250.546875 | 14592.869140625 | 19265.998046875 |
| 54859 | MORC2 | Homo sapiens | MORC2_HUMAN | Q9Y6X9-2_S681_M2 | RKRSVAVSDEEEVEE | Phospho (STY) | 2 | Q9Y6X9-2 | S | 681 | ... | 23552.466796875 | 22144.580078125 | 20846.8515625 | 24248.41796875 | 22490.0546875 | 22095.990234375 | 25553.849609375 | 22250.546875 | 14592.869140625 | 19265.998046875 |
| 54860 | MORC2 | Homo sapiens | MORC2_HUMAN | Q9Y6X9-2_S677_M2 | ATPSRKRSVAVSDEE | Phospho (STY) | 2 | Q9Y6X9-2 | S | 677 | ... | 23552.466796875 | 22144.580078125 | 20846.8515625 | 24248.41796875 | 22490.0546875 | 22095.990234375 | 25553.849609375 | 22250.546875 | 14592.869140625 | 19265.998046875 |
| 54861 | IVNS1ABP | Homo sapiens | NS1BP_HUMAN | Q9Y6Y0_M341_M1 | SKSLSFEMQQDELIE | Oxidation (M) | 1 | Q9Y6Y0 | M | 341 | ... | Filtered | 17287.40625 | Filtered | 15751.861328125 | 14749.724609375 | 12410.79296875 | 14130.1396484375 | Filtered | 13198.474609375 | 13553.0908203125 |
| 54862 | IVNS1ABP | Homo sapiens | NS1BP_HUMAN | Q9Y6Y0_S338_M1 | PKLSKSLSFEMQQDE | Phospho (STY) | 1 | Q9Y6Y0 | S | 338 | ... | Filtered | 17287.40625 | Filtered | 15751.861328125 | 14749.724609375 | 12410.79296875 | 14130.1396484375 | 7562.62060546875 | 13198.474609375 | 13553.0908203125 |
54863 rows × 46 columns
The default reader extracts some streamlined information
[12]:
# Example with custom column mapping
reader = pg_reader_provider.get_reader('spectronaut')
reader.import_file(spectronaut_example_path)
[12]:
| [1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity | [2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity | [3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity | [4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity | [5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity | [6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity | [7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity | [8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity | [9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity | [10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity | ... | [27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity | [28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity | [29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity | [30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity | [31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity | [32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity | [33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity | [34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity | [35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity | [36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| proteins | genes | |||||||||||||||||||||
| TVB19_HUMAN;TRBR1_HUMAN | TRBV19;TRB | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 |
| TRBV19;TRB | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | |
| TRBV19;TRB | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | |
| TRBV19;TRB | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | |
| TRBV19;TRB | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| MORC2_HUMAN | MORC2 | NaN | NaN | 6817.745605 | NaN | NaN | NaN | 18010.679688 | 12501.521484 | 17377.408203 | 13730.358398 | ... | 23552.466797 | 22144.580078 | 20846.851562 | 24248.417969 | 22490.054688 | 22095.990234 | 25553.849609 | 22250.546875 | 14592.869141 | 19265.998047 |
| MORC2 | NaN | NaN | 6817.745605 | NaN | NaN | NaN | 18010.679688 | 12501.521484 | 17377.408203 | 13730.358398 | ... | 23552.466797 | 22144.580078 | 20846.851562 | 24248.417969 | 22490.054688 | 22095.990234 | 25553.849609 | 22250.546875 | 14592.869141 | 19265.998047 | |
| MORC2 | NaN | NaN | 6817.745605 | NaN | NaN | NaN | 18010.679688 | 12501.521484 | 17377.408203 | 13730.358398 | ... | 23552.466797 | 22144.580078 | 20846.851562 | 24248.417969 | 22490.054688 | 22095.990234 | 25553.849609 | 22250.546875 | 14592.869141 | 19265.998047 | |
| NS1BP_HUMAN | IVNS1ABP | NaN | NaN | 38411.285156 | NaN | NaN | NaN | 10104.601562 | 12773.764648 | 10412.311523 | 11411.670898 | ... | NaN | 17287.406250 | NaN | 15751.861328 | 14749.724609 | 12410.792969 | 14130.139648 | NaN | 13198.474609 | 13553.090820 |
| IVNS1ABP | NaN | NaN | 38411.285156 | NaN | NaN | NaN | 10104.601562 | 18788.167969 | 10412.311523 | 17367.800781 | ... | NaN | 17287.406250 | NaN | 15751.861328 | 14749.724609 | 12410.792969 | 14130.139648 | 7562.620605 | 13198.474609 | 13553.090820 |
54863 rows × 36 columns
Let’s say that we are also interested in the PTM site in the sample. We can extract this information as well by using the add_column_mapping method:
[13]:
# Add custom column mapping for organism information
reader.add_column_mapping({"ptm_site_amino_acid": "PTM.SiteAA"})
reader.import_file(spectronaut_example_path)
[13]:
| [1] 20180815_QE3_nLC3_AH_DIA_Honly_ind_01.raw.PTM.Quantity | [2] 20180815_QE3_nLC3_AH_DIA_Honly_ind_02.raw.PTM.Quantity | [3] 20180815_QE3_nLC3_AH_DIA_Honly_ind_03.raw.PTM.Quantity | [4] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_01.raw.PTM.Quantity | [5] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_02.raw.PTM.Quantity | [6] 20180815_QE3_nLC3_AH_DIA_Yonly_ind_03.raw.PTM.Quantity | [7] 20180816_QE3_nLC3_AH_DIA_H100_Y100_01.raw.PTM.Quantity | [8] 20180816_QE3_nLC3_AH_DIA_H100_Y100_02.raw.PTM.Quantity | [9] 20180816_QE3_nLC3_AH_DIA_H100_Y100_03.raw.PTM.Quantity | [10] 20180816_QE3_nLC3_AH_DIA_H100_Y100_04.raw.PTM.Quantity | ... | [27] 20180816_QE3_nLC3_AH_DIA_H100_Y25_03.raw.PTM.Quantity | [28] 20180816_QE3_nLC3_AH_DIA_H100_Y25_04.raw.PTM.Quantity | [29] 20180816_QE3_nLC3_AH_DIA_H100_Y25_05.raw.PTM.Quantity | [30] 20180816_QE3_nLC3_AH_DIA_H100_Y25_06.raw.PTM.Quantity | [31] 20180816_QE3_nLC3_AH_DIA_H100_Y50_01.raw.PTM.Quantity | [32] 20180816_QE3_nLC3_AH_DIA_H100_Y50_02.raw.PTM.Quantity | [33] 20180816_QE3_nLC3_AH_DIA_H100_Y50_03.raw.PTM.Quantity | [34] 20180816_QE3_nLC3_AH_DIA_H100_Y50_04.raw.PTM.Quantity | [35] 20180816_QE3_nLC3_AH_DIA_H100_Y50_05.raw.PTM.Quantity | [36] 20180816_QE3_nLC3_AH_DIA_H100_Y50_06.raw.PTM.Quantity | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| proteins | genes | ptm_site_amino_acid | |||||||||||||||||||||
| TVB19_HUMAN;TRBR1_HUMAN | TRBV19;TRB | S | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 |
| S | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | ||
| Y | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | ||
| S | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | ||
| S | NaN | NaN | NaN | NaN | NaN | NaN | 89374.656250 | NaN | 90181.578125 | 96197.070312 | ... | 69968.835938 | 103632.601562 | 90488.929688 | 113429.859375 | 96970.273438 | 61069.171875 | 99673.273438 | 109199.875000 | 112307.476562 | 112374.843750 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| MORC2_HUMAN | MORC2 | S | NaN | NaN | 6817.745605 | NaN | NaN | NaN | 18010.679688 | 12501.521484 | 17377.408203 | 13730.358398 | ... | 23552.466797 | 22144.580078 | 20846.851562 | 24248.417969 | 22490.054688 | 22095.990234 | 25553.849609 | 22250.546875 | 14592.869141 | 19265.998047 |
| S | NaN | NaN | 6817.745605 | NaN | NaN | NaN | 18010.679688 | 12501.521484 | 17377.408203 | 13730.358398 | ... | 23552.466797 | 22144.580078 | 20846.851562 | 24248.417969 | 22490.054688 | 22095.990234 | 25553.849609 | 22250.546875 | 14592.869141 | 19265.998047 | ||
| S | NaN | NaN | 6817.745605 | NaN | NaN | NaN | 18010.679688 | 12501.521484 | 17377.408203 | 13730.358398 | ... | 23552.466797 | 22144.580078 | 20846.851562 | 24248.417969 | 22490.054688 | 22095.990234 | 25553.849609 | 22250.546875 | 14592.869141 | 19265.998047 | ||
| NS1BP_HUMAN | IVNS1ABP | M | NaN | NaN | 38411.285156 | NaN | NaN | NaN | 10104.601562 | 12773.764648 | 10412.311523 | 11411.670898 | ... | NaN | 17287.406250 | NaN | 15751.861328 | 14749.724609 | 12410.792969 | 14130.139648 | NaN | 13198.474609 | 13553.090820 |
| S | NaN | NaN | 38411.285156 | NaN | NaN | NaN | 10104.601562 | 18788.167969 | 10412.311523 | 17367.800781 | ... | NaN | 17287.406250 | NaN | 15751.861328 | 14749.724609 | 12410.792969 | 14130.139648 | 7562.620605 | 13198.474609 | 13553.090820 |
54863 rows × 36 columns
scVerse compatibility¶
The standardized format also allows users to easily convert the protein group tables to widely used -omics formats like anndata.AnnData.
[14]:
def create_anndata_from_pg_matrix(file_path: str, search_engine: str, **kwargs) -> ad.AnnData:
"""Get anndata object from PG matrix."""
reader = pg_reader_provider.get_reader(search_engine, **kwargs)
df = reader.import_file(file_path)
return ad.AnnData(
X=df.values.T,
var=df.index.to_frame(),
obs = df.columns.to_frame(name="sample_id")
)
[15]:
adata = create_anndata_from_pg_matrix(
alphadia_example_path, search_engine="alphadia"
)
adata
[15]:
AnnData object with n_obs × n_vars = 6 × 9364
obs: 'sample_id'
var: 'uniprot_ids'
Conclusion¶
The alphabase protein group reader module provides:
Unified interface for reading protein group tables from multiple search engines
Standardized output format that facilitates cross-engine comparisons and downstream analyses
Flexible quantification options to extract different measurement types (raw, LFQ, iBAQ)
Extensible architecture that supports custom column mappings and new search engines
This standardization enables researchers to focus on biological insights rather than data format complexities.