PSM readers

[1]:
%reload_ext autoreload
%autoreload 2
[2]:
# Helper packages
import io
from copy import copy
import numpy as np
import pandas as pd

# alphabase
from alphabase.psm_reader import psm_reader_provider, psm_reader_yaml

Background

The alphabase.psm_reader module provides a unifying interface to read PSM tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of pandas.DataFrames, regardless of the input file format.

Introduction to peptide spectrum matches (PSMs)

Peptide spectrum matches (PSMs) are the primary output of proteomics search engines. In a PSM table, each row typically represents a single peptide-spectrum-match, i.e. a peptide sequence that the proteomics search engine identified to be compatible with an observed mass spectrum in a given sample. PSM tables contain information about both 1) the peptide sequence, 2) the spectrum, as well as 3) the score assigned to the PSM by the search engine.

A minimal PSM table could look something like this:

sample_id

peptide

confidence_score

scan_id

1

PEPTIDE

0.99

1234

In this example, the search engine identified the peptide PEPTIDE as a match to the spectrum 1234 sample with ID 1, and assigned a confidence score of 0.99 to this match.

Search engine outputs

In reality, PSM tables are significantly more complex than this, as they contain additional information on both the spectrum (e.g. the sample run, detector type, precursor intensities, …), the peptide (e.g. the protein it belongs to, the modifications it carries), and the peptide search (quality control measures). This additional information can be extremely useful for downstream analyses, but also makes PSM tables more difficult to work with, as the exact names may differ between search engines, versions, and file formats.

Unifying properties

Alphabase aligns the column names to a unified vocabulary, as defined in the alphabase.psm_reader.psm_reader_yaml mapping. We can explore this standardization, that facilitates cross-engine comparisons. Note that some search engines use version-dependent names for the same property (indicated by lists) and alphabase can deal with this ambiguity. To support both Bruker and Thermo data, we did not use Scan Number in the output dataframe but spec_idx (starts with 0). spec_idx = scan_num - 1 in thermo data.

[3]:
# Generate a dataframe that maps the report-specific column names to the unified column names based on alphabase mapping
psm_column_mapping = (
    pd.DataFrame.from_dict({
        mapping.get("reader_type", "psm"): mapping.get("column_mapping", {}) for k, mapping in psm_reader_yaml.items() if k != "modification_mappings"
    })
    .T
    .rename_axis(columns="alphabase (unified name)")
    )

# Order by importance (number of search engines with corresponding column)
columns_ordered = psm_column_mapping.isna().mean(axis=0).sort_values(ascending=True)
psm_column_mapping = (
    psm_column_mapping.loc[:, columns_ordered.index]
    .sort_index(axis=0)
    .dropna(how="all", axis=1)
)

# Compute summary
summary = (
    psm_column_mapping
    .agg([lambda x: x.isna().mean()])
)



## Visualize the mapping of PSM columns to alphabase unified columns
# Stylize with pandas CSS class
headers = {
    'selector': 'th',#'th:not(.index_name)',
    'props': 'background-color: #18456d; color: white;'
}
cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; font-weight:bold; position: sticky'
}

summary_row = {"selector": "tbody tr:last-child", "props": [("background-color", "#efefef"), ("font-weight", "bold")]}

caption = {
    "selector": "caption",
    "props": "caption-side: top; font-style: italic; font-size: 12pt; text-align:left; margin-bottom: 10pt;"
}

# Visualize
psm_column_mapping_stylized = (
    psm_column_mapping
        .style

        .concat(
            summary
                .style
                .relabel_index(["Missing"])
                .bar(color='#cccccc', vmin=0, vmax=1)
        )
        .set_caption("Mapping of PSM columns to alphabase unified columns")
        .set_table_styles(
            [headers, cell_hover, index_names, summary_row, caption]
        )

)

psm_column_mapping_stylized
[3]:
Mapping of PSM columns to alphabase unified columns
alphabase (unified name) raw_name charge rt mobility proteins sequence score uniprot_ids genes ccs fdr scan_num decoy precursor_mz query_id modified_sequence intensity rt_stop rt_start peptide_fdr mods fragment_intensity fragment_mz fragment_type fragment_charge fragment_series fragment_loss_type scannr spec_idx protein_fdr
alphadia run charge rt_observed mobility proteins sequence score uniprot_ids genes ccs fdr nan nan nan nan nan intensity rt_stop rt_start nan mods nan nan nan nan nan nan nan nan nan
alphapept raw_name charge rt mobility nan nan score nan nan nan q_value scan_no decoy mz query_idx nan nan nan nan nan nan nan nan nan nan nan nan nan raw_idx nan
diann Run Precursor.Charge RT ['IM', 'IonMobility'] Protein.Names Stripped.Sequence CScore Protein.Ids Genes CCS Q.Value MS2.Scan nan nan nan nan nan RT.Stop RT.Start nan nan nan nan nan nan nan nan nan nan nan
library_reader_base ReferenceRun PrecursorCharge ['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime'] ['Mobility', 'IonMobility', 'PrecursorIonMobility'] ['ProteinId', 'ProteinID', 'ProteinName', 'Protein Name'] ['PeptideSequence', 'StrippedPeptide'] nan ['UniProtIds', 'UniProtID', 'UniprotId'] ['GeneName', 'Genes', 'Gene'] CCS nan nan nan PrecursorMz nan ['ModifiedPeptideSequence', 'ModifiedPeptide'] nan nan nan nan nan ['LibraryIntensity', 'RelativeIntensity', 'RelativeFragmentIntensity', 'RelativeFragmentIonIntensity'] ['ProductMz'] ['FragmentType', 'FragmentIonType', 'ProductType', 'ProductIonType'] ['FragmentCharge', 'FragmentIonCharge', 'ProductCharge', 'ProductIonCharge'] ['FragmentSeriesNumber', 'FragmentNumber'] ['FragmentLossType', 'FragmentIonLossType', 'ProductLossType', 'ProductIonLossType'] nan nan nan
maxquant Raw file Charge Retention time ['Mobility', 'IonMobility', 'K0', '1/K0'] Proteins Sequence Score nan ['Gene Names', 'Gene names'] CCS nan ['Scan number', 'MS/MS scan number', 'MS/MS Scan Number', 'Scan index'] Reverse m/z nan nan Intensity nan nan nan nan nan nan nan nan nan nan nan nan nan
msfragger_pepxml raw_name assumed_charge retention_time_sec ion_mobility protein peptide expect nan nan nan nan start_scan nan nan spectrum nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
pfind raw_name Charge RT nan Proteins Sequence Final_Score Proteins nan nan Q-value Scan_No ['Target/Decoy', 'Targe/Decoy'] nan File_Name nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
sage filename charge rt mobility proteins stripped_peptide sage_discriminant_score nan nan nan spectrum_q nan is_decoy nan nan peptide nan nan nan peptide_q nan nan nan nan nan nan nan scannr nan protein_q
spectronaut ReferenceRun PrecursorCharge ['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime'] ['Mobility', 'IonMobility', 'PrecursorIonMobility'] ['Protein Name', 'ProteinId', 'ProteinID', 'ProteinName', 'ProteinGroup', 'ProteinGroups'] ['StrippedPeptide', 'PeptideSequence'] nan ['UniProtIds', 'UniProtID', 'UniprotId'] ['Genes', 'Gene', 'GeneName', 'GeneNames'] CCS nan nan nan PrecursorMz nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
spectronaut_report R.FileName charge ['EG.ApexRT', 'EG.MeanApexRT'] ['FG.ApexIonMobility'] ['PG.ProteinNames', 'PG.ProteinGroups'] nan nan PG.UniProtIds PG.Genes nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
Missing 0.000000 0.000000 0.000000 0.100000 0.100000 0.200000 0.300000 0.400000 0.400000 0.500000 0.500000 0.500000 0.600000 0.600000 0.700000 0.800000 0.800000 0.800000 0.800000 0.900000 0.900000 0.900000 0.900000 0.900000 0.900000 0.900000 0.900000 0.900000 0.900000 0.900000

Unifying peptide modifications

Alphabase further unifies representations of peptide modifications between the different search engines to the community-driven unimod format.

E.g. the MaxQuant-internal representations of phosphorylated serines are mapped to the unimod representation:

alphabase/UniMod

MaxQuant

Phospho@S

S(Phospho (S)), S(Phospho (ST)), S(Phospho (STY)), S(Phospho (STYDH)), S(ph), pS

See alphabase.psm_reader.psm_reader_yaml["modification_mappings"] for all mappings as parsed dictionaries and alphabase.constants.const_files.psm_reader_yaml for the underlying file.

Code | Read and parse PSM tables

The alphabase psm_reader module enables users to parse proteomics PSM reports to a dataframe for most common search engines with a single line of code via its alphabase.psm_reader.psm_reader_provider factory.

Available readers

alphabase.psm_reader.psm_reader_provider has registered some basic reader classes. A list of implemented readers can be accessed via its reader_dict property:

[4]:
all_registered_readers = psm_reader_provider.reader_dict.keys()

# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)
Registered readers in alphabase:
        - alphadia
        - alphadia_parquet
        - alphapept
        - diann
        - maxquant
        - msfragger
        - msfragger_pepxml
        - msfragger_psm_tsv
        - openswath
        - pfind
        - pfind3
        - sage_parquet
        - sage_tsv
        - speclib_tsv
        - spectronaut
        - spectronaut_report
        - swath

Interact with the reader provider

Example 1 - MaxQuant

We demonstrate how to interact with PSM tables via alphabase based on a minimal example output of the MaxQuant search engine.

First, let’s provide some minimal input, which is the header of a real MaxQuant report

[5]:
maxquant_example = io.StringIO(
'''Raw file Scan number     Scan index      Sequence        Length  Missed cleavages        Modifications   Modified sequence       Oxidation (M) Probabilities     Oxidation (M) Score diffs       Acetyl (Protein_N-term) Oxidation (M)   Proteins        Charge  Fragmentation   Mass analyzer   Type    Scan event number       Isotope index   m/z     Mass    Mass error [ppm]        Mass error [Da] Simple mass error [ppm] Retention time  PEP     Score   Delta score     Score diff      Localization prob       Combinatorics   PIF     Fraction of total spectrum      Base peak fraction      Precursor full scan number      Precursor Intensity     Precursor apex fraction Precursor apex offset   Precursor apex offset time      Matches Intensities     Mass deviations [Da]    Mass deviations [ppm]   Masses  Number of matches       Intensity coverage      Peak coverage   Neutral loss level      ETD identification type Reverse All scores      All sequences   All modified sequences  Reporter PIF    Reporter fraction       id      Protein group IDs       Peptide ID      Mod. peptide ID Evidence ID     Oxidation (M) site IDs
20190402_QX1_SeVW_MA_HeLa_500ng_LC11        81358   73979   AAAAAAAAAPAAAATAPTTAATTAATAAQ   29      0       Unmodified      _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation (M))PAAAATAPTTAATTAATAAQ_                 0       0       sp|P37108|SRP14_HUMAN   3       HCD     FTMS    MULTI-MSMS      13      1       790.07495       2367.203        0.35311 0.00027898      -0.061634807    70.261  0.012774        41.423  36.666  NaN     NaN     1       0       0       0       81345   10653955        0.0338597821787898      -11     0.139877319335938       y1;y2;y3;y4;y11;y1-NH3;y2-NH3;a2;b2;b3;b4;b5;b6;b7;b8;b9;b11;b12;b6(2+);b8(2+);b13(2+);b18(2+)  2000000;2000000;300000;400000;200000;1000000;400000;300000;600000;1000000;2000000;3000000;3000000;3000000;3000000;2000000;600000;500000;1000000;2000000;300000;200000   5.2861228709844E-06;-6.86980268369553E-05;-0.00238178789771837;0.000624715964988809;-0.0145624692099773;-0.000143471782706683;-0.000609501446461991;-0.000524972720768346;0.00010190530804266;5.8620815195809E-05;0.000229901232955854;-0.000108750048696038;-0.000229593152369034;0.00183148682538103;0.00276641182404092;0.000193118923334623;0.00200988580445483;0.000102216846016745;5.86208151389656E-05;0.000229901232955854;-0.00104559184393338;0.00525030008475369     0.0359413365445091;-0.314964433555295;-8.23711898839045;1.60102421155213;-14.8975999917227;-1.10320467763838;-3.03102462870716;-4.56152475051625;0.712219104095465;0.273777366204575;0.806231096969562;-0.305312183824154;-0.537399178230218;3.67572664689217;4.85930954169285;0.301587577451224;2.48616190909398;0.116225745519871;0.273777365939099;0.806231096969562;-2.19774169175011;7.53961026980589      147.076413378177;218.113601150127;289.153028027798;390.197699998035;977.50437775671;130.050013034583;201.087592852046;115.087114392821;143.081402136892;214.118559209185;285.155501716567;356.192954155649;427.230188786552;498.265241494374;569.301420357176;640.341107437877;808.429168310795;879.468189767554;214.118559209185;285.155501716567;475.757386711244;696.362265007215    22      0.262893575628735       0.0826446280991736      None    Unknown         41.4230894199432;4.75668724862449;3.9515580701967       AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQILQGK;PVTLWITVTHMQADEVSVWR        _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVSVTQILQGK_;_PVTLWITVTHMQADEVSVWR_                  0       1443    0       0       0
20190402_QX1_SeVW_MA_HeLa_500ng_LC11        81391   74010   AAAAAAAAAAPAAAATAPTTAATTAATAAQ  29      0       Unmodified      _AAAAAAAAAPAAAATAPTTAATTAATAAQ_                 0       0       sp|P37108|SRP14_HUMAN   2       HCD     FTMS    MULTI-MSMS      14      0       1184.6088       2367.203        0.037108        4.3959E-05      1.7026696       70.287  7.1474E-09      118.21  100.52  NaN     NaN     1       0       0       0       81377   9347701 0.166790347889974       -10     0.12664794921875        y1;y2;y3;y4;y5;y9;y12;y13;y14;y20;y13-H2O;y20-H2O;y1-NH3;y20-NH3;b3;b4;b5;b6;b7;b8;b9;b11;b12;b13;b14;b15;b16;b19;b15-H2O;b16-H2O       500000;600000;200000;400000;200000;100000;200000;1000000;200000;300000;200000;100000;100000;70000;300000;900000;2000000;3000000;5000000;8000000;6000000;600000;800000;600000;200000;300000;200000;300000;300000;1000000 -0.000194444760495571;0.000149986878682284;0.000774202587820128;-0.0002445094036716;0.000374520568641401;-0.00694293246522193;-0.0109837291331587;-0.0037745820627606;-0.000945546471939451;0.00152326440706929;0.00506054832726477;0.00996886361417637;6.25847393393997E-05;-0.024881067836759;-3.11821549132674E-05;-0.000183099230639527;0.000161332473453513;0.000265434980121881;0.000747070697229901;0.000975534518261156;0.00101513939785036;0.00651913000274362;0.0058584595163893;0.00579536744021425;0.00131097834105276;-0.0131378531671089;0.00472955218901916;-0.00161006322559842;-0.00201443239325272;0.0227149399370319 -1.32206444236914;0.687655553213019;2.6775131607882;-0.626628140021726;0.811995006209331;-8.6203492854282;-10.1838066275079;-3.21078702288986;-0.758483069159249;0.881072738747222;4.37168212373889;5.82682888353564;0.481236695337485;-14.5343986203644;-0.145630261806375;-0.642102166533079;0.452935954800214;0.621293379181583;1.49934012872483;1.71355878380837;1.58531240493271;8.06399202403175;6.6614096214532;6.09718023739784;1.28333378040908;-11.7030234519348;3.96235146626144;-1.07856912288932;-1.82370619437775;19.3220953109188        147.07661310906;218.113382465221;289.149872037312;390.198569223404;461.235063981231;805.411965958065;1078.54847749073;1175.59403219566;1246.62831694787;1728.87474561429;1157.57463237897;1710.85573532879;130.049806978061;1711.87460084504;214.118649012155;285.155914717031;356.192684073126;427.22969375842;498.266325910503;569.303211234482;640.340285417402;808.424659066597;879.462433524883;950.49961040476;1021.54120858166;1122.60333588727;1193.62258226971;1492.77704268533;1104.58164778019;1175.59403219566      30      0.474003002083763       0.167630057803468       None    Unknown         118.209976573419;17.6937689289157;17.2534171481793      AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSVLYLK;VGSSVPSKASELVVMGDHDAARR      _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxidation (M))QSEQLQSVLYLK_;_VGSSVPSKASELVVMGDHDAARR_                 1       1443    0       0       1
20190402_QX1_SeVW_MA_HeLa_500ng_LC11        107307  98306   AAAAAAAGDSDSWDADAFSVEDPVRK      26      1       Acetyl (Protein_N-term) _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_                   1       0       sp|O75822|EIF3J_HUMAN   3       HCD     FTMS    MULTI-MSMS      10      2       879.06841       2634.1834       -0.93926        -0.00082567     -3.2012471      90.978  2.1945E-12      148.95  141.24  NaN     NaN     1       0       0       0       107297  10193939        0.267970762043589       -8      0.10211181640625        y1;y2;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14;y15;y17;y18;y19;y20;y21;y23;y21-H2O;y1-NH3;y19-NH3;y14(2+);y16(2+);y22(2+);a2;b2;b3;b4;b5;b6;b7     300000;200000;3000000;600000;1000000;500000;2000000;1000000;1000000;1000000;90000;1000000;400000;900000;1000000;400000;3000000;2000000;1000000;400000;100000;200000;200000;80000;100000;200000;200000;2000000;5000000;5000000;5000000;2000000;300000    1.34859050149316E-07;-6.05140996867704E-06;2.27812602133781E-05;0.00128986659160546;-0.00934536073077652;0.000941953783126337;-0.00160424237344614;-0.00239257341399934;-0.00111053968612396;-0.00331340710044969;0.00330702864630439;0.000963683996815234;0.00596290290945944;-0.00662057038289277;-0.0117122701335575;0.00777853472800416;0.0021841542961738;0.000144322111736983;-0.00087403893667215;0.0197121595674616;-0.021204007680808;-0.000308954599830713;-0.026636719419912;-0.0137790992353075;0.00596067266928912;-0.0077053835773313;9.11402199221811E-06;-0.000142539300128419;-0.000251999832926231;1.90791054137662E-05;-0.00236430185879044;-9.54583337602344E-05;-0.000556959493223985      0.000916705048437201;-0.0199575598103408;0.0456231928690862;2.09952637717462;-12.5708704058425;1.11808305811426;-1.72590731777249;-2.22239181008062;-0.967696370445928;-2.62418809422166;2.47964286628144;0.665205752892023;3.64753748704453;-3.84510115530963;-6.08782672045773;3.81508105974837;1.04209904973991;0.0666012719936656;-0.390545453668809;8.28224925531311;-9.55133250134922;-2.37499239179248;-12.8127653858411;-16.846761946123;6.48662354975264;-6.67117082062383;0.0580151981289049;-0.770098855873447;-0.983876895688683;0.0583162347158579;-5.93738717724506;-0.203431522818505;-1.03087538746314  147.112804035741;303.21392125011;499.33507018564;614.360746132308;743.413974455831;842.472101057517;929.506675663573;1076.57587791081;1147.61170966489;1262.6408555643;1333.67134891635;1448.700635293;1634.77494902759;1721.81956091078;1923.88362405243;2038.89107627957;2095.9181343836;2166.95728800359;2237.99542015244;2380.04906152953;2220.00518543488;130.0865640237;2078.92040615582;817.907873297785;918.917619246831;1155.02717356753;157.097144992378;185.0922112678;256.129434516133;327.166277224995;398.205774393759;469.240619338034;540.278194626993  33      0.574496146107112       0.14410480349345        None    Unknown         148.951235201399;7.71201258444522;7.36039532447559      AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDGER;HTLTSFWNFKAGCEEKCYSNR  _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_;_PSRQESELM(Oxidation (M))WQWVDQRSDGER_;_HTLTSFWNFKAGCEEKCYSNR_                    2       625     1       1       2       '''
)

# Parse with pandas for visualization purposes
pd.read_csv(copy(maxquant_example), sep="\t")
[5]:
Raw file Scan number Scan index Sequence Length Missed cleavages Modifications Modified sequence Oxidation (M) Probabilities Oxidation (M) Score diffs ... All sequences All modified sequences Reporter PIF Reporter fraction id Protein group IDs Peptide ID Mod. peptide ID Evidence ID Oxidation (M) site IDs
0 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81358 73979 AAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 Unmodified _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ... NaN NaN ... AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ... _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS... NaN NaN 0 1443 0 0 0 NaN
1 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81391 74010 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 Unmodified _AAAAAAAAAPAAAATAPTTAATTAATAAQ_ NaN NaN ... AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV... _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid... NaN NaN 1 1443 0 0 1 NaN
2 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 107307 98306 AAAAAAAGDSDSWDADAFSVEDPVRK 26 1 Acetyl (Protein_N-term) _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... NaN NaN ... AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG... _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... NaN NaN 2 625 1 1 2 NaN

3 rows × 61 columns

Then use the psm_reader_provider.get_reader method to get the maxquant-report reader. Use the import_file method to read the file, which is directly returned as a pandas DataFrame.

[6]:
maxquant_reader = psm_reader_provider.get_reader('maxquant')

# Import the file or a bytestream
maxquant_report = maxquant_reader.import_file(maxquant_example)

# The parsed PSM is also stored in the reader class as `psm_df` attribute
# maxquant_report = maxquant_reader.psm_df

maxquant_report
/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/alphabase/psm_reader/psm_reader.py:318: UserWarning: Unknown modifications: {'_(Acetyl (Protein_N-term))'}. Precursors with unknown modifications will be removed.
  warnings.warn(
[6]:
sequence charge rt scan_num raw_name precursor_mz score proteins decoy spec_idx mods mod_sites nAA rt_norm
0 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 2 70.287 81391 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 1184.6088 118.21 sp|P37108|SRP14_HUMAN 0 81390 30 0.772571

Example 2 - Set custom arguments

One can also customize the reader by setting specific arguments. For example, one can set more stringent fdr filters (default: \(fdr=0.01\)). We showcase this on the example of a DIANN PSM report table.

[7]:
diann_tsv_example = io.StringIO(r'''File.Name       Run     Protein.Group   Protein.Ids     Protein.Names   Genes   PG.Quantity     PG.Normalised   PG.MaxLFQ       Genes.Quantity  Genes.Normalised        Genes.MaxLFQ    Genes.MaxLFQ.Unique     Modified.Sequence       Stripped.Sequence       Precursor.Id    Precursor.Charge        Q.Value Global.Q.Value  Protein.Q.Value PG.Q.Value      Global.PG.Q.Value       GG.Q.Value      Translated.Q.Value      Proteotypic     Precursor.Quantity      Precursor.Normalised    Precursor.Translated    Quantity.Quality        RT      RT.Start        RT.Stop iRT     Predicted.RT    Predicted.iRT   Lib.Q.Value     Ms1.Profile.Corr        Ms1.Area        Evidence        Spectrum.Similarity     Mass.Evidence   CScore  Decoy.Evidence  Decoy.CScore    Fragment.Quant.Raw      Fragment.Quant.Corrected        Fragment.Correlations   MS2.Scan        IM      iIM     Predicted.IM    Predicted.iIM
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636.d  20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636       Q9UH36  Q9UH36          SRRD    3296.49 3428.89 3428.89 3296.49 3428.89 3428.89 3428.89 (UniMod:1)AAAAAAALESWQAAAPR     AAAAAAALESWQAAAPR       (UniMod:1)AAAAAAALESWQAAAPR2    2       3.99074e-05     1.96448e-05     0.000159821     0.000159821     0.000146135     0.000161212     0       1       3296.49 3428.89 3296.49 0.852479        19.9208 19.8731 19.9685 123.9   19.8266 128.292 0       0.960106        5308.05 1.96902 0.683134        0.362287        0.999997        1.23691 3.43242e-05     1212.01;2178.03;1390.01;1020.01;714.008;778.008;        1212.01;1351.73;887.591;432.92;216.728;732.751; 0.956668;0.757581;0.670497;0.592489;0.47072;0.855203;   30053   1.19708 1.19328 1.19453 1.19469
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642.d  20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642       Q9UH36  Q9UH36          SRRD    2365    2334.05 2334.05 2365    2334.05 2334.05 2334.05 (UniMod:1)AAAAAAALESWQAAAPR     AAAAAAALESWQAAAPR       (UniMod:1)AAAAAAALESWQAAAPR2    2       0.000184434     1.96448e-05     0.000596659     0.000596659     0.000146135     0.000604961     0       1       2365    2334.05 2365    0.922581        19.905  19.8573 19.9527 123.9   19.782  128.535 0       0.940191        4594.04 1.31068 0.758988        0       0.995505        0.28633 2.12584e-06     1209.02;1210.02;1414.02;1051.01;236.003;130.002;        1209.02;1109.89;732.154;735.384;0;46.0967;      0.919244;0.937624;0.436748;0.639369;0.296736;0.647924;  30029   1.195   1.19328 1.19381 1.19339
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648.d  20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648       Q9UH36  Q9UH36          SRRD    1664.51 1635.46 1635.47 1664.51 1635.46 1635.47 1635.47 (UniMod:1)AAAAAAALESWQAAAPR     AAAAAAALESWQAAAPR       (UniMod:1)AAAAAAALESWQAAAPR2    2       0.000185123     1.96448e-05     0.000307409     0.000307409     0.000146135     0.000311332     0       1       1664.51 1635.46 1664.51 0.811147        19.8893 19.8416 19.937  123.9   19.7567 128.896 0       0.458773        6614.06 1.7503  0.491071        0.00111683      0.997286        1.92753 2.80543e-05     744.01;1708.02;1630.02;1475.02;0;533.006;       322.907;808.594;577.15;536.033;0;533.006;       0.760181;0.764072;0.542005;0.415779;0;0.913438; 30005   1.19409 1.19328 1.19323 1.19308
''')

pd.read_csv(copy(diann_tsv_example), sep="\t")
[7]:
File.Name Run Protein.Group Protein.Ids Protein.Names Genes PG.Quantity PG.Normalised PG.MaxLFQ Genes.Quantity ... Decoy.Evidence Decoy.CScore Fragment.Quant.Raw Fragment.Quant.Corrected Fragment.Correlations MS2.Scan IM iIM Predicted.IM Predicted.iIM
0 F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 Q9UH36 NaN SRRD 3296.49 3428.89 3428.89 3296.49 ... 1.23691 0.000034 1212.01;2178.03;1390.01;1020.01;714.008;778.008; 1212.01;1351.73;887.591;432.92;216.728;732.751; 0.956668;0.757581;0.670497;0.592489;0.47072;0.... 30053 1.19708 1.19328 1.19453 1.19469
1 F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 Q9UH36 NaN SRRD 2365.00 2334.05 2334.05 2365.00 ... 0.28633 0.000002 1209.02;1210.02;1414.02;1051.01;236.003;130.002; 1209.02;1109.89;732.154;735.384;0;46.0967; 0.919244;0.937624;0.436748;0.639369;0.296736;0... 30029 1.19500 1.19328 1.19381 1.19339
2 F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 Q9UH36 NaN SRRD 1664.51 1635.46 1635.47 1664.51 ... 1.92753 0.000028 744.01;1708.02;1630.02;1475.02;0;533.006; 322.907;808.594;577.15;536.033;0;533.006; 0.760181;0.764072;0.542005;0.415779;0;0.913438; 30005 1.19409 1.19328 1.19323 1.19308

3 rows × 52 columns

By passing the more stringent fdr filter (\(fdr_{\text{stringent}} = 10^{-4}\)) in the second function call, two precursors with an fdr of \(\sim0.0002\) are removed from the resulting table

[8]:
# Read PSM reports with one liners
diann_psm_standard = psm_reader_provider.get_reader('diann').import_file(copy(diann_tsv_example))
diann_psm_custom_fdr = psm_reader_provider.get_reader('diann', fdr=1e-4).import_file(copy(diann_tsv_example))

print("Number of observations (Standard filter):", len(diann_psm_standard))
print("Number of observations (Stringent filter):", len(diann_psm_custom_fdr))
Number of observations (Standard filter): 3
Number of observations (Stringent filter): 1

Conclusion

Overall, this tutorial

  • Explained how alphabase maps different search engine outputs to a unified format

  • Provides examples on how to read PSM tables from different search engines

  • Gives an overview over the available and implemented readers