PSM readers¶
[1]:
%reload_ext autoreload
%autoreload 2
[2]:
# Helper packages
import io
from copy import copy
import numpy as np
import pandas as pd
# alphabase
from alphabase.psm_reader import psm_reader_provider, psm_reader_yaml
Background¶
The alphabase.psm_reader module provides a unifying interface to read PSM tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of pandas.DataFrames, regardless of the input file format.
Introduction to peptide spectrum matches (PSMs)¶
Peptide spectrum matches (PSMs) are the primary output of proteomics search engines. In a PSM table, each row typically represents a single peptide-spectrum-match, i.e. a peptide sequence that the proteomics search engine identified to be compatible with an observed mass spectrum in a given sample. PSM tables contain information about both 1) the peptide sequence, 2) the spectrum, as well as 3) the score assigned to the PSM by the search engine.
A minimal PSM table could look something like this:
sample_id |
peptide |
confidence_score |
scan_id |
|---|---|---|---|
1 |
PEPTIDE |
0.99 |
1234 |
In this example, the search engine identified the peptide PEPTIDE as a match to the spectrum 1234 sample with ID 1, and assigned a confidence score of 0.99 to this match.
Search engine outputs¶
In reality, PSM tables are significantly more complex than this, as they contain additional information on both the spectrum (e.g. the sample run, detector type, precursor intensities, …), the peptide (e.g. the protein it belongs to, the modifications it carries), and the peptide search (quality control measures). This additional information can be extremely useful for downstream analyses, but also makes PSM tables more difficult to work with, as the exact names may differ between search engines, versions, and file formats.
Unifying properties¶
Alphabase aligns the column names to a unified vocabulary, as defined in the alphabase.psm_reader.psm_reader_yaml mapping. We can explore this standardization, that facilitates cross-engine comparisons. Note that some search engines use version-dependent names for the same property (indicated by lists) and alphabase can deal with this ambiguity. To support both Bruker and Thermo data, we did not use Scan Number in the output dataframe but spec_idx (starts with 0).
spec_idx = scan_num - 1 in thermo data.
[3]:
# Generate a dataframe that maps the report-specific column names to the unified column names based on alphabase mapping
psm_column_mapping = (
pd.DataFrame.from_dict({
mapping.get("reader_type", "psm"): mapping.get("column_mapping", {}) for k, mapping in psm_reader_yaml.items() if k != "modification_mappings"
})
.T
.rename_axis(columns="alphabase (unified name)")
)
# Order by importance (number of search engines with corresponding column)
columns_ordered = psm_column_mapping.isna().mean(axis=0).sort_values(ascending=True)
psm_column_mapping = (
psm_column_mapping.loc[:, columns_ordered.index]
.sort_index(axis=0)
.dropna(how="all", axis=1)
)
# Compute summary
summary = (
psm_column_mapping
.agg([lambda x: x.isna().mean()])
)
## Visualize the mapping of PSM columns to alphabase unified columns
# Stylize with pandas CSS class
headers = {
'selector': 'th',#'th:not(.index_name)',
'props': 'background-color: #18456d; color: white;'
}
cell_hover = { # for row hover use <tr> instead of <td>
'selector': 'td:hover',
'props': [('background-color', '#ffffb3')]
}
index_names = {
'selector': '.index_name',
'props': 'font-style: italic; font-weight:bold; position: sticky'
}
summary_row = {"selector": "tbody tr:last-child", "props": [("background-color", "#efefef"), ("font-weight", "bold")]}
caption = {
"selector": "caption",
"props": "caption-side: top; font-style: italic; font-size: 12pt; text-align:left; margin-bottom: 10pt;"
}
# Visualize
psm_column_mapping_stylized = (
psm_column_mapping
.style
.concat(
summary
.style
.relabel_index(["Missing"])
.bar(color='#cccccc', vmin=0, vmax=1)
)
.set_caption("Mapping of PSM columns to alphabase unified columns")
.set_table_styles(
[headers, cell_hover, index_names, summary_row, caption]
)
)
psm_column_mapping_stylized
[3]:
| alphabase (unified name) | raw_name | charge | rt | mobility | proteins | sequence | score | uniprot_ids | genes | ccs | fdr | scan_num | decoy | precursor_mz | query_id | modified_sequence | intensity | rt_stop | rt_start | peptide_fdr | mods | fragment_intensity | fragment_mz | fragment_type | fragment_charge | fragment_series | fragment_loss_type | scannr | spec_idx | protein_fdr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| alphadia | run | charge | rt_observed | mobility | proteins | sequence | score | uniprot_ids | genes | ccs | fdr | nan | nan | nan | nan | nan | intensity | rt_stop | rt_start | nan | mods | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| alphapept | raw_name | charge | rt | mobility | nan | nan | score | nan | nan | nan | q_value | scan_no | decoy | mz | query_idx | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | raw_idx | nan |
| diann | Run | Precursor.Charge | RT | ['IM', 'IonMobility'] | Protein.Names | Stripped.Sequence | CScore | Protein.Ids | Genes | CCS | Q.Value | MS2.Scan | nan | nan | nan | nan | nan | RT.Stop | RT.Start | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| library_reader_base | ReferenceRun | PrecursorCharge | ['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime'] | ['Mobility', 'IonMobility', 'PrecursorIonMobility'] | ['ProteinId', 'ProteinID', 'ProteinName', 'Protein Name'] | ['PeptideSequence', 'StrippedPeptide'] | nan | ['UniProtIds', 'UniProtID', 'UniprotId'] | ['GeneName', 'Genes', 'Gene'] | CCS | nan | nan | nan | PrecursorMz | nan | ['ModifiedPeptideSequence', 'ModifiedPeptide'] | nan | nan | nan | nan | nan | ['LibraryIntensity', 'RelativeIntensity', 'RelativeFragmentIntensity', 'RelativeFragmentIonIntensity'] | ['ProductMz'] | ['FragmentType', 'FragmentIonType', 'ProductType', 'ProductIonType'] | ['FragmentCharge', 'FragmentIonCharge', 'ProductCharge', 'ProductIonCharge'] | ['FragmentSeriesNumber', 'FragmentNumber'] | ['FragmentLossType', 'FragmentIonLossType', 'ProductLossType', 'ProductIonLossType'] | nan | nan | nan |
| maxquant | Raw file | Charge | Retention time | ['Mobility', 'IonMobility', 'K0', '1/K0'] | Proteins | Sequence | Score | nan | ['Gene Names', 'Gene names'] | CCS | nan | ['Scan number', 'MS/MS scan number', 'MS/MS Scan Number', 'Scan index'] | Reverse | m/z | nan | nan | Intensity | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| msfragger_pepxml | raw_name | assumed_charge | retention_time_sec | ion_mobility | protein | peptide | expect | nan | nan | nan | nan | start_scan | nan | nan | spectrum | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| pfind | raw_name | Charge | RT | nan | Proteins | Sequence | Final_Score | Proteins | nan | nan | Q-value | Scan_No | ['Target/Decoy', 'Targe/Decoy'] | nan | File_Name | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| sage | filename | charge | rt | mobility | proteins | stripped_peptide | sage_discriminant_score | nan | nan | nan | spectrum_q | nan | is_decoy | nan | nan | peptide | nan | nan | nan | peptide_q | nan | nan | nan | nan | nan | nan | nan | scannr | nan | protein_q |
| spectronaut | ReferenceRun | PrecursorCharge | ['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime'] | ['Mobility', 'IonMobility', 'PrecursorIonMobility'] | ['Protein Name', 'ProteinId', 'ProteinID', 'ProteinName', 'ProteinGroup', 'ProteinGroups'] | ['StrippedPeptide', 'PeptideSequence'] | nan | ['UniProtIds', 'UniProtID', 'UniprotId'] | ['Genes', 'Gene', 'GeneName', 'GeneNames'] | CCS | nan | nan | nan | PrecursorMz | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| spectronaut_report | R.FileName | charge | ['EG.ApexRT', 'EG.MeanApexRT'] | ['FG.ApexIonMobility'] | ['PG.ProteinNames', 'PG.ProteinGroups'] | nan | nan | PG.UniProtIds | PG.Genes | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| Missing | 0.000000 | 0.000000 | 0.000000 | 0.100000 | 0.100000 | 0.200000 | 0.300000 | 0.400000 | 0.400000 | 0.500000 | 0.500000 | 0.500000 | 0.600000 | 0.600000 | 0.700000 | 0.800000 | 0.800000 | 0.800000 | 0.800000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 | 0.900000 |
Unifying peptide modifications¶
Alphabase further unifies representations of peptide modifications between the different search engines to the community-driven unimod format.
E.g. the MaxQuant-internal representations of phosphorylated serines are mapped to the unimod representation:
alphabase/UniMod |
MaxQuant |
|---|---|
Phospho@S |
S(Phospho (S)), S(Phospho (ST)), S(Phospho (STY)), S(Phospho (STYDH)), S(ph), pS |
See alphabase.psm_reader.psm_reader_yaml["modification_mappings"] for all mappings as parsed dictionaries and alphabase.constants.const_files.psm_reader_yaml for the underlying file.
Code | Read and parse PSM tables¶
The alphabase psm_reader module enables users to parse proteomics PSM reports to a dataframe for most common search engines with a single line of code via its alphabase.psm_reader.psm_reader_provider factory.
Available readers¶
alphabase.psm_reader.psm_reader_provider has registered some basic reader classes. A list of implemented readers can be accessed via its reader_dict property:
[4]:
all_registered_readers = psm_reader_provider.reader_dict.keys()
# Display all registered readers
sep = "\n\t- "
print("Registered readers in alphabase:", sep.join(sorted(all_registered_readers)), sep=sep)
Registered readers in alphabase:
- alphadia
- alphadia_parquet
- alphapept
- diann
- maxquant
- msfragger
- msfragger_pepxml
- msfragger_psm_tsv
- openswath
- pfind
- pfind3
- sage_parquet
- sage_tsv
- speclib_tsv
- spectronaut
- spectronaut_report
- swath
Interact with the reader provider¶
Example 1 - MaxQuant¶
We demonstrate how to interact with PSM tables via alphabase based on a minimal example output of the MaxQuant search engine.
First, let’s provide some minimal input, which is the header of a real MaxQuant report
[5]:
maxquant_example = io.StringIO(
'''Raw file Scan number Scan index Sequence Length Missed cleavages Modifications Modified sequence Oxidation (M) Probabilities Oxidation (M) Score diffs Acetyl (Protein_N-term) Oxidation (M) Proteins Charge Fragmentation Mass analyzer Type Scan event number Isotope index m/z Mass Mass error [ppm] Mass error [Da] Simple mass error [ppm] Retention time PEP Score Delta score Score diff Localization prob Combinatorics PIF Fraction of total spectrum Base peak fraction Precursor full scan number Precursor Intensity Precursor apex fraction Precursor apex offset Precursor apex offset time Matches Intensities Mass deviations [Da] Mass deviations [ppm] Masses Number of matches Intensity coverage Peak coverage Neutral loss level ETD identification type Reverse All scores All sequences All modified sequences Reporter PIF Reporter fraction id Protein group IDs Peptide ID Mod. peptide ID Evidence ID Oxidation (M) site IDs
20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81358 73979 AAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 Unmodified _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation (M))PAAAATAPTTAATTAATAAQ_ 0 0 sp|P37108|SRP14_HUMAN 3 HCD FTMS MULTI-MSMS 13 1 790.07495 2367.203 0.35311 0.00027898 -0.061634807 70.261 0.012774 41.423 36.666 NaN NaN 1 0 0 0 81345 10653955 0.0338597821787898 -11 0.139877319335938 y1;y2;y3;y4;y11;y1-NH3;y2-NH3;a2;b2;b3;b4;b5;b6;b7;b8;b9;b11;b12;b6(2+);b8(2+);b13(2+);b18(2+) 2000000;2000000;300000;400000;200000;1000000;400000;300000;600000;1000000;2000000;3000000;3000000;3000000;3000000;2000000;600000;500000;1000000;2000000;300000;200000 5.2861228709844E-06;-6.86980268369553E-05;-0.00238178789771837;0.000624715964988809;-0.0145624692099773;-0.000143471782706683;-0.000609501446461991;-0.000524972720768346;0.00010190530804266;5.8620815195809E-05;0.000229901232955854;-0.000108750048696038;-0.000229593152369034;0.00183148682538103;0.00276641182404092;0.000193118923334623;0.00200988580445483;0.000102216846016745;5.86208151389656E-05;0.000229901232955854;-0.00104559184393338;0.00525030008475369 0.0359413365445091;-0.314964433555295;-8.23711898839045;1.60102421155213;-14.8975999917227;-1.10320467763838;-3.03102462870716;-4.56152475051625;0.712219104095465;0.273777366204575;0.806231096969562;-0.305312183824154;-0.537399178230218;3.67572664689217;4.85930954169285;0.301587577451224;2.48616190909398;0.116225745519871;0.273777365939099;0.806231096969562;-2.19774169175011;7.53961026980589 147.076413378177;218.113601150127;289.153028027798;390.197699998035;977.50437775671;130.050013034583;201.087592852046;115.087114392821;143.081402136892;214.118559209185;285.155501716567;356.192954155649;427.230188786552;498.265241494374;569.301420357176;640.341107437877;808.429168310795;879.468189767554;214.118559209185;285.155501716567;475.757386711244;696.362265007215 22 0.262893575628735 0.0826446280991736 None Unknown 41.4230894199432;4.75668724862449;3.9515580701967 AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQILQGK;PVTLWITVTHMQADEVSVWR _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVSVTQILQGK_;_PVTLWITVTHMQADEVSVWR_ 0 1443 0 0 0
20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81391 74010 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 Unmodified _AAAAAAAAAPAAAATAPTTAATTAATAAQ_ 0 0 sp|P37108|SRP14_HUMAN 2 HCD FTMS MULTI-MSMS 14 0 1184.6088 2367.203 0.037108 4.3959E-05 1.7026696 70.287 7.1474E-09 118.21 100.52 NaN NaN 1 0 0 0 81377 9347701 0.166790347889974 -10 0.12664794921875 y1;y2;y3;y4;y5;y9;y12;y13;y14;y20;y13-H2O;y20-H2O;y1-NH3;y20-NH3;b3;b4;b5;b6;b7;b8;b9;b11;b12;b13;b14;b15;b16;b19;b15-H2O;b16-H2O 500000;600000;200000;400000;200000;100000;200000;1000000;200000;300000;200000;100000;100000;70000;300000;900000;2000000;3000000;5000000;8000000;6000000;600000;800000;600000;200000;300000;200000;300000;300000;1000000 -0.000194444760495571;0.000149986878682284;0.000774202587820128;-0.0002445094036716;0.000374520568641401;-0.00694293246522193;-0.0109837291331587;-0.0037745820627606;-0.000945546471939451;0.00152326440706929;0.00506054832726477;0.00996886361417637;6.25847393393997E-05;-0.024881067836759;-3.11821549132674E-05;-0.000183099230639527;0.000161332473453513;0.000265434980121881;0.000747070697229901;0.000975534518261156;0.00101513939785036;0.00651913000274362;0.0058584595163893;0.00579536744021425;0.00131097834105276;-0.0131378531671089;0.00472955218901916;-0.00161006322559842;-0.00201443239325272;0.0227149399370319 -1.32206444236914;0.687655553213019;2.6775131607882;-0.626628140021726;0.811995006209331;-8.6203492854282;-10.1838066275079;-3.21078702288986;-0.758483069159249;0.881072738747222;4.37168212373889;5.82682888353564;0.481236695337485;-14.5343986203644;-0.145630261806375;-0.642102166533079;0.452935954800214;0.621293379181583;1.49934012872483;1.71355878380837;1.58531240493271;8.06399202403175;6.6614096214532;6.09718023739784;1.28333378040908;-11.7030234519348;3.96235146626144;-1.07856912288932;-1.82370619437775;19.3220953109188 147.07661310906;218.113382465221;289.149872037312;390.198569223404;461.235063981231;805.411965958065;1078.54847749073;1175.59403219566;1246.62831694787;1728.87474561429;1157.57463237897;1710.85573532879;130.049806978061;1711.87460084504;214.118649012155;285.155914717031;356.192684073126;427.22969375842;498.266325910503;569.303211234482;640.340285417402;808.424659066597;879.462433524883;950.49961040476;1021.54120858166;1122.60333588727;1193.62258226971;1492.77704268533;1104.58164778019;1175.59403219566 30 0.474003002083763 0.167630057803468 None Unknown 118.209976573419;17.6937689289157;17.2534171481793 AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSVLYLK;VGSSVPSKASELVVMGDHDAARR _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxidation (M))QSEQLQSVLYLK_;_VGSSVPSKASELVVMGDHDAARR_ 1 1443 0 0 1
20190402_QX1_SeVW_MA_HeLa_500ng_LC11 107307 98306 AAAAAAAGDSDSWDADAFSVEDPVRK 26 1 Acetyl (Protein_N-term) _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_ 1 0 sp|O75822|EIF3J_HUMAN 3 HCD FTMS MULTI-MSMS 10 2 879.06841 2634.1834 -0.93926 -0.00082567 -3.2012471 90.978 2.1945E-12 148.95 141.24 NaN NaN 1 0 0 0 107297 10193939 0.267970762043589 -8 0.10211181640625 y1;y2;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14;y15;y17;y18;y19;y20;y21;y23;y21-H2O;y1-NH3;y19-NH3;y14(2+);y16(2+);y22(2+);a2;b2;b3;b4;b5;b6;b7 300000;200000;3000000;600000;1000000;500000;2000000;1000000;1000000;1000000;90000;1000000;400000;900000;1000000;400000;3000000;2000000;1000000;400000;100000;200000;200000;80000;100000;200000;200000;2000000;5000000;5000000;5000000;2000000;300000 1.34859050149316E-07;-6.05140996867704E-06;2.27812602133781E-05;0.00128986659160546;-0.00934536073077652;0.000941953783126337;-0.00160424237344614;-0.00239257341399934;-0.00111053968612396;-0.00331340710044969;0.00330702864630439;0.000963683996815234;0.00596290290945944;-0.00662057038289277;-0.0117122701335575;0.00777853472800416;0.0021841542961738;0.000144322111736983;-0.00087403893667215;0.0197121595674616;-0.021204007680808;-0.000308954599830713;-0.026636719419912;-0.0137790992353075;0.00596067266928912;-0.0077053835773313;9.11402199221811E-06;-0.000142539300128419;-0.000251999832926231;1.90791054137662E-05;-0.00236430185879044;-9.54583337602344E-05;-0.000556959493223985 0.000916705048437201;-0.0199575598103408;0.0456231928690862;2.09952637717462;-12.5708704058425;1.11808305811426;-1.72590731777249;-2.22239181008062;-0.967696370445928;-2.62418809422166;2.47964286628144;0.665205752892023;3.64753748704453;-3.84510115530963;-6.08782672045773;3.81508105974837;1.04209904973991;0.0666012719936656;-0.390545453668809;8.28224925531311;-9.55133250134922;-2.37499239179248;-12.8127653858411;-16.846761946123;6.48662354975264;-6.67117082062383;0.0580151981289049;-0.770098855873447;-0.983876895688683;0.0583162347158579;-5.93738717724506;-0.203431522818505;-1.03087538746314 147.112804035741;303.21392125011;499.33507018564;614.360746132308;743.413974455831;842.472101057517;929.506675663573;1076.57587791081;1147.61170966489;1262.6408555643;1333.67134891635;1448.700635293;1634.77494902759;1721.81956091078;1923.88362405243;2038.89107627957;2095.9181343836;2166.95728800359;2237.99542015244;2380.04906152953;2220.00518543488;130.0865640237;2078.92040615582;817.907873297785;918.917619246831;1155.02717356753;157.097144992378;185.0922112678;256.129434516133;327.166277224995;398.205774393759;469.240619338034;540.278194626993 33 0.574496146107112 0.14410480349345 None Unknown 148.951235201399;7.71201258444522;7.36039532447559 AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDGER;HTLTSFWNFKAGCEEKCYSNR _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_;_PSRQESELM(Oxidation (M))WQWVDQRSDGER_;_HTLTSFWNFKAGCEEKCYSNR_ 2 625 1 1 2 '''
)
# Parse with pandas for visualization purposes
pd.read_csv(copy(maxquant_example), sep="\t")
[5]:
| Raw file | Scan number | Scan index | Sequence | Length | Missed cleavages | Modifications | Modified sequence | Oxidation (M) Probabilities | Oxidation (M) Score diffs | ... | All sequences | All modified sequences | Reporter PIF | Reporter fraction | id | Protein group IDs | Peptide ID | Mod. peptide ID | Evidence ID | Oxidation (M) site IDs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | 81358 | 73979 | AAAAAAAAAPAAAATAPTTAATTAATAAQ | 29 | 0 | Unmodified | _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ... | NaN | NaN | ... | AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ... | _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS... | NaN | NaN | 0 | 1443 | 0 | 0 | 0 | NaN |
| 1 | 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | 81391 | 74010 | AAAAAAAAAAPAAAATAPTTAATTAATAAQ | 29 | 0 | Unmodified | _AAAAAAAAAPAAAATAPTTAATTAATAAQ_ | NaN | NaN | ... | AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV... | _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid... | NaN | NaN | 1 | 1443 | 0 | 0 | 1 | NaN |
| 2 | 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | 107307 | 98306 | AAAAAAAGDSDSWDADAFSVEDPVRK | 26 | 1 | Acetyl (Protein_N-term) | _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... | NaN | NaN | ... | AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG... | _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... | NaN | NaN | 2 | 625 | 1 | 1 | 2 | NaN |
3 rows × 61 columns
Then use the psm_reader_provider.get_reader method to get the maxquant-report reader. Use the import_file method to read the file, which is directly returned as a pandas DataFrame.
[6]:
maxquant_reader = psm_reader_provider.get_reader('maxquant')
# Import the file or a bytestream
maxquant_report = maxquant_reader.import_file(maxquant_example)
# The parsed PSM is also stored in the reader class as `psm_df` attribute
# maxquant_report = maxquant_reader.psm_df
maxquant_report
/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/alphabase/psm_reader/psm_reader.py:318: UserWarning: Unknown modifications: {'_(Acetyl (Protein_N-term))'}. Precursors with unknown modifications will be removed.
warnings.warn(
[6]:
| sequence | charge | rt | scan_num | raw_name | precursor_mz | score | proteins | decoy | spec_idx | mods | mod_sites | nAA | rt_norm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAAAAAAAAAPAAAATAPTTAATTAATAAQ | 2 | 70.287 | 81391 | 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | 1184.6088 | 118.21 | sp|P37108|SRP14_HUMAN | 0 | 81390 | 30 | 0.772571 |
Example 2 - Set custom arguments¶
One can also customize the reader by setting specific arguments. For example, one can set more stringent fdr filters (default: \(fdr=0.01\)). We showcase this on the example of a DIANN PSM report table.
[7]:
diann_tsv_example = io.StringIO(r'''File.Name Run Protein.Group Protein.Ids Protein.Names Genes PG.Quantity PG.Normalised PG.MaxLFQ Genes.Quantity Genes.Normalised Genes.MaxLFQ Genes.MaxLFQ.Unique Modified.Sequence Stripped.Sequence Precursor.Id Precursor.Charge Q.Value Global.Q.Value Protein.Q.Value PG.Q.Value Global.PG.Q.Value GG.Q.Value Translated.Q.Value Proteotypic Precursor.Quantity Precursor.Normalised Precursor.Translated Quantity.Quality RT RT.Start RT.Stop iRT Predicted.RT Predicted.iRT Lib.Q.Value Ms1.Profile.Corr Ms1.Area Evidence Spectrum.Similarity Mass.Evidence CScore Decoy.Evidence Decoy.CScore Fragment.Quant.Raw Fragment.Quant.Corrected Fragment.Correlations MS2.Scan IM iIM Predicted.IM Predicted.iIM
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636.d 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636 Q9UH36 Q9UH36 SRRD 3296.49 3428.89 3428.89 3296.49 3428.89 3428.89 3428.89 (UniMod:1)AAAAAAALESWQAAAPR AAAAAAALESWQAAAPR (UniMod:1)AAAAAAALESWQAAAPR2 2 3.99074e-05 1.96448e-05 0.000159821 0.000159821 0.000146135 0.000161212 0 1 3296.49 3428.89 3296.49 0.852479 19.9208 19.8731 19.9685 123.9 19.8266 128.292 0 0.960106 5308.05 1.96902 0.683134 0.362287 0.999997 1.23691 3.43242e-05 1212.01;2178.03;1390.01;1020.01;714.008;778.008; 1212.01;1351.73;887.591;432.92;216.728;732.751; 0.956668;0.757581;0.670497;0.592489;0.47072;0.855203; 30053 1.19708 1.19328 1.19453 1.19469
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642.d 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642 Q9UH36 Q9UH36 SRRD 2365 2334.05 2334.05 2365 2334.05 2334.05 2334.05 (UniMod:1)AAAAAAALESWQAAAPR AAAAAAALESWQAAAPR (UniMod:1)AAAAAAALESWQAAAPR2 2 0.000184434 1.96448e-05 0.000596659 0.000596659 0.000146135 0.000604961 0 1 2365 2334.05 2365 0.922581 19.905 19.8573 19.9527 123.9 19.782 128.535 0 0.940191 4594.04 1.31068 0.758988 0 0.995505 0.28633 2.12584e-06 1209.02;1210.02;1414.02;1051.01;236.003;130.002; 1209.02;1109.89;732.154;735.384;0;46.0967; 0.919244;0.937624;0.436748;0.639369;0.296736;0.647924; 30029 1.195 1.19328 1.19381 1.19339
F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648.d 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648 Q9UH36 Q9UH36 SRRD 1664.51 1635.46 1635.47 1664.51 1635.46 1635.47 1635.47 (UniMod:1)AAAAAAALESWQAAAPR AAAAAAALESWQAAAPR (UniMod:1)AAAAAAALESWQAAAPR2 2 0.000185123 1.96448e-05 0.000307409 0.000307409 0.000146135 0.000311332 0 1 1664.51 1635.46 1664.51 0.811147 19.8893 19.8416 19.937 123.9 19.7567 128.896 0 0.458773 6614.06 1.7503 0.491071 0.00111683 0.997286 1.92753 2.80543e-05 744.01;1708.02;1630.02;1475.02;0;533.006; 322.907;808.594;577.15;536.033;0;533.006; 0.760181;0.764072;0.542005;0.415779;0;0.913438; 30005 1.19409 1.19328 1.19323 1.19308
''')
pd.read_csv(copy(diann_tsv_example), sep="\t")
[7]:
| File.Name | Run | Protein.Group | Protein.Ids | Protein.Names | Genes | PG.Quantity | PG.Normalised | PG.MaxLFQ | Genes.Quantity | ... | Decoy.Evidence | Decoy.CScore | Fragment.Quant.Raw | Fragment.Quant.Corrected | Fragment.Correlations | MS2.Scan | IM | iIM | Predicted.IM | Predicted.iIM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... | 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... | Q9UH36 | Q9UH36 | NaN | SRRD | 3296.49 | 3428.89 | 3428.89 | 3296.49 | ... | 1.23691 | 0.000034 | 1212.01;2178.03;1390.01;1020.01;714.008;778.008; | 1212.01;1351.73;887.591;432.92;216.728;732.751; | 0.956668;0.757581;0.670497;0.592489;0.47072;0.... | 30053 | 1.19708 | 1.19328 | 1.19453 | 1.19469 |
| 1 | F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... | 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... | Q9UH36 | Q9UH36 | NaN | SRRD | 2365.00 | 2334.05 | 2334.05 | 2365.00 | ... | 0.28633 | 0.000002 | 1209.02;1210.02;1414.02;1051.01;236.003;130.002; | 1209.02;1109.89;732.154;735.384;0;46.0967; | 0.919244;0.937624;0.436748;0.639369;0.296736;0... | 30029 | 1.19500 | 1.19328 | 1.19381 | 1.19339 |
| 2 | F:\XXX\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... | 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... | Q9UH36 | Q9UH36 | NaN | SRRD | 1664.51 | 1635.46 | 1635.47 | 1664.51 | ... | 1.92753 | 0.000028 | 744.01;1708.02;1630.02;1475.02;0;533.006; | 322.907;808.594;577.15;536.033;0;533.006; | 0.760181;0.764072;0.542005;0.415779;0;0.913438; | 30005 | 1.19409 | 1.19328 | 1.19323 | 1.19308 |
3 rows × 52 columns
By passing the more stringent fdr filter (\(fdr_{\text{stringent}} = 10^{-4}\)) in the second function call, two precursors with an fdr of \(\sim0.0002\) are removed from the resulting table
[8]:
# Read PSM reports with one liners
diann_psm_standard = psm_reader_provider.get_reader('diann').import_file(copy(diann_tsv_example))
diann_psm_custom_fdr = psm_reader_provider.get_reader('diann', fdr=1e-4).import_file(copy(diann_tsv_example))
print("Number of observations (Standard filter):", len(diann_psm_standard))
print("Number of observations (Stringent filter):", len(diann_psm_custom_fdr))
Number of observations (Standard filter): 3
Number of observations (Stringent filter): 1
Conclusion¶
Overall, this tutorial
Explained how
alphabasemaps different search engine outputs to a unified formatProvides examples on how to read PSM tables from different search engines
Gives an overview over the available and implemented readers