{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PSM readers" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Helper packages\n", "import io\n", "from copy import copy\n", "import numpy as np\n", "import pandas as pd \n", "\n", "# alphabase\n", "from alphabase.psm_reader import psm_reader_provider, psm_reader_yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background \n", "\n", "The `alphabase.psm_reader` module provides a unifying interface to read PSM tables from different search engines and file formats. It is designed to be easy to use, and to provide a consistent output format in the form of `pandas.DataFrame`s, regardless of the input file format.\n", "\n", "### Introduction to peptide spectrum matches (PSMs)\n", "\n", "Peptide spectrum matches (PSMs) are the primary output of proteomics search engines. In a PSM table, each row typically represents a single peptide-spectrum-match, i.e. a peptide sequence that the proteomics search engine identified to be compatible with an observed mass spectrum in a given sample. PSM tables contain information about both 1) the *peptide sequence*, 2) the *spectrum*, as well as 3) the *score* assigned to the PSM by the search engine. \n", "\n", "A minimal PSM table could look something like this:\n", "\n", "| sample_id | peptide | confidence_score | scan_id |\n", "|-------------|---------|-------| -------|\n", "| 1 | PEPTIDE | 0.99 | 1234 |\n", "\n", "In this example, the search engine identified the peptide **PEPTIDE** as a match to the spectrum 1234 sample with `ID 1`, and assigned a confidence score of `0.99` to this match. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Search engine outputs\n", "\n", "In reality, PSM tables are significantly more complex than this, as they contain additional information on both the spectrum (e.g. the sample run, detector type, precursor intensities, ...), the peptide (e.g. the protein it belongs to, the modifications it carries), and the peptide search (quality control measures). This additional information can be extremely useful for downstream analyses, but also makes PSM tables more difficult to work with, as the exact names may differ between search engines, versions, and file formats. \n", "\n", "#### Unifying properties \n", "\n", "Alphabase aligns the column names to a unified vocabulary, as defined in the `alphabase.psm_reader.psm_reader_yaml` mapping. We can explore this standardization, that facilitates cross-engine comparisons. Note that some search engines use version-dependent names for the same property (indicated by lists) and alphabase can deal with this ambiguity. To support both Bruker and Thermo data, we did not use `Scan Number` in the output dataframe but `spec_idx` (starts with 0). `spec_idx = scan_num - 1` in thermo data. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "skip-execution" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mapping of PSM columns to alphabase unified columns
alphabase (unified name)raw_namechargertmobilityproteinssequencescoreuniprot_idsgenesccsfdrscan_numdecoyprecursor_mzquery_idmodified_sequenceintensityrt_stoprt_startpeptide_fdrmodsfragment_intensityfragment_mzfragment_typefragment_chargefragment_seriesfragment_loss_typescannrspec_idxprotein_fdr
alphadiarunchargert_observedmobilityproteinssequencescoreuniprot_idsgenesccsfdrnannannannannanintensityrt_stoprt_startnanmodsnannannannannannannannannan
alphapeptraw_namechargertmobilitynannanscorenannannanq_valuescan_nodecoymzquery_idxnannannannannannannannannannannannannanraw_idxnan
diannRunPrecursor.ChargeRT['IM', 'IonMobility']Protein.NamesStripped.SequenceCScoreProtein.IdsGenesCCSQ.ValueMS2.ScannannannannannanRT.StopRT.Startnannannannannannannannannannannan
library_reader_baseReferenceRunPrecursorCharge['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime']['Mobility', 'IonMobility', 'PrecursorIonMobility']['ProteinId', 'ProteinID', 'ProteinName', 'Protein Name']['PeptideSequence', 'StrippedPeptide']nan['UniProtIds', 'UniProtID', 'UniprotId']['GeneName', 'Genes', 'Gene']CCSnannannanPrecursorMznan['ModifiedPeptideSequence', 'ModifiedPeptide']nannannannannan['LibraryIntensity', 'RelativeIntensity', 'RelativeFragmentIntensity', 'RelativeFragmentIonIntensity']['ProductMz']['FragmentType', 'FragmentIonType', 'ProductType', 'ProductIonType']['FragmentCharge', 'FragmentIonCharge', 'ProductCharge', 'ProductIonCharge']['FragmentSeriesNumber', 'FragmentNumber']['FragmentLossType', 'FragmentIonLossType', 'ProductLossType', 'ProductIonLossType']nannannan
maxquantRaw fileChargeRetention time['Mobility', 'IonMobility', 'K0', '1/K0']ProteinsSequenceScorenan['Gene Names', 'Gene names']CCSnan['Scan number', 'MS/MS scan number', 'MS/MS Scan Number', 'Scan index']Reversem/znannanIntensitynannannannannannannannannannannannannan
msfragger_pepxmlraw_nameassumed_chargeretention_time_secion_mobilityproteinpeptideexpectnannannannanstart_scannannanspectrumnannannannannannannannannannannannannannannan
pfindraw_nameChargeRTnanProteinsSequenceFinal_ScoreProteinsnannanQ-valueScan_No['Target/Decoy', 'Targe/Decoy']nanFile_Namenannannannannannannannannannannannannannannan
sagefilenamechargertmobilityproteinsstripped_peptidesage_discriminant_scorenannannanspectrum_qnanis_decoynannanpeptidenannannanpeptide_qnannannannannannannanscannrnanprotein_q
spectronautReferenceRunPrecursorCharge['RT', 'iRT', 'Tr_recalibrated', 'RetentionTime', 'NormalizedRetentionTime']['Mobility', 'IonMobility', 'PrecursorIonMobility']['Protein Name', 'ProteinId', 'ProteinID', 'ProteinName', 'ProteinGroup', 'ProteinGroups']['StrippedPeptide', 'PeptideSequence']nan['UniProtIds', 'UniProtID', 'UniprotId']['Genes', 'Gene', 'GeneName', 'GeneNames']CCSnannannanPrecursorMznannannannannannannannannannannannannannannannan
spectronaut_reportR.FileNamecharge['EG.ApexRT', 'EG.MeanApexRT']['FG.ApexIonMobility']['PG.ProteinNames', 'PG.ProteinGroups']nannanPG.UniProtIdsPG.Genesnannannannannannannannannannannannannannannannannannannannannan
Missing0.0000000.0000000.0000000.1000000.1000000.2000000.3000000.4000000.4000000.5000000.5000000.5000000.6000000.6000000.7000000.8000000.8000000.8000000.8000000.9000000.9000000.9000000.9000000.9000000.9000000.9000000.9000000.9000000.9000000.900000
\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate a dataframe that maps the report-specific column names to the unified column names based on alphabase mapping\n", "psm_column_mapping = (\n", " pd.DataFrame.from_dict({\n", " mapping.get(\"reader_type\", \"psm\"): mapping.get(\"column_mapping\", {}) for k, mapping in psm_reader_yaml.items() if k != \"modification_mappings\"\n", " })\n", " .T\n", " .rename_axis(columns=\"alphabase (unified name)\")\n", " )\n", "\n", "# Order by importance (number of search engines with corresponding column)\n", "columns_ordered = psm_column_mapping.isna().mean(axis=0).sort_values(ascending=True)\n", "psm_column_mapping = (\n", " psm_column_mapping.loc[:, columns_ordered.index]\n", " .sort_index(axis=0)\n", " .dropna(how=\"all\", axis=1)\n", ")\n", "\n", "# Compute summary\n", "summary = (\n", " psm_column_mapping\n", " .agg([lambda x: x.isna().mean()])\n", ")\n", "\n", "\n", "\n", "## Visualize the mapping of PSM columns to alphabase unified columns\n", "# Stylize with pandas CSS class \n", "headers = {\n", " 'selector': 'th',#'th:not(.index_name)',\n", " 'props': 'background-color: #18456d; color: white;'\n", "}\n", "cell_hover = { # for row hover use instead of \n", " 'selector': 'td:hover',\n", " 'props': [('background-color', '#ffffb3')]\n", "}\n", "index_names = {\n", " 'selector': '.index_name',\n", " 'props': 'font-style: italic; font-weight:bold; position: sticky'\n", "}\n", "\n", "summary_row = {\"selector\": \"tbody tr:last-child\", \"props\": [(\"background-color\", \"#efefef\"), (\"font-weight\", \"bold\")]}\n", "\n", "caption = {\n", " \"selector\": \"caption\",\n", " \"props\": \"caption-side: top; font-style: italic; font-size: 12pt; text-align:left; margin-bottom: 10pt;\"\n", "}\n", "\n", "# Visualize\n", "psm_column_mapping_stylized = (\n", " psm_column_mapping\n", " .style\n", "\n", " .concat(\n", " summary\n", " .style\n", " .relabel_index([\"Missing\"])\n", " .bar(color='#cccccc', vmin=0, vmax=1)\n", " )\n", " .set_caption(\"Mapping of PSM columns to alphabase unified columns\")\n", " .set_table_styles(\n", " [headers, cell_hover, index_names, summary_row, caption]\n", " )\n", "\n", ")\n", "\n", "psm_column_mapping_stylized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Unifying peptide modifications \n", "\n", "Alphabase further unifies representations of peptide modifications between the different search engines to the community-driven unimod format.\n", "\n", "E.g. the MaxQuant-internal representations of phosphorylated serines are mapped to the unimod representation:\n", "\n", "| alphabase/UniMod | MaxQuant |\n", "|------------------|----------|\n", "| Phospho@S | S(Phospho (S)), S(Phospho (ST)), S(Phospho (STY)), S(Phospho (STYDH)), S(ph), pS |\n", "\n", "See `alphabase.psm_reader.psm_reader_yaml[\"modification_mappings\"]` for all mappings as parsed dictionaries and `alphabase.constants.const_files.psm_reader_yaml` for the underlying file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code | Read and parse PSM tables\n", "\n", "The alphabase `psm_reader` module enables users to parse proteomics PSM reports to a dataframe for most common search engines with a single line of code via its `alphabase.psm_reader.psm_reader_provider` factory. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available readers \n", "\n", "`alphabase.psm_reader.psm_reader_provider` has registered some basic reader classes. A list of implemented readers can be accessed via its `reader_dict` property: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Registered readers in alphabase:\n", "\t- alphadia\n", "\t- alphadia_parquet\n", "\t- alphapept\n", "\t- diann\n", "\t- maxquant\n", "\t- msfragger\n", "\t- msfragger_pepxml\n", "\t- msfragger_psm_tsv\n", "\t- openswath\n", "\t- pfind\n", "\t- pfind3\n", "\t- sage_parquet\n", "\t- sage_tsv\n", "\t- speclib_tsv\n", "\t- spectronaut\n", "\t- spectronaut_report\n", "\t- swath\n" ] } ], "source": [ "all_registered_readers = psm_reader_provider.reader_dict.keys()\n", "\n", "# Display all registered readers\n", "sep = \"\\n\\t- \"\n", "print(\"Registered readers in alphabase:\", sep.join(sorted(all_registered_readers)), sep=sep)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interact with the reader provider" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 1 - MaxQuant\n", "\n", "We demonstrate how to interact with PSM tables via alphabase based on a minimal example output of the MaxQuant search engine. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's provide some minimal input, which is the header of a real MaxQuant report" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Raw fileScan numberScan indexSequenceLengthMissed cleavagesModificationsModified sequenceOxidation (M) ProbabilitiesOxidation (M) Score diffs...All sequencesAll modified sequencesReporter PIFReporter fractionidProtein group IDsPeptide IDMod. peptide IDEvidence IDOxidation (M) site IDs
020190402_QX1_SeVW_MA_HeLa_500ng_LC118135873979AAAAAAAAAPAAAATAPTTAATTAATAAQ290Unmodified_(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ...NaNNaN...AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ..._AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS...NaNNaN01443000NaN
120190402_QX1_SeVW_MA_HeLa_500ng_LC118139174010AAAAAAAAAAPAAAATAPTTAATTAATAAQ290Unmodified_AAAAAAAAAPAAAATAPTTAATTAATAAQ_NaNNaN...AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV..._AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid...NaNNaN11443001NaN
220190402_QX1_SeVW_MA_HeLa_500ng_LC1110730798306AAAAAAAGDSDSWDADAFSVEDPVRK261Acetyl (Protein_N-term)_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV...NaNNaN...AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG..._(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV...NaNNaN2625112NaN
\n", "

3 rows × 61 columns

\n", "
" ], "text/plain": [ " Raw file Scan number Scan index \\\n", "0 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81358 73979 \n", "1 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81391 74010 \n", "2 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 107307 98306 \n", "\n", " Sequence Length Missed cleavages \\\n", "0 AAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 \n", "1 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 \n", "2 AAAAAAAGDSDSWDADAFSVEDPVRK 26 1 \n", "\n", " Modifications Modified sequence \\\n", "0 Unmodified _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ... \n", "1 Unmodified _AAAAAAAAAPAAAATAPTTAATTAATAAQ_ \n", "2 Acetyl (Protein_N-term) _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... \n", "\n", " Oxidation (M) Probabilities Oxidation (M) Score diffs ... \\\n", "0 NaN NaN ... \n", "1 NaN NaN ... \n", "2 NaN NaN ... \n", "\n", " All sequences \\\n", "0 AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ... \n", "1 AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV... \n", "2 AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG... \n", "\n", " All modified sequences Reporter PIF \\\n", "0 _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS... NaN \n", "1 _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid... NaN \n", "2 _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... NaN \n", "\n", " Reporter fraction id Protein group IDs Peptide ID Mod. peptide ID \\\n", "0 NaN 0 1443 0 0 \n", "1 NaN 1 1443 0 0 \n", "2 NaN 2 625 1 1 \n", "\n", " Evidence ID Oxidation (M) site IDs \n", "0 0 NaN \n", "1 1 NaN \n", "2 2 NaN \n", "\n", "[3 rows x 61 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "maxquant_example = io.StringIO(\n", "'''Raw file\tScan number\tScan index\tSequence\tLength\tMissed cleavages\tModifications\tModified sequence\tOxidation (M) Probabilities\tOxidation (M) Score diffs\tAcetyl (Protein_N-term)\tOxidation (M)\tProteins\tCharge\tFragmentation\tMass analyzer\tType\tScan event number\tIsotope index\tm/z\tMass\tMass error [ppm]\tMass error [Da]\tSimple mass error [ppm]\tRetention time\tPEP\tScore\tDelta score\tScore diff\tLocalization prob\tCombinatorics\tPIF\tFraction of total spectrum\tBase peak fraction\tPrecursor full scan number\tPrecursor Intensity\tPrecursor apex fraction\tPrecursor apex offset\tPrecursor apex offset time\tMatches\tIntensities\tMass deviations [Da]\tMass deviations [ppm]\tMasses\tNumber of matches\tIntensity coverage\tPeak coverage\tNeutral loss level\tETD identification type\tReverse\tAll scores\tAll sequences\tAll modified sequences\tReporter PIF\tReporter fraction\tid\tProtein group IDs\tPeptide ID\tMod. peptide ID\tEvidence ID\tOxidation (M) site IDs\n", "20190402_QX1_SeVW_MA_HeLa_500ng_LC11\t81358\t73979\tAAAAAAAAAPAAAATAPTTAATTAATAAQ\t29\t0\tUnmodified\t_(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation (M))PAAAATAPTTAATTAATAAQ_\t\t\t0\t0\tsp|P37108|SRP14_HUMAN\t3\tHCD\tFTMS\tMULTI-MSMS\t13\t1\t790.07495\t2367.203\t0.35311\t0.00027898\t-0.061634807\t70.261\t0.012774\t41.423\t36.666\tNaN\tNaN\t1\t0\t0\t0\t81345\t10653955\t0.0338597821787898\t-11\t0.139877319335938\ty1;y2;y3;y4;y11;y1-NH3;y2-NH3;a2;b2;b3;b4;b5;b6;b7;b8;b9;b11;b12;b6(2+);b8(2+);b13(2+);b18(2+)\t2000000;2000000;300000;400000;200000;1000000;400000;300000;600000;1000000;2000000;3000000;3000000;3000000;3000000;2000000;600000;500000;1000000;2000000;300000;200000\t5.2861228709844E-06;-6.86980268369553E-05;-0.00238178789771837;0.000624715964988809;-0.0145624692099773;-0.000143471782706683;-0.000609501446461991;-0.000524972720768346;0.00010190530804266;5.8620815195809E-05;0.000229901232955854;-0.000108750048696038;-0.000229593152369034;0.00183148682538103;0.00276641182404092;0.000193118923334623;0.00200988580445483;0.000102216846016745;5.86208151389656E-05;0.000229901232955854;-0.00104559184393338;0.00525030008475369\t0.0359413365445091;-0.314964433555295;-8.23711898839045;1.60102421155213;-14.8975999917227;-1.10320467763838;-3.03102462870716;-4.56152475051625;0.712219104095465;0.273777366204575;0.806231096969562;-0.305312183824154;-0.537399178230218;3.67572664689217;4.85930954169285;0.301587577451224;2.48616190909398;0.116225745519871;0.273777365939099;0.806231096969562;-2.19774169175011;7.53961026980589\t147.076413378177;218.113601150127;289.153028027798;390.197699998035;977.50437775671;130.050013034583;201.087592852046;115.087114392821;143.081402136892;214.118559209185;285.155501716567;356.192954155649;427.230188786552;498.265241494374;569.301420357176;640.341107437877;808.429168310795;879.468189767554;214.118559209185;285.155501716567;475.757386711244;696.362265007215\t22\t0.262893575628735\t0.0826446280991736\tNone\tUnknown\t\t41.4230894199432;4.75668724862449;3.9515580701967\tAAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQILQGK;PVTLWITVTHMQADEVSVWR\t_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVSVTQILQGK_;_PVTLWITVTHMQADEVSVWR_\t\t\t0\t1443\t0\t0\t0\t\n", "20190402_QX1_SeVW_MA_HeLa_500ng_LC11\t81391\t74010\tAAAAAAAAAAPAAAATAPTTAATTAATAAQ\t29\t0\tUnmodified\t_AAAAAAAAAPAAAATAPTTAATTAATAAQ_\t\t\t0\t0\tsp|P37108|SRP14_HUMAN\t2\tHCD\tFTMS\tMULTI-MSMS\t14\t0\t1184.6088\t2367.203\t0.037108\t4.3959E-05\t1.7026696\t70.287\t7.1474E-09\t118.21\t100.52\tNaN\tNaN\t1\t0\t0\t0\t81377\t9347701\t0.166790347889974\t-10\t0.12664794921875\ty1;y2;y3;y4;y5;y9;y12;y13;y14;y20;y13-H2O;y20-H2O;y1-NH3;y20-NH3;b3;b4;b5;b6;b7;b8;b9;b11;b12;b13;b14;b15;b16;b19;b15-H2O;b16-H2O\t500000;600000;200000;400000;200000;100000;200000;1000000;200000;300000;200000;100000;100000;70000;300000;900000;2000000;3000000;5000000;8000000;6000000;600000;800000;600000;200000;300000;200000;300000;300000;1000000\t-0.000194444760495571;0.000149986878682284;0.000774202587820128;-0.0002445094036716;0.000374520568641401;-0.00694293246522193;-0.0109837291331587;-0.0037745820627606;-0.000945546471939451;0.00152326440706929;0.00506054832726477;0.00996886361417637;6.25847393393997E-05;-0.024881067836759;-3.11821549132674E-05;-0.000183099230639527;0.000161332473453513;0.000265434980121881;0.000747070697229901;0.000975534518261156;0.00101513939785036;0.00651913000274362;0.0058584595163893;0.00579536744021425;0.00131097834105276;-0.0131378531671089;0.00472955218901916;-0.00161006322559842;-0.00201443239325272;0.0227149399370319\t-1.32206444236914;0.687655553213019;2.6775131607882;-0.626628140021726;0.811995006209331;-8.6203492854282;-10.1838066275079;-3.21078702288986;-0.758483069159249;0.881072738747222;4.37168212373889;5.82682888353564;0.481236695337485;-14.5343986203644;-0.145630261806375;-0.642102166533079;0.452935954800214;0.621293379181583;1.49934012872483;1.71355878380837;1.58531240493271;8.06399202403175;6.6614096214532;6.09718023739784;1.28333378040908;-11.7030234519348;3.96235146626144;-1.07856912288932;-1.82370619437775;19.3220953109188\t147.07661310906;218.113382465221;289.149872037312;390.198569223404;461.235063981231;805.411965958065;1078.54847749073;1175.59403219566;1246.62831694787;1728.87474561429;1157.57463237897;1710.85573532879;130.049806978061;1711.87460084504;214.118649012155;285.155914717031;356.192684073126;427.22969375842;498.266325910503;569.303211234482;640.340285417402;808.424659066597;879.462433524883;950.49961040476;1021.54120858166;1122.60333588727;1193.62258226971;1492.77704268533;1104.58164778019;1175.59403219566\t30\t0.474003002083763\t0.167630057803468\tNone\tUnknown\t\t118.209976573419;17.6937689289157;17.2534171481793\tAAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSVLYLK;VGSSVPSKASELVVMGDHDAARR\t_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxidation (M))QSEQLQSVLYLK_;_VGSSVPSKASELVVMGDHDAARR_\t\t\t1\t1443\t0\t0\t1\t\n", "20190402_QX1_SeVW_MA_HeLa_500ng_LC11\t107307\t98306\tAAAAAAAGDSDSWDADAFSVEDPVRK\t26\t1\tAcetyl (Protein_N-term)\t_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_\t\t\t1\t0\tsp|O75822|EIF3J_HUMAN\t3\tHCD\tFTMS\tMULTI-MSMS\t10\t2\t879.06841\t2634.1834\t-0.93926\t-0.00082567\t-3.2012471\t90.978\t2.1945E-12\t148.95\t141.24\tNaN\tNaN\t1\t0\t0\t0\t107297\t10193939\t0.267970762043589\t-8\t0.10211181640625\ty1;y2;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14;y15;y17;y18;y19;y20;y21;y23;y21-H2O;y1-NH3;y19-NH3;y14(2+);y16(2+);y22(2+);a2;b2;b3;b4;b5;b6;b7\t300000;200000;3000000;600000;1000000;500000;2000000;1000000;1000000;1000000;90000;1000000;400000;900000;1000000;400000;3000000;2000000;1000000;400000;100000;200000;200000;80000;100000;200000;200000;2000000;5000000;5000000;5000000;2000000;300000\t1.34859050149316E-07;-6.05140996867704E-06;2.27812602133781E-05;0.00128986659160546;-0.00934536073077652;0.000941953783126337;-0.00160424237344614;-0.00239257341399934;-0.00111053968612396;-0.00331340710044969;0.00330702864630439;0.000963683996815234;0.00596290290945944;-0.00662057038289277;-0.0117122701335575;0.00777853472800416;0.0021841542961738;0.000144322111736983;-0.00087403893667215;0.0197121595674616;-0.021204007680808;-0.000308954599830713;-0.026636719419912;-0.0137790992353075;0.00596067266928912;-0.0077053835773313;9.11402199221811E-06;-0.000142539300128419;-0.000251999832926231;1.90791054137662E-05;-0.00236430185879044;-9.54583337602344E-05;-0.000556959493223985\t0.000916705048437201;-0.0199575598103408;0.0456231928690862;2.09952637717462;-12.5708704058425;1.11808305811426;-1.72590731777249;-2.22239181008062;-0.967696370445928;-2.62418809422166;2.47964286628144;0.665205752892023;3.64753748704453;-3.84510115530963;-6.08782672045773;3.81508105974837;1.04209904973991;0.0666012719936656;-0.390545453668809;8.28224925531311;-9.55133250134922;-2.37499239179248;-12.8127653858411;-16.846761946123;6.48662354975264;-6.67117082062383;0.0580151981289049;-0.770098855873447;-0.983876895688683;0.0583162347158579;-5.93738717724506;-0.203431522818505;-1.03087538746314\t147.112804035741;303.21392125011;499.33507018564;614.360746132308;743.413974455831;842.472101057517;929.506675663573;1076.57587791081;1147.61170966489;1262.6408555643;1333.67134891635;1448.700635293;1634.77494902759;1721.81956091078;1923.88362405243;2038.89107627957;2095.9181343836;2166.95728800359;2237.99542015244;2380.04906152953;2220.00518543488;130.0865640237;2078.92040615582;817.907873297785;918.917619246831;1155.02717356753;157.097144992378;185.0922112678;256.129434516133;327.166277224995;398.205774393759;469.240619338034;540.278194626993\t33\t0.574496146107112\t0.14410480349345\tNone\tUnknown\t\t148.951235201399;7.71201258444522;7.36039532447559\tAAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDGER;HTLTSFWNFKAGCEEKCYSNR\t_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_;_PSRQESELM(Oxidation (M))WQWVDQRSDGER_;_HTLTSFWNFKAGCEEKCYSNR_\t\t\t2\t625\t1\t1\t2\t'''\n", ")\n", "\n", "# Parse with pandas for visualization purposes\n", "pd.read_csv(copy(maxquant_example), sep=\"\\t\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then use the `psm_reader_provider.get_reader` method to get the maxquant-report reader. Use the `import_file` method to read the file, which is directly returned as a pandas DataFrame. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/alphabase/psm_reader/psm_reader.py:318: UserWarning: Unknown modifications: {'_(Acetyl (Protein_N-term))'}. Precursors with unknown modifications will be removed.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencechargertscan_numraw_nameprecursor_mzscoreproteinsdecoyspec_idxmodsmod_sitesnAArt_norm
0AAAAAAAAAAPAAAATAPTTAATTAATAAQ270.2878139120190402_QX1_SeVW_MA_HeLa_500ng_LC111184.6088118.21sp|P37108|SRP14_HUMAN081390300.772571
\n", "
" ], "text/plain": [ " sequence charge rt scan_num \\\n", "0 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 2 70.287 81391 \n", "\n", " raw_name precursor_mz score \\\n", "0 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 1184.6088 118.21 \n", "\n", " proteins decoy spec_idx mods mod_sites nAA rt_norm \n", "0 sp|P37108|SRP14_HUMAN 0 81390 30 0.772571 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "maxquant_reader = psm_reader_provider.get_reader('maxquant')\n", "\n", "# Import the file or a bytestream\n", "maxquant_report = maxquant_reader.import_file(maxquant_example)\n", "\n", "# The parsed PSM is also stored in the reader class as `psm_df` attribute\n", "# maxquant_report = maxquant_reader.psm_df\n", "\n", "maxquant_report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 2 - Set custom arguments \n", "\n", "One can also customize the reader by setting specific arguments. For example, one can set more stringent `fdr` filters (default: $fdr=0.01$). We showcase this on the example of a DIANN PSM report table." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
File.NameRunProtein.GroupProtein.IdsProtein.NamesGenesPG.QuantityPG.NormalisedPG.MaxLFQGenes.Quantity...Decoy.EvidenceDecoy.CScoreFragment.Quant.RawFragment.Quant.CorrectedFragment.CorrelationsMS2.ScanIMiIMPredicted.IMPredicted.iIM
0F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_...20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp...Q9UH36Q9UH36NaNSRRD3296.493428.893428.893296.49...1.236910.0000341212.01;2178.03;1390.01;1020.01;714.008;778.008;1212.01;1351.73;887.591;432.92;216.728;732.751;0.956668;0.757581;0.670497;0.592489;0.47072;0....300531.197081.193281.194531.19469
1F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_...20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp...Q9UH36Q9UH36NaNSRRD2365.002334.052334.052365.00...0.286330.0000021209.02;1210.02;1414.02;1051.01;236.003;130.002;1209.02;1109.89;732.154;735.384;0;46.0967;0.919244;0.937624;0.436748;0.639369;0.296736;0...300291.195001.193281.193811.19339
2F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_...20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp...Q9UH36Q9UH36NaNSRRD1664.511635.461635.471664.51...1.927530.000028744.01;1708.02;1630.02;1475.02;0;533.006;322.907;808.594;577.15;536.033;0;533.006;0.760181;0.764072;0.542005;0.415779;0;0.913438;300051.194091.193281.193231.19308
\n", "

3 rows × 52 columns

\n", "
" ], "text/plain": [ " File.Name \\\n", "0 F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... \n", "1 F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... \n", "2 F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... \n", "\n", " Run Protein.Group \\\n", "0 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 \n", "1 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 \n", "2 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 \n", "\n", " Protein.Ids Protein.Names Genes PG.Quantity PG.Normalised PG.MaxLFQ \\\n", "0 Q9UH36 NaN SRRD 3296.49 3428.89 3428.89 \n", "1 Q9UH36 NaN SRRD 2365.00 2334.05 2334.05 \n", "2 Q9UH36 NaN SRRD 1664.51 1635.46 1635.47 \n", "\n", " Genes.Quantity ... Decoy.Evidence Decoy.CScore \\\n", "0 3296.49 ... 1.23691 0.000034 \n", "1 2365.00 ... 0.28633 0.000002 \n", "2 1664.51 ... 1.92753 0.000028 \n", "\n", " Fragment.Quant.Raw \\\n", "0 1212.01;2178.03;1390.01;1020.01;714.008;778.008; \n", "1 1209.02;1210.02;1414.02;1051.01;236.003;130.002; \n", "2 744.01;1708.02;1630.02;1475.02;0;533.006; \n", "\n", " Fragment.Quant.Corrected \\\n", "0 1212.01;1351.73;887.591;432.92;216.728;732.751; \n", "1 1209.02;1109.89;732.154;735.384;0;46.0967; \n", "2 322.907;808.594;577.15;536.033;0;533.006; \n", "\n", " Fragment.Correlations MS2.Scan IM \\\n", "0 0.956668;0.757581;0.670497;0.592489;0.47072;0.... 30053 1.19708 \n", "1 0.919244;0.937624;0.436748;0.639369;0.296736;0... 30029 1.19500 \n", "2 0.760181;0.764072;0.542005;0.415779;0;0.913438; 30005 1.19409 \n", "\n", " iIM Predicted.IM Predicted.iIM \n", "0 1.19328 1.19453 1.19469 \n", "1 1.19328 1.19381 1.19339 \n", "2 1.19328 1.19323 1.19308 \n", "\n", "[3 rows x 52 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diann_tsv_example = io.StringIO(r'''File.Name\tRun\tProtein.Group\tProtein.Ids\tProtein.Names\tGenes\tPG.Quantity\tPG.Normalised\tPG.MaxLFQ\tGenes.Quantity\tGenes.Normalised\tGenes.MaxLFQ\tGenes.MaxLFQ.Unique\tModified.Sequence\tStripped.Sequence\tPrecursor.Id\tPrecursor.Charge\tQ.Value\tGlobal.Q.Value\tProtein.Q.Value\tPG.Q.Value\tGlobal.PG.Q.Value\tGG.Q.Value\tTranslated.Q.Value\tProteotypic\tPrecursor.Quantity\tPrecursor.Normalised\tPrecursor.Translated\tQuantity.Quality\tRT\tRT.Start\tRT.Stop\tiRT\tPredicted.RT\tPredicted.iRT\tLib.Q.Value\tMs1.Profile.Corr\tMs1.Area\tEvidence\tSpectrum.Similarity\tMass.Evidence\tCScore\tDecoy.Evidence\tDecoy.CScore\tFragment.Quant.Raw\tFragment.Quant.Corrected\tFragment.Correlations\tMS2.Scan\tIM\tiIM\tPredicted.IM\tPredicted.iIM\n", "F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636.d\t20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636\tQ9UH36\tQ9UH36\t\tSRRD\t3296.49\t3428.89\t3428.89\t3296.49\t3428.89\t3428.89\t3428.89\t(UniMod:1)AAAAAAALESWQAAAPR\tAAAAAAALESWQAAAPR\t(UniMod:1)AAAAAAALESWQAAAPR2\t2\t3.99074e-05\t1.96448e-05\t0.000159821\t0.000159821\t0.000146135\t0.000161212\t0\t1\t3296.49\t3428.89\t3296.49\t0.852479\t19.9208\t19.8731\t19.9685\t123.9\t19.8266\t128.292\t0\t0.960106\t5308.05\t1.96902\t0.683134\t0.362287\t0.999997\t1.23691\t3.43242e-05\t1212.01;2178.03;1390.01;1020.01;714.008;778.008;\t1212.01;1351.73;887.591;432.92;216.728;732.751;\t0.956668;0.757581;0.670497;0.592489;0.47072;0.855203;\t30053\t1.19708\t1.19328\t1.19453\t1.19469\n", "F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642.d\t20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642\tQ9UH36\tQ9UH36\t\tSRRD\t2365\t2334.05\t2334.05\t2365\t2334.05\t2334.05\t2334.05\t(UniMod:1)AAAAAAALESWQAAAPR\tAAAAAAALESWQAAAPR\t(UniMod:1)AAAAAAALESWQAAAPR2\t2\t0.000184434\t1.96448e-05\t0.000596659\t0.000596659\t0.000146135\t0.000604961\t0\t1\t2365\t2334.05\t2365\t0.922581\t19.905\t19.8573\t19.9527\t123.9\t19.782\t128.535\t0\t0.940191\t4594.04\t1.31068\t0.758988\t0\t0.995505\t0.28633\t2.12584e-06\t1209.02;1210.02;1414.02;1051.01;236.003;130.002;\t1209.02;1109.89;732.154;735.384;0;46.0967;\t0.919244;0.937624;0.436748;0.639369;0.296736;0.647924;\t30029\t1.195\t1.19328\t1.19381\t1.19339\n", "F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648.d\t20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648\tQ9UH36\tQ9UH36\t\tSRRD\t1664.51\t1635.46\t1635.47\t1664.51\t1635.46\t1635.47\t1635.47\t(UniMod:1)AAAAAAALESWQAAAPR\tAAAAAAALESWQAAAPR\t(UniMod:1)AAAAAAALESWQAAAPR2\t2\t0.000185123\t1.96448e-05\t0.000307409\t0.000307409\t0.000146135\t0.000311332\t0\t1\t1664.51\t1635.46\t1664.51\t0.811147\t19.8893\t19.8416\t19.937\t123.9\t19.7567\t128.896\t0\t0.458773\t6614.06\t1.7503\t0.491071\t0.00111683\t0.997286\t1.92753\t2.80543e-05\t744.01;1708.02;1630.02;1475.02;0;533.006;\t322.907;808.594;577.15;536.033;0;533.006;\t0.760181;0.764072;0.542005;0.415779;0;0.913438;\t30005\t1.19409\t1.19328\t1.19323\t1.19308\n", "''')\n", "\n", "pd.read_csv(copy(diann_tsv_example), sep=\"\\t\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By passing the more stringent `fdr` filter ($fdr_{\\text{stringent}} = 10^{-4}$) in the second function call, two precursors with an fdr of $\\sim0.0002$ are removed from the resulting table" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of observations (Standard filter): 3\n", "Number of observations (Stringent filter): 1\n" ] } ], "source": [ "# Read PSM reports with one liners\n", "diann_psm_standard = psm_reader_provider.get_reader('diann').import_file(copy(diann_tsv_example))\n", "diann_psm_custom_fdr = psm_reader_provider.get_reader('diann', fdr=1e-4).import_file(copy(diann_tsv_example))\n", "\n", "print(\"Number of observations (Standard filter):\", len(diann_psm_standard))\n", "print(\"Number of observations (Stringent filter):\", len(diann_psm_custom_fdr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Overall, this tutorial \n", "\n", "- Explained how `alphabase` maps different search engine outputs to a unified format\n", "- Provides examples on how to read PSM tables from different search engines\n", "- Gives an overview over the available and implemented readers" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 4 }