instead of \n",
" 'selector': 'td:hover',\n",
" 'props': [('background-color', '#ffffb3')]\n",
"}\n",
"index_names = {\n",
" 'selector': '.index_name',\n",
" 'props': 'font-style: italic; font-weight:bold; position: sticky'\n",
"}\n",
"\n",
"summary_row = {\"selector\": \"tbody tr:last-child\", \"props\": [(\"background-color\", \"#efefef\"), (\"font-weight\", \"bold\")]}\n",
"\n",
"caption = {\n",
" \"selector\": \"caption\",\n",
" \"props\": \"caption-side: top; font-style: italic; font-size: 12pt; text-align:left; margin-bottom: 10pt;\"\n",
"}\n",
"\n",
"# Visualize\n",
"psm_column_mapping_stylized = (\n",
" psm_column_mapping\n",
" .style\n",
"\n",
" .concat(\n",
" summary\n",
" .style\n",
" .relabel_index([\"Missing\"])\n",
" .bar(color='#cccccc', vmin=0, vmax=1)\n",
" )\n",
" .set_caption(\"Mapping of PSM columns to alphabase unified columns\")\n",
" .set_table_styles(\n",
" [headers, cell_hover, index_names, summary_row, caption]\n",
" )\n",
"\n",
")\n",
"\n",
"psm_column_mapping_stylized"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Unifying peptide modifications \n",
"\n",
"Alphabase further unifies representations of peptide modifications between the different search engines to the community-driven unimod format.\n",
"\n",
"E.g. the MaxQuant-internal representations of phosphorylated serines are mapped to the unimod representation:\n",
"\n",
"| alphabase/UniMod | MaxQuant |\n",
"|------------------|----------|\n",
"| Phospho@S | S(Phospho (S)), S(Phospho (ST)), S(Phospho (STY)), S(Phospho (STYDH)), S(ph), pS |\n",
"\n",
"See `alphabase.psm_reader.psm_reader_yaml[\"modification_mappings\"]` for all mappings as parsed dictionaries and `alphabase.constants.const_files.psm_reader_yaml` for the underlying file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code | Read and parse PSM tables\n",
"\n",
"The alphabase `psm_reader` module enables users to parse proteomics PSM reports to a dataframe for most common search engines with a single line of code via its `alphabase.psm_reader.psm_reader_provider` factory. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Available readers \n",
"\n",
"`alphabase.psm_reader.psm_reader_provider` has registered some basic reader classes. A list of implemented readers can be accessed via its `reader_dict` property: "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Registered readers in alphabase:\n",
"\t- alphadia\n",
"\t- alphadia_parquet\n",
"\t- alphapept\n",
"\t- diann\n",
"\t- maxquant\n",
"\t- msfragger\n",
"\t- msfragger_pepxml\n",
"\t- msfragger_psm_tsv\n",
"\t- openswath\n",
"\t- pfind\n",
"\t- pfind3\n",
"\t- sage_parquet\n",
"\t- sage_tsv\n",
"\t- speclib_tsv\n",
"\t- spectronaut\n",
"\t- spectronaut_report\n",
"\t- swath\n"
]
}
],
"source": [
"all_registered_readers = psm_reader_provider.reader_dict.keys()\n",
"\n",
"# Display all registered readers\n",
"sep = \"\\n\\t- \"\n",
"print(\"Registered readers in alphabase:\", sep.join(sorted(all_registered_readers)), sep=sep)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interact with the reader provider"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example 1 - MaxQuant\n",
"\n",
"We demonstrate how to interact with PSM tables via alphabase based on a minimal example output of the MaxQuant search engine. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's provide some minimal input, which is the header of a real MaxQuant report"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" \n",
" | \n",
" Raw file | \n",
" Scan number | \n",
" Scan index | \n",
" Sequence | \n",
" Length | \n",
" Missed cleavages | \n",
" Modifications | \n",
" Modified sequence | \n",
" Oxidation (M) Probabilities | \n",
" Oxidation (M) Score diffs | \n",
" ... | \n",
" All sequences | \n",
" All modified sequences | \n",
" Reporter PIF | \n",
" Reporter fraction | \n",
" id | \n",
" Protein group IDs | \n",
" Peptide ID | \n",
" Mod. peptide ID | \n",
" Evidence ID | \n",
" Oxidation (M) site IDs | \n",
" \n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | \n",
" 81358 | \n",
" 73979 | \n",
" AAAAAAAAAPAAAATAPTTAATTAATAAQ | \n",
" 29 | \n",
" 0 | \n",
" Unmodified | \n",
" _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ... | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ... | \n",
" _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS... | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" 1443 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" \n",
" \n",
" | 1 | \n",
" 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | \n",
" 81391 | \n",
" 74010 | \n",
" AAAAAAAAAAPAAAATAPTTAATTAATAAQ | \n",
" 29 | \n",
" 0 | \n",
" Unmodified | \n",
" _AAAAAAAAAPAAAATAPTTAATTAATAAQ_ | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV... | \n",
" _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid... | \n",
" NaN | \n",
" NaN | \n",
" 1 | \n",
" 1443 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" NaN | \n",
" \n",
" \n",
" | 2 | \n",
" 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | \n",
" 107307 | \n",
" 98306 | \n",
" AAAAAAAGDSDSWDADAFSVEDPVRK | \n",
" 26 | \n",
" 1 | \n",
" Acetyl (Protein_N-term) | \n",
" _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG... | \n",
" _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... | \n",
" NaN | \n",
" NaN | \n",
" 2 | \n",
" 625 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" NaN | \n",
" \n",
" \n",
" \n",
" 3 rows × 61 columns \n",
" "
],
"text/plain": [
" Raw file Scan number Scan index \\\n",
"0 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81358 73979 \n",
"1 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 81391 74010 \n",
"2 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 107307 98306 \n",
"\n",
" Sequence Length Missed cleavages \\\n",
"0 AAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 \n",
"1 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 29 0 \n",
"2 AAAAAAAGDSDSWDADAFSVEDPVRK 26 1 \n",
"\n",
" Modifications Modified sequence \\\n",
"0 Unmodified _(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation ... \n",
"1 Unmodified _AAAAAAAAAPAAAATAPTTAATTAATAAQ_ \n",
"2 Acetyl (Protein_N-term) _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... \n",
"\n",
" Oxidation (M) Probabilities Oxidation (M) Score diffs ... \\\n",
"0 NaN NaN ... \n",
"1 NaN NaN ... \n",
"2 NaN NaN ... \n",
"\n",
" All sequences \\\n",
"0 AAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQ... \n",
"1 AAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSV... \n",
"2 AAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDG... \n",
"\n",
" All modified sequences Reporter PIF \\\n",
"0 _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVS... NaN \n",
"1 _AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxid... NaN \n",
"2 _(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSV... NaN \n",
"\n",
" Reporter fraction id Protein group IDs Peptide ID Mod. peptide ID \\\n",
"0 NaN 0 1443 0 0 \n",
"1 NaN 1 1443 0 0 \n",
"2 NaN 2 625 1 1 \n",
"\n",
" Evidence ID Oxidation (M) site IDs \n",
"0 0 NaN \n",
"1 1 NaN \n",
"2 2 NaN \n",
"\n",
"[3 rows x 61 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maxquant_example = io.StringIO(\n",
"'''Raw file\tScan number\tScan index\tSequence\tLength\tMissed cleavages\tModifications\tModified sequence\tOxidation (M) Probabilities\tOxidation (M) Score diffs\tAcetyl (Protein_N-term)\tOxidation (M)\tProteins\tCharge\tFragmentation\tMass analyzer\tType\tScan event number\tIsotope index\tm/z\tMass\tMass error [ppm]\tMass error [Da]\tSimple mass error [ppm]\tRetention time\tPEP\tScore\tDelta score\tScore diff\tLocalization prob\tCombinatorics\tPIF\tFraction of total spectrum\tBase peak fraction\tPrecursor full scan number\tPrecursor Intensity\tPrecursor apex fraction\tPrecursor apex offset\tPrecursor apex offset time\tMatches\tIntensities\tMass deviations [Da]\tMass deviations [ppm]\tMasses\tNumber of matches\tIntensity coverage\tPeak coverage\tNeutral loss level\tETD identification type\tReverse\tAll scores\tAll sequences\tAll modified sequences\tReporter PIF\tReporter fraction\tid\tProtein group IDs\tPeptide ID\tMod. peptide ID\tEvidence ID\tOxidation (M) site IDs\n",
"20190402_QX1_SeVW_MA_HeLa_500ng_LC11\t81358\t73979\tAAAAAAAAAPAAAATAPTTAATTAATAAQ\t29\t0\tUnmodified\t_(Acetyl (Protein_N-term))AAAAAAAAM(Oxidation (M))PAAAATAPTTAATTAATAAQ_\t\t\t0\t0\tsp|P37108|SRP14_HUMAN\t3\tHCD\tFTMS\tMULTI-MSMS\t13\t1\t790.07495\t2367.203\t0.35311\t0.00027898\t-0.061634807\t70.261\t0.012774\t41.423\t36.666\tNaN\tNaN\t1\t0\t0\t0\t81345\t10653955\t0.0338597821787898\t-11\t0.139877319335938\ty1;y2;y3;y4;y11;y1-NH3;y2-NH3;a2;b2;b3;b4;b5;b6;b7;b8;b9;b11;b12;b6(2+);b8(2+);b13(2+);b18(2+)\t2000000;2000000;300000;400000;200000;1000000;400000;300000;600000;1000000;2000000;3000000;3000000;3000000;3000000;2000000;600000;500000;1000000;2000000;300000;200000\t5.2861228709844E-06;-6.86980268369553E-05;-0.00238178789771837;0.000624715964988809;-0.0145624692099773;-0.000143471782706683;-0.000609501446461991;-0.000524972720768346;0.00010190530804266;5.8620815195809E-05;0.000229901232955854;-0.000108750048696038;-0.000229593152369034;0.00183148682538103;0.00276641182404092;0.000193118923334623;0.00200988580445483;0.000102216846016745;5.86208151389656E-05;0.000229901232955854;-0.00104559184393338;0.00525030008475369\t0.0359413365445091;-0.314964433555295;-8.23711898839045;1.60102421155213;-14.8975999917227;-1.10320467763838;-3.03102462870716;-4.56152475051625;0.712219104095465;0.273777366204575;0.806231096969562;-0.305312183824154;-0.537399178230218;3.67572664689217;4.85930954169285;0.301587577451224;2.48616190909398;0.116225745519871;0.273777365939099;0.806231096969562;-2.19774169175011;7.53961026980589\t147.076413378177;218.113601150127;289.153028027798;390.197699998035;977.50437775671;130.050013034583;201.087592852046;115.087114392821;143.081402136892;214.118559209185;285.155501716567;356.192954155649;427.230188786552;498.265241494374;569.301420357176;640.341107437877;808.429168310795;879.468189767554;214.118559209185;285.155501716567;475.757386711244;696.362265007215\t22\t0.262893575628735\t0.0826446280991736\tNone\tUnknown\t\t41.4230894199432;4.75668724862449;3.9515580701967\tAAAAAAAAAPAAAATAPTTAATTAATAAQ;FHRGPPDKDDMVSVTQILQGK;PVTLWITVTHMQADEVSVWR\t_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_FHRGPPDKDDMVSVTQILQGK_;_PVTLWITVTHMQADEVSVWR_\t\t\t0\t1443\t0\t0\t0\t\n",
"20190402_QX1_SeVW_MA_HeLa_500ng_LC11\t81391\t74010\tAAAAAAAAAAPAAAATAPTTAATTAATAAQ\t29\t0\tUnmodified\t_AAAAAAAAAPAAAATAPTTAATTAATAAQ_\t\t\t0\t0\tsp|P37108|SRP14_HUMAN\t2\tHCD\tFTMS\tMULTI-MSMS\t14\t0\t1184.6088\t2367.203\t0.037108\t4.3959E-05\t1.7026696\t70.287\t7.1474E-09\t118.21\t100.52\tNaN\tNaN\t1\t0\t0\t0\t81377\t9347701\t0.166790347889974\t-10\t0.12664794921875\ty1;y2;y3;y4;y5;y9;y12;y13;y14;y20;y13-H2O;y20-H2O;y1-NH3;y20-NH3;b3;b4;b5;b6;b7;b8;b9;b11;b12;b13;b14;b15;b16;b19;b15-H2O;b16-H2O\t500000;600000;200000;400000;200000;100000;200000;1000000;200000;300000;200000;100000;100000;70000;300000;900000;2000000;3000000;5000000;8000000;6000000;600000;800000;600000;200000;300000;200000;300000;300000;1000000\t-0.000194444760495571;0.000149986878682284;0.000774202587820128;-0.0002445094036716;0.000374520568641401;-0.00694293246522193;-0.0109837291331587;-0.0037745820627606;-0.000945546471939451;0.00152326440706929;0.00506054832726477;0.00996886361417637;6.25847393393997E-05;-0.024881067836759;-3.11821549132674E-05;-0.000183099230639527;0.000161332473453513;0.000265434980121881;0.000747070697229901;0.000975534518261156;0.00101513939785036;0.00651913000274362;0.0058584595163893;0.00579536744021425;0.00131097834105276;-0.0131378531671089;0.00472955218901916;-0.00161006322559842;-0.00201443239325272;0.0227149399370319\t-1.32206444236914;0.687655553213019;2.6775131607882;-0.626628140021726;0.811995006209331;-8.6203492854282;-10.1838066275079;-3.21078702288986;-0.758483069159249;0.881072738747222;4.37168212373889;5.82682888353564;0.481236695337485;-14.5343986203644;-0.145630261806375;-0.642102166533079;0.452935954800214;0.621293379181583;1.49934012872483;1.71355878380837;1.58531240493271;8.06399202403175;6.6614096214532;6.09718023739784;1.28333378040908;-11.7030234519348;3.96235146626144;-1.07856912288932;-1.82370619437775;19.3220953109188\t147.07661310906;218.113382465221;289.149872037312;390.198569223404;461.235063981231;805.411965958065;1078.54847749073;1175.59403219566;1246.62831694787;1728.87474561429;1157.57463237897;1710.85573532879;130.049806978061;1711.87460084504;214.118649012155;285.155914717031;356.192684073126;427.22969375842;498.266325910503;569.303211234482;640.340285417402;808.424659066597;879.462433524883;950.49961040476;1021.54120858166;1122.60333588727;1193.62258226971;1492.77704268533;1104.58164778019;1175.59403219566\t30\t0.474003002083763\t0.167630057803468\tNone\tUnknown\t\t118.209976573419;17.6937689289157;17.2534171481793\tAAAAAAAAAPAAAATAPTTAATTAATAAQ;SELKQEAMQSEQLQSVLYLK;VGSSVPSKASELVVMGDHDAARR\t_AAAAAAAAAPAAAATAPTTAATTAATAAQ_;_SELKQEAM(Oxidation (M))QSEQLQSVLYLK_;_VGSSVPSKASELVVMGDHDAARR_\t\t\t1\t1443\t0\t0\t1\t\n",
"20190402_QX1_SeVW_MA_HeLa_500ng_LC11\t107307\t98306\tAAAAAAAGDSDSWDADAFSVEDPVRK\t26\t1\tAcetyl (Protein_N-term)\t_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_\t\t\t1\t0\tsp|O75822|EIF3J_HUMAN\t3\tHCD\tFTMS\tMULTI-MSMS\t10\t2\t879.06841\t2634.1834\t-0.93926\t-0.00082567\t-3.2012471\t90.978\t2.1945E-12\t148.95\t141.24\tNaN\tNaN\t1\t0\t0\t0\t107297\t10193939\t0.267970762043589\t-8\t0.10211181640625\ty1;y2;y4;y5;y6;y7;y8;y9;y10;y11;y12;y13;y14;y15;y17;y18;y19;y20;y21;y23;y21-H2O;y1-NH3;y19-NH3;y14(2+);y16(2+);y22(2+);a2;b2;b3;b4;b5;b6;b7\t300000;200000;3000000;600000;1000000;500000;2000000;1000000;1000000;1000000;90000;1000000;400000;900000;1000000;400000;3000000;2000000;1000000;400000;100000;200000;200000;80000;100000;200000;200000;2000000;5000000;5000000;5000000;2000000;300000\t1.34859050149316E-07;-6.05140996867704E-06;2.27812602133781E-05;0.00128986659160546;-0.00934536073077652;0.000941953783126337;-0.00160424237344614;-0.00239257341399934;-0.00111053968612396;-0.00331340710044969;0.00330702864630439;0.000963683996815234;0.00596290290945944;-0.00662057038289277;-0.0117122701335575;0.00777853472800416;0.0021841542961738;0.000144322111736983;-0.00087403893667215;0.0197121595674616;-0.021204007680808;-0.000308954599830713;-0.026636719419912;-0.0137790992353075;0.00596067266928912;-0.0077053835773313;9.11402199221811E-06;-0.000142539300128419;-0.000251999832926231;1.90791054137662E-05;-0.00236430185879044;-9.54583337602344E-05;-0.000556959493223985\t0.000916705048437201;-0.0199575598103408;0.0456231928690862;2.09952637717462;-12.5708704058425;1.11808305811426;-1.72590731777249;-2.22239181008062;-0.967696370445928;-2.62418809422166;2.47964286628144;0.665205752892023;3.64753748704453;-3.84510115530963;-6.08782672045773;3.81508105974837;1.04209904973991;0.0666012719936656;-0.390545453668809;8.28224925531311;-9.55133250134922;-2.37499239179248;-12.8127653858411;-16.846761946123;6.48662354975264;-6.67117082062383;0.0580151981289049;-0.770098855873447;-0.983876895688683;0.0583162347158579;-5.93738717724506;-0.203431522818505;-1.03087538746314\t147.112804035741;303.21392125011;499.33507018564;614.360746132308;743.413974455831;842.472101057517;929.506675663573;1076.57587791081;1147.61170966489;1262.6408555643;1333.67134891635;1448.700635293;1634.77494902759;1721.81956091078;1923.88362405243;2038.89107627957;2095.9181343836;2166.95728800359;2237.99542015244;2380.04906152953;2220.00518543488;130.0865640237;2078.92040615582;817.907873297785;918.917619246831;1155.02717356753;157.097144992378;185.0922112678;256.129434516133;327.166277224995;398.205774393759;469.240619338034;540.278194626993\t33\t0.574496146107112\t0.14410480349345\tNone\tUnknown\t\t148.951235201399;7.71201258444522;7.36039532447559\tAAAAAAAGDSDSWDADAFSVEDPVRK;PSRQESELMWQWVDQRSDGER;HTLTSFWNFKAGCEEKCYSNR\t_(Acetyl (Protein_N-term))AAAAAAAGDSDSWDADAFSVEDPVRK_;_PSRQESELM(Oxidation (M))WQWVDQRSDGER_;_HTLTSFWNFKAGCEEKCYSNR_\t\t\t2\t625\t1\t1\t2\t'''\n",
")\n",
"\n",
"# Parse with pandas for visualization purposes\n",
"pd.read_csv(copy(maxquant_example), sep=\"\\t\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then use the `psm_reader_provider.get_reader` method to get the maxquant-report reader. Use the `import_file` method to read the file, which is directly returned as a pandas DataFrame. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/alphabase/psm_reader/psm_reader.py:318: UserWarning: Unknown modifications: {'_(Acetyl (Protein_N-term))'}. Precursors with unknown modifications will be removed.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" \n",
" | \n",
" sequence | \n",
" charge | \n",
" rt | \n",
" scan_num | \n",
" raw_name | \n",
" precursor_mz | \n",
" score | \n",
" proteins | \n",
" decoy | \n",
" spec_idx | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" rt_norm | \n",
" \n",
" \n",
" \n",
" \n",
" | 0 | \n",
" AAAAAAAAAAPAAAATAPTTAATTAATAAQ | \n",
" 2 | \n",
" 70.287 | \n",
" 81391 | \n",
" 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 | \n",
" 1184.6088 | \n",
" 118.21 | \n",
" sp|P37108|SRP14_HUMAN | \n",
" 0 | \n",
" 81390 | \n",
" | \n",
" | \n",
" 30 | \n",
" 0.772571 | \n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
" sequence charge rt scan_num \\\n",
"0 AAAAAAAAAAPAAAATAPTTAATTAATAAQ 2 70.287 81391 \n",
"\n",
" raw_name precursor_mz score \\\n",
"0 20190402_QX1_SeVW_MA_HeLa_500ng_LC11 1184.6088 118.21 \n",
"\n",
" proteins decoy spec_idx mods mod_sites nAA rt_norm \n",
"0 sp|P37108|SRP14_HUMAN 0 81390 30 0.772571 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maxquant_reader = psm_reader_provider.get_reader('maxquant')\n",
"\n",
"# Import the file or a bytestream\n",
"maxquant_report = maxquant_reader.import_file(maxquant_example)\n",
"\n",
"# The parsed PSM is also stored in the reader class as `psm_df` attribute\n",
"# maxquant_report = maxquant_reader.psm_df\n",
"\n",
"maxquant_report"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example 2 - Set custom arguments \n",
"\n",
"One can also customize the reader by setting specific arguments. For example, one can set more stringent `fdr` filters (default: $fdr=0.01$). We showcase this on the example of a DIANN PSM report table."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" \n",
" | \n",
" File.Name | \n",
" Run | \n",
" Protein.Group | \n",
" Protein.Ids | \n",
" Protein.Names | \n",
" Genes | \n",
" PG.Quantity | \n",
" PG.Normalised | \n",
" PG.MaxLFQ | \n",
" Genes.Quantity | \n",
" ... | \n",
" Decoy.Evidence | \n",
" Decoy.CScore | \n",
" Fragment.Quant.Raw | \n",
" Fragment.Quant.Corrected | \n",
" Fragment.Correlations | \n",
" MS2.Scan | \n",
" IM | \n",
" iIM | \n",
" Predicted.IM | \n",
" Predicted.iIM | \n",
" \n",
" \n",
" \n",
" \n",
" | 0 | \n",
" F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... | \n",
" 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... | \n",
" Q9UH36 | \n",
" Q9UH36 | \n",
" NaN | \n",
" SRRD | \n",
" 3296.49 | \n",
" 3428.89 | \n",
" 3428.89 | \n",
" 3296.49 | \n",
" ... | \n",
" 1.23691 | \n",
" 0.000034 | \n",
" 1212.01;2178.03;1390.01;1020.01;714.008;778.008; | \n",
" 1212.01;1351.73;887.591;432.92;216.728;732.751; | \n",
" 0.956668;0.757581;0.670497;0.592489;0.47072;0.... | \n",
" 30053 | \n",
" 1.19708 | \n",
" 1.19328 | \n",
" 1.19453 | \n",
" 1.19469 | \n",
" \n",
" \n",
" | 1 | \n",
" F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... | \n",
" 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... | \n",
" Q9UH36 | \n",
" Q9UH36 | \n",
" NaN | \n",
" SRRD | \n",
" 2365.00 | \n",
" 2334.05 | \n",
" 2334.05 | \n",
" 2365.00 | \n",
" ... | \n",
" 0.28633 | \n",
" 0.000002 | \n",
" 1209.02;1210.02;1414.02;1051.01;236.003;130.002; | \n",
" 1209.02;1109.89;732.154;735.384;0;46.0967; | \n",
" 0.919244;0.937624;0.436748;0.639369;0.296736;0... | \n",
" 30029 | \n",
" 1.19500 | \n",
" 1.19328 | \n",
" 1.19381 | \n",
" 1.19339 | \n",
" \n",
" \n",
" | 2 | \n",
" F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... | \n",
" 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... | \n",
" Q9UH36 | \n",
" Q9UH36 | \n",
" NaN | \n",
" SRRD | \n",
" 1664.51 | \n",
" 1635.46 | \n",
" 1635.47 | \n",
" 1664.51 | \n",
" ... | \n",
" 1.92753 | \n",
" 0.000028 | \n",
" 744.01;1708.02;1630.02;1475.02;0;533.006; | \n",
" 322.907;808.594;577.15;536.033;0;533.006; | \n",
" 0.760181;0.764072;0.542005;0.415779;0;0.913438; | \n",
" 30005 | \n",
" 1.19409 | \n",
" 1.19328 | \n",
" 1.19323 | \n",
" 1.19308 | \n",
" \n",
" \n",
" \n",
" 3 rows × 52 columns \n",
" "
],
"text/plain": [
" File.Name \\\n",
"0 F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... \n",
"1 F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... \n",
"2 F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_... \n",
"\n",
" Run Protein.Group \\\n",
"0 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 \n",
"1 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 \n",
"2 20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_sp... Q9UH36 \n",
"\n",
" Protein.Ids Protein.Names Genes PG.Quantity PG.Normalised PG.MaxLFQ \\\n",
"0 Q9UH36 NaN SRRD 3296.49 3428.89 3428.89 \n",
"1 Q9UH36 NaN SRRD 2365.00 2334.05 2334.05 \n",
"2 Q9UH36 NaN SRRD 1664.51 1635.46 1635.47 \n",
"\n",
" Genes.Quantity ... Decoy.Evidence Decoy.CScore \\\n",
"0 3296.49 ... 1.23691 0.000034 \n",
"1 2365.00 ... 0.28633 0.000002 \n",
"2 1664.51 ... 1.92753 0.000028 \n",
"\n",
" Fragment.Quant.Raw \\\n",
"0 1212.01;2178.03;1390.01;1020.01;714.008;778.008; \n",
"1 1209.02;1210.02;1414.02;1051.01;236.003;130.002; \n",
"2 744.01;1708.02;1630.02;1475.02;0;533.006; \n",
"\n",
" Fragment.Quant.Corrected \\\n",
"0 1212.01;1351.73;887.591;432.92;216.728;732.751; \n",
"1 1209.02;1109.89;732.154;735.384;0;46.0967; \n",
"2 322.907;808.594;577.15;536.033;0;533.006; \n",
"\n",
" Fragment.Correlations MS2.Scan IM \\\n",
"0 0.956668;0.757581;0.670497;0.592489;0.47072;0.... 30053 1.19708 \n",
"1 0.919244;0.937624;0.436748;0.639369;0.296736;0... 30029 1.19500 \n",
"2 0.760181;0.764072;0.542005;0.415779;0;0.913438; 30005 1.19409 \n",
"\n",
" iIM Predicted.IM Predicted.iIM \n",
"0 1.19328 1.19453 1.19469 \n",
"1 1.19328 1.19381 1.19339 \n",
"2 1.19328 1.19323 1.19308 \n",
"\n",
"[3 rows x 52 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diann_tsv_example = io.StringIO(r'''File.Name\tRun\tProtein.Group\tProtein.Ids\tProtein.Names\tGenes\tPG.Quantity\tPG.Normalised\tPG.MaxLFQ\tGenes.Quantity\tGenes.Normalised\tGenes.MaxLFQ\tGenes.MaxLFQ.Unique\tModified.Sequence\tStripped.Sequence\tPrecursor.Id\tPrecursor.Charge\tQ.Value\tGlobal.Q.Value\tProtein.Q.Value\tPG.Q.Value\tGlobal.PG.Q.Value\tGG.Q.Value\tTranslated.Q.Value\tProteotypic\tPrecursor.Quantity\tPrecursor.Normalised\tPrecursor.Translated\tQuantity.Quality\tRT\tRT.Start\tRT.Stop\tiRT\tPredicted.RT\tPredicted.iRT\tLib.Q.Value\tMs1.Profile.Corr\tMs1.Area\tEvidence\tSpectrum.Similarity\tMass.Evidence\tCScore\tDecoy.Evidence\tDecoy.CScore\tFragment.Quant.Raw\tFragment.Quant.Corrected\tFragment.Correlations\tMS2.Scan\tIM\tiIM\tPredicted.IM\tPredicted.iIM\n",
"F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636.d\t20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A2_1_22636\tQ9UH36\tQ9UH36\t\tSRRD\t3296.49\t3428.89\t3428.89\t3296.49\t3428.89\t3428.89\t3428.89\t(UniMod:1)AAAAAAALESWQAAAPR\tAAAAAAALESWQAAAPR\t(UniMod:1)AAAAAAALESWQAAAPR2\t2\t3.99074e-05\t1.96448e-05\t0.000159821\t0.000159821\t0.000146135\t0.000161212\t0\t1\t3296.49\t3428.89\t3296.49\t0.852479\t19.9208\t19.8731\t19.9685\t123.9\t19.8266\t128.292\t0\t0.960106\t5308.05\t1.96902\t0.683134\t0.362287\t0.999997\t1.23691\t3.43242e-05\t1212.01;2178.03;1390.01;1020.01;714.008;778.008;\t1212.01;1351.73;887.591;432.92;216.728;732.751;\t0.956668;0.757581;0.670497;0.592489;0.47072;0.855203;\t30053\t1.19708\t1.19328\t1.19453\t1.19469\n",
"F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642.d\t20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-A8_1_22642\tQ9UH36\tQ9UH36\t\tSRRD\t2365\t2334.05\t2334.05\t2365\t2334.05\t2334.05\t2334.05\t(UniMod:1)AAAAAAALESWQAAAPR\tAAAAAAALESWQAAAPR\t(UniMod:1)AAAAAAALESWQAAAPR2\t2\t0.000184434\t1.96448e-05\t0.000596659\t0.000596659\t0.000146135\t0.000604961\t0\t1\t2365\t2334.05\t2365\t0.922581\t19.905\t19.8573\t19.9527\t123.9\t19.782\t128.535\t0\t0.940191\t4594.04\t1.31068\t0.758988\t0\t0.995505\t0.28633\t2.12584e-06\t1209.02;1210.02;1414.02;1051.01;236.003;130.002;\t1209.02;1109.89;732.154;735.384;0;46.0967;\t0.919244;0.937624;0.436748;0.639369;0.296736;0.647924;\t30029\t1.195\t1.19328\t1.19381\t1.19339\n",
"F:\\XXX\\20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648.d\t20201218_tims03_Evo03_PS_SA_HeLa_200ng_high_speed_21min_8cm_S2-B2_1_22648\tQ9UH36\tQ9UH36\t\tSRRD\t1664.51\t1635.46\t1635.47\t1664.51\t1635.46\t1635.47\t1635.47\t(UniMod:1)AAAAAAALESWQAAAPR\tAAAAAAALESWQAAAPR\t(UniMod:1)AAAAAAALESWQAAAPR2\t2\t0.000185123\t1.96448e-05\t0.000307409\t0.000307409\t0.000146135\t0.000311332\t0\t1\t1664.51\t1635.46\t1664.51\t0.811147\t19.8893\t19.8416\t19.937\t123.9\t19.7567\t128.896\t0\t0.458773\t6614.06\t1.7503\t0.491071\t0.00111683\t0.997286\t1.92753\t2.80543e-05\t744.01;1708.02;1630.02;1475.02;0;533.006;\t322.907;808.594;577.15;536.033;0;533.006;\t0.760181;0.764072;0.542005;0.415779;0;0.913438;\t30005\t1.19409\t1.19328\t1.19323\t1.19308\n",
"''')\n",
"\n",
"pd.read_csv(copy(diann_tsv_example), sep=\"\\t\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By passing the more stringent `fdr` filter ($fdr_{\\text{stringent}} = 10^{-4}$) in the second function call, two precursors with an fdr of $\\sim0.0002$ are removed from the resulting table"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of observations (Standard filter): 3\n",
"Number of observations (Stringent filter): 1\n"
]
}
],
"source": [
"# Read PSM reports with one liners\n",
"diann_psm_standard = psm_reader_provider.get_reader('diann').import_file(copy(diann_tsv_example))\n",
"diann_psm_custom_fdr = psm_reader_provider.get_reader('diann', fdr=1e-4).import_file(copy(diann_tsv_example))\n",
"\n",
"print(\"Number of observations (Standard filter):\", len(diann_psm_standard))\n",
"print(\"Number of observations (Stringent filter):\", len(diann_psm_custom_fdr))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"Overall, this tutorial \n",
"\n",
"- Explained how `alphabase` maps different search engine outputs to a unified format\n",
"- Provides examples on how to read PSM tables from different search engines\n",
"- Gives an overview over the available and implemented readers"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
|