{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: Basic Definitions and Settings\n", "\n", "Measuring m/z values is the elemental function of MS technologies, therefore the calculation of mass values for a peptide and its fragments becomes the most essential part in MS-based computational tools. AlphaBase calculates all mass values from atoms. And the masses of amino acids and modifications are calculated from their atom compositions, repectively. Eventually, the masses of peptides or precursors as well as their fragments can be calculated by the amino acid sequences with or without modifications (See figure below).\n", "\n", "Calculating masses from atoms makes it much easier to switch between unlabeled and heavy-labeled peptides, as we did in Stellar MS for 15N-labeled peptides as the reference for targeted proteomics (https://www.biorxiv.org/content/10.1101/2024.06.02.597029v2.full).\n", "\n", "The other advantage of starting from atoms is that AlphaBase can calculate isotope distributions of peptides based on a pre-defined isotope distribution list of atoms (e.g., NIST atom table in https://physics.nist.gov/cgi-bin/Compositions/stand_alone.pl). The isotope information has been applied in our alphaDIA search engine to boost the identification of DIA-MS data (https://www.biorxiv.org/content/10.1101/2024.05.28.596182v1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![atom-to-peptides.png](atom-to-peptides.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Atoms/Elements\n", "\n", "The masses of all amino acids and modifications are calculated from their atom compositions.\n", "\n", "The atom information are defined in https://github.com/MannLabs/alphabase/blob/main/alphabase/constants/const_files/nist_element.yaml which is parsed from NIST, see https://github.com/MannLabs/alphabase/blob/main/scripts/nist_chem_to_yaml.ipynb.\n", "\n", "After adding some heavy isotopes, including 13C, 15N, 2H, and 18O, we obtain 109 kinds of atoms:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:22.699057Z", "start_time": "2025-01-30T16:49:22.690604Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:42.987776Z", "iopub.status.busy": "2026-01-05T22:43:42.987610Z", "iopub.status.idle": "2026-01-05T22:43:45.610885Z", "shell.execute_reply": "2026-01-05T22:43:45.610625Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abundancemass
13C[0.01, 0.99][12.0, 13.00335483507]
14N[0.996337, 0.003663][14.00307400443, 15.00010889888]
15N[0.01, 0.99][14.00307400443, 15.00010889888]
18O[0.005, 0.005, 0.99][15.99491461957, 16.9991317565, 17.99915961286]
2H[0.01, 0.99][1.00782503223, 2.01410177812]
.........
Xe[0.000952, 0.00089, 0.019102, 0.264006, 0.0407...[123.905892, 125.9042983, 127.903531, 128.9047...
Y[1.0][88.9058403]
Yb[0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0....[167.9338896, 169.9347664, 170.9363302, 171.93...
Zn[0.4917, 0.2773, 0.0404, 0.1845, 0.0061][63.92914201, 65.92603381, 66.92712775, 67.924...
Zr[0.5145, 0.1122, 0.1715, 0.1738, 0.028][89.9046977, 90.9056396, 91.9050347, 93.906310...
\n", "

109 rows × 2 columns

\n", "
" ], "text/plain": [ " abundance \\\n", "13C [0.01, 0.99] \n", "14N [0.996337, 0.003663] \n", "15N [0.01, 0.99] \n", "18O [0.005, 0.005, 0.99] \n", "2H [0.01, 0.99] \n", ".. ... \n", "Xe [0.000952, 0.00089, 0.019102, 0.264006, 0.0407... \n", "Y [1.0] \n", "Yb [0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0.... \n", "Zn [0.4917, 0.2773, 0.0404, 0.1845, 0.0061] \n", "Zr [0.5145, 0.1122, 0.1715, 0.1738, 0.028] \n", "\n", " mass \n", "13C [12.0, 13.00335483507] \n", "14N [14.00307400443, 15.00010889888] \n", "15N [14.00307400443, 15.00010889888] \n", "18O [15.99491461957, 16.9991317565, 17.99915961286] \n", "2H [1.00782503223, 2.01410177812] \n", ".. ... \n", "Xe [123.905892, 125.9042983, 127.903531, 128.9047... \n", "Y [88.9058403] \n", "Yb [167.9338896, 169.9347664, 170.9363302, 171.93... \n", "Zn [63.92914201, 65.92603381, 66.92712775, 67.924... \n", "Zr [89.9046977, 90.9056396, 91.9050347, 93.906310... \n", "\n", "[109 rows x 2 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from alphabase.constants.atom import CHEM_INFO_DICT\n", "pd.DataFrame().from_dict(CHEM_INFO_DICT, orient='index')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And their mono-isotopic mass are in `CHEM_MONO_MASS` (dict):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:23.563685Z", "start_time": "2025-01-30T16:49:23.559129Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.625017Z", "iopub.status.busy": "2026-01-05T22:43:45.624885Z", "iopub.status.idle": "2026-01-05T22:43:45.628025Z", "shell.execute_reply": "2026-01-05T22:43:45.627787Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
13C13.003355
14N14.003074
15N15.000109
18O17.999160
2H2.014102
......
Xe131.904155
Y88.905840
Yb173.938866
Zn63.929142
Zr89.904698
\n", "

109 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "13C 13.003355\n", "14N 14.003074\n", "15N 15.000109\n", "18O 17.999160\n", "2H 2.014102\n", ".. ...\n", "Xe 131.904155\n", "Y 88.905840\n", "Yb 173.938866\n", "Zn 63.929142\n", "Zr 89.904698\n", "\n", "[109 rows x 1 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.atom import CHEM_MONO_MASS\n", "pd.DataFrame().from_dict(CHEM_MONO_MASS, orient='index')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These atom masses are used to calculate the masses of amino acids, modifications, and then subsequent masses of peptides and fragments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Commonly used molecular masses" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:24.557595Z", "start_time": "2025-01-30T16:49:24.555151Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.629265Z", "iopub.status.busy": "2026-01-05T22:43:45.629192Z", "iopub.status.idle": "2026-01-05T22:43:45.631231Z", "shell.execute_reply": "2026-01-05T22:43:45.631023Z" } }, "outputs": [ { "data": { "text/plain": [ "(1.007276467, 1.0033, 17.02654910112, 18.01056468403)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.atom import (\n", " MASS_PROTON, MASS_ISOTOPE, MASS_NH3, MASS_H2O\n", ")\n", "MASS_PROTON, MASS_ISOTOPE, MASS_NH3, MASS_H2O" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Amino Acids" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:25.418105Z", "start_time": "2025-01-30T16:49:25.413661Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.632415Z", "iopub.status.busy": "2026-01-05T22:43:45.632353Z", "iopub.status.idle": "2026-01-05T22:43:45.642932Z", "shell.execute_reply": "2026-01-05T22:43:45.642744Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aaformulasmilesmass
65AC(3)H(5)N(1)O(1)S(0)N([Fl])([Fl])[C@@]([H])(C)C(=O)[Ts]7.103711e+01
66BC(1000000)NaN1.200000e+07
67CC(3)H(5)N(1)O(1)S(1)N([Fl])([Fl])[C@@]([H])(CS)C(=O)[Ts]1.030092e+02
68DC(4)H(5)N(1)O(3)S(0)N([Fl])([Fl])[C@@]([H])(CC(=O)O)C(=O)[Ts]1.150269e+02
69EC(5)H(7)N(1)O(3)S(0)N([Fl])([Fl])[C@@]([H])(CCC(=O)O)C(=O)[Ts]1.290426e+02
70FC(9)H(9)N(1)O(1)S(0)N([Fl])([Fl])[C@@]([H])(Cc1ccccc1)C(=O)[Ts]1.470684e+02
71GC(2)H(3)N(1)O(1)S(0)N([Fl])([Fl])CC(=O)[Ts]5.702146e+01
72HC(6)H(7)N(3)O(1)S(0)N([Fl])([Fl])[C@@]([H])(CC1=CN=C-N1)C(=O)[Ts]1.370589e+02
73IC(6)H(11)N(1)O(1)S(0)N([Fl])([Fl])[C@@]([H])([C@]([H])(CC)C)C(=O)[Ts]1.130841e+02
74JC(6)H(11)N(1)O(1)S(0)NaN1.130841e+02
75KC(6)H(12)N(2)O(1)S(0)N([Fl])([Fl])[C@@]([H])(CCCCN)C(=O)[Ts]1.280950e+02
76LC(6)H(11)N(1)O(1)S(0)N([Fl])([Fl])[C@@]([H])(CC(C)C)C(=O)[Ts]1.130841e+02
77MC(5)H(9)N(1)O(1)S(1)N([Fl])([Fl])[C@@]([H])(CCSC)C(=O)[Ts]1.310405e+02
78NC(4)H(6)N(2)O(2)S(0)N([Fl])([Fl])[C@@]([H])(CC(=O)N)C(=O)[Ts]1.140429e+02
79OC(12)H(19)N(3)O(2)C[C@@H]1CC=N[C@H]1C(=O)NCCCC[C@@H](C(=O)[Ts])N...2.371477e+02
80PC(5)H(7)N(1)O(1)S(0)N1([Fl])[C@@]([H])(CCC1)C(=O)[Ts]9.705276e+01
81QC(5)H(8)N(2)O(2)S(0)N([Fl])([Fl])[C@@]([H])(CCC(=O)N)C(=O)[Ts]1.280586e+02
82RC(6)H(12)N(4)O(1)S(0)N([Fl])([Fl])[C@@]([H])(CCCNC(=N)N)C(=O)[Ts]1.561011e+02
83SC(3)H(5)N(1)O(2)S(0)N([Fl])([Fl])[C@@]([H])(CO)C(=O)[Ts]8.703203e+01
84TC(4)H(7)N(1)O(2)S(0)N([Fl])([Fl])[C@@]([H])([C@]([H])(O)C)C(=O)[Ts]1.010477e+02
85UC(3)H(5)N(1)O(1)Se(1)N([Fl])([Fl])[C@@]([H])(C[Se][H])C(=O)[Ts]1.509536e+02
86VC(5)H(9)N(1)O(1)S(0)N([Fl])([Fl])[C@@]([H])(C(C)C)C(=O)[Ts]9.906841e+01
87WC(11)H(10)N(2)O(1)S(0)N([Fl])([Fl])[C@@]([H])(CC(=CN2)C1=C2C=CC=C1)C...1.860793e+02
88XC(1000000)NaN1.200000e+07
89YC(9)H(9)N(1)O(2)S(0)N([Fl])([Fl])[C@@]([H])(Cc1ccc(O)cc1)C(=O)[Ts]1.630633e+02
90ZC(1000000)NaN1.200000e+07
\n", "
" ], "text/plain": [ " aa formula \\\n", "65 A C(3)H(5)N(1)O(1)S(0) \n", "66 B C(1000000) \n", "67 C C(3)H(5)N(1)O(1)S(1) \n", "68 D C(4)H(5)N(1)O(3)S(0) \n", "69 E C(5)H(7)N(1)O(3)S(0) \n", "70 F C(9)H(9)N(1)O(1)S(0) \n", "71 G C(2)H(3)N(1)O(1)S(0) \n", "72 H C(6)H(7)N(3)O(1)S(0) \n", "73 I C(6)H(11)N(1)O(1)S(0) \n", "74 J C(6)H(11)N(1)O(1)S(0) \n", "75 K C(6)H(12)N(2)O(1)S(0) \n", "76 L C(6)H(11)N(1)O(1)S(0) \n", "77 M C(5)H(9)N(1)O(1)S(1) \n", "78 N C(4)H(6)N(2)O(2)S(0) \n", "79 O C(12)H(19)N(3)O(2) \n", "80 P C(5)H(7)N(1)O(1)S(0) \n", "81 Q C(5)H(8)N(2)O(2)S(0) \n", "82 R C(6)H(12)N(4)O(1)S(0) \n", "83 S C(3)H(5)N(1)O(2)S(0) \n", "84 T C(4)H(7)N(1)O(2)S(0) \n", "85 U C(3)H(5)N(1)O(1)Se(1) \n", "86 V C(5)H(9)N(1)O(1)S(0) \n", "87 W C(11)H(10)N(2)O(1)S(0) \n", "88 X C(1000000) \n", "89 Y C(9)H(9)N(1)O(2)S(0) \n", "90 Z C(1000000) \n", "\n", " smiles mass \n", "65 N([Fl])([Fl])[C@@]([H])(C)C(=O)[Ts] 7.103711e+01 \n", "66 NaN 1.200000e+07 \n", "67 N([Fl])([Fl])[C@@]([H])(CS)C(=O)[Ts] 1.030092e+02 \n", "68 N([Fl])([Fl])[C@@]([H])(CC(=O)O)C(=O)[Ts] 1.150269e+02 \n", "69 N([Fl])([Fl])[C@@]([H])(CCC(=O)O)C(=O)[Ts] 1.290426e+02 \n", "70 N([Fl])([Fl])[C@@]([H])(Cc1ccccc1)C(=O)[Ts] 1.470684e+02 \n", "71 N([Fl])([Fl])CC(=O)[Ts] 5.702146e+01 \n", "72 N([Fl])([Fl])[C@@]([H])(CC1=CN=C-N1)C(=O)[Ts] 1.370589e+02 \n", "73 N([Fl])([Fl])[C@@]([H])([C@]([H])(CC)C)C(=O)[Ts] 1.130841e+02 \n", "74 NaN 1.130841e+02 \n", "75 N([Fl])([Fl])[C@@]([H])(CCCCN)C(=O)[Ts] 1.280950e+02 \n", "76 N([Fl])([Fl])[C@@]([H])(CC(C)C)C(=O)[Ts] 1.130841e+02 \n", "77 N([Fl])([Fl])[C@@]([H])(CCSC)C(=O)[Ts] 1.310405e+02 \n", "78 N([Fl])([Fl])[C@@]([H])(CC(=O)N)C(=O)[Ts] 1.140429e+02 \n", "79 C[C@@H]1CC=N[C@H]1C(=O)NCCCC[C@@H](C(=O)[Ts])N... 2.371477e+02 \n", "80 N1([Fl])[C@@]([H])(CCC1)C(=O)[Ts] 9.705276e+01 \n", "81 N([Fl])([Fl])[C@@]([H])(CCC(=O)N)C(=O)[Ts] 1.280586e+02 \n", "82 N([Fl])([Fl])[C@@]([H])(CCCNC(=N)N)C(=O)[Ts] 1.561011e+02 \n", "83 N([Fl])([Fl])[C@@]([H])(CO)C(=O)[Ts] 8.703203e+01 \n", "84 N([Fl])([Fl])[C@@]([H])([C@]([H])(O)C)C(=O)[Ts] 1.010477e+02 \n", "85 N([Fl])([Fl])[C@@]([H])(C[Se][H])C(=O)[Ts] 1.509536e+02 \n", "86 N([Fl])([Fl])[C@@]([H])(C(C)C)C(=O)[Ts] 9.906841e+01 \n", "87 N([Fl])([Fl])[C@@]([H])(CC(=CN2)C1=C2C=CC=C1)C... 1.860793e+02 \n", "88 NaN 1.200000e+07 \n", "89 N([Fl])([Fl])[C@@]([H])(Cc1ccc(O)cc1)C(=O)[Ts] 1.630633e+02 \n", "90 NaN 1.200000e+07 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.aa import AA_DF\n", "AA_DF.loc[ord('A'):ord('Z')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In `AA_DF`, amino acids are encoded by ASCII (128 characters), thus 65==ord('A'), ..., 90==ord('Z'). Unicode strings can be quickly converted to ASCII int32 values using `np.array.view()`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:26.227920Z", "start_time": "2025-01-30T16:49:26.225581Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.644189Z", "iopub.status.busy": "2026-01-05T22:43:45.644126Z", "iopub.status.idle": "2026-01-05T22:43:45.646215Z", "shell.execute_reply": "2026-01-05T22:43:45.646008Z" } }, "outputs": [ { "data": { "text/plain": [ "array([65, 66, 67, 88, 89, 90], dtype=int32)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "np.array(['ABCXYZ']).view(np.int32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But users does not need to know this, as we provided easy to use functionalities to get residue masses from sequences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculate AA masses in batch" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:27.796494Z", "start_time": "2025-01-30T16:49:27.793162Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.647393Z", "iopub.status.busy": "2026-01-05T22:43:45.647331Z", "iopub.status.idle": "2026-01-05T22:43:45.649319Z", "shell.execute_reply": "2026-01-05T22:43:45.649133Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[131.04048509, 71.03711379, 103.00918496, 115.02694302,\n", " 129.04259309, 147.06841391, 57.02146372],\n", " [131.04048509, 71.03711379, 128.09496302, 115.02694302,\n", " 129.04259309, 147.06841391, 57.02146372],\n", " [131.04048509, 71.03711379, 128.09496302, 115.02694302,\n", " 129.04259309, 147.06841391, 156.10111102]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.aa import calc_AA_masses_for_same_len_seqs\n", "calc_AA_masses_for_same_len_seqs(\n", " [\n", " 'MACDEFG', 'MAKDEFG', 'MAKDEFR'\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:28.328268Z", "start_time": "2025-01-30T16:49:28.325621Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.650418Z", "iopub.status.busy": "2026-01-05T22:43:45.650345Z", "iopub.status.idle": "2026-01-05T22:43:45.652264Z", "shell.execute_reply": "2026-01-05T22:43:45.652092Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[1.31040485e+02, 1.00000000e+08, 1.00000000e+08, 1.00000000e+08,\n", " 1.00000000e+08, 1.00000000e+08, 1.00000000e+08],\n", " [1.31040485e+02, 7.10371138e+01, 1.28094963e+02, 1.00000000e+08,\n", " 1.00000000e+08, 1.00000000e+08, 1.00000000e+08],\n", " [1.31040485e+02, 7.10371138e+01, 1.28094963e+02, 1.15026943e+02,\n", " 1.29042593e+02, 1.47068414e+02, 1.56101111e+02]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.aa import calc_AA_masses_for_var_len_seqs\n", "calc_AA_masses_for_var_len_seqs(\n", " [\n", " 'M', 'MAK', 'MAKDEFR'\n", " ])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifications\n", "\n", "In AlphaBase, we used `mod_name@aa` to represent a modification, the `mod_name` is from UniMod. We also used `mod_name@Protein_N-term`, `mod_name@Any_N-term` and `mod_name@Any_C-term` for terminal modifications, which follow the UniMod terminal name schema.\n", "\n", "The default modification TSV is stored in https://github.com/MannLabs/alphabase/blob/main/alphabase/constants/const_files/modification.tsv, which is loaded upon startup of AlphaBase.\n", "Users can add more modifications into the tsv file (only `mod_name` and `composition` columns are required), e.g. by using the https://github.com/MannLabs/alphabase/blob/main/scripts/unimod_to_tsv.ipynb notebook." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:29.682938Z", "start_time": "2025-01-30T16:49:29.674414Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.653275Z", "iopub.status.busy": "2026-01-05T22:43:45.653216Z", "iopub.status.idle": "2026-01-05T22:43:45.685059Z", "shell.execute_reply": "2026-01-05T22:43:45.684819Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mod_nameunimod_massunimod_avge_masscompositionunimod_modlossmodloss_compositionclassificationunimod_idsmilesmodloss_importancemassmodloss_originalmodloss
mod_name
Acetyl@TAcetyl@T42.01056542.0367H(2)C(2)O(1)0.0Post-translational10.042.0105650.00.0
Acetyl@Protein_N-termAcetyl@Protein_N-term42.01056542.0367H(2)C(2)O(1)0.0Post-translational1CC(=O)[Ts]0.042.0105650.00.0
Acetyl@SAcetyl@S42.01056542.0367H(2)C(2)O(1)0.0Post-translational10.042.0105650.00.0
Acetyl@CAcetyl@C42.01056542.0367H(2)C(2)O(1)0.0Post-translational10.042.0105650.00.0
Acetyl@Any_N-termAcetyl@Any_N-term42.01056542.0367H(2)C(2)O(1)0.0Multiple1CC(=O)[Ts]0.042.0105650.00.0
..........................................
Lactyl@Any_N-termLactyl@Any_N-term72.02112972.0627H(4)C(3)O(2)0.0Post-translational0C[C@@H](O)C(=O)[Ts]0.072.0211290.00.0
Lactyl@Protein_N-termLactyl@Protein_N-term72.02112972.0627H(4)C(3)O(2)0.0Post-translational0C[C@@H](O)C(=O)[Ts]0.072.0211290.00.0
YnLactyl@KYnLactyl@K239.126991239.2941H(17)C(11)N(3)O(3)0.0Post-translational0OCCCCCCN1C=C(C[C@@H](O)C(=O)NCCCC[C@H](N([Fl])...0.0239.1269910.00.0
YnLactyl@Any_N-termYnLactyl@Any_N-term239.126991239.2941H(17)C(11)N(3)O(3)0.0Post-translational0OCCCCCCN1C=C(C[C@@H](O)C(=O)[Ts])N=N10.0239.1269910.00.0
YnLactyl@Protein_N-termYnLactyl@Protein_N-term239.126991239.2941H(17)C(11)N(3)O(3)0.0Post-translational0OCCCCCCN1C=C(C[C@@H](O)C(=O)[Ts])N=N10.0239.1269910.00.0
\n", "

2852 rows × 13 columns

\n", "
" ], "text/plain": [ " mod_name unimod_mass \\\n", "mod_name \n", "Acetyl@T Acetyl@T 42.010565 \n", "Acetyl@Protein_N-term Acetyl@Protein_N-term 42.010565 \n", "Acetyl@S Acetyl@S 42.010565 \n", "Acetyl@C Acetyl@C 42.010565 \n", "Acetyl@Any_N-term Acetyl@Any_N-term 42.010565 \n", "... ... ... \n", "Lactyl@Any_N-term Lactyl@Any_N-term 72.021129 \n", "Lactyl@Protein_N-term Lactyl@Protein_N-term 72.021129 \n", "YnLactyl@K YnLactyl@K 239.126991 \n", "YnLactyl@Any_N-term YnLactyl@Any_N-term 239.126991 \n", "YnLactyl@Protein_N-term YnLactyl@Protein_N-term 239.126991 \n", "\n", " unimod_avge_mass composition unimod_modloss \\\n", "mod_name \n", "Acetyl@T 42.0367 H(2)C(2)O(1) 0.0 \n", "Acetyl@Protein_N-term 42.0367 H(2)C(2)O(1) 0.0 \n", "Acetyl@S 42.0367 H(2)C(2)O(1) 0.0 \n", "Acetyl@C 42.0367 H(2)C(2)O(1) 0.0 \n", "Acetyl@Any_N-term 42.0367 H(2)C(2)O(1) 0.0 \n", "... ... ... ... \n", "Lactyl@Any_N-term 72.0627 H(4)C(3)O(2) 0.0 \n", "Lactyl@Protein_N-term 72.0627 H(4)C(3)O(2) 0.0 \n", "YnLactyl@K 239.2941 H(17)C(11)N(3)O(3) 0.0 \n", "YnLactyl@Any_N-term 239.2941 H(17)C(11)N(3)O(3) 0.0 \n", "YnLactyl@Protein_N-term 239.2941 H(17)C(11)N(3)O(3) 0.0 \n", "\n", " modloss_composition classification unimod_id \\\n", "mod_name \n", "Acetyl@T Post-translational 1 \n", "Acetyl@Protein_N-term Post-translational 1 \n", "Acetyl@S Post-translational 1 \n", "Acetyl@C Post-translational 1 \n", "Acetyl@Any_N-term Multiple 1 \n", "... ... ... ... \n", "Lactyl@Any_N-term Post-translational 0 \n", "Lactyl@Protein_N-term Post-translational 0 \n", "YnLactyl@K Post-translational 0 \n", "YnLactyl@Any_N-term Post-translational 0 \n", "YnLactyl@Protein_N-term Post-translational 0 \n", "\n", " smiles \\\n", "mod_name \n", "Acetyl@T \n", "Acetyl@Protein_N-term CC(=O)[Ts] \n", "Acetyl@S \n", "Acetyl@C \n", "Acetyl@Any_N-term CC(=O)[Ts] \n", "... ... \n", "Lactyl@Any_N-term C[C@@H](O)C(=O)[Ts] \n", "Lactyl@Protein_N-term C[C@@H](O)C(=O)[Ts] \n", "YnLactyl@K OCCCCCCN1C=C(C[C@@H](O)C(=O)NCCCC[C@H](N([Fl])... \n", "YnLactyl@Any_N-term OCCCCCCN1C=C(C[C@@H](O)C(=O)[Ts])N=N1 \n", "YnLactyl@Protein_N-term OCCCCCCN1C=C(C[C@@H](O)C(=O)[Ts])N=N1 \n", "\n", " modloss_importance mass modloss_original \\\n", "mod_name \n", "Acetyl@T 0.0 42.010565 0.0 \n", "Acetyl@Protein_N-term 0.0 42.010565 0.0 \n", "Acetyl@S 0.0 42.010565 0.0 \n", "Acetyl@C 0.0 42.010565 0.0 \n", "Acetyl@Any_N-term 0.0 42.010565 0.0 \n", "... ... ... ... \n", "Lactyl@Any_N-term 0.0 72.021129 0.0 \n", "Lactyl@Protein_N-term 0.0 72.021129 0.0 \n", "YnLactyl@K 0.0 239.126991 0.0 \n", "YnLactyl@Any_N-term 0.0 239.126991 0.0 \n", "YnLactyl@Protein_N-term 0.0 239.126991 0.0 \n", "\n", " modloss \n", "mod_name \n", "Acetyl@T 0.0 \n", "Acetyl@Protein_N-term 0.0 \n", "Acetyl@S 0.0 \n", "Acetyl@C 0.0 \n", "Acetyl@Any_N-term 0.0 \n", "... ... \n", "Lactyl@Any_N-term 0.0 \n", "Lactyl@Protein_N-term 0.0 \n", "YnLactyl@K 0.0 \n", "YnLactyl@Any_N-term 0.0 \n", "YnLactyl@Protein_N-term 0.0 \n", "\n", "[2852 rows x 13 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.modification import MOD_DF\n", "MOD_DF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Modification sites\n", "\n", "In alphabase, we use 0 and -1 to represent modification site of N-term and C-term, respectively. For other modification sites, we use 1 to n." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:30.400073Z", "start_time": "2025-01-30T16:49:30.397484Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.686247Z", "iopub.status.busy": "2026-01-05T22:43:45.686182Z", "iopub.status.idle": "2026-01-05T22:43:45.688281Z", "shell.execute_reply": "2026-01-05T22:43:45.688102Z" } }, "outputs": [ { "data": { "text/plain": [ "array([42.01056468, 0. , 57.02146372, 0. , 0. ,\n", " 0. , 0. ])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.modification import calc_modification_mass\n", "\n", "# example: add two modifications and print the array of mass modifications\n", "sequence = 'MACDEFG'\n", "mod_names = ['Acetyl@Any_N-term', 'Carbamidomethyl@C']\n", "mod_sites = [0, 3] # 0 for N-term, 3 for the third amino acid\n", "calc_modification_mass(\n", " nAA=len(sequence),\n", " mod_names=mod_names,\n", " mod_sites=mod_sites\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:31.049003Z", "start_time": "2025-01-30T16:49:31.045754Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.689321Z", "iopub.status.busy": "2026-01-05T22:43:45.689260Z", "iopub.status.idle": "2026-01-05T22:43:45.691232Z", "shell.execute_reply": "2026-01-05T22:43:45.691050Z" } }, "outputs": [ { "data": { "text/plain": [ "array([58.0054793, 0. , 0. , 0. , 0. ,\n", " 0. , 0. ])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# example: add two modifications and print the array of mass modifications\n", "sequence = 'MAKDEFG'\n", "mod_names = ['Acetyl@Any_N-term', 'Oxidation@M']\n", "mod_sites = [0, 1] # 0 for N-term, 1 for the first amino acid\n", "calc_modification_mass(\n", " nAA=len(sequence),\n", " mod_names=mod_names,\n", " mod_sites=mod_sites\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple modifications at a single site is supported, for example, in the following example, `K3` contains both `GG@K` and `Dimethyl@K`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:32.133084Z", "start_time": "2025-01-30T16:49:32.130276Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.692247Z", "iopub.status.busy": "2026-01-05T22:43:45.692192Z", "iopub.status.idle": "2026-01-05T22:43:45.694110Z", "shell.execute_reply": "2026-01-05T22:43:45.693915Z" } }, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 0. , 142.07422757, 0. ,\n", " 0. , 0. , 0. ])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequence = 'MAKDEFR'\n", "mod_names = ['GG@K', 'Dimethyl@K']\n", "mod_sites = [3, 3]\n", "calc_modification_mass(\n", " nAA=len(sequence),\n", " mod_names=mod_names,\n", " mod_sites=mod_sites\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Caculate modification masses in batch" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:33.217088Z", "start_time": "2025-01-30T16:49:33.213576Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.695235Z", "iopub.status.busy": "2026-01-05T22:43:45.695173Z", "iopub.status.idle": "2026-01-05T22:43:45.697144Z", "shell.execute_reply": "2026-01-05T22:43:45.696950Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 42.01056468, 0. , 57.02146372, 0. ,\n", " 0. , 0. , 0. ],\n", " [ 58.0054793 , 0. , 0. , 0. ,\n", " 0. , 0. , 0. ],\n", " [ 0. , 0. , 142.07422757, 0. ,\n", " 0. , 0. , 0. ]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.modification import calc_mod_masses_for_same_len_seqs\n", "calc_mod_masses_for_same_len_seqs(\n", " nAA=7,\n", " mod_names_list=[\n", " ['Acetyl@Any_N-term', 'Carbamidomethyl@C'],\n", " ['Acetyl@Any_N-term', 'Oxidation@M'],\n", " ['GG@K', 'Dimethyl@K'],\n", " ],\n", " mod_sites_list=[\n", " [0, 3],\n", " [0, 1],\n", " [3, 3],\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mass calculation functionalities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculate AA and modification masses in batch" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:34.937876Z", "start_time": "2025-01-30T16:49:34.933525Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.698281Z", "iopub.status.busy": "2026-01-05T22:43:45.698218Z", "iopub.status.idle": "2026-01-05T22:43:45.700560Z", "shell.execute_reply": "2026-01-05T22:43:45.700307Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[173.05104977, 71.03711379, 160.03064868, 115.02694302,\n", " 129.04259309, 147.06841391, 57.02146372],\n", " [189.04596439, 71.03711379, 128.09496302, 115.02694302,\n", " 129.04259309, 147.06841391, 57.02146372],\n", " [131.04048509, 71.03711379, 270.16919059, 115.02694302,\n", " 129.04259309, 147.06841391, 156.10111102]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.aa import calc_AA_masses_for_same_len_seqs\n", "from alphabase.constants.modification import calc_mod_masses_for_same_len_seqs\n", "mod_masses = calc_mod_masses_for_same_len_seqs(\n", " nAA=7,\n", " mod_names_list=[\n", " ['Acetyl@Any_N-term', 'Carbamidomethyl@C'],\n", " ['Acetyl@Any_N-term', 'Oxidation@M'],\n", " ['GG@K', 'Dimethyl@K'],\n", " ],\n", " mod_sites_list=[\n", " [0, 3],\n", " [0, 1],\n", " [3, 3],\n", " ]\n", ")\n", "aa_masses = calc_AA_masses_for_same_len_seqs(\n", " [\n", " 'MACDEFG', 'MAKDEFG', 'MAKDEFR'\n", " ]\n", ")\n", "mod_masses+aa_masses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### np.cumsum to get b-ion neutral masses" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:35.985899Z", "start_time": "2025-01-30T16:49:35.982829Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.701672Z", "iopub.status.busy": "2026-01-05T22:43:45.701615Z", "iopub.status.idle": "2026-01-05T22:43:45.703469Z", "shell.execute_reply": "2026-01-05T22:43:45.703288Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 173.05104977, 244.08816356, 404.11881224, 519.14575526,\n", " 648.18834835, 795.25676227, 852.27822599],\n", " [ 189.04596439, 260.08307818, 388.17804119, 503.20498422,\n", " 632.24757731, 779.31599122, 836.33745494],\n", " [ 131.04048509, 202.07759887, 472.24678946, 587.27373248,\n", " 716.31632557, 863.38473949, 1019.48585051]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.cumsum(aa_masses+mod_masses, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mass functionalities in 'mass_calc'\n", "\n", "The functionalities for peptide and fragment neutral masses have been implement in `alphabase.peptide.mass_calc`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:36.899617Z", "start_time": "2025-01-30T16:49:36.895322Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.704536Z", "iopub.status.busy": "2026-01-05T22:43:45.704480Z", "iopub.status.idle": "2026-01-05T22:43:45.706726Z", "shell.execute_reply": "2026-01-05T22:43:45.706543Z" } }, "outputs": [ { "data": { "text/plain": [ "array([ 870.28879067, 854.34801962, 1037.49641519])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.peptide.mass_calc import calc_peptide_masses_for_same_len_seqs\n", "\n", "peptide_masses = calc_peptide_masses_for_same_len_seqs(\n", " ['MACDEFG', 'MAKDEFG', 'MAKDEFR'],\n", " mod_list=[\n", " 'Acetyl@Any_N-term;Carbamidomethyl@C',\n", " 'Acetyl@Any_N-term;Oxidation@M',\n", " 'GG@K;Dimethyl@K',\n", " ],\n", ")\n", "peptide_masses" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:37.414885Z", "start_time": "2025-01-30T16:49:37.411633Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.707675Z", "iopub.status.busy": "2026-01-05T22:43:45.707622Z", "iopub.status.idle": "2026-01-05T22:43:45.709730Z", "shell.execute_reply": "2026-01-05T22:43:45.709543Z" } }, "outputs": [ { "data": { "text/plain": [ "array([ 870.28879067, 854.34801962, 1037.49641519])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.peptide.mass_calc import calc_b_y_and_peptide_masses_for_same_len_seqs\n", "b_masses, y_masses, peptide_masses = calc_b_y_and_peptide_masses_for_same_len_seqs(\n", " ['MACDEFG', 'MAKDEFG', 'MAKDEFR'],\n", " mod_list=[\n", " ['Acetyl@Any_N-term', 'Carbamidomethyl@C'],\n", " ['Acetyl@Any_N-term', 'Oxidation@M'],\n", " ['GG@K', 'Dimethyl@K'],\n", " ],\n", " site_list=[\n", " [0, 3],\n", " [0, 1],\n", " [3, 3],\n", " ],\n", ")\n", "peptide_masses" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:38.022288Z", "start_time": "2025-01-30T16:49:38.019932Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.710723Z", "iopub.status.busy": "2026-01-05T22:43:45.710668Z", "iopub.status.idle": "2026-01-05T22:43:45.712373Z", "shell.execute_reply": "2026-01-05T22:43:45.712188Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[173.05104977, 244.08816356, 404.11881224, 519.14575526,\n", " 648.18834835, 795.25676227],\n", " [189.04596439, 260.08307818, 388.17804119, 503.20498422,\n", " 632.24757731, 779.31599122],\n", " [131.04048509, 202.07759887, 472.24678946, 587.27373248,\n", " 716.31632557, 863.38473949]])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b_masses" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:38.389402Z", "start_time": "2025-01-30T16:49:38.387298Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.713386Z", "iopub.status.busy": "2026-01-05T22:43:45.713332Z", "iopub.status.idle": "2026-01-05T22:43:45.714922Z", "shell.execute_reply": "2026-01-05T22:43:45.714737Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[697.2377409 , 626.20062711, 466.16997843, 351.14303541,\n", " 222.10044232, 75.0320284 ],\n", " [665.30205523, 594.26494145, 466.16997843, 351.14303541,\n", " 222.10044232, 75.0320284 ],\n", " [906.45593011, 835.41881632, 565.24962574, 450.22268271,\n", " 321.18008962, 174.11167571]])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_masses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Isotope distribution\n", "\n", "`alphabase.constants.isotope.IsotopeDistribution` will calculate the isotope distribution and the mono-isotopic idx in the distribution for a given atom composition. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For an atom, the mono-isotopic idx (`mono_idx`) points to the highest abundance isotope, so the value is `round(mass of highest isotope - mass of first isotope)`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:39.845574Z", "start_time": "2025-01-30T16:49:39.833715Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.715971Z", "iopub.status.busy": "2026-01-05T22:43:45.715901Z", "iopub.status.idle": "2026-01-05T22:43:45.721522Z", "shell.execute_reply": "2026-01-05T22:43:45.721334Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abundancemassmono_idx
13C[0.01, 0.99][12.0, 13.00335483507]1
14N[0.996337, 0.003663][14.00307400443, 15.00010889888]0
15N[0.01, 0.99][14.00307400443, 15.00010889888]1
18O[0.005, 0.005, 0.99][15.99491461957, 16.9991317565, 17.99915961286]2
2H[0.01, 0.99][1.00782503223, 2.01410177812]1
............
Xe[0.000952, 0.00089, 0.019102, 0.264006, 0.0407...[123.905892, 125.9042983, 127.903531, 128.9047...8
Y[1.0][88.9058403]0
Yb[0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0....[167.9338896, 169.9347664, 170.9363302, 171.93...6
Zn[0.4917, 0.2773, 0.0404, 0.1845, 0.0061][63.92914201, 65.92603381, 66.92712775, 67.924...0
Zr[0.5145, 0.1122, 0.1715, 0.1738, 0.028][89.9046977, 90.9056396, 91.9050347, 93.906310...0
\n", "

109 rows × 3 columns

\n", "
" ], "text/plain": [ " abundance \\\n", "13C [0.01, 0.99] \n", "14N [0.996337, 0.003663] \n", "15N [0.01, 0.99] \n", "18O [0.005, 0.005, 0.99] \n", "2H [0.01, 0.99] \n", ".. ... \n", "Xe [0.000952, 0.00089, 0.019102, 0.264006, 0.0407... \n", "Y [1.0] \n", "Yb [0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0.... \n", "Zn [0.4917, 0.2773, 0.0404, 0.1845, 0.0061] \n", "Zr [0.5145, 0.1122, 0.1715, 0.1738, 0.028] \n", "\n", " mass mono_idx \n", "13C [12.0, 13.00335483507] 1 \n", "14N [14.00307400443, 15.00010889888] 0 \n", "15N [14.00307400443, 15.00010889888] 1 \n", "18O [15.99491461957, 16.9991317565, 17.99915961286] 2 \n", "2H [1.00782503223, 2.01410177812] 1 \n", ".. ... ... \n", "Xe [123.905892, 125.9042983, 127.903531, 128.9047... 8 \n", "Y [88.9058403] 0 \n", "Yb [167.9338896, 169.9347664, 170.9363302, 171.93... 6 \n", "Zn [63.92914201, 65.92603381, 66.92712775, 67.924... 0 \n", "Zr [89.9046977, 90.9056396, 91.9050347, 93.906310... 0 \n", "\n", "[109 rows x 3 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from alphabase.constants.atom import CHEM_INFO_DICT\n", "atom_df = pd.DataFrame().from_dict(CHEM_INFO_DICT, orient='index')\n", "def get_mono(masses_abundances):\n", " masses, abundances = masses_abundances\n", " return round(masses[np.argmax(abundances)]-masses[0])\n", "atom_df['mono_idx'] = atom_df[['mass','abundance']].apply(\n", " get_mono, axis=1\n", ")\n", "atom_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`mono_idx` of an atom composition refers to the sum of the `mono_idx` of all atoms. In AlphaBase, `alphabase.constants.isotope.IsotopeDistribution` calculate both isotope abundance and `mono_idx`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, `Fe`'s `mono_idx` is 2 (mass from 53.94 to 55.93), " ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:42.838219Z", "start_time": "2025-01-30T16:49:42.833554Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.722650Z", "iopub.status.busy": "2026-01-05T22:43:45.722593Z", "iopub.status.idle": "2026-01-05T22:43:45.724635Z", "shell.execute_reply": "2026-01-05T22:43:45.724439Z" } }, "outputs": [ { "data": { "text/plain": [ "abundance [0.05845, 0.91754, 0.02119, 0.00282]\n", "mass [53.93960899, 55.93493633, 56.93539284, 57.933...\n", "mono_idx 2\n", "Name: Fe, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "atom_df.loc['Fe']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `C(1)Fe(1)`'s `mono_idx` is also 2:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:43.824372Z", "start_time": "2025-01-30T16:49:43.261666Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:45.725796Z", "iopub.status.busy": "2026-01-05T22:43:45.725738Z", "iopub.status.idle": "2026-01-05T22:43:46.211927Z", "shell.execute_reply": "2026-01-05T22:43:46.211682Z" } }, "outputs": [ { "data": { "text/plain": [ "(array([5.78245850e-02, 6.25415000e-04, 9.07722322e-01, 3.07809450e-02,\n", " 3.01655900e-03, 3.01740000e-05, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00]),\n", " 2)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.isotope import IsotopeDistribution, parse_formula\n", "iso = IsotopeDistribution()\n", "iso.calc_formula_distribution(\n", " [('C',1),('Fe',1)]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But `13C(1)Fe(1)`'s `mono_idx` should be 3:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:43.893066Z", "start_time": "2025-01-30T16:49:43.890803Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:46.213218Z", "iopub.status.busy": "2026-01-05T22:43:46.213144Z", "iopub.status.idle": "2026-01-05T22:43:46.215077Z", "shell.execute_reply": "2026-01-05T22:43:46.214896Z" } }, "outputs": [ { "data": { "text/plain": [ "(array([5.845000e-04, 5.786550e-02, 9.175400e-03, 9.085765e-01,\n", " 2.100630e-02, 2.791800e-03, 0.000000e+00, 0.000000e+00,\n", " 0.000000e+00, 0.000000e+00]),\n", " 3)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iso.calc_formula_distribution(\n", " [('13C',1),('Fe',1)]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `mono_idx` of unlabeled atom compositions is always 0, no matter how big the compositions are. This means `mono` isotope is not necessary to be the `highest` isotope peak, especially when the composition get larger. Here are three examples from small composition to large ones, we can see that the highest peaks move from 0 to 2." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:44.477521Z", "start_time": "2025-01-30T16:49:44.466197Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:46.216208Z", "iopub.status.busy": "2026-01-05T22:43:46.216146Z", "iopub.status.idle": "2026-01-05T22:43:46.227408Z", "shell.execute_reply": "2026-01-05T22:43:46.227215Z" } }, "outputs": [ { "data": { "text/plain": [ "('mono=0, highest=0',\n", " array([5.53058051e-01, 3.06480210e-01, 1.06031073e-01, 2.73885413e-02,\n", " 5.79597328e-03, 1.05055134e-03, 1.67897345e-04, 2.41173838e-05,\n", " 3.15729577e-06, 3.80635657e-07]))" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.constants.isotope import IsotopeDistribution, parse_formula\n", "iso = IsotopeDistribution()\n", "\n", "formula = 'C(50)H(50)O(20)Na(1)'\n", "formula = parse_formula(formula)\n", "dist, mono = iso.calc_formula_distribution(formula)\n", "f\"mono={mono}, highest={dist.argmax()}\", dist" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:45.165377Z", "start_time": "2025-01-30T16:49:45.162386Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:46.228519Z", "iopub.status.busy": "2026-01-05T22:43:46.228449Z", "iopub.status.idle": "2026-01-05T22:43:46.230468Z", "shell.execute_reply": "2026-01-05T22:43:46.230278Z" } }, "outputs": [ { "data": { "text/plain": [ "('mono=0, highest=1',\n", " array([3.21124792e-01, 3.53459703e-01, 2.05844502e-01, 8.38383715e-02,\n", " 2.66913129e-02, 7.04911613e-03, 1.60206285e-03, 3.21190201e-04,\n", " 5.78218885e-05, 9.47198919e-06]))" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula = 'C(100)H(100)O(20)Na(1)'\n", "formula = parse_formula(formula)\n", "dist, mono = iso.calc_formula_distribution(formula)\n", "f\"mono={mono}, highest={dist.argmax()}\", dist" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2025-01-30T16:49:45.672851Z", "start_time": "2025-01-30T16:49:45.670264Z" }, "execution": { "iopub.execute_input": "2026-01-05T22:43:46.231464Z", "iopub.status.busy": "2026-01-05T22:43:46.231406Z", "iopub.status.idle": "2026-01-05T22:43:46.233227Z", "shell.execute_reply": "2026-01-05T22:43:46.233050Z" } }, "outputs": [ { "data": { "text/plain": [ "('mono=0, highest=2',\n", " array([0.10312113, 0.22700935, 0.25713731, 0.19936063, 0.11878142,\n", " 0.05791123, 0.02402947, 0.00871637, 0.00281814, 0.00082412]))" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula = 'C(200)H(200)O(40)Na(1)'\n", "formula = parse_formula(formula)\n", "dist, mono = iso.calc_formula_distribution(formula)\n", "f\"mono={mono}, highest={dist.argmax()}\", dist" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.3 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" }, "vscode": { "interpreter": { "hash": "8a3b27e141e49c996c9b863f8707e97aabd49c4a7e8445b9b783b34e4a21a9b2" } } }, "nbformat": 4, "nbformat_minor": 2 }