Tutorial for Dev: Basic Definations#

This notebook introduces low-level functionalities use in AlphaBase to developers.

[1]:
%reload_ext autoreload
%autoreload 2

Atoms/Elements#

The masses of all amino acids and modifications are calculated from their atom compositions.

The atom information are defined in https://github.com/MannLabs/alphabase/blob/main/alphabase/constants/const_files/nist_element.yaml which is parsed from NIST, see https://github.com/MannLabs/alphabase/blob/main/nbs/nist_chem_to_yaml.ipynb.

After adding some heavy isotopes, including 13C, 15N, 2H, and 18O, we obtain 109 kinds of atoms:

[2]:
import pandas as pd
from alphabase.constants.element import CHEM_INFO_DICT
pd.DataFrame().from_dict(CHEM_INFO_DICT, orient='index')
[2]:
abundance mass
13C [0.01, 0.99] [12.0, 13.00335483507]
14N [0.996337, 0.003663] [14.00307400443, 15.00010889888]
15N [0.01, 0.99] [14.00307400443, 15.00010889888]
18O [0.005, 0.005, 0.99] [15.99491461957, 16.9991317565, 17.99915961286]
2H [0.01, 0.99] [1.00782503223, 2.01410177812]
... ... ...
Xe [0.000952, 0.00089, 0.019102, 0.264006, 0.0407... [123.905892, 125.9042983, 127.903531, 128.9047...
Y [1.0] [88.9058403]
Yb [0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0.... [167.9338896, 169.9347664, 170.9363302, 171.93...
Zn [0.4917, 0.2773, 0.0404, 0.1845, 0.0061] [63.92914201, 65.92603381, 66.92712775, 67.924...
Zr [0.5145, 0.1122, 0.1715, 0.1738, 0.028] [89.9046977, 90.9056396, 91.9050347, 93.906310...

109 rows × 2 columns

And their mono-isotopic mass are in CHEM_MONO_MASS (dict):

[3]:
from alphabase.constants.element import CHEM_MONO_MASS
pd.DataFrame().from_dict(CHEM_MONO_MASS, orient='index')
[3]:
0
13C 13.003355
14N 14.003074
15N 15.000109
18O 17.999160
2H 2.014102
... ...
Xe 131.904155
Y 88.905840
Yb 173.938866
Zn 63.929142
Zr 89.904698

109 rows × 1 columns

These atom masses are used to calculate the masses of amino acids, modifications, and then subsequent masses of peptides and fragments.

Commonly used molecular masses#

[4]:
from alphabase.constants.element import (
    MASS_PROTON, MASS_ISOTOPE, MASS_NH3, MASS_H2O
)
MASS_PROTON, MASS_ISOTOPE, MASS_NH3, MASS_H2O
[4]:
(1.007276467, 1.0033, 17.02654910112, 18.01056468403)

Amino Acids#

[5]:
from alphabase.constants.aa import AA_DF
AA_DF.loc[ord('A'):ord('Z')]
[5]:
aa formula mass
65 A C(3)H(5)N(1)O(1)S(0) 7.103711e+01
66 B C(1000000) 1.200000e+07
67 C C(3)H(5)N(1)O(1)S(1) 1.030092e+02
68 D C(4)H(5)N(1)O(3)S(0) 1.150269e+02
69 E C(5)H(7)N(1)O(3)S(0) 1.290426e+02
70 F C(9)H(9)N(1)O(1)S(0) 1.470684e+02
71 G C(2)H(3)N(1)O(1)S(0) 5.702146e+01
72 H C(6)H(7)N(3)O(1)S(0) 1.370589e+02
73 I C(6)H(11)N(1)O(1)S(0) 1.130841e+02
74 J C(6)H(11)N(1)O(1)S(0) 1.130841e+02
75 K C(6)H(12)N(2)O(1)S(0) 1.280950e+02
76 L C(6)H(11)N(1)O(1)S(0) 1.130841e+02
77 M C(5)H(9)N(1)O(1)S(1) 1.310405e+02
78 N C(4)H(6)N(2)O(2)S(0) 1.140429e+02
79 O C(12)H(19)N(3)O(2) 2.371477e+02
80 P C(5)H(7)N(1)O(1)S(0) 9.705276e+01
81 Q C(5)H(8)N(2)O(2)S(0) 1.280586e+02
82 R C(6)H(12)N(4)O(1)S(0) 1.561011e+02
83 S C(3)H(5)N(1)O(2)S(0) 8.703203e+01
84 T C(4)H(7)N(1)O(2)S(0) 1.010477e+02
85 U C(3)H(5)N(1)O(1)Se(1) 1.509536e+02
86 V C(5)H(9)N(1)O(1)S(0) 9.906841e+01
87 W C(11)H(10)N(2)O(1)S(0) 1.860793e+02
88 X C(1000000) 1.200000e+07
89 Y C(9)H(9)N(1)O(2)S(0) 1.630633e+02
90 Z C(1000000) 1.200000e+07

From AA_DF, we can see that amino acids are encoded by ASCII (128 characters). 65==ord(‘A’), …, 90==ord(‘Z’). Unicode strings can be fastly converted to ascii int32 values using numpy:

[6]:
import numpy as np

np.array(['ABCXYZ']).view(np.int32)
[6]:
array([65, 66, 67, 88, 89, 90], dtype=int32)

But users does not need to know this, as we provided easy to use functionalities to get residue masses from sequences.

Calculate AA masses in batch#

[7]:
from alphabase.constants.aa import calc_AA_masses_for_same_len_seqs
calc_AA_masses_for_same_len_seqs(
    [
        'MACDEFG', 'MAKDEFG', 'MAKDEFR'
    ]
)
[7]:
array([[131.04048509,  71.03711379, 103.00918496, 115.02694302,
        129.04259309, 147.06841391,  57.02146372],
       [131.04048509,  71.03711379, 128.09496302, 115.02694302,
        129.04259309, 147.06841391,  57.02146372],
       [131.04048509,  71.03711379, 128.09496302, 115.02694302,
        129.04259309, 147.06841391, 156.10111102]])

Modifications#

In AlphaBase, we used mod_name@aa to represent a modification, the mod_name is from UniMod. We also used mod_name@Protein N-term, mod_name@Any N-term and mod_name@Any C-term for terminal modifications, which follow the UniMod terminal name schema.

The default modification TSV is stored in alphabase/constants/const_files/modification.tsv, users can add more modifications into the tsv file (only mod_name and composition colums are required). Please https://github.com/MannLabs/alphabase/blob/main/alphabase/constants/const_files/modification.tsv.

[8]:
from alphabase.constants.modification import MOD_DF
MOD_DF
[8]:
mod_name unimod_mass unimod_avge_mass composition unimod_modloss modloss_composition classification unimod_id modloss_importance mass modloss_original modloss
mod_name
Acetyl@T Acetyl@T 42.010565 42.0367 H(2)C(2)O(1) 0.0 Post-translational 1 0.0 42.010565 0.0 0.0
Acetyl@Protein N-term Acetyl@Protein N-term 42.010565 42.0367 H(2)C(2)O(1) 0.0 Post-translational 1 0.0 42.010565 0.0 0.0
Acetyl@S Acetyl@S 42.010565 42.0367 H(2)C(2)O(1) 0.0 Post-translational 1 0.0 42.010565 0.0 0.0
Acetyl@C Acetyl@C 42.010565 42.0367 H(2)C(2)O(1) 0.0 Post-translational 1 0.0 42.010565 0.0 0.0
Acetyl@Any N-term Acetyl@Any N-term 42.010565 42.0367 H(2)C(2)O(1) 0.0 Multiple 1 0.0 42.010565 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
TMTpro_zero@K TMTpro_zero@K 295.189592 295.3773 H(25)C(15)N(3)O(3) 0.0 Chemical derivative 2017 0.0 295.189592 0.0 0.0
TMTpro_zero@T TMTpro_zero@T 295.189592 295.3773 H(25)C(15)N(3)O(3) 0.0 Chemical derivative 2017 0.0 295.189592 0.0 0.0
Andro-H2O@C Andro-H2O@C 332.198760 332.4339 H(28)C(20)O(4) 0.0 Chemical derivative 2025 0.0 332.198759 0.0 0.0
His+O(2)@H His+O(2)@H 169.048741 169.1381 H(7)C(6)N(3)O(3) 0.0 Post-translational 2027 0.0 169.048741 0.0 0.0
GlyGly@K GlyGly@K 114.042927 114.1026 H(6)C(4)N(2)O(2) 0.0 Post-translational 121 1000000.0 114.042927 0.0 0.0

2685 rows × 12 columns

Modification sites#

In alphabase, we use 0 and -1 to represent modification site of N-term and C-term, respectively. For other modification sites, we use 1 to n.

[9]:
from alphabase.constants.modification import calc_modification_mass
sequence = 'MACDEFG'
mod_names = ['Acetyl@Any N-term', 'Carbamidomethyl@C']
mod_sites = [0,3]
calc_modification_mass(
    nAA=len(sequence),
    mod_names=mod_names,
    mod_sites=mod_sites
)
[9]:
array([42.01056468,  0.        , 57.02146372,  0.        ,  0.        ,
        0.        ,  0.        ])

The modifications on the first amino acid and N-term will be added.

[10]:
sequence = 'MAKDEFG'
mod_names = ['Acetyl@Any N-term', 'Oxidation@M']
mod_sites = [0,1]
calc_modification_mass(
    nAA=len(sequence),
    mod_names=mod_names,
    mod_sites=mod_sites
)
[10]:
array([58.0054793,  0.       ,  0.       ,  0.       ,  0.       ,
        0.       ,  0.       ])

Multiple modification at a single site is supported, for example, in the following example, K3 contains both GlyGly@K and Dimethyl@K:

[11]:
sequence = 'MAKDEFR'
mod_names = ['GlyGly@K', 'Dimethyl@K']
mod_sites = [3,3]
calc_modification_mass(
    nAA=len(sequence),
    mod_names=mod_names,
    mod_sites=mod_sites
)
[11]:
array([  0.        ,   0.        , 142.07422757,   0.        ,
         0.        ,   0.        ,   0.        ])

Caculate modification masses in batch#

[12]:
from alphabase.constants.modification import calc_mod_masses_for_same_len_seqs
calc_mod_masses_for_same_len_seqs(
    nAA=7,
    mod_names_list=[
        ['Acetyl@Any N-term', 'Carbamidomethyl@C'],
        ['Acetyl@Any N-term', 'Oxidation@M'],
        ['GlyGly@K', 'Dimethyl@K'],
    ],
    mod_sites_list=[
        [0, 3],
        [0, 1],
        [3, 3],
    ]
)
[12]:
array([[ 42.01056468,   0.        ,  57.02146372,   0.        ,
          0.        ,   0.        ,   0.        ],
       [ 58.0054793 ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        , 142.07422757,   0.        ,
          0.        ,   0.        ,   0.        ]])

Mass calculation functionalities#

Calculate AA and modification masses in batch#

[13]:
from alphabase.constants.aa import calc_AA_masses_for_same_len_seqs
from alphabase.constants.modification import calc_mod_masses_for_same_len_seqs
mod_masses = calc_mod_masses_for_same_len_seqs(
    nAA=7,
    mod_names_list=[
        ['Acetyl@Any N-term', 'Carbamidomethyl@C'],
        ['Acetyl@Any N-term', 'Oxidation@M'],
        ['GlyGly@K', 'Dimethyl@K'],
    ],
    mod_sites_list=[
        [0, 3],
        [0, 1],
        [3, 3],
    ]
)
aa_masses = calc_AA_masses_for_same_len_seqs(
    [
        'MACDEFG', 'MAKDEFG', 'MAKDEFR'
    ]
)
mod_masses+aa_masses
[13]:
array([[173.05104977,  71.03711379, 160.03064868, 115.02694302,
        129.04259309, 147.06841391,  57.02146372],
       [189.04596439,  71.03711379, 128.09496302, 115.02694302,
        129.04259309, 147.06841391,  57.02146372],
       [131.04048509,  71.03711379, 270.16919059, 115.02694302,
        129.04259309, 147.06841391, 156.10111102]])

np.cumsum to get b-ion neutral masses#

[14]:
import numpy as np
np.cumsum(aa_masses+mod_masses, axis=1)
[14]:
array([[ 173.05104977,  244.08816356,  404.11881224,  519.14575526,
         648.18834835,  795.25676227,  852.27822599],
       [ 189.04596439,  260.08307818,  388.17804119,  503.20498422,
         632.24757731,  779.31599122,  836.33745494],
       [ 131.04048509,  202.07759887,  472.24678946,  587.27373248,
         716.31632557,  863.38473949, 1019.48585051]])

Mass functionalities in ‘mass_calc’#

The functionalities for peptide and fragment neutral masses have been implement in alphabase.peptide.mass_calc:

[15]:
from alphabase.peptide.mass_calc import calc_peptide_masses_for_same_len_seqs

peptide_masses = calc_peptide_masses_for_same_len_seqs(
    ['MACDEFG', 'MAKDEFG', 'MAKDEFR'],
    mod_list=[
        'Acetyl@Any N-term;Carbamidomethyl@C',
        'Acetyl@Any N-term;Oxidation@M',
        'GlyGly@K;Dimethyl@K',
    ],
)
peptide_masses
[15]:
array([ 870.28879067,  854.34801962, 1037.49641519])
[16]:
from alphabase.peptide.mass_calc import calc_b_y_and_peptide_masses_for_same_len_seqs
b_masses, y_masses, peptide_masses = calc_b_y_and_peptide_masses_for_same_len_seqs(
    ['MACDEFG', 'MAKDEFG', 'MAKDEFR'],
    mod_list=[
        ['Acetyl@Any N-term', 'Carbamidomethyl@C'],
        ['Acetyl@Any N-term', 'Oxidation@M'],
        ['GlyGly@K', 'Dimethyl@K'],
    ],
    site_list=[
        [0, 3],
        [0, 1],
        [3, 3],
    ],
)
peptide_masses
[16]:
array([ 870.28879067,  854.34801962, 1037.49641519])
[17]:
b_masses
[17]:
array([[173.05104977, 244.08816356, 404.11881224, 519.14575526,
        648.18834835, 795.25676227],
       [189.04596439, 260.08307818, 388.17804119, 503.20498422,
        632.24757731, 779.31599122],
       [131.04048509, 202.07759887, 472.24678946, 587.27373248,
        716.31632557, 863.38473949]])
[18]:
y_masses
[18]:
array([[697.2377409 , 626.20062711, 466.16997843, 351.14303541,
        222.10044232,  75.0320284 ],
       [665.30205523, 594.26494145, 466.16997843, 351.14303541,
        222.10044232,  75.0320284 ],
       [906.45593011, 835.41881632, 565.24962574, 450.22268271,
        321.18008962, 174.11167571]])

Isotope distribution#

alphabase.constants.isotope.IsotopeDistribution will calculate the isotope distribution and the mono-isotopic idx in the distribution for a given atom composition.

What is the mono-isotopic idx (mono_idx)? For an atom, the mono_idx points to the highest abundance isotope, so the value is round(mass of highest isotope - mass of first isotope).

[19]:
import pandas as pd
from alphabase.constants.element import CHEM_INFO_DICT
atom_df = pd.DataFrame().from_dict(CHEM_INFO_DICT, orient='index')
def get_mono(masses_abundances):
    masses, abundances = masses_abundances
    return round(masses[np.argmax(abundances)]-masses[0])
atom_df['mono_idx'] = atom_df[['mass','abundance']].apply(
    get_mono, axis=1
)
atom_df
[19]:
abundance mass mono_idx
13C [0.01, 0.99] [12.0, 13.00335483507] 1
14N [0.996337, 0.003663] [14.00307400443, 15.00010889888] 0
15N [0.01, 0.99] [14.00307400443, 15.00010889888] 1
18O [0.005, 0.005, 0.99] [15.99491461957, 16.9991317565, 17.99915961286] 2
2H [0.01, 0.99] [1.00782503223, 2.01410177812] 1
... ... ... ...
Xe [0.000952, 0.00089, 0.019102, 0.264006, 0.0407... [123.905892, 125.9042983, 127.903531, 128.9047... 8
Y [1.0] [88.9058403] 0
Yb [0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0.... [167.9338896, 169.9347664, 170.9363302, 171.93... 6
Zn [0.4917, 0.2773, 0.0404, 0.1845, 0.0061] [63.92914201, 65.92603381, 66.92712775, 67.924... 0
Zr [0.5145, 0.1122, 0.1715, 0.1738, 0.028] [89.9046977, 90.9056396, 91.9050347, 93.906310... 0

109 rows × 3 columns

mono_idx of an atom composition refers to the sum of the mono_idx of all atoms. In AlphaBase, alphabase.constants.isotope.IsotopeDistribution calculate both isotope abundance and mono_idx.

For example, Fe’s mono_idx is 2,

[20]:
atom_df.loc['Fe']
[20]:
abundance                 [0.05845, 0.91754, 0.02119, 0.00282]
mass         [53.93960899, 55.93493633, 56.93539284, 57.933...
mono_idx                                                     2
Name: Fe, dtype: object

So C(1)Fe(1)’s mono_idx is also 2:

[21]:
from alphabase.constants.isotope import IsotopeDistribution, parse_formula
iso = IsotopeDistribution()
iso.calc_formula_distribution(
    [('C',1),('Fe',1)]
)
[21]:
(array([5.78245850e-02, 6.25415000e-04, 9.07722322e-01, 3.07809450e-02,
        3.01655900e-03, 3.01740000e-05, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00]),
 2)

But 13C(1)Fe(1)’s mono_idx should be 3:

[22]:
iso.calc_formula_distribution(
    [('13C',1),('Fe',1)]
)
[22]:
(array([5.845000e-04, 5.786550e-02, 9.175400e-03, 9.085765e-01,
        2.100630e-02, 2.791800e-03, 0.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00]),
 3)

The mono_idx for most of the atom compositions is 0, no matter how big the compositions are.

[23]:
from alphabase.constants.isotope import IsotopeDistribution, parse_formula
iso = IsotopeDistribution()

formula = 'C(100)H(100)O(50)Na(1)'
formula = parse_formula(formula)
formula
[23]:
[('C', 100), ('H', 100), ('O', 50), ('Na', 1)]

mono isotope is not the highest isotope!!!

[24]:
dist, mono = iso.calc_formula_distribution(formula)
mono, dist.argmax(), dist
[24]:
(0,
 1,
 array([2.98521241e-01, 3.31991573e-01, 2.13532938e-01, 1.00604878e-01,
        3.82856126e-02, 1.23872292e-02, 3.51773755e-03, 8.95830236e-04,
        2.07763024e-04, 4.43944472e-05]))

All these low-level functionalities have been integrated into DataFrame functionalities, see tutorial_dev_dataframes.ipynb or Tutorial for Dev: Peptide and Fragment DataFrames