Tutorial for Dev: Basic Definations#
This notebook introduces low-level functionalities use in AlphaBase to developers.
[1]:
%reload_ext autoreload
%autoreload 2
Atoms/Elements#
The masses of all amino acids and modifications are calculated from their atom compositions.
The atom information are defined in https://github.com/MannLabs/alphabase/blob/main/alphabase/constants/const_files/nist_element.yaml which is parsed from NIST, see https://github.com/MannLabs/alphabase/blob/main/nbs/nist_chem_to_yaml.ipynb.
After adding some heavy isotopes, including 13C, 15N, 2H, and 18O, we obtain 109 kinds of atoms:
[2]:
import pandas as pd
from alphabase.constants.element import CHEM_INFO_DICT
pd.DataFrame().from_dict(CHEM_INFO_DICT, orient='index')
[2]:
abundance | mass | |
---|---|---|
13C | [0.01, 0.99] | [12.0, 13.00335483507] |
14N | [0.996337, 0.003663] | [14.00307400443, 15.00010889888] |
15N | [0.01, 0.99] | [14.00307400443, 15.00010889888] |
18O | [0.005, 0.005, 0.99] | [15.99491461957, 16.9991317565, 17.99915961286] |
2H | [0.01, 0.99] | [1.00782503223, 2.01410177812] |
... | ... | ... |
Xe | [0.000952, 0.00089, 0.019102, 0.264006, 0.0407... | [123.905892, 125.9042983, 127.903531, 128.9047... |
Y | [1.0] | [88.9058403] |
Yb | [0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0.... | [167.9338896, 169.9347664, 170.9363302, 171.93... |
Zn | [0.4917, 0.2773, 0.0404, 0.1845, 0.0061] | [63.92914201, 65.92603381, 66.92712775, 67.924... |
Zr | [0.5145, 0.1122, 0.1715, 0.1738, 0.028] | [89.9046977, 90.9056396, 91.9050347, 93.906310... |
109 rows × 2 columns
And their mono-isotopic mass are in CHEM_MONO_MASS
(dict):
[3]:
from alphabase.constants.element import CHEM_MONO_MASS
pd.DataFrame().from_dict(CHEM_MONO_MASS, orient='index')
[3]:
0 | |
---|---|
13C | 13.003355 |
14N | 14.003074 |
15N | 15.000109 |
18O | 17.999160 |
2H | 2.014102 |
... | ... |
Xe | 131.904155 |
Y | 88.905840 |
Yb | 173.938866 |
Zn | 63.929142 |
Zr | 89.904698 |
109 rows × 1 columns
These atom masses are used to calculate the masses of amino acids, modifications, and then subsequent masses of peptides and fragments.
Commonly used molecular masses#
[4]:
from alphabase.constants.element import (
MASS_PROTON, MASS_ISOTOPE, MASS_NH3, MASS_H2O
)
MASS_PROTON, MASS_ISOTOPE, MASS_NH3, MASS_H2O
[4]:
(1.007276467, 1.0033, 17.02654910112, 18.01056468403)
Amino Acids#
[5]:
from alphabase.constants.aa import AA_DF
AA_DF.loc[ord('A'):ord('Z')]
[5]:
aa | formula | mass | |
---|---|---|---|
65 | A | C(3)H(5)N(1)O(1)S(0) | 7.103711e+01 |
66 | B | C(1000000) | 1.200000e+07 |
67 | C | C(3)H(5)N(1)O(1)S(1) | 1.030092e+02 |
68 | D | C(4)H(5)N(1)O(3)S(0) | 1.150269e+02 |
69 | E | C(5)H(7)N(1)O(3)S(0) | 1.290426e+02 |
70 | F | C(9)H(9)N(1)O(1)S(0) | 1.470684e+02 |
71 | G | C(2)H(3)N(1)O(1)S(0) | 5.702146e+01 |
72 | H | C(6)H(7)N(3)O(1)S(0) | 1.370589e+02 |
73 | I | C(6)H(11)N(1)O(1)S(0) | 1.130841e+02 |
74 | J | C(6)H(11)N(1)O(1)S(0) | 1.130841e+02 |
75 | K | C(6)H(12)N(2)O(1)S(0) | 1.280950e+02 |
76 | L | C(6)H(11)N(1)O(1)S(0) | 1.130841e+02 |
77 | M | C(5)H(9)N(1)O(1)S(1) | 1.310405e+02 |
78 | N | C(4)H(6)N(2)O(2)S(0) | 1.140429e+02 |
79 | O | C(12)H(19)N(3)O(2) | 2.371477e+02 |
80 | P | C(5)H(7)N(1)O(1)S(0) | 9.705276e+01 |
81 | Q | C(5)H(8)N(2)O(2)S(0) | 1.280586e+02 |
82 | R | C(6)H(12)N(4)O(1)S(0) | 1.561011e+02 |
83 | S | C(3)H(5)N(1)O(2)S(0) | 8.703203e+01 |
84 | T | C(4)H(7)N(1)O(2)S(0) | 1.010477e+02 |
85 | U | C(3)H(5)N(1)O(1)Se(1) | 1.509536e+02 |
86 | V | C(5)H(9)N(1)O(1)S(0) | 9.906841e+01 |
87 | W | C(11)H(10)N(2)O(1)S(0) | 1.860793e+02 |
88 | X | C(1000000) | 1.200000e+07 |
89 | Y | C(9)H(9)N(1)O(2)S(0) | 1.630633e+02 |
90 | Z | C(1000000) | 1.200000e+07 |
From AA_DF
, we can see that amino acids are encoded by ASCII (128 characters). 65==ord(‘A’), …, 90==ord(‘Z’). Unicode strings can be fastly converted to ascii int32 values using numpy:
[6]:
import numpy as np
np.array(['ABCXYZ']).view(np.int32)
[6]:
array([65, 66, 67, 88, 89, 90], dtype=int32)
But users does not need to know this, as we provided easy to use functionalities to get residue masses from sequences.
Calculate AA masses in batch#
[7]:
from alphabase.constants.aa import calc_AA_masses_for_same_len_seqs
calc_AA_masses_for_same_len_seqs(
[
'MACDEFG', 'MAKDEFG', 'MAKDEFR'
]
)
[7]:
array([[131.04048509, 71.03711379, 103.00918496, 115.02694302,
129.04259309, 147.06841391, 57.02146372],
[131.04048509, 71.03711379, 128.09496302, 115.02694302,
129.04259309, 147.06841391, 57.02146372],
[131.04048509, 71.03711379, 128.09496302, 115.02694302,
129.04259309, 147.06841391, 156.10111102]])
Modifications#
In AlphaBase, we used mod_name@aa
to represent a modification, the mod_name
is from UniMod. We also used mod_name@Protein N-term
, mod_name@Any N-term
and mod_name@Any C-term
for terminal modifications, which follow the UniMod terminal name schema.
The default modification TSV is stored in alphabase/constants/const_files/modification.tsv
, users can add more modifications into the tsv file (only mod_name
and composition
colums are required). Please https://github.com/MannLabs/alphabase/blob/main/alphabase/constants/const_files/modification.tsv.
[8]:
from alphabase.constants.modification import MOD_DF
MOD_DF
[8]:
mod_name | unimod_mass | unimod_avge_mass | composition | unimod_modloss | modloss_composition | classification | unimod_id | modloss_importance | mass | modloss_original | modloss | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mod_name | ||||||||||||
Acetyl@T | Acetyl@T | 42.010565 | 42.0367 | H(2)C(2)O(1) | 0.0 | Post-translational | 1 | 0.0 | 42.010565 | 0.0 | 0.0 | |
Acetyl@Protein N-term | Acetyl@Protein N-term | 42.010565 | 42.0367 | H(2)C(2)O(1) | 0.0 | Post-translational | 1 | 0.0 | 42.010565 | 0.0 | 0.0 | |
Acetyl@S | Acetyl@S | 42.010565 | 42.0367 | H(2)C(2)O(1) | 0.0 | Post-translational | 1 | 0.0 | 42.010565 | 0.0 | 0.0 | |
Acetyl@C | Acetyl@C | 42.010565 | 42.0367 | H(2)C(2)O(1) | 0.0 | Post-translational | 1 | 0.0 | 42.010565 | 0.0 | 0.0 | |
Acetyl@Any N-term | Acetyl@Any N-term | 42.010565 | 42.0367 | H(2)C(2)O(1) | 0.0 | Multiple | 1 | 0.0 | 42.010565 | 0.0 | 0.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
TMTpro_zero@K | TMTpro_zero@K | 295.189592 | 295.3773 | H(25)C(15)N(3)O(3) | 0.0 | Chemical derivative | 2017 | 0.0 | 295.189592 | 0.0 | 0.0 | |
TMTpro_zero@T | TMTpro_zero@T | 295.189592 | 295.3773 | H(25)C(15)N(3)O(3) | 0.0 | Chemical derivative | 2017 | 0.0 | 295.189592 | 0.0 | 0.0 | |
Andro-H2O@C | Andro-H2O@C | 332.198760 | 332.4339 | H(28)C(20)O(4) | 0.0 | Chemical derivative | 2025 | 0.0 | 332.198759 | 0.0 | 0.0 | |
His+O(2)@H | His+O(2)@H | 169.048741 | 169.1381 | H(7)C(6)N(3)O(3) | 0.0 | Post-translational | 2027 | 0.0 | 169.048741 | 0.0 | 0.0 | |
GlyGly@K | GlyGly@K | 114.042927 | 114.1026 | H(6)C(4)N(2)O(2) | 0.0 | Post-translational | 121 | 1000000.0 | 114.042927 | 0.0 | 0.0 |
2685 rows × 12 columns
Modification sites#
In alphabase, we use 0 and -1 to represent modification site of N-term and C-term, respectively. For other modification sites, we use 1 to n.
[9]:
from alphabase.constants.modification import calc_modification_mass
sequence = 'MACDEFG'
mod_names = ['Acetyl@Any N-term', 'Carbamidomethyl@C']
mod_sites = [0,3]
calc_modification_mass(
nAA=len(sequence),
mod_names=mod_names,
mod_sites=mod_sites
)
[9]:
array([42.01056468, 0. , 57.02146372, 0. , 0. ,
0. , 0. ])
The modifications on the first amino acid and N-term will be added.
[10]:
sequence = 'MAKDEFG'
mod_names = ['Acetyl@Any N-term', 'Oxidation@M']
mod_sites = [0,1]
calc_modification_mass(
nAA=len(sequence),
mod_names=mod_names,
mod_sites=mod_sites
)
[10]:
array([58.0054793, 0. , 0. , 0. , 0. ,
0. , 0. ])
Multiple modification at a single site is supported, for example, in the following example, K3
contains both GlyGly@K
and Dimethyl@K
:
[11]:
sequence = 'MAKDEFR'
mod_names = ['GlyGly@K', 'Dimethyl@K']
mod_sites = [3,3]
calc_modification_mass(
nAA=len(sequence),
mod_names=mod_names,
mod_sites=mod_sites
)
[11]:
array([ 0. , 0. , 142.07422757, 0. ,
0. , 0. , 0. ])
Caculate modification masses in batch#
[12]:
from alphabase.constants.modification import calc_mod_masses_for_same_len_seqs
calc_mod_masses_for_same_len_seqs(
nAA=7,
mod_names_list=[
['Acetyl@Any N-term', 'Carbamidomethyl@C'],
['Acetyl@Any N-term', 'Oxidation@M'],
['GlyGly@K', 'Dimethyl@K'],
],
mod_sites_list=[
[0, 3],
[0, 1],
[3, 3],
]
)
[12]:
array([[ 42.01056468, 0. , 57.02146372, 0. ,
0. , 0. , 0. ],
[ 58.0054793 , 0. , 0. , 0. ,
0. , 0. , 0. ],
[ 0. , 0. , 142.07422757, 0. ,
0. , 0. , 0. ]])
Mass calculation functionalities#
Calculate AA and modification masses in batch#
[13]:
from alphabase.constants.aa import calc_AA_masses_for_same_len_seqs
from alphabase.constants.modification import calc_mod_masses_for_same_len_seqs
mod_masses = calc_mod_masses_for_same_len_seqs(
nAA=7,
mod_names_list=[
['Acetyl@Any N-term', 'Carbamidomethyl@C'],
['Acetyl@Any N-term', 'Oxidation@M'],
['GlyGly@K', 'Dimethyl@K'],
],
mod_sites_list=[
[0, 3],
[0, 1],
[3, 3],
]
)
aa_masses = calc_AA_masses_for_same_len_seqs(
[
'MACDEFG', 'MAKDEFG', 'MAKDEFR'
]
)
mod_masses+aa_masses
[13]:
array([[173.05104977, 71.03711379, 160.03064868, 115.02694302,
129.04259309, 147.06841391, 57.02146372],
[189.04596439, 71.03711379, 128.09496302, 115.02694302,
129.04259309, 147.06841391, 57.02146372],
[131.04048509, 71.03711379, 270.16919059, 115.02694302,
129.04259309, 147.06841391, 156.10111102]])
np.cumsum to get b-ion neutral masses#
[14]:
import numpy as np
np.cumsum(aa_masses+mod_masses, axis=1)
[14]:
array([[ 173.05104977, 244.08816356, 404.11881224, 519.14575526,
648.18834835, 795.25676227, 852.27822599],
[ 189.04596439, 260.08307818, 388.17804119, 503.20498422,
632.24757731, 779.31599122, 836.33745494],
[ 131.04048509, 202.07759887, 472.24678946, 587.27373248,
716.31632557, 863.38473949, 1019.48585051]])
Mass functionalities in ‘mass_calc’#
The functionalities for peptide and fragment neutral masses have been implement in alphabase.peptide.mass_calc
:
[15]:
from alphabase.peptide.mass_calc import calc_peptide_masses_for_same_len_seqs
peptide_masses = calc_peptide_masses_for_same_len_seqs(
['MACDEFG', 'MAKDEFG', 'MAKDEFR'],
mod_list=[
'Acetyl@Any N-term;Carbamidomethyl@C',
'Acetyl@Any N-term;Oxidation@M',
'GlyGly@K;Dimethyl@K',
],
)
peptide_masses
[15]:
array([ 870.28879067, 854.34801962, 1037.49641519])
[16]:
from alphabase.peptide.mass_calc import calc_b_y_and_peptide_masses_for_same_len_seqs
b_masses, y_masses, peptide_masses = calc_b_y_and_peptide_masses_for_same_len_seqs(
['MACDEFG', 'MAKDEFG', 'MAKDEFR'],
mod_list=[
['Acetyl@Any N-term', 'Carbamidomethyl@C'],
['Acetyl@Any N-term', 'Oxidation@M'],
['GlyGly@K', 'Dimethyl@K'],
],
site_list=[
[0, 3],
[0, 1],
[3, 3],
],
)
peptide_masses
[16]:
array([ 870.28879067, 854.34801962, 1037.49641519])
[17]:
b_masses
[17]:
array([[173.05104977, 244.08816356, 404.11881224, 519.14575526,
648.18834835, 795.25676227],
[189.04596439, 260.08307818, 388.17804119, 503.20498422,
632.24757731, 779.31599122],
[131.04048509, 202.07759887, 472.24678946, 587.27373248,
716.31632557, 863.38473949]])
[18]:
y_masses
[18]:
array([[697.2377409 , 626.20062711, 466.16997843, 351.14303541,
222.10044232, 75.0320284 ],
[665.30205523, 594.26494145, 466.16997843, 351.14303541,
222.10044232, 75.0320284 ],
[906.45593011, 835.41881632, 565.24962574, 450.22268271,
321.18008962, 174.11167571]])
Isotope distribution#
alphabase.constants.isotope.IsotopeDistribution
will calculate the isotope distribution and the mono-isotopic idx in the distribution for a given atom composition.
What is the mono-isotopic idx (mono_idx)? For an atom, the mono_idx
points to the highest abundance isotope, so the value is round(mass of highest isotope - mass of first isotope)
.
[19]:
import pandas as pd
from alphabase.constants.element import CHEM_INFO_DICT
atom_df = pd.DataFrame().from_dict(CHEM_INFO_DICT, orient='index')
def get_mono(masses_abundances):
masses, abundances = masses_abundances
return round(masses[np.argmax(abundances)]-masses[0])
atom_df['mono_idx'] = atom_df[['mass','abundance']].apply(
get_mono, axis=1
)
atom_df
[19]:
abundance | mass | mono_idx | |
---|---|---|---|
13C | [0.01, 0.99] | [12.0, 13.00335483507] | 1 |
14N | [0.996337, 0.003663] | [14.00307400443, 15.00010889888] | 0 |
15N | [0.01, 0.99] | [14.00307400443, 15.00010889888] | 1 |
18O | [0.005, 0.005, 0.99] | [15.99491461957, 16.9991317565, 17.99915961286] | 2 |
2H | [0.01, 0.99] | [1.00782503223, 2.01410177812] | 1 |
... | ... | ... | ... |
Xe | [0.000952, 0.00089, 0.019102, 0.264006, 0.0407... | [123.905892, 125.9042983, 127.903531, 128.9047... | 8 |
Y | [1.0] | [88.9058403] | 0 |
Yb | [0.00123, 0.02982, 0.1409, 0.2168, 0.16103, 0.... | [167.9338896, 169.9347664, 170.9363302, 171.93... | 6 |
Zn | [0.4917, 0.2773, 0.0404, 0.1845, 0.0061] | [63.92914201, 65.92603381, 66.92712775, 67.924... | 0 |
Zr | [0.5145, 0.1122, 0.1715, 0.1738, 0.028] | [89.9046977, 90.9056396, 91.9050347, 93.906310... | 0 |
109 rows × 3 columns
mono_idx
of an atom composition refers to the sum of the mono_idx
of all atoms. In AlphaBase, alphabase.constants.isotope.IsotopeDistribution
calculate both isotope abundance and mono_idx
.
For example, Fe
’s mono_idx
is 2,
[20]:
atom_df.loc['Fe']
[20]:
abundance [0.05845, 0.91754, 0.02119, 0.00282]
mass [53.93960899, 55.93493633, 56.93539284, 57.933...
mono_idx 2
Name: Fe, dtype: object
So C(1)Fe(1)
’s mono_idx
is also 2:
[21]:
from alphabase.constants.isotope import IsotopeDistribution, parse_formula
iso = IsotopeDistribution()
iso.calc_formula_distribution(
[('C',1),('Fe',1)]
)
[21]:
(array([5.78245850e-02, 6.25415000e-04, 9.07722322e-01, 3.07809450e-02,
3.01655900e-03, 3.01740000e-05, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00]),
2)
But 13C(1)Fe(1)
’s mono_idx
should be 3:
[22]:
iso.calc_formula_distribution(
[('13C',1),('Fe',1)]
)
[22]:
(array([5.845000e-04, 5.786550e-02, 9.175400e-03, 9.085765e-01,
2.100630e-02, 2.791800e-03, 0.000000e+00, 0.000000e+00,
0.000000e+00, 0.000000e+00]),
3)
The mono_idx
for most of the atom compositions is 0, no matter how big the compositions are.
[23]:
from alphabase.constants.isotope import IsotopeDistribution, parse_formula
iso = IsotopeDistribution()
formula = 'C(100)H(100)O(50)Na(1)'
formula = parse_formula(formula)
formula
[23]:
[('C', 100), ('H', 100), ('O', 50), ('Na', 1)]
mono
isotope is not thehighest
isotope!!!
[24]:
dist, mono = iso.calc_formula_distribution(formula)
mono, dist.argmax(), dist
[24]:
(0,
1,
array([2.98521241e-01, 3.31991573e-01, 2.13532938e-01, 1.00604878e-01,
3.82856126e-02, 1.23872292e-02, 3.51773755e-03, 8.95830236e-04,
2.07763024e-04, 4.43944472e-05]))
All these low-level functionalities have been integrated into DataFrame functionalities, see tutorial_dev_dataframes.ipynb
or Tutorial for Dev: Peptide and Fragment DataFrames