Tutorial for Dev: Peptide and Fragment DataFrames#
This notebook introduces functionalities for peptide and fragment DataFrames to developers.
Peptide DataFrame#
Peptide dataframe must contain four columns: sequence
for animo acid sequence (str), mods
for modification names (str), mod_sites
for modification sites (str), and charge
for precursor charge states (int).
We can easily build a peptide dataframe:
[1]:
import pandas as pd
df = pd.DataFrame({
'sequence': ['ACDEFHIK', 'APDEFMNIK', 'SWDEFMNTIRAAAAKDDDDR'],
'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
'mod_sites': ['2', '', '1;6'],
'charge': [1,2,3],
})
df
[1]:
sequence | mods | mod_sites | charge | |
---|---|---|---|---|
0 | ACDEFHIK | Carbamidomethyl@C | 2 | 1 |
1 | APDEFMNIK | 2 | ||
2 | SWDEFMNTIRAAAAKDDDDR | Phospho@S;Oxidation@M | 1;6 | 3 |
Calculate precursor_mz and isotopes from peptide dataframe#
alphabase.peptide.precursor.update_precursor_mz()
calculates the precursor_mz for peptides.
[2]:
from alphabase.peptide.precursor import update_precursor_mz
update_precursor_mz(df)
[2]:
sequence | mods | mod_sites | charge | nAA | precursor_mz | |
---|---|---|---|---|---|---|
0 | ACDEFHIK | Carbamidomethyl@C | 2 | 1 | 8 | 1019.461492 |
1 | APDEFMNIK | 2 | 9 | 532.757692 | ||
2 | SWDEFMNTIRAAAAKDDDDR | Phospho@S;Oxidation@M | 1;6 | 3 | 20 | 808.337166 |
alphabase.peptide.precursor.calc_precursor_isotope()
calculates the precursor isotope information for peptides. It will add i_*
columns for peptides.
[3]:
from alphabase.peptide.precursor import calc_precursor_isotope
calc_precursor_isotope(df)
[3]:
sequence | mods | mod_sites | charge | nAA | precursor_mz | i_0 | i_1 | i_2 | i_3 | i_4 | i_5 | mono_isotope_idx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACDEFHIK | Carbamidomethyl@C | 2 | 1 | 8 | 1019.461492 | 0.544890 | 0.294208 | 0.116900 | 0.034340 | 0.008077 | 0.001584 | 0 |
1 | APDEFMNIK | 2 | 9 | 532.757692 | 0.527839 | 0.300826 | 0.123018 | 0.037359 | 0.009104 | 0.001854 | 0 | ||
2 | SWDEFMNTIRAAAAKDDDDR | Phospho@S;Oxidation@M | 1;6 | 3 | 20 | 808.337166 | 0.271028 | 0.323775 | 0.225641 | 0.115441 | 0.047553 | 0.016561 | 0 |
Computing isotope patterns is very time-consuming for millions of peptides, so we provided
calc_precursor_isotope_mp
with multiprocessing for users.
Fragment DataFrame#
alphabase.peptide.fragment.create_fragment_mz_dataframe()
is the only function we need to calculate fragment_mz dataframe. It has two key parameters:
precursor_df (pd.DataFrame): the peptide or precursor dataframe.
charged_frag_types (list of str): The charged fragments to be considered into the fragment dataframe columns. The schema is
Type[_LossType]_z[n]
, whereType
can beb,y,c,z
_LossType
can be_modloss,_H2O,_NH3
, this is optional.z[n]
is the charge state. If precursor charge is less thann
, the corresponding mz will be set as zero.
[4]:
from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
df,
charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','z_z1']
)
frag_mz_df
[4]:
a_z1 | b_z1 | c_z1 | b_z2 | x_z1 | y_z1 | y_H2O_z1 | z_z1 | |
---|---|---|---|---|---|---|---|---|
0 | 44.049477 | 72.044388 | 89.070938 | 0.000000 | 974.403625 | 948.424377 | 930.413818 | 932.405640 |
1 | 204.080124 | 232.075043 | 249.101593 | 0.000000 | 814.372986 | 788.393738 | 770.383179 | 772.375000 |
2 | 319.107056 | 347.101990 | 364.128540 | 0.000000 | 699.346069 | 673.366760 | 655.356201 | 657.348083 |
3 | 448.149658 | 476.144562 | 493.171112 | 0.000000 | 570.303467 | 544.324219 | 526.313660 | 528.305481 |
4 | 595.218079 | 623.213013 | 640.239563 | 0.000000 | 423.235046 | 397.255768 | 379.245209 | 381.237061 |
5 | 732.276978 | 760.271912 | 777.298462 | 0.000000 | 286.176147 | 260.196869 | 242.186310 | 244.178146 |
6 | 845.361023 | 873.355957 | 890.382507 | 0.000000 | 173.092072 | 147.112808 | 129.102234 | 131.094086 |
7 | 44.049477 | 72.044388 | 89.070938 | 36.525833 | 1019.450256 | 993.471008 | 975.460449 | 977.452271 |
8 | 141.102234 | 169.097153 | 186.123703 | 85.052216 | 922.397522 | 896.418213 | 878.407654 | 880.399536 |
9 | 256.129181 | 284.124084 | 301.150635 | 142.565689 | 807.370544 | 781.391296 | 763.380737 | 765.372559 |
10 | 385.171783 | 413.166687 | 430.193237 | 207.086990 | 678.327942 | 652.348694 | 634.338135 | 636.329956 |
11 | 532.240173 | 560.235107 | 577.261658 | 280.621185 | 531.259521 | 505.280273 | 487.269714 | 489.261566 |
12 | 663.280701 | 691.275574 | 708.302124 | 346.141418 | 400.219055 | 374.239807 | 356.229218 | 358.221069 |
13 | 777.323608 | 805.318542 | 822.345093 | 403.162903 | 286.176147 | 260.196869 | 242.186310 | 244.178146 |
14 | 890.407654 | 918.402588 | 935.429138 | 459.704926 | 173.092072 | 147.112808 | 129.102234 | 131.094086 |
15 | 140.010727 | 168.005630 | 185.032181 | 84.506454 | 2281.977783 | 2255.998535 | 2237.988037 | 2239.979980 |
16 | 326.090027 | 354.084961 | 371.111511 | 177.546112 | 2095.898438 | 2069.919189 | 2051.908691 | 2053.900635 |
17 | 441.116974 | 469.111877 | 486.138428 | 235.059586 | 1980.871582 | 1954.892334 | 1936.881714 | 1938.873657 |
18 | 570.159546 | 598.154480 | 615.181030 | 299.580872 | 1851.828979 | 1825.849731 | 1807.839111 | 1809.831055 |
19 | 717.227966 | 745.222900 | 762.249451 | 373.115082 | 1704.760620 | 1678.781372 | 1660.770752 | 1662.762573 |
20 | 864.263367 | 892.258301 | 909.284851 | 446.632782 | 1557.725220 | 1531.745972 | 1513.735352 | 1515.727173 |
21 | 978.306335 | 1006.301208 | 1023.327759 | 503.654266 | 1443.682251 | 1417.703003 | 1399.692383 | 1401.684326 |
22 | 1079.354004 | 1107.348877 | 1124.375488 | 554.178101 | 1342.634521 | 1316.655273 | 1298.644775 | 1300.636597 |
23 | 1192.438110 | 1220.432983 | 1237.459473 | 610.720093 | 1229.550537 | 1203.571289 | 1185.560669 | 1187.552490 |
24 | 1348.539185 | 1376.534058 | 1393.560669 | 688.770691 | 1073.449463 | 1047.470093 | 1029.459595 | 1031.451416 |
25 | 1419.576294 | 1447.571167 | 1464.597778 | 724.289246 | 1002.412292 | 976.433044 | 958.422485 | 960.414307 |
26 | 1490.613403 | 1518.608276 | 1535.634888 | 759.807800 | 931.375183 | 905.395935 | 887.385376 | 889.377197 |
27 | 1561.650513 | 1589.645386 | 1606.671997 | 795.326355 | 860.338074 | 834.358826 | 816.348267 | 818.340088 |
28 | 1632.687622 | 1660.682495 | 1677.709106 | 830.844910 | 789.300964 | 763.321716 | 745.311096 | 747.302979 |
29 | 1760.782593 | 1788.777466 | 1805.804077 | 894.892395 | 661.205994 | 635.226746 | 617.216187 | 619.208008 |
30 | 1875.809570 | 1903.804443 | 1920.830933 | 952.405884 | 546.179016 | 520.199768 | 502.189209 | 504.181061 |
31 | 1990.836426 | 2018.831421 | 2035.857910 | 1009.919312 | 431.152100 | 405.172852 | 387.162262 | 389.154114 |
32 | 2105.863525 | 2133.858398 | 2150.884766 | 1067.432861 | 316.125153 | 290.145905 | 272.135345 | 274.127167 |
33 | 2220.890381 | 2248.885254 | 2265.911865 | 1124.946289 | 201.098221 | 175.118958 | 157.108383 | 159.100235 |
After create_fragment_mz_dataframe()
, two columns frag_start_idx
and frag_stop_idx
will be append to the peptide dataframe. These two values locate the fragment in the fragment dataframe of a peptide.
[5]:
df[[
'sequence','mods','mod_sites','charge','nAA',
'precursor_mz','frag_start_idx','frag_stop_idx'
]]
[5]:
sequence | mods | mod_sites | charge | nAA | precursor_mz | frag_start_idx | frag_stop_idx | |
---|---|---|---|---|---|---|---|---|
0 | ACDEFHIK | Carbamidomethyl@C | 2 | 1 | 8 | 1019.461492 | 0 | 7 |
1 | APDEFMNIK | 2 | 9 | 532.757692 | 7 | 15 | ||
2 | SWDEFMNTIRAAAAKDDDDR | Phospho@S;Oxidation@M | 1;6 | 3 | 20 | 808.337166 | 15 | 34 |
[6]:
start,stop = df[['frag_start_idx','frag_stop_idx']].values[0] #first peptide
frag_mz_df.iloc[start:stop]
[6]:
a_z1 | b_z1 | c_z1 | b_z2 | x_z1 | y_z1 | y_H2O_z1 | z_z1 | |
---|---|---|---|---|---|---|---|---|
0 | 44.049477 | 72.044388 | 89.070938 | 0.0 | 974.403625 | 948.424377 | 930.413818 | 932.405640 |
1 | 204.080124 | 232.075043 | 249.101593 | 0.0 | 814.372986 | 788.393738 | 770.383179 | 772.375000 |
2 | 319.107056 | 347.101990 | 364.128540 | 0.0 | 699.346069 | 673.366760 | 655.356201 | 657.348083 |
3 | 448.149658 | 476.144562 | 493.171112 | 0.0 | 570.303467 | 544.324219 | 526.313660 | 528.305481 |
4 | 595.218079 | 623.213013 | 640.239563 | 0.0 | 423.235046 | 397.255768 | 379.245209 | 381.237061 |
5 | 732.276978 | 760.271912 | 777.298462 | 0.0 | 286.176147 | 260.196869 | 242.186310 | 244.178146 |
6 | 845.361023 | 873.355957 | 890.382507 | 0.0 | 173.092072 | 147.112808 | 129.102234 | 131.094086 |
Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].
All dataframe functionalities use low-level APIs of AlphaBase, see tutorial_dev_basic_definations.ipynb
or Tutorial for Dev: Basic Definations
.
Spectral library functionalities provide higher-level APIs which encapsulate these dataframe functionalities, see tutorial_dev_spectral_libraries.ipynb
or Tutorial for Dev: Spectral Libraries
.