Tutorial for Dev: Peptide and Fragment DataFrames#

This notebook introduces functionalities for peptide and fragment DataFrames to developers.

Peptide DataFrame#

Peptide dataframe must contain four columns: sequence for animo acid sequence (str), mods for modification names (str), mod_sites for modification sites (str), and charge for precursor charge states (int).

We can easily build a peptide dataframe:

[1]:
import pandas as pd

df = pd.DataFrame({
    'sequence': ['ACDEFHIK', 'APDEFMNIK', 'SWDEFMNTIRAAAAKDDDDR'],
    'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
    'mod_sites': ['2', '', '1;6'],
    'charge': [1,2,3],
})
df
[1]:
sequence mods mod_sites charge
0 ACDEFHIK Carbamidomethyl@C 2 1
1 APDEFMNIK 2
2 SWDEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 1;6 3

Calculate precursor_mz and isotopes from peptide dataframe#

alphabase.peptide.precursor.update_precursor_mz() calculates the precursor_mz for peptides.

[2]:
from alphabase.peptide.precursor import update_precursor_mz

update_precursor_mz(df)
[2]:
sequence mods mod_sites charge nAA precursor_mz
0 ACDEFHIK Carbamidomethyl@C 2 1 8 1019.461492
1 APDEFMNIK 2 9 532.757692
2 SWDEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 1;6 3 20 808.337166

alphabase.peptide.precursor.calc_precursor_isotope() calculates the precursor isotope information for peptides. It will add i_* columns for peptides.

[3]:
from alphabase.peptide.precursor import calc_precursor_isotope

calc_precursor_isotope(df)
[3]:
sequence mods mod_sites charge nAA precursor_mz i_0 i_1 i_2 i_3 i_4 i_5 mono_isotope_idx
0 ACDEFHIK Carbamidomethyl@C 2 1 8 1019.461492 0.544890 0.294208 0.116900 0.034340 0.008077 0.001584 0
1 APDEFMNIK 2 9 532.757692 0.527839 0.300826 0.123018 0.037359 0.009104 0.001854 0
2 SWDEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 1;6 3 20 808.337166 0.271028 0.323775 0.225641 0.115441 0.047553 0.016561 0

Computing isotope patterns is very time-consuming for millions of peptides, so we provided calc_precursor_isotope_mp with multiprocessing for users.

Fragment DataFrame#

alphabase.peptide.fragment.create_fragment_mz_dataframe() is the only function we need to calculate fragment_mz dataframe. It has two key parameters:

  • precursor_df (pd.DataFrame): the peptide or precursor dataframe.

  • charged_frag_types (list of str): The charged fragments to be considered into the fragment dataframe columns. The schema is Type[_LossType]_z[n], where

    • Type can be b,y,c,z

    • _LossType can be _modloss,_H2O,_NH3, this is optional.

    • z[n] is the charge state. If precursor charge is less than n, the corresponding mz will be set as zero.

[4]:
from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
    df,
    charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','z_z1']
)
frag_mz_df
[4]:
a_z1 b_z1 c_z1 b_z2 x_z1 y_z1 y_H2O_z1 z_z1
0 44.049477 72.044388 89.070938 0.000000 974.403625 948.424377 930.413818 932.405640
1 204.080124 232.075043 249.101593 0.000000 814.372986 788.393738 770.383179 772.375000
2 319.107056 347.101990 364.128540 0.000000 699.346069 673.366760 655.356201 657.348083
3 448.149658 476.144562 493.171112 0.000000 570.303467 544.324219 526.313660 528.305481
4 595.218079 623.213013 640.239563 0.000000 423.235046 397.255768 379.245209 381.237061
5 732.276978 760.271912 777.298462 0.000000 286.176147 260.196869 242.186310 244.178146
6 845.361023 873.355957 890.382507 0.000000 173.092072 147.112808 129.102234 131.094086
7 44.049477 72.044388 89.070938 36.525833 1019.450256 993.471008 975.460449 977.452271
8 141.102234 169.097153 186.123703 85.052216 922.397522 896.418213 878.407654 880.399536
9 256.129181 284.124084 301.150635 142.565689 807.370544 781.391296 763.380737 765.372559
10 385.171783 413.166687 430.193237 207.086990 678.327942 652.348694 634.338135 636.329956
11 532.240173 560.235107 577.261658 280.621185 531.259521 505.280273 487.269714 489.261566
12 663.280701 691.275574 708.302124 346.141418 400.219055 374.239807 356.229218 358.221069
13 777.323608 805.318542 822.345093 403.162903 286.176147 260.196869 242.186310 244.178146
14 890.407654 918.402588 935.429138 459.704926 173.092072 147.112808 129.102234 131.094086
15 140.010727 168.005630 185.032181 84.506454 2281.977783 2255.998535 2237.988037 2239.979980
16 326.090027 354.084961 371.111511 177.546112 2095.898438 2069.919189 2051.908691 2053.900635
17 441.116974 469.111877 486.138428 235.059586 1980.871582 1954.892334 1936.881714 1938.873657
18 570.159546 598.154480 615.181030 299.580872 1851.828979 1825.849731 1807.839111 1809.831055
19 717.227966 745.222900 762.249451 373.115082 1704.760620 1678.781372 1660.770752 1662.762573
20 864.263367 892.258301 909.284851 446.632782 1557.725220 1531.745972 1513.735352 1515.727173
21 978.306335 1006.301208 1023.327759 503.654266 1443.682251 1417.703003 1399.692383 1401.684326
22 1079.354004 1107.348877 1124.375488 554.178101 1342.634521 1316.655273 1298.644775 1300.636597
23 1192.438110 1220.432983 1237.459473 610.720093 1229.550537 1203.571289 1185.560669 1187.552490
24 1348.539185 1376.534058 1393.560669 688.770691 1073.449463 1047.470093 1029.459595 1031.451416
25 1419.576294 1447.571167 1464.597778 724.289246 1002.412292 976.433044 958.422485 960.414307
26 1490.613403 1518.608276 1535.634888 759.807800 931.375183 905.395935 887.385376 889.377197
27 1561.650513 1589.645386 1606.671997 795.326355 860.338074 834.358826 816.348267 818.340088
28 1632.687622 1660.682495 1677.709106 830.844910 789.300964 763.321716 745.311096 747.302979
29 1760.782593 1788.777466 1805.804077 894.892395 661.205994 635.226746 617.216187 619.208008
30 1875.809570 1903.804443 1920.830933 952.405884 546.179016 520.199768 502.189209 504.181061
31 1990.836426 2018.831421 2035.857910 1009.919312 431.152100 405.172852 387.162262 389.154114
32 2105.863525 2133.858398 2150.884766 1067.432861 316.125153 290.145905 272.135345 274.127167
33 2220.890381 2248.885254 2265.911865 1124.946289 201.098221 175.118958 157.108383 159.100235

After create_fragment_mz_dataframe(), two columns frag_start_idx and frag_stop_idx will be append to the peptide dataframe. These two values locate the fragment in the fragment dataframe of a peptide.

[5]:
df[[
    'sequence','mods','mod_sites','charge','nAA',
    'precursor_mz','frag_start_idx','frag_stop_idx'
]]
[5]:
sequence mods mod_sites charge nAA precursor_mz frag_start_idx frag_stop_idx
0 ACDEFHIK Carbamidomethyl@C 2 1 8 1019.461492 0 7
1 APDEFMNIK 2 9 532.757692 7 15
2 SWDEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 1;6 3 20 808.337166 15 34
[6]:
start,stop = df[['frag_start_idx','frag_stop_idx']].values[0] #first peptide
frag_mz_df.iloc[start:stop]
[6]:
a_z1 b_z1 c_z1 b_z2 x_z1 y_z1 y_H2O_z1 z_z1
0 44.049477 72.044388 89.070938 0.0 974.403625 948.424377 930.413818 932.405640
1 204.080124 232.075043 249.101593 0.0 814.372986 788.393738 770.383179 772.375000
2 319.107056 347.101990 364.128540 0.0 699.346069 673.366760 655.356201 657.348083
3 448.149658 476.144562 493.171112 0.0 570.303467 544.324219 526.313660 528.305481
4 595.218079 623.213013 640.239563 0.0 423.235046 397.255768 379.245209 381.237061
5 732.276978 760.271912 777.298462 0.0 286.176147 260.196869 242.186310 244.178146
6 845.361023 873.355957 890.382507 0.0 173.092072 147.112808 129.102234 131.094086

Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].

All dataframe functionalities use low-level APIs of AlphaBase, see tutorial_dev_basic_definations.ipynb or Tutorial for Dev: Basic Definations.

Spectral library functionalities provide higher-level APIs which encapsulate these dataframe functionalities, see tutorial_dev_spectral_libraries.ipynb or Tutorial for Dev: Spectral Libraries.