Tutorial: Peptide and Fragment DataFrames

AlphaBase uses Pandas dataframes, a tabular-like data structure, to represent peptides and fragments. The dataframe structure is easy to read from human’s perspective, and efficient for input and output from machine’s perspective. See tutorial_basic_definitions.ipynb for an introduction to basic concepts and tutorial_spectral_libraries.ipynb for an introduction to spectral libraries.

peptide-fragment-dataframe

Peptide DataFrame

The peptide dataframe must contain four columns:

  • sequence for amino acid sequence (str);

  • mods for modification names (str, separated by ;);

  • mod_sites for modification sites (str, separated by ;);

  • charge for precursor charge states (int).

Other columns like precursor_mz can be flexibly added into the dataframe if necessary; AlphaBase provides functionalities to calculate e.g. precursor_mz and isotopes columns.

[13]:
import pandas as pd
import numpy as np

peptide_df = pd.DataFrame({
    'sequence': ['ACDEFHIK', 'APDEFMNIK', 'WDSEFMNTIRAAAAKDDDDR'],
    'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
    'mod_sites': ['2', '', '3;6'],
    'charge': [1,2,3],
})
peptide_df
[13]:
sequence mods mod_sites charge
0 ACDEFHIK Carbamidomethyl@C 2 1
1 APDEFMNIK 2
2 WDSEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 3;6 3

Fragment DataFrame

The fragments are also organized in a dataframe structure. The column names of the dataframe represent the fragment type, using the schema Type[_LossType]_zn, where:

  • Type can be b,y,c,z,a,x

  • the optional _LossType can be _modloss, _H2O, or _NH3, this is optional.

  • n is the charge state, for example 1.

Here are some examples:

[14]:
from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
    peptide_df,
    charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','y_modloss_z1','z_z1']
)
frag_mz_df
[14]:
a_z1 b_z1 c_z1 b_z2 x_z1 y_z1 y_H2O_z1 y_modloss_z1 z_z1
0 44.049477 72.044388 89.070938 0.000000 974.403625 948.424377 930.413818 0.000000 932.405640
1 204.080124 232.075043 249.101593 0.000000 814.372986 788.393738 770.383179 0.000000 772.375000
2 319.107056 347.101990 364.128540 0.000000 699.346069 673.366760 655.356201 0.000000 657.348083
3 448.149658 476.144562 493.171112 0.000000 570.303467 544.324219 526.313660 0.000000 528.305481
4 595.218079 623.213013 640.239563 0.000000 423.235046 397.255768 379.245209 0.000000 381.237061
5 732.276978 760.271912 777.298462 0.000000 286.176147 260.196869 242.186310 0.000000 244.178146
6 845.361023 873.355957 890.382507 0.000000 173.092072 147.112808 129.102234 0.000000 131.094086
7 44.049477 72.044388 89.070938 36.525833 1019.450256 993.471008 975.460449 0.000000 977.452271
8 141.102234 169.097153 186.123703 85.052216 922.397522 896.418213 878.407654 0.000000 880.399536
9 256.129181 284.124084 301.150635 142.565689 807.370544 781.391296 763.380737 0.000000 765.372559
10 385.171783 413.166687 430.193237 207.086990 678.327942 652.348694 634.338135 0.000000 636.329956
11 532.240173 560.235107 577.261658 280.621185 531.259521 505.280273 487.269714 0.000000 489.261566
12 663.280701 691.275574 708.302124 346.141418 400.219055 374.239807 356.229218 0.000000 358.221069
13 777.323608 805.318542 822.345093 403.162903 286.176147 260.196869 242.186310 0.000000 244.178146
14 890.407654 918.402588 935.429138 459.704926 173.092072 147.112808 129.102234 0.000000 131.094086
15 159.091675 187.086594 204.113144 94.046936 2262.896973 2236.917725 2218.906982 2138.940674 2220.898926
16 274.118622 302.113525 319.140076 151.560410 2147.869873 2121.890625 2103.880127 2023.913818 2105.872070
17 441.116974 469.111877 486.138428 235.059586 1980.871582 1954.892334 1936.881714 0.000000 1938.873657
18 570.159546 598.154480 615.181030 299.580872 1851.828979 1825.849731 1807.839111 0.000000 1809.831055
19 717.227966 745.222900 762.249451 373.115082 1704.760620 1678.781372 1660.770752 0.000000 1662.762573
20 864.263367 892.258301 909.284851 446.632782 1557.725220 1531.745972 1513.735352 0.000000 1515.727173
21 978.306335 1006.301208 1023.327759 503.654266 1443.682251 1417.703003 1399.692383 0.000000 1401.684326
22 1079.354004 1107.348877 1124.375488 554.178101 1342.634521 1316.655273 1298.644775 0.000000 1300.636597
23 1192.438110 1220.432983 1237.459473 610.720093 1229.550537 1203.571289 1185.560669 0.000000 1187.552490
24 1348.539185 1376.534058 1393.560669 688.770691 1073.449463 1047.470093 1029.459595 0.000000 1031.451416
25 1419.576294 1447.571167 1464.597778 724.289246 1002.412292 976.433044 958.422485 0.000000 960.414307
26 1490.613403 1518.608276 1535.634888 759.807800 931.375183 905.395935 887.385376 0.000000 889.377197
27 1561.650513 1589.645386 1606.671997 795.326355 860.338074 834.358826 816.348267 0.000000 818.340088
28 1632.687622 1660.682495 1677.709106 830.844910 789.300964 763.321716 745.311096 0.000000 747.302979
29 1760.782593 1788.777466 1805.804077 894.892395 661.205994 635.226746 617.216187 0.000000 619.208008
30 1875.809570 1903.804443 1920.830933 952.405884 546.179016 520.199768 502.189209 0.000000 504.181061
31 1990.836426 2018.831421 2035.857910 1009.919312 431.152100 405.172852 387.162262 0.000000 389.154114
32 2105.863525 2133.858398 2150.884766 1067.432861 316.125153 290.145905 272.135345 0.000000 274.127167
33 2220.890381 2248.885254 2265.911865 1124.946289 201.098221 175.118958 157.108383 0.000000 159.100235

Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].

The fragment dataframe is connected to the peptide (precursor) dataframe by the frag_start_idx and frag_stop_idx columns of the peptide dataframe. These two values can locate all fragments of a peptide in the fragment dataframe, as shown in the figure.

[15]:
peptide_df
[15]:
sequence mods mod_sites charge nAA frag_start_idx frag_stop_idx
0 ACDEFHIK Carbamidomethyl@C 2 1 8 0 7
1 APDEFMNIK 2 9 7 15
2 WDSEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 3;6 3 20 15 34
[16]:
selected_peptide_index = -1 # last peptide
start = peptide_df['frag_start_idx'].values[selected_peptide_index]
stop = peptide_df['frag_stop_idx'].values[selected_peptide_index]
frag_mz_df.iloc[start:stop]
[16]:
a_z1 b_z1 c_z1 b_z2 x_z1 y_z1 y_H2O_z1 y_modloss_z1 z_z1
15 159.091675 187.086594 204.113144 94.046936 2262.896973 2236.917725 2218.906982 2138.940674 2220.898926
16 274.118622 302.113525 319.140076 151.560410 2147.869873 2121.890625 2103.880127 2023.913818 2105.872070
17 441.116974 469.111877 486.138428 235.059586 1980.871582 1954.892334 1936.881714 0.000000 1938.873657
18 570.159546 598.154480 615.181030 299.580872 1851.828979 1825.849731 1807.839111 0.000000 1809.831055
19 717.227966 745.222900 762.249451 373.115082 1704.760620 1678.781372 1660.770752 0.000000 1662.762573
20 864.263367 892.258301 909.284851 446.632782 1557.725220 1531.745972 1513.735352 0.000000 1515.727173
21 978.306335 1006.301208 1023.327759 503.654266 1443.682251 1417.703003 1399.692383 0.000000 1401.684326
22 1079.354004 1107.348877 1124.375488 554.178101 1342.634521 1316.655273 1298.644775 0.000000 1300.636597
23 1192.438110 1220.432983 1237.459473 610.720093 1229.550537 1203.571289 1185.560669 0.000000 1187.552490
24 1348.539185 1376.534058 1393.560669 688.770691 1073.449463 1047.470093 1029.459595 0.000000 1031.451416
25 1419.576294 1447.571167 1464.597778 724.289246 1002.412292 976.433044 958.422485 0.000000 960.414307
26 1490.613403 1518.608276 1535.634888 759.807800 931.375183 905.395935 887.385376 0.000000 889.377197
27 1561.650513 1589.645386 1606.671997 795.326355 860.338074 834.358826 816.348267 0.000000 818.340088
28 1632.687622 1660.682495 1677.709106 830.844910 789.300964 763.321716 745.311096 0.000000 747.302979
29 1760.782593 1788.777466 1805.804077 894.892395 661.205994 635.226746 617.216187 0.000000 619.208008
30 1875.809570 1903.804443 1920.830933 952.405884 546.179016 520.199768 502.189209 0.000000 504.181061
31 1990.836426 2018.831421 2035.857910 1009.919312 431.152100 405.172852 387.162262 0.000000 389.154114
32 2105.863525 2133.858398 2150.884766 1067.432861 316.125153 290.145905 272.135345 0.000000 274.127167
33 2220.890381 2248.885254 2265.911865 1124.946289 201.098221 175.118958 157.108383 0.000000 159.100235

Using on several fragment dataframes (e.g., m/z and intensity dataframes) may be not convenient in some situations, especially when we need to operate subsets of the dataframes. Therefore, alphabase also provides a flattened fragment dataframe structure to store all fragment information.

[17]:
from alphabase.peptide.fragment import flatten_fragments

dummy_frag_intensity_df = pd.DataFrame(
        np.zeros_like(frag_mz_df.values),
        columns=frag_mz_df.columns
    )

precursor_df, flat_frag_df = flatten_fragments(
    precursor_df=peptide_df,
    fragment_mz_df=frag_mz_df,
    fragment_intensity_df=dummy_frag_intensity_df
)
[18]:
precursor_df
[18]:
sequence mods mod_sites charge nAA frag_start_idx frag_stop_idx flat_frag_start_idx flat_frag_stop_idx
0 ACDEFHIK Carbamidomethyl@C 2 1 8 0 7 0 49
1 APDEFMNIK 2 9 7 15 49 113
2 WDSEFMNTIRAAAAKDDDDR Phospho@S;Oxidation@M 3;6 3 20 15 34 113 267
[19]:
flat_frag_df
[19]:
mz intensity type loss_type charge number position
0 44.049477 0.0 97 0 1 1 0
1 72.044388 0.0 98 0 1 1 0
2 89.070938 0.0 99 0 1 1 0
3 974.403625 0.0 120 0 1 7 0
4 948.424377 0.0 121 0 1 7 0
... ... ... ... ... ... ... ...
262 1124.946289 0.0 98 0 2 19 18
263 201.098221 0.0 120 0 1 1 18
264 175.118958 0.0 121 0 1 1 18
265 157.108383 0.0 121 18 1 1 18
266 159.100235 0.0 122 0 1 1 18

267 rows × 7 columns

For the flattened fragment dataframe, it contains mz, intensity, type, loss_type, charge, number, and position columns, other columns can be flexibly added. All columns are converted to numeric values for better processing in numpy and numba package. For instance , type is the ASCII code of abc/xyz ions, a=97, b=98, c=99, x=120, y=121, and z=122. Losses are also converted to numbers as well, therefore, Water loss becomes 18, and phospho loss becomes 98.

And similar to frag_start_idx and frag_stop_idx, we use flat_frag_start_idx and flat_frag_stop_idx to keep the connection between the precursor dataframe and the flattened fragment dataframe.

[ ]: