Tutorial: Peptide and Fragment DataFrames¶

AlphaBase uses Pandas dataframes, a tabular-like data structure, to represent peptides and fragments. The dataframe structure is easy to read from human’s perspective, and efficient for input and output from machine’s perspective. See tutorial_basic_definitions.ipynb for an introduction to basic concepts and tutorial_spectral_libraries.ipynb for an introduction to spectral libraries.

peptide-fragment-dataframe

Peptide DataFrame¶

The peptide dataframe must contain four columns:

sequence for amino acid sequence (str);
mods for modification names (str, separated by ;);
mod_sites for modification sites (str, separated by ;);
charge for precursor charge states (int).

Other columns like precursor_mz can be flexibly added into the dataframe if necessary; AlphaBase provides functionalities to calculate e.g. precursor_mz and isotopes columns.

[13]:

import pandas as pd
import numpy as np

peptide_df = pd.DataFrame({
    'sequence': ['ACDEFHIK', 'APDEFMNIK', 'WDSEFMNTIRAAAAKDDDDR'],
    'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
    'mod_sites': ['2', '', '3;6'],
    'charge': [1,2,3],
})
peptide_df

[13]:

	sequence	mods	mod_sites	charge
0	ACDEFHIK	Carbamidomethyl@C	2	1
1	APDEFMNIK			2
2	WDSEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	3;6	3

Fragment DataFrame¶

The fragments are also organized in a dataframe structure. The column names of the dataframe represent the fragment type, using the schema Type[_LossType]_zn, where:

Type can be b,y,c,z,a,x
the optional _LossType can be _modloss, _H2O, or _NH3, this is optional.
n is the charge state, for example 1.

Here are some examples:

[14]:

from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
    peptide_df,
    charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','y_modloss_z1','z_z1']
)
frag_mz_df

[14]:

	a_z1	b_z1	c_z1	b_z2	x_z1	y_z1	y_H2O_z1	y_modloss_z1	z_z1
0	44.049477	72.044388	89.070938	0.000000	974.403625	948.424377	930.413818	0.000000	932.405640
1	204.080124	232.075043	249.101593	0.000000	814.372986	788.393738	770.383179	0.000000	772.375000
2	319.107056	347.101990	364.128540	0.000000	699.346069	673.366760	655.356201	0.000000	657.348083
3	448.149658	476.144562	493.171112	0.000000	570.303467	544.324219	526.313660	0.000000	528.305481
4	595.218079	623.213013	640.239563	0.000000	423.235046	397.255768	379.245209	0.000000	381.237061
5	732.276978	760.271912	777.298462	0.000000	286.176147	260.196869	242.186310	0.000000	244.178146
6	845.361023	873.355957	890.382507	0.000000	173.092072	147.112808	129.102234	0.000000	131.094086
7	44.049477	72.044388	89.070938	36.525833	1019.450256	993.471008	975.460449	0.000000	977.452271
8	141.102234	169.097153	186.123703	85.052216	922.397522	896.418213	878.407654	0.000000	880.399536
9	256.129181	284.124084	301.150635	142.565689	807.370544	781.391296	763.380737	0.000000	765.372559
10	385.171783	413.166687	430.193237	207.086990	678.327942	652.348694	634.338135	0.000000	636.329956
11	532.240173	560.235107	577.261658	280.621185	531.259521	505.280273	487.269714	0.000000	489.261566
12	663.280701	691.275574	708.302124	346.141418	400.219055	374.239807	356.229218	0.000000	358.221069
13	777.323608	805.318542	822.345093	403.162903	286.176147	260.196869	242.186310	0.000000	244.178146
14	890.407654	918.402588	935.429138	459.704926	173.092072	147.112808	129.102234	0.000000	131.094086
15	159.091675	187.086594	204.113144	94.046936	2262.896973	2236.917725	2218.906982	2138.940674	2220.898926
16	274.118622	302.113525	319.140076	151.560410	2147.869873	2121.890625	2103.880127	2023.913818	2105.872070
17	441.116974	469.111877	486.138428	235.059586	1980.871582	1954.892334	1936.881714	0.000000	1938.873657
18	570.159546	598.154480	615.181030	299.580872	1851.828979	1825.849731	1807.839111	0.000000	1809.831055
19	717.227966	745.222900	762.249451	373.115082	1704.760620	1678.781372	1660.770752	0.000000	1662.762573
20	864.263367	892.258301	909.284851	446.632782	1557.725220	1531.745972	1513.735352	0.000000	1515.727173
21	978.306335	1006.301208	1023.327759	503.654266	1443.682251	1417.703003	1399.692383	0.000000	1401.684326
22	1079.354004	1107.348877	1124.375488	554.178101	1342.634521	1316.655273	1298.644775	0.000000	1300.636597
23	1192.438110	1220.432983	1237.459473	610.720093	1229.550537	1203.571289	1185.560669	0.000000	1187.552490
24	1348.539185	1376.534058	1393.560669	688.770691	1073.449463	1047.470093	1029.459595	0.000000	1031.451416
25	1419.576294	1447.571167	1464.597778	724.289246	1002.412292	976.433044	958.422485	0.000000	960.414307
26	1490.613403	1518.608276	1535.634888	759.807800	931.375183	905.395935	887.385376	0.000000	889.377197
27	1561.650513	1589.645386	1606.671997	795.326355	860.338074	834.358826	816.348267	0.000000	818.340088
28	1632.687622	1660.682495	1677.709106	830.844910	789.300964	763.321716	745.311096	0.000000	747.302979
29	1760.782593	1788.777466	1805.804077	894.892395	661.205994	635.226746	617.216187	0.000000	619.208008
30	1875.809570	1903.804443	1920.830933	952.405884	546.179016	520.199768	502.189209	0.000000	504.181061
31	1990.836426	2018.831421	2035.857910	1009.919312	431.152100	405.172852	387.162262	0.000000	389.154114
32	2105.863525	2133.858398	2150.884766	1067.432861	316.125153	290.145905	272.135345	0.000000	274.127167
33	2220.890381	2248.885254	2265.911865	1124.946289	201.098221	175.118958	157.108383	0.000000	159.100235

Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].

The fragment dataframe is connected to the peptide (precursor) dataframe by the frag_start_idx and frag_stop_idx columns of the peptide dataframe. These two values can locate all fragments of a peptide in the fragment dataframe, as shown in the figure.

[15]:

peptide_df

[15]:

	sequence	mods	mod_sites	charge	nAA	frag_start_idx	frag_stop_idx
0	ACDEFHIK	Carbamidomethyl@C	2	1	8	0	7
1	APDEFMNIK			2	9	7	15
2	WDSEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	3;6	3	20	15	34

[16]:

selected_peptide_index = -1 # last peptide
start = peptide_df['frag_start_idx'].values[selected_peptide_index]
stop = peptide_df['frag_stop_idx'].values[selected_peptide_index]
frag_mz_df.iloc[start:stop]

[16]:

	a_z1	b_z1	c_z1	b_z2	x_z1	y_z1	y_H2O_z1	y_modloss_z1	z_z1
15	159.091675	187.086594	204.113144	94.046936	2262.896973	2236.917725	2218.906982	2138.940674	2220.898926
16	274.118622	302.113525	319.140076	151.560410	2147.869873	2121.890625	2103.880127	2023.913818	2105.872070
17	441.116974	469.111877	486.138428	235.059586	1980.871582	1954.892334	1936.881714	0.000000	1938.873657
18	570.159546	598.154480	615.181030	299.580872	1851.828979	1825.849731	1807.839111	0.000000	1809.831055
19	717.227966	745.222900	762.249451	373.115082	1704.760620	1678.781372	1660.770752	0.000000	1662.762573
20	864.263367	892.258301	909.284851	446.632782	1557.725220	1531.745972	1513.735352	0.000000	1515.727173
21	978.306335	1006.301208	1023.327759	503.654266	1443.682251	1417.703003	1399.692383	0.000000	1401.684326
22	1079.354004	1107.348877	1124.375488	554.178101	1342.634521	1316.655273	1298.644775	0.000000	1300.636597
23	1192.438110	1220.432983	1237.459473	610.720093	1229.550537	1203.571289	1185.560669	0.000000	1187.552490
24	1348.539185	1376.534058	1393.560669	688.770691	1073.449463	1047.470093	1029.459595	0.000000	1031.451416
25	1419.576294	1447.571167	1464.597778	724.289246	1002.412292	976.433044	958.422485	0.000000	960.414307
26	1490.613403	1518.608276	1535.634888	759.807800	931.375183	905.395935	887.385376	0.000000	889.377197
27	1561.650513	1589.645386	1606.671997	795.326355	860.338074	834.358826	816.348267	0.000000	818.340088
28	1632.687622	1660.682495	1677.709106	830.844910	789.300964	763.321716	745.311096	0.000000	747.302979
29	1760.782593	1788.777466	1805.804077	894.892395	661.205994	635.226746	617.216187	0.000000	619.208008
30	1875.809570	1903.804443	1920.830933	952.405884	546.179016	520.199768	502.189209	0.000000	504.181061
31	1990.836426	2018.831421	2035.857910	1009.919312	431.152100	405.172852	387.162262	0.000000	389.154114
32	2105.863525	2133.858398	2150.884766	1067.432861	316.125153	290.145905	272.135345	0.000000	274.127167
33	2220.890381	2248.885254	2265.911865	1124.946289	201.098221	175.118958	157.108383	0.000000	159.100235

Using on several fragment dataframes (e.g., m/z and intensity dataframes) may be not convenient in some situations, especially when we need to operate subsets of the dataframes. Therefore, alphabase also provides a flattened fragment dataframe structure to store all fragment information.

[17]:

from alphabase.peptide.fragment import flatten_fragments

dummy_frag_intensity_df = pd.DataFrame(
        np.zeros_like(frag_mz_df.values),
        columns=frag_mz_df.columns
    )

precursor_df, flat_frag_df = flatten_fragments(
    precursor_df=peptide_df,
    fragment_mz_df=frag_mz_df,
    fragment_intensity_df=dummy_frag_intensity_df
)

[18]:

precursor_df

[18]:

	sequence	mods	mod_sites	charge	nAA	frag_start_idx	frag_stop_idx	flat_frag_start_idx	flat_frag_stop_idx
0	ACDEFHIK	Carbamidomethyl@C	2	1	8	0	7	0	49
1	APDEFMNIK			2	9	7	15	49	113
2	WDSEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	3;6	3	20	15	34	113	267

[19]:

flat_frag_df

[19]:

	mz	intensity	type	loss_type	charge	number	position
0	44.049477	0.0	97	0	1	1	0
1	72.044388	0.0	98	0	1	1	0
2	89.070938	0.0	99	0	1	1	0
3	974.403625	0.0	120	0	1	7	0
4	948.424377	0.0	121	0	1	7	0
...	...	...	...	...	...	...	...
262	1124.946289	0.0	98	0	2	19	18
263	201.098221	0.0	120	0	1	1	18
264	175.118958	0.0	121	0	1	1	18
265	157.108383	0.0	121	18	1	1	18
266	159.100235	0.0	122	0	1	1	18

267 rows × 7 columns

For the flattened fragment dataframe, it contains mz, intensity, type, loss_type, charge, number, and position columns, other columns can be flexibly added. All columns are converted to numeric values for better processing in numpy and numba package. For instance , type is the ASCII code of abc/xyz ions, a=97, b=98, c=99, x=120, y=121, and z=122. Losses are also converted to numbers as well, therefore, Water loss becomes 18, and phospho loss becomes 98.

And similar to frag_start_idx and frag_stop_idx, we use flat_frag_start_idx and flat_frag_stop_idx to keep the connection between the precursor dataframe and the flattened fragment dataframe.

[ ]: