Tutorial for Dev: Peptide and Fragment DataFrames#

This notebook introduces functionalities for peptide and fragment DataFrames to developers.

Peptide DataFrame#

Peptide dataframe must contain four columns: sequence for animo acid sequence (str), mods for modification names (str), mod_sites for modification sites (str), and charge for precursor charge states (int).

We can easily build a peptide dataframe:

[1]:

import pandas as pd

df = pd.DataFrame({
    'sequence': ['ACDEFHIK', 'APDEFMNIK', 'SWDEFMNTIRAAAAKDDDDR'],
    'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
    'mod_sites': ['2', '', '1;6'],
    'charge': [1,2,3],
})
df

[1]:

	sequence	mods	mod_sites	charge
0	ACDEFHIK	Carbamidomethyl@C	2	1
1	APDEFMNIK			2
2	SWDEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	1;6	3

Calculate precursor_mz and isotopes from peptide dataframe#

alphabase.peptide.precursor.update_precursor_mz() calculates the precursor_mz for peptides.

[2]:

from alphabase.peptide.precursor import update_precursor_mz

update_precursor_mz(df)

[2]:

	sequence	mods	mod_sites	charge	nAA	precursor_mz
0	ACDEFHIK	Carbamidomethyl@C	2	1	8	1019.461492
1	APDEFMNIK			2	9	532.757692
2	SWDEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	1;6	3	20	808.337166

alphabase.peptide.precursor.calc_precursor_isotope() calculates the precursor isotope information for peptides. It will add i_* columns for peptides.

[3]:

from alphabase.peptide.precursor import calc_precursor_isotope

calc_precursor_isotope(df)

[3]:

	sequence	mods	mod_sites	charge	nAA	precursor_mz	i_0	i_1	i_2	i_3	i_4	i_5
0	ACDEFHIK	Carbamidomethyl@C	2	1	8	1019.461492	0.544890	0.294208	0.116900	0.034340	0.008077	0.001584
1	APDEFMNIK			2	9	532.757692	0.527839	0.300826	0.123018	0.037359	0.009104	0.001854
2	SWDEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	1;6	3	20	808.337166	0.271028	0.323775	0.225641	0.115441	0.047553	0.016561

Computing isotope patterns is very time-consuming for millions of peptides, so we provided calc_precursor_isotope_mp with multiprocessing for users.

Fragment DataFrame#

alphabase.peptide.fragment.create_fragment_mz_dataframe() is the only function we need to calculate fragment_mz dataframe. It has two key parameters:

precursor_df (pd.DataFrame): the peptide or precursor dataframe.
charged_frag_types (list of str): The charged fragments to be considered into the fragment dataframe columns. The schema is Type[_LossType]_z[n], where
- Type can be b,y,c,z
- _LossType can be _modloss,_H2O,_NH3, this is optional.
- z[n] is the charge state. If precursor charge is less than n, the corresponding mz will be set as zero.

[4]:

from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
    df,
    charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','z_z1']
)
frag_mz_df

[4]:

	a_z1	b_z1	c_z1	b_z2	x_z1	y_z1	y_H2O_z1	z_z1
0	44.049477	72.044388	89.070938	0.000000	974.403625	948.424377	930.413818	932.405640
1	204.080124	232.075043	249.101593	0.000000	814.372986	788.393738	770.383179	772.375000
2	319.107056	347.101990	364.128540	0.000000	699.346069	673.366760	655.356201	657.348083
3	448.149658	476.144562	493.171112	0.000000	570.303467	544.324219	526.313660	528.305481
4	595.218079	623.213013	640.239563	0.000000	423.235046	397.255768	379.245209	381.237061
5	732.276978	760.271912	777.298462	0.000000	286.176147	260.196869	242.186310	244.178146
6	845.361023	873.355957	890.382507	0.000000	173.092072	147.112808	129.102234	131.094086
7	44.049477	72.044388	89.070938	36.525833	1019.450256	993.471008	975.460449	977.452271
8	141.102234	169.097153	186.123703	85.052216	922.397522	896.418213	878.407654	880.399536
9	256.129181	284.124084	301.150635	142.565689	807.370544	781.391296	763.380737	765.372559
10	385.171783	413.166687	430.193237	207.086990	678.327942	652.348694	634.338135	636.329956
11	532.240173	560.235107	577.261658	280.621185	531.259521	505.280273	487.269714	489.261566
12	663.280701	691.275574	708.302124	346.141418	400.219055	374.239807	356.229218	358.221069
13	777.323608	805.318542	822.345093	403.162903	286.176147	260.196869	242.186310	244.178146
14	890.407654	918.402588	935.429138	459.704926	173.092072	147.112808	129.102234	131.094086
15	140.010727	168.005630	185.032181	84.506454	2281.977783	2255.998535	2237.988037	2239.979980
16	326.090027	354.084961	371.111511	177.546112	2095.898438	2069.919189	2051.908691	2053.900635
17	441.116974	469.111877	486.138428	235.059586	1980.871582	1954.892334	1936.881714	1938.873657
18	570.159546	598.154480	615.181030	299.580872	1851.828979	1825.849731	1807.839111	1809.831055
19	717.227966	745.222900	762.249451	373.115082	1704.760620	1678.781372	1660.770752	1662.762573
20	864.263367	892.258301	909.284851	446.632782	1557.725220	1531.745972	1513.735352	1515.727173
21	978.306335	1006.301208	1023.327759	503.654266	1443.682251	1417.703003	1399.692383	1401.684326
22	1079.354004	1107.348877	1124.375488	554.178101	1342.634521	1316.655273	1298.644775	1300.636597
23	1192.438110	1220.432983	1237.459473	610.720093	1229.550537	1203.571289	1185.560669	1187.552490
24	1348.539185	1376.534058	1393.560669	688.770691	1073.449463	1047.470093	1029.459595	1031.451416
25	1419.576294	1447.571167	1464.597778	724.289246	1002.412292	976.433044	958.422485	960.414307
26	1490.613403	1518.608276	1535.634888	759.807800	931.375183	905.395935	887.385376	889.377197
27	1561.650513	1589.645386	1606.671997	795.326355	860.338074	834.358826	816.348267	818.340088
28	1632.687622	1660.682495	1677.709106	830.844910	789.300964	763.321716	745.311096	747.302979
29	1760.782593	1788.777466	1805.804077	894.892395	661.205994	635.226746	617.216187	619.208008
30	1875.809570	1903.804443	1920.830933	952.405884	546.179016	520.199768	502.189209	504.181061
31	1990.836426	2018.831421	2035.857910	1009.919312	431.152100	405.172852	387.162262	389.154114
32	2105.863525	2133.858398	2150.884766	1067.432861	316.125153	290.145905	272.135345	274.127167
33	2220.890381	2248.885254	2265.911865	1124.946289	201.098221	175.118958	157.108383	159.100235

After create_fragment_mz_dataframe(), two columns frag_start_idx and frag_stop_idx will be append to the peptide dataframe. These two values locate the fragment in the fragment dataframe of a peptide.

[5]:

df[[
    'sequence','mods','mod_sites','charge','nAA',
    'precursor_mz','frag_start_idx','frag_stop_idx'
]]

[5]:

	sequence	mods	mod_sites	charge	nAA	precursor_mz	frag_start_idx	frag_stop_idx
0	ACDEFHIK	Carbamidomethyl@C	2	1	8	1019.461492	0	7
1	APDEFMNIK			2	9	532.757692	7	15
2	SWDEFMNTIRAAAAKDDDDR	Phospho@S;Oxidation@M	1;6	3	20	808.337166	15	34

[6]:

start,stop = df[['frag_start_idx','frag_stop_idx']].values[0] #first peptide
frag_mz_df.iloc[start:stop]

[6]:

	a_z1	b_z1	c_z1	x_z1	y_z1	y_H2O_z1	z_z1
0	44.049477	72.044388	89.070938	974.403625	948.424377	930.413818	932.405640
1	204.080124	232.075043	249.101593	814.372986	788.393738	770.383179	772.375000
2	319.107056	347.101990	364.128540	699.346069	673.366760	655.356201	657.348083
3	448.149658	476.144562	493.171112	570.303467	544.324219	526.313660	528.305481
4	595.218079	623.213013	640.239563	423.235046	397.255768	379.245209	381.237061
5	732.276978	760.271912	777.298462	286.176147	260.196869	242.186310	244.178146
6	845.361023	873.355957	890.382507	173.092072	147.112808	129.102234	131.094086

Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].

All dataframe functionalities use low-level APIs of AlphaBase, see tutorial_dev_basic_definations.ipynb or Tutorial for Dev: Basic Definations.

Spectral library functionalities provide higher-level APIs which encapsulate these dataframe functionalities, see tutorial_dev_spectral_libraries.ipynb or Tutorial for Dev: Spectral Libraries.