SpecLibFasta usage#

[1]:

%reload_ext autoreload
%autoreload 2

[2]:

from alphabase.protein.fasta import SpecLibFasta

Proteins from a dict (or loaded from fasta files)

[3]:

prot1 = 'MABCDESTKAFGHIJKLMNOPQRAFGHIJK'
prot2 = 'AFGHIJKLMNOPQR'
protein_dict = {
    'xx': {
        'protein_id': 'xx',
        'gene_name': '',
        'sequence': prot1
    },
    'yy': {
        'protein_id': 'yy',
        'gene_name': 'gene',
        'sequence': prot2
    }
}

alphabase.protein.fasta.SpecLibFasta.get_peptides_from_protein_dict will digest a protein dict into a peptide dataframe.

alphabase.protein.fasta.SpecLibFasta.get_peptides_from_fasta will digest a fasta file or a fasta list into a peptide dataframe.

[4]:

fasta_lib = SpecLibFasta(
    ['b_z1','y_z1'], I_to_L=False, decoy='pseudo_reverse',
    var_mods=['Acetyl@Protein N-term', 'Oxidation@M'],
    fix_mods=['Carbamidomethyl@C'],
)
# fasta_lib.get_peptides_from_fasta(fasta_files)
fasta_lib.get_peptides_from_protein_dict(protein_dict)
fasta_lib.precursor_df

[4]:

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	nAA
0	AFGHIJK	0;1	0	True	True	7
1	LMNOPQR	0;1	0	False	True	7
2	ABCDESTK	0	0	True	False	8
3	MABCDESTK	0	0	True	False	9
4	AFGHIJKLMNOPQR	0;1	1	True	True	14
5	LMNOPQRAFGHIJK	0	1	False	True	14
6	ABCDESTKAFGHIJK	0	1	True	False	15
7	MABCDESTKAFGHIJK	0	1	True	False	16
8	AFGHIJKLMNOPQRAFGHIJK	0	2	False	True	21
9	ABCDESTKAFGHIJKLMNOPQR	0	2	True	False	22
10	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	23

[5]:

fasta_lib.protein_df

[5]:

	protein_id	gene_name	sequence
0	xx		MABCDESTKAFGHIJKLMNOPQRAFGHIJK
1	yy	gene	AFGHIJKLMNOPQR

We can also append the protein names to precursor_df

[6]:

fasta_lib.append_protein_name()
fasta_lib.precursor_df

[6]:

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	nAA	proteins	genes
0	AFGHIJK	0;1	0	True	True	7	xx;yy	gene
1	LMNOPQR	0;1	0	False	True	7	xx;yy	gene
2	ABCDESTK	0	0	True	False	8	xx
3	MABCDESTK	0	0	True	False	9	xx
4	AFGHIJKLMNOPQR	0;1	1	True	True	14	xx;yy	gene
5	LMNOPQRAFGHIJK	0	1	False	True	14	xx
6	ABCDESTKAFGHIJK	0	1	True	False	15	xx
7	MABCDESTKAFGHIJK	0	1	True	False	16	xx
8	AFGHIJKLMNOPQRAFGHIJK	0	2	False	True	21	xx
9	ABCDESTKAFGHIJKLMNOPQR	0	2	True	False	22	xx
10	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	23	xx

If we have our own precursor_df loaded by psm_readers, we can directly assign it to fasta_lib.

fasta_lib._precursor_df = precursor_df

Thus, we can still use SpecLibFasta functionalities for this precursor_df.

Add modifications including both var_mods (Acetyl@Protein N-term, Oxidation@M, see initialzation of fasta_lib) and fix_mods (Carbamidomethyl@C) into the precursor_df.

[7]:

fasta_lib.add_modifications()
fasta_lib.precursor_df[['sequence','mods','mod_sites']]

[7]:

	sequence	mods	mod_sites
0	AFGHIJK
1	AFGHIJK	Acetyl@Protein N-term	0
2	LMNOPQR	Oxidation@M	2
3	LMNOPQR
4	ABCDESTK	Carbamidomethyl@C	3
5	ABCDESTK	Acetyl@Protein N-term;Carbamidomethyl@C	0;3
6	MABCDESTK	Oxidation@M;Carbamidomethyl@C	1;4
7	MABCDESTK	Carbamidomethyl@C	4
8	MABCDESTK	Acetyl@Protein N-term;Oxidation@M;Carbamidomet...	0;1;4
9	MABCDESTK	Acetyl@Protein N-term;Carbamidomethyl@C	0;4
10	AFGHIJKLMNOPQR	Oxidation@M	9
11	AFGHIJKLMNOPQR
12	AFGHIJKLMNOPQR	Acetyl@Protein N-term;Oxidation@M	0;9
13	AFGHIJKLMNOPQR	Acetyl@Protein N-term	0
14	LMNOPQRAFGHIJK	Oxidation@M	2
15	LMNOPQRAFGHIJK
16	ABCDESTKAFGHIJK	Carbamidomethyl@C	3
17	ABCDESTKAFGHIJK	Acetyl@Protein N-term;Carbamidomethyl@C	0;3
18	MABCDESTKAFGHIJK	Oxidation@M;Carbamidomethyl@C	1;4
19	MABCDESTKAFGHIJK	Carbamidomethyl@C	4
20	MABCDESTKAFGHIJK	Acetyl@Protein N-term;Oxidation@M;Carbamidomet...	0;1;4
21	MABCDESTKAFGHIJK	Acetyl@Protein N-term;Carbamidomethyl@C	0;4
22	AFGHIJKLMNOPQRAFGHIJK	Oxidation@M	9
23	AFGHIJKLMNOPQRAFGHIJK
24	ABCDESTKAFGHIJKLMNOPQR	Oxidation@M;Carbamidomethyl@C	17;3
25	ABCDESTKAFGHIJKLMNOPQR	Carbamidomethyl@C	3
26	ABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Oxidation@M;Carbamidomet...	0;17;3
27	ABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C	0;3
28	MABCDESTKAFGHIJKLMNOPQR	Oxidation@M;Carbamidomethyl@C	1;4
29	MABCDESTKAFGHIJKLMNOPQR	Oxidation@M;Carbamidomethyl@C	18;4
30	MABCDESTKAFGHIJKLMNOPQR	Oxidation@M;Oxidation@M;Carbamidomethyl@C	1;18;4
31	MABCDESTKAFGHIJKLMNOPQR	Carbamidomethyl@C	4
32	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Oxidation@M;Carbamidomet...	0;1;4
33	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Oxidation@M;Carbamidomet...	0;18;4
34	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Oxidation@M;Oxidation@M;...	0;1;18;4
35	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C	0;4

alphabase.protein.fasta.SpecLibFasta.add_additional_modifications is specially designed for Phospho, as it may generate thousands of peptidoforms for a peptide with multiple phospho sites.

[8]:

from alphabase.protein.fasta import append_special_modifications
fasta_lib._precursor_df = append_special_modifications(
    fasta_lib.precursor_df, ['Phospho@S','Phospho@T'],
    min_mod_num=0, max_mod_num=1, max_peptidoform_num=100
)
fasta_lib.precursor_df

[8]:

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	mods	mod_sites	nAA	proteins	genes
0	AFGHIJK	0;1	0	True	True			7	xx;yy	gene
1	AFGHIJK	0;1	0	True	True	Acetyl@Protein N-term	0	7	xx;yy	gene
2	LMNOPQR	0;1	0	False	True	Oxidation@M	2	7	xx;yy	gene
3	LMNOPQR	0;1	0	False	True			7	xx;yy	gene
4	ABCDESTK	0	0	True	False	Carbamidomethyl@C;Phospho@S	3;6	8	xx
...	...	...	...	...	...	...	...	...	...	...
79	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Oxidation@M;Oxidation@M;...	0;1;18;4;8	23	xx
80	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Oxidation@M;Oxidation@M;...	0;1;18;4	23	xx
81	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Carbamidomethyl@C;Phospho@S	0;4;7	23	xx
82	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Carbamidomethyl@C;Phospho@T	0;4;8	23	xx
83	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Carbamidomethyl@C	0;4	23	xx

84 rows × 10 columns

Flexible method to add peptide labeling

[9]:

fasta_lib.add_peptide_labeling({
    '': [], # not labelled for reference
    '0': ['Dimethyl@Any N-term','Dimethyl@K'],
    '8': ['Dimethyl:2H(6)13C(2)@Any N-term','Dimethyl:2H(6)13C(2)@K'],
})
fasta_lib.precursor_df

[9]:

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	mods	mod_sites	nAA	proteins	genes	labeling_channel
0	AFGHIJK	0;1	0	True	True			7	xx;yy	gene
1	AFGHIJK	0;1	0	True	True	Acetyl@Protein N-term	0	7	xx;yy	gene
2	LMNOPQR	0;1	0	False	True	Oxidation@M	2	7	xx;yy	gene
3	LMNOPQR	0;1	0	False	True			7	xx;yy	gene
4	ABCDESTK	0	0	True	False	Carbamidomethyl@C;Phospho@S	3;6	8	xx
...	...	...	...	...	...	...	...	...	...	...	...
247	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Oxidation@M;Oxidation@M;...	0;1;18;4;8;9;16	23	xx		8
248	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Oxidation@M;Oxidation@M;...	0;1;18;4;9;16	23	xx		8
249	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Carbamidomethyl@C;Phosph...	0;4;7;9;16	23	xx		8
250	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Carbamidomethyl@C;Phosph...	0;4;8;9;16	23	xx		8
251	MABCDESTKAFGHIJKLMNOPQR	0	2	True	False	Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth...	0;4;9;16	23	xx		8

252 rows × 11 columns

[10]:

fasta_lib.add_charge()
fasta_lib.precursor_df[['sequence','mods','mod_sites','charge']]

[10]:

	sequence	mods	mod_sites	charge
0	AFGHIJK			2
1	AFGHIJK			3
2	AFGHIJK			4
3	AFGHIJK	Acetyl@Protein N-term	0	2
4	AFGHIJK	Acetyl@Protein N-term	0	3
...	...	...	...	...
751	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C;Phosph...	0;4;8;9;16	3
752	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C;Phosph...	0;4;8;9;16	4
753	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth...	0;4;9;16	2
754	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth...	0;4;9;16	3
755	MABCDESTKAFGHIJKLMNOPQR	Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth...	0;4;9;16	4

756 rows × 4 columns

Append precursor mz and isotope information

[11]:

fasta_lib.calc_precursor_mz()
fasta_lib.calc_precursor_isotope()
fasta_lib.precursor_df[['precursor_mz']+[col for col in fasta_lib.precursor_df.columns if col.startswith('i_')]]

/Users/wenfengzeng/workspace/alphabase/alphabase/peptide/precursor.py:613: RuntimeWarning: invalid value encountered in divide
  precursor_dist /= np.sum(precursor_dist, axis=1, keepdims=True)

[11]:

	precursor_mz	i_0	i_1	i_2	i_3	i_4	i_5
0	3.932371e+02	0.625822	0.285918	0.072883	0.013411	0.001966	0.0
1	2.624938e+02	0.625822	0.285918	0.072883	0.013411	0.001966	0.0
2	1.971222e+02	0.625822	0.285918	0.072883	0.013411	0.001966	0.0
3	4.142423e+02	0.610921	0.292699	0.078690	0.015312	0.002378	0.0
4	2.764973e+02	0.610921	0.292699	0.078690	0.015312	0.002378	0.0
...	...	...	...	...	...	...	...
751	4.000960e+06	NaN	NaN	NaN	NaN	NaN	NaN
752	3.000720e+06	NaN	NaN	NaN	NaN	NaN	NaN
753	6.001400e+06	NaN	NaN	NaN	NaN	NaN	NaN
754	4.000934e+06	NaN	NaN	NaN	NaN	NaN	NaN
755	3.000700e+06	NaN	NaN	NaN	NaN	NaN	NaN

756 rows × 7 columns

Using alphabase.spectral_library.base.SpecLibBase.calc_fragment_mz_df to calculate fragment mz dataframe.

[12]:

fasta_lib.calc_fragment_mz_df()
fasta_lib.fragment_mz_df

[12]:

	b_z1	y_z1
0	7.204439e+01	714.429749
1	2.191128e+02	567.361328
2	2.761343e+02	510.339844
3	4.131932e+02	373.280945
4	5.262772e+02	260.196869
...	...	...
11911	1.200205e+07	751.420959
11912	1.200216e+07	637.377991
11913	1.200240e+07	400.230286
11914	1.200250e+07	303.177521
11915	1.200262e+07	175.118958

11916 rows × 2 columns

calc_fragment_mz_df() also generate pointers frag_start_idx and frag_stop_idx in the precursor_df to locate fragments of each precursor.

[13]:

fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']]

[13]:

	frag_start_idx	frag_stop_idx
0	0	6
1	6	12
2	12	18
3	18	24
4	24	30
...	...	...
751	11806	11828
752	11828	11850
753	11850	11872
754	11872	11894
755	11894	11916

756 rows × 2 columns

Note that all fragment ions are stored from peptide’s N-terminal to C-terminal, so the b-ions are in the ascending order (from b1 to bn) and y-ions are in the decending order (from yn to y1).

[14]:

start, end = fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']].values[1]
fasta_lib.fragment_mz_df.iloc[start:end,:]

[14]:

	b_z1	y_z1
6	72.044388	714.429749
7	219.112808	567.361328
8	276.134277	510.339844
9	413.193176	373.280945
10	526.277222	260.196869
11	639.361328	147.112808

Save protein_df, precursor_df, fragment_mz_df, fragment_intensity_df into a hdf file.

[15]:

# fasta_lib.save_hdf('path/to/hdf_file.hdf')

[ ]: