SpecLibFasta usage#

[1]:
%reload_ext autoreload
%autoreload 2
[2]:
from alphabase.protein.fasta import SpecLibFasta

Proteins from a dict (or loaded from fasta files)

[3]:
prot1 = 'MABCDESTKAFGHIJKLMNOPQRAFGHIJK'
prot2 = 'AFGHIJKLMNOPQR'
protein_dict = {
    'xx': {
        'protein_id': 'xx',
        'gene_name': '',
        'sequence': prot1
    },
    'yy': {
        'protein_id': 'yy',
        'gene_name': 'gene',
        'sequence': prot2
    }
}

alphabase.protein.fasta.SpecLibFasta.get_peptides_from_protein_dict will digest a protein dict into a peptide dataframe.

alphabase.protein.fasta.SpecLibFasta.get_peptides_from_fasta will digest a fasta file or a fasta list into a peptide dataframe.

[4]:
fasta_lib = SpecLibFasta(
    ['b_z1','y_z1'], I_to_L=False, decoy='pseudo_reverse',
    var_mods=['Acetyl@Protein N-term', 'Oxidation@M'],
    fix_mods=['Carbamidomethyl@C'],
)
# fasta_lib.get_peptides_from_fasta(fasta_files)
fasta_lib.get_peptides_from_protein_dict(protein_dict)
fasta_lib.precursor_df
[4]:
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA
0 AFGHIJK 0;1 0 True True 7
1 LMNOPQR 0;1 0 False True 7
2 ABCDESTK 0 0 True False 8
3 MABCDESTK 0 0 True False 9
4 AFGHIJKLMNOPQR 0;1 1 True True 14
5 LMNOPQRAFGHIJK 0 1 False True 14
6 ABCDESTKAFGHIJK 0 1 True False 15
7 MABCDESTKAFGHIJK 0 1 True False 16
8 AFGHIJKLMNOPQRAFGHIJK 0 2 False True 21
9 ABCDESTKAFGHIJKLMNOPQR 0 2 True False 22
10 MABCDESTKAFGHIJKLMNOPQR 0 2 True False 23
[5]:
fasta_lib.protein_df
[5]:
protein_id gene_name sequence
0 xx MABCDESTKAFGHIJKLMNOPQRAFGHIJK
1 yy gene AFGHIJKLMNOPQR

We can also append the protein names to precursor_df

[6]:
fasta_lib.append_protein_name()
fasta_lib.precursor_df
[6]:
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA proteins genes
0 AFGHIJK 0;1 0 True True 7 xx;yy gene
1 LMNOPQR 0;1 0 False True 7 xx;yy gene
2 ABCDESTK 0 0 True False 8 xx
3 MABCDESTK 0 0 True False 9 xx
4 AFGHIJKLMNOPQR 0;1 1 True True 14 xx;yy gene
5 LMNOPQRAFGHIJK 0 1 False True 14 xx
6 ABCDESTKAFGHIJK 0 1 True False 15 xx
7 MABCDESTKAFGHIJK 0 1 True False 16 xx
8 AFGHIJKLMNOPQRAFGHIJK 0 2 False True 21 xx
9 ABCDESTKAFGHIJKLMNOPQR 0 2 True False 22 xx
10 MABCDESTKAFGHIJKLMNOPQR 0 2 True False 23 xx

If we have our own precursor_df loaded by psm_readers, we can directly assign it to fasta_lib.

fasta_lib._precursor_df = precursor_df

Thus, we can still use SpecLibFasta functionalities for this precursor_df.

Add modifications including both var_mods (Acetyl@Protein N-term, Oxidation@M, see initialzation of fasta_lib) and fix_mods (Carbamidomethyl@C) into the precursor_df.

[7]:
fasta_lib.add_modifications()
fasta_lib.precursor_df[['sequence','mods','mod_sites']]
[7]:
sequence mods mod_sites
0 AFGHIJK
1 AFGHIJK Acetyl@Protein N-term 0
2 LMNOPQR Oxidation@M 2
3 LMNOPQR
4 ABCDESTK Carbamidomethyl@C 3
5 ABCDESTK Acetyl@Protein N-term;Carbamidomethyl@C 0;3
6 MABCDESTK Oxidation@M;Carbamidomethyl@C 1;4
7 MABCDESTK Carbamidomethyl@C 4
8 MABCDESTK Acetyl@Protein N-term;Oxidation@M;Carbamidomet... 0;1;4
9 MABCDESTK Acetyl@Protein N-term;Carbamidomethyl@C 0;4
10 AFGHIJKLMNOPQR Oxidation@M 9
11 AFGHIJKLMNOPQR
12 AFGHIJKLMNOPQR Acetyl@Protein N-term;Oxidation@M 0;9
13 AFGHIJKLMNOPQR Acetyl@Protein N-term 0
14 LMNOPQRAFGHIJK Oxidation@M 2
15 LMNOPQRAFGHIJK
16 ABCDESTKAFGHIJK Carbamidomethyl@C 3
17 ABCDESTKAFGHIJK Acetyl@Protein N-term;Carbamidomethyl@C 0;3
18 MABCDESTKAFGHIJK Oxidation@M;Carbamidomethyl@C 1;4
19 MABCDESTKAFGHIJK Carbamidomethyl@C 4
20 MABCDESTKAFGHIJK Acetyl@Protein N-term;Oxidation@M;Carbamidomet... 0;1;4
21 MABCDESTKAFGHIJK Acetyl@Protein N-term;Carbamidomethyl@C 0;4
22 AFGHIJKLMNOPQRAFGHIJK Oxidation@M 9
23 AFGHIJKLMNOPQRAFGHIJK
24 ABCDESTKAFGHIJKLMNOPQR Oxidation@M;Carbamidomethyl@C 17;3
25 ABCDESTKAFGHIJKLMNOPQR Carbamidomethyl@C 3
26 ABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Oxidation@M;Carbamidomet... 0;17;3
27 ABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C 0;3
28 MABCDESTKAFGHIJKLMNOPQR Oxidation@M;Carbamidomethyl@C 1;4
29 MABCDESTKAFGHIJKLMNOPQR Oxidation@M;Carbamidomethyl@C 18;4
30 MABCDESTKAFGHIJKLMNOPQR Oxidation@M;Oxidation@M;Carbamidomethyl@C 1;18;4
31 MABCDESTKAFGHIJKLMNOPQR Carbamidomethyl@C 4
32 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Oxidation@M;Carbamidomet... 0;1;4
33 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Oxidation@M;Carbamidomet... 0;18;4
34 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... 0;1;18;4
35 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C 0;4

alphabase.protein.fasta.SpecLibFasta.add_additional_modifications is specially designed for Phospho, as it may generate thousands of peptidoforms for a peptide with multiple phospho sites.

[8]:
from alphabase.protein.fasta import append_special_modifications
fasta_lib._precursor_df = append_special_modifications(
    fasta_lib.precursor_df, ['Phospho@S','Phospho@T'],
    min_mod_num=0, max_mod_num=1, max_peptidoform_num=100
)
fasta_lib.precursor_df
[8]:
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA proteins genes
0 AFGHIJK 0;1 0 True True 7 xx;yy gene
1 AFGHIJK 0;1 0 True True Acetyl@Protein N-term 0 7 xx;yy gene
2 LMNOPQR 0;1 0 False True Oxidation@M 2 7 xx;yy gene
3 LMNOPQR 0;1 0 False True 7 xx;yy gene
4 ABCDESTK 0 0 True False Carbamidomethyl@C;Phospho@S 3;6 8 xx
... ... ... ... ... ... ... ... ... ... ...
79 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... 0;1;18;4;8 23 xx
80 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... 0;1;18;4 23 xx
81 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Carbamidomethyl@C;Phospho@S 0;4;7 23 xx
82 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Carbamidomethyl@C;Phospho@T 0;4;8 23 xx
83 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Carbamidomethyl@C 0;4 23 xx

84 rows × 10 columns

Flexible method to add peptide labeling

[9]:
fasta_lib.add_peptide_labeling({
    '': [], # not labelled for reference
    '0': ['Dimethyl@Any N-term','Dimethyl@K'],
    '8': ['Dimethyl:2H(6)13C(2)@Any N-term','Dimethyl:2H(6)13C(2)@K'],
})
fasta_lib.precursor_df
[9]:
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA proteins genes labeling_channel
0 AFGHIJK 0;1 0 True True 7 xx;yy gene
1 AFGHIJK 0;1 0 True True Acetyl@Protein N-term 0 7 xx;yy gene
2 LMNOPQR 0;1 0 False True Oxidation@M 2 7 xx;yy gene
3 LMNOPQR 0;1 0 False True 7 xx;yy gene
4 ABCDESTK 0 0 True False Carbamidomethyl@C;Phospho@S 3;6 8 xx
... ... ... ... ... ... ... ... ... ... ... ...
247 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... 0;1;18;4;8;9;16 23 xx 8
248 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... 0;1;18;4;9;16 23 xx 8
249 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... 0;4;7;9;16 23 xx 8
250 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... 0;4;8;9;16 23 xx 8
251 MABCDESTKAFGHIJKLMNOPQR 0 2 True False Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... 0;4;9;16 23 xx 8

252 rows × 11 columns

[10]:
fasta_lib.add_charge()
fasta_lib.precursor_df[['sequence','mods','mod_sites','charge']]
[10]:
sequence mods mod_sites charge
0 AFGHIJK 2
1 AFGHIJK 3
2 AFGHIJK 4
3 AFGHIJK Acetyl@Protein N-term 0 2
4 AFGHIJK Acetyl@Protein N-term 0 3
... ... ... ... ...
751 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... 0;4;8;9;16 3
752 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... 0;4;8;9;16 4
753 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... 0;4;9;16 2
754 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... 0;4;9;16 3
755 MABCDESTKAFGHIJKLMNOPQR Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... 0;4;9;16 4

756 rows × 4 columns

Append precursor mz and isotope information

[11]:
fasta_lib.calc_precursor_mz()
fasta_lib.calc_precursor_isotope()
fasta_lib.precursor_df[['precursor_mz']+[col for col in fasta_lib.precursor_df.columns if col.startswith('i_')]]
/Users/wenfengzeng/workspace/alphabase/alphabase/peptide/precursor.py:613: RuntimeWarning: invalid value encountered in divide
  precursor_dist /= np.sum(precursor_dist, axis=1, keepdims=True)
[11]:
precursor_mz i_0 i_1 i_2 i_3 i_4 i_5
0 3.932371e+02 0.625822 0.285918 0.072883 0.013411 0.001966 0.0
1 2.624938e+02 0.625822 0.285918 0.072883 0.013411 0.001966 0.0
2 1.971222e+02 0.625822 0.285918 0.072883 0.013411 0.001966 0.0
3 4.142423e+02 0.610921 0.292699 0.078690 0.015312 0.002378 0.0
4 2.764973e+02 0.610921 0.292699 0.078690 0.015312 0.002378 0.0
... ... ... ... ... ... ... ...
751 4.000960e+06 NaN NaN NaN NaN NaN NaN
752 3.000720e+06 NaN NaN NaN NaN NaN NaN
753 6.001400e+06 NaN NaN NaN NaN NaN NaN
754 4.000934e+06 NaN NaN NaN NaN NaN NaN
755 3.000700e+06 NaN NaN NaN NaN NaN NaN

756 rows × 7 columns

Using alphabase.spectral_library.base.SpecLibBase.calc_fragment_mz_df to calculate fragment mz dataframe.

[12]:
fasta_lib.calc_fragment_mz_df()
fasta_lib.fragment_mz_df
[12]:
b_z1 y_z1
0 7.204439e+01 714.429749
1 2.191128e+02 567.361328
2 2.761343e+02 510.339844
3 4.131932e+02 373.280945
4 5.262772e+02 260.196869
... ... ...
11911 1.200205e+07 751.420959
11912 1.200216e+07 637.377991
11913 1.200240e+07 400.230286
11914 1.200250e+07 303.177521
11915 1.200262e+07 175.118958

11916 rows × 2 columns

calc_fragment_mz_df() also generate pointers frag_start_idx and frag_stop_idx in the precursor_df to locate fragments of each precursor.

[13]:
fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']]
[13]:
frag_start_idx frag_stop_idx
0 0 6
1 6 12
2 12 18
3 18 24
4 24 30
... ... ...
751 11806 11828
752 11828 11850
753 11850 11872
754 11872 11894
755 11894 11916

756 rows × 2 columns

Note that all fragment ions are stored from peptide’s N-terminal to C-terminal, so the b-ions are in the ascending order (from b1 to bn) and y-ions are in the decending order (from yn to y1).

[14]:
start, end = fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']].values[1]
fasta_lib.fragment_mz_df.iloc[start:end,:]
[14]:
b_z1 y_z1
6 72.044388 714.429749
7 219.112808 567.361328
8 276.134277 510.339844
9 413.193176 373.280945
10 526.277222 260.196869
11 639.361328 147.112808

Save protein_df, precursor_df, fragment_mz_df, fragment_intensity_df into a hdf file.

[15]:
# fasta_lib.save_hdf('path/to/hdf_file.hdf')
[ ]: