SpecLibFasta usage¶
[1]:
%reload_ext autoreload
%autoreload 2
[2]:
from alphabase.protein.fasta import SpecLibFasta
Proteins from a dict (or loaded from fasta files)
[3]:
prot1 = 'MABCDESTKAFGHIJKLMNOPQRAFGHIJK'
prot2 = 'AFGHIJKLMNOPQR'
protein_dict = {
'xx': {
'protein_id': 'xx',
'gene_name': '',
'sequence': prot1
},
'yy': {
'protein_id': 'yy',
'gene_name': 'gene',
'sequence': prot2
}
}
alphabase.protein.fasta.SpecLibFasta.get_peptides_from_protein_dict will digest a protein dict into a peptide dataframe.
alphabase.protein.fasta.SpecLibFasta.get_peptides_from_fasta will digest a fasta file or a fasta list into a peptide dataframe.
[4]:
fasta_lib = SpecLibFasta(
['b_z1','y_z1'], I_to_L=False, decoy='pseudo_reverse',
var_mods=['Acetyl@Protein_N-term', 'Oxidation@M'],
fix_mods=['Carbamidomethyl@C'],
)
# fasta_lib.get_peptides_from_fasta(fasta_files)
fasta_lib.get_peptides_from_protein_dict(protein_dict)
fasta_lib.precursor_df
[4]:
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | |
|---|---|---|---|---|---|---|---|---|
| 0 | AFGHIJK | 0;1 | 0 | True | True | 7 | ||
| 1 | LMNOPQR | 0;1 | 0 | False | True | 7 | ||
| 2 | ABCDESTK | 0 | 0 | True | False | 8 | ||
| 3 | MABCDESTK | 0 | 0 | True | False | 9 | ||
| 4 | AFGHIJKLMNOPQR | 0;1 | 1 | True | True | 14 | ||
| 5 | LMNOPQRAFGHIJK | 0 | 1 | False | True | 14 | ||
| 6 | ABCDESTKAFGHIJK | 0 | 1 | True | False | 15 | ||
| 7 | MABCDESTKAFGHIJK | 0 | 1 | True | False | 16 | ||
| 8 | AFGHIJKLMNOPQRAFGHIJK | 0 | 2 | False | True | 21 | ||
| 9 | ABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 22 | ||
| 10 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 23 |
[5]:
fasta_lib.protein_df
[5]:
| protein_id | gene_name | sequence | |
|---|---|---|---|
| 0 | xx | MABCDESTKAFGHIJKLMNOPQRAFGHIJK | |
| 1 | yy | gene | AFGHIJKLMNOPQR |
We can also append the protein names to precursor_df
[6]:
fasta_lib.append_protein_name()
fasta_lib.precursor_df
[6]:
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | proteins | genes | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFGHIJK | 0;1 | 0 | True | True | 7 | xx;yy | gene | ||
| 1 | LMNOPQR | 0;1 | 0 | False | True | 7 | xx;yy | gene | ||
| 2 | ABCDESTK | 0 | 0 | True | False | 8 | xx | |||
| 3 | MABCDESTK | 0 | 0 | True | False | 9 | xx | |||
| 4 | AFGHIJKLMNOPQR | 0;1 | 1 | True | True | 14 | xx;yy | gene | ||
| 5 | LMNOPQRAFGHIJK | 0 | 1 | False | True | 14 | xx | |||
| 6 | ABCDESTKAFGHIJK | 0 | 1 | True | False | 15 | xx | |||
| 7 | MABCDESTKAFGHIJK | 0 | 1 | True | False | 16 | xx | |||
| 8 | AFGHIJKLMNOPQRAFGHIJK | 0 | 2 | False | True | 21 | xx | |||
| 9 | ABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 22 | xx | |||
| 10 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 23 | xx |
If we have our own precursor_df loaded by psm_readers, we can directly assign it to fasta_lib.
fasta_lib._precursor_df = precursor_df
Thus, we can still use SpecLibFasta functionalities for this precursor_df.
Add modifications including both var_mods (Acetyl@Protein_N-term, Oxidation@M, see initialzation of fasta_lib) and fix_mods (Carbamidomethyl@C) into the precursor_df.
[7]:
fasta_lib.add_modifications()
fasta_lib.precursor_df[['sequence','mods','mod_sites']]
[7]:
| sequence | mods | mod_sites | |
|---|---|---|---|
| 0 | AFGHIJK | ||
| 1 | AFGHIJK | Acetyl@Protein_N-term | 0 |
| 2 | LMNOPQR | Oxidation@M | 2 |
| 3 | LMNOPQR | ||
| 4 | ABCDESTK | Carbamidomethyl@C | 3 |
| 5 | ABCDESTK | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;3 |
| 6 | MABCDESTK | Oxidation@M;Carbamidomethyl@C | 1;4 |
| 7 | MABCDESTK | Carbamidomethyl@C | 4 |
| 8 | MABCDESTK | Acetyl@Protein_N-term;Oxidation@M;Carbamidomet... | 0;1;4 |
| 9 | MABCDESTK | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;4 |
| 10 | AFGHIJKLMNOPQR | Oxidation@M | 9 |
| 11 | AFGHIJKLMNOPQR | ||
| 12 | AFGHIJKLMNOPQR | Acetyl@Protein_N-term;Oxidation@M | 0;9 |
| 13 | AFGHIJKLMNOPQR | Acetyl@Protein_N-term | 0 |
| 14 | LMNOPQRAFGHIJK | Oxidation@M | 2 |
| 15 | LMNOPQRAFGHIJK | ||
| 16 | ABCDESTKAFGHIJK | Carbamidomethyl@C | 3 |
| 17 | ABCDESTKAFGHIJK | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;3 |
| 18 | MABCDESTKAFGHIJK | Oxidation@M;Carbamidomethyl@C | 1;4 |
| 19 | MABCDESTKAFGHIJK | Carbamidomethyl@C | 4 |
| 20 | MABCDESTKAFGHIJK | Acetyl@Protein_N-term;Oxidation@M;Carbamidomet... | 0;1;4 |
| 21 | MABCDESTKAFGHIJK | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;4 |
| 22 | AFGHIJKLMNOPQRAFGHIJK | Oxidation@M | 9 |
| 23 | AFGHIJKLMNOPQRAFGHIJK | ||
| 24 | ABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Carbamidomethyl@C | 17;3 |
| 25 | ABCDESTKAFGHIJKLMNOPQR | Carbamidomethyl@C | 3 |
| 26 | ABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Oxidation@M;Carbamidomet... | 0;17;3 |
| 27 | ABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;3 |
| 28 | MABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Carbamidomethyl@C | 1;4 |
| 29 | MABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Carbamidomethyl@C | 18;4 |
| 30 | MABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Oxidation@M;Carbamidomethyl@C | 1;18;4 |
| 31 | MABCDESTKAFGHIJKLMNOPQR | Carbamidomethyl@C | 4 |
| 32 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Oxidation@M;Carbamidomet... | 0;1;4 |
| 33 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Oxidation@M;Carbamidomet... | 0;18;4 |
| 34 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4 |
| 35 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;4 |
alphabase.protein.fasta.SpecLibFasta.add_additional_modifications is specially designed for Phospho, as it may generate thousands of peptidoforms for a peptide with multiple phospho sites.
[8]:
from alphabase.protein.fasta import append_special_modifications
fasta_lib._precursor_df = append_special_modifications(
fasta_lib.precursor_df, ['Phospho@S','Phospho@T'],
min_mod_num=0, max_mod_num=1, max_peptidoform_num=100
)
fasta_lib.precursor_df
[8]:
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | proteins | genes | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFGHIJK | 0;1 | 0 | True | True | 7 | xx;yy | gene | ||
| 1 | AFGHIJK | 0;1 | 0 | True | True | Acetyl@Protein_N-term | 0 | 7 | xx;yy | gene |
| 2 | LMNOPQR | 0;1 | 0 | False | True | Oxidation@M | 2 | 7 | xx;yy | gene |
| 3 | LMNOPQR | 0;1 | 0 | False | True | 7 | xx;yy | gene | ||
| 4 | ABCDESTK | 0 | 0 | True | False | Carbamidomethyl@C;Phospho@S | 3;6 | 8 | xx | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 79 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4;8 | 23 | xx | |
| 80 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4 | 23 | xx | |
| 81 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Carbamidomethyl@C;Phospho@S | 0;4;7 | 23 | xx | |
| 82 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Carbamidomethyl@C;Phospho@T | 0;4;8 | 23 | xx | |
| 83 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Carbamidomethyl@C | 0;4 | 23 | xx |
84 rows × 10 columns
Flexible method to add peptide labeling
[9]:
fasta_lib.add_peptide_labeling({
'': [], # not labelled for reference
'0': ['Dimethyl@Any_N-term','Dimethyl@K'],
'8': ['Dimethyl:2H(6)13C(2)@Any_N-term','Dimethyl:2H(6)13C(2)@K'],
})
fasta_lib.precursor_df
[9]:
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | proteins | genes | labeling_channel | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFGHIJK | 0;1 | 0 | True | True | 7 | xx;yy | gene | |||
| 1 | AFGHIJK | 0;1 | 0 | True | True | Acetyl@Protein_N-term | 0 | 7 | xx;yy | gene | |
| 2 | LMNOPQR | 0;1 | 0 | False | True | Oxidation@M | 2 | 7 | xx;yy | gene | |
| 3 | LMNOPQR | 0;1 | 0 | False | True | 7 | xx;yy | gene | |||
| 4 | ABCDESTK | 0 | 0 | True | False | Carbamidomethyl@C;Phospho@S | 3;6 | 8 | xx | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 247 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4;8;9;16 | 23 | xx | 8 | |
| 248 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4;9;16 | 23 | xx | 8 | |
| 249 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Carbamidomethyl@C;Phosph... | 0;4;7;9;16 | 23 | xx | 8 | |
| 250 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Carbamidomethyl@C;Phosph... | 0;4;8;9;16 | 23 | xx | 8 | |
| 251 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein_N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 23 | xx | 8 |
252 rows × 11 columns
[10]:
fasta_lib.add_charge()
fasta_lib.precursor_df[['sequence','mods','mod_sites','charge']]
[10]:
| sequence | mods | mod_sites | charge | |
|---|---|---|---|---|
| 0 | AFGHIJK | 2 | ||
| 1 | AFGHIJK | 3 | ||
| 2 | AFGHIJK | 4 | ||
| 3 | AFGHIJK | Acetyl@Protein_N-term | 0 | 2 |
| 4 | AFGHIJK | Acetyl@Protein_N-term | 0 | 3 |
| ... | ... | ... | ... | ... |
| 751 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C;Phosph... | 0;4;8;9;16 | 3 |
| 752 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C;Phosph... | 0;4;8;9;16 | 4 |
| 753 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 2 |
| 754 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 3 |
| 755 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein_N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 4 |
756 rows × 4 columns
Append precursor mz and isotope information
[11]:
fasta_lib.calc_precursor_mz()
fasta_lib.calc_precursor_isotope()
fasta_lib.precursor_df[['precursor_mz']+[col for col in fasta_lib.precursor_df.columns if col.startswith('i_')]]
/Users/wenfengzeng/workspace/alphabase/alphabase/peptide/precursor.py:613: RuntimeWarning: invalid value encountered in divide
precursor_dist /= np.sum(precursor_dist, axis=1, keepdims=True)
[11]:
| precursor_mz | i_0 | i_1 | i_2 | i_3 | i_4 | i_5 | |
|---|---|---|---|---|---|---|---|
| 0 | 3.932371e+02 | 0.625822 | 0.285918 | 0.072883 | 0.013411 | 0.001966 | 0.0 |
| 1 | 2.624938e+02 | 0.625822 | 0.285918 | 0.072883 | 0.013411 | 0.001966 | 0.0 |
| 2 | 1.971222e+02 | 0.625822 | 0.285918 | 0.072883 | 0.013411 | 0.001966 | 0.0 |
| 3 | 4.142423e+02 | 0.610921 | 0.292699 | 0.078690 | 0.015312 | 0.002378 | 0.0 |
| 4 | 2.764973e+02 | 0.610921 | 0.292699 | 0.078690 | 0.015312 | 0.002378 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 751 | 4.000960e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
| 752 | 3.000720e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
| 753 | 6.001400e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
| 754 | 4.000934e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
| 755 | 3.000700e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
756 rows × 7 columns
Using alphabase.spectral_library.base.SpecLibBase.calc_fragment_mz_df to calculate fragment mz dataframe.
[12]:
fasta_lib.calc_fragment_mz_df()
fasta_lib.fragment_mz_df
[12]:
| b_z1 | y_z1 | |
|---|---|---|
| 0 | 7.204439e+01 | 714.429749 |
| 1 | 2.191128e+02 | 567.361328 |
| 2 | 2.761343e+02 | 510.339844 |
| 3 | 4.131932e+02 | 373.280945 |
| 4 | 5.262772e+02 | 260.196869 |
| ... | ... | ... |
| 11911 | 1.200205e+07 | 751.420959 |
| 11912 | 1.200216e+07 | 637.377991 |
| 11913 | 1.200240e+07 | 400.230286 |
| 11914 | 1.200250e+07 | 303.177521 |
| 11915 | 1.200262e+07 | 175.118958 |
11916 rows × 2 columns
calc_fragment_mz_df() also generate pointers frag_start_idx and frag_stop_idx in the precursor_df to locate fragments of each precursor.
[13]:
fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']]
[13]:
| frag_start_idx | frag_stop_idx | |
|---|---|---|
| 0 | 0 | 6 |
| 1 | 6 | 12 |
| 2 | 12 | 18 |
| 3 | 18 | 24 |
| 4 | 24 | 30 |
| ... | ... | ... |
| 751 | 11806 | 11828 |
| 752 | 11828 | 11850 |
| 753 | 11850 | 11872 |
| 754 | 11872 | 11894 |
| 755 | 11894 | 11916 |
756 rows × 2 columns
Note that all fragment ions are stored from peptide’s N-terminal to C-terminal, so the b-ions are in the ascending order (from b1 to bn) and y-ions are in the decending order (from yn to y1).
[14]:
start, end = fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']].values[1]
fasta_lib.fragment_mz_df.iloc[start:end,:]
[14]:
| b_z1 | y_z1 | |
|---|---|---|
| 6 | 72.044388 | 714.429749 |
| 7 | 219.112808 | 567.361328 |
| 8 | 276.134277 | 510.339844 |
| 9 | 413.193176 | 373.280945 |
| 10 | 526.277222 | 260.196869 |
| 11 | 639.361328 | 147.112808 |
Save protein_df, precursor_df, fragment_mz_df, fragment_intensity_df into a hdf file.
[15]:
# fasta_lib.save_hdf('path/to/hdf_file.hdf')
[ ]: