SpecLibFasta usage#
[1]:
%reload_ext autoreload
%autoreload 2
[2]:
from alphabase.protein.fasta import SpecLibFasta
Proteins from a dict (or loaded from fasta files)
[3]:
prot1 = 'MABCDESTKAFGHIJKLMNOPQRAFGHIJK'
prot2 = 'AFGHIJKLMNOPQR'
protein_dict = {
'xx': {
'protein_id': 'xx',
'gene_name': '',
'sequence': prot1
},
'yy': {
'protein_id': 'yy',
'gene_name': 'gene',
'sequence': prot2
}
}
alphabase.protein.fasta.SpecLibFasta.get_peptides_from_protein_dict
will digest a protein dict into a peptide dataframe.
alphabase.protein.fasta.SpecLibFasta.get_peptides_from_fasta
will digest a fasta file or a fasta list into a peptide dataframe.
[4]:
fasta_lib = SpecLibFasta(
['b_z1','y_z1'], I_to_L=False, decoy='pseudo_reverse',
var_mods=['Acetyl@Protein N-term', 'Oxidation@M'],
fix_mods=['Carbamidomethyl@C'],
)
# fasta_lib.get_peptides_from_fasta(fasta_files)
fasta_lib.get_peptides_from_protein_dict(protein_dict)
fasta_lib.precursor_df
[4]:
sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | |
---|---|---|---|---|---|---|---|---|
0 | AFGHIJK | 0;1 | 0 | True | True | 7 | ||
1 | LMNOPQR | 0;1 | 0 | False | True | 7 | ||
2 | ABCDESTK | 0 | 0 | True | False | 8 | ||
3 | MABCDESTK | 0 | 0 | True | False | 9 | ||
4 | AFGHIJKLMNOPQR | 0;1 | 1 | True | True | 14 | ||
5 | LMNOPQRAFGHIJK | 0 | 1 | False | True | 14 | ||
6 | ABCDESTKAFGHIJK | 0 | 1 | True | False | 15 | ||
7 | MABCDESTKAFGHIJK | 0 | 1 | True | False | 16 | ||
8 | AFGHIJKLMNOPQRAFGHIJK | 0 | 2 | False | True | 21 | ||
9 | ABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 22 | ||
10 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 23 |
[5]:
fasta_lib.protein_df
[5]:
protein_id | gene_name | sequence | |
---|---|---|---|
0 | xx | MABCDESTKAFGHIJKLMNOPQRAFGHIJK | |
1 | yy | gene | AFGHIJKLMNOPQR |
We can also append the protein names to precursor_df
[6]:
fasta_lib.append_protein_name()
fasta_lib.precursor_df
[6]:
sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | proteins | genes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AFGHIJK | 0;1 | 0 | True | True | 7 | xx;yy | gene | ||
1 | LMNOPQR | 0;1 | 0 | False | True | 7 | xx;yy | gene | ||
2 | ABCDESTK | 0 | 0 | True | False | 8 | xx | |||
3 | MABCDESTK | 0 | 0 | True | False | 9 | xx | |||
4 | AFGHIJKLMNOPQR | 0;1 | 1 | True | True | 14 | xx;yy | gene | ||
5 | LMNOPQRAFGHIJK | 0 | 1 | False | True | 14 | xx | |||
6 | ABCDESTKAFGHIJK | 0 | 1 | True | False | 15 | xx | |||
7 | MABCDESTKAFGHIJK | 0 | 1 | True | False | 16 | xx | |||
8 | AFGHIJKLMNOPQRAFGHIJK | 0 | 2 | False | True | 21 | xx | |||
9 | ABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 22 | xx | |||
10 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | 23 | xx |
If we have our own precursor_df loaded by psm_readers, we can directly assign it to fasta_lib.
fasta_lib._precursor_df = precursor_df
Thus, we can still use SpecLibFasta functionalities for this precursor_df.
Add modifications including both var_mods (Acetyl@Protein N-term
, Oxidation@M
, see initialzation of fasta_lib) and fix_mods (Carbamidomethyl@C
) into the precursor_df.
[7]:
fasta_lib.add_modifications()
fasta_lib.precursor_df[['sequence','mods','mod_sites']]
[7]:
sequence | mods | mod_sites | |
---|---|---|---|
0 | AFGHIJK | ||
1 | AFGHIJK | Acetyl@Protein N-term | 0 |
2 | LMNOPQR | Oxidation@M | 2 |
3 | LMNOPQR | ||
4 | ABCDESTK | Carbamidomethyl@C | 3 |
5 | ABCDESTK | Acetyl@Protein N-term;Carbamidomethyl@C | 0;3 |
6 | MABCDESTK | Oxidation@M;Carbamidomethyl@C | 1;4 |
7 | MABCDESTK | Carbamidomethyl@C | 4 |
8 | MABCDESTK | Acetyl@Protein N-term;Oxidation@M;Carbamidomet... | 0;1;4 |
9 | MABCDESTK | Acetyl@Protein N-term;Carbamidomethyl@C | 0;4 |
10 | AFGHIJKLMNOPQR | Oxidation@M | 9 |
11 | AFGHIJKLMNOPQR | ||
12 | AFGHIJKLMNOPQR | Acetyl@Protein N-term;Oxidation@M | 0;9 |
13 | AFGHIJKLMNOPQR | Acetyl@Protein N-term | 0 |
14 | LMNOPQRAFGHIJK | Oxidation@M | 2 |
15 | LMNOPQRAFGHIJK | ||
16 | ABCDESTKAFGHIJK | Carbamidomethyl@C | 3 |
17 | ABCDESTKAFGHIJK | Acetyl@Protein N-term;Carbamidomethyl@C | 0;3 |
18 | MABCDESTKAFGHIJK | Oxidation@M;Carbamidomethyl@C | 1;4 |
19 | MABCDESTKAFGHIJK | Carbamidomethyl@C | 4 |
20 | MABCDESTKAFGHIJK | Acetyl@Protein N-term;Oxidation@M;Carbamidomet... | 0;1;4 |
21 | MABCDESTKAFGHIJK | Acetyl@Protein N-term;Carbamidomethyl@C | 0;4 |
22 | AFGHIJKLMNOPQRAFGHIJK | Oxidation@M | 9 |
23 | AFGHIJKLMNOPQRAFGHIJK | ||
24 | ABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Carbamidomethyl@C | 17;3 |
25 | ABCDESTKAFGHIJKLMNOPQR | Carbamidomethyl@C | 3 |
26 | ABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Oxidation@M;Carbamidomet... | 0;17;3 |
27 | ABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C | 0;3 |
28 | MABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Carbamidomethyl@C | 1;4 |
29 | MABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Carbamidomethyl@C | 18;4 |
30 | MABCDESTKAFGHIJKLMNOPQR | Oxidation@M;Oxidation@M;Carbamidomethyl@C | 1;18;4 |
31 | MABCDESTKAFGHIJKLMNOPQR | Carbamidomethyl@C | 4 |
32 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Oxidation@M;Carbamidomet... | 0;1;4 |
33 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Oxidation@M;Carbamidomet... | 0;18;4 |
34 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4 |
35 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C | 0;4 |
alphabase.protein.fasta.SpecLibFasta.add_additional_modifications
is specially designed for Phospho
, as it may generate thousands of peptidoforms for a peptide with multiple phospho sites.
[8]:
from alphabase.protein.fasta import append_special_modifications
fasta_lib._precursor_df = append_special_modifications(
fasta_lib.precursor_df, ['Phospho@S','Phospho@T'],
min_mod_num=0, max_mod_num=1, max_peptidoform_num=100
)
fasta_lib.precursor_df
[8]:
sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | proteins | genes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AFGHIJK | 0;1 | 0 | True | True | 7 | xx;yy | gene | ||
1 | AFGHIJK | 0;1 | 0 | True | True | Acetyl@Protein N-term | 0 | 7 | xx;yy | gene |
2 | LMNOPQR | 0;1 | 0 | False | True | Oxidation@M | 2 | 7 | xx;yy | gene |
3 | LMNOPQR | 0;1 | 0 | False | True | 7 | xx;yy | gene | ||
4 | ABCDESTK | 0 | 0 | True | False | Carbamidomethyl@C;Phospho@S | 3;6 | 8 | xx | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
79 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4;8 | 23 | xx | |
80 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4 | 23 | xx | |
81 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Carbamidomethyl@C;Phospho@S | 0;4;7 | 23 | xx | |
82 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Carbamidomethyl@C;Phospho@T | 0;4;8 | 23 | xx | |
83 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Carbamidomethyl@C | 0;4 | 23 | xx |
84 rows × 10 columns
Flexible method to add peptide labeling
[9]:
fasta_lib.add_peptide_labeling({
'': [], # not labelled for reference
'0': ['Dimethyl@Any N-term','Dimethyl@K'],
'8': ['Dimethyl:2H(6)13C(2)@Any N-term','Dimethyl:2H(6)13C(2)@K'],
})
fasta_lib.precursor_df
[9]:
sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | proteins | genes | labeling_channel | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFGHIJK | 0;1 | 0 | True | True | 7 | xx;yy | gene | |||
1 | AFGHIJK | 0;1 | 0 | True | True | Acetyl@Protein N-term | 0 | 7 | xx;yy | gene | |
2 | LMNOPQR | 0;1 | 0 | False | True | Oxidation@M | 2 | 7 | xx;yy | gene | |
3 | LMNOPQR | 0;1 | 0 | False | True | 7 | xx;yy | gene | |||
4 | ABCDESTK | 0 | 0 | True | False | Carbamidomethyl@C;Phospho@S | 3;6 | 8 | xx | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
247 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4;8;9;16 | 23 | xx | 8 | |
248 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Oxidation@M;Oxidation@M;... | 0;1;18;4;9;16 | 23 | xx | 8 | |
249 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... | 0;4;7;9;16 | 23 | xx | 8 | |
250 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... | 0;4;8;9;16 | 23 | xx | 8 | |
251 | MABCDESTKAFGHIJKLMNOPQR | 0 | 2 | True | False | Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 23 | xx | 8 |
252 rows × 11 columns
[10]:
fasta_lib.add_charge()
fasta_lib.precursor_df[['sequence','mods','mod_sites','charge']]
[10]:
sequence | mods | mod_sites | charge | |
---|---|---|---|---|
0 | AFGHIJK | 2 | ||
1 | AFGHIJK | 3 | ||
2 | AFGHIJK | 4 | ||
3 | AFGHIJK | Acetyl@Protein N-term | 0 | 2 |
4 | AFGHIJK | Acetyl@Protein N-term | 0 | 3 |
... | ... | ... | ... | ... |
751 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... | 0;4;8;9;16 | 3 |
752 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C;Phosph... | 0;4;8;9;16 | 4 |
753 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 2 |
754 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 3 |
755 | MABCDESTKAFGHIJKLMNOPQR | Acetyl@Protein N-term;Carbamidomethyl@C;Dimeth... | 0;4;9;16 | 4 |
756 rows × 4 columns
Append precursor mz and isotope information
[11]:
fasta_lib.calc_precursor_mz()
fasta_lib.calc_precursor_isotope()
fasta_lib.precursor_df[['precursor_mz']+[col for col in fasta_lib.precursor_df.columns if col.startswith('i_')]]
/Users/wenfengzeng/workspace/alphabase/alphabase/peptide/precursor.py:613: RuntimeWarning: invalid value encountered in divide
precursor_dist /= np.sum(precursor_dist, axis=1, keepdims=True)
[11]:
precursor_mz | i_0 | i_1 | i_2 | i_3 | i_4 | i_5 | |
---|---|---|---|---|---|---|---|
0 | 3.932371e+02 | 0.625822 | 0.285918 | 0.072883 | 0.013411 | 0.001966 | 0.0 |
1 | 2.624938e+02 | 0.625822 | 0.285918 | 0.072883 | 0.013411 | 0.001966 | 0.0 |
2 | 1.971222e+02 | 0.625822 | 0.285918 | 0.072883 | 0.013411 | 0.001966 | 0.0 |
3 | 4.142423e+02 | 0.610921 | 0.292699 | 0.078690 | 0.015312 | 0.002378 | 0.0 |
4 | 2.764973e+02 | 0.610921 | 0.292699 | 0.078690 | 0.015312 | 0.002378 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
751 | 4.000960e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
752 | 3.000720e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
753 | 6.001400e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
754 | 4.000934e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
755 | 3.000700e+06 | NaN | NaN | NaN | NaN | NaN | NaN |
756 rows × 7 columns
Using alphabase.spectral_library.base.SpecLibBase.calc_fragment_mz_df
to calculate fragment mz dataframe.
[12]:
fasta_lib.calc_fragment_mz_df()
fasta_lib.fragment_mz_df
[12]:
b_z1 | y_z1 | |
---|---|---|
0 | 7.204439e+01 | 714.429749 |
1 | 2.191128e+02 | 567.361328 |
2 | 2.761343e+02 | 510.339844 |
3 | 4.131932e+02 | 373.280945 |
4 | 5.262772e+02 | 260.196869 |
... | ... | ... |
11911 | 1.200205e+07 | 751.420959 |
11912 | 1.200216e+07 | 637.377991 |
11913 | 1.200240e+07 | 400.230286 |
11914 | 1.200250e+07 | 303.177521 |
11915 | 1.200262e+07 | 175.118958 |
11916 rows × 2 columns
calc_fragment_mz_df()
also generate pointers frag_start_idx
and frag_stop_idx
in the precursor_df to locate fragments of each precursor.
[13]:
fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']]
[13]:
frag_start_idx | frag_stop_idx | |
---|---|---|
0 | 0 | 6 |
1 | 6 | 12 |
2 | 12 | 18 |
3 | 18 | 24 |
4 | 24 | 30 |
... | ... | ... |
751 | 11806 | 11828 |
752 | 11828 | 11850 |
753 | 11850 | 11872 |
754 | 11872 | 11894 |
755 | 11894 | 11916 |
756 rows × 2 columns
Note that all fragment ions are stored from peptide’s N-terminal to C-terminal, so the b-ions are in the ascending order (from b1 to bn) and y-ions are in the decending order (from yn to y1).
[14]:
start, end = fasta_lib.precursor_df[['frag_start_idx','frag_stop_idx']].values[1]
fasta_lib.fragment_mz_df.iloc[start:end,:]
[14]:
b_z1 | y_z1 | |
---|---|---|
6 | 72.044388 | 714.429749 |
7 | 219.112808 | 567.361328 |
8 | 276.134277 | 510.339844 |
9 | 413.193176 | 373.280945 |
10 | 526.277222 | 260.196869 |
11 | 639.361328 | 147.112808 |
Save protein_df, precursor_df, fragment_mz_df, fragment_intensity_df into a hdf file.
[15]:
# fasta_lib.save_hdf('path/to/hdf_file.hdf')
[ ]: