alphabase.protein.fasta#
The SpecLibFasta
provides the highest level APIs based on all other
functionalities in AlphaBase.
See examples in library_from_fasta notebook.
Classes:
|
|
|
This is the main entry of AlphaBase when generating spectral libraries from fasta files It includes functionalities to: |
Functions:
|
|
|
Annotate a list of peptides with genes and proteins by using an ahocorasick automaton. |
|
Append special (not N/C-term) variable modifications to the exsiting modifications of each sequence in df. |
|
Cleave a sequence with cut postions (cut_pos). |
|
Concatenate all protein sequences into a single sequence, seperated by sep ($ by default). |
|
|
|
get candidate modification sites |
|
Generate fix modifications for the sequence |
|
|
|
get all combinations of variable modification sites |
|
Generate all modification combinations and associated sites for the sequence. |
|
Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D ... |
Used only when the var mod list contains more than one mods on the same AA, for example: Mod1@A, Mod2@A ... |
|
Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D ... |
|
|
|
|
|
|
|
|
|
|
|
|
Read a FASTA file line by line |
Data:
Pre-built protease dict with regular expression. |
- class alphabase.protein.fasta.Digest(protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 6, peptide_length_max: int = 45)[source][source]#
Bases:
object
Methods:
__init__
([protease, max_missed_cleavages, ...])Digest a protein sequence
cleave_sequence
(sequence)Cleave a sequence.
get_cut_positions
(sequence)- __init__(protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 6, peptide_length_max: int = 45)[source][source]#
Digest a protein sequence
- Parameters:
protease (str, optional) – protease name, could be pre-defined name defined in protease_dict or a regular expression. By default ‘trypsin/P’
max_missed_cleavages (int, optional) – Max number of misses cleavage sites. By default 2
peptide_length_min (int, optional) – Minimal cleaved peptide length, by default 6
peptide_length_max (int, optional) – Maximal cleaved peptide length, by default 45
- cleave_sequence(sequence: str) tuple [source][source]#
Cleave a sequence.
- Parameters:
sequence (str) – the given (protein) sequence.
- Returns:
list[str]: cleaved peptide sequences with missed cleavages list[int]: miss cleavage list list[bool]: is protein N-term list[bool]: is protein C-term
- Return type:
tuple[list]
- class alphabase.protein.fasta.SpecLibFasta(charged_frag_types: list = ['b_z1', 'b_z2', 'y_z1', 'y_z2'], *, protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 7, peptide_length_max: int = 35, precursor_charge_min: int = 2, precursor_charge_max: int = 4, precursor_mz_min: float = 400.0, precursor_mz_max: float = 2000.0, var_mods: list = ['Acetyl@Protein_N-term', 'Oxidation@M'], min_var_mod_num: int = 0, max_var_mod_num: int = 2, fix_mods: list = ['Carbamidomethyl@C'], labeling_channels: dict = None, special_mods: list = [], min_special_mod_num: int = 0, max_special_mod_num: int = 1, special_mods_cannot_modify_pep_n_term: bool = False, special_mods_cannot_modify_pep_c_term: bool = False, decoy: str = None, include_contaminants: bool = False, I_to_L: bool = False)[source][source]#
Bases:
SpecLibBase
This is the main entry of AlphaBase when generating spectral libraries from fasta files It includes functionalities to:
Load protein sequences
Digest protein sequences
Append decoy peptides
Add fixed, variable or labeling modifications to the peptide sequences
Add charge states
Save libraries into hdf file
- max_peptidoform_num#
For some modifications such as Phospho, there may be thousands of peptidoforms generated for some peptides, so we use this attribute to control the overall number of peptidoforms of a peptide.
- Type:
int, 100 by default
- protein_df#
Protein dataframe with columns ‘protein_id’, ‘sequence’, ‘description’, ‘gene_name’, etc.
- Type:
pd.DataFrame
Methods:
__init__
([charged_frag_types, protease, ...])- param charged_frag_types:
Fragment types with charge,
Add charge states
Add fixed and variable modifications to all peptide sequences in self.precursor_df
add_mods_for_one_seq
(sequence, ...)Add fixed and variable modifications to a sequence
add_peptide_labeling
([labeling_channel_dict])Add labeling onto peptides inplace of self._precursor_df
Add external defined variable modifications to all peptide sequences in self._precursor_df.
get_peptides_from_fasta
(fasta_file)Load peptide sequences from fasta files.
get_peptides_from_fasta_list
(fasta_files)Load peptide sequences from fasta file list
get_peptides_from_protein_df
(protein_df)get_peptides_from_protein_dict
(protein_dict)Cleave the protein sequences in protein_dict.
import_and_process_fasta
(fasta_files)Import and process a fasta file or a list of fasta files.
import_and_process_peptide_sequences
(...[, ...])Importing and process peptide sequences instead of proteins.
import_and_process_protein_df
(protein_df)Import and process the protein_dict.
import_and_process_protein_dict
(protein_dict)Import and process the protein_dict.
load_hdf
(hdf_file[, load_mod_seq])Load contents from hdf file: - self.precursor_df <- library/precursor_df - self.precursor_df <- library/mod_seq_df if load_mod_seq is True - self.protein_df <- library/protein_df - self.fragment_mz_df <- library/fragment_mz_df - self.fragment_intensity_df <- library/fragment_intensity_df
The peptide processing step which is called by import_and_process_... methods.
save_hdf
(hdf_file)Save the contents into hdf file (attribute -> hdf_file): - self.precursor_df -> library/precursor_df - self.protein_df -> library/protein_df - self.fragment_mz_df -> library/fragment_mz_df - self.fragment_intensity_df -> library/fragment_intensity_df
- __init__(charged_frag_types: list = ['b_z1', 'b_z2', 'y_z1', 'y_z2'], *, protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 7, peptide_length_max: int = 35, precursor_charge_min: int = 2, precursor_charge_max: int = 4, precursor_mz_min: float = 400.0, precursor_mz_max: float = 2000.0, var_mods: list = ['Acetyl@Protein_N-term', 'Oxidation@M'], min_var_mod_num: int = 0, max_var_mod_num: int = 2, fix_mods: list = ['Carbamidomethyl@C'], labeling_channels: dict = None, special_mods: list = [], min_special_mod_num: int = 0, max_special_mod_num: int = 1, special_mods_cannot_modify_pep_n_term: bool = False, special_mods_cannot_modify_pep_c_term: bool = False, decoy: str = None, include_contaminants: bool = False, I_to_L: bool = False)[source][source]#
- Parameters:
charged_frag_types (list, optional) – Fragment types with charge, by default [ ‘b_z1’,’b_z2’,’y_z1’, ‘y_z2’ ]
protease (str, optional) – Could be pre-defined protease name defined in
protease_dict
, or a regular expression. By default ‘trypsin’max_missed_cleavages (int, optional) – Maximal missed cleavages, by default 2
peptide_length_min (int, optional) – Minimal cleaved peptide length, by default 7
peptide_length_max (int, optional) – Maximal cleaved peptide length, by default 35
precursor_charge_min (int, optional) – Minimal precursor charge, by default 2
precursor_charge_max (int, optional) – Maximal precursor charge, by default 4
precursor_mz_min (float, optional) – Minimal precursor mz, by default 200.0
precursor_mz_max (float, optional) – Maximal precursor mz, by default 2000.0
var_mods (list, optional) – list of variable modifications, by default [‘Acetyl@Protein_N-term’,’Oxidation@M’]
max_var_mod_num (int, optional) – Minimal number of variable modifications on a peptide sequence, by default 0
max_var_mod_num – Maximal number of variable modifications on a peptide sequence, by default 2
fix_mods (list, optional) – list of fixed modifications, by default [‘Carbamidomethyl@C’]
labeling_channels (dict, optional) – Add isotope labeling with different channels, see
add_peptide_labeling()
. Defaults to Nonespecial_mods (list, optional) – Modifications with special occurance per peptide. It is useful for modificaitons like Phospho which may largely explode the number of candidate modified peptides. The number of special_mods per peptide is controlled by max_append_mod_num. Defaults to [].
min_special_mod_num (int, optional) – Control the min number of special_mods per peptide, by default 0.
max_special_mod_num (int, optional) – Control the max number of special_mods per peptide, by default 1.
special_mods_cannot_modify_pep_c_term (bool, optional) – Some modifications cannot modify the peptide C-term, this will be useful for GlyGly@K as if C-term is di-Glyed, it cannot be cleaved/digested. Defaults to False.
special_mods_cannot_modify_pep_n_term (bool, optional) – Similar to special_mods_cannot_modify_pep_c_term, but at N-term. Defaults to False.
decoy (str, optional) –
Decoy type (see
alphabase.spectral_library.base.append_decoy_sequence()
)protein_reverse: Reverse on target protein sequences
pseudo_reverse: Pseudo-reverse on target peptide sequences
diann: DiaNN-like decoy
None: no decoy
by default None
include_contaminants (bool, optional) – If include contaminants.fasta, by default False
- add_modifications()[source][source]#
Add fixed and variable modifications to all peptide sequences in self.precursor_df
- add_mods_for_one_seq(sequence: str, is_prot_nterm, is_prot_cterm) tuple [source][source]#
Add fixed and variable modifications to a sequence
- Parameters:
sequence (str) – Peptide sequence
is_prot_nterm (bool) – if protein N-term
is_prot_cterm (bool) – if protein C-term
- Returns:
list[str]: list of modification names list[str]: list of modification sites
- Return type:
tuple
- add_peptide_labeling(labeling_channel_dict: dict = None)[source][source]#
Add labeling onto peptides inplace of self._precursor_df
- Parameters:
labeling_channel_dict (dict, optional) – For example:
` { -1: [], # not labeled 0: ['Dimethyl@Any_N-term','Dimethyl@K'], 4: ['Dimethyl:2H(4)@Any_N-term','Dimethyl:2H(4)@K'], 8: ['Dimethyl:2H(6)13C(2)@Any_N-term','Dimethyl:2H(6)13C(2)@K'], } `
. The key name could be int (highly recommended or must be in the future) or str, and the value must be a list of modification names (str) in alphabase format. It is set to self.labeling_channels if None. Defaults to None
- add_special_modifications()[source][source]#
Add external defined variable modifications to all peptide sequences in self._precursor_df. See
append_special_modifications()
for details.
- get_peptides_from_fasta(fasta_file: str | list)[source][source]#
Load peptide sequences from fasta files.
- Parameters:
fasta_file (Union[str,list]) – Could be a fasta file (str) or a list of fasta files (list[str])
- get_peptides_from_fasta_list(fasta_files: list)[source][source]#
Load peptide sequences from fasta file list
- Parameters:
fasta_files (list) – fasta file list
- get_peptides_from_peptide_sequence_list(pep_seq_list: list, protein_list: list = None)[source][source]#
- get_peptides_from_protein_dict(protein_dict: dict)[source][source]#
Cleave the protein sequences in protein_dict.
- Parameters:
protein_dict (dict) – Format:
` { 'prot_id1': {'protein_id': 'prot_id1', 'sequence': string, 'gene_name': string, 'description': string 'prot_id2': {...} ... } `
- import_and_process_fasta(fasta_files: list)[source][source]#
Import and process a fasta file or a list of fasta files. It includes 3 steps:
Digest and get peptide sequences, it uses self.get_peptides_from_…()
2. Process the peptides including add modifications, it uses
process_from_naked_peptide_seqs()
.- Parameters:
fasta_files (list) – A fasta file or a list of fasta files
- import_and_process_peptide_sequences(pep_seq_list: list, protein_list: list = None)[source][source]#
Importing and process peptide sequences instead of proteins. The processing step is in
process_from_naked_peptide_seqs()
.- Parameters:
pep_seq_list (list) – Peptide sequence list
protein_list (list, optional) – Protein id list which maps to pep_seq_list one-by-one, by default None
- import_and_process_protein_df(protein_df: DataFrame)[source][source]#
Import and process the protein_dict. The processing step is in
process_from_naked_peptide_seqs()
.` protein_dict = load_all_proteins(fasta_files) `
- Parameters:
protein_df (pd.DataFrame) – DataFrame with columns ‘protein_id’, ‘sequence’, ‘gene_name’, ‘description’
- import_and_process_protein_dict(protein_dict: dict)[source][source]#
Import and process the protein_dict. The processing step is in
process_from_naked_peptide_seqs()
.` protein_dict = load_all_proteins(fasta_files) `
- Parameters:
protein_dict (dict) – Format: { ‘prot_id1’: {‘protein_id’: ‘prot_id1’, ‘sequence’: string, ‘gene_name’: string, ‘description’: string ‘prot_id2’: {…} … }
- load_hdf(hdf_file: str, load_mod_seq: bool = False)[source][source]#
Load contents from hdf file: - self.precursor_df <- library/precursor_df - self.precursor_df <- library/mod_seq_df if load_mod_seq is True - self.protein_df <- library/protein_df - self.fragment_mz_df <- library/fragment_mz_df - self.fragment_intensity_df <- library/fragment_intensity_df
- Parameters:
hdf_file (str) – hdf file path
load_mod_seq (bool, optional) – After library is generated with hash values (int64) for sequences (str) and modifications (str), we don’t need sequence information for searching. So we can skip loading sequences to make the loading much faster. By default False
- process_from_naked_peptide_seqs()[source][source]#
The peptide processing step which is called by import_and_process_… methods.
- save_hdf(hdf_file: str)[source][source]#
Save the contents into hdf file (attribute -> hdf_file): - self.precursor_df -> library/precursor_df - self.protein_df -> library/protein_df - self.fragment_mz_df -> library/fragment_mz_df - self.fragment_intensity_df -> library/fragment_intensity_df
- Parameters:
hdf_file (str) – The hdf file path
- alphabase.protein.fasta.add_single_peptide_labeling(seq: str, mods: str, mod_sites: str, label_aas: str, label_mod_dict: dict, nterm_label_mod: str, cterm_label_mod: str)[source][source]#
- alphabase.protein.fasta.annotate_precursor_df(precursor_df: DataFrame, protein_df: DataFrame)[source][source]#
Annotate a list of peptides with genes and proteins by using an ahocorasick automaton.
- Parameters:
precursor_df (pd.DataFrame) – A dataframe containing a sequence column.
protein_df (pd.DataFrame) – protein dataframe containing sequence column.
- Returns:
updated precursor_df with genes, proteins and cardinality columns.
- Return type:
pd.DataFrame
- alphabase.protein.fasta.append_special_modifications(df: DataFrame, var_mods: list = ['Phospho@S', 'Phospho@T', 'Phospho@Y'], min_mod_num: int = 0, max_mod_num: int = 1, max_peptidoform_num: int = 100, cannot_modify_pep_nterm_aa: bool = False, cannot_modify_pep_cterm_aa: bool = False) DataFrame [source][source]#
Append special (not N/C-term) variable modifications to the exsiting modifications of each sequence in df.
- Parameters:
df (pd.DataFrame) – Precursor dataframe
var_mods (list, optional) – Considered varialbe modification list. Defaults to [‘Phospho@S’,’Phospho@T’,’Phospho@Y’].
min_mod_num (int, optional) – Minimal modification number for each sequence of the var_mods. Defaults to 0.
max_mod_num (int, optional) – Maximal modification number for each sequence of the var_mods. Defaults to 1.
max_peptidoform_num (int, optional) – One sequence is only allowed to explode to max_peptidoform_num number of modified peptides. Defaults to 100.
cannot_modify_pep_nterm_aa (bool, optional) – Similar to cannot_modify_pep_cterm_aa, by default False
cannot_modify_pep_cterm_aa (bool, optional) – If the modified AA is at C-term, then the modification cannot modified it. For example GlyGly@K, for a peptide ACDKEFGK, if GlyGly is at the C-term, trypsin cannot cleave the C-term K, hence there will be no such a modified peptide ACDKEFGK(GlyGly). by default False
- Returns:
The precursor_df with new modification added.
- Return type:
pd.DataFrame
- alphabase.protein.fasta.cleave_sequence_with_cut_pos(sequence: str, cut_pos: ndarray, n_missed_cleavages: int = 2, pep_length_min: int = 6, pep_length_max: int = 45) tuple [source]#
Cleave a sequence with cut postions (cut_pos). Filters to have a minimum and maximum length.
- Parameters:
sequence (str) – protein sequence
cut_pos (np.ndarray) – cut postions determined by a given protease.
n_missed_cleavages (int) – the number of max missed cleavages.
pep_length_min (int) – min peptide length.
pep_length_max (int) – max peptide length.
- Returns:
List[str]. Cleaved peptide sequences with missed cleavages.
List[int]. Number of miss cleavage of each peptide.
List[bool]. If N-term peptide
List[bool]. If C-term pepetide
- Return type:
tuple
- alphabase.protein.fasta.concat_proteins(protein_dict: dict, sep='$') str [source][source]#
Concatenate all protein sequences into a single sequence, seperated by sep ($ by default).
- Parameters:
protein_dict (dict) – protein_dict by read_fasta_file()
- Returns:
concatenated sequence seperated by sep.
- Return type:
str
- alphabase.protein.fasta.create_labeling_peptide_df(peptide_df: DataFrame, labels: list, inplace: bool = False)[source][source]#
- alphabase.protein.fasta.get_candidate_sites(sequence: str, target_mod_aas: str) list [source][source]#
get candidate modification sites
- Parameters:
sequence (str) – peptide sequence
target_mod_aas (str) – AAs that may have modifications
- Returns:
candiadte mod sites in alphabase format (0: N-term, -1: C-term, 1-n:others)
- Return type:
list
- alphabase.protein.fasta.get_fix_mods(sequence: str, fix_mod_aas: str, fix_mod_dict: dict) tuple [source][source]#
Generate fix modifications for the sequence
- alphabase.protein.fasta.get_var_mod_sites(sequence: str, target_mod_aas: str, min_var_mod: int, max_var_mod: int, max_combs: int) list [source][source]#
get all combinations of variable modification sites
- Parameters:
sequence (str) – peptide sequence
target_mod_aas (str) – AAs that may have modifications
min_var_mod (int) – max number of mods in a sequence
max_var_mod (int) – max number of mods in a sequence
max_combs (int) – max number of combinations for a sequence
- Returns:
list of combinations (tuple) of modification sites
- Return type:
list
- alphabase.protein.fasta.get_var_mods(sequence: str, var_mod_aas: str, mod_dict: dict, min_var_mod: int, max_var_mod: int, max_combs: int) tuple [source][source]#
Generate all modification combinations and associated sites for the sequence.
- alphabase.protein.fasta.get_var_mods_per_sites(sequence: str, mod_sites: tuple, var_mod_dict: dict) list [source]#
Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D …
- alphabase.protein.fasta.get_var_mods_per_sites_multi_mods_on_aa(sequence: str, mod_sites: tuple, var_mod_dict: dict) list [source][source]#
Used only when the var mod list contains more than one mods on the same AA, for example: Mod1@A, Mod2@A …
- alphabase.protein.fasta.get_var_mods_per_sites_single_mod_on_aa(sequence: str, mod_sites: tuple, var_mod_dict: dict) list [source][source]#
Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D …
- alphabase.protein.fasta.protease_dict = {'arg-c': 'R', 'asp-n': '\\w(?=D)', 'bnps-skatole': 'W', 'caspase 1': '(?<=[FWYL]\\w[HAT])D(?=[^PEDQKR])', 'caspase 10': '(?<=IEA)D', 'caspase 2': '(?<=DVA)D(?=[^PEDQKR])', 'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])', 'caspase 4': '(?<=LEV)D(?=[^PEDQKR])', 'caspase 5': '(?<=[LW]EH)D', 'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])', 'caspase 7': '(?<=DEV)D(?=[^PEDQKR])', 'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])', 'caspase 9': '(?<=LEH)D', 'chymotrypsin': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))', 'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))', 'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))', 'clostripain': 'R', 'cnbr': 'M', 'enterokinase': '(?<=[DE]{3})K', 'factor xa': '(?<=[AFGILTVM][DE]G)R', 'formic acid': 'D', 'glu-c': 'E', 'glutamyl endopeptidase': 'E', 'granzyme b': '(?<=IEP)D', 'hydroxylamine': 'N(?=G)', 'iodosobenzoic acid': 'W', 'lys-c': 'K', 'lys-n': '\\w(?=K)', 'no-cleave': '_', 'non-specific': '()', 'ntcb': '\\w(?=C)', 'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\w[^P]))', 'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\w[^P]))', 'proline endopeptidase': '(?<=[HKR])P(?=[^P])', 'proteinase k': '[AEFILTVWY]', 'staphylococcal peptidase i': '(?<=[^E])E', 'thermolysin': '[^DE](?=[AFILMV])', 'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))', 'trypsin': '([KR])', 'trypsin/p': '([KR])', 'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))', 'trypsin_full': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))', 'trypsin_not_p': '([KR](?=[^P]))'}#
Pre-built protease dict with regular expression.