alphabase.protein.fasta#

The SpecLibFasta provides the highest level APIs based on all other functionalities in AlphaBase.

See examples in library_from_fasta notebook.

Classes:

`Digest`([protease, max_missed_cleavages, ...])
`SpecLibFasta`([charged_frag_types, protease, ...])	This is the main entry of AlphaBase when generating spectral libraries from fasta files It includes functionalities to:

Functions:

`add_single_peptide_labeling`(seq, mods, ...)
`annotate_precursor_df`(precursor_df, protein_df)	Annotate a list of peptides with genes and proteins by using an ahocorasick automaton.
`append_special_modifications`(df[, var_mods, ...])	Append special (not N/C-term) variable modifications to the exsiting modifications of each sequence in df.
`cleave_sequence_with_cut_pos`(sequence, cut_pos)	Cleave a sequence with cut postions (cut_pos).
`concat_proteins`(protein_dict[, sep])	Concatenate all protein sequences into a single sequence, seperated by sep ($ by default).
`create_labeling_peptide_df`(peptide_df, labels)
`get_candidate_sites`(sequence, target_mod_aas)	get candidate modification sites
`get_fix_mods`(sequence, fix_mod_aas, fix_mod_dict)	Generate fix modifications for the sequence
`get_uniprot_gene_name`(description)
`get_var_mod_sites`(sequence, target_mod_aas, ...)	get all combinations of variable modification sites
`get_var_mods`(sequence, var_mod_aas, ...)	Generate all modification combinations and associated sites for the sequence.
`get_var_mods_per_sites`(sequence, mod_sites, ...)	Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D ...
`get_var_mods_per_sites_multi_mods_on_aa`(...)	Used only when the var mod list contains more than one mods on the same AA, for example: Mod1@A, Mod2@A ...
`get_var_mods_per_sites_single_mod_on_aa`(...)	Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D ...
`load_all_proteins`(fasta_file_list)
`load_fasta_list_as_protein_df`(fasta_list)
`parse_labels`(labels)
`parse_term_mod`(term_mod_name)
`protein_idxes_to_names`(protein_idxes, ...)
`read_fasta_file`([fasta_filename])	Read a FASTA file line by line

Data:

protease_dict

Pre-built protease dict with regular expression.

class alphabase.protein.fasta.Digest(protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 6, peptide_length_max: int = 45)[source][source]#

Bases: object

Methods:

`__init__`([protease, max_missed_cleavages, ...])	Digest a protein sequence
`cleave_sequence`(sequence)	Cleave a sequence.
`get_cut_positions`(sequence)

__init__(protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 6, peptide_length_max: int = 45)[source][source]#

Digest a protein sequence

Parameters:

protease (str, optional) – protease name, could be pre-defined name defined in protease_dict or a regular expression. By default ‘trypsin/P’
max_missed_cleavages (int, optional) – Max number of misses cleavage sites. By default 2
peptide_length_min (int, optional) – Minimal cleaved peptide length, by default 6
peptide_length_max (int, optional) – Maximal cleaved peptide length, by default 45

cleave_sequence(sequence: str) → tuple[source][source]#

Cleave a sequence.

Parameters:: sequence (str) – the given (protein) sequence.
Returns:: list[str]: cleaved peptide sequences with missed cleavages list[int]: miss cleavage list list[bool]: is protein N-term list[bool]: is protein C-term
Return type:: tuple[list]

get_cut_positions(sequence)[source][source]#

class alphabase.protein.fasta.SpecLibFasta(charged_frag_types: list = ['b_z1', 'b_z2', 'y_z1', 'y_z2'], *, protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 7, peptide_length_max: int = 35, precursor_charge_min: int = 2, precursor_charge_max: int = 4, precursor_mz_min: float = 400.0, precursor_mz_max: float = 2000.0, var_mods: list = ['Acetyl@Protein_N-term', 'Oxidation@M'], min_var_mod_num: int = 0, max_var_mod_num: int = 2, fix_mods: list = ['Carbamidomethyl@C'], labeling_channels: dict = None, special_mods: list = [], min_special_mod_num: int = 0, max_special_mod_num: int = 1, special_mods_cannot_modify_pep_n_term: bool = False, special_mods_cannot_modify_pep_c_term: bool = False, decoy: str = None, include_contaminants: bool = False, I_to_L: bool = False)[source][source]#

Bases: SpecLibBase

This is the main entry of AlphaBase when generating spectral libraries from fasta files It includes functionalities to:

Load protein sequences
Digest protein sequences
Append decoy peptides
Add fixed, variable or labeling modifications to the peptide sequences
Add charge states
Save libraries into hdf file

max_peptidoform_num#

For some modifications such as Phospho, there may be thousands of peptidoforms generated for some peptides, so we use this attribute to control the overall number of peptidoforms of a peptide.

Type:: int, 100 by default

protein_df#

Protein dataframe with columns ‘protein_id’, ‘sequence’, ‘description’, ‘gene_name’, etc.

Type:: pd.DataFrame

Methods:

`__init__`([charged_frag_types, protease, ...])	param charged_frag_types: Fragment types with charge,
`add_charge`()	Add charge states
`add_modifications`()	Add fixed and variable modifications to all peptide sequences in self.precursor_df
`add_mods_for_one_seq`(sequence, ...)	Add fixed and variable modifications to a sequence
`add_peptide_labeling`([labeling_channel_dict])	Add labeling onto peptides inplace of self._precursor_df
`add_special_modifications`()	Add external defined variable modifications to all peptide sequences in self._precursor_df.
`append_protein_name`()
`get_peptides_from_fasta`(fasta_file)	Load peptide sequences from fasta files.
`get_peptides_from_fasta_list`(fasta_files)	Load peptide sequences from fasta file list
`get_peptides_from_peptide_sequence_list`(...)
`get_peptides_from_protein_df`(protein_df)
`get_peptides_from_protein_dict`(protein_dict)	Cleave the protein sequences in protein_dict.
`import_and_process_fasta`(fasta_files)	Import and process a fasta file or a list of fasta files.
`import_and_process_peptide_sequences`(...[, ...])	Importing and process peptide sequences instead of proteins.
`import_and_process_protein_df`(protein_df)	Import and process the protein_dict.
`import_and_process_protein_dict`(protein_dict)	Import and process the protein_dict.
`load_hdf`(hdf_file[, load_mod_seq])	Load contents from hdf file: - self.precursor_df <- library/precursor_df - self.precursor_df <- library/mod_seq_df if load_mod_seq is True - self.protein_df <- library/protein_df - self.fragment_mz_df <- library/fragment_mz_df - self.fragment_intensity_df <- library/fragment_intensity_df
`process_from_naked_peptide_seqs`()	The peptide processing step which is called by import_and_process_... methods.
`save_hdf`(hdf_file)	Save the contents into hdf file (attribute -> hdf_file): - self.precursor_df -> library/precursor_df - self.protein_df -> library/protein_df - self.fragment_mz_df -> library/fragment_mz_df - self.fragment_intensity_df -> library/fragment_intensity_df

__init__(charged_frag_types: list = ['b_z1', 'b_z2', 'y_z1', 'y_z2'], *, protease: str = 'trypsin', max_missed_cleavages: int = 2, peptide_length_min: int = 7, peptide_length_max: int = 35, precursor_charge_min: int = 2, precursor_charge_max: int = 4, precursor_mz_min: float = 400.0, precursor_mz_max: float = 2000.0, var_mods: list = ['Acetyl@Protein_N-term', 'Oxidation@M'], min_var_mod_num: int = 0, max_var_mod_num: int = 2, fix_mods: list = ['Carbamidomethyl@C'], labeling_channels: dict = None, special_mods: list = [], min_special_mod_num: int = 0, max_special_mod_num: int = 1, special_mods_cannot_modify_pep_n_term: bool = False, special_mods_cannot_modify_pep_c_term: bool = False, decoy: str = None, include_contaminants: bool = False, I_to_L: bool = False)[source][source]#

Parameters:

charged_frag_types (list, optional) – Fragment types with charge, by default [ ‘b_z1’,’b_z2’,’y_z1’, ‘y_z2’ ]
protease (str, optional) – Could be pre-defined protease name defined in protease_dict, or a regular expression. By default ‘trypsin’
max_missed_cleavages (int, optional) – Maximal missed cleavages, by default 2
peptide_length_min (int, optional) – Minimal cleaved peptide length, by default 7
peptide_length_max (int, optional) – Maximal cleaved peptide length, by default 35
precursor_charge_min (int, optional) – Minimal precursor charge, by default 2
precursor_charge_max (int, optional) – Maximal precursor charge, by default 4
precursor_mz_min (float, optional) – Minimal precursor mz, by default 200.0
precursor_mz_max (float, optional) – Maximal precursor mz, by default 2000.0
var_mods (list, optional) – list of variable modifications, by default [‘Acetyl@Protein_N-term’,’Oxidation@M’]
max_var_mod_num (int, optional) – Minimal number of variable modifications on a peptide sequence, by default 0
max_var_mod_num – Maximal number of variable modifications on a peptide sequence, by default 2
fix_mods (list, optional) – list of fixed modifications, by default [‘Carbamidomethyl@C’]
labeling_channels (dict, optional) – Add isotope labeling with different channels, see add_peptide_labeling(). Defaults to None
special_mods (list, optional) – Modifications with special occurance per peptide. It is useful for modificaitons like Phospho which may largely explode the number of candidate modified peptides. The number of special_mods per peptide is controlled by max_append_mod_num. Defaults to [].
min_special_mod_num (int, optional) – Control the min number of special_mods per peptide, by default 0.
max_special_mod_num (int, optional) – Control the max number of special_mods per peptide, by default 1.
special_mods_cannot_modify_pep_c_term (bool, optional) – Some modifications cannot modify the peptide C-term, this will be useful for GlyGly@K as if C-term is di-Glyed, it cannot be cleaved/digested. Defaults to False.
special_mods_cannot_modify_pep_n_term (bool, optional) – Similar to special_mods_cannot_modify_pep_c_term, but at N-term. Defaults to False.
decoy (str, optional) –
Decoy type (see alphabase.spectral_library.base.append_decoy_sequence())
- protein_reverse: Reverse on target protein sequences
- pseudo_reverse: Pseudo-reverse on target peptide sequences
- diann: DiaNN-like decoy
- None: no decoy
by default None
include_contaminants (bool, optional) – If include contaminants.fasta, by default False

add_charge()[source][source]#: Add charge states

add_modifications()[source][source]#: Add fixed and variable modifications to all peptide sequences in self.precursor_df

add_mods_for_one_seq(sequence: str, is_prot_nterm, is_prot_cterm) → tuple[source][source]#

Add fixed and variable modifications to a sequence

Parameters:

sequence (str) – Peptide sequence
is_prot_nterm (bool) – if protein N-term
is_prot_cterm (bool) – if protein C-term

Returns:

list[str]: list of modification names list[str]: list of modification sites

Return type:

tuple

add_peptide_labeling(labeling_channel_dict: dict = None)[source][source]#

Add labeling onto peptides inplace of self._precursor_df

Parameters:: labeling_channel_dict (dict, optional) – For example: ` { -1: [], # not labeled 0: ['Dimethyl@Any_N-term','Dimethyl@K'], 4: ['Dimethyl:2H(4)@Any_N-term','Dimethyl:2H(4)@K'], 8: ['Dimethyl:2H(6)13C(2)@Any_N-term','Dimethyl:2H(6)13C(2)@K'], } `. The key name could be int (highly recommended or must be in the future) or str, and the value must be a list of modification names (str) in alphabase format. It is set to self.labeling_channels if None. Defaults to None

add_special_modifications()[source][source]#: Add external defined variable modifications to all peptide sequences in self._precursor_df. See append_special_modifications() for details.

append_protein_name()[source][source]#

get_peptides_from_fasta(fasta_file: str | list)[source][source]#

Load peptide sequences from fasta files.

Parameters:: fasta_file (Union[str,list]) – Could be a fasta file (str) or a list of fasta files (list[str])

get_peptides_from_fasta_list(fasta_files: list)[source][source]#

Load peptide sequences from fasta file list

Parameters:: fasta_files (list) – fasta file list

get_peptides_from_peptide_sequence_list(pep_seq_list: list, protein_list: list = None)[source][source]#

get_peptides_from_protein_df(protein_df: DataFrame)[source][source]#

get_peptides_from_protein_dict(protein_dict: dict)[source][source]#

Cleave the protein sequences in protein_dict.

Parameters:: protein_dict (dict) – Format: ` { 'prot_id1': {'protein_id': 'prot_id1', 'sequence': string, 'gene_name': string, 'description': string 'prot_id2': {...} ... } `

import_and_process_fasta(fasta_files: list)[source][source]#

Import and process a fasta file or a list of fasta files. It includes 3 steps:

Digest and get peptide sequences, it uses self.get_peptides_from_…()

2. Process the peptides including add modifications, it uses process_from_naked_peptide_seqs().

Parameters:: fasta_files (list) – A fasta file or a list of fasta files

import_and_process_peptide_sequences(pep_seq_list: list, protein_list: list = None)[source][source]#

Importing and process peptide sequences instead of proteins. The processing step is in process_from_naked_peptide_seqs().

Parameters:

pep_seq_list (list) – Peptide sequence list
protein_list (list, optional) – Protein id list which maps to pep_seq_list one-by-one, by default None

import_and_process_protein_df(protein_df: DataFrame)[source][source]#

Import and process the protein_dict. The processing step is in process_from_naked_peptide_seqs(). ` protein_dict = load_all_proteins(fasta_files) `

Parameters:: protein_df (pd.DataFrame) – DataFrame with columns ‘protein_id’, ‘sequence’, ‘gene_name’, ‘description’

import_and_process_protein_dict(protein_dict: dict)[source][source]#

Import and process the protein_dict. The processing step is in process_from_naked_peptide_seqs(). ` protein_dict = load_all_proteins(fasta_files) `

Parameters:: protein_dict (dict) – Format: { ‘prot_id1’: {‘protein_id’: ‘prot_id1’, ‘sequence’: string, ‘gene_name’: string, ‘description’: string ‘prot_id2’: {…} … }

load_hdf(hdf_file: str, load_mod_seq: bool = False)[source][source]#

Load contents from hdf file: - self.precursor_df <- library/precursor_df - self.precursor_df <- library/mod_seq_df if load_mod_seq is True - self.protein_df <- library/protein_df - self.fragment_mz_df <- library/fragment_mz_df - self.fragment_intensity_df <- library/fragment_intensity_df

Parameters:

hdf_file (str) – hdf file path
load_mod_seq (bool, optional) – After library is generated with hash values (int64) for sequences (str) and modifications (str), we don’t need sequence information for searching. So we can skip loading sequences to make the loading much faster. By default False

process_from_naked_peptide_seqs()[source][source]#: The peptide processing step which is called by import_and_process_… methods.

save_hdf(hdf_file: str)[source][source]#

Save the contents into hdf file (attribute -> hdf_file): - self.precursor_df -> library/precursor_df - self.protein_df -> library/protein_df - self.fragment_mz_df -> library/fragment_mz_df - self.fragment_intensity_df -> library/fragment_intensity_df

Parameters:: hdf_file (str) – The hdf file path

alphabase.protein.fasta.add_single_peptide_labeling(seq: str, mods: str, mod_sites: str, label_aas: str, label_mod_dict: dict, nterm_label_mod: str, cterm_label_mod: str)[source][source]#

alphabase.protein.fasta.annotate_precursor_df(precursor_df: DataFrame, protein_df: DataFrame)[source][source]#

Annotate a list of peptides with genes and proteins by using an ahocorasick automaton.

Parameters:

precursor_df (pd.DataFrame) – A dataframe containing a sequence column.
protein_df (pd.DataFrame) – protein dataframe containing sequence column.

Returns:

updated precursor_df with genes, proteins and cardinality columns.

Return type:

pd.DataFrame

alphabase.protein.fasta.append_special_modifications(df: DataFrame, var_mods: list = ['Phospho@S', 'Phospho@T', 'Phospho@Y'], min_mod_num: int = 0, max_mod_num: int = 1, max_peptidoform_num: int = 100, cannot_modify_pep_nterm_aa: bool = False, cannot_modify_pep_cterm_aa: bool = False) → DataFrame[source][source]#

Append special (not N/C-term) variable modifications to the exsiting modifications of each sequence in df.

Parameters:

df (pd.DataFrame) – Precursor dataframe
var_mods (list, optional) – Considered varialbe modification list. Defaults to [‘Phospho@S’,’Phospho@T’,’Phospho@Y’].
min_mod_num (int, optional) – Minimal modification number for each sequence of the var_mods. Defaults to 0.
max_mod_num (int, optional) – Maximal modification number for each sequence of the var_mods. Defaults to 1.
max_peptidoform_num (int, optional) – One sequence is only allowed to explode to max_peptidoform_num number of modified peptides. Defaults to 100.
cannot_modify_pep_nterm_aa (bool, optional) – Similar to cannot_modify_pep_cterm_aa, by default False
cannot_modify_pep_cterm_aa (bool, optional) – If the modified AA is at C-term, then the modification cannot modified it. For example GlyGly@K, for a peptide ACDKEFGK, if GlyGly is at the C-term, trypsin cannot cleave the C-term K, hence there will be no such a modified peptide ACDKEFGK(GlyGly). by default False

Returns:

The precursor_df with new modification added.

Return type:

pd.DataFrame

alphabase.protein.fasta.cleave_sequence_with_cut_pos(sequence: str, cut_pos: ndarray, n_missed_cleavages: int = 2, pep_length_min: int = 6, pep_length_max: int = 45) → tuple[source]#

Cleave a sequence with cut postions (cut_pos). Filters to have a minimum and maximum length.

Parameters:

sequence (str) – protein sequence
cut_pos (np.ndarray) – cut postions determined by a given protease.
n_missed_cleavages (int) – the number of max missed cleavages.
pep_length_min (int) – min peptide length.
pep_length_max (int) – max peptide length.

Returns:

List[str]. Cleaved peptide sequences with missed cleavages.

List[int]. Number of miss cleavage of each peptide.

List[bool]. If N-term peptide

List[bool]. If C-term pepetide

Return type:

tuple

alphabase.protein.fasta.concat_proteins(protein_dict: dict, sep='$') → str[source][source]#

Concatenate all protein sequences into a single sequence, seperated by sep ($ by default).

Parameters:: protein_dict (dict) – protein_dict by read_fasta_file()
Returns:: concatenated sequence seperated by sep.
Return type:: str

alphabase.protein.fasta.create_labeling_peptide_df(peptide_df: DataFrame, labels: list, inplace: bool = False)[source][source]#

alphabase.protein.fasta.get_candidate_sites(sequence: str, target_mod_aas: str) → list[source][source]#

get candidate modification sites

Parameters:

sequence (str) – peptide sequence
target_mod_aas (str) – AAs that may have modifications

Returns:

candiadte mod sites in alphabase format (0: N-term, -1: C-term, 1-n:others)

Return type:

list

alphabase.protein.fasta.get_fix_mods(sequence: str, fix_mod_aas: str, fix_mod_dict: dict) → tuple[source][source]#: Generate fix modifications for the sequence

alphabase.protein.fasta.get_uniprot_gene_name(description: str)[source][source]#

alphabase.protein.fasta.get_var_mod_sites(sequence: str, target_mod_aas: str, min_var_mod: int, max_var_mod: int, max_combs: int) → list[source][source]#

get all combinations of variable modification sites

Parameters:

sequence (str) – peptide sequence
target_mod_aas (str) – AAs that may have modifications
min_var_mod (int) – max number of mods in a sequence
max_var_mod (int) – max number of mods in a sequence
max_combs (int) – max number of combinations for a sequence

Returns:

list of combinations (tuple) of modification sites

Return type:

list

alphabase.protein.fasta.get_var_mods(sequence: str, var_mod_aas: str, mod_dict: dict, min_var_mod: int, max_var_mod: int, max_combs: int) → tuple[source][source]#: Generate all modification combinations and associated sites for the sequence.

alphabase.protein.fasta.get_var_mods_per_sites(sequence: str, mod_sites: tuple, var_mod_dict: dict) → list[source]#: Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D …

alphabase.protein.fasta.get_var_mods_per_sites_multi_mods_on_aa(sequence: str, mod_sites: tuple, var_mod_dict: dict) → list[source][source]#: Used only when the var mod list contains more than one mods on the same AA, for example: Mod1@A, Mod2@A …

alphabase.protein.fasta.get_var_mods_per_sites_single_mod_on_aa(sequence: str, mod_sites: tuple, var_mod_dict: dict) → list[source][source]#: Used when the var mod list contains only one mods on the each AA, for example: Mod1@A, Mod2@D …

alphabase.protein.fasta.load_all_proteins(fasta_file_list: list)[source][source]#

alphabase.protein.fasta.load_fasta_list_as_protein_df(fasta_list: list)[source][source]#

alphabase.protein.fasta.parse_labels(labels: list)[source][source]#

alphabase.protein.fasta.parse_term_mod(term_mod_name: str)[source][source]#

alphabase.protein.fasta.protease_dict = {'arg-c': 'R', 'asp-n': '\\w(?=D)', 'bnps-skatole': 'W', 'caspase 1': '(?<=[FWYL]\\w[HAT])D(?=[^PEDQKR])', 'caspase 10': '(?<=IEA)D', 'caspase 2': '(?<=DVA)D(?=[^PEDQKR])', 'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])', 'caspase 4': '(?<=LEV)D(?=[^PEDQKR])', 'caspase 5': '(?<=[LW]EH)D', 'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])', 'caspase 7': '(?<=DEV)D(?=[^PEDQKR])', 'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])', 'caspase 9': '(?<=LEH)D', 'chymotrypsin': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))', 'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))', 'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))', 'clostripain': 'R', 'cnbr': 'M', 'enterokinase': '(?<=[DE]{3})K', 'factor xa': '(?<=[AFGILTVM][DE]G)R', 'formic acid': 'D', 'glu-c': 'E', 'glutamyl endopeptidase': 'E', 'granzyme b': '(?<=IEP)D', 'hydroxylamine': 'N(?=G)', 'iodosobenzoic acid': 'W', 'lys-c': 'K', 'lys-n': '\\w(?=K)', 'no-cleave': '_', 'non-specific': '()', 'ntcb': '\\w(?=C)', 'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\w[^P]))', 'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\w[^P]))', 'proline endopeptidase': '(?<=[HKR])P(?=[^P])', 'proteinase k': '[AEFILTVWY]', 'staphylococcal peptidase i': '(?<=[^E])E', 'thermolysin': '[^DE](?=[AFILMV])', 'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))', 'trypsin': '([KR])', 'trypsin/p': '([KR])', 'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))', 'trypsin_full': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))', 'trypsin_not_p': '([KR](?=[^P]))'}#: Pre-built protease dict with regular expression.

alphabase.protein.fasta.protein_idxes_to_names(protein_idxes: str, protein_names: list)[source][source]#

alphabase.protein.fasta.read_fasta_file(fasta_filename: str = '')[source][source]#

Read a FASTA file line by line

Parameters:: fasta_filename (str) – fasta.
Yields:: dict – protein information, {protein_id:str, full_name:str, gene_name:str, description:str, sequence:str}