alphabase.peptide.fragment#
Functions:
|
Add new modifications into |
|
Calculate the cardinality for a given fragment across a group of precursors. |
|
Calculates the number of fragments for each precursor. |
|
recalculates fragment indices to remove unused fragments. |
Since fragment_df is indexed by precursor_df, when we concatenate multiple fragment_df, the indexed positions will change for in precursor_dfs, this function keeps the correct indexed positions of precursor_df when concatenating multiple fragment_df dataframes. |
|
|
Generate fragment mass dataframe for the precursor_df. |
Sort nAA in precursor_df for faster fragment mz dataframe creation. |
|
|
Fill in indices, max indices and excluded indices for each peptide. |
|
Filters the number of fragments for each precursor. |
|
Converts the tabular fragment format consisting of the fragment_mz_df and the fragment_intensity_df into a linear fragment format. |
|
Combine fragment types and charge states. |
|
Get the sliced fragment_df from frag_start_end_list=[(start,end),(start,end),...]. |
|
Init zero fragment dataframe for the precursor_df. |
|
Init zero fragment dataframe from the reference_fragment_df (same rows and same columns) |
|
Initialize a zero dataframe based on peptide length (nAA) array (peplen_array) and charge_frag_types (column number). |
|
joins all values in the left array to the values in the right array. |
|
Mask the fragment dataframe when the fragment charge is larger than the precursor charge |
|
Oppsite to get_charged_frag_types. |
|
Parse fragments to get fragment numbers, fragment positions and not top k excluded indices in one hit faster than doing each operation individually, and makes the most of the operations that are done in parallel. |
|
Removes unused fragments of removed precursors, reannotates the frag_start_col and frag_stop_col |
|
Set the values of the slices frag_start_end_list=[(start,end),(start,end),...] of fragment_df. |
Data:
Masses parsed from |
|
Represent fragment ion types from b/y ions. |
- alphabase.peptide.fragment.add_new_frag_type(frag_type: str, representation: str)[source][source]#
Add new modifications into
frag_type_representation_dict
and updatefrag_mass_from_ref_ion_dict
.- Parameters:
frag_type (str) – New fragment type
representation (str) – The representation similar to
frag_type_representation_dict
- alphabase.peptide.fragment.calc_fragment_cardinality(precursor_df, fragment_mz_df, group_column='elution_group_idx', split_target_decoy=True)[source][source]#
Calculate the cardinality for a given fragment across a group of precursors. The cardinality is the number of precursors that have a given fragment at a given position.
All precursors within a group are expected to have the same number of fragments. The precursor dataframe.
- fragment_mz_dfpd.DataFrame
The fragment mz dataframe.
- group_columnstr
The column to group the precursors by. Integer column is expected.
- split_target_decoybool
If True, the cardinality is calculated for the target and decoy precursors separately.
- alphabase.peptide.fragment.calc_fragment_count(precursor_df: DataFrame, fragment_intensity_df: DataFrame)[source][source]#
Calculates the number of fragments for each precursor.
- Parameters:
precursor_df (pd.DataFrame) – precursor dataframe which contains the frag_start_idx and frag_stop_idx columns
fragment_intensity_df (pd.DataFrame) – fragment intensity dataframe which contains the fragment intensities
- Returns:
array with the number of fragments for each precursor
- Return type:
numpy.ndarray
- alphabase.peptide.fragment.calc_fragment_mz_values_for_same_nAA(df_group: DataFrame, nAA: int, charged_frag_types: list)[source][source]#
- alphabase.peptide.fragment.compress_fragment_indices(frag_idx)[source]#
recalculates fragment indices to remove unused fragments. Can be used to compress a fragment library. Expects fragment indices to be ordered by increasing values (!!!). It should be O(N) runtime with N being the number of fragment rows.
>>> frag_idx = [[6, 10], [12, 14], [20, 22]]
>>> frag_idx = [[0, 4], [4, 6], [6, 8]] >>> fragment_pointer = [6,7,8,9,12,13,20,21]
- alphabase.peptide.fragment.concat_precursor_fragment_dataframes(precursor_df_list: List[DataFrame], fragment_df_list: List[DataFrame], *other_fragment_df_lists) Tuple[DataFrame, ...] [source][source]#
Since fragment_df is indexed by precursor_df, when we concatenate multiple fragment_df, the indexed positions will change for in precursor_dfs, this function keeps the correct indexed positions of precursor_df when concatenating multiple fragment_df dataframes.
- Parameters:
precursor_df_list (List[pd.DataFrame]) – precursor dataframe list to concatenate
fragment_df_list (List[pd.DataFrame]) – fragment dataframe list to concatenate
other_fragment_df_lists – arbitray other fragment dataframe list to concatenate, e.g. fragment_mass_df, fragment_inten_df, …
- Returns:
concatenated precursor_df, fragment_df, other_fragment_dfs …
- Return type:
Tuple[pd.DataFrame,…]
- alphabase.peptide.fragment.create_fragment_mz_dataframe(precursor_df: ~pandas.core.frame.DataFrame, charged_frag_types: ~typing.List, *, reference_fragment_df: ~pandas.core.frame.DataFrame = None, inplace_in_reference: bool = False, batch_size: int = 500000, dtype: ~numpy.dtype = <class 'numpy.float32'>) DataFrame [source][source]#
Generate fragment mass dataframe for the precursor_df. If the reference_fragment_df is provided and precursor_df contains frag_start_idx, it will generate the mz dataframe based on the reference. Otherwise it generates the mz dataframe from scratch.
- Parameters:
precursor_df (pd.DataFrame) – precursors to generate fragment masses, if precursor_df contains the ‘frag_start_idx’ column, reference_fragment_df must be provided
charged_frag_types (List) – [‘b_z1’,’b_z2’,’y_z1’,’y_z2’,’b_modloss_1’,’y_H2O_z1’…]
reference_fragment_df (pd.DataFrame) – kwargs only. Generate fragment_mz_df based on this reference, as precursor_df.frag_start_idx and precursor.frag_stop_idx point to the indices in reference_fragment_df. Defaults to None
inplace_in_reference (bool) – kwargs only. Change values in place in the reference_fragment_df. Defaults to False
batch_size (int) – Number of peptides for each batch, to save RAM.
- Returns:
fragment_mz_df with given charged_frag_types
- Return type:
pd.DataFrame
- alphabase.peptide.fragment.create_fragment_mz_dataframe_by_sort_precursor(precursor_df: ~pandas.core.frame.DataFrame, charged_frag_types: ~typing.List, batch_size: int = 500000, dtype: ~numpy.dtype = <class 'numpy.float32'>) DataFrame [source][source]#
Sort nAA in precursor_df for faster fragment mz dataframe creation.
Because the fragment mz values are continous in memory, so it is faster when setting values in pandas.
Note that this function will change the order and index of precursor_df
- Parameters:
precursor_df (pd.DataFrame) – precursor dataframe
charged_frag_types (List) – fragment types list
batch_size (int, optional) – Calculate fragment mz values in batch. Defaults to 500000.
- alphabase.peptide.fragment.fill_in_indices(frag_start_idxes: ndarray, frag_stop_idxes: ndarray, indices: ndarray, max_indices: ndarray, excluded_indices: ndarray, top_k: int, flattened_intensity: ndarray, number_of_fragment_types: int, max_frag_per_peptide: int = 300) None [source]#
Fill in indices, max indices and excluded indices for each peptide. indices: index of fragment per peptide (from 0 to max_index-1) max_indices: max index of fragments per peptide (number of fragments per peptide) excluded_indices: not top k excluded indices per peptide
- Parameters:
frag_start_idxes (np.ndarray) – start indices of fragments for each peptide
frag_stop_idxes (np.ndarray) – stop indices of fragments for each peptide
indices (np.ndarray) – index of fragment per peptide (from 0 to max_index-1) it will be filled in this function
max_indices (np.ndarray) – max index of fragments per peptide (number of fragments per peptide) it will be filled in this function
excluded_indices (np.ndarray) – not top k excluded indices per peptide it will be filled in this function
top_k (int) – top k highest peaks to keep
flattened_intensity (np.ndarray) – Flattened fragment intensities
number_of_fragment_types (int) – number of types of fragments (e.g. b,y,b_modloss,y_modloss, …) equals to the number of columns in fragment mz dataframe
max_frag_per_peptide (int, optional) – maximum number of fragments per peptide, Defaults to 300
- alphabase.peptide.fragment.filter_fragment_number(precursor_df: DataFrame, fragment_intensity_df: DataFrame, n_fragments_allowed_column_name: str = 'n_fragments_allowed', n_allowed: int = 999)[source][source]#
Filters the number of fragments for each precursor.
- Parameters:
precursor_df (pd.DataFrame) – precursor dataframe which contains the frag_start_idx and frag_stop_idx columns
fragment_intensity_df (pd.DataFrame) – fragment intensity dataframe which contains the fragment intensities
n_fragments_allowed_column_name (str, default = 'n_fragments_allowed') – column name in precursor_df which contains the number of allowed fragments
n_allowed (int, default = 999) – number of fragments which should be allowed
- Return type:
None
- alphabase.peptide.fragment.flatten_fragments(precursor_df: DataFrame, fragment_mz_df: DataFrame, fragment_intensity_df: DataFrame, min_fragment_intensity: float = -1, keep_top_k_fragments: int = 1000, custom_columns: list = ['type', 'number', 'position', 'charge', 'loss_type'], custom_df: Dict[str, DataFrame] = {}) Tuple[DataFrame, DataFrame] [source][source]#
Converts the tabular fragment format consisting of the fragment_mz_df and the fragment_intensity_df into a linear fragment format. The linear fragment format will only retain fragments above a given intensity treshold with mz > 0. It consists of columns: mz, intensity, type, number, charge and loss_type, where each column refers to:
mz:
PEAK_MZ_DTYPE
, fragment mz valueintensity:
PEAK_INTENSITY_DTYPE
, fragment intensity value- type: uint8, ASCII code of the ion type. Small caps are for regular scoring ions used during search: (97=a, 98=b, 99=c, 120=x, 121=y, 122=z).
Small caps subtracted by 64 are used for ions only quantified and not scored: (33=a, 34=b, 35=c, 56=x, 57=y, 58=z). By default all ions are scored and quantified. It is left to the user or search engine to decide which ions to use.
number: uint32, fragment series number
position: uint32, fragment position in sequence (from left to right, starts with 0)
charge: uint8, fragment charge
loss_type: int16, fragment loss type, 0=noloss, 17=NH3, 18=H2O, 98=H3PO4 (phos), …
The fragment pointers frag_start_idx and frag_stop_idx will be reannotated to the new fragment format.
For ASCII code type, we can convert it into byte-str by using frag_df.type.values.view(‘S1’).
- Parameters:
precursor_df (pd.DataFrame) – input precursor dataframe which contains the frag_start_idx and frag_stop_idx columns
fragment_mz_df (pd.DataFrame) – input fragment mz dataframe of shape (N, T) which contains N * T fragment mzs. Fragments with mz==0 will be excluded.
fragment_intensity_df (pd.DataFrame) – input fragment intensity dataframe of shape (N, T) which contains N * T fragment mzs. Could be empty (len==0) to exclude intensity values.
min_fragment_intensity (float, optional) – minimum intensity which should be retained. Defaults to -1
custom_columns (list, optional) – ‘mz’ and ‘intensity’ columns are required. Others could be customized. Defaults to [‘type’,’number’,’position’,’charge’,’loss_type’]
custom_df (Dict[str, pd.DataFrame], optional) – Append custom columns by providing additional dataframes of the same shape as fragment_mz_df and fragment_intensity_df. Defaults to {}.
- Returns:
pd.DataFrame – precursor dataframe with added flat_frag_start_idx and flat_frag_stop_idx columns
pd.DataFrame – fragment dataframe with columns: mz, intensity, type, number, charge and loss_type, where each column refers to:
mz:
PEAK_MZ_DTYPE
, fragment mz valueintensity:
PEAK_INTENSITY_DTYPE
, fragment intensity value- type: uint8, ASCII code of the ion type. Small caps are for regular scoring ions used during search: (97=a, 98=b, 99=c, 120=x, 121=y, 122=z).
Small caps subtracted by 64 are used for ions only quantified and not scored: (33=a, 34=b, 35=c, 56=x, 57=y, 58=z). By default all ions are scored and quantified. It is left to the user or search engine to decide which ions to use.
number: uint32, fragment series number
position: uint32, fragment position in sequence (from left to right, starts with 0)
charge: uint8, fragment charge
loss_type: int16, fragment loss type, 0=noloss, 17=NH3, 18=H2O, 98=H3PO4 (phos), …
- alphabase.peptide.fragment.frag_mass_from_ref_ion_dict = {'a': {'add_mass': -27.99491461957, 'ref_ion': 'b'}, 'b_H2O': {'add_mass': -18.01056468403, 'ref_ion': 'b'}, 'b_NH3': {'add_mass': -17.02654910112, 'ref_ion': 'b'}, 'c': {'add_mass': 17.02654910112, 'ref_ion': 'b'}, 'c_lossH': {'add_mass': 16.01872406889, 'ref_ion': 'b'}, 'x': {'add_mass': 25.97926455511, 'ref_ion': 'y'}, 'y_H2O': {'add_mass': -18.01056468403, 'ref_ion': 'y'}, 'y_NH3': {'add_mass': -17.02654910112, 'ref_ion': 'y'}, 'z': {'add_mass': -16.01872406889, 'ref_ion': 'y'}, 'z_addH': {'add_mass': -15.01089903666, 'ref_ion': 'y'}}#
Masses parsed from
frag_type_representation_dict
.
- alphabase.peptide.fragment.frag_type_representation_dict = {'a': 'b+C(-1)O(-1)', 'b_H2O': 'b+H(-2)O(-1)', 'b_NH3': 'b+N(-1)H(-3)', 'c': 'b+N(1)H(3)', 'c_lossH': 'b+N(1)H(2)', 'x': 'y+C(1)O(1)H(-2)', 'y_H2O': 'y+H(-2)O(-1)', 'y_NH3': 'y+N(-1)H(-3)', 'z': 'y+N(-1)H(-2)', 'z_addH': 'y+N(-1)H(-1)'}#
Represent fragment ion types from b/y ions. Modification neutral losses (i.e. modloss) are not here as they have variable atoms added to b/y ions.
- alphabase.peptide.fragment.get_charged_frag_types(frag_types: List[str], max_frag_charge: int = 2) List[str] [source][source]#
Combine fragment types and charge states.
- Parameters:
frag_types (List[str]) – e.g. [‘b’,’y’,’b_modloss’,’y_modloss’]
max_frag_charge (int) – max fragment charge. (default: 2)
- Returns:
charged fragment types
- Return type:
List[str]
Examples
>>> frag_types=['b','y','b_modloss','y_modloss'] >>> get_charged_frag_types(frag_types, 2) ['b_z1','b_z2','y_z1','y_z2','b_modloss_z1','b_modloss_z2','y_modloss_z1','y_modloss_z2']
- alphabase.peptide.fragment.get_sliced_fragment_dataframe(fragment_df: DataFrame, frag_start_end_list: List | ndarray, charged_frag_types: List = None) DataFrame [source][source]#
Get the sliced fragment_df from frag_start_end_list=[(start,end),(start,end),…].
- Parameters:
fragment_df (pd.DataFrame) – fragment dataframe to get values
frag_start_end_list (Union) – List[Tuple[int,int]], e.g. [(start,end),(start,end),…] or np.ndarray
charged_frag_types (List[str]) – e.g. [‘b_z1’,’b_z2’,’y_z1’,’y_z2’]. if None, all columns will be considered
- Returns:
sliced fragment_df. If charged_frag_types is None, return fragment_df with all columns
- Return type:
pd.DataFrame
- alphabase.peptide.fragment.init_fragment_by_precursor_dataframe(precursor_df, charged_frag_types: ~typing.List[str], *, reference_fragment_df: ~pandas.core.frame.DataFrame = None, dtype: ~numpy.dtype = <class 'numpy.float32'>, inplace_in_reference: bool = False)[source][source]#
Init zero fragment dataframe for the precursor_df. If the reference_fragment_df is provided, the result dataframe’s length will be the same as reference_fragment_df. Otherwise it generates the dataframe from scratch.
- Parameters:
precursor_df (pd.DataFrame) – precursors to generate fragment masses, if precursor_df contains the ‘frag_start_idx’ column, it is better to provide reference_fragment_df as precursor_df.frag_start_idx and precursor.frag_stop_idx point to the indices in reference_fragment_df
charged_frag_types (List) – [‘b_z1’,’b_z2’,’y_z1’,’y_z2’,’b_modloss_z1’,’y_H2O_z1’…]
reference_fragment_df (pd.DataFrame) – init zero fragment_mz_df based on this reference. If None, fragment_mz_df will be initialized by
alphabase.peptide.fragment.init_zero_fragment_dataframe()
. Defaults to None.dtype (np.dtype) – dtype of fragment mz values, Defaults to
PEAK_MZ_DTYPE
.inplace_in_reference (bool, optional) – if calculate the fragment mz inplace in the reference_fragment_df (default: False)
- Returns:
zero fragment_df with given charged_frag_types columns
- Return type:
pd.DataFrame
- alphabase.peptide.fragment.init_fragment_dataframe_from_other(reference_fragment_df: ~pandas.core.frame.DataFrame, dtype=<class 'numpy.float32'>)[source][source]#
Init zero fragment dataframe from the reference_fragment_df (same rows and same columns)
- alphabase.peptide.fragment.init_zero_fragment_dataframe(peplen_array: ~numpy.ndarray, charged_frag_types: ~typing.List[str], dtype=<class 'numpy.float32'>) Tuple[DataFrame, ndarray, ndarray] [source][source]#
Initialize a zero dataframe based on peptide length (nAA) array (peplen_array) and charge_frag_types (column number). The row number of returned dataframe is np.sum(peplen_array-1).
- Parameters:
peplen_array (np.ndarray) – peptide lengths for the fragment dataframe
charged_frag_types (List[str]) – [‘b_z1’,’b_z2’,’y_z1’,’y_z2’,’b_modloss_z1’,’y_H2O_z1’…]
- Returns:
pd.DataFrame, fragment_df with zero values
np.ndarray (int64), the start indices point to the fragment_df for each peptide
np.ndarray (int64), the end indices point to the fragment_df for each peptide
- Return type:
tuple
- alphabase.peptide.fragment.join_left(left: ndarray, right: ndarray)[source]#
joins all values in the left array to the values in the right array. The index to the element in the right array is returned. If the value wasn’t found, -1 is returned. If the element appears more than once, the last appearance is used.
- Parameters:
left (numpy.ndarray) – left array which should be matched
right (numpy.ndarray) – right array which should be matched to
- Returns:
array with length of the left array which indices pointing to the right array -1 is returned if values could not be found in the right array
- Return type:
numpy.ndarray, dtype = int64
- alphabase.peptide.fragment.mask_fragments_for_charge_greater_than_precursor_charge(fragment_df: DataFrame, precursor_charge_array: ndarray, nAA_array: ndarray, *, candidate_fragment_charges: list = [2, 3, 4])[source][source]#
Mask the fragment dataframe when the fragment charge is larger than the precursor charge
- alphabase.peptide.fragment.parse_charged_frag_type(charged_frag_type: str) Tuple[str, int] [source][source]#
Oppsite to get_charged_frag_types.
- Parameters:
charged_frag_type (str) – e.g. ‘y_z1’, ‘b_modloss_z1’
- Returns:
str. Fragment type, e.g. ‘b’,’y’
int. Charge state
- Return type:
tuple
- alphabase.peptide.fragment.parse_fragment(frag_directions: ndarray, frag_start_idxes: ndarray, frag_stop_idxes: ndarray, top_k: int, intensities: ndarray, number_of_fragment_types: int) Tuple[ndarray, ndarray, ndarray] [source][source]#
Parse fragments to get fragment numbers, fragment positions and not top k excluded indices in one hit faster than doing each operation individually, and makes the most of the operations that are done in parallel.
- Parameters:
frag_directions (np.ndarray) – directions of fragments for each peptide
frag_start_idxes (np.ndarray) – start indices of fragments for each peptide
frag_stop_idxes (np.ndarray) – stop indices of fragments for each peptide
top_k (int) – top k highest peaks to keep
intensities (np.ndarray) – Flattened fragment intensities
number_of_fragment_types (int) – number of types of fragments (e.g. b,y,b_modloss,y_modloss, …) equals to the number of columns in fragment mz dataframe
- Returns:
Tuple of fragment numbers, fragment positions and not top k excluded indices
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- alphabase.peptide.fragment.remove_unused_fragments(precursor_df: DataFrame, fragment_df_list: Tuple[DataFrame, ...], frag_start_col: str = 'frag_start_idx', frag_stop_col: str = 'frag_stop_idx') Tuple[DataFrame, Tuple[DataFrame, ...]] [source][source]#
Removes unused fragments of removed precursors, reannotates the frag_start_col and frag_stop_col
- Parameters:
precursor_df (pd.DataFrame) – Precursor dataframe which contains frag_start_idx and frag_stop_idx columns
fragment_df_list (List[pd.DataFrame]) – A list of fragment dataframes which should be compressed by removing unused fragments. Multiple fragment dataframes can be provided which will all be sliced in the same way. This allows to slice both the fragment_mz_df and fragment_intensity_df. At least one fragment dataframe needs to be provided.
frag_start_col (str, optional) – Fragment start idx column in precursor_df, such as “frag_start_idx” and “peak_start_idx”. Defaults to “frag_start_idx”.
frag_stop_col (str, optional) – Fragment stop idx column in precursor_df, such as “frag_stop_idx” and “peak_stop_idx”. Defaults to “frag_stop_idx”.
- Returns:
returns the reindexed precursor DataFrame and the sliced fragment DataFrames
- Return type:
pd.DataFrame, List[pd.DataFrame]
- alphabase.peptide.fragment.update_sliced_fragment_dataframe(fragment_df: DataFrame, fragment_df_vals: ndarray, values: ndarray, frag_start_end_list: List[Tuple[int, int]], charged_frag_types: List[str] = None)[source][source]#
Set the values of the slices frag_start_end_list=[(start,end),(start,end),…] of fragment_df.
- Parameters:
fragment_df (pd.DataFrame) – fragment dataframe to set the values
fragment_df_vals (np.ndarray) – The fragment_df.to_numpy(copy=True), to prevent readonly assignment.
values (np.ndarray) – values to set
frag_start_end_list (List[Tuple[int,int]]) – e.g. [(start,end),(start,end),…]
charged_frag_types (List[str], optional) – e.g. [‘b_z1’,’b_z2’,’y_z1’,’y_z2’]. If None, the columns of values should be the same as fragment_df’s columns. It is much faster if charged_frag_types is None as we use numpy slicing, otherwise we use pd.loc (much slower). Defaults to None.