alphabase.peptide.fragment#

Functions:

add_new_frag_type(frag_type, representation)

Add new modifications into frag_type_representation_dict and update frag_mass_from_ref_ion_dict.

calc_fragment_cardinality(precursor_df, ...)

Calculate the cardinality for a given fragment across a group of precursors.

calc_fragment_count(precursor_df, ...)

Calculates the number of fragments for each precursor.

calc_fragment_mz_values_for_same_nAA(...)

compress_fragment_indices(frag_idx)

recalculates fragment indices to remove unused fragments.

concat_precursor_fragment_dataframes(...)

Since fragment_df is indexed by precursor_df, when we concatenate multiple fragment_df, the indexed positions will change for in precursor_dfs, this function keeps the correct indexed positions of precursor_df when concatenating multiple fragment_df dataframes.

create_fragment_mz_dataframe(precursor_df, ...)

Generate fragment mass dataframe for the precursor_df.

create_fragment_mz_dataframe_by_sort_precursor(...)

Sort nAA in precursor_df for faster fragment mz dataframe creation.

fill_in_indices(frag_start_idxes, ...[, ...])

Fill in indices, max indices and excluded indices for each peptide.

filter_fragment_number(precursor_df, ...[, ...])

Filters the number of fragments for each precursor.

flatten_fragments(precursor_df, ...[, ...])

Converts the tabular fragment format consisting of the fragment_mz_df and the fragment_intensity_df into a linear fragment format.

get_charged_frag_types(frag_types[, ...])

Combine fragment types and charge states.

get_sliced_fragment_dataframe(fragment_df, ...)

Get the sliced fragment_df from frag_start_end_list=[(start,end),(start,end),...].

init_fragment_by_precursor_dataframe(...[, ...])

Init zero fragment dataframe for the precursor_df.

init_fragment_dataframe_from_other(...[, dtype])

Init zero fragment dataframe from the reference_fragment_df (same rows and same columns)

init_zero_fragment_dataframe(peplen_array, ...)

Initialize a zero dataframe based on peptide length (nAA) array (peplen_array) and charge_frag_types (column number).

join_left(left, right)

joins all values in the left array to the values in the right array.

mask_fragments_for_charge_greater_than_precursor_charge(...)

Mask the fragment dataframe when the fragment charge is larger than the precursor charge

parse_all_frag_type_representation()

parse_charged_frag_type(charged_frag_type)

Oppsite to get_charged_frag_types.

parse_fragment(frag_directions, ...)

Parse fragments to get fragment numbers, fragment positions and not top k excluded indices in one hit faster than doing each operation individually, and makes the most of the operations that are done in parallel.

remove_unused_fragments(precursor_df, ...[, ...])

Removes unused fragments of removed precursors, reannotates the frag_start_col and frag_stop_col

update_sliced_fragment_dataframe(...[, ...])

Set the values of the slices frag_start_end_list=[(start,end),(start,end),...] of fragment_df.

Data:

frag_mass_from_ref_ion_dict

Masses parsed from frag_type_representation_dict.

frag_type_representation_dict

Represent fragment ion types from b/y ions.

alphabase.peptide.fragment.add_new_frag_type(frag_type: str, representation: str)[source][source]#

Add new modifications into frag_type_representation_dict and update frag_mass_from_ref_ion_dict.

Parameters:
alphabase.peptide.fragment.calc_fragment_cardinality(precursor_df, fragment_mz_df, group_column='elution_group_idx', split_target_decoy=True)[source][source]#

Calculate the cardinality for a given fragment across a group of precursors. The cardinality is the number of precursors that have a given fragment at a given position.

All precursors within a group are expected to have the same number of fragments. The precursor dataframe.

fragment_mz_dfpd.DataFrame

The fragment mz dataframe.

group_columnstr

The column to group the precursors by. Integer column is expected.

split_target_decoybool

If True, the cardinality is calculated for the target and decoy precursors separately.

alphabase.peptide.fragment.calc_fragment_count(precursor_df: DataFrame, fragment_intensity_df: DataFrame)[source][source]#

Calculates the number of fragments for each precursor.

Parameters:
  • precursor_df (pd.DataFrame) – precursor dataframe which contains the frag_start_idx and frag_stop_idx columns

  • fragment_intensity_df (pd.DataFrame) – fragment intensity dataframe which contains the fragment intensities

Returns:

array with the number of fragments for each precursor

Return type:

numpy.ndarray

alphabase.peptide.fragment.calc_fragment_mz_values_for_same_nAA(df_group: DataFrame, nAA: int, charged_frag_types: list)[source][source]#
alphabase.peptide.fragment.compress_fragment_indices(frag_idx)[source]#

recalculates fragment indices to remove unused fragments. Can be used to compress a fragment library. Expects fragment indices to be ordered by increasing values (!!!). It should be O(N) runtime with N being the number of fragment rows.

>>> frag_idx = [[6,  10],
            [12, 14],
            [20, 22]]
>>> frag_idx = [[0, 4],
            [4, 6],
            [6, 8]]
>>> fragment_pointer = [6,7,8,9,12,13,20,21]
alphabase.peptide.fragment.concat_precursor_fragment_dataframes(precursor_df_list: List[DataFrame], fragment_df_list: List[DataFrame], *other_fragment_df_lists) Tuple[DataFrame, ...][source][source]#

Since fragment_df is indexed by precursor_df, when we concatenate multiple fragment_df, the indexed positions will change for in precursor_dfs, this function keeps the correct indexed positions of precursor_df when concatenating multiple fragment_df dataframes.

Parameters:
  • precursor_df_list (List[pd.DataFrame]) – precursor dataframe list to concatenate

  • fragment_df_list (List[pd.DataFrame]) – fragment dataframe list to concatenate

  • other_fragment_df_lists – arbitray other fragment dataframe list to concatenate, e.g. fragment_mass_df, fragment_inten_df, …

Returns:

concatenated precursor_df, fragment_df, other_fragment_dfs …

Return type:

Tuple[pd.DataFrame,…]

alphabase.peptide.fragment.create_fragment_mz_dataframe(precursor_df: ~pandas.core.frame.DataFrame, charged_frag_types: ~typing.List, *, reference_fragment_df: ~pandas.core.frame.DataFrame = None, inplace_in_reference: bool = False, batch_size: int = 500000, dtype: ~numpy.dtype = <class 'numpy.float32'>) DataFrame[source][source]#

Generate fragment mass dataframe for the precursor_df. If the reference_fragment_df is provided and precursor_df contains frag_start_idx, it will generate the mz dataframe based on the reference. Otherwise it generates the mz dataframe from scratch.

Parameters:
  • precursor_df (pd.DataFrame) – precursors to generate fragment masses, if precursor_df contains the ‘frag_start_idx’ column, reference_fragment_df must be provided

  • charged_frag_types (List) – [‘b_z1’,’b_z2’,’y_z1’,’y_z2’,’b_modloss_1’,’y_H2O_z1’…]

  • reference_fragment_df (pd.DataFrame) – kwargs only. Generate fragment_mz_df based on this reference, as precursor_df.frag_start_idx and precursor.frag_stop_idx point to the indices in reference_fragment_df. Defaults to None

  • inplace_in_reference (bool) – kwargs only. Change values in place in the reference_fragment_df. Defaults to False

  • batch_size (int) – Number of peptides for each batch, to save RAM.

Returns:

fragment_mz_df with given charged_frag_types

Return type:

pd.DataFrame

alphabase.peptide.fragment.create_fragment_mz_dataframe_by_sort_precursor(precursor_df: ~pandas.core.frame.DataFrame, charged_frag_types: ~typing.List, batch_size: int = 500000, dtype: ~numpy.dtype = <class 'numpy.float32'>) DataFrame[source][source]#

Sort nAA in precursor_df for faster fragment mz dataframe creation.

Because the fragment mz values are continous in memory, so it is faster when setting values in pandas.

Note that this function will change the order and index of precursor_df

Parameters:
  • precursor_df (pd.DataFrame) – precursor dataframe

  • charged_frag_types (List) – fragment types list

  • batch_size (int, optional) – Calculate fragment mz values in batch. Defaults to 500000.

alphabase.peptide.fragment.fill_in_indices(frag_start_idxes: ndarray, frag_stop_idxes: ndarray, indices: ndarray, max_indices: ndarray, excluded_indices: ndarray, top_k: int, flattened_intensity: ndarray, number_of_fragment_types: int, max_frag_per_peptide: int = 300) None[source]#

Fill in indices, max indices and excluded indices for each peptide. indices: index of fragment per peptide (from 0 to max_index-1) max_indices: max index of fragments per peptide (number of fragments per peptide) excluded_indices: not top k excluded indices per peptide

Parameters:
  • frag_start_idxes (np.ndarray) – start indices of fragments for each peptide

  • frag_stop_idxes (np.ndarray) – stop indices of fragments for each peptide

  • indices (np.ndarray) – index of fragment per peptide (from 0 to max_index-1) it will be filled in this function

  • max_indices (np.ndarray) – max index of fragments per peptide (number of fragments per peptide) it will be filled in this function

  • excluded_indices (np.ndarray) – not top k excluded indices per peptide it will be filled in this function

  • top_k (int) – top k highest peaks to keep

  • flattened_intensity (np.ndarray) – Flattened fragment intensities

  • number_of_fragment_types (int) – number of types of fragments (e.g. b,y,b_modloss,y_modloss, …) equals to the number of columns in fragment mz dataframe

  • max_frag_per_peptide (int, optional) – maximum number of fragments per peptide, Defaults to 300

alphabase.peptide.fragment.filter_fragment_number(precursor_df: DataFrame, fragment_intensity_df: DataFrame, n_fragments_allowed_column_name: str = 'n_fragments_allowed', n_allowed: int = 999)[source][source]#

Filters the number of fragments for each precursor.

Parameters:
  • precursor_df (pd.DataFrame) – precursor dataframe which contains the frag_start_idx and frag_stop_idx columns

  • fragment_intensity_df (pd.DataFrame) – fragment intensity dataframe which contains the fragment intensities

  • n_fragments_allowed_column_name (str, default = 'n_fragments_allowed') – column name in precursor_df which contains the number of allowed fragments

  • n_allowed (int, default = 999) – number of fragments which should be allowed

Return type:

None

alphabase.peptide.fragment.flatten_fragments(precursor_df: DataFrame, fragment_mz_df: DataFrame, fragment_intensity_df: DataFrame, min_fragment_intensity: float = -1, keep_top_k_fragments: int = 1000, custom_columns: list = ['type', 'number', 'position', 'charge', 'loss_type'], custom_df: Dict[str, DataFrame] = {}) Tuple[DataFrame, DataFrame][source][source]#

Converts the tabular fragment format consisting of the fragment_mz_df and the fragment_intensity_df into a linear fragment format. The linear fragment format will only retain fragments above a given intensity treshold with mz > 0. It consists of columns: mz, intensity, type, number, charge and loss_type, where each column refers to:

  • mz: PEAK_MZ_DTYPE, fragment mz value

  • intensity: PEAK_INTENSITY_DTYPE, fragment intensity value

  • type: uint8, ASCII code of the ion type. Small caps are for regular scoring ions used during search: (97=a, 98=b, 99=c, 120=x, 121=y, 122=z).

    Small caps subtracted by 64 are used for ions only quantified and not scored: (33=a, 34=b, 35=c, 56=x, 57=y, 58=z). By default all ions are scored and quantified. It is left to the user or search engine to decide which ions to use.

  • number: uint32, fragment series number

  • position: uint32, fragment position in sequence (from left to right, starts with 0)

  • charge: uint8, fragment charge

  • loss_type: int16, fragment loss type, 0=noloss, 17=NH3, 18=H2O, 98=H3PO4 (phos), …

The fragment pointers frag_start_idx and frag_stop_idx will be reannotated to the new fragment format.

For ASCII code type, we can convert it into byte-str by using frag_df.type.values.view(‘S1’).

Parameters:
  • precursor_df (pd.DataFrame) – input precursor dataframe which contains the frag_start_idx and frag_stop_idx columns

  • fragment_mz_df (pd.DataFrame) – input fragment mz dataframe of shape (N, T) which contains N * T fragment mzs. Fragments with mz==0 will be excluded.

  • fragment_intensity_df (pd.DataFrame) – input fragment intensity dataframe of shape (N, T) which contains N * T fragment mzs. Could be empty (len==0) to exclude intensity values.

  • min_fragment_intensity (float, optional) – minimum intensity which should be retained. Defaults to -1

  • custom_columns (list, optional) – ‘mz’ and ‘intensity’ columns are required. Others could be customized. Defaults to [‘type’,’number’,’position’,’charge’,’loss_type’]

  • custom_df (Dict[str, pd.DataFrame], optional) – Append custom columns by providing additional dataframes of the same shape as fragment_mz_df and fragment_intensity_df. Defaults to {}.

Returns:

  • pd.DataFrame – precursor dataframe with added flat_frag_start_idx and flat_frag_stop_idx columns

  • pd.DataFrame – fragment dataframe with columns: mz, intensity, type, number, charge and loss_type, where each column refers to:

    • mz: PEAK_MZ_DTYPE, fragment mz value

    • intensity: PEAK_INTENSITY_DTYPE, fragment intensity value

    • type: uint8, ASCII code of the ion type. Small caps are for regular scoring ions used during search: (97=a, 98=b, 99=c, 120=x, 121=y, 122=z).

      Small caps subtracted by 64 are used for ions only quantified and not scored: (33=a, 34=b, 35=c, 56=x, 57=y, 58=z). By default all ions are scored and quantified. It is left to the user or search engine to decide which ions to use.

    • number: uint32, fragment series number

    • position: uint32, fragment position in sequence (from left to right, starts with 0)

    • charge: uint8, fragment charge

    • loss_type: int16, fragment loss type, 0=noloss, 17=NH3, 18=H2O, 98=H3PO4 (phos), …

alphabase.peptide.fragment.frag_mass_from_ref_ion_dict = {'a': {'add_mass': -27.99491461957, 'ref_ion': 'b'}, 'b_H2O': {'add_mass': -18.01056468403, 'ref_ion': 'b'}, 'b_NH3': {'add_mass': -17.02654910112, 'ref_ion': 'b'}, 'c': {'add_mass': 17.02654910112, 'ref_ion': 'b'}, 'c_lossH': {'add_mass': 16.01872406889, 'ref_ion': 'b'}, 'x': {'add_mass': 25.97926455511, 'ref_ion': 'y'}, 'y_H2O': {'add_mass': -18.01056468403, 'ref_ion': 'y'}, 'y_NH3': {'add_mass': -17.02654910112, 'ref_ion': 'y'}, 'z': {'add_mass': -16.01872406889, 'ref_ion': 'y'}, 'z_addH': {'add_mass': -15.01089903666, 'ref_ion': 'y'}}#

Masses parsed from frag_type_representation_dict.

alphabase.peptide.fragment.frag_type_representation_dict = {'a': 'b+C(-1)O(-1)', 'b_H2O': 'b+H(-2)O(-1)', 'b_NH3': 'b+N(-1)H(-3)', 'c': 'b+N(1)H(3)', 'c_lossH': 'b+N(1)H(2)', 'x': 'y+C(1)O(1)H(-2)', 'y_H2O': 'y+H(-2)O(-1)', 'y_NH3': 'y+N(-1)H(-3)', 'z': 'y+N(-1)H(-2)', 'z_addH': 'y+N(-1)H(-1)'}#

Represent fragment ion types from b/y ions. Modification neutral losses (i.e. modloss) are not here as they have variable atoms added to b/y ions.

alphabase.peptide.fragment.get_charged_frag_types(frag_types: List[str], max_frag_charge: int = 2) List[str][source][source]#

Combine fragment types and charge states.

Parameters:
  • frag_types (List[str]) – e.g. [‘b’,’y’,’b_modloss’,’y_modloss’]

  • max_frag_charge (int) – max fragment charge. (default: 2)

Returns:

charged fragment types

Return type:

List[str]

Examples

>>> frag_types=['b','y','b_modloss','y_modloss']
>>> get_charged_frag_types(frag_types, 2)
['b_z1','b_z2','y_z1','y_z2','b_modloss_z1','b_modloss_z2','y_modloss_z1','y_modloss_z2']
alphabase.peptide.fragment.get_sliced_fragment_dataframe(fragment_df: DataFrame, frag_start_end_list: List | ndarray, charged_frag_types: List = None) DataFrame[source][source]#

Get the sliced fragment_df from frag_start_end_list=[(start,end),(start,end),…].

Parameters:
  • fragment_df (pd.DataFrame) – fragment dataframe to get values

  • frag_start_end_list (Union) – List[Tuple[int,int]], e.g. [(start,end),(start,end),…] or np.ndarray

  • charged_frag_types (List[str]) – e.g. [‘b_z1’,’b_z2’,’y_z1’,’y_z2’]. if None, all columns will be considered

Returns:

sliced fragment_df. If charged_frag_types is None, return fragment_df with all columns

Return type:

pd.DataFrame

alphabase.peptide.fragment.init_fragment_by_precursor_dataframe(precursor_df, charged_frag_types: ~typing.List[str], *, reference_fragment_df: ~pandas.core.frame.DataFrame = None, dtype: ~numpy.dtype = <class 'numpy.float32'>, inplace_in_reference: bool = False)[source][source]#

Init zero fragment dataframe for the precursor_df. If the reference_fragment_df is provided, the result dataframe’s length will be the same as reference_fragment_df. Otherwise it generates the dataframe from scratch.

Parameters:
  • precursor_df (pd.DataFrame) – precursors to generate fragment masses, if precursor_df contains the ‘frag_start_idx’ column, it is better to provide reference_fragment_df as precursor_df.frag_start_idx and precursor.frag_stop_idx point to the indices in reference_fragment_df

  • charged_frag_types (List) – [‘b_z1’,’b_z2’,’y_z1’,’y_z2’,’b_modloss_z1’,’y_H2O_z1’…]

  • reference_fragment_df (pd.DataFrame) – init zero fragment_mz_df based on this reference. If None, fragment_mz_df will be initialized by alphabase.peptide.fragment.init_zero_fragment_dataframe(). Defaults to None.

  • dtype (np.dtype) – dtype of fragment mz values, Defaults to PEAK_MZ_DTYPE.

  • inplace_in_reference (bool, optional) – if calculate the fragment mz inplace in the reference_fragment_df (default: False)

Returns:

zero fragment_df with given charged_frag_types columns

Return type:

pd.DataFrame

alphabase.peptide.fragment.init_fragment_dataframe_from_other(reference_fragment_df: ~pandas.core.frame.DataFrame, dtype=<class 'numpy.float32'>)[source][source]#

Init zero fragment dataframe from the reference_fragment_df (same rows and same columns)

alphabase.peptide.fragment.init_zero_fragment_dataframe(peplen_array: ~numpy.ndarray, charged_frag_types: ~typing.List[str], dtype=<class 'numpy.float32'>) Tuple[DataFrame, ndarray, ndarray][source][source]#

Initialize a zero dataframe based on peptide length (nAA) array (peplen_array) and charge_frag_types (column number). The row number of returned dataframe is np.sum(peplen_array-1).

Parameters:
  • peplen_array (np.ndarray) – peptide lengths for the fragment dataframe

  • charged_frag_types (List[str]) – [‘b_z1’,’b_z2’,’y_z1’,’y_z2’,’b_modloss_z1’,’y_H2O_z1’…]

Returns:

pd.DataFrame, fragment_df with zero values

np.ndarray (int64), the start indices point to the fragment_df for each peptide

np.ndarray (int64), the end indices point to the fragment_df for each peptide

Return type:

tuple

alphabase.peptide.fragment.join_left(left: ndarray, right: ndarray)[source]#

joins all values in the left array to the values in the right array. The index to the element in the right array is returned. If the value wasn’t found, -1 is returned. If the element appears more than once, the last appearance is used.

Parameters:
  • left (numpy.ndarray) – left array which should be matched

  • right (numpy.ndarray) – right array which should be matched to

Returns:

array with length of the left array which indices pointing to the right array -1 is returned if values could not be found in the right array

Return type:

numpy.ndarray, dtype = int64

alphabase.peptide.fragment.mask_fragments_for_charge_greater_than_precursor_charge(fragment_df: DataFrame, precursor_charge_array: ndarray, nAA_array: ndarray, *, candidate_fragment_charges: list = [2, 3, 4])[source][source]#

Mask the fragment dataframe when the fragment charge is larger than the precursor charge

alphabase.peptide.fragment.parse_all_frag_type_representation()[source][source]#
alphabase.peptide.fragment.parse_charged_frag_type(charged_frag_type: str) Tuple[str, int][source][source]#

Oppsite to get_charged_frag_types.

Parameters:

charged_frag_type (str) – e.g. ‘y_z1’, ‘b_modloss_z1’

Returns:

str. Fragment type, e.g. ‘b’,’y’

int. Charge state

Return type:

tuple

alphabase.peptide.fragment.parse_fragment(frag_directions: ndarray, frag_start_idxes: ndarray, frag_stop_idxes: ndarray, top_k: int, intensities: ndarray, number_of_fragment_types: int) Tuple[ndarray, ndarray, ndarray][source][source]#

Parse fragments to get fragment numbers, fragment positions and not top k excluded indices in one hit faster than doing each operation individually, and makes the most of the operations that are done in parallel.

Parameters:
  • frag_directions (np.ndarray) – directions of fragments for each peptide

  • frag_start_idxes (np.ndarray) – start indices of fragments for each peptide

  • frag_stop_idxes (np.ndarray) – stop indices of fragments for each peptide

  • top_k (int) – top k highest peaks to keep

  • intensities (np.ndarray) – Flattened fragment intensities

  • number_of_fragment_types (int) – number of types of fragments (e.g. b,y,b_modloss,y_modloss, …) equals to the number of columns in fragment mz dataframe

Returns:

Tuple of fragment numbers, fragment positions and not top k excluded indices

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

alphabase.peptide.fragment.remove_unused_fragments(precursor_df: DataFrame, fragment_df_list: Tuple[DataFrame, ...], frag_start_col: str = 'frag_start_idx', frag_stop_col: str = 'frag_stop_idx') Tuple[DataFrame, Tuple[DataFrame, ...]][source][source]#

Removes unused fragments of removed precursors, reannotates the frag_start_col and frag_stop_col

Parameters:
  • precursor_df (pd.DataFrame) – Precursor dataframe which contains frag_start_idx and frag_stop_idx columns

  • fragment_df_list (List[pd.DataFrame]) – A list of fragment dataframes which should be compressed by removing unused fragments. Multiple fragment dataframes can be provided which will all be sliced in the same way. This allows to slice both the fragment_mz_df and fragment_intensity_df. At least one fragment dataframe needs to be provided.

  • frag_start_col (str, optional) – Fragment start idx column in precursor_df, such as “frag_start_idx” and “peak_start_idx”. Defaults to “frag_start_idx”.

  • frag_stop_col (str, optional) – Fragment stop idx column in precursor_df, such as “frag_stop_idx” and “peak_stop_idx”. Defaults to “frag_stop_idx”.

Returns:

returns the reindexed precursor DataFrame and the sliced fragment DataFrames

Return type:

pd.DataFrame, List[pd.DataFrame]

alphabase.peptide.fragment.update_sliced_fragment_dataframe(fragment_df: DataFrame, fragment_df_vals: ndarray, values: ndarray, frag_start_end_list: List[Tuple[int, int]], charged_frag_types: List[str] = None)[source][source]#

Set the values of the slices frag_start_end_list=[(start,end),(start,end),…] of fragment_df.

Parameters:
  • fragment_df (pd.DataFrame) – fragment dataframe to set the values

  • fragment_df_vals (np.ndarray) – The fragment_df.to_numpy(copy=True), to prevent readonly assignment.

  • values (np.ndarray) – values to set

  • frag_start_end_list (List[Tuple[int,int]]) – e.g. [(start,end),(start,end),…]

  • charged_frag_types (List[str], optional) – e.g. [‘b_z1’,’b_z2’,’y_z1’,’y_z2’]. If None, the columns of values should be the same as fragment_df’s columns. It is much faster if charged_frag_types is None as we use numpy slicing, otherwise we use pd.loc (much slower). Defaults to None.