Developer Tutorial: Proteomics meets the scverse¶

This tutorial showcases how alphabase can be used to interface proteomics search engine outputs and the scverse, a python-centric software ecosystem that implements various tools for the analysis of (single-cell) omics data.

The central data structure to represent and store omics data in the scverse is the anndata object, an optimized data container for the representation of high-dimensional experimental data.

alphabases core functions can be used by developers to easily create anndata objects from search engine outputs, more specifically peptide-spectrum-matches (see the tutorial on PSM readers). anndata can then interface with various software tools for downstream tasks, of which many might also be applicable to proteomics data. In this use case, alphabase acts as an adapter, that provides an interface between many search engines upstream and the various software tools downstream

Load libraries¶

Note that you’ll have to have anndata and mudata installed in your software environment, either by installing it next to your alphabase installation

pip install alphabase anndata mudata

or by installing

pip install "alphabase[docs]"

[1]:

import warnings
from typing import Literal, Optional

import anndata as ad
import mudata as md
import numpy as np
import pandas as pd

from alphabase.psm_reader import psm_reader_provider
from alphabase.psm_reader.keys import PsmDfCols
from alphabase.tools.data_downloader import DataShareDownloader

/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:4: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
  import cgi
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:20: ImportWarning: Dependency 'progressbar' not installed. Download progress will not be displayed.
  warnings.warn(

Implement anndata reader¶

Here, we implement an anndata reader that can read the output of multiple search engines into the standardized anndata object

Walk through¶

Load file into unified structure¶

The first step is to load the PSM table into a standardized pandas.DataFrame. As described in the PSM reader tutorial, alphabase abstracts away search engine-specific formats and harmonizes column names into a consistent schema. This is accomplished by selecting the appropriate reader for your search engine:

reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)

Then, load the file:

psm_report = reader.load(file_path)

The resulting dataframe will include standardized column names such as:

intensity – protein intensity (or via custom mapping, the precursor intensity)
proteins – feature ID
raw_name – sample/run identifier

✅ Tip: You don’t need to worry about the original column names—alphabase maps them automatically based on the search engine.

Note that alphabase standardizes column names from different search engines so that we only need to define a simple dictionary lookup that maps the feature-level specific columns (sample id, feature id, intensity) to support 3 different feature levels in our reader function across all supported search engines:

FEATURE_LEVEL_MAPPING = {
    "proteins": {"sample_id": PsmDfCols.RAW_NAME, "feature_id": PsmDfCols.PROTEINS, "intensity": "PsmDfCols.INTENSITY"},
    "peptides": ...,
    "precursors": ...
}

Pivot to wide format¶

PSM tables are typically in long format—each row corresponds to a single precursor-spectrum match. To create an AnnData object, we need a matrix with samples as rows and features (e.g., proteins or peptides) as columns.

We use pandas.pivot_table to reshape the data:

pivot_psm_report = pd.pivot_table(
    psm_report,
    index=PsmDfCols.RAW_NAME,
    columns=PsmDfCols.PROTEINS,
    values=PsmDfCols.INTENSITY,
    aggfunc="first",
    fill_value=np.nan,
)

💡 Protein-level columns in PSM tables can be redundant—many precursors will point to the same protein and the PSM report stores the same intensity values for all redundant protein groups. We use aggfunc="first" to select the first observed intensity for a given (sample, feature) pair

💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into protein groups as features.

💡 Here, we fill missing values with NaN, following standard practice in proteomics.

Build anndata object¶

Now that we have a wide-format matrix of intensities, creating the AnnData object is straightforward:

adata = ad.AnnData(
    X=pivot_psm_report.values,
    obs=pivot_psm_report.index.to_frame(index=False),
    var=pivot_psm_report.columns.to_frame(index=False)
)

In the implementation below, we slightly modify this logic as we need to account for the customization

Implementation¶

The complete implementation might look a bit like this:

[2]:

FEATURE_LEVEL_MAPPING = {
    "protein":
        {
            "sample_id": PsmDfCols.RAW_NAME,
            "feature_id": PsmDfCols.PROTEINS,
            "intensity": PsmDfCols.INTENSITY,
        },
    "peptide":
        {
            "sample_id": PsmDfCols.RAW_NAME,
            "feature_id": PsmDfCols.SEQUENCE,
            "intensity": PsmDfCols.PEPTIDE_INTENSITY,
        },
    "precursor":
        {
            "sample_id": PsmDfCols.RAW_NAME,
            "feature_id": PsmDfCols.PRECURSOR_ID,
            "intensity": PsmDfCols.PRECURSOR_INTENSITY,
        }
}

def read_psm(
    file_path: str,
    reader_type: str,
    feature_level: Literal["protein", "peptide", "precursor"] = "protein",
    *,
    intensity_column: Optional[str] = None,
    feature_column: Optional[str] = None,
    sample_column: Optional[str] = None,
    custom_column_mapping: Optional[dict[str, str]] = None,
    **reader_kwargs,
) -> ad.AnnData:
    """Convert a PSM table to an anndata object that stores the feature intensities

    Parameters
    ----------
    file_path
        Path to file
    reader_type
        Type of search engine output. Must be one of the implemented readers
        (see: `alphabase.psm_reader.psm_reader_provider.reader_dict.keys()`)
    intensity_column
        Name of the column storing intensity data. Default to `intensity` key `psm_reader.yaml`
    feature_column
        Name of the column storing proteins ids. Defaults to `proteins` key in `psm_reader.yaml`
    sample_column
        Name of the column storing raw (or run) name. Defaults to `raw_name` key in `psm_reader.yaml`
    **reader_kwargs
        Passed to :meth:`alphabase.psm_reader.psm_reader_provider.get_reader`

    Returns
    -------
    :class:`anndata.AnnData`

        with
            - .X: Intensities as specified in `intensity_column`
            - .obs: Empty `.obs` dataframe with sample name (run) as index
            - .var: `.var` dataframe with search engine-specific metadata on features as values

    Raises
    ------
    warning
        For redundant features
    """
    # Get correct reader
    reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)

    if custom_column_mapping:
        reader.add_column_mapping(custom_column_mapping)

    # Read file
    psm_report = reader.load(file_path)

    sample_column = FEATURE_LEVEL_MAPPING.get(feature_level).get("sample_id") if sample_column is None else sample_column
    feature_column = FEATURE_LEVEL_MAPPING.get(feature_level).get("feature_id")  if feature_column is None else feature_column
    intensity_column = FEATURE_LEVEL_MAPPING.get(feature_level).get("intensity") if intensity_column is None else intensity_column

    # Pivot from long to wide format
    # The psm report is oriented in a long format, while the count table has a wide format
    # Thus, we pivot the psm report so that it has the shape (samples x features)
    pivot_psm_report = pd.pivot_table(
        psm_report,
        index=sample_column,
        columns=feature_column,
        values=intensity_column,
        aggfunc="first",
        fill_value=np.nan,
    )

    obs = pd.DataFrame(index=pivot_psm_report.index)
    var = pd.DataFrame(index=pivot_psm_report.columns)

    # Use custom names instead of streamlined names for custom columns to prevent incorrect
    # naming (e.g. "protein" for precursor indices for custom features)
    # obs = obs.rename_axis(index=sample_column) if sample_column is not None else obs
    # var = var.rename_axis(index=feature_column) if feature_column is not None else var

    # Assemble to anndata
    return ad.AnnData(X=pivot_psm_report.values, obs=obs, var=var)

Usage - Interact with anndata reader¶

We can explore the behaviour of this reader with some sample data, in this case an DiaNN PSM report. The PSM report contains data from 5 HeLa cell digests that was generated from a label-free DIA run on an Orbitrap Astral.

[3]:

def get_alphadia_example(output_dir: Optional[str] = None) -> str:
    """Get example data for the tutorial

    The function downloads an example alphadia v2.0.0 PSM report (.parquet) and stores it
    in `output_dir`, or, alternatively in a temporary directory

    Parameter
    ---------
    output_dir
        Output directory. If `None`, creates a temporary directory

    Returns
    -------
    File location

    Notes
    -----
    File size (parquet): ca. 50 MB

    References
    ----------
    > Wallmann, G., Skowronek, P., Brennsteiner, V. et al. AlphaDIA enables DIA transfer learning for feature-free proteomics.
    Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02791-w

    """
    EXAMPLE_URL = "https://datashare.biochem.mpg.de/s/PjjGzy5KpREzntj"

    if output_dir is None:
        from tempfile import tempdir

        output_dir = tempdir

    downloader = DataShareDownloader(url=EXAMPLE_URL, output_dir=output_dir)

    return downloader.download()

[4]:

# Download example data
alphadia_path = get_alphadia_example()

/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia_hela_precursors.parquet already exists (54.18055438995361 MB)

Per default, the function is able to read protein-level information from the PSM report. We see that alphadia identified 9880 distinct protein groups in the data

[5]:

protein_groups = read_psm(
    alphadia_path,
    reader_type="alphadia_parquet",
    feature_level="protein"
)
protein_groups

[5]:

AnnData object with n_obs × n_vars = 6 × 9880

If we, instead of using the default settings, pass the custom column that identifies precursors and precursor intensities, we can also read in precursor-level data into an anndata.AnnData object. Note that this requires users to specify the (search engine-specific) column names for precursor identifies (=features) and their intensity values. In the given dataset, alphadia v2.0.0 identified 117615 distinct precursors

[6]:

precursors = read_psm(
    alphadia_path,
    reader_type="alphadia_parquet",
    feature_level="precursor"
)
precursors

/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)

[6]:

AnnData object with n_obs × n_vars = 6 × 117615

Usage - Downstream applications¶

We can use the anndata objects for downstream computations. Here, for example, we define a small function that computes the data completeness per sample based on the standardized anndata structure. In contrast to working with PSM tables directly, this function is applicable to data derived from any PSM report, as the data structure has been streamlined

[7]:

def compute_missing_value_proportion(adata: ad.AnnData, sample_column: Optional[str] = None) -> pd.DataFrame:
    """Compute proportion of missing values per sample"""
    index = adata.obs_names if sample_column is None else adata.obs[sample_column]

    return pd.DataFrame(
        np.isnan(adata.X).sum(axis=1) / adata.n_vars,
        index=index, columns=["proportion_missing_values"]
    )

[8]:

compute_missing_value_proportion(protein_groups)

[8]:

	proportion_missing_values
raw_name
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01	0.008603
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02	0.008704
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03	0.009615
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01	0.008907
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02	0.012753
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03	0.009312

[9]:

compute_missing_value_proportion(precursors)

[9]:

	proportion_missing_values
raw_name
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01	0.139344
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02	0.137168
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03	0.138265
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01	0.140764
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02	0.162105
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03	0.135178

Bind precursor and protein data in a single container¶

We can use the multi-modal extension of anndata, called mudata to store protein and precursor data in a single container

[10]:

mdata = md.MuData(
    {"proteins": protein_groups, "precursors": precursors}
)

mdata

/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/mudata/_core/mudata.py:1598: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
  self._update_attr("var", axis=0, join_common=join_common)
/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/mudata/_core/mudata.py:1461: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
  self._update_attr("obs", axis=1, join_common=join_common)

[10]:

MuData object with n_obs × n_vars = 6 × 127495
  2 modalities
    proteins:       6 x 9880
    precursors:     6 x 117615

Conclusion¶

In this tutorial, we demonstrated how alphabase can serve as a bridge between proteomics search engine outputs and the anndata structure—an efficient, standardized format widely adopted in the scverse ecosystem. By converting PSM reports into AnnData objects, developers can seamlessly integrate proteomics data with a growing suite of tools for analysis, visualization, and downstream modeling.

We implemented a very similar function in our scverse-compatible analysis package alphapepttools