Developer Tutorial: Proteomics meets the scverse¶
This tutorial showcases how alphabase can be used to interface proteomics search engine outputs and the scverse, a python-centric software ecosystem that implements various tools for the analysis of (single-cell) omics data.
The central data structure to represent and store omics data in the scverse is the anndata object, an optimized data container for the representation of high-dimensional experimental data.
alphabases core functions can be used by developers to easily create anndata objects from search engine outputs, more specifically peptide-spectrum-matches (see the tutorial on PSM readers). anndata can then interface with various software tools for downstream tasks, of which many might also be applicable to proteomics data. In this use case, alphabase acts as an adapter, that provides an interface between many search engines upstream and the
various software tools downstream
Load libraries¶
Note that you’ll have to have anndata and mudata installed in your software environment, either by installing it next to your alphabase installation
pip install alphabase anndata mudata
or by installing
pip install "alphabase[docs]"
[1]:
import warnings
from typing import Literal, Optional
import anndata as ad
import mudata as md
import numpy as np
import pandas as pd
from alphabase.psm_reader import psm_reader_provider
from alphabase.psm_reader.keys import PsmDfCols
from alphabase.tools.data_downloader import DataShareDownloader
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:4: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
import cgi
/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:20: ImportWarning: Dependency 'progressbar' not installed. Download progress will not be displayed.
warnings.warn(
Implement anndata reader¶
Here, we implement an anndata reader that can read the output of multiple search engines into the standardized anndata object
Walk through¶
Load file into unified structure¶
The first step is to load the PSM table into a standardized pandas.DataFrame. As described in the PSM reader tutorial, alphabase abstracts away search engine-specific formats and harmonizes column names into a consistent schema. This is accomplished by selecting the appropriate reader for your search engine:
reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)
Then, load the file:
psm_report = reader.load(file_path)
The resulting dataframe will include standardized column names such as:
intensity – protein intensity (or via custom mapping, the precursor intensity)
proteins – feature ID
raw_name – sample/run identifier
✅ Tip: You don’t need to worry about the original column names—alphabase maps them automatically based on the search engine.
Note that alphabase standardizes column names from different search engines so that we only need to define a simple dictionary lookup that maps the feature-level specific columns (sample id, feature id, intensity) to support 3 different feature levels in our reader function across all supported search engines:
FEATURE_LEVEL_MAPPING = {
"proteins": {"sample_id": PsmDfCols.RAW_NAME, "feature_id": PsmDfCols.PROTEINS, "intensity": "PsmDfCols.INTENSITY"},
"peptides": ...,
"precursors": ...
}
Pivot to wide format¶
PSM tables are typically in long format—each row corresponds to a single precursor-spectrum match. To create an AnnData object, we need a matrix with samples as rows and features (e.g., proteins or peptides) as columns.
We use pandas.pivot_table to reshape the data:
pivot_psm_report = pd.pivot_table(
psm_report,
index=PsmDfCols.RAW_NAME,
columns=PsmDfCols.PROTEINS,
values=PsmDfCols.INTENSITY,
aggfunc="first",
fill_value=np.nan,
)
💡 Protein-level columns in PSM tables can be redundant—many precursors will point to the same protein and the PSM report stores the same intensity values for all redundant protein groups. We use aggfunc="first" to select the first observed intensity for a given (sample, feature) pair
💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into protein groups as features.
💡 Here, we fill missing values with NaN, following standard practice in proteomics.
Build anndata object¶
Now that we have a wide-format matrix of intensities, creating the AnnData object is straightforward:
adata = ad.AnnData(
X=pivot_psm_report.values,
obs=pivot_psm_report.index.to_frame(index=False),
var=pivot_psm_report.columns.to_frame(index=False)
)
In the implementation below, we slightly modify this logic as we need to account for the customization
Implementation¶
The complete implementation might look a bit like this:
[2]:
FEATURE_LEVEL_MAPPING = {
"protein":
{
"sample_id": PsmDfCols.RAW_NAME,
"feature_id": PsmDfCols.PROTEINS,
"intensity": PsmDfCols.INTENSITY,
},
"peptide":
{
"sample_id": PsmDfCols.RAW_NAME,
"feature_id": PsmDfCols.SEQUENCE,
"intensity": PsmDfCols.PEPTIDE_INTENSITY,
},
"precursor":
{
"sample_id": PsmDfCols.RAW_NAME,
"feature_id": PsmDfCols.PRECURSOR_ID,
"intensity": PsmDfCols.PRECURSOR_INTENSITY,
}
}
def read_psm(
file_path: str,
reader_type: str,
feature_level: Literal["protein", "peptide", "precursor"] = "protein",
*,
intensity_column: Optional[str] = None,
feature_column: Optional[str] = None,
sample_column: Optional[str] = None,
custom_column_mapping: Optional[dict[str, str]] = None,
**reader_kwargs,
) -> ad.AnnData:
"""Convert a PSM table to an anndata object that stores the feature intensities
Parameters
----------
file_path
Path to file
reader_type
Type of search engine output. Must be one of the implemented readers
(see: `alphabase.psm_reader.psm_reader_provider.reader_dict.keys()`)
intensity_column
Name of the column storing intensity data. Default to `intensity` key `psm_reader.yaml`
feature_column
Name of the column storing proteins ids. Defaults to `proteins` key in `psm_reader.yaml`
sample_column
Name of the column storing raw (or run) name. Defaults to `raw_name` key in `psm_reader.yaml`
**reader_kwargs
Passed to :meth:`alphabase.psm_reader.psm_reader_provider.get_reader`
Returns
-------
:class:`anndata.AnnData`
with
- .X: Intensities as specified in `intensity_column`
- .obs: Empty `.obs` dataframe with sample name (run) as index
- .var: `.var` dataframe with search engine-specific metadata on features as values
Raises
------
warning
For redundant features
"""
# Get correct reader
reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)
if custom_column_mapping:
reader.add_column_mapping(custom_column_mapping)
# Read file
psm_report = reader.load(file_path)
sample_column = FEATURE_LEVEL_MAPPING.get(feature_level).get("sample_id") if sample_column is None else sample_column
feature_column = FEATURE_LEVEL_MAPPING.get(feature_level).get("feature_id") if feature_column is None else feature_column
intensity_column = FEATURE_LEVEL_MAPPING.get(feature_level).get("intensity") if intensity_column is None else intensity_column
# Pivot from long to wide format
# The psm report is oriented in a long format, while the count table has a wide format
# Thus, we pivot the psm report so that it has the shape (samples x features)
pivot_psm_report = pd.pivot_table(
psm_report,
index=sample_column,
columns=feature_column,
values=intensity_column,
aggfunc="first",
fill_value=np.nan,
)
obs = pd.DataFrame(index=pivot_psm_report.index)
var = pd.DataFrame(index=pivot_psm_report.columns)
# Use custom names instead of streamlined names for custom columns to prevent incorrect
# naming (e.g. "protein" for precursor indices for custom features)
# obs = obs.rename_axis(index=sample_column) if sample_column is not None else obs
# var = var.rename_axis(index=feature_column) if feature_column is not None else var
# Assemble to anndata
return ad.AnnData(X=pivot_psm_report.values, obs=obs, var=var)
Usage - Interact with anndata reader¶
We can explore the behaviour of this reader with some sample data, in this case an DiaNN PSM report. The PSM report contains data from 5 HeLa cell digests that was generated from a label-free DIA run on an Orbitrap Astral.
[3]:
def get_alphadia_example(output_dir: Optional[str] = None) -> str:
"""Get example data for the tutorial
The function downloads an example alphadia v2.0.0 PSM report (.parquet) and stores it
in `output_dir`, or, alternatively in a temporary directory
Parameter
---------
output_dir
Output directory. If `None`, creates a temporary directory
Returns
-------
File location
Notes
-----
File size (parquet): ca. 50 MB
References
----------
> Wallmann, G., Skowronek, P., Brennsteiner, V. et al. AlphaDIA enables DIA transfer learning for feature-free proteomics.
Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02791-w
"""
EXAMPLE_URL = "https://datashare.biochem.mpg.de/s/PjjGzy5KpREzntj"
if output_dir is None:
from tempfile import tempdir
output_dir = tempdir
downloader = DataShareDownloader(url=EXAMPLE_URL, output_dir=output_dir)
return downloader.download()
[4]:
# Download example data
alphadia_path = get_alphadia_example()
/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia_hela_precursors.parquet already exists (54.18055438995361 MB)
Per default, the function is able to read protein-level information from the PSM report. We see that alphadia identified 9880 distinct protein groups in the data
[5]:
protein_groups = read_psm(
alphadia_path,
reader_type="alphadia_parquet",
feature_level="protein"
)
protein_groups
[5]:
AnnData object with n_obs × n_vars = 6 × 9880
If we, instead of using the default settings, pass the custom column that identifies precursors and precursor intensities, we can also read in precursor-level data into an anndata.AnnData object. Note that this requires users to specify the (search engine-specific) column names for precursor identifies (=features) and their intensity values. In the given dataset, alphadia v2.0.0 identified 117615 distinct precursors
[6]:
precursors = read_psm(
alphadia_path,
reader_type="alphadia_parquet",
feature_level="precursor"
)
precursors
/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
return dispatch(args[0].__class__)(*args, **kw)
[6]:
AnnData object with n_obs × n_vars = 6 × 117615
Usage - Downstream applications¶
We can use the anndata objects for downstream computations. Here, for example, we define a small function that computes the data completeness per sample based on the standardized anndata structure. In contrast to working with PSM tables directly, this function is applicable to data derived from any PSM report, as the data structure has been streamlined
[7]:
def compute_missing_value_proportion(adata: ad.AnnData, sample_column: Optional[str] = None) -> pd.DataFrame:
"""Compute proportion of missing values per sample"""
index = adata.obs_names if sample_column is None else adata.obs[sample_column]
return pd.DataFrame(
np.isnan(adata.X).sum(axis=1) / adata.n_vars,
index=index, columns=["proportion_missing_values"]
)
[8]:
compute_missing_value_proportion(protein_groups)
[8]:
| proportion_missing_values | |
|---|---|
| raw_name | |
| 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01 | 0.008603 |
| 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02 | 0.008704 |
| 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03 | 0.009615 |
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01 | 0.008907 |
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02 | 0.012753 |
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03 | 0.009312 |
[9]:
compute_missing_value_proportion(precursors)
[9]:
| proportion_missing_values | |
|---|---|
| raw_name | |
| 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_01 | 0.139344 |
| 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_02 | 0.137168 |
| 20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_03 | 0.138265 |
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_01 | 0.140764 |
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_02 | 0.162105 |
| 20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_03 | 0.135178 |
Bind precursor and protein data in a single container¶
We can use the multi-modal extension of anndata, called mudata to store protein and precursor data in a single container
[10]:
mdata = md.MuData(
{"proteins": protein_groups, "precursors": precursors}
)
mdata
/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/mudata/_core/mudata.py:1598: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
self._update_attr("var", axis=0, join_common=join_common)
/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/mudata/_core/mudata.py:1461: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
self._update_attr("obs", axis=1, join_common=join_common)
[10]:
MuData object with n_obs × n_vars = 6 × 127495
2 modalities
proteins: 6 x 9880
precursors: 6 x 117615Conclusion¶
In this tutorial, we demonstrated how alphabase can serve as a bridge between proteomics search engine outputs and the anndata structure—an efficient, standardized format widely adopted in the scverse ecosystem. By converting PSM reports into AnnData objects, developers can seamlessly integrate proteomics data with a growing suite of tools for analysis, visualization, and downstream modeling.
We implemented a very similar function in our scverse-compatible analysis package alphapepttools