{ "cells": [ { "cell_type": "markdown", "id": "64042a96", "metadata": {}, "source": [ "# Developer Tutorial: Proteomics meets the scverse\n", "\n", "This tutorial showcases how `alphabase` can be used to interface proteomics search engine outputs and the [scverse](https://scverse.org), a python-centric software ecosystem that implements various tools for the analysis of (single-cell) omics data. \n", "\n", "The central data structure to represent and store omics data in the scverse is the [anndata object](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html), an optimized data container for the representation of high-dimensional experimental data. \n", "\n", "`alphabase`s core functions can be used by developers to easily create `anndata` objects from search engine outputs, more specifically peptide-spectrum-matches (see the tutorial on [PSM readers](../nbs/psm_readers.ipynb)). `anndata` can then interface with various software tools for downstream tasks, of which many might also be applicable to proteomics data. In this use case, **alphabase acts as an adapter**, that provides an interface between many search engines upstream and the various software tools downstream" ] }, { "cell_type": "markdown", "id": "2c61ba56", "metadata": {}, "source": [ "## Load libraries\n", "\n", "Note that you'll have to have `anndata` and `mudata` installed in your software environment, either by installing it next to your `alphabase` installation \n", "\n", "```shell\n", "pip install alphabase anndata mudata\n", "```\n", "\n", "or by installing\n", "\n", "```shell\n", "pip install \"alphabase[docs]\"\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "30e73c86", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:4: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13\n", " import cgi\n", "/Users/lucas-diedrich/Documents/Projects/alphaX/alphabase/alphabase/alphabase/tools/data_downloader.py:20: ImportWarning: Dependency 'progressbar' not installed. Download progress will not be displayed.\n", " warnings.warn(\n" ] } ], "source": [ "import warnings\n", "from typing import Literal, Optional\n", "\n", "import anndata as ad\n", "import mudata as md\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from alphabase.psm_reader import psm_reader_provider\n", "from alphabase.psm_reader.keys import PsmDfCols\n", "from alphabase.tools.data_downloader import DataShareDownloader" ] }, { "cell_type": "markdown", "id": "8973fbc9", "metadata": {}, "source": [ "## Implement anndata reader\n", "\n", "Here, we implement an anndata reader that can read the output of multiple search engines into the standardized anndata object" ] }, { "cell_type": "markdown", "id": "fab47a55", "metadata": {}, "source": [ "### Walk through \n", "\n", "\n", "#### Load file into unified structure\n", "\n", "The first step is to load the PSM table into a standardized `pandas.DataFrame`. As described in the [PSM reader tutorial](../nbs/psm_readers.ipynb), alphabase abstracts away search engine-specific formats and harmonizes column names into a consistent schema.\n", "This is accomplished by selecting the appropriate reader for your search engine:\n", "\n", "```\n", "reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)\n", "```\n", "\n", "Then, load the file:\n", "\n", "```python\n", "psm_report = reader.load(file_path)\n", "```\n", "\n", "The resulting dataframe will include standardized column names such as:\n", "\n", "- **intensity** – protein intensity (or via custom mapping, the precursor intensity)\n", "- **proteins** – feature ID\n", "- **raw_name** – sample/run identifier\n", "\n", "**✅ Tip**: You don’t need to worry about the original column names—alphabase maps them automatically based on the search engine.\n", "\n", "Note that `alphabase` standardizes column names from different search engines so that we only need to define a simple dictionary lookup that maps the feature-level specific columns (sample id, feature id, intensity) to support 3 different feature levels in our reader function across all supported search engines:\n", "\n", "```\n", "FEATURE_LEVEL_MAPPING = {\n", " \"proteins\": {\"sample_id\": PsmDfCols.RAW_NAME, \"feature_id\": PsmDfCols.PROTEINS, \"intensity\": \"PsmDfCols.INTENSITY\"}, \n", " \"peptides\": ..., \n", " \"precursors\": ...\n", "}\n", "```\n", "\n", "#### Pivot to wide format \n", "\n", "PSM tables are typically in long format—each row corresponds to a single precursor-spectrum match. To create an AnnData object, we need a matrix with samples as rows and features (e.g., proteins or peptides) as columns.\n", "\n", "We use `pandas.pivot_table` to reshape the data:\n", "\n", "```\n", " pivot_psm_report = pd.pivot_table(\n", " psm_report,\n", " index=PsmDfCols.RAW_NAME,\n", " columns=PsmDfCols.PROTEINS,\n", " values=PsmDfCols.INTENSITY,\n", " aggfunc=\"first\",\n", " fill_value=np.nan,\n", " )\n", "```\n", "\n", "💡 Protein-level columns in PSM tables can be redundant—many precursors will point to the same protein and the PSM report stores the same intensity values for all redundant protein groups. We use `aggfunc=\"first\"` to select the first observed intensity for a given `(sample, feature)` pair\n", "\n", "💡 Since some identified peptide sequences can match multiple proteins (such as isoforms or homologues), proteomics search engines typically handle this ambiguity by grouping these proteins into *protein groups* as features.\n", "\n", "💡 Here, we fill missing values with `NaN`, following standard practice in proteomics.\n", "\n", "\n", "#### Build anndata object\n", "Now that we have a wide-format matrix of intensities, creating the AnnData object is straightforward:\n", "\n", "```python\n", "adata = ad.AnnData(\n", " X=pivot_psm_report.values,\n", " obs=pivot_psm_report.index.to_frame(index=False),\n", " var=pivot_psm_report.columns.to_frame(index=False)\n", ")\n", "```\n", "In the implementation below, we slightly modify this logic as we need to account for the customization" ] }, { "cell_type": "markdown", "id": "6aa4cb37", "metadata": {}, "source": [ "### Implementation\n", "\n", "The complete implementation might look a bit like this:" ] }, { "cell_type": "code", "execution_count": 2, "id": "c955d8fe", "metadata": {}, "outputs": [], "source": [ "FEATURE_LEVEL_MAPPING = {\n", " \"protein\": \n", " {\n", " \"sample_id\": PsmDfCols.RAW_NAME,\n", " \"feature_id\": PsmDfCols.PROTEINS,\n", " \"intensity\": PsmDfCols.INTENSITY,\n", " },\n", " \"peptide\": \n", " {\n", " \"sample_id\": PsmDfCols.RAW_NAME,\n", " \"feature_id\": PsmDfCols.SEQUENCE,\n", " \"intensity\": PsmDfCols.PEPTIDE_INTENSITY,\n", " },\n", " \"precursor\": \n", " {\n", " \"sample_id\": PsmDfCols.RAW_NAME,\n", " \"feature_id\": PsmDfCols.PRECURSOR_ID,\n", " \"intensity\": PsmDfCols.PRECURSOR_INTENSITY,\n", " }\n", "}\n", "\n", "def read_psm(\n", " file_path: str,\n", " reader_type: str,\n", " feature_level: Literal[\"protein\", \"peptide\", \"precursor\"] = \"protein\",\n", " *,\n", " intensity_column: Optional[str] = None,\n", " feature_column: Optional[str] = None,\n", " sample_column: Optional[str] = None,\n", " custom_column_mapping: Optional[dict[str, str]] = None,\n", " **reader_kwargs,\n", ") -> ad.AnnData:\n", " \"\"\"Convert a PSM table to an anndata object that stores the feature intensities\n", "\n", " Parameters\n", " ----------\n", " file_path\n", " Path to file\n", " reader_type\n", " Type of search engine output. Must be one of the implemented readers\n", " (see: `alphabase.psm_reader.psm_reader_provider.reader_dict.keys()`)\n", " intensity_column\n", " Name of the column storing intensity data. Default to `intensity` key `psm_reader.yaml`\n", " feature_column\n", " Name of the column storing proteins ids. Defaults to `proteins` key in `psm_reader.yaml`\n", " sample_column\n", " Name of the column storing raw (or run) name. Defaults to `raw_name` key in `psm_reader.yaml`\n", " **reader_kwargs\n", " Passed to :meth:`alphabase.psm_reader.psm_reader_provider.get_reader`\n", "\n", " Returns\n", " -------\n", " :class:`anndata.AnnData`\n", "\n", " with\n", " - .X: Intensities as specified in `intensity_column`\n", " - .obs: Empty `.obs` dataframe with sample name (run) as index\n", " - .var: `.var` dataframe with search engine-specific metadata on features as values\n", "\n", " Raises\n", " ------\n", " warning\n", " For redundant features\n", " \"\"\"\n", " # Get correct reader\n", " reader = psm_reader_provider.get_reader(reader_type, **reader_kwargs)\n", "\n", " if custom_column_mapping:\n", " reader.add_column_mapping(custom_column_mapping)\n", "\n", " # Read file\n", " psm_report = reader.load(file_path)\n", "\n", " sample_column = FEATURE_LEVEL_MAPPING.get(feature_level).get(\"sample_id\") if sample_column is None else sample_column\n", " feature_column = FEATURE_LEVEL_MAPPING.get(feature_level).get(\"feature_id\") if feature_column is None else feature_column\n", " intensity_column = FEATURE_LEVEL_MAPPING.get(feature_level).get(\"intensity\") if intensity_column is None else intensity_column\n", "\n", " # Pivot from long to wide format\n", " # The psm report is oriented in a long format, while the count table has a wide format\n", " # Thus, we pivot the psm report so that it has the shape (samples x features)\n", " pivot_psm_report = pd.pivot_table(\n", " psm_report,\n", " index=sample_column,\n", " columns=feature_column,\n", " values=intensity_column,\n", " aggfunc=\"first\",\n", " fill_value=np.nan,\n", " )\n", "\n", " obs = pd.DataFrame(index=pivot_psm_report.index)\n", " var = pd.DataFrame(index=pivot_psm_report.columns)\n", "\n", " # Use custom names instead of streamlined names for custom columns to prevent incorrect\n", " # naming (e.g. \"protein\" for precursor indices for custom features)\n", " # obs = obs.rename_axis(index=sample_column) if sample_column is not None else obs\n", " # var = var.rename_axis(index=feature_column) if feature_column is not None else var\n", "\n", " # Assemble to anndata\n", " return ad.AnnData(X=pivot_psm_report.values, obs=obs, var=var)" ] }, { "cell_type": "markdown", "id": "cbe24c2e", "metadata": {}, "source": [ "### Usage - Interact with anndata reader\n", "\n", "We can explore the behaviour of this reader with some sample data, in this case an `DiaNN` PSM report. The PSM report contains data from 5 HeLa cell digests that was generated from a label-free DIA run on an Orbitrap Astral. " ] }, { "cell_type": "code", "execution_count": 3, "id": "8fdfa3bf", "metadata": {}, "outputs": [], "source": [ "def get_alphadia_example(output_dir: Optional[str] = None) -> str:\n", " \"\"\"Get example data for the tutorial\n", "\n", " The function downloads an example alphadia v2.0.0 PSM report (.parquet) and stores it\n", " in `output_dir`, or, alternatively in a temporary directory\n", "\n", " Parameter\n", " ---------\n", " output_dir\n", " Output directory. If `None`, creates a temporary directory\n", "\n", " Returns\n", " -------\n", " File location\n", "\n", " Notes\n", " -----\n", " File size (parquet): ca. 50 MB\n", "\n", " References\n", " ----------\n", " > Wallmann, G., Skowronek, P., Brennsteiner, V. et al. AlphaDIA enables DIA transfer learning for feature-free proteomics. \n", " Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02791-w\n", "\n", " \"\"\"\n", " EXAMPLE_URL = \"https://datashare.biochem.mpg.de/s/PjjGzy5KpREzntj\"\n", "\n", " if output_dir is None:\n", " from tempfile import tempdir\n", "\n", " output_dir = tempdir\n", "\n", " downloader = DataShareDownloader(url=EXAMPLE_URL, output_dir=output_dir)\n", "\n", " return downloader.download()" ] }, { "cell_type": "code", "execution_count": 4, "id": "a58b796b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/var/folders/py/838_q5nd6594y27wbrpkhl3h0000gn/T/alphadia_hela_precursors.parquet already exists (54.18055438995361 MB)\n" ] } ], "source": [ "# Download example data\n", "alphadia_path = get_alphadia_example()" ] }, { "cell_type": "markdown", "id": "e4b450e7", "metadata": {}, "source": [ "Per default, the function is able to read protein-level information from the PSM report. We see that `alphadia` identified 9880 distinct protein groups in the data" ] }, { "cell_type": "code", "execution_count": 5, "id": "ae56a05f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 6 × 9880" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "protein_groups = read_psm(\n", " alphadia_path,\n", " reader_type=\"alphadia_parquet\",\n", " feature_level=\"protein\"\n", ")\n", "protein_groups" ] }, { "cell_type": "markdown", "id": "aa0b53d8", "metadata": {}, "source": [ "If we, instead of using the default settings, pass the custom column that identifies precursors and precursor intensities, we can also read in precursor-level data into an `anndata.AnnData` object. Note that this requires users to specify the (search engine-specific) column names for precursor identifies (=features) and their intensity values. In the given dataset, `alphadia v2.0.0` identified `117615` distinct precursors" ] }, { "cell_type": "code", "execution_count": 6, "id": "e7c32472", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.\n", " return dispatch(args[0].__class__)(*args, **kw)\n" ] }, { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 6 × 117615" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "precursors = read_psm(\n", " alphadia_path,\n", " reader_type=\"alphadia_parquet\",\n", " feature_level=\"precursor\"\n", ")\n", "precursors" ] }, { "cell_type": "markdown", "id": "3fb1acfa", "metadata": {}, "source": [ "## Usage - Downstream applications" ] }, { "cell_type": "markdown", "id": "5059a64f", "metadata": {}, "source": [ "We can use the anndata objects for downstream computations. Here, for example, we define a small function that computes the data completeness per sample based on the standardized `anndata` structure. In contrast to working with PSM tables directly, this function is applicable to data derived from any PSM report, as the data structure has been streamlined" ] }, { "cell_type": "code", "execution_count": 7, "id": "3531d1e9", "metadata": {}, "outputs": [], "source": [ "def compute_missing_value_proportion(adata: ad.AnnData, sample_column: Optional[str] = None) -> pd.DataFrame:\n", " \"\"\"Compute proportion of missing values per sample\"\"\"\n", " index = adata.obs_names if sample_column is None else adata.obs[sample_column]\n", " \n", " return pd.DataFrame(\n", " np.isnan(adata.X).sum(axis=1) / adata.n_vars,\n", " index=index, columns=[\"proportion_missing_values\"]\n", " )" ] }, { "cell_type": "code", "execution_count": 8, "id": "c1354ca7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proportion_missing_values
raw_name
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_010.008603
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_020.008704
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_030.009615
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_010.008907
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_020.012753
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_030.009312
\n", "
" ], "text/plain": [ " proportion_missing_values\n", "raw_name \n", "20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.008603\n", "20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.008704\n", "20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.009615\n", "20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.008907\n", "20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.012753\n", "20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.009312" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compute_missing_value_proportion(protein_groups)" ] }, { "cell_type": "code", "execution_count": 9, "id": "16784572", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proportion_missing_values
raw_name
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_010.139344
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_020.137168
20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_after_030.138265
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_010.140764
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_020.162105
20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min_F-40_iO_before_030.135178
\n", "
" ], "text/plain": [ " proportion_missing_values\n", "raw_name \n", "20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.139344\n", "20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.137168\n", "20231023_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.138265\n", "20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.140764\n", "20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.162105\n", "20231024_OA3_TiHe_ADIAMA_HeLa_200ng_Evo01_21min... 0.135178" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compute_missing_value_proportion(precursors)" ] }, { "cell_type": "markdown", "id": "a1ed8e46", "metadata": {}, "source": [ "## Bind precursor and protein data in a single container\n", "\n", "We can use the multi-modal extension of `anndata`, called `mudata` to store protein and precursor data in a single container" ] }, { "cell_type": "code", "execution_count": 10, "id": "70ec043f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/mudata/_core/mudata.py:1598: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.\n", " self._update_attr(\"var\", axis=0, join_common=join_common)\n", "/Users/lucas-diedrich/mamba/envs/alphabase/lib/python3.12/site-packages/mudata/_core/mudata.py:1461: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.\n", " self._update_attr(\"obs\", axis=1, join_common=join_common)\n" ] }, { "data": { "text/html": [ "
MuData object with n_obs × n_vars = 6 × 127495\n",
       "  2 modalities\n",
       "    proteins:\t6 x 9880\n",
       "    precursors:\t6 x 117615
" ], "text/plain": [ "MuData object with n_obs × n_vars = 6 × 127495\n", " 2 modalities\n", " proteins:\t6 x 9880\n", " precursors:\t6 x 117615" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdata = md.MuData(\n", " {\"proteins\": protein_groups, \"precursors\": precursors}\n", ")\n", "\n", "mdata" ] }, { "cell_type": "markdown", "id": "beba56ec", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this tutorial, we demonstrated how `alphabase` can serve as a bridge between proteomics search engine outputs and the `anndata` structure—an efficient, standardized format widely adopted in the scverse ecosystem. By converting [PSM reports](../nbs/psm_readers.ipynb) into AnnData objects, developers can seamlessly integrate proteomics data with a growing suite of tools for analysis, visualization, and downstream modeling.\n", "\n", "We implemented a very similar function in our scverse-compatible analysis package [alphapepttools](https://github.com/MannLabs/alphapepttools)" ] } ], "metadata": { "kernelspec": { "display_name": "alphabase", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }