TopoMetry systematically learns and evaluates the latent geometry of single-cell data

  1. David Sidarta-Oliveira  Is a corresponding author
  2. Ana I Domingos
  3. Licio A Velloso  Is a corresponding author
  1. University of Oxford, United Kingdom
  2. University of Campinas, Brazil
5 figures and 2 additional files

Figures

Figure 1 with 1 supplement
A framework for single-cell geometric analysis.

(a) Schematic overview of the TopoMetry algorithm. From an input single-cell dataset (e.g. normalized and scaled scRNAseq data), TopoMetry builds a kNN graph, which is used to learn manifold-adaptive similarities with a decay-adaptive kernel suitable for constructing Laplacian-type and diffusion operators. After estimation of intrinsic dimensionality, these operators are decomposed into a spectral scaffold with up to hundreds of components that jointly explain all the underlying geometry of the dataset. The spectral scaffolds are used to learn refined Laplacian-type and diffusion operators of the scaffolds themselves, encoding ‘the geometry of the geometry’. The scaffolds and operators constitute key TopoMetry outputs and can be utilized for downstream tasks, such as clustering, visualization, imputation, evaluation, and diagnostics, in a geometry-aware manner. TopoMetry utilities include (b) estimation of local intrinsic dimensionality, (c) filtering of categorical signals, and (d) imputation and denoising. Crucially, TopoMetry introduces the visualization of manifold diagnostics (e) for single-cell data, in which distortions induced by 2-D embeddings can be identified and investigated from a local, global, and contraction/expansion perspective.

Figure 1—figure supplement 1
Schematic overview of the current standard and the proposed framework for single-cell analysis.
Figure 2 with 1 supplement
Geometry preservation benchmark.

(a) Schematic representation of the benchmark workflow, in which a corpus composed of 68 scRNA-seq datasets was collected, preprocessed, and analyzed with (i) the current PCA→UMAP standard, (ii) standalone UMAP (graph from high-dimensional gene expression space), (iii) scVI (a popular tool for variational inference), and (iv) TopoMetry. (b) Violin plots representing geometry-preservation metrics for lower-dimensional latent spaces learned with PCA and scVI, compared to TopoMetry’s spectral scaffolds. TopoMetry’s scaffolds achieved systematically higher scores across all metrics. (c) Violin plots representing geometry-preservation metrics for 2-D visualizations obtained with the evaluated methods. Except for PaCMAP on TopoMetry’s multiscale spectral scaffold, the geometry-aware visualizations achieved systematically higher scores. Visualizations based on scVI and PCA latent space presented the lowest scores.

Figure 2—figure supplement 1
Systematic evaluation of PCA-explained variance across single-cell datasets.

(a) PCA scores in geometry-preservation metrics correlate with its total explained variance across datasets. (b) PCA total explained variance across scRNA-seq datasets depends on the selection of highly variable genes, decreases with larger numbers of genes, and is considerably low at default settings. (c) ‘Scree’ plot of singular values, showing a stable ‘elbow point’ at 30–50 PCs (higher with larger HVGs). (d) Cumulative percentage of explained variance across datasets and number of PCs, highlighting that the poor performance of PCA in scRNA-seq data cannot be attributed to an insufficient number of PCs. (e) PCA performance improves as the number of cells increases, particularly when considering larger numbers of HVGs.

Figure 3 with 1 supplement
Inferring cellular lineages with TopoMetry.

(a) TopoMAP and (b) PCA→UMAP visualizations of the Pancreas dataset showing cellular developmental trajectories in the murine pancreas, colored by original cell type annotations. (c) TopoMAP visualization, colored by inferred scores of different phases of the cell cycle, and (d) the predicted phase for each cell with RNA velocity overlay. Note how RNA velocity trajectories largely agree with the identified cell cycle structure and the represented geometry. (e) PCA→UMAP visualization of the Mouse Organogenesis Cell Atlas (MOCA), comprising ~1.3 million cells collected during murine embryo development, colored by refined subtrajectories annotation. (f–g) TopoMAP visualizations of MOCA, colored by original annotations on refined subtrajectories (f), and TopoMetry’s clustering results (g). Note how the TopoMAP embedding successfully separates main and refined trajectories and adds enhanced detail and resolution on the diversity of subpopulations arising during development.

Figure 3—figure supplement 1
The current PCA→UMAP standard fails to represent cell cycle geometry.

(a–b) PCA→UMAP visualizations of the murine pancreas development dataset, colored by inferred scores of different phases of the cell cycle (a), and the predicted cell cycle phase for each cell with RNA velocity overlay (b). The scores, predictions, and RNA velocity are highly suggestive of underlying cell cycle geometry, which is ignored and distorted by the PCA→UMAP visualization. (c) TopoMAP visualizations of the same dataset, colored by the first six components of the spectral scaffold. Note how each component encodes a different aspect of the underlying geometry.

Figure 4 with 3 supplements
TopoMetry unveils unexpected transcriptional diversity of T cells.

Analysis of the pbmc68k dataset, comprising approximately 68,000 peripheral blood mononuclear cells from a healthy donor (10 X Genomics). (a) TopoMAP visualization colored by TopoMetry’s clustering results. Main cell types are well separated, and T cells present an unexpectedly high diversity, with approximately a hundred clusters identifying T cell subpopulations. (b) Standard PCA→UMAP visualizations colored by clustering results obtained with the PCA-derived graph (left) and the kNN graph from the high-dimensional gene expression space (right), presenting the same global separation of main cell types but disagreeing on T cells. (c) Matrixplot of the top 3 marker genes found for PCA-based clusters, highlighting the presence of non-specific markers for T cells. (d) Matrixplot of the top 3 marker genes found for TopoMetry clusters, presenting highly specific marker expression. (e) Standalone UMAP visualizations of the same data, colored by PCA-based (left) and kNN-based clustering results (right). Note how the standalone approach detects part, but not all, of the T cell clusters identified with TopoMetry. (f) Contraction/expansion diagnostics of PCA→UMAP (left) and TopoMAP (right) visualizations. Note how the PCA→UMAP approach expands most regions of the cell identity manifold, while TopoMAP contracts the region inhabited by T CD4 lymphocytes when projecting TopoMetry’s refined graphs to a 2-D space.

Figure 4—figure supplement 1
TopoMetry’s visualizations consistently detect T cell diversity.

(a) Additional TopoMetry visualizations of the pbmc68k dataset (from left to right: MAP on the multiscale scaffold’s operator, PaCMAP on the fixed-time or multiscale scaffold’s operator). (b) TopoMAP visualizations colored by clustering results obtained from the standard PCA-based workflow (left) or with a standalone kNN graph (right). Note how the clustering results from the standalone kNN graph partly agree with TopoMetry’s results and succeed in detecting some of the T cell clusters identified by TopoMetry. (c–d) TopoMAP visualizations colored by Scrublet’s doublet score (c) and by predicted doublets (d), showing that the additional T cell clusters identified by TopoMetry cannot be attributed to doublets.

Figure 4—figure supplement 2
The spectral scaffold of PBMCs.

Panel of the first 40 components of the spectral scaffold learned from the pbmc68k dataset by TopoMetry. Note how the information encoded by the first few components is associated with global class separation and long-range relationships. The encoded information becomes increasingly detailed as the number of components increases. TopoMetry estimates intrinsic dimensionalities and uses an ad hoc eigengap to automatically identify the ideal number of scaffold components to be constructed.

Figure 4—figure supplement 3
Transcriptional diversity of T cells across diseases.

A set of three panels, each containing visualization and clustering results for a dataset of peripheral mononuclear blood cells (PBMC) using (i) the PCA→UMAP workflow (50 PCs retained), (ii) standalone kNN graph and UMAP (‘on data’), and (iii) TopoMetry, in addition to the marker gene dotplots for the clusters found by (i) and (iii). Across all datasets, the PCA-based results suggest little T cell diversity and present non-specific marker gene signatures, standalone results identify some additional T cell populations, and TopoMetry results detect the full range of T cell clusters with highly specific marker genes. (a) Panel for the Lupus dataset, comprising PBMCs from healthy donors and systemic lupus erythematosus patients. (b) Panel for the Dengue dataset, consisting of PBMCs collected from a dengue fever patient. (c) Panel for the multiple sclerosis dataset, comprising PBMCs and mononuclear cells from cerebrospinal fluid (CSF) samples from healthy donors and multiple sclerosis patients.

Figure 5 with 3 supplements
TopoMetry detects T cell clonal expansion dynamics from gene expression.

Analysis of the ECCITE-TCR dataset, comprising circulating T CD8+ lymphocytes collected from human donors in baseline conditions and after SARS-CoV-2 vaccination or infection. (a) TopoMetry’s default visualizations, colored by TopoMetry’s clustering results. Note how projections derived from the fixed-time scaffold better preserve the local structure of the dataset, while projections derived from the multiscale scaffold better preserve long-range relationships and overall global structure. Despite minor differences, all visualizations correctly represent the cell cycle geometry of proliferating lymphocytes and the transcriptional diversity of central (TCM) and effector memory (TEM) lymphocytes. (b) Matrix plot of the top 3 marker genes found for TopoMetry clusters, presenting highly specific marker expression. (c–f) TopoMAP visualizations colored by original cell type annotations (c), clonal expansion information (d), predicted phase of the cell cycle (e), and contraction/expansion diagnostics. Note how the small clusters of TCM and TEM correspond to smaller clonotypes (ranging from small to medium), how the identified cell cycle geometry agrees with cell cycle predictions, and how the former are contracted while the latter are expanded in the 2-D visualization.

Figure 5—figure supplement 1
Additional comparisons on the ECCITE-TCR dataset.

2-D visualizations of the ECCITE-TCR dataset using (a) PCA→UMAP, (b) standalone UMAP (‘pure UMAP’), and (c) a weighted nearest-neighbors (WNN) graph built using both RNA and TCR information, all colored by TopoMetry’s clustering results. Note how the paired TCR–RNA approach succeeds in detecting some of the additional clusters of CD8+ TEM and TCM cells identified by TopoMetry. (d) Matrix plot of the top 3 marker genes found for PCA-based clustering results, with non-specific markers highlighted. (e–f) Paired WNN visualizations colored by original cell type annotations (e) and clonal expansion (f). (g) Same visualizations as in (a–c), colored by predicted cell cycle phase. Note how all visualizations fail to successfully represent the cell cycle geometry.

Figure 5—figure supplement 2
Geometrical properties of clonal expansion dynamics.

(a) Panel of TopoMAP visualizations of the ECCITE-TCR dataset, colored by the first 15 components of the spectral scaffold. Note how the first few components describe the global geometry (e.g. the lineage inference from proliferating cells to mature TEM and TCM and antigen-specific hyperexpanded T cells), while the following components progressively add local information to the scaffold. (b) TopoMAP visualizations, colored by estimated local intrinsic dimensionality (I.D.) using different methods (FSA, MLE) and choices of k-neighbors, with homogeneous estimates across different regions of the manifold. (c) TopoMAP visualizations, colored by the distribution density of clone sizes across the embedding.

Figure 5—figure supplement 3
T cell clusters identified exclusively by TopoMetry are associated with specific clonotypes.

Analysis of the T cell compartment of the Tissue Immune Cell Atlas (TICA) dataset, for which both RNA-seq and VDJ-seq data are available. (a–f) TopoMAP visualization of TICA’s T cell compartment, colored by (a) TopoMetry’s clusters; (b) original cell type annotations; (c) origin species of epitopes recognized by each cell’s TCR based on amino acid sequence, with a TopoMetry cluster corresponding to a clonotype cluster that specifically recognizes SARS-CoV-2 antigens; (d) the largest 30 clonotypes detected by TCR amino acid sequence, highlighting that these clonotypes correspond to TopoMetry’s encoded geometry; (e) clone size, highlighting the largest clones and their agreement with the proposed geometry-aware representations; and (f) clonal expansion, showing TopoMetry’s clusters that correspond to hyperexpanded clones. (g–i) Repertoire overlap analysis quantifying the overlap between TCR sequences between different clusters across clustering results, considering (g) the original cell type annotations, (h) the results from the standalone kNN graph, and (i) TopoMetry’s clustering results. Note how the overlap can be mitigated by avoiding the use of PCA-based clusters (as in g) and using a standalone graph approach instead (as in h), and how it is almost entirely abolished by the geometry-aware analysis (as in i).

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. David Sidarta-Oliveira
  2. Ana I Domingos
  3. Licio A Velloso
(2026)
TopoMetry systematically learns and evaluates the latent geometry of single-cell data
eLife 13:RP100361.
https://doi.org/10.7554/eLife.100361.3