TopOMetry systematically learns and evaluates the latent dimensions of single-cell atlases

Davi Sidarta-Oliveira; Ana Domingos; Licio A Velloso

doi:10.7554/eLife.100361.1

Introduction

Analyzing high-dimensional data (i.e., data in which the number of variables per observation exceeds hundreds or thousands) is a challenge shared among fields as diverse as signal processing, economics, and structural chemistry - and, more recently, the life sciences, particularly with the development of single-cell experiments. Single-cell assays have inaugurated a new era in the unbiased molecular sampling of the cellular diversity in biological systems, from cell lines to whole organisms, leading to an ongoing revolution across the life sciences^1,2. In such high-dimensional data (hd-data), information regarding cellular phenotype (e.g., gene expression, chromatin accessibility, etc.) is individually gathered for every cell, resulting in matrices containing up to millions of cells and dozens of thousands of variables for each. This creates challenges in data analysis, collectively known as the curse of dimensionality, and requires sophisticated statistical and geometric methods to allow meaningful exploration and understanding of the latent features underlying the data. These are broadly referred to as dimensionality reduction techniques and include matrix decomposition methods, autoencoders, and graph-layout optimization algorithms.

Many dimensionality reduction approaches have been developed and recently reviewed elsewhere^3–5; here, we will briefly describe the strategies behind these algorithms and outline their current use standards in sc-data. Briefly, there are two main ways to perform dimensionality reduction (DR): i) decomposing matrices; and ii) optimizing graph layouts. Matrix decomposition DR methods embed data into an arbitrary number of latent dimensions and include the classic Principal Component Analysis (PCA)^6,7, a linear method that involves the eigendecomposition of the covariance matrix. More recent non-linear dimensionality reduction (NLDR) methods are based on manifold-learning techniques and assume that the input data lie on or near a low-dimensional manifold, which is almost always the case in real-world datasets. In particular, spectral methods (i.e., based on the decomposition of Laplacian-type operators), such as Laplacian Eigenmaps (LE)⁸ and Diffusion Maps (DM)^9,10, have arisen as robust non-linear alternatives, in which the eigendecomposition of Laplacian-type operators^8,9,11–13 yield orthonormal components that encode the latent graph structure of the manifold. The other broad class of DR techniques consists of graph-layout optimization methods. Their approach involves learning a similarity graph from samples, then minimizing a loss function to optimize the layout of this graph in 2 or 3-D for visualization. Graph-layout optimization (GLO) methods include the popular t-Stochastic Neighborhood Embedding (t-SNE)¹⁴, Uniform Manifold Approximation and Projection (UMAP)^15,16, and Pairwise Manifold Approximation and Projection (PaCMAP)¹⁷, Minimum Distortion Embedding (MDE)¹⁸. Approaches based on autoencoders¹⁹ are a recent exception to these two categories.

All these methods attempt to find adequate low-dimensional mappings to high-dimensional data, which are crucial in single-cell analyses. Theoretically, finding latent signals representative of cellular identity is intrinsically related to defining a finite number of cell lineages and types within a given dataset^20,21. Ideally, in such mappings, moving through continuous topologies in the form of connected populations often correspond to differentiation trajectories or stimulus-response processes^21,22, while the space between discrete topologies should indicate the correspondence to different main cell types. The connection between this concept and an empirical view of the Waddington epigenetic landscape is direct and has been recently explored^2,20,21,23. Such mappings are indispensable for downstream tasks in sc-data analysis²⁴, such as clustering, RNA velocity projection^25,26, batch-correction^27,28, trajectory inference²⁹, and pseudotime estimation^21,30. Current practice involves using the first few principal components (a hyperparameter commonly set by the user) to compute neighborhood graphs for downstream tasks, such as clustering and graph-layout optimization (GLO) algorithm (i.g., t-SNE or UMAP) for visualization. The GLO procedure is rarely performed on the full set of genes/features (or highly-variable features); instead, the default approach is to use it with either the top principal components or other latent dimensions learned by an autoencoder as with scvi-tools³¹. Despite their popularity, there is little evidence to support these currently held standards.

Defining which dimensionality reductions are the best, from both a theoretical and practical perspectives, has remained an active challenge when analyzing sc-data, as generating and evaluating the reliability of these mappings is an extremely challenging task. No known ground truth exists, and real-world data are immensely more complex than any toy datasets that could be generated to investigate the suitability of DR techniques. Some attempts to compare different DR methods have been made^5,32,33; however, they often rely on metrics associated with prior clustering results (e.g., the Adjusted Rand Index), thus holding no guarantees that a good-performing method preserves the underlying geometry of the data. Qualitatively assessing the representation of known biological ground truths as visualization landmarks is also challenging, as only a few are known (e.g., the cell cycle as a closed loop); and even when present, they are prone to promote unconscious confirmatory biases and cannot be extended to the entire representation. The current gold standard of experimentally validating some hypotheses drawn from the learned representations cannot be considered a validation of the whole representation per se either. Moreover, each algorithm’s own assumptions (often implicit and by construction) regarding data’s statistical and geometrical properties are also disturbingly challenging to assess in practice. For example, the universally used PCA holds the strong assumption that data is uniformly distributed and lies approximately in a series of hyperplanes (the principal components) by definition, which generates misleading results when faced with non-uniform or non-linear structure by missing local information and introducing linear distortions^{6,14,15,34–37}. The popular UMAP¹⁵, whereas more extensive by considering that the data is dispersed over or close to a locally connected Riemannian manifold, assumes this distribution is uniform across a manifold, with a locally constant metric. As such, representations of data with non-uniform data sampling could be filled with artifacts without any warning to users. As most downstream analysis and biological conclusions drawn from sc-data rely on the lower-dimensional mappings learned by DR, it is crucial for them to be formally guaranteed to represent latent biology. These should be independent of the data distribution or the underlying geometry, which should be considered unknown even for extensively investigated systems. Regardless, this is not the case when using the current standard approach (i.e., using UMAP or t-SNE on the top Principal Components); in fact, its assumptions likely do not hold true for all datasets, in which case artifactual distortions can be introduced in the learned representations. Any inferences made from these representations are critical and will impact downstream analyses; thus raising concerns about the credibility of the vast literature in which this practice was adopted.

Despite the importance of minimizing assumptions regarding the latent geometry underlying high-dimensional data, an approach that fully yields the power of topological Laplacian-type operators is yet to be proposed. These operators have been widely and successfully used in many applications, from computer vision to natural language processing, but not in single-cell biology. They are particularly advantageous because they approximate the Laplace-Beltrami Operator (LBO) at the limit of large data (i.e., when the number of samples is very large, as in sc-data). We consider the extremely loose assumption that data lies on a Riemannian manifold. It is a fundamental result in mathematics that the LBO encodes all the local geometric information of a Riemannian manifold with an unknown shape^8,9,11,36,38. The top eigenfunctions of this operator can then be used to approximate the original data⁹ so that each eigenfunction carries a fundamental harmonic of the data^8,9,13. In addition, because current theoretical biology points to an underlying Waddington landscape within the sampled biological system²⁰, these can be naturally modeled as a Riemannian manifold. We refer to such theoretical manifolds as phenotypic manifolds. We reasoned that recovering such manifolds by approximating the LBO would thus render optimal sc-data representations while holding no prior assumptions regarding the underlying data geometry. Here, we show how combining LBO eigenfunctions with existing projection (GLO) techniques yields representations of high-dimensional data that preserve more original information than those learned with current standards (PCA-based) or with GLO alone. This strategy makes the Riemannian metric³⁶ a natural metric for evaluating the distortion of these visualizations, a most critical feature that the field currently lacks. We present TopOMetry, a python toolkit to combinatorially obtain numerous LBO approximations using different computational approaches and systematically evaluate them from a quantitative and qualitative perspective. We show this strategy is remarkably superior to the currently held PCA-based approach or to using stand-alone GLO implementations when analyzing single-cell data.

Results

A framework to systematically learn and evaluate manifolds in single-cell data

To achieve the task of approximating the LBO to recover the phenotypic manifolds underlying sc-data, we developed TopOMetry (Topologically-Optimized geoMetry), a toolkit that combines and enhances state-of-the-art techniques to approximate the LBO through weighted graph Laplacian-type operators. TopOMetry efficiently obtains the eigenspectra of these operators and uses them to approximate the LBO once again. Users can then visualize its graph representation with any existing graph-layout optimization (GLO) algorithms. This modular approach allows combinatorially evaluation of different options of kernels, decomposition, post-processing, and GLO algorithms. It eliminates the need to choose a single approach a priori and allows users to decide which of the possible representations better describe each dataset objectively.

TopOMetry takes as input a data matrix of variables per observation (e.g., genes per cell; Figure 1A). It then computes a k-nearest-neighbors graph and uses it to build a kernel matrix with state-of-the-art similarity functions (the default adaptive bandwidth estimation is related to the local intrinsic dimensionality). TopOMetry includes a new family of manifold-adaptive kernels closely related to the intrinsic dimensionalities of the underlying manifold³⁹. These kernels are then used to learn Laplacian-type operators, which asymptotically approximate the LBO (Figure 1B). Next, the eigenspectra of the learned Laplacian-type operators are obtained, allowing the estimation of the data global intrinsic dimensionality (Figure 1C). The resulting orthonormal eigenbasis is related to a Fourier basis and encodes all latent signals present in the original data^8,9,13. Such eigenbasis can then be used to compute a new weighted kernel (referred to as graph kernel) to obtain a second LBO approximation encoded as a similarity graph (the associated topological graph; Figure 1D), which can then be visualized with existing GLO algorithms (Figure 1E). These graphs can also be used for downstream tasks such as clustering, signal filtering, pseudotime estimation and others (Figure 1F). In TopOMetry, dozens of visualizations obtained from the graphs or directly from the eigenbasis can be obtained with a single line of code. These representations’ preservation of local⁴⁰ and global structure is then systematically evaluated (Figure 1G), alleviating practitioners’ choice of a single similarity-learning or GLO method. The top-performing projections can then be qualitatively evaluated using the Riemannian metric to assess the distortion induced by the mapping (Figure 1G).

Schematic overview of single-cell data analysis with TopOMetry.
(A) Single-cell experiments are preprocessed into high-dimensional matrices of cells and measured variables (e.g., gene expression, protein content, chromatin accessibility) and require dimensional reduction tools for their analysis. Most tools assume unknown aspects of the underlying geometry (e.g., linearity) or distribution (e.g., uniformity). Instead, relying on the Laplace-Beltrami Operator (LBO) and its eigenfunctions allows learning such geometry assuming only the manifold hypothesis. (B) Approximating the LBO involves constructing a k-nearest-neighbors (kNN) graph using adaptive affinity estimation, which can be done through several kernels so to make the affinity graph insensitive to neighborhood densities. The Laplacian-type operators (particularly the anisotropic diffusion operator) are approximations of the LBO. The resulting matrices can be used for several tasks, such as imputation of missing data, signal filtering and interpolation, and graph sparsification and coarsening. (C) The eigendecompositions of the LBO approximations yield eigenvectors and eigenvalues that are weighted or multiscaled to form a new orthogonal eigenbasis. The eigenvalues can be used to estimate an eigengap or spectral gap to estimate the intrinsic dimensionality of the data. The intrinsic dimensionality can also be estimated using neighborhood-based methods (e.g., FSA and MLE) or *ad-hoc* inspection of eigenvectors for discriminative potential. (D) A second neighborhood graph is learned from the eigenbasis, and its Laplacian-type operators are used to obtain a new LBO approximation, rendering ‘topological graphs’. (E) From a spectral initialization obtained from the topological graph, any graph layout optimization (GLO) can be used on the topological graph or the eigenbasis for visualization. (F) Downstream tasks such as clustering, RNA velocity estimation, and imputation can be performed with the learned topological graphs and layouts. (G) The learned eigenbases and visualizations can be evaluated regarding the preservation of global and local structure, and the Riemannian metric can be used to visualize distortions in layouts to aid biological interpretation.

The outlined approach has numerous advantages in comparison to the current standards. First, it holds an extremely loose assumption regarding the underlying data geometry compared to PCA or UMAP (the data lies in or close to a Riemannian manifold). Second, such an assumption is mathematically solid and intrinsically related to the core biological goal in sc-data analysis (i.e., to recover the Waddington landscape or an extended version thereof). Third, it involves the estimation of the manifold local and global dimensionality, which alleviates the similarity-learning process and eliminates the need to guess a number of latent dimensions to keep for downstream analysis (as needed when using PCA or autoencoders). Another major advantage is using the Riemannian metric³⁶ as a “distortion ruler,” allowing qualitative assessment of the distortion present in two-dimensional mappings and the robust estimation of the global and local dimensionality of the underlying manifold. Finally, TopOMetry allows users to obtain and evaluate dozens of possible representations with a single command, eliminating the need to choose a pre-defined method (e.g., UMAP on PCA) and allowing users to pick which performs best for each case. This is the toolkit’s main advantage and represents a fundamental change of paradigm in sc-data analysis: instead of a priori electing a single method to be used at all times across all datasets, it allows users to discover which particular model works best for representing each individual dataset before other downstream tasks. Computationally, TopOMetry is easy-to-use and centered around a single python class, the TopOGraph object, which orchestrates the analyses (Supplementary Figure 1).

Representations learned with TopOMetry preserve local and global structures across single-cell datasets

We first performed a qualitative sanity test on TopOMetry by running it on synthetic toy data (Supplementary Figure 2A). The eigenfunctions of the Laplacian-type operators used in the toolkit pass negative control checks with uniformly distributed noise and square and also positive control checks with normally distributed classes. This is best visualized by coloring sample points by their first eigencomponent or principal component (Supplementary Figure 2B). The representation of uniformly distributed classes as points or straight lines by the first two non-trivial eigenfunctions is related by their encoding of manifold structure (Supplementary Figure 2B), which further highlights the power of this approach to discover the true geometry of the data regardless of the sampling distribution. We also reinforce that this approach can unfold curved manifolds, such as the classic ‘S’-shape and the swiss-roll examples (Supplementary Figure 2B-C), and stress the inadequacy of trying to use PCA to represent non-linear geometries (Supplementary Figure 2B-C). For an additional test on high-dimensional toy data with ground-truth labels, we obtained embeddings from the MNIST handwritten digits dataset with TopOMetry and existing methods (Supplementary Figure 2D).

To determine TopOMetry’s performance in real-world sc-data, we curated a diverse set of 20 single-cell datasets from different biological systems with varying numbers of cells and variables per cell across diseases, species, and technologies (Table 1). The datasets were then processed following current standards in scRNA-seq analysis²⁴: i, library size normalization; ii, logarithmic transformation; iii, selection of highly variable genes (HVG), and iv) scaling (Z-score normalization, i.e., mean centering and variance clipping). For each dataset, we computed the first 100 principal components and obtained two-dimensional (2-D) projections with three existing DR methods: i, t-SNE¹⁴; ii, UMAP¹⁵; iii, PaCMAP¹⁷. Projections were computed using either the principal components (the current standard) or the full data matrices. We then used TopOMetry to obtain 2-D projections using similar graph-layout optimization (GLO) methods. Each visualization was then scored regarding the preservation of the global and local structure. The Mean Reconstruction Error (MRE) was used for quantifying the preservation of the global structure as recently proposed⁴¹, with PCA being a reference by construction (Figure 3A). It has been shown that the preservation of global structure is mainly driven by the initialization given to GLO algorithms⁴², thus making preservation of local structure a more important factor to consider when choosing between options. The trustworthiness score⁴⁰ was used to assess the preservation of local structure (Figure 3B). In addition, we also propose a score based on the correlation of geodesic distances in the high- and low-dimensional spaces (Figure 3C), drawing from previous approaches^33,43; however, its high computational cost makes its application impractical for most real-world datasets.

Public single-cell RNA-seq datasets that were used in this study benchmark.

Across all datasets, the eigenbases (Figure 2D) and 2-D representations (Figure 2E) that yielded the highest local scores (LS) for all datasets were generated with TopOMetry. Global scores (GS) were as high as PCA for TopOMetry’s eigenbases (Supplementary Figure 3A) and discretely lower for projections when using TopOMetry compared to GLO on data or top principal components on some datasets, with a maximum difference of 0.06 for the AgeingMouseBrain dataset (Supplementary Figure 3B), further suggesting that the preservation of local structure is a more robust metric to discriminate against different visualizations. We found that PCA yielded lower LS for all datasets (Figure 2D), as expected from a global and linear method. Strikingly, we also found that projections derived from using GLO methods on the top principal components had lower LS when compared to using GLO alone (Figure 2E), further suggesting that this practice could be inadequate despite being currently held as a default approach. The opposite happens when using TopOMetry’s Laplacian-type eigenbases to build the neighborhood graphs projected with PaCMAP or the diffusion potential from these eigenbases with MAP, as these approaches achieve higher LS than using GLO alone (Figure 2E). Within TopOMetry, the variation of LS differed mostly across eigenmap methods, with LE scoring lower than DM and msDM, and little to no variation across different kernel methods (Figure 2E). These results show the superiority of TopOMetry over the standard workflow or GLO methods alone, and highlight the value of critically evaluating methods before electing one for the downstream analysis approach.

TopOMetry eigenbases and projections preserve more original structure than the current standards.
(A) Schematic overview of the assessment of global structure in TopOMetry - the global score is calculated by taking the exponential of the difference between an embedding MRE and PCA’s MRE, divided by PCA’s MRE. (B) Schematic overview of the assessment of local structure in TopOMetry with the trustworthiness score - the score penalizes embeddings in which cells that are neighbors in the low-but not in the high-dimensional space. (C) Schematic overview of the assessment of distances preservation in manifolds - the pairwise geodesic distances in the high- and low-dimensional spaces are computed, and Spearman R correlation between the rank of neighbors for each cell is obtained as a score. (D) Annotated heatmap of local scores for TopOMetry’s eigenbases and PCA for 20 single-cell datasets (higher is better). (E) Annotated heatmap of local scores for projections learned using expression data, the first 100 principal components, and TopOMetry’s eigenbases or topological graphs for 20 single-cell datasets (higher is better).

Estimating intrinsic dimensionalities in toy and single-cell data with TopOMetry.
(A) Eigenspectrum of the eigenvalues for each diffusion component learned for the MNIST dataset of handwritten digits with TopOMetry (left) and the absolute value of their first derivatives (right). (B) Distribution of MNIST samples (individual images) across different classes (digits) across diffusion components (top) and principal components (bottom). (C) Histograms of intrinsic dimensionality estimates for each sample in the MNIST dataset, with varying numbers of k-nearest-neighbors, with the FSA (top) and MLE (bottom) methods. (D) Heatmap of Spearman R correlation between FSA and MLE estimates of intrinsic dimensionalities of MNIST images for varying numbers of k-nearest-neighbors. (E) topoMAP projections of a subset of the MNIST handwritten dataset, colored by classes (numbers). The 100 images with the highest (left) or lowest (right) estimates of intrinsic dimensionality are colored black in the projections and shown on the top of each projection. (F) Eigenspectrum of the eigenvalues for each diffusion component learned for the *PBMC3k* dataset of peripheral blood mononuclear cells (10X Genomics) with TopOMetry (left) and the absolute value of their first derivatives (right). (G) Histograms of intrinsic dimensionality estimates for each cell in the *PBMC3k* dataset, with varying numbers of k-nearest-neighbors, with the FSA (top) and MLE (bottom) methods. (H) Heatmap of Spearman R correlation between FSA and MLE estimates of intrinsic dimensionalities of *PBMC3k* cells for varying numbers of k-nearest-neighbors. (I) topoMAP projections of the *PBMC3k* dataset, colored by the estimates obtained with FSA (top) and MLE (bottom) with 100 nearest-neighbors. (J) topoMAP projection of the *PBMC3k* dataset, colored by annotated cell types. (K) Violin plots of estimates of intrinsic dimensionalities of *PBMC3k* cells with FSA (left) and MLE (right) with 100 nearest-neighbors.

Estimating and visualizing intrinsic dimensionalities with TopOMetry

Intrinsic dimensionality (i.d.) can be loosely defined as the minimum number of parameters needed to describe a multiparameter (high-dimensional) system accurately⁴⁴. Many definitions of i.d. have been proposed, including local and global i.d.^39,45, and its accurate estimation remains an active field of research. Estimating the global i.d. of a given dataset is intrinsically related to estimating the dimensionality of its underlying manifold⁴⁶, which is crucial when performing dimensional reduction (i.e., to estimate how many components to use in matrix decomposition). Furthermore, local i.d. is used in many signal processing applications as a separate variable to describe the data and provide meaningful information regarding the manifold structure. Noticeably, the similarity kernels employed in TopOMetry’s are related to the FSA (Farahmand-Szepesvári-Audibert) method, a recently adapted estimator of intrinsic dimensionality³⁹; as they harness the ratio between the distances to median nearest-neighbors (i.e., k/2) and to k-nearest-neighbors as a measure of local sampling density. To our knowledge, no tool has ever been proposed to evaluate the i.d. of sc-data specifically, despite its centrality for learning accurate representations. Likewise, we know of no other similarity kernels straightforwardly related to local i.d. estimates.

TopOMetry uses three existing strategies to estimate the i.d. of the data’s underlying manifold: i) analyzing the eigenspectrum of its Laplacian-type operators; ii) inspecting the eigenfunctions to evaluate the separation of known classes; and iii) using two independent state-of-the-art algorithms for i.d. estimation, FSA³⁹, and MLE⁴⁷. The two first strategies estimate the global i.d. and the latter estimates the local i.d. and weights them to estimate the global i.d. We first demonstrate these strategies using the MNIST dataset, comprising images of handwritten digits. In this dataset, TopOMetry found an eigengap at 19 dimensions using the eigenvalues’ first derivatives (Figure 3A). The corresponding eigenfunctions separate the ground-truth classes well, in contrast to the principal components (Figure 3B). Local i.d. estimates with both FSA and MLE presented nearly identical distributions across a range of possible kNN (Figure 3C), and the global estimates obtained with these algorithms are similar to the global i.d. predicted by the eigengap (FSA: 17-18, MLE: 12-14). Furthermore, the estimates obtained with these algorithms are highly correlated (Figure 3D), reinforcing their reliability. To interrogate whether i.d. estimates translated to actual insights regarding the data, we used a subsample of the dataset and visualized the digits with the highest and lowest i.d. Interestingly, the 100 digits with the highest local i.d. were predominantly 3’s, 5’s, and 8’s written with varying calligraphy, indicating an association between local i.d. and intra-cluster heterogeneity (Figure 3E, left). Conversely, the 100 digits with the lowest local i.d. were predominantly 2’s, 6’s, and 1’s written with consistent calligraphy, further suggesting this association (Figure 3E, right).

We next examined the PBMC 3k dataset following the same approach to test whether these results translated to real-world sc-data. This dataset comprises ∼2,700 peripheral blood mononuclear cells (PBMC) from a healthy human donor. Once again, the global estimate obtained with the eigengap method (72) (Figure 3F) was reasonably similar to those obtained with MLE (122-59) and FSA (122-74) (Figure 3G). Similar to the MNIST dataset, the estimates of local i.d. obtained with MLE and FSA were also highly correlated (Figure 3H) but strikingly varied across different regions of the represented manifold (Figure 3I), which is inhabited by discrete cell types (Figure 3J, K). In this dataset, we found that megakaryocytes, natural killer (NK) T cells, and dendritic cells (DC) presented the highest local i.d. However, different from the MNIST dataset, in which the images can be directly visualized, the biological implications of higher local i.d. for these cell types in the PBMC 3k dataset remain unclear, requiring further investigative strategies. Nevertheless, estimating the local i.d. during exploratory sc-data analysis mains remarkably advantageous, as it: i, allows discovering that distinct cell types inhabiting discrete regions of the phenotypic manifold present strikingly different dimensionalities; ii, indicates that these cell types will present different distortions in representations learned with dimensional reduction; and iii, estimates the number of latent dimensions needed to fully describe the data, which should be between the average and the upper limit of the local i.d. distribution. Defining such a number is a particularly critical issue in sc-data analysis, as most existing approaches try to denoise the data into an arbitrary number of latent dimensions prior to downstream tasks. Within TopOMetry, such estimates are used to guide the number of eigencomponents users should compute for each specific dataset. We expect that estimating the i.d. of sc-data will greatly assist practitioners during its analysis, allowing more accurate representations and providing additional insights regarding its geometry across various methods and tasks.

TopOMetry allows precise manifold learning of small and large-scale datasets

We next aimed to explore how TopOMetry can be useful in inferring cellular development trajectories. For a first demonstration, we obtained scRNAseq data from the developing murine pancreas⁴⁸ popularly used in RNA velocity estimation (Figure 3A) and analyzed it with TopOMetry’s overall top-performing model (Figure 3B, topoMAP for brevity). The representation learned with this model yielded finer-grained lineages when compared to the default PCA-based UMAP. When scoring cells by their predicted cell cycle phase, our model effectively mapped mitotic cells to a closed-loop structure (Figure 3C), which provides a better representation of the cell cycle than the homogenous cloud found with the default approach (Figure 3D). A detailed inspection revealed that the default approach places cells with high S and G2/M scores in an intermediate position of the differentiation lineage apart from the remaining cells undergoing the cell cycle (Figure 3D, marker). Conversely, the same cells are placed close to other cycling cells in the representation learned with TopOMetry (Figure 3C, marker), providing an optimal representation of the global structure of the underlying manifold. These results and the fact that the cells are indeed cycling are confirmed by visualizing RNA velocity vectors on the embeddings (Figure 4E-F). The superiority of our approach over the current standard is further reinforced by the fact that the RNA velocity vector field based on the default approach suggests that epsilon cells give rise to beta cells (Figure 3E). Lineage tracing studies have shown that this is the exception rather than the norm⁴⁹: instead, delta cells represent a transient state that later gives rise to alpha cells, which is faithfully represented by the RNA velocity vector field based on the TopOMetry model (Figure 3F). Finally, it is also shown that the latent dimensions learned by TopOMetry (the multiscaled diffusion components) are associated with particular regions of the underlying manifold (Figure 4G) corresponding to specific cellular states. This effectively reconstructs the manifold’s topology one region at a time (i.e., one identity per component). These components can also be used for downstream tasks, such as inferring changes in gene signatures along several developmental axes.

Inferring cellular lineages in small and large datasets with TopOMetry.
(A) UMAP and (B) topoMAP projections of the *Pancreas* dataset showing cellular developmental trajectories in the murine pancreas, colored by annotated cell type. (C) topoMAP and (D) UMAP projections of the same data, colored by inferred scores of different cell-cycle phases. The same projections are also colored by the predicted cell-cycle phase of each cell in (E) and (F), and by the first five diffusion components in (G). Arrows indicate a population of cells that are classified as mitotic but were misplaced in the middle of the differentiation trajectory by the default PCA-based UMAP projection. (H) PCA and (I) PCA-based UMAP projections of the *Mouse Organogenesis Cell Atlas* (MOCA) showing cellular developmental trajectories of whole mouse embryos, colored by annotated subtrajectories from the original study. (J) PCA-based UMAP and (K) topoMAP projections of the MOCA dataset, colored by development stage, annotated trajectories and subtrajectories, and TopOMetry clustering results using the standard Leiden community-detection algorithm. Note how the topoMAP projection and the clustering results from TopOMetry uncover dozens of neuronal subtrajectories that match the developmental stage from which cells were sampled.

Evaluating distortions in two-dimensional projections with the Riemannian metric.
The Riemannian metric can be used to estimate distortions in two-dimensional visualizations and can be represented by ellipses. If no distortion is present in any preferential direction, the ellipses will have zero eccentricities and correspond to circles. If distortion is present in a preferential direction, the ellipse will be aligned in that direction, and its eccentricity indicates the degree of distortion. (A) Synthetic data comprised of three normally distributed clusters with varying degrees of variance (leftmost), and visualization of the Riemannian metric on two-dimensional visualizations of the data obtained with PCA, UMAP, DM (within TopOMetry), and topoMAP (TopOMetry MAP on DM diffusion potential). (B) Visualization of the Riemannian metric on two-dimensional visualizations of the *PBMC3k* dataset obtained with PCA, UMAP, DM (within TopOMetry), and topoMAP (TopOMetry MAP on DM diffusion potential). (C) topoMAP projections of the *PBMC3k* dataset, colored by the eccentricities of ellipses representing the Riemannian metric across the aforementioned visualizations for each cell. (D) Violin plots of the eccentricities of ellipses representing the Riemannian metric on the aforementioned visualizations of the *PBMC3k* dataset, for each cell type annotation.

Next, we show how TopOMetry can be used for lineage inference in organism-wide cell atlases to learn detailed cellular hierarchies. For this, we harnessed 1.3 million high-quality pre-filtered single-cell transcriptomes from the mouse embryo during organogenesis⁵⁰. Standard processing was performed, and data was then represented with PCA (Figure 3H) or UMAP on the top 300 PCs (Figure 3I and 3J - annotations are from the original study). TopOMetry analysis was then performed using its default model (Figure 3K, topoMAP for brevity), and the Leiden community detection algorithm was employed on its diffusion potential graph. Both approaches successfully mapped major cell types as distinct regions of the manifold without apparent connections between them. However, TopOMetry revealed considerably finer-grained cellular lineages (Figure 3K) than those represented with UMAP on PCA (Figure 3I). In total, we found approximately 380 subpopulations, while only 56 were originally described (and labeled as subtrajectories). The majority of these populations were neuronal populations, which is consistent with the expected diversity of the central system. Some of the populations found when clustering with TopOMetry are also suggested by the UMAP on PCA (Figure 3I) representation, albeit with a less clear trajectory. Hence, it is shown that TopOMetry excels at faithfully representing vast datasets comprising hundreds of coexisting lineages, being ideal for representing whole-organism data and performing lineage inference for its intrinsic relation with the Waddington epigenetic landscape.

TopOMetry augments representations with the Riemannian metric to visualize distortions

A central issue when learning low-dimensional representations from sc-data is the introduction of distortions. Distortion is inevitably introduced when a high-dimensional manifold is embedded into a lower-dimensional space. As an illustrative example, consider the Earth and the planar projections used for making two-dimensional maps (e.g., the Mercator projection): in these maps, there is always some distortion, which can involve distances between points or shapes, angles, and areas within the projection. However, while Earth’s true geometry can be physically visualized from space, the epigenetic manifolds²⁰ of phenotypic identities from which sc-data is sampled are unknown and cannot be visually assessed, which makes it extremely challenging to qualitatively evaluate how the low-dimensional mapping distorted them. The theoretical model of cellular diversity as Riemannian manifolds proposed here allows a direct connection with a unique method for evaluating these distortions: expressing the Riemannian metric of the original manifold³⁶ in any given 2-D mapping. The Riemannian metric (RM) is expressed in 2-D coordinates so that it makes the represented distances and angles approximately equal to their original manifold counterparts. In practice, the RM can be visualized as ellipses whose axes indicate the magnitude and directions in which the manifold was distorted. Visually, these ellipses will have eccentricity equal to one if the distortion is uniform in all directions, rendering a circle with a radius approaching zero in the special case of no distortion (i.e., the representation of a given point fully preserves its geometric properties on the manifold). Such an approach allows users to consider how different regions of the manifold were distorted by the representation and to qualitatively compare them for hypothesis generation, effectively working as a “distortion ruler.”

To illustrate these concepts, we used a toy dataset with three randomly distributed clusters with distinct variances (Figure 4A) and learned their representations with PCA, DM (within TopOMetry), UMAP, and a TopOMetry model (MAP on the diffusion potential of a diffusion map eigenbasis). The RM was then estimated for each embedding and visualized using ellipses. These ellipses present high eccentricity when using the first two principal components, indicating distortion towards the periphery of the embedding and reasonably high eccentricities across all three clusters when using UMAP. In contrast, when using the first two diffusion components learned with TopOMetry, the distortion was found to be negligible for one of the clusters and to happen along with the components themselves for the other two. This was expected, as each cluster is encoded by a separate diffusion component in this scenario. When using the TopOMetry MAP on DM potential model, we found a distortion pattern similar to that observed within UMAP.

Next, to demonstrate its usability in sc-data analysis, we estimated the RM for representations of the classic PBMC 3k dataset learned with PCA, UMAP, UMAP on PCA, and a TopOMetry model (Figure 4B). As represented by the ellipses, each representation differently distorts specific regions of the manifold. To allow for a more uniform comparison, we color-coded the eccentricity of ellipses corresponding to each cell across methods and visualized them within the TopOMetry layout (Figure 4C) and with violin plots (Figure 4D). As observed, the populations associated with the greatest distortion were reasonably consistent across representations; CD4 T cells, CD14+ monocytes, B lymphocytes, and NK cells. These findings suggest that all representations will preferentially misrepresent some cell types, often at different fashions when comparing different methods and that this could be related to the intrinsic geometry of the underlying manifold. Albeit illustrative, this example is limited by the small number of cells available. As CD4+ T cells were more favorably represented within the TopOMetry model and are known for their extreme diversity, we asked whether analyzing a larger amount of samples (cells) could harbor additional insights into their biology.

TopOMetry reveals novel transcriptionally defined T cell populations

To further investigate CD4+ T cells and distortions of the underlying manifold, we harnessed a public dataset comprising approximately 68,000 PBMCs from a healthy donor (10X Genomics, PBMC68K dataset). Representations were obtained with UMAP (using either the top 50 or 300 PCs or the highly-variable genes) and TopOMetry models. Cells were then clustered by the Leiden community detection algorithm using weighted neighborhood graphs learned from; i, the top 50 PCs; ii, the top 300 PCs; and iii, TopOMetry’s diffusion potential. We then used CellTypist⁵¹ to predict major cell types and subtypes. Results obtained using 50 PCs were essentially identical to previous reports⁵² (Figure 6A), rendering 24 clusters, 16 being T cells, which have been classically described with poor marker genes across different single-cell studies^51,53–58 (Figure 6B). However, these findings were inconsistent with what we observed when using 300 PCs, in particular for CD4+ T cells (Supplementary Figures 4A-B), further suggesting that these putative clusters could be linear distortions. A similar discrepancy was found when using stand-alone UMAP, which suggested the existance of additional T cell clusters (Supplementary Figure 4C). Strikingly, the results obtained with TopOMetry suggest the existence of 123 clusters in this dataset (Figure 6C), 116 of them being of T cells (Figure 6C, Supplementary Figure 4D), and each presenting specific marker genes (Figure 6D), in sharp contrast to the classic PCA-based approach. The agreement between clustering results obtained with TopOMetry models or with the HVG data and the disparity of these results in confrontation with those found using the PCA-based workflow was quantitatively confirmed using the adjusted mutual information (AMI) score⁵⁹ (Supplementary Figure 4E).

Uncovering the transcriptional diversity of T cells with TopOMetry.
(A) PCA-based UMAP projections of the *PBMC68k* dataset, comprising approximately 68,000 peripheral blood mononuclear cells from a healthy donor (10X Genomics), colored by clustering results obtained with the standard PCA-based approach (left) or with the diffusion potential of TopOMetry msDM eigenbasis (right). The main cell types and the populations for which the two approaches yield similar clustering results are indicated in the figure. Note that the two approaches only reach disagreeing results for T cells. (B) Dotplot of the top marker gene for each cluster found with the standard PCA-based approach, ranked by logistic regression. Note how clusters corresponding to T cells present poor marker genes. (C) topoMAP projection of the *PBMC68k* dataset, colored by clustering results obtained with the standard PCA-based approach (left) or with the diffusion potential of TopOMetry msDM eigenbasis (right). Note that the clusters of T cells found with the former are randomly spread across the dozens of clusters found with the latter. (D) Dotplot of the top marker gene for each cluster found when clustering with TopOMetry, ranked by logistic regression with identical parameters. Note how clusters corresponding to T cells present highly-specific marker genes. (E) Boxplots of the number of T cell clusters (left) found with different clustering strategies (standard PCA-based, using expression data and with TopOMetry) and their mean size (right), across four datasets of peripheral blood mononuclear cells from human donors: *PBMC68k, Dengue, Lupus*, and *MS_CSF*. (F) PCA-based UMAP and (G) topoMAP projections of lymphoid cells from the *Tissue Immune Cell Atlas* (TICA) dataset, comprised of lymphoid cells across different organs and donors, colored by annotations from the original study (left) and clustering results obtained with TopOMetry (right). Note that the two clustering results reach disagreeing results mostly for CD4+ cell clusters. (H) PCA-based UMAP and (I) topoMAP projections of the TICA dataset, colored by clonal expansion thresholds. (J) PCA-based UMAP and (K) topoMAP projections of the TICA dataset, colored by the 30 clonotypes with the highest amount of cells. The arrows indicate the precise match of these clonotypes to the manifold structure uncovered by TopOMetry. (L) PCA-based UMAP and (M) topoMAP projections of the TICA dataset, colored by the epitopes recognized by each T cell. The arrows indicate populations recognizing specific epitopes that precisely match the manifold structure uncovered by TopOMetry. (N) Heatmap of T cell repertoire overlap between the clusters found in the original study, (O) using expression data and (P) TopOMetry. Note how the original clusters present high repertoire overlap, which is less evident in clusters found using expression data and nearly absent from clusters found with TopOMetry.

Regarding the identity of the additional clusters, the majority of the discrepant ones were annotated by CellTypist as T central memory (TCM) cells (Supplementary Figure 4F), although additional clusters corresponding to NK (natural killer) and MAIT (mucosal-associated invariant T) cells were also identified. T regulatory cells were the only T cell subpopulation found across all approaches (Supplementary Figures 4G-I), in contrast to four other putative subpopulations we identified (marked by expression of MDN1, C11orf68, IL11RA and IL21R), which were randomly scattered in the PCA-based UMAP embedding, marginally detectable using stand-alone UMAP and clearly identified using TopOMetry (Supplementary Figures 4G-I) .

To elucidate whether these findings were reproducible across different sets of data and not restricted to this particular example, we harnessed three additional independent scRNAseq datasets of human PBMCs, including healthy controls and patients suffering from dengue virus (DenV, Dengue dataset) infection, multiple sclerosis (MS, MS_CSF dataset) and systemic lupus erythematosus (SLE, Lupus dataset). The analyses performed on the PBMC68K dataset were then applied for these three datasets, with identical parameters. We found strikingly consistent results across all three datasets, in which clustering and UMAP results obtained with 50 PCs were inconsistent with those obtained using the original data (HVGs) or TopOMetry (Supplementary Figures 5A-C). Likewise, specific marker genes were found for the clusters identified with TopOMetry, in contrast to the poorly-defined signatures observed for PCA-derived clusters (Supplementary Figures 5D-F). In average, 18 clusters of T cells were found using 50PC, in contrast to a mean of 73 identified using the HVGs and of 137 identified using TopOMetry (Figure 6E). The average number of cells in these clusters was expressively smaller using HVGs (608) and TopOMetry (320) in comparison to using 50 PCs (2489). In addition, the adjusted mutual information score for different clustering strategies followed a similar pattern to that observed with the PBMC68K dataset, in which clustering results using the expression data are more similar to TopOMetry than to the standard PCA-based workflow (Supplementary Figure 5G). Unexpectadly, different TopOMetry models presented a high AMI score, demonstrating their sucessfull convergence towards the LBO to describe the underlying phenotypic manifold.

To explore whether or not the identified T cell subpopulations were functionally distinct from a T cell receptor perspective, we harnessed an additional dataset - the T cell compartment of the Tissue Immune Cell Atlas⁵¹, comprising scRNA-seq and scVDJ-seq from T cells obtained from multiple human donors and tissues, including blood, that were jointly analyzed using the standard PCA-based workflow (Figure 6F). When analyzing this data with TopOMetry, we found a mismatch between the clusters identified with it and the clusters described in the publication, particularly around CD4+ cells (Figure 6G, Supplementary Figure 6A and 6B), in a similar fashion to the previous datasets. T cell clonotypes were identified based on amino-acid sequence similarity using ScirPy⁶⁰, and were found to match the clusters found with TopOMetry (Supplementary Figure 6C) better than the originally described clusters (Supplementary Figure 6D). When analyzing TCR clonal expansion, we found that subpopulations with larger expansion corresponded to NK and CD8+ cells, as expected and previously described⁵¹ (Figure 6H). However, populations with modest expansion corresponded to the CD4+ T cell clusters identified with TopOMetry (Figure 6H), suggesting these clusters correspond to specific clonal identities. To further explore this hypothesis, the 30 clonotypes with the most cells were visualized on the original UMAP embedding (Figure 6J), on which they are sparsely and randomly distributed, and on the TopOMetry embedding (topoMAP, Figure 6K), on which they are remarkably localized alongside the clusters found with TopOMetry. Consistent with these findings, the clonotype modularity (how densely connected the mRNA-derived neighborhood graph underlying the cells in a given clonotype is) was found to be higher when using TopOMetry in comparison to the standard approach (Suppplementary Figure 6E and 6F). Next, we cross-referenced the TCR structure to VDJdb, a public database of antigen epitopes^60,61, and identified two transcriptionally defined clusters which present TCR that bind to SARS-CoV-2 and EBV epitopes (Figures 6L and 6M); these populations are well-defined in the TopOMetry analyses, but are unindentifiable in the original UMAP. These results suggest that the clusters of T cells identified by TopOMetry correspond to specific clonal populations. This hypothesis is confirmed by analysis of TCR repertoire overlap across clustering results - while the clusters defined in the publication present a high repertoire overlap (Figure 6N), those identified using HVGs are expressively more well-separated (Figure 6O), and those obtained with TopOMetry present no overlap at all (Figure 6P). Indeed, we found that cells belonging to specific clonotypes defined by amino-acid sequence similarity also expressed the transcripts for the TCR chains identified by VDJ-seq (Supplementary Figures 6G-J). These results confirm that TopOMetry is able to identify clonal populations from mRNA expression alone, and strongly suggest that the T cell clusters found in the other analyzed datasets (for which VDJ data is not available) also correspond to specific clones.

Throghout the analyzed datasets containing T cells, it was noticeable that the major cell populations (CD4+ and CD8+ T, NK and B cells, monocytes and megakaryocytes) were similarly represented by the default workflow using 50 or 300 PCs, stand-alone UMAP and clustering on HVG and TopOMetry. Two main reasons could underlie such discrepancy on the underlying manifold of different cell types: i) high intrinsic dimensionalities (i.d.) and ii) highly non-linear geometry at a local level. To test the former, we used the FSA and MLE i.d. estimation methods, and found that T cells did not present particularly high i.d. in any of the tested datasets (Supplementary Figure 7A). To test the latter, we computed PCA with up to 300 PCs and calculated the total explained covariance, and found that 50 and 300 PCs explained an average of only 20.09% and 41.89% of the total covariance for these datasets when using the default parameters in scanpy⁵², with an ad hoc ‘elbow point’ around 30 PCs (Supplementary Figures 7B-F), at which only 17.7% of covariance was explained, in average. Remarkably, no HVG selection approach yielded orthogonal bases that explained more than 55% of variance at 300 PCs. Such low values of explained covariance are widely known as a hallmark of highly non-linear systems in the broader machine-learning community, albeit apparently unnoticed within the single-cell niche. Taken together, these findings and the lack of specific marker genes for clusters learned from a PCA-based neighborhood graph strongly suggest that the phenotypic manifolds underlying T cell identity are highly non-linear and poorly approximated by hyperplanes of maximum covariance, and point to the introduction of linear distortion by PCA preprocessing. Accordingly, we found that the distortion was significantly higher in the PCA-based UMAP representations compared to the topoMAP representation across all datasets in T cells (p < 10^-90, two-sided Wilcoxon test), as represented by ellipse eccentricities when projecting the Riemannian metric (Supplementary Figure G). Thus, it is clear that the CD4+ T cell clusters found with the standard workflow are PCA-derived artifacts rather than real biological subpopulations, and that the clusters found using HVGs or TopOMetry better correspond to the actual biological subpopulations.

Discussion

Finding low-dimensional spaces encoding the phenotypic manifolds underlying single-cell data is a fundamental task in single-cell analysis and a prerequisite for the performance of most existing algorithms used in downstream analyses (e.g., clustering, marker genes, RNA velocity). However, how to properly find and evaluate such spaces has remained elusive. Here, we draw from recent advances in spectral graph theory and high-dimensional data analysis to show that such spaces are best modeled as Riemannian manifolds in a comprehensive theoretical framework, and present TopOMetry, a robust, user-friendly and efficient toolkit that allows users to fully explore such manifolds in single-cell atlases with minimal assumptions about the data (i.e., the manifold hypothesis⁶²). A first-of-its-kind, the toolkit addresses numerous gaps in high-dimensional and single-cell data analysis, providing means to estimate intrinsic dimensionalities, learning consistent topological representations¹³ and evaluating subspaces from a global and local perspective. The ability of learning and evaluating dozens of representations with a single line of code empowers single-cell analysis, as one cannot be sure whether a particular computational algorithm will perform well in any given set of data: extending the conclusions obtained from even dozens of datasets to the vast and exponentially growing corpora of single-cell data would be a dangerous assumption. Instead of relying on the presumed supremacy of any given algorithm or assuming data structures, users are endowed with the capacity of testing different approaches that pursue the same unbiased mathematical goal (LBO approximation) and decide which to adopt based on quantitative and qualitative evaluations. Remarkably, it allows assessing distortion in visualizations for the first time by harnessing the Riemannian metric, addressing current criticism⁶³ on the biological interpretation of embeddings from single-cell data. Its modular design, extensive documentation and its interoperability with the larger python ecosystem for single-cell analyses ought to make its adoption easy and seamless to the majority of users. Similar to any other tool, it presents limitations, the major one being that, in its current form, it does not allow online or inverse learning, although that may be bypassed by the use of recently developed geometric regularized autoencoders⁶⁴. Another limitation is the high computational cost of evaluating subspaces, a shared feature of existing tools^33,43, as it involves operating on dense matrices pairwise distances on memory. We expect that our results will inspire the development of numerous other approaches harnessing the power of Laplacian-type operators and consistent manifold learning for the analysis of sc-data.

The power of the proposed theoretical framework and toolkit are demonstrated throughout the presented examples, identifying developmental trajectories in the murine pancreas that better match existing knowledge, representing the cell cycle as a closed loop, and dissecting numerous neuronal lineages throughout embryonic development. In particular, the identification of approximately 130 novel subpopulations of T cells across datasets sheds unprecedented light on T cell biology, alongside the match between these populations and TCR clones. Diversity is a known hallmark of T cells, but such diversity has remained obscured in scRNA-seq data, leading to a common belief that T cells are not transcriptionally heterogeneous. The results shown herein suggest the opposite, forming a new model in which different clones of T cells are transcriptionally distinct and that clonality can be inferred from transcriptional information alone. Moreover, our findings point to the widespread use of PCA as a universal preprocessing step without any consideration for possible non-linear geometries as the reason previous studies failed to detect them, and these results warrant the community to further assess the extent to which published results have been corrupted by PCA, and to carefully reconsider its use in single-cell analyses. However, this methodological study falls short of further experimental investigation of these subpopulations. Thus, the mechanisms and functions underlying such diversity are yet unclear, and future studies targeting specific T cell subtypes are warranted to further elucidate these populations.

Methods

Public data acquisition and availability

Public single-cell RNA-seq datasets (Table 1) were obtained from various sources and processed following the standard data analysis. The used sources and processing parameters are available in Table 2 alongside additional information for each dataset.

Code availability

TopOMetry is freely accessible as a python library. The source code is available at https://github.com/davisidarta/topometry, and can be easily installed through the python package index (PyPI) at https://pypi.org/project/topometry/. The library is extensively documented at https://topometry.readthedocs.io. Code used to generate the exposed examples is available at https://github.com/davisidarta/topometry-manuscript. Code used for the benchmark shown in Figure 2 and Supplementary Figure 3 is available at https://github.com/davisidarta/topometry-dr-benchmark. Any additional information can be requested from the authors and will be promptly made available.

Standard data analysis

Count matrices were used to create AnnData objects within scanpy⁵², which were filtered to recommended thresholds. Data were then processed with the following workflow: i) data were library-size normalized to a factor of 10,000 and variance stabilized by log normalization; ii) highly variable genes (HVGs) were identified and kept for downstream analysis-selection parameters were adjusted so that each dataset had between 1,500 and 4,000 HVGs; iii) HVG-data was then centered and scaled (i.e., standardization or z-score transformation); iv) dimensionality reduced to 50 or 300 dimensions with PCA; v) neighborhoods were computed using UMAP fuzzy simplicial sets using the cosine metric distance; vi) the neighborhood graph was projected with UMAP and t-SNE; and vii) the neighborhood graph was clustered with the Leiden community detection algorithm, with the resolution parameter set to 2. These steps correspond to the workflow applied in most manuscripts in which single-cell data is analyzed, being the norm in the field and regarded as the current gold-standard⁶⁵. Thus, we refer to these as the standard workflow throughout the manuscript. Unless otherwise stated, this was performed with default parameters within the SCANPY toolkit for single-cell analysis.

When stated that stand-alone UMAP and clustering was performed on HVGs, we refer to the scaled data matrix of cells vs. HVGs. In these cases, all analyses were still performed within scanpy, but setting the parameters use_rep=’X’ and n_pcs=0 in the scanpy.pp.neighbors() function, which causes the neighborhood graph to be computed on the scaled data instead of top principal components. UMAP projections and Leiden clustering results were then obtained normally using scanpy built-in functions.

Cluster marker genes were found with logistic regression⁶⁶ by using scanpy’s in all demonstrated examples, which calls scikit-learn⁶⁷ LogisticRegression estimator with default parameters (L2 regularization). The number of maximum iterations was set to 1000 for all examples and clustering results.

Topologically Optimized geoMetry

Overview

Fundamentally, TopOMetry learns representations using the discrete approximation of the Laplace-Beltrami Operator (LBO; △f), formally defined as the divergence of the gradient:

The LBO is a powerful descriptor of data geometry under the manifold assumption, as on a Riemannian manifold (a smooth manifold M with a Riemannian metric g defined at every point p ∈ M), the LBO entirely determines the Riemannian metric g - and therefore, all topological properties of the manifold. This remarkable feature was previously explored by methods like Laplacian Eigenmaps (LE) and Diffusion Maps (DM)^9,10,68, in which Laplacian-type operators obtained from affinity matrices approximate the LBO. The eigendecomposition of these operators is then used to find latent subspaces that completely preserve the geometry of the underlying manifold. This approach has been shown to be particularly robust when using affinity matrices learned with variable bandwidth kernels⁶⁸, continuous-k-nearest-neighbors (CkNN)¹³ or fuzzy simplicial sets^15,69.

TopOMetry combines and expands these previous approaches into a comprehensive framework that approximates the LBO iteratively to learn faithful low-dimensional representations of high-dimensional data. First, a k-nearest-neighbors (kNN) graph is obtained from the standardized data (the base kNN graph), and then used to estimate affinity matrices using different kernels (the base kernel). Laplacian-type operators are obtained from these matrices (the ‘geometric’ Laplacian with DM, also referred to as the diffusion operator, and the ‘random-walk’ Laplacian with LE), and their eigenvectors are used to form a new orthogonal basis in which the latent geometry of the original data is preserved. The learned orthogonal bases (the eigenbases) can be used for numerous downstream tasks, such as computing specialized neighborhood graphs for graph layout optimization (GLO) algorithms such as t-SNE and PaCMAP. It has been previously shown that multiple eigenvectors can be needed to encode data heterogeneity^43,70,71, each encoding a different aspect of data structure, i.e., distinct cellular trajectories, and dozens or hundreds of diffusion components could be needed to describe large-scale, whole-organism single-cell data. Thus, these orthogonal bases in which all latent geometry is encoded are still high-dimensional, and as such, can be described through a second round of LBO approximation. In this step, TopOMetry learns a new kNN graph from these bases, from which new affinity matrices and Laplacian-type operators can be computed - these operators encode the latent geometry on a higher level, and can be used for other downstream tasks such as clustering and GLO. The diffusion potential is used by default for clustering (e.g., with the Leiden community-detection algorithm) and GLO (e.g., with MAP, a simplified version of UMAP), and its eigendecomposition is used as initialization for GLO algorithms. As a result, TopOMetry harnesses LBO approximations at each step of dimensional reduction, offering a pipeline that is completely unsupervised and dependent uniquely on the intrinsic geometry of the underlying manifold, while also allowing the flexibility of testing multiple options of kernels and types of Laplacian operators.

Computational implementation

TopOMetry was designed to be easy to use, and as computationally efficient as possible. Popular approximate nearest-neighbor libraries (ANN, e.g., NMSlib, hnswlib, and annoy) are seamlessly integrated into the toolkit. Nearly all graph operations are performed on sparse matrices and are highly scalable with multithreading. Each step of the workflow involves the use of a modular class based on scikit-learn⁶⁷ transformers, thus allowing users to construct custom-made pipelines from them. The main classes are estimators of neighborhood graphs, affinity kernels (and associated graph operators), eigendecompositions and graph layout optimizations. The analysis is centered on the TopOGraph class, which orchestrates these classes into a streamlined workflow (Supplementary Figure 1). The modular nature of the software allows it to be easily integrated into diverse machine-learning and sc-data analysis workflows. It also includes additional classes and functions for evaluating lower-dimensional spaces, estimating intrinsic dimensionalities, and visualizing results. Some of these functions harness the TopOGraph class into high-level APIs that allow users to compute, evaluate and inspect the results of dozens of possible layouts from a single dataset using a single line of code, or to do so using the standard AnnData object⁵². Extensive documentation and tutorials are available at https://topometry.readthedocs.io/en/latest/.

Construction of k-nearest-neighbor graphs

One of the most fundamental aspect of machine-learning algorithms dealing with high dimensional data is the curse of dimensionality: in higher-dimensional spaces (HDS), the capacity of distance metrics to discriminate dissimilarity decays, i.e. as dimensionality increases, the Euclidean distance between the most similar samples tends to approach the distance between the most dissimilar one⁴². Thus, the task of accurately and efficiently computing distances between samples in a large HDS is a challenge by itself, and several algorithms and tools have been developed to address it⁷². The most commonly used metric in single-cell data analysis is the Euclidean distance, followed by the cosine and the manhattan distances - yet, in principle, any metric distance could be applied. Recent benchmarks^73,74 have achieved initial insight on the suitability of different metrics to estimate distances between single-cells, pointing towards correlation metrics as a favorable choice, although these were limited to only some single-cell RNA sequencing datasets in PCA-based workflows, and further studies are needed. This guided the choice of the cosine distance metric as a default in TopOMetry and in all presented analyses, as it trivially corresponds to the Pearson correlation distance when using standardized data (which is the current standard in single-cell analysis).

TopOMetry can employ scikit-learn and various ANN libraries to construct kNN graphs. Guided by a recent benchmark between distance estimation algorithms⁷⁵, we defined the Hierarchical Navigable Small Worlds (HNSW)^76,77 as a default. HNSW has been shown to achieve fast and near-optimal results while being useful for computing a variety of distance metrics (e.g. euclidean, cosine, manhattan). Similarly to other non-linear algorithms, the first step of the diffusion basis is to find a k-nearest-neighbors graph (kNN) graph. As previous studies have shown that using only highly variable features (HVF) for computations increases the power of single-cell analysis^65,78, we estimated distances from the cell by HVF matrix for the shown experiments. As a reference, ∼1,500 to ∼4,000 genes are generally considered a reasonable range of HVFs in single-cell RNA sequencing data analysis.

Affinity estimation

Variable bandwidth kernels

The implemented kernels draws from recent advances in kernel operators and expands them to include two other possible kernels. Instead of the classic Gaussian kernel with a fixed ε bandwidth, several studies have described that variable bandwidth kernels^13,68,70,79 act as manifold-adaptive estimators, rendering more robust results. These variable bandwidth kernels were adapted into TopOMetry as a bandwidth-adaptive kernel^68,68,71,79 in which the distance between two samples (cells) is normalized by each samples’ distance to its median k-nearest-neighbor (i.e. the k/2 - nearest-neighbor). As an extension of this concept, we introduce a novel kernel that includes neighborhood expansion procedure, in which the original kNN graph is expanded to additional neighbors using a data-derived estimate on neighborhood sparseness. In addition, we introduce a novel adaptively decaying kernel, in which the rate of similarity learning is adaptively adjusted by each cell neighborhood density instead of by a fixed hyperparameter as otherwise proposed by PHATE’s ‘alpha-decaying kernel’⁴³.

Guided by previous approaches, we first construct a k-nearest-neighbors graph for a given dataset, with N samples (cells) and M measures (genes), using a given distance metric as a local similarity measure. Let ℳ⊂ℝⁿ be a Riemannian manifold with unknown sampling distribution q (not necessarily uniformly distributed) in ambient dimension n. For a given sample x_i ∈ X_{n ×m} ⊂ ℳ (the manifold hypothesis - data is assumed to be contained in the proximities of the manifold), consider nbrs (x) to be the ordered set of its k-nearest-neighbors under some distance metric

Given that the vector of sample observations x_i, a scaling factor σ_i corresponding to the distance of sample i to its median nearest-neighboring sample(s) in the high-dimensional space is defined:

Drawing from previous work on adaptive bandwidth diffusion kernels, the following kernel function could be initially used to convert distances into affinities:

It is of notice that the inverse of the scaling factor σ_i is employed in the manifold-adaptive FSA method of intrinsic dimensionality estimation³⁹.

To minimize the impact of the choice of the number of k neighbors hyperparameter on the learned graph structure, we harness the potential of the scaling factor σ_i to carry information regarding the neighborhood density of x_i. For this, a new indirect, continuous metric named pseudomedian(Ω) is defined for each sample x_i as the interpolation of all σ_i into the new interval:

Ω_i represents a simple, indirect, and yet intrinsic reference from the distribution of nbrs (x_i) at a given fixed k, thus providing information on the neighborhood density of each individual sample and about the discrete graph-building process. A sample with a sparser neighborhood will have a larger value of Ω than one with a dense neighborhood, indicating the relative need for search expansion in the sparser region. This estimate is implemented in the proposed diffusion process as a way to obtain enriched information about x_i neighborhood density, and also as a measure to alleviate the impact of a fixed k hyperparameter, as the graph search can be extended as a function of maximum Ω to help avoid insufficient sampling at sparser neighborhoods:

It is of note that PHATE⁴³ introduced a kernel with variable bandwidth εequipped with a novel alpha-decay, in which α is a hyperparameter that controls the kernel decay rate. This was a first attempt to handle cases in which a sparsely sampled neighborhood can present with distant neighbors that are weighted with a high affinity. Note that α = 2 leads to the classic Gaussian kernel, and ε is a fixed bandwidth

We harness the concept of the alpha-decaying kernel to implement an adaptively-decaying kernel, in which the rate of decay is related to the neighborhood sampling density distribution

In summary, TopOMetry includes the fixed-bandwidth kernel, adapts the bandwidth-adaptive kernel, and introduces the bandwidth-adaptive kernel with neighborhood expansion, the bandwidth-adaptive adaptively-decaying kernel, and the bandwidth-adaptive adaptively-decaying with neighborhood expansion. By default, the bandwidth-adaptive kernel is used, due to its simplicity and relation to the FSA i.d. estimator.

Continuous k-nearest neighbors

Continuous k-nearest neighbors (CkNN) is a recently introduced⁸⁰ method that constructs a single unweighted graph from which the underlying manifold’s structure (i.e. topological information) can be retrieved, given any compact Riemannian manifold. In its introductory manuscript, it was also shown that CkNN is the only unweighted graph construction strategy for which the graph Laplacian converges to the Laplace-Beltrami Operator (LBO). Conceptually, CkNN points to the superiority of consistent homology over persistent homology. However, differently from variable-bandwidth kernels or fuzzy simplicial sets, this is achieved through an unweighted graph rather than through a weighted graph.

CkNN’s goal is to create an unweighted undirected graph from a given set of samples with distances given by a metric d. This translates to deciding whether to connect samples with edges. The two traditional approaches for constructing such a graph are a) fixed ϵ-balls (i.e. connect samples i, j if d(i, j) < ϵ); and b) k-Nearest Neighbors (kNN), in which a fixed integer k given as a user defined hyperparameter is used to define edges (i.e. connect samples i, j if d(i, j) ≤ d(i,i _k) or if d(i, j) ≤ d(j, j_k); i_k and j_k being i and j k-th nearest neighbor, respectively). As shown in CkNN’s manuscript and in its excellent performance within TopOMetry, a less discrete and rather continuous version of kNN circumvents several limitations imposed by these traditional strategies.

CkNN connects samples i, j with edges if . Here, δ is a unitless scale parameter allowed to continuously vary. Given a finite set of data, there is only a finite set of possible values for δ, which is set within TopOMetry as a user-provided hyperparameter. There are two main advantages to this approach: first, it allows one to interpret the CkNN graph construction as a kernel method, enabling interpreting the graph Laplacian in terms of δ; second, it allows fixing the k hyperparameter, enabling the use of d(i, i_k) as a local density estimate, similarly to the approximation of geodesics with fuzzy simplicial sets and diffusion harmonics. In practice, this captures the natural multiscale topology of data, wielding smooth geometries where data is sparsely sampled and fine-grained details where data is densely sampled.

Fuzzy simplicial sets

The application of 1-skeleton fuzzy simplicial sets as topological metrics on TopOMetry draws entirely from previous work, particularly from UMAP¹⁵ theory and the work of David Spivak⁶⁹. Our innovation is represented by the use of these metrics to construct new orthogonal bases that capture latent topology, and subsequently to construct topological graphs prior to layout optimization. When using this method to learn metrics from some data X_{n × m}, it is assumed Xis uniformly distributed on a locally connected ℳ Riemannian manifold with a locally constant metric g. UMAP originally addresses the uniformity assumption by creating a custom distance for each sample by normalizing distances with respect to their k-th nearest-neighbor to effectively approximate geodesics. To merge each of the discrete metric spaces generated per sample into a global informational space, the metric spaces are converted into fuzzy simplicial sets. A consensus metric can then be obtained by taking a fuzzy union across all metric spaces encoded as fuzzy simplicial sets, rendering a single fuzzy simplicial set which is granted to capture relevant topological metrics on ℳ. The Laplace-Beltrami Operator acting on ℳ can then be approximated from the normalized laplacian of this fuzzy graph representation.

The aforementioned approach is grounded on solid category theory and no major modifications to the fuzzy simplicial set algorithms defined and implemented in UMAP were made, aside from improving its computational performance with approximate-nearest-neighbors with HNSW. For brevity, we direct the reader to UMAP’s¹⁵ and David Spivak’s⁶⁹ manuscripts for in-depth descriptions of the involved category theory. A summarized description follows.

Simplicial sets are generalizations of simplicial complexes that provide a combinatorial approach to the study of topological spaces. Let Δbe a category with the finite order sets [n] = {1, …, n} as objects, and morphisms given by order-preserving maps. Δ^op denotes a category with the same objects as Δ and morphisms equivalent to those of Δ with reversed direction. Then, a simplicial set is defined as a functor from Δ^op to Sets (the category of sets). Given a simplicial set X : Δ^op → Sets, we refer to the set X([n]) as X_n, and to the elements of the set as the n-simplices of X. In the category of simplicial sets, there’s a direct correspondence between the n-simplices of Xand morphisms of the standard simplices Δⁿ (Δⁿ→ X). For each x ∈ X_n there’s a corresponding morphism x: Δ _n → X, and therefore . The standard covariant functor |·| : Δ → Top(the category of topological spaces) that sends [n]to the standard n-simplex |Δⁿ| ⊂ ℝⁿ⁺¹ with standard subspace topology is defined as:

And because X : Δ^op → Sets is a simplicial set, the realization of X (|X|) is , which associates the topological space with the simplicial set. On the other hand, given a topological space Y, the associated simplicial set S(Y)(the singular set of Y) is defined asS(Y) : [n] → hom_Top (|Δ ⁿ|, Y). The theoretical contributions of David Spivak and UMAP’s authors involve exploiting the fact that the realization functor and singular set functors form an adjunction, translating between topological spaces and simplicial sets - for this, a categorical presentation of fuzzy sets was outlined.

A fuzzy set can be defined as a set for which membership is not binary - rather, it is a fuzzy property represented by continuous values in the unit interval l (with l = (0, 1] ⊆ℝ, and with topology given by intervals of the form [0, a) for a ∈ (0, 1]). This allows one to define presheafs: a presheaf P on I is a functor from Δ^opto Sets . A fuzzy set is a presheaf on I such that all maps P(a ≤ b) are injections. P([0, a))is the set of all elements with at least a membership strength. With this concept in mind, the category Fuzz of fuzzy sets is defined as the full subcategory of sheaves on I spanned by fuzzy sets. Consequently, defining fuzzy simplicial sets (FSS) corresponds to considering presheaves of Δto be valued in Fuzz instead of Sets . The objects of the category of FSS sFuzz are thus functors from Δ^opto Fuzz, and its morphisms are given by natural transformations. Conversely, FSS can be viewed as sheafs over the product topology Δ × I, allowing one to apply them to a broad category of extended-pseudo-metric spaces. This will be useful as the initial geodesic normalization step involves creating a unique metric space for each sample.

Extended-pseudo-metric-spaces (X, d) are defined as sets Xand maps d :X×X→ ℝ_≥0 ∪ {∞}, being objects of the category EPMet, which has non-expansive maps as morphisms. A further subcategory of finite extended-pseudo-metric-spaces is defined as FinEPMet. Previously⁶⁹, the realization and singular set functors were extended to FSS by the construction of natural extension functors, Real and Sing, between the categories sFuzzand EPMet. Realis defined in terms of standard fuzzy simplices , similarly to the the classic realization functor

A morphism exists only if a ≤ b, and is determined by a Δ morphism σ: [n] → [m]. The action of Real on such a morphism is then given by the non-expansive map

This map is then extended to a more general simplicial set X by defining . Once again, similarly to the classical case, the right adjoint functor Singis then defined for a given extended-pseudo-metric space Yin terms of its action on the category of product topology Δ × I:

As only finite metric spaces are of interest in this case, the associated FSS Fin – sFuzz is considered and the correspondent FinRealand FinSingare used. Formally, the functor FinReal : Fin − sFuzz → FinEPMet is defined by:

,where d_a (x_i, x _j) is equivalent to− log(a) if i ≠ j and to 0 otherwise, allowing to be defined. Analogously,

FinSing: FinEPMet → Fin − sFuzz can be defined by:

By construction, we have that the functors FinReal : Fin − sFuzz → FinEPMet and FinSing : FinEPMet → Fin − sFuzzform an adjunction in which FinRealis the left adjoint andFinSingis the right adjoint.After the trouble of defining fuzzy simplicial sets and their functors, one can merge the rather incompatible metric spaces created during geodesic approximation to preserve topological information from each metric by first translating each to a fuzzy simplicial set with the FinSingfunctor, and then taking a union across all FSS, resulting in a single FSS that is ought to capture the underlying structure of the manifold ℳ. Given a dataset X = {X₁,…,X_n} ∈ ℝⁿ and its associated family of extended-pseudo-metric spaces {(X, d_i)}_i=1…N with common carrier set X, it follows that d_i (X_i,X_k) =d_ℳ (X_i, X_k) − ρif i = jor i = k, and that d_i (X_i, X_k) = ∞if otherwise. ρis the distance between X_i and its nearest neighbor, and d_ℳ is the approximate geodesic distance on the manifold ℳ. Then, the fuzzy topological representation of X corresponds to the union: .

Learning a latent basis from Laplacian-type operators

TopOMetry employs eigendecomposition to learn latent bases from Laplacian-type operators that approximate the LBO. Diffusion Maps (and its multiscale version) result from the eigendecomposition of the diffusion operator. The construction of a diffusion process on a graph from a given kernel function is a classical topic in spectral graph theory. Given a symmetric positive semi-definite affinity matrix W, we define the degree matrix D as a diagonal matrix composed of the sum of columns of . The unnormalized (or ‘combinatorial’) graph Laplacian L is given by: L = D − W. Then the normalized (or ‘symmetrized normalized’) graph Laplacian L_sym is given by: L_sym = D ^−1/2 LD ^−1/2. The random-walk Laplacian L_rw is given by: L_rw = D⁻¹ L = 1 − D⁻¹ W. The diffusion operator P can be simply defined as P = D_⁻¹ W, but can also be parametrized as P _(α) by the α anisotropy value, which controls how much the data distribution is allowed to bias the obtained results, so that only the underlying geometry is considered and the LBO is approximated when α = 1. For that, we first obtain an α-normalized W_(α) : W_(α) = D ^−α W D ^−α and form a new degree matrix , which is used to form the anisotropic diffusion operator

These operators can then be decomposition using the the generalized eigenvector problem. In classic Laplacian Eigenmaps (LE), the system Lf = λD f is solved to obtain a number of m eigenvectors, and these are used to map the data into a m-dimensional embedding . By default TopOMetry uses the random-walk Laplacian L_rw, and eigendecomposes it by L_rwf = λf. In LE, the eigenvectors with smallest eigenvalues are used, the first one being dropped as it is trivial. In Diffusion Maps (DM), the eigenvectors with the highest eigenvalues are obtained by solving P _(α)f = λf.

Most approaches using DM rely on powering the diffusion operator P _(α) by a given number of t ‘diffusion steps’^9,10,43, which consist of an additional hyperparameter. Instead, we harnessed an adaptation consisting of multiscaling, which avoids setting a fixed number of t diffusion steps and renders weighted diffusion distances across all possible timescales^71,81.

Given a diffusion operator P _(α) with a defined number of t steps and a given estimate M of the intrinsic dimensionality (number of non-trivial eigenvalues) of spectral decomposition: ℳ, one can achieve the spectral decomposition:

In which encodes the diffusion distance D D between samples x_i and x_j, along the diffusion component ϕ:

We next multiscale distances to account for all possible t scales, rendering multiscale diffusion distances msD D :

Which can be reformulated as:

The eigenvalues of the resulting decompositions are useful to perform an empiric estimation of the data intrinsic dimensionality during computations, as the numerical values hit the limits of floating-point precision, which is encoded either by a sign-flip or by the decomposition algorithm outputting null outputs for the eigenvalues that pass this limit. We found that this is a useful way to set the key hyperparameter n_eigs, which defines a fixed number of latent dimensions to be computed, and recommend putative users to increase this parameter up to a point where such a spectral gap (or an eigengap) can be observed in scree plots. As shown in Figure 3, the identified spectral gap often corresponds to external estimates of intrinsic dimensionality.

Graph-layout optimization

Graph layout optimization (GLO) are techniques that aim to optimize a lower-dimensional representation (usually in two or three dimensions, for visualization) that preserves most of some properties from the original data by minimizing a given cost function. Differences between these methods are mostly ough to the different methods to learn similarities from distances and to the number and design of cost functions⁸². The seminal t-SNE method, for instance, employs t-Student scores to estimate similarities between samples, and minimizes the Kullback–Leibler divergence between the high-dimensional and lower-dimensional representations. UMAP, which followed t-SNE as the de facto gold standard for visualization, learns similarities with fuzzy simplicial sets (FSS), and then minimizes the cross-entropy between the high and lower-dimensional representations. PaCMAP has shed additional light on how these methods work by analyzing the asymptotics of cost functions and empirically testing and defining the characteristics of efficiently designed cost functions; it also introduced the use of more than one cost function during the layout optimization process (i.e. initial iterations follow one cost function, intermediate steps another cost function, final steps a third one). MDE generalizes these concepts by situating layout optimization methods as special cases of convex optimization. With the exception of MAP, TopOMetry harnesses pre-existing implementations of layout algorithms in the python programming language, such as t-SNE, TriMAP, PaCMAP and MDE. A detailed description of the main techniques currently included in the framework and used in the benchmark follows:

t-Distributed Stochastic Neighborhood Embedding (t-SNE)

t-SNE¹⁴ is an improved version of SNE⁸³ (Stochastic Neighborhood Embedding) that offers significantly superior performance when applied to real-world data. SNE first converts high-dimensional distances into conditional probabilities that represent similarities: the similarity between samples x_i and x_j corresponds to the probability p _i|j that x_j would have x_i as neighbor if neighbors were chosen according to the probability density under a Gaussian centered at x _j.

A similar conditional probability q_i|j can be computed for the points in the low-dimensional embedding. Traditionally, the gaussian employed in q_i|j is set to , and these similarities are defined by:

SNE aims to find a low-dimensional representation that minimizes the difference between these two probability distributions by considering the sum of all Kullback-Leibler (KL) divergences between the high- and low-dimensional representation of each data point. Considering P _i as the conditional probability distribution of samples in their ambient high-dimensional space and Q_i its low-dimensional counterpart, the sum of KL divergences for all samples is then minimized through gradient descent, with a cost function C defined by:

The gradient of of this cost function has the form:

And its update update with a momentum term (to avoid local minima) is given by:

t-SNE’s superiority to SNE is due to two differences: first, the implementation of a symmetrized version of the SNE cost function; second, the use of heavy-tailed distributions to address the crowding problem. Significant advances have also been made regarding the optimization of the t-SNE cost function to avoid local minima and increase performance. To symmetrize the cost function, instead of minimizing the sum of all KL divergences between high-dimensional and low-dimensional probability distributions, a single KL divergence between the joint probabilities distributions P (high-dimensional) and Q (low-dimensional) is used:

In t-SNE, a Student t-distribution with one degree of freedom (the same as a Cauchy distribution) is used as a heavy-tailed distribution for the low-dimensional mapping. This defines the joint probabilities q_{i j}:

The gradient of the KL divergence between P and Q (computed using the above equation) is given by:

With an update with momentum identical to that of SNE.

There are numerous modifications and adaptations of the t-SNE algorithm (e.g. heavy tailing, density preservation, perplexity combination), and these exceed the scope of this manuscript. The version of t-SNE included in TopOMetry relies on the MulticoreTSNE^84,85 package.

Uniform Manifold Approximation and Projection (UMAP) and Manifold Approximation and Projection (MAP)

The similarity metric when using the ‘fuzzy’ algorithm in TopOMetry corresponds to this 1-skeleton of the fuzzy simplicial sets, that is, a fuzzy graph (i.e. a fuzzy set of edges). In essence, UMAP’s algorithm consists of only a couple more steps to learn a low-dimensional representation from X by constructing fuzzy topological representations of X and its putative low-dimensional representation Y. First, the fuzzy set of edges is represented by a reference set A and a membership strength function µ : A → [0, 1]. Using a presheaf representation P, it’s possible to define A = ⋃ P_a∈(0,1] ([0, a)), and µ(x) = sup{a ∈ (0, 1] | x ∈P([0, a))}.Then, to generate visualizations, the graph layout is optimized by minimizing the cross-entropy C between the high-dimensional and low-dimensional fuzzy sets (A, µ) and (A, v)in terms of the reference set A:

The MAP approach included in TopOMetry corresponds to a near-identical implementation of UMAP’s optimization procedure, but using arbitrary graphs that are directly used for layout optimization (i.e. the manifold has already been approximated), thus only assuming that the cross-entropy optimization finds global near-optima solutions.

Pairwise Controlled Manifold Approximation and Projection (PaCMAP)

PaCMAP holds significant similarities with UMAP and TriMAP, but introduces its own unique insights. The core idea of PaCMAP is establishing robust loss functions that take into account not only a sample’s local neighborhood, but how that neighborhood relates to closely located neighborhoods and with far away points. First, the neighborhood affinity graph from data is built in a manner identical to TriMAP triplet weighting procedure, in which distances between samples are scaled by the distance between the target sample and its 4th to 6th nearest neighbor:

Next, pairs of samples are selected as follows:

Near pairs - a sample’s k-nearest-neighbors, based on the scaled distance.
Mid-near pairs - for each sample i, 6 other samples are randomly selected, and the second-closest to i is paired to it in a mid-near pair. The number of mid-near pairs to be sampled is k times a hyperparameter (MN_ratio, default 0.5).
Further pairs - uniformly sample non-neighbors of i. The number of further pairs to be sampled is k times a hyperparameter (FP_ratio, default 2).

Given an initialization (PCA in default PaCMAP, LE in TopOMetry), the embedding is then optimized in three different stages, each with different weights for near-pairs, mid-near pairs and further pairs. Consider a sample i, its neighbor j, its mid-near pair k and its further pair l. Given the low-dimensional distance between samples a and b, PaCMAP minimizes the loss:

Where w_nb is a weight given to i neighbors, w_mn a weight given to its mid-near pairs and w_fp a weight given to its further pairs. The optimization is performed in three stages, each with a number of iterations t that can be controlled by the user as hyperparameters. In the first stage (i.e. first 100 t iterations), w_nb = 2,

, and w_fp = 1. In the second stage, w_nb = 3,

w _mn = 3, and w_fp = 1. In the third and last stage, w_nb = 1, w_mn = 0, and w_fp = 1.

The uniqueness and novelty of PaCMAP lies in the fact that local and global structure are balanced throughout the optimization process by the choice of sampling neighbors, mid-near pairs and further pairs simultaneously. When employed in TopOMetry, PaCMAP uses a denoised topological orthogonal basis instead of the raw data; similarly to TriMAP, feeding it the downstream topological graph may be counterproductive due to its peculiar neighborhood sampling process.

Estimation of intrinsic dimensionalities

Intrinsic dimensionalities were estimated using the Maximum-Likelihood Estimator (MLE)⁴⁷ and the modified FSA³⁹ algorithms. A summary of these methods follow:

Maximum Likelihood Estimator (MLE)

Consider X i.i.d. observations representing an embedding of a lower-dimensional sample X_i = g(Y_i), where Y_i are sampled from an equally lower-dimensional and unknown density function f, g being a continuous and smooth mapping. Given a point x, it is assumed that f(x)is constant in a small sphere S_x (R) of radius R centered on x, and the observations are treated as a homogeneous Poisson process in S _x (R). Substituting a number of k-nearest-neighbors instead of a radius R, one can find that the estimate for the intrinsic lower-dimensionality m_k (x) around x is:

in which T_k(x) represents the distance between x and its k-nearest-neighbor. The global estimate is obtained by averaging all local i.d. Estimates.

Manifold adaptive dimensionality estimation (FSA)

The FSA (Farahmand, Szepesvári & Audibert) method^39,86 is extremely simple: it uses two neighborhoods around a data point to estimate its local intrinsic dimensionality δ(x). In particular, it uses the ratio between the distance R from x to its k and k/2 nearest neighbors:

Note that the ratio corresponds to the scaling factor σ used in the bandwidth-adaptive kernel estimator employed in TopOMetry. The median of local estimates³⁹ is then used to represent the global intrinsic dimensionality of the dataset.

Estimating and representing distortions with the Riemannian metric

We draw from the seminal work of Perrault-Joncas and Meila³⁶, which first used the Riemannian metric to visualize distortions in the data and learn locally isometric embeddings by correcting projections, using fixed bandwidth kernels. In that regard, our contribution is restricted to extending their work using kNN graphs and variable bandwidth kernels and to applying these principles to single-cell data.

The Riemannian metric g is a symmetric positive definite tensor field that defines an inner product <, >_g on the tangent space T_p M for every p ∈ M, M being a Riemannian manifold. The Laplace-Beltrami operator (LBO) △f = ∇ · ∇f can be expressed by means of g:

Because △f is approximated by the random-walk graph Laplacian L_rw and the anisotropic diffusion operator P _α, one can then use it to estimate the inverse of the Riemannian metric g(p)⁻¹ using an arbitrary coordinate chart or embedding:

with i, j = 1,…, d. Computing the inverse of g is then straightforward.

The method can then be extended to work with any embedding f of the manifold M. Consider d the manifold intrinsic dimension and s the dimension of the embedding space. For for any embedded point f_p, there will be a corresponding s × s matrix h_p with rank d defining a scalar product and with null space orthogonal to the tangent space can be defined so that (f(M), h) is isometric with (M, g), corresponding to the embedding Riemannian metric.

The embedding metric h_p is given by the pseudoinverse of its dual, , where:

In practice, using the graph Laplacian L to approximate △, we have:

can then be eigendecomposed, with eigenvalues ordered from largest to smallest, and the embedding metric h is given by the pseudoinverse of . For each point, the ellipsoid representation can then be obtained using its eigendecomposition. When using two dimensions, the ellipse orientation is given by the arctangent of the ratio between the eigenvectors, and the size of the major and minor axes are given by for the largest and smallest λ eigenvalues, respectively, κ being a scaling factor.

Quantitative evaluations

Global score

To evaluate preservation of global structure by embeddings and layouts, we elected the PCA resemblance score. It was first introduced in the TriMAP manuscript⁴¹ and roughly captures how much a given embedding Y performs compared to PCA when it comes to preserving the global structure of the original data X. By design, PCA has the lowest possible Minimum Reconstruction Error (MRE), defined as:

Where || · ||_F denotes the Frobenius norm. A global score (GS) is then defined by normalizing a given MRE by PCA’s MRE (which by construction is 1):

Interestingly, as exposed in the results, the global score remains rather stable and close to 1.0 across all embedding and layout algorithms that were tested, given a PCA or spectral initialization. We empirically show that this score is approximately the same using either PCA or a spectral layout (LE) and implement functions to perform such tests within TopOMetry.

Trustworthiness

For evaluating the preservation of local structure from an isomorphic perspective, we harnessed the well-known trustworthiness⁴⁰ metric available on the scikit-learn machine-learning toolkit. Given a small number of k-nearest-neighbors, this metric is defined as:

In which for each sample i, are its k-nearest-neighbors in the low-dimensional space, and j is its r(i, j)-th nearest-neighbor in the original high-dimensional space. Essentially, this penalizes samples (cells) that are nearest neighbors in the low-dimensional space, but not in the original space, in proportion to their original rank. A score of 0 implies complete loss of local geometry, and a score of 1 implies complete preservation of local geometry, at the rank of k.

Geodesic correlation

For evaluating the preservation of local structure from an isometric perspective, we built upon previous approaches that explored geodesic distances and the Spearman correlation between original high-dimensional and downstream low-dimensional distances^43,87. This task is computationally very intensive, and thus was not performed by default in the benchmark. kNN graphs are built from both the original high-dimensional data matrix and the representations to be scored. Next, the geodesic distances (i.e. the weighted shortest-path pairwise distances) are obtained for each of these graphs, and their distributions are then scored according to the Spearman correlation (R). Cells can then be ranked accordingly to their neighborhood indices - let n be the number of cells and d_i be the difference between ranks for cell i in the original high-dimensional space and in the latent low-dimensional space. Then, we can calculate the Spearman correlation as:

The Spearman rank-order correlation coefficient (R) is a nonparametric measure of association between two sets of data, and superior to the Pearson correlation in this case because it assumes only monotonicity, not linearity. Intuitively, this score tells whether increasing distances in the low-dimensional embedding is related to increasing distances in the high-dimensional data: a score of 0 implies no association at all, and a score of 1.0 implies perfect association.

Benchmark

For the benchmark presented in Figure 2 and Supplementary Figure 3, we used the global score and the trustworthiness score. 200 principal components or eigencomponents were used for each non-biological dataset, and 600 principal components or eigencomponents were used for each single-cell dataset. For all datasets, the cosine metric distance was used with a value of 30 nearest-neighbors. All other hyperparameters used in t-SNE, UMAP, PaCMAP and TopOMetry were set to defaults. We restricted the tested GLO techniques to t-SNE, UMAP and PaCMAP due to practical limitations in calculating results for additional methods (e.g., including TriMAP and MDE or the plethora of available options would lead to an unreasonable increase in runtime).

RNA velocity analysis and lineage inference

For RNA velocity analysis of the murine pancreas development dataset⁴⁸, we used the python toolkit scvelo²⁶. Analysis was performed with standard parameters to reproduce scvelo documentation. TopOMetry analysis was performed using the msDM eigenbasis (obtained with the bandwidth_adaptive kernel) and the MAP projection method was used on the learned diffusion potential from this eigenbasis with standard parameters. RNA velocity was then recomputed using the msDM representation to compute neighborhood graphs using the scvelo.pp.moments function, with all other hyperparameters being identical to the standard analysis.

For the MOCA dataset, TopOMetry msDM eigenbasis was used to decompose the data into 800 eigenvectors (796 positive). Clustering with the Leiden community detection algorithm and projection with MAP were then carried out using the diffusion potential learned from this eigenbasis, using default parameters.

Analysis of T cell diversity and clonality

The datasets used to explore T cell diversity were cell-type annotated using CellTypist, using the graph connectivities learned from the scaled HVG data matrix for overclustering and majority voting⁵¹. The Immune_All_High model was used to annotate main cell types, and the Immune_All_Low to annotate subtypes. For quantifying the amount of detected T cell clusters, a threshold of 0.8 prediction probability was used.

All datasets were analyzed following the standard workflow within scanpy, using 50 or 300 principal components and with the Leiden clustering algorithm set to resolution=2. All other hyperparameters were set to defaults. Marker genes were found using logistic regression⁶⁶ with max_iters=1000 and all other hyperparameters set to defaults. TopOMetry analysis was performed using the msDM eigenbasis with the bandwidth-adaptive kernel to calculate a total of 500 eigencomponents for all datasets, from which the diffusion potential was used to compute MAP projections. All other TopOMetry hyperparameters were set to defaults.

For the analysis of the TICA dataset⁵¹, an AnnData object containing scRNA-seq and VDJ information for the T cell compartment was obtained from https://www.tissueimmunecellatlas.org/ (GEX+TCRab data). We then reanalyzed this data with scirpy⁶⁰ following the steps exactly as detailed in its documentation without any change to parameters other than to use TopOMetry latent representations and clustering results. TopOMetry analysis was performed with the same parameters used for all other PBMC datasets.

Computational Environment

The benchmark presented on Figure 2 and Supplementary Figure 3 was computed on a virtual machine with Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz x 128 threads, with a total of 320GB of ERCC DDR4 SDRAM with the 64-bit Ubuntu 18.04.6 LTS distribution of the Linux operating system, provided by the Institute of Biology from the University of Campinas. All other analyses were performed on a six-core Lenovo ThinkPad P52 Intel® Xeon(R) E-2176M CPU @ 2.70GHz

× 12 threads, with a total of 128GB of ERCC DDR4 SDRAM with 64-bit 21.10 Pop!OS distribution of the Linux operating system. Package versions are listed in Methods Table 1.

Supporting information

Supplementary Material

Funding

DS-O was supported by the grant #2020/04074-2, São Paulo Research Foundation (FAPESP), and LAV was supported by the grant #2013/07607-8, São Paulo Research Foundation (FAPESP).

Acknowledgements

We thank Gustavo B. Bellini, Valcir V. Vargas and Marcos Akira at the Institute of Biology from the University of Campinas for their most kind support with computational infrastructure. We thank Leland McInness, Dmitry Kobak and Akshay Agrawal for their valuable comments regarding some of the ideas presented in the manuscript. We thank Bruno Loyola Barbosa for his kind assistance in testing the early implementations of TopOMetry. We thank Ebru Erbay and Helder Nakaya for their generous feedback during the development of TopOMetry.

Author Contributions

DS-O, LAV and AD jointly conceived the idea of harnessing phenotypic topology to visually represent biological hallmarks from single-cell data. DS-O designed and built TopOMetry and conceived and performed all bioinformatic analyses. All authors wrote the manuscript and agreed to the final submitted version.

Competing interests

The authors have no competing interests.

Supplementary figures and legends

Significance of findings

Strength of evidence

Abstract

Introduction

Results

A framework to systematically learn and evaluate manifolds in single-cell data

Schematic overview of single-cell data analysis with TopOMetry.

Representations learned with TopOMetry preserve local and global structures across single-cell datasets

Public single-cell RNA-seq datasets that were used in this study benchmark.

TopOMetry eigenbases and projections preserve more original structure than the current standards.

Estimating intrinsic dimensionalities in toy and single-cell data with TopOMetry.

Estimating and visualizing intrinsic dimensionalities with TopOMetry

TopOMetry allows precise manifold learning of small and large-scale datasets

Inferring cellular lineages in small and large datasets with TopOMetry.

Evaluating distortions in two-dimensional projections with the Riemannian metric.

TopOMetry augments representations with the Riemannian metric to visualize distortions

TopOMetry reveals novel transcriptionally defined T cell populations

Uncovering the transcriptional diversity of T cells with TopOMetry.

Discussion

Methods

Public data acquisition and availability

Code availability

Standard data analysis

Topologically Optimized geoMetry

Overview

Computational implementation

Construction of k-nearest-neighbor graphs

Affinity estimation

Variable bandwidth kernels

Continuous k-nearest neighbors

Fuzzy simplicial sets

Learning a latent basis from Laplacian-type operators

Graph-layout optimization

t-Distributed Stochastic Neighborhood Embedding (t-SNE)

Uniform Manifold Approximation and Projection (UMAP) and Manifold Approximation and Projection (MAP)

Pairwise Controlled Manifold Approximation and Projection (PaCMAP)

Estimation of intrinsic dimensionalities

Maximum Likelihood Estimator (MLE)

Manifold adaptive dimensionality estimation (FSA)

Estimating and representing distortions with the Riemannian metric

Quantitative evaluations

Global score

Trustworthiness

Geodesic correlation

Benchmark

RNA velocity analysis and lineage inference

Analysis of T cell diversity and clonality

Computational Environment

Supporting information

Funding

Acknowledgements

Author Contributions

Competing interests

Supplementary figures and legends

Schematic overview of standard and TopOMetry workflows for single-cell analysis.

Qualitative evaluation of TopOMetry and existing methods on synthetic and toy data.

Preservation of global structure across eigenbases and projections from single-cell datasets.

Transcriptional diversity of T cells in a healthy human donor.

Transcriptional diversity of T cells across diseases.

Transcriptional diversity of T cell clonotypes.

PCA introduces linear distortion into the manifold region inhabited by T cells.

References

Article and author information

Author information

Davi Sidarta-Oliveira

Ana Domingos

Licio A Velloso

Version history

Cite all versions

Copyright

Metrics