Schematic overview of single-cell data analysis with TopOMetry.

(A) Single-cell experiments are preprocessed into high-dimensional matrices of cells and measured variables (e.g., gene expression, protein content, chromatin accessibility) and require dimensional reduction tools for their analysis. Most tools assume unknown aspects of the underlying geometry (e.g., linearity) or distribution (e.g., uniformity). Instead, relying on the Laplace-Beltrami Operator (LBO) and its eigenfunctions allows learning such geometry assuming only the manifold hypothesis. (B) Approximating the LBO involves constructing a k-nearest-neighbors (kNN) graph using adaptive affinity estimation, which can be done through several kernels so to make the affinity graph insensitive to neighborhood densities. The Laplacian-type operators (particularly the anisotropic diffusion operator) are approximations of the LBO. The resulting matrices can be used for several tasks, such as imputation of missing data, signal filtering and interpolation, and graph sparsification and coarsening. (C) The eigendecompositions of the LBO approximations yield eigenvectors and eigenvalues that are weighted or multiscaled to form a new orthogonal eigenbasis. The eigenvalues can be used to estimate an eigengap or spectral gap to estimate the intrinsic dimensionality of the data. The intrinsic dimensionality can also be estimated using neighborhood-based methods (e.g., FSA and MLE) or ad-hoc inspection of eigenvectors for discriminative potential. (D) A second neighborhood graph is learned from the eigenbasis, and its Laplacian-type operators are used to obtain a new LBO approximation, rendering ‘topological graphs’. (E) From a spectral initialization obtained from the topological graph, any graph layout optimization (GLO) can be used on the topological graph or the eigenbasis for visualization. (F) Downstream tasks such as clustering, RNA velocity estimation, and imputation can be performed with the learned topological graphs and layouts. (G) The learned eigenbases and visualizations can be evaluated regarding the preservation of global and local structure, and the Riemannian metric can be used to visualize distortions in layouts to aid biological interpretation.

Public single-cell RNA-seq datasets that were used in this study benchmark.

TopOMetry eigenbases and projections preserve more original structure than the current standards.

(A) Schematic overview of the assessment of global structure in TopOMetry - the global score is calculated by taking the exponential of the difference between an embedding MRE and PCA’s MRE, divided by PCA’s MRE. (B) Schematic overview of the assessment of local structure in TopOMetry with the trustworthiness score - the score penalizes embeddings in which cells that are neighbors in the low-but not in the high-dimensional space. (C) Schematic overview of the assessment of distances preservation in manifolds - the pairwise geodesic distances in the high- and low-dimensional spaces are computed, and Spearman R correlation between the rank of neighbors for each cell is obtained as a score. (D) Annotated heatmap of local scores for TopOMetry’s eigenbases and PCA for 20 single-cell datasets (higher is better). (E) Annotated heatmap of local scores for projections learned using expression data, the first 100 principal components, and TopOMetry’s eigenbases or topological graphs for 20 single-cell datasets (higher is better).

Estimating intrinsic dimensionalities in toy and single-cell data with TopOMetry.

(A) Eigenspectrum of the eigenvalues for each diffusion component learned for the MNIST dataset of handwritten digits with TopOMetry (left) and the absolute value of their first derivatives (right). (B) Distribution of MNIST samples (individual images) across different classes (digits) across diffusion components (top) and principal components (bottom). (C) Histograms of intrinsic dimensionality estimates for each sample in the MNIST dataset, with varying numbers of k-nearest-neighbors, with the FSA (top) and MLE (bottom) methods. (D) Heatmap of Spearman R correlation between FSA and MLE estimates of intrinsic dimensionalities of MNIST images for varying numbers of k-nearest-neighbors. (E) topoMAP projections of a subset of the MNIST handwritten dataset, colored by classes (numbers). The 100 images with the highest (left) or lowest (right) estimates of intrinsic dimensionality are colored black in the projections and shown on the top of each projection. (F) Eigenspectrum of the eigenvalues for each diffusion component learned for the PBMC3k dataset of peripheral blood mononuclear cells (10X Genomics) with TopOMetry (left) and the absolute value of their first derivatives (right). (G) Histograms of intrinsic dimensionality estimates for each cell in the PBMC3k dataset, with varying numbers of k-nearest-neighbors, with the FSA (top) and MLE (bottom) methods. (H) Heatmap of Spearman R correlation between FSA and MLE estimates of intrinsic dimensionalities of PBMC3k cells for varying numbers of k-nearest-neighbors. (I) topoMAP projections of the PBMC3k dataset, colored by the estimates obtained with FSA (top) and MLE (bottom) with 100 nearest-neighbors. (J) topoMAP projection of the PBMC3k dataset, colored by annotated cell types. (K) Violin plots of estimates of intrinsic dimensionalities of PBMC3k cells with FSA (left) and MLE (right) with 100 nearest-neighbors.

Inferring cellular lineages in small and large datasets with TopOMetry.

(A) UMAP and (B) topoMAP projections of the Pancreas dataset showing cellular developmental trajectories in the murine pancreas, colored by annotated cell type. (C) topoMAP and (D) UMAP projections of the same data, colored by inferred scores of different cell-cycle phases. The same projections are also colored by the predicted cell-cycle phase of each cell in (E) and (F), and by the first five diffusion components in (G). Arrows indicate a population of cells that are classified as mitotic but were misplaced in the middle of the differentiation trajectory by the default PCA-based UMAP projection. (H) PCA and (I) PCA-based UMAP projections of the Mouse Organogenesis Cell Atlas (MOCA) showing cellular developmental trajectories of whole mouse embryos, colored by annotated subtrajectories from the original study. (J) PCA-based UMAP and (K) topoMAP projections of the MOCA dataset, colored by development stage, annotated trajectories and subtrajectories, and TopOMetry clustering results using the standard Leiden community-detection algorithm. Note how the topoMAP projection and the clustering results from TopOMetry uncover dozens of neuronal subtrajectories that match the developmental stage from which cells were sampled.

Evaluating distortions in two-dimensional projections with the Riemannian metric.

The Riemannian metric can be used to estimate distortions in two-dimensional visualizations and can be represented by ellipses. If no distortion is present in any preferential direction, the ellipses will have zero eccentricities and correspond to circles. If distortion is present in a preferential direction, the ellipse will be aligned in that direction, and its eccentricity indicates the degree of distortion. (A) Synthetic data comprised of three normally distributed clusters with varying degrees of variance (leftmost), and visualization of the Riemannian metric on two-dimensional visualizations of the data obtained with PCA, UMAP, DM (within TopOMetry), and topoMAP (TopOMetry MAP on DM diffusion potential). (B) Visualization of the Riemannian metric on two-dimensional visualizations of the PBMC3k dataset obtained with PCA, UMAP, DM (within TopOMetry), and topoMAP (TopOMetry MAP on DM diffusion potential). (C) topoMAP projections of the PBMC3k dataset, colored by the eccentricities of ellipses representing the Riemannian metric across the aforementioned visualizations for each cell. (D) Violin plots of the eccentricities of ellipses representing the Riemannian metric on the aforementioned visualizations of the PBMC3k dataset, for each cell type annotation.

Uncovering the transcriptional diversity of T cells with TopOMetry.

(A) PCA-based UMAP projections of the PBMC68k dataset, comprising approximately 68,000 peripheral blood mononuclear cells from a healthy donor (10X Genomics), colored by clustering results obtained with the standard PCA-based approach (left) or with the diffusion potential of TopOMetry msDM eigenbasis (right). The main cell types and the populations for which the two approaches yield similar clustering results are indicated in the figure. Note that the two approaches only reach disagreeing results for T cells. (B) Dotplot of the top marker gene for each cluster found with the standard PCA-based approach, ranked by logistic regression. Note how clusters corresponding to T cells present poor marker genes. (C) topoMAP projection of the PBMC68k dataset, colored by clustering results obtained with the standard PCA-based approach (left) or with the diffusion potential of TopOMetry msDM eigenbasis (right). Note that the clusters of T cells found with the former are randomly spread across the dozens of clusters found with the latter. (D) Dotplot of the top marker gene for each cluster found when clustering with TopOMetry, ranked by logistic regression with identical parameters. Note how clusters corresponding to T cells present highly-specific marker genes. (E) Boxplots of the number of T cell clusters (left) found with different clustering strategies (standard PCA-based, using expression data and with TopOMetry) and their mean size (right), across four datasets of peripheral blood mononuclear cells from human donors: PBMC68k, Dengue, Lupus, and MS_CSF. (F) PCA-based UMAP and (G) topoMAP projections of lymphoid cells from the Tissue Immune Cell Atlas (TICA) dataset, comprised of lymphoid cells across different organs and donors, colored by annotations from the original study (left) and clustering results obtained with TopOMetry (right). Note that the two clustering results reach disagreeing results mostly for CD4+ cell clusters. (H) PCA-based UMAP and (I) topoMAP projections of the TICA dataset, colored by clonal expansion thresholds. (J) PCA-based UMAP and (K) topoMAP projections of the TICA dataset, colored by the 30 clonotypes with the highest amount of cells. The arrows indicate the precise match of these clonotypes to the manifold structure uncovered by TopOMetry. (L) PCA-based UMAP and (M) topoMAP projections of the TICA dataset, colored by the epitopes recognized by each T cell. The arrows indicate populations recognizing specific epitopes that precisely match the manifold structure uncovered by TopOMetry. (N) Heatmap of T cell repertoire overlap between the clusters found in the original study, (O) using expression data and (P) TopOMetry. Note how the original clusters present high repertoire overlap, which is less evident in clusters found using expression data and nearly absent from clusters found with TopOMetry.

Schematic overview of standard and TopOMetry workflows for single-cell analysis.

(A) Schematic overview of the current standard workflow for single-cell analysis, starting from a high-dimensional matrix generated from raw sequencing data, that undergoes quality control (QC) and cell filtering and library-size normalization. High-variable genes are selected and used to compute PCA, from which the top principal components are used to find k-nearest-neighbors (kNN) graphs and affinity matrices that are used for clustering and projection into two-dimensional visualizations. (B) Schematic overview of the TopOMetry workflow, which consists of the same default processing steps, after which kNN graphs and specialized kernels are used to learn Laplacian-type operators that consist of topological affinity matrices. The eigendecomposition of such an operator yields eigenbases that preserve the latent topological information from the original high-dimensional manifold. The eigenbasis can then be used to learn new topological graphs using the same kernels and operators, which are used for clustering or learning two-dimensional projections for visualization through graph-layout optimization techniques. The learned projections are then evaluated for the preservation of global and local structure and the amount of introduced distortion. (C) Schematic overview of the computational implementation of TopOMetry, which is centered around the TopOGraph class.

Qualitative evaluation of TopOMetry and existing methods on synthetic and toy data.

(A) Two-dimensional projections obtained with PCA, kernel PCA, and TopOMetry eigenbases for eight synthetic datasets: a circle inside a larger circle, two half moons, gaussian distributions with equal variances, gaussian distributions with different variances, the ‘S’ shape, the ‘swiss-roll’ shape, a uniformly sampled square and Gaussian noise. Sample points are colored by their original classes for the first four datasets and by their distribution on the fifth and sixth. Note how the eigenbases used in TopOMetry do not induce artificial clustering structure in the uniform square and the Gaussian noise, and successfully unfold the (B) The same projections, with samples colored by the value of the first component of the eigenbasis. Note how the first component of PCA and Kernel PCA fails to discriminate discrete submanifolds in the first two datasets and to unfold the ‘swiss-roll’ shape. (C) Three-dimensional visualization of the ‘S’ and ‘swiss-roll’ shapes, colored by the first component of each method. Note how the first component of PCA and Kernel PCA fails to discover the notoriously non-linear geometry. (D) Two-dimensional projections of the MNIST handwritten-digits dataset obtained with PCA, the eigenbases used in TopOMetry, a MAP projection of the diffusion potential of these eigenbases, UMAP on the first 100 principal components, UMAP on the data, t-SNE, PaCMAP and TriMAP. Samples (images) are colored by the digit they correspond to.

Preservation of global structure across eigenbases and projections from single-cell datasets.

(A) Global score for TopOMetry eigenbases for each of the 20 datasets used in the benchmark, with varying kernels. (B) Global score for projections obtained with t-SNE, UMAP, and PaCMAP with or without PCA preprocessing and with TopOMetry.

Transcriptional diversity of T cells in a healthy human donor.

(A) UMAP projections of the PBMC68k dataset obtained using the top 50 principal components (PCs), colored by clustering results obtained from the top 50 PCs, top 300 PCs, gene expression matrix, TopOMetry diffusion potential of the msDM eigenbasis and cell types predicted with CellTypist. (B) UMAP projections of the same data using the top 300 PCs. Note how clusters obtained with 50 PCs do not correspond to the structure uncovered using 300 PCs. (C) UMAP projections of the same data using the gene expression matrix of highly-variable genes. (D) topoMAP (MAP of the diffusion potential of the msDM eigenbasis) projection of the same data. Arrows in (C) and (D) highlight agreeing clustering results when using the gene expression matrix and TopOMetry, which can be detected on the periphery of the UMAP embedding of T cells. (E) Heatmap of adjusted mutual information (AMI) score between clustering results obtained with either approach, including additional TopOMetry eigenbases. The AMI score indicates the agreement between two clustering results when no ground-truth is known. Note how the results obtained with the high-dimensional expression data agree more with those obtained with TopOMetry than with those obtained with PCA. (F) topoMAP embedding of the same data, colored by cell subtypes predicted with CellTypist. Note how the majority of novel cell types uncovered correspond to CD4+ T and NK cells. (G) UMAP projection using the top 50 PCs, colored by gene expression of the canonical markers CD3E, CD4, CD8A, and FOXP3, and of marker genes for T cell subpopulations found with TopOMetry: MDN1, C11orf68, IL11RA, and IL21R. (H) UMAP projection using the expression data, colored as in (G), and with red circles highlighting the densely localized expression of marker genes of novel T cell subpopulations. (I) topoMAP projection of the same data, likewise colored, with red circles highlighting the densely localized expression of marker genes of novel T cell subpopulations.

Transcriptional diversity of T cells across diseases.

(A) Panel of projections of the Lupus dataset of peripheral mononuclear blood cells (PBMC) from healthy donors and Systemic Lupus Erythematosus patients, each obtained with UMAP on the first 50 principal components (PCs), UMAP on the expression data and TopOMetry, and colored by clustering results obtained with the first 50 PCs, expression data and TopOMetry. Each row corresponds to a clustering method and each column corresponds to a projection method. (B) Panel of projections of the Dengue dataset of PBMC from a dengue fever patient, obtained and colored as in (A). (C) Panel of projections of the MS_CSF dataset of mononuclear cells from the peripheral blood and cerebrospinal fluid (CSF) from healthy donors and multiple sclerosis patients, obtained and colored as in (A) and (B). Note how the findings observed in the PBMC68k dataset are replicated in a nearly identical fashion for all three datasets: clustering and projecting from expression data yield results that disagree with those obtained with the standard PCA-based approach and agree with those obtained with TopOMetry. The diffusion potential of the msDM eigenbasis was used for topoMAP projection and Leiden clustering in all datasets. (D) Dotplot of the top marker genes for clusters found with 50 PC and with TopOMetry on the Lupus dataset, with three genes shown per cluster for the former and one for the latter. (E) Same as in (D), but for the Dengue dataset. (F) Same as in (E) and (D), but for the MS_CSF dataset. Note how the finding that clusters of T cells found with the standard PCA-based approach present poor marker genes is replicated for all datasets, in sharp contrast to the highly specific marker genes presented by the clusters found with TopOMetry. (G) Heatmaps of adjusted mutual information (AMI) scores indicating agreement between different clustering strategies for the Lupus, Dengue and MS_CSF datasets, including additional TopOMetry eigenbases. Clustering results obtained with expression data disagree with those obtained using PCA and agree with those obtained using TopOMetry.

Transcriptional diversity of T cell clonotypes.

(A) PCA-based UMAP projection of T cells from the TICA dataset, as in the original study, colored by expression of CD4 and CD8A. (B) topoMAP (MAP of the diffusion potential of the msDM eigenbasis) of the TICA dataset, colored by expression of CD4 and CD8A. (C) Clonotype network for T cells with a detectable TCR in this dataset, based on amino acid sequence similarity. Each dot represents an individual clonotype, and its size represents the number of cells belonging to that clonotype. Dots are colored by the clusters from the original study. Note how different clones are grouped together under the same cluster assignment. (D) Same as in (C), but with dots colored by the clusters found with TopOMetry. Note how most clonotypes belong to a single cluster assignment, without overlap. (E) PCA-based UMAP and (F) topoMAP projections of the same data, colored by clonotype modularity. Clonotype modularity was calculated using a neighborhood graph learned from the top 50 principal components (PCs) or the diffusion potential graph. Note how modularities are higher with the latter, indicating a better connection between the manifold learned from RNA-seq and the clonotypes learned from VDJ-seq. (G-J) Violin plots of comparative gene expression between cells belonging to a clone and the rest, detected akin to marker genes using the Wilcoxon test, showing gene expression of the receptor chains that define clonotypes identified by VDJ-seq.

PCA introduces linear distortion into the manifold region inhabited by T cells.

(A) Violin plots of intrinsic dimensionality (i.d.) estimates for the PBMC68k, Lupus, Dengue, and MS_CSF datasets, grouped by cell types predicted with CellTypist. Note how T cells do not present particularly high i.d. Estimates. (B-F) Eigenspectrum (left) and cumulative explained variance (right) of PCA for the PBMC68k, Lupus, Dengue, MS_CSF, and TICA datasets. An ad hoc ‘elbow point’ is found for all datasets around 30 principal components (PCs), at which PCA explains less than 15% of the data covariance. Such feature is a known hallmark of non-linear data. (G) Boxplot of eccentricities of the ellipses representing the Riemannian metric to evaluate distortions, obtained from T cells from the PBMC68k, Lupus, Dengue, and MS_CSF datasets. For all datasets, the eccentricities (an indirect measure of induced distortion) were significantly lower in the topoMAP projection when compared to UMAP on the first 50 PCs (p < 10-90, two-sided Wilcoxon test).