Mapping single-cell atlases throughout Metazoa unravels cell type evolution

  1. Alexander J Tarashansky
  2. Jacob M Musser
  3. Margarita Khariton
  4. Pengyang Li
  5. Detlev Arendt
  6. Stephen R Quake
  7. Bo Wang  Is a corresponding author
  1. Department of Bioengineering, Stanford University, United States
  2. European Molecular Biology Laboratory, Developmental Biology Unit, Germany
  3. Centre for Organismal Studies, University of Heidelberg, Germany
  4. Department of Applied Physics, Stanford University, United States
  5. Chan Zuckerberg Biohub, United States
  6. Department of Developmental Biology, Stanford University School of Medicine, United States
7 figures and 8 additional files


Figure 1 with 1 supplement
SAMap addresses challenges in mapping cell atlases of distantly related species.

(A) Schematic showing the phylogenetic relationships among seven species analyzed. (B) Challenges in mapping single-cell transcriptomes. Gene duplications cause large numbers of homologs per gene, determined by reciprocal BLAST (cut-off: E-value <10−6), and frequent gene losses and the acquisition of new genes result in large fractions of transcriptomes lacking homology, which limits the amount of information comparable across species. (C) SAMap workflow. Homologous gene pairs initially weighted by protein sequence similarity are used to align the manifolds, low dimensional representations of the cell atlases. Gene-gene correlations calculated from the aligned manifolds are used to update the edge weights in the bipartite graph, which are then used to improve manifold alignment. (D) Mutual nearest neighborhoods improve the detection of cross-species mutual nearest neighbors by connecting cells that target one other’s within-species neighborhoods. (E) Convergence of SAMap is evaluated by the root mean square error (RMSE) of the alignment scores between mapped clusters in adjacent iterations for all 21 pairwise comparisons of the seven species.

Figure 1—figure supplement 1
Scalability of SAMap.

The runtime (A) and memory usage (B) of all mappings performed in this study are plotted versus the total number of cells from both datasets. For this study, SAMap was run on a standard desktop computer running Ubuntu 18.04, with an 8-core i7 Intel processor and 64 Gb of RAM.

Figure 2 with 1 supplement
SAMap successfully maps D. rerio and X. tropicalis atlases.

(A) UMAP projection of the combined zebrafish (yellow) and Xenopus (blue) manifolds, with example cell types circled. (B) Sankey plot summarizing the cell type mappings. Edges with alignment score <0.1 are omitted. Edges that connect developmentally distinct secretory cell types are highlighted in black, with connections across germ layers highlighted in red. (C) Heatmaps of alignment scores between developmental time points for ionocyte, forebrain/midbrain, placodal, and neural crest lineages. X-axis: Xenopus. Y-axis: zebrafish. (D) Expressions of orthologous gene pairs linked by SAMap are overlaid on the combined UMAP projection. Expressing cells are color-coded by species, with those connected across species colored cyan. Cells with no expression are shown in gray. The mapped secretory cell types are highlighted with circles. (E) SAMap alignment scores compared to those of benchmarking methods using one-to-one vertebrate orthologs as input. Each dot represents a cell type pair supported by ontogeny annotations.

Figure 2—figure supplement 1
Existing methods failed to map D. rerio and X. tropicalis atlases.

(A) UMAP projections of the integration results from SAMap using the full homology graph, compared to LIGER, BBKNN, Scanorama, Seurat, Harmony, and SAMap using 1–1 orthologs. For fair comparisons, all methods were run on the D. rerio and X. torpicalis atlases subsampled to approximately 15,000 cells to satisfy the computational constraints of Seurat and LIGER. (B) Histograms of alignment scores between individual cells.

Figure 3 with 1 supplement
SAMap reveals prevalent paralog substitutions in frog and zebrafish.

(A) Expression of orthologous (top) and paralogous (bottom) gene pairs overlaid on the combined UMAP projection. Expressing cells are color-coded by species, with those that are connected across species colored cyan. Cells with no expression are shown in gray. Paralogs are ordered by the evolutionary time when they are inferred to have duplicated. (B) Paralog substitution scores of all cell types. The substitution score counts the number of substituting paralogs that are differentially expressed in a particular cell type while normalizing for the number of differentially expressed genes in a cell type and the number of paralogs of a gene (see Materials and methods). (C) The percentage of paralogs from each phylogenetic age that were substituted for orthologs in frog or zebrafish lineages.

Figure 3—figure supplement 1
Paralog substitution analysis yields similar results using the SAMap manifold constructed from one-to-one orthologs.

Comparison of substitution rates for different paralog ages (A) and cell type substitution scores (B) calculated from the original frog-zebrafish manifold versus the manifold generated using only one-to-one orthologs. (C) Histogram showing the distribution of correlation differences for paralog substitutions specific to the original (teal) and one-to-one ortholog based analyses (orange), along with those identified in both mappings (blue). Note that the majority of substitution events, especially in the large correlation difference regime, are present in both mapped manifolds.

Figure 4 with 2 supplements
SAMap transfers cell type information from a well-annotated organism (planarian S. mediterranea) to its less-studied cousin (schistosome S. mansoni) and identifies parallel stem cell compartments.

(A) UMAP projection of the combined manifolds. Tissue type annotations are adopted from the S. mediterranea atlas (Fincher et al., 2018). The schistosome atlas was collected from juvenile worms, which we found to contain neoblasts with an abundance comparable to that of planarian neoblasts (Li et al., 2021). (B) Overlapping expressions of selected tissue-specific TFs with expressing cell types circled. (C) UMAP projection of the aligned manifolds showing planarian and schistosome stem cells, with homologous subpopulations circled. Planarian neoblast data is from Zeng et al., 2018, and cNeoblasts correspond to the Nb2 population, which are pluripotent cells that can rescue neoblast-depleted planarians in transplantation experiments. (D) Distributions of conserved TF expressions in each neoblast subpopulation. Expression values are k-nearest-neighbor averaged and standardized, with negative values set to zero. Blue: planarian; yellow: schistosome.

Figure 4—figure supplement 1
SAMap-linked gene pairs that are enriched in cell type pairs between S. mediterranea and S. mansoni.

(A) Rows: linked cell types. Schistosome cell types correspond to Leiden clusters. Columns: genes linked by SAMap with overlapping eukaryotic eggNOG orthology groups. We calculate the average standardized expression of each gene in an orthology group for its corresponding cell type in a particular pair and report the highest expression. A selected set of orthology groups corresponding to transcriptional regulators are labeled. (B) Fluorescence in situ hybridization shows the co-expression of wnt11 (Smp_156540) and a panel of muscle markers (collagen, troponin, myosin and tropomyosin) in S. mansoni juveniles. The body wall muscles are expected to be located close to the parasite surface (dashed outline). The images are maximum intensity projections constructed from ~10 confocal slices with optimal axial spacing recommended by the Zen software collected on a Zeiss LSM 800 confocal microscope using a 40× (N.A. = 1.1, working distance = 0.62 mm) water-immersion objective (LD C-Apochromat Corr M27). (C) Whole mount in situ hybridization images showing that the expression of wnt11 and frizzled (Smp_174350) are concentrated in the parasite tail (arrows) with decreasing gradients extending anteriorly. In planarian muscles, Wnt genes provide the positional cues for setting up the body plan during regeneration (Scimone et al., 2017; Reddien, 2018). The presence of an anterior-posterior expression gradient of wnt11 and frizzled in muscles of schistosome juveniles suggests that they may have similar functional roles in patterning during development.

Figure 4—figure supplement 2
Schistosome muscle progenitors express canonical muscle markers.

UMAP projections of schistosome stem cells with gene expressions overlaid. μ and μ’ cells are circled. Colormap: expression in units of log2(D+1). For visualization, expression was smoothed via nearest-neighbor averaging using SAM. Note that myod1 and cabp are expressed in both presumptive muscle progenitor populations, whereas all other markers are enriched in μ’ cells. All genes displayed are also expressed in fully differentiated muscle tissues.

Figure 5 with 2 supplements
Mapping evolutionarily distant species identifies densely connected cell type groups.

(A) Schematic illustrating edge (left) and node (right) transitivities, defined as the fraction of triads (set of three connected nodes) in closed triangles. (B) The percentage of cell type pairs that are topologically equivalent to the green edge in each illustrated motif. (C) Network graphs showing highly connected cell type families. Each node represents a cell type, color-coded by species (detailed annotations are provided in Supplementary file 7). Mapped cell types are connected with an edge. (D) Boxplot showing the median and interquartile ranges of node transitivities for highly connected cell type groups. For all box plots, the whiskers denote the maximum and minimum observations. The average node transitivity per group is compared to a bootstrapped null transitivity distribution, generated by repeatedly sampling subsets of nodes in the cell type graph and calculating their transitivities. **p<5×10−5, ***p<5×10−7. (E) Boxplot showing the median and interquartile ranges of the number of enriched gene pairs in highly connected cell type groups. All cell type connections in these groups have at least 40 enriched gene pairs (dashed line).

Figure 5—figure supplement 1
Number of enriched gene pairs are mostly independent of edge transitivity.

(A) Box plot showing the median and interquartile ranges of the number of enriched gene pairs in cell type mappings from all 21 pairwise mappings between the seven species. The whiskers denote the maximum and minimum observations. Of cell type mappings, 87% have greater than 40 enriched gene pairs (dashed line). Species acronyms are the same as in Figure 1A. (B) Top left: The edge transitivity is plotted against the number of enriched gene pairs for all cell type pairs in the connectivity graph. Dashed line: the linear best fit, with the Pearson correlation coefficient reported at the top. Top right: magnified view of the mapped cell type pairs supported by small numbers of gene pairs (<40) to show that those edges have low transitivity scores (<0.4). The sublots below show the number of enriched gene pairs and edge transitivity for individual species pairs.

Figure 5—figure supplement 2
Alignment scores are mostly independent of edge transitivity.

Top left: alignment scores and edge transitivity for all cell type pairs in the connectivity graph including the seven species. Dashed line: the linear best fit, with the Pearson correlation coefficient reported at the top. Alignment scores and edge transitivity for individual species pairs are shown in the remaining subplots.

Figure 6 with 1 supplement
SAMap identifies muscle and stem cell transcriptional signatures conserved across species.

(A) Enrichment of KOG functional annotations calculated for genes shared in contractile cell types. For each species, genes enriched in individual contractile cell types are combined. (B) Expression and enrichment of conserved muscle genes in contractile cell types. Color: mean standardized expression. Symbol size: the fraction of cells each gene is expressed in per cell type. Homologs are grouped based on overlapping eukaryotic eggNOG orthology groups. If multiple genes from a species are contained within an orthology group, the gene with highest standardized expression is shown. Genes in blue: core transcriptional program of bilaterian muscles; red: transcriptional regulators conserved throughout Metazoa. (C) Enrichment of KOG functional annotations for genes shared by stem cell types. (D) Top: boxplot showing the median and interquartile ranges of the mean standardized expressions of stem cell-enriched genes in multipotent stem cells (MSCs), lineage-committed stem cells (LSCs), and differentiated cells (DCs). MSCs include sponge archaeocytes (Musser et al., 2019), hydra interstitial stem cells (Siebert et al., 2019), planarian neoblasts cluster 0 defined in Fincher et al., 2018, schistosome ε-cells (Tarashansky et al., 2019). LSCs include sponge transition cells, hydra ecto- and endo-epithelial stem cells; planarian piwicells that cluster with differentiated tissues, and schistosome tissue-specific progenitors. Bottom: dot plot showing the mean standardized expressions of selected transcriptional regulators. The transcript IDs corresponding to each gene are listed in Supplementary file 6.

Figure 6—figure supplement 1
Phylogenetic reconstruction of animal contractile cell transcriptional regulators.

Trees depict Csrp/Crip (A) and Fox group I (B) gene families. Genes labeled red are enriched in at least one contractile gene pair identified via SAMap. Support values indicate bootstrap support from 1000 nonparametric (Csrp) or ultrafast (Fox) bootstrap replicates. Besides these two transcriptional regulators, contractile cells in all seven species were found to be also enriched for transcription factors from the C2H2 Zinc Finger, Lim Homeobox, and Paired Homeobox families, though in different cell types we found enrichment of a number of distinct orthologs. Whether this reflects an ancestral role for these transcription factor families in regulating contractility or their independent evolution will require additional taxonomic sampling and broader coverage of muscle cell diversity to resolve.

Author response image 1

Additional files

Supplementary file 1

Cell atlas metadata and cell annotations.

Metadata include the number of cells, number of transcripts in the transcriptome, median number of transcripts detected per cell, the reference transcriptome used in this study, database through which the transcriptomes are provided, technology used for constructing the cell atlases, atlas data accessions, processing notes, and references. Leiden clusters and cell type annotations are reported for cells in each atlas. The Zebrafish and Xenopus tables include both the original cell type annotations and those used in this study. D. rerio, X. tropicalis, and mouse annotations include developmental stages.
Supplementary file 2

Cell type annotations for the zebrafish-Xenopus mapping.

Correspondence between the cell type annotations provided in the original study (Briggs et al., 2018; Wagner et al., 2018) and corresponding annotations used in this study is provided for both D. rerio and X. tropicalis atlases.
Supplementary file 3

Mapping of zebrafish-Xenopus atlases with individual cell types removed.

The two highest-scoring partners are reported for each cell type in the original mapping and the mapping after its homolog was removed. The new mappings are categorized as being present in the original analysis, not present in the original analysis but connecting developmentally related cells, or neither.
Supplementary file 4

Identified paralogs with greater expression similarity than orthologs in the zebrafish-Xenopus mapping.

Each row contains a pair of vertebrate-orthologous genes and a corresponding pair of eukaryotic paralogs with higher correlation in expression compared to the orthologs, the expression correlations for ortholog and paralog pairs, the difference between their correlations, the paralogs’ last common ancestor, and the cell types in which the genes are enriched. Highlighted rows are shown in Figure 3A.
Supplementary file 5

Genes enriched in contractile cell types and invertebrate stem cells highlighted in Figure 6.

The IDs of the genes enriched in the contractile and invertebrate stem cell types are provided along with the IDs of the eggNOG orthology groups to which they belong. In cases where multiple genes from a species belonging to the same orthology group are enriched, the most differentially expressed gene is shown. The descriptions in the stem cell table are orthology annotations associated with the Spongilla genes provided in the original study (Musser et al., 2019).
Supplementary file 6

Transcript IDs corresponding to the multipotent stem cell enriched genes shown in Figure 6D.
Supplementary file 7

Cell types in the cell type families shown in Figure 5C.

For the schistosome cell types, we annotate two neural clusters, both of which express the neural marker complexin (Li et al., 2021). One of the clusters expresses the antigen SmKK7, so we label the clusters ‘Neural’ and ‘Neural_KK7’, respectively. The ‘Muscle’ population contains non-stem cells expressing troponin. The ‘Tegument_prog’ and ‘Tegument’ populations consist of cells expressing tegument progenitor and differentiated marker genes, respectively, as reported in a previous study (Wendt et al., 2018).
Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Alexander J Tarashansky
  2. Jacob M Musser
  3. Margarita Khariton
  4. Pengyang Li
  5. Detlev Arendt
  6. Stephen R Quake
  7. Bo Wang
Mapping single-cell atlases throughout Metazoa unravels cell type evolution
eLife 10:e66747.