Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability
Figures
Generation of primate embryoid bodies.
(A) Overview of the embryoid body (EB) differentiation workflow of the four primate species human (Homo sapiens), orangutan (Pongo abelii), cynomolgus (Macaca fascicularis), and rhesus (Macaca mulatta), including their phylogenetic relationship. Scale bar represents 500 µm. (B) Immunofluorescence staining of day 16 EBs using -fetoprotein (AFP), -III-tubulin, and -smooth muscle actin (-SMA). Scale bar represents 100 µm. (C) Schematic overview of the sampling and processing steps prior to 10 x scRNA-seq. (D) UMAP representation of the whole scRNA-seq dataset, integrated across all four species with Harmony. Single cells are colored by the expression of known marker genes for the three germ layers and undifferentiated cells. (E) UMAP representation, colored by assigned germ layers, split by species. Created with BioRender.com.
Comparison of embryoid body (EB) differentiation protocols using flow cytometry.
(A) Antibody combination to analyze induced pluripotent stem cells (iPSCs) and cells of the three primary germ layers in a single sample. (B) Flow cytometry gating overview using human EBs at day 7 of differentiation. 1. Gating of cell population. 2. Gating of single cell population. 3. Gating of live cell population. 4-6. Gating of cells belonging to pluripotent or germ layer populations based on the antibody combination shown in S1A. (C) Phase contrast images of orangutan EBs on day 6 of differentiation in four different culture conditions. Scale bar represents 250 μm. (D) Bar plot of pluripotency and germ layer proportions of day 7 EBs from human, orangutan, cynomolgus, and rhesus in the four different culture conditions. Created with BioRender.com.
Total number of recovered cells.
(A) Barplot of cell numbers per species and experimental batch and 10x lane.(B) Barplot of cell numbers per species and day of differentiation. (C) Barplot of cell numbers per clone.(D) Barplot of cell numbers per clone and day of differentiation.
Reference-based cell type classification.
(A) UMAP presentations colored by labels from a classification with a reference dataset of day 21 human embryoid bodies (Rhodes et al., 2022). (B) Single cell clusters in integrated data from all four species. (C) Stacked bar plot of the proportions of predicted labels across clusters obtained in the integrated dataset.
Assignment of orthologous cell types across species.
(A) Schematic overview of the pipeline to match clusters between species and assign orthologous cell types. (B) Sankey plot visualizing the intermediate steps of the cell type assignment pipeline. Each line represents a cell which are colored by their species of origin on the left and by their current cell type assignment during the annotation procedure on the right. An initial set of 118 high-resolution clusters (HRCs), 25–35 per species, was combined into 26 orthologous cell type clusters (OCCs). Similar cell type clusters were merged, and after further manual refinement, provided the basis for final orthologous cell type assignments. (C) Fraction of annotated cell types per species. (D) UMAPs for each species colored by cell type. (E) To validate our cell type assignments, we selected three marker genes per cell type that exhibit a similar expression pattern across all four species and have been reported to be specific for this cell type in both human and mouse (Appendix 1—table 1). The heatmap depicts the fraction of cells of a cell type in which the respective gene was detected for cell types present in at least three species.
Replicability of cell types across species measured by reciprocal classification.
(A) Heatmap illustrating ‘all vs all’ similarities of cell types from all four species. For each cell type pair, the similarity represents the average classification fraction obtained through reciprocal classification between each species pair. (B) Average classification fractions for cell types that are shared among each species pair. AP : astrocyte progenitor, CFib: cardiac fibroblasts, CEndo: cardiac endothelial cells, CPC: cardiac progenitor cells, EEC: early epithelial cells, EE: early ectoderm, EC: epithelial cells, Fib: fibroblasts, GPC: granule precursor cells, Hepa: hepatocytes, NCI: neural crest I, NCII: neural crest II, Neu: neurons, SMC: smooth muscle cells.
Replicability of cell types across species measured with MetaNeighbor.
(A) Heatmap illustrating ‘all vs all’ similarities of cell types from all four species. For each cell type pair, the similarity represents area under the receiver operator characteristic curve (AUROC) scores obtained with Meta Neighbor Crow et al., 2018 in unsupervised mode. (B) AUROC scores for cell types that are shared among each species pair. AP: astrocyte progenitor, CFib: cardiac fibroblasts, CEndo: cardiac endothelial cells, CPC: cardiac progenitor cells, EEC: early epithelial cells, EE: early ectoderm, EC: epithelial cells, Fib: fibroblasts, GPC: granule precursor cells, Hepa: hepatocytes, NCI: neural crest I, NCII: neural crest II, Neu: neurons, SMC: smooth muscle cells.
Cell type annotation.
(A) Bar plot of cell type fraction per species and clone.(B) Bar plot of cell type fractions per experimental batch and 10x lane. (C) Bar plot of cell type fractions per day of differentiation across different species.
Pseudotime analysis of ectoderm differentiation trajectories.
(A-D) PHATE embeddings of induced pluripotent stem cells (iPSCs) and ectodermal cells from human (A), orangutan (B), cynomolgus (C), and rhesus macaque (D). (E-F) Distribution of pseudotime values across major ectodermal lineages for each species. Box plots illustrate shifts in pseudotime distributions between day 8 (d8) and day 16 (d16) of differentiation.
Pseudotime analysis of mesoderm differentiation trajectories.
(A-D) PHATE embeddings of induced pluripotent stem cells (iPSCs) and mesodermal cells from human (A), orangutan (B), cynomolgus (C), and rhesus macaque (D). (E-F) Distribution of pseudotime values across major mesodermal lineages for each species. Box plots illustrate shifts in pseudotime distributions between day 8 (d8) and day 16 (d16) of differentiation.
Pseudotime analysis of endoderm differentiation trajectories.
(A-D) PHATE embeddings of induced pluripotent stem cells (iPSCs) and endodermal cells from human (A), orangutan (B), cynomolgus (C), and rhesus macaque (D). (E-F) Distribution of pseudotime values across major endodermal lineages for each species. Box plots illustrate shifts in pseudotime distributions between day 8 (d8) and day 16 (d16) of differentiation.
Effect of cell type specificity on expression conservation.
(A) UMAP visualizations depicting expression patterns of selected example genes: SOX10 (conserved cell type-specific expression in neural crest cells), ESRG (species-specific and cell type-specific expression in human iPSCs), and RPL22 (conserved, broad expression). (B) For each gene, expression was summarized per species and cell type as the expression fraction and binarized into ‘not expressed’/’expressed’ (black frame) based on cell type-specific thresholds. The same example genes as in (A) are shown here. iPSCs: induced pluripotent stem cells, EE: early ectoderm, NC: neural crest, SMC: smooth muscle cells, CFib: cardiac fibroblasts, EC: epithelial cells, Hepa: hepatocytes. (C) Boxplot of expression conservation of genes according to the number of different cell types in which a gene is expressed in humans (cell type specificity). (D) Boxplot of the fraction of coding sequence sites that were found to evolve under constraint based on a 43 primate phylogeny (Sullivan et al., 2023), stratified by human cell type specificity.
Characteristics of genes with different levels of cell type-specific expression.
(A) Stacked bar plot of the number of genes per cell type specificity level for different species. (B) Boxplot of expression conservation of genes with different levels of cell type specificity in orangutan, cynomolgus, and rhesus. (C) Boxplot of gene-level constraint based on primate phastCons scores Sullivan et al., 2023 for protein-coding genes. (D) Boxplot of mean expression per cell type for genes with different levels of cell type specificity. (E) Boxplot of mean expression per cell type for a subset of 236 genes per cell type specificity and species that were sampled to have a similar distribution of mean expression. (F) Boxplot of expression conservation of the same subsampled gene sets as in (E).
Evaluation of marker gene conservation.
(A) UpSet plot illustrating the overlap between species for the top 100 marker genes per cell type. (B) Heatmap showing the expression fractions of marker genes: on the left, markers shared among all species, and on the right, markers unique to the human ranking. For each cell type, one representative gene is labeled and further detailed in Figure 4—figure supplement 1. iPSCs: induced pluripotent stem cells, EE: early ectoderm, NC: neural crest, SMC: smooth muscle cells, CFib: cardiac fibroblasts, EC: epithelial cells, Hepa: hepatocytes. (C) Rank-biased overlap (RBO) analysis comparing the concordance of gene rankings per cell type for lncRNAs, protein-coding genes, and transcription factors. (D) Average F1-score for a k-nearest neighbor (kNN)-classifier trained in the human clone 29B5 to predict cell type identity based on the expression of 1–30 marker genes. Each line represents the performance in a different clone, with shaded areas indicating 95% bootstrap confidence intervals.
Expression patterns of shared and human-specific marker genes.
(A) UMAP representation per species filtered for the seven cell types that are present in all four species. (B) UMAP representations colored by the log-normalized expression of seven representative marker genes that are shared among the top 100 marker genes per cell type in all four species. (C) UMAP representations colored by the log-normalized expression of seven representative marker genes that are only present in the human top 100 marker gene ranking per cell type.
k-nearest neighbor (kNN) classification performance per cell type.
F1-score per cell type for a kNN classifier trained in the human clone 29B5 to predict cell type identity based on the expression of 1-30 protein-coding marker genes. Each line represents the performance in a different clone, colored by species identity.
k-nearest neighbor (kNN) classification performance for transcription factors and protein coding marker genes.
(A) Average F1-score for a kNN-classifier trained in the human clone 29B5 to predict cell type identity in the other clones. The classifier is trained on the expression of the top 1-30 protein-coding markers (solid lines) or transcription factor markers (dashed lines). (B) Comparison of the maximum average F1 score between transcription factors and protein coding markers for the classifications depicted in (A).
Pseudotime analysis for a differentiation trajectory towards neurons.
Single cells were first aggregated into metacells per species using SEACells (Persad et al. 2023). Pluripotent and ectoderm metacells were then integrated across all four species using Harmony and a combined pseudotime was inferred with Slingshot (Street et al. 2018), specifying iPSCs as the starting cluster. Here, lineage 3 is shown, illustrating a differentiation towards neurons. (A) PHATE embedding colored by pseudotime (Moon et al. 2019). (B) PHATE embedding colored by celltype. (C) Pseudotime distribution across the sampling timepoints (day 8 and day 16) in different species.
UMAP visualization for the Harmony-integrated dataset across all four species for the seven shared cell types, colored by cell type identity (A) and species (B).
(A) Evaluation of species mixing per cell type in the Harmony-integrated dataset, quantified by the fraction of cells with an adjusted cell-specific mixing score (cms) above 0. 05. (B) Summary of rank-biased overlap (RBO) scores per cell type to assess concordance of marker gene rankings for all species pairs.
(A) UMI-counts/ gene for the same cynomolgus macaque iPSC samples. On the x-axis the gtf file from Ensembl Macaca_fascicularis_6.0.111 was used to count and on the y-axis we used our filtered Liftoff annotation that transferred the human gene models from Gencode v32. (B) The # of DE-genes between human and cynomolgus iPSCs detected with DESeq2. In Liftoff, we counted human samples using Gencode v32 and compared it to the Liftoff annotation of the same human gene models to macFas6. In Ensembl, we use Gencode v32 for the human and Ensembl Macaca_fascicularis_6.0.111 for the Macaque. For both comparisons we subset the genes to only contain one-to-one orthologs as annotated in biomart. Up and down regulation is relative to human expression. (C) Read counts for one example gene SALL4. Here, we used in addition to the Liftoff and Ensembl annotation also transcripts derived from Nanopore cDNA sequencing of cynomolgus iPSCs. (D) Gene models for SALL4 in the space of MacFas6 and a coverage for iPSC-Prime-seq bulk RNA-sequencing.
Tables
Cell lines.
List of cell lines used for embryoid body (EB) differentiation.
| ID | Species | Sex | Publication |
|---|---|---|---|
| 29B5 | Homo sapiens | Male | Geuder et al., 2021 |
| 63Ab2.2 | Homo sapiens | Female | Geuder et al., 2021 |
| 69A1 | Pongo abelii | Male | Geuder et al., 2021 |
| 68A20 | Pongo abelii | Male | Geuder et al., 2021 |
| 82A3 | Macaca fascicularis | Female | Edenhofer et al., 2024 |
| 56B1 | Macaca fascicularis | Female | Edenhofer et al., 2024 |
| 56A1 | Macaca fascicularis | Female | |
| 87B1 | Macaca mulatta | Male | Jocher et al., 2024 |
| 83D1 | Macaca mulatta | Male | Jocher et al., 2024 |
| 83Ab1.1 | Macaca mulatta | Male | Jocher et al., 2024 |
Marker genes.
Literature review for marker genes used in human and mouse / rodents to determine a specific cell type.