Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability

eLife Assessment

The authors make an important contribution to comparative functional genomics by developing a semi-automated computational pipeline that integrates classification and marker-based cluster annotation to identify orthologous cell types. Using a single-cell RNA-seq dataset of induced pluripotent stem cells and derived embryonic bodies from four primate species: humans, orangutans, cynomolgus macaques, and rhesus macaques, the authors provide convincing evidence that cell type-specific marker genes are substantially less transferable across species than broadly expressed genes, with transferability declining as phylogenetic distance increases. This study establishes a key framework and reference dataset for comparative single-cell analyses and encourages more rigorous evaluation of marker gene transferability across species.

https://doi.org/10.7554/eLife.105398.3.sa0

Significance of the findings:

Important: Findings that have theoretical or practical implications beyond a single subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Convincing: Appropriate and validated methodology in line with current state-of-the-art

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Materials and methods
Appendix 1
Data availability
References
Article and author information
Metrics

Abstract

The identification of cell types remains a major challenge. Even after a decade of single-cell RNA sequencing (scRNA-seq), reasonable cell type annotations almost always include manual non-automated steps. The identification of orthologous cell types across species complicates matters even more, but at the same time strengthens the confidence in the assignment. Here, we generate and analyze a dataset consisting of embryoid bodies (EBs) derived from induced pluripotent stem cells (iPSCs) of four primate species: humans, orangutans, cynomolgus, and rhesus macaques. This kind of data includes a continuum of developmental cell types, multiple batch effects (i.e. species and individuals) and uneven cell type compositions and hence poses many challenges. We developed a semi-automated computational pipeline combining classification and marker-based cluster annotation to identify orthologous cell types across primates. This approach enabled the investigation of cross-species conservation of gene expression. Consistent with previous studies, our data confirm that broadly expressed genes are more conserved than cell type-specific genes, raising the question of how conserved, inherently cell type-specific, marker genes are. Our analyses reveal that human marker genes are less effective in macaques and vice versa, highlighting the limited transferability of markers across species. Overall, our study advances the identification of orthologous cell types across species, provides a well-curated cell type reference for future in vitro studies and informs the transferability of marker genes across species.

Introduction

Cell types are a central concept for biology, but are, as other concepts like species, practically difficult to identify. Theoretically, one would consider all stable, irreversible states on a directed developmental trajectory as cell types. In practice, we are limited by our experimental possibilities. Historically, cell type definitions hinged on observations of cell morphology in a tissue context, which was later combined with immunofluorescence analyses of marker genes (Bakken et al., 2017). A lot of the functional knowledge that we have about cell types today is based on such visual and marker-based cell type definitions. With single-cell sequencing, our capabilities to characterize and identify new cell types have radically changed (The Tabula Muris Consortium et al., 2018; Regev et al., 2017). Clustering cells by their expression profiles enables a more systematic and higher-resolution identification of groups of cells that are then interpreted as cell types. However, distinguishing them from cell states or technical artifacts is not straightforward. A key criterion for defining a true cell type is its reproducibility across experiments, individuals, or even species.

Hence, identifying the same, i.e., orthologous, cell types across individuals and species is crucial. There are three principal strategies to match cell types from scRNA-seq data. (1) One is to integrate all cells prior to performing a cell type assignment on a shared embedding (Song et al., 2023). (2) The second approach is to consider cell types from one species as the reference and transfer these annotations to the other species using classification methods (Liu et al., 2023). (3) The third strategy is to assign clusters and match them across species, which has the advantage of not requiring data integration of multiple species or an annotated reference (Castro-Mondragon et al., 2022; Bakken et al., 2021; Suresh et al., 2023).

Furthermore, established marker genes are still heavily used to validate and interpret clusters identified by scRNA-seq data (Zhang et al., 2019b; Guo and Li, 2021; Ianevski et al., 2022). Together with newly identified transcriptomic markers for human and mouse, they are collected in databases (Franzén et al., 2019; Zhang et al., 2019a) and provide the basis for follow-up studies using spatial transcriptomics and/or immunofluorescence approaches. However, previous studies have shown that the same cell types may be defined by different marker genes in different species (Hodge et al., 2019; Bakken et al., 2021). For example, Krienen et al., 2020 found that only a modest fraction of interneuron subtype-specific genes overlapped between primates and even less between primate and rodent species.

To better understand how gene expression in general and the expression of marker genes in particular evolves across closely related species, we used induced pluripotent stem cells (iPSCs) and their derived cell types from humans and non-human primates (NHP). One fairly straightforward way to obtain diverse cell types from iPSCs are embryoid bodies (EBs). EBs are the simplest type of iPSC-derived organoids contain a dynamic mix of cell types from all three germ layers and result from spontaneous differentiation upon withdrawal of key pluripotency factors (Brickman and Serup, 2017; Itskovitz-Eldor et al., 2000; Rhodes et al., 2022; Guo et al., 2019; Han et al., 2018).

EBs and brain organoids from humans and chimpanzees have, for example, been used to infer human-specific gene regulation in brain organoids (Kanton et al., 2019) or to investigate mechanisms of gene expression evolution (Barr et al., 2023).

Here, we explore to what extent levels of cell type specificity of marker genes are conserved in primates. We generated scRNA-seq data of 8 and 16-day-old EBs from human, orangutan (Pongo abelii), cynomolgus (Macaca fascicularis), and rhesus macaque (Macaca mulatta) iPSCs. Using this data, we established an analysis pipeline to identify and assign orthologous cell types. With this annotation, we provide a well-curated cell type reference for in vitro studies of early primate development. Moreover, it allowed us to assess the cell type specificity and expression conservation of genes across species. We find that even though the cell type-specificity of a marker gene remains similar across species, its discriminatory power still decreases with phylogenetic distance.

Results

Generation of embryoid bodies from iPSCs of different primate species

We generated EBs from iPSCs across multiple primate species: two human iPSC clones (from two individuals), two orangutan clones (from one individual), three cynomolgus clones (from two individuals), and three rhesus clones (from one individual) (Geuder et al., 2021; Jocher et al., 2024; Edenhofer et al., 2024). To optimize conditions for generating a sufficient number of cells from all three germ layers across these four species, we tested combinations of two culturing media (‘EB-medium’ and ‘DFK20,’ see Methods) and two EB-differentiation conditions (‘single-cell seeding’ and ‘clump seeding,’ see Methods). After 7 days of differentiation, germ layer composition was analyzed by flow cytometry (Figure 1—figure supplement 1A, B and C). Among the four tested protocols, culture in DFK20 medium with clump seeding resulted in the most balanced representation of all germ layers, yielding a substantial number of cells from each layer across all species (Figure 1—figure supplement 1D).

Under these conditions, we established an EB formation protocol based on 8 days of floating culture in dishes, followed by 8 days of attached culture (Figure 1A). This results in the formation of cells from all three germ layers, as confirmed by immunofluorescence staining for AFP (endoderm), β-III-tubulin (ectoderm) and α-SMA (mesoderm) (Figure 1B). To generate scRNA-seq data, we dissociated 8 or 16-day-old EBs into single cells and pooled cells from all four species to minimize batch effects (Figure 1C). We performed the experiment in three independent replicates, generating a total of four lanes and six lanes of 10 x Genomics scRNA-seq at day 8 and day 16, respectively (Figure 1—figure supplement 2A). This resulted in a dataset comprising over 85,000 cells after filtering and doublet removal, distributed fairly equally over time points, species, and clones (Figure 1—figure supplement 2B–D).

Figure 1 with 3 supplements see all

Download asset Open asset

Generation of primate embryoid bodies.

(A) Overview of the embryoid body (EB) differentiation workflow of the four primate species human (*Homo sapiens*), orangutan (*Pongo abelii*), cynomolgus (*Macaca fascicularis*), and rhesus (*Macaca mulatta*), including their phylogenetic relationship. Scale bar represents 500 µm. (B) Immunofluorescence staining of day 16 EBs using $α$ -fetoprotein (AFP), $β$ -III-tubulin, and $α$ -smooth muscle actin ( $α$ -SMA). Scale bar represents 100 µm. (C) Schematic overview of the sampling and processing steps prior to 10 x scRNA-seq. (D) UMAP representation of the whole scRNA-seq dataset, integrated across all four species with Harmony. Single cells are colored by the expression of known marker genes for the three germ layers and undifferentiated cells. (E) UMAP representation, colored by assigned germ layers, split by species. Created with BioRender.com.

In agreement with the immunofluorescence staining, we detected well-established marker genes of pluripotent cells and of all three germ layers (Ludwig et al., 2023) in the scRNA-seq data: SOX2, SOX10, and STMN4 expression was used to label ectodermal cells, APOA1, and EPCAM for endodermal cells, COL1A1 and ACTA2 (α-SMA) for mesodermal cells, and POU5F1 and NANOG for pluripotent cells (Figure 1D). Expression of these marker genes corresponded well with a classification based on a published scRNA-seq dataset from 21-day-old human EB (Rhodes et al., 2022). This initial, rough germ layer assignment shows that our differentiation protocol generates EBs with the expected germ layers and cell type diversity from all four species (Figure 1E, Figure 1—figure supplement 3A).

Assignment of orthologous cell types

Many integration methods encounter difficulties when they are applied to data from multiple species and uneven cell type compositions (Song et al., 2023). Indeed, when comparing clusters derived from an integrated embedding across all species (Hie et al., 2019; Korsunsky et al., 2019) to the aforementioned preliminary cell type assignments, we observed signs of overfitting. For instance, a cluster predominantly containing cells classified as neurons in humans, cynomolgus, and rhesus macaques consisted mainly of early ectoderm and mesoderm cells in orangutans (Figure 1—figure supplement 3B and C). To address this issue, we developed an approach that assigns orthologous cell types without a common embedding space in an interactive shiny app (https://shiny.bio.lmu.de/Cross_Species_CellType/; Figure 2A and B):

Figure 2 with 6 supplements see all

Download asset Open asset

Assignment of orthologous cell types across species.

(A) Schematic overview of the pipeline to match clusters between species and assign orthologous cell types. (B) Sankey plot visualizing the intermediate steps of the cell type assignment pipeline. Each line represents a cell which are colored by their species of origin on the left and by their current cell type assignment during the annotation procedure on the right. An initial set of 118 high-resolution clusters (HRCs), 25–35 per species, was combined into 26 orthologous cell type clusters (OCCs). Similar cell type clusters were merged, and after further manual refinement, provided the basis for final orthologous cell type assignments. (C) Fraction of annotated cell types per species. (D) UMAPs for each species colored by cell type. (E) To validate our cell type assignments, we selected three marker genes per cell type that exhibit a similar expression pattern across all four species and have been reported to be specific for this cell type in both human and mouse (Appendix 1—table 1). The heatmap depicts the fraction of cells of a cell type in which the respective gene was detected for cell types present in at least three species.

First, we assign cells to clusters separately for each species. To avoid losing rare cell types, we aim to obtain at least double as many high-resolution clusters (HRCs) per species as expected cell types. We then use the HRCs of one species as a reference to classify the cells of the other species using SingleR (Aran et al., 2019). These pair-wise comparisons are done reciprocally for each species and via a cross-validation approach also within each species (see Methods). For each comparison, we average the two values for the fraction of cells annotated as the other HRC. For example, a perfect ‘reciprocal best-hit’ between HRC-A in human and HRC-B in rhesus would have all cells of HRC-B assigned to HRC-A when using the human as a reference and reciprocally all cells in HRC-A assigned to HRC-B when using the rhesus as a reference. Next, we used the resulting distance matrix as input for hierarchical clustering to find orthologous clusters across species and merge similar clusters within species. Here, the user can choose and adjust the final cell type cluster number. This allows us to identify orthologous cell type clusters (OCCs) across all four species, while retaining species-specific clusters when no matching cluster was identified.

In the last steps, OCCs are manually further refined by merging neighboring OCCs with similar marker gene and transcriptome profiles (see Methods). To avoid bias, we first identify marker genes independently for each species solely based on scRNA-seq expression data (Hao et al., 2021). We then intersect those lists to identify the top-ranking marker genes with consistently good specificity across all species. The final set of conserved marker genes then serves us to derive cell type labels by searching the literature as well as databases of known marker genes (Figure 2E). If the marker-gene-based cell type assignment reveals cluster inconsistencies, they can be marked for further splitting. This feature is of particular importance for rare cell types. For example, we separated a cluster of early progenitor cells into iPSCs, cardiac progenitors, and early epithelial cells.

Suresh et al., 2023 devised a conceptually similar approach to ours to identify orthologous cell types across species. The main difference is that they used scores from MetaNeighbor Crow et al., 2018 where we use SingleR to measure distances between HRCs. However, in essence, both scores are based on rank correlations, and hence it may not be surprising that both scoring systems yield consistent cluster groupings that show high replicability across species. However, using our SingleR-based scores to compare OCCs across species may yield more clearly defined correspondences compared to MetaNeighbor scores (Figure 2—figure supplements 1 and 2).

Overall, we are confident that our approach yields meaningful orthologous cell type assignments, without requiring a prior annotation per species or a reference dataset. Moreover, the necessary fine-tuning of the cell type clusters by the expert user is facilitated by an interactive app.

Many cell types are shared between day 8 and day 16 EBs

Using the strategy described in the previous section, we detected a total of 15 reproducible cell types from the three germ layers, all of which were detected in at least three cell lines in three independent replicates. Among these, we identified four cell types that represent the latest time points along ectodermal developmental lineages (astrocyte progenitor, granule precursor, neurons, neural crest II), four that represent the latest time points along mesodermal lineages (fibroblasts, smooth muscle cells, cardiac endothelial cells, cardiac fibroblasts), and two that represent the latest detected time points along endodermal lineages (epithelial cells, hepatocytes). Many of these cell types were present at both sampling times (Figure 2—figure supplement 3C). The most notable exception is that orangutan EBs lost the majority of ectodermal cells at the later time point. Aside from this technical deviation—likely caused by the additional handling step (see previous chapter)—some more differentiated cell types only appear at day 16 at appreciable frequencies. This is most pronounced for smooth muscle cells in all species, but also holds for neuron-like cells in humans. Overall, this leads to an increase in the observed cell type diversity over time.

To further evaluate differences between the two sampling time points, we performed pseudotime analyses (Street et al., 2018) on the experiments integrated per species and germ layer, defining iPSCs as the origin and the differentiated cell types listed above as the endpoints of the developmental trajectories (Figure 2—figure supplements 4–6). As expected, day 16 cells generally occupy later positions along the trajectories than day 8 cells, yet the distributions overlap: iPSCs and precursor states, such as early ectoderm are still detectable, albeit at lower frequency, in the day 16 EBs. Still, the few states that are confined to one of the two time points improve cross-species comparability when both are considered jointly. Integrating day 8 and day 16 increased the overlap in detected cell types between species; for example, human neural cells were only observed at day 16, whereas they were already present at day 8 in macaques, and we, therefore, used the combined data from both time points for downstream analyses.

Overall, 9 of the 15 cell types were detected in at least 3 species, and 7 cell types were reproducibly detected in all four species (Figure 2C and D; Figure 2—figure supplement 3). These 7 cell types consisted of iPSCs, two cell types representing ectoderm: early ectoderm and neural crest, two cell types of mesodermal origin: smooth muscle cells and cardiac fibroblasts, and two endodermal cell types: epithelial cells and hepatocytes (Figure 2C and E) and are used for the analysis of pleiotropy and marker genes in the remainder of this manuscript.

Cell type-specific genes have less conserved expression levels

Based on the premise that it is not necessarily the expression level, but rather the expression breadth that determines expression conservation (Duret and Mouchiroud, 2000), we developed a method to call a gene ‘expressed’ or not that considers the expression variance across the cells of one type, which we then used to score cell type-specificity and expression conservation (Figure 3B); see Methods.

Figure 3 with 1 supplement see all

Download asset Open asset

Effect of cell type specificity on expression conservation.

(A) UMAP visualizations depicting expression patterns of selected example genes: *SOX10* (conserved cell type-specific expression in neural crest cells), *ESRG* (species-specific and cell type-specific expression in human iPSCs), and *RPL22* (conserved, broad expression). (B) For each gene, expression was summarized per species and cell type as the expression fraction and binarized into ‘not expressed’/’expressed’ (black frame) based on cell type-specific thresholds. The same example genes as in (A) are shown here. iPSCs: induced pluripotent stem cells, EE: early ectoderm, NC: neural crest, SMC: smooth muscle cells, CFib: cardiac fibroblasts, EC: epithelial cells, Hepa: hepatocytes. (C) Boxplot of expression conservation of genes according to the number of different cell types in which a gene is expressed in humans (cell type specificity). (D) Boxplot of the fraction of coding sequence sites that were found to evolve under constraint based on a 43 primate phylogeny (Sullivan et al., 2023), stratified by human cell type specificity.

For example, we find that the neural crest marker SOX10 (Mollaaghababa and Pavan, 2003) is cell type-specific and conserved, the lncRNA ESRG is iPSC- and human-specific; in contrast, RPL22, a gene that encodes a protein of the large ribosomal subunit, is broadly expressed and conserved (Figure 3A). Overall, we find on average ∼15% of genes to be cell type-specific, i.e., our score determined them to be expressed in only one cell type, while ∼40% of genes were found to be broadly expressed in all seven cell types (Figure 3—figure supplement 1A).

Additionally, we obtained a measure of expression conservation, which quantifies the consistency of the cell type expression score across species. We found that broadly expressed genes present in all cell types exhibited high expression conservation, whereas cell type-specific genes tended to be more species-specific (Figure 3C; Figure 3—figure supplement 1B).

Unsurprisingly, broadly expressed genes also showed higher average expression levels (Kliesmete et al., 2024; Figure 3—figure supplement 1D). To ensure that the observed relationship between expression breadth and conservation in our data is not solely due to expression level differences, we sub-sampled genes from all cell type-specificity levels for comparable mean expression. This did not change the pattern: also, broadly expressed genes with a low mean expression level are highly conserved across species (Figure 3—figure supplement 1E and F). Moreover, the coding sequences of broadly expressed genes show higher levels of constraint than more cell type-specific genes, thus supporting the notion that the higher conservation of the expression pattern that we observed here is due to evolutionary stable functional constraints on this set of genes (Figure 3D; Figure 3—figure supplement 1C).

Marker gene conservation

Building on our previous observation that cell type-specific genes are less conserved across species, we investigated the conservation and transferability of marker genes, which are, by definition, cell type-specific, in greater detail. To this end, we call marker genes for all cell types and species, using a combination of differential expression analysis and a quantile rank-score based test for differential distribution detection (Ling et al., 2021). Additionally, we define a good marker gene as one that is upregulated and expressed in a higher fraction of cells compared to the rest. To prioritize marker genes, we rank them based on the difference in the detection fraction: the proportion of cells of a given type in which a gene is detected compared to its detection rate in all other cells.

We found a low overlap of top marker genes among species, with a median of 15 of the top 100-ranked marker genes per cell type shared across all four species, while a larger proportion of markers was unique to individual species (Figure 4A). Notably, these species-specific markers often exhibited cell type-specific expression in only one species, with reduced or non-specific expression in others (Figure 4B; Figure 4—figure supplement 1).

Figure 4 with 3 supplements see all

Download asset Open asset

Evaluation of marker gene conservation.

(A) UpSet plot illustrating the overlap between species for the top 100 marker genes per cell type. (B) Heatmap showing the expression fractions of marker genes: on the left, markers shared among all species, and on the right, markers unique to the human ranking. For each cell type, one representative gene is labeled and further detailed in Figure 4—figure supplement 1. iPSCs: induced pluripotent stem cells, EE: early ectoderm, NC: neural crest, SMC: smooth muscle cells, CFib: cardiac fibroblasts, EC: epithelial cells, Hepa: hepatocytes. (C) Rank-biased overlap (RBO) analysis comparing the concordance of gene rankings per cell type for lncRNAs, protein-coding genes, and transcription factors. (D) Average F1-score for a k-nearest neighbor (kNN)-classifier trained in the human clone 29B5 to predict cell type identity based on the expression of 1–30 marker genes. Each line represents the performance in a different clone, with shaded areas indicating 95% bootstrap confidence intervals.

Given the special role of transcriptional regulators for the definition of a cell type (Arendt et al., 2016) and the differences in conservation between protein-coding and non-coding RNAs (Johnsson et al., 2014), we analyzed the comparability of marker genes of different types. To this end, we assessed the concordance of the top 100 marker genes across species for protein-coding genes, lncRNAs, transcription factors (TFs), or all genes using rank-biased overlap (RBO) scores (Webber et al., 2010). We find that marker genes that are TFs have the highest concordance between species and that the two macaque species, which are also phylogenetically most similar, are also most similar in their ranked marker gene lists. In contrast, lncRNA markers show the lowest overlap between species. In fact, their cross-species conservation is so low that they also significantly reduce the performance if they are included together with protein-coding markers (Figure 4C).

To properly evaluate the performance of marker genes, it is essential to consider their ability to differentiate between cell types. This discriminatory power ultimately determines how well marker genes perform in cell type classification within and across species. To this end, we trained a k-nearest neighbors (kNN) classifier on varying numbers of marker genes per cell type in one human clone (29B5) and evaluated prediction performance using the average F1 score across cell types (Figure 4—figure supplement 2). Again, we analyzed markers from a set of all protein-coding genes and TFs only and found that even though TFs appear to be more conserved across species, they do not discriminate cell types as well as the top protein-coding markers (Figure 4—figure supplement 3). Using protein-coding marker genes only determined with 29B5 to classify the other human clone, we achieve good discriminatory power (F1 score>0.9) with only 11 marker genes per cell type. In contrast, the classification performance for clones from the other species was substantially lower, failing to reach the performance levels observed in human clones even when using up to 30 marker genes (Figure 4D).

In summary, we find that lncRNA marker genes have low transferability between species, while protein-coding markers do reasonably well. However, the predictive value of marker genes decreases with increasing phylogenetic distance, requiring longer marker gene lists to achieve accurate cell type classification for more distantly related species.

Discussion

An essential criterion for a true cell type is reproducibility across experiments, individuals, or even species. This raises the question of how to reliably identify reproducible cell types across species. When cell types are annotated separately for each species, their reproducibility can be evaluated based on transcriptomic similarity (Crow et al., 2018; Wang et al., 2021). If integration-based methods are used to accomplish this task (Barr et al., 2023; Bakken et al., 2021), reproducibility not only depends on the similarity of the expression profiles but also on cell type composition. Integration works best when the cell type compositions are as similar as possible across experiments. This, however, is not the case for organoids, which often have highly heterogeneous cell type compositions (He et al., 2023) and our EB data are no exception. Moreover, integration methods struggle with large and variable batch effects, which are expected due to the varying phylogenetic distances across species (Song et al., 2023). In contrast, classification methods, such as SingleR (Aran et al., 2019) rely mainly on the similarity to a reference profile, which makes it less vulnerable to cell type composition and batch effects. Hence, in our pipeline to identify orthologous cell types, we mainly rely on classification. We start with an unsupervised approach in that we identify cell clusters and then ensure reproducibility as well as comparability using a supervised approach with reciprocal classification of clusters across all species pairs.

Defining cell types in a developmental dataset is particularly challenging, and we do not believe that there is one perfect solution that would fit all cell types and samples. Therefore, we rely on an interactive approach that we implemented in a shiny app (https://shiny.bio.lmu.de/Cross_Species_CellType/) to facilitate the flexible choice of parameters for cluster matching, merging and inspection by visualizing marker genes. Suresh et al., 2023 employed a similar approach also requiring several manual parameter choices. This makes a formal comparison difficult. Generally, both methods seem to agree well on the orthology assignments of cell type clusters (Figure 2—figure supplement 2 & Figure 2—figure supplement 1). MetaNeighbor, as used by Suresh et al., 2023, provides a more quantitative and potentially more sensitive framework for assessing cross-species cell type relationships. However, this higher sensitivity may also make it more affected by data with a lower signal-to-noise ratio, such as our developmental time series.

Hence, the carefully annotated dataset presented here can serve as a valuable resource for future research. Non-human primate iPSCs are central to many studies focusing on evolutionary comparisons, and the pool of iPSC lines for these purposes is expected to grow, incorporating more species and individuals. In this context, the transcriptomic data we generated offer a reference dataset that can be used to verify the pluripotency and differentiation potential of non-human primate iPSC lines by examining gene expression during EB formation.

The set of shared cell types between all four primate species allowed us to evaluate the conservation and transferability of marker genes between species. To begin with, marker genes are by definition cell type-specific, and also with this dataset, we can show that they are less conserved than broadly expressed genes. Expression breadth can be interpreted as a sign of pleiotropy and hence higher functional constraint (Hastings, 1996; Duret and Mouchiroud, 2000). Conversely, we expect cell type-specific marker genes to be among the least conserved genes. Indeed, we and others find that the overlap of marker genes across species is limited (Hodge et al., 2019; Krienen et al., 2020; Bakken et al., 2021; Feng et al., 2022). Moreover, conservation varies significantly across gene biotypes. On the one hand, lncRNAs, which are often highly cell type-specific, exhibit lower cross-species conservation. Their low sequence conservation further complicates their utility for comparative studies (Johnsson et al., 2014). On the other hand, TFs, which have been proposed as central elements of a Core Regulatory Complex (CoRC) that defines cell type identity (Arendt et al., 2016), are among the most conserved markers across species. However, the power to distinguish cell types based solely on the expression of TF markers remains lower than when markers are selected from the broader set of all protein-coding genes (Figure 4—figure supplement 3). Even though within species, a handful of marker genes can achieve remarkable accuracy, their discriminatory power remains lower for other species. Thus, whole transcriptome profiles offer a more comprehensive approach to cross-species cell type classification for single-cell data.

This said, marker genes remain fundamental to most current cell type annotations. Moreover, marker genes will continue to be used to match cell types across modalities, as, for example, to validate cell type properties in experiments that are often based on immunofluorescence of individual markers or gene panels as used for spatial transcriptomics (Benito-Kwiecinski et al., 2021; Gulati et al., 2025). To this end, we have refined the ranking of marker genes beyond differential expression analysis to focus on consistent differences in detection rate. Markers identified in this way are bound to translate better into protein-based validations than markers defined based on expression levels, due to the discrepancy of mRNA and protein expression (Pascal et al., 2008). Furthermore, the presence-absence signal is more robust against cross-species fluctuations in gene expression than measures based on expression level differences.

In conclusion, we present a robust reference dataset for early primate development alongside tools to identify and evaluate orthologous cell types. Our findings emphasize the need for caution when transferring marker genes for cell type annotation and characterization in cross-species studies.

Materials and methods

Cell lines

Request a detailed protocol

We used 10 iPSC lines that were all generated in-house and have already been published (Table 1). Absence of Sendai virus was confirmed by RT-PCR, and all lines are mycoplasma-free. Cell lines were authenticated using SNP panels that were established using RNA-seq data (Jocher et al., 2024).

Table 1

Cell lines.

List of cell lines used for embryoid body (EB) differentiation.

ID	Species	Sex	Publication
29B5	Homo sapiens	Male	Geuder et al., 2021
63Ab2.2	Homo sapiens	Female	Geuder et al., 2021
69A1	Pongo abelii	Male	Geuder et al., 2021
68A20	Pongo abelii	Male	Geuder et al., 2021
82A3	Macaca fascicularis	Female	Edenhofer et al., 2024
56B1	Macaca fascicularis	Female	Edenhofer et al., 2024
56A1	Macaca fascicularis	Female
87B1	Macaca mulatta	Male	Jocher et al., 2024
83D1	Macaca mulatta	Male	Jocher et al., 2024
83Ab1.1	Macaca mulatta	Male	Jocher et al., 2024

EB differentiation method comparison

Request a detailed protocol

Four EB differentiation protocols are compared initially, which are combinations of two differentiation media (DFK20 and EB-medium) and two differentiation methods (dish and 96-well).

For single-cell differentiation in 96-well plates, primate iPSCs from one 80% confluent 6-well are washed with DPBS and incubated with Accumax (Sigma-Aldrich, SCR006) for 7 min at 37 °C. Afterwards, iPSCs are dissociated to single cells, the enzymatic reaction is stopped by adding DPBS, and cells are counted and pelleted at 300×g for 5 min. Single cells are resuspended in EB-medium consisting of StemFit Basic02 (Nippon Genetics, 3821.00) w/o bFGF or DFK20, both supplemented with 10 µM Y-27632 (Biozol, ESI-ST10019). The DFK20 medium consists of DMEM/F12 (Fisher Scientific, 15373541) with 20% KSR (Thermo Fisher Scientific, 10828–028), 1% MEM non-essential amino acids (Thermo Fisher Scientific, 11140–035), 1% Glutamax (Thermo Fisher Scientific, 35050038), 100 U/mL Penicillin, 100 µg/mL Streptomycin (Thermo Fisher Scientific, 15140122), and 0.1 mM 2-Mercaptoethanol (Thermo Fisher Scientific, M3148). Afterwards, 9000 cells in 150 µl medium are seeded per well of a Nuclon Sphera 96-well plate (Fisher Scientific, 15396123) and cultured at 37 °C and 5% CO₂. A medium change with the corresponding EB differentiation medium w/o Rock inhibitor is performed every other day during the whole protocol. EBs are collected from the 96-well plate and subjected to flow cytometry after 7 days of differentiation.

For clump differentiation in culture dishes, primate iPSCs from one 80% confluent 12-well are washed with DPBS and incubated with 0.5 mM EDTA (Carl Roth, CN06.3) for 3–5 min at RT. The EDTA is removed, StemFit (Nippon Genetics, 3821.00) supplemented with 10 µM Y-27632 (Biozol, ESI-ST10019) is added and cells are dissociated to clumps of varying sizes. Subsequently, the clumps are transferred to sterile bacterial dishes with vents and cultured at 37 °C and 5% CO₂. After 24 hr, the medium is exchanged by either EB-medium or DFK20 supplemented with 10 µM Y-27632 for an additional 24 hr, before changing the medium to EB-medium or DFK20. A medium change is performed every other day during the protocol from day 4 on. EBs are collected from the dishes and subjected to flow cytometry after 7 days of differentiation.

Flow cytometry

Request a detailed protocol

Flow cytometry is performed on day 7 of the differentiation protocol. Therefore, 1/10 of the EBs are collected, washed with DPBS, incubated with Accumax (Sigma-Aldrich, SCR006) for 10 min at 37 °C and dissociated to single cells. After washing, cells are incubated with the Viability Dye eFluor 780 (Thermo Fisher Scientific, 65-0865-18) diluted 1/1000 in PBS for 30 min at 4 °C in the dark. The live/dead stain is quenched by the addition of Cell Staining Buffer (CSB) consisting of DPBS with 0.5% BSA (Sigma-Aldrich, A3059), 0.01% NaN₃ (Sigma-Aldrich, S2002), and 2 mM EDTA (Carl Roth, CN06.3). Subsequently, cells are pelleted and incubated with a mixture of the following antibodies diluted 1/200 in CSB for 1 hr at 4 °C in the dark. The antibodies used are anti-TRA-1–60-AF488 (STEMCELL Technologies, 60064AD.1), anti-CXCR4-PE (BioLegend, 306505), anti-NCAM1-PE/Cy7 (BioLegend, 318317), and anti-PDGFRα-APC (BioLegend, 323511). After centrifugation, cells are resuspended in PBS containing 0.5% BSA, 0.01% NaN₃, and 1 µg/ml DNase I (STEMCELL Technologies, 07469), filtered through a strainer and analyzed using the BD FACS Canto Flow Cytometry System. Flow cytometry data are analyzed using FlowJo (V10.8.2).

In-vitro embryoid body differentiation

Request a detailed protocol

Two human, two orangutan, three cynomolgus, and three rhesus iPSC lines are used for EB differentiation. The human and orangutan iPSCs are reprogrammed from urinary cells, while cynomolgus and rhesus iPSCs were reprogrammed from fibroblasts. All cell lines were characterized and validated previously and were tested negative for mycoplasma and SeV reprogramming vector integration (Geuder et al., 2021; Jocher et al., 2024; Edenhofer et al., 2024).

For embryoid body formation prior to 10 x scRNA-seq, the EB differentiation protocol using DFK20 medium in culture dishes is performed in duplicates for each clone. After 8 days of floating culture in dishes, EBs from both replicates are pooled and seeded into 6-wells coated with 0.2% gelatin (Sigma-Aldrich, G1890) for another 8 days of attached culture with subsequent medium changes every other day. In total, three replicates of EB formation are performed on different days, and each replicate includes cell lines from all four primate species.

scRNA-seq library generation and sequencing

Request a detailed protocol

EBs are sampled on day 8 and day 16 of the protocol. For dissociation, floating EBs are collected, while attached EBs are kept in their wells, washed with DPBS, and incubated with Accumax (Sigma-Aldrich, SCR006) for 10–20 min at 37 °C. Afterwards, EBs are pipetted up and down with a p1000 pipette until they are completely dissociated. The enzymatic reaction is stopped by adding DFK20 medium, cells are pelleted at 300 g for 5 min and resuspended in 1 mL DPBS. If cell clumps are observed, the liquid is filtered through a 40 µm strainer before counting them with a Countess II automated cell counter (Thermo Fisher Scientific, C10228). Equal cell numbers from each cell line are pooled, washed with DPBS +0.04% BSA and resuspended in DPBS +0.04% BSA aiming for a final concentration of 800–1000 cells/µL. scRNA-seq libraries are generated using the 10 x Genomics Chromium Next GEM Single Cell 3’ Kit V3.1 workflow in three replicates. Each time, evenly pooled single cells from the different cell lines are loaded on 2–6 lanes of a 10 x chip, targeting 16,000 cells per lane. Libraries are sequenced on an Illumina NextSeq1000/1500 with a 100-cycle kit and the following sequencing setup: read 1 (28 bases), read 2 (10 bases), read 3 (10 bases), and read 4 (90 bases).

Alignment of scRNA-seq data

Request a detailed protocol

Reads are processed with Cell Ranger version 7.0.0. We map all reads to four reference genomes: Homo sapiens GRCh38 (GENCODE release 32), Pongo abelii Susie_PABv2/ponAbe3, Macaca fascicularis macFas6, and Macaca mulatta rheMac10. The orangutan, cynomolgus macaque, and rhesus macaque GTF files are created by transferring the hg38 annotation to the corresponding primate genomes via the tool Liftoff (Shumate and Salzberg, 2021), followed by removal of transcripts with partial mapping (<50%), low sequence identity (<50%), or excessive length (>100 bp difference and >2 length ratio) for all species.

Species and individual demultiplexing

Request a detailed protocol

Since we pool cells from multiple species on each 10 x lane, we use cellsnp-lite (Huang and Huang, 2021) version 1.2.0 and vireo (Huang et al., 2019) version 0.5.7 to assign single cells to their respective species. Initially, we obtain a list of 51000 informative variants (referred to as ‘species vcf file’) from a bulk RNA-seq experiment involving samples from Homo sapiens, Pongo abelii and Macaca fascicularis, mapped to the GRCh38 reference genome. We run cellsnp-lite in mode 2b for whole-chromosome pileup and filter for high-coverage homozygous variants to identify informative variants.

For the demultiplexing of species in the scRNA-seq data, we employ a two-step strategy:

Initial species assignment: Using the Cell Ranger output aligned to GRCh38, we genotype each single cell with cellsnp-lite providing the species vcf file as candidate SNPs and setting a minimum UMI count filter of 10. Subsequently, we assign single cells to human, orangutan, or macaque identity with vireo using again the species vcf file as the donor file.
Distinguishing macaque species: To differentiate between the two macaque species, Macaca fascicularis and Macaca mulatta, we use the Cell Ranger output aligned to rheMac10. After genotyping with cellsnp-lite, we demultiplex with vireo, specifying the number of donors to two, without providing a donor vcf file in this case. We assign the donor, for which the majority of distinguishing variants agreed with the rheMac10 reference alleles, to Macaca mulatta, and the other donor to Macaca fascicularis.

To distinguish different human individuals pooled in the same experiment, we genotype single cells with cellsnp-lite with a candidate vcf file of 7.4 million common variants from the 1000 Genomes Project, demultiplexed with vireo specifying two donors and assign donors to individuals based on the intersection with variants from bulk RNA-seq data of the same individuals. To distinguish between different cynomolgus individuals, we use a reference vcf with informative variants obtained from bulk RNA-seq data to genotype single cells and demultiplex the individuals.

Processing of scRNA-seq data

Request a detailed protocol

We remove background RNA with CellBender version 0.2.0 (Fleming et al., 2023) at a false positive rate (FPR) of 0.01. After quality control, we retain cells with more than 1000 detected genes and a mitochondrial fraction below 8%. We remove cross-species doublets based on the vireo assignments and intra-species doublets using scDblFinder version 1.6.0 (Germain et al., 2021), specifying the expected doublet rate based on the cross-species doublet fraction. For each species, we normalize the counts with scran version 1.28.2 (Lun et al., 2016) and integrated data from different experiments with scanorama (Hie et al., 2019). UMAP dimensionality reductions are created with Seurat version 4.3.0 on the first 30 components of the scanorama corrected embedding per species.

Besides the separate processing per species, we also create an integrated dataset of all four species together using Harmony version 0.1.1 (Korsunsky et al., 2019). We identify clusters on the first 20 Harmony-integrated PCs with Seurat at a resolution of 0.1, resulting in a number of clusters similar to the broad cell types described in a human EB dataset (Rhodes et al., 2022; Figure 1D and E).

Reference-based classification

Request a detailed protocol

To get an initial cell type annotation, we download a reference dataset of day 21 human EBs (Rhodes et al., 2022). We normalize the count matrix with scran and intersect the genes between reference 441 and our scRNA-seq dataset. Next, we train a SingleR version 2.0.0 (Aran et al., 2019) classifier for 442 the broad cell type classes defined in Figure 1G of the original publication (Rhodes et al., 2022) using 443 trainSingleR with pseudo-bulk aggregation. Cell type labels are transferred to cells of each species 444 with classifySingleR.

Orthologous cell type annotation

Request a detailed protocol

To annotate orthologous cell types, we first perform high-resolution clustering of the scRNA-seq data for each species separately. For this, we take the first 20 components of the Scanorama-corrected embedding as input to perform clustering in Seurat with FindNeighbors and FindClusters at a resolution of 2 to obtain the initial HRCs.

Next, we score the similarity of all HRCs with an approach based on reciprocal classification. For each species, we train a SingleR classifier on all HRCs of a species. We then classify the cells of all other species with classifySingleR. In this way, we can calculate the similarity of each HRC in the target species to each HRC in the reference species as the fraction of cells of the target HRC classified as the reference HRC. To also obtain similarity scores between HRCs within a species, we split the data of each species into a reference set with 80% of cells and a test set with 20% of cells. Analogous to the cross-species classification scheme, we transfer HRC labels from the reference set to the test set and score the overlap of target and reference HRCs.

In the next step, we combine HRCs based on pairwise similarity scores. We average the bidirectional similarity scores for each HRC pair and construct a distance matrix with all HRCs. Subsequently, based on hierarchical clustering (hclust, average method), we define 26 initial orthologous cell type clusters (OCCs) based on the visual inspection of the distance matrix. In this way, we merge similar HRCs within species and match HRCs across species to obtain a set of OCCs.

OCCs with very similar expression and marker profiles can be further merged. Therefore, we create pseudobulk profiles for each OCC and calculate Spearman’s ρ for all pair-wise comparisons within a species (s) based on the 2000 most variable genes. We perform hierarchical clustering on $1 - {\bar{ρ}}_{s}$ and merge orthologous clusters at a cut height of 0.1, that was interactively determined by also inspecting the similarity of the top marker genes as found by Seurat’s FindMarkers. In the shiny app, we provide a list of OCC markers for each species separately, but also the intersection of conserved markers. Based on those marker combinations, the user can then assign the cell types. If the marker gene distribution as visualized in UMAPs reveals overmerged OCCs, the user can split them interactively. Specifically, we separate merged OCC 4 into iPSCs, cardiac progenitor cells and early epithelial cells for the final assignment. We assign merged OCC5 as neural crest I, but re-annotate a subcluster present only in cynomolgus and rhesus macaques as fibroblasts. Similarly, we re-annotate a subcluster of merged OCC12 (granule precursor cells) as astrocyte progenitors in cynomolgus and rhesus macaque. Finally, we exclude OCCs with less than 800 cells that are only present in 1 or 2 species.

We assess the correspondence of the final cell type assignments across species with two approaches. For the scores shown in Figure 2—figure supplement 1, we apply the same reciprocal classification approach as described above, providing cell type labels instead of hrcs as initial clusters. For the scores shown in Figure 2—figure supplement 2, we use the function MetaNeighborUS of MetaNeighbor Version 1.18.0 to compare cell type labels across species.

Pseudotime analysis

Request a detailed protocol

Pseudotime trajectories were inferred separately for ectodermal, mesodermal, and endodermal lineages in each species using slingshot (version 2.12.0) (Street et al., 2018). For each germ layer, cells were filtered to include iPSCs and cell types belonging to the respective germ layer. The analysis was based on Scanorama-integrated PCA embeddings (Hie et al., 2019), with iPSCs defined as the starting cluster and germ layer-specific differentiated cell types as endpoints (ectoderm: astrocyte progenitors, granule precursor cells, neurons, and neural crest cells; mesoderm: fibroblasts, smooth muscle cells, cardiac endothelial cells, and cardiac fibroblasts; endoderm: epithelial cells and hepatocytes). If neural_crest_II was absent, neural_crest_I was used as an alternative endpoint. PHATE embeddings (phateR, version 1.0.7) were (Moon et al., 2019) computed from the Scanorama PCA space to visualize the inferred lineages in two dimensions.

Presence-absence scoring of expression

Request a detailed protocol

To determine when to define a gene as expressed in a certain cell type, we derive a lower limit of gene detection per cell type and species while accounting for noise and differences in power to detect expression. We first filter the count matrices for each clone, keeping only genes with at least 1% nonzero counts and cells within three median absolute deviations for number of UMIs and the number of genes with counts >0 per cell type and species. These filtered matrices are then downsampled so that we keep the same number of cells in each species (n=18,800), while keeping the original cell type proportion. Next, per species, we estimate the following distributional characteristics per gene (i) across cell types (j): (1) the fraction of nonzero counts ( $f_{i j}$ ), (2) the mean ( $μ_i j \pm s . e . (μ_{i j})$ ) and dispersion ( $θ_i$ ) of the negative binomial distribution using glmgampoi v1.10.2 (Ahlmann-Eltze and Huber, 2021). In the next step, we define a putative expression status per gene per cell type. (1) Genes are detectable if their log mean expression $l o g (μ_{i j})$ is above the fifth quantile of the $l o g (μ)$ value distribution across all genes per cell type. (2) Genes are reliably estimable if the ratio $l o g (\frac{s . e . (μ_{i j})}{μ_{i j}})$ is below the 90th quantile of $l o g (\frac{s . e . (μ)}{μ})$ value distribution. Only when both conditions are met is the expression status set to 1, otherwise 0. A binomial logistic regression model using Firth’s bias reduction method as implemented in R package logistf (version 1.26.0) is then applied to derive the minimal gene detection needed to call a gene expressed, i.e., when P(Y=1) solve $l o g (\frac{p}{1 - p}) = a + b * f_{i j}$ towards $f_{i j}$ . To ensure consistency between species, we set the detection threshold for each cell type to the maximum threshold among all species.

Cell type specificity and expression conservation scores

Request a detailed protocol

To assess cell type specificity and expression conservation of genes across species, we first determine in which cell types a gene is expressed in a species, using the thresholds defined in the previous section. Thus, we determine cell type specificity as the number of cell types in which a gene was found to be expressed. Here, this score can be maximally 7, i.e., the gene is detected in all cell types that were found in all four species.

To evaluate expression conservation, we develop a phylogenetically weighted conservation score for each gene, reflecting the number of species in which the gene is expressed, weighted by the scaled phylogenetic distance as estimated in Bininda-Emonds et al., 2007. For each gene, we calculate the expression conservation score as follows:

E x p r e s s i o n c o n s e r v a t i o n = \frac{1}{N_{c t}} \sum_{c t} \sum_{b \in d e t e c t e d} b l

where $N_{c t}$ is the number of cell types in which the gene is detected. We then simply sum the scaled branch lengths $b l$ across all cell types ( $c t$ ) and branches ( $b$ ) on which we infer the gene to be expressed. Because we only have four species, we only have one internal branch, for which we infer expression if at least one great ape and one macaque species show expression in that cell type. The score ranges from 0.075 (detected only in cynomolgus or rhesus macaque) to 1 (detected in the same cell types in all four species).

Furthermore, we extract measures of sequence conservation for protein-coding genes from Supplementary Data S14 in the study by 2023 (Sullivan et al., 2023). Here, we use the fraction of CDS bases with primate phastCons ≥0.96 as a gene-based measure of constraint.

Marker gene detection

Request a detailed protocol

We filter the count matrices for each clone to retain only genes with nonzero counts in one of the 7 cell types that were detected in all species. We then downsample these filtered matrices to equalize the number of cells across species, leaving us with ∼11,600 cells per species. Furthermore, to mitigate differences in statistical power due to varying numbers of cells per cell type, we perform testing on cell types with a minimum of 10 and a maximum of 250 cells for each pairwise comparison of ‘self’ versus ‘other.’ The maximum of 250 cells ensures that the cell type composition of the ‘other’ is comparable across species. We identify marker genes using the p-values ( $p_{a d j} < 0.1$ ) determined by ZIQ-Rank (Ling et al., 2021) and use Seurat FindMarkers with logistic regression to identify the cell types for which the gene is a marker. Furthermore, the marker gene needs to be above the cell type’s detection threshold (see above) and needs to be up-regulated in the cell type for which it is a marker (log fold change >0.01). Finally, a marker gene must be detected in a larger proportion of cells for which it is a marker than in other cell types ( $p_{j} - {\bar{p}}_{o t h e r} = Δ > 0.01$ ). The detection proportion Δ is also used to sort the lists of marker genes, deeming the genes with the largest Δ as the best marker genes. In order to also gauge within-species variation in marker gene detection, we conducted the same analysis across clones instead of species. In order to compare cross-species reproducibility of different types of marker genes, i.e., protein-coding, lncRNAs and transcriptional regulators, we wanted to compare the ranked lists of marker genes across species. To this end, we perform a concordance analysis using RBO (Webber et al., 2010) on the top 100 marker genes (rbo R package version 0.0.1). For this part, a list of transcription factors were created by selecting genes with at least one annotated motif in the motif databases JASPAR 2022 vertebrate core (Castro-Mondragon et al., 2022), JASPAR 2022 vertebrate unvalidated (Castro-Mondragon et al., 2022) and IMAGE Madsen et al., 2018. Annotations for protein-coding and lncRNA genes were extracted from the Ensembl GTF file provided with the human Cell Ranger reference dataset (GRCh38-2020-A). To assess the predictive performance of marker genes, we conduct a kNN classification (FNN R package version 1.1.4.1). We train a kNN classifier (k=3) on the log-normalized counts of the top 1–30 human markers per cell type in the human clone 29B5. We then predict the cell type identity in all clones and summarize classification performance per cell type with F1-scores, as well as the average F1-score across all seven cell types.

Appendix 1

Appendix 1—table 1

Marker genes.

Literature review for marker genes used in human and mouse / rodents to determine a specific cell type.

Cell type	Marker gene	Used in human	Used in mouse
iPSCs	POU5F1	Nguyen et al., 2018	Loh et al., 2006
iPSCs	NANOG	Nguyen et al., 2018	Apostolou et al., 2013
iPSCs	L1TD1	Närvä et al., 2012	Närvä et al., 2012
early ectoderm	SOX2	Graham et al., 2003	Lodato et al., 2013
early ectoderm	HES5	Ziller et al., 2015	Harada et al., 2021
early ectoderm	RFX4	Ziller et al., 2015	Kawase et al., 2014
granule precursor cells	NFIA	Tan et al., 2023	Fraser et al., 2020
granule precursor cells	ZIC1	Aruga et al., 1998	Schüller et al., 2006
granule precursor cells	ZIC4	Aruga et al., 1998	Blank et al., 2011
neural crest	SOX10	Mollaaghababa and Pavan, 2003	Mollaaghababa and Pavan, 2003; Kim et al., 2003
neural crest	FOXD3	Tseng et al., 2016	Dottori et al., 2001
neural crest	S100B	Hackland et al., 2017	Murphy et al., 1991
neurons	STMN2	Klim et al., 2019	Guerra San Juan et al., 2022; Ware et al., 2016
neurons	TAGLN3 (NP25)	Mori et al., 2004	Ware et al., 2016
neurons	DCX	Gleeson et al., 1999	Gleeson et al., 1999
smooth muscle cells	COL8A1	Rojas et al., 2024	Muhl et al., 2022
smooth muscle cells	ACTG2	Hashmi et al., 2020	Muhl et al., 2022
smooth muscle cells	ACTA2	Rojas et al., 2024	Muhl et al., 2022
cardiac fibroblasts	TNNT2	Mononen et al., 2020	Tachampa and Wongtawan, 2020
cardiac fibroblasts	DCN	Floy et al., 2021	Ko et al., 2022
cardiac fibroblasts	HAND2	Mononen et al., 2020	Furtado et al., 2014
epithelial cells	CDH1	Oikawa et al., 2018	Bondow et al., 2012
epithelial cells	EPCAM	Martowicz et al., 2016	Huang et al., 2018
epithelial cells	CLDN7	Farkas et al., 2015	Xing et al., 2020
hepatocytes	TTR	Banas et al., 2007	Lavon and Benvenisty, 2005
hepatocytes	APOA1	Krueger et al., 2013	De Giorgi et al., 2021
hepatocytes	APOA2	Krueger et al., 2013	Peng et al., 2018

Data availability

Code for analysis and figures is available on GitHub (https://github.com/Hellmann-Lab/EB-analyses; copy archived at Janssen, 2024), and accompanying files are deposited in Zenodo (https://doi.org/10.5281/zenodo.14198850). All sequencing files were deposited in GEO (GSE280441).

The following data sets were generated

1. Jocher J
2. Janssen P
3. Vieth B
4. Edenhofer FC
5. Dietl T
6. Térmeg A
7. Geuder J
8. Enard W
9. Hellmann I
(2024) NCBI Gene Expression Omnibus
ID GSE280441. Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE280441
1. Janssen P
(2024) Zenodo
Identification and comparison of orthologous cell types from primate embryoid bodies shows limits of marker gene transferability.

https://doi.org/10.5281/zenodo.14198849

References

1. Ahlmann-Eltze C
2. Huber W
(2021) glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data
Bioinformatics 36:5701–5702.

https://doi.org/10.1093/bioinformatics/btaa1009
- PubMed
- Google Scholar
1. Apostolou E
2. Ferrari F
3. Walsh RM
4. Bar-Nur O
5. Stadtfeld M
6. Cheloufi S
7. Stuart HT
8. Polo JM
9. Ohsumi TK
10. Borowsky ML
11. Kharchenko PV
12. Park PJ
13. Hochedlinger K
(2013) Genome-wide chromatin interactions of the Nanog locus in pluripotency, differentiation, and reprogramming
Cell Stem Cell 12:699–712.

https://doi.org/10.1016/j.stem.2013.04.013
- PubMed
- Google Scholar
1. Aran D
2. Looney AP
3. Liu L
4. Wu E
5. Fong V
6. Hsu A
7. Chak S
8. Naikawadi RP
9. Wolters PJ
10. Abate AR
11. Butte AJ
12. Bhattacharya M
(2019) Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage
Nature Immunology 20:163–172.

https://doi.org/10.1038/s41590-018-0276-y
- PubMed
- Google Scholar
1. Arendt D
2. Musser JM
3. Baker CVH
4. Bergman A
5. Cepko C
6. Erwin DH
7. Pavlicev M
8. Schlosser G
9. Widder S
10. Laubichler MD
11. Wagner GP
(2016) The origin and evolution of cell types
Nature Reviews. Genetics 17:744–757.

https://doi.org/10.1038/nrg.2016.127
- PubMed
- Google Scholar
1. Aruga J
2. Minowa O
3. Yaginuma H
4. Kuno J
5. Nagai T
6. Noda T
7. Mikoshiba K
(1998) Mouse Zic1 is involved in cerebellar development
The Journal of Neuroscience 18:284–293.

https://doi.org/10.1523/JNEUROSCI.18-01-00284.1998
- PubMed
- Google Scholar
1. Bakken T
2. Cowell L
3. Aevermann BD
4. Novotny M
5. Hodge R
6. Miller JA
7. Lee A
8. Chang I
9. McCorrison J
10. Pulendran B
11. Qian Y
12. Schork NJ
13. Lasken RS
14. Lein ES
15. Scheuermann RH
(2017) Cell type discovery and representation in the era of high-content single cell phenotyping
BMC Bioinformatics 18:559.

https://doi.org/10.1186/s12859-017-1977-1
- PubMed
- Google Scholar
1. Bakken TE
2. Jorstad NL
3. Hu Q
4. Lake BB
5. Tian W
6. Kalmbach BE
7. Crow M
8. Hodge RD
9. Krienen FM
10. Sorensen SA
11. Eggermont J
12. Yao Z
13. Aevermann BD
14. Aldridge AI
15. Bartlett A
16. Bertagnolli D
17. Casper T
18. Castanon RG
19. Crichton K
20. Daigle TL
21. Dalley R
22. Dee N
23. Dembrow N
24. Diep D
25. Ding S-L
26. Dong W
27. Fang R
28. Fischer S
29. Goldman M
30. Goldy J
31. Graybuck LT
32. Herb BR
33. Hou X
34. Kancherla J
35. Kroll M
36. Lathia K
37. van Lew B
38. Li YE
39. Liu CS
40. Liu H
41. Lucero JD
42. Mahurkar A
43. McMillen D
44. Miller JA
45. Moussa M
46. Nery JR
47. Nicovich PR
48. Niu S-Y
49. Orvis J
50. Osteen JK
51. Owen S
52. Palmer CR
53. Pham T
54. Plongthongkum N
55. Poirion O
56. Reed NM
57. Rimorin C
58. Rivkin A
59. Romanow WJ
60. Sedeño-Cortés AE
61. Siletti K
62. Somasundaram S
63. Sulc J
64. Tieu M
65. Torkelson A
66. Tung H
67. Wang X
68. Xie F
69. Yanny AM
70. Zhang R
71. Ament SA
72. Behrens MM
73. Bravo HC
74. Chun J
75. Dobin A
76. Gillis J
77. Hertzano R
78. Hof PR
79. Höllt T
80. Horwitz GD
81. Keene CD
82. Kharchenko PV
83. Ko AL
84. Lelieveldt BP
85. Luo C
86. Mukamel EA
87. Pinto-Duarte A
88. Preissl S
89. Regev A
90. Ren B
91. Scheuermann RH
92. Smith K
93. Spain WJ
94. White OR
95. Koch C
96. Hawrylycz M
97. Tasic B
98. Macosko EZ
99. McCarroll SA
100. Ting JT
101. Zeng H
102. Zhang K
103. Feng G
104. Ecker JR
105. Linnarsson S
106. Lein ES
(2021) Comparative cellular analysis of motor cortex in human, marmoset and mouse
Nature 598:111–119.

https://doi.org/10.1038/s41586-021-03465-8
- PubMed
- Google Scholar
1. Banas A
2. Teratani T
3. Yamamoto Y
4. Tokuhara M
5. Takeshita F
6. Quinn G
7. Okochi H
8. Ochiya T
(2007) Adipose tissue-derived mesenchymal stem cells as a source of human hepatocytes
Hepatology 46:219–228.

https://doi.org/10.1002/hep.21704
- PubMed
- Google Scholar
(2023) The relationship between regulatory changes in cis and trans and the evolution of gene expression in humans and chimpanzees
Genome Biology 24:207.

https://doi.org/10.1186/s13059-023-03019-3
- PubMed
- Google Scholar
(2021) An early cell shape transition drives evolutionary expansion of the human forebrain
Cell 184:2084–2102.

https://doi.org/10.1016/j.cell.2021.02.050
- PubMed
- Google Scholar
1. Bininda-Emonds ORP
2. Cardillo M
3. Jones KE
4. MacPhee RDE
5. Beck RMD
6. Grenyer R
7. Price SA
8. Vos RA
9. Gittleman JL
10. Purvis A
(2007) The delayed rise of present-day mammals
Nature 446:507–512.

https://doi.org/10.1038/nature05634
- PubMed
- Google Scholar
(2011) Multiple developmental programs are altered by loss of Zic1 and Zic4 to cause Dandy-Walker malformation cerebellar pathogenesis
Development 138:1207–1216.

https://doi.org/10.1242/dev.054114
- PubMed
- Google Scholar
1. Bondow BJ
2. Faber ML
3. Wojta KJ
4. Walker EM
5. Battle MA
(2012) E-cadherin is required for intestinal morphogenesis in the mouse
Developmental Biology 371:1–12.

https://doi.org/10.1016/j.ydbio.2012.06.005
- PubMed
- Google Scholar
1. Brickman JM
2. Serup P
(2017) Properties of embryoid bodies
Wiley Interdisciplinary Reviews. Developmental Biology 6:259.

https://doi.org/10.1002/wdev.259
- PubMed
- Google Scholar
1. Castro-Mondragon JA
2. Riudavets-Puig R
3. Rauluseviciute I
4. Lemma RB
5. Turchi L
6. Blanc-Mathieu R
7. Lucas J
8. Boddie P
9. Khan A
10. Manosalva Pérez N
11. Fornes O
12. Leung TY
13. Aguirre A
14. Hammal F
15. Schmelter D
16. Baranasic D
17. Ballester B
18. Sandelin A
19. Lenhard B
20. Vandepoele K
21. Wasserman WW
22. Parcy F
23. Mathelier A
(2022) JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles
Nucleic Acids Research 50:D165–D173.

https://doi.org/10.1093/nar/gkab1113
- PubMed
- Google Scholar
1. Crow M
2. Paul A
3. Ballouz S
4. Huang ZJ
5. Gillis J
(2018) Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor
Nature Communications 9:884.

https://doi.org/10.1038/s41467-018-03282-0
- PubMed
- Google Scholar
1. De Giorgi M
2. Li A
3. Hurley A
4. Barzi M
5. Doerfler AM
6. Cherayil NA
7. Smith HE
8. Brown JD
9. Lin CY
10. Bissig K-D
11. Bao G
12. Lagor WR
(2021) Targeting the Apoa1 locus for liver-directed gene therapy
Molecular Therapy. Methods & Clinical Development 21:656–669.

https://doi.org/10.1016/j.omtm.2021.04.011
- PubMed
- Google Scholar
(2001) The winged-helix transcription factor Foxd3 suppresses interneuron differentiation and promotes neural crest cell fate
Development 128:4127–4138.

https://doi.org/10.1242/dev.128.21.4127
- PubMed
- Google Scholar
1. Duret L
2. Mouchiroud D
(2000) Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate
Molecular Biology and Evolution 17:68–74.

https://doi.org/10.1093/oxfordjournals.molbev.a026239
- PubMed
- Google Scholar
1. Edenhofer FC
2. Térmeg A
3. Ohnuki M
4. Jocher J
5. Kliesmete Z
6. Briem E
7. Hellmann I
8. Enard W
(2024) Generation and characterization of inducible KRAB-dCas9 iPSCs from primates for cross-species CRISPRi
iScience 27:110090.

https://doi.org/10.1016/j.isci.2024.110090
- PubMed
- Google Scholar
1. Farkas AE
2. Hilgarth RS
3. Capaldo CT
4. Gerner-Smidt C
5. Powell DR
6. Vertino PM
7. Koval M
8. Parkos CA
9. Nusrat A
(2015) HNF4α regulates claudin-7 protein expression during intestinal epithelial differentiation
The American Journal of Pathology 185:2206–2218.

https://doi.org/10.1016/j.ajpath.2015.04.023
- PubMed
- Google Scholar
1. Feng M
2. Swevers L
3. Sun J
(2022) Hemocyte clusters defined by scRNA-Seq in Bombyx mori: In Silico analysis of predicted marker genes and implications for potential functional roles
Frontiers in Immunology 13:852702.

https://doi.org/10.3389/fimmu.2022.852702
- PubMed
- Google Scholar
1. Fleming SJ
2. Chaffin MD
3. Arduini A
4. Akkad AD
5. Banks E
6. Marioni JC
7. Philippakis AA
8. Ellinor PT
9. Babadi M
(2023) Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender
Nature Methods 20:1323–1335.

https://doi.org/10.1038/s41592-023-01943-7
- PubMed
- Google Scholar
1. Floy ME
2. Givens SE
3. Matthys OB
4. Mateyka TD
5. Kerr CM
6. Steinberg AB
7. Silva AC
8. Zhang J
9. Mei Y
10. Ogle BM
11. McDevitt TC
12. Kamp TJ
13. Palecek SP
(2021) Developmental lineage of human pluripotent stem cell-derived cardiac fibroblasts affects their functional phenotype
FASEB Journal 35:e21799.

https://doi.org/10.1096/fj.202100523R
- PubMed
- Google Scholar
(2019) PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data
Database 2019:baz046.

https://doi.org/10.1093/database/baz046
- PubMed
- Google Scholar
1. Fraser J
2. Essebier A
3. Brown AS
4. Davila RA
5. Harkins D
6. Zalucki O
7. Shapiro LP
8. Penzes P
9. Wainwright BJ
10. Scott MP
11. Gronostajski RM
12. Bodén M
13. Piper M
14. Harvey TJ
(2020) Common regulatory targets of NFIA, NFIX and NFIB during postnatal cerebellar development
Cerebellum 19:89–101.

https://doi.org/10.1007/s12311-019-01089-3
- PubMed
- Google Scholar
1. Furtado MB
2. Costa MW
3. Pranoto EA
4. Salimova E
5. Pinto AR
6. Lam NT
7. Park A
8. Snider P
9. Chandran A
10. Harvey RP
11. Boyd R
12. Conway SJ
13. Pearson J
14. Kaye DM
15. Rosenthal NA
(2014) Cardiogenic genes expressed in cardiac fibroblasts contribute to heart development and repair
Circulation Research 114:1422–1434.

https://doi.org/10.1161/CIRCRESAHA.114.302530
- PubMed
- Google Scholar
(2021) Doublet identification in single-cell sequencing data using scDblFinder
F1000Research 10:979.

https://doi.org/10.12688/f1000research.73600.2
- PubMed
- Google Scholar
1. Geuder J
2. Wange LE
3. Janjic A
4. Radmer J
5. Janssen P
6. Bagnoli JW
7. Müller S
8. Kaul A
9. Ohnuki M
10. Enard W
(2021) A non-invasive method to generate induced pluripotent stem cells from primate urine
Scientific Reports 11:3516.

https://doi.org/10.1038/s41598-021-82883-0
- PubMed
- Google Scholar
(1999) Doublecortin is a microtubule-associated protein and is expressed widely by migrating neurons
Neuron 23:257–271.

https://doi.org/10.1016/s0896-6273(00)80778-3
- PubMed
- Google Scholar
1. Graham V
2. Khudyakov J
3. Ellis P
4. Pevny L
(2003) SOX2 functions to maintain neural progenitor identity
Neuron 39:749–765.

https://doi.org/10.1016/s0896-6273(03)00497-5
- PubMed
- Google Scholar
1. Guerra San Juan I
2. Nash LA
3. Smith KS
4. Leyton-Jaimes MF
5. Qian M
6. Klim JR
7. Limone F
8. Dorr AB
9. Couto A
10. Pintacuda G
11. Joseph BJ
12. Whisenant DE
13. Noble C
14. Melnik V
15. Potter D
16. Holmes A
17. Burberry A
18. Verhage M
19. Eggan K
(2022) Loss of mouse Stmn2 function causes motor neuropathy
Neuron 110:1671–1688.

https://doi.org/10.1016/j.neuron.2022.02.011
- PubMed
- Google Scholar
1. Gulati GS
2. D’Silva JP
3. Liu Y
4. Wang L
5. Newman AM
(2025) Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics
Nature Reviews. Molecular Cell Biology 26:11–31.

https://doi.org/10.1038/s41580-024-00768-2
- PubMed
- Google Scholar
1. Guo H
2. Tian L
3. Zhang JZ
4. Kitani T
5. Paik DT
6. Lee WH
7. Wu JC
(2019) Single-cell RNA sequencing of human embryonic stem cell differentiation delineates adverse effects of nicotine on embryonic development
Stem Cell Reports 12:772–786.

https://doi.org/10.1016/j.stemcr.2019.01.022
- PubMed
- Google Scholar
1. Guo H
2. Li J
(2021) scSorter: assigning cells to known cell types according to marker genes
Genome Biology 22:69.

https://doi.org/10.1186/s13059-021-02281-7
- PubMed
- Google Scholar
(2017) Top-down inhibition of BMP signaling enables robust induction of hPSCs into neural crest in fully defined, xeno-free conditions
Stem Cell Reports 9:1043–1052.

https://doi.org/10.1016/j.stemcr.2017.08.008
- PubMed
- Google Scholar
1. Han X
2. Chen H
3. Huang D
4. Chen H
5. Fei L
6. Cheng C
7. Huang H
8. Yuan GC
9. Guo G
(2018) Mapping human pluripotent stem cell differentiation pathways using high throughput single-cell RNA-sequencing
Genome Biology 19:47.

https://doi.org/10.1186/s13059-018-1426-0
- PubMed
- Google Scholar
1. Hao Y
2. Hao S
3. Andersen-Nissen E
4. Mauck WM III
5. Zheng S
6. Butler A
7. Lee MJ
8. Wilk AJ
9. Darby C
10. Zager M
11. Hoffman P
12. Stoeckius M
13. Papalexi E
14. Mimitou EP
15. Jain J
16. Srivastava A
17. Stuart T
18. Fleming LM
19. Yeung B
20. Rogers AJ
21. McElrath JM
22. Blish CA
23. Gottardo R
24. Smibert P
25. Satija R
(2021) Integrated analysis of multimodal single-cell data
Cell 184:3573–3587.

https://doi.org/10.1016/j.cell.2021.04.048
- PubMed
- Google Scholar
1. Harada Y
2. Yamada M
3. Imayoshi I
4. Kageyama R
5. Suzuki Y
6. Kuniya T
7. Furutachi S
8. Kawaguchi D
9. Gotoh Y
(2021) Cell cycle arrest determines adult neural stem cell ontogeny by an embryonic Notch-nonoscillatory Hey1 module
Nature Communications 12:6562.

https://doi.org/10.1038/s41467-021-26605-0
- PubMed
- Google Scholar
(2020) Pseudo-obstruction-inducing ACTG2R257C alters actin organization and function
JCI Insight 5:140604.

https://doi.org/10.1172/jci.insight.140604
- PubMed
- Google Scholar
1. Hastings KE
(1996) Strong evolutionary conservation of broadly expressed protein isoforms in the troponin I gene family and other vertebrate gene families
Journal of Molecular Evolution 42:631–640.

https://doi.org/10.1007/BF02338796
- PubMed
- Google Scholar
Preprint
1. He Z
2. Dony L
3. Fleck JS
4. Szałata A
5. Li KX
6. Slišković I
7. Lin HC
8. Santel M
9. Atamian A
10. Quadrato G
11. Sun J
12. Paşca SP
13. Camp JG
14. Theis F
15. Treutlein B
(2023) An integrated transcriptomic cell atlas of human neural organoids
bioRxiv.

https://doi.org/10.1101/2023.10.05.561097
- Google Scholar
1. Hie B
2. Bryson BD
3. Berger B
(2019) Efficient integration of heterogeneous single-cell transcriptomes using Scanorama
Nature Biotechnology 37:685–691.

https://doi.org/10.1038/s41587-019-0113-3
- PubMed
- Google Scholar
1. Hodge RD
2. Bakken TE
3. Miller JA
4. Smith KA
5. Barkan ER
6. Graybuck LT
7. Close JL
8. Long B
9. Johansen N
10. Penn O
11. Yao Z
12. Eggermont J
13. Höllt T
14. Levi BP
15. Shehata SI
16. Aevermann B
17. Beller A
18. Bertagnolli D
19. Brouner K
20. Casper T
21. Cobbs C
22. Dalley R
23. Dee N
24. Ding S-L
25. Ellenbogen RG
26. Fong O
27. Garren E
28. Goldy J
29. Gwinn RP
30. Hirschstein D
31. Keene CD
32. Keshk M
33. Ko AL
34. Lathia K
35. Mahfouz A
36. Maltzer Z
37. McGraw M
38. Nguyen TN
39. Nyhus J
40. Ojemann JG
41. Oldre A
42. Parry S
43. Reynolds S
44. Rimorin C
45. Shapovalova NV
46. Somasundaram S
47. Szafer A
48. Thomsen ER
49. Tieu M
50. Quon G
51. Scheuermann RH
52. Yuste R
53. Sunkin SM
54. Lelieveldt B
55. Feng D
56. Ng L
57. Bernard A
58. Hawrylycz M
59. Phillips JW
60. Tasic B
61. Zeng H
62. Jones AR
63. Koch C
64. Lein ES
(2019) Conserved cell types with divergent features in human versus mouse cortex
Nature 573:61–68.

https://doi.org/10.1038/s41586-019-1506-7
- PubMed
- Google Scholar
1. Huang L
2. Yang Y
3. Yang F
4. Liu S
5. Zhu Z
6. Lei Z
7. Guo J
(2018) Functions of EpCAM in physiological processes and diseases (Review)
International Journal of Molecular Medicine 42:1771–1785.

https://doi.org/10.3892/ijmm.2018.3764
- PubMed
- Google Scholar
(2019) Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference
Genome Biology 20:273.

https://doi.org/10.1186/s13059-019-1865-2
- PubMed
- Google Scholar
1. Huang X
2. Huang Y
(2021) Cellsnp-lite: an efficient tool for genotyping single cells
Bioinformatics 37:4569–4571.

https://doi.org/10.1093/bioinformatics/btab358
- PubMed
- Google Scholar
(2022) Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data
Nature Communications 13:1246.

https://doi.org/10.1038/s41467-022-28803-w
- PubMed
- Google Scholar
1. Itskovitz-Eldor J
2. Schuldiner M
3. Karsenti D
4. Eden A
5. Yanuka O
6. Amit M
7. Soreq H
8. Benvenisty N
(2000) Differentiation of human embryonic stem cells into embryoid bodies compromising the three embryonic germ layers
Molecular Medicine 6:88–95.

https://doi.org/10.1007/BF03401776
- PubMed
- Google Scholar
Software
1. Janssen P
(2024) Primate embryoid body analysis, version swh:1:rev:f6bd4b033bf4b167d6d5370543661c19e2a17e3d
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:fb950161f8c0871e77430f2f7f526b842bfbcc41;origin=https://github.com/Hellmann-Lab/EB-analyses;visit=swh:1:snp:52f6a9fb7a710af5cfa432c2f9e588c4b16e365b;anchor=swh:1:rev:f6bd4b033bf4b167d6d5370543661c19e2a17e3d
1. Jocher J
2. Edenhofer FC
3. Janssen P
4. Müller S
5. Lopez-Parra DC
6. Geuder J
7. Enard W
(2024) Generation and characterization of three fibroblast-derived Rhesus Macaque induced pluripotent stem cell lines
Stem Cell Research 74:103277.

https://doi.org/10.1016/j.scr.2023.103277
- PubMed
- Google Scholar
(2014) Evolutionary conservation of long non-coding RNAs; sequence, structure, function
Biochimica et Biophysica Acta 1840:1063–1071.

https://doi.org/10.1016/j.bbagen.2013.10.035
- PubMed
- Google Scholar
1. Kanton S
2. Boyle MJ
3. He Z
4. Santel M
5. Weigert A
6. Sanchís-Calleja F
7. Guijarro P
8. Sidow L
9. Fleck JS
10. Han D
11. Qian Z
12. Heide M
13. Huttner WB
14. Khaitovich P
15. Pääbo S
16. Treutlein B
17. Camp JG
(2019) Organoid single-cell genomic atlas uncovers human-specific features of brain development
Nature 574:418–422.

https://doi.org/10.1038/s41586-019-1654-9
- PubMed
- Google Scholar
1. Kawase S
2. Kuwako K
3. Imai T
4. Renault-Mihara F
5. Yaguchi K
6. Itohara S
7. Okano H
(2014) Regulatory factor X transcription factors control Musashi1 transcription in mouse neural stem/progenitor cells
Stem Cells and Development 23:2250–2261.

https://doi.org/10.1089/scd.2014.0219
- PubMed
- Google Scholar
1. Kim J
2. Lo L
3. Dormand E
4. Anderson DJ
(2003) SOX10 maintains multipotency and inhibits neuronal differentiation of neural crest stem cells
Neuron 38:17–31.

https://doi.org/10.1016/s0896-6273(03)00163-6
- PubMed
- Google Scholar
1. Kliesmete Z
2. Orchard P
3. Lee VYK
4. Geuder J
5. Krauß SM
6. Ohnuki M
7. Jocher J
8. Vieth B
9. Enard W
10. Hellmann I
(2024) Evidence for compensatory evolution within pleiotropic regulatory elements
Genome Research 34:1528–1539.

https://doi.org/10.1101/gr.279001.124
- PubMed
- Google Scholar
1. Klim JR
2. Williams LA
3. Limone F
4. Guerra San Juan I
5. Davis-Dusenbery BN
6. Mordes DA
7. Burberry A
8. Steinbaugh MJ
9. Gamage KK
10. Kirchner R
11. Moccia R
12. Cassel SH
13. Chen K
14. Wainger BJ
15. Woolf CJ
16. Eggan K
(2019) ALS-implicated protein TDP-43 sustains levels of STMN2, a mediator of motor neuron growth and repair
Nature Neuroscience 22:167–179.

https://doi.org/10.1038/s41593-018-0300-4
- PubMed
- Google Scholar
1. Ko T
2. Nomura S
3. Yamada S
4. Fujita K
5. Fujita T
6. Satoh M
7. Oka C
8. Katoh M
9. Ito M
10. Katagiri M
11. Sassa T
12. Zhang B
13. Hatsuse S
14. Yamada T
15. Harada M
16. Toko H
17. Amiya E
18. Hatano M
19. Kinoshita O
20. Nawata K
21. Abe H
22. Ushiku T
23. Ono M
24. Ikeuchi M
25. Morita H
26. Aburatani H
27. Komuro I
(2022) Cardiac fibroblasts regulate the development of heart failure via Htra3-TGF-β-IGFBP7 axis
Nature Communications 13:3275.

https://doi.org/10.1038/s41467-022-30630-y
- PubMed
- Google Scholar
1. Korsunsky I
2. Millard N
3. Fan J
4. Slowikowski K
5. Zhang F
6. Wei K
7. Baglaenko Y
8. Brenner M
9. Loh PR
10. Raychaudhuri S
(2019) Fast, sensitive and accurate integration of single-cell data with Harmony
Nature Methods 16:1289–1296.

https://doi.org/10.1038/s41592-019-0619-0
- PubMed
- Google Scholar
1. Krienen FM
2. Goldman M
3. Zhang Q
4. C. H. del Rosario R
5. Florio M
6. Machold R
7. Saunders A
8. Levandowski K
9. Zaniewski H
10. Schuman B
11. Wu C
12. Lutservitz A
13. Mullally CD
14. Reed N
15. Bien E
16. Bortolin L
17. Fernandez-Otero M
18. Lin JD
19. Wysoker A
20. Nemesh J
21. Kulp D
22. Burns M
23. Tkachev V
24. Smith R
25. Walsh CA
26. Dimidschstein J
27. Rudy B
28. S. Kean L
29. Berretta S
30. Fishell G
31. Feng G
32. McCarroll SA
(2020) Innovations present in the primate interneuron repertoire
Nature 586:262–269.

https://doi.org/10.1038/s41586-020-2781-z
- Google Scholar
1. Krueger WH
2. Tanasijevic B
3. Barber V
4. Flamier A
5. Gu X
6. Manautou J
7. Rasmussen TP
(2013) Cholesterol-secreting and statin-responsive hepatocytes from human ES and iPS cells to model hepatic involvement in cardiovascular health
PLOS ONE 8:e67296.

https://doi.org/10.1371/journal.pone.0067296
- PubMed
- Google Scholar
1. Lavon N
2. Benvenisty N
(2005) Study of hepatocyte differentiation using embryonic stem cells
Journal of Cellular Biochemistry 96:1193–1202.

https://doi.org/10.1002/jcb.20590
- PubMed
- Google Scholar
1. Ling W
2. Zhang W
3. Cheng B
4. Wei Y
(2021) Zero-inflated quantile rank-score based test (ziqrank) with application to scrna-seq differential gene expression analysis
The Annals of Applied Statistics 15:1673–1696.

https://doi.org/10.1214/21-aoas1442
- PubMed
- Google Scholar
1. Liu X
2. Shen Q
3. Zhang S
(2023) Cross-species cell-type assignment from single-cell RNA-seq data by a heterogeneous graph neural network
Genome Research 33:96–111.

https://doi.org/10.1101/gr.276868.122
- PubMed
- Google Scholar
1. Lodato MA
2. Ng CW
3. Wamstad JA
4. Cheng AW
5. Thai KK
6. Fraenkel E
7. Jaenisch R
8. Boyer LA
(2013) SOX2 co-occupies distal enhancer elements with distinct POU factors in ESCs and NPCs to specify cell state
PLOS Genetics 9:e1003288.

https://doi.org/10.1371/journal.pgen.1003288
- PubMed
- Google Scholar
1. Loh Y-H
2. Wu Q
3. Chew J-L
4. Vega VB
5. Zhang W
6. Chen X
7. Bourque G
8. George J
9. Leong B
10. Liu J
11. Wong K-Y
12. Sung KW
13. Lee CWH
14. Zhao X-D
15. Chiu K-P
16. Lipovich L
17. Kuznetsov VA
18. Robson P
19. Stanton LW
20. Wei C-L
21. Ruan Y
22. Lim B
23. Ng H-H
(2006) The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells
Nature Genetics 38:431–440.

https://doi.org/10.1038/ng1760
- PubMed
- Google Scholar
1. Ludwig TE
2. Andrews PW
3. Barbaric I
4. Benvenisty N
5. Bhattacharyya A
6. Crook JM
7. Daheron LM
8. Draper JS
9. Healy LE
10. Huch M
11. Inamdar MS
12. Jensen KB
13. Kurtz A
14. Lancaster MA
15. Liberali P
16. Lutolf MP
17. Mummery CL
18. Pera MF
19. Sato Y
20. Shimasaki N
21. Smith AG
22. Song J
23. Spits C
24. Stacey G
25. Wells CA
26. Zhao T
27. Mosher JT
(2023) ISSCR standards for the use of human stem cells in basic research
Stem Cell Reports 18:1744–1752.

https://doi.org/10.1016/j.stemcr.2023.08.003
- PubMed
- Google Scholar
(2016) Pooling across cells to normalize single-cell RNA sequencing data with many zero counts
Genome Biology 17:1–14.

https://doi.org/10.1186/s13059-016-0947-7
- Google Scholar
(2018) Integrated analysis of motif activity and gene expression changes of transcription factors
Genome Research 28:243–255.

https://doi.org/10.1101/gr.227231.117
- PubMed
- Google Scholar
(2016) The role of EpCAM in physiology and pathology of the epithelium
Histology and Histopathology 31:349–355.

https://doi.org/10.14670/HH-11-678
- PubMed
- Google Scholar
1. Mollaaghababa R
2. Pavan WJ
(2003) The importance of having your SOX on: role of SOX10 in the development of neural crest-derived melanocytes and glia
Oncogene 22:3024–3034.

https://doi.org/10.1038/sj.onc.1206442
- PubMed
- Google Scholar
1. Mononen MM
2. Leung CY
3. Xu J
4. Chien KR
(2020) Trajectory mapping of human embryonic stem cell cardiogenesis reveals lineage branch points and an ISL1 progenitor-derived cardiac fibroblast lineage
Stem Cells 38:1267–1278.

https://doi.org/10.1002/stem.3236
- PubMed
- Google Scholar
1. Moon KR
2. van Dijk D
3. Wang Z
4. Gigante S
5. Burkhardt DB
6. Chen WS
7. Yim K
8. van den Elzen A
9. Hirn MJ
10. Coifman RR
11. Ivanova NB
12. Wolf G
13. Krishnaswamy S
(2019) Visualizing structure and transitions in high-dimensional biological data
Nature Biotechnology 37:1482–1492.

https://doi.org/10.1038/s41587-019-0336-3
- PubMed
- Google Scholar
1. Mori K
2. Muto Y
3. Kokuzawa J
4. Yoshioka T
5. Yoshimura S
6. Iwama T
7. Okano Y
8. Sakai N
(2004) Neuronal protein NP25 interacts with F-actin
Neuroscience Research 48:439–446.

https://doi.org/10.1016/j.neures.2003.12.012
- PubMed
- Google Scholar
1. Muhl L
2. Mocci G
3. Pietilä R
4. Liu J
5. He L
6. Genové G
7. Leptidis S
8. Gustafsson S
9. Buyandelger B
10. Raschperger E
11. Hansson EM
12. Björkegren JLM
13. Vanlandewijck M
14. Lendahl U
15. Betsholtz C
(2022) A single-cell transcriptomic inventory of murine smooth muscle cells
Developmental Cell 57:2426–2443.

https://doi.org/10.1016/j.devcel.2022.09.015
- PubMed
- Google Scholar
1. Murphy M
2. Bernard O
3. Reid K
4. Bartlett PF
(1991) Cell lines derived from mouse neural crest are representative of cells at various stages of differentiation
Journal of Neurobiology 22:522–535.

https://doi.org/10.1002/neu.480220508
- PubMed
- Google Scholar
1. Närvä E
2. Rahkonen N
3. Emani MR
4. Lund R
5. Pursiheimo JP
6. Nästi J
7. Autio R
8. Rasool O
9. Denessiouk K
10. Lähdesmäki H
11. Rao A
12. Lahesmaa R
(2012) RNA-binding protein L1TD1 interacts with LIN28 via RNA and is required for human embryonic stem cell self-renewal and cancer cell proliferation
STEM CELLS 30:452–460.

https://doi.org/10.1002/stem.1013
- PubMed
- Google Scholar
1. Nguyen QH
2. Lukowski SW
3. Chiu HS
4. Senabouth A
5. Bruxner TJC
6. Christ AN
7. Palpant NJ
8. Powell JE
(2018) Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular heterogeneity and cell state transitions between subpopulations
Genome Research 28:1053–1066.

https://doi.org/10.1101/gr.223925.117
- PubMed
- Google Scholar
1. Oikawa T
2. Otsuka Y
3. Onodera Y
4. Horikawa M
5. Handa H
6. Hashimoto S
7. Suzuki Y
8. Sabe H
(2018) Necessity of p53-binding to the CDH1 locus for its expression defines two epithelial cell types differing in their integrity
Scientific Reports 8:1595.

https://doi.org/10.1038/s41598-018-20043-7
- PubMed
- Google Scholar
1. Pascal LE
2. True LD
3. Campbell DS
4. Deutsch EW
5. Risk M
6. Coleman IM
7. Eichner LJ
8. Nelson PS
9. Liu AY
(2008) Correlation of mRNA and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate
BMC Genomics 9:246.

https://doi.org/10.1186/1471-2164-9-246
- PubMed
- Google Scholar
1. Peng WC
2. Logan CY
3. Fish M
4. Anbarchian T
5. Aguisanda F
6. Álvarez-Varela A
7. Wu P
8. Jin Y
9. Zhu J
10. Li B
11. Grompe M
12. Wang B
13. Nusse R
(2018) Inflammatory cytokine TNFα promotes the long-term expansion of primary hepatocytes in 3D culture
Cell 175:1607–1619.

https://doi.org/10.1016/j.cell.2018.11.012
- PubMed
- Google Scholar
1. Regev A
2. Teichmann SA
3. Lander ES
4. Amit I
5. Benoist C
6. Birney E
7. Bodenmiller B
8. Campbell P
9. Carninci P
10. Clatworthy M
11. Clevers H
12. Deplancke B
13. Dunham I
14. Eberwine J
15. Eils R
16. Enard W
17. Farmer A
18. Fugger L
19. Göttgens B
20. Hacohen N
21. Haniffa M
22. Hemberg M
23. Kim S
24. Klenerman P
25. Kriegstein A
26. Lein E
27. Linnarsson S
28. Lundberg E
29. Lundeberg J
30. Majumder P
31. Marioni JC
32. Merad M
33. Mhlanga M
34. Nawijn M
35. Netea M
36. Nolan G
37. Pe’er D
38. Phillipakis A
39. Ponting CP
40. Quake S
41. Reik W
42. Rozenblatt-Rosen O
43. Sanes J
44. Satija R
45. Schumacher TN
46. Shalek A
47. Shapiro E
48. Sharma P
49. Shin JW
50. Stegle O
51. Stratton M
52. Stubbington MJT
53. Theis FJ
54. Uhlen M
55. van Oudenaarden A
56. Wagner A
57. Watt F
58. Weissman J
59. Wold B
60. Xavier R
61. Yosef N
62. Human Cell Atlas Meeting Participants
(2017) The Human Cell Atlas
eLife 6:27041.

https://doi.org/10.7554/eLife.27041
- Google Scholar
1. Rhodes K
2. Barr KA
3. Popp JM
4. Strober BJ
5. Battle A
6. Gilad Y
(2022) Human embryoid bodies as a novel system for genomic studies of functionally diverse cell types
eLife 11:e71361.

https://doi.org/10.7554/eLife.71361
- Google Scholar
(2024) Single-cell analyses offer insights into the different remodeling programs of arteries and veins
Cells 13:793.

https://doi.org/10.3390/cells13100793
- PubMed
- Google Scholar
1. Schüller U
2. Kho AT
3. Zhao Q
4. Ma Q
5. Rowitch DH
(2006) Cerebellar “transcriptome” reveals cell-type and stage-specific expression during postnatal development and tumorigenesis
Molecular and Cellular Neurosciences 33:247–259.

https://doi.org/10.1016/j.mcn.2006.07.010
- PubMed
- Google Scholar
1. Shumate A
2. Salzberg SL
(2021) Liftoff: accurate mapping of gene annotations
Bioinformatics 37:1639–1643.

https://doi.org/10.1093/bioinformatics/btaa1016
- PubMed
- Google Scholar
(2023) Benchmarking strategies for cross-species integration of single-cell RNA sequencing data
Nature Communications 14:6495.

https://doi.org/10.1038/s41467-023-41855-w
- Google Scholar
1. Street K
2. Risso D
3. Fletcher RB
4. Das D
5. Ngai J
6. Yosef N
7. Purdom E
8. Dudoit S
(2018) Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
BMC Genomics 19:477.

https://doi.org/10.1186/s12864-018-4772-0
- PubMed
- Google Scholar
1. Sullivan PF
2. Meadows JRS
3. Gazal S
4. Phan BN
5. Li X
6. Genereux DP
7. Dong MX
8. Bianchi M
9. Andrews G
10. Sakthikumar S
11. Nordin J
12. Roy A
13. Christmas MJ
14. Marinescu VD
15. Wang C
16. Wallerman O
17. Xue J
18. Yao S
19. Sun Q
20. Szatkiewicz J
21. Wen J
22. Huckins LM
23. Lawler A
24. Keough KC
25. Zheng Z
26. Zeng J
27. Wray NR
28. Li Y
29. Johnson J
30. Chen J
31. Zoonomia Consortium§
32. Paten B
33. Reilly SK
34. Hughes GM
35. Weng Z
36. Pollard KS
37. Pfenning AR
38. Forsberg-Nilsson K
39. Karlsson EK
40. Lindblad-Toh K
(2023) Leveraging base-pair mammalian constraint to understand genetic variation and human disease
Science 380:eabn2937.

https://doi.org/10.1126/science.abn2937
- PubMed
- Google Scholar
1. Suresh H
2. Crow M
3. Jorstad N
4. Hodge R
5. Lein E
6. Dobin A
7. Bakken T
8. Gillis J
(2023) Comparative single-cell transcriptomic analysis of primate brains highlights human-specific regulatory evolution
Nature Ecology & Evolution 7:1930–1943.

https://doi.org/10.1038/s41559-023-02186-7
- PubMed
- Google Scholar
1. Tachampa K
2. Wongtawan T
(2020) Unique patterns of cardiogenic and fibrotic gene expression in rat cardiac fibroblasts
Veterinary World 13:1697–1708.

https://doi.org/10.14202/vetworld.2020.1697-1708
- PubMed
- Google Scholar
Preprint
1. Tan L
2. Shi J
3. Moghadami S
4. Wright CP
5. Parasar B
6. Seo Y
7. Vallejo K
8. Cobos I
9. Duncan L
10. Chen R
11. Deisseroth K
(2023) Cerebellar granule cells develop non-neuronal 3D genome architecture over the lifespan
bioRxiv.

https://doi.org/10.1101/2023.02.25.530020
- Google Scholar
(2018) Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris
Nature 562:367–372.

https://doi.org/10.1038/s41586-018-0590-4
- Google Scholar
1. Tseng TC
2. Hsieh FY
3. Dai NT
4. Hsu SH
(2016) Substrate-mediated reprogramming of human fibroblasts into neural crest stem-like cells and their applications in neural repair
Biomaterials 102:148–161.

https://doi.org/10.1016/j.biomaterials.2016.06.020
- PubMed
- Google Scholar
1. Wang J
2. Sun H
3. Jiang M
4. Li J
5. Zhang P
6. Chen H
7. Mei Y
8. Fei L
9. Lai S
10. Han X
11. Song X
12. Xu S
13. Chen M
14. Ouyang H
15. Zhang D
16. Yuan G-C
17. Guo G
(2021) Tracing cell-type evolution by cross-species comparison of cell atlases
Cell Reports 34:108803.

https://doi.org/10.1016/j.celrep.2021.108803
- Google Scholar
1. Ware M
2. Hamdi-Rozé H
3. Le Friec J
4. David V
5. Dupé V
(2016) Regulation of downstream neuronal genes by proneural transcription factors during initial neurogenesis in the vertebrate brain
Neural Development 11:22.

https://doi.org/10.1186/s13064-016-0077-7
- PubMed
- Google Scholar
(2010) A similarity measure for indefinite rankings
ACM Transactions on Information Systems 28:1–38.

https://doi.org/10.1145/1852102.1852106
- Google Scholar
1. Xing T
2. Benderman LJ
3. Sabu S
4. Parker J
5. Yang J
6. Lu Q
7. Ding L
8. Chen YH
(2020) Tight junction protein claudin-7 is essential for intestinal epithelial stem cell self-renewal and differentiation
Cellular and Molecular Gastroenterology and Hepatology 9:641–659.

https://doi.org/10.1016/j.jcmgh.2019.12.005
- PubMed
- Google Scholar
1. Zhang X
2. Lan Y
3. Xu J
4. Quan F
5. Zhao E
6. Deng C
7. Luo T
8. Xu L
9. Liao G
10. Yan M
11. Ping Y
12. Li F
13. Shi A
14. Bai J
15. Zhao T
16. Li X
17. Xiao Y
(2019a) CellMarker: a manually curated resource of cell markers in human and mouse
Nucleic Acids Research 47:D721–D728.

https://doi.org/10.1093/nar/gky900
- PubMed
- Google Scholar
1. Zhang Z
2. Luo D
3. Zhong X
4. Choi JH
5. Ma Y
6. Wang S
7. Mahrt E
8. Guo W
9. Stawiski EW
10. Modrusan Z
11. Seshagiri S
12. Kapur P
13. Hon GC
14. Brugarolas J
15. Wang T
(2019b) SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples
Genes 10:531.

https://doi.org/10.3390/genes10070531
- Google Scholar
1. Ziller MJ
2. Edri R
3. Yaffe Y
4. Donaghey J
5. Pop R
6. Mallard W
7. Issner R
8. Gifford CA
9. Goren A
10. Xing J
11. Gu H
12. Cacchiarelli D
13. Tsankov AM
14. Epstein C
15. Rinn JL
16. Mikkelsen TS
17. Kohlbacher O
18. Gnirke A
19. Bernstein BE
20. Elkabetz Y
21. Meissner A
(2015) Dissecting neural differentiation regulatory networks through epigenetic footprinting
Nature 518:355–359.

https://doi.org/10.1038/nature13990
- Google Scholar

Article and author information

Author details

Jessica Jocher

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Project administration

Contributed equally with
Philipp Janssen

Competing interests
No competing interests declared
Philipp Janssen

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing

Contributed equally with
Jessica Jocher

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-3167-7503
Beate Vieth

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Conceptualization, Software, Formal analysis, Supervision, Investigation, Writing – original draft

Competing interests
No competing interests declared
Fiona C Edenhofer

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Data curation, Methodology

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-6983-2938
Tamina Dietl

Helmholtz Zentrum München Deutsches Forschungszentrum für Gesundheit und Umwelt: Munich, Munich, Germany

Contribution
Formal analysis, Methodology

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0009-0000-4126-2603
Anita Térmeg

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Methodology

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0009-0005-8872-9086
Paulina Spurk

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Methodology

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-8682-370X
Johanna Geuder

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Methodology

Competing interests
No competing interests declared
Wolfgang Enard

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Conceptualization, Supervision, Funding acquisition, Project administration

Contributed equally with
Ines Hellmann

For correspondence
enard@bio.lmu.de

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-4056-0550
Ines Hellmann

Anthropology and Human Genomics, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Germany

Contribution
Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review and editing

Contributed equally with
Wolfgang Enard

For correspondence
hellmann@bio.lmu.de

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-0588-1313

Funding

Deutsche Forschungsgemeinschaft (458247426)

Wolfgang Enard
Ines Hellmann

Deutsche Forschungsgemeinschaft (458888224)

Wolfgang Enard
Ines Hellmann

Deutsche Forschungsgemeinschaft (407541155)

Ines Hellmann

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank all members of the Enard/Hellmann group for valuable input and discussions. We are grateful to Stefanie Färberböck for her expert technical assistance and help in cell culture. We acknowledge the Core Facility Flow Cytometry at the Biomedical Center, Ludwig-Maximilians-Universität München, for providing equipment and services. We thank Dr. Stefan Krebs and the staff of LAFUGA and the NGS Competence Center Tübingen (NCCT) for sequencing services. This work was supported by the Deutsche Forschungsgemeinschaft (DFG): PJ and JJ, as well as the majority of the project costs, were funded by a grant to IH and WE (458247426). BV was funded by the grant to IH (407541155) and FE by a grant to WE (458888224).

Version history

Sent for peer review: December 13, 2024
Preprint posted: March 18, 2025
Reviewed Preprint version 1: March 24, 2025
Reviewed Preprint version 2: March 6, 2026
Version of Record published: April 8, 2026

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.105398. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.