Discovering sparse transcription factor codes for cell states and state transitions during development
Abstract
Computational analysis of gene expression to determine both the sequence of lineage choices made by multipotent cells and to identify the genes influencing these decisions is challenging. Here we discover a pattern in the expression levels of a sparse subset of genes among cell types in B and Tcell developmental lineages that correlates with developmental topologies. We develop a statistical framework using this pattern to simultaneously infer lineage transitions and the genes that determine these relationships. We use this technique to reconstruct the early hematopoietic and intestinal developmental trees. We extend this framework to analyze singlecell RNAseq data from early human cortical development, inferring a neocorticalhindbrain split in early progenitor cells and the key genes that could control this lineage decision. Our work allows us to simultaneously infer both the identity and lineage of cell types as well as a small set of key genes whose expression patterns reflect these relationships.
https://doi.org/10.7554/eLife.20488.001Introduction
During development, pluripotent cells make a series of lineage decisions to give rise to the different cell types of the body. These lineage decisions are controlled by intracellular molecular networks that include transcription factors and signaling molecules. There are two fundamental challenges associated with understanding the differentiation of individual cells. The first is to identify lineage relationships: how cells and their progeny move from pluripotent through intermediate to terminally differentiated cell states. The second is to identify the key molecular drivers that allow cells to make fate decisions along their developmental trajectory.
Reconstructing cell lineages has traditionally involved prospectively tracking cells and their progeny using a variety of imaging or genetic tools (Buckingham and Meilhac, 2011; Frumkin et al., 2008; Orkin and Zon, 2008; Sulston et al., 1983). Recent progress in singlecell sequencing techniques (Grün et al., 2015; Jaitin et al., 2014; Macosko et al., 2015; Patel et al., 2014; Paul et al., 2015; Treutlein et al., 2014; Zeisel et al., 2015) allows for a complementary view of the transcriptional states of individual cells during the course of development, providing static snapshots of the dynamics of the underlying molecular network. But inferring lineage relationships and the dynamics of the underlying molecular networks has proved difficult using transcriptional data alone, in part because of the high dimensional nature of these data. Overcoming this challenge would be particularly useful for understanding the development of human organs such as the brain where traditional lineage tracing experiments are more difficult.
High dimensional data analysis techniques such as PCA, ICA, or tSNE, are useful at reducing dimensionality. However, since the resulting axes represent linear combinations of a large number of features (for example, expression levels of each gene), interpreting the analysis or making experimental predictions is sometimes challenging. Meanwhile, traditional statistical methods such as linear multivariate regression have limited applicability for detecting patterns in high dimensional data (Advani and Ganguli, 2016; Donoho and Tanner, 2009). The challenges inherent in highdimensional data analysis such as identifying discriminatory features are further exacerbated as the fraction of relevant features decreases (Donoho and Tanner, 2009). Computational techniques currently in use to cluster single cell data or to infer relationships among cells are built on these approaches, thereby assuming that all highvariance genes are equally relevant for pattern detection (Marco et al., 2014; Satija et al., 2015; Trapnell et al., 2014). In contrast, decades of work in developmental biology have revealed that combinations of a few transcription factors can be sufficient to experimentally perturb cell fate and developmental decisions (Gilbert, 2014; Graf and Enver, 2009; Takahashi and Yamanaka, 2006) suggesting that the expression patterns of a few genes may be most relevant for making computational inferences. Therefore, there is a need to detect patterns involving a small fraction of all genes. Unfortunately, except in the case of wellstudied lineage decisions, we do not know the identity of this fraction.
In statistics, techniques relying on L1 regularization have been successful in contexts where the number of informative variables is known to be small but whose identities are unknown, both for regression problems (Baraniuk, 2007; Candès et al., 2006; Tibshirani, 1996; Wainwright, 2009) and for clustering (Witten and Tibshirani, 2010). Inspired by these successes in statistics, our aim here is to discover generalizable sparse patterns in gene expression data during development (if they exist), and to exploit these patterns to computationally infer the dynamics of cell state transitions from highdimensional transcriptional data obtained during the course of development.
In this manuscript we analyze expression patterns among cell types with known lineage relationships in late hematopoiesis and discover a pattern in a sparse subset of genes that correlates with these relationships. We develop a Bayesian framework based on this gene expression pattern to simultaneously infer lineage transitions and the key genes that drive them. We apply this method to reconstruct the lineage tree among a different set of cell types in early hematopoietic development, and in this process identify many known drivers of early hematopoiesis, including Gata1, Cebpa and Ebf1. We further extend our method to analyze singlecell gene expression data, using genes exhibiting the discovered pattern to cluster cells from early brain development and to infer lineage relationships between these clusters. Our analysis reveals a split from early progenitors to putative neocortex and mid/hindbrain cell types, as evidenced by the mutually exclusive expression of regionspecific genes such as FOXG1, LHX1, and POU3F2 (BRN2). This prediction was validated experimentally in a separate work (Yao et al., 2017). We finally discuss the advantages of using sparse patterns for making inferences and for modeling the underlying gene regulatory networks.
Results
Discovering sparse patterns correlated with lineage transitions
In order to identify gene expression patterns that are robustly predictive of lineage relationships, we analyzed gene expression data from 41 cell types during B and T cell development that have an experimentally established developmental lineage (Figure 1A, Heng et al., 2008). We searched for sparse patterns of gene expression amongst groups of three cell types from this collection; subsets of three are the minimal set in which measures of relative similarity can be used infer relative lineage relationships.

Figure 1—source data 1
 https://doi.org/10.7554/eLife.20488.003

Figure 1—source data 2
 https://doi.org/10.7554/eLife.20488.004
We identified 150 triplets of cell types with experimentally verified lineage relationships from B and T cell development (Heng et al., 2008) (Figure 1—source data 1). Three such triplets are shown in Figure 1A. These triplets constituted both cell fate decisions (for example, cell type A gives rise to cell type B and C) and lineage progressions (cell type B gives rise to cell type A which then gives rise to cell type C). For each triplet, we noted which cell type was the progenitor or intermediate cell type (‘root’ cell type A) and which cell types were not (‘leaf’ cell types B and C). Note that even in the case in which cell type B gives rise to cell type A which gives rise to cell type C, we will refer to A as the ‘root’ and B and C as ‘leaves’ for that particular triplet. We analyzed transcription factor gene expression data for these triplets from the Immunological Genome Consortium (Heng et al., 2008) since transcription factors are the ultimate drivers of cell fate decisions.
Surprisingly, we found that specific expression patterns involving only single genes correlated well with lineage relationships between three cell types. Genes that are not differentially expressed within a triplet of cell types convey no information about relationships between the cell types. Therefore, we select genes with expression variability among the three cell types. The expression pattern of such genes can belong to one of only two possible patterns: (Figure 1B): the clear minimum pattern: the gene has a clear minimum level in one of the three cell types, with its distribution of gene expression levels being wellseparated from the other two (p<0.005 in both two sample ttests between the minimum cell type and the other two cell types; Figure 1—figure supplement 1A); the clear maximum pattern: the gene has a clear maximum level in one of the cell types (p<0.005 in both sample ttests; Figure 1—figure supplement 1B). Note that both patterns can be satisfied simultaneously if the distribution of expression levels in the three cell types are all wellseparated with a clear maximum and minimum. We tested if either of the two patterns correlated with the lineage topologies between the three cell types with known lineage relationships (Figure 1C, Figure 1—figure supplement 1C).
Triplets of cell types can be separated into categories based on how many genes exhibit the aforementioned patterns. 56% of the triplets contained more than 10 genes exhibiting the clear minimum pattern, and in 100% of these triplets the majority of genes with expression fitting this pattern reached their minimum expression in one of the leaves (Figure 1C) and never in the root of the triplet. The frequency with which the pattern correctly indicated the lineage relationship increased with the number of genes within a triplet exhibiting the pattern, thus suggesting a confidence measure. The genes showing the clear minimum pattern fell into two distinct groups, corresponding to whether the minimum expression level was in one or the other of the two leaves (Figure 1—figure supplement 1A; Figure 1—figure supplement 2C). Thus the expression pattern of the total set of clear minimum genes correlated with the topology. Since genes showing the clear minimum pattern correlated with lineage relationships (Figure 1C) between cell states both in the case of branches and linear sequences of cell state transitions, we refer to them as transition genes.
We further verified that the clear minimum pattern could be observed (a) in the set of all 25,194 genes (Figure 1—figure supplement 2A), (b) using FDRadjusted pvalues (Benjamini and Hochberg, 1995) (Figure 1—figure supplement 2B), (c) in triplets of different lengths along the lineage tree (Figure 1—figure supplement 2D), (d) in triplets both containing only internal nodes and including terminal nodes (Figure 1—figure supplement 2E), (e) and in both lineage progression and cell fate decision triplets (Figure 1—figure supplement 2F).
The clear maximum pattern was a poorer indicator of lineage relationships (Figure 1—figure supplement 1C). 83% of the triplets had more than 10 genes exhibiting this pattern, but 10% of those showed the majority of genes with expression fitting this pattern reaching their maximum in the root while the others did so in the leaves. Crucially, the integrity of the relationship between the clear maximum pattern and lineage topology was not correlated with the number of genes exhibiting the pattern (Figure 1—figure supplement 1C). While the clear maximum pattern did not correlate with lineage relationships, genes exhibiting this pattern identify individual cell types, and therefore we will refer to them as marker genes.
There are many examples of genes known to be functionally important for lineage decisions whose expression patterns fit the clear minimum pattern. In the case of lateral inhibition commonly used during development, progenitor cells express genes together (for example, Notch and Delta) which are differentially expressed in the differentiated states (only Notch or only Delta) (Perrimon et al., 2012) reaching the minimum expression level in one of the leaves. The same pattern is also seen in multiple examples of lineage decisions often involving mutual inhibition, where key genes expressed in the progenitor are differentially regulated in the progeny (Graf and Enver, 2009; Qi et al., 2013; Thomson et al., 2011; Zhang et al., 1999). In each of these cases, key genes reach minimal expression levels in one of the leaves of the triplets.
The observation that genes exhibiting the clear minimum pattern are correlated with the lineage topology of a triplet of cell types further revealed that (i) only this fraction of transcription factors can be useful for inferring lineage relationships, and (ii) the identity of this fraction depends on which group of three cell types were analyzed. As the subset of genes that are informative varies based on the triplet of cell types being considered, establishing lineage relationship between all cell types at once as opposed to three at a time could be challenging. Indeed, our attempt to reconstruct the lineage relationships between all cell types using methods based on PCA failed (Figure 1D).
We further evaluated the clear minimum pattern in 100 triplets in which there was no clear relation between the cell types (Figure 1—source data 1). We found that while there were a substantial number of genes exhibiting a clear minimum in one of the three cell types in unrelated triplets, their minima were evenly distributed amongst the cell types. To quantify that the minima were evenly distributed, we counted the fraction of genes f_{i} which reached a clear minimum in cell type i = A, B, C (for A, B, and C unrelated cell types), and for each triplet, we computed the entropy $S={\sum}_{i=A,B,C}{f}_{i}\mathrm{log}\left({f}_{i}\right)$. We compared the distribution of the entropy and the number of genes showing a minimum in any triplet for unrelated and related triplets (Figure 1—figure supplement 3A). The unrelated triplets have higher entropy and typically more genes with a minimum level. This suggested that unrelated triplets show a distinct pattern from the related triplets.
Using patterns to infer lineages
Together, these observations suggested a strategy for inferring the lineage topology between three cell types: each gene showing the clear minimum pattern with a minimum expression level in a particular cell type increases to the probability that this cell type is not the root of the topology (Figure 2A). We next developed a statistical machinery to systematically detect this pattern in gene expression data and to use the resulting sparse subset of genes to infer lineage relationships between three cell types at a time. We then used the inferred relationships between all sets of three cell types as constraints to determine the full developmental lineage tree.

Figure 2—source data 1
 https://doi.org/10.7554/eLife.20488.009

Figure 2—source data 2
 https://doi.org/10.7554/eLife.20488.010

Figure 2—source data 3
 https://doi.org/10.7554/eLife.20488.011
The classifications of genes as transition or marker genes in Figure 1 were based on p<0.005 in a twosample ttest. To implement such classifications probabilistically without arbitrary cutoffs we developed a statistical framework to infer the lineage relationships between each set of three cell types A, B, and C and find the key sets of transition genes (those genes that show the clear minimum pattern), given gene expression data $\left\{{g}_{i}^{A,B,C}\right\}$ in those cell types. We determined the probability of any possible topological relationship between the cell types $T=\{A,B,C,\varnothing \}$ referring to cell type A, B, or C being the root of the triplet, or $\varnothing $which corresponds to the case where the data does not suggest any lineage relationship between the three cell types because either no significant pattern could be detected, or multiple genes exhibiting the minimum pattern suggested conflicting topologies. Rather than an absolute classification of genes as showing a pattern or not, we calculated the probability $p\left({\alpha}_{i}=1\text{}\text{}\text{}{g}_{i}^{A,B,C}\right)$ of each gene $i$ being a marker gene (denoted by ${\alpha}_{i}=1$), i.e. gene $i$ showing the clear maximum pattern, and the probability $p\left({\beta}_{i}=1\text{}\text{}{g}_{i}^{A,B,C}\right)$ of it being a transition gene denoted by ${\beta}_{i}=1$, i.e. gene $i$ showing the clear minimum pattern (Materials and methods).
We derived the probability of a given topology $T$ given the expression data as (Materials and methods):
where, ${\mathcal{\mathcal{O}}}_{i}=p\left({\beta}_{i}=1\text{}\text{}{g}_{i}^{A,B,C}\right)/p\left({\beta}_{i}=0\text{}\text{}{g}_{i}^{A,B,C}\right)$ is the odds of gene $i$ being a transition gene and thus having a unique minimum. The term $p\left({\mu}_{T}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}{g}_{i}^{A,B,C}\right)$ is the probability that the mean ${\mu}_{T}^{i}$ of the distribution of the expression levels of gene i in the root cell type $T$ is less than the mean in the other two cell types. The odds implicitly contains the only free parameter in our analysis, the prior odds $\frac{p\left({\beta}_{i}=1\right)}{p\left({\beta}_{i}=0\right)}$, which defines the number of genes we expect to show the clear minimum pattern a priori, and functions as a sparsity parameter for the inference. Qualitatively, in the above equation, every gene casts a vote $p\left({\mu}_{T}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}{g}_{i}^{A,B,C}\right)$ against the cell type $T$ in which its mean expression is minimal being the root. Further, this vote is weighted by the odds $\mathcal{\mathcal{O}}}_{i$ of gene $i$ being a transition gene. Thus, genes with a clearer minimum pattern get larger votes in determining which cell type is not the root. In practice, these quantities are computed numerically (Materials and methods).
We note further that if a substantial number of genes cast votes against each of the cell types, then the probability of the null topology $\varnothing $ increases. We computed the probability of obtaining the null topology among the 150 related triplets and 100 unrelated triplets from our training set. The distribution of the probability of obtaining the null topology was considerably different between the related triplets and the unrelated triplets, with an AUC of 0.96 (Figure 1—figure supplement 3B–C).
Application to hematopoietic gene expression data
We used our statistical framework to recreate the lineage of early hematopoietic differentiation. We considered 11 early hematopoietic progenitors from the ImmGen Consortium microarray data set (Heng et al., 2008) (Figure 2—source data 1). These cell types and their associated relationships were not included in the data set used earlier to study the correlations of the two patterns and lineage topologies. Several features of the early hematopoietic lineage tree are debated (Adolfsson et al., 2005; Iwasaki and Akashi, 2007) (Figure 2—figure supplement 2A). Given only the gene expression data for these different subpopulations of cells, we determined the lineage relationships and the key factors associated with each lineage decision. We calculated the probabilities of topology and marker and transition genes for the $\left(\begin{array}{c}11\\ 3\end{array}\right)=165$ possible triplets of cell types using our statistical framework (Figure 2—source data 2). To illustrate our method, we first described the analysis of the expression data from two such triplets of cell types: CMP/ST/MPP and MEP/GMP/FrBC (Figure 2B–G). We then assembled the triplets to form an undirected lineage tree (Figure 3; Video 1).

Figure 3—source data 1
 https://doi.org/10.7554/eLife.20488.015
Following Equation 1, each gene votes against the topology whose central node has the minimum expression of that gene among the three cell types, and this vote is weighted by the odds that the gene is a transition gene. To illustrate this for the triplet of cell types CMP, ST and MPP, we plotted the topology each gene voted most against, i.e. the topology $T$ for which $p\left({\mu}_{T}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}{g}_{i}^{CMP,ST,MPP}\right)$ is the maximum, versus the odds $\mathcal{\mathcal{O}}}_{i$ of that gene being a transition gene (Figure 2B).
We find two groups of genes that are much more likely to be transition genes than any of the other genes, with values of ${O}_{i}\sim {10}^{2}$ compared to ${10}^{0}$ at most for other genes (Figure 2B, regular and bold fonts). These two groups of genes have a large value for either $p\left({\mu}_{CMP}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}{g}_{i}^{CMP,ST,MPP}\right)$ or $p\left({\mu}_{MPP}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}{g}_{i}^{CMP,ST,MPP}\right)$ and thus vote against $T=\mathit{C}\mathit{M}\mathit{P}$ (cell type CMP is the intermediate) or against $T=\mathit{M}\mathit{P}\mathit{P}$ (cell type MPP is the intermediate). Together these genes that have a high odds of being transition genes appear to most support topology $T=\mathit{S}\mathit{T}\equiv CMP\mathit{S}\mathit{T}MPP$.
In fact, the intuition in Figure 2B is borne out in the calculation of $p\left(T\text{}\text{}\left\{{g}_{i}^{CMP,ST,MPP}\right\}\right)$. Using Equation 1 above and assuming a sparsity parameter of 0.05, we calculate that there is an 84% chance that the topology is $\mathit{S}\mathit{T}$ (Figure 2C; Figure 2—figure supplement 2B). Although gene Irf8 (Figure 2B, italic font) is strongly downregulated in ST and is expressed at higher levels in CMP and MPP (Figure 2C), we note that its signal is overwhelmed by the large number of genes downregulated in either CMP or MPP, illustrating the statistical nature of the framework.
For each triplet, we evaluated each gene’s probability of being a transition or marker gene (Figure 2—source data 3). Figure 2D shows the names and associated probabilities of the 12 genes most likely to be transition genes for the triplet $\mathrm{C}\mathrm{M}\mathrm{P}\mathbf{S}\mathbf{T}\mathrm{M}\mathrm{P}\mathrm{P}$. The transition genes fall into two groups, corresponding to the two groups in Figure 2B. One group, which includes genes Gata1, Dach1, and Gata2, has higher expression in CMP than in MPP; the other group, which includes Hlf, Tsc22d1, and Hoxa9, has higher expression in MPP. Although the values of the probabilities of the genes being transition genes vary with the value of the sparsity parameter, the relative order of different genes does not change. The genes identified include many genes previously identified as being important for lineage specification (Crispino, 2005; Gazit et al., 2013; Miyawaki et al., 2015). The transition genes we discovered thus not only have gene expression patterns that reflect the lineage decision but also include functionally important genes.
In addition to the transition genes, we identified marker genes ($p\left({\alpha}_{i}=1\text{}\text{}\left\{{g}_{i}\right\},T\right)0.8$) present only in ST (including Mpl and Rai14, consistent with [Solar et al., 1998]) and then symmetrically downregulated in both CMP and MPP (Figure 2—figure supplement 1C). Marker genes for CMP include Srf, Zeb2, Rbpj and Irf8 (consistent with [Goossens et al., 2011; Kurotaki et al., 2013; Ragu et al., 2010; RobertMoreno et al., 2005; Tamura et al., 2000]); marker genes for MPP include Satb1, consistent with (Satoh et al., 2013). Although these genes were not used to determine the topology, they are good markers for cell types ST, CMP and MPP.
Plotting the cell types using the mean expression levels of the two transition gene class captures the fork in the gene expression space associated with the cellfate decision (Figure 2E). In contrast with the PCA analysis of the cell types (Figure 2—figure supplement 1E), in which MPP appears to be an intermediate between the hematopoietic stem cell types (LT and ST) and CMP, the projection of the cell types onto the transitiongene subspace shows that ST splits into CMP and MPP.
In contrast to the case of the triplet of cell types CMP/ST/MPP, for triplet MEP/GMP/FrBC, the distributions of genes supporting different topologies are similar (Figure 2F). Thus the most likely topology calculated using Equation 1 is the null hypothesis (99%), which is that transition genes, if they exist, do not have patterns that depend on the cellular topology (Figure 2G, Figure 2—figure supplement 1D), in which case there is insufficient evidence to classify the triplet according to a particular nonnull topology. The distribution of the maximal probability of nonnull topologies in the different triplets is heavily concentrated near 1, allowing for clear separation between null and nonnull triplets (Figure 2—figure supplement 1E). Null topologies were identified by the algorithm for triplets with cells that are from three terminal nodes (for example, the triplet MEP/GMP/FrBC) or from triplets that contain very distantly related triplets (for example, LT/CLP/ETP) (Figure 2—source data 2).
A lineage tree for early hematopoiesis
We next reconstructed the early hematopoietic lineage tree and identified transition genes involved in the different cell state transitions using all the nonnull triplet relationships as constraints on the lineage relationships between all cell types. Out of the 165 possible triplets of hematopoietic progenitors, 144 showed one single nonnull topology with probability greater than 0.6 over a range of the prior odds from 10^{−6} to 10^{2}. We next determined an undirected graph that recapitulates all of the individual triplet topologies (note that we are only inferring triplet topologies and are not inferring directionality). For example, although triplet CMP/LT/MPP has topology CMP – LT – MPP (Figure 3—figure supplement 1A), we could determine that LT cannot be the direct progenitor of CMP or MPP, because ST is an intermediate between LT and both cell types (Figure 3—figure supplement 1B–C). We could thus 'prune' this triplet when inferring the full graph (Figure 3—figure supplement 1D–E). A visualization of the pruning process is shown in Video 1, where successive triplets are added to the graph, creating new edges and pruning others, leading to the final undirected tree. In practice, though, the pruning process was performed on all triplets simultaneously, not in succession.
The ST/CMP/MPP triplet (Figure 2B–E) immediately distinguishes between two competing models regarding the hierarchy of early hematopoietic progenitors. According to the traditional picture (Iwasaki and Akashi, 2007), MPP is the progenitor of CMP, and ST is the progenitor of MPP – therefore MPP should be an intermediate between ST and CMP and the topology of triplet ST/MPP/CMP should be ST – MPP – CMP (Figure 2—figure supplement 1A, left). According to a model suggested by Adolfsson and colleagues (Adolfsson et al., 2005), ST splits into CMP and MPP (Figure 2—figure supplement 1A, right), and the topology should be CMP – ST – MPP. We identify both CMP – ST – MPP and CMP – LT – MPP as the correct topologies, lending support to the Adolfsson model. The Adolfsson and traditional models differ in the topology of 9 triplets. The inferred expected topologies of 8 out of these nine triplets support the Adolfsson model, which led to the identification of the final tree (Figure 2—figure supplement 2).
The lineage tree that we determined is consistent with the Adolfsson model and contains three additional lineage decisions (Figure 3A–B). First, CMP gives rise to MEP (megakaryocyte/erythroid progenitor) and GMP (granulocyte/macrophage progenitor). Second, MPP gives rise to MLP (multilineage progenitor), which then splits into the GMP and CLP (common lymphoid progenitor) cell types. In the final lineage decision, CLP gives rise to the ETP (preT) and FrA (preproB) cell types.
For each triplet of cell types along the tree, we identified transition and marker classes of genes. Among the 14 triplets that contained only adjacent cell types, we identified on average 24 marker genes per cell type and 25 transition genes (probability threshold of 0.8, prior odds $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)\text{}=0.05$). Many genes we discovered as belonging with high probability to the transition and marker classes of genes at each lineage decision are known in the literature to be functionally important genes, including classic hematopoietic regulators such as Cebpa, Sfpi1, Gata1, Satb1, Irf8 and Ebf1 (see full tables with references in Figure 3C and Figure 3—source data 1). Additionally, the genes identified include many genes successfully used in hematopoietic reprogramming experiments, including Gata2 and Pbx1 (Figure 3C). Together these observations suggest that the sparse subspace of transition and marker genes identified by our framework not only allows for accurate reconstruction of the lineage hierarchy but also constitutes a set of candidates for relevant biological functions.
As further validation of the inference method, we compared it to the method proposed by Grun et al. on a singlecell intestinal development data set (Grün et al., 2015, 2016). We inferred lineages between each cell type based on their cluster identifications, excluding clusters with fewer than 10 cells, and constructed an undirected lineage tree by taking triplets with probability > 0.6 and applying the pruning rule (Figure 3—figure supplement 2A). The only disagreement between the two methods is the progression from crypt base columnar cells (C_{2}) to different populations of Goblet cells (C_{4} and C_{8}). Grun et al. hypothesize a C_{2} – C_{8} – C_{4} progression, while we infer the triplet C_{8} – C_{2} – C_{4} with p>0.99, suggesting that the progenitor C_{2} gives rise to both differentiated Goblet subpopulations. Both lineage trees are supported by the literature (van der Flier and Clevers, 2009). The high probability transition genes included many factors well known for their roles in tissue homeostasis and development (Figure 3—figure supplement 2B), notably Klf4 (Yu et al., 2012), Atoh1 (VanDussen and Samuelson, 2010), Spdef (Noah et al., 2010), Foxa1/Foxa2 (Ye and Kaestner, 2009), and Tcf3 (Merrill et al., 2001).
Inferred lineage tree for human excitatory neuronal progenitors from in vitro singlecell data over 80 days of differentiation
The ease with which singlecell transcriptomic data can be generated (Grün et al., 2015; Jaitin et al., 2014; Macosko et al., 2015; Patel et al., 2014; Paul et al., 2015; Treutlein et al., 2014; Zeisel et al., 2015) presents an opportunity to understand the dynamics of the underlying networks that lead individual cells to their final fate. We studied the differentiation of stem cells both into germ layer progenitors (Jang et al., 2017) and into cortical neurons. To study the latter, we analyzed singlecell gene expression data from 2217 cells from an in vitro differentiation protocol for early human neuronal development (Yao et al., 2017). Briefly, human embryonic stem cells (hESCs) were subjected to a SMAD inhibitionbased cortical induction phase, a progenitor expansion phase, and a neural differentiation phase. Single cells were sorted at 12, 26, 54, and 80 days into differentiation, and their gene expression was profiled using the SMARTSeq2 technique (Picelli et al., 2013). In the initial clustering of the singlecell data, dimensionality reduction by PCA (into 15 abovenoise components) followed by tSNE (Van Der Maaten and Hinton, 2008; Satija et al., 2015) showed separation by day and SOX2 expression (Figure 4A). However, the number of predicted clusters varied depending on the perplexity parameter (Figure 4—figure supplement 1A–B). In addition, no clear lineage or distance relationship among the putative types is immediately apparent from this clustering. Analysis of this data with other recent methods such as Monocle and Monocle2 (Trapnell et al., 2014), TSCAN (Ji and Ji, 2016), and StemID (Grün et al., 2016) did not clearly reconstruct lineage or infer key genes regulating transitions (Figure 4A – Bottom, Figure 4—figure supplement 2). Monocle2 (Figure 4—figure supplement 2A) produces a tree with complex branching, but SOX2+ progenitors and DCX+ differentiated neurons do not clearly separate.

Figure 4—source data 1
 https://doi.org/10.7554/eLife.20488.020

Figure 4—source data 2
 https://doi.org/10.7554/eLife.20488.021

Figure 4—source data 3
 https://doi.org/10.7554/eLife.20488.022

Figure 4—source data 4
 https://doi.org/10.7554/eLife.20488.023

Figure 4—source data 5
 https://doi.org/10.7554/eLife.20488.024
Analyzing data from singlecell profiling presents an additional challenge relative to data from populationlevel profiling because cell types are not previously known and must be inferred from the data. Computationally, it is necessary define a measure of similarity in the gene expression profiles of individual cells so as to be able to cluster them and define cell states. Here again, it is necessary to identify the correct gene subspace to use for clustering. Clustering and determining lineage have typically been performed sequentially and treated as independent problems (Satija et al., 2015; Trapnell, 2015; Trapnell et al., 2014). However, we found that it is informative to solve both these problems simultaneously.
In our framework, the relevant feature set for clustering is the set of marker and transition transcription factors from the triplets with nonnull topologies. But determination of these sets of genes and of the transition topologies depends itself on knowledge of the cluster identities. Following previous work on sparse clustering (Witten and Tibshirani, 2010), we simultaneously determined optimal clusters, lineage topology and sets of transition and marker genes by iteratively selecting transition and marker genes and clustering the data using this set of features (Figure 4B).
In order to utilize information about developmental distances in clustering, we iteratively maximized a joint distribution, $P\left(T,{\left\{C\right\}}^{m}\left\{{g}_{i}\right\}\right)$, over the developmental tree, $T,$ and the set of clusters of cells, $\left\{C\right\}}^{m}=\left\{{c}_{1}^{m},{c}_{2}^{m},\dots ,{c}_{K}^{m}\right\$ (Figure 4B). Starting with a clustering ${\left\{C\right\}}^{m},$ we first inferred the set of genes $\left\{{\alpha}_{i}=1\right\}$ and $\left\{{\beta}_{i}=1\right\}$ which were identified in high probability triplets, $p\left(T\left\{{g}_{i}\right\},\text{}{\left\{C\right\}}^{m}\right)0.6$, and we then reclustered in this new subspace to obtain the clusters for the next iteration ${\left\{C\right\}}^{m+1}$.
Initial analysis based on the gap statistic (Tibshirani et al., 2001) suggested that the singlecell gene expression profiles clustered into 20 clusters of cell types. We chose a seed ${\left\{C\right\}}^{0}$ for the iterated clustering procedure by intentionally overclustering the data into 40 clusters using spectral Kmedoids. By being overly discriminative in our initial clustering, we ensured that all genes with differential regulation would be classified as either marker genes or transition genes, and would be preserved in later clustering iterations. We iterated the clusteringinference procedure until the dimension of the reclustering subspace changed by less than 10% of the total transcription factor space. In this resulting subspace of 469 genes, we finally clustered the cells into the final configuration $\left\{C\right\}}^{f}=\left\{{c}_{0},\text{}{c}_{1},\dots ,{c}_{7}\right\$ using Kmedoids (Figure 4C, Figure 4—source data 1) where the number of clusters K = 8 was chosen based on the gap statistic (Tibshirani et al., 2001).
We inferred 45 high probability triplets ($p\left(T\text{}\text{}\left\{{g}_{i}\right\}\right)0.6$) between the final cell clusters (Figure 4—source data 2). Four such triplets are shown in Figure 4D, plotted in axes defined by transition genes for each triplet (Figure 4—source data 3). Starting with progenitor cell states (SOX2+, Figure 4—figure supplement 1C), we manually appended cell clusters to the tree according to their time information and in agreement with inferred topological restrictions. The first transition involves the production of day 12 neuronal cell type ${c}_{1}$ from day 12 progenitor ${c}_{0}$, which is observed in the triplets ${c}_{1}{c}_{0}{c}_{2}$ and ${c}_{1}{c}_{0}{c}_{3}$. The $c}_{1}{c}_{0}{c}_{2$ triplet ($p\left(T={c}_{0}\text{}\text{}\left\{{g}_{i}^{{c}_{0},{c}_{1},{c}_{2},}\right\}\right)=\text{}0.99$, Figure 4D – top left) is mediated by 47 transition genes between ${c}_{0}$ and ${c}_{1}$ and 87 between ${c}_{0}$ and ${c}_{2}$ ($p\left({\beta}_{i}=1\text{}\text{}\left\{{g}_{i}\right\},T\right)0.8$). Transition genes expressed highly in the ${c}_{1}{c}_{0}$ branch include CEBPG, POU3F1, POU3F2, NR2F1, NR2F2, ARX, LIN28A, TOX3, ZBTB20, PROX1, and SOX15 which have been previously implicated in proliferation of forebrain progenitors (Au et al., 2013; Borello et al., 2014; Cimadamore et al., 2013; Dominguez et al., 2013; Yang et al., 2015) and the migratory behaviors of ganglionic eminences (Kanatani et al., 2008; Kessaris et al., 2014; Lodato et al., 2014; Olivetti and Noebels, 2012; Reinchisi et al., 2012). The ${c}_{0}{c}_{2}$ transition genes include DMRTA1, HES1, HES5, FOXG1, PAX6, HMGA2, SOX2, SOX3, SOX9, SOX6, SP8, OTX2, TGIF, ID4, SOX3, TCF7L2, and TCFL1 which are known to be expressed in forebrain progenitors of the developing neocortex, thalamus, and hypothalamus (Abraham et al., 2013; Pozniak et al., 2010; Shimojo et al., 2011; Tzeng and de Vellis, 1998; Wang et al., 2006), and are known to establish dorsal forebrain regional identity; (Azim et al., 2009; BaniYaghoub et al., 2006; Borello et al., 2014; GastonMassuet et al., 2016; Hagey et al., 2014; Hutton and Pevny, 2011; Johansson et al., 2013; Kikkawa et al., 2013; Kishi et al., 2012; Manuel et al., 2011; Miyoshi and Fishell, 2012; Ohtsuka et al., 2001; Ross et al., 2003; Shen and Walsh, 2005; Sur and Rubenstein, 2005; Yang et al., 2015; Zembrzycki et al., 2007).
The progenitor cell types form the triplet ${c}_{2}{c}_{0}{c}_{3}\text{}(p\left(T={c}_{0}\left\{{g}_{i}^{{c}_{0},{c}_{2},{c}_{3}}\right\}\right)=\text{}0.96$, Figure 4D – top right). The ${c}_{2}{c}_{0}$ branch is mediated by 39 transition genes including LHX2, FEZF2, FOXG1, HMGA1, SP8, OTX1, SOX11, GLI3, SIX3, and ETV5 which suggest that ${c}_{2}$ is comprised of cortical progenitors (Appolloni et al., 2008; Greig et al., 2013; Kishi et al., 2012; Manuel et al., 2011; Raciti et al., 2013). The ${c}_{0}{c}_{3}$ branch is mediated by 55 transition genes including GTF2I, HIF1A, ID1, ID3, PROX1, POU3F2, SALL1, SOX21 TCF12, TRPS1 and ZHX2, which are associated with mesencephalon and metencephalon regional development, as well as having known involvement with midbrain/hindbrain organizer identity (Buck et al., 2001; Enkhmandakh et al., 2009; Inoue et al., 2012; Jaegle et al., 2003; Kunath et al., 2002; Lavado and Oliver, 2007; Milosevic et al., 2007; Ohba et al., 2004; Uittenbogaard and Chiaramello, 2002; Yao et al., 2017). We additionally inferred the triplet ${c}_{2}{c}_{0}{c}_{3}\text{}(p\left(T={c}_{0}\text{}\text{}\left\{{g}_{i}^{{c}_{0},{c}_{1},{c}_{3}}\right\}\right)\text{}0.99$, Figure 4D – bottom left), which suggests a three way split from early progenitor ${c}_{0}$ into early differentiated neuron ${c}_{1}$, and progenitors ${c}_{2}$ and ${c}_{3}$.
The continuation of the ${c}_{2}$ branch is inferred through triplet ${c}_{0}{c}_{2}{c}_{4}$ ($p\left(T={c}_{2}\text{}\text{}\left\{{g}_{i}^{{c}_{0},{c}_{2},{c}_{4}}\right\}\right)=\text{}0.97$). The ${c}_{2}{c}_{4}$ branch includes transition genes BCL11A, EMX2, FOXP2 and RORB, which are known to be associated with the neocortex and neuronal identity (Cánovas et al., 2015; Ebisu et al., 2016; Greig et al., 2016; Jabaudon et al., 2012; Wiegreffe et al., 2015; Woodworth et al., 2016; Zembrzycki et al., 2007).The triplet ${c}_{0}{c}_{3}{c}_{5}$ ($p\left(T={c}_{3}\text{}\text{}\left\{{g}_{i}^{{c}_{0},{c}_{3},{c}_{5}}\right\}\right)\text{}0.99$) meanwhile is characterized by transition genes in the ${c}_{3}{c}_{5}$ branch including ASCL1, FOXP2, PAX3, POU3F4, ZIC1, ZIC4, HOXB2, and EN2, which have been shown to regulate fate acquisition in the midbrain/hindbrain (Agoston et al., 2012; Ang, 2006; Di Bonito et al., 2013; Elsen et al., 2008; Hegarty et al., 2013; Miller et al., 2011; Tan et al., 2014).
The ${c}_{5}$ cell cluster differentiates into two distinct clusters of postmitotic neurons  ${c}_{6}$ and ${c}_{7}$ ($p\left(T={c}_{5}\text{}\text{}\left\{{g}_{i}^{{c}_{5},{c}_{6},{c}_{7}}\right\}\right)\text{}0.99$, Figure 4D – bottom right). The ${c}_{5}{c}_{6}$ branch is inferred from transition genes including PAX3, CRX, SOX11, EBF2, FOXP4, ASCL1, FOXO3, and SIX3 which have strong expression in developing dopaminergic and gabaergic neurons of the midbrain (Agoston et al., 2012; Erickson et al., 2010; Pino et al., 2014; Yang et al., 2015; Yin et al., 2009; Zhang et al., 2002). Transition genes in the ${c}_{5}{c}_{7}$ branch include ARGFX, DUXA, HES1, NFIB, PPARA, SOX2, SOX7, and SOX9, which are known to be associated with the medulla oblongata region of the hindbrain (Fawcett and Klymkowsky, 2004; Kameda et al., 2011; Kumbasar et al., 2009; Madissoon et al., 2016; Matsui et al., 2000; Stolt et al., 2003).
In addition to interpreting these individual transition genes defining the major branch splits, we correlated the expression over all predicted transition and marker genes in neuronal clusters to in vivo developmental human data (Miller et al., 2014). The in vivo data comprise a representative range of microarray data sampled from different parts of the developing brain at postconception week 15, including forebrain proliferative regions, midbrain, and hindbrain. The differences in data acquisition methods (RNAseq vs. microarray, singlecell vs heterogeneous populations) resulted in relatively low correlations overall, but there are clear associations between individual clusters and specific brain regions (Figure 4E). Specifically, ${c}_{1}$, maps to the ganglionic eminences, suggesting an interneuron identity, whereas ${c}_{5},{c}_{6}$ and ${c}_{7}$ show better mapping to mid and hindbrain structures, and ${c}_{4}$ appears to be more closely related to neocortex. Overall, this global comparison, combined with the identification of genes with known regional expression, suggests that the inferred clusters from the in vitro data capture the diversity of differentiation into the early stages of the major neuronal lineages (Figure 4F). These lineage predictions based on our analysis techniques were verified experimentally using viral barcoding in a separate work (Yao et al., 2017).
To estimate the sparsity of the underlying network and to find a minimal subset of genes through which lineage could be inferred, we replicated the analysis while only considering a limited set of genes per triplet. We assembled a collection of 20 triplets with maximal leaftoleaf distance of 4 nodes, and nonnull inferred topology. For each triplet, we ranked genes based on their odds of being a transition gene, ${\mathcal{\mathcal{O}}}_{i},$ agnostic of the true topology of the triplet. We then replicated the inference process using only the N genes with the greatest odds (Figure 4—figure supplement 1E). We found that with as few as 4 genes per triplet, the correct lineage topology could be inferred for all of the triplets. Genes with greatest odds comprise a restricted subset of genes for further experimental investigation. Further, these findings suggest that the dynamics of expression of just four specific genes are sufficient to monitor a particular lineage decision in single cells.
Discussion
Finding an informative subspace of variables for data analysis is a general problem in machine learning, both for regression and clustering (Tibshirani, 1996; Witten and Tibshirani, 2010); the innovation in this paper is to use a statistical pattern learned from known biology to inform this subspace search. The approach we take here is complementary to methods that project expression variability onto coordinates of PCA, ICA or tSNE maps (Marco et al., 2014; Satija et al., 2015; Trapnell et al., 2014), which are combinations of all variables. Searching for sparse representation of the dynamics has the advantage of providing interpretability and experimental direction (McGibbon and Pande, 2017). Following the dynamics of this small set of highprobability transition genes via fluorescent tagging could allow for the tracking of lineage decisions of individual cells in real time. Further, these genes provide a list of candidates for drivers of fate decisions, and hence a set of experimental hypotheses.
Not all genes give us equal information about the dynamics of differentiation. We discovered that genes showing the clear minimum pattern are most predictive of the sequence of lineage transitions during development. Although our pattern discovery and subsequent lineage reconstruction does not assume any functional role for the clear minimum pattern, we note that this pattern is shown by genes known to be regulators of development during hematopoiesis. The same pattern observed in many differentiating systems (Graf and Enver, 2009; Qi et al., 2013; Thomson et al., 2011; Zhang et al., 1999), and is consistent with mutual inhibition. Mutual inhibition, in turn, is hypothesized to play an important role in maintaining multistable systems and in mediating transitions between different stable states of multistable systems (Ferrell, 2012).
Discovery of sparse representations of the cell states and variability between them demonstrates the efficacy of low dimensional descriptions of the system. Understanding the dynamics and transitions of complex physical systems composed of a large number of variables has been driven by the discovery of low dimensional order parameters (Anderson, 1978; Landau and Lifshitz, 1951). As opposed to measuring and modeling states as high dimensional objects in their native representation, order parameters provide low dimensional descriptions of the states and dynamics, which has proven crucial in developing both qualitative and quantitative models. Finding small subsets of genes which captures the lineage transitions in cells analogously provides a low dimensional subspace that captures the dynamics in genetic networks and can be useful for modeling (Jang et al., 2017). An accompanying paper allows us to exploit this idea to extract mathematical models for the underlying molecular circuits from single cell gene expression data obtained during germ layer differentiation (Jang et al., 2017).
Materials and methods
In vitro neuronal differentiation
Request a detailed protocolSinglecell transcriptomic data from the in vitro neural differentiation procedure was obtained as described in Yao et al. (2017) (Supplemental information):
hESCs were dissociated with Accutase and plated on Matrigelcoated 24well plates at 2.5 × 105 cells/cm2 in DMEM/F12 (#11330–032), 1 × N2, 1 × B27 without vitamin A, 2 mM Glutamax, 100 μM nonessential amino acids, 0.5 mg/mL BSA, 1X PenStrep, and 100 μM 2mercaptoethanol (referred to as basal medium; all from Thermo Fisher, Waltham, MA) with 20 ng/mL FGF2 (Thermo Fisher) and 2 μM thiazovivin. Cortical induction was initiated by changing to the basal medium with 5 μM SB431542 (StemRD, Burlingame, CA), 50 nM LDN193189 (Reagents Direct, Encinitas, CA) and 1 μM cyclopamine (Stemgent, Lexington, MA) (referred to as NIM). NIM was changed daily for 11 days. On day 12, cells were dissociated and seeded on Matrigelcoated 24well plates at 5 × 105/cm2 in basal medium with 20 ng/mL FGF2 and 2 μM thiazovivin. Progenitor expansion was initiated on D13 by changing to serumfree human neural stem cell culture medium (NSCM, #A10509–01 from Thermo Fisher) containing 20 ng/mL FGF2 and 20 ng/mL EGF. NSCM was changed daily for 6 days. Cultures were passaged once more on D19 with Accutase and replated at 5 × 105 cells/cm2. On D26, cells were dissociated with Accutase and seeded on 24well plates sequentially coated with polyDlysine (Millipore, Billerica, MA) and laminin (Thermo Fisher) at 1 × 105 cells/cm2 in basal medium supplemented with 20 ng/mL FGF2 and 2 μM thiazovivin. On D27, medium was changed to a 1:1 mixture of DMEM/F12 and Neurobasal medium (#21103–049) supplemented with 100 μM cAMP (SigmaAldrich, St Louis, MO), 10 ng/mL BDNF (R and D Systems, Minneapolis, MN), 10 ng/mL GDNF (R and D Systems) and 10 ng/mL NT3 (R and D Systems) (referred to as ND). Cells were maintained in ND medium for four weeks until day 54 with half medium change every other day. Quality of differentiations was routinely assessed by immunostaining at D12 (PAX6 and DCX), at D26 (LHX2, SOX2, EOMES, POU3F2, and TBR1), and at D54 (MAP2 costained with TBR1, CTIP2, SATB2). In addition flow cytometry at D26 (EOMES, SOX2 and PAX6) was performed. Typically, EOMES at day 26 proved the most valuable quality control metric (~10% of cells by both flow cytometry and immunostaining) and predicted failure at D54. Specifically, when EOMES was low (<1% of cells) differentiations failed and were typically dominated by POU3F2+ cell types and/or nonneural ‘other’ cell types. These failed differentiations were eliminated from further analysis, typically ~20% of experiments (5 of 19 experiments in 2016). 50 bp pairedend SmartSeq2 libraries were prepared from these cells as previously described (Picelli et al., 2013) and mapped as described in Thomsen et al., 2016 and Yao et al. (2017). As each cell was profiled independently (without pooling before amplification, as in methods such as CelSeq, STRT, or DropSeq), we did not observe the batch effects present in poolingbased methods. Although cells were profiled in plates (batches), there was no significant platerelated variation  this Fixed singlecell transcriptomic characteriza is most likely due to amplification being carried out separately in each well of the plate, thereby precluding crosstalk among barcodes, or platerelated differences in amplification. This is clear from the mixing of cells from different plates in the clustering. A complete census is provided (Figure 4—source data 4).
Gene expression data
Request a detailed protocolHematopoietic gene expression data were downloaded from the Immunological Genome Project (Heng et al., 2008); GEO GSE15907) and log2 transformed. We restricted the genes considered to 1459 transcription factors. Brain development expression data were log2 transformed, and 1460 transcription factors were individually normalized by dividing by the 90thpercentile expression value. Lists of mouse and human transcription factors are provided (Figure 1—source data 2, Figure 4—source data 5).
Software
Calculations were performed using custom written MATLAB code (The Mathworks) on the Harvard Research Computing Odyssey cluster. Code is available at https://github.com/furchtgott/sibilant. tSNE was done using the package provided in (Van Der Maaten, 2009). Monocole was run in R (Core Team, 2015; Trapnell et al., 2014).
Description of algorithm
Request a detailed protocolThe algorithm proceeds according to the following steps:
Find initial seed clustering configuration ${\left\{C\right\}}^{0}$ using Kmedoids where K is chosen to be larger than the cluster number inferred from the Gap statistic (Tibshirani et al., 2001)
For all triplets of clusters, find most likely $T$ and $\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$ given ${\left\{C\right\}}^{m}$:
Compute $p\left({g}_{i}^{A,B,C}\text{}\text{}T,{\alpha}_{i},{\beta}_{i},{\left\{C\right\}}^{m}\right)$ by integrating numerically over $p\left(\mu ,\sigma \right)$ (Equations 6, 9, and 10).
Compute $p\left(T,\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},{\left\{C\right\}}^{m}\right)$ using Equations 20 and 29.
Identify mostly likely topology $T$ and set of $\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$.
Recluster $\left\{g\right\}$ using Kmedoids in the space of all $\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$ for the triplets with probability $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},{\left\{C\right\}}^{m}\right)$> 0.6 of being nonnull. Determine new clustering configuration ${\left\{C\right\}}^{m+1}$.
Repeat steps 1 and 2 until convergence of $\left\{C\right\}.$
Determine the most likely tree connecting cell clusters, recapitulating highprobability triplet topologies.
Each of these steps is described in the following section.
Bayesian framework for inferring cluster identities, state transitions, and marker and transition genes simultaneously
Notation; Bayes’ rule
Request a detailed protocolGiven gene expression data from single cells $\left\{{g}_{i}\right\}$, we built a probabilistic framework to simultaneously infer cell cluster identities, $\left\{C\right\}\equiv \left\{{c}_{A},{c}_{B},\dots \right\}$, the sequence of transitions $T$ between these clusters, the key sets of marker genes $\left\{{\alpha}_{i}\right\}$ that define each cell cluster, and genes $\left\{{\beta}_{i}\right\}$ that determine the sequence of transitions between clusters. We maximized the joint probability distribution of these variables given the gene expression data, $p\left(T,\left\{{c}_{A},{c}_{B},\dots \right\},\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\mid \left\{{g}_{i}\right\}\right)$ to determine the maximum likelihood estimates of these parameters.
We first consider how to solve this problem in the case in which there are three cell clusters, and we will later build a tree using all possible combinations of three cell clusters. Let the set of three cell clusters be c_{A}, c_{B} and c_{C} with gene expression data $\left\{{g}_{i}^{A,B,C}\right\}$ for all genes $(i=1toN)$ and all cells. The term ${g}_{i}^{A,B,C}$ denotes the expression data for just gene $i$ in cells in clusters c_{A}, c_{B}, and c_{C}. The topology $T$ of the relationships between cell clusters c_{A}, c_{B} and c_{C} can take on four possible values: $T=\mathcal{\mathcal{A}}$: cell cluster c_{A} is in the middle (either c_{A} is the progenitor of c_{B} and c_{C}, or c_{A} is an intermediate cell type between c_{B} and c_{C}); $T=\mathcal{\mathcal{B}}$: cell cluster c_{B} is in the middle; $T=\mathcal{\mathcal{C}}$: cell cluster c_{C} is in the middle; or $T=\varnothing $: we cannot determine the topology. Complementarily, for each gene $i$ we define variables ${\alpha}_{i}$ and ${\beta}_{i}$, where ${\alpha}_{i}=1$ and ${\beta}_{i}=0$ if the gene is a marker gene, ${\alpha}_{i}=0$ and ${\beta}_{i}=1$ if the gene is a transition gene, and ${\alpha}_{i}={\beta}_{i}=0$ otherwise. Our task is to determine the probability $p\left(T,\left\{{c}_{A},{c}_{B},\dots \right\},\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$ given gene expression data for all genes $\left\{{g}_{i}^{A,B,C}\right\}$.
According to Bayes’ rule, $p\left(T,\left\{{c}_{A},{c}_{B},\dots \right\},\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$ is proportional to the probability of the gene expression data given $T$, $\left\{C\right\},\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$:
The denominator of the right hand side of Equation 2 is a normalization constant. Expressions $p\left(\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\mid T,\left\{C\right\}\right)$, $\text{}p\left(T\left\{C\right\}\right)$, and$p\left(\left\{C\right\}\right)$ are respectively the prior probabilities of $\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$ given $T$ and $\left\{C\right\},$the prior probability of $T$ given $\left\{C\right\}$, and the prior probability of $\left\{C\right\}$. We assume that in the absence of any expression data, the probability that a gene is a transition or marker gene is independent of the topology and clustering configuration: $p\left(\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\mid T,\left\{C\right\}\right)\text{}p\left(T\left\{C\right\}\right)\text{}p\left(\left\{C\right\}\right)=p\left(\left\{C\right\}\right)p\left(T\right)\prod _{i}\text{}p\left({\alpha}_{i},{\beta}_{i}\right)$, and $p\left(T\right)=1/4$.
Conditional independence
Request a detailed protocolIn our model, we assume that knowing the clustering configuration $\left\{C\right\}$, the topology $T$ and whether or not a gene is a marker or transition gene is sufficient to determine the probability distribution for its expression levels in each of the cell clusters. Therefore, the gene expression patterns of different genes are conditionally independent given the topology, clustering and gene type:
Thus, Equation 2 becomes
We maximize the evaluated $p\left(T,\left\{{c}_{A},{c}_{B},\dots \right\},\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$ with respect to $T$, $\left\{C\right\}$ and each of the ${\alpha}_{i}$ and ${\beta}_{i}$ to obtain the most likely relationships between cell types c_{A}, c_{B} and c_{C}, as well as the genes most likely to be marker and transition genes.
Expression for $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},{\alpha}_{i}=0,\text{}{\beta}_{i}=1\right)$ (transition genes)
Request a detailed protocolTo infer $p\left(T,\left\{{c}_{A},{c}_{B},\dots \right\},\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$, we need a model to compute the probability of the gene expression data for each gene, $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},{\alpha}_{i},\text{}{\beta}_{i}\right)$, given $T$, $\left\{C\right\},$${\alpha}_{i}$and ${\beta}_{i}$, following Equation 4. Our model for the probability distribution of the expression of a single transition gene $i$ in the three cell types $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},{\alpha}_{i}=0,\text{}{\beta}_{i}=1\right)$ is defined solely by the geometry of the arrangement of the cell types in gene expression space, as described in the main text. For example, for $T=\mathcal{\mathcal{A}}$ and ${\beta}_{i}=1$, our model is that the distribution of the expression levels of ${g}_{i}^{A,B,C}$ in the three cell types A, B and C has the smallest mean value in either B or C but not in A (Figure 2A). If the distribution of the expression of gene $i$ in cell type A is ${D}_{A}\left({g}_{i}^{A}\text{}\text{}{\mu}_{A}^{i},{\sigma}_{A}^{i},\left\{C\right\}\right)$ (we assume a lognormal distribution) with a mean ${\mu}_{A}^{i}$ and standard deviation ${\sigma}_{A}^{i}$, with analogous expressions for cell types B and C, then our model defining $p\left({g}_{i}^{A,B,C}\text{}\text{}T=\mathcal{\mathcal{A}},{\beta}_{i}=1,\left\{C\right\}\right)$ is that either ${\mu}_{B}^{i}<{\mu}_{C}^{i}$ and ${\mu}_{B}^{i}<{\mu}_{A}^{i}$ or ${\mu}_{C}^{i}<{\mu}_{B}^{i}$and ${\mu}_{C}^{i}<{\mu}_{A}^{i}$, where ${\mu}_{A}^{i}$, ${\mu}_{B}^{i}$ and ${\mu}_{C}^{i}$ are the mean values of the expression levels of $g}_{i$ in cell types A, B and C. Thus,
The terms in Equation 5 can be calculated by integrating over the prior probability distribution of the means ${\mu}_{A}^{i}$, ${\mu}_{B}^{i}$ and ${\mu}_{C}^{i}$ and standard deviations ${\sigma}_{A}^{i}$, ${\sigma}_{B}^{i}$ and ${\sigma}_{C}^{i}$, with the conditions on the means constraining the domains of integration:
Probabilities $p\left({g}_{i}^{A,B,C}\text{}\text{}T=\mathcal{\mathcal{B}},{\beta}_{i}=1,\left\{C\right\}\right)$ and $p\left({g}_{i}^{A,B,C}\text{}\text{}T=\mathcal{\mathcal{C}},{\beta}_{i}=1,\left\{C\right\}\right)$ are defined similarly.
In addition to topologies $\mathcal{\mathcal{A}}$, $\mathcal{B}$ and $\mathcal{\mathcal{C}}$, we consider a null hypothesis $\varnothing$ in which transition genes have differential expression levels between states, but these levels are not correlated with any particular topology of states. This corresponds to having gene expression levels from celltypes A, B and C coming from three distributions with no restrictions on the relative order of the three means:
Note that the probability of the data given the null hypothesis is the average of the probabilities of the data given the nonnull hypotheses:
Note that $p({g}_{i}^{A,B,C}T,\left\{C\right\},{\alpha}_{i}=0,{\beta}_{i}=1)$ depends on both $T$ and $\left\{C\right\}$.
Expression for $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},{\alpha}_{i}=1,{\beta}_{i}=0\right)$ (marker genes)
Request a detailed protocolOur model for marker genes assumes that the probability distribution for the expression level of such genes, $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},{\alpha}_{i}=1,{\beta}_{i}=0\right)$ to be independent of $T$ and to be generated from distributions with two celltypes having a low value and the third a high value (for example, ${D}_{AB}\left({g}_{i}^{A,B,C}\text{}\text{}{\mu}_{AB}^{i},{\sigma}_{AB}^{i}\right)$ for celltypes A and B and ${D}_{C}\left({g}_{i}^{C}\text{}\text{}{\mu}_{C}^{i},{\sigma}_{c}^{i}\right)$ for celltype C, with the constraint ${\mu}_{AB}^{i}<{\mu}_{C}^{i})$:
Note that $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},\text{}{\alpha}_{i}=1,{\beta}_{i}=0\right)=\text{}\text{}p\left({g}_{i}^{A,B,C}\text{}\text{}\left\{C\right\},\text{}{\alpha}_{i}=1,{\beta}_{i}=0\right)$ does not depend on $T$ but does depend on $\left\{C\right\}$.
Expression for $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},\text{}{\alpha}_{i}=0,{\beta}_{i}=0\right)$ (irrelevant genes)
Request a detailed protocolOur model for genes that are neither marker nor transition genes is that the expression levels of such genes, $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},\text{}{\alpha}_{i}=0,{\beta}_{i}=0\right)$, is generated from one single distribution ${D}_{ABC}\left({g}_{i}^{A,B,C}\text{}\text{}{\mu}^{i},{\sigma}^{i}\right)$:
Note that $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},\text{}{\alpha}_{i}=0,{\beta}_{i}=0\right)=\text{}\text{}p\left({g}_{i}^{A,B,C}\text{}\text{}\text{}{\alpha}_{i}=0,{\beta}_{i}=0\right)$ does not depend on $T$ or $\left\{C\right\}$.
Numerical integration
Request a detailed protocolEach of the probabilities on the right hand side of Equation 4 is evaluated numerically as above. We assume the distribution of the expression of gene $i$ in cluster c_{A }${D}_{A}\left({g}_{i}^{A}\text{}\text{}{\mu}_{A}^{i},{\sigma}_{A}^{i}\right)$ to be lognormal. Given $m$ log2transformed replicate measurements ${g}_{i}^{A}$ of gene expression of gene $i$ in cells belonging to cluster c_{A}, the probability of the data assuming mean ${\mu}_{A}^{i}$ and standard deviation ${\sigma}_{A}^{i}$ is:
Distributions ${D}_{B},{D}_{C},{D}_{AB},{D}_{AC},{D}_{BC}$ and ${D}_{ABC}$ are defined analogously.
We take the a priori probability distribution of ${\mu}^{i}$ and ${\sigma}^{i}$, $p\left({\mu}^{i},{\sigma}^{i}\right)$ as uniform over a certain range of means and standard deviations estimated from the data. For the log2transformed hematopoietic gene expression data, we take $2<{\mu}^{i}<14$ and $0<{\sigma}^{i}<0.75$.
We take the prior probabilities for the distributions in different cell types to be independent: $p\left({\mu}_{A}^{i},{\mu}_{B}^{i},{\mu}_{C}^{i},{\sigma}_{A}^{i},{\sigma}_{B}^{i},{\sigma}_{C}^{i}\right)=\text{}p\left({\mu}_{A}^{i},{\sigma}_{A}^{i}\right)\text{}p\left({\mu}_{B}^{i},{\sigma}_{B}^{i}\right)\text{}p\left({\mu}_{C}^{i},{\sigma}_{C}^{i}\right)$. The constraints on the order of the means are enforced by the domain of integration, and the prior must be properly normalized over this domain. For example, in Equation 6,
Integrals were evaluated numerically in MATLAB using trapezoidal integration with stepsizes $\delta \mu $ = 0.05 and $\delta \sigma $ = 0.01.
The choice of a lognormal distribution could potentially confound the results, particularly for singlecell RNASeq data with significant zeroinflation. This is a potential area for improvement in our algorithm, but the method can easily be adapted to model different distributions of RNA expression, such as gamma distributions (Shahrezaei and Swain, 2008; Wills et al., 2013) or betaPoisson distributions (Delmans and Hemberg, 2016; Vu et al., 2016). In either case, the probability of the data given different topologies would be computed by numeric integration over the parameters of the distribution, for example, $\alpha $ and $\beta $ for the Gamma distribution, by replacing the lognormal distributions in Equations 7, 9 and 10 with ones from the appropriate model.
The choice of the right parametric form for singlecell RNA expression is still an area of active research. Our choice of lognormal distributions assumes that higher order moments of the distributions (beyond standard deviation) ought to have a minimal contribution to the predictions, but we have not tested this extensively.
Although the default prior was the uniform prior, we also implemented an empirical prior $p\left(\mu ,\sigma \right)$ by estimating the empirical distribution over all the Immgen cell types, using the kernel density estimation code provided in (Botev et al., 2010). The resulting hematopoietic lineage tree was identical. Using the kerneldensityestimated empirical prior may provide more stability in future analyses.
Probability of topology given gene expression and cluster identities $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$
Request a detailed protocolWe can derive the probability of the topology given the gene expression data and cluster identities $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$ by summing over all the $\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$ to find the probability of the data given topology $p\left(\left\{{g}_{i}^{A,B,C}\right\}\text{}\text{}T,\left\{C\right\}\right)$:
where the probability of gene expression data for gene $i$ given topology $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\}\right)$ is obtained by summing $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},\text{}{\alpha}_{i},{\beta}_{i}\right)$ over ${\alpha}_{i}$ and ${\beta}_{i}$:
We consider the odds of a gene being a transition gene $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)$ as a sparsity parameter, which we vary. We take $p\left({\alpha}_{i}=0\text{}\text{}{\beta}_{i}=0\right)=\text{}p\left({\alpha}_{i}=1\text{}\text{}{\beta}_{i}=0\right)=1/2.$
The probability of topology $T$ given data is proportional to the probability of the data given topology $T$ (using Bayes’ rule):
where
Therefore, using Equation 14, we obtain the following expression for $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$:
Equation 18 can be written more explicitly by rewriting Equation 15 as follows:
Here we have used the fact noted earlier that in our generating model, $p\left({g}_{i}^{A,B,C}\text{}\text{}T,{\alpha}_{i}=1,\text{}\text{}{\beta}_{i}=0,\text{}\left\{C\right\}\right)=p\left({g}_{i}^{A,B,C}\text{}\text{}{\alpha}_{i}=1,\text{}\text{}{\beta}_{i}=0,\left\{C\right\}\right)$ and $p\left({g}_{i}^{A,B,C}\text{}\text{}T,{\alpha}_{i}=0,\text{}\text{}{\beta}_{i}=0,\text{}\left\{C\right\}\right)=p\left({g}_{i}^{A,B,C}\text{}\text{}{\alpha}_{i}=0,\text{}\text{}{\beta}_{i}=0\right)$ do not depend on $T$ (Equation 10). The terms $\prod _{i}p\left({g}_{i}^{A,B,C}\text{}\text{}{\beta}_{i}=0,\left\{C\right\}\right)\text{}p\left({\beta}_{i}=0\right)$ cancel out in the numerator and denominator of Equation 18, and we can write Equation 18 in terms of ratios of the probabilities of the data given transitiongene and nontransitiongene status:
We can rewrite Equation 20 as:
where $\mathcal{\mathcal{O}}}_{i$ is the odds that gene $i$ is a transition gene, given clustering:
and $p\left(T\text{}\text{}{g}_{i}^{A,B,C},\text{}{\beta}_{i}=1,\left\{C\right\}\right)$ is the probability of $T$ given only gene expression data for gene $i$, clustering and that gene $i$ is a transition gene:
Thus each gene’s contribution $p\left(T\text{}\text{}{g}_{i}^{A,B,C},\text{}{\beta}_{i}=1,\left\{C\right\}\right)$ to the probability of the topology given total gene expression $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$ is weighted by the odds $\mathcal{\mathcal{O}}}_{i$ that it is transition gene.
Rewriting Equation 21 in terms of negative votes
Request a detailed protocolLet us denote the probability of gene expression data for gene $i$ given that cell cluster $\xi $ has the distribution with minimum mean expression as $p\left({g}_{i}^{A,B,C}\text{}\text{}{\mu}_{\xi}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n},\text{}\left\{\mathrm{C}\text{}\right\}\right).$ For example, $p\left({g}_{i}^{A,B,C}\text{}\text{}{\mu}_{B}^{i}\text{}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{},\text{}\left\{\mathrm{C}\text{}\right\}\right)=\text{}p\left({g}_{i}^{A,B,C}\text{}\text{}{\mu}_{B}^{i}\text{}{\mu}_{A}^{i},{\mu}_{B}^{i}\text{}{\mu}_{C}^{i},\text{}\left\{C\right\}\right)$. Then, using $p\left(T\right)=1/4$ and Equations 5 and 8, we can write:
Therefore, for $T=\mathcal{\mathcal{A}},\mathcal{\mathcal{B}},\mathcal{\mathcal{C}}$, we can rewrite Equation 5 as:
Combining Equations 21 and 25, we derive, for $T\ne \varnothing $:
where $p\left({\mu}_{T}^{i}\phantom{\rule{thinmathspace}{0ex}}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}\text{}{g}_{i}^{A,B,C},{\beta}_{i}=1,\{C\}\right)$ is the probability that cell cluster $T$ (the intermediate cluster in topology $T$) has the distribution with the minimum mean for gene $i$:
Every gene can be thought of as casting a vote $p\left({\mu}_{T}^{i}\phantom{\rule{thinmathspace}{0ex}}\mathrm{i}\mathrm{s}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}\text{}\text{}{g}_{i}^{A,B,C},{\beta}_{i}=1,\{C\}\right)$ against cell type $T$ being the intermediate, and this vote is weighted by the odds $\mathcal{\mathcal{O}}}_{i$ of the gene $i$ being a transition gene and having a unique minimum, given the clustering. This corresponds to Equation 1 in the main text.
Expression for $p\left(T,\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$
Request a detailed protocolOnce $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$ is calculated, it is straightforward to find $p\left(T,\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},\left\{C\right\}\right)$:
where $p\left(\left\{{\alpha}_{i}\right\},\left\{{\beta}_{i}\right\}\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\},T,\left\{C\right\}\right)$ is the probability of $\left\{{\alpha}_{i}\right\}$ and $\left\{{\beta}_{i}\right\}$given the particular topology $T$, clustering $\left\{C\right\}$ and gene expression. Because we have assumed that gene expression patterns $p\left({g}_{i}^{A,B,C}\text{}\text{}T,\left\{C\right\},\text{}{\alpha}_{i},{\beta}_{i}\right)$ are conditionally independent given $T$, $\left\{C\right\}$, ${\alpha}_{i}$ and ${\beta}_{i}$ (Equation 3), the probabilities of being marker or transition genes ${\alpha}_{i}$ or ${\beta}_{i}$ are also conditionally independent given gene expression, clustering and the topology:
where $p\left({\alpha}_{i},{\beta}_{i}\text{}\text{}T,\left\{C\right\},\text{}{g}_{i}^{A,B,C}\right)$ is the probability that gene $i$ is a marker or transition gene given its gene expression, the clustering, and that the topology is $T$.
Choice of prior odds does not affect the most likely topology
Request a detailed protocolThe only free parameter in our calculation above is the prior odds of gene i being a transition gene, $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)$. At one extreme, if $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)\to 0$, then $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)\to p\left(T\right)$: if we assume that none of the genes are transition genes, then knowing gene expression does not give us any new knowledge of the topology $T$, since only transition genes are informative about $T$. At the other extreme, if $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)\to \mathrm{\infty}$ then the null hypothesis dominates: if all genes are transition genes, then there will be negative votes against all topologies. We computed the behavior of $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$ between these two limits to determine the sensitivity of our answer to $p({\beta}_{i}=1)/p\left({\beta}_{i}=0\right)$.
Figure 2—figure supplement 1 shows the dependence of the probabilities $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$ on the prior odds for triplets CMP/ST/MPP and GMP/MEP/FrBC for values of $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)$ between 10^{−8} and 10^{2}. For triplet CMP/ST/MPP the topology $\mathit{S}\mathit{T}\equiv CMP\mathit{S}\mathit{T}MPP$ dominates for $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)$ between 10^{−2} and 10, whereas for triplet MEP/GMP/FrBC there is no value of the prior odds that strongly favors a nonnull topology. For most triplets, the most likely topology does not depend on the choice of prior odds; when building lineage trees, we ignore those triplets where different choices of prior odds lead to different mostlikely topologies, i.e. there is more than one nonnull topology that reaches probability 0.6 over the range of prior odds.
Determination of lineage tree from triplet topologies
Selection of triplets
Request a detailed protocolIn order to build lineage trees from the topologies we determine for each cell type, we select the triplets for which our determination of the topology is most robust. There is one free parameter in our model: the prior odds for a gene to be a transition gene in the absence of gene expression data, $p\left({\beta}_{i}=1\right)/p\left({\beta}_{i}=0\right)$. For each triplet, we vary this parameter between 10^{−6} and 10^{2} and calculate the probability of the topology given gene expression data $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$ as a function of the prior odds.
We want to consider only triplets which showed a single dominant topology. We exclude triplets which show a weak probability for a particular topology or ones which depend on a particular choice of prior odds. We also do not consider triplets which show a strong probability for two different topologies, depending on the choice of prior odds.
There were 14 such triplets among the 165 hematopoietic triplets. Mathematically, these cases come up when the genes that are most likely to show the clear minimum pattern (furthest on the right in a ‘dot plot’) suggest one topology, but if one used a more permissive value of the sparsity parameter, a different topology wins out. One of the cell types might have a small number of genes with very high odds, but then fewer genes with moderately high odds compared to the other cell types. We did not notice a clear pattern in the identity of the triplets exhibiting this behavior, but 9 of the 14 were triplets of length five or greater in the Adolfsson model. One of the triplets with this behavior was the MLP/CMP/GMP triplet, and the dominant topology was either MLP (at low prior odds) or CMP (at higher prior odds). Interestingly, both cell types are progenitors to GMP in the Adolfsson model.
We consider triplets for which only one nonnull topology has probability $p\left(T\text{}\text{}\left\{{g}_{i}^{A,B,C}\right\}\right)$greater than 0.6. The probabilities of the different topologies for each triplet in the hematopoietic tree are shown in Figure 2—source data 1 and for the triplets in the cortical development tree in Figure 4—source data 2.
Pruning rule
Request a detailed protocolWe assemble the triplets with known topology into an undirected graph. Since we determined topologies by considering cell types three at a time, we obtain topological relationships involving both cell types that are nearest neighbors and cell types that are more distantly related. In order to reconstruct the tree, we must determine which cell types are nearest neighbors and which ones are separated by one or more intermediate cell types.
The set of inferred topologies allows us to determine which cell types are separated by intermediates. For every pair of cell types, we ask whether any of the inferred topologies features an intermediate between the two cell types. If such a topology has been inferred, we consider that the two cell types are not nearest neighbors, and that at least one other cell type is an intermediate. For example, we can ignore triplet CMP – LT – MPP because triplet LT – ST – CMP testifies that there exists an intermediate between LT and CMP, and triplet LT – ST – MPP testifies that there exists an intermediate between LT and MPP (Figure 3—figure supplement 1).
Note that this pruning rule does not assume the absence of loops. The lineage tree we infer for the hematopoietic progenitors contains a loop that includes ST to CMP to GMP on one side and ST to MPP to MLP to GMP on the other side. The loop involves triplets CMP – ST – MPP, ST – CMP – GMP, ST – MPP – MLP and MPP – MLP – GMP (we cannot determine the topology of triplet CMP/MLP/GMP). None of the triplets shows a topology that would allow us to break up the loop.
Distinguishing between two models of hematopoiesis
Request a detailed protocolThe topologies we infer support the model from Adolfsson et al. (2005), in which CMP splits from STHSC. In particular, there are several triplets that can distinguish the Adolfsson model from the traditional picture, and they support the Adolfsson model. These triplets include CMP – ST – MPP, CMP – ST – MLP, MEP – ST – MLP and CMP – LT – MPP. These triplets show that, unlike in the traditional picture, cell types CMP and MEP split from ST are not descended from MPP or MLP.
On the other hand, triplets LT – MLP – GMP, ST – MLP – GMP and MPP – MLP – GMP show that MLP is an intermediate between the earliest progenitors and GMP. See also Figure 2—figure supplement 2 and Figure 2—source data 2.
Stability analysis
Request a detailed protocolThe inference algorithm depends on several parameters and priors. We performed a stability analysis for both the microarray hematopoietic data and the human brain single cell data to determine the parameter ranges for which the inferred lineage tree was unchanged.
Hematopoiesis
For the prior probability, given that a gene is not a transition gene, that it is an irrelevant gene, our default value was $p\left({\alpha}_{i}=0\text{}\text{}{\beta}_{i}=0\right)$ =0.5, but the tree was unchanged for values of $p\left({\alpha}_{i}=0\text{}\text{}\text{}{\beta}_{i}=0\text{}\right)$ between 0.25 and 0.65.
For the prior probabilities of different topologies, our default value was $p\left(\varnothing \right)=\text{}p\left(\mathcal{\mathcal{A}}\right)=\text{}p\left(\mathcal{\mathcal{B}}\right)=\text{}p\left(\mathcal{\mathcal{C}}\right)=\text{}0.25$. We varied $p(\varnothing )$ while keeping the prior probabilities of the nonnull topologies equal: $p\left(T\ne \varnothing \right)=\frac{1}{3}\left(1p\left(\varnothing \right)\right)$. The tree was unchanged for values of $p(\varnothing )$ between 0.1 and 0.35.
We used a threshold of 0.6 to consider a triplet significant for the treebuilding step. The tree was unchanged for thresholds between 0.5 and 0.65.
A key input parameter into the algorithm is the expected prior distribution of means and standard deviations $p\left(\mu ,\sigma \right)$ used in the numeric integration. Our default prior was a uniform prior for both means and standard deviations over reasonable ranges of these parameters. We also implemented an empirical prior $p\left(\mu ,\sigma \right)$ by estimating the empirical distribution over all the Immgen cell types, using kernel density estimation (Botev et al., 2010). The resulting hematopoietic lineage tree was identical. Using the kerneldensityestimated empirical prior may provide more stability in future analyses.
Brain Development
For the prior probability, given that a gene is not a transition gene, that it is an irrelevant gene, our default value was $p\left({\alpha}_{i}=0\text{}\text{}{\beta}_{i}=0\text{}\right)$ = 0.5, but the tree was unchanged for values of $p\left({\alpha}_{i}=0\text{}\text{}{\beta}_{i}=0\text{}\right)$ between 0.02 and 0.97.
For the prior probabilities of different topologies, our default value was $p\left(\varnothing \right)=\text{}p\left(\mathcal{\mathcal{A}}\right)=\text{}p\left(\mathcal{\mathcal{B}}\right)=\text{}p\left(\mathcal{\mathcal{C}}\right)=\text{}0.25$. We varied $p(\varnothing )$ while keeping the prior probabilities of the nonnull topologies equal: $p\left(T\ne \varnothing \right)=\frac{1}{3}\left(1p(\varnothing )\right)$. The tree was unchanged for values of $p(\varnothing )$ between 0.06 and 0.96.
We used a threshold of 0.6 to consider a triplet significant for the treebuilding step. The tree was unchanged for thresholds between 0.5 and 0.8.
For the single cell data, we also changed the number of initial seed clusters for the iterative algorithm within a range from 30 to 45. In each case, the tree remained the same, and never more than 23% of the cells were clustered into a different cell type.
Data availability

Immunological Genome ProjectPublicly available at the NCBI Gene Expression Omnibus (accession no: GSE15907).

SingleCell mRNA Sequencing Reveals Rare Intestinal Cell TypesPublicly available at the NCBI Gene Expression Omnibus (accession no: GSE62270).

Regionspecific neural stem cell lineages revealed by singlecell rnaseq from human embryonic stem cells [Smartseq]Publicly available at the NCBI Gene Expression Omnibus (accession no: GSE86982).
References

Local moments and localized statesReviews of Modern Physics 50:191–201.https://doi.org/10.1103/RevModPhys.50.191

Transcriptional control of midbrain dopaminergic neuron developmentDevelopment 133:3499–3506.https://doi.org/10.1242/dev.02501

Six3 controls the neural progenitor status in the murine CNSCerebral Cortex 18:553–562.https://doi.org/10.1093/cercor/bhm092

SOX6 controls dorsal progenitor identity and interneuron diversity during neocortical developmentNature Neuroscience 12:1238–1247.https://doi.org/10.1038/nn.2387

Role of Sox2 in the development of the mouse neocortexDevelopmental Biology 295:52–66.https://doi.org/10.1016/j.ydbio.2006.03.007

Controlling the false discovery rate: a practical and powerful approach to multiple testingJournal of the Royal Statistical Society. Series B 57:289–3000.

Kernel density estimation via diffusionThe Annals of Statistics 38:2916–2957.https://doi.org/10.1214/10AOS799

Embryonic expression of the murine homologue of SALL1, the gene mutated in TownesBrocks syndromeMechanisms of Development 104:143–146.https://doi.org/10.1016/S09254773(01)003641

Tracing cells for tracking cell lineage and clonal behaviorDevelopmental Cell 21:394–409.https://doi.org/10.1016/j.devcel.2011.07.019

Stable signal recovery from incomplete and inaccurate measurementsCommunications on Pure and Applied Mathematics 59:1207–1223.https://doi.org/10.1002/cpa.20124

SoftwareR: A Language and Environment for Statistical ComputingR: A Language and Environment for Statistical Computing.

GATA1 in normal and malignant hematopoiesisSeminars in Cell & Developmental Biology 16:137–147.https://doi.org/10.1016/j.semcdb.2004.11.002

Development and differentiation of the intestinal epitheliumCellular and Molecular Life Sciences 60:1322–1332.https://doi.org/10.1007/s0001800322893

Observed universality of phase transitions in highdimensional geometry, with implications for modern data analysis and signal processingPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367:4273–4293.https://doi.org/10.1098/rsta.2009.0152

Embryonic expression of xenopus laevis SOX7Gene Expression Patterns 4:29–33.https://doi.org/10.1016/j.modgep.2003.08.003

Bistability, bifurcations, and Waddington's epigenetic landscapeCurrent Biology 22:R458–R466.https://doi.org/10.1016/j.cub.2012.03.045

Cell lineage analysis of a mouse tumorCancer Research 68:5924–5931.https://doi.org/10.1158/00085472.CAN076216

Molecular logic of neocortical projection neuron specification, development and diversityNature Reviews Neuroscience 14:755–769.https://doi.org/10.1038/nrn3586

The immunological genome project: networks of gene expression in immune cellsNature Immunology 9:1091–1094.https://doi.org/10.1038/ni10081091

Gbx2 directly restricts Otx2 expression to forebrain and midbrain, competing with class III POU factorsMolecular and Cellular Biology 32:2618–2627.https://doi.org/10.1128/MCB.0008312

Rorβ induces barrellike neuronal clusters in the developing neocortexCerebral Cortex 22:996–1006.https://doi.org/10.1093/cercor/bhr182

The POU proteins Brn2 and Oct6 share important functions in schwann cell developmentGenes & development 17:1380–1391.https://doi.org/10.1101/gad.258203

Control of endodermal endocrine development by Hes1Nature Genetics 24:36–44.https://doi.org/10.1038/71657

TSCAN: Pseudotime reconstruction and evaluation in singlecell RNAseq analysisNucleic Acids Research 44:e117.https://doi.org/10.1093/nar/gkw430

The zincfinger transcription factor Klf4 is required for terminal differentiation of goblet cells in the colonDevelopment 129:2619–2628.

Genetic programs controlling cortical Interneuron fateCurrent Opinion in Neurobiology 26:79–87.https://doi.org/10.1016/j.conb.2013.12.012

HMGA regulates the global chromatin state and neurogenic potential in neocortical precursor cellsNature Neuroscience 15:1127–1133.https://doi.org/10.1038/nn.3165

Absence of the transcription factor Nfib delays the formation of the basilar pontine and other mossy fiber nucleiThe Journal of Comparative Neurology 513:98–112.https://doi.org/10.1002/cne.21943

Expression of Trps1 during mouse embryonic developmentMechanisms of Development 119 Suppl 1:S117–120.https://doi.org/10.1016/S09254773(03)001035

Prox1 expression patterns in the developing and adult murine brainDevelopmental Dynamics 236:518–524.https://doi.org/10.1002/dvdy.21024

Dynamic patterning at the pylorus: formation of an epithelial intestinestomach boundary in late fetal lifeDevelopmental Dynamics 238:3205–3217.https://doi.org/10.1002/dvdy.22134

BMP signaling promotes the growth of primary human colon carcinomas in vivoJournal of Molecular Cell Biology 2:318–332.https://doi.org/10.1093/jmcb/mjq035

Identification of simple reaction coordinates from complex dynamicsThe Journal of Chemical Physics 146:044109.https://doi.org/10.1063/1.4974306

CREB3L1 is a metastasis suppressor that represses expression of genes regulating metastasis, invasion, and angiogenesisMolecular and Cellular Biology 33:4985–4995.https://doi.org/10.1128/MCB.0095913

Tcf3 and Lef1 regulate lineage differentiation of multipotent stem cells in skinGenes & Development 15:1688–1705.https://doi.org/10.1101/gad.891401

Rapid loss of intestinal crypts upon conditional deletion of the Wnt/Tcf4 target gene cMycMolecular and Cellular Biology 26:8418–8426.https://doi.org/10.1128/MCB.0082106

Sox21 is a repressor of neuronal differentiation and is antagonized by YB1Neuroscience Letters 358:157–160.https://doi.org/10.1016/j.neulet.2004.01.026

Roles of the basic helixloophelix genes Hes1 and Hes5 in expansion of neural stem cells of the developing brainJournal of Biological Chemistry 276:30467–30474.https://doi.org/10.1074/jbc.M102420200

Interneuron, interrupted: molecular pathogenesis of ARX mutations and Xlinked infantile spasmsCurrent Opinion in Neurobiology 22:859–865.https://doi.org/10.1016/j.conb.2012.04.006

Signaling mechanisms controlling cell fate and embryonic patterningCold Spring Harbor Perspectives in Biology 4:a005975.https://doi.org/10.1101/cshperspect.a005975

Reprogramming fibroblasts to neuralprecursorlike cells by structured overexpression of pallial patterning genesMolecular and Cellular Neuroscience 57:42–53.https://doi.org/10.1016/j.mcn.2013.10.004

COUPTFII expressing interneurons in human fetal forebrainCerebral Cortex 22:2820–2830.https://doi.org/10.1093/cercor/bhr359

Spatial reconstruction of singlecell gene expression dataNature Biotechnology 33:495–502.https://doi.org/10.1038/nbt.3192

Dynamic expression of notch signaling genes in neural stem/progenitor cellsFrontiers in Neuroscience 5:78.https://doi.org/10.3389/fnins.2011.00078

CSE1L, DIDO1 and RBM39 in colorectal adenoma to carcinoma progressionCellular Oncology 35:293–300.https://doi.org/10.1007/s1340201200882

Trp53 deficiency protects against acute intestinal inflammationThe Journal of Immunology 191:837–847.https://doi.org/10.4049/jimmunol.1201716

The Sox9 transcription factor determines glial fate choice in the developing spinal cordGenes & Development 17:1677–1689.https://doi.org/10.1101/gad.259003

The embryonic cell lineage of the nematode caenorhabditis elegansDevelopmental Biology 100:64–119.https://doi.org/10.1016/00121606(83)902014

Estimating the number of clusters in a data set via the gap statisticJournal of the Royal Statistical Society: Series B 63:411–423.https://doi.org/10.1111/14679868.00293

Regression shrinkage and selection via the lassoJournal of the Royal Statistical Society. Series B 58:267–288.https://doi.org/10.1111/j.14679868.2011.00771.x

Regulation of mammalian epithelial differentiation and intestine development by class I histone deacetylasesMolecular and Cellular Biology 24:3132–3139.https://doi.org/10.1128/MCB.24.8.31323139.2004

Defining cell types and states with singlecell genomicsGenome research 25:1491–1498.https://doi.org/10.1101/gr.190595.115

Stem cells, selfrenewal, and differentiation in the intestinal epitheliumAnnual Review of Physiology 71:241–260.https://doi.org/10.1146/annurev.physiol.010908.163145

ConferenceLearning a parametric embedding by preserving local structureTwelfth International Conference on Artificial Intelligence and Statistics (AISTATS). pp. 384–391.

BetaPoisson model for singlecell RNAseq data analysesBioinformatics 32:2128–2135.https://doi.org/10.1093/bioinformatics/btw202

Sharp thresholds for HighDimensional and noisy sparsity recovery using $\ell _{1}$ Constrained Quadratic Programming (Lasso)IEEE Transactions on Information Theory 55:2183–2202.

Sox3 expression identifies neural progenitors in persistent neonatal and adult mouse forebrain germinative zonesThe Journal of Comparative Neurology 497:88–100.https://doi.org/10.1002/cne.20984

A framework for feature selection in clusteringJournal of the American Statistical Association 105:713–726.https://doi.org/10.1198/jasa.2010.tm09415

Ebf2 is required for development of dopamine neurons in the midbrain periaqueductal gray matter of mouseDevelopmental Neurobiology 75:1282–1294.https://doi.org/10.1002/dneu.22284

Ventral mesencephalonenriched genes that regulate the development of dopaminergic neurons in vivoJournal of Neuroscience 29:5170–5182.https://doi.org/10.1523/JNEUROSCI.556908.2009

Identification, tissue expression, and functional characterization of Otx3, a novel member of the Otx familyJournal of Biological Chemistry 277:28065–28069.https://doi.org/10.1074/jbc.C100767200
Article and author information
Author details
Funding
National Science Foundation
 Leon A. Furchtgott
National Institutes of Health
 Sharad Ramanathan
Allen Foundation
 Vilas Menon
 Sharad Ramanathan
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Sandeep Choubey, Alex Schier, Andrew Murray, David Scadden and Toshihiko Oki for scientific discussions. We also thank Christof Koch, Andrew Murray, KC Huang, Jim Valcourt and Dann Huh for detailed comments and feedback on this work. We are grateful to Simona Lodato for helping us interpret our results and in particular help us write the section on the genes we discover to be important during cortical development. We are particularly grateful to the Senior and Reviewing Editors and to the four peer reviewers, including Nir Yosef, for their very thoughtful comments and suggestions. The study was supported by NSFGRFP and NDSEG Fellowships (LF), an NIH Pioneer Award (SR), and the Allen Institute for Brain Science (SR).
Version history
 Received: August 9, 2016
 Accepted: January 31, 2017
 Version of Record published: March 15, 2017 (version 1)
Copyright
© 2017, Furchtgott et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 5,994
 views

 1,169
 downloads

 21
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Developmental Biology
A combination of singlecell techniques and computational analysis enables the simultaneous discovery of cell states, lineage relationships and the genes that control developmental decisions.

 Computational and Systems Biology
 Genetics and Genomics
We propose a new framework for human genetic association studies: at each locus, a deep learning model (in this study, Sei) is used to calculate the functional genomic activity score for two haplotypes per individual. This score, defined as the Haplotype Function Score (HFS), replaces the original genotype in association studies. Applying the HFS framework to 14 complex traits in the UK Biobank, we identified 3619 independent HFS–trait associations with a significance of p < 5 × 10^{−8}. Finemapping revealed 2699 causal associations, corresponding to a median increase of 63 causal findings per trait compared with singlenucleotide polymorphism (SNP)based analysis. HFSbased enrichment analysis uncovered 727 pathway–trait associations and 153 tissue–trait associations with strong biological interpretability, including ‘circadian pathwaychronotype’ and ‘arachidonic acidintelligence’. Lastly, we applied least absolute shrinkage and selection operator (LASSO) regression to integrate HFS prediction score with SNPbased polygenic risk scores, which showed an improvement of 16.1–39.8% in crossancestry polygenic prediction. We concluded that HFS is a promising strategy for understanding the genetic basis of human complex traits.