1. Neuroscience
Download icon

Mapping the transcriptional diversity of genetically and anatomically defined cell populations in the mouse brain

Tools and Resources
  • Cited 0
  • Views 3,531
  • Annotations
Cite this article as: eLife 2019;8:e38619 doi: 10.7554/eLife.38619

Abstract

Understanding the principles governing neuronal diversity is a fundamental goal for neuroscience. Here, we provide an anatomical and transcriptomic database of nearly 200 genetically identified cell populations. By separately analyzing the robustness and pattern of expression differences across these cell populations, we identify two gene classes contributing distinctly to neuronal diversity. Short homeobox transcription factors distinguish neuronal populations combinatorially, and exhibit extremely low transcriptional noise, enabling highly robust expression differences. Long neuronal effector genes, such as channels and cell adhesion molecules, contribute disproportionately to neuronal diversity, based on their patterns rather than robustness of expression differences. By linking transcriptional identity to genetic strains and anatomical atlases, we provide an extensive resource for further investigation of mouse neuronal cell types.

https://doi.org/10.7554/eLife.38619.001

Introduction

The extraordinary diversity of vertebrate neurons has been appreciated since the proposal of the neuron doctrine (Ramon y Cajal, 1894). Classically, this diversity was characterized by neuronal morphology, physiology, and circuit connectivity, but increasingly, defined genetically through driver and reporter strains (Gong et al., 2003; Madisen et al., 2010; Taniguchi et al., 2011; Shima et al., 2016) or genomically by their genome-wide expression profiles. The first genome-wide studies of mammalian neuronal diversity employed in situ hybridization or microarrays (Sugino et al., 2006; Doyle et al., 2008), while more recent studies have utilized advances in single-cell (SC) RNA-seq (Zeisel et al., 2015; Zeisel et al., 2018; Tasic et al., 2016; Tasic et al., 2018; Paul et al., 2017). In theory, SC RNA-seq can be applied in an unbiased fashion to discover all cell types that comprise a tissue, but manipulation of these cell types to better understand their biological composition and function often require the use of genetic tools such as mouse driver strains. Differences in techniques for cell isolation, library preparation or clustering of single cell profiles have not yet led to a consensus view of the number or identity of the neuronal cell types comprising most parts of the mouse nervous system. Furthermore, the relationship between cell populations defined transcriptionally and those that can be specified genetically and anatomically using existing strains has received far less attention (though see Tasic et al., 2018).

Here, we attempt to strengthen the link between genomically and genetically defined cell types in the mouse brain by performing RNA-seq on a large set of genetically identified and fluorescently labeled neurons from micro-dissected brain regions. In total, we profiled 179 sorted neuronal populations and 15 nonneuronal populations. Because each sample of sorted cells may contain more than one 'atomic' cell type, we refer to these as genetically- and anatomically-identified cell populations (GACPs). To assess homogeneity, we quantitatively compared our sorted cell populations to publicly available single-cell datasets, which revealed a comparable level of homogeneity, but a much lower level of noise in the sorted population profiles.

Although neuronal diversity has long been recognized, the question of how this diversity arises has not been addressed sufficiently in a genomic context (Arendt et al., 2016; Muotri and Gage, 2006). We identify two different sets of genes that distinguish GACPs based on the robustness or pattern of their expression differences. The most robust expression differences are those of homeobox transcription factors. These genes also have the lowest transcriptional noise suggesting differential chromatin regulation. Chromatin accessibility measurements reveal that the promoters and gene bodies of these genes are indeed more closed. In contrast, the genes capable of distinguishing the largest numbers of GACPs are neuronal effector genes like receptors, ion channels and cell adhesion molecules. Interestingly, genes defined by the robustness and patterns of their expression differences also differ in their transcript length. Genes with robust, low-noise expression tend to be shorter, while genes with the greatest capacity to distinguish populations tend to be longer.

Here, we provide important new resources for mapping brain cell types including a large set of low-noise profiles from genetically identified neurons, anatomical maps of their distributions, and a method to compare and contextualize single-cell RNA-seq datasets. We implement a novel strategy to mine information from large surveys of cell types, and demonstrate the utility of this strategy in generating specific biological insights into the genes contributing to neuronal diversity.

Results

A dataset of genetically identified neuronal transcriptomes

To identify genes contributing most to mammalian neuronal diversity, we collected transcriptomes from 179 genetically and anatomically identified populations of neurons and 15 populations of nonneuronal cells in mice (Table 1; Figure 1; Figure 1—figure supplement 1; Supplementary file 1,2). The great majority (186/194) were identified both genetically and anatomically, with the remaining identified only anatomically, by their location and projection patterns. Each collected population represents a group of fluorescently labeled cells dissociated and sorted from a specific micro-dissected region of the mouse brain or other tissue. The pipeline for collecting GACP transcriptomes is depicted in Figure 1A (see Materials and methods for additional details). Mouse lines were first characterized by generating a high-resolution atlas of reporter expression (Figure 1B) then, regions containing labeled cells with uniform morphology were chosen for sorting and RNA-seq. In total, we sequenced 2.3 trillion bp in 565 libraries. This effort (NeuroSeq) constitutes the largest and most diverse single collection of genetically identified cell populations profiled by RNA-seq. The raw data is deposited to NCBI GEO (GSE79238). The processed data, including anatomical atlases, RNA-seq coverage, and TPM are available at http://neuroseq.janelia.org (Figure 1C).

Figure 1 with 3 supplements see all
The NeuroSeq dataset.

(A) Schema of pipeline for anatomical and genomic data collection. (B) Example sections from atlases at low (top), medium (middle) and high (bottom) magnifications. (C) Web tools available at http://neuroseq.janelia.org (D) Sensitivity of library preparation measured from ERCC detection across all libraries. The 50% detection sensitivity of the assay itself was 23 copy*kbp.

https://doi.org/10.7554/eLife.38619.002
Table 1
Summary of profiled samples.
https://doi.org/10.7554/eLife.38619.006
Region/typeTransmitter#groupsSubregions#samples
CNS neuronsOlfactory (OLF)glu10AOBmi, MOBgl, PIR, AOB, COAp30
GABA4AOBgr, MOBgr, MOBmi11
Isocortexglu22VISp, AI, MOp5, MO, VISp6a, SSp, SSs, ECT, ORBm, RSPv68
GABA3Isocortex, SSp (Sst+, Pvalb+)7
glu,GABA1RSPv3
Subplate (CTXsp)glu1CLA4
Hippocampus (HPF)glu24CA1, CA1sp, CA2, CA3, CA3sp, DG, DG-sg, SUBd-sp, IG65
GABA4CA3, CA, CA1 (Sst+, Pvalb+)12
Striatum (STR)GABA12ACB, OT, CEAm, CEAl, islm, isl, CP33
Pallidum (PAL)GABA1BST4
Thalamus (TH)glu11PVT, CL, AMd, LGd, PCN, AV, VPM, AD29
Hypothalamus (HY)glu11LHA, MM, PVHd, SO, DMHp, PVH, PVHp36
GABA4ARH, MPN, SCH15
glu,GABA2SFO3
Midbrain (MB)DA2SNc, VTA5
glu2SCm, IC6
5HT2DR10
GABA1PAG4
glu,DA1VTA3
Pons (P)glu7PBl, PG22
NE1LC2
5HT2CSm7
Medulla (MY)GABA7AP, NTS, MV, NTSge, DCO18
glu6NTSm, IO, ECU, LRNm20
ACh2DMX, VII6
5HT1RPA3
GABA,5HT1RPA4
glu,GABA1PRP3
Cerebellum (CB)GABA10CUL4, 5mo, CUL4, 5pu, CUL4, 5gr, PYRpu25
glu4CUL4, 5gr, NODgr10
Retinaglu5ganglion cells (MTN, LGN, SC projecting)14
Spinal Cordglu1Lumbar (L1-L5) dorsal part3
GABA4Lumbar (L1-L5) dorsal part, central part12
PNSJugularglu2(TrpV1+)7
Dorsal root ganglion (DRG)glu2(TrpV1+, Pvalb+)5
Olfactory sensory neurons (OE)glu4MOE,VNO9
nonneuronMicroglia2MOp5(Isocortex),UVU(CB) (Cx3cr1+)6
Astrocytes1Isocortex (GFAP+)4
Ependyma1Choroid Plexus2
Ependyma2Lateral ventricle (Rarres2+)6
Epithelial1Blood vessel (Isocortex) (Apod+, Bgn+)3
Epithelial1olfactory epithelium2
Progenitor1DG (POMC+)3
Pituitary1(POMC+)3
non brainPancreas2Acinar cell, beta cell7
Myofiber2Extensor digitorum longus muscle7
Brown adipose cell1Brown adipose cell from neck.4
total194565

To determine the sensitivity of our transcriptional profiling, we used ERCC spike-ins. Amplified RNA libraries had an average sensitivity (50% detection) of 23 copy*kbp of ERCC spike-ins across all libraries (Figure 1D). Since manually sorted samples had 132 ± 16 cells (mean ± SEM), this indicates our pipeline had the sensitivity to detect a single copy of a transcript per cell 80% of the time. This high sensitivity allowed for deep transcriptional profiling in our diverse set of cell populations.

To assess the extent of contamination in the dataset, we checked expression levels of marker genes for several nonneuronal cell populations (Figure 1—figure supplement 2B). As previously shown (Okaty et al., 2011), manual sorting produced, in general, extremely clean data.

To assess the homogeneity of the sorted, pooled samples, we compared our datasets to publicly available single cell (SC) datasets. To compare across different datasets, we used a method based on linear decomposition by non-negative least squares (NNLS) (See Figure 2 and Figure 2—figure supplement 16). This method tests the degree to which individual profiles can be decomposed into linear mixtures of profiles from another dataset. Such mixtures or impurities can arise in at least two ways (Figure 2A): by pooling similar cell types prior to sequencing in the case of sorted datasets, or by pooling similar profiles after sequencing, at the clustering stage, in the case of SC datasets. Although NNLS is a widely used decomposition procedure, it has not previously been applied to expression profiles. Therefore, we performed a number of control experiments to validate its use. First, we cross-validated the decompositions by dividing each dataset in half and testing the ability to decompose one half by the other (Figure 2—figure supplement 1). This revealed that some NeuroSeq samples had overlapping coefficients and so could not be well distinguished. For example, pairs of populations identified in layer 2/3 of two different regions in the same strain (AI.L23_glu_P157/ORBm.L23_glu_P157) or by retrogradely labeled cells in the same layer and region from two different targets (SSp.L23_glu_M1.inj/SSp.L23_glu_S2.inj and SSp.L5_glu_BPn.inj/SSp.L5_glu_IRT.in) were hard to distinguish. On the other hand, overlapping coefficients were also present for some pairs of cell populations in the SC datasets (such as Oligo Serpinb1a/Oligo Synpr in the Tasic dataset and MGL1/MGL2/MGL3 in the Zeisel dataset). On average the purity, defined as how well a single sample can be decomposed into the most closely corresponding sample, was similar across the three datasets (Figure 2—figure supplement 1D). As a second control, we demonstrated that NNLS decomposition could be used to recover the numbers of cell types isolated from distinct strains in a SC dataset, after mixing these profiles together, despite the fact that this information was not included in the fitting procedure (Figure 2—figure supplement 2). Finally, NNLS (Figure 2B,C) produced comparable or cleaner decompositions than a competing Random Forest algorithm (Figure 2—figure supplement 6). These results indicate that NNLS can be used to reliably decompose mixtures of cellular profiles. Similar average coefficients (i.e. similar purity) were obtained for decompositions of the NeuroSeq data by SC datasets and by decomposing these datasets by each other (Figure 2, Figure 2—figure supplement 36). Hence our decomposition results indicate that although heterogeneity may exist in some of our sorted samples, it is comparable to the inaccuracies introduced by clustering SC profiles.

Figure 2 with 7 supplements see all
Decomposition by non-negative least squares (NNLS) fitting.

(A) Diagram illustrating potential sources of heterogeneity at the separation phase in profiles from sorted cells (left) or at the clustering phase in profiles from single cells (right). (B,C) NNLS coefficients of NeuroSeq cell populations decomposed by two scRNA-seq datasets: (Tasic et al., 2018; Zeisel et al., 2018). (D) Mean purity scores for NeuroSeq and SC datasets. The purity score for a sample is defined as the ratio of the highest coefficient to the sum of all coefficients. Error bars are Std. Dev.

https://doi.org/10.7554/eLife.38619.007

Since merging or splitting of closely related clusters either prior to sequencing or during the clustering process can lead to poor discrimination between samples, we also measured the separability of cell population profiles obtained in each study (Figure 2—figure supplement 7). As expected, the clusters of sorted population samples, which are averages across one hundred cells or more, were much more cleanly separable than SC clusters. Taken together, NNLS decomposition and separability provide a quantitative framework for assessing the trade-offs between homogeneity and reproducibility when measuring population transcriptomes from GACPs and SCs.

To demonstrate the utility of the dataset, made possible by its broad sampling of neuronal populations, we extracted pan-neuronal genes (genes expressed commonly in all neuronal populations but expressed at lower levels or not at all in nonneuronal cell populations; Figure 1—figure supplement 3). Here, broad sampling of cell populations is essential to avoid false positives (Zhang et al., 2014b; Mo et al., 2015; Stefanakis et al., 2015). Because of the high sensitivity and low noise, we were able to be conservative and exclude genes expressed in most but not all neuron types. Extracted pan-neuronal genes contain well-known genes such as Eno2 (Enolase2), which is the neuronal form of Enolase required for the Krebs cycle, Slc2a3 (chloride transporter) required for inhibitory transmission, and Atp1a3 (ATPase Na+/K + transporting subunit alpha 3), which belongs to the complex responsible for maintaining electrochemical gradients across the membrane, as well as genes not previously known to be pan-neuronal, such as 2900011O08Rik (now called Migration Inhibitory Protein; Zhang et al., 2014a). Synaptic genes are often differentially expressed among neurons, but interestingly, some were included in this pan-neuronal list such as Syn1, Stx1b, Stxbp1, Sv2a, and Vamp2. These appear to be common synaptic components, and highlight essential parts of these complexes. Thus, the dataset should be useful for many other applications, especially those requiring comparisons across a wide variety of neuronal cell types.

Metrics to quantify diversity

Analysis of expression differences between individual groups is the basis of most profiling efforts. Variance-based metrics, such as Analysis of Variance (ANOVA) F-Value, or coefficient of variation (CV) are commonly used for this purpose. However, these metrics are jointly affected by the pattern of differential expression and the robustness of the differences, and so cannot readily separate these two features (Figures 3 and 4; Figure 3—figure supplement 1). Since these features may differ in their biological significance, we searched for the simplest way to quantitatively separate them. This led us to adopt two easily calculated variants of widely used metrics for differential expression and fold-change.

Figure 3 with 1 supplement see all
Gene expression metrics related to information content and robustness (Left) Cartoon illustrating the process of calculating fold-change ratio (FCR) and differentially expressed fraction (DEF) for four different hypothetical genes that differ in the information content (2 and 4 vs. 1 and 3) and signal-to-noise ratio (SNR; 1 and 2 vs. 3 and 4) of their expression patterns across cell populations.

(Middle) Expression signals are used to construct matrices for each gene of the log fold-changes between populations (fold-change matrix) and the distinctions between populations based on those differences (Differentiation Matrix; DM; see Materials and methods). (Right) The differentially expressed fraction (DEF) is the fraction of the total pairs of cell populations distinguished (i.e. of nonzero values in DM excluding diagonal). The fold-change ratio (FCR) is the average expression difference between distinguished pairs divided by the average expression difference between undistinguished pairs. Orange and blue bars show that the resulting DEF and FCR calculations capture the variations in information and SNR across the four genes.

https://doi.org/10.7554/eLife.38619.015
Figure 4 with 1 supplement see all
DEF and FCR capture distinct aspects of expression diversity related to information content and robustness.

(A) Highly variable genes (warm colored dots; color scale shows significance of ANOVA across cell populations) include both genes with high FCR and low DEF (like Tfap2c and Hoxa9) and genes with lower FCR and high DEF (like Chrm1 and Slc17a6). (B) Expression profiles of example genes labeled in A. Sample key in horizontal color bar as in Figure 1—figure supplement 13. Red ticks at left indicate 0; Vertical scale is log2(FPKM+1); blue ticks = 6) (C) DMs for example genes, calculated as shown in Figure 3. (DE) HUGO gene groups enriched in the top 1000 FCR and top 1000 DEF genes. Red lines indicate the p = 10-5 threshold used to judge significance.

https://doi.org/10.7554/eLife.38619.017

To quantify the contribution of each gene to cell type diversity, we measured the fraction of cell population pairs in which the gene is differentially expressed. (For differential analysis, the limma-voom framework was used, see Materials and methods). This differentially expressed fraction (DEF) is closely related to the Gini-Simpson diversity index (Simpson, 1949) widely used in ecology to measure species diversity in a community (see Appendix 1). DEF ranges from 0 to 1. The maximum observed value of 0.65 indicates that the gene distinguishes 65% of the pairs, while a value of 0 indicates that the gene distinguishes none (i.e. it is expressed at similar levels in all cells). DEF is easy to calculate and approximates the mutual information (MI) between expression levels and cell populations (Appendix 1).

The robustness of an expression difference depends on its magnitude relative to the underlying noise. Robustness is often quantified as a Signal-to-Noise Ratio (SNR). Since the signals we are interested in are the gene expression differences distinguishing cell types, we computed the ratio of the mean fold-change expression differences between distinguished pairs to the mean fold-change between undistinguished pairs. This fold-change ratio (FCR) indicates the robustness of pair distinctions but is independent of the number of pairs distinguished. High FCR genes robustly distinguish cell populations and are therefore suitable as 'marker genes'.

Unlike DEF and FCR, variance-based methods like ANOVA F-values and CV are either affected by both MI and SNR (ANOVA; Figure 4A–C and Figure 3—figure supplement 1) or by neither (CV; Figure 3—figure supplement 1). The fact that ANOVA does not distinguish between information content and SNR can be appreciated from the fact that the most significant ANOVA genes (Figure 4A–C) include both high DEF and high FCR genes. Therefore, DEF and FCR are useful because they provide independent measures of the robustness and magnitude of differential expression between cell populations.

To determine the types of genes most differentially expressed (highest DEF) and most robustly different (highest FCR) between cell populations, we performed over-representation analysis using the HUGO Gene Groups (Braschi et al., 2018, Figure 4D,E). The most robust expression differences (highest FCR) were those of homeobox transcription factors (TFs) and G-protein coupled receptors (GPCRs; Figure 4D). High DEF genes are enriched for neuronal effector genes including receptors, ion channels and cell adhesion molecules (Figure 4E). High FCR and High DEF enrichments were based on the HUGO gene groups, but similar results were obtained using the PANTHER gene families (Mi et al., 2017) and Gene Ontology annotations (Ashburner et al., 2000, Figure 4—figure supplement 1). In the case of the high FCR genes, the Gene Ontology categories differed, since this ontology lacks a separate category for homeobox transcription factors. Instead multiple parent categories (e.g. sequence-specific DNA binding, RNA polymerase II regulatory region DNA binding etc.) were overrepresented.

Thus, using these two simple metrics we identify synaptic and signaling genes as the most differentially expressed, and homeobox TFs and GPCRs as the most robustly distinguishing families of genes. These two categories of genes drive neuronal diversity by endowing neuronal cell types with specialized signaling and connectivity phenotypes, and by orchestrating cell-type-specific patterns of transcription. In addition, their distinct contributions to distinguishing neuronal types suggests possible differences in the regulation of these two categories of genes.

Homeobox TFs have the highest SNRs and can form a combinatorial code for cell populations

FCR, like SNR, is a ratio between signal and noise, and so can reflect high expression levels in most ON cell types (high signal), low expression levels in most OFF cell types (low noise), or both. Homeobox genes are not among the most abundantly expressed genes. Their average expression levels (~30 FPKM) are significantly lower than, for example, those of neuropeptides (~90 FPKM). This suggests that the high FCRs of homeobox TFs depend more on low noise than high signal. In fact, many homeobox TFs have uniformly low expression in OFF cell types (Figure 5A top). We quantified this 'OFF noise’ for all genes and found that homeobox genes are enriched among genes that have both low OFF noise and at least moderate ON expression levels (red dashed region in Figure 5B; see also Figure 5—figure supplements 1 and 2). Homeobox genes were not enriched in a group of high OFF noise genes (blue dashed region in Figure 5B; data not shown) that was matched for maximum expression level (Figure 5—figure supplement 1C). The enrichment of homeoboxes was also observable in two of the single-cell datasets encompassing multiple brain regions (Figure 5—figure supplement 3).

Figure 5 with 4 supplements see all
Mechanisms contributing to low noise and high information content of homeobox TFs.

(A) Example expression patterns of a LIM class homeobox TF (Lhx1) and a calcium binding protein (Calb2) with similar overall expression levels. Sample key as in Figure 1—figure supplement 13. (B) (upper) OFF state noise (defined as standard deviation (std) of samples with FPKM<1) plotted against maximum expression. (lower) HUGO gene groups enriched in the region indicated by red dashed box in the upper panel (see Figure 5—figure supplement 1 for PANTHER and Gene Ontology enrichments). (C) Average (replicate n = 2) ATAC-seq profiles for the genes shown in A. Some peaks are truncated. Expression levels are plotted at right (gray bars). (D) Length-normalized ATAC profile for genes with high (> 0.3, blue dashed box in B, n = 853) and low (< 0.2, red dashed box in B, n = 1643) OFF state expression noise. (E) Each circle represents the orthogonality of expression patterns calculated using HUGO gene groups. Orthogonality is a measure of the degree of non-redundancy in a set of expression patterns. Since the dispersion of orthogonality depends on family size, results are compared to orthogonality calculated from randomly sampled groups of genes (green solid lines: mean and Std. Dev.; green dashed lines: 99% confidence interval). Families, Z-scores, family size: 1. GPCR: 17.1, n = 277; 2. Homeoboxes: 16.6, n = 148; 3. Ion channels: 10.7, n = 275; 4. C2 domain containing: 7.8, n = 159; 5. Zinc fingers: 6.9, n = 1002; 6. Immunoglobulin superfamily domain containing: 6.7, n = 292; 7. PDZ domain containing: 6.3, n = 144; 8. Fibronectin type III domain containing: 5.9, n = 143; 9. Endogenous ligands: 5.1, n = 165; 10. Basic helix-loop-helix proteins: 4.9, n = 77.

https://doi.org/10.7554/eLife.38619.019

Tight control of expression may reflect closed chromatin. To test this, we measured chromatin accessibility using ATAC-seq (see Materials and methods). As expected, compared to high-noise genes (Figure 5C bottom), genes with low OFF noise had fewer and smaller peaks within the vicinity of their transcription start site (TSS) and gene body (Figure 5C top, Figure 5D), consistent with the idea that chromatin accessibility contributes to their low OFF noise. Functionally, the tight control of homeobox TF expression levels may reflect their known importance as determinants of cell identity, and that establishing and maintaining robust differences between cell types may require tight ON/OFF regulation rather than graded regulation.

Homeobox containing TFs can be subdivided into subfamilies based on their structure. The different homeobox subfamilies differed in their OFF noise and hence in their FCR values. Some families (e.g. HOXL, NKL, PRD) had very low OFF noise and high FCR, while others (e.g. CERS, PROS, CUT) had higher OFF noise and lower FCR (Figure 5—figure supplement 4).

The ability of gene families to provide information about cell identities reflects both how informative individual family members are, and the relationships between them. If the information across family members is independent, the overall information is increased relative to the case in which multiple members contain redundant information. This aspect of 'family-wise’ information is not captured by 'gene-wise’ metrics like mean DEF, or by enrichment analysis (Figure 4D,E). One means of capturing the additive, non-redundancy within a gene family is to measure the orthogonality of expression patterns among the member genes. This analysis (Figure 5E) reveals that homeobox TFs and GPCRs have the greatest orthogonality between cell types among HUGO groups (as well as in PANTHER families, Figure 5—figure supplement 1E). Related to this, we found that the homeobox family can distinguish more than 99% of GACP pairs, suggesting these TFs comprise a combinatorial code for the cell populations profiled. To illustrate this, we computed the minimum set of homeobox TFs needed to distinguish the populations studied and found that a set of as few as 8 could distinguish 99% of GACP pairs (Figure 5—figure supplement 2B). Combinatorial codes could also be produced from other highly orthogonal gene families, as illustrated for GPCRs Figure 5—figure supplement 2C). As illustrated in these heat maps, expression differences for homeobox TFs had higher contrast, consistent with the fact that individually, homeobox TFs have the highest FCR (Figure 4D) and lowest OFF noise (Figure 5B). In summary, we found that many homeobox genes are expressed with a very high SNR and are one of the groups of genes with the most orthogonal expression patterns. This suggests that, similar to other tissues (Pereira et al., 2015Gendrel et al., 2016Zheng et al., 2015; Dasen and Jessell, 2009; Philippidou and Dasen, 2013), homeobox TFs play an important role in specifying cell types in the brain.

Diversity arising from alternative splicing

Alternative splicing is known to increase transcriptome diversity (Andreadis et al., 1987). To assess the contribution of alternative splicing to diversifying transcriptomes across cell populations, we quantified the branch probabilities at each alternative splice donor site within each gene (Figure 6A top). The branch probabilities at each donor site are the relative frequencies with which particular splice acceptors are chosen, and can be estimated from observed junction read counts. Branch probabilities are highly bimodal (Figure 6A bottom), suggesting that most branch point choices are made consistently, in an all-or-none fashion, for any given cell population.

Alternative splicing and neuronal diversity.

(A) (Top) Schematic representation of branch probabilities. Alternative donor sites (red dot) can be spliced to multiple acceptor sites 1,,m with probabilities p1,,pm. (Bottom) Distribution of branch probabilities across all samples and all alternative splice sites. (B) Heatmap showing branch probabilities across neuronal samples for branches with highest splice DEF. Each row corresponds to a branch within the indicated gene on the left and the location is indicated on the right. Samples without junctional reads at this branch are colored white. (C) Enriched HUGO gene groups and PANTHER protein classes for genes with top 1000 combined splice DEF. (D) Splice graphs illustrating examples of alternative splicing leading to inclusion or exclusion (marked 'i', 'e’ of Pfam domains (magenta exons) with branch probabilities shown in the heatmap below. Previously unannotated exons and junctions are blue; annotated are black. Dotted lines indicate branches predicted to lead to nonsense-mediated decay (NMD). A red star above an exon indicates existence of a premature termination codon (PTC) within the exon which satisfies the '50nt rule' for NMD (Nagy and Maquat, 1998) (i.e. more than 50 bp upstream to the next junction), whereas a black star indicates existence of a PTC within 50 bp of the next junction. Dashed lines and hatches indicate that there is no coding path through the element. (>) indicates an annotated translation start site. (E) Proportion of branch points predicted to lead to NMD (purple), altered Pfam inclusion (red), or both (overlapped region), at one or more of its branches.

https://doi.org/10.7554/eLife.38619.024

To test the significance of differential splicing across cell populations, we utilized a statistical test based on the Dirichlet-Multinomial distribution and the log-likelihood ratio test, developed in LeafCutter (Li et al., 2018). We used pair-wise differential expression of each branch to calculate a branch DEF, much as we previously calculated the differentially expressed fraction (DEF) from expression values (Figure 3). Examples of branches with high DEFs are shown in Figure 6B. The list includes known examples like the site of the flip and flop variants of the AMPA receptor subunit Gria2 (Sommer et al., 1990). Another previously known example is the splicing regulator muscleblind like splicing factor 2 (Mbnl2), which is known to regulate splicing in the developing brain (Charizanis et al., 2012) and is known to be spliced at multiple sites, including the one shown in Figure 6B (Pascual et al., 2006).

In order to determine which families of genes are highly differentially spliced, we computed a splice DEF per gene by combining the ability of a gene's alternatively spliced sites to distinguish a pair of samples (i.e. a pair is distinguished by a gene if any alternatively spliced site in the gene can distinguish the pair). Using this combined splice DEF, we found that RNA binding proteins, especially splicing related factors (such as Pcbp2 and Mbnl2) are highly alternatively spliced among neuronal cell types (Zheng and Black, 2013), but over-represented categories also included other families such as Glutamate receptors and G-protein modulators (Figure 6C).

To begin to assess the functional impact of alternative splicing, we determined which alternative sites lead to inclusion or exclusion of a known protein domain using the Pfam database (Finn et al., 2015). In addition to providing information relevant to the potential functions of many previously unknown isoforms, our analysis also provides a more comprehensive view of known splice events. Two examples are shown in Figure 6D. Alternative splicing of Amyloid precursor-like protein 2 (Aplp2) is known to regulate inclusion of a bovine pancreatic trypsin inhibitor (BPTI) Kunitz domain (Sandbrink et al., 1997) and this domain is known to regulate proteolysis of the related protein APP, the amyloid precursor protein implicated in Alzheimer’s disease (Beckmann et al., 2016). Differential inclusion of this exon is known to occur between neurons and nonneurons. Intriguingly, we found that splicing at this site in hippocampal interneurons differs not only from that in forebrain excitatory neurons, but also from other forebrain inhibitory neurons in neocortex and striatum. Kalirin (Kalrn) is a RhoGEF kinase implicated in Huntington’s disease, schizophrenia and synaptic plasticity (Penzes and Jones, 2008). Kalrn is known to be regulated via binding of adaptor proteins to its SH3 (SRC homology 3) domains (Schiller et al., 2006) which is regulated by alternative splicing of this domain. In addition to expanding the number of known variants (blue exons and junctions in Figure 6D) we reveal their detailed distribution across the profiled set of neural populations. In total, the data reveal a detailed quantitative view of hundreds of thousands of known and unknown cell-type-specific splicing events, providing an unmatched resource for investigating their functional significance.

Not all splicing events alter the inclusion or exclusion of known protein domains. Many splicing events introduce frame shifts or new stop codons and hence are predicted to lead to nonsense-mediated decay (NMD). Coupling of regulated splicing to NMD is believed to be an important mechanism for regulating protein abundance (Lewis et al., 2003). Consistent with previous observations (Yan et al., 2015; Mauger and Scheiffele, 2017), we noticed that most alternative sites contain branches that can lead to NMD (Figure 6E). This suggests that alternative splicing may contribute not only to the diversity of isoforms present, but also to diversity defined on the basis of transcript abundance.

The present results provide a comprehensive resource of known and novel splicing events across a large number of neuronal cell types. Altogether, nearly 70% of alternative sites lead to differential inclusion of a known Pfam domain or NMD (Figure 6E), and thus to functional or quantitative diversity across cell types.

Long genes contribute disproportionately to neuronal diversity

We found that neuronal effector genes (ion channels, receptors and cell adhesion molecules, etc.) have the greatest ability to distinguish cell populations (Figure 4E). Previously, these categories of genes have been found to be selectively enriched in neurons and to share the physical characteristic of being long (Sugino et al., 2014; Gabel et al., 2015; Zylka et al., 2015). Consistent with this, DEF, which approximates the mutual information (MI) between expression levels and cell populations, is significantly correlated with length (Figure 7A; correlation coefficient = 0.19; p=7.5e-189), reaching a maximum for the very longest genes. Long genes (100 kb) have nearly twice the average ability to distinguish cell populations (DEF) as shorter genes (Figure 7A), and provide greater family-wise separation between cell types (Figure 7C). Analyzing publicly available single-cell data confirms that this bias is broadly observable (Figure 7—figure supplement 1). In contrast, FCR, which measures the signal-to-noise or robustness of expression differences, is higher for shorter genes, reaching a maximum for genes below 10 kbp in length (Figure 7B).

Figure 7 with 3 supplements see all
Long genes have a greater capacity for distinguishing cell populations.

(A) DEF as a function of gene length. For violin plots in A, B, D, genes are sorted by length and binned (four bins per log unit). (B) Robustness of expression difference (FCR) as a function of gene length. (C) Orthogonality of expression patterns calculated as in Figure 5E, but using long neuronal genes (n = 1829, 100 kb) and short neuronal genes (n = 10572, <100 kb) rather than functionally defined gene families. Z-score is 33.2 for long and 22.1 for short neuronal genes. Both are highly different from randomly sampled genes (green solid lines mean and Std. Dev.; dashed lines = 99% confidence interval), but long genes provide greater separation. (D) Splice DEF as a function of gene length. (E) Fraction of pairs distinguished by splicing (splice-only), transcript abundance (exp-only), or by both measures. (F) Variation in long gene expression in neuronal and nonneuronal populations across major brain regions studied. Error bars are SEM. (CTXsp consisted of single region: Claustrum.)

https://doi.org/10.7554/eLife.38619.025

Recently, (Raman et al., 2018) have argued that many prior observations of long gene bias are not significant when controlling for baseline variability in length-dependent expression. In order to assess the applicability of this argument to the present observations, we compared the fold-changes across length between groups and within replicates of individual groups as in Raman et al. (2018). An example of this test applied to two populations is shown in Figure 7—figure supplement 2A,B. Even after applying corrections for multiple comparisons across all bins, the long gene bins (100 kb) are highly significant. Panels C,D of this figure illustrate the results of performing this comparison for all GACPs in our dataset. The median fraction of significant long gene bins (0.89) greatly exceeded the fraction of short gene bins (0.1). A more detailed analysis of the test developed by Raman et al. and its application to other observations will be published elsewhere.

In addition to being differentially expressed, long genes are likely to have a larger number of exons and hence a greater potential for differential splicing. To evaluate the degree to which differential splicing of long genes contributes to distinguishing cell populations we plotted the splice DEF (Figure 6) as a function of gene length. As expected, DEF calculated from differential splicing also increased with gene length (Figure 7D), although the slope was more gradual and the maximum DEF value achieved was less than that for gene expression (Figure 7A). For each gene, we measured the fraction of cell populations pairs that could be distinguished on the basis of differential expression, differential splicing, or both. This revealed that for the current dataset, the average alternatively spliced gene distinguishes only 1.4% of cell populations, but distinctions based on expression of these same genes were nearly 10 times more common (13.9%, Figure 7E).

Finally, to determine whether neuronal long gene expression contributes more to profiles in some anatomical regions than in others, we plotted the fraction of the longest genes expressed in neuronal and nonneuronal populations across each of the major brain regions studied. The results confirm strong differences between neurons and nonneurons and show the strongest long gene expression in forebrain regions, with weaker expression evident in hindbrain (Figure 7F). Analyses of single-cell datasets revealed similar trends (Figure 7—figure supplement 3).

Discussion

A resource of genetically identified neuronal transcriptomes

The dataset presented here is the largest collection of transcriptomes of anatomically and genetically specified neuronal cell types available in a mammalian species (Table 1). The approach employed in this study provides a complementary view of neuronal diversity to that afforded by SC sequencing. By sorting and pooling ∼100 cells chosen based on genetic and anatomical similarity, we generated profiles with low noise and high depth, but, where tested, with a comparable degree of homogeneity, as that obtained in recent SC studies.

The fact that each transcriptome corresponds to a genetically (or retrogradely) labeled population will foster reproducible studies across investigators. The few profiles in our study that mapped to more than one SC profile (Figure 2), may represent cell types better distinguishable using SCs or improved genetic markers, or alternatively, may represent cell populations that are highly overlapping. The optimal granularity with which cell types may be distinguished remains an open question. Pooling cell profiles either prior to sequencing, as in this study, or after sequencing at the clustering phase, as in SC studies, risks compromising profile homogeneity. However, over-fragmenting clusters risks the opposite problem of reducing the reliability and reproducibility with which populations can be distinguished across studies. Given the complementary advantages of improved reproducibility and separability afforded by pooling profiles, and of reduced heterogeneity afforded by maximally separating profiles, further integration of these approaches with other modalities, such as FISH (Moffitt et al., 2016) are needed to accurately profile the full census of brain cell types. By linking these efforts to genetically identified neurons, the present dataset provides a useful resource for these efforts.

A transcriptional code for neuronal diversity

We utilized easily calculated metrics that capture essential features of the robustness and information content of transcriptome diversity. These measures are simply versions of Fold-Change (FCR) and Differential Expression (DEF) adapted to the analysis of many separate populations simultaneously. Importantly, they capture independent components of the differences captured by variance-based metrics like ANOVA and CV (Figure 4A, Figure 3—figure supplement 1). Metrics like ANOVA are influenced jointly by signal-to-noise and mutual information, while FCR and DEF better separate them (Figure 3—figure supplement 1) and so these metrics may be more broadly useful when making genome-wide comparisons across many populations. In the present dataset, FCR and DEF identified two very different sets of genes contributing to neuronal diversity: high FCR, low-noise genes, exemplified by homeobox transcription factors, and high DEF, long neuronal effector genes like ion channels, receptors and cell adhesion molecules.

The homeobox family of TFs exhibited the most robust (high FCR) expression differences across cell types (Figure 4D). These ON/OFF differences were characterized by extremely low expression in the OFF state (Figure 5). Mechanistically, the low expression was associated with reduced genome accessibility measured by ATAC-seq (Figure 5C,D), presumably reflecting epigenetic regulation of the OFF state, known to occur for example at the clustered Hox genes via Polycomb group (PcG) proteins (Montavon and Soshnikova, 2014). Although this regulation has been studied most extensively at Hox genes, genome-wide ChIP studies reveal that PcG proteins are bound to over 100 homeobox TFs in ES cells (Boyer et al., 2006). Our results indicate that strong cell-type-specific repression persists in the adult brain, presumably due to the continued functional importance of preventing even partial activation of inappropriate programs of neuronal identity.

Although individually, homeobox TFs contain less information about cell types than long neuronal effector genes, their patterns of expression are highly orthogonal and therefore their joint expression pattern is highly informative. As a group, homeobox TFs distinguished more than 99% of neuronal cell types profiled (Figure 5—figure supplement 2). (Note this includes several Purkinje and hippocampal pyramidal cell groups that may actually represent duplicate examples of the same cell types). Historically, homeobox TFs are well known to combinatorially regulate neuronal identity in Drosophila and C. elegans (Pereira et al., 2015; Gendrel et al., 2016) and the vertebrate brainstem and spinal cord (Dasen and Jessell, 2009; Philippidou and Dasen, 2013). Our results suggest a broader importance of homeobox TFs throughout the mammalian nervous system. Continued expression of these factors in adult neurons suggests that they likely also contribute to the maintenance of neuronal identity.

Long genes and neuronal diversity

Our study suggests that long neuronal effector genes contribute disproportionately to neuronal transcriptional diversity (Figure 7). Previously, it was reported that differences in transcript length can bias differential expression analysis of RNA-seq data (Oshlack and Wakefield, 2009). To ensure that we avoided this bias, we used counts of reads only from within the one kbp-long 3’ ends of the genes for calculating expression values. Recently, an alternative statistical analysis has been used to argue that some of these length biases may be artefactual (Raman et al., 2018). Despite concerns about the rigor of this analysis (manuscript in preparation), we found that the observed length biases remain highly significant, even within this statistical framework (Figure 7—figure supplement 2), suggesting that they are robust features of the transcriptional differences between neuronal populations.

Long genes are expressed at higher levels in neurons than in nonneuronal cells in the nervous system, a bias that was also present in SC datasets (Figure 7—figure supplements 1 and 2) and that has been reported previously (Sugino et al., 2014; Gabel et al., 2015; Zylka et al., 2015). These differences are greatest in the forebrain (Figure 7F; Figure 7—figure supplement 2), perhaps reflecting the large numbers of distinct cell types in these regions and the enhanced ability of these genes to distinguish GACPs based on their expression. However, we and others did not measure cell-type-specific protein expression, and so cannot be sure that the long gene bias extends to the level of neuronal proteins.

Long genes tend to have larger numbers of exons and therefore are likely to be expressed in a larger number of distinct isoforms as a result of alternative splicing (alternative start sites also contribute). We quantified differential splicing from analysis of junctional reads. Interestingly, branch probabilities at most sites of alternative splicing were highly bimodal (Figure 6A), suggesting that within each GACP, splicing is largely all or none, a finding previously reported in single immune cells (Shalek et al., 2013) but not found in some single neuron studies (Gokce et al., 2016). This led to patterns that often flipped between high and low probabilities for a given branch as one traversed major brain region boundaries (Figure 6B). More than two thirds of these splicing events lead to inclusion or exclusion of known protein domains (Figure 6E), but many of these, as well as some of the remaining events that do not modify domain structure, also introduce a frame shift or premature stop codon, and so are predicted to lead to nonsense mediated decay (NMD). We did not directly test the contribution of NMD to transcript abundance, but our splicing results are consistent with the idea that this may be an important mechanism for regulating transcript stability and hence transcript abundance across different cell populations (Yan et al., 2015; Traunmüller et al., 2014). While differential splicing is able to distinguish fewer GACPs than transcript abundance (Figure 7E), this may be an underestimate for two reasons. First, as just noted, splicing may influence transcript abundance through NMD, and second, the sensitivity to detect splicing differences depends on an adequate number of junctional reads. Deeper sequencing could increase the apparent contribution of this component of neuronal diversity.

Long genes are enriched in the signaling molecules, receptors and ion channels responsible for input/output transformations in neurons, and the cell adhesion molecules that specify neuronal connectivity. The finding that these genes play an important role in diversifying cortical interneurons (Paul et al., 2017), as well as distinguishing the larger set of populations studied here, is sensible in light of the phenotypic diversity required for neuronal communication and connectivity. These genes are long because of long introns that are rich in sequences derived from transposons and other retroelements (Grishkevich and Yanai, 2014). Whether or how this increased length has any functional significance for the regulation of these genes is unclear from our studies, but it is intriguing that these long genes are disrupted in forms of autism spectrum disorder (Zylka et al., 2015; Wei et al., 2016) and in the related developmental disorder Rett Syndrome (Sugino et al., 2014; Gabel et al., 2015), where loss of the chromatin protein Mecp2 leads to selective upregulation of long neuronal genes in a highly cell-type-specific fashion. These studies suggest the possibility that long neuronal genes are subject to distinct modes of regulation, with particular significance for neuronal diversity.

In contrast to long neuronal effector genes, which tend to be expressed later in development as neurons mature phenotypically (Okaty et al., 2009), low noise, high FCR genes are frequently critical for early development. These genes, such as many of the homeobox TFs, are often quite short and, at least in the case of the Hox genes, are known to be remarkably transposon impoverished (Waterston et al., 2002; Simons et al., 2006). This may reflect selection against transposon insertion, but may also reflect chromatin that is non-permissive for insertion in germ cells and the early embryo, where heritable transposition occurs. The high FCR/low OFF noise of many of these genes detected here may reflect a transcriptional signature of this class of genes. Consistent with this view, low OFF noise genes were nearly six times shorter than high OFF noise genes (Figure 5—figure supplement 1D). Highly restrictive chromatin at these genes may be established early in development to protect them from disruptive transposition (Montavon and Soshnikova, 2014). If so, this tightly closed state is maintained in postmitotic neurons where it may also prevent transcriptional signals associated with inappropriate neural identities. This feature was not uniformly present across all subfamilies of homeobox transcription factors. Interestingly, however, the families with the highest FCR and lowest noise also had the shortest length, while those with higher noise expression (and lower FCR) were longer (Figure 5—figure supplement 4).

The observation that long genes contribute disproportionately to neuronal transcriptional diversity is surprising both because of the increased metabolic cost of expressing them (Castillo-Davis et al., 2002), and since these genes are frequent sites of genome instability associated with genetic lesions leading to autism and other developmental disorders (Wei et al., 2016). These apparent disadvantages may be too weak to lead to selection against long gene expression in mammalian neurons. If this is not the case, however, it raises the question of why the mechanisms used to prevent elongation of shorter, low OFF noise genes were not also applied to neuronal effector genes. This could simply reflect developmental or later functional constraints that exclude the use of these epigenetic protection mechanisms. Alternatively, length itself may confer some advantages that outweigh other disadvantages. This could occur either through benefits provided by the diversification of alternative splicing, or through regulatory features contained within intronic sequences (Zhao et al., 2018).

Materials and methods

Cell types and mouse lines

We assume that cell types are organized hierarchically in a tree-like fashion proceeding from major branches (e.g. 'cortical excitatory neuron’) to more specialized subtypes, with the terminal 'leaf-level’ branches comprising 'atomic’ cell types. Profiled cell populations are defined operationally by the intersection of a transgenic mouse strain (or in some cases anatomical projection target) and a brain region. Mouse lines profiled in this study are summarized in Supplementary file 1. Most were obtained from GENSAT (Gong et al., 2007) or from the Brandeis Enhancer Trap Collection (Shima et al., 2016). For Cre-driver lines, the Ai3, Ai9 or Ai14 reporter (Madisen et al., 2010) was crossed and offspring hemizygous for Cre and the reporter gene were used for profiling. Information on samples profiled is in Supplementary file 2. Populations profiled are designed to sample regions and cell types across the mouse brain within the limits of available resources. In addition several non-brain samples were profiled as out-groups. Replicate numbers (averaging three across all populations) are in Supplementary file 2. Replicates were obtained in single animals, except for a few cases in which pooling across animals was needed due to difficulty in sorting. Our study used a small number of replicates (n = 2–4) per population to maximize the number of populations studied, while still allowing calculation of summary statistics. No explicit power analysis was performed. No attempt was made to remove outliers. Sequenced libraries were not used when total reads were low (<5M reads). Out of 179 neuronal GACPs, there are 165 groups which have more than one replicate. Of these, 14 were recent additions, and most analyses were performed with the remaining 151 groups. All experiments were conducted in accordance with the requirements of the Institutional Animal Care and Use Committees at Janelia Research Campus and Brandeis University.

Tissue data

In addition to cell-type-specific data obtained in this study, we analyzed publicly available RNA-seq data using tissue samples. Information on these samples are described in Supplementary file 3.

Atlas

Animals were anesthetized and perfused with 4% paraformaldehyde and brains were sectioned at 50μm thickness. Every fourth section was mounted on slides and imaged with a slide scanner equipped with a 20x objective lens (3DHISTECH; Budapest, Hungary). In house programs were used to adjust contrast and remove shading caused by uneven lighting. Images were converted to a zoomify-compatible format for web delivery and are available at http://neuroseq.janelia.org.

Cell sorting

Manual cell sorting was performed as described (Hempel et al., 2007; Sugino et al., 2014). Briefly, animals were sacrificed following isoflurane anesthesia, and 300μm slices were digested with pronase E (1 mg/ml, P5147; Sigma-Aldrich) for 1 hr at room temperature, in artificial cerebrospinal fluid (ACSF) containing 6,7-dinitroquinoxaline-2,3-dione (20μM; Sigma-Aldrich), D-(−)−2-amino-5-phosphonovaleric acid (50μM; Sigma-Aldrich), and tetrodotoxin (0.1μM; Alomone Labs). Desired brain regions were micro-dissected and triturated with Pasteur pipettes of decreasing tip size. Dissociated cell suspensions were diluted 5–20 fold with filtered ACSF containing fetal bovine serum (1%; HyClone) and poured over Petri dishes coated with Sylgard (Dow Corning). For dim cells, Petri dishes with glass bottoms were used. Fluorescent cells were aspirated into a micropipette (tip diameter 30–50μm) under a fluorescent stereomicroscope (M165FC; Leica), and were washed three times by transferring to clean dishes. After the final wash, pure samples were aspirated in a small volume (1~3μl) and lysed in 47μl XB lysis buffer (Picopure Kit, KIT0204; ThermoFisher) in a 200μl PCR tube (Axygen), incubated for 30 min at 40°C on a thermal cycler and then stored at −80°C. Detailed information on profiled samples are provided in Supplementary file 2.

RNA-seq

Total RNA was extracted using the Picopure kit (KIT0204; ThermoFisher). Either 1 μl total, or 1μl per 50 sorted cells of 10-5 dilution of ERCC spike-in control (#4456740; Life Technologies) was added to the purified RNA and vacuum concentrated to 5 μl and immediately processed for reverse transcription using the NuGEN Ovation RNA-Seq System V2 (#7102; NuGEN) which yielded 4~8μg of amplified DNA. Amplified DNA was fragmented (Covaris E220) to an average of ~200 bp and ligated to Illumina sequencing adaptors with the Encore Rapid Kit (0314; NuGEN). Libraries were quantified with a KAPA Library Quant Kit (KAPA Biosystems) and sequenced on an Illumina HiSeq 2500 with 4 to 32-fold multiplexing (single end, usually 100 bp read length, see Supplementary file 2).

RNA-seq analysis

Adaptor sequences (AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC for Illumina sequencing and CTTTGTGTTTGA for NuGEN SPIA) were removed from de-multiplexed FASTQ data using cutadapt v1.7.1 (http://dx.doi.org/10.14806/ej.17.1.200) with parameters '–overlap = 7 –minimum-length = 30'. Abundant sequences (ribosomal RNA, mitochondrial, Illumina phiX and low complexity sequences) were detected using bowtie2 (Langmead and Salzberg, 2012) v2.1.0 with default parameters. The remaining reads were mapped to the UCSC mm10 genome using STAR (Dobin et al., 2013) v2.4.0i with parameters '–chimSegmentMin 15 –outFilterMismatchNmax 3'. Mapped reads are quantified with HTSeq (Anders et al., 2015) using Gencode.vM13 (Harrow et al., 2012).

Annotations

For reference annotations we used Gencode.vM13 (Harrow et al., 2012) downloaded from http://www.gencodegenes.org/, and NCBI RefSeq (Pruitt et al., 2014) downloaded from the UCSC genome browser.

Pan-neuronal genes

Pan-neuronal genes satisfied the following conditions: (1) mean neuronal expression level (NE) > 20 FPKM, (2) minimum NE > 5 FPKM, (3) mean NE > maximum nonneuronal expression level (NNE), (4) minimum NE > mean NNE, (5) mean NE > 4x mean NNE, (6) mean NE > mean NNE + 2x standard deviation of NNE, (7) mean NE − 2x standard deviation of NE > mean NNE.

DEF/FCR/DM calculation

To calculate DEF, the following criteria were used to assign a '1’ or '0’ to each element in the differentiation matrix (DM): absolute log fold change >2 and q-value <0.05. Q-values were calculated using the limma package including the voom method (Law et al., 2014). To adjust the power to be similar across cell types, two replicates (the most recent two) were used for all cell populations with more than two replicates. We have tried the same calculations with three replicates (using a fewer number of cell populations) and obtained similar results (data not shown). To avoid possible bias in variances due to transcript length differences (Oshlack and Wakefield, 2009), we quantified counts using reads from within the 3’ 1 kbp of each gene. For genes with transcript lengths shorter than one kbp, we used the whole gene length. We also calculated DEF and FCR across five SC datasets: For (Zeisel et al., 2015; Tasic et al., 2016) and (Tasic et al., 2018), we used log fold change >1 and q-value <0.05 calculated using the limma/voom method for differential gene expression. For (Saunders et al., 2018) and (Zeisel et al., 2018), only cluster average expression was available, and log fold change >1 was defined as the criterion for differential expression.

Overrepresentation, orthogonality and minimal gene sets

Overrepresentation analysis was performed using the top-level HUGO gene groups (Figures 46) and was supplemented (Figure 6, Figure 4—figure supplement 1, Figure 5—figure supplements 1 and 3) using the PANTHER Classification System and the Molecular Function component of the Gene Ontology Annotation (GOM). Orthogonality quantifies the non-redundancy across expression patterns. We calculated orthogonality (Figure 5E) as the mean pairwise decorrelation (1- Pearson’s corr. coef.) over a family of genes. Gene groups with less than 50 members were excluded, since variance of this measure was much larger in small groups of randomly selected genes (dashed green lines in Figure 5E). Minimal gene sets capable of serving as combinatorial codes across cell populations (Figure 5—figure supplement 2) were calculated by a greedy algorithm using the Differentiation Matrix (DM) defined in Figure 3. Specifically, from a set of genes (such as homeobox TFs or other families), the gene with the highest DEF was chosen as the first member of the set. Successive members were chosen, irrespective of their individual DEF, so as to maximize the combined DEF of the set. The combined DEF is the fraction of pairs distinguished by any gene in the group, and is calculated from the combined DM, which is the logical OR of the individual DMs for each gene in the group. This procedure continued until the combined DEF exceeded the desired threshold (0.99 in the case of Figure 5—figure supplement 2). The homeoboxes set was constructed by merging the HUGO Homeoboxes gene group and the PANTHER homeobox protein TFs (PC00119) and had 156 genes. The GPCRs set is a merging of G protein-coupled receptors in HUGO and G-protein coupled receptors (PC00021) in PANTHER and has 347 genes.

Calculation of differential splicing

To identify differential splicing, we utililzed a statistical test based on the Dirichlet-Multinomial distribution and the log-likelihood ratio test, developed in LeafCutter (Li et al., 2018). However, instead of using a group of connected introns as a unit for tests (as done in LeafCutter), we used a group of introns originating from an alternative donor site. Total junctional reads at an alternative donor > 10 was a prerequisite for testing. DM for alternative donors were then calculated as 1 for pairs of cell populations with p<0.05 and maximum delta-PSI > 0.1, and 0 for others. (delta-PSI: absolute difference of PSI, proportion-spliced-in,)

NNLS/random forest decomposition

The following single-cell datasets were downloaded and used for decomposition: (Zeisel et al., 2015) (NCBI GEO GSE60361), (Tasic et al., 2016) (NCBI GEO GSE71585), (Tasic et al., 2018) (http://celltypes.brain-map.org/rnaseq), (Zeisel et al., 2018) (http://mousebrain.org/), (Saunders et al., 2018) (dropviz.org). Deposited count data were converted to log2(CPM+1) and used for comparison. The NeuroSeq dataset was quantified using RefSeq and featurecount (Liao et al., 2014) and converted into log2(CPM+1). Subsets of genes common to NeuroSeq, Tasic 2018 and Zeisel 2018 datasets were used for decomposition. To account for differences in distributions of logCPM values between datasets, they were quantile-normalized to an average profile generated from the decomposed dataset. Since most genes in the single-cell profiles exhibited noisy expression patterns, using the entire gene set for decomposition was not feasible. Therefore, we selected genes deemed most informative for distinguishing cell classes based on the ANOVA F-statistic across cell classes (obtained using limma/voom in R). However, simply taking the top ANOVA genes led to highly biased gene selection since some cell types exhibited much larger transcriptional differences than others (e.g. many ANOVA selected genes were specific to microglia). We therefore selected genes to reduce the redundancy between distinguished cell populations. Beginning with the highest ANOVA gene (highest ANOVA F-value), genes were selected only if their DM (Differentiation Matrix defined in Figure 3) differed from those previously selected, enforced by requiring a Jaccard index threshold of 0.5, across all studies. We chose the top 500 genes meeting this criterion. Decompositions were performed on average profiles created by averaging NeuroSeq replicates or by averaging single-cell profiles using cluster assignments provided by the authors. NNLS was implemented using the R nnls library. For Random forest, the randomForest R package was used.

ATAC-seq

Seven cell types, Purkinje and granule cells from cerebellum, pyramidal cells in layer 5 and 6 from neocortex, in the deep layers of entorhinal cortex, and in CA1 and CA1-3 of hippocampus, labeled in mouse lines P036, P033, P078, 56L, P038, P064, and P036 respectively (all from Shima et al., 2016) were profiled with ATAC-seq. They were isolated by FACS to obtain ∼40,000 labeled neurons. ATAC libraries for Illumina next-generation sequencing were prepared in accordance with a published protocol (Buenrostro et al., 2013). Briefly, collected cells were lysed in buffer containing 0.1% IGEPAL CA-630 (I8896, Sigma-Aldrich) and nuclei pelleted for resuspension in tagmentation DNA buffer with Tn5 (FC-121–1030, Illumina). Nuclei were incubated for 20–30 min at 37°C. Library amplification was monitored by real-time PCR and stopped prior to saturation (typically 8–10 cycles). Library quality was assessed prior to sequencing using BioAnalyzer estimates of fragment size distributions looking for a ladder pattern indicative of fragmentation at nucleosome intervals as well as qPCR to determine relative enrichment at two housekeeping genes compared to background (specifically the TSS of Gapdh and Actb were assessed relative to the average of three intergenic regions). For sequencing, Illumina HiSeq 2500 with 2 to 4-fold multiplexing and paired end 100 bp read length was used. In addition to ATAC-seq, RNA-seq was performed on replicate samples of ∼2000 cells collected in a similar way, and library prepared using the same method described above.

ATAC-seq analysis

Nextera adaptors (CTGTCTCTTATACACATCT) were trimmed from both ends from de-multiplexed FASTQ files using cutadapt with parameters ”-n 3 -q 30,30 m 36’. Reads were then mapped to UCSC mm10 genome using bowtie2 (Langmead and Salzberg, 2012) with parameters ”-X2000 –no-mixed –no-discordant’. PCR duplicates were removed using Picard tools (http://broadinstitute.github.io/picard, v2.8.1) and reads mapping to mitochondrial DNA, scaffolds, and alternate loci were discarded. BigWig genomic coverage files were generated using bedtools (Quinlan and Hall, 2010) and scaled by the total number of reads per million.

Anatomical region abbreviations

Region abbreviations: ACB: Nucleus accumbens AD: Anterodorsal nucleus AI: Agranular insular area AMd: Anteromedial nucleus, dorsal part AOBgr: Accessory olfactory bulb, granular layer AOBmi: Accessory olfactory bulb, mitral layer AP: Area postrema ARH: Arcuate hypothalamic nucleus AV: Anteroventral nucleus of thalamus CA: Hippocampus Ammon's horn CA1: Hippocampus field CA1 CA1sp: Hippocampus field CA1, pyramidal layer CA3: Hippocampus field CA3 CEAm: Central amygdalar nucleus, medial part CEAl: Central amygdalar nucleus, lateral part CL: Central lateral nucleus of the thalamus COAp: Cortical amygdalar area, posterior part CP: Caudoputamen CSm: Superior central nucleus raphe, medial part CUL4,5gr: Cerebellum lobules IV-V, granular layer CUL4,5mo: Cerebellum lobules IV-V, molecular layer CUL4,5pu: Cerebellum lobules IV-V, Purkinje layer DCO: Dorsal cochlear nucleus DG: Hippocampus dentate gyrus DMHp: Dorsomedial nucleus of the hypothalamus, posterior part DMX: Dorsal motor nucleus of the vagus nerve DR: Dorsal nucleus raphe ECT: Ectorhinal area IC: Inferior colliculus IG: Induseum griseum IO: Inferior olivary complex isl: Islands of Calleja islm: Major island of Calleja LC: Locus ceruleus LGd: Dorsal part of the lateral geniculate complex LHA: Lateral hypothalamic area MM, Medial mammillary nucleus MO: Somatomotor area MOBgl: Main olfactory bulb, glomerular layer MOBgr: Main olfactory bulb, granular layer MOBmi: Main olfactory bulb, mitral layer MOE: main olfactory epithelium MOp5: Primary motor area, layer 5 MV: Medial vestibular nucleus NTS: Nucleus of the solitary tract NTSge: Nucleus of the solitary tract, gelatinous part NTSm: Nucleus of the solitary tract, medial part ORBm: Orbital area, medial part OT: Olfactory tubercle PAG: Periaqueductal gray PBl: Parabrachial nucleus, lateral division PCN: Paracentral nucleus PG: Pontine gray PIR: Piriform area PRP: Nucleus prepositus PVH, Paraventricular hypothalamic nucleus PVHd: Paraventricular hypothalamic nucleus, descending division PVHp, Paraventricular hypothalamic nucleus, parvicellular division PVT: Paraventricular nucleus of the thalamus PYRpu: Cerebellum Pyramus (VIII), Purkinje layer RPA: Nucleus raphe pallidus RSPv: Retrosplenial area, ventral part RT, Reticular nucleus of the thalamus SCH: Suprachiasmatic nucleus SCm: Superior colliculus, motor related SFO: Subfornical organ SNc: Substantia nigra, compact part SO: Supraoptic nucleus SSp: Primary somatosensory area SSs: Supplemental somatosensory area SUBd-sp: Subiculum, dorsal part, pyramidal layer VII: Facial motor nucleus VISp: Primary visual area VISp6a: Primary visual area, layer 6a VNO: vemoronasal organ VPM: Ventral posteromedial nucleus of the thalamus VTA: Ventral tegmental area

Appendix 1

Relationship between DEF and Gini-Simpson index or MI

Here we explore in more detail the relationship between DEF (differentially expressed fraction of populations) and Gini-Simpson index (GSI) or MI (mutual information). DEF of a gene is equivalent to the Gini-Simpson index calculated using distinguishable levels of expression of the gene and it is also closely related to mutual information between (discretized) expression levels and cell population labels.

Assume there are Ne distinguishable expression levels of a gene and there are ni cell population groups in level i. Then, the Gini-Simpson index (GSI) is:

(1)GSI=1i=1Nepi2(2)=1i=1Neni(ni1)N(N1)

Where pi is the probability of randomly selected element being in expression level i and N=i=1Neni is the total number of groups. The second equation holds since pi2=ni(ni1)/N(N1) for sampling without replacement.

Since ni(ni1)/N(N1)=(ni(ni1)/2)/(N(N1)/2), this term is the fraction of pairs in level i. So the sum of these are the total fraction of indistinguishable pairs and one minus this sum equals the fraction of distinguishable pairs, which is DEF. Thus, DEF is equivalent to the Gini-Simpson index calculated using distinguishable levels of expression.

To calculate mutual information between expression levels and cell populations, we discretize expression levels into Ne levels. Let Ns be the number of samples. Let nij be counts in the contingency table where i=1,...,Ne and j=1,...,Ns. Then the joint probability distribution and the marginal probability distribution can be written as:

(3) p(i,j)=nijNs
(4) p(i)=jnijNs=niNs
(5) p(j)=inijNs=njNs

Where ni=jnij and nj=inij is the number of samples in level i and nj is the number of replicates in cell type j. The mutual information between expression level (E) and samples (S) is:

(6)I(E;S)=i,jp(i,j)logp(i,j)p(i)p(j)(7)=i,jp(i,j)logp(i,j)p(j)i,jp(i,j)logp(i)(8)=i,jp(j)p(i|j)logp(i|j)i,jp(i,j)logp(i)(9)=jp(j)ip(i|j)logp(i|j)ilogp(i)jp(i,j)(10)=jp(j)H(E|S=j)ip(i)logp(i)(11)=H(E|S)+H(E)

H(E|S=j) is the entropy of expression levels in cell population j, which represents the expression noise in cell population j, and H(E|S) is the average of these across all cell populations. When there are no replicates, H(E|S) is zero. When there are replicates, H(E|S=j) represents how noisy the expression is. This may depend on expression level, and H(E|S), the average of H(E|S=j) may depend on expression prevalence (i.e., how widely the gene is expressed), but in any case, the first term H(E|S) represents reduction of the mutual information by noise.

The second term H(E) is the entropy of the marginal distribution p(i) and represents the main information content about cell groups encoded in expression levels. This can be rewritten using counts in the contingency table as:

(12)H(E)=ip(i)logp(i)(13)=iniNslogniNs(14)=iniNslogni+iniNslogNs(15)=1Nsinilogni+logNs

Thus, it is maximized when all ni’s are 0 or 1, which corresponds to the case in which one expression level corresponds to one cell population, making all cell populations distinguishable by the expression levels. This is true when the number of discretization levels exceeds the number of samples. When the number of discretization levels (Ne) is less than the number of samples (Ns), H(E) takes the maximum value of logNe when all the samples are distributed equally across each bin.

To explore the relationship between H(E) and DEF, the logni in the first term is replaced (approximated) by (ni-1) (first two terms in the Taylor expansion of logni around ni=1.):

(16)H(E)1Nsini(ni1)+logNs(17)=2Nsini(ni1)/2+logNs(18)=2Ns{Ns(Ns1)/2ini(ni1)/2}(Ns1)+logNs(19)=(Ns1)DEF(Ns1)+logNs

Since ni is the number of samples in one expression level, ni(ni-1)/2 is the number of indistinguishable pairs in that expression level when there are no replicates. The term within the curly bracket is then the number of distinguishable pairs, leading to Equation (19).

More formally, since both h(p)=nilogni and d(p)=ni(ni-1)=ni2-Ns are Schur-convex functions* on partitions of Ns, p=(n1,n2,...,nk), when partition p1 majorizes p2 then, h(p1)h(p2) and d(p1)d(p2). When the partition length is 2, that is, when expression levels are discretized into only two levels, corresponding to ON and OFF, then, all of the partitions can be ordered with respect to majorization, therefore, h(p) and d(p) are order-preserved transformations of each other (Figure 3—figure supplement 1C left). When the partition length is greater than 2, this relationship is not satisfied. However, they are still highly correlated to each other (Figure 3—figure supplement 1C right).

When DEF is calculated from global discretization (as in the above case), the maximum number of pairs distinguishable occurs when all samples are equally distributed across bins and the number of distinguishable pairs is (NsNe)2Ne(Ne1)/2. Therefore,

(20)max(DEF)=(NsNe)2Ne(Ne1)/2Ns(Ns1)/2(21)=(11Ne)/(11Ns)(22)11Ne(whenNs1)

As stated above, this is also when the entropy H(E) takes the maximum value of log2Ne in the unit of bits. (Figure 3—figure supplement 1C)

* A Schur-convex function is a function f:RkR which satisfies f(x)f(y) for all x,y where x majorizes y. For x=(x1,x2,...,xk)Rkwhere(x1x2...xk) and y=(y1,y2,...,yk)Rkwhere(y1y2...yk)x majorizes y when i=1kxi=i=1kyi and i=1jxii=1jyiforallj=1,...,k. When x majorizes y, it follows xiyi for all i, so it is easy to see h(x)h(y) and d(x)d(y).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
    Genenames.org: the HGNC and VGNC resources in 2019
    1. B Braschi
    2. P Denny
    3. K Gray
    4. T Jones
    5. R Seal
    6. S Tweedie
    7. B Yates
    8. E Bruford
    (2018)
    Nucleic Acids Research, 10.1093/nar/gky930.
  8. 8
  9. 9
  10. 10
  11. 11
    Current Topics in Developmental Biology
    1. JS Dasen
    2. TM Jessell
    (2009)
    169–200, Chapter Six Hox Networks and the Origins of Motor Neuron Diversity, Current Topics in Developmental Biology, Elsevier, 10.1016/s0070-2153(09)88006-x.
  12. 12
  13. 13
  14. 14
    The pfam protein families database: towards a more sustainable future
    1. RD Finn
    2. P Coggill
    3. RY Eberhardt
    4. SR Eddy
    5. J Mistry
    6. AL Mitchell
    7. SC Potter
    8. M Punta
    9. M Qureshi
    10. A Sangrador-Vegas
    11. GA Salazar
    12. J Tate
    13. A Bateman
    (2015)
    Nucleic Acids Research, 44, 10.1093/nar/gkv1344.
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    Annotation-free quantification of RNA splicing using LeafCutter
    1. YI Li
    2. DA Knowles
    3. J Humphrey
    4. AN Barbeira
    5. SP Dickinson
    6. HK Im
    7. JK Pritchard
    (2018)
    Nature Genetics, 50, 10.1101/044107.
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
    La fine structure des centres nerveux. the croonian lecture
    1. S Ramon y Cajal
    (1894)
    Proceedings of the Royal Society of London. Series B, Biological Sciences 55:444–448.
    https://doi.org/10.1098/rspl.1894.0063
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
    Molecular taxonomy of major neuronal classes in the adult mouse forebrain
    1. K Sugino
    2. CM Hempel
    3. MN Miller
    4. AM Hattox
    5. P Shapiro
    6. C Wu
    7. ZJ Huang
    8. SB Nelson
    (2006)
    Nature Neuroscience, 9, 10.1038/nn1618, 16369481.
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
    Initial sequencing and comparative analysis of the mouse genome
    1. RH Waterston
    2. K Lindblad-Toh
    3. E Birney
    4. J Rogers
    5. JF Abril
    6. P Agarwal
    7. R Agarwala
    8. R Ainscough
    9. M Alexandersson
    10. P An
    11. SE Antonarakis
    12. J Attwood
    13. R Baertsch
    14. J Bailey
    15. K Barlow
    16. S Beck
    17. E Berry
    18. B Birren
    19. T Bloom
    20. P Bork
    21. M Botcherby
    22. N Bray
    23. MR Brent
    24. DG Brown
    25. SD Brown
    26. C Bult
    27. J Burton
    28. J Butler
    29. RD Campbell
    30. P Carninci
    31. S Cawley
    32. F Chiaromonte
    33. AT Chinwalla
    34. DM Church
    35. M Clamp
    36. C Clee
    37. FS Collins
    38. LL Cook
    39. RR Copley
    40. A Coulson
    41. O Couronne
    42. J Cuff
    43. V Curwen
    44. T Cutts
    45. M Daly
    46. R David
    47. J Davies
    48. KD Delehaunty
    49. J Deri
    50. ET Dermitzakis
    51. C Dewey
    52. NJ Dickens
    53. M Diekhans
    54. S Dodge
    55. I Dubchak
    56. DM Dunn
    57. SR Eddy
    58. L Elnitski
    59. RD Emes
    60. P Eswara
    61. E Eyras
    62. A Felsenfeld
    63. GA Fewell
    64. P Flicek
    65. K Foley
    66. WN Frankel
    67. LA Fulton
    68. RS Fulton
    69. TS Furey
    70. D Gage
    71. RA Gibbs
    72. G Glusman
    73. S Gnerre
    74. N Goldman
    75. L Goodstadt
    76. D Grafham
    77. TA Graves
    78. ED Green
    79. S Gregory
    80. R Guigó
    81. M Guyer
    82. RC Hardison
    83. D Haussler
    84. Y Hayashizaki
    85. LW Hillier
    86. A Hinrichs
    87. W Hlavina
    88. T Holzer
    89. F Hsu
    90. A Hua
    91. T Hubbard
    92. A Hunt
    93. I Jackson
    94. DB Jaffe
    95. LS Johnson
    96. M Jones
    97. TA Jones
    98. A Joy
    99. M Kamal
    100. EK Karlsson
    101. D Karolchik
    102. A Kasprzyk
    103. J Kawai
    104. E Keibler
    105. C Kells
    106. WJ Kent
    107. A Kirby
    108. DL Kolbe
    109. I Korf
    110. RS Kucherlapati
    111. EJ Kulbokas
    112. D Kulp
    113. T Landers
    114. JP Leger
    115. S Leonard
    116. I Letunic
    117. R Levine
    118. J Li
    119. M Li
    120. C Lloyd
    121. S Lucas
    122. B Ma
    123. DR Maglott
    124. ER Mardis
    125. L Matthews
    126. E Mauceli
    127. JH Mayer
    128. M McCarthy
    129. WR McCombie
    130. S McLaren
    131. K McLay
    132. JD McPherson
    133. J Meldrim
    134. B Meredith
    135. JP Mesirov
    136. W Miller
    137. TL Miner
    138. E Mongin
    139. KT Montgomery
    140. M Morgan
    141. R Mott
    142. JC Mullikin
    143. DM Muzny
    144. WE Nash
    145. JO Nelson
    146. MN Nhan
    147. R Nicol
    148. Z Ning
    149. C Nusbaum
    150. MJ O'Connor
    151. Y Okazaki
    152. K Oliver
    153. E Overton-Larty
    154. L Pachter
    155. G Parra
    156. KH Pepin
    157. J Peterson
    158. P Pevzner
    159. R Plumb
    160. CS Pohl
    161. A Poliakov
    162. TC Ponce
    163. CP Ponting
    164. S Potter
    165. M Quail
    166. A Reymond
    167. BA Roe
    168. KM Roskin
    169. EM Rubin
    170. AG Rust
    171. R Santos
    172. V Sapojnikov
    173. B Schultz
    174. J Schultz
    175. MS Schwartz
    176. S Schwartz
    177. C Scott
    178. S Seaman
    179. S Searle
    180. T Sharpe
    181. A Sheridan
    182. R Shownkeen
    183. S Sims
    184. JB Singer
    185. G Slater
    186. A Smit
    187. DR Smith
    188. B Spencer
    189. A Stabenau
    190. N Stange-Thomann
    191. C Sugnet
    192. M Suyama
    193. G Tesler
    194. J Thompson
    195. D Torrents
    196. E Trevaskis
    197. J Tromp
    198. C Ucla
    199. A Ureta-Vidal
    200. JP Vinson
    201. AC Von Niederhausern
    202. CM Wade
    203. M Wall
    204. RJ Weber
    205. RB Weiss
    206. MC Wendl
    207. AP West
    208. K Wetterstrand
    209. R Wheeler
    210. S Whelan
    211. J Wierzbowski
    212. D Willey
    213. S Williams
    214. RK Wilson
    215. E Winter
    216. KC Worley
    217. D Wyman
    218. S Yang
    219. SP Yang
    220. EM Zdobnov
    221. MC Zody
    222. ES Lander
    223. AT Chinwalla
    224. LL Cook
    225. KD Delehaunty
    226. GA Fewell
    227. LA Fulton
    228. RS Fulton
    229. TA Graves
    230. LW Hillier
    231. ER Mardis
    232. JD McPherson
    233. TL Miner
    234. WE Nash
    235. JO Nelson
    236. MN Nhan
    237. KH Pepin
    238. CS Pohl
    239. TC Ponce
    240. B Schultz
    241. J Thompson
    242. E Trevaskis
    243. MC Wendl
    244. RK Wilson
    245. S Yang
    246. M Daly
    247. R David
    248. J Deri
    249. S Dodge
    250. K Foley
    251. D Gage
    252. S Gnerre
    253. T Holzer
    254. DB Jaffe
    255. M Kamal
    256. EK Karlsson
    257. C Kells
    258. A Kirby
    259. EJ Kulbokas
    260. ES Lander
    261. T Landers
    262. JP Leger
    263. R Levine
    264. E Mauceli
    265. JH Mayer
    266. M McCarthy
    267. J Meldrim
    268. J Meldrim
    269. JP Mesirov
    270. R Nicol
    271. C Nusbaum
    272. S Seaman
    273. T Sharpe
    274. A Sheridan
    275. JB Singer
    276. R Santos
    277. B Spencer
    278. N Stange-Thomann
    279. JP Vinson
    280. CM Wade
    281. J Wierzbowski
    282. D Wyman
    283. MC Zody
    284. A Kasprzyk
    285. E Mongin
    286. AG Rust
    287. G Slater
    288. A Stabenau
    289. A Ureta-Vidal
    290. S Whelan
    291. S Leonard
    292. C Lloyd
    293. L Matthews
    294. S McLaren
    295. K McLay
    296. B Meredith
    297. JC Mullikin
    298. Z Ning
    299. K Oliver
    300. E Overton-Larty
    301. R Plumb
    302. S Potter
    303. M Quail
    304. C Scott
    305. S Searle
    306. R Shownkeen
    307. S Sims
    308. M Wall
    309. AP West
    310. D Willey
    311. S Williams
    312. R Agarwala
    313. V Sapojnikov
    314. C Ucla
    315. RJ Weber
    316. EM Zdobnov
    317. SD Brown
    318. S Batalov
    319. S Schwartz
    320. JF Abril
    321. P Agarwal
    322. M Alexandersson
    323. SE Antonarakis
    324. R Baertsch
    325. E Berry
    326. E Birney
    327. P Bork
    328. N Bray
    329. MR Brent
    330. DG Brown
    331. J Butler
    332. C Bult
    333. F Chiaromonte
    334. DM Church
    335. M Clamp
    336. FS Collins
    337. RR Copley
    338. O Couronne
    339. S Cawley
    340. J Cuff
    341. V Curwen
    342. T Cutts
    343. ET Dermitzakis
    344. C Dewey
    345. NJ Dickens
    346. M Diekhans
    347. I Dubchak
    348. SR Eddy
    349. L Elnitski
    350. RD Emes
    351. P Eswara
    352. E Eyras
    353. A Felsenfeld
    354. P Flicek
    355. WN Frankel
    356. TS Furey
    357. G Glusman
    358. N Goldman
    359. L Goodstadt
    360. ED Green
    361. S Gregory
    362. R Guigó
    363. RC Hardison
    364. D Haussler
    365. A Hinrichs
    366. W Hlavina
    367. F Hsu
    368. T Hubbard
    369. D Karolchik
    370. E Keibler
    371. WJ Kent
    372. DL Kolbe
    373. I Korf
    374. D Kulp
    375. I Letunic
    376. M Li
    377. K Lindblad-Toh
    378. B Ma
    379. DR Maglott
    380. W Miller
    381. R Mott
    382. L Pachter
    383. G Parra
    384. P Pevzner
    385. A Poliakov
    386. CP Ponting
    387. A Reymond
    388. KM Roskin
    389. S Schwartz
    390. A Smit
    391. C Sugnet
    392. M Suyama
    393. G Tesler
    394. D Torrents
    395. J Tromp
    396. R Wheeler
    397. E Winter
    398. RH Waterston
    399. KC Worley
    400. Mouse Genome Sequencing Consortium
    (2002)
    Nature 420:520–562.
    https://doi.org/10.1038/nature01262
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73

Decision letter

  1. Catherine Dulac
    Senior Editor; Harvard University, United States
  2. Chris P Ponting
    Reviewing Editor; University of Edinburgh, United Kingdom
  3. Nenad Sestan
    Reviewer

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "The Transcriptional Logic of Mammalian Neuronal Diversity" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The following individual involved in review of your submission has agreed to reveal his identity: Nenad Sestan (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The reviewers described the broad survey of gene expression in nearly 200 mouse neuronal subpopulations as being a powerful and useful resource for the neuroscience community. A major strength was the use of transgenic labelled neuronal subpopulations with some anatomical information. Nevertheless, this Editor and some of the reviewers were not convinced by the analysis of "long genes" that gene length is an appropriate explanatory metric of regulatory complexity, particularly given that the density of regulatory elements across mammalian genomes is known to be relatively uniform. We consider the authors' hypothesis not to have been proved. Consequently, either this analysis needs to be removed or replaced or further substantiated. Additionally, we request that the comparison with single cell data is updated, and its presentation revised. Finally, it will be important to report findings using conventional statistical metrics rather than those currently used, unless the authors can justify that these are superior.

The following are the revisions that are required:

1) Currently only gene bodies, and not intergenic sequences, are considered in the analysis. Unless this subjective choice is further justified the authors will need to consider (e.g. for the ATAC-Seq analysis) all intergenic and intragenic regulatory elements. This could be done by considering a gene territory in its vicinity and ATAC-Seq peaks that connect to the gene via HiC peaks, for example. If a regulatory complexity metric is best explained by a gene's full (rather than intragenic) regulatory landscape then the associated speculation in the manuscript needs to be removed. Additionally, reviewers were not convinced: that the frequency of insertion mutations is uniform in the genome, as assumed; that sequence-similar, more cell/tissue-specific, paralogs could modulate this frequency; and, that there are specific population genetic studies that could test the authors' hypothesis.

2) The authors' attempt to reinvent the wheel, statistically, was considered unnecessary. Reporting results using conventional statistics should be sufficient. Attempts to develop novel test statistics should be removed unless with compelling justification. This is easily done since one of the statistics is close to "fraction of comparisons DE" and the other is close to a fold change. The NNLS method is itself not validated and also not essential for the analysis.

3) An important analysis is the comparison with single cell RNA-seq datasets (Figure 2). The problem with the current analysis is that the Ziesel, 2015 and Tasic, 2016 studies are already somewhat out of date, because they are pilot studies to the most recent Tasic et al., 2018 paper in bioRxiv which is already accepted. We understand that the dataset should be available soon, and thus that the timeline should be compatible with this revision. If the data set unexpectedly is not available please do let us know. Another high resolution dataset is from Paul et al., 2017, whose data should also be compared.

4) The use of "cell type" is misleading, even with "operational cell types" defined in the Materials and methods section. Even with known anatomic locations, it is quite likely that the labelled population comprises multiple "atomic types". "Subpopulation" is a more appropriate description and is no less significant. In this discussion, the authors must discuss the extent to which their analysis might be confounded by interregional or interindividual variation.

5) Figures. These could be further streamlined and shuffled to help the reader more. Figure 2 jumps straight into what may seem like an esoteric debate to those not currently diving into or weary of single cell RNA sequencing. A schematic of single cell vs. pooled cell analysis that illustrates shallow and deep RNA capture plus questions of cell purity might help introduce these heatmaps. Also, the Figure 2B panels could be placed in the supplement, and in Figure 2A, it might help to highlight examples of possible low purity cell types to highlight the overall very high purity of the data. Figure 3 – DI and SC will not be intuitive to all readers, even after the schematic in 2A. As a bridge, it might help to label individual genes with high SC and high DI from the scatterplot in B and show the expression of these genes across cell types. Figure 4A dives into this a level deeper by focusing on OFF noise, but Lhx1 and Calb2 could be labeled as examples in 3. Figure 5B: label how many long and short genes were considered.

6) The authors should supplement the use of PANTHER gene families to avoid, to the extent possible, biases in the specific datasets included in this database; it is possible that some gene families, including synaptic and signaling genes or homeobox transcription factors, are over-represented in this dataset. In Figure 4D, the PANTHER gene category "receptor" seems overly broad. What types of receptors? Certainly not all receptor genes are long – olfactory receptors are very short. Also, signaling proteins are on this list but are not discussed in the text along with ion channels and cell adhesion molecules. Why this selective avoidance?

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for sending your article entitled "Mapping the transcriptional diversity of genetically and anatomically defined cell populations in the mouse brain" for peer review at eLife. Your article is being overseen by Chris Ponting, as Reviewing Editor, and Catherine Dulac as the Senior Editor.

Given the issue regarding "long genes" that we describe below, the editors and reviewers invite you to respond within the next two weeks with an action plan and timetable for the completion of additional work. We plan to discuss your response and then issue a binding recommendation.

You have placed considerable emphasis in your two versions of this manuscript on correlations with "long genes". With respect to these aspects (in many places, e.g. subsection “Long genes shape neuronal diversity”, fourth paragraph) it is our view that this argument cannot be retained unless you can satisfactorily refute the results of Raman et al., 2018. These indicate that the length dependencies that you, and others, have seen are likely a PCR artefact.

"Long genes shape neuronal diversity". This returns to our previous issue whether "long gene" expression is correlated with neuronal diversity or whether it causes neuronal diversity. The word "shape" implies causality, yet without evidence: their expression could be a consequence of diverse neuronal cell populations. You also have not shown whether these mRNA expression trends persist with proteins and this will need to be stated explicitly. "rich in transposons and other retroelements". This implies that these introns contain active elements whereas they contain the inactive debris of retrotransposons. The last paragraph of the Discussion is not warranted and should be excluded because it implies (without evidence) that long genes arise because of "exaptations".

https://doi.org/10.7554/eLife.38619.047

Author response

The reviewers described the broad survey of gene expression in nearly 200 mouse neuronal subpopulations as being a powerful and useful resource for the neuroscience community. A major strength was the use of transgenic labelled neuronal subpopulations with some anatomical information. Nevertheless, this Editor and some of the reviewers were not convinced by the analysis of "long genes" that gene length is an appropriate explanatory metric of regulatory complexity, particularly given that the density of regulatory elements across mammalian genomes is known to be relatively uniform. We consider the authors' hypothesis not to have been proved. Consequently, either this analysis needs to be removed or replaced or further substantiated. Additionally, we request that the comparison with single cell data is updated, and its presentation revised. Finally, it will be important to report findings using conventional statistical metrics rather than those currently used, unless the authors can justify that these are superior.

In keeping with your suggestions, we changed the title and rewrote much of the manuscript to remove any attempt to provide evidence that regulatory complexity was responsible for the observed long gene bias. As requested we updated the single cell analyses and recast our statistical analyses in terms of the traditional metrics of fold-change and differential expression. In addition, we added a new analysis of alternative splicing. We have made very extensive changes in the manuscript and have attempted to respond to each point raised in the reviews as outlined below.

The following are the revisions that are required:

1) Currently only gene bodies, and not intergenic sequences, are considered in the analysis. Unless this subjective choice is further justified the authors will need to consider (e.g. for the ATAC-Seq analysis) all intergenic and intragenic regulatory elements. This could be done by considering a gene territory in its vicinity and ATAC-Seq peaks that connect to the gene via HiC peaks, for example. If a regulatory complexity metric is best explained by a gene's full (rather than intragenic) regulatory landscape then the associated speculation in the manuscript needs to be removed. Additionally, reviewers were not convinced: that the frequency of insertion mutations is uniform in the genome, as assumed; that sequence-similar, more cell/tissue-specific, paralogs could modulate this frequency; and, that there are specific population genetic studies that could test the authors' hypothesis.

As requested, we removed the data on intronic peaks and the analyses of the differential presence of these peaks across cell types as a metric of regulatory complexity. We removed mention of any population genetic test of hypotheses about increased regulatory complexity.

2) The authors' attempt to reinvent the wheel, statistically, was considered unnecessary. Reporting results using conventional statistics should be sufficient. Attempts to develop novel test statistics should be removed unless with compelling justification. This is easily done since one of the statistics is close to "fraction of comparisons DE" and the other is close to a fold change.

As suggested, we recast our metrics to make it clearer that they are simply versions of Differential Expression and fold-change for simultaneous comparisons across many populations. The index previously called Differentiation Index (DI) is now called Differentially Expressed Fraction (DEF) and the previous Signal Contrast (SC) is now Fold-Change Ratio (FCR). We hope that the text, arrangement of the figures and the new terms now make clearer how these are simple extensions of the standard statistics to the situation of making succinct comparisons across many populations.

The NNLS method is itself not validated and also not essential for the analysis.

NNLS is a standard method of matrix decomposition. Here we provide extensive validation of its use for decomposition of mixed expression profiles. These include:

- Cross validation (testing the ability of half of each dataset to decompose the other half): Figure 2—figure supplement 1;

- A test of the ability to retrieve merged profiles: Figure 2—figure supplement 2;

- Comparison to another standard algorithm for decomposition, a random forest classifier: Figure 2—figure supplement 6.

This is the core method we used for comparison with single cell datasets, which is stated to be “an important analysis” hence it is essential for our analysis. These analyses have (as requested below) been fully updated to include more recent datasets. The conclusion, supported by these analyses is that there is a similar level of heterogeneity apparent from the decomposition of NeuroSeq sorted data by single cell profiles as is found in the decomposition of one set of single cell profiles by another. This is a key result of the paper.

3) An important analysis is the comparison with single cell RNA-seq datasets (Figure 2). The problem with the current analysis is that the Ziesel, 2015 and Tasic, 2016 studies are already somewhat out of date, because they are pilot studies to the most recent Tasic et al., 2018 paper in bioRxiv which is already accepted. We understand that the dataset should be available soon, and thus that the timeline should be compatible with this revision. If the data set unexpectedly is not available please do let us know. Another high resolution dataset is from Paul et al., 2017, whose data should also be compared.

We updated the analyses to include Tasic et al., 2018, Zeisel et al., 2018 and Paul et al., 2017. These are included in Figure 2 and in Figure 2—figure supplements 1, 3-7. In addition, we added comparisons to these datasets, as well as to Saunders et al., 2018 to Figure 5—figure supplement 3 (Off noise) and Figure 7—figure supplements 1 and 2 (length bias).

4) The use of "cell type" is misleading, even with "operational cell types" defined in the Materials and methods section. Even with known anatomic locations, it is quite likely that the labelled population comprises multiple "atomic types". "Subpopulation" is a more appropriate description and is no less significant. In this discussion, the authors must discuss the extent to which their analysis might be confounded by interregional or interindividual variation.

As requested, we have refrained from referring to the samples profiled as originating from cell types and have instead referred to them as Genetically- and Anatomically-identified Cell Populations (GACPs) or simply as “populations”. While we find this terminology cumbersome and not representative of the common use of the term “cell type” in the literature, we defer to the reviewers on this point.

5) Figures. These could be further streamlined and shuffled to help the reader more. Figure 2 jumps straight into what may seem like an esoteric debate to those not currently diving into or weary of single cell RNA sequencing. A schematic of single cell vs. pooled cell analysis that illustrates shallow and deep RNA capture plus questions of cell purity might help introduce these heatmaps. Also, the Figure 2B panels could be placed in the supplement, and in Figure 2A, it might help to highlight examples of possible low purity cell types to highlight the overall very high purity of the data.

We thank the reviewer for the suggestion to include a schematic in Figure 2. We think this helps us make the important point that apparent heterogeneity can arise both from pooling, as in our study, and from differences in the way clustering is applied in single cell studies. We think that this idea will be useful to readers of this paper and more generally to the literature on cell type-specific gene expression. As requested, we moved the Figure 2B panels to the supplement (now Figure 2—figure supplements 3, 4, and 5, which includes the Paul et al., 2017 data). Finally, we highlight examples of possible low purity cell types in the text of the Results section in the context of describing the cross validation (Figure 2—figure supplement 1).

Figure 3 – DI and SC will not be intuitive to all readers, even after the schematic in 2A. As a bridge, it might help to label individual genes with high SC and high DI from the scatterplot in B and show the expression of these genes across cell types.

To increase the clarity of this section, we renamed DI and SC to indicate their direct relationship to standard measures Differential Expression and Fold Change, and separated the explanation of these relationships (Figure 3) from the application to the data (Figure 4). As requested, we labeled specific example genes in Figure 4A and show their patterns of expression across populations in Figures 4B and C.

Figure 4A dives into this a level deeper by focusing on OFF noise, but Lhx1 and Calb2 could be labeled as examples in 3. Figure 5B: label how many long and short genes were considered.

We found other clearer examples to label in what was previously Figure 3 (now 4, see immediately above) but maintain the previously used examples in the now Figure 5. The numbers of long and short genes in the now Figure 6 are listed in the figure legend.

6) The authors should supplement the use of PANTHER gene families to avoid, to the extent possible, biases in the specific datasets included in this database; it is possible that some gene families, including synaptic and signaling genes or homeobox transcription factors, are over-represented in this dataset. In Figure 4D, the PANTHER gene category "receptor" seems overly broad. What types of receptors? Certainly not all receptor genes are long – olfactory receptors are very short. Also, signaling proteins are on this list but are not discussed in the text along with ion channels and cell adhesion molecules. Why this selective avoidance?

To avoid the possibility of implicit overrepresentation in the categories used for over-representation analyses, we used three different sets of categories. HUGO, PANTHER and Gene Ontology (specifically, GOM, the Gene Ontology Molecular Function categories). We replaced the use of PANTHER in the main figures with HUGO gene groups in most cases, and supplemented the analysis with PANTHER and GOM in the supplementary figures (Figure 4—figure supplement 1, Figure 5—figure supplement 1, Figure 5—figure supplement 3). In each case, the categories listed are those that exceeded the stated significance test, regardless of how broad or narrow these categories are. The use of HUGO to some degree mitigates some of the very broad categories found with PANTHER and Gene Ontology, because of the deeper hierarchies present in these systems of annotation. Nevertheless, some HUGO gene groups are also quite broad.

We note, for example, the Paul et al., 2017 paper uses HUGO but that Figure 2C of this paper contains categories that are equally broad: specifically: “Receptors” and “Signaling.” With respect to our own analyses, Figure 4E lists the categories GABA receptors, GPCRs, Glutamate receptors, GABA A receptors, Amine receptors etc. We now explicitly mention signaling genes in the Results section concerning Figure 4, in the legend of Figure 5 and in the Discussion, where we also cite the related finding by Paul et al., 2017.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

You have placed considerable emphasis in your two versions of this manuscript on correlations with "long genes". With respect to these aspects (in many places, e.g. subsection “Long genes shape neuronal diversity”, fourth paragraph) it is our view that this argument cannot be retained unless you can satisfactorily refute the results of Raman et al., 2018. These indicate that the length dependencies that you, and others, have seen are likely a PCR artefact.

We have now performed the Raman et al., 2018, test and include this as Figure 7—figure supplement 2. As can be seen, our results are highly significant even using this flawed test. We also refer to this result in the main text (subsection “Long genes contribute disproportionately to neuronal diversity”, second paragraph) and in the Discussion (subsection “Long genes and neuronal diversity”, first paragraph). As noted in our previous reply, we are preparing a more detailed rebuttal of the Raman et al. paper and so only include here our analysis of the fact that this cannot account for our observations of length biases.

"Long genes shape neuronal diversity". This returns to our previous issue whether "long gene" expression is correlated with neuronal diversity or whether it causes neuronal diversity. The word "shape" implies causality, yet without evidence: their expression could be a consequence of diverse neuronal cell populations.

We removed the word shape, changing the section title to “Long genes and neuronal diversity.” We also added the word “transcriptional” to clarify the following sentence: “Our study suggests that long neuronal effector genes contribute disproportionately to neuronal transcriptional diversity”.

You also have not shown whether these mRNA expression trends persist with proteins and this will need to be stated explicitly.

We added the following sentence: “However, we and others did not measure cell type-specific protein expression, and so cannot be sure that the long gene bias extends to the level of neuronal proteins.”

"rich in transposons and other retroelements". This implies that these introns contain active elements whereas they contain the inactive debris of retrotransposons.

We changed the sentence to: “These genes are long because of long introns that have accumulated sequences derived from transposons and other retroelements (Grishkevich et al., 2014).”

The last paragraph of the Discussion is not warranted and should be excluded because it implies (without evidence) that long genes arise because of "exaptations".

We have rewritten the final paragraph of the Discussion to present a more balanced account of this as only one of three possibilities.

https://doi.org/10.7554/eLife.38619.048

Article and author information

Author details

  1. Ken Sugino

    Janelia Research Campus, Ashburn, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    For correspondence
    ken.sugino@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5795-0635
  2. Erin Clark

    Brandeis University, Waltham, United States
    Contribution
    Resources, Investigation, Writing—review and editing
    Contributed equally with
    Anton Schulmann
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4013-325X
  3. Anton Schulmann

    Janelia Research Campus, Ashburn, United States
    Contribution
    Software, Formal analysis, Visualization, Writing—review and editing
    Contributed equally with
    Erin Clark
    Competing interests
    No competing interests declared
  4. Yasuyuki Shima

    Brandeis University, Waltham, United States
    Contribution
    Resources, Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
  5. Lihua Wang

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
  6. David L Hunt

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
  7. Bryan M Hooks

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0135-4284
  8. Dimitri Tränkner

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
  9. Jayaram Chandrashekar

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6412-0114
  10. Serge Picard

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
  11. Andrew L Lemire

    Janelia Research Campus, Ashburn, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
  12. Nelson Spruston

    Janelia Research Campus, Ashburn, United States
    Contribution
    Supervision, Funding acquisition, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3118-1636
  13. Adam W Hantman

    Janelia Research Campus, Ashburn, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Project administration, Writing—review and editing
    For correspondence
    hantmana@janelia.hhmi.org
    Competing interests
    No competing interests declared
  14. Sacha B Nelson

    Brandeis University, Waltham, United States
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Supervision, Funding acquisition, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    nelson@brandeis.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0108-8599

Funding

Howard Hughes Medical Institute

  • Ken Sugino
  • Anton Schulmann
  • Lihua Wang
  • David L Hunt
  • Bryan M Hooks
  • Dimitri Tränkner
  • Jayaram Chandrashekar
  • Andrew L Lemire
  • Nelson Spruston
  • Adam W Hantman
  • Sacha B Nelson

National Eye Institute (EY022360)

  • Erin Clark

National Institute of Mental Health (MH105949)

  • Erin Clark
  • Yasuyuki Shima
  • Sacha B Nelson

National Institute of Neurological Disorders and Stroke (NS075007)

  • Erin Clark
  • Yasuyuki Shima
  • Sacha B Nelson

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Jody Clements and Charlotte Weaver for help in preparing the web site, Erina Hara, Asish Gulati, Xiaotang Jing and Zhe Meng for technical help, Keven McGowan for assistance in sequencing, Jim Cox, Amanda Zeladonis and Amanda Wardlaw for help in animal maintenance, Gabe Murphy for help in retinal sample collection, and Rosa Miyares for comments on the manuscript.

Ethics

Animal experimentation: All experiments were conducted in accordance with the requirements of the Institutional Animal Care and Use Committees at Janelia Research Campus (protocol# not available) and Brandeis University (protocol#17001).

Senior Editor

  1. Catherine Dulac, Harvard University, United States

Reviewing Editor

  1. Chris P Ponting, University of Edinburgh, United Kingdom

Reviewer

  1. Nenad Sestan

Publication history

  1. Received: July 20, 2018
  2. Accepted: April 11, 2019
  3. Accepted Manuscript published: April 12, 2019 (version 1)
  4. Version of Record published: May 3, 2019 (version 2)

Copyright

© 2019, Sugino et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,531
    Page views
  • 677
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)