Although gene expression divergence has long been postulated to be the primary driver of human evolution, identifying the genes and genetic variants underlying uniquely human traits has proven to be quite challenging. Theory suggests that cell type-specific cis-regulatory variants may fuel evolutionary adaptation due to the specificity of their effects. These variants can precisely tune the expression of a single gene in a single cell type, avoiding the potentially deleterious consequences of trans-acting changes and non-cell type-specific changes that can impact many genes and cell types, respectively. It has recently become possible to quantify human-specific cis-acting regulatory divergence by measuring allele-specific expression in human-chimpanzee hybrid cells—the product of fusing induced pluripotent stem (iPS) cells of each species in vitro. However, these cis-regulatory changes have only been explored in a limited number of tissues and cell types. Here, we quantify human-chimpanzee cis-regulatory divergence in gene expression and chromatin accessibility across six cell types, enabling the identification of highly cell type-specific cis-regulatory changes. We find that cell type-specific genes and regulatory elements evolve faster than those shared across cell types, suggesting an important role for genes with cell type-specific expression in human evolution. Furthermore, we identify several instances of lineage-specific natural selection that may have played key roles in specific cell types, such as coordinated changes in the cis-regulation of dozens of genes involved in neuronal firing in motor neurons. Finally, using novel metrics and a machine learning model, we identify genetic variants that likely alter chromatin accessibility and transcription factor binding, leading to neuron-specific changes in the expression of the neurodevelopmentally important genes FABP7 and GAD1. Overall, our results demonstrate that integrative analysis of cis-regulatory divergence in chromatin accessibility and gene expression across cell types is a promising approach to identify the specific genes and genetic variants that make us human.
This is an important study that leverages a human-chimpanzee tetraploid iPSC model to test whether cis-regulatory divergence between species tends to be cell type-specific. The evidence supporting the study's primary conclusion--that species differences in gene regulation are enriched in cell type-specific genes and regulatory elements--is compelling, although attention to biases introduced by sequence conservation is merited, and the case that is made for cell type-specific changes reflecting adaptive evolution is incomplete. This work will be of broad interest in evolutionary and functional genomics.
In the past few million years, humans have evolved a multitude of unique phenotypes (Shave et al., 2019; Vanderhaeghen & Polleux, 2023). For example, our cardiovascular system has evolved to enable extended periods of physical exertion and the unique aspects of our nervous system enable human language and toolmaking (Shave et al., 2019; Vanderhaeghen & Polleux, 2023). Previous research suggests that much of human adaptation may be caused by changes in gene expression (Fraser, 2013; King & Wilson, 1975; Reilly & Noonan, 2016; Romero et al., 2012). To catalog these changes, studies have compared gene expression in post-mortem tissues of humans and our closest living relatives, chimpanzees (Blake et al., 2020; Kelley & Gilad, 2020; Ma et al., 2022). Although thousands of differentially expressed genes have been identified in post-mortem samples, it is generally not possible to determine whether genetic differences cause a gene to be differentially expressed. In post-mortem studies, genetic differences cannot be disentangled from sources of variation such as differences in diet, environment, cell type abundances, age, post-mortem interval, and other confounding factors. In addition, for many traits the relevant gene expression differences may be specific to early development, but it is impossible to study fetal development in vivo in non-human great apes due to both ethical and technical difficulties. To circumvent these issues, several groups have used great ape iPS cells to study differences in gene expression in cell types present in early development (Benito-Kwiecinski et al., 2021; Field et al., 2019; Kanton et al., 2019; Pavlovic et al., 2018). While the use of iPS cells addresses many of the confounding factors present in post-mortem comparisons, they also introduce new issues such as interspecies differences in iPS cell differentiation kinetics, efficiency, and maturation. Overall, it remains tremendously challenging to identify human-specific changes in gene expression, which limits our ability to link expression differences to either phenotypic differences or natural selection on the human lineage.
One particularly powerful means of studying the evolution of gene expression is through the measurement of allele specific expression (ASE) in hybrids between two species (Combs et al., 2018; Fraser, 2011; Hu et al., 2022; Mack & Nachman, 2017; Wittkopp et al., 2004). This approach has the advantage of eliminating many confounding factors inherent to interspecies comparisons, including differences in cell type composition, environmental factors, developmental stage, and response to differentiation protocols. Because the trans-acting environments of the two alleles in a hybrid are identical, ASE has the additional benefit of reflecting only cis-regulatory changes, which are thought to be less pleiotropic and more likely to drive evolutionary adaptation than broader trans-acting changes (Agoglia et al., 2021; Gokhman et al., 2021; Prud’homme et al., 2007; Wittkopp & Kalay, 2012). Furthermore, ASE enables the use of powerful methods that can detect lineage-specific natural selection and, as a result, contribute to our understanding of the selective pressures that have shaped the evolution of a wide variety of species (Fraser, 2011). Until recently, it has not been possible to disentangle cis- and trans-acting changes fixed in the human lineage since humans cannot hybridize with any other species. However, the development of human-chimpanzee hybrid iPS cells enables measurement of ASE in a wide variety of tissues and developmental contexts (Agoglia et al., 2021; Gokhman et al., 2021; Song et al., 2021). This provides an effective platform to investigate general principles of hominid gene expression evolution, detect lineage-specific selection, and identify candidate gene expression changes underlying human-specific traits.
While gene expression differences between humans and chimpanzees are well-studied, there has been less focus on epigenetic changes, many of which are likely to underlie differences in gene expression (García-Pérez et al., 2021; Kozlenkov et al., 2020; Netherlands Brain Bank et al., 2016; Trizzino et al., 2017). Furthermore, these studies, regardless of whether they utilize post-mortem tissues or cell lines, are subject to the same confounding factors mentioned above. Analogous to ASE, one can use the assay for transposase accessible chromatin using sequencing (ATAC-seq) in interspecies hybrids to measure allele-specific chromatin accessibility (ASCA) (Buenrostro et al., 2013; Corces et al., 2017; Liang et al., 2021; S. Zhang et al., 2020). As with ASE, ASCA is unaffected by many confounders inherent to between-species comparisons and only measures cis-regulatory divergence. Perhaps most importantly, ASCA can implicate specific regulatory elements that likely underlie gene expression differences. These regulatory elements can then be more closely studied to identify the likely causal genetic variants and the molecular mechanisms by which those variants alter gene expression.
Here, we generated RNA-seq and ATAC-seq data from six human-chimpanzee hybrid iPS cell-derived cell types and quantified ASE and ASCA. Using this dataset, we identified thousands of genes and cis-regulatory elements showing cell type-specific ASE and ASCA. We found that cell type-specific genes and cis-regulatory elements are more likely to have divergent expression and accessibility than their more broadly expressed/accessible counterparts. Furthermore, we provide evidence for polygenic selection on the expression level of genes associated with physiologically relevant gene sets including sodium channels and syntaxin-binding proteins in motor neurons. Finally, we use newly developed metrics and machine learning algorithms to link cell type-specific differences in chromatin accessibility and gene expression and identify putative causal mutations underlying these differences. Using this pipeline we identified motor neuron-specific increases in promoter chromatin accessibility and gene expression for FABP7, which plays a key role in neurodevelopment but is not well-studied in neurons. In addition, we focus on a human-accelerated region (HAR) near the promoter of GAD1. While this region is accessible in all cell types, both the accessibility of the HAR and the expression of GAD1 are only chimpanzee-biased in motor neurons. Analysis of scRNA-seq from human and chimpanzee brain organoids showed that increased expression of GAD1 also occurs in ventral forebrain inhibitory neurons. Overall, this study provides insight into the evolution of gene expression in great apes as well as a resource that will inform functional dissection of human-specific molecular changes.
Cis-regulatory divergence of gene expression in six cell types is largely cell type-specific or shared across all cell types
To measure genome-wide cis-regulatory divergence in gene expression, we performed RNA-seq on six cell types derived from human-chimpanzee hybrid iPS cells (Fig. 1a). The cell types profiled were from six diverse developmental lineages including the motor neuron (MN), cardiomyocyte (CM), hepatocyte progenitor (HP), pancreatic progenitor (PP), skeletal myocyte (SKM), and retinal pigment epithelium (RPE) lineages. These represent all three germ layers and a variety of organs (Fig. 1a). It is worth noting that these differentiations do not necessarily lead to a pure population of cells, but rather a population of cells with different levels of maturity along a particular developmental lineage. For example, the SKM population likely contains fully differentiated muscle fibers as well as a small population of proliferating satellite cells; for clarity we refer to this as the SKM cell type. As these different cell types are not shared between tissues, we use cell type-specific and tissue-specific interchangeably throughout the manuscript.
Two independently generated hybrid lines were differentiated for each cell type and at least two biological replicates per hybrid line per cell type were collected (see Methods). Each cell type was sequenced to an average depth of 134 million paired-end reads (Supp Fig. 1). We used a computational pipeline to quantify ASE adapted from the pipeline introduced by Agoglia et al (Agoglia et al., 2021). Briefly, we computed ASE by mapping reads to both the human and chimpanzee genomes, correcting for mapping bias, and assigning reads to the human or chimpanzee genome if a read contained one or more human-chimpanzee single nucleotide differences (see Methods).
As expected, the samples clustered predominantly by cell type (Fig. 1b-c). Within four of the six cell type clusters, individual samples clustered by line rather than species of origin, potentially indicating line to line variability in differentiation (Fig. 1b). This highlights the importance of measuring ASE which, by definition, is measured within each line and so robust to variability between lines. Indeed, when performing PCA within cell types using allelic counts (i.e. counting reads from the human allele and chimpanzee allele separately), human and chimpanzee species differences were clearly separated by principal component (PC) 1 or PC2 in each cell type (Fig. 1d, Supp Fig. 2). To assess the success of our differentiations, we examined each cell type for the expression of known marker genes in our RNA-seq data (Fig. 1e, Supp Fig. 3). All cell types express canonical marker genes and do not express pluripotency markers, indicating that the differentiations were successful (Fig. 1e, Supp Fig 3, Additional file 1) (Burridge et al., 2014; Chal et al., 2016; Korytnikov & Nostro, 2016; Mallanna & Duncan, 2013; Maury et al., 2015; Sharma et al., 2019).
Because our hybrid cells were grown concurrently with their human and chimpanzee diploid “parental” cells, we performed an additional check for purity of the hybrid lines by quantifying genome-wide ASE. We noticed that among our 25 RNA-seq samples, the two PP hybrid2 samples had a slight bias towards higher expression from the chimpanzee alleles across all chromosomes. This is likely due to a small fraction of contaminating chimpanzee cells in these samples. We corrected for this by reducing the chimpanzee allele counts such that the number of reads assigned to the human and chimpanzee alleles was equal. By simulating contamination of a hybrid sample with chimpanzee cells, we found that this correction was conservative and that the log fold-change estimates were largely unaffected by contamination after this correction (Methods, Supp Fig. 4).
We next investigated which genes were differentially expressed between the human and chimpanzee allele in each cell type (Fig. 2a). We identified thousands of genes showing significantly biased ASE in each cell type at a false discovery rate (FDR) cutoff of 0.05 (Fig. 2b). We detected a comparable number of ASE genes in all cell types except SKM. As a result, we repeated all subsequent analyses both including and excluding SKM and obtained qualitatively similar results regardless of whether SKM was included.
While a considerable number of genes had significant ASE in all cell types, many more genes only had significant ASE in a single cell type, suggesting cell type-specific cis-regulatory divergence (Fig. 2b). A notable family of developmentally important genes that exemplifies differences in ASE across cell types is the neurotrophins and their receptors (Fig. 2c) (Caporali & Emanueli, 2009; Huang & Reichardt, 2001). For example, NTRK3, which plays a key role in the development of the nervous system, is only differentially expressed in RPE and MN but is chimpanzee-biased in RPE and human-biased in MN (Supp Fig. 5) (Ichim et al., 2012; Naito et al., 2017). In addition, the gene coding for its primary ligand (NTF3) is differentially expressed in a variety of cell types yet is human-biased in all cell types except MN in which it is chimpanzee-biased (Supp Fig. 5). NTRK1 differential expression is similarly tissue-specific as it is strongly chimpanzee-biased in MN, but human-biased in CM and SKM (Supp Fig. 5). These results indicate that the regulatory landscape of these genes has undergone many complex cis-regulatory changes as the human and chimpanzee lineages have diverged.
To further investigate the relationship between tissue-specificity and ASE, we asked whether genes with variable expression across tissues are more likely to show ASE. Using a standard definition of cell type-specific genes—those with detectable expression in only one cell type in our study—we found that cell type-specific genes were typically enriched for ASE in the one cell type where they are expressed (Supp Fig. 6) (GTEx Consortium, 2017; Jain & Tuteja, 2019). However, other cell type-specific expression patterns such as uniquely low expression in a particular cell type may also indicate an important dosage-sensitive function in that cell type. We therefore focused on a broader definition of cell type-specificity in which genes that are differentially expressed between one cell type and all others in our study (FDR < 0.05 for each pairwise comparison) are considered cell type-specific for that cell type. We found that this more inclusive definition, which assigned many more genes to be cell type-specific, showed an even more significant ASE enrichment than the narrower definition (Fig. 2d). This result is not sensitive to the choice of FDR cutoff nor driven solely by a subgroup of highly expressed genes as it is well-preserved across FDR cutoffs (Supp Fig. 7) and different gene expression levels (Supp Fig. 8). This trend is also robust to separating samples into two groups and using one to define cell type-specific genes and the other to identify differentially expressed genes. This controls for spurious relationships that can result when the same data are used to define two different quantities which are then compared (Supp Fig. 9) (Fraser, 2019).
This enrichment (Fig. 2d) suggests that tissue-specific genes may have less constraint and/or more frequent positive selection on their expression. We reasoned that if the trend was solely driven by constraint, then controlling for constraint—even if imperfectly—would be expected to reduce the strength of the relationship. To investigate this, we binned genes by their variance in ASE across a large cohort of human samples which we have previously shown acts as a reasonable proxy for evolutionary constraint on gene expression (Castel et al., 2020; Starr et al., 2023). Across cell types, we generally observe significant enrichments in each bin and little difference in enrichment between bins, suggesting that differences in constraint on expression of cell type-specific vs. ubiquitously expressed genes are not solely responsible for our observations (Supp Fig. 10). Furthermore, we observe even stronger enrichments using an alternative constraint metric, the probability of haploinsufficiency score (pHI) likely due to the larger number of genes for which pHI can be calculated (Supp Fig. 11) (Collins et al., 2022). Overall, our analysis suggests that differences in constraint are unlikely to fully explain these trends, suggesting a potential role for positive selection.
Lineage-specific selection has acted on tissue-specific gene expression divergence
Next, we sought to use our RNA-seq data to identify instances of lineage-specific selection. We used a previously published method that incorporates a tissue-specific measure of constraint on gene expression to test whether groups of genes have likely been under selection (Starr et al., 2023). Using this strategy, we detected several signals of putative lineage-specific selection, some of which were cell type-specific (Additional file 2). Notably, the four most significant enrichments were found in motor neurons and cardiomyocytes and are highly relevant to those cell types (Fig. 2e; Additional file 2). In cardiomyocytes, the top pathway was “PPAR signaling pathway” which has been shown to play an important role in the regulation of heart morphology and lipid metabolism (Montaigne et al., 2021). For example, NR1H3 (also known as LXRA) is strongly upregulated in human cardiomyocytes as well as all other cell types (Supp Fig. 12a). Furthermore, this upregulation appears to have occurred in the human lineage based on data from non-hybrid cardiomyocytes as well as adult hearts (Supp Fig. 12b) (Blake et al., 2020; Pavlovic et al., 2018). Hybrid cells are essential in determining that the human-specific upregulation of NR1H3 in cardiomyocytes has a strong genetic component as NR1H3 expression is very responsive to diet and other environmental factors (Wang & Tontonoz, 2018).
In motor neurons, multiple categories showed a strong bias toward higher chimpanzee expression including “sodium ion transmembrane transport” and “syntaxin binding”. The genes in these categories are of fundamental importance to the function of motor neurons as sodium ion transporters control excitability and syntaxin binding proteins play a major role in controlling the release of neurotransmitters from synaptic vesicles (Brose et al., 2019; Meisler et al., 2021). Interestingly, several key genes in these sets appear to have human-derived differences in expression that extend beyond motor neurons to other neuronal types. For example, SCN1B, SCN2B, and SYT2 have chimpanzee-biased ASE in our MN data. Similarly, these genes have human-biased expression when comparing human, chimpanzee, and rhesus macaque glutamatergic cortical neurons in a previously published dataset (Supp Fig. 13) (Kozlenkov et al., 2020). We note that several other genes in these genes sets are not differentially expressed between humans and chimpanzees in this cortical neuron dataset, emphasizing the importance of studying individual neuron types (Kozlenkov et al., 2020). Overall, the strong bias in gene expression of sodium ion transporters and syntaxin binding proteins we observe suggests lineage-specific selection that may have altered the electrophysiological properties of human motor neurons.
Patterns of allele-specific chromatin accessibility reveal divergent cis-regulatory elements
While ASE provides insight into what gene expression changes might underlie phenotypic differences between humans and chimpanzees, in the absence of additional data it is very difficult to prioritize which specific mutations might cause expression divergence. To begin to fill this gap, we generated ATAC-seq data from five of the six cell types (all except RPE), with each cell type sequenced to an average depth of 184 million paired-end reads (Supp Fig. 14). ATAC-seq uses a hyperactive Tn5 transposase to cleave DNA that is not bound by nucleosomes to enrich for accessible chromatin (Fig. 3a), a hallmark of active cis-regulatory elements (CREs) (Buenrostro et al., 2013; Corces et al., 2017). We estimated ASCA for individual open chromatin peaks by mapping reads to both species’ reference genomes, correcting for mapping bias, generating a unified list of peaks across all samples, and then counting reads supporting each allele in each peak (see Methods). Cell types clustered well using PC1 and PC2 of the ATAC-seq data, except for the HP sample which clustered closely with PP samples (Fig. 3b, Supp Fig. 15). However, performing PCA on just the HP and PP samples clearly separates the two cell types (Supp Fig. 16). Within each cell type, species differences were clearly separated by PC1 or PC2 (Supp Fig. 17). As an example of ASCA, the accessibility of the promoter of CTSF was strongly chimpanzee-biased, mirroring the chimpanzee-biased ASE for this gene (Fig. 3c).
As a first step in analyzing the ATAC-seq data, we intersected the peaks we identified with the genomic annotations of chromatin states. These fifteen categories, predicted across many tissues and cell types by the chromHMM model (Vu & Ernst, 2022), include terms such as “TSS” (transcription start site) and “enhancer”. We then plotted the median of the absolute human-chimpanzee ASCA log fold-changes for each chromatin state and cell type (Fig 3d). The TSS and promoter annotations were the least divergent in their accessibility, whereas regions of heterochromatin were the most divergent (Fig. 3d). To explore the relationship between interspecies differences in ASCA and ASE, we assigned peaks to the nearest TSS and computed the Pearson correlation between the allelic log fold-change of chromatin accessibility and expression within each cell type and chromatin state. As expected, TSS and promoter annotations showed the strongest correlation between ASCA and ASE, and correlations were stronger when including only differentially accessible peaks (Fig. 3e, Methods). Intriguingly however, ASCA of regions annotated as heterochromatin, polycomb repressed, or quiescent were as strongly correlated with ASE as elements identified as enhancers or DNase hypersensitivity sites (Fig. 3e). Notably this result is robust to how peaks and chromHMM annotations were intersected (Supp Fig. 18a), as well as to removal of all peaks even slightly overlapping TSS or promoter-related annotations (Supp Fig. 18b). Given that heterochromatin/quiescent regions are highly divergent in ASCA (Fig. 3d), this suggests that changes in accessibility of these regions may be particularly important in the evolution of gene expression.
Next, we investigated whether the analog of the relationship between cell type-specific gene expression and ASE (Fig. 2d) holds for chromatin accessibility. Since the number of called peaks is largely dependent on sequencing depth (Supp Fig. 14, 19a), we performed down-sampling to equalize power to detect peaks across cell types (Supp Fig. 19b, Methods). We then called peaks on the down-sampled data and identified peaks as cell type-specific if they were called as peaks in only one cell type. In agreement with the gene expression data, we observed that cell type-specific peaks are enriched for ASCA across all cell types and this enrichment generally holds when using varying log2 fold-change or p-value cutoffs (Supp Figs. 20-21). Analogous to our analysis of gene expression, we also applied a broader definition of cell type-specificity to the ATAC data, in which a peak was considered specific to a cell type if that peak had an absolute log2 fold-change greater than a chosen threshold (e.g. 0.5 or 1) across all pairwise comparisons with other cell types. We observe strong enrichment for ASCA in cell type-specific peaks using this definition except when using the most stringent cutoffs due to the very low number of peaks meeting these criteria (Supp Figs. 22-23, Methods). Finally, we observe the same enrichments when controlling for a recently published metric for constraint on non-coding elements, suggesting that differences in evolutionary constraint may not be solely responsible for the observed trends (Supp Fig. 24) (S. Chen et al., 2022).
We next explored the relationship between cell type-specific ASCA and ASE. To do this, we developed a novel metric called differential expression enrichment (dEE) to quantify how specific the log fold-change is to a particular cell type or tissue. Our method is based on expression enrichment (EE) (Yu et al., 2006), a metric that measures how specific gene expression is to a certain cell type/tissue. dEE estimates how cell type-specific ASE is for a gene (Supp Fig. 25, Methods) and, analogously, dCAE (differential chromatin accessibility enrichment) measures how cell type-specific ASCA is for a cis-regulatory element (Supp Fig. 26, Methods). dEE and dCAE are high in a cell type if there is a high absolute log fold-change in that cell type and much lower absolute log-fold changes or log fold-changes in the opposite direction in the other cell types. For example, dEE would be close to one for a gene in a cell type if the gene had strongly human-biased ASE in that cell type and very weakly human-biased or chimpanzee-biased ASE in the other cell types. On the other hand, if a gene did not have any strong allelic bias, that gene would have dEE close to zero. dEE is conceptually related to a metric we have previously introduced, diffASE, and generalizes diffASE to an arbitrary number of cell types and any assay that produces log fold-changes (Combs et al., 2018; Hu et al., 2022; York et al., 2018). Using these metrics, we identified 154 instances in which a gene with cell type-specific ASE (i.e., high dEE) had a peak with cell type-specific ASCA in the same cell type (i.e., high dCAE, see Additional file 3 for the full list). Of these, 95 showed ASCA and ASE in the same direction which is more than expected by chance (77 expected; p < 0.005, binomial test). These results suggest that tissue-specific cis-regulatory divergence in chromatin accessibility may often impact tissue-specific gene expression, though this divergence is neither necessary nor sufficient to do so.
Identifying candidate causal cis-regulatory variants by integrating ASE and ASCA across cell types
As high dEE and dCAE in a given cell type might be indicative of a causal link between the change in chromatin accessibility and the change in expression, we focused on the 95 peak-gene pairs with matching direction and used two different strategies to identify examples to investigate in detail. First, we prioritized genes known to play important roles in development. For example, we found that the promoter of FABP7 has human-biased ASCA specifically in motor neurons (Fig. 4a-b) and that the FABP7 gene has human-biased ASE in motor neurons (Fig 4c). FABP7 is used as a marker of glial cells and neural progenitor cells (NPCs) and plays a key role in NPC proliferation and astrocyte function (Arai et al., 2005; De Rosa et al., 2012; Ebrahimi et al., 2016; Watanabe et al., 2007). Using previously published single-nucleus RNA-seq data from humans, chimpanzees, and rhesus macaques, we confirmed that FABP7 shows a human-derived up-regulation in several neuronal subtypes but not glial cells (Supp Fig. 27) (Ma et al., 2022). To investigate the genetic basis of this cell type-specific divergence, we leveraged a machine learning model, Sei (K. M. Chen et al., 2022), to nominate potentially causal variants in the promoter of FABP7 (see Methods). Sei is a deep neural network that takes DNA sequence as input and predicts the probability that the sequence has a particular epigenetic state in a variety of cell types and tissues (Fig. 4d) (K. M. Chen et al., 2022). While single-base substitutions differing between human and chimpanzee had only minor impacts on predicted cis-regulatory activity, “chimpanizing” the human sequence of the FABP7 promoter at two small indels (by deleting one base at chr6: 122,779,291 and inserting three bases at chr6: 122,779,115) was sufficient to make the Sei predictions for the chimpanized human sequence closely match the predictions for the complete chimpanzee sequence (Fig. 4e-g). Making only one of these changes had substantial but weaker effects in both cases, suggesting that both mutations might be functionally important (Fig. 4f,g). The 1-base insertion in the human lineage introduces a binding site for the neuronally expressed transcription factors GLIS2/3, suggesting a potential molecular mechanism (Calderari et al., 2018; Castro-Mondragon et al., 2022; Ke et al., 2015).
As another approach to ranking the 95 peaks, we searched for peaks containing human-chimpanzee sequence differences in otherwise highly conserved genomic positions, since these could reflect changes in selective pressure. Using PhyloP scores for 241 placental mammals (Sullivan et al., 2023) to assess conservation, one of the top-ranked peaks was a putative enhancer six kilobases away from the TSS of GAD1, which plays a key role in the synthesis of the neurotransmitter GABA (Feldblum et al., 1993). Notably, part of this peak has been classified as a human accelerated region (HAR) (Girskis et al., 2021; Hubisz & Pollard, 2014; Pollard et al., 2006)—a short sequence that is highly conserved in mammals yet contains an unusual number of human-specific mutations. Both the accessibility in the peak and GAD1 expression are only chimpanzee-biased in motor neurons (Fig. 4a-c, Supp Fig. 28). Applying Sei to estimate the predicted effect of every variant in this region, we found that a single-nucleotide substitution within the HAR (chr2: 170,823,193) has by far the largest predicted cis-regulatory effect and most closely matches the differences in Sei predictions between the full human and chimpanzee haplotypes (Fig. 4d-g). Interestingly, this mutation is predicted to disrupt a binding site for several basic helix-loop-helix transcription factors that play essential roles in neuronal differentiation such as Ascl1 (Supp Fig. 29) (Castro-Mondragon et al., 2022; Mizuguchi et al., 2006; Yang et al., 2017).
As GAD1 is only highly expressed in GABAergic neurons (and was therefore lowly expressed in the cell types studied here, Supp Fig. 30a), we investigated whether this reduced expression of human GAD1 also occurs in cortical organoids which contain GABAergic neurons together with other cell types in which GAD1 is not highly expressed. We analyzed our previously published data from human-chimpanzee hybrid cortical organoids (Agoglia et al., 2021) and found that the expression of GAD1 from the chimpanzee allele spikes higher than that of the human allele around day 50 of hybrid cortical organoid differentiation before dropping in expression over time to match the human expression level (Supp Fig. 30b). Because ASE in the hybrid cells controls for any potential interspecies differences in differentiation kinetics or cell type composition, this difference must be the result of cis-regulatory divergence between humans and chimpanzees. This expression difference is also more pronounced in comparisons of human and chimpanzee parental cortical organoids, with a higher absolute log fold-change at day 50, day 100, and day 150, only returning to equal expression at day 200 (Supp Fig. 30c). While this could be due to differences in cell type proportion between human and chimpanzee organoids, it might also be due to a reinforcing trans-acting effect.
To test whether this difference in expression also occurs specifically during GABAergic neuron differentiation we examined GAD1 expression in single cell RNA-seq data from human and chimpanzee cortical organoids (Kanton et al., 2019). We observed a spike in GAD1 expression in less mature chimpanzee GABAergic neurons that is absent in the corresponding part of the trajectory in human neurons (Supp Fig. 31). Notably, a similar trend holds regardless of which GABAergic sub-trajectory (i.e., equivalent to GABAergic neurons from the caudal, lateral, or medial ganglionic eminences) is examined suggesting this difference is not unique to a particular type of GABAergic neuron (Supp Fig. 31). Finally, we examined the accessibility of the putative GAD1 enhancer more closely. Consistent with a potential role for this enhancer in the spike in GAD1 expression during development, the accessibility of this enhancer mirrors the expression of GAD1 in human cortical and striatal organoids (Supp Fig. 32) with high accessibility between day 50 and day 100 before decreasing somewhat near day 150 (Trevino et al., 2020). Overall, our results demonstrate how the combination of RNA-seq, ATAC-seq, and machine learning models can nominate variants that may be responsible for cell type-specific changes in gene expression and chromatin accessibility.
In this study, we quantified human-chimpanzee cis-regulatory divergence in gene expression and chromatin accessibility in six different cell types representing diverse developmental lineages. Across the thousands of genes with ASE, we found that most cis-regulatory divergence is specific to one or a few cell types. Furthermore, we found that divergent cis-regulation is linked to tissue-specificity, with tissue-specific genes being enriched for ASE and tissue-specific regulatory elements being enriched for ASCA. As this result was largely unchanged when stratifying by evolutionary constraint, our results suggest that changes in the expression of genes with more cell type-specific expression patterns may be less deleterious than changes in more broadly expressed genes, supporting the idea that cell-type specific divergence may be less pleiotropic (Wittkopp & Kalay, 2012). Overall, this suggests that broad changes in expression in cell type-specifically expressed genes may be an important substrate for evolution.
We also identified several sets of genes evolving under lineage-specific selection that may have played a role in establishing unique facets of human physiology and behavior. Most interestingly, we found evidence for selection on sodium ion transporters and syntaxin binding proteins that may alter the electrophysiological properties of motor neurons, and potentially other types of neurons as well (Brose et al., 2019; Meisler et al., 2021). The complexity of the molecular machinery regulating neuronal excitability and synaptic vesicle release make it difficult to say what the effects of these gene expression changes are on the excitability of motor neurons without electrophysiology data from human and great ape neurons coupled with perturbation of candidate genes. However, given the divergence in locomotion and motor skills between humans and chimpanzees, it is tempting to speculate that these changes may have had some role in the evolution of motor control and learning in humans.
In this work, we developed two metrics—dEE and dCAE—to quantify the degree of cell type-specific differential expression and accessibility. These metrics are largely analogous to widely used metrics that quantify tissue- or cell type-specific expression level and applicable to any comparison of log fold-changes across conditions. They markedly improved our ability to identify matching cell type-specific ASE and ASCA and led to the identification of 95 peak-gene pairs that had highly cell type-specific concordant changes in accessibility and expression.
One such example is a potentially human-derived increase in FABP7 expression in several types of human neurons. As FABP7 is not highly expressed in adult mouse neurons, the functional consequences of its higher expression in human neurons are difficult to predict (Yao et al., 2021). FABP7 plays a role in the uptake of the fatty acid Docosahexaenoic Acid (DHA), an important component of neuronal membranes (Akbar et al., 2005; Choi et al., 2021). DHA promotes neuronal survival through phosphatidylserine accumulation, so it is possible that the human-specific FABP7 expression increases neuronal DHA uptake leading to reduced apoptosis in human neurons during development and ultimately contributing to a larger number of neurons in humans (Akbar et al., 2005; Choi et al., 2021).
In addition, we identified a highly conserved developmentally dynamic enhancer near GAD1 that may have partially lost activity in the human lineage resulting in a decrease in GAD1 expression early in GABAergic neuron development. By integrating with the deep learning model Sei, we identified a variant that may account for the chimpanzee-biased ASCA in this region (K. M. Chen et al., 2022). Interestingly, the ASE of GAD1 was coupled with a relatively small (though significant) magnitude of ASCA. This could potentially reflect divergence in transcription factor binding that leaves a “footprint” resulting in subtle ASCA (Vierstra et al., 2020). Overall, our data suggest that this enhancer has lost activity in the human lineage, potentially altering the expression pattern of GAD1 during neuronal development. GAD1 is the rate-limiting enzyme for GABA synthesis so GABA levels are likely responsive to changes in GAD1 expression (Feldblum et al., 1993). GABA release has complex context-specific effects on neurodevelopment making it difficult to speculate as to what the phenotypic effects of reduced GABA synthesis during human neurodevelopment might be (Ben-Ari et al., 2012). However, the high conservation of this cis-regulatory element in placental mammals implies that its human-specific disruption is likely to have important neurodevelopmental effects. Careful perturbation of this enhancer and GAD1 expression in mouse models will be required to explore this further.
Overall, our study provides foundational data, insight, and computational tools that will improve our understanding of cell type-specific cis-regulatory evolution and the role it has played in the establishment of human-specific traits.
Generation of multiple human-chimpanzee hybrid cell types
We used two previously described human-chimpanzee hybrid iPS cell lines (hybrid1 and hybrid2, previously denoted Hy1-25 and Hy1-30 respectively) (Agoglia et al., 2021). Before differentiation, cells were routinely cultured on matrigel in mTeSR1 or mTeSR Plus (Stem Cell Technologies cat #85850 or cat #100-0276). Culture and in vitro differentiation of iPS cells into six cell types (motor neurons (MN), cardiomyocytes (CM), hepatocyte progenitors (HP), pancreatic progenitors (PP), skeletal myocytes (SKM), and retinal pigment epithelium (RPE)) was carried out by the Columbia Stem Cell Core Facility using published protocols (Burridge et al., 2014; Chal et al., 2016; Korytnikov & Nostro, 2016; Mallanna & Duncan, 2013; Maury et al., 2015; Sharma et al., 2019).
Preparation of RNA-seq libraries
All samples were cryopreserved in liquid nitrogen before RNA extraction (Milani et al., 2016). Cells were gently thawed and then washed with PBS and cell pellets were collected via centrifugation at 1,000 RPM for 5 minutes. Cell pellets were loosened by flicking the tube and an appropriate volume of Buffer RLT based on the cell count were added following the RNeasy Mini Kit (Qiagen, 74104) protocol. Total RNA extraction and on-column DNAse digestion were performed using RNeasy Mini Kit (Qiagen, 74104) and RNase-Free DNase Set (Qiagen, 79254). RNA quality was assessed using the Agilent Bioanalyzer RNA Pico assay. Only samples with an RNA integrity number (RIN) greater than or equal to 7 were used to prepare cDNA libraries. All RNA-seq libraries except 3 motor neuron libraries (2 hybrid2 and 1 hybrid1) were prepared using the TruSeq Stranded mRNA kit (Illumina, 20020594) and the TruSeq RNA CD Index Plate (Illumina, 20019792) from between 100 ng and 1 ug total RNA following the manufacturer’s protocols. The other three motor neuron libraries were prepared using Illumina Stranded mRNA Prep (Illumina, 20040532) and IDT for Illumina RNA UD Indexes Set A, Ligation (Illumina, 20040553) due to low yield of total RNA. Notably, the four motor neuron libraries did not cluster by library preparation method. All libraries were normalized, pooled at an equimolar ratio using Qubit measurements, and sequenced on an Illumina HiSeq 4000 to generate 2×150bp paired-end reads.
Identification of confident human-chimpanzee SNVs
To identify a confident list of human-chimpanzee SNVs that could be used to quantify allele-specific expression and chromatin accessibility, we first downloaded hg38-panTro6 MAF files from UCSC and whole-genome sequencing data generated from the parental human and chimpanzee iPS cells (in the form of bam files aligned to hg38 and panTro5, generously provided by the Gilad lab). We first converted them back to fastq files and then mapped reads to panTro6 and hg38 (we mapped both human and chimpanzee to both reference genomes) using bowtie2 with the flags —very-sensitive-local - p 16 (Langmead & Salzberg, 2012). We then used a modified version of our previous approach to filter out SNVs that could not be confidently identified as homozygous in the human and chimpanzee parental lines (Agoglia et al., 2021). Briefly, we extracted SNVs and indels from both human and chimpanzee MAF files, counted reads in the WGS data that supported the human, chimpanzee, or an alternative base at that position, then filtered out any SNVs with < 2 reads or < 90% of reads supporting that species’ base. We then reformatted files, merged with indels for use in Hornet, and generated a modified bed file of SNVs that includes the human and chimpanzee base at the SNV position (van de Geijn et al., 2015).
Generation of allele-specific count tables
An allele specific expression mapping and calling pipeline adapted from Agoglia et al. and our updated high-confidence SNV list was used. The whole pipeline was carried out twice independently using hg38 or panTro6 as the reference genome. This approach was taken to eliminate genes showing strong mapping bias, defined here as genes with an absolute difference in log fold-change between the panTro6 referenced and hg38 referenced runs greater than one. All sequencing reads were trimmed with SeqPrep (adapters specified by the manufacturer for the different library preparation kits) and mapped using STAR with two passes and the following parameters: --outFilterMultimapNmax 1 (Dobin et al., 2013; John St. John, n.d.). Uniquely aligned reads were deduplicated with Picard and Hornet (an implementation of WASP which first removes reads overlapping indels) was used to correct for mapping bias (Broad Institute, n.d.; van de Geijn et al., 2015). Reads were assigned to either the human allele if they contained one or more human-chimpanzee single nucleotide differences that matched the human sequence and zero positions that matched the chimpanzee sequence (and vice versa for assigning reads to the chimpanzee allele) and counted per gene as previously described (Agoglia et al., 2021).
Detection of aneuploidy on chromosome 20 and slight chimpanzee parental contamination in PP hybrid2 samples
In our quality control process, we plotted the log fold-change for each gene along every chromosome and inspected the results. This revealed a clear bias toward the human allele on a part of chromosome 20 for hybrid2 samples, suggesting chromosome 20 aneuploidy which was also reported by Agoglia et al (Agoglia et al., 2021). As a result, we excluded chromosome 20 from all downstream analyses. In addition, we found that PP hybrid2 samples had a slight bias toward the chimpanzee allele across every chromosome which was most likely due to a small fraction of contaminating chimpanzee cells in these samples. Rather than removing these samples, we normalized the allele-specific count tables by subtracting a small number of reads from the chimpanzee allele counts calculated based on the biased ratio summarized from genome wide human and chimpanzee allele counts, to force a global log fold-change (across all autosomes except chromosome 20) of zero between the human and chimpanzee alleles. We applied this normalization to all other samples as well. To evaluate the success of this strategy, we took hybrid and chimpanzee parental iPSC RNA-seq from Agoglia et al (Agoglia et al., 2021) from the same iPS cell lines as used in this study, and then simulated chimpanzee parental contamination by mixing chimpanzee parental data into hybrid 2 data to reach similar a similar degree of chimpanzee bias level to that observed in the PP hybrid2 samples. We then identified genes showing ASE (see “Identifying genes with ASE”) using the counts from the original hybrid samples, simulated contaminated samples, and corrected simulated contaminated samples respectively and compared the outputs (Supp Fig. 4).
PCA and hierarchical clustering
Allelic counts were normalized by DESeq2 rlog and principal components analysis (PCA) was performed on rlog normalized allelic counts with default centering and scaling (Love et al., 2014). The top 1,000 variable genes with the highest variance of normalized allelic counts across all cell types were used to compute Euclidean distance matrices. The R package pheatmap was used to do hierarchical clustering and heatmap plotting.
Identifying genes with ASE
DESeq2 was used to measure allele-specific expression (ASE) in each cell type (Love et al., 2014). All reads from chromosome 20 were removed (as mentioned above). Two replicates per hybrid line per cell type (plus one additional replicate for SKM hybrid2 for a total of 3 samples) were used by DEseq2 with model ∼hybLine+Species to measure differential expression level. A likelihood ratio test (test=“LRT”, betaPrior=FALSE) was used to compute p-values. P-values were then false discovery rate adjusted using an implementation of the Benjamini-Hochberg correction in the R package qvalue (Benjamini & Hochberg, 1995; Storey Lab, n.d.). Log fold-changes were shrunk as recommended by the DESeq2 pipeline (Zhu et al., 2019). Differentially expressed genes were defined as those with FDR < 0.05 when aligned to hg38 and panTro6 as well as an absolute difference in log fold-change <= 1 when comparing the results from the two alignments.
Identifying cell type-specifically expressed genes
For the more traditional definition of cell type-specific genes, we required transcripts per million (TPM) < 1 for a gene in every cell type except one. In the cell type with TPM > 1, we varied how highly expressed the gene had to be in that cell type (again using a TPM cutoff, varying between one and five) to consider that gene to be specific to that cell type. A similar process to the one described in “Identifying differentially expressed genes” was used to identify cell type-specific genes based on the broader definition described in the main text. Rather than using allelic counts, total counts for each sample (i.e. all uniquely mapping deduplicated reads regardless of their allelic origin) were computed by summing all allelic and non-allelic counts. These counts were inputted to DESeq2 and the expression of each gene was compared pairwise between all cell types (Love et al., 2014). Genes were defined as cell type-specifically expressed in a cell type only if all pairwise comparisons between that cell type and other cell types resulted in an FDR < 0.05 using both hg38 and panTro6 aligned counts. Due to the markedly lower number of differentially expressed genes identified in SKM, results were computed both including and excluding SKM. An analogous procedure was used to identify more broadly defined cell type-specific peaks in the down-sampled ATAC-seq dataset. Peaks were defined as specific to a cell type if the absolute log fold-change was greater than 0.5 across all pairwise comparisons with the other cell types. We also tested an absolute log fold-change threshold of 1 to ensure that our results were not sensitive to the choice of cutoff.
Enrichment test for genes with cell type-specific expression patterns and genes showing ASE
Odds ratios were calculated using the unconditional maximum likelihood estimate implemented in the R package epitools function oddsratio.wald (), and 95% confidence intervals and p-values were calculated using the normal approximation. A directly analogous procedure was performed to test for enrichment of peaks with ASCA and cell type-specific peaks.
Enrichment test stratified by expression level or evolutionary constraint
Enrichment tests were carried out as in ‘Enrichment test of cell type-specific expression patterns and genes showing ASE’ except that genes were split into five equal size bins depending on which factors were used to stratify genes, and tests were done in each bin. When stratifying by expression level, genes were ordered in ascending order based on expression level (TPM) and then split into 5 equal size bins where genes in the 0-20% bin are the most lowly expressed genes and genes in the 80-100% bin are the most highly expressed genes. When stratifying by constraint metrics such as ASE variance or pHI, genes were ordered in ascending order based on ASE variance values and then split into five equal size bins where the 0-20% bin contains genes with the lowest ASE variance (i.e. most evolutionarily constrained) and the 80-100% bin contains genes with highest ASE variance (i.e. least evolutionarily constrained). To stratify the ATAC data by constraint, we used the “QCed genomic constraint by 1kb regions” computed by the gnomAD consortium (S. Chen et al., 2022). We further removed any regions that overlapped protein coding exons from the human gtf file using bedtools subtract (Quinlan & Hall, 2010). If a peak overlapped two or more 1 kilobase windows, it was assigned to the window with the highest constraint, mirroring the procedure used by the gnomAD consortium to assign peaks to ENCODE regulatory elements (S. Chen et al., 2022). Once the peaks were ranked by this metric, a procedure identical to that for the gene expression constraint metrics outlined above was performed.
Identification of lineage-specific selection on gene expression
We used a modified version of our previously published pipeline, which uses ASE values from many individuals of a single species to estimate cis-regulatory constraint of each gene (Starr et al., 2023). We restricted these ASE values to GTEx samples from the tissue (s) of origin for each cell type (with the exceptions of RPE which was compared to all GTEx samples and MN which was compared to all brain and peripheral nerve samples as more closely matches tissues such as eyes for RPE are not available in GTEx). We then used the Mann-Whitney U test to compare the human population ASE distribution to the human-chimpanzee ASE distribution as previously described (Starr et al., 2023). We then used the previously described signed ranking by Mann-Whitney p-value that incorporates whether a gene has human or chimpanzee-biased ASE with GSEAPY and the binomial test to identify instances of lineage-specific selection (Starr et al., 2023). Due to the focus on tissue-specificity, we did not filter redundant gene sets with GSEAPY FDR < 0.25 in multiple cell types (Starr et al., 2023; Subramanian et al., 2005).
Preparation of ATAC-seq libraries
We used the OmniATAC protocol with the only modification being the use of 25,000 cells instead of 50,000 since the fused iPS cells are tetraploid (Corces et al., 2017). All samples for ATAC-seq prep were from the same vials used in RNA-seq library preparation except for the motor neuron libraries due to the low yield of total RNA extracted from motor neurons. After library preparation and running samples on a Bioanalyzer, we noticed a considerable number of fragments greater than 1000 bases in length. To reduce these fragments, we size selected with Ampure beads using the protocol from the Kaestner lab available here: https://www.med.upenn.edu/kaestnerlab/assets/user-content/documents/ATAC-seq-Protocol-(Omni)-Kaestner-Lab.pdf.
After size selection and rerunning on the Bioanalzyer, we pooled the libraries together and sequenced them to compute quality control metrics. We used the R package ChrAccR to compute TSS enrichment scores (Mueller, Fabian, n.d.). We pooled all libraries with TSS enrichment score greater than 3.5. This resulted in 2 CM libraries, 2 MN libraries, 2 PP libraries, 1 SKM library, and 1 HP library. After pooling, libraries were sequenced on an Illumina Hiseq 4000 to produce 2×150 paired-end reads.
Mapping the ATAC-seq data
We trimmed reads using SeqPrep and then mapped them to the hg38 and panTro6 reference genomes with bowtie2 in paired-end mode (John St. John, n.d.; Langmead & Salzberg, 2012). The following parameters were used: -X 2000 --very-sensitive-local -p 16. After mapping, duplicates were removed via Picard MarkDuplicates (Broad Institute, n.d.). We then removed multi-mapping reads with the command samtools view -b -q 10 (Li et al., 2009). Due to the format of bowtie2’s output, running Hornet on all reads at once was excessively RAM intensive. Therefore, we split the bam files by chromosome and ran Hornet on each of the chromosomes separately. We used the files of SNVs and indels generated as described above as input to Hornet. After Hornet finished running, we used samtools merge to merge all autosomes and sex chromosomes (we excluded the mitochondrial genome) to create a final bam file for downstream analysis (Li et al., 2009).
Peak calling and filtering
As only one replicate was available for SKM and HP, we generated two pseudo-replicates by randomly assigning reads to one of two files using Picard SplitSamByNumberOfReads (Broad Institute, n.d.). We then called peaks on each file separately, as well as a merged file containing all the reads from a particular cell type. For example, for MN, both replicates were pooled and peaks were called on that file as well as the two replicates separately. Before peak calling, all bam files were converted to bed files. We called peaks using MACS2 callpeak with the following arguments: -f BED -p 0.01 --nomodel --shift 75 --extsize 150 -B --SPMR --keep-dup all --call-summits (Y. Zhang et al., 2008, p. 2). We called peaks on both the chimpanzee-referenced and human-referenced bam files. After peak calling, we sought to filter peaks using a modified version of the ENCODE pipeline designed to eliminate peaks that lack a one-to-one ortholog between humans and chimpanzees. The following pipeline was run on each cell type separately. We first filtered peaks that were not called in both replicates as well as the pooled file using code from the ENCODE pipeline based on bedtools and awk (Quinlan & Hall, 2010). We then used a custom Python script to merge overlapping peaks and used UCSC LiftOver to lift the peaks from hg38 to the panTro6 and back to hg38 as well as from panTro6 to hg38 (Kuhn et al., 2013). We then used bedtools to intersect the resulting human referenced files and filtered out any peaks that did not have at least 25% overlap with a peak in the other file (Quinlan & Hall, 2010). After filtering out peaks overlapping ENCODE blacklisted regions and merging overlapping peaks again, we lifted the file that was originally chimpanzee-referenced back to the chimpanzee genome (Amemiya et al., 2019). Finally, we removed human-referenced peaks if their chimpanzee-referenced counterpart failed to LiftOver (Kuhn et al., 2013).
Annotating the peak lists
To annotate the peaks, we used the list of TSS defined by Horlbeck et al. to annotate peaks (Horlbeck et al., 2016). We lifted over each TSS to hg38, expanded 1000 bases on either side of the midpoint of each TSS to generate promoters, and merged any promoters that overlapped while retaining all unique gene names associated with the promoter (Kuhn et al., 2013). We then used reciprocal LiftOver with panTro6 to filter out non-orthologous promoters and used bedtools intersect to link peaks to promoters and expanded the peak to include the entirety of the promoter if necessary (Quinlan & Hall, 2010). Through this process, we also outputted a list of non-promoter CREs (sometimes labeled as enhancers as enhancers are thought to be the most common type of CRE). We took this list and used bedtools closest to link them to the two closest protein coding genes (Quinlan & Hall, 2010). Notably, the gene naming conventions differ for the Horlbeck et al. TSS list and the GTF file used for RNA-seq processing. We altered all gene names in peaks to match those found in the GTF file. In some cases, the gene no longer existed in the updated hg38 GTF in which case the gene name was replaced with NAN.
Generating a unified peak list
We next merged our cell type-specific peak list across all five cell types to create a unified peak list. To do this, we iteratively intersected all the peaks with bedtools and then merged any overlapping peaks (Quinlan & Hall, 2010). Finally, we added back any peaks that did not intersect a peak found in any other cell types. We then took the chimpanzee and human-referenced versions of these peak lists and ran them through the LiftOver-based non-homologous peak filtering pipeline described above to generate a final file of all identified peaks as well as which cell type (s) they were called in (Kuhn et al., 2013). Then, we reran the annotation pipeline described in ‘Annotating the peak lists’ on this new set of peaks.
Counting reads in peaks and further peak filtering
First, we split the bam files into reads that we could confidently assign to the chimpanzee genome and reads we could assign to the human genome. We used our bed file of high-confidence SNVs and required at least one SNV matching the human genome as well as no SNVs matching the chimpanzee genome for a read to be assigned to human (and vice versa for chimp). We then used a custom Python script to reformat the peak list bed files as GTF files and used HTSeq to count reads in peaks using the following parameters: -s no -m union -r pos (Anders et al., 2015). We only kept peaks if they had a mean read count across replicates within a cell type of at least 25 from either allele. For example, if a peak has an average of 27 reads from the human allele and an average of 10 reads from the chimpanzee allele in MN, that peak would be kept in MN. On the other hand, if the same peak had an average of 24 reads from the human allele and an average of 10 reads from the chimpanzee allele in CM, that peak would be discarded for CM.
We next filtered the reads to remove peaks that might be differentially accessible but show evidence of mapping bias or do not agree between replicates. To do this, we removed any peaks with an absolute log fold-change greater than one in one replicate but with a fold-change of any magnitude in the opposite direction in the other. This was not done for SKM or HP as we had only one replicate. We then removed any peaks that had a log fold-change in opposite directions with an absolute difference greater than 1 in at least one replicate when comparing the human-referenced and chimpanzee-referenced counts. Finally, as described in section ‘‘Detection of aneuploidy on chromosome 20 and slight chimpanzee parental contamination in PP hybrid2 samples’ for RNA-seq data analysis, we removed any peaks on chr20 and took this as our final list of peaks for downstream analyses. Allelic counts were normalized as described in the RNA-seq data analysis. We tested for allele-specific chromatin accessibility (ASCA) using the binomial test applied to the normalized allelic counts (summed by species within a cell type). We considered any peaks with a binomial p-value less than 0.05 to be nominally differentially accessible.
Down-sampling to identify cell type-specific ATAC-seq peaks
As the number of peaks detected by ATAC-seq is generally a function of read depth and our read depth varied widely across cell types, we restricted to one replicate (always hybrid1 if two replicates were available) and down-sampled reads to match the SKM sample with lowest sequencing depth. We then called peaks for cell types with a single ATAC replicate as described above.
Allelic chromatin accessibility tracks
Allelic bam files with reads originating from the human allele and the chimpanzee allele (respectively) were obtained as described in ‘Counting reads in peaks and further peak filtering’. Two replicates in CM, MN, and PP were pooled by cell type. Bam files were converted into bigWig files using python package deepTools bamCoverage with options: --binSize 1 --normalizeUsing CPM --effectiveGenomeSize 2862010578 --ignoreForNormalization chr20 --extendReads (Ramírez et al., 2014). Tracks were visualized and plotted using the python package pyGenomeTracks (Lopez-Delisle et al., 2021). When comparing human and chimpanzee log fold-change track differences in each cell type, deepTools bigwigCompare was used to compare between human bigWig and chimpanzee bigWig with options: --pseudocount 1 --skipZeroOverZero --operation log2 -bs 1 (Ramírez et al., 2014).
ChromHMM annotation and correlation with ASE
A universal chromHMM annotation was obtained for each peak based on overlap with any of the 15 categories in chromHMM (excluding the blacklist category, for which peaks had already been removed) (Ernst & Kellis, 2017; Vu & Ernst, 2022). Divergence was measured as the z-score transformed median of the absolute log fold-change of human and chimpanzee normalized counts in each peak. Each peak was assigned to the closest gene and then Pearson correlation was computed between the chromatin accessibility log fold-change and the expression log fold-change for each peak and its nearest gene. Pearson correlation was computed only on categories including at least 15 peaks. When showing results for differentially accessible peaks, only peaks with binomial p-values less than 0.05 were kept and used in computing the Pearson correlation. When assigning a unique chromHMM to each peak, the chromHMM category that covered the largest portion of each peak was used. When filtering out promoter-related annotations, peaks covering any promoter-related chromHMM categories (“TSS”, “flanking promoter” and “bivalent promoter”) were filtered out and the analysis described above was repeated.
Computation of differential expression enrichment (dEE) and differential chromatin accessibility enrichment (dCAE)
For each target cell type, taking CM as an example, the log fold-change for gene A was fixed as target log fold-change, and the log fold-changes for gene A in the remaining cell types with an opposing sign (compared to the target log fold-change) were set to zero. Then, the dEE value was calculated as the proportion of the target log fold-change in the sum of the zeroed log fold-changes across all cell types. For example, the dEE for gene A in CM would be abs (target LFC)/sum (abs (LFC after zeroing))). dEE ranges from zero to one and low dEE value indicates differential expression with similar magnitude and direction across cell types, and/or the gene does not have any strong allelic bias, whereas a high dEE value indicates that this gene is only strongly differentially expressed (with the sign the log fold-change has in that cell type) in a particular cell type. dCAE uses the same procedure as dEE except the table is populated with the log fold-changes derived from chromatin accessibility measurements. dEE and dCAE are sensitive to the inclusion or exclusion of cell types (by definition), so we excluded RPE when integrative analysis combining results from dEE and dCAE was performed (to match the cell types for which dCAE could be computed, Supp Fig. 25b). After restricting to genes defined as having significant ASE or significant ASCA, we defined genes with dEE >= 0.75 in a particular cell type as showing cell type-specific ASE and peaks with dCAE >= 0.75 in one cell type as showing cell type-specific ASCA. We used bedtools intersect to intersect the peaks with our list of human-chimpanzee single nucleotide differences and the 241-way placental mammal PhyloP scores (Quinlan & Hall, 2010; Sullivan et al., 2023). We also checked whether peaks that contained human-chimpanzee differences in sites with high PhyloP scores were in the list of HARs described in Girskis et al (Girskis et al., 2021).
Predicting regulatory activity with single-variant resolution
We used sequences in fasta format as input to the deep neural network model Sei (K. M. Chen et al., 2022). Sei requires a 4096 base pair input sequence, so we put the center of our region of interest at the center of the input window and expanded equally on either side to contain 4096 base pairs. The human sequence was retrieved from hg38 and the corresponding chimpanzee sequence was retrieved from panTro6. The effect size when comparing the probabilities of each sequence having a particular chromatin state was computed as the log of the human sequence probability divided by the chimpanzee sequence probability. Only annotations for which either the chimpanzee sequence or the human sequence had a probability value greater than or equal to 0.5 were kept for downstream analysis. All SNVs between human and chimpanzee in this input window were identified and ordered based on coordinates. For each SNV position, the human sequence was changed to the chimpanzee allele at that position to generate a new sequence that was input to Sei. The log fold-change for each chromatin annotation was computed for each input sequence as described above and used as a measure of the effect of this change on the sequence. Similarly, an indel can be introduced to modify the human sequence and input to Sei. With indels, the center of the regions of interest (promoter or HAR) were always at the center of the input window and the start or end of the sequence inputted to Sei could possibly lose or gain base pairs. However, we found that for the small indels shown here this had essentially no effect on the Sei output.
Processing of publicly available datasets
The data from Blake et al. and Pavlovic et al. were processed as previously described (Blake et al., 2020; Pavlovic et al., 2018; Starr et al., 2023). For the Pavlovic et al. data, log fold-changes were computed in DESeq2 with the scaled proportion of cardiomyocytes present in each sample (available in the supplemental materials of Pavlovic et al.), sex, and whether cardiomyocytes were treated with T3 as covariates (i.e. using the model ∼sex+scaled_proportion_cardiomyocytes+T3_Treatment+species) (Pavlovic et al., 2018). No covariates were included for Blake et al. as they had little impact on the data (Blake et al., 2020). The log fold-changes and FDR corrected p-values were directly downloaded from the supplemental materials of Kozlenkov et al. (Kozlenkov et al., 2020).
The processed data from Ma et al. (Ma et al., 2022) were downloaded from http://resources.sestanlab.org/PFC/. We pseudobulked the data by cell type by summing counts within each individual. We then separately input each pairwise comparison of two species (human to chimpanzee or human to rhesus macaque) into DESeq2 with no covariates to test for differential expression and compute log fold-changes.
The counts tables from Kanton et al.12 were downloaded from https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7552 and processed with SCANPY (Kanton et al., 2019, p. 29409532; Wolf et al., 2018). The data were filtered by removing cells with n_genes_by_counts > 2500 and >5% mitochondrial reads. We also removed cells with fewer than 200 unique genes and genes that had non-zero counts in fewer than 3 cells. After filtering, any chimpanzee cells not falling in the category (defined by Kanton et al. (Kanton et al., 2019)) “ventral forebrain progenitors and neurons” were eliminated and human cells not in the categories “ventral progenitors and neurons 1”, “ventral progenitors and neurons 2”, or “ventral progenitors and neurons 3” were similarly eliminated. We then merged the two counts tables, normalized/logarithmized the counts, computed PCA, used harmony to integrate cells from different species (human and chimpanzee), and found nearest neighbors with the harmonized principal components (Korsunsky et al., 2019). We then ran Leiden clustering with resolution = 0.5 to identify 7 subclusters (one of which appeared to be a technical artifact with very low counts that was removed) (Traag et al., 2019).
We identified cell types and lineages using canonical marker genes (MKI67 and HES5 for progenitors, NKX2-1 and LHX6 for the medial ganglionic eminence or MGE, MEIS2 and ZFHX3 for the lateral ganglionic eminence or LGE, and SCGN and NR2F1 for the caudal ganglionic eminence or CGE) (Su-Feher et al., 2022). We then used the implementation of PAGA in SCANPY to compute pseudotime using the first cell in the progenitor subcluster as the root (Wolf et al., 2019). We binned cells into five equal bins along pseudotime and compared the expression of cells with non-zero counts for GAD1 in each pseudotime bin. Within each bin, we used a Wilcoxon test to test for higher expression of GAD1 in chimpanzee cells compared to human cells. We repeated the pseudotime analysis, binning, and comparing of GAD1 gene expression for each subtrajectory (MGE, LGE, and CGE).
Description of Additional file 1
This file contains an extended evaluation of the success of the differentiations in generating the desired cell type and other cell types likely to be present in each sample.
Description of Additional file 2
This file contains the results of running the test for selection described by Starr et al on each cell type.
Description of Additional file 3
This file contains the peak-gene pairs that had concordant high dCAE and high dEE in each cell type.
The authors have nothing to declare.
Ethics approval and consent to participate
Not relevant to our study.
Consent for publication
Not relevant to our study.
Availability of data and materials
Raw and processed data generated by this study are publicly available through the Gene Expression Omnibus under accession GSE232949: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE232949. The snRNAseq data from Ma et al. are available here: http://resources.sestanlab.org/PFC/. The scRNAseq data from Kanton et al. are available here: https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-7552. The bulk RNA-seq data from Blake et al. and Pavlovic et al. are available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112356 and here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110471 respectively. The log fold-changes and associated statistics were used directly from the supplemental materials of Kozlenkov et al. The human-chimp pairwise alignment used to identify SNVs and indels is available here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsPanTro6/. The human and chimpanzee genomes used are available here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ and here: https://hgdownload.soe.ucsc.edu/goldenPath/panTro6/bigZips/ respectively. The 241-way PhyloP scores were downloaded from here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/cactus241way/. The “QCed genomic constraint by 1kb regions” from the gnomAD consortium are available here: https://gnomad.broadinstitute.org/downloads#v3. All scripts for performing analyses and making figures in this manuscript is publicly available at https://github.com/banwang27/multi-celltypes.
The authors have no competing interests to declare.
Funding for this work came from NIH R01HG012285. A.L.S was supported by an NDSEG fellowship.
The authors wish to acknowledge the by the Columbia Stem Cell Core for their hard work in differentiating these cells. We also acknowledge members of the Fraser lab past and present for helpful discussion. BioRender was used to generate Figures 1a, 2a, and 3a.
- 1.Primate cell fusion disentangles gene regulatory divergence in neurodevelopmentNature 592:421–427https://doi.org/10.1038/s41586-021-03343-3
- 2.Docosahexaenoic acid: A positive modulator of Akt signaling in neuronal survivalProceedings of the National Academy of Sciences 102:10858–10863https://doi.org/10.1073/pnas.0502903102
- 3.The ENCODE Blacklist: Identification of Problematic Regions of the GenomeScientific Reports 9https://doi.org/10.1038/s41598-019-45839-z
- 4.HTSeq—A Python framework to work with high-throughput sequencing dataBioinformatics (Oxford, England) 31:166–169https://doi.org/10.1093/bioinformatics/btu638
- 5.Role of Fabp7, a downstream gene of Pax6, in the maintenance of neuroepithelial cells during early embryonic development of the rat cortexThe Journal of Neuroscience: The Official Journal of the Society for Neuroscience 25:9752–9761https://doi.org/10.1523/JNEUROSCI.2512-05.2005
- 6.The GABA Excitatory/Inhibitory Shift in Brain Maturation and Neurological DisordersThe Neuroscientist 18:467–486https://doi.org/10.1177/1073858412438697
- 7.An early cell shape transition drives evolutionary expansion of the human forebrainCell 184:2084–2102https://doi.org/10.1016/j.cell.2021.02.050
- 8.Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple TestingJournal of the Royal Statistical Society: Series B (Methodological) 57:289–300https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
- 9.A comparison of gene expression and DNA methylation patterns across tissues and speciesGenome Research 30:250–262https://doi.org/10.1101/gr.254904.119
- 10.Broad Institute. (n.d.). Picard.
- 11.Synaptic vesicle fusion: Today and beyondNature Structural & Molecular Biology 26:663–668https://doi.org/10.1038/s41594-019-0277-z
- 12.Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome positionNature Methods 10:1213–1218https://doi.org/10.1038/nmeth.2688
- 13.Chemically defined generation of human cardiomyocytesNature Methods 11:855–860https://doi.org/10.1038/nmeth.2999
- 14.Molecular genetics of the transcription factor GLIS3 identifies its dual function in beta cells and neuronsGenomics 110:98–111https://doi.org/10.1016/j.ygeno.2017.09.001
- 15.Cardiovascular actions of neurotrophinsPhysiological Reviews 89:279–308https://doi.org/10.1152/physrev.00007.2008
- 16.A vast resource of allelic expression data spanning human tissuesGenome Biology 21https://doi.org/10.1186/s13059-020-02122-z
- 17.JASPAR 2022: The 9th release of the open-access database of transcription factor binding profilesNucleic Acids Research 50:D165–D173https://doi.org/10.1093/nar/gkab1113
- 18.Generation of human muscle fibers and satellite-like cells from human pluripotent stem cells in vitroNature Protocols 11:1833–1850https://doi.org/10.1038/nprot.2016.110
- 19.A sequence-based global map of regulatory activity for deciphering human geneticsNature Genetics 54:940–949https://doi.org/10.1038/s41588-022-01102-2
- 20.A genome-wide mutational constraint map quantified from variation in 76,156 human genomes [Preprint]Genetics https://doi.org/10.1101/2022.03.20.485034
- 21.FABP7 Facilitates Uptake of Docosahexaenoic Acid in Glioblastoma Neural Stem-like CellsNutrients 13https://doi.org/10.3390/nu13082664
- 22.A cross-disorder dosage sensitivity map of the human genomeCell 185:3041–3055https://doi.org/10.1016/j.cell.2022.06.036
- 23.Tissue-Specific cis-Regulatory Divergence Implicates eloF in Inhibiting Interspecies Mating in DrosophilaCurrent Biology 28:3969–3975https://doi.org/10.1016/j.cub.2018.10.036
- 24.UpSetR: An R package for the visualization of intersecting sets and their propertiesBioinformatics 33:2938–2940https://doi.org/10.1093/bioinformatics/btx364
- 25.An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissuesNature Methods 14:959–962https://doi.org/10.1038/nmeth.4396
- 26.A radial glia gene marker, fatty acid binding protein 7 (FABP7), is involved in proliferation and invasion of glioblastoma cellsPloS One 7https://doi.org/10.1371/journal.pone.0052113
- 27.STAR: Ultrafast universal RNA-seq alignerBioinformatics (Oxford, England) 29:15–21https://doi.org/10.1093/bioinformatics/bts635
- 28.Astrocyte-expressed FABP7 regulates dendritic morphology and excitatory synaptic function of cortical neuronsGlia 64:48–62https://doi.org/10.1002/glia.22902
- 29.Chromatin-state discovery and genome annotation with ChromHMMNature Protocols 12:2478–2492https://doi.org/10.1038/nprot.2017.124
- 30.Different distributions of GAD65 and GAD67 mRNAS suggest that the two glutamate decarboxylases play distinctive functional rolesJournal of Neuroscience Research 34:689–706https://doi.org/10.1002/jnr.490340612
- 31.Structurally Conserved Primate LncRNAs Are Transiently Expressed during Human Cortical Differentiation and Influence Cell-Type-Specific GenesStem Cell Reports 12:245–257https://doi.org/10.1016/j.stemcr.2018.12.006
- 32.Genome-wide approaches to the study of adaptive gene expression evolution: Systematic studies of evolutionary adaptations involving gene expression will allow many fundamental questions in evolutionary biology to be addressedBioEssays 33:469–477https://doi.org/10.1002/bies.201000094
- 33.Gene expression drives local adaptation in humansGenome Research 23:1089–1096https://doi.org/10.1101/gr.152710.112
- 34.Improving Estimates of Compensatory cis–trans Regulatory DivergenceTrends in Genetics 35:3–5https://doi.org/10.1016/j.tig.2018.09.003
- 35.Epigenomic profiling of primate lymphoblastoid cell lines reveals the evolutionary patterns of epigenetic activities in gene regulatory architecturesNature Communications 12https://doi.org/10.1038/s41467-021-23397-1
- 36.Rewiring of human neurodevelopmental gene regulatory programs by human accelerated regionsNeuron 109:3239–3251https://doi.org/10.1016/j.neuron.2021.08.005
- 37.Human–chimpanzee fused cells reveal cis-regulatory divergence underlying skeletal evolutionNature Genetics 53:467–476https://doi.org/10.1038/s41588-021-00804-3
- 38.Genetic effects on gene expression across human tissuesNature 550:204–213https://doi.org/10.1038/nature24277
- 39.Compact and highly active next-generation libraries for CRISPR-mediated gene repression and activationELife 5https://doi.org/10.7554/eLife.19760
- 40.Cis-Regulatory changes in locomotor genes are associated with the evolution of burrowing behaviorCell Reports 38https://doi.org/10.1016/j.celrep.2022.110360
- 41.Neurotrophins: Roles in neuronal development and functionAnnual Review of Neuroscience 24:677–736https://doi.org/10.1146/annurev.neuro.24.1.677
- 42.Exploring the genesis and functions of Human Accelerated Regions sheds light on their role in human evolutionCurrent Opinion in Genetics & Development 29:15–21https://doi.org/10.1016/j.gde.2014.07.005
- 43.Neurotrophins and cell deathExperimental Cell Research 318:1221–1228https://doi.org/10.1016/j.yexcr.2012.03.006
- 44.TissueEnrich: Tissue-specific gene enrichment analysisBioinformatics (Oxford, England) 35:1966–1967https://doi.org/10.1093/bioinformatics/bty890
- 45.John St. John. (n.d.). SeqPrep.
- 46.Organoid single-cell genomic atlas uncovers human-specific features of brain developmentNature 574:418–422https://doi.org/10.1038/s41586-019-1654-9
- 47.Up-Regulation of Glis2 Involves in Neuronal Apoptosis After Intracerebral Hemorrhage in Adult RatsCellular and Molecular Neurobiology 35:345–354https://doi.org/10.1007/s10571-014-0130-1
- 48.Effective study design for comparative functional genomicsNature Reviews Genetics 21:385–386https://doi.org/10.1038/s41576-020-0242-z
- 49.Evolution at Two Levels in Humans and Chimpanzees: Their macromolecules are so alike that regulatory mutations may account for their biological differencesScience 188:107–116https://doi.org/10.1126/science.1090005
- 50.Fast, sensitive and accurate integration of single-cell data with HarmonyNature Methods 16:1289–1296https://doi.org/10.1038/s41592-019-0619-0
- 51.Generation of polyhormonal and multipotent pancreatic progenitor lineages from human pluripotent stem cellsMethods 101:56–64https://doi.org/10.1016/j.ymeth.2015.10.017
- 52.Evolution of regulatory signatures in primate cortical neurons at cell-type resolutionProceedings of the National Academy of Sciences 117:28422–28432https://doi.org/10.1073/pnas.2011884117
- 53.The UCSC genome browser and associated toolsBriefings in Bioinformatics 14:144–161https://doi.org/10.1093/bib/bbs038
- 54.Fast gapped-read alignment with Bowtie 2Nature Methods 9:357–359https://doi.org/10.1038/nmeth.1923
- 55.The Sequence Alignment/Map format and SAMtoolsBioinformatics (Oxford, England) 25:2078–2079https://doi.org/10.1093/bioinformatics/btp352
- 56.Cell-type-specific effects of genetic variation on chromatin accessibility during human neuronal differentiationNature Neuroscience 24:941–953https://doi.org/10.1038/s41593-021-00858-w
- 57.pyGenomeTracks: Reproducible plots for multivariate genomic datasetsBioinformatics (Oxford, England) 37:422–423https://doi.org/10.1093/bioinformatics/btaa692
- 58.Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2Genome Biology 15https://doi.org/10.1186/s13059-014-0550-8
- 59.Molecular and cellular evolution of the primate dorsolateral prefrontal cortexScience https://doi.org/10.1126/science.abo7257
- 60.Gene Regulation and SpeciationTrends in Genetics 33:68–80https://doi.org/10.1016/j.tig.2016.11.003
- 61.Differentiation of hepatocytes from pluripotent stem cellsCurrent Protocols in Stem Cell Biology 26:1–1https://doi.org/10.1002/9780470151808.sc01g04s26
- 62.Combinatorial analysis of developmental cues efficiently converts human pluripotent stem cells into multiple neuronal subtypesNature Biotechnology 33:89–96https://doi.org/10.1038/nbt.3049
- 63.Sodium channelopathies in neurodevelopmental disordersNature Reviews Neuroscience 22:152–166https://doi.org/10.1038/s41583-020-00418-4
- 64.Cell freezing protocol suitable for ATAC-Seq on motor neurons derived from human induced pluripotent stem cellsScientific Reports 6https://doi.org/10.1038/srep25474
- 65.Ascl1 and Gsh1/2 control inhibitory and excitatory cell fate in spinal sensory interneuronsNature Neuroscience 9:770–778https://doi.org/10.1038/nn1706
- 66.PPAR control of metabolism and cardiovascular functionsNature Reviews Cardiology 18:809–823https://doi.org/10.1038/s41569-021-00569-6
- 67.Mueller, Fabian. (n.d.). ChrAccR.
- 68.Emerging roles of the neurotrophin receptor TrkC in synapse organizationNeuroscience Research 116:10–17https://doi.org/10.1016/j.neures.2016.09.009
- 69.Epigenomic annotation of gene regulatory alterations during evolution of the primate brainNature Neuroscience 19:494–503https://doi.org/10.1038/nn.4229
- 70.A Comparative Assessment of Human and Chimpanzee iPSC-derived Cardiomyocytes with Primary Heart TissuesScientific Reports 8https://doi.org/10.1038/s41598-018-33478-9
- 71.An RNA gene expressed during cortical development evolved rapidly in humansNature 443:167–172https://doi.org/10.1038/nature05113
- 72.Emerging principles of regulatory evolutionProceedings of the National Academy of Sciences 104:8605–8612https://doi.org/10.1073/pnas.0700488104
- 73.BEDTools: A flexible suite of utilities for comparing genomic featuresBioinformatics (Oxford, England) 26:841–842https://doi.org/10.1093/bioinformatics/btq033
- 74.deepTools: A flexible platform for exploring deep-sequencing dataNucleic Acids Research 42:W187–191https://doi.org/10.1093/nar/gku365
- 75.Evolution of Gene Regulation in HumansAnnual Review of Genomics and Human Genetics 17:45–67https://doi.org/10.1146/annurev-genom-090314-045935
- 76.Comparative studies of gene expression and the evolution of gene regulationNature Reviews Genetics 13:505–516https://doi.org/10.1038/nrg3229
- 77.Clinical-grade stem cell-derived retinal pigment epithelium patch rescues retinal degeneration in rodents and pigsScience Translational Medicine 11https://doi.org/10.1126/scitranslmed.aat5580
- 78.Selection of endurance capabilities and the trade-off between pressure and volume in the evolution of the human heartProceedings of the National Academy of Sciences 116:19905–19910https://doi.org/10.1073/pnas.1906902116
- 79.Genetic studies of human-chimpanzee divergence using stem cell fusionsProceedings of the National Academy of Sciences of the United States of America 118https://doi.org/10.1073/pnas.2117557118
- 80.Accounting for cis-regulatory constraint prioritizes genes likely to affect species-specific traitsGenome Biology 24https://doi.org/10.1186/s13059-023-02846-8
- 81.Storey Lab. (n.d.). Qvalue.
- 82.Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profilesProceedings of the National Academy of Sciences 102:15545–15550https://doi.org/10.1073/pnas.0506580102
- 83.Single cell enhancer activity distinguishes GABAergic and cholinergic lineages in embryonic mouse basal gangliaProceedings of the National Academy of Sciences 119https://doi.org/10.1073/pnas.2108760119
- 84.Leveraging Base Pair Mammalian Constraint to Understand Genetic Variation and Human Disease [Preprint]Genomics https://doi.org/10.1101/2023.03.10.531987
- 85.From Louvain to Leiden: Guaranteeing well-connected communitiesScientific Reports 9https://doi.org/10.1038/s41598-019-41695-z
- 86.Chromatin accessibility dynamics in a model of human forebrain developmentScience (New York, N.Y.) 367https://doi.org/10.1126/science.aay1645
- 87.Transposable elements are the primary source of novelty in primate gene regulationGenome Research 27:1623–1633https://doi.org/10.1101/gr.218149.116
- 88.WASP: Allele-specific software for robust molecular quantitative trait locus discoveryNature Methods 12:1061–1063https://doi.org/10.1038/nmeth.3582
- 89.Developmental mechanisms underlying the evolution of human cortical circuitsNature Reviews Neuroscience 24:213–232https://doi.org/10.1038/s41583-023-00675-z
- 90.Global reference mapping of human transcription factor footprintsNature 583:729–736https://doi.org/10.1038/s41586-020-2528-x
- 91.Universal annotation of the human genome through integration of over a thousand epigenomic datasetsGenome Biology 23https://doi.org/10.1186/s13059-021-02572-z
- 92.Liver X receptors in lipid signalling and membrane homeostasisNature Reviews Endocrinology 14:452–463https://doi.org/10.1038/s41574-018-0037-x
- 93.Fabp7 Maps to a Quantitative Trait Locus for a Schizophrenia EndophenotypePLoS Biology 5https://doi.org/10.1371/journal.pbio.0050297
- 94.Evolutionary changes in cis and trans gene regulationNature 430:85–88https://doi.org/10.1038/nature02698
- 95.Cis-regulatory elements: Molecular mechanisms and evolutionary processes underlying divergenceNature Reviews Genetics 13:59–69https://doi.org/10.1038/nrg3095
- 96.SCANPY: Large-scale single-cell gene expression data analysisGenome Biology 19https://doi.org/10.1186/s13059-017-1382-0
- 97.PAGA: Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cellsGenome Biology 20https://doi.org/10.1186/s13059-019-1663-x
- 98.Generation of pure GABAergic neurons by transcription factor programmingNature Methods 14:621–628https://doi.org/10.1038/nmeth.4291
- 99.A taxonomy of transcriptomic cell types across the isocortex and hippocampal formationCell 184:3222–3241https://doi.org/10.1016/j.cell.2021.04.021
- 100.Behavior-dependent cis regulation reveals genes and pathways associated with bower building in cichlid fishesProceedings of the National Academy of Sciences 115https://doi.org/10.1073/pnas.1810140115
- 101.Computational analysis of tissue-specific combinatorial gene regulation: Predicting interaction between transcription factors in human tissuesNucleic Acids Research 34:4925–4936https://doi.org/10.1093/nar/gkl595
- 102.Allele-specific open chromatin in human iPSC neurons elucidates functional disease variantsScience 369:561–565https://doi.org/10.1126/science.aay3983
- 103.Model-based analysis of ChIP-Seq (MACS)Genome Biology 9https://doi.org/10.1186/gb-2008-9-9-r137
- 104.Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differencesBioinformatics 35:2084–2092https://doi.org/10.1093/bioinformatics/bty895