1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

The cis-regulatory effects of modern human-specific variants

  1. Carly V Weiss
  2. Lana Harshman
  3. Fumitaka Inoue
  4. Hunter B Fraser
  5. Dmitri A Petrov  Is a corresponding author
  6. Nadav Ahituv  Is a corresponding author
  7. David Gokhman  Is a corresponding author
  1. Department of Biology, Stanford University, Stanford, United States
  2. Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, United States
  3. Institute for Human Genetics, University of California San Francisco, San Francisco, United States
Research Article
  • Cited 0
  • Views 2,396
  • Annotations
Cite this article as: eLife 2021;10:e63713 doi: 10.7554/eLife.63713
Voice your concerns about research culture and research communication: Have your say in our 7th annual survey.

Abstract

The Neanderthal and Denisovan genomes enabled the discovery of sequences that differ between modern and archaic humans, the majority of which are noncoding. However, our understanding of the regulatory consequences of these differences remains limited, in part due to the decay of regulatory marks in ancient samples. Here, we used a massively parallel reporter assay in embryonic stem cells, neural progenitor cells, and bone osteoblasts to investigate the regulatory effects of the 14,042 single-nucleotide modern human-specific variants. Overall, 1791 (13%) of sequences containing these variants showed active regulatory activity, and 407 (23%) of these drove differential expression between human groups. Differentially active sequences were associated with divergent transcription factor binding motifs, and with genes enriched for vocal tract and brain anatomy and function. This work provides insight into the regulatory function of variants that emerged along the modern human lineage and the recent evolution of human gene expression.

Introduction

The fossil record allows us to directly compare skeletons between modern humans and their closest extinct relatives, the Neanderthal and the Denisovan. From this we can make inferences not only about skeletal differences, but also about other systems, such as the brain. These approaches have uncovered a myriad of traits that distinguish modern from archaic humans. For example, our face is flat with smaller jaws, our development is slower, our pelvises are narrower, our limbs tend to be slenderer, and our brain differs in its substructure proportions (Neubauer et al., 2018; Gunz et al., 2019; Aiello and Dean, 2002) (especially the cerebellum; Kochiyama et al., 2018). Despite our considerable base of knowledge of how modern humans differ from archaic humans at the phenotypic level, we know very little about the genetic changes that have given rise to these phenotypic differences.

The Neanderthal and the Denisovan genomes provide a unique insight into the genetic underpinnings of recent human phenotypic evolution. The vast majority of genetic changes that separate modern and archaic humans are found outside protein-coding regions, and some of these likely affect gene expression (Yan and McCoy, 2020). Such regulatory changes may have a sizeable impact on human evolution, as alterations in gene regulation are thought to underlie most of the phenotypic differences between closely related groups (Britten and Davidson, 1971; King and Wilson, 1975; Enard et al., 2014; Fraser, 2013). Indeed, there is mounting evidence that many of the noncoding variants that emerged in modern humans have altered gene expression in cis, shaped phenotypes, and have been under selection (Yan and McCoy, 2020; McCoy et al., 2017; Petr et al., 2019; Gokhman et al., 2020; Colbran, 2019; Gokhman et al., 2019; Dannemann and Racimo, 2018; Weyer and Pääbo, 2016; Vespasiani et al., 2020; Grogan and Perry, 2020). Fixed variants, in particular, could potentially underlie phenotypes specific to modern humans, and some of these variants might have been driven to fixation by positive selection.

Unfortunately, our ability to infer the regulatory function of noncoding variants is currently limited (Chatterjee and Ahituv, 2017). In archaic humans, incomplete information on gene regulation is further exacerbated by the lack of RNA molecules and epigenetic marks in these degraded samples (Yan and McCoy, 2020). We have previously used patterns of cytosine degradation in ancient samples to reconstruct whole-genome archaic DNA methylation maps (Gokhman et al., 2020; Gokhman et al., 2014; Gokhman et al., 2016). However, despite various approaches to extract regulatory information from ancient genomes (Yan and McCoy, 2020; Colbran, 2019; Gokhman et al., 2016; Barker et al., 2020; Batyrev et al., 2019; Pedersen et al., 2014; Silvert et al., 2019; Moriano and Boeckx, 2020), our understanding of gene regulation in archaic humans remains minimal, with most archaic regulatory information being currently inaccessible (Yan and McCoy, 2020). Additionally, whereas expression quantitative trait locus (eQTL) mapping can be used to identify variants that drive differential expression between individuals, it can only be applied to loci that are variable within the present-day human population. Therefore, fixed noncoding variants are of particular interest in the study of human evolution, but are also particularly difficult to characterize.

Massively parallel reporter assays (MPRAs) provide the ability to interrogate the regulatory effects of thousands of variants en masse (Inoue and Ahituv, 2015). By cloning a candidate regulatory sequence downstream to a short transcribable sequence-based barcode, thousands of sequences and variants can be tested for regulatory activity in parallel. Thus, MPRA is an effective high-throughput tool to identify variants underlying divergent regulation, especially in organisms where experimental options are limited (Tewhey et al., 2016; Klein et al., 2018; Ryu et al., 2018; Uebbing et al., 2021). Here, we conducted a lentivirus-based MPRA (lentiMPRA; Gordon et al., 2020) on the 14,042 fixed or nearly fixed single-nucleotide variants that emerged along the modern human lineage. We generated a library of both the derived (modern human) and ancestral (archaic human and ape) sequences of each locus and expressed them in three human cell types: embryonic stem cells (ESCs), neural progenitor cells (NPCs), and primary fetal osteoblasts. By comparing the transcriptional activities of each pair of sequences, we generated a comprehensive catalog providing a map of sequences capable of promoting expression and those that alter gene expression. We found that 1791 (13%) of the sequence pairs promote expression and that 407 (23%) of these active sequences drive differential expression between the modern and archaic alleles. These differentially active sequences are associated with differential transcription factor (TF) binding affinity and are enriched for genes that affect the vocal tract and brain. This work provides a genome-wide catalog of the cis-regulatory effects of genetic variants unique to modern humans, allowing us to systematically interrogate recent human gene regulatory evolution.

Results

LentiMPRA design and validation

To define a set of variants that likely emerged and reached fixation or near fixation along the modern human lineage, we took all the single-nucleotide variants where modern humans differ from archaic humans and great apes (based on three Neanderthal genomes [Prüfer et al., 2014; Prüfer et al., 2017; Mafessoni et al., 2020], one Denisovan genome [Meyer et al., 2012], and 114 chimpanzee, bonobo, and gorilla genomes [de Manuel et al., 2016]). We excluded any polymorphic sites within modern humans (in either the 1000 Genomes Project [Auton et al., 2015] or in dbSNP [Sherry et al., 2001]), or within archaic humans and great apes (Prüfer et al., 2014; Prüfer et al., 2017; Mafessoni et al., 2020; Meyer et al., 2012; de Manuel et al., 2016) (see 'Materials and methods'). The resulting set of 14,042 variants comprises those changes that likely emerged and reached fixation or near fixation along the modern human lineage (Supplementary file 1a-c). The vast majority of these variants are intergenic (Figure 1—figure supplement 1a). By definition, this list does not include variants that introgressed from archaic humans into modern humans and spread to detectable frequencies. We refer to the derived version of each sequence as the modern human sequence and the ancestral version as the archaic human sequence.

We analyzed variants that likely emerged and reached fixation or near fixation along the modern human lineage (yellow) and that were not polymorphic in any other ape or archaic genome (green) (top). The modern and archaic human variants and their surrounding 200 bp were synthesized, cloned into barcoded expression constructs, and infected in triplicates into three human cell lines using a chromosomally integrating vector, following the lentiMPRA protocol (Gordon et al., 2020) (see 'Materials and methods'). We compared the activity (RNA/DNA) of the modern and archaic human constructs to identify variants promoting differential expression using MPRAnalyze (Ashuach et al., 2019) (bottom).

We synthesized a library composed of 200 bp sequences (due to oligonucleotide synthesis length limitations) per each of the 14,042 variants (one sequence for the modern human allele and one for the archaic human allele, Figure 1—figure supplement 1a–c). Each sequence contained at its center either the modern or archaic human variant. Out of 14,042 sequence pairs, 13,680 (90%) had a single variant separating the human groups. For the 1362 sequence pairs containing additional variants within the 200 bp window, we used either the modern-only or archaic-only variants throughout the sequence. We amplified this library of sequences, each along with a minimal promoter (mP) and barcode. We then inserted these constructs into the lentiMPRA vector, so that the barcode, which is the readout of activity, is located within the 5’UTR of the reporter gene and is transcribed if the assayed sequence is an active regulatory element (Gordon et al., 2020). We associated each sequence with multiple barcodes to achieve a high number of independent replicates of expression per sequence, thereby reducing potential site-of-integration effects. 97% of sequences had at least 10 barcodes associated with them, with a median of 96 barcodes per sequence (Figure 1—figure supplement 2a). Furthermore, we used a chromosomally integrating construct rather than an episomal construct due to the improved technical reproducibility and correlation of results from chromosomally integrating constructs with functional genomic signals like TF ChIP-seq and histone acetylation marks (Inoue et al., 2017). To further reduce lentivirus site-of-integration effects, this vector contained antirepressors on either side and was integrated in multiple independent sites, with each sequence marked by multiple barcodes (see Discussion for additional lentiMPRA limitations). Importantly, despite the caveat of interrogating sequences outside of their endogenous context, MPRAs were shown to generally replicate the endogenous activity of sequences (Inoue et al., 2017; Klein et al., 2020; Kircher et al., 2019).

The brain and skeleton have been the focus of evolutionary studies due to their extensive phenotypic divergence among human lineages (Aiello and Dean, 2002). Therefore, we chose human cells related to each of these central systems: NPCs and primary fetal osteoblasts. In addition, we used ESCs (line H1, from which the NPCs were derived) to gain insight into early stages of development. Finally, the abundance of previously published regulatory maps for these three cell types (Gokhman et al., 2014; Ernst and Kellis, 2012; Kundaje et al., 2015) also enables the investigation of the dynamics of evolutionary divergence at different regulatory levels. While these cell types represent diverse systems, further studies are needed in order to characterize the activity of these sequences in other cell types.

We used the library of 14,042 pairs of archaic and modern human sequences, together with positive and negative control sequences, to infect each cell type. As positive controls for ESCs and NPCs, we added a set of 199 sequences with known regulatory capacity from previous MPRAs (Supplementary file 1d). To our knowledge, there have not been any MPRAs conducted in osteoblasts, so we searched the literature for putative regulatory regions in osteoblasts and other bone cell types and used these as putative positive controls (Supplementary file 1d, see 'Materials and methods'). As negative controls, in all cell types, we randomly chose 100 sequences from the library and scrambled the order of their bases, creating a set of GC content matching sequences that had not been previously established to drive expression (Supplementary file 1e).

We performed three replicates of library infection in each cell type and quantified barcode abundance for each sequence in RNA and DNA (Figure 1). To assess the reproducibility of our lentiMPRA results, we calculated the RNA/DNA ratio (a measure of expression normalized to the number of integrated DNA molecules) for each sequence and compared it across the three replicates per cell type. We saw a strong correlation of RNA/DNA ratios between replicates for all cell types (Pearson’s r = 0.76–0.96, p<10−100, Figure 1—figure supplement 2b), with the lower correlation scores being in ESC, likely due to our use of lower multiplicity of infection (MOI) in these cells due to their increased sensitivity to lentivirus infection. High barcode and read coverage in MPRA generally provides increased power to detect differences in allelic expression (Gordon et al., 2020; Kircher et al., 2019). Thus, to determine how variability depended on our barcode counts, we downsampled the number of barcodes per sequence and calculated the RNA/DNA ratio at each step for each of the three replicates. In agreement with previous studies (Inoue et al., 2017), we found that the number of barcodes used in this study is well within the plateau, suggesting that the number of barcodes is not a limiting factor in our experiment (Figure 1—figure supplement 2c). Finally, we assessed the distribution of RNA/DNA ratios across our scrambled sequences and positive controls. The mean RNA/DNA ratio of the scrambled sequences was lower than that of the positive control sequences in ESCs and NPCs (p=2.7×10−8 for ESCs and p=1.8×10−6 for NPCs, t-test, see 'Materials and methods', Figure 1—figure supplement 2d), but not in osteoblasts (p=0.25). This is unlikely due to a problem with the osteoblasts, as the osteoblast-related controls show similar expression in all three cell types. Moreover, ESC and NPC positive controls are active in osteoblasts (p=1.1×10−3). The correlation between replicates was also similar between osteoblasts and the other two cell types (Figure 1—figure supplement 2b). Thus, the lack of activity of the osteoblast putative positive controls is likely because, in contrast to the ESC and NPC confirmed positive controls, the osteoblast putative positive controls were not previously tested in an MPRA, and some of these putative enhancers were identified in mouse and were not validated in human. Overall, these results suggest that the lentiMPRA was technically reproducible and adequately powered to detect expression.

Figure 1 with 2 supplements see all
Using lentivirus-based MPRA (lentiMPRA) to identify variants driving differential expression in modern humans.

Characterization of active regulatory sequences

We first examined which of the assayed sequences are able to drive expression. To do so, we utilized MPRAnalyze (Ashuach et al., 2019), which uses a model for each of the RNA and DNA counts, estimates transcription rate, and then identifies sequences driving significant expression. We also added an additional stringency filter whereby a sequence is only considered expressed if it had an RNA/DNA ratio significantly higher than that of the scrambled sequences (false discovery rate [FDR] <0.05). We found that in ESCs, 8% (1183) of sequence pairs drove expression in at least one of the alleles, 6% (814) in osteoblasts, and 4% (602) in NPCs (FDR <0.05, Supplementary file 1a-c, Figure 1—figure supplement 2d, see 'Materials and methods'). Hereinafter, we refer to these sequences as active sequences. Overall, 13% (1791) of archaic and modern human sequence pairs were active in at least one cell type, 4% (586) in at least two cell types, and 2% (222) in all three cell types (overlap of 75-fold higher than expected, p<10−100, Super Exact test; Wang et al., 2015, Figure 2a).

Figure 2 with 1 supplement see all
Identification of modern human sequences promoting expression in lentivirus-based MPRA (lentiMPRA).

(a) Overlap between cell types of active sequences. Super Exact test p-value is shown for the overlap of the three groups. (b-d) Enrichment levels of active and repressive histone modification marks within active sequences. Enrichment is computed compared to inactive sequences. The enrichment of H3K27me3 in embryonic stem cells (ESCs) possibly reflects the presence of this mark in bivalent genes, which become active in later stages of development (Blanco et al., 2020). For confidence intervals, see Supplementary file 2. (e) Enrichment of differentially active sequences in various chromatin-based genomic annotations. Missing circles reflect no differentially active sequences in that category. Stars mark significant enrichments (false discovery rate [FDR] <0.05). (f) Violin plots of DNA methylation levels for active (green) vs. inactive (red) sequences in osteoblasts. Methylation levels per sequence were computed as the mean methylation across all modern and archaic human bone methylation samples. The circle marks mean methylation across all sequences in each group. t-test p-value is shown.

Some of these sequences may show activity in the lentiMPRA experiment but not in their endogenous genomic context. To test whether activity in our lentiMPRA reflects true biological function, we investigated whether our active sequences had expected regulatory characteristics in the modern human genome. Active regulatory sequences in the genome tend to bear active chromatin marks. Therefore, we examined whether active sequences in lentiMPRA tend to be enriched for markers of active chromatin in their endogenous context. We first tested overlap with five histone modification marks and one histone variant associated with active chromatin (H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K9ac, and H2A.Z), as well as with two histone modification marks associated with repressed chromatin (H3K9me3 and H3K27me3, see 'Materials and methods') (Kundaje et al., 2015). We found that on average, active sequences were 1.6- to 2.7-fold more likely than inactive sequences to have active chromatin marks, depending on cell type. Also, these sequences tended to show relatively fewer repressive marks compared to active marks (Figure 2b–d, Supplementary file 2). These trends get stronger when looking at more highly active sequences. For example, while only 18% of inactive sequences in ESCs overlap H3K4me2 peaks, 70% of active sequences with an RNA/DNA ratio ≥3 in ESCs overlap H3K4me2 peaks (FDR = 4.4×10−16, Fisher’s exact test, Figure 2b–d, Supplementary file 2). To further test the functional characteristics of active sequences, we analyzed chromHMM annotation (Ernst and Kellis, 2012; Kundaje et al., 2015), which uses chromatin signatures to subdivide the genome into functional regions. Of the 14,042 sequences, 2,163 (15%) overlapped promoter or enhancer chromHMM annotations in at least one of the three cell types. Additional 2658 sequences (19%) overlapped such marks in other cell types not included in this study. Compared to inactive sequences, we found that active sequences are enriched for promoter and enhancer marks (FDR <0.05 in each of the cell types for overlap with active TSS and enhancers, Figure 2e, Figure 1—figure supplement 1, Supplementary file 1f, Supplementary file 2). We also found that compared to inactive sequences, active sequences are 6–32% closer to GTEx (GTEx Consortium, 2015) eQTLs, depending on cell type (FDR <0.05, t-test). Active sequences are also 1.2–1.3× closer to transcription start sites (TSS), with 32–39% of them located within 10 kb of a TSS, depending on cell type (FDR <0.05, t-test, Supplementary file 2).

Active genomic regions often show reduced DNA methylation levels compared to inactive regions (Jones, 2012). To further test if the activity we detected in the lentiMPRA reflects true biological function, we tested whether the active sequences in the lentiMPRA tend to be hypomethylated in their endogenous genomic context. To do so, we used our previously published modern and archaic human DNA methylation maps (Gokhman et al., 2020; Gokhman et al., 2014; Gokhman et al., 2016). Because the DNA methylation maps originate from skeletal samples, we compared them to the osteoblast lentiMPRA data. We found that active sequences are significantly hypomethylated compared to inactive sequences (p=5.5×10−13, t-test, Figure 2f) and that their activity level (RNA/DNA ratio) is negatively correlated with methylation levels (6.0 × 10−9, Pearson’s r = −0.24).

Finally, compared to inactive sequences, active sequences show slightly higher sequence conservation in primates, indicating a potential functional role (PhyloP, −0.05 on average for inactive, −0.04 for active, FDR = 1.1×10−3, t-test) with more highly active sequences showing higher conservation levels (e.g., 0.24 for active sequences with RNA/DNA ratio ≥4, Figure 2—figure supplement 1a, Supplementary file 2). In summary, we found that sequences that are capable of driving expression tend to overlap active chromatin marks, are depleted of repressive chromatin marks, closer to TSS and eQTLs, and have higher sequence conservation, giving us confidence that the MPRA provides us with biologically meaningful results.

Differentially active sequences between modern and archaic humans

We next set out to identify modern and archaic human sequences driving differential expression. We used MPRAnalyze (Ashuach et al., 2019) to compare expression driven by the modern and archaic sequences. Out of the active sequence pairs in each cell type, 110 (9%) in ESCs drive significantly differential expression between modern and archaic humans, 243 (30%) in osteoblasts, and 153 (25%) in NPCs (FDR ≤ 0.05, see Materials and methods, Figure 3a–c, Figure 1—figure supplement 2, see Discussion for cell-type differences). We refer to these sequence pairs hereinafter as differentially active sequences. Overall, we see significant overlap between cell types in differentially active sequences: 407 sequences (23% of active sequences) were differentially active in at least one cell type, 89 (5%) in at least two cell types, and 10 (0.6%) in all three cell types (eightfold higher than expected compared to active sequences, p=5×10−7, Super Exact test (Wang et al., 2015), Figure 3d).

Figure 3 with 2 supplements see all
Differential activity of derived modern human sequences.

(a–c) Distributions of expression fold-changes (RNA/DNA) of active (light) and differentially active (dark) sequences in each cell type. (d) Overlap of differentially active sequences between cell types. Super Exact test p-value is presented for the overlap of the three groups compared to active sequences. In the 10 sequences that were differentially active across all three cells types, the direction of fold-change was identical across all cell types (p=1.9×10−3, binomial test). (e) Violin plots of predicted transcription factor (TF) binding score difference between modern and archaic sequences. Positive scores represent increased binding in the modern sequence. Points show mean.

As expected from such closely related organisms, and similar to other MPRAs that compared nucleotide variants (see Discussion), including one that compared human and chimp sequences (Ryu et al., 2018), most sequences drove modest magnitudes of expression difference; of the 407 differentially active sequences, the median fold-change was 1.2×, and only five sequences had a fold-change greater than 2× (Figure 3a–c). We refer to differentially active sequences where modern human expression is higher/lower than archaic human expression as up/downregulating sequences, respectively. In ESCs and NPCs, sequences were equally likely to be up- or downregulating (51% and 52% of differentially active sequences were downregulating, p=0.92 and 0.63, respectively, binomial test), while in osteoblasts downregulation was observed slightly more often (59%, p=6.9×10−3). Finally, we examined the 89 sequence pairs that were differentially active in two cell types and the 10 sequence pairs that were differentially active in all three cell types, and tested how often the direction of differential activity in one cell type matched the direction in the other cell types. We found a strong agreement in the direction of differential activity across cell types (87 out of 89 of sequence pairs that are differentially active in two cell types, p=6.5×10−24, and 10 out of 10 for three cell types, p=9.5×10−7, binomial test). We also observed a high correlation between the magnitudes of differential activity (Pearson’s r = 0.82, p=1.6×10−27). That differentially active sequences from one cell type are predictive of differential activity in other cell types, even of cell types as disparate as those used here, suggests that these sequences are likely to be differentially active in other cell types not assayed in this lentiMPRA.

To further test the replicability of these results, we examined the relationship between pairs of overlapping differentially active sequences (i.e., variants that are <200 bp apart and thus appear in more than one sequence, three overlapping pairs in ESCs, five in osteoblasts, and two in NPCs). We found that the direction of expression change is identical in all pairs of overlapping sequences (p=2.0×10−3, binomial test), and that the magnitude of their expression change is highly correlated (Pearson’s r = 0.95, 2.4 × 10−5, Figure 2—figure supplement 1b). To validate these results with an orthogonal method, we tested four differentially active sequences from each cell type in a luciferase reporter assay and found that the direction and magnitude of differential expression tended to replicate the lentiMPRA results (9 out of 12 sequences, Pearson’s r = 0.67, p=3.7×10−4, Figure 2—figure supplement 1c, Supplementary file 1g). These results suggest that the lentiMPRA was both technically reproducible across cell types and assays and also indicative of true biological signal.

Finally, we examined the endogenous genomic locations of differentially active sequences, focusing on promoters and enhancers. Between 33% and 45% of these sequences are within 10 kb of a TSS (depending on cell type, Supplementary file 1h). Analyzing chromHMM (Ernst and Kellis, 2012; Kundaje et al., 2015), we found that between 20% and 25% of the differentially active sequences are within putative promoter or enhancer regions (Supplementary file 1f). To test if differentially active sequences are enriched within regulatory elements, we compared the proportion overlapping chromHMM promoters and enhancers in differentially active sequences to that proportion in the other active sequences. We found that differentially active sequences are over-represented within putative enhancer regions in NPCs (2.2-fold, FDR = 0.03, Fisher’s exact test, Figure 1—figure supplement 1c,d). These results support a model of rapid enhancer evolution in modern humans, as previously reported for other mammals (Villar et al., 2015) (see 'Discussion').

Molecular mechanisms underlying differential activity

Next, we sought to understand what regulatory mechanisms might be associated with differential activity. Changes in expression are often linked to changes in regulatory marks. For example, increased DNA methylation tends to be associated with reduced activity (Jones, 2012). We therefore tested methylation levels in each pair of sequences and examined if the human group with the lower sequence activity tends to show higher methylation levels. Here too, because the DNA methylation maps originate from bone samples (Gokhman et al., 2020; Gokhman et al., 2014; Gokhman et al., 2016), we compared them to the osteoblast lentiMPRA data. We found that upregulating sequences indeed have a slight but significant tendency to be hypomethylated in modern compared to archaic humans, and that downregulating sequences tend to be hypermethylated in modern compared to archaic humans (on average −2% methylation in upregulating sequences and +1% methylation in downregulating sequences in the modern compared to the archaic genomes, p=0.028, paired t-test; Figure 3—figure supplement 1a). This trend is slightly more pronounced when looking at the most differentially regulating sequences. For example, the top 10 most downregulating sequences show on average +8% methylation in modern compared to archaic humans, whereas the top 10 most upregulating sequences show −7% methylation in modern compared to archaic humans. We also examined promoter regions (5 kb upstream to 1 kb downstream of a TSS), where the association between methylation and reduced activity is known to be stronger compared to the rest of the genome (Jones, 2012). Indeed, we found that upregulating promoter sequences have +5% methylation on average in the modern compared to the archaic genomes, while downregulating promoter sequences have −8% methylation (p=0.034, paired t-test; Figure 3—figure supplement 1b). This trend is more pronounced in CpG-poor promoters, where the link between methylation and expression is known to be stronger (Lister et al., 2009; Stadler et al., 2011; Schlesinger et al., 2013) (−15% methylation in upregulating sequences and +15% methylation in downregulating promoter sequences in modern compared to archaic humans; p=6×10−3, paired t-test; Figure 3—figure supplement 1c).

We conjectured that some of the differential activity in these loci might have been driven by alterations in TF binding. To investigate this, we compared predicted TF binding affinity to the modern and archaic sequences using FIMO (Grant et al., 2011). We found that (1) compared to other active sequences, the difference in predicted binding between the modern and archaic human alleles tends to be larger for differentially active sequences (combined across cell types: 4.3×, p=0.02, t-test, Figure 3—figure supplement 1d); (2) the directionality of differential expression tends to match the directionality of differential binding, that is, upregulating sequences tend to have stronger predicted binding for the modern human sequence, whereas downregulating sequences tend to have stronger predicted binding for the archaic sequence (p=3.7×10−6 for ESCs, p=1.7×10−6 for osteoblasts, and p=1.3×10−5 for NPCs, binomial test, Figure 3e, see Materials and methods); and (3) the magnitude of expression difference is correlated with the magnitude of predicted binding difference (Pearson’s r = 0.43 and p=1.2×10−3 for ESCs, Pearson’s r = 0.23 and p=0.02 for osteoblasts, and Pearson’s r = 0.35 and p=2.4×10−3 for NPCs, Figure 3—figure supplement 2a–c and Supplementary file 3). These results support the notion that alterations in TF binding played a role in shaping some of the expression differences between modern and archaic humans.

To identify the TFs that primarily drove these observations, we investigated which motif changes are most predictive of expression changes. For each TF and the sequences it is predicted to differentially bind, we examined the correlation between binding and expression fold-change (either positive or negative). We found that changes to the motifs of 14 TFs were predictive of expression changes (Figure 3—figure supplement 2d, Supplementary file 3b). All of these TFs had a positive correlation between changes in their predicted binding affinity and changes in expression of their bound sequences, reflective of their known capability to promote transcription (Suske, 2017; Frey-Jakobs et al., 2018; Bruderer et al., 2013; Frietze et al., 2010; Song et al., 2003; Zhu et al., 2018; Ji et al., 2020; Morita et al., 2016; Syafruddin et al., 2020). Of note, the use of an mP with basal activity in the MPRA design means that transcriptional repression is less likely to be detected, and therefore, further investigation is required in order to identify potential repressive activity in these sequences (see Discussion).

Next, we sought to explore if some motif changes are particularly over-represented within differentially active sequences, suggestive of a more central role in shaping modern human regulatory evolution. To control for sequence composition biases, we used active sequences as a background to search for motif enrichment within differentially active sequences. We found that ZNF281, an inhibitor of neuronal differentiation (Pieraccioli et al., 2018), is significantly enriched: out of 153 differentially active sequences in NPCs, 14 are predicted to be bound by ZNF281 (4.6-fold, FDR = 0.04, Supplementary file 3c). Notably, ZNF281 is also one of the TFs whose predicted differential binding is most tightly linked with differential expression (Figure 3—figure supplement 2d,e). Overall, these data support a model whereby variants in ZNF281 motifs might have modulated ZNF281 binding in NPCs, thereby contributing to neural expression differences between modern and archaic humans.

Potential phenotypic consequences of differential expression

In an attempt to assess the functional effects of the differential transcriptional activity we detected, we first sought to link each sequence to the gene(s) it might regulate in its endogenous genomic location. While most regulatory sequences are known to affect their closest gene (Gasperini et al., 2019; Fulco et al., 2019), some exert their function through interactions with more distal genes, often reflected in chromatin conformation capture assays, such as Hi-C interactions (Gasperini et al., 2020), or eQTL associations (Gasperini et al., 2020; Jung et al., 2019). To predict the genes linked to each sequence, we combined data from four sources: (1) proximity to TSS; (2) proximity to eQTLs (GTEx Consortium, 2015); (3) proximity to putative enhancers (Fishilevich et al., 2017); and (4) spatial interaction with promoters using Hi-C data (Jung et al., 2019) (see 'Materials and methods'). Using these data, we generated for each cell type a list of genes potentially regulated by each sequence. Overall, 1341 (75%) out of the 1791 active sequences were linked to at least one putative target gene (Supplementary file 1h).

To study the potential functional effects of differentially active sequences, we analyzed functions associated with their linked genes. To control for confounders such as cell type-specific regulation, gene length, and GC content, we compared differentially active sequences to other active sequences (instead of the genomic background), which minimizes inherent biases in the active sequences. First, we tested Gene Ontology terms and found an enrichment of the following terms within downregulating sequences: vesicle-mediated transport (6.6-fold, FDR = 1.9×10−3, in osteoblasts), regulation of apoptotic process (6.0-fold, FDR = 1.9×10−3, in ESCs), protein ubiquitination (4.7-fold, FDR = 1.9×10−3, in ESCs), multicellular organism development (3.3-fold, FDR = 0.01, in ESCs), and protein transport (3.3-fold, FDR = 0.02, in osteoblasts, Figure 3—figure supplement 2f, Supplementary file 4a). No enriched terms were found within upregulating sequences. To obtain a more detailed picture of phenotypic function, we ran Gene ORGANizer, a tool that uses monogenic disorders to link genes to the organs they affect (Gokhman et al., 2017). We analyzed the genes linked to differentially active sequences and found that for genes linked to sequences driving upregulation, the most enriched body parts belong to the vocal tract, that is, the vocal cords (5.0-fold, FDR = 1.3×10−3), voice box (larynx, 3.8-fold, FDR = 4.8×10−3), and pharynx (3.3-fold, FDR = 9.5×10−3, all within ESCs, Figure 4a). Interestingly, we have previously reported that the most extensive DNA methylation changes in modern compared to archaic humans arose in genes affecting the vocal cords and voice box (Gokhman et al., 2020). Conversely, within sequences driving downregulation, the enriched body part is the cerebellum (3.0-fold, FDR = 9.2×10−3, in NPCs, Figure 4a, Supplementary file 4b). This is in line with previous reports of cerebellar anatomy differences between modern humans and Neanderthals (Neubauer et al., 2018; Gunz et al., 2019; Aiello and Dean, 2002), including results suggesting that the biggest differences in brain anatomy are in the cerebellum (Kochiyama et al., 2018). These data also provide leads into the functional divergence of organs, like the voice box, that are not preserved in the fossil record.

Differentially active sequences are linked to genes affecting the vocal tract and brain.

(a) Gene ORGANizer enrichment map showing body parts that are significantly over-represented within genes linked to differentially active sequences (false discovery rate [FDR] <0.05). Organs are colored according to the enrichment scale. See Supplementary file 4 for cell types. (b) Human Phenotype Ontology (HPO) phenotypes significantly enriched (FDR <0.05) within differentially active sequences. Fold enrichment is shown in parentheses. See Supplementary file 4 for cell types. (c) CpG islands and read density of active histone modification marks (Kundaje et al., 2015) around the differentially active sequence in SATB2 (GRCh37 genome version). (d) Violin plots of archaic vs. modern activity of the differentially active sequence in SATB2.

Next, we delved into individual phenotypes associated with the differentially active sequences. To this end, we used the Human Phenotype Ontology (HPO) database (Köhler et al., 2014), a curated database of genes and the phenotypes they underlie in monogenic disorders. HPO covers a broad range of phenotypes related to anatomy, physiology, and behavior. We found that enriched phenotypes were involved in speech, heart morphology testicular descent, and kidney function (FDR <0.05, Figure 4b, Supplementary file 4b). These results reveal body parts and phenotypes that were particularly associated with gene expression changes between modern and archaic humans, and could be new candidates for phenotypes under selection.

Downregulation of SATB2 potentially underlies brain and skeletal differences

This catalog of cis-regulatory changes allows us to explore specific sequence changes that potentially underlie divergent phenotypes observed from fossils. To use the most robust data from lentiMPRA, we examined the 10 sequences that are differentially active across all three cell types and looked at their linked genes. To investigate the phenotypes that are potentially linked to these genes, we looked for those genes whose phenotypes can be compared to the fossil record (i.e., skeletal phenotypes). The only gene that fits these criteria was SATB2, a regulator of brain and skeletal phenotypes (Zarate and Fish, 2017). First, we analyzed its linked variant (C to T transition), which is at a position that is relatively conserved in vertebrates (GRCh38: 199,469,203 on chromosome 2, PhyloP score = 0.996). This position is found within a CpG island between two alternative TSS of SATB2 (Figure 4c). It is also found in the first intron of SATB2-AS1, an antisense lncRNA which upregulates SATB2 protein levels (Liu et al., 2017). To determine if this position lies within a regulatory region, we investigated it for chromatin marks in modern humans. We found that it overlaps a DNase I-hypersensitive site (ENCODE Project Consortium, 2012) and shows many peaks of active histone modification marks in all three cell types (Figure 4c, Supplementary file 1f). Indeed, this sequence drives high expression in all three cell types (4th, 8th, and 19th percentile among active sequences, in ESCs, osteoblasts, and NPCs, respectively, FDR <10−5 across all). Although this sequence shows hallmarks of activity in modern humans, compared to the archaic sequence, the modern human sequence is downregulating in all three cell types (−9% in ESCs, FDR = 6.8×10−4, −27% in osteoblasts, FDR = 2.2×10−42, and −12% in NPCs, FDR = 1.1×10−7, Figure 4d). These results suggest that the ancestral version of this sequence possibly promoted even higher expression in archaic humans.

SATB2 encodes a TF expressed in developing bone and brain. Its activity promotes bone formation, jaw patterning, cortical upper layer neuron specification, and tumorigenesis (Zarate and Fish, 2017). Genome-wide association studies show that common variants near and within SATB2 are mainly associated with brain and bone phenotypes, such as reaction time, anxiety, mathematical abilities, schizophrenia, autism, bone density, and facial morphology (Buniello et al., 2019; Claes et al., 2014). Heterozygous loss-of-function (LOF) mutations in SATB2 result in the SATB2-associated syndrome, which primarily affects neurological and craniofacial phenotypes. This includes speech delay, behavioral anomalies (e.g., jovial personality, aggressive outbursts, and hyperactivity), autistic tendencies, small jaws, dental abnormalities, and morphological changes to the palate (Zarate et al., 1993). Additionally, reduced functional levels of SATB2 due to heterozygous LOF have been shown to be the cause of these phenotypes in both human (Zarate and Fish, 2017; Zarate et al., 1993; Gigek et al., 2015; Qian et al., 2019) and mouse (Li et al., 2017; Zhang et al., 2019; Dobreva et al., 2006). Because these phenotypes are driven by changes to functional SATB2 levels (Zarate and Fish, 2017), we conjectured that the differential expression of SATB2 predicted from lentiMPRA might be linked to divergent modern human phenotypes. Thus, we examined whether the phenotypes SATB2 affects are divergent between archaic and modern humans (e.g., if modern human jaw size is different than the jaw size of archaic humans). We focused on phenotypes available for examination from the fossil record, primarily skeletal differences between modern humans and Neanderthals. From HPO, we generated a list of 17 phenotypes known to be affected by SATB2 and found that 88% (15) of them are divergent between these human groups (Supplementary file 5). These include the length of the skull, size of the jaws, and length of the dental arch. Next, based on SATB2 downregulation in modern humans predicted from lentiMPRA, we examined whether the direction of a phenotypic change between patients and healthy individuals matches the direction of phenotypic change between modern and archaic humans. For example, given that SATB2-associated syndrome patients have smaller jaws, we tested if modern human jaws are smaller compared to archaic humans. If SATB2 expression is not in fact related to phenotypic divergence, there is a 50% likelihood for a given phenotype to match the fossil record. Yet, we observed a match in direction in 80% of the phenotypes (12 out of 15, Supplementary file 5). This includes smaller jaws, flatter face, and higher forehead in modern compared to archaic humans. Overall, the observed number of phenotypes that are both divergent and match in their direction of change is 2.3-fold higher than expected by chance (p=1.3×10−4, hypergeometric test, Supplementary file 5, see 'Materials and methods'). Together, these data support a model whereby the C→T substitution in the putative promoter of SATB2, which emerged and reached fixation in modern humans, possibly reduced the expression of SATB2 and possibly affected brain and craniofacial phenotypes. However, further evidence is required to elucidate the potential role of this variant in modern human evolution.

Discussion

Identifying noncoding sequence changes underlying human traits is one of the biggest challenges in genetics. This is particularly difficult in ancient samples, where regulatory information is scarce (Yan and McCoy, 2020; Gokhman et al., 2016). Here, we use an MPRA-based framework to study how sequence changes shaped human gene regulation. By comparing modern to archaic sequences, we investigated the regulatory potential of each of the 14,042 single-nucleotide variants that emerged and reached fixation or near fixation in modern humans. We found an association between divergent TF motifs and the sequences driving expression changes, suggesting that changes to TF binding might have played a central role in shaping divergent modern human expression. Our results also suggest that genes affecting the vocal tract and cerebellum might have been particularly affected by these expression changes, which is in line with previous comparisons based on the fossil record (Neubauer et al., 2018; Gunz et al., 2019; Aiello and Dean, 2002; Kochiyama et al., 2018) and DNA methylation (Gokhman et al., 2020). More importantly, these results provide candidate sequence changes underlying these evolutionary trends.

LentiMPRA is designed for linking DNA sequence changes to expression changes en masse. Notably, it has limitations that could influence our results, mainly by potentially generating false negatives. First, our lentiMPRA library inserts were limited to ~200 bp in length, due to oligonucleotide synthesis technical restrictions, which may be insufficient to detect the activity of longer enhancer sequences (Inoue et al., 2017). Second, some minimally active sequences may not be expressed at a high enough level to pass our limit of detection. At the same time, some minimally active sequences may not be biologically significant. Third, some sequences may regulate expression post-transcriptionally, which lentiMPRA is not designed to detect. Fourth, since test sequences are randomly integrated into the genome, sequences that are dependent on their endogenous genomic environments (e.g., on nearby TF binding sites) might show reduced activity when inserted in new locations, while others might show activity that they otherwise would not have. Our design partially addresses this through the use of antirepressors and multiple independent integrations, which are intended to dilute location-specific effects. Additionally, all biases are expected to similarly affect the modern and archaic human versions of each sequence (Inoue et al., 2017). Fifth, transcriptional repression is less likely to be detected due to the low basal activity of the mP used. Sixth, the level of sequence activity may depend on more than one variant (including non-fixed variants, which we have not tested here). In the cases of non-fixed variants, the extent of differential activity could vary between individuals. At the same time, in the 10% of sequences that include more than one fixed variant, it is generally impossible to determine which of the variants drives the differential activity (with the exception of cases with more than two variants where the tiled sequences include a different combination of these variants).

Finally, differences in the trans environment of a cell could have an effect on the ability of a sequence to exert its cis-regulatory effect, resulting in cell type-specific cis-regulatory effects, as we observed in our data. The trans environment of the same cell type might also differ between two organisms. However, the majority of the cis-regulatory changes we observed would be expected to be present in archaic human cells as well, considering that such conservation has been observed between substantially more divergent organisms (e.g., human-chimpanzee [Ryu et al., 2018] and human-mouse [Mattioli et al., 2020]). In other words, while trans-regulatory changes play a key role in species divergence, the trans environments of the same cell type in two closely related organisms tend to affect cis-regulation similarly. Despite these caveats, MPRAs have been repeatedly shown to be able to replicate the activity of sequences in their endogenous context (Inoue et al., 2017; Klein et al., 2020; Kircher et al., 2019).

Importantly, when genomes from additional modern human individuals are sequenced and new variants mapped, it might become clear that some of the variants we analyzed have not reached fixation. However, regardless of whether they are completely fixed or not, these variants represent derived substitutions that likely emerged in modern humans and spread to considerable frequency. Further investigation is required to determine when they emerged, how rapidly they spread, and whether their effect was neutral or adaptive.

As expected, we observed differences in activity and differential activity between cell types (Tewhey et al., 2016; Kircher et al., 2019; Mattioli et al., 2020). Although some of this variation is likely biological (i.e., cell type-specific gene regulation), it is difficult to determine what proportion of it is due to biological vs. technical factors (e.g., differences in lentivirus preparation, infection rate, or cell growth, see 'Materials and methods'). Importantly, these differences are expected to result in false negatives rather than false positives. In other words, some of the sequences that appear as active or differentially active in one cell type might actually be active or differentially active in additional cell types (including cell types that were not tested in this study). Thus, we largely refrained from comparisons between cell types and the overlap observed in Figure 2a and Figure 3a should not be used to define such similarities. Rather, these diagrams should be used to examine the replicability of our results. Despite these caveats and limitations, lentiMPRA is a powerful high-throughput tool to characterize the regulatory activity of derived variants, and indeed has become a common assay to study the capability of sequences to promote expression (Chatterjee and Ahituv, 2017).

With this method, we found that 1791 (13%) of the 14,042 sequence pairs can promote expression in at least one of the three cell types tested, and that 405 (23%) of these active sequences show differential activity between modern and archaic humans (average fold-change: 1.24×, standard deviation 0.18, Figure 2—figure supplement 1a–c). Interpreting these results in light of previous MPRAs is challenging, not only because of key differences in statistical power and experimental design (e.g., sequence length), but also because of differing variant selection processes for each MPRA. With the exception of highly repetitive regions, which were removed from our library for technical reasons, the sequences we selected included all known modern human-derived fixed or nearly fixed variants (see 'Materials and methods'). Conversely, previous reporter assays and MPRAs on human intra- or inter-species variation used biased sets of variants by selecting sequences with putative regulatory function (e.g., eQTLs [Tewhey et al., 2016], TF binding sites [Weyer and Pääbo, 2016], ChIP-seq peaks [Klein et al., 2018], or TSS [Mattioli et al., 2020]) and/or regions showing particularly rapid evolution (e.g., human accelerated regions; Ryu et al., 2018; Uebbing et al., 2021; Prabhakar et al., 2008; Capra et al., 2013). In line with the fact that our data was not pre-filtered for putative regulatory regions, the proportion of active sequences we observed tends to be slightly lower than these previous studies. However, the magnitude of differential activity, as well as the fraction of differentially active sequences out of the active sequences, was similar to previous studies (Weyer and Pääbo, 2016; Tewhey et al., 2016; Klein et al., 2018; Ryu et al., 2018; Uebbing et al., 2021; Mattioli et al., 2020; Prabhakar et al., 2008; Capra et al., 2013). At the same time, we were capable of measuring regulatory activity in regions that would otherwise be excluded by filtering for a specific set of marks. Thus, future MPRAs on unfiltered sets of variants will enable the comparison of the patterns we observed to patterns within modern humans, between more deeply divergent clades, and of non-fixed modern-archaic differences.

Our results also suggest that differentially active sequences are over-represented within putative enhancers in NPCs (Figure 1—figure supplement 1c–d, Supplementary file 2). Enhancers have been suggested to be an ideal substrate for evolution because of their tissue specificity and temporal modularity (True and Carroll, 2002). Indeed, previous studies of introgression between archaic and modern humans suggested that enhancers are some of the most divergent regions between modern and archaic humans (Petr et al., 2019; Silvert et al., 2019; Telis et al., 2020). In line with the enrichment we observed in NPCs, brain-related putative enhancers show particularly low introgression, perhaps suggesting that the modern human sequences in these regions were adaptive (Silvert et al., 2019; Telis et al., 2020). To fully characterize the underlying mechanisms of differential activity in enhancers, it is important to disentangle the various factors and confounders that might contribute to this enrichment. There are several alternative explanations for the enrichment we observe, namely that variants within enhancers could be more likely to alter expression compared to other active sequences, or they could be particularly detectable in lentiMPRA. This could be tested using saturation mutagenesis MPRAs (Kircher et al., 2019) to compare the effect of random mutations in enhancer and non-enhancer modern human-derived active sequences.

Our results suggest that differentially active sequences are not randomly distributed across the genome, but rather tend to be linked to genes affecting particular body parts and phenotypes. The most pronounced enrichment was in the vocal tract, that is, the vocal cords, larynx, and pharynx. This was evident in the Gene ORGANizer analysis, where these organs are over-represented by up to fivefold, as well as in the HPO analysis, where some of the most enriched phenotypes are nasal speech, palate development, nasal passage opening, and laryngeal stiffness (Figure 4b, Supplementary file 4c). Overall, 53 of the 407 differentially active sequences were linked to genes which are known to affect one or more vocal tract phenotypes. Previous reports have also suggested that the vocal tract went through particularly extensive regulatory changes between modern and archaic humans (Gokhman et al., 2020), as well as between humans and chimpanzees (Gokhman et al., 2021; Prescott et al., 2015). Intriguingly, the anatomy of the vocal tract differs between humans and chimpanzees and has been suggested to affect human phonetic range (Lieberman, 2007). Comparing the anatomy of archaic and modern human larynges is currently impossible because the soft tissues of the larynx rapidly decay postmortem. However, together with these previous reports (Gokhman et al., 2020; Gokhman et al., 2021; Prescott et al., 2015), our results enable the study of vocal tract evolution from a genetic point of view and suggest that genes influencing the modern human vocal tract have possibly gone through regulatory changes that are not shared by archaic humans.

We also identified an enrichment of brain-related phenotypes, particularly those affecting the size of the cerebellum (Figure 4Supplementary file 4b,c). The cerebellum is involved in motor control and perception, as well as more complex functions such as cognitive processing, emotional regulation, language, and working memory (Mariën et al., 2014). Interestingly, the cerebellum has been described as the most morphologically divergent brain region between modern and archaic humans (Neubauer et al., 2018; Kochiyama et al., 2018). Evidence of divergent brain and cerebellar evolution can also be found at the regulatory level. Studies of Neanderthal alleles introduced into modern humans through introgression provide a clue as to the functional effects of divergent loci between archaic and modern humans. These works have shown that many of the introgressed sequences were likely negatively selected, with the strongest effect in regulatory regions (Petr et al., 2019; Silvert et al., 2019), particularly in brain enhancers (Telis et al., 2020). Studies of introgressed sequences have also shown that the cerebellum is one of the regions with the most divergent expression between Neanderthal and modern human alleles (McCoy et al., 2017). Together with our results, these data collectively suggest that sequences separating archaic and modern humans are particularly linked to functions of the brain, and especially the cerebellum.

Functional information on archaic human genomes is particularly challenging to obtain because of the postmortem decay of RNA and epigenetic marks in ancient samples. MPRA not only provides a new avenue to identify differential regulation in archaic samples, but also reveals the sequence changes underlying these differences. Here, we present a catalog providing regulatory insight into the sequence changes that separate modern from archaic humans. This resource will hopefully help assign functional context to various signatures of sequence divergence, such as selective sweeps and introgression deserts, and facilitate the study of modern human evolution through the lens of gene regulation.

Materials and methods

Code and data availability

Request a detailed protocol

Code is available for download on Github: https://github.com/weiss19/AH-v-MH; Weiss, 2021; copy archived at swh:1:rev:a75b6f0b7d278cb0388e52b3d491e262be77c206. Data was deposited in GEO under accession number: GSE152404.

Selection of fixed, derived variants and design of DNA oligonucleotides

Request a detailed protocol

We selected the variants for our lentiMPRA in the following manner. As a basis, we used the list of 321,820 modern human-derived single-nucleotide changes reported to differ between modern humans and the Altai Neanderthal genome (Prüfer et al., 2014). We then filtered this list to include only positions where the Vindija Neanderthal (Prüfer et al., 2017) and Denisovan sequences (Meyer et al., 2012) both match the Altai Neanderthal variant, and are also not polymorphic in any of the four ape species examined (61 Pan troglodytes, 10 Pan paniscus, 15 Gorilla beringei, and 28 Gorilla gorilla) (de Manuel et al., 2016). Next, we excluded loci which had any observed variation within modern humans in dbSNP, as annotated by Prüfer et al., 2014, or in the 1000 Genomes Project (phase 3) (Auton et al., 2015). Finally, for technical limitations in downstream synthesis and cloning, we excluded variants at which the surrounding 200 bp had >25% repetitive elements as defined by RepeatMasker (Smit et al., 1996). The resulting list contained 14,297 sequences and was used to design the initial set of DNA fragments. Upon completion of the lentiMPRA, another high-coverage Neanderthal genome (the Chagyrskaya Neanderthal) was published (Mafessoni et al., 2020), and we subsequently also filtered out loci at which the Chagyrskaya Neanderthal genome did not match the ancestral sequence, bringing the final list of analyzed loci to 14,042 (28,082 archaic and modern sequences, Supplementary file 1a-c).

We designed DNA fragments (oligonucleotides, hereinafter oligos) centered on each variant, including the 99 bp upstream and 100 bp downstream of each variant (200 bp total). For each variant we designed two fragments, one with the ancestral (archaic human and ape) sequence and one with the derived (modern human) sequence. For cases where two or more variants would be included in the same oligo, we used either derived-only (modern human) or ancestral-only (archaic human and ape) variants throughout the oligo. The average variants per oligo out of the 14,042 oligos was 1.1, with 12,680 containing one variant, 1259 containing two, 96 containing three, and 7 containing four. We also included 100 negative control fragments, created by randomly picking 100 of the designed DNA fragments and scrambling their sequence (Supplementary file 1e). Lastly, we incorporated 299 positive control fragments (Ryu et al., 2018; Prabhakar et al., 2008; Visel et al., 2007; Inoue et al., 2019; Hojo et al., 2016; Meyer et al., 2016; Khalid et al., 2018a; Khalid et al., 2018b; Loots et al., 2005; Fukami et al., 2006; Kawane et al., 2018) (i.e., expected to drive expression; Supplementary file 1d). As the library was infected into three cell types (see later), we designed positive controls for each of the cell types. For human ESCs and human NPCs, we used sequences which were previously shown to drive expression in MPRA in each of these cell types (Supplementary file 1d). For fetal osteoblast cells (Hobs), we used putative and confirmed enhancers from mouse and human (Supplementary file 1d). 15 bp adapter sequences for downstream cloning were added to the 5’ (5’-AGGACCGGATCAACT) and 3’ (5’-CATTGCGTGAACCGA) ends of each fragment, bringing the total length of each fragment to 230 bp. We synthesized each fragment as an oligonucleotide through Agilent Technologies, twice independently to minimize synthesis errors (Supplementary file 1i).

Production of the plasmid lentiMPRA library and barcode association sequencing

Request a detailed protocol

The plasmid lentiMPRA library was generated as described in Gordon et al., 2020. In brief, the two independently synthesized Agilent Technologies oligo pools were amplified separately via a five-cycle PCR using a different pairs of primers for each pool (forward primers, 5BC-AG-f01.1 and 5BC-AG-f01.2; reverse primers, 5BC-AG-r01.1 and 5BC-AG-r01.2; Supplementary file 1i), adding an mP downstream of the test sequence. A second round of five-cycle PCR was performed with the same primers for both pools (5BC-AG-f02 and 5BC-AG-r02; Supplementary file 1i) to add a 15 bp random barcode downstream of the mP. The two pools were then combined at a 1:1 ratio and cloned into a doubled digested (AgeI/SbfI) pLS-SceI vector (Addgene, 137725) with NEBuilder HiFi Master Mix (NEB). The resulting plasmid lentiMPRA library was electroporated into 10-beta competent cells (NEB) using a Gemini X2 electroporation system (BTX) (2 kv, 25 μF, 200 Ω) and allowed to grow up overnight on twelve 15 cm 100 mg/mL carbenicillin LB agar plates. Colonies were pooled and midiprepped (Qiagen). We collected approximately 6 million colonies, such that ~200 barcodes were associated with each oligo on average. To determine the sequences of the random barcodes and which oligos they were associated with, we first amplified a fragment containing the oligo, mP, and barcode from each plasmid in the lentiMPRA library using primers that contain Illumina flow cell adapters (P7-pLSmp-ass-gfp and P5-pLSmP-ass-i#, Supplementary file 1i). We sequenced these amplified sequences with a NextSeq 150PE kit using custom primers (R1, pLSmP-ass-seq-R1; R2 [index read], pLSmP-ass-seq-ind1; R3, pLSmP-ass-seq-R2, Supplementary file 1i) to obtain approximately 150 million total reads. We later did a second round of barcode association sequencing of these fragments to obtain approximately 76 million additional reads, for a combined total of 225,592,667 reads. To associate barcodes with oligos, we first mapped read pairs (R1 and R3) to the original list of 28,993 oligos using bowtie2 (--very-sensitive) (Langmead and Salzberg, 2012). Next, we filtered out pairs of reads that (1) did not map to the same oligo, (2) did not have at least one of the reads in the pair with a mapping quality of ≥6, or (3) did not have the ‘proper pair’ SAM designation. We linked each pair of reads with the read covering its barcode (R2) and saved only those barcode reads having at least a quality score of 30 across all 15 bases in the R2 read. We removed any barcodes associated with more than a single unique oligo (i.e., ‘promiscuous’ barcodes), as well as any barcodes where we did not see evidence of its oligo association at least three times. We then created a list of barcode-oligo associations – this final list comprised 3,495,698 unique barcodes spanning 28,678 oligos (98.9% of the original list of 14,297 variant sequence pairs, 100 negative sequences and 299 positive control sequences), which we refer to as the barcode-oligo association list.

Cell culture and differentiation

Request a detailed protocol

Human fetal osteoblasts were purchased from Cell Applications Inc (406 K-05f, tested negative for mycoplasma) and were maintained in Osteoblast Growth Medium (Cell Applications Inc). For passaging, cells were washed with 1× PBS, dissociated with trypsin/EDTA (Cell Applications Inc), and plated at approximately 5000 cells/cm2. H1-ESCs (ESCs, WiCell WA-01, RRID:CVCL_9771, identity authenticated via STR profiling, and tested negative for mycoplasma) were cultured on Matrigel (Corning) in mTeSR1 media (STEMCELL Technologies) and medium was changed daily. For passaging, cells were dissociated using StemPro Accutase (Thermo Fisher Scientific), washed and re-plated on Matrigel-coated dishes at a dilution of 1:5 to 1:10 in mTeSR1 media supplemented with 10 μM Y-27632 (Selleck Chemicals). ESCs were differentiated into NPCs by dual-Smad inhibition as previously described (Chambers et al., 2009; Inoue et al., 2019). Briefly, ESCs were cultured in mTeSR1 media until the cells became 80% confluent and then the media was replaced with neural differentiation media consisting of KnockOut DMEM (Life Technologies) supplemented with KnockOut Serum Replacement (Life Technologies), 2 mM L-glutamine, 1× MEM-NEAA (Life Technologies), 1× beta-mercaptoethanol (Life Technologies), 200 ng/mL Recombinant mouse Noggin (R and D systems), and 10 μM SB431542 (EMD Millipore). On day 4 of differentiation, the neural differentiation media was gradually replaced by N2 media (DMEM/F12 [Thermo Fisher Scientific] supplemented with N2 [Thermo Fisher Scientific]) every 2 days (3:1 ratio on day 6, 1:1 on day 8, and 1:3 on day 10) while maintaining 200 ng/mL Noggin and 10 μM SB431542. On day 12, cells were dissociated into single-cell using TrypLE Express (Thermo Fisher Scientific) and cultured in N2B27 media (1:1 mixture of N2 media and Neurobasal media [Thermo Fisher Scientific] with B27 [Thermo Fisher Scientific] supplemented with 20 ng/mL bFGF [R and D systems] and 20 ng/mL EGF [Millipore sigma]) on Matrigel-coated dish. NPCs were maintained in N2B27 with bFGF and EGF for a month and used for the following experiments at passage 15.

NPCs were validated through RT-qPCR at passage 1 (after 1 week of culturing in N2B27 media supplemented with bFGF and EGF) and at passage 10. RT-qPCR primers were designed for neural marker genes: SOX1/2, NES (NESTIN), MAP2; glial marker genes: GFAP, OLIG2; mesoderm marker genes: T(BRA), GSC; and endoderm marker genes: SOX17, FOXA2 (Supplementary file 1j). Expression of each marker was compared to HPRT expression (Supplementary file 1h). Additionally, validation via RNA-seq at passage 1 was performed. Results can be found in Figure 7A and D of Inoue et al., 2019 (data in GEO under accession number: GSE115046).

Cell line infection with lentiMPRA library, RNA- and DNA-seq, and read processing

Request a detailed protocol

Lentivirus was produced and packaged with the plasmid lentiMPRA library in twelve 15 cm dishes of HEK293T cells using the Lenti-Pac HIV expression packaging kit, following the manufacturer’s protocol (GeneCopoeia). Additional lentivirus was produced as needed in batches of ten 15 cm dishes. Lentivirus containing the lentiMPRA library (referred to hereafter as lentivirus) was filtered through a 0.45 μm PES filter system (Thermo Fisher Scientific) and concentrated with Lenti-X concentrator (Takara Bio). Titration reactions using varying amounts of lentivirus were conducted on each cell type to determine the best volume to add, based on an optimal number of viral particles per cell, as described in Gordon et al., 2020. Lentiviral infection, DNA/RNA extraction, and barcode sequencing were all performed as described in Gordon et al., 2020. Briefly, each replicate consisted of approximately 9.6 million cells each of ESC and osteoblast, and 20 million cells of NPC. ESC and osteoblast cells were seeded into four 10 cm dishes per replicate (with approximately 2.4 million cells in each dish), while NPCs were seeded into five 10 cm dishes per replicate (with approximately 4 million cells per dish). Additional cells were used for NPCs due to decreased efficiency of DNA/RNA extraction in NPCs. Three replicates were performed per cell type. Cells were infected with the lentiMPRA library at an MOI of 50 for NPCs and osteoblasts, and an MOI of 10 for ESCs. We used a lower MOI for ESC because the cells are very sensitive to infection and an MOI higher than 10 would result in cell death. For ESC and osteoblasts, cell media was changed to include 8 μg/mL polybrene before the addition of the lentiMPRA library to increase infection efficiency. The media was replaced with growth media without polybrene approximately 24 hr after infection. Infected cells were grown for 3 days before combining the plates of each replicate for extraction of RNA and DNA via the Qiagen AllPrep mini kit (Qiagen). We subsequently purified mRNA from the RNA using the Oligotex mRNA prep kit (Qiagen) and synthesized cDNA from the resulting mRNA with SuperScript II RT (Invitrogen), using a primer containing a unique molecular identifier (UMI) (P7-pLSmp-ass16UMI-gfp, Supplementary file 1i). DNA fragments were amplified from both the isolated DNA and generated cDNA, keeping each replicate and DNA type separate, with three-cycle PCR using primers that include adapters necessary for sequencing (P7-pLSmp-ass16UMI-gfp and P5-pLSmP-5bc-i#, Supplementary file 1i). These primers also contained a sample index for demultiplexing and a UMI for consolidating replicate molecules (see later). A second round of PCR was performed to amplify the library for sequencing using primers targeting the adapters (P5, P7, Supplementary file 1i). The fragments were purified and further sequenced with six runs of NextSeq 15PE with 10-cycle dual index reads, using custom primers (R1, pLSmP-ass-seq-ind1; R2 [read for UMI], pLSmP-UMI-seq; R3, pLSmP-bc-seq; R4 [read for sample index], pLSmP-5bc-seq-R2, Supplementary file 1i). Later, an additional two runs of 15PE of only the ESC samples were performed due to lower lentivirus infection efficiency in this cell type. Each sample's R1 and R3 reads (containing the barcode) were mapped with bowtie2 (Langmead and Salzberg, 2012) (--very-sensitive) to the barcode-oligo association list. Next, we applied several quality filters on the resulting alignments. We first filtered out read pairs that did not map as proper pairs, and then ensured the mapped sequence completely matched the known barcode sequence by requiring that both R1 and R3 reads have CIGAR strings = 15M, MD flags = 15, and a mapping quality of at least 20. Next, we consolidated read abundance per barcode by selecting only reads with unique UMIs, the result being abundance counts for each barcode, across each replicate library of each cell type for both RNA and DNA.

Data was deposited in GEO under accession number GSE152404.

Measurement of expression and differential expression

Request a detailed protocol

We used the R package MPRAnalyze (Ashuach et al., 2019) (version 1.3.1, https://github.com/YosefLab/MPRAnalyze) to analyze lentiMPRA data. To determine which oligos were capable of promoting expression, we modeled replicate information into both the RNA and DNA models of MPRAnalyze’s quantification framework (rnaDesign = ~ replicate and dnaDesign = ~ replicate) and extracted alpha, the transcription rate, for each oligo. MPRAnalyze used the expression of our 100 scrambled oligos as a baseline against which to measure the level of expression of each tested oligo. We corrected the mean absolute deviation (MAD) score-based p-values from MPRAnalyze for multiple testing across tested oligos, including positive controls and excluding scrambled sequences, using the Benjamini-Hochberg method, thus generating an MAD score-based expression FDR for each oligo. For each variant and for each cell type, we looked at both the archaic and modern sequence oligos and assigned an oligo as potentially capable of driving expression if it had an FDR ≤ 0.05 in at least one sequence, and at least 10 barcodes in both sequences (Supplementary file 1a-c). This left 2097 sequences in ESCs, 1059 in osteoblasts, and 664 in NPCs. Next, we applied a second test for activity, to account for potential overestimation of active sequences in ESCs due to the lower lentiviral infection efficiency in these cells. We aggregated UMI-normalized read abundances across all barcodes of each oligo, across all replicates in a given cell type, and calculated a simple ratio of expression as RNA abundance normalized to DNA abundance (RNA/DNA ratio). Next, similarly to Kwasnieski et al., 2014, we determined an RNA/DNA ratio threshold per cell type. This was done by first removing scrambled sequences that show RNA/DNA ratios >2 standard deviations away from the average RNA/DNA ratio of all of the scrambled sequences, as these likely represent oligos that are, by chance, capable of driving some expression. This left 95 scrambled sequences in ESCs, 94 in osteoblasts, and 97 in NPCs. Then, we used the distribution of RNA/DNA ratios of the remaining scrambled sequences to assign an FDR for each of the non-scrambled oligos. FDR was calculated as the fraction of scrambled sequences that showed an RNA/DNA ratio as high or higher than each non-scrambled oligo. Only oligos that passed both tests described above (FDR ≤0.05 in each test) were considered as ‘active’ (i.e., capable of driving expression). This resulted in 1183 sequences in ESCs, 814 in osteoblasts, and 602 in NPCs.

To measure differential expression between archaic and modern sequences, we used MPRAnalyze’s comparative framework. In essence, this tool uses a barcode’s RNA reads as an indicator of expression level and normalizes this to the DNA reads as a measure of the number of genomic insertions of that barcode (i.e., the number of fragments from which RNA can be transcribed). MPRAnalyze uses information across all the barcodes for both alleles of a given sequence, as well as information across all replicates. For the terms of the model, we included replicate information in the RNA, DNA, and reduced (null) models, allele information in the RNA and DNA models, and barcode information only in the DNA model (rnaDesign = ~ replicate + allele, dnaDesign = ~ replicate + barcode + allele, reducedDesign = ~ replicate). We extracted p-values and the differential expression estimate (fold-change of the modern relative to archaic sequence). Then, we corrected the p-values of the set of active oligos (see above) for multiple testing with the Benjamini-Hochberg method to generate an FDR for each sequence. We set a cutoff of FDR ≤ 0.05 to call a sequence capable of driving differential expression. From this we generated, for each cell type, a list of sequences with differential expression between the archaic and modern alleles (Supplementary file 1a-c).

We tested agreement between replicates by examining how many differentially active sequences show disagreement between the three replicates in the direction of their differential activity. We found that our dataset shows high between-replicate agreement, with the majority of sequences showing the same directionality across all three replicates (ESCs: 76%, osteoblasts: 78%, NPCs: 86%, compared to 25% expected by chance, p<10−16 for all three cell types, one-tailed Binomial test, Supplementary file 1k). Importantly, the log2(fold-change) (LFC) of the disagreeing replicate tends to cross the 0 line only marginally: the median LFC of the disagreeing replicate is 0.05 compared to 0.3 in the agreeing replicates. We also tested activity levels and found no evidence of lower activity in sequences with disagreement (p=0.27, one-tailed t-test). However, their absolute LFC tends to be slightly lower (0.25 vs. 0.32, p=6×10−5, one-tailed t-test).

Luciferase validation assays

Request a detailed protocol

Each assayed oligo was synthesized by Twist Biosciences and cloned into the pLS-mP-Luc vector (Addgene 106253) upstream of the luciferase gene. Lentivirus was generated independently for each vector using techniques as described for MPRA (see above), with the omission of the filtering and concentration step, which was replaced with the collection of the entirety of the cell culture media for use in subsequent infections. In addition, pLS-SV40-mP-Rluc (Addgene 106292), to adjust for infection efficiency, was added at a 1:3 ratio to the assayed vector for a total of 4 μg for lentivirus production. We infected each cell type individually with each viral prep. The amount of lentivirus added was based on titrations in which varying amounts of a subset of viral preps were added to each cell type and cell death was observed 3 days post infection; the virus volume that produced between 30% and 50% death was used for subsequent experiments. Approximately 20,000 cells were plated in 96-well plates and grown for 24–48 hr (~70% confluent) before the addition of lentivirus. For osteoblasts and ESCs, 8 μg/mL polybrene was added to the culture media at the same time as the addition of the lentivirus. The media was changed 24 hr after infection and cells were grown for an additional 48 hr. The cells were then washed with PBS and lysed. Firefly and renilla luciferase expression were measured using the Dual-Luciferase Reporter Assay System (Promega) on the GloMax plate reader (Promega). Each oligo was tested using two biological replicates on different days and each biological replicate consisted of three technical replicates. Activity of a given oligo was calculated by normalizing the firefly luciferase activity to the renilla luciferase. We then calculated the LFC between the modern and archaic alleles as log2(modern/archaic). A full list of oligos tested and their LFC can be found in Supplementary file 1a-c.

We found that the mean difference in fold-change between replicates was threefold lower for the differentially active vs. other active sequences (0.22 vs. 0.60), and that the variance of these differences was ninefold lower for differentially active sequences compared to other active sequences (0.09 vs. 0.83, Supplementary file 1k), suggesting that differentially active sequences reflect a true biological signal.

Predicting target genes

To connect the surrounding locus of each variant to genes it potentially regulates, we combined four data sources. For each locus, we generated four types of gene lists, based on four largely complementary approaches: (1) overlap with known eQTLs, (2) spatial interaction with promoters, (3) proximity to putative enhancers, and (4) proximity to a TSS (Supplementary file 1h). Each data source was obtained and incorporated into each type of list as described below.

Proximity to known eQTLs 

Request a detailed protocol

eQTLs are genetic variants between individuals shown to be associated with expression differences. We reasoned that the target genes of the sequence surrounding a variant are potentially similar to the target genes of nearby eQTLs. We downloaded eQTLs and their associated genes from GTEx (GTEx Consortium, 2015) (http://www.gtexportal.org, v8 on August 26, 2019) and overlapped the locations of each eQTL with our list of sequences. We linked the target genes of any eQTLs within ±1 kb to each variant. We used all tissue types reported by GTEx, for each cell type in the lentiMPRA. Out of the 14,042 loci, 9503 were found within ±1 kb of an eQTL, with 83,777 eQTLS overall overlapping them.

Spatial interaction with a promoter via Hi-C data

Request a detailed protocol

High-throughput chromosome conformation capture (Hi-C) techniques map spatial interactions between segments of DNA. We reasoned that if a variant is found within or near a region that was shown to interact physically with a promoter, that variant could be in a region involved in regulating that promoter. We downloaded promoter capture Hi-C data from Jung et al., 2019, containing a list of all the significant interactions between promoters and other segments of the genome across 27 tissue and cell types. We overlapped our variants with the locations of interacting genomic fragments to find interactions within ±10 kb of each variant. We then linked each variant with the promoters that each interacting fragment was shown to contact. We repeated this process twice: once to obtain a cell type-specific list and once to obtain a generic list. For the cell type-specific (stringent) list of locus-gene links, we included only those interactions observed in cell types corresponding to the cell lines used in our lentiMPRA: ESCs, NPCs, and mesenchymal stem cells (MSCs) as an approximation for osteoblasts (given that osteoblast Hi-C data is not publicly available to the best of our knowledge, and that osteoblasts differentiate from MSCs). For the generic (non-stringent) list, we used interactions across any of the 27 tissue and cell types analyzed by Jung et al., 2019. Out of the 14,042 loci, 4688 overlapped at least one region that interacts with a promoter.

Putative enhancers

Request a detailed protocol

Lastly, we checked which of our variants were in previously reported putative enhancers. To this end, we downloaded the GeneHancer database (Fishilevich et al., 2017) V4_12 and searched for putative enhancers within ±10 kb of each of our variants, linking each variant to the target genes of each putative enhancer within that distance. GeneHancer provides ‘elite’ or ‘non-elite’ status to their defined enhancer-target gene connections depending on the strength of the evidence supporting each connection. Using this information, we repeated the process twice: once for the elite status and once for all annotations. Out of the 14,042 loci, 5017 overlapped at least one putative enhancer.

Promoters

Request a detailed protocol

Promoters were defined as the region 5 kb upstream to 1 kb downstream of GENCODE (Harrow et al., 2012) v29 GRCh38 TSS. If a variant fell within this region, we linked it to that TSS’s gene. Each variant was assigned to all the promoters it fell within. Out of the 14,042 loci, 1466 were found within a promoter.

Overall, 11,207 out of the 14,042 loci were linked to at least one putative target gene, with a median of four target genes per locus. Of the remaining loci, 2830 were linked to their closest TSS, regardless of distance. The last five without hg38 coordinates for their closest TSS were not linked to a gene. Importantly, these links do not necessarily mean that these target genes are regulated by these loci, but rather they serve as a list of potential target genes for the loci showing a regulatory function through lentiMPRA.

DNA methylation in active and differentially active sequences

Request a detailed protocol

The four highest resolution DNA methylation maps for modern and archaic bone samples were taken from Gokhman et al., 2014 and Gokhman et al., 2020. Promoter sequences were defined as sequences within 5 kb upstream to 1 kb downstream of a TSS. CpG-poor promoter sequences were defined as promoter sequences ranking at the bottom half based on their CpG density. Enhancer sequences were defined as sequences annotated in chromHMM as putative enhancers (i.e., enhancers, genic enhancers, and bivalent enhancer) in osteoblast cells.

In putative enhancer sequences we found a slightly weaker link between methylation and activity compared to promoter sequences, with 3% hypermethylation of downregulating sequences and 5% hypomethylation of upregulating sequences. Perhaps in accordance with the much weaker link between enhancer methylation and activity (Jones, 2012), this trend is not significant despite having similar statistical power to the promoter analysis (p=0.12, paired t-test). To test whether our results might have been affected by CpG density, we compared CpG density in differentially active compared to non-differentially active sequences, and in upregulating compared to downregulating sequences. We found no significant difference in CpG density between these groups (p>0.05, t-test).

The hypermethylation of downregulating sequences in modern compared to archaic humans and the hypomethylation of upregulating sequences in modern compared to archaic humans are also observed to some extent when testing these sequences in NPCs, but not in ESCs. For example, the top 10 upregulating sequences are hypomethylated by 7% on average in modern compared to archaic humans, top 10 downregulating sequences are hypermethylated by 13% in modern compared to archaic humans. This is in line with previous observations that differentially methylated regions tend to be shared across tissues (Hernando-Herraez et al., 2013).

Differential TF binding sites

Request a detailed protocol

We predicted differences in binding of human TFs caused by each of our variants as follows. First, we downloaded the entire set of publicly available human TF binding motifs (7705 motifs, 6608 publicly available) from the Catalogue of Inferred Sequence Binding Preferences (CIS-BP) database (http://cisbp.ccbr.utoronto.ca/) and filtered them to include only motifs labeled as directly determined (i.e., we filtered out inferred motifs), resulting in 4351 motifs. Next, to enrich our mapping result for matches covering the variant location, we trimmed each of our oligo sequences containing a single variant to ±30 bp around the variant (the length of the longest motif). We did not trim oligos containing >1 variant. We used FIMO (Grant et al., 2011) to map each remaining motif to both the archaic and modern alleles of each trimmed sequence (or untrimmed, for sequences with >1 variant). A background model was generated using fasta-get-markov using the trimmed (or untrimmed, if >1 variant) sequences. For each motif mapping to both the archaic and modern alleles at the same strand and location, we required that at least one allele had a q-value (as supplied by FIMO≤0.05). Then, we found cases where the FIMO predicted binding score of a motif differed between the archaic and modern alleles. FIMO uses a p-value cutoff of 10−4 for reporting predicted binding. Therefore, some sequence pairs have a reported score for only one of the alleles. To assign these sequence pairs with a score difference, we used a conservative approach where we assigned the unscored allele with this lowest score reported for that motif, representing a score that is closest to a p-value of 10−4. Because the unreported score could be anywhere below the lowest reported score, but could not have been above it, this results in a conservative underestimation of the score difference. Finally, we linked each motif to the TF it is most confidently associated with in CIS-BP, thereby generating lists of TFs that showed differential predicted binding for each sequence. For cases in which multiple unique motifs corresponded to the same TF, we used the motif with the largest score difference between alleles. TF enrichment analyses were done on all predicted differential TF binding sites for TFs with a minimum of 10 predicted differential sites. TFs that are not expressed in the cell types we examined in this study (FPKM <1) were removed from the analyses. For TF expression in ESCs, we used ENCODE RNA-seq data for H1-hESC (ENCODE Project Consortium, 2012). For osteoblast expression, data (Moriarity et al., 2015) was downloaded from GEO under accession number: GSE57925. For NPC expression, data (Lu et al., 2020) was downloaded from GEO under accession number: GSE115407. Fisher’s exact test was used to compute enrichment of a TF among differentially active sequences compared to other active sequences. p-Values were FDR-adjusted.

To further test the enrichment of ZNF281, we examined various cutoffs of the number of predicted bound motifs, ranging from 5 to a maximum of 14 (the number of motifs predicted to be differentially bound by ZNF281) in steps of 1. We found that with the exception of the cutoffs of 5 and 6 (where ZNF281 is only slightly above the significance threshold: FDR = 0.058 and 0.053, respectively), ZNF281 is the only significant TF across all of these cutoffs (FDR ≤0.05). We repeated the same test for FPKM cutoffs, ranging from 0.5 to 3 in steps of 0.5, and found that ZNF281 is the only significantly enriched TF (FDR ≤0.05) across all of these cutoffs.

For the predicted binding vs. expression correlation analysis, a cutoff of 10 sites per TF was used. p-Values were computed using Pearson’s correlation.

Overlapping loci with genomic features

Request a detailed protocol

The following datasets were used for the overlap analyses: GENCODE v28 GRCh38 human genome TSS (Frankish et al., 2019), GTEx v8 eQTLs (GTEx Consortium, 2015), and broad peaks for the following histone modification marks: H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, and H3K27me, and the histone variant H2A.Z from the Roadmap Project for ESCs, ESC-derived NPCs, and osteoblasts (Kundaje et al., 2015). We overlapped each of these datasets with the lists of inactive and active sequences, and computed enrichment p-values using a Fisher’s exact test. We repeated this for various RNA/DNA cutoffs (1, 1.5, 2, 2.5, 3, and 3.5). Sex chromosomes were removed from the analyses. p-values were FDR-adjusted using the Benjamini-Hochberg procedure.

Sequence conservation within primates was taken from the Altai Neanderthal genome annotation, which used the PhyloP metric (Prüfer et al., 2014).

Human-chimpanzee cis-regulatory expression changes

Request a detailed protocol

We investigated the expression of genes associated with differentially active sequences by analyzing human and chimp RNA-seq data. As the expression changes we report are driven by cis-regulatory changes, we used our recently generated RNA-seq data from human-chimp hybrid cells (Gokhman et al., 2021) (GEO accession numbers: GSE146481 and GSE144825). In these hybrid cells, the human and chimpanzee chromosomes are found within the same nuclear environment and are exposed to the same trans factors (e.g., TFs). Therefore, any differential expression observed between the human and chimpanzee alleles within these hybrid cells is attributed to cis-regulatory changes. These cells are hybrid human-chimpanzee induced pluripotent stem cells (iPSCs), and we therefore investigated whether genes associated with upregulating sequences in our ESC lentiMPRA data tend to be upregulated in the hybrid iPSCs and vice versa. It is important to note that differential expression between humans and chimpanzees reflects ~12 million years of evolution (i.e., changes that emerged along the human as well as along the chimpanzee lineages since their split from their common ancestor ~6 million years ago). However, our lentiMPRA data was done on sequences that changed along the modern human lineage (~550–765 thousand years). Therefore, the human-chimpanzee differences span an evolutionary time that is ~20-fold longer than the modern human lineage, and the effect of modern-derived variants on gene expression between humans and chimpanzees is expected to be largely diluted by the many other changes that accumulated along the rest of this time. Indeed, we observe a very slight, but significant correlation between differential expression observed in the lentiMPRA data and differential expression observed in the human-chimp hybrid data (p=0.017, Pearson’s r = 0.1, Figure 3—figure supplement 2g).

Phenotype enrichment analyses

Request a detailed protocol

Body part enrichment analyses were conducted using Gene ORGANizer v13. The analyses were conducted on sequences driving increased expression, sequences driving decreased expression, and all differentially active sequences. This was done in each of the three cell types. We conducted these analyses using various LFC thresholds: 0, 0.5, and 0.75, on the non-stringent locus-gene associations, and using a cutoff of five genes per term. Analyses were done against the active sequences as background and using the ORGANizer tool with the confident option. p-values were FDR-adjusted using the Benjamini-Hochberg procedure. For osteoblasts, non-skeletal organs were removed from the analyses. For NPCs, non-neuronal organs were removed.

For the HPO analyses, we used HPO (Köhler et al., 2014) build 1268 (November 8, 2019), analyzing gene lists identical to the Gene ORGANizer analyses, with the exception of using a cutoff of three genes per term, because fewer genes are linked to HPO terms than to Gene ORGANizer terms. Lists of phenotypes from HPO were generated for each variant through its linked genes. Hypergeometric test p-values were computed per phenotype and FDR-adjusted. Similarly to the Gene ORGANizer analysis, we removed non-skeletal phenotypes from the osteoblast results and non-neuronal phenotypes from the NPC results.

Gene Ontology, Gene ORGANizer, and HPO analyses were also done on the full set of genes linked to the 14,042 fixed variants using the same parameters described above (Supplementary file 6). Importantly, unlike the analyses of differentially active sequences, which can be compared against a non-differentially active sequences background to control for potential biases, the full set of sequences cannot be compared against a background set. Therefore, these results may be affected by different confounders such as GC content, the ability to call SNPs, DNA degradation patterns, and it is still to be determined to what extent these results reflect true evolutionary trends.

SATB2 phenotypic analysis was done as previously described in Gokhman et al., 2019. In short, we used HPO (Köhler et al., 2014) build 1268 (November 8, 2019) to link phenotypes to SATB2. In addition, we conducted a literature search to expand gene-phenotype links to include studies that did not appear on HPO (Supplementary file 5). We used only skeletal directional phenotypes, that is, phenotypes that could be described on a scale (e.g., smaller/larger hands), as these could be examined against the fossil record. This resulted in 34 phenotypes that are the result of SATB2 heterozygous LOF (Supplementary file 5). Phenotypes that are included in another phenotype (e.g., prominent nasal bridge and prominent nose) were merged, and contradicting phenotypes (e.g., broad nose and thin/small nose) were removed. This resulted in a final list of 17 phenotypes (Supplementary file 5). Given that the mechanism underlying these phenotypes is a decrease in the dosage of SATB2, and that SATB2 is possibly downregulated in modern humans, we sought to investigate if similar phenotypes exist between modern human patients with SATB2 heterozygous LOF and archaic humans. For each phenotype, we determined if it is divergent between the modern and archaic humans based on previously published annotation (Gokhman et al., 2019). Then, for remaining divergent phenotypes, we tested if the direction between patients and healthy individuals matches the direction between modern and archaic humans. The significance of directionality match was computed using a binomial test, with a random probability of success p=0.5. To compute the significance of the overall number of phenotypes that are divergent and match in direction, we compared the overall number of annotated divergent phenotypes to the number of divergent phenotypes associated with SATB2 using a hypergeometric test. Out of a total of 696 annotated phenotypes between modern and archaic humans (Gokhman et al., 2019), 434 are annotated as divergent, and the direction of 50% of them (217 phenotypes) is expected to match by chance.

Data availability

Data was deposited in GEO under accession number: GSE152404.

The following data sets were generated

References

  1. Software
    1. Smit A
    2. Hubley R
    3. Green P
    (1996)
    RepeatMasker Open-3.0
    RepeatMasker Open-3.0.
  2. Book
    1. Zarate YA
    2. Kaylor J
    3. Fish J
    (1993)
    SATB2-Associated Syndrome
    In: Adam M. P, Ardinger H. H, Pagon R. A, editors. GeneReviews [Internet]. Seattle, WA: University of Washington, Seattle. pp. 1993–2021.

Decision letter

  1. Patricia J Wittkopp
    Senior and Reviewing Editor; University of Michigan, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This study advances our understanding of human evolution by testing the function of human-specific variants in non-coding regions that regulate gene expression. In some cases, variants are identified that might contribute to traits limited to modern humans. The comprehensiveness of dataset should also make this work an important community resource.

Decision letter after peer review:

Thank you for submitting your article "The cis-regulatory effects of modern human-specific variants" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Patricia Wittkopp as the Senior and Reviewing Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

All three reviewers and I found much to like in this work. In the post-review discussion, there was general agreement of the strengths and weaknesses in the study, with the most pressing issue being the need to be more clear about the caveats resulting from assaying fragments with individual variants, out of their native genomic context, and in a limited number of cell types. These must be addressed directly in a revised version. Beyond this, each reviewer also identified additional points that need clarification and refinement. Because of the depth of these comments, I am including all three reviewer's comments below in their entirety. I look forward to seeing a revised version of this work.

Reviewer #1:

Weiss and colleagues performed a lentiMPRA comparing archaic and human genetic variants in three cell types. I found this study well designed and executed and the results exciting. I especially liked the analysis of SATB2! It is really amazing that by focusing on skeletal phenotypes the authors can root inferences from their study in human biology. I wish this was done for more than one gene! I still have a number of concerns about the study, outlined below.

1. In a 200 bp sequence background, there are typically other variants included. If I understand correctly, the authors only considered one focal variant that differs between modern and archaic alleles and tested for a difference in one of the two possible sequence backgrounds. This means that one variant is tested in an artificial allele that possibly never existed in an archaic or modern human. in vivo, this variant might interact with nearby variants (fixed or segregating), but this is ignored in this setting. It would obviously be too much to demand that both the modern and the archaic variant be tested in the archaic allele as well, but I wonder if the authors can report if there are cases (and if so, how many) where tested variants are close to other variants which would indicate that the result is artificial and thus unreliable. The relevant distance would be within the size of one transcription factor binding site or less than 15-20 bp or so.

2. Line 179/Figure 2, There is a larger overlap between NPCs and osteoblasts than between either and ESCs – due to either cell type's closer developmental similarity with ESCs, I had expected the opposite. Can the authors offer an explanation?

3. Line 176, How often were both archaic and modern alleles active and how often only one?

4. Line 192, "At the same time, these sequences tended to be depleted of repressive marks". Figure 2b shows an enrichment of H3K27me3 in active sequences, while no depletion for H3K9me3. Figure 2c shows a mild depletion at best and Figure 2d a combination of enrichment and depletion of both marks. It seems to me that the quoted statement is not supported by the data.

5. Figure 3, There are a few differentially active sequences that show very low fold changes. I recommend using a cutoff to remove sequences with such small fold changes. What would be the biological significance of a sequence that is barely different between alleles but considered significantly differentially active? Related, if there are sequences that show such small fold changes, it is possible that there are some differentially active sequences that disagree in the direction of bias between replicates. If there are any, those should certainly be excluded!

6. Line 504, As most chromatin regions are inactive in most cell types, the use of insulators may more often than not lead to an increase in activity and thus to false positive detection of regulatory activity in lentiMPRA. This means, importantly, that for many of the active sequences in this study, there is no evidence that they would be active in this cell type at all, because not all sequences overlap a signature of activity (histone mark, active chromatin etc.). Those sequences might only be active because of the use of insulators. I think all that can really be said of those sequences is that they have the potential to actively drive transcription in a hypothetical cell type in which they would be accessible and active.

7. Lines 251-252, "[…] suggests that these sequences are likely to be differentially active in other cell types not assayed in this lentiMPRA." I disagree with this idea. I would argue that this indicates that regulatory logic is independent of cell type but mainly determined by the underlying sequence which determines which transcription factors can bind. That sequence will bind the same or related transcription factors expressed in two different cell types which will lead to similar effects in distinct cell types. But this would also be true for a sequence that is only active in cell type A but heterologously expressed in cell type Y. Especially due to the use of the insulator elements in the construct, we don't know if the sequence would really be active.

8. Line 317, It appears that this analysis does not differentiate between activating and repressive transcription factors. I would expect that repressive transcription factors invert this relationship. Do the authors find that to be true?

9. Lines 332-336, The language here indicates that the authors looked for positive correlations only ("i.e., higher affinity to the modern human sequences was predictive of higher expression"). Then they state that "All of these TFs had a positive correlation" which is trivial if they only looked for positive correlations. It should be clarified if negative correlations were investigated as well.

10. Line 342, The finding that ZNF281 and SP3 are significantly enriched among differentially active sequences in NPCs at an FDR of 0.05 is only corrected for the tests done in NPCs. However, the authors tested for enrichment among all three cell types (plus the union of all three cell types together). This means that multiple testing correction should have been performed over the entire dataset that was used in the test and not separately by cell type. If done over the entire dataset (all three cell types, ignoring the union of all cell types), the FDR for ZNF281 is 0.056 and that of SP3 is 0.13. This means that there is no enrichment of any transcription factors among differentially active sequences after appropriate multiple testing correction. I did not check this for the other analyses, but multiple testing correction should be performed over the entire analyzed dataset and not per cell type throughout the study, for instance for enrichment of GO, HPO etc.

11. Line 551-557, This seems like a gross overinterpretation to me. If a binding site is detected or not given one base pair, is most closely related to the significance cutoff for detection of a binding site in the software being used. This could also be interpreted such that if it depends on one single base pair if there is binding or not, there might not be much binding at all. That the cell regulatory machinery uses the same cutoff for binding as the method to detect binding in this study is extremely unlikely. Further, binding is not only determined by the sequence itself but also by sequence activity and accessibility. In other words, the evolution of a new putative binding site in inaccessible heterochromatin is likely inconsequential. It would be more likely for a newly evolved, putative binding site to be active adjacent to an already existing, active enhancer, where the sequence would already be accessible. But lentiMPRA is not appropriate for detecting such events and speculations about newly evolved regulatory elements seem unwarranted.

12. Line 621-622, How many of the variants included in the study have information in only one of the archaic human genomes?

13. Lines 636-639, The way how the design of the oligos is described is very confusing and maybe actually misleading. For instance here: "For each variant we designed two fragments, one with the ancestral sequence and one with the derived sequence." I don't think that is correct. If I understand correctly, it should say "For each variant we designed two fragments, one with the ancestral variant state and one with the derived variant state, each in the background of the modern human [Is this correct?] sequence." because sequence versions do not differ in their entire sequence but only in the single variant whereas the sequences are otherwise identical.

14. Line 905, Where TFs filtered for expression in the analyzed cell type? If not, that should be done!

15. Line 906-908, "For cases in which multiple unique motifs corresponded to the same TF, we used the motif with the largest score difference between alleles." I don't see why that would make sense. If there are different score differences for different motifs on the same allele, wouldn't the expectation be that in each case the highest score binds? That could mean that a substitution would swap one motif for another, such that the binding differential of either motif is irrelevant to the question but what really matters instead is the binding differential between the respective maximum binding affinity of the two motifs.

Reviewer #2:

Weiss et al. screened 14,042 variants that differ between modern and archaic humans genome-wide for their gene-regulatory impact. Using massively parallel reporter assay (lentiMPRA) in three cell types, they find evidence that ~13% of genomic sequences harboring variants can drive gene expression (=active sequences), and that about 13% * 23% ≈ 3% of variants lead to differential regulatory activity (=differentially active sequences). The authors study, characterize and compare active and differentially active sequences with regard to genome annotations; they find, for example, that differentially active sequences are enriched in annotated enhancer loci. With regards to a possible mechanism for differential regulatory activity observed, Weiss et al. study differences in transcription factor (TF) binding site motifs between corresponding archaic vs. modern sequences, and they are able to report substantial motif changes for ZNF81 and SP3. Linking first variants to genes and then second genes to organs and phenotypes allows the authors to relate putative variants' effects to brain and vocal tract; then a variant with reduced regulatory activity in the modern human version (across all there cell types studied) is discussed in more detail, with the conclusion that reduced SATB2 expression could give rise to brain and craniofacial phenotypes different between modern and archaic humans.

Effects of (modern) human-specific variants are of great interest, so this study using powerful MPRA technology for a comprehensive assessment of all variants is exciting and a step forward. The authors are able to suggest specific variants that may underlie modern human traits, and overall the manuscript is clear. The linking of variants to organs and phenotypes is interesting, and it provides several potential entry points for follow-up studies. Importantly, the data provided constitute a significant resource for the community and will have an impact on further studies in this area.

While it would be great to have a more definite linking of variants to genes, in general and specifically in the case of SABT2, where there is no direct experimental evidence that chr2:199,469,203 affects that gene's expression, that would require additional experiments. An easier improvement would be for the authors to publish the computer code used to generate results; currently this is not the case, nor do the authors state that it is made available upon (reasonable) request. This makes it very hard to reproduce results reported, study details of the analyses performed, or explore effects of parameter choices, thereby diminishing the value/impact of the study. The manuscript would benefit from addressing the following specific points.

1. P8 / Supplementary Figure 1: Panel B: By eye the data looks quite a bit more noisy for ESC than for the other two cell types. Correlation is less for ESC, and seems driven by only few sequences. Additionally reporting Spearman correlation would be good, and I this should be discussed in the manuscript. Panel D: It would be more informative using a log-scale for the RNA/DNA ratio. Also, it would be informative to see the tested sequences together with the scrambled and the positive.

2. P9 L 169: It might be interesting for the authors to provide an overview of where/of what type the sequences are. For example., are they mostly intergenic, proximal, or inside genes? If so, is there perhaps an enrichment for certain types of genes, etc.? Even in case not novel, where archaic vs. modern human differences are would be good to know in light of contrasting active vs. inactive sequences later in the manuscript.

3. P9 L 172 / Methods P38-40 L761-802: While the authors describe their MPRA analysis, safe for actually attempting to re-do the analysis, it remains unclear whether sufficient details are provided to reproduce results; publishing code used for the analysis would be far better. Further on, processed data provided on GEO has replicates already aggregated, so that it cannot be used to reproduce, e.g., the MPRAnalyze part of the analysis and primary data would have to be re-processed first. Also, it is unclear why the authors use a combination of MPRAnalyze and a heuristic involving aggregation across replicates to assign oligos/sequences as "active". It'd be good to explain why that is done, and what is the effect on the set of active oligos.

4. P10 L 190-192: Figure 2: Panels B-D: For each bin of RNA/DNA ratio enrichment appears to be plotted, based on a hypergeometic/Fisher test, probably as the logarithm of the odds ratio. It would be informative to plot a measure of variability/confidence for each bin as well. Standard implementation of the Fisher test in R provides a confidence interval, for example.

5. P10 L 198-200, Figure 2: Similar to the point above, it would be good to see "error bars" on panel e.

6. P12 L 235: It is unclear what the null model for the Super Exact test is. Is overlap of differentially active sequences between cell types higher than expected, given what we already know about the overlap considering all active sequences? Or do cell types overlap for differentially active sequences more than expected given sequences were selected purely at random, but not as much as they do considering all active sequences?

7. P15 L 272-274: Supplementary Figure 3: Variability between lentiMPRA replicates is not displayed in panel B; it would be good to use a display that shows variability in replicates of both arrays, lentiMPRA and luciferase assay.

8. P16 L 300-305: For the methylation analysis, it there a cell type signal in the sense that up-regulation correlates with hypomethylation and down regulation with hypermethylation in osteoblasts, but not in the other two cell types?

9. P17 L 322-325: Supplementary Figure 4: What exactly is plotted in panels B-D: does each point correspond to a sequence-TF combination, or was some kind of averaging performed (over sequences or TFs)?

10. P44 L 900-903: FIMO analysis: Unclear why the lowest-matching score would be used. Why not use the highest log odds on that allele? Also, what background model was used for FIMO?

11. P18, L 342: How sensitive are the results of the enrichment analysis (ZNF281 and SP3) with regards to some of the parameters, like requiring that 10 or more motif changes are required for TFs to be considered? How many TFs did meet that requirement?

12. P20 L 371-376: In the supplementary table, it would be good to provide information about how a gene was linked to a variant (i.e., which of the four lines of evidence used support the link). Also, for the Hi-C data, the authors describe two lists in the methods a cell type-specific list and a general list. However, it appears unclear which one was used for linking to genes.

13. P20 L 387-391 and P21 L 394-396: Which genes are driving theses terms? Do they largely overlap between different terms, or are they different genes in each term. It would be informative to report genes alongside terms in the supplemental table. Also for the HPO analysis in the following paragraph.

14. P23 L 432: PyloP conservation scores are usually -log10(p-value), so while conservation is supported, the p-value corresponding to a score of 0.996 would be about 0.1, which is not that highly conserved.

15. P25 L 484-486. The authors write they have investigated how single nucleotide variants "have affected expression". While technically true in the context of the lentiMPRA, the authors measure changes in the ability of these sequences to drive expression, but do not show that they actually affect "real live" gene expression outside of an artificial construct. Perhaps change this sentence.

Reviewer #3:

The majority of sequence variants that separate modern and archaic human genomes are found in non-coding regions, often overlapping gene regulatory elements. It is difficult to study the functional significance of these variants on molecular phenotypes such as transcription and chromatin regulation due to the lack of intact biological material in archaic specimens. The current work uses an MPRA framework to compare the gene regulatory activities of derived modern human versus ancestral variants in three modern human cell types – ESCs, NPCs and osteoblasts. As expected, the authors found that active sequences were associated with enhancer annotations, and differentially active sequences were associated with divergent TF binding motifs. The presented MPRA data is well controlled. However, the differences detected by the various comparisons made were often subtle, or small in magnitude. This requires robust statistical analysis to understand the significance of those differences. While the authors were moderate in most of their claims, this reviewer thinks some of those claims require further analytical support with some adjustments in the text. With these additional non-experimental, but critical analytical revisions and some clarifications, this would be of potentially high interest to a broad readership as a contribution to eLife.

1. Selection of cell types – this is crucial as TFs are the major drivers of MPRA activity. The claim that ESCs were used "due to their globally active transcription, providing a general view of gene expression" is based on a single study from mouse embryonic stem cells (Efroni et al., 2008). Mouse ESCs are known to have very different biological properties from human ESCs, with mouse ESCs representing a less differentiated state than human ESCs (Hanna et al., PNAS 2010). Therefore, this is not a sufficient justification for using ESCs as a comparative cell type. In particular, enhancers are typically cell type or lineage specific in their activity, and they are driven by TFs that are expressed in a lineage dependent manner. hESCs express basal and pluripotency factors, limiting their capacity to drive transcription of developmentally specialized enhancers. The use of NPCs and osteoblasts are reasonable considering the central interest of the study in understanding the role of derived alleles in shaping modern human phenotypic differences, but beyond that point it is not clear why the cell types were chosen (see point 2).

2. Of the 14000 regions selected for the MPRA – what was the expectation given existing functional genomic data from modern humans and how did that inform the selection of cell types? This question is related to the above point. In supplementary figure 2 the authors provide the annotation classifications of the 14,000 fixed derived variants based on ChromHMM. This data indicates that ~25% of these variants overlap enhancers or TSS regions – but the relative distributions of the annotations differ depending on the ChromHMM context (ESC, NPC or Osteoblasts). It seems that using this methodology, one can predict other cell-types for which these variants may be active, thus providing an informed context to test the MPRA. It is important to know what percentage of the fixed variants are located in any type of regulatory region across a variety of cell-types. This information may also be important for understanding differential activity of modern vs archaic sequences, as the magnitude of differential activity may be cell type dependent. Related to this point, in the Discussion the authors point out that, based on previous studies, trans differences are likely minimal between modern and archaic humans, but what about trans differences in cell-type environment? Their assertion is perhaps an overgeneralization that diminishes the importance of trans effects. This is a point that should be addressed.

3. The authors report a modest median fold-change of 1.2x for differentially active sequences. They use a combination of criteria to associate these sequences with a putative target gene. An important next step to support the significance of these findings would be to understand whether these modest differences are linked to differential expression of target genes. Obviously, such a comparison cannot be made for modern and archaic human transcriptomes; however, one potential avenue for testing this idea would be to use public data from comparative human and chimp RNA-seq studies to look at whether the target genes are differentially expressed between modern human and chimp, provided that the ancestral allele for the differentially active sequence is found in chimp.

4. The section describing the relationship between DNA methylation and differentially active sequences seems to be one of the least well supported pieces of this study. In the absence of gene expression data, a key piece of support for the functional significance of differentially active sequences is to link those differences to epigenetic differences. The availability of DNA methylation maps from Neanderthal and Denisovan makes this possible. However, the way this data is presented (or not presented) is really difficult to interpret (i.e. reporting that upregulating sequences tend to be hypomethylated with a -1% difference compared to what?). A figure would have been very useful here. Also, it's important to note that many enhancers have low CpG density, while promoters and TSSs with high CpG density are typically hypomethylated regardless of gene activity (Lister et al. 2009, Stadler et al. 2011, Schlesinger et al. 2013). Thus, it is necessary to parse these regions when referring to methylation differences.

https://doi.org/10.7554/eLife.63713.sa1

Author response

Reviewer #1:

Weiss and colleagues performed a lentiMPRA comparing archaic and human genetic variants in three cell types. I found this study well designed and executed and the results exciting. I especially liked the analysis of SATB2! It is really amazing that by focusing on skeletal phenotypes the authors can root inferences from their study in human biology. I wish this was done for more than one gene! I still have a number of concerns about the study, outlined below.

We thank the reviewer for their feedback and are glad that they found this work exciting. We share the reviewer’s excitement for the prospect of identifying additional candidate genes, and we trust that this data will serve as an important resource for the identification of genes underlying modern human evolution.

1. In a 200 bp sequence background, there are typically other variants included. If I understand correctly, the authors only considered one focal variant that differs between modern and archaic alleles and tested for a difference in one of the two possible sequence backgrounds. This means that one variant is tested in an artificial allele that possibly never existed in an archaic or modern human. in vivo, this variant might interact with nearby variants (fixed or segregating), but this is ignored in this setting. It would obviously be too much to demand that both the modern and the archaic variant be tested in the archaic allele as well, but I wonder if the authors can report if there are cases (and if so, how many) where tested variants are close to other variants which would indicate that the result is artificial and thus unreliable. The relevant distance would be within the size of one transcription factor binding site or less than 15-20 bp or so.

Following the reviewer’s comment, we realized that this point required further clarification. Most importantly, each sequence included either the ancestral sequence throughout, or the derived sequence throughout. In other words, if a 200 bp window had more than one variant separating modern from archaic humans and apes, the other derived variants were also included in the derived construct. Thus, each focal modern-derived variant was tested against a modern human background, and each focal ancestral variant was tested against an ancestral background in the same experiment. Notably, only 1,362 of the 14,042 pairs of sequences (10%) had >1 fixed variant. Following the reviewer’s comment, we realized that we did not adequately describe this design in the text, and we therefore added the following section to the main text: “13,680 out of 14,042 sequence pairs (90%) had a single variant separating the human groups. For the 1,362 sequence pairs containing additional variants within the 200 bp window, we used either the modern-only or archaic-only variants throughout the sequence”.

As the reviewer mentioned, in addition to the fixed variants we investigate, these sequences may contain additional non-fixed variants in some modern or archaic individuals. We agree with the reviewer that these non-fixed variants may have an effect on the activity of some of these sequences. Because in these cases both alleles of the non-fixed variants exist in the modern human population, these constructs reflect sequences that likely exist in at least some humans. To emphasize these caveats, we added the following section to the discussion of the MPRA limitations: “Sixth, the level of sequence activity may depend on more than one variant (including non-fixed variants, which we have not tested here). In the cases of non-fixed variants, the extent of differential activity could vary between individuals. At the same time, in the 10% of sequences that include more than one fixed variant, in our study it is not possible to determine which of the variants drives the differential activity (with the exception of cases with more than two variants where the tiled sequences include a different combination of these variants).”

2. Line 179/Figure 2, There is a larger overlap between NPCs and osteoblasts than between either and ESCs – due to either cell type's closer developmental similarity with ESCs, I had expected the opposite. Can the authors offer an explanation?

Like the reviewer, we were also expecting more overlap between each cell type and ESCs than to each other. Reassuringly, when looking at the differentially active sequences (Figure 3d) compared to all active sequences (Figure 2a) this trend is far less extreme. Regardless, we believe technical reasons could partially explain the observed trend. ESCs are very sensitive to infection (see methods where we note the MOI used for ESCs was 10 compared to 50 for the other cell types and that “We used a lower MOI for ESC because the cells are very sensitive to infection and a MOI higher than 10 would result in cell death”; L807). Due to the difficulty in infection, the quality of the data was slightly lower for ESCs than for the other cell types. To this point, we added the following to the Results section: “We saw a strong correlation of RNA/DNA ratios between replicates for all cell types (Pearson’s r = 0.76 – 0.96, P < 10-100 , Supplementary Figure 2b), with the lower correlation scores being in ESC, likely due to our use of lower multiplicity of infection (MOI) in these cells due to their increased sensitivity to lentivirus infection”. We also note in the methods that we required additional sequencing for ESCs “due to lower lentivirus infection efficiency in this cell type.”. We believe the technical difficulties with these cells could be a reason we see less of an overlap between ESCs and other cell types. Importantly, we make a point in the discussion to clarify that “it is difficult to determine what proportion of [variation] is due to biological versus technical factors” and “Thus, we largely refrained from comparisons between cell types”.

Additionally, the reason that both we and the reviewers expected this specific trend rests on the assumption that the test sequences are a representative sampling of functional sequences across the genome and therefore can be considered representative of activity in a given cell type. Although our target sequences were selected in an unbiased manner, there may be, by chance, some unknown bias, especially given the low number of active sequences. This bias could be the cause for an unexpected overlap in activity. To clarify this point, we’ve revised the previously mentioned sentence to read “Thus, we largely refrained from comparisons between cell types and the overlap observed in Figure 2a and Figure 3a should not be used to define such similarities. Rather, these diagrams should be used to examine the replicability of our results.”

3. Line 176, How often were both archaic and modern alleles active and how often only one?

In ESCs: 439 sequences were significantly active only in the archaic allele, 447 only in the modern allele, and 297 in both. In osteoblasts: 210 only in the archaic allele, 178 only in the modern allele, and 426 in both. In NPCs: 126 only in the archaic allele, 178 only in the modern allele, and 426 in both. Importantly, the absence of evidence of activity of the other allele is likely not evidence of inactivity, but rather could be the result of marginally significant activity or limited power to detect activity: the mean FDR for alleles that were identified as active is 0.01 (ESCs and osteoblasts) and 0.02 (NPCs). The full data appears in Supplementary File 1. We now refer to this supplementary table in the paragraph introducing our analysis of active sequences: “We found that in ESCs, 8% (1,183) of sequence pairs drove expression in at least one of the alleles, 6% (814) in osteoblasts, and 4% (602) in NPCs (FDR < 0.05, Supplementary File 1, see Methods)”

4. Line 192, "At the same time, these sequences tended to be depleted of repressive marks". Figure 2b shows an enrichment of H3K27me3 in active sequences, while no depletion for H3K9me3. Figure 2c shows a mild depletion at best and Figure 2d a combination of enrichment and depletion of both marks. It seems to me that the quoted statement is not supported by the data.

H3K27me3 indeed shows an enrichment in ESCs. This is likely due to the fact that together with H3K4me3, the H3K27me3 modification in ESCs marks bivalent genes, which become active in later stages of differentiation (Bernstein B.E. et al., Cell, 2006, and Blanco et al., Trends in Genetics, 2020). To clarify the role of this modification in ESCs, we added the following statement to the figure legend: “The enrichment of H3K27me3 in ESCs possibly reflects the presence of this mark in bivalent genes, which become active in later stages of development (Blanco et al., Trends in Genetics, 2020).”

While H3K9me3 may not appear to be depleted when looking at the graph, our analysis shows a slight depletion that is significant (FDR = 0.02 for a minimum RNA/DNA ratio = 1). As the reviewer mentioned, in Figure 2c the depletion is mild, but here too – it is often significant (e.g., FDR = 0.006 for a minimum RNA/DNA ratio = 1 for H3K9me3) and becomes more substantial as the RNA/DNA cutoff increases. In Figure 2d the marks are indeed enriched in lower cutoffs but become depleted in higher cutoffs. We realize that our previous statement did not reflect the complex dynamics observed in these plots. Therefore, we rephrased it to “these sequences tended to show relatively fewer repressive marks compared to active marks (Figure 2b-d, Supplementary File 2)”. We have also added p-values and FDRs to this supplementary table.

5. Figure 3, There are a few differentially active sequences that show very low fold changes. I recommend using a cutoff to remove sequences with such small fold changes. What would be the biological significance of a sequence that is barely different between alleles but considered significantly differentially active?

We agree with the reviewer that some of these fold-changes might represent biologically insignificant changes. At the same time, any cutoff used here (in addition to the statistical cutoff we use) or in any other MPRA will likely be somewhat arbitrary, and for some sequences, even small fold changes might have a biological effect. As also shown both in these lentiMPRAs and other published lentiMPRA datasets (see for example, Inoue et al. Genome Research 2017, Kircher et al. Nature Communications 2019, Inoue et al. Cell Stem Cell, Klein et al. Nature Methods 2020), while the fold-changes can be small, in particular for enhancers, they are consistent across replicates. Also, of note, is that an advantage of our use of lentiMPRAs in this study is that we can quantitatively compare the modern and archaic human sequence side-by-side in the same lentiMPRA experiment to identify sequence changes that change their activity. Additionally, it is possible that for some genes, the effect per sequence is small, but the effects of several sequences might add up to a substantial effect on the gene. To clarify this, we added the following statement to the description of the limitations of MPRA: “At the same time, some minimally active sequences may not be biologically significant”.

Related, if there are sequences that show such small fold changes, it is possible that there are some differentially active sequences that disagree in the direction of bias between replicates. If there are any, those should certainly be excluded!

MPRAnalyze, which we used to determine the significance of differential activity, takes into account in its statistical analyses the level of agreement between replicates. Additionally, when comparing the agreement between different cell types (which is expected to be lower than between replicates), we observed a strong agreement, with 107 out of 109 sequences going in the same direction (P = 9.2x10-30, Binomial test). We thus rephrased this section: “We identified 109 sequence pairs that were differentially active in more than one cell type. Out of these 109, we found that 107 show the same direction of differential activity across cell types (P = 9.2x10-30, Binomial test), and we also observed a high correlation between the magnitudes of differential activity (Pearson’s r = 0.82, P = 1.6x10-27).”

6. Line 504, As most chromatin regions are inactive in most cell types, the use of insulators may more often than not lead to an increase in activity and thus to false positive detection of regulatory activity in lentiMPRA. This means, importantly, that for many of the active sequences in this study, there is no evidence that they would be active in this cell type at all, because not all sequences overlap a signature of activity (histone mark, active chromatin etc.). Those sequences might only be active because of the use of insulators. I think all that can really be said of those sequences is that they have the potential to actively drive transcription in a hypothetical cell type in which they would be accessible and active.

We agree with the reviewer that some of the sequences that show activity in our assay may in fact be inactive in their endogenous genomic context. Importantly, the majority of active sequences (86%) show at least one active histone modification peak. This is of course insufficient to determine whether these sequences are active in their endogenous genomic context. At the same time, the lack of active marks in some sequences does not necessarily mean that they are inactive. We refer to this point in the discussion: “while others might show activity that they otherwise would not have”. Following the reviewer’s comment, we added the following sentence to the “Characterization of active regulatory sequences” section: “Some of these sequences may show activity in the lentiMPRA experiment, but not in their endogenous genomic context”.

7. Lines 251-252, "[…] suggests that these sequences are likely to be differentially active in other cell types not assayed in this lentiMPRA." I disagree with this idea. I would argue that this indicates that regulatory logic is independent of cell type but mainly determined by the underlying sequence which determines which transcription factors can bind. That sequence will bind the same or related transcription factors expressed in two different cell types which will lead to similar effects in distinct cell types. But this would also be true for a sequence that is only active in cell type A but heterologously expressed in cell type Y. Especially due to the use of the insulator elements in the construct, we don't know if the sequence would really be active.

We agree that the observation that a sequence is differentially active in one cell type does not necessarily mean that this sequence is also differentially active, or even just active, in other cell types. Our claim referred to the observation that the magnitude of differential activity was significantly correlated between cell types (Pearson’s R = 0.82, P = 1.6x10-27), and that sequences tended to be differentially active in more than one cell type (overlap of 75-fold higher than expected, P < 10-100, Super Exact test, Figure 2a). In other words, it is unlikely that these sequences are differentially active only in the cell types we examined. To further test this, we analyzed eQTL data from GTEx v8. We tested how often eQTLs detected in the brain cortex are significant only in this tissue. We found that out of 500,000 randomly chosen variant-gene associations, only 113 (0.02%) were significant solely in the brain cortex. We repeated this for cultured fibroblasts and liver and found similar results (<1% of fibroblast and liver eQTLs were significant only in that cell type/tissue). Therefore, despite the complex and cell-type specific interactions between sequence and transcription factors, these interactions are often shared by more than one cell type.

8. Line 317, It appears that this analysis does not differentiate between activating and repressive transcription factors. I would expect that repressive transcription factors invert this relationship. Do the authors find that to be true?

We agree with the reviewer that increased binding by a repressor TF is expected to decrease, rather than increase, the activity of the bound sequence. Although we did not limit the analysis to activator TFs, the inherent design of MPRA is less likely to detect transcriptional repression due to the low basal activity of the minimal promoter used. Therefore, sequences that get bound by repressors and decrease the activity of the construct are less likely to be identified as active or differentially active (Tewhey et al., 2016). Indeed, all of the significant correlations we found between differential binding and differential activity were positive (Supplementary File 3). We also investigated previous studies of the TFs that came up in our analyses and found that none of them are exclusively a repressor, and most are predominantly or exclusively activators (with the exception of ZNF880, for which we did not find evidence establishing its role in transcriptional regulation). For examples of studies establishing the transcriptional role of these TFs, see the following PMIDs:

SP1 – 27697431

ZNF341 – 29907690

VEZF1 – 23921720

ZNF263 – 19887448

MAZ – 12684688

SP3 – 27697431

ZBTB17 – 29616049

ZNF281 – 32512343

EGR2 – 27856665

KLF6 – 32998281

All of which are now cited in the text. We now also emphasize in this section the limitation of the MPRA design in finding transcriptional repression: “…the use of a minimal promoter with basal activity in the MPRA design means that transcriptional repression is less likely to be detected, and therefore, further investigation is required in order to identify potential repressive activity in these sequences (see Discussion).”

9. Lines 332-336, The language here indicates that the authors looked for positive correlations only ("i.e., higher affinity to the modern human sequences was predictive of higher expression"). Then they state that "All of these TFs had a positive correlation" which is trivial if they only looked for positive correlations. It should be clarified if negative correlations were investigated as well.

Following the reviewer’s comment, we realized the text was unclear. We tested both positive and negative correlations in this analysis. To clarify this, we rephrased this section to: “we examined the correlation between binding and expression fold-change (either positive or negative). We found that changes to the motifs of 14 TFs were predictive of expression changes (Supplementary Figure 5d, Supplementary File 3). All of these TFs had a positive correlation between changes in their predicted binding affinity and changes in expression of their bound sequences, reflective of their known capability to promote transcription.”.

10. Line 342, The finding that ZNF281 and SP3 are significantly enriched among differentially active sequences in NPCs at an FDR of 0.05 is only corrected for the tests done in NPCs. However, the authors tested for enrichment among all three cell types (plus the union of all three cell types together). This means that multiple testing correction should have been performed over the entire dataset that was used in the test and not separately by cell type. If done over the entire dataset (all three cell types, ignoring the union of all cell types), the FDR for ZNF281 is 0.056 and that of SP3 is 0.13. This means that there is no enrichment of any transcription factors among differentially active sequences after appropriate multiple testing correction. I did not check this for the other analyses, but multiple testing correction should be performed over the entire analyzed dataset and not per cell type throughout the study, for instance for enrichment of GO, HPO etc.

Following the reviewer’s comment, we applied the FDR adjustment jointly on all three cell types in every analysis of this kind in the paper (TF binding motifs, predicted TF binding vs expression, Gene ORGANizer, HPO and Gene Ontology). Additionally, following comment #14, we have now removed from the analyses TFs that are not expressed in these cell types (see response to #14). Consequently, SP3 is no longer significantly enriched and was therefore removed from the “Molecular mechanisms underlying differential activity” section and from Supplementary Figure 4. ZNF281 remains significantly enriched (4.6-fold in NPCs, FDR = 0.04). Similarly, we have updated the Gene Ontology, Gene ORGANizer and HPO results (Figure 4a,b, and Supplementary Table 4). The top results we focused on in the Gene ORGANizer analysis (i.e., vocal cords, larynx, pharynx, and cerebellum) remain significant (FDR < 0.05), while the urethra is no longer significant and was therefore removed from the “Phenotypic consequences of differential expression” section and from figure 4a. For the HPO analysis, on top of applying FDR jointly on the three cell types, we have lowered the cutoff for the minimum number of genes per term from 5 to 3, because HPO terms tend to be linked to considerably fewer genes than GO and Gene ORGANizer and terms (Gene ORGANizer aggregates HPO terms into organs). Figure 4a,b, Supplementary File 4, and the “Phenotypic consequences of differential expression” section have all been updated with the new FDRs.

11. Line 551-557, This seems like a gross overinterpretation to me. If a binding site is detected or not given one base pair, is most closely related to the significance cutoff for detection of a binding site in the software being used. This could also be interpreted such that if it depends on one single base pair if there is binding or not, there might not be much binding at all. That the cell regulatory machinery uses the same cutoff for binding as the method to detect binding in this study is extremely unlikely. Further, binding is not only determined by the sequence itself but also by sequence activity and accessibility. In other words, the evolution of a new putative binding site in inaccessible heterochromatin is likely inconsequential. It would be more likely for a newly evolved, putative binding site to be active adjacent to an already existing, active enhancer, where the sequence would already be accessible. But lentiMPRA is not appropriate for detecting such events and speculations about newly evolved regulatory elements seem unwarranted.

We agree with the reviewer that this is a more complicated point than our description suggests. Therefore, we have removed this short paragraph from the discussion.

12. Line 621-622, How many of the variants included in the study have information in only one of the archaic human genomes?

For the design of this library, we took only sequences with information from all three high-coverage archaic genomes available when the library was designed (Altai and Vindija Neanderthals genomes, and the Denisovan genome). We later also removed sequences where a fourth archaic human genome (the Chagyrskaya Neanderthal) did not match the sequence in the other archaic genomes. To clarify this, we rephrased this section in the Methods to: “As a basis, we used the list of 321,820 modern human-derived single nucleotide changes reported to differ between modern humans and the Altai Neanderthal genome. We then filtered this list to include only positions where the Vindija Neanderthal and Denisovan sequences both match the Altai Neanderthal variant… and we subsequently also filtered out loci at which the Chagyrskaya Neanderthal genome did not match the ancestral sequence, bringing the final list of analyzed loci to 14,042”.

13. Lines 636-639, The way how the design of the oligos is described is very confusing and maybe actually misleading. For instance here: "For each variant we designed two fragments, one with the ancestral sequence and one with the derived sequence." I don't think that is correct. If I understand correctly, it should say "For each variant we designed two fragments, one with the ancestral variant state and one with the derived variant state, each in the background of the modern human [Is this correct?] sequence." because sequence versions do not differ in their entire sequence but only in the single variant whereas the sequences are otherwise identical.

Following this and comment #1 made by the reviewer, we realized this was unclear and rephrased this section. Please see our response to comment #1.

14. Line 905, Where TFs filtered for expression in the analyzed cell type? If not, that should be done!

We thank the reviewer for pointing this out. Following the reviewer’s comment, we removed from the analyses TFs that are not expressed in the cell types examined (FPKM <= 1). For ESC expression, we used ENCODE RNA-seq data for H1-hESC, downloaded from EBI Array Express under accession number: E-GEOD-26284. For osteoblast expression, we used the Moriarity et al., Nat Gen., 2015 RNA-seq data, downloaded from GEO under accession number: GSE57925. For NPC expression, we used the Lu et al. Mol. Cell, 2020 RNA-seq data, downloaded from GEO under accession number: GSE115407. This is now described in the Methods (“Differential transcription factor binding sites” chapter). The updated results appear in Supplementary Figure 5d and Supplementary File 3. Importantly, the TF we focused on in this section (ZNF281) is expressed in the relevant cell type (NPC).

15. Line 906-908, "For cases in which multiple unique motifs corresponded to the same TF, we used the motif with the largest score difference between alleles." I don't see why that would make sense. If there are different score differences for different motifs on the same allele, wouldn't the expectation be that in each case the highest score binds? That could mean that a substitution would swap one motif for another, such that the binding differential of either motif is irrelevant to the question but what really matters instead is the binding differential between the respective maximum binding affinity of the two motifs.

Following the reviewer’s comment, we now take for each motif its maximum predicted binding score in each of the two alleles and then take the score difference between these maximums. We updated the Methods (“Differential transcription factor binding sites” section) and the binding-expression correlation analysis in Supplementary Figure 5d accordingly (and there was no effect on the TF enrichment results).

Reviewer #2:

Weiss et al. screened 14,042 variants that differ between modern and archaic humans genome-wide for their gene-regulatory impact. Using massively parallel reporter assay (lentiMPRA) in three cell types, they find evidence that ~13% of genomic sequences harboring variants can drive gene expression (=active sequences), and that about 13% * 23% ≈ 3% of variants lead to differential regulatory activity (=differentially active sequences). The authors study, characterize and compare active and differentially active sequences with regard to genome annotations; they find, for example, that differentially active sequences are enriched in annotated enhancer loci. With regards to a possible mechanism for differential regulatory activity observed, Weiss et al. study differences in transcription factor (TF) binding site motifs between corresponding archaic vs. modern sequences, and they are able to report substantial motif changes for ZNF81 and SP3. Linking first variants to genes and then second genes to organs and phenotypes allows the authors to relate putative variants' effects to brain and vocal tract; then a variant with reduced regulatory activity in the modern human version (across all there cell types studied) is discussed in more detail, with the conclusion that reduced SATB2 expression could give rise to brain and craniofacial phenotypes different between modern and archaic humans.

Effects of (modern) human-specific variants are of great interest, so this study using powerful MPRA technology for a comprehensive assessment of all variants is exciting and a step forward. The authors are able to suggest specific variants that may underlie modern human traits, and overall the manuscript is clear. The linking of variants to organs and phenotypes is interesting, and it provides several potential entry points for follow-up studies. Importantly, the data provided constitute a significant resource for the community and will have an impact on further studies in this area.

We share the reviewer’s point of view that the catalog we provide and the trends we identify could serve as entry points and an important resource for the community for follow-up studies.

While it would be great to have a more definite linking of variants to genes, in general and specifically in the case of SABT2, where there is no direct experimental evidence that chr2:199,469,203 affects that gene's expression, that would require additional experiments. An easier improvement would be for the authors to publish the computer code used to generate results; currently this is not the case, nor do the authors state that it is made available upon (reasonable) request. This makes it very hard to reproduce results reported, study details of the analyses performed, or explore effects of parameter choices, thereby diminishing the value/impact of the study. The manuscript would benefit from addressing the following specific points.

We agree with the reviewer that further experimental evidence linking the chr2:199,469,203 variant to SATB2 would be useful. We are currently working on using CRISPR/Cas9 to edit this variant and investigate its function and effect on SATB2 expression. As the reviewer mentions, this is a relatively difficult task and we are unlikely to include it in the current study, especially considering the covid-19-related restrictions on wet lab work. It is also important to emphasize that several independent pieces of evidence suggest that this locus has a regulatory function that potentially affects SATB2: (i) This position is found within the putative promoter of SATB2, 4 kb upstream of the first TSS and 2 kb downstream of the second TSS, (ii) It is a relatively conserved position in vertebrates, found within a CpG island and a DNase I-hypersensitive site that shows active histone modification marks in all three cell types, and (iii) It is also found within a lncRNA that has been shown to regulate SATB2 expression (Liu et al., 2017). To further clarify to readers that the link between this variant and the expression of SATB2 remains to be determined, we toned down this section, which now reads: “Together, these data support a model whereby the C->T substitution in the putative promoter of SATB2, which emerged and reached fixation in modern humans, possibly reduced the expression of SATB2”.

Additionally, as the reviewer suggested, we have made the code fully available on Github (https://github.com/weiss19/AH-v-MH), and have added the following statement to the Methods: “Code is available for download on Github: https://github.com/weiss19/AH-v-MH”.

1. P8 / Supplementary Figure 1: Panel B: By eye the data looks quite a bit more noisy for ESC than for the other two cell types. Correlation is less for ESC, and seems driven by only few sequences. Additionally reporting Spearman correlation would be good, and I this should be discussed in the manuscript.

We agree with the reviewer that the ESC data appears noisier than the other two cell types. As per the reviewer’s suggestion, we calculated Spearman’s correlation for each comparison in Supplementary Figure 1b, and saw that while the correlation for each cell type decreased slightly, for ESCs the correlation decreased more substantially (from ~0.76 to ~0.48). Importantly, this correlation was still very significant (P = 1.8x10-143 to 2.4x10-139). We added Spearman’s r to the figure. We believe technical reasons could partially explain this with ESCs being very sensitive to infection (see methods where we note the MOI used for ESCs was 10 compared to 50 for the other cell types and that “We used a lower MOI for ESC because the cells are very sensitive to infection and a MOI higher than 10 would result in cell death”). Due to the difficulty in infection, the quality of the data was slightly lower for ESCs than for the other cell types. To address this, we added the following to the Results section: “We saw a strong correlation of RNA/DNA ratios between replicates for all cell types (Pearson’s r = 0.76 – 0.96, P < 10-100 , Supplementary Figure 2b), with the lower correlation scores being in ESC, likely due to our use of lower multiplicity of infection (MOI) in these cells due to their increased sensitivity to lentivirus infection”. We also note in the methods that we required additional sequencing for ESCs “due to lower lentivirus infection efficiency in this cell type.”

Panel D: It would be more informative using a log-scale for the RNA/DNA ratio. Also, it would be informative to see the tested sequences together with the scrambled and the positive.

We agree with the reviewer. Following the reviewer’s comment, we changed the scale to a log2 scale and added boxes for the tested sequences (split into active and inactive to show the distribution of each).

2. P9 L 169: It might be interesting for the authors to provide an overview of where/of what type the sequences are. For example., are they mostly intergenic, proximal, or inside genes? If so, is there perhaps an enrichment for certain types of genes, etc.? Even in case not novel, where archaic vs. modern human differences are would be good to know in light of contrasting active vs. inactive sequences later in the manuscript.

We agree with the reviewer that further information about the distribution of these variants was missing. Following the reviewer’s comment, we added to the paragraph introducing the 14,042 variants a description of their distribution, namely that “the vast majority of the variants are intergenic”, and also refer the readers to Supplementary Figure 1a, which describes the genomic annotation of these variants.

We have also added analyses of the enrichment of Gene Ontology, HPO and Gene ORGANizer terms within genes associated with these variants (Supplementary File 5) and added a description of these results to the “Phenotype enrichment analyses” section in the Methods. The top Gene Ontology terms associated with these genes are GO:0042472 inner ear morphogenesis (FDR = 2x10-3), GO:0007157 heterophilic cell-cell adhesion via plasma membrane cell adhesion molecules (FDR = 2x10-3) and GO:0048706 embryonic skeletal system development (FDR = 3x10-3). The top HPO terms associated with these genes are Broad nasal tip (FDR = 4x10-6), Long philtrum (FDR = 2x10-4), and X-linked inheritance (FDR = 3x10-4). The Gene ORGANizer analysis did not provide any significant results. Importantly, unlike the analyses of differentially active sequences, which were compared against a background of non-differentially active sequences to control for potential confounders, the full set of genes in this new analysis is not compared against a background. Therefore, these results may be affected by different confounders such as GC content, the ability to call SNPs, and DNA degradation patterns in ancient samples, and it is still to be determined to what extent these results reflect true evolutionary trends.

3. P9 L 172 / Methods P38-40 L761-802: While the authors describe their MPRA analysis, safe for actually attempting to re-do the analysis, it remains unclear whether sufficient details are provided to reproduce results; publishing code used for the analysis would be far better. Further on, processed data provided on GEO has replicates already aggregated, so that it cannot be used to reproduce, e.g., the MPRAnalyze part of the analysis and primary data would have to be re-processed first.

We agree with the reviewer that it is important for code to be made publicly available for readers of the manuscript. As such, we have created a Github repository to house the code for the project: https://github.com/weiss19/AH-v-MH.

We have also added the following line to the Methods: “Code is available for download on Github: https://github.com/weiss19/AH-v-MH.”.

Also, it is unclear why the authors use a combination of MPRAnalyze and a heuristic involving aggregation across replicates to assign oligos/sequences as "active". It'd be good to explain why that is done, and what is the effect on the set of active oligos.

Following the reviewer’s comment, we realized that the reasoning behind using two approaches to determine if a sequence is active was not clear enough. We applied two separate tests in order to add another layer of stringency to this analysis. Further stringency was required because MPRAnalyze did not seem to completely account for the lower lentiviral infection efficiency in the ESC library, which reduced library complexity, resulting in a possible overestimation of active sequences in these cells (i.e., twice or more active sequences identified in ESCs compared to the other cell types: 2,097 in ESCs, 1,059 in osteoblasts and 664 in NPCs). We therefore decided to apply another test of activity, as used in previous MPRA analyses, where the 95th percentile of the scrambled sequences expression is used as a cutoff for activity (see for example Kwasnieski et al., Genome Research, 2014). Indeed, almost half of the ESC active sequences did not pass the second test, while sequences in the other cells were not as affected (1,183 sequences passed both tests in ESCs, 814 in osteoblasts and 602 in NPCs). We now explain this in more detail in the “Measurement of expression and differential expression” section in the methods: ” Next, we applied a second test for activity, to account for potential overestimation of active sequences in ESCs due to the lower lentiviral infection efficiency in these cells.”

4. P10 L 190-192: Figure 2: Panels B-D: For each bin of RNA/DNA ratio enrichment appears to be plotted, based on a hypergeometic/Fisher test, probably as the logarithm of the odds ratio. It would be informative to plot a measure of variability/confidence for each bin as well. Standard implementation of the Fisher test in R provides a confidence interval, for example.

We agree with the reviewer that confidence intervals would be informative for these results. We have tried to add them to the figure itself but it made the figure very busy and hard to read (as each panel included over 40 confidence intervals, which overlap with one another at each x-axis cutoff). We therefore added this information to Supplementary File 2 and refer to the table in the legend of the figure.

5. P10 L 198-200, Figure 2: Similar to the point above, it would be good to see "error bars" on panel e.

Similar to the previous point, we added confidence intervals to the table, as the figure became too busy with overlapping whiskers.

6. P12 L 235: It is unclear what the null model for the Super Exact test is. Is overlap of differentially active sequences between cell types higher than expected, given what we already know about the overlap considering all active sequences? Or do cell types overlap for differentially active sequences more than expected given sequences were selected purely at random, but not as much as they do considering all active sequences?

The Super Exact test for the overlap of differentially active sequences was done in comparison to the active sequences as background. Following the reviewer’s comment, we realized that this was unclear from the text and we added this information to the legend of Figure 3d and the main text: “8-fold higher than expected compared to active sequences (P = 5x10-7, Super Exact test, Figure 3d).”

7. P15 L 272-274: Supplementary Figure 3: Variability between lentiMPRA replicates is not displayed in panel B; it would be good to use a display that shows variability in replicates of both arrays, lentiMPRA and luciferase assay.

We agree. Following the reviewer’s comment, we added per-replicate data to the figure (now Supplementary Figure 3c).

8. P16 L 300-305: For the methylation analysis, it there a cell type signal in the sense that up-regulation correlates with hypomethylation and down regulation with hypermethylation in osteoblasts, but not in the other two cell types?

The trend is observable also in the NPCs (e.g., the top 10 upregulating sequences are hypomethylated by 7% on average in modern compared to archaic humans, the top 10 downregulating sequences are hypermethylated by 13% in modern compared to archaic humans). This is in line with previous observations that differentially methylated regions tend to be shared across tissues (e.g., Hernando-Herraez et al., Dynamics of DNA methylation in recent human and great ape evolution, PLoS Genet., 2013). In the ESC data we do not see this correlation, perhaps due to its noisier nature in this experiment. We have added this to the “DNA methylation in active and differentially active sequences” section in the Methods.

9. P17 L 322-325: Supplementary Figure 4: What exactly is plotted in panels B-D: does each point correspond to a sequence-TF combination, or was some kind of averaging performed (over sequences or TFs)?

Each point in these plots corresponds to a single sequence, i.e., its expression fold-change (x-axis) and predicted TF binding difference (y-axis). To clarify this, we added “for each sequence” to the figure legend.

10. P44 L 900-903: FIMO analysis: Unclear why the lowest-matching score would be used. Why not use the highest log odds on that allele? Also, what background model was used for FIMO?

A background model for FIMO was generated using fasta-get-markov using the trimmed (or untrimmed, if >1 variant) sequences. As for the score difference calculation, FIMO uses a p-value cutoff of 10-4 for reporting predicted binding (https://meme-suite.org/meme/doc/fimo.html). Therefore, some sequence pairs have a reported score for only one of the alleles. To assign these sequence pairs with a score difference, we used a conservative approach where we assigned the unscored allele with the lowest score reported for that motif, representing a score that is closest to a p-value of 10-4. Because the unreported score could be anywhere below this lowest reported score, but could not have been above it, this results in a conservative underestimation of the score difference. Then, as in other sequence pairs and as the reviewer suggested, we took the highest score difference per TF. To clarify this, we rephrased this section in the Methods and also added the information about the background model (see the “Differential transcription factor binding sites” section in the Methods).

11. P18, L 342: How sensitive are the results of the enrichment analysis (ZNF281 and SP3) with regards to some of the parameters, like requiring that 10 or more motif changes are required for TFs to be considered? How many TFs did meet that requirement?

Following the changes we have made to this analysis based on comment #10 by Reviewer 1 (carrying out multiple testing adjustment across all three cell lines), SP3 is no longer significantly enriched in this analysis. We therefore examined the enrichment of ZNF281 across various cutoffs of the number of predicted bound motifs, ranging from 5 to a maximum of 14 (the number of motifs predicted to be differentially bound by ZNF281) in steps of 1. We found that with the exception of the cutoffs of 5 and 6 (where ZNF281 is only slightly above the significance threshold: FDR = 0.058 and FDR = 0.053, respectively), ZNF281 is the only significant TF across all of these cutoffs (FDR ≤ 0.05). Overall, 27 TFs pass the minimum cutoff of 10 predicted binding motifs that we used.

We also tested the other parameter used in this analysis: the FPKM level of each TF. We repeated the analysis for FPKM minimum cutoffs ranging from 0.5 to 3 in steps of 0.5, and found that ZNF281 is significantly enriched across all of these cutoffs, and is the only significantly enriched TF (FDR ≤ 0.05). We now describe this in the “Differential transcription factor binding sites” section in the Methods.

12. P20 L 371-376: In the supplementary table, it would be good to provide information about how a gene was linked to a variant (i.e., which of the four lines of evidence used support the link). Also, for the Hi-C data, the authors describe two lists in the methods a cell type-specific list and a general list. However, it appears unclear which one was used for linking to genes.

Following the reviewer’s suggestion, we added to the “Linked genes” sheet in Supplementary File 1 the information on the source through which each sequence was linked to its genes (see the “sources for sequence-gene links” columns). With regard to the type of Hi-C data used, for the cell-type-specific (stringent) list, we used the cell-type-specific Hi-C links. For the generic (non-stringent) list, we used the general list. In Supplementary File 1 we report both lists, but for the enrichment analyses we used the generic list to have more statistical power. To address this, we rephrased these sections in the Methods, which now read: “For the cell type-specific (stringent) list of locus-gene links, we included only those interactions observed in cell types corresponding to the cell lines used in our lentiMPRA: ESCs, NPCs and mesenchymal stem cells as an approximation for osteoblasts (given that osteoblast Hi-C data is not publicly available to the best of our knowledge, and that osteoblasts differentiate from MSCs). For the generic (non-stringent) list, we used interactions across any of the 27 tissue and cell types analyzed by Jung et al.” and “We conducted these analyses.… on the non-stringent locus-gene associations”.

13. P20 L 387-391 and P21 L 394-396: Which genes are driving theses terms? Do they largely overlap between different terms, or are they different genes in each term. It would be informative to report genes alongside terms in the supplemental table. Also for the HPO analysis in the following paragraph.

Following the reviewer’s comment, we added to Supplementary File 4 the lists of genes associated with each term in each of the enrichment analyses (HPO, Gene ORGANizer and Gene Ontology). With regard to the overlap of genes between different terms: in terms that are tightly associated with one another (e.g., HP:0000089 – Renal hypoplasia and HP:0000076 – Vesicoureteral reflux) we observe the expected overlap (UFD1, COMT, RAD21 for Renal hypoplasia and UFD1, COMT, RAD21,and CDC45 for Vesicoureteral reflux). For terms that are less tightly associated with one another (e.g., HP:0000028 – Cryptorchidism and HP:0000488 – Retinopathy), we see few to no overlapping genes (DGCR6, CDC45, UFD1, DGCR8, TXNRD2, COMT, RAD21 for Cryptorchidism and PTPN22, ERF2IP, TMEM231,and NDUFAF3 for Retinopathy).

14. P23 L 432: PyloP conservation scores are usually -log10(p-value), so while conservation is supported, the p-value corresponding to a score of 0.996 would be about 0.1, which is not that highly conserved.

Indeed, using “highly conserved” was inaccurate and we therefore changed it to “relatively conserved”.

15. P25 L 484-486. The authors write they have investigated how single nucleotide variants "have affected expression". While technically true in the context of the lentiMPRA, the authors measure changes in the ability of these sequences to drive expression, but do not show that they actually affect "real live" gene expression outside of an artificial construct. Perhaps change this sentence.

We agree that we cannot claim that these variants have necessarily affected expression in their native genomic locations. Therefore, we have rephrased this sentence and this line now reads “By comparing modern to archaic sequences, we investigated the regulatory potential of each of the 14,042 single-nucleotide variants that emerged and reached fixation or near fixation in modern humans”.

Reviewer #3:

The majority of sequence variants that separate modern and archaic human genomes are found in non-coding regions, often overlapping gene regulatory elements. It is difficult to study the functional significance of these variants on molecular phenotypes such as transcription and chromatin regulation due to the lack of intact biological material in archaic specimens. The current work uses an MPRA framework to compare the gene regulatory activities of derived modern human versus ancestral variants in three modern human cell types – ESCs, NPCs and osteoblasts. As expected, the authors found that active sequences were associated with enhancer annotations, and differentially active sequences were associated with divergent TF binding motifs. The presented MPRA data is well controlled. However, the differences detected by the various comparisons made were often subtle, or small in magnitude. This requires robust statistical analysis to understand the significance of those differences. While the authors were moderate in most of their claims, this reviewer thinks some of those claims require further analytical support with some adjustments in the text. With these additional non-experimental, but critical analytical revisions and some clarifications, this would be of potentially high interest to a broad readership as a contribution to eLife.

We thank the reviewer for their constructive feedback and agree that this work has the potential to be of high interest to the readers of eLife. We have added several analyses and clarifications following the reviewer’s comments.

1. Selection of cell types – this is crucial as TFs are the major drivers of MPRA activity. The claim that ESCs were used "due to their globally active transcription, providing a general view of gene expression" is based on a single study from mouse embryonic stem cells (Efroni et al., 2008). Mouse ESCs are known to have very different biological properties from human ESCs, with mouse ESCs representing a less differentiated state than human ESCs (Hanna et al., PNAS 2010). Therefore, this is not a sufficient justification for using ESCs as a comparative cell type. In particular, enhancers are typically cell type or lineage specific in their activity, and they are driven by TFs that are expressed in a lineage dependent manner. hESCs express basal and pluripotency factors, limiting their capacity to drive transcription of developmentally specialized enhancers. The use of NPCs and osteoblasts are reasonable considering the central interest of the study in understanding the role of derived alleles in shaping modern human phenotypic differences, but beyond that point it is not clear why the cell types were chosen (see point 2).

We are glad the reviewer found our use of NPCs and osteoblasts to be reasonable and within the scope of our project. Additional reasons for choosing these cell types were the abundance of previously published datasets for these cell types (e.g., chromHMM) that can be used to uncover regulatory patterns, their stability in culture and infection, and the ability to compare osteoblast data with ancient DNA methylation maps derived from bone samples.

With regard to ESCs, hESCs indeed show many characteristics that distinguish them from mESCs, and the field seems to be split on the exact characteristics of the transcriptome in hESCs. In addition to Efroni et al., 2008, which we cite in our manuscript, there are additional sources that suggest global transcription is a hallmark of embryonic stem cells (e.g., Gulati, et al. 2020 doi: 10.1126/science.aax0249, Egli). There is, however, also evidence that ESCs have their own unique transcriptome that changes through differentiation (Inoue, et al. 2019 doi: 10.1016/j.stem.2019.09.010), and as the reviewer mentioned, the extent to which mESCs and hESCs share transcriptional characteristics is still under debate. Since there is some disagreement as to whether ESCs show global transcription or not, we removed this reasoning from the manuscript, and instead present the other reasons why these cells were picked, namely because: (i) stem cells provide unique insight to an early developmental stage in human cells, which is commonly used to interrogate human evolution and early development (Blake, et al. 2018 doi: 10.1186/s13059-018-1490-5; Loh, et al. 2014 doi: 10.1016/j.stem.2013.12.007; Mora-Bermúdez, et al. 2016 doi: 10.7554/eLife.18683; Nuttle, et al. 2016. doi: 10.1038/nature19075), (ii) the NPCs we used were derived from this H1-ESC line, and (iii) ESCs are extensively characterized, providing a wealth of data available for these cells (e.g., chromHMM and mapping of bivalency). The text now reads “The brain and skeleton have been the focus of evolutionary studies due to their extensive phenotypic divergence among human lineages. Therefore, we chose human cells related to each of these central systems: NPCs and primary fetal osteoblasts. In addition, we used ESCs (line H1, from which the NPCs were derived) to gain insight into early stages of development. Finally, the abundance of previously published regulatory maps for these three cell types also enables the investigation of the dynamics of evolutionary divergence at different regulatory levels. While these cell types represent diverse systems, further studies are needed in order to characterize the activity of these sequences in other cell types.”.

2. Of the 14000 regions selected for the MPRA – what was the expectation given existing functional genomic data from modern humans and how did that inform the selection of cell types? This question is related to the above point. In supplementary figure 2 the authors provide the annotation classifications of the 14,000 fixed derived variants based on ChromHMM. This data indicates that ~25% of these variants overlap enhancers or TSS regions – but the relative distributions of the annotations differ depending on the ChromHMM context (ESC, NPC or Osteoblasts). It seems that using this methodology, one can predict other cell-types for which these variants may be active, thus providing an informed context to test the MPRA. It is important to know what percentage of the fixed variants are located in any type of regulatory region across a variety of cell-types. This information may also be important for understanding differential activity of modern vs archaic sequences, as the magnitude of differential activity may be cell type dependent.

We agree with the reviewer that chromatin marks in other cell types could help shed light on the activity of these sequences in additional cell types that were not included in this work, and that this information was missing from the manuscript. We therefore added to the manuscript chromHMM annotations and analyses in ten additional tissues/cell types. On top of ESCs, osteoblasts, and NPCs, these samples include mesenchymal stem cells, monocytes, skin fibroblasts, brain hippocampus, skeletal muscle, heart left ventricle, sigmoid colon, ovary, fetal lung, and liver. We picked these tissues to represent a wide range of regulatory programs, and to take samples where the Roadmap reported quality was defined as the highest (i.e., 1). The chromHMM annotations for these 13 tissues now appear Supplementary Table 1. We used these annotations to explore which of the 14,042 sequence pairs in this study show active chromHMM annotations (i.e., enhancer- and TSS-related annotations) and in what tissues. We found that 4,821 out of the 14,042 (34%) sequence pairs overlap active chromHMM annotations in at least one tissue, with 740 of the sequence pairs (5%) overlapping active chromHMM annotations in over half of the tissues. We now discuss this in the main text. We have also added a panel to Supplementary Figure 1 showing a histogram of these results, as well as the overlap within the three cell types used in this study. We hope this new data would be informative of potential activity of these sequences in tissues not included in this study.

Interpreting our results in light of previous MPRAs is challenging, not only because of key differences in statistical power and experimental design (e.g., sequence length), but also because of the way sequences were selected for each MPRA. Previous reporter assays and MPRAs on human variation used pre-filtered sets of variants by selecting sequences with putative regulatory function (e.g., eQTLs, TF binding sites, ChIP-seq peaks, or TSSs) and/or regions showing particularly rapid evolution (e.g., human accelerated regions, see Discussion for references). In our experiment, on the other hand, a key consideration was not to pre-filter the pool of sequences based on their potential activity and/or cell type specificity, but rather generate an unbiased scan of fixed modern human-derived variants. This allowed us to measure regulatory activity in regions that would otherwise be excluded by filtering for a specific set of marks, and to get a relatively unbiased estimate of the proportion of fixed changes that have a potential regulatory effect. In line with the fact that our data was not pre-filtered for putative regulatory regions, the proportion of active sequences we observed tended to be slightly lower than in previous studies. However, the magnitude of differential expression we observed, as well as the fraction of differentially active sequences out of the active sequences was similar to previous studies. For example, Tewhey et al., 2016 investigated 32,373 variants previously identified as cis-eQTLs in humans, and found that 12% of these sequences showed activity in their assay, and that 25% of them showed differential activity, with only 46 sequences showing a 2-fold change or higher. Uebbing et al. 2021 investigated 32,776 substitutions in human accelerated regions and human gain enhancers and found that 11.7% of HARs and 33.9% of HGEs were active in neurodevelopment, of which 27.5% and 34.6%, respectively, were differentially active between human and chimpanzee, with a mean fold-change of 1.58x. To address this, the text now reads: “In line with the fact that our data was not pre-filtered for putative regulatory regions, the proportion of active sequences we observed tends to be slightly lower than these previous studies. However, the magnitude of differential activity, as well as the fraction of differentially active sequences out of the active sequences was similar to previous studies16,28–31,83–85“. We trust that future unfiltered MPRA studies will help shed light on how these modern-archaic human differences compare to intra- and inter-species effects of single-nucleotide changes.

Related to this point, in the Discussion the authors point out that, based on previous studies, trans differences are likely minimal between modern and archaic humans, but what about trans differences in cell-type environment? Their assertion is perhaps an overgeneralization that diminishes the importance of trans effects. This is a point that should be addressed.

We agree with the reviewer that this sentence was an overgeneralization and that it diminished the importance of trans effects. Trans differences between cell types are indeed substantially more pronounced than trans differences between the same cell type of two human groups. Because we did not compare the activity of the modern human allele in one cell type to the archaic human allele in another cell type, these trans differences should not cause false positives in our results. However, trans differences between cell types very likely underlie the observation that differential activity in one cell type is not always shared with other cell types. We now realize that the statement was an overgeneralization and our intent was not to downplay the role of trans effects. We therefore rephrased it to: “Finally, differences in the trans environment of a cell could have an effect on the ability of a sequence to exert its cis-regulatory effect, resulting in cell-type-specific cis-regulatory effects, as we observed in our data. The trans environment of the same cell type might also differ between two organisms. However, the majority of the cis-regulatory changes we observed would be expected to be present in archaic human cells as well, considering that such conservation has been observed between substantially more divergent organisms (e.g., human-chimpanzee Ryu et al., 2018 and human-mouse Köhler et al., 2014). In other words, while trans-regulatory changes play a key role in species divergence, the trans environments of the same cell type in two closely related organisms tend to affect cis-regulation similarly.”

3. The authors report a modest median fold-change of 1.2x for differentially active sequences. They use a combination of criteria to associate these sequences with a putative target gene. An important next step to support the significance of these findings would be to understand whether these modest differences are linked to differential expression of target genes. Obviously, such a comparison cannot be made for modern and archaic human transcriptomes; however, one potential avenue for testing this idea would be to use public data from comparative human and chimp RNA-seq studies to look at whether the target genes are differentially expressed between modern human and chimp, provided that the ancestral allele for the differentially active sequence is found in chimp.

Following the reviewer’s comment, we investigated the expression of genes associated with differentially active sequences using human and chimp RNA-seq data. As the expression changes we report are driven by cis-regulatory changes, we used our recently generated RNA-seq data from human-chimp hybrid cells (Gokhman et al., 2021, Nature Genetics, in press, see read count and raw sequencing data in GEO accession numbers: GSE146481 and GSE144825). In these hybrid cells, the human and chimpanzee chromosomes are found within the same nuclear environment and are exposed to the same trans factors (e.g., transcription factors). Therefore, any differential expression observed between the human and chimpanzee alleles within these hybrid cells is attributed to cis-regulatory changes. The cells we generated are hybrid human-chimp induced pluripotent stem cells (iPSCs), and we therefore investigated whether genes associated with upregulating sequences in our ESC MPRA data tend to be upregulated in the hybrid iPSCs, and vice versa. It is important to note that differential expression between humans and chimpanzees reflects ~12 million years of evolution (i.e., changes that emerged along the human as well as along the chimpanzee lineages since their split from their common ancestor ~6 million years ago). However, our MPRA data was done on sequences that changed along the modern human lineage (~550-765 thousand years). Therefore, the human-chimp differences span an evolutionary time that is ~20-fold longer than the modern human lineage, and the effect of modern-derived variants on gene expression between humans and chimpanzees is expected to be largely diluted by the many other changes that accumulated along the rest of this time. Indeed, we observe a very slight, but significant correlation between differential expression observed in the MPRA data and differential expression observed in the human-chimp hybrid data (P = 0.017, Pearson’s r = 0.1).

These results now appear in a new section in the Methods titled: “Human-chimpanzee cis-regulatory expression changes”.

4. The section describing the relationship between DNA methylation and differentially active sequences seems to be one of the least well supported pieces of this study. In the absence of gene expression data, a key piece of support for the functional significance of differentially active sequences is to link those differences to epigenetic differences. The availability of DNA methylation maps from Neanderthal and Denisovan makes this possible. However, the way this data is presented (or not presented) is really difficult to interpret (i.e. reporting that upregulating sequences tend to be hypomethylated with a -1% difference compared to what?). A figure would have been very useful here.

Following the reviewer’s comment, we have made several changes to this paragraph. First, we reworded this section to clarify the comparison that was done. In short, the comparison was done between sequence pairs, i.e., for each pair of modern and archaic human sequences we tested if the human lineage showing reduced activity in the lentiMPRA is also the human lineage showing hypermethylation of that sequence in its native genomic context. For example, if a sequence is downregulating in modern compared to archaic humans, we tested if it is also hypermethylated in modern compared to archaic humans. The results we found show a weak, but significant, association of increased methylation and reduced activity (i.e., downregulating sequences show on average 2% more methylation in modern compared to archaic genomes, and upregulating sequences show 1% less methylation). The trend is slightly stronger when looking at the top 10 most divergent sequences (-7.4% and +8.3%, respectively). Following the reviewer’s suggestion, we added a violin plot showing the methylation levels for these sequences (Supplementary Figure 4a). Additionally, because methylation is known to be most tightly linked with activity in promoter regions, we added a section examining the relationship between differential methylation and differential activity of sequences within promoter regions. Here too, we found that sequences upregulating activity in modern humans are hypomethylated by 5% on average in the modern compared to the archaic genomes, while downregulating sequences are hypermethylated by 8% on average (P = 0.034, paired t-test). These results now appear in Supplementary Figure 4b. We have also toned down this paragraph and removed the sentence about the overlap with differentially methylated regions, as it was both confusing and was only based on four observations. Finally, to further investigate the relationship between DNA methylation and activity, we added a paragraph to the “Characterization of active regulatory sequences” section, where we explore the link between the two. We found that active sequences are significantly hypomethylated compared to inactive sequences (P = 5.5x10-13, t-test) and that their activity level (RNA/DNA ratio) is negatively correlated with methylation levels (P = 6.0x10-9, Pearson’s r = -0.24). We now show these new results in Figure 2f.

Also, it's important to note that many enhancers have low CpG density, while promoters and TSSs with high CpG density are typically hypomethylated regardless of gene activity (Lister et al. 2009, Stadler et al. 2011, Schlesinger et al. 2013). Thus, it is necessary to parse these regions when referring to methylation differences.

We thank the reviewer for this suggestion. Indeed, CpG composition varies across different functional parts of the genome and this could potentially be correlated with methylation levels. Following the reviewer’s comment, we added several tests to the differential methylation analysis section.

First, we ranked differentially active promoter sequences based on their CpG density and tested whether the hypomethylation we observe in upregulating sequences and the hypermethylation we observe in downregulating sequences become more pronounced in promoters with lower CpG density. Indeed, as the reviewer mentioned, the link between methylation and activity is stronger in CpG-poor promoter sequences: within upregulating promoter sequences, when taking those ranking at the bottom half based on their CpG density, they are hypomethylated by 15% on average in modern compared to archaic genomes (compared to 5% in all promoters, see above), while downregulating sequences show 15% hypermethylation (compared to 8% in all promoters, P = 0.006, paired t-test). We added these results to the methylation section and to Supplementary Figure 4c.

Second, to further test the potential effect of CpG density on our results, we compared CpG density in differentially active compared to non-differentially active sequences, and in upregulating compared to downregulating sequences. We found no significant difference in CpG density between these groups (p-values > 0.05, t-test). These results now appear in the “DNA methylation in active and differentially active sequences” section in the Methods.

Third, we tested chromHMM putative enhancer sequences and found a similar, but slightly weaker, trend compared to promoters, with 3% hypermethylation of downregulating sequences and 5% hypomethylation of upregulating sequences. Perhaps in accordance with the weaker link between enhancer methylation and activity, this trend is not significant despite having similar statistical power to the promoter analysis (P = 0.12, paired t-test). These results now appear in the “DNA methylation in active and differentially active sequences” section in the Methods.

Finally, it is important to note that the tests for hypo/hypermethylation were done in a pairwise manner by comparing the modern sequence to the archaic sequence, which are generally matched in their CpG density. Therefore, the decreased methylation we observe in upregulating sequences, and the increased methylation we observe in downregulating sequences are unlikely driven by differences in CpG density.

https://doi.org/10.7554/eLife.63713.sa2

Article and author information

Author details

  1. Carly V Weiss

    Department of Biology, Stanford University, Stanford, Stanford, United States
    Contribution
    Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Lana Harshman
    Competing interests
    No competing interests declared
  2. Lana Harshman

    1. Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, San Francisco, United States
    2. Institute for Human Genetics, University of California San Francisco, San Francisco, San Francisco, United States
    Contribution
    Resources, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Carly V Weiss
    Competing interests
    No competing interests declared
  3. Fumitaka Inoue

    1. Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, San Francisco, United States
    2. Institute for Human Genetics, University of California San Francisco, San Francisco, San Francisco, United States
    Present address
    Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
    Contribution
    Data curation, Validation, Investigation, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0657-434X
  4. Hunter B Fraser

    Department of Biology, Stanford University, Stanford, Stanford, United States
    Contribution
    Supervision, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8400-8541
  5. Dmitri A Petrov

    Department of Biology, Stanford University, Stanford, Stanford, United States
    Contribution
    Supervision, Funding acquisition, Writing - original draft, Writing - review and editing
    For correspondence
    dpetrov@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3664-9130
  6. Nadav Ahituv

    1. Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, San Francisco, United States
    2. Institute for Human Genetics, University of California San Francisco, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Validation, Investigation, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    nadav.ahituv@ucsf.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7434-8144
  7. David Gokhman

    Department of Biology, Stanford University, Stanford, Stanford, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    dgokhman@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3536-9006

Funding

National Human Genome Research Institute (1UM1HG009408)

  • Nadav Ahituv

National Institute of Mental Health (1R01MH109907)

  • Nadav Ahituv

National Institute of Mental Health (1U01MH116438)

  • Nadav Ahituv

Uehara Memorial Foundation

  • Fumitaka Inoue

Stanford Center for Computational, Evolutionary and Human Genomics

  • Carly V Weiss

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank Tal Ashuach (MPRAnalyze), Terence Capellini, Evelyn Jagoda, Martin Kircher, and the Fraser, Petrov, and McCoy labs for helpful feedback. DG was supported by the Human Frontier, Rothschild, and Zuckerman fellowships. This work was supported in part by the National Human Genome Research Institute grant 1UM1HG009408 (NA), the National Institute of Mental Health grants 1R01MH109907 (NA) and 1U01MH116438 (NA), the Uehara Memorial Foundation (FI), and the Stanford Center for Computational, Evolutionary and Human Genomics (CEHG).

Senior and Reviewing Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Publication history

  1. Received: October 3, 2020
  2. Accepted: March 30, 2021
  3. Version of Record published: April 22, 2021 (version 1)

Copyright

© 2021, Weiss et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,396
    Page views
  • 270
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    Kai He et al.
    Research Article

    The speciose mammalian order Eulipotyphla (moles, shrews, hedgehogs, solenodons) combines an unusual diversity of semi-aquatic, semi-fossorial, and fossorial forms that arose from terrestrial forbearers. However, our understanding of the ecomorphological pathways leading to these lifestyles has been confounded by a fragmentary fossil record, unresolved phylogenetic relationships, and potential morphological convergence, calling for novel approaches. The net surface charge of the oxygen-storing muscle protein myoglobin (ZMb), which can be readily determined from its primary structure, provides an objective target to address this question due to mechanistic linkages with myoglobin concentration. Here we generate a comprehensive 71 species molecular phylogeny that resolves previously intractable intra-family relationships and then ancestrally reconstruct ZMb evolution to identify ancient lifestyle transitions based on protein sequence alone. Our phylogenetically informed analyses confidently resolve fossorial habits having evolved twice in talpid moles and reveal five independent secondary aquatic transitions in the order housing the world's smallest endothermic divers.

    1. Computational and Systems Biology
    2. Evolutionary Biology
    Alexander J Tarashansky et al.
    Research Advance

    Comparing single-cell transcriptomic atlases from diverse organisms can elucidate the origins of cellular diversity and assist the annotation of new cell atlases. Yet, comparison between distant relatives is hindered by complex gene histories and diversifications in expression programs. Previously, we introduced the self-assembling manifold (SAM) algorithm to robustly reconstruct manifolds from single-cell data (Tarashansky et al., 2019). Here, we build on SAM to map cell atlas manifolds across species. This new method, SAMap, identifies homologous cell types with shared expression programs across distant species within phyla, even in complex examples where homologous tissues emerge from distinct germ layers. SAMap also finds many genes with more similar expression to their paralogs than their orthologs, suggesting paralog substitution may be more common in evolution than previously appreciated. Lastly, comparing species across animal phyla, spanning mouse to sponge, reveals ancient contractile and stem cell families, which may have arisen early in animal evolution.