Abstract
What genomic sequences make conserved genes generate divergent expression in closely related species, which may have critically driven human evolution, has puzzled researchers for decades. Genomic studies have examined species-specific gene birth, gene loss, and changes in promoters and transcription factor binding sites, but species-specific epigenetic regulation remains barely explored. This study identified human-specific long noncoding RNAs (lncRNAs) from GENCODE-annotated human lncRNAs, predicted their DNA binding sites (DBSs) genome-wide, analyzed these DBSs and their counterparts in modern humans (CEU, CHB, and YRI), archaic humans (Altai Neanderthals, Denisovans, and Vindija Neanderthals), and chimpanzees, and analyzed the impact of DBSs on gene expression in modern and archaic humans. The results suggest that human-specific lncRNAs and their DBSs have substantially rewired gene expression human-specifically and that the rewiring has evolved continuously from archaic to modern humans. Rewired gene expression promotes brain development, makes humans adapt to new environments and lifestyles, and causes differences in modern humans. These results uncover a critical dimension of human evolution and underscore the diverse functions of species-specific lncRNAs.
Introduction
Genomic changes that drive evolution include gene birth (Kaessmann, 2010), gene loss (Albalat and Canestro, 2016), and changes in regulatory sequences (Prud’homme et al., 2007). The limited genomic differences but substantial phenotypic and behavioral divergence between humans and other hominids make what genomic changes at what time have critically driven human evolution an enduring puzzle (Anton et al., 2014; Pollen et al., 2023). Although waves of studies have identified genes promoting human brain development (Evans et al., 2005; Fiddes et al., 2018; Mekel-Bobrov et al., 2005; Pinson et al., 2022; Suzuki et al., 2018), conclusions from gene birth-centered studies are controversial (Currat et al., 2006; Timpson et al., 2007; Yu et al., 2007). Moreover, genes important for brain development (e.g., NOTCH2NL and ASPM) do not necessarily regulate other human traits that precede or coincide with brain enlargement. Recent studies revealed the importance of regulatory sequences for human evolution, including new sequences (Liu et al., 2021), lost sequences (McLean et al., 2011), human accelerated regions (HARs) (Dong et al., 2016; Mangan et al., 2022; Prabhakar et al., 2006; Whalen and Pollard, 2022), and the turnover of transcription factor binding sites (TFBS) (Krieger et al., 2022; Otto et al., 2009; Zhang et al., 2023). An unexplored question is how epigenetic regulation contributes to human evolution.
Experimental studies have generated substantial findings on epigenetic regulation. (a) The chimpanzee and human genomes are almost 99% identical (Chimpanzee Sequencing and Analysis Consortium, 2005). (b) About one-third of human long-noncoding RNAs (lncRNA) are primate-specific (Derrien et al., 2012). (c) Species-specific lncRNAs have distinct expression profiles in organs of different species (Sarropoulos et al., 2019). (d) Many lncRNAs can bind to DNA sequences by forming RNA:DNA triplexes (Abu Almakarem et al., 2012), thereby recruiting histone and DNA modification enzymes to binding sites to regulate nearby gene expression (Lee, 2009; Roadmap Epigenomics et al., 2015). (e) Approximately 40% of differentially expressed human genes result from interspecies epigenetic differences (Hernando-Herraez et al., 2015). Compared with protein-coding gene-centered hypotheses of human evolution, the alternative one is plausible – human-specific (HS) lncRNAs and their DNA binding sites have human-specifically rewired gene expression to critically promote human evolution and determine human traits.
RNA sequence search by structure and sequence alignment can identify orthologous lncRNA genes in genomes (Nawrocki, 2014; Nawrocki and Eddy, 2013). Thus, HS lncRNAs can be identified upon human lncRNAs’ orthologues in other mammals. Many lncRNA can bind to DNA sequences following specific base-pairing rules to form RNA:DNA triplexes (Abu Almakarem et al., 2012). Thus, these rules allow computational prediction of lncRNAs’ DNA binding domains (DBD) and DNA binding sites (DBS) (Lin et al., 2019; Wen et al., 2022). Gene expression in most human organs and tissues has been profiled (GTEx Consortium, 2017). Further, cenomes of modern humans, archaic humans, and multiple apes have been sequenced (1000 Genomes Project Consortium et al., 2012; Chimpanzee Sequencing and Analysis Consortium, 2005; Meyer et al., 2012; Prufer et al., 2017; Pru fer et al., 2013). Together, these results make it feasible to explore the above alternative hypothesis.
In this study, we identified HS lncRNAs upon the orthologues of the GENCODE-annotated human lncRNA genes in 16 mammalian genomes, predicted HS lncRNAs’ DBDs and DBSs in the genomes of modern humans (CEU, CHB, and YRI), predicted these DBSs’ counterparts in the genomes of archaic humans (Altai Neanderthals, Denisovans, Vindija Neanderthals) and chimpanzees (Figure 1A), and experimentally validated DBS prediction. Then, we applied genome analyses and evolutionary/population genetics analyses to HS lncRNAs and their DBSs, applied transcriptome analysis to modern and archaic humans using the GTEx project data, and analyzed multiple experimental datasets. Our genome, transcriptome, and evolutionary analyses suggest that HS lncRNAs and their DBSs have continuously and distinctly rewired gene expression during human evolution. The rewired gene expression influences or determines diverse human traits, from an enlarged brain to modified metabolism pathways.
Results
1. HS lncRNAs regulate diverse genes and transcripts
After searching for orthologues of the 13562 GENCODE-annotated human lncRNA genes (v18) in 16 mammalian genomes and the genomes of archaic humans by RNA sequence and structure alignment using the Infernal program (Figure 1A) (Lin et al., 2019; Nawrocki, 2014; Nawrocki et al., 2009), we identified 66 HS lncRNA genes (Supplementary Table 1). We predicted HS lncRNAs’ DBSs in the 5000 bp promoter regions of the 179128 Ensembl-annotated transcripts (release 79) using the LongTarget program (Supplementary Note 1) (Lin et al., 2019). 105141 strong DBSs (mean binding affinity >60, mean length >147 bp) were identified in 96789 transcripts of 23396 genes, and 152836 weak DBSs (36< mean binding affinity <60) were identified in 127898 transcripts of 33185 genes (Figure 1B; Supplementary Table 2, 3). Genes and transcripts whose promoter regions contain the DBS of an HS lncRNA were assumed targets of the HS lncRNA. Several HS lncRNAs have DBSs in nearly all genes and transcripts (e.g., RP11-423H2.3 has strong DBSs in 73855 transcripts of 17941 genes), and only about 1.5% of target genes (0.6% of target transcripts) are human-specific genes and transcripts (Figure 1C), indicating that HS lncRNAs substantially rewire conserved genes and supporting the above rewiring hypothesis. Many transcripts have DBSs for multiple HS lncRNAs, many HS lncRNAs have multiple DBSs in a single transcript, and HS lncRNAs themselves show complex targeting relationships (Figure 1D; Supplementary Note 2), indicating the complexity of the rewiring.
HARs are conserved genomic loci but have evolved acceleratedly in the human lineage. Multiple researchers assume that HARs critically influence human evolution (especially brain evolution); however, results obtained from HAR studies are controversial (Levchenko et al., 2018). The Zoonomia Project recently identified 312 HARs important for 3D genome rewiring and neurodevelopment (Keough et al., 2023). We found that only 8 HARs overlap 26 DBSs of 14 HS lncRNAs, suggesting that DBS and HAR are sequences of different functions subject to separate examination.
We validated DBS prediction using multiple methods and datasets (Supplementary Note 1; Supplementary Fig 1-5). (a) HS lncRNA DBSs overlap well with experimentally identified DNA methylation and histone modification signals in multiple cell lines (https://genome.UCSC.edu). (b) Many DBSs co-localize with annotated cis-regulatory elements (cCREs) in genes’ promoter regions (https://genome.UCSC.edu). (c) predicted and experimentally detected DBSs of NEAT1, MALAT1, MEG3 overlap well (Mondal et al., 2015; West et al., 2014). (d) We used CRISPR/Cas9 to knock out predicted DBDs (100-200 bp) of seven lncRNAs (three HS lncRNAs and four wrongly transcribed lncRNAs) in multiple cell lines, performed RNA-seq before and after DBD knockout, and analyzed differential gene expression The |fold change| of target genes (genes with DBS) was significantly larger than the |fold change| of non-target genes (genes without DBS) (one-sided Mann-Whitney test, p = 3.1e-72, 1.49e-114, and 1.12e-206 for the HS lncRNA RP13-516M14.1, RP11-426L16.8, and SNORA59B, and p = 2.58e-09, 6.49e-41, 0.034, and 5.23e-07 for the four wrongly transcribed lncRNAs), suggesting that changed expression of target genes is caused by DBD knockout. (e) Finally, the LongTarget program uses the Smith-Waterman local alignment to identify DBD/DBS binding; the alignment algorithm determines that a DBS of 147 bp is extremely unlikely to be generated by chance (p < 8.2e-19 to 1.5e-48).
2. HS lncRNAs’ target genes characterize human evolution and traits
Since strong and weak DBSs may occur at different times in human evolution, we separately examined whether their target genes are enriched in functions important for human evolution. We sorted target genes by DBS binding affinity, chose the top 2000 genes (DBS length >252 bp and DBS binding affinity >151) and bottom 2000 genes (DBS length <60 bp but DBS binding affinity >36), and applied over-representation analysis using the g:Profiler program and Gene Ontology (GO) database to the two gene sets (Supplementary Table 4). The top and bottom 2000 genes are co-enriched in GO terms important for human evolution (instead of just brain evolution), including “neuron projection development” (GO:0031175), “learning” (GO:0007612), “regulation of anatomical structure size” (GO: GO:0090066), “head development” (GO:0060322), and “regulation of locomotion” (GO:0040012)) (Supplementary Note 3; Supplementary Table 4). 514 strongest DBSs have binding affinity >300 (DBS length >500 bp) (Supplementary Table 5), and genes containing these strongest DBSs include IFNAR1 and NFATC1 (important for immune function), KIF21B and NTSR1 (critical for neural development), SLC2A11 and SLC2A1 (involved in glucose usage), BAIAP3 and TAS1R3 (a brain-specific angiogenesis inhibitor and a receptor for the sweet taste response), and some primate-specific noncoding RNA genes (e.g., CTD-3224I3.3 which is highly expressed in the cerebellum, lung, and testis) (Table 1). Of note, genes with weak DBSs are enriched in GO terms including “response to alcohol” (GO:0097305), “response to temperature stimulus” (GO:0009266), “DNA methylation or demethylation” (GO:0044728), and “female pregnancy” (GO:0007565) (Supplementary Note 3; Supplementary Table 4). These findings help reveal how epigenetic regulation reshapes different human traits at different times.
3. The rewiring of gene expression has greatly evolved
The human genome is almost 99%, 99%, and 98% identical to the genomes of chimpanzees, bonobos, and gorillas (Chimpanzee Sequencing and Analysis Consortium, 2005; Prufer et al., 2012). Since where sequence differences are in these genomes is a question, we assessed to what extent HS lncRNAs’ DBS sequences are human-specific and found that 97.81% of strong DBSs have counterparts in chimpanzees. This suggests that many DBSs are not human-specific.
To examine the evolution of HS lncRNAs and their DBSs during human evolution, we identified their counterparts in Altai Neanderthals, Denisovans, and Vindija Neanderthals (Meyer et al., 2012; Prufer et al., 2017; Prüfer et al., 2013). HS lncRNAs and strong DBSs have counterparts in these archaic humans, but the distances of HS lncRNAs are smaller than the distances of DBSs between archaic and modern humans (“distance” means distance per base), supporting that many DBSs occur in ape genomes instead of human-specific. We computed DBS distances in two ways. The first is from the reconstructed human ancestor (downloaded from the EBI website) to chimpanzees, archaic humans, and modern humans; the second is from chimpanzees and archaic humans to modern humans. The first type of distances has the anomaly that many DBSs’ distances from the human ancestor to modern humans are smaller than to archaic humans (for example, DBSs in the top 20% (5033) genes in the human ancestor have distances >0.015 to modern humans, but more DBSs have distances >0.015 to Altai Neanderthals, Denisovans, and even chimpanzees) (Supplementary Note 3). This anomaly may be caused by the human ancestor being built using six primates without archaic humans; therefore, no analysis was followed. In contrast, the second type of distances agrees well with the phylogenetic distances between chimpanzees, archaic humans, and modern humans. Notable, large sequence changes occurred in limited DBSs (Figure 1E), and genes containing these greatly changed DBSs may critically characterize human evolution. Given the high similarity between human and chimpanzee genomes, the top 20% (4248) genes in chimpanzees may sufficiently characterize the human-chimpanzee differences, and these genes have DBS distances >0.034. Taking this as a threshold, 1256, 2514, and 134 genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals have DBS distance >0.034, and these numbers agree with the phylogenetic relationships between these archaic and modern humans (Figure 1A; Supplementary Table 6).
To examine target genes with mostly changed DBSs are enriched in what functions, we applied over-representation analysis using the g:Profiler program and Gene Ontology (GO) database to target genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals. To make statistical significance comparable, we used the top 1256 genes in chimpanzees and three archaic humans as the inputs. The 1256 genes in chimpanzees and Altai Neanderthals are enriched in many common GO terms; however, “behaviour” (GO:0007610) and “cognition” (GO:0050890) have much higher significance in Altai Neanderthals than in chimpanzees (Figure 2). In addition, enrichment of such GO terms as “learning” (GO:0007612), “learning or memory” (GO:0007611), and “glucose homeostasis” (GO:0042593) is specific to Altai Neanderthals. Agreeing with the close relationship between Vindija Neanderthals and modern humans, the 134 genes in Vindija Neanderthals are enriched only in two GO terms with low significance (“smoothened signalling pathway” (GO:0007224) and “regulation of smoothened signalling pathway” (GO:0008589)) (Supplementary Table 7). These results help reveal the periods and positions of epigenetic regulation changes and help infer the periods of related phenotypic changes.
4. Epigenetic regulation by HS lncRNAs shows differences in modern humans
To check whether the rewiring is still going on, we analysed single nucleotide polymorphisms (SNPs, with minimal allele frequency (MAF) ≥0.1) in DBSs using the 1000 Human Genome data (1000 Genomes Project Consortium et al., 2012). Many DBSs are highly polymorphic (>10 SNPs, 5 times the average number of SNPs for all DBSs); many DBSs have mostly changed sequence distances from modern humans to archaic humans and chimpanzees; and the intersection of the above two sets are in genes important for human evolution (Supplementary Table 8), including IFNAR1, NFATC1, TAS1R3, INS, ST3GAL4, and FN3KRP (Table 2). SNPs inside these DBSs, whether favored or recessive, affect DBS binding affinity in modern humans. A notable case is sugar metabolism. A feature of modern human diets is high sugar intake, which causes non-enzymatic oxidation of proteins (i.e., glycation) and deactivates protein functions. Among proteins encoded by genes whose DBSs are most SNP-rich and have mostly changed sequence distances, TAS1R3 recognizes diverse natural and synthetic sweeteners, Insulin decreases blood glucose concentration, ST3GAL4 regulates protein glycosylation, and FN3KRP deglycates proteins. DBSs in these genes not only have counterparts with mostly changed sequences in chimpanzees and archaic humans, but also are SNP-rich, indicating continuous evolution of these genes’ regulation by HS lncRNAs.
We next used multiple statistical tests, including XP-CLR (Chen et al., 2010), iSAFE (Akbari et al., 2018), Tajima’s D (Tajima, 1989), Fay-Wu’s H (Fay and Wu, 2000), the fixation index (Fst) (Weir and Cockerham, 1984), and linkage disequilibrium (LD) (Slatkin, 2008), to detect selection signals in HS lncRNAs and strong DBSs in the modern human CEU, CHB, and YRI (the representative populations of Europeans, Asians, and Africans). These tests detected selection signals in several HS lncRNA genes in CEU and CHB but not in YRI and selection signals in more DBSs in CEU and CHB than in YRI (Supplementary Note 4, 5; Supplementary Table 9, 10, 11). Examples include the DBSs in the two pigmentation-related genes MC1R and MFSD12 and in the two odor reception-related genes OR6C1 and TAS1R3. The enriched SNPs, detected selection signals, and annotated gene functions suggest specific and ongoing influence of rewired gene expression on human evolution.
5. SNPs in DBSs have cis-effects on gene expression
To examine the impact of DBSs on gene expression, we assessed the expression quantitative trait loci (eQTLs) in DBS sequences using the Genotype-Tissue Expression (GTEx) data (GTEx Consortium, 2017). 1727 SNPs with MAF ≥0.1 in at least one population and with absolute values of cis-effect size (ES) ≥0.5 in at least one tissue were identified in DBSs in autosomal genes (Supplementary Table 12,13). These eQTLs include 372 “conserved” ones (also in DBSs in the three archaic humans) and 1020 “novel” ones (in DBSs only in modern humans). Many conserved eQTLs are in brain tissues and have high derived allele frequencies (DAFs) in all three modern populations. A notable eQTL that has high DAFs in all three populations and positive ES ≥0.5 in 44 out of 48 GTEx tissues is rs2246577 in the DBS in FN3KRP, whose encoded protein deglycates proteins to restore their function (Table 2). In contrast, many novel eQTLs are in a particular tissue and have population-specific DAFs (Supplementary Note 6).
We performed two analyses to examine how eQTLs in DBSs influence gene expression via lncRNA-DBS interactions. First, we computed the expression correlation between HS lncRNAs and target transcripts with DBSs having eQTLs in specific tissues. Most (94%) HS lncRNA-target transcript pairs showed significant expression correlation in tissues in which the eQTLs were identified (|Spearman’s rho| >0.3 and FDR <0.05). Second, we examined whether eQTLs are more enriched in DBSs than in promoter regions by computing and comparing the density of eQTLs in each DBS and each Ensembl-annotated promoter. The results suggest that eQTLs are more enriched in DBSs than in promoters (one-sided Mann-Whitney test, p = 0.0) and that these eQTLs influence gene expression via lncRNA-DBS interactions (Supplementary Note 6). Tissue-specific eQTLs and population-specific SNPs in DBSs confirm the cis-effects of HS lncRNA-mediated epigenetic regulation on gene expression.
6. HS lncRNAs contributed to brain evolution from archaic to modern humans
We finally examined the impact of HS lncRNA-DBS interaction on gene expression in the GTEx tissues (GTEx Consortium, 2017). 40 autosomal HS lncRNAs are expressed robustly (median TPM > 0.1) in at least one tissue, and they form 198876 pairs with target transcripts in all tissues. 45% of these pairs show a significant expression correlation in specific tissues (Spearman’s |rho| >0.3 and FDR <0.05). In contrast, when randomly sampling 10000 pairs of lncRNAs and protein-coding transcripts genome-wide, the percent of pairs showing this level of expression correlation (Spearman’s |rho| >0.3 and FDR <0.05) is only 2.3%. Equally notable is that 56% of HS lncRNA-target transcript pairs with a significant correlation are present in at least one brain region (Figure 3A), indicating substantial gene expression regulation by HS lncRNAs in the brain.
We analysed nine experimental datasets to verify the above conclusion substantial gene expression regulation by HS lncRNAs in the brain (Supplementary Note 7). First, we analysed two datasets of epigenetic studies (one examined H3K27ac and H3K4me2 profiling of human, macaque, and mouse corticogenesis, and the other examined gene expression and H3K27ac modification in eight brain regions in humans and four other primates) (Reilly et al., 2015; Xu et al., 2018). 84% and 73% of genes in the two datasets have DBSs for HS lncRNAs, significantly more enriched in HS lncRNA targets than the genome-wide background (p = 1.21e-21 and 1.2e-56, two-sided Fisher’s exact test). Second, we analysed two datasets of PsychENCODE studies (one examined spatiotemporally differentially expressed genes and spatiotemporally differentially methylated sites in 16 brain regions, and the other examined the spatiotemporal transcriptomic divergence across human and macaque brain development) (Li et al., 2018; Zhu et al., 2018). In the first dataset, 65 HS lncRNAs are expressed in all 16 brain regions, 109 transcripts have spatiotemporally differentially methylated sites in their promoters, and 56 of the 109 transcripts have DBSs for HS lncRNAs. In the second dataset, 8951 genes show differential expression between the human and macaque brains, and 72% of differentially expressed genes in the human brain have DBSs of HS lncRNAs. Thus, both datasets show significant enrichment for HS lncRNA regulation. Third, three studies identified several genes critically regulating cortical expansion (Florio et al., 2018; Johnson et al., 2018; Suzuki et al., 2018). Of the 40 protein-coding genes reported by these studies, 29 have DBSs of HS lncRNAs in their promoter regions (Supplementary Table 14). Thus, these genes are enriched with DBSs of HS lncRNAs compared with the genome-wide background (p < 0.01, two-sided Fisher’s exact test). Finally, we analysed two datasets of brain organoids studies. By establishing and comparing cerebral organoids between humans, chimpanzees, and macaques, Pollen et al. identified 261 human-specific gene expression changes (Pollen et al., 2019). By fusing human and chimpanzee iPS cells and differentiating the hybrid iPS cells into hybrid cortical spheroids, Agoglia et al. generated a panel of tetraploid human-chimpanzee hybrid iPS cells and identified thousands of genes with divergent expression between humans and chimpanzees (Agoglia et al., 2021). The 261 and 1102 genes in the two datasets are enriched with DBSs of HS lncRNAs compared with the genome-wide background (p = 1.2e-16 and 3.4e-74).
We next examined the evolution of the impact of HS lncRNAs and their DBSs on gene expression in the GTEx tissues. For each DBS in an HS lncRNA-target transcript pair correlated in a GTEx tissue, we computed the sequence distance of the DBS from modern humans to archaic humans. Then, we compared the distribution of DBS sequence distances in each tissue with the distribution of DBS sequence distances in all tissues (one-sided two-sample Kolmogorov-Smirnov test). DBSs in HS lncRNA-target transcript pairs correlated in brain regions have significantly changed sequence distances since the Altai Neanderthals and Denisovans (Figure 3B), suggesting that gene expression regulation by HS lncRNAs in the brain has experienced significant evolution and that HS lncRNAs contributed greatly to recent brain evolution (Ponce de Leon et al., 2021). Results from other GTEx tissues, including Kidney-Cortex, Ovary, and Skin-Sun Expressed (Figure 3B), also suggest continuous influences of rewired gene expression on human evolution.
Discussion
Modern humans differ from archaic humans and great apes trivially in genomes but greatly in phenotypes, making what genomic differences critically determine the phenotypic differences a profound question. This question has been addressed by many researchers along multiple approaches, and attention has been especially paid to new protein-coding genes, HARs, and evolved TF-TFBS interactions. Findings include HS protein-coding genes promote rapid neocortex expansion (Fiddes et al., 2018; Florio et al., 2018; Pinson et al., 2022), HARs are significantly enriched in domains important for human-specific 3D genome organization (Keough et al., 2023), and gene expression is rewired by TFs and TFBSs (Johnson, 2017). These findings suggest that species-specific lncRNAs and their DBSs may also substantially rewire gene expression. This study identified HS lncRNAs, their DBSs, and counterparts of these DBSs in ancient humans and multiple apes. Our analyses of the evolution of and interaction between HS lncRNAs and their DBSs generate evidence suggesting that HS lncRNAs and their DBSs have contributed greatly to human evolution by distinctly rewiring gene expression. Our analyses of multiple experimental datasets support this conclusion. The rewiring can be dated back to chimpanzees, has significantly evolved from archaic to modern humans, and still influences differences in modern humans. The occurrence of HS lncRNAs and their DBSs may have three situations – (a) HS lncRNAs preceded their DBSs, (b) HS lncRNAs and their DBSs co-occurred, (c) HS lncRNAs succeeded their DBSs. Our results support the third situation and the rewiring hypothesis.
We performed seven CRISPR/Cas9 experiments to validate DBS prediction and analysed multiple experimental datasets to support the main conclusion. Parameters and thresholds may influence data analysis results. When running Infernal, LongTarget, MAFFT7, and Phylip, default parameters were used. When analysing enrichment of GO terms in gene lists with strong and weak DBSs, the top and bottom 2000 genes were used, with the whole genome as the background. When analyzing enrichment of GO terms in gene lists with DBSs with large sequence distances, the threshold of 0.034 was used. DBSs in 4248, 1256, 2514, and 134 genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals have distance >0.034 to modern humans, and these numbers sufficiently characterize the human-chimpanzee differences and qualitatively agree with the phylogenetic relationships. Other parameter and threshold values would generate somewhat different results, but enriched GO terms with high significance will be quite stable. We note that most of the reported are findings rather than conclusions, indicative or suggestive of something that answer the primary question of what genomic differences critically determine the phenotypic differences between humans and apes and between modern and archaic humans.
Several findings are noteworthy. First, Neanderthal-inherited sequences have measurable impacts on gene expression in modern humans, and the impacts are least detectable in the brain (McCoy et al., 2017). Here we find HS lncRNAs have more significantly influenced gene expression in the brain than in other tissues. Second, the regulation of many genes by HS lncRNAs has undergone great evolution, specific examples including those making humans adapt to high sugar intake. Third, intensively regulated genes are enriched in different functions at different periods of human evolution. Specifically, “old” SNPs in DBSs are in genes regulating neural development, and “new” SNPs in DBSs are in genes regulating such processes as glucose metabolism. These findings reveal the diverse, continuous, and subtle contribution of HS lncRNA to human evolution.
The above findings not only reveal HS lncRNAs’ functions and adaptive human evolution, but also raise multiple new questions. First, the protein-coding regions in the mouse and human genomes are about 85% identical; however, many mouse lncRNAs are rodent-specific (Yue et al., 2014). Thus, two questions are how mouse-specific lncRNAs specifically rewire gene expression in mice and how human- and mouse-specific rewiring influences the cross-species transcriptional differences (Breschi et al., 2017; Hodge et al., 2019; Zhu et al., 2018). Second, whether mice and other mammals have lncRNAs such as RP11-423H2.3 that substantially rewires gene expression species-specifically. Third, to what extent clade- and species-specific lncRNAs would make speciation and evolution preordained. These interesting questions have practical importance.
Material and methods
1. Data sources
The sequences of human (GRCh38/hg38, GRCh37/hg19) and chimpanzee genomes (panTro5) were obtained from the UCSC Genome Browser (http://genome.UCSC.edu). Three high-quality archaic human genomes were obtained from the Max Planck Institute for Evolutionary Anthropology (https://www.eva.mpg.de/genetics/index.html), which include an Altai Neanderthal which lived approximately 122 thousand years ago (kya) (Pru fer et al., 2013), a Denisovan (an Asian relative of Neanderthals) which lived approximately 72 kya (Meyer et al., 2012), and a Vindija Neanderthal which lived approximately 52 kya (Prufer et al., 2017). The ancestral sequences for the human genome (GRCh37) (which were generated upon the EPO alignments of six primate species human, chimpanzee, gorilla, orangutan, macaque, and marmoset) were downloaded from the EBI website (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/). The SNP data of modern humans were obtained from the 1000 Genomes Project (phase I) (1000 Genomes Project Consortium et al., 2012). The three modern human populations are CEU (Utah residents with Northern and Western European ancestry), CHB (Han Chinese in Beijing, China), and YRI (Yoruba in Ibadan, Nigeria), which contain 100, 99, and 97 individuals, respectively. The genetic map files of modern humans were downloaded from the 1000 Genomes Project website (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110106_recombination_hotspots/). The 1000 Genomes phased genotype data in VCF format were downloaded from the International Genome Sample Resource website (http://www.internationalgenome.org/). Annotated human genes and transcripts were obtained from the Ensembl website (http://www.Ensembl.org). FASTA-format nucleotide sequences were converted from VCFs using the vcf-consensus program in the VCFtools package when necessary (Danecek et al., 2011). Data in the GeneCards Human Gene Database (www.genecards.org) were used to annotate genes and transcripts.
RNA-sequencing (RNA-seq) data (transcripts per million [TPM]) and eQTL data of human tissues were obtained from the Genotype-Tissue Expression Project (GTEx, v8) website (https://gtexportal.org/) (GTEx Consortium, 2017). Histone modification and DNA methylation signals in multiple cell lines and ENCODE Candidate Cis-Regulatory Elements (cCRE) were obtained from the UCSC Genome Browser (http://genome.UCSC.edu). The genome-wide DNA binding sites of lncRNA NEAT1, MALAT1, and MEG3 were obtained from experimental studies (Mondal et al., 2015; West et al., 2014). The predicted and experimentally detected DBSs in target genes of MALAT1, NEAT1, and MEG3 are given in supplementary tables (Supplementary Table 15).
2. Identifying HS lncRNA genes
We used the Infernal program (Nawrocki et al., 2009), which searches orthologous RNA sequences upon sequence and structure alignment, to identify orthologous exons in 16 mammalian genomes for each exon in each of the 13562 GENCODE (v18)-annotated human lncRNA genes (Lin et al., 2019). The 16 mammals were chimpanzee (CSAC 2.1.4/panTro4), macaque (BGI CR_1.0/rheMac3), marmoset (WUGSC 3.2/calJac3), tarsier (Broad/tarSyr1), mouse lemur (Broad/micMur1), tree shrew (Broad/tupBel1), mouse (GRCm38/mm10), rat (Baylor3.4/rn4, RGSC6.0/rn6), guinea pig (Broad/cavPor3), rabbit (Broad/oryCun2), dog (Broad CanFam3.1/canFam3), cow (Baylor Btau_4.6.1/bosTau7), elephant(Broad/loxAfr3), hedgehog (EriEur2.0/eriEur2), opossum (Broad/monDom5), and platypus (WUGSC 5.0.1/ornAna1) (http://genome.UCSC.edu). If the number of orthologous exons of a human lncRNA gene in a genome exceeded half the exon number of the human lncRNA gene, these orthologous exons were assumed to form an orthologous lncRNA gene. If a human lncRNA gene had no orthologous gene in all of the 16 mammals, it was assumed to be human-specific.
3. Identifying DBSs of HS lncRNAs
LncRNAs bind to DNA sequences by forming RNA:DNA triplexes. Each triplex comprises triplex-forming oligonucleotides (TFO) in the lncRNA and a triplex-targeting site (TTS) in the DNA sequence. We used the LongTarget program to predict HS lncRNAs’ DNA binding domains (DBD) and binding sites (DBS) with the default parameters (Ruleset = all, T.T. penalty = −1000, CC penalty = 0, Offset = 15, Identity ≥ 60, Nt ≥ 50) (Lin et al., 2019; Wen et al., 2022). LongTarget program predicts DBDs and DBSs simultaneously, with a DBD comprising a set of densely overlapping TFOs and a DBS comprising a set of densely overlapping TTSs. For each HS lncRNA, we predicted its DBSs in the 5000 bp promoter regions (−3500 bp upstream and +1500 bp downstream the transcription start site) of the 179128 Ensembl-annotated transcripts (release 79). For each DBS, a binding affinity was calculated by multiplexing the DBS length and the averaged Identity score of all TTSs (the Identity score is the percentage of paired nucleotides). Strong and weak DBSs were classified upon binding affinity ≥ 60 and < 60. A transcript whose promoter region contains a strong DBS of an HS lncRNA was assumed to be a target transcript of the HS lncRNA, and the gene containing this transcript was assumed to be a target gene of the HS lncRNA. As the 1000 Genomes Project (phase I) data and the archaic human genomes are based on GRCh37/hg19, the coordinates of DBSs were converted from GRCh38/hg38 to GRCh37/hg19 using the liftover program from the UCSC Genome Browser (Kuhn et al., 2013).
4. Experimentally validating DBS prediction
A 157 bp sequence (chr17:80252565-80252721, hg19) containing the DBD of RP13-516M14.1, a 202 bp sequence (chr1:113392603-113392804, hg19) containing the DBD of RP11-426L16.8, and a 198 bp sequence (chr17:19460524-19460721, hg19) containing the DBD of SNORA59B, were knocked out in the HeLa cell line, RKO cell line, and SK-MES-1 cell line, respectively. Two sequences (chr1:156643524-156643684, chr10:52445649-52445740, hg38) containing the DBD of two wrongly transcribed noncoding sequences were knocked out in the HCT-116 and A549 cell lines, respectively. The seven knockouts were performed by UBIGENE, Guangzhou, China (http://www.ubigene.com) using CRISPR-UTM, a revised version of CRISPR/Cas9 technology. Before and after the seven DBD knockouts, RNA-sequencing (RNA-seq) was performed by Novogene, Beijing, China (https://cn.novogene.com) and HaploX, Shenzhen, China (https://www.haplox.cn/). The reads were aligned to the human GRCh38 genome using the Hiasat2 program (Kim et al., 2019), and the resulting Sam files were converted to bam files using Samtools (Li et al., 2009). The Stringtie program was used to quantify gene expression levels (Pertea et al., 2015). Fold change of gene expression was computed using the edgeR package (Robinson et al., 2010), and significant up- and down-regulation of target genes after DBD knockout was determined upon |log2(fold change)| > 1 with FDR < 0.1.
Genome-wide DBSs of NEAT1, MALAT1, and MEG3 were experimentally detected (Mondal et al., 2015; West et al., 2014). We also used these data to validate DBS prediction by predicting DBSs of the three lncRNAs and checking the overlap between predicted and experimentally detected DBSs (Supplementary Table 15).
5. Mapping DBSs of HS lncRNAs in the chimpanzee and archaic human genomes
We used the liftover program from the UCSC Genome Browser (http://genome.UCSC.edu) to map loci of DBSs in the human genome hg38 to the chimpanzee genome Pan_tro 3.0 (panTro5). The mapping results were examined by checking the human-chimpanzee pairwise alignment (hg38 vs panTro5) in the UCSC Genome Browser (http://genome.UCSC.edu). 97.81% (100403/102651) of DBSs were mapped, and the 2248 unmapped DBSs in 429 genes were considered human-specific. We used vcf-consensus in the VCFtools package to extract the DBSs of HS lncRNAs from the VCF files of Altai Neanderthals, Denisovans, and Vindija Neanderthals. The variant with the highest quality score was selected whenever multiple variant calls were observed at a given locus. The obtained DBS sequences in the chimpanzees and three archaic humans are called counterparts of DBSs in these genomes.
6. Estimating sequence distances of DBSs between different genomes
We first aligned DBS sequences in the genomes of humans, chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals using the MAFFT7 program for measuring sequence distances from modern humans to chimpanzees and archaic humans (Katoh and Standley, 2013). We then computed sequence distances using the dnadist program with the Kimura 2-parameter model in the PHYLIP (3.6) package (http://evolution.genetics.washington.edu/phylip.html) and using the Tamura-Nei model in the MEGA7 package (Kumar et al., 2016). The two methods generated equivalent results. The largest distance between DBSs in humans and their counterparts in chimpanzees is 5.383. Since 2248 DBSs in 429 human genes do not have counterparts in chimpanzees, we assumed that the sequence distances of these DBSs between humans and chimpanzees were 10.0. Since the top 20% of DBSs in chimpanzees (including these 2248 DBSs) have distances ≥ 0.034, we used 0.034 as the threshold to identify the “mostly changed” DBSs in the chimpanzees and three archaic humans.
We determined human ancestral sequences of DBSs upon the human ancestor sequences from the EBI website, which were generated upon the EPO alignments of six primate species. We used the above-mentioned methods to compute DBSs’ sequence distances from the human ancestor to chimpanzees, archaic humans, and modern humans. We found that when the human-chimpanzee ancestral sequence has the ancestral sequence (which means the inferring of ancestral allele is of high confidence), DBS distances from the human ancestor to modern humans are larger than to archaic humans, but this situation only accounts for about 63.8%. For many DBSs, the distances from the human ancestor to modern humans are smaller than to archaic humans (especially Neanderthals and Denisovans) and even to chimpanzees. This defect may be caused by the absence of archaic humans in building the human ancestral sequence.
7. Detecting positive selection signals in HS lncRNA genes and DBSs
We used multiple tests to detect positive selection signals in HS lncRNA genes. First, we used the XP-CLR test (parameters = -w1 0.001 300 100 -p0 0.95, window size = 0.1 cM, grid size = 100 bp) to perform six pairwise genome-wide scans (i.e., CEU-CHB, CEU-YRI, CHB-CEU, CHB-YRI, YRI-CEU, and YRI-CHB) (Chen et al., 2010). The upper 1% of scores across the entire genome in each pairwise scan were 34.6 in the CEU-YRI scan, 16.8 in the CEU-CHB scan, 45.0 in the CHB-YRI scan, 26.9 in the CHB-CEU scan, 14.1 in the YRI-CEU scan, and 14.1 in the YRI-CHB scan. These scores were used as the thresholds of positive selection signals in these populations. Second, we used the iSAFE program to scan each genomic region that contains a HS lncRNA gene and its 500-kb upstream and downstream sequences (Akbari et al., 2018). Strongly selected loci were detected only in CEU and CHB. Third, we used the VCFtools program to calculate Tajima’s D values for each HS lncRNA gene in CEU, CHB, and YRI (Danecek et al., 2011). The calculation was performed using a 1500-bp non-overlapping sliding window because the lengths of these genes exceed 1500 bp. To generate the background reference for judging the significant increase or decrease of Tajima’s D for HS lncRNA genes in a population, we calculated Tajima’s D values across the whole genome using a sliding window of 1500 bp. As values of Tajima’s D were compared with the background reference, a significant D<0 and D>0 indicate positive (or directional) selection and balancing selection, respectively, instead of population demography dynamics (Tajima, 1989). Fourth, we used the integrated Fst to detect positive selection signals in HS lncRNA genes. The Fst of each HS lncRNA gene was computed for three comparisons, i.e., CEU-YRI, CHB-YRI, and CHB-CEU. Extreme Fst values of SNPs were detected in HS lncRNA genes in the comparisons of CEU-YRI and CHB-YRI. Since allele frequencies for different loci vary in a genome and genetic drift may have different effects at different loci, we used a sliding window of 1500 for comparing Fst values of HS lncRNA genes with the genome-wide background. Extreme Fst values indicate positive selection. Finally, we applied linkage disequilibrium (LD) analysis to each HS lncRNA gene. We computed the pairwise LD (r²) in CEU, CHB, and YRI for common SNPs (with minimal allele frequency (MAF) ≥ 0.05 in at least one population) in HS lncRNA genes and DBSs using the PLINK program (Purcell et al., 2007). Significantly increased LD was detected in SNPs in HS lncRNA genes in CEU and CHB. The LD patterns were portrayed using the Haploview program (Barrett et al., 2005).
Next, we used the above tests to detect positive selection signals in DBSs. First, the 100 bp grid size of the XP-CLR test also allowed the detection of selection signals in DBSs. Second, we performed the Tajima’s D and Fay-Wu’s H tests (Fay and Wu, 2000; Tajima, 1989). We calculated Tajima’s D values for each DBS in CEU, CHB, and YRI using the VCFtools program (Danecek et al., 2011), with a sliding window of 147 bp (the mean length of strong DBSs). To generate the background reference for judging the significant increase or decrease of Tajima’s D in a population, we calculated Tajima’s D values across the whole genome using the same sliding window. As values of Tajima’s D were compared with the background reference, a significant D<0 and D>0 indicate positive (or directional) selection and balancing selection, respectively. Fay-Wu’s H values were calculated similarly using the VariScan program (Vilella et al., 2005). Calculating Fay-Wu’s H demands the ancestral sequences as the outgroup. We extracted the ancestral sequences of DBSs from the human ancestral sequence, which was generated upon the EPO alignments of six primate species (human, chimpanzee, gorilla, orangutan, macaque, and marmoset) (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignm ents/). Third, we computed the Fst to measure the frequency differences of alleles in DBSs between populations. For the CEU-YRI, CHB-YRI, and CHB-CEU pairwise comparisons, we used the revised VCFtools program to compute the weighted Fst for all SNPs in each DBS (Weir and Cockerham, 1984). Fourth, we integrated the weighted Fst values in the three populations into an “integrated Fst” which indicated whether the DBS locus was under selection in a certain population (Nielsen and Slatkin, 2013). We used sliding windows of 147 bp for comparing Fst values of DBSs with the genome-wide background. We empirically denoted the values of integrated Fst in the upper 10% of scores across the entire genome as statistically significant. To determine positive selection more reliably, we used Tajima’s D and the integrated Fst to jointly determine if a DBS was under positive selection in a population. The thresholds that determined the upper 10% of Tajima’s D values across the entire genome in CEU, CHB, and YRI were −0.97, −0.96, and −0.97, respectively, and the threshold that determined the upper 10% of integrated Fst values across the entire genome in the three populations was 0.22. For example, a DBS was assumed to be under positive selection in CEU if (a) the DBS had a Tajima’s D<-0.97 in CEU and Tajima’s D>0.0 in the two other populations and (b) the DBS had an integrated Fst>0.22. Analogously, a DBS was assumed to be under positive selection in both CEU and CHB if the DBS had a Tajima’s D <-0.97 in CEU, <-0.96 in CHB, and >0.0 in YRI, and had an integrated Fst>0.22.
8. Functional enrichment analysis of genes
We used the g:Profiler program (with the parameter settings: Organism=Homo sapiens, Ordered query = No, Significance threshold = Benjamini-Hochberg FDR, User threshold = 0.05, 50 < terms size < 1000) and the Gene Ontology (GO) database to perform over-representation analysis (Raudvere et al., 2019). This analysis is used to determine what pre-defined gene sets (GO terms) are more present (over-represented) in a list of “interesting” genes than what would be expected by chance. The lists of genes included genes with strongest DBSs in humans, genes with weakest DBSs in humans, and genes with mostly changed DBSs in chimpanzees and three archaic humans. To choose genes with strong DBSs, we ranked genes in descending order based on binding affinity and chose the top 2000 genes. To choose genes with weak DBSs, we first ranked genes in ascending order based on binding affinity, then ranked genes in descending order based on the number of regulatory HS lncRNAs, and finally chose the top 2000 genes with binding affinity ≤40 and regulatory HS lncRNAs ≥5. We used the distance threshold of 0.034 to choose genes with mostly changed DBSs. Genes with mostly changed DBSs in chimpanzees include the 429 genes in which 2248 DBSs lack counterparts. 4248, 1256, 2514, and 134 genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals have DBSs with distance ≥0.034. To make the statistical significance comparable across chimpanzees, Altai Neanderthals, and Denisovans, we used the same number of genes (the top 1256 genes whose DBSs have the largest DBS distances) as the input to the g:Profiler program.
9. Analysing SNP frequencies in human populations
The frequency of common SNPs (MAF ≥0.05) in DBSs in the three modern human populations was computed using the VCFtools package (Danecek et al., 2011). The ancestral/derived states of SNPs were inferred upon the human ancestor sequences and were used to determine derived allele frequency.
10. Analysing the cis-effect of SNPs in DBSs on target gene expression
SNPs with MAF >0.1 in DBSs in any of the three modern human populations and absolute values of cis-effect size >0.5 (FDR <0.05) in any of the GTEx tissues were examined for an influence on the expression of the target genes (GTEx Consortium, 2017). SNPs that are eQTLs in the GTEx tissues and have biased derived allele frequencies (DAF) in the three modern human populations were examined to estimate whether the eQTL is population-specific.
11. Examining the tissue-specific impact of HS lncRNA-regulated gene expression
First, we examined the expression of HS lncRNA genes across the GTEx tissues. HS lncRNA genes with a median TPM value >0.1 in a tissue were considered robustly expressed in that tissue. Upon this criterion, 40 HS lncRNA genes were expressed in at least one tissue and were used to examine the impact of HS lncRNA regulation on gene and transcript expression (other HS lncRNAs may function in the cytoplasm). Since a HS lncRNA gene may have multiple transcripts, we adopted the transcript, which contains the predicted DNA binding domain (DBD) and has the largest TPM value as the representative transcript of the HS lncRNA. We calculated the pairwise Spearman’s correlation coefficient between the expression of a HS lncRNA (the representative transcript) and the expression of each of its target transcripts using the scipy.stats.spearmanr program in the scipy package. The expression of a HS lncRNA gene and a target transcript was considered to be significantly correlated if the |Spearman’s rho| > 0.3, with Benjamini-Hochberg FDR <0.05. We examined the percentage distribution of correlated HS lncRNA-target transcript pairs across GTEx tissues and organs (Figure 3A).
We examined all GTEx tissues to determine in which tissues HS lncRNA-regulated gene expression may have undergone significant changes from archaic to modern humans. If a pair of HS lncRNA and target transcript are robustly expressed in a tissue and their expression shows a significant correlation (|Spearman’s rho| >0.3, with Benjamini-Hochberg FDR <0.05) in the tissue, we computed the sequence distance of the HS lncRNA’s DBS in the transcript from the three archaic humans to modern humans. We compared the sequence distances of all DBSs in each tissue with the sequence distances of all DBSs in all GTEx tissues (i.e. the distribution of all DBS distances was taken as the background). A one-sided two-sample Kolmogorov-Smirnov test was used to examine whether the sequence distances of all DBSs in a specific tissue deviate from the background distribution (which reflects the “neutral evolution” of gene expression). For a tissue, if the test result was Benjamini-Hochberg FDR < 0.001, the tissue was considered to have significantly changed gene expression regulated by the HS lncRNA. We used different colors to mark tissues with significantly changed gene expression regulation since Altai Neanderthals, Denisovans, and Vindija Neanderthals, and used “D”, “A.D.”, and “ADV” to indicate changes since Denisovans, since Altai Neanderthals and Denisovans, and since Altai Neanderthals, Denisovans, and Vindija Neanderthals, respectively (Figure 3B).
Declaration of competing interest
The authors declare no competing interest.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (31771456 to H.Z.) and the China Postdoctoral Science Foundation (2020M682788 to J.L.).
Data and code availability
GENCODE human lncRNAs are available at the website (https://www.gencodegenes.org/human/). Human lncRNAs’ orthologues in 16 mammals and the LongTarget program are available at the LongTarget/LongMan website (http://www.gaemons.net/). Other programs and tests are open-source and freely available. The RNA-seq data of the three HS lncRNAs before and after the CRISPR/Cas9 experiments have been deposited in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo) under the accession number GSE213231. The RNA-seq data of the four wrongly transcribed noncoding sequences before and after the CRISPR/Cas9 experiments have been deposited in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo) (the accession number is to be obtained). All other data are in Supplementary Tables.
Supplementary data
Two supplementary files are available with the manuscript online. One PDF file contains 7 Supplementary Notes in which there are 23 Supplementary Figures, and one Excel file contains 15 Supplementary Tables.
Supplementary Note 1 – Supporting evidence for DBS prediction
We used multiple methods and datasets to validate DBS prediction. First, since lncRNAs can recruit epigenetic regulatory enzymes to lncRNAs’ DNA binding sites, we uploaded predicted DBSs of HS lncRNAs to the UCSC Genome Browser as a custom track, compared these DBSs with tracks such as “ENCODE DNA Methylation tracks” (Meissner et al., 2008) and “ENCODE Histone Modification tracks” (Ernst et al., 2011). We found that predicted DBSs overlap well with experimentally identified DNA methylation and histone modification signals in multiple cell lines (Supplementary Figure 1).
Second, since lncRNAs’ DBSs can be detected genome-wide by experiments such as ChIRP-seq, we predicted the DBSs of the lncRNA MALAT1, NEAT1, and MEG3 and compared predicted DBSs with the experimentally identified DNA binding sites (Mondal et al., 2015; West et al., 2014). We found that predicted DBSs agree well with experimentally identified DNA binding sites (Supplementary Figure 2).
Third, we found that many DBSs also co-localize with ENCODE Candidate Cis-Regulatory Elements (cCREs) in promoter regions of genes (Supplementary Figure 3). cCREs are the subset of representative DNase hypersensitive sites across ENCODE and Roadmap Epigenomics samples supported by either histone modifications (H3K4me3 and H3K27ac) or CTCF-binding data.
Fourth, we used the CRISPR/Cas9 technique to knock out the sequences that contain the DBD of seven lncRNAs (three HS lncRNAs and four wrongly transcribed long noncoding RNAs) in multiple cell lines and performed RNA-seq before and after the knockout. The deleted sequences were just 100-200 bp, and gene expression analysis revealed that the |fold change| of target genes was significantly larger than the |fold change| of non-target genes (one-sided Mann-Whitney test, p = 3.1e-72, 1.49e-114, and 1.12e-206 for HS lncRNA RP13-516M14.1, RP11-426L16.8, and SNORA59B, and p = 2.58e-09, 6.49e-41, 0.034, and 5.23e-07 for the four wrongly transcribed lncRNAs in cancer cell lines) (Supplementary Figure 4). These results suggest that the deletion of DBD causes the changed expression of target genes (Supplementary Figure 5).
Fifth, the LongTarget program uses a variant of the Smith-Waterman algorithm to identify all local alignments in each lncRNA/DNA pair. Each alignment is a RNA:DNA triplex that consists of a TFO (triplex-formating oligonucleotides) and a TTS (triplex-targeting sites). A DBS is defined based on a set of densely overlapping TTSs, and a DBD is defined based on both TFOs and the DBS (He et al., 2015). LongTarget can more robustly and accurately identify DBS than some popular methods (Wen et al., 2022). To examine the likelihood a local alignment of 147 bp (the average length of strong DBSs) can be generated by chance in a 5000 bp DNA sequence, we randomly simulated two sequences seqA and seqB (which were 5000 bp and 147 bp) and aligned seqB to seqA using the EMBOSS Water (https://www.ebi.ac.uk/Tools/psa/emboss_water/) local alignment program with alignment identity controlled to 60% (because the default Identity parameter of the LongTarget program was 60%). The equation p = 1 − exp(−Kmne-JS) was used to calculate a p value to estimate the likelihood, in which m and n were the length of seqA and seqB, K was a small adjusting constant (which was 0.1), λ was the normalization coefficient of the score (which was 0.3), and S was the alignment score. We repeated the simulation and calculation process 10000 times and determined that the maximal and minimal p values were between 8.2e-19 and 1.5e-48. Thus, a DBS of 147 bp is extremely unlikely to be generated by chance. This result also supports that the changed expression of target genes is caused by the knockout of DBDs in the CRISPR/Cas9 experiments.
Supplementary Note 2 – Features of HS lncRNA-mediated gene expression regulation
HS lncRNA-mediated gene expression regulation shows multiple features. First, HS lncRNAs themselves form complex targeting relationships (Supplementary Figure 6), suggesting networks and cascades of regulation. Second, some DBSs are human-specific sequences (Supplementary Figure 7), suggesting that human-specific sequences were explored by HS lncRNAs for regulating gene expression. Third, many genes and transcripts have DBSs for multiple HS lncRNAs (Supplementary Figure 8), suggesting co-regulation of functionally related genes. Fourth, certain genes and transcripts have multiple DBSs for one HS lncRNA, suggesting tissue-specific regulation. Fifth, selection signals were detected in some DBSs in specific populations (Supplementary Figure 9). Sixth, generally, HS lncRNAs on the Y chromosome have longer DBSs than HS lncRNAs on the autosomes (Supplementary Figure 10). Finally, SNPs in the DBSs of a HS lncRNA in multiple genes on a chromosome show LD, suggesting an association between these DBSs (Supplementary Figure 11).
Supplementary Note 3 – Target genes are enriched in specific functions
Strong (binding affinity>=60) and weak (binding affinity<60) DBSs may indicate well-established and recently-formed gene expression regulation by HS lncRNAs. To examine if genes with strong and weak DBSs are enriched in different biological functions, we used the g:Profiler program (with the parameter settings: Organism=Homo sapiens, Ordered query=No, Significance threshold=Benjamini-Hochberg FDR, User threshold=0.05, 50<terms size<1000) and the Gene Ontology (GO) database to perform over-representation analysis. An over-representation analysis is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance. We used the top 2000 genes with strong DBSs and the bottom 2000 genes with weak DBSs as the input to g:Profiler (Supplementary Table 2-4). The two gene sets, with strong and weak DBSs, are enriched in many common GO terms with different significance. These GO terms include “neuron projection development” (GO:0031175) (2.36E-16 vs 3.79E-11), “neuron projection morphogenesis” (GO:0048812) (8.56E-16 vs 1.64E-09), and “female sex differentiation” (GO:0046660) (2.40E-2 vs 5.84E-04) (Supplementary Figure 12). Genes with strong and weak DBSs are also enriched in different GO terms. Especially, genes with weak DBSs are enriched in slightly more GO terms (Supplementary Figure 13), including “response to alcohol” (GO:0097305), “negative regulation of locomotion” (GO:0040013), “female gonad development” (GO:0008585), “response to temperature stimulus” (GO:0009266), “cerebral cortex development” (GO:0021987), “DNA methylation or demethylation” (GO:0044728), and “female pregnancy” (GO:0007565) (Supplementary Table 4). These DBSs and target genes reflect the recent adaptive evolution of humans.
Next, we examined DBSs that have most changed sequence distance (“distance” hereafter means the distance per base) in two ways. First, we computed DBS sequence distances from the human ancestor to chimpanzees, archaic humans, and modern humans (hg19) (i.e., from the root note to leaf nodes in the phylogenetic tree); second, we computed DBS sequence distances between modern humans (hg19) and chimpanzees and archaic humans (i.e., between leaf nodes in the phylogenetic tree). In the first type of DBS distances, DBSs in the top 20% (5033) genes have a distance > 0.015 from the human ancestor to modern humans. When the human-chimpanzee ancestral sequence has the ancestral sequence (which means the inferring of ancestral allele is of high confidence), DBS distances from the human ancestor to modern humans are indeed larger than DBS distances from the human ancestor to archaic humans. However, it is unreasonable that more DBSs have distances > 0.015 from the human ancestor to Altai Neanderthals, Denisovans, and even chimpanzees (Supplementary Figure 14). We think these wrong distances may be because no archaic human genomes were used to build the human ancestral sequence. According to the file “homo_sapiens_ancestor_GRCh37_e71.README” and the paper “1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015”, a high-confidence call in the human ancestor sequence is made when all three sequences - the ancestral (the common ancestor of humans and chimpanzees), the sister (chimpanzees), and the ancestral of the ancestral sequences - agree. We found that only about 64% calls in the human ancestor sequence are high-confidence calls and that DBSs in human ancestral sequences with high-confidence have correct distances to the other five leaf nodes.
To examine whether DBS distances between leaf nodes mainly reflect the genetic changes in the human lineage or in the chimpanzee lineage, we also computed sequence distances for DBSs with the most changed sequences between humans and gorillas. We found that the DBS distances between humans and chimpanzees were significantly correlated with those between humans and gorillas (Spearman’s rho = 0.57, p = 0.0) and that these DBSs have a larger distance between humans and gorillas (Supplementary Figure 15A). These results suggest that the sequence differences of the most changed DBSs between humans and chimpanzees are driven mainly by the genetic changes occurring in the human lineage but not in the chimpanzee lineage. The same results were observed when the archaic humans were examined (Supplementary Figure 15B).
In the second type of distances, DBSs in 429 genes have no counterparts in chimpanzees, the largest distance between chimpanzees and modern humans was 5.383, DBSs in the top 20% (4248) genes in chimpanzees have distances > 0.034, and DBSs having distances > 0.034 in genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals are 1256, 2514, and 134 (Supplementary Table 6). These distances agree with the phylogenetic distances between chimpanzees, archaic humans, and modern humans (Figure 1AE). We assigned 10.0 as the distance of DBSs that lack counterparts in chimpanzees. Subsequently, to make the statistical significance of enrichment analysis comparable across chimpanzees, Altai Neanderthals, and Denisovans, we used the top 1256 genes whose DBSs > 0.035 between modern humans and chimpanzees, Altai Neanderthals, and Denisovans as the input to the g:Profiler program. Genes in Altai Neanderthals are enriched in more GO terms than genes in chimpanzees, but many of these GO terms are highly related to human traits (Figure 2; Supplementary Table 7). The differences in enriched GO terms in chimpanzees and Altai Neanderthals suggest that HS lncRNA-mediated gene expression regulation has greatly evolved from the human ancestor to Altai Neanderthals.
Finally, we also sorted DBSs by the distances from the human ancestor to the five leaf nodes, identified genes with mostly changed DBSs, and applied over-representation analysis to genes with mostly changed DBSs. It is interesting (and reasonable) to find that genes with largest DBS distances from the human ancestor to Altai Neanderthals and to modern humans are enriched in many common GO terms (e.g., “regulation of anatomical structure size” (GO:0090066), “behavior” (GO:0007610), and “cellular response to xenobiotic stimulus” (GO:0071466)), that the Altai Neanderthal-specific GO terms include “sensory system development” (GO:0048880), “brain development” (GO:0007420), and “adult locomotory behavior” (GO:0008344), and that the modern human-specific GO terms include “female pregnancy” (GO:0007565), “learning” (GO:0007612), and “cognition” (GO:0050890) (Supplementary Table 7). The differences in Neanderthal-specific and modern human-specific GO terms reasonably reflect the different stages of human evolution.
Supplementary Note 4 – Positive selection signals in HS lncRNA genes
We used multiple tests, including XP-CLR (Chen et al., 2010), iSAFE (Akbari et al., 2018), Tajima’s D (Tajima, 1989), the fixation index (Fst) (Weir and Cockerham, 1984), and linkage disequilibrium (LD) (Slatkin, 2008), to detect positive selection signals in HS lncRNA genes.
First, we used the XP-CLR program to scan the genome regions which contain HS lncRNAs and their 500-kb upstream and downstream sequences. Six pairwise comparisons were applied to the three human populations (CEU-CHB, CEU-YRI, CHB-CEU, CHB-YRI, YRI-CEU, and YRI-CHB). Selective sweeps were detected in the genome regions containing RP11-848P1.4 and RP11-598D14.1 in the CEU-YRI and CHB-YRI comparisons (Supplementary Figure 16). Abundant SNPs in these regions have low derived allele frequency (DAF) in YRI, but are nearly fixed in CEU and CHB. Examples (with DAF in YRI, CEU, and CHB) include rs7208589 (DAF = 0.162, 0.990, and 0.995), rs8073226 (DAF = 0.148, 0.990, and 0.995), and rs9915124 (DAF = 0.181, 0.990, and 0.995) in the region containing RP11-848P1.4, and rs4690648 (DAF = 0.185, 0.939, and 0.917), rs11722101 (DAF = 0.269, 0.970, and 0.917), and rs11730933 (DAF = 0.292, 0.970, and 0.917) in the region containing RP11-598D14.1.
Second, we used the iSAFE program to detect favored mutations in HS lncRNA genes in these populations. Because selective sweeps caused by favored mutations may have varied lengths, we ran iSAFE four times with genome regions of 80, 500, 1000, and 2500 kb, which center at the HS lncRNA. Selective sweeps and favored mutations were detected robustly in genome regions containing RP11-598D14.1, RP11-848P1.4, AC006129.1, AC006129.4, CTD-2291D10.1, CTD-2291D10.2, and LA16c-306A4.2 in CEU and CHB, and in the genome region containing CTD-3051D23.4 in CHB (Supplementary Figure 17). Mutations with top iSAFE scores were located in the gene body regions of HS lncRNAs, and their DAF values were low in YRI but high in CEU and CHB. In addition, all of the detected favored mutations have high LD (r2) scores.
Third, we used Tajima’s D and integrated Fst to detect positive selection signals in gene body regions of HS lncRNAs. Positive selection signals were detected in RP11-598D14.1, RP11-848P1.4, RP11-344P13.4, and RP11-426L16.8 in CEU and CHB, in AC129929.5 and RP11-423H2.3 in CEU, and in CTB-151G24.1 in CHB (Supplementary Figure 18). Since Tajima’s D values were referenced with the genome-wide background, D<0 and D>0 indicate positive (or directional selection) and balancing selection, respectively, instead of population demography dynamics. The Fst of each HS lncRNA gene, also referenced with the genome-wide background, was computed for the CEU-YRI, CHB-YRI, and CHB-CEU comparisons. Extreme Fst values of SNPs were detected in RP11-598D14.1 and AC129929.5 in the comparisons of CEU-YRI and CHB-YRI. Since Fst values were referenced with the genome-wide background, extreme Fst values indicate positive selection.
Finally, we applied LD analysis to each HS lncRNA gene. Significantly increased LD was detected in SNPs in AC024592.9, AC129929.5, RP11-157B13.7, RP11-277P12.10, CTD-2142D14.1, and CTD-2291D10.1 in CEU and CHB (Supplementary Figure 19). The above results, taken together, indicate that HS lncRNA genes may have undergone more significant adaptive evolution in CEU and CHB than in YRI.
Supplementary Note 5 – Positive selection signals in DBSs
We used the tests mentioned above, including Tajima’s D (Tajima, 1989), Fay and Wu’s H (Fay and Wu, 2000), integrated Fst (Weir and Cockerham, 1984), and LD (Slatkin, 2008), to detect positive selection signals in DBSs in each population. These tests unveil positive selection signals in DBSs in specific target genes in specific populations (Supplementary Table 9-11).
First, we calculated Tajima’s D, Fst, and integrated Fst for polymorphic DBSs in CEU, CHB, and YRI. Positive selection signals were detected in DBSs in more genes in CEU and CHB than in YRI (Supplementary Table 10). These genes include the two pigmentation-related genes MC1R and MFSD12, the two odor reception-related genes OR6C1 and TAS1R3, and the immune-related gene TLR1. These target genes may suggest the adaptation of gene expression regulation in CEU and CHB to changes in diet and the environment. In YRI, positive selection signals were detected in DBSs in genes such as SLCO4A1 which encodes a protein mediating the Na-independent transport of organic anions (e.g., the thyroid hormones T3 and T4). DBSs in different transcripts of GNAS (a gene important for genomic imprinting) contain SNPs specifically selected for in different populations (Supplementary Figure 9).
Next, we calculated Fay-Wu’s H and integrated Fst for each polymorphic DBS in CEU, CHB, and YRI. Strong negative values (H <-2), together with significantly large integrated Fst (> 0.22), were obtained mainly in DBSs in CEU and CHB (Supplementary Table 11). These genes fall into two classes. The first class includes PASK, CPT1A, and EXOC7, which are important for glucose and lipid metabolism. PASK encodes a protein that plays a role in the regulation of insulin gene expression. CPT1A encodes a protein that exerts an important role in triglyceride metabolism. EXOC7 encodes a protein that plays a crucial role in targeting SLC2A4 vesicles to the plasma membrane in response to insulin in adipocytes. The second class includes COMT, TAS1R3, and ALMS1, which are important for neural development. COMT is involved in the metabolism of adrenalin and noradrenalin. TAS1R3 encodes a protein important for recognizing diverse natural and synthetic sweeteners. Mutations in ALMS1 are associated with Alstrom syndrome.
Third, to examine whether gene expression regulation by HS lncRNAs is coordinated, we examined the LD between SNPs in DBSs of the same HS lncRNA chromosome-wide. When more than one SNP in a DBS has minimal allele frequency (MAF) >= 0.05, we chose the SNP which had the strongest LD (r2). Despite genetic recombination, LD between DBSs was detected on some chromosomes in CEU and CHB (Supplementary Figure 11; Supplementary Note 2).
Fourth, we computed the distances between DBSs in modern humans and their counterparts in archaic humans and compared the distances with the distances between annotated promoters and their counterparts in archaic humans. A considerable portion of DBSs (note that DBSs are within promoter regions) have larger distances than promoters between modern and archaic humans (Supplementary Figure 20). We also computed the frequency distribution of SNPs (> 0.05) in DBSs and found that SNPs are enriched with low and high frequencies (Supplementary Figure 21), which is an indicator of positive selection.
Finally, we analyzed two experimental datasets. It was reported that the expression of a set of genes in T cell activation in response to pathogens shows significant variations in 348 healthy European, African, and Asian individuals (Ye et al., 2014). To examine whether HS lncRNAs contribute to these variations, we examined whether DBSs in these genes have different binding affinities in different populations. We found that DBSs in IFITM3, IL2RA, IL17F, MXRA7, CCL22, and FADS2 contain SNPs with biased frequencies in CEU, CHB, and YRI. A typical example is the DBSs in FADS2 which is expressed differentially in Europeans and Africans. The two DBSs in FADS2-003 and FADS2-010, respectively, contain four SNPs - rs71046746 has DAF 0.96/0.99/0.88 in CEU/CHB/YRI, but the other three unannotated SNPs have DAFs 0.24/0.22/0.01, 0.24/0.22/0.04, and 0.24/0.22/0.02 in CEU/CHB/YRI.
Another study examined genome-wide patterns of selection in 230 West Eurasians who lived between 6500 and 300 BC and identified selection signals in multiple genes which are associated with diet, pigmentation, and immunity (Mathieson et al., 2015). We detected signatures of positive selection in DBSs in LCT, TLR1, TLR6, TLR10, SLC45A2, SLC22A4, MHC, ZKSCAN3, FADS1, FADS2, DHCR7, GRM5, ATXN2, and HERC2. Reliable population-specific selection signals were identified in the DBS of RP11-423H2.3 in HERC2 (the Tajima’s D in CEU/CHB/YRI are −0.19/1.82/-1.12 and the integrated Fst is 0.27), in the DBSs of RP11-423H2.3 in TLR1 and TLR6 (the Tajima’s D in CEU/CHB/YRI are −1.26/-1.2/1.38 and the integrated Fst is 0.24), and in the DBS of SNORA59B in TLR1 and TLR6 (the Tajima’s D in CEU/CHB/YRI are −1.73/-0.9/1.76 and the integrated Fst is 0.24). Taken together, the above analyses suggest that population-specific selection signals in DBSs may help explain the phenotypic and physiological differences between human populations.
Supplementary Note 6 – Considerable SNPs in DBSs have cis-effect on gene expression in specific tissues and populations
To find supporting evidence for DBS influencing gene expression in tissues and organs, we first examined whether SNPs in DBSs have a cis-effect on gene expression using the data of the Genotype-Tissue Expression (GTEx) project (GTEx Consortium, 2017). The expression of some genes in specific tissues is of interest. For example, of the 21 SNPs which are eQTLs exclusively in the GTEx tissue Thyroid, 14 have high DAF in YRI (Supplementary Figure 22). Correspondingly, in Africans, positive selection signals were found in genes involved in energy metabolism (Fan et al., 2019). Second, we computed the eQTL density for each DBS and each Ensembl-annotated promoter and compared eQTL density of the two kinds of sequences. We found that eQTLs were more enriched in DBSs than in promoters (one-sided Mann-Whitney test, p = 0.0) (Supplementary Figure 23). Third, for DBS harboring eQTLs in specific tissues, 94% of the corresponding HS lncRNA-target transcript pairs exhibit expression correlation (|Spearman’s rho| > 0.3 and FDR < 0.05) in the eQTLs’ tissues.
Finally, we examined how many SNPs DBSs are DNA methylation QTL (mQTL) and histone modification QTL (haQTL). One study analyzed DNA methylation data from three populations (10 Caucasians, 10 African Americans, and 10 Japanese) and identified RGS6, CLEC2L, ABCF1, MIR3678, and EP400 as having differential methylation in African Americans compared with Japanese and Caucasians (Giri et al., 2017). Based on this study, we found that DBSs of several HS lncRNAs in some transcripts of RGS6, CLEC2L, and EP400 (also EP400NL) had significantly different Tajima’s D and integrated Fst in YRI. Another study applied QTL analysis to a multi-omics dataset and identified a set of SNPs significantly associated with levels of gene expression, DNA methylation, and histone modification in the human dorsolateral prefrontal cortex (Ng et al., 2017). Ng et al. called SNPs with cis-effect (including eQTL, mQTLs, and haQTLs) xQTL and found that xQTL SNPs were enriched close to transcription start sites. We re-examined these xQTL SNPs and identified 4735 in the DBSs of HS lncRNAs (Supplementary Table 13). Notably, 4319 out of the 4735 xQTLs were meQTL SNPs, many of which had biased frequencies in CEU, CHB, and YRI.
Supplementary Note 7 – HS lncRNAs critically regulate genes important for brain development
The enlarged brain is one of the most prominent features of modern humans. We analyzed multiple datasets of experimental studies to examine the impact of HS lncRNAs on brain development. First, we analyzed two datasets of epigenetic studies. A comparative H3K27ac and H3K4me2 profiling of human, macaque, and mouse corticogenesis revealed that strongly concordant epigenetic gains are enriched in promoters of 301 genes which are active during human corticogenesis (Reilly et al., 2015). Of these genes (257 are annotated in hg38), 84% have DBSs for at least one HS lncRNA and 56% have DBSs for >= 5 HS lncRNAs. Thus, these 301 genes are enriched significantly for regulation by HS lncRNAs compared with genes without DBSs of HS lncRNAs (p = 1.32e-26 for the situation of 56% genes and p = 1.21e-21 for the situation of 84% genes, two-sided Fisher’s exact test). Another study examined gene expression and H3K27ac modification in eight brain regions in humans and four other primates, and revealed 1851 genes (1687 genes annotated in hg38) with human-specific transcriptome differences and 240 genes with chimpanzee-specific transcriptome differences in at least one brain region (Xu et al., 2018). Of these genes, 73% have DBSs for at least one HS lncRNA and 46% have DBSs for >= 5 HS lncRNA. These data again indicate that genes related to brain development are enriched significantly in HS lncRNA targets (p = 1.1e-75 for the situation of 46% genes and p = 1.2e-56 for the situation of 73% genes, two-sided Fisher’s exact test).
Second, we analyzed the data from two studies conducted by the PsychENCODE consortium. The data of one study cover 16 brain regions and 9 time windows and include spatiotemporally differentially expressed genes, spatiotemporally differentially methylated sites, spatiotemporal variations of H3K27ac enrichment, and cell type marker genes (Li et al., 2018). Of the 66 HS lncRNAs, 65 were expressed in all of the 16 brain regions and in the 9 time windows, and 7 HS lncRNAs (RP11-423H2.3, RP11-706O15.5, RP13-539J13.1, SNORD3B-1, SNORD3B-2, RP4-740C4.7, and RP13-516M14.1) exhibited spatial or temporal expression differences. To assess the contribution of these 7 HS lncRNAs to the spatiotemporal changes of gene expression, we mapped the spatiotemporally differentially methylated sites to the promoters of annotated transcripts and identified 109 transcripts which had these sites in their promoters. We then predicted the DBSs of the 7 HS lncRNAs in the promoter regions of the 109 transcripts and found that 56 transcripts had DBSs for at least one of the 7 HS lncRNAs. In addition, 61% of 770 cell type-specific marker genes had at least one DBS of the s7 HS lncRNAs. These results suggest that HS lncRNAs regulate spatiotemporal gene expression and determine cell fate during brain development, probably by regulating DNA methylation in promoter regions. The other PsychENCODE study examined the spatiotemporal transcriptomic divergence across human and macaque brain development (Zhu et al., 2018). We examined the potential relationships between the HS lncRNAs and the 8951 genes showing differential expression between human and macaque brain during brain development. We identified DBSs of at least one HS lncRNAs in the promoter regions of 72% of differentially expressed genes in the human brain. Moreover, DBSs of at least 5 HS lncRNAs were identified in the promoter regions of 44% of differentially expressed genes. Compared with genes without DBS for HS lncRNAs, these differentially expressed genes are significantly enriched for regulation by HS lncRNAs (p = 0, Chi-square test). Of the 7 transcription factor genes differentially expressed between humans and macaques, four had DBSs of HS lncRNAs. These results support that gene expression in the human brain is highly regulated by HS lncRNAs.
Third, three recent studies identified several genes, including NOTCH2NL and Aspm, which are critical for regulating cortical expansion in the human brain (Florio et al., 2018; Johnson et al., 2018; Suzuki et al., 2018). NOTCH2NL is highly expressed in radial glia and activates Notch signaling to promote the clonal expansion of human cortical progenitors (Florio et al., 2018; Suzuki et al., 2018). Aspm is expressed in mice but shows no contribution to mouse corticogenesis. Nevertheless, an Aspm knockout greatly influenced corticogenesis in ferrets (Johnson et al., 2018). We examined whether HS lncRNAs potentially regulate these genes. Of the 40 protein-coding genes reported by the three studies, 14 have DBSs of >= 5 HS lncRNAs and 29 have DBSs of >= 1 HS lncRNAs in their promoter regions (Supplementary Table 14). Compared with the background situation (22562 protein-coding genes’ promoter regions contain, but 20944 protein-coding genes’ promoter regions do not contain, DBSs of HS lncRNAs), the 40 protein-coding genes involved in cortical expansion are highly enriched for regulation by HS lncRNAs (p < 0.01, two-sided Fisher’s exact test).
Recently, brain organoids have emerged as an important approach to studying primate neural development in vitro. By establishing and comparing cerebral organoids between humans, chimpanzees, and macaques, Pollen et al. identified 261 human-specific gene expression changes (Pollen et al., 2019). In another study, Agoglia et al. generated a panel of tetraploid human-chimpanzee hybrid iPS cells (i.e. hybrid induced pluripotent stem cells) by fusing human and chimpanzee iPS cells, and differentiated the hybrid iPS cells into hybrid cortical spheroids (Agoglia et al., 2021). By allele-specific expression (ASE) analysis, Agoglia et al. identified thousands of genes with divergent expression between humans and chimpanzees at at least one time point. We obtained the 261 genes from the first study and the 1102 genes with |logFC| > 1 and adjusted p < 0.05 from the second study. Compared with the background situation mentioned above, the two sets of genes were significantly enriched with DBSs of HS lncRNAs (p = 1.2e-16 and 3.4e-74, respectively).
References
- 1.An integrated map of genetic variation from 1,092 human genomesNature 491:56–65
- 2.Comprehensive survey and geometric classification of base triples in RNA structuresNucleic Acids Res 40:1407–1423
- 3.Primate cell fusion disentangles gene regulatory divergence in neurodevelopmentNature 592:421–427
- 4.Identifying the favored mutation in a positive selective sweepNat Methods 15:279–282
- 5.Evolution by gene lossNat Rev Genet 17:379–391
- 6.Human evolution. Evolution of early Homo: an integrated biological perspectiveScience 345
- 7.Haploview: analysis and visualization of LD and haplotype mapsBioinformatics 21:263–265
- 8.Comparative transcriptomics in human and mouseNat Rev Genet 18:425–440
- 9.Population differentiation as a test for selective sweepsGenome Res 20:393–402
- 10.Initial sequence of the chimpanzee genome and comparison with the human genomeNature 437:69–87
- 11.Comment on “Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens” and “Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans”Science 313
- 12.The variant call format and VCFtoolsBioinformatics 27:2156–2158
- 13.The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expressionGenome Res 22:1775–1789
- 14.Genome-Wide Identification of Regulatory Sequences Undergoing Accelerated Evolution in the Human GenomeMol Biol Evol 33:2565–2575
- 15.Microcephalin, a gene regulating brain size, continues to evolve adaptively in humansScience 309:1717–1720
- 16.Hitchhiking under positive Darwinian selectionGenetics 155:1405–1413
- 17.Human-Specific NOTCH2NL Genes Affect Notch Signaling and Cortical NeurogenesisCell 173:1356–1369
- 18.Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortexElife 7
- 19.Genetic effects on gene expression across human tissuesNature 550:204–213
- 20.DNA Methylation: Insights into Human EvolutionPLoS Genet 11
- 21.Conserved cell types with divergent features in human versus mouse cortexNature 573:61–68
- 22.The rewiring of transcription circuits in evolutionCurr Opin Genet Dev 47:121–127
- 23.Aspm knockout ferret reveals an evolutionary mechanism governing cerebral cortical sizeNature 556:370–375
- 24.Origins, evolution, and phenotypic impact of new genesGenome Res 20:1313–1326
- 25.MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and UsabilityMol Biol Evol 30:772–780
- 26.Three-dimensional genome rewiring in loci with human accelerated regionsScience 380
- 27.Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotypeNat Biotechnol 37:907–915
- 28.Evolution of transcription factor binding through sequence variations and turnover of binding sitesGenome Res 32:1099–1111
- 29.The UCSC genome browser and associated toolsBrief Bioinform 14:144–161
- 30.MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger DatasetsMol Biol Evol 33:1870–1874
- 31.Lessons from X-chromosome inactivation: long ncRNA as guides and tethers to the epigenomeGenes Dev 23:1831–1842
- 32.Human Accelerated Regions and Other Human-Specific Sequence Variations in the Context of Evolution and Their Relevance for Brain DevelopmentGenome Biol Evol 10:166–188
- 33.The Sequence Alignment/Map format and SAMtoolsBioinformatics 25:2078–2079
- 34.Integrative functional genomic analysis of human brain development and neuropsychiatric risksScience 362
- 35.Pipelines for cross-species and genome-wide prediction of long noncoding RNA bindingNat Protoc 14:795–818
- 36.Human brain evolution: Emerging roles for regulatory DNA and RNACurr Opin Neurobiol 71:170–177
- 37.Adaptive sequence divergence forged new neurodevelopmental enhancers in humansCell 185:4587–4603
- 38.Impacts of Neanderthal-Introgressed Sequences on the Landscape of Human Gene ExpressionCell 168
- 39.Human-specific loss of regulatory DNA and the evolution of human-specific traitsNature 471:216–219
- 40.Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiensScience 309:1720–1722
- 41.A High-Coverage Genome Sequence from an Archaic Denisovan IndividualScience 338:222–226
- 42.MEG3 long noncoding RNA regulates the TGF-beta pathway genes through formation of RNA-DNA triplex structuresNat Commun 6
- 43.Annotating functional RNAs in genomes using InfernalMethods Mol Biol 1097:163–197
- 44.Infernal 1.1: 100-fold faster RNA homology searchesBioinformatics 29:2933–2935
- 45.Infernal 1.0: inference of RNA alignmentsBioinformatics 25:1335–1337
- 46.An Introduction to Population GeneticsSinauer Associates Inc Publishers
- 47.Measuring transcription factor-binding site turnover: a maximum likelihood approach using phylogeniesGenome Biol Evol 1:85–98
- 48.StringTie enables improved reconstruction of a transcriptome from RNA-seq readsNat Biotechnol 33:290–295
- 49.Human TKTL1 implies greater neurogenesis in frontal neocortex of modern humans than NeanderthalsScience 377
- 50.Establishing Cerebral Organoids as Models of Human-Specific Brain EvolutionCell 176:743–756
- 51.Human-specific genetics: new tools to explore the molecular and cellular basis of human evolutionNat Rev Genet
- 52.The primitive brain of early HomoScience 372:165–171
- 53.Accelerated evolution of conserved noncoding sequences in humansScience 314
- 54.Emerging principles of regulatory evolutionProc Natl Acad Sci U S A 104:8605–8612
- 55.A high-coverage Neandertal genome from Vindija Cave in CroatiaScience 358:655–658
- 56.The bonobo genome compared with the chimpanzee and human genomesNature 486:527–531
- 57.The complete genome sequence of a Neanderthal from the Altai MountainsNature 505:43–49
- 58.PLINK: a tool set for whole-genome association and population-based linkage analysesAm J Hum Genet 81:559–575
- 59.g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)Nucleic Acids Res 47:W191–W198
- 60.Evolutionary genomics. Evolutionary changes in promoter and enhancer activity during human corticogenesisScience 347:1155–1159
- 61.Integrative analysis of 111 reference human epigenomesNature 518:317–330
- 62.edgeR: a Bioconductor package for differential expression analysis of digital gene expression dataBioinformatics 26:139–140
- 63.Developmental dynamics of lncRNAs across mammalian organs and speciesNature 571:510–514
- 64.Linkage disequilibrium--understanding the evolutionary past and mapping the medical futureNat Rev Genet 9:477–485
- 65.Human-Specific NOTCH2NL Genes Expand Cortical Neurogenesis through Delta/Notch RegulationCell 173:1370–1384
- 66.Statistical method for testing the neutral mutation hypothesis by DNA polymorphismGenetics 123:585–595
- 67.Comment on papers by Evans et al. and Mekel-Bobrov et al. on Evidence for Positive Selection of MCPH1 and ASPMScience 317
- 68.VariScan: Analysis of evolutionary patterns from large-scale DNA sequence polymorphism dataBioinformatics 21:2791–2793
- 69.Estimating F-Statistics for the Analysis of Population StructureEvolution 38:1358–1370
- 70.Fasim-LongTarget enables fast and accurate genome-wide lncRNA/DNA binding predictionComput Struct Biotechnol J 20:3347–3350
- 71.The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sitesMol Cell 55:791–802
- 72.Enhancer Function and Evolutionary Roles of Human Accelerated RegionsAnnu Rev Genet 56:423–439
- 73.Human-specific features of spatial gene expression and regulation in eight brain regionsGenome Res 28:1097–1110
- 74.Comment on “Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens”Science 316
- 75.A comparative encyclopedia of DNA elements in the mouse genomeNature 515:355–364
- 76.Transcription factor binding sites are frequently under accelerated evolution in primatesNat Commun 14
- 77.Spatiotemporal transcriptomic divergence across human and macaque brain developmentScience 362
- 1.Primate cell fusion disentangles gene regulatory divergence in neurodevelopmentNature 592:421–427
- 2.Identifying the favored mutation in a positive selective sweepNat Methods 15:279–282
- 3.Population differentiation as a test for selective sweepsGenome Res 20:393–402
- 4.Mapping and analysis of chromatin state dynamics in nine human cell typesNature 473:43–49
- 5.African evolutionary history inferred from whole genome sequence data of 44 indigenous African populationsGenome Biol 20
- 6.Hitchhiking under positive Darwinian selectionGenetics 155:1405–1413
- 7.Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortexElife 7
- 8.DNA methylation profiling reveals the presence of population-specific signatures correlating with phenotypic characteristicsMol Genet Genomics 292:655–662
- 9.Genetic effects on gene expression across human tissuesNature 550:204–213
- 10.LongTarget: a tool to predict lncRNA DNA-binding motifs and binding sites via Hoogsteen base-pairing analysisBioinformatics 31:178–186
- 11.Aspm knockout ferret reveals an evolutionary mechanism governing cerebral cortical sizeNature 556:370–375
- 12.Integrative functional genomic analysis of human brain development and neuropsychiatric risksScience 362
- 13.Genome-wide patterns of selection in 230 ancient EurasiansNature 528:499–503
- 14.Genome-scale DNA methylation maps of pluripotent and differentiated cellsNature 454:766–770
- 15.A High-Coverage Genome Sequence from an Archaic Denisovan IndividualScience 338:222–226
- 16.MEG3 long noncoding RNA regulates the TGF-beta pathway genes through formation of RNA-DNA triplex structuresNat Commun 6
- 17.An xQTL map integrates the genetic architecture of the human brain’s transcriptome and epigenomeNat Neurosci 20:1418–1426
- 18.Establishing Cerebral Organoids as Models of Human-Specific Brain EvolutionCell 176:743–756
- 19.Evolutionary genomics. Evolutionary changes in promoter and enhancer activity during human corticogenesisScience 347:1155–1159
- 20.Linkage disequilibrium--understanding the evolutionary past and mapping the medical futureNat Rev Genet 9:477–485
- 21.Human-Specific NOTCH2NL Genes Expand Cortical Neurogenesis through Delta/Notch RegulationCell 173:1370–1384
- 22.Statistical method for testing the neutral mutation hypothesis by DNA polymorphismGenetics 123:585–595
- 23.Estimating F-Statistics for the Analysis of Population StructureEvolution 38:1358–1370
- 24.Fasim-LongTarget enables fast and accurate genome-wide lncRNA/DNA binding predictionComput Struct Biotechnol J 20:3347–3350
- 25.The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sitesMol Cell 55:791–802
- 26.Human-specific features of spatial gene expression and regulation in eight brain regionsGenome Res 28:1097–1110
- 27.Intersection of population variation and autoimmunity genetics in human T cell activationScience 345
- 28.Spatiotemporal transcriptomic divergence across human and macaque brain developmentScience 362
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
Copyright
© 2023, Lin et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 839
- downloads
- 22
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.