Abstract
What genes and regulatory sequences make humans and apes that share substantial genes but show distinct phenotypes has puzzled researchers for decades. Genomic studies have examined species-specific genes and regulatory sequences (e.g., transcription factor binding sites, TFBS); birth, loss, and changes in these genes and sequences can greatly drive speciation and evolution. However, those involved in epigenetic regulation - species-specific lncRNA genes and regulatory sequences - remain poorly explored. We identified human-specific (HS) lncRNAs from GENCODE-annotated human lncRNAs, predicted their DNA binding domains (DBDs) and binding sites (DBSs) genome-wide, analyzed DBSs in modern humans (CEU, CHB, and YRI), archaic humans (Altai Neanderthals, Denisovans, and Vindija Neanderthals), and their counterparts in chimpanzees, and analyzed how DBSs influence gene expression in modern and archaic humans. Our results suggest that HS lncRNAs and their DBSs have substantially rewired gene expression human-specifically, the rewiring has evolved continuously from archaic to modern humans, and rewired gene expression has promoted brain development, made humans adapt to new environments and lifestyles, and caused differences in modern humans.
1. Introduction
Genomic changes that drive evolution include gene birth (Kaessmann, 2010), gene loss (Albalat and Canestro, 2016), and changes in regulatory sequences (Prud’homme et al., 2007). The limited genomic differences but the substantial phenotypic and behavioral divergence between humans and other hominids make what genomic changes have critically driven human evolution an enduring puzzle (Anton et al., 2014; Pollen et al., 2023). Waves of studies have identified genes promoting human brain development (Evans et al., 2005; Fiddes et al., 2018; Mekel-Bobrov et al., 2005; Pinson et al., 2022; Suzuki et al., 2018), but conclusions from gene birth-centric studies are controversial (Currat et al., 2006; Timpson et al., 2007; Yu et al., 2007). Moreover, genes important for brain development (e.g., NOTCH2NL and ASPM) may not regulate other human traits that coincide with or even precede brain enlargement. Studies revealed regulatory sequence changes important for human evolution, including new sequences (Liu et al., 2021), lost sequences (McLean et al., 2011), human accelerated regions (HARs) (Dong et al., 2016; Mangan et al., 2022; Prabhakar et al., 2006; Whalen and Pollard, 2022), and the turnover of transcription factor binding sites (TFBS) (Krieger et al., 2022; Otto et al., 2009; Zhang et al., 2023). However, how epigenetic regulatory sequences contribute to human evolution remains barely explored.
Experiments have generated substantial findings on epigenetic regulation. (a) About one-third of human lncRNA are primate-specific (Derrien et al., 2012), and the chimpanzee and human genomes are almost 99% identical (Chimpanzee Sequencing and Analysis Consortium, 2005). (b) Species-specific lncRNAs have distinct expression profiles in organs of different species (Sarropoulos et al., 2019). (c) Many lncRNAs can bind to DNA sequences by forming RNA:DNA triplexes (Abu Almakarem et al., 2012), and recruit histone and DNA modification enzymes to binding sites (Lee, 2009; Roadmap Epigenomics et al., 2015). (d) Approximately 40% of differentially expressed human genes result from interspecies epigenetic differences (Hernando-Herraez et al., 2015). Thus, besides human-specific (HS) protein-coding genes and related regulatory sequences (e.g., TFBS), HS lncRNAs and their DNA binding sites may also critically promote human evolution and determine human traits by human-specifically rewiring gene expression.
Since RNA sequence search by structure and sequence alignment can reliably identify orthologous lncRNA genes in genomes (Nawrocki, 2014; Nawrocki and Eddy, 2013), HS lncRNAs can be identified upon human lncRNAs’ orthologues in other mammals. Many lncRNAs contain DNA binding domains (DBDs) and can use DBDs to bind to DNA binding sites (DBSs) in the genome. Since RNA:DNA triplexes follow specific base-pairing rules (Abu Almakarem et al., 2012), DBDs and DBSs can be computationally predicted (Lin et al., 2019; Wen et al., 2022). Detected gene expression in organs and tissues, especially in human organs and tissues reported by the GTEx project (GTEx Consortium, 2017), provides data for examining whether the expression of lncRNAs and their potential targets (genes with DBSs) are correlated. Further, genomes of modern humans, archaic humans, and multiple apes have been sequenced (1000 Genomes Project Consortium et al., 2012; Chimpanzee Sequencing and Analysis Consortium, 2005; Meyer et al., 2012; Prufer et al., 2017; Prüfer et al., 2013), making the above analyses performed across species. Thus, it is time to examine whether HS lncRNAs and their DBSs have substantially and human-specifically rewired gene expression and determined human traits.
This study identified HS lncRNAs upon the orthologues of the GENCODE-annotated human lncRNA genes in 16 mammalian genomes, predicted HS lncRNAs’ DBDs and DBSs in the genomes of modern humans (CEU, CHB, and YRI), archaic humans (Altai Neanderthals, Denisovans, Vindija Neanderthals), and these DBSs’ counterparts in the genome of chimpanzees (Figure 1A). We experimentally validated DBS prediction using CRISPR/Cas9 experiment and multiple cell lines. Cross-population and cross-species analyses allowed us to exploit reasonable methods, controls, gene numbers, and data sources. To analyze DBSs accurately, which have different lengths, unclear biochemical properties, and significant variations across populations and species, we computed the binding affinity for each DBS upon its length and identity (percent of paired nucleotides) and classified DBSs into strong and weak ones. We also classified DBSs into old and young ones upon when most of their sequence changes occurred. These strategies ensure robust genomic analyses, which include evolutionary/population genetics analyses, and transcriptomic analyses, which cover multiple datasets, including the GTEx data and data of multiple experiments. Our analyses reveal that HS lncRNAs and their DBSs have distinctly and continuously rewired gene expression during human evolution, and the rewired gene expression determines or greatly influences diverse human traits from a rapidly evolved brain to a modified metabolism system.

Study Overview.
(A) The relationships between the chimpanzees, the three archaic humans (Altai Neanderthals, Denisovans, and Vindija Neanderthals), and the three modern human populations. Dashed lines indicate the phylogenetic distances from modern humans to archaic humans and chimpanzees. Genes and DBSs have the following implications – the DBS in B2M lacks a counterpart in chimpanzees; the DBS in ABL2 has great differences between archaic and modern humans; the DBS in IRNAR1 is highly polymorphic in modern humans. Red letters in DBSs indicate tissue-specific eQTLs or population-specific favored mutations. (B) DBSs’ mean length and binding affinity. (C) The target genes and transcripts of 66 HS lncRNAs. (D) The targeting relationships between HS lncRNAs. (E) The sequence distances of DBSs (the top 40%) from modern humans to chimpanzees and archaic humans. (F) The impacts of interactions between HS lncRNAs and DBSs on gene expression in GTEx tissues.
2. Results
2.1. HS lncRNAs regulate diverse genes and transcripts
How many human lncRNA genes are human-specific is an important question still without a clear answer. After searching for orthologues of the 13562 GENCODE-annotated human lncRNA genes (v18) in 16 mammalian genomes and in the genomes of archaic humans using the Infernal program that performs RNA sequence and structure alignment (Figure 1A) (Lin et al., 2019; Nawrocki, 2014; Nawrocki et al., 2009), we identified 66 HS lncRNA genes (Supplementary Table 1). Then, we predicted DBSs of HS lncRNAs in the 5000 bp promoter regions of the 179128 Ensembl-annotated transcripts (release 79) using the LongTarget program (Supplementary Note 1) (Lin et al., 2019). The binding affinity of a pair of DBD and DBS depends on their length and identity (i.e., the percent of paired nucleotides). Analyzing DBSs at known imprinted genes predicted using different LongTarget parameters and comparing DBSs with CHART-seq-reported binding sequences (“peaks”) reveal the reasonable affinity threshold of 60 (He et al., 2015) (Supplementary Fig. 2). Since weak DBSs could be generated recently and play subtle functions, we used affinity>=60 and 36<affinity<60 as the thresholds for identifying strong and weak DBSs. 105141 strong DBSs (mean length>147 bp) were identified in 96789 transcripts of 23396 genes; 152836 weak DBSs were identified in 127898 transcripts of 33185 genes (Figure 1B; Supplementary Table 2, 3). These genes and transcripts are HS lncRNAs’ potential targets. Several HS lncRNAs (e.g., RP11-423H2.3) have abundant targets; many targets have DBSs for multiple HS lncRNAs; many HS lncRNAs have multiple DBSs in a single target; and about 1.5% of target genes (0.6% of target transcripts) are human-specific (Figure 1C). Further, HS lncRNAs themselves show complex targeting relationships (Figure 1D; Supplementary Note 2). These results indicate that HS lncRNAs have distinctly rewired the expression of substantial conserved genes instead of regulating evolutionarily new genes.
We validated DBS prediction using multiple methods and datasets (Supplementary Note 1; Supplementary Fig 1-5). (a) HS lncRNA DBSs overlap well with experimentally identified DNA methylation and histone modification signals in multiple cell lines (https://genome.UCSC.edu). (b) Many DBSs co-localize with annotated cis-regulatory elements (cCREs) in genes’ promoter regions (https://genome.UCSC.edu). (c) Predicted DBSs overlap well with experimentally detected DBSs of NEAT1, MALAT1, and MEG3 (Mondal et al., 2015; West et al., 2014). (d) We used CRISPR/Cas9 to knock out predicted DBDs (100-200 bp) of seven lncRNAs (three HS lncRNAs and four wrongly transcribed lncRNAs) in multiple cell lines, performed RNA-seq before and after DBD knockout, analyzed differential gene expression, and found that the |fold change| of target genes was significantly larger than the |fold change| of non-target genes (one-sided Mann-Whitney test, p = 3.1e-72, 1.49e-114, 1.12e-206, 2.58e-09, 6.49e-41, 0.034, and 5.23e-07, respectively). This result suggests that the changed expression of target genes is caused by DBD knockout. (e) The LongTarget program uses the Smith-Waterman local alignment to identify DBD/DBS binding; thus, taking a DBS of 147 bp as an example, it is extremely unlikely to be generated by chance (p < 8.2e-19 to 1.5e-48).
HARs are conserved genomic loci yet show accelerated evolution in the human lineage. Researchers assume that HARs critically influence human evolution, but results obtained from HAR studies are controversial (Levchenko et al., 2018). The Zoonomia Project recently identified 312 HARs important for 3D genome rewiring and neurodevelopment (Keough et al., 2023). Since only 8 HARs overlap 26 DBSs of 14 HS lncRNAs, we infer that DBS and HAR are functionally different sequences.
2.2. HS lncRNAs’ target genes characterize human evolution and traits
Since strong and weak DBSs may occur and function in different periods in human evolution, it is interesting and necessary to examine whether their target genes are enriched in distinct functions. About 5% of genes have significant sequence differences in humans and chimpanzees, but more show expression differences due to regulatory sequences. We sorted target genes by their DBS affinity and, to be prudential, chose the top 2000 genes (DBS length>252 bp and binding affinity>151) and bottom 2000 genes (DBS length#x003C;60 bp but binding affinity>36) to conduct over-representation analysis using the g:Profiler program and Gene Ontology (GO) database to the two gene sets (Supplementary Table 4). The top and bottom 2000 genes (with the whole genome as the reference) are co-enriched in GO terms important for human evolution, including "neuron projection development" (GO:0031175), "learning" (GO:0007612), "regulation of anatomical structure size" (GO: GO:0090066), "head development" (GO:0060322), and "regulation of locomotion" (GO:0040012) (Supplementary Note 3; Supplementary Table 4). The genes containing strongest DBSs (514 DBSs have length >500 bp) include IFNAR1 and NFATC1 (important for immune function), KIF21B and NTSR1 (critical for neural development), SLC2A11 and SLC2A1 (involved in glucose usage), BAIAP3 (a brain-specific angiogenesis inhibitor) and TAS1R3 (a receptor for the sweet taste response), and some primate-specific noncoding RNA genes (e.g., CTD-3224I3.3 which is highly expressed in the cerebellum, lung, and testis) (Table 1; Supplementary Table 4, 5). On the other hand, genes with weak DBSs are enriched in such GO terms as “response to alcohol” (GO:0097305), “response to temperature stimulus” (GO:0009266), “DNA methylation or demethylation” (GO:0044728), and “female pregnancy” (GO:0007565) (Supplementary Note 3). These findings justify the classification of DBSs into strong and weak ones and indicate that strong and weak DBSs have different functions.


Genes with DBSs that have largest binding affinity and mostly changed sequence distances (from modern humans to archaic humans and chimpanzees)
2.3. The rewiring of gene expression has greatly evolved
The human genome is almost 99%, 99%, and 98% identical to the genomes of chimpanzees, bonobos, and gorillas (Chimpanzee Sequencing and Analysis Consortium, 2005; Prufer et al., 2012), but what sequence variants critically differentiate these species remains unknown. We assessed whether HS lncRNAs’ DBSs are conserved in these species or human-specific. Notably, 97.81% of the 105141 strong DBSs have counterparts in chimpanzees, suggesting that these DBSs are similar to HARs in evolution and have undergone human-specific evolution.
HS lncRNAs may be generated before, with, or after their DBSs. To analyze the evolution of HS lncRNAs and their DBSs, we identified their counterparts in not only chimpanzees but also Altai Neanderthals, Denisovans, and Vindija Neanderthals (Meyer et al., 2012; Prufer et al., 2017; Prüfer et al., 2013). Counterparts of HS lncRNAs and strong DBSs were identified in these archaic humans, but the distances (distance per base) of HS lncRNAs are smaller than the distances of DBSs between archaic and modern humans, suggesting that many DBSs are older than HS lncRNAs and HS lncRNAs were generated after most of their DBSs. This suggests that HS lncRNAs make conversed genes in apes expressed specifically in humans.
We computed DBS distances in two ways to examine DBS evolution. The first is from the reconstructed human ancestor (downloaded from the EBI website) to chimpanzees, archaic humans, and modern humans; the second is from chimpanzees and archaic humans to modern humans. The first set of distances has the anomaly that many distances from the human ancestor to modern humans are smaller than to archaic humans. For example, DBSs in the top 20% genes in the human ancestor have distances >0.015 to modern humans, but more DBSs have distances >0.015 to Altai Neanderthals, Denisovans, and even chimpanzees (Supplementary Note 3). This anomaly is caused by the human ancestor being built using six primates without archaic humans. The second set of distances agrees well with the phylogenetic distances between chimpanzees, archaic humans, and modern humans. Large distances are in a small portion of DBSs (Figure 1E), thus targets containing these DBSs may greatly characterize human evolution. Based on a recent study that identified 5,984 genes differentially expressed between human-only and chimpanzee-only iPSC lines (Song et al., 2021), we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences, and we found that these genes have DBS distances >0.034. Taking 0.034 as the threshold, we identified 1256, 2514, and 134 genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals with DBS distance >0.034. These numbers agree with the phylogenetic relationships between these archaic and modern humans (Figure 1A; Supplementary Table 6).
To reveal whether target genes with mostly changed DBSs are enriched in specific functions, we applied over-representation analysis to target genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals using the g:Profiler program and Gene Ontology (GO) database. To make statistical significance comparable, we used just the top 1256 genes as the inputs. The 1256 genes in chimpanzees and Altai Neanderthals are enriched in many common GO terms. However, “behavior” (GO:0007610) and "cognition" (GO:0050890) have much higher significance in Altai Neanderthals than in chimpanzees (Figure 2), and some GO terms such as "learning" (GO:0007612), “learning or memory” (GO:0007611), and “glucose homeostasis” (GO:0042593) is enriched only in Altai Neanderthals. In addition, agreeing with the closer relationship between Vindija Neanderthals and modern humans, the 134 genes in Vindija Neanderthals are enriched only in two GO terms with low significance (“smoothened signaling pathway” (GO:0007224) and “regulation of smoothened signaling pathway” (GO:0008589)) (Supplementary Table 7). These results reveal when and how HS lncRNA-mediated epigenetic regulation influences human evolution.

1256 target genes whose DBSs have the largest distances from modern humans to chimpanzees and Altai Neanderthals are enriched in different Biological Processes GO terms.
Upon significance threshold = 0.05 (Benjamini-Hochberg FDR), the two gene sets in chimpanzees and Altai Neanderthals are enriched in 199 and 409 GO terms (50<terms size<1000), respectively. Shared GO terms are color-marked. Left: Top GO terms enriched in genes in chimpanzees. Right: Top GO terms enriched in genes in Altai Neanderthals.
2.4. Epigenetic regulation by HS lncRNAs shows differences in modern humans
To check whether the rewiring has been population-specifically revised in modern humans, we analyzed single nucleotide polymorphisms (SNPs, with minimal allele frequency (MAF)≥0.1) in strong DBSs using the 1000 Human Genome data (1000 Genomes Project Consortium et al., 2012). We computed the average number of SNPs for all DBSs, identified highly polymorphic DBSs (SNP numbers in CEU, CHB, YRI are 5 times larger than the average), and identified DBSs with mostly changed sequence distances from modern humans to chimpanzees, Altai Neanderthals, and Denisovans (sequence distance>0.034 to chimpanzees, Altai Neanderthals, and Denisovans which are indicated by C, A, D in Table 1,2). The intersection of the two classes of DBSs (highly polymorphic DBSs and mostly changed DBSs), wherein SNPs may greatly change binding affinity, are found in genes important for human evolution (Supplementary Table 8), including IFNAR1, NFATC1, TAS1R3, INS, ST3GAL4, and FN3KRP (Table 2). Genes involved in sugar metabolism are especially notable. A key feature of modern human diets is high sugar intake which causes non-enzymatic oxidation of proteins (i.e., glycation) and deactivates protein functions. Among proteins encoded by genes that have these two classes of DBSs, TAS1R3 recognizes diverse natural and synthetic sweeteners, Insulin decreases blood glucose concentration, ST3GAL4 regulates protein glycosylation, and FN3KRP deglycates proteins. These DBSs indicate continuous evolution of gene regulation human-specifically and population-specifically by HS lncRNAs.


Genes with most polymorphic DBSs and DBSs with mostly changed sequence distances from modern humans to archaic humans and chimpanzees
We next used multiple statistical tests, including XP-CLR (Chen et al., 2010), iSAFE (Akbari et al., 2018), Tajima’s D (Tajima, 1989), Fay-Wu’s H (Fay and Wu, 2000), the fixation index (Fst) (Weir and Cockerham, 1984), and linkage disequilibrium (LD) (Slatkin, 2008), with widely-used parameters and thresholds (Supplementary Note 4, 5), to detect selection signals in HS lncRNAs and strong DBSs in CEU, CHB, and YRI which are the representative populations of Europeans, Asians, and Africans. These tests detected selection signals in several HS lncRNA genes in CEU and CHB but not in YRI and selection signals in more DBSs in CEU and CHB than in YRI. Selection signals in YRI may be underestimated due to fewer samples and smaller sample sizes (than CEU and CHB), yet the overall results agree well with features of human evolution. Signals in considerable DBSs were detected by multiple tests, indicating the reliability of the analysis (Supplementary Note 4, 5; Supplementary Table 9, 10, 11). DBSs in the two pigmentation-related genes MC1R and MFSD12 and the two odor reception-related genes OR6C1 and TAS1R3 are notable examples. The rich SNPs, population-specific selection signals, and specific functions of related genes together indicate the ongoing influence of rewired gene expression on human evolution.
2.5. Young weak DBSs may have greatly promoted recent human evolution
To further examine whether the rewiring has been revised recently, we classified strong and weak DBSs with significant sequence changes into “old” ones, “young” ones, and others. We defined old DBSs as having “Human-Chimp distance>0.034 AND Human-Altai Neanderthals distance==0” and young DBSs as having “Human-Altai Neanderthals distance>0.034 OR Human-Denisovan distance>0.034”, which demark early and late sequence changes in human evolution. We identified favored and hitchhiking mutations in 17 human populations (Tang et al., 2022; Tang et al., 2023), classified DBSs into six classes (strong old, strong young, strong other, weak old, weak young, weak others), and examined the numbers of favored and hitchhiking mutations in each class (Supplementary Table 16). Favored and hitchhiking mutations are most enriched in young weak DBSs, suggesting that these DBSs may have greatly promoted recent human evolution.
2.6. SNPs in DBSs have cis-effects on gene expression
To verify the above genomic and population genetics analysis, we examined SNPs in DBSs and their impact on gene expression using transcriptomic data. Based on the Genotype-Tissue Expression (GTEx) data (GTEx Consortium, 2017), we identified expression quantitative trait loci (eQTLs) in DBSs using the widely adopted criteria of MAF≥0.1 and cis-effect size (ES)≥0.5. 1727 SNPs, with MAF ≥0.1 in at least one population and absolute values of ES≥0.5 in at least one tissue, were identified in DBSs in autosomal genes (Supplementary Table 12,13). These eQTLs include 372 “conserved” ones (i.e., also in DBSs in the three archaic humans) and 1020 “novel” ones (i.e., only in DBSs in modern humans). Many conserved eQTLs are in brain tissues and have high derived allele frequencies (DAFs) in all three modern populations. A notable eQTL that has high DAFs in all three populations and has positive ES≥0.5 in 44 out of 48 GTEx tissues is rs2246577 in the DBS in FN3KRP, whose encoded protein deglycates proteins to restore their function (Table 2). In contrast, many novel eQTLs are in a particular tissue and have population-specific DAFs (Supplementary Note 6).
Next, we performed two analyses to examine how eQTLs in DBSs influence gene expression via lncRNA-DBS interactions. First, we computed the expression correlation between HS lncRNAs and target transcripts with DBSs having eQTLs in specific tissues. Most (94%) HS lncRNA-target transcript pairs showed significant expression correlation in tissues in which the eQTLs were identified (|Spearman’s rho| >0.3 and FDR <0.05). Second, we examined whether eQTLs are more enriched in DBSs than in promoter regions by computing and comparing the density of eQTLs in each DBS and each Ensembl-annotated promoter. Using promoters as the control for the comparison is reasonable as they contain and also are longer than DBSs. The results indicate that eQTLs are more enriched in DBSs than in promoters (one-sided Mann-Whitney test, p = 0.0) and that these eQTLs influence gene expression via lncRNA-DBS interactions (Supplementary Note 6). Thus, tissue-specific eQTLs and population-specific SNPs in DBSs support the genomic and population genetics analysis.
2.7. HS lncRNAs promote brain evolution from archaic to modern humans
To further confirm that SNPs in DBSs have cis-effects on gene expression, we examined the impact of HS lncRNA-DBS on gene expression in the GTEx tissues (GTEx Consortium, 2017). 40 autosomal HS lncRNAs are expressed in at least one tissue (median TPM>0.1) and form 198876 HS lncRNA-DBS pairs with target transcripts in all tissues. Of these pairs, 45% show a significant expression correlation in specific tissues (Spearman’s |rho|>0.3 and FDR<0.05). To check how likely such correlation could be obtained by chance, we randomly sampled 10000 pairs of lncRNAs and protein-coding transcripts genome-wide and found that only 2.3% of pairs show significant expression correlation (Spearman’s |rho|>0.3 and FDR<0.05). Moreover, a higher percentage (56%) of HS lncRNA-target transcript pairs with significant correlation is detected in at least one brain region, indicating more extensive gene expression regulation by HS lncRNAs in the brain (Figure 3A).

The impact of HS lncRNA-DBS interaction on gene expression in GTEx tissues and organs.
(A) The percentage distribution of correlated HS lncRNA-target transcript pairs across GTEx tissues and organs. Higher percentages of correlated pairs are in the brain than in other tissues and organs. (B) The distribution of significantly changed DBSs in HS lncRNA-target transcript pairs across GTEx tissues and organs between archaic and modern humans. Orange, red, and dark red indicate significant DBS changes from Denisovans (D), Altai Neanderthals and Denisovans (AD), and all of the three archaic humans (ADV). DBSs in HS lncRNA-target transcript pairs correlated in seven brain regions (in dark red) have changed significantly and consistently since the Altai Neanderthals, Denisovans, and Vindija Neanderthals (one-sided two-sample Kolmogorov-Smirnov test, significant changes determined by FDR <0.001).
To obtain detailed supporting evidence, we analyzed nine experimental datasets (Supplementary Note 7). First, we analyzed two datasets of epigenetic studies, one examining H3K27ac and H3K4me2 profiling of human, macaque, and mouse corticogenesis, and the other examining gene expression and H3K27ac modification in eight brain regions in humans and four other primates (Reilly et al., 2015; Xu et al., 2018). Compared with the genome-wide background, 84% and 73% of genes in the two datasets have DBSs for HS lncRNAs, indicating significantly higher enrichment in HS lncRNA targets (p = 1.21e-21 and 1.2e-56, two-sided Fisher’s exact test). When taking gene expression in chimpanzees as the control, 1851 genes show human-specific transcriptomic differences in one or multiple brain regions, but only 240 genes show chimpanzee-specific transcriptomic differences. Second, we analyzed two datasets of PsychENCODE studies, one examining spatiotemporally differentially expressed genes and spatiotemporally differentially methylated sites in 16 brain regions, and the other examining the spatiotemporal transcriptomic divergence across human and macaque brain development (Li et al., 2018; Zhu et al., 2018). In the first dataset, 65 HS lncRNAs are expressed in all 16 brain regions, 109 transcripts have spatiotemporally differentially methylated sites in their promoters, and 56 of the 109 transcripts have DBSs for HS lncRNAs. In the second dataset, 8951 genes show differential expression between the human and macaque brains, and 72% of differentially expressed genes in the human brain have DBSs of HS lncRNAs. Thus, both datasets show significant enrichment for HS lncRNA regulation. Third, three studies identified genes critically regulating cortical expansion (Florio et al., 2018; Johnson et al., 2018; Suzuki et al., 2018). Of the 40 reported protein-coding genes, 29 have DBSs of HS lncRNAs in promoter regions (Supplementary Table 14). Thus, these genes are enriched with DBSs of HS lncRNAs compared with the genome-wide background (p<0.01, two-sided Fisher’s exact test). Finally, we analyzed two datasets of brain organoids studies. By establishing and comparing cerebral organoids between humans, chimpanzees, and macaques, Pollen et al. identified 261 human-specific gene expression changes (Pollen et al., 2019). By fusing human and chimpanzee iPS cells and differentiating the hybrid iPS cells into hybrid cortical spheroids, Agoglia et al. generated a panel of tetraploid human-chimpanzee hybrid iPS cells and identified thousands of genes with divergent expression between humans and chimpanzees (Agoglia et al., 2021). We found that the 261 and 1102 genes in the two datasets are enriched with DBSs of HS lncRNAs compared with the genome-wide background (p=1.2e-16 and 3.4e-74).
Finally, we next examined the evolution of the impact of HS lncRNAs and their DBSs on gene expression using the GTEx data. For each DBS in a pair of HS lncRNA-target transcript that show correlated expression in a GTEx tissue, we computed the sequence distances of this DBS from modern humans to archaic humans. Then, we compared the distribution of DBS sequence distances in each tissue with the distribution of DBS sequence distances in all tissues (the control) (one-sided two-sample Kolmogorov-Smirnov test). DBSs in HS lncRNA-target transcript pairs correlated in brain regions have significantly changed sequence distances since the Altai Neanderthals and Denisovans (Figure 3B), suggesting that gene expression regulation by HS lncRNAs in the brain has experienced significant evolution and that HS lncRNAs contributed greatly to recent brain evolution (Ponce de Leon et al., 2021). Results from other GTEx tissues, including Kidney-Cortex, Ovary, and Skin-Sun Expressed (Figure 3B), also suggest continuous influences of rewired gene expression on human evolution.
2.8. HS lncRNAs mediate human-specific correlated gene expression in the brain
Species-specificity of expression correlation highlights species-specific transcriptional regulation. To further explore the evidence in this aspect, we used the GTEx data from the human frontal cortex (BA9) and anterior cingulate cortex (BA24) (n=101 and 83, respectively) (GTEx Consortium, 2017), the transcriptomic data for the same brain regions in macaques (n=22 and 25, respectively) (Zhu et al., 2018), and the eGRAM program we developed recently, to identify co-expressed gene modules. HS lncRNAs’ target genes form distinct modules in the human brain, in which genes have rich correlation relationships (Pearson’s r>0.8) and are enriched in neurodevelopment-related KEGG pathways (hypergeometric distribution test, FDR<0.05). In contrast, their orthologues in the same macaque brain regions do not have these features (Figure 4; Supplementary Figure 24; Supplementary Table 17). The distinct differences in gene modules in these brain regions further support that gene expression has been substantially rewired by HS lncRNAs in the human brain.

Human-specifically rewired gene expression by HS lncRNAs in the frontal cortex (BA9).
(A) Genes expressed in the frontal cortex are enriched for HS lncRNAs’ target genes and neurodevelopment-related pathways. Squares indicate HS lncRNAs; dots indicate gene modules (Module_1 and Module_2 are illustrated); colors indicate enriched KEGG pathways. (B) Comparison of modules of the same genes in humans and macaques (indicated by H and M). In each pair of modules, green dots, and blue dots denote human genes and their orthologues in the corresponding positions, and lines indicate correlation. Correlated expression is more prominent in humans than in macaques.
3. Discussion
Modern humans differ from archaic humans and great apes trivially in genomes but greatly in phenotypes, making what genomic differences critically determine modern humans a profound question. This question has been addressed by many researchers along multiple approaches. Previous attention has been paid to new protein-coding genes, HARs, and TF-TFBS interactions, and findings include that HS protein-coding genes promote rapid neocortex expansion (Fiddes et al., 2018; Florio et al., 2018; Pinson et al., 2022), that HARs are significantly enriched in domains important for human-specific 3D genome organization (Keough et al., 2023), and that gene expression is rewired by TFs and TFBSs (Johnson, 2017). These findings make us postulate that HS lncRNAs have greatly rewired gene expression. To verify this postulation, this study identified HS lncRNAs, their DBSs, and these DBSs’ counterparts in ancient humans and multiple apes. Our analyses of these lncRNAs, DBSs, and target genes, including their evolution and interaction, indicate that HS lncRNAs have greatly promoted human evolution by distinctly rewiring gene expression. This conclusion is supported by our analyses of multiple experimental datasets. The rewiring has greatly evolved from humans’ ancestors to archaic humans and modern humans and still influences modern human populations.
As DBS prediction is the basis of the study, we performed CRISPR/Cas9 experiments to validate predicted DBSs. When running Infernal, LongTarget, MAFFT7, Phylip, and eGRAM, default parameters that suit most situations were used. When analyzing gene sets, the whole genome or the chimpanzee genome were used as controls. When identifying HS lncRNAs’ target genes that probably critically characterize human evolution, the top 20% of genes having the largest DBS distances from humans to chimpanzees were used because this percentage of genes may well characterize the human-chimpanzee differences. Accordingly, the DBS distance of 0.034 was used as the threshold to identify genes that may well characterize the differences between modern and archaic humans. The numbers of critical genes agree with the phylogenetic relationships between modern humans, archaic humans, and chimpanzees. Importantly, we differentiated strong/weak DBSs and old/young DBSs. We found that different classes of DBSs have different evolutionary and population genetics features and genes with these DBSs are enriched in distinct functions. Other values of parameters and thresholds may generate slightly different results, but the core of the results and the main conclusion probably still hold. Finally, not all detected signals reliably indicate positive selection.
Several findings are noteworthy. First, Neanderthal-inherited sequences have measurable impacts on gene expression in modern humans, and the impacts are least detectable in the brain (McCoy et al., 2017). In comparison, we find that HS lncRNAs, which may have evolved quite recently, have more significantly influenced gene expression in the brain than in other tissues. Second, the regulation of many genes by HS lncRNAs has undergone great evolution, notable examples including those making humans adapt to high sugar intake. Third, intensively regulated genes are enriched in different functions at different periods of human evolution. Specifically, “old” SNPs in DBSs are in genes regulating neural development, and “new” SNPs in DBSs are in genes regulating such processes as glucose metabolism. Fourth, recently emerged regulations (i.e., target genes with young weak DBSs) have significantly promoted recent human evolution. These findings reveal diverse, continuous, and subtle contributions of HS lncRNA to human evolution.
The above findings also raise new questions. First, the protein-coding regions in the mouse and human genomes are about 85% identical; however, many mouse lncRNAs are rodent-specific (Yue et al., 2014). Thus, two questions are how mouse-specific lncRNAs specifically rewire gene expression in mice and how human- and mouse-specific rewiring influences the cross-species transcriptional differences (Breschi et al., 2017; Hodge et al., 2019; Zhu et al., 2018). Second, whether mice and other mammals (especially model animals) have species-specific lncRNAs such as RP11-423H2.3 that substantially rewires gene expression species-specifically. Third, whether and how clade- and species-specific lncRNAs would make evolution preordained. These questions are theoretically interesting and important for translational medicine.
4. Material and methods
4.1. Data sources
The sequences of human (GRCh38/hg38, GRCh37/hg19) and chimpanzee genomes (panTro5) were obtained from the UCSC Genome Browser (http://genome.UCSC.edu). Three high-quality archaic human genomes were obtained from the Max Planck Institute for Evolutionary Anthropology (https://www.eva.mpg.de/genetics/index.html), which include an Altai Neanderthal which lived approximately 122 thousand years ago (kya) (Prüfer et al., 2013), a Denisovan (an Asian relative of Neanderthals) which lived approximately 72 kya (Meyer et al., 2012), and a Vindija Neanderthal which lived approximately 52 kya (Prufer et al., 2017). The ancestral sequences for the human genome (GRCh37) (which were generated upon the EPO alignments of six primate species human, chimpanzee, gorilla, orangutan, macaque, and marmoset) were downloaded from the EBI website (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/). The SNP data of modern humans were obtained from the 1000 Genomes Project (phase I) (1000 Genomes Project Consortium et al., 2012). The three modern human populations are CEU (Utah residents with Northern and Western European ancestry), CHB (Han Chinese in Beijing, China), and YRI (Yoruba in Ibadan, Nigeria), which contain 100, 99, and 97 individuals, respectively. The genetic map files of modern humans were downloaded from the 1000 Genomes Project website (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110106_recombination_hotspots/). The 1000 Genomes phased genotype data in VCF format were downloaded from the International Genome Sample Resource website (http://www.internationalgenome.org/). Annotated human genes and transcripts were obtained from the Ensembl website (http://www.Ensembl.org). FASTA-format nucleotide sequences were converted from VCFs using the vcf-consensus program in the VCFtools package when necessary (Danecek et al., 2011). Data in the GeneCards Human Gene Database (www.genecards.org) were used to annotate genes and transcripts.
RNA-sequencing (RNA-seq) data (transcripts per million [TPM]) and eQTL data of human tissues were obtained from the Genotype-Tissue Expression Project (GTEx, v8) website (https://gtexportal.org/) (GTEx Consortium, 2017). Histone modification and DNA methylation signals in multiple cell lines and ENCODE Candidate Cis-Regulatory Elements (cCRE) were obtained from the UCSC Genome Browser (http://genome.UCSC.edu). The genome-wide DNA binding sites of lncRNA NEAT1, MALAT1, and MEG3 were obtained from experimental studies (Mondal et al., 2015; West et al., 2014). The predicted and experimentally detected DBSs in target genes of MALAT1, NEAT1, and MEG3 are given in supplementary tables (Supplementary Table 15).
4.2. Identifying HS lncRNA genes
We used the Infernal program (Nawrocki et al., 2009), which searches orthologous RNA sequences upon sequence and structure alignment, to identify orthologous exons in 16 mammalian genomes for each exon in each of the 13562 GENCODE (v18)-annotated human lncRNA genes (Lin et al., 2019). The 16 mammals were chimpanzee (CSAC 2.1.4/panTro4), macaque (BGI CR_1.0/rheMac3), marmoset (WUGSC 3.2/calJac3), tarsier (Broad/tarSyr1), mouse lemur (Broad/micMur1), tree shrew (Broad/tupBel1), mouse (GRCm38/mm10), rat (Baylor3.4/rn4, RGSC6.0/rn6), guinea pig (Broad/cavPor3), rabbit (Broad/oryCun2), dog (Broad CanFam3.1/canFam3), cow (Baylor Btau_4.6.1/bosTau7), elephant(Broad/loxAfr3), hedgehog (EriEur2.0/eriEur2), opossum (Broad/monDom5), and platypus (WUGSC 5.0.1/ornAna1) (http://genome.UCSC.edu). If the number of orthologous exons of a human lncRNA gene in a genome exceeded half the exon number of the human lncRNA gene, these orthologous exons were assumed to form an orthologous lncRNA gene. If a human lncRNA gene had no orthologous gene in all of the 16 mammals, it was assumed to be human-specific.
4.3. Identifying DBSs of HS lncRNAs
LncRNAs bind to DNA sequences by forming RNA:DNA triplexes. Each triplex comprises triplex-forming oligonucleotides (TFO) in the lncRNA and a triplex-targeting site (TTS) in the DNA sequence. We used the LongTarget program to predict HS lncRNAs’ DNA binding domains (DBD) and binding sites (DBS) with the default parameters (Ruleset = all, T.T. penalty = -1000, CC penalty = 0, Offset = 15, Identity ≥ 60, Nt ≥ 50) (Lin et al., 2019; Wen et al., 2022). LongTarget program predicts DBDs and DBSs simultaneously, with a DBD comprising a set of densely overlapping TFOs and a DBS comprising a set of densely overlapping TTSs. For each HS lncRNA, we predicted its DBSs in the 5000 bp promoter regions (-3500 bp upstream and +1500 bp downstream the transcription start site) of the 179128 Ensembl-annotated transcripts (release 79). For each DBS, a binding affinity was calculated by multiplexing the DBS length and the averaged Identity score of all TTSs (the Identity score is the percentage of paired nucleotides). Strong and weak DBSs were classified upon binding affinity ≥ 60 and < 60. A transcript whose promoter region contains a strong DBS of an HS lncRNA was assumed to be a target transcript of the HS lncRNA, and the gene containing this transcript was assumed to be a target gene of the HS lncRNA. As the 1000 Genomes Project (phase I) data and the archaic human genomes are based on GRCh37/hg19, the coordinates of DBSs were converted from GRCh38/hg38 to GRCh37/hg19 using the liftover program from the UCSC Genome Browser (Kuhn et al., 2013).
4.4. Experimentally validating DBS prediction
A 157 bp sequence (chr17:80252565-80252721, hg19) containing the DBD of RP13-516M14.1, a 202 bp sequence (chr1:113392603-113392804, hg19) containing the DBD of RP11-426L16.8, and a 198 bp sequence (chr17:19460524-19460721, hg19) containing the DBD of SNORA59B, were knocked out in the HeLa cell line, RKO cell line, and SK-MES-1 cell line, respectively. Two sequences (chr1:156643524-156643684, chr10:52445649-52445740, hg38) containing the DBD of two wrongly transcribed noncoding sequences were knocked out in the HCT-116 and A549 cell lines, respectively. The seven knockouts were performed by UBIGENE, Guangzhou, China (http://www.ubigene.com) using CRISPR-UTM, a revised version of CRISPR/Cas9 technology. Before and after the seven DBD knockouts, RNA-sequencing (RNA-seq) was performed by Novogene, Beijing, China (https://cn.novogene.com) and HaploX, Shenzhen, China ( https://www.haplox.cn/). The reads were aligned to the human GRCh38 genome using the Hiasat2 program (Kim et al., 2019), and the resulting Sam files were converted to bam files using Samtools (Li et al., 2009). The Stringtie program was used to quantify gene expression levels (Pertea et al., 2015). Fold change of gene expression was computed using the edgeR package (Robinson et al., 2010), and significant up- and down-regulation of target genes after DBD knockout was determined upon |log2(fold change)| > 1 with FDR < 0.1.
Genome-wide DBSs of NEAT1, MALAT1, and MEG3 were experimentally detected (Mondal et al., 2015; West et al., 2014). We also used these data to validate DBS prediction by predicting DBSs of the three lncRNAs and checking the overlap between predicted and experimentally detected DBSs (Supplementary Table 15).
4.5. Mapping DBSs of HS lncRNAs in the chimpanzee and archaic human genomes
We used the liftover program from the UCSC Genome Browser (http://genome.UCSC.edu) to map loci of DBSs in the human genome hg38 to the chimpanzee genome Pan_tro 3.0 (panTro5). The mapping results were examined by checking the human-chimpanzee pairwise alignment (hg38 vs panTro5) in the UCSC Genome Browser (http://genome.UCSC.edu). 97.81% (100403/102651) of DBSs were mapped, and the 2248 unmapped DBSs in 429 genes were considered human-specific. We used vcf-consensus in the VCFtools package to extract the DBSs of HS lncRNAs from the VCF files of Altai Neanderthals, Denisovans, and Vindija Neanderthals. The variant with the highest quality score was selected whenever multiple variant calls were observed at a given locus. The obtained DBS sequences in the chimpanzees and three archaic humans are called counterparts of DBSs in these genomes.
4.6. Estimating sequence distances of DBSs between different genomes
We first aligned DBS sequences in the genomes of humans, chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals using the MAFFT7 program for measuring sequence distances from modern humans to chimpanzees and archaic humans (Katoh and Standley, 2013). We then computed sequence distances using the dnadist program with the Kimura 2-parameter model in the PHYLIP (3.6) package (http://evolution.genetics.washington.edu/phylip.html) and using the Tamura-Nei model in the MEGA7 package (Kumar et al., 2016). The two methods generated equivalent results. The largest distance between DBSs in humans and their counterparts in chimpanzees is 5.383. Since 2248 DBSs in 429 human genes do not have counterparts in chimpanzees, we assumed that the sequence distances of these DBSs between humans and chimpanzees were 10.0. Since the top 20% of DBSs in chimpanzees (including these 2248 DBSs) have distances ≥ 0.034, we used 0.034 as the threshold to identify the "mostly changed" DBSs in the chimpanzees and three archaic humans.
We determined human ancestral sequences of DBSs upon the human ancestor sequences from the EBI website, which were generated upon the EPO alignments of six primate species. We used the above-mentioned methods to compute DBSs’ sequence distances from the human ancestor to chimpanzees, archaic humans, and modern humans. We found that when the human-chimpanzee ancestral sequence has the ancestral sequence (which means the inferring of ancestral allele is of high confidence), DBS distances from the human ancestor to modern humans are larger than to archaic humans, but this situation only accounts for about 63.8%. For many DBSs, the distances from the human ancestor to modern humans are smaller than to archaic humans (especially Neanderthals and Denisovans) and even to chimpanzees. This defect may be caused by the absence of archaic humans in building the human ancestral sequence.
4.7. Detecting positive selection signals in HS lncRNA genes and DBSs
We used multiple tests to detect positive selection signals in HS lncRNA genes. First, we used the XP-CLR test (parameters = -w1 0.001 300 100 -p0 0.95, window size = 0.1 cM, grid size = 100 bp) to perform six pairwise genome-wide scans (i.e., CEU-CHB, CEU-YRI, CHB-CEU, CHB-YRI, YRI-CEU, and YRI-CHB) (Chen et al., 2010). The upper 1% of scores across the entire genome in each pairwise scan were 34.6 in the CEU-YRI scan, 16.8 in the CEU-CHB scan, 45.0 in the CHB-YRI scan, 26.9 in the CHB-CEU scan, 14.1 in the YRI-CEU scan, and 14.1 in the YRI-CHB scan. These scores were used as the thresholds of positive selection signals in these populations. Second, we used the iSAFE program to scan each genomic region that contains a HS lncRNA gene and its 500-kb upstream and downstream sequences (Akbari et al., 2018). Strongly selected loci were detected only in CEU and CHB. Third, we used the VCFtools program to calculate Tajima’s D values for each HS lncRNA gene in CEU, CHB, and YRI (Danecek et al., 2011). The calculation was performed using a 1500-bp non-overlapping sliding window because the lengths of these genes exceed 1500 bp. To generate the background reference for judging the significant increase or decrease of Tajima’s D for HS lncRNA genes in a population, we calculated Tajima’s D values across the whole genome using a sliding window of 1500 bp. As values of Tajima’s D were compared with the background reference, a significant D<0 and D>0 indicate positive (or directional) selection and balancing selection, respectively, instead of population demography dynamics (Tajima, 1989). Fourth, we used the integrated Fst to detect positive selection signals in HS lncRNA genes. The Fst of each HS lncRNA gene was computed for three comparisons, i.e., CEU-YRI, CHB-YRI, and CHB-CEU. Extreme Fst values of SNPs were detected in HS lncRNA genes in the comparisons of CEU-YRI and CHB-YRI. Since allele frequencies for different loci vary in a genome and genetic drift may have different effects at different loci, we used a sliding window of 1500 for comparing Fst values of HS lncRNA genes with the genome-wide background. Extreme Fst values indicate positive selection. Finally, we applied linkage disequilibrium (LD) analysis to each HS lncRNA gene. We computed the pairwise LD (r²) in CEU, CHB, and YRI for common SNPs (with minimal allele frequency (MAF) ≥ 0.05 in at least one population) in HS lncRNA genes and DBSs using the PLINK program (Purcell et al., 2007). Significantly increased LD was detected in SNPs in HS lncRNA genes in CEU and CHB. The LD patterns were portrayed using the Haploview program (Barrett et al., 2005).
Next, we used the above tests to detect positive selection signals in DBSs. First, the 100 bp grid size of the XP-CLR test also allowed the detection of selection signals in DBSs. Second, we performed the Tajima’s D and Fay-Wu’s H tests (Fay and Wu, 2000; Tajima, 1989). We calculated Tajima’s D values for each DBS in CEU, CHB, and YRI using the VCFtools program (Danecek et al., 2011), with a sliding window of 147 bp (the mean length of strong DBSs). To generate the background reference for judging the significant increase or decrease of Tajima’s D in a population, we calculated Tajima’s D values across the whole genome using the same sliding window. As values of Tajima’s D were compared with the background reference, a significant D<0 and D>0 indicate positive (or directional) selection and balancing selection, respectively. Fay-Wu’s H values were calculated similarly using the VariScan program (Vilella et al., 2005). Calculating Fay-Wu’s H demands the ancestral sequences as the outgroup. We extracted the ancestral sequences of DBSs from the human ancestral sequence, which was generated upon the EPO alignments of six primate species (human, chimpanzee, gorilla, orangutan, macaque, and marmoset) (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/). Third, we computed the Fst to measure the frequency differences of alleles in DBSs between populations. For the CEU-YRI, CHB-YRI, and CHB-CEU pairwise comparisons, we used the revised VCFtools program to compute the weighted Fst for all SNPs in each DBS (Weir and Cockerham, 1984). Fourth, we integrated the weighted Fst values in the three populations into an "integrated Fst" which indicated whether the DBS locus was under selection in a certain population (Nielsen and Slatkin, 2013). We used sliding windows of 147 bp for comparing Fst values of DBSs with the genome-wide background. We empirically denoted the values of integrated Fst in the upper 10% of scores across the entire genome as statistically significant. To determine positive selection more reliably, we used Tajima’s D and the integrated Fst to jointly determine if a DBS was under positive selection in a population. The thresholds that determined the upper 10% of Tajima’s D values across the entire genome in CEU, CHB, and YRI were -0.97, -0.96, and -0.97, respectively, and the threshold that determined the upper 10% of integrated Fst values across the entire genome in the three populations was 0.22. For example, a DBS was assumed to be under positive selection in CEU if (a) the DBS had a Tajima’s D<-0.97 in CEU and Tajima’s D>0.0 in the two other populations and (b) the DBS had an integrated Fst>0.22. Analogously, a DBS was assumed to be under positive selection in both CEU and CHB if the DBS had a Tajima’s D <-0.97 in CEU, <-0.96 in CHB, and >0.0 in YRI, and had an integrated Fst>0.22.
4.8. Functional enrichment analysis of genes
We used the g:Profiler program (with the parameter settings: Organism=Homo sapiens, Ordered query = No, Significance threshold = Benjamini-Hochberg FDR, User threshold = 0.05, 50 < terms size < 1000) and the Gene Ontology (GO) database to perform over-representation analysis (Raudvere et al., 2019). This analysis is used to determine what pre-defined gene sets (GO terms) are more present (over-represented) in a list of “interesting” genes than what would be expected by chance. The lists of genes included genes with strongest DBSs in humans, genes with weakest DBSs in humans, and genes with mostly changed DBSs in chimpanzees and three archaic humans. To choose genes with strong DBSs, we ranked genes in descending order based on binding affinity and chose the top 2000 genes. To choose genes with weak DBSs, we first ranked genes in ascending order based on binding affinity, then ranked genes in descending order based on the number of regulatory HS lncRNAs, and finally chose the top 2000 genes with binding affinity ≤40 and regulatory HS lncRNAs ≥5. We used the distance threshold of 0.034 to choose genes with mostly changed DBSs. Genes with mostly changed DBSs in chimpanzees include the 429 genes in which 2248 DBSs lack counterparts. 4248, 1256, 2514, and 134 genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals have DBSs with distance ≥0.034. To make the statistical significance comparable across chimpanzees, Altai Neanderthals, and Denisovans, we used the same number of genes (the top 1256 genes whose DBSs have the largest DBS distances) as the input to the g:Profiler program.
4.9. Analysing SNP frequencies in human populations
The frequency of common SNPs (MAF ≥0.05) in DBSs in the three modern human populations was computed using the VCFtools package (Danecek et al., 2011). The ancestral/derived states of SNPs were inferred upon the human ancestor sequences and were used to determine derived allele frequency.
4.10. Analysing the cis-effect of SNPs in DBSs on target gene expression
SNPs with MAF >0.1 in DBSs in any of the three modern human populations and absolute values of cis-effect size >0.5 (FDR <0.05) in any of the GTEx tissues were examined for an influence on the expression of the target genes (GTEx Consortium, 2017). SNPs that are eQTLs in the GTEx tissues and have biased derived allele frequencies (DAF) in the three modern human populations were examined to estimate whether the eQTL is population-specific.
4.11. Examining the tissue-specific impact of HS lncRNA-regulated gene expression
First, we examined the expression of HS lncRNA genes across the GTEx tissues. HS lncRNA genes with a median TPM value >0.1 in a tissue were considered robustly expressed in that tissue. Upon this criterion, 40 HS lncRNA genes were expressed in at least one tissue and were used to examine the impact of HS lncRNA regulation on gene and transcript expression (other HS lncRNAs may function in the cytoplasm). Since an HS lncRNA gene may have multiple transcripts, we adopted the transcript, which contains the predicted DNA binding domain (DBD) and has the largest TPM value as the representative transcript of the HS lncRNA. We calculated the pairwise Spearman’s correlation coefficient between the expression of an HS lncRNA (the representative transcript) and the expression of each of its target transcripts using the scipy.stats.spearmanr program in the scipy package. The expression of an HS lncRNA gene and a target transcript was considered to be significantly correlated if the |Spearman’s rho| > 0.3, with Benjamini-Hochberg FDR <0.05. We examined the percentage distribution of correlated HS lncRNA-target transcript pairs across GTEx tissues and organs (Figure 3A).
We examined all GTEx tissues to determine in which tissues HS lncRNA-regulated gene expression may have undergone significant changes from archaic to modern humans. If a pair of HS lncRNA and target transcript are robustly expressed in a tissue and their expression shows a significant correlation (|Spearman’s rho| >0.3, with Benjamini-Hochberg FDR <0.05) in the tissue, we computed the sequence distance of the HS lncRNA’s DBS in the transcript from the three archaic humans to modern humans. We compared the sequence distances of all DBSs in each tissue with the sequence distances of all DBSs in all GTEx tissues (i.e. the distribution of all DBS distances was taken as the background). A one-sided two-sample Kolmogorov-Smirnov test was used to examine whether the sequence distances of all DBSs in a specific tissue deviate from the background distribution (which reflects the "neutral evolution" of gene expression). For each tissue, if the test result was Benjamini-Hochberg FDR < 0.001, the tissue was considered to have significantly changed gene expression regulated by the HS lncRNA. We used different colors to mark tissues with significantly changed gene expression regulation since Altai Neanderthals, Denisovans, and Vindija Neanderthals, and used "D", "A.D.", and "ADV" to indicate changes since Denisovans, since Altai Neanderthals and Denisovans, and since Altai Neanderthals, Denisovans, and Vindija Neanderthals, respectively (Figure 3B).
4.12. Examining enrichment of favored and hitchhiking mutations in DBSs
Using the deep learning network DeepFavored we developed that integrates multiple statistical tests for identifying favored mutations (Tang et al., 2022), we identified 13339 favored mutations and 244098 hitchhiking mutations in 17 human populations (Tang et al., 2023). In this study, we classified DBSs in two ways - into strong ones (affinity>60) and weak ones (36<affinity<60), and into old ones (Human-Chimp distance>0.034 AND Human-Altai Neanderthals distance==0), young ones (Human-Altai Neanderthals distance>0.034 OR Human-Denisovan distance>0.034), and others. We then examined the numbers of favored and hitchhiking mutations in each class of DBSs. The weak young DBSs have the largest proportion of favored and hitchhiking mutations.
4.13. Identifying and analyzing transcriptional regulatory modules
Most clustering algorithms classify genes into disjoint modules based on expression correlation without considering regulatory relationships (Saelens et al., 2018). The GRAM program identifies gene regulatory modules based on correlation and TF-TFBS binding (Bar-Joseph et al., 2003). LncRNAs transcriptionally regulate genes based on lncRNA-DBS binding and correlate gene expression. We developed the eGRAM program to identify gene modules based on correlated expression, TF-TFBS interaction, and lncRNA-DBS interaction. In this study, we used eGRAM to identify gene modules in the same regions in the human and macaque brain and enriched KEGG pathways using reported RNS-seq datasets (n=101 and 83 for frontal cortex and n=22 and 25 for anterior cingulate cortex) (GTEx Consortium, 2017; Zhu et al., 2018). The default parameters of DBS affinity=60, Pearson correlation=0.5, module size=50, and FDR=0.01 (hypergeometric distribution test) were used. The key steps of program are as follows. (a1) Identify each lncRNA’s correlated lncRNAs, which may form a set of co-regulators. (a2, optional) Identify each TF’s correlated TFs, which may form a set of co-regulators. (b1) Compute the correlation between each lncRNA and all genes. (b2, optional) Compute the correlation between each TF and all genes. (c1) Identify each lncRNA’s target genes. (c2, optional) Identify each TF’s target genes. (d1) Identify each lncRNA set’s target module upon correlation and targeting relationships. (d2, optional) Identify each TF set’s target module upon correlation and targeting relationships. (e) Check whether TFs’ modules contain lncRNAs’ targets and whether lncRNAs’ modules contain TFs’ targets, which reveal genes co-regulated by TFs and lncRNAs and genes independently regulated by TFs and lncRNAs. (f) Performs pathway enrichment analysis for all modules.
Data and code availability
GENCODE human lncRNAs are available at the website (https://www.gencodegenes.org/human/). Human lncRNAs’ orthologues in 16 mammals and the LongTarget program are available at the LongTarget/LongMan website (http://www.gaemons.net/). Other programs and tests are open-source and freely available. The RNA-seq data of the three HS lncRNAs before and after the CRISPR/Cas9 experiments have been deposited in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo) under the accession number GSE213231. The RNA-seq data of the four wrongly transcribed noncoding sequences before and after the CRISPR/Cas9 experiments have been deposited in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo) (the accession number is to be obtained). All other data are in Supplementary Tables. The data of favored and hitchhiking mutations are available from the PopTradeOff database or from the authors upon request. The eGRAM code is available on the GitHub website (https://github.com/LinjieCodes/eGRAMv2R1).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (31771456 to H.Z.) and the China Postdoctoral Science Foundation (2020M682788 to J.L.).
Additional information
Author contributions
H.Z. designed the study and drafted the manuscript. J.L. performed most data analyses. Y.W. and J.T. performed XP-CLR and iSAFE analyses. All authors have read the manuscript and consent to its publication.
Additional files
References
- An integrated map of genetic variation from 1,092 human genomesNature 491:56–65Google Scholar
- Comprehensive survey and geometric classification of base triples in RNA structuresNucleic Acids Res 40:1407–1423Google Scholar
- Primate cell fusion disentangles gene regulatory divergence in neurodevelopmentNature 592:421–427Google Scholar
- Identifying the favored mutation in a positive selective sweepNat Methods 15:279–282Google Scholar
- Evolution by gene lossNat Rev Genet 17:379–391Google Scholar
- Human evolution. Evolution of early Homo: an integrated biological perspectiveScience 345:1236828Google Scholar
- Computational discovery of gene modules and regulatory networksNat Biotechnol 21:1337–1342Google Scholar
- Haploview: analysis and visualization of LD and haplotype mapsBioinformatics 21:263–265Google Scholar
- Comparative transcriptomics in human and mouseNat Rev Genet 18:425–440Google Scholar
- Population differentiation as a test for selective sweepsGenome Res 20:393–402Google Scholar
- Initial sequence of the chimpanzee genome and comparison with the human genomeNature 437:69–87Google Scholar
- Comment on "Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens" and "Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans"Science 313:172Google Scholar
- The variant call format and VCFtoolsBioinformatics 27:2156–2158Google Scholar
- The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expressionGenome Res 22:1775–1789Google Scholar
- Genome-Wide Identification of Regulatory Sequences Undergoing Accelerated Evolution in the Human GenomeMol Biol Evol 33:2565–2575Google Scholar
- Microcephalin, a gene regulating brain size, continues to evolve adaptively in humansScience 309:1717–1720Google Scholar
- Hitchhiking under positive Darwinian selectionGenetics 155:1405–1413Google Scholar
- Human-Specific NOTCH2NL Genes Affect Notch Signaling and Cortical NeurogenesisCell 173:1356–1369Google Scholar
- Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortexeLife 7Google Scholar
- Genetic effects on gene expression across human tissuesNature 550:204–213Google Scholar
- LongTarget: a tool to predict lncRNA DNA-binding motifs and binding sites via Hoogsteen base-pairing analysisBioinformatics 31:178–186Google Scholar
- DNA Methylation: Insights into Human EvolutionPLoS Genet 11:e1005661Google Scholar
- Conserved cell types with divergent features in human versus mouse cortexNature 573:61–68Google Scholar
- The rewiring of transcription circuits in evolutionCurr Opin Genet Dev 47:121–127Google Scholar
- Aspm knockout ferret reveals an evolutionary mechanism governing cerebral cortical sizeNature 556:370–375Google Scholar
- Origins, evolution, and phenotypic impact of new genesGenome Res 20:1313–1326Google Scholar
- MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and UsabilityMolecular Biology and Evolution 30:772–780Google Scholar
- Three-dimensional genome rewiring in loci with human accelerated regionsScience 380:eabm1696Google Scholar
- Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotypeNat Biotechnol 37:907–915Google Scholar
- Evolution of transcription factor binding through sequence variations and turnover of binding sitesGenome Res 32:1099–1111Google Scholar
- The UCSC genome browser and associated toolsBrief Bioinform 14:144–161Google Scholar
- MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger DatasetsMolecular Biology and Evolution 33:1870–1874Google Scholar
- Lessons from X-chromosome inactivation: long ncRNA as guides and tethers to the epigenomeGenes Dev 23:1831–1842Google Scholar
- Human Accelerated Regions and Other Human-Specific Sequence Variations in the Context of Evolution and Their Relevance for Brain DevelopmentGenome Biol Evol 10:166–188Google Scholar
- The Sequence Alignment/Map format and SAMtoolsBioinformatics 25:2078–2079Google Scholar
- Integrative functional genomic analysis of human brain development and neuropsychiatric risksScience 362Google Scholar
- Pipelines for cross-species and genome-wide prediction of long noncoding RNA bindingNat Protoc 14:795–818Google Scholar
- Human brain evolution: Emerging roles for regulatory DNA and RNACurr Opin Neurobiol 71:170–177Google Scholar
- Adaptive sequence divergence forged new neurodevelopmental enhancers in humansCell 185:4587–4603Google Scholar
- Impacts of Neanderthal-Introgressed Sequences on the Landscape of Human Gene ExpressionCell 168:916Google Scholar
- Human-specific loss of regulatory DNA and the evolution of human-specific traitsNature 471:216–219Google Scholar
- Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiensScience 309:1720–1722Google Scholar
- A High-Coverage Genome Sequence from an Archaic Denisovan IndividualScience 338:222–226Google Scholar
- MEG3 long noncoding RNA regulates the TGF-beta pathway genes through formation of RNA-DNA triplex structuresNat Commun 6:7743Google Scholar
- Annotating functional RNAs in genomes using InfernalMethods Mol Biol 1097:163–197Google Scholar
- Infernal 1.1: 100-fold faster RNA homology searchesBioinformatics 29:2933–2935Google Scholar
- Infernal 1.0: inference of RNA alignmentsBioinformatics 25:1335–1337Google Scholar
- An Introduction to Population GeneticsSinauer Associates Inc Publishers Google Scholar
- Measuring transcription factor-binding site turnover: a maximum likelihood approach using phylogeniesGenome Biol Evol 1:85–98Google Scholar
- StringTie enables improved reconstruction of a transcriptome from RNA-seq readsNat Biotechnol 33:290–295Google Scholar
- Human TKTL1 implies greater neurogenesis in frontal neocortex of modern humans than NeanderthalsScience 377:eabl6422Google Scholar
- Establishing Cerebral Organoids as Models of Human-Specific Brain EvolutionCell 176:743–756Google Scholar
- Human-specific genetics: new tools to explore the molecular and cellular basis of human evolutionNat Rev Genet :1–25Google Scholar
- The primitive brain of early HomoScience 372:165–171Google Scholar
- Accelerated evolution of conserved noncoding sequences in humansScience 314:786Google Scholar
- Emerging principles of regulatory evolutionProc Natl Acad Sci U S A 104:8605–8612Google Scholar
- A high-coverage Neandertal genome from Vindija Cave in CroatiaScience 358:655–658Google Scholar
- The bonobo genome compared with the chimpanzee and human genomesNature 486:527–531Google Scholar
- The complete genome sequence of a Neanderthal from the Altai MountainsNature 505:43–49Google Scholar
- PLINK: a tool set for whole-genome association and population-based linkage analysesAm J Hum Genet 81:559–575Google Scholar
- g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)Nucleic Acids Res 47:W191–W198Google Scholar
- Evolutionary genomics. Evolutionary changes in promoter and enhancer activity during human corticogenesisScience 347:1155–1159Google Scholar
- Integrative analysis of 111 reference human epigenomesNature 518:317–330Google Scholar
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression dataBioinformatics 26:139–140Google Scholar
- A comprehensive evaluation of module detection methods for gene expression dataNat Commun 9:1090Google Scholar
- Developmental dynamics of lncRNAs across mammalian organs and speciesNature 571:510–514Google Scholar
- Linkage disequilibrium--understanding the evolutionary past and mapping the medical futureNat Rev Genet 9:477–485Google Scholar
- Genetic studies of human-chimpanzee divergence using stem cell fusionsProc Natl Acad Sci U S A 118Google Scholar
- Human-Specific NOTCH2NL Genes Expand Cortical Neurogenesis through Delta/Notch RegulationCell 173:1370–1384Google Scholar
- Statistical method for testing the neutral mutation hypothesis by DNA polymorphismGenetics 123:585–595Google Scholar
- Uncovering the extensive trade-off between adaptive evolution and disease susceptibilityCell Rep 40:111351Google Scholar
- PopTradeOff: A database for exploring population-specificity of adaptive evolution, disease susceptibility, and drug responsivenessComput Struct Biotechnol J 21:3443–3451Google Scholar
- Comment on papers by Evans et al. and Mekel-Bobrov et al. on Evidence for Positive Selection of MCPH1 and ASPMScience 317:1036Google Scholar
- VariScan: Analysis of evolutionary patterns from large-scale DNA sequence polymorphism dataBioinformatics 21:2791–2793Google Scholar
- Estimating F-Statistics for the Analysis of Population StructureEvolution 38:1358–1370Google Scholar
- Fasim-LongTarget enables fast and accurate genome-wide lncRNA/DNA binding predictionComput Struct Biotechnol J 20:3347–3350Google Scholar
- The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sitesMol Cell 55:791–802Google Scholar
- Enhancer Function and Evolutionary Roles of Human Accelerated RegionsAnnu Rev Genet 56:423–439Google Scholar
- Human-specific features of spatial gene expression and regulation in eight brain regionsGenome Res 28:1097–1110Google Scholar
- Comment on "Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens"Science 316:370Google Scholar
- A comparative encyclopedia of DNA elements in the mouse genomeNature 515:355–364Google Scholar
- Transcription factor binding sites are frequently under accelerated evolution in primatesNat Commun 14:783Google Scholar
- Spatiotemporal transcriptomic divergence across human and macaque brain developmentScience 362Google Scholar
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Reviewed Preprint version 3:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.89001. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Lin et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 1,111
- downloads
- 37
- citations
- 4
Views, downloads and citations are aggregated across all versions of this paper published by eLife.