1. Introduction

The limited genomic but substantial phenotypic and behavioral differences between humans and other hominids make the question of which sequence changes have critically driven human evolution an enduring puzzle (Anton et al., 2014; Pollen et al., 2023). Sequence changes include the birth and loss of genes, as well as the turnover and modifications of regulatory sequences (Albalat and Canestro, 2016; Kaessmann, 2010; Prud’homme et al., 2007). Waves of studies have identified genes that promote human brain development (e.g., NOTCH2NL and ASPM) (Evans et al., 2005; Fiddes et al., 2018; Mekel-Bobrov et al., 2005; Pinson et al., 2022; Suzuki et al., 2018). However, many conclusions from the gene birth-centric studies are controversial (Currat et al., 2006; Timpson et al., 2007; Yu et al., 2007). Since genes essential for brain development may not necessarily regulate other human traits (some of which preceded brain enlargement), these new brain-related genes do not fully answer the above question. Moreover, investigations into how species-specific lncRNAs, a major class of new genes in mammals, contribute to human evolution remain limited. Studies revealed regulatory sequence changes important for human evolution, including new sequences (Liu et al., 2021), lost sequences (McLean et al., 2011), human accelerated regions (HARs) (Dong et al., 2016; Mangan et al., 2022; Prabhakar et al., 2006; Whalen and Pollard, 2022), and turnover of transcription factor (TF) binding sites (Krieger et al., 2022; Otto et al., 2009; Zhang et al., 2023). Nevertheless, no systematic analysis of lncRNA binding sites, which also regulate transcription, has been reported.

Experimental studies over the past decades have generated substantial findings about lncRNAs and epigenetic regulation. (a) About one-third of human lncRNAs are primate-specific (Derrien et al., 2012). (b) Species-specific lncRNAs exhibit distinct expression in tissues and organs of different species (Sarropoulos et al., 2019). (c) Many lncRNAs can bind to DNA sequences by forming RNA:DNA triplexes (Abu Almakarem et al., 2012), thereby recruiting histone and DNA modification enzymes to binding sites and regulating transcription (Lee, 2009; Roadmap Epigenomics et al., 2015). (d) Approximately 40% of differentially expressed human genes result from interspecies epigenetic differences (Hernando-Herraez et al., 2015). Thus, not only human-specific (HS) TFs and their DNA binding sites (TF DBS), but also HS lncRNAs and their DNA binding sites (lncRNA DBS), critically regulate gene expression human-specifically.

To examine the latter, which have had few reports, the first step is to identify HS lncRNAs. Unlike DNA sequence search, RNA sequence search for identifying homologous and orthologous lncRNA genes needs combined structure and sequence alignment (Nawrocki, 2014; Nawrocki and Eddy, 2013), and human lncRNAs’ orthologues in other mammals determine HS lncRNAs. Many lncRNAs contain DNA binding domains (DBDs) that can bind to DBSs. Since RNA:DNA triplexes follow specific base-pairing rules (Abu Almakarem et al., 2012), these rules allow computational prediction of DBDs and DBSs, and accordingly, target genes and transcripts (Lin et al., 2019; Wen et al., 2022). Gene expression in organs and tissues, especially those examined by the GTEx project (GTEx Consortium, 2017), provides data to investigate whether the expression of lncRNAs and their targets is correlated, supporting their regulatory relationship. Finally, the sequenced genomes of modern humans, archaic humans, and multiple apes allow cross-species analysis of the genomes and transcriptomes (1000 Genomes Project Consortium et al., 2012; Chimpanzee Sequencing and Analysis Consortium, 2005; Meyer et al., 2012; Prufer et al., 2017; Prüfer et al., 2013). Thus, it is feasible to examine whether and how HS lncRNAs and their DBSs have contributed to human evolution by human-specifically regulating gene expression.

This study identified HS lncRNAs upon the orthologues of the GENCODE-annotated human lncRNA genes in 16 mammalian genomes, predicted HS lncRNAs’ DBDs and DBSs in the genomes of modern humans (CEU, CHB, and YRI), archaic humans (Altai Neanderthals, Denisovans, Vindija Neanderthals), as well as counterparts of DBSs in the genome of chimpanzees (Figure 1A). Joint prediction of DBDs and DBSs (which form RNA:DNA triplexes) using local alignment is more reasonable than (and outperforms) DBS prediction based on base-pairing rules (Lin et al., 2019). We validated the DBS prediction by knocking out (KO) predicted DBDs using CRISPR/Cas9, performing RNA-seq before and after KO, and examining whether the differentially expressed genes were predicted target genes. To accurately analyze DBSs with different lengths and identity values (the percentage of paired RNA and DNA nucleotides in a triplex), we defined binding affinity (length x identity) and classified DBSs into strong and weak ones based on binding affinity and also old and young ones based on when most of their sequence changes occurred. We then utilized multiple datasets and methods to perform cross-population and cross-species analyses, including gene set enrichment analyses, evolutionary/population genetics analyses, and transcriptomic analyses. The results suggest that HS lncRNAs and their DBSs have distinctly and continuously reshaped gene expression during human evolution, and this reshaped gene expression greatly determines human traits, from the rapidly evolved brain to the highly modified immune and metabolism systems.

Study Overview.

(A) The relationships between chimpanzees, the three archaic humans (Altai Neanderthals, Denisovans, and Vindija Neanderthals), and the three modern human populations. Dashed lines indicate the phylogenetic distances from modern humans to archaic humans and chimpanzees. Genes and DBSs have the following implications: the DBS in B2M lacks a counterpart in chimpanzees; the DBS in ABL2 has great differences between archaic and modern humans; the DBS in IRNAR1 is highly polymorphic in modern humans. Red letters in DBSs indicate tissue-specific eQTLs or population-specific favored mutations. (B) The mean length and binding affinity of strong DBSs. (C) Number of target genes and transcripts of the 66 HS lncRNAs. (D) The illustrative figure shows the targeting relationships between HS lncRNAs (see the full figure in Supplementary Note 2). (E) The sequence distances of DBSs (the top 40%) from modern humans to chimpanzees and archaic humans. (F) The illustrative figure shows the GTEx tissues used to investigate the impacts of HS lncRNA-target transcript pairs on gene expression (see the full figure in Figure 3).

2. Results

2.1. HS lncRNAs regulate diverse genes and transcripts

How many human lncRNAs are human-specific is a question without a precise estimate. After searching for orthologues of the 13562 GENCODE-annotated human lncRNA genes (v18) in 16 mammalian genomes and the genomes of archaic humans using the Infernal program (which performs a combined RNA sequence and structure alignment) (Figure 1A) (Lin et al., 2019; Nawrocki, 2014; Nawrocki et al., 2009), we identified 66 lncRNA genes that exist in humans but not in any other species (Supplementary Table 1). Then, we predicted DBSs of the 66 HS lncRNAs in the 5000 bp promoter regions of the 179128 Ensembl-annotated transcripts (release 79) using the LongTarget program (Supplementary Note 1) (Lin et al., 2019). Since DBD1’s DBSs vastly outnumber other DBDs’ DBSs, suggesting that DBD1 (hereafter simply called DBD) and its DBSs are more reliable, for each HS lncRNA, we analysed only DBD1 and its DBSs. We defined the binding affinity of a DBS as the product of its length and identity (the percentage of paired RNA and DNA nucleotides), and focused on DBSs with binding affinity>=60 (strong DBSs). The analysis of DBSs at known imprinted genes and the comparison of DBSs with CHART-seq-reported binding sequences (“peaks”) suggest that binding affinity characterizes DBSs better than length (Supplementary Fig. 2) (He et al., 2015). Since DBSs generated more recently may be weak yet play new and/or subtle functions, we also analyzed DBSs with 36<binding affinity<60 (weak DBSs). 105141 strong DBSs (mean length>147 bp) were identified in 96789 transcripts of 23396 genes, and 152836 weak DBSs were identified in 127898 transcripts of 33185 genes (Figure 1B; Supplementary Table 2, 3). Several HS lncRNAs (especially RP11-423H2.3) have abundant targets; many targets have DBSs of multiple HS lncRNAs; many HS lncRNAs have multiple DBSs in a target; only about 1.5% of target genes (0.6% of target transcripts) are human-specific (Figure 1C); and HS lncRNAs themselves show complex targeting relationships (Figure 1D; Supplementary Note 2). These suggest that the limited number of HS lncRNAs has significantly reshaped the expression of substantial genes.

We validated DBS prediction using multiple methods and datasets (Supplementary Note 1; Supplementary Fig. 1-5). (a) HS lncRNA DBSs overlap well with experimentally identified DNA methylation and histone modification signals in multiple cell lines (https://genome.UCSC.edu). (b) Many DBSs co-localize with annotated cis-regulatory elements (cCREs) in genes’ promoter regions (https://genome.UCSC.edu). (c) Predicted DBSs overlap well with experimentally detected DBSs of NEAT1, MALAT1, and MEG3 (Mondal et al., 2015; West et al., 2014). (d) We used CRISPR/Cas9 to knock out predicted DBDs (100-200 bp) in several lncRNAs across multiple cell lines, performed RNA-seq before and after DBD knockout, and analyzed the resulting differential gene expression. We found that the |fold change| of target genes was significantly larger than the |fold change| of non-target genes (one-sided Mann-Whitney test, p = 3.1e-72, 1.49e-114, 1.12e-206, 2.58e-09, 6.49e-41, 0.034, and 5.23e-07, respectively). (e) The LongTarget program uses a variant of the Smith-Waterman local alignment to identify DBD and DBS simultaneously; statistically, a local alignment (i.e., a triplex containing a pair of DBD and DBS) of 147 bp (the mean length of strong DBSs) is extremely unlikely to be generated by chance (p < 8.2e-19 to 1.5e-48).

HARs are conserved genomic loci yet show accelerated evolution in the human lineage. Multiple researchers postulate that HARs critically promote human evolution; however, results from the related studies are controversial (Levchenko et al., 2018). The Zoonomia Project recently identified 312 HARs important for 3D genome rewiring and neurodevelopment (Keough et al., 2023). We found that only 8 of these HARs overlap 26 DBSs of 14 HS lncRNAs, suggesting that DBSs and HARs are different kinds of sequences.

2.2. HS lncRNAs’ target genes characterize human evolution and traits

AS strong and weak DBSs may occur in different periods in human evolution, it is interesting and necessary to examine whether the related target genes are enriched in different functions. For genes with significant sequence differences between humans and chimpanzees, studies have generated different estimates, ranging from 1.24% to 5% (Britten, 2002; Ebersberger et al., 2002). More genes exhibit expression differences due to variations in regulatory sequences. To reliably differentiate genes with strong and weak DBSs, we sorted target genes by their DBS binding affinity and chose the top 2000 genes (DBS length>252 bp and binding affinity>151) and bottom 2000 genes (DBS length<60 bp but binding affinity>36) to conduct over-representation analysis using the g:Profiler program and Gene Ontology (GO) database (Supplementary Table 4). The top and bottom 2000 genes (with the whole genome as the background) are co-enriched in GO terms important for human evolution, including “neuron projection development” (GO:0031175), “learning” (GO:0007612), “regulation of anatomical structure size” (GO: GO:0090066), “head development” (GO:0060322), and “regulation of locomotion” (GO:0040012) (Supplementary Note 3; Supplementary Table 4). Genes with strongest DBSs (514 DBSs have length >500 bp) include IFNAR1 and NFATC1 (important for immune function), KIF21B and NTSR1 (critical for neural development), SLC2A11 and SLC2A1 (involved in glucose usage), BAIAP3 (a brain-specific angiogenesis inhibitor), TAS1R3 (a receptor for the sweet taste response), and several primate-specific noncoding RNA genes (e.g., CTD-3224I3.3 which is highly expressed in the cerebellum, lung, and testis) (Table 1; Supplementary Table 4, 5). On the other hand, genes with weak DBSs are enriched in such GO terms as “response to alcohol” (GO:0097305), “response to temperature stimulus” (GO:0009266), “DNA methylation or demethylation” (GO:0044728), and “female pregnancy” (GO:0007565) (Supplementary Note 3). Since weak DBSs are more likely to have occurred recently, these findings suggest that some regulatory functions of HS lncRNAs were obtained recently.

Genes with DBSs that have largest binding affinity and mostly changed sequence distances (from modern humans to archaic humans and chimpanzees)

2.3. The reshaping of gene expression has greatly evolved

The human genome is approximately 99%, 99%, and 98% identical to the genomes of chimpanzees, bonobos, and gorillas (Chimpanzee Sequencing and Analysis Consortium, 2005; Prufer et al., 2012). However, the sequence variants that critically differentiate these species remain unclear. We therefore examined whether DBSs of HS lncRNAs are conserved in these species or are human-specific. Notably, 97.81% of the 105141 strong DBSs have counterparts in chimpanzees, raising the possibility that these sequences have evolved considerably during human evolution.

HS lncRNAs may be generated before, with, or after their DBSs. To further clarify these situations, we identified the counterparts of HS lncRNAs and their DBSs also in Altai Neanderthals, Denisovans, and Vindija Neanderthals (Meyer et al., 2012; Prufer et al., 2017; Prüfer et al., 2013). Counterparts of HS lncRNAs and strong DBSs were identified in these archaic humans, but the distances (distance per base) of HS lncRNAs are smaller than the distances of DBSs between archaic and modern humans, suggesting that the counterpart sequences of many DBSs existed before HS lncRNAs. In other words, these distances suggest that HS lncRNAs may have reshaped gene expression substantially in humans.

We computed DBS distances in two ways to examine DBS evolution. The first is from the reconstructed human ancestor (downloaded from the EBI website) to chimpanzees, archaic humans, and modern humans; the second is from chimpanzees and archaic humans to modern humans. In the first set of distances, many DBS distances from the human ancestor to modern humans are smaller than those to archaic humans. For example, DBSs in the top 20% genes in the human ancestor have distances >0.015 to modern humans, but more DBSs have distances >0.015 to Altai Neanderthals, Denisovans, and even chimpanzees (Supplementary Note 3). This anomaly is caused by the reconstructed human ancestor, which is built using six primates without archaic humans. On the other hand, the second set of distances agrees well with the phylogenetic distances between chimpanzees, archaic humans, and modern humans. DBSs with large distances account for a small portion (Figure 1E); we postulate that genes containing these DBSs may greatly characterize human evolution. To set a cutoff for “large distance”, we turned to a recent study that identified 5984 genes differentially expressed between human-only and chimpanzee-only iPSC lines, including 4377 genes whose differential expression is due to trans-acting differences between humans and chimpanzees (Song et al., 2021). A distance cutoff of 0.034 identifies 4248 (the top 20%) genes in chimpanzees, as well as 1256, 2514, and 134 genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals, respectively. Since Prufer et al. reported that “it has been suggested that Denisovans received gene flow from a hominin lineage that diverged prior to the common ancestor of modern humans, Neandertals, and Denisovans” and that “we find that the Denisovan genome carries fewer derived alleles that are fixed in Africans, and thus tend to be older, than the Altai Neandertal genome” (Prufer et al., 2017), these gene numbers agree with the phylogenetic relationships between archaic and modern humans (Figure 1A; Supplementary Table 6).

To determine whether target genes with mostly changed DBSs are enriched in specific functions, we applied overrepresentation analysis to target genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals using the g:Profiler program and GO database. To make statistical significance comparable across chimpanzees, Altai Neanderthals, and Denisovans, we used just the top 1256 genes as the input for the three. The 1256 genes in chimpanzees and Altai Neanderthals are enriched in multiple common GO terms. However, “behavior” (GO:0007610) and “cognition” (GO:0050890) have much higher significance in Altai Neanderthals than in chimpanzees (Figure 2), and some GO terms such as “learning” (GO:0007612), “learning or memory” (GO:0007611), and “glucose homeostasis” (GO:0042593) are enriched only in Altai Neanderthals. Agreeing with the closer relationship between Vindija Neanderthals and modern humans, the 134 genes in Vindija Neanderthals are enriched only in two GO terms with low significance (“smoothened signaling pathway” (GO:0007224) and “regulation of smoothened signaling pathway” (GO:0008589)) (Supplementary Table 7). These results provide valuable insights into when and how HS lncRNA-mediated epigenetic regulation impacts human evolution.

1256 target genes whose DBSs have the largest distances from modern humans to chimpanzees and Altai Neanderthals are enriched in different Biological Processes GO terms.

Upon significance threshold = 0.05 (Benjamini-Hochberg FDR), the two gene sets in chimpanzees and Altai Neanderthals are enriched in 199 and 409 GO terms (50<terms size<1000), respectively. Shared GO terms are color-marked. Left: Top GO terms enriched in genes in chimpanzees. Right: Top GO terms enriched in genes in Altai Neanderthals.

2.4. Epigenetic regulation by HS lncRNAs shows differences in modern humans

To check whether the reshaped gene expression has been population-specifically revised in modern humans, we analyzed single-nucleotide polymorphisms (SNPs, with minimal allele frequency (MAF)≥0.1) in strong DBSs using the 1000 Human Genome data (1000 Genomes Project Consortium et al., 2012). We computed SNP number and SNP number per base for each DBS, identified highly polymorphic DBSs (SNP number per base in CEU, CHB, and YRI are 5 times larger than the average), and obtained the intersection of highly polymorphic DBSs and DBSs with mostly changed sequence distances (distance>0.034, from modern humans to chimpanzees, Altai Neanderthals, and Denisovans) (Supplementary Table 8). Most genes in the intersection are lncRNA genes, and protein-coding genes include those important for human evolution and adaptation (especially metabolism and immune), such as SCTR, TLR, NCR, IFNAR1, NFATC1, TAS1R3, INS, ST3GAL4, and FN3KRP. The rich SNPs, in addition to large sequence distances and Tajima’s D, indicate DBS differences in modern humans. Genes involved in sugar metabolism are especially notable. A key feature of modern human diets is high sugar intake, which can cause non-enzymatic oxidation of proteins (i.e., glycation) and deactivate protein functions. Among proteins encoded by genes in this intersection, TAS1R3 recognizes diverse natural and synthetic sweeteners, Insulin decreases blood glucose concentration, ST3GAL4 regulates protein glycosylation, and FN3KRP deglycates proteins. These DBSs and target genes indicate continuous evolution of transcriptional regulation with human and population specificity.

To examine whether these SNPS are neutral or indicate selection signals, we next used multiple statistical tests (with widely adopted parameters), including XP-CLR (Chen et al., 2010), iSAFE (Akbari et al., 2018), Tajima’s D (Tajima, 1989), Fay-Wu’s H (Fay and Wu, 2000), the fixation index (Fst) (Weir and Cockerham, 1984), and linkage disequilibrium (LD) (Slatkin, 2008), to detect selection signals in HS lncRNAs and strong DBSs in CEU, CHB, and YRI (the three representative populations of Europeans, Asians, and Africans) (Supplementary Note 4, 5; Supplementary Table 9, 10, 11). Selection signals were detected in several HS lncRNA genes in CEU and CHB but not in YRI, as well as in more DBSs in CEU and CHB than in YRI. These results agree well with findings reported in previous studies, including that fewer selection signals are detected in YRI (Sabeti et al., 2007; Voight et al., 2006). Selection signals in considerable DBSs were detected by multiple tests, indicating the reliability of the analysis. DBSs in the two pigmentation-related genes MC1R and MFSD12 and the two odor reception-related genes OR6C1 and TAS1R3 are notable examples. The dense SNPs, population-specific selection signals, and specific functions of these genes together indicate the ongoing reshaping of gene expression in modern humans.

2.5. Young weak DBSs may have greatly promoted recent human evolution

To further examine whether the reshaping has been revised recently, we classified strong and weak DBSs with mostly changed sequence distances into “old” ones, “young” ones, and “others”. DBSs with “Human-Chimp distance>0.034 AND Human-Altai Neanderthals distance=0” were defined as old ones, and DBSs with “Human-Altai Neanderthals distance>0.034 OR Human-Denisovan distance>0.034” were defined as young ones, which demark early and late sequence changes in human evolution. As we identified favored and hitchhiking mutations in 17 human populations (Tang et al., 2022; Tang et al., 2023), we further classified DBSs into six classes - strong old, strong young, strong others, weak old, weak young, weak others, and examined the numbers of favored and hitchhiking mutations in each class (Supplementary Table 16). Favored and hitchhiking mutations are most enriched in the weak young class; we interpret that these DBSs may have considerably promoted recent human evolution.

2.6. SNPs in DBSs have cis-effects on gene expression

To verify the above genomic and population genetics analysis, we further examined SNPs in DBSs and their impact on gene expression using transcriptomic data. Based on the Genotype-Tissue Expression (GTEx) data (GTEx Consortium, 2017), we identified expression quantitative trait loci (eQTLs) in DBSs using the widely adopted criteria of MAF≥0.1 and cis-effect size (ES)≥0.5. 1727 SNPs, with MAF ≥0.1 in at least one population and absolute values of ES≥0.5 in at least one tissue, were identified in DBSs in autosomal genes (Supplementary Table 12,13). These eQTLs include 372 “conserved” ones (i.e., also in DBSs in the three archaic humans) and 1020 “novel” ones (i.e., only in DBSs in modern humans). Many conserved eQTLs are in brain tissues and have high derived allele frequencies (DAFs) in all three modern populations. A notable eQTL with high DAFs in all three populations and positive ES≥0.5 in 44 out of 48 GTEx tissues is rs2246577 in the DBS in FN3KRP, whose encoded protein deglycates proteins to restore their function (Table 2). In contrast, many novel eQTLs are in a particular tissue and have population-specific DAFs (Supplementary Note 6).

Genes with most polymorphic DBSs and DBSs with mostly changed sequence distances from modern humans to archaic humans and chimpanzees

Next, we performed two analyses to examine how eQTLs in DBSs influence gene expression. First, we computed the expression correlation between HS lncRNAs and target transcripts with DBSs having eQTLs in specific tissues. Most (94%) HS lncRNA-target transcript pairs showed significant expression correlation in tissues in which the eQTLs were identified (|Spearman’s rho| >0.3 and FDR <0.05). Second, we examined whether eQTLs are more enriched in DBSs than in Ensembl-annotated promoters by computing and comparing the density of eQTLs in the two classes of regions. In the comparison, we used promoters as the reference (background) because they contain DBSs. The results indicate that eQTLs are more enriched in DBSs than in promoters (one-sided Mann-Whitney test, p=0.0) (Supplementary Note 6). Thus, population-specific SNPs and tissue-specific eQTLs, on the levels of genome and transcriptome, respectively, suggest that DBSs enable HS lncRNAs to regulate transcription with population- and tissue-specific features.

2.7. HS lncRNAs promote brain evolution from archaic to modern humans

To further confirm that SNPs in DBSs have cis-effects on gene expression, we examined the impact of HS lncRNA-DBS on gene expression in the GTEx tissues (GTEx Consortium, 2017). 40 autosomal HS lncRNAs are expressed in at least one tissue (median TPM>0.1) and form 198876 HS lncRNA-target transcript pairs in all tissues. Of these pairs, 45% show a significant expression correlation in specific tissues (Spearman’s |rho|>0.3 and FDR<0.05). To assess the likelihood that these correlations could be obtained by chance, we randomly sampled 10000 pairs of lncRNAs and protein-coding transcripts genome-wide and found that only 2.3% of pairs showed significant expression correlation (Spearman’s |rho|>0.3 and FDR<0.05). Moreover, a higher percentage (56%) of HS lncRNA-target transcript pairs with significant correlation is detected in at least one brain region, indicating more extensive gene expression regulation by HS lncRNAs in the brain (Figure 3A).

The impact of HS lncRNA-DBS interaction on gene expression in GTEx tissues and organs.

(A) The distribution of the percentage of HS lncRNA-target transcript pairs with correlated expression across GTEx tissues and organs. Higher percentages of correlated pairs are in brain regions than in other tissues and organs. (B) The distribution of significantly changed DBSs (in terms of sequence distance) in HS lncRNA-target transcript pairs across GTEx tissues and organs between archaic and modern humans. Orange, red, and dark red indicate significant changes from Denisovans (D), Altai Neanderthals and Denisovans (AD), and all three archaic humans (ADV). DBSs in HS lncRNA-target transcript pairs with correlated expression in seven brain regions (in dark red) have changed significantly and consistently since the Altai Neanderthals, Denisovans, and Vindija Neanderthals (one-sided two-sample Kolmogorov-Smirnov test, significant changes determined by FDR <0.001).

To obtain supporting evidence on the transcriptome level, we analyzed nine experimental datasets (Supplementary Note 7). First, we analyzed two datasets of epigenetic studies: one examining H3K27ac and H3K4me2 profiling in human, macaque, and mouse corticogenesis, and the other examining gene expression and H3K27ac modification in eight brain regions in humans and four other primates (Reilly et al., 2015; Xu et al., 2018). Compared with the genome-wide background, 84% and 73% of genes in the two datasets have DBSs for HS lncRNAs, indicating significantly higher enrichment in HS lncRNA targets (p=1.21e-21 and 1.2e-56, two-sided Fisher’s exact test). When using gene expression in chimpanzees as the control, 1851 genes show human-specific transcriptomic differences in one or multiple brain regions, whereas only 240 genes show chimpanzee-specific transcriptomic differences. Second, we analyzed two datasets of PsychENCODE studies, one examining spatiotemporally differentially expressed genes and spatiotemporally differentially methylated sites in 16 brain regions, and the other examining the spatiotemporal transcriptomic divergence across human and macaque brain development (Li et al., 2018; Zhu et al., 2018). In the first dataset, 65 HS lncRNAs are expressed in all 16 brain regions, 109 transcripts have spatiotemporally differentially methylated sites in their promoters, and 56 of the 109 transcripts have DBSs for HS lncRNAs. In the second dataset, 8951 genes show differential expression between the human and macaque brains, and 72% of differentially expressed genes in the human brain have DBSs for HS lncRNAs. Thus, both datasets show significant enrichment for HS lncRNA regulation. Third, three studies identified genes critically regulating cortical expansion (Florio et al., 2018; Johnson et al., 2018; Suzuki et al., 2018). Of the 40 reported protein-coding genes, 29 have DBSs of HS lncRNAs in promoter regions (Supplementary Table 14). Thus, these genes are enriched with DBSs of HS lncRNAs compared to the genome-wide background (p<0.01, two-sided Fisher’s exact test). Fourth, we analyzed two datasets of brain organoid studies. By establishing and comparing cerebral organoids between humans, chimpanzees, and macaques, Pollen et al. identified 261 human-specific gene expression changes (Pollen et al., 2019). By fusing human and chimpanzee iPS cells and differentiating the hybrid iPS cells into hybrid cortical spheroids, Agoglia et al. generated a panel of tetraploid human-chimpanzee hybrid iPS cells and identified thousands of genes with divergent expression between humans and chimpanzees (Agoglia et al., 2021). We found that 261 and 1102 genes in the two datasets are enriched with DBSs of HS lncRNAs compared to the genome-wide background (p=1.2e-16 and 3.4e-74).

Finally, we examined the evolution of the impact of HS lncRNAs and their DBSs on gene expression further using the GTEx data (the transcript expression matrices). For each DBS in an HS lncRNA-target transcript pair that shows correlated expression in a GTEx tissue, we computed the sequence distances of this DBS from the HS lncRNA-target transcript pair in modern humans to the pairs in the three archaic humans. Then, we compared the distribution of DBS sequence distances in each tissue with the distribution of DBS sequence distances in all tissues as the background (one-sided two-sample Kolmogorov-Smirnov test). DBSs in HS lncRNA-target transcript pairs with correlated expression in brain regions have significantly changed sequence distances since the Altai Neanderthals and Denisovans (Figure 3B), suggesting that gene expression regulation by HS lncRNAs in the brain has undergone significant evolution recently (Ponce de Leon et al., 2021). Results from other GTEx tissues, including Kidney-Cortex, Ovary, and Skin-Sun Expressed (Figure 3B), also suggest continuous influences of reshaped gene expression on human evolution.

The above results prompted us to examine whether HS lncRNAs may have contributed more significantly to human evolution than HS TFs (Supplementary Note 9). Based on the “hg38-panTro6” gene sets reported by Kirilenko et al. and the human TF lists reported by previous studies and used in the SCENIC package (Bahrami et al., 2015; Kirilenko et al., 2023; Lambert et al., 2018), we identified five potential HS TFs. Then, we predicted their DBSs in the 5000 bp promoter regions of the 179128

Ensembl-annotated transcripts (release 79) using FIMO and CellOracle (Grant et al., 2011; Kamimoto et al., 2023), identified counterparts of these DBSs in archaic humans and chimpanzees, computed sequence distances of DBSs from modern humans to archaic humans and chimpanzees (Supplementary Table 18), computed the Pearson correlation of HS TFs and target transcripts across GTEx tissues, and computed sequence distances of DBSs in HS TF-target transcript pairs across GTEx tissues from modern humans to archaic humans. Highly correlated HS TF-target transcript pairs are distributed across multiple tissues, rather than being confined to the brain; additionally, significantly changed DBSs do not densely occur in the brain (Supplementary Figs. 25 and 26). These results further support that HS lncRNAs have promoted human evolution by specifically regulating gene expression.

2.8. HS lncRNAs mediate human-specific correlated gene expression in the brain

To explore the evidence indicating that the high percentage of correlated gene expression in the brain (Figure 3A) is human-specific, we used the GTEx data from the human frontal cortex (BA9) and anterior cingulate cortex (BA24) (n=101 and 83, respectively) (Consortium et al., 2017), the transcriptomic data for the same brain regions in macaques (n=22 and 25, respectively) (Zhu et al., 2018), and the eGRAM program we developed recently for examining transcriptional regulation modules, to identify gene modules. In the human brain, HS lncRNAs’ target genes form distinct modules, characterized by rich correlation relationships (Pearson’s r>0.8) and enrichment in neurodevelopment-related KEGG pathways (hypergeometric distribution test, FDR<0.05). In contrast, in the same macaque brain regions, the orthologues of HS lncRNAs’ target genes do not have these features (Figure 4; Supplementary Figure 24; Supplementary Table 17). The distinct differences in gene modules in these brain regions support that the reshaped gene expression by HS lncRNAs is human-specific.

Human-specifically reshaped gene expression by HS lncRNAs in the frontal cortex (BA9).

(A) Genes expressed in the human frontal cortex are enriched for HS lncRNAs’ target genes and neurodevelopment-related pathways. Squares, dots, and colors indicate HS lncRNAs, gene modules (Module_1 and Module_2 are illustrated), and enriched KEGG pathways, respectively. (B) Comparison of modules and genes in humans (indicated by H) and macaques (indicated by M). In each pair of modules, green and blue dots denote human genes and their orthologues, and lines between dots indicate correlated expression. Note that many orthologous genes in macaques are not in the modules but displayed in the corresponding positions, and correlated expression is more prominent in humans than in macaques.

3. Discussion

The trivial genomic differences but substantial phenotypic and behavioral differences between modern humans, archaic humans, and apes make the question of what genomic differences critically determine modern humans a profound one. This question has been addressed by numerous studies employing various methods. Previous studies notably examined new protein-coding genes and HARs, reporting that HS protein-coding genes promote rapid neocortex expansion (Fiddes et al., 2018; Florio et al., 2018; Pinson et al., 2022) and that HARs are significantly enriched in domains important for human-specific 3D genome organization (Keough et al., 2023). Many studies also examined the functions of lncRNAs in humans and mice. However, HS lncRNAs and lncRNA DBSs have been overlooked. Since gene expression can be substantially rewired by TFs and TF DBSs (Johnson, 2017), it is sensible to postulate that HS lncRNAs and their DBSs may have greatly reshaped gene expression. Verifying this postulation needs examination of cross-species genomic and transcriptomic data. This study aimed to verify this postulation by identifying HS lncRNAs, their DBSs, and the counterparts of these DBSs in modern humans, ancient humans, and chimpanzees. We used multiple methods to analyze diverse data, and our results suggest that HS lncRNAs have greatly reshaped (or even rewired) gene expression in humans.

As DBS prediction is the basis of the study, we used multiple methods, including DBD KO using CRISPR/Cas9, following the analysis of differential expression of target genes, to validate predicted DBSs. When running programs including Infernal, LongTarget, and g:Profiler and performing statistical tests that detect selection signals, we used default parameters that suit most situations and widely used significance levels. When identifying those of HS lncRNAs’ target genes that probably most critically differentiate humans from chimpanzees, the top 20% (4248) of genes were used whose DBSs in humans or the counterparts of these DBSs in chimpanzees have the largest binding affinity or sequence distances. In support that these parameters, significance levels, and cutoffs are reasonable, the numbers of genes we obtained that differentiate modern humans from archaic humans and chimpanzees agree with the phylogenetic relationships revealed by previous studies (Prufer et al., 2017), and are approximate to the number (4377) of differentially expressed genes between human-only and chimpanzee-only iPSC lines due to trans-acting differences between humans and chimpanzees (Song et al., 2021). Importantly, since weak DBSs may be less reliable than strong ones, we classified DBSs into strong/weak ones and old/young ones. We found that different classes of DBSs have distinct evolutionary and functional features, which is generally sensible given that human evolution has undergone different big events. Due to the nature of data analysis, it is challenging to eliminate bias, false positives, false negatives, and the impact of varying parameter values. Other parameters and cutoff values may generate somewhat (but not significantly) different results (e.g., a DBS distance cutoff=0.035 would determine 4157 top target genes). Not all detected signals reliably indicate positive selection, but the main conclusions are based on binding affinity, sequence distance, and SNP number, instead of selection signals in DBSs (Tables 1, 2). Genes have different transcript/isoform numbers, but this may unlikely cause confounding effects in downstream analyses such as overrepresentation analysis (Figure 2) and transcriptomic data analysis (Figure 3) because the input to g:Profiler is a non-redundant gene list and each RNA-seq file normally has just one record for each gene, even if some genes may have one but others have multiple DBSs (before their one or multiple transcripts). While a gene with more DBSs/transcripts may have a higher chance to evolve a large DBS sequence distance, this more likely suggests that the transcript with a large DBS sequence distance has become more/less important than others.

We have a few notes on several findings. First, Neanderthal-inherited sequences have measurable impacts on gene expression in modern humans, but the impacts are least detectable in the brain (McCoy et al., 2017). Here, we find that HS lncRNAs and their DBSs have more significantly influenced gene expression in the brain than in other tissues. Second, the regulation of many genes by HS lncRNAs may have undergone rapid evolution, notable examples including those making humans adapt to high sugar intake. Third, intensively regulated genes are enriched in different functions at different periods of human evolution, as evidenced by genes with strong/weak and young/old DBSs and with DBSs containing “old” and “new” SNPs. Notable details include that DBSs containing “old” and “new” SNPs are in genes regulating neural development and glucose metabolism. Fourth, newly emerged regulations (i.e., target genes with young weak DBSs) may have significantly promoted recent human evolution.

The above findings also raise new questions. First, a high percentage (about 85%) of protein-coding regions in the mouse and human genomes is identical, but many mouse lncRNAs are rodent-specific (Yue et al., 2014). Thus, two questions arise: how these lncRNAs specifically rewire gene expression in mice and to what extent human- and mouse-specific lncRNAs cause the cross-species transcriptional regulation differences (Breschi et al., 2017; Hodge et al., 2019; Zhu et al., 2018). Second, whether mice and other mammals have such species-specific lncRNAs as RP11-423H2.3 that regulate the expression of substantial genes. Third, whether lineage- and species-specific lncRNAs would make many evolutionary novelties preordained.

4. Material and methods

4.1. Data sources

The sequences of human (GRCh38/hg38, GRCh37/hg19) and chimpanzee genomes (panTro5) were obtained from the UCSC Genome Browser (http://genome.UCSC.edu). Three high-quality archaic human genomes were obtained from the Max Planck Institute for Evolutionary Anthropology (https://www.eva.mpg.de/genetics/index.html), which include an Altai Neanderthal that lived approximately 122 thousand years ago (kya) (Prüfer et al., 2013), a Denisovan (an Asian relative of Neanderthals) that lived approximately 72 kya (Meyer et al., 2012), and a Vindija Neanderthal that lived approximately 52 kya (Prufer et al., 2017). The ancestral sequences for the human genome (GRCh37) (which were generated upon the EPO alignments of six primate species human, chimpanzee, gorilla, orangutan, macaque, and marmoset) were downloaded from the EBI website (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignm ents/). The SNP data of modern humans were obtained from the 1000 Genomes Project (phase I) (1000 Genomes Project Consortium et al., 2012). The three modern human populations are CEU (Utah residents with Northern and Western European ancestry), CHB (Han Chinese in Beijing, China), and YRI (Yoruba in Ibadan, Nigeria), which contain 100, 99, and 97 individuals, respectively. The genetic map files of modern humans were downloaded from the 1000 Genomes Project website (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110106_recombination_hotspots/).

The 1000 Genomes phased genotype data in VCF format were downloaded from the International Genome Sample Resource website (http://www.internationalgenome.org/). Annotated human genes and transcripts were obtained from the Ensembl website (http://www.Ensembl.org). FASTA-format nucleotide sequences were converted from VCFs using the vcf-consensus program in the VCFtools package when necessary (Danecek et al., 2011). Data in the GeneCards Human Gene Database (www.genecards.org) were used to annotate genes and transcripts.

RNA-sequencing (RNA-seq) data (transcripts per million [TPM]) and eQTL data of human tissues were obtained from the Genotype-Tissue Expression Project (GTEx, v8) website (https://gtexportal.org/) (GTEx Consortium, 2017). Histone modification and DNA methylation signals in multiple cell lines and ENCODE Candidate Cis-Regulatory Elements (cCRE) were obtained from the UCSC Genome Browser (http://genome.UCSC.edu). The genome-wide DNA binding sites of lncRNA NEAT1, MALAT1, and MEG3 were obtained from experimental studies (Mondal et al., 2015; West et al., 2014). The predicted and experimentally detected DBSs in target genes of MALAT1, NEAT1, and MEG3 are given in the supplementary table (Supplementary Table 15).

4.2. Identifying HS lncRNA genes

We used the Infernal program (Nawrocki et al., 2009), which searches orthologous RNA sequences upon sequence and structure alignment, to identify orthologous exons in 16 mammalian genomes for each exon in each of the 13562 GENCODE (v18)-annotated human lncRNA genes (Lin et al., 2019). The 16 mammals were chimpanzee (CSAC 2.1.4/panTro4), macaque (BGI CR_1.0/rheMac3), marmoset (WUGSC 3.2/calJac3), tarsier (Broad/tarSyr1), mouse lemur (Broad/micMur1), tree shrew (Broad/tupBel1), mouse (GRCm38/mm10), rat (Baylor3.4/rn4, RGSC6.0/rn6), guinea pig (Broad/cavPor3), rabbit (Broad/oryCun2), dog (Broad CanFam3.1/canFam3), cow (Baylor Btau_4.6.1/bosTau7), elephant(Broad/loxAfr3), hedgehog (EriEur2.0/eriEur2), opossum (Broad/monDom5), and platypus (WUGSC 5.0.1/ornAna1) (http://genome.UCSC.edu). If the number of orthologous exons of a human lncRNA gene in a genome exceeded half the exon number of the human lncRNA gene, these orthologous exons were assumed to form an orthologous lncRNA gene. If a human lncRNA gene had no orthologous gene in all of the 16 mammals, it was assumed to be human-specific.

4.3. Identifying DBSs of HS lncRNAs

LncRNAs bind to DNA sequences by forming RNA:DNA triplexes. Each triplex comprises triplex-forming oligonucleotides (TFO) in the lncRNA and a triplex-targeting site (TTS) in the DNA sequence. We used the LongTarget program to predict HS lncRNAs’ DNA binding domains (DBD) and binding sites (DBS) with the default parameters (Ruleset = all, T.T. penalty = -1000, CC penalty = 0, Offset = 15, Identity ≥ 60, Nt ≥ 50) (Lin et al., 2019; Wen et al., 2022). The LongTarget program predicts DBDs and DBSs simultaneously, where a DBD comprises a set of densely overlapping TFOs and a DBS comprises a set of densely overlapping TTSs. For each HS lncRNA, we predicted its DBSs in the 5000 bp promoter regions (−3500 bp upstream and +1500 bp downstream the transcription start site) of the 179128 Ensembl-annotated transcripts (release 79). For each DBS, its binding affinity is the product of DBS length and the averaged Identity score of all TTSs (the Identity score is the percentage of paired nucleotides). Strong and weak DBSs were classified based on binding affinity ≥60 and <60. A transcript whose promoter region contains a strong DBS of an HS lncRNA was assumed to be a target transcript of the HS lncRNA, and the gene containing this transcript was assumed to be a target gene of the HS lncRNA. As the 1000 Genomes Project (phase I) data and the archaic human genomes are based on GRCh37/hg19, the coordinates of DBSs were converted from GRCh38/hg38 to GRCh37/hg19 using the liftover program from the UCSC Genome Browser (Kuhn et al., 2013).

4.4. Experimentally validating DBS prediction

A 157 bp sequence (chr17:80252565-80252721, hg19) containing the DBD of RP13-516M14.1, a 202 bp sequence (chr1:113392603-113392804, hg19) containing the DBD of RP11-426L16.8, and a 198 bp sequence (chr17:19460524-19460721, hg19) containing the DBD of SNORA59B, were knocked out in the HeLa cell line, RKO cell line, and SK-MES-1 cell line, respectively. Two sequences (chr1:156643524-156643684, chr10:52445649-52445740, hg38) containing the DBD of two wrongly transcribed noncoding sequences were knocked out in the HCT-116 and A549 cell lines, respectively. The seven knockouts were performed by UBIGENE, Guangzhou, China (http://www.ubigene.com) using CRISPR-UTM, a revised version of CRISPR/Cas9 technology. Before and after the seven DBD knockouts, RNA sequencing (RNA-seq) was performed by Novogene, Beijing, China (https://cn.novogene.com) and HaploX, Shenzhen, China (https://www.haplox.cn/). The reads were aligned to the human GRCh38 genome using the Hiasat2 program (Kim et al., 2019), and the resulting SAM files were converted to BAM files using Samtools (Li et al., 2009). The Stringtie program was used to quantify gene expression levels (Pertea et al., 2015). Fold change of gene expression was computed using the edgeR package (Robinson et al., 2010), and significant up- and down-regulation of target genes after DBD knockout was determined upon |log2(fold change)| > 1 with FDR < 0.1.

Genome-wide DBSs of NEAT1, MALAT1, and MEG3 were experimentally detected (Mondal et al., 2015; West et al., 2014). We also used these data to validate DBS prediction by predicting DBSs of the three lncRNAs and checking the overlap between predicted and experimentally detected DBSs (Supplementary Table 15).

4.5. Mapping DBSs of HS lncRNAs in the chimpanzee and archaic human genomes

We used the liftover program from the UCSC Genome Browser (http://genome.UCSC.edu) to map loci of DBSs in the human genome hg38 to the chimpanzee genome Pan_tro 3.0 (panTro5). The mapping results were examined by checking the human-chimpanzee pairwise alignment (hg38 vs panTro5) in the UCSC Genome Browser (http://genome.UCSC.edu). 97.81% (100403/102651) of DBSs were mapped, and the 2248 unmapped DBSs in 429 genes were considered human-specific. We used vcf-consensus in the VCFtools package to extract the DBSs of HS lncRNAs from the VCF files of Altai Neanderthals, Denisovans, and Vindija Neanderthals. The variant with the highest quality score was selected whenever multiple variant calls were observed at a given locus. The obtained DBS sequences in chimpanzees and three archaic humans are called counterparts of DBSs in these genomes.

4.6. Estimating sequence distances of DBSs between different genomes

We first aligned DBS sequences in the genomes of humans, chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals using the MAFFT7 program for measuring sequence distances from modern humans to chimpanzees and archaic humans (Katoh and Standley, 2013). We then computed sequence distances using the dnadist program with the Kimura 2-parameter model in the PHYLIP (3.6) package (http://evolution.genetics.washington.edu/phylip.html) and the Tamura-Nei model in the MEGA7 package (Kumar et al., 2016). The two methods generated equivalent results. The largest distance between DBSs in humans and their counterparts in chimpanzees is 5.383. Since 2248 DBSs in 429 human genes do not have counterparts in chimpanzees, we assumed that these DBSs have sequence distances of 10.0 between humans and chimpanzees. The relationship between cross-species sequence distances and cross-species phenotypic distances is little known. Based on the report that 5984 genes were identified differentially expressed between human-only and chimpanzee-only iPSC lines (including 4377 whose differential expression was due to cross-species trans-acting differences) (Song et al., 2021) and on the report that “the Denisovan genome thus tend to be older, than the Altai Neandertal genome” (Prufer et al., 2017), we chose 0.034 as the DBS sequence distance cutoff for identifying “mostly changed DBSs” from modern humans to chimpanzees and archaic humans. 0.034 determined 4248 genes (the top 20%) in chimpanzees, and 1256, 2514, and 134 genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals, respectively, which may critically characterize humans from chimpanzees and three archaic humans. If cutoff=0.035, since 91 genes have DBS sequence distance=0.034, 4248-91=4157 genes, this small difference in gene number (4248 vs 4157) would not significantly influence overrepresentation analysis (Figure 2).

We determined human ancestral sequences of DBSs based on the human ancestor sequences from the EBI website, which were generated based on the EPO alignments of six primate species. We used the above-mentioned methods to calculate the sequence distances DBSs from the human ancestor to chimpanzees, archaic humans, and modern humans. We found that when the human-chimpanzee ancestral sequence has the ancestral sequence (which means the inference of ancestral allele is of high confidence), DBS distances from the human ancestor to modern humans are larger than to archaic humans, but this situation only accounts for about 63.8%. For many DBSs, the distances from the human ancestor to modern humans are smaller than to archaic humans (especially Neanderthals and Denisovans) and even to chimpanzees. This defect may be caused by the absence of archaic humans in building the human ancestral sequence.

4.7. Detecting positive selection signals in HS lncRNA genes and DBSs

We used multiple tests to detect positive selection signals in HS lncRNA genes. First, we used the XP-CLR test (parameters = -w1 0.001 300 100 -p0 0.95, window size = 0.1 cM, grid size = 100 bp) to perform six pairwise genome-wide scans (i.e., CEU-CHB, CEU-YRI, CHB-CEU, CHB-YRI, YRI-CEU, and YRI-CHB) (Chen et al., 2010). The upper 1% of scores across the entire genome in each pairwise scan was 34.6 in the CEU-YRI scan, 16.8 in the CEU-CHB scan, 45.0 in the CHB-YRI scan, 26.9 in the CHB-CEU scan, 14.1 in the YRI-CEU scan, and 14.1 in the YRI-CHB scan. These scores were used as the thresholds of positive selection signals in these populations. Second, we used the iSAFE program to scan each genomic region containing an HS lncRNA gene and its 500-kb upstream and downstream sequences (Akbari et al., 2018). Strongly selected loci were detected only in CEU and CHB. Third, we used the VCFtools program to calculate Tajima’s D values for each HS lncRNA gene in CEU, CHB, and YRI (Danecek et al., 2011). The calculation was performed using a 1500-bp non-overlapping sliding window because the lengths of these genes exceed 1500 bp. To generate a background reference for assessing the significant increase or decrease of Tajima’s D for HS lncRNA genes in a population, we calculated Tajima’s D values across the whole genome using a sliding window of 1500 bp. As the values of Tajima’s D were compared with the background reference, significant D<0 and D>0 indicate positive (or directional) selection and balancing selection, respectively, rather than population demography dynamics (Tajima, 1989). Fourth, we used the integrated Fst to detect positive selection signals in HS lncRNA genes. The Fst of each HS lncRNA gene was computed for three comparisons, i.e., CEU-YRI, CHB-YRI, and CHB-CEU. Extreme Fst values of SNPs were detected in HS lncRNA genes in the comparisons of CEU-YRI and CHB-YRI. Since allele frequencies for different loci vary across a genome and genetic drift may have different effects at different loci, we used a sliding window of 1500 to compare Fst values of HS lncRNA genes with the genome-wide background. Extreme Fst values indicate positive selection. Finally, we applied linkage disequilibrium (LD) analysis to each HS lncRNA gene. We computed the pairwise LD (r²) in CEU, CHB, and YRI for common SNPs (with minimal allele frequency (MAF) ≥ 0.05 in at least one population) in HS lncRNA genes and DBSs using the PLINK program (Purcell et al., 2007). Significantly increased LD was detected in SNPs in HS lncRNA genes in CEU and CHB. The LD patterns were portrayed using the Haploview program (Barrett et al., 2005).

Next, we used the above tests to detect positive selection signals in DBSs. First, the 100 bp grid size of the XP-CLR test also allowed the detection of selection signals in DBSs. Second, we performed Tajima’s D and Fay-Wu’s H tests (Fay and Wu, 2000; Tajima, 1989). We calculated Tajima’s D values for each DBS in CEU, CHB, and YRI using the VCFtools program (Danecek et al., 2011), with a sliding window of 147 bp (the mean length of strong DBSs). To generate a background reference for judging the significant increase or decrease of Tajima’s D in a population, we calculated Tajima’s D values across the whole genome using the same sliding window. As values of Tajima’s D were compared with the background reference, a significant D<0 and D>0 indicate positive (or directional) selection and balancing selection, respectively. Fay-Wu’s H values were calculated similarly using the VariScan program (Vilella et al., 2005). Calculating Fay-Wu’s H demands the ancestral sequences as the outgroup. We extracted the ancestral sequences of DBSs from the human ancestral sequence, which was generated upon the EPO alignments of six primate species (human, chimpanzee, gorilla, orangutan, macaque, and marmoset) (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignm ents/). Third, we computed the Fst to measure the frequency differences of alleles in DBSs between populations. For the CEU-YRI, CHB-YRI, and CHB-CEU pairwise comparisons, we used the revised VCFtools program to compute the weighted Fst for all SNPs in each DBS (Weir and Cockerham, 1984). Fourth, we integrated the weighted Fst values in the three populations into an “integrated Fst” which indicated whether the DBS locus was under selection in a certain population (Nielsen and Slatkin, 2013). We used sliding windows of 147 bp for comparing Fst values of DBSs with the genome-wide background. We empirically denoted the values of integrated Fst in the upper 10% of scores across the entire genome as statistically significant. To determine positive selection more reliably, we used Tajima’s D and the integrated Fst to jointly determine if a DBS was under positive selection in a population. The thresholds that determined the upper 10% of Tajima’s D values across the entire genome in CEU, CHB, and YRI were -0.97, -0.96, and -0.97, respectively, and the threshold that determined the upper 10% of integrated Fst values across the entire genome in the three populations was 0.22. For example, a DBS was assumed to be under positive selection in CEU if (a) the DBS had a Tajima’s D<-0.97 in CEU and Tajima’s D>0.0 in the two other populations and (b) the DBS had an integrated Fst>0.22. Analogously, a DBS was assumed to be under positive selection in both CEU and CHB if the DBS had a Tajima’s D <-0.97 in CEU, <-0.96 in CHB, and >0.0 in YRI, and had an integrated Fst>0.22.

4.8. Functional enrichment analysis of genes

We used the g:Profiler program (with the parameter settings: Organism=Homo sapiens, Ordered query = No, Significance threshold = Benjamini-Hochberg FDR, User threshold = 0.05, 50 < terms size < 1000) and the Gene Ontology (GO) database to perform over-representation analysis (Raudvere et al., 2019). This analysis is used to determine which pre-defined gene sets (GO terms) are more prevalent (over-represented) in a list of “interesting” genes than would be expected by chance. The lists of genes included genes with the strongest DBSs in humans, genes with the weakest DBSs in humans, and genes with mostly changed DBSs in chimpanzees and three archaic humans. To choose genes with strong DBSs, we ranked genes in descending order based on binding affinity and chose the top 2000 genes. To choose genes with weak DBSs, we first ranked genes in ascending order based on binding affinity, then ranked genes in descending order based on the number of regulatory HS lncRNAs, and finally chose the top 2000 genes with binding affinity ≤40 and regulatory HS lncRNAs ≥5. We used the distance threshold of 0.034 to select genes with mostly changed DBSs. Genes with mostly changed DBSs in chimpanzees include the 429 genes in which 2248 DBSs lack counterparts. 4248, 1256, 2514, and 134 genes in chimpanzees, Altai Neanderthals, Denisovans, and Vindija Neanderthals have DBSs with distance ≥0.034. To make the statistical significance comparable across chimpanzees, Altai Neanderthals, and Denisovans, we used the same number of genes (the top 1256 genes whose DBSs have the largest DBS distances) as the input to the g:Profiler program.

4.9. Analysing SNP frequencies in human populations

The frequency of common SNPs (MAF ≥0.05) in DBSs in the three modern human populations was computed using the VCFtools package (Danecek et al., 2011). The ancestral/derived states of SNPs were inferred from the human ancestor sequences and were used to determine derived allele frequency.

4.10. Analysing the cis-effect of SNPs in DBSs on target gene expression

SNPs with MAF >0.1 in DBSs in any of the three modern human populations and absolute values of cis-effect size >0.5 (FDR <0.05) in any of the GTEx tissues were examined for an influence on the expression of the target genes (GTEx Consortium, 2017). SNPs that are eQTLs in the GTEx tissues and have biased derived allele frequencies (DAF) in the three modern human populations were examined to estimate whether the eQTL is population-specific.

4.11. Examining the tissue-specific impact of HS lncRNA-regulated gene expression

First, we examined the expression of HS lncRNA genes across the GTEx tissues. HS lncRNA genes with a median TPM value >0.1 in a tissue were considered robustly expressed in that tissue. Upon this criterion, 40 HS lncRNA genes were expressed in at least one tissue and were used to examine the impact of HS lncRNA regulation on gene and transcript expression (other HS lncRNAs may function in the cytoplasm). Since an HS lncRNA gene may have multiple transcripts, we adopted the transcript that contains the predicted DBD and has the largest TPM value as the representative transcript of the HS lncRNA. We calculated the pairwise Spearman’s correlation coefficient between the expression of an HS lncRNA (the representative transcript) and the expression of each of its target transcripts using the scipy.stats.spearmanr program in the scipy package. The expression of an HS lncRNA gene and a target transcript was considered to be significantly correlated if the |Spearman’s rho|>0.3, with Benjamini-Hochberg FDR<0.05. We examined the percentage distribution of correlated HS lncRNA-target transcript pairs across GTEx tissues and organs (Figure 3A).

We examined all GTEx tissues to determine in which tissues HS lncRNA-regulated gene expression may have undergone significant changes from archaic to modern humans. GTEx data have both gene expression matrices and transcript expression matrices; we used the latter to examine the change of HS lncRNA-target transcript pairs from modern humans to archaic humans. If a pair of HS lncRNA and target transcript is robustly expressed in a tissue and their expression shows a significant correlation (|Spearman’s rho|>0.3, with Benjamini-Hochberg FDR<0.05) in the tissue, we computed the sequence distance of the HS lncRNA’s DBS in the transcript from the three archaic humans to modern humans. We compared the sequence distances of all DBSs in each tissue with the sequence distances of all DBSs in all GTEx tissues (as the background). A one-sided two-sample Kolmogorov-Smirnov test was used to examine whether the sequence distances of all DBSs in a specific tissue deviate from the background distribution (which reflects the “neutral evolution” of gene expression). For each tissue, if the test result was Benjamini-Hochberg FDR < 0.001, the tissue was considered to have significantly changed gene expression regulated by the HS lncRNA. We used different colors to mark tissues with significantly changed gene expression regulation since Altai Neanderthals, Denisovans, and Vindija Neanderthals, and used “D”, “A.D.”, and “ADV” to indicate changes since Denisovans, since Altai Neanderthals and Denisovans, and since Altai Neanderthals, Denisovans, and Vindija Neanderthals, respectively (Figure 3B).

4.12. Examining enrichment of favored and hitchhiking mutations in DBSs

Using the deep learning network DeepFavored, which integrates multiple statistical tests for identifying favored mutations (Tang et al., 2022), we identified 13339 favored mutations and 244098 hitchhiking mutations in 17 human populations (Tang et al., 2023). In this study, we classified DBSs in two ways: into strong ones (affinity>60) and weak ones (36<affinity<60), and into old ones (Human-Chimp distance>0.034 AND Human-Altai Neanderthals distance=0), young ones (Human-Altai Neanderthals distance>0.034 OR Human-Denisovan distance>0.034), and others. We then examined the number of favored and hitchhiking mutations in each class of DBSs. The weak young DBSs have the largest proportion of favored and hitchhiking mutations.

4.13. Identifying and analyzing transcriptional regulatory modules

Most clustering algorithms classify genes into disjoint modules based on expression correlation without considering regulatory relationships (Saelens et al., 2018). The GRAM program identifies gene regulatory modules based on correlation and TF-TFBS binding (Bar-Joseph et al., 2003). LncRNAs transcriptionally regulate genes based on lncRNA-DBS binding and correlate gene expression. We developed the eGRAM program to identify gene modules based on correlated expression, TF-TFBS interaction, and lncRNA-DBS interaction. In this study, we used eGRAM to identify gene modules in the same regions of the human and macaque brains and enriched KEGG pathways using reported RNA-seq datasets (n=101 and 83 for frontal cortex and n=22 and 25 for anterior cingulate cortex) (GTEx Consortium, 2017; Zhu et al., 2018). The default parameters, DBS binding affinity=60, Pearson correlation=0.5, module size=50, and FDR=0.01 (hypergeometric distribution test), were used. The key steps of the program are as follows. (a1) Identify each lncRNA’s correlated lncRNAs, which may form a set of co-regulators. (a2, optional) Identify each TF’s correlated TFs, which may form a set of co-regulators. (b1) Compute the correlation between each lncRNA and all genes. (b2, optional) Compute the correlation between each TF and all genes. (c1) Identify each lncRNA’s target genes. (c2, optional) Identify each TF’s target genes. (d1) Identify each lncRNA set’s target module upon correlation and targeting relationships. (d2, optional) Identify each TF set’s target module upon correlation and targeting relationships. (e) Check whether TFs’ modules contain lncRNAs’ targets and whether lncRNAs’ modules contain TFs’ targets, which reveal genes co-regulated by TFs and lncRNAs and genes independently regulated by TFs and ncRNAs. (f) Performs pathway enrichment analysis for all modules.

4.14 The analysis of HS TFs and their DBSs

Kirilenko et al recently identified orthologous genes in hundreds of placental mammals and birds and organized genes into pairwise datasets using humans and mice as the references (e.g., “hg38-panTro6”, “hg38-mm10”, and “mm10-hg38”) (Kirilenko et al., 2023). Based on the “many2zero” and “one2zero” gene lists (which contain 0 and 147 genes) in the hg38-panTro6 dataset, which indicate the multiple human genes and the one human gene that have no orthologues in chimpanzees, we identified HS protein-coding genes. Further, based on three human TF lists reported by two studies and used in the SCENIC package (Bahrami et al., 2015; Lambert et al., 2018), we identified HS TFs. Based on the JASPAR database and using the CellOracle program (with default parameters), we predicted DBSs of HS TFs. Then, we repeated the steps of our HS lncRNA analyses (Supplementary Note 9), including computing sequence distances of DBSs and examining the impact of HF TF-target transcript pairs on gene expression in GTEx tissues and organs.

Data and code availability

GENCODE human lncRNAs are available at the website (https://www.gencodegenes.org/human/). Human lncRNAs’ orthologues in 16 mammals, as well as the LongTarget program, are available at the LongTarget/LongMan website (http://www.gaemons.net/). The eGRAM code is available on the GitHub website (https://github.com/LinjieCodes/eGRAMv2R1). Other programs and tests are open-source and freely available. The RNA-seq data of the three HS lncRNAs before and after DBD KO have been deposited in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo) (accession number GSE213231). The RNA-seq data of the other four cases of DBD KO have been deposited in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo) (accession number is GSE229846). The data on favored and hitchhiking mutations are available from the PopTradeOff database or upon request. All other data are in Supplementary Tables.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (31771456 to H.Z.) and the China Postdoctoral Science Foundation (2020M682788 to J.L.).

Additional information

Author contributions

H.Z. designed the study and drafted the manuscript. J.L. performed most of the data analyses. Y.W. and J.T. performed XP-CLR, iSAFE, and favored mutation analyses. J.L. and H.Z. developed the eGRAM program. All authors have read the manuscript and consent to its publication.

Funding

National Natural Science Foundation of China (31771456)

China Postdoctoral Science Foundation (2020M682788)

Additional files

Supplementary Tables. Excel file containing 18 Supplementary Tables.

Supplementary Notes. PDF file containing 9 Supplementary Notes and 26 Supplementary Figures.