Promoter sequence and architecture determine expression variability and confer robustness to genetic variants
Abstract
Genetic and environmental exposures cause variability in gene expression. Although most genes are affected in a population, their effect sizes vary greatly, indicating the existence of regulatory mechanisms that could amplify or attenuate expression variability. Here, we investigate the relationship between the sequence and transcription start site architectures of promoters and their expression variability across human individuals. We find that expression variability can be largely explained by a promoter’s DNA sequence and its binding sites for specific transcription factors. We show that promoter expression variability reflects the biological process of a gene, demonstrating a selective trade-off between stability for metabolic genes and plasticity for responsive genes and those involved in signaling. Promoters with a rigid transcription start site architecture are more prone to have variable expression and to be associated with genetic variants with large effect sizes, while a flexible usage of transcription start sites within a promoter attenuates expression variability and limits genotypic effects. Our work provides insights into the variable nature of responsive genes and reveals a novel mechanism for supplying transcriptional and mutational robustness to essential genes through multiple transcription start site regions within a promoter.
Editor's evaluation
This paper presents valuable findings about how human genetic variation impacts gene expression. Using a compelling analysis of new experimental data based on cell lines from 108 individuals, the authors uncover features that distinguish promoters with highly variable expression across individuals from those exhibiting low variability. This work and the associated resource will be of broad interest for further investigations of the interplay between genetic variation and gene expression control.
https://doi.org/10.7554/eLife.80943.sa0Introduction
Transcriptional regulation is the main process controlling how genome-encoded information is translated into phenotypes. Hence, understanding how transcriptional regulation influences gene expression variability is of fundamental importance to understand how organisms are capable of generating proper phenotypes in the face of stochastic, environmental, and genetic variation. Through differentiation, cells acquire highly specialized functions, but need to still maintain their general abilities to accurately regulate both essential pathways as well as responses to changes in the environment. To achieve robustness, regulatory processes must be capable of attenuating expression variability of essential genes (Bartha et al., 2018), while still allowing, or possibly amplifying (Eldar and Elowitz, 2010; Urban and Johnston, 2018), variability in expression for genes that are required for differentiation or responses to environmental changes and external cues. How cells can achieve such precision and robustness remains elusive.
Genetic variation affects the expression level (Montgomery et al., 2010; Pickrell et al., 2010; Stranger et al., 2007) of the majority of human genes (Battle et al., 2017; Lappalainen et al., 2013; Storey et al., 2007). However, genes are associated with highly different effect sizes, with ubiquitously expressed or essential genes frequently being less affected (Battle et al., 2017). This indicates that genes associated with different regulatory programs are connected with different mechanisms or effects of mutational robustness (Payne and Wagner, 2015). Multiple transcription factor (TF) binding sites may act to buffer the effects of mutations in promoters (Spivakov et al., 2012), and promoters can have highly flexible transcription start site (TSS) architectures (Akalin et al., 2009; Carninci et al., 2006; Lehner, 2008). This demonstrates that the sequence and architecture of a promoter may influence its variability in expression across individuals.
Previous studies aimed at identifying processes involved in the regulation of gene expression variability have indeed revealed regulatory features mostly associated with the promoters of genes, such as CpG islands and TATA-boxes (Morgan and Marioni, 2018; Ravarani et al., 2016; Sigalova et al., 2020), the chromatin state around gene TSSs (Faure et al., 2017), and the propensity of RNA polymerase II to pause downstream of the TSS (Boettiger and Levine, 2009). These studies have relied on model organisms or focused on transcriptional noise across single cells. As of yet, regulatory features have not been thoroughly studied from the perspective of variability in promoter activity or across human individuals. Furthermore, it is unclear if regulation of variability mainly acts to attenuate variability to achieve stable expression for certain genes or if independent regulatory processes act in parallel to amplify variability for other genes.
Here, we provide a comprehensive characterization of the sequences, TSS architectures, and regulatory processes determining variability of promoter activity across human lymphoblastoid cell lines (LCLs). We find that variability in promoter activity is to a large degree reflected by the promoter sequence, notwithstanding possible genotypic differences. Furthermore, the presence of binding sites for specific TFs, including those of the ETS family, are highly predictive of low promoter variability independently of their impact on promoter expression levels. In addition, we demonstrate that differences in the variability of promoters reflect their involvement in distinct biological processes, indicating a selective trade-off between stability and plasticity of promoters. Finally, we show that flexibility in TSS usage is associated with attenuated promoter variability. Our results reveal a novel mechanism that confers mutational robustness to gene promoters via switches between proximal core promoters. This study provides fundamental insights into transcriptional regulation, indicating shared mechanisms that can buffer stochastic, environmental, and genetic variation and how these affect the responsiveness and cell-type restricted activity of genes.
Results
TSS profiling reveals variability in promoter activity across individuals
To characterize human variability in promoter activities, we profiled TSSs using CAGE (Cap Analysis of Gene Expression Takahashi et al., 2012; Figure 1A) across 108 Epstein-Barr virus (EBV)-transformed LCLs (Auton et al., 2015) of African origin, 89 from Yoruba in Ibadan, Nigeria (YRI) and 19 from Luhya in Webuye, Kenya (LWK) (Supplementary file 1). The samples had a balanced sex ratio, 56 females and 52 males, and no observable population stratification in the expression data (Figure 1—figure supplement 1). With CAGE, TSSs can be mapped with single base pair resolution and the relative number of sequencing reads supporting each TSS gives simultaneously an accurate estimate of the abundance of its associated RNA (Kawaji et al., 2014). The CAGE data across the LCL panel therefore give us a unique opportunity to both estimate variability in promoter activity and characterize the regulatory features influencing such variability.
We identified 29,001 active promoters of 15,994 annotated genes (Frankish et al., 2019) through positional clustering of proximal CAGE-inferred TSSs on the same strand (Figure 1A; Carninci et al., 2006) detected in at least 10% of individuals (Supplementary file 2). This individual-agnostic strategy ensured a focus on promoters that are active across multiple individuals while also allowing for the measurement of variability in promoter activity across the panel. For example, the CAGE data revealed that the promoters of gene RPL26L1, encoding a putative component of the large 60 S subunit of the ribosome, and transcription factor gene SIX3 have highly different variance yet similar mean expression across individuals (Figure 1B–D).
We used the squared coefficient of variation (CV2) as a measure of promoter expression dispersion, revealing how the normalized expression across individuals deviates from the mean for each identified promoter. We observed that the promoter CV2 decreases by increasing mean expression (Figure 1E; Eling et al., 2018; Kolodziejczyk et al., 2015; Sigalova et al., 2020). To account for this bias, we subtracted the expected dispersion for each promoter according to its expression level (Kolodziejczyk et al., 2015; Newman et al., 2006). Importantly, rank differences in promoter dispersion were maintained for each expression level after adjustment, as seen for promoters of genes RPL26L1 and SIX3 (Figure 1E and F). This strategy thus allowed us to investigate how promoter architecture and sequence determine variability in promoter activity across the panel separately from its impact on expression level (Figure 1F).
Promoter expression variability is reflected by the promoter sequence
To investigate if local sequence features at promoters reflect their variability in activity, we applied machine learning (convolutional neural network (CNN); Figure 2—figure supplement 1A; see Materials and methods) to discern low variable promoters (N=5,054) from highly variable promoters (N=5,683) based on their local DNA sequence alone. We considered the genomic reference sequence to model the intrinsic component of variability encoded within the promoter sequence independently of local genetic variants within the panel. The resulting model was capable of distinguishing between these promoter classes with high accuracy (area under receiving operating curve (AUC)=0.84 for the out of sample test set; Figure 2—figure supplement 1B), equally well for highl and low variable promoters (per-class test set F1 scores of 0.76 and 0.77, respectively).
To assess which sequence features the CNN had learned to distinguish the classes, we calculated importance scores using DeepLift (Shrikumar et al., 2019) for each nucleotide in the input sequences for predicting low and high promoter variability. This approach relies on backpropagation of the contributions of all neurons in the CNN to the input features, nucleotides, and can therefore be used to identify properties or short stretches of DNA indicative of amplifying or attenuating expression variability. We applied motif discovery on clustered stretches, so called metaclusters, of the input sequences with high importance scores (Shrikumar, 2020) and matched the identified metaclusters to known TF binding motifs (Fornes et al., 2020). This strategy revealed TFs indicative of either high or low promoter variability (Figure 2A–C). Noteworthy, we observed motifs for the ETS superfamily of TFs, including ELK1, ETV6, and ELK3, associated with low variable promoters, and motifs for PTF1A, ASCL2, and FOS-JUN heterodimer (AP-1) among highly variable promoters. These results demonstrate that the promoter sequence and its putative TF binding sites are predictive of the expression variability of a promoter.
Sequence features of promoters are highly predictive of promoter variability
To systematically test how predictive TF binding sites are of the variability of active promoters, we made use of binding sites predicted from motif scanning for 746 TFs (Fornes et al., 2020). TF binding site profiles and low/high CpG content (Figure 2—figure supplement 2A) were collected for each identified promoter and the resulting feature data were used to train a machine learning (random forest) classifier features associated with either high or low variability (low variable N=5054, highly variable N=5683). Feature selection (Kursa and Rudnicki, 2010) identified 124 of the 746 TFs as well as CpG ratio to be important for classification, and a classifier based on these selected features demonstrated high predictive performance (AUC = 0.79; per-class F1 score of 0.73 and 0.68 for low and highly variable promoters, respectively; Figure 2—figure supplement 2B), reinforcing the strong link observed between DNA sequence and promoter variability (Figure 2—figure supplement 1B).
Reverse engineering of the random forest classifier (SHAP, Shapley additive explanations) (Lundberg and Lee, 2017) allowed us to calculate the marginal contribution of each of the 125 selected features to the prediction of variability class for each promoter and whether the feature on average is indicative of amplifying or attenuating variability of expression when present in the promoter sequence (Figure 2D; Figure 2—figure supplement 3A). We identified the presence of high observed/expected CpG ratio and TATA-binding protein (TBP) binding sites (TATA-boxes) to be the strongest predictive features of low and high promoter variability, respectively. Although the remaining TFs contribute only marginally on their own to the predictions compared to TATA-box and high CpG content status (Figure 2D; Figure 2—figure supplement 3A), a baseline model (decision tree) based on CpG ratio and TBP binding site presence alone yielded worse performance than the full model (AUC = 0.71 vs 0.79 for the baseline model and the full model, respectively; Figure 2—figure supplement 2B). This demonstrates that the TF binding grammar contributes to a promoter’s expression variability.
TFs associated with highly variable promoters are mostly related to tissue-specific or developmental regulation (e.g., FOXP2, HOXA10) while TFs predictive of low promoter variability are generally associated with ubiquitous activity across cell types and a diverse range of basic cellular processes (e.g. ELK1, ELF4, ETV3). In addition, TFs predictive of high variability (e.g. ZIC2, ZNF449, HOXA10) tend to have binding sites in relatively few highly variable promoters while TFs predictive of low promoter variability (e.g. ELK1, ELK3) show a propensity for having binding sites present in a large number of promoters (Figure 2E; Figure 2—figure supplement 3B and C). This suggests that variably expressed promoters have diverse TF binding profiles and that the regulatory grammar for promoter stability is less complex.
Although the adjusted dispersion of promoters was separated from their expression level (Figure 1E), we observed that the presence of binding sites for some TFs that are predictive of promoter variability are also associated with promoter expression level (Figure 2—figure supplement 4). Importantly, despite this association, the effect of identified features on promoter variability is still evident across a range of promoter expression levels (Figure 2F and G). This is particularly apparent for CpG islands, which seem to have an attenuating effect on promoter variability regardless of expression level (Figure 2F).
Many of the TFs identified as being predictive of low variability (e.g. ELK1, ELK3, ELF4, ETV2, ETV3) belong to the ETS family of TFs (Figure 2D; Figure 2—figure supplement 3A), a large group of TFs that are conserved across Metazoa and characterized by their shared ETS domain that binds 5′-GGA(A/T)–3′ DNA sequences (Sharrocks, 2001). ETS factors are important regulators of promoter activities in lymphoid cells (Hepkema et al., 2020), but are generally involved in a wide range of crucial cellular processes such as growth, proliferation, apoptosis, and cellular homeostasis (Kar and Gutierrez-Hartmann, 2013; Oikawa and Yamada, 2003; Suico et al., 2017). Furthermore, different ETS factors can bind in a redundant manner to the same promoters of housekeeping genes (Hollenhorst et al., 2011; Hollenhorst et al., 2007). However, the shared DNA-binding domain of ETS factors makes it hard to discern individual factors based on their binding motifs alone (Figure 2A). Although in general linked with higher promoter activity (Curina et al., 2017), ETS binding site presence is associated with lower variability across all expression levels (Figure 2G). In addition, the degree of promoter variability decreases by an increasing number of non-overlapping ETS binding sites (Figure 2—figure supplement 5A), regardless of promoter expression level (Figure 2—figure supplement 5B), suggesting that multiple ETS binding sites can either facilitate cooperativity between ETS factors or provide robustness to stabilize promoter variability across individuals.
Taken together, our results indicate that promoter sequence can influence both low and high promoter variability across human individuals independently from its impact on expression level. Our results further indicate that variable promoters exhibit highly diverse binding grammars for TFs that are associated with relatively few promoters, while a more uniform regulatory grammar is indicated for stable promoters, being highly associated with higher CpG content and ETS binding sites.
Variability in promoter activity reflects plasticity and robustness for distinct biological functions
The high performance of predicting promoter variability from local DNA sequence and the distinct TF binding profiles of low and highly variable promoters imply that attenuation and amplification of variability are driven by distinct regulatory mechanisms. This argues that favoring robustness (low variability) over plasticity (high variability) should reflect the biological processes where this provides regulatory advantages. Supporting this hypothesis, we observed that low variable promoters were highly enriched with basic cellular housekeeping processes, in particular metabolic processes (Figure 3A). In contrast, highly variable promoters were enriched with more dynamic biological functions, including signaling, response to stimulus, and developmental processes.
Interestingly, the same features found to be predictive of low and high promoter variability across individuals, including CpG-content and TATA-boxes (TBP binding sites), are also associated with low and high transcriptional noise across individual cells (Faure et al., 2017; Morgan and Marioni, 2018). The presence of a TATA-box is also associated with high gene expression variability in flies (Sigalova et al., 2020). This suggests that some of the same underlying regulatory mechanisms that dictate low or high transcriptional noise across single cells are maintained and conserved between humans and flies at an individual level and manifested to control low and high expression variability across a population, respectively, as well as housekeeping or restricted activity across cell types.
In agreement, genes of highly variable promoters tend to have higher transcriptional noise than those of low variable ones across GM12878 single cells (Cohen’s d=0.411, p<2.2 × 10–16, two sample t-test; Figure 3—figure supplement 1A; Supplementary file 3). Furthermore, we observed an inverse correlation between variability in promoter activity and the number of cell types (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014) and tissues (Battle et al., 2017) the corresponding gene is expressed in (Spearman’s rank correlation ρ = −0.21 and −0.15 for cell types and tissues, respectively, p< 2.2 × 10–16; Figure 3B; Figure 3—figure supplement 1B), demonstrating that highly variable promoters are more cell-type and tissue specific in their expression.
The restricted expression (Figure 3—figure supplement 1B), biological processes (Figure 3A), and promoter TF grammar (Figure 2D and E) of genes associated with highly variable promoters led us to hypothesize that these are more prone to respond to external stimuli. Tumor necrosis factor (TNFα) induces an acute and time-limited gene response to NFkB signaling (Nelson et al., 2004; Turner et al., 2010), with negligible impact on chromatin topology (Jin et al., 2013), and is therefore suitable to study gene responsiveness. We profiled GM12878 TSSs and promoter activities with CAGE before and after 6 hr treatment with TNFα (Supplementary file 4). This revealed enrichment of up-regulated promoters among highly variable promoters (odds ratio (OR)=1.529, p=4.563 × 10–8, Fisher’s exact test) as posited, while low variable promoters were mostly unaffected or down-regulated (OR = 0.459, p<2.2 × 10–16, Fisher’s exact test; Figure 3C). In addition, low variable promoters had a weaker response (Figure 3D).
Furthermore, we observed drug-target genes and genes with GWAS hits to be regulated by highly variable promoters but essential genes to be regulated by low variable promoters (Figure 3E). In contrast, when we compared promoter expression between these same groups of genes we observed no association with drug-targets or GWAS-associated genes. Although essential genes are associated with higher promoter expression, this association is comparably weaker than that with promoter variability (Figure 3—figure supplement 1C).
Taken together, our results demonstrate the importance of low promoter variability for cell viability and survival in different conditions and reveal the responsiveness of highly variable promoters. They further indicate that the variability observed in promoter activity across individuals is strongly associated with the regulation of its associated gene, the expression breadth across cell types, and to some extent also the transcriptional noise across single cells, which reflects a selective trade-off between high stability and high responsiveness and specificity.
Promoters with low variability have flexible transcription initiation architectures
Promoters are associated with different levels of spread of their TSS locations, which has led to their classification into broad or narrow (sharp) promoters according to their positional width (Akalin et al., 2009; Carninci et al., 2006; Lehner, 2008). Although the shape and distinct biological mechanisms of these promoter classes, for example, housekeeping activities of broad promoters, are conserved across species (Carninci et al., 2006; Hoskins et al., 2011), the necessity for positional dispersion of TSSs and its association with promoter variability are poorly understood.
Analysis of promoter widths revealed only a weak relationship with promoter variability. We observed an enrichment of highly variable promoters within narrow promoters having an interquartile range (IQR) of their CAGE-inferred TSSs within a width of 1–5 bp (p<2.2 × 10–16, OR = 2.04, Fisher’s exact test). Low variable promoters, on the other hand, were enriched among those of size 10–25 bp (p<2.2 × 10–16, OR = 1.44, Fisher’s exact test), but beyond this width the association is lost (Figure 4—figure supplement 1A). To simultaneously capture the spread of TSSs and their relative frequencies compared to total RNA expression within a promoter, we considered a width-normalized Shannon entropy as a measure of TSS positional dispersion (Hoskins et al., 2011). This measure will discern promoters whose relative TSS expression is concentrated to a small subset of their widths (low entropy) from those with a more even spread (high entropy). We observed that low variable promoters are associated with a higher entropy than promoters with high variability (Figure 4A). Consistently, low variable promoters tend to have more TSSs substantially contributing to their overall expression across individuals (Figure 4—figure supplement 1B). We reasoned that a weaker association between low promoter variability and broad width than with high entropy may be due to low variable promoters being composed of multiple clusters of TSSs (multi-modal peaks) from independent core promoters. Indeed, decomposition of multi-modal peaks within the CAGE TSS signals of promoters (Supplementary file 5) demonstrated that higher entropy reflects an increased number of decomposed promoters, as indicated by their number of local maxima of CAGE signals (Figure 4B).
The decomposed promoters of gene UFSP2 (Figure 4C and E) clearly illustrate that the activity of sub-clusters of TSSs within promoters and their contributions to the overall activity of the encompassing promoter can vary to a great extent between individuals. In contrast, the decomposed promoters of gene RIT1 (Figure 4D and F) contribute equally to the overall activity of the encompassing promoter across individuals. To assess in general how individual decomposed promoters influence the overall promoter variability, we calculated the expression-adjusted dispersion (adjusted log10-transformed CV2) of local-maxima decomposed promoters. Interestingly, many of the decomposed promoters showed a vastly different variability across individuals compared to the promoters they originate from (Figure 4—figure supplement 1C). This disagreement indicates that decomposed promoters within the same promoter reflect core promoters that may either operate and be regulated independently of each other or differ in their ability to compete for the transcriptional machinery, both of which may contribute to the overall robustness or plasticity of the promoter and, in turn, the gene. As highly multimodal peaks are mainly found to be associated with low variable promoters, we hypothesized that this flexibility in core promoter usage may act as a compensatory mechanism to stabilize their expression.
If the effects of significant changes in expression across the panel are masked by compensatory changes in decomposed promoter usage within the same promoter, this would be revealed by low or even negative expression correlation between decomposed promoters (e.g., decomposed promoters 1 and 2 of UFSP2, Figure 4C and E). Indeed, we observed a strong association between promoter variability and the minimal expression correlation between decomposed promoter pairs within a promoter (Figure 4G). Low variable multi-modal promoters are associated with weakly or even negatively correlated pairs of decomposed promoters. In contrast, highly variable multi-modal promoters are associated with moderately or highly correlated pairs of decomposed promoters. The weak expression correlation between decomposed promoters of low variable promoters demonstrates that decomposed promoters may operate independently of each other, while negatively correlated pairs indicate a competition for the transcriptional machinery or a compensatory shift between decomposed promoters. The association between decomposed promoter correlation and overall promoter stability was maintained when all decomposed promoter pairs were considered (Figure 4—figure supplement 1D), and could not be explained by CpG island status (Figure 4—figure supplement 1E). However, decomposed promoter expression correlation was associated with promoter width (Figure 4—figure supplement 1F), demonstrating that complex low variable promoters with multiple decomposed promoters require larger promoter width, while broad promoter width does not necessarily lead to lower promoter variability.
The spread and dominant position of TSSs in broad promoters are tightly linked to immediate downstream (+1) nucleosome positioning, and changes in +1 nucleosome positioning can alter the preferred TSS (Dreos et al., 2016; FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014; Haberle et al., 2014). Hence, the variability and multi-modal TSS patterns of promoters could be related to their nucleosomal architectures. Indeed, comparison of +1 nucleosome positioning relative to the dominant TSS position of promoters across the panel revealed that low variable promoters tend to have stronger +1 nucleosome positioning (Figure 4—figure supplement 2A and B). However, when analyzing specifically multi-modal low variable promoters (containing at least two decomposed promoters), a strong +1 nucleosome was only observed for promoters with highly correlated decomposed promoters (Figure 4—figure supplement 2C and D). Furthermore, highly variable multi-modal promoters and those containing low correlated pairs of decomposed promoters exhibited more fuzzy +1 nucleosome positioning across cells (Figure 4—figure supplement 2E and F). Our results thus demonstrate that low variable promoters with flexible TSS usage, that is, having weakly or negatively correlated decomposed promoters, are characterized by less restrictive and more fuzzy +1 nucleosome positioning.
Taken together, our results demonstrate that flexible usage of core promoters within promoters with permissive nucleosomal architectures provide stability to the overall expression of a large subset of gene promoters with low variability.
Alternative TSSs of low variability promoters indicate a novel mechanism of mutational robustness
While genetic variants associated with gene expression levels (expression quantitative trait loci, eQTLs) frequently occur within gene promoters, they are rarely found associated with housekeeping or ubiquitously expressed genes, and when they are, they have limited effect sizes (Battle et al., 2017). One explanation for this observation is that mutations that would significantly alter the expression of such genes would be detrimental to cell viability. In addition, the rare and limited effects of eQTLs on housekeeping genes might be due to mechanisms promoting mutational robustness. Our results (Figure 4) indicate that a flexible TSS architecture within a promoter may provide such a mechanism and thereby mask the effects genetic variants may have on individual decomposed promoters.
To test if flexibility in core promoter usage within a promoter may cause mutational robustness, we first performed local eQTL analysis on promoters (within 25 kb). We tested both the association between the genotypes of common genetic variants (MAF ≥10%) and the expression of promoters (promoter eQTL, prQTLs; Figure 5A, top) as well as those of decomposed promoters (decomposed promoter eQTL, dprQTL). 2,457 promoters were associated with at least one prQTLs (5% FDR; Supplementary file 6). While prQTLs were observed across all levels of promoter variability, they were more commonly associated with highly variable promoters (Figure 5B). Fewer prQTL single nucleotide polymorphisms (SNPs) and, in general, common variants were found proximal to low variable promoters, indicating a negative selection for these. As expected, the effect size for the most significant prQTL variant (lead SNP) for each promoter was positively associated with the expression variability of the promoter (Spearman’s rank correlation ρ = 0.16, p< 2.2 × 10–16, Figure 5—figure supplement 1A). This indicates that, in addition to having fewer proximal genetic variants, low variable promoters are less likely to have prQTLs with large regulatory effects. However, QTLs of decomposed promoter expression (dprQTLs) exhibited similar prevalence (Figure 5—figure supplement 1C) and maximum effect sizes (Figure 5—figure supplement 1D) across promoter variability classes. This suggests that a flexible TSS architecture can limit the impact of genetic variants on promoter expression.
To further investigate the observed disagreement between the variability of promoters and their decomposed promoters, we tested the association between the genotypes of common genetic variants and the contribution of decomposed promoters to the overall expression of their encompassing (non-decomposed) promoters (fraction eQTL, frQTL; Figure 5A, bottom). We identified 1,230 promoters to be associated with at least one frQTL (5% FDR; Supplementary file 7). Unlike the prQTLs and dprQTLs, the frQTLs were more commonly associated with decomposed promoters from low variable promoters (Figure 5C). Conceptually, the frQTLs can affect decomposed promoter usage and overall promoter expression levels to different degrees, as exemplified by the promoters of genes RGS14 and GGNBP2 (Figure 5D and E). Gene RGS14 has three decomposed promoters localized within its promoter (Figure 5D), for which SNP rs56235845 (chr5:177371039 T/G) was strongly associated with the contribution to the overall promoter activity for only decomposed promoters 1 and 2 (frQTL beta = 0.210, –0.181, –0.062; FDR = 2.42 × 10–5, 2.54×10–8, 2.64×10–2, for decomposed promoters 1, 2, and 3, respectively). Despite the limited association of the variant with decomposed promoter 3, it still had a noticeable association with the overall promoter activity (prQTL beta = −2.47, FDR = 3.57 × 10–5; Figure 5D, bottom right). In contrast, SNP rs9906189 (chr17:36549567 G/A) was strongly associated with the contribution to the overall promoter activity for both decomposed promoters of gene GGNBP2 (frQTL beta = 0.222,–0.222; FDR = 2.05 × 10–26, 2.05×10–25, for decomposed promoters 1 and 2, respectively), but in opposite directions (Figure 5E). Interestingly, this switch in decomposed promoter usage translates into a limited effect on the overall promoter activity (prQTL beta = −0.063, FDR = 0.989; Figure 5E, bottom right).
Both examples, a partial shift (Figure 5D) and a switch (Figure 5E) in decomposed promoter usage, are indicative of plasticity in TSS usage, which can secure tolerable levels of steady-state mRNA. Although frQTLs were associated with promoters across the wide spectrum of promoter variabilities (Figure 5C), they showed a large difference in their relative effect on the overall promoter activity (maximal relative change in expression between reference and variant alleles; Figure 5F). frQTLs associated with highly variable promoters tend to have a larger relative effect on the overall promoter activity compared to frQTLs associated with low variable promoters. This association is further maintained at the gene level (adjusted RNA-seq Lappalainen et al., 2013 CV2; Figure 5—figure supplement 1B; Supplementary file 8), demonstrating that individual differences in decomposed promoter usage contribute to low promoter variability and, in turn, low gene variability. In total, we found 286 promoters (out of 1230) of 284 genes to be associated with stabilizing frQTLs, for which the same SNP was associated with at least two decomposed promoters (5% FDR) with relative effects in opposite directions (Supplementary file 9). Our results thus indicate that TSS usage flexibility confers mutational robustness that stabilizes the variability of promoters and their associated genes.
Taken together, integrating prQTLs, dprQTLs, and frQTLs provides novel insights into how common genetic variants can influence TSS usage in humans and its potential impact on gene expression. We demonstrate that low variable promoters characterized by multiple decomposed promoters (multi-modal TSS usage) are less affected by the presence of genetic variants compared to highly variable promoters. In addition, we find that part of this tolerance can be explained by a, previously unreported, mechanism of mutational robustness through plasticity in TSS usage. The prevalence of expression-independent decomposed promoters within low variable promoters, as suggested by low pairwise correlation, indicates an extensive regulatory role of TSS plasticity in attenuating expression variability of essential genes.
Discussion
In this study, we provide an extensive characterization of promoter-associated features influencing variability in promoter activity across human individuals and demonstrate their importance for determining stability, responsiveness, and specificity. Overall, we show that the local DNA sequence, putative TF binding sites, and transcription initiation architecture of promoters are highly predictive of promoter variability.
Although the classifier based on TF binding site sequence and CpG island status was able to predict promoter variability well (AUC = 0.79 on the test set), it did not perform as well as the CNN model (AUC = 0.84 on the test set), which was trained on DNA sequence alone. This indicates that additional information influencing variability may be encoded within the promoter sequence. For instance, the density and variations of Initiator elements within promoters could influence TSS flexibility (Carninci et al., 2006; Frith et al., 2007; Haberle et al., 2014; Nepal et al., 2020). In addition, di- or tri-nucleotide sequence patterns and stretches of high AT-richness, which influence local nucleosome positioning (Dreos et al., 2016; Haberle et al., 2014; Segal et al., 2006), impose different requirements for chromatin remodeling activities (Lorch et al., 2014) at gene promoters of low and high variability, which in turn may affect their variability and responsiveness. The promoter sequence may also encode a promoter’s intrinsic enhancer responsiveness (Arnold et al., 2017), which may influence its expression variability. Although current data cannot distinguish between direct or secondary effects, an increased variability mediated via enhancers is supported by a higher dependency on enhancer-promoter interactions for cell-type-specific genes compared to housekeeping genes (Furlong and Levine, 2018; Schoenfelder and Fraser, 2019). However, compatibility differences between human promoter classes and enhancers only result in subtle effects in vitro (Bergman et al., 2022), suggesting that measurable promoter variability is likely a result of both intrinsic promoter variability and additive or synergistic contributions from enhancers. Directly modeling the influence and context-dependency of enhancers on promoter variability would therefore be important to further characterize regulatory features that may amplify gene expression variability.
Despite a clear association with high promoter CpG content and housekeeping genes, low variable promoters were not strongly associated with a broader width, which we would expect from promoters in CpG islands and with housekeeping activity (Carninci et al., 2006). Rather, our results suggest that low variability requires a certain minimum promoter width that can encompass a transcription initiation architecture competent of attenuating variability through flexible TSS usage. Switching between proximal clusters of TSSs (decomposed promoters) within a larger promoter is fundamentally different from that between alternative promoters (Garieri et al., 2017; Valen et al., 2008; Zhang et al., 2017), which will more likely lead to differences in transcript and protein isoforms. Rather, a flexible initiation architecture enables several points of entries for RNA polymerase II to initiate in the same promoter. This ensures proper gene expression across different cell types (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014; Kawaji et al., 2006) and developmental stages (Haberle et al., 2014). Interestingly, ETS factors, here associated with low variable promoters, can interact with transcriptional co-activators and chromatin modifying complexes (Curina et al., 2017; Göös et al., 2022). ETS factors may therefore play a role in TSS selection in promoters with multi-modal architectures (Lam et al., 2019). Here, we show that such flexibility is also associated with low variability across individuals for the same cell type. Our findings further indicate that plasticity in TSS usage within a promoter confers a, previously unreported, layer of mutational robustness that can buffer the effects of genetic variants, leading to limited or no impact on the overall promoter expression. Of note, the presence of weak or negatively correlated expression patterns between decomposed promoters for a large number of promoters suggests that such buffering events will be revealed for more genes with an increased sample size.
Flexibility in TSS usage may ensure transcriptional robustness of genes both in different environments and in the face of genetic variation. Since promoter shape is generally conserved across orthologous promoters (Carninci et al., 2006; Hoskins et al., 2011), it is plausible that robustness through flexible TSS usage is a conserved mechanism. In support, genetic variants may affect promoter shape for ubiquitously expressed genes in flies with limited effect on promoter expression (Schor et al., 2017). Changes in promoter shape in flies thus likely recapitulates the plasticity in TSS usage across human LCLs, despite apparent differences in core promoter elements, promoter nucleotide content, and regulatory features between flies and humans.
Notably, many of the promoter features we, and others (Sigalova et al., 2020), have identified to be predictive of promoter variability, including the presence or absence of CpG islands and TATA boxes, have previously been linked with different levels of transcriptional noise as inferred from single-cell experiments (Faure et al., 2017; Morgan and Marioni, 2018). This suggests that variability in promoter activity across individuals partly reflects the stochasticity in gene expression across cells. Given that the underlying sources of variation are different, for example, genetic and environmental versus stochastic, this indicates that mechanisms that contribute to the buffering of stochastic noise at a single cell level can also contribute to the attenuation of genetic and environmental variation at an individual level.
We observed less restrictive +1 nucleosome positioning across individuals at low variable promoters with flexible TSS usage. Notably, these promoters are also associated with a fuzzy +1 nucleosome positioning across cells within an individual. This indicates that positional shifts of nucleosomes in coordination with shifts in core promoter usage may be due to an inherent property of these promoters in addition to the influence by genetic variation. Our observations indicate that multiple configurations of accessible chromatin may exist for low variable promoters across single cells, which may cause stochastic TSS selection with no or only low impact on expression level. This is compatible with a high density of pyrimidine/purine (YR) dinucleotides within broad promoters (Carninci et al., 2006; Frith et al., 2007), which provide a flexibility for transcription initiation sites in the absence of strong positional signals like the TATA box (Carninci et al., 2006; FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014; Müller and Tora, 2014). Genetic variants biasing such TSS selection and the preference for any open chromatin configuration may therefore cause observable shifts in TSS usage between individuals, enabled by the flexible nucleosome and TSS architecture of the promoter.
It is important to note that the regulatory programs of EBV-immortalized LCLs, like other cultured cells, have been shown to be susceptible to genotype-independent sources of variation, such as primary cellular heterogeneity and EBV’s viral cellular reprogramming (Choy et al., 2008; Ozgyin et al., 2019). While we cannot exclude that such sources may influence the measured variability of some genes, we speculate that non-genetic variation would rather dampen than elevate the associations and effect sizes we report here, in particular the decomposed promoter switches highlighted by our genotype association analysis.
Taken together, our results favor a model in which the regulation of transcriptional noise across single cells reflects specificity across cell types and dispersion across individuals with shared mechanisms conferring stochastic, genetic and environmental robustness (Figure 6). There are several implications of this model. First, the link between low transcriptional noise and low individual variability of promoters and their associations with ubiquitous and essential genes indicates that regulatory mechanisms that ensure broad expression across cell types may enforce low variability across individuals and single cells. Second, our results indicate that encoding responsiveness or developmentally restricted expression patterns of gene promoters may require high stochasticity in expression across single cells, which in turn may disallow ubiquitous expression across cell types. Thus, it is likely that increased variability is not just reflecting the absence of regulatory mechanisms that attenuate variability but the presence of those that amplify it. Finally, given that mutational robustness through flexible TSS usage is mostly associated with low variable genes, this implies that cell-type restricted, responsive and developmental genes may be more susceptible to trait-associated genetic variants, which finds support in the literature (Finucane et al., 2015; Roadmap Epigenomics Consortium et al., 2015; Timshel et al., 2020).
Materials and methods
LCL cell culturing
Request a detailed protocolEpstein-Barr virus immortalized LCLs (Supplementary file 1) were obtained from the NIGMS Human Genetic Cell Repository at Coriell Institute for Medical Research. Cells were incubated at 37°C at 5% carbon dioxide in the Roswell Park Memorial Institute (RPMI) Medium 1640 supplemented with 2mM L-glutamine and 20% of non-inactivated fetal bovine serum and antibiotics. Cell cultures were split every few days for maintenance. All 108 LCLs were grown unsynchronized for 5–7 passages and harvested when they reached>20million cells. The Coriell Institute for Medical Research frequently screens cells for mycoplasma contamination and verifies cell line identity. All cell culturing was done within a year of purchase. To confirm that samples were free of mycoplasma contamination, CAGE reads were aligned to 4 mycoplasma reference genomes, including Mycoplasma hominis, which showed no indication of contamination for any sample (Olarerin-George and Hogenesch, 2015).
CAGE library preparation, sequencing, and mapping
Request a detailed protocolCAGE libraries were prepared in 10 batches in total as described elsewhere (Andersson et al., 2014b; Takahashi et al., 2012) from 1500 ng total RNA from each LCL. 23 libraries (Supplementary file 1) underwent a second round of size selection (Invitrogen E-Gel) to remove excessive primer dimers. The libraries were quality checked using an Agilent 2100 Bioanalyzer system with a RNA pico chips kit (Agilent) and quantified using DNA 1000 chips kit (Agilent). Pooled libraries (Supplementary file 1) were sequenced with spiked-in PhiX on an Illumina HiSeq 2500 machine single-end for 50 cycles using v4 sequencing chemistry (Illumina Inc) and a custom sequencing primer (Takahashi et al., 2012). Libraries were split by barcode and reads were trimmed to remove linker sequences and filtered for a minimum sequence quality of Q30 in 50% of base pairs using the FASTX-Toolkit. rRNA reads matching subsequences of the human ribosomal DNA complete repeating unit (U13369.1) were removed using rRNAdust (version 1.06) (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014). Mapping to the human reference genome (hg38) was performed using BWA (version 0.7.15-r1140) allowing a maximum edit distance of 2. To reduce mapping bias, reads were re-mapped using the WASP pipeline (van de Geijn et al., 2015) and BWA, taking into account biallelic SNVs (Lowy-Gallego et al., 2019). Reads with a mapping quality of 20 were retained for further analyses. Sample-related information, including CAGE run batch ID and E-gel information, are provided in Supplementary file 1.
CAGE tag clustering, filtering and quantification
Request a detailed protocolCAGE-defined transcription start sites (CTSSs) were identified from 5’ ends of mapped CAGE reads for each strand separately. The expression of CTSSs for each LCL was quantified from the number of CAGE reads sharing 5’ ends using CAGEfightR (version 1.10) (Thodberg et al., 2019). To identify broad promoters that could potentially encompass multiple alternative core promoters (decomposed promoters), we performed lenient positional clustering of CTSSs (tag clustering) (Carninci et al., 2006), grouping CTSSs on the same strand within 60 bp of each other. To exclude rare promoters within the panel, tag clustering was performed on CTSSs with at least 1 CAGE read in at least 5 LCLs. The expression of each tag cluster (CAGE-inferred promoter) in each individual LCL was quantified by aggregating the expression of all CTSSs falling within the defined tag cluster region. To allow capture of flexible TSS usage within promoters across the panel, no support filtering was performed at CTSS level for expression quantification. Expression levels were converted to tags per million (TPM), by normalizing the expression count of each tag cluster in each library as a fraction of its number of mapped CAGE reads, scaled by 106. Tag clusters were filtered to be proximal to GENCODE-annotated TSSs (hg38, version 29, within 1000 bp upstream) and to have at least 10 read counts in more than 10 LCLs. The resulting 29,001 gene-associated CAGE-inferred promoters were later decomposed by local maxima decomposition to split multi-modal tag clusters (https://github.com/anderssonlab/CAGEfightR_extensions, version 0.1.1; Andersson, 2021). First, for each CAGE-inferred promoter, local maxima of within-promoter CTSSs with the highest pooled expression separated by at least 20 bp were identified. Second, decomposition was performed for each local maxima separately in decreasing order of pooled expression level. For each local maxima, the fraction between the pooled expression of each CTSS to that of the local maxima was calculated. All CTSSs associated with at least 10% of the local maxima signal that were not gapped by more than 10 bp with CTSS expression less than this value were retained in a new decomposed promoter. For smoothing purposes, neighboring non-zero CTSSs within 1 bp distance of CTSSs fulfilling the fraction criterion were also included. Subsequently, decomposed promoters were merged if positioned within 1 bp from each other.
Geuvadis YRI RNA-seq data analysis
Request a detailed protocolGene expression data quantified in the recount2 project (Collado-Torres et al., 2017) using Geuvadis YRI RNA-seq data (Lappalainen et al., 2013) was downloaded using the recount R package. Only genes with more than 1 transcript per million in at least 10% of YRI samples were considered for expression variability calculation.
GM12878 scRNA-seq data analysis
Request a detailed protocolGM12878 10 X Genomics scRNA-seq data (Osorio et al., 2019) was downloaded from Gene Expression Omnibus (GSE126321) and processed using Seurat (version 4.0.3) (Hao et al., 2021). Cells with a proportion of mitochondrial reads lower than 10% and a sequencing depth deviating less than 2.5 times the standard deviation from the average sequencing depth across cells were considered. The expression of genes with read counts observed in at least 10 cells were normalized using scran (version 1.18.7) (Lun et al., 2016) and retained for expression variability calculation.
Measuring expression variability across individuals
Request a detailed protocolThe raw dispersion of each CAGE tag cluster was calculated using the squared coefficient of variation (CV2) of TPM-normalized promoter (or decomposed promoter) expression across the LCL panel and subsequently log10-transformed. Adjustment of the mean expression-dispersion relationship was performed by subtracting the expected log10-transformed dispersion for each promoter according to its expression level, using a running median (width 50, step size 25) of raw dispersions (log10 CV2) ordered by mean expression level (TPM) across the panel, as described elsewhere (Kolodziejczyk et al., 2015; Newman et al., 2006). The same strategy was used to calculate the adjusted dispersion of gene expression from RNA-seq and scRNA-seq data. Promoters were grouped by variability according to the quartiles of expression-adjusted dispersions (≤Q1, (Q1, Q2], (Q2, Q3],>Q3).
Neural network model, training and hyperparameter tuning
Request a detailed protocolA simple neural network architecture was designed to learn to predict low and high variability from DNA sequence. The neural network model uses as input one-hot-encoded DNA sequences (A = [1,0,0,0], C = [0,1,0,0], G = [0,0,1,0], T = [0,0,0,1]) from the human reference genome (hg38) to predict low and highly variable promoter activity as output. Although CAGE-inferred promoters varied in width, we made use of fixed-length 600bp sequences for each promoter centered on its pooled CAGE summit CTSS. 600bp was used to make sure that sequences influencing promoter variability contained within regions that could cover a central open chromatin site (150–300bp) as well as within flanking nucleosomal DNA (150–200bp) were captured, where also most of the expression output of a promoter originate (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014).
The model (Figure 2—figure supplement 1A) consists of one convolutional layer with 128 hidden units and a kernel size of 10, followed by global average pooling and two dense layers with 128 and 2 nodes, respectively. Batch normalization and dropout (0.1) were applied after each layer. The ReLU activation function (Agarap, 2019) was used in all layers except the final layer, in which a sigmoid activation function was used to predict the variability class (low or high adjusted dispersion).
Promoter sequences from chromosomes 2 and 3 were used as the test set and those from the remaining chromosomes were used for training and hyperparameter tuning with a fivefold cross-validation. Hyperparameters were manually adjusted to yield the best performance on the validation set. The neural network model was implemented and trained in Keras (version 2.3.0, https://github.com/keras-team/kera Chollet, 2022) with TensorFlow backend (version 1.14) (Abadi et al., 2016) using the Adam optimizer (Kingma and Ba, 2017) with a learning rate of 0.0001, batch size of 64, and early stopping with the patience of 15 epochs.
We initially used the first and third quartiles (Q1 and Q3) to distinguish low variable promoters (≤Q1) from highly variable promoters (>Q3), corresponding to an adjusted log10-transformed CV2 of –0.1490 and 0.1922, respectively. To reduce false positives, we slightly adjusted the thresholds for low and highly variable promoters to –0.20 and 0.25, respectively. The final training and test sets for the neural network model together consisted of 5054 low variable and 5683 highly variable promoters. To ensure consistency, the same thresholds were used for training and testing with Random Forest and decision tree classifiers (see below).
Motif discovery using DeepLIFT and TF–MoDISco
Request a detailed protocolTo interpret the neural network model we used DeepLIFT (Shrikumar et al., 2019), a feature attribution method, to compute importance scores for each nucleotide in the 600 bp input sequences for low and highly variable promoters. DeepLIFT relies on backpropagation of the contributions of all neurons in the neural network to the input features, nucleotides, and was used to estimate the importance of each position and nucleotide in the input sequences to predict high and low variability. The resulting importance scores were supplied to TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) (Shrikumar, 2020) to identify DNA stretches (seqlets) with high importance for the predictions. DeepLIFT and TF-MoDISco were run independently on the input sequences for low variable and highly variable promoters. TF-MoDISco identified 18,035 seqlets for low variable promoters and 21,942 seqlets for highly variable promoters by using the importance scores from DeepLIFT over a width of 15 bp with a flank size of 5 bp and a FDR threshold of 0.05. The seqlets identified were merged in 41 and 47 metaclusters for low and highly variable promoters, respectively.
We used Tomtom (MEME package 5.1.1; Gupta et al., 2007) to match the resulting metaclusters to known TF motifs (in MEME format) from the JASPAR database (release 2020, hg38; Fornes et al., 2020). We compared each non-redundant JASPAR vertebrate frequency matrix with the metaclusters using Tomtom based on the Sandelin and Wasserman distance (Sandelin and Wasserman, 2004). Matches were considered those with a minimum overlap between query and target of 5 nucleotides and a p-value <0.05.
Random forest, Boruta and SHAP analysis
Request a detailed protocolTo identify broad-scale trends of high CpG content, we calculated CpG observed/expected ratio in windows +/-500bp around the pooled summit CTSS of each promoter. Calculated CpG ratios revealed a bimodal distribution that informed on thresholding high CpG content promoters as those with CpG observed/expected ratio>0.5 (Figure 2—figure supplement 2A).
Predicted transcription factor binding sites for 746 TFs with scores of 500 or greater (P<10–5) (hg38) were obtained from JASPAR (release 2020, hg38) (Fornes et al., 2020) and for each TF, presence/absence was obtained by overlapping predicted TF binding sites with promoters considered in the modeling. Together, the CpG content status and the presence/absence of predicted TF binding sites were used as features for predicting high and low variability using Random Forest (Pedregosa et al., 2011).
Similarly to the neural network model, promoters from chromosomes 2 and 3 were only used as the test set. The remaining promoters were used for training and hyperparameter tuning with fivefold cross-validation. The Random Forest model was implemented and trained in Scikit-learn (version 2.3.0) with 500 trees, a maximum depth of trees of 10, 50samples split per node, and 50samples to be at a leaf node. The remaining hyperparameters were kept with default values.
Instead of selecting features directly from the Random Forest model, the BorutaShap package (Keany, 2020) was used for feature selection. The main advantage of using the Boruta approach is that the features compete with their randomized version (or shadow feature) and not with themselves. Therefore, a feature is considered relevant only if its score is higher than the best randomized feature. In this way, from the 746 original TF features, only 125 features were kept. The features were selected using only promoters from the training set. Finally, the SHAP library (Lundberg and Lee, 2017) was used to explain the importance of the 125 selected features for the two promoter classes. SHAP calculates Shapley values, a game theoretic approach for optimal credit allocation during cooperation, which can be used to estimate the marginal contribution of each feature to a model’s predictions.
Decision tree baseline model
Request a detailed protocolTo evaluate the contribution of TF binding site presence for predicting promoter variability, we trained a baseline model based on CpG content status and TATA-box presence only. CpG content status (CpG observed/expected ratio>0.5) and the presence/absence of predicted TBP binding sites were used as features for predicting high and low variability using a decision tree classifier. The decision tree model was implemented and trained in Scikit-learn (version 2.3.0) using default parameters. Training and test data were defined as for the CNN and Random Forest models.
Tissue-, cell-type specificity and gene annotations
Request a detailed protocolRNA-seq gene expression values across tissues were obtained from the GTEx consortium (Battle et al., 2017). Promoters were considered expressed in tissues in which their corresponding gene had≥5RPKM average expression across donors.
CAGE gene expression values across cell types were obtained from the FANTOM5 project (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al., 2014). The average normalized (tags per million, TPM) expression per gene was calculated across samples associated with the same cell type facet (grouping of CAGE libraries according to Cell Ontology annotation of samples), according to Andersson et al., 2014a, and a gene was considered expressed in a cell type facet if the average expression was≥5TPM.
Gene lists for FDA approved drug-targets (Wishart et al., 2018), essential genes (Hart et al., 2017) and GWAS hits (MacArthur et al., 2017) were downloaded from the MacArthur Lab Repository (https://github.com/macarthur-lab/gene_lists; MacArthur, 2019).
GM12878 cell culturing, TNF-α stimulation and differential expression analysis
Request a detailed protocolGM12878 cells were obtained from the NHGRI Sample Repository for Human Genetic Research at Coriell Institute for Medical Research. Unstimulated GM12878 cells and those stimulated with 25 ng/ml TNF-α for 6 hr were harvested with four replicates for each condition. Cell culturing, CAGE library preparation and mapping were done as described above for the LCL panel. CAGE reads supporting each of the final filtered promoters identified in the LCL panel were counted for each replicate using CAGEfightR (version 1.10; Thodberg et al., 2019). Differential expression analysis of the aggregated CTSS counts was performed using standard library size adjustment and a generalized linear model with DESeq2 (version 1.30.1; Love et al., 2014). Promoters with FDR-adjusted p-value ≤0.05 were considered to be differentially expressed.
Correlation analysis of decomposed promoter expression
Request a detailed protocolTo test if decomposed promoters could act independently of each other, we calculated Pearson correlation coefficients of LCL expression between pairs of decomposed promoters originating from the same promoter. We focused on promoters with at least 2 decomposed promoters significantly contributing to the overall expression of the promoter. Specifically, we considered decomposed promoters whose expression accounted for at least 5% of the overall promoter expression in at least half of all LCLs, resulting in 37,663 decomposed promoters of 14,889 promoters. To avoid potential bias introduced from a variable number of decomposed promoters per promoter, we considered the lowest correlation across decomposed promoter pairs within a promoter.
Nucleosome positioning analysis
Request a detailed protocolMicrococcal nuclease-digested nucleosome sequencing (MNase-seq) data from 7 EBV-immortalized LCLs (GM18507, GM18508, GM18516, GM18522, GM19193, GM19238, GM19239) were obtained from GEO (GSE36979) (Gaffney et al., 2012).
The locations and fuzziness scores of nucleosomes were called with DANPOS2 (version 2.2.2) (Chen et al., 2013) using the dpos command on each LCL separately.+1 nucleosomes were defined as the closest downstream nucleosome of the dominant TSS position (derived from pooled CAGE data) of each CAGE-inferred promoter.
Positional cross correlations were calculated between CAGE TSSs and 5’ ends of MNase-seq reads on the same strand at dominant TSS positions of CAGE-inferred promoters ±500bp (maximum lag 250) to identify their most likely separation. Cross-correlation analysis was performed on either pooled CAGE data (across all 108 LCLs) versus pooled MNase-seq data (across all 7 LCLs) or using only CAGE and MNase data from one LCL (GM18516). Finally, for both analyses, a weight for each promoter was calculated from the geometric mean of aggregated MNase and CAGE signals. This was used to calculate a weighted average of the cross correlations across considered promoters.
Mapping QTLs
Request a detailed protocolWe tested both the association between the genotypes of common genetic variants (MAF ≥10%) and the expression of promoters (promoter eQTL, prQTLs) and decomposed promoters (decomposed promoter eQTL, dprQTLs), as well as their association with the contribution of to the overall expression of the encompassing (non-decomposed) promoter (fraction eQTL, frQTL). prQTLs, dprQTLs, and frQTLs were mapped using the MatrixEQTL R package (version 2.3) (Shabalin, 2012). We controlled for genetic population stratification and library preparation batches (Supplementary file 1) by including these as covariates. In addition, we included the first 5 principal components derived from normalized promoter expression values (TPM) as covariates for prQTLs.
For prQTL detection, all 29,001 promoters were tested using TPM-normalized expression values. For frQTLs, we calculated the fractional contribution of each decomposed promoter to the expression of its original promoter. To focus the dprQTL and frQTL analyses on relevant shifts in TSS usage, we considered only decomposed promoters whose expression accounted for at least 5% of the overall promoter expression in at least half of all LCLs and promoters with at least 2 such decomposed promoters, resulting in 37,663 decomposed promoters of 14,889 promoters.
For each promoter, we tested common (minor allele frequency ≥10%) biallelic SNVs (Lowy-Gallego et al., 2019) at a maximum distance of 25 kb from the CTSS with maximum pooled CAGE signal within each promoter for association with changes in promoter expression levels or decomposed promoter contribution. Resulting p-values were FDR-adjusted according to the total number of promoters or decomposed promoters tested genome-wide within the MatrixEQTL R package. prQTLs, dprQTLs, and frQTLs with FDR ≤5% were retained. A promoter was associated with a dprQTL or frQTL if at least one of its decomposed promoters was associated with a dprQTL or frQTL at FDR < 5%.
Appendix 1
Data availability
CAGE data were deposited into the Gene Expression Omnibus (GEO) database under accession number GSE188131 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE188131). GM12878 scRNA-seq data were retrieved from GEO (accession number GSE126321). Processed Geuvadis RNA-seq gene expression data were retrieved from recount2 (Collado-Torres et al., 2017) (accession number ERP001942). Processed GTEx RNA-seq gene expression data were retrieved from the GTEx portal (https://www.gtexportal.org/home/datasets, version-8). MNase-seq data were retreived from GEO (accession number GSE36979). Predicted transcription factor binding sites for 746 TFs were obtained from Fornes et al., 2020 (http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/hg38/). Code for data analysis performed in this study is publicly available on GitHub: https://github.com/anderssonlab/Einarsson_et_al_2022/, (copy archived at swh:1:rev:de2e6b8a35c16687c8d55630e65c78489a629c99; Andersson, 2022).
-
NCBI Gene Expression OmnibusID GSE188131. Promoter sequence and architecture determine expression variability and confer robustness of genetic variation.
-
NCBI Gene Expression OmnibusID GSE126321. Single-cell RNA sequencing of lymphoblastoid cell lines of European and African ancestries.
-
dbGaPID phs000424.v8.p2. GTEx Analysis V8.
-
NCBI Gene Expression OmnibusID GSE36979. Genome-wide maps of nucleosome occupancy in human lymphoblastoid cell lines.
References
-
SoftwareCAGEfightR_extensions, version swh:1:rev:2a9d5d91431fd88df2c06ab89fa6acc1a2395ddeSoftware Heritage.
-
SoftwareAnderssonlab, version swh:1:rev:de2e6b8a35c16687c8d55630e65c78489a629c99Software Heritage.
-
Genome-Wide analysis of mammalian promoter architecture and evolutionNature Genetics 38:626–635.https://doi.org/10.1038/ng1789
-
Reproducible RNA-seq analysis using recount2Nature Biotechnology 35:319–321.https://doi.org/10.1038/nbt.3838
-
Influence of rotational nucleosome positioning on transcription start site selection in animal promotersPLOS Computational Biology 12:e1005144.https://doi.org/10.1371/journal.pcbi.1005144
-
JASPAR 2020: update of the open-access database of transcription factor binding profilesNucleic Acids Research 48:D87–D92.https://doi.org/10.1093/nar/gkz1001
-
GENCODE reference annotation for the human and mouse genomesNucleic Acids Research 47:D766–D773.https://doi.org/10.1093/nar/gky955
-
A code for transcription initiation in mammalian genomesGenome Research 18:1–12.https://doi.org/10.1101/gr.6831208
-
Human transcription factor protein interaction networksNature Communications 13:766.https://doi.org/10.1038/s41467-022-28341-5
-
Evaluation and design of genome-wide CRISPR/spcas9 knockout screensG3: Genes, Genomes, Genetics 7:2719–2727.https://doi.org/10.1534/g3.117.041277
-
Genomic and biochemical insights into the specificity of ETS transcription factorsAnnual Review of Biochemistry 80:437–471.https://doi.org/10.1146/annurev.biochem.79.081507.103945
-
Molecular mechanisms of ETS transcription factor-mediated tumorigenesisCritical Reviews in Biochemistry and Molecular Biology 48:522–543.https://doi.org/10.3109/10409238.2013.838202
-
Feature selection with the boruta packageJournal of Statistical Software 36:i11.https://doi.org/10.18637/jss.v036.i11
-
Role of DNA sequence in chromatin remodeling and the formation of nucleosome-free regionsGenes & Development 28:2492–2497.https://doi.org/10.1101/gad.250704.114
-
A unified approach to interpreting model predictionsAdv Neural Inf Process System 30:4766–4775.
-
The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)Nucleic Acids Research 45:D896–D901.https://doi.org/10.1093/nar/gkw1133
-
Chromatin and DNA sequences in defining promoters for transcription initiationBiochimica et Biophysica Acta 1839:118–128.https://doi.org/10.1016/j.bbagrm.2013.11.003
-
Mechanisms of mutational robustness in transcriptional regulationFrontiers in Genetics 6:322.https://doi.org/10.3389/fgene.2015.00322
-
Scikit-learn: machine learning in pythonJournal of Machine Learning Research: JMLR 12:2825–2830.
-
Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformaticsJournal of Molecular Biology 338:207–215.https://doi.org/10.1016/j.jmb.2004.02.048
-
Long-range enhancer-promoter contacts in gene expression controlNature Reviews. Genetics 20:437–455.https://doi.org/10.1038/s41576-019-0128-0
-
The ETS-domain transcription factor familyNature Reviews. Molecular Cell Biology 2:827–837.https://doi.org/10.1038/35099076
-
Gene-expression variation within and among human populationsAmerican Journal of Human Genetics 80:502–509.https://doi.org/10.1086/512017
-
Roles and regulations of the ETS transcription factor ELF4/MEFJournal of Molecular Cell Biology 9:168–177.https://doi.org/10.1093/jmcb/mjw051
-
CAGEfightR: analysis of 5’-end data using R/BioconductorBMC Bioinformatics 20:487.https://doi.org/10.1186/s12859-019-3029-5
-
DrugBank 5.0: a major update to the drugbank database for 2018Nucleic Acids Research 46:D1074–D1082.https://doi.org/10.1093/nar/gkx1037
Article and author information
Author details
Funding
Danmarks Frie Forskningsfond (6108-00038)
- Robin Andersson
European Research Council (638173)
- Robin Andersson
Novo Nordisk Fonden (NNF18OC0052570)
- Robin Andersson
Novo Nordisk Fonden (NNF20OC0059796)
- Robin Andersson
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank members of the Andersson lab for rewarding discussions. Sequencing was performed by the SNP&SEQ Technology Platform in Uppsala, part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory. The SNP&SEQ Platform is also supported by the Swedish Research Council and the Knut and Alice Wallenberg Foundation.
Copyright
© 2022, Einarsson et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 3,763
- views
-
- 383
- downloads
-
- 19
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Chromosomes and Gene Expression
- Neuroscience
Pathogenic variants in subunits of RNA polymerase (Pol) III cause a spectrum of Polr3-related neurodegenerative diseases including 4H leukodystrophy. Disease onset occurs from infancy to early adulthood and is associated with a variable range and severity of neurological and non-neurological features. The molecular basis of Polr3-related disease pathogenesis is unknown. We developed a postnatal whole-body mouse model expressing pathogenic Polr3a mutations to examine the molecular mechanisms by which reduced Pol III transcription results primarily in central nervous system phenotypes. Polr3a mutant mice exhibit behavioral deficits, cerebral pathology and exocrine pancreatic atrophy. Transcriptome and immunohistochemistry analyses of cerebra during disease progression show a reduction in most Pol III transcripts, induction of innate immune and integrated stress responses and cell-type-specific gene expression changes reflecting neuron and oligodendrocyte loss and microglial activation. Earlier in the disease when integrated stress and innate immune responses are minimally induced, mature tRNA sequencing revealed a global reduction in tRNA levels and an altered tRNA profile but no changes in other Pol III transcripts. Thus, changes in the size and/or composition of the tRNA pool have a causal role in disease initiation. Our findings reveal different tissue- and brain region-specific sensitivities to a defect in Pol III transcription.
-
- Biochemistry and Chemical Biology
- Chromosomes and Gene Expression
The mRNA 5'-cap structure removal by the decapping enzyme DCP2 is a critical step in gene regulation. While DCP2 is the catalytic subunit in the decapping complex, its activity is strongly enhanced by multiple factors, particularly DCP1, which is the major activator in yeast. However, the precise role of DCP1 in metazoans has yet to be fully elucidated. Moreover, in humans, the specific biological functions of the two DCP1 paralogs, DCP1a and DCP1b, remain largely unknown. To investigate the role of human DCP1, we generated cell lines that were deficient in DCP1a, DCP1b, or both to evaluate the importance of DCP1 in the decapping machinery. Our results highlight the importance of human DCP1 in decapping process and show that the EVH1 domain of DCP1 enhances the mRNA-binding affinity of DCP2. Transcriptome and metabolome analyses outline the distinct functions of DCP1a and DCP1b in human cells, regulating specific endogenous mRNA targets and biological processes. Overall, our findings provide insights into the molecular mechanism of human DCP1 in mRNA decapping and shed light on the distinct functions of its paralogs.