Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Sequence variation in regulatory DNA alters gene expression and shapes genetically complex traits. However, the identification of individual, causal regulatory variants is challenging. Here, we used a massively parallel reporter assay to measure the cis-regulatory consequences of 5832 natural DNA variants in the promoters of 2503 genes in the yeast Saccharomyces cerevisiae. We identified 451 causal variants, which underlie genetic loci known to affect gene expression. Several promoters harbored multiple causal variants. In five promoters, pairs of variants showed non-additive, epistatic interactions. Causal variants were enriched at conserved nucleotides, tended to have low derived allele frequency, and were depleted from promoters of essential genes, which is consistent with the action of negative selection. Causal variants were also enriched for alterations in transcription factor binding sites. Models integrating these features provided modest, but statistically significant, ability to predict causal variants. This work revealed a complex molecular basis for cis-acting regulatory variation.

Introduction

Individual genomes carry thousands of sequence differences in gene-regulatory elements. Collectively, these variants contribute to variation in many phenotypic traits by altering the expression of one or multiple genes (Albert and Kruglyak, 2015). The presence of individual DNA variants that alter gene expression can be detected by mapping genomic regions called ‘expression quantitative trait loci’ (eQTLs). Among these, ‘local’ eQTLs are located close to or in the gene whose expression they influence. Eukaryotic species ranging from yeast to human carry large amounts of local regulatory variation (Brem et al., 2002; Hasin-Brumshtein et al., 2016; Heyne et al., 2014; Rockman et al., 2010; Stranger et al., 2005; West et al., 2007). In human populations, most genes are affected by one or multiple local eQTLs (GTEx Consortium et al., 2017). Similarly, in a cross between two genetically different yeast isolates, 74% of genes are influenced by local eQTLs (Albert et al., 2018). Most of these local eQTLs arise from DNA variants that perturb cis-acting regulatory mechanisms (Albert et al., 2018; Ronald et al., 2005). When such cis-acting variants are located in a gene’s transcribed region, they may alter mRNA stability, splicing, polyadenylation, or regulation by RNA-binding proteins. Cis-acting variants in promoters or enhancers may alter the transcription of their target genes.

As a consequence of genetic linkage in experimental crosses (Albert et al., 2018) and linkage disequilibrium in outbred populations (GTEx Consortium et al., 2017; Kita et al., 2017), regions mapped as eQTLs almost always contain multiple sequence variants. Typically, it is assumed that most of these variants have no effect, obscuring the identity of one or a few causal variants in each eQTL (Figure 1). Although the causal variants in several local eQTLs have been identified (Chang et al., 2013; Claussnitzer et al., 2015; Lutz et al., 2019; Musunuru et al., 2010; Ronald et al., 2005), most causal variants remain unknown. Because of this lack of systematic information, many questions about local eQTLs remain open, including whether local eQTLs are typically caused by one or multiple variants, if multiple variants interact in a non-additive fashion, what evolutionary forces act on causal variants, which molecular mechanisms causal variants perturb, and whether it may ultimately be possible to combine the answers to these questions to construct models that can predict the consequences of regulatory variants from genome sequence.

Figure 1 with 8 supplements see all

Download asset Open asset

Schematic of MPRA design.

At the top, two native genes in the genome with multiple promoter variants (stars) are shown. Below the two genes, the TSS and Upstream MPRA library designs are illustrated. The TSS library tests all variants within 144 bp of the transcription start site (dashed vertical lines), while the Upstream library tests a subset of TSS variants along with variants located further away from the transcription start site.

Massively parallel reporter assays (MPRAs) have begun to make it possible to dissect regulatory activity of DNA sequences at scale (Kinney et al., 2010; Melnikov et al., 2012; Mulvey et al., 2020; Patwardhan et al., 2009). In these approaches, which have been applied in bacteria (Cambray et al., 2018; Kinney et al., 2010; Kosuri et al., 2013), yeast (Cuperus et al., 2017; Mogno et al., 2013; Sharon et al., 2012), flies (Gisselbrecht et al., 2013), zebrafish (Rabani et al., 2017), mouse tissues (Kwasnieski et al., 2014; Smith et al., 2013), and cultured human cells (Mulvey et al., 2020), pooled libraries of DNA oligos are placed next to or in a reporter gene and inserted into populations of cells. The activity of each oligo is then assayed in bulk by pooled high-throughput sequencing. MPRAs have been performed with synthesized libraries of designed oligos (Sharon et al., 2012), with fragmented genomic DNA (van Arensbergen et al., 2019; Arnold et al., 2013; Wang et al., 2018), and with randomly generated DNA (Cuperus et al., 2017; de Boer et al., 2020; Rosenberg et al., 2015). MPRA readouts have included sequencing of either the oligos themselves (Arnold et al., 2013) or short barcodes that tag each oligo (Kwasnieski et al., 2012). MPRAs have quantified mRNA expression by sequencing cDNA (Kwasnieski et al., 2012) and protein expression based on ‘FACS-Seq’ approaches in which cells are sorted into bins of increasing fluorescent reporter gene activity (Kinney et al., 2010; Matreyek et al., 2018; Sharon et al., 2012). MPRAs have been conducted using plasmid-borne reporters as well as reporters integrated into the genome (Inoue et al., 2017; Maricque et al., 2019; Mogno et al., 2013).

MPRAs have been used to probe DNA sequences for their ability to drive transcription (Arnold et al., 2013; Kheradpour et al., 2013; Wang et al., 2018), dissect the importance of individual bases in regulatory elements (Patwardhan et al., 2009), and examine the combined effects of multiple elements in regulatory ‘grammars’ (Davis et al., 2020; Kosuri et al., 2013; Mogno et al., 2013; Sharon et al., 2012; Smith et al., 2013) in promoters (Kotopka and Smolke, 2020; Lubliner et al., 2015; Sharon et al., 2012; Weingarten-Gabbay et al., 2019), UTRs (Cuperus et al., 2017; Dvir et al., 2013; Rabani et al., 2017; Shalem et al., 2015) and enhancers (Arnold et al., 2013; Klein et al., 2019; Melnikov et al., 2012; Patwardhan et al., 2009). Other applications have assayed sequences that promote splicing (Cheung et al., 2019; Rosenberg et al., 2015), translation (Goodman et al., 2013; Weingarten-Gabbay et al., 2016), DNA methylation (Krebs et al., 2014) and RNA editing (Safra et al., 2017). More recently, MPRAs have identified individual human DNA variants that alter gene expression (Tewhey et al., 2016; Ulirsch et al., 2016) in studies ranging in scale from variants in specific regions implicated by genome-wide association studies for a given disease (Choi et al., 2019; Liu et al., 2017; Pashos et al., 2017; Vockley et al., 2015) to a genome-wide survey of nearly six million common single-nucleotide polymorphisms (van Arensbergen et al., 2019). In spite of these successes, the size of the human genome, which harbors tens of millions of rare as well as common variants (1000 Genomes Project Consortium et al., 2015), combined with a high degree of tissue-specificity in gene expression and the activity of regulatory DNA GTEx Consortium et al., 2017; Inoue et al., 2019; Maricque et al., 2019 have complicated dissection of causal variants in local eQTLs. The compact gene-regulatory regions of S. cerevisiae, combined with comprehensive eQTL maps (Albert et al., 2018), provide an excellent opportunity to study regulatory variants systematically.

Here, we used an MPRA to probe thousands of intergenic variants that differ between two yeast isolates. We identified 451 variants with significant cis-acting effects on mRNA expression. These causal variants underlie known local eQTLs. We found that individual local eQTLs can harbor multiple causal variants, including pairs of variants with non-additive, epistatic effects. Causal variants tended to alter transcription factor binding motifs and showed signs of evolving under negative selection. Combinations of these features predicted causal variants better than expected by chance, albeit with modest accuracy.

Results

An MPRA to assay cis-regulatory variants in yeast promoters

At least half of the genes in the yeast genome are influenced by local eQTLs that segregate between the laboratory strain BY, a close relative of the genome reference strain S288C, and RM, a vineyard isolate that is related to strains commonly used in wine making (Peter et al., 2018). Most of these eQTLs act in cis (Albert et al., 2018), making the BY and RM strains a rich reservoir for identifying causal cis-acting variants. Here, we studied DNA variants in yeast promoters, which we defined as the intergenic region upstream of the transcription start site of a given gene up to the coding region of the adjacent gene, for a maximum of 1000 bases. These promoter regions differ between BY and RM at 11,768 single-nucleotide variants (SNVs) and 2442 insertion/deletion variants (indels) in 3176 genes (Bloom et al., 2013).

To assay the effects of individual variants on gene expression, we designed an MPRA composed of two synthetic promoter libraries encoded by pooled oligonucleotides (Figure 1, Table 1, Supplementary file 1). The ‘TSS’ library assayed all variants in the 144 nucleotides immediately upstream of the transcription start site (Pelechano et al., 2013), a region that is highly enriched for transcription factor binding sites (Lin et al., 2010). When multiple variants were present within this region for a given gene, we designed one sequence that carried the BY allele at all variants, one sequence that carried the RM allele at all variants, and a set of sequences that each carried the RM allele at a single variant and the BY allele at all other variants. The ‘Upstream’ library mostly assayed variants located further upstream of the transcription start site than those in the TSS library, using a pair of sequences that represented 144 bp of genomic DNA centered on the given variant. By design, the TSS and Upstream libraries shared a subset of variants. Together, they assayed 7005 unique variants (5,758 SNVs and 1247 indels; Figure 1—figure supplement 1) in the promoters of 3076 genes, with a median of two variants per gene (Figure 1—figure supplement 2).

Table 1

Library design and representation in experiments.

Library	TSS	Upstream
Designed oligos	7211	9882
Variants in design	3645	4547
Genes in design	2172	1918
Oligos in finished library	6565	9646
Barcodes	9.2 million	20 million
Median barcodes per oligo	590	1008
Variants with data	2427	4467
Genes with data	1429	1824
Number of replicates	12	6

Table 1—source data 1 Information on replicate samples.: https://cdn.elifesciences.org/articles/62669/elife-62669-table1-data1-v2.xlsx
Download elife-62669-table1-data1-v2.xlsx

The synthesized oligo libraries were placed upstream of a yEGFP reporter gene (Sheff and Thorn, 2004) on a low-copy number yeast plasmid (Figure 1). Prior to adding the yEGFP gene, we added random barcodes with a length of 20 nucleotides downstream of the reporter gene (Figure 1 and Figure 1—figure supplement 3). These barcodes were expressed as part of the 3’ end of the reporter mRNA, such that barcode abundance provided a measure of gene expression driven by the given oligo. We used paired-end sequencing to map each barcode to the oligo it tagged (Figure 1—figure supplement 3). Most oligos were tagged by hundreds of barcodes (Table 1, Figure 1—figure supplement 4) to control for possible influences of the expressed barcodes on mRNA levels. The final plasmid libraries we used in our experiments contained more than 90% of the designed oligos, which assayed 5832 variants in the promoters of 2503 genes (Table 1; see Materials and methods and Figure 1—figure supplement 5 for a description of oligos lost during oligo synthesis and/or cloning). The libraries were transformed into the BY strain and grown in independent replicate cultures (Table 1). We grew each culture to late exponential phase, extracted mRNA and plasmid DNA from each culture, and sequenced barcodes (Figure 1—figure supplement 6) to a median depth of 46 million RNA reads (range 19–70 million) and 31 million DNA reads (range 14–84 million) per sample (Table 1—source data 1).

Oligo expression is reproducible and reflects gene expression in the genome

We conducted three analyses to assess the reliability of our data. First, we measured the reproducibility of oligo-driven reporter expression. We summed the RNA counts of all barcodes assigned to a given oligo (Supplementary file 2), divided them by the respective summed DNA counts to normalize for unequal library composition, and log₂-transformed the resulting ratio. Spearman’s rank correlation coefficients (rho) among pairs of replicates ranged from 0.79 to 0.90 (median = 0.83; all p<2.2e-16) in the Upstream library and from 0.55 to 0.93 (median = 0.74; all p<2.2e-16) in the TSS library (Figure 1—figure supplement 7A & D).

Second, our two libraries included a common set of 200 oligos whose sequences we had sampled from a prior study (Sharon et al., 2012), allowing us to determine reproducibility between replicates, libraries, and with prior work. Median correlations between TSS and Upstream replicates (rho = 0.76; all p<2.2e-16) were nearly as high as those between replicates within each library (TSS: 0.8, Upstream: 0.99; all p<2.2e-16; Figure 1—figure supplement 8). Further, the 200 oligos showed significant, positive correlations with their published expression levels (rho ≥ 0.4, p≤1e-6), in spite of experimental and design differences between studies (Figure 1—figure supplement 7B & E).

Third, we asked if the promoter fragments in our plasmid libraries were able to recapitulate the expression of genes in their native genomic locations. To do so, we computed the average expression driven by all oligos extracted from the promoter region of a given gene. Although each oligo contained at most 150 bp of promoter sequence on a plasmid, expression in both libraries correlated significantly (rho ≥ 0.23, p<2.2e-16) with the native mRNA levels of genes in the genome (Figure 1—figure supplement 7C & F). In sum, our assay quantified the regulatory activity of promoter fragments in a manner that was reproducible within our study and when compared to earlier work, and that reflected the activity of native promoters in the genome.

Identification of hundreds of cis-acting promoter variants

The median fold change between alleles was 1.4-fold, and only 29 causal variants (6%) altered expression by more than two-fold (Figure 2A). These magnitudes are in good agreement with known genetic effects in the BY/RM strains, in which the vast majority of local eQTLs (Albert et al., 2018) and allele-specific, cis-acting effects on expression (Albert et al., 2014a) alter mRNA abundance by less than 2-fold.

Figure 2 with 1 supplement see all

Download asset Open asset

Identification of single causal variants.

(A) A scatterplot showing the effect size and significance for each variant. The genome-wide significance threshold is shown as a dashed horizontal line. Variants with the most significant effects, along with the genes they affect, are indicated. Variants and genes highlighted in color are described in the text. The histograms at the top shows the distribution of effect sizes for causal (blue) and non-causal (gray) variants. (B) A variant in the promoter of OLE1 known to affect OLE1 expression has a significant effect in the Upstream MPRA. The figure shows expression values for oligos carrying the two alleles. Colored lines and dots indicate different biological replicate experiments. Boxplots show the median as thick line, with the box showing the 25th and 75th percentiles. Whiskers show the largest value no further than 1.5 times the inter-quartile range; points beyond this range are shown as individual points. (C) As in (B) for a variant in the SFA1 promoter, which was detected in the TSS library. To identify individual causal variants, we tested each promoter variant for its effect on reporter gene expression. We detected 166 variants with significant effects in the TSS library and 293 variants in the Upstream library at a false discovery rate (FDR) of 5% (Figure 2—source data 1). The π₁ statistic (Storey and Tibshirani, 2003) computed across all variants suggested that at least 26 and 31% of variants had effects on gene expression in the TSS and Upstream libraries, respectively, even if these variants could not all be detected with individual significance. There were 451 unique variants that reached significance across the two partially overlapping libraries (A, Figure 2—source data 2).

Figure 2—source data 1 Statistical tests for each variant. Results for the TSS and Upstream libraries are in separate worksheets.: https://cdn.elifesciences.org/articles/62669/elife-62669-fig2-data1-v2.xlsx
Download elife-62669-fig2-data1-v2.xlsx
Figure 2—source data 2 Statistical results for each variant after aggregating across the two sub libraries.: https://cdn.elifesciences.org/articles/62669/elife-62669-fig2-data2-v2.xlsx
Download elife-62669-fig2-data2-v2.xlsx

Our MPRA assayed some variants multiple times in different sequence contexts, and we used this redundancy to assess the reproducibility of variant effects. First, 359 variants that reside in the intergenic region between two divergently expressed genes were tested twice in the same library, but in opposite orientation relative to the reporter gene (Figure 2—figure supplement 1A). Variants that were significant in at least one strand orientation were likely to also be significant in the other orientation (Fisher’s exact test (FET): p=0.0003, odds ratio (OR) = 7). These significant variants also agreed well in the direction of variant effect, that is whether the RM allele drives higher or lower expression than the BY allele (FET: p=0.007, OR = 6.1). Second, 527 variants were assayed in both the TSS and the Upstream libraries, where they were embedded in oligos that had the same strand orientation but differed in the exact window of DNA surrounding each variant (Figure 2—figure supplement 1B). Among these, variants that were significant in at least one library were likely to also be significant in the other (FET p=0.007, OR = 4), with significant directional agreement (FET p=0.046, OR = 2.9). Thus, in spite of the small effects of natural sequence variants, our assay was able to reproducibly identify individual causal variants.

Our assay identified several known causal variants. For example, a variant in the promoter of the OLE1 gene affects OLE1 expression in cis and is likely to be the single causal variant in this promoter (Lutz et al., 2019). This variant was highly significant in the MPRA (Figure 2B), while a second variant in the OLE1 promoter was not (Figure 2A).

In the promoter of the SFA1 gene, the MPRA detected a single causal variant out of nine that were assayed (Figure 2A). The RM allele of the causal variant is also present in a strain used for bioethanol production (YJS329), where it has been proposed to increase SFA1 expression by creating an Msn2/4 binding site (Maurer et al., 2017; Zheng et al., 2012). In our assay, this allele increased gene expression significantly (Figure 2C). Taken together, these results show that our MPRA reproducibly detected hundreds of individual cis-acting DNA variants.

Local eQTLs can be caused by single or multiple causal variants

The identity of causal variants in QTL regions is a major question in the genetics of complex traits. We asked if the variants identified by our MPRA underlie the many local eQTLs that segregate between the BY and RM strains. Specifically, we turned to local eQTLs mapped in 1012 recombinant individuals obtained by crossing BY and RM (Albert et al., 2018). Because of genetic linkage in the cross, the effects of neighboring variants create one aggregated, spread-out signal of local variation at each gene. For 2884 genes, these effects were strong enough to be detected as local eQTLs at genome-wide significance (Albert et al., 2018). We asked to what extent these local eQTLs can be explained by individual causal variants identified here (Figure 3—source data 1).

Initially, we considered all 2300 genes with available data from both the cross and the MPRA, irrespective of whether their local effect had reached significance in the cross. Likewise, we summed the MPRA effects of all assayed variants for a given gene, irrespective of their significance. There was a significant correlation between local eQTL effects and summed MPRA effects (rho = 0.1, p=1e-6). This overall correlation is likely degraded by noise in both datasets, in particular for small, non-significant effects. Therefore, we first restricted our analyses to variants with significant (5% FDR) MPRA effects. The summed effects of these significant variants showed a stronger correlation with eQTL effects than did those of all variants (rho = 0.24, p=5e-6). Next, to also avoid noise in the eQTL effect estimates, we further restricted the comparison to the 238 genes with strong local eQTLs—those with a LOD score of at least 50. This filter again improved the correlation with summed significant MPRA effects (rho = 0.47, p=2e-5). Allele-specific expression in diploid hybrids provides an independent estimate of the aggregated effect of all cis-acting variants that affect a given gene (Emerson et al., 2010; Ronald et al., 2005; Wittkopp et al., 2004). Genes with significant allele-specific expression in published BY/RM hybrid data (Albert et al., 2014a; Albert et al., 2018) were more likely to have a causal variant in our MPRA (FET: OR = 3.2, p<2.2e-16). These agreements show that, in aggregate, our MPRA successfully captured effects of variants that cause local, cis-acting eQTLs.

We asked whether single MPRA variants could account for local eQTLs. At each gene, we extracted the most significant variant among those with an FDR of ≤5%. The effects of these single variants correlated with strong local eQTLs (rho = 0.41, p=3e-4) and showed significant agreement between the direction of the effects of the variants and the eQTLs (FET: OR = 6.9, p=9e-5; Figure 3A). For example, the MPRA effect for the single causal variant in the OLE1 promoter (Lutz et al., 2019) recapitulated the magnitude of the OLE1 local eQTL almost perfectly (Figure 3A & C). Thus, we identified individual causal variants that underlie local eQTLs.

Figure 3

Download asset Open asset

Comparison of variant effects to local eQTLs.

(A) Scatterplot showing the MPRA effect of the most significant causal variant per gene (y-axis) versus the effect of the local eQTL (x-axis). Red dots indicate local eQTLs with a LOD score of at least 50. Genes describes in the text and in panel (C) are highlighted in bold. Other genes also present in panel (B) are indicated in regular font. The dashed diagonal shows the case of equal MPRA and local eQTL effects. Dashed horizontal and vertical lines indicate no effect. The panel only shows variants with a significant MPRA effect. (B) As in (A), but for the second-most significant causal variant per gene. Note the different scale of the y-axis between A and B. (C). Examples of summed MPRA effects of individual variants compared to the local eQTL for the given gene. (D). Spearman correlation coefficients MPRA variants versus local eQTLs as a function of whether a variant is bound by a nucleosome in the genome. Significance of the correlation is indicated. Error bars show 95% confidence intervals for the strength of the correlation.

Figure 3—source data 1 Local eQTLs and variant results.: https://cdn.elifesciences.org/articles/62669/elife-62669-fig3-data1-v2.xlsx
Download elife-62669-fig3-data1-v2.xlsx

QTL regions for organismal traits can harbor multiple causal variants (Mackay et al., 2009), but it is less clear whether QTLs for cellular traits such as gene expression have a similarly complex molecular basis. To test whether local eQTLs contain multiple causal variants, we considered, where available, the second-most significant MPRA variant for each gene, if that variant was still significant at an FDR of 5%. The effects of these second variants correlated with eQTL effects (rho = 0.3, p=0.03; Figure 3B). Further, there were significant correlations between the number of nominally significant (p<0.05) MPRA variants identified for a gene and that gene’s local eQTL LOD score (rho = 0.12, p=0.003) and its absolute eQTL effect size (rho = 0.14, p=3e-5). Thus, some local eQTLs are due to multiple causal promoter variants.

Some genes were affected by more than two causal variants. For example, we detected four causal variants at the CWP1 gene (Figure 2A). At each of these variants, the RM allele increased expression, and their summed effects approximated that of the CWP1 local eQTL (Figure 3C).

Some genes had two significant variants with opposite direction of effect. For example, we detected two variants located 47 bases apart in the promoter of the UFO1 gene that had effects of similar magnitude but in opposite direction (Figure 2A). UFO1 does not have a significant local eQTL (Figure 3C), suggesting that this gene is influenced by two variants whose effects cancel each other out.

Together, these observations suggest that the molecular basis of local eQTLs spans a range of scenarios. Some local eQTLs are due to single causal variants, while others arise from joint effects of multiple causal variants. When multiple variants have effects in the same direction, they sum to create a stronger eQTL. When their effects are in the opposite directions, they may be invisible in a cross.

The effects of individual variants may be influenced by nucleosome binding

Our plasmid-based MPRA tested variants in a molecular context that differs from the native genomic context in two respects: in the DNA sequence beyond that present on the oligos and, potentially, in the chromatin state in which each variant is embedded. Specifically, nucleosomes influence gene expression (Han and Grunstein, 1988), and may mask or otherwise alter the effects of variants they bind (Rando and Winston, 2012; Zhu et al., 2018). We asked whether the effects of variants located in nucleosome-free regions in the genome (Brogaard et al., 2012) were better captured by our MPRA than those of nucleosome-bound variants. Indeed, the summed effects of significant variants in nucleosome-free regions correlated better with strong local eQTLs than those of nucleosome-bound variants, which showed no significant correlation (Figure 3D). A similar difference was seen for the single most significant MPRA variant (Figure 3D). This better agreement between the genome and the MPRA for the effects of variants in nucleosome-free regions would be expected if these variants are also not bound by nucleosomes on the plasmids, and are thus exposed to a molecular environment that resembles that in the genome. More broadly, this result suggests that nucleosome occupancy and chromatin state may alter the regulatory effects of genomic variants, consistent with MPRAs in human cells (Inoue et al., 2019; Maricque et al., 2019) and with the fact that many human local eQTLs vary among tissues and cell types (GTEx Consortium et al., 2017; Yao et al., 2020).

Non-additive interactions among promoter variants

The presence of multiple causal variants in single promoters raises the question of whether these variants have independent, additive effects or if the effects of some variants are modulated by the presence of the other variant(s) in a non-additive, epistatic interaction. To address this question, we made use of the fact that the TSS library included 342 variant pairs for which there were oligos representing all four combinations of the two alleles at the two variants. For each such variant pair, we tested whether the joint effect of both variants differed from the sum of their individual effects (Figure 4—source data 1).

There were five variant pairs that showed evidence for non-additive effects at an FDR of ≤ 20%. While the π₁ statistic (Storey and Tibshirani, 2003) suggested that at least 34% of variant pairs have non-additive effects, the majority of these cases could not be detected at individual significance, presumably due to limited statistical power.

The five significant pairs showed a range of epistatic patterns (Figure 4 and Figure 4—figure supplement 1). For example, the RM alleles at two variants upstream of DAD2 did not have significant effects individually (p>0.4 compared to the oligo with two BY alleles), but when combined, the two RM alleles drove significantly higher expression (p=2e-6, FDR = 0.01%; Figure 4). RM carries a derived allele at each of these variants, indicated by the fact that the BY alleles are present in a strain from the Taiwanese clade of yeast isolates, which split early from other isolates (including BY and RM) during S. cerevisiae evolution (Peter et al., 2018). Both derived alleles are found at high and nearly identical frequencies in the yeast population (~74%), and the two variants, which are separated by only two nucleotides, show very high linkage disequilibrium (D’=0.999, r² = 0.99). Thus, these two epistatic variants form a tightly linked haplotype that increases gene expression only when both RM alleles are present, perhaps because both variants are needed to create a binding site for a transcriptional activator.

Figure 4 with 1 supplement see all

Download asset Open asset

Epistasis among promoter variants.

Each panel shows, for one gene, MPRA expression driven by four oligos with the indicated combination of BY and RM alleles at the two variants. Each panel states the gene name in bold along with multiple-testing adjusted and raw interaction p-values. Variants are given as “chromosome:position reference/alternative allele”. Colored lines between boxplots connect the data for a given oligo in the different biological replicates. Boxplots show the median as thick line, with the box showing the 25th and 75th percentiles. Whiskers show the largest value no further than 1.5 times the inter-quartile range; any observations beyond this range are shown as individual points. (A). Promoter variants for DAD2. (B) Promoter variants for RTT101.

Figure 4—source data 1 Results from the test for epistatic interactions.: https://cdn.elifesciences.org/articles/62669/elife-62669-fig4-data1-v2.xlsx
Download elife-62669-fig4-data1-v2.xlsx

The RM alleles at two SNVs upstream of RTT101 each reduced expression compared to the BY background (FDR < 0.1%), but an oligo carrying both RM alleles reduced expression to a similar degree as either variant alone (Figure 4B). Both RM alleles are derived, and the two variants have very little recombination (D’=0.998). However, the two RM alleles do not form a common haplotype (r² = 0.08), likely because while the RM ‘A’ allele at 352,322 bp on chromosome X is present in 42% of yeast isolates, the RM ‘G’ allele at 352,304 bp is much rarer, with a frequency of 6%. These results suggest that a common promoter variant reduces the expression of RTT101. A second, much rarer variant has a similar effect in isolation, but its effect is obscured in the presence of the more common variant, perhaps because both variants abrogate binding of the same transcriptional activator.

Together, these results show that cis-acting promoter variants can influence gene expression in a non-additive fashion. Such epistatic effects may be widespread, but the resultant deviations from additive expectations are often small, making their systematic detection challenging.

Characteristics of causal variants

The identification of hundreds of causal variants enabled us to ask whether causal variants tend to share molecular or population genetic characteristics. To address this question, we assembled a set of 2998 features describing each variant. These features comprised sequence characteristics of the alleles, evolutionary properties of the variant, and features that describe the gene that is regulated by the promoter in which the variant resides (Supplementary file 3 and Supplementary file 4). Most features (2,967) described transcription factor (TF) binding (see below).

We divided variants into a ‘causal’ group (defined as variants with an FDR of ≤5%; n = 459) and a non-causal group (variants with a raw p-value>0.2; n = 3,774). For variants located in divergent promoters that were assayed in both orientations, the results from the two orientations were considered separately. For this reason, the number of causal variants in these analyses was slightly higher than the 451 unique variants reported above. We used logistic regression to test the association of each feature with variant causality (Figure 5—source data 1). While these analyses cannot isolate the individual contributions of features that are correlated with each other, they provide an overview of the characteristics of causal variants.

Overall, causal variants were more likely to be SNVs than indels (Figure 5A). At the same time, longer indels were more likely to be causal than shorter indels (Figure 5A). Causal variants were less likely to be bound by a nucleosome in the genome (FDR = 4%). Nucleosomes may shield such variants from DNA-binding factors even on the plasmid reporter if the local DNA sequence around the tested variants can direct nucleosome formation at least partially.

Figure 5 with 1 supplement see all

Download asset Open asset

Association of features with variant causality.

(A) Non-TF features. The figure shows the strength of association between each feature and variant causality. Error bars show the standard error of the mean. Significant associations are indicated by three stars (FDR < 5%) or one star (nominal p-value<0.05). Non-significant features are shown in lighter coloring. (B) TF summary features aggregated across all 196 TFs, separated by strand, mode of aggregation across TFs, strength of binding (weak or strong), and mode of comparing allelic PWM scores across sliding sequence windows spanning each variant (Materials and methods). Each of these summary features was significantly associated with variant causality at an FDR of <5%. (C) Distributions of logistic regression estimates for strong vs. weak binding for individual TFs. The p-value shows the result of a Wilcoxon rank test. See also Figure 5—figure supplement 1.

Figure 5—source data 1 Single-feature regression analyses of causal variants. Logistic and linear regression (to predict log-fold change) are in separate worksheets.: https://cdn.elifesciences.org/articles/62669/elife-62669-fig5-data1-v2.xlsx
Download elife-62669-fig5-data1-v2.xlsx

Causal variants were enriched at nucleotides with higher PhastCons scores (FDR = 1%, Figure 5A), suggesting that natural variants that occur at nucleotides that are more conserved across yeast species are more likely to affect gene expression than those at less conserved nucleotides. Variants in the promoters of essential genes were less likely to be causal (FDR = 6%). Causal variants were also less likely to occur in promoters of genes with many synthetic genetic interactions (Figure 5A), which tends to be a property of essential genes (Costanzo et al., 2016). Essential genes are required for yeast viability, and are known to have less genetic (Albert et al., 2018), environmental (Eng et al., 2010), and stochastic (Newman et al., 2006) variation in gene expression. The observation that natural variants in promoters of essential genes are less likely to perturb gene expression is consistent with negative selection acting to purge causal variants from essential gene promoters. In further support of this hypothesis, causal variants had lower derived allele frequencies than non-causal variants (FDR = 10%), as expected if negative selection prevents variants that affect gene expression from rising to higher frequency (Kita et al., 2017; Ronald and Akey, 2007).

Causal variants alter predicted transcription factor binding

Gene expression is shaped by chromatin state as well as by the binding of TFs to cis-regulatory elements (Rando and Winston, 2012). Previous work suggested that while cis-eQTLs can perturb multiple aspects of chromatin, these molecular changes tend to ultimately be caused by sequence-directed differences in TF binding (Degner et al., 2012; Kasowski et al., 2013; Kilpinen et al., 2013; McVicker et al., 2013). To explore the influence of single causal variants on TF binding, we computed the expected propensity of each of 196 TFs to bind to the two alleles at each variant. Because some TFs have different affinities for the two DNA strands at their binding sites (de Boer et al., 2020), we separately considered binding to the sense and the antisense strand, as well as in a strand-agnostic manner. TFs bind to individual ‘strong’ sites that closely match their motif, as well as to weaker sites with imperfect motif matches (de Boer et al., 2020). To capture this distinction, we computed separate feature sets that considered only strong or only weak binding sites. These analyses cannot disambiguate sharing of similar binding motifs by different TFs but probe the overall role of perturbed TF binding in variant causality. Finally, we also aggregated the 196 TF-specific feature sets into summary features that reported maximum and average allelic differences at each variant across the 196 TFs (Materials and methods).

Overall, TF binding was strongly and broadly associated with variant causality (Figure 5B). Across the 2,940 TF-specific features, 481 features for 139 distinct TFs showed significant associations at an FDR of 5% (Figure 5—source data 1). All of the 27 summary features were significant at this threshold. For example, causal variants were more likely to result in gains or losses of strong binding sites (62%; n = 286) than non-causal variants (49%; n = 1,846; FET p=6e-8). These associations were found on both DNA strands (Figure 5B).

Causality was associated with differences in strong as well as weak TF binding (Figure 5B; Figure 5—figure supplement 1). While yeast promoters harbor relatively few strong binding sites, they contain many binding sites that are individually weak but that in combination greatly influence promoter activity (de Boer et al., 2020). In our logistic regression analyses of the 196 individual TFs, differences in the various weak TF-binding metrics showed stronger associations with causality than those in the strong TF-binding metrics (Figure 5C). Because of the relatively low density of strong binding sites, many natural variants change no or only a small number of these sites. In contrast, the high density of weak binding sites presents a large mutational target, such that essentially every variant perturbs one or more weak binding sites. Evidently, these subtle differences can result in detectable expression changes. Overall, these analyses provided clear evidence that variants predicted to perturb TF binding are more likely to alter gene expression than variants not predicted to do so.

Prediction of causal variants

The prediction of the effects of individual variants in a given genome is a major challenge for modern genetics and genomics. We tested whether the hundreds of causal variants we identified might enable us to predict the effects of individual cis-acting variants from their annotated features. We built 112 multiple logistic regression models for predicting variant causality from various feature subsets (Figure 6—source data 1). We trained these models using repeated 10-fold cross-validation on a random subset comprising 90% of the causal and non-causal variants and tested model performance on the remaining 10% of the variants. The best model achieved an area under the receiver operating curve (AUC) of 0.71 (Figure 6A; see inset for details on the model). Cross-validated elastic-net regularization, which is less prone to potential overfitting, did not improve model performance (best model AUC = 0.71).

Figure 6

Download asset Open asset

Prediction of causal variants and variant effects.

(A) Prediction results for 112 models. On the left, the plot shows the performance of binomial classifiers on the 10% test as black bars. On the right, the plot shows the performance of the linear predictors of variant effects as spearman correlation coefficients (rho) between actual and predicted log-fold changes in expression as pink bars. Blue symbols show r2 values for each model when fit to the entire dataset. The best classifier and linear model are indicated by orange bars and shown as insets. (B) Measured expression driven by each oligo versus expression predicted by the de Boer model. The dashed red line denotes the median measured expression level. (C) Observed fold-changes of individual variants measured in our MPRA versus fold-changes predicted by the de Boer model. The pink line shows the linear regression fit for all variants.

Figure 6—source data 1 Multiple-feature regression analyses of causal variants. Logistic and linear regression (to predict log-fold change) are in separate worksheets.: https://cdn.elifesciences.org/articles/62669/elife-62669-fig6-data1-v2.xlsx
Download elife-62669-fig6-data1-v2.xlsx

We also built a series of multiple linear regression models that used the same 112 feature combinations to predict the absolute log-fold change caused by a given variant. When fit to the entire dataset, nearly all models fit the data better than expected by chance, with r² values of up to 0.33 at a p-value of 3e-50 for a model including non-TF features as well as all TF features (Figure 6A). A model considering only the TF features fit the data almost as well (r² = 0.32, p=7e-49). The median r² across the 112 models was 0.09. Next, we used 90% of the variants to train these linear regression models using repeated 10-fold cross-validation and predicted absolute log-fold change in the remaining 10% of the variants. The best correlation between predicted and observed fold-changes was rho = 0.15 (p=0.0004). Several other models performed nearly as well (Figure 6A; Figure 6—source data 1). The median model prediction performance was rho = 0.07. Models fit with cross-validated elastic-net regularization performed no better than the corresponding unregularized linear models (best rho = 0.12, p=0.02; median rho = 0.005). Overall, prediction models built from various feature sets were able to predict the identity and effect size of causal variants better than chance, but with modest accuracy.

Recently, de Boer et al. reported a model for predicting reporter gene expression driven by millions of random DNA fragments (de Boer et al., 2020). We tested whether this model was capable of predicting individual variant effects in our data. Using the DNA sequence centered on each variant in our library as input, we predicted reporter gene expression driven by each oligo in our libraries. These predicted expression values showed a significant correlation with measured expression in our MPRA (Figure 6B). We then computed the predicted fold change of each variant as the difference between the predicted expression for the respective BY and RM oligos. The predicted and measured fold-changes of all variants correlated significantly (Figure 6C). This correlation increased for variants that were causal in our data (Figure 6C), although this subset would be unknown during variant effect prediction from genome sequence alone. When we ignored the direction of the fold-changes, prediction performance was reduced for all variants (rho = 0.09, p=1e-8), and degraded completely for causal variants (rho = 0, p=0.9). This suggests that much of the ability of this independent model to predict variant effects in our data derives from the model’s ability to correctly capture whether altered binding of individual transcription factors to a given allele increases or decreases expression.

In summary, models based on individual features trained on our variant effects, as well as a completely independent model (de Boer et al., 2020), showed similar ability to predict variant effects from DNA sequence. Although prediction performance was modest in both cases, the fact that a model trained on an independent experiment can capture some effects in our data demonstrates that the prediction of individual variant effects from DNA sequence alone is becoming possible.

Discussion

In this work, we used an MPRA to examine the effects of roughly half of all intergenic DNA sequence variants between two yeast strains and identified hundreds of variants with significant, cis-acting effects on gene expression. These significant variants reflected the effects of known local eQTLs, suggesting that we have identified causal variants that underlie natural regulatory variation. Several insights emerged from in-depth analysis of these causal variants.

First, the molecular basis of local eQTLs can include single (as for OLE1) or multiple (as for CWP1) causal variants. Studies seeking to identify the molecular basis of QTL regions often implicitly assume that each QTL is driven by a single causal variant. However, efforts to fine-map QTLs for organismal traits often find that multiple, linked causal variants exist in a region (Kroymann and Mitchell-Olds, 2005; Sinha et al., 2008; Steinmetz et al., 2002). Whether multiple variants also underlie QTLs for molecular traits such as gene expression was less clear, especially for local eQTLs in yeast, given the small mutational target size of the compact regulatory regions. Our results show that the molecular basis of local regulatory variation can involve multiple causal variants even in a single pair of yeast strains.

In some promoter regions (as for UFO1), we detected variants with effects in opposing directions that cancelled each other out, resulting in no local eQTL signal, demonstrating that QTL mapping can miss linked causal variants of opposite effect. Similarly, recent work has shown that the number of trans-acting loci affecting gene expression throughout the genome may have been greatly underestimated due to linkage in experimental crosses with limited recombination (Metzger and Wittkopp, 2019). These results paint an emerging picture of a highly polygenic architecture of gene expression variation, with dozens of trans-acting loci across the genome (Albert et al., 2018; Albert et al., 2014b; Brion et al., 2020; Metzger and Wittkopp, 2019), each of which may harbor multiple causal variants.

Second, we detected several instances of non-additive epistatic interactions between variants in the same promoter. Previous work revealed epistatic eQTL pairs that influence the expression of a given gene, including between cis and trans eQTLs as well as between trans eQTLs (Albert et al., 2018; Brem et al., 2005). Very few of these cases have been resolved to individual variants (Brem et al., 2005; Lutz et al., 2019). Our results show that there can also be epistasis between multiple causal variants within a single local eQTL. Nevertheless, most of our assayed variant pairs act additively. Even for pairs with the most significant epistasis, the quantitative departures from additivity tended to be small (Figure 4 &S10). This result is consistent with searches for epistatic QTLs for various growth traits in this cross. Although examples of higher-order epistatic interactions with profound effects on some traits have been reported (Forsberg et al., 2017), most of the hundreds of detected epistatic pairs involve small deviations from additivity that account for minuscule fractions of phenotypic variance (Albert et al., 2018; Bloom et al., 2015), as expected from quantitative genetic theory (Hill et al., 2008).

Third, cis-acting variants showed signals of evolving under negative selection, in that the derived alleles of causal variants tended to have reduced population frequencies and were less likely to be found in the promoters of essential genes. Similar results reviewed in Signor and Nuzhdin, 2018 have been reported in yeast (Kita et al., 2017; Ronald and Akey, 2007), plants (Josephs et al., 2015) and humans (Battle et al., 2014), but were based on studies in crosses subject to linkage among neighboring variants, or in natural populations subject to both linkage disequilibrium and differences in statistical power for variants with different population frequencies. Our MPRA results are free from confounding effects of linkage and population frequency and extend the observation of negative selection to single, causal cis-acting variants.

Fourth, causal variants were enriched for sequence changes that altered predicted transcription factor binding. This enrichment was present for strong binding sites, as well as weak sites that do not pass strict binding site detection thresholds. Emerging evidence suggests that in addition to strong, high-affinity binding sites, intergenic DNA harbors an abundance of weak, low-affinity binding sites with imperfect motif matches (de Boer et al., 2020; Tanay, 2006; Wunderlich and Mirny, 2009). This latter group provides a larger mutational target than the few strong exact motif matches. As a consequence, almost every natural variant may perturb one or multiple weak binding sites. The joint effects of perturbations to weak sites could further contribute to molecular architectures in which several causal variants shape the overall activity of a given promoter in a given individual. While more work is needed to explore the consequences of alterations to weak binding sites, they may also help explain why many inferred causal variants in human diseases are located close to, but do not obviously change, recognizable TF motifs (Farh et al., 2015).

Our approach had several limitations. First, not all potential cis-acting variants were assayed in our libraries, including roughly half of intergenic variants as well as variants in gene regions such as the most proximal bases of the core promoter, the gene body, the 3’UTR, and the terminator. Future work with dedicated libraries to study these variants will likely reveal many additional causal effects. Meanwhile, the concordance of our results with local eQTLs suggests that we have discovered a non-trivial fraction of causal variants, and that, by extension, many cis-acting variants do indeed reside in promoter regions. Second, causal variants typically have small effects on gene expression, necessitating replicate assays to ensure high statistical power. Higher levels of replication, as well as further optimization of assay design, are likely to improve variant detection and effect quantification. Finally, our plasmid-bound assay cannot capture the full chromatin state of variants in the genome. Nevertheless, our causal variants correlated with local eQTL effects measured in the native genomic context, showing that chromatin state does not override the effects of all variants. This agreement was stronger for variants that reside in nucleosome-free regions than for nucleosome-bound variants. If nucleosomes do not form on the plasmids, such regulatory variants may be exposed to a regulatory environment similar to that in the genome. More broadly, the precise chromatin context in which a variant is embedded likely influences its exact effect on gene expression, consistent with the considerable differences revealed by MPRAs inserted at different sites in the human genome (Maricque et al., 2019), as well as with the high degree of tissue-specificity of human cis-eQTLs (GTEx Consortium et al., 2017).

The prediction of the consequences of individual DNA variants is a major area of research, especially given that accurate predictions could aid discovery of disease-causing mechanisms and diagnosis of individual patients with unknown genetic disorders. As a consequence, a plethora of computational methods aiming to predict the severity or molecular consequences of individual variants has been developed (for example, Huang et al., 2017; Kircher et al., 2014; Lee et al., 2015; Zhou et al., 2018; Zhou and Troyanskaya, 2015, reviewed in Nishizaki and Boyle, 2017). Although our models integrating various features to predict the identity and effect size of causal variants performed better than chance, no single feature and no simple combination of features achieved high prediction accuracy. While the size of our dataset, which included several hundred causal variants, may be insufficient to train predictors with better performance, an independent model trained on millions of DNA sequences (de Boer et al., 2020) and applied to our data performed similarly. Our prediction accuracies are also comparable to those obtained from state-of-the-art machine-learning tools applied to human MPRA data (Kircher et al., 2019) and to simulated genetic architectures (Liu et al., 2019). Clearly, prediction of variant effects remains a challenging problem even in comparatively simple yeast promoters. Models trained in a range of environmental conditions, which would provide variation in chromatin contexts, as well as on MPRA data with longer and more diverse flanking sequences (Kircher et al., 2019) may further improve predictive accuracy.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Additional information
Strain, strain background (Saccharomyces cerevisiae)	BY4741	Albert et al., 2018 (doi:10.7554/eLife.35471)
Recombinant DNA reagent	RCP83 plasmid	This paper; (Addgene plasmid #163466)	Plasmid backbone
Recombinant DNA reagent	SurePrint Oligonucleotide Libraries	Agilent	Custom DNA oligo library
Gene (Aequorea victoria)	yEGFP	pKT0127 (Addgene plasmid #8728)	Reporter gene
Software, algorithm	R version 3.5.0	https://www.r-project.org	Data analysis

MPRA design and analysis code generated for this paper is available at https://github.com/frankwalbert/promoterVariants (copy archived at swh:1:rev:fb7e232981f63281d944ccf273fdafa24ac2272d; Albert, 2020). Unless otherwise specified, analyses were performed in R (https://www.r-project.org), using various packages from the tidyverse (https://tidyverse.tidyverse.org/index.html) (Wickham et al., 2019) and Bioconductor (https://www.bioconductor.org) (Huber et al., 2015). Specific packages are listed below. Sequences of primer and elements of the reporter construct are available in Supplementary file 5.

Share this article

Cite this article

Schematic of MPRA design.

Library design and representation in experiments.

Table 1—source data 1

Identification of single causal variants.

Figure 2—source data 1

Figure 2—source data 2

Comparison of variant effects to local eQTLs.

Figure 3—source data 1

Epistasis among promoter variants.

Figure 4—source data 1

Association of features with variant causality.

Figure 5—source data 1

Prediction of causal variants and variant effects.

Figure 6—source data 1

Author details

Kaushik Renganaath

Contribution

Contributed equally with

Competing interests

Rockie Chong

Contribution

Contributed equally with

Competing interests

Laura Day

Contribution

Competing interests

Sriram Kosuri

Contribution

Competing interests

Leonid Kruglyak

Contribution

For correspondence

Competing interests

Frank W Albert

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism