Introduction

Local regulatory mechanisms within the genome and their interaction with chromatin structure give rise to subtle variations in gene expression. Yet the interacting effects that genetic and epigenetic factors produce on gene transcription are rarely studied at a genome-wide scale, leaving us without global information on a key step between the genetic code and the phenotype. Studies are generally limited to examination of individual regions or overlapping single-nucleotide polymorphisms (SNPs) and open chromatin peaks with limited investigation into how these regulatory elements combine to affect gene transcription 1,2. Genetic analyses are a powerful approach that allow the study of these interactions. Conversely, phenotypic variation in genetically diverse populations is a result of both genetic and epigenetic factors operating in tandem. Understanding the scope and landscape of these interactions on a genome-wide scale is a vital step towards deciphering the genetic regulation of gene expression and, in turn, the mechanisms of non-coding variation on phenotypic outcomes.

Physical distance along a linear genome is a common metric for determining whether a putative regulatory element will affect a given gene’s transcription. This has led to the concept of a “local area” in which most regulatory interactions take place. Many have used a 100 to 500 kb flanking window around a gene to encapsulate the landscape of local epigenetic effects 35. This interval size is motivated more by convenience than rigorous evidence. Incidentally, the estimated scale of this local area is approximately equivalent to the size of a topologically associated domain (TAD). TADs are chromatin loops with an average length of approximately 1.1 Mb that are joined via the architectural protein CTCF and the cohesin protein complex at known binding motifs 6,7. TADs are believed to be mostly cell-type invariant and conserved across species 6,8,9. The current hypothesis is that TADs provide physical boundaries for local regulatory function, but functional analysis of regulatory interactions within TADs has been limited by the lack of informative genetic variation in the studies that led to TAD discovery. Reverse genetic approaches like saturation mutagenesis can be used to introduce functionally informative genetic variation. But such screens are difficult in mammals, and even modern CRISPR-Cas9 based approaches are still limited in their scope and application outside of immortalized cell lines 10.

Samples from genetically diverse populations permit comprehensive study of complex genetic interaction and non-coding regulation on a genome-wide scale 11. The Diversity Outbred mouse population (DO) is a heterogeneous stock composed of mice derived from eight founder strains, including three wild-derived strains 12. The ongoing process of random outbreeding produces a mouse population segregating millions of precisely-mapped SNPs, with well-balanced allele frequencies, and increasingly small blocks of linkage disequilibrium that now approach the size of TADs 13,14. Samples from the DO have been used to create extensive resources and data sets, including mouse embryonic stem cell (mESC) lines. A recent study used mESCs derived from generation 19 DO embryos, grown in the absence of ERK inhibition to study the gene regulatory networks that stabilize pluripotent states in vitro. This revealed phenotypic variability in features of ground state pluripotency 15.

Here we determine the extent to which genetic-epigenetic interactions between single nucleotide polymorphisms (SNPs) and regions of open chromatin are present, and uncover the biological basis of their distribution in three-dimensional (3D) genomic organization. Our results show that genetic-epigenetic interactions were found across the genome and involve regulatory elements that would not be identified with single-omics data. These interacting elements cluster together within TADs bounded by previously identified active CTCF binding sites. We infer chromatin structure from interaction data, analyze interaction contributions from the main and interaction regression elements, identify potential regulatory functions underlying a set of interactions, and correlate these interaction behaviors to CTCF binding differences in inbred founder lines.

Methods

DO mESC production was performed by Predictive Biology. Cultures were grown with G3SK inhibitor (1i medium). Bulk RNA and ATAC sequencing were performed and normalized as previously described 15. Expression data was formatted and analyzed as log2-transformed transcripts per million (TPM). ATAC-seq data was formatted as trimmed mean of M values (TMM).16 Genotyping was performed by Giga Mouse Universal Genotyping Array (GigaMUGA) 17. Aneuploidies were removed with the argyle R package 18. QTL2 haplotype reconstruction, normalization and pseudoprobe processing was carried out as previously described15. Samples with X0 genotypes were removed, and the union of all samples with the required data types resulted in 176 samples.

SNPs were subset to reflect linkage disequilibrium and haplotype prediction confidence, selecting only the closest SNP to grid points of GigaMUGA haplotype probabilities, evenly spaced by genetic distance 19,20. This produced a subset of 68,413 SNPs which were most likely to match the actual haplotype for each sample at a given locus (Supplementary Information).

RNA-seq, ATAC-seq and haplotype data were processed into SQLite databases and fit to a regression model using the stats::step() function in R. We modeled the expression of each gene by linear regression with a SNP-by-ATAC non-additive interaction term:

where yi =RNA abundance (log2TPM), x1 = SNP, and x2 = ATAC intensity (TMM). SNPs were coded as 0 for reference, 1 for heterozygote, and 2 for homozygote. Multi-allelic variants were not present in the subset used and were therefore not accounted for in the method. Models that retained the β3x1x2 term by passing the default Akaike Information Criterion cutoff were deemed to be interacting. These were used as the input for all further analyses.

This regression model was compared to two null models: one in which neither a genetic variant nor open chromatin element affect a gene’s expression additively or interactively (β1 = β2 = β;3 = 0 in Equation 1), and one where both affect gene expression, but they do so independently of each other (β3 = 0 in Equation 1). The former null model produced more conservative results and was consistent with the latter null model. Significance of the full model (Equation 1) was estimated relative to the two null models with an F-ratio test. F statistics were calculated with the R function pf(). Given the possibility of overfitting as the number of model terms increases and the high number of models tested (see Results), a Bonferroni adjusted p value cutoff of p < 1×10−7 was selected (hereafter referred to as “significant models”). Bonferroni corrections were calculated via stats::p.adjust().

mESC TAD data was derived from previously published work 6 with a liftover from mm9 to mm10 via UCSC genome browser 21. Chromosome information for mm10 retrieved from the Integrative Genomics Viewer 22.

DNA motif and binding site discovery was performed using the MEMEsuite tools MEME, STREME, and Tomtom, all set to default parameters (See Supplemental Data) 2326. SNP data were imputed from DO founder genomes referenced from release v3 (REL-1303-SNPs_Indels-GRCm238) of the Mouse Genome Project through the Sanger Institute 27,28.

We downloaded the CTCF ChIP-seq in C57BL/6 Bruce4 and 129/Ola E14TG2a.4 mESCs from the ENCODE portal29 with the following identifiers: ENCSR000CCB and ENCSR362VNF 30. These were also used to estimate the locations of Smad3 activity. CTCF and SMAD3 binding motifs were retrieved from HOCOMOCO 31. with the identifiers CTCF_MOUSE.H11MO.0.A and SMAD3_MOUSE.H11MO.0.B. These were overlapped with our ATAC-seq dataset via FIMO 27.

We performed CTCF ChIP-seq following previously established protocols32. Three cell lines were run for four isogenic mESC strains: C57BL/6J, CAST/EiJ, PWK/Ph, and WSB/EiJ. Cultures were grown in 1i medium. One sequencing run failed, resulting in the loss of one PWK and one WSB sample.

FastQC, Bowtie alignment and g2gtools liftover from imputed strain genomes to BL/6J were performed using a modification of a now publicly available ATAC-seq pipeline, accessed in development 15 December 202033. Mitochondrial read filtering steps were removed. Results were filtered to those ChIP-seq binding locations that appeared in two or more isogenic strains.

All study data and scripts are available at Synapse (https://doi.org/10.7303/syn26534443).

Results

Genetic-Epigenetic Interactions are Pervasive and enriched within TAD boundaries

Although there are numerous reports detailing examples of genetic variants and chromatin state altering gene expression, the non-additive interactions between these features have yet to be systematically explored. We aimed to determine the extent to which genetic-epigenetic regulatory interactions occur on a systemic level, and whether these interactions exhibit any bias in their location within the genome. To explore these interactions, we utilized paired transcriptome and chromatin accessibility profiling of Diversity Outbred mESCs 15. These samples incorporate genetic backgrounds from eight well-characterized founder strains from three M. musculus subspecies, randomly outbred for 19 generations. Thus, the resultant data constitute a unique resource for addressing questions of genetic-epigenetic regulatory interactions.

To search for non-additive interaction effects on a global scale, we developed a method of analysis that could be broadly applied to, and could permit discrimination between, co-occurrence and interaction effects. To that end, we combined regression model fitting (Equation 1) and model selection via information criteria in order to test whether gene expression was affected by a genetic variant, an open chromatin region, both independently of each other, or both working in concert.

To account for the resolution limits on SNP effects imposed by linkage disequilibrium, we subsetted our SNP data to 68,413 high confidence SNPs evenly spaced by genetic distance to reflect this resolution limit (Methods). We collated these with 102,173 ATAC-seq peaks and 13,631 genes with RNA-seq coverage within our sample set.

To test the non-additive interaction models, randomly selected trios of intrachromosomal genes, SNPs, and ATAC-seq peaks were tested with the interaction model. A total of 102,104 ATAC-seq peaks, 68,413 SNPs, and 13,631 genes were used to model 39,021,625 combinations. This resulted in 27,509 significant models (Bonferroni adjusted p < 1×10−7), or 0.007% of all models, involving 19,145 ATAC-seq peaks, 14,476 SNPs, and 1,286 genes (18.75%, 21.16%, and 8.52% of the total input respectively). Complete results are tallied in Tables S1 and S2. The number of significant models was four orders of magnitude greater than expected, thus indicating the presence of true associations (Fig. 1A). The locations of significant associations indicated non-additive interactions were much more likely to occur within the linkage disequilibrium block that contains the gene (Fig. 1B), matching prior work that the majority of detectable genetic effects originate in local regulatory regions 3436. Approximately half (49%) of all significant models contained a SNP or ATAC peak within 4 Mb of the gene they affected.

Interacting and additive models are abundant and favor local genes. A. Density of Bonferroni Adjusted p-value distribution in randomly sampled intrachromosomal models. B. Density of non-additive interacting models where adj. p < 1×10−7, by minimum distance between regulatory elements. C. Distribution of model term retention across intra-TAD models. D & E. Intra-TAD ATAC-seq peak and SNP participation in single-term, additive, and non-additive interacting models.

As a result of these findings, we considered potential chromatin features that might influence the local area for genetic-epigenetic interactions. We hypothesized that TADs may play a role in determining the prevalence of non-additive interactions, as they are a 3D genome structure known to enable enhancer access to local genes. Therefore, we generated a focused subset of regression models, testing all potential combinations between each gene and every SNP and ATAC peak that fall within one TAD of its transcription start site in either direction.

Models with at least one interacting element within the same TAD as the gene were 4.5 times more likely to reach our threshold, compared to models that did not (3.9% versus 0.9%), involving 3,836 SNPs and 3,725 ATAC-seq peaks affecting the expression of 896 genes (Table S1-S2). ATAC-seq peaks involved in non-additive interactions favored results between CTCF binding sites associated with TAD formation (Fig. S1A). Regression models that did not identify interaction involvement also show this local, TAD-bound organization of ATAC-seq peaks (Fig. S1B). Interacting regulatory elements as defined by our model are thus clustered within the same TAD as the gene they affect. This provides a possible structural explanation for previously observed local regulatory ranges.

Interacting elements in genetically diverse samples escape conventional discovery methods

We next evaluated whether genetic-epigenetic, non-additive interactions provide information that genotyping or ATAC-seq alone cannot. To this end, we compared interacting models, found to represent 32.29% of significant intra-TAD models (Fig. 1C), with models associated with SNPs or ATAC-seq peaks without interactions. We broke down the overlap between intra-TAD regression model results by SNP and ATAC-seq peaks separately. We found that 27.10% of SNPs are shared between non-additive interacting models and additive models, and 22.59% are shared with models that contain no ATAC-seq peak contribution. In contrast, few models contain only an ATAC-seq peak contribution, and 37.45% of ATAC-seq peaks are exclusive to interacting models (Fig. 1D). This is also in contrast to results from our randomly sampled gene-ATAC-SNP trios, which show a preference for SNP-only models (Fig. S1C-D). These results suggest that although genetic variants are the primary driver of variation in expression, non-additive interactions with chromatin states further reveal the origins of expression variation. Our findings also suggest that genetic variation and open chromatin data cannot be used independently to capture these regulatory features. Conversely, this suggests that ATAC-seq data alone is a less effective predictor of gene expression in a genetically diverse population.

TAD boundaries limit genetic-epigenetic interactions

The specific preference for non-additive interactions within active CTCF binding sites warranted further study. We sought to determine if active CTCF binding sites provided an appreciable boundary to interactions (Fig. 2A), similar to reported segregation of enhancer elements 37. We carried out regression analyses across the genome for all possible models involving each gene-SNP-ATAC combination within the gene’s TAD and nearest flanking TADs or intra-TAD regions. Although resolution of causal SNP location limited by linkage disequilibrium, ATAC-seq peaks that interact with SNPs could be confidently localized in relation to TAD boundaries.

ATAC-seq peaks that interact with SNPs generally reside within the affected gene’s TAD. A. Schematic of a TAD loop, including gene (purple) and density of interacting model elements (red). Loop interior is in blue, exterior DNA is gray, and CTCF binding sites are in yellow. B. Location of interacting ATAC-seq peaks relative to TAD boundary location, merged across all genes. TAD interior denotes the TAD in which the dependent gene was found. C. Interacting ATAC-seq peaks by distance from associated gene TSS. Local area cutoffs of 100 kb and 500 kb flanking regions are marked.

We found that TADs contained more ATAC peaks that interact with local SNPs to affect expression of a resident gene (Fig. 2B). With a window of 200 kb on either side of an active CTCF boundary, we found 52,321 ATAC-seq peaks sharing a TAD with their interacting gene, versus 24,145 peaks in different TADs or inter-TAD regions This constitutes a significant enrichment of intra-TAD gene-peak interactions (an odds ratio of 1.27, null expectation p < 2.2×10−16) (Fig. S2). We also identified 29 individual ATAC-seq peaks that were involved in more than 200 non-additive interactions each, which are distinctly visible within the aggregate view (Fig. 2B). These were exclusively affecting genes within the same TAD as the interacting ATAC-seq peak, which suggests that our results are not due to higher density of open chromatin within TAD structures. This is consistent with previous findings that the effects of enhancers and other regulatory elements are constrained by TADs 38.

Revising constraints on local regulator area using genetic data

Definitions of local regulatory area are used to establish the scope of analyses. Many cell types and animal models do not have publicly available Hi-C or CTCF ChIA-PET datasets, further limiting many researchers that may want to study genetic-epigenetic non-additive interactions. Our data supports that TADs may act as a biologically-defined boundary. Therefore, we wanted to determine how far the TAD-defined local area was likely to extend from any given gene. We found that to capture 95% of inter-TAD genetic-epigenetic interactions at adjusted p < 1×10−7, a window of 2,071,826 bp upstream and 2,579,188 bp downstream of the gene transcription start site (TSS) was required (Fig. 2C). The density of results using linear DNA sequence does not necessarily experience a linear drop-off over this distance, due to variable TAD lengths (Fig. 3E-F).

TADs provide context for interactions and increase interaction search efficacy. A. Counts of intra-TAD ATAC-seq peaks involved in all non-additive interactive models, centered on the TSS of the gene affected by the SNP-ATAC interaction. Coordinates transformed to a standard scale. B. Example TAD, displaying interacting ATAC peak density and gene locations. Peak relevance generally decays relative to intra-TAD distance rather than linear chromosomal distance. C-F. A comparison between linear sequence-based and TAD-limited search methods for interacting ATAC-seq peaks. C and D compare percentage of significantly interacting ATAC-seq peaks at each gene-relative locus. E and F compare density of ATAC-seq peaks at each locus. TAD-based search shows a higher density of interactions and places limits on search distance due to testing only TAD-internal ATAC-seq peaks.

We next considered the location of these interacting elements with regards to genes and known regulatory elements. ATAC-seq peaks involved in non-additive genetic-epigenetic interactions were analyzed by overlap with mm10 NCBI RefSeq genes and selected Enhancer Atlas 2 datasets (Fig. S3A-B). 65.22% of peaks on each chromosome fell within gene bodies. Normalized for length, exons were especially enriched, with 34.39% of peaks falling within exons, half of which fell within the first (or only) exon. 6.19% of peaks fall within enhancers curated for ESC R, R1, KH2 embryonic stem cell lines. There was no inverse relationship between gene body vs. enhancer peak percentage (Fig. S3C), indicating that this was not simply reflecting the relative density of genes within various TADs across the genome.

Density of interactions between genomic elements is defined by 3D context

If TADs act as a constraint to local interaction, then their 3D looping structure should be reflected in the regulatory patterns of genes found within them. At the most basic level of TAD structural organization, the CTCF binding site brings distant areas of chromatin into close physical association. We analyzed genome-wide distribution of SNP-interacting ATAC peaks relative to genes within their respective TADs. This revealed that the gene-centric increase in regulatory interaction density responded to the 3D context of the TAD loop, crossing CTCF boundaries irrespective of linear DNA sequence. This creates a distribution of non-additive interactions centered at the gene promoter that reaches its minimum halfway around the TAD loop from the gene (Fig. 3A).

We sought to determine whether gene-centered, TAD-internal searches for interacting elements were potentially a more efficient method for searching for non-additive interactions, when compared to local area searches along the linear DNA sequence. We compared the two methods, running a pair of genome-wide, gene-centric analyses out to the limit of our TAD-internal search (+/- ∼2.5 Mb). We calculated the percentage of ATAC-seq peaks within this region that participated in significant interactions. (Fig 3C-D). The overall density of ATAC-seq peaks within these regions was also compared (Fig. E-F).

We found that TAD-based searches produce consistently higher percentages of interacting ATAC-seq peaks across the analysis, and higher ATAC-seq peak density out to +/- ∼1Mb. TAD-based searches experience a progressive falloff in ATAC-seq peak density when compared to linear sequence-based searches. This is largely due to dropout of smaller TADs from the analysis. This results in more variable interaction percentages past ∼1Mb from the gene.

Linear search results show a consistently lower percentage of interacting ATAC-seq peaks, and a lower density of ATAC-seq peaks per locus. This was unexpected, as the immediate vicinity of the gene was expected to contain a similar density and non-additive interaction potential of ATAC-seq peaks when compared to the TAD-based search method. We therefore hypothesized that TAD boundaries in close proximity to some genes might create this discrepancy. We found that while the majority of genes were randomly distributed within their TADs, 12,437 or 24.31% of genes were found to have a TAD boundary close upstream to their transcription start site (Fig. S3D).

Interestingly, there is a depletion in the non-additive interaction rate of ATAC-seq peaks flanking the gene (Fig. 3D), which is not noted in raw interaction counts (Fig 3A). This may be due to the necessary presence of open chromatin at the promoter and the gene body during priming and transcription regardless of interaction. There are also genes which are concomitantly active within our samples, and do not show variation in expression, thus experiencing activity without detectable regulatory interaction.

Our results indicate that the local area for non-additive genetic-epigenetic interaction is not only constrained by TADs, but also shaped according to the overall 3D genome structure of the TAD loop. Analytical methods which reflect this are more efficient at discovering interacting regulatory elements.

Genetic-epigenetic interactions influence gene expression more than epigenetic factors alone

The coefficients estimated by our regression model quantify the effects of the genetic and epigenetic features on the expression of each gene (Equation 1). We postulated that our ATAC-seq, SNP, and non-additive interaction effects would display patterns that corresponded to positive and negative regulatory roles with varying strengths.

In previous analyses of SNP-SNP interactions, the importance of relative magnitude of interaction effects has been a subject of debate, as single effect terms (“main effects”) are normally of greater magnitude than interaction effects 39,40. However, in the context of non-additive genetic-epigenetic interactions we observed interaction effects of greater absolute value than ATAC-seq effects 70.52% of the time, greater than SNP coefficients 13.42% of the time, and greater than both main effects 14.00% of the time. This suggests that ATAC-seq peaks are generally a modifier of underlying genetics, so the majority of ATAC-seq peak effects are smaller than interaction effects. This was supported by the relatively low contribution of ATAC-seq peaks to single-effect regression models (Fig 1D). Their quantitative variation produces relatively subtle effects. This stands in comparison to the binary presence or absence of SNPs, which also directly capture regulatory sequence.

The directionality of effects on gene expression specifies the positive or negative influences from SNPs, ATAC peaks, and their combinations. Positive and negative effects represent correlation or anti-correlation of SNP presence or ATAC-seq score with gene expression. When all effects have the same sign (positive or negative), the total effect is synergistic and the pairwise combination of SNP and ATAC peak alter expression beyond the sum of the two. When the interaction effect has the opposite sign of the main effects, this indicates redundancy or interference, based on the idea that redundant factors create an “or” logic that yields a combined result that is less than the sum-based expectation. Alternatively, mixed main effects with interactions can also signify a suppression of one outcome in favor of the other, in which the sign of the interaction term serves to move the additive expectation nearer to one of the two marginal outcomes.

We found that models indicating redundancy or interference were the most common overall, totaling 39.54% of all models (Fig. S4A). Synergistic effects were found in 16.74% of all models. These two observations, suggest that a greater proportion of regulatory non-additive interactions attenuate gene expression, rather than strengthening it. While ATAC-seq peaks are often correlated with increased gene expression 41, we were surprised to find that an increase in ATAC-seq signal had a negative effect on gene expression in 40.73% of models (Fig. S4A). Due to the high proportion of ATAC-seq peaks found within gene bodies and the association between open chromatin and gene transcription, this result warranted further investigation.

Motif Enrichment Analysis reveals CTCF complex participation in genetic-epigenetic interactions

We looked next for potential binding sites or functional motifs underlying our results to provide clues as to the mechanistic underpinnings of model effects. We were especially interested in the subset of ATAC-seq peaks with negative effects on gene transcription, as these results were counter to our expectations. We hypothesized that these areas of open chromatin might expose binding sites of reppressive regulatory factors.

To test this hypothesis, we tested for DNA motif enrichment analysis using MEMEsuite (Fig. 4A). We selected the subset of ATAC-seq peaks involved in 10 or more significant non-additive interactions, at least 50% of which have negative ATAC-seq effects (negative effectors). This subset was compared versus all ATAC peaks and against shuffled control sequences via STREME analysis, which finds enriched ungapped motifs in provided sequences.

Motif analysis identifies differences in interacting CTCF binding motifs. A. A schematic of our motif analysis through MEMEsuite. FASTA files derived from interacting ATAC-seq peaks are used to identify enriched motifs, identify protein binding sequences, and locate the sequences within the ATAC-seq peaks. B-C. Binding sites found within significant motifs are less protected from genetic variation. SNP counts are shown at each locus in the CTCF binding sequence, comparing motifs within interacting ATAC-seq peaks versus all CTCF binding sites.

Results showed the negative effector subset was enriched for 49 motifs (Supplemental File 1), including the CTCF binding site (p < 2.0×10−14) and SMAD3 binding site (p < 1.5×10−8). These motifs were present, but less significantly enriched in positive effectors, or in all significant ATAC-seq peaks. This suggested that a portion of negative ATAC-seq effects can be functionally explained by altered behavior of CTCF binding sites carried by specific ancestries in the DO. To complement this analysis, we also quantified the CTCF motif occupancy in negative effectors versus other ATAC-seq peaks. FIMO motif scanning showed that 53.3% of top negative ATAC-seq sequences had at least one CTCF binding motif in them, compared to 35.3% in ATAC-seq peaks with positive effectors on gene transcription (See deposited data).

We next examined whether these peak locations contain SNPs that might alter CTCF binding potential. Imputing from the founder genomes of the DO population, we analyzed the locations of SNPs in CTCF binding motifs associated with regression models versus all CTCF motifs. Out of CTCF motifs within this negative subset, 96.1% of these were found to contain a DO founder SNPs, versus 18.0% for positive effect-associated CTCF motifs (p < 2.2×10−16). Similar results were found for Smad3. We also found that across all motifs, the density of SNPs favored the start and end of the binding sequence (Fig. 4B). However, in motifs associated with regression models, the density of SNPs was approximately equal across all bases (Fig. 4C). These bases have previously been identified as having high protein-DNA binding energies with the canonical sequence 41,42.

Most genomic sequences matching the CTCF binding motif are not known to be bound, according to ChIP-seq experiments 43. We wanted to determine the overlap between CTCF binding sites found in our data and known active binding sites in mESCs. Comparing to available ENCODE CTCF ChIP-seq in C57BL/6 Bruce4 and 129/Ola E14TG2a.4 mESCs 44,45, we found 2.04% and 1.87% of these overlapped with negative ATAC effect CTCF binding sites that contained SNPs, versus 17.45% and 16.20% overlapping with positive effect CTCF binding sites that contained SNPs. Expanding the scope to attempt to find flanking Smad3 regions netted consistently low results, with a 1 kb flanking window returning between 1.20% and 0.41% for Bruce4 and E14TG2a.4 respectively. These findings show that the majority of CTCF binding sites found within our significant models are not captured in previous analyses of ESCs two different M. musculus musculus strains.

Putative developmental regulator Platr2 is regulated by multiple redundant elements

To provide an example of our analytical method and probe a gene previously proposed to be important in stem cell development, we examined Platr2 and its TAD. Platr2’s TAD contains a high concentration of confidently called regression models in our analysis (Fig 3B), with 1,031 models of non-additive SNP-ATAC interaction reaching our significance cutoff for resident genes, of which 179 were models of Platr2 expression. It also contains 22 haplotype markers, allowing a certain level of model SNP localization. Previous studies have found a group of genes regulated in trans by expression quantitative trait loci (eQTLs) mapped to Platr2 15. These target genes are associated with embryonic ectoderm, indicating Platr2 may act as a regulator of stem cell state. These factors made it a target of interest for further exploration.

When the direction of effects for Platr2 were analyzed, the distribution showed a shift toward models where ATAC and SNP effects agreed with each other, but not with the interaction term (Fig. S4B). As discussed above, this potentially indicates functional redundancy between haplotype and chromatin openness at these sites. Motif enrichment analysis of interacting ATAC-seq peaks identified a sequence at 16 sites which contain Smad3 binding motifs, and another sequence at 15 sites that contain Ctcf binding motifs (Fig. S4 C-D, Supplemental File 2), which may suggest modulation of CTCF binding strength. These results suggest that Platr2 may have differential regulation patterns governed by changes in TAD formation.

CTCF binding in inbred mESCs validates strain specific effects

CTCF binding to DNA is associated with open chromatin around the binding site, TAD formation, and regulation of local gene expression 32,4648. Genetic variation within CTCF binding sites can alter binding potential41,42. We hypothesized that our regression models indicated areas of strain-specific CTCF binding due to polymorphisms in the DO haplotypes. To test this, we performed CTCF ChIP-seq on mESCs derived from four of the eight DO founder strains, including representatives from the three subspecies contributing to the DO population. C57BL/6J was also included as the standard reference.

Sample ChIP-seq binding intensities clustered by strain in PCA, and were separated by subspecies as expected (Fig. S6). Consensus peaks were identified from replicates of each strain. Fifty-two percent of CTCF ChIP-seq peaks were shared across all four strains, with the remaining 48% found in three or fewer strains (Fig. 5A). A range of binding intensities was also found at individual loci, with 13% having a variance in fold enrichment greater than 10. (Fig. 5B).

CTCF ChIP-seq analysis shows predicted strain-specific differences in binding intensity. A. Percentage of ChIP-seq peaks in surveyed strains. B. Variance (log10) in binding intensity fold enrichment for all ChIP-seq peaks. C. Percentage of significance in association between DO genotype at CTCF peaks and CTCF binding intensity on inbred ChIP-seq samples, in various subsets.

Non-additive interactions are predictive of CTCF binding patterns

With our ChIP-seq data indicating the presence of strain-specific CTCF binding, we hypothesized this differential binding could be predicted from our regression models. We anticipated that inbred founder strain CTCF ChIP-seq would have more SNP effects at binding sites with open chromatin in non-additive interactions with local genotype, as opposed to binding sites without interactions. Both interacting and non-interacting areas contain genetic variation and ATAC-seq peaks, but our models indicated interacting areas had more effects on local gene expression and genetic variability in CTCF binding sites. More specifically, we expected negative effector ATAC-seq peaks to have the greatest predictive power, as we believed their negative effects on local gene expression arise from strain-specific differences in CTCF binding and loop formation, whereas positive effects on gene expression may come from non-localized SNPs in other binding sites.

To test these hypotheses, we retrieved CTCF ChIP-seq locations and identified biallelic SNPs found in the four inbred DO founder strains we used for our ChIP-seq experiments. The correlation of SNP genotype to CTCF binding site occupancy was assayed using a paired sample Student’s t-test (Fig. 5C).

The proportion of results below p < 0.05 in all CTCF ChIP-seq peaks was 16%, establishing a baseline of predictive power for SNPs genome-wide. Subsetting to those CTCF binding sites found in ATAC-seq peaks increased the proportion of significant correlation to 31%. This was expected, as open chromatin in the vicinity of CTCF binding sites is associated with binding site occupancy32,47.

To test whether our theory that non-additive interacting regression models held greater predictive power of CTCF binding intensity, we subset CTCF binding sites in ATAC-seq peaks with significant effects on local gene expression. Binding sites associated with non-additive interacting models had 37% correlation. This outperformed additive models, which had 32% correlation.

ATAC-seq peaks can be associated with significant effects on multiple genes, potentially in combination with multiple SNPs, resulting in some ATAC-seq peaks associating with multiple significant regression. Interestingly, we found that among ATAC-seq peaks over CTCF binding sites, those associated with additive models were a subset of ATAC-seq peaks with non-additive interacting models. This means that any ATAC-seq peaks that had effects on local gene expression and were localized to CTCF binding sites always had interaction effects with the local genotype, and these models are more predictive of CTCF binding intensity.

These results matched our expectations. ATAC-seq peaks co-localized with a candidate polymorphism affecting CTCF binding would be more likely to affect gene expression in a non-additive fashion, as the polymorphism would only affect CTCF binding based on chromatin state at the binding site or any nearby priming factors. This is in contrast to additive models, where genetic effects and ATAC-seq effects are predicted to be independent of each other and are thus less likely to be co-localized.

Subsetting those ATAC-seq peaks that were negative effectors in interacting regression models produced a 41% correlation. This outperformed ATAC-seq peaks with a positive effect, which had 32% correlation. This was in line with our previously stated predictions and suggested that non-additive interactions can be used to evaluate and predict local 3D chromatin structure.

Finally, we wanted to determine whether the areas where we identified the most DO regression models supported our hypothesis, or if a larger amount of activity would result in a leveling effect on DO genotype’s predictive power on CTCF binding. We subset the results to those regions associated with a greater than average number of regression models. Results showed that the gap between negative and positive effector regions widened in these regions to 46% versus 30% correlation. This indicates that our regression model can provide detailed data in regions of dense genetic-epigenetic interaction, reinforcing our previous results.

Discussion

Integrating analysis of multi-omics data in a genetically diverse population allows for greater specificity and modeling of complex regulatory interactions. Populations that segregate natural genetic variation provide dense, randomized perturbations from which gene expression variance can be modeled. This contrasts with experimental designs that rely on a limited number of engineered perturbations in isogenic cells or animals. By collecting genetic, epigenetic, and transcriptomic data from the same cellular panel, models can be inferred without confounding factors such as different experimental protocols and environments. Our approach to integrating genetic and functional genomics data from 176 Diversity Outbred mouse embryonic stem cell lines allowed us to systematically probe how genetic and epigenetic variation interact to jointly influence local gene expression.

We integrated ATAC-seq peaks and SNPs via an interactive regression model of their effects on gene expression to determine how often these factors are independent of each other, and how often they interact. From this, we were able to infer the wide-scale presence of local interactions between these regulatory mechanisms. This method is well-suited for outbred populations, which are not compatible with methods used for analyzing inbred cohorts, such as differential expression or variance between samples. While similar interactions have been observed in isolated contexts and in co-localized SNPs and ATAC-seq peaks 38,49, this marks the first time that the phenomenon has been observed genome-wide, with the genetic mapping resolution provided by outbred, heterozygous samples.

From our analysis of genetic-epigenetic interactions we have put forth several key findings regarding the interactivity, structure, and function in the regulation of gene expression. We discovered that patterns of genetic-epigenetic interaction reflect the structure of topologically associating domains. Our model inferred the presence of frequent interactions between genotypic variation and open chromatin, with a strong preference for coordinates internal to known TADs in mESCs. We demonstrated this inference in several ways, including the clustering of highly interacting areas of open chromatin within TADs, and clustering of interactions based on inter-TAD distance rather than linear DNA sequence. Interacting ATAC-seq peaks were found to cluster within gene bodies and annotated enhancers. These findings confirm that TADs generally define the local area for interactions. While open chromatin density is higher within TADs, we found that open chromatin density is less important for interactivity than TAD boundaries are. Furthermore, linear proximity cannot be used to identify association between genes and regulatory features. SNPs or areas of open chromatin that are segregated by a TAD boundary may be incapable of affecting expression of a nearby gene. Conversely, a distant SNP or ATAC peak may be placed near a gene within a TAD loop.

Interaction effects were further classified based on the magnitude and direction of their effects on gene expression. These models contained an unexpectedly high proportion of increased ATAC-seq peak effects associated with reduced gene expression. Upon further investigation, these were found to be enriched for CTCF binding sites, which were further enriched for genetic variation in Diversity Outbred mice, particularly in the core sequence of CTCF binding sites (Fig. 4B-C). This indicates that these CTCF binding sites may be differentially bound in different samples, resulting in downregulation of transcription, either by cutting off access to nearby enhancers or by abolishing TAD structure.

We further analyzed the interplay between direction of effect for SNPs, ATAC-seq peaks, and their interactive components (as in Fig S4A), allowing us to make hypotheses about the functional significance of various interactions. Models where all effects are positive or negative suggest the open chromatin region and genetic variant enhance each other’s effectiveness in increasing or decreasing gene expression. These synergistic effects are indicative of two regulatory factors working together to produce a greater change in gene expression, beyond what either could produce independently. Functional redundancy or interference can be inferred from models where the SNP and ATAC-seq peak have a positive effect on gene expression and the interaction effect is negative, or vice versa. Redundancy is rather common in our genetic-epigenetic interactions, while synergistic effects are relatively rare. This appears to align with previous analyses of purely genetic interactions in other mouse crosses 6,39. Other models are more cryptic, but contain variation in beta coefficients associated with these models (Fig. S5), suggesting that functional subtypes that can be investigated within each group.

Our approach can be directly extended to specific gene targets. Mediation analysis in the DO by Skelly et al. revealed several mESC genes that act as mediators of downstream gene networks 15,50. Our analysis identified several genetic-epigenetic interactions in the area of one mediator gene, Platr2, and provided a targeted list of SNPs and ATAC-seq peaks that may influence the gene’s expression (Fig. 3B). ATAC-seq peaks were enriched for Smad3 binding sites, a component of the CTCF binding complex. This provides further regulatory information and potential targets for experimental manipulation.

Through DNA motif analysis, we identified distal regulatory activity via our interaction modeling. This permits functional analysis of protein factors on gene activity that is otherwise undetected by this method, such as regulatory proteins with no local QTL. This means that while interaction models may achieve the best resolution in areas of low linkage disequilibrium, they can still be used to identify and infer regulatory action of key conserved proteins in trans.

Our results have several implications for genetic analysis of gene expression in genetically diverse populations. Experiments performed in cell cultures or isogenic models often fail to produce replicable results in other tissues or human trials, Genetic-epigenetic interactions may underpin some of these failures, particularly those that have initially shown strong SNP effects. Thus, it is important to consider generating datasets with a combination of genetic mapping, gene expression, and appropriate epigenetic data, either on a local or genome-wide scale. At present, publicly available data that matches these criteria is not available. As research expands towards greater coverage in wild populations and humans, accessibility is likely to increase. Furthermore, ATAC-seq could be substituted for other experimental datasets, such as H3K4me3 ChIP-seq.

In the future, our study would benefit from the generation of more precise mapping of active CTCF binding sites, across more genetic backgrounds. Our CTCF ChIP-seq experiments expanded on previous findings that mouse strains have different strengths in CTCF binding51. ChIP-seq is capable of identifying where CTCF is binding, but not whether it is actively forming a TAD binding complex. Chromatin conformation experiments like Hi-C and ChIA-PET in the DO founder strains would permit us to directly examine genome structure. This could determine whether there are interactions that were obscured in some samples by the blocking effect of a TAD boundary, and whether there are some TADs that are more tolerant of boundary shifts than others.

Overall, this study demonstrates that genetic-epigenetic interaction analysis can reveal 3D genome structure through the positioning of interacting genome features. We see in this how genetics and genome structure can inform each other. These findings imply that including TAD boundaries and TAD loops in analyzing genomic features affecting gene expression, such as chromatin states and genetic variants, can maximize and contextualize results.

Acknowledgements

We thank G. Churchill for helpful comments. This study was funded by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM115518. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.