The missing link between genetic association and regulatory function
Abstract
The genetic basis of most traits is highly polygenic and dominated by non-coding alleles. It is widely assumed that such alleles exert small regulatory effects on the expression of cis-linked genes. However, despite the availability of gene expression and epigenomic datasets, few variant-to-gene links have emerged. It is unclear whether these sparse results are due to limitations in available data and methods, or to deficiencies in the underlying assumed model. To better distinguish between these possibilities, we identified 220 gene–trait pairs in which protein-coding variants influence a complex trait or its Mendelian cognate. Despite the presence of expression quantitative trait loci near most GWAS associations, by applying a gene-based approach we found limited evidence that the baseline expression of trait-related genes explains GWAS associations, whether using colocalization methods (8% of genes implicated), transcription-wide association (2% of genes implicated), or a combination of regulatory annotations and distance (4% of genes implicated). These results contradict the hypothesis that most complex trait-associated variants coincide with homeostatic expression QTLs, suggesting that better models are needed. The field must confront this deficit and pursue this ‘missing regulation.’
Editor's evaluation
The findings reported here are important because they address the issue of how complex traits arise from their genetic underpinnings. There is an assumption that genetically mediated variation in transcript abundance, usually detected via analysis of expressed quantitative trait loci, is key to this process, but we lack robust evidence in support of that view. This article finds limited evidence that the baseline expression of trait-related genes explains the associations between complex traits and genetic variants (as identified from genome-wide association studies), leading to the view that the field needs to confront a problem of 'missing regulation.'
https://doi.org/10.7554/eLife.74970.sa0Introduction
Modern complex trait genetics has uncovered surprises at every turn, including the paucity of associations between traits and coding variants of large effect, and the ‘mystery of missing heritability,’ in which no combination of common and rare variants can explain a large fraction of trait heritability (Manolio et al., 2009). Further work has revealed unexpectedly high polygenicity for most human traits and very small effect sizes for individual variants. Enrichment analyses have demonstrated that a large fraction of heritability resides in regions with gene regulatory potential, predominantly tissue-specific accessible chromatin and enhancer elements, suggesting that trait-associated variants influence gene regulation (Maurano et al., 2012; Trynka et al., 2013; Gusev et al., 2014). Furthermore, genes in trait-associated loci are more likely to have genetic variants that affect their expression levels (expression quantitative trait loci, or eQTLs), and the variants with the strongest trait associations are more likely also to be associated with transcript abundance of at least one proximal gene (Nicolae et al., 2010). Combined, these observations have led to the inference that most trait-associated variants are eQTLs, and their effects arise from altering transcript abundance, rather than protein sequence. Equivalent sQTL (splice QTL) analyses of exon usage data have revealed a more modest overlap with trait-associated alleles, suggesting that a fraction of trait-associated variants influence splicing, and hence the relative abundance of different transcript isoforms, rather than overall expression levels. The genetic variant causing expression changes may lie outside the locus and involve a knock-on effect on gene regulation, with the variant altering transcript abundances for genes elsewhere in the genome (a trans-eQTL), but the consensus view is that trans-eQTLs are typically mediated by the variant influencing a gene in the region (a cis-eQTL) (GTEx Consortium, 2020). Thus, a model has emerged in which most trait-associated variants influence proximal gene regulation.
Here we argue that this unembellished model—in which genome-wide association study (GWAS) peaks are mediated by the effects on the homeostatic expression assayed in tissue samples—is the exception rather than the rule. We highlight the challenges of current strategies linking GWAS variants to genes and call for a reevaluation of the basic model in favor of more complex models possibly involving context-specificity with respect to cell types, developmental stages, cell states, or the constancy of expression effects.
Our argument begins with several observations that challenge the unembellished model. One challenge is the difference between spatial distributions of eQTLs, which are dramatically enriched in close proximity to genes, and GWAS peaks, which are usually farther away (Stranger et al., 2007; Farh et al., 2015; Mostafavi et al., 2022). Another is that expression levels mediate a minority of complex trait heritability (Yao et al., 2020). Finally, many studies have designed tools for colocalization analysis: a test of whether GWAS and eQTL associations are due to the same set of variants, not merely distinct variants in linkage disequilibrium. If the model is correct, most trait associations should also be eQTLs, but across studies, only 5–40% of trait associations colocalize with eQTLs (Giambartolomei et al., 2014; Chun et al., 2017; Giambartolomei et al., 2018; Hormozdiari et al., 2016).
Despite the doubts raised, the fact that most GWAS peaks do not colocalize with eQTLs cannot disprove the predominant, unembellished model. In a sense, negative colocalization results are confusing because their hypothesis is too broad. If we predict merely that GWAS peaks will colocalize with some genes’ expression, it is not clear what is meant by a peak’s failure to colocalize with any individual gene’s expression.
Thus, a narrower, more testable hypothesis requires identifying genes we believe a priori are biologically relevant to the GWAS trait. If these trait-linked genes have nearby GWAS peaks and eQTLs, failure to colocalize would be a meaningful negative result. Earlier studies tested all GWAS peaks; when a peak has no colocalization, the model is inconclusive. But trait-linked genes that fail to colocalize reveal that our method for detecting non-coding variation is, with current data, incompatible with our model for understanding it.
With this distinction in mind, we created a set of trait-associated genes capable of supporting or contradicting the model of non-coding GWAS associations acting as eQTLs. For this purpose, the selection of genes becomes extremely important. Because the model attempts to explain the genetic relationship between traits and gene expression, true positives cannot be selected based on measurements of genetic association to traits (GWAS) or expression (eQTL mapping). With this restriction, one source of true positives is to identify genes that are both in loci associated with a complex trait and are also known to harbor coding mutations tied to a related Mendelian trait or the same complex trait. Using a model not based on expression, Mendelian genes are enriched in common-variant heritability for cognate complex traits (Weiner et al., 2022). The genes and their coding variants may be detected in familial studies of cognate Mendelian disorders or by aggregation in a burden test on the same complex phenotypes as GWAS (Backman et al., 2021).
For genes whose coding variants can cause detectable phenotypic change, the strong expectation is that a variant of small effect influences the gene identified by its rare coding variants. As an example, APOE and LDLR are both low-density lipoprotein receptor genes (Schneider et al., 1981; Goldstein and Brown, 1973). Coding variants in APOE and LDLR can lead to the Mendelian disorder familial hypercholesterolemia (Goldstein and Brown, 1973; Cenarro et al., 2016). Even in the absence of a Mendelian coding variant, experiments in animal models have found that the overexpression of these genes reduces cholesterol levels (Shimano et al., 1992a; Shimano et al., 1992b; Kawashiri et al., 2001). GWAS on human subjects have found significant associations near APOE and LDLR, so it seems reasonable to suspect that any non-coding effects in these loci may be mediated by these genes. This general relationship between Mendelian and complex traits is supported by several lines of evidence summarized in Appendix 1.
Results
To test the model that trait-associated variants influence baseline gene expression, we assembled a list of putatively causative genes for seven polygenic common traits with available large-scale GWAS data, each of which also has an extreme form in which coding variants of large effect alter one or more genes with well-characterized biology (Table 1). Our selection included four common diseases: type II diabetes (T2D) (Mahajan et al., 2018), where early-onset familial forms are caused by rare coding mutations (insulin-independent MODY; neonatal diabetes; maternally inherited diabetes and deafness; familial partial lipodystrophy); ulcerative colitis (UC) and Crohn disease (CD) (Liu et al., 2015; Goyette et al., 2015), which have Mendelian pediatric forms characterized by severity of presentation; and breast cancer (BC) (Zhang et al., 2020), where germline coding mutations (e.g., BRCA1) or somatic tissue (e.g., PIK3CA) are sufficient for disease. We also chose three quantitative traits: low- and high-density lipoprotein levels (LDL and HDL); and height. Between known Mendelian genes and those from Backman et al., 2021, our analysis included 220 unique gene-trait pairs (Figure 1).
In well-powered GWAS, even relatively rare large-effect coding alleles (mutations in BRCA1 that cause breast cancer, for instance) may be detectable as an association to common variants, which could make the effect of a coding variant appear to be regulatory instead. To account for this possibility, we computed association statistics in each GWAS locus conditional on coding variants. We applied a direct conditional test to datasets with available individual-level genotype data (height, LDL, HDL); for those studies without available genotype data, we computed conditional associations from summary statistics using COJO (Yang et al., 2011; Yang et al., 2012; ‘Materials and methods’). With both methods, the resulting GWAS associations should reflect only non-coding variants.
After controlling for coding variation, we examined whether these genes are more likely than chance to be in close proximity to variants associated with the polygenic form of each trait. In agreement with existing literature (Freund et al., 2018), we observe a significant enrichment for all traits in our combined Mendelian and Backman et al., 2021 gene sets (Figure 1—figure supplement 1).
Of our 220 genes, 147 (67%) fell within 1 Mb of a GWAS locus for the cognate complex trait, over three times as many as the 43 predicted by a random null model (95% confidence interval: 31.5–54.5). Our window of 1 Mb represents roughly the upper bound for distances identified between enhancer–promoter pairs, but most pairs are closer (Nasser et al., 2021), so we would expect enrichment to increase as the window around genes decreases; this proves to be the case. At a distance of 100 kb, we find 104 putatively causative genes (47%), though the null model predicts only 11 (95% CI 4.5–17.0), an order-of-magnitude enrichment (Figure 1—figure supplement 1). Given their known causal roles in the severe forms of each phenotype, these results suggest that the 147 genes near GWAS signals are likely to be the targets of trait-associated non-coding variants. For example, we see a significant GWAS association between breast cancer risk and variants in the estrogen receptor (ESR1) locus even after controlling for coding variation; the baseline expression model would thus predict that non-coding risk alleles alter ESR1 expression to drive breast cancer risk.
We next looked for evidence that the trait-associated variants were also altering the expression of our 147 genes in relevant tissues. Controlling for the number of tests we conducted, 134 (91.1%) of these genes had an eQTL in at least one relevant tissue at a false discovery rate (FDR) of Q < 0.05 (‘Materials and methods’). If these variants act through changes in gene expression, phenotypic associations should be driven by some of the same variants as eQTLs in relevant tissue types. We therefore looked for co-localization between our GWAS signals and eQTLs in relevant tissues (Table 2) drawn from the GTEx Project using three well-documented methods: coloc (Giambartolomei et al., 2014), JLIM (Chun et al., 2017), and eCAVIAR (Hormozdiari et al., 2016). We found support for the colocalization of trait and eQTL association for only 7 genes out of 147 (4.8%) for coloc; 10/147 (6.8%) for JLIM; and 8/147 (5.4%) for eCAVIAR. Accounting for overlap, this represents only 18/220 putatively causative genes (8.2%) or 18/147 (12.2%) putatively causative genes near GWAS peaks, even without full multiple-hypothesis testing correction (‘Materials and methods’), which is not obviously better than random chance. We note that prior estimates of the fraction of GWAS associations colocalizing with eQTLs (25–40%; Giambartolomei et al., 2014; Chun et al., 2017; Hormozdiari et al., 2016; Wen et al., 2017) do not directly evaluate the ability to find causative genes. By contrast, our estimate of the number of putatively causative genes that colocalize with eQTLs tests the consistency of our knowledge, models, and data.
A potential weakness of our approach is the restriction of our search to predefined tissues. We believe this is necessary in order to avoid the disadvantages of testing each gene–trait pair in each tissue—either a large number of false positives or a severe multiple-testing correction that may lead to false negatives. However, restricting to the set of tissues with a known biological role and available expression data almost certainly leaves out tissues with relevance in certain contexts. Some of the tissues we do use have smaller sample sizes, limiting their power to detect eQTLs with smaller effects.
To address potential shortcomings from the available sample of tissue contexts, we incorporated the Multivariate Adaptive Shrinkage Method (MASH) (Urbut et al., 2019). MASH is a Bayesian method that takes genetic association summary statistics measured across a variety of conditions and, by determining patterns of similarity across conditions, updates the summary statistics of each individual condition. In our case, if an eQTL is difficult to find in a tissue of interest, incorporating information from other tissues may help us detect it. Unlike meta-analysis, this method generates summary statistics that still correspond to a specific tissue.
We ran MASH on every locus used in our earlier analysis using data from all non-brain GTEx tissues (‘Materials and methods’). Rerunning coloc with these modified statistics increased the number of GWAS-eQTL colocalizations across all genes by 26% (from 389 to 489). However, the 100 new colocalizations identified only four additional putatively causative genes (Figure 1—figure supplement 2). These results indicate that tissue-type selection was not the limiting factor in our analysis.
Transcriptome-wide association studies (TWAS) (Gamazon et al., 2015; Gusev et al., 2016; Mancuso et al., 2017; Barbeira et al., 2018) are another class of methods applied to identify causative genes under GWAS peaks using gene expression. TWAS measures genetic correlation between traits and is not designed to avoid correlations caused by LD, which gives it higher power in the case of allelic heterogeneity or poorly typed causative variants (Wainberg et al., 2019). However, while sensitive, TWAS analyses typically yield expansive result sets that include many false positives and are sensitive to the number of tissue types (Wainberg et al., 2019). Results from the FUSION implementation of TWAS (Mancuso et al., 2017) across all tissues identified our putatively causative genes as likely tied to the GWAS peak in 66/220 loci (30%). However, only 4/220 (1.8%) genes were identified by FUSION when we restricted the analysis to relevant tissues.
Given the paucity of expression-mediated GWAS peaks, we asked whether GWAS variants in our study loci reside in likely regulatory sites. Taking the 128 genes in the Mendelian subset of putatively causative genes, we fine-mapped each nearby GWAS association using the SuSiE algorithm (Wang et al., 2020b). For 37 of these genes, we identified at least one high-confidence fine-mapped variant (defined as a variant with posterior inclusion probability, or PIP, greater than 0.7) within 100 kb of the transcription start site. We tested whether these fine-mapped variants fall within (i) regulatory DNA marked by DNase I hypersensitive sites (DHS) Meuleman et al., 2020; (ii) a narrowly mapped active histone modification feature (H3K27ac, H3K4me1, or H3K4me3; Nasser et al., 2021); or (iii) sites marked as an ‘enhancer’ by ChromHMM (Ernst and Kellis, 2012; Ernst and Kellis, 2017; ‘Materials and methods’). As many as 32/37 (86%) genes identified this way have a fine-mapped variants within a candidate regulatory region across all the tissue types examined, or 25/37 (68%) when restricting to trait-relevant tissues (Figure 2); (Supplementary file 1 and Supplementary file 2). Despite this strong evidence that these GWAS associations arise from effects on regulatory DNA, only 5/25 loci (20%) demonstrably correspond to expression effects in our eQTL analysis.
In order to more analogously compare our regulatory feature analysis to our eQTL analysis, we computed ‘activity-by-distance’ (ABD)—a simplification of the 'activity-by-contact' method that provides a measure connecting a distal regulatory region to a target gene (Nasser et al., 2021; Fulco et al., 2019; 'Materials and methods' Figure 3). Taking each locus’s feature with the highest ABD score, we implicate 5/37 (14%) of our Mendelian subset of genes. As such, even when a GWAS association and trait-relevant gene are in the same locus, they are difficult to link, whether using eQTLs or current approaches to integrating chromatin data with target genes.
Discussion
Overall, our results are strongly consistent with the idea that complex traits are governed by non-coding genetic variants whose effects on phenotype are mediated by their contribution to the regulation of nearby genes. However, these same results are inconsistent with a model that such effects on gene regulation arise from the genetic influences on baseline gene expression that are captured by eQTLs.
The enrichment of putatively causative genes—selected based on existing biological knowledge—near GWAS peaks supports their role in complex traits. Additionally, the enrichment of fine-mapped GWAS variants in likely regulatory regions marked by DHS and other chromatin features lends support to the general model of GWAS associations arising from effects on gene regulation and expression. However, the inability of varied statistical methods to connect GWAS associations and expression argues against the idea that the causative GWAS variants exert their effects via the homeostatic or blended effects on gene expression that are captured by bulk-tissue eQTLs of the sort discovered by broad expression-data collection projects.
Many explanations have been suggested for the limited success of expression-centric methods to explain the mechanisms of GWAS variants. Undirected, broad approaches—including most GWAS-eQTL linking studies—are designed to be largely independent of a priori biological knowledge and hypotheses. This unconstrained focus is ideal for discovery because it delivers the largest number of positive findings, but it is ill-suited for providing explanations for negative results—when you do not know what you were looking for, it is hard to explain why you did not find it. By testing only loci for which there is a strongly suspected contributing gene, we were better able to distinguish which factors might prevent us from identifying GWAS-to-gene links using expression alone.
As a result, we conclude that a number of explanations often considered when evaluating expression-based variant-to-gene methods are not applicable in the context we examined. These include (i) non-expression-mediated mechanisms, (ii) lack of statistical power for GWAS, (iii) the absence of eQTLs for relevant genes, (iv) and underpowered methods for linking expression to GWAS (Table 3).
Our results suggest that there remains to be discovered a major component connecting variants to gene expression, which we term 'missing regulation.' We propose that this currently lacking component could be exposed by considering more nuanced experimental and analytical models of gene expression. One likely model involves context-dependent gene expression—expression whose relevant effects are confined to (i) particular refined cell types or anatomical partitions, or (ii) cell states such as responses to perturbations, whether exogenous (e.g., environmental change) or endogenous (e.g., cell differentiation). A complementary model may incorporate heterogeneity of gene expression or the variance of expression across relatively short time scales. These models and others may depend on or be augmented by thresholding or buffering of variant effects, which may produce changes in gene expression with nonlinear effects on phenotype. A summary of proposed models can be found in Table 4.
We propose that finding the 'missing regulation' will require not only identifying novel eQTLs explaining GWAS peaks, but also explaining the phenotypic irrelevance of 'red herring eQTLs'—that is, eQTLs for putatively causative genes that fall near GWAS peaks but do not colocalize with them. Above, we use the example genes APOE and LDLR. Both these genes harbor coding variants causing Mendelian hypercholesterolemia, and both have non-coding variants that GWAS have tied to LDL levels. Both have eQTLs in trait-relevant tissues. For APOE, these points cohere into an explanation: the LDL association is an eQTL for the lipid-binding gene. But for LDLR—and for most genes—the association, the mechanism, and the gene cannot be tied together.
Importantly, our results do not diminish the importance or general utility of eQTLs. Rather, they suggest that current models are deficient in two respects: (i) they fail to unify trait-associated non-coding variants with known trait-associated genes, and (ii) they fail to explain the non-effects of identified 'red herring' eQTLs. These deficiencies highlight a need for new approaches to the role of gene regulation in complex traits.
One long-standing goal of GWAS has been to discover genes contributing to complex traits (Manolio et al., 2009; Farh et al., 2015), but low rates of positive findings for expression-based variant-to-gene methods have constrained this possibility (Chun et al., 2017; Baird et al., 2021). Among other challenges, this has limited the benefit of GWAS and expression data for disease-gene mapping and drug discovery (Baird et al., 2021; Umans et al., 2021). Another practical question raised is the value of current large-scale public datasets. Compared to genotypes, expression data are relatively difficult to collect, particularly from specialized cell contexts. If the most relevant models are shown to depend on effects not observable in homeostatic eQTL mapping, the field may need to consider prioritizing other biological contexts and forms of expression data.
Materials and methods
Gene selection
Request a detailed protocolBy manual literature search, we selected 128 genes harboring large-effect-size coding variants for one of the seven phenotypes (Table 1; specifically, we selected 128 gene–trait pairs, representing 121 unique genes). These genes were identified using familial studies, rare disease exome-sequencing analyses, and, for breast cancer, using the MutPanning method (Dietlein et al., 2020) (citations for each gene are included in Table 1). Review papers, as well as the OMIM database (McKusick-nathans institute of genetic medicine, 2021), were generally used as starting points, but an examination of the original literature was needed to confirm genes’ suitability. For example, though SMC3 is known to cause Cornelia de Lange syndrome, which is characterized in part by short stature, SMC3 mutations lead to a milder form of the syndrome, usually without a marked reduction in stature (Deardorff et al., 2007). Several of these phenotypes—height, HDL, cholesterol, breast cancer, and type II diabetes—were also analyzed in Backman et al., 2021, which, through burden testing, identified a total of 110 genes; after accounting for overlaps, this increased our set of putatively causative genes to 220.
The inclusion of genes from Backman et al. ensures that our results are not dependent on an undetected bias in our selection. The set of genes chosen from familial studies offered the advantage that it was selected based on independent methods and data distinct from the large-scale genotyping studies that have characterized the GWAS era. The tradeoff to this was the impossibility of selecting genes through a fully systematic and non-arbitrary process. Because this work was performed in the UKBB, there is some overlap between their data and ours. However, our work did not use exomes, and most of the variants driving their findings are too rare to influence GWAS results. When this is not the case, our decision to condition on coding variants should make the effects used in our work independent from their findings.
Identifying coding variants
Request a detailed protocolBecause GWAS sample sizes are large enough to detect the low-frequency coding variants used to select some of our genes, it is possible that a coding variant would distort the association signal of nearby eQTLs. To minimize this concern, we removed the effects of coding variants on GWAS. Many variants can fall within coding sequences in rare splice variants, so it is important to remove only those variants that appear commonly as coding. These coding variants were selected based on the pext (proportion of expression across transcripts) data (Cummings et al., 2020). Two filters were used. First, we removed genes whose expression in a trait-relevant tissue was below 50% of their maximum expression across tissues. Second, we removed variants that fell within the coding sequence of less than 25% of splice isoforms in that tissue. The remaining variants were used to correct GWAS signal, as explained below. The code for this analysis, and all other quantitative analyses in this paper, can be accessed at https://github.com/NJC12/missing_link_association_function (NJC12, 2023; copy archived at swh:1:rev:46d9072b7cc13f6532203d1494eec4d0f634e092).
GWAS
Request a detailed protocolFor height, LDL cholesterol, and HDL cholesterol, GWAS were performed using genotypic and phenotypic data from the UKBB. In order to avoid confounding, we restricted our sample to the 337K unrelated individuals with genetically determined British ancestry identified by Bycroft et al., 2017. The GWAS were run using Plink 2.0 (Chang et al., 2015), with the covariates age, sex, body mass index (for LDL and HDL only), 10 principal components, and coding variants.
Conditional analysis
Request a detailed protocolBecause UKBB has limited power for breast cancer, Crohn disease, ulcerative colitis, and type II diabetes, we used publicly available summary statistics. The Conditional and Joint Analysis (COJO) (Yang et al., 2011; Yang et al., 2012) program can condition summary statistics on selected variants—in our case, coding variants—by using an LD reference panel. For this reference, we used TOPMed subjects of European ancestry (Taliun et al., 2021). The ancestry of these subjects was confirmed with FastPCA (Galinsky et al., 2016a; Galinsky et al., 2016b), and the relevant data were extracted using bcftools (Danecek et al., 2021). Our conditional GWAS data are available at doi:10.5061/dryad.612jm644q.
Enrichment analysis
Request a detailed protocolAt each distance, the number of Mendelian and non-Mendelian genes within that window around GWAS peaks are counted. p-Values were calculated using Fisher’s exact test (Figure 1, Figure 1—figure supplement 1). Because Mendelian genes may be unusually important beyond our chosen traits, we conduct a set of controls by measuring the enrichment of non-matching Mendelian and complex traits (CD genes and BC GWAS; BC genes and LDL GWAS; LDL genes and UC GWAS; UC genes and height GWAS; height genes and T2D GWAS; T2D genes and HDL GWAS; HDL genes and CD GWAS).
eQTL detection
Request a detailed protocoleQTL summary statistics were taken from GTEx v7. Some methods detect colocalization with variants that are individually significant, but would not pass a genome-wide threshold (Chun et al., 2017). Because we tested only a subset of genes, we used the Benjamini–Hochberg method (Benjamini and Hochberg, 1995) to calculate the FDR based on the number of tests we conducted multiplied by a correction factor to account for variants that are tested in combination with a gene but are not reported (a factor of 20 closely matched the genome-wide FDR results for GTEx). With this method, 204/220 (93%) of our genes displayed an eQTL, including 134/147 genes with a nearby GWAS peak (91%). Even using the FDR statistics of the GTEx project—which are based on the assumption of testing every gene in every tissue—107/220 (49%) of our genes and 76/147 (52%) of genes near GWAS peaks had an eQTL at Q < 0.05.
Colocalization
Request a detailed protocolJLIM (Chun et al., 2017) was run using GWAS summary statistics and GTEx v7 genotypes and phenotypes for the tissues listed in Table 2. Coloc (Giambartolomei et al., 2014) was run using GWAS and GTEx v7 summary statistics for the same tissues. eCAVIAR (Hormozdiari et al., 2016) was run using GWAS and GTEx v7 summary statistics for these tissues, and a reference dataset of LD from UKBB (Weissbrod et al., 2020). MASH was run incorporating data from all non-brain tissues, and coloc was rerun using the adjusted values for the same tissues as before.
MASH
Request a detailed protocolMASH was applied to all GTEx tissues using the mashr R-package (Urbut et al., 2019). We restricted this model to non-brain tissues—which include all of our trait-selected tissues—due to the known tendency of brain and non-brain tissues to cluster separately in expression analysis (Battle et al., 2017; Park et al., 2017; Gamazon et al., 2019).
Fusion (TWAS)
Request a detailed protocolWe used the FUSION implementation of TWAS, which accounts for the possibility of multiple cis-eQTLs linked to the trait-associated variant by jointly calling sets of genes predicted to include the causative gene, to interrogate our 220 loci (Mancuso et al., 2017). FUSION included our putatively causative genes in the set identified as likely relevant to the GWAS peak in 66/220 loci (30%). However, interpretation of this TWAS result is difficult. For many complex traits, TWAS returns a large number of findings (e.g., >150 for LDL cholesterol and >4800 for height). This is in part due to the multiple genes jointly returned at a locus and can also be a result of the large number of tissues and cell types included in the implementation of FUSION. Most hits are found in tissues without any clear relevance to the trait and absent in relevant tissues—LDL, for example, has more TWAS associations between expression and eQTL in prostate adenocarcinoma (24 genes associated), brain prefrontal cortex (23 genes associated), and transformed fibroblasts (21 genes associated) than it does in adipose (16 genes associated), blood (11 genes associated), or liver (5 genes associated). Individual genes were often identified as hits in multiple tissues, but with an inconsistent direction of effect—that is, increased gene expression correlated with an increase in the quantitative trait or disease risk in some tissues, but a decrease in others, which suggests that the gene in question may not be the one whose expression contributes to the complex trait. Because of this possibility and the known biological role of many of our genes, we restricted our results to tissues with established relevance to our traits.
Fine-mapping GWAS hits
Request a detailed protocolWe fine-mapped the GWAS variants located within ±100 kb of our putatively causative genes by applying the SuSiE algorithm (Wang et al., 2020b) on the unconditional summary statistics from the GWAS of breast cancer, Crohn disease, ulcerative colitis, type II diabetes, height, LDL cholesterol, and HDL cholesterol. An LD reference panel from UKBB subjects of European ancestry was used for this analysis. Fine-mapped variants were annotated using snpEff (v4.3t). Only non-coding variants were kept for further analysis.
Functional genomic annotation of fine-mapped hits
Request a detailed protocolWe projected fine-mapped GWAS variants onto active regions of the genome, identified using three alternative approaches: (i) histone modification features, (ii) DHS, and (iii) ChromHMM enhancers.
First, we looked at three histone modification marks, namely, acetylation of histone H3 lysine 27 residues (H3K27ac), mono-methylation of histone H3 lysine 4 residues (H3K4me1), and tri-methylation of lysine 4 residues (H3K4me3) from the Roadmap Epigenomics Project (Wainberg et al., 2019) to identify functional enhancers, which are key contributors of tissue-specific gene regulation. We downloaded imputed narrowPeak sets for H3K27ac, H3K4me1, and H3K4me3 from the Roadmap Epigenomics Project (Wainberg et al., 2019) ftp site (available here) for 14 different tissue types (Supplementary file 1). For each tissue type, we extracted the narrow peaks that are within ±5 Mb of our putatively causative genes. Then following the approach described in Fulco et al. Barbeira et al., 2018, we extended the 150 bp narrow peaks by 175 bp on both sides to arrive at candidate features of 500 bp in length. All features mapping to blacklisted regions (https://sites.google.com/site/anshul kundaje/projects/blacklists) were removed (Kundaje et al., 2015). The remaining features were re-centered around the peak and overlapping features were merged to give the final set of features per histone modification track. The mean activity/strength of a feature (AF) was calculated by taking the geometric mean of the corresponding peak strengths from H3K27ac, H3K4me1, and H3K4me3 marks. We then combined these activity measurements with the linear distances between the features and the transcription start sites of causative genes to compute ABD scores (a simplified version of ABC scores Barbeira et al., 2018) for gene–feature pairs using the following formula:
The ABD score can be thought of as a measure of the contribution of a feature, F to the combined regulatory on gene, G. A high ABD score may serve as a proxy for an increased specificity between a chromatin feature and the gene of interest. We projected the fine-mapped variants onto the chromatin features in different tissue types to assess whether there is an enrichment of likely causal GWAS hits in regulatory features near our putatively causative genes. Both proximity (genomic distance) and specificity (ABD scores) were considered to determine the regulatory contribution of the fine-mapped hits.
Next, we looked at the DHS that are considered to be generic markers of the regulatory DNA and can contain genetic variations associated with traits and diseases (Meuleman et al., 2020). We downloaded the index of human DHS along with biosample metadata from https://www.meuleman.org/research/dhsindex/. The index was in hg38 coordinates, which were converted to hg19 coordinates using the online version of the hgLiftOver package (https://genome.ucsc.edu/cgi-bin/hgLiftOver). We created a DHS index for each tissue type relevant to the traits and diseases we analyzed by including all DHS that are present in at least one biosample from a certain tissue type (Supplementary file 2). We then selected DHS that lie within ±100 kb of the TSS of our putatively causative genes. Since DHS are of variable widths, we recentered the summits in a 350 bp window and merged overlapping sites in the same way as we did for other chromatin marks. We calculated the mean activity (AF) by averaging the strengths of all the merged sites. Next, we calculated the activity by distance score for each DHS and gene pair using the same formula described above. Finally, for each fine-mapped variant, we identified all DHS that fall within ±100 kb of the variant.
Finally, we used in silico chromatin state predictions (chromHMM core 15-state model Wainberg et al., 2019) for relevant tissue types (Supplementary file 1) to identify active enhancer regions in the genome. Tissue-specific chromHMM annotations were downloaded from the Roadmap Epigenomics Project Wainberg et al., 2019 ftp site (available here). We considered a fine-mapped variant to fall in an enhancer region if it was housed within a chromHMM segment described as either enhancer, or bivalent enhancer, or genic enhancer. Since chromHMM annotations are not accompanied by activity measurements, the ABD approach could not be applied here.
Appendix 1
Evidence for the relationship between Mendelian and complex traits
More generally, this expectation is supported by several lines of evidence. Comorbidity between Mendelian and complex traits has been used to identify common variants associated with the complex traits (Blair et al., 2013). Early GWAS found associations near genes identified through familial studies of severe disorders (Voight et al., 2010; Chan et al., 2015), and later implicated some of the same genes in complex and Mendelian forms of cardiovascular (Kathiresan and Srivastava, 2012) and neuropsychiatric (Zhu et al., 2014) traits. More recent analyses have found that GWAS associations are enriched in regions near causative genes for cognate Mendelian traits in blood traits (Vuckovic et al., 2020), lipid traits and diabetes (Weiner et al., 2022), as well as a diverse collection of 62 traits (Freund et al., 2018). Another recent method used transcriptomic, proteomic, and epigenomic data to prioritize genes and found that, in a selection of nine phenotypes, the selected genes were enriched for Mendelian genes causing similar traits (Mountjoy et al., 2021). Together, these suggest that genes causing Mendelian traits also influence cognate complex traits, but not through the same coding mechanism.
Genes can also harbor coding variants tied to less severe forms of a trait. These coding variants are more difficult to identify individually as their effect sizes are much smaller. However, the greater number of variants (in aggregate) and freedom from searching for severe segregating traits allows the use of large population datasets. Backman et al., 2021 used burden testing on UK Biobank data to identify genes whose coding variation affects complex traits, finding many genes not identified through familial studies.
Data availability
Numerical data for results is included in Source Data 1. The dataset generated (GWAS summary statistics conditioned on coding variants) can be found at https://doi.org/10.5061/dryad.612jm644q. Code for this project is available at https://github.com/NJC12/missing_link_association_function (copy archived at swh:1:rev:46d9072b7cc13f6532203d1494eec4d0f634e092). UK Biobank data was used and is available by application to the UK Biobank Data Access Committee https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. TOPMed Whole Genome Sequencing Project - Freeze 5b, Phases 1 and 2 data was used and can be accessed at https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-project-freeze-5b-phases-1-and-2, full list of TOPMed URLs available in Supplementary file 3.
-
Dryad Digital RepositoryGWAS results conditioned on coding variants.https://doi.org/10.5061/dryad.612jm644q
References
-
A homozygous IER3IP1 mutation causes microcephaly with simplified gyral pattern, epilepsy, and permanent neonatal diabetes syndrome (MEDS)American Journal of Medical Genetics. Part A 158A:2788–2796.https://doi.org/10.1002/ajmg.a.35583
-
Mutations in PCSK9 cause autosomal dominant hypercholesterolemiaNature Genetics 34:154–156.https://doi.org/10.1038/ng1161
-
Mutations in the gene encoding the rER protein FKBP65 cause autosomal-recessive osteogenesis imperfectaAmerican Journal of Human Genetics 86:551–559.https://doi.org/10.1016/j.ajhg.2010.02.022
-
The role of regulatory variation in complex traits and diseaseNature Reviews. Genetics 16:197–212.https://doi.org/10.1038/nrg3891
-
Novel CENPJ mutation causes seckel syndromeJournal of Medical Genetics 47:411–414.https://doi.org/10.1136/jmg.2009.076646
-
Confirmation of a new phenotype in an individual with a variant in the last part of exon 30 of CREBBPAmerican Journal of Medical Genetics. Part A 179:634–638.https://doi.org/10.1002/ajmg.a.61052
-
Germline mutations in HRAS proto-oncogene cause Costello syndromeNature Genetics 37:1038–1040.https://doi.org/10.1038/ng1641
-
Gain-Of-Function mutations in RIT1 cause Noonan syndrome, a Ras/MAPK pathway syndromeAmerican Journal of Human Genetics 93:173–180.https://doi.org/10.1016/j.ajhg.2013.05.021
-
Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneityAmerican Journal of Human Genetics 109:223–239.https://doi.org/10.1016/j.ajhg.2022.01.002
-
Negative feedback buffers effects of regulatory variantsMolecular Systems Biology 11:785.https://doi.org/10.15252/msb.20145844
-
An integrated approach to identify environmental modulators of genetic risk factors for complex traitsAmerican Journal of Human Genetics 108:1866–1879.https://doi.org/10.1016/j.ajhg.2021.08.014
-
Multiple Slc26a2 mutations occurring in a three-generational familyEuropean Journal of Medical Genetics 61:24–28.https://doi.org/10.1016/j.ejmg.2017.10.007
-
Whole exome sequencing identifies three novel mutations in ANTXR1 in families with GAPO syndromeAmerican Journal of Medical Genetics Part A 164:2328–2334.https://doi.org/10.1002/ajmg.a.36678
-
Rna processing defects of the helicase gene RecQL4 in a compound heterozygous Rothmund-Thomson patientAmerican Journal of Medical Genetics. Part A 120A:395–399.https://doi.org/10.1002/ajmg.a.20154
-
Defective IL10 signaling defining a subgroup of patients with inflammatory bowel diseaseThe American Journal of Gastroenterology 106:1544–1555.https://doi.org/10.1038/ajg.2011.112
-
Association between a polymorphism in the carboxyl ester lipase gene and serum cholesterol profileEuropean Journal of Human Genetics 12:627–632.https://doi.org/10.1038/sj.ejhg.5201204
-
Controlling the false discovery rate: a practical and powerful approach to multiple testingJournal of the Royal Statistical Society 57:289–300.https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
-
Diverse growth hormone receptor gene mutations in Laron syndromeAmerican Journal of Human Genetics 52:998–1005.
-
Further evidence of the importance of RIT1 in Noonan syndromeAmerican Journal of Medical Genetics Part A 164:2952–2957.https://doi.org/10.1002/ajmg.a.36722
-
Mutations in the pre-replication complex cause meier-gorlin syndromeNature Genetics 43:356–359.https://doi.org/10.1038/ng.775
-
Mrnas, proteins and the emerging principles of gene expression controlNature Reviews. Genetics 21:630–644.https://doi.org/10.1038/s41576-020-0258-4
-
A restricted spectrum of mutations in the Smad4 tumor-suppressor gene underlies myhre syndromeAmerican Journal of Human Genetics 90:161–169.https://doi.org/10.1016/j.ajhg.2011.12.011
-
Germline missense mutations affecting KRAS isoform B are associated with a severe Noonan syndrome phenotypeAmerican Journal of Human Genetics 79:129–135.https://doi.org/10.1086/504394
-
The p.leu167del mutation in ApoE gene causes autosomal dominant hypercholesterolemia by down-regulation of LDL receptor expression in hepatocytesThe Journal of Clinical Endocrinology and Metabolism 101:2113–2121.https://doi.org/10.1210/jc.2015-3874
-
Twelve years of samtools and bcftoolsGigaScience 10:giab008.https://doi.org/10.1093/gigascience/giab008
-
Meier-Gorlin syndrome genotype-phenotype studies: 35 individuals with pre-replication complex gene mutations and 10 without molecular diagnosisEuropean Journal of Human Genetics 20:598–606.https://doi.org/10.1038/ejhg.2011.269
-
Mutations in cohesin complex members SMC3 and SMC1A cause a mild variant of Cornelia de Lange syndrome with predominant mental retardationAmerican Journal of Human Genetics 80:485–494.https://doi.org/10.1086/511888
-
Rad21 mutations cause a human cohesinopathyAmerican Journal of Human Genetics 90:1014–1027.https://doi.org/10.1016/j.ajhg.2012.04.019
-
Differential association of two PTPN22 coding variants with Crohn’s disease and ulcerative colitisInflammatory Bowel Diseases 17:2287–2294.https://doi.org/10.1002/ibd.21630
-
Landscape of conditional eQTL in dorsolateral prefrontal cortex and co-localization with schizophrenia GWASAmerican Journal of Human Genetics 102:1169–1184.https://doi.org/10.1016/j.ajhg.2018.04.011
-
Chromatin-state discovery and genome annotation with chromhmmNature Protocols 12:2478–2492.https://doi.org/10.1038/nprot.2017.124
-
In frame fibrillin-1 gene deletion in autosomal dominant weill-marchesani syndromeJournal of Medical Genetics 40:34–36.https://doi.org/10.1136/jmg.40.1.34
-
Tnf and IL10 SNPs act together to predict disease behaviour in Crohn'’ diseaseJournal of Medical Genetics 42:523–528.https://doi.org/10.1136/jmg.2004.027425
-
Atg16L1 T300A shows strong associations with disease subgroups in a large Australian IBD population: further support for significant disease heterogeneityThe American Journal of Gastroenterology 103:2519–2526.https://doi.org/10.1111/j.1572-0241.2008.02023.x
-
Phenotype-Specific enrichment of Mendelian disorder genes near GWAS regions across 62 complex traitsAmerican Journal of Human Genetics 103:535–552.https://doi.org/10.1016/j.ajhg.2018.08.017
-
Familial hyperglycemia due to mutations in glucokinase: definition of a subtype of diabetes mellitusThe New England Journal of Medicine 328:697–702.https://doi.org/10.1056/NEJM199303113281005
-
System-Wide molecular evidence for phenotypic buffering in ArabidopsisNature Genetics 41:166–167.https://doi.org/10.1038/ng.308
-
A new gene, EVC2, is mutated in Ellis-van Creveld syndromeMolecular Genetics and Metabolism 77:291–295.https://doi.org/10.1016/s1096-7192(02)00178-6
-
Fast principal-component analysis reveals convergent evolution of ADH1B in europe and east asiaAmerican Journal of Human Genetics 98:456–472.https://doi.org/10.1016/j.ajhg.2015.12.022
-
Population structure of UK biobank and ancient eurasians reveals adaptation at genes influencing blood pressureAmerican Journal of Human Genetics 99:1130–1139.https://doi.org/10.1016/j.ajhg.2016.09.014
-
Novel variants of the IL-10 receptor 1 affect inhibition of monocyte TNF-alpha productionJournal of Immunology 170:5578–5582.https://doi.org/10.4049/jimmunol.170.11.5578
-
Cell-Type heterogeneity in adipose tissue is associated with complex traits and reveals disease-relevant cell-specific eQTLsAmerican Journal of Human Genetics 104:1013–1024.https://doi.org/10.1016/j.ajhg.2019.03.025
-
Inflammatory bowel disease and mutations affecting the interleukin-10 receptorThe New England Journal of Medicine 361:2033–2045.https://doi.org/10.1056/NEJMoa0907206
-
Activating mutations in the gene encoding the ATP-sensitive potassium-channel subunit kir6.2 and permanent neonatal diabetesThe New England Journal of Medicine 350:1838–1849.https://doi.org/10.1056/NEJMoa032922
-
Familial hyperalphalipoproteinemiaArchives of Internal Medicine 135:1025–1028.
-
Mutations of the growth hormone receptor in children with idiopathic short statureNew England Journal of Medicine 333:1093–1098.https://doi.org/10.1056/NEJM199510263331701
-
Contribution of RIT1 mutations to the pathogenesis of Noonan syndrome: four new cases and further evidence of heterogeneityAmerican Journal of Medical Genetics Part A 164:2310–2316.https://doi.org/10.1002/ajmg.a.36646
-
The -250G > a promoter variant in hepatic lipase associates with elevated fasting serum high-density lipoprotein cholesterol modulated by interaction with physical activity in a study of 16,156 Danish subjectsThe Journal of Clinical Endocrinology and Metabolism 93:2294–2299.https://doi.org/10.1210/jc.2007-2815
-
Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseasesAmerican Journal of Human Genetics 95:535–552.https://doi.org/10.1016/j.ajhg.2014.10.004
-
Defective mutations in the insulin promoter factor-1 (IPF-1) gene in late-onset type 2 diabetes mellitusThe Journal of Clinical Investigation 104:R41–R48.https://doi.org/10.1172/JCI7469
-
Clinical and molecular genetic analysis of 19 Wolfram syndrome kindreds demonstrating a wide spectrum of mutations in WFS1American Journal of Human Genetics 65:1279–1290.https://doi.org/10.1086/302609
-
Prenatal diagnosis of diastrophic dysplasia with polymorphic DNA markersJournal of Medical Genetics 30:265–268.https://doi.org/10.1136/jmg.30.4.265
-
Phenotypic and molecular characterization of bruck syndrome (osteogenesis imperfecta with contractures of the large joints) caused by a recessive mutation in PLOD2American Journal of Medical Genetics Part A 131A:115–120.https://doi.org/10.1002/ajmg.a.30231
-
An apolipoprotein CII mutation, ciilys19 -- -- ’hr' identified in patients with hyperlipidemiaDisease Markers 9:73–80.
-
Mutations in SRCAP, encoding SNF2-related CREBBP activator protein, cause floating-harbor syndromeAmerican Journal of Human Genetics 90:308–313.https://doi.org/10.1016/j.ajhg.2011.12.001
-
Colocalization of GWAS and eQTL signals detects target genesAmerican Journal of Human Genetics 99:1245–1260.https://doi.org/10.1016/j.ajhg.2016.10.003
-
Progeroid facial features and lipodystrophy associated with a novel splice site mutation in the final intron of the FBN1 geneAmerican Journal of Medical Genetics Part A 155:721–724.https://doi.org/10.1002/ajmg.a.33905
-
Probabilistic colocalization of genetic variants from complex and molecular traits: promise and limitationsAmerican Journal of Human Genetics 108:25–35.https://doi.org/10.1016/j.ajhg.2020.11.012
-
Somatic and germline mosaicism for a R248C missense mutation in FGFR3, resulting in a skeletal dysplasia distinct from thanatophoric dysplasiaAmerican Journal of Medical Genetics. Part A 120A:157–168.https://doi.org/10.1002/ajmg.a.20012
-
The -514 C- > T hepatic lipase promoter region polymorphism and plasma lipids: a meta-analysisThe Journal of Clinical Endocrinology and Metabolism 89:3858–3863.https://doi.org/10.1210/jc.2004-0188
-
Targeted mutation of plasma phospholipid transfer protein gene markedly reduces high-density lipoprotein levelsThe Journal of Clinical Investigation 103:907–914.https://doi.org/10.1172/JCI5578
-
Mutations in FKBP10 cause recessive osteogenesis imperfecta and Bruck syndromeJournal of Bone and Mineral Research 26:666–672.https://doi.org/10.1002/jbmr.250
-
Wfs1/Wolframin mutations, Wolfram syndrome, and associated diseasesHuman Mutation 17:357–367.https://doi.org/10.1002/humu.1110
-
Ptpn11 (protein-tyrosine phosphatase, nonreceptor-type 11) mutations in seven Japanese patients with Noonan syndromeThe Journal of Clinical Endocrinology and Metabolism 87:3529–3533.https://doi.org/10.1210/jcem.87.8.8694
-
Transcriptional bursting in gene expression: analytical results for general stochastic modelsPLOS Computational Biology 11:e1004292.https://doi.org/10.1371/journal.pcbi.1004292
-
Regulation of gene expression via translational bufferingBiochimica et Biophysica Acta. Molecular Cell Research 1869:119140.https://doi.org/10.1016/j.bbamcr.2021.119140
-
A single molecule view of gene expressionTrends in Cell Biology 19:630–637.https://doi.org/10.1016/j.tcb.2009.08.008
-
Mutations in the TGFβ binding-protein-like domain 5 of FBN1 are responsible for acromicric and geleophysic dysplasiasAmerican Journal of Human Genetics 89:7–14.https://doi.org/10.1016/j.ajhg.2011.05.012
-
Mutations of Smad4 account for both LAPS and myhre syndromesAmerican Journal of Medical Genetics. Part A 158A:1520–1521.https://doi.org/10.1002/ajmg.a.35374
-
Missense mutations in the insulin promoter factor-1 gene predispose to type 2 diabetesThe Journal of Clinical Investigation 104:R33–R39.https://doi.org/10.1172/JCI7449
-
Integrating gene expression with summary association statistics to identify genes associated with 30 complex traitsAmerican Journal of Human Genetics 100:473–487.https://doi.org/10.1016/j.ajhg.2017.01.031
-
Systematic localization of common disease-associated variation in regulatory DNAScience 337:1190–1195.