1. Computational and Systems Biology
  2. Genetics and Genomics
Download icon

Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals

  1. Agata Wesolowska-Andersen
  2. Grace Zhuo Yu
  3. Vibe Nylander
  4. Fernando Abaitua
  5. Matthias Thurner
  6. Jason M Torres
  7. Anubha Mahajan
  8. Anna L Gloyn
  9. Mark I McCarthy  Is a corresponding author
  1. Wellcome Centre for Human Genetics, United Kingdom
  2. University of Oxford, United Kingdom
  3. Churchill Hospital, United Kingdom
Research Article
  • Cited 3
  • Views 2,152
  • Annotations
Cite this article as: eLife 2020;9:e51503 doi: 10.7554/eLife.51503

Abstract

Genome-wide association analyses have uncovered multiple genomic regions associated with T2D, but identification of the causal variants at these remains a challenge. There is growing interest in the potential of deep learning models - which predict epigenome features from DNA sequence - to support inference concerning the regulatory effects of disease-associated variants. Here, we evaluate the advantages of training convolutional neural network (CNN) models on a broad set of epigenomic features collected in a single disease-relevant tissue – pancreatic islets in the case of type 2 diabetes (T2D) - as opposed to models trained on multiple human tissues. We report convergence of CNN-based metrics of regulatory function with conventional approaches to variant prioritization – genetic fine-mapping and regulatory annotation enrichment. We demonstrate that CNN-based analyses can refine association signals at T2D-associated loci and provide experimental validation for one such signal. We anticipate that these approaches will become routine in downstream analyses of GWAS.

Introduction

Genome-wide association studies (GWAS) have identified over 400 independent signals implicated in genetic susceptibility to type 2 diabetes (T2D) (Mahajan et al., 2018). However, efforts to derive biological insights from these signals face the challenge of identifying the functional, causal variants driving these associations within the sets of credible variants defined by the linkage disequilibrium (LD) structure at each locus. It remains far from trivial to assign mechanisms of action at these loci: most associated variants map to non-coding sequence, the implication being that they influence disease risk through the transcriptional regulation of one or more of the nearby genes.

There is mounting evidence that disease-associated variants are likely to perturb genes and regulatory modules that are of specific importance within disease-relevant cell types or tissues (Marbach et al., 2016; Battle et al., 2017). For example, several studies have reported significant enrichment of T2D GWAS variants within pancreatic islet enhancer regions (Parker et al., 2013; Pasquali et al., 2014), with that enrichment particularly concentrated in subsets of islet enhancers characterised by open chromatin and hypomethylation (Thurner et al., 2018), and clustered in 3D enhancer hub structures (Miguel-Escalada et al., 2018). Pancreatic islets represent a key tissue for the maintenance of normal glucose homeostasis, and uncovering islet-specific regulatory mechanisms is therefore critical to understanding T2D aetiology and pathogenesis.

Several studies have generated genome-wide epigenomic profiling datasets of whole human pancreatic islets, and/or FACS-sorted individual islet cell types (Bhandare et al., 2010; Gaulton et al., 2010; Stitzel et al., 2010; Parker et al., 2013; Pasquali et al., 2014; Maher, 2012; Thurner et al., 2018; Bramswig et al., 2013; Ackermann et al., 2016). This wealth of genomic data provides a valuable resource for studying the regulatory machinery of human pancreatic islets, and has proven instrumental in prioritising disease-associated variants by considering their overlap with regulatory elements enriched in disease-associated signals (Huang et al., 2017; Thurner et al., 2018). However, in cases where multiple associated variants in high LD reside in the same regulatory region, a method with higher resolution is needed to resolve the causal variant. High-throughput massively parallel reporter assays (MPRA) offer one solution for the empirical assessment of putatively functional variants (Ulirsch et al., 2016; Tewhey et al., 2016), but they are expensive to deploy genome-wide, and may not fully recapitulate the cellular context.

Convolutional neural networks (CNNs) are emerging as a powerful tool to study regulatory motifs in genomic data, and are well suited to extracting high-level information from high-throughput datasets de novo. Indeed, CNN frameworks have been shown to aid in prioritization of genomic variants based on their predicted effect on chromatin accessibility and modifications (Zhou and Troyanskaya, 2015; Kelley et al., 2016) or gene expression (Zhou et al., 2018; Kelley et al., 2018). The methods deployed so far learn the regulatory code de novo from genomic sequences of regulatory regions gathered from multiple tissues, using datasets provided by the ENCODE (Maher, 2012), the NIH Epigenome Roadmap (Bernstein et al., 2010) and GTEx (Battle et al., 2017) consortia. The derived models offer computational predictions of the likely regulatory effects of genomic variation based on disruption or creation of regulatory motifs discovered by the CNNs. While these multi-tissue methods offer an attractive, generally applicable, framework for variant prioritization, they may be missing nuances of the tissue-specific regulatory grammar, and may not be optimal for predictions of regulatory effects that are specific to disease-relevant tissues.

In the present study, we trained CNNs on a broad collection of genome-wide epigenomic profiles capturing chromatin regulatory features from human pancreatic islets, and applied the resulting models to predict the regulatory effects of sequence variants associated with T2D. We demonstrate that these tissue-specific CNN models recapitulate regulatory grammar specific to pancreatic islets, as opposed to discovering regulatory motifs common across multiple tissues. We apply the CNN models to predict islet regulatory variants among the credible sets of T2D-association signals, and demonstrate how CNN predictions can be integrated with genetic and functional fine-mapping approaches to provide single-base resolution of functional impact at T2D-associated loci.

Results

CNNs achieve high performance in predicting islet chromatin regulatory features

We collected 30 genome-wide epigenomic profiling annotations from human pancreatic islets, and their FACS-sorted cell subsets from previously published studies (Bhandare et al., 2010; Gaulton et al., 2010; Stitzel et al., 2010; Parker et al., 2013; Pasquali et al., 2014; Maher, 2012; Thurner et al., 2018; Bramswig et al., 2013; Ackermann et al., 2016)(Supplementary file 1-STable 1), and re-processed them uniformly with the same computational pipelines. The 1000 bp long genomic sequences encompassing the signal peaks were used, together with vectors representing presence/absence of the 30 islet epigenomic features within these regions, as inputs to train the multi-class prediction CNNs. The resulting CNN models predict presence of these 30 features within any 1000 bp long genomic sequence (Figure 1—figure supplement 1). Since the weights in neural network training are initialized randomly and then optimized during training, there is a considerable amount of heterogeneity in the predictive scores resulting from different iterations of the same training process, as networks may converge at different local optimal solutions. To improve robustness of results achieved with these models, we trained a total of 1000 CNNs with 10 different sets of hyperparameters differing in numbers of convolutional filters and their sizes to account for this stochastic heterogeneity (Supplementary file 1-STable 2).

Overall, CNNs achieved high performance in predicting the islet epigenomic features in sequences withheld from training, though we observed that the performance varied depending on the predicted feature (Figure 1, Figure 1—figure supplement 2). The best predictive performance was achieved for features related to promoters, transcription factor (TF) binding and DNA accessibility, with mean areas under receiver-operator curves (AUROC) of 0.95, 0.89 and 0.89, respectively. Histone mark features associated with active, coding regions, repressed regions and enhancers proved more difficult to predict based on their underlying genomic sequence, with mean AUROCs of 0.82, 0.79 and 0.77, respectively.

Figure 1 with 3 supplements see all
Area under precision-recall curves (AUPRC) for 30 islet epigenomic features predicted by CNN models.

The AUPRC values were calculated based on performance on the test set formed by 1000 bp sequences from chr2, held out from training and validation. The boxplots show summary of performance across 1000 individual CNN models, and are grouped by corresponding regulatory element. As the interpretation of AUPRC values depend on how well balanced the dataset it, we denote the class imbalance (equivalent to prediction of a random model) for each feature as open circles, which corresponds to the proportion of sequences with the given feature present. Features marked as ‘a_”, ‘b_”,”e_”, and ‘aci_”, were assayed in FACS-sorted cell populations rather than whole pancreatic islets, and correspond to alpha cells, beta cells, exocrine cells, and acinar cells, respectively (Bramswig et al., 2013; Ackermann et al., 2016).

In addition to area under receiver-operator curve (AUROC)(Figure 1—figure supplement 2), which was used to test predictive performance during CNN training, we also inspected the area under precision-recall curves (AUPRC)(Figure 1), a more appropriate measure when predicted classes are not well balanced. We observed that, while variable between features, the predictive performance of the CNN models for the chromatin regulatory features was high, and for the majority of predicted features far exceeded the performance of random predictors. Again, we observed most accurate predictions for promoters, open chromatin and TF binding features, with AUPRCs of 0.70, 0.65 and 0.48, respectively, while enhancers, as well as active, and repressed regions had lower accuracy, with mean AUPRCs of 0.24, 0.37 and 0.20, respectively.

Convolutional filters recover binding motifs of TFs with roles in pancreatic development

Convolutional filters of the first network layer capture local sequence patterns and motifs, aiding in predictions of islet regulatory features. We hypothesized that many of these would correspond to binding motifs of transcription factors (TFs) with roles in pancreatic islet development and function. For each convolutional filter of the first CNN layer, we derived a position weighted matrix (PWM) based on the observed nucleotide frequencies activating the filter, and compared them to a database of known TF binding motifs. We observed that the number of annotated motifs per network was positively correlated with the size of convolutional filters within the first layer, while the number of filters informative for the predictions (activation standard deviation >0) decreased with filter size (Figure 1—figure supplement 3). On average, only 29 out of 320 filters in first CNN layer were annotated to known TF binding motifs, but an average of 177 filters were informative for predictions, indicating that CNN models identify potentially novel sequence motifs not currently represented in the databases of known TF binding motifs.

In total, we identified 373 recurrent annotated binding motifs with <5% false-discovery rate (FDR) sequence similarity to the filters of the first CNN layer, which were detected in >50 networks (Supplementary file 1-STable 3). The 10 most frequently discovered non-redundant motifs are listed in Table 1. As expected, among the consistently-detected transcription factor motifs, we found motifs for all the transcription factors included in the ChIP-seq training datasets (CTCF, FOXA2, PDX1, MAFB, NKX2.2, NKX6.1). Additionally, CNNs discovered, ab initio, binding motifs for several TFs that are known to be important for pancreatic development and for the maintenance of beta and alpha cell functions, including RFX6, HNF1A and NEUROD1 (Jennings et al., 2015; van der Meulen and Huising, 2015). This demonstrates that the CNN models of the pancreatic islet epigenome are capable of discovering well-established islet TF motifs ab initio from genomic sequences.

Table 1
10 non-redundant transcription factors binding motifs most frequently detected by first layer convolutional filters at FDR < 5%.

Sequence logos of representative CNN filters are shown. Transcription factor binding motifs redundancy was removed with Tomtom motif similarity search with other motifs detected by CNNs with q < 0.05 for similarity to the main motif are listed in the last column; only three motifs with highest similarity are listed.

Motif name/TFRepresentative
CNN filter logo
Motif logoCNNs with filter
match q < 0.05
Similar TF motifs discovered
M6114_1.02
FOXA1


838M6234_1.02 FOXA3, M6241_1.02 FOXJ2, M4567_1.02 FOXA2…
M4427_1.02
CTCF


833M4612_1.02 CTCFL
M1906_1.02
SP1


677M2314_1.02 SP2, M6482_1.02 SP3, M6535_1.02 WT1…
M2296_1.02
MAFK


629M4629_1.02 NFE2, M4572_1.02 MAFF, M4681_1.02 BACH2…
M2292_1.02
JUND


571M4623_1.02 JUNB, M2278_1.02 FOS, M4619_1.02 FOSL1…
M1528_1.02
RFX6


556M4476_1.02 RFX5, M1529_1.02 RFX7, M5777_1.02 RFX4…
M4640_1.02
ZBTB7A


530M6539_1.02 ZBTB7B, M6552_1.02 ZNF148, M6422_1.02 PLAGL1…
M1970_1.02
NFIC


484M5664_1.02 NFIX, M5660_1.02 NFIA, M5662_1.02 NFIB
M2277_1.02
FLI1


442M6222_1.02 ETV4, M2275_1.02 ELF1, M5398_1.02 ERF…
M6281_1.02
HNF1A


418M6282_1.02 HNF1B, M6546_1.02 ZFHX3

Islet CNN models prioritize T2D-associated variants with regulatory roles in pancreatic islets

The regulatory effects of genomic variants can be approximated by comparing the CNN predictions for genomic sequences including different alleles of the same variant. Here, we applied the islet CNN models for prioritization of T2D-associated variants from a recently published GWAS study (Mahajan et al., 2018). This study of ~900,000 cases and controls of European ancestry, identified 403 T2D-risk signals, and performed genetic fine-mapping for 380 of them. We ran CNN predictions for all 109,779 variants included within the 99% credible variant sets for these signals, averaging the regulatory predictions for each variant and feature across 1000 individual CNN models to increase robustness. Variants most likely to influence the islet epigenome were then identified through the cumulative distribution function for the normal distribution, separately for each predicted feature, and the lowest q-value for any of the features was assigned to a variant to signify its overall regulatory potential.

We identified a total of 11,389 variants with a q-value <0.05 for any of the 30 features, approximately 10% of the total number of credible set variants. Broadly similar numbers of regulatory variants fell within the different groups of chromatin features (Figure 2A). Variants with q < 0.05 were significantly more likely to be evolutionarily conserved than the credible set variants with overall q > 0.05, as assessed by one-sided Wilcoxon rank sum test of the GERP scores (Cooper et al., 2005) (p=7.3e-04).

Functional characterization of CNN-predicted regulatory variants.

(A) Distribution of CNN-predicted regulatory variants (q < 0.05) in the six broader CNN feature groups. (B) Enrichment of variants predicted to affect the CNN feature groups within variant list ranked by eQTL result p-value from the InsPIRE study (Viñuela et al., 2019). Enrichment was calculated with R package gage (Luo et al., 2009), red bars indicate gene-set enrichment at the top (and blue at the bottom) of the eQTL p-value -ranked list of variants. (C) Predicted regulatory variants reside in regulatory elements they are predicted to affect. For each variant we found the lowest CNN q-value among feature groups corresponding to different regulatory elements (promoters, enhancers, open chromatin, active regions, TF binding, repressed regions) predicted from genomic sequence, and we ranked all variants according to these six q-values. We then tested whether variants residing in each of the 15 pancreatic islet chromatin states (Thurner et al., 2018) were enriched at the top or bottom of these ranked lists using gene-set enrichment analysis implemented in the R package gage (Luo et al., 2009). Colours in the heatmap represent the strength of the enrichment expressed as log10-transformed enrichment q-values, with red colours representing enrichments at the top (enrichment), and blue at the bottom of the ranked lists (depletion). For plotting purposes all -log10(p-values) below −50, or above 50 were truncated to these values. Stars denote significant enrichments: *<0.05, **<0.01 and ***<0.001. Variant level functional annotations and CNN predictions for the credible set variants are available as Figure 2—source data 1.

To further validate the functional inference from the islet CNN, we compared the CNN predictions at the credible set variants with results from a recent cis-eQTL study of human islet samples from 420 donors performed by the InsPIRE consortium (Viñuela et al., 2019). In this analysis, 91 of the 403 T2D GWAS signals from the largest published meta-analysis (Mahajan et al., 2018) overlapped an islet eQTL (defined as p<10e-08): of these, we identified overlapping regulatory variants (as predicted by the CNN q < 0.05) at 73 (~80%). We considered the enrichments of variants predicted to affect the six different groups of chromatin features, and found that variants predicted to affect promoter or enhancer activities in islets were preferentially enriched among the top pancreatic islet eQTL results (Figure 2B).

We hypothesized that CNN-predicted regulatory variants would also be more likely to show allelic imbalance in chromatin accessibility. We used ATAC-seq data from a previously-published dataset of 17 human pancreatic islets (Thurner et al., 2018) to identify 137 pancreatic islet chromatin accessibility QTLs (caQTLs) among the credible set variants, and found these to be significantly enriched among the variants with the lowest CNN q-scores (p=6.9e-20).

Finally, we reasoned that variants predicted to affect function of specific regulatory elements would be more likely to reside within them (e.g. a variant predicted to disrupt enhancer function should be residing in an enhancer region). We tested this by investigating overlap of predicted regulatory variants with human pancreatic islet chromatin state maps (Thurner et al., 2018). For this purpose, we considered the lowest q-values within each of the earlier described groups of regulatory features, corresponding to promoters, enhancers, open chromatin, transcription factor binding, as well as active and repressed regulatory regions (Figure 2C). Overall, we observed good agreement between the predicted disrupted regulatory elements and the variant located within them. Additionally, we observed a depletion of regulatory variants within heterochromatin and other low methylation sites.

These findings provide validation that variants predicted, on the basis of the CNN, to have the greatest impact on islet regulatory function, were enriched for overlap with regions previously characterized of particular importance for the regulation of islet transcription. However, the regulatory signals captured by cis-eQTL analyses have the disadvantage that they often provide poor localization of the regulatory variant to patterns of local LD. Similarly, chromatin state data lack resolution beyond the 200 bp intervals used for data aggregation. In contrast, the predictions from CNN models offer single-nucleotide resolution.

Convergence between CNN predictions and fine-mapping approaches

If the CNN models are correctly identifying regulatory variants, we would expect to see convergence between variants predicted to have regulatory effects based on the CNNs, and those assigned high genetic posterior probabilities of association (gPPA) from genetic fine-mapping. To test for such convergence, we generated the null distribution of randomly distributed regulatory variants through 1000 permutations of CNN q-values, while preserving the structure of credible sets at the 380 T2D-associated signals. We observed enrichment of islet regulatory variants (CNN q < 0.05) among variants with highest PPAs (Figure 3A), compared to the permutation-based random distribution of regulatory variants (p=0.001). Overall, we found that 28.8% of variants with gPPAs > = 0.8 had predicted regulatory effects with q < 0.05.

Figure 3 with 1 supplement see all
Convergence between CNN regulatory predictions and fine-mapping approaches for functional variant prioritization.

(A) Regulatory variants (black) are enriched among variants with highest genetic PPAs (gPPAs) over permuted background (blue). (B) Regulatory variants (black) are enriched among variants with highest functional PPAs (fPPAs) generated with FGWAS over permuted background (blue). (C) Regulatory variants (black) are enriched among variants with top PPA ranks within 99% sets of credible variants over permuted background (blue), as well as at top ranks of signals acting through insulin secretion (red) over insulin action (purple) mechanisms.

It is standard to complement genetic fine-mapping with information from the genome-wide enrichment of association signals within regulatory annotations in disease-relevant tissues, deriving functional posterior probabilities of association (fPPAs) that combine genetic and epigenomic insights into variant function (Huang et al., 2017). As with the gPPAs, we observed that variants with high fPPAs, obtained through incorporating enrichments in regulatory elements from human pancreatic islet chromatin state maps (Thurner et al., 2018), were enriched for CNN-predicted regulatory effects (p=0.001)(Figure 3B). Overall, 40.6% of variants with fPPAs > = 0.8 had predicted regulatory effects with q < 0.05.

As fine-mapping resolution varies between signals, we conducted analogous analyses based on variant rank within each credible set, irrespective of the quantitative PPA value (Figure 3C). Again, we observed higher proportions of predicted regulatory variants at higher PPA ranks within fine-mapped loci, when compared to random distribution of regulatory variants.

One might expect that CNN models trained with pancreatic islet epigenomic annotations would display the strongest evidence for prediction of regulatory variants at the subset of T2D GWAS signals characterized by defects in insulin secretion (Dimas et al., 2014; Wood et al., 2017), signals that are likely mediated through events in pancreatic islets. Indeed, we observed that the enrichment of islet CNN-regulatory variants was more marked within the top ranks of insulin secretion signals. In contrast, T2D signals characterized by a primary defect in insulin action (which typically involve mechanisms in liver, fat and muscle) showed no enrichment over the permuted background (Figure 3C).

Collectively, these data corroborate the convergence of the agnostically-derived CNN variant regulatory scores and diverse measures of islet biology, and indicate the potential for CNNs to support causal variant prioritization in T2D. They also emphasize the value of functional analyses that take account of the tissue-specificity of both transcriptional regulation and disease pathogenesis.

Islet CNN models correctly identify known functional variants at T2D-associated loci

We tested whether the CNN regulatory predictions for individual variants can be integrated with previous fine-mapping approaches to further resolve T2D-associated signals. Overall, we found that, at 327 out of the 380 fine-mapped signals from the most recent European T2D GWAS study (Mahajan et al., 2018), there was at least one predicted regulatory variant (defined as q < 0.05). Among the 74 signals previously fine-mapped to a single variant (with gPPA or fPPA > = 80%), we found 28 variants predicted to be regulatory in pancreatic islets (Supplementary file 1-STable 4), in line with the previously reported overall enrichment of T2D-association signals in the regulatory elements specific to islets (Parker et al., 2013; Pasquali et al., 2014; Thurner et al., 2018; Miguel-Escalada et al., 2018). These included two well-studied variants (rs10830963 at MNTR1B, rs7903146 at TCF7L2), both previously shown to alter enhancer activities (Gaulton et al., 2015; Gaulton et al., 2010): these served as positive controls for the application of CNNs to functional variant prioritization.

Islet CNN models refine regulatory mechanisms at T2D-associated loci

While functional fine-mapping can be invaluable in narrowing down the list of most likely causal variants through investigating overlaps with regulatory elements in appropriate tissues, this strategy may not provide sufficient resolution at loci where several variants reside in the same regulatory element. Among the credible sets of variants, we identified 93 signals featuring at least two variants with fPPAs > = 20%, indicating that, even after functional fine-mapping, there is no unambiguously causal single variant. At 37 of these 93 signals, CNNs predicted islet regulatory variants among the top fPPA candidates (Supplementary file 1-STable 5). At 25 signals, integration of CNN regulatory predictions downstream of the functional fine-mapping highlighted a single most likely causal variant, with either just a single islet regulatory variant predicted among the credible variants, or the top regulatory variant having much lower q-value (difference in -log10(q) > 100) than other predicted regulatory variants at the signal (a few such signals are highlighted in Figure 4 and Figure 5), narrowing down the list of candidates for further functional follow-up studies.

Examples of T2D-association signals where integration of CNN regulatory variant prediction downstream of functional fine-mapping refines the association signals to single candidate variants.

Genetic PPAs (gPPAs) are shown in the top panels as blue points, functional PPAs (fPPAs) are shown in the middle panels as green points, and -log10-transformed q-values from CNN predictions are shown in the bottom panels as red points.

CNN regulatory predictions help refine the association signal at PROX1 locus, previously fine-mapped to only two variants: rs17712208 and rs79687284.

(A) Genetic PPA (gPPA), functional PPA (fPPA) and -log10(q-value) of the CNN islet regulatory predictions for both variants. (B) Allelic imbalance in open chromatin across four pancreatic islets heterozygous for the variants. Allele counts for the major (grey) and minor (red) alleles are shown for both variants. (C) Table summary with CNN predictions for the H3K27ac mark for both variants. (D) In silico saturated mutagenesis for 40nt flanking sequence around the rs17712208 variant for the H3K27ac predictions. The line plots in the upper panel indicate the SAD (SNP accessibility difference) scores corresponding to absolute highest values from the heatmap below, with blue line indicating loss of function, and red – gain of function changes. Blue fields in the heatmap indicate that a given nucleotide substitution would result in decrease in prediction values for H3K27ac, while red field indicate increase in the predictions. The height of letters in the sequence below the heatmap indicated the relative importance of each nucleotide in the final predictions. (E) Matched HNF1B binding motif is shown below. (F) Luciferase reporter assays confirmed that the A allele of rs17712208 resulted in significant repression of enhancer activity, while no effect was observed for the rs79687284 variant. GFP = green fluorescent protein (negative control), EV = empty vector (baseline). Source file with luciferase intensity values is available as Figure 5—source data 1.

To explore this further, we focused on a T2D-association signal at the PROX1 locus, identified after conditioning the T2D association on the primary (most significant) association signal at rs340874; on the basis of patterns of phenotypic association of the T2D-risk allele with continuous diabetes-related traits, this variant is associated with a primary effect on insulin secretion (Dimas et al., 2014). This conditional signal has been fine-mapped to two plausible variants, rs79687284 and rs17712208, on the basis of genetic and genomic data (Figure 5A). These variants are in perfect LD (R2 = 1.0, D’=1.0) and are located 376 bp apart within the same open strong enhancer in islets. Neither genetic fine-mapping (in European populations) nor functional fine-mapping were able to further resolve the association at this signal. When we investigated pancreatic islet ATAC-seq data from four individuals heterozygous for these two variants (as confirmed by array genotype data), we observed strong allelic imbalance at both variants (rs17712208 p=1.55e-06; rs79687284 p=6.10e-05) (Figure 5B). This supports regulatory effects of this PROX1 signal in the pancreatic islets, but highlights the difficulties in resolving the causal variant. These data are also consistent with the possibility that both variants are contributing to the functional effect.

At this signal, both variants were scored as potentially regulatory by CNNs with q < 0.05, but the q-value at rs17712208 was much lower (q = 1.69e-160) highlighting this variant as the more likely regulatory candidate in the islets. This q-value corresponded to a prediction that the T2D risk A-allele would lead to a significant reduction in the H3K27ac mark (△H3K27ac = −0.12) (Figure 5C) indicative of an active regulatory element presence. The second variant, rs79687284, was assigned a much less remarkable q-value (q = 0.002) and only predicted to affect promoter features, with the top predictions pointing to a mild reduction of H3K4me3 mark in alpha and beta cells (△a_H3K4me3 = −0.01, △b_H3K4me3 = −0.01).

While it remains plausible that rs79687284 plays a regulatory function in a specific cellular context, the functional annotation and CNN predictions point to rs17712208 as the more likely causal variant (particularly given that both map to an enhancer rather than a promoter). We performed an in silico saturated mutagenesis for the rs17712208 variant (Figure 5D) to identify the regulatory sequence motifs at this locus affected by the variant. We observed that the reference T-allele is a crucial nucleotide in the HNF1B binding motif (Figure 5E), and that introduction of an A-allele at this position disrupts this motif, leading to predicted loss of the H3K27ac mark. We note that both PROX1 and HNF1B are co-expressed during pancreatic development in a previously published iPSC-derived model of beta cell differentiation (Perez-Alcantara et al., 2018).

To obtain experimental confirmation, we assessed the transcriptional activity of both variants in the human EndoC-βh1 beta-cell line (Ravassard et al., 2011). This is a widely-used model of beta-cell function, which expresses both PROX1 and HNF1B, and represents the majority cell type in the islets. Using an in vitro reporter assay, we confirmed that the A-allele at rs17712208 resulted in significant repression of enhancer activity (p=0.0001), while no significant change was observed for the T2D-risk allele at rs79687284 (p=0.15)(Figure 5F).

These examples illustrate how diverse approaches to variant prioritization at the T2D-associated loci show strong convergence towards the same candidate variants, and how CNN models can complement fine-mapping approaches in providing single-base resolution to refine the association signals. Coupled with additional evidence that these signals exert their mechanism of action in the pancreatic islets, rather than other tissues implicated in T2D aetiology, the predicted regulatory variants may provide attractive targets for further functional follow-up studies.

Comparison of tissue-specific with multi-tissue CNN model

Finally, we compared the regulatory variant predictions made with the pancreatic islet tissue-specific CNN models described in this study with predictions made with a widely-used multi-tissue variant prioritization tool, DeepSEA (Zhou and Troyanskaya, 2015) (Figure 3—figure supplement 1 - C). We compared our results to the multi-tissue significance scores reported by DeepSEA for each variant, as well as to the predicted effects on pancreatic islet chromatin accessibility (based on ENCODE DNase-seq data [‘PanIslets’] generated from primary pancreatic islet cells, one of the tissues contributing to the overall multi-tissue score). Overall, we observed a modest but highly significant correlation of predictions with both these regulatory prediction scores reported by DeepSEA (multi-tissue: r = 0.227, p<2.2e-16; PanIslets: r = 0.223, p<2.2e-16) (Figure 3—figure supplement 1 - AB).

As with our islet CNN models, we observed strong enrichment of predicted regulatory variants amongst the top ranks of the T2D GWAS 99% credible variant sets for both modes of DeepSEA, as compared to permuted background (Figure 3—figure supplement 1 - C). However, both the multi-tissue and PanIslets DeepSEA models performed better within the subset of T2D signals acting through insulin action rather than through insulin secretion signals, predicting higher proportions of regulatory variants at top ranks of credible sets for the former, contrary to our expectations for the PanIslets model (Figure 3—figure supplement 1 - D). In addition, the DeepSEA framework failed to predict regulatory impact at loci where T2D causative variants have been established through multiple lines of evidence (such as at rs10830962 at MTNR1B (Gaulton et al., 2015) and rs7903146 at TCF7L2 Gaulton et al., 2010), both of which were correctly identified as regulatory in our islet-based study. Similarly, while the rs17712208 variant at PROX1 reported in this study was found to be regulatory using the multi-tissue mode of DeepSEA (p=6.25e-03), the predictions for pancreatic islets accessibility profiles (PanIslets) were only borderline significant for this variant (p=0.049).

Our in silico saturated mutagenesis analysis indicated NEUROD1 as the likely motif created at the MTNR1B locus (not shown), and HNF1B as the motif disrupted at PROX1 (Figure 5E). These transcription factors are critical for the function of pancreatic islet beta cells (Gu et al., 2010), and likely represent tissue-specific processes, more liable to be missed in a multi-tissue model. We anticipate that CNN models trained on data from multiple tissues may be biased towards sequence features and motifs present across all the tissues, and may not detect tissue-specific signals with high confidence. Given the demonstrated enrichment of T2D association signals in regulatory elements specific to the pancreatic islets, we demonstrate the value of training the prediction models in the tissues most relevant to the phenotype in question, at least for those loci where the tissue-of-action can be defined.

Discussion

Application of deep learning methods to characterize the regulatory potential of noncoding variants has been a subject of interest in recent years. Instead of predictions across an array of tissues, in the current study we focused on multiple genome-wide epigenomic annotations available for pancreatic islets, one of the central tissues impacting T2D aetiology. We demonstrate how the CNN regulatory predictions for genomic variants can be integrated downstream of fine-mapping approaches for further refinement of GWAS association signals.

Overall, we observed that CNNs were capable of learning the regulatory code from the underlying genomic sequence, though the prediction accuracy differed among the studied epigenomic features. The individual feature predictive performance was likely affected by a number of factors, in particular the quality of input data, and the total number of peaks called for a feature, as deep learning methods require many training examples to learn to generalize well. Furthermore, epigenomic features differ in their characteristics, including peak width and specificity of sequence motifs predictive of their presence. Thus, when training a multi-task prediction network, we may expect that network parameter optimization might not work equally well for disparate features. Promoter regions, DNA accessibility and binding of specific transcription factors proved the easiest to predict, as these features are characterized by very distinct sequence motifs that can be identified by the network’s convolutional filters. Importantly, the binding motifs learned de novo from the genomics sequences of signal peaks presented to the networks included several transcription factors with well-established functions in pancreatic development and function.

In the present study, we trained 1000 individual CNN models with different hyperparameters to aid in prioritization of the likely functional variants from the largest T2D GWAS study to date (Mahajan et al., 2018). We observed an overall convergence of the genetic fine-mapping, functional fine-mapping and regulatory predictions generated by the tissue-specific CNN models, as evidenced by enrichment of predicted functional variants among variants with highest gPPAs, fPPAs and PPA ranks. To our knowledge, this is the first time such enrichment has been reported for in silico predictions of regulatory variants. Our findings contrast with those from a recent study, which argued that computational methods are not yet mature enough to be applied in GWAS fine-mapping (Liu et al., 2019).

In our study, we have observed multiple examples of convergence of the top scoring variants by CNNs with the top variants identified through functional fine-mapping using islet chromatin state maps. We also note that the CNN-prioritized variants were significantly enriched in islet chromatin accessibility QTLs, as well as variants residing in islet functional regulatory elements, including promoters and strong open enhancers, which is in line with findings of previous functional genomics studies (Thurner et al., 2018; Parker et al., 2013; Pasquali et al., 2014). We demonstrate that when used in conjunction with high-resolution genetic or functional fine-mapping, CNN predictions can provide a powerful way of dissecting the causal variant from a set of T2D-associated variants. The results presented in this study will facilitate further functional follow-up studies to fully elucidate the mechanisms underlying the associated disease susceptibility at the prioritized variants.

One important limitation of this approach is that the value of applying tissue-specific CNN models to guide the identification of causal variants depends on knowing the effector tissue for the signals in question. With complex diseases, such as T2D, this is not always straightforward, as multiple tissues (including pancreatic islets, liver, adipose and skeletal muscle) have been implicated in T2D aetiology. Additionally, even within the correct tissue, inference can be complicated by the fact that tissues represent heterogeneous mixtures of cell populations, of which only a subset might be directly relevant to disease aetiology. The growing availability of cell type specific functional genomic datasets will make it possible to explore the extent to which this more precise assignment of disease pathology allows more accurate inference regarding the causal mechanisms.

The development of high-throughput experimental methods for assaying regulatory variants, such as massively-parallel reporter assays (MPRAs), offers a complementary route to deriving information on variant-specific function. However, these methods are subject to their own sets of biases and errors, and genome-wide deployment can be problematic. More recent methods – such as HiDRA (high-resolution dissection of regulatory activity)(Wang et al., 2018) – allow further scale-up, but are limited to sites heterozygous in the assayed sample. Nevertheless, large scale experimental studies such as these provide a further dimension of information that can be folded into future rounds of supervised model training. CNN models have the potential to complement high-throughput experimental approaches to elucidate regulatory variants, by learning the regulatory grammar at genomic locations assayed directly in these experiments, and allowing extrapolation to variants (and cell types) not assayed directly. Indeed, we anticipate that in the near future, deep learning based fine-mapping approaches will become part of routine downstream analyses of GWA data.

Materials and methods

Collection and processing of islet data

Request a detailed protocol

We collected 30 genome-wide epigenome profiling datasets from human pancreatic islets from previously published studies (Bhandare et al., 2010; Gaulton et al., 2010; Stitzel et al., 2010; Parker et al., 2013; Pasquali et al., 2014; Maher, 2012; Thurner et al., 2018) (Supplementary file 1-STable 1). Where raw sequencing data was available, we uniformly reprocessed the data using the default settings of the AQUAS Transcription Factor and Histone ChIP-Seq processing pipeline (https://github.com/kundajelab/chipseq_pipeline) and ATAC-Seq/DNase-Seq pipeline (https://github.com/kundajelab/atac_dnase_pipelines), mapping against the human reference genome build hg19, and using MACS2 for peak calling. We used the coordinates of the optimal peaks sets produced with the Irreproducible Discovery Rate (IDR) procedure for further analysis. We grouped the 30 datasets into six broader feature categories: promoters, enhancers, open chromatin, TF binding, active, or repressed regions, based on previous reports of regulatory elements associated with presence of the different chromatin modifications (Zhou et al., 2011; Kundaje et al., 2015). This grouping was only used for convenience of presentation of the results, and to facilitate their interpretation.

CNN input sequence and feature encoding

Request a detailed protocol

The training and test datasets for the convolutional neural network training were created analogously to the approach used in Basset (Kelley et al., 2016). Briefly, 1000 bp genomic intervals were assigned to all called narrow peaks, by extending 500 bp on each side of the centre of the peak. Peaks were greedily merged until no peaks overlapped by >200 bp, and the centre of the merged peak was determined as a weighted average of the midpoints of the merged peaks, weighted by the number of chromatin features present in each individual peak. This resulted in 505,273 genomic intervals of 1000 bp length with assigned presence/absence of the 30 islet epigenomic features, with an average of 2.62 chromatin features per interval. The genomic sequences of the intervals were extracted from the hg19 human reference genome and encoded as one-hot code matrix, mapping the sequences into a 4-row binary matrix corresponding to the four DNA nucleotides at each position. For each 1000 bp sequence, we created an accompanying feature vector denoting which of the 30 datasets showed a signal peak overlapping the sequence. All sequences from chromosomes 1 (N = 43,029, 8.52% total) and 2 (N = 40,506, 8.02% total) were held out from the training and applied as validation and test sets, respectively.

CNN training

Request a detailed protocol

CNN models were trained using code from the open source package Basset (Kelley et al., 2016) implemented in the Torch7 framework (http://torch.ch). We trained a total of 1000 CNN models with 10 different hyperparameter settings (Sup. Table 2), differing in sizes and numbers of convolutional filters applied. We averaged the results across these 1000 models to achieve robust results, and overcome the CNN training stochasticity. Each network contained three convolutional layers, each followed by a rectified linear unit (ReLU) and a max pooling layer, and two fully connected layers. The final output layer produced the predictions for the 30 features. The network architecture is schematically shown in SFig.1. The network training was stopped when the area under receiver-operator curve (AUROC) did not improve in 10 subsequent training epochs. The predictive performance of the networks was assessed by evaluating predictions made on the sequences from chr2 constituting the test set. We calculated the AUROCs and AUPRCs using R package PRROC v.1.3 (Grau et al., 2015).

Sequence motifs captured by convolutional filters

Request a detailed protocol

The convolutional filters in the first CNN layer detect repeatedly occurring local sequence patterns, which increase the prediction accuracy. These patterns can be summarized as position-weighted matrices (PWMs), derived by counting nucleotide occurrences that activate the filter to more than half of its maximum value, as implemented in the Basset framework. The resulting PWMs were matched to the CIS-BP database (Weirauch et al., 2014) of known transcription factor binding motifs using the Tomtom v.4.10.1 motif comparison tool (Gupta et al., 2007). We repeated this for all the 1000 trained networks, and found recurrent motifs by identifying binding motifs matched at FDR < 5% in at least 50 networks. The redundant motifs were filtered out from the top motifs table by comparing all the database motifs against each other using the Tomtom v4.10.1 motif comparison tool, and reporting the top scoring motif as the main hit, and other motifs matching it at FDR < 5% as secondary motifs.

Application of CNN models to variant prioritization

Request a detailed protocol

We obtained the credible variant sets from a recent T2D GWAS study (Mahajan et al., 2018). For each variant, we run predictions for the 30 islet epigenomic features with all 1000 trained CNN models for two 1000 bp long sequences: first including the reference allele and second including the alternative allele, with the variant positioned in the middle of the 1000 bp sequence. We calculated the mean differences in prediction scores across all the 1000 CNNs for each of the investigated features between the two sequences for each variant. We then estimated p-values for each of the variants and each feature with the cumulative distribution function of normal distribution, and applied FDR procedure for multiple testing correction. The overall regulatory potential of each variant was quantified as the lowest q-value for any of the 30 features.

Functional validation of CNN regulatory predictions

Request a detailed protocol

To test the validity of the resulting regulatory predictions for the credible set variants, we tested whether there was an enrichment of variants predicted to affect any of the six feature groups among pancreatic islet expression QTLs (eQTLs) from a recent study in 420 individuals (Viñuela et al., 2019). We downloaded the full set of results with nominal p-values for gene-level eQTLs, and focused on the results for the credible set variants, for which we also have generated CNN predictions in this study. We applied gene-set enrichment analysis implemented in R package ‘gage’ v.2.32.1 to calculate enrichments of CNN-predicted regulatory variants within variant list ranked by eQTL p-values.

In a similar manner we tested whether variants with lower q-values are more likely to act as chromatin accessibility QTLs (caQTLs) exhibiting allelic imbalance in open chromatin in the ATAC-seq data from 17 human pancreatic islets (Thurner et al., 2018). We identified the islet caQTLs among the credible set variants by investigating variants overlapping the ATAC-seq peaks. We required > = 2 subjects with a heterozygous genotype for the variant, and > = 5 sequencing reads overlapping each allele. We calculated the significance of open chromatin allelic imbalance with the negative binomial distribution test, and used an FDR-adjusted p-value of 0.05 to define the 137 caQTLs. The enrichment of these caQTLs within the credible set variant list ranked by their lowest q-value was calculated using R package ‘gage’ v.2.32.1.

Finally, we tested whether variants predicted to affect specific regulatory elements (promoters, enhancers, open chromatin, TF binding, active, or repressed regions) were more likely to reside in these regulatory regions in islets, as defined by the high-resolution chromatin state maps of human pancreatic islets, partitioning the genome into 15 regulatory states based on patterns of chromatin accessibility and DNA methylation integrated with established ChIP-seq marks (Thurner et al., 2018). Using the generally applicable gene set enrichment R package gage (version 2.32.1) (Luo et al., 2009) we tested whether variants overlapping each chromatin state were more likely to be found at the top or bottom of the variant list ranked by the lowest q-values within each of the feature groups.

Evaluating convergence between fine-mapping and CNN predictions

Request a detailed protocol

To test the overall convergence between the two complementary methods for variant prioritization: genetic fine-mapping and CNN regulatory predictions, we evaluated whether we observed more regulatory variants at highest genetic PPAs. We examined the proportions of regulatory variants predicted by CNNs using q-value <0.05 as a cut-off, at different thresholds of genomic PPA (gPPA) going down from PPA of 1.00 to 0.00 in steps of 0.01, and observed decreasing proportions of regulatory variants with decreasing gPPA. We compared this to a random distribution of regulatory variants with respect to gPPA by permuting the CNN q-values 1000 times, preserving the overall number of significant variants, as well as the number and overall structure of credible sets. The enrichment p-value was calculated from these permutations by comparing the areas under curves with fraction of variants significant at each gPPA threshold generated in each permutation with the area under the curve calculated from the original results. Areas under curves were calculated using the AUC function of the DescTools R package v. 0.99.28 (Signorelli, 2019). We repeated the same for functional fine-mapping PPAs (fPPA), using fPPAs generated through incorporation of pancreatic islet regulatory elements defined by chromatin state maps (Thurner et al., 2018). To account for the differing degree of fine-mapping resolution at different GWAS loci, we also evaluated whether we observed higher proportion of regulatory variants with q < 0.05 among the variants with the highest PPA within each locus, regardless of the actual PPA value, and repeated the same procedure investigating genomic PPA ranks within each signal. In the same way we evaluated whether we observed lower q-values at the top ranks of signals known to act through insulin secretion mechanisms (N = 34), rather than insulin action (N = 23) (Wood et al., 2017; Dimas et al., 2014).

Identification of T2D association signals further refined by incorporation of CNN regulatory prediction

Request a detailed protocol

We identified the T2D association signals where the islet CNN models can help with further refinement by investigating signals comprising at least two variants with fPPA > = 0.2. In Figures 4 and 5 we highlighted signals where CNN predictions point to a single variant, among these fPPA > = 0.2 variants, with a much higher regulatory score than the remaining variants at these loci.

The in silico saturated mutagenesis was performed using the basset_sat.py script from the Basset framework (Kelley et al., 2016). The TF binding motifs overlapping the variant site were identified with FIMO from the MEME suite v. 4.11.2 (Grant et al., 2011) using the CIS-BP motifs database (Weirauch et al., 2014).

Plasmid transfection and luciferase reporter assay

Request a detailed protocol

We experimentally validated the CNN regulatory predictions for the two variants (rs17712208 and rs79687284) at PROX1 locus with luciferase reporter assay. Briefly, human EndoC-βH1 cells (Ravassard et al., 2011) were grown at 50–60% confluence in 24-well plates and were transfected with 500 ng of empty pGL4.23 [luc2/minP] vector (Promega, Charbonnieres, France) or pGL4-minP-PROX_enhancer vectors (wildtype, rs17712208-A and rs79687284-C) with FuGENE HD (Roche Applied Science, Meylan, France) using a FuGENE:DNA ratio of 6:1. The EndoC-βH1 cells (RRID:CVCL_L909) were a purchased from Univercell BioSolutions (http://www.univercell-biosolutions.com/human-heart-cells-and-stem-cells-production) and tested negative for mycoplasm. Primer sequences used for cloning and site-directed mutagenesis (SDM) are listed in Supplementary file 1-STable 6. Restriction enzymes Nhel and Xhol were used for all subsequent cloning. Luciferase activities were measured 48 hr after transfection using the Dual-Luciferase Reporter Assay kit (Promega). The firefly luciferase activity was normalized to the Renilla luciferase activity obtained by cotransfection of 10 ng of the pGL4.74[hRluc/TK] Renilla Luciferase vector (Promega).

Comparison of single- versus multi-tissue CNN predictions

Request a detailed protocol

Finally, we compared the predictions resulting from our single-tissue pancreatic islet CNN models with the predictions generated with another publicly available multi-tissue variant prioritization tool DeepSEA (Zhou and Troyanskaya, 2015). The functional significance scores for each variant (multi-tissue) and the q-value for chromatin effects in the ENCODE PanIslets primary pancreatic islets cells (single tissue) were generated by submitting VCF files of the credible set variants to the DeepSEA web server (date accessed: 9th May, 2018).

Code availability

Request a detailed protocol

Code used to generate the results of this study is available at https://github.com/agawes/islet_CNN.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
    Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci
    1. KJ Gaulton
    2. T Ferreira
    3. Y Lee
    4. A Raimondo
    5. R Mägi
    6. ME Reschen
    7. A Mahajan
    8. A Locke
    9. NW Rayner
    10. N Robertson
    11. RA Scott
    12. I Prokopenko
    13. LJ Scott
    14. T Green
    15. T Sparso
    16. D Thuillier
    17. L Yengo
    18. H Grallert
    19. S Wahl
    20. M Frånberg
    21. RJ Strawbridge
    22. H Kestler
    23. H Chheda
    24. L Eisele
    25. S Gustafsson
    26. V Steinthorsdottir
    27. G Thorleifsson
    28. L Qi
    29. LC Karssen
    30. EM van Leeuwen
    31. SM Willems
    32. M Li
    33. H Chen
    34. C Fuchsberger
    35. P Kwan
    36. C Ma
    37. M Linderman
    38. Y Lu
    39. SK Thomsen
    40. JK Rundle
    41. NL Beer
    42. M van de Bunt
    43. A Chalisey
    44. HM Kang
    45. BF Voight
    46. GR Abecasis
    47. P Almgren
    48. D Baldassarre
    49. B Balkau
    50. R Benediktsson
    51. M Blüher
    52. H Boeing
    53. LL Bonnycastle
    54. EP Bottinger
    55. NP Burtt
    56. J Carey
    57. G Charpentier
    58. PS Chines
    59. MC Cornelis
    60. DJ Couper
    61. AT Crenshaw
    62. RM van Dam
    63. AS Doney
    64. M Dorkhan
    65. S Edkins
    66. JG Eriksson
    67. T Esko
    68. E Eury
    69. J Fadista
    70. J Flannick
    71. P Fontanillas
    72. C Fox
    73. PW Franks
    74. K Gertow
    75. C Gieger
    76. B Gigante
    77. O Gottesman
    78. GB Grant
    79. N Grarup
    80. CJ Groves
    81. M Hassinen
    82. CT Have
    83. C Herder
    84. OL Holmen
    85. AB Hreidarsson
    86. SE Humphries
    87. DJ Hunter
    88. AU Jackson
    89. A Jonsson
    90. ME Jørgensen
    91. T Jørgensen
    92. WH Kao
    93. ND Kerrison
    94. L Kinnunen
    95. N Klopp
    96. A Kong
    97. P Kovacs
    98. P Kraft
    99. J Kravic
    100. C Langford
    101. K Leander
    102. L Liang
    103. P Lichtner
    104. CM Lindgren
    105. E Lindholm
    106. A Linneberg
    107. CT Liu
    108. S Lobbens
    109. J Luan
    110. V Lyssenko
    111. S Männistö
    112. O McLeod
    113. J Meyer
    114. E Mihailov
    115. G Mirza
    116. TW Mühleisen
    117. M Müller-Nurasyid
    118. C Navarro
    119. MM Nöthen
    120. NN Oskolkov
    121. KR Owen
    122. D Palli
    123. S Pechlivanis
    124. L Peltonen
    125. JR Perry
    126. CG Platou
    127. M Roden
    128. D Ruderfer
    129. D Rybin
    130. YT van der Schouw
    131. B Sennblad
    132. G Sigurðsson
    133. A Stančáková
    134. G Steinbach
    135. P Storm
    136. K Strauch
    137. HM Stringham
    138. Q Sun
    139. B Thorand
    140. E Tikkanen
    141. A Tonjes
    142. J Trakalo
    143. E Tremoli
    144. T Tuomi
    145. R Wennauer
    146. S Wiltshire
    147. AR Wood
    148. E Zeggini
    149. I Dunham
    150. E Birney
    151. L Pasquali
    152. J Ferrer
    153. RJ Loos
    154. J Dupuis
    155. JC Florez
    156. E Boerwinkle
    157. JS Pankow
    158. C van Duijn
    159. E Sijbrands
    160. JB Meigs
    161. FB Hu
    162. U Thorsteinsdottir
    163. K Stefansson
    164. TA Lakka
    165. R Rauramaa
    166. M Stumvoll
    167. NL Pedersen
    168. L Lind
    169. SM Keinanen-Kiukaanniemi
    170. E Korpi-Hyövälti
    171. TE Saaristo
    172. J Saltevo
    173. J Kuusisto
    174. M Laakso
    175. A Metspalu
    176. R Erbel
    177. KH Jöcke
    178. S Moebus
    179. S Ripatti
    180. V Salomaa
    181. E Ingelsson
    182. BO Boehm
    183. RN Bergman
    184. FS Collins
    185. KL Mohlke
    186. H Koistinen
    187. J Tuomilehto
    188. K Hveem
    189. I Njølstad
    190. P Deloukas
    191. PJ Donnelly
    192. TM Frayling
    193. AT Hattersley
    194. U de Faire
    195. A Hamsten
    196. T Illig
    197. A Peters
    198. S Cauchi
    199. R Sladek
    200. P Froguel
    201. T Hansen
    202. O Pedersen
    203. AD Morris
    204. CN Palmer
    205. S Kathiresan
    206. O Melander
    207. PM Nilsson
    208. LC Groop
    209. I Barroso
    210. C Langenberg
    211. NJ Wareham
    212. CA O'Callaghan
    213. AL Gloyn
    214. D Altshuler
    215. M Boehnke
    216. TM Teslovich
    217. MI McCarthy
    218. AP Morris
    219. DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium
    (2015)
    Nature Genetics 47:1415–1425.
    https://doi.org/10.1038/ng.3437
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
    DescTools: Tools fro descriptive statistics
    1. A Signorelli
    (2019)
    DescTools: Tools fro descriptive statistics.
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41

Decision letter

  1. Stephen Parker
    Reviewing Editor; University of Michigan, United States
  2. Detlef Weigel
    Senior Editor; Max Planck Institute for Developmental Biology, Germany

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The reviewers found this to be an exciting application of deep learning CNN models and pancreatic islet epigenomic data to type 2 diabetes (T2D) genome wide association study (GWAS) variants, including the convincing experimental validation of a single-locus effect. Additionally, the bulk enrichment analyses shown in Figure 3 demonstrate the likely broad utility of such an approach. Congratulations on the nice work, and we will look forward to seeing how these models are further used to provide mechanistic insights across diverse complex diseases for which GWAS and tissue-relevant epigenomic annotations are available.

Decision letter after peer review:

Thank you for submitting your article "Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals" for consideration by eLife. Your article has been reviewed by two peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Harry Dietz as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Wesolowska-Andersen et al. present a convolutional neural network (CNN) derived approach to predict regulatory variants in pancreatic islets and refine T2D GWAS signals. Specifically, the authors measure the regulatory potential of T2D associated variants using the CNN trained on diverse islet functional genomics data and compare their predictions to those derived from genetic and functional fine-mapping approaches. They highlight examples of deep learning derived predictions that help them refine signals at multiple loci in a tissue-specific manner. This is a well-written and interesting paper. Perhaps the most striking result is shown in Figure 3 which convincingly shows differential enrichment when T2D GWAS signals are partitioned into insulin action vs. secretion loci. The single locus validation is also quite nice.

Overall, the findings are reasonable and the validation at the end ties the manuscript together nicely. However, we recommend revisions that would improve the overall clarity of the manuscript and associated figures.

Essential revisions:

Showing a cartoon/schematic of the CNN architecture used and how the input and output map to this architecture will be useful for guiding the general readership at eLife.

The authors should mention in the Discussion how their prediction method contrasts with doing some experimental rapid high-throughput approach like Hidra.

It is clear that the authors are aiming to make the case that their models are an improvement over previous prediction approaches. Although they make a strong case, one does wonder if pancreatic islets are the best setting in which to initially do this. After all, pancreatic islets represent a mixed cell population, so any epigenomic features will be drawn from all the diverse cells of this tissue. Given there is an abundance of 'pure' cell types with this sort of data available (for example, in ENCODE), it would have utility to 1) run in that setting first to demonstrate the optimal power of this approach for a relevant disease with an abundance of GWAS hits, and then 2) understand what the 'cost' is if you subsequently run in a mixed cell setting like the one they delineate. The problem with this will be with getting a relevant GWAS data set that aligns with a very densely-profiled cell line and other orthogonal validation data. Another alternative "validation" approach could be to compare the models to islet eQTLs, for which there are now good data sets available. The models presented here should be highly enriched for islet eQTL. Our discussion led to the conclusion that implementing one of the above approaches will strengthen the work: either implementing the models in a homogeneous cell line (to circumvent the issues associated with mixed cell populations) or to perform model comparisons with islet eQTL signals.

https://doi.org/10.7554/eLife.51503.sa1

Author response

Essential revisions:

Showing a cartoon/schematic of the CNN architecture used and how the input and output map to this architecture will be useful for guiding the general readership at eLife.

We have now included a cartoon schematic of the CNN architecture used as a new Figure 1—figure supplement 1.

The authors should mention in the Discussion how their prediction method contrasts with doing some experimental rapid high-throughput approach like Hidra.

We added the following paragraph in the Discussion section: “The development of high-throughput experimental methods for assaying regulatory variants, such as massively-parallel reporter assays (MPRAs), offers a complementary route to deriving information on variant-specific function. […] CNN models have the potential to complement high-throughput experimental approaches to elucidate regulatory variants, by learning the regulatory grammar at genomic locations assayed directly in these experiments, and allowing extrapolation to variants (and cell-types) not assayed directly.”

It is clear that the authors are aiming to make the case that their models are an improvement over previous prediction approaches. Although they make a strong case, one does wonder if pancreatic islets are the best setting in which to initially do this. After all, pancreatic islets represent a mixed cell population, so any epigenomic features will be drawn from all the diverse cells of this tissue. Given there is an abundance of 'pure' cell types with this sort of data available (for example, in ENCODE), it would have utility to 1) run in that setting first to demonstrate the optimal power of this approach for a relevant disease with an abundance of GWAS hits, and then 2) understand what the 'cost' is if you subsequently run in a mixed cell setting like the one they delineate. The problem with this will be with getting a relevant GWAS data set that aligns with a very densely-profiled cell line and other orthogonal validation data. Another alternative "validation" approach could be to compare the models to islet eQTLs, for which there are now good data sets available. The models presented here should be highly enriched for islet eQTL. Our discussion led to the conclusion that implementing one of the above approaches will strengthen the work: either implementing the models in a homogeneous cell line (to circumvent the issues associated with mixed cell populations) or to perform model comparisons with islet eQTL signals.

We fully acknowledge reviewers comment that pancreatic islets are indeed a mixture of cell types, and while function of the insulin-producing beta cells is imperative to blood glucose control and T2D aetiology, there is emerging evidence that other cell types may contribute as well. We would also like to bring to the reviewers’ attention that among the 30 chromatin features included in our networks inputs and predictions, there were several epigenomic datasets measured in FACS-sorted pure islet cell populations – specifically beta cells, alpha cells, as well as acinar and exocrine cells. We did not see many differences between these pure cell population predictions, possibly because the open chromatin and the H3K4me3 profiles for these cell populations represented pairs of features with highest pairwise similarity, as shown in the feature similarity plot using Jaccard metric.

Author response image 1
Pairwise Jaccard distances for the pancreatic islet epigenomic datasets used in CNN training.

Given that, we believe that when looking for cell-type specific differences in epigenomic profiles, it would be important to take into consideration the quantitative differences in open chromatin, or levels of chromatin modifications, rather than investigate them as binary traits.

As the reviewer rightly suggests, to comprehensively test this hypothesis, we would require good quality data for several pure cell types, or cell lines, as well as an appropriate set of GWAS results, which is not trivial. Additionally, this would require setting up a completely new scheme for CNN training, which due to computational time, would not be compatible with the two month turnaround time for the manuscript. We address this concern in Discussion: “Additionally, even within the correct tissue, inference can be complicated by the fact that tissues represent heterogeneous mixtures of cell populations, of which only a subset might be directly relevant to disease aetiology. The growing availability of cell-type specific functional genomic datasets will make it possible to explore the extent to which this more precise assignment of disease pathology allows more accurate inference regarding the causal mechanisms.”

We took forward the reviewer’s suggestion for the additional analysis comparing the CNN results to pancreatic islet eQTL data. Initially, we did not include such comparison in our manuscript as previous islet eQTL studies discovered that the overlap between T2D GWAS signals and islet eQTLs does only explain a small proportion of the GWAS signals, even though there exists a significant enrichment of GWAS hits within the annotated eQTLs.

In our manuscript, we have compared our CNN predictions at the credible set variants the recent human pancreatic islet eQTL study in 420 islet donors from the INSPIRE consortium (Viñuela et al., 2019). The results are summarized in a new Figure 2B and new paragraph in the Results section: “To further validate the functional inference from the islet CNN, we compared the CNN predictions at the credible set variants with results from a recent cis-eQTL study of human islet samples from 420 donors performed by the InsPIRE consortium (Viñuela et al., 2019). […] We considered the enrichments of variants predicted to affect the six different groups of chromatin features, and found that variants predicted to affect promoter or enhancer activities in islets were preferentially enriched among the top pancreatic islet eQTL results (Figure 2B).”

https://doi.org/10.7554/eLife.51503.sa2

Article and author information

Author details

  1. Agata Wesolowska-Andersen

    Wellcome Centre for Human Genetics, Oxford, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Project administration
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8688-2814
  2. Grace Zhuo Yu

    Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, United Kingdom
    Contribution
    Resources, Formal analysis, Validation, Investigation, Methodology
    Competing interests
    No competing interests declared
  3. Vibe Nylander

    Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, United Kingdom
    Contribution
    Resources, Methodology
    Competing interests
    No competing interests declared
  4. Fernando Abaitua

    Wellcome Centre for Human Genetics, Oxford, United Kingdom
    Contribution
    Resources, Supervision, Validation, Investigation, Methodology
    Competing interests
    No competing interests declared
  5. Matthias Thurner

    1. Wellcome Centre for Human Genetics, Oxford, United Kingdom
    2. Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, United Kingdom
    Contribution
    Data curation, Formal analysis
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7329-9769
  6. Jason M Torres

    Wellcome Centre for Human Genetics, Oxford, United Kingdom
    Contribution
    Data curation, Formal analysis, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7537-7035
  7. Anubha Mahajan

    Wellcome Centre for Human Genetics, Oxford, United Kingdom
    Contribution
    Data curation, Formal analysis, Methodology
    Competing interests
    No competing interests declared
  8. Anna L Gloyn

    1. Wellcome Centre for Human Genetics, Oxford, United Kingdom
    2. Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, United Kingdom
    3. Oxford NIHR Biomedical Centre, Churchill Hospital, Oxford, United Kingdom
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Methodology
    Contributed equally with
    Mark I McCarthy
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1205-1844
  9. Mark I McCarthy

    1. Wellcome Centre for Human Genetics, Oxford, United Kingdom
    2. Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, United Kingdom
    3. Oxford NIHR Biomedical Centre, Churchill Hospital, Oxford, United Kingdom
    Present address
    Genentech, South San Francisco, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Methodology, Project administration
    Contributed equally with
    Anna L Gloyn
    For correspondence
    mark.mccarthy@drl.ox.ac.uk
    Competing interests
    Senior editor, eLife. MMcC has served on advisory panels for Pfizer, NovoNordisk and Zoe Global, has received honoraria from Merck, Pfizer, Novo Nordisk and Eli Lilly, and research funding from Abbvie, Astra Zeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, NovoNordisk, Pfizer, Roche, Sanofi Aventis, Servier, and Takeda. As of June 2019, MMcC is an employee of Genentech, and a holder of Roche stock.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4393-0510

Funding

Wellcome (099673/Z/12/Z)

  • Matthias Thurner

Wellcome (090532)

  • Mark I McCarthy

Wellcome (106130)

  • Anna L Gloyn
  • Mark I McCarthy

Wellcome (098381)

  • Mark I McCarthy

Wellcome (203141)

  • Anna L Gloyn
  • Mark I McCarthy

Wellcome (212259)

  • Mark I McCarthy

Wellcome (095101)

  • Anna L Gloyn

Wellcome (200837)

  • Anna L Gloyn

Medical Research Council (MR/L020149/1)

  • Anna L Gloyn

Horizon 2020 Framework Programme (T2D Systems)

  • Anna L Gloyn

NIH Clinical Center (U01-DK105535)

  • Anna L Gloyn
  • Mark I McCarthy

NIH Clinical Center (U01-DK085545)

  • Anna L Gloyn

National Institute for Health Research (NF-SI-0617-10090)

  • Anna L Gloyn

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. The views expressed in this article are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health.

Acknowledgements

MT was a Wellcome doctoral student. ALG is a Wellcome Senior Fellow in Basic Biomedical Science. MMcC is a Wellcome Investigator and an NIHR Senior Investigator. Relevant funding support for this work comes from Wellcome (090532, 106130, 098381, 203141, 212259, 095101, 200837, 099673/Z/12/Z), Medical Research Council (MR/L020149/1), European Union Horizon 2020 Programme (T2D Systems), NIDDK (U01-DK105535), NIH (U01-DK105535; U01-DK085545) and NIHR (NF-SI-0617–10090). The views expressed in this article are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health.

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Developmental Biology, Germany

Reviewing Editor

  1. Stephen Parker, University of Michigan, United States

Publication history

  1. Received: August 30, 2019
  2. Accepted: January 27, 2020
  3. Accepted Manuscript published: January 27, 2020 (version 1)
  4. Version of Record published: February 7, 2020 (version 2)

Copyright

© 2020, Wesolowska-Andersen et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,152
    Page views
  • 402
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Human Biology and Medicine
    Praveen Anand et al.
    Short Report Updated
    1. Computational and Systems Biology
    2. Human Biology and Medicine
    Nalin Leelatian et al.
    Research Article Updated