Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals

  1. Agata Wesolowska-Andersen
  2. Grace Zhuo Yu
  3. Vibe Nylander
  4. Fernando Abaitua
  5. Matthias Thurner
  6. Jason M Torres
  7. Anubha Mahajan
  8. Anna L Gloyn
  9. Mark I McCarthy  Is a corresponding author
  1. Wellcome Centre for Human Genetics, United Kingdom
  2. University of Oxford, United Kingdom
  3. Churchill Hospital, United Kingdom
6 figures, 1 table and 2 additional files

Figures

Figure 1 with 3 supplements
Area under precision-recall curves (AUPRC) for 30 islet epigenomic features predicted by CNN models.

The AUPRC values were calculated based on performance on the test set formed by 1000 bp sequences from chr2, held out from training and validation. The boxplots show summary of performance across 1000 individual CNN models, and are grouped by corresponding regulatory element. As the interpretation of AUPRC values depend on how well balanced the dataset it, we denote the class imbalance (equivalent to prediction of a random model) for each feature as open circles, which corresponds to the proportion of sequences with the given feature present. Features marked as ‘a_”, ‘b_”,”e_”, and ‘aci_”, were assayed in FACS-sorted cell populations rather than whole pancreatic islets, and correspond to alpha cells, beta cells, exocrine cells, and acinar cells, respectively (Bramswig et al., 2013; Ackermann et al., 2016).

Figure 1—figure supplement 1
Schematic representation of the applied convolutional neural network architecture, with sizes and numbers of filters, and width of pooling indicated for a representative combination of tested hyperparameters.
Figure 1—figure supplement 2
Area under receiver-operator curves (AUROC) for 30 islet epigenomic features predicted by CNN models.

The AUROC values were calculated based on performance on the test set formed by 1000 bp sequences from chr2, held out from training and validation. The boxplots show summary of performance across 1000 individual CNN models, and are grouped by corresponding regulatory element. Features marked as ‘a_”, ‘b_”,”e_”, and ‘aci_”, were assayed in FACS-sorted cell populations rather than whole pancreatic islets, and correspond to alpha cells, beta cells, exocrine cells, and acinar cells, respectively (Ackermann et al., 2016; Bramswig et al., 2013).

Figure 1—figure supplement 3
Influence of the size of filters in the first convolutional layers on filters’ annotation and filter’s influence on predictions.

Boxplots represent summary of 100 individual CNN models differing in the size of convolutional filters of the first layer. Informative filters represent filters with standard deviation of filter activation >0, indicating filters helping with the final CNN predictions of regulatory features, while annotated filters represent filters which could be annotated to match any known TF binding motifs. In grey are shown all the informative filters, and in white a subset of these filters which were not annotated to match any known TF binding motifs. The number of informative filters decreases with increasing filter size. Red boxplots show increasing number of annotated filters with increasing filter size.

Functional characterization of CNN-predicted regulatory variants.

(A) Distribution of CNN-predicted regulatory variants (q < 0.05) in the six broader CNN feature groups. (B) Enrichment of variants predicted to affect the CNN feature groups within variant list ranked by eQTL result p-value from the InsPIRE study (Viñuela et al., 2019). Enrichment was calculated with R package gage (Luo et al., 2009), red bars indicate gene-set enrichment at the top (and blue at the bottom) of the eQTL p-value -ranked list of variants. (C) Predicted regulatory variants reside in regulatory elements they are predicted to affect. For each variant we found the lowest CNN q-value among feature groups corresponding to different regulatory elements (promoters, enhancers, open chromatin, active regions, TF binding, repressed regions) predicted from genomic sequence, and we ranked all variants according to these six q-values. We then tested whether variants residing in each of the 15 pancreatic islet chromatin states (Thurner et al., 2018) were enriched at the top or bottom of these ranked lists using gene-set enrichment analysis implemented in the R package gage (Luo et al., 2009). Colours in the heatmap represent the strength of the enrichment expressed as log10-transformed enrichment q-values, with red colours representing enrichments at the top (enrichment), and blue at the bottom of the ranked lists (depletion). For plotting purposes all -log10(p-values) below −50, or above 50 were truncated to these values. Stars denote significant enrichments: *<0.05, **<0.01 and ***<0.001. Variant level functional annotations and CNN predictions for the credible set variants are available as Figure 2—source data 1.

Figure 2—source data 1

Summary of CNN predictions for all variants from T2D GWAS credible sets.

https://cdn.elifesciences.org/articles/51503/elife-51503-fig2-data1-v2.txt
Figure 3 with 1 supplement
Convergence between CNN regulatory predictions and fine-mapping approaches for functional variant prioritization.

(A) Regulatory variants (black) are enriched among variants with highest genetic PPAs (gPPAs) over permuted background (blue). (B) Regulatory variants (black) are enriched among variants with highest functional PPAs (fPPAs) generated with FGWAS over permuted background (blue). (C) Regulatory variants (black) are enriched among variants with top PPA ranks within 99% sets of credible variants over permuted background (blue), as well as at top ranks of signals acting through insulin secretion (red) over insulin action (purple) mechanisms.

Figure 3—figure supplement 1
Comparison of CNN regulatory predictions made with the islet-specific CNN ensemble to predictions made with the publicly available DeepSEA model.

(A) Comparison of -log10-transformed q-values from the islet CNN ensemble with functional significance scores generated by the omni-tissue DeepSEA model (B) Comparison of -log10-transformed q-values from the islet CNN ensemble with -log10-transformed E-values for the ENCODE PanIslet DNase generated by the DeepSEA model. In scatterplots variants predicted to be regulatory with both approaches are shown in red, variants predicted as regulatory only by DeepSEA are shown in blue, and variants predicted as regulatory only by islet CNNs are shown in green. (C) Enrichment of regulatory variants among variants at the top ranks of T2D GWAS 99% credible sets (black) predicted by the omni-tissue DeepSEA model over the permuted background (blue). Purple line shows the signals acting through insulin action mechanisms, while red line shows the signals acting through insulin secretion (pancreatic islet-mediated) mechanisms. (D) Enrichment of regulatory variants among variants at the top ranks of T2D GWAS 99% credible sets (black) predicted by the single-tissue DeepSEA model based on ENCODE PanIslet DNase dataset over the permuted background (blue). Purple line shows the signals acting through insulin action mechanisms, while red line shows the signals acting through insulin secretion (pancreatic islet-mediated) mechanisms.

Examples of T2D-association signals where integration of CNN regulatory variant prediction downstream of functional fine-mapping refines the association signals to single candidate variants.

Genetic PPAs (gPPAs) are shown in the top panels as blue points, functional PPAs (fPPAs) are shown in the middle panels as green points, and -log10-transformed q-values from CNN predictions are shown in the bottom panels as red points.

CNN regulatory predictions help refine the association signal at PROX1 locus, previously fine-mapped to only two variants: rs17712208 and rs79687284.

(A) Genetic PPA (gPPA), functional PPA (fPPA) and -log10(q-value) of the CNN islet regulatory predictions for both variants. (B) Allelic imbalance in open chromatin across four pancreatic islets heterozygous for the variants. Allele counts for the major (grey) and minor (red) alleles are shown for both variants. (C) Table summary with CNN predictions for the H3K27ac mark for both variants. (D) In silico saturated mutagenesis for 40nt flanking sequence around the rs17712208 variant for the H3K27ac predictions. The line plots in the upper panel indicate the SAD (SNP accessibility difference) scores corresponding to absolute highest values from the heatmap below, with blue line indicating loss of function, and red – gain of function changes. Blue fields in the heatmap indicate that a given nucleotide substitution would result in decrease in prediction values for H3K27ac, while red field indicate increase in the predictions. The height of letters in the sequence below the heatmap indicated the relative importance of each nucleotide in the final predictions. (E) Matched HNF1B binding motif is shown below. (F) Luciferase reporter assays confirmed that the A allele of rs17712208 resulted in significant repression of enhancer activity, while no effect was observed for the rs79687284 variant. GFP = green fluorescent protein (negative control), EV = empty vector (baseline). Source file with luciferase intensity values is available as Figure 5—source data 1.

Author response image 1
Pairwise Jaccard distances for the pancreatic islet epigenomic datasets used in CNN training.

Tables

Table 1
10 non-redundant transcription factors binding motifs most frequently detected by first layer convolutional filters at FDR < 5%.

Sequence logos of representative CNN filters are shown. Transcription factor binding motifs redundancy was removed with Tomtom motif similarity search with other motifs detected by CNNs with q < 0.05 for similarity to the main motif are listed in the last column; only three motifs with highest similarity are listed.

Motif name/TFRepresentative
CNN filter logo
Motif logoCNNs with filter
match q < 0.05
Similar TF motifs discovered
M6114_1.02
FOXA1


838M6234_1.02 FOXA3, M6241_1.02 FOXJ2, M4567_1.02 FOXA2…
M4427_1.02
CTCF


833M4612_1.02 CTCFL
M1906_1.02
SP1


677M2314_1.02 SP2, M6482_1.02 SP3, M6535_1.02 WT1…
M2296_1.02
MAFK


629M4629_1.02 NFE2, M4572_1.02 MAFF, M4681_1.02 BACH2…
M2292_1.02
JUND


571M4623_1.02 JUNB, M2278_1.02 FOS, M4619_1.02 FOSL1…
M1528_1.02
RFX6


556M4476_1.02 RFX5, M1529_1.02 RFX7, M5777_1.02 RFX4…
M4640_1.02
ZBTB7A


530M6539_1.02 ZBTB7B, M6552_1.02 ZNF148, M6422_1.02 PLAGL1…
M1970_1.02
NFIC


484M5664_1.02 NFIX, M5660_1.02 NFIA, M5662_1.02 NFIB
M2277_1.02
FLI1


442M6222_1.02 ETV4, M2275_1.02 ELF1, M5398_1.02 ERF…
M6281_1.02
HNF1A


418M6282_1.02 HNF1B, M6546_1.02 ZFHX3

Additional files

Supplementary file 1

Supplementary Tables.

(STable 1) Summary of publicly available datasets used to train the CNN models of human pancreatic islet epigenomic features. Where indicated, the original raw data was reprocessed with the default setting of either the ATAC-seq/DNase-seq pipeline (available from: https://github.com/kundajelab/atac_dnase_pipelines), or the AQUAS TF and histone ChIP-seq pipeline (available from: https://github.com/kundajelab/chipseq_pipeline), using the human genome GRCh37 as reference. (STable 2) Tested sets of CNN hyperparameters. Convolutional neural networks with each set of hyperparameters differing in numbers and sizes of convolutional filters were trained 100 times, for a total of 1000 CNNs trained. (STable3) Full list of transcription factor binding motifs with <5% FDR sequence match to motifs activating convolutional filters from the first layers of the 1000 CNN ensemble. No motif redundancy removal was applied here. (STable 4) CNN regulatory predictions at 28 T2D association signals fine-mapped to a single most likely causal variant with genetic PPA (gPPA) > = 0.80 or functional PPA (fPPA) > = 0.80. (STable 5) CNN regulatory predictions at signals with at least two variants with functional PPAs (fPPAs) > = 0.2. These are the signals where incorporating CNN predictions downstream of fine-mapping can yield the largest benefits. The table lists all the variants with fPPA > = 0.05 at these signals, together with their CNN q-value (lowest_Q), and the corresponding top scoring CNN feature and the mean predicted score difference across the 1000 trained models. (STable 6) Primer sequences used for cloning of the Prox1 enhancer. Prox1_enhancer_Forward (Reverse)_internal were designed for sequence validation. Restriction enzymes NheI and XhoI were used for all subsequent cloning. SDM = site directed mutagenesis.

https://cdn.elifesciences.org/articles/51503/elife-51503-supp1-v2.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/51503/elife-51503-transrepform-v2.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Agata Wesolowska-Andersen
  2. Grace Zhuo Yu
  3. Vibe Nylander
  4. Fernando Abaitua
  5. Matthias Thurner
  6. Jason M Torres
  7. Anubha Mahajan
  8. Anna L Gloyn
  9. Mark I McCarthy
(2020)
Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals
eLife 9:e51503.
https://doi.org/10.7554/eLife.51503