Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals

  1. Agata Wesolowska-Andersen
  2. Grace Zhuo Yu
  3. Vibe Nylander
  4. Fernando Abaitua
  5. Matthias Thurner
  6. Jason M Torres
  7. Anubha Mahajan
  8. Anna L Gloyn
  9. Mark I McCarthy  Is a corresponding author
  1. Wellcome Centre for Human Genetics, United Kingdom
  2. University of Oxford, United Kingdom
  3. Churchill Hospital, United Kingdom
6 figures, 1 table and 2 additional files

Figures

Figure 1 with 3 supplements
Area under precision-recall curves (AUPRC) for 30 islet epigenomic features predicted by CNN models.

The AUPRC values were calculated based on performance on the test set formed by 1000 bp sequences from chr2, held out from training and validation. The boxplots show summary of performance across …

Figure 1—figure supplement 1
Schematic representation of the applied convolutional neural network architecture, with sizes and numbers of filters, and width of pooling indicated for a representative combination of tested hyperparameters.
Figure 1—figure supplement 2
Area under receiver-operator curves (AUROC) for 30 islet epigenomic features predicted by CNN models.

The AUROC values were calculated based on performance on the test set formed by 1000 bp sequences from chr2, held out from training and validation. The boxplots show summary of performance across …

Figure 1—figure supplement 3
Influence of the size of filters in the first convolutional layers on filters’ annotation and filter’s influence on predictions.

Boxplots represent summary of 100 individual CNN models differing in the size of convolutional filters of the first layer. Informative filters represent filters with standard deviation of filter …

Functional characterization of CNN-predicted regulatory variants.

(A) Distribution of CNN-predicted regulatory variants (q < 0.05) in the six broader CNN feature groups. (B) Enrichment of variants predicted to affect the CNN feature groups within variant list …

Figure 2—source data 1

Summary of CNN predictions for all variants from T2D GWAS credible sets.

https://cdn.elifesciences.org/articles/51503/elife-51503-fig2-data1-v2.txt
Figure 3 with 1 supplement
Convergence between CNN regulatory predictions and fine-mapping approaches for functional variant prioritization.

(A) Regulatory variants (black) are enriched among variants with highest genetic PPAs (gPPAs) over permuted background (blue). (B) Regulatory variants (black) are enriched among variants with …

Figure 3—figure supplement 1
Comparison of CNN regulatory predictions made with the islet-specific CNN ensemble to predictions made with the publicly available DeepSEA model.

(A) Comparison of -log10-transformed q-values from the islet CNN ensemble with functional significance scores generated by the omni-tissue DeepSEA model (B) Comparison of -log10-transformed q-values …

Examples of T2D-association signals where integration of CNN regulatory variant prediction downstream of functional fine-mapping refines the association signals to single candidate variants.

Genetic PPAs (gPPAs) are shown in the top panels as blue points, functional PPAs (fPPAs) are shown in the middle panels as green points, and -log10-transformed q-values from CNN predictions are …

CNN regulatory predictions help refine the association signal at PROX1 locus, previously fine-mapped to only two variants: rs17712208 and rs79687284.

(A) Genetic PPA (gPPA), functional PPA (fPPA) and -log10(q-value) of the CNN islet regulatory predictions for both variants. (B) Allelic imbalance in open chromatin across four pancreatic islets …

Author response image 1
Pairwise Jaccard distances for the pancreatic islet epigenomic datasets used in CNN training.

Tables

Table 1
10 non-redundant transcription factors binding motifs most frequently detected by first layer convolutional filters at FDR < 5%.

Sequence logos of representative CNN filters are shown. Transcription factor binding motifs redundancy was removed with Tomtom motif similarity search with other motifs detected by CNNs with q < 0.05…

Motif name/TFRepresentative
CNN filter logo
Motif logoCNNs with filter
match q < 0.05
Similar TF motifs discovered
M6114_1.02
FOXA1


838M6234_1.02 FOXA3, M6241_1.02 FOXJ2, M4567_1.02 FOXA2…
M4427_1.02
CTCF


833M4612_1.02 CTCFL
M1906_1.02
SP1


677M2314_1.02 SP2, M6482_1.02 SP3, M6535_1.02 WT1…
M2296_1.02
MAFK


629M4629_1.02 NFE2, M4572_1.02 MAFF, M4681_1.02 BACH2…
M2292_1.02
JUND


571M4623_1.02 JUNB, M2278_1.02 FOS, M4619_1.02 FOSL1…
M1528_1.02
RFX6


556M4476_1.02 RFX5, M1529_1.02 RFX7, M5777_1.02 RFX4…
M4640_1.02
ZBTB7A


530M6539_1.02 ZBTB7B, M6552_1.02 ZNF148, M6422_1.02 PLAGL1…
M1970_1.02
NFIC


484M5664_1.02 NFIX, M5660_1.02 NFIA, M5662_1.02 NFIB
M2277_1.02
FLI1


442M6222_1.02 ETV4, M2275_1.02 ELF1, M5398_1.02 ERF…
M6281_1.02
HNF1A


418M6282_1.02 HNF1B, M6546_1.02 ZFHX3

Additional files

Supplementary file 1

Supplementary Tables.

(STable 1) Summary of publicly available datasets used to train the CNN models of human pancreatic islet epigenomic features. Where indicated, the original raw data was reprocessed with the default setting of either the ATAC-seq/DNase-seq pipeline (available from: https://github.com/kundajelab/atac_dnase_pipelines), or the AQUAS TF and histone ChIP-seq pipeline (available from: https://github.com/kundajelab/chipseq_pipeline), using the human genome GRCh37 as reference. (STable 2) Tested sets of CNN hyperparameters. Convolutional neural networks with each set of hyperparameters differing in numbers and sizes of convolutional filters were trained 100 times, for a total of 1000 CNNs trained. (STable3) Full list of transcription factor binding motifs with <5% FDR sequence match to motifs activating convolutional filters from the first layers of the 1000 CNN ensemble. No motif redundancy removal was applied here. (STable 4) CNN regulatory predictions at 28 T2D association signals fine-mapped to a single most likely causal variant with genetic PPA (gPPA) > = 0.80 or functional PPA (fPPA) > = 0.80. (STable 5) CNN regulatory predictions at signals with at least two variants with functional PPAs (fPPAs) > = 0.2. These are the signals where incorporating CNN predictions downstream of fine-mapping can yield the largest benefits. The table lists all the variants with fPPA > = 0.05 at these signals, together with their CNN q-value (lowest_Q), and the corresponding top scoring CNN feature and the mean predicted score difference across the 1000 trained models. (STable 6) Primer sequences used for cloning of the Prox1 enhancer. Prox1_enhancer_Forward (Reverse)_internal were designed for sequence validation. Restriction enzymes NheI and XhoI were used for all subsequent cloning. SDM = site directed mutagenesis.

https://cdn.elifesciences.org/articles/51503/elife-51503-supp1-v2.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/51503/elife-51503-transrepform-v2.docx

Download links