Rapid, reference-free human genotype imputation with denoising autoencoders

  1. Raquel Dias
  2. Doug Evans
  3. Shang-Fu Chen
  4. Kai-Yu Chen
  5. Salvatore Loguercio
  6. Leslie Chan
  7. Ali Torkamani  Is a corresponding author
  1. University of Florida, United States
  2. Scripps Research Institute, United States

Abstract

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least 4-fold faster inference run time relative to standard imputation tools.

Data availability

The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline. The python script for calculating imputation accuracy is available at https://github.com/TorkamaniLab/imputation_accuracy_calculator. Instructions on how to access the unique information on the parameters and hyperparameters of each one of the 256 autoencoders is shared through our source code repository at https://github.com/TorkamaniLab/imputator_inference. We also shared the pre-trained autoencoders and instructions on how to use them for imputation at https://github.com/TorkamaniLab/imputator_inference.Imputation data format. The imputation results are exported in variant calling format (VCF) containing the imputed genotypes and imputation quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference, heterozygous, and homozygous alternate allele). The probabilities can be used for quality control of the imputation results.

The following previously published data sets were used

Article and author information

Author details

  1. Raquel Dias

    Department of Microbiology and Cell Science, University of Florida, Gainesville, United States
    Competing interests
    The authors declare that no competing interests exist.
  2. Doug Evans

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  3. Shang-Fu Chen

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  4. Kai-Yu Chen

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  5. Salvatore Loguercio

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  6. Leslie Chan

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  7. Ali Torkamani

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    For correspondence
    atorkama@scripps.edu
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0232-8053

Funding

National Institutes of Health (R01HG010881)

  • Raquel Dias
  • Doug Evans
  • Shang-Fu Chen
  • Kai-Yu Chen
  • Salvatore Loguercio
  • Ali Torkamani

National Institutes of Health (KL2TR002552)

  • Raquel Dias

National Institutes of Health (U24TR002306)

  • Doug Evans
  • Shang-Fu Chen
  • Kai-Yu Chen
  • Ali Torkamani

National Institutes of Health (UL1TR002550)

  • Doug Evans
  • Shang-Fu Chen
  • Kai-Yu Chen
  • Ali Torkamani

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Matthew Stephens, University of Chicago, United States

Version history

  1. Received: November 16, 2021
  2. Preprint posted: December 2, 2021 (view preprint)
  3. Accepted: September 19, 2022
  4. Accepted Manuscript published: September 23, 2022 (version 1)
  5. Version of Record published: October 12, 2022 (version 2)

Copyright

© 2022, Dias et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,801
    views
  • 241
    downloads
  • 3
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Raquel Dias
  2. Doug Evans
  3. Shang-Fu Chen
  4. Kai-Yu Chen
  5. Salvatore Loguercio
  6. Leslie Chan
  7. Ali Torkamani
(2022)
Rapid, reference-free human genotype imputation with denoising autoencoders
eLife 11:e75600.
https://doi.org/10.7554/eLife.75600

Share this article

https://doi.org/10.7554/eLife.75600

Further reading

    1. Cell Biology
    2. Computational and Systems Biology
    Thomas Grandits, Christoph M Augustin ... Alexander Jung
    Research Article

    Computer models of the human ventricular cardiomyocyte action potential (AP) have reached a level of detail and maturity that has led to an increasing number of applications in the pharmaceutical sector. However, interfacing the models with experimental data can become a significant computational burden. To mitigate the computational burden, the present study introduces a neural network (NN) that emulates the AP for given maximum conductances of selected ion channels, pumps, and exchangers. Its applicability in pharmacological studies was tested on synthetic and experimental data. The NN emulator potentially enables massive speed-ups compared to regular simulations and the forward problem (find drugged AP for pharmacological parameters defined as scaling factors of control maximum conductances) on synthetic data could be solved with average root-mean-square errors (RMSE) of 0.47 mV in normal APs and of 14.5 mV in abnormal APs exhibiting early afterdepolarizations (72.5% of the emulated APs were alining with the abnormality, and the substantial majority of the remaining APs demonstrated pronounced proximity). This demonstrates not only very fast and mostly very accurate AP emulations but also the capability of accounting for discontinuities, a major advantage over existing emulation strategies. Furthermore, the inverse problem (find pharmacological parameters for control and drugged APs through optimization) on synthetic data could be solved with high accuracy shown by a maximum RMSE of 0.22 in the estimated pharmacological parameters. However, notable mismatches were observed between pharmacological parameters estimated from experimental data and distributions obtained from the Comprehensive in vitro Proarrhythmia Assay initiative. This reveals larger inaccuracies which can be attributed particularly to the fact that small tissue preparations were studied while the emulator was trained on single cardiomyocyte data. Overall, our study highlights the potential of NN emulators as powerful tool for an increased efficiency in future quantitative systems pharmacology studies.

    1. Computational and Systems Biology
    2. Neuroscience
    Domingos Leite de Castro, Miguel Aroso ... Paulo Aguiar
    Research Article Updated

    Closed-loop neuronal stimulation has a strong therapeutic potential for neurological disorders such as Parkinson’s disease. However, at the moment, standard stimulation protocols rely on continuous open-loop stimulation and the design of adaptive controllers is an active field of research. Delayed feedback control (DFC), a popular method used to control chaotic systems, has been proposed as a closed-loop technique for desynchronisation of neuronal populations but, so far, was only tested in computational studies. We implement DFC for the first time in neuronal populations and access its efficacy in disrupting unwanted neuronal oscillations. To analyse in detail the performance of this activity control algorithm, we used specialised in vitro platforms with high spatiotemporal monitoring/stimulating capabilities. We show that the conventional DFC in fact worsens the neuronal population oscillatory behaviour, which was never reported before. Conversely, we present an improved control algorithm, adaptive DFC (aDFC), which monitors the ongoing oscillation periodicity and self-tunes accordingly. aDFC effectively disrupts collective neuronal oscillations restoring a more physiological state. Overall, these results support aDFC as a better candidate for therapeutic closed-loop brain stimulation.