Rapid, reference-free human genotype imputation with denoising autoencoders

  1. Raquel Dias
  2. Doug Evans
  3. Shang-Fu Chen
  4. Kai-Yu Chen
  5. Salvatore Loguercio
  6. Leslie Chan
  7. Ali Torkamani  Is a corresponding author
  1. University of Florida, United States
  2. Scripps Research Institute, United States

Abstract

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least 4-fold faster inference run time relative to standard imputation tools.

Data availability

The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline. The python script for calculating imputation accuracy is available at https://github.com/TorkamaniLab/imputation_accuracy_calculator. Instructions on how to access the unique information on the parameters and hyperparameters of each one of the 256 autoencoders is shared through our source code repository at https://github.com/TorkamaniLab/imputator_inference. We also shared the pre-trained autoencoders and instructions on how to use them for imputation at https://github.com/TorkamaniLab/imputator_inference.Imputation data format. The imputation results are exported in variant calling format (VCF) containing the imputed genotypes and imputation quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference, heterozygous, and homozygous alternate allele). The probabilities can be used for quality control of the imputation results.

The following previously published data sets were used

Article and author information

Author details

  1. Raquel Dias

    Department of Microbiology and Cell Science, University of Florida, Gainesville, United States
    Competing interests
    The authors declare that no competing interests exist.
  2. Doug Evans

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  3. Shang-Fu Chen

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  4. Kai-Yu Chen

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  5. Salvatore Loguercio

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  6. Leslie Chan

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  7. Ali Torkamani

    Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    For correspondence
    atorkama@scripps.edu
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0232-8053

Funding

National Institutes of Health (R01HG010881)

  • Raquel Dias
  • Doug Evans
  • Shang-Fu Chen
  • Kai-Yu Chen
  • Salvatore Loguercio
  • Ali Torkamani

National Institutes of Health (KL2TR002552)

  • Raquel Dias

National Institutes of Health (U24TR002306)

  • Doug Evans
  • Shang-Fu Chen
  • Kai-Yu Chen
  • Ali Torkamani

National Institutes of Health (UL1TR002550)

  • Doug Evans
  • Shang-Fu Chen
  • Kai-Yu Chen
  • Ali Torkamani

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

© 2022, Dias et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,585
    views
  • 287
    downloads
  • 12
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Raquel Dias
  2. Doug Evans
  3. Shang-Fu Chen
  4. Kai-Yu Chen
  5. Salvatore Loguercio
  6. Leslie Chan
  7. Ali Torkamani
(2022)
Rapid, reference-free human genotype imputation with denoising autoencoders
eLife 11:e75600.
https://doi.org/10.7554/eLife.75600

Share this article

https://doi.org/10.7554/eLife.75600

Further reading

    1. Computational and Systems Biology
    2. Genetics and Genomics
    Fangluo Chen, Dylan C Sarver ... G William Wong
    Research Article

    Obesity is a major risk factor for type 2 diabetes, dyslipidemia, cardiovascular disease, and hypertension. Intriguingly, there is a subset of metabolically healthy obese (MHO) individuals who are seemingly able to maintain a healthy metabolic profile free of metabolic syndrome. The molecular underpinnings of MHO, however, are not well understood. Here, we report that CTRP10/C1QL2-deficient mice represent a unique female model of MHO. CTRP10 modulates weight gain in a striking and sexually dimorphic manner. Female, but not male, mice lacking CTRP10 develop obesity with age on a low-fat diet while maintaining an otherwise healthy metabolic profile. When fed an obesogenic diet, female Ctrp10 knockout (KO) mice show rapid weight gain. Despite pronounced obesity, Ctrp10 KO female mice do not develop steatosis, dyslipidemia, glucose intolerance, insulin resistance, oxidative stress, or low-grade inflammation. Obesity is largely uncoupled from metabolic dysregulation in female KO mice. Multi-tissue transcriptomic analyses highlighted gene expression changes and pathways associated with insulin-sensitive obesity. Transcriptional correlation of the differentially expressed gene (DEG) orthologs in humans also shows sex differences in gene connectivity within and across metabolic tissues, underscoring the conserved sex-dependent function of CTRP10. Collectively, our findings suggest that CTRP10 negatively regulates body weight in females, and that loss of CTRP10 results in benign obesity with largely preserved insulin sensitivity and metabolic health. This female MHO mouse model is valuable for understanding sex-biased mechanisms that uncouple obesity from metabolic dysfunction.

    1. Computational and Systems Biology
    Huiyong Cheng, Dawson Miller ... Qiuying Chen
    Research Article

    Mass spectrometry imaging (MSI) is a powerful technology used to define the spatial distribution and relative abundance of metabolites across tissue cryosections. While software packages exist for pixel-by-pixel individual metabolite and limited target pairs of ratio imaging, the research community lacks an easy computing and application tool that images any metabolite abundance ratio pairs. Importantly, recognition of correlated metabolite pairs may contribute to the discovery of unanticipated molecules in shared metabolic pathways. Here, we describe the development and implementation of an untargeted R package workflow for pixel-by-pixel ratio imaging of all metabolites detected in an MSI experiment. Considering untargeted MSI studies of murine brain and embryogenesis, we demonstrate that ratio imaging minimizes systematic data variation introduced by sample handling, markedly enhances spatial image contrast, and reveals previously unrecognized metabotype-distinct tissue regions. Furthermore, ratio imaging facilitates identification of novel regional biomarkers and provides anatomical information regarding spatial distribution of metabolite-linked biochemical pathways. The algorithm described herein is applicable to any MSI dataset containing spatial information for metabolites, peptides or proteins, offering a potent hypothesis generation tool to enhance knowledge obtained from current spatial metabolite profiling technologies.