Rapid, reference-free human genotype imputation with denoising autoencoders

Abstract
Data availability
Article and author information
Metrics

Abstract

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least 4-fold faster inference run time relative to standard imputation tools.

Data availability

The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline. The python script for calculating imputation accuracy is available at https://github.com/TorkamaniLab/imputation_accuracy_calculator. Instructions on how to access the unique information on the parameters and hyperparameters of each one of the 256 autoencoders is shared through our source code repository at https://github.com/TorkamaniLab/imputator_inference. We also shared the pre-trained autoencoders and instructions on how to use them for imputation at https://github.com/TorkamaniLab/imputator_inference.Imputation data format. The imputation results are exported in variant calling format (VCF) containing the imputed genotypes and imputation quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference, heterozygous, and homozygous alternate allele). The probabilities can be used for quality control of the imputation results.

The following previously published data sets were used

1. McCarthy S
2. Das S
3. Kretzschmar W
4. et al.
(2016) Haplotype Reference Consortium
EGAS00001001710.

https://ega-archive.org/studies/EGAS00001001710
1. 1000 Genomes Project Consortium
(2015) The 1000 Genomes Project Consortium
N/A.

https://www.internationalgenome.org/data-portal/data-collection/phase-3
1. Bild DE
2. Bluemke DA
3. Burke GL
4. et al.
(2002) MESA (Multi-Ethnic Study of Atherosclerosis) study
phs001416.

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001416.v2.p1
1. The ARIC investigators consortium
(1989) Atherosclerosis Risk in Communities (ARIC)
phs001211.

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001211.v4.p3
1. Bergström A
2. McCarthy SA
3. Hui R
4. et al.
(2020) Human Genome Diversity Project (HGDP)
N/A.

https://www.internationalgenome.org/data-portal/data-collection/hgdp
1. Taliun D
2. Harris DN
3. Kessler MD
4. et al.
(2021) TOPMed Cohort
Multiple.

https://topmed.nhlbi.nih.gov/

Article and author information

Author details

Raquel Dias

Department of Microbiology and Cell Science, University of Florida, Gainesville, United States

Competing interests
The authors declare that no competing interests exist.
Doug Evans

Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States

Competing interests
The authors declare that no competing interests exist.
Shang-Fu Chen

Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States

Competing interests
The authors declare that no competing interests exist.
Kai-Yu Chen

Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States

Competing interests
The authors declare that no competing interests exist.
Salvatore Loguercio

Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States

Competing interests
The authors declare that no competing interests exist.
Leslie Chan

Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States

Competing interests
The authors declare that no competing interests exist.
Ali Torkamani

Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States

For correspondence
atorkama@scripps.edu

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0003-0232-8053

Funding

National Institutes of Health (R01HG010881)

Raquel Dias
Doug Evans
Shang-Fu Chen
Kai-Yu Chen
Salvatore Loguercio
Ali Torkamani

National Institutes of Health (KL2TR002552)

Raquel Dias

National Institutes of Health (U24TR002306)

Doug Evans
Shang-Fu Chen
Kai-Yu Chen
Ali Torkamani

National Institutes of Health (UL1TR002550)

Doug Evans
Shang-Fu Chen
Kai-Yu Chen
Ali Torkamani

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.