Rapid, Reference-Free human genotype imputation with denoising autoencoders

  1. Raquel Dias
  2. Doug Evans
  3. Shang-Fu Chen
  4. Kai-Yu Chen
  5. Salvatore Loguercio
  6. Leslie Chan
  7. Ali Torkamani  Is a corresponding author
  1. Scripps Research Translational Institute, Scripps Research Institute, United States
  2. Department of Integrative Structural and Computational Biology, Scripps Research, United States
  3. Department of Microbiology and Cell Science, University of Florida, United States

Abstract

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here, we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least fourfold faster inference run time relative to standard imputation tools.

Editor's evaluation

The paper describes a novel neural-network-based strategy for imputing unmeasured genotypes, which is a standard part of most association testing pipelines. The method is computationally intensive to train, but once training is complete the imputation is fast and accurate and does not require further access to a reference panel. It has the potential to be a practically-appealing alternative to existing methods. although further work (eg training of models) is required before this new approach can be applied genome-wide.

https://doi.org/10.7554/eLife.75600.sa0

Introduction

The human genome is inherited in large blocks from parental genomes, generated through a DNA-sequence-dependent shuffling process called recombination. The non-uniform nature of recombination breakpoints producing these genomic blocks results in correlative genotype relationships across genetic variants, known as linkage disequilibrium. Thus, genotypes for a small subset (1–10%) of observed common genetic variants can be used to infer the genotype status of unobserved but known genetic variation sites across the genome (on the order of ~1 M of >10 M sites; Li et al., 2009; Marchini and Howie, 2010). This process, called genotype imputation, allows for the generation of nearly the full complement of known common genetic variation at a fraction of the cost of direct genotyping or sequencing. Given the massive scale of genotyping required for genome-wide association studies or implementation of genetically informed population health initiatives, genotype imputation is an essential approach in population genetics.

Standard approaches to genotype imputation utilize Hidden Markov Models (HMM) (Browning et al., 2018; Das et al., 2016; Rubinacci et al., 2020) distributed alongside large WGS-based reference panels (Browning and Browning, 2016). In general terms, these imputation algorithms use genetic variants shared between to-be-imputed genomes and the reference panel and apply Hidden Markov Models (HMM) to impute the missing genotypes per sample (Das et al., 2018). The hidden states in the HMMs represent the haplotype in a reference panel that is most closely related to the haplotype being imputed. The HMM parameter estimation also depends on recombination rates, mutation rates, and/or genotype error rates that must be fit by Markov Chain Monte Carlo Algorithm (MCMC) or an expectation-maximization algorithm. Thus, HMM-based imputation is a computationally intensive process, requiring access to both high-performance computing environments and large, privacy-sensitive, WGS reference panels (Kowalski et al., 2019). Often, investigators outside of large consortia will resort to submitting genotype data to imputation servers (Das et al., 2016), resulting in privacy and scalability concerns (Sarkar et al., 2021).

Recently, artificial neural networks, especially autoencoders, have attracted attention in functional genomics for their ability to fill-in missing data for image restoration and inpainting (Chaitanya et al., 2017; Ghosh et al., 2020; Mao et al., 2016; Xie et al., 2012). Autoencoders are neural networks tasked with the problem of simply reconstructing the original input data, with constraints applied to the network architecture or transformations applied to the input data in order to achieve a desired goal like dimensionality reduction or compression, and de-noising or de-masking (Abouzid et al., 2019; Liu et al., 2020; Voulodimos et al., 2018). Stochastic noise or masking is used to modify or remove data inputs, training the autoencoder to reconstruct the original uncorrupted data from corrupted inputs (Tian et al., 2020). Autoencoders that receive corrupted or masked data as input and are trained to predict the original uncorrupted data as the output are also known as denoising autoencoders. These autoencoder characteristics are well-suited for genotype imputation and may address some of the limitations of HMM-based imputation by eliminating the need for dissemination of reference panels and allowing the capture of non-linear relationships in genomic regions with complex linkage disequilibrium structures. Some attempts at genotype imputation using neural networks have been previously reported, though for specific genomic contexts (Naito et al., 2021) at genotype masking levels (5–20%) not applicable in typical real-world population genetics scenarios (Chen and Shi, 2019; Islam et al., 2021; Sun and Kardia, 2008), or in contexts where the neural network must be trained for imputation with specific input variant sets, i.e. the model must be re-trained for imputation from different genotyping array (Kojima et al., 2020).

Here, we present a generalized approach to unphased human genotype imputation using sparse, denoising autoencoders capable of highly accurate genotype imputation at genotype masking levels (98+%) appropriate for array-based genotyping and low-pass sequencing-based population genetics initiatives. We describe the initial training and implementation of autoencoders spanning all of human chromosome 22, achieving equivalent to superior accuracy relative to modern HMM-based methods, and dramatically improving computational efficiency at deployment without the need to distribute reference panels.

Materials and methods

Overview

Sparse, de-noising autoencoders spanning all bi-allelic SNPs observed in the Haplotype Reference Consortium were developed and optimized. Each autoencoder receives masked data as input and is trained to predict the original uncorrupted data as the output. Each bi-allelic SNP was encoded as two binary input nodes, representing the presence or absence of each allele (Figure 1B, E). This encoding allows for the straightforward extension to multi-allelic architectures and non-binary allele presence probabilities. A data augmentation approach using modeled recombination events and offspring formation coupled with random masking at an escalating rate drove our autoencoder training strategy (Figure 1C). Because of the extreme skew of the allele frequency distribution for rarely present alleles (Auton et al., 2015), a focal-loss-based approach was essential to genotype imputation performance. The basic architecture of the template fully-connected autoencoder before optimization to each genomic segment is depicted in Figure 1D. Individual autoencoders were designed to span genomic segments with boundaries defined by computationally identified recombination hotspots (Figure 1A). The starting point for model hyperparameters were randomly selected from a grid of possible combinations and were further tuned from a battery of features describing the complexity of the linkage-disequilibrium structure of each genomic segment.

Schematic overview of the autoencoder training workflow.

(A) Tiling of autoencoders across the genome is achieved by (A.1) calculating a n x n matrix of pairwise SNP correlations, thresholding them at 0.45 (selected values are shown in red background, excluded values in gray), (A.2) quantifying the overall local LD strength centered at each SNP by computing their local correlation box counts and splitting the genome into approximately independent segments by identifying local minima (recombination hotspots). The red arrow illustrates minima between strong LD regions. For reducing computational complexity, we calculated the correlations in a fixed sliding box size of 500x500 common variants (MAF ≥ 0.5%). Thus, the memory utilization for calculating correlations will be the same regardless of genomic density. (B) Ground truth whole genome sequencing data is encoded as binary values representing the presence (1) or absence (0) of the reference allele (blue) and alternative allele (red). (C) Variant masking (setting both alleles as absent, represented by 0, corrupts data inputs at a gradually increasing masking rate). Example masked variants are outlined. (D) Fully-connected autoencoders spanning segments defined as shown in panel (A), are then trained to reconstruct the original uncorrupted data from corrupted inputs; (E) the reconstructed outputs (imputed data) are compared to the ground truth states for loss calculation and are decoded back to genotypes.

Genotype encoding

Request a detailed protocol

Genotypes for all bi-allelic SNPs were converted to binary values representing the presence (1) or absence (0) of the reference allele A and alternative allele B, respectively, as shown in Equation 1.

(1) xi=ifGi=[A,A]: xi=[1,0]ifGi=A,B: xi=1,1ifGi=B,A: xi=1,1ifGi=B,B: xi=0,1ifGi=null: xi=0,0

where x is a vector containing the two allele presence input nodes to the autoencoder and their encoded allele presence values derived from the original genotype, G, of variant i. The output nodes of the autoencoder are similarly rescaled to 0–1 by a sigmoid function, split into three genotype outputs (homozygous reference, homozygous alternate, and heterozygous), and normalized using the Softmax function. The normalized outputs can also be regarded as probabilities and can be combined for the calculation of alternative allele dosage and as a measure of imputation quality. This representation is extensible to other classes of genetic variation, and allows for the use of probabilistic loss functions.

Training data, masking, and data augmentation

Training data

Request a detailed protocol

Whole-genome sequence data from the Haplotype Reference Consortium (HRC) was used for training and as the reference panel for comparison to HMM-based imputation (McCarthy et al., 2016). The dataset consists of 27,165 samples and 39,235,157 biallelic SNPs generated using whole-genome sequence data from 20 studies of predominantly European ancestry (HRC Release 1.1): 83.92% European, 2.33% East Asian, 1.63% Native American, 2.17% South Asian, 2.96% African, and 6.99% admixed ancestry individuals. Genetic ancestry was determined using continental population classification from the 1000 Genomes Phase3 v5 (1000 G) reference panel and a 95% cutoff using Admixture software (Alexander et al., 2009). Genotype imputation autoencoders were trained for all 510,442 unique SNPs observed in HRC on human chromosome 22. For additional comparisons, whole-genome sequence data from 31 studies available through the NHLBI Trans-Omics for Precision Medicine (TOPMed) program were used as an alternative reference panel for HMM-based imputation tools (Taliun et al., 2021). We downloaded Freeze 8 of TOPMed, which is the latest version with all consent groups genotyped across the same set of jointly called variants. GRCh38 TOPMed cohorts were converted to hg19 with Picard 2.25 (‘Picard toolkit’, 2019), and multi allelic SNPs removed with bcftools v.1.10.2 (Danecek et al., 2021). Any variants with missing genotypes were excluded as well, yielding a final reference panel for chr22 consisting of 73,586 samples and 11,089,826 biallelic SNPs. Since the ARIC and MESA cohorts are used for model selection and validation, they were excluded from the TOPMed reference panel. Relatedness analysis using the KING robust kinship estimator revealed significant data leakage boosting HMM-based imputation performance through individuals directly participating in the MESA and other TOPMed cohorts, as well as through numerous first- and second-degree familial relationships spanning MESA individuals and individuals in other TOPMed cohorts.

Validation and testing data

Request a detailed protocol

A balanced (50%:50% European and African genetic ancestry) subset of 796 whole genome sequences from the Atherosclerosis Risk in Communities cohort (ARIC) (Mou et al., 2018), was used for model validation and selection. The Wellderly (Erikson et al., 2016), Human Genome Diversity Panel (HGDP) (Cann et al., 2002), and Multi-Ethnic Study of Atherosclerosis (MESA) (Bild et al., 2002 ) cohorts were used for model testing. The Wellderly cohort consisted of 961 whole genomes of predominantly European genetic ancestry. HGDP consisted of 929 individuals across multiple ancestries: 11.84% European, 14.64% East Asian, 6.57% Native American, 10.98% African, and 55.97% admixed. MESA consisted of 5370 whole genomes across multiple ancestries: 27.62% European, 11.25% East Asian, 4.99% Native American, 5.53% African, and 50.61% admixed. MESA, Wellderly, and HGDP are all independent datasets, not used for autoencoder training, nor model selection, whereas HRC and ARIC were utilized for training and model selection, respectively.

GRCh38 mapped cohorts (HGDP and MESA) were converted to hg19 using Picard v2.25 (Broad Institute, 2022). All other datasets were originally mapped and called against hg19. Multi-allelic SNPs, SNPS with >10% missingness, and SNPs not observed in HRC were removed with bcftools v1.10.2 (Danecek et al., 2021). Mock genotype array data was generated from these WGS cohorts by restricting genotypes to those present on commonly used genotyping arrays (Affymetrix 6.0, UKB Axiom, and Omni 1.5 M). For chromosome 22, intersection with HRC and this array-like masking respectively resulted in: 9025, 10,615, and 14,453 out of 306,812 SNPs observed in ARIC; 8630, 10,325, and 12,969 out of 195,148 SNPs observed in the Wellderly; 10,176, 11,086, and 14,693 out of 341,819 SNPs observed in HGDP; 9237, 10,428, and 13,677 out of 445,839 SNPs observed in MESA. All input genotypes from all datasets utilized in this work are unphased, and no pre-phasing was performed.

Data augmentation

Request a detailed protocol

We employed two strategies for data augmentation – random variant masking and simulating further recombination with offspring formation. During training, random masking of input genotypes was performed at escalating rates, starting with a relatively low masking rate (80% of variants) that is gradually incremented in subsequent training rounds until up to only five variants remain unmasked per autoencoder. Masked variants are encoded as the null case in Equation 1. During finetuning we used sim1000G (Dimitromanolakis et al., 2019) to simulate of offspring formation using the default genetic map and HRC genomes as parents. A total of 30,000 offspring genomes were generated and merged with the original HRC dataset, for a total of 57,165 genomes.

Loss function

Request a detailed protocol

In order to account for the overwhelming abundance of rare variants, the accuracy of allele presence reconstruction was scored using an adapted version of focal loss (FL) (Lin et al., 2017), shown in Equation 2.

(2) FL=-αt1-ptγ [xtlogpt+1-xtlog1-pt]

where the classic cross entropy (shown as binary log loss in brackets) of the truth class (xt) predicted probability (pt) is weighted by the class imbalance factor αt and a modulating factor (1 - pt)γ. t represents the index of each allele in a genomic segment. The modulating factor is the standard focal loss factor with hyperparameter, γ, which amplifies the focal loss effect by down-weighting the contributions of well-classified alleles to the overall loss (especially abundant reference alleles for rare variant sites)(Lin et al., 2017). αt is an additional balancing hyperparameter set to the truth class frequency.

This base focal loss function is further penalized and regularized to encourage simple and sparse models in terms of edge-weight and hidden layer activation complexity. These additional penalties result in our final loss function as shown in Equation 3.

(3) SFL=αt(1pt)γ[xtlog(pt)+(1xt)log(1pt)]+βS(ρ||ρ^)+λ1W1+λ2W2

where W1 and W2 are the standard L1 and L2 norms of the autoencoder weight matrix (W), with their contributions mediated by the hyperparameters λ1 and λ2. S is a sparsity penalty, with its contribution mediated by the hyperparameter β, which penalizes deviation from a target hidden node activation set by the hyperparameter vs the observed mean activation ρ^ over a training batch j summed over total batches n, as shown in Equation 4:

(4) S(ρ||ρ^)=j=1nρlog(ρρ^j)+(1ρ)log(1ρ1ρ^j)

Genome tiling

Request a detailed protocol

All model training tasks were distributed across a diversified set of NVIDIA graphical processing units (GPUs) with different video memory limits: 5 x Titan Vs (12 GB), 8x A100s (40 GB), 60x V100s (32 GB). Given computational complexity and GPU memory limitations, individual autoencoders were designed to span approximately independent genomic segments with boundaries defined by computationally identified recombination hotspots (Figure 1E). These segments were defined using an adaptation of the LDetect algorithm (Berisa and Pickrell, 2016). First, we calculated a n x n matrix of pairwise SNP correlations using all common genetic variation (≥0.5% minor allele frequency) from HRC. Correlation values were thresholded at 0.45, which is the threshold that returns the minimum number of segments spanning chromosome 22 with an average size per segment that fits into the video memory of GPUs. While developing the tiling algorithm, we tested lower thresholds, which made the segments smaller and more abundant, and thus made the GPU memory workload less efficient (e.g. many tiles resulted in many autoencoders per GPU, which thus caused a CPU-GPU communication overhead). Due to the obstacles related to computational inefficiency, CPU-GPU communication overhangs, and GPU memory limits, we did not proceed with model training on segments generated with other correlation thresholds. For each SNP, we calculated a box count of all pairwise SNP correlations spanning 500 common SNPs upstream and downstream of the index SNP. This moving box count quantifies the overall local LD strength centered at each SNP. Local minima in this moving box count were used to split the genome into approximately independent genomic segments of two types – large segments of high LD interlaced with short segments of weak LD corresponding to recombination hotspot regions. Individual autoencoders were designed to span the entirety of a single high LD segment plus its adjacent upstream and downstream weak LD regions. Thus, adjacent autoencoders overlap at their weak LD ends. If an independent genomic segment exceeded the threshold number of SNPs amenable to deep learning given GPU memory limitations, internal local minima within the high LD regions were used to split the genomic segments further to a maximum of 6000 SNPs per autoencoder. Any remaining genomic segments still exceeding 6000 SNPs were further split into 6000 SNP segments with large overlaps of 2500 SNPs given the high degree of informative LD split across these regions. This tiling process resulted in 256 genomic segments spanning chromosome 22: 188 independent LD segments, 32 high LD segments resulting from internal local minima splits, and 36 segments further split due to GPU memory limitations.

Hyperparameter initialization and grid search

Request a detailed protocol

We first used a random grid search approach to define initial hyperparameter combinations producing generally accurate genotype imputation results. The hyperparameters and their potential starting values are listed in Table 1. This coarse-grain grid search was performed on all genomic segments of chromosome 22 (256 genomic segments), each tested with 100 randomly selected hyperparameter combinations per genomic segment, with a batch size of 256 samples, training for 500 epochs without any stop criteria, and validating on an independent dataset (ARIC). To evaluate the performance of each hyperparameter combination, we calculated the average coefficient of determination (r-squared) comparing the predicted and observed alternative allele dosages per variant. Concordance and F1-score were also calculated to screen for anomalies but were not ultimately used for model selection.

Table 1
Description and values of hyperparameters tested in grid search.

λ1: scaling factor for Least Absolute Shrinkage and Selection Operator (LASSO or L1) regularization; λ2: scaling factor for Ridge (L2) regularization; β: scaling factor for sparsity penalty described in Equation 4; ρ: target hidden layer activation described in Equation 4; Activation function type: defines how the output of a hidden neuron will be computed given a set of inputs; Learning rate: step size at each learning iteration while moving toward the minimum of the loss function; γ: amplifying factor for focal loss described in Equation 3; Optimizer type: algorithms utilized to minimize the loss function and update the model weights in backpropagation; Loss type: algorithms utilized to calculate the model error (Equation 2); Number of hidden layers: how many layers of artificial neurons to be implemented between input layer and output layer; Hidden layer size ratio: scaling factor to resize the next hidden layer with reference to the size of Its previous layer; Learning rate decay ratio: scaling factor for updating the learning rate value on every 500 epochs.

Hyperparameter descriptionTested values (coarse-grid search)
λ1 for L1 regularization[1e-3, 1e-4, 1e-5, 1e-6, 1e-1, 1e-2, 1e-7, 1e-8]
λ2 for L2 regularization[1e-3, 1e-4, 1e-5, 1e-6, 1e-1, 1e-2, 1e-7, 1e-8]
Sparsity scaling factor (β)[0, 0.001, 0.01, 0.05, 1, 5, 10]
Target average hidden layer activation (ρ)[0.001, 0.004, 0.007, 0.01, 0.04, 0.07, 0.1, 0.4, 0.7, 1.0]
Activation function type[‘sigmoid’, ‘tanh’, ‘relu’, ‘softplus’]
Learning rate[0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
Amplifying factor for focal loss (γ)[0, 0.5, 1, 2, 3, 5]
Optimizer type[‘Adam’, ‘RMS Propagation’, ‘Gradient Descent’]
Loss type[‘Binary Cross Entropy’, ‘Custom Focal Loss’]
Number of hidden layers[1, 2, 4, 6, 8]
Hidden layer size ratio[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
Learning rate decay ratio[ 0.0, 0.25, 0.5, 0.75, 0.95, 0.99, 0.999, 0.9999]

Hyperparameter tuning

Request a detailed protocol

In order to avoid local optimal solutions and reduce the hyperparameter search space, we used an ensemble-based machine learning approach (Extreme Gradient Boosting–XGBoost) to predict the expected performance (r-squared) of each hyperparameter combination per genomic segment using the results of the coarse-grid search and predictive features calculated for each genomic segment. These features include the number of variants, average recombination rate and average pairwise Pearson correlation across all SNPs, proportion of rare and common variants across multiple minor allele frequency (MAF) bins, number of principal components necessary to explain at least 90% of variance, and the total variance explained by the first two principal components. The observed accuracies of the coarse-grid search, numbering 25,600 training inputs, were used to predict the accuracy of 500,000 new hyperparameter combinations selected from Table 1 without training. All categorical predictors (activation function name, optimizer type, loss function type) were one-hot encoded. The model was implemented using XGBoost package v1.4.1 in Python v3.8.3 with 10-fold cross-validation and default settings.

We then ranked all hyperparameter combinations by their predicted performance and selected the top 10 candidates per genomic segment along with the single best initially tested hyperparameter combination per genomic segments for further consideration. All other hyperparameter combinations were discarded. Genomic segments with sub-optimal performance relative to Minimac were subjected to tuning with simulated offspring formation. For tuning, the maximum number of epochs was increased (35,000) with automatic stop criteria: if there is no improvement in average loss value of the current masking/training cycle versus the previous one, the training is interrupted, otherwise training continues until the maximum epoch limit is reached. Each masking/training cycle consisted of 500 epochs. Final hyperparameter selection was based on performance on the validation dataset (ARIC).

This process results in 256 unique autoencoders spanning the genomic segments of chromosome 22. Each genomic segment consists of a different number of input variables (genetic variants), sparsity, and correlation structure. Thus, 256 unique autoencoder models span the entirety of chromosome 22 (e.g.: each autoencoder has different edge weights, number of layers, loss function, as well as regularization and optimization parameters).

Performance testing and comparisons

Request a detailed protocol

Performance was compared to Minimac4 (Das et al., 2016), Beagle5 (Browning et al., 2018), and Impute5 (Rubinacci et al., 2020) using default parameters. We utilized HRC as reference panel for the HMM-based imputation tools, which is the same dataset used for training the autoencoders, and we applied the same quality control standards for both HMM-based and autoencoder-based imputation. We also provide additional comparisons to HMM-based imputation using the TOPMed cohort. No post-imputation quality control was applied. Population level reconstruction accuracy is quantified by measuring r-squared across multiple strata of data: per genomic segment, at whole chromosome level, and stratified across multiple minor allele frequency bins: [0.001–0.005], [0.005–0.01], [0.01–0.05], [0.05–0.1], [0.1–0.2], [0.2–0.3], [0.3–0.4], [0.4–0.5]. While r-squared is our primary comparison metric, sample-level and population-level model performance is also evaluated with concordance and the F1-score. Wilcoxon rank-sum testing was used to assess the significance of accuracy differences observed. Spearman correlations were used to evaluate the relationships between genomic segment features and observed imputation accuracy differences. Standard errors for per variant imputation accuracy r-squared is equal or less than 0.001 where not specified. Performance is reported only for the independent test datasets (Wellderly, MESA, and HGDP). Note that MESA ultimately is not independent of the TOPMed cohort when used for HMM-based imputation.

We used the MESA cohort for inference runtime comparisons. Runtime was determined using the average and standard error of three imputation replicates. Two hardware configurations were used for the tests: (1) a low-end environment: 16-core Intel Xeon CPU (E5-2640 v2 2.00 GHz), 250 GB RAM, and one GPU (NVIDIA GTX 1080); (2) a high-end environment: 24-Core AMD CPU (EPYC 7352 2.3 GHz), 250 GB RAM, using one NVIDIA A100 GPU. We report computation time only, input/output (I/O) reading/writing times are excluded as separately optimized functions. Since the computational burden of training the models remains on the developer side, the runtime results refer to the task of imputing the missing genotypes given a pre-trained autoencoder set.

Data availability

Request a detailed protocol

The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline; Dias et al., 2022. The python script for calculating imputation accuracy is available at https://github.com/TorkamaniLab/imputation_accuracy_calculator; Dias, 2021. Instructions on how to access the unique information on the parameters and hyperparameters of each one of the 256 autoencoders is shared through our source code repository at https://github.com/TorkamaniLab/imputator_inference, copy archived at swh:1:rev:2fbd203acf8aaf320a520c6374d6f4d57f068a7c; Dias, 2022. We also shared the pre-trained autoencoders and instructions on how to use them for imputation at https://github.com/TorkamaniLab/imputator_inference; Dias, 2022.

Imputation data format

Request a detailed protocol

The imputation results are exported in variant calling format (VCF) containing the imputed genotypes and imputation quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference, heterozygous, and homozygous alternate allele). The probabilities can be used for quality control of the imputation results.

Results

Untuned performance and model optimization

A preliminary comparison of the best performing autoencoder per genomic segment vs HMM-based imputation was made after the initial grid search (Minimac4: Figure 2, Beagle5 and Eagle5: Figure 2—figure supplements 12). Untuned autoencoder performance was generally inferior to all tested HMM-based methods except when tested on the European ancestry-rich Wellderly dataset when masked using the Affymetrix 6.0 and UKB Axiom marker sets, but not Omni 1.5 M markers. HMM-based imputation was consistently superior across the more ancestrally diverse test datasets (MESA and HGDP) (two proportion test, p≤8.77 × 10–6). Overall, when performance across genomic segments, test datasets, and test array marker sets was combined, the autoencoders exhibited an average r-squared per variant of 0.352±0.008 in reconstruction of WGS ground truth genotypes versus an average r-squared per variant of 0.374±0.007, 0.364±0.007, and 0.357±0.007 for HMM-based imputation methods (Minimac4, Beagle5, and Impute5, respectively) (Table 2). This difference was statistically significant only relative to Minimac4 (Minimac4: Wilcoxon rank-sum test p=0.037, Beagle5 and Eagle5: p≥0.66).

Figure 2 with 4 supplements see all
HMM-based (y-axis) versus autoencoder-based (x-axis) imputation accuracy prior to tuning.

Minimac4 and untuned autoencoders were tested across three independent datasets–- MESA (top), Wellderly (middle), and HGDP (bottom) and across three genotyping array platforms–- Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Minimac4 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Minimac4 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

Table 2
Performance comparisons between untuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5).

Average r-squared per variant was extracted from each genomic segment of chromosome 22. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference tuned autoencoder (AE). * represents p-values ≤0.05, ** indicates p-values ≤0.001, and *** indicates p-values ≤0.0001.

MESAWellderlyHGDPAffymetrix 6.0UKB AxiomOmni 1.5 MCombined
AE (untuned)0.303±0.0080.470±0.0090.285±0.0060.339±0.0080.356±0.0070.362±0.0080.352±0.008
Minimac40.337±0.007*0.471±0.0080.314±0.006**0.352±0.0080.370±0.0060.400±0.007**0.374±0.007*
Beagle50.336±0.007*0.460±0.0080.296±0.0050.342±0.0070.367±0.0060.384±0.007*0.364±0.007
Impute50.326±0.007*0.458±0.0080.289±0.0060.336±0.0080.354±0.0060.383±0.008*0.358±0.007

In order to understand the relationship between genomic segment features, hyperparameter values, and imputation performance, we calculated predictive features (see Materials and methods) for each genomic segment and determined their Spearman correlation with the differences in r-squared observed for the autoencoder vs Minimac4 (Figure 2—figure supplement 3). We observed that the autoencoder had superior performance when applied to the genomic segments with the most complex LD structures: those with larger numbers of observed unique haplotypes, unique diplotypes, and heterozygosity, as well as high average MAF, and low average pairwise Pearson correlation across all SNPs (average LD) (Spearman correlation ρ≥0.22, p≤9.8 × 10–04). Similarly, we quantified genomic segment complexity by the proportion of variance explained by the first two principal components as well as the number of principal components needed to explain at least 90% of the variance of HRC genotypes from each genomic segment. Concordantly, superior autoencoder performance was associated with a low proportion explained by the first two components and positively correlated with the number of components required to explained 90% of variance (Spearman ρ≥0.22, p≤8.3 × 10–04). These observations, with predictive features determined in the HRC training dataset and performance determined in the ARIC validation dataset, informed our tuning strategy.

We then used the genomic features significantly correlated with imputation performance in the ARIC validation dataset to predict the performance of and select the hyperparameter values to advance to fine-tuning. An ensemble model inference approach was able to predict the genomic segment-specific performance of hyperparameter combinations with high accuracy (Figure 2—figure supplement 4, mean r-squared=0.935 ± 0.002 of predicted vs observed autoencoders accuracies via 10-fold cross validation). The top 10 best performing hyperparameter combinations were advanced to fine-tuning (Table 3). Autoencoder tuning with simulated offspring formation was then executed as described in Materials and methods.

Table 3
Top 10 best performing hyperparameter combinations that advanced to fine-tuning.

See Materials and methods and Table 1 for a detailed description of the hyperparameters.

λ1λ2βρActivationLearn rateγOptimizerLoss typeHidden layersSize ratioDecay
0.100.010.01tanh1.0*10–40adamCE410.95
0.1010.5sigmoid1.0*10–41adamCE20.90.95
0.1050.5sigmoid1.0*10–14adamCE20.50
0.1010.005relu1.0*10–14adamFL610.25
0.1050.01relu1.0*10–55adamFL410.95
0.100.010.1leakyrelu1.0*10–50adamFL80.90.95
0.1010.01tanh1.0*10–40adamCE610.95
01.0*10–80.0010.05relu1.0*10–54adamCE80.60.95
0.1000.01relu1.0*10–15adamFL80.90
0.100.010.01tanh1.0*10–35adamCE210.95

Tuned performance

After tuning, autoencoder performance surpassed HMM-based imputation performance across all imputation methods, independent test datasets, and genotyping array marker sets. At a minimum, autoencoders surpassed HMM-based imputation performance in >62% of chromosome 22 genomic segments (two proportion test p=1.02 × 10–11) (Minimac4: Figure 3, Beagle5 and Eagle5: Figure 3—figure supplements 12). Overall, the optimized autoencoders exhibited superior performance with an average r-squared of 0.395±0.007 vs 0.374±0.007 for Minimac4 (Wilcoxon rank sum test p=0.007), 0.364±0.007 for Beagle5 (Wilcoxon rank sum test p=1.53*10–4), and 0.358±0.007 for Impute5 (Wilcoxon rank sum test p=2.01*10–5) (Table 4). This superiority was robust to the marker sets tested, with the mean r-squared per genomic segment for autoencoders being 0.373±0.008, 0.399±0.007, and 0.414±0.008 vs 0.352±0.008, 0.370±0.006, and 0.400±0.007 for Minimac4 using Affymetrix 6.0, UKB Axiom, and Omni 1.5 M marker sets (Wilcoxon rank-sums test P-value = 0.029, 1.99*10–4, and 0.087, respectively). Detailed comparisons to Beagle5 and Eagle5 are presented in Figure 3—figure supplements 12.

Figure 3 with 7 supplements see all
HMM-based (y-axis) versus autoencoder-based (axis) imputation accuracy after tuning.

Minimac4 and tuned autoencoders were validated across three independent datasets–- MESA (top), Wellderly (middle), and HGDP (bottom) and across three genotyping array platforms–- Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Minimac4 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Minimac4 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

Table 4
Performance comparisons between tuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5).

Average r-squared per variant was extracted from each genomic segment of chromosome 22. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference untuned autoencoder (AE). * represents p-values ≤0.05, ** indicates p-values ≤0.001, and *** indicates p-values ≤0.0001.

MESAWellderlyHGDPAffymetrix 6.0UKB AxiomOmni 1.5 MCombined
AE (tuned)0.355±0.0070.505±0.0080.327±0.0060.373±0.0080.399±0.0070.414±0.0080.396±0.007
AE (untuned)0.303±0.008***0.470±0.009*0.285±0.006***0.339±0.008*0.356±0.007***0.362±0.008***0.352±0.008***
Minimac40.337±0.007*0.471±0.008**0.314±0.0060.352±0.008*0.370±0.006**0.400±0.0070.374±0.007*
Beagle50.336±0.007*0.460±0.008***0.296±0.005***0.342±0.007**0.367±0.006***0.384±0.007**0.364±0.007**
Impute50.326±0.007*0.458±0.008***0.289±0.006***0.336±0.008**0.354±0.006***0.383±0.008**0.358±0.007***

Tuning improved performance of the autoencoders across all genomic segments, generally improving the superiority of autoencoders relative to HMM-based approaches in genomic segments with complex haplotype structures while equalizing performance relative to HMM-based approaches in genomic segments with more simple LD structures (as described in Materials and methods, by the number of unique haplotypes: Figure 3—figure supplement 3, diplotypes: Figure 3—figure supplement 4, average pairwise LD: Figure 3—figure supplement 5, proportion variance explained: Figure 3—figure supplement 6). Concordantly, genomic segments with higher recombination rates exhibited the largest degree of improvement with tuning (Figure 3—figure supplement 7). Use of the augmented reference panel did not improve HMM-based imputation, having no influence on Minimac4 performance (original overall r-squared of 0.374±0.007 vs 0.363±0.007 after augmentation, Wilcoxon rank-sum test p=0.0917), and significantly degrading performance of Beagle5 and Impute5 (original r-squared of 0.364±0.007 and 0.358±0.007 vs 0.349±0.006 and 0.324±0.007 after augmentation, p=0.026 and p=1.26*10–4, respectively). Summary statistics for these comparisons are available in Supplementary file 1.

Overall chromosome 22 imputation accuracy

After merging the results from all genomic segments, the whole chromosome accuracy of autoencoder-based imputation remained superior to all HMM-based imputation tools, across all independent test datasets, and all genotyping array marker sets (Wilcoxon rank-sums test p≤5.55 × 10–67). The autoencoder’s mean r-squared per variant ranged from 0.363 for HGDP to 0.605 for the Wellderly vs 0.340–0.557 for Minimac4, 0.326–0.549 for Beagle5, and 0.314–0.547 for Eagle5, respectively. Detailed comparisons are presented in in Table 5 and Supplementary file 2.

Table 5
Whole chromosome level comparisons between autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5).

Average r-squared per variant was extracted at whole chromosome level. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference tuned autoencoder (AE). * represents p-values ≤0.05, ** indicates p-values ≤0.001, and *** indicates p-values ≤0.0001. Standard errors that are equal or less than 0.001 are not shown.

MESAWellderlyHGDP
Affymetrix 6.0UKB AxiomOmni 1.5 MAffymetrix 6.0UKB AxiomOmni 1.5 MAffymetrix 6.0UKB AxiomOmni 1.5 M
AE (tuned)0.4100.3950.4520.5370.6050.5860.3630.3640.392
Minimac40.390***0.364***0.436***0.500***0.557***0.551***0.350***0.340***0.385***
Beagle50.383***0.379***0.420***0.484***0.549***0.534***0.326***0.328***0.353***
Impute50.384***0.356***0.429***0.485***0.547***0.539***0.328***0.314***0.359***

Further, when imputation accuracy is stratified by MAF bins, the autoencoders maintain superiority across all MAF bins by nearly all test dataset and genotyping array marker sets (Figure 4, and Supplementary file 3). Concordantly, autoencoder imputation accuracy is similarly superior when measured with F1-scores (Figure 4—figure supplement 1) and concordance (Figure 4—figure supplement 2), though these metrics are less sensitive at capturing differences in rare variant imputation accuracy.

Figure 4 with 3 supplements see all
HMM-based versus autoencoder-based imputation accuracy across MAF bins.

Autoencoder-based (red) and HMM-based (Minimac4 (blue), Beagle5 (green), and Impute5 (purple)) imputation accuracy was validated across three independent datasets–- MESA (top), Wellderly (middle), and HGDP (bottom) and across three genotyping array platforms–- Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤0.05, ** indicates p-values ≤0.001, and *** indicates p-values ≤0.0001, ns represents non-significant p-values.

When we upgraded the reference panel of the HMM-based tools with the more expansive TOPMed cohort, the superior performance of the HRC-trained autoencoder was still sustained across all datasets except for MESA (Figure 4—figure supplement 3). Given that MESA is a sub-cohort of the TOPMed cohort, we evaluated the possibility of residual data leakage after the removal of MESA from the TOPMed cohort and found that 44 MESA individuals were duplicated in other TOPMed cohorts, 182 MESA individuals had a first degree relative in other TOPMed cohorts, and >92% of MESA individuals had at least one second degree relative in other TOPMed cohorts, resulting in improved imputation performance. Notably, across the most diverse and truly independent HGDP validation dataset, the autoencoder displays superior performance despite only being exposed to training on the less diverse HRC reference cohort.

Ancestry-specific chromosome 22 imputation accuracy

Finally, we evaluated ancestry-specific imputation accuracy. As before, overall autoencoder-based imputation maintains superiority across all continental populations present in MESA (Figure 5, Wilcoxon rank-sums test p=5.39 × 10–19). The autoencoders’ mean r-squared ranged from 0.357 for African ancestry to 0.614 for East Asian ancestry vs 0.328–0.593 for Minimac4, 0.330–0.544 for Beagle5, and 0.324–0.586 for Impute5, respectively. Note, East Asian ancestry exhibits a slightly higher overall imputation accuracy relative to European ancestry due to improved rare variant imputation. Autoencoder superiority replicates when HGDP is split into continental populations (Figure 5—figure supplement 1).

Figure 5 with 3 supplements see all
HMM-based versus autoencoder-based imputation accuracy across ancestry groups.

Autoencoder-based (red) and HMM-based (Minimac4 (blue), Beagle5 (green), and Impute5 (purple)) imputation accuracy was validated across individuals of diverse ancestry from MESA cohort (EUR: European (top); EAS: East Asian (2nd row); AMR: Native American (3rd row); AFR: African (bottom)) and multiple genotype array platforms (Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right)). Each data point represents the imputation accuracy (average r-squared per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤0.05, ** indicates p-values ≤0.001, and *** indicates p-values ≤0.0001, ns represents non-significant p-values.

Further stratification of ancestry-specific imputation accuracy results by MAF continues to support autoencoder superiority across all ancestries, MAF bins, and nearly all test datasets, and genotyping array marker sets (Figure 5, Figure 5—figure supplement 1). Minimum and maximum accuracies across MAF by ancestry bins ranged between 0.177–0.937 for the autoencoder, 0.132–0.907 for Minimac4, 0.147–0.909 for Beagle5, and 0.115–0.903 for Impute5, with a maximum standard error of ±0.004.

Thus, with training on equivalent reference cohorts, autoencoder performance was superior across all variant allele frequencies and ancestries with the primary source of superiority arising from hard to impute regions with complex LD structures. When the reference panel of the HMM-based tools is upgraded to the more diverse TOPMed dataset, the HRC-trained autoencoder remains superior across all ancestry groups of HGDP (Figure 5—figure supplement 2), as well as in the MESA ancestries well represented in HRC (European and East Asian) but not in MESA ancestries where representation is significantly enhanced by the TOPMed reference panel (American and African) with additional imputation performance deriving from a significant degree of familial relationships spanning the TOPMed reference panel and MESA test cohort (Figure 5—figure supplement 3).

Inference speed

Inference runtimes for the autoencoder vs HMM-based methods were compared in a low-end and high-end computational environment as described in Methods. In the low-end environment, the autoencoder’s inference time is at least ~4 X faster than all HMM-based inference times (summing all inference times from all genomic segments of chromosome 22, the inference time for the autoencoder was 2.4±1.1*10–3 seconds versus 1,754±3.2, 583.3±0.01, and 8.4±4.3*10–3 s for Minimac4, Beagle5, and Impute5, respectively (Figure 6A)). In the high-end environment, this difference narrows to a~3 X advantage of the autoencoder vs HMM-based methods (2.1±8.0*10–4 versus 374.3±1.2, 414.3±0.01, and 6.1±2.1*10–4) seconds for Minimac4, Beagle5, and Impute5, respectively (Figure 6B). These unoptimized results indicate that autoencoder-based imputation can be executed rapidly, without a reference cohort, and without the need for a high-end server or high-performance computing (HPC) infrastructure. However, we must note that to deploy the autoencoder-based imputation to production, the autoencoders must be pre-trained separately across all segments of all chromosomes in the human genome. This initial pre-training can require months of computation time, depending upon the GPU resources available, whereas HMM-based imputation does not require any pre-training after initial parameters are defined. Thus, the HMM-based approach is more flexible to the de-novo use of alternative reference panels – though recent cohorts have revealed scaling limitations. On the other hand, unlike HMM-based imputation tools, pre-trained autoencoders retain the information learned from pre-training and can be continuously fine-tuned with additional genomes and reference panels as they become available. Thus, once pre-trained, autoencoders may be incrementally upgraded using newly available reference panels.

HMM-based versus autoencoder-based inference runtimes.

We plot the average time and standard error of three imputation replicates. Two hardware configurations were used for the tests: (A) a low-end environment: 16-core Intel Xeon CPU (E5-2640 v2 2.00 GHz), 250 GB RAM, and one GPU (NVIDIA GTX 1080); (B) a high-end environment: 24-Core AMD CPU (EPYC 7352 2.3 GHz), 250 GB RAM, using one NVIDIA A100 GPU.

Discussion

Artificial neural network-based data mining techniques are revolutionizing biomedical informatics and analytics (Dias and Torkamani, 2019; Jumper et al., 2021). Here, we have demonstrated the potential for these techniques to execute a fundamental analytical task in population genetics, genotype imputation, producing superior results in a computational efficient and portable framework. The trained autoencoders can be transferred easily, and execute their functions rapidly, even in modest computing environments, obviating the need to transfer private genotype data to external imputation servers or services. Furthermore, our fully trained autoencoders robustly surpass the performance of all modern HMM-based imputation approaches across all tested independent datasets, genotyping array marker sets, minor allele frequency spectra, and diverse ancestry groups. This superiority was most apparent in genomic regions with low LD and/or high complexity in their linkage disequilibrium structure.

Superior imputation accuracy is expected to improve GWAS power, enable more complete coverage in meta-analyses, and improve causal variant identification through fine-mapping. Moreover, superior imputation accuracy in low LD regions may enable the more accurate interrogation of specific classes of genes under a greater degree of selective pressure and involved in environmental sensing. For example, promoter regions of genes associated with inflammatory immune responses, response to pathogens, environmental sensing, and neurophysiological processes (including sensory perception genes) are often located in regions of low LD (Dias and Torkamani, 2019; Frazer et al., 2007). These known disease-associated biological processes that are critical to interrogate accurately in GWAS. Thus, the autoencoder-based imputation approach both improves statistical power and biological coverage of individual GWAS’ and downstream meta-analyses.

HMM-based imputation tools depend on end-user access to large reference panels or datasets to impute a single genome whereas pre-trained autoencoder models eliminate that dependency. However, further development is required to actualize this approach in practice for broad adoption. Autoencoders must be pre-trained and validated across all segments of the human genome – a computationally expensive task. Here we performed training only for chromosome 22. Autoencoder training is computationally intensive, shifting the computational burden to model trainers, and driving performance gains for end-users. As a result, inference time scales only with the number of variants to be imputed, whereas HMM-based inference time depends on both reference panel and the number of variants to be imputed. This allows for autoencoder-based imputation to extend to millions of genomes but introduces some challenges in the continuous re-training and fine-tuning of the pre-trained models as larger reference panels are made available. In addition, our current encoding approach lacks phasing information and no pre-phasing was performed. Pre-phasing can lead to substantial improvements in imputation accuracy. Future models will need to address the need for phasing and continuous fine-tuning of models for application to modern, ever-growing, genomic datasets.

Ideas and speculation

After expanding this approach across the whole genome, our work will provide a more efficient genotype imputation platform on whole genome scale and thus benefit genomic research especially in contexts where the computational power required for modern HMM-based imputation is not accessible. In addition to the speed, cost, and accuracy benefits, our proposed approach can potentially improve automation for downstream analyses. The autoencoder naturally generates a hidden encoding with latent features representative of the original data. This latent representation of the original data acts as an automatic feature extraction and dimensionality reduction technique for downstream tasks such as genetic risk prediction. Moreover, the autoencoder-based imputation approach only requires a reference panel during training – only the neural network needs to be distributed for implementation. Thus, the neural network is portable and avoids privacy issues associated with standard statistical imputation. This privacy-preserving feature will allow developers to deploy real-time data-driven algorithms on personal devices (edge computing). These new features will expand the clinical applications of genomic imputation, as well as its role in preventive healthcare. Another point related to data privacy is that the autoencoders segment the genome, making reconstruction of an individual genome impossible even if reference data were somehow recoverable from the neural networks. Nevertheless, while there are no official data sharing restrictions on deep learning model weights generated from genomic data, future privacy risks may be discovered, necessitating further research into privacy concerns and differential privacy techniques for autoencoder-based genotype imputation.

Data availability

The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline (copy archived at swh:1:rev:35d2e292e786ebc41e71f27809dad56b1e1933c4). The python script for calculating imputation accuracy is available at https://github.com/TorkamaniLab/imputation_accuracy_calculator, (copy archived at swh:1:rev:e01229e3f245e8bb95b29d4f4f1e547fcff70ae4). Instructions on how to access the unique information on the parameters and hyperparameters of each one of the 256 autoencoders is shared through our source code repository at https://github.com/TorkamaniLab/imputator_inference, (copy archived at swh:1:rev:2fbd203acf8aaf320a520c6374d6f4d57f068a7c). We also shared the pre-trained autoencoders and instructions on how to use them for imputation at https://github.com/TorkamaniLab/imputator_inference. Imputation data format. The imputation results are exported in variant calling format (VCF) containing the imputed genotypes and imputation quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference, heterozygous, and homozygous alternate allele). The probabilities can be used for quality control of the imputation results.

The following previously published data sets were used

References

    1. Frazer KA
    2. Ballinger DG
    3. Cox DR
    4. Hinds DA
    5. Stuve LL
    6. Gibbs RA
    7. Belmont JW
    8. Boudreau A
    9. Hardenbol P
    10. Leal SM
    11. Pasternak S
    12. Wheeler DA
    13. Willis TD
    14. Yu F
    15. Yang H
    16. Zeng C
    17. Gao Y
    18. Hu H
    19. Hu W
    20. Li C
    21. Lin W
    22. Liu S
    23. Pan H
    24. Tang X
    25. Wang J
    26. Wang W
    27. Yu J
    28. Zhang B
    29. Zhang Q
    30. Zhao H
    31. Zhao H
    32. Zhou J
    33. Gabriel SB
    34. Barry R
    35. Blumenstiel B
    36. Camargo A
    37. Defelice M
    38. Faggart M
    39. Goyette M
    40. Gupta S
    41. Moore J
    42. Nguyen H
    43. Onofrio RC
    44. Parkin M
    45. Roy J
    46. Stahl E
    47. Winchester E
    48. Ziaugra L
    49. Altshuler D
    50. Shen Y
    51. Yao Z
    52. Huang W
    53. Chu X
    54. He Y
    55. Jin L
    56. Liu Y
    57. Shen Y
    58. Sun W
    59. Wang H
    60. Wang Y
    61. Wang Y
    62. Xiong X
    63. Xu L
    64. Waye MMY
    65. Tsui SKW
    66. Xue H
    67. Wong JTF
    68. Galver LM
    69. Fan JB
    70. Gunderson K
    71. Murray SS
    72. Oliphant AR
    73. Chee MS
    74. Montpetit A
    75. Chagnon F
    76. Ferretti V
    77. Leboeuf M
    78. Olivier JF
    79. Phillips MS
    80. Roumy S
    81. Sallée C
    82. Verner A
    83. Hudson TJ
    84. Kwok PY
    85. Cai D
    86. Koboldt DC
    87. Miller RD
    88. Pawlikowska L
    89. Taillon-Miller P
    90. Xiao M
    91. Tsui LC
    92. Mak W
    93. Song YQ
    94. Tam PKH
    95. Nakamura Y
    96. Kawaguchi T
    97. Kitamoto T
    98. Morizono T
    99. Nagashima A
    100. Ohnishi Y
    101. Sekine A
    102. Tanaka T
    103. Tsunoda T
    104. Deloukas P
    105. Bird CP
    106. Delgado M
    107. Dermitzakis ET
    108. Gwilliam R
    109. Hunt S
    110. Morrison J
    111. Powell D
    112. Stranger BE
    113. Whittaker P
    114. Bentley DR
    115. Daly MJ
    116. de Bakker PIW
    117. Barrett J
    118. Chretien YR
    119. Maller J
    120. McCarroll S
    121. Patterson N
    122. Pe’er I
    123. Price A
    124. Purcell S
    125. Richter DJ
    126. Sabeti P
    127. Saxena R
    128. Schaffner SF
    129. Sham PC
    130. Varilly P
    131. Altshuler D
    132. Stein LD
    133. Krishnan L
    134. Smith AV
    135. Tello-Ruiz MK
    136. Thorisson GA
    137. Chakravarti A
    138. Chen PE
    139. Cutler DJ
    140. Kashuk CS
    141. Lin S
    142. Abecasis GR
    143. Guan W
    144. Li Y
    145. Munro HM
    146. Qin ZS
    147. Thomas DJ
    148. McVean G
    149. Auton A
    150. Bottolo L
    151. Cardin N
    152. Eyheramendy S
    153. Freeman C
    154. Marchini J
    155. Myers S
    156. Spencer C
    157. Stephens M
    158. Donnelly P
    159. Cardon LR
    160. Clarke G
    161. Evans DM
    162. Morris AP
    163. Weir BS
    164. Tsunoda T
    165. Mullikin JC
    166. Sherry ST
    167. Feolo M
    168. Skol A
    169. Zhang H
    170. Zeng C
    171. Zhao H
    172. Matsuda I
    173. Fukushima Y
    174. Macer DR
    175. Suda E
    176. Rotimi CN
    177. Adebamowo CA
    178. Ajayi I
    179. Aniagwu T
    180. Marshall PA
    181. Nkwodimmah C
    182. Royal CDM
    183. Leppert MF
    184. Dixon M
    185. Peiffer A
    186. Qiu R
    187. Kent A
    188. Kato K
    189. Niikawa N
    190. Adewole IF
    191. Knoppers BM
    192. Foster MW
    193. Clayton EW
    194. Watkin J
    195. Gibbs RA
    196. Belmont JW
    197. Muzny D
    198. Nazareth L
    199. Sodergren E
    200. Weinstock GM
    201. Wheeler DA
    202. Yakub I
    203. Gabriel SB
    204. Onofrio RC
    205. Richter DJ
    206. Ziaugra L
    207. Birren BW
    208. Daly MJ
    209. Altshuler D
    210. Wilson RK
    211. Fulton LL
    212. Rogers J
    213. Burton J
    214. Carter NP
    215. Clee CM
    216. Griffiths M
    217. Jones MC
    218. McLay K
    219. Plumb RW
    220. Ross MT
    221. Sims SK
    222. Willey DL
    223. Chen Z
    224. Han H
    225. Kang L
    226. Godbout M
    227. Wallenburg JC
    228. L’Archevêque P
    229. Bellemare G
    230. Saeki K
    231. Wang H
    232. An D
    233. Fu H
    234. Li Q
    235. Wang Z
    236. Wang R
    237. Holden AL
    238. Brooks LD
    239. McEwen JE
    240. Guyer MS
    241. Wang VO
    242. Peterson JL
    243. Shi M
    244. Spiegel J
    245. Sung LM
    246. Zacharia LF
    247. Collins FS
    248. Kennedy K
    249. Jamieson R
    250. Stewart J
    (2007) A second generation human haplotype map of over 3.1 million SNPs
    Nature 449:851–861.
    https://doi.org/10.1038/nature06258
  1. Conference
    1. Islam T
    2. Kim CH
    3. Iwata H
    4. Shimono H
    5. Kimura A
    6. Zaw H
    7. Raghavan C
    8. Leung H
    9. Singh RK
    (2021)
    A Deep Learning Method to Impute Missing Values and Compress Genome-ide Polymorphism Data in Rice In
    Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies. pp. 11–13.
  2. Book
    1. Mao XJ
    2. Shen C
    3. Yang YB
    (2016)
    Advances in Neural Information Processing Systems 29
    New york, USA: Curran Associates, Inc.
    1. Taliun D
    2. Harris DN
    3. Kessler MD
    4. Carlson J
    5. Szpiech ZA
    6. Torres R
    7. Taliun SAG
    8. Corvelo A
    9. Gogarten SM
    10. Kang HM
    11. Pitsillides AN
    12. LeFaive J
    13. Lee S-B
    14. Tian X
    15. Browning BL
    16. Das S
    17. Emde A-K
    18. Clarke WE
    19. Loesch DP
    20. Shetty AC
    21. Blackwell TW
    22. Smith AV
    23. Wong Q
    24. Liu X
    25. Conomos MP
    26. Bobo DM
    27. Aguet F
    28. Albert C
    29. Alonso A
    30. Ardlie KG
    31. Arking DE
    32. Aslibekyan S
    33. Auer PL
    34. Barnard J
    35. Barr RG
    36. Barwick L
    37. Becker LC
    38. Beer RL
    39. Benjamin EJ
    40. Bielak LF
    41. Blangero J
    42. Boehnke M
    43. Bowden DW
    44. Brody JA
    45. Burchard EG
    46. Cade BE
    47. Casella JF
    48. Chalazan B
    49. Chasman DI
    50. Chen Y-DI
    51. Cho MH
    52. Choi SH
    53. Chung MK
    54. Clish CB
    55. Correa A
    56. Curran JE
    57. Custer B
    58. Darbar D
    59. Daya M
    60. de Andrade M
    61. DeMeo DL
    62. Dutcher SK
    63. Ellinor PT
    64. Emery LS
    65. Eng C
    66. Fatkin D
    67. Fingerlin T
    68. Forer L
    69. Fornage M
    70. Franceschini N
    71. Fuchsberger C
    72. Fullerton SM
    73. Germer S
    74. Gladwin MT
    75. Gottlieb DJ
    76. Guo X
    77. Hall ME
    78. He J
    79. Heard-Costa NL
    80. Heckbert SR
    81. Irvin MR
    82. Johnsen JM
    83. Johnson AD
    84. Kaplan R
    85. Kardia SLR
    86. Kelly T
    87. Kelly S
    88. Kenny EE
    89. Kiel DP
    90. Klemmer R
    91. Konkle BA
    92. Kooperberg C
    93. Köttgen A
    94. Lange LA
    95. Lasky-Su J
    96. Levy D
    97. Lin X
    98. Lin K-H
    99. Liu C
    100. Loos RJF
    101. Garman L
    102. Gerszten R
    103. Lubitz SA
    104. Lunetta KL
    105. Mak ACY
    106. Manichaikul A
    107. Manning AK
    108. Mathias RA
    109. McManus DD
    110. McGarvey ST
    111. Meigs JB
    112. Meyers DA
    113. Mikulla JL
    114. Minear MA
    115. Mitchell BD
    116. Mohanty S
    117. Montasser ME
    118. Montgomery C
    119. Morrison AC
    120. Murabito JM
    121. Natale A
    122. Natarajan P
    123. Nelson SC
    124. North KE
    125. O’Connell JR
    126. Palmer ND
    127. Pankratz N
    128. Peloso GM
    129. Peyser PA
    130. Pleiness J
    131. Post WS
    132. Psaty BM
    133. Rao DC
    134. Redline S
    135. Reiner AP
    136. Roden D
    137. Rotter JI
    138. Ruczinski I
    139. Sarnowski C
    140. Schoenherr S
    141. Schwartz DA
    142. Seo J-S
    143. Seshadri S
    144. Sheehan VA
    145. Sheu WH
    146. Shoemaker MB
    147. Smith NL
    148. Smith JA
    149. Sotoodehnia N
    150. Stilp AM
    151. Tang W
    152. Taylor KD
    153. Telen M
    154. Thornton TA
    155. Tracy RP
    156. Van Den Berg DJ
    157. Vasan RS
    158. Viaud-Martinez KA
    159. Vrieze S
    160. Weeks DE
    161. Weir BS
    162. Weiss ST
    163. Weng L-C
    164. Willer CJ
    165. Zhang Y
    166. Zhao X
    167. Arnett DK
    168. Ashley-Koch AE
    169. Barnes KC
    170. Boerwinkle E
    171. Gabriel S
    172. Gibbs R
    173. Rice KM
    174. Rich SS
    175. Silverman EK
    176. Qasba P
    177. Gan W
    178. NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium
    179. Papanicolaou GJ
    180. Nickerson DA
    181. Browning SR
    182. Zody MC
    183. Zöllner S
    184. Wilson JG
    185. Cupples LA
    186. Laurie CC
    187. Jaquish CE
    188. Hernandez RD
    189. O’Connor TD
    190. Abecasis GR
    (2021) Sequencing of 53,831 diverse genomes from the NHLBI topmed program
    Nature 590:290–299.
    https://doi.org/10.1038/s41586-021-03205-y
  3. Book
    1. Xie J
    2. Xu L
    3. Chen E
    (2012)
    Advances in Neural Information Processing Systems 25 (NIPS 2012)
    Newyork, United States: curran associates inc.

Decision letter

  1. Matthew Stephens
    Reviewing Editor; University of Chicago, United States
  2. Molly Przeworski
    Senior Editor; Columbia University, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders" for consideration by eLife. Your article has been reviewed by 4 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Molly Przeworski as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) More precise technical details need to be provided about the method. Code used to train the method should be made available.

2) Accuracy assessments should be made on a new dataset not used at any point in the training.

3) Comparison with HMM should use a more diverse reference panel and with standard quality controls in place, with separate comparisons for rare vs common variants.

4) The reason for the surprisingly low average accuracy of all methods, compared with

previous studies, needs to be identified and, probably, corrected. (Reported imputation accuracies in previous studies are typically R2>0.8, whereas in this paper reported R2 is mostly in range 0.2-0.6).

5) Software for applying the pre-trained network to impute genotypes needs to be publicly available.

Reviewer #1 (Recommendations for the authors):

Supplemental Figures 12 and 13 – I appreciate the authors using various metrics for comparing imputation accuracies, but I think these figures are a bit confusing because it shows that SNPs with smaller MAF somehow have better imputation accuracy.

Reviewer #2 (Recommendations for the authors):

Two particular concerns for the authors:

1) The code is not yet available. Once it is made available I'll be able to more thoroughly evaluate what the authors have done.

2) As mentioned above, there is a bit of work to do on cleaning up the run time performance evaluations.

Reviewer #3 (Recommendations for the authors):

Besides the points mentioned in the public review, I have the following comments/suggestions:

– While I am somewhat familiar with autoencoders, I found it quite hard to follow the description of "denoising" autoencoders. Part of the problem was that I did not appreciate that these are different from a regular autoencoder until I did a bit more reading. I'm not quite sure how to make the description more accessible to the audience who are not familiar with denoising autoencoders, but I think it would have helped me to have an equation giving the training objective function that is being optimized to supplement Figure 1.

– The description of the HMM methods on p3 is inaccurate. For example, the hidden states are not the "to-be-imputed" variants (a better high-level description would be that the hidden states represent the haplotype in the reference panel that is most closely related to the haplotype being imputed, although different HMMs may have slightly different interpretations for the hidden states) and the genotyped variants are not the "observed states" (they are observed, but they are not the HMM states). In any case it needs rewriting.

– p4 the analogy with single-cell "dropout" is misleading because the zeros in single cell data are not "missing data" in the same sense as the missing genotypes in genotype imputation. (See https://www.nature.com/articles/s41588-021-00873-4 for example).

– p6: although ultimately performance is what matters, the genotype encoding seems a bit weird to me: specifically the way that 0 is used for both missing data and absence of the allele (which are different things!) Is this standard in the denoising autoencoder literature? Also, can you clarify what you mean by "scaled outputs can also be regarded as probabilities" and "This representation maintains the interdepencies among classes". Indeed, I don't see why this last is true: for example, the "true" genotype can never be (0,0), so if one of the pair is 0 the other must be 1, and this interdependency does not seem to be captured by the encoding.

Reviewer #4 (Recommendations for the authors):

Specific comments are given below.

1. The authors trained their models using HRC reference panel, which consists of predominantly European ancestry individuals. They didn't mention the reference panel used in HMM-based imputations, and I assume it was also HRC. As we know, performance of HMM-based imputation depends on the reference panels. There are larger and more diverse reference panels publicly available (e.g. TOPMed freeze 8 reference panel through the TOPMed imputation server). Does AE still have values (whatever training datasets they are using) over the TOPMed imputation server? Evaluations with diverse reference panel is particularly important for at least two reasons: (1) diverse panels are widely used (if not the most widely used); (2) AE relies more on pretrained model that no longer retains individual haplotype level information than HMM-based methods, which is conceptually susceptible to heterogeneity in reference panel. The authors may not have access to individual-level TOPMed data for training their AE models. In that case, they should at least evaluate the performance with the 1000 Genomes Project as reference.

2. The authors didn't mention any post-imputation QC steps, thus it's not clear to me which variants are included in their evaluations/comparisons. As we know, not all markers attempted (that is, markers in the reference panel) can be well imputed. It is essential to have post-imputation QC to filter out poorly imputed markers. The manuscript does not touch this important topic at all, rendering the method practically not useful. Also rather puzzlingly, the R2 showed in all the figures are largely in the range 0.2-0.6, which is inconsistent from what's reported in the literature. For common variants, HMM-based imputation methods have been reported to attain R2 > 0.8 (if not even much higher); and for rare variants, the mode is close to zero. The expected bimodal distribution is not observed. Please clarify. For similar reason, quality assessment should be performed for common and rare variants separately.

3. Have the authors evaluated the computationally identified hotspots? Are they largely consistent with the "true" recombination hotspots? Based on Figure 1E, these hotspots are calculated according to LD. What if the reference panel contains diverse populations? The LD calculated based on heterogeneous individuals would be meaningless.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders" for further consideration by eLife. Your revised article has been evaluated by two of the original Reviewers, Molly Przeworski (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1. You need to clearly report the (very considerable) training time required for the autoencoder. This relates to comments from both Reviewers which I repeat here:

Reviewer 2: "I don't think the authors are dealing enough with performance issues and the need to share training times for tuning. First, in the revision the authors have edited/added Figure 6 which shows a runtime comparison, however, the methods section description (lines 308-315) does not mention how only prediction time was taken into account, as the authors seem to be explaining in the Response to Reviews document. More importantly, given that software release is mostly a proof of principle- pre-trained models are available for chr22 only -further, genome-wide inference a user would have to (as I understand it): download a reference set, train tiled auto-encoders for the genome on their own hardware, and then do inference on their data of interest. Surely then it's important to profile the performance of this bit of the pipeline for the user."

Reviewer 4. That "the training process is going to take approximately one year to finish." would severely limit the application of the method. If the authors, as developers, do not release pre-trained models, we would be relying on extremely motivated and computationally savvy users to perform the pre-training. Along the line, reporting only the inference speed (i.e., computing time needed to APPLY the pre-trained models) becomes insufficient since the users would need to pre-train the models for all other chromosomes before they can apply the models to their target data.

2. You need to address the original request to compare against HMMs from a diverse reference. This request was made in the original decision, but the response instead focussed on the diversity of the target samples, which is a different issue. (It seems that running the autoencoder with a new diverse reference sample would require considerable new computation, but running the HMMs with a more diverse reference should be relatively straightforward; this does seem to highlight a disadvantage of the long training time of autoencoders if they are to be retrained when new larger reference samples become available.)

This relates to the following comment from Reviewer 4:

For major comment 1, I was not referring to a diverse target but referred to a diverse reference. I know that HRC includes the samples from the 1000 Genomes Project but the aggressive variant filtering and the predominant European ancestry individuals (~30,000 versus ~2,000 non-European from the 1000 Genomes Project) make it a lousy example as a diverse reference.

https://doi.org/10.7554/eLife.75600.sa1

Author response

Essential revisions:

1) More precise technical details need to be provided about the method. Code used to train the method should be made available.

We have made the source code and detailed instructions for use available publicly at Github.

The computational pipeline for autoencoder training and validation is available at:

https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline.

The python script for calculating imputation accuracy is available at: https://github.com/TorkamaniLab/imputation_accuracy_calculator.

2) Accuracy assessments should be made on a new dataset not used at any point in the training.

The MESA, Wellderly, and HGDP datasets are all independent datasets, never used for training, nor model selection. Only HRC was used as reference panel or for training, and ARIC was used for model tuning. We included a statement in the methods clarifying this point.

3) Comparison with HMM should use a more diverse reference panel and with standard quality controls in place, with separate comparisons for rare vs common variants.

Two of our independent testing datasets (MESA, but especially HGDP), are from highly diversified multi-ethnic cohorts. HGDP includes 828 samples from 54 different populations representing all continental populations and including remote populations like Siberia, Oceania, etc. This reference panel is described in more detail in the reference below and likely represents the most diverse human genome dataset available.

Bergström A, et al., Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020 Mar 20;367(6484):eaay5012.

4) The reason for the surprisingly low average accuracy of all methods, compared with

previous studies, needs to be identified and, probably, corrected. (Reported imputation accuracies in previous studies are typically R2>0.8, whereas in this paper reported R2 is mostly in range 0.2-0.6).

We show average accuracy of 0.2-0.6 in Table 4, but that is the average aggregate R2 per variant across all variants (no minor allele frequency filter or binning applied). The reviewer points that the accuracy should be R2>0.8, but this R2>0.8 expectation applies only to common variants (allele frequency >1%). Our results reflect this level of accuracy (R2>0.8) for common variants as plotted in Figure 4. The aggregate accuracy we report in Table 4 is substantially lower because the vast majority of genetic variants are rare, falling below the 1% allele frequency threshold and pulling down the overall average. The parent HRC reference publication below, as well as other supporting references we provide, demonstrate this is expected and conforms with our results.

References:

Rubinacci S, Delaneau O, Marchini J. Genotype imputation using the positional burrows wheeler transform. PLoS genetics. 2020 Nov 16;16(11):e1009049.

Our curves in Figure 4 replicate this graded accuracy across allele frequencies with higher accruacy above 1% allele frequency. Other examples:

McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, Luo Y. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. 2016 Oct;48(10):1279.

Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH, Duggal P. Genotype imputation performance of three reference panels using African ancestry individuals. Human genetics. 2018 Apr;137(4):281-92.

5) Software for applying the pre-trained network to impute genotypes needs to be publicly available.

We now share the pre-trained autoencoders (including model weights and inference source code), as well as instructions on how to use them for imputation.

These resources are publicly available at https://github.com/TorkamaniLab/imputator_inference. We have added this information to the Data Availability subsection of the Methods.

Reviewer #1 (Recommendations for the authors):

Supplemental Figures 12 and 13 – I appreciate the authors using various metrics for comparing imputation accuracies, but I think these figures are a bit confusing because it shows that SNPs with smaller MAF somehow have better imputation accuracy.

Yes, this is a bit of an accuracy paradox: since most of the genotypes in rare variants are homozygous reference, concordances and F-scores are inflated due to the overwhelming abundance of the negative class. In order to address this source of confusion, we have added a note to the captions of these figures explaining this phenomenon and pointing the reader to Figure S14 (and R-squared in general) for a more accurate picture of balanced accuracy.

Reviewer #2 (Recommendations for the authors):

Two particular concerns for the authors:

1) The code is not yet available. Once it is made available I'll be able to more thoroughly evaluate what the authors have done.

We made both pre-trained models and the source code available now.

2) As mentioned above, there is a bit of work to do on cleaning up the run time performance evaluations.

We have clarified the major concerns on the performance comparisons, including further clarification that all the testing results shown are from independent datasets not used for training, nor tuning; and that the runtime comparisons extracted from HMM refer to the prediction step only, excluding the time for I/O and HMM iterations.

Reviewer #3 (Recommendations for the authors):

Besides the points mentioned in the public review, I have the following comments/suggestions:

– While I am somewhat familiar with autoencoders, I found it quite hard to follow the description of "denoising" autoencoders. Part of the problem was that I did not appreciate that these are different from a regular autoencoder until I did a bit more reading. I'm not quite sure how to make the description more accessible to the audience who are not familiar with denoising autoencoders, but I think it would have helped me to have an equation giving the training objective function that is being optimized to supplement Figure 1.

We edited/included a couple of sentences in the introduction and methods, defining what is the most basic difference between typical autoencoders and the denoising autoencoder. We hope this helps to further clarify this issue.

– The description of the HMM methods on p3 is inaccurate. For example, the hidden states are not the "to-be-imputed" variants (a better high-level description would be that the hidden states represent the haplotype in the reference panel that is most closely related to the haplotype being imputed, although different HMMs may have slightly different interpretations for the hidden states) and the genotyped variants are not the "observed states" (they are observed, but they are not the HMM states). In any case it needs rewriting.

Thanks for catching this. We corrected the HMM definition in the introduction.

– p4 the analogy with single-cell "dropout" is misleading because the zeros in single cell data are not "missing data" in the same sense as the missing genotypes in genotype imputation. (See https://www.nature.com/articles/s41588-021-00873-4 for example).

While there is some subtlety to this point, single-cell “dropout” includes both molecules that actually do not exist in the cell at the time of sampling as well as molecules that do exist but are not captured by the single-cell assay. Our own experience with scRNAseq data confirms this, taking the same library and sequencing it more deeply (or applying other molecular techniques to deplete abundant species) leads to the discovery of a larger number of genes per cell.

These are true missing data.

Despite this, we agree the scRNAseq example is complex and perhaps confusing – a simpler example more true to the nature of missing data in genotype data would be more appropriate. Therefore we’ve replaced those examples with the more classic image applications.

– p6: although ultimately performance is what matters, the genotype encoding seems a bit weird to me: specifically the way that 0 is used for both missing data and absence of the allele (which are different things!) Is this standard in the denoising autoencoder literature? Also, can you clarify what you mean by "scaled outputs can also be regarded as probabilities" and "This representation maintains the interdepencies among classes". Indeed, I don't see why this last is true: for example, the "true" genotype can never be (0,0), so if one of the pair is 0 the other must be 1, and this interdependency does not seem to be captured by the encoding.

We rescale the outputs by the following formula:

AE output example [x’ref, x’alt]: [0.55,0.90]

x′[ref,ref] = x′ref ∗ (1 − x′alt) = 0.55 ∗ 0.10 = 0.055

x′[alt,alt] = (1 − x′ref) ∗ x′alt = 0.45 ∗ 0.90 = 0.405

x′[ref,alt] = x′ref ∗ x′alt = 0.55 ∗ 0.90 = 0.495

where x’ is the reconstructed outputs from the autoencoders for reference (ref) and alternate (alt) alleles.

Next, we convert the rescaled outputs into probabilities ( P[ref,ref], P[alt,alt], P[alt,alt] ) by using a softmax function:P[ref ,ref]=ex[ref ,ref]ex[ref ,ref]+ex[ref ,alt]+ex[alt.alt]=0.252P[ref,alt]=ex[ref,alt]ex[ref,ref]+ex[ref,alt]+ex[alt,alt]=0.357P[alt,alt]=ex[ref,alt]ex[ref,ref]+ex[ref,alt]+ex[alt,alt]=0.391 Corrupting the data can be done by using any randomly chosen value as replacement for the true data. Either 0.0 or 0.5 is typically utilized for representing fixed noise or masking noise. Even in an image, RGB 0, 0, 0 is often used but is not actually “missing” but black. True “missing” is not used in autoencoders. We didn’t find any significant differences by using either of these values for masking.

We have removed the statement about interdependencies to avoid confusion. However, as a point of interest for the reviewer – while we have yet to test this point – 0,0 genotypes may actually be possible – when the position containing the SNP locus is deleted in the genome. This gets a bit complex and we’re not absolutely sure whether we would be able to capture this sort of event with our encoding, so we’ve simply removed the statement.

Reviewer #4 (Recommendations for the authors):

Specific comments are given below.

1. The authors trained their models using HRC reference panel, which consists of predominantly European ancestry individuals. They didn't mention the reference panel used in HMM-based imputations, and I assume it was also HRC. As we know, performance of HMM-based imputation depends on the reference panels. There are larger and more diverse reference panels publicly available (e.g. TOPMed freeze 8 reference panel through the TOPMed imputation server). Does AE still have values (whatever training datasets they are using) over the TOPMed imputation server?

We included further clarification in the methods section:

“We utilized HRC as reference panel for the HMM-based imputation tools, which is the same dataset used for training the autoencoders, and we applied the same quality control standards for both HMM-based and autoencoder-based imputation.”

Until recently, HRC has been the standard for imputation (and is still often used). Both the HMM-method and our autoencoders would improve with the use of an expanded reference. The fairest comparison is to use the same reference for both approaches.

Evaluations with diverse reference panel is particularly important for at least two reasons: (1) diverse panels are widely used (if not the most widely used); (2) AE relies more on pretrained model that no longer retains individual haplotype level information than HMM-based methods, which is conceptually susceptible to heterogeneity in reference panel. The authors may not have access to individual-level TOPMed data for training their AE models. In that case, they should at least evaluate the performance with the 1000 Genomes Project as reference.

We have evaluated our models on data that is significantly more diverse than the 1000 Genomes (note that 1000 Genomes is also not independent from HRC). Two of our independent testing datasets (MESA, but especially HGDP), are from highly diversified multi-ethnic cohorts. HGDP includes 828 samples from 54 different populations representing all continental populations and including remote populations like Siberia, Oceania, etc. This reference panel is described in more detail in the reference below and likely represents the most diverse human genome dataset available – more diverse (though smaller) than TopMed.

Bergström A, et al., Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020 Mar 20;367(6484):eaay5012.

2. The authors didn't mention any post-imputation QC steps, thus it's not clear to me which variants are included in their evaluations/comparisons. As we know, not all markers attempted (that is, markers in the reference panel) can be well imputed. It is essential to have post-imputation QC to filter out poorly imputed markers. The manuscript does not touch this important topic at all, rendering the method practically not useful. Also rather puzzlingly, the R2 showed in all the figures are largely in the range 0.2-0.6, which is inconsistent from what's reported in the literature. For common variants, HMM-based imputation methods have been reported to attain R2 > 0.8 (if not even much higher); and for rare variants, the mode is close to zero. The expected bimodal distribution is not observed. Please clarify. For similar reason, quality assessment should be performed for common and rare variants separately.

We included further clarification in the methods explaining that no post-imputation QC was applied. However, we bin performance by allele frequency which relates to imputability. And the output of our model includes genotype probabilities which can be used as a measure of post-imputation QC for end-users. This point is now explained in the Genotype Encoding section of Methods, and in the documentation of the imputation software repository available at:

https://github.com/TorkamaniLab/imputator_inference

As for overall accuracy, we show average accuracy of 0.2-0.6 in Table 4, but that is the average aggregate R2 per variant across all variants (no MAF filter or binning applied). The reviewer points that the accuracy should be R2>0.8, but this R2>0.8 expectation applies only to common variants (allele frequency >1%). Our results reflect this level of accuracy (r2>0.8) for common variants as plotted in Figure 4. The aggregate accuracy we report in Table 4 is substantially lower because the vast majority of genetic variants are rare, falling below the 1% allele frequency threshold and pulling down the average. The parent HRC reference, as well as other supporting references we provide, demonstrate this is expected and conforms with our results.

References:

Rubinacci S, Delaneau O, Marchini J. Genotype imputation using the positional burrows wheeler transform. PLoS genetics. 2020 Nov 16;16(11):e1009049.

McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, Luo Y. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. 2016 Oct;48(10):1279.

Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH, Duggal P. Genotype imputation performance of three reference panels using African ancestry individuals. Human genetics. 2018 Apr;137(4):281-92.

3. Have the authors evaluated the computationally identified hotspots? Are they largely consistent with the "true" recombination hotspots? Based on Figure 1E, these hotspots are calculated according to LD. What if the reference panel contains diverse populations? The LD calculated based on heterogeneous individuals would be meaningless.

Although we have not manually verified this across the whole genome, we have done several comparisons across regions of known recombination hotspots versus our LD based tiling method. One of the examples is shown in Author response image 1, where the LD-based tile boundaries identified by our method are defined by the vertical red lines, and the recombination hotspots defined by PRDM9 binding sites are represented in green.

Author response image 1

Similarly, we have compared our LD-based tiling boundaries of HRC versus 1000G. The plot (Author response image 2) shows an example where the HRC scores and 1000G scores are plotted and overlaid for some example regions from chromosome 10. The black line is HRC LD scores, the brown plot is the 1000G LD scores. The blue and red vertical lines are the genomic segment boundaries found by our method.

Author response image 2

However, we had to further split certain segments that were too large due to GPU memory limitations (see Methods).As expected, the sub-segments resulting from this further split did not match to known recombination sites.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1. You need to clearly report the (very considerable) training time required for the autoencoder. This relates to comments from both Reviewers which I repeat here:

Reviewer 2: "I don't think the authors are dealing enough with performance issues and the need to share training times for tuning. First, in the revision the authors have edited/added Figure 6 which shows a runtime comparison, however, the methods section description (lines 308-315) does not mention how only prediction time was taken into account, as the authors seem to be explaining in the Response to Reviews document. More importantly, given that software release is mostly a proof of principle- pre-trained models are available for chr22 only -further, genome-wide inference a user would have to (as I understand it): download a reference set, train tiled auto-encoders for the genome on their own hardware, and then do inference on their data of interest. Surely then it's important to profile the performance of this bit of the pipeline for the user."

Reviewer 4. That "the training process is going to take approximately one year to finish." would severely limit the application of the method. If the authors, as developers, do not release pre-trained models, we would be relying on extremely motivated and computationally savvy users to perform the pre-training. Along the line, reporting only the inference speed (i.e., computing time needed to APPLY the pre-trained models) becomes insufficient since the users would need to pre-train the models for all other chromosomes before they can apply the models to their target data.

We have now expanded the description of the implications of the inference time at the end of the Results section with a discussion of the considerable training time required to generate pre-trained models that achieve our reported inference time. We contrast this with HMM-based imputation tools, statin that they are more easily adaptable to the de-novo application of novel reference panels (lines 541 – 551).

2. You need to address the original request to compare against HMMs from a diverse reference. This request was made in the original decision, but the response instead focussed on the diversity of the target samples, which is a different issue. (It seems that running the autoencoder with a new diverse reference sample would require considerable new computation, but running the HMMs with a more diverse reference should be relatively straightforward; this does seem to highlight a disadvantage of the long training time of autoencoders if they are to be retrained when new larger reference samples become available.)

This relates to the following comment from Reviewer 4:

For major comment 1, I was not referring to a diverse target but referred to a diverse reference. I know that HRC includes the samples from the 1000 Genomes Project but the aggressive variant filtering and the predominant European ancestry individuals (~30,000 versus ~2,000 non-European from the 1000 Genomes Project) make it a lousy example as a diverse reference.

We have acquired the TOPMed reference cohort and performed HMM-based imputation using this reference panel. A description of this reference panel is provided in methods (lines 147 – 160). Imputation accuracy comparisons are provided at the cohort level (line 476 – 485; new Figure 4—figure supplement 3) and broken down by ancestry (lines 521 – 528; new Figure 5—figure supplement 2 and 3). The autoencoder retains superior performance even when the HMM has access to a more diverse cohort except in the case of MESA – which is a component of TOPMed and contains a significant rate of familial relationship to the rest of the TOPMed cohort. Additional minor edits to support the addition of this analysis are made throughout the manuscript.

https://doi.org/10.7554/eLife.75600.sa2

Article and author information

Author details

  1. Raquel Dias

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    3. Department of Microbiology and Cell Science, University of Florida, Gainesville, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  2. Doug Evans

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    Contribution
    Data curation, Software, Visualization, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Shang-Fu Chen

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    Contribution
    Data curation, Software, Validation, Methodology
    Competing interests
    No competing interests declared
  4. Kai-Yu Chen

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    Contribution
    Data curation, Software, Validation
    Competing interests
    No competing interests declared
  5. Salvatore Loguercio

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    Contribution
    Data curation, Formal analysis, Validation, Visualization, Writing – review and editing
    Competing interests
    No competing interests declared
  6. Leslie Chan

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    Contribution
    Software, Validation, Methodology
    Competing interests
    No competing interests declared
  7. Ali Torkamani

    1. Scripps Research Translational Institute, Scripps Research Institute, La Jolla, United States
    2. Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Project administration, Writing – review and editing
    For correspondence
    atorkama@scripps.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0232-8053

Funding

National Institutes of Health (R01HG010881)

  • Raquel Dias

National Institutes of Health (KL2TR002552)

  • Raquel Dias

National Institutes of Health (U24TR002306)

  • Doug Evans

National Institutes of Health (UL1TR002550)

  • Doug Evans

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work is supported by R01HG010881 to AT, KL2TR002552 to RD, as well as grants U24TR002306 and UL1TR002550. We would like to thank JC Ducom and Lisa Dong from the Scripps High Performance Computing center, as well as Fernanda Foertter, Johnny Israeli, Ohad Mosafi, and Joyjit Daw from NVIDIA for their technical support and collaboration in this project. A portion of this research was conducted using a startup account at the Summit supercomputer from Oak Ridge National Laboratory (ORNL).

Senior Editor

  1. Molly Przeworski, Columbia University, United States

Reviewing Editor

  1. Matthew Stephens, University of Chicago, United States

Publication history

  1. Received: November 16, 2021
  2. Preprint posted: December 2, 2021 (view preprint)
  3. Accepted: September 19, 2022
  4. Accepted Manuscript published: September 23, 2022 (version 1)
  5. Version of Record published: October 12, 2022 (version 2)

Copyright

© 2022, Dias et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 557
    Page views
  • 128
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Raquel Dias
  2. Doug Evans
  3. Shang-Fu Chen
  4. Kai-Yu Chen
  5. Salvatore Loguercio
  6. Leslie Chan
  7. Ali Torkamani
(2022)
Rapid, Reference-Free human genotype imputation with denoising autoencoders
eLife 11:e75600.
https://doi.org/10.7554/eLife.75600

Further reading

    1. Computational and Systems Biology
    2. Neuroscience
    Zhe Chen, Garrett J Blair ... Hugh T Blair
    Tools and Resources

    Epifluorescence miniature microscopes ('miniscopes') are widely used for in vivo calcium imaging of neural population activity. Imaging data is typically collected during a behavioral task and stored for later offline analysis, but emerging techniques for online imaging can support novel closed-loop experiments in which neural population activity is decoded in real time to trigger neurostimulation or sensory feedback. To achieve short feedback latencies, online imaging systems must be optimally designed to maximize computational speed and efficiency while minimizing errors in population decoding. Here we introduce DeCalciOn, an open-source device for real-time imaging and population decoding of in vivo calcium signals that is hardware compatible with all miniscopes that use the UCLA Data Acquisition (DAQ) interface. DeCalciOn performs online motion stabilization, neural enhancement, calcium trace extraction, and decoding of up to 1024 traces per frame at latencies of <50 ms after fluorescence photons arrive at the miniscope image sensor. We show that DeCalciOn can accurately decode the position of rats (n=12) running on a linear track from calcium fluorescence in the hippocampal CA1 layer, and can categorically classify behaviors performed by rats (n=2) during an instrumental task from calcium fluorescence in orbitofrontal cortex (OFC). DeCalciOn achieves high decoding accuracy at short latencies using innovations such as field-programmable gate array (FPGA) hardware for real time image processing and contour-free methods to efficiently extract calcium traces from sensor images. In summary, our system offers an affordable plug-and-play solution for real-time calcium imaging experiments in behaving animals.

    1. Computational and Systems Biology
    2. Immunology and Inflammation
    Anastasia O Smirnova, Anna M Miroshnichenkova ... Alexander Komkov
    Tools and Resources

    High-throughput sequencing of adaptive immune receptor repertoires is a valuable tool for receiving insights in adaptive immunity studies. Several powerful TCR/BCR repertoire reconstruction and analysis methods have been developed in the past decade. However, detecting and correcting the discrepancy between real and experimentally observed lymphocyte clone frequencies is still challenging. Here we discovered a hallmark anomaly in the ratio between read count and clone count-based frequencies of non-functional clonotypes in multiplex PCR-based immune repertoires. Calculating this anomaly, we formulated a quantitative measure of V- and J-genes frequency bias driven by multiplex PCR during library preparation called Over Amplification Rate (OAR). Based on the OAR concept, we developed an original software for multiplex PCR-specific bias evaluation and correction named iROAR: Immune Repertoire Over Amplification Removal (https://github.com/smiranast/iROAR). The iROAR algorithm was successfully tested on previously published TCR repertoires obtained using both 5' RACE (Rapid Amplification of cDNA Ends)-based and multiplex PCR-based approaches and compared with a biological spike-in-based method for PCR bias evaluation. The developed approach can increase the accuracy and consistency of repertoires reconstructed by different methods making them more applicable for comparative analysis.