Figures and data

Theoretical genotype recovery accuracy in relation to sample size, R/N ratio, and MAF.
Simulations were performed for sample sizes of 300, 500, and 1000, with R/N ratios ranging from 0.05 to 1 in increments of 0.05 and MAFs ranging from 0.05 to 0.5. Each point represents the mean accuracy of 30 replications.

Recovery performance in simulations based on 489 European samples from 1000 Genomes Project.
Genotypic data of 1 million SNPs were used in the simulation. Five replications were conducted for each R/N ratio to obtain the average recovery accuracy of each SNP. (A) depicts simulations with no causal SNPs (heritability = 0). (B) shows the mean accuracy of each MAF bin, contrasting causal and non-causal SNPs at heritability = 0.5. (C) presents the mean accuracy across various MAF bins, R/N ratios, and heritability levels. In simulations with heritability > 0, 10% of randomly selected SNPs were designated as causal. Error bars represent standard deviations.

Recovery application in GTEx.
The algorithm was applied to cis-region GWAS summary statistics for 20,059 genes from GTEx v7 whole blood tissue, with open-access phenotype and covariate data. Due to overlapping cis-regions, some variants had GWAS data for multiple traits (genes). (A) shows the distribution of variants across MAF and the number of traits (P) divided by the sample size (N = 369). (B) illustrates the recovery accuracy for 17,568 variants (MAF ≤ 0.1, P /N > 0.2) with true genotype. The vertical and horizontal lines represent x = 0.05 and y = 0.25, respectively. (C) shows the genomic positions (GRCh37) of 12,840 variants with MAF ≤ 0.05 and P /N > 0.25.

Minimal R/N ratio predicted by MAF for sample identification.
The boundary curve separating recovery accuracy > 0.99 from accuracy ≤ 0.99 was fitted based on theoretical simulations of 1,000 samples, as shown in Figure 1. The model for this boundary is described by the equation R/N = 2 ÷ (1 + eb×MAF) − 1, where b is the parameter to be estimated. (A) The red line represents the fitted curve based on points at the lower boundary of accuracy > 0.99. (B) Displays the mean sample identification rate of 30 replications based on 1,000 randomly selected common SNPs (MAF > 5%) along with the corresponding R/N cutoff. For a given R/N cutoff, only SNPs with predicted accuracy > 0.99 are retained for sample identification in populations from the 1000 Genomes Project (EUR: European, AFR: African, AMR: Admixed American, EAS: East Asian, SAS: South Asian). Shadings are 95% confidence intervals.