Genomic privacy risks in GWAS summary statistics

  1. Biostatistics Group, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
  2. Center for Intelligent Medicine Research, Greater Bay Area Institute of Precision Medicine (Guangzhou), State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
  3. Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
  4. Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Alexandre Fournier-Level
    La Trobe University, Bundoora, Australia
  • Senior Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public review):

Summary:

The authors aim to demonstrate that GWAS summary statistics, previously considered safe for open sharing, can, under certain conditions, be used to recover individual-level genotypes when combined with large numbers of high-dimensional phenotypes. By reformulating the GWAS linear model as a system of linear programming constraints, they identify a critical phenotype-to-sample size ratio (R/N) above which genotype reconstruction becomes theoretically feasible.

Strengths:

There is conceptual originality and mathematical clarity. The authors establish a fundamental quantitative relationship between data dimensionality and privacy leakage and validate their theory through well-designed simulations and application to the GTEx dataset. The derivation is rigorous, the implementation reproducible, and the work provides a formal framework for assessing privacy risks in genomic research.

Weaknesses:

The study simplifies assumptions that phenotypes are independent, which is not the truth, and are measured without noise. Real-world data are highly correlated across different levels, not only genotype but also multi-omics, which may overstate recovery potential. The empirical evidence, while illustrative, is limited to small-scale data and idealized conditions; thus, the full practical impact remains to be demonstrated. GTEx analysis used only whole blood eQTL data from 369 individuals, which cannot capture the complexity, sample heterogeneity, or cross-tissue dependencies typical of biobank-scale studies.

Reviewer #2 (Public review):

Summary:

This study focuses on the genomic privacy risks associated with Genome-Wide Association Study (GWAS) summary statistics, employing a three-tiered demonstration framework of "theoretical derivation - simulation experiments - real-data validation". The research finds that when GWAS summary statistics are combined with high-dimensional phenotypic data, genotype recovery and individual re-identification can be achieved using linear programming methods. It further identifies key influencing factors such as the effective phenotype-to-sample size ratio (R/N) and minor allele frequency (MAF). These findings provide practical reference for improving data governance policies in genomic research, holding certain real-world significance.

Strengths:

This study integrates theoretical analysis, simulation validation, and the application of real-world datasets to construct a comprehensive research framework, which is conducive to understanding and mitigating the risk of private information leakage in genomic research.

Weaknesses:

(1) Limited scope of variant types covered:

The analysis is conducted solely on Single Nucleotide Polymorphisms (SNPs), omitting other crucial genomic variant types such as Copy Number Variations (CNVs), Insertions/Deletions (InDels), and chromosomal translocations/inversions. From a genomic structure perspective, variants like CNVs and InDels are also core components of individual genetic characteristics, and in some disease-related studies, association signals for these variants can be even more significant than those for SNPs. From the perspective of privacy risk logic, the genotypes of these variants (e.g., copy number for CNVs, base insertion/deletion status for InDels) can also be quantified and could theoretically be inferred backwards using the combination of "summary statistics + high-dimensional phenotypes". Their privacy leakage risks might differ from those of SNPs (for instance, rare CNVs might be more easily re-identified due to higher genetic specificity).

(2) Bias in data applicability scope:

Both the simulation experiments and real-data validation in the study primarily rely on European population samples (e.g., 489 European samples from the 1000 Genomes Project; the genetic background of whole blood tissue samples from the GTEx project is not explicitly mentioned regarding non-European proportions). It only briefly notes a higher risk for African populations in the individual re-identification risk assessment, without conducting systematic analyses for other populations, such as East Asian, South Asian, or admixed American populations. Significant differences in genetic structure (e.g., MAF distribution, linkage disequilibrium patterns) exist across different populations. This may result in the R/N threshold and the relationship between MAF and recovery accuracy identified in the study not being fully applicable to other populations

Hence, addressing the aforementioned issues through supplementary work would enhance the study's scientific rigor and application value, potentially providing more comprehensive theoretical and technical support for "privacy protection" in genomic data sharing.

Author response:

Reviewer #1 (Public Review):

Summary:

The authors aim to demonstrate that GWAS summary statistics, previously considered safe for open sharing, can, under certain conditions, be used to recover individual-level genotypes when combined with large numbers of high-dimensional phenotypes. By reformulating the GWAS linear model as a system of linear programming constraints, they identify a critical phenotypeto-sample size ratio (R/N) above which genotype reconstruction becomes theoretically feasible.

Strengths:

There is conceptual originality and mathematical clarity. The authors establish a fundamental quantitative relationship between data dimensionality and privacy leakage and validate their theory through well-designed simulations and application to the GTEx dataset. The derivation is rigorous, the implementation reproducible, and the work provides a formal framework for assessing privacy risks in genomic research

We thank the reviewer for the positive assessment of our work’s conceptual originality, mathematical rigor, and reproducible implementation.

Weaknesses:

The study simplifies assumptions that phenotypes are independent, which is not the truth, and are measured without noise. Real-world data are highly correlated across different levels, not only genotype but also multi-omics, which may overstate recovery potential. The empirical evidence, while illustrative, is limited to small-scale data and idealized conditions; thus, the full practical impact remains to be demonstrated. GTEx analysis used only whole blood eQTL data from 369 individuals, which cannot capture the complexity, sample heterogeneity, or cross-tissue dependencies typical of biobank-scale studies

We recognize the concern regarding the independence and noiselessness assumptions in our frame work. While assuming independent, noiseless phenotypes represents an idealized scenario, it allows us to clearly demonstrate the conceptual potential of our framework. The GTEx whole blood analysis is intended as a proof-of-concept, illustrating feasibility rather than capturing full biological complexity. In the revised manuscript, we will clarify these assumptions, emphasize that practical reconstruction accuracy maybe lower in correlated and noisy real-world data, and expand empirical validation to multiple GTEx tissue sand independent cohorts to demonstrate robustness under more realistic conditions.

Reviewer #2 (PublicReview):

Summary:

This study focuses on the genomic privacy risks associated with Genome-Wide Association Study (GWAS) summary statistics, employing a three-tiered demonstration framework of” theoretical derivation- simulation experiments- real-data validation”. The research finds that when GWAS summary statistics are combined with high-dimensional phenotypic data, genotype recovery and individual re-identification can be achieved using linear programming methods. It further identifies key influencing factors such as the effective phenotype-to-sample sizeratio(R/N) and minor allele frequency(MAF). These findings provide practical reference for improving data governance policies in genomic research, holding certain real-world significance

Strengths:

This study integrates theoretical analysis, simulation validation, and the application of real world datasets to construct a comprehensive research framework, which is conducive to understanding and mitigating the risk of private information leakage in genomic research

We are glad the reviewer values our integration of theory, simulation, and real data

Weaknesses:

(1) Limited scope of variant types covered:

The analysis is conducted solely on Single Nucleotide Polymorphisms(SNPs), omitting other crucial genomic variant types such as Copy Number Variations(CNVs), Insertions/Deletions (InDels), and chromosomal translocations/inversions. From a genomic structure perspective, variants like CNVs and InDels are also core components of individual genetic characteristics, and in some disease-related studies, association signals for these variants can be even more significant than those for SNPs. From the perspective of privacy risk logic, the genotypes of these variants (e.g., copy number for CNVs, base insertion/deletion status for InDels) can also be quantified and could theoretically be inferred backwards using the combination of ”summary statistics +high-dimensional phenotypes”. Their privacy leakage risks might differ from those of SNPs(for instance, rare CNVs might be more easily re-identified due to higher genetic specificity)

This point raises an important clarification regarding variant types beyond SNPs. We would like to clarify that our mathematical framework is not inherently restricted to SNPs. In fact, it is broadly applicable to any genetic variant that can be represented numerically, e.g., allelic dosage (0/1/2), copy number counts for CNVs, or presence/absence indicators for InDels. Conceptually, CNVs , InDels, and other structural variants can be incorporated in the same way as SNPs.

The main limitation arises from the current availability of GWAS summary statistics for these non-SNP variant types (e.g., CNV dosages≥3), which are still relatively scarce. As a result, empirically evaluating our framework on these variant classes would be challenging. In the revision, we will explicitly emphasize the general applicability of our framework to diverse genetic variants while clearly noting this practical limitation. We also plan to include simulations to investigate the recovery accuracy associated with CNVs and InDels, which will further demonstrate the extensibility of our approach. It should be noted, however, that leaking genotypic data of ordinary SNPs already raises concerns, regardless of other types of genetic variants.

(2) Bias in data applicability scope:

Both the simulation experiments and real-data validation in the study primarily rely on European population samples (e.g.,489 Europe an samples from the 1000 Genomes Project; the genetic background of whole blood tissue samples from the GTEx project is not explicitly mentioned regarding non-European proportions). It only briefly notes a higher risk for African populations in the individual re-identification risk assessment, without conducting systematic analyses for other populations, such as East Asian, South Asian, or admixed American populations. Significant differences in genetic structure (e.g., MAF distribution, linkage disequilibrium patterns) exist across different populations. This may result in the R/N threshold and the relationship between MAF and recovery accuracy identified in the study not being fully applicable to other populations.

Hence, addressing the aforementioned issues through supplementary work would enhance the study’s scientific rigor and application value, potentially providing more comprehensive theoretical and technical support for” privacy protection” in genomic data sharing.

We acknowledge this valid concern regarding the generalizability of our findings. Our analysis already identifies MAF as a key factor influencing recovery accuracy, which begins to address population-specific genetic differences. Importantly, because our reconstruction method treats each variant independently, its success does not rely on population-specific LD patterns. The core determinant of feasibility is the ratio of phenotypic dimensions to sample size(R/N), a relationship we expect to hold a cross populations.

Nevertheless, we agree that further validation across diverse ancestries can be helpful. In the revised manuscript, we will try to include additional cohorts as extended validation analyses

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation