Genealogy: The challenges of maintaining genetic privacy
The direct-to-consumer genetic testing industry has grown rapidly in the past few years, to the extent that the companies offering such tests now hold a large proportion of all the human genetic data ever generated (Regalado, 2019). A common reason why someone might undergo genetic testing is to discover relatives, either within the database of the company that performed the test, or via one of a number of third-party services that allow users to upload genomes generated by other labs. Two new studies demonstrate that it may be possible for a user to obtain genomic data without permission from some databases (Edge and Coop, 2020; Ney et al., 2020).
In general, when a user uploads their genome to a third-party service, the service searches its database for genomes that have segments that are identical or nearly identical to segments of the user's genome. The number of such identical-by-state (IBS) segments, and the length of these segments, both increase with the closeness of the relationship between the user and the person (or persons) in the database. The minimum length of a segment is typically around a few millions of base pairs.
To see how a user could access data they should not be able to access, suppose that Alice uploads her genome and finds that she is related to Bob. If the testing service gives Alice details about the IBS segments she shares with Bob (such as the location of these segments in the genome), then Alice will have obtained a certain amount of genomic information about Bob. Now, two independent groups – Michael Edge and Graham Coop of the University of California, Davis writing in eLife (Edge and Coop, 2020), and Peter Ney, Luis Ceze, and Tadayoshi Kohno of the University of Washington in work to be presented at the NDSS symposium in San Diego in February (Ney et al., 2020) – report how services that give users certain details about IBS segments could be subject to attacks that allow an 'adversary' to obtain potentially significant amounts of genomic information that they should not have permission to access (Edge and Coop, 2020; Ney et al., 2020).
The key insight is that an adversary does not have to upload their own genome, and that they can instead upload multiple genomes, including genomes that are in the public domain. This approach is called 'IBS tiling'. For each IBS segment that is reported, the adversary gains a small amount of genetic information about a 'target' genome in the database. However, by uploading a large number of genomes, it is possible to obtain large amounts of genetic information (Figure 1A). Using simulations, Edge and Coop showed that with about 900 public genomes from the 1000 Genomes Project, IBS tiling is expected to reveal about 60% of the genome of a European target. A related approach developed by Edge and Coop, named 'IBS probing', allows the adversary to learn if the target's genome contains a specific disease allele (Larkin, 2017; Figure 1B).
The risk of IBS tiling and IBS probing is limited in services that only report IBS segments to users who are closely related. Thus, as genomes from public databases will only rarely be close relatives of the target, this will limit the effective number of genomes available for tiling. However, IBS tiling could yield significant amounts of information on targets from founder populations in which the rate of genomic sharing is high, such as Ashkenazi Jews or Finns (Carmi et al., 2014; Martin et al., 2018). Direct-to-consumer genetic testing companies and third-party services could eliminate this risk by not showing users where IBS segments are located within the genome.
The most popular third-party service, GEDmatch, has over a million users, and was recently acquired by the forensics genomics company Verogen (Husbands, 2019). GEDmatch puts very few restrictions on users and is vulnerable to IBS tiling. GEDmatch is routinely used by police forces to investigate crime (Erlich et al., 2018; Kennett, 2019), though (as of recently) they can only search the genomes of users who have opted in to give law-enforcement agencies access to their genetic information.
When comparing genomes, GEDmatch uses a simple algorithm, reporting a region of the genome as an IBS segment so long as the user and the target do not have conflicting homozygous genotypes: that is, if the user genome is, say, AA at a given site, GEDmatch will return an IBS segment if the target is AA or AB at that site, but not if the target is BB (subject to the segments being longer than a certain minimum length, as described above). GEDmatch also provides users with an image, indicating, for each site in the genome, whether the genotypes of the user and the target fully match, partly match, or do not match.
Ney et al. recently demonstrated that it is possible to extract nearly the entire genome of an individual from GEDmatch by uploading an artificial nearly-all-heterozygote genome and examining the resulting IBS segments (which was also shown by Edge and Coop), or by uploading an all-homozygote genome and examining the resulting images. However, these techniques depend crucially on the specifics of the genome comparison methods used by GEDmatch, and could become obsolete if these methods change, or if users are prohibited from uploading artificial or manipulated genomes.
The use of digital signatures could also prevent adversaries from uploading genomes they have downloaded from public resources or have generated computationally (Erlich et al., 2018). This would involve direct-to-consumer genetic testing labs digitally signing their genome files before users can download them, and third-party services only returning information about IBS segments to a user if the genome uploaded by the user has a digital signature from an approved lab.
The practical consequences of an adversary getting access to your genetic information are debatable. For example, some researchers question the potential usefulness of methods that predict the risk of disease based on polygenic scores (Wald and Old, 2019), especially for non-European populations (Martin et al., 2019). However, others argue for a clinical utility of polygenic risk scores (Lambert et al., 2019). Likewise, there are contrasting views on the usefulness of information about mutations in protein-coding regions. For example, some argue that most coding mutations carried by an individual are difficult to interpret, even by physicians (Hoffman-Andrews, 2017). However, databases such as ClinVar allow users to interpret the pathogenicity of many mutations, and some mutations can be strong risk factors for diseases such as Alzheimer's or breast cancer, which may affect insurance decisions.
However, one needs to remember that DNA is immutable, and thus, any loss of privacy cannot be reversed. Moreover, any loss of privacy can go beyond the individual and extend to their relatives. Further, if an entire large US-based database was compromised, an adversary would be able to identify most US individuals, even those not in the database (Erlich et al., 2018). Therefore, I urge all stakeholders to pay attention to the work of these two groups and attempt to keep genetic information secure.
References
-
The known unknown: the challenges of genetic variants of uncertain significance in clinical practiceJournal of Law and the Biosciences 4:648–657.https://doi.org/10.1093/jlb/lsx038
-
Using genetic genealogy databases in missing persons cases and to develop suspect leads in violent crimesForensic Science International 301:107–117.https://doi.org/10.1016/j.forsciint.2019.05.016
-
Towards clinical utility of polygenic risk scoresHuman Molecular Genetics 28:R133–R142.https://doi.org/10.1093/hmg/ddz187
-
Haplotype sharing provides insights into fine-scale population history and disease in FinlandAmerican Journal of Human Genetics 102:760–775.https://doi.org/10.1016/j.ajhg.2018.03.003
-
WebsiteGenotype extraction and false relative attacks: security risks to third-party genetic genealogy services beyond identity inferenceNetwork and Distributed System Security Symposium (NDSS) (San Diego, US). Accessed December 16, 2019.
-
WebsiteMore than 26 million people have taken an at-home ancestry testMIT Technology Review. Accessed December 16, 2019.
-
The illusion of polygenic disease risk predictionGenetics in Medicine 21:1705–1707.https://doi.org/10.1038/s41436-018-0418-5
Article and author information
Author details
Publication history
Copyright
© 2020, Carmi
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 2,431
- views
-
- 181
- downloads
-
- 2
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
If you've uploaded your DNA on genealogy databases, it may be at risk.
-
- Evolutionary Biology
With the availability of high-quality full genome polymorphism (SNPs) data, it becomes feasible to study the past demographic and selective history of populations in exquisite detail. However, such inferences still suffer from a lack of statistical resolution for recent, for example bottlenecks, events, and/or for populations with small nucleotide diversity. Additional heritable (epi)genetic markers, such as indels, transposable elements, microsatellites, or cytosine methylation, may provide further, yet untapped, information on the recent past population history. We extend the Sequential Markovian Coalescent (SMC) framework to jointly use SNPs and other hyper-mutable markers. We are able to (1) improve the accuracy of demographic inference in recent times, (2) uncover past demographic events hidden to SNP-based inference methods, and (3) infer the hyper-mutable marker mutation rates under a finite site model. As a proof of principle, we focus on demographic inference in Arabidopsis thaliana using DNA methylation diversity data from 10 European natural accessions. We demonstrate that segregating single methylated polymorphisms (SMPs) satisfy the modeling assumptions of the SMC framework, while differentially methylated regions (DMRs) are not suitable as their length exceeds that of the genomic distance between two recombination events. Combining SNPs and SMPs while accounting for site- and region-level epimutation processes, we provide new estimates of the glacial age bottleneck and post-glacial population expansion of the European A. thaliana population. Our SMC framework readily accounts for a wide range of heritable genomic markers, thus paving the way for next-generation inference of evolutionary history by combining information from several genetic and epigenetic markers.