1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

Attacks on genetic privacy via uploads to genealogical databases

  1. Michael D Edge  Is a corresponding author
  2. Graham Coop  Is a corresponding author
  1. University of California, Davis, United States
  2. University of Southern California, United States
Research Article
  • Cited 0
  • Views 2,273
  • Annotations
Cite this article as: eLife 2020;9:e51810 doi: 10.7554/eLife.51810

Abstract

Direct-to-consumer (DTC) genetics services are increasingly popular, with tens of millions of customers. Several DTC genealogy services allow users to upload genetic data to search for relatives, identified as people with genomes that share identical by state (IBS) regions. Here, we describe methods by which an adversary can learn database genotypes by uploading multiple datasets. For example, an adversary who uploads approximately 900 genomes could recover at least one allele at SNP sites across up to 82% of the genome of a median person of European ancestries. In databases that detect IBS segments using unphased genotypes, approximately 100 falsified uploads can reveal enough genetic information to allow genome-wide genetic imputation. We provide a proof-of-concept demonstration in the GEDmatch database, and we suggest countermeasures that will prevent the exploits we describe.

Introduction

As genotyping costs have fallen over the last decade, direct-to-consumer (DTC) genetic testing (Hogarth et al., 2008; Hogarth and Saukko, 2017; Khan and Mittelman, 2018) has become a major industry, with over 26 million people enrolled in the databases of the five largest companies (Regalado, 2019). One of the major applications of DTC genetics is genetic genealogy. Customers of companies such as 23andMe and Ancestry, once they are genotyped, can view a list of other customers who are likely to be genetic relatives. These putative relatives’ full names are often given, and sometimes contact details are given as well. Such genealogical matching services are of interest to people who want to find distant genetic relatives to extend their family tree, and can be particularly useful to people who otherwise may not have information about their genetic relatives, such as adoptees or the biological children of sperm donors. Several genetic genealogy services—including GEDmatch, MyHeritage, FamilyTreeDNA, and LivingDNA (Table 1)—allow users to upload their own genetic data if they have been genotyped by another company. These entities generally offer some subset of their services at no charge to uploaders, which helps to grow their databases. Upload services have also been used by law enforcement, with the goal of identifying relatives of the source of a crime-scene sample (Erlich et al., 2018; Edge and Coop, 2019), prompting discussion about genetic privacy (Syndercombe Court, 2018; Ram et al., 2018; Kennett, 2019; Scudder et al., 2019).

Table 1
Key parameters for several genetic genealogy services that allow user uploads as of July 26th, 2019.
ServiceDatabase size (millions)Individuals shownIBS/IBD Segments Reported
GEDmatch1.23000 closest matches shown free; Unlimited w/ $10/month license; any two kits can be searched against each otherYes if longer than user-set threshold. Min. threshold 0.1 cM, default 7 cM
FamilyTreeDNA1*All that share at least one 9 cM block or one 7.69 cM block and 20 total cMYes, down to 1 cM, for $19 per kit
MyHeritage3All that share at least one 8 cM blockYes, down to 6 cM, for $29 per kit or unlimited for $129/year. Customers may opt out
LivingDNAUnknownPutative relatives out to about 4th-cousin rangeOnly sum length of matching segments reported
DNA.LAND**0.159Top 50 matches shown with minimum 3 cM segmentYes
  1. *Although Regalado (2019) reports that FamilyTreeDNA has two million users, he also suggests that only about half of these are genotyped at genome-wide autosomal SNPs, which is in line with other estimates (Larkin, 2018).

    **DNA.LAND has discontinued genealogical matching for uploaded samples as of July 26th, 2019.

The genetic signal used to identify likely genealogical relatives is identity by descent (IBD, [Browning and Browning, 2012; Thompson, 2013]. We use 'IBD’ to indicate both 'identity by descent’ and 'identical by descent', depending on context). Pairs of people who share an ancestor in the recent past can share segments of genetic material from that ancestor. The distribution of IBD sharing as a function of genealogical relatedness is well studied (Donnelly, 1983; Huff et al., 2011; Browning and Browning, 2012; Thompson, 2013; Buffalo et al., 2016; Conomos et al., 2016; Ramstetter et al., 2018), and DTC genetics entities can use information about the number and length of inferred IBD segments between a pair of people to estimate their likely genealogical relationship (Staples et al., 2016; Ramstetter et al., 2017). These shared segments—IBD segments—result in the sharing of a near-identical stretch of chromosome (a shared haplotype). Shared haplotypes can most easily be identified looking for long genomic regions where two people share at least one allele at nearly every locus.

For the rest of the main text, we focus on identical-by-state (IBS) segments, which are genomic runs of (near) identical sequence shared among individuals and can be thought of as a superset of true IBD segments. Very long IBS segments, say over 7 centiMorgans (cM), are likely to be IBD, but shorter IBS segments, say <4 cM, may or may not represent true IBD due to recent sharing—they may instead represent a mosaic of shared ancestry deeper in the past. Many of the algorithms for IBD detection that scale well to large datasets rely principally on detection of long IBS segments, at least as their first step (Gusev et al., 2009; Henn et al., 2012; Huang et al., 2014). We consider the effect on our results of attempting to distinguish IBS and IBD in supplementary material.

Many DTC genetics companies, in addition to sharing a list of putative genealogical relatives, give customers information about their shared IBS with each putative relative, possibly including the number, lengths, and locations of shared genetic segments (Table 1). This IBS report may represent substantial information about one’s putative relatives—one already has access to one’s own genotype, and so knowing the locations of IBS sharing with putative relatives reveals information about those relatives’ genotypes in those locations (He et al., 2014). Users of genetic genealogy services implicitly or explicitly agree to this kind of genetic information sharing, in which large amounts of genetic information are shared with close biological relatives and small amounts of information are shared with distant relatives.

Here, we consider methods by which it may be possible to compromise the genetic privacy of users of genetic genealogy databases. In particular, we show that for services where genotype data can be directly uploaded by users, many users may be at risk of sharing a substantial proportion of their genome-wide genotypes with any party that is able to upload and combine information about several genotypes. We consider two major tools that might be used by an adversary to reveal genotypes in a genetic genealogy database. One tool available to the adversary is to upload real genotype data or segments of real genotype data. When uploading real genotypes, the information gained comes by virtue of observed sharing between the uploaded genotypes and genotypes in the database (an issue also raised by Larkin, 2017). Publicly available genotypes from the 1000Genomes Project (Abecasis et al., 2012), Human Genome Diversity Project (Cann et al., 2002), OpenSNP project (Greshake et al., 2014), or similar initiatives might be uploaded.

A second tool available to the adversary is to upload artificial genetic datasets (Ney et al., 2018). In particular, we consider the use of artificial genetic datasets that are tailored to trick algorithms that use a simple, scalable method for IBS detection, that of identifying long segments in which a pair of genomes contains no incompatible homozygous sites (Henn et al., 2012; Huang et al., 2014). Such artificial datasets can be designed to reveal the genotypes of users at single sites of interest or sufficiently widely spaced sites genome-wide. We describe how a set of a few hundred artificial datasets could be designed to reveal enough genotype information to allow accurate imputation of common genotypes for every user in the database.

Below, we describe these procedures and illustrate them using either publicly available or artificial data. We show that under some circumstances, many users could be at risk of having their genotypes revealed, either at key positions or at many sites genome-wide. In particular, we show that GEDmatch, as of mid-December 2019, was vulnerable to an attack we term IBS baiting that obtains genotype data via artificial data uploads. Our results are largely complementary to the independent work of Ney et al. (2020), which was first posted publicly within a week of the first public posting of this manuscript on bioRxiv. In the discussion, we consider our work in light of other genetic privacy concerns (Erlich and Narayanan, 2014; Naveed et al., 2015 and the work of Ney et al., 2020), and we give some suggested practices that DTC genetics services can adopt to prevent privacy breaches by the techniques described here.

Results

We describe three general methods for revealing the genotypes of users in genetic genealogy databases that allow uploads. The first, IBS tiling, involves uploading many real genotypes in order to identify genotype information from many regions in many people. The second, IBS probing, involves uploading a dataset containing a long haplotype that includes an allele of interest, creating matches at this locus. Genotypes at other places in the genome are chosen to be unlikely to generate IBS with any user in the database, so matches with the uploaded dataset are likely to be users who carry the allele of interest. The third method, IBS baiting, involves uploading fake datasets with long runs of heterozygosity to induce phase-unaware methods for IBS calling to reveal genotypes.

IBS tiling

In IBS tiling, the genotype information shared between a target user in the database and each member of a set of comparison genomes is aggregated into potentially substantial information about the target’s genotypes. For example, consider a user of European ancestries. She is likely to have some degree of IBS sharing with a large set of people from across Europe (Ralph and Coop, 2013) and beyond. If one knows the user’s IBS sharing locations with one random person of European ancestries (and the random person’s genotype), then one can learn a little about the user’s genotype. But if one can upload many people’s genotypes for comparison, then one can uncover small proportions of the target user’s genotypes from many of the comparison genotypes, eventually uncovering much of the target user’s genome by virtue of a ‘tiling’ of shared IBS with known genotypes (Figure 1A). Similar ideas have been suggested with application to IBD-based genotype imputation (Carmi et al., 2014).

Schematics of the IBS tiling and IBS probing procedures.

(A) In IBS tiling, multiple genotypes are uploaded (green lines) and the positions at which they are IBS with the target (represented by blue lines) are recorded. Once enough datasets have been uploaded, the target will eventually have a considerable proportion of their genome 'tiled’ by IBS with uploads that have known genotypes. (B) In IBS probing, the uploaded probe consists of a haplotype carrying an allele of interest (red dot) surrounded by 'IBS-inert’ segments (purple dashed lines)—fake genotype data designed to be unlikely to share any IBS regions with anyone in the database. In the event of an IBS match in the database, the matching database entry is likely to carry the allele of interest.

We consider the amount of IBS tiling possible within a set of publicly available genotypes for 872 people of European origin genotyped at 544,139 sites. We phased the sample using Beagle 5.0 (Browning and Browning, 2007) and used Refined IBD software (Browning and Browning, 2013) to identify IBS segments (see Materials and methods). In the main text, we include IBS segments that are not particularly likely to be IBD—these are IBS segments returned by Refined IBD with relatively low LOD scores for IBD, between 1 and 3. True IBD segments reveal more than mere IBS segments about shared genotypes because untyped variants (including rare variants) within an IBD segment are likely to be shared. At the same time, mere IBS is sufficient to infer sharing for SNPs that are genotyped within the segment.

Figure 2 shows the median amount of coverage obtainable by IBS tiling as a function of comparison sample size, imposing various restrictions on the minimum segment length in cM. (For similar results, see Figure 2b of Carmi et al., 2014 and Figure 2 of Panoutsopoulou et al., 2014) Approximately 2.8 Giga base-pairs (Gbp) were covered by IBS segments anywhere in the genome; we take this to be approximately the maximum possible genomic length recoverable by IBS with our SNP set. Using the entire sample (871 genotypes, since the target is left out) and including all called IBS segments >1 cM, the median person has an average of 60% of the maximum length of 2.8 Gbp covered by IBS segments (with the average taken across their two chromosomes), and sites across 82% of this length have at least one of two alleles recoverable by IBS tiling. Increasing the cM threshold required for reporting substantially reduces the amount of IBS tiling. With a cutoff of 3 cM, approximately 6.9% of the median person’s genotype information is recoverable, including at least one of two alleles at sites in 11% of the genome. When a more stringent cutoff of 8 cM is used, only 1% of the genome has at least one of two alleles recoverable for the median person when using a comparison sample of 871. Our reports for segments longer than 3 cM may be conservative because Refined IBD sometimes splits long IBS segments into multiple shorter segments in the presence of phasing errors (Browning and Browning, 2013; Bjelland et al., 2017).

Figure 2 with 5 supplements see all
Lengths of genome in Giga base-pairs (Gbp) covered by IBS tiling as a function of minimum required length of IBS segments in centiMorgans (cM) and size of a randomly selected comparison sample for the median person in our dataset.

The top-left panel shows the average coverage across each of the person’s two haplotypes. The top-right shows IBS2 coverage, the length of genome where both haplotypes are covered by IBS tiles. The bottom-left panel shows IBS1, the length of genome where exactly one haplotype is covered by IBS tiles. (IBS1 coverage can decrease at larger comparison sample sizes because IBS2 coverage increases). The bottom-right panel shows IBS1+ coverage, the length of genome covered by either IBS1 or IBS2.

For some people, the amount of information obtainable by IBS tiling is even larger. In our sample, the top 10% of people have genotypes across 76% of their total genome covered by IBS tiles, including one or more alleles at sites in at least 93% of the 2.8 Gbp covered by IBS tiles anywhere. If only segments longer than 3 cM are reported, the top 10% of people have one or both alleles covered at sites in at least 38% of the total, and if only segments longer than 8 cM are reported, the top 10% have one or both alleles covered at sites in at least 6% of the total.

The coverage obtained by IBS tiling and its informativeness about target genotypes depends on the specific practices used for reporting IBS information (Figure 2—figure supplements 15). For example, DTC genealogy services may take additional steps to ensure that any short segments reported are likely to be IBD, not merely IBS. Such steps will tend to decrease the amount of IBS tiling possible, particularly for short segments (Figure 2—figure supplement 1). As another example, some DTC genealogy services only report matching segments for pairs of people who share at least one long IBS segment (Table 1), but then allow users to see shorter IBS segments (>1cM) for those pairs of people. Unsurprisingly, we find that this strategy allows a higher level of IBS tiling than if only long segments are revealed (Figure 2—figure supplement 2), because people who share a long IBS segment may also share shorter segments that are hidden if only long segments are reported.

In this demonstration of IBS tiling, we used haplotype information provided by the Refined IBD software to determine which haplotypes were covered by IBS in each person. Most genetic genealogy services that provide information on the location of IBS matches with putative relatives do not provide haplotype information, making it difficult to distinguish IBS1 (in which one chromosome is covered by an IBS segment) and IBS2 (in which both chromosomes are covered by IBS segments). One tool available to an adversary pursuing IBS tiling is to upload genotype information that is homozygous at all sites using one of two phased haplotypes as a basis, effectively searching for IBS with one chromosome at a time. In the presence of phasing errors, some IBS segments may be missed, and the assumption that phase is known would render the coverage rates in Figure 2 overestimates. At the same time, the decrease in tiling performance is small for short segments, which can be seen by conducting our test of IBS tiling using Germline software with the haploid flag, which causes putative IBS segments to terminate with a single phasing error (Figure 2—figure supplement 3). It may remain difficult to distinguish some cases—such as distinguishing IBS1 from IBS2 with a run of homozygosity on the database genotype—but there will be no question about which uploaded haplotype is IBS with the database genotype. Thus, at any point where a homozygous upload and a target are IBS, at least one of the target’s alleles is known. Further, if the target is IBS with any other uploaded datasets at a genetic locus of interest, it will often be possible to infer the target’s full genotype.

IBS tiling rates vary somewhat by population, with Finnish samples showing the highest tiling rates among the 1000Genomes populations included (Figure 2—figure supplement 4). There also appear to be slight biases for IBS tiles to appear in regions with low SNP density and lower heterozygosity, meaning that the proportion of alleles—and particularly the proportion of minor alleles—recovered by tiling is typically slightly lower than the proportion of the genome length in Mbp covered (Figure 2—figure supplement 5).

IBS probing

IBS probing is an application of the same idea underlying IBS tiling. By IBS probing, one could identify people with specific genotypes of interest, such as risk alleles for Alzheimer’s disease (Corder et al., 1993), even if the DTC service does not report chromosomal locations of IBS matches. To identify people carrying a particular allele at a locus of interest, one could use haplotypes carrying the allele in publicly available databases. To do so, one would extract a haplotype that surrounds the allele of interest and place it into a false genetic dataset designed to have no long IBS segments with any real genomes (Figure 1B). Thus, any returned putative relatives must match at the allele of interest, revealing that they carry the allele. We call this attack ‘IBS probing’ by analogy with hybridization probes, as the genuine haplotype around the allele of interest acts as a probe. Whereas IBS tiling recovers genetic information from across the genome, IBS probing acts only on a single locus of interest. The advantage is that IBS probing is possible even in databases that do not report the chromosomal locations of IBS segments.

There are several ways of generating chromosomes unlikely to have long shared segments with any entries in the database. One simple way is to sample alleles at each locus in proportion to their frequencies. Chromosomes generated in this way are free of linkage disequilibrium (LD) and thus unlike genuine chromosomes. If the database distinguishes between IBS and IBD, then these fake data are unlikely to register as IBD with any genuine haplotypes. However, they may appear as IBS in segments where genetic diversity is low, depending on the length threshold used by the database. Near-zero rates of IBS can be obtained by generating more unusual-looking fake data, such as by sampling alleles from one minus their frequency or by generating a dataset of all minor alleles.

Figure 3 shows a demonstration of IBS probing performance in our set of 872 Europeans in a window around the APOE locus. For a 1-cM threshold for reporting IBS, we generated probes by retaining 1.9 cM of real data around a site of interest in the APOE locus from all 872 people. Outside that 1.9-cM window, we generated data by drawing alleles randomly (see Materials and methods). For a 3-cM threshold for reporting IBS, we generated probes by retaining 5.9 cM of real data around the site of interest. With 1-cM matching, 1497 of 1744 haplotypes (86%) matched one of the probes at the site of interest. (Target haplotypes were not allowed to match probes constructed from the same person that carried the target haplotype). With 3-cM matching, 164 of 1744 haplotypes (9.4%) matched one of the probes at the site of interest. Very few matches occurred outside the region of interest—none with a 3-cM threshold and only 0.1% of matches with a 1-cM threshold. Moreover, we generated different inert genotypes for all 872 probes, and the great majority of these had no matches with any real sample. An adversary would only need to generate one inert dataset, which can be tested by uploading to the database and confirming that no matches are returned. Probes could then be constructed by stitching real haplotypes at the site of interest into the the same set of inert data. The probes would then be likely to match each other, but the adversary would know those identities and could ignore those matches.

Figure 3 with 2 supplements see all
A demonstration of the IBS probing method around position 45411941 on chromosome 19 (GRCh37 coordinates), in the APOE locus.

We show the proportion of haplotypes among the 872 Europeans in our sample covered IBS by probes constructed from the sample, as a function of the chromosomal location in a 10-Mb region around the site of interest. In red, we show the coverage using a 1-cM threshold for reporting IBS, where the probes are constructed using real data in a 1.9-cM region centered on the site of interest (region boundaries shown in dashed orange). In orange, we show the coverage using a 3-cM threshold for reporting IBS, where the probes are constructed using real data in a 5.9-cM region around the site of interest.

The efficacy of IBS probing will depend on the minimum IBS-match length reported to users, the specific methods used for identifying IBS segments (Figure 3—figure supplements 12), and whether the genotype of interest is included on the SNP chip. These factors vary in terms of whether they affect the sensitivity of IBS probing—the proportion of people carrying the allele of interest returned by a probe or set of probes—or the precision of IBS probing—the proportion of people returned by a probe who in fact carry the genotype of interest. For example, high thresholds for IBS reporting will mean that uploaded genotypes will need to have long IBS segments with targets at the locus of interest. Long IBS segments are likely to represent relatively close genealogical relatives (i.e. long IBS segments are likely to be IBD segments), and not many targets will be close relatives of the source of any given haplotype of interest, meaning that the sensitivity of IBS probing is reduced by reporting thresholds that require long IBS segments. If the locus of interest or a highly correlated one is not included on the chip used to genotype either the uploaded sample or the target sample, then probing may only expected to work well if the upload and the target are truly IBD rather than merely IBS, reducing the precision of IBS probing for variants that are not genotyped. Limiting probing results to likely IBD matches will decrease the number of matches returned, particularly for short cM thresholds (Figure 3—figure supplement 1).

Another factor that will affect the success of IBS probing is the frequency of the allele of interest. For example, if the allele of interest is very rare, then it is likely to be only somewhat enriched on the haplotypes that tend to carry it, and reported matches may not actually carry the allele, even if they are IBD with an uploaded haplotype that carries it. IBS probing will perhaps be most sensitive and precise when the allele of interest is both common and relatively young, as is the case for founder mutations. In this case, most carriers of the allele will share the same long haplotype around the site of interest, meaning that fewer probes would need to be uploaded in order to learn the identities of the majority of the carriers in the database.

IBS baiting

IBS tiling and IBS probing take advantage of publicly available genotype data. The idea of both is that an adversary uploads genuine genetic datasets—or, in the case of IBS probing, datasets with genuine segments—to learn about entries in the database that share segments with the uploaded genomes.

In this section, we describe an exploit called IBS baiting. The specific strategy for IBS baiting that we describe is possible if the database identifies putative IBS segments by searching for long regions where a pair of people has no incompatible homozygous sites. An incompatible homozygous site is a site at which one person in the pair is homozygous for one allele, and the other person is homozygous for the other allele. Identifying IBS segments in this way does not require phased genotypes and scales relatively easily to large datasets—we refer to methods in this class as 'phase-unaware’ and contrast them with phase-aware methods for IBS detection. Phase-unaware methods are robust to phasing errors, which are an issue for long IBD segments (Durand et al., 2014). Major DTC genetics companies have used phase-unaware methods in the past for IBS detection (Henn et al., 2012; Hon et al., 2013), and some state-of-the-art IBD detection and phasing pipelines feature an initial phase-unaware step (Huang et al., 2014; Loh et al., 2016).

The main tool used in IBS baiting is the construction of apparently IBS segments by assigning every uploaded site in the region to be heterozygous. (SNPs with missing data may also be included in these regions). These runs of heterozygosity, which are unlikely to occur naturally (unlike runs of homozygosity, [McQuillan et al., 2008; Pemberton et al., 2012]), will be identified as IBS with every genome in the database using phase-unaware methods: because they contain no homozygous sites at all, they cannot contain homozygous sites incompatible with any person in the database.

Here, we consider a database in which an apparent IBS segment is halted exactly at the places at which the first incompatible homozygous site occurs on each side of the segment. We also assume that the database detects all segments without incompatible homozygous sites that pass the required length threshold. Ney et al. (2020) independently proposed a similar approach in their section VII ‘Genetic Marker Extraction Using Matching Segments,’ showing that GEDmatch was vulnerable to it. Similarly, we demonstrate below that IBS baiting can be implemented against GEDmatch.

Single-site IBS baiting

The simplest application of IBS baiting is to use it to reveal genotypes at a single site. If IBS is identified by looking for single incompatible homozygous sites and missing data can be ignored, then users’ genotypes at any single biallelic site of interest can be determined by examining their putative IBS with each of two artificial datasets (Figure 4A). In each artificial dataset, the site of interest is flanked by a run of heterozygosity. The combined length of these two runs of heterozygosity must exceed the minimum length of IBS segment reported by the database. The adversary uploads two datasets with these runs of heterozygosity in place. In one dataset, the site of interest is homozygous for the major allele, and in the other, the site of interest is homozygous for the minor allele. If the target user is homozygous at the site of interest, then one of these two uploads will not show a single, uninterrupted IBS segment—IBS will be interrupted at the site of interest (or may not be called at all). If the IBS segment with the dataset homozygous for the major allele is interrupted, then the target user is homozygous for the minor allele. Similarly, if the IBS segment with the dataset homozygous for the minor allele is interrupted, then the target user is homozygous for the major allele. If both uploads show uninterrupted IBS segments with the target, then the target user is heterozygous at the site of interest. Thus, for any genotyped biallelic site of interest, the genotypes of every user shown as a match can be revealed after uploading two artificial datasets. Depending on how possible matches are made accessible to the adversary, the genotypes of every user could be returned. Genotypes of medical interest that are often included in SNP chips, such as those in the APOE locus (Corder et al., 1993), are potentially vulnerable to single-site IBS baiting.

Schematics of the IBS baiting procedure.

(A) To perform IBS baiting at a single site, two uploads are required, each with runs of heterozygous genotypes flanking the key site. At the key site, the two uploaded datasets are homozygous for different alleles. The three possible target genotypes at the key site can each be determined by examining their IBS coverage with the uploads. If there is a break in IBS with either upload, then the target is homozygous for the allele not carried by the upload that shows the break in IBS (with the broken IBS segment shown as a cyan line). If there is no break in IBS with either upload, then the target is heterozygous at the key site. (B) Target genotypes at many key sites across the genome can be learned by comparison with two uploaded datasets, as long as key sites are spaced widely enough.

Here, we have considered a database using the simplest possible version of a phase-unaware method for detecting IBS, that in which an apparent IBS segment is halted exactly at the places at which the first incompatible homozygous site occurs on each side of the segment. In principle, phase-unaware IBS-detection algorithms can be altered to allow for occasional incompatible homozygous sites before halting as an allowance for genotyping error, or the extent of the reported region might be modified to be less than the full range between incompatible homozygous sites. Versions of IBS baiting might be developed to work within such modifications. The key insight is that if two artificial kits differ at exactly one site in a region and they produce two different patterns of called IBS with a target, then the target’s genotype is revealed at that site. For example, if a database uses a phase-unaware method for IBS calling that requires two incompatible homozygous sites before a putative IBS segment is halted, then an attacker might modify our scheme by putting in a rare homozygote at a site near the key site. For most target users, the rare homozygote in the uploaded files would be an incompatible homozygous site, implying that a mismatch at the key site will cause a break in a putative IBS region. By using different homozygote genotypes nearby, an attacker might still identify the genotypes of everyone in the database at the key site. As discussed below, such measures do not appear to be necessary to perform IBS baiting in GEDmatch. Further, in GEDmatch, uploading a third bait dataset with a missing genotype at the key site can distinguish targets with missing genotypes from heterozygous targets.

Single-site IBS baiting could also be used if chromosomal locations of matches are not reported. To do so, one would use the the scheme we describe in a large region surrounding the locus of interest and use fake IBS-inert segments to fill in the rest of the dataset.

Parallel IBS baiting

The second method we consider applies the IBS baiting technique to many sites in parallel (Figure 4B). By parallel application of IBS baiting, users’ genotypes at hundreds or thousands of sites across the genome can be identified by comparison with each pair of artificial genotypes. By repeated parallel IBS baiting, eventually enough genotypes can be learned that genotype imputation becomes accurate, and genome-wide genotypes could in principle be imputed for every user in the database. If IBS segments as short as 1 cM are reported to the user, then accurate imputation (97–98% accuracy) becomes possible after comparison with only about 100 uploaded datasets. The procedure starts by designing a single pair of uploaded files as follows:

  1. Identify a set of key sites to be revealed by the IBS baiting procedure. For every key site, the sum of the distances in cM to the nearest neighboring key site on each side (or the end of the chromosome, if there is no flanking key site on one side) must be at least the minimum IBS length reported by the database.

  2. Produce two artificial genetic datasets. In each, every non-key site is heterozygous. In one, each key site is homozygous for the major allele, in the other, each key site is homozygous for the minor allele.

  3. Upload each artificial dataset and compare them to a target user. Key sites that are covered by putative IBS segments between the target and both artificial datasets are heterozygous in the target. The target is homozygous for the major allele at key sites that are covered by putative IBS segments between the target and the major-allele-homozygous dataset only. Similarly, the target is homozygous for the minor allele at key sites that are covered by putative IBS segments between the target and the minor-allele-homozygous dataset only.

Carrying out this procedure reveals the target’s genotype at every key site. If IBS segments of length at least t cM are reported, and a chromosome is c cM long, then up to 2c/t1 key sites can be revealed with each pair of uploaded files. (To see this, consider the case where c=tk, with k a positive integer, and place key sites at t/2,t,3t/2,...,ct/2. This calculation ignores the possibility of missing data at key sites in the target). This means that with a minimum reported IBS threshold of 1 cM, 100 uploaded datasets could reveal approximately 100 genotypes per cM, which is enough to impute genome-wide genotypes at 97 - 98% accuracy (Shi et al., 2018). In principle, the key sites could also be chosen to ensure good LD coverage and higher imputation accuracy. Of course, higher accuracy imputation can be obtained by recovering exact genotypes for more sites, and with several thousand uploads, the genotypes at every genotyped site could be revealed by IBS baiting without the need to impute.

IBS baiting in GEDmatch

We hypothesized that IBS baiting would work in the GEDmatch DTC database. GEDmatch provides no public documentation of the IBS algorithm they use, but IBS segments identified by GEDMatch seem to terminate only on incompatible homozygous sites, as would be expected if they use phase-unaware IBS detection. Specifically, the GEDmatch 1-to-1 match tool identifies the locations of IBS segments between pairs of genetic datasets ('kits’ in GEDMatch terminology) and allows the user to specify the minimum genetic length and minimum number of matching SNPs to include in a segment. The 1-to-1 tool also returns a ‘full resolution’ picture of the chromosome that appears to be a SNP-by-SNP picture of the match between the kits along each chromosome. (These pictures are themselves a major security risk. We alerted GEDmatch to the risk in a July 24th email (posted here: https://github.com/mdedge/IBS_privacy/blob/master/IBS_baiting_demo/GEDmatch_emails.pdf) but did not analyze them further. Ney et al. (2020) showed in detail that the images provided by GEDmatch allow an adversary to learn the full genotype of a target person).

To demonstrate IBS baiting in GEDmatch, we uploaded a small number of artificial genotypes to their database beginning in late November 2019. These kits were designed in accordance with the algorithm discussed above, but with some slight alterations to bypass counter-measures that GEDMatch has put in place since we (and, independently, Ney and colleagues) informed them of the risk of IBS baiting in summer 2019. Before uploading any data to GEDmatch, we first confirmed our planned procedure with the UC Davis IRB and with GEDmatch representatives. We uploaded our kits into the GEDmatch ‘research’ and not ‘public’ category to prevent matches to the public database, and only used the 1-to-1 IBS match tool among our own uploaded test kits. In this way, we avoided interacting with any genotype data of real GEDmatch users and did not violate GEDmatch’s terms and conditions.

We targeted four random SNPs along chromosome 22 for IBS baiting. We uploaded two bait genotype kits (B1 and B2) that had opposite-homozygote genotypes at each of these key SNPs. Each key SNP was in turn surrounded by a ∼1cM stretch of SNPs containing genotypes that were either heterozygoous or coded missing. The rest of the genome was specified to be IBS-inert. We then uploaded three target genotype datasets whose genotypes we wanted to determine at the key sites. Two of these target kits (T1 and T3) had opposite-homozygous genotypes at each of the key SNPs, while the third (T2) was heterozygous at each key SNP. (See subsection 'GEDmatch demonstration’ in the Materials and methods for more details on the kit design). We then used the GEDmatch 1-to-1 match tool, choosing the parameters so a single opposite-homozygous genotype between a bait and target kit would interrupt a putative IBS segment.

In each case, our two bait kits had the correct IBS patterns with the target kits, allowing correct determination of the target genotypes by IBS baiting. On the left of Figure 5, we show a zoomed-in view of the three targets’ matches around one of the key SNP sites. The homozygous targets have IBS matches with only one of the bait kits, whereas the heterozygous target has IBS matches with both bait kits. This pattern is seen across all four target regions (right side of Figure 5, see section 'GEDmatch demonstration’ of the Materials and methods for more detailed results). The target and bait kits displayed in Figure 5 were uploaded and analyzed on December 15, 2019, showing that GEDmatch has remained vulnerable to IBS-baiting attacks even after its acquisition by Verogen, which was announced on December 9, 2019.

Visualization of IBS baiting using GEDmatch’s 1-to-1 chromosome browser.

Left: Zoomed-in view of the region containing key SNP 1, showing the three target kits (T1–T3) matched to the two bait kits (B1 and B2). Right: Zoomed-out views of regions containing all four key SNPs on chromosome 22. For each pair of bait and target kits, the top rectangle (red, yellow, or green) shows the GEDmatch SNP-level pairwise genotype-match image (colored to show no match, half match, or full match) returned by the 1-to-1 GEDmatch tool. The bottom rectangle (black and blue) shows the GEDmatch IBD-track image, black for no putative IBD match, blue regions showing putative IBD segments. The white text on the IBS track is not provided by GEDmatch and was added as a guide to the eye. Opposite-homozygote calls at the key SNP are seen in the left panel as a red line in an otherwise matching region (yellow and green). The spatial positions of SNPs in the match panel appears to have been jittered; for example the location of the red line varies slightly in the different plots that should have the same coordinate system (perhaps as a countermeasure against a Ney et al., 2020-style attack).

Discussion

We have suggested several methods by which an adversary might learn the genotypes of people included in a genetic genealogy database that allows uploads. Our methods take advantage of both the population-genetic distributions of IBS segments and of methods used for calling IBS. In particular, IBS tiling works simply because there are background levels of IBS (and IBD) even among distantly related members of a population (e.g. Ralph and Coop, 2013). In our dataset, the median person had the majority of their genetic information susceptible to IBS tiling on the basis of other members of the dataset, depending on the procedures used for reporting IBS. IBS tiling performance will also depend on the ancestries of the target and comparison samples because IBD rates differ within and among populations (Palamara et al., 2012; Carmi et al., 2013; Ralph and Coop, 2013), as well as on the prevalence of close biological relatives in the dataset. IBS tiling performance improves as the size of the comparison sample increases. Thus, if enough genomes are compared with a target for IBS, eventually a substantial amount of the target genome is covered by IBS with one or more of the comparison genomes.

IBS probing combines the principles behind IBS tiling with the idea of 'IBS-inert’ artificial segments. If the majority of the genome—everywhere except a locus of interest—can be replaced with artificial segments that will not have IBS with any genome in the database, then the adversary knows that any matches identified are in a locus of interest. As such, IBS probing could be used to reveal sensitive genetic information about database participants even if chromosomal locations of matches are not reported to users.

Finally, IBS baiting exploits phase-unaware IBS calling algorithms that use incompatible homozygous sites to delimit putative IBS regions. Although such methods can be useful in genetic genealogy because they scale well to large data, they are vulnerable to fake datasets that include runs of heterozygous sites, which will be identified as IBS with everyone in the database. By inserting homozygous genotypes at key sites and heterozygotes everywhere else, we estimate that approximately 100 well-designed uploads could reveal enough genotypes to impute genome-wide information for any user in a database, provided that the threshold for reporting a matching segment is approximately 1 cM. Similarly, two uploads could reveal any genotype at a single site of interest, such as rs429358, which reveals whether the user carries an APOE-ε4 variant and is associated with risk of late-onset Alzheimer’s disease.

There are millions of people enrolled in genetic genealogy databases that allow uploads (Table 1). Genetic genealogy has many applications, and uploads are popular with users who want to find relatives who may be scattered across different databases. Though allowing uploads brings several benefits for both customers and DTC services, it also entails additional privacy risks. Users of DTC genetic genealogy services that allow uploads could be at risk of having their genetic information extracted by the procedures we describe here, depending on the methods that these services use to identify and report IBS. Concerns arising from the methods we report are in addition to standard digital security concerns. The attacks we describe require little special expertise in computing; the adversary only needs to be able to procure or create the appropriate data files and to process and aggregate the information returned from the database.

We have not set out to determine precisely how vulnerable users of each specific DTC service are. We do not know the full details of methods used by each service for matching, nor have we attempted to deanonymize any real users’ genotypes. We contacted representatives of each of the organizations listed in Table 1 90 days (July 24th, 2019) before posting this manuscript publicly in order to give them time to repair any security vulnerabilities related to the methods we describe. We have posted our emails to GEDmatch representatives here: https://github.com/mdedge/IBS_privacy/blob/master/IBS_baiting_demo/GEDmatch_emails.pdf.

On the basis of our results, we do have serious concerns about the privacy of GEDmatch users. As of this writing, GEDmatch uses length thresholds for displaying matching segments that are too short, allowing for effective IBS tiling attacks, and GEDmatch also appears to use phase-unaware IBD detection methods, allowing for IBS baiting attacks. Additionally, as detailed by Ney et al. (2020), whose work was independent of ours, GEDmatch provides users with high-resolution images comparing the chromosomes of any two users at SNP-level resolution, allowing for reconstruction of a target’s genotype using these images. GEDmatch was recently purchased by Verogen, a forensic genetics company, but as of December 15, 2019, GEDmatch has not as yet prevented the attacks we describe. Since our and Ney et al. (2020)’s initial communications with GEDmatch in July, GEDmatch has placed a reCAPTCHA on its upload and 1-to-1 tool forms. Though reCAPTCHA may deter bulk bot attacks to harvest large numbers of kit genotypes, it is still possible for a human to carry out small-scale attacks. Further, even as reCAPTCHA has improved at blocking non-human users in recent years, new attacks have been developed to bypass reCAPTCHA (Baecher et al., 2011; Brown et al., 2017; Zhou et al., 2018; Akrout et al., 2019). As we outline below, there are simple steps that could be taken to make IBS attacks much less of a risk.

In our estimation, the other active services listed in Table 1 (MyHeritage, FamilyTreeDNA, and LivingDNA) are likely substantially less vulnerable than GEDmatch to the attacks we describe here. LivingDNA does not provide a chromosome browser, precluding IBS tiling attacks. MyHeritage and FamilyTreeDNA use thresholds for revealing matching segment locations that make IBS tiling much less efficient. (However, FamilyTreeDNA’s practice of showing matches as short as 1 cM given that two people share at least one long match is still somewhat permissive, see Figure 2—figure supplement 2). Representatives of MyHeritage, FamilyTreeDNA, and LivingDNA have confirmed to us that their IBD-calling algorithms rely on phased data, which should preclude IBS baiting. (We have not tested this ourselves). DTC genetic genealogy is a growing field, and any new entities that begin offering upload services may also face threats of the kind we describe.

Genetic genealogy databases that allow uploads have been in the public eye recently because of their role in long-range familial search strategies recently adopted by law enforcement. In long-range familial search, investigators seek to identify the source of a crime-scene sample by identifying relatives of the sample in a genetic genealogy database that allows uploads. Searching in SNP-based genealogy databases allows the detection of much more distant relationships than does familial searching in traditional forensic microsatellite datasets (Rohlfs et al., 2012), vastly increasing the number of people detectable by familial search (Erlich et al., 2018; Edge and Coop, 2019). At this writing, both GEDmatch and FamilyTreeDNA have been searched in this way. Long-range familial search raises a range of privacy concerns (Syndercombe Court, 2018; Ram et al., 2018; Kennett, 2019; Scudder et al., 2019). One response from advocates of long-range search has been to note that 'raw genetic data are not disclosed to law enforcement… Search results display only the length and chromosomal location of shared DNA blocks’ (Greytak et al., 2018). However, the methods we describe here illustrate that there are several ways to reveal users’ raw genetic data on the basis of the locations of shared DNA blocks. Because companies that work with law enforcement on long-range familial searching—including Parabon Nanolabs and Bode Technology (Kennett, 2019)—now routinely upload tens of datasets to genetic genealogy databases, they may be accidentally accumulating information that would allow them to reconstruct many people’s genotypes.

Data breaches via IBS tiling, IBS probing, and IBS baiting are preventable. We have identified a set of strategies that genetic genealogy services could adopt to protect their genotype data from IBS-based attacks. We give a detailed list of these strategies in Appendix A (also summarized in Table 2). Broadly, the suggestions consist of restrictions on they types of datasets that can be uploaded, restrictions on the kinds of information shared with users, and restrictions on classes of methods used for identifying putative IBD segments. For example, to prevent IBS tiling, the simplest measures are either to forgo the use of a chromosome browser feature or only to show users the positions of long IBS segments, such as segments of at least 8 cM. To prevent IBS baiting, the most robust countermeasure is to phase data before identifying IBS segments, allowing only relatively few phase switches in any putative segment. Phasing the data and only reporting long segments both decrease the uncertainty of IBD calls and so may improve user experience as well. Finally, we also support the strategy of requiring encrypted signatures on uploaded files, proposed by Erlich et al. (2018), which would allow DTC databases to block any files that do not originate from trusted sources. Some of our suggestions limit the potential uses of genetic genealogy data, and users will vary in the degree to which they value these potential uses and in the degree to which they want to protect their genetic information.

Table 2
Potential countermeasures against the methods of attack outlined here, with their likely effectiveness against IBS tiling, IBS probing, and IBS baiting.
StrategyPrevents IBS tilingPrevents IBS probingPrevents IBS baiting
Require cryptographic signature from genotyping serviceYesYesYes
Restrict reporting of IBS to long segments (e.g. >8 cM)PartiallyPartiallyWeakly
Report number and lengths of IBS segments but not locationsYesNoPartially
Block homozygous uploadsPartiallyNoNo
Report small number of matching individuals per kitPartiallyPartiallyPartially
Disallow matching between arbitrary kitsPartiallyPartiallyPartially
Block uploads of publicly available genomesPartiallyNoNo
Block uploads with evidence of IBS-inert segmentsNoYesNo
Block uploads with long runs of heterozygosityNoNoPartially
Use phase-aware methods for IBS detectionNoNoYes

All these suggestions assume that genealogy services will maintain raw genetic data for people in their database. Another possibility would be for individual people instead to upload an encrypted version of their genetic data, with relative matching performed on the encrypted datasets, as has been suggested previously (He et al., 2014).

Our IBS tiling and IBS probing results focus on users of European ancestries, in part because most users of DTC genetic genealogy services appear to have substantial European ancestries. (DTC genetics companies generally do not release this kind of information on their users, but their research papers suggest that they have access to especially large samples with European ancestries—for example, a 23andMe paper on demography in the United States included almost 150,000 self-described European Americans and less than 10,000 each of self-described African Americans and Latino Americans [Bryc et al., 2015]. For a qualitatively similar sample composition in a study from Ancestry, see Han et al. (2017). One question is how these results would generalize to other populations. Because IBD sharing is generally greater within populations than between populations (e.g. Ralph and Coop, 2013), potential users are more vulnerable if there are more publicly available genomes from people with similar ancestries. If IBD-detection algorithms are not well calibrated to differences in heterozygosity across populations, then spurious IBD calls will be more common in populations with lower heterozygosity, leading to greater risk of IBD tiling. Finally, we show in Figure 2—figure supplement 4 that in our sample, Finnish samples are more vulnerable to IBS tiling than other populations, which is likely due to Finns tracing substantial ancestry to a founder population that experienced a bottleneck 1̃00 generations ago (Kere, 2001). Members of other groups with similar demographic histories are likely to be at elevated risk of IBS tiling and IBS probing as well.

We have focused on genetic genealogy databases that allow uploads because at this writing, it is straightforward to download publicly available genetic datasets and to produce fake genetic datasets for upload. In principle, however, another way to perform attacks like the ones we describe would be to use biological samples. For example, a group of people willing to share their genetic data with each other could collaborate to perform IBS tiling by sending actual biological samples for genotyping. Even IBS probing and IBS baiting could be performed with biological samples by adversaries who could synthesize the samples. Though synthesizing such samples is technically challenging now, it may become easier in the future. Such methods could present opportunities to attack databases that do not allow uploads, such as the large databases maintained by Ancestry (>14 million) and 23andMe (>9 million) (Regalado, 2019). They would also thwart the countermeasure of requiring uploaded datasets to include an cryptographic signature indicating their source.

The IBS-based privacy attacks we describe here add to a growing set of threats to genetic privacy (Homer et al., 2008; Nyholt et al., 2009; Im et al., 2012; Gymrek et al., 2013; Humbert et al., 2015; Shringarpure and Bustamante, 2015; Edge et al., 2017; Ayday and Humbert, 2017; Kim et al., 2018; Erlich et al., 2018). A person’s genotype includes sensitive health information that might be used for discrimination, and people whose genetic information is compromised may be vulnerable to scams involving falsified relatives (Ney et al., 2020). Athough there are many emerging threats to privacy, some of the more unsettling of which have nothing to do with genetics, genetic data do have special features that might require special considerations. In particular, genetic privacy concerns not only the person whose genotypes are directly revealed but also their relatives whose genotypes may be revealed indirectly (Humbert et al., 2013), a point highlighted by the use of genetic genealogy for long-range forensic searches (Erlich et al., 2018; Edge and Coop, 2019).

Although many forms of genetic discrimination are prohibited legally, rules vary between countries and states. For example, in the United States, the Genetic Information Nondiscrimination Act (GINA) protects against genetic discrimination in the provision of health insurance but does not explicitly disallow genetic discrimination in the provision of life insurance, disability insurance, or long-term care insurance (Bélisle-Pipon et al., 2019). In addition to measures for protecting genetic privacy in the short term, there is a need for more complete frameworks governing the circumstances under which genetic data can be used (Clayton et al., 2019).

Materials and methods

Data assembly

Request a detailed protocol

We performed IBS tiling with publicly available genoytpes from 872 people of European ancestries. Of these 872 genotypes, 503 came from the EUR subset of the 20130502 release of phase 3 of the 1000 Genomes project (Abecasis et al., 2012), downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. This release set has been pruned to remove close biological relatives. The EUR subset includes the following population codes and numbers of people: CEU (Utah residents with Northern and Western European Ancestry, 99 people), FIN (Finnish in Finland, 99 people), GBR (British in England and Scotland, 91 people), IBS (Iberian Population in Spain, 107 people), TSI (Toscani in Italia, 107 people).

The remaining 369 were selected from samples typed on the Human Origins SNP array (Patterson et al., 2012), including 142 genotypes from the Human Genome Diversity Project (Cann et al., 2002). Specifically, we downloaded the Human Origins data from https://reich.hms.harvard.edu/downloadable-genotypes-present-day-and-ancient-dna-data-compiled-published-papers, using the 1240K+HO dataset, version 37.2. The 372 selected people were all contemporary samples chosen according to population labels. We also excluded people from the Human Origins dataset if they appeared in the 1000 Genomes dataset. The populations used for selecting data, along with the number of participants included after excluding 1000 Genomes samples, were as follows: 'Adygei’ (16), 'Albanian’ (6), 'Basque’ (29), 'Belarusian’ (10), 'Bulgarian’ (10), 'Croatian’ (10), 'Czech’ (10), 'English’ (0), 'Estonian’ (10), 'Finnish’ (0), 'French’ (61), 'Greek’ (20), 'Hungarian’ (20), 'Icelandic’ (12), 'Italian_North’ (20), 'Italian_South’ (4), 'Lithuanian’ (10), 'Maltese’ (8), 'Mordovian’ (10), 'Norwegian’ (11), 'Orcadian’ (13), 'Romanian’ (10), 'Russian’ (22), 'Sardinian’ (27), 'Scottish’ (0), 'Sicilian’ (11), 'Spanish’ (0), 'Spanish_North’ (0), and 'Ukrainian’ (9). The populations with 0 people included are those for which all the samples in the Human Origins dataset are included in the 1000 Genomes phase 3 panel. Samples with group labels marked 'ignore’ were excluded, including samples marked as close relatives.

We down-sampled the sequence data from the 1000 Genomes project to include only sites typed by the Human Origins chip. Of the 597,573 SNPs included in the Human Origins dataset, 558,257 sites appeared at the same position in the 1000 Genomes dataset, 557,999 of which appear as biallelic SNPs. For 546,530 of these, both the SNP identifier and position match in 1000 Genomes, and for 544,139 of them, the alleles agreed as well. We merged the dataset at the set of 544,139 SNPs at which SNP identifiers, positions, and alleles matched between the Human Origins and 1000 Genomes datasets.

We used vcftools (Danecek et al., 2011), bcftools (Li, 2011), PLINK (Purcell et al., 2007), and EIGENSOFT (Price et al., 2006) to create the merged file. The script used to create it is maintained at github.com/mdedge/IBS_privacy/, and the merged data file is available at https://doi.org/10.25338/B8X619. A permanent version of the scripts used in the publication version of this paper is available with doi 10.5281/zenodo.3620958.

Phasing, IBS calling, and IBS tiling

Request a detailed protocol

We phased the combined dataset using Beagle 5.0 (Browning and Browning, 2007) using the default settings and genetic maps for each chromosome. We used linear interpolation to obtain the genetic map position of each SNP on the build GRCh37 LDhat genetic map (Frazer et al., 2007) downloaded from the Beagle website (http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/). We used Refined IBD software (Browning and Browning, 2013) to identify IBS segments, retaining segments of at least .1 centiMorgans (cM) with LOD scores >1. We also used Germline (Gusev et al., 2009) to identify IBS segments under alternative parameters, shown in the supplement. The resulting IBS segments were analyzed using the GenomicRanges package (Lawrence et al., 2013) in R (R Development Core Team, 2013). Scripts used for phasing, IBS calling, and IBS tiling are available at github.com/mdedge/IBS_privacy/.

IBS probing

Request a detailed protocol

To generate IBS-inert genotypes for IBS probing in Figure 3, we computed allele frequencies within the set of 872 Europeans for chromosome 19. Allele frequencies less than 10% were changed to 10%, and then alleles were sampled at one minus their frequency. This strategy generates genetic data that look quite unlike real data, with the advantage (for the purposes of IBS probing) of being unlikely to return IBS matches anywhere. An adversary attempting IBS probing in a real database would need to tailor the approach to the quality control and IBS calling methods used by the database.

After inert genotypes were produced, we stitched them with real phased genotypes from windows around GRCh position 45411941 on chromosome 19, the site of SNP rs429358. SNP rs429358 is in the APOE locus; if a haplotype has a C at rs429358 and a C at nearby SNP rs7412, then that haplotype is said to harbor the APO-ϵ4 allele, which confers risk for Alzheimer’s disease (Corder et al., 1993). rs429358 is not genotyped on the Human Origins chip, but it is included on recent chips used by both Ancestry and 23andMe. To simulate probing with a 1 cM threshold for matching, we pulled real data from a region of 1.9 cM around the site, and to simulate probing with a 3 cM threshold, we pulled real data from a region of 5.9 cM around the site. Distances in cM were computed by linear interpolation from a genetic map in GRCh37 coordinates. Scripts used to generate Figure 3 are available at github.com/mdedge/IBS_privacy/.

GEDmatch demonstration

On Novermber 21st, 2019, we first uploaded artificial genetic datasets to GEDmatch’s research mode in order to demonstrate the possibility of IBS baiting. GEDMatch has not published details of its IBS detection procedures. However, the options available to users in the 1-to-1 match tool and the description of how those options can be used to ignore single-site matches led us to hypothesize that GEDmatch uses phase-unaware IBS detection and that the 1-to-1 match tool might be vulnerable to IBS baiting.

Description of GEDmatch 1-to-1 tool

Request a detailed protocol

GEDmatch’s 1-to-1 match tool allows the user to compare the IBS matches of any two genetic datasets (or, in GEDmatch parlance, 'kits’), as long as the kit numbers are known to the user. Thus, to identify the genotypes of many users an adversary would need access to the kit numbers of many users. The 1-to-many tool in default GEDmatch reports 3000 of the closest genetic relatives of any kit whose number is known to the user, and reports the kit numbers of those match kits (along with names and email addresses). Thus an adversary can iteratively search for all the kit numbers matching a known kit, and so obtain many kit numbers to use in 1-to-1 searches. We alerted GEDmatch to this issue with the 1-to-many tool, as nearly the entire GEDmatch database of kit numbers and genetic relationships could be scraped.

The 1-to-1 match tool allows the user to specify parameters that govern IBS calling. In particular, the user can specify the minimum cM length of the blocks (down to 0.1 cM) and the minimum number of SNPs in a block (down to 25 SNPs). GEDmatch also allows the user to specify the ‘mis-match bunch limit,’ which appears to be the minimum number of IBS-compatible SNPs after an opposite-homozygous site that are required in order for a second opposite-homozygous site not to break the IBS segment.

Ethics

In order to comply with GEDmatch’s terms and conditions, we used artificial datasets designed not to match any genuine genetic data uploaded to GEDmatch. The kits were uploaded in 'Research’ mode, where they are not visible to other users via 1-to-many search. We did not interact with any other users’ data; we ran GEDmatch’s 1-to-1 comparison tools only comparing among our artificial kits. We exercised care not to interact with any other tools and to avoid accidental discoveries. Prior to uploading the artificial datasets, we also consulted with the UC Davis Institutional Review Board (IRB) to ensure that these uploads do not constitute human subjects research. Upon receiving confirmation from the IRB that our uploads do not constitute human subjects research and before uploading the datasets, we alerted GEDmatch that we would be making the uploads, and we also shared the kit numbers with them after we had completed our analyses.

Construction of artificial datasets

We constructed artificial 'target’ and 'bait’ kits using the SNPs included in the 23 and Me v4 chip. (The 'target’ kits are the targets of inference, and the 'bait’ kits are designed to reveal their genotypes). We identified the alleles at these SNP positions in the 1000Genomes dataset, along with their frequencies in the EUR subset of 1000Genomes. We assigned as missing (‘- -’) any SNP that we could not match by position in 1000Genomes. We chose four target SNPs at random on chromosome 22. These SNPs were chosen at random from the set of strand-unambiguous polymorphisms, that is not A/T and G/C SNPs. These strand-unambiguous sites include the majority of SNPs on the chip, for example 89% of the SNPs on the 23andMe chip on chromosome 22.

Target genomes

Request a detailed protocol

We uploaded three artificial target genome kits (T1-T3). These vary in their genotypes at the target SNPs. T1 and T3 are homozygous for different alleles; T2 is heterozygous. At the rest of the loci, we constructed genotypes by randomly sampling alleles according to their frequencies at each SNP. Thus, there is no LD among loci.

Bait genomes

Request a detailed protocol

We uploaded two artificial bait genome kits. These two kits have opposite-homozygote genotypes at each of the target SNPs. The two bait uploads were then set to have identical genotypes in the rest of their autosomes, with their genotypes specified as below.

To create a region around the target that would bait a phase-unaware method into calling IBD, we took SNPs in the 0.6cM on either side of the target SNP, selected at random 22 on each side, and set them to be heterozygous in both bait genomes. The rest of the SNPs within this bait region were set to be missing. We used only 22 heterozygous SNPs on each side and filled in the rest with missing data (rather than making all sites heterozygous) because large numbers of heterozygous sites generated an error on upload, ‘HTZ string too long’ and would not be processed further. Blocking uploads with long runs of heterozygous sites is a countermeasure put in place by GEDmatch after we and (Ney et al., 2020) initially alerted GEDmatch to the risks of upload-based privacy attacks. However, we found that the countermeasure was not triggered by runs of heterozygous sites with missing sites interspersed, and these runs of heterozygosity interspersed with missingness also effectively baited GEDmatch into calling IBD segments. Additionally, we confirmed with Peter Ney (personal communication) that his previously uploaded kits including long runs of heterozygosity remain active even though re-uploads of those same kits are blocked as of December 3rd, 2019, suggesting that the block applies only to newly uploaded kits and not to existing data on GEDmatch.

The alleles in target kits at all other autosomal SNPs in the genome were drawn at random with frequency 1-p, where p is the frequency in the 1000Genomes EUR subsample. This scheme was chosen to ensure that the bait genomes were unlikely to have spurious IBS matches anywhere with any target genome, so that the only potential IBS was in the target regions.

Detailed results of baiting

Request a detailed protocol

We compared each target to both bait genomes using the 1-to-1 GEDmatch tool. We set the minimum block to a length of > 0.7cM and 25 SNPs, with a mismatch cutoff of 25 SNPs. This ensured that we could detect IBS in the key regions, but that a single opposite-homozygous mismatch would be sufficient to prevent the identification of a putative IBS segment in the key region.

The baiting attempt was successful; we observed IBS only where we expected it between bait and target kits (Figure 5). We observed no putative IBD segments on any chromosome except 22, as expected on the basis of our procedure for filling in artificial genotypes in both sets of kits. The details of the matches on chromosome 22 are reported in Table 3. We observed 4 putative IBD segments overlapping our target bait regions in the comparisons with matching homozygote genotypes at the bait site, that is in the T1-B1 and T3-B2 comparisons, as well in both heterozygote-homozygote comparisons, that is T2-B1 and T2-B2. We observed no putative IBD segments in the pairs with opposite-homozygous mismatches, T1-B1 and T3-B2. Thus the genotypes of the targets are readily discernable from from the putative IBD segments output by GEDmatch. The full results returned by GEDmatch are available as images here (https://github.com/mdedge/IBS_privacy/tree/master/IBS_baiting_demo; the kit numbers are redacted to prevent reuse).

Table 3
Summary of the SNPs targeted by baiting and the IBS returned by GEDmatch.

For each region, we give the position of the key SNP (target bp). Because by design our bait kits are genetically identical outside of the target SNPs, the IBS regions returned by GEDmatch’s 1-to-1 tool are identical across bait kits generating a match. For each pairwise comparison, we report the IBS information returned: Left-Right bp of the IBS region, the cM length, the number (#) of SNPs in the IBS region with a non-missing target. We also report the number (#) of SNPs spanned by the region IBS when matched to the missing target Bmiss.

Matching pairsTarget 1Target 2Target 3Target 4
target bp27613130340240973767378142008068
T1-(B1 Bmiss)
IBS L bp27427698337716723751986440054428
IBS R bp27680780343287413782771143112674
IBS cM1.30.81.11.2
# SNPs47454240
# SNPs Bmiss46444139
T2-(B1 B2 Bmiss)
IBS L bp27433179337716723750850740357667
IBS R bp27680780343287413782771143112674
IBS cM1.30.81.20.9
# SNPs45454532
# SNPs Bmiss44444431
T3-(B3 Bmiss)
IBS L bp27433179337716723751986440357667
IBS R bp27680780343287413782771143112674
IBS cM1.30.81.10.9
# SNPs45454532
# SNPs Bmiss44444131
Tmiss-(All Baits)
IBS L bp27433179337716723751986440357667
IBS R bp27680780343287413782771143112674
IBS cM1.30.81.10.9
# SNPs44444431
# SNPs Bmiss44444431

Some of the IBS blocks have fewer SNPs than we expect. We believe this to be due to the removal of SNPs during the tokenization stage, during which rare SNPs and SNPs with stand-ambiguous alleles seem to be removed (Ney et al., 2020). We did not investigate this further, but multiple uploads could be used to determine the approximate criteria for SNPs to be included, and hence determine where an adversary should set cutoffs.

Our two bait kits could both generate IBS matches to the target because the target genotype is missing rather than heterozygous. To determine whether a genotype was missing, we implemented a trick borrowed from Ney et al. (2020), and uploaded a third bait kit (Bmiss) with the target SNP set to missing (i.e. ‘- -’) and then looked at the number of SNPs an IBS match across the target site spans. In each case, the non-missing baits (B1 and B2) generated an IBS block match with with T1-T3 that was one SNP longer than the IBS block generated by the Bmiss bait (Table 3). Comparing these baits to a new target with a missing genotype at each target site (Tmiss), we see that in each pairwise comparison the IBS blocks are the same number of SNPs long regardless of whether the target SNP bait genotype was missing (Table 3). Therefore, we can distinguish the target being heterozygote or missing by the use of a third bait kit and inspection of the number of SNPs included in an IBS match.

The possibility of IBS-baiting-like procedures also interacts with the vulnerabilities arising from the presentation of SNP-level visualizations explored by Ney et al. (2020). Even if short IBS blocks were not reported to the user explicitly, it is clear from the zoomed-in view that we can see the target mismatches in question (see Figure 5). One measure that GEDmatch appears to have taken against a Ney et al. (2020)-style attack is to jitter the positions of SNPs in their visualization slightly. However, an attacker could counter such jittering by embedding key sites in runs of heterozygosity, making it easier to identify them in visualizations after jittering. Thus, the images displayed by GEDmatch still pose additional security risks.

Appendix 1

Detailed rationale for proposed countermeasures

Here, we detail the rationale and possible advantages and disadvantages of the countermeasures listed in Table 2.

  1. Require uploaded files to include cryptographic signatures identifying their source. This recommendation was initially made by Erlich et al. (2018). Under this suggestion, DTC genetics services would cryptographically sign the genetic data files they provide to users. Upload services might then check for a signature from an approved DTC service on each uploaded dataset, blocking datasets from upload otherwise. An alternative procedure that would accomplish the same goal would be for the DTC entities to exchange data directly at the user’s request (Ney et al., 2018). Such a procedure would allow upload services to know the source of the files they analyze and to disallow uploaded datasets produced by non-approved entities and user-modified datasets. All the methods we describe require the upload of multiple genetic datasets. As such, requiring cryptographic signatures would force the adversary to have multiple biological samples analyzed by a DTC service in order to implement any of our procedures, and IBS probing and IBS baiting would require synthetic samples, which are much harder to produce than fake datasets. Another benefit of this approach is that it would protect research participants against being reidentified using DTC genetic genealogy services (Erlich et al., 2018). A disadvantage of this strategy is that it requires the cooperation of several distinct DTC services.

  2. Restrict reporting of IBS to long segments. Reporting short IBS segments increases the typical coverage of IBS tiling (Figure 2) and IBS probing (Figure 3), as well as the efficiency of IBS baiting. Very short blocks may be of little practical utility for genetic genealogy (Huff et al., 2011). Reporting only segments longer than 8 cM would substantially limit IBS tiling attacks. A partially effective variant of this strategy is to report short segments only for pairs of people who share at least one long segment (Figure 2—figure supplement 2). One disadvantage is that short segments, though less reliably inferred than longer segments, may still be of interest to genealogists.

  3. Do not report locations of IBS segments. Another tactic for preventing IBS tiling is not to report chromosomal locations at all. If chromosomal locations are not reported, IBS tiling as we have described it becomes impossible.

  4. Block uploads of genomes with excessive homozygosity. IBS tiling is especially informative if genotypes that are homozygous for phased haplotypes are uploaded, so blocking genomes with excessive homozygosity presents a barrier to IBS tiling attacks. However, runs of homozygosity occur naturally (Pemberton et al., 2012), and allowing for naturally occurring patterns of homozygosity would leave a loophole for an adversary who could upload many genotypes, using including homozygous regions and using only those for tiling.

  5. Report only a small number of putative relatives per uploaded kit. Reporting only the closest relatives (say the 50 - 100 closest relatives) of an uploaded kit would decrease the efficiency of all the methods we describe here. Only a small number of people could have their privacy compromised by each upload. This countermeasure comes with costs to genealogists, who may want to explore as many matches as possible in order to build family trees.

  6. Disallow arbitrary matching between kits. Some services allow searches for IBS between any pair of individuals in the database. Allowing such searches makes all potential IBS attacks easier. This countermeasure might hamper the investigations of genealogists exploring complex hypotheses about relatedness.

  7. Block uploads of publicly available genomes. There are now thousands of genomes available for public download, and these publicly available genomes can be used for IBS tiling. Genetic genealogy databases could include publicly available genomes (potentially without allowing them to be returned as IBS matches for typical users) and flag accounts that upload them. This strategy would go some distance toward blocking IBS tiling, but it could be thwarted in several ways, for example by uploading genetic datasets produced by splicing together haplotypes from publicly available genomes.

  8. Block uploads with evidence of IBS-inert segments. IBS-inert segments—that is false genetic segments designed to be unlikely to be IBS with anyone in the database—are key to IBS probing. Some methods for constructing IBS-inert segments are easy to identify, but others may not be. If a database is large enough, genomes with IBS-inert segments could be identified by looking for genomes that have much less apparent IBS with other database members than might be expected.

  9. Block uploads with long runs of heterozygosity. Long runs of heterozygosity do not arise naturally but are key to the IBS baiting approaches we describe here. Blocking genomes with long runs of heterozygosity—or alternatively, blocking genomes that have much more apparent IBS with a range of other database members than expected—would hamper IBS baiting. However, this countermeasure might be hard to apply to a small-scale IBS baiting attack, where only one or a few short runs of heterozygosity might be necessary. In our sample, the longest run of heterozygosity (in terms of number of SNPs) consisted of 38 SNPs and spanned .06 cM. This suggests that filtering out long runs of heterozygosity might be a promising strategy, though identifying a specific procedure would require more careful consideration of variation in non-European populations and of the composition of commercial SNP chips (including SNP density and allele frequencies).

  10. Use phase-aware methods for IBS detection. Although calling IBS by looking for long segments without incompatible homozygous genotypes scales well to large datasets, such methods are easy to trick, allowing IBS baiting approaches. In addition to allowing IBS estimation methods that are harder to trick, faked samples may stand out as unusual during the process of phasing, raising more opportunities for quality-control checks.

References

  1. 1
  2. 2
  3. 3
  4. 4
    Breaking reCAPTCHA: A Holistic Approach via Shape Recognition
    1. P Baecher
    2. N Büscher
    3. M Fischlin
    4. B Milde
    (2011)
    In: J Camenisch, S Fischer-Hübner, Y Murayama, A Portmann, C Rieder, editors. Future Challenges in Security and Privacy for Academia and Industry. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 56–67.
  5. 5
  6. 6
  7. 7
    I am ’totally’ Human: Bypassing the Recaptcha
    1. SS Brown
    2. N DiBari
    3. S Bhatia
    (2017)
    2017 13th International Conference on Signal-Image Technology Internet-Based Systems.
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    A second generation human haplotype map of over 3.1 million SNPs
    1. KA Frazer
    2. DG Ballinger
    3. DR Cox
    4. DA Hinds
    5. LL Stuve
    6. RA Gibbs
    7. JW Belmont
    8. A Boudreau
    9. P Hardenbol
    10. SM Leal
    11. S Pasternak
    12. DA Wheeler
    13. TD Willis
    14. F Yu
    15. H Yang
    16. C Zeng
    17. Y Gao
    18. H Hu
    19. W Hu
    20. C Li
    21. W Lin
    22. S Liu
    23. H Pan
    24. X Tang
    25. J Wang
    26. W Wang
    27. J Yu
    28. B Zhang
    29. Q Zhang
    30. H Zhao
    31. H Zhao
    32. J Zhou
    33. SB Gabriel
    34. R Barry
    35. B Blumenstiel
    36. A Camargo
    37. M Defelice
    38. M Faggart
    39. M Goyette
    40. S Gupta
    41. J Moore
    42. H Nguyen
    43. RC Onofrio
    44. M Parkin
    45. J Roy
    46. E Stahl
    47. E Winchester
    48. L Ziaugra
    49. D Altshuler
    50. Y Shen
    51. Z Yao
    52. W Huang
    53. X Chu
    54. Y He
    55. L Jin
    56. Y Liu
    57. Y Shen
    58. W Sun
    59. H Wang
    60. Y Wang
    61. Y Wang
    62. X Xiong
    63. L Xu
    64. MM Waye
    65. SK Tsui
    66. H Xue
    67. JT Wong
    68. LM Galver
    69. JB Fan
    70. K Gunderson
    71. SS Murray
    72. AR Oliphant
    73. MS Chee
    74. A Montpetit
    75. F Chagnon
    76. V Ferretti
    77. M Leboeuf
    78. JF Olivier
    79. MS Phillips
    80. S Roumy
    81. C Sallée
    82. A Verner
    83. TJ Hudson
    84. PY Kwok
    85. D Cai
    86. DC Koboldt
    87. RD Miller
    88. L Pawlikowska
    89. P Taillon-Miller
    90. M Xiao
    91. LC Tsui
    92. W Mak
    93. YQ Song
    94. PK Tam
    95. Y Nakamura
    96. T Kawaguchi
    97. T Kitamoto
    98. T Morizono
    99. A Nagashima
    100. Y Ohnishi
    101. A Sekine
    102. T Tanaka
    103. T Tsunoda
    104. P Deloukas
    105. CP Bird
    106. M Delgado
    107. ET Dermitzakis
    108. R Gwilliam
    109. S Hunt
    110. J Morrison
    111. D Powell
    112. BE Stranger
    113. P Whittaker
    114. DR Bentley
    115. MJ Daly
    116. PI de Bakker
    117. J Barrett
    118. YR Chretien
    119. J Maller
    120. S McCarroll
    121. N Patterson
    122. I Pe'er
    123. A Price
    124. S Purcell
    125. DJ Richter
    126. P Sabeti
    127. R Saxena
    128. SF Schaffner
    129. PC Sham
    130. P Varilly
    131. D Altshuler
    132. LD Stein
    133. L Krishnan
    134. AV Smith
    135. MK Tello-Ruiz
    136. GA Thorisson
    137. A Chakravarti
    138. PE Chen
    139. DJ Cutler
    140. CS Kashuk
    141. S Lin
    142. GR Abecasis
    143. W Guan
    144. Y Li
    145. HM Munro
    146. ZS Qin
    147. DJ Thomas
    148. G McVean
    149. A Auton
    150. L Bottolo
    151. N Cardin
    152. S Eyheramendy
    153. C Freeman
    154. J Marchini
    155. S Myers
    156. C Spencer
    157. M Stephens
    158. P Donnelly
    159. LR Cardon
    160. G Clarke
    161. DM Evans
    162. AP Morris
    163. BS Weir
    164. T Tsunoda
    165. JC Mullikin
    166. ST Sherry
    167. M Feolo
    168. A Skol
    169. H Zhang
    170. C Zeng
    171. H Zhao
    172. I Matsuda
    173. Y Fukushima
    174. DR Macer
    175. E Suda
    176. CN Rotimi
    177. CA Adebamowo
    178. I Ajayi
    179. T Aniagwu
    180. PA Marshall
    181. C Nkwodimmah
    182. CD Royal
    183. MF Leppert
    184. M Dixon
    185. A Peiffer
    186. R Qiu
    187. A Kent
    188. K Kato
    189. N Niikawa
    190. IF Adewole
    191. BM Knoppers
    192. MW Foster
    193. EW Clayton
    194. J Watkin
    195. RA Gibbs
    196. JW Belmont
    197. D Muzny
    198. L Nazareth
    199. E Sodergren
    200. GM Weinstock
    201. DA Wheeler
    202. I Yakub
    203. SB Gabriel
    204. RC Onofrio
    205. DJ Richter
    206. L Ziaugra
    207. BW Birren
    208. MJ Daly
    209. D Altshuler
    210. RK Wilson
    211. LL Fulton
    212. J Rogers
    213. J Burton
    214. NP Carter
    215. CM Clee
    216. M Griffiths
    217. MC Jones
    218. K McLay
    219. RW Plumb
    220. MT Ross
    221. SK Sims
    222. DL Willey
    223. Z Chen
    224. H Han
    225. L Kang
    226. M Godbout
    227. JC Wallenburg
    228. P L'Archevêque
    229. G Bellemare
    230. K Saeki
    231. H Wang
    232. D An
    233. H Fu
    234. Q Li
    235. Z Wang
    236. R Wang
    237. AL Holden
    238. LD Brooks
    239. JE McEwen
    240. MS Guyer
    241. VO Wang
    242. JL Peterson
    243. M Shi
    244. J Spiegel
    245. LM Sung
    246. LF Zacharia
    247. FS Collins
    248. K Kennedy
    249. R Jamieson
    250. J Stewart
    251. International HapMap Consortium
    (2007)
    Nature 449:851–861.
    https://doi.org/10.1038/nature06258
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
    Addressing the concerns of the lacks family: quantification of kin genomic privacy
    1. M Humbert
    2. E Ayday
    3. J-P Hubaux
    4. A Telenti
    (2013)
    Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security. pp. 1141–1152.
    https://doi.org/10.1145/2508859.2516707
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
    Cystic fibrosis: a case study in genetic privacy
    1. L Larkin
    (2017)
    The DNA Geek. Accessed July 1, 2019.
  48. 48
    Database sizes—September 2018 update
    1. L Larkin
    (2018)
    The DNA Geek. Accessed July 1, 2019.
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
    R: a language and environment for statistical computing
    1. R Development Core Team
    (2013)
    R Foundation for Statistical Computing, Vienna, Austria.
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
    More Than 26 Million People Have Taken an at-Home Ancestry Test
    1. A Regalado
    (2019)
    MIT Technology Review.
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76

Decision letter

  1. Magnus Nordborg
    Reviewing Editor; Austrian Academy of Sciences, Austria
  2. Mark I McCarthy
    Senior Editor; Genentech, United States
  3. Amy L Williams
    Reviewer; Cornell University, United States
  4. Shai Carmi
    Reviewer; The Hebrew University of Jerusalem, Israel

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The manuscript shows that genetic databases, like other databases, are vulnerable to being (ab)used in a manner not foreseen by their owners. The presented scenarios are very realistic and I sincerely hope that this article will spur both genetic service firms and politics into action.

Decision letter after peer review:

Thank you for submitting your article "Attacks on genetic privacy via uploads to genealogical databases" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Magnus Nordborg as the Reviewing Editor and Mark McCarthy as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Amy L Williams (Reviewer #1); Shai Carmi (Reviewer #2). The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Direct-to-consumer genetics services are increasingly popular for genetic genealogy, with millions of customers as of 2019. Several DTC genealogy services allow users to upload their own genetic data in order to search for genetic relatives. This paper demonstrates that such services can also be exploited to reveal much about the genome of individuals in the database, without their consent, and with potentially adversarial consequences.

Essential revisions:

A major concern was whether you intend this paper as a hypothetical discussion, or as a real-world demonstration. The latter would require real-world examples. Either way, it would be necessary to try to specify the conditions under which this work. For example, your IBD baiting method depends sensitively on the algorithms used by the provider. Thus, at a minimum, we would like to see a general discussion of the limitations of these exploits, and of the parameters that effect their efficacy.

Needless to say, you should also discuss the recent work of Ney et al., which you are aware of.

Reviewer #1:

Edge and Coop outline several means of inferring genotypes of individuals in genetic databases that allow users – specifically an attacker – to upload genetic data. There are three such attacks: (1) IBS tiling, wherein an attacker utilizes inferred IBS regions to N (known) samples to reconstruct the genotypes of others in the database in a genome-wide fashion. (2) IBS probing, with the aim of inferring genotypes at a specific locus (for example, the APOE locus). And (3) IBS baiting, with a more systematic form of probing that, given multiple uploads, may be able to collect enough genotypes to allow imputation of a person's common variants genome-wide.

The paper is of broad interest given the robust interest in direct-to-consumer genetic testing and the advent of long-range familial searches by law enforcement. The authors have responsibly handled their discoveries, notifying the various companies that host databases vulnerable to these attacks.

Before it can be published a few technical topics that should be addressed are:

1) To perform IBS baiting effectively, the (unphased) IBS detector would need to break segments at opposite homozygous genotypes. However, in practice most such detectors would likely be tolerant to errors, and therefore may not break the segments in the fashion the text assumes. This seems like it may pose a serious issue to such an attack, and this merits more careful consideration.

2) The Supplementary Figure 4 legend says, "All other arguments were kept at their default values. Calling IBS without respect to genotype phase returns many IBS segments, but less can be learned about each segment via tiling than if haplotype phase is respected." Note that services that I'm aware of do not provide phase to the users, nor allow users to upload phased data. (Please say if this is not the case.) Given this, it's hard to make use of the potential benefits of phased data. In general, this issue merits a more prominent mention – the fact that, if phase were known, the amount of information learned by an attacker is greater, and that (unless services give this information), it's hard to envision how an attacker could put this into practice.

3) "(We consider some alternative IBS reporting procedures in the supplement.)" Is this about the supplementary figures? If so, perhaps cite them. I wasn't sure if there was text that I missed?

4) Optionally, the Discussion may wish to expand on the proposals that inhibit genealogical research, which end users may prefer not to have implemented. Points 2, 4, and 5 (since inferring other relatives' relationships can be useful) fall in this category.

Reviewer #2:

Conflict of interest:

The authors have shared the manuscript with me a few weeks ago. I have discussed my thoughts with the authors and they have revised their manuscript according to some of my comments, acknowledging me at the end of the paper. The review below is a slightly edited and expanded version of the remaining comments.

General assessment:

Edge and Coop describe a method for breaching privacy of genomes deposited in genetic genealogy matching services. The key idea is that a genetic match between a (known) uploaded genome and a target genome reveals some DNA sequence of the target. This can be exploited to recover large proportions of the target's genome. This is a novel and innovative approach. The manuscript is very interesting and thought provoking and the results are important. The analysis is overall sound (but see the points below). I am certain that the results will have major implications on genetic genealogy, genetic privacy, and beyond.

1) A result I find intriguing is the high proportion of the genome that can be covered by IBS tiling, which is comparable only to what was previously seen in founder populations. I suggest the following. (All the requested data should be already available to the authors or require very quick experiments.)

a) Show a breakdown of the coverage by ethnicity (at least for the main ethnicities), to make the results a bit more comparable to previous studies.

b) Further emphasize the message that segments need not be IBD (identity-by-descent) to allow privacy breaching – rather, IBS (identity-by-state) is good enough. This explains to some extent why the proportions of genome covered are so high, even though these are (mostly) not founder populations.

c) I think the results of Figure 2—figure supplement 3 are important, and should perhaps be reported as the main results. The reason is that when running Germline in haploid mode with no errors allowed, we are guaranteed no mismatches between the target and the (known) uploaded genome. In other words, we have an exact match to at least one haplotype of the target. (The authors can even improve performance easily by using a diploid mode but allowing no errors. Germline would still require a perfect match, but would allow phasing errors if they happen between blocks.) If the authors choose not to change the order of figures, I would recommend to at least report the mismatch rate between the haplotypes that were found to be matching by RefinedIBD in the main analysis (Figure 2).

d) Regarding Supplementary Figure 4, I think this figure might be somewhat misleading. The problem with the approach taken to generate that figure (if I'm not mistaken) is that Germline will not try to match any sites where either the target or the uploaded genome are heterozygous. Thus, the coverage is likely inflated – there could be entire "covered" segments that provide very little information on the target. At the very least, the uploaded genome should be made artificially homozygous, so that we are guaranteed to have information on the homozygous genotypes of the target.

2) It will be important to evaluate the IBS tiling method against a very simple "null", in which each allele is predicted to be the major (more frequent) allele. In other words, the outcome would be not the proportion of the genome covered, but the proportion of alleles of the target correctly inferred, and this outcome should be compared between IBS tiling and just using the major allele. While this experiment may take a little time (but I believe no more than a few days), I believe it is essential, because otherwise it is difficult to evaluate the success of the proposed method.

3) I am not confident whether these very elegant results form a practical and immediate risk of privacy, or whether the paper is more of a proof of concept. The biggest problem is with IBS baiting. The success of this approach relies on an IBD detection algorithm that is very simple minded. It is not clear to me whether any of the companies is actually using such an algorithm. But more generally, the authors did not demonstrate an actual recovery of genomic material from a genetic genealogy service using any of their methods. Of course, they would not want, and should not, violate the terms and conditions of any company. But I think that if using research genomes (such as 1000 Genomes) or their own genomes, and limiting the experiment in duration and scale, this would be legitimate. Or the authors could even explicitly ask the companies' managements for permission.

This is not to say that the article is not worthwhile without such experiments. On the contrary, the paper describes a very novel approach, and it would be extremely important and urgent that the proposed techniques become known to all stakeholders in personal genomics, both from the industry and from the academia, as well as the actual participants. Also, additional experiments may take too much time or be outside the scope of the present paper.

But as happens frequently with this kind of papers, once they are published, the media and the general public cannot get to the bottom of such subtle nuances (even if authors do their very best). I expect the paper will be very widely covered, and with some likelihood, it could develop into a total media circus and trigger panic. I think that would be an unfortunate consequence, unless there is a real, tangible risk of privacy breaching. If the risk is more theoretical in nature, it will be important to say so explicitly (and possibly drop the part about the letters to the representatives of the companies, which is only going to amplify the drama).

Reviewer #3:

Edge and Coop describe a battery of methods that seek to recover parts of a personal genome through segment matching queries in a direct-to-consumer database that facilitates uploads. Specifically, the authors describe methods for tiling the hacked genome with matched segments, probing it for the genotype at a particular locus, or baiting it to match contrived genomes, designed to recover the genotype at a particular site.

The paper is technically sound. Methodologically, it puts together ideas that had been floated, and actually evaluates them rigorously. In the context of the genetic privacy field, this constitutes and advancement.

1) This reviewer believes that genetic privacy as a whole is overblown. The impacts of violating it are not substantial, and accepting such work in broadly read venues panders to irrational fears thus does science a disservice. While I don't fault the authors for pushing their work to a visible journal, making this more of a comment to the editor, I would nevertheless welcome the authors' rebuttal. Specifically, I would challenge the statements in the last paragraph of the Discussion regarding trait-predictability of traits. These are upper-bounded by the prediction accuracy implied by SNP heritability (accuracy which is markedly lower than the SNP heritability itself). More practically, the likely improvement in prediction does not mean convergence of prediction even to that bound. Worse, given the non-genetic data trace of individuals today, with more precious predictive value, genetic privacy is a distraction. An example ad absurdum, every street camera recovers my height better than my genome would.

2) The paper is somewhat thin in results (basically, Figures 2 and 3). In particular, Section 2.3 is falsely appearing under Results, whereas it only describes a method, without even applying it. This defeats the entire purpose of the manuscript, of actually demonstrating the attacks and quantifying their effectiveness. One quantitative question relevant to (defending against) the baiting attack has to do with feasibility of assembling all-het segments from naturally-occurring human haplotypes of chip SNPs. There are back-of-an-envelope reasons to assume those would not be long enough for the described attack, but actual data would be reassuring and consistent with the nature of contributions of this manuscript.

3) Relatedly, I am specifically concerned regarding the baiting security loophole being practical, as the authors' description of IBS baiting relies on a straw man IBS detector that they construct to have that weakness. As the authors point out, many actual detectors would not willy-nilly extend each segment till conflicting homozygous on both ends, or require some information content to seed a match between segments. Baiting may still be possible, but likely more complicated and potentially impractical.

4) The results reported are all w.r.t. the general European population. It is important to report the (different) results for other continental ancestries, and, on the other hand, in bottleneck populations.

https://doi.org/10.7554/eLife.51810.sa1

Author response

Essential revisions:

A major concern was whether you intend this paper as a hypothetical discussion, or as a real-world demonstration. The latter would require real-world examples. Either way, it would be necessary to try to specify the conditions under which this work. For example, your IBD baiting method depends sensitively on the algorithms used by the provider. Thus, at a minimum, we would like to see a general discussion of the limitations of these exploits, and of the parameters that effect their efficacy.

The reviewers’ comments in this area focus on IBS baiting. We have clarified that IBS baiting is possible in GEDmatch (as of Nov and Dec 2019) in two ways: (1) Section VII of Ney et al., which became public shortly after we submitted, executes an attack essentially equivalent to IBS baiting in section VII, and (2) We have performed IBS baiting in a small set of fake genetic datasets that we uploaded to GEDmatch in research mode. (These uploads do not violate GEDmatch’s terms and were determined not to be human subjects research by our university IRB. We also confirmed our plan with GEDmatch before initiating uploads).

We have also expanded the discussion of the circumstances that would make the attacks harder or easier, clarifying the implications with respect to the various existing genealogy services.

Needless to say, you should also discuss the recent work of Ney et al., which you are aware of.

Yes, we have added discussion of Ney et al., which dovetails well with our results, at various relevant points. We mention this in the manuscript, but we want to state clearly here that the work of Ney et al. was entirely independent of ours; we were unaware of their work until after our manuscript was submitted to eLife and bioRxiv. Ney et al. described an exploit on the images provided by GEDmatch’s 1-to-1 search feature. We pointed out that these images were a major security weakness to GEDmatch in our July 24th email to them (all our emails to GEDmatch are now posted on the paper’s GitHub repository), but we did not pursue them further. The other exploit discussed by Ney et al. operates on the same principles as our IBS baiting procedure (described in their section VII).

Reviewer #1:

[…] The paper is of broad interest given the robust interest in direct-to-consumer genetic testing and the advent of long-range familial searches by law enforcement. The authors have responsibly handled their discoveries, notifying the various companies that host databases vulnerable to these attacks.

Before it can be published a few technical topics that should be addressed are:

1) To perform IBS baiting effectively, the (unphased) IBS detector would need to break segments at opposite homozygous genotypes. However, in practice most such detectors would likely be tolerant to errors, and therefore may not break the segments in the fashion the text assumes. This seems like it may pose a serious issue to such an attack, and this merits more careful consideration.

We agree that this is an important issue for planning a method of attack. On the basis of our exploration of GEDmatch’s features, we have always believed that it is susceptible to an IBD baiting attack. The flexibility of their 1-to-1 interface allows users to effectively force IBS blocks to be broken by single mismatches, as the user gets to choose the tolerance to errors, by specifying the number of matching SNPs before a second mismatch is allowed. Indeed the video that GEDmatch points users to demonstrates how to vary the tool’s parameters to allow single mismatches to be ignored (https://www.youtube.com/watch?v=7J2TGtcOYMs&feature=youtu.be). We had initially not gone ahead with the demonstration on GEDmatch as we had wanted to avoid getting into detailed explorations of specific databases, and we initially also did not want to single out the only non-commercial database for a demonstration of an attack.

The Ney et al. paper (https://dnasec.cs.washington.edu/genetic-genealogy/ney_ndss.pdf) independently arrived at an attack that is essentially equivalent to IBS baiting as we describe it (with the modification of adding an additional file that reveals whether the site is missing in the target) and showed that it works in GEDmatch. The attack is described in section VII of their paper.

We have now demonstrated that IBS bating works in GEDmatch. We first found this in late November, and we confirmed that the methods still work as of December 15th, after the acquisition of GEDmatch by Verogen. We have described these in depth in sections 2.3.3 and 4.4. We note that in response to our and Ney et al.’s initial emails GEDmatch had put in place a number of counter-measures to try and block IBS-baiting style attacks, notably blocks on long runs of heterozygosity. However, it only took a few trial uploads to work out the rough parameters of these checks, and to design baiting kits that could circumvent their checks. GEDmatch has also put in place recaptcha on their 1-to-1 tool page to block bots, this may block simple bot attacks to access large numbers of genotypes.

We have currently chosen not to publicly release the code to simulate our GEDmatch baits, to make it slightly less trivial to perform the hack. The code is very simple. We are happy to reconsider this decision if the editor or reviewers feel that the code is needed.

We have also added a bit more general discussion on this issue in the IBS baiting section. The key point is that the attacker needs to design a set of uploads where changes in called IBS (or lack of change in called IBS) with a target will reveal the target’s genotype at a known site. If the adversary can identify the algorithm, it may often be possible to design a strategy that reveals many genotypes, albeit perhaps not all of them, and perhaps requiring more uploaded datasets. To illustrate this, we have added a short discussion of what one might do if the database allows one incompatible homozygote within any called IBD segment.

2) The Supplementary Figure 4 legend says, "All other arguments were kept at their default values. Calling IBS without respect to genotype phase returns many IBS segments, but less can be learned about each segment via tiling than if haplotype phase is respected." Note that services that I'm aware of do not provide phase to the users, nor allow users to upload phased data. (Please say if this is not the case.) Given this, it's hard to make use of the potential benefits of phased data. In general, this issue merits a more prominent mention – the fact that, if phase were known, the amount of information learned by an attacker is greater, and that (unless services give this information), it's hard to envision how an attacker could put this into practice.

(We have deleted the former Supplementary Figure 4 in response to reviewer 2’s) comments, but here is a response to the rest of the comment): We agree that a service that provides phasing information and allows phased uploads would be easier to exploit, and that most of the services currently do not, which we have clarified. (GEDmatch actually does seem to allow phased matches to be shown---there is a color on their one-to-one match pictures intended to show a phased match---but we do not believe they are implemented in much generality. Since they’re there, though, we’ve avoided saying that none of the services used phased inputs.) We also have emphasized that one way around this concern is to upload homozygous genotypes, and we have emphasized that Figure 2—figure supplement 3 is a projection of the success one might obtain by doing so (since phase errors break IBS segments in Germline’s haploid mode). Filters can be set up to prevent uploads with unrealistic amounts of homozygosity, but an adversary could upload genomes with homozygous runs in key places that are not homozygous everywhere. Because long runs of homozygosity appear naturally, it may be difficult to filter out datasets with enough homozygous material to be useful for IBS tiling.

3) "(We consider some alternative IBS reporting procedures in the supplement.)" Is this about the supplementary figures? If so, perhaps cite them. I wasn't sure if there was text that I missed?

Yes, we meant the supplementary figures and should have been clearer about that. We have deleted this line from the Discussion and clarified the referencing of the supplementary figures in the Results section.

4) Optionally, the Discussion may wish to expand on the proposals that inhibit genealogical research, which end users may prefer not to have implemented. Points 2, 4, and 5 (since inferring other relatives' relationships can be useful) fall in this category.

Yes, we have added some short notes about the potential costs to legitimate genealogical pursuits of these countermeasures. (In response to a comment from reviewer 3, we have moved the detailed discussion of these points to the Appendix.)

Reviewer #2:

[…] Edge and Coop describe a method for breaching privacy of genomes deposited in genetic genealogy matching services. The key idea is that a genetic match between a (known) uploaded genome and a target genome reveals some DNA sequence of the target. This can be exploited to recover large proportions of the target's genome. This is a novel and innovative approach. The manuscript is very interesting and thought provoking and the results are important. The analysis is overall sound (but see the points below). I am certain that the results will have major implications on genetic genealogy, genetic privacy, and beyond.

1) A result I find intriguing is the high proportion of the genome that can be covered by IBS tiling, which is comparable only to what was previously seen in founder populations. I suggest the following. (All the requested data should be already available to the authors or require very quick experiments.)

a) Show a breakdown of the coverage by ethnicity (at least for the main ethnicities), to make the results a bit more comparable to previous studies.

We have added a supplementary figure (Figure 2—figure supplement 4) showing coverage for four EUR subgroups in 1000 Genomes, namely FIN (Finnish), IBS (Iberian in Spain), GBR (British), and TSI (Tuscany). We chose these four subgroups of the combined dataset because they are all large enough subsets to get decent estimates, they are reasonably clearly localized geographically, and they are of approximately equal size. As might be expected, the Finnish sample shows higher coverage by IBS tiling than other groups. We have added some comments about variation among groups in the success of IBD tiling to the Discussion.

b) Further emphasize the message that segments need not be IBD (identity-by-descent) to allow privacy breaching – rather, IBS (identity-by-state) is good enough. This explains to some extent why the proportions of genome covered are so high, even though these are (mostly) not founder populations.

Thank you. We have emphasized this point by adding these sentences to section 2.1: “True IBD segments reveal more than mere IBS segments about shared genotypes because untyped variants (including rare variants) within an IBD segment are likely to be shared. At the same time, mere IBS is sufficient to infer sharing for SNPs that are genotyped within the segment.”

c) I think the results of Figure 2—figure supplement 3 are important, and should perhaps be reported as the main results. The reason is that when running Germline in haploid mode with no errors allowed, we are guaranteed no mismatches between the target and the (known) uploaded genome. In other words, we have an exact match to at least one haplotype of the target. (The authors can even improve performance easily by using a diploid mode but allowing no errors. Germline would still require a perfect match, but would allow phasing errors if they happen between blocks.) If the authors choose not to change the order of figures, I would recommend to at least report the mismatch rate between the haplotypes that were found to be matching by RefinedIBD in the main analysis (Figure 2).

We agree with this concern but have chosen to retain the figure order. The main reason is that during revision, we realized we had misread GEDmatch’s minimum cM threshold, which is in fact 0.1cM and not 1cM. Because of this, we wanted to add results for 0.1cM, but Germline’s options do not allow the user to search for such short IBS segments (perhaps reasonably for most applications), and so we stuck with refinedIBD for the main text. Still, we have added comments to the main text to emphasize the point raised by the reviewer and point readers to Figure 2—figure supplement 3.

d) Regarding Supplementary Figure 4, I think this figure might be somewhat misleading. The problem with the approach taken to generate that figure (if I'm not mistaken) is that Germline will not try to match any sites where either the target or the uploaded genome are heterozygous. Thus, the coverage is likely inflated – there could be entire "covered" segments that provide very little information on the target. At the very least, the uploaded genome should be made artificially homozygous, so that we are guaranteed to have information on the homozygous genotypes of the target.

We agree that the kind of coverage tracked by the former Supplementary Figure 4 is not very informative about the target for the reasons stated---a tile only reveals that the target is unlikely to have a homozygote opposite to the comparison within the region. We have removed it for now, though reviewer 1 appeared to find the figure informative, and we are open to reintroducing it with extra emphasis in the caption on the limited information gained from tiles in this case.

2) It will be important to evaluate the IBS tiling method against a very simple "null", in which each allele is predicted to be the major (more frequent) allele. In other words, the outcome would be not the proportion of the genome covered, but the proportion of alleles of the target correctly inferred, and this outcome should be compared between IBS tiling and just using the major allele. While this experiment may take a little time (but I believe no more than a few days), I believe it is essential, because otherwise it is difficult to evaluate the success of the proposed method.

We have added a supplementary figure that addresses this concern (Figure 2—figure supplement 5). We are not 100% sure of the alternative hypothesis being proposed and have not included a hypothesis test. (Does the alternative hypothesis allow prediction of the major allele outside IBS tiles? It seems to us that would be the most natural comparison.) But the figure supplement shows the median proportion of total alleles covered and the median proportion of minor alleles covered (minor alleles are ~19% of the total). The figure supplement suggests that there is a slight bias for IBS tiles to be in regions of lower SNP density and in regions with lower heterozygosity. However, the biases are relatively small, so, for example, with a 1cM threshold and all 872 samples included, IBS tiles cover a median of 52% of the minor alleles (as opposed to 57% of the total length in base pairs). Though we do not give these numbers in text, one can calculate easily from the figure legend that a guess of major alleles everywhere gives an average ~81% of alleles guessed accurately (the average major allele frequency), whereas tiling plus a guess of major alleles outside of the tiles gives ~91% accuracy for the median person (all the major alleles plus about half the minor alleles), an increase that is sure to be significant with the many loci considered.

3) I am not confident whether these very elegant results form a practical and immediate risk of privacy, or whether the paper is more of a proof of concept. The biggest problem is with IBS baiting. The success of this approach relies on an IBD detection algorithm that is very simple minded. It is not clear to me whether any of the companies is actually using such an algorithm. But more generally, the authors did not demonstrate an actual recovery of genomic material from a genetic genealogy service using any of their methods. Of course, they would not want, and should not, violate the terms and conditions of any company. But I think that if using research genomes (such as 1000 Genomes) or their own genomes, and limiting the experiment in duration and scale, this would be legitimate. Or the authors could even explicitly ask the companies' managements for permission.

This is not to say that the article is not worthwhile without such experiments. On the contrary, the paper describes a very novel approach, and it would be extremely important and urgent that the proposed techniques become known to all stakeholders in personal genomics, both from the industry and from the academia, as well as the actual participants. Also, additional experiments may take too much time or be outside the scope of the present paper.

But as happens frequently with this kind of papers, once they are published, the media and the general public cannot get to the bottom of such subtle nuances (even if authors do their very best). I expect the paper will be very widely covered, and with some likelihood, it could develop into a total media circus and trigger panic. I think that would be an unfortunate consequence, unless there is a real, tangible risk of privacy breaching. If the risk is more theoretical in nature, it will be important to say so explicitly (and possibly drop the part about the letters to the representatives of the companies, which is only going to amplify the drama).

We thank reviewer 2 for this thoughtful comment. We have made several changes to address it. First, we note that section VII of the Ney paper already shows that IBS baiting has recently been possible in GEDmatch.

As discussed in the response to reviewer 1 we have now implemented a demonstration of IBS baiting against GEDmatch, and we present the results in sections 2.3.3 and 4.4. We have also added more discussion on the specific risks at the services listed in Table 1. Our general view is that the risks are substantial at GEDmatch and much lower (but still likely nonzero) at the other services.

Reviewer #3:

[…] 1) This reviewer believes that genetic privacy as a whole is overblown. The impacts of violating it are not substantial, and accepting such work in broadly read venues panders to irrational fears thus does science a disservice. While I don't fault the authors for pushing their work to a visible journal, making this more of a comment to the editor, I would nevertheless welcome the authors' rebuttal. Specifically, I would challenge the statements in the last paragraph of the Discussion regarding trait-predictability of traits. These are upper-bounded by the prediction accuracy implied by SNP heritability (accuracy which is markedly lower than the SNP heritability itself). More practically, the likely improvement in prediction does not mean convergence of prediction even to that bound. Worse, given the non-genetic data trace of individuals today, with more precious predictive value, genetic privacy is a distraction. An example ad absurdum, every street camera recovers my height better than my genome would.

We agree with many of the points raised here and have expanded the paragraph mentioned to emphasize the points of agreement. In particular, we agree that for many complex traits, prediction may be bounded at fairly low accuracy, even as sample sizes and models improve, and we have dropped the sentence noted by the reviewer. We also agree that there are many other threats to privacy, such as street cameras and many kinds of traces of our behavior online. These threats are much more revealing about many aspects our lives than genetics (and doubtless this will remain true into the future). At the same time, one does not need to adopt a strong version of genetic exceptionalism to be concerned about a new form of data being leaked. As geneticists we view it as important to highlight issues with genetic data.

These databases are growing rapidly and have garnered a lot of public attention, both through advertising by companies and news stories following the Golden State Killer. The fact that the genetic data of over a million people, notably through GEDmatch, may have been open to an adversarial attack does concern us and seems worthy of public attention. One reason for choosing eLife as a venue for the article is that we wanted the article to be fully open access so that the large communities of genetic genealogists could discuss the final version of paper (and freely reuse figures etc.).

Genetic data privacy and sharing are evolving rapidly, and if no clear thinking on policy emerges, it will be too late to reverse decisions that have been made by default. The potential risks from genetic data breaches extend beyond the targeted person, at least to the target’s relatives. Further, others have also noted that there are potential national security issues arising from genetic data breaches (e.g. identification of covert operatives), and that people whose genetic information is compromised might be vulnerable to cyberattacks (e.g. an attacker might generate a genetic profile for a false relative to gain a person’s trust).

It may well be that in the long run, “genetic privacy” is not the right framing for these issues, and that we may need to move to a framework of ownership of genetic data, allowed uses of genetic data, and harms done by misuse. Still, we believe that making the potential risks clear as quickly as possible is a useful contribution to the public discussion.

2) The paper is somewhat thin in results (basically, Figures 2 and 3). In particular, Section 2.3 is falsely appearing under Results, whereas it only describes a method, without even applying it. This defeats the entire purpose of the manuscript, of actually demonstrating the attacks and quantifying their effectiveness. One quantitative question relevant to (defending against) the baiting attack has to do with feasibility of assembling all-het segments from naturally-occurring human haplotypes of chip SNPs. There are back-of-an-envelope reasons to assume those would not be long enough for the described attack, but actual data would be reassuring and consistent with the nature of contributions of this manuscript.

In response to this and the other reviewer comments, we have added an application of IBS baiting in GEDmatch, similar to the analysis of Ney et al. section VII (which was performed independently and without our knowledge).

We have also added the following text after identifying the longest run of heterozygosity (in terms of # of SNPs) in our dataset: “In our sample, the longest run of heterozygosity (in terms of number of SNPs) consisted of 38 SNPs and spanned. 06 cM. This suggests that filtering out long runs of heterozygosity might be a promising strategy, though identifying a specific procedure would require more careful consideration of variation in non-European populations and of the composition of commercial SNP chips (including SNP density and allele frequencies).”

3) Relatedly, I am specifically concerned regarding the baiting security loophole being practical, as the authors' description of IBS baiting relies on a straw man IBS detector that they construct to have that weakness. As the authors point out, many actual detectors would not willy-nilly extend each segment till conflicting homozygous on both ends, or require some information content to seed a match between segments. Baiting may still be possible, but likely more complicated and potentially impractical.

We agree that the IBS detector we consider is simplistic, but as mentioned above, IBS baiting does work in GEDmatch, indicating that the IBS detection scheme we describe is not far off of the actual method used in a major database. In response to this comment and one of reviewer 1’s comments, we have added some text on how one might respond to slightly more sophisticated IBS callers in section 2.3.1. It seems to us that there will be some potential to violate privacy using a baiting-like procedure in any database that uses phase-unaware IBS detection and does not take steps to filter out fake profiles. It may become impractical at some point, but it is better to make it effectively impossible by using phase-aware IBS detection.

4) The results reported are all w.r.t. the general European population. It is important to report the (different) results for other continental ancestries, and, on the other hand, in bottleneck populations.

We have added some discussion of the differences among some of the European subgroups (see also the new supplementary figure, Figure 2—figure supplement 4), which allows some extrapolation to patterns that will affect other continental groups. We have chosen to focus on European populations in part because major genetic genealogy databases seem to consist largely of people of European ancestries. We have added the following paragraph to the Discussion:

“Our IBS tiling and IBS probing results focus on users of European ancestries, in part because most users of DTC genetic genealogy services appear to have substantial European ancestries. […] Finally, we show in Figure 2—figure supplement 4 that in our sample, Finnish samples are more vulnerable to IBS tiling than other populations, which is likely due to Finns tracing substantial ancestry to a founder population that experienced a bottleneck 100 generations ago (Kere, 2001). Members of other groups with similar demographic histories are likely to be at elevated risk of IBS tiling and IBS probing as well.”

https://doi.org/10.7554/eLife.51810.sa2

Article and author information

Author details

  1. Michael D Edge

    1. Center for Population Biology, University of California, Davis, Davis, United States
    2. Department of Evolution and Ecology, University of California, Davis, Davis, United States
    3. Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, United States
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology
    For correspondence
    edgem@usc.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8773-2906
  2. Graham Coop

    1. Center for Population Biology, University of California, Davis, Davis, United States
    2. Department of Evolution and Ecology, University of California, Davis, Davis, United States
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology
    For correspondence
    gmcoop@ucdavis.edu
    Competing interests
    Reviewing editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8431-0302

Funding

National Institutes of Health (GM108779)

  • Graham Coop

National Institutes of Health (GM130050)

  • Michael D Edge

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Matt Bishop, Elizabeth Joh, Peter Ney, and Mike Sweeney for useful conversations, and we thank Shai Carmi, Yaniv Erlich, Debbie Kennett, Leah Larkin, Magnus Nordborg, Rori Rohlfs, Noah Rosenberg, Ann Turner, Amy Williams, and an anonymous reviewer for helpful comments on the manuscript. Swapan Mallick and David Reich answered questions about the Human Origins dataset, Brian Browning answered questions about Refined IBD, and Alexander Gusev answered questions about Germline software. We acknowledge support from the National Institutes of Health (R01-GM108779 and F32-GM130050).

Senior Editor

  1. Mark I McCarthy, Genentech, United States

Reviewing Editor

  1. Magnus Nordborg, Austrian Academy of Sciences, Austria

Reviewers

  1. Amy L Williams, Cornell University, United States
  2. Shai Carmi, The Hebrew University of Jerusalem, Israel

Publication history

  1. Received: September 12, 2019
  2. Accepted: December 23, 2019
  3. Accepted Manuscript published: January 7, 2020 (version 1)
  4. Version of Record published: January 30, 2020 (version 2)

Copyright

© 2020, Edge and Coop

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,273
    Page views
  • 279
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

  1. If you've uploaded your DNA on genealogy databases, it may be at risk.

    1. Ecology
    2. Evolutionary Biology
    Guilhem Doulcier et al.
    Research Article