DNA re-identification is used for a broad suite of applications, ranging from cell line authentication to forensics. However, current re-identification schemes suffer from high latency and limited access. Here, we describe a rapid, inexpensive, and portable strategy to robustly re-identify human DNA called 'MinION sketching'. MinION sketching requires as few as 3 min of sequencing and 60-300 random SNPs to re-identify a sample enabling near real-time applications of DNA re-identification. Our method capitalizes on the rapidly growing availability of genomic reference data for cell lines, tissues in biobanks, and individuals. This empowers the application of MinION sketching in research and clinical settings for periodic cell line and tissue authentication. Importantly, our method enables considerably faster and more robust cell line authentication relative to current practices and could help to minimize the amount of irreproducible research caused by mix-ups and contamination in human cell and tissue cultures.https://doi.org/10.7554/eLife.27798.001
The human genome represents the complete set of genetic information needed to make a person. DNA sequencing technologies used to study genomes have become much faster, cheaper and more accessible over recent years. This has enabled them to be used more regularly in various fields like precision medicine, in research laboratories and forensics. Even so, there are still fields where optimization is critical.
Reproducibility is an important issue in biomedical research; one group of scientists working with human cells may report results that other scientists cannot reproduce. Sometimes this is because the original work was done in the wrong type of cells by mistake. Human cells used in biomedical research are very hard to discriminate from each other using microscopes; however, DNA analysis can be used to ensure the origin of the cells.
The MinION device, a USB compatible handheld DNA sequencer, has become available in the last few years. Its size, speed and portability could enable many new uses for DNA sequencing. Technology like this could be used to confirm which cells the scientists are working with before they publish their results. Yet, currently DNA readings from the MinION are not accurate enough to be used to reliably confirm the identity of human cells used in research.
Zaaijer et al. have now developed an approach that can accurately identify human cells using the MinION device. The approach involves “DNA re-identification”, which works by comparing an unknown DNA sample to a collection of known DNA profiles. Using their new method, Zaaijer et al. report that, with three minutes of DNA sequencing, they can correctly identify a DNA sample, with 99.9% confidence. This is a high enough level of accuracy for the system to tell the difference between one person and another, using only their DNA.
This new technology is much faster than current rapid DNA sequencing approaches. Previously, processing DNA samples could take hours or even days and was not particularly portable. The new technology has many applications from finding criminals to diagnosing illnesses and tracking epidemics. It is also an affordable way for laboratories to confirm the identity of cells they are working with. This has the potential to save billions in research funding each year and speed up scientific progress.https://doi.org/10.7554/eLife.27798.002
DNA is a powerful biometric identifier. With the exception of monozygotic twins, DNA profiles are unique to each individual on Earth (Kayser and de Knijff, 2011; Bieber et al., 2006; Gymrek et al., 2013). The ability to re-identify DNA has multiple applications in a broad range of disciplines. In research settings, re-identification is employed to authenticate cell lines and patient-derived xenografts (PDXs) by matching their DNA to validated genomic profiles (NIH, 2016; AMS, 2015; El-Hoss et al., 2016). In clinical genetics, the American College of Medical Genetics recommends using DNA genotyping tests to track sample identity and avoid sample mix-ups during clinical whole genome/exome sequencing (Green, 2013). In forensics, DNA re-identification has become one of the most common techniques to identify samples from crime scenes or from victims of mass disasters, and human trafficking (Kayser and de Knijff, 2011). Despite this wide range of applications, current DNA re-identification methods suffer from high latency and lack of rapid access.
Cell line contamination is a widespread and persistent problem in academic, clinical and commercial research, despite having been recognized for 50 years (Gartler, 1968; Alston-Roberts et al., 2010; Yu et al., 2015; Almeida et al., 2016). The ongoing publication of irreproducible research is a major economic burden on society and costs 28 billion dollars each year in the USA alone (Capes-Davis and ICLAC, 2016). To diminish this the NIH and various journals require researchers to authenticate cell lines by matching their DNA profiles to validated signatures (NIH, 2016; AMS, 2015). The most common DNA re-identification strategy is to genotype a minimum of eight autosomal polymorphic short tandem repeats (STRs) (Masters et al., 2001; Smith et al., 2012; Capes-Davis et al., 2010; Reid et al., 2013; ATCC, 2011). However, this technique entails the use of time consuming PCR-based steps and specialized capillary electrophoresis machines, with the latter not being part of standard laboratory equipment.
The STR profiling method for cell line authentication originates from the forensic sciences, where it is standard practice (Alston-Roberts et al., 2010; Almeida et al., 2016). Since most cell lines are uniquely derived from single patients, the STR profiling method was adopted for cell line authentication as well. However, the unstable genetic nature of cancer cell lines undermines the usefulness of these standards and affects the re-identification efficiency of the multi-allelic STR markers (Capes-Davis et al., 2010; Alston-Roberts et al., 2010; Castro et al., 2013). Previous studies have explored using more stable SNP markers for re-identification. Indeed, a carefully selected panel of ~50 SNPs confers a power for re-identification similar to that provided by the 8–13 STR markers used in forensics and cell line authentication (Jobling and Gill, 2004; Sanchez et al., 2006; Yu et al., 2015). SNPs are increasingly being used for cancer cell line authentication (Sanchez et al., 2006; Castro et al., 2013; Otto et al., 2017).
To overcome current latency and accessibility issues, we have developed a rapid and novel SNP-based strategy for robust re-identification of human DNA using a MinION sequencer (produced by Oxford Nanopore Technologies, ONT). The MinION is a cheap and portable DNA sequencer that weighs only 100 g and can be plugged into a laptop computer. This device can easily be adopted as part of standard laboratory equipment. Our SNP-based strategy, termed ‘MinION sketching’, exploits real-time data generation by sequentially analyzing extremely low coverage shotgun-sequencing data from a sample of interest and comparing observed variants to a reference database of common SNPs (Figure 1). We specifically sought a strategy that does not require PCR in order to maximize speed, reduce the number of steps in the protocol and to omit species bias that follows from using human-specific primers (Alston-Roberts et al., 2010). However, this poses two technical challenges. First, MinION sequencing exhibits a high error rate of 5–15% (Ip et al., 2015), which is two orders of magnitude beyond the expected differences between any two individuals. Second, MinION sketching produces shotgun-sequencing data that only covers a fraction of the human genome due to the limited capacity of a MinION cell. As such, the extremely low coverage dictates that each locus is covered by up to one sequence read, which nullifies the ability to enhance the signal by integrating multiple reads or observing both alleles at heterozygous loci. Taken together, these challenges translate to a noisy identification task where the available genotype data only provide a mere sketch of the actual genomic data.
To address these challenges, we developed a Bayesian algorithm that computes a posterior probability that the sketch matches an entry in the reference database (Hexact) or has no match in the database, taking into account each marker’s allele frequency and the prior probability that a sample matches an entry in the reference database. The Bayesian approach sequentially updates the posterior probability with every new marker that is observed until a match is found. Collectively, our method can robustly re-identify a sample without PCR amplification, yet with very high probability, overcoming the low coverage and high error rate from nanopore sequencing.
In order to benchmark our re-identification method for real-life applications, we tested it in a variety of technical scenarios. To start, we constructed two large-scale proof-of-principle reference databases of genomic datasets that would stress the specificity of our method. The first reference database contains 31,000 genome-wide ~600K-900K SNP genotyping array files from individuals tested by Direct-to-Consumer (DTC) companies such as 23andMe, AncestryDNA, and FamilyTreeDNA (Figure 2A) (Erlich, 2015). The second reference database consists of genome-wide ~700K-800K SNP genotyping array files from 1,099 cancer cell lines in the Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012) and can be used for cell line authentication. Next, we created a MinION sketch for DNA samples in multiple technical scenarios (Supplementary file 1A). These testing scenarios included extracting DNA from a spit kit or cell line culture, sequencing with either the R7 chemistry or the newer R9 chemistry, and re-identifying a sample with a genetically unstable background. The genetic reference files for each of these sketched samples were included in our reference databases.
We found that the MinION sketching procedure re-identified human DNA with high accuracy after just minutes of operation. After 13 min of sketching using the R7 chemistry, the Bayesian algorithm re-identified the NA12890 sample (a female CEU individual from the HapMap project) with a posterior probability greater than 99.9%. Despite the high error rate of this relatively old chemistry and the low coverage, the algorithm needed only 195 bi-allelic variants to re-identify the sample (Figure 3 and Supplementary file 1B). This is only ~2 times above the theoretical expectation for re-identifying a person by fingerprinting random markers (Lin et al., 2004). To further test the robustness of our method, we re-sketched NA12890’s sequencing data against reference files for her first-degree relative (NA12877) and second-degree relative (NA12879). Importantly, no exact-matching probability was observed, highlighting the specificity of our method (Figure 3).
Using the 31,000-individual reference database (consisting of genetic profiles from individuals genotyped by DTC companies), we repeated the R7 experiment with a sample of a mixed Ashkenazi-Uzbeki male (YE001). Again, we were able to re-identify this person within 13 min after assessing 110 SNPs (Figure 2B and Supplementary file 1B), further showing that the method produces consistent results across ethnic origins. None of the other 31,000 individuals reached this level of matching probability (Figure 2B). Finally, given that the number of reference samples in our database is in the thousands, but the number of people in the world in the billions, we wondered about the impact of the prior probability on identifying individuals. To this end, we tested various prior probabilities of identifying the YE001 sketch. We found that the initial selection of the prior probability had no effect on the matching ability and only slightly increased the time required to achieve a high-confidence match. Even with a prior probability that considers a database around a million times bigger than the world’s population (1015), the posterior probability reached 99.9% with only 25 min of sketching YE001 (Figure 2—figure supplement 1), showing that our method returns robust results regardless of the chosen prior.
Moving to the newer R9 chemistry provided even faster re-identification results. We sketched samples of a Northern European female (SZ001) and a Northern European-Italian-Ashkenazi male (JP001) using this chemistry. We were able to re-identify these two samples using only 98–134 SNPs, and the fastest identification required fewer than 5 min of MinION sketching (Figure 2C and D and Supplementary file 1C). Again, none of the other 31,000 individuals in our database were matched to SZ001 or JP001 using this strategy. The rapid re-identification seems to be linked intimately to the increased speed with which DNA strands pass through the pores with the R9 chemistry versus the R7 chemistry (250bases/sec vs 70bases/sec). These results suggest that future developments in speeding up the DNA reading time could further reduce the re-identification time.
Next, we explored the applicability of MinION sketching for cancer cell line authentication, a longstanding issue in the research community. We used MinION sketching and the R9 chemistry to authenticate THP1, a monocytic leukemia strain, against the second reference database that consisted of cell lines from the CCLE. To show that more than one sample can be authenticated at the same time, we barcoded the THP1 sample and combined it with an additional, barcoded human sample. From the barcoded THP1 reads that were generated in ~3 min of sequencing, the sketching procedure leveraged 91 SNPs to authenticate the THP1 cell line with a posterior probability of 99.9%. None of the other 1098 CCLE reference files reached a probability of 99.9% or even exceeded a 10% match probability (Figure 4A, Supplementary file 1D).
Thus far, re-identification required an intersection of 91–195 SNPs from the MinION sketch and reference SNP file to reach a match probability of 99.9%. Having observed this range in the number of SNPs required, we wished to find the minimum number of intersected SNPs necessary to obtain a 99.9% match. This way we can optimize the sequencing time. To determine such a ‘stop sketching’ threshold, we simulated 10,000 different sketching runs for the THP1 cell line (Figure 4B) and SZ001 (Figure 4—figure supplement 1A). The majority of simulated MinION sketches reached a match with a 99.9% probability using only 120–140 intersected SNPs. By 300 intersected SNPs, 99.6% of all sketches of the THP1 cell line were matched to its reference file with a probability of 99.9%, and for sketches of SZ001 by 240 SNPs that intersected with its reference. As expected, none of our simulation files failed to reach a correct match with a 99.9% probability with the correct reference file in the database. Although the number of mismatches per run was strongly correlated with the number of SNPs analyzed (Figure 4—figure supplement 1B), the results from our sequencing runs and simulations suggest that even genetically unstable cancer cell lines can be identified with confidence using no more than 300 SNPs. The minimum sequencing run time necessary to infer a match depends on the yield of the specific run and the chemistry used. In summary, the MinION sketching method relies on the presence of the reference file in the database. If computing the posterior probability for 300 SNPs does not result in a 99.9% match, then the reference file for that cell line or individual is almost certainly not present in the reference database and further sketching is highly unlikely to yield any success.
Next, we wondered how a severe contamination with cells of another origin would affect successful cell line authentication. Cell line cross-contamination is caused mostly by overgrowth from secondary cell lines with a substantially shorter generation time (Capes-Davis et al., 2010; Alston-Roberts et al., 2010). To start assessing the effects of contamination, we re-analyzed the data from the THP1 experiment but without resolving the barcodes, which essentially reflects a 50% contamination. The algorithm correctly showed a 0% match probability to the THP1 reference file or any other cell line in the database (Figure 4C). We further explored the effect of the fraction of contamination on matching sketches with the THP1 reference file. By sampling from the above data in different proportions, we found that the algorithm correctly rejects a match for samples with contamination levels above 25% (Figure 5). While it may seem that the algorithm is not as sensitive to contamination as current STR-based methods, periodic testing of a cell culture with our method will reveal the contamination in a more timely fashion (see Discussion).
Lastly, we aimed to explore a sample preparation strategy that requires minimal hands-on time. To this end, we utilized a simple protocol to extract DNA using the rapid transposase-mediated fragmentation and adaptor ligation kit provided by ONT. This method generates 1D reads, where only one of the two strands passes through the nanopore, resulting in reads with a higher error rate (Supplementary file 1E). The advantage of this method is the speed and convenience of the preparation protocol. In only 55 min, we were able to extract DNA and produce a ready-to-sequence library (Figure 6A, example of execution: Figure 6—video 1). The increased error rate resulted in the requirement for more SNPs to reach the re-identification threshold. In our experiment with the rapid sample preparation protocol we needed 239 SNPs to identify SZ001 with >99.9% probability (Figure 6B). As such, re-identification of DNA and cell line authentication can still be completed with the same level of accuracy in one afternoon and using only minimal hands-on time by the researcher.
Our results show that MinION sketching for re-identification of human samples is robust and is faster than other currently available methodologies. With its high accuracy, it can be adopted for the periodic testing and authentication of cell lines in research settings, for verifying biobank entries, for tracing samples in clinical genetics, and for certain forensic applications. Based on only 3–13 min of sequencing and 60–300 informative SNPs, MinION sketching can infer the identity of an anonymous sample. It is a unique addition to current state-of-the-art DNA re-identification methodologies.
The main cause of cell line mix-ups is suggested to be human error (Alston-Roberts et al., 2010; Yu et al., 2015; Almeida et al., 2016). It is therefore crucial to have means to monitor these errors rapidly and periodically. While the American Type Culture Collection (ATCC) offers an STR-based cell line authentication service, the overall procedure requires shipping consumables and samples back-and-forth and takes 2 weeks to complete. This works sufficiently in situations of cell line contamination that originate from mislabeling of a cell culture (100% contamination). Yet, a processing time of 2 weeks is suboptimal when caused by the mistaken transfer of cells from one culture to another, which can lead to cases of fitness competition between the cell lines (Alston-Roberts et al., 2010; Yu et al., 2015). It takes only 10 cells from a line with a doubling time that is 2–4 hr shorter than that of the original strain to overgrow an initial culture (106 cells) within 2 weeks (Figure 5—figure supplement 1), which would currently be the time-point when STR typing results would be returned by the ATCC.
Strikingly, once cells are in log-phase it can take as few as 2 days to change the contamination level of a culture from 1% to 80%. Our contamination simulations show that a contamination ≥25%, and often less than that, precludes a true matching result. Although the current STR-based methods can pick up on lower levels of contamination, in practice this does not make much difference considering: (1) the pace with which a contaminant can invade a cell culture, and (2) the relatively low identification speed of methods currently employed that precludes the timely return of data on the genomic composition of a cell population over multiple time points. Moreover, STR analysis is typically done using human-specific primers for amplification, and this therefore limits the identification of contaminants to ones of human origin (Alston-Roberts et al., 2010). Our method, on the other hand, does have the potential ability to detect DNA from contaminants of non-human origin, such as infectious organisms like mycoplasma. Such contaminants could be identified efficiently when our pipeline is run in combination with metagenomic methods for real-time microbial detection (as in Quick et al., 2015). The key to detect cell line contamination with human and non-human cells is periodic testing.
The MinION can be part of standard lab equipment and facilitates rapid sample preparation and testing just prior to key experiments. We show that with our MinION sketching method, cell line authentication can be achieved in the lab in one afternoon, either using a hands-on or hands-off protocol. The first protocol involves a hands-on ~3 hr library preparation step (including DNA extraction), but after only ~3 mins of sequencing we were able to identify the THP1 cell line out of 1,099 cancer cell lines with a posterior probability of 99.9%. The second protocol requires just 55 min for DNA extraction and transposase-mediated adapter ligation, after which sequencing can start. Our MinION sketching method reduces re-identification latency so that research does not have to be paused for long until the DNA profiling results return.
Our method relies on randomly sampling SNPs from the genome, instead of a fixed set of small numbers of STRs or SNPs in a panel. This way we can omit a time consuming and biased PCR step in our method, and avoid the loss of statistical power caused by allelic dropout. This is particularly advantageous for cancer cell line authentication where genomic instability is prevalent. Cancerous cell lines commonly undergo loss of heterozygosity or exhibit aneuploidy, which affect STR-based re-identification of DNA samples through the loss of alleles (Capes-Davis, 2013). Furthermore, cancer cells that are deficient in their mismatch repair (MMR) pathways and suffer from microsatellite instability are identified more accurately by SNP-based than by STR-based identification methods (Castro et al., 2013; Otto et al., 2017). Because of these challenges, the current official ASN-0002 standards for STR analysis use an 80% matching threshold to positively match the STR profile in question to a reference file. Using this threshold, cell lines can be identified correctly in 98% of the cases (Capes-Davis, 2013). In our experiments, we see a clear correlation between the number of mismatches and the SNPs that are used to infer a match. The occurrence of mismatches in matching the DNA from a genetically unstable cancer cell line to its reference file results in a need to collect more SNP evidence for a match. Still, when we simulated matching the THP1 cell line to its reference file 10,000 times we found that intersecting a minimum of only 300 SNPs leads to a correct match in 99.6% of the cases. Furthermore, we found that based on this simulation, the SZ001 simulation and our experimental data, 300 SNPs could be used as a ‘stop-sketching’ threshold. Importantly, we never observed a false-positive match to any other reference file in the CCLE database. As a SNP-based method MinION sketching improves the precision of re-identifying cancer cell lines compared to the STR-based identification methods. ANSI-approved standards for SNP usage in cancer cell line authentication would be useful to promote the community-wide adoption of SNP-based sample re-identification (also proposed by Yu et al., 2015 and Otto et al., 2017, among others).
The start-up cost for the MinION is currently $1000, and multiplexing 12 DNA samples in one run makes the cost of consumables for sequencing a sample around $100. This cost per sample is already lower than the ATCC STR-typing service, which is $195, but higher on a per-run basis than the Geneprint system method. However, the latter method can only be used with access to the Applied Biosystems 3500 platform, and involves a more elaborate protocol that requires hands-on time, therefore incurring higher costs of labor. Although the balance between cost of labor and costs of machine depreciation and consumables poses a trade-off for all methodologies, the requirement of extensive hands-on laboratory work seems a main driver for avoiding authentication with current STR-based tests. Given that MinION sketching requires only minimal hands-on time and provides re-identification within hours instead of days/weeks, it is a very efficient and competitive re-identification method, especially when working with a small number of samples. While the costs of MinION sequencing continue to decrease, MinION sketching is currently not competitive in price for high-throughput testing until sequencing costs will have decreased further.
In conclusion, to help solve the long-standing issue of (cancer) cell line contamination and to enhance the traceability of tissue samples in biobanks we developed an rapid re-identification method for DNA samples. Our method lowers the barrier for adoption of regular cell line authentication, which is important since only periodic testing will detect contamination and mix-ups efficiently and reduce the costs involved with irreproducible research. MinION sketching can easily be done in laboratories, in the clinic, or in biobanks as a routine sample authentication test.
The matching algorithm uses a Bayesian framework to evaluate the posterior probability of a match. Let be a random variable that either indicates whether the MinION sketch directly matches a known person (), or does not match () with respect to the -th individual in the database. Let be the observed MinION data for the -th bi-allelic marker, with , where and denote the two alleles; and let denote the observation for bi-allelic markers.
The posterior probability of the matching outcome for the -th sample is:
where is the prior probability for the matching status of i-th sample and is specified by the user.
The likelihood is approximated using the following equation:
The likelihood of an exact match given the data of the -th marker, , is given by the following matrix:
where the rows denote the genotype of the -th sample for the -th marker as observed in the DNA database, the columns correspond to the observed genotype in the MinION data, and ∈ denotes the error rate assuming symmetry in confusing allele A for allele B and vice versa. corresponds to a specific row of M based on the observed genotype of a sample in the database. For example, if the genotype of the database sample is AA, then and .
The likelihood of a mismatch given the data of the k-th marker, , basically corresponds to observing the allele in a random person from the population. This probability is the sum of two processes: (i) the random person has the same allele as and the observation is errorless or (ii) the random person does not have the same allele as but a sequencing error flipped the observed allele. Therefore:
where denotes the frequency of the observed allele in the population
Finally, the evidence, is given by:
We purchased the genomic DNA sample for the 1,000 Genomes individual NA12890 from the Coriell Institute. The THP1 cell line (ECACC Cat# 88081201, RRID:CVCL_0006) was used from laboratory resources and its authenticity was thoroughly verified in this study. YE001 and SZ001 were derived from the corresponding authors (Y.E. and S.Z.) using cheek swabs (Catch-All Sample Collection Swab Epicentre QEC89100) or a saliva collection kit (Supplementary file 1A). JP001 was sampled through saliva collection.
Genomic DNA from NA12890 and YE001 (Supplementary file 1A; exp. 1, exp. 2 respectively) were prepared for 2D MinION libraries using the SQK-MAP006 kit (ONT) as described by Zaaijer et al. (2016). 2D libraries are double-stranded DNA fragments with a ligated hairpin loop and adaptors containing a tether and motor protein necessary for MinION sequencing, and were run on the R7 flow cells. DNA samples from SZ001, JP001 and the THP1 cell line were prepared using the SQK-NSK007 kit from ONT (Supplementary file 1A; exp. 3, exp. 4, exp. 5) and run on R9 flow cells.
Samples (Supplementary file 1A, exp. 6) were collected by cheek swab (Catch-All Sample Collection Swab Epicentre QEC89100) scraping ~30 s on both sides of the cheek. Cells were recovered in 200 μl PBS. After addition of 20 μl Proteinase K and 200 μl lysis buffer (DNeasy blood and tissue kit, Qiagen, #69504), the sample was incubated at 56°C for 10 min. The sample was then applied to the column, spun for 1 min, and washed sequentially with buffers AW1 and AW2. Next, 20 μl elution buffer was applied and the column was spun for 1 min on a regular benchtop centrifuge at maximum speed. Recovery of the DNA sample in 20 μl of sterile water resulted in an average yield of ~3–5 ng/μl.
We used the SQK-RAD001 kit to prepare the DNA library. FRM (2.5 μl, ONT) was added to the DNA sample (20 μl) and incubated for 1 min at 30°C. Then, 1 μl RAD (ONT) plus 0.2 μl ligase was added and the mixture was incubated for 10 min.
The R9 flow cell was prepared by applying 500 μl priming mix (RBF 1x) twice. The library was then added to the flow cell without a purification step.
The barcoding protocol was executed according to manufacturer’s instructions for native barcoding kit I (EXP-NBD002, ONT) in conjunction with the Nanopore Sequencing kit (SQK-NSK007, ONT) with some modifications (Supplementary file 1A, exp. 4, exp. 5). In brief; 1.5 μg DNA was used as starting material for each sample and vigorously vortexed for 1 min. The DNA sample was end-repaired and dA-tailed using the NEBNext Ultra II End Repair/dA-Tailing Module (New England Biolabs [NEB] E7546S; 5 min 20°C, and 5 min 65°C). After an AMPure purification, the DNA fragments were subject to ligation using Blunt/TA Ligase Master Mix (NEB M0367S) for 5 min at 20°C and then 5 min at 65°C. The sample was then purified using AMPure magnetic beads and the DNA was eluted off the beads using 31 μl nuclease-free water. The NB01 and NB02 barcodes were ligated to the fragments of each sample with Blunt/TA Ligase mix (NEB) and incubated for 15 min. After an AMPure purification step, the two samples were pooled. Next, we ligated the adaptor (BAM) and hairpin (BHP) to the barcoded DNA fragments using NEB Quick Ligase (NEB) for 20 min at room temperature (22°C). The HTP (ONT) was added and incubated for another 10 min. The 50 μl MyOne C1 beads were prepared in the incubation step, which tethered the hairpin and ligated DNA fragments. The DNA library was eluted off the beads by ELB (ONT) at 37°C for 10 min and was applied to the flow cell.
YE001, JP001 (https://dna.land/consent) and three HapMap samples (NA12890, NA12977, NA12879) are publicly available reference files. The 1,099 cancer cell line files were downloaded (GSE36139, CCLE), base-called using Birdseed and converted into 23andMe file format. The 31,000 DTC genomes were available from two sources: (i) 1,446 DTC genomes were downloaded from the public website OpenSNP.org and (ii) 29,554 genomes were collected using DNA.Land, an online website (https://dna.land). The website procedures were approved by our IRB. Based on current consent, this set of 29,554 genomes cannot be shared. All experiments with this collection were done using an automatic algorithm on a secure server without access to the explicit identifiers of the samples (e.g. names or contact information).
To start a MinION run, we primed the flow cell according to the manufacturer’s protocol. We started MinKnow (protocol ‘MAP_48 Hr_Sequencing_Run_SQK_MAP006’ for R7 and ‘NC_48 hr_Sequencing_Run_FLO-MIN104’ for R9), uploaded the collected reads to Metrichor (a cloud-based program that base-called the reads), and stored them on our computer.
We used Poretools (Loman and Quinlan, 2014) to extract the FASTQ data and time stamps from the local files. Only reads with an average base quality greater than nine were used for the downstream analysis. Next, we aligned the files to hg19 using bwa-mem (v0.7.14) (Li, 2013) using the command ‘bwa mem –V –x ont2d –t 4’. Reads with multiple alignments were not considered for further analysis.
To extract variants, we used a script to retain nucleotides from the MinION output that overlapped known positions of bi-allelic SNPs from dbSNP build-138 with an allele frequency between 1–99%. To minimize the effects of sequencing error, we considered only MinION read bases that matched the common SNP alleles in dbSNP. For example, if at position chr1:10,000 the MinION reported ‘A’ and dbSNP reported a variant ‘C/G’, then we treated this position as a sequencing error. The R7 chemistry run with NA12890 generated 4,920 variants after 1 hr of MinION sequencing, of which 7.7% were rejected after filtering for common SNPs. Intersecting these with the reference file and analyzing the true error from the matched SNPs resulted in 8.9% mismatches. This contrasted with the R9 chemistry, which only resulted in 2% true mismatches (Supplementary file 1C-E).
The Bayesian model was integrated in a Python script in order to match between the MinION data and each entry in the database. To accelerate the search, we implemented the following procedure: (i) if the posterior probability drops below 10−9, the script concludes that the database entry does not match and moves to the next entry, and (ii) the script uses only up to 1 hr of data to determine the posterior probability of a sample.
As a default setting, we used a prior probability of 10−5 for exact matching. The only exception was Figure 2—figure supplement 1 (YE001), where we employed a range of prior probabilities. As a default setting, we used the computed error rate from each read as the εin our Bayesian algorithm.
All code is publicly available on github at https://github.com/TeamErlich/personal-identification-pipeline (Erlich, 2017). A copy is archived at https://github.com/elifesciences-publications/personal-identification-pipeline.
For the simulations, we took reads from exp. 4 and 5 (Supplementary file 1A). The total number of reads was set to 3,000 and a random number of reads that represented the required proportion were selected. For example, for 50% contamination, we took 1,500 random reads from exp. 4 and 1,500 random reads from exp. 5. These were pooled together and again shuffled to simulate a mixture. This process was repeated five times for each contamination fraction. The resulting pooled file was processed using our pipeline and matched to the reference file of the corresponding MinION sketch (either THP1, or JP001).
Availability of the data: The code for our method is available on https://github.com/TeamErlich/personal-identification-pipeline (Erlich, 2017). A copy is archived at https://github.com/elifesciences-publications/personal-identification-pipeline. Replicating the experiments can be done using the THP1 test example (all data available on Github). The MinION reads for THP1 are also available in Zenodo at https://zenodo.org/record/1035914; 10.5281/zenodo.1035913). Building databases for your own MinION queries can be done by using opensnp.org, for cancer cell lines by downloading the CCLE reference files or using your own private SNP array files relevant to your query. The 29,554 genomes provided by DNA.land are not available for distribution to ensure genomic privacy of the individuals who donated their genomes to DNA.land (see Materials and methods section: Reference databases and Table 1).
Cell line misidentification: the beginning of the endNature Reviews. Cancer 10:441–448.https://doi.org/10.1038/nrc2852
Check your cultures! a list of cross-contaminated or misidentified cell linesInternational Journal of Cancer 127:1–8.https://doi.org/10.1002/ijc.25242
Match criteria for human cell line authentication:where do we draw the line?2510–2519, 2519.
High-throughput SNP-based authentication of human cell linesInternational Journal of Cancer 132:308–314.https://doi.org/10.1002/ijc.27675
Presented at the 65th Annual Meeting of the American Society of Human GeneticsDNA.Land: A community-wide platform to study millions of genomesphenomes, Presented at the 65th Annual Meeting of the American Society of Human Genetics, Baltimore.
American College of medical genetics and genomics acmg recommendations for reporting of incidental findings in clinical exome and genome sequencing1–29, 4472.
MinION analysis and reference consortium: phase 1 data release and analysisF1000Research, 4, 10.12688/f1000research.7201.1, 26834992.
Improving human forensics through advances in genetics, genomics and molecular biologyNature Reviews Genetics 12:179–192.https://doi.org/10.1038/nrg2952
Poretools: a toolkit for analyzing nanopore sequence dataBioinformatics 30:3399–3401.https://doi.org/10.1093/bioinformatics/btu555
Assay Guidance ManualAuthentication of Human Cell Lines by STR DNA Profiling Analysis, Assay Guidance Manual, Eli Lilly & Company and the National Center for Advancing Translational Sciences.
Andrew P MorrisReviewing Editor; University of Liverpool, United Kingdom
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
[Editors’ note: this article was originally rejected after discussions between the reviewers, but the authors were invited to resubmit after an appeal against the decision.]
Thank you for submitting your work entitled "Rapid DNA Re-Identification for Cell Line Authentication and Forensics" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Nicholas Loman (Reviewer #3).
Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.
You will see from the reviewers comments below that they agreed that the methodology to identify individuals or cell lines using very low coverage MinION data, assuming the availability of an existing reference database, is sound and that the results of the application of the approach to individual and cell line identification provided a convincing theoretical demonstration of the technique. However, the reviewers raised substantial concerns about the practical utility of the approach in forensics and cell line identification because, for example, of the lack of availability of a reference and the impact of contamination. Given that the emphasised applications are a central element the paper, the reviewers recommended considering other applications, potentially outside the field of human genetics, that would more broadly increase the impact of the technique.
The authors present a novel approach, "MinION sketching" to enable rapid, inexpensive and portable human DNA re-identification. The authors suggest that their approach requires 3 minutes and "91 random SNPs" to identify a sample, and demonstrate applications to cell line authentication or forensics.
I have no concerns about the statistical approach, but it wasn't clear to me how the performance of the algorithm depended on the properties of the SNPs used. With common SNPs, there will be greater probability of a genotype match with a random DNA sample? But presumably common SNPs are more likely to be reported in relevant databases? Is the database of 31,000 individuals likely to be representative of the DNA databases used in forensics (which I still thought were predominantly based on STRs rather than SNPs)? How much does the approach depend on the number of individuals and numbers of SNPs in the reference? Similarly, with the cell line authentication, what impact does the use of a different technology in the reference make on utility and the interpretation of findings? I also didn't see any discussion of "stopping rules" – how does the method determine there is sufficient evidence that there is no match (since the prior probability of a match will be small)?
Whilst the algorithm is currently defined as providing probabilities of a match, can the approach be extended to provide probabilities that the samples are from close relatives?
The novel approach is claimed to be rapid and inexpensive, but is the need for "rapid" response so important in the applications of forensics or cell line authentication? How does the novel approach compare in terms of the amount of DNA required compared with currently used techniques, and does this reduce utility?
This manuscript described the use of the commercially available MinION system based on sequencing via nanopore technology to analyse a limited number of 4 (in words: four) DNA samples in various technical scenarios. The authors then compared the MinION outcome with a SNP microarray database they collected from Direct-to Consumer companies and other sources, to establish sample matches using a Bayesian algorithm they developed. They claim that besides the relatively high error rate of MinION, their algorithm provides evidence for sample matching, likely because of the large number of SNPs generated by both approaches. Based on their results, the authors suggest the future practical use of this approach for re-identification purposes in forensics and elsewhere, and advocate its advantage of a cheap, quick and PCR-free approach, which contrasts to the more expensive, slower PCR approach currently used such as in forensics. Besides it being unclear what the advantage is to speed-up the re-identification for the price of using an error-prone sequencing device (which in turn needs a statistical approach to compensate for the errors), it also is unclear why a PCR-free approach shall provide an advantage for these type of data. Clearly, STR typing benefits from avoiding PCR as this eliminates disturbing slippage artifacts, which can trouble re-identification; however, SNPs do not generate slippage artifacts. Obviously, general advantage of using PCR for identification purposes is that minute amounts of DNA can be successfully used, which is especially suitable to forensic identification. However, I cannot find in the manuscript the sensitivity limits of this approach, which I expect to be higher (i.e. more DNA needed) than possible when using PCR, which would limit the application. Moreover, when proposing the use of their approach in forensics, the authors seem to ignore that SNPs are not used in routine forensic DNA analysis despite their technical advantages of avoiding slippage artifacts etc. simply because forensic reference databases consist of STRs. Hence, even by ignoring the various caveats of this approach when it comes to the robustness and reliability of a forensic DNA test, which I cannot see validated in this study, the usefulness of their approach for identification purposes in forensics as suggested by the authors does not exist in practice. And even if forensic DNA databases would ever move to SNPs, which has been discussed for many years but not a single country has adapted this, the proposed approach would be not suitable, because it would require that reference samples and trace samples would be analysed with different technologies, which causes additional complication that are avoided by the use of the same markers and technology for both type of forensic samples, as is currently the case. Whether their developed Bayesian approach that allows matching error-prone MinION data with error-poor SNP microarray data, which to me is the heart of this manuscript but not its application, is technically sound and novel enough to justify publication in a high-profile life science journal, escapes this reviewer's technical background knowledge and shall be evaluated by a statistical genetics expert instead; the proposed application of their approach for re-identification purposes at least in the field of forensics does not. Another statistical genetics issue that shall be evaluated by a respective expert is if the number of SNPs matched between MinION and reference array dataset, as achieved with this approach, is truly enough for statistically sound individual identification, which not only depends on SNP numbers but also on degree of variation. However, from what I can see I expect that the SNPs matching between both datasets appear to be different between individuals, this issue would require a careful evaluation using much more data then presented here.
This manuscript details a computational framework and an experimental design for the rapid identification of human subjects employing low coverage, noisy nanopore reads. It builds on existing work previously published in eLife relating to using nanopore sequencing in the classroom (https://elifesciences.org/articles/14258).
The method and results are solid and convincing as a method to identify individuals or cell lines using very low coverage MinION data, assuming the availability of an existing reference database.
The Bayesian method used is elegant and seems to give good results, demonstrating the power of combinations of unlinked SNPs in generating a unique signature of humans. We previously demonstrated a similar real-time streaming identification approach for pathogens using phylogenetic placement, a slightly different but related approach (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0677-2)
The ability to rapidly identify individuals outside of a traditional laboratory environment using forensic DNA sequencing will be interesting to a broad audience including biologists, forensic scientists and the public. The authors previously termed this 'CSI' sequencing and I think that is the source of the appeal of this idea and manuscript.
My main concerns about this work relate mainly to whether it has genuine practical uses.
The first example – of human identification seems limited in utility for the following reasons:
1) A reference database, generated from whole-genome sequences or genotyping panels is required for this to work. (And the data generated by this technique cannot be used to populate such a database, meaning a parallel reference database building effort must be employed).
2) The database used for this work from the DNA.land website is not publicly available (nor can it be for privacy reasons), to permit others to reproduce this work.
3) Even if it was – should work like this be encouraged? I find it a little hard to think of outside of forensic investigations where the ability to identify people from their DNA would be something routinely practiced.
In the absence of large scale genetic databases it is hard to see how this could be useful. Would the authors advocate the collection of identifiable large scale genetic databases by authorities? How would this work in practice? If not, is this purely a theoretical demonstration?
The second example given – that of cell line identification – does seem like a potential practical use, although at present the cost of such analysis would likely hinder its adoption compared with a simpler STR panel based approach for identification. Although the authors state that this is done by sending samples off and at great cost, this is also available to researchers to run in their own labs at not great cost (GenePrint system from ProMega at <$10/rxn). The authors may want to discuss how the cost could be brought down to similar levels.
The method as shown is also not likely to work well with lower levels of contamination and/or from contamination from multiple cell lines and does not seem to provide identification in such mixtures.
In summary, I think to improve this article the authors should really spend time outlining the potential practical uses of this technique (including outside of human genetics) and then discuss in more detail the ethical concerns associated with such uses.
[Editors’ note: what now follows is the decision letter after the authors submitted their appeal.]Thank you for choosing to send your work entitled "Rapid DNA Re-Identification for Cell Line Authentication and Forensics" for consideration at eLife. Your letter of appeal has been considered by a Senior Editor and a Reviewing editor, and we are prepared to consider a revised submission with no guarantees of acceptance. We encourage such dialogue post decisions, and welcome the chance to review those decisions. The reviewers were more convinced by the potential application to issues of cell line contamination after reading your letter of appeal, and would recommend that you include all the points made in your rebuttal in the revised version of the manuscript. The reviewers remain less convinced about the application for forensic human identification, and suggest including much more detail in the discussion of the limitations of the approach in this context and the barriers for putting it into practice, whilst speculating that forensics might move to using SNPs in the future.
[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]
Thank you for resubmitting your work entitled "Rapid DNA Re-Identification for Cell Line Authentication and Forensics" for further consideration at eLife. Your revised article has been favorably evaluated by Mark McCarthy (Senior editor), a Reviewing editor, and two reviewers.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance.
The reviewers remain unconvinced that the proposed method is suitable for the needs and requirements of forensic DNA analysis, as outlined in our original rejection letter. The reviewers appreciated the increased emphasis on cell line authentication in the revised version of the manuscript, for which the method is more appropriate, but are not satisfied that forensic applications have been sufficiently de-emphasised, as requested in our response to the authors' letter of appeal.
The reviewers particularly commented that rapid and mobile DNA analysis is not applied for standard DNA profiling for many reasons, including sensitivity, genotype quality, and non-controllable environment increasing risks of contamination, which would also apply to the proposed method. The proposed method is not applicable to STRs, on which almost all forensic DNA analyses are based (including all existing forensic DNA databases), and is not suitable for the analysis of low quantity DNA typically available from crime scene stains.
To be acceptable for publication, the reviewers have requested the following changes be made:
1) Remove "and forensics" from the title.
2) Remove "or in some forensic applications" from the title.
3) Remove from the Introduction the fourth paragraph describing forensics.
4) Remove the paragraphs from the Discussion that discuss the utility of the approach for forensics, i.e. the first two paragraphs of the subsection “Forensics”.
5) Remove "and to provide an alternative method for DNA-based forensics" from the concluding remarks.
In addition, please address the following issues:
1) The reference database (http://files.teamerlich.org/pidp/CCLE_genotypes.tar.gz) should be deposited in a public repository (not the lab website).
2) Concern has been raised over the availability of the code, which should be made publicly available by releasing it with some kind of Creative Commons license, for example.
3) The competing interest statement (Y.E. is a consultant for a DNA forensic company) should be more explicit.https://doi.org/10.7554/eLife.27798.018
- Yaniv Erlich
- Yaniv Erlich
- Yaniv Erlich
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
We thank Eleazar Eskin for useful comments on the model. We thank Neville Sanjana for providing cell lines and for discussions. We thank Jae Young Choi, Thomas Willems and Dina Zielinski for useful comments, William Stephenson and Kunal Pandit from New York Genome Center’s Innovation lab for technical assistance and Aaron Wolman for providing the THP1 cell line. We thank Michael Micorescu from Oxford Nanopore Technologies for useful discussions, and the Columbia Ubiquitous Genomics class 2015 for data generation.
Human subjects: All individuals (YE001, JP001, SZ001) declare they fully consented to participate in the study and to the publication of the MinION data. Further information can be found in the Materials and methods section subheading: "Reference databases'.
- Andrew P Morris, Reviewing Editor, University of Liverpool, United Kingdom
- Received: April 13, 2017
- Accepted: November 2, 2017
- Version of Record published: November 28, 2017 (version 1)
© 2017, Zaaijer et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.