The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria
Abstract
Individual malaria infections can carry multiple strains of Plasmodium falciparum with varying levels of relatedness. Yet, how local epidemiology affects the properties of such mixed infections remains unclear. Here, we develop an enhanced method for strain deconvolution from genome sequencing data, which estimates the number of strains, their proportions, identitybydescent (IBD) profiles and individual haplotypes. Applying it to the Pf3k data set, we find that the rate of mixed infection varies from 29% to 63% across countries and that 51% of mixed infections involve more than two strains. Furthermore, we estimate that 47% of symptomatic dual infections contain sibling strains likely to have been cotransmitted from a single mosquito, and find evidence of mixed infections propagated over successive infection cycles. Finally, leveraging data from the Malaria Atlas Project, we find that prevalence correlates within Africa, but not Asia, with both the rate of mixed infection and the level of IBD.
https://doi.org/10.7554/eLife.40845.001Introduction
Individuals infected with malariacausing parasites of the genus Plasmodium often carry multiple, distinct strains of the same species (Bell et al., 2006). Such mixed infections, also known as complex infections, are likely indicative of intense local exposure rates, being common in regions of Africa with high rates of prevalence (Howes et al., 2016). However, they have also been documented for P. vivax and other malariacausing parasites (Mueller et al., 2007; Collins, 2012), even in regions of much lower prevalence (Howes et al., 2016; Steenkeste et al., 2010). Mixed infections have been associated with increased disease severity (de Roode et al., 2005) and also facilitate the generation of genomic diversity within the parasite, enabling cotransmission to the mosquito vector where sexual recombination occurs (Mzilahowa et al., 2007). The distribution of mixed infection duration, and whether the clearance of one or more strains results purely from host immunity (Borrmann and Matuschewski, 2011) or can be influenced by interactions between the distinct strains (Enosse et al., 2006; Bushman et al., 2016), are all open questions.
Although mixed infections can be studied from genetic barcodes (Galinsky et al., 2015), genome sequencing provides a more powerful approach for detecting mixed infections (O'Brien et al., 2016; Chang et al., 2017). Genetic differences between coexisting strains manifest as polymorphic loci in the DNA sequence of the isolate. The higher resolution of sequencing data allows the use of statistical methods for estimating the number of distinct strains, their relative proportions, and genome sequences (Zhu et al., 2018d). Although genomic approaches cannot identify individuals infected multiple times by identical strains, and are affected by sequencing errors and problems of incomplete or erroneous reference assemblies, they provide a rich characterisation of within host diversity (Manske et al., 2012; Auburn et al., 2012; Pearson et al., 2016).
Previous research has highlighted that coexisting strains can be highly related (Nair et al., 2014; Trevino et al., 2017). For example, in P. vivax, 58% of mixed infections show long stretches of within host homozygosity (Pearson et al., 2016). In addition, (Nkhoma et al., 2012) reported an average of 78.7% P. falciparum allele sharing in Malawi and 87.6% sharing in Thailand. A mixed infection with related strains can arise through different mechanisms. Firstly, relatedness is created when distinct parasite strains undergo meiosis in a mosquito vector. A mosquito vector can acquire distinct strains by biting a single multiplyinfected individual, or multiple infected individuals in close succession. Cotransmission of multiple meiotic progeny produces a mixed infection in a singlebite, containing related strains. Alternatively, relatedness in a mixed infection can result from multiple bites in a parasite population with low genetic diversity, such as is expected during the early stages of an outbreak or following severe population bottlenecks; for instance, those resulting from an intervention (Mouzin et al., 2010; Wong et al., 2017; Daniels et al., 2015). Interestingly, serial cotransmission of a mixed infection is akin to inbreeding, producing strains with relatedness levels well above those of standard siblings.
The rate and relatedness structure of mixed infections are therefore highly relevant for understanding regional epidemiology. However, progress towards utilising this source of information is limited by three problems. Firstly, while strain deconvolution within mixed infections has received substantial attention (Galinsky et al., 2015; O'Brien et al., 2016; Chang et al., 2017; Zhu et al., 2018d), currently, no methods perform both deconvolution of strains and estimation of relatedness. Because existing deconvolution methods assume equal relatedness along the genome, differences in relatedness that occur, for example through infection by sibling strains, can lead to errors in the estimation of the number, proportions and sequences of individual strains (Figure 1). Recently, progress has been made in the case of dualinfections with balanced proportions (Henden et al., 2018), but a general solution is lacking. The second problem is that little is known about how the rate and relatedness structure of mixed infections relates to underlying epidemiological parameters. Informally, mixed infections will occur when prevalence is high; an observation exploited by Cerqueira et al. (2017) when estimating changes in transmission over time. However, the quantitative nature of this relationship, the key parameters that influence mixed infection rates and how patterns of relatedness relate to infection dynamics are largely unexplored. Finally, an important issue, though not one addressed here, is the sampling design. Malaria parasites may be taken from individuals presenting with disease or as part of a surveillance programme. They are also often highly clustered in time and space. What impact different sampling approaches have on observed genomic variation is not clear. Nevertheless, because mixed infection rates are likely to respond rapidly to changes in prevalence (Volkman et al., 2012), exploring these challenges may render critical insights for malaria control in the field.
Here, we develop, test and apply an enhanced method for strain deconvolution called DEploidIBD, which builds on our previouslypublished DEploid software. The method separates estimation of strain number, proportions, and relatedness (specifically the identitybydescent, or IBD, profile along the genome) from the problem of inferring genome sequences. This strategy provides substantial improvements to accuracy when strains are closely related. We apply DEploidIBD to 2344 field isolates of P. falciparum collected from 13 countries over a range of years (2001–2014) and available through the Pf3k Project (see Appendix), and characterise the rate and relatedness patterns of mixed infections. In addition, we develop a statistical framework for characterising the processes underlying mixed infections, estimating that nearly half of symptomatic mixed infections arise from the transmission of sibling strains, as well as demonstrating the propagation of mixed infections through multiple cycles of hostvector transmission. Finally, we investigate the relationships between statistics of mixed infection and epidemiological estimates of pathogen prevalence (MAP, 2017), showing that, at a global level, regional rates of mixed infection and levels of background IBD are correlated with estimates of malaria parasite prevalence.
Strain deconvolution in the presence of relatedness
Existing methods for deconvolution of mixed infections typically assume that the different genetic strains present in mixed infections are unrelated. This assumption allows for efficient computation of priors for allele frequencies within samples, either through assuming independence of loci (O'Brien et al., 2016) or as sequences generated as imperfect mosaics of some (predefined) reference panel (Zhu et al., 2018d). However, when strains are related to each other, and particularly when patterns of IBD vary along the genome (for example through being siblings), the constraints imposed on withinsample allele frequencies through IBD can cause problems for deconvolution methods, which can try to fit complex strain combinations (with relatedness) as simpler configurations (without relatedness). Below we outline the approach we take to integrating IBD into DEploid. Further details are provided in the Appendix.
Decoding genomic relatedness among strains
A common approach to detecting IBD between two genomes is to employ a hidden Markov Model that transitions into and out of IBD states (Chang et al., 2015; Gusev et al., 2009; Gusev et al., 2011). We have generalised this approach to the case of $K$ haploid Plasmodium genomes (strains). In this setting, there are ${2}^{K}$ possible genotype configurations, as each of the $K$ strains can be either reference (i.e. same as the reference genome used during assembly), or alternative (i.e. carry a different allele) at a given locus (we assume all variation is biallelic). In most cases, if each of the $K$ strains constitutes a unique proportion of the infection, each genotype configuration will produce a distinct alternative within sample allele frequency (WSAF; Figure 1A), which defines the expected fraction of total sequencing reads that are alternative at a given locus in the sequenced infection.
The effect of IBD among these $K$ strains is to limit the number of distinct genotype configurations possible, in a way that depends on the pattern of IBD sharing. Consider that, for any given locus, the $K$ strains in the infection are assigned to $j\le K$ possible reference haplotypes. IBD exists when two or more strains are assigned to the same haplotype. In this scenario, the total number of possible patterns of IBD for a given $K$ is equal to ${\sum}_{j=1}^{K}S(K,j)$, where $S(K,j)$ is the number of ways $K$ objects can be split into $j$ subsets; a Stirling number of the second kind (Graham et al., 1988). Thus, for two strains, there are two possible IBD states (IBD or nonIBD), for three strains there are five states (all IBD, none IBD and the three pairwise IBD configurations), for four strains there are fifteen states (see Appendix), and so on. We limit analysis to a maximum of four strains for computational efficiency. Finally, for a given IBD state, only ${2}^{j}$ rather than ${2}^{K}$ genotype configurations are possible, thereby restricting the set of possible WSAF values.
Moving along the genome, recombination can result in changes in IBD state, hence changing the set of possible WSAF values at loci (Figure 1B). To infer IBD states we use a hidden Markov model, which assumes linkage equilibrium between variants for computational efficiency, with a GammaPoisson emission model for read counts to account for overdispersion (see Appendix). Populationlevel allele frequencies are estimated from isolates obtained from a similar geographic region. Given the structure of the hidden Markov model, we can compute the likelihood of the strain proportions by integrating over all possible IBD sharing patterns, yielding a Bayesian estimate for the number and proportions of strains (see Appendix 1 Implementation details). We then use posterior decoding to infer the relatedness structure across the genome (Figure 1B). To quantify relatedness, we compute the mean IBD between pairs of strains and statistics of IBD tract length (mean, median and N50, the lengthweighted median IBD tract length, Figure 1C).
In contrast to our previous work, DEploidIBD infers strain structure in two steps. In the first we estimate the number and proportions of strains using Markov Chain MonteCarlo (MCMC), allowing for IBD as described above. In the second, we infer the individual genomes of the strains, using the MCMC methodology of , which can account for linkage disequilibrium (LD) between variants, but without updating strain proportions. The choice of reference samples for deconvolution is described in Zhu et al. (2018d) and in the Appendix. During this step we do not use the inferred IBD constraints per se, though the inferred haplotypes will typically copy from the same (or identical) members of the reference panel within the IBD tract.
Results
Method validation
Validation using experimentally generated mixed infections
We first sought to characterise the behaviour of DEploidIBD and compare its performance to the previously published method, DEploid. To this end, we reanalysed a set of 27 experimentally generated mixed infections (Wendler, 2015) that had been previously deconvoluted by DEploid (Zhu et al., 2018d) using DEploidIBD (Figure 2—figure supplement 2). These mixed infections were created with combinations of two or three laboratory strains (selected from 3D7, Dd2, HB3 and 7G8), set at varying known proportions (Wendler, 2015), and therefore provide a simple framework for evaluating inference of the number of strains ($K$) and their proportions. Since the method allows deconvolution of mixed infections containing up to four strains, we augmented the experimental mixtures by combining all four lab strains in silico at differing proportions (see Appendix 2 In silico lab mixtures). Using this approach, we found that DEploid and DEploidIBD performed comparably, except in the case of three strains with equal proportions, where LD information is necessary to achieve accurate deconvolution and DEploid performed better. Both DEploid and DEploidIBD struggled to deconvolute our in silico mixtures of four strains, typically underestimating the number of strains present.
Validation against simulated mixed infections
Validation using mixtures of lab strains has two limitations: (i) the strains comprising the mixed infection were part of the reference panel and (ii) no IBD was present. We therefore investigated the ability of DEploidIBD to recover IBD between strains within a mixed infection, in the context of a realistic reference panel, and with strains typical of those we observe in nature. To achieve this, we designed a validation framework where clonal samples from the Pf3k project were combined in silico to produce simulated mixed infections, allowing us to create examples with varying numbers of strains and proportions, and to introduce tracts of IBD, by copying selected sections of the genome between strains. Using this framework, we constructed a broad suite of simulated mixed infections, derived from clonal samples from Africa and Asia that were combined into mixtures of 2, 3 and 4 strains with variable proportions and IBD configurations.
We randomly selected 189 clonal samples of African origin and 204 clonal samples of Asian origin from which to construct our simulated mixed infections and restricted the analysis to chromosome 14 to reduce computational time. Starting with mixed infections containing two strains, we randomly took two samples of African or Asian origin and combined them at proportions ranging from highly imbalanced (10% and 90%) to exactly balanced (each 50%) and used copying to produce either no (0%), low (25%), medium (50%) or high (75%) levels of IBD (note that background IBD between the two clonal strains may also exist). In total this resulted in 4,000 $K=2$ mixed infections, each of which was deconvoluted with DEploid and DEploidIBD. Outputs of DEploidIBD were compared to the true values for each simulated infection, including the inference of $K$, the effective ${K}_{e}$ (computed as ${K}_{e}=1/\sum {w}_{i}^{2}$, where ${w}_{i}$ is the proportion of the $i$th strain, thus incorporating proportion inference), the average pairwise relatedness between strains (for $K=2$, this is the fraction of the genome inferred to be IBD), and the inference of IBD tract length, expressed as the IBD N50 metric.
For mixtures of two strains, both DEploid and DEploidIBD performed well in scenarios where the IBD between strains was low (<=25%). In moderate or high IBD scenarios with imbalanced strain proportions, DEploid tended to underestimate the proportion of the minor strain resulting in underestimation of ${K}_{e}$, whereas DEploidIBD was able to infer the proportion of these mixtures correctly (Figure 2). The main novelty of DEploidIBD is the calculation of an IBD profile between strains. We found that the IBD summary statistics produced by DEploidIBD were accurate across all twostrain mixed infections tested in Africa. In Asia, DEploidIBD tended to estimate more IBD than was simulated (Figure 2B). However, this likely reflects the presence of higher background IBD in Asia rather than systematic error.
To simulate realistic mixed infections containing 3 or 4 strains, we first considered the different transmission scenarios under which they can arise. We modelled a mixed infection of $K$ strains as resulting from $b$ biting events, where $K\in \{3,4\}$ and $1\le b\le K$. When greater than one strain is transmitted in a single biting event, the cotransmitted strains will share IBD, as a consequence of meiosis occurring in the mosquito. Strains transmitted through independent bites, causing superinfection in the host, do not share any IBD beyond background. Following this paradigm, we generated a suite of mixed infection types: $K=3$, $b=1,2,3$ and $K=4$, $b=1,2,2,3,4$ (the first $b=2$ has two strains per bite, the second three and one); and simulated each of these across a variety of proportions, again using sets of clonal samples from Africa and Asia as starting strains.
As with the experimental validation, the balancedproportion $K=3$ mixed infections generated in silico proved challenging to deconvolute, with both methods inferring the presence of two rather than three strains (Figure 2C). In mixed infections with imbalanced proportions, we found that, in African samples with IBD ($b=1,2$), DEploid tended to either underestimate the number of strains present, or infer proportions incorrectly. In Asian samples this is less of an issue as the reference panels can provide better prior information for deconvolution due to lower diversity. In contrast, DEploidIBD consistently gave the correct number of strains and proportions in such cases, and produced IBD statistics that were accurate as long as the median coverage of simulated infections was > 20x. Both methods struggled to deconvolute mixed infections of four strains (Figure 2—figure supplement 2), although performed better (i.e. inferred $K=4$ greater than 50% of the time) for mixtures with less IBD ($b=3,4$). However, even in these cases, estimates of the proportions and IBD statistics were variable, indicating that further work is needed before $K=4$ mixed infections can be reliably deconvoluted.
Finally, we used the in silico approach to explore the quality of haplotypes inferred by DEploidIBD, focusing on $K=2$ infections across variable proportions. We compared the haplotype inferences between DEploid and DEpoloidIBD using the error model described in the Appendix, and found that rates of genotype error are similar for the two approaches in settings of low relatedness (DEploidIBD has an error rate of 0.7% per site per strain for 20/80 mixtures and 1.4% for 50/50 mixtures). However, for the 20/80% mixtures with high relatedness, genotype error for DEploid increased to 1.8%, while remaining at 0.8% for DEploidIBD (Figure 3A). Switch errors in haplotype estimation are comparable between the two methods and decrease with increasing relatedness due to higher homozygosity (Figure 3B). Finally, we identified a simple metric to compute on inferred haplotypes that can identify low quality haplotypes (see Appendix).
Geographical variation in mixed infection rates and relatedness
To investigate how the rate and relatedness structure of mixed infections varies among geographical regions with different epidemiological characteristics, we applied DEploidIBD to 2344 field samples of P. falciparum released by the Pf3k project (Pf3k Consortium, 2016). These samples were collected under a wide range of studies with differing designs, though the majority of samples were collected from symptomatic individuals seeking clinical treatment. An important exception are the samples from Senegal which, though collected passively at a clinic, were screened to contain only one strain by SNP barcode (Daniels et al., 2015). A summary of the data sources is presented in Table 1 and full details regarding study designs can be found at https://www.malariagen.net/projects/pf3k#samplinglocations. Details of data processing are given in the Appendix. For deconvolution, samples were grouped into geographical regions by genetic similarity; four in Africa, and three in Asia. (Table 1). Reference panels were constructed from the clonal samples found in each region. Since previous research has uncovered strong population structure in Cambodia (Miotto et al., 2013), we stratified samples into West and North Cambodia when performing analysis at the country level. Diagnostic plots for the deconvolution of all samples can be found at https://github.com/mcveanlab/mixedIBDSupplement (Zhu, 2018a; copy archived at https://github.com/elifesciencespublications/mixedIBDSupplement) and inferred haplotypes can be accessed at ftp://ngs.sanger.ac.uk/production/pf3k/technical_working/release_5/mixedIBD_paper_haplotypes/. We identified 787 samples where low sequencing coverage or the presence of lowfrequency strains resulted in unusual haplotypes (see Appendix). Estimates of strain number, proportions and IBD states from these samples are used in subsequent analyses, but not the haplotypes. We also confirmed that reported results are not affected by the exclusion of samples with haplotypes with low confidence (data not shown). In all following analyses, only strains present with a proportion of1% in a sample are reported.
We find substantial variation in the rate and relatedness structure of mixed infections across continents and countries. Within Africa, rates of mixed infection vary from 29% in The Gambia to 63% in Malawi (Figure 4A). Senegal has a rate of mixed infection (18%) lower than The Gambia, however as these samples were screened by SNP barcode to be clonal, this rate should be an underestimate. In Southeast Asian samples, mixed infection rates are in general lower, though also vary considerably; from 21% in Thailand to 54% in Bangladesh. Where data for a location is available over multiple years, we find no evidence for significant fluctuation over time (though we note that these studies are typically not well powered to see temporal variations). We observe that between 5.1% (Senegal) and 40% (Malawi) of individuals have infections carrying more than two strains.
Relatedness between samples and populations also varies substantially. In dual infections, the average fraction of the genome inferred to be IBD ranges from 14% in Guinea to 65% in West Cambodia (Figure 4B). Asian populations show, on average, a higher level of relatedness within dual infections (44%) compared to African populations ( 26%). Levels of IBD in samples with three or more strains are comparable to those seen in dual infections (average IBD being 45% in Asia and 37% in Africa) and significantly correlated at the country level, with correlation of 0.75 (p=0.0019, weighted by the number of mixed samples). Overall, 51% of all mixed infections involve strains with over 30% of the genome being IBD.
We next considered the relationship between mixed infection rate and the level of IBD. We find that populations with higher rates of mixed infection tend to have lower levels of IBD within mixed infections (linear model p=0.06 after accounting for a continental level difference and weighted by sample size). However, the continental level effect is driven by Senegal, which has an unusual combination of low mixed infections and also low IBD. Excluding Senegal, we find a consistent pattern across populations (Figure 4C), with a strong negative correlation between mixed infection rate and the level of IBD (Pearson $r=0.65$, p = $3\times {10}^{4}$). Previous work has demonstrated how a recent and dramatic decline in P. falciparum prevalence within Senegal has left an impact on patterns of genetic variation (Daniels et al., 2015), which may explain its unusual profile.
Inferring the origin of IBD in mixed infections
The high levels of IBD observed in many mixed infections suggest the presence of sibling strains (Figure 5). To quantify the expected IBD patterns between siblings, we developed a meiosis simulator for P. falciparum (pfmeiosis), incorporating relevant features of malaria biology that can impact the way IBD is produced in a mosquito and detected in a human host. Most importantly, a single infected mosquito can undergo multiple meioses in parallel, one occurring for each oocyst that forms on the mosquito midgut (Ghosh et al., 2000). In a mosquito infected with two distinct strains, each oocyst can either self (the maternal and paternal strain are the same) or outbreed (the maternal and paternal strains are different). We model a $K=n$ mixed infection as a sample of $n$ strains (without replacement, as drawing identical strains yields $K=n1$) from the pool of strains created by all oocysts. Studies of wildcaught Anopheles Gambiae suggest that the distribution of oocysts is roughly geometric, with the majority of infected mosquitoes carrying only one oocyst (Beier et al., 1991; Collins et al., 1984). In such a case, we find that a $K=2$ infection will have an expected IBD of 1/3, consistent with the observations of Wong et al. (2018). Conditioning on at least one progeny originating from an outbred oocyst (such that a detectable recombination event has occurred), the expected IBD asymptotically approaches 1/2 as the total number of oocysts grows (see Appendix).
Using this simulation framework, we sought to classify observed mixed infections based on their patterns of IBD. We used two summary statistics to perform the classification: mean IBD segment length and IBD fraction. We built empirical distributions for these two statistics for each country in Pf3k, by simulating meiosis between pairs of clonal samples from that country. In this way, we control for variation in genetic diversity (as background IBD between clonal samples) in each country. Starting from a pair of clonal samples ($M=0$, where $M$ indicates the number of meioses that have occurred), we simulated three successive rounds of meiosis ($M=1,2,3$), representing the creation and serial transmission of a mixed infection (Figure 6A). Each round of meiosis increases the amount of observed IBD. For example, in Ghana, the mean IBD fraction for $M=0$ was 0.002, for $M=1$ was 0.41, for $M=2$ was 0.66, and for $M=3$ was 0.80 (Figure 6B). West Cambodia, which has lower genetic diversity, had a mean IBD fraction of 0.08 for $M=0$ and consequently, the mean IBD fractions for higher values of $M$ were slightly increased, to 0.46, 0.68, 0.81 for $M=$ 1, 2 and 3, respectively (Figure 6B).
With these simulated distributions, we used Naive Bayes to classify $K=2$ mixed infections in Pf3k (Figure 6C). Of the 393 $K=2$ samples containing only highquality haplotypes (see Appendix), 325 (83%) had IBD statistics that fell within the range observed across all simulated $M$. Of these, nearly half (183, 47%) were classified as siblings ($M>0$). Moreover, we observe geographical differences in the rate at which sibling and unrelated mixed infections occur. Notably, in Asia a greater fraction of all mixed infections contained siblings (59% vs. 41% in Africa), driven by a higher frequency of $M=2$ and $M=3$ mixed infections (Figure 6D). Mixed infections classified as $M>1$ are produced by serial cotransmission of parasite strains, that is a chain of mixed infections along which IBD increases.
Characteristics of mixed infections correlate with local parasite prevalence
To assess how characteristics of mixed infections relate to local infection intensity, we obtained estimates of P. falciparum prevalence (standardised as $PfP{R}_{210}$, prevalence in the 2to10 year age range) from the Malaria Atlas Project ((MAP, 2017), see Table 1). The countrylevel prevalence estimates range from 0.01% in Thailand to 55% in Ghana, with African countries having up to two orders of magnitude greater values than Asian ones (mean of 36% in Africa and 0.6% in Asia). However, seasonal and geographic fluctuations in prevalence mean that, conditional on sampling an individual with malaria, local prevalence may be much higher than the longerterm (and more geographically widespread) countrylevel average, hence we extracted the individual pixellevel estimate of prevalence (corresponding to a 5 km × 5 km region) from MAP nearest to each genome collection point. We summarise mixed infection rates by the average effective number of strains, which reflects both the number and proportion of strains present.
Given that samples from Senegal were screened to be primarily singlegenome (Daniels et al., 2015), we computed all correlations with prevalence including (${r}_{S+}$) and excluding them (${r}_{S}$; Figure 7). We find that the effective number of strains is a significant predictor of $PfP{R}_{210}$ globally (${r}_{S+}=0.65,p<{10}^{5}$) and in African populations when Senegal is included (${r}_{S+}=0.48,p=0.04$, ${r}_{S}=0.18,p=0.51$), but is uncorrelated across Asia. Similarly, $PfP{R}_{210}$ is negatively correlated with background IBD globally (${r}_{S+}=0.43,p=0.004$) and across Africa but not in Asia. Surprisingly, the amount of IBD observed within $K=2$ mixed infections was not correlated with prevalence in Africa or Asia. The rate of sibling infection ($M=1$) is not correlated with the parasite prevalence (Asia: ${r}_{S+}=0.23,p=0.2$, Africa: ${r}_{S+}=0.16,p=0.5$). However, the rate of supersiblings ($K=2,M>1$) is significantly correlated with $PfP{R}_{210}$ (${r}_{S+}=0.31,p=0.04$) at the global scale, suggesting that serial cotransmission may occur more readily in low prevalence regions.
Discussion
It has long been appreciated that mixed infections are an integral part of malaria biology. However, determining the number, proportions, and haplotypes of the strains that comprise them has proven a formidable challenge. Previously we developed an algorithm, DEploid, for deconvoluting mixed infections (Zhu et al., 2018d). However, we subsequently noticed the presence of mixed infections with highly related strains in which the algorithm performed poorly, particularly with lowfrequency minor strains. Mixed infections containing highly related strains represent an epidemiological scenario of particular interest, because they are likely to have been produced from a single mosquito bite, itself multiply infected, and in which meiosis has occurred to generate sibling strains. Thus, we developed an enhanced method, DEploidIBD, capable not only of deconvoluting highly related mixed infections, but also inferring IBD segments between all pairs of strains present in the infection. Validation work using simulated mixed infections illustrated that DEploidIBD performs well on infections of two or three strains and across a widerange of IBD levels. We note that limitations and technical difficulties remain, including deconvoluting infections with more than three strains, handling mixed infections with highly symmetrical or asymmetrical strain proportions (e.g. $K=3$ with strains at 33%, or $K=2$ with one strain at 2%), analysing data with multiple infecting species, coping with lowcoverage data, and selecting appropriate reference panels from the growing reference resources.
The application of DEploidIBD to the 2344 samples in the Pf3k project has revealed the extent and structure of relatedness among malaria infections and how these characteristics vary between geographic locations. We found that 1026 (44%) of all samples in Pf3k were mixed, being comprised of 480 $K=2$ infections, 372 $K=3$ and 127 $K=4$ infections. Across the entire data set, the total number of genomes extracted from mixed infections is nearly double the number extracted from clonal infections (2584 genomes from $K>1$ vs. 1365 from $K=1$). We also found considerable variation, between countries and continents in the characteristics of mixed infections, suggesting that they are sensitive to local epidemiology. Previous work has highlighted the utility of mixed infection rate in discerning changes in regional prevalence, and we reenforce that finding here, observing a significant correlation between the effective number of strains and parasite prevalence across Pf3k collection sites. Similarly, using DEploidIBD we also observe significant geographical variation in the relatedness profiles of strains within mixed infections. Interestingly, this variation is structured such that regions with high rates of mixed infection tend to contain strains that are less related, resulting in a significant negative correlation between mixed infection rate and mean relatedness within those infections.
The ability to identify the extent and genomic structure of IBD enables inference of the mechanisms by which mixed infections can arise. A mixed infection of $K$ strains can be produced by either $K$ independent infectious bites or by $j<K$ infectious bites. In the first case, parasites are delivered by separate vectors and no meiosis occurs between the distinct strains, thus any IBD observed in the mixed infection must have preexisted as background IBD between the individual strains. In the second case, meiosis may occur between strains, resulting in long tracts of IBD. The exact amount of IBD produced by meiosis is a random variable, dependent on outcomes of meiotic processes, such as the number of recombination events, the distance between them, and the segregation of chromosomes. Importantly, the mean IBD produced during meiosis in P. falciparum also depends on the number and type (selfed vs outbred) of oocysts in the infectious mosquito. Here, we have shown, from first principles, that the amount of IBD expected in a singlebite mixed infection produced from two unrelated parasites strains will always be slightly less than 1/2, and potentially as low as 1/3 (see Appendix).
To quantify the distribution of IBD statistics expected through different mechanisms of mixed infection, we developed a Monte Carlo simulation tool, pfmeiosis, which we used to infer the recent transmission history of individuals with dual ($K=2$) infections. We considered mixed infection chains, in which $M$ successive rounds of meiosis, transmission to host, and uptake by vector can result in sibling strain infections with very high levels of IBD. Using this approach, we found that 47% of dual infections within the Pf3k Project likely arose through cotransmission events. Moreover, and particularly within Asian population samples, we found evidence for long mixed infection chains ($M>1$), representing repeated cotransmission without intervening superinfection. This observation is not a product of lower genetic diversity in Asia, as differences in background IBD between countries have been controlled for in the simulations. Rather, it reflects true differences in transmission epidemiology between continents. These findings have three important consequences. Firstly, they suggest that successful establishment of multiple strains through a single infection event is a major source of mixed infection. Second, they imply that the bottlenecks imposed at transmission (to host and vector) are relatively weak. Finally, they indicate that the differing mechanisms causing mixed infections reflect aspects of local epidemiology.
We note that a nontrivial fraction (17%) of all mixed infections had patterns of IBD inconsistent with the simulations (typically with slightly higher IBD levels than background but lower than among siblings). We suggest three possible explanations. A first is that the unclassified samples result from the IBD profiles produced by DEploidIBD, in particular the overestimation of short IBD tracks, similar to the issue observed by Wong et al. (2018). Alternatively, our estimate of background IBD, generated by combining pairs of random clonal samples from a given country into an artificial $M=0$ mixed infection, will underestimate true background IBD if there is very strong local population structure. Finally, we only simulated simple mixed infection transmission chains, at the exclusion of more complex transmission histories, such as those involving strains related at the level of cousins. The extent to which such complex histories can be inferred with certainty remains to be explored.
Lastly, our results show that the rate and relatedness structure of mixed infections correlate with estimated levels of parasite prevalence, at least within Africa, where prevalence is typically high (Smith et al., 1993). In Asia, which has much lower overall prevalence, as well as greater temporal (and possibly spatial) fluctuations, we do not observe such correlations. However, it may well be that other genomic features that we do not consider in this work could provide much higher resolution, in space and time, for capturing changes in prevalence than traditional methods. Testing this hypothesis will lead to a much greater understanding of how genomic data can potentially be used to inform global efforts to control and eradicate malaria.
Materials and methods
The data analysed within this paper were collected and made openly available to researchers by member of the Pf3k Consortium. Information about studies within the data set can be found at https://www.malariagen.net/projects/pf3k#samplinglocations. Detailed information about data processing can be found at https://www.malariagen.net/data/pf3k5. Briefly, field isolates were sequenced to an average read depth of 86 (range 12.6–192.5). After removing humanderived reads and mapping to the 3D7 reference genome, variants were called using GATK best practice and approximately one million variant sites were genotyped in each isolate. After filtering samples for low coverage and crossspecies contamination, 2344 samples remained. The Appendix provides details on the filters used and data availability. For deconvolution, samples were grouped into geographical regions by genetic similarity; four in Africa, and three in Asia. (Table 1). Reference panels were constructed from the clonal samples found at each region. Since previous research has uncovered severe population structure in Cambodia (Miotto et al., 2013), we stratified samples into West and North Cambodia when performing analysis at the country level.
Data availability
View detailed protocolMetadata on samples is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/pf3k_release_5_metadata_20170804.txt.gz. Sequence data (aligned to Plasmodium falciparum strain 3D7 v3.1 reference genome sequences, for details see ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/201508/Pfalciparum.genome.fasta.gz) is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/5.1/. Diagnostic plots for the deconvolution of all samples can be found at https://github.com/mcveanlab/mixedIBDSupplement (Zhu, 2018a; copy archived at https://github.com/elifesciencespublications/mixedIBDSupplement) and deconvoluted haplotypes can be accessed at ftp://ngs.sanger.ac.uk/production/pf3k/technical_working/release_5/mixedIBD_paper_haplotypes/. Code implementing the algorithms described in this paper, DEploidIBD, is available at https://github.com/DEploiddev/DEploid (Zhu, 2018b; copy archived at https://github.com/elifesciencespublications/DEploid). Code to generate in silico lab mixture of 4 strains are available at https://github.com/DEploiddev/DEploidDataBenchmarkin_silico_lab_mixed_4s (Zhu, 2019b; copy archived at https://github.com/elifesciencespublications/DEploidDataBenchmarkin_silico_lab_mixed_4s). Code to generate in silico field mixtures of 2, 3, four strains are available at https://github.com/DEploiddev/DEploidDataBenchmarkin_silico_field (Zhu, 2019a; copy archived at https://github.com/elifesciencespublications/DEploidDataBenchmarkin_silico_field).
Appendix 1
Deconvolution in the presence of IBD
Notation
We use the same notations as Zhu et al. (2018d) (see Appendix 1—table 1). Our data, $D$, are the allele read counts of sample $j$ at a given site $i$, denoted as ${r}_{j,i}$ and ${a}_{j,i}$ for reference (REF) and alternative (ALT) alleles respectively. These have assigned values of 0 and 1, respectively. Here we consider only biallelic loci, though future extensions to the model to include multiallelic sites could be accommodated. The empirical allele frequencies within a sample (WSAF) ${p}_{j,i}$ and at population level (PLAF) ${f}_{i}$ are calculated by $\frac{{a}_{j,i}}{{a}_{j,i}+{r}_{j,i}}$ and $\frac{{\sum}_{j}{a}_{j,i}}{{\sum}_{j}{a}_{j,i}+{\sum}_{j}{r}_{j,i}}$ respectively. Since all data in this section refers to the same sample, we drop the subscript $j$ from now on.
Modelling mixed infections with IBD
We describe the mixed infection problem by considering the number of strains, $K$, the relative abundance of each strain, $\mathbf{\mathbf{w}}$, and the allelic state of the $k$th strain at the $i$th locus, ${\mathbf{\mathbf{h}}}_{k,i}$. In addition to Zhu et al. (2018d), we also infer the identitybydescent (IBD) state ${\mathcal{S}}_{i}$, which describes the strain relationship with each other at site $i$. For example, for three strains, the IBDstate could be one of the five cases: (1) all strains are not IBD; (2) strains 1 and 2 are IBD; (3) strains 1 and 3 are IBD; (4) strains 2 and 3 are IBD; (5) all strains are IBD (see Appendix 1—table 2). To simplify the problem, we assume independence between markers, and drop the subscript $i$ from now on. As previously, we use a Bayesian approach and define the posterior probabilities of $K$, $\mathbf{\mathbf{w}}$, $\mathbf{\mathbf{h}}$ and $\mathcal{S}$, as
where $e$ is the read error rate. Moreover, we can decompose the joint prior as
The number of strains, $K$, and their proportions $\mathbf{\mathbf{w}}$, are as in Zhu et al. (2018d): $K$ is fixed at a userdefined value (here, $K=4$) and strains below a fixed proportion threshold (here, 0.01) are removed. To model IBD configurations and resulting haplotypes we introduce a parameter $\theta $, which is the probability of outbreeding (i.e. nonIBD) for a mixed infection with two strains. Specifically, for $K$ strains, the prior probability on configuration $\mathcal{S}$ is a function of the number of distinct strains in the configuration, ${C}_{\mathcal{S}}$, and the number of distinct configurations with the same number of distinct strains, ${M}_{\mathcal{C}}$:
That is, the number of distinct strains is drawn from a binomial distribution with parameters $K$ and $\theta $ and the IBD configuration is uniformly drawn from among those with the same number of distinct strains. Consequently, the expected number of distinct haplotypes (i.e. nonIBD) in an infection with $K$ strains is $1+\theta (K1)$. The value of $\theta $ estimated within the Markov chain Monte Carlo (see details below).
Suppose that the probability of recombination event between two adjacent sites is ${p}_{rec}$ (see Implementation details). To model the transitions between IBD states we make the approximation that the IBD state at the $i+1$th site is the same as at the $i$th site, with probability $1{p}_{rec}$ and, if there is a recombination event, the IBD state is drawn from the stationary distribution described above.
To model allelic states (or genotype), conditional on IBD state, we assume that allelic states for each IBD group are independent Bernoulli draws given the population allele frequency (PLAF). For example, if $K=3$ and only strains 1 and 2 are IBD (state 0,0,1), the genotype at site $i$, ${\mathbf{\mathbf{h}}}_{\mathbf{\mathbf{i}}}$, could be $\{0,0,0\}$ or $\{1,1,1\}$, with probabilities ${(1f)}^{2}$, $(1f)f$, $f(1f)$ and ${f}^{2}$ respectively. The joint prior on IBD and allelic states is referred to as $\psi (\mathcal{S},\mathbf{\mathbf{h}})$. Note that we ignore association (known as linkage disequilibrium or LD) between nearby alleles, though note that in implementation we run DEploidIBD twice; first as described here to estimate strain number and proportions, allowing for IBD, then subsequently with a reference panel, as for DEploid, but with the strain number and proportions fixed. This latter step does model LD and consequently results in better estimates of strain haplotypes.
Details of the emission model are the same as described in Zhu et al. (2018d). Briefly, the reference and alternative allele read counts at each site are modelled as being drawn from a betabinomial distribution given the expected WSAF, $q={\mathbf{\mathbf{w}}}^{T}\mathbf{\mathbf{h}}$. To incorporate sequencing error, we modify the expected WSAF such that allele are misread with probability $e$. Thus, the adjusted expected WSAF becomes
As previously, we model overdispersion in read counts relative to the Binomial using a Betabinomial distribution.
Parameter estimation using Markov chain Monte Carlo
To infer the relatedness between strains within a mixed infection and their relative proportions we use a Markov chain Monte Carlo (MCMC) approach. We learn the relative abundance of each strain by exploiting signatures of withinsample allele frequency imbalance, using a MetropolisHastings algorithm, which samples proportions, $\mathbf{\mathbf{w}}$, given IBDconfigurations, $\mathcal{S}$. We then use a Gibbs sampler to update $\mathcal{S}$ and $\mathbf{\mathbf{h}}$ for a given $\mathbf{\mathbf{w}}$ by first sampling the IBD state from the posterior, and then sampling the allelic configuration (genotype) from the selected IBD state at each site to update haplotypes. Note that the hidden Markov model structure for IBD states along the chromosome leads to efficient algorithms for calculating quantities of importance. Notably, by summing over allelic configurations that are consistent with a given IBD configuration, we can  for a given set of strain proportions  calculate the likelihood integrated over all possible IBD configurations, which leads to efficient sampling.
MetropolisHastings update for proportions
We update strain proportions, $\mathbf{\mathbf{w}}$, through modelling underlying log titres, $\mathbf{\mathbf{x}}$, where ${w}_{i}\propto exp({x}_{i})$. As previously, log titres are assumed to be drawn from a $N(log(K),{\sigma}^{2})$ distribution under the prior. To update, we choose $i$ uniformly from $K$ and propose new ${x}_{i}^{\prime}$s from ${x}_{i}^{\prime}={x}_{i}+\delta x$, where $\delta x\sim N(0,{\sigma}^{2}/s)$, and $s$ is a scaling factor. The new proposed proportion is therefore $\frac{\mathrm{exp}({x}_{i}^{\prime})}{{\sum}_{k=1}^{K}\mathrm{exp}({x}_{k}^{\prime})}$. Since the proposal distribution is symmetrical, the Hastings ratio is 1. The proposed update is accepted with probability
Note that, conditional on the current estimate of strain haplotypes, $\mathbf{\mathbf{h}}$, the likelihood is not a function of the specific IBD configuration.
Gibbs sampling update for haplotype and IBDconfiguration update
As described above, in DEploidIBD, allelic states at different sites within a haplotype are independent. However, IBD states are connected through a Markov process. We therefore update the IBD configuration and strain haplotypes in a two stage process. First, we use the Forward algorithm to first compute the integrated likelihood  that is the probability of observing the readlevel data integrating over all possible IBD configurations and allelic states. Defining ${F}_{i,\mathcal{S},\mathbf{\mathbf{h}}}$ to be the integrated likelihood for the set of paths ending in IBD configuration $\mathcal{S}$ at site $i$ and with allelic configuration $\mathbf{\mathbf{h}}$, it follows that
where $\psi (\mathcal{S},\mathbf{\mathbf{h}})$ is the prior on the combination of IBD configuration $\mathcal{S}$ and allelic configuration $\mathbf{\mathbf{h}}$ as defined in Section (Modelling mixed infections with IBD). Note that the summation term is identical for all IBD/allelic configurations.
By storing the output of the Forward algorithm we can then sample from the posterior distribution of the IBD and allelicconfigurations (given current proportions). That is, we sample ${\mathcal{S}}_{l},{\mathbf{\mathbf{h}}}_{l}\propto {F}_{l,\mathcal{S},\mathbf{\mathbf{h}}}$, and then previous steps proportion to ${F}_{i,\mathcal{S},\mathbf{\mathbf{h}}}$ times the probability to transition to the sampled state at position $i+1$.
Caveat: identifiability with balanced mixing
Using a reference panel free approach means that it can be difficult or impossible to deconvolute samples containing strains with equal proportions. For example, it is impossible to distinguish between $\{\frac{1}{3},\frac{1}{3},\frac{1}{3}\}$ and $\{\frac{1}{3},\frac{2}{3}\}$, or $\{\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4}\}$ and $\{\frac{1}{4},\frac{1}{4},\frac{1}{2}\}$. Because of the prior we use, the model will prefer to merge haplotypes when possible, which can lead to underestimation of the number of strains. We advise users to apply DEploid with multiple runs with and without the ‘ibd’ flag and see if such problem occurs.
Gibbs sampling update for IBD parameter
Given the sampled IBD configuration, the IBD parameter $\theta $ can be updated using Gibbs sampling. We first identify the path of the $D$ distinct IBD configurations along the chromosome and, for each, identify the number of distinct IBD groups, ${\mathcal{C}}_{d}$. For example, if the IBD state is 0,1,2,2 there are three distinct IBD groups (0,1,2; ${\mathcal{C}}_{d}=3$), while if the IBD state is 0,0,1,1 there are two IBD groups (0,1; ${\mathcal{C}}_{d}=2$). Under the prior, the number of distinct strains (minus 1) is drawn from a binomial distribution with parameters $K1$ and $\theta $. With a uniform prior on $\theta $, a new value, ${\theta}^{\prime}$ is therefore drawn from:
where ${\mathcal{C}}_{D}={\sum}_{d=1}^{D}{\mathcal{C}}_{d}$.
Implementation details
Below we detail a number of implementation details.
Approximating the likelihood surface
At any site, the likelihood for the WSAF, $q$, induced by the allelic configuration and strain proportions derives from the betabinomial model as implemented in DEploid. That is, the likelihood of ${q}_{i}$ is
where ${\pi}_{i}$ is the adjusted WSAF $\pi =q(1e)+(1q)e$ and ${c}^{1}$ determines the magnitude of overdispersion relative to the binomial, with $c\to \mathrm{\infty}$ recovering the binomial. Here, we use $c=100$.
For computational efficiency, rather than recomputing for every value of $q$, we first approximate the likelihood function for each site through a Beta distribution, matching the first two moments of the posterior on $q$ implied by Equation 6 within a uniform prior on $q$. The estimated parameters of the Beta distribution are then used to approximate the likelihood surface in all subsequent calculations.
MCMC parameters for deconvolution
Number of strains. As described above, we aim to infer more strains than are actually present, starting the MCMC chain with a fixed $K$ (default of 5). In our experience, we find poor performance for $K>4$, hence use the flag k four to specify the number of strains as 4. At the point of reporting, we keep strains with a proportion above a fixed threshold, typically 0.01.
Parameters. We typically set the logtitre variance ${\sigma}^{2}=5$, which can be adjusted when working with extremely unbalanced samples (see Zhu et al., 2018d, Supplementary Material). We set the read error rate as 0.01 and the rate of miscopying as 0.01.
Recombination rate and scaling. We assume a uniform recombination map, where the genetic distance between loci $i$ and $i+1$ is computed by ${\gamma}_{i}={D}_{i}/{d}_{m}$ where ${D}_{i}$ denotes the physical distance between loci $i$ and $i+1$ in nucleotides and ${d}_{m}$ denotes the average recombination rate in Morgans bp^{−1}. We use the recombination rate for P. falciparum of 13,500 base pairs per centiMorgan as reported by Miles et al. (2016). The recombination rate is scaled by a factor $G$, which reflects the effective population size, rate of inbreeding, and relatedness of the reference panel. In practice, we deconvolute over 1 million markers in field samples. We tuned the parameter $G$ using in vitro lab mixtures, finding that a value of $G=20$ ensures small recombination probabilities between any two markers, with a mean of 0.015. A large value of $G$ relaxes the reference panel constraint, becoming an LD free model when $G$ is infinity. The scaled genetic distance $G\gamma $ is used to compute the transition probability of switching from copying reference haplotype $a$ to reference haplotype $b$. Given LD varies enormously between P. falciparum populations, we will investigate how to tune this parameter for future improvement. In the current release, our program allows users to apply the flag G to specify a specific value. We note that IBD deconvolution from errorprone data can benefit from even smaller values of $G$, such as 0.1.
Reporting. We aim to provide users with a single point estimate of the haplotypes, their proportions and IBD configurations, although the full chain is also available for analysis. To achieve this we report values at the last iteration  that is we report a single sample from the posterior. However, to measure robustness, we typically repeat the deconvolution with multiple random starting points. We use a majority vote rule on the inferred number of strains; we then select the chain with the lowest average deviance (after removing the burnin) as our estimate. The deviance measures the difference in log likelihood between the fitted and saturated models, the latter being inferred by setting the WSAF to that of the observed values. These parameters can be modified by users to achieve a preferred balance between computational speed and confidence. By default, we set the MCMC sampling rate as 5, with the first 50% of samples removed as burn in and 800 samples used for estimation.
Reference panel construction. To infer clonal samples for the reference panel we use the Pf3k project data, running the algorithm without LD on all samples and identifying those with a dominant haplotype (proportion > 0.99) as clonal. These clonal samples are grouped by region of sampling to form locationspecific reference panels. In addition, we have included a number of reference strains, described in more detail below.
Summarising pairwise IBD
At the end of a run, we obtain the maximum likelihood estimate of IBD configurations along the chromosome using the Viterbi algorithm. Patterns of IBD are then obtained for all pairs of strains and summarised through the mean fraction IBD and the N50 IBD tract, obtained by identifying all blocks of IBD, sorting them in decreasing size and finding the size of the block such that 50% of all sites that are in IBD lie in blocks of at least this size. A larger N50 statistic indicates more recent common ancestry.
We note that it is also possible to obtain estimates of pairwise IBD from the full posterior distribution of pairwise IBD (using standard hidden Markov model practices). Typically these give very similar answers to the Viterbi solution, though are typically slightly larger due to the identification of low certainty regions. The posterior mean IBD between pairs is also provided to users as output.
Commands
Each isolate deconvolution is repeated 15 times, each time initialized with a random seed. For each run we obtain an estimate of the number of strains and we take the modal value to be the estimate. For reporting, we use the first run that estimated the consensus number of strains. The text below gives an example of an input script for deconvolution. Full details are available in the documentation at the Github page: https://github.com/mcveanlab/DEploid/ (Zhu, 2018b; copy archived at https://github.com/elifesciencespublications/DEploid).
ref=PD0577 C_ref.txt
alt = PD0577 C_alt.txt
plaf = asiaGroup1_PLAF.txt
panel = asiaGroup1PanelMostDiverse10.csv
exludeAt = asiaGroup1_and_pf3k_bad_snp_in_at_least_50_samples.txt
prefix = PD0577 C_IBD time dEploid ref ${ref} alt ${alt} plaf ${plaf} panel ${panel} \ exclude ${exludeAt} o ${prefix} nSample 250 rate 8 burn 0.67 ibd k 4 interpretDEploid.r ref ${ref} alt ${alt} plaf ${plaf} o ${prefix} \ dEprefix ${prefix} exclude ${exludeAt}
Error analysis
For haplotype quality analysis, we compared the inferred haplotype (DEploid output) with the true haplotype. In addition to mismatches between the testing haplotype and the truth, the deconvolution process also introduces switch errors, where the haplotypes are correct, but have undergone in silico recombination events. In addition, when one strain has low frequency, there might be insufficient data to make accurate inference, resulting in missing segments of the true haplotypes. We refer to this as dropout error. Dropout error can also be caused by the sequencing process, when parts of the genome are not well sequenced (low read count).
When comparing inferred haplotypes to true haplotypes we use a dynamic programming approach to find a description of the differences, optimising a cost function in which switch errors are twice as costly as mismatches and allele dropouts. Code for performing the optimal description, errorAnalysis.r, is available at the Github page: https://github.com/DEploiddev/DEploidUtilities/(Zhu, 2018c; copy archived at https://github.com/elifesciencespublications/DEploidUtilities). An example analysis for three strains is shown in Appendix 1—figure 1.
Appendix 2
Generating in silico mixtures for deconvolution benchmark
In silico mixtures of lab strains
For in silico lab mixtures, we used the lab strain GB4 to provide a wholegenome read depth profile. We drew Bernoulli variables to simulate alternative allele counts, with the probability of success as the adjusted within sample allele frequency at each position (Equation 4). The expected WSAF is the dot product of the genotype of the lab strains: 3D7, Dd2, HB3, 7G8 and the proportions. For example, if the genotype is 0, 0, 1, one at an arbitrary position, and the mixture proportion of 0.1, 0.2, 0.4 and 0.3 will lead to an expected WSAF of 0.7. We then further adjust the WSAF with an error term (0.01) to account for all technical error/noise: $0.7\times (10.01)+(10.7)\times .01=0.696$.
To test the accuracy of DEploidIBD in a more realistic setting, we created in silico mixtures of 2, 3 and 4 strains given different transmission scenarios from mosquito bites. Let $K$ denote the number strains within each sample, and $b$ denote the number of independent mosquito bites. In such a scenario, a mixed infection containing $K$ strains can be treated as a result of $b$bite events, where $K\in \{2,3,4\}$ and $1\le b\le K$. For example, a $K=2$ mixed infection can result from cotransmission, where two strains are passed in a single bite ($b=1$); or by superinfection, where each strain is delivered by a unique bite ($b=2$). A mixed infection containing three strains is more complex: besides coinfection (three strains in a single bite, $b=1$) and superinfection (3 strains in three bites, $b=3$), we can also have a superinfection scenario made up from a coinfection plus a clonalinfection ($b=2$). The complete simulation procedure (code) is available at https://github.com/DEploiddev/DEploidDataBenchmarkin_silico_field.
In silico mixtures of two field strains
To simulate a mixture of two strains, we randomly selected two strains from 189 clonal samples of African origin (proportions ranging from 10/90% to 50/50%) using Chromosome 14 data. A further 20 randomly chosen samples were used as the reference panel. In order to compare the accuracy of the two methods at different levels of relatedness, we set 0%, 25%, 50% and 75% of the second haplotype the same as the first haplotype to mimic scenarios of unrelated, low, medium and high relatedness respectively. This operation sets a lower limit to the relatedness between two strains, as background relatedness may also exist. We used empirical read depths and drew read counts for the two alleles from binomial proportions (the same approach for generating in silico lab mixtures). We excluded sites for analysis at zero alternative allele counts in both targeted samples and reference panel, kept and analysed around 7757 polymorphic sites (standard deviation 178) for each in silico samples.
We repeated the in silico experiment with mixtures of two strains from 204 clonal Asian samples, also with mixing proportions of ranging from 10/90% to 50/50%, using about 3041 sites (s.d. 227) from Chromosome 14.
In silico mixtures of three field strains
We further extended benchmarking to in silico mixtures of three strains in African populations with mixing proportions of 10/10/80%, 10/25/65%, 15/25/60%, 10/40/50%, 15/30/55%, 20/30/50%, 33/33/34%. Two generate $b=1$ infections with three strains, we randomly selected two strains, namely parent A and parent B, from 189 clonal samples of Africa origin, and set the first 33% of the first haplotype and the last 66% of the second haplotype the same as parent A; the rest of the first and second haplotypes and the third haplotype the same as parent B, to mimic the scenario of a 1/3 pairwise relatedness within sample.
In addition to the basic coinfection and superinfection scenarios of 3 strains, we also consider more complex events when coinfection presents within a superinfection. This data simulation is similar to simulating 2 strains of 50% relatedness, with one additional unrelated haplotype. Therefore, the overall pairwise relatedness is $1/2$ distributed over three possible pairs, which leads to $1/6$.
In silico mixtures of four field strains
We tested In silico deconvolution experiments on mixtures of 4 strains. From previous experiments, we have learned that strain compositions with even proportions are difficult to deconvolute. In this set of simulations, we experimented with various unbalanced and balanced proportions including 11/22/30/37%, 25/25/25/25%, 20/20/20/40% and 30/30/30/10%. For some cases, the data generation procedures can easily be modified from previous experiment: a $b=4$ event of for four strains is equivalent to a 3bite event of 3 strains with one extra random haplotype; a $b=3$ event of four strains is equivalent to a 2bite event of 3 strains with one extra random haplotype. For 2bite event of 4 strains, there are two possibilities: (i) both bites pass on two strains, which is essentially repeating $b=1$ event of 2 strains twice or (ii) three strains in one bite and one in another, which is essentially a 1bite event of 3 strains with one extra random haplotype.
Haplotype quality for in silico mixtures
We assessed the quality of all the haplotypes deconvolved with DEploid and DEploidIBD for the in silico simulated mixtures (Appendix 2—figure 1, Appendix 2—figure 2). Our results are consistent with the performance we observed in field samples. Complex mixtures with balanced proportions or marginal strains (i.e. with a very low proportion) tend to produce chimeric haplotypes. Nonetheless, further research is needed to explore factors that result in deconvolution failure. DEpoidIBD performs slightly worse than the previous version of the software; we ascribe this to the fact that the simulations lack any complex IBD structure. See (Haplotype quality assessment) for more details about the procedure.
Appendix 3
Pf3k field sample analysis
Sample choice
We used 2640 samples from the Pf3k project (see https://www.malariagen.net/projects/pf3k). We excluded Nigerian samples from downstream analysis as there were only five samples from this country. We discarded samples containing mixed malaria species and those where sequencing coverage depth was below 30X or in which less than 30% of sites were callable. Finally, we exclude all lab strains (including reference strains, artificial mixtures, and crosses samples) and duplicated samples. In total, we retained 2344 field samples from 13 countries.
Data filtering
We ran DEploidIBD on high quality biallelic SNPs (both coding and noncoding, tagged with PASS at the QUAL column in the VCF file) from Pf3k (Pf3k Consortium, 2016). Before the additional filtering step described below, this set contained 1,057,830 SNPs.
bcftools view\
–include ’FILTER=”PASS"’ \
–minalleles 2 \
–maxalleles 2 \
–types snps \
–outputfile SNP_INDEL_Pf3D7_01_v3.high_quality_biallelic_snps.vcf.gz \
–outputtype z \
SNP_INDEL_Pf3D7_01_v3.combined.filtered.vcf.gz
High leverage data points
We found that markers with high coverage for both alleles could mislead our model, inducing it to fit additional strains. We used a threshold of >99.5% coverage (default) to identify markers with extremely high allele counts. We further expanded this list of potential problematic markers by considering their nearest 10 neighbours on both flanks, and excluding those that were tagged more than once (see Appendix 3—figure 1). These poorlygenotyped variants are likely to be errors of mapping and genotype calling.
To track down the causes of high leverage points, we assessed nucleotide diversity in the P. falciparum genome. We used clonal haplotypes to compute nucleotide diversity by running a sliding window along the genome. At each SNP, we use ${n}_{0}$ and ${n}_{1}$ to denote counts for reference and alternative alleles, respectively. Let $n=({n}_{0}+{n}_{1})$ be the number of haplotypes in the population with a nonmissing call. We computed the mean number of pairwise differences for this SNP as follows. First, we computed the total number of pairs as ${n}_{pairs}=n*(n1)/2$. Then, we computed the number of pairs that were the same, ${n}_{same}=({n}_{0}*({n}_{0}1)/2)+({n}_{1}*({n}_{1}1)/2)$, and the number of pairs that were different, ${n}_{d}$ = ${n}_{pairs}{n}_{same}$. Finally, we obtained the mean number of pairwise differences as $mpd={n}_{d}/{n}_{pairs}$. To estimate nucleotide diversity $\pi $, we computed the sum of $mpd$ in a window of 20kbp centred on each SNP, and divided by the number of accessible bases, which produces the mean number of pairwise differences per base.
Regions containing high leverage points tended to be at the ends of chromosomes or within regions of high nucleotide diversity, where read mapping was problematic (see Appendix 3—figure 2). We identified potential outliers in all samples, and filtered out common outliers in at least 50 samples – 48,443 in total.
Analysis preparation
To improve the accuracy and efficiency of the deconvolution process, we first split the data into groups based on genetic similarity. We computed genetic distance between two samples as follows:
where $l$ represents an arbitrary locus, $L$ denotes the total number of loci, and ${\text{WSAF}}_{s,l}$ indicates the nonreference withinsample allele frequency for sample $s$ at locus $l$ is then given by ${\text{WSAF}}_{s,l}=\frac{{a}_{s,l}}{{r}_{s,l}+{a}_{s,l}}$ where ${a}_{s,l}$ is the number of read counts supporting the alternative allele in sample $s$ at locus $l$, and ${r}_{s,l}$ is the number of read counts supporting the reference allele in sample $s$ at locus $l$.
We found that samples from the same geographical region differentiated into clear groups. We used this initial grouping as the basis for defining the reference panels that assisted the deconvolution. The geographical groups arising from this analysis are listed below. In order to reduce computational time, we only used polymorphic sites at each population group:
Malawi, Congo, with 349,242 sites.
Ghana (Navrongo), with 508,606 sites.
Nigeria, Senegal, Mali, with 210,819 sites.
The Gambia, Guinea, Ghana (Kintampo), with 250,827 sites.
Cambodia (Pursat), Cambodia (Pailin), Thailand (Sisakhet), with 44,317 sites.
Vietnam, Laos, Cambodia (Ratanakiri), Cambodia (Preah Vihear), with 88,410 sites.
Bangladesh, Myanmar, Thailand (Mae Sot), Thailand (Ranong), with 84,868 sites.
Haplotype quality assessment
In this work, we also assessed the quality of the haplotypes inferred by DEploidIBD. Our goal was to establish to what degree our inferred haplotypes were statistically indistinguishable, given a suite of population genetics statistics, from the subset of clonal haplotypes that had the same geographical origin. Our assumption was that haplotypes found in mixed infections would have similar characteristics than those present in clonal samples. In our assessment, we found that the distribution of statistics for groups of deconvoluted haplotypes had extreme outliers and presented a higher variance when compared with the clonal population originating on the same region. We noticed that the painting process implemented by DEploid struggles when faced with challenging mixtures. For instance, mixed infections in which the coexisting strains have the same relative proportion (e.g. $k=4$ with each strain having a proportion of 25%), or samples in which proportions are very unbalanced (e.g. k = 2 with the marginal strain at 2%). This often results in an excess of alternative calls being assigned to one of the strains, which in turn provokes a deficit of diversity on the remaining haplotypes, that cannot be explained in terms of their genetic relationship to the reference genome used for mapping and assembly (3D7). We defined our quality metric as a $z$score that approximates how much a deconvoluted haplotype deviates from the mean genetic diversity of the clonal population present in the same geographical area.
For each population, we computed the distribution of alternative calls observed within the subset of clonal samples ($k=1$). Using this distribution as reference, we computed a $z$score for each haplotype in the whole population following
where ${a}_{i}$ denotes the number of alternative calls in the haplotype $i$, and ${\overline{a}}_{r}$ and ${\sigma}_{r}$ are, respectively, the mean and standard deviation of observed alternative calls in the clonal set of samples from the population of origin. We only considere as suitable haplotypes with a $z$score in the range $(3,3)$, thus discarding any strain that is three or more standard deviations away, in terms of alternative calls, from the mean observed for clonal samples. By using the set of clonal samples as the reference distribution, we approximated the number of alternative calls expected in a genome belonging to that population, which serves as a proxy for genetic diversity but is easier to compute. Supp. Appendix 3—figure 3 shows an example of this filtering process for the most problematic population in the dataset (Ghana). Supp. Appendix 3—table 1 lists the number of haplotypes discarded by population while Supp. Appendix 3—table 2 describes the number of haplotypes discarded by COI level. Statistical deconvolution of haplotypes in mixed infections remains a challenging problem and requires further research. Nonetheless, our quality metric can guide other researchers in the process of discarding haplotypes that are clearly artefactual
Combining clonal sample pairs for background IBD computation
We combined randomly selected clonal sample pairs to create artificial mixed infections, as a way to generate a background IBD distributions for each country. We assumed these artificial mixed infections mimic infections generated from two independent mosquito bites. In this setting, strain proportions are determined by their median read depth, whereas sample coverage is obtained by accumulating the reference and alternative allele counts of two clonal samples. Similar to DEploidIBD deconvolution, SNPs with very high coverage resulting from this process caused high leverage in the model. Additionally, the sample sequence depth and skewness were heterogeneous due to different sample preparation and sequencing protocols. We reduced the DEploidIBD filtering threshold from 99.5% to 80%, and used low recombination probabilities to avoid false IBD breakpoint inference. We validated our method using lab crosses (Miles et al., 2016), and compared the IBD block detection using (Li and Stephens, 2003)’s painting with parental strains and the DEploidIBD algorithm (Appendix 3—figure 4).
Appendix 4
Expected levels of IBD in P.falciparum mixed infections
The amount of IBD observed in a mixed infection is a function of the number of oocysts present in the biting mosquito. We will demonstrate this below.
First, let us briefly review the fundamentals of malaria meiosis. In our case, we imagine a mosquito bites a human host containing two distinct malaria strains. Call these strains $A$ and $B$. Some number of gametocytes of $A$ and $B$ are imbibed during the bite, differentiate into gametes, and undergo fertilization to produce zygotes (reviewed in Ghosh et al., 2000; Bennink et al., 2016). Some fraction of these zygotes succeed in establishing themselves as oocysts on the mosquito midgut (Ghosh et al., 2000). Three products of fertilization are possible, and thus the oocysts can be either: $A+A$ or $B+B$ (inbred oocysts, ${n}_{ii}$), or $A+B$ (outbred oocysts, ${n}_{ij}$). The oocyst state of a mosquito can be characterized by (${n}_{ij}$, ${n}_{ii}$). Which strain is maternal and paternal may vary from oocyst to oocyst, but this is of no consequence here.
A $K=2$ mixed infection is established when two distinct sporozoites, produced from the oocysts of this mosquito, infect a host. Each oocyst produces thousands of sporozoites (Beier et al., 1991), of four types (discussed in McKenzie et al., 2001), which pool in the mosquito salivary glands (Ghosh et al., 2000). Imagine drawing a $K=2$ mixed infection from a mosquito harbouring a single outbred oocyst (${n}_{ij}$=1). In such a mosquito there are two copies of each of the two strains (two sets of sister chromatids; $A$, $A$, $B$, $B$). Thus, ignoring recombination for the present, there are two pairs with an IBD fraction of 1 and, if our original strains are unrelated, the remainder of the $\left(\genfrac{}{}{0pt}{}{4}{2}\right)$ pairs will have an IBD of 0. Thus a single ${n}_{ij}$ oocyst has an expected IBD of $E[\rho ]=2/\left(\genfrac{}{}{0pt}{}{4}{2}\right)=1/3$. We draw pairs without replacement because if sporozoites of only one type seed the infection, it will be $K=1$. Importantly, neither recombination nor segregation change this result, as they only shuffle how the total identity is distributed between pairs, rather than create or destroy identity (identity is created by DNA replication and destroyed by mutation); the expectation is taken over all pairs and is thus unaffected.
Computing the expected IBD fraction for a mosquito possessing ${n}_{ij}$ outbred oocysts is an extension of the above. Again ignoring recombination, the expected IBD fraction $E[\rho {n}_{ij}]$ is equal to the total number of pairs with an IBD of 1 (IBD pairs), over all possible pairs. In a mosquito with ${n}_{ij}$ oocysts, we have $2{n}_{ij}$ copies of each parental strain, thus we have $\left(\genfrac{}{}{0pt}{}{2{n}_{ij}}{2}\right)$ IBD pairs for each parental strain, thus $2\left(\genfrac{}{}{0pt}{}{2{n}_{ij}}{2}\right)$ IBD pairs total. Dividing this by the total number of pairs amongst ${n}_{ij}$ oocysts we have
The above yields $1/3$ for ${n}_{ij}=1$, approaching $1/2$ as ${n}_{ij}$ grows. This result has been validated with pfmeiosis in Appendix 4—figure 1.
Including ${n}_{ii}$ oocysts is somewhat involved, as some pairs (selected without replacement) may be identical (thus yielding $K=1$) or completely unrelated (yielding $K=2$, but effectively without having undergone meiosis or producing any detectable recombination breakpoints between parental strains). We are interested in the expected IBD produced as a result of meiosis between parental strains, and thus for the moment we exclude these pairs. In practice, this means the mosquito must have at least one outbred oocyst, and at least one of the infecting sporozoites must be from an outbred oocyst.
The derivation is as above: first ignoring recombination and segregation, then enumerating all IBD pairs (pairs with IBD fraction of 1) and dividing by the total number of pairs to compute the expectation. Note that the additional IBD pairs possible between an outbred and inbred oocyst are given by the term $8{n}_{ij}{n}_{ii}$, and that you can no longer use all possible pairs drawn without replacement as the denominator, but must exclude the pairs described above.
Which is of a similar form to above, but increases to $1/2$ quicker if more inbred oocysts are present. As before the equation is validated in Appendix 4—figure 2A.
The expression for $E[\rho {n}_{ij}>0,{n}_{ii}]$ can also be derived by recognizing that there are three types of pairs possible in a mosquito with a collection of ${n}_{ij}$ and ${n}_{ii}$ oocysts: (1) a pair can contain two strains from a single ${n}_{ij}$, (${n}_{ij}^{o=1}$); (2) a pair can contain two strains from two different ${n}_{ij}$, (${n}_{ij}^{o=2}$); or (3) a pair can contain one strain from an ${n}_{ij}$ oocyst and one from an ${n}_{ii}$ oocyst, ${n}_{ij,ii}^{o=2}$. Pair type (1) is unique to malaria and has an $E[\rho {n}_{ij}^{o=1}]=1/3$, as shown above; pair type (2) are standard siblings with $E[\rho {n}_{ij}^{o=2}]=1/2$; and pair type (3) represent a motherdaughter relationship, also with $E[\rho {n}_{ij,ii}^{o=2}]=1/2$. The full IBD fraction and IBD segment length distributions of these pairs were generated using pfmeiosis and can be seen in Appendix 4—figure 2B. We enumerate the number of each pair type given ${n}_{ij}$ and ${n}_{ii}$, weighted by their expectation, to derive $E[\rho {n}_{ij}>0,{n}_{ii}]$:
As above.
Data availability
Metadata on samples is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/pf3k_release_5_metadata_20170804.txt.gz. Sequence data (aligned to Plasmodium falciparum strain 3D7 v3.1 reference genome sequences, for details see ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/201508/Pfalciparum.genome.fasta.gz) is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/5.1/. Diagnostic plots for the deconvolution of all samples can be found at https://github.com/mcveanlab/mixedIBDSupplement (copy archived at https://github.com/elifesciencespublications/mixedIBDSupplement) and deconvolved haplotypes can be accessed at ftp://ngs.sanger.ac.uk/production/pf3k/technical_working/release_5/mixedIBD_paper_haplotypes/. Code implementing the algorithms described in this paper, DEploidIBD, is available at https://github.com/mcveanlab/DEploid (copy archived at https://github.com/elifesciencespublications/DEploid).

Wellcome Trust Sanger public ftp siteID 5.1 Data. The Pf3k Project (2016): pilot data release 5.
References

Quantitation of malaria sporozoites in the salivary glands of wild afrotropical anophelesMedical and Veterinary Entomology 5:63–70.https://doi.org/10.1111/j.13652915.1991.tb00522.x

The development of malaria parasites in the mosquito midgutCellular Microbiology 18:905–918.https://doi.org/10.1111/cmi.12604

Protective immunity against malaria by 'natural immunization': a question of dose, parasite diversity, or both?Current Opinion in Immunology 23:500–508.https://doi.org/10.1016/j.coi.2011.05.009

Withinhost competition and drug resistance in the human malaria parasite Plasmodium falciparumProceedings of the Royal Society B: Biological Sciences 283:20153038.https://doi.org/10.1098/rspb.2015.3038

First field trial of an immunoradiometric assay for the detection of malaria sporozoites in mosquitoesThe American Journal of Tropical Medicine and Hygiene 33:538–543.https://doi.org/10.4269/ajtmh.1984.33.538

Plasmodium knowlesi: a malaria parasite of monkeys and humansAnnual Review of Entomology 57:107–121.https://doi.org/10.1146/annurevento121510133540

Whole population, genomewide mapping of hidden relatednessGenome Research 19:318–326.https://doi.org/10.1101/gr.081398.108

DASH: a method for identicalbydescent haplotype mapping uncovers association with recent variationThe American Journal of Human Genetics 88:706–717.https://doi.org/10.1016/j.ajhg.2011.04.023

Global epidemiology of Plasmodium vivaxThe American Journal of Tropical Medicine and Hygiene 95:15–34.https://doi.org/10.4269/ajtmh.160141

K13propeller polymorphisms in Plasmodium falciparum parasites from subSaharan AfricaThe Journal of Infectious Diseases 211:1352–1355.https://doi.org/10.1093/infdis/jiu608

Modeling linkage disequilibrium and identifying recombination hotspots using singlenucleotide polymorphism dataGenetics 165:2213–2233.

WebsiteThe malaria atlas project: developing global maps of malaria riskAccessed December 8, 2017.

BookFocus on Senegal Roll Back Malaria: Progress and Impact SeriesGeneva: World Health Organization.

Plasmodium malariae and Plasmodium ovalethe "bashful" malaria parasitesTrends in Parasitology 23:278–283.https://doi.org/10.1016/j.pt.2007.04.009

Singlecell genomics for dissection of complex malaria infectionsGenome Research 24:1028–1038.https://doi.org/10.1101/gr.168286.113

Close kinship within multiplegenotype malaria parasite infectionsProceedings of the Royal Society B: Biological Sciences 279:2589–2598.https://doi.org/10.1098/rspb.2012.0113

Inferring strain mixture within clinical Plasmodium falciparum isolates from genomic sequence dataPLOS Computational Biology 12:e1004824.https://doi.org/10.1371/journal.pcbi.1004824

Wholegenome scans provide evidence of adaptive evolution in malawian Plasmodium falciparum isolatesThe Journal of Infectious Diseases 210:1991–2000.https://doi.org/10.1093/infdis/jiu349

HighResolution SingleCell sequencing of malaria parasitesGenome Biology and Evolution 9:3373–3383.https://doi.org/10.1093/gbe/evx256

Harnessing genomics and genome biology to understand malaria biologyNature Reviews Genetics 13:315–328.https://doi.org/10.1038/nrg3187

ThesisAccessing complex genomic variation in Plasmodium falciparum natural infectionPhD thesis, University of Oxford.
Decision letter

Eduardo FrancoSenior and Reviewing Editor; McGill University, Canada

Rachel DanielsReviewer; Harvard School of Public Health, United States

Bryan GreenhouseReviewer; University of California, San Francisco, United States

Steve SchaffnerReviewer; Broad Institute, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a guest Reviewing Editor and Eduardo Franco as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Rachel Daniels (Reviewer #1); Bryan Greenhouse (Reviewer #2); Steve Schaffner (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
In the discussion among the reviewers, the opinion emerged that this work was more appropriate as a Tools and Resources submission than as a Research Article. One reviewer commented that "the "epi" part is a bit of a stretch but the tool is the real (potential) advance. Regardless, I do think that for the tool to be published still requires additional work on validation and on the clarity of the methods and validation results."
Summary:
The authors address an important and challenging problem – phasing genotypes from deep sequence data obtained from mixedstrain falciparum infections. The manuscript is essentially a methods paper and offers a significant advance on the authors' prior work in this area (DEploid) in that it explicitly considers relatedness (IBD) between strains found within the same infection, primarily as a result of cotransmission. The new tool is validated in lab and in silico mixtures of parasite chromosomes. Following this validation of performance, DEploidIBD is applied to samples from the pf3K database and correlated to data on local prevalence from the Malaria Atlas Project.
The manuscript is overall wellwritten, with clear pseudocode and wellexplained supplementary material; however, there are additional considerations that could perhaps improve the conclusions from the analysis and better delineate the advantages of this improvement over both the previous version and other reports
Essential revisions:
1) In general, I find it difficult to clearly understand how well the method is expected to perform based on the simulations. Most of the comparisons presented are between DEploidIBD and DEploid, and the authors do convincingly show an overall relative improvement. However based on the extent of the simulations and way the results are presented it is difficult for the reader to understand in what settings the method it is likely to perform well and where it is likely to fail. This is critical for the reader to understand the utility of the method. Regarding the extent of the simulations, the laboratory mixtures perform well but with only four strains in the population and optimal reference panels this is not a realistic test. The authors go on to perform in silico mixtures from Asia and Africa, but the former is presented for up to 3 strains while the latter (a more challenging dataset) is only presented for up to two strains. The method as presented is designed to work for up to four strains, and thus should be tested as such. It is quite possible that the information content for some (many) of these cases presents an identifiability problem even with a perfect method, however it would still be useful to provide the reader with information as to the limitations of the data/method. It may also be useful to evaluate the sensitivity of the method to the correctness of the reference panel, for lab mixtures and potentially also for in silico sims.
2) It is difficult for me to understand how well the deconvolution process worked on the simulations. This may in part be a result of my not fully understanding the statistics presented. For switch errors, what are the number of true switches for comparison? How are these errors defined (e.g. if the switch occurs 1 SNP in the wrong direction is it still considered an error?). How is genotype error calculated? Is this the percent of SNPs incorrectly phased within each deconvoluted haplotype? Does the denominator for this error include trivial cases (homozygous calls) or the only the more meaningful heterozygous calls in that sample? This is critical to understand, given the high proportion of "low confidence" haplotypes called on real samples.
3) Because phasing haplotypes is one of the primary goals of the software, evaluating haplotype quality from unknowns is of critical importance. The mechanics of the haplotype quality assessment is explained in the supplementary material, but the authors should explain why outliers in the proportion of alternative calls would be a good indicator of a problematic haplotype. Are these alternative calls with respect to the 3D7 reference genome, or to the reference panel used by the software? I presume the former but if the latter this should be made more clear. It would also be good to evaluate this method for "QC" of the outputs on simulations, in particular more complex simulations as discussed in my comment above.
4) Related to #3, in looking at haplotype quality in the real data (Figure 3A) it appears that more than half of polyclonal samples from Asia and ~75% of polyclonal samples have poor quality haplotypes. These numbers are not reflected well in Appendix 3—table 1, which lump in the (trivial) monoclonal samples, and Appendix 3—table 2 which lump together Asia and Africa. I also cannot reconcile the numbers in Appendix 3—table 2 with Figure 3A, e.g. >80% of COI==4 samples appear poor in Figure 3A but Appendix 3—table 2 mentions this figure at 57%. Perhaps the table and the figure representing different things, but if so I missed that distinction.
5) The impact of the population level correlations of output metrics with epidemiologic data are overstated in a few places in the manuscript, in particular in the impact statement which says that "monitoring fine scale patterns of mixed infections and within sample relatedness will be highly informative for assessing the impact of interventions." It is possible that with additional data and likely some sophisticated modeling this will be turn out to be true, but the general correlations with epidemiologic data presented in this manuscript, while qualitatively showing relationships in the expected direction (for Africa, but not in Asia) are neither unexpected nor quantitative in any way that would provide actionable information beyond basic surveillance data. The potential is certainly there, but the data presented do not support this claim.
6) The results in this manuscript, based on samples collected primarily in symptomatic individuals, cannot be generalized to infections in general, the vast majority of which are asymptomatic in moderate to high transmission areas, and in which the relative probability of superinfection to cotransmission is likely much greater since infections more easily stack up in semiimmune individuals with less chance of symptoms and thus treatment. E.g. "more than half of mixed infections arise from the transmission of siblings".
7) There are major concerns over validation and since this is a methods paper, that is a key component so that the reader can assess the usefulness in their own situation.
The method was validated by analyzing:
A) 27 experimental mixtures of 1 – 3 strains with no IBD betweenthem. The arrangement should provide an easy test for the method,since it employs highly divergent strains and an ideal referencepanel. DEploidIBD performs as well as DEploid, except for one casewhich it gets wrong. The authors comment that "LD information isnecessary to achieve accurate deconvolution". This confuses me, since as described both methods include LD information through the reference panel. What does this mean, and why did the new method perform more poorly than the original?
B) Simulated chromosomes with IBD made of mixtures of (a) pairs of
Asian strains, (b) pairs of African strains, (c) triplets of Asianstrains with a rather artificial IBD pattern. My concern here is thatthe performance of the method has not been tested over the relevantrange of complexity (e.g. that seen in this study). How does it workfor more realistic combinations of 3 strains, or for 3 Africanstrains, or for 4 strains? The method could still be quite valuableeven if it cannot handle something like 4 strains, but it would bevery helpful to know its limits. Also, why was (b) not compared to theperformance of DEploid?
8) The quality of reconstructed haplotypes is determined by comparison with monoclonal samples: haplotypes that look either too similar or too dissimilar to the reference are discarded. Are these haplotypes discarded anywhere besides the meiosis analysis? Are they included in COI estimates? More importantly, the number of discarded haplotypes is sometimes very large, e.g. 45% of all haplotypes for Ghana are discarded, and 57% for COI=4 from all populations. It is hard to have confidence in the remaining haplotypes when so many are clearly wrong, especially since there is no independent basis for exclusion. Further treatment of these poor quality haplotypes how they are generated or how they can be prevented, or what output is unaffected by them is needed.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria" for further consideration at eLife. Your revised article has been favorably evaluated by Eduardo Franco as the Senior and Reviewing Editor, and three reviewers.
The manuscript has been improved. All reviewers acknowledged the substantial improvements and the extra work that was required. They also acknowledged the detailed and thoughtful responses you provided to address the first review round. However, there are some remaining concerns that they raised and which need to be addressed before acceptance, as outlined below.
Comments from the reviewers (edited and rearranged for concision and clarity):
1) It is still not clear how the authors differentiate the probabilities of identitybystate from identitybydescent, or how they account for this difference in their methodologies.
2) A few items remain unclear:
a) Appendix 1, subsection “Modelling mixed infections with IBD”: p_{rec} is introduced here, but it is not clear how it is calculated.
b) Appendix 1, subsection “In silico mixtures of two field strains”: How are the regions of identity between genomes formed, i.e. how are breakpoints set between segments?
c) The handling of the number of strains (K) is not explained well. The authors say "The priors on the number of strains, K, and their proportions, w, are as in Zhu et al., 2018, with associated hyperparameters being userdefined", but that paper doesn't describe priors or associated hyperparameters for K. Rather, it says the number of strains is fixed at a high value and minor strains dropped in the end. This approach also seems to be reflected in the subsection “MCMC parameters for deconvolution”, of Appendix 1,. Since calculation of other parameters is conditional on K, I find this to be quite confusing.
3) The authors have done a great deal of additional simulation to more fully explore the effectiveness of their algorithm. My only suggestion here is to have a few lines summarizing their conclusions about the situations in which the program works well; ideally, this information should appear in the Abstract as well, since it is one of the more important takeaways from the paper.
4) More context and discussion are now provided about the pruning of haplotypes, which is good. There does seem to be curious gap, however. The error rate for reconstructed haplotypes is determined for simulated complex infections, and lowquality haplotypes are identified and removed from real complex infections, but I do not see any connection described between these two activities. Were low quality haplotypes filtered out from the simulated data? Were there any lowquality simulated haplotypes? The simulated dataset is the obvious context for assessing the effectiveness of the filtering procedure, so it's puzzling that this is not described.
5) Figure 2 attempts to convey a large amount of information, and does this very nicely. A few things still took me some time to decipher and wonder if there is any other way to make this easier for a reader. One was the horizontal histograms of K – the bins go from top to bottom and are indicated by value/α of the orange and blue but since there is a lot of gray and absent bins is took a while to figure this out despite the key at the top. I don't have a great suggestion as numerals would likely be too cluttering, but maybe at least mentioning explicitly in the figure legend that K=1 is at the top and K=4 at the bottom? Also, do the "coverage below 20" mean samples with median sequencing depth <20x? Maybe define explicitly in the legend?
6) Figure 3 re: per site genetic error, because many sites in a given sample will have homozygous calls and always be assigned correctly, it could be useful to convey what the (e.g. mean) null expectation would be as a benchmark, i.e. if assignment occurred at random. I'm not sure how easy this is to do or would be worth it, but would make the error result more interpretable (and likely might vary for different scenarios).
https://doi.org/10.7554/eLife.40845.039Author response
Summary:
The authors address an important and challenging problem – phasing genotypes from deep sequence data obtained from mixedstrain falciparum infections. The manuscript is essentially a methods paper and offers a significant advance on the authors' prior work in this area (DEploid) in that it explicitly considers relatedness (IBD) between strains found within the same infection, primarily as a result of cotransmission. The new tool is validated in lab and in silico mixtures of parasite chromosomes. Following this validation of performance, DEploidIBD is applied to samples from the pf3K database and correlated to data on local prevalence from the Malaria Atlas Project.
The manuscript is overall wellwritten, with clear pseudocode and wellexplained supplementary material; however, there are additional considerations that could perhaps improve the conclusions from the analysis and better delineate the advantages of this improvement over both the previous version and other reports
We thank the editor and reviewers for their positive comments. We agree that this is essentially a methods paper, though the findings about the origins of mixed infection (i.e. c. 50% being through a single bite) and correlations with prevalence are both novel and biological.
Essential revisions:
1) In general, I find it difficult to clearly understand how well the method is expected to perform based on the simulations. Most of the comparisons presented are between DEploidIBD and DEploid, and the authors do convincingly show an overall relative improvement. However based on the extent of the simulations and way the results are presented it is difficult for the reader to understand in what settings the method it is likely to perform well and where it is likely to fail. This is critical for the reader to understand the utility of the method. Regarding the extent of the simulations, the laboratory mixtures perform well but with only four strains in the population and optimal reference panels this is not a realistic test. The authors go on to perform in silico mixtures from Asia and Africa, but the former is presented for up to 3 strains while the latter (a more challenging dataset) is only presented for up to two strains. The method as presented is designed to work for up to four strains, and thus should be tested as such. It is quite possible that the information content for some (many) of these cases presents an identifiability problem even with a perfect method, however it would still be useful to provide the reader with information as to the limitations of the data/method. It may also be useful to evaluate the sensitivity of the method to the correctness of the reference panel, for lab mixtures and potentially also for in silico sims.
To address these comments we have carried out a much more comprehensive analysis of DEploidIBD, in order to identify scenarios under which the method should perform well or poorly. In particular, we have added a suite of in silicosimulationbased validation experiments for both African and Asian samples, trialling mixed infections of K=2, K=3, and K=4 with various proportions and IBD profiles (see the expanded ‘Method Validation’ section of the Results and Figure 2.). These include four distinct IBD profiles for the simulated K=2 infections (0%, 25%, 50% and 75% IBD) across 5 proportions, including the challenging equalproportion case. For the K=3 infections, we simulated three IBD profiles designed to represent the IBD structure expected in instances where the infection was generated from one, two, or three independent bites; here across 7 proportion values, again including the equalproportion case. Similarly, for K=4 infections, we simulated five IBD profiles, designed to capture the IBD structure expected in the case of one, two, three, or four bites generating the infection; across 5 proportions including the allequal proportion case. In total this resulted in 122 in silico examples of mixed infections on which to trial the method.
Several observations emerged from this additional validation, such as:
 DEploidIBD performs better than DEploid in scenarios where IBD levels are high
 Equalproportion cases are challenging for both DEploid and DEploidIBD
 Barring the equalproportion cases, the K, proportion, and IBD statistics inferred from DEploidIBD seem to be largely accurate for K=2 and K=3 infections
 For K=4 cases, performance was poor for both DEpoid and DEploidIBD, although DEploidIBD performed marginally better.
These results are now all included in the revision and, combined, allow the reader a much better understanding of the strengths and limitations of the new method (specifically note the highly revised Figure 2s and 3).
2) It is difficult for me to understand how well the deconvolution process worked on the simulations. This may in part be a result of my not fully understanding the statistics presented. For switch errors, what are the number of true switches for comparison? How are these errors defined (e.g. if the switch occurs 1 SNP in the wrong direction is it still considered an error?). How is genotype error calculated? Is this the percent of SNPs incorrectly phased within each deconvoluted haplotype? Does the denominator for this error include trivial cases (homozygous calls) or the only the more meaningful heterozygous calls in that sample? This is critical to understand, given the high proportion of "low confidence" haplotypes called on real samples.
We acknowledge the rationale for using the statistics was not clear and have added a description for the error analysis model in Appendix 1 subsection “Error analysis”, including a link to the code used to calculate the statistics. The revised Figure 3 provides additional quantification of inferred haplotype quality under a much wider range of scenarios than previously. We now write:
“For haplotype quality analysis, we compared the inferred haplotype with the true haplotype. [...] An example analysis for three strains is shown in Figure 1. “
3) Because phasing haplotypes is one of the primary goals of the software, evaluating haplotype quality from unknowns is of critical importance. The mechanics of the haplotype quality assessment is explained in the supplementary material, but the authors should explain why outliers in the proportion of alternative calls would be a good indicator of a problematic haplotype. Are these alternative calls with respect to the 3D7 reference genome, or to the reference panel used by the software? I presume the former but if the latter this should be made more clear. It would also be good to evaluate this method for "QC" of the outputs on simulations, in particular more complex simulations as discussed in my comment above.
Extracting haplotypes from mixed infections was certainly a goal for DEploidIBD, though we do not see it, nor intended to present it, as being the primary one. Rather, both the primary goal and novelty of DEploidIBD was the extraction of IBD profiles from mixed infections, given their clear epidemiological significance. Obtaining highquality haplotypes from complex mixed infections is a very difficult problem and we do not wish to present it as fully solved.
Our goal in analysing haplotype quality was to define a metric (zscore) that could be useful for ourselves and others to identify inferred haplotypes of comparable quality to those obtained from clonal strains. Our investigations show that several of the haplotypes inferred from low coverage samples, or where there are minor strains present, have unusually high or low numbers of variant calls, suggestive of unresolved issues (underestimating K or strain ‘dropout’ respectively).
It is important to note (and we have now highlighted this in the paper) that DEploidIBD infers K, proportions, and IBD profiles before and independently of the haplotypes it produces. We did not stress this point sufficiently in the original paper. Critically, the haplotypes are not used in any of our downstream epidemiological analysis. We have made amendments to the text to ensure we clearly acknowledge outstanding issues with haplotype quality, to emphasize their independence from K, proportion and IBD profile inference, and to highlight that they have not been used in downstream analyses. Appendix 3 provides additional substantial detail on these analyses.
4) Related to #3, in looking at haplotype quality in the real data (Figure 3A) it appears that more than half of polyclonal samples from Asia and ~75% of polyclonal samples have poor quality haplotypes. These numbers are not reflected well in Appendix 3—table 1, which lump in the (trivial) monoclonal samples, and Appendix 3—table 2 which lump together Asia and Africa. I also cannot reconcile the numbers in Appendix 3—table 2 with Figure 3A, e.g. >80% of COI==4 samples appear poor in Figure 3A but Appendix 3—table 2 mentions this figure at 57%. Perhaps the table and the figure representing different things, but if so I missed that distinction.
The reviewer was correct and we have generated a new table, Appendix 3—table 1 to clarify the breakdown of poorquality haplotypes both geographically and across K=2, K=3 and K=4 infections. Regarding the consistency of Figure 3A and Appendix 3—table 2, we failed to highlight that the figure presents percentages on a bysample basis – i.e. the percentage of samples with at least one bad quality haplotype. In contrast, the supplementary tables refer to individual strains (i.e. considering all haplotypes and disregarding their aggregation within samples). We have clarified this point in the main text.
5) The impact of the population level correlations of output metrics with epidemiologic data are overstated in a few places in the manuscript, in particular in the impact statement which says that "monitoring fine scale patterns of mixed infections and within sample relatedness will be highly informative for assessing the impact of interventions." It is possible that with additional data and likely some sophisticated modeling this will be turn out to be true, but the general correlations with epidemiologic data presented in this manuscript, while qualitatively showing relationships in the expected direction (for Africa, but not in Asia) are neither unexpected nor quantitative in any way that would provide actionable information beyond basic surveillance data. The potential is certainly there, but the data presented do not support this claim.
We acknowledge that our results raise more questions than they answer about how genetic data might be used to monitor interventions in real time, though we maintain that features such as the ability to quantify mixed infections and to infer the origins of such events are likely to be useful. In response to these criticisms we have completely rewritten the Discussion, providing a much more balanced view of what this work does and does not say about the value of genetic data in monitoring malaria epidemiology and highlight settings – such as in Asia – where observed patterns suggest that we need a much better understanding of how epidemiological fluctuations in space and time impact genetic diversity.
6) The results in this manuscript, based on samples collected primarily in symptomatic individuals, cannot be generalized to infections in general, the vast majority of which are asymptomatic in moderate to high transmission areas, and in which the relative probability of superinfection to cotransmission is likely much greater since infections more easily stack up in semiimmune individuals with less chance of symptoms and thus treatment. E.g. "more than half of mixed infections arise from the transmission of siblings".
This is a good point which we have now taken into consideration by qualifying this statement in all places it occurs. In the Abstract we now write “…47% of symptomatic dual infections contain sibling strains…”; in the Introduction “…more than half of symptomatic mixed infections arise from the transmission of siblings…”; and in the Discussion “…we found that 47% of dual infections within the Pf3k Project likely arose through cotransmission…”.
7) There are major concerns over validation and since this is a methods paper, that is a key component so that the reader can assess the usefulness in their own situation.
The method was validated by analyzing:
A) 27 experimental mixtures of 1 – 3 strains with no IBD betweenthem. The arrangement should provide an easy test for the method,since it employs highly divergent strains and an ideal referencepanel. DEploidIBD performs as well as DEploid, except for one casewhich it gets wrong. The authors comment that "LD information isnecessary to achieve accurate deconvolution". This confuses me, since as described both methods include LD information through the reference panel. What does this mean, and why did the new method perform more poorly than the original?
The new algorithm (DEploidIBD) deconvolutes mixed infections in a two stage process. In the first stage it estimates strain number, proportions and IBD profiles using a model that does not use a reference panel. In the second, it fixes strain number and proportions and then estimates haplotypes with a reference panel (sa in DEploid). Consequently, if strain number is incorrectly estimated, strain haploytpes will be too. This is what happens in the case of equally balanced strain mixtures where, for example, a mixture of three strains with proportions each of ⅓, is fitted as a mixture of two strains with proportions ⅓ and ⅔. We have clarified this within the text and rewritten the supplementary material to highlight this twostage approach to deconvolution.
B) Simulated chromosomes with IBD made of mixtures of (a) pairs of
Asian strains, (b) pairs of African strains, (c) triplets of Asianstrains with a rather artificial IBD pattern. My concern here is thatthe performance of the method has not been tested over the relevantrange of complexity (e.g. that seen in this study). How does it workfor more realistic combinations of 3 strains, or for 3 Africanstrains, or for 4 strains? The method could still be quite valuableeven if it cannot handle something like 4 strains, but it would bevery helpful to know its limits. Also, why was (b) not compared to theperformance of DEploid?
As described in Essential Revision #1, we have augmented the in silico validation to including K=2, 3, 4 for both Asian and Africa across a range of IBD structures and proportions.
8) The quality of reconstructed haplotypes is determined by comparison with monoclonal samples: haplotypes that look either too similar or too dissimilar to the reference are discarded. Are these haplotypes discarded anywhere besides the meiosis analysis? Are they included in COI estimates? More importantly, the number of discarded haplotypes is sometimes very large, e.g. 45% of all haplotypes for Ghana are discarded, and 57% for COI=4 from all populations. It is hard to have confidence in the remaining haplotypes when so many are clearly wrong, especially since there is no independent basis for exclusion. Further treatment of these poor quality haplotypes how they are generated or how they can be prevented, or what output is unaffected by them is needed.
We have addressed this issue in a previous comment (#3). In brief, we highlight here that we discard haplotypes that have a very significant excess or deficit of genetic diversity when compared with the clonal population of samples from the same geographical region. We agree with the reviewer that other tests may be in order (but see response to #3) to assure the quality of the nondiscarded haplotypes, however we did not elaborate on this as we have not used deconvoluted haplotypes in any analysis; we did not make this clear in the original text. Regarding the causes that biases the deconvolution of haplotypes, we have identify problematic cases in which proportions are very well balanced or extremely unbalanced, but this is still work in progress and needs further research.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Comments from the reviewers (edited and rearranged for concision and clarity):
1) It is still not clear how the authors differentiate the probabilities of identitybystate from identitybydescent, or how they account for this difference in their methodologies.
As with all approaches to detecting IBD, the signal that the algorithms are looking for is long stretches of IBS that results from very recent common ancestry (in the last few generations), while short stretches may result from older common ancestor events and sporadic allele matches can happen by chance (or potentially through error). The parametric model described in the Materials and methods defines exactly how this is achieved and the specific parameter values (also described in the Materials and methods) tune the algorithm to look for particular types of signal. We have attempted to make the method as clear as possible, though would be happy to consider specific suggestions as to where our description could be improved further.
2) A few items remain unclear:
a) Appendix 1, subsection “Modelling mixed infections with IBD”: p_{rec} is introduced here, but it is not clear how it is calculated.
Details on how the probability of recombination between two sites (prec_{}) is calculated are provided in the Implementation Details. We now direct the reader to these when prec_{}is introduced.
b) Appendix 1, subsection “In silico mixtures of two field strains”: How are the regions of identity between genomes formed, i.e. how are breakpoints set between segments?
The breakpoints are set at exactly 25%, 50% or 75% of the chromosome length creating one contiguous tract of IBD. This models scenarios wherein a given bivalent has undergone a single crossover event (all chromosomes must undergo at least one crossover event to ensure homolog pairing), and a chromosome involved in this crossover event was transmitted to the mixed infection, resulting in that chromosome having a single contiguous IBD tract. Given that the majority of chromosomes in P. falciparum have a map length of about 1 Morgan (see Figure 3B of the reference Miles et al.in Genome Research) this IBD structure should be frequently observed in mixed infections (albeit across a continuous spectrum of IBD percentages).
c) The handling of the number of strains (K) is not explained well. The authors say "The priors on the number of strains, K, and their proportions, w, are as in Zhu et al., 2018, with associated hyperparameters being userdefined", but that paper doesn't describe priors or associated hyperparameters for K. Rather, it says the number of strains is fixed at a high value and minor strains dropped in the end. This approach also seems to be reflected in the subsection “MCMC parameters for deconvolution”, of Appendix 1,. Since calculation of other parameters is conditional on K, I find this to be quite confusing.
In practice there is no prior on the number of strains as it is set at a fixed value, i.e. in the joint prior (Equation 2) P(K) = 1. This value, and the threshold for discarding strains (set at 0.01) is what we meant by hyperparameters. To clarify this, which have changed the text to:
“The number of strains, K, and their proportions w, are as in (Zhu, 2017): K is fixed at a userdefined value (here, K=4) and strains below a fixed proportion threshold (here, 0.01) are discarded.”
3) The authors have done a great deal of additional simulation to more fully explore the effectiveness of their algorithm. My only suggestion here is to have a few lines summarizing their conclusions about the situations in which the program works well; ideally, this information should appear in the Abstract as well, since it is one of the more important takeaways from the paper.
We have added to the first paragraph of the Discussion to cover this more thoroughly:
“Validation work using simulated mixed infections illustrated that DEploidIBD performs well on infections of two or three strains and across a widerange of IBD levels. We note that limitations and technical difficulties remain, including deconvoluting infections with more than three strains, handling mixed infections with highly symmetrical or asymmetrical strain proportions (e.g. K=3 with strains at 33%, or K=2 with one strain at 2%), analysing data with multiple infecting species, coping with lowcoverage data, and selecting appropriate reference panels from the growing reference resources.”
Though we recognize a similar overview would be useful in the abstract, we have chosen not to add one, as we are already exactly at the 150 word limit and would like to keep current sentences covering biological findings as well.
4) More context and discussion are now provided about the pruning of haplotypes, which is good. There does seem to be curious gap, however. The error rate for reconstructed haplotypes is determined for simulated complex infections, and lowquality haplotypes are identified and removed from real complex infections, but I do not see any connection described between these two activities. Were low quality haplotypes filtered out from the simulated data? Were there any lowquality simulated haplotypes? The simulated dataset is the obvious context for assessing the effectiveness of the filtering procedure, so it's puzzling that this is not described.
To address this issue, we ran our haplotype filtering strategy on all of the ~57K haplotypes produced for the in silicovalidation and we have now included these results in the Haplotype Quality Assessment section of the of the supplementary materials (see Appendix 2—figure 1 and Appendix 2—figure 2). The results are qualitatively consistent with what was observed in the Pf3k data, namely that balanced or highly asymmetric proportions (displayed as a low or high entropy of proportion in the figures) yield more discarded haplotypes, as do mixtures with four strains. However, the rate at which haplotypes are discarded is substantially lower for the in silicomixtures than the field mixtures, likely reflecting greater complexity in the latter. We note that, in silico mixtures deconvoluted by DEploid tend to have somewhat fewer haplotypes discarded than those deconvoluted by DEploidIBD (e.g. DEploid discards 4.6% of haplotypes from Africa, DEploidIBD discards 8.6%), which likely reflects the stronger prior on haplotypes used in DEploid.
5) Figure 2 attempts to convey a large amount of information, and does this very nicely. A few things still took me some time to decipher and wonder if there is any other way to make this easier for a reader. One was the horizontal histograms of K – the bins go from top to bottom and are indicated by value/α of the orange and blue but since there is a lot of gray and absent bins is took a while to figure this out despite the key at the top. I don't have a great suggestion as numerals would likely be too cluttering, but maybe at least mentioning explicitly in the figure legend that K=1 is at the top and K=4 at the bottom? Also, do the "coverage below 20" mean samples with median sequencing depth <20x? Maybe define explicitly in the legend?
We have taken your suggestions and added to the figure legend to clarify:
“From the left to the right, the panels show the strain proportion compositions, distribution of inferred K in a verticallyoriented histogram (top: K=1, bottom: K=4) and using both methods: DEploid in orange and DEploidIBD in blue, effective number of strains, pairwise relatedness and IBD N50 (the latter two only for DEploidIBD).”
Also:
“Grey points identify experiments of low coverage data (median sequencing depth <20)…”
6) Figure 3 re: per site genetic error, because many sites in a given sample will have homozygous calls and always be assigned correctly, it could be useful to convey what the (e.g. mean) null expectation would be as a benchmark, i.e. if assignment occurred at random. I'm not sure how easy this is to do or would be worth it, but would make the error result more interpretable (and likely might vary for different scenarios).
As the reviewer says, the absolute genotype error will depend on the fraction of sites that are homozygous. For this reason, we only calculated accuracy at sites that are heterozygous in the sample or the samplespecific reference panel. Consequently, the expected genotype error under the null is sample specific – but certainly much higher than the range reported (i.e. off the scale). Likewise, the switch error could be as high as the number of sample heterozygous sites (minus one), which is also off the scale. We have clarified the approach taken in the legend to the figure.
https://doi.org/10.7554/eLife.40845.040Article and author information
Author details
Funding
Wellcome (206194)
 Jacob AlmagroGarcia
Wellcome (090770)
 Jacob AlmagroGarcia
 Richard D Pearson
 Roberto Amato
 Alistair Miles
 Dominic Kwiatkowski
Wellcome (100956/Z/13/Z)
 Sha Joe Zhu
 Gil McVean
Li Ka Shing Foundation
 Gil McVean
Wellcome (204911)
 Jacob AlmagroGarcia
 Richard D Pearson
 Roberto Amato
 Alistair Miles
 Dominic Kwiatkowski
Medical Research Council (G0600718)
 Jacob AlmagroGarcia
 Richard D. Pearson
 Roberto Amato
 Alistair Miles
 Dominic Kwiatkowski
Department for International Development (M006212)
 Jacob AlmagroGarcia
 Richard D Pearson
 Roberto Amato
 Alistair Miles
 Dominic Kwiatkowski
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Senior and Reviewing Editor
 Eduardo Franco, McGill University, Canada
Reviewers
 Rachel Daniels, Harvard School of Public Health, United States
 Bryan Greenhouse, University of California, San Francisco, United States
 Steve Schaffner, Broad Institute, United States
Publication history
 Received: August 9, 2018
 Accepted: July 10, 2019
 Accepted Manuscript published: July 12, 2019 (version 1)
 Version of Record published: August 6, 2019 (version 2)
Copyright
© 2019, Zhu et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,984
 Page views

 284
 Downloads

 27
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Epidemiology and Global Health
Background: Homebased selfsampling for human papillomavirus (HPV) testing may be an alternative for women not attending clinicbased cervical cancer screening.
Methods: We assessed barriers to care and motivators to use athome HPV selfsampling kits during the COVID19 pandemic as part of a randomized controlled trial evaluating kit effectiveness. Participants were women aged 3065 and underscreened for cervical cancer in a safetynet healthcare system. We conducted telephone surveys in English/Spanish among a subgroup of trial participants, assessed differences between groups, and determined statistical significance at p<0.05.
Results: Over half of 233 survey participants reported that clinicbased screening (Pap) is uncomfortable (67.8%), embarrassing (52.4%), and discomfort seeing male providers (63.1%). The last two factors were significantly more prevalent among Spanish versus English speakers (66.4% vs. 30% (p=0.000) and 69.9 vs. 52.2% (p=0.006), respectively). Most women who completed the kit found Pap more embarrassing (69.3%), stressful (55.6%), and less convenient (55.6%) than the kit. The first factor was more prevalent among Spanish versus English speakers (79.6% vs. 53.38%, p=0.001) and among patients with elementary education or below.
Conclusions: The COVID19 pandemic influenced most (59.5%) to participate in the trial due to fear of COVID, difficulty making appointments, and ease of using kits. HPV selfsampling kits may reduce barriers among underscreened women in a safetynet system.
Funding: This study is supported by a grant from the National Institute for Minority Health and Health Disparities (NIMHD, R01MD013715, PI: JR Montealegre).
Clinical trial number: NCT03898167.

 Epidemiology and Global Health
Local cervical cancer epidemiological data essential to project the contextspecific impact of cervical cancer preventive measures are often missing. We developed a framework, hereafter named Footprinting, to approximate missing data on sexual behaviour, human papillomavirus (HPV) prevalence, or cervical cancer incidence, and applied it to an Indian case study. With our framework, we (1) identified clusters of Indian states with similar cervical cancer incidence patterns, (2) classified states without incidence data to the identified clusters based on similarity in sexual behaviour, (3) approximated missing cervical cancer incidence and HPV prevalence data based on available data within each cluster. Two main patterns of cervical cancer incidence, characterized by high and low incidence, were identified. Based on the patterns in the sexual behaviour data, all Indian states with missing data on cervical cancer incidence were classified to the lowincidence cluster. Finally, missing data on cervical cancer incidence and HPV prevalence were approximated based on the mean of the available data within each cluster. With the Footprinting framework, we approximated missing cervical cancer epidemiological data and made contextspecific impact projections for cervical cancer preventive measures, to assist public health decisions on cervical cancer prevention in India and other countries.