1. Epidemiology and Global Health
  2. Microbiology and Infectious Disease
Download icon

The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria

  1. Sha Joe Zhu
  2. Jason A Hendry
  3. Jacob Almagro-Garcia
  4. Richard D Pearson
  5. Roberto Amato
  6. Alistair Miles
  7. Daniel J Weiss
  8. Tim CD Lucas
  9. Michele Nguyen
  10. Peter W Gething
  11. Dominic Kwiatkowski
  12. Gil McVean  Is a corresponding author
  13. for the Pf3k Project
  1. University of Oxford, United Kingdom
  2. Wellcome Sanger Institute, United Kingdom
Tools and Resources
Cite this article as: eLife 2019;8:e40845 doi: 10.7554/eLife.40845
16 figures, 5 tables, 1 data set and 2 additional files


Figure 1 with 2 supplements
Deconvolution of a complex field sample PD0577-C from Thailand.

(A) Scatter-plot showing the number of reads supporting the reference (REF: x-axis) and alternative (ALT: y-axis) alleles. The multiple clusters indicate the presence of multiple strains, but cannot distinguish the exact number or proportions. (B) The profile of within-sample allele frequency along chromosomes 11 and 12 (red points) suggests a changing profile of IBD with three distinct strains, estimated to be with proportions of 22%, 52% and 26% respectively (other chromosomes omitted for clarity, see Figure 1—figure supplement 1); blue points indicate expected allele frequencies within the isolate. However, the strains are inferred to be siblings of each other: green segments indicate where all three strains are IBD (Note: green segments do not appear in this example, but occur in Figure 5); yellow, orange and dark orange segments indicate the regions where one pair of strains are IBD but the others are not. In no region are all three strains inferred to be distinct. (C) Statistics of IBD tract length, in particular illustrating the N50 segment length. A graphical description of the modules and workflows for DEploidIBD is given in Figure 1—figure supplement 2.

Figure 1—figure supplement 1
Whole genome deconvolution of field sample PD0577-C.

The outer ring shows the expected within-sample allele frequency (WSAF) (blue) and observed WSAF (red) across the genome. Red and blue points indicate observed and expected allele frequencies within the isolate. The inner ring indicates the IBD states among the three strains: green segments indicate where all three strains are IBD; yellow, orange and dark orange segments indicate the regions where one pair of strains are IBD but the others are not. In no region are all three strains inferred to be distinct, suggesting that the three strains are siblings.

Figure 1—figure supplement 2
A graphical overview of the data types and work flows for DEploidIBD.

The boxes at the bottom represent final outputs of the pipeline. The rectangular boxes indicate when DEploidIBD is executed, with inputs highlighted by blue arrows. The process has three key steps: Step 1. A reference panel for the set of samples is constructed from high confidence clonal haplotypes, either identified from within a study or from an external resource, such as Pf3k. Step 2: DEploidIBD, using population level allele frequencies, is used to infer the number of strains, strain proportions and IBD profile within each sample. Step 3: DEploidIBD is re-run on each sample to infer haplotypes, but with the proportions estimated in Step two fixed and this time using the haplotype (LD-aware) method previously implemented in DEploid.

Figure 2 with 5 supplements
Performance of DEploidIBD and DEploid on 100 in silico mixtures for each of three different scenarios.

From the left to the right, the panels show the strain proportion compositions, distribution of inferred K in a vertically-oriented histogram (top: K=1, bottom: K=4), using both methods: DEploid in orange and DEploidIBD in blue, effective number of strains, pairwise relatedness and IBD N50 (the latter two only for DEploidIBD). From top to the bottom, cases are ordered from even strain proportions to the most imbalanced composition. Grey points identify experiments of low coverage data (median sequencing depth < 20), and pink identify cases where K is inferred incorrectly. (A) In silico mixtures of two African strains with high-relatedness (75%) for 7757 (s.d. 178) sites on Chromosome 14, Note that DEploid underestimates the minor strain proportion if strains have high relatedness. In the extreme case, DEploid misclassifies a K=2-mixture as clonal, whereas DEploidIBD consistently estimates the correct proportions. (B) In silico mixtures of two Asian strains with high-relatedness (75%) for 3041 sites (s.d. 227) on Chromosome 14, Note that DEploid underestimates strain number when the minor strain is low frequency, while DEploidIBD typically performs well. (C) In silico mixtures of three African strains, where each pair is IBD over a distinct third of the chromosome. Note that both methods fail to deconvolute the case of equal proportions. However, for unbalanced mixtures, DEploidIBD consistently performs better than DEploid.

Figure 2—figure supplement 1
Validation of DEploidIBD using 27 in vitro lab mixtures and four in silico mixtures.

A reference panel of the laboratory strains (3D7, Dd2, HB3 and 7G8; Panel V) was used to deconvolute samples with DEploid. Each experiment is performed with and without IBD inference and with the maximum number of 4 strains. Black crosses indicate the true effective number of strains. Coloured crosses (DEploid in red, DEploidIBD in purple) indicate median values obtained from 30 replicates using the algorithm indicated in the legend. The coloured dots show the inferred effective number of strains across replicates with intensity proportional to fraction. Note one sample where balanced proportions of three strains results in the LD-free (DEploid-IBD) approach fitting the data as a mixture of two strains with proportions of 1/3 and 2/3. For in silico mixtures of four strains, DEploid performs poorly. DEploidIBD shows some improvement in unbalanced mixtures, though misclassifies K=4 mixtures as only having three strains.

Figure 2—figure supplement 2
Illustration of simulation study design.

We conduct simulation studies to mimic K-mixtures (top row) as results of b-biting events, where K{2,3,4} and 1bK. For each K-mixture, the left column illustrate the overall relationship between strains (black dots): connected dots imply strains are from the same mosquito bite. The level of relatedness between parasite strains is reflected by the haplotype segment copied from the parental strains within the mosquito. Each colour represents a unique strain within the mosquito, which we randomly draw from field clonal haplotypes. For example, when K=2, we consider the case that the two strains are from two independent mosquito bites; on the other hand, when two strains are from the same mosquito bite, we consider scenarios of low (25%), moderate (50%) and high (75%) relatedness between two sibling strains. These events are represented in the second, third and forth rows respectively. For K=3, we consider mixed-infection events as products of three mosquito bites, two mosquito bites and a single bite. For K=4, we consider mixed infections as products of four mosquito bites, three mosquito bites, two mosquito bites and a single bite. We further divide the possibilities of the 2-bite event into the case that both bites pass on two strains (2 + 2) and the other possibility that one bite passes on a single strain and the other bit passes on three strains (1 + 3).

Figure 2—figure supplement 3
Additional comparison of DEploidIBD and DEploid on 100 in silico mixtures of two strains from Africa with low and moderate relatedness, illustrated by sub panels (A) and (B), respectively.

Detailed panel description can be found in the caption to Figure 2. DEploid generally performs well for samples of low within sample relatedness, though struggles when the minor strain proportion is below 30%. In contrast, DEploidIBD consitently performs well.

Figure 2—figure supplement 4
Additional comparison of DEploidIBD and DEploid on in silico b=2 bite mixtures of K=3 strains from Africa and Asia, illustrated by sub panels (A) and (B), respectively.

Detailed panel descriptions can be found in the caption to Figure 2. The unrelated strain provides a strong signal in allele frequency imbalance for DEploid to detect and therefore performs better than dealing with b=1 mixtures. Comparing (A) and (B), pairwise relatedness estimates are noisy in Asia because of the background IBD. However, background relatedness generates shorts segments of IBD and therefore leads to IBD N50 underestimation.

Figure 2—figure supplement 5
Comparison of DEploidIBD and DEploid on 100 in silico b=3 bite mixtures of four strains from Africa.

Detailed panel descriptions can be found in the caption to Figure 2. DEploid performs poorly in all cases. In contrast, DEploidIBD performs well when all four strains have unequal proportions, but is less accurate when some strains have equal proportion.

Cumulative distribution of the average per site genotype error (left) and switch error (right) across simulated mixtures (measured at sites that are heterozygous in the sample or sample-specific reference panel).

(A) Error rates of Asian in silico samples of three levels of IBD (25%, 50% and 75%) for a K=2 mixture with proportions of 20/80%. Because DEploidIBD estimates proportions more accurately, it enables better haplotype inference. (B) Error rates of African in silico samples of three levels of IBD (25%, 50% and 75%) for a K=2 mixture with proportions of 20/80%. Inference in Asia benefits from better reference panels (due to lower overall diversity) and therefore gives lower error rates than in Africa. (C) DEploidIBD error rates for African in silico samples of three mosquito biting scenarios for a K=3 mixture with proportions of 10/10/80%. The additional strain increases the difficulty of haplotype inference, particularly in the case of three independent bites.

Characterisation of mixed infections across 2344 field samples of Plasmodium falciparum.

(A) The fraction of samples, by population, inferred by DEploidIBD to be K=1 (clonal), K=2 (dual), K=3 (triple), or K=4 (More than 3). Populations are ordered by rate of mixed infections within each continent. We use shaded regions to indicate the distribution of 787 samples that have low-confidence deconvoluted haplotypes. Senegal is marked with an asterisks as these samples were screened to be clonal. (B) The distribution of average pairwise IBD sharing within mixed infections (including dual, triple and quad infections), broken down into unrelated (where the fraction of the genome inferred to be IBD, ρ, is <0.1), low IBD (0.1ρ<0.3), sib-level (0.3ρ<0.7) and high (ρ0.7). Stars indicate the average IBD scaled between 0 and 1 from bottom to the top. Populations follow the same order as in Panel A. (C) The relationship between the rate of mixed infection and level of IBD. Populations are coloured by continent, with size reflecting sample size and error bars showing ±1 s.e.m.. The dotted line shows the slope of the regression from a linear model. Abbreviations: SN-Senegal, GM-The Gambia, NG-Nigeria, GN-Guinea, CD-The Democratic Republic of Congo, ML-Mali, GH-Ghana, MW-Malawi, MM-Myanmar, TH-Thailand, VN-Vietnam, KH-Cambodia, LA-Laos, BD-Bangladesh.

Example IBD profiles in mixed infections.

Plots showing the ALT versus REF plots (left hand side) and inferred IBD profiles along the genome for five strains of differing composition. From top to bottom: A dual infection of highly related strains (ρ=0.84); a dual infection of two sibling strains (ρ=0.6); a triple infection of three sibling strains (note the absence of stretches without IBD); a triple infection of two related strains and one unrelated strain; and a triple infection of three unrelated strains. The numbers below the sample IDs indicate the average pairwise IBD, r, the mean length of IBD segments, l, in kb and the inferred number of distinct strains, K, respectively.

Identifying sibling strains within mixed infections.

(A) Schematic showing how IBD fraction and IBD segment length distributions are created for k=2 mixed infections using pf-meiosis. Two clonal samples from a given country are combined to create an unrelated (M=0, where M is number of meioses that have occurred) mixed infection. The M=0 infection is then passed through 3 rounds of pf-meioses to generate M=1,2,3 classes, representing serial transmission of the mixed infection (M=1 are siblings). (B) Simulated IBD distributions for M=0,1,2,3 for Ghana (top) and West Cambodia (bottom). A total of 10,000 mixed infections are simulated for each class, from 500 random pairs of clonal samples. (C) Classification results for 393 K=2 mixed infections from 13 countries. Undetermined indicates mixed infections with IBD statistics that were never observed in simulation. (D) Breakdown of class percentage by continent. Total number of samples is given above bars. Colours as in panel C (M=0, grey; M=1, purple; M=2, pink; M=3, orange; Undetermined, black). (E) Same as (D), but by country. Abbreviations as in Figure 4.

The relationship between P. falciparum prevalence and characteristics of mixed infection.

Four mixed infection statistics are shown including the average effective number of strains (Effective K, first column), given by Ke=(wi2)1, where wi is the proportion of the ith strain; background IBD observed between clonal samples (Background Fraction IBD, second column); fraction IBD within K=2 mixed infections (Fraction IBD, third column); and the rate of K=2 mixed infections classified as having M>1 (Supersibling Rate, fourth column). Each point relates to a row in Table 1 from different sampling locations and years. Pearson’s r is computed globally (shown at top in a grey box for each statistic), across Asian countries (upper panel) and across African countries (lower panel). Globally and for Africa, the correlations were computed including Senegal (rS+) and excluding Senegal (rS-). The slope and confidence intervals for the regression line excluding Senegal are drawn. Significant correlations (p<0.05) are highlighted in red and significance levels indicated by asterisks (* <0.05, ** <0.01, *** <0.001).

Appendix 1—figure 1
Comparison of true and inferred haplotypes for Chromosome 14 (2,369 SNPs) in the lab strain mixture sample PG0396-C after running DEploidIBD to infer strain number and proportions (top) and after subsequent refinement of haplotypes by running DEploid with Reference Panel V (bottom).

The yellow, cyan and white backgrounds identify the haplotype segments from strains 7G8, HB3 and Dd2 respectively. Numbers in the titles indicate the inferred switch, mismatch and dropout errors identified by the dynamic programming approach, with the cost of switch errors being twice that of other errors.

Appendix 2—figure 1
Distribution of quality scores haplotypes deconvolved from in silico mixtures using DEploid.

Each row represents a different population (Africa and Asia). The left panels represent the overall distribution of z-scores whereas the right panels stratify results according to the entropy of mixture proportions (y-axis) and number of strains (color).

Appendix 2—figure 2
Distribution of quality scores haplotypes deconvolved from in silico mixtures using DEploidIBD.

Each row represents a different population (Africa and Asia). The left panels represent the overall distribution of Z-scores whereas the right panels stratify results according to the entropy of mixture proportions (y-axis) and number of strains (color).

Appendix 3—figure 1
Identification of high leverage data points for filtering.

(Top) Plot showing total allele counts across all markers for field isolate PG0415. We observe a small number of heterozygous sites with high coverage (shown as crosses on the bottom-left plot), which can potentially mislead our model to over-fit the data with additional strains (above the dotted line). We used a threshold of ≥99.5% coverage to identify markers with high allele counts. Red crosses indicate markers that are filtered out. (Bottom-left) Scatter plot showing alternative against reference allele count. The marked black crosses refer to the outliers identified on the previous plot, which will cause the inference method to mistakenly identify the sample as being a mixed infection. (Bottom-middle) Histogram of allele frequency within sample. (Bottom-right) Allele frequency within sample (WSAF), compared against the population average (PLAF).

Appendix 3—figure 2
Nucleotide diversity for a sliding window size of 20,000 base pairs.

(Top) Histograms showing the heavy tail of ND beyond 0.0007. (Bottom) Figure showing ND along P. falciparum chromosome 1. Scattered Points mark chromosome positions of poorly genotyped SNPs which we exclude from the deconvolution process. These points are jitterred to ease visualization.

Appendix 3—figure 3
Diagnostic plots showing the distribution of haplotype quality (z-scores) for the Ghanian samples.

Left. Scatterplot showing the relationship between haplotype z-score and strain proportion. The top axis shows the number of alternative calls below/above the mean of the subset of clonal samples that correspond to a given z-score. The vertical red line denotes a z-score of whereas the red-shaded area indicate the haplotypes we retain Point colors show the COI level of the sample. Right. Four views of the same plot in which the samples have been highlighted according to their COI level.

Appendix 3—figure 4
In silico validation of IBD estimation using lab crosses.

(A) Visual summary of of IBD block detection between DEploidIBD (top) and ancestral state inference from Li and Stephens (2003) (bottom), using artificial mixtures of lab crosses PG0071-C and PG0058-C (last tract). (B) Scatter plot of IBD segment Nx values extracted by comparing clonal sample ancestry (using DEploidIBD) on artificial mixtures.

Appendix 4—figure 1
Exploring the relationship between number of outbred oocysts (nij) and IBD.

(A) Joint IBD fraction and IBD segment length distributions for K=2 mixed infections simulated from two unrelated strains and a fixed number of outbred oocysts nij, using pf-meiosis. Mean values for each distribution are indicated by same-color dashed lines. Each distribution is created from 1000 simulated mixed infections. (B) Validation of theoretical result given in text (S1.8). Line plot compares trend in expected IBD fraction with the number of outbred oocysts, nij, for infections simulated in panel A, and analytical expression S1.8.

Appendix 4—figure 2
Exploring expected IBD allowing for outbred (nij) and inbred (nii) oocysts.

(A) Validation of expression for expected IBD fraction conditional on outbred nij and inbred nii oocysts (S1.9). Line plot compares trend in expected IBD fraction with varying number of outbred (x-axis, nij) and inbred (line color, nii) oocysts and the analytical expression S1.9 (grey dashed lines). (B) Using pf-meiosis to simulate K=2 mixed infections generated from (1) two strains from the same outbred oocyst from (nijo=1, ’Within oocyst’); (2) two strains different outbred oocysts(nijo=2, ’Standard Siblings’); (3) one strain from an outbred and one strain from an inbred oocyst (nij,iio=2, ’Mother-daughter’).



Table 1
Summary of Pf3k samples in data release 5.1, where D¯ denotes mean read depth and ss is sample size.

Genotyping, including both indel and SNP variants, was performed using a pipeline based on GATK best practices, see Materials and methods. Data available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/5.1. PfPR is the inferred parasite prevalence rate in a 5 × 5 km resolution grid from the MAP project, centred at the Pf3k sample collection sites; Relatedness ρ and effective number of strains Ke are summary metrics from DEploidIBD output.

CountryYearLocationPfPRssD¯ (s.e.)ρ¯Ke¯Reference
Gambia2008Brikam0.0665129 ( 9.4 )0.51.3(Amambua-Ngwa et al., 2012)
Ghana2009Navrongo0.7912186 ( 5.7 )0.211.6(Duffy et al., 2015; Kamau et al., 2015; MalariaGEN Plasmodium falciparum Community Project, 2016)
2010Navrongo0.79171127 ( 10.3 )0.231.5
2011Navrongo0.729776 ( 5.3 )0.211.5
Kintampo0.58689 ( 13.5 )0.111.5
2012Navrongo0.5247111 ( 3.8 )0.291.6
Kintampo0.4140157 ( 8.1 )0.221.6
2013Navrongo0.3188119 ( 4 )0.261.6
Kintampo0.294172 ( 38.4 )0.441.1
Malawi2011Chikwawa0.19230101 ( 3 )0.261.7(Ocholla et al., 2014)
Zomba0.343589 ( 9.1 )0.241.6
Mali2007Bandiagara0.43995 ( 25.2 )0.391.8(Mobegi et al., 2014; MalariaGEN Plasmodium falciparum Community Project, 2016)
Faladje0.373675 ( 10.1 )0.271.3
Kolle0.215182 ( 10.5 )0.31.6
Guinea2011Nzerekore0.499777 ( 4.6 )0.171.4
Congo DR2013Kinshasa0.2411349 ( 3.2 )0.311.5
Senegal2004Thies0.092130 ( 68.2 )0.011.4(Wong et al., 2017)
2009Thies0.0443175 ( 14.9 )0.431.1
2010Thies0.0424159 ( 9.7 )0.31.3
2011Thies0.033297 ( 6 )0.331.1
West2009Pursat0.00711975 ( 8.8 )0.391.3(Amato et al., 2017; MalariaGEN Plasmodium falciparum Community Project, 2016)
Cambodia2010Pursat0.007110595 ( 6.8 )0.651.2
2011Pailin0.00254954 ( 4.1 )0.431.1
Pursat0.009610349 ( 3.1 )0.631.2
2012Pailin0.000963146 ( 5.6 )0.431.0
Pursat0.0079737 ( 19.1 )0.581.4
North2010Ratanakiri0.00395071 ( 6.1 )0.431.3
Cambodia2011Preah Vihear0.027351 ( 5.3 )0.361.2
Ratanakiri0.00328145 ( 4.3 )0.471.4
2012Preah Vihear0.00753043 ( 6.7 )0.371.0
Ratanakiri0.00161544 ( 8.9 )0.31.3
Thailand2011Mae Sot0.000113566 ( 7.5 )0.351.2(Miotto et al., 2013; MalariaGEN Plasmodium falciparum Community Project, 2016)
Sisakhet1e-045112 ( 25.4 )0.171.3
2012Mae Sot5.7e-056983 ( 4.9 )0.581.3
Ranong0.000181182 ( 12.4 )0.381.2
Sisakhet01389 ( 13 )0.371.1
2013Sisakhet0362 ( 8.8 )0.091.2
Bangladesh2012Ramu0.00215053 ( 4.2 )0.451.5
Viet Nam2011Bu Gia Map0.00734367 ( 5 )0.431.3
Phuoc Long0.00532768 ( 7.2 )0.371.2
2012Bu Gia Map0.007219115 ( 8 )0.671.1
Phuoc Long0.00485107 ( 6.3 )0.811.2
Myanmar2011Bago Division0.00761259 ( 7.1 )0.241.2
2012Bago Division0.00844762 ( 5.2 )0.451.2
Laos2011Attapeu0.00945971 ( 4.2 )0.361.4
2012Attapeu0.022577 ( 7.2 )0.681.3
Appendix 1—table 1
Notation used in this article.
iMarker index
jSample index
rRead count for reference allele
aRead count for alternative allele
fPopulation level allele frequency (PLAF)
KNumber of distinct strains within sample
lNumber of sites
𝐰Proportions of strains
𝐱Log titre of strains
𝐡iAllelic states of K parasite strains at site i
hk,iAllelic state of parasite strain k at site i
pObserved within sample allele frequency (WSAF)
qUnadjusted expected WSAF
πAdjusted expected WSAF
eProbability of read error
𝒮iIBD configuration at site i
θProbability of non-IBD in a mixture of two strains
Appendix 1—table 2
IBD configurations for two, three and four strains, ordered top to bottom by the number of IBD pairs.

The (zero-indexed) notation indicates the type assigned to each haplotype, thus 0–1 indicates non-IBD for two strains, while 0-1-2-2 indicates four strains in which the third and fourth are IBD.

IndexIBD state
K = 2K = 3K = 4
Appendix 3—table 1
Number of haplotypes discarded and retained for each population in the Pf3k dataset.
CountryDiscardedRetainedFraction discarded
DR. of Congo621550.29
The Gambia22730.23
Appendix 3—table 2
Number of haplotypes retained and discarded stratified by COI level.
COIRetainedDiscardedFraction discarded

Data availability

Metadata on samples is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/pf3k_release_5_metadata_20170804.txt.gz. Sequence data (aligned to Plasmodium falciparum strain 3D7 v3.1 reference genome sequences, for details see ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/2015-08/Pfalciparum.genome.fasta.gz) is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/5.1/. Diagnostic plots for the deconvolution of all samples can be found at https://github.com/mcveanlab/mixedIBD-Supplement (copy archived at https://github.com/elifesciences-publications/mixedIBD-Supplement) and deconvolved haplotypes can be accessed at ftp://ngs.sanger.ac.uk/production/pf3k/technical_working/release_5/mixedIBD_paper_haplotypes/. Code implementing the algorithms described in this paper, DEploidIBD, is available at https://github.com/mcveanlab/DEploid (copy archived at https://github.com/elifesciences-publications/DEploid).

The following previously published data sets were used
  1. 1
    Wellcome Trust Sanger public ftp site
    1. The Pf3k Project Consortium
    ID 5.1 Data. The Pf3k Project (2016): pilot data release 5.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)