Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum Genomes

Bing Guo
Shannon Takala-Harrison author has email address
Timothy D O’Connor author has email address

Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, United States
Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, United States

https://doi.org/10.7554/eLife.101924.2

Open access
Copyright information

Figures and data

Overview of methods used in benchmarking IBD detection methods.
Benchmarking and optimization of IBD callers for Pf include simulation analyses (top, green shading) and empirical data-based validation analyses (bottom, gray shading). (1) For the simulation study, the genealogical trees and phased genotype data are generated via the combination of forward (SLiM [41, 42]) and coalescent (msprime [43]) simulations (indicated by the superscript a in the diagram). True IBD is obtained from a simulated genealogy tree via tskibd [10] (b) and inferred IBD from the phased genotype using different IBD callers, including hap-IBD [20], hmmIBD [8], isoRelate [6], Refined IBD [44], and phased IBD [21]. IBD benchmarking is performed at two levels. The first is at the IBD segment level. The metrics include false positive and false negative rates (c), population-level total IBD per length bin [22] (d), and total IBD per isolate pair. The second is at the level of IBD-based downstream analyses, including the effective population size (N_e) by IBDNe [22] (e), community membership through the InfoMap algorithm [45, 46] (f), and selection signals by statistics X_iHS [10] (g). As the default parameters of different IBD callers, particularly those developed for human data, may not be ideal for Pf genomes, we performed grid searches for key parameters for each IBD caller so that the comparison was based on the best performance of each caller (see S1 Data for detailed information on simulation and IBD calling parameters and used values). (2) For validation in empirical data where true IBD is not available, we obtained IBD-based estimates using IBD from different callers and assessed which version of IBD can generate the expected patterns. The empirical data sets are subsampled from the MalariaGEN Pf 7 database [37] (h).

High recombination rates reduce genetic marker density and affect the quality of detected IBD segments.

a, The number of common single nucleotide polymorphisms (SNPs) (minor allele frequency ≥ 0.01) per genetic unit (centimorgan, cM) in simulated genomes with different recombination rates. In these simulations (blue line), the mutation rates are fixed; the recombination rates vary widely to include the rate for both humans (red diamond) and Pf (red star). For each recombination rate, n = 4 independent simulations of chromosomes were performed. Data were plotted in the form of mean (marker) ± standard deviation (vertical lines, which are difficult to visualize given low variation among chromosomes). b, Accuracy of IBD segments detected from genomes simulated with different recombination rates. The accuracy of IBD segments is measured by the false negative rates (top panel) and false positive rates (bottom panel). The plotted error rates represent the genome-wide rates (as defined in Methods) of IBD segments identified with default IBD caller parameters unless stated otherwise in S1 Table. These rates are based on one representative set from n = 3 simulation sets. The plotted vertical lines indicate standard deviations of error rates across all genome pairs in the representative simulated set. Both the vertical lines and markers are horizontally staggered for clarity. Only error rates for two IBD detection methods, hmmIBD and hap-IBD, are included in (b) for simplicity. The error rates for all 5 IBD callers are provided in S1 Fig. For both (a) and (b), the genomes were simulated under the single-population model (see Methods). Note that log scales are used for the y axis in (a) and the x axis in (b).

Accuracy of IBD segments detected from Pf genomes varies across IBD callers.
IBD segments were inferred from genomes simulated under the single-population model with a shrinking population size and a recombination rate compatible with Pf. The accuracy of IBD was evaluated using the calculated false positive rate (y axis) and false negative rate (x axis). The rates were calculated for different length bins in centimorgans, including [3-4), [4-6), [6-10), [10-18), [18, inf) centimorgans and at the genome-wide level (defined by overlapping analysis between true IBD segments and inferred IBD segments from each genome pair). These rates are based on one representative set from n = 3 simulation sets. The plotted vertical and horizontal lines represent the standard deviations of error rates (horizontal for false negatives and vertical for false positives) calculated across all relevant segments for length-bin specific rates or across all genome pairs for genome-wide rates in the representative simulated set. The titles of the subplots indicate the IBD callers analyzed. The results of the simulations under the multiple-population model and the UK human demographic model are provided as S2 Data.

IBD caller-specific parameter optimization can improve the quality of IBD segments inferred from simulated Pf genomes (using hap-IBD as an example).

a, Quality of detected IBD measured by false positive and false negative rates before (left column) and after (right column) hap-IBD-specific parameter optimization. As indicated in the axis legend, the error rates were calculated for different length ranges (in centimorgans), including [3-4), [4-6), [6-10), [10-18), [18, inf) and at the genome-wide level. These rates are based on one representative set from n = 3 simulation sets. The plotted vertical and horizontal lines represent the standard deviations of error rates (horizontal for false negatives and vertical for false positives) calculated across all relevant segments for length-bin specific rates or across all genome pairs for genome-wide rates in the representative simulated set. b, Quality of detected IBD measured by total genome pairwise IBD, an estimate of genetic relatedness, before (left column) and after (right column) hap-IBD parameter optimization. Data of n = 1,000 haploid genomes from a representative simulation set was plotted with each subplot for hap-IBD before and after parameter optimization as indicated in the titles. Each dot represents a pair of genomes with the coordinates x and y being true and inferred total IBD. In (b), the blue dots are the pairs with nonzero true and inferred total IBD while red dots are pairs with either true total IBD or inferred total IBD being 0; zero-valued total IBD was replaced with 1.0 cM for visualization purposes. The red dotted line of y = x indicates the expected pattern, that is, true total IBD equal to inferred total IBD if the inferred IBD was 100% accurate. Note that log scales are used in both the x and y axes in (b).

Post-optimization benchmarking of different IBD callers by comparing downstream estimates
N_e. With parameters optimized for each IBD caller, the performance of IBD callers was evaluated by comparing the N_etrajectory for the recent 100 generations estimated via IBDNe based on true (black dashed line) IBD versus inferred IBD (red solid line). True IBD was calculated from simulated genealogical trees via tskibd; inferred IBD includes those inferred from hap-IBD, hmmIBD, isoRelate, Refined IBD, and phased IBD, with their N_e estimates shown from top left to bottom right. The shaded areas surrounding the red lines indicate 95% confidence intervals as determined by IBDNe. The plots show results from one representative set of n = 3 replicated simulation sets. See S9 Fig for pre-optimization results. Note that log scales are used on the y axes.

Validation of the performance of IBD callers in empirical data sets by comparing IBD-based downstream analyses.

a, IBD coverage and detected selection signals in the SEA data set using different IBD callers (rows 1 to 5). Annotations and corresponding vertical dotted lines at the top indicate the center of known and putative drug resistance genes and genes related to sexual commitment; red shading indicates regions that are inferred to be under positive selection (see Methods for definitions). b, N_e estimates of the SEA data set based on IBD inferred from different callers. Line plots are point estimates; the shaded areas around the line plots indicate confidence intervals based on bootstrapping (generated by IBDNe). c, Inference of the population structure of the structured data set by the InfoMap community detection algorithm using the IBD inferred from different IBD callers. The rows of the heatmap are geographic regions of isolates, and the columns are the largest, inferred communities, labeled as c1 to c6. The heatmap color represents the number of isolates in each block with the given row and column labels. The columns are rearranged so that the diagonal blocks tend to have the largest values per row for better visualization. Note that log scales are used in y axes in (b).

Comparison of computational runtime for IBD calling process for different callers.
a, Runtime for different IBD callers to detect IBD from genomes of different sample sizes in single-thread mode. The comparison is based on Pf genomes of size of 100 cM simulated under the single-population model. The x-axis tick labels include the number of pairs of genomes simulated and analyzed (below the plot, reflecting the number of computation units) and the number of haploid genomes (above the plot, representing sample size) analyzed. The line styles and markers for different callers/tools are provided in the legend box on the far right of the figure, which is shared across the two subplots. Values on the y axis represent means (markers) and standard deviations (vertical error lines) from n = 3 sets of independent simulations. Note that at each sample size, the x values for different IBD callers are slightly staggered to prevent the error lines from overlapping; some of the error lines are hard to visualize as they are relatively small. b, Runtime in multithreading mode. (b) is organized similarly to (a), except that the IBD calling processes were run in multithreading mode with 10 CPU threads. Note that log scales are used in y axes and bottom x axes. Also, see S11 Fig for the maximum memory usage for different callers.

Sign up for email alerts