High recombination rates reduce genetic marker density and affect the quality of detected IBD segments. a, The number of common single nucleotide polymorphisms (SNPs) (minor allele frequency 0.01) per genetic unit (centimorgan, cM) in simulated genomes with different recombination rates. In these simulations (blue line), the mutation rates are fixed; the recombination rates vary widely to include the rate for both humans (red diamond) and Pf (red star). b, Accuracy of IBD segments detected from genomes simulated with different recombination rates. The accuracy of IBD segments is measured by the false negative rates (top panel) and false positive rates (bottom panel). The plotted error rates reflect the genome-wide rates (defined in Methods) of IBD segments called with default IBD caller parameters unless otherwise specified (see S1) Table. Only error rates for two IBD detection methods, hmmIBD, and hap-IBD, are included in (b) for simplicity. The error rates for other IBD callers are provided in S2 Fig. For both (a) and (b), the genomes were simulated under the single-population model (see Methods).

Accuracy of IBD segments detected from Pf genomes varies across IBD callers. IBD segments were inferred from genomes simulated under the single-population model with a shrinking population size and a recombination rate compatible with Pf. The accuracy of IBD was evaluated using the calculated false positive rate (y axis) and false negative rate (x axis). The rates were calculated for different length bins in centimorgans, including [3-4), [4-6), [6-10), [10-18), [18, inf) centimorgan and at the genome-wide level (defined by overlapping analysis between true IBD segments and inferred IBD segments from each genome pair). The IBD callers analyzed here, from left to right, include hmmIBD, hap-IBD, isoRelate, Refined IBD, and phased IBD. The results of the simulations under the multiple population model are provided as S1 Data.

IBD caller-specific parameter optimization can improve the quality of IBD segments inferred from simulated Pf genomes (using hap-IBD as an example). a, Quality of detected IBD measured by false positive and false negative rates before (left column) and after (right column) hap-IBD-specific parameter optimization. As indicated in the axis legend, the error rates were calculated for different length ranges (in centimorgans), including [3-4), [4-6), [6-10), [10-18), [18, inf) and at the genome-wide level. b, Quality of detected IBD measured by total genome pairwise IBD, an estimate of genetic relatedness, before (left column) and after (right column) hap-IBD parameter optimization. Each dot represents a pair of genomes with the coordinates x and y being true and inferred total IBD. Note: both the x and y axes in (b) use log scales. In (b), the blue dots are the pairs with nonzero true and inferred total IBD while red dots are pairs with either true total IBD or inferred total IBD being 0; zero-valued total IBD was replaced with 1.0 cM for visualization purposes. The red dotted line of y = x indicates the expected pattern, that is, true total IBD equal to inferred total IBD if the inferred IBD was 100% accurate.

Post-optimization benchmarking of different IBD callers by comparing downstream estimates Ne. With parameters optimized for each IBD caller, the performance of IBD callers was evaluated by comparing the Ne trajectory for the recent 100 generations estimated via IBDNe based on true (black dashed line) IBD versus inferred IBD (red solid line). True IBD was calculated from simulated genealogical trees via tskibd inferred IBD includes those inferred from hap-IBD, hmmIBD, isoRelate, Refined IBD, and, phased IBD, with their Ne estimates shown from left to right. The shading areas surrounding the red lines indicate 95% confidence intervals as determined by IBDNe. See S10 Fig for pre-optimization results.

Validation of the performance of IBD callers in empirical data sets by comparing IBD-based downstream analyses. a, IBD coverage and detected selection signals in the SEA data set using different IBD callers (rows 1 to 5). Annotations and corresponding vertical dotted lines at the top indicate the center of known and putative drug resistance genes and genes related to sexual commitment; red shading indicates regions that are inferred to be under positive selection (see Methods for definitions). b, Ne estimates of the SEA data set based on IBD inferred from different callers. Line plots are point estimates; the shading areas around the line plots indicate confidence intervals based on bootstrapping (generated by IBDNe). c, Inference of the population structure of the structured data set by the InfoMap community detection algorithm using the IBD inferred from different IBD callers. The rows of the heatmap are geographic regions of isolates, and the columns are the largest, inferred communities, labeled as c1 to c6. The heat map color represents the number of isolates in each block with the given row and column labels. The columns are rearranged so that the diagonal blocks tend to have the largest values per row for better visualization.

Comparison of computational runtime for IBD calling process for different callers. a, Runtime for different IBD callers to detect IBD from genomes of different sample sizes in single-thread mode. The comparison is based on Pf genomes of size of 100 cM simulated under the single population model. The x-axis tick labels include the number of pairs of genomes analyzed (below the plot, on a linear scale) and the sample size (number of haploid genomes, above the plot, on a nonlinear scale) analyzed. The line styles and markers for different callers/tools are provided in the legend box on the far right of the figure, which is shared across the three subplots. b, Runtime in multithreading mode. (b) is organized similarly to (a), except that the IBD calling processes were run in multithreading mode with 10 threads. Also, see S12 Fig for the maximum memory usage for different callers.