Introduction

Malaria is a mosquito-borne disease that poses a significant public health challenge globally, with an estimated 249 million clinical cases and 608,000 deaths occurring in 2022 [1]. Surveillance remains an important component of malaria control and elimination efforts. Advances in sequencing technologies and the scale of resequencing studies now allow for parasite genomic surveillance, which can provide insights into the efficacy of malaria interventions and guide the design of targeted elimination strategies in different transmission settings [24].

Identity-by-descent (IBD) is an essential tool in population genomics that has been used to estimate genetic relatedness [58], positive selection [6, 911], effective population size (Ne) [10, 12], fine-scale population structure [4, 10, 13] and migration patterns [4, 14]. However, the reliability of the IBD-based analysis is highly dependent on the accuracy of the detected IBD segments. Insufficient density of genetic markers, on a local or genome-wide scale, probably contributes to high error rates in the identification of IBD segments [11, 15, 16] and reduced accuracy of IBD-based estimates of population demography [5, 17]. Many IBD detection methods have been designed for human genomes, where the demographic history and evolutionary parameters, including the recombination rate, differ considerably from Plasmodium falciparum (Pf). The effective population size of humans has increased rapidly in recent history [18], while that of Pf is decreasing, particularly in regions such as Southeast Asia and South America [1, 19], due to enhanced malaria elimination efforts. More importantly, Pf genomes recombine about 70 times more frequently per unit of physical distance [2024] than the human genome [25], while sharing a similar mutation rate [2, 2631] as human genomes [32] on the order of 108 per base pair per generation. The decreasing population size [10] and the high recombination rate in Pf result in a reduced number of common variants, such as single nucleotide polymorphisms, per unit of genetic distance. Large human whole genome sequencing data sets typically provide millions of common biallelic SNP variants [33], while Pf data sets only have tens of thousands [34]. Given that the human genome is about twice as large as Pf in genetic units, the per-centimorgan (cM) SNP density in Pf can be two orders of magnitude lower than in humans, which may not provide sufficient information for IBD detection. Thus, it is critical to understand whether IBD detection methods can still generate accurate IBD segments under low SNP density conditions, considering the specific evolutionary parameters of the Pf genome.

Evaluating the quality of the detected IBD segments requires benchmarking with the known ground truth through simulation studies [15, 16, 35, 36]. The performance of IBD detection tools developed for use in the human context is typically measured using simulated genomes reflecting demographic and evolutionary parameters of human genomes [15, 16, 3537]. Given the differences in SNP density between human and Pf genomes, the quality assessment and parameter optimization derived from human studies likely do not apply directly to Pf. For tools explicitly designed for malaria parasites, such as isoRelate and hmmIBD, the evaluation of the quality of IBD was based on parent-offspring [8] or pedigree-based simulations (up to 25 generations) [6] that focused primarily on close relatives, which more likely mirrors low malaria transmission settings than high transmission settings. Furthermore, benchmarking methods and definitions of IBD accuracy are inconsistent across studies [6, 8, 15, 16], making the results of the quality evaluation of IBD difficult to compare. Considering the limitations of existing evaluations of IBD detection methods for Pf genomes, a unified benchmarking framework specifically designed for high recombining Pf genomes from low- and high-transmission settings is needed. Such a framework will assist researchers in comparing and prioritizing different IBD detection methods for intended downstream analysis.

In the present study, we developed a unified IBD benchmarking framework that reflects the demographic and evolutionary parameters of Pf. We evaluated how different recombination rates and marker densities affect the quality of detected IBD segments, performed IBD caller-specific parameter optimization, and benchmarked different IBD detection methods with their optimized parameters at both the IBD segment and downstream inference levels. Furthermore, we validated our findings from simulation analysis with empirical data sets constructed from subsets of samples from the publicly available whole genome sequencing database MalaraiGEN Pf 7. Our findings indicate that a high recombination rate (given the same mutation rate) is associated with a low SNP density (per genetic unit), which substantially affects the accuracy of the detected IBD segments. IBD segments called by many of the benchmarked methods generally suffer from a high false negative rate (fraction of true IBD not detected) with the default parameters. IBD caller-specific parameter optimization can reduce errors in detected IBD segments and the resulting bias in IBD-based relatedness estimates. With optimized parameters, IBD inferred by most evaluated IBD callers can capture the main patterns of positive selection and population structure; however, only hmmIBD provides an accurate estimate of the trajectory of effective population size. To obtain optimal results, we recommend optimizing the parameters of human-oriented IBD callers before applying to Pf genomes, and using hmmIBD for IBD quality-sensitive analysis when genetic data from haploid genomes are available.

Results

Low SNP density due to high recombination rate affects the accuracy of IBD calls

Ideal IBD segments are typically defined as shared genomic regions descending unbroken by recombination from a common ancestor [38, 39], which can be derived from the ancestral recombination graph (ARG) [10, 17, 35, 40]. When ARG is unknown, IBD segments are generally inferred from mutational information, such as genotype data, as a profile of the underlying ARG [41, 42]. However, high-recombining species like Pf have very low marker densities per genetic unit, and may largely affect the accuracy of inferred IBD segments.

To assess how recombination rates affect marker density per genetic unit and the detection of IBD segments, we simulated genomes with varying recombination rates but a fixed mutation rate. Under a single-population demographic model (see Methods), we found that the density of common biallelic SNPs (minor allele frequency 0.01) per cM, in selectively neutral scenarios, is inversely correlated with recombination rates ranging from 3 × 109 to 106 per base pair per generation (Fig 1a). For instance, the SNP density of Pf -like genomes is 25 SNPs per cM, which is approximately 1/67 of that of the human-like genomes (1,660 SNPs per cM). We further assessed how low marker densities associated with high recombination rates affect the accuracy of detected IBD segments calculating two metrics (via ishare/ibdutils, see Code availability), including the false negative rate (FN), which represents the proportion of a true IBD segment (obtained via tskibd [10]) not covered by inferred IBD segments of the same genome pairs, and the false positive rate (FP), which indicates the faction of a true segment not covered by inferred segments (see Methods for detailed definitions, and S1 Fig for method overview). Our analysis showed that as the recombination rate increases, both the genome-wide FN and FP (Fig 1b) increase for IBD inferred from hmmIBD. The patterns vary in the other four IBD detection methods, and all suffer elevated FNs and/or FPs as the recombination rate increases (Fig 1b and S2 Fig), except for isoRelate, of which the IBD quality tends to be better with a lower marker density. The results suggest that a low SNP density per genetic unit can dramatically affect the reliability of detected IBD segments.

High recombination rates reduce genetic marker density and affect the quality of detected IBD segments. a, The number of common single nucleotide polymorphisms (SNPs) (minor allele frequency 0.01) per genetic unit (centimorgan, cM) in simulated genomes with different recombination rates. In these simulations (blue line), the mutation rates are fixed; the recombination rates vary widely to include the rate for both humans (red diamond) and Pf (red star). b, Accuracy of IBD segments detected from genomes simulated with different recombination rates. The accuracy of IBD segments is measured by the false negative rates (top panel) and false positive rates (bottom panel). The plotted error rates reflect the genome-wide rates (defined in Methods) of IBD segments called with default IBD caller parameters unless otherwise specified (see S1) Table. Only error rates for two IBD detection methods, hmmIBD, and hap-IBD, are included in (b) for simplicity. The error rates for other IBD callers are provided in S2 Fig. For both (a) and (b), the genomes were simulated under the single-population model (see Methods).

Varying quality of IBD inferred from simulated Pf genomes via different IBD callers

Multiple IBD callers have been used for Pf, including Pf -oriented, Hidden Markov Model-based methods, such as hmmIBD [8] and isoRelate [6], and those originally designed for human genomes, such as Refined IBD [12] and Beagle (version 4.1) [4]. We analyzed hap-IBD [15] and phased IBD [16] IBD callers in addition to hmmIBD, isoRelate and Refined IBD, since the former represents two recent key advancements in the development of IBD detection methods that scale well to large sample sizes and genome sizes.

To evaluate the applicability and accuracy of these IBD detection methods in analyzing Pf genomes, we performed benchmarking analyses in simulated genomes (S1 Fig, top panel), mimicking the high recombination rate and the decreasing population size of Pf populations. Our analyses include three sets of comparisons: (1) baseline benchmarking, where we mainly used the default parameter values for each IBD caller and compared the performance in Pf genomes at the level of IBD segment and their simple statistics; (2) post-optimization benchmarking, where we used parameter values optimized specifically for each IBD caller so that the comparisons are based on the optimal performance of each method; (3) human-like genome benchmarking, where detected IBD segments are expected to have low error rates for human-oriented IBD callers and thus were used as an internal control to validate our benchmarking pipeline (see S1 Table for the IBD caller parameters and Methods for demographic models).

Baseline benchmarking analysis shows that all callers except hmmIBD suffer from high FN rates especially for short IBD segments (genome-wide FN/FP rates reflect shorter IBD segments as most segments are short) (Fig 2). Similarly, genetic relatedness metrics based on pairwise total IBD are largely underestimated for most callers (S3 Fig). We found that by default hmmIBD has relatively low FN/FP error rates (Fig 2) and is less biased for relatedness estimates (S3 Fig). In contrast, isoRelate and human-oriented callers have high false negative rates and varying FP rates (Fig 2). Thus, both Pf - and human-oriented IBD callers can suffer high error rates when detecting IBD from genomes with a high recombination rate and a low marker density, with the exent depending on underlying assumptions and methodologies.

Accuracy of IBD segments detected from Pf genomes varies across IBD callers. IBD segments were inferred from genomes simulated under the single-population model with a shrinking population size and a recombination rate compatible with Pf. The accuracy of IBD was evaluated using the calculated false positive rate (y axis) and false negative rate (x axis). The rates were calculated for different length bins in centimorgans, including [3-4), [4-6), [6-10), [10-18), [18, inf) centimorgan and at the genome-wide level (defined by overlapping analysis between true IBD segments and inferred IBD segments from each genome pair). The IBD callers analyzed here, from left to right, include hmmIBD, hap-IBD, isoRelate, Refined IBD, and phased IBD. The results of the simulations under the multiple population model are provided as S1 Data.

IBD-caller-specific parameter optimization for Pf improves IBD accuracy

Since these IBD callers are optimized for different species or genotype datasets, we hypothesized that optimization of IBD caller parameters under a unified framework designed for Pf genomes can improve the performance of these callers in analyzing malaria parasite data. As searching the entire IBD caller parameter space is inefficient, our optimization focused mainly on parameters potentially affected by or needing adjustment due to differences in marker density between the new target (Pf) and previously tested populations (humans). For IBD callers that do not explicitly have marker density-related parameters, such as hmmIBD, we explored other parameters that likely affect IBD quality. We performed grid searches to find parameter value(s) that generate inferred IBD with low and balanced error rates (see S2 Table for parameters explored and their corresponding values, and S1 Data for detailed results).

We found that most IBD callers have a key parameter that can dramatically affect their FN/FP rates (S2 Table). For example, the FN rates of IBD called from hap-IBD change substantially when the min-marker parameter varies (see S1 Data). With a value of 70, the FN rate for short IBD segments dramatically decreases such that the FN and FP error rates become more balanced (Fig 3a). Consistently, genetic relatedness estimates change from being highly underestimated before parameter optimization (Fig 3b, left column) to being more balanced after optimization (Fig 3b, right column). Similar improvements were observed for Refined IBD and phased IBD (S4 Fig). In contrast, the quality metrics remained largely unchanged during parameter optimization attempts for hmmIBD and isoRelate, with hmmIBD being more accurate and unbiased and isoRelate suffering from high false negative rates and an underestimated relatedness (S4 Fig). The observation that human-oriented IBD callers are more optimizable for the simulated Pf genome compared to Pf -oriented IBD callers confirms that IBD callers originally targeting a non-Pf population need adjustment before they can be used for Pf.

IBD caller-specific parameter optimization can improve the quality of IBD segments inferred from simulated Pf genomes (using hap-IBD as an example). a, Quality of detected IBD measured by false positive and false negative rates before (left column) and after (right column) hap-IBD-specific parameter optimization. As indicated in the axis legend, the error rates were calculated for different length ranges (in centimorgans), including [3-4), [4-6), [6-10), [10-18), [18, inf) and at the genome-wide level. b, Quality of detected IBD measured by total genome pairwise IBD, an estimate of genetic relatedness, before (left column) and after (right column) hap-IBD parameter optimization. Each dot represents a pair of genomes with the coordinates x and y being true and inferred total IBD. Note: both the x and y axes in (b) use log scales. In (b), the blue dots are the pairs with nonzero true and inferred total IBD while red dots are pairs with either true total IBD or inferred total IBD being 0; zero-valued total IBD was replaced with 1.0 cM for visualization purposes. The red dotted line of y = x indicates the expected pattern, that is, true total IBD equal to inferred total IBD if the inferred IBD was 100% accurate.

The human-oriented IBD callers, when not optimized for Pf, underperformed hmmIBD, especially for Refined IBD and phased IBD (S4a Fig). To exclude potential problems in our benchmarking pipeline, we simulated genomes with recombination rates and the demographic history consistent with human population in the UK (S5a Fig, left column, and S5b Fig, left column; also see Methods). These callers, evaluated without Pf -optimized parameters indeed perform much better on human genomes, showing consistently lower FN/FP error rates (S5a Fig, left column) and less biased total IBD-based relatedness estimates (S5b Fig, left column), compared to Pf -like genomes (S4a Fig and S4b Fig, left columns). The results support the robustness of our benchmarking pipeline (S5a Fig and S5b Fig, left columns) and demonstrate challenges in applying human-oriented IBD callers to Pf. Furthermore, we found that IBD caller performance varies with demographic configurations (single-population model in S4a Fig and S4b Fig, left columns, versus UK human model in S5a Fig and S5b Fig, right columns) even with the same (Pf) recombination/mutation rates and default IBD caller parameters, suggesting that the optimization is demography dependent (see details in S1 Data).

Post-optimization benchmarking via downstream inferences

IBD-based downstream analyses, such as estimation of Ne, selection signals, and population structure, are key applications of IBD segments in population genetics, which often rely on the high quality of input IBD segments. With optimized IBD caller-specific parameters tailored for Pf -like genomes, we can expect IBD callers to perform at their best, which allows high-level benchmarking by comparing IBD-based downstream estimates.

For IBD-based selection detection, we simulated positive selection on each of the 14 chromosomes via the single-population model, and identified IBD peaks with different callers (see Methods). These peaks, inferred as regions under selection, were true signals if they contain the selected site from simulations. We found that most callers can capture the majority of the simulated signals except Refined IBD, which is less sensitive and only detects 3 out of 14 (S6a Fig); isoRelate shows an increased level of false positives or low signal-to-noise ratios, evident in IBD coverage curves (S6a Fig).

For IBD-based population structure inference, we simulated Pf -like genomes under a selectively neutral condition using the multiple-population demographic model, and performed IBD network community detection via InfoMap [43, 44] to define subpopulations (see Methods). Similarly to IBD-based selection signal detection, we found that IBD inferred from most callers can accurately recapitulate the simulated population structure, comparable to true IBD (S6b Fig). The exception is that isoRelate tends to generate many smaller groups, which is likely due to high FN rates for short IBD segments, missing connectivity among distantly related genomes and capturing closely related subgroups of small sizes.

For IBD-based Ne inference, we simulated neutral Pf genomes using the single-population model and estiamted Ne from detected IBD via IBDNe [17]. We found most of the compared IBD callers suffer from wild oscillation, which has previously been observed [17], and deviate significantly from the truth for older generations (Fig 4), consistent with the general pattern of high error rates in shorter IBD length bins for these IBD callers (S4a Fig, right column). Meanwhile, the IBD inferred by hmmIBD generated highly accurate estimates comparable to true IBD (Fig 4). We explored the mechanisms underlying bias in Ne estimates and found that the strong bias is likely due to underestimation of population-level total IBD for short IBD segments, which is mostly obvious for hap-IBD, isoRelate, and phased IBD, followed by Refined IBD (S7 Fig). For hmmIBD, both the estimates of Ne (Fig 4) and population total IBD are relatively unbiased (S7 Fig, column 2), consistent with its relatively low and balanced FP/FN rates (Fig 2, leftmost panel). These results suggest that IBD-based Ne estimates are highly sensitive to the quality of input IBD segments, and that hmmIBD is more accurate for this analysis.

Post-optimization benchmarking of different IBD callers by comparing downstream estimates Ne. With parameters optimized for each IBD caller, the performance of IBD callers was evaluated by comparing the Ne trajectory for the recent 100 generations estimated via IBDNe based on true (black dashed line) IBD versus inferred IBD (red solid line). True IBD was calculated from simulated genealogical trees via tskibd inferred IBD includes those inferred from hap-IBD, hmmIBD, isoRelate, Refined IBD, and, phased IBD, with their Ne estimates shown from left to right. The shading areas surrounding the red lines indicate 95% confidence intervals as determined by IBDNe. See S10 Fig for pre-optimization results.

Given that Ne estimation in Pf is very sensitive to the quality of detected IBD, we explored whether excluding short, error-prone IBD segments (< 4cM) could improve Ne estimates for non-hmmIBD callers. The exclusion results in reduced oscillation of the trajectory for some callers, like hap-IBD, but wide confidence intervals or underestimation in older generations, in both simulated and emprical data (S8 Fig). We then explored reasons underlying the recent oscillation or drop (around 20 generations ago) commonly observed in the estimated Ne trajectories (S8a Fig, second row and S8b Fig, both rows) [12, 37, 45]. We hypothesized that this oscillation is partially due to IBD segments with TMRCA < 1.5 generations ago being included in IBDNe input [17]. We found that removing these segments can greatly mitigate this problem (S8a Fig), especially for hmmIBD and Refined IBD. The findings suggest caution when interpreting (1) a recent drop in an estimated Ne trajectory in empirical data sets where TMRCA-based filtering is less practical and (2) extremely large estimates in older generations stemming from high error rates for short IBD segments.

To confirm that parameter optimization improves downstream inferences, we compared post-optimization results (S6 Fig and Fig 4) with pre-optimization estimates (S9 Fig and S10 Fig). We found that parameter optimization indeed increased the power in selection detection (in isoRelate, Refined IBD and phased IBD), improved population structure inference (phased IBD and hap-IBD), and reduced oscillation on the Ne trajectory (Refined IBD).

Validation in empirical data set

We further validated the findings from simulation analysis in emprical datasets, by constructing “single” or “multiple” population data sets based on the MalariaGEN Pf 7 data [34] (see Methods for details). As true IBD segments are not available here, we focused on high-level benchmarking by evaluating whether IBD-based downstream estimates are consistent with expected patterns, including Ne estimation and selection signal detection with “single” population data sets and InfoMap population structure inference with the “multiple” population data set. (S3 Table, S4 Table, and S5 Table).

With optimized parameters, all callers, except Refined IBD, captures most known selection signals from the Southeast Asia data set. These signals include selective sweeps associated with antimalarial drug resistance and sexual commitment, such as dihydrofolate reductase (dhfr) [46], multidrug resistance protein 1 (pfmdr1) [47], amino acid transporter 1 (pfaat1) [23], chloroquine resistance transporter (pfcrt) [48], dihydropteroate synthase (dhps) [49], Apicomplexan-specific ApiAP2-g(ap2-g) [50] and apicoplast ribosomal protein S10 (arps10) [51] (Fig 5a). hmmIBD detects more peaks but suffer from noise, likely due to the relatively high FP to FN ratios for short (<4 cM) IBD segments (Fig 1 and Fig 2).

Validation of the performance of IBD callers in empirical data sets by comparing IBD-based downstream analyses. a, IBD coverage and detected selection signals in the SEA data set using different IBD callers (rows 1 to 5). Annotations and corresponding vertical dotted lines at the top indicate the center of known and putative drug resistance genes and genes related to sexual commitment; red shading indicates regions that are inferred to be under positive selection (see Methods for definitions). b, Ne estimates of the SEA data set based on IBD inferred from different callers. Line plots are point estimates; the shading areas around the line plots indicate confidence intervals based on bootstrapping (generated by IBDNe). c, Inference of the population structure of the structured data set by the InfoMap community detection algorithm using the IBD inferred from different IBD callers. The rows of the heatmap are geographic regions of isolates, and the columns are the largest, inferred communities, labeled as c1 to c6. The heat map color represents the number of isolates in each block with the given row and column labels. The columns are rearranged so that the diagonal blocks tend to have the largest values per row for better visualization.

Similar to the simulation analysis, IBD detected using most callers resulted in Ne estimates with unrealistic oscillations for the Southeast Asia data set, including extremely large estimates in more distant past (> 20 generations ago) (Fig 5b). The problems are much less severe with the estimates from hmmIBD, which mirrored the expected reduction in malaria in this region due to the intense efforts to eliminate malaria in recent decades [1].

InfoMap-based community detection reveals expected continental population structure across most callers: African Pf parasites are less structured, and Southeast Asian parasites are more structured and distinct from Oceanian parasites (Fig 5c), consistent with previous non-IBD-based methods [34]. isoRelate IBD estimates generate many small, close groups, likely due to high false negative rates for short IBD segments, especially in high-transmission settings like Africa where parasites have low relatedness and mainly share short IBD segments (Fig 5c).

To further confirm the improvement of IBD-based estimates through parameter optimization, we performed analyses with IBD detected with unoptimized parameters (see S1 Table). We found that the height and number of IBD peaks for hap-IBD and phased IBD decreased significantly (S11a Fig versus Fig 5a). Pre-optimization Ne estimates show more extreme oscillations, especially for human-oriented IBD callers (S11b Fig). Pre-optimization IBD estimates from hap-IBD and phased IBD fail to reveal the expected population structure, particularly in African parasite populations (S11c Fig). These differences underscore the importance of parameter optimization for Pf, especially for IBD callers not validated for Pf.

Computation efficiency comparison and improvement

With the decrease in whole genome sequencing cost and the increase in sample availability, it is important to prioritize IBD callers that scale well for large sample sizes, such as MalariaGEN Pf 7 (n = 20,864) [34]. We compared the IBD inference time and maximum memory usage for different IBD callers with or without parallelization. When using a single thread, probabilistic inference algorithms like isoRelate, Refined IBD, and hmmIBD are about two orders of magnitude slower for those based on identity-by-state-based or positional Burrows-Wheeler transform (PBWT) based algorithms, such as hap-IBD and phased IBD(Fig 6a). Maximum memory consumption is highest in Refined IBD, with hmmIBD being around ten times more efficient (S12a Fig). With multithreading, the patterns are similar to single-thread comparison as most allow parallelization (Fig 6b and S12b Fig). The exception is hmmIBD, as it currently only supports a single-thread computation. Despite the computation efficiency of hmmIBD compared to isoRelate and its high accuracy in detected IBD segments from Pf genomes, it remains significantly slower than IBS-based methods, highlighting the need for further enhancements for large data sets like MalariaGEN Pf 7.

Comparison of computational runtime for IBD calling process for different callers. a, Runtime for different IBD callers to detect IBD from genomes of different sample sizes in single-thread mode. The comparison is based on Pf genomes of size of 100 cM simulated under the single population model. The x-axis tick labels include the number of pairs of genomes analyzed (below the plot, on a linear scale) and the sample size (number of haploid genomes, above the plot, on a nonlinear scale) analyzed. The line styles and markers for different callers/tools are provided in the legend box on the far right of the figure, which is shared across the three subplots. b, Runtime in multithreading mode. (b) is organized similarly to (a), except that the IBD calling processes were run in multithreading mode with 10 threads. Also, see S12 Fig for the maximum memory usage for different callers.

Discussion

Our study sought to address important questions regarding the application of identity-by-descent (IBD) in highly recombining species, such as Plasmodium falciparum, focusing on the reliability of IBD segment detection and the prioritization of detection methods for accurate IBD-based analyses. We observed that: (1) Low marker density per genetic distance, resulting from high recombination rates relative to mutation rates, significantly affects the accuracy of IBD detection; (2) Optimizing IBD caller-specific parameters can improve the performance of IBD callers for Pf genomes, especially for callers designed for human genomes; (3) After parameter optimization, most IBD callers effectively capture expected positive selection signals and population structure patterns; however, only hmmIBD can provide plausible Ne estimates for simulated genomes mimicking the Pf population; (4) Probabilistic (HMM-based) IBD detection methods are significantly less computationally efficient than IBS-based methods; hmmIBD will need enhancement through parallelization and memory optimization, to enable scalability to larger data sets. Consequently, we recommend optimizing human-oriented IBD callers before their application in malaria parasite research and endorse the use of hmmIBD for quality-sensitive IBD-based inferences, for example, the estimation of Ne.

Comparing the performance of multiple IBD detection methods often requires a unified framework, which should include a uniform definition of accuracy and a simulated ground truth that mimics Pf genomes. Our benchmarking framework incorporates several novel features. First, it utilizes a consistent definition of IBD length-specific accuracy based on the overlap of IBD segment lengths, closely aligned with metrics used in hap-IBD [15], phased IBD [16], and Refined IBD [52], as detailed in our Methods section. The approach differs from the original evaluation of hmmIBD, which assesses the accuracy of IBD based on the fraction of SNPs that share IBD, which could overlook the precise accuracy of IBD length of the detected segments [8]. In particular, the original study of isoRelate defined the accuracy (true positive rate) using a less stringent overlap-by-count criterion where a segment is counted as accurate when at least 50% of a detected IBD segment is overlapped by true segments [6]. Second, we generated Pf -like genomes via population genetic simulation that reflect a realistic distribution of IBD segment lengths. This contrasts with previous studies, where methods such as hap-IBD and Refined IBD used human-like data from population genetic simulations [15, 52], whereas hmmIBD, isoRelate, and phased-ibd relied on simulations based on artificial recombination or pedigree models [6, 8]. These models (no-population-based genetic simulation) often produce long shared IBD segments typical of close relatives, failing to capture the IBD length distribution in population samples predominantly comprising distant relatives. Third, our benchmarking extends beyond the segment-level evaluation of IBD callers and includes downstream inferences of population structure and effective population size, providing a more thorough assessment of their application in real-world analyses. This comprehensive approach seeks to complement and broaden the scope of quality evaluations that are predominantly focused on the IBD segment level in original studies, thus improving our understanding of the performance of IBD detection methods in the context of Pf genomes.

The density of markers per genetic unit, as a measure of information enrichment for coancestry inference, plays a crucial role in determining whether and how a candidate IBD segment is accepted in the final IBD calls. IBS-based methods, such as hap-IBD and phased IBD, first identify long IBS segments ( 2 centimorgan) as candidate IBD segments, and subsequently merge short ones separated with small gaps [15], allow a certain number of discordant markers to account for phasing errors [16], or eliminate false positive segments by removing candidate segments supported by only a small number of markers [15]. Similarly, Refined IBD, which combines an IBS-based method with an HMM probabilistic model, uses a LOD score to decide whether a candidate IBD segment should be rejected or accepted [52]. In these studies, default values for threshold parameters related to marker number/density were shown to be effective for genomes like humans. Although the use of different parameter values has been considered for these algorithms to account for marker density differences between sequencing data versus microarray genotype data [15], per-genetic-unit marker density in human data, even for microarray genotype data, is still significantly higher than in Pf sequencing data. We evaluated how different levels of per-genetic-unit marker density affect the detection of IBD segments by varying recombination rates in simulations. With simulated Pf -like genomes of low marker density and high recombination rate, we found that human-oriented IBD callers suffer high false negative rates. One potential explanation is that the thresholds optimized for human data are too stringent for Pf causing excessive rejection of candidate segments. The effect of marker-density on IBD detection is further confirmed by our findings that adjusting the values of marker-density-related parameters using a grid-search approach could significantly reduce IBD error rates and generate more accurate IBD-based downstream estimates. Even though IBD accuracy can be improved by parameter optimization, we found that error rates of the detected IBD segments are still higher in Pf genomes than those of human genomes even after IBD caller parameters are optimized. There are several possible explanations for the high error rates of IBD segments detected from low marker density data: (1) Detected IBD segments can only start and end at the genotype marker site, bypassing any non-genotyped sites; (2) A lower marker density is linked with greater uncertainty in the distribution of IBD endpoints [11]; (3) Ancestral relationships including IBD are too difficult to be reliability inferred given limited mutational information [41, 42, 53, 54].

In our benchmark analysis, we used only common biallelic SNPs as markers for inferring IBD, excluding rare variants and indels. The use of this additional information can potentially provide denser genotype information, thus enhancing our understanding of the population’s ancestral relationships, a key aspect on which the inference of the IBD segment depends. For instance, large-scale whole genome sequencing studies reveal that rare variants account for the majority of all segregating sites [33, 55], which contain crucial information for deciphering recent evolutionary history. However, rare variants are typically not utilized for two main reasons: (1) rare genotypes are very sparsely distributed across many sites and are less informative per site [8, 11]; (2) rare genotype calls are more prone to genotyping or phasing errors. As a result, including rare variation cause reduced accuracy in detected IBD segments due to genotyping/phasing errors and may increase IBD inference time due to data sparsity. Our ongoing work aims to improve the computational efficiency of the hmmIBD algorithm by parallelization and develop a tabular encoding strategy for efficient rare allele-sharing analyses have the potential to mitigate the computational burden of rare variants. Although our simulation analysis indicates that including rare variants is of little effect or detrimental for IBD detetion (S1 Data), the findings could be skewed due to the absence of genotyping errors, the small sample size simulated, and limited number of cut-off values tested. Further investigation in contexts of large sample sizes and varying levels of genotype errors is needed to inform the usability of rare variants for different IBD detection algorithms. Indels are another significant source of underutilized genetic variations in the Pf genome, with abundance on par with that of biallelic SNPs in the MalariaGEN Pf 7 data set [34]. These indels are, in part, the result of microplasticity related to the high AT content (up to 90% in non-coding regions) in Pf genomes [28]. Using these variants could also increase the marker density to infer IBD segments. However, additional research is necessary to determine whether the inclusion of these variants can reduce uncertainty in the inference of IBD for Pf or introduce more bias due to challenges such as sequence-read mapping.

Although a substantial portion of this study concentrated on Pf, the main findings and methodologies may be relevant to high-recombining species beyond Pf. For instance, in regions with intermediate and low malaria transmission, the incidence of Pf has markedly decreased, allowing other species, such as Plasmodium vivax, to become predominant [1]. Clinical malaria caused by simian Plasmodium species, e.g., Plasmodium knowlesi, have also increased in some geographic areas where human Plasmodium species have declined [56]. We expect that the performance of IBD callers will be similar in other Plasmodium species, given their likely comparable high recombination rates [57, 58]; however, generalization would need further exploration as part of future work, considering variations in evolutionary histories and parasite biology [5961]. For other species that exhibit high recombination, but have markedly distinct evolutionary parameters from Plasmodium, the bench-marking framework established in this study can easily be tailored to prioritize and evaluate IBD detection methods in a context-specific manner.

While we have conducted numerous simulation analyses complemented by carefully designed validation studies, our work is subject to a few caveats: (1) Our simulations did not explicitly incorporate inbreeding within the complicated life cycle of the parasite [62], except for the increased inbreeding potential due to the reduced population size in the single population model. Inbreeding can be pervasive, especially in low transmission settings, leading to a change in the length distribution of IBD toward longer segments and a potentially reduced marker density. (2) Our optimization is based on simple accuracy metrics and only focuses on a subset of parameters, to allow faster iteration over different values. Investigating a larger parameter space, including genotyping error rate and higher-level accuracy metrics may generate different optimal values and further improve IBD-based downstream estimates. (3) We assume a constant recombination rate and mutation rate as static genomic/population parameters, rather than traits capable of evolving over time. If this assumption proves to be inaccurate, such as with recombination rates that vary between individuals and populations [63], a more complex benchmarking framework will be required.

In this study, we evaluated the performance of existing IBD segment detection methods in analyzing genomic data of malaria parasite Pf, which is characterized with high recombination rates and low marker densities. Our findings underscore that a high recombination rate, relative to the mutation rate, can compromise the accuracy of detected IBD segments when using methods originally calibrated for the human genome, characterized by a significantly lower recombination rate and a higher marker density (per genetic unit). The accuracy of IBD detection can be improved by parameter optimization via grid search techniques. We advocate for a context-specific evaluation of IBD detection methods when applying them to untested species. Specifically for Pf, our research indicates that hidden Markov model-based probabilistic methods, such as hmmIBD produce less biased IBD estimates leading to more accurate downstream inferences. This is especially important for analyses that heavily rely on the accuracy of detected IBD segments, such as Ne inference.

Methods

Simulation overview

We used population genetic simulations to allow the generation of (1) ground truth, including true IBD segments, true sites under positive selection, true trajectory of population size, and true sub-population assignments (population structure), and (2) inferred patterns, including IBD inferred from phased genotype data via different IBD callers and IBD-based downstream inferences of Ne, positive selection, and population structure. By comparing inferred patterns with ground truth, we calculated metrics at the IBD segment level and the IBD-based downstream estimate level (high level) for benchmarking and optimizing various IBD detection methods for Pf genomes. As described in our accompanying work [10], we combined the flexible forward simulator SLiM [64, 65] and the efficient coalescent simulator msprime [66] to simulate genomes similar to Pf reflecting the high recombination rate, strong positive selection, and population size shrinkage due to malaria reduction (S1 Fig). A detailed implementation of the simulations can be found in a dedicated GitHub repository (https://github.com/bguo068/bmibdcaller_simulations).

In these simulations, we assumed constant recombination rates over the genome, such as 6.67 × 107 per base pair per generation for Pf [23, 67] and 1.0 × 108 for human [25], and a mutation rate of 1.0 × 108 for both Pf [26, 27] and human unless otherwise specified. Parameters for modeling population size changes, population structure, and positive selection are detailed in the following section or our related publication [10].

Simulated demographic models

We used three different demographic models in the simulations, including the single-population model, the multiple-population model, and the UK European human population models [15].

Single- and multiple-population models have been described in our accompanying work [10]. The single population model mimics malaria reduction in settings like Southeast Asia, with a population size decreasing from 10,000 to 1,000 over the last 200 generations. This model was used to benchmark IBD detection methods at the IBD segment level and at the (high) level of downstream estimation, including selection signal detection and Ne estimation. The multiple-population model was mainly used to benchmark IBD calling methods via IBD network-based community detection. Implementation of the two models were provided a GitHub repository (see Code availability).

The UK human demographic model, similar to the one used in [15], simulates a population bottleneck event from a constant size of 10,000 to 3,000 that occurred 5,000 generations ago, followed by growth at rate of 1.4% and 25% per generation beginning 300 and 10 generations ago, respectively. We simulated 14 chromosomes with a size of 60 cM each for a smaller genome size to reduce simulation time. This demographic model serves as a control to detect IBD segments with human-oriented callers, which can help validate our IBD accuracy evaluation pipeline. We also used this model to test whether demographic models impact the performance of IBD callers by replacing the human recombination rate with that of Pf.

To separate the effects of positive selection from that of demographic models and recombination rates, we mostly simulated neutral genomes by setting the selection coefficient s to 0 for the above models, except in the case where we need to benchmark IBD callers for detecting positive selection signals.

Positive selection simulation

To evaluate the performance of IBD callers via IBD-based selection signal detection, we simulated positive selection within the single-population model with selection coefficients s to 0.2 with a single origin starting 80 generations ago. Fourteen chromosomes with a size of 100 cM were simulated independently, each with a selected site at 33.3 centimorgan from their left ends. For selection simulation, we conditioned on the establishment of the selective sweeps, that is, the allele under selection should not be lost at the present-day generation. If lost, the simulation was rerun for a maximum of 100 times until the selective sweep is established.

IBD calling and default parameters

We generated true IBD from simulated genealogical trees in the tree sequences using the tskibd algorithm [10]. Briefly, we sample local/marginal trees along a chromosome and tracked changes in the most recent common ancestor (MRCA) for each pair of sample nodes. If MRCA changes, the shared ancestral segment breaks. We report a shared ancestral segment as an IBD segment if it length exceeds a threshold, such as 2 centimorgans.

For inferred IBD, we used phased genotype data as input. Only biallelic sites with a minor allele frequency no less than 0.01 were included, unless otherwise stated. When needed, a genetic map is generated based on the constant recombination rate as specified in the corresponding simulations. For IBD detection methods designed for diploids, we converted each haploid to a pseudohomozygous diploid. The resulting IBD segments of each pair of pseudo-homozygous diploids (A1/A2 and B1/B2) will have redundant information due to the 100% runs of homozygosity. We only keep one pair A1-B1 and remove other combinations.

As each IBD detection method provides multiple tunable parameters, we detailed values used in S1 Table for both default and optimized scenarios. For the default scenarios, the parameters mostly follow the original documentation. Exceptions are parameters that need consistency across different IBD callers for benchmarking, including the minimum IBD length and the minimum minor allele frequency. The process of obtaining optimized parameter values is described below.

Benchmarking metrics at the IBD segment level

Several metrics were calculated to benchmark IBD methods at the IBD segment level or using their simple aggregates, including the false negative rate and false positive rate, pairwise total IBD and population-level total IBD per length bin.

False positive rate and the false negative rate were obtained following the definition used in the work of Zhou et al. [15]. Rates were first calculated per segment via segment-overlapping analysis; then, they were averaged over all segments of the same length bin. The false positive rate per segment is defined as the proportion of a given IBD segment of some genome pair from the inferred set (for example, IBD called via hmmIBD) that is not covered by any IBD segment from the truth set (generated by tskibd), for the same genome pair. Similarly, the false negative rate per segment is defined as the proportion of a given IBD segment of some genome pair from the truth set that is not covered by any IBD segment from the inferred set. The average false positive rate is calculated as the average per-segment false positive rates for all inferred IBD segments the length of which falls in a certain range (length bin); the average false negative rate is calculated as the average per-segment false positive rates for all true segments of a length bin. The following length bins were used: [3-4), [4-6), [6-10), [10-18), [18, inf) cM, similar to Zhou et al.’s method.

Genome-wide FP/FN rates per pair and their averages across all genome pairs were calculated to capture genome-wide bias. The per-pair genome-wide false positive rate is the ratio of two sums: (1) the numerator sum is the total length of parts of all inferred IBD segments of a certain genome pair that are not covered by any true IBD segments of the same genome pair; (2) the denominator sum is the total length of all inferred IBD segments of a certain genome pair. The per-pair genome-wide false negative rate is defined in a similar way as the percentage of pairwise total true IBD that are not overlapped by any inferred segments of the same pair. We then obtain the aggregate metrics by averaging these rates for all genome pairs.

Pairwise total IBD from truth versus inferred set was calculated as it’s a useful metric to estimate genetic relatedness and build IBD sharing networks. It was calculated as the sum of the lengths of all inferred or true IBD segments of each genome pair.

Given that the Ne estimator IBDNe internally utilizes quantities of population-level total IBD of different length bins, these quantities were calculated here to better examine IBD accuracy for Ne inference. We defined non-overlapping length bins of 0.05 centimorgan width that cover all possible lengths. For each length bin, a poplation total IBD was defined as a sum of length of all IBD segments with segment lengths falling into this bin from any genome pair.

To expedite IBD segment-level analysis and alleviate computational burdens, we developed an open-source tool, ishare/ibdutils (available at https://github.com/bguo068/ishare), which harnesses algorithms such as interval trees to efficiently calculate metrics like those described above.

IBD caller parameter optimization

We optimized key IBD caller-specific parameters by iterating each parameter over the list of discrete values, or two or more parameters over a grid of discrete values. Many optimized parameters were related to marker density for IBD callers hap-IBD, phased IBD, Refined IBD, and isoRelate. Other parameters were searched, if they potentially have a great impact on the quality of the detected IBD. The optimal values for explored parameters are determined by the length-bin-specific or genome-wide error rates (FN and FP) for detected IBD segments as defined above. The parameter value or combination of values that generate lower and generally balanced error rates were selected as optimal values. The parameters searched, the value lists explored and the optimal values selected are summarized in S1 Table and S2 Table. When the optimized values vary across different demographic models, we used the ones optimized for the single-population model for downstream analyses. We provided detailed heatmaps of error rates for all demographic models tested (see S1 Data).

Benchmarking via IBD-based downstream analysis

At a higher level, we benchmarked different IBD callers by comparing downstream estimates based on true IBD sets versus inferred IBD sets. These IBD-based estimates include positive selection scan, Ne estimates, and population structure inference via the community detection algorithm InfoMap.

We scanned positive selection signals using the IBD-based thresholding method followed by validation with integrated haplotype score-based statistics XiHS as previously described [10].

We inferred the trajectory of effective population size, Ne, for the last 100 generations using IBDNe. As this method uses IBD shared by diploid individuals as input, we converted each haploid genome to pseudo-homozygous diploid individuals. We infer the trajectory Ne using most of the default parameters except for setting the minregion parameter to 10 cM to allow the inclusion of short contigs in the analysis. The final estimates are scaled by 0.25 to compensate for the haploid-to-diploid conversion. By default, we used the value of 2 cM for the mincm parameter to only include IBD segments 2 cM for inferring Ne trajectory; as indicated in the Results section, we also set mincm to 4 to test whether excluding short IBD segments can improve the accuracy of Ne estimates. As the IBDNe algorithm does not work well when IBD segments shared by close relatives are included in the input, we followed the procedure described in the original work by Browning et al. [17]. For simulated data, we utilized the TRMCA information of the true IBD segments to filter the IBD segments before calling IBDNe. For true IBD segments (generated by tskibd), we excluded IBD segments with TMRCA < 1.5; for inferred IBD segments (called by hap-IBD, hmmIBD, isoRelate, Refined IBD, phased IBD), we removed (inferred) segments that overlap with any true IBD segment with TMRCA < 1.5 shared by the same genome pair. For empirical data where true IBD and TMRCA are not available, we pruned highly related isolates by iteratively removing the genome that has the highest number of close relatives defined by pairwise total IBD > 0.5 of genome size until no close relatives are present in the remaining subgroup [10].

For population structure inference, we first built the pairwise total IBD matrix, each element being the total IBD for a pair of genomes. The matrix was then squared and used as a weighted adjacency matrix to construct an IBD-sharing network. We then ran the InfoMap algorithm to infer community membership. Genomes assigned the same membership were inferred to be of the same subpopulation. For empirical data, we excluded IBD segments shorter than 4 cM when calculating the total IBD matrix to help reduce noise due to false positives and set each element with a value < 5 to zero in the unsquared IBD matrix to decrease the density of the IBD matrix.

Before all high-level benchmarking analyses, we pruned highly related samples as mentioned above. As IBD-based estimates can be biased by strong positive selection in empirical data, we removed high IBD sharing peaks with peak impact index > 0.01 as previously described [10] before any downstream analyses.

Processing empirical data sets

We constructed empirical data sets for validation using genotype data from whole genome sequencing samples from the MalariaGEN Pf 7 database [34]. We used malariagen_data 7.13 to download high-quality monoclonal samples that pass quality control. Monoclonal samples were determined by Fws > 0.95 (Fws table available from the malariaGEN website). Quality control labels were extracted directly from the metadata (provided in the malariagen_data package).

We then generated haploid genomes (phased genotype data) using the dominant allele from genotype calls. The dominant allele of each genotype call was determined by the per-sample allele depth (AD) fields. For each sample and site, the allele supported by 90% of total reads (total AD values) in a genotype call with at least 5 total reads was used as the dominant allele. Genotype calls without dominant alleles were marked as missing; those with dominant alleles were replaced with a phased genotype homozygous for the dominant allele. The genotype data were further filtered by sample missingness and SNP minor allele frequency and missingness. The resulting genotype data had per-SNP and per-sample missingness < 0.1 and minor allele frequency 0.01. Genotypes based on dominant alleles were further imputed without a panel using Beagle 5.1 [68]. These processing steps generated phased, imputed, pseudo-homozygous diploid genotype data, ready for IBD detection.

We constructed different data sets, including, two “single” population data sets and a “structured, multiple” population data set, by subsampling the above haploid genomes according to sampling time and location. For each data set, we set the time window to 2-3 years to reduce the sample time heterogeneity, and then shifted the window within all possible sampling years and chose one that maximized the sample size. For the “single” population data sets, we further restricted the sample locations to a relatively small geographic region, such as eastern Southeast Asia, as the data set was used for Ne estimation which assumes a homogeneous population. For the “multiple” population data set, we included samples from different continental or subcontinental regions using “Population” labels from the meta-information table provided with the MalariaGEN Pf 7 database. To make the sample size of each “population” more balanced, we set a maximum number of samples of 300. Populations with samples larger than 300 are subsampled to a size of 300; populations with a size smaller than 100 were not included in the multiple population data set. The details of the sampling location and time information were summarized in Supplementary Tables 3-5.

Details about the preparation and analysis of the empirical data sets can be found at (https://github.com/bguo068/bmibdcaller_empirical).

Measuring computational runtime and memory usage

The genomes were simulated with N0 = 1,000 and s = 0.0 under the single population model. The runtime and maximum memory usage were measured using the GNU time 1.7 utility. To allow a more appropriate comparison, we ensured: (1) the time used for data pre- and post-processing was excluded; (2) memory resource allocation was capped at 30 gigabytes per IBD call; (3) input genotype data included only common SNPs with minor allele frequency 0.01; (4) the minimum reported IBD segment length was set to 2.0 cM.

Code availability

Custom tools or scripts were provided in the following GitHub repositories: (1) bmibdcaller_simulations: a Nextflow pipeline to benchmark different IBD detection methods and optimize IBD caller-specific parameters by simulating Plasmodium falciparum-like genomes and using true IBD (https://github.com/bguo068/bmibdcaller_simulations). (2) bmibdcaller_empirical: a Nextflow pipeline to benchmark IBD callers with empirical data by comparing IBD-based estimates with expected patterns (https://github.com/bguo068/bmibdcaller_empirical). (3) ishare/ibdutils: a Rust crate and command line tools designed to facilitate the analysis of rare-variant sharing and identity-by-descent (IBD) sharing (used here mainly for fast IBD segment overlapping analysis) (https://github.com/bguo068/ishare, command line tool ibdutils).

Data availability

All empirical data used were publicly available from MalariaGEN Pf 7 [34].

Acknowledgements

This publication uses MalariaGEN data as described in “Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples” MalariaGEN et al, Wellcome Open Research 2023, 8:22 https://doi.org/10.12688/wellcomeopenres.18681.1. This work was supported by NIH 1R01AI145852 granted to ST-H and TDO by the U.S. National Institutes of Health.