Modelbased spatialtemporal mapping of opisthorchiasis in endemic countries of Southeast Asia
Abstract
Opisthorchiasis is an overlooked danger to Southeast Asia. Highresolution disease risk maps are critical but have not been available for Southeast Asia. Georeferenced disease data and potential influencing factor data were collected through a systematic review of literatures and openaccess databases, respectively. Bayesian spatialtemporal joint models were developed to analyze both point and arealevel disease data, within a logit regression in combination of potential influencing factors and spatialtemporal random effects. The modelbased risk mapping identified areas of low, moderate, and high prevalence across the study region. Even though the overall populationadjusted estimated prevalence presented a trend down, a total of 12.39 million (95% Bayesian credible intervals [BCI]: 10.10–15.06) people were estimated to be infected with O. viverrini in 2018 in four major endemic countries (i.e., Thailand, Laos, Cambodia, and Vietnam), highlighting the public health importance of the disease in the study region. The highresolution risk maps provide valuable information for spatial targeting of opisthorchiasis control interventions.
Introduction
End of the epidemics of neglected tropical diseases (NTDs) by 2030 embodied in the international set of targets for the sustainable development goals (SDGs) endorsed by the United Nations empowers the efforts made by developing countries to combat the NTD epidemics (UN, 2015). To date, 20 diseases have been listed as NTDs, and opisthorchiasis is under the umbrella of foodborne trematodiasis (Ogorodova et al., 2015). Two species of opisthorchiasis are of public health significance, that is, Opisthorchis felineus (O. felineus), endemic in eastern Europe and Russia, and Opithorchis viverrini (O. viverrini), endemic in Southeast Asian countries (Petney et al., 2013). The later species is of our interest in the current article.
According to WHO’s conservative estimation, an overall disease burden due to opisthorchiasis was 188,346 disabilityadjusting life years (DALYs) in 2010 (Havelaar et al., 2015). Fürst and colleagues estimated that more than 99% of the burden worldwide attribute to O. viverrini infection in Southeast Asia (Fürst et al., 2012). Five countries in Southeast Asia, Cambodia, Lao PDR, Myanmar, Thailand, and Vietnam, are endemic for opisthorchiasis, with an estimated 67.3 million people at risk (Keiser and Utzinger, 2005). It is well documented that chronic and repeated infection with O. viverrini leads to the development of fatal bile duct cancer (cholangiocarcinoma) (International Agency for Research on Cancer, 1994).
The life cycle of O. viverrini involves freshwater snails of the genus Bithynia as the first intermediate host, and freshwater cyprinoid fish as the second intermediate host. Humans and other carnivores (e.g., cats and dogs), the final hosts, become infected by consuming raw or insufficiently cooked infected fish (Andrews et al., 2008; Saijuntha et al., 2014). Behavioral, environmental, and socioeconomic factors affect the transmission of O. viverrini (GrundyWarr et al., 2012, Phimpraphai et al., 2017, Phimpraphai et al., 2018, Prueksapanich et al., 2018). Raw or insufficiently cooked fish consumption is the cultural root in endemic countries, showing a strong relationship with the occurrence of the disease (Andrews et al., 2008; GrundyWarr et al., 2012). Poorly hygienic conditions increase the risk of infection, especially in areas practicing rawfisheating habit (GrundyWarr et al., 2012). In addition, environmental and climatic factors, such as temperature, precipitation, and landscape, affecting either snail/fish population or growth of the parasites inside the intermediate hosts, can potentially influence the risk of human infection (Forrer et al., 2012; Suwannatrai et al., 2017). Important control strategies of O. viverrini infection include preventive chemotherapy, health education, environmental modification, improving sanitation, as well as comprehensive approaches with combinations of the above (Saijuntha et al., 2014). For purposes of public health control, WHO recommends implementing preventive chemotherapy once a year or once every 2 years depending on the levels of prevalence in population, with complementary interventions such as health education and improvement of sanitation (WHO, 2009).
Understanding the geographical distribution of O. viverrini infection risk at high spatial resolution is critical to prevent and control the disease costeffectively in priority areas. Thailand conducted national surveys for O. viverrini prevalence in 1981, 1991, 2001, 2009, and 2014 (Echaubard et al., 2016; Suwannatrai et al., 2018), but the results of these surveys were presented at the province level, which is less informative for precisely targeting control interventions. Suwannatrai and colleagues, based on climatic and O. viverrini presence data, produced climatic suitability maps for O. viverrini in Thailand using the MaxEnt modeling approach (Suwannatrai et al., 2017). The maps brought insights for identifying areas with a high probability of O. viverrini occurrence; however, they did not provide direct information on prevalence of O. viverrini in population (Elith et al., 2011). A risk map of O. viverrini infection in Champasack province of Lao PDR was presented by Forrer and colleagues (Forrer et al., 2012). To our knowledge, highresolution, modelbased risk estimates of O. viverrini infection are unavailable in the whole endemic region of Southeast Asia.
Bayesian geostatistical modeling is one of the most rigorous inferential approaches for highresolution maps depicting the distribution of the disease risk (KaragiannisVoules et al., 2015). Geostatistical modeling relates georeferenced disease data with potential influencing factors (e.g., socioeconomic and environmental factors) and estimates the infection risk in areas without observed data (Gelfand and Banerjee, 2017). Common geostatistical models are usually based on pointreferenced survey data (Banerjee et al., 2014). In practice, disease data collected from various sources often consists of pointreferenced and areaaggregated data. Bayesian geostatistical joint modeling approaches provide a flexible framework for combining analysis of both kinds of data (Moraga et al., 2017; Smith et al., 2008). In this study, we aimed (1) to collect all available survey data on the prevalence of O. viverrini infection at point or arealevel in Southeast Asia through systematic review; and (2) to estimate the spatialtemporal distribution of the disease risk at a high spatial resolution, with the application of advanced Bayesian geostatistical joint modeling approach.
Results
A total of 2690 references were identified through systematically reviewing peerreview literatures, and 13 additional references were gathered from other sources. According to the inclusion and exclusion criteria, 168 records were included, resulted in a total of 580 ADM1level surveys in 174 areas, 210 ADM2level surveys in 142 areas, 53 ADM3level surveys in 51 areas, and 251 pointlevel surveys at 207 locations in five endemic countries (i.e., Cambodia, Lao PDR, Myanmar, Thailand, and Vietnam) of Southeast Asia (Figure 1). Around 70% and 15% of surveys were conducted in Thailand and Lao PDR, respectively. Only two relevant records were obtained from Myanmar. To avoid large estimated errors, we did not include this data in the final geostatistical analysis. All surveys were conducted after 1970, with around 75% done after 1998. Most surveys (95%) are community based. Around 40% of surveys used the Kato–Katz technique for diagnosis, while another 42% did not specify diagnostic approaches. Mean prevalence calculated directly from survey data was 16.74% across the study region. A summary of survey data is listed in Table 1, and survey locations and observed prevalence in each period are shown in Figure 2. Arealevel data cover all regions in Thailand and Lao PDR, and most regions in Cambodia and Vietnam, while pointreferenced data are absent in most areas of Vietnam, the western part of Cambodia and southern part of Thailand. Around 70% of eligible literatures got a score equal or more than 7, indicating an overall good quality of eligible literatures in our study (Figure 2—figure supplement 1).
Seven variables were selected for the final model through the Bayesian variable selection process (Table 2). The infection risk was 2.61 (95% BCI: 2.10–3.42) times in the community as much as that in schoolaged children. Surveys using FECT (formalinethyl acetate concentration technique) as the diagnostic method showed a lower prevalence (OR 0.76, 95% BCI: 0.61–0.93) compared to that using Kato–Katz method, while no significant difference was found between Kato–Katz and the other diagnostic methods. Human influence index and elevation were negatively correlated with the infection risk. Each unit increase of the HII index was associated with 0.01 (95% BCI: 0.003–0.02) decrease in the logit of the prevalence. And increase in 1 m in elevation was associated with the 0.003 (95% BCI: 0.001–0.005) decrease in the logit of the prevalence. The spatial range was estimated as 83.55 km (95% BCI: 81.34–86.61), the spatial variance ${\sigma}_{\varphi}^{2}$ was 12.59 (95% BCI: 11.96–13.56), the variance of betalikelihood ${\sigma}_{\beta}^{2}$ was 0.15 (95% BCI: 0.14–0.15), and the temporal correlation coefficient $\rho $ was 0.66 (95% BCI: 0.65–0.67). Model validation showed that our model was able to correctly estimate 79.61% of locations within the 95% BCI, indicating the model had a reasonable capacity of prediction accuracy. The ME, MAE, and MSE were 0.24%, 9.06%, and 2.38%, respectively, in the final model, while they were −7.14%, 16.67%, and 5.09%, respectively, in the model only based on pointreferenced data, suggesting that the performance of the final model was better than the model only based on pointreferenced data. On the other hand, Monte Carlo test for preferential sampling suggested that preferential sampling may exist for survey locations in one third (6/18) of the survey years (Figure 2—source data 2).
The estimated risk maps of O. viverrini infection in different selected years (i.e., 1978, 1983, 1988, 1993, 1998, 2003, 2008, 2013, and 2018) are presented in Figure 3. In 2018, the high infection risk (with prevalence >25%) was mainly estimated in regions of the southern, the central, and the northcentral parts of Lao PDR, some areas in the eastcentral parts of Cambodia, and some areas of the northeastern and the northern parts of Thailand. The southern part of Thailand, the northern part of Lao PDR, and the western part of Cambodia showed low risk estimates (with prevalence <5%) of O. viverrini infection. The central and several southern parts of Vietnam showed low to moderate risk of O. viverrini infection, while there was no evidence of O. viverrini in other parts of Vietnam. High estimation uncertainty was mainly present in the central part of Lao PDR, the northern and the eastern parts of Thailand, and the central part of Cambodia and Vietnam (Figure 4).
In addition, the infection risk varies over time across the study region (Figure 5). Areas of northern Thailand showed an increasing trend in periods 1978–1988 and 1993–2003, while most areas of the country presented a considerable decrease of infection risk after 2008. The infection risk first increased and then decreased in areas of the north, the central, and the southern parts of Lao PDR and the central parts of Vietnam. The eastcentral and western part of Cambodia showed an increasing trend in recent years.
The populationadjusted estimated prevalence over the study region presents a trend down after 1995 (Figure 6 and Figure 6—figure supplements 1–9). At the country level, the estimated prevalence in Thailand showed a fast decline after 1995 and took on a gradually decreasing change in Cambodia. In Lao PDR, the overall prevalence maintained quite stable before 1990 and decreased slightly between 1990 and 1997, increased significantly after 1997, then decreased from 2006, and became stable after 2011. The prevalence is stable in Vietnam during the whole study period. We estimated that in 2018, the overall populationadjusted estimated prevalence of O. viverrini infection in the whole study region was 6.57% (95% BCI: 5.35–7.99%), corresponding to 12.39 million (95% BCI: 10.10–15.06) infected individuals (Table 3). Lao PDR showed the highest prevalence (35.21%, 95% BCI: 28.50–40.70%), followed by Thailand (9.71%, 95% BCI: 7.98–12.17%), Cambodia (6.15%, 95% BCI: 2.41–11.73%), and Vietnam (2.15%, 95% BCI: 0.73–4.40%). Thailand had the largest numbers of individuals estimated to be infected with O. viverrini (6.71 million, 95% BCI: 5.51–8.41), followed by Lao PDR (2.45 million, 95% BCI: 1.98–2.83), Vietnam (2.07 million, 95% BCI: 0.70–4.24), and Cambodia (1.00 million, 95% BCI: 0.39–1.90).
Discussion
In this study, we produced modelbased, highresolution risk estimates of opisthorchiasis across endemic countries of Southeast Asia. The disease is the most important foodborne trematodiasis in the study region (Sripa et al., 2010), taking into account most of the disease burden of opisthorchiasis in the world (Fürst et al., 2012). The estimates were obtained by systematically reviewing all possible georeferenced survey data and applying a Bayesian geostatistical modeling approach that jointly analyzes pointreferenced and areaaggregated disease data, as well as environmental and socioeconomic predictors. Our findings will be important for guiding control and intervention costeffectively and serve as a baseline for future progress assessment.
Our estimates suggested that there was an overall decrease of O. viverrini infection in Southeast Asia from 1995 onwards, which may be largely attributed to the decline of infection prevalence in Thailand. This decline was probably on account of the national opisthorchiasis control program launched by the Ministry of Public Health of Thailand from 1987 (Jongsuksuntigul and Imsomboon, 2003; Jongsuksuntigul et al., 2003). Our highresolution risk estimates in Thailand in 2018 showed similar pattern as the climatic suitability map provided by Suwannatral and colleagues (Suwannatrai et al., 2017). In this case, we estimated the prevalence of the population instead of the occurrence probability of the parasite, which arms decision makers with more direct epidemiological information for guiding control and intervention. The national surveys in Thailand reported a prevalence of 8.7% and 5.2% in 2009 and 2014, respectively (MOPH, 2014; Wongsaroj et al., 2014). However, we estimated higher prevalence of 12.44% (95% BCI: 10.79–14.26%) and 9.34% (95% BCI: 7.88–11.02%) in 2009 and 2014, respectively. Even though the national surveys covered most provinces in Thailand, estimates were based on simply calculating the percentage of positive cases among all the participants (Wongsaroj et al., 2014), and the remote areas might not be included (Maipanich et al., 2004). Instead, our estimates were based on rigorous Bayesian geostatistical modeling of available survey data with environmental and socioeconomic predictors, accounting for heterogeneous distribution of infection risk and population density when aggregating countrylevel prevalence.
Our findings suggested that the overall prevalence of O. viverrini remained high (>20%) in Lao PDR during the study periods, consistent with conclusions drawn by Suwannatrai et al., 2018. We estimated that a total number of 2.45 (95% BCI: 1.98–2.83) million people living in Lao PDR were infected with O. viverrini, equivalent to that estimated by WHO in 2004 (WHO, 2002). Besides, our risk mapping for Champasack province shares similarly risk map pattern produced by Forrer and colleagues (Forrer et al., 2012). A nationalscaled survey in Cambodia during the period 2006–2011 reported infection rate of 5.7% (Yong et al., 2014), lower than our estimation of 8.34% (95% BCI: 5.25–14.95%) in 2011. The former may underestimate the prevalence because more than 77% of participants were schoolchildren (Yong et al., 2014). Another large survey in five provinces of Cambodia suggested a large intradistrict variation, which makes the identification of endemic areas difficult (Miyamoto et al., 2014). Our highresolution estimates for Cambodia help to differentiate the intradistrict risk. However, the estimates should be taken cautious due to large districtwide variances and a relatively small number of surveys. Indeed, O. viverrini infection was underreported in Cambodia (Khieu et al., 2019), and further pointreferenced survey data are recommended for more confirmative results.
Although an overall low prevalence was estimated in Vietnam (2.15%, 95% BCI: 0.73–4.40%) in 2018, it corresponds to 2.07 million (95% BCI: 0.70–4.24 million) people infected, comparable to the number in Lao PDR, mainly due to a larger population in Vietnam. The risk mapping suggested moderate to high risk areas presented in central Vietnam, with a high risk in Phu Yen province for many years, particularly. This agreed with previous studies considering the province a ‘hotspot’ (Doanh and Nawa, 2016). Of note, even though there was no evidence of O. viverrini infection in the northern part of the country, Clonorchis sinensis, another important liver fluke species, is endemic in the region (Sithithaworn et al., 2012). We did not provide estimates for Myanmar in case of large estimated errors. Indeed, only two relevant papers were identified by our systematic review, where one shows low to moderate prevalence in three regions of Lower Myanmar (Aung et al., 2017), and the other found low endemic of O. viverrini infection in three districts of the capital city Yangon (Sohn et al., 2019). Nationwide epidemiological studies are urgent for a more comprehensive understanding of the disease in Myanmar.
We identified several important factors associated with O. viverrini infection in Southeast Asia, which may provide insights for the prevention and control of the disease. The infection risk was higher in the entire community than that in schoolchildren, consistent with multiple studies (Aung et al., 2017; Forrer et al., 2012; Miyamoto et al., 2014; Van De, 2004; Wongsaroj et al., 2014). A negative association was found between O. viverrini infection and elevation, suggesting the disease was more likely to occur in low altitude areas, which was consistent with a previous study (Wang et al., 2013). HII, a measure of human direct influence on ecosystems (Sanderson et al., 2002), showed a negative relationship with O. viverrini infection risk, indicating the disease was more likely to occur in areas with low levels of human activities, which were often remote and economically underdeveloped. The habit of eating raw or insufficiently cooked fish was more common in rural areas than that in economically developed ones, which could partially explain our findings (GrundyWarr et al., 2012, Keiser, 2019). Indeed, this culturally rooted habit is one of the determinants for human opisthorchiasis (Kaewpitoon et al., 2008; Ziegler et al., 2011). However, the precise geographical distribution of such information is unavailable and thus we could not use it as a covariate in this study.
Our estimate of the number of people infected with O. viverrini is higher than that of the previous study (12.39 million vs 8.6 million [Qian and Zhou, 2019]) emphasizing the public health importance of this neglected disease in Southeast Asia, and suggesting that more effective control interventions should be conducted, particularly in the high risk areas. The successful experience in the intervention of Thailand may be useful for reference by other endemic countries of the region. The national opisthorchiasis control program, supported by the government of Thailand, applied interrelated approaches, including stool examination and treatment of positive cases, health education aiming at the promotion of cooked fish consumption, and environmental sanitation to improve hygienic defecation (Jongsuksuntigul and Imsomboon, 2003). In addition, for areas with difficulties to reduce infection risk, a new strategy was developed by Sripa and colleagues, using the EcoHealth approach with anthelminthic treatment, novel intensive health education on both communities and schools, ecosystem monitoring, and active participation of the community (Sripa et al., 2015). This ‘Lawa model’ shows good effectiveness in Lawa Lake area, where the liver fluke was highly endemic (Sripa et al., 2015). Furthermore, common integrated control interventions (e.g., combination of preventive chemotherapy with praziquantel, improvement of sanitation and water sources, and health education) are applicable not only for opisthorchiasis but also for other NTDs, such as soiltransmitted helminth infection and schistosomiasis, which are also prevalent in the study region (Dunn et al., 2016; Gordon et al., 2019). Implementation of such interventions in coendemic areas could be costeffective (Linehan et al., 2011; WHO, 2012).
Frankly, several limitations exist in our study. We collected data from different sources, locations of which might not be random and preferential sampling may exist. We performed a riskpreferential sampling test and the results showed that preferential sampling might exist for survey locations in one third (6/18) of the survey years (Figure 2—source data 2). The corresponding impacts might include improper variogram estimator, biased parameter estimation, and unreliable exposure surface estimates (Diggle et al., 2010; Pati et al., 2011; Gelfand et al., 2012). To avoid a more complex model, we did not take into account the preferential sampling issue for our final model, as the model validation showed a reasonable capacity of prediction accuracy. However, the disadvantage of this issue should be well aware.
We set clear criteria for selection of all possible qualified surveys and did not exclude surveys that reported prevalence in intervals without exact observed values. Sensitivity analysis showed that the using the midpoint values of the intervals had little effects on the final results (Figure 3—figure supplement 1). For surveys across a large area, complex designs, such as randomly sampling from subgroups of the population under a welldesigned scheme, are likely adopted, as it is impractical to draw simple random samples from the whole area. In such case, respondents may have unequal probabilities to be selected, thus weighting should be used to generalize results for the entire area. The observed disease data we collected were from surveys either at pointlevel (i.e., community or school) or aggregated over areas. For pointlevel data, as study areas were quite small, simple sampling design was mostly used in the corresponding surveys. And for areallevel data, particularly those aggregated across ADM1, complex designs were likely applied. However, most of the corresponding surveys were only reported raw prevalence or prevalence without clarifying whether weighting was applied. Thus, we did not have enough information to address the design effect for each single survey included. On the other hand, as population density across the study region was different, we calculated the estimated country and provinciallevel prevalence by averaging the estimated pixellevel prevalence weighted by population density. In this way, we took into account the diversity of population density across areas for regional summaries of the estimates.
We assumed similar proportions of age and gender in different surveys, as most of which only reported prevalence aggregated by age and gender. Nevertheless, considering the possible differences in infection risk between the whole population and schoolchildren, we categorized survey types to the community and schoolbased. Furthermore, our analysis was based on survey data under different diagnostic methods. The sensitivity and specificity of the same diagnostic method may differ across studies (Charoensuk et al., 2019; Laoprom et al., 2016; Sayasone et al., 2015), while different diagnostic methods may result in different results in the same survey. To partially taking into account the diversity of diagnostic methods, we assumed the same diagnostic method has similar sensitivity and specificity, and we considered the types of diagnostic methods as covariates in the model. Results showed that the odds of infection with FECT methods was significantly lower than that with Kato–Katz, which was consistent with results found by Lovis et al., 2009. In addition, most of the diagnostic methods in the surveys were based on fecal microscopic technique on eggs, which could not effectively distinguish between O. viverrini and minute intestinal flukes of the family Heterophyidae (e.g., heterophyid and lecithodendriid) (Charoensuk et al., 2019, Sato et al., 2010). Thus, our results may overestimate the O. viverrini infection risk in areas where heterophyid and lecithodendriid are endemic, such as Phongsaly, Saravane, and Champasak provinces in Lao PDR (Sato et al., 2010, Chai et al., 2010; Chai et al., 2013), Nan and Lampang provinces in Thailand (Wijit et al., 2013), and Takeo province in Cambodia (Sohn et al., 2011). There is an urgent need for the application of more powerful diagnostic practices with higher sensitivity and specificity to better detect the true O. viverrini prevalence, such as PCR (Lovis et al., 2009, Lu et al., 2017, Sato et al., 2010). Nevertheless, because of the similar treatment and the prevention strategies of O. viverrini and minute intestinal flukes (Keiser and Utzinger, 2010), our risk mapping is valuable also for areas coendemic with the above flukes.
In conclusion, this study contributes to better understand the spatialtemporal characteristics of O. viverrini infection in major endemic countries of Southeast Asia, providing valuable information guiding control and intervention, and serving as a baseline for future progress assessment. Estimates were based on a rigorous geostatistical framework jointly analyzing point and areallevel survey data with potential predictors. The higher number of infected people we estimated highlights the public health importance of this neglected disease in the study region. More comprehensive epidemiological studies are urgently needed for endemic areas with scant survey data.
Materials and methods
Search strategy, selection criteria, and data extraction
Request a detailed protocolWe collected relevant publications reporting prevalence data of opisthorchiasis in Southeast Asia through a systematic review (registered in the International Prospective Register of Systematic Reviews, PROSPERO, No.CRD42019136281), and reported our systematic review according to the PRISMA guidelines (Supplementary file 1A; Moher et al., 2010). We searched PubMed and ISI Web of Science from inception to February 9, 2020, with search terms: (liver fluke* OR Opisthorchi*) AND (Southeast Asia OR Indonesia OR (Myanmar OR Burma) OR Thailand OR Vietnam OR Malaysia OR Philippines OR Lao PDR OR Cambodia OR Timor OR Brunei OR Singapore). We set no limitations on language, date of survey, or study design in our search strategy. For literatures not found by the above methods, we also reviewed reports from governments or Ministry of Health, theses, relevant books, and documents.
We followed a protocol (Figure 1—figure supplement 1) for inclusion, exclusion, and extraction of survey data. First, we screened titles and abstracts to identify potentially relevant articles. Publications on in vitro studies, or absence of human studies or absence of disease studies were excluded. Quality control was undertaken by rechecking 20% of randomly selected irrelevant papers. Second, the fulltext review was applied to potentially relevant articles. We excluded publications with following conditions: absence of prevalence data; studies done in specific patient groups (e.g., prevalence on patients with specific diseases), in specific population groups (e.g., travelers, military personnel, expatriates, nomads, displaced or migrating population), under specific study designs (e.g., case report studies, case–control studies, clinical trials, autopsy studies); drug efficacy or intervention studies (except for baseline data or control groups), population deworming within 1 year, the survey time interval more than 10 years, data only based on the direct smear method (due to low sensitivity) or serum diagnostics (due to unable to differ the past and the active infection). During the fulltext review, the potential relevant cited references of the articles were also screened. Studies were included if they reported survey data at provincial level and below, such as administrative divisions of level 1 (ADM1: province, state, etc.), 2 (ADM2: city, etc.), and 3 (ADM3: county, etc.), and at pointlevel (village, town, school, etc.). Duplicates were checked and removed. The quality assessment of each individual record included in the final geostatistical analysis was performed by two independent reviewers, based on a ninepoint quality evaluation checklist (Figure 2—figure supplement 1—source data 1).
We followed the GATHER checklist (Supplementary file 1B; Stevens et al., 2016) for the data extraction. Detailed information of records was extracted into a database, which includes literature information (e.g., journal, authors, publication date, title, volume, and issue), survey information (e.g., survey type: community or schoolbased, and year of survey), location information (e.g., location name, location type, and coordinates), and diseaserelated data (e.g., species of parasites, diagnostic method, population age, number of examined, number of positive, and percentage of positive). The coordinates of the survey locations were obtained from Google Maps (https://www.google.com/maps/). For surveys reported prevalence in intervals without exact observed values, the midpoints of the intervals were assigned.
Environmental, socioeconomic, and demographic data
Request a detailed protocolThe environmental data (i.e., annual precipitation, distance to the nearest open water bodies, elevation, land cover, land surface temperature [LST] in the daytime and at night, and normalized difference vegetation index [NDVI]), socioeconomic data (i.e., human influence index, survey type, and travel time to the nearest big city), and demographic data of Southeast Asia were downloaded from open data sources (Figure 7—source data 1). Land cover data was summarized by the most frequent category within each pixel over the period of 2001–2018. We combined similar land cover classes and regrouped them into five categories: (i) croplands; (ii) forests; (iii) shrub and grass; (iv) urban; and (v) others. LST in the daytime and at night, as well as NDVI were averaged over the period of 2000–2018. All data were aligned over a 5 × 5 km grid across the study region (Figure 7). Data at pointreferenced survey locations were extracted. We linked the data to the divisions (i.e., ADM1, ADM2, or ADM3) reported aggregated outcome of interest (i.e., infection prevalence) by averaging them within the corresponding divisions. The above data processing was done using the package ‘ratser’ (https://cran.rproject.org/web/packages/raster) through R (version 3.5.0).
Model fitting and variable selection
Request a detailed protocolAs our outcome of interest derived from both pointreferenced and areaaggregated surveys, a bivariate Bayesian geostatistical joint modeling approach was applied to analyze the arealevel and pointlevel survey data together (Moraga et al., 2017; Utazi et al., 2019), and account for both disease data reporting numbers of examined and positive, and those reporting only prevalence.
We defined ${p}_{it}$ the probability of infection at location $i$ and time period $t$, where $i$ is the index either for the location of pointreferenced data or of the area for arealevel data. Based on the probability theory, for data reported with numbers of examined and positive, we assumed that the number of examined ${Y}_{it}$ followed a binomial distribution ${Y}_{it}~Bin({p}_{it},{N}_{it})$, where ${N}_{it}$ denoted the number of examined; and for data only reported with the observed prevalence, we assumed that the observed prevalence ${ob}_{it}$ followed a beta distribution ${ob}_{it}~Be({p}_{it},{\sigma}_{\beta}^{2})$. The period of this study was from 1978 to 2018. We modeled predictors on a logit scale of ${p}_{it}$.
We referred to the method proposed by Cameletti and colleagues (Krainski, 2019; Cameletti et al., 2013) to build a spatialtemporal model combined with covariates, which was defined as an SPDE (Stochastic Partial Differential Equation) model for the spatial domain and an AR1 model for the time dimension. A standard grid of 5 × 5 km^{2} was overlaid to each survey area resulting in a certain number of pixels representing the area. We assumed that survey locations and pixels within survey areas shared the same spatialtemporal process. In addition, we assumed the infection risk the same within 1year period for the same areas. Different observations from the same year in the same areas can be treated as realizations of the randomized spatialtemporal process. Let $i=1,\dots ,{n}_{A},{n}_{A}+1,\dots ,{n}_{A}+{n}_{p}$, where ${n}_{A}$ is the total number of areas for arealevel surveys and ${n}_{p}$ is the total number of locations for pointreferenced surveys. Regarding arealevel data, $\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\left({p}_{it}\right)={\beta}_{0}+\stackrel{\sim}{{\mathit{x}}_{it}^{\prime}}\mathit{\beta}+{\left{A}_{i}\right}^{1}{\int}_{{A}_{i}}\omega \left(s,t\right)dsdt$, where $i=1,\dots ,{n}_{A}$, $\stackrel{\sim}{{\mathit{x}}_{it}}$ the vectors of covariate values for ${i}^{th}$ area in time period $t$ with $\stackrel{~}{{\mathit{x}}_{it}^{\text{'}}}={\left{A}_{i}\right}^{1}{\int}_{{A}_{i}}x\left(s,t\right)dsdt$ and ${\beta}_{0}$ and $\mathit{\beta}$ are the intercept and the corresponding regression coefficients. $\left{A}_{i}\right={\int}_{{A}_{i}}1ds$ is the size of the ${i}^{th}$ area and $\omega \left(s,t\right)$ the spatialtemporal random effects of pixels within the area. For pointreferenced data, $\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{i}\mathrm{t}\left({p}_{it}\right)={\beta}_{0}+{\mathit{x}}_{it}^{\text{'}}\mathit{\beta}+\omega \left({s}_{i},t\right)$, where $i={n}_{A}+1,\dots ,{n}_{A}+{n}_{p}$, ${\mathit{x}}_{it}^{\mathrm{\text{'}}}$ is the vectors of covariate values and $\omega \left({s}_{i},t\right)$ is the spatialtemporal random effect for ${i}^{th}$ location in time period $t$. To decrease the computational burden, under the SPDE framework, we built the GMRF on regular temporal knots, that is, $\omega =({\omega}_{t=1978},{\omega}_{t=1983},{\omega}_{t=1988},{\omega}_{t=1993},{\omega}_{t=1998},{\omega}_{t=2003},{\omega}_{t=2008},{\omega}_{t=2013},{\omega}_{t=2018})\text{'}$ (Cameletti et al., 2013; Krainski, 2019). We assumed the spatiotemporal random effect $\omega \left(s,t\right)$ follow a zeromean Gaussian distribution, that is, $\omega ~GP(0,{\mathit{K}}_{space}\otimes {\mathit{K}}_{time})$, where the spatial covariance matrix ${\mathit{K}}_{space}$ was defined as a stationary Matérn covariance function ${{\sigma}_{\varphi}^{2}\left(\kappa \mathit{D}\right)}^{v}{K}_{v}\left(\kappa \mathit{D}\right)/\left(\Gamma \right(v\left){2}^{v1}\right)$ and the temporal covariance matrix as ${\mathit{K}}_{time}={\rho}^{{t}_{u}{t}_{o}}$ with $\left\rho \right<1$, corresponding to the autoregressive stochastic process with first order (AR1). And the spatiotemporal random effect $\omega \left(s,t\right)$ was assumed independent of each other in different times and locations, that is,$\mathrm{C}\mathrm{o}\mathrm{v}\left({\omega}_{it},{\omega}_{j{t}^{\prime}}\right)=\{\begin{array}{cc}l0,\text{}& if\text{}t\ne {t}^{\prime}\\ {\sigma}_{\varphi}^{2},\text{}& if\text{}t={t}^{\prime}\end{array}$. Here $\mathit{D}$ donates the Euclidean distance matrix, κ is a scaling parameter, and the range $r=\sqrt{8\nu}/\kappa $, representing the distance at which spatial correlation becomes negligible (<0.1), and ${K}_{\nu}$ is the modified Bessel function of the second kind, with the smoothness parameter $\nu $ fixed at 1. The latent fields corresponding to other years are approximated by projection of $\omega $ using the Bspline basis function of degree two, that is, $B}_{i,1}\left(t\right)=\{\begin{array}{l}1,\text{}{t}_{i}\le t{t}_{i+1}\\ 0,\text{}otherwise\end{array$ and $\mathrm{}{B}_{i,m}\left(t\right)=\frac{t{t}_{i}}{{t}_{i+m1}{t}_{i}}{B}_{i,m1}\left(t\right)+\frac{{t}_{i+m}t}{{t}_{i+m}{t}_{i+1}}{B}_{i+1,m1}\left(t\right)$, where $m$ is the degree of two (Krainski, 2019; Cameletti et al., 2013).
We formulated the model in a Bayesian framework. Minimally informative priors were specified for parameters and hyper parameters as follows: $\mathit{\beta}~N(0,{10}^{5}\mathit{I})$, $\mathrm{log}\left(1/{\sigma}_{\beta}^{2}\right)~\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{G}\mathrm{a}\mathrm{m}\mathrm{m}\mathrm{a}\left(\mathrm{1,0.1}\right)$, $\mathrm{log}\left(1/{\sigma}_{\varphi}^{2}\right)~\mathrm{l}\mathrm{o}\mathrm{g}\mathrm{G}\mathrm{a}\mathrm{m}\mathrm{m}\mathrm{a}\left(\mathrm{1,0.01}\right)$, $\mathrm{log}\left((1+\rho )/(1\rho )\right)~N\left(\mathrm{0,0.15}\right)$, and $\mathrm{log}\left(\kappa \right)~N(\mathrm{log}\left(\sqrt{8}/d\right),1)$, where $d$ is the median distance between the predicted grids.
Additionally, we applied variable selection procedure to identify the best set of predictors for a parsimonious model. First, the best functional form (continuous or categorical) of continuous variables was selected, by fitting univariate Bayesian spatialtemporal models with either form as the independent variable and selecting the form with the lowest log score (Pettit, 1990). Second, the best subset method was used to identify the best combination of predictors for the final model. According to previous studies (Aung et al., 2017; Forrer et al., 2012; Miyamoto et al., 2014; Wongsaroj et al., 2014), the infection risk in community and school may be different, and using different diagnostic methods may differ the observed prevalence (Charoensuk et al., 2019; Laoprom et al., 2016; Sayasone et al., 2015). Thus, the survey type (i.e., community or schoolbased) and the diagnostic methods (i.e., Kato–Katz, FECT, or other methods) were kept in all potential models, while the other 10 environmental and socioeconomic variables were put forth into the Bayesian variable selection process. The model with the minimum log score was chosen as the final model.
Model fitting and variable selection process were conducted through INLASPDE approach (Lindgren et al., 2011; Rue et al., 2009), using INLA package in R (version 3.5.0). Estimation of risk for O. viverrini infection in each year of the study period was done over a grid with cell size of 5 × 5 km^{2}. And the relative changes of the prevalence were also calculated using a formula as ${(pp}_{s{t}_{j}}{pp}_{s{t}_{i}})/{pp}_{s{t}_{i}}$ for pixel $s$ between the former year ${t}_{i}$ and the later year ${t}_{j}$, where $pp$ indicates the median of the posterior estimated distribution of infection risk. The corresponding risk maps and the prevalence changing maps were produced using ArcGIS (version 10.2). In addition, as population density across the study region was different, the populationadjusted estimated prevalence and number of infected individuals in 2018 were calculated at the country and provincial levels averaging the estimated pixellevel prevalence weighted by population density, that is, ${\widehat{pp}}_{A}={\sum}_{i\in A}{\widehat{pp}}_{i}{w}_{i}/{\sum}_{i\in A}{w}_{i}$. Here ${\widehat{pp}}_{A}$, ${\widehat{pp}}_{i}$, and $w}_{i$ are the estimated prevalence in area $A$, estimated prevalence at pixel $i$, and population density at pixel $i$, respectively, where $i$ belongs to area $A$. Based on previous studies, for the provinces in Vietnam where there was no evidence of O. viverrini infection, we multiplied the estimated results by zero as the final estimated prevalence (Doanh and Nawa, 2016). The R code used for model fitting is publicly available in GitHub (https://github.com/SYSUOpisthorchiasis/Spatialtemporalmappingofopisthorchiasis and archived in software heritage; Zhao, 2021; copy archived at swh:1:rev:6493df4ba60c1f2f1aaaad979174a3a5d928627a).
Model validation, sensitivity analysis, and test of preferential sampling
Request a detailed protocolModel validation was conducted using the 5fold outofsample crossvalidation approach. Mean error ($\mathrm{M}\mathrm{E}=\frac{1}{N}\sum ({ob}_{it}{pp}_{it})$), mean absolute error ($\mathrm{M}\mathrm{A}\mathrm{E}=\frac{1}{N}\sum \left{ob}_{it}{pp}_{it}\right$), mean square error ($\mathrm{M}\mathrm{S}\mathrm{E}=\frac{1}{N}\sum {({ob}_{it}{pp}_{it})}^{2}$), and the coverage rate of observations within 95% BCI were calculated to evaluate the performance of the model. Furthermore, a Bayesian geostatistical model only based on pointreferenced data was fitted and validated, to compare its performance with our joint modeling approach. In addition, a sensitivity analysis was conducted to evaluate the effects of using the midpoint values of the intervals as the observed prevalence in one literature from Suwannatrai and colleagues (Suwannatrai et al., 2018), reporting observed prevalence of O. viverrini infection in intervals. Sensitivity analysis was done by using the lower and the upper limits of the intervals in the modeling analysis.
Considering that the data in this study were sourced from different studies, preferential sampling may exist. We performed a test for preferential sampling of the data. To our knowledge, no method has been developed for preferential sampling test on observations combined at point and areal levels. To compromise, we took centers of the areas with survey data as their locations for the test of preferential sampling. A fast and intuitive Monte Carlo test developed by Watson was adopted for its advantage of fast speed and feasibility of data arising from various distributions. We assumed ${S}_{t}$ (i.e., the collection of sampled points at time $t$) a realization from an inhomogeneous Poisson processes (IPP) under the condition of $\omega \left(s,t\right)$ (i.e., the spatialtemporal Gaussian random field), that is, $[{s}_{t}\omega (s,t)]=\mathrm{I}\mathrm{P}\mathrm{P}(\lambda (s,t))$, and $\mathrm{l}\mathrm{o}\mathrm{g}\left(\lambda \left(s,t\right)\right)={\alpha}_{0}+h\left(\omega \left(s,t\right)\right)$, where $h$ is a monotonic function of $\omega \left(s,t\right)$. When $h\equiv 0$, the sampling process is independent from $\omega \left(s,t\right)$, thus the preferential sampling is not significant. In this way, the problem of detecting preferential sampling can be transformed into the hypothesis testing of $h\equiv 0$. If $h\equiv 0$ is false, for example, in case that $h$ is a monotonic increasing function of $\omega \left(s,t\right)$, then the point patterns ${S}_{t}$ are expected to exhibit an excess of clustering in areas with higher $\omega \left(s,t\right)$, thus positive association can be detected between the localized amount of clustering and estimated $\omega \left(s,t\right)$. First, we used the mean of the distances to the K nearest points (D_{K}) to measure the clustering of locations, and calculated the rank correlation ${r}_{t\left(K\right)}$ between D_{K} and the estimated $\omega \left(s,t\right)$ for survey year $t$. Here the estimated $\omega \left(s,t\right)$ was obtained from fitting the Bayesian spatialtemporal joint model. Next, the Monte Carlo method was used to sample realizations from the IPP under the null hypothesis (i.e., $h\equiv 0$), following which a set of rank correlations ${r}_{t\left(K\right)}^{M}$ were calculated, approximating the distribution of the rank correlations ${\rho}_{t\left(K\right)}$ under $h\equiv 0$. In this way, the nonstandard sampling distribution of the test statistic can be approximated. Finally, we computed the desired empirical pvalue by evaluating the proportion of the Monte Carlosampled ${r}_{t\left(K\right)}^{M}$ which are more extreme than ${r}_{t\left(K\right)}$. We set a sample size of 1000 for each Monte Carlo sampling. We also considered K from 1 to 8 to measure the clustering of locations and resulted in eight pvalues respective to different K for each survey year. If one of the pvalues is smaller or equal to 0.05, we considered preferential sampling existing in the corresponding survey year. Since our model could estimate the disease risk each year of the study period, this test was done for each survey year with number of locations more than or equal to 10 (i.e., 1978, 1981, 1991, 1995, 1998, 2000, 2001, 2004, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, and 2016). The test was conducted using the package 'PStestR' in R (version 3.6.3) (Watson, 2020).
Data availability
All data generated or analysed during this study are included in the manuscript and supporting files. Source data files have been provided for Figures 2–7, Figure 2figure supplement 1, Figure 3figure supplement 1, and Figure 6figure supplement 1–9.
References

Opisthorchis viverrini: an underestimated parasite in world healthTrends in Parasitology 24:497–501.https://doi.org/10.1016/j.pt.2008.08.011

BookHierarchical Modeling and Analysis for Spatial Data (Second Edition)Boca Raton: Chapman & Hall/CRC.

Spatiotemporal modeling of particulate matter concentration through the SPDE approachAStA Advances in Statistical Analysis 97:109–131.https://doi.org/10.1007/s1018201201963

Prevalence of the intestinal flukes Haplorchis taichui and H. yokogawai in a mountainous area of phongsaly province, lao PDRThe Korean Journal of Parasitology 48:339–342.https://doi.org/10.3347/kjp.2010.48.4.339

Hyperendemicity of Haplorchis taichui infection among riparian people in saravane and champasak province, lao PDRThe Korean Journal of Parasitology 51:305–311.https://doi.org/10.3347/kjp.2013.51.3.305

Geostatistical inference under preferential samplingJournal of the Royal Statistical Society: Series C 59:191–232.https://doi.org/10.1111/j.14679876.2009.00701.x

Clonorchis sinensis and Opisthorchis spp. in Vietnam: current status and prospectsTransactions of the Royal Society of Tropical Medicine and Hygiene 110:13–20.https://doi.org/10.1093/trstmh/trv103

The role of evolutionary biology in research and control of liver flukes in Southeast AsiaInfection, Genetics and Evolution 43:381–397.https://doi.org/10.1016/j.meegid.2016.05.019

A statistical explanation of MaxEnt for ecologistsDiversity and Distributions 17:43–57.https://doi.org/10.1111/j.14724642.2010.00725.x

Spatial distribution of, and risk factors for, Opisthorchis viverrini infection in southern lao PDRPLOS Neglected Tropical Diseases 6:e1481.https://doi.org/10.1371/journal.pntd.0001481

Global burden of human foodborne trematodiasis: a systematic review and metaanalysisThe Lancet Infectious Diseases 12:210–221.https://doi.org/10.1016/S14733099(11)702948

On the effect of preferential sampling in spatial predictionEnvironmetrics 23:565–578.https://doi.org/10.1002/env.2169

Bayesian Modeling and Analysis of Geostatistical DataAnnual Review of Statistics and Its Application 4:245–266.https://doi.org/10.1146/annurevstatistics060116054155

Asian schistosomiasis: current status and prospects for control leading to eliminationTropical Medicine and Infectious Disease 4:40.https://doi.org/10.3390/tropicalmed4010040

Schistosomes, liver flukes and Helicobacter pyloriIARC Monographs on the Evaluation of Carcinogenic Risks to Humans 61:1–241.

Evaluation of the helminthiasis control program in Thailand at the end of the 8th health development planThe Journal of Tropical Medicine and Parasitology 26:18–45.

Opisthorchiasis control in ThailandActa Tropica 88:229–232.https://doi.org/10.1016/j.actatropica.2003.01.002

Opisthorchiasis in Thailand: Review and current statusWorld Journal of Gastroenterology 14:2297–2302.https://doi.org/10.3748/wjg.14.2297

BookHighlighting Operational and Implementation Research for Control of HelminthiasisAcademic Press.

Emerging Foodborne TrematodiasisEmerging Infectious Diseases 11:1507–1514.https://doi.org/10.3201/eid1110.050614

The drugs we have and the drugs we need against major helminth infectionsAdvances in Parasitology 73:197–230.https://doi.org/10.1016/S0065308X(10)730086

Is Opisthorchis viverrini emerging in Cambodia?Advances in Parasitology 103:31–73.https://doi.org/10.1016/bs.apar.2019.02.002

BookAdvanced Spatial Modeling with Stochastic Partial Differential Equations Using R and INLABoca Raton: CRC Press, Taylor & Francis Group.https://doi.org/10.1201/9780429031892

Evaluation of a commercial stool concentrator kit compared to direct smear and formalinethyl acetate concentration methods for diagnosis of parasitic infection with special reference to Opisthorchis viverrini sensu lato in ThailandSoutheast Asian Journal of Tropical Medicine and Public Health 47:890–900.

An explicit link between gaussian fields and gaussian markov random fields: the stochastic partial differential equation approachJournal of the Royal Statistical Society: Series B 73:423–498.https://doi.org/10.1111/j.14679868.2011.00777.x

Integrated implementation of programs targeting neglected tropical diseases through preventive chemotherapy: proving the feasibility at national scaleThe American Journal of Tropical Medicine and Hygiene 84:5–14.https://doi.org/10.4269/ajtmh.2011.100411

Intestinal parasitic infections among inhabitants of the north, Westcentral and eastern border Areas of ThailandThe Journal of Tropical Medicine and Parasitology 27:51–58.

Field survey focused on Opisthorchis viverrini infection in five provinces of CambodiaParasitology International 63:366–373.https://doi.org/10.1016/j.parint.2013.12.003

Preferred reporting items for systematic reviews and metaanalyses: The PRISMA statementInternational Journal of Surgery 8:336–341.https://doi.org/10.1016/j.ijsu.2010.02.007

ReportNational Survey on Helminthiasis in ThailandDepartment of Disease Control, Ministry of Public Health.

Opisthorchiasis: An Overlooked DangerPLOS Neglected Tropical Diseases 9:e3563.https://doi.org/10.1371/journal.pntd.0003563

The zoonotic, fishborne liver flukes Clonorchis sinensis, Opisthorchis felineus and Opisthorchis viverriniInternational Journal for Parasitology 43:1031–1046.https://doi.org/10.1016/j.ijpara.2013.07.007

The conditional predictive ordinate for the normal distributionJournal of the Royal Statistical Society: Series B 52:175–184.https://doi.org/10.1111/j.25176161.1990.tb01780.x

Liver FlukeAssociated biliary tract CancerGut and Liver 12:236–245.https://doi.org/10.5009/gnl17102

Human liver flukes in China and ASEAN: time to fight togetherPLOS Neglected Tropical Diseases 13:e0007214.https://doi.org/10.1371/journal.pntd.0007214

Approximate bayesian inference for latent gaussian models by using integrated nested Laplace approximationsJournal of the Royal Statistical Society: Series B 71:319–392.https://doi.org/10.1111/j.14679868.2008.00700.x

Liver flukes: clonorchis and OpisthorchisAdvances in Experimental Medicine and Biology 766:153–199.https://doi.org/10.1007/9781493909155_6

CoproDNA diagnosis of Opisthorchis viverrini and Haplorchis taichui infection in an endemic area of lao PDRThe Southeast Asian Journal of Tropical Medicine and Public Health 41:28.

The current status of opisthorchiasis and clonorchiasis in the mekong basinParasitology International 61:10–16.https://doi.org/10.1016/j.parint.2011.08.014

Unified geostatistical modeling for data fusion and spatial heteroskedasticity with R package rampsJournal of Statistical Software 25:i10.https://doi.org/10.18637/jss.v025.i10

Adult Opisthorchis viverrini flukes in humans, Takeo, CambodiaEmerging Infectious Diseases 17:1302–1304.https://doi.org/10.3201/eid1707.102071

LowGrade Endemicity of Opisthorchiasis, Yangon, MyanmarEmerging Infectious Diseases 25:1435–1437.https://doi.org/10.3201/eid2507.190495

Epidemiology of Opisthorchis viverrini infectionAdvances in Parasitology 101:41–67.https://doi.org/10.1016/bs.apar.2018.05.002

ReportTransforming Our World: The 2030 Agenda for Sustainable DevelopmentUnited Nations.

A spatial regression model for the disaggregation of areal unit based data to highresolution grids with application to vaccination coverage mappingStatistical Methods in Medical Research 28:3226–3241.https://doi.org/10.1177/0962280218797362

Fishborne trematodes in VietnamSoutheast Asian Journal of Tropical Medicine and Public Health 35:299–301.

ReportJoint WHO/FAO Workshop on Foodborne Trematode Infections in AsiaWorld Health Organization.

ReportReport of the WHO Expert Consultation on Foodborne Trematode Infections and Taeniasis/CysticercosisWorld Health Organization.

ReportAccelerating Work to Overcome the Global Impact of Neglected Tropical DiseasesWorld Health Organization.

High prevalence of haplorchiasis in nan and Lampang provinces, Thailand, proven by adult worm recovery from suspected opisthorchiasis casesThe Korean Journal of Parasitology 51:767–769.https://doi.org/10.3347/kjp.2013.51.6.767

National survey of helminthiasis in ThailandAsian Biomedicine 8:779–783.https://doi.org/10.5372/ABM.V8I6.2696

Prevalence of intestinal helminths among inhabitants of Cambodia (20062011)The Korean Journal of Parasitology 52:661–666.https://doi.org/10.3347/kjp.2014.52.6.661

SoftwareSpatialtemporalmappingofopisthorchiasis, version swh:1:rev:6493df4ba60c1f2f1aaaad979174a3a5d928627aSoftware Heritage.
Decision letter

Miles P DavenportSenior Editor; University of New South Wales, Australia

Talía MalagónReviewing Editor; McGill University, Canada
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Infection with Opithorchis viverrini is a neglected tropical disease endemic to Southeast Asian countries. This study is particularly interesting in that it compiles data from decades of prevalence surveys of O. viverrini infection to produce high resolution maps of infection prevalence in Southeast Asia over the past few decades, identifying regions where prevalence has decreased and increased, and the systemic factors that have influenced the prevalence over time. Such high resolution geographical data will be highly valuable for public health efforts aimed at treating and preventing this disease.
Decision letter after peer review:
Thank you for submitting your article "Modelbased spatialtemporal mapping of opisthorchiasis in endemic countries of Southeast Asia" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and, and the evaluation has been overseen by Miles Davenport as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
As the editors have judged that your manuscript is of interest, but as described below that additional experiments are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)
Summary:
This paper examined the prevalence of opisthorchiasis in Southeast Asian countries using survey data obtained through an extensive literature search. The data comprised both aerial and pointreferenced data. The authors then fitted a Bayesian fusion spatiotemporal model to map the prevalence of the disease at 5x5 km resolution. The estimates were also aggregated to different administrative units within the study countries. This work applies stateoftheart modelling techniques to an interesting application.
Essential revisions:
1) Many of the publications studied only in areas where Opisthorchis viverrini is endemic. The prevalence in surveys these are likely to be overestimated due to the preferential sampling of areas. The authors have used a method to test for this (Monte Carlo test using R package PStestR) and claim that they did not detect any preferential sampling. However, the reviewers did not find this very convincing given the clustering of points in Figure 2. We would like the authors to give further details regarding the analysis method they used to test for preferential sampling (Monte Carlo test using R package PStestR), show the results of this analysis, and also to include some further discussion regarding the impact of preferential sampling on the validity of results.
2) The authors stratified predictions by 10year periods; this is a very coarse time frame for predictions given that incidence of infection can vary from year to year. This limitation should be fully acknowledged by the authors if shorter time frames cannot be used.
3) The reviewers criticized the imputation of sample size in order to convert prevalences to binomial data in papers where sample size was unavailable. While the authors included a sensitivity analysis of the impact of this imputation in Figure 3—source data 3 to help assess this point, this was not considered sufficient to address this issue. The reviewers suggest instead that if the data were originally available as prevalence estimates, these should be treated as such and modelled using a β likelihood or a normal likelihood (on the logit scale) and not converted artificially to binomial data.
4) The authors should describe how they dealt statistically when they encountered multiple estimates from the same area within each of the 10year periods.
5) Surveys often have complex designs, using weighting to calculate the prevalence over an entire area. How did the authors account for this weighting in their analysis?
6) The authors treated surveys aggregated over ADM2 or ADM3 areas as points, whereas those aggregated over ADM1 areas were treated as areal data. This is a very rough way to handle spatial misalignment. If the data were associated with areas, these should be left as areal data in the analysis and should not be treated as points as one would be enforcing nonexistent geographical precision in the data in doing so. The authors should justify this choice, or discuss how it may impact the accuracy of results.
7) The authors used the AUC statistic to validate their model. This is an inappropriate use of the AUC; ROC and AUC are normally used to check the discrimination ability of logistic regression models and not binomial regression models. The authors mention other metrics which are useful for evaluating binomial regression models such as MSE and MAE, but the values of these metrics are not discussed or presented in the manuscript. Please discard the AUC analysis, and instead include a table showing the values of these other metrics in the main manuscript, as well as the bias and 95% coverage rates of the fitted model.
8) The authors discuss differences in test sensitivity as a source of heterogeneity between surveys, which they ignored by assuming similar sensitivity across all surveys. It is unclear how much this may have affected results. Please give estimates of the magnitude of the difference of sensitivity of different diagnostic tests, as this could heavily influence differences in prevalence across surveys if these differences in sensitivity are very large. Is there a reason why the authors did not assess the diagnostic method as a covariate in their model?
9) The authors used as an exclusion criteria surveys using the smear method to detect opisthorchiasis due to its lack of sensitivity. However, in nearly half of all reports, the diagnostic test used was not reported or missing. How do the authors then know that these records did not use the smear method to detect disease?
10) The authors need to provide a list of citations of all their included studies as an appendix, consistent with GATHER item 5 and PRISMA item 18. GATHER also suggests providing a table with each data source used, reference information or contact name/institution, population represented, data collection method, year(s) of data collection, sex and age range, diagnostic criteria or measurement method, and sample size, as relevant.
11) The interpretation of the estimated regression coefficients of the categorical variables was poorly done. In particular for model results Table 2: since the authors used a logit link function, the model results can be converted into odds ratios by exponentiating the model coefficients. Please convert all coefficients in this table into odds ratios. Model coefficients have very little inherent interpretability, while odds ratios can be interpreted by readers as measures of relative risk comparing the reference category and the category in question in relation to the outcome variable. The authors may also want to consider dropping the other noncoefficient model parameters from this table (spatial range, correlation coefficient, spatial variance) and report them in the text instead as their units are not consistent with the rest of the table. For the probability %, this would be reinterpreted as the probability that the odds ratio is >1 for risk factors increasing the prevalence of disease, and <1 for risk factors decreasing the prevalence of disease (distance to nearest open body of water and precipitation). Also, for the variables that were modeled as continuous (precipitation, HII), we need the unit size increase associated with each increase in prevalence (i.e. what increase in annual precipitation is associated with the 0.14 decrease in the logit?)
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Modelbased spatialtemporal mapping of opisthorchiasis in endemic countries of Southeast Asia" for further consideration by eLife. Your revised article has been evaluated by Miles Davenport (Senior Editor) and a Reviewing Editor.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
1) For Figure 5, negative values are conventionally interpreted as decreases and positive values as increases, so the numbers in this figure are likely to lead to confusion. Please change the calculations instead to (𝑝𝑝𝑠𝑡j − 𝑝𝑝𝑠𝑡i )/𝑝𝑝𝑠𝑡𝑖, which should lead to an inversion of the sign without changing the numbers, and will increase the interpretability of the figure.
2) In Table 2, the exponent of the intercept of the model cannot be interpreted as an odds ratio, as it represents the odds of the prevalence at the reference value of all categories. Please leave the cells for OR and prob(%) blank for this row, as these quantities are not relevant for the intercept.
https://doi.org/10.7554/eLife.59755.sa1Author response
Essential revisions:
1) Many of the publications studied only in areas where Opisthorchis viverrini is endemic. The prevalence in surveys these are likely to be overestimated due to the preferential sampling of areas. The authors have used a method to test for this (Monte Carlo test using R package PStestR) and claim that they did not detect any preferential sampling. However, the reviewers did not find this very convincing given the clustering of points in Figure 2. We would like the authors to give further details regarding the analysis method they used to test for preferential sampling (Monte Carlo test using R package PStestR), show the results of this analysis, and also to include some further discussion regarding the impact of preferential sampling on the validity of results.
We thank the reviewers for arising the issue of preferential sampling. To our knowledge, there hasn’t been method developed for preferential sampling test on observations combined at point and areal levels. To compromise, we took centers of the areas with survey data as their locations for the test of preferential sampling. We adopted a fast Monte Carlo test developed by Watson, for its advantage of fast speed and feasibility of data arising from various distributions (Watson, 2020). We assumed ${S}_{t}$ (i.e., the collection of sampled points at time $t$) a realization from an inhomogeneous Poisson processes (IPP) under the condition of $\omega (s,t)$ (i.e., the spatialtemporal Gaussian random field), that is $[\phantom{\rule{0.222em}{0ex}}{S}_{t}\omega (s,t)]=IPP\left(\lambda (s,t)\right)$, and $log\left(\lambda (s,t)\right)={\alpha}_{0}+h\left(\omega (s,t)\right)$, where $h$ is a monotonic function of $\omega (s,t)$. When $h\equiv 0$, the sampling process is independent from $\omega (s,t)$, thus the preferential sampling is not significant. In this way, the problem of detecting preferential sampling can be transformed into the hypothesis testing of $h\equiv 0$. If $h\equiv 0$ is false, for example, in case that $h$ is a monotonic increasing function of $\omega (s,t)$, then the point patterns ${S}_{t}$ are expected to exhibit an excess of clustering in areas with higher $\omega (s,t)$, thus positive association can be detected between the localized amount of clustering and estimated $\omega (s,t)$ (Watson, 2020). Firstly, we used the mean of the distances to the K nearest points (D_{K}) to measure the clustering of locations, and calculated the rank correlation ${r}_{t(K)}$ between D_{K} and the estimated $\omega (s,t)$ for survey year $t$. Here the estimated $\omega (s,t)$ was obtained from fitting the Bayesian spatialtemporal joint model. Next, the Monte Carlo method was used to sample realizations from the IPP under the null hypothesis (i.e., $h\equiv 0$), following which, a set of rank correlations ${r}_{t(K)}^{M}$ were calculated, approximating the distribution of the rank correlations ${\rho}_{t(K)}$ under $h\equiv 0$. In this way, the nonstandard sampling distribution of the test statistic can be approximated. Finally, we computed the desired empirical pvalue by evaluating the proportion of the Monte Carlosampled ${r}_{t(K)}^{M}$ which are more extreme than ${r}_{t(K)}$. We set a sample size of 1000 for each Monte Carlo sampling. We also considered K from 1 to 8 to measure the clustering of locations and resulted in eight pvalues respective to different K for each survey year. If one of the pvalues is smaller or equal to 0.05, we considered preferential sampling existing in the corresponding survey year. Since we modified our model to estimate the disease risk each year of the study period (please see the reply for the comment 2), this test was done for each survey year with number of locations more than or equal to 10. Results showed that significant preferential sampling might exist for survey locations in one third (6/18) of the survey years (Figure 2—source data 2). The corresponding impacts might include improper variogram estimator, biased parameter estimation and unreliable exposure surface estimates (Diggle et al., 2010, Pati et al., 2011; Gelfand et al., 2012). To avoid a more complex model, we didn’t take into account the preferential sampling issue for our final model, as the model validation showed a reasonable capacity of prediction accuracy. However, the disadvantage of this issue should be well aware. In response to the suggestion, we added the description of the test for preferential sampling in subsection “Model validation, sensitivity analysis and test of preferential sampling”, the test results in the Results, and the limitation and discussion in the Discussion.
2) The authors stratified predictions by 10year periods; this is a very coarse time frame for predictions given that incidence of infection can vary from year to year. This limitation should be fully acknowledged by the authors if shorter time frames cannot be used.
We thank the reviewers for pointing out ways we could improve. We agree with your point of view. We improved the model by construction of the spatialtemporal random effects with temporal resolution each year instead of a 10year period, which was able to estimate the disease risk yearly. We referred to the method proposed by Cameletti and colleagues (Cameletti et al., 2013; Krainski, 2019) to build a spatialtemporal model combined with covariates, which was defined as a SPDE model for the spatial domain and an AR1 model for the time dimension. To decrease the computational burden, under the SPDE framework, we built the GMRF on regular temporal knots, that is $\omega =({\omega}_{t=1978},\phantom{\rule{0.222em}{0ex}}{\omega}_{t=1983},{\omega}_{t=1988},{\omega}_{t=1993},\phantom{\rule{0.222em}{0ex}}{\omega}_{t=1998},{\omega}_{t=2003},{\omega}_{t=2008},\phantom{\rule{0.222em}{0ex}}{\omega}_{t=2013},{\omega}_{t=2018})\prime $, while the latent fields corresponding to other years are approximated by projection of $\omega $ using the Bspline basis function of degree two, that is $B}_{i,1}\left(t\right)=\{\begin{array}{l}1,\text{}{t}_{i}\le t{t}_{i+1}\\ 0,\text{}otherwise\end{array$ and $\phantom{\rule{0.222em}{0ex}}{B}_{i,m}\left(t\right)=\frac{t{t}_{i}}{{t}_{i+m1}{t}_{i}}{B}_{i,m1}\left(t\right)+\frac{{t}_{i+m}t}{{t}_{i+m}{t}_{i+1}}{B}_{i+1,m1}\left(t\right)$, where $m$ is the degree of two (Cameletti, et al.,2012; Krainski,2019). We put more detailed description in subsection “Model fitting and variable selection”.
3) The reviewers criticized the imputation of sample size in order to convert prevalences to binomial data in papers where sample size was unavailable. While the authors included a sensitivity analysis of the impact of this imputation in Figure 3—source data 3 to help assess this point, this was not considered sufficient to address this issue. The reviewers suggest instead that if the data were originally available as prevalence estimates, these should be treated as such and modelled using a β likelihood or a normal likelihood (on the logit scale) and not converted artificially to binomial data.
We thank the reviewers’ comment and suggestion. For the survey data we collected, around 54.2% were reported with number of examined and number of positive, and the other 45.8% were only with the observed prevalence. Following the reviewers’ suggestion, in order to make full use of the available information (i.e., the reported the numbers of examined and the numbers of positive), we developed a bivariate model that jointly analyzes data reporting numbers of examined and positive, and data reporting only prevalence. Based on the probability theory, for data reported with numbers of examined and positive, we assumed that the number of examined ${Y}_{\mathrm{\text{it}}}$ followed a binomial distribution ${Y}_{\mathrm{\text{it}}}\sim Bin({N}_{\mathrm{\text{it}}},{p}_{\mathrm{\text{it}}})$, where ${N}_{\mathrm{\text{it}}}$ denoted the number of examined; and for data only reported with the observed prevalence, we assumed that the observed prevalence ${\mathrm{\text{ob}}}_{\mathrm{\text{it}}}$ followed a β distribution ${\mathrm{\text{ob}}}_{\mathrm{\text{it}}}\sim Be({p}_{\mathrm{\text{it}}},{\sigma}_{\beta}^{2})$. Here ${p}_{\mathrm{\text{it}}}$ was denoted the observed prevalence, number of examined, number of positive and the probability of infection, respectively. Furthermore, we modeled ${p}_{\mathrm{\text{it}}}$, the probability of infection, (from either types of distributions) in a logit form with same predictors and spatialtemporal random effects. Model validation showed that the performance of this model was satisfying, able to correctly estimate 79.61% of observations within a 95% coverage. We added the corresponding method in subsection “Model fitting and variable selection”, respectively.
4) The authors should describe how they dealt statistically when they encountered multiple estimates from the same area within each of the 10year periods.
We thanked the reviewers for arising this point. In the revised manuscript, we modified the model for a yearly temporal resolution (please see the reply for the comment 2). In this case, we assumed the infection risk the same within 1year period for the same areas. Different observations from the same year in the same areas can be treated as realizations of the randomized spatialtemporal process. Based on the fitted results, we estimated the infection risk each year of the study period at each pixel of the grid of 5×5km^{2} resolution. We added the corresponding descriptions in subsection “Model fitting and variable selection”, respectively.
5) Surveys often have complex designs, using weighting to calculate the prevalence over an entire area. How did the authors account for this weighting in their analysis?
We thank the reviewer for the comment on this important point. Indeed, for surveys across a large area, complex designs, such as randomly sampling from subgroups of the population under a welldesigned scheme, are likely adopted, as it is impractical to draw simple random samples from the whole area. In such case, respondents may have unequal probabilities to be selected, thus weighting should be used to generalize results for the entire area. The observed disease data we collected were from surveys either at pointlevel (i.e., community or school) or aggregated over areas. For pointlevel data, as study areas were quite small, simple sampling design were mostly used in the corresponding surveys. And for areallevel data, particularly those aggregated across ADM1, complex designs were likely applied. However, most of the corresponding surveys were only reported raw prevalence or prevalence without clarifying whether weighting was applied. Thus, we did not have enough information to address the design effect for each single survey included. We put this limitation in the Discussion. On the other hand, as population density across the study region were different, we calculated the estimated country and provincial level prevalence by averaging the estimated pixellevel prevalence weighted by population density, that is ${\hat{\mathrm{\text{pp}}}}_{A}={\sum}_{i\in A}^{}{\hat{\mathrm{\text{pp}}}}_{i}{w}_{i}/{\sum}_{i\in A}^{}{w}_{i}$. Here ${\hat{\mathrm{\text{pp}}}}_{A}$, ${\hat{\mathrm{\text{pp}}}}_{i}$ and ${w}_{\mathrm{\text{it}}}\phantom{\rule{0.222em}{0ex}}$ are the estimated prevalence in area $A$, estimated prevalence at pixel $i$ and population density at pixel $i$, respectively, where $i$ belongs to area $A$. In this way, we took into account the diversity of population density across areas for regional summaries of the estimates. (subsection “Model fitting and variable selection”, and the Discussion).
6) The authors treated surveys aggregated over ADM2 or ADM3 areas as points, whereas those aggregated over ADM1 areas were treated as areal data. This is a very rough way to handle spatial misalignment. If the data were associated with areas, these should be left as areal data in the analysis and should not be treated as points as one would be enforcing nonexistent geographical precision in the data in doing so. The authors should justify this choice, or discuss how it may impact the accuracy of results.
We thank the reviewers for pointing out ways we could improve. We agree with you, and treat all the survey data aggregated over ADM1, ADM2 and ADM3 as areal data. We have revised both the study protocol and results accordingly (Figure 1—figure supplement 1 and Figure 2).
7) The authors used the AUC statistic to validate their model. This is an inappropriate use of the AUC; ROC and AUC are normally used to check the discrimination ability of logistic regression models and not binomial regression models. The authors mention other metrics which are useful for evaluating binomial regression models such as MSE and MAE, but the values of these metrics are not discussed or presented in the manuscript. Please discard the AUC analysis, and instead include a table showing the values of these other metrics in the main manuscript, as well as the bias and 95% coverage rates of the fitted model.
We thank the reviewers’ suggestion. We modified the model validation part by using mean error ($ME=\frac{1}{N}{\sum}_{}^{}({\mathrm{\text{ob}}}_{\mathrm{\text{it}}}{\mathrm{\text{pp}}}_{\mathrm{\text{it}}})$), mean absolute error ($MAE=\frac{1}{N}{\sum}_{}^{}{\mathrm{\text{ob}}}_{\mathrm{\text{it}}}{\mathrm{\text{pp}}}_{\mathrm{\text{it}}}$), mean square error ($MSE=\frac{1}{N}{\sum}_{}^{}{({\mathrm{\text{ob}}}_{\mathrm{\text{it}}}{\mathrm{\text{pp}}}_{\mathrm{\text{it}}})}^{2}$), as well as the coverage rate of observations within 95% BCI to evaluate the performance of the model. The ME, MAE, and MSE were 0.24%, 9.06%, and 2.38%, respectively, in the final model. And our model was able to correctly estimate 79.61% of locations within the 95% BCI, indicating the model had a reasonable capacity of prediction accuracy. We have revised both subsection “Model validation, sensitivity analysis and test of preferential sampling” and the Results of the manuscript accordingly.
8) The authors discuss differences in test sensitivity as a source of heterogeneity between surveys, which they ignored by assuming similar sensitivity across all surveys. It is unclear how much this may have affected results. Please give estimates of the magnitude of the difference of sensitivity of different diagnostic tests, as this could heavily influence differences in prevalence across surveys if these differences in sensitivity are very large. Is there a reason why the authors did not assess the diagnostic method as a covariate in their model?
We thank the reviewers’ comment. Previous studies have shown that the sensitivity and specificity of the same diagnostic method may differ across studies, while different diagnostic methods may result in different results in the same survey (Charoensuk et al., 2019; Laoprom et al., 2016; Sayasone et al., 2015). Due to the lack of enough information on the assessment of the quality and procedure of the diagnostic approach in each survey, we didn’t take into account this heterogeneity to the model in the original manuscript. Following the reviewers’ suggestion, by assuming the same diagnostic method has similar sensitivity and specificity across different surveys, we added the types of diagnostic methods, that is KatoKatz, FECT and other methods (including methods other than the above two and methods not stated or missing) as covariates in the model, with KatoKatz the baseline. The odds of infection differed significantly, with FECT resulted a lower odds than KatoKatz, which was consistent with results found by Lovis and colleagues (Lovis et al., 2009). We have revised the subsection “Model fitting and variable selection”, results, Table 2 and discussion in the revised manuscript accordingly.
9) The authors used as an exclusion criteria surveys using the smear method to detect opisthorchiasis due to its lack of sensitivity. However, in nearly half of all reports, the diagnostic test used was not reported or missing. How do the authors then know that these records did not use the smear method to detect disease?
We thank the reviewers’ comment. As the direct smear has very low sensitivity, and only 5 relevant surveys used this method, we excluded them in the modeling analysis. However, there was a certain proportion of surveys (42%) with diagnostic techniques not stated or missing, and we were not able to know whether these surveys used direct smear as the diagnostic method, which was a limitation of the study. To partially taking into account the uncertainty, we considered the types of diagnostic methods as covariates in the modified model, grouping surveys with methods not stated or missing, or using methods other than KatoKatz or FECT as the type “others”. The result shows that there was no significant difference between the odds of infection with other methods and that with KatoKatz (Table 2, Results in revised manuscript).
10) The authors need to provide a list of citations of all their included studies as an appendix, consistent with GATHER item 5 and PRISMA item 18. GATHER also suggests providing a table with each data source used, reference information or contact name/institution, population represented, data collection method, year(s) of data collection, sex and age range, diagnostic criteria or measurement method, and sample size, as relevant.
We thank the reviewers’ suggestion. We provided a table, listing relevant information (reference, population represented, data collection method, year of survey, et al.,) for each data source in Figure 2—source data 1.
11) The interpretation of the estimated regression coefficients of the categorical variables was poorly done. In particular for model results Table 2: since the authors used a logit link function, the model results can be converted into odds ratios by exponentiating the model coefficients. Please convert all coefficients in this table into odds ratios. Model coefficients have very little inherent interpretability, while odds ratios can be interpreted by readers as measures of relative risk comparing the reference category and the category in question in relation to the outcome variable. The authors may also want to consider dropping the other noncoefficient model parameters from this table (spatial range, correlation coefficient, spatial variance) and report them in the text instead as their units are not consistent with the rest of the table. For the probability %, this would be reinterpreted as the probability that the odds ratio is >1 for risk factors increasing the prevalence of disease, and <1 for risk factors decreasing the prevalence of disease (distance to nearest open body of water and precipitation). Also, for the variables that were modeled as continuous (precipitation, HII), we need the unit size increase associated with each increase in prevalence (i.e. what increase in annual precipitation is associated with the 0.14 decrease in the logit?)
We thank the reviewer for these helpful suggestions. According to your suggestion, we revised the Table 2, by adding another column for the odds ratio (OR) and redefined the “Prob(%)” as the probability of OR>1. We also moved results of other noncoefficient model parameters to the text (Results). As we have modified the model according to reviewers’ suggestion, variable selection was rerun. And seven variables were selected for the final model, that is, survey type, diagnostic methods and land surface temperature (LST) in the daytime in categorical form, and human influence index, distance to the nearest open water bodies, elevation and travel time to the nearest big city in continuous form (Table 2 in revised manuscript). We added the interpretations of ORs for each covariate in the revised manuscript as following “The infection risk was 2.61 (95%BCI: 2.103.42) times in the community as much as that in schoolaged children. Surveys using FECT as the diagnostic method showed a lower prevalence (OR 0.76, 95%BCI: 0.610.93) compared to that using KatoKatz method, while no significant difference was found between KatoKatz and the other diagnostic methods. Human influence index and elevation were negatively correlated with the infection risk. Each unit increase of the HII index was associated with 0.01 (95%BCI: 0.0030.02) decrease in the logit of the prevalence. And increase of 1 meter in elevation was associated with 0.003 (95%BCI: 0.0010.005) decrease in the logit of the prevalence.” (Results).
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
1) For Figure 5, negative values are conventionally interpreted as decreases and positive values as increases, so the numbers in this figure are likely to lead to confusion. Please change the calculations instead to (𝑝𝑝𝑠𝑡j − 𝑝𝑝𝑠𝑡i )/𝑝𝑝𝑠𝑡𝑖 , which should lead to an inversion of the sign without changing the numbers, and will increase the interpretability of the figure.
We thank the editors’ suggestion. We modified the calculations as ${(\mathrm{\text{pp}}}_{s{t}_{j}}{\mathrm{\text{pp}}}_{s{t}_{i}})/{\mathrm{\text{pp}}}_{s{t}_{i}}$. We have revised the method in the revised manuscript (subsection “Model fitting and variable selection”) and changed Figure 5 and Figure 5—source data 1 accordingly. And in the revised Figure 5, the red color represents increase of the risk, while the blue represents decrease of the risk.
2) In Table 2, the exponent of the intercept of the model cannot be interpreted as an odds ratio, as it represents the odds of the prevalence at the reference value of all categories. Please leave the cells for OR and prob(%) blank for this row, as these quantities are not relevant for the intercept.
We thank the editors for pointing out ways we could improve. Following the editors’ suggestion, we left the cells for OR and prob(%) blank for the row of the intercept (Table 2).
https://doi.org/10.7554/eLife.59755.sa2Article and author information
Author details
Funding
National Natural Science Foundation of China (81703320)
 YingSi Lai
National Natural Science Foundation of China (82073665)
 YingSi Lai
Natural Science Foundation of Guangdong (2017A030313704)
 YingSi Lai
China Medical Board (17274)
 YingSi Lai
Sun Yatsen University One Hundred Talent Grant
 YingSi Lai
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We are grateful to Dr Roy Burstein from Institute for Disease Modeling, Bellevue, Washington, USA for providing very good suggestions for the manuscript.
Ethics
Human subjects: This work was based on survey data pertaining to the prevalence of opisthorchiasis extracted from open published peerreviewed literatures. All data were aggregated and did not contain any information at the individual or household levels. Therefore, there were no specific ethical issues warranted special attention.
Senior Editor
 Miles P Davenport, University of New South Wales, Australia
Reviewing Editor
 Talía Malagón, McGill University, Canada
Version history
 Received: June 7, 2020
 Accepted: January 11, 2021
 Accepted Manuscript published: January 12, 2021 (version 1)
 Version of Record published: February 8, 2021 (version 2)
Copyright
© 2021, Zhao et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 740
 Page views

 115
 Downloads

 9
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Epidemiology and Global Health
Background:
Machine learning (ML) techniques improve disease prediction by identifying the most relevant features in multidimensional data. We compared the accuracy of ML algorithms for predicting incident diabetic kidney disease (DKD).
Methods:
We utilized longitudinal data from 1365 Chinese, Malay, and Indian participants aged 40–80 y with diabetes but free of DKD who participated in the baseline and 6year followup visit of the Singapore Epidemiology of Eye Diseases Study (2004–2017). Incident DKD (11.9%) was defined as an estimated glomerular filtration rate (eGFR) <60 mL/min/1.73 m^{2} with at least 25% decrease in eGFR at followup from baseline. A total of 339 features, including participant characteristics, retinal imaging, and genetic and blood metabolites, were used as predictors. Performances of several ML models were compared to each other and to logistic regression (LR) model based on established features of DKD (age, sex, ethnicity, duration of diabetes, systolic blood pressure, HbA1c, and body mass index) using area under the receiver operating characteristic curve (AUC).
Results:
ML model Elastic Net (EN) had the best AUC (95% CI) of 0.851 (0.847–0.856), which was 7.0% relatively higher than by LR 0.795 (0.790–0.801). Sensitivity and specificity of EN were 88.2 and 65.9% vs. 73.0 and 72.8% by LR. The top 15 predictors included age, ethnicity, antidiabetic medication, hypertension, diabetic retinopathy, systolic blood pressure, HbA1c, eGFR, and metabolites related to lipids, lipoproteins, fatty acids, and ketone bodies.
Conclusions:
Our results showed that ML, together with feature selection, improves prediction accuracy of DKD risk in an asymptomatic stable population and identifies novel risk factors, including metabolites.
Funding:
This study was supported by the National Medical Research Council, NMRC/OFLCG/001/2017 and NMRC/HCSAINV/MOH00101900. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

 Epidemiology and Global Health
The NIHfunded RECOVER study is collecting clinical data on patients who experience a SARSCoV2 infection. As patient representatives of the RECOVER Initiative’s Mechanistic Pathways task force, we offer our perspectives on patient motivations for partnering with researchers to obtain results from mechanistic studies. We emphasize the challenges of balancing urgency with scientific rigor. We recognize the importance of such partnerships in addressing postacute sequelae of SARSCoV2 infection (PASC), which includes ‘long COVID,’ through contrasting objective and subjective narratives. Long COVID’s prevalence served as a call to action for patients like us to become actively involved in efforts to understand our condition. Patientcentered and patientpartnered research informs the balance between urgency and robust mechanistic research. Results from collaborating on protocol design, diverse patient inclusion, and awareness of community concerns establish a new precedent in biomedical research study design. With a public health matter as pressing as the longterm complications that can emerge after SARSCoV2 infection, considerate and equitable stakeholder involvement is essential to guiding seminal research. Discussions in the RECOVER Mechanistic Pathways task force gave rise to this commentary as well as other review articles on the current scientific understanding of PASC mechanisms.