Inference and control of the nosocomial transmission of methicillinresistant Staphylococcus aureus
Abstract
Methicillinresistant Staphylococcus aureus (MRSA) is a continued threat to human health in both community and healthcare settings. In hospitals, control efforts would benefit from accurate estimation of asymptomatic colonization and infection importation rates from the community. However, developing such estimates remains challenging due to limited observation of colonization and complicated transmission dynamics within hospitals and the community. Here, we develop an inference framework that can estimate these key quantities by combining statistical filtering techniques, an agentbased model, and realworld patienttopatient contact networks, and use this framework to infer nosocomial transmission and infection importation over an outbreak spanning 6 years in 66 Swedish hospitals. In particular, we identify a small number of patients with disproportionately high risk of colonization. In retrospective control experiments, interventions targeted to these individuals yield a substantial improvement over heuristic strategies informed by number of contacts, length of stay and contact tracing.
https://doi.org/10.7554/eLife.40977.001eLife digest
Antibioticresistant bacteria like the Methicillinresistant Staphylococcus aureus (MRSA) can live in people for many years without making them sick. During this time, the bacteria can spread to others who come in contact with the MRSAinfected person. The number of people with stealth MRSA infections living in the community has been increasing. As a result, hospitals may not only be dealing with MRSA infections that originated onsite, but also cases imported from the community. That makes tracking and controlling MRSA infections in hospitals difficult.
Now, Pei et al. show that computer modeling can help identify the role MRSA infections from the community play in hospital outbreaks and test ways to control them. In the experiments, data from an MRSA outbreak that occurred at 66 Swedish hospitals over 6 years were analyzed using statistical methods and computer modeling. This helped to identify patients who were likely colonized with MRSA within the hospital and those who had acquired it in the community. Next, Pei et al. used computer modeling to test what would have happened if these highrisk individuals had received interventions to prevent them from spreading MRSA in the hospital. This showed that targeting individuals at highrisk of a MRSA infection could reduce the spread of MRSA in the hospital.
The computer models developed by Pei et al. may help researchers, clinicians and public health officials working to control the spread of antibiotic resistant bacteria. The model can improve our understanding of how antibiotic resistant bacteria spread in healthcare facilities and may enable the development of more effective strategies to control these pathogens. Infectioncontrol strategies created with this system must first be tested in isolated, realworld settings to verify they work before they can be deployed broadly.
https://doi.org/10.7554/eLife.40977.002Introduction
Antimicrobial resistance is a global concern in healthcare systems due to its substantial morbidity and mortality burden and the lack of effective treatment options (CDC, 2013a; Magill et al., 2014; WHO, 2018). Among antibioticresistant agents, Methicillinresistant Staphylococcus aureus (MRSA) emerges as one of the most widespread and virulent pathogens (Grundmann et al., 2006; Klevens et al., 2007; Klein et al., 2007; Jarvis et al., 2012) and has been highlighted as a leading cause of healthcareassociated infections (HAIs) by the U.S. Centers for Disease Control and Prevention (CDC) (CDC, 2013b). Initially confined to healthcare facilities, MRSA has since become increasingly prevalent in the broader population in both the United States and Europe (Chambers, 2001; Naimi et al., 2003; Zetola et al., 2005; Hetem et al., 2012; Kouyos et al., 2013; Tosas Auguet et al., 2016). This entwined transmission among hospitals and the community has obscured understanding of the dynamics and persistence of MRSA. Further, MRSA can colonize patients without symptoms for years, during which it can be transmitted stealthily (Cooper et al., 2004a). These epidemiological features have greatly complicated its control and elimination.
The prevalence of MRSA has large variations across different countries. In Europe, a general northsouth gradient has been observed, with rare incidence in Scandinavian hospitals and much higher occurrence in Mediterranean hospitals (Stefani and Varaldo, 2003; Tiemersma et al., 2004; Johnson, 2011). In particular, Sweden remains one of the few countries with a low prevalence of MRSA infection (Stenhem et al., 2006). A substantial proportion of MRSA cases in Sweden has been imported from abroad due to traveling and healthcare contacts in foreign countries (Stenhem et al., 2010; Larsson et al., 2014). As a consequence, analyzing MRSA outbreaks in Sweden offers a good opportunity to study the hybrid dynamics of MRSA in hospital settings where both nosocomial transmission and importation occurs.
To facilitate better control of MRSA in hospital settings, several critical questions need to be answered. First, what are the relative roles of nosocomial transmission and infection importation from the community? Public health officials require an accurate assessment of the current force of infection within and into hospitals in order to deploy appropriate containment measures; however, with the increasing prevalence of MRSA in the community, disentangling HAIs from infections imported from the community has become difficult. Second, how many patients are colonized, and who and where are these highrisk individuals? Effective control would benefit from accurate determination of asymptomatic colonization rates in the general population; failure to estimate and account for colonization may result in longterm control issues (Cooper et al., 2004a). Although colonized patients can be identified using sequencing methods (Harris et al., 2010; Long et al., 2014), the expense of these assays limits their application, particularly in underdeveloped countries where MRSA has become endemic. In light of this situation, mathematical modeling offers an alternative approach for locating individuals with a high probability of colonization and guiding the targeted deployment of laboratory testing (Grundmann and Hellriegel, 2006; van Kleef et al., 2013; Opatowski et al., 2011). However, this inference problem is again complicated by the unobserved stealth transmission dynamics that occurs in the highly complex timevarying contact networks of the real world (Donker et al., 2010; Vanhems et al., 2013; Jarynowski and Liljeros, 2015; Obadia et al., 2015a; Obadia et al., 2015b; Rocha et al., 2016; Nekkab et al., 2017; Duval et al., 2018).
To address these issues, here we develop an agentbased network modelBayesian inference system for estimating unobserved colonization and importation rates from simple incidence records. We use this system to infer the transmission dynamics of the most commonly diagnosed MRSA strain, UK EMRSA15 (Grundmann et al., 2010; Das et al., 2013), from multiple Swedish hospitals (Materials and methods). Key features estimated include the number of infections acquired in hospital and imported from outside, as well as the locations of individuals with a high colonization probability. Such information is crucial for designing costeffective control measures (Cooper et al., 2003; Cooper et al., 2004b; Hubben et al., 2011; Worby et al., 2013). In retrospective control experiments, decolonization of potentially colonized patients outperforms heuristic intervention strategies based on number of contacts, length of stay and contact tracing. These findings indicate that the modelinference system can inform effective, actionable and costeffective measures for reducing nosocomial transmission.
Results
Spatiotemporal features of MRSA infection
Observed incidence over a 6year outbreak period is reported in Figure 1A. The relative date of diagnosis and associated ward of each observed infection were recorded; however, information on type of ward was not provided. Previous studies indicate that heterogeneity in infection risk exists across different types of ward (Bootsma et al., 2006). For instance, patients in intensive care units (ICUs) typically suffer a higher risk of infection than those in nonICU settings due to higher patienthealthcareworker contact rates, high levels of antibiotic use, and high patient vulnerability to infection. In this study dataset, however, we observed no clear clustering of infections in certain wards. In Figure 1B, we use a raster plot to display the distribution of infections in 114 infected wards over time. Infections are distributed without noticeable clustering, presumably due to effective control measures taken in the study hospitals that maintain a low infection rate even in ICU settings (Tiemersma et al., 2004).

Figure 1—source data 1
 https://doi.org/10.7554/eLife.40977.006
Distributions of some key statistics of the patient flow in infected wards differ from those in uninfected ones (Figure 1C). Infected wards tend to have a higher number of inpatients, a longer average length of stay as well as a larger ward size. Intuitively, the number of MRSA infections in a ward should increase as patientdays within the ward increase. However, the average number of infections is not observed to increase linearly with patientdays, indicating that patientdays per ward alone cannot explain the observed patterns of infection (Figure 1—figure supplement 1). While these raw features provide a general understanding of MRSA transmission, they cannot be effectively employed to assess infection or colonization risk in a specific ward due to their largely overlapping distributions, which prevent a clear classification of risk. Instead, a quantitative analysis using mathematical modeling is needed.
The agentbased model
In hospital settings, MRSA transmission between colonized/infected patients and susceptible individuals is primarily mediated indirectly by healthcare workers (Lowy, 1998; Temime et al., 2009). As a result, accurate representation of actual contact patterns is crucial for modeling MRSA transmission. Many previous studies have formulated transmission models using ordinary differential equations (ODEs) (Cooper et al., 2004a; Kajita et al., 2007; D'Agata et al., 2009) or stochastic processes (Forrester et al., 2007; Kypraios et al., 2010). To account for heterogeneity among different settings, several studies have included multiple facilities in a singlemodel construct, incorporating prior information on facility type in order to characterize and differentiate transmission dynamics (Bootsma et al., 2006; Forrester et al., 2007). These approaches were then generalized to permit connection among institutions at different scales (hospitals, nursing homes and longterm healthcare facilities, or multiple wards or units within a facility) with timevarying contact patterns (Thomas et al., 2018).
In this work, we model nosocomial MRSA transmission using an individuallevel agentbased model (Macal et al., 2014; Assab et al., 2017). One major advantage of using an agentbased approach is that the heterogeneity of contact in different ward types can be accounted for within the model. For instance, even though we were provided no information about ward type, some of the heterogeneity among wards can be represented by the contact pattern specific to each ward, for example a longer length of stay in longterm healthcare units. Other aspects, however, are simplified; for example, due to the observed absence of infection clusters (Figure 1B), we assume a uniform transmission rate across different wards within the model. This assumption is ultimately justified by the good agreement between inferred dynamics and observations (as described later).
In the model, transmission occurs on the substrate of a timevarying contact network, which is constructed using the actual hospitalization records from 66 hospitals in Stockholm County, Sweden. In this contact network, nodes represent uniquely labeled patients, connected by undirected links among individuals sharing a ward at a given time. The rationale behind this network construction approach is that, if two patients stay in the same ward simultaneously, the shared healthcare personnel may facilitate transmission between them. The structure of the contact network is relatively stable over time, as indicated by the degree distributions of the weekly aggregated networks (Figure 1D). In particular, the degree distributions can be well fitted by a Weibull distribution, $P(k)=ab{k}^{b1}{e}^{a{k}^{b}}$, where $a=6.15\times {10}^{4}$ (95% CI: $5.96\times {10}^{4}6.34\times {10}^{4}$), $b=2.13$ (95% CI: $2.122.14$) (${R}^{2}=0.95$).
The contact network is timevarying and exhibits high spatiotemporal complexity. The daily inhospital patient number fluctuates between 4000 and 7000 during the study period (Figure 1—figure supplement 2A–B). Patient hospitalization time, readmission time, and patienttopatient contact time all follow heavytailed distributions, spanning several orders of magnitude (Figure 1—figure supplement 2C). Moreover, about 100 connected components coexist in the contact network each week (Figure 1—figure supplement 2D). Connections between different connected components change over time due to the transfer and readmission of patients. 128,119 patients were transferred from one ward to another during their stay, and another 280,506 patients were readmitted within 1 year of their previous discharge. These patient movements connect healthcare facilities that would otherwise be isolated in the network and facilitate longrange transmission across multiple hospitals. A direct consequence of this patient movement is difficulty tracking the indirect transmission path of MRSA across different hospitals. For instance, a patient located in one hospital can be involved in the transmission occurring in another when he/she moves across multiple facilities. Detailed analyses of the contact network structure and hospitalization traffic can be found in Appendix 1.
Model patients are classified into three categories: susceptible individuals who are free of MRSA ($S$), colonized individuals who carry the bacteria asymptomatically ($C$), and confirmed positive patients ($P$). The model simulates two connected dynamics: nosocomial transmission and importation from the community. Here, the community is broadly defined as all locations outside the study hospitals, and may include households and healthcare facilities not covered in the study. Within hospitals, transitions between states ($S$, $C$, $P$) are governed by parameters that help define either interaction dynamics or the progression of infection. Specifically, a susceptible individual staying in a ward with a colonized person can become MRSA colonized with transmission probability $\beta $ per day. In our model, we assume that patients within a ward have the same rate of contact with each other, presumably mediated by the shared healthcare workers in a ward. The transmission process is densitydependent, as the force of infection in a ward increases with the number of colonized patients within the ward (Begon et al., 2002). Upon colonization, asymptomatic persons can return to the susceptible state at a spontaneous decolonization rate $\alpha $, or they can test positive with an infection progression rate $p$. We assume infected patients will receive treatment, no longer spread bacteria, and return to state $S$ with a recovery rate $\mu $. Treatment is assumed to continue until infected patients are clear of MRSA. Given the exponential decay of infection probability, the characteristic treatment period is $1/\mu $ days. Note that colonization only occurs between individuals connected by a link in the contact network, whereas decolonization, infection and recovery progress spontaneously, independent of the contact network. Outside the study hospitals, the transmission process is not explicitly simulated; instead, two additional parameters are introduced to represent transmission intensity. For patients who appear for the first time in hospital, we assume they belong to states $C$ and $P$ with probability ${C}_{0}$ and ${I}_{0}$, respectively. As importation rates of colonized and infected patients depend on the timevarying MRSA prevalence outside hospitals, we assume the parameters ${C}_{0}$ and ${I}_{0}$ are timedependent. Once patients appear in the contact network, the evolution of their states follows the dynamics as defined above. After discharge, we continue tracking the progression of colonized individuals; however, transmission outside the study hospitals is not represented. The flow of individuals between categories is illustrated schematically in Figure 1E.
For a realistic scenario, diseaserelated model parameters may differ from person to person. To account for this variability during implementation, parameters, for example $\alpha $, $p$ and $\mu $, for each individual are randomly drawn from uniform ranges obtained from prior literature (Table 1). The parameter ranges are enlarged slightly to cover the values reported in these works. Our main objective is to infer the three most important parameters governing transmission dynamics: the transmission rate $\beta $, the infection importation rate ${I}_{0}$ and the colonization importation rate ${C}_{0}$.
Iterated filtering for agentbased models
To infer epidemiological parameters in an agentbased model, we adapt an iterated filtering (IF) algorithm (Ionides et al., 2006; King et al., 2008; Ionides et al., 2011). IF can be used to infer the maximum likelihood estimates (MLEs) of parameters in epidemic models and has been successfully applied to infectious diseases such as cholera (King et al., 2008) and measles (He et al., 2010). Initially developed for ODE models, IF has subsequently been generalized for other model forms (e.g. stochastic models) using the plugandplay approach (He et al., 2010). Here, we adapt IF for agentbased models, leveraging an equationfree approach (Kevrekidis et al., 2003) that allows for mapping between the systemlevel observations (e.g. weekly incidence) used for the IF and the individuallevel states evolved in the agentbased model (Appendix 1). In applying the IF, we perform multiple iterations using an efficient Bayesian filtering algorithm – the Ensemble Adjustment Kalman Filter (EAKF) (Anderson, 2001), which has been widely used in infectious disease forecast and inference (Shaman and Karspeck, 2012; Yang et al., 2015; Pei and Shaman, 2017; Pei et al., 2018a; Kandula et al., 2018). Details of the IF implementation can be found in Materials and methods.
Before applying the inference system to realworld data, we first need to validate its effectiveness. For the realworld data the inference targets are unobserved, so instead we test the inference system using modelgenerated synthetic outbreaks for which we know the exact values of the parameter. Although actual MRSA transmission dynamics cannot be fully described by the simplified agentbased model, performing synthetic tests provides validation that the inference system works if the transmission process generally follows the modelspecified dynamics.
To generate synthetic outbreak observations, we used the agentbased model to simulate weekly incidence during a oneyear period (52 weeks), and then imposed noise to produce the observations used in inference (See details in Appendix 1). We ran 20 iterations of the EAKF within the IF framework. In Figure 2A, we display the inference results for the three parameters $\beta $, ${I}_{0}$ and ${C}_{0}$ at different iterations in one realization of the IF algorithm. The blue horizontal lines mark the target values used to generate the outbreak. The orange boxes show the distribution of posterior parameters (300 ensemble members) after each iteration. The IF algorithm returns the stabilized ensemble mean as the MLEs of parameters. As a result of the stochastic nature of model dynamics and initialization of the inference algorithm, different runs of the IF algorithm usually return slightly different MLEs. To obtain the credible intervals (CIs) for the MLEs, we repeated the inference for 100 times (see Materials and methods). The inferred mean values and 95% CIs for the parameters $\beta $, ${I}_{0}$ and ${C}_{0}$ are $9.00,[8.07,9.68]\times {10}^{3}$, $1.91,[1.38,2.54]\times {10}^{3}$ and $7.18,[5.84,8.70]\times {10}^{2}$, with the actual values $\beta =9\times {10}^{3}$, ${I}_{0}=2\times {10}^{3}$ and ${C}_{0}=7.5\times {10}^{2}$. The inference system thus accurately estimates $\beta $ and ${I}_{0}$ from noisy observations, and slightly underestimates ${C}_{0}$. In its implementation, the performance of the inference system depends on the sensitivity of the observations to each parameter. In the agentbased model used here, observed incidence is less sensitive to ${C}_{0}$ due to the long period of colonization. As a consequence, estimates of ${C}_{0}$ do not always exactly match the actual target and are here biased low. Nevertheless, this slight underestimation does not significantly affect the inferred dynamics. To demonstrate this insensitivity, we ran 1000 simulations using the inferred mean parameters and obtained distributions of weekly incidence from the stochastic agentbased model. The distributions of weekly incidence (blue boxes) are compared with the observed cases (red crosses) in Figure 2B. We also evaluated the agreement between the observed and simulated incidences in Figure 2B (Figure 2—figure supplement 1; Analysis details are explained in Appendix 1). The inferred dynamics fit the observed incidence well. The Matlab code for synthetic test on an example network is uploaded as an additional file.

Figure 2—source data 1
 https://doi.org/10.7554/eLife.40977.014
We repeated the above analysis for the colonized population (Figure 2C) and found that the numbers of unobserved colonized patients can also be well estimated by the inference system. Moreover, the inference system can distinguish the number of infections transmitted in hospital and imported from outside the study hospitals (Figure 2D–E). More tests for alternate synthetic outbreaks and different observation frequencies were performed and are presented in Figure 2—figure supplements 2–5.
Inference of MRSA transmission in swedish hospitals
We next applied the inference system to the UK EMRSA15 incidence data binned every 4 weeks. Because the UK EMRSA15 transmission parameters are unlikely to remain constant over the entire 6year outbreak cycle, we inferred model parameters year by year (52 weeks). Beginning with the first year, we ran the IF inference sequentially through each year. Between 2 consecutive years, the inferred results from the previous year were used to initialize the inference system in the next (see Figure 3—figure supplement 1). In Figure 3A, we present the distributions of the key parameters $\beta $, ${I}_{0}$ and ${C}_{0}$ for each year, generated from 100 independent realizations of the IF algorithm. The parameter estimates together with the associated 95% CIs are reported in Table 2. All parameter values increased in the first 3 or 4 years, and then gradually decreased thereafter.

Figure 3—source data 1
 https://doi.org/10.7554/eLife.40977.019

Table 2—source data 1
 https://doi.org/10.7554/eLife.40977.021
The inferred parameters can be plugged back into the model to run simulations and obtain information addressing our questions of interest (see Video 1 for an example). For instance, we performed 1000 model simulations using the inferred mean parameter values, and generated distributions of incidence from the stochastic agentbased model. These distributions are compared to observations in Figure 3B. All observations fall within the whisker range of Tukey boxplots (see more analyses in Figure 3—figure supplement 2). To further explore whether some of the key observed statistics can be reproduced using the inferred parameters, we display the distribution of the number of infected wards in Figure 3C. The observed number lies at the peak of the simulated distribution (vertical dash line). The spatial distribution of infections among different wards can be characterized by the distribution of wards with a certain number of infections in an outbreak. In Figure 3D, we compare this distribution obtained from 1000 simulations with what we observed in the data (red diamonds): the observed distribution agrees well with the simulated distributions. This close matching indicates that the model structure and inferred parameters can reliably reproduce the observed outbreak pattern in both space and time (see also Figure 3—figure supplement 2). In addition to generating a good model fit, the inference system also discriminates the burdens of nosocomial transmission and infection importation. Nosocomial and imported infections are distinguished by the location of MRSA colonization: if patients acquire MRSA in hospital, they are classified as nosocomial transmission cases; otherwise they are imported cases. Figure 3E compares the distributions of both types of infections generated from 1000 simulations: a substantial number of infections are inferred as importations. In clinical practice, the number of days between hospital admission and infection is usually used to distinguish hospitalacquired from communityacquired infections, typically with 48 hr used as the threshold. We performed this classification and compared the findings with our inference result. As shown in Figure 3—figure supplement 3, the number of imported and nosocomial cases obtained from inference generally matches the classification result using days from admission to infection.
Our findings indicate that, at its onset, during the first year of the outbreak, UK EMRSA15 gradually invaded the hospital system from the community. Only sporadic nosocomial transmission occurred. With the accumulation of infected and colonized patients in the hospitals, a rise in nosocomial transmission occurred, reflected by an increase of the transmission rate $\beta $ during the third year (Figure 3A). Concurrently, both the infection and colonization importation rates, ${I}_{0}$ and ${C}_{0}$, also experienced growth. This simultaneous rise may have been caused by household transmission initiated by asymptomatically colonized patients discharged from hospitals. After this growth phase, both transmission and importation rates were suppressed. Finally, the UK EMRSA15 outbreak appears near eliminated in Swedish hospitals, represented by the inferred low values of all parameters. However, if control measures in hospital were to be relaxed, the colonized patients might spark another outbreak due to the lengthy colonization period, which highlights the need for asymptomatic colonization control in order to effect MRSA elimination (Cooper et al., 2004a).
Designing costeffective interventions
Asymptomatic colonization is a major issue hindering the control and elimination of MRSA in hospitals (Cooper et al., 2004a). Screening can identify colonized patients and evaluate the general colonization burden; however, it is an inefficient and costly measure that wastes resources that otherwise could be used to solve more urgent problems. As shown above, given the heterogeneity of contact among patients, levels of exposure to the hazard of colonization differ substantially. As a result, more efficient intervention strategies can be designed that leverage this individuallevel heterogeneity.
In Figure 4A, we display the inferred distribution of colonized patients in the Swedish hospitals over time. Colonized patient numbers peak in the middle of the record and decline thereafter. To determine who and where these highrisk individuals reside within the network, we can use the agentbased model to quantify colonization risk at the individual level. The distribution of individual colonization probability at T = 40 (week 160), generated from ${10}^{4}$ simulations using inferred parameters, is displayed in Figure 4B. A clear heavytailed powerlaw distribution $y\propto {x}^{2.13}$ emerges in which the colonization probability spans several orders of magnitude (see Figure 4—figure supplement 1 and Appendix 1 for a rigorous statistical analysis of this distribution) (Clauset et al., 2009; Muchnik et al., 2013). The complex spatiotemporal interaction patterns within the network give rise to a small number of patients with a disproportionately high risk of colonization. To examine how these individuals distribute among hospitals, we visualize the colonization probability in Figure 4C. Highrisk patients tend to appear in densely connected clusters.

Figure 4—source data 1
 https://doi.org/10.7554/eLife.40977.025
Costeffective interventions can be practiced by the targeted screen and decolonization of identified highrisk patients. In order to evaluate the effectiveness of such interventions, we performed a retrospective control experiment. Specifically, we used the inferred parameters in Figure 3A to run the model for 6 years to reproduce the outbreak. Every 4 weeks, we used currently available information (as would be available in real time) to estimate patient colonization probabilities (see details in Materials and methods). The colonization probabilities estimated in real time are highly correlated with the results obtained using information from the whole course of the epidemic, shown in Figure 4C. During the model integration, every 4 weeks, we selected patients with an estimated colonization probability higher than a certain threshold for screening. If positive, these inpatients were decolonized. To assess the impact of decolonization success rate on intervention impact, two efficiencies, 100% and 75%, were tested, and we repeated the experiment 1000 times. The findings show that the proposed intervention strategy can avert considerable numbers of colonization and infection (Figure 5A–B). Decreasing the decolonization threshold leads to a larger screened population (as shown in the inset of Figure 5B), and thus reduces colonization and infection further. However, the marginal benefit becomes negligible below a certain threshold value, as the remaining colonized and infected patients are possibly caused by importation, which cannot be directly controlled by inpatient intervention. The decolonization success rate also plays an important role, as indicated by the increased colonization and infection for the lower success rate.

Figure 5—source data 1
 https://doi.org/10.7554/eLife.40977.027
The advantage of the proposed inferencebased intervention can be better appreciated by examining its additional benefit over other heuristic control measures. Here, we compare the performance of the inferencebased intervention with three alternative screening strategies informed by patient number of contacts, length of stay and contact tracing. For the former two, at each month, we ranked patients by their current total number of contacts (i.e. cumulative number of connections in the timevarying network since admission) or length of stay in a descending order, and created the screening list using the topranked patients. By varying the fraction of patients selected from the ranking (from 0% to 5%), we can inspect the control results for different numbers of screened patients. For contact tracing, upon each observation of infection, we tracked patients who stayed in the same ward with an infected individual within a certain time window prior to the infection, and screened those possibly colonized patients in hospitals. Tracing time windows ranging from 1 day to 14 days were tested. The number of screened patients does not increase significantly with tracing times longer than 14 days. Note that, screening and decolonization are performed only within hospitals. If patients listed for screening have already discharged before the diagnosis of infection, they are screened upon their next readmission.
In Figure 5C–D, the average numbers of colonized and infected patients are compared based on the number of screened patients. Heuristic control measures relying on the number of contacts, length of stay and contact tracing all limit MRSA transmission; however, a substantial additional reduction in both colonization and infection can be achieved through inferencebased intervention. On average, inferencebased screening of approximately 0.89% (6,617/743,599) of all patients can avert up to 38% (121/315) of infections and 9% (1,610/17,810) of colonizations. In comparison, the other three methods given similar numbers of screened patients only reduced infections and colonizations by 21% and 4% (number of contacts), 27% and 6% (length of stay), and 28% and 6% (contact tracing), respectively.
The colonization probability obtained from inference quantifies individual systemic risk given the general situation of transmission, regardless of the specific location of undetected colonization. In contrast, screening based on contact tracing identifies colonized individuals related to observed infections; however, with an unknown amount of imported colonization, this approach may overlook a considerable number of colonized patients, who can sustain subsequent transmission. As a result, the inferencebased intervention can identify and treat the pivotal individuals, or superspreaders (Pei and Makse, 2013; Pei et al., 2014; Pei et al., 2017; Pei et al., 2018b; Teng et al., 2016), who may otherwise transmit MRSA asymptomatically in the first place. This preventive approach is more effective than contact tracing in the presence of frequent importation, as it disrupts probable transmission pathways. In realworld hospital settings, the proposed inferencebased intervention could be implemented and evaluated in real time: it only requires hospitalization records and ward information.
Discussion
In this work, we have developed an agentbased modelinference framework that can estimate nosocomial MRSA transmission dynamics in the presence of importation. Further, we have shown that these inferred dynamics can be used to quantify patient colonization risk and guide more effective interventions.
The transmission dynamics generated using the agentbased model are intrinsically stochastic, that is, the observed record of UK EMRSA15 infections is just one realization among an ensemble of all possible outcomes of an underlying highly stochastic process. In order to evaluate the general risk of MRSA transmission, key epidemiological parameters were inferred from the single observed realization. Previous studies have developed methods to infer transmission risk factors and reconstruct transmission paths using individuallevel infection data for diseases such as H1N1 and MERSCoV (Cauchemez et al., 2011; Cauchemez and Ferguson, 2012; Cauchemez et al., 2016). In particular, Bayesian data augmentation approaches have been applied to MRSA models (Forrester et al., 2007; Kypraios et al., 2010); however, these approaches are not readily applicable to our dataset. The data assimilation scheme we developed here enables estimation of epidemiological parameters and key transmission information using aggregated incidence data. As demonstrated in the retrospective control experiment, assessment of individual colonization risk using aggregated data can be quite useful for preventing future MRSA transmission, especially when stealth importations are frequent.
In this study, we omitted representation of heterogeneity across different wards. This simplification is valid for the study Swedish hospitals, as we observed no infection clusters and the model reproduced key statistics of observations well. However, in other settings, clustering analysis and ward information may be necessary before the application of the inference system. Should certain wards suffer a much higher rate of infection, a separate suite of parameters can be defined and inferred for these wards, using priors that better represent this more intense transmission. We also only considered transmission among patients staying in the same ward. In the future, more contact information such as healthcare workers shared by a group of patients could be incorporated into the contact network. In addition, as the community defined in the model may include nonsampled hospitals, inferred community risk may have been overestimated as it also included contributions from those healthcare facilities outside the network.
We note that the reported inference results are obtained using only hospitalization records and UK EMRSA15 case numbers. Should more data (e.g. surgery or treatment records) become available, this additional information could be incorporated into the agentbased model and used to refine the present results. Our modelinference framework provides a foundational platform for flexible simulation and inference of antibiotic resistant pathogens. In this study, we applied this system to Swedish hospitals with low MRSA prevalence. However, in the future, it could be used to provide actionable information for disease control in less developed settings where MRSA is endemic. In a highly interconnected area, transmission of antibiotic resistant pathogens from endemic regions to epidemicfree hospitals is more likely. This risk calls for containment measures in the general population and collaborative control efforts among multiple healthcare facilities (Smith et al., 2005; D'Agata et al., 2009; Ciccolini et al., 2014; Slayton et al., 2015).
Materials and methods
Data description
Request a detailed protocolThe dataset contains admission and discharge records of 743,599 distinct patients from 66 hospitals (271 clinics, 1041 wards) in Stockholm County, Sweden (Jarynowski and Liljeros, 2015; Rocha et al., 2016), spanning over 3500 continuous days during the 2000s. The exact dates and ward types are confidential for the protection of patient privacy. In total, 2,041,531 admission records were collected. The hospitalization dataset is quite comprehensive as the patients constitute over one third of the total 2.2 million population of Stockholm County. In addition, the dataset also contains individual diagnostic records of MRSA, which provide relative date of diagnosis and the strain of MRSA. Diagnosis was performed on patients with symptomatic infections as well as asymptomatic patients in contact with positive cases. A total of 991 positive cases from 172 different strains were confirmed, and the most prevalent strain was UK EMRSA15 (289 cases). UK EMRSA15 is present in 16 countries worldwide (Grundmann et al., 2010; Das et al., 2013). Here, we focus on this specific strain. Although the dataset spans over 3500 days (nearly 10 years), we limit our study to a 6year (300week) period with reported UK EMRSA15 incidence. We display time series of 4week incidence and cumulative incidence for UK EMRSA15 in Figure 1A.
Iterated filtering for agentbased models
Request a detailed protocolWe infer system epidemiological parameters using an iterated filtering (IF) algorithm (Ionides et al., 2006; King et al., 2008; Ionides et al., 2011). This algorithm has been coupled with ODE models and used to infer latent variables associated with the transmission of cholera (King et al., 2008) and measles (He et al., 2010). The IF framework is designed as follows: an ensemble of system states, which represent the distribution of parameters, are repeatedly adjusted using filtering techniques in a series of iterations, during which the variance of the parameters is gradually tuned down. In the process, the distribution of parameters is iteratively optimized per observations and narrowed down to values that achieve maximum likelihood. This approach is based on an analytical proof that guarantees its convergence under mild assumptions (Ionides et al., 2006).
In its original implementation, the data assimilation method used in IF is sequential Monte Carlo, or particle filtering (Arulampalam et al., 2002). Here, due to the high computational cost of the agentbased model, we use a different efficient data assimilation algorithm  the Ensemble Adjustment Kalman Filter (EAKF) (Anderson, 2001). Unlike particle filtering, which requires a large ensemble size (usually of the order O(${10}^{4}$) or higher) (Snyder et al., 2008), the EAKF can generate results similar in performance using only hundreds of ensemble members (Shaman and Karspeck, 2012). Originally developed for use in weather prediction, the EAKF assumes a Gaussian distribution of both the prior and likelihood, and adjusts the prior distribution to a posterior using Bayes rule in a deterministic way such that the first two moments (mean and variance) of an observed variable are adjusted while higher moments remain unchanged during the update (Anderson, 2001). In epidemiological studies, the EAKF has been widely used for parameter inference and forecast of infectious diseases (Shaman and Karspeck, 2012; Yang et al., 2015; Pei and Shaman, 2017; Pei et al., 2018a; Kandula et al., 2018). The implementation details of the EAKF are introduced in Appendix 1.
In this study, we focus on the inference of three transmissionrelated parameters: the nosocomial transmission rate $\beta $, the infection importation rate ${I}_{0}$ and the colonization importation rate ${C}_{0}$. The initial prior ranges for these parameters are reported in Table 1. Other diseaserelated parameters, for example the spontaneous decolonization rate $\alpha $, the infection progress rate $p$, and the recovery rate $\mu $, are drawn uniformly from ranges obtained from previous studies for each individual in the agentbased model (see Table 1). Should more specific information about these parameters become available, it may be possible in the future to better constrain the model with their incorporation into the system. In synthetic testing of the IFEAKF algorithm, we use weekly incidence as observations. Given the parameter vector, $\mathbf{\mathbf{z}}={(\beta ,{I}_{0},{C}_{0})}^{T}$, the IFEAKF algorithm proceeds per the pseudocode in Algorithm 1. During the EAKF update, only the parameters $\beta $, ${I}_{0}$ and ${C}_{0}$ were adjusted; the microscopic state ($S$, $C$ or $P$) in each ensemble member was set as the state at the end of previous time step and was not adjusted. Detailed explanation of the IFEAKF system is provided in Appendix 1.
In each iteration of the IF, the standard deviation of each parameter is shrunk by a factor $a\in (0,1)$ (or equivalently, the variance is discounted by a factor of ${a}^{2}$). In practice, the discount factor $a$ can range between 0.9 and 0.99 (Ionides et al., 2006). If $a$ is too small, the algorithm may ‘quench’ too fast and fail to find the MLE; if it is too close to 1, the algorithm may not converge in a reasonable time interval. We stop the IF algorithm once the estimates of the ensemble mean stabilize. The number of iterations required for this convergence was determined by inspecting the evolution of posterior parameter distributions, as in Figure 2A. Note that once the ensemble mean stabilizes, increasing the iteration time will not affect the MLE, although it can lead to a further narrowing of the ensemble distribution.
Algorithm 1 only returns the MLEs for the parameters; however, it is also desirable to obtain CIs for those MLEs. For deterministic ODE models, Ionides et al. used ‘sliced likelihood’ to numerically estimate the Fisher information and standard errors (SEs) of MLEs (Ionides et al., 2006). Here, for a highly stochastic system, evaluating the Fisher information numerically is challenging. As a result, we took another approach by running multiple realizations of the IF algorithm. In different runs, the MLEs are slightly different due to stochasticity in the agentbased model and in the initialization of the inference algorithm. In this work, we ran 100 independent realizations to generate the average MLEs of inferred parameters and their corresponding 95% CIs. Results from synthetic tests indicate that this approach is effective in calculating MLEs and quantifying their uncertainties.
Algorithm 1. IF–EAKF 

Input: An agentbased model $\mathcal{M}$ in a timevarying contact network $G(V,E,t)$, the number of observations $T$, incidence $\{{o}_{t}\}$, the observational error variance (OEV) $\{{\sigma}_{t,o}^{2}\}$, the initial system parameters ${\overline{\mathbf{\mathbf{z}}}}^{0}={(\beta ,{I}_{0},{C}_{0})}^{T}$, the initial covariance matrix $\mathrm{\Sigma}$, a discount factor $a\in (0,1)$, and the number of iterations $L$. for $l=1$ to $L$ do Generate an ensemble of parameter vectors with $n$ members using a multivariate Gaussian distribution: ${\{{\widehat{\mathbf{\mathbf{z}}}}_{0}^{l}\}}_{n}\sim \mathcal{N}({\overline{\mathbf{\mathbf{z}}}}^{l1},{a}^{2(l1)}\mathrm{\Sigma})$. for $t=1$ to $T$ do Run the agentbased model $\mathcal{M}$ with posterior ${\{{\widehat{\mathbf{\mathbf{z}}}}_{t1}^{l}\}}_{n}$ obtained from last update for one week, and return the ensemble of incidence: ${\{{o}_{t}^{l}\}}_{n}=\mathcal{M}(G,{\{{\widehat{\mathbf{\mathbf{z}}}}_{t1}^{l}\}}_{n})$. Update the prior distribution of parameters $\beta $, ${I}_{0}$ and $C}_{0}:{\left\{{\mathbf{z}}_{t}^{l}\right\}}_{n}\equiv \{{\hat{\mathbf{z}}}_{t1}^{l}{\}}_{n$ to posterior ${\{{\widehat{\mathbf{\mathbf{z}}}}_{t}^{l}\}}_{n}$ using the EAKF, ${\{{o}_{t}^{l}\}}_{n}$, $\{{\sigma}_{t,o}^{2}\}$ and $\{{o}_{t}\}$. Individual states are evolved per the agentbased model and are not updated by the EAKF. end for Calculate the ensemble mean of the posterior over time as the input in next iteration: ${\overline{\mathbf{z}}}^{l}=\sum _{t}E(\{{\hat{\mathbf{z}}}_{t}^{l}{\}}_{n})/T$, where $E$ computes the ensemble mean. end for Output: ${\overline{\mathbf{\mathbf{z}}}}^{L}$ as the MLE of the parameter vector. 
An alternative method to infer posterior parameters is to use Approximate Bayesian computation (ABC) (Beaumont et al., 2002). ABCbased methods employ numerical simulations to approximate the likelihood function, in which the simulated samples are compared with the observed data. In a typical ABC rejection algorithm, large numbers of parameters are sampled from the prior distribution. For each set of parameters, the distance between simulated samples (generated using the parameters) and observed data is calculated. Parameters resulting in a distance larger than a certain tolerance are rejected, and the retained parameters form the posterior distribution. ABC methods can fully explore the likelihood landscape in parameter space. However, it requires large numbers of simulations, which may be prohibitive for the largescale agentbased models considered here. In addition, a good choice of the tolerance in the rejection algorithm is needed. The IF algorithm, instead, is applicable to computationally expensive agentbased models, but may become trapped in the local optimum of the posterior distribution. In practice, this problem can be alleviated by exploring a larger prior parameter space and setting a slower quenching speed, that is, a smaller discount factor $a$.
Inferred parameters and 95% CIs for three synthetic tests
Request a detailed protocolWe report the inferred parameters and their corresponding 95% CIs for the synthetic tests in Table 3. The actual parameters used to generate the synthetic outbreaks are also reported. Results are obtained from 100 independent realizations of the IF algorithm.
Inferencebased intervention
Request a detailed protocolTo guarantee a fair comparison between the inferencebased intervention and other heuristic strategies, we estimated the colonization probability using only realtime information available before control measures are effected. For instance, to estimate the colonization probability at the fifth month in the third year, we first infer the model parameters for the first 2 years, where we have data from the whole year, and then use the partial observation in the remaining 5 months to infer the model parameters for the third year. The inferred parameters are then used to generate 1000 synthetic outbreaks from the beginning, and the current colonization probability for each individual is calculated from these simulations. In the inset of Figure 5A, we show that the colonization probability estimated in real time is highly correlated with that obtained using information from the entire outbreak record. In practice, every 4 weeks, the estimated colonization probability and the decolonization list were updated. The inferencebased intervention only uses information available at the time control measures are effected. As a consequence, it is a practical method that can be implemented in real time.
Appendix 1
Analysis of the hospitalization traffic
We performed an analysis of the admission and discharge traffic in the study hospitals. In the hospitals, the inpatient population changed on a daily basis. To quantify the speed of patient renewal, we define a patient overlap ratio at day $t$ as $Q(t)=Z(t1)\cap Z(t)/Z(t1)$, where $Z(t)$ is the set of patients in the hospital at day $t$, and $Z(t)$ is the number of those patients. In Figure 1—figure supplement 2A, we present the evolution of $Q(t)$ during a period of 300 days. $Q(t)$ exhibits a somewhat periodic behavior with a period of 7 days (see the inset of Figure 1—figure supplement 2A). This is possibly due to reduced patient traffic during weekends.
We next examined the total number of patients in the study hospitals. As shown in Figure 1—figure supplement 2B, the inhospital patient number fluctuates between 4000 and 7000. Patient numbers exhibit a periodic behavior at an annual timescale, as well as at finer weekly timescale. We also present the number of new patients (with respect to the patients present the previous day) in the hospitals each day in Figure 1—figure supplement 2B. The number of new patients is relatively small compared with the total patients.
The distribution of patient time in hospital follows a powerlaw shape with a heavy tail (see Figure 1—figure supplement 2C). Such heterogeneity leads to high spatiotemporal contact network complexity. In fact, the contact time between all pairs of patients follows a similar powerlaw distribution, as shown in the upper inset of Figure 1—figure supplement 2C. The readmission time, that is, individual patient time between discharge and next admission, is a key parameter in MRSA transmission models. The lower inset of Figure 1—figure supplement 2C indicates that this readmission time is also quite heterogeneous, spanning from several days to up to a few years.
Next we examined the topological features of the weekly aggregated contact network in Figure 1—figure supplement 2D, which shows the number of patients in the giant connected component (GCC) and the entire contact network. While most patients belong to the GCC, there also exist many fragmented small connected components (CCs). The total number of CCs in the contact network is also presented in Figure 1—figure supplement 2D. About 100 CCs coexist in the network each week, but the size of small CCs is usually below 200, as shown in the inset of Figure 1—figure supplement 2D. Connections between different CCs change over time due to the transfer and readmission of patients. 128,119 patients were transferred from one ward to another during their stay, and another 280,506 patients were readmitted within one year of their previous discharge. These patient movements connect healthcare facilities that would otherwise be isolated in the network and are responsible for longrange transmission across multiple hospitals.
Given the large heterogeneity in the network structure and contact time, a traditional compartmental model using ordinary differential equations (ODE) may not adequately capture actual transmission dynamics. Therefore, in this study we adopt an individuallevel agentbased model.
EquationFree approach
Instead of using a parsimonious ordinary differential equation model, we employ an agentbased model to account for the spatiotemporal complexity of the underlying contact patterns. In particular, agentbased models can be used to simulate epidemic spread using an EquationFree approach (Kevrekidis et al., 2003). The transmission process evolves following microscopic update rules defined at the individuallevel, and macroscopic states are aggregated from the total simulated population.
The EquationFree approach has been widely used for multiscale modeling in applied mathematics and statistical physics. It consists of three basic elements: (1), lift, which transforms macroscopic observations through lifting to one or more consistent microscopic realizations; (2), evolve, which uses the microscopic simulator to evolve these realizations for a given time; and (3), restrict, which aggregates the evolved microscopic realizations to obtain the macroscopic observation.
In the MRSA transmission model, some quantities, for example colonization importation, infection importation, and weekly incidence, are macroscopic values aggregated from the individuallevel states. In model simulation, we first need to lift these macroscopic quantities to consistent microscopic realizations. To do this, we maintained multiple realizations (300 ensemble members) of individuallevel states. For each new patient entering the hospitals, a random number $r$ was generated from a uniform distribution $\upsilon \sim U[0,1]$. If $\upsilon \le {I}_{0}$, ${I}_{0}\le \upsilon \le {I}_{0}+{C}_{0}$, or $\upsilon \ge {I}_{0}+{C}_{0}$, and used to designate the new patient as infected, colonized or susceptible, respectively. This lifting procedure was performed for all realizations and produced an ensemble of possible microscopic states. These individuallevel states were then evolved following the rules defined in the model over the timevarying contact network $G(V,E,t)$. The model estimate of the observed state, that is 4 week incidence, was obtained by aggregating the total number of new infections across the entire population in the study hospitals. This aggregated, macroscopic state is then used in conjunction with the EAKF algorithm to update the parameter vector $\mathbf{\mathbf{z}}={(\beta ,{I}_{0},{C}_{0})}^{T}$ (see Materials and methods in main text). This multiscale method enables systemlevel analysis directly from microscopic simulations, which bypasses the need to derive macroscopic evolution equations.
The Ensemble Adjustment Kalman Filter
To represent the statespace distribution, the EAKF maintains an ensemble of system state vectors acting as samples from the distribution. In particular, the EAKF assumes that both the prior distribution and likelihood are Gaussian, and thus can be fully characterized by their first two moments, that is mean and covariance. The update scheme for ensemble members is computed using Bayes’ rule (posterior $\propto $ prior $\times $ likelihood) via the convolution of the two Gaussian distributions. For observed state variables, the posterior of the $i}^{\mathrm{t}\mathrm{h}$ ensemble member is updated through
Here ${o}_{t,post}^{i}$ and ${o}_{t,prior}^{i}$ are the posterior and prior of the observed variable for the $i}^{\mathrm{t}\mathrm{h}$ ensemble member at time $t;{\overline{o}}_{t,prior}$ is the mean of the prior observed variable; ${\sigma}_{t,obs}^{2}$ and ${\sigma}_{t,prior}^{2}$ are the variances of the observation and the prior observed variable; and ${o}_{t}$ is the observation at time $t$. Unobserved variables and parameters are updated through their covariability with the observed variable, which can be computed directly from the ensemble. In particular, the $i}^{\mathrm{t}\mathrm{h}$ ensemble member of unobserved variable or parameter ${x}^{i}$ is updated by
Here ${x}_{t,post}^{i}$ and ${x}_{t,prior}^{i}$ are the posterior and prior of the unobserved variable or parameter for the $i}^{\mathrm{t}\mathrm{h}$ ensemble member at time $t$; and $\sigma ({\{{x}_{t,prior}\}}_{n},{\{{o}_{t,prior}\}}_{n})$ is the covariance between the prior of the unobserved variable or parameter ${\{{x}_{t,prior}\}}_{n}$ and the prior of the observed variable ${\{{o}_{t,prior}\}}_{n}$ at time $t$. In the EAKF, variables and parameters are updated deterministically so that the higher moments of the prior distribution are preserved in the posterior.
Synthetic tests
To generate synthetic outbreak observations, we used the agentbased model to simulate weekly incidence during a oneyear period (52 weeks), and then imposed noise to produce the observations used in inference. The synthetic outbreak tested in Figure 2 was generated with the parameters $\beta =9\times {10}^{3}$, ${I}_{0}=2\times {10}^{3}$ and ${C}_{0}=7.5\times {10}^{2}$. To mimic actual observational error variance, we assumed an observational error variance (OEV) of ${\sigma}_{t,o}^{2}=10+{(0.1\times {o}_{simulate}(t))}^{2}$, which is a baseline uncertainty plus a term related to the simulated incidence ${o}_{simulate}(t)$. In reality, because we have only one data point at each observation time point, the variance of observed incidence is unknown. As such, we have to use a heuristic OEV in the inference algorithm. The above form of OEV has been successfully used in realtime influenza forecast (Shaman and Karspeck, 2012; Pei and Shaman, 2017; Pei et al., 2018a; Kandula et al., 2018), and also produces satisfactory performance in the following synthetic tests. The observation used in the inference was drawn from a Gaussian distribution $o(t)\sim \mathcal{N}({o}_{simulate}(t),{\sigma}_{t,o}^{2})$. The initial system state ${\overline{\mathbf{\mathbf{z}}}}^{0}$ was drawn from the following ranges $\beta \in [0,0.001]$, ${I}_{0}\in [0,0.003]$ and ${C}_{0}\in [0,0.1]$, using Latin Hypercube Sampling (LHS) (Tang, 1993). For simplicity, the initial covariance matrix was assumed to be diagonal $\mathrm{\Sigma}=diag({({\mathbf{\mathbf{z}}}_{max}{\mathbf{\mathbf{z}}}_{min})}^{2}/16)$, where ${\mathbf{\mathbf{z}}}_{max}$ and ${\mathbf{\mathbf{z}}}_{min}$ are the vectors of the upper and lower bounds of parameters $\beta $, ${I}_{0}$ and ${C}_{0}$. In each iteration, the covariance matrix was contracted by a factor of ${a}^{2}$ (equivalent to a reduction of the standard deviation by a factor of $a$). We used a discount factor of $a=0.9$ and terminated the algorithm at $L=20$ iterations. For the EAKF, $n=300$ ensemble members were used.
We validated the IFEAKF inference framework for different synthetic scenarios. Figure 2 presents the synthetic situation where nosocomial transmission accounts for the majority of incidence (see Figure 2D–E). To evaluate the goodness of fit for incidence number in Figure 2B, we performed the following statistical analysis. As the agentbased model is a highly stochastic system, the observed incidence in Figure 2B is only one possible outcome of the actual dynamics, whereas in our analysis, the stochasticity of incidence number needs to be considered. To this end, we compared several summary statistics quantifying the goodness of fit in Figure 2B with their distributions calculated from synthetic outbreaks (surrogate data) generated from the inferred dynamics. We first considered the log likelihood of observations. In particular, we generated 1000 synthetic outbreaks using the inferred parameters, and approximated the distribution of incidence number at each week. Then we calculated the log likelihood for the observed incidence in each synthetic outbreak, and estimated its distribution using these 1000 log likelihood values computed from the surrogate data. In Figure 2—figure supplement 1A, we compared the log likelihood computed from Figure 2B (vertical red line) with this distribution (blue bars) and calculated the 2sided pvalue. The pvalue is well above zero, indicating that, in terms of log likelihood, our inferred dynamics span and thus agree well with the observed incidence. In other words, the observed incidence in Figure 2B is a typical outcome from our inferred dynamics. The same analysis was also applied to rootmeansquare error (RMSE), coefficient of determination (${R}^{2}$) and Pearson correlation coefficient (Figure 2—figure supplement 1B–D). The RMSE, ${R}^{2}$ and Pearson correlation coefficient were calculated using the incidence time series in each synthetic outbreak and the mean incidence time series averaged over 1000 simulations.
For the opposite situation in which nosocomial transmission is less than importation, we performed the same test. In this case, we set $\beta =6\times {10}^{3}$, ${I}_{0}=2\times {10}^{3}$ and ${C}_{0}=7.5\times {10}^{2}$. The distributions of posterior parameters after each iteration (blue boxes) shown in Figure 2—figure supplement 2A are gradually adjusted to their targets (red horizontal lines). We ran 100 independent realizations of the inference, and report the inferred values and 95% CIs for the parameters $\beta $, ${I}_{0}$ and ${C}_{0}$ in Table 3. Additionally, weekly incidence, colonized population, and nosocomial and imported infections can be generally reproduced with the inferred parameters (see Figure 2—figure supplement 2B–E). The goodness of fit in Figure 2—figure supplement 2B is analyzed in Figure 2—figure supplement 3.
We finally tested the effect of observation frequency. In the actual diagnostic data from the Swedish hospitals, weekly incidence is very low. To account for the large uncertainty in weekly observation, we instead use 4 week incidence. In Figure 2 and Figure 2—figure supplement 2, we used weekly observations; in Figure 2—figure supplement 4, we assimilated 4 week incidence for a synthetic outbreak generated with the same parameter setting as in Figure 2. The findings indicate that this change of observational frequency does not affect the performance of the inference system. The inferred values and 95% CIs for the parameters $\beta $, ${I}_{0}$ and ${C}_{0}$ are reported in Table 3. The goodness of fit in Figure 2—figure supplement 4B is analyzed in Figure 2—figure supplement 5.
Inference using the actual diagnostic data
Beginning with the first year, we ran the IF inference sequentially through each year. The initial ranges of parameters are set as $\beta \in [,0.01]$, ${I}_{0}\in [0,0.001]$ and ${C}_{0}\in [0,0.1]$. The initial covariance matrix is set as $\mathrm{\Sigma}=diag({({\mathbf{\mathbf{z}}}_{max}{\mathbf{\mathbf{z}}}_{min})}^{2}/16)$, where ${\mathbf{\mathbf{z}}}_{max}$ and ${\mathbf{\mathbf{z}}}_{min}$ are the vectors of the upper and lower bound of parameters $\beta $, ${I}_{0}$ and ${C}_{0}$. In EAKF, the OEV is defined as ${\sigma}_{t,o}^{2}=1+{(0.1\times {o}_{t})}^{2}$, where ${o}_{t}$ is the observed incidence at time $t$. In each year, we ran $L=20$ iterations using a discount parameter of $a=0.9$. Between two consecutive years, we used the ensemble of inferred parameters and microscopic states of individuals in the previous year to initialize the inference system in the next year in order to maintain the continuity of the transmission dynamics. The evolution of the posterior parameter distributions for 20 iterations through the 6 year record is shown in Figure 3—figure supplement 1. In general, the ensemble means of parameters become stable after 10 iterations with the EAKF through the record, which means our choice of $L=20$ is sufficient. Note that increasing $L$ will lead to narrower distributions of posterior parameters in Figure 3—figure supplement 1, but will not affect the MLEs, that is the ensemble mean. We repeated the inference 100 times and obtained the inferred parameters and corresponding 95% CIs reported in Table 2. We further performed an evaluation of the goodness of fit in Figure 3—figure supplement 2. All statistics have pvalues well above zero. This implies the observed incidence in Figure 3B is a plausible outcome from the inferred dynamics.
Following inference, we used the mean estimated parameters to simulate outbreaks. Key information, for example incidence (Figure 3B), the infections transmitted in hospital and imported from outside the study hospitals (Figure 3E), the colonized population (Figure 4A), and the individuallevel colonization rate (Figure 4B–C) were obtained from these simulations.
Statistical test of powerlaw distribution
We used a maximum likelihood estimator to fit and validate the powerlaw data in Figure 4B. For computational convenience, we fit the count of infections of each individual among ${10}^{4}$ simulations (that is, we multiply the frequency in Figure 4B by 10^{4}). Denote the count data by $X=({x}_{1},{x}_{2},\mathrm{\cdots},{x}_{n})$, and suppose the data satisfy a powerlaw distribution $P({x}_{i})\propto {x}_{i}^{\gamma}$. Usually, empirical data follow a powerlaw behavior above a lower bound: $P({x}_{i})\propto {x}_{i}^{\gamma}$ for ${x}_{i}\ge {x}_{min}$ (Clauset et al., 2009). For a given ${x}_{min}$, the maximum likelihood estimator of the powerlaw exponent $\gamma $ for discrete data is $\hat{\gamma}=1+n[{\sum}_{i=1}^{n}\mathrm{ln}({x}_{i}/{x}_{min}1/2){]}^{1}$ (Clauset et al., 2009). The standard error on $\widehat{\gamma}$ is estimated by $\sigma =(\widehat{\gamma}1)/\sqrt{(}n)+O(1/n)$. To find the best lower bound, ${x}_{min}$ ranging from 0 to 100 was tested. The best ${x}_{min}$ was selected by choosing the value that minimizes the KolmogorovSmirnov (KS) statistic between the data and the fitted model. KS statistic is the maximum distance between two cumulative distribution functions (CDFs): $D={\mathrm{max}}_{x\ge {x}_{min}}S(x)P(x)$, where $S(x)$ is the CDF of the data larger than ${x}_{min}$, and $P(x)$ is the CDF of the powerlaw model obtained from MLE. In Figure 4—figure supplement 1A, the change of KS statistic for different ${x}_{min}$ values is reported. We choose the best ${x}_{min}$ value as 24. The fitted powerlaw exponent is $\widehat{\gamma}=2.13$ with a standard deviation of 0.12.
To test the statistical significance of the powerlaw fitting, we generated ${10}^{4}$ synthetic datasets using the fitted model, computed their KS statistic with respect to the model, and compared the obtained distribution with the KS statistic of the data in Figure 4—figure supplement 1B. If the pvalue of the observed KS statistic is close to 0, we reject the hypothesis that the data are generated from the fitted model; otherwise, the powerlaw fitting is statistically significant. This approach has been widely used in testing the statistical significance of powerlaw fitting (Clauset et al., 2009; Muchnik et al., 2013). The 2sided pvalue for our data is 0.5374.
Data availability
The dataset was approved for use in our study by the Regional Ethical Review Board in Stockholm (Record Number 2004/5:8). The data used in this article is a completely anonymized subset of the original data material, and keys to exact dates and identities of hospitals and individuals were not provided. While the data are deidentified, the owner of the data, Stockholm County Council, will not allow public sharing of the dataset. Interested researchers can verify the dataset at Stockholm University, and we, the authors, will do our best to help such interested researchers contact the owner of the data. Requests for data access should be addressed to FL (fredrik.liljeros@sociology.su.se). The Matlab code for parameter inference in a synthetic MRSA outbreak simulated in an example timevarying contact network is provided as Source code file 1. Numerical data for Figures 15 are also provided.
References

An ensemble adjustment kalman filter for data assimilationMonthly Weather Review 129:2884–2903.https://doi.org/10.1175/15200493(2001)129<2884:AEAKFF>2.0.CO;2

A tutorial on particle filters for online nonlinear/nongaussian bayesian trackingIEEE Transactions on Signal Processing 50:174–188.https://doi.org/10.1109/78.978374

Mathematical models of infection transmission in healthcare settings: recent advances from the use of network structured dataCurrent Opinion in Infectious Diseases 30:410–418.https://doi.org/10.1097/QCO.0000000000000390

Approximate Bayesian computation in population geneticsGenetics 162:2025–2035.

A clarification of transmission terms in hostmicroparasite models: numbers, densities and areasEpidemiology and Infection 129:147–153.https://doi.org/10.1017/S0950268802007148

Controlling methicillinresistant Staphylococcus aureus: Quantifying the effects of interventions and rapid diagnostic testingProceedings of the National Academy of Sciences 103:5620–5625.https://doi.org/10.1073/pnas.0510077103

Methods to infer transmission risk factors in complex outbreak dataJournal of the Royal Society Interface 9:456–469.https://doi.org/10.1098/rsif.2011.0379

The changing epidemiology of Staphylococcus aureus?Emerging Infectious Diseases 7:178–182.https://doi.org/10.3201/eid0702.010204

Modeling the Invasion of Community‐Acquired Methicillin‐ResistantStaphylococcus aureusinto HospitalsClinical Infectious Diseases : An Official Publication of the Infectious Diseases Society of America 48:274–284.https://doi.org/10.1086/595844

Nasal carriage of epidemic methicillinresistant Staphylococcus aureus 15 (EMRSA15) clone observed in three chicagoarea longterm care facilitiesAntimicrobial Agents and Chemotherapy 57:4551–4553.https://doi.org/10.1128/AAC.0052813

Mathematical modelling: a tool for hospital infection controlThe Lancet Infectious Diseases 6:39–45.https://doi.org/10.1016/S14733099(05)70325X

Plugandplay inference for disease dynamics: measles in large and small populations as a case studyJournal of the Royal Society Interface 7:271–283.https://doi.org/10.1098/rsif.2009.0151

Nosocomial transmission of communityassociated methicillinresistant Staphylococcus aureus in danish hospitalsJournal of Antimicrobial Chemotherapy 67:1775–1780.https://doi.org/10.1093/jac/dks125

Inference for nonlinear dynamical systemsPNAS 103:18438–18443.https://doi.org/10.1073/pnas.0603181103

National prevalence of methicillinresistant Staphylococcus aureus in inpatients at United States health care facilities, 2010American Journal of Infection Control 40:194–200.https://doi.org/10.1016/j.ajic.2012.02.001

ConferenceContact networks and the spread of mrsa in stockholm hospitals2015 Second European Network Intelligence Conference. pp. 150–154.https://doi.org/10.1109/ENIC.2015.30

Methicillinresistant Staphylococcus aureus: the European landscapeJournal of Antimicrobial Chemotherapy 66:iv43–iv48.https://doi.org/10.1093/jac/dkr076

Modelling an outbreak of an emerging pathogenNature Reviews Microbiology 5:700–709.https://doi.org/10.1038/nrmicro1660

Evaluation of mechanistic and statistical methods in forecasting influenzalike illnessJournal of the Royal Society Interface 15:20180174.https://doi.org/10.1098/rsif.2018.0174

EquationFree, CoarseGrained multiscale computation: enabling mocroscopic simulators to perform systemlevel analysisCommunications in Mathematical Sciences 1:715–762.https://doi.org/10.4310/cms.2003.v1.n4.a5

Hospitalizations and deaths caused by methicillinresistant Staphylococcus aureus, United States, 19992005Emerging Infectious Diseases 13:1840–1846.https://doi.org/10.3201/eid1312.070629

Epidemiology of MRSA in southern Sweden: strong relation to foreign country of origin, health care abroad and foreign travelEuropean Journal of Clinical Microbiology & Infectious Diseases 33:61–68.https://doi.org/10.1007/s1009601319292

Staphylococcus aureus infectionsThe New England Journal of Medicine 339:520–532.https://doi.org/10.1056/NEJM199808203390806

Multistate pointprevalence survey of health careassociated infectionsNew England Journal of Medicine 370:1198–1208.https://doi.org/10.1056/NEJMoa1306801

Spread of hospitalacquired infections: A comparison of healthcare networksPLOS Computational Biology 13:e1005666.https://doi.org/10.1371/journal.pcbi.1005666

Interindividual Contacts and Carriage of MethicillinResistant Staphylococcus aureus: A Nested CaseControl StudyInfection Control & Hospital Epidemiology 36:922–929.https://doi.org/10.1017/ice.2015.89

Detailed contact data and the dissemination of Staphylococcus aureus in hospitalsPLOS Computational Biology 11:e1004170.https://doi.org/10.1371/journal.pcbi.1004170

Contribution of mathematical modeling to the fight against bacterial antibiotic resistanceCurrent Opinion in Infectious Diseases 24:279–287.https://doi.org/10.1097/QCO.0b013e3283462362

Spreading dynamics in complex networksJournal of Statistical Mechanics: Theory and Experiment 2013:P12002.https://doi.org/10.1088/17425468/2013/12/P12002

Theories for influencer identification in complex networks.Complex Spreading Phenomena in Social Systems pp. 125–148.https://doi.org/10.1007/9783319773322_8

Vital signs: estimated effects of a coordinated approach for action to reduce antibioticresistant infections in health care facilities  United StatesMMWR. Morbidity and Mortality Weekly Report 64:826–831.https://doi.org/10.15585/mmwr.mm6430a4

Obstacles to highdimensional particle filteringMonthly Weather Review 136:4629–4640.https://doi.org/10.1175/2008MWR2529.1

Epidemiology of methicillinresistant staphylococci in EuropeClinical Microbiology and Infection 9:1179–1186.https://doi.org/10.1111/j.14690691.2003.00698.x

Imported methicillinresistant Staphylococcus aureus, SwedenEmerging Infectious Diseases 16:189–196.https://doi.org/10.3201/eid1602.081655

Orthogonal arraybased latin hypercubesJournal of the American Statistical Association 88:1392–1397.https://doi.org/10.1080/01621459.1993.10476423

Extended models for nosocomial infection: parameter estimation and model selectionMathematical Medicine and Biology: A Journal of the IMA 35:i29–i49.https://doi.org/10.1093/imammb/dqx010

Methicillinresistant Staphylococcus aureus in Europe, 19992002Emerging Infectious Diseases 10:1627–1634.https://doi.org/10.3201/eid1009.040069

Communityacquired meticillinresistant Staphylococcus aureus: an emerging threatThe Lancet Infectious Diseases 5:275–286.https://doi.org/10.1016/S14733099(05)701122
Decision letter

Ben CooperReviewing Editor; Mahidol Oxford Tropical Medicine Research Unit, Thailand

Prabhat JhaSenior Editor; Saint Michael's Hospital, Canada
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Inference and control of the nosocomial transmission of Methicillinresistant Staphylococcus aureus" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Ben Cooper as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Reviewing Editor and Prabhat Jha as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Martin Bootsma (Reviewer #2); Lulla Opatowski (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This work makes both an important methodological contribution to the communicable disease modelling literature, and also an important practical contribution to the study of nosocomial infections. First, the authors develop a novel inference framework allowing inferences from individual based transmission models (overcoming computational problems with existing particle filtering approaches). Second, the methods are applied to make inferences about parameters for a multiinstitutional MRSA outbreak across a network of hospitals in Sweden and simulations are used to show how such modelbased inferences could translate into improved control measures for MRSA.
Essential revisions:
1) Clarification of exact nature of the MRSA data is needed. Does this only come from clinical infections or was there also screening on asymptomatic carriage?
2) Figure 1B is difficult to read and should be revised. Since only integers (05) are plotted, the continuous scale may not be the best choice and makes it difficult to pick out what numbers are being plotted.
3) Enough information needs to be presented to allow an interested reader to repeat this analysis on their own data. In particular, the following aspects of the methods were not clear to any of the reviewers:
i) In Algorithm 1, why is the discount factor raised to the power 2(l1), rather than just (l1). Is the 2 arbitrary?
ii) How should the discount factor a be chosen, and how do different values of a influence the algorithm? A value of 0.9 is used, but not explanation is given for this choice.
iii) Algorithm 1 returns the MLE of the parameter vector but makes no attempt to measure the associated uncertainty. Indeed, the variance of the parameter vector ensemble approaches zero as L increases. This is also seen in Appendix 1, Figure 4A. However, distributions for parameters are reported in Figure 3A, and the methods say that "parameter distributions become stable after 5 iterations…which means our choice of L=10 is sufficient". So how are the distributions in Figure 3A derived? Are these just taken from the ensemble of parameter vectors stopping after 10 iterations (implying that if L had been taken as 20 or more, the distributions shown in Figure 3A would have had much lower variance?) Surely it is important to produce not only point estimates, but also appropriate measures of uncertainty associated with these; it is not clear if this is being done and justification of the choice of L=10 in the data analysis (subsection “Inference using the actual diagnostic data”) compared with L=20 in the synthetic tests (subsection “Synthetic tests”) is lacking.
iv) In Algorithm 1 the actual EAKF step is not explained. Given that space is not a limitation (and technical details can be put int he appendix), it would be helpful to provide details of this EAKF step with sufficient detail to enable replication by the interested reader.
4) Clarification of model assumptions is needed. In particular, it is unclear whether the increased decolonization rate μ was only present during treatment, and that afterwards the spontaneous decolonization rate α kicks in, or whether treated patients keep their increased rate μ also after the treatment is stopped. Linked to this, what is the duration of treatment?
5) An explicit unambiguous description of the transmission process within a ward is lacking. Is the process densitydependent or frequencydependent?
6) Units are missing in Table 1.
7) A table with all parameters estimates is missing.
8) Figure 5AB legend is confusing as the yaxis doesn't seem to depict a reduction but cumulative incidence. Clarification is needed.
9) Can more formal assessment of model fits with synthetic and real data be provided in addition to the plots in Figure 2 and Figure 3?
10) Figure 5 and associated text: it is unclear whether the model based control measures uses only data/information that would be available at the time point control measures are put in place. If the inference approach instead uses information from the whole course of the epidemic, then this would be an unfair comparison. This needs clarification.
11) Subsection “Statistical test of powerlaw distribution”: "the significance level of 0.05" Following the recent ASA statement on the use (and misuse of pvalues) there is a consensus that we should be striving to move away from such "bright lines". So fine to report the pvalue, but we should try to avoid saying that. 049 is significant while. 051 isn't.
https://doi.org/10.7554/eLife.40977.033Author response
Summary:
This work makes both an important methodological contribution to the communicable disease modelling literature, and also an important practical contribution to the study of nosocomial infections. First, the authors develop a novel inference framework allowing inferences from individual based transmission models (overcoming computational problems with existing particle filtering approaches). Second, the methods are applied to make inferences about parameters for a multiinstitutional MRSA outbreak across a network of hospitals in Sweden and simulations are used to show how such modelbased inferences could translate into improved control measures for MRSA.
We appreciate the reviewers’ positive evaluations of our work.
Essential revisions:
1) Clarification of exact nature of the MRSA data is needed. Does this only come from clinical infections or was there also screening on asymptomatic carriage?
Diagnosis was performed on patients with symptomatic infections as well as some patients with contact with positive cases. So, the screening covers both clinical infections and a portion of asymptomatic carriage. This is now clarified in subsection “Data description” of the revised manuscript.
2) Figure 1B is difficult to read and should be revised. Since only integers (05) are plotted, the continuous scale may not be the best choice and makes it difficult to pick out what numbers are being plotted.
Figure 1B was reproduced using a discrete color palette.
3) Enough information needs to be presented to allow an interested reader to repeat this analysis on their own data. In particular, the following aspects of the methods were not clear to any of the reviewers:
i) In Algorithm 1, why is the discount factor raised to the power 2(l1), rather than just (l1). Is the 2 arbitrary?
After each iteration, the standard deviation of each parameter is discounted by a factor a. This is equivalent to a discount of a^{2} on the variance, which is represented by multiplication with the diagonal matrix Σ. We have clarified this point in subsection “Iterated filtering for agentbased models” and subsection “Synthetic tests” of the revised manuscript.
ii) How should the discount factor a be chosen, and how do different values of a influence the algorithm? A value of 0.9 is used, but not explanation is given for this choice.
In implementation, the discount factor, a, is typically selected to be between 0.9 and 0.99 (Ionides et al., 2006). If it is too small, the algorithm may “quench” too fast and fail to find the MLE; if it is too close to 1, the algorithm may not converge in a reasonable time interval. We now explain the choice of discount factor in subsection “Iterated filtering for agentbased models”.
iii) Algorithm 1 returns the MLE of the parameter vector but makes no attempt to measure the associated uncertainty. Indeed, the variance of the parameter vector ensemble approaches zero as L increases. This is also seen in Appendix 1, Figure 4A. However, distributions for parameters are reported in Figure 3A, and the methods say that "parameter distributions become stable after 5 iterations…which means our choice of L=10 is sufficient". So how are the distributions in Figure 3A derived? Are these just taken from the ensemble of parameter vectors stopping after 10 iterations (implying that if L had been taken as 20 or more, the distributions shown in Figure 3A would have had much lower variance?) Surely it is important to produce not only point estimates, but also appropriate measures of uncertainty associated with these; it is not clear if this is being done and justification of the choice of L=10 in the data analysis (subsection “Inference using the actual diagnostic data”) compared with L=20 in the synthetic tests (subsection “Synthetic tests”) is lacking.
This is an insightful question. Indeed, Algorithm 1 only returns the MLEs of parameters. For deterministic ODE models, Ionides et al. used “sliced likelihood” to numerically estimate the Fisher information and standard errors (SEs) of the MLEs (Ionides et al., 2006). Here, for a highly stochastic system, evaluating the Fisher information numerically is challenging. As a result, we employed another approach by running multiple realizations of the IF algorithm. In these different runs, the MLEs are slightly different due to the stochasticity in the agentbased model and the initialization of the inference algorithm. We ran 100 independent realizations to generate the average MLEs of inferred parameters and their corresponding 95% CIs. We explain this approach in subsection “Iterated filtering for agentbased models”.
The distributions shown in Figure 3A of the initial submission were the posterior distributions of 300 ensemble members after L=10 iterations in one realization of the IF algorithm. The reviewers are correct that if we continue the algorithm, these distributions will converge to a point, as is observed in Figure 2A and supplemental figures. To properly report the inference results in the revision, we now display the distributions of MLEs (ensemble means) obtained from the 100 runs in Figure 3A. In other figures, we have also clarified which distributions have been plotted. Please see changes in subsection “Iterated filtering for agentbased models”, subsection “Inference of MRSA transmission in Swedish hospitals” and relevant figure legends.
We stopped the algorithm after the ensemble mean estimate stabilized. In principle, the number of iterations required for convergence can be determined by inspecting the evolution of posterior parameter distributions, as in Figure 2A. Once the posterior mean stabilizes, increasing the iteration time will not affect the MLE, although it will surely lead to a gradually narrowed ensemble distribution. This point is now explained in subsection “Iterated filtering for agentbased models”. To keep the algorithm setting consistent with the synthetic test, we have extended the iteration time to L=20 for the inference using realworld diagnostic data in the revision.
iv) In Algorithm 1 the actual EAKF step is not explained. Given that space is not a limitation (and technical details can be put int he appendix), it would be helpful to provide details of this EAKF step with sufficient detail to enable replication by the interested reader.
A detailed explanation of the EAKF update was added to the appendix. Please see subsection “The Ensemble Adjustment Kalman Filter”.
4) Clarification of model assumptions is needed. In particular, it is unclear whether the increased decolonization rate μ was only present during treatment, and that afterwards the spontaneous decolonization rate α kicks in, or whether treated patients keep their increased rate μ also after the treatment is stopped. Linked to this, what is the duration of treatment?
We apologize for this ambiguity. In our model, we assume that treatment continues until infected patients are clear of MRSA. Given the exponential decay of infection probability, the characteristic treatment period is 1/μ days. We have clarified this assumption in subsection “The agentbased model” of the revised manuscript.
5) An explicit unambiguous description of the transmission process within a ward is lacking. Is the process densitydependent or frequencydependent?
We now explain the transmission process explicitly in subsection “The agentbased model” of the revised manuscript. Specifically, a susceptible individual staying in a ward with a colonized person can become MRSA colonized with transmission probability β per day. In our model, we assume that patients within a ward have the same rate of contact with each other, presumably mediated by the shared healthcare workers in a ward. The transmission process is densitydependent, as the force of infection in a ward increases with the number of colonized patients within the ward.
6) Units are missing in Table 1.
Units were added in Table 1.
7) A table with all parameters estimates is missing.
All estimated parameters are now reported in Table 2.
8) Figure 5AB legend is confusing as the yaxis doesn't seem to depict a reduction but cumulative incidence. Clarification is needed.
The legend of Figure 5AB was changed to indicate that the yaxis is cumulative incidence.
9) Can more formal assessment of model fits with synthetic and real data be provided in addition to the plots in Figure 2 and Figure 3?
We thank reviewers for this constructive suggestion. For the revision, we performed an analysis to evaluate the model fit. To do so, we inspect whether the observed incidence is a typical outcome from the inferred stochastic dynamics. Specifically, we compared several summary statistics quantifying the goodnessoffit with their distributions calculated from synthetic outbreaks (surrogate data) generated from the inferred dynamics. The tested statistics include log likelihood, RMSE, Rsquared and Pearson correlation coefficient. For all statistics, pvalues are bounded away from zero, indicating that the observed incidence agrees well with the inferred dynamics. Details can be found in subsection “Synthetic tests”, Figure 2—figure supplement 1, Figure 2—figure supplement 3, Figure 2—figure supplement 5 and Figure 3—figure supplement 2.
10) Figure 5 and associated text: it is unclear whether the model based control measures uses only data/information that would be available at the time point control measures are put in place. If the inference approach instead uses information from the whole course of the epidemic, then this would be an unfair comparison. This needs clarification.
To guarantee a fair comparison with other control measures, we reran the inferencebased intervention using only realtime information available before the control measures are effected. The colonization probability estimated in realtime is highly correlated with the results obtained using information from the whole course of the epidemic. The implementation details are now presented in the revised manuscript and can be found in subsection “Designing costeffective interventions” and subsection “Inferencebased intervention”. The results remain similar.
11) Subsection “Statistical test of powerlaw distribution”: "the significance level of 0.05" Following the recent ASA statement on the use (and misuse of pvalues) there is a consensus that we should be striving to move away from such "bright lines". So fine to report the pvalue, but we should try to avoid saying that.049 is significant while.051 isn't.
Statements about statistical significance were removed.
https://doi.org/10.7554/eLife.40977.034Article and author information
Author details
Funding
National Institute of General Medical Sciences (GM110748)
 Jeffrey L Shaman
National Institute of Environmental Health Sciences (ES009089)
 Jeffrey L Shaman
National Institute of Biomedical Imaging and Bioengineering (1R01EB022720)
 Hernán Makse
National Cancer Institute (U54CA137788/U54CA132378)
 Hernán Makse
National Science Foundation (IIS1515022)
 Hernán Makse
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Martin Bootsma, Ben Cooper, Prabhat Jha and Lulla Opatowski for valuable comments on earlier versions of the manuscript. JS is funded by the National Institute of General Medical Sciences GM110748 and the National Institute of Environmental Health Sciences ES009089. HM is funded by the National Institute of Biomedical Imaging and Bioengineering 1R01EB022720, the National Cancer Institute U54CA137788/U54CA132378 and the National Science Foundation IIS1515022. The opinions stated are those of the authors and do not represent the official position of funders.
Ethics
Clinical trial registration
Human subjects: The dataset was approved for use in our study by the Regional Ethical Review Board in Stockholm (Record Number 2004/5:8).
Senior Editor
 Prabhat Jha, Saint Michael's Hospital, Canada
Reviewing Editor
 Ben Cooper, Mahidol Oxford Tropical Medicine Research Unit, Thailand
Publication history
 Received: August 10, 2018
 Accepted: November 16, 2018
 Version of Record published: December 18, 2018 (version 1)
Copyright
© 2018, Pei et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,275
 Page views

 307
 Downloads

 20
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
The mouse brain is by far the most intensively studied among mammalian brains, yet basic measures of its cytoarchitecture remain obscure. For example, quantifying cell numbers, and the interplay of sex, strain, and individual variability in cell density and volume is out of reach for many regions. The Allen Mouse Brain Connectivity project produces highresolution full brain images of hundreds of brains. Although these were created for a different purpose, they reveal details of neuroanatomy and cytoarchitecture. Here, we used this population to systematically characterize cell density and volume for each anatomical unit in the mouse brain. We developed a DNNbased segmentation pipeline that uses the autofluorescence intensities of images to segment cell nuclei even within the densest regions, such as the dentate gyrus. We applied our pipeline to 507 brains of males and females from C57BL/6J and FVB.CD1 strains. Globally, we found that increased overall brain volume does not result in uniform expansion across all regions. Moreover, regionspecific density changes are often negatively correlated with the volume of the region; therefore, cell count does not scale linearly with volume. Many regions, including layer 2/3 across several cortical areas, showed distinct lateral bias. We identified strainspecific or sexspecific differences. For example, males tended to have more cells in extended amygdala and hypothalamic regions (MEA, BST, BLA, BMA, and LPO, AHN) while females had more cells in the orbital cortex (ORB). Yet, interindividual variability was always greater than the effect size of a single qualifier. We provide the results of this analysis as an accessible resource for the community.

 Computational and Systems Biology
 Immunology and Inflammation
To appropriately defend against a wide array of pathogens, humans somatically generate highly diverse repertoires of B cell and T cell receptors (BCRs and TCRs) through a random process called V(D)J recombination. Receptor diversity is achieved during this process through both the combinatorial assembly of V(D)Jgenes and the junctional deletion and insertion of nucleotides. While the Artemis protein is often regarded as the main nuclease involved in V(D)J recombination, the exact mechanism of nucleotide trimming is not understood. Using a previously published TCRβ repertoire sequencing data set, we have designed a flexible probabilistic model of nucleotide trimming that allows us to explore various mechanistically interpretable sequencelevel features. We show that local sequence context, length, and GC nucleotide content in both directions of the wider sequence, together, can most accurately predict the trimming probabilities of a given Vgene sequence. Because GC nucleotide content is predictive of sequencebreathing, this model provides quantitative statistical evidence regarding the extent to which doublestranded DNA may need to be able to breathe for trimming to occur. We also see evidence of a sequence motif that appears to get preferentially trimmed, independent of GCcontentrelated effects. Further, we find that the inferred coefficients from this model provide accurate prediction for V and Jgene sequences from other adaptive immune receptor loci. These results refine our understanding of how the Artemis nuclease may function to trim nucleotides during V(D)J recombination and provide another step toward understanding how V(D)J recombination generates diverse receptors and supports a powerful, unique immune response in healthy humans.