Founder effects arising from gathering dynamics systematically bias emerging pathogen surveillance

  1. Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany
  • Senior Editor
    Aleksandra Walczak
    École Normale Supérieure - PSL, Paris, France

Reviewer #1 (Public review):

Summary:

This work considers the biases introduced into pathogen surveillance due to congregation effects, and also models homophily and variants/clades. The results are primarily quantitative assessments of this bias but some qualitative insights are gained e.g. that initial variant transmission tends to be biased upwards due to this effect, which is closely related to classical founder effects.

Strengths:

The model considered involves a simplification of the process of congregation using multinomial sampling that allows for a simpler and more easily interpretable analysis.

Weaknesses:

This simplification removes some realism, for example, detailed temporal transmission dynamics of congregations.

Reviewer #2 (Public review):

Summary:

In "Founder effects arising from gathering dynamics systematically bias emerging pathogen surveillance" Bradford and Hang present an extension to the SIR model to account for the role of larger than pairwise interactions in infectious disease dynamics. They explore the impact of accounting for group interactions on the progression of infection through the various sub-populations that make up the population as a whole. Further, they explore the extent to which interaction heterogeneity can bias epidemiological inference from surveillance data in the form of IFR and variant growth rate dynamics. This work advances the theoretical formulation of the SIR model and may allow for more realistic modeling of infectious disease outbreaks in the future.

Strengths:

(1) This work addresses an important limitation of standard SIR models. While this limitation has been addressed previously in the form of network-based models, those are, as the authors argue, difficult to parameterize to real-world scenarios. Further, this work highlights critical biases that may appear in real-world epidemiological surveillance data. Particularly, over-estimation of variant growth rates shortly after emergence has led to a number of "false alarms" about new variants over the past five years (although also to some true alarms).

(2) While the results presented here generally confirm my intuitions on this topic, I think it is really useful for the field to have it presented in such a clear manner with a corresponding mathematical framework. This will be a helpful piece of work to point to to temper concerns about rapid increases in the frequency of rare variants.

(3) The authors provide a succinct derivation of their model that helps the reader understand how they arrived at their formulation starting from the standard SIR model.

(4) The visualizations throughout are generally easy to interpret and communicate the key points of the authors' work.

(5) I thank the authors for providing detailed code to reproduce manuscript figures in the associated GitHub repo.

Weaknesses:

(1) The authors argue that network-based SIR models are difficult to parameterize (line 66), however, the model presented here also has a key parameter, mainly P_n, or the distribution of risk groups in the population. I think it is important to explore the extent to which this parameter can be inferred from real-world data to assess whether this model is, in practice, any easier to parameterize.

(2) The authors explore only up to four different risk groups, accounting for only four-wise interactions. But, clearly, in real-world settings, there can be much larger gatherings that promote transmission. What was the justification for setting such a low limit on the maximum group size? I presume it's due to computational efficiency, which is understandable, but it should be discussed as a limitation.

(3) Another key limitation that isn't addressed by the authors is that there may be population structure beyond just risk heterogeneity. For example, there may be two separate (or, weakly connected) high-risk sub-groups. This will introduce temporal correlation in interactions that are not (and can not easily be) captured in this model. My instinct is that this would dampen the difference between risk groups shown in Figure 2A. While I appreciate the authors's desire to keep their model relatively simple, I think this limitation should be explicitly discussed as it is, in my opinion, relatively significant.

Author response:

Reviewer #1 (Public review):

Summary:

This work considers the biases introduced into pathogen surveillance due to congregation effects, and also models homophily and variants/clades. The results are primarily quantitative assessments of this bias but some qualitative insights are gained e.g. that initial variant transmission tends to be biased upwards due to this effect, which is closely related to classical founder effects.

Strengths:

The model considered involves a simplification of the process of congregation using multinomial sampling that allows for a simpler and more easily interpretable analysis.

Weaknesses:

This simplification removes some realism, for example, detailed temporal transmission dynamics of congregations.

We appreciate Reviewer #1's comments. We hope our framework, like the classic SIR model, can be adapted in the future to build more complex and realistic models.

Reviewer #2 (Public review):

Summary:

In "Founder effects arising from gathering dynamics systematically bias emerging pathogen surveillance" Bradford and Hang present an extension to the SIR model to account for the role of larger than pairwise interactions in infectious disease dynamics. They explore the impact of accounting for group interactions on the progression of infection through the various sub-populations that make up the population as a whole. Further, they explore the extent to which interaction heterogeneity can bias epidemiological inference from surveillance data in the form of IFR and variant growth rate dynamics. This work advances the theoretical formulation of the SIR model and may allow for more realistic modeling of infectious disease outbreaks in the future.

Strengths:

(1) This work addresses an important limitation of standard SIR models. While this limitation has been addressed previously in the form of network-based models, those are, as the authors argue, difficult to parameterize to real-world scenarios. Further, this work highlights critical biases that may appear in real-world epidemiological surveillance data. Particularly, over-estimation of variant growth rates shortly after emergence has led to a number of "false alarms" about new variants over the past five years (although also to some true alarms).

(2) While the results presented here generally confirm my intuitions on this topic, I think it is really useful for the field to have it presented in such a clear manner with a corresponding mathematical framework. This will be a helpful piece of work to point to to temper concerns about rapid increases in the frequency of rare variants.

(3) The authors provide a succinct derivation of their model that helps the reader understand how they arrived at their formulation starting from the standard SIR model.

(4) The visualizations throughout are generally easy to interpret and communicate the key points of the authors' work.

(5) I thank the authors for providing detailed code to reproduce manuscript figures in the associated GitHub repo.

Weaknesses:

(1) The authors argue that network-based SIR models are difficult to parameterize (line 66), however, the model presented here also has a key parameter, mainly P_n, or the distribution of risk groups in the population. I think it is important to explore the extent to which this parameter can be inferred from real-world data to assess whether this model is, in practice, any easier to parameterize.

(2) The authors explore only up to four different risk groups, accounting for only four-wise interactions. But, clearly, in real-world settings, there can be much larger gatherings that promote transmission. What was the justification for setting such a low limit on the maximum group size? I presume it's due to computational efficiency, which is understandable, but it should be discussed as a limitation.

(3) Another key limitation that isn't addressed by the authors is that there may be population structure beyond just risk heterogeneity. For example, there may be two separate (or, weakly connected) high-risk sub-groups. This will introduce temporal correlation in interactions that are not (and can not easily be) captured in this model. My instinct is that this would dampen the difference between risk groups shown in Figure 2A. While I appreciate the authors's desire to keep their model relatively simple, I think this limitation should be explicitly discussed as it is, in my opinion, relatively significant.

We appreciate Reviewer 2's thoughtful comments and wish to address some of the weaknesses:

We agree that inferring P_n from real data will be challenging, but think this is an important direction for future research. Further, we’d like to reframe our claim that our approach is "easier to parameterize" than network models. Rather, P_n has fewer degrees of freedom than analogous network models, just as many different networks can share the same degree distribution. Fewer degrees of freedom mean that we expect our model to suffer from fewer identifiability issues when fitting to data, though non-identifiability is often inescapable in models of this nature (e.g., \beta and \gamma in the SIR model are not uniquely identifiable during exponential growth). Whether this is more or less accurate is another question. Classic bias-variance tradeoffs argue that a model with a moderate complexity trained on one data set can better fit future data than overly simple or overly complex models.

We chose four risk groups for purposes of illustration, but this can be increased arbitrarily. It should be noted that the simulation bottleneck when increasing the numbers of risk groups is numerical due the stiffness of the ODEs. This arises because the nonlinearity of infection terms scales with the number of risk groups (e.g., ~ \beta * S * I^3 for 4 risk groups). As such, a careful choice of numerical solvers may be required when integrating the ODEs. Meanwhile, this is not an issue for stochastic, individual based implementation (e.g., Gillespie). As for how well this captures super-spreading, we believe choosing smaller risk groups does not hinder modeling disease spread at large gatherings. Consider a statistical interpretation, where individuals at a large gathering engage in a series of smaller interactions over time (e.g., 2/3/4/etc person conversations). The key determinants of the resulting gathering size distribution at any one large gathering are the number of individuals within some shared proximity over time and the infectiousness/dispersal of the pathogen. Of course, whether this interpretation is a sufficient approximation for classic super-spreading events (e.g., funerals during 2014-2015 West Africa Ebola outbreak) is a matter of debate. Our framework is best interpreted at a population level where the effects of any single gathering are washed out by the overall gathering distribution, P_n. As the prior weakness highlighted, establishing P_n is challenging, but we believe empirically measuring proxies of it may provide future insight in how behavior impacts disease spread. For example, prior work has combined contact tracing and co-location data from connection to WiFi networks to estimate the distribution of contacts per individual, and its degree of overdispersion (Petros et al. Med 2022).

We chose to introduce our framework in a simple SIR context familiar to many readers. This decision does not in any way limit applying it to settings with more population structure. Rather, we believe our framework is easily adaptable and that our presentation (hopefully) makes it clear how to do this. For example, two weakly connected groups could be easily achieved by (for each gathering) first sampling the preferred group and then sampling from the population in a biased manner. The biased sampling could even be a function of gathering sizes, time, etc. The resulting infection terms are still (sums of) multinomials. More generally, the sampling probabilities for an individual of some type need not be its frequency (e.g., S/N, I/N). Indeed, we believe generating models with complex social interactions is both simplified and made more robust by focusing on modeling the generative process of attending gatherings.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation