COVID19 cluster size and transmission rates in schools from crowdsourced case reports
Abstract
The role of schools in the spread of SARSCoV2 is controversial, with some claiming they are an important driver of the pandemic and others arguing that transmission in schools is negligible. School cluster reports that have been collected in various jurisdictions are a source of data about transmission in schools. These reports consist of the name of a school, a date, and the number of students known to be infected. We provide a simple model for the frequency and size of clusters in this data, based on random arrivals of index cases at schools who then infect their classmates with a highly variable rate, fitting the overdispersion evident in the data. We fit our model to reports from four Canadian provinces, providing estimates of mean and dispersion for cluster size, as well as the distribution of the instantaneous transmission parameter β, whilst factoring in imperfect ascertainment. According to our model with parameters estimated from the data, in all four provinces (i) more than 65% of nonindex cases occur in the 20% largest clusters, and (ii) reducing instantaneous transmission rate and the number of contacts a student has at any given time are effective in reducing the total number of cases, whereas strict bubbling (keeping contacts consistent over time) does not contribute much to reduce cluster sizes. We predict strict bubbling to be more valuable in scenarios with substantially higher transmission rates.
Editor's evaluation
This paper provides an important novel methodology to understand the mode of spread of SARSCoV2 in schools given sparse data.
https://doi.org/10.7554/eLife.76174.sa0eLife digest
During the COVID19 pandemic, public health officials promoted social distancing as a way to reduce SARSCoV2 transmission. The goal of social distancing is to reduce the number, proximity, and duration of facetoface interactions between people. To achieve this, people shifted many activities online or canceled events outright. In education, some schools closed and shifted to online learning, while others continued classes in person with safety precautions.
Better information about SARSCoV2 transmission in schools could help public health officials to make decisions of what activities to keep in person and when to suspend classes. If safety measures lower transmission in schools considerably, then closing schools may not be worth online education's social, educational, and economic costs. However, if transmission of SARSCoV2 in schools remains high despite measures, closing schools may be essential, despite the costs.
Tupper et al. used data about COVID19 cases in children attending inperson school in four Canadian provinces between 2020 and 2021 to fit a computer model of school transmission. On average, their analysis shows that one infected person in a school leads to between one and two further cases. Most of the time, no more students are infected, indicating that normally infection clusters are small; and only rarely does one infected person set off a large outbreak. The model also showed that measures to reduce transmission, like masking or small class sizes, were more effective than interventions such as keeping students with the same cohort all day (bubbling).
Tupper et al. caution that their findings apply to the variants of SARSCoV2 circulating in Canada during the 20202021 school year, and may not apply to newer, highly transmissible strains like Omicron. However, the model could always be adapted to assess school or workplace transmission of more recent strains of SARSCoV2, and more generally of other diseases. Thus, Tupper et al. provide a new approach to estimating the rate of disease transmission and comparing the impact of different prevention strategies.
Introduction
In the management of the COVID19 pandemic, an important consideration is the role of children and in particular schools. In most jurisdictions rates of SARSCoV2 infection among children are similar to those in the adult population (Centers for Disease Control and Prevention, 2021). But severity is much lower in children; the infection fatality rate (IFR) of COVID for at age 10 was estimated to be 0.002% versus an IFR of 0.01% at age 25, and 0.4% at age 55, for the original SARSCoV2 virus present in 2020 (Levin et al., 2020). Cases are more often asymptomatic among children, less likely to require hospitalization and ICU care (Centers for Disease Control and Prevention, 2021), and less likely to be classified as long COVID (Sudre et al., 2021). On the other hand, MISC is a serious condition sometimes resulting from SARSCoV2 infection (CDC, 2021a), and myocarditis happens more frequently as a side effect of infection among younger individuals (Singer et al., 2022).
Jurisdictions have had to make a choice between closing schools, with all the attendant social, economic, and psychological costs (Chaabane et al., 2021), and leaving schools open, allowing possible transmission of SARSCoV2 in that setting (Centers for Disease Control and Prevention, 2021). The direct downside of transmission in schools if it occurs is that children may be infected there, risking the low but nonnegligible harms of COVID19 in that age range, but also adult teachers and staff are put at risk. Transmission in schools may also contribute to overall community transmission, indirectly jeopardizing more vulnerable individuals (Walsh et al., 2021). As a concrete example, if a child contracts SARSCoV2 at school, they may then go on to transmit it to an elderly relative they live with, for whom the consequences are more severe (Laws et al., 2021). Estimating the magnitude of these two kinds of harm and making the decision as to what choice to make involves many sources of uncertainty and value judgements, which helps explain why different jurisdictions have taken different approaches (Harris, 2020). In some jurisdictions schools were open for the 2020–2021 school year, though many measures were put into place in order to reduce the risk of SARSCoV2 transmission (British Columbia Ministry of Education, 2020). Measures included cohorting, staggered entrance and exit times, masks, improvements in ventilation, extra sanitization measures. In other jurisdictions schools were closed for large portions of the year (Partners, 2021).
Studies that have looked at the effect of school closures on the overall rate of SARSCoV2 transmission find mixed results: some find substantial reduction in community transmission when schools are closed, and others small or no effect (Walsh et al., 2021; Chernozhukov et al., 2021). Given that schools involve many children all sharing a room for many hours a day, it may be surprising that there is not a clearer evidence of significant transmission in schools. One explanation is that children may be less likely to transmit SARSCoV2 to each other, either by being less infectious or by being less susceptible (Dattner et al., 2021; Viner et al., 2021). But transmission in schools does occur, and it’s worthwhile to estimate the magnitude and characterize the variation in it.
One source of evidence for transmission in schools are school exposure reports. Throughout the pandemic organizations have collected data submitted by volunteers about COVID cases in schools, and such data has subsequently been published online (National Education Association, 2020; Covid Schools Canada, 2021; Support Our Students Alberta, 2022). Data consists of reports of exposures or clusters in schools, either submitted by parents or determined from reading newspaper reports. Several such websites exist, though many ceased due to excessive workload after the 2020–2021 school year. In some jurisdictions there are also similar sources of data provided by local government (Government of Ontario, 2021; State of Michigan, 2021) or Public Health Agencies (Vancouver Coastal Health, 2021; Health, 2021).
Here, we propose a simple model of transmission in schools, and we use these data on cluster sizes to estimate parameters of the model for four Canadian provinces. Our model allows for heterogeneity in transmission rate, which is able to capture the considerable variability in the sizes of the clusters, with most exposures leading to no further cases (and so a cluster of size 1) but with few having a large number of cases (Tufekci, 2020). We estimate the mean and overdispersion parameters for different jurisdictions. We then use our parameter estimates in a couple of ways: firstly, we explore the overdispersion of cluster sizes in different jurisdictions, giving estimates of what fraction of all cases are in the 20% largest of all clusters. Secondly, we can obtain an estimate of the distribution of the transmission rate $\beta $, the rate at which a single infected individual infects a susceptible person when they are in contact. This parameter, in turn, could be used to simulate school transmission and explore the impacts of interventions (Tupper et al., 2020) as we explore for some parameter choices. In Appendix 1 we perform a similar analysis for eight US states, where only substantially less complete datasets were available.
Finally, two important changes have occurred in 2021 that we expect to impact cluster sizes in schools. On the one hand, in many jurisdictions, large portions of children aged 5 and up have been vaccinated with the Pfizer/BioNTech vaccine (The New York Times, 2020). According to the extent to which the vaccine protects against infection, we expect cluster size will be reduced, as fewer students will be infected if they have been immunized. Observed cluster size may be reduced further even than this, if the vaccine allows hardertodetect infections to occur. On the other hand, now more infectious variants of the coronavirus have emerged; the Alpha, Delta, and Omicron variants have all had a higher estimated transmissibility than their predecessors (CDC, 2021b; CDC, 2022). Increased transmissibility would suggest larger cluster sizes, certainly among unvaccinated ages, but the relative impact of vaccination and the new variants together is difficult to gauge. Furthermore, changes in vaccination, transmission, and immune evasion may all lead to a change in the variability in cluster sizes.
Materials and methods
Our data consist of reports of confirmed cases among students, teachers, and staff in schools in four Canadian provinces during the 2020–2021 school year. Data was collected by Dr Shraddha Pai with COVID Schools Canada (Covid Schools Canada, 2021), an initiative of the group Masks for Canada (Canadian Doctors, Professionals, & Citizens for Masks, 2021). We included the four provinces from this dataset with the most schools reporting cases with date information. For each school, there is a list of confirmed cases among students, teachers, and staff, along with the dates on which the cases were reported. We then assigned cases to clusters based on being at the same school and being reported within 7 days of each other; if the difference in date between two cases was less than or equal to 7 days, or they could be linked by a sequence of such cases, they were put in the same cluster. We chose 7 days on the basis of estimates of the serial interval for COVID19 of approximately 5 days (Rai et al., 2021). (We explore other choices of window in Appendix 1.) Information was not available about whether the cases at the same school were in the same classroom. Accordingly, we interpret clusters as capturing all linked cases at a given school, and not just one classroom.
There is substantial uncertainty in whether each of our determined clusters of cases accurately represents a set of cases linked by transmission. For any cluster of two or more cases, it may be that two independent sets of cases are incorrectly included in the same cluster. This may lead us to overestimate the size of clusters. Likewise, any two of our clusters in the the same school that occur further apart than 7 days may in fact be linked by a chain of undetected transmission, leading to an underestimate of cluster size. Both these factors may occur in our data, but we neglect both of them, taking the observed cluster size as given by our method. We are also unable to distinguish between transmission occurring in a school and in social activities with classmates outside of school.
In a given jurisdiction, we assume exposure events occur according to a Poisson process with variable rate. Independently of this process, once an exposure event occurs at a school, we say $Z$ additional people are infected by the index case, for a total of $Z+1$ individuals in the cluster. The variable $Z$ includes individuals directly infected by the index case, as well as any subsequent infected individuals that are included in the same cluster. Following LloydSmith et al., 2005, we model $Z$ as a Poisson random variable with parameter $\nu $, where $\nu $ itself is a Gammadistributed random variable. As described by LloydSmith et al., 2005, $Z$ is then a negative binomial random variable. Rather than the usual parametrization of a negative binomial distribution, we use parameters ${R}_{c}$ and $k$. The parameter ${R}_{c}$ is the expected number of additional infections in a cluster, and $k$ is the dispersion: a measure of how far the distribution of $Z$ is from being Poisson. As $k\to \mathrm{\infty}$, the distribution of $Z$ approaches that of a Poisson distribution with mean ${R}_{c}$. The variance of $Z$ is ${R}_{c}(1+{R}_{c}/k)$ and so for smaller values of $k$ we expect more of the secondary cases to occur in rare large clusters rather than in frequent small clusters (LloydSmith et al., 2005).
There are then a total of $Z+1$ infected individuals in the school. To give an idea of how the distribution of true cluster size depends on the parameters when they are in this range, in Figure 1 we show the theoretical distributions for varying parameters. On the left, we fix ${R}_{c}+1=2$ and vary $k$. Decreasing $k$ causes there to be more clusters of size 1 (i.e. no transmission) and more large clusters, but reduces the number of intermediatesized clusters. On the right, we fix $k=0.3$ and show the effect of varying mean cluster size ${R}_{c}+1$. As ${R}_{c}$ increases, the frequency of clusters with no or little transmission decreases and the frequency of larger cluster sizes increases.
The number of the total $Z+1$ cases that are actually observed, $X$, depends on the ascertainment model. We consider a model where each case is observed and contributes to the reported cluster size with probability $\mathrm{q}$, so that the observed cluster size $X$ (conditioned on $Z$) is binomial with parameters $n=Z+1$ and probability $q$. The index case is treated the same as the infectees, so $X$ may or may not include the index case. If none of the cases in a cluster are observed, we assume the cluster is not reported, so our model factors in the effect that smaller clusters are more likely to be missed. See Appendix 1 for an explicit statement of the likelihood function.
For each collection of cluster sizes in our datasets we estimate the mean ${R}_{c}$ and dispersion $k$ using the ascertainment model with $\mathrm{q}=0.75$. We base this value on the metaanalysis (Bobrovitz et al., 2021) which reports ascertainment fractions for highincome regions in the Americas between 66% (in the last quarter of 2020) and 85% (in the second quarter of 2021). We use maximum likelihood estimation to obtain estimates of ${R}_{c}$ and $k$, and we use the Hessian of the loglikelihood to obtain 95% confidence ellipses for the parameters [Wasserman, 2013, Sec. 9.10].
Finally, we perform a second analysis using the same model, using a smaller window of time for the definition of a cluster. In this way we hope to identify only the index case and the cases directly infected by the index case. We use the model above for this (smaller) number of cases for each cluster to estimate a distribution for $\nu $, but then use this in turn to estimate a distribution for the instantaneous transmission rate $\beta $. Our reasoning is that if $\nu $ is the random Poisson parameter when the index case it exposed to $n$ people for time $T$, then $\beta $ has approximately the same distribution as $\nu /(nT)$. Under these assumptions, $\beta $ is also a Gammadistributed random variable with parameter we can easily identify, from those for $\nu $.
Results
Figure 2 shows histograms of cluster size according to our definition in the four provinces. In Table 1 we show some statistics associated with the data for each province. In the top we show the number of clusters, the number of schools appearing, the number of schools with more than one reported cluster, and the fraction of schools with multiple clusters. In the bottom we show the fraction of clusters that have only one observed case, and the average number of observed cases in the clusters, the maximum observed cluster size, the index of dispersion (variance divided by mean) of cluster size, and index of dispersion of the number of cases in a cluster subtracting one for the presumed index case.
In Figure 3 (left) we show the rate (in clusters per day per 100,000 population) that cases appear in the dataset over time. In Figure 3 (right) we show the rate of COVID incidence per 100,000 population in the province over the same period of time. There is an apparent correspondence between the two time series, with peaks in rate of clusters per day occuring near peaks in incidence.
Figure 4 (left) shows the estimated mean cluster size ($={R}_{c}+1$) and dispersion $k$ for the four Canadian provinces. Mean cluster sizes ranged from 1.9 to 2.9 cases, and dispersion ranged from 0.34 to 0.53 (recalling that no overdispersion corresponds to $k\to \mathrm{\infty}$.) Recall that we determined clusters by including cases in the same cluster if they were reported within 7 days of each other. In Appendix 1 we explore what happens if we change this window to either 4 or 10 days. We find that estimates of $k$ do not change much: there is less than a 10% change in $k$ in all cases. A window of 4 days leads to smaller cluster sizes (at most 18% smaller) and a window of 10 days leads to larger cluster sizes (at most 11% larger).
In Appendix 1 we explore varying the ascertainment fraction between 0.2 and 1. Though lower ascertainment fractions yield bigger values of ${R}_{c}$ and smaller values of $k$, we see that the parameter estimates are relatively insensitive to values of $\mathrm{q}$ between 0.5 and 1. For example, when q_{1} is reduced from 0.75 to 0.5, the range of ${R}_{c}+1$ shifts from 1.9–2.9 to 3.2–6.4, and the range of $k$ shifts from 0.34–0.53 to 0.22–0.39. The reason for this is that though a given cluster with multiple cases will look smaller with fewer cases detected, and lower detection will thereby bias observed size downwards, many singlecase clusters will not be detected at all, biasing the observed cluster size upwards again. We also consider an alternate model of ascertainment, where the chance of a cluster being reported at all depends on the size of the cluster, and vary the rate of ascertainment in that alternate model; see Appendix 1.
Another way to visualize the variability of transmission we have inferred from the data is to show the distribution of the Poisson parameter $\nu $, of which ${R}_{c}$ is just the mean. In our model $\nu $ is the index casespecific expected number of further cases in a cluster, and is a gammadistributed random variable. Figure 4 (right) shows the estimated distribution of $\nu $ for each jurisdiction, and Table 2 shows some key properties of the distribution for each of the provinces.
As a way of interpreting dispersion values and what they mean for cluster size, we consider the fraction of all cases that occur in the largest 20% of all clusters. (If the distribution of cases follows the Pareto principle Wikipedia contributors, 2021 then 80% of the cases will be in the top 20% largest clusters.) If we consider only secondary cases (not including the index case) we see from Figure 5 (right) the fraction that are due to the 20% largest clusters for various values of mean cluster size and $k$. For example, for Alberta with a mean cluster size of 2.9 and a dispersion $k$ of 0.53, 69% of the secondary cases are in the top 20% of the clusters by size. For Saskatchewan, with a mean cluster size of 1.9 and $k=0.37$, 82% of secondary cases are in the top 20% of clusters by size. When we include index cases, the fractions are correspondingly lower, as we see in Figure 5 (right).
Our model does not consider the details of transmission at the individual level, and so does not make use of an instantaneous transmission rate per contact pair. However, by making some simple assumptions about SARSCoV2 transmission, we can infer a distribution of transmission rate $\beta $ from our estimate of the distribution of the parameter $\nu $. Recall that $\nu $ is a Gammadistributed random variable that gives mean number of secondary cases. Another way to estimate mean cluster size is to use an individual contact model where when an infectious person is in contact with a susceptible person, the susceptible person is infected with rate $\beta $. In such a model we assume that infected individuals are in a classroom for 2 days before isolating (when they develop symptoms), and that the total contact time with their classmates is $T=12$ hr. Assuming that all individuals are in the same class, the infected individual is in contact with $n=25$ other susceptible students for that time period. Then the infected individual will on average infect $\beta nT$ other students. So we estimate $\beta =\nu /(nT)$. Since $\nu $ is Gammadistributed, our estimate of $\beta $ is too. For estimating the distribution of $\beta $ we used a 4day window for the definition of clusters, since this is more likely to include only people directly infected by the index case. Figure 6 shows our estimated distribution of $\beta $ for the different Canadian provinces. Table 3 shows some of the features of the estimated distribution for $\beta $.
One application of these estimates of the distribution of $\beta $ is that we can explore the consequences of different types of interventions in the classroom setting. In Tupper et al., 2020 the authors consider a simple model of SARSCoV2 transmission among a group of contacts and investigate the quantity $R}_{event$, the average number of secondary infections due to the presence of a single infectious individual. $R}_{event$ is determined by $T$, the total length of time the infectious individual is with others; n_{contact}, the number of contacts at any point in time, $\tau $ the length of time the individual is with a fixed set of contacts; and $\beta $, the instantaneous transmission rate. The parameter $\tau $ can vary between some fraction of $T$ (e.g., $T/3$, if the index case divides their time equally between three sets of n_{contact} contacts) or $T$ if the set of contacts is fixed. Interventions can be classified according to which of these parameters they modify: reducing transmission reduces $\beta $, social distancing reduces n_{contact}, and ‘bubbling’ (staying with the same small group rather than mingling) increases $\tau $ to $T$. If we use our distributions for $\beta $ with the model of Tupper et al., 2020 we can estimate how the distribution of cluster sizes is changed with different interventions under different values of the parameters ${R}_{c}$ and $k$.
In Figure 7 we show estimated size distributions of clusters under different interventions. Our baseline simulation settings intend to capture a preCOVID high school classroom: $T=12$ hr (2 days of exposure before the index case isolates), $\tau =3$ hr (each student has four different classes that they attend for equal periods of time), ${n}_{class}=25$, and $\beta $ is sampled from our estimated distribution for a given choice of ${R}_{c}$ and $k$. We consider three interventions: transmission reduction (e.g., by introducing masks) reduces β by a factor of 2; social distancing cuts the size of a class in half; strict bubbling increases $\tau $ to $T$. For all values of ${R}_{c}$ and $k$ we consider, we simulate 10^{7} clusters to obtain a histogram of the number of secondary cases as well a mean and standard deviations, for the baseline conditions and for each of the three interventions, as shown in Figure 7. Means and standard deviations are accurate to the number of digits reported, and are shown with the corresponding histogram in the figure.
Figure 7 (left) shows results for ${R}_{c}$ and $k$ close to that of Manitoba with a 4day window for cluster definition (${R}_{c}=1.0$, $k=0.4$). We see that both reducing transmission and social distancing are effective in reducing the total number of cases, whereas bubbling does not contribute much to reduce cluster sizes. This is characteristic of what (Tupper et al., 2020) call the linear regime: the number of secondary infections depends linearly on the time the infectious person is present with others. Figure 7 (right) shows the results in a hypothetical setting where ${R}_{c}$ is much larger (${R}_{c}=2.5$, $k=0.4$), perhaps due to the existence of a more transmissible variant such as Omicron. Here, transmission reduction is less effective than in the linear regime, and strict bubbling more so; increasing $\beta $ has moved us closer to the socalled saturating regime, where transmission reduction is relatively less effective than bubbling.
Discussion
We have used cluster size data to estimate the mean and dispersion in cluster sizes, accounting for imperfect case detection. We have found that in each of the provinces we consider, the majority of school transmission occurs in a small number of classrooms, with the top 20% of clusters containing between 70% and 80% of the secondary cases in school settings. We developed a method to estimate the transmission rate per contact per unit time, with reference to a simple model of classroom transmission. Having a direct estimate of the transmission rate allows us to compare the benefits of different control measures. We find that with parameters estimated from Canadian jurisdictions during the 2020–2021 school year, interventions that reduce transmission rates (such as masking) and reduce number of contacts at any one time (class size reduction), are more effective than strategies aimed at keeping sets of contacts consistent (such as bubbling).
Overdispersion in transmission of SARSCoV2 and other infectious diseases is well documented (e.g., Woolhouse et al., 1997) and is often described with reference to the 20/80 rule: that 20% of the infected individuals account for 80% of the transmission. Naturally, if the more infectious 20% can be identified, interventions targeting that portion of the population are likely to have a high impact. For SARS in 2003, LloydSmith et al., 2005, estimated that 20% of the cases were responsible for almost 90% of the transmission. Estimates for SARSCoV2 also find considerable overdispersion, with the parameter $k$ between 0.1 (Endo et al., 2020) and 0.5 (Laxminarayan et al., 2020) (with ${R}_{0}=2.5$ this gives the top 20% of cases causing 69–96% of the transmissions; see Sneppen et al., 2021, for a survey). These estimates focus on the distribution of the number of people an infectious person infects directly during the whole course of infection (with mean R_{0}), which is of obvious epidemiological importance, but for which it is difficult to obtain highquality data. When a case is identified, we are not always able to determine who they infected, and indirect methods must be used. We may miss cases, and others may be wrongly attributed to a given index case.
In our present study, we examined a different random quantity, the number of additional cases $Z$ infected, either directly or through intermediaries, by a given index case in a given setting. We denoted the mean of $Z$ by ${R}_{c}$. Including the index case means that the cluster size is $Z+1$, with mean ${R}_{c}+1$. Compared to estimates of R_{0}, ${R}_{c}$ does not count people infected at other sites, but it does include additional cases, because it includes both direct and indirect transmission. $Z$ and its mean ${R}_{c}$ are therefore more focused on the particular setting (in this case a school) than R_{0} is. In general it will depend on the infectiousness of the index case, as well as how conducive the environment is to transmission, and what activities are undertaken there. Determining the distribution of $Z$, as we have done here, provides an alternative means of investigating transmission.
However, these two measures of transmissibility (R_{0} and ${R}_{c}$, the mean of $Z$) may be close enough that it is instructive to compare our estimates for $Z$ with the traditional R_{0}, and our dispersion estimates with dispersion estimates for the number of secondary infections. Our ${R}_{c}$ ranges from 0.9 in Saskatchewan to 1.9 in Alberta. These low values of ${R}_{c}$ are inconsistent with R_{0} estimates (which range from 2 to 6; Alimohamadi et al., 2020), and indicate that in the preDelta time frame in these jurisdictions schools were unlikely to be a major contributor to SARSCoV2 spread. However, with increased transmissibility with new variants such as Omicron, this situation may have changed. The discrepancy is even greater when we consider clusters defined by the 4day window, which are even smaller. Our estimates for $k$ range from 0.34 (in Manitoba) to 0.53 (in Alberta), corresponding closely to earlier estimates of dispersion.
Overdispersion has consequences for controlling transmission and for estimation. Estimating the average transmission rate from a small number of clusters will be difficult, and will result in a high variability. Most likely what will be observed in a small number of sampled clusters will be little to no onward transmission, which would lead to underestimates of the transmission rate. But if one or more larger clusters are included in a sample by chance, then this could lead to an overestimate of the transmission rate.
If we could identify the conditions under which the rare larger clusters occur (highrisk individuals, activities, and settings) we could achieve a disproportionately large effect on reducing transmission by applying new measures in these settings. There are myriad possible reasons for overdispersion of transmission for SARSCoV2, including variation in viral load (Chen et al., 2020), behaviour, and number of contacts. But a key factor in higher dispersion with SARSCoV2 in comparison to other pathogens such as influenza is aerosolization (Goyal et al., 2021), which allows the index case to infect others in the room even if they are not a close contact. Properties of the setting may be very important, with some settings (cramped, poor ventilation) being especially conducive to transmission. It would be good to identify classrooms or schools where there is a high risk of larger clusters. For example, if data were available on the occupancy, ventilation standards, mask use, classroom size, distancing behaviour, and other features of classrooms, we could investigate how this related to the cluster size. Rapid tests may be especially good for identifying the most infectious individuals, given that they are sensitive to viral loads (Mina and Phillips, 2021), but additional data collection is likely needed to quantify settinglevel risks.
Two important changes have happened since the majority of the data here was collected. Firstly, in the jurisdictions studied, effective vaccines have been developed and deployed for those aged 5 and up (The New York Times, 2020). There are several ways in which this may effect cluster sizes in the school setting. To the extent that the general population (including adults) being vaccinated reduces incidence of COVID (WilderSmith and Mulholland, 2021; Leshem and WilderSmith, 2021; Mallapaty, 2021), there will be fewer introductions of SARSCoV2 into the classroom, and so fewer exposures will occur leading to fewer clusters. This effect may be dampened by relaxation of distancing and other measures that were keeping COVID19 at bay and are no longer necessary in the context of vaccination. The distribution of cluster sizes when clusters do occur will also change: many students who might otherwise be infected will be protected by the vaccine, others who are vaccinated but infected (breakthrough infection) may have reduced symptoms and therefore may not be detected. We therefore expect the mean cluster size to be reduced by vaccination, in age ranges where vaccination has been deployed. It is unclear what the consequences will be for the dispersion.
Secondly, new, more transmissible variants of SARSCoV2 have emerged (CDC, 2021a), most notably the Alpha variant, the Delta variant, the Omicron variant, and most recently the BA.2 strain of the Omicron variant, each with a substantially higher transmissibility than its predecessors. A natural way to implement this change in our model is to multiply ${R}_{c}$ by an appropriate factor, boosting the size of clusters, without changing the dispersion parameter $k$. Data from the period in which Delta was the prevalent strain is not available, but schools in the Canada and the US saw resurgences in clusters in schools around school openings (Cravey, 2021; Star staff wire services, 2021; CNN, 2021).
Our data and model have some limitations. The data rely on crowdsourcing, and there is reason to believe that reporting is incomplete. Inequity may effect data collection, as wealthier districts are more likely to have the resources to identify and track transmission. In general, larger clusters may be more likely to be reported. In the modelling, we assumed a Poisson random variable for the cluster size, with an underlying gammadistributed rate variable. This is a flexible model allowing for considerable overdispersion, but it is simple and does not explicitly handle complexities such as the differences between elementary and high schools. Our estimates of the transmission rate were derived (where feasible) from a model with a fixed number of hours that the index case would be infectious in the classroom, and fixed class sizes. Accounting for variation in these would result in more variability in the estimates.
A major limitation of our analysis is how we assigned cases to clusters. Since the only data available was the number of cases reported on a given day at a school, we put cases in the same cluster if they occurred within 7 days of each other. The choice of 7 days was informed by the serial interval of COVID19, but unavoidably, some cases will have been put in clusters that were not linked by transmission, whereas other that were linked were not put in the same cluster. Furthermore, we assumed that all clusters consisted of an index case and a number of other cases directly infected by the index case. In reality, there may be longer chains of transmission. Any of these assumptions may bias our estimates of the distribution of $\nu $ and $\beta $. Finally, our illustrative modelling of the impact of interventions was simple, and used simple assumptions for the impacts of masking, distancing, and cohorts (bubbling). Our estimates of the percontact transmission rate per unit time could, however, be used in more sophisticated simulation modelling to compare interventions.
Despite these limitations, our approach has distinct advantages. We have developed estimates of the persontoperson transmission rate derived directly from data. The data we use (cluster sizes) are relatively easy to access. This approach does not require individuallevel data or contact tracing information, which are often not available; individuals may be identifiable and data are held within public health institutions. However, we note that if it were available, contact tracing data would be an excellent gold standard against which to check our assumptions about cluster identification. Our estimation approach, together with cluster size data, offers a highresolution view of transmission: we can estimate the distribution of cluster sizes in specific settings, accounting for reporting and overdispersion, and in some contexts we can estimate the transmission rate, all without requiring either individuallevel data or assumptions on transmission parameters such as the serial interval (see, in contrast, Cori et al., 2013; Wallinga and Teunis, 2004, which require serial interval estimates). The results offer contextspecific tools to simulate interventions in particular settings (here, schools). The methods are readily generalizable to other structured settings, such as workplace outbreaks where workplaces are similar in size and structure. Our results also suggest the need for data collection activities that can relate cluster sizes to setting variables such as occupancy, density, ventilation, activity, and distancing behaviour. Ultimately this would provide the data needed to design interventions that best reduce school and/or workplace transmission.
Appendix 1
Model for cluster size
We consider two models for ascertainment (whether a case is actually detected), though we only consider the first in the main text.
In the first ascertainment model (individual ascertainment) each of the infected individuals is detected with a probability q_{1}. So $X$, the total number of infected individuals is binomial $(n,p)$ with parameters $n=Z+1$ and $p={q}_{1}$. If by chance none of the individuals are observed, we do not observe the cluster. This is meant to model a situation where cases are detected independently of each other, and one detected case does not lead to further tests or screening.
In the second ascertainment model (cluster ascertainment), at first each case is identified with probability q_{2}, but then if any of the students are identified they are all subsequently identified. This is intended to capture a situation where a single detected case triggers testing for the whole class. Again, if no cases are detected we do not observe the cluster. This is equivalent to saying that clusters of size $m$ are detected in their entirety with probability $1{(1{q}_{2})}^{m}$.
The number $Z$ of new cases given the presence of one infectious case is a Poissondistributed random variable with a rate $\nu $ that is itself a Gammadistributed random variable with a shape $k$ and scale $\theta $. This means $Z$ has a negative binomial distribution $\mathrm{NB}(r,p)$, where $r=k$ and $p=\theta /(1+\theta )$. Letting $\mathrm{\Theta}=(k,\theta )$, the pmf of $Z$ is
Under the individual ascertainment model with ascertainment probability q_{1}, $X$, the number of observed cases, is binomial $(n,p)$ with $n=Z+1$ and $p={q}_{1}$. So, the probability that $i$ individuals are observed is
for $i=0,1,\mathrm{\dots}.$ Since we do not observe clusters with no observed cases the probability of observing a cluster of size $i$ is actually $P(X=i)={C}_{\mathrm{\Theta},{q}_{1}}^{1}{W}_{\mathrm{\Theta},{q}_{1}}(i)$ for $i=1,2,\mathrm{\dots}$, where ${C}_{\mathrm{\Theta},{q}_{1}}={\sum}_{i=1}^{\mathrm{\infty}}{W}_{\mathrm{\Theta},{q}_{1}}$.
If the observed cluster sizes are ${X}_{i}$, $i=1,\mathrm{\dots},n$, the loglikelihood function for $\mathrm{\Theta}=(k,\theta )$ under the individual ascertainment model is then
Under the cluster ascertainment model, the cluster is observed or not with probability $1{(1{q}_{2})}^{Y+1}$. So the probability of observing $X=j$ in a cluster is
${U}_{\mathrm{\Theta},{q}_{2}}(i)=[1{(1{q}_{2})}^{i}]{V}_{\mathrm{\Theta}}(i)$ for $i=0,1,\mathrm{\dots}$ but then since we cannot observe clusters of size 0, an observed cluster has size $i$ with probability
$P(X=i)={D}_{\mathrm{\Theta},{q}_{2}}^{1}{U}_{\mathrm{\Theta},{q}_{2}}(i)$ for $i=1,2,\mathrm{\dots}$, where ${D}_{\mathrm{\Theta},{q}_{1}}={\sum}_{i=1}^{\mathrm{\infty}}{U}_{\mathrm{\Theta},{q}_{2}}(i)$.
If the observed cluster sizes are ${X}_{i}$, $i=1,\mathrm{\dots},n$, the loglikelihood function for $\mathrm{\Theta}=(k,\theta )$ under the cluster ascertainment model is then
Under both ascertainment models, we then go from our estimates of $k$ and $\theta $ to estimates of ${R}_{c}$ via the formula
We use the Delta method to obtain confidence intervals for ${R}_{c}$ from confidence intervals on $k$ and $\theta $.
Analysis of US data
The US data was gathered from the National Educational Association website (Canadian Doctors, Professionals, & Citizens for Masks, 2021) (originally started by Alisha Morris, an educator at a Kansas high school) which collected data from news media and from reports submitted by volunteers (Walker, 2021). We selected the eight states with the most data available, and covering the period between August and November 2020. For the US data we used confirmed student cases listed on a particular date for the cluster size, excluding teachers and staff. We did not collect cases reported on different days at the same school in the same cluster as we did with the Canadian data.
In Appendix 1—table 1 we show some statistics associated with the data for each state. In the top we show the number of clusters, the number of schools appearing, the number of schools with more than one reported cluster, and the fraction of schools with multiple clusters. In the bottom we show the fraction of clusters that have only one observed case, and the average number of observed cases in the clusters, the maximum observed cluster size, the index of dispersion (variance divided by mean) of cluster size, and index of dispersion of the number of cases in a cluster subtracting one for the presumed index case. Comparing with Table 1, we can see several striking differences between the US and Canadian data. There are substantially more clusters reported in Canada than in the US, despite the US states having greater population on average. This may partly be explained by the Canadian data being collected over a longer period than the US data, but this is likely not the full explanation: in Appendix 1—figure 2 we show the rate (in clusters per day) that cases appear in the dataset over time. We can compare with Figure 3 (left) that shows the same thing for the Canadian data. Even at times when both US and Canadian datasets record clusters, Canadian rates are higher than US rates by an order of magnitude, despite incidence rates being similar in US states versus Canadian provinces (Appendix 1—figure 2 (right) versus Figure 3 (right)). This suggests that the method used for gathering cluster reports in different jurisdictions varied substantially between the two datasets, especially when we look at daily incident cases in each. Furthermore, the majority of schools in the US datasets only report one cluster., whereas the opposite is true of the Canadian data.
There are also substantial differences in statistics of cluster sizes. Mean observed cluster sizes were without exception larger in the US states than Canadian provinces, and Canadian provinces tended to have a higher fraction of clusters with only one case. Given the incomplete nature of the US data, we cannot determine whether these differences are due to real differences in transmission in the jurisdictions, or because smaller clusters were less likely to be reported in the US states.
Appendix 1—figure 3 shows the estimated mean cluster size ($={R}_{c}+1$) and dispersion $k$ same for the eight US states. In the US data, mean cluster size was estimated to range from about two in Texas to almost eight in Florida. Dispersions ranged from 0.05 to 0.3, showing considerable overdispersion compared to the Poisson distribution. However, given that we are very probably substantially undersampling clusters in the US data, and the clusters that we are observing are likely larger ones, these estimates of mean cluster size are biased upwards in a way we are not able to control for.
Varying the rate and model of ascertainment
In the main text we estimated parameters with the assumption of the individual ascertainment model with an ascertainment probability of 0.75. Here, we investigate how our main parameters ${R}_{c}+1$ (expected cluster size) and $k$ (dispersion) vary with this ascertainment probability. We also consider the alternate ascertainment model discussed in the previous section ‘Model for cluster size’.
Appendix 1—figure 4 shows the parameter estimates for the two models. The left plots show results for the individual ascertainment model where we set q_{2} = 1 and vary q_{1} from 0.2 to 1. The right plots show results for the group ascertainment model with q_{1} = 1 and q_{2} varying from 0.2 to 1. We see that the parameters do vary with the model and the ascertainment fraction, but relative magnitudes of the parameters in different jurisdictions do not change.
Varying the window for assigning cases to a cluster
In the main text, we reported results when clusters were defined by assigning cases to the same cluster if they were reported within 7 days of each other, or if they could be linked by a chain of such cases. We investigate here how changing the window for defining clusters affects our results. In Figure 5 we show how our estimates for dispersion ($k$) and mean cluster size (${R}_{c}+1$) vary when the window width is set to either 4, 7, or 10 days. Dispersion does not change much with changing the window, but as expected longer windows lead to larger clusters. However, the change in average cluster size from 7 to 10 days is modest, as we see in Appendix 1—table 2.
Data availability
Code and data have been deposited in GitHub (https://github.com/PaulFredTupper/covid19clustersinschools, copy archived at swh:1:rev:77cc5d7f42cde3c4eb71500b52a9797f6762712e) and Zenodo (https://doi.org/10.5281/zenodo.7117270).

ZenodoCrowdsourced COVID19 cases and outbreaks in US and Canadian Schools in 202021.https://doi.org/10.5281/zenodo.7117270
References

Estimate of the basic reproduction number for COVID19: a systematic review and metaanalysisJournal of Preventive Medicine and Public Health = Yebang Uihakhoe Chi 53:151–157.https://doi.org/10.3961/jpmph.20.076

WebsiteFor parents: Multisystem inflammatory syndrome in children (MISC) associated with COVID19Accessed October 26, 2021.

A new framework and software to estimate timevarying reproduction numbers during epidemicsAmerican Journal of Epidemiology 178:1505–1512.https://doi.org/10.1093/aje/kwt133

Assessing the age specificity of infection fatality rates for COVID19: systematic review, metaanalysis, and public policy implicationsEuropean Journal of Epidemiology 35:1123–1138.https://doi.org/10.1007/s10654020006981

Estimates of serial interval for covid19: a systematic review and metaanalysisClinical Epidemiology and Global Health 9:157–161.https://doi.org/10.1016/j.cegh.2020.08.007

Attributes and predictors of long COVIDNature Medicine 27:626–631.https://doi.org/10.1038/s4159102101292y

Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measuresAmerican Journal of Epidemiology 160:509–516.https://doi.org/10.1093/aje/kwh255

BookAll of Statistics: A Concise Course in Statistical InferenceSpringer Science & Business Media.https://doi.org/10.1007/9780387217369

Effectiveness of an inactivated SARScov2 vaccineThe New England Journal of Medicine 385:946–948.https://doi.org/10.1056/NEJMe2111165
Decision letter

Joshua T SchifferReviewing Editor; Fred Hutchinson Cancer Research Center, United States

Eduardo FrancoSenior Editor; McGill University, Canada

Joshua T SchifferReviewer; Fred Hutchinson Cancer Research Center, United States
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "COVID19 clusters in schools: frequency, size, and transmission rates from crowdsourced exposure reports" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Joshua T Schiffer as Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Senior Editor.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. Please submit a revised version that addresses these concerns directly. Although we expect that you will address these comments in your response letter, we also need to see the corresponding revision clearly marked in the text of the manuscript. Some of the reviewers' comments may seem to be simple queries or challenges that do not prompt revisions to the text. Please keep in mind, however, that readers may have the same perspective as the reviewers. Therefore, it is essential that you attempt to amend or expand the text to clarify the narrative accordingly.
Essential revisions:
1) Please reconstruct how the comparison between the US and Canadian data is done in the paper. All 3 reviewers agreed that the US data likely is hampered by overrepresentation of large super spreader events, leading to biased and inaccurate projections of the overdispersion distribution. Please weight the 3 reviewers' suggestion for clarifying the presentation of this data.
2) Please either remove the time to detection analysis or incorporate a method to account for the inherent high degree of stochasticity of this outcome.
3) Please define cluster more clearly.
4) Please provide a more complete discussion about the true value of q and how this might impact model conclusions.
Reviewer #1 (Recommendations for the authors):
– Figure 1: xaxis needs labels and ticks to denote quantity.
– Please remove "not all the news is good for children": this language is too casual.
– All figures: increases font size substantially please.
– Figure 2: Please add an actual correlation plot with rho coefficients and pvalues to formalize what appears to be true by eye. This will be so helpful to demonstrate that the US data is insufficient for the model used in the paper (Pardon the slang but… garbage in, garbage out). The authors can then emphasize the point that appropriately thorough data is necessary to model crowd sourced data of this nature.
– Table 1: given overdispersion, the median, interquartile range and full range of cluster size would be more informative than just the mean. Please add percentages to schools with multiple clusters column.
– Keeping in mind that this is a generalist journal., it would be helpful to show theoretical distributions with varying values of k as a supplemental figure. This would provide valuable context for Figure 3 and how to interpret k (in addition to Figure 4 which is great).
– The assumption that q=0.75 should be cited, or better yet, acknowledged up front as uncertain. The sensitivity analysis with varying values of q should be emphasized more when q is introduced.
– Figure 3: would it be possible to show the median and mean on the xaxis?
– Figure 4: why not include all provinces in Canada?
– Figure 5 is a bit difficult to interpret except for the most statistical minded among us, and could be a supplement (both the v and β parameters) or not included at all.
– For all figures comparing US and Canada data, please normalize the yaxes so comparisons can be made by eye more easily.
– When T < τ, how is”mingling” implemented? Does the number of contacts switch with movement from “classroom to classroom”. This needs greater explanation.
– Figure 7 is interesting and one of the coolest parts of the paper but lacks sufficient detail. For reduced β, increased social distancing and bubbling, the distribution is impacted to various degrees but is the total number of cases lower? Is the variance in the distributions lower? Are these differences statistically significant?
– Figure 7: The selection of R=0.5 and R=5 seems perhaps too extreme. Most persisting variants have been associated with R~1 (0.81.2) whereas omicron was 35 fold higher, albeit probably lower overall in classrooms. I would suggest at least testing an intermediate set of R values. Maybe 1 and 3?
– Please mention the omicron variant in the discussion.
– The authors make the excellent point that the crowd sourcing method is cheaper and easier than contact tracing or phylogenetic approaches but should also mention that contact tracing would be a wonderful goldstandard to validate the accuracy of crowd sourcing in terms of identifying cluster size.
– Figure 8: this would be easier to interpret of the panels were square rather than rectangular.
Reviewer #2 (Recommendations for the authors):
The major concern with the presented analysis is the use of sparse and potentially biased data in the analysis. The authors use compiled reports of case cluster sizes in schools across a number of jurisdictions, but have no methodological means to correct for issues clearly present in the underlying data. For example, most schools in the US only report a single cluster in the dataset, which likely biases the US dataset in a number of significant ways. This is contrasted with Canadian data where almost all schools report multiple clusters. The figures showing incident cases in schools and the community recapitulates this point with the US data showing no clear relationship between schools and the community compared with the Canadian data that shows a clear relationship. In either the case where school clusters contribute to transmission in the community or clusters are simply a byproduct of community transmission, we would expect to see correlation between the incident cases. While the authors mention these limitations to their study, I believe these data issues may present an existential issue with the current analysis as it makes it nearly impossible to interpret the results (even though I admit that the authors are fairly careful in their own framing). For example, it will be natural to wonder why Florida's Rc is nearly 8 while Texas's is around 2, and I believe there is very limited to say about the situation without controlling for underlying issues in the data between the states. I would suggest that the authors consider major methodological changes that might address the potential biases in the analysis. Perhaps limiting the analysis solely to Canadian data could also solve this issue, though there are likely biases in the dataset to consider as well.
The paper focuses on describing characteristics of clusters of cases in schools, and the authors give some details on the number of clusters in their datasets, however we never get a proper definition of what is considered a cluster here. According to the text on pages 11 and 12 it seems that an assumption is that all secondary cases in a cluster are infected by the same seed case (directly or indirectly). However, the interpretation of the results generally consider it to be through the direct route, even though public health policies would diverge significantly. For example a cluster that was infected outside of the school environment but were detected in school (imagine a group of close friends) would be counted in the analysis for Rc which ultimately filters to an understanding of how mitigation measures in school could control outbreaks. Alternatively, a cluster of multiple generations sputtering along might be given a large Rc when the actual R_eff for transmission might be below or close to 1.
The cluster definition is extra important, because it also impacts the estimate of the transmission rate. The transmission rate estimation assumes all transmission is happening within schools and in a single generation. In cases where there is actually a chain of transmission, or when transmission is happening outside of the school setting the formula used to calculate β does not hold. The authors partially acknowledge this by restricting the analysis to Canadian reports, but I think it should be more explicitly discussed in the manuscript either through an attempt to handle the complexity mathematically or with a larger discussion of the interpretation of the results.
Finally, the authors use an ascertainment q=0.75, which seems high, at least for the United States reports. For the US, the CDC estimates 1 in 4 infections were reported for a period covering the authors dataset, and detection is even less likely in children who are more likely to be asymptomatic (e.g. Davies et al.). The authors mention they used an alternate model of ascertainment and running a sensitivity analysis on q, where results for q=0.5 are potentially different from the main text results. Given the very possibility that the actual value of q is in that range, I would urge that the authors revisit their main results taking this into account.
Reviewer #3 (Recommendations for the authors):
Probably the most important observation here is that Canadian school cluster sizes and k parameters seem more consistent with a public health system that is accurately identifying (even small) school outbreaks, whereas the absence of single case outbreaks and the large fraction of very large outbreaks in the US data suggests a system which is calibrated to only identify (more obvious) superspreader events. Canada's per capita excess death rate during the pandemic has been notably lower (around 1/3) that seen in the US. Given that a relatively small contribution to mortality has come from pediatric cases, can the authors spell out a bit more explicitly how better surveillance may have blunted the pandemic's impact in Canada relative to the US?
Thank you for making the relevant data (https://github.com/PaulFredTupper/covid19clustersinschools/tree/main/schools_cluster_paper/datasets) and code accessible.
[Editors’ note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "COVID19 cluster size and transmission rates in schools from crowdsourced case reports" for further consideration by eLife. Your revised article has been evaluated by a Senior Editor and a Reviewing Editor.
The manuscript has been improved and is very close to satisfactory for acceptance but there are some remaining issues that need to be addressed, as outlined below:
1. Please make the abstract more balanced between methodologic gains and scientific conclusions.
2. Apply techniques to account for multiple generations of infection (or at least rerun the model assuming a smaller 45 day window to define clusters in line with the known serial interval) to ensure that this does not substantially impact the model's semiquantitative conclusions.
Reviewer #1 (Recommendations for the authors):
Thank you for the extensive and extremely helpful changes to the article. It is well written, comprehensive and covers an important topic.
Reviewer #2 (Recommendations for the authors):
I greatly appreciate the authors revisions in response to the reviewer comments. In my previous review, I had a comment about the definition of the cluster and multiple generations of transmission. Now that the authors have provided that definition I have one major comment remaining.
The authors responded to my comment with: "These are indeed limitations of our analysis. We feel that we already addressed some of these limitations in the secondtolast paragraph of the discussion, but we have now added the following text as well: 'Furthermore, we assumed that all clusters consisted of an index case and a number of other cases directly infected by the index case. In reality, there may be longer chains of transmission. Any of these assumptions may bias our estimates of the distribution of ν and β.'"
Without strong evidence otherwise, I think the authors should address this concern within the analysis rather than simply a discussion. The definition of cluster now explicitly includes multiple transmission generations (through linking cases in successive weeks with one another). How many clusters have "multiple generations"? How does cluster size compare between single and multiple generation clusters? I imagine the multiple generation clusters are the largest ones, but it may not be the case.
Since every cluster in the current analysis is finite and small (i.e. the size of the cluster is finite and not close to the herd immunity threshold of the school), the data suggest that the R within schools is less than 1, but the author's analysis suggests otherwise because the analysis focuses solely on the index case transmission (e.g. not accounting for the fact that for a cluster of size 9 a single individual infects 8 individuals but each of those 8 individuals are not transmitting further). There is quite a bit of literature about stuttering chains of transmission (e.g. blumberg and lloydsmith's work for monkeypox) that could be appropriate for addressing this issue. I believe the authors could adapt their analysis to account for multiple generations explicitly using those methods or in a different way the authors deem appropriate. This point is important because ultimately it impacts the interpretation of R, the estimation of Β for the subsequent analyses, and potentially the impact of interventions.
https://doi.org/10.7554/eLife.76174.sa1Author response
Essential revisions:
1) Please reconstruct how the comparison between the US and Canadian data is done in the paper. All 3 reviewers agreed that the US data likely is hampered by overrepresentation of large super spreader events, leading to biased and inaccurate projections of the overdispersion distribution. Please weight the 3 reviewers' suggestion for clarifying the presentation of this data.
We have restructured the paper, making the main text be exclusively about the Canadian data, and moving the US data and its analysis to the Appendix. There we highlight the many limitations of this data and explain why we may infer unrealistically large mean cluster sizes in it.
2) Please either remove the time to detection analysis or incorporate a method to account for the inherent high degree of stochasticity of this outcome.
We have removed the time to detection analysis.
3) Please define cluster more clearly.
While trying to respond to this request, we realized that there had been a misunderstanding as to how the data had originally been collected. In the previously submitted version of the paper, it was assumed that the data reflected preidentified clusters. It turns out that the data only listed the number of cases in a school reported on each day that cases were reported. So, for example, if on May 4th, a school was reported as having 4 cases, and then on May 5th, the school reported 3 more cases, that would be included in our analysis as two separate clusters in the same school, even though they are more likely to be linked by transmission. Realizing this led us to redo our analysis. Accordingly, if cases in a school are reported within 7 days of each other, we assumed them to be in the same cluster. What we take our new clusters to be is a set of cases in one school that are linked by transmission within the school. This lead us to a different set of cluster sizes in each province, and therefore changed our parameter estimates. Generally, mean cluster size was estimated to be somewhat larger than before, with dispersion staying about the same. This did not lead to different conclusions in the rest of the paper. For example, our conclusions about what are the most likely interventions to succeed in the school setting did not change. Of course, our data only imperfectly allows us to identify true clusters, and we hope we have made the limitations of our method appropriately clear in the current manuscript.
4) Please provide a more complete discussion about the true value of q and how this might impact model conclusions.
We provide a citation to a metaanalysis of ascertainment fraction that supports our value of q. We discuss how a lower value of q would effect our estimates. And we still have in the SI values for our parameter with q going as low as 0.2.
Reviewer #1 (Recommendations for the authors):
– Figure 1: xaxis needs labels and ticks to denote quantity.
We have fixed this.
– Please remove "not all the news is good for children": this language is too casual.
We have deleted this phrase.
– All figures: increases font size substantially please.
Done. Please let us know if this is sufficient.
– Figure 2: Please add an actual correlation plot with rho coefficients and pvalues to formalize what appears to be true by eye. This will be so helpful to demonstrate that the US data is insufficient for the model used in the paper (Pardon the slang but… garbage in, garbage out). The authors can then emphasize the point that appropriately thorough data is necessary to model crowd sourced data of this nature.
Author response image 1 is a correlation plot for the Canadian data. We indicate the correlation between rate of clusters reported and incident cases in the plot. We don’t report a pvalue because they are not independent data points and a standard pvalue would not be informative. We would prefer to not show this plot in the paper, because the current plot are adequate for seeing the pattern in the data, that pattern is not key to any claims we are making, and the US data has since been moved to the Appendix.
– Table 1: given overdispersion, the median, interquartile range and full range of cluster size would be more informative than just the mean. Please add percentages to schools with multiple clusters column.
These are good suggestions. We have added fraction of schools with multiple clusters. But since most of the clusters have a single case in each province, the median is always 1, and the interquartile range is 0 or 1. Instead we have added maximum cluster size (minimum is of course 1) and index of dispersion and index of dispersion excluding the index case.
– Keeping in mind that this is a generalist journal., it would be helpful to show theoretical distributions with varying values of k as a supplemental figure. This would provide valuable context for Figure 3 and how to interpret k (in addition to Figure 4 which is great).
Thanks for this suggestion. We now have a figure showing the probability mass function for cluster size for varying parameters. We show it on a log scale for probability and for clusters only up to size 10. This makes it harder to compare to data, but otherwise it’s difficult to see the salient features of the distribution.
– The assumption that q=0.75 should be cited, or better yet, acknowledged up front as uncertain. The sensitivity analysis with varying values of q should be emphasized more when q is introduced.
We now cite a metaanalysis that includes an estimate of ascertainment in highincome countries in the Americas.
– Figure 3: would it be possible to show the median and mean on the xaxis?
We were not sure if the reviewer meant to refer to another figure. The mean is a parameter in the model (or at least is directly computed from a parameters in the model). The median is not, and is always 1 for all the model fits of the Canadian data. Apologies if we are not understanding what is being asked here.
– Figure 4: why not include all provinces in Canada?
We now show all four Canadian provinces we consider.
– Figure 5 is a bit difficult to interpret except for the most statistical minded among us, and could be a supplement (both the v and β parameters) or not included at all.
We agree that it’s difficult to interpret, which is part of the challenge of working with over dispersed distributions. But we would like to keep it in the main text, since it is one of the more useful plots we think for deciding what kinds of β values are useful in simulations.
– For all figures comparing US and Canada data, please normalize the yaxes so comparisons can be made by eye more easily.
We have moved all US data to the Appendix, since it has been pointed out that the data is probably not of high enough quality to be compared with the Canadian data.
When T < τ, how is”mingling” implemented? Does the number of contacts switch with movement from “classroom to classroom”. This needs greater explanation.
Yes, this was unclear. We have added the following sentence which we hope clarifies things: “The parameter τ can vary between some fraction of T (for example T/3, if the index case divides their time equally between three sets of n_{contact} contacts) or T if the set of contacts is fixed.” To answer the reviewer’s question, we don’t allow τ > T.
– Figure 7 is interesting and one of the coolest parts of the paper but lacks sufficient detail. For reduced β, increased social distancing and bubbling, the distribution is impacted to various degrees but is the total number of cases lower? Is the variance in the distributions lower? Are these differences statistically significant?
We have added comments in the text to explain more what is happening. In each of the plots, the mean number of cases is shown and have expanded the text there to make this clearer. We have also stated that all mean values are accurate to the number of digits reported, so there is no issue of statistical significance. Finally, we state the variance of the total number of infections in the figure too.
– Figure 7: The selection of R=0.5 and R=5 seems perhaps too extreme. Most persisting variants have been associated with R~1 (0.81.2) whereas omicron was 35 fold higher, albeit probably lower overall in classrooms. I would suggest at least testing an intermediate set of R values. Maybe 1 and 3?
Recall that our R_{c} is not the same as the R usually used in epidemiology. Our first values of R_{c} and k are the ones we estimated for Saskatchewan, so we think they are an appropriate choice.
For our larger value of R_{c} we now use 2.5, which may be more realistic than or original choice of 5.
– Please mention the omicron variant in the discussion.
We have now added the following to the discussion:
“Secondly, new, more transmissible variants of SARSCoV2 have emerged most notably the Α variant, the Δ variant, the Omicron variant, and most recently the BA.2 strain of the Omicron variant, each with a substantially higher transmissibility than its predecessors.”
– The authors make the excellent point that the crowd sourcing method is cheaper and easier than contact tracing or phylogenetic approaches but should also mention that contact tracing would be a wonderful goldstandard to validate the accuracy of crowd sourcing in terms of identifying cluster size.
We have now added the following to the discussion:
“However, we note that if it were available, contact tracing data would be an excellent gold standard against which to check our assumptions about cluster identification.”
– Figure 8: this would be easier to interpret of the panels were square rather than rectangular.
We have made the panels closer to square.
Reviewer #2 (Recommendations for the authors):
The major concern with the presented analysis is the use of sparse and potentially biased data in the analysis. The authors use compiled reports of case cluster sizes in schools across a number of jurisdictions, but have no methodological means to correct for issues clearly present in the underlying data. For example, most schools in the US only report a single cluster in the dataset, which likely biases the US dataset in a number of significant ways. This is contrasted with Canadian data where almost all schools report multiple clusters. The figures showing incident cases in schools and the community recapitulates this point with the US data showing no clear relationship between schools and the community compared with the Canadian data that shows a clear relationship. In either the case where school clusters contribute to transmission in the community or clusters are simply a byproduct of community transmission, we would expect to see correlation between the incident cases. While the authors mention these limitations to their study, I believe these data issues may present an existential issue with the current analysis as it makes it nearly impossible to interpret the results (even though I admit that the authors are fairly careful in their own framing). For example, it will be natural to wonder why Florida's Rc is nearly 8 while Texas's is around 2, and I believe there is very limited to say about the situation without controlling for underlying issues in the data between the states. I would suggest that the authors consider major methodological changes that might address the potential biases in the analysis. Perhaps limiting the analysis solely to Canadian data could also solve this issue, though there are likely biases in the dataset to consider as well.
Thank you for your comments. We agree that the quality of the data from the US is too poor to learn much from, and we have now moved it to the Appendix and explained its limitations more fully. Likewise, we have explained better how the Canadian data was collected, as well as providing some justification for our choice of ascertainment probability. Of course, one may disagree with our choice, but we hope that our sensitivity analysis can allow the reader to make their own decision about what an appropriate value of ascertainment fraction is appropriate, and make their own inferences.
The paper focuses on describing characteristics of clusters of cases in schools, and the authors give some details on the number of clusters in their datasets, however we never get a proper definition of what is considered a cluster here. According to the text on pages 11 and 12 it seems that an assumption is that all secondary cases in a cluster are infected by the same seed case (directly or indirectly). However, the interpretation of the results generally consider it to be through the direct route, even though public health policies would diverge significantly. For example a cluster that was infected outside of the school environment but were detected in school (imagine a group of close friends) would be counted in the analysis for Rc which ultimately filters to an understanding of how mitigation measures in school could control outbreaks. Alternatively, a cluster of multiple generations sputtering along might be given a large Rc when the actual R_eff for transmission might be below or close to 1.
These are indeed limitations of our analysis. We feel that we already addressed some of these limitations in the secondtolast paragraph of the discussion, but we have now added the following text as well:
“Furthermore, we assumed that all clusters consisted of an index case and a number of other cases directly infected by the index case. In reality, there may be longer chains of transmission. Any of these assumptions may bias our estimates of the distribution of ν and β.”
The cluster definition is extra important, because it also impacts the estimate of the transmission rate. The transmission rate estimation assumes all transmission is happening within schools and in a single generation. In cases where there is actually a chain of transmission, or when transmission is happening outside of the school setting the formula used to calculate β does not hold. The authors partially acknowledge this by restricting the analysis to Canadian reports, but I think it should be more explicitly discussed in the manuscript either through an attempt to handle the complexity mathematically or with a larger discussion of the interpretation of the results.
We don’t want to make our model more complicated. We could make a simulation based model, but then we lose our ability to fit (for multiple choices of q, for example) with the same ease we do here. We feel that we now discuss these limitations adequately in the discussion.
Finally, the authors use an ascertainment q=0.75, which seems high, at least for the United States reports. For the US, the CDC estimates 1 in 4 infections were reported for a period covering the authors dataset, and detection is even less likely in children who are more likely to be asymptomatic (e.g. Davies et al.). The authors mention they used an alternate model of ascertainment and running a sensitivity analysis on q, where results for q=0.5 are potentially different from the main text results. Given the very possibility that the actual value of q is in that range, I would urge that the authors revisit their main results taking this into account.
The ascertainment fraction in different regions is difficult to pin down. Indeed, there are
estimates of quite low ascertainment fractions in some regions. However, comparing the seroprevalence data from Canadian blood services to the cumulative incidence of cases, we obtained q = 1, which is admittedly unrealistically high. In the end we defer to a metaanalysis by Bobrovitz et al. In highincome jurisdictions in the Americas they report an estimated seroprevalance to cumulative incidence ratio. In the last quarter of 2020 they report 1.5 for the ratio, it decreases to 1.3 in the first quarter of 2021, and then to 1.2 in the second quarter of 2021. Taking the inverse gives numbers in the vicinity of 0.75. We added the following text:
“We base this value on the metaanalysis Bobrovitz et al. ,which reports ascertainment fractions for highincome regions in the Americas between 66% (in the last quarter of 2020) to 85% (in the second quarter of 2021).”
Reviewer #3 (Recommendations for the authors):
Probably the most important observation here is that Canadian school cluster sizes and k parameters seem more consistent with a public health system that is accurately identifying (even small) school outbreaks, whereas the absence of single case outbreaks and the large fraction of very large outbreaks in the US data suggests a system which is calibrated to only identify (more obvious) superspreader events. Canada's per capita excess death rate during the pandemic has been notably lower (around 1/3) that seen in the US. Given that a relatively small contribution to mortality has come from pediatric cases, can the authors spell out a bit more explicitly how better surveillance may have blunted the pandemic's impact in Canada relative to the US?
We mainly agree with your comments, and if we had a consistent source of data from the two countries, making such a comparison would be interesting. But as the other reviewers pointed out, the difference in how the data was collected in the two countries (and even between states within the US) makes such a comparison impossible.
Accordingly, we have moved the US data and analysis to the Appendix, where we explain its limitations.
[Editors’ note: further revisions were suggested prior to acceptance, as described below.]
The manuscript has been improved and is very close to satisfactory for acceptance but there are some remaining issues that need to be addressed, as outlined below:
1. Please make the abstract more balanced between methodologic gains and scientific conclusions.
Please see the following new abstract. In addition to some small changes in the beginning, we have rewritten the latter half as requested.
“The role of schools in the spread of SARSCoV2 is controversial, with some claiming they are an important driver of the pandemic and others arguing that transmission in schools is negligible. School cluster reports that have been collected in various jurisdictions are a source of data about transmission in schools. These reports consist of the name of a school, a date, and the number of students known to be infected. We provide a simple model for the frequency and size of clusters in this data, based on random arrivals of index cases at schools who then infect their classmates with a highly variable rate, fitting the overdispersion evident in the data. We fit our model to reports from four Canadian provinces, providing estimates of mean and dispersion for cluster size, as well as the distribution of the instantaneous transmission parameter β, whilst factoring in imperfect ascertainment. According to our model with parameters estimated from the data, in all four provinces (i) more than 65% of nonindex cases occur in the 20% largest clusters, and (ii) reducing instantaneous transmission rate and the number of contacts a student has at any given time are effective in reducing the total number of cases, whereas strict bubbling (keeping contacts consistent over time) does not contribute much to reduce cluster sizes. We predict strict bubbling to be more valuable in scenarios with substantially higher transmission rates.”
2. Apply techniques to account for multiple generations of infection (or at least rerun the model assuming a smaller 45 day window to define clusters in line with the known serial interval) to ensure that this does not substantially impact the model's semiquantitative conclusions.
We have added a section in the appendix where we recompute our parameter estimates using different windows for defining clusters: 10 days and 4 days. Changing the window never changes the estimates for the dispersion k by more than 10%. Average cluster size increases with a longer window, but by at most 11%. Decreasing the window decreases the estimate of average cluster size by at most 18%. We reference this in the main text with the following:
“We explore other choices of window in the appendix.”
and later on with
“Recall that we determined clusters by including cases in the same cluster if they were reported within 7 days of each other. In the Appendix we explore what happens if we change this window to either 4 days or 10 days. We find that estimates of κ do not change much: there is less than a 10% change in k in all cases. A window of 4 days leads to smaller cluster sizes (at most 18% smaller) and a window of 10 days leads to larger cluster sizes (at most 11% larger)”
Please see the final section of the appendix for details.
Another way we account for multiple generations of infection is in how we now estimate β; we use the 4 day window for β estimates, since that is more likely to capture only direct transmission. See our response to Reviewer #2 below.
Reviewer #2 (Recommendations for the authors):
I greatly appreciate the authors revisions in response to the reviewer comments. In my previous review, I had a comment about the definition of the cluster and multiple generations of transmission. Now that the authors have provided that definition I have one major comment remaining.
The authors responded to my comment with: "These are indeed limitations of our analysis. We feel that we already addressed some of these limitations in the secondtolast paragraph of the discussion, but we have now added the following text as well: 'Furthermore, we assumed that all clusters consisted of an index case and a number of other cases directly infected by the index case. In reality, there may be longer chains of transmission. Any of these assumptions may bias our estimates of the distribution of ν and β.'"
Without strong evidence otherwise, I think the authors should address this concern within the analysis rather than simply a discussion. The definition of cluster now explicitly includes multiple transmission generations (through linking cases in successive weeks with one another). How many clusters have "multiple generations"? How does cluster size compare between single and multiple generation clusters? I imagine the multiple generation clusters are the largest ones, but it may not be the case.
This is a very good point. Now that we have correctly interpreted the data we have information about how clusters unfold over time in a school. Accordingly, we don’t have to only work with total cluster size, which as you point out, may include multiple generations of infection. Because other data sets only contain cluster size (such as the lowerquality American data we consider in the Appendix) we will continue to use total cluster size and the model with parameter v as the focus of our work. However, there is no longer a need to estimate β from ν; we can actually estimate the number of direct infections of an index case, and use that to estimate β. Accordingly, we estimate the number of direct infections of an index case by fitting our model to data with the cluster size defined by the 4 day window. This is more likely to capture direct infections. This leads to several minor changes in the paper, but the main difference is a modest reduction in our values of β. The main change in the text of the document is the following paragraph:
“Finally, we perform a second analysis using the same model, using a smaller window of time for the definition of a cluster. In this way we hope to identify only the index case and the cases directly infected by the index case. We use the model above for this (smaller) number of cases for each cluster to estimate a distribution for ν, but then use this in turn to estimate a distribution for the instantaneous transmission rate β. Our reasoning is that if ν is the random Poisson parameter when the index case it exposed to n people for time T, then β has approximately the same distribution as ν/(nT). Under these assumptions, β is also a Gammadistributed random variable with parameter we can easily identify, from those for ν.”
Since every cluster in the current analysis is finite and small (i.e. the size of the cluster is finite and not close to the herd immunity threshold of the school), the data suggest that the R within schools is less than 1, but the author's analysis suggests otherwise because the analysis focuses solely on the index case transmission (e.g. not accounting for the fact that for a cluster of size 9 a single individual infects 8 individuals but each of those 8 individuals are not transmitting further). There is quite a bit of literature about stuttering chains of transmission (e.g. blumberg and lloydsmith's work for monkeypox) that could be appropriate for addressing this issue. I believe the authors could adapt their analysis to account for multiple generations explicitly using those methods or in a different way the authors deem appropriate. This point is important because ultimately it impacts the interpretation of R, the estimation of Β for the subsequent analyses, and potentially the impact of interventions.
These are good points. Originally we were only interested in estimating the total number of cases due to the presence of one index case in a school, in part because some of our data (no longer used) only reported final cluster size. This lead us to define R_{c}, the expected number of additional cases in the cluster, which is certainly different than the usual R. With the Canadian data we acquired later, we can actually try to estimate R, because there is some information about how the clusters unfold. Our estimates of R_{c} with a cluster definition window of 4 days is probably a reasonable estimate of R, and hence for an estimate of β The suggestion to use the literature about stuttering chains of transmission for estimation in this context is an interesting one, but we have not developed it for this work.
Since using a four day window instead of a 7 day window did not change parameter estimates we added comments here and there to acknowledge the change. In addition to the changes described above, this is reflected in the main text with the following:
“For estimating the distribution of β we used a 4 day window for the definition of clusters, since this is more likely to include only people directly infected by the index case.”
And
“The discrepancy is even greater when we consider clusters defined by the four day window, which are even smaller.”
https://doi.org/10.7554/eLife.76174.sa2Article and author information
Author details
Funding
Natural Sciences and Engineering Research Council of Canada (RGPIN201906911)
 Paul Tupper
Natural Sciences and Engineering Research Council of Canada (RGPIN201906624)
 Caroline Colijn
Government of Canada (Canada 150 Research Chairs Program)
 Caroline Colijn
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Alisha Morris and the other volunteers at the National Education Association for providing the US data that was used in this study. PT and CC were supported by Natural Science and Engineering Research Council (Canada) Discovery Grants (RGPIN201906911 and RGPIN201906624). CC receives funding from the Federal Government of Canada’s Canada 150 Research Chair program.
Senior Editor
 Eduardo Franco, McGill University, Canada
Reviewing Editor
 Joshua T Schiffer, Fred Hutchinson Cancer Research Center, United States
Reviewer
 Joshua T Schiffer, Fred Hutchinson Cancer Research Center, United States
Publication history
 Received: December 7, 2021
 Preprint posted: December 8, 2021 (view preprint)
 Accepted: October 20, 2022
 Accepted Manuscript published: October 21, 2022 (version 1)
 Version of Record published: November 30, 2022 (version 2)
Copyright
© 2022, Tupper et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 380
 Page views

 49
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Epidemiology and Global Health
Background:
Whether natural selection may have attributed to the observed blood group frequency differences between populations remains debatable. The ABO system has been associated with several diseases and recently also with susceptibility to COVID19 infection. Associative studies of the RhD system and diseases are sparser. A large diseasewide risk analysis may further elucidate the relationship between the ABO/RhD blood groups and disease incidence.
Methods:
We performed a systematic loglinear quasiPoisson regression analysis of the ABO/RhD blood groups across 1,312 phecode diagnoses. Unlike prior studies, we determined the incidence rate ratio for each individual ABO blood group relative to all other ABO blood groups as opposed to using blood group O as the reference. Moreover, we used up to 41 years of nationwide Danish followup data, and a disease categorization scheme specifically developed for diagnosiswide analysis. Further, we determined associations between the ABO/RhD blood groups and the age at the first diagnosis. Estimates were adjusted for multiple testing.
Results:
The retrospective cohort included 482,914 Danish patients (60.4% females). The incidence rate ratios (IRRs) of 101 phecodes were found statistically significant between the ABO blood groups, while the IRRs of 28 phecodes were found statistically significant for the RhD blood group. The associations included cancers and musculoskeletal, genitourinary, endocrinal, infectious, cardiovascular, and gastrointestinal diseases.
Conclusions:
We found associations of diseasewide susceptibility differences between the blood groups of the ABO and RhD systems, including cancer of the tongue, monocytic leukemia, cervical cancer, osteoarthrosis, asthma, and HIV and hepatitis B infection. We found marginal evidence of associations between the blood groups and the age at first diagnosis.
Funding:
Novo Nordisk Foundation and the Innovation Fund Denmark

 Epidemiology and Global Health
 Genetics and Genomics
Background: To evaluate the utility of polygenic risk scores (PRS) in identifying highrisk individuals, different publicly available PRS for breast (n=85), prostate (n=37), colorectal (n=22) and lung cancers (n=11) were examined in a prospective study of 21,694 Chinese adults.
Methods: We constructed PRS using weights curated in the online PGS Catalog. PRS performance was evaluated by distribution, discrimination, predictive ability, and calibration. Hazard ratios (HR) and corresponding confidence intervals [CI] of the common cancers after 20 years of followup were estimated using Cox proportional hazard models for different levels of PRS.
Results: A total of 495 breast, 308 prostate, 332 femalecolorectal, 409 malecolorectal, 181 femalelung and 381 malelung incident cancers were identified. The area under receiver operating characteristic curve for the best performing sitespecific PRS were 0.61 (PGS000873, breast), 0.70 (PGS00662, prostate), 0.65 (PGS000055, femalecolorectal), 0.60 (PGS000734, malecolorectal) and 0.56 (PGS000721, femalelung), and 0.58 (PGS000070, malelung), respectively. Compared to the middle quintile, individuals in the highest cancerspecific PRS quintile were 64% more likely to develop cancers of the breast, prostate, and colorectal. For lung cancer, the lowest cancerspecific PRS quintile was associated with 2834% decreased risk compared to the middle quintile. In contrast, the hazard ratios observed for quintiles 4 (femalelung: 0.95 [0.611.47]; malelung: 1.14 [0.821.57]) and 5 (femalelung: 0.95 [0.611.47]) were not significantly different from that for the middle quintile.
Conclusions: Sitespecific PRSs can stratify the risk of developing breast, prostate, and colorectal cancers in this East Asian population. Appropriate correction factors may be required to improve calibration.
Funding This work is supported by the National Research Foundation Singapore (NRFNRFF201702), PRECISION Health Research, Singapore (PRECISE) and the Agency for Science, Technology and Research (A*STAR). WP Koh was supported by National Medical Research Council, Singapore (NMRC/CSA/0055/2013). CC Khor was supported by National Research Foundation Singapore (NRFNRFI201801). Rajkumar Dorajoo received a grant from the Agency for Science, Technology and Research Career Development Award (A*STAR CDA  202D8090), and from Ministry of Health Healthy Longevity Catalyst Award (HLCA20Jan0022). The Singapore Chinese Health Study was supported by grants from the National Medical Research Council, Singapore (NMRC/CIRG/1456/2016) and the U.S. National Institutes of Health [NIH] (R01 CA144034 and UM1 CA182876).