The unmitigated profile of COVID19 infectiousness
Abstract
Quantifying the temporal dynamics of infectiousness of individuals infected with SARSCoV2 is crucial for understanding the spread of COVID19 and for evaluating the effectiveness of mitigation strategies. Many studies have estimated the infectiousness profile using observed serial intervals. However, statistical and epidemiological biases could lead to underestimation of the duration of infectiousness. We correct for these biases by curating data from the initial outbreak of the pandemic in China (when mitigation was minimal), and find that the infectiousness profile of the original strain is longer than previously thought. Sensitivity analysis shows our results are robust to model structure, assumed growth rate and potential observational biases. Although unmitigated transmission data is lacking for variants of concern (VOCs), previous analyses suggest that the alpha and delta variants have faster withinhost kinetics, which we extrapolate to crude estimates of variantspecific unmitigated generation intervals. Knowing the unmitigated infectiousness profile of infected individuals can inform estimates of the effectiveness of isolation and quarantine measures. The framework presented here can help design better quarantine policies in early stages of future epidemics.
Editor's evaluation
By analyzing a carefully curated dataset of cases observed early, and adjusting for multiple forms of bias, this study provides convincing evidence that in the absence of public health interventions, the duration of infectiousness of COVID19 (original variant) is longer than previously estimated. These important findings improve our ability to model counterfactual interventionfree scenarios, add to evidence that interventions have reduced the duration of infectiousness, and provide an example of how to navigate the biases and pitfalls inevitably present in outbreak data.
https://doi.org/10.7554/eLife.79134.sa0Introduction
In an emerging epidemic, such as the current COVID19 pandemic, information about key epidemiological parameters of the causative infectious agent (SARSCoV2 in the case of COVID19) is crucial for monitoring and mitigating the spread of the disease. A central epidemiological parameter, which determines the time scale of transmission, is the generation interval – the time between the infection of the infector (first case) and of the infectee (secondary case). Measuring the generation interval directly is hard in practice, as determining the exact time of infection is challenging. Thus, to infer the generation interval for an emerging infectious disease, researchers usually rely on two widely reported epidemiological parameters: the incubation period – the time between infection with the virus and the onset of symptoms (either for the infector or the infectee) – and the serial interval – the time between onset of symptoms of the infector and infectee (Fine, 2003; Svensson, 2007; Figure 1). Key epidemiological delays, such as incubation periods, serial intervals, and generation intervals, vary across hosts and transmission events, and are thus described as distributions rather than fixed values.
The generationinterval distribution plays a key role in determining the spread and control of emerging epidemics such as the ongoing COVID19 pandemic. At the population level, the generationinterval distribution links incidence of infection, particularly the epidemic growth rate r, with the reproduction number R (Gostic et al., 2020; Wallinga and Lipsitch, 2007). At the individual level, it characterizes the infectiousness profile (i.e., the temporal evolution of infectiousness from the time of infection). In the case of COVID19, short generation intervals, driven by presymptomatic transmission, have limited the effectiveness of different mitigation strategies, including contact tracing (Ferretti et al., 2020b), case isolation, quarantine (Sun et al., 2020), and testing (Grassly et al., 2020; Johansson et al., 2021).
The generation and serialinterval distributions can change over the course of an epidemic. For example, they are affected by the behavior of the population and can be shortened by the introduction of mitigation steps such as social distancing and case isolation, which limit the spread of the disease and reduce the probabilities of transmission after symptom onset (Ali et al., 2020). Our study aims to estimate the temporal dynamics of transmissibility of infected cases in the absence of intervention measures, noted hereafter as the ‘unmitigated generation interval’. Unbiased estimates of the time profile of transmissibility are important for inferring the effectiveness of selfisolation or quarantine policies in the absence of other interventions.
In practice, estimating the unmitigated infectious profile is expected to be challenging, since even in the absence of any mitigation policies, symptomatic individuals may selfisolate, reducing their own chances of late transmission. To address this issue, we apply a strict data curation procedure to account for which transmission events occurred both before major mitigation steps took place and before awareness of the epidemic became widespread. Most available estimates of the generationinterval distribution addressed the effects of mitigation only in a limited manner, not fully accounting for steps such as contact tracing and case isolation (Ferretti et al., 2020a; Ganyani et al., 2020; He et al., 2020).
Even after minimizing mitigation and behavioral effects, estimating the generationinterval distribution directly from contact tracing data remains difficult because the timepoint of infection of both the infector and the infectee are usually unknown. Instead, researchers estimate generationinterval and incubation period distributions by calculating the likelihood of observing all serial intervals in the transmission pair dataset (Ferretti et al., 2020a; Ferretti et al., 2020b; He et al., 2020) or else, they simply use the serialinterval distribution as a proxy for the generationinterval distribution (Flaxman et al., 2020).
While the serialintervalbased framework has been widely applied to infer the generationinterval distribution of COVID19 (Ferretti et al., 2020a; Ferretti et al., 2020b; Ganyani et al., 2020; He et al., 2020; Sun et al., 2020), there are several key methodological issues that could lead to considerable biases. First, the distribution of realized serial intervals depends on the rate of the spread of the disease as well as the direction from which they are measured: either forward from a cohort of infectors who developed symptoms at the same time or backward from a cohort of infectees (Park et al., 2021). A cohort of individuals that develop symptoms on a given day is a sample of all individuals who have been previously infected. When the incidence of infection is increasing, recently infected individuals represent a bigger fraction of this population and thus are overrepresented in this cohort. Therefore, we are more likely to encounter infected individuals with a short incubation period in this cohort compared to an unbiased sample. The forward serial interval is calculated for a cohort of infectors who developed symptoms at the same time and therefore is sensitive to this bias. These dynamical biases are demonstrated using epidemic simulations by Park et al., 2020. However, most analyses of serialinterval distributions assume that the incubation periods of the infector and infectee follow the same distribution (Ganyani et al., 2020; He et al., 2020; Sun et al., 2020), and only a few studies partially account for this dynamical bias (Ferretti et al., 2020a; Ferretti et al., 2020b). Second, incubation periods and temporal profile of infectiousness are likely to be correlated across infectors – that is, individuals that show symptoms later or earlier are also more likely to infect others later or earlier, respectively. Most available studies make strict assumptions on the relationship between the incubation period and the generation interval – either assuming that they are independent (Ferretti et al., 2020b; Ganyani et al., 2020; Sun et al., 2020) or that the time from onset of symptoms to transmission (TOST) is independent of the incubation period (He et al., 2020). Only a few studies have compared various correlation models (Ferretti et al., 2020a) or explicitly modeled the infectiousness profile relative to the incubation period (Hart et al., 2021). Finally, biases can arise from the data collection process. For example, determining who infected whom based on their symptomonset dates can miss presymptomatic transmission. Likewise, long serial intervals may represent multiple chains of transmissions where intermediate hosts were not correctly identified. These biases can cause overestimation of the mean serial interval as well as the mean generation interval.
Currently, no available estimate for the generation interval deals with all the biases described above, impairing our ability to accurately describe the infectiousness of SARSCoV2infected individuals in the absence of interventions. Here, we aggregate all available transmission data for Wuhan, China, in the initial stages of the pandemic, when the effects of mitigation steps were minimal, and employ a statistical framework that addresses the major sources of bias in estimating the generationinterval distribution. We estimate a median generation interval of 7.9 days (95% confidence interval [CI] 6.8–9) and an average of 9.7 days (95% CI 8.3–11.2), suggesting that the infectious period is much longer than previously thought. We further combine our generationinterval estimates with previously inferred viral load trajectories (Kissler et al., 2021; Hay et al., 2022) to extrapolate unmitigated generationinterval distributions of alpha, delta, and omicron variants. The estimated unmitigated generationinterval distribution could be adopted for answering questions about quarantine and isolation policy, as well as for estimating the original R_{0} at the initial spread in China. However, estimation of instantaneous R(t) should account for changes in generationinterval distributions, reflecting mitigation effects and the current variant.
Results
We estimated the unmitigated generation interval by focusing on the first period of transmission in China, thus minimizing the potential impacts of early interventions. To choose our analysis period, we relied on previous analyses of the early outbreak and the timeline of interventions in Wuhan and mainland China. We quantified the forward serialinterval distributions based on the symptomonset dates of the infector. We found that the mean forward serial interval stayed constant until around the January 17, 2020, and then decreased gradually, indicating changes in transmission dynamics (Figure 2). In particular, a strict restrictions on mobility (lockdown) imposed in Wuhan city on January 23 (Kraemer et al., 2020) likely impacted generation (and therefore serial) intervals of infectors who were infected a few days prior to this date – for example, an individual who developed symptoms on January 23 would have had reduced transmission after January 23, thereby shortening their generation and serial intervals. The clear negative trend in the mean serial interval from January 17–18 onward also matches the timing of the decrease in the effective reproduction number R(t) for domestic cases in Wuhan, China, estimated by Lipsitch et al., 2020. Therefore, we chose January 17 as our cutoff date. Large uncertainties in early serialinterval data limited our ability to detect changes in the mean forward serial interval before January 17. Previous studies Kraemer et al., 2020; Lipsitch et al., 2020; Park et al., 2021 found no clear signs of change in the growth of the epidemic prior to the period between the 16th and the 19th of January.
We used the transmission pairs for which the infector developed symptoms between December 12, 2019, and January 17, 2020, as our main dataset for estimating the unmitigated generationinterval distribution. This dataset includes a total of 77 transmission pairs with a mean serial interval of 9.1 days (95% CI: 7.9–10.2), and a standard deviation of 5.2 days. Although this is substantially longer than the mean of 7.8 days (95% CI: 7–8.6 days) suggested by Ali et al., 2020, for the early period of the epidemic, there is considerable uncertainty in both estimates with overlapping CI. Nonetheless, a lower mean serial interval estimated by Ali et al. likely reflects their decision to include infectors who developed symptoms up to January 22, who were already subject to effects of mitigation strategies. Other studies that did not differentiate different stages of the epidemic estimated a much lower mean serial interval (4–6 days) (He et al., 2020; Sun et al., 2020; Zhang et al., 2020; Zhanwei et al., 2020).
We inferred the unmitigated generationinterval distribution of SARSCoV2 transmission based on an integrative curated dataset, which focuses on the earlyoutbreak period in China.
We used the maximum likelihood framework to estimate the parameters of the joint bivariate distribution of the generation interval and the incubation period, assuming a known incubation period distribution (Xin et al., 2021) with a mean of 6.3 days and a standard deviation of 3.6 days. We estimate that the unmitigated generationinterval distribution has a median of 7.9 days (95% CI: 6.8–9), a mean of 9.7 (95% CI: 8.3–11.2) days and standard deviation of 6.9 (95% CI: 4.3–10.1) days. Furthermore, we estimate a correlation parameter (see Methods) of 0.75 (95% CI: 0.5–0.9). Our estimates are robust to the choice of data sources used in the analysis included (Appendix 1—figure 4).
We note that our estimated mean generation interval is longer than the observed mean serial interval (9.1 days) of the period in question. This is supported by the theory (Park et al., 2021) of the dynamical effects of the epidemic – in contrast to the common assumption that the mean generation and serial intervals are identical. During the exponential growth phase, the mean incubation period of the infectors is expected to be shorter than the mean incubation period of the infectee – this effect causes the mean forward serial interval to become longer than the mean forward generation interval of the cohorts that developed symptoms during the study period. However, these cohorts of infectors with short incubation periods will also have short forward generation (and therefore serial) intervals due to their correlations. When the latter effect is stronger, the mean forward serial interval becomes shorter than the mean intrinsic generation interval, as these findings suggest.
The joint bivariate distribution and its marginal distributions are shown in Figure 3A. With or without the growth rate adjustment, the model was able to fit the observed serialinterval data well (Figure 3B). Using the inferred bivariate distribution, we derived the distribution of TOST, as shown in Figure 3—figure supplement 1. The negative side of this distribution gives the presymptomatic transmission, which constitutes ≈20% (95% CI: 6–32%) of total transmission.
A comparison with the current available estimates of the generationinterval distribution (Ferretti et al., 2020a; He et al., 2020; Sun et al., 2020) reveals that the inferred distribution has a heavier (right) tail (Figure 4) and a higher median (7.9 days compared to 5.4–5.8 days) and standard deviation (6.9 days compared to 3.3–3.9 days). For example, the gamma distribution assumed by Johansson et al., 2021, to give an infectious period of about 10 days (and a peak at 5 days) for the analysis of quarantine and isolation policies has a far smaller tail. One way to quantify the difference in the tails of the different estimates is by comparing the proportion of transmission after a certain timepoint. When comparing the proportion of transmission after day 14, there are clear differences from previously reported distributions. The distributions of Ferretti et al., He et al., and Sun et al. indicate a residual fraction of transmission after 14 days of 2–4%, while the distribution assumed by Johansson et al. indicates only 0.2%. In contrast, our inferred generationinterval distribution predicts that about 18.5% (95% CI of 10–25%) of the transmission occurs after 14 days in the unmitigated scenario (assuming the behavior doesn’t change due to quarantine, isolation, testing, etc.).
In addition to the possible dynamical and statistical biases considered in our analysis, the resulting wide generationinterval distribution might be affected by biases in the data collection process as detailed in the Introduction and Methods sections. The estimated generationinterval distributions were sensitive to the cutoff date with an estimated median of 6.5–8 days and estimated means of 7–10 days for periods ending on January 16 to January 19, 2020 (Figure 5A–C and Figure 5—figure supplement 1). Using cutoff dates of January 21 or later gives generationinterval distribution with median of less than 6 days, and residual transmission of 5% at 14 days after infection, similar to the values found in previous sources (Ferretti et al., 2020a; He et al., 2020; Sun et al., 2020). This demonstrates the impact of mitigations in biasing the inference of generationinterval distributions.
Switching the order of some of the transmission pairs caused a decrease in both the median and mean of the generation interval, as well as a decrease in the correlation parameter (Figure 5G–I, Figure 5—figure supplement 2).
The sensitivity analysis to high serialinterval values caused a slight decrease in the mean generation interval, but still resulted in a wide distribution. Removing the transmission pairs with the highest serial intervals from the dataset caused a small decrease in the generationinterval distribution. For a removal of the top 10% values, the inferred distribution has a median of 7.2 days and a mean of 8.3 days (Figure 5D–F, Figure 5—figure supplement 3). As switching the direction of transmission among randomly selected infectorinfectee pairs gives negative serial intervals (and thus lower mean serial interval), a decrease in the mean generationinterval distribution was expected. However, even when reordering 10% of the pairs the distribution is wide: for example, the median of bootstrap estimates for the median generation interval is 7.2 days (Figure 5H). These bootstrap estimates also yield substantial residual transmission at 14 days (Figure 5I).
Other factors of uncertainty in the estimate are the growth rate and incubation period distribution we assume for the inference of the distribution. Changing the assumed growth rate during this period had very little effect on the results, with estimated mean increasing from 9.5 to 9.7 days, as assumed growth rates decreased from 0.16 to 0.04 day^{–1} (Appendix 1—figure 7). Changing the incubation period to one with a median in the range of 4–5.5 days (Appendix 1—figure 10), as well as inclusion of severe cases in the dataset (Appendix 1—figure 11) had very little effect. These sensitivity analyses demonstrate the robustness of our conclusion: the unmitigated generationinterval distribution is likely wider than previously thought.
To quantify the effect of our estimated generationinterval distribution on the estimates of the basic reproduction number R_{0} of SARSCoV2 wild type, we use the growth rate estimated in a recent study of the early outbreak dynamics in China (Tsang et al., 2020). Combining our estimated generationinterval distribution with the early growth rate (Wallinga and Lipsitch, 2007), we find R_{0} to be 2.2 with a CI of 1.9–2.7 (Appendix 1—figure 5).
Finally, we estimated the unmitigated generationinterval distributions for new SARSCoV2 variants by incorporating viral kinetic information (Figure 6). The median generation interval of both the alpha and delta variants was estimated to be 6.7 days, 15% shorter than the original variant. The median generation interval of the omicron variant was estimated to be 5.8 days, ≈30% shorter than the original variant. Even though these generation intervals are considerably shorter than the 7.9 days median generation intervals of the wild type, there might be a considerable amount of late transmission – for example, we estimate that more than 15–19% of the transmission potential occurs 5 days after symptoms for the three variants of concern (VOCs), in the absence of mitigation.
Discussion
In this work, we assembled transmission pair data from 12 datasets representing the earlyoutbreak period in China, and modeled the relationship between disease transmission and symptom onset using a bivariate lognormal distribution. By applying a maximum likelihood framework, we found that the unmitigated generationinterval distribution has a heavier right tail than previously estimated (Ferretti et al., 2020a; He et al., 2020; Sun et al., 2020), corresponding to a larger mean and standard deviation. The bias in the previous estimates likely reflects the effects of mitigation steps, such as quarantine of exposed individuals, as well as changes in awarenessdriven behavior, such as faster selfisolation after symptom onset, that prevent transmission during late stages of infection. These sources of bias were not fully accounted for in previous estimates, leading to substantial underestimation of the generationinterval distribution.
Our sensitivity analysis of the cutoff date for the period of unmitigated transmission indicates that using late cutoff dates, such as the January 21, leads to similar underestimation as seen in the previous sources. However, these dates correspond to periods when transmission dynamics were affected heavily by mitigations, such as the Wuhan lockdown that started on January 23. Therefore, our results, based on an earlier cutoff date, are more representative of the unmitigated scheme.
Superspreading events are considered an important feature of the spread of COVID19. Indeed, if the transmission pair dataset comprised a large number of cases from a single event, the inferred infectiousness profile would be biased due to strong statistical dependencies. However, the data we chose to include in the analysis consisted of at most two events with more than two infectees (4 and 7), and therefore superspreading likely had negligible effects on our analysis.
Accounting for potential correlations between the incubation period and the generation interval provided a better estimate of the proportion of presymptomatic transmission. Our results suggest that, on average, only ≈20% (6–32%) of the unmitigated transmission happens before symptoms appear, lower than commonly stated values that already include mitigation effects (40–60% Ferretti et al., 2020a; Sun et al., 2020). When mitigation strategies are introduced, we would expect the amount of postsymptomatic transmission to decrease, leading to an increase in the fraction of presymptomatic transmission. Thus, it is not surprising that our estimate of the proportion of presymptomatic transmission is lower than previous estimates that looked at a later period (Ferretti et al., 2020a; Sun et al., 2020). Furthermore, our results match the trend shown by Sun et al., 2020, in which the faster isolation of cases increases the presymptomatic fraction of transmission and shortens the mean generation interval.
To check whether these results are sensitive to our choice of using a bivariate lognormal distribution to characterize the joint distribution of the generation interval and the incubation period, we repeated our analysis using a different functional form using an adjusted logistic TOST model following Ferretti et al., 2020a (see supplementary for details). Both models estimate large means and standard deviations of the generation intervals, and a low proportion of presymptomatic transmission for the current dataset. Applying both models to the data from Ferretti et al., 2020a, produced similar distributions with lower estimates for the mean generation interval and higher persymptomatic proportion (Appendix 1—figure 6). This indicates that the results presented in this study are a product of the focus on the data prior to mitigation steps, in combination with the correction for the growth of the epidemic.
Following the sensitivity analyses to the cutoff date, the growth rate, and the model of infectiousness, we can see which of the three biases described in Table 1 has the greatest effect. We conclude that the cutoff date seems to be the dominant factor in our analysis, presumably meaning that taking the effects of interventions into account is the most important for an accurate estimate of the generationinterval distribution. Additional sensitivity analyses, such as to the assumed incubation period, also support this conclusion, as they show only a minor effect.
Our analysis relies on datasets of transmission pairs gathered from previously published studies and thus has several limitations that are difficult to correct for. Transmission pairs data can be prone to incorrect identification of transmission pairs, including the direction of transmission. In particular, presymptomatic transmission can cause infectors to develop symptoms after their infectees, making it difficult to identify who infected whom. Data from the early outbreak might also be sensitive to ascertainment and reporting biases which could lead to missing links in transmission pairs, causing serial intervals to appear longer (e.g., people who transmit asymptomatically might not be identified). Moreover, when multiple potential infectors are present, an individual who developed symptoms close to when the infectee became infected is more likely to be identified as the infector. These biases might increase the estimated correlation of the incubation period and the period of infectiousness. We have tried to deal with these biases by using a bootstrapping approach, in which some data points are omitted in each bootstrap sample. The relatively narrow ranges of uncertainty suggest that the results are not very sensitive to specific transmission pairs data points being included in the analysis. We also performed a sensitivity analysis to address several of the potential biases such as the duration of the unmitigated transmission period, the inclusion of long serial intervals in the dataset, and the incorrect orderings of transmission pairs (see Methods). The sensitivity analysis shows that although these potential biases can decrease the inferred mean generation interval, our main conclusions about the long unmitigated generation intervals (high median length and substantial residual transmission after 14 days) remained robust (Figure 5). Due to the nature of early spread of a new unknown disease, it is nearly impossible to find two completely unrelated datasets from the period prior to mitigation, limiting the ability of further validation of the current results.
Our estimates of the unmitigated generationinterval distribution can inform quarantine policy. The tail of the survival function (Figure 4B) indicates that individuals infected with the wild type still have, on average, ≈18% of their transmission potential 14 days after infection. We also found a strong correlation of the incubation period with the generation interval, accentuating the importance of quickly isolating individuals as soon as they show symptoms.
Determining the optimal period of quarantine for individuals exposed to COVID19 is hard, as it needs to balance the prevention of further transmission with personal and economic costs of longer quarantine. It is important to consider the basic risk of transmission underlying those considerations, by looking at the distribution of infectiousness in the absence of mitigation measures. Johansson et al., 2021, estimates for the residual transmission across different quarantine policies (e.g., with and without testing before release) have served as the basis for recent recommendations by the U.S. Centers for Disease Control and Prevention (CDC, 2020) for a 10day quarantine period (without PCR testing) for exposed individuals. As can be seen in Figure 4B, our results suggest that this analysis underestimates the residual transmission after 10 days by an order of magnitude for the average individual (35% of the transmission vs. 4%). One of the first and ongoing policies from mitigating transmission is mandatory selfisolation for individuals developing COVID19related symptoms (Johansson et al., 2021; Quilty et al., 2021). We estimate a strong correlation of incubation period and infectiousness, enhancing the contribution of selfisolation to transmission prevention. However, even when considering selfisolation of 70% of individuals immediately upon symptoms, as Johansson et al., 2021 assumed in their analysis, we still find a residual transmission of 11.8% compared to 1.3% in Johansson et al.’s estimates (Figure 4—figure supplement 1). Indeed, the unmitigated infectiousness profile suggests that without testing, the residual transmission after quarantine would be substantially higher – thus supporting the policy of requiring PCR or rapid tests for ending quarantines, as required in many countries. The current study does not analyze the possible benefits of such testing policies directly, but only of selfisolation by individuals who developed COVID19 symptoms. In addition, quarantine and isolation measures typically begin several days after the infection event, suggesting that the actual amount of postquarantine and isolation transmission would be lower than what we estimate.
The basic reproduction number R_{0} estimates derived here are close to reported values from early in the epidemic value (Chinazzi et al., 2020; Imai et al., 2020; Li et al., 2020; Wu et al., 2020b), despite the longer estimate for the generationinterval distribution. This is mainly due to using the corrected growth rate, which is considerably lower than previously assumed values (Tsang et al., 2020).
SARSCoV2 viral load trajectories serve an important role in understanding the dynamics of the disease and modeling its infectiousness (Quilty et al., 2021; Cleary et al., 2021). Indeed, the general shapes of the mean viral load trajectories and culture positivity, based on longitudinal studies, are comparable with our estimated unmitigated infectiousness profile (Figure 6—figure supplements 1 and 2, comparison with Chu et al., 2022 ; Killingley et al., 2022; Kissler et al., 2021). However, the nature of the relationship between viral load, culture positivity, symptom onset, and realworld infectivity is complex and not well characterized. Therefore, the ability to infer infectiousness from viral load data is very limited, especially near the tail of infectiousness, several days following symptom onset and peak viral loads. Viral load models are usually made to fit the measurements during an initial exponential clearance phase and in many cases miss a later slow decay (Kissler et al., 2021). Furthermore, there is considerable individuallevel variation in viral trajectories that isn’t accounted for in populationmean models (Kissler et al., 2021; Singanayagam et al., 2021). Other factors limiting the ability to compare generationinterval estimates with viral loads models are the variability of the incubation periods and its relation to the timing of the peak of the viral loads, and the great uncertainty and apparent nonlinearity of the relation between viral loads and culture positivity (Jaafar et al., 2021; Jones et al., 2021). Due to these caveats and in order to avoid overinterpretation of viral load data, we restrict our extrapolation of new VOCs’ infectiousness to a single parameter characterizing the viral duration of clearance.
New SARSCoV2 VOCs continue to emerge and replace previous lineages, adding uncertainty to the pandemic transmission dynamics, including the shape of infectiousness profiles. Although a few studies have tried to characterize generation and serialinterval distributions of new variants (Hart et al., 2021; Pung et al., 2021; Ryu et al., 2021; Hart et al., 2022), these analyses are necessarily subject to behavioral and intervention effects and are likely to underestimate the true duration of SARSCoV2 infectiousness. Instead, we estimated the unmitigated infectiousness profiles of new variants by comparing the differences in decay rates of viral load trajectories (Hay et al., 2022; Kissler et al., 2021). Our analyses suggested that the unmitigated generation intervals of the alpha and delta variants are shorter by 15% than those of the original strain and by 30% for the omicron variant (Figure 6). Our estimates of residual transmission (more than 15% transmission occurring at least 5 days after symptom onset for the alpha, delta, and omicron variants) suggest that caution is needed with the CDC’s newly updated (CDC, 2021) isolation guideline in the absence of additional measures, such as testing, before releasing isolated individuals. Our extrapolations are necessarily crude given the complex relationship between viral load, symptomaticity, and infectiousness discussed above. Moreover, compartmentalization in the respiratory tract, aerosolization, receptor binding affinity, and immune history can also play important roles in determining the infectiousness profiles of SARSCoV2 variants (Puhach et al., 2022). Furthermore, changes in populationlevel susceptibility can also cause the generationinterval distribution to change over time (Kenah et al., 2008; Champredon and Dushoff, 2015).
Our analysis focuses on estimating unmitigated intervals. Therefore, our estimates don’t take into account the effect of current interventions and behavioral changes. Nonetheless, these estimates can be useful for assessing isolation and contact tracing measures. One prediction of our extrapolation procedure is that the durations of infectiousness and incubation periods for the alpha and delta variants would be shorter by 15% relative to the original strain, which is supported by independent studies (Brandal et al., 2021Grant et al., 2021; Hwang et al., 2021; Singanayagam et al., 2021; Hart et al., 2022). Further transmission data as well as rich viral load trajectory data could assist in better inferences of the infectiousness profiles of new variants.
The current analysis provides an updated benchmark for the unmitigated profile of SARSCoV2 infectiousness. Furthermore, with the emergence of new VOCs, which may exhibit altered transmission dynamics than previously dominant wild type (Kissler et al., 2021), future studies could use our framework to update estimates of the generationinterval distributions for these emerging strains even under mitigation conditions and with inference of the correlation to the incubation period.
Taken together, our results demonstrate the importance of considering possible biases in the serialinterval data used for estimating the generationinterval distribution, as well as the underlying assumptions made when estimating the distribution from the source data. Our analysis provides a view of the infectiousness profile of an infected individual in absence of mitigation steps, which is a key ingredient of many models used for guiding policy.
Methods
Data collection
Data on serial intervals of transmission events were gathered from published and preprint literature, using a literature survey as described in the supplementary information. In order to control for biases introduced by later interventions, we focused on data from the early stages of the epidemic, when there were almost no cases identified outside China. Twelve relevant datasets were identified: (Ali et al., 2020; Ganyani et al., 2020; He et al., 2020; Liao et al., 2020; Li et al., 2020; Ren et al., 2021; Wu et al., 2020a; Xia et al., 2020; Yang et al., 2020; Zhang et al., 2020; Zhanwei et al., 2020; Zhao et al., 2020). In total, the combined dataset contained 2000 pairs, including duplicates. We crosschecked for duplicates in the combined dataset in three steps (see Appendix 1—figure 1): First, we removed pairs containing the same ‘infector/infectee ID’ (leaving 1685 pairs). Second, we looked at datasets containing sex and age information of the contacts and identified as duplicates those with matching sex, age, and symptomonset date for both cases (identifying 931 unique transmission pairs in these sets). Lastly, we looked at the datasets not containing information regarding the sex and age of the cases (additional 406 pairs) and added to the dataset only pairs with symptomonset dates that did not occur already in the first group (71 of the 406 cases were added, resulting in 1002 transmission pairs in total). See Appendix 1—figures 2 and 3 for a visualization of the datasets as a function of the symptomonset date.
Statistical model of serial interval data
Following Klinkenberg and Nishiura, 2011; Park et al., 2021, our model incorporates the possible interaction of the generation interval (${\tau}_{g}$) with the incubation period of the infector (${\tau}_{i}$) using a joint density function, denoted $h\left({\tau}_{i},{\tau}_{g}\right)$.The use of a joint distribution allows us to consider a correlation between the two periods. For example, it allows us to assume that infected individuals who develop symptoms later than average are more likely to transmit later than average, given that viral load peaks around the time of symptom onset. This is supported by longitudinal viral load studies showing that viral loads and culture positivity peak around symptoms onset (Killingley et al., 2022; He et al., 2020) and previous analyses of transmission pairs (Ferretti et al., 2020a; Hart et al., 2021).
When the epidemic is in equilibrium (i.e., the incidence of infection remains constant over time), we can write down the probability density function $s\left(\tau \mid {\alpha}_{1},{\alpha}_{2}\right)$ of observing an infectorinfectee pair whose symptomonset dates differ by a specific period (serial interval). This probability density function is conditional on the infection time of the infector ${\alpha}_{1}$ and the infectee ${\alpha}_{2}$ relative to the symptomonset time of the infector. As described in Figure 1, if we define the symptom onset time of the infector as zero, this means that ${\alpha}_{1}<0$, and because the infector has to be infected before the infectee, this requires that $\alpha}_{1}<{\alpha}_{2$. Assuming equilibrium conditions, is equal to the joint distribution describing the generation interval and the incubation period of the infector $h\left({\tau}_{i},{\tau}_{g}\right)$ , multiplied by the probability density function of the distribution of the infectee’s incubation period (denoted $l\left(\tau {\alpha}_{2}\right)$ . This is a marginal distribution derived from $h$ by integration over ${\tau}_{g}$ ,$l\left({\tau}_{i}\right)={\int}_{0}^{\infty}h\left({\tau}_{i},{\tau}_{g}\right)d{\tau}_{g}$):
where $\tau $ is the serial interval, and ${\alpha}_{1}$ , ${\alpha}_{2}$ are the infection times. The notations are further presented together with the definitions in Figure 1. As is shown in Equation (1), the two distributions $\left(h\left({\tau}_{i},{\tau}_{g}\right),l\left({\tau}_{i2}\right)\right)$ depend on the relative infection times of both the infector and the infectee (α_{1} and α_{2}). Although the exact time of infection is typically unknown, a possible exposure time window is provided in many cases. To compensate for the lack of information, the model integrates over all possible combinations of infector and infectee exposure times when estimating the parameters of the distribution from the observed serial intervals of the transmission pairs:
Most previous analyses of the serialinterval distributions of COVID19 have relied on this model, which assumes a constant force of infection (i.e., the per capita rate at which susceptible individuals become infected). However, in the beginning of an epidemic, the number of infections (and therefore the force of infection) increases exponentially, creating a specific ‘backward’ bias. When the force of infection is increasing exponentially, a cohort of infectors that developed symptoms at the same time is more likely to have been infected recently and thus to have shorter incubation periods, on average, than their infectees. Infectors with short incubation periods will also have short generation intervals due to their correlations, meaning that individuals who transmit early after infection are overrepresented. It is important to correct for this bias by adding a factor ${e}^{r{\alpha}_{1}}$ (Park et al., 2021):
Incubation period distribution and growth rate assumptions
We used the incubation period distribution provided by a metaanalysis, which reviewed and aggregated 72 studies, as they likely represent bestavailable estimates for the wild type (Xin et al., 2021). In their metaanalysis, Xin et al., 2021, found an increase of the incubation period following the introduction of interventions in China, matching the theoretical framework shown above. Their inferred incubation period distribution includes a correction for the growth rate of the early spread, accordingly.
The daily growth rates in the early outbreak period in Wuhan in particular and in the rest of China were estimated by another study (Tsang et al., 2020) to be r=0.08 day^{–1} and r=0.10 day^{–1} , respectively. In our main analysis, we used the growth rate measured for mainland China (r=0.10 day^{–1}), taken as a mean growth rate representing the dynamic of the early outbreak relevant for most of the transmission pairs. We further present a sensitivity analysis for this parameter (see Results section). We note that daily growth rate estimates of 0.08–0.10 day^{–1} are lower than previous estimates in the range of 0.17–0.3 day^{–1} (Kamalich et al., 2020a; Park et al., 2020) due to case ascertainment corrections (Park et al., 2020). For the functional form of $h$, we used a bivariate lognormal distribution. Parameters for the incubation period were taken from the metaanalysis (Xin et al., 2021) leaving three free parameters: the shape and the scale of the lognormal distribution defining the generationinterval univariate distribution, and a correlation parameter (defined as the correlation between the logged incubation period and the logged generation interval). In order to test the sensitivity of our results to the choice of a lognormal distribution, we also considered the alternative form used in Ferretti et al., 2020a in Appendix 1—figure 6.
Maximum likelihood inference of the generationinterval distribution
We then chose the parameters $\widehat{\theta}$ that maximize the likelihood of the observed serial intervals ${\tau}_{j}^{obs}$ (the maximum likelihood estimate):
Sequential least squares programming method, implemented in Python, was used to maximize the loglikelihood (Kraft, 1988; Virtanen et al., 2020). We calculated the uncertainties of the estimates using bootstrapping: the dataset was resampled with replacement (100 times for the main analysis and 100 times for sensitivity analyses) and processed via the maximum likelihood framework. In addition, the growth rate (r) was sampled from the uncertainty distribution found in a previous study of the early outbreak in China (Tsang et al., 2020). We calculated CI based on the 95% quantiles of the bootstrapping results.
Sensitivity analyses
We conducted three primary sensitivity analyses to investigate potential biases in our approach. First, we tested how our estimate of the unmitigated generationinterval distribution is sensitive to our cutoff date assumption by varying it between January 11 and January 25. We note that using serialinterval data from later dates are generally less reliable as they are affected by mitigation measures, which prevent late transmissions. Second, we considered the possibility that long serial intervals may be caused by omission of intermediate infections in multiple chains of transmission, which in turn would lead to overestimation of the mean serial and generation intervals. Thus, we refit our model after removing long serial intervals from the data (by varying the maximum serial interval between 14 and 24 days). We also considered ‘splitting’ these intervals into smaller intervals, but decided this was unnecessarily complex, since several choices would need to be made, and the effects would likely be small compared to the effect of the choice of maximum, since the distribution of the resulting split intervals would not differ sharply from that of the remaining observed intervals in most cases. Finally, we considered the possibility that the lack of negative serial intervals in early serial interval data might have been caused by the incorrect determination of the direction of transmission, especially given limited information about presymptomatic transmission in the beginning of the pandemic. In other words, infectees who developed symptoms before their infectors may have been incorrectly identified as a primary case. To test for potential biases, we refitted our model after switching the direction of transmission among randomly selected infectorinfectee pairs by varying the number of pairs switched (2, 4, 6, or 8 pairs out of 77) and the maximal serial interval for which order switching is allowed (3, 5, or 7 days). For each combination, the analysis was run 30 times with randomly sampled infectorinfectee pairs.
Beyond the primary sensitivity analysis, we also performed several supplementary sensitivity analysis. First, we tested other possible sensitivities of the data to biases based on location of infection, or the literature source of the data. To test the sensitivity to infection location, we stratified the dataset by where the infectors were infected (Wuhan vs. outside of Wuhan) as detailed in the supplementary information. To test for sensitivity to any specific literature source, we repeated the analysis while removing one dataset at a time, including all the transmission events that were duplicated also in other datasets (defined by the infector and infectee ID). Second, the effect of the assumed growth rate was assessed by varying it between 0.04 and 0.16 day^{–1} and the effect of the assumed incubation period distribution assessed by varying its median parameter between 4 and 5.5 days. Furthermore, the effect of inclusion of severe cases was assessed for both the period in focus (prior to January 17) and later dates. Finally, the sensitivity of the results to the choice of the lognormal bivariate distribution model was tested by comparison with another model distribution (given in Ferretti et al., 2020a, see supplementary material for full details).
Estimation of the basic reproduction number
We estimated the basic reproduction number (${R}_{0}$) using the EulerLotka equation (Wallinga and Lipsitch, 2007):
where $g\left(\tau \right)$ is the distribution of the generation interval and $r$ is the growth rate.
Extrapolation of the unmitigated generation interval of VOCs
Beyond estimating the unmitigated generation interval for the original wild type of SARSCoV2, we also extrapolated the unmitigated generationinterval distributions of the alpha, delta, and omicron variants by combining our estimates with previously inferred viral load trajectories (Hay et al., 2022; Kissler et al., 2021). Kissler et al. estimated exponential growth and clearance (decay) rates of viral load trajectories across 173 participants from the National Basketball Association between November 28, 2020, and August 11, 2021, including individuals infected by alpha and delta variants. Hay et al. extended the analysis to include an additional 204 individuals who were infected by delta or omicron variants. These studies showed that the overall viral shedding time of the new variants was shorter than the nonVOCs, mainly due to a significant reduction of the clearance time – the duration of the period from the peak viral load back to undetectable level of viral load. Following Kissler et al., we assume that the group of nonVOCs represents the original wildtype variant. We assume that differences in clearance durations reflect biological differences in the rate in which the variant infects the host, and therefore base the extrapolation on the ratio of clearance durations: $\kappa =\frac{{c}_{WT}}{{c}_{VOC}}<1$, where ${c}_{WT},{c}_{VOC}$ are the viral trajectories clearance rate of the wildtype variant and VOC. We scaled the infectiousness profile for the VOCs shortening its time course by $\kappa $:
where ${h}^{WT},{h}^{VOC}$ are the joint bivariate distribution of incubation period and generation interval of the wildtype variant and VOCs. Since the distribution of infectiousness is lognormal, the scaling affects only one of the parameters of the distribution (the median). See supplementary information for full derivation. The resulting unmitigated generationinterval distribution then estimates the unmitigated infectiousness profile of new variants under a counterfactual scenario, in which behavioral and intervention effects remain the same as in the initial pandemic phase.
Although the connection of viral load levels and infectiousness is not well characterized, previously inferred viral load trajectories qualitatively match the shape of the distribution of transmission probability as a function of the TOST, providing support for our approximation (Figure 6). This apparent similarity was also demonstrated in previous studies (He et al., 2020; Jones et al., 2021; Marc et al., 2021).
Appendix 1
Extended methods
Literature survey for serialinterval data
A literature survey was conducted in order to gather data on serial intervals of transmission events from published and preprint literature. The survey was composed using a ‘google scholar’ inquiry containing the phrases: ‘serial interval’ + ‘COVID’ + ‘China’. Twelve relevant datasets were identified: (Ali et al., 2020; Ganyani et al., 2020; He et al., 2020; Liao et al., 2020; Li et al., 2020; Ren et al., 2021; Wu et al., 2020a; Xia et al., 2020; Yang et al., 2020; Zhang et al., 2020; Zhanwei et al., 2020; Zhao et al., 2020).
Calculation of the mean serial interval for cohorts of transmission pairs that occurred on the same day
In order to compensate for the scarce data with early dates of infector’s onset, we used a simple probabilistic model with Bayesian inference to derive crude estimates of the mean serial interval as a function of the infector symptoms onset date. For each date, the serial intervals of infectors that developed symptoms on that day were assumed to have a Student’s t distribution such that the mean, standard deviation, and degrees of freedoms were random variables with the next prior distributions:
Mean – halfnormal distribution with standard deviation of 20 days.
Standard deviation – halfnormal distribution with standard deviation of 10 days.
Degrees of freedoms – exponential distribution with mean of 30.
A Markov chain Monte Carlo method (implemented in Python pymc3 library) was then used to estimate the mean serial interval and its uncertainty (see Results and Figure 2B).
Sensitivity analysis to the period of interest
For each of the dates between January 11 and 25, 2020, we extracted the dataset consisting of the transmission pairs with infector onset symptoms up to that date. We rerun the maximum likelihood framework on the extracted datasets. Furthermore, in order to obtain uncertainty estimates, we used a bootstrapping method. In the bootstrapping process we resampled with replacements 100 times and processed via the maximum likelihood framework. In addition, the growth rate (r) was sampled from the distribution found by a study of the early outbreak in China (Tsang et al., 2020).
Sensitivity analysis to the infection location of the infector
The transmission pairs dataset contains data from various cities and provinces in China. The mitigation steps were enacted at different timepoints across China, first in Wuhan and later in other cities and provinces. Previous analysis showed substantial growth rate differences across provinces (Kamalich et al., 2020b), but it seems that when corrected for case ascertainment, the observed difference in growth rate between Wuhan and the rest of China is small (0.08 day^{–1} vs. 0.1 day^{–1}) (Tsang et al., 2020).
In our main analysis, we do not differentiate transmission pairs by location. Thus, spatial effects could affect our results in two ways: via the estimated growth rate or via the period chosen for analysis as an approximation for unmitigated transmission. Appendix 1—figure 7 shows the sensitivity of the results to a change in the growth rate in the range of 0.04–0.16 day^{–1}; estimates for the mean generation interval change in the range of 8.1–9.1 days. Specifically, assuming a growth rate of 0.08 day^{–1} instead of 0.1 day^{–1} has a minimal effect on the main results of the analysis.
We further test how the duration of unmitigated period affects the results of the analysis when the dataset is stratified by the infection location of the infectors and infectee. Appendix 1—figure 8 compares the mean observed serial intervals when the infector and infectee were infected in or outside of Wuhan. We expect that Wuhan to Wuhan transmissions will be shorter than transmissions from Wuhan to the rest of China, but we do not find significant differences, as the data for transmission pairs from Wuhan is scarce. We also check the sensitivity of the generationintervaldistribution estimates to our choices of the unmitigated period, when it is defined separately for those pairs whose infector has been infected inside or outside Wuhan (shown in Appendix 1—figure 9). Our analysis suggests that reasonable changes in the unmitigated period have minor effects on our main estimates (e.g., the median generation interval and the 90% of the distribution). For example, taking only pairs with an infector that was infected in Wuhan and developed symptoms until January 15 or pairs with an infector that was infected outside of Wuhan and developed symptoms until January 21 leads to a median generation interval of 6.9 days, in comparison to 7.9 days in our main analysis. Both are substantially larger than previous reports (Ferretti et al., 2020a; He et al., 2020; Sun et al., 2020).
Comparison with another model of infectiousness
To check whether these results are sensitive to our choice of using a bivariate lognormal distribution to characterize the joint distribution of the generation interval and the incubation period, we repeated our analysis using a different functional form using an adjusted logistic TOST model following Ferretti et al., 2020a. Ferretti et al. modeled the transmission by assuming a TOST distribution with a skewedlogistic shape that is dependent on the incubation period (only on the left side).
While ${t}_{i}$ is the specific incubation period, $\tau $ is the mean incubation period (5.42 days), $\alpha $, $\sigma $ are parameters determining the shape of the distribution, and ${m}_{l}$ , ${m}_{r}$ are the median of the distribution of the negative and positive sides (${m}_{l}=\frac{\sigma {t}_{i}}{\tau}ln\left({2}^{\frac{1}{\alpha}}1\right)$ , ${m}_{r}=\sigma ln\left({2}^{\frac{1}{\alpha}}1\right)$).
We adjusted their model by an additional parameter ($l$) enabling shifting the TOST along the time axis.
We use our maximum likelihood framework to estimate the generationinterval distribution of the adjusted form based on our compiled dataset. Furthermore, we fitted both our models and the adjusted model to the serialinterval dataset provided in Ferretti et al. supplementary Figure S1. Results of the comparison are presented in Appendix 1—figure 6.
Extrapolation of the unmitigated generation interval of VOCs
Beyond estimating the unmitigated generation interval for the original wild type of SARSCoV2, we also extrapolated the unmitigated generationinterval distributions of the alpha and delta and omicron variants by combining our estimates with previously inferred viral load trajectories (Hay et al., 2022; Kissler et al., 2021). Following Kissler et al. notation, we use the group of nonVOCs as representing the original wildtype variant. As discussed in the Methods section, we assume that differences in clearance durations reflect biological differences in the rate in which the variant infects the host, and therefore base the extrapolation on the ratio of clearance durations: $\kappa =\frac{{c}_{WT}}{{c}_{VOC}}<1$, where ${c}_{WT},{c}_{VOC}$ are the viral trajectories clearance rate of the wildtype variant and VOC.
We scaled the infectiousness profile for the VOCs shortening its time course by:
where ${h}^{WT},{h}^{VOC}$ are the joint bivariate distribution of incubation period and generation interval of the wildtype variant and VOC, respectively. Following the use of a bivariate lognormal distribution for $h$, we get a simple expression for the extrapolated generationinterval distribution:
where ${g}^{WT},{g}^{VOC}$ are the generationinterval distribution of the variants.
Therefore,${g}^{voc}({\tau}_{g})\sim lognormal(\kappa \mu ,\sigma )$
where $\mu ,\sigma $ are the scale and shape parameters of ${g}^{WT}$ such that ${g}^{WT}\text{}({\tau}_{g})\sim lognormal\left(\mu ,\sigma \right)$.
Thus, the scaling was achieved by multiplying the log mean parameter of the lognormal incubation period and generationinterval distributions by the ratios of the clearance durations. This scaling affects only one of the parameters of the distribution (the median), keeping the squared coefficient of the variation constant.
Data availability
All study data are included in the article, SI appendix, and Dataset S1. All code is available in Jupyter notebooks found in https://gitlab.com/milolabpublic/theunmitigatedprofileofcovid19infectiousness, (copy archived at swh:1:rev:5e33057809e940d9b3ecd06a389b07611d15b39e).

ZenodoPDGLin/COVID19_EffSerialInterval_NPI: Serial interval of SARSCoV2 was shortened over time by nonpharmaceutical interventions.https://doi.org/10.5281/zenodo.3940300

githubID COVID19. Estimating the generation interval for COVID19.
References

Intrinsic and realized generation intervals in infectiousdisease transmissionProceedings. Biological Sciences 282:20152026.https://doi.org/10.1098/rspb.2015.2026

Using viral load and epidemic dynamics to optimize pooled testing in resourceconstrained settingsScience Translational Medicine 13:abf1568.https://doi.org/10.1126/scitranslmed.abf1568

The interval between successive cases of an infectious diseaseAmerican Journal of Epidemiology 158:1039–1047.https://doi.org/10.1093/aje/kwg251

Practical considerations for measuring the effective reproductive number, RtPLOS Computational Biology 16:e1008409.https://doi.org/10.1371/journal.pcbi.1008409

Comparison of molecular testing strategies for COVID19 control: a mathematical modelling studyThe Lancet. Infectious Diseases 20:1381–1389.https://doi.org/10.1016/S14733099(20)306307

Generation time of the alpha and delta SARScov2 variants: an epidemiological analysisThe Lancet. Infectious Diseases 22:603–610.https://doi.org/10.1016/S14733099(22)000019

Transmission dynamics of the delta variant of sarscov2 infections in south koreaThe Journal of Infectious Diseases 225:793–799.https://doi.org/10.1093/infdis/jiab586

Doubling time of the covid19 epidemic by province, chinaEmerging Infectious Diseases 26:1912–1914.https://doi.org/10.3201/eid2608.200219

Severe acute respiratory syndrome coronavirus 2 transmission potential, iran, 2020Emerging Infectious Diseases 26:1915–1917.https://doi.org/10.3201/eid2608.200536

Generation interval contraction and epidemic data analysisMathematical Biosciences 213:71–79.https://doi.org/10.1016/j.mbs.2008.02.007

Viral dynamics of sarscov2 variants in vaccinated and unvaccinated personsThe New England Journal of Medicine 385:2489–2491.https://doi.org/10.1056/NEJMc2102507

The correlation between infectivity and incubation period of measles, estimated from households with two casesJournal of Theoretical Biology 284:52–60.https://doi.org/10.1016/j.jtbi.2011.06.015

BookA software package for sequential quadratic programmingWiss. Berichtswesen d. DFVLR.

Early transmission dynamics in wuhan, china, of novel coronavirusinfected pneumoniaThe New England Journal of Medicine 382:1199–1207.https://doi.org/10.1056/NEJMoa2001316

Reconciling earlyoutbreak estimates of the basic reproductive number and its uncertainty: framework and applications to the novel coronavirus (SARSCoV2) outbreakJournal of the Royal Society, Interface 17:20200144.https://doi.org/10.1098/rsif.2020.0144

Evidence for presymptomatic transmission of coronavirus disease 2019 (COVID19) in ChinaInfluenza and Other Respiratory Viruses 15:19–26.https://doi.org/10.1111/irv.12787

Serial interval and transmission dynamics during sarscov2 delta variant predominance, south koreaEmerging Infectious Diseases 28:407–410.https://doi.org/10.3201/eid2802.211774

A note on generation times in epidemic modelsMathematical Biosciences 208:300–311.https://doi.org/10.1016/j.mbs.2006.10.010

How generation intervals shape the relationship between growth rates and reproductive numbersProceedings. Biological Sciences 274:599–604.https://doi.org/10.1098/rspb.2006.3754

The incubation period distribution of coronavirus disease 2019: a systematic review and metaanalysisClinical Infectious Diseases 73:2344–2352.https://doi.org/10.1093/cid/ciab501

Serial interval of covid19 among publicly reported confirmed casesEmerging Infectious Diseases 26:1341–1343.https://doi.org/10.3201/eid2606.200357

COVID19 and genderspecific difference: Analysis of public surveillance data in Hong Kong and Shenzhen, China, from January 10 to February 15, 2020Infection Control and Hospital Epidemiology 41:750–751.https://doi.org/10.1017/ice.2020.64
Decision letter

Katelyn GosticReviewing Editor; University of Chicago, United States

Aleksandra M WalczakSenior Editor; CNRS LPENS, France

James A HayReviewer; Harvard T.H. Chan School of Public Health, United States
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "The unmitigated profile of COVID19 infectiousness" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Aleksandra Walczak as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: James A Hay (Reviewer #3).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
(1) All three reviewers raised questions about the potential impact of ascertainment bias and small sample size in the unmitigated transmission pair data. Please address potential impacts on the results, and qualify the conclusions if appropriate.
(2) Address questions from two reviewers about the accuracy of fixed incubation period estimates obtained from a metaanalysis. Should these be corrected for the same biases that affect generation interval estimates?
(3) Please provide more detail about the methods used to estimate R0 and the generation interval of variants of concern. Please also consider editing the methods for clarity and readability by a general audience.
(4) In order to make the manuscript more accessible to a general audience, please provide a clearer explanation of why short forward intervals are overrepresented in a growing epidemic. Consider including a diagram or simulation, as suggested by Reviewer 3.
(5) Please address the impact of uncertainty in viral load trajectories on individual generation times, on the residual fraction, and on our ability to infer generation intervals for variants of concern using viral load trajectories. On a related note, please consider modifications to Figure 6a so that it is easier to visualize whether the viral load trajectory aligns well with the claim that 18% of transmission occurs >14d after infection.
Reviewer #1 (Recommendations for the authors):
1. The methods section is complete, but it might be easier to follow with more attention to organization, transitions, and maybe with additional subheadings. In particular, it would be helpful if key details like which parameters are being estimated, and which data you're fitting to, were easier to locate in this section.
2. The Introduction and Methods cover a lot of ground summarizing all the forms of bias and adjustment that go into producing an accurate unmitigated estimate, and it is currently a bit hard to keep track of all these details. It could be helpful to provide some sort of list, table, or summary paragraph to help readers keep track of all the forms of bias and adjustment that this analysis deals with, including references where appropriate. It would also be helpful to more clearly state that the main contribution of this study is to collect and apply all these statistical corrections to a carefully curated dataset.
3. I got tripped up by this statement on page 12:
"We find that our framework is able to properly reproduce the realized serial interval distribution given the growth rate in the early stages of the outbreak in Wuhan, China (Figure 3b)."
Aren't the models fit to the SI datameaning that we expect this result and should be alarmed by anything else? I think that this is just a wording issue and that what you're trying to say here is something like, "With or without the growthrate adjustment, the model was able to fit the observed serial interval data well (Figure 3b)". But with the current phrasing, it sounds (at least to me) like this is being presented as some sort of independent validation of the model. For the same reason, I'd consider changing "estimated SI" to "fitted SI' in the Figure 3b legend.
Reviewer #2 (Recommendations for the authors):
As well as the broad comments made in the public review, I had the following comments:
– "This dataset includes a total of 77 transmission pairs with a mean serial interval of 9.1 days (7.910.2 95% CIs) and a standard deviation of 5.2 days. This is substantially longer than the mean of 7.8 days suggested by Ali et al" – is it possible to quantify this difference statistically (e.g. with a test for difference in means between the samples)? Given a mean of 9.1 days and SD=5.2, it wouldn't seem implausible for a random subsample from this dataset to have a mean of 7.8?
– Could the authors clarify which formula they used from Wallinga and Lipsitch (2007) to calculate R0 from generation time, as the exact calculation will depend on assumptions about the distribution of generations etc? I presume the authors used an appropriate formulation but would be useful to state explicitly. The finding that the early R0 is similar despite a longer generation time seems a bit counterintuitive, so it would be helpful to have some more discussion about what's happening here.
– It would be useful to give some intuition about why changing the baseline incubation period had a limited effect on the results. Is this because the epidemic phase adjustment dominates in the calculation?
– The methods for scaling the generation interval for other VOCs are described briefly in the caption to Figure 6, but it would be helpful to have the calculation given explicitly in the methods, so there is no ambiguity in terms like "ratio of the clearance's durations". Also in this figure, it's unclear where Α line is in B and C, so worth mentioning in the caption. Finally, I didn't follow this sentence: "The inset shows a zoomin on the period of 1224 days after exposure, a period in which there is a substantial difference between the current estimate and those from previous studies." Are the presented estimates not all new ones derived from the current study and viral shedding data.
– I appreciate that not all of these studies were available at the time of submission, but it could be helpful to update the discussion to also place the results in the context of more recent viral culture duration from serial swabbing data (Chu et al., JAMA Int Med 2022) and/or shedding profiles in human challenge data (Killingley et al., Nature Med 2022).
https://doi.org/10.7554/eLife.79134.sa1Author response
Essential revisions:
(1) All three reviewers raised questions about the potential impact of ascertainment bias and small sample size in the unmitigated transmission pair data. Please address potential impacts on the results, and qualify the conclusions if appropriate.
We recognize the potential biases in the transmission pairs data. We therefore developed an extensive framework of sensitivity analyses for identifying biases that could substantially affect the results. In the Results section and figure 5, we show that the main study result, that the unmitigated generationinterval distribution is longer than previously estimated, is robust to reasonable amounts of ascertainment bias. We discuss this point at length and have added several supplemental figures to support this claim.
As reviewer #3 mentioned, we conducted a sensitivity analysis for the inclusion of the longest serial intervals, to investigate possible effects of missing links in the longest transmission pairs. We also discuss why we think it’s not necessary to explicitly model the short intervals that may be unobserved due to missing links.
“Second, we considered the possibility that long serial intervals may be caused by omission of intermediate infections in multiple chains of transmission, which in turn would lead to overestimation of the mean serial and generation intervals. Thus, we refit our model after removing long serial intervals from the data (by varying the maximum serial interval between 14 and 24 days). We also considered “splitting” these intervals into smaller intervals, but decided this was unnecessarily complex, since several choices would need to be made, and the effects would likely be small compared to the effect of the choice of maximum, since the distribution of the resulting split intervals would not differ sharply from that of the remaining observed intervals in most cases.”
We added to the discussion text regarding the effect of possible bias in the dataset, explicitly specifying the ascertainment bias.
“Our analysis relies on datasets of transmission pairs gathered from previously published studies and thus has several limitations that are difficult to correct for. Transmission pairs data can be prone to incorrect identification of transmission pairs, including the direction of transmission. In particular, presymptomatic transmission can cause infectors to report symptoms after their infectees, making it difficult to identify who infected whom. Data from the early outbreak might also be sensitive to ascertainment and reporting biases which could lead to missing links in transmission pairs, causing serial intervals to appear longer (For example, people who transmit asymptomatically might not be identified). Moreover, when multiple potential infectors are present, an individual who developed symptoms close to when the infectee became infected is more likely to be identified as the infector. These biases might increase the estimated correlation of the incubation period and the period of infectiousness. We have tried to account for these biases by using a bootstrapping approach, in which some data points are omitted in each bootstrap sample. The relatively narrow ranges of uncertainty suggest that the results are not very sensitive to specific transmission pairs data points being included in the analysis. We also performed a sensitivity analysis to address several potential biases such as the duration of the unmitigated transmission period, the inclusion of long serial intervals in the dataset, and the incorrect ordering of transmission pairs (see Methods). The sensitivity analysis shows that although these biases could decrease the inferred mean generation interval, our main conclusions about the long unmitigated generation intervals (high median length and substantial residual transmission after 14 days) remained robust (Figure 5).”
(2) Address questions from two reviewers about the accuracy of fixed incubation period estimates obtained from a metaanalysis. Should these be corrected for the same biases that affect generation interval estimates?
In our analysis we use the incubation period distribution from Xin et al. 2021 which already considers the backward bias caused by the expanding epidemic with the corrected growth rate of 0.1/d. Xin et al. showed in their metaanalysis that the mean incubation period reported by the various sources changed according to the dates used by the source. Incubation periods prior to the peak of the epidemic in China were lower than ones from after the peak, in a manner that coincided with the backward correction they performed (using a similar derivation to that suggested by Park et al. 2021). Accordingly, the distribution of incubation period they report is the intrinsic incubation period, after correction for the growth rate of the initial spread in China. We added two sentences in our methods section to clarify this point:
“In their metaanalysis, Xin et al. found an increase of the incubation period following the introduction of interventions in China, matching the theoretical framework shown above. Their inferred incubation period distribution includes a correction for the growth rate of the early spread, accordingly.”
Furthermore, we perform a sensitivity analysis for the shape of the incubation period distribution, and show that it has a minor effect on our conclusions (Appendix 1—figure 10).
(3) Please provide more detail about the methods used to estimate R0 and the generation interval of variants of concern. Please also consider editing the methods for clarity and readability by a general audience.
We made some edits to the methods section in order to make it more accessible and clear, for example, we added subheadings for the various sections, added a section explaining the derivation of the basic reproduction number, and clarified the section regarding the VOCs extrapolations:
“We estimated the basic reproduction number ($R}_{0$) using the EulerLotka equation (Wallinga and Lipsitch 2007):
where $g\left(\tau \right)$ <milestonestart /> <milestoneend /> is the distribution of the generation interval and $\text{r}$ is the growth rate.”
“Beyond estimating the unmitigated generation interval for the original wild type of SARSCoV2, we also extrapolated the unmitigated generationinterval distributions of the α, δ and omicron variants by combining our estimates with previously inferred viral load trajectories (Kissler et al. 2021; Hay et al. 2022). Kissler et al. estimated exponential growth and clearance (decay) rates of viral load trajectories across 173 participants from the National Basketball Association between November 28, 2020, and August 11, 2021, including individuals infected by α and δ variants. Hay et al. extended the analysis to include an additional 204 individuals who were infected by δ or omicron variants. These studies showed that the overall viral shedding time of the new variants was shorter than the NonVOC variants, mainly due to a significant reduction of the clearance time – the duration of the period from the peak viral load back to undetectable level of viral load. Following Kissler et al., we assume that the group of nonVOC variants represents the original wild type variant. We assume that differences in clearance durations reflect biological differences in the rate in which the variant infects the host, and therefore base the extrapolation on the ratio of clearance durations:$\text{}\kappa =\frac{{c}_{\text{WT}}}{{c}_{\text{VOC}}}1$, where $c}_{\text{WT}},{c}_{\text{VOC}$ are the viral trajectories clearance rate of the wildtype and VOC variants. We scaled the infectiousness profile for the VOCs shortening its time course by $\kappa$:
Where $\text{}{h}^{\text{WT}},\text{}{h}^{\text{VOC}}$ <milestonestart /> <milestoneend /> are the joint bivariate distribution of incubation period and generation interval of the wildtype and VOC variants. Since the distribution of infectiousness is lognormal the scaling affects only one of the parameters of the distribution (the median). See Supplementary methods for full derivation. The resulting unmitigated generationinterval distribution then estimates the unmitigated infectiousness profile of new variants under a counterfactual scenario, in which behavioral and intervention effects remain the same as in the initial pandemic phase.”
(4) In order to make the manuscript more accessible to a general audience, please provide a clearer explanation of why short forward intervals are overrepresented in a growing epidemic. Consider including a diagram or simulation, as suggested by Reviewer 3.
We added an explanation to the paragraph in order to make it clearer:
“A cohort of individuals that develop symptoms on a given day is a sample of all individuals who have been previously infected. When the incidence of infection is increasing, recently infected individuals represent a bigger fraction of this population and thus are overrepresented in this cohort. Therefore, we are more likely to encounter infected individuals with a short incubation period in this cohort compared to an unbiased sample. The forward serialinterval is calculated for a cohort of infectors who developed symptoms at the same time and therefore is sensitive to this bias. These dynamical biases are demonstrated using epidemic simulations by Park et al."
(5) Please address the impact of uncertainty in viral load trajectories on individual generation times, on the residual fraction, and on our ability to infer generation intervals for variants of concern using viral load trajectories. On a related note, please consider modifications to Figure 6a so that it is easier to visualize whether the viral load trajectory aligns well with the claim that 18% of transmission occurs >14d after infection.
Viral load trajectories data have potential for informing estimates of the infectiousness profile. However the relationship between viral load, culture positivity, symptom onset, and infectivity is complex and not well characterized. Due to this limitation we tried to use viral loads in a more limited way, extrapolating our results to variants of concerns (which lack unmitigated transmission data). Following the comment, we added a detailed discussion of the limitations of using viral loads as a proxy for infectiousness, including the variation of viral loads across individuals. We also added supplementary figures (Figure 6—figure supplements 12) to show the possible effect of an individual's viral loads in relation to the infectiousness and for comparison with new viral load and culture results (Chu et al. 2022; Killingley et al. 2022). As the viral load trajectories data for the different VOC is given only as a function of time from the onset of symptoms, it is not possible to directly link it to the fraction of transmission post 14 days from infection. We made changes to Figure 6 to clarify the possible connection of viral load with the TOST (time from symptoms onset to transmission) distribution and the resulting extrapolation to the unmitigated generationinterval distributions.
“SARSCoV2 viral load trajectories serve an important role in understanding the dynamics of the disease and modeling its infectiousness (Quilty et al. 2021; Cleary et al. 2021). Indeed, the general shapes of the mean viral load trajectories and culture positivity, based on longitudinal studies, are comparable with our estimated unmitigated infectiousness profile (Figure 6—figure supplements 12, comparison with (Chu et al. 2022; Killingley et al. 2022; Kissler et al. 2021)). However, the nature of the relationship between viral load, culture positivity, symptom onset, and realworld infectivity is complex and not well characterized. Therefore, the ability to infer infectiousness from viral load data is very limited, especially near the tail of infectiousness, several days following symptom onset and peak viral loads. Viral load models are usually made to fit the measurements during an initial exponential clearance phase and in many cases miss a later slow decay (Kissler et al. 2021). Furthermore, there is considerable individuallevel variation in viral trajectories that isn’t accounted for in populationmean models (Kissler et al. 2021; Singanayagam et al. 2021). Other factors limiting the ability to compare generationinterval estimates with viral loads models are the variability of the incubation periods and its relation to the timing of the peak of the viral loads, and the great uncertainty and apparent nonlinearity of the relation between viral loads and culture positivity (Jaafar et al. 2021; Jones et al. 2021). Due to these caveats and in order to avoid over interpretation of viral load data, we restrict our extrapolation of new VOCs’ infectiousness to a single parameter characterizing the viral duration of clearance.”
Reviewer #1 (Recommendations for the authors):
1. The methods section is complete, but it might be easier to follow with more attention to organization, transitions, and maybe with additional subheadings. In particular, it would be helpful if key details like which parameters are being estimated, and which data you're fitting to, were easier to locate in this section.
We made some edits to the Methods section in order to make it more readable and clearer. We added subheadings for the various sections. Moreover, we added a section explaining the derivation of the basic reproduction number and clarified the section regarding the VOCs extrapolations.
2. The Introduction and Methods cover a lot of ground summarizing all the forms of bias and adjustment that go into producing an accurate unmitigated estimate, and it is currently a bit hard to keep track of all these details. It could be helpful to provide some sort of list, table, or summary paragraph to help readers keep track of all the forms of bias and adjustment that this analysis deals with, including references where appropriate. It would also be helpful to more clearly state that the main contribution of this study is to collect and apply all these statistical corrections to a carefully curated dataset.
Following the comment we added a table summarizing the main possible biases in inference of infectiousness profile from serial intervals data.
3. I got tripped up by this statement on page 12:
"We find that our framework is able to properly reproduce the realized serial interval distribution given the growth rate in the early stages of the outbreak in Wuhan, China (Figure 3b)."
Aren't the models fit to the SI datameaning that we expect this result and should be alarmed by anything else? I think that this is just a wording issue and that what you're trying to say here is something like, "With or without the growthrate adjustment, the model was able to fit the observed serial interval data well (Figure 3b)". But with the current phrasing, it sounds (at least to me) like this is being presented as some sort of independent validation of the model. For the same reason, I'd consider changing "estimated SI" to "fitted SI' in the Figure 3b legend.
We thank the reviewer for pointing out the confusing phrasing. We changed the phrasing as the reviewer suggest, and it now reads:
“The joint bivariate distribution and its marginal distributions are shown in Figure 3A. With or without the growthrate adjustment, the model was able to fit the observed serial interval data well (Figure 3B)”.
Furthermore, the legend of Figure 3B was also changed accordingly to “Fitted SI”.
Reviewer #2 (Recommendations for the authors):
As well as the broad comments made in the public review, I had the following comments:
– "This dataset includes a total of 77 transmission pairs with a mean serial interval of 9.1 days (7.910.2 95% CIs) and a standard deviation of 5.2 days. This is substantially longer than the mean of 7.8 days suggested by Ali et al" – is it possible to quantify this difference statistically (e.g. with a test for difference in means between the samples)? Given a mean of 9.1 days and SD=5.2, it wouldn't seem implausible for a random subsample from this dataset to have a mean of 7.8?
As the reviewers pointed out, the 95% confidence interval of the mean serial interval estimate is 7.910.2 days, which overlaps with the one reported by Ali et al. (mean of 7.8 days, CI of 78.6 days). We now acknowledge this uncertainty in the main text:
“This dataset includes a total of 77 transmission pairs with a mean serial interval of 9.1 days (95% CI: 7.910.2), and a standard deviation of 5.2 days. Although this is substantially longer than the mean of 7.8 days (95% CI: 78.6 days) suggested by Ali et al. (Ali et al. 2020) for the early period of the epidemic, there is considerable uncertainty in both estimates with overlapping confidence intervals. Nonetheless, a lower mean serial interval estimated by Ali et al. likely reflects their decision to include infectors who developed symptoms up to January 22nd, who were already subject to effects on mitigation strategy”.
We do not think that quantifying the statistical significance will provide any more information, especially given that the confidence intervals overlap by a considerable amount. Nonetheless, we still find that the mean serial interval decreases consistently when we use later cutoff dates. We also note that our dataset incorporates several other transmission pairs taken from other sources.
– Could the authors clarify which formula they used from Wallinga and Lipsitch (2007) to calculate R0 from generation time, as the exact calculation will depend on assumptions about the distribution of generations etc? I presume the authors used an appropriate formulation but would be useful to state explicitly. The finding that the early R0 is similar despite a longer generation time seems a bit counterintuitive, so it would be helpful to have some more discussion about what's happening here.
We added to the methods the formula for derivation of R_{0} using the distribution of the generation interval. See response for Essential Revisions 2.
The estimate of R_{0} is similar to the previous estimate mainly due to the use of the corrected growth rate of 0.1/d, as previous studies assumed shorter GI (which decreases R_{0}) but higher growth rate (which increases R_{0}).
“The basic reproduction number R_{0} estimates derived here are close to reported values from early in the epidemic value (Wu, Leung, and Leung 2020; Li et al. 2020; Chinazzi et al. 2020; Imai et al. 2020), despite the longer estimate for the generationinterval distribution. This is mainly due to using the corrected growth rate, which is considerably lower than previously assumed values (Tsang et al. 2020).”
– It would be useful to give some intuition about why changing the baseline incubation period had a limited effect on the results. Is this because the epidemic phase adjustment dominates in the calculation?
From the sensitivity analysis presented in Appendix 1—figure 10, changing of the incubation period distribution mainly affects the estimate of the correlation parameter (shorter incubation period causes a decrease in the correlation parameter). The adjustment for epidemic phase also doesn’t have a large effect on the results. Therefore, the cutoff date seems to be the dominant factor in our analysis, presumably meaning that mitigations have the largest effect on the generation interval distribution. We add the next paragraph to the discussion:
“Following the sensitivity analyzes to the cutoff date, the growth rate and the model of infectiousness, we can see which of the three biases described in Table 1 has the greatest effect. We conclude that the cutoff date seems to be the dominant factor in our analysis, presumably meaning that taking the effects of interventions into account is the most important for an accurate estimate of the generation interval distribution. Additional sensitivity analyses, such as to the assumed incubation period, also support this conclusion, as they show only a minor effect.”
– The methods for scaling the generation interval for other VOCs are described briefly in the caption to Figure 6, but it would be helpful to have the calculation given explicitly in the methods, so there is no ambiguity in terms like "ratio of the clearance's durations". Also in this figure, it's unclear where Α line is in B and C, so worth mentioning in the caption. Finally, I didn't follow this sentence: "The inset shows a zoomin on the period of 1224 days after exposure, a period in which there is a substantial difference between the current estimate and those from previous studies." Are the presented estimates not all new ones derived from the current study and viral shedding data.
We have now expanded our explanation for the extrapolation of the unmitigated generation intervals of the VOC in the Methods.
The extrapolation of the α and δ variants are extremely close, hence the hidden α line in the panels. We added in the figure caption:
“The extrapolated distributions for the α and δ variants are extremely close, hence the green line is hidden by the red line in panels bd”.
The anomalous sentence in the figure caption was an editing error and has been removed. Thank you for pointing it out.
– I appreciate that not all of these studies were available at the time of submission, but it could be helpful to update the discussion to also place the results in the context of more recent viral culture duration from serial swabbing data (Chu et al., JAMA Int Med 2022) and/or shedding profiles in human challenge data (Killingley et al., Nature Med 2022).
Following the comment we added a supplemental figure (Figure 6—figure supplement 2) that shows comparison between the estimated profile of infection to these two recent studies. We also added a detailed discussion paragraph regarding the comparability of the results with viral load data and the major limitation of this kind of comparisons, see response to Essential Revisions 5.
https://doi.org/10.7554/eLife.79134.sa2Article and author information
Author details
Funding
Weizmann Institute of Science (The Weizmann CoronaVirus Fund)
 Ron Milo
Weizmann Institute of Science (Weizmann Data Science Research Center and by a research grant from the Estate of Tully and Michele)
 Ron Sender
Canadian Institute for Health Research
 Jonathan Dushoff
Ben B. and Joyce E. Eisenberg Foundation
 Ron Milo
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We would like to thank David Champredon and David Earn for valuable feedback on this manuscript. Funding: Ben B and Joyce E Eisenberg Foundation, The Weizmann CoronaVirus Fund (RM), the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center, and by a research grant from the Estate of Tully and Michele Plesser (RS). RM is the Charles and Louise Gartner Professional Chair. Jonathan Dushoff is supported by the Canadian Institutes of Health Research. YMB is an Azrieli Fellow.
Senior Editor
 Aleksandra M Walczak, CNRS LPENS, France
Reviewing Editor
 Katelyn Gostic, University of Chicago, United States
Reviewer
 James A Hay, Harvard T.H. Chan School of Public Health, United States
Publication history
 Preprint posted: November 19, 2021 (view preprint)
 Received: March 31, 2022
 Accepted: July 27, 2022
 Accepted Manuscript published: August 1, 2022 (version 1)
 Version of Record published: August 19, 2022 (version 2)
Copyright
© 2022, Sender, BarOn et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,341
 Page views

 263
 Downloads

 3
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structurebased measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higherorder statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

 Cancer Biology
 Computational and Systems Biology
Lung squamous cell carcinoma (LUSC) is a type of lung cancer with a dismal prognosis that lacks adequate therapies and actionable targets. This disease is characterized by a sequence of low and highgrade preinvasive stages with increasing probability of malignant progression. Increasing our knowledge about the biology of these premalignant lesions (PMLs) is necessary to design new methods of early detection and prevention, and to identify the molecular processes that are key for malignant progression. To facilitate this research, we have designed XTABLE (Exploring Transcriptomes of Bronchial Lesions), an opensource application that integrates the most extensive transcriptomic databases of PMLs published so far. With this tool, users can stratify samples using multiple parameters and interrogate PML biology in multiple manners, such as two and multiplegroup comparisons, interrogation of genes of interests, and transcriptional signatures. Using XTABLE, we have carried out a comparative study of the potential role of chromosomal instability scores as biomarkers of PML progression and mapped the onset of the most relevant LUSC pathways to the sequence of LUSC developmental stages. XTABLE will critically facilitate new research for the identification of early detection biomarkers and acquire a better understanding of the LUSC precancerous stages.