Prediction of SARS-CoV-2 transmission dynamics based on population-level cycle threshold values: An epidemic transmission and machine learning modeling study
Figures
Study design.
Overall design of the study (A) showing the patient population and molecular target of interest for the cycle threshold modeling (first column), sampling and modeling data sources for the study (second column), and modeling approaches (third column). The overall study flowchart (B) is also presented and depicts the process that led to selection of 429 tests for the SEIR modeling analysis. BC: British Columbia; E gene: Envelope gene; SARS-CoV-2: Nov: November; Rt: reproductive number; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2; SEIR: susceptible-exposed-infectious-recovered.
Twenty most prevalent SARS-CoV-2 variant of concern lineages in British Columbia from January 2021 to January 2022.
The current study was performed during a time of Omicron variant predominance, from November 19, 2021, to January 8, 2022 (vertical dotted lines). BCCDC: British Columbia Centre for Disease Control; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2.
Vaccination status definitions.
Vaccination status was defined based on the date of vaccine receipt relative to the date of the sample collection included for the study. For the Janssen vaccine only, fully vaccinated status was defined as having received one dose 14 days or more prior to sample collection. For all other vaccines, Unvaccinated status was defined as having received no SARS-CoV-2 vaccine, or having received a SARS-CoV-2 vaccine less than 21 days prior to the sample collection date. Partially vaccinated status was defined as having received the SARS-CoV-2 vaccine dose 1 greater than or equal to 21 days prior to sample collection, but having received dose 2 less than 14 days prior to the sample collection. Fully vaccinated status was defined as greater than or equal to 14 days since the receipt of dose 2, but having received dose 3 less than 14 days prior to the sample collection. SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2.
SARS-CoV-2 case incidence and E gene Ct distribution across study phases.
Violin plots demonstrating the E gene cycle threshold value distribution in British Columbia across different time points of the study period (A). The median Ct values with associated interquartile range in dotted lines are presented. The absolute number of cases of confirmed SARS-CoV-2 infection is presented separately (B). Ct. e: Envelope (E) gene cycle threshold value; SARS-CoV-2: SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2.
Long-term care facility outbreak investigation modeling findings.
A multiple-cross-section SEIR model was fitted to the outbreak data (A) and showed a peak in incidence on the 12th day of the outbreak that preceded by 2 days the observed peak at the outbreak facility. The population included in this outbreak investigation was sampled at three predetermined time points (dashed red lines). The Monte Carlo chain model-predicted incidence curve is represented by a blue line and was overlaid with the reported number of confirmed SARS-CoV-2-positive cases in this outbreak setting in yellow bars. The blue ribbon represents the 95% credible interval. Violin plots of the viral kinetic parameters for the SEIR model are also presented in the outbreak case study (B). The MCMC approach searches over the viral kinetics described above and is based on prior values. Fit to detectable Ct distribution across time points of days 4, 12, and 19 is also presented in the outbreak study (C). These show the model fit (blue curve) overlayed with the frequency of Ct values (gray bars) and are a good indicator of the Ct distribution across the time points. The darker blue ribbon represents the 95% credible interval. The Ct values increase from outbreak days 12–19 as the epidemic declines. Ct: cycle threshold; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2; SEIR: susceptible-exposed-infected-recovered.
Case study epidemiological data (A) and epidemic curve (B) for the 156 infected individuals in the long-term care facility outbreak.
Overall population modeling findings.
A multiple-cross-section SEIR model was fitted to the overall population-level data (A) and showed an incidence peak from December 27, 2021, to January 1, 2022, which overlapped with the observed peak of reported cases in the province. The Markov chain Monte Carlo model-predicted incidence curve is represented (black lines) and was overlaid with the reported number of confirmed SARS-CoV-2-positive (yellow bars) cases. Violin plots of the viral kinetic parameters for the SEIR model are presented (B). Three unique time horizons were chosen of sizes 5, 6, and 7. The MCMC approach searches over the viral kinetic parameters presented above and is based on prior values described separately (Hay et al., 2021). To align with the described Omicron viral kinetics, the incubation period was fixed and set at 3 days, and the infectious viral kinetic parameter was fixed. An upper bound of I0 was set at 0.100. The initial reproductive number (R0) increases across more horizons, which in turn shifts the SEIR peak earlier. The fit to detectable cycle threshold distribution is presented over the largest horizon (C). The largest frequency (gray bars) of model fit lowest Ct values (blue curve) occurs on days 14, 15, and 16, which represented the peak of the epidemic. The darker blue ribbon represents the 95% credible interval. Ct: cycle threshold; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2; SEIR: susceptible-exposed-infected-recovered.
True Rt (solid black) vs predicted Rt (dotted blue) on a single-cross-sectional SEIR model using default viral kinetics.
Sample time (t=104) (dotted red),.
Effect of sample size on machine learning model prediction of the reproductive number.
(A) Predicted Rt vs True Rt across all five machine learning (ML) models on three different sample sizes (n=100 [Panel I], 1,000 [Panel II] & 10,000 [Panel III]). The predicted Rt across all models follows the true Rt more closely with higher sample sizes. The 95% credible intervals are presented in colored ribbon. This is further corroborated by the (B) boxplot of the performance (MSE score) of all five models on three different sample sizes (n=100, 1000, and 10,000). Increasing sample sizes decreases the MSE, resulting in a more accurate predictive model. Random Forest is the best model at higher sample sizes.
Predicted reproductive number (Rt) vs true Rt (blue) across all models using the same simulation parameters from the SEIR model as presented in Figure 5.
Feature importance analysis by SHapley Additive exPlanations (SHAP) values.
The colors indicate the association between machine learning outputs and simulated cycle threshold (Ct) data. Features that have a higher impact on the prediction of Rt are presented in pink, and features that have a lower impact on the prediction of Rt are in blue. Results are presented stratified by three different population sizes: 100, 1000, and 10,000 with each column in descending order of performance. Of the five features explored, the top-ranking feature across all models was the variance of the Ct data. SHAP: SHapley Additive exPlanations; LGBM: Light Gradient Boosting Model; XGBM: eXtreme Gradient Boosting Model.
Tables
Epidemiological, clinical, and laboratory data of the cohort of asymptomatic individuals tested during the test period of the study.
| Group | Subgroup | Phase 3(n=500,914) | Subgroup for SEIR analysis(n=429) |
|---|---|---|---|
| Testing* | Positives | 70,704 | 429 |
| Negatives | 426,666 | 0 | |
| Repeats | 5,942 | 0 | |
| Other | 3,544 | 0 | |
| Specimen type | NP | 32,956 | 429 |
| SG | 37,508 | 0 | |
| Other | 71 | 0 | |
| Age (years) | 0–4 | 2,013 | 2 |
| 5–18 | 8,757 | 23 | |
| 19–39 | 31,497 | 271 | |
| 40–59 | 19,535 | 102 | |
| 60–79 | 7,518 | 29 | |
| ≥80 | 1,376 | 2 | |
| Unknown | 8 | 0 | |
| Sex | Female | 37,073 | 206 |
| Male | 32,733 | 221 | |
| Unknown | 898 | 2 | |
| Patient health authority | 1 | 31,490 | 151 |
| 2 | 9,000 | 4 | |
| 3 | 3,825 | 5 | |
| 4 | 16,394 | 245 | |
| 5 | 9,684 | 8 | |
| Unknown | 311 | 16 | |
| Vaccination status | Unvaccinated | 15,494 | 355 |
| One dose | 1,605 | 4 | |
| Fully vaccinated† | 49,361 | 64 | |
| Other | 4,244 | 6 | |
| Asymptomatic testing | 1,548 | 429 | |
| No E gene result | 18,583 | 0 | |
| VoC lineage | Alpha (B.1.1.7) | 0 | 0 |
| Beta (B.1.351) | 0 | 0 | |
| Delta (B.1.617.2) | 9,261 | 0 | |
| Gamma (P.1) | 0 | 0 | |
| Omicron (B.1.1.529) | 11,657 | 429 | |
| Unknown ‡ | 49,786 | 0 | |
-
E gene, envelope gene; VoC, variant of concern.
-
*
For all variables except testing, data presented as first positive result per person.
-
†
Does not include individuals who received ≥3 doses of vaccine.
-
‡
Due to laboratory testing algorithms, only a selected portion of SARS-CoV-2-positive samples underwent characterization to identify the VoC lineage.
SEIR and ML model comparison across implementation considerations for SARS-CoV-2 incidence prediction.
| Model | SEIR | ML | ||
|---|---|---|---|---|
| Quantitative model comparison | Simulated data MSE (in %) | 0.6 (95% CI, 0.60–0.64%) | Random Forest†: 54 (95% CI, 39–83%) | |
| Qualitative model comparisons | Sampling type | Random sampling | Random sampling | |
| Number of SARS-CoV-2-positive samples for which model best suited | Small (»30) | Large (>1000) | ||
| Sampling frequency | Single/ multiple snapshots | Daily snapshots | ||
| Flexibility | Modelling of transmission | Fixed in time | Time-independent | |
| Ability to add in multiple predictors | Unable to incorporate | Able to incorporate, and flexible in their representation | ||
| Scalability | Single outbreak setting | Population level | ||
| Computational complexity* | Low | Low-moderate | ||
| Predictive power requirements | Good in single setting with well-mixed population and stable contact behaviour/infection control | No requirements other than sufficient sample size for Ct summary statistics by snapshot | ||
| Additional sampling requirements | None | Ordered in time, restricted to fixed interval sampling | ||
-
Ct, cycle threshold; ML, machine learning; SARS-CoV-2, severe acute respiratory syndrome coronavirus type 2; SEIR, susceptible-exposed-infected-recovered.
-
*
Relative computational complexity based on assumed sample size listed in Sacability row.
-
†
Random Forest presented as was the top performing ML model.
Additional files
-
Supplementary file 1
Vaccination phase definitions used for the study.
- https://cdn.elifesciences.org/articles/95666/elife-95666-supp1-v2.docx
-
Supplementary file 2
Epidemiological, clinical, and laboratory data of the earlier British Columbia SARS-CoV-2 pandemic phases.
- https://cdn.elifesciences.org/articles/95666/elife-95666-supp2-v2.docx
-
Supplementary file 3
SARS-CoV-2 diagnostic testing strategy based on the envelope (E) gene target and test result interpretation criteria used.
- https://cdn.elifesciences.org/articles/95666/elife-95666-supp3-v2.docx
-
Supplementary file 4
Control table of values and priors for the SEIR model.
- https://cdn.elifesciences.org/articles/95666/elife-95666-supp4-v2.docx
-
Supplementary file 5
Hyperparameter selection.
- https://cdn.elifesciences.org/articles/95666/elife-95666-supp5-v2.docx
-
MDAR checklist
- https://cdn.elifesciences.org/articles/95666/elife-95666-mdarchecklist1-v2.pdf