Research Article

Epidemiology and Global Health

Prediction of SARS-CoV-2 transmission dynamics based on population-level cycle threshold values: An epidemic transmission and machine learning modeling study

British Columbia Centre for Disease Control, Canada
School of Population and Public Health, University of British Columbia, Canada
Department of Pathology and Laboratory Medicine, University of British Columbia, Canada
Office of the Medical Health Officer, Fraser Health, Canada
LifeLabs, Canada
Fraser Health, Canada
Division of Medical Microbiology & Infection Control, Vancouver Coastal Health Authority, Canada
Division of Medical Microbiology, Kelowna General Hospital, Canada
Division of Microbiology and Molecular Diagnostics, Victoria General Hospital, Canada
Division of Medical Microbiology and Virology, St. Paul’s Hospital, Canada
Ministry of Health, Canada

Feb 16, 2026

https://doi.org/10.7554/eLife.95666

Open access
Copyright information

Figures
Tables
Additional files

7 figures, 2 tables and 6 additional files

Figures

Figure 1 with 2 supplements

Download asset Open asset

Study design.

Overall design of the study (A) showing the patient population and molecular target of interest for the cycle threshold modeling (first column), sampling and modeling data sources for the study (second column), and modeling approaches (third column). The overall study flowchart (B) is also presented and depicts the process that led to selection of 429 tests for the SEIR modeling analysis. BC: British Columbia; E gene: Envelope gene; SARS-CoV-2: Nov: November; Rt: reproductive number; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2; SEIR: susceptible-exposed-infectious-recovered.

Figure 1—figure supplement 1

Download asset Open asset

Twenty most prevalent SARS-CoV-2 variant of concern lineages in British Columbia from January 2021 to January 2022.

The current study was performed during a time of Omicron variant predominance, from November 19, 2021, to January 8, 2022 (vertical dotted lines). BCCDC: British Columbia Centre for Disease Control; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2.

Figure 1—figure supplement 2

Download asset Open asset

Vaccination status definitions.

Vaccination status was defined based on the date of vaccine receipt relative to the date of the sample collection included for the study. For the Janssen vaccine only, fully vaccinated status was defined as having received one dose 14 days or more prior to sample collection. For all other vaccines, **Unvaccinated status** was defined as having received no SARS-CoV-2 vaccine, or having received a SARS-CoV-2 vaccine less than 21 days prior to the sample collection date. **Partially vaccinated** status was defined as having received the SARS-CoV-2 vaccine dose 1 greater than or equal to 21 days prior to sample collection, but having received dose 2 less than 14 days prior to the sample collection. **Fully vaccinated** status was defined as greater than or equal to 14 days since the receipt of dose 2, but having received dose 3 less than 14 days prior to the sample collection. SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2.

Figure 2

Download asset Open asset

SARS-CoV-2 case incidence and E gene Ct distribution across study phases.

Violin plots demonstrating the E gene cycle threshold value distribution in British Columbia across different time points of the study period (A). The median Ct values with associated interquartile range in dotted lines are presented. The absolute number of cases of confirmed SARS-CoV-2 infection is presented separately (B). Ct. e: Envelope (E) gene cycle threshold value; SARS-CoV-2: SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2.

Figure 3 with 1 supplement

Download asset Open asset

Long-term care facility outbreak investigation modeling findings.

A multiple-cross-section SEIR model was fitted to the outbreak data (A) and showed a peak in incidence on the 12th day of the outbreak that preceded by 2 days the observed peak at the outbreak facility. The population included in this outbreak investigation was sampled at three predetermined time points (dashed red lines). The Monte Carlo chain model-predicted incidence curve is represented by a blue line and was overlaid with the reported number of confirmed SARS-CoV-2-positive cases in this outbreak setting in yellow bars. The blue ribbon represents the 95% credible interval. Violin plots of the viral kinetic parameters for the SEIR model are also presented in the outbreak case study (B). The MCMC approach searches over the viral kinetics described above and is based on prior values. Fit to detectable Ct distribution across time points of days 4, 12, and 19 is also presented in the outbreak study (C). These show the model fit (blue curve) overlayed with the frequency of Ct values (gray bars) and are a good indicator of the Ct distribution across the time points. The darker blue ribbon represents the 95% credible interval. The Ct values increase from outbreak days 12–19 as the epidemic declines. Ct: cycle threshold; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2; SEIR: susceptible-exposed-infected-recovered.

Figure 3—figure supplement 1

Download asset Open asset

Case study epidemiological data (A) and epidemic curve (B) for the 156 infected individuals in the long-term care facility outbreak.

Figure 4

Download asset Open asset

Overall population modeling findings.

A multiple-cross-section SEIR model was fitted to the overall population-level data (A) and showed an incidence peak from December 27, 2021, to January 1, 2022, which overlapped with the observed peak of reported cases in the province. The Markov chain Monte Carlo model-predicted incidence curve is represented (black lines) and was overlaid with the reported number of confirmed SARS-CoV-2-positive (yellow bars) cases. Violin plots of the viral kinetic parameters for the SEIR model are presented (B). Three unique time horizons were chosen of sizes 5, 6, and 7. The MCMC approach searches over the viral kinetic parameters presented above and is based on prior values described separately (Hay et al., 2021). To align with the described Omicron viral kinetics, the incubation period was fixed and set at 3 days, and the infectious viral kinetic parameter was fixed. An upper bound of I₀ was set at 0.100. The initial reproductive number (R₀) increases across more horizons, which in turn shifts the SEIR peak earlier. The fit to detectable cycle threshold distribution is presented over the largest horizon (C). The largest frequency (gray bars) of model fit lowest Ct values (blue curve) occurs on days 14, 15, and 16, which represented the peak of the epidemic. The darker blue ribbon represents the 95% credible interval. Ct: cycle threshold; SARS-CoV-2: severe acute respiratory syndrome coronavirus type 2; SEIR: susceptible-exposed-infected-recovered.

Figure 5

Download asset Open asset

True R_t (solid black) vs predicted R_t (dotted blue) on a single-cross-sectional SEIR model using default viral kinetics.

Sample time (t=104) (dotted red),.

Figure 6

Download asset Open asset

Effect of sample size on machine learning model prediction of the reproductive number.

(A) Predicted R_t vs True R_t across all five machine learning (ML) models on three different sample sizes (n=100 [Panel I], 1,000 [Panel II] & 10,000 [Panel III]). The predicted R_t across all models follows the true R_t more closely with higher sample sizes. The 95% credible intervals are presented in colored ribbon. This is further corroborated by the (B) boxplot of the performance (MSE score) of all five models on three different sample sizes (n=100, 1000, and 10,000). Increasing sample sizes decreases the MSE, resulting in a more accurate predictive model. Random Forest is the best model at higher sample sizes.

Figure 7 with 1 supplement

Download asset Open asset

Predicted reproductive number (R_t) vs true R_t (blue) across all models using the same simulation parameters from the SEIR model as presented in Figure 5.

Figure 7—figure supplement 1

Download asset Open asset

Feature importance analysis by SHapley Additive exPlanations (SHAP) values.

The colors indicate the association between machine learning outputs and simulated cycle threshold (Ct) data. Features that have a higher impact on the prediction of R_t are presented in pink, and features that have a lower impact on the prediction of R_t are in blue. Results are presented stratified by three different population sizes: 100, 1000, and 10,000 with each column in descending order of performance. Of the five features explored, the top-ranking feature across all models was the variance of the Ct data. SHAP: SHapley Additive exPlanations; LGBM: Light Gradient Boosting Model; XGBM: eXtreme Gradient Boosting Model.

Tables

Table 1

Epidemiological, clinical, and laboratory data of the cohort of asymptomatic individuals tested during the test period of the study.

Group	Subgroup	Phase 3(n=500,914)	Subgroup for SEIR analysis(n=429)
Testing*	Positives	70,704	429
	Negatives	426,666	0
	Repeats	5,942	0
	Other	3,544	0
Specimen type	NP	32,956	429
	SG	37,508	0
	Other	71	0
Age (years)	0–4	2,013	2
	5–18	8,757	23
	19–39	31,497	271
	40–59	19,535	102
	60–79	7,518	29
	≥80	1,376	2
	Unknown	8	0
Sex	Female	37,073	206
	Male	32,733	221
	Unknown	898	2
Patient health authority	1	31,490	151
	2	9,000	4
	3	3,825	5
	4	16,394	245
	5	9,684	8
	Unknown	311	16
Vaccination status	Unvaccinated	15,494	355
	One dose	1,605	4
	Fully vaccinated^†	49,361	64
	Other	4,244	6
Asymptomatic testing		1,548	429
No E gene result		18,583	0
VoC lineage	Alpha (B.1.1.7)	0	0
	Beta (B.1.351)	0	0
	Delta (B.1.617.2)	9,261	0
	Gamma (P.1)	0	0
	Omicron (B.1.1.529)	11,657	429
	Unknown ^‡	49,786	0

E gene, envelope gene; VoC, variant of concern.
*

For all variables except testing, data presented as first positive result per person.
†

Does not include individuals who received ≥3 doses of vaccine.
‡

Due to laboratory testing algorithms, only a selected portion of SARS-CoV-2-positive samples underwent characterization to identify the VoC lineage.

Table 2

SEIR and ML model comparison across implementation considerations for SARS-CoV-2 incidence prediction.

	Model		SEIR	ML
Quantitative model comparison	Simulated data MSE (in %)		0.6 (95% CI, 0.60–0.64%)	Random Forest^†: 54 (95% CI, 39–83%)
Qualitative model comparisons	Sampling type		Random sampling	Random sampling
	Number of SARS-CoV-2-positive samples for which model best suited		Small (»30)	Large (>1000)
	Sampling frequency		Single/ multiple snapshots	Daily snapshots
	Flexibility	Modelling of transmission	Fixed in time	Time-independent
	Flexibility	Ability to add in multiple predictors	Unable to incorporate	Able to incorporate, and flexible in their representation
	Scalability		Single outbreak setting	Population level
	Computational complexity*		Low	Low-moderate
	Predictive power requirements		Good in single setting with well-mixed population and stable contact behaviour/infection control	No requirements other than sufficient sample size for Ct summary statistics by snapshot
	Additional sampling requirements		None	Ordered in time, restricted to fixed interval sampling

Ct, cycle threshold; ML, machine learning; SARS-CoV-2, severe acute respiratory syndrome coronavirus type 2; SEIR, susceptible-exposed-infected-recovered.
*

Relative computational complexity based on assumed sample size listed in Sacability row.
†

Random Forest presented as was the top performing ML model.

Additional files

Supplementary file 1 Vaccination phase definitions used for the study.: https://cdn.elifesciences.org/articles/95666/elife-95666-supp1-v2.docx
Download elife-95666-supp1-v2.docx
Supplementary file 2 Epidemiological, clinical, and laboratory data of the earlier British Columbia SARS-CoV-2 pandemic phases.: https://cdn.elifesciences.org/articles/95666/elife-95666-supp2-v2.docx
Download elife-95666-supp2-v2.docx
Supplementary file 3 SARS-CoV-2 diagnostic testing strategy based on the envelope (E) gene target and test result interpretation criteria used.: https://cdn.elifesciences.org/articles/95666/elife-95666-supp3-v2.docx
Download elife-95666-supp3-v2.docx
Supplementary file 4 Control table of values and priors for the SEIR model.: https://cdn.elifesciences.org/articles/95666/elife-95666-supp4-v2.docx
Download elife-95666-supp4-v2.docx
Supplementary file 5 Hyperparameter selection.: https://cdn.elifesciences.org/articles/95666/elife-95666-supp5-v2.docx
Download elife-95666-supp5-v2.docx
MDAR checklist: https://cdn.elifesciences.org/articles/95666/elife-95666-mdarchecklist1-v2.pdf
Download elife-95666-mdarchecklist1-v2.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Afraz Arif Khan
Hind Sbihi
Michael A Irvine
Agatha N Jassem
Yayuk Joffres
Braeden Klaver
Naveed Janjua
Aamir Bharmal
Carmen H Ng
Chris D Fjell
Miguel Imperial
Susan Roman
Marthe K Charles
Amanda Wilmer
John Galbraith
Marc G Romney
Bonnie Henry
Linda MN Hoang
Mel Krajden
Catherine A Hogan

(2026)

Prediction of SARS-CoV-2 transmission dynamics based on population-level cycle threshold values: An epidemic transmission and machine learning modeling study

eLife 15:e95666.

https://doi.org/10.7554/eLife.95666

Share this article

Cite this article

Study design.

Twenty most prevalent SARS-CoV-2 variant of concern lineages in British Columbia from January 2021 to January 2022.

Vaccination status definitions.

SARS-CoV-2 case incidence and E gene Ct distribution across study phases.

Long-term care facility outbreak investigation modeling findings.

Case study epidemiological data (A) and epidemic curve (B) for the 156 infected individuals in the long-term care facility outbreak.

Overall population modeling findings.

True Rt (solid black) vs predicted Rt (dotted blue) on a single-cross-sectional SEIR model using default viral kinetics.

Effect of sample size on machine learning model prediction of the reproductive number.

Predicted reproductive number (Rt) vs true Rt (blue) across all models using the same simulation parameters from the SEIR model as presented in Figure 5.

Feature importance analysis by SHapley Additive exPlanations (SHAP) values.

Epidemiological, clinical, and laboratory data of the cohort of asymptomatic individuals tested during the test period of the study.

SEIR and ML model comparison across implementation considerations for SARS-CoV-2 incidence prediction.

Supplementary file 1

Supplementary file 2

Supplementary file 3

Supplementary file 4

Supplementary file 5

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

True R_t (solid black) vs predicted R_t (dotted blue) on a single-cross-sectional SEIR model using default viral kinetics.

Predicted reproductive number (R_t) vs true R_t (blue) across all models using the same simulation parameters from the SEIR model as presented in Figure 5.