Main Text

Coronavirus disease 2019 (COVID-19) vaccinations, which are effective in preventing infection by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and severe COVID-19 illness, have enabled the gradual and safe removal of COVID-19 restrictions on everyday life over the past year. However, we still face the emergence of variants of concerns (VOCs), which is a major worry in the era of “living with COVID-19”. As a result, to prevent major outbreaks, basic public health responses such as testing, isolation, and quarantine are demanded and important. In particular, rapid and accurate diagnostic tests are essential for controlling ongoing transmission. Salivary diagnostic testing is a convenient tool for early and efficient diagnosis of COVID-19 because it is easy for health care professionals and patients to administer1,2.

The oral cavity is an important target for SARS-CoV-23, and viral particles in the lower and upper respiratory tract can reach the oral cavity through liquid droplets1. Saliva droplets are thus a potential route of SARS-CoV-2 transmission1,3. Although SARS-CoV-2 infection dynamics within the respiratory tract are well characterized thanks to data from oropharyngeal and nasopharyngeal swabs49, infection dynamics in saliva are poorly understood. Recent reports3,5 have suggested that saliva and other tissues are highly compartmentalized and that SARS-CoV-2 infection dynamics are qualitatively different in different tissues. However, although individual­level heterogeneity in virus dynamics (especially individual infectiousness) has been evaluated, it remains unknown how individual-level viral shedding patterns in saliva are stratified, and which factor(s) determine the patterns5,10. Thus, to shape a country’s early response to future VOCs, such as through isolation and screening guidelines based on salivary diagnostic testing, it is critical to address these points with the use of a high quality and quantity of saliva specimens annotated with basic clinical patient data.

So far, our ability to understand and characterize whole SARS-CoV-2 infection dynamics has been hindered by several limitations of the clinical data that made it impossible to capture either the early or the late phase of infection or to annotate individual-level clinical information. To overcome these limitations, here we quantified and stratified longitudinal virus dynamics in saliva 3 samples from 144 mildly symptomatic participants of two different but complementary cohorts5,11,12. We successfully identified three groups with significantly different durations of viral shedding (i.e., the mean durations were 11.5 days, 17.4 days, and 30.0 days, respectively), implying a large inter-participant heterogeneity in virus infection dynamics. However, when we analyzed a total of 47 variables, including basic demographic information, daily clinical symptoms, results of blood tests, and vital signs, none explained the stratified grouping.

We also retrospectively explored whether salivary micro-RNAs were associated with the stratification by using stored residual saliva specimens. Micro-RNAs are non-coding RNAs that regulate numerous cellular processes by modulating protein levels through direct binding to mRNA (coding-RNA), thereby influencing translation efficiency or mRNA abundance. Various microRNAs are recognized for their impact on the viral replication ability and immune response in viruses like EBV, HCV, and HIV etc13. Many studies are underway to investigate micro-RNAs as potential targets for virus diagnosis and treatment. However, the relationship between micro-RNAs and the patterns of virus shedding of the SARS-CoV-2 virus in body remains unknown. We quantified the expression levels of 92 micro-RNAs and found that no micro-RNA significantly explained the stratified groups, although the mir-1846 level may have been weakly correlated with the peak viral load. Our findings provide important insights into the complexities of viral shedding patterns in saliva and suggest that predicting the heterogeneity of viral dynamics using basic clinical and micro-RNA data may be a challenging task. These insights are critical for developing accurate diagnostic tools, effective treatments, and prevention strategies for COVID-19.


Description of cohort and study design

We used longitudinal saliva viral load data obtained from cohorts of the nelfinavir (NFV) clinical trial (jRCT207120002311,12) and the University of Illinois at Urbana-Champaign5. All infections were either mild or asymptomatic. All participants in these cohorts confirmed that they had never been previously infected with SARS-CoV-2, and none were vaccinated against SARS-CoV-2 at the time of enrolment. Of 182 participants from these two studies, 144 symptomatic participants, excluding 22 asymptomatic participants and 16 participants with incomplete observational data, were considered in our analysis (Fig 1A). In addition, we annotated the saliva viral load results of 90 participants from the NFV clinical trial with their sex, age, daily symptoms, blood test results, and vital signs, as summarized in Fig 1B, Table 1, and Supplementary Table 1. The data from the two studies complemented one another: the longitudinal data from the University of Illinois contained data on the early phase of infection (before symptom onset), while the data from the NFV clinical trial contained data on the late phase of viral RNA load (more than 14 days after symptom onset) (Fig 1C). The time-series viral load in the saliva samples for all individual used in the analysis are plotted in Fig 1D. We also analyzed 60 stored saliva specimens from 30 participants of the NFV clinical trial for micro-RNA analysis.

Characteristics of cohorts from the NFV clinical trial (jRCT207120002311,12) and the University of Illinois5:

(A) Flowchart of cohorts from the NFV clinical trial and the University of Illinois along with the number of participants and inclusion criteria for our analysis are described. (B) Data collection schedule of viral load, blood test, and vital signs from participants in the NFV clinical trial is described. (C) and (D) show, for each participant (N=144 participants, 2191 samples), the timeline of sample collection and the captured SARS-CoV-2 viral RNA load for saliva RT-qPCR, respectively. The red and blue colors indicate samples for cohorts from the NFV clinical trial and the University of Illinois, respectively.

Clinical data of the overall cohort from the NFV clinical trial and in groups stratified by longitudinal virus dynamics

Quantifying and stratifying SARS-CoV-2 infection dynamics in saliva

We employed a previously developed mathematical model describing SARS-CoV-2 infection dynamics [i.e., Eqs.(1-2)] to evaluate interparticipant heterogeneity (details are provided in Methods), and reconstructed the best-fit virus dynamics in saliva of 144 symptomatic participants (Extended Data Fig 1 and Supplementary Table 2). We also applied the more realistic mathematical model used in Ke et al.5 [i.e., Eqs.(3-6)], but the fitting by these two mathematical models reconstructed our dataset equally (Supplementary Fig 1 and Supplementary Table 3). This is because the longitudinal data from the NFV clinical trial included the late phase of the viral RNA load, which was not included in the data from the University of Illinois; therefore, Eqs.(1-2) are sufficient for the purpose of reconstruction of viral load. To avoid complexities and assumptions on parameters of Eqs.(3-6) as described in Ke et al.5, we used our simple model of Eqs.(1-2) for further analysis.

Next, to stratify the time-course pattern of viral shedding, we first applied unsupervised random forest clustering to the individual “reconstructed” virus dynamics of 144 participants (e.g., Extended Data Fig 1). However, this analysis failed to divide the time-course pattern into different clusters (data not shown). To overcome this problem, we quantified the peak, duration, up-slope (i.e., growth rate), and down-slope (i.e., decay rate) of the reconstructed dynamics as “features” of the virus dynamics (Supplementary Table 4). Interestingly, the unsupervised random forest clustering based on these features identified 3 groups (i.e., G1: N=46, G2: N=61, and G3: N=37) in which the time-course patterns were clearly discriminated. This finding suggested that there is a heterogeneity of virus infection dynamics in saliva (see Methods). Fig 2A indicates a two­dimensional Uniform Manifold Approximation and Projection (UMAP) embedding of the three stratified groups. Using a different color for each group (gray for G1, magenta for G2, and blue for G3), we also plotted the estimated individual viral load (Fig 2B), and the highlighted time­course pattern of each group by the Partial Least-Squares Discriminant Analysis (PLS-DA) (Fig 2C).

Stratification of individual SARS-CoV-2 viral dynamics in saliva:

(A) UMAP of stratified viral RNA load based on the extracted features from the reconstructed individual-level viral dynamics is shown. (B) The reconstructed individual viral RNA load is shown. Colors for individual-level viral dynamics correspond to the colors of the dots in the UMAP described in (A). (C) The time-course patterns of each group highlighted by the Partial Least-Squares Discriminant Analysis (PLS-DA). (D) Distributions between groups of each feature used for stratification of viral shedding patterns are shown. The p-values of ANOVA for the difference in each feature among stratified group are all less than 0.05. (E) Distributions of the number of individuals in each stratified group for the standard-of-care alone (left, n=97) and standard-of-care plus NFV administration (right, n=47) participants are shown. (F) Distributions of the number of individuals in each stratified group for Alpha variants (left, n=30), Delta variants (middle, n=13), and other variants (right, n=66) of SARS-CoV-2 are shown.

The distributions of the four features are described in Fig 2D; a statistically significant between-group difference was found in the duration of viral shedding. The mean durations were 11.5 days (95% CI: 10.6 to 12.4), 17.4 days (16.6 to 18.2), and 30.0 days (28.1 to 31.8) for G1, G2, and G3, respectively. In our previous report10, we consistently confirmed that there were at least 3 groups showing different durations of viral shedding in upper respiratory specimens.

Because of previous work concluding that there are no significant differences in virus infection dynamics between NFV-treated and untreated participants12, we analyzed all data together regardless of treatment (see Methods). To further confirm whether NFV affects the stratification of the time-course pattern of viral shedding, we compared the number of individuals belonging to each group (i.e., G1, G2 and G3) between NFV-treated and untreated participants (including the members of the University of Illinois cohort) and found no trend for the stratification (p=0.784 by the Fisher’s exact test: Fig 2E).

Another possibility that explains the different viral duration observed here may be a difference in VOC genotypes. To test this, we used data from 55 participants in our NFV clinical trial who had been characterized according to which VOCs [i.e., B.1.1.7 (Alpha), B.1.672.2 or AY.29 (Delta), and other variants] they had been infected with (see Methods), in addition to data from 54 participants of the University of Illinois cohort. However, we observed no trend in the number of individuals belonging to each group among the VOCs (p=0.728 by the Fisher’s exact test: Fig 2F), which is consistent with the conclusion in Ke et al.5.

Basic clinical data may not explain heterogeneity in individual viral shedding

Using data from the NFV clinical trial, we annotated the saliva viral loads of 90 participants with basic demographic information, daily symptoms, blood test results, and vital signs (Fig 1A, Table 1 and Supplementary Table 1). We also annotated the saliva viral loads of 52 participants from the University of Illinois with daily symptoms (Supplementary Table 1).

To identify factors that were significantly correlated with the viral shedding patterns in saliva specimens obtained from the NFV clinical trial, we first examined the 39 variables summarized in Table 1. Each factor was compared between the three groups by ANOVA and the p-values were corrected by the False Discovery Rate (FDR). However, we found no clinical data that differed significantly (i.e., corrected p-value of ANOVA of less than 0.05) among the stratified groups (Fig 3A). To avoid overfitting by bootstrap aggregating (bagging), we also trained a random forest classifier (see Methods), a tree-based machine learning algorithm suitable for tabular data14, to predict the group from the clinical data of 90 individuals in the NFV clinical trial cohort and obtained ROC-AUCs of 60%, 49%, and 36% for predicting G1, G2, and G3, respectively (Fig 3B). We were not able to achieve a high ROC-AUC for predicting the shedding patterns based on the basic clinical data. We also attempted to make prediction based on clinical data which exhibited relatively low p-values in the ANOVA analysis. However, we were unable to achieve a high prediction accuracy with this approach (data not shown).

Correlation between clinical data and viral shedding patterns:

(A) P-values of ANOVA corrected by the FDR to compare clinical data among the three stratified groups are shown. Clinical data are listed in reverse order of p-values. (B) and (C) show ROC curves of random forest classifiers trained on predicting each group by using data for 39 clinical values and 8 daily symptoms, respectively. The corresponding AUC (area under curve) value of each ROC curve is shown on the top of each panel.

Next, we asked whether the stratification of the study population is associated with clinical symptoms of COVID-19 that could be caused by active replication of SARS-CoV-2. In general, the clinical symptoms of COVID-19 are cough, fever, shortness of breath, muscle pain, sore throat, confusion, chest pain, headache, rhinorrhea, diarrhea, and nausea and vomiting. In our study, individual-level symptom data were available as 8 categories in the cohorts from both the NFV clinical trial and the University of Illinois (Supplementary Table 1). We tried to use a random forest classifier to investigate whether symptom data could predict the stratified groups and obtained ROC-AUCs of 57%, 54%, and 48% for predicting G1, G2, and G3, respectively (Fig 3C). In fact, SARS-CoV-2 human challenge clearly showed no quantitative correlation between the individuals’ time-series pattern of viral load and symptoms15.

Additionally, we investigated the relationship between each feature of viral dynamics (i.e., duration of viral shedding, peak viral load, up-slope, and down-slope) and the clinical data by using the Pearson’s correlation coefficient (Extended Data Table 1). Overall, correlation coefficients were low (0.06 on average) with high p-values, which suggests that no feature was likely to be explained by these clinical data.

Relationship between salivary microRNAs and viral shedding patterns in COVID-19 patients

Various proteins in saliva have antiviral effects. It is also expected that some micro-RNAs in saliva may impair SARS-CoV-2 replication1 and could thus act as a biomarker to predict COVID-19 disease progression16. We here used the stored residual saliva specimens from the NFV clinical trial to identify micro-RNA(s) associated with the stratified groups (i.e., G1, G2 and G3). We note that, because all residual saliva specimens are annotated with the individual participant and we know which participants belong to which stratified group, we can select and compare saliva specimens from G1, G2, and G3 without bias. This implies that we can impartially select participants in equal numbers from each group, unaffected by other factors. Specifically, we collected 60 stored saliva specimens from the NFV clinical trial to perform micro-RNA analysis for 30 participants. We picked two samples for each participant to evaluate the role of micro-RNAs during both the peak and the late phase (i.e., 30 samples for each phase): the nearest sample from the estimated peak and the most distant sample above the detection limit in the late phase (Fig 4A). We normalized micro-RNA expression among the samples by using the DESeq2 Bioconductor package. We summarize the information on the micro-RNAs we obtained from saliva specimens in Supplementary Table 5.

Correlation between micro-RNA data and viral shedding patterns:

(A) The strategy of micro-RNA data collection from saliva samples in the NFV clinical trial is described. We picked a total of 30 participants by choosing 10 participants from each group. We chose two samples (the nearest to estimated peak and the most distant but above the detection limit in the late phase) from each participant for quantifying micro-RNA. (B) P-values of Kruskal-Wallis ANOVA corrected by the FDR for each micro-RNA level are shown. Micro-RNA levels are listed by reverse order of p-values. Only the 24 micro-RNA levels with the lowest p-values are shown. (C) ROC curves of random forest classifiers trained on predicting each group by using levels of 92 micro-RNAs are shown. The corresponding AUC value of each ROC curve is presented on the top of each panel.

Similar to the analysis using clinical data, we compared the expression levels of 92 micro-RNAs between the three stratified groups. Because the micro-RNA data were non-parametric, we used the Kruskal-Wallis ANOVA for analysis and corrected the p-values by FDR. However, we failed to find micro-RNA levels that differentiated the stratified groups (i.e., with a corrected p-value of Kruskal-Wallis ANOVA of less than 0.05) in the three trials using the data from the peak phase, the late phase, and both phases (e.g., Fig 3B for the total 60 samples). We also trained a random forest classifier to predict each group from the micro-RNA levels for the 60 total samples, and obtained ROC-AUCs of 48%, 57%, and 42%, respectively. Again, we did not obtain enough ROC-AUCs to predict stratified groups by using the collected micro-RNA data.

Furthermore, we investigated the relationship between the 4 features of viral dynamics and micro-RNA levels. Here we used the Spearman’s correlation coefficient (Extended Data Table 2). Overall, we did not find strong correlations between micro-RNA levels and features (Spearman’s correlation coefficients on average of 0.002, 0.024, −0.001, and −0.001 for duration of viral shedding, peak viral load, up-slope, and down-slope, respectively, for the 60 total samples). Only the mir-1846 level exhibited a weak negative correlation (−0.53 Spearman’s correlation coefficient with 0.01 p-value) with the peak viral load (Extended Data Figure 2). We confirmed similar trends even when we analyzed the micro-RNA level for the peak and late phases separately.


Being able to quickly and efficiently diagnose COVID-19 is essential in monitoring the pandemic. Because the sampling process for saliva is noninvasive, and because it is inexpensive and minimizes the risk for transmissions to health care workers1, saliva sampling has excellent potential and advantages over other sampling methods from biological specimens such as the lower and upper respiratory tract2,17. Given the significant individual heterogeneity in the saliva viral shedding5,18, identifying biomarker(s) for viral shedding patterns will be crucial for improving public health interventions in the era of living with COVID-19. To improve our understanding of SARS-CoV-2 infection dynamics in saliva to enable application of saliva testing in the fight against COVID-19, we quantified and stratified longitudinal virus dynamics in saliva samples from 144 mildly symptomatic individuals from the cohorts of the NFV clinical trial11 and the University of Illinois at Urbana-Champaign5. In addition to the large heterogeneity in virus infection dynamics, we identified three groups (i.e., G1, G2 and G3) with different viral shedding patterns (Fig 2D).

Immunocompromised patients have been reported to have a prolonged duration of viral shedding, lasting over three months, underscoring the critical role of host immune responses in controlling viral infections1922. Although oral immune responses remain poorly understood, Huang et al. recently confirmed by using single-cell RNA sequencing of the human minor salivary glands and gingiva that SARS-CoV-2 infection can trigger sustained, localized immune responses in saliva3. In this study, we observed significant differences in the down-slopes of viral shedding in saliva among participants in different groups, with a more rapid decline in G1. This decline is likely attributed to a stronger immune response to SARS-CoV-2 in G1 participants than in participants in G2 and G3, as reflected in the death rate of infected cells due to the immune response (Fig. 2D). Lower levels of viral replication have also been observed among infected participants with high baseline levels of mucosal IgA (but not IgG), as reported elsewhere23. Recently, we demonstrated that rapid anti-spike secretory IgA antibody responses can contribute to reducing viral shedding durations and amounts in nasopharyngeal mucosa24. These findings highlight the importance of biomarkers that directly reflect an individual’s immune response, such as antibody induction, in predicting viral shedding patterns. Therefore, quantifying the time-series pattern of mucosal IgA and its correlation with saliva viral load may provide crucial insights into the stratification of SARS-CoV-2 infection dynamics.

For the purpose of predicting viral shedding patterns during the early stage of infection, we first explored the association of 39 basic clinical variables, 8 daily symptoms, and the levels of 92 micro-RNAs with the stratified groups. However, none of the factors were significant (Table 1, Fig 3A, Fig 4B, Supplementary Table 1, Supplementary Table 2 and Supplementary Fig 2). In contrast, we showed that mir-1846, which is an exogenous micro-RNA that is specifically classified as an Oryza sativa micro-RNA (osa-microRNA)25, may exhibit a weak negative correlation. Exogenous micro-RNAs enter the human body primarily through food and can affect human metabolism by interacting and binding with human genes. mir-1846 is reported to interact with two human genes25 that are known to be associated with the progression of melanoma, various cancers, and leukemia. This suggests that mir-1846 levels may be linked to human immunity. Few studies have investigated the role of mir-1846 in humans, but our findings suggest the need for further investigations into the impact of this micro-RNA level on human immunity. Our research sheds light on the intricate patterns of viral shedding in saliva.

Our approach has several limitations that must be considered in our next study: First, our analysis was limited to participants with symptomatic infection and excluded those with asymptomatic infection (22 asymptomatic individuals out of a total of 182 individuals, i.e., 12% of participants) because we integrated datasets with different time scales from different cohorts. Although our data do not include participants infected with omicron variants, others have reported that the omicron variant may cause a higher proportion of asymptomatic infection26. Thus, evaluating the effect of asymptomatic infection will be important to update our stratification, especially for recent (or future emerging) VOCs. Second, micro-RNAs participate in the post-transcriptional regulation of gene expression; however, they do not provide direct insights into immune cell dynamics. Given the reported association between the duration of viral shedding and mucosal immunity as discussed above, it appears imperative to analyze modalities that are directly linked to the immune response in the future. Another potential limitation of this study is the timing of saliva specimen sampling, although we took great care to select and compare specimens from G1, G2, and G3 without bias. As a result of our clinical trial design (jRCT207120002311,12), participants were enrolled after the onset of symptoms, thereby restricting saliva specimen collection exclusively to the post-symptom phase. Unfortunately, we lack samples from the pre-infection, pre-symptomatic, and early infection phases. Consequently, the absence of individual-level baseline values for micro-RNA means that inter-participant heterogeneity in micro-RNA levels may obscure signals related to distinct viral infection dynamics in saliva.

In conclusion, our study revealed that the dynamics of SARS-CoV-2 infection in saliva can be classified into three groups based mainly on the duration of viral shedding. However, accurately predicting the variability in viral dynamics remains a challenging task, because it requires a more comprehensive understanding of the complex shedding patterns in saliva, as well as detailed clinical and molecular data. The identification of a sensitive, simple, and rapid biomarker for saliva viral shedding will be imperative for future COVID-19 outbreak control.


Ethics statement

The NFV clinical trial was approved by the institutional review board of Nagasaki University Hospital (approval number: I20-001) and is registered with the Japan Registry of Clinical Trials (jRCT2071200023). All participants provided written, informed consent for secondary use of clinical information and samples. The present study was approved by the ethics committee of Nagoya University (approval number: hc 22-01).

Saliva viral load data

Longitudinal saliva viral load data of participants with symptomatic and asymptomatic COVID-19 (122 cases) were obtained from the NFV clinical trial11. Briefly, the NFV clinical trial was a prospective, randomized, open-label, blinded-endpoint, parallel-group trial conducted between July 2020 and October 2021 at 11 university and teaching hospitals in Japan. This study consisted of a 14-day treatment period and a 14-day follow-up period, with no significant differences in the time to viral clearance between patients who received standard-of-care plus NFV administration and those who had the standard-of-care alone11,12. Therefore, the participants with COVID-19 were analyzed together here. In addition, we obtained similar saliva viral load data (60 cases) from the cohort of the University of Illinois at Urbana-Champaign5. This cohort contained all faculty, staff, and students at the University of Illinois at Urbana-Champaign, who undergo at least twice weekly quantitative PCR-RT testing during fall of 2020 and spring of 2021. Among those 182 cases, we focused only on symptomatic participants. Also, we excluded the participants who had less than three measured viral loads that were not limit detections (i.e., 90 cases from the NFV clinical trial and 54 cases from the University of Illinois were used in this study). This decision was based on the understanding that reasonable estimates cannot be derived when the number of data points less than the number of parameters in the mathematical model. The limit of detection for viral load data from the University of Illinois is 1.08 copies/ml. However, the limit of detection for viral load data from NFV clinical trial was unclear. Considering this, we assumed 1.08 copies/ml as the limit of detection for all viral load data.

Viral genome sequencing

The cDNA had been synthesized from RNA of SARS-CoV-2-positive saliva samples. Reverse transcription, multiplex PCR reaction, and Illumina library prep were conducted using a protocol published previously27. The pooled library was first purified by AMPureXP at 0.8 x concentration and then again at 1.2 x concentration. The purified library was sequenced for 151 cycles at both paired-ends in Illumina iSeq100. Sequence analysis was performed using the nf-core/viralcon pipeline (10.5281/zenodo.3901628).

Quantifying biomarkers in saliva

Total RNA from saliva was extracted with MagMAX CORE Nucleic Acid Purification kits (Applied Biosystems, Foster City, CA). Micro-RNAs were detected using Illumina Hiseq x Ten (Illumina, Inc, San Diego, CA) with data processing by ribodepletion (Genewiz-Azenta, South Plainfield, NJ). To remove technical sequences, the pass filter data in the fastq format were processed by Trimmomatic (v0.30) to be high-quality clean data. Following quality trimming, micro-RNAs were identified and checked using miRDeep228. Normalization of micro-RNA expression among samples and differential expression analysis were carried out using the DESeq2 Bioconductor package.

Basic clinical data

Basic clinical data including basic demographic characteristics of the study participants, symptoms, and findings of physical examinations and laboratory tests were obtained according to the study protocol11. We here used information on age, daily symptoms, results of blood tests, and vital signs of the symptomatic participants in the NFV clinical trial (summarized in Table 1 and Supplementary Table 1).

Mathematical modeling

To describe SARS-CoV-2 infection dynamics in saliva specimens, we here mainly used the following mathematical model developed in our recent studies4,6,10:

The variables f(t) and V(t) are the fraction of uninfected target cells and the total amount of virus, respectively, and the parameters β, γ, and δ are the rate constant for virus infection, the maximum rate constant for viral replication, and the death rate of infected cells, respectively.

In addition to comparing the simple model [i.e., a target-cell-limited model; Eqs.(1-2)], we also used the following “immune effector cell model” developed in Ke et al.5 for the saliva viral load (see Supplementary Fig 1):

The variables T(t), E(t), and i(t) are the total number uninfected target cells, cells in the eclipse phase of infection, and productively infected cells, respectively. The parameters 1/k, π, and c are the average duration of the eclipse phase, the virus production rate, and the clearance rate of viruses, respectively. The death rate of infected cells is assumed to be time-dependent to mimic the killing of infected cells by immune effector cells: δ(t) = δ1 for t < t1 and δ(t) = δ1 + δ2 for t1 < t. For a detailed explanation of the immune effector cell model, the reader is referred to Ke et al.5.

Parameter estimation

A nonlinear mixed-effects modelling approach incorporates fixed effects as well as random effects that describe the inter-participant variability in parameters. Including random effects amounts to a partial pooling of the data of all participants to improve estimates of the parameters applicable across the cases. In fact, in our study, the saliva specimens of cases from the cohort of the University of Illinois and the NFV clinical trial were obtained mainly during the “early” (i.e., from the date of infection to no more than 14 days post symptom onset) and “late” (i.e., no more than 28 days from around symptom onset) phases of SARS-CoV-2 infection, respectively. Therefore, cohorts from these two studies can reveal the whole time-course pattern of SARS-CoV-2 infection dynamics in a complementary manner through the random effects of a nonlinear mixed-effects modelling.

In our analyses, the variable V(t) in Eq.(2) corresponds to the viral load in saliva for SARS-CoV-2. To fit the patient’s viral load data, we used a program MONOLIX 2021R2 (, implement maximum likelihood estimation of parameters in nonlinear mixed effect model. The nonlinear mixed effect model allows a fixed effect as well as interpatient variability. This method estimates each parameter θi(= θ x eηi) for each individual where θ is a fixed effect, and ηi is a random effect, and which obeys a Gaussian distribution with mean 0 and standard deviation Ω. Here we used lognormal distributions as prior distributions of parameters to guarantee the positiveness (i.e., negative values do not biologically make sense). In parameter estimation, as time 0 in the original dataset represents the time of symptom onset, we also estimated the time from infection to symptom onset (corresponding to τ in Supplementary Table 2 and 3) along with other parameters. The fixed effect parameters and random effect parameters were estimated by use of the stochastic approximation Expectation/Maximization (SAEM) algorithm and empirical Bayes method, respectively. A right-truncated normal distribution was used in the likelihood function to account for the left censoring of the viral load data (i.e., when the viral load is not detectable)29. We changed the initial values multiple times to avoid a local minimum of AIC and confirmed the robustness of parameter estimation.

Unsupervised clustering and stratification of SARS-CoV-2 infection dynamics

Unsupervised random forest clustering was performed on the selected features of the virus infection dynamics, that is, the peak viral load, duration of viral shedding, up-slope, and down-slope (rfUtilities package in R). The use of random forest allows us to avoid overfitting by bootstrap aggregating (bagging) and to achieve better generalization performance30. After a random forest dissimilarity (i.e., the distance matrix between all pairs of samples) was obtained, it was visualized with Uniform Manifold Approximation and Projection (UMAP) in a two­dimensional plane and was stratified with spectral clustering (Python scikit-learn). The optimal number of clusters was determined by the eigengap heuristic method.

Random forest classifiers for characterizing stratified groups

Random forest classifiers were trained to predict either of the three stratified groups (G1-G3) using randomForest packages in R. The receiver operating characteristic (ROC) curve for each classifier was drawn from out-of-bag (OOB) samples using the pROC package in R. For example, the ROC for G1 is for the random forest classifier predicting “G1” versus “G2 or G3”. We list the variables for the supervised random forest in Table 1, Supplementary Table 1, and Supplementary Table 2.

Statistical analysis

When necessary, the variables were compared among different groups using Fisher’s exact test (for categorical variables), analysis of variance (ANOVA, for numerical variables from clinical data with more than two groups), and Kruskal-Wallis ANOVA (for variables from micro-RNA data with more than two groups). Corrections of p-values for multiple testing were performed by the False Discovery Rate (FDR). Also, the variables were investigated for their relationship with features of viral load using the Pearson’s correlation coefficient (for variables from clinical data) and the Spearman’s correlation coefficient (for variables from micro-RNA data). All statistical analyses were performed using R (version 4.1.3).


This study was supported in part by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2022R1C1C2003637) (to K.S.K.); Scientific Research (KAKENHI) B 23H03497 (to S.I.); Grant-in-Aid for Transformative Research Areas 22H05215 (to S.I.); Grant-in-Aid for Challenging Research (Exploratory) 22K19829 (to S.I.); AMED CREST 19gm1310002 (to S.I.); AMED Research Program on Emerging and Re-emerging Infectious Diseases 22fk0108509 (to S.I.), 23fk0108684 (to S.I.), 23fk0108685 (to S.I.); AMED Research Program on HIV/AIDS 22fk0410052 (to S.I.); AMED Program for Basic and Clinical Research on Hepatitis 22fk0210094 (to S.I.); AMED Program on the Innovative Development and the Application of New Drugs for Hepatitis B 22fk0310504h0501 (to S.I.); AMED Strategic Research Program for Brain Sciences 22wm0425011s0302; AMED JP22dm0307009 (to K.A.); JST MIRAI JPMJMI22G1 (to S.I.); Moonshot R&D JPMJMS2021 (to K.A. and S.I.) and JPMJMS2025 (to S.I.); Institute of AI and Beyond at the University of Tokyo (to K.A.); Shin-Nihon of Advanced Medical Research (to S.I.); SECOM Science and Technology Foundation (to S.I.); The Japan Prize Foundation (to S.I.).

Author contributions

YM, KW, SI, and TM designed the research. HP, RY, KSK, KE, NN, SIwanami, KA, TU, KM, TMorita, CBB, RK and SI carried out the computational analysis. SI and TM supervised the project. All authors contributed to writing the manuscript.

Competing financial interests

The authors declare no conflicts of interest associated with this manuscript.

Institutional review board statement

This study was approved by the ethics committees of Nagoya University (hc22-01).

Extended data figures

Reconstructed viral dynamics in saliva samples for individual participants:

The individual-level model fits to saliva RT-qPCR results using the target-cell­limited model described in Eqs.(1-2), for the same cohorts described in Fig. 1, are shown. The closed dots and the solid curves indicate the measured data and the estimated viral dynamics, respectively. Individuals from the NFV clinical trial and the University of Illinois are shown in red and blue, respectively.

Correlation between mir-1846 level and the features of infection dynamics:

The correlation between the rank of mir-1846 and the rank of (A) the peak viral load, (B) the duration of viral shedding, (C) up-slope and (D) down-slope are shown, respectively. The Spearman’s correlation coefficients and p-values are presented in the top of each panel. The black solid line and the gray shaded area indicate results of the linear regression and their 95% confidence levels, respectively.

Pearson’s correlation coefficients between clinical data and features of viral dynamics.

Spearman’s correlation coefficients between micro-RNA data and features of viral dynamics.

Supplementary Information

Comparison of three model fits to viral load in saliva samples for individual participants:

Three different individual-level model fits to saliva RT-qPCR results using the target-cell-limited model and the immune effector model described in Eqs.(1-2) and Eqs.(3-6), respectively, are shown. The black closed dots and solid curves indicate the measured viral load and the expected viral dynamics using the target-cell-limited model, respectively (Expended Data Fig 1). The red and blue dashed curves indicate estimated viral dynamics using the immune effector model for all the 144 individuals described in Fig 1 and for the 54 individuals in the cohort from the University of Illinois, respectively. Namely, the blue dashed curves are the model fits presented in 1.

Daily symptom data for whole cohorts for each group

Summary of parameter estimation by the model described in Eqs.(1-2)

Summary of parameter estimation by the model described in Eqs.(3-6)

Features of reconstructed individual viral dynamics for each group

Salivary micro-RNAs obtained from 60 specimens of 30 participants