Within-host diversity improves phylogenetic and transmission reconstruction of SARS-CoV-2 outbreaks
Figures
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig1-v2.tif/full/617,/0/default.jpg)
Genomic analysis of technical duplicates before filtering.
(a) Allele frequency comparison between technical replicates for all frequencies (right) and for frequencies up to 1% (left). Colors represent the Ct value for the sample. (b) Proportion of shared minor variants between technical replicates in relation to the Ct value. (c) Total number of minor variants in relation to the Ct value. Lines linked two technical replicates. Each sequence has a different color, with sequences from the same patient having a different shade of the same color.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig1-figsupp1-v2.tif/full/617,/0/default.jpg)
Collection date distribution and time from symptom and days from symptom onset.
(a) Distribution of collection dates. (b) Histogram of time from symptom onset to sample collection.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig1-figsupp2-v2.tif/full/617,/0/default.jpg)
Sample mean coverage distribution.
Density distribution of mean coverage.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig1-figsupp3-v2.tif/full/617,/0/default.jpg)
Effects of Ct value on whole-genome sequencing data.
(a) Higher Ct values were linked to a higher number of within-sample variation. (b) Correlation between Ct value and isolate sequencing mean coverage. Lower coverage was associated to higher Ct values (R2=0.13, t-statistic p-value <0.001).
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig1-figsupp4-v2.tif/full/617,/0/default.jpg)
Proportion of shared minor variants between technical replicates using different filters of allele frequency.
Individual plots of shared within-host variants between technical duplicates using increasing thresholds of allele frequency. Colors represent Ct value, while the size of the point shows the total number of within-host variants between the two samples.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig2-v2.tif/full/617,/0/default.jpg)
Probability of sharing within-host variants in sample pairs.
The probability of variants shared between pairs of samples calculated as the number of low-frequency variants in both samples divided by the total number of variants between the pair. Colors grouped samples by their relationship. Points represent the mean probability a variant is shared between all pairwise samples within a group and allele frequency. Error bars show the 95th and 5th percentiles.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig2-figsupp1-v2.tif/full/617,/0/default.jpg)
Allele frequency comparison in pairwise sample pairs.
Pairwise allele frequency comparison between isolate pairs with different relationships. Allele frequencies were compared between isolates with no relationship, from the same hospital, from the same department, with epidemiological links, as well as between longitudinal and technical replicates.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig2-figsupp2-v2.tif/full/617,/0/default.jpg)
Probability that minor variants are shared.
Probability that low-frequency variants are shared inferred with a logistic model with allele frequency and epidemiological relationship as independent variable and whether a variant is shared or not as dependent variable. Y-axis in logarithmic scale for representation. Solid lines represent the model‘s best fit, while shaded areas show the 95% CI.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig2-figsupp3-v2.tif/full/617,/0/default.jpg)
Dynamics of low-frequency variants in longitudinal duplicates.
Variant frequency of low-frequency variants through time in longitudinal duplicates. Each panel represents a single individual, with variants indicated by dots at each time point. The same variant at different time points is linked by lines. Yellow colors represent variants that are consistently found at each time point, while gray dots show variants that present in the first sampling event but lost in subsequent isolates.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig3-v2.tif/full/617,/0/default.jpg)
Model of within-host diversity.
Proposed evolutionary model of within-host diversity in genomic sequences. Uppercase letters represent the major variant in the population, while lowercase letters indicate presence of a minor variant alongside the major one. (a, c) Genome sequences where some positions show within-sample variation (top), represented by a major allele (big size letter) and a minor one (smaller size), as well as its representation in the alignment (bottom). (b, d) Models of nucleotide evolution. Character transitions are indicated by arrows. (a) Consensus sequence, where only the major allele is represented in the alignment. (b) Model of nucleotide evolution using the consensus sequence, with four character states representing the four nucleotides. (c) Sequence with within-sample variation, represented by an uppercase letter for the major allele and a lower case letter for the minor allele. (d) Model of nucleotide evolution with 16-character states accounting for within-sample variation.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig4-v2.tif/full/617,/0/default.jpg)
Similarity scores for inferred trees.
Comparison of the phylogenetic trees inferred using simulated sequences from known random starting trees and different phylogenetic models. Colors differentiate the metrics used for the comparison.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig4-figsupp1-v2.tif/full/617,/0/default.jpg)
Similarity scores for inferred trees with different rates.
Comparison of the phylogenetic trees inferred using simulated sequences with different transition/transversion rates to reflect different within-host diversity levels. Colors show the different rates of within-host evolution. Light colors represent trees inferred with consensus alignments, while dark colors show trees inferred with the model accounting for within-host diversity.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig4-figsupp2-v2.tif/full/617,/0/default.jpg)
Similarity scores for inferred trees from coalescent simulations.
Comparison of the phylogenetic trees inferred using simulated sequences from known coalescent starting trees and different phylogenetic models. Colors differentiate the metrics used for the comparison.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig5-v2.tif/full/617,/0/default.jpg)
Inferred phylogenetic trees from outbreak simulations.
Kuhner and Felsenstein (KF) tree distance between phylogenetic trees from simulated outbreaks. Phylogenies were inferred using consensus sequences (blue) and alignments reflecting within-sample diversity (red), and compared to the known phylogeny of the simulated outbreak. For reference, gray color represents a set of random trees. Outbreak simulations were performed with different bottleneck sizes and mutation rates. The mutation rate is measured as the number of mutations per base per generation cycle.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig6-v2.tif/full/617,/0/default.jpg)
Previously uninformative clusters present temporal signal when using within-sample diversity.
A set of 11 outbreak clusters (one per panel, each plotting the root-to-tip distance in number of substitutions per genome against time) in which all samples had identical consensus genomes sequences (and therefore no temporal signal). Blue colors indicate those regressions that after utilizing within sample diversity now have a positive slope (temporal signal), and red shows those regressions that have a negative slope (misleading or false positive temporal signal).
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig6-figsupp1-v2.tif/full/617,/0/default.jpg)
Phylogenetic trees for SARS-CoV-2.
SARS-CoV-2 phylogenetic trees inferred from consensus sequences (left) and an alignment with major and minor variant information (right).
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig6-figsupp2-v2.tif/full/617,/0/default.jpg)
Genetic distance between longitudinal samples.
The genetic distance in the phylogenetic tree inferred using within-sample diversity increased as the between longitudinal samples progressed. Black line shows the best fit in a linear model, while the blue shaded area represents the 95% CI.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig6-figsupp3-v2.tif/full/617,/0/default.jpg)
Number of low-frequency variants within outbreaks as the outbreak progresses.
Y-axis shows the number of low-frequency variants for each isolate within an outbreak, while the x-axis represents the days since that particular outbreak started. Black line shows the best fit in a linear model, while the blue shaded area represents the 95% CI.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig7-v2.tif/full/617,/0/default.jpg)
Within-sample variation improves resolution of infectious disease outbreaks.
Effect of using low-frequency variants in phylogenetic inference. (a) Maximum likelihood phylogeny using the consensus sequences (left) and the alignment leveraging within-sample variation. Replicates of the same sample share the same color. Sample IDs are coded as follows: SF, for staff members; P, for patients; and PC, for patient contacts. (b) Transmission network inferred using within-sample variation. Edge width is proportional to the likelihood of direct transmission using a susceptible-exposed-infectious-removed (SEIR) model. Colored edges represent the Edmunds optimum branching and thus the most likely chain. (c) Heatmap of the likelihood of direct transmission between all pairwise pairs of samples using a SEIR model. Vertical axis is the infector while the horizontal axis shows the infectee.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig7-figsupp1-v2.tif/full/617,/0/default.jpg)
Time-calibrated phylogenetic trees for SARS-CoV-2.
SARS-CoV-2 phylogenetic trees inferred from consensus sequences (left) and an alignment with major and minor variant information (right). Branch lengths are measured in years.
![](https://iiif.elifesciences.org/lax:84384%2Felife-84384-fig7-figsupp2-v2.tif/full/617,/0/default.jpg)
Phylogenetic and transmission for SARS-CoV-2 outbreaks.
(a–d) Phylogenies of SARS-CoV-2 outbreaks. The branch lengths are in units of substitutions per genome, and the scales are shown under the trees. Colors represent samples from the same individual. Samples with the same name are technical replicates. Left tree of each panel shows the phylogeny inferred with the consensus alignment. Right tree represents the phylogeny inferred using within-sample variation. Heatmap shows the likelihood of direct transmission for each pair of samples in a susceptible-exposed-infectious-removed (SEIR) model of transmission. Vertical axis is the infector while the horizontal axis shows the infectee.
Tables
SNP distance between pairs of samples.
Sample relationship | Estimate (95% CI) | p-Value |
---|---|---|
None | 11.04 (10.94–11.15) | Reference |
Hospital | 9.78 (9.48–10.09) | <1×10-4 |
Department | 5.15 (4.54–5.83) | <1×10-4 |
Epidemiological | 1.5 (1.22–1.78) | <1×10-4 |
Longitudinal duplicates | 0 (0–0.2) | <1×10-4 |
Technical replicate | 0 (0–0.2) | <1×10-4 |
Additional files
-
Supplementary file 1
Study participants metadata.
- https://cdn.elifesciences.org/articles/84384/elife-84384-supp1-v2.tsv
-
Supplementary file 2
Sample collection and demographics.
- https://cdn.elifesciences.org/articles/84384/elife-84384-supp2-v2.csv
-
Supplementary file 3
Metrics used for phylogenetic tree comparison.
- https://cdn.elifesciences.org/articles/84384/elife-84384-supp3-v2.csv
-
Supplementary file 4
Transition/transversion rates and base frequencies of the known simulated tree.
- https://cdn.elifesciences.org/articles/84384/elife-84384-supp4-v2.csv
-
Supplementary file 5
Inferred transition/transversion rates and base frequencies when using the consensus sequence.
Numbers show the average of 100 simulations.
- https://cdn.elifesciences.org/articles/84384/elife-84384-supp5-v2.csv
-
Supplementary file 6
Inferred transition/transversion rates and base frequencies when accounting for within-host diversity.
Numbers show the average of 100 simulations.
- https://cdn.elifesciences.org/articles/84384/elife-84384-supp6-v2.csv
-
MDAR checklist
- https://cdn.elifesciences.org/articles/84384/elife-84384-mdarchecklist1-v2.docx