Abstract
Accurate estimation of the effects of mutations on SARS-CoV-2 viral fitness can inform public-health responses such as vaccine development and predicting the impact of a new variant; it can also illuminate biological mechanisms including those underlying the emergence of variants of concern1. Recently, Lan et al reported a high-quality model of SARS-CoV-2 secondary structure and its underlying dimethyl sulfate (DMS) reactivity data2. I investigated whether secondary structure can explain some variability in the frequency of observing different nucleotide substitutions across millions of patient sequences in the SARS-CoV-2 phylogenetic tree3. Nucleotide basepairing was compared to the estimated “mutational fitness” of substitutions, a measurement of the difference between a substitution’s observed and expected frequency that is correlated with other estimates of viral fitness4. This comparison revealed that secondary structure is often predictive of substitution frequency, with significant decreases in substitution frequencies at basepaired positions. Focusing on the mutational fitness of C→T, the most common type of substitution, I describe C→T substitutions at basepaired positions that characterize major SARS-CoV-2 variants; such mutations may have a greater impact on fitness than appreciated when considering substitution frequency alone.
Introduction
While investigating the significance of the substitution C29095T, detected in a familial cluster of SARS-CoV-2 infections5, I hypothesized that this synonymous substitution reflected the high frequency of C→T substitution during the pandemic6. Specifically, frequent C29095T substitution had previously complicated attempts to infer recombinant genomes7. Preliminary investigation of C29095T revealed that it was the fourth most frequent C→T substitution; C29095T occurrs almost seven times as often as a typical C→T substitution4. While there was no clear reason for the selection of this synonymous substitution, C29095 was found to be unpaired in a secondary structure model8. I hypothesized that deamination may be more frequent for unpaired cytosine residues. This was supported by previous analysis with a resolution of ∼300 nucleotides9. To determine whether secondary structure was in fact correlated with mutation frequency at single-nucleotide resolution, the data set reported by Lan et al2 was compared to the mutational fitness estimates reported by Bloom and Neher4. Note that “mutational fitness” is not a measurement of viral fitness per se; rather, it is an estimate made assuming that the expected frequencies of neutral mutations are determined only by the type of substitution (with C→T being much more frequent than all other types of substitutions).
The data compared in this study consisted of estimated mutational fitness for the SARS-CoV-2 genome as reported by Bloom and Neher4 (Supplementary Data ntmut_fitness_all.csv and nt_fitness.csv) as well as population-averaged dimethyl sulfate (DMS) reactivities for SARS-CoV-2-infected Huh7 cells and the corresponding secondary structure model reported by Lan et al2 (Supplementary Data 7 and 8). Note that the estimated mutational fitness is logarithmically related to the ratio of the observed and expected number of occurrences of a nucleotide substitution, with large and asymmetric differences in the frequencies of different types of synonymous substitutions6. Additionally, note that DMS data was obtained in experiments using the WA1 strain in Lineage A, which differs from the more common Lineage B at 3 positions and could have different secondary structure. I focused on the most common types of nucleotide substitutions: those comprising approximately 5% or more of total substitutions.
Results
There was a significant increase in synonymous substitution frequencies at unpaired positions for C→T, G→T, C→A, and T→C, but not for A→G or G→A (p < 0.05; Tukey’s range test with Bonferroni correction; A→T, G→C, and C→G were also significant in an unplanned analysis of all substitution types). For all substitution types with significant differences, unpaired substitution frequencies were higher than basepaired substitution frequencies. The largest effects were observed for C→T and G→T (Figure 1). In this secondary structure model, there is basepairing for 60% and 73% of C and G positions, respectively (limited to those covered in the mutational fitness analysis). Median estimated mutational fitness for synonymous C→T and G→T at unpaired positions are higher than at basepaired positions by 1.46 and 1.36, respectively. Expressed in terms of substitution frequency rather than mutational fitness, the frequency of synonymous C→T and G→T substitution is about four times higher at unpaired positions than at basepaired positions. Together, this demonstrates a meaningful impact of secondary structure on substitution frequencies.
Basepairing in the secondary structure model appears to be more predictive of estimated mutational fitness than average DMS reactivity, with correlation coefficients of 0.59 and 0.45, respectively, for C→T substitutions (point biserial and Spearman correlation coefficients). Correlation coefficients remain significant, but are reduced (0.18 and 0.13) when considering nonsynonymous mutations (Figure 2, left), consistent with larger and often negative effects of nonsynonymous mutations on viral fitness4. However, DMS reactivity is more correlated with estimated mutational fitness than basepairing when analysis is limited to positions with detectable DMS reactivity (excluding the sites plotted at the minimum measured value of 0.00012). No major difference in this trend was observed across the SARS-CoV-2 genome. As a first-order approximation, two constants were calculated to equalize median mutational fitness for synonymous substitutions at basepaired, unpaired, and all positions. An “adjusted mutational fitness” can then be calculated for C→T substitutions by incrementing mutational fitness by +0.32 at basepaired position and by –1.14 at unpaired positions (results were similar when considering fourfold degenerate positions rather than all synonymous substitutions). Scatter plots compare DMS reactivity to estimated mutational fitness at positions with nonsynonymous C→T substitutions before and after applying this coarse adjustment (Figure 2, left and right).
For a preliminary estimate of whether nonsynonymous C→T substitutions at basepaired positions are prone to underestimation of mutational fitness, I tested the hypothesis that C→T having highly ranked fitness at basepaired positions would be mutations that characterize significant SARS-CoV-2 lineages (arbitrarily defined as having 5% prevalence in the one-week average of global sequences on the CoV-Spectrum website10 at any time during the pandemic). This was the case for 6 of the top 15 C→T substitutions at basepaired positions; their encoded mutations are shown in Figure 2. Top-ranked mutations characterize Omicron BA.1, one of the first recognized recombinant lineages XB, Gamma P.1, and the current fast-growing lineage JN.1.7. Half of these mutations have relatively high DMS reactivity for basepaired positions and half have very low DMS reactivity. By comparison, the synonymous substitution C29095T at an unpaired position has very high estimated mutational fitness and DMS reactivity. Despite having a higher median estimated mutational fitness (1.41 vs. 1.10), only 3 of the top 15 nonsynonymous C→T at unpaired positions define major lineages (BQ.1.1, JN.1.8.1, and BA.2.86.1).
Of particular note is C22227T at a basepaired position encoding the spike A222V mutation. This was one mutation that characterized the B.1.177 lineage, and it was unclear whether it conferred any fitness advantage11. Further investigation as well as its recurrence in the major Delta sublineage AY.4.2 provided additional evidence of an increase in viral fitness and suggested molecular mechanisms12. Here, I focus on C→T substitutions for comparison to DMS reactivity data, but I also note that top-ranked G→T substitutions at basepaired positions are rich in mutations to ORF3a and also include mutations that characterize variants of concern, such as nucleocapsid D377Y in Delta. Lastly, note that, following the coarse adjustment for basepairing inferred from synonymous substitutions, nonsynonymous C→T substitutions characterizing major variants now have some of the highest estimated mutational fitnesses for C→T substitutions (Figure 2, right).
Discussion
This analysis shows that it is informative to combine apparent viral fitness, inferred from massive sequencing of SARS-CoV-2 during the pandemic, with accurate secondary structure measurements. It is important to remember that apparent “mutational fitness” results from a combination of the rates at which diversity is generated and the subsequent selection processes. Importantly, genome secondary structure can impact both. However, even the unprecedented density of sampling SARS-CoV-2 genomes has been insufficient to reliably infer fitness impacts of single mutations more directly from dynamics subsequent to occurrences in the SARS-CoV-2 phylogenetic tree4,13. Further investigation into phenomena reported here, such as the lack of apparent secondary structure dependence for A→G and G→A substitutions, could inform investigation of underlying mutation mechanisms. I suggest that secondary structure, along with other data correlating with substitution frequency, can be used to refine estimates of mutational fitness.
More sophisticated analysis can incorporate structural heterogeneity2 as well as local sequence context14. Furthermore, additional measurements of secondary structure for genomes of new variants or modeling may reveal significant changes to secondary structure since 2020. For the spike protein, the correlation between estimated mutational fitness and pseudovirus entry quantified by deep mutational scanning serves as one metric that can be used to optimize models15. However, it is critical to evaluate uncertainty in any model estimating fitness of a new variant. To this end, initial estimates can be refined by rapid in vitro characterization and continued genomic surveillance.
Acknowledgements
Although I did not directly access any genome sequencing databases for this work, I am grateful to the patients who volunteered samples, and to the clinicians, technicians, and teams behind the databases who have made sequencing data available so that this work is possible. I thank Erol Akcay, Alex Crits-Cristoph, Florence Débarre, Ryan Hisner, and James Yates for critical comments on the manuscript and discussions. ZH is supported by Fundação para a Ciência e a Tecnologia (FCT) through MOSTMICRO-ITQB (DOI 10.54499/UIDB/04612/2020; DOI 10.54499/UIDP/04612/2020) and LS4FUTURE Associated Laboratory (DOI 10.54499/LA/P/0087/2020).
Data availability
Data analyzed in this manuscript2,4 and a Python notebook to reproduce analysis are available at https://github.com/smmlab/SARS2-fitness-secondary-structure
References
- 1.SARS-CoV-2 variant biology: immune escape, transmission and fitnessNat. Rev. Microbiol 21:162–177
- 2.Secondary structural ensembles of the SARS-CoV-2 RNA genome in infected cellsNat. Commun 13
- 3.Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemicNat. Genet 53:809–816
- 4.Fitness effects of mutations to SARS-CoV-2 proteinsVirus Evol 9
- 5.A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family clusterThe Lancet 395:514–523
- 6.Mutation Rates and Selection on Synonymous Mutations in SARS-CoV-2Genome Biol. Evol 13
- 7.Recombinant SARS-CoV-2 genomes circulated at low levels over the first year of the pandemicVirus Evol 7
- 8.In vivo structural characterization of the SARS-CoV-2 RNA genome identifies host proteins vulnerable to repurposed drugsCell 184:1865–1883
- 9.C-to-U RNA deamination is the driving force accelerating SARS-CoV-2 evolutionLife Sci. Alliance 6
- 10.CoV-Spectrum: analysis of globally shared SARS-CoV-2 data to identify and characterize new variantsBioinformatics 38:1735–1737
- 11.Spread of a SARS-CoV-2 variant through Europe in the summer of 2020Nature 595:707–712
- 12.The structural role of SARS-CoV-2 genetic background in the emergence and success of spike mutations: The case of the spike A222V mutationPLOS Pathog 18
- 13.A phylogeny-based metric for estimating changes in transmissibility from recurrent mutations in SARS-CoV-2bioRxiv https://doi.org/10.1101/2021.05.06.442903
- 14.Mutational signature dynamics indicate SARS-CoV-2’s evolutionary capacity is driven by host antiviral moleculesPLOS Comput. Biol 20
- 15.A pseudovirus system enables deep mutational scanning of the full SARS-CoV-2 spikeCell 186:1263–1278
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2024, Zach Hensel
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 194
- downloads
- 2
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.