Secondary structure of the SARS-CoV-2 genome is predictive of nucleotide substitution frequency

  1. ITQB NOVA, Universidade NOVA de Lisboa, Lisbon, Portugal

Peer review process

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Richard Neher
    University of Basel, Basel, Switzerland
  • Senior Editor
    John Schoggins
    The University of Texas Southwestern Medical Center, Dallas, United States of America

Reviewer #1 (Public review):

Summary:

This very short paper shows a greater likelihood of C->U substitutions at sites predicted to be unpaired in the SARS-CoV-2 RNA genome, using previously published observational data on mutation frequencies in SARS-CoV-2 (Bloom and Neher, 2023).

General comments:

A preference for unpaired bases as target for APOBEC-induced mutations has been demonstrated previously in functional studies so the finding is not entirely surprising. This of course assumes that A3A or other APOBEC is actually the cause of the majority of C->U changes observed in SARS-CoV-2 sequences.

I'm not sure why the authors did not use the published mutation frequency data to investigate other potential influences on editing frequencies, such as 5' and 3' base contexts. The analysis did not contribute any insights into the potential mechanisms underlying the greater frequency of C->U (or G->U) substitutions in the SARS-CoV-2 genome.

Comments on revisions:

The revisions have addressed my main comments in my review.

Reviewer #2 (Public review):

Hensel investigated the implications of SARS-CoV-2 RNA secondary structure in synonymous and nonsynonymous mutation frequency. The analysis integrated estimates of mutational fitness generated by Bloom and Neher (from publicly available patient sequences) and a population-averaged model of RNA base-pairing from Lan et al (from DMS mutational profiling with sequencing, DMS-MaPseq)

The results show that base-pairing limits the frequency of some synonymous substitutions (including the most common C→T), but not all: G→A and A→G substitutions seem unaffected by base-pairing.

The author then addressed nonsynonymous C→T substitutions at basepaired positions. While there is still a generally higher estimated mutational fitness at unpaired positions, they propose a coarse adjustment to disentangle base-pairing from inherent mutational fitness at a given position. This adjustment reveals that nonsynonymous substitutions at base-paired positions, which define major variants, have higher mutational fitness.

Overall, this manuscript highlights the importance of considering RNA secondary structure in viral evolution studies.

The conclusions of this work are generally well supported by the data presented. Particularly, the author acknowledges most limitations of the analyses and addresses them. Even though no new sequencing results were generated, the author used available data generated from the analysis of roughly seven million sequenced patient samples. Finally, the author discusses ways to improve the current available models.

There are a number of limitations of this work that should be highlighted, specifically in regard to the secondary structure data used in this paper. The Lan et al. dataset was generated using a multiplicity of infection (MOI) of 0.05, 24 hours post-infection (h.p.i.). At such a low MOI and late timepoint, viral replication is not synchronous and sequencing artifacts might be generated by cell debris and viral RNA degradation, therefore impacting the population-averaged results. In addition, the nonsynonymous base-paired positions in Figure 2 have relatively high population-averaged DMS reactivity, which suggests those positions are dynamic. Therefore, the proposed adjustment could result in an incorrect estimation of their inherent mutational fitness.

Additionally, like all such RNA probing experiments within cells, it remains difficult to deconvolve DMS/SHAPE low reactivity with RNA accessibility (e.g. from protein binding).

This work presents clear methods and an easy-to-access bioinformatic pipeline, which can be applied to other RNA viruses. Of note, it can be readily implemented in existing datasets. Finally, this study raises novel mechanistic questions on how mutational fitness is not correlated to secondary structure in the same way for every substitution.

Overall, this work highlights the importance of studying mutational fitness beyond an immune evasion perspective. On the other hand, it also adds to the viral intrinsic constraints to immune evasion.

Comments on revisions:

Following revision by the author, our concerns have been addressed. The additional analysis strengthens the conclusions & the revisions to the text have improved the manuscript for a general audience.

Author response:

The following is the authors’ response to the original reviews.

Reviewer 1 Public Review:

Summary

This very short paper shows a greater likelihood of C->U substitutions at sites predicted to be unpaired in the SARS-CoV-2 RNA genome, using previously published observational data on mutation frequencies in  SARS-CoV-2 (Bloom and Neher, 2023).

General comments

A preference for unpaired bases as a target for APOBEC-induced mutations has been demonstrated previously in functional studies so the finding is not entirely surprising. This of course assumes that A3A or other APOBEC is actually the cause of the majority of C->U changes observed in SARS-CoV-2 sequences.

I'm not sure why the authors did not use the published mutation frequency data to investigate other potential influences on editing frequencies, such as 5' and 3' base contexts. The analysis did not contribute any insights into the potential mechanisms underlying the greater frequency of C->U (or G->U) substitutions in the SARS-CoV-2 genome.

I have added additional discussions of mechanisms focusing on the question of whether basepairing bias is  primarily driven by secondary structure dependence of underlying mutation rates or by conservation of  secondary structure (Discussion lines 178–192) and I added a brief analysis of the 5′ and 3′ contexts of the  relationship between being basepaired in a secondary structure model and apparent mutational fitness  (Figures S1 and S2, Results lines 85–97). I found that the 5′ context of unpaired, but not paired basepairs  influences apparent mutational fitness (preference for 5′ U), and that the  is also . Additionally, there is a 3′  preference for G, indicating some CpG suppression. This contrasts to some degree with another analysis  based on counting lineage frequencies that may have lacked power to detect relatively small effects  (Simmonds  mBio  2024).

Reviewer 1 Author recommendations:

There are at least 5 publications describing the mapping/prediction of SARS-CoV-2 RNA secondary structure from 2022-2023 and their predictions are not entirely consistent. Why did the authors only refer to the Lan et al. paper?

I have added comparisons when the Lan et al secondary structure model is replaced by one of two others  derived from SHAPE data (Results lines 110–122). Unsurprisingly, similar secondary structure models give  similar results and performance is modestly higher for the models from Lan et al. This is consistent  with  their observations that DMS reactivities performed better as classifiers of SL5 and ORF1 secondary  structure (the reason I compared to this secondary structure model and reactivity data set rather than  others), but I did not go into detail on this in the revision since there are many differences in methods  beyond class of reactivity probe. For example, somewhat stronger correlation for the Vero than the Huh7  dataset in Lan  et al  could arise from combining data  from two replicates, from cell type, or from differences  in data analysis methods. It’s also a small difference and cannot be confidently distinguished from noise.

I conducted a preliminary comparison of the performance of DMS and SHAPE data for predicting mutations  where DMS data is available, but I opted against including this analysis in the manuscript for the same  reasons. Instead, I included in results and discussion comments on how, in general, reactivity data contains  information that is predictive of substitution rates that is not captured by binary secondary structure models.  I also discuss how multiple data sources can potentially be integrated to more accurately predict the impact  of a substitution on fitness (Discussion lines 195–201).

Specific substitutions are referred to as C->T and C29085T for example, but as the genome of SARS-CoV-2 is RNA, and T should be a U.

I agree and I have changed all “T” to “U” in the paper and analysis scripts. The choice of “T” was motivated  by what seemed to appear most frequently in papers on SARS-CoV-2 mutational spectra, but “U” is nearly  universal in papers on secondary structure and mutation mechanisms, so I agree it makes more sense in  this paper.

The C29085T substitution is somewhat non-canonical as it is a single base bulge in a longer duplex section of dsRNA, very unlike the favoured sites for mutation in the Nakata et al paper.

I have added a discussion of Nakata  et al ( NAR 2023) ( Introduction lines 29–32). I did not go into this depth  in the revision, but the analysis of ~2M patient sequences in Nakata  et al  also noted a high rate of UUC→UUU substitution, so the UUUC context of C29095 (shared by 3 of the 10 positions highlighted in  Nakata  et al  that had high mutation frequencies with  exogenous APOBEC3A expression) could be  interesting to investigate further.

High C29095U substitution frequency is indeed somewhat at odds with the results in that work, which found  that UC→UU substitutions to be elevated in longer single-stranded regions than the context of C29095U in  SARS-CoV-2 secondary structure models (a single unpaired base opposing three unpaired bases in an  asymmetric internal loop).

I'm not sure why DMS reactivity is considered a separate variable from pairing likelihood as one informs the other.

The intent here, which was not clear, was to show that a binary basepairing model that uses DMS  reactivities as constraints does not capture all of the information available. I have clarified this in as  described above discussing information in different reactivy datasets.

The C29095U substitution is also relavent to the consideration of DMS reactivity in addition to the resulting  secondary structure model. These are not considered as separate predictors and the reason for showing  both is mentioned in the paper: “DMS reactivity was more strongly correlated with estimated mutational  fitness than basepairing when analysis was limited to positions with detectable DMS reactivity.” I have  clarified this in the revised manuscript and also it is relevant to the discussion of a potential model  integrating all available datasets.

Reviewer 2 Public Review:

Hensel investigated the implications of SARS-CoV-2 RNA secondary structure in synonymous and nonsynonymous mutation frequency. The analysis integrated estimates of mutational fitness generated by Bloom and Neher (from publicly available patient sequences) and a population-averaged model of RNA basepairing from Lan et al (from DMS mutational profiling with sequencing, DMS-MaPseq).

The results show that base-pairing limits the frequency of some synonymous substitutions (including the most common CT), but not all: GA and AG substitutions seem unaffected by base-pairing.

The author then addressed nonsynonymous CT substitutions at base-paired positions. While there is still a generally higher estimated mutational fitness at unpaired positions, they propose a coarse adjustment to disentangle base-pairing from inherent mutational fitness at a given position. This adjustment reveals that nonsynonymous substitutions at base-paired positions, which define major variants, have higher mutational fitness.

Overall, this manuscript highlights the importance of considering RNA secondary structure in viral evolution studies.

The conclusions of this work are generally well supported by the data presented. Particularly, the author acknowledges most limitations of the analyses, and addresses them. Even though no new sequencing results were generated, the author used available data generated from the analysis of roughly seven million sequenced patient samples. Finally, the author discusses ways to improve the current available models.

There are a number of limitations of this work that should be highlighted, specifically in regard to the secondary structure data used in this paper. The Lan et al. dataset was generated using a multiplicity of infection (MOI) of 0.05, 24 hours post-infection (h.p.i.). At such a low MOI and late timepoint, viral replication is not synchronous and sequencing artifacts might be generated by cell debris and viral RNA degradation, therefore impacting the population-averaged results. In addition, the nonsynonymous base-paired positions in Figure 2 have relatively high population-averaged DMS reactivity, which suggests those positions are dynamic. Therefore, the proposed adjustment could result in an incorrect estimation of their inherent mutational fitness.

I would go further than this to say that the proposed adjustmentment  will usually  result in an incorrect  estimate. My intent is to propose an improved, but still likely incorrect, estimate by utilizing  in  vitro  data to  refine baseline mutation rates in order to obtain improved, but only coarsely adjusted, estimates of  mutational fitness. I added a note in the discussion that  in vitro  reactivities (and, consequently, secondary  structure models) may not reflect secondary structures  in vivo ( Discussion lines 204–205). I did not go  into  detail regarding the specific technical considerations mentioned here because they are outside the scope of  my expertise.

I am not sure that top-ranked non-synonymous C→U positions have particularly high DMS values after  coarse adjustment for basepairing (labeled amino acid mutations in Figure 2). Of the six common mutations  used as examples, three have minimum values in the dataset considered (which is processed  normalized/filtered data rather than raw data) and three do not have very high DMS reactivity.

However, there is clearly information in base reactivity that is not captured by a binary basepairing model,  which is indicated by residual positive correlation between DMS reactivity and mutational fitness after  adjustment. I now include a figure demonstrating this for synonymous C→U substitutions as Figure S3, and  I have tried to clarify the language throughout the manuscript to make it clear that a more accurate  adjustment is possible.

Additionally, like all such RNA probing experiments within cells, it remains difficult to deconvolve DMS/SHAPE low reactivity with RNA accessibility (e.g. from protein binding).

I agree, and in revising this manuscript it was interesting to see that Nakata  et al ( discussed above)  identified relatively large single-stranded regions with enhanced UC→UU substitution frequencies with  exogenous APOBEC3A expression, while C29095U, for example, is a single unpaired base with high DMS  reactivity and high empirical C→U substitution frequency (discussed briefly in the introduction of the revised  manuscript). Future analyses could consider heterogeneity in secondary structure as well as secondary  structures with low heterogeneity where strained conformations could have higher reactivity.

This work presents clear methods and an easy-to-access bioinformatic pipeline, which can be applied to other RNA viruses. Of note, it can be readily implemented in existing datasets. Finally, this study raises novel mechanistic questions on how mutational fitness is not correlated to secondary structure in the same way for every substitution.

Overall, this work highlights the importance of studying mutational fitness beyond an immune evasion perspective. On the other hand, it also adds to the viral intrinsic constraints to immune evasion.

Reviewer 2 Author recommendations:

Even though the experiment was not performed in this manuscript, it would be helpful for the readers if it was briefly explained how secondary structure is inferred from DMS reactivity, as this technique is not broadly used.

It is not objective to refer to the Lan et al. model of RNA structure as "high quality" given the limitations of their experimental approach (low MOI, asynchronous infection, DMS-only, no long-range interactions) and the lack of external validation of the structure of the genome they propose.

I removed “high-quality” from the abstract. Since a result of the paper is that secondary structure correlates  with synonymous substitution rates, this is an observation that can be used to retrospectively compare the  quality of secondary structure models in this respect. I updated the manuscript to include such a  comparison, and did not find a large difference between secondary structure models (Results lines  110–122). I added a discussion of how multiple data sources can potentially be integrated to more  accurately predict the impact of a substitution of viral fitness.

I have also added a brief discussion of constraints on how much we can confidently infer from these  experiments given limitations of the experimental approach. I note that DMS and SHAPE data provide  information that can be combined to make a stronger model, and that predictions can be rapidly tested  given observations by Gout (Symonds?) et al that  in  vitro  substitution rates correlate with those observed  during the pandemic (Discussion lines 195–201).

Mutational fitness from Bloom & Neher was derived throughout the pandemic, much of which came from a period with the most active surveillance (Delta / Omicron waves). Consequently, these viruses differ from the WA1 strain used by Lan et al. far more than the 3 nt differences between lineage A and B that the author refers to. The following sentence should therefore be revised to avoid misleading the reader:

"Additionally, note that DMS data was obtained in experiments using the WA1 strain in Lineage A, which differs from the more common Lineage B at 3 positions and could have different secondary structure."

Revised:

“Additionally, note that DMS data was obtained in experiments using the WA1 strain in Lineage A,  which differs from the more common Lineage B at 3 positions and could have different secondary  structure. Furthermore, mutational fitness is estimated from the phylogenetic tree of published  sequences (the public UShER tree (Turakhia et al., 2021) additionally curated to filter likely artifacts  such as branches with numerous reversions) that are typically far more divergent and subsequently  will have somewhat different secondary structures. Since the dataset used for mutational fitness  aggregates data across viral clades, my analysis will not capture secondary structure variation  between clades or indels and masked sites that were not considered in that analysis (Bloom and  Neher, 2023).”

To determine the extent to which the results depend on the single RNA structure model, it would be informative "turn the crank again" on the analysis with one of the other RNA structure datasets for SARS-CoV-2 (though most other datasets suffer from similar problems of asynchronicity of infection).

I have added comparisons when the Lan  et al  secondary  structure model is replaced by one of two others  derived from SHAPE data as described above. Also, I conducted preliminary comparisons of underlying  DMS and SHAPE reactivity data as described above, but I opted not to include these in the revised  manuscript given that methods different beyond the chemical probe used. I also discuss how multiple data  sources can potentially be integrated to more accurately predict the impact of a substitution of viral fitness.

In Figure 1 it would be helpful to add the values of the unpaired/basepaired ratios in the plot for clarity.

Furthermore, a similar analysis using the substitution frequency, which strengthens the conclusions, is mentioned in the text, however, it is not shown. It could be shown as part of Figure 1, or as a supplementary figure.

This was a good suggestion since numbers around 1 are not perceived as being very significant. I added  the ratio of median unpaired:paired rates to Figure 1, updated the corresponding manuscript text and the  figure caption, and note that the numbers are somewhat changed from the first version of my manuscript  because of updating to use the most up-to-date mutational fitness estimates.

It is not clear how the two constants were calculated to obtain the "adjusted mutational fitness". It could be shown as part of Figure 2, or as a supplementary figure.

I added dashed lines and arrows to Figure 2 showing median paired/unpaired mutational fitnesses and the  adjustment made to normalize to the overall median. I also added Figure S3 showing this for synonymous  substitutions, where it is more clear given the lower fraction of mutations with substantial fitness impacts.

Minor comments

Statements like "the current fast-growing lineage JN.1.7" never age well... please revise to state the period of time to which this refers.

Revised:

“…lineage JN.1.7, which had over 20% global prevalence in Spring 2024…”

Also, I checked the list of mutations and the examples given remain in the top 15 ranked basepaired,  non-synonymous C→U mutations (BA.2-defining C26060U is added to the list, but I did not update to  include this). It replaces C9246U, which was not mentioned in the first version of the manuscript.

Similarly, please provide context for the reader in the phrase: "This was one mutation that characterized the B.1.177 lineage" (e.g. add its early reference as "EU1" and that it predominated in Europe in autumn 2020, prior to the emergence of the Alpha variant).

Revised to add detail:

This was one of the mutations that characterized the B.1.177 lineage. This lineage, also known as  EU1, characterized a majority of sequences in Spain in summer 2020 and eventually in several  other countries in Europe prior to the emergence of the Alpha variant. However, it was unclear  whether or this lineage had higher fitness than other lineages or if A222V specifically conferred a  fitness advantage.

"massive sequencing of SARS-CoV-2" - the meaning of the word "massive" is unclear. Revise.

Revised  “…millions of patient SARS-CoV-2 sequences published during the pandemic…”

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation