Abstract
Background
The vast majority of phylogenetic trees are inferred from molecular sequence data (nucleotides or amino acids) using time-reversible evolutionary models which assume that, for any pair of nucleotide or amino acid characters, the relative rate of X to Y substitution is the same as the relative rate of Y to X substitution. However, this reversibility assumption is unlikely to accurately reflect the actual underlying biochemical and/or evolutionary processes that lead to the fixation of substitutions. Here, we use empirical viral genome sequence data to reveal that evolutionary non-reversibility is pervasive among most groups of viruses. Specifically, we consider two non-reversible nucleotide substitution models: (1) a 6-rate non-reversible model (NREV6) in which Watson-Crick complementary substitutions occur at identical relative rates and which might therefor be most applicable to analyzing the evolution of genomes where both complementary strands are subject to the same mutational processes (such as might be expected for double-stranded (ds) RNA or dsDNA genomes); and (2) a 12-rate non-reversible model (NREV12) in which all relative substitution types are free to occur at different rates and which might therefore be applicable to analyzing the evolution of genomes where the complementary genome strands are subject to different mutational processes (such as might be expected for viruses with single-stranded (ss) RNA or ssDNA genomes).
Results
Using likelihood ratio and Akaike Information Criterion-based model tests, we show that, surprisingly, NREV12 provided a significantly better fit to 21/31 dsRNA and 20/30 dsDNA datasets than did the general time reversible (GTR) and NREV6 models with NREV6 providing a better fit than NREV12 and GTR in only 5/30 dsDNA and 2/31 dsRNA datasets. As expected, NREV12 provided a significantly better fit to 24/33 ssDNA and 40/47 ssRNA datasets. Next, we used simulations to show that increasing degrees of strand-specific substitution bias decrease the accuracy of phylogenetic inference irrespective of whether GTR or NREV12 is used to describe mutational processes. However, in cases where strand-specific substitution biases are extreme (such as in SARS-CoV-2 and Torque teno sus virus datasets) NREV12 tends to yield more accurate phylogenetic trees than those obtained using GTR.
Conclusion
We show that NREV12 should, be seriously considered during the model selection phase of phylogenetic analyses involving viral genomic sequences.
Background
Modelling the nucleotide substitution processes that underly the diversification of virus genome sequences is at the heart of many viral evolutionary analyses. The most widely used nucleotide substitution models belong to the general time reversible (GTR) family (Tavaré, 1986) that assume that the Markov process of evolution will occur in the same way both forward and backward in time such that, when the arrow of time is inverted, the forward process cannot be distinguished from the backward process (Hoff, Orf, Riehm, Darriba, & Stamatakis, 2016), (Lio & Goldman, 1998), (Tavaré, 1986).
The essence of the GTR model is captured in the definition of its instantaneous rate matrix in Eq. (1); a matrix that models the rates at which the four different nucleotides {A,C,G,T} are exchanged, ensuring that the detailed reversibility balance condition: Qji πi =Qij πj (where Qji is the instantaneous rate of change from j to i and πi is the equilibrium probability of state i) is met (Squartini & Arndt, 2008) (Posada, 2003). The instantaneous rate matrix of the GTR model consists of six parameters (a, b, c, d, e, and f) which indicate the relative rates from base i to j in the state space {A, C, G, T} and the πi’s are the equilibrium frequencies of each base
The rate matrix in Eq. (1) is symmetrical in that, for example, the relative rate at which A changes to G is the same as the relative rate at which G changes to A.
Time reversible nucleotide substitution models such as GTR form the basis of almost all nucleotide sequence-focused evolutionary analyses (including those involving eukaryotes, prokaryotes, and viruses) (Lefort, Longueville, & Gascuel, 2017), (Posada & Crandall, 2001), (Minin, Abdo, Joyce, & Sullivan, 2003).
The reliability of a phylogenetic tree constructed using a particular nucleotide sequence dataset should be maximized when the evolutionary models used to construct the tree accurately reflect the evolutionary processes that yielded the nucleotide sequence dataset (Buckley & Cunningham, 2002), (Ripplinger & Sullivan, 2008), (Hoff, Orf, Riehm, Darriba, & Stamatakis, 2016). The suitability of different models for describing the evolution of DNA or RNA sequences is, therefore, expected to depend to some degree on the biological and environmental contexts of the sequences being analyzed.
Mutations in viral genomes arise due to diverse biotic (such as replication enzyme infidelities, RNA/DNA editing enzymes) and abiotic (such as ionizing radiation, inorganic oxidizers and chemical mutagens) factors (Sanjuán & Domingo-Calap, 2016). Mutagenic chemical reactions or types of radiation that, for example, cause G to A or C to U mutations in DNA or RNA, are not the same as those that cause A to G or U to C mutations (Cheng, Cahill, Kasai, Nishimura, & Loeb, 1992), (Nguyen, et al., 1992), (Chelico, Pham, Calabrese, & Goodman, 2006), (Sharma, Patnaik, Taggart, & Baysal, 20016). It should not be expected, therefore, that the relative rates of G to A substitution will equal as the relative rates of A to G substitution. Instead, in evolving double-stranded (ds) DNA and dsRNA molecules where both strands of the genome are in existence for similar amounts of time, both G to A and C to T substitutions should occur at relatively similar rates. Therefore, for nucleotide sequence datasets derived from any organisms with dsDNA or dsRNA genomes, a non-reversible nucleotide substitution model with a different relative substitution rate category for each of the six possible pairs of complementary nucleotide substitutions (e.g. NREV6 in Eq. (2)), might plausibly provide a better description of mutational processes than GTR (Baele, Van de Peer, & Vansteelandt, 2010), (Wickner, 1993).
In the case of single-stranded (ss) RNA viruses, ssDNA viruses, retroviruses, and dsRNA/dsDNA viruses where the two complementary genome strands do not exist for equal amounts of time (Yu, et al., 2004), a model where all twelve different substitutions are free to occur at different rates might be best. Specifically, with ssRNA viruses, ssDNA viruses, and retroviruses, only one of the genome strands (called the virion strand) is packaged into viral particles for transmission and, in many dsRNA viruses, the genome strand that is translated into proteins (called the + strand) exists for longer during the life cycle than does the complementary (or −) strand (Bruslind, 2020), (Onwubiko, et al., 2020). In all these viruses, some degree of strand-specific substitution bias is expected to occur (Van Der Walt, Martin, Varsani, Polston, & Rybicki, 2008), (Polak & Arndt, 2008) such that NREV6 might be anticipated to provide a poorer description of mutational processes than a model such as NREV12 (Eq. (3)) where each of the twelve different types of substitution has a different relative rate (Baele, Van de Peer, & Vansteelandt, 2010).
Because non-reversible models consider the directionality of evolution, they could, in some cases, be used to identify root nodes of phylogenetic trees (Yap & Speed, 2005), (Boussau & Gouy, 2006). It is, however, unclear whether non-reversible models might, in certain situations at least, perform better than reversible models in the context of phylogenetic inference. Although it is possible to use non-reversible nucleotide substitution models such as NREV6 and NREV12 during maximum likelihood-based phylogenetic inference with computer programs such as IQ-TREE (Nguyen, Schmidt, Von Haeseler, & Minh, 2015), these models are still not routinely used for phylogenetic inference. This is in part because non-reversible models render several commonly used algorithmic techniques for efficient likelihood computation inapplicable. It is also in part because it remains undetermined whether, under conditions where strand-specific substitution biases are evident, non-reversible models consistently yield substantially more accurate phylogenetic trees than reversible models.
Here we present evidence that strand-specific nucleotide substitution biases are common within virus genomic sequence datasets such that NREV12 generally provides a significantly better fit than both GTR and NREV6 for such datasets. We then use simulations to demonstrate that whereas strand-specific nucleotide substitution biases reduce the accuracy of phylogenetic inference irrespective of whether GTR or NREV12 are used, when these biases become extreme use of NREV12 can yield significantly more accurate phylogenetic trees than GTR.
Materials And Methods
Virus sequence datasets and phylogenetic trees
We obtained nucleotide sequences from the National Centre for Biotechnology Information Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) and the Los Alamos National Laboratory HIV sequence database (https://www.hiv.lanl.gov/content/index). These included gene and whole-genome sequences for viruses with ssRNA, ssDNA, dsRNA, and dsDNA genomes (datasets are summarized in supplementary Table S1). An outgroup sequence from a closely related virus species was added to each dataset to help root phylogenetic trees accurately. The sequences in each of the datasets were aligned using MUSCLE (Edgar, 2004) implemented in Aliview (Larsson, 2014), and maximum likelihood phylogenetic trees were constructed from each alignment using RAxML v8.2 (Stamatakis, 2016).
Model Testing
We evaluated the fit of NREV12, NREV6, and GTR to the 141 individual sequence datasets using a previously published model test (Harkins, et al., 2009) formulated in the HyPhy scripting language (Pond, Frost, & Muse, 2005). This script (which can be obtained from https://github.com/veg/hyphy-analyses/tree/master/NucleotideNonREV) took as input a rooted maximum likelihood phylogenetic tree (minus the rooting sequence) and its corresponding nucleotide sequence alignment. The first step of the model testing process involved the harvesting of nucleotide sequences from the sequence alignment into a vector of frequencies π = {πA, πC, πG, πT} called the frequency distribution matrix containing three free parameters with that of nucleotide T conditioned at absolute (1 − πA − πC − πG).
Once the frequencies were harvested, the first stochastic rate matrix consisting only of the relative rates picked from a gamma distribution was defined to satisfy the reversibility conditions of relative rates being equal in reverse i.e., rAG = rGA, rAC = rCA, rAT = rTA, rCG = rGC, rTG = rGT, rCT = rTC. Thereafter the instantaneous rate matrix Q was calculated by multiplying the relative substitution rates by the appropriate nucleotide frequencies which were used to form the GTR model probability transition matrix, P by P (t) = eQt. The role of the GTR model during model testing was to model mutations along the branches of tree, T. Given parameters of the GTR model describing (1) equilibrium nucleotide frequencies, and (2) the nucleotide substitution process, the likelihood, L, of the observed data, D, was calculated with the values of all the independent parameters, Ѳ, being investigated to find the combination that maximized the value of L (maximizing L (| D, T). The value of the log likelihood (lnL) under the GTR model was, at this point, stored for future comparisons with those of the NREV6 and NREV12 models.
The model was then changed to one satisfying the complementary relative reversibility conditions of the NREV6 model: i.e. rAG = rTC, rAC = rTG, rAT = rTA, rCG = rGC, rCA = rGT, rCT = rGA. The lnL of observing the data, D, given the tree, T, under the NREV6 model was calculated and stored for later comparisons. The model was then changed to the NREV12 model for which relative rates were defined such that it satisfied complete non-reversibility.
For each dataset we then used the lnL scores and numbers of free parameters in the three models for likelihood ratio tests (LRTs; (Anisimova, Bielawski, & Yang, 2001)) to determine whether (1) NREV12 fitted the data significantly better than GTR and (2) whether NREV12 fitted the data significantly better than NREV6. Specifically, for the NREV12 vs GTR comparison we calculated the LRTstatisticas2 (lnLNREV12 − lnLGTR) with the p value being calculated as 1 − chi (LRT, dfNREV12 − dfGTR). For the NREV12vsGTR comparison we calculated the LRTstatisticas2 (lnLNREV12 − lnLNREV6) with the p value being calculated as 1 − chi (LRT, dfNREV12 − dfNREV6).
Further, the lnL scores and numbers of free parameters for each model were used to calculate AIC scores for each of the models (Posada & Buckley, 2004) which enabled us to identify which of the three models fit the data best. The model with the lowest AIC score was selected as the best fitting model with the AIC scores for the different models being calculated as follows:
Quantification of non-reversibility.
We further defined the degree of non-reversibility (DNR) as the absolute difference between the relative rate differences of two nucleotide pairs; i.e. for two nucleotides, x and y, there exists a relative rate of x to y substitutions that we will refer to as m, and a relative rate y to x substitutions that we will refer to as n. Under the NREV12 model, the degree of non-reversibility (DNR) between x and y is defined simply as the absolute difference between m and n:(|m-n|) and will hereby be referred to as ijDNR were i and j are two nucleotides. We use DNR as a mathematical representation of the degree to which the rates of all pairs of reverse substitutions differ from one another. For each of 140 individual viral sequence datasets we calculated the average DNR of the six ijDNR estimates inferred using the NREV12 model.
Simulations for testing the impact of non-reversible evolution on phylogenetic inference
We tested the accuracy of phylogenetic tree inference under reversible and non-reversible models using simulated datasets with varying average pairwise nucleotide sequence identities (APIs) evolved under the NREV12 model with different degrees of non-reversibility (DNR). The goal of these tests was not to exhaustively evaluate model misspecification issues during phylogenetic tree inference, but rather to check, in instances where viral taxa are known to be evolving in a detectably non-reversible manner (i.e. where NREV12 or NREV6 fits the data better than GTR), whether not accounting for this might decrease the accuracy of phylogenetic inference. Using IQTREE, a phylogenetic inference program that has the option to apply an NREV12-like model (referred to in IQTEE as the UNREST model), a phylogenetic tree was inferred from an alignment of real sequences (Avian Leukosis virus) with an average sequence identity (API) of ~ 90%. The branch lengths on this tree were then scaled to create four other phylogenetic trees representing sequences with approximately 95%, 85%, 80% and 75% API. These five trees are hereafter referred to as “true” trees and each individual tree was used as the starting point of a different set of simulations.
To test whether failure to account for non-reversibility might decrease the accuracy of phylogenetic inference we simulated the evolution of 5,500 nucleotide sequence alignments evolved non-reversibly under varying DNR along the five true phylogenetic trees: 100 datasets per true tree per simulated degree of non-reversibility (DNR). Specifically, simulations were done using HyPhy (Pond, Frost, & Muse, 2005) with relative rates ranging from a completely reversible matrix i.e., CA = AC = 0.166, GA = AG = 1, AT = TA = 0.14, GC = CG = 0.131, TG = GT = 0.188 and CT = TC = 1.101 – representing DNR = 0 – through matrices with DNR = 2, 4, 6, 8, 10, 12, 14, 16, 18 and 20 (Table 1). These baseline simulated substitution rates are reflective of those seen in empirical viral nucleotide sequence datasets.
At each DNR, the relative rates used conformed to standard measures of non-reversibility under the Kolmogorov conditions according to which non-reversibly evolving sequence datasets should yield three irreversibility indices (IRI1, IRI,2 and IRI3) that are all non-zero (Squartini & Arndt, 2008). Accordingly, whereas a for simulated datasets where DNR was 0, the IRI1, IRI2 and IRI3 indices were all approximately zero indicating that the sequences in these datasets had (as expected), evolved in a time-reversible manner, for datasets where DNR was greater than 0, the IRI1, IRI12 and IRI3 indices were all different from zero indicating that the sequences in these datasets had indeed evolved in a time non-reversible manner. Further, it should be noted that all simulations under NREV12 were performed under the stationarity criterion:
πeQT = π (where Q is the rate matrix and π is the nucleotide frequency distribution and t ≥ 0).
Quantifying the accuracy of phylogenetic inferences
We used the weighted Robinson-Foulds metric (wRF; implemented in the R phangorn package (Schliep, 2011)) to quantify differences between the true trees used to simulate datasets and the trees inferred from these datasets using the GTR or NREV12 models. wRF considers differences between both the topology and branch lengths of actual and inferred trees (Kuhner, 2015), (Robinson & Foulds, 1981).
Results And Discussion
Non-reversible nucleotide substitution models generally provide a better fit than reversible models to virus sequence datasets
We tested for evidence of non-reversibility of the nucleotide substitution process in 141 virus sequence datasets (33 ssDNA virus datasets, 30 dsDNA virus datasets, 31 dsRNA virus datasets, and 47 ssRNA virus datasets, all consisting of either full genome sequences (for unsegmented viruses), or complete genome component sequences (for viruses with segmented genomes). Specifically, for each dataset, we compared the goodness-of-fit of the GTR + G, NREV6 + G, and NREV12 + G models (where G represents gamma-distributed rates).
Given that dsDNA viruses such as adenoviruses, papillomaviruses and herpesviruses have both their DNA strands in existence for similar amounts of time before DNA-dependant-DNA polymerase enzymes copy both their + and − DNA strands during replication (Hanson, 2009), we had anticipated that the best fitting substitution model for sequence datasets of these viruses would be NREV6. Using weighted AIC scores to reveal trends of model support (Fig. 2), it is surprising that NREV12 was overall the best-supported model (illustrated by the redder hues around the top corner of the dsDNA plot in Fig. 2. Out of the 30 dsDNA datasets considered, we found that NREV6 provided the best fit to five datasets (HPV18, HPV45, HPV16, BPV, and SV40) and GTR provided the best fit to five (Alphapapilloma virus 6, JC polyomavirus, DPV, RTBV, and DBAV). NREV12 was the best fitting model for the remaining 20 datasets (Table 1). Further, LRTs revealed strong overall support for NREV12, with this model providing a significantly better fit (p < 0.05) than NREV6 for 25/30 of the dsDNA datasets and a significantly better fit than GTR for 24/30 of the datasets.
As NREV6 was not the best fitting model for most of the dsDNA virus datasets we infer that, in most dsDNA virus species, strand-specific substitution biases are not irrelevant. Further, the datasets where NREV6 was not the best fit are from species in families containing other species where NREV6 was the best fit, indicating that such strand-specific substitution biases are unlikely to be a consequence of some broadly conserved feature of viral life cycles in these families (such as, for example, ssDNA replicative intermediates). It is instead plausible that these differences may relate to:
differences in replication fidelity and/or proofreading efficiency on the leading and trailing DNA strands in some dsDNA virus species (Grigoriev, 1999): These differences are common in eukaryotes (Youri, Newlon, & KunkelThomas, 2002) (Furusawa, 2012) and prokaryotes (Fijalkowska, Jonczyk, Tkaczyk, Bialoskorska, & Schaaper, 1998) and, considering that the replication processes of dsDNA viruses analyzed here mirror those of their eukaryote hosts, it is perhaps unsurprising that most of these viruses also display some evidence of strand-specific substitution biases.
extra exposure to DNA damage of displaced template strands during unidirectional rolling circle replication in some papillomavirus species such as HPV16 could be a contributor to strand-specific nucleotide substitution biases (Kusumoto-Matsuo, Kanda, & Kukimoto, 2011).
extra time spent by non-coding strands in single-stranded dissociated states during RNA transcription in some papillomavirus and polyomavirus species (Fernandes & de Medeiros Fernandes, 2012): during transcription processes, the dissociated non-coding strand is transiently more exposed to damage than the coding strand (Wei, et al., 2010) which might also contribute to strand-specific substitution biases.
Similarly, and equally surprising, we found that NREV12 was overall the best supported model for dsRNA viruses (illustrated by the redder hues around the top corner of the dsRNA plot in Fig. 2).
NREV6 fit only two of the 31 dsRNA datasets better than both NREV12 and GTR (Human rotavirus A set H and Fiji virus). NREV12 was found to be the best fitting model for 21/31 datasets and GTR was the best fitting of 8/31 (Table 2). In all three Birnaviridae family datasets (which contains virus species with two genome segments) and in 17/22 of Reoviridae family datasets (which contain virus species with 10–12 genome segments) the NREV12 model provided the best fit. Based on the LRTs, strong overall support for NREV12 was found, with this model providing a significantly better fit (p < 0.05) in 27/31 dsRNA virus datasets relative to NREV6, and 23/31 datasets relative to GTR.
We anticipated that NREV12 might fit many of these dsRNA datasets better than NREV6 simply because, during their infection cycles, the coding + strand of dsRNA viruses (the one from which protein translation occurs) tends to exist for longer periods within an infected cell than the non-coding –strand. Specifically, there are two main steps during double-stranded RNA virus replication (Wickner, 1993). Firstly, synthesis of the viral + strands from a dsRNA template occurs in the cytoplasm within viral particles. These + strands exist within the cell for prolonged periods in the absence of complementary -strands and are used as templates for translation of viral proteins. In the second step the + strands remaining after translation act as templates for -strand synthesis resulting in the formation of new dsRNA molecules. The + strands of dsRNA viruses are therefore likely more impacted by mutational processes, which in turn could explain the pervasive strand specific substitution biases seen in this group of viruses.
For the ssRNA and ssDNA viruses where one genome strand exists during the virus life cycle for far longer periods of time than the other such that complementary substitutions would not be expected to occur at similar rates, we anticipated that NREV12 should provide a better fit than both NREV6 and GTR. Indeed, for ssRNA viruses, NREV12 was a better fit than NREV6 and GTR for 40/47 of the ssRNA datasets and 24/33 of the ssDNA virus datasets (Fig. 2). Of the nine ssDNA virus datasets where NREV12 was not the best fitting model, GTR fit 7/9 better and NREV6 fit 2/9 better (Table 3). Of the seven ssRNA datasets (Table 4) where NREV12 was not the best fitting model, GTR fit 6/7 better and NREV6 fit 1/7 better.
Based on the LRTs, strong overall support for NREV12 was found with this model providing a significantly better fit (p < 0.05) than NREV6 for 45/47 of the ssRNA virus datasets (Table 4) and 31/33 of the ssDNA virus datasets (Table 3). Similarly, based on LRTs, NREV12 provided a significantly better fit than GTR for 27/33 of the ssDNA virus datasets (Table 3) and 40/47 of the ssRNA virus datasets (Table 4).
We found that DNR estimates alone did not cleanly differentiate between datasets for which NREV12 was or was not best supported (Tables 1, 2, 3 and 4). For the 107 nucleotide sequence datasets with a model preference of NREV12, ten had estimated DNRs that were greater than 0.5, 13 had DNRs between 0.25 and 0.5 and 84 had DNRs between 0.0225 and 0.25. For the ten nucleotide sequence datasets with a model preference of NREV6, one had an estimated DNR greater than 0.5, four had estimated DNRs between 0.25 and 0.5 and five had estimated DNRs between 0.064 and 0.25 (Fig. 2). For the 24 nucleotide sequence datasets with a model preference of GTR, none had estimated DNRs greater than 0.5, one had an estimated DNR between 0.25 and 0.5 and the remainder had estimated DNRs between 0.0335 and 0.25.
The dsDNA virus dataset with the highest DNR was Human mastadenovirus D (DNR = 0.646), the dsRNA virus dataset with the highest estimated DNR was Porcine_rotavirus_B (0.351), the ssRNA virus dataset with the highest DNR was SARS-CoV-2 (DNR = 1.536) and the ssDNA virus dataset with the highest DNR was Torque teno sus virus (DNR = 1.56).
Therefore, while NREV12 appears to be generally more appropriate than either NREV6 or GTR for describing mutational processes in ssRNA, ssDNA, dsDNA, and dsRNA viruses, this might only be particularly relevant from a practical perspective when datasets of these viruses yield DNR estimates that are greater than 0.25. For such datasets NREV12 (and possibly NREV 6 in some instances) might be especially useful for both determining the direction of evolution across phylogenetic trees (i.e., it could potentially be used to root these trees) and for quantifying genomic strand-specific nucleotide substitution biases (Harkins et al., 2009).
Assessing the impacts of model misspecification on phylogenetic tree inference
To determine whether it might sometimes be worthwhile using NREV12 rather than GTR for phylogenetic inference when NREV12 is the best fitting nucleotide substitution model, we used simulated datasets to compare the accuracy of phylogenetic trees inferred using these models. Specifically, we simulated datasets with DNRs varying from 0 to 20 along known “true” phylogenetic trees (supplementary Figure S 1) with branches scaled to reflect branch tip sequences with APIs of ~ 75%, ~ 80%, ~ 85%, ~ 90% and ~ 95%. For each of five API levels, we therefore simulated 5500 datasets (comprising 100 datasets for each DNR = 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 and 20).
Phylogenetic trees were inferred from these 5500 simulated datasets and compared to the phylogenetic trees used to simulate the datasets (i.e., the true trees) using wRF distances to assess the impact of varying DNR on the accuracy of phylogenetic inference. We further tested whether the accuracy of phylogenetic inference could be improved for sequences that had evolved under DNR > 0 by using NREV12 instead of GTR. Specifically, for every simulated dataset a phylogenetic tree was inferred using GTR and another using NREV12 and the wRF distances of each of these trees to the true tree was determined. For each of the analysed DNRs a paired t-test (correlated t-test) was then used to compare the wRF scores of trees inferred using GTR and NREV12. We were particularly interested in determining whether trees inferred using a mis-specified model (i.e. GTR in this case) would be less accurate than trees inferred with a correctly specified model (i.e. NREV12).
We found that, irrespective of dataset diversity and the nucleotide substitution model used, phylogenetic inference tended to become more inaccurate (i.e. wRF scores increased) as DNR increased (Fig. 2). This tendency was, however, more pronounced when using a (miss-specified) GTR model than when using a (correctly specified) NREV12 model with, for any given dataset having DNR > 0, the use of NREV12 tending to yield more accurate phylogenetic trees than when GTR was used. There were, however, only statistically significant improvements (p < 0.05 paired t-test) in the accuracy of phylogenetic trees inferred using NREV12 relative to those inferred using GTR in lower diversity datasets (i.e. those with APIs of 85%, 90% and 95%); and then only for DNR > 8. It is noteworthy that the highest estimated DNR in any of the empirical datasets that we analysed was xxx: more than four-fold lower than the point where statistically significant differences in phylogenetic-inference accuracies became apparent in the simulated datasets.
Conclusion
The non-reversible nucleotide substitution model, NREV12, generally provides a substantially better fit to virus nucleotide sequence datasets than does the widely used reversible substitution model, GTR. NREV12 also generally provides a better fit to virus nucleotide sequence datasets than does NREV6; a non-reversible model that would be expected to best describe the evolution of double stranded genome sequences that display no strand-specific nucleotide substitution biases. This suggests that, contrary to our expectations, substantial strand-specific nucleotide substitution biases (i.e. estimated DNRs > 0.25) are common during viral evolution irrespective of genome type. Such biases should be expected for any viruses where one genome strand either is in existence for substantially longer periods of time than the other, or is more exposed to mutagenic processes than the other during transmission, replication, or gene expression.
We had anticipated that, given evidence of sequences evolving both non-reversibly and with strand-specific substitution biases, inferring trees using a model such as NREV12 that appropriately accounts for this might: (1) minimize the impact of increasing DNR on the accuracy of phylogenetic inference (i.e. wRF scores presented in blue in Fig. 2 might have been expected to not increase with increasing DNR); and (2) yield significantly more accurate phylogenetic inferences than when using GTR for all datasets where NREV12 was the most appropriate model and DNRs were greater than zero. However, increasing DNR clearly decreased the accuracy of phylogenetic inference even when using NREV12, and, for datasets where DNRs were greater than zero, using GTR did not consistently yield significantly less accurate phylogenetic inferences than those attained using NREV12. From a practical perspective, choosing a non-reversible nucleotide substitution model to construct phylogenetic trees from virus genome sequences that display strand specific nucleotide substitution biases is not guaranteed to yield more accurate phylogenetic trees. Nevertheless, in instances where strand specific substitution biases are higher than ~ 0.5 (such as are found in our SARS-CoV-2, Torque teno sus virus and Banana bunchy top virus datasets) it may be prudent to select a model such as NREV12 (such as is implemented in programs like IQTREE) over GTR as the better of two suboptimal choices.
The lack of available data regarding the proportions of viral life cycles during which genomes exist in single and double stranded states makes it difficult to rationally predict the situations where the use of models such as GTR, NREV6 and NREV12 might be most justified: particularly in light of the poor over-all performance of NREV6 and GTR relative to NREV12 with respect to describing mutational processes in viral genome sequence datasets. We therefore recommend case-by-case assessments of NREV12 vs NREV6 vs GTR model fit when deciding whether it is appropriate to consider the application of non-reversible models for phylogenetic inference and/or phylogenetic model-based analyses such as those intended to test for evidence of natural section or the existence of molecular clocks.
Abbreviations
API:: Average pairwise nucleotide sequence identity
NREV6:: 6 Rate Non-Reversible Model
NREV12:: 12 Rate Non-Reversible Model
GTR:: General time Reversible Model
dsDNA:: Double-stranded deoxyribonucleic acid
dsRNA:: Double-strandedribonucleic acid
ssDNA:: Single-stranded deoxyribonucleic acid
ssRNA:: Single-strandedribonucleic acid
DNR:: Degrees of Non-Reversibility
wRF:: Weighted Robinson-Foulds distance
LRT:: likelihood ratio test
AIC:: Akaike information criterion test
Declarations
Ethics
The University of Cape Town ethics committee declared that this research did not need ethics approval due to the use of freely accessible nucleotide sequences obtained from the National Centre for Biotechnology Information Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) and the Los Alamos National Laboratory HIV sequence database (https://www.hiv.lanl.gov/content/index).
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on request.
Authors’ contributions
RS contributed to the data collection, model testing, and interpretation of results and, together with PH, drafted the manuscript. DPM contributed to the conception and design of the study and interpretation of the findings. DPM and SLKP edited the HyPhy batch script used to conduct model tests and critically reviewed the manuscript. CBC, PH, CWM, FP, KR, SS contributed to the collection and model testing of viral genome sequences.
Acknowledgment
The authors wish to thank the South African National Research Foundation (NRF) for funding this research under the South African Centre for Epidemiological Modelling and Analysis (SACEMA) bursary. SLKP is supported by the U.S. National Institutes of Health (R01 AI134384 and AI140970) and the US National Science Foundation (RAPID 2027196 NSF/DBI, BIO). DPM is supported by the Wellcome Trust (222574/Z/21/Z).
Consent for Publication
Not applicable
Supplementary Information
References
- 1.Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolutionMolecular biology and evolution 18:1585–1592
- 2.Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequencesJournal of molecular evolution 17:34–50
- 3.Efficient likelihood computations with nonreversible models of evolutionSystematic biology 55:756–768
- 4.Bruslind, L. (n.d.). pressbooks. Retrieved from General Microbiology: https://open.oregonstate.education/generalmicrobiology/chapter/the-viruses/
- 5.The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap supportMolecular Biology and Evolution 19:394–405
- 6.APOBEC3G DNA deaminase acts processively 3′→ 5′ on single-stranded DNANature structural & molecular biology 13:392–399
- 7.8-Hydroxyguanine, an abundant form of oxidative DNA damage, causes GT and AC substitutionsBiological Chemistry 267:166–172
- 8.MUSCLE: multiple sequence alignment with high accuracy and high throughputNucleic acids research 32:1792–1797
- 9.Human papillomavirus: biology and pathogenesisHuman Papillomavirus and Related Diseases-From Bench to Bedside-A Clinical Perspective IntechOpen
- 10.Unequal fidelity of leading and lagging strand DNA replication on the Escherichia coli chromosomePreceedings of the National Academy of Science 95:10020–10025
- 11.Implications of fidelity difference between the leading and the lagging strand of DNA for the acceleration of evolutionFrontiers in oncology 2
- 12.Strand-specific compositional asymmetries in double-stranded DNA virusesVirus research 60:1–19
- 13.2 Isolation of Viral DNA from CulturesHandbook of Nucleic Acid Purification
- 14.Experimental evidence indicating that mastreviruses probably did not co-diverge with their hostsVirology Journal 6:1–14
- 15.Does the choice of nucleotide substitution models matter topologically?BMC bioinformatics 17:1–13
- 16.Practical performance of tree comparison metricsSystematic Biology 64:205–214
- 17.Rolling circle replication of human papillomavirus type 16 DNA in epithelial cell extractsGenes Cells. Genes to Cells :23–33https://doi.org/10.1111/j.1365-2443.2010.01458.x
- 18.AliView: a fast and lightweight alignment viewer and editor for large datasetsBioinformatics 30:3276–3278
- 19.SMS: smart model selection in PhyMLMolecular biology and evolution 34:2422–2424
- 20.Models of molecular evolution and phylogenyGenome research 8:1233–1244
- 21.Performance-based selection of likelihood models for phylogeny estimationSystematic biology 52:674–683
- 22.IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogeniesMolecular biology and evolution 32:268–274
- 23.DNA damage and mutation in human cells exposed to nitric oxide in vitroNational Academy of Science 89:3030–3034
- 24.SV40 T antigen interactions with ssDNA and replication protein A: a regulatory role of T antigen monomers in lagging strand DNA replicationNational Library of medicine 48:3657–3677https://doi.org/10.1093/nar/gkaa138
- 25.Transcription induces strand-specific mutations at the 5′ end of human genesGenome Research 18:1216–1223
- 26.HyPhy: hypothesis testing using phylogeniesStatistical methods in molecular evolution :125–181
- 27.Using MODELTEST and PAUP to select a model of nucleotide substitution:6–5
- 28.Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio testsSystematic biology 53:793–808
- 29.Selecting models of nucleotide substitution: an application to human immunodeficiency virus 1 (HIV-1)Molecular biology and evolution 18:897–906
- 30.Selecting the best-fit model of nucleotide substitutionSystematic biology 50:580–601
- 31.Does choice in model selection affect maximum likelihood analysis?Systematic biology 57:76–85
- 32.qpcR: an R package for sigmoidal model selection in quantitative real-time polymerase chain reaction analysisBioinformatics 24:1549–1551
- 33.Comparison of phylogenetic treesMathematical biosciences 53:131–147
- 34.Mechanisms of viral mutationCellular and molecular life sciences 73:4433–4448
- 35.phangorn: phylogenetic analysis in RBioinformatics 27:592–593
- 36.The double-domain cytidine deaminase APOBEC3G is a cellular site-specific RNA editing enzymeScientific reports 6:1–12
- 37.Quantifying the stationarity and time reversibility of the nucleotide substitution processMolecular biology and evolution 25:2525–2535
- 38.The RAxML v8. 2. X Manual
- 39.Some probabilistic and statistical problems in the analysis of DNA sequencesLectures on mathematics in the life sciences 17:57–86
- 40.Experimental observations of rapid Maize streak virus evolution reveal a strand-specific nucleotide substitution biasVirology journal 5:1–11
- 41.New views on strand asymmetry in insect mitochondrial genomesPLoS One 5
- 42.Double-stranded RNA virus replication and packagingThe Journal of biological chemistry 268:3797–3800
- 43.Rooting a phylogenetic tree with nonreversible substitution modelsBMC Evolutionary Biology 5:1–8
- 44.Yeast origins establish a strand bias for replicational mutagenesisMolecular cell 10:207–213
- 45.Single-strand specificity of APOBEC3G accounts for minus-strand deamination of the HIV genomeNature structural & molecular biology 11:435–442
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2023, Sianga-Mete et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 179
- downloads
- 2
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.