Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models

  1. Rita Sianga-Mete  Is a corresponding author
  2. Penelope Hartnady
  3. Wimbai Caroline Mandikumba
  4. Kayleigh Rutherford
  5. Christopher Brian Currin
  6. Florence Phelanyane
  7. Sabina Stefan
  8. Steven Weaver
  9. Sergei L Kosakovsky Pond
  10. Darren P Martin
  1. Division of Computational Biology, Institute of Infectious Diseases and Molecular Medicine, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, South Africa
  2. Department of Human Biology, Faculty of Health Sciences, University of Cape Town, South Africa
  3. Centre for Infectious Disease and Epidemiology Research, School of Public Health and Family Medicine, University of Cape Town, South Africa
  4. Centre for Biomedical Engineering, School of Engineering, Brown University, United States
  5. Institute for Genomics and Evolutionary Medicine, Department of Biology, Temple University, United States
  6. Wellcome Center for Infectious Diseases Research in Africa, Institute of Infectious Disease and Molecular Medicine and Department of Medicine, University of Cape Town, South Africa

eLife Assessment

This valuable study revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. The authors provide solid evidence that (1) it can be beneficial to include non-time-reversible models in addition to general time-reversible models when inferring phylogenetic trees out of simulated viral genome sequence data sets, and that (2) non time-reversible models may fit the real data better than the reversible substitution models commonly used in phylogenetics, a finding consistent with previous work.

https://doi.org/10.7554/eLife.87361.3.sa0

Abstract

Most phylogenetic trees are inferred using time-reversible evolutionary models that assume that the relative rates of substitution for any given pair of nucleotides are the same regardless of the direction of the substitutions. However, there is no reason to assume that the underlying biochemical mutational processes that cause substitutions are similarly symmetrical. We consider two non-reversible nucleotide substitution models: (1) a 6-rate non-reversible model (NREV6) that is applicable to analysing mutational processes in double-stranded genomes, in that complementary substitutions occur at identical rates and (2) a 12-rate non-reversible model (NREV12) that is applicable to analysing mutational processes in single-stranded (ss) genomes, in that all substitution types are free to occur at different rates. Using likelihood ratio and Akaike information criterion-based model tests, we show that, surprisingly, NREV12 provided a significantly better fit than the general time reversible (GTR) and NREV6 models to 21/31 dsRNA and 20/30 dsDNA datasets. As expected, however, NREV12 provided a significantly better fit to 24/33 ssDNA and 40/47 ssRNA datasets. We tested how non-reversibility impacts the accuracy with which phylogenetic trees are inferred. As simulated degrees of non-reversibility (DNRs) increased, the tree topology inferences using both NREV12 and GTR became more accurate, whereas inferred tree branch lengths became less accurate. We conclude that while non-reversible models should be helpful in the analysis of mutational processes in most virus species, there is no pressing need to use these models for routine phylogenetic inference.

Introduction

Modelling the nucleotide substitution processes that underlie the diversification of virus genome sequences lies at the heart of many viral evolutionary analyses. The most widely used nucleotide substitution models belong to the general time reversible (GTR) family (Tavaré, 1986) and assume that the Markov process of evolution is time-reversible (Hoff et al., 2016; Liò and Goldman, 1998; Tavaré, 1986).

The GTR model is defined by its instantaneous rate matrix Qij (Equation 1), where Qij defines the instantaneous rate of change from nucleotide i{A,C,G,T} to nucleotide j; subject to the detailed balance condition: qjiπi=qjiπj, with rates q and equilibrium frequencies π (Squartini and Arndt, 2008; Posada, 2003). The instantaneous rate matrix of the GTR model includes six rate parameters (a, b, c, d, e, and f). Because only products of substitution rates and evolutionary times can be estimated, one of the rate parameters is set to 1 (e.g. b), or the entire matrix is normalised to yield one expected substitution per unit time.

(1) Q={qij}=(aπCbπGdπTaπAcπGeπTbπAcπCfπTdπAeπCfπG)

The rate matrix in Equation 1 is symmetrical, e.g., the relative rate at which A changes to G is the same as the relative rate at which G changes to A.

Time-reversible nucleotide substitution models such as GTR form the basis of almost all nucleotide sequence-focused evolutionary analyses (including those involving eukaryotes, prokaryotes, and viruses) (Lefort et al., 2017; Posada and Crandall, 2001a; Posada and Crandall, 2001b; Minin et al., 2003).

The reliability of a phylogenetic tree constructed using a particular nucleotide sequence dataset should be maximised when the evolutionary models used to construct the tree accurately reflect the important aspects of the evolutionary process (Buckley and Cunningham, 2002; Ripplinger and Sullivan, 2008; Hoff et al., 2016). The suitability of different models for describing the evolution of DNA or RNA sequences is, therefore, expected to depend to some degree on the biological and environmental contexts of the sequences being analysed.

Mutations in viral genomes arise due to diverse biotic (such as replication enzyme infidelities, RNA/DNA editing enzymes) and abiotic (such as ionising radiation, inorganic oxidisers, and chemical mutagens) factors (Sanjuán and Domingo-Calap, 2016). Mutagenic chemical reactions or types of radiation that, for example, cause G to A or C to U mutations in DNA or RNA are not the same as those that cause A to G or U to C mutations (Cheng et al., 1992;⁠ Nguyen et al., 1992; Chelico et al., 2006; Sharma et al., 2016). It should not be expected, therefore, that the relative rates of G to A substitution will equal the relative rates of A to G substitution. Instead, in evolving double-stranded (ds) DNA and dsRNA molecules where both strands of the genome are in existence for similar amounts of time, both G to A and C to T substitutions should occur at relatively similar rates. Therefore, for nucleotide sequence datasets derived from any organisms with dsDNA or dsRNA genomes, a non-reversible nucleotide substitution model with a different relative substitution rate category for each of the six possible pairs of complementary nucleotide substitutions (e.g. NREV6 in Equation 2), with qAC=qTG,qAG=qTC,qAT=qTA,qCG=qGC,qCT=qGA,qGT=qCA, might plausibly provide a better description of mutational processes than GTR (Baele et al., 2010; Wickner, 1993).

(2) Q={qij}=(aπCbπGcπTfπAdπGeπTeπAdπCfπTcπAbπCaπG)

In the case of ssRNA viruses, ssDNA viruses, retroviruses, and dsRNA/dsDNA viruses where the two complementary genome strands do not exist for equal amounts of time (Yu et al., 2004), a model where all 12 different substitutions occur at different rates might be best. Specifically, with ssRNA viruses, ssDNA viruses, and retroviruses, only one of the genome strands (called the virion strand) is packaged into viral particles for transmission and, in many dsRNA viruses, the genome strand that is translated into proteins (called the + strand) exists for longer during the life cycle than does the complementary (or –) strand (Bruslind, 2020; Onwubiko et al., 2020). In all these viruses, some degree of strand-specific substitution bias is expected to occur (van der Walt et al., 2008; Polak and Arndt, 2008) such that NREV6 might be anticipated to provide a poorer description of mutational processes than a model such as NREV12 (Equation 3), where each of the 12 different types of substitution has a separate rate (Baele et al., 2010).

(3) Q={qij}=(aπAbπAcπAgπCdπCeπChπGiπGfπGjπTkπTlπT)

Because non-reversible models consider the directionality of evolution, they could, in some cases, be used to identify root nodes of phylogenetic trees (Yap and Speed, 2005; Boussau and Gouy, 2006). It is, however, unclear whether non-reversible models might, in certain situations at least, perform better than reversible models in the context of phylogenetic inference. Although it is possible to use non-reversible nucleotide substitution models such as NREV6 and NREV12 during maximum likelihood-based phylogenetic inference with computer programs such as IQ-TREE (Nguyen et al., 2015), these models are not routinely used for phylogenetic inference. This is in part because non-reversible models render several commonly used algorithmic techniques for efficient likelihood computation inapplicable, making inference slower. It is also in part because it remains undetermined whether, under conditions where strand-specific substitution biases are evident, non-reversible models consistently yield substantially more accurate phylogenetic trees than reversible models.

Here, we present evidence that strand-specific nucleotide substitution biases are common within virus genomic sequence datasets such that NREV12 generally provides a significantly better fit than both GTR and NREV6 for such datasets. We then use simulations to demonstrate that whereas strand-specific nucleotide substitution biases reduce the accuracy of phylogenetic inference under both GTR and NREV12, when these biases become extreme, use of NREV12 can yield significantly more accurate phylogenetic trees than GTR.

Results and discussion

Non-reversible nucleotide substitution models generally provide a better fit than reversible models to virus sequence datasets

We tested for evidence of non-reversibility of the nucleotide substitution process in 141 virus sequence datasets (33 ssDNA virus datasets, 30 dsDNA virus datasets, 31 dsRNA virus datasets, and 47 ssRNA virus datasets), all consisting of either full genome sequences (for unsegmented viruses) or complete genome component sequences (for viruses with segmented genomes). Specifically, for each dataset, we compared the goodness-of-fit of the GTR+G, NREV6+G, and NREV12+G models (where G represents gamma-distributed nucleotide substitution rates among sites; Yang, 1994).

Given that dsDNA viruses such as adenoviruses, papillomaviruses, and herpesviruses have both their DNA strands in existence for similar amounts of time before DNA-dependent-DNA polymerase enzymes copy both their + and – DNA strands during replication (Hanson, 2009), we had anticipated that the best fitting substitution model for sequence datasets of these viruses would be NREV6. Using weighted small sample corrected Akaike information criterion (AIC-c) scores to reveal trends of model support (Figure 1), it is surprising that NREV12 was overall the best-supported model (illustrated by the redder hues around the top corner of the dsDNA plot in Figure 2). Out of the 30 dsDNA datasets considered, we found that NREV6 provided the best fit to five datasets (HPV18, HPV45, HPV16, BPV, and SV40) and GTR provided the best fit to five (Alphapapillomavirus 6, JC polyomavirus, DPV, RTBV, and DBAV). NREV12 was the best fitting model for the remaining 20 datasets (Table 1). Further, likelihood ratio tests (LRTs) revealed strong overall support for NREV12, with this model providing a significantly better fit (p<0.05) than NREV6 for 25/30 of the dsDNA datasets and a significantly better fit than GTR for 24/30 of the datasets.

Ternary plots illustrating the relative fit of the NREV12, NREV6, and GTR nucleotide substitution models based on weighted small sample corrected Akaike information criterion (AIC-c) scores for 30 dsDNA, 31 dsRNA, 33 ssDNA, and 47 ssRNA virus nucleotide sequence datasets.

These plots were produced using the Akaike weights function with an overlaid density function (implemented in the qpcR package of RStudio; Ritz and Spiess, 2008) to indicate point densities. Each model is represented by a corner of the triangles, and each circle represents the relative fit of each of the three models to a single nucleotide sequence dataset. The sides of the triangle represent model support axes ranging from 0% to 100%, with the position of a circle in relation to each of the sides of the triangle indicating the probability of models best describing the nucleotide sequence dataset that is represented by that point. Red colours represent a very high density of nucleotide sequence datasets that favour a particular model, blue colours indicate a lower, but still substantial, density of datasets that favour a particular model.

Weighted Robinson-Foulds distances between inferred and true phylogenetic trees for datasets simulated with different degrees of nucleotide substitution non-reversibility and different average pairwise sequence identities (APIs) (~75%, ~80%, ~85%, ~90%, and ~95%).

‘ns’ above a pair of box and whisker plots indicates a paired t-test adjusted p-value of ≥0.05 and ‘*’ indicates a paired t-test adjusted p-value of <0.05.

Table 1
Akaike information criterion (AIC) scores and likelihood ratio test (LRT) results for double-stranded DNA virus datasets.

The lowest small sample corrected AIC (AIC-c) scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12p-Value GTR vs NREV-12p-Value NREV-6 vs NREV-12DNR
PapillomaviridaeAPPV 635099.535108.035102.2>0.050.0070.089
HPV18_225202.925174.625179.2<0.001>0.050.323
HPV45_223600.623599.023602.9>0.05>0.050.285
HPV16_229734.029664.529665.4<0.001>0.050.371
HPV3124681.424677.324672.80.0020.010.165
HPV6_131199.131150.031141.2<0.001<0.0010.451
LPV67165.767188.167145.5<0.001<0.0010.42
DPV69829.769889.269835.1>0.05<0.0010.056
XPV95455.695617.195452.2<0.001<0.0010.072
BATV134821134511133322<0.001<0.0010.402
PolyomaviridaeJC_251806.751819.651812.0>0.050.0030.089
BK_221472.621472.721471.10.030.030.244
SV4016859.816858.016858.40.037>0.050.567
BPV148614.9148573.8148585.2<0.001>0.050.064
CaulimoviridaeCMV124083.9124221.0123888.6<0.001<0.0010.351
CSSV145327.0146575145202<0.001<0.0010.158
SVBV138575138488.1138464.7<0.001<0.0010.174
DBAV46495.546514.146502.0>0.05<0.0010.0335
RTBV54987.955350.154991>0.05<0.0010.082
BDV376325.2376647.6376029.9<0.001<0.0010.140
SiphoviridaeCLV237362.3237351.8237348.6<0.001<0.010.070
TectiviridaeTTIV913864.9913915.4913773.1<0.001<0.0010.279
AdenoviridaeFAV_C3074086.73074207.53073739.1<0.001<0.0010.169
FAV_E103482.3103222.7102636.7<0.001<0.0010.357
FAV_D2326925.62325719.42324784.5<0.001<0.0010.551
FAV_A705328.5705436.5705197.8<0.001<0.0010.645
HMAV_B103796.7103937.44103753.8<0.001<0.00110.890
HMAV_D1748635.217497691748119.1<0.001<0.0010.646
HMAV_C2851144.52851357.128511330.006<0.0010.0225
HMAV_E1915044.81915065.31914998<0.001<0.0010.049

As NREV6 was not the best fitting model for most of the dsDNA virus datasets, we infer that, in most dsDNA virus species, strand-specific substitution biases are not ignorable. Further, the datasets where NREV6 was not the best fit are from species in families containing other species where NREV6 was the best fit, indicating that such strand-specific substitution biases are unlikely to be a consequence of some broadly conserved feature of viral life cycles in these families (such as ssDNA replicative intermediates). It is instead plausible that these differences may relate to:

  1. differences in replication fidelity and/or proofreading efficiency on the leading and trailing DNA strands in some dsDNA virus species (Grigoriev, 1999). These differences are common in eukaryotes (Pavlov et al., 2002; Furusawa, 2012) and prokaryotes (Fijalkowska et al., 1998) and, considering that the replication processes of dsDNA viruses analysed here mirror those of their eukaryote hosts, it is perhaps unsurprising that most of these viruses also display some evidence of strand-specific substitution biases.

  2. extra exposure to DNA damage of displaced template strands during unidirectional rolling circle replication in some papillomavirus species such as HPV16 could be a contributor to strand-specific nucleotide substitution biases (Kusumoto-Matsuo et al., 2011).

  3. extra time spent by non-coding strands in single-stranded dissociated states during RNA transcription in some papillomavirus and polyomavirus species (Fernandes and Medeiros Fernandes, 2012): during transcription processes, the dissociated non-coding strand is transiently more exposed to damage than the coding strand (Wei et al., 2010), which might also contribute to strand-specific substitution biases.

Similarly, and equally surprising, we found that NREV12 was overall the best supported model for dsRNA viruses (illustrated by the redder hues around the top corner of the dsRNA plot in Figure 1).

NREV6 fit only 2 of the 31 dsRNA datasets better than both NREV12 and GTR (Human rotavirus A set H and Fiji virus). NREV12 was found to be the best fitting model for 21/31 datasets and GTR was the best fitting of 8/31 (Table 2). In all three Birnaviridae family datasets (which contain virus species with two genome segments) and in 17/22 of Reoviridae family datasets (which contain virus species with 10–12 genome segments), the NREV12 model provided the best fit. Based on the LRTs, strong overall support for NREV12 was found, with this model providing a significantly better fit (p<0.05) in 27/31 dsRNA virus datasets relative to NREV6 and 23/31 datasets relative to GTR.

Table 2
Akaike information criterion (AIC) scores and likelihood ratio test (LRT) results for double-stranded RNA datasets.

The lowest small sample corrected AIC (AIC-c) scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12GTR vs NREV-12NREV-6 vs NREV-12DNR
BirnaviridaeAQBV31754.931853.331721.9<0.001<0.0010.219
GBV_A47176.947347.247154.8<0.001<0.0010.142
IPNV79186.279221.979182.40.0145<0.0010.162
GBV_B39313.739062.838938.7<0.001<0.0010.201
ReoviridaeBTV_A34803.534895.134801.30.03<0.0010.042
BTV_B48849.948893.48837.1<0.001<0.0010.043
BTV_C28350.928386.528350.8>0.05<0.0010.061
BTV_D24969.124947.324894.0<0.001<0.0010.191
BTV_F20622.720708.520610.2<0.001<0.0010.067
BTV_G63349.963485.063345.90.00426<0.0010.040
BTV_H20596.720685.520586.1<0.001<0.0010.118
BTV_I17592.717622.517588.80.01<0.0010.095
BRVA_C41206.741187.441137.1<0.001<0.0010.128
HRVA_A17030.517043.217035.5>0.050.0030.036
HRVA_B8275.18280.38281.7>0.05>0.050.087
HRVA_C12815.112842.612807.60.003<0.0010.132
HRVA_D28036.88041.08043.7>0.05>0.050.057
HRVA_E7045.97056.17053.3>0.050.020.102
HRVA_F7046.07056.77053.4>0.050.020.0710
HRVA_G18424.218434.018425.1>0.05<0.0010.123
HRVA_H20431.420413.8720420.5.60.002>0.050.163
PRVA_A28540.728441.928398.7<0.001<0.0010.204
PRVA_B14757.714775.514732.6<0.001<0.0010.351
HRVC_A6713.26718.26712.30.0450.0070.124
PTOV202011.3202106.5201878.5<0.001<0.0010.039
FJV_B9274.19250.09250.9<0.001>0.050.194
EndornaviridaeEDV1771992.81772689.11771950.6<0.001<0.0010.121
BPAV70386.570540.270390.7>0.050.000.047
TotiviridaeTTV617302.6617462.6617172.9<0.001<0.0010.052
GDV80435.880396.580387.7<0.0010.0020.109
HypoviridaeHPV66859.866899.866857.80.03<0.0010.057

We anticipated that NREV12 might fit many of these dsRNA datasets better than NREV6 simply because, during their infection cycles, the coding +strand of dsRNA viruses (the one from which protein translation occurs) tends to exist for longer periods within an infected cell than the non-coding –strand. Specifically, there are two main steps during double-stranded RNA virus replication (Wickner, 1993). Firstly, synthesis of the viral +strands from a dsRNA template occurs in the cytoplasm within viral particles. These +strands exist within the cell for prolonged periods in the absence of complementary –strands and are used as templates for translation of viral proteins. In the second step, the +strands remaining after translation act as templates for –strand synthesis, resulting in the formation of new dsRNA molecules. The +strands of dsRNA viruses are therefore likely more impacted by mutational processes, which in turn could explain the pervasive strand-specific substitution biases seen in this group of viruses.

For the ssRNA and ssDNA viruses where one genome strand exists during the virus life cycle for far longer periods of time than the other such that complementary substitutions would not be expected to occur at similar rates, we anticipated that NREV12 should provide a better fit than both NREV6 and GTR. Indeed, for ssRNA viruses, NREV12 was a better fit than NREV6 and GTR for 40/47 of the ssRNA datasets and 24/33 of the ssDNA virus datasets (Figure 1). Of the nine ssDNA virus datasets where NREV12 was not the best fitting model, GTR fit 7/9 better and NREV6 fit 2/9 better (Table 3). Of the seven ssRNA datasets where NREV12 was not the best fitting model, GTR fit 6/7 better and NREV6 fit 1/7 better.

Table 3
Small sample corrected Akaike information criterion (AIC-c) scores and likelihood ratio test (LRT) results for single-stranded DNA datasets.

The lowest AIC-c scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12p-Value GTR vs NREV-12p-Value NREV-6 vs NREV-12DNR
NanoviridaeBBTV M15044.315207.914984.4<0.001<0.0010.662
BBTV N10605.610686.210595.2<0.001<0.0010.533
BBTV R18484.51854418480.8>0.05<0.0010.609
BBTV S12718.912757.212707.3<0.001<0.0010.728
CCDV38622.738632.038630.5>0.050.030.050
MDV36232.83606336064<0.001>0.050.142
PYDV56138.456076.656056.4<0.001<0.0010.187
FBNS100153.6100135.6100120.5<0.001<0.0010.098
GeminiviridaeBegomo 528192.128311.928192.5>0.05<0.0010.1995
Begomo 616743.016722.616724.1<0.001>0.050.214
Begomo 98517.68540.88515.60.03<0.0010.312
Dicot 144730.744594.344583.3<0.001<0.0010.200
Dicot 239909.939919.839917.9>0.05<0.0010.100
MSV252645.3254347.5254347.5<0.001<0.0010.144
PanSV94601.294600.394593.7<0.001<0.0010.182
WDV35301.735313.235253.8<0.001<0.0010.1033
CircoviridaeBFDV17256.717262.717246.7<0.001<0.0010.224
DG_CV12754.812779.512758.3>0.05<0.0010.116
PiCV19180.519192.519191.0>0.050.040.117
CCCC84435.784377.484315.3<0.001<0.0010.132
BTC262910.4262060.1261985.4<0.001<0.0010.178
POCV224940.924953.824915.8<0.001<0.0010.162
CCV90307.990301.590285.9<0.001<0.0010.114
AnelloviridaeTTV_1825811826800825292<0.001<0.0010.513
TTSV332287.9332397.4332258.2<0.001<0.0011.560
ParvoviridaeMVM26756.326743.926686.9<0.001<0.0010.148
HPV67051.267080.167001.8<0.001<0.0010.235
CPV857318569585689.3<0.0010.0070.062
PPV163006.8163090.7162995.9<0.001<0.0010.143
CAV_P37073.337115.537065.7<0.001<0.0010.162
MicroviridaeBMV31175.331164.831147.3<0.001<0.0010.188
PleolipoviridaeAPV85700.285617.485402.8<0.001<0.0010.204
BPV204797.5204802.3204796.70.040.0070.064

Based on the LRTs, strong overall support for NREV12 was found with this model providing a significantly better fit (p<0.05) than NREV6 for 45/47 of the ssRNA virus datasets (Table 4) and 31/33 of the ssDNA virus datasets (Table 3). Similarly, based on LRTs, NREV12 provided a significantly better fit than GTR for 27/33 of the ssDNA virus datasets (Table 3) and 40/47 of the ssRNA virus datasets (Table 4).

Table 4
Small sample corrected Akaike information criterion (AIC-c) scores and likelihood ratio test (LRT) results for single-stranded RNA datasets.

The lowest AIC-c scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12p-Value GTR vs NREV-12p-Value NREV-6 vs NREV-12DNR
AstroviridaeHAV94580.794926.394548.1<0.001<0.0010.096
BAV188307.1188572.9188144.9<0.001<0.0010.108
MMV281072.2281094.5281076.9>0.05<0.0010.072
PAV150626.88150827.6150609.5<0.001<0.0010.069
CKV90902.391233.190873.0<0.001<0.0010.083
GA64998.565223.964975.9<0.001<0.0010.110
CAV_A85558.885617.485547.3<0.01<0.0010.076
BromoviridaeCMV RNA134197.534198.834147.7<0.001<0.0010.124
CMV RNA231398.231455.931388.7<0.001<0.0010.091
CMV RNA324337.224360.324343.9>0.05<0.0010.073
AMS24337.224360.324343.9>0.05<0.0010.073
PSV6770767786.567691<0.001<0.0010.048
CaliciviridaeLAV73042.873102.472984.6<0.001<0.0010.120
NoV207667.2207777.5207660<0.001<0.0010.047
VSV235936.4236051.4235913.3<0.001<0.0010.046
ClosteroviridaeCTV30062.229980.429960.1<0.001<0.0010.272
FlaviviridaeDGV_T169771.970030.569776.2>0.05<0.0010.063
JEV146920.8148101.5146885.5<0.001<0.0010.091
HepeviridaeHPVE1200439.5200863.8200179.8<0.001<0.0010.073
HPVE2155709.1155983.8155518.6<0.001<0.0010.088
PicornaviridaeENV_A552287.9553535.5551794.1<0.001<0.0010.061
HRV_A102218.7102267.0101550.7<0.001<0.0010.285
AIV101073.1101136.7101052.2<0.001<0.0010.093
AHP139635.7140119.6139506.9<0.001<0.0010.170
ECV82078.982181.082065.8<0.001<0.0010.066
CDV130551.3130896.7130478.3<0.001<0.0010.086
TCV53027.353029530230.01510.04220.033
FMDV455180.6455582.6454806.1<0.001<0.0010.117
FusariviridaeFRV52413.152470.652418.4>0.05<0.0010.076
RetroviridaeHIV1_setA344014.4344295.1343669.7<0.001<0.0010.237
HIV1_M80764.180829.580668.1<0.001<0.0010.442
HIV1_setC180575.0180702.3180494.4<0.001<0.0010.107
HIV1_setD298489.9298695.3298260.6<0.001<0.0010.133
HIV1_setE289111.3289292.1288941.9<0.001<0.0010.112
HIV1_setF214375.9214692.2214289.4<0.001<0.0010.148
EIV126149126365.4125300<0.001<0.0010.192
BIV24505.224506.924513.2>0.05>0.050.15
FIV164542.1164487.9164260.4<0.001<0.0010.114
CAV351329.9351871.5350721.9<0.001<0.0010.174
SIV110731.2110816110663.3<0.001<0.0010.144
FiloviridaeEbola_253147.353143.053149.9>0.05>0.500.264
Orthomyxo-viridaeFlu A 282872.883010.282849.7<0.001<0.0010.27
Flu B 150090.450144.150060.9<0.001<0.0010.311
CoronaviridaeSARS-COV1214715.3214968.5214644.39<0.001<0.0010.198
SARS-COV215715.4.215715.615696.7<0.001<0.0011.536
SARB573966.3573815.1572517.0<0.001<0.0010.301
MERS-COV516683.2516983.4516608.9<0.001<0.0010.169

We found that the degree of non-reversibility (DNR) estimates alone did not cleanly differentiate between datasets for which NREV12 was or was not best supported (Tables 1–4). For the 107 nucleotide sequence datasets with a model preference of NREV12, 10 had estimated DNRs that were greater than 0.5, 13 had DNRs between 0.25 and 0.5, and 84 had DNRs between 0.0225 and 0.25. For the 10 nucleotide sequence datasets with a model preference of NREV6, one had an estimated DNR greater than 0.5, four had estimated DNRs between 0.25 and 0.5, and five had estimated DNRs between 0.064 and 0.25 (Figure 1). For the 24 nucleotide sequence datasets with a model preference of GTR, none had estimated DNRs greater than 0.5, one had an estimated DNR between 0.25 and 0.5, and the remainder had estimated DNRs between 0.0335 and 0.25.

The dsDNA virus dataset with the highest DNR was Human mastadenovirus D (DNR = 0.646), the dsRNA virus dataset with the highest estimated DNR was Porcine_rotavirus_B (0.351), the ssRNA virus dataset with the highest DNR was SARS-CoV-2 (DNR = 1.536), and the ssDNA virus dataset with the highest DNR was Torque teno sus virus (DNR = 1.56).

Therefore, while NREV12 appears to be generally more appropriate than either NREV6 or GTR for describing mutational processes in ssRNA, ssDNA, dsDNA, and dsRNA viruses, this might only be particularly relevant from a practical perspective when datasets of these viruses yield DNR estimates that are greater than 0.25. For such datasets, NREV12 (and possibly NREV 6 in some instances) might be especially useful for both determining the direction of evolution across phylogenetic trees (i.e. it could potentially be used to root these trees) and for quantifying genomic strand-specific nucleotide substitution biases (Harkins et al., 2009).

Assessing the impacts of model misspecification on phylogenetic tree inference

To determine whether it is worthwhile to use NREV12 rather than GTR for phylogenetic inference when NREV12 is the best fitting nucleotide substitution model, we used simulated datasets to compare the accuracy of phylogenetic trees inferred using these models.

We found that, regardless of dataset diversity and the nucleotide substitution model used, phylogenetic inference tended to become less accurate (i.e. weighted Robinson-Foulds [wRF] scores increased) as DNR increased (Figure 2). This tendency was, however, more pronounced when using a (mis-specified) GTR model than when using a (correctly specified) NREV12 model with, for any given dataset having DNR>0, the use of NREV12 tending to yield more accurate phylogenetic trees than when GTR was used. There were, however, only statistically significant improvements (p<0.05, paired t-test) in the accuracy of phylogenetic trees inferred using NREV12 relative to those inferred using GTR in lower diversity datasets (i.e. those with average pairwise nucleotide sequence identities (APIs) of 85%, 90%, and 95%) and then only for DNR>8. It is noteworthy that the highest estimated DNR in any of the empirical datasets that we analysed was 1.56 more than fourfold lower than the point where statistically significant differences in phylogenetic inference accuracies became apparent in the simulated datasets.

Materials and methods

Virus sequence datasets and phylogenetic trees

Request a detailed protocol

We obtained viral nucleotide sequences from the National Centre for Biotechnology Information Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) and the Los Alamos National Laboratory HIV sequence database (https://www.hiv.lanl.gov/content/index). These included gene and whole-genome sequences for viruses with ssRNA, ssDNA, dsRNA, and dsDNA genomes (datasets are summarised in Supplementary file 1). An outgroup sequence from a closely related virus species was added to each dataset to help root phylogenetic trees. The sequences in each of the datasets were aligned using MUSCLE (Edgar, 2004) implemented in Aliview (Larsson, 2014)⁠, and maximum likelihood phylogenetic trees were constructed from each alignment using RAxML v8.2 (Stamatakis, 2016)⁠.

Model testing

Request a detailed protocol

We evaluated the fit of NREV12, NREV6, and GTR to the 141 individual sequence datasets using a previously published model test (Harkins et al., 2009) implemented as a HyPhy package module (Pond and Muse, 2005). This script (https://github.com/veg/hyphy-analyses/tree/master/NucleotideNonREV; Delport and Kosakovsky Pond, 2023) is also available as a module in the Datamonkey web server (Weaver et al., 2018), which takes as input a rooted maximum likelihood phylogenetic tree (minus the rooting sequence) and its corresponding nucleotide sequence alignment. The three models described above: GTR, NREV6, and NREV12 are then fitted to the data using maximum likelihood (ML). The equilibrium frequencies (EF) of the GTR model match those empirically observed in the alignment, while EFs for NREV6 and NREV12 are inferred by the model, satisfying the condition πQ=0. An additional model NREV12 + F is estimated, where the distribution of nucleotides at the root of the tree is estimated by maximum likelihood, instead of being set to the empirical frequencies. Nested models were compared by the LRT with significance evaluated using the χd2 distribution with d=difference in degrees of freedom. For all models, we also computed the small AIC-c score.

Quantification of non-reversibility

Request a detailed protocol

We further defined the DNR as the absolute difference between the relative rate differences of two nucleotide pairs; i.e., for two nucleotides, x and y, there exists a relative rate of x to y substitutions that we will refer to as m, and a relative rate of y to x substitutions that we will refer to as n. Under the NREV12 model, the DNR between x and y is defined simply as the absolute difference between m and n: (|m-n|) and will hereby be referred to as ijDNR, where i and j are two nucleotides. We use DNR as a mathematical representation of the degree to which the rates of all pairs of reverse substitutions differ from one another. For each of the 140 individual viral alignments, we calculated the average DNR of the six ijDNR estimates inferred using the NREV12 model.

Simulations for testing the impact of non-reversible evolution on phylogenetic inference

Request a detailed protocol

We tested the accuracy of phylogenetic tree inference under reversible and non-reversible models using simulated datasets with varying APIs evolved under the NREV12 model with different DNRs. The goal of these tests was not to exhaustively evaluate model misspecification issues during phylogenetic tree inference, but rather to check, in instances where viral taxa are known to be evolving in a detectably non-reversible manner (i.e. where NREV12 or NREV6 fits the data better than GTR), whether not accounting for this might decrease the accuracy of phylogenetic inference. Using IQ-TREE, a phylogenetic inference program that has the option to apply an NREV12-like model (referred to in IQ-TREE as the UNREST model), a phylogenetic tree was inferred from an alignment of real sequences (Avian Leukosis virus) (Figure 3) with an API of ~90%. The branch lengths on this tree were then scaled to create four other phylogenetic trees representing sequences with approximately 95%, 85%, 80%, and 75% API. These five trees are hereafter referred to as ‘true’ trees, and each individual tree was used as the starting point of a different set of simulations.

Phylogenetic tree inferred from an alignment of real sequences (Avian Leukosis virus) that was used to simulate datasets with degrees of non-reversibility (DNRs) varying from 0 to 20.

The alignment of Avian Leukosis virus had an average sequence identity (API) of ~90%, and the branches of this tree were scaled to produce four other trees reflecting branch tip sequences with approximate pairwise identities of ~75%, ~80%, ~85%, and~95%.

Phylogenetic trees were inferred from these 5500 simulated datasets and compared to the phylogenetic trees used to simulate the datasets (i.e. the true trees) using wRF distances to assess the impact of varying DNR on the accuracy of phylogenetic inference. We further tested whether the accuracy of phylogenetic inference could be improved for sequences that had evolved under DNR >0 by using NREV12 instead of GTR. Specifically, for every simulated dataset, a phylogenetic tree was inferred using GTR, and another using NREV12 and the wRF distances of each of these trees to the true tree was determined. For each of the analysed DNRs, a paired t-test was then used to compare the wRF scores of trees inferred using GTR and NREV12. We were particularly interested in determining whether trees inferred using a mis-specified model (i.e. GTR in this case) would be less accurate than trees inferred with a correctly specified model (i.e. NREV12).

To test whether failure to account for non-reversibility might decrease the accuracy of phylogenetic inference, we simulated the evolution of 5500 nucleotide sequence alignments evolved non-reversibly under varying DNR along the five true phylogenetic trees: 100 datasets per true tree per simulated DNR. Specifically, simulations were done using HyPhy (Pond and Muse, 2005), with relative rates ranging from a completely reversible matrix (Equation 4)

(4) Q={qij}=(0.16610.140.1660.1311.10110.1310.1880.141.1010.188)

representing DNR = 0 – through matrices with DNR = 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 (Table 5). These baselines-simulated substitution rates are reflective of those seen in empirical viral nucleotide sequence datasets.

Table 5
Relative rate change for C to A, G to A, A to T, G to C, T to G, and C to T mutations under the 11 degrees of non-reversibility alongside the maintained rates for A to C, A to G, T to A, C to G, G to T, and T to C.
Degree of non-reversibility (DNR)Relative rates of different nucleotide substitution types (from-to)
C-AA-CG-AA-GA-TT-AG-CC-GT-GG-TC-TT-C
00.1660.166110.140.140.1310.1310.1180.1181.1011.101
22.1660.166312.140.142.1310.1312.1180.1183.1011.101
44.1660.166514.140.144.1310.1314.1180.1185.1011.101
66.1660.166716.140.146.1310.1316.1180.1187.1011.101
88.1660.166918.140.148.1310.1318.1180.1189.1011.101
1010.1660.16611110.140.1410.1310.13110.1180.11811.1011.101
1212.1660.16613112.140.1412.1310.13112.1180.11813.1011.101
1414.1660.16615114.140.1414.1310.13114.1180.11815.1011.101
1616.1660.16617116.140.1416.1310.13116.1180.11817.1011.101
1818.1660.16619118.140.1418.1310.13118.1180.11819.1011.101
2020.1660.16621120.140.1420.1310.13120.1180.11821.1011.101

At each DNR, the relative rates used conformed to standard measures of non-reversibility under the Kolmogorov conditions according to which non-reversibly evolving sequence datasets should yield three irreversibility indices (IRI1, IRI2, and IRI3) that are all non-zero (Squartini and Arndt, 2008). It should be noted that all simulations under NREV12 were performed under the stationarity criterion: πeQt=π (where Q is the rate matrix, π is the nucleotide frequency distribution, and t≥0).

Quantifying the accuracy of phylogenetic inferences

Request a detailed protocol

We used the wRF (implemented in the R phangorn package; Schliep, 2011⁠) to quantify differences between the true trees used to simulate datasets and the trees inferred from these datasets using the GTR or NREV12 models. wRF considers differences between both the topology and branch lengths of actual and inferred trees (Kuhner and Yamato, 2015; Robinson and Foulds, 1981).

Conclusion

Request a detailed protocol

The non-reversible nucleotide substitution model, NREV12, provides a substantially better fit to most virus nucleotide sequence datasets than does the widely used reversible substitution model, GTR. NREV12 also provides a better fit to most virus nucleotide sequence datasets than does NREV6; a non-reversible model that would be expected to best describe the evolution of double-stranded genome sequences that display no strand-specific nucleotide substitution biases. This suggests that, contrary to our expectations, substantial strand-specific nucleotide substitution biases (i.e. estimated DNRs>0.25) are common during viral evolution irrespective of genome type. Such biases should be expected for any viruses where one genome strand either is in existence for substantially longer periods of time than the other, or is more exposed to mutagenic processes than the other during transmission, replication, or gene expression.

We had anticipated that, given evidence of sequences evolving both non-reversibly and with strand-specific substitution biases, inferring trees using a model such as NREV12 that appropriately accounts for this might: (1) minimise the impact of increasing DNR on the accuracy of phylogenetic inference (i.e. wRF scores presented in blue in Figure 2 might have been expected to not increase with increasing DNR) and (2) yield significantly more accurate phylogenetic inferences than when using GTR for all datasets where NREV12 was the most appropriate model and DNRs were greater than zero. However, increasing DNR clearly decreased the accuracy of phylogenetic inference even when using NREV12, and, for datasets where DNRs were greater than zero, using GTR did not consistently yield significantly less accurate phylogenetic inferences than those attained using NREV12. From a practical perspective, choosing a non-reversible nucleotide substitution model to construct phylogenetic trees from virus genome sequences that display strand-specific nucleotide substitution biases is not guaranteed to yield more accurate phylogenetic trees. Nevertheless, in instances where strand-specific substitution biases are higher than ~0.5 (such as are found in our SARS-CoV-2, Torque teno sus virus, and Banana bunchy top virus datasets), it may be prudent to select a model such as NREV12 (such as is implemented in programs like IQ-TREE) over GTR as the better of two suboptimal choices.

The lack of available data regarding the proportions of viral life cycles during which genomes exist in single- and double-stranded states makes it difficult to rationally predict the situations where the use of models such as GTR, NREV6, and NREV12 might be most justified: particularly in light of the poor overall performance of NREV6 and GTR relative to NREV12 with respect to describing mutational processes in viral genome sequence datasets. We therefore recommend case-by-case assessments of NREV12 vs NREV6 vs GTR model fit when deciding whether it is appropriate to consider the application of non-reversible models for phylogenetic inference and/or phylogenetic model-based analyses such as those intended to test for evidence of natural selection or the existence of molecular clocks.

Declarations

Ethics

The University of Cape Town ethics committee declared that this research did not need ethics approval due to the use of freely accessible nucleotide sequences obtained from the National Centre for Biotechnology Information Taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy) and the Los Alamos National Laboratory HIV sequence database (https://www.hiv.lanl.gov/content/index).

Data availability

We obtained viral nucleotide sequences from the National Centre for Biotechnology Information Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) and the Los Alamos National Laboratory HIV sequence database (https://www.hiv.lanl.gov/content/index). These included gene and whole-genome sequences for viruses with ssRNA, ssDNA, dsRNA, and dsDNA genomes (datasets are summarized in Supplementary file 1).

References

  1. Book
    1. Bruslind L
    (2020)
    Retrieved from General Microbiology
    Oregon State University.
    1. Fernandes JV
    2. Medeiros Fernandes TA
    (2012)
    Human Papillomavirus and Related Diseases-From Bench to Bedside-A Clinical Perspective
    Human papillomavirus: biology and pathogenesis, Human Papillomavirus and Related Diseases-From Bench to Bedside-A Clinical Perspective, IntechOpen, 10.5772/27154.
    1. Hanson L
    (2009)
    Handbook of Nucleic Acid Purification
    1–18, Isolation of viral DNA from cultures, Handbook of Nucleic Acid Purification, CRC press, 10.1201/9781420070972.pt1.
  2. Book
    1. Pond SL
    2. Muse SV
    (2005) HyPhy: hypothesis testing using phylogenies
    In: Muse SV, editors. Statistical Methods in Molecular Evolution. Oxford University Press. pp. 125–181.
    https://doi.org/10.1093/bioinformatics/bti079
    1. Posada D
    2. Crandall KA
    (2001b)
    Selecting the best-fit model of nucleotide substitution
    Systematic Biology 50:580–601.
    1. Tavaré S
    (1986)
    Some probabilistic and statistical problems in the analysis of DNA sequences
    Lectures on Mathematics in the Life Sciences 17:57–86.
    1. Wickner RB
    (1993)
    Double-stranded RNA virus replication and packaging
    The Journal of Biological Chemistry 268:3797–3800.

Article and author information

Author details

  1. Rita Sianga-Mete

    Division of Computational Biology, Institute of Infectious Diseases and Molecular Medicine, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Rondebosch, South Africa
    Contribution
    Conceptualization, Formal analysis, Visualization, Methodology, Writing – original draft, Project administration
    For correspondence
    rita@aims.ac.za
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6626-4007
  2. Penelope Hartnady

    Division of Computational Biology, Institute of Infectious Diseases and Molecular Medicine, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Rondebosch, South Africa
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  3. Wimbai Caroline Mandikumba

    Division of Computational Biology, Institute of Infectious Diseases and Molecular Medicine, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Rondebosch, South Africa
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  4. Kayleigh Rutherford

    Division of Computational Biology, Institute of Infectious Diseases and Molecular Medicine, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, Rondebosch, South Africa
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  5. Christopher Brian Currin

    Department of Human Biology, Faculty of Health Sciences, University of Cape Town, Rondebosch, South Africa
    Contribution
    Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4809-5059
  6. Florence Phelanyane

    Centre for Infectious Disease and Epidemiology Research, School of Public Health and Family Medicine, University of Cape Town, Rondebosch, South Africa
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  7. Sabina Stefan

    Centre for Biomedical Engineering, School of Engineering, Brown University, Providence, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  8. Steven Weaver

    Institute for Genomics and Evolutionary Medicine, Department of Biology, Temple University, Philadelphia, United States
    Contribution
    Methodology
    Competing interests
    No competing interests declared
  9. Sergei L Kosakovsky Pond

    Institute for Genomics and Evolutionary Medicine, Department of Biology, Temple University, Philadelphia, United States
    Contribution
    Supervision, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4817-4029
  10. Darren P Martin

    Wellcome Center for Infectious Diseases Research in Africa, Institute of Infectious Disease and Molecular Medicine and Department of Medicine, University of Cape Town, Rondebosch, South Africa
    Contribution
    Supervision, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8785-0870

Funding

South African Centre for Epidemiological Modelling and Analysis (PhD Study Bursary)

  • Rita Sianga-Mete

U.S. National Institutes of Health (R01 AI134384)

  • Sergei L Kosakovsky Pond
  • Steven Weaver

U.S. National Institutes of Health (R01 AI140970)

  • Sergei L Kosakovsky Pond
  • Steven Weaver

U.S. National Institutes of Health (GM144468)

  • Sergei L Kosakovsky Pond
  • Steven Weaver

Wellcome Trust

https://doi.org/10.35802/222574
  • Darren P Martin

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Acknowledgements

The authors wish to thank the South African National Research Foundation (NRF) for funding this research under the South African Centre for Epidemiological Modelling and Analysis (SACEMA) bursary. SLKP and SW were supported in part by the U.S. National Institutes of Health (R01 AI134384 and AI140970, GM144468). DPM is supported by the Wellcome Trust (222574/Z/21/Z).

Version history

  1. Preprint posted:
  2. Sent for peer review:
  3. Reviewed Preprint version 1:
  4. Reviewed Preprint version 2:
  5. Version of Record published:

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.87361. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Sianga-Mete et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,838
    views
  • 10
    downloads
  • 0
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Rita Sianga-Mete
  2. Penelope Hartnady
  3. Wimbai Caroline Mandikumba
  4. Kayleigh Rutherford
  5. Christopher Brian Currin
  6. Florence Phelanyane
  7. Sabina Stefan
  8. Steven Weaver
  9. Sergei L Kosakovsky Pond
  10. Darren P Martin
(2025)
Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models
eLife 12:RP87361.
https://doi.org/10.7554/eLife.87361.3

Share this article

https://doi.org/10.7554/eLife.87361