Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models

  1. Rita Sianga-Mete  Is a corresponding author
  2. Penelope Hartnady
  3. Wimbai Caroline Mandikumba
  4. Kayleigh Rutherford
  5. Christopher Brian Currin
  6. Florence Phelanyane
  7. Sabina Stefan
  8. Steven Weaver
  9. Sergei L Kosakovsky Pond
  10. Darren P Martin
  1. Division of Computational Biology, Institute of Infectious Diseases and Molecular Medicine, Department of Integrative Biomedical Sciences, Faculty of Health Sciences, University of Cape Town, South Africa
  2. Department of Human Biology, Faculty of Health Sciences, University of Cape Town, South Africa
  3. Centre for Infectious Disease and Epidemiology Research, School of Public Health and Family Medicine, University of Cape Town, South Africa
  4. Centre for Biomedical Engineering, School of Engineering, Brown University, United States
  5. Institute for Genomics and Evolutionary Medicine, Department of Biology, Temple University, United States
  6. Wellcome Center for Infectious Diseases Research in Africa, Institute of Infectious Disease and Molecular Medicine and Department of Medicine, University of Cape Town, South Africa
3 figures, 5 tables and 2 additional files

Figures

Ternary plots illustrating the relative fit of the NREV12, NREV6, and GTR nucleotide substitution models based on weighted small sample corrected Akaike information criterion (AIC-c) scores for 30 dsDNA, 31 dsRNA, 33 ssDNA, and 47 ssRNA virus nucleotide sequence datasets.

These plots were produced using the Akaike weights function with an overlaid density function (implemented in the qpcR package of RStudio; Ritz and Spiess, 2008) to indicate point densities. Each model is represented by a corner of the triangles, and each circle represents the relative fit of each of the three models to a single nucleotide sequence dataset. The sides of the triangle represent model support axes ranging from 0% to 100%, with the position of a circle in relation to each of the sides of the triangle indicating the probability of models best describing the nucleotide sequence dataset that is represented by that point. Red colours represent a very high density of nucleotide sequence datasets that favour a particular model, blue colours indicate a lower, but still substantial, density of datasets that favour a particular model.

Weighted Robinson-Foulds distances between inferred and true phylogenetic trees for datasets simulated with different degrees of nucleotide substitution non-reversibility and different average pairwise sequence identities (APIs) (~75%, ~80%, ~85%, ~90%, and ~95%).

‘ns’ above a pair of box and whisker plots indicates a paired t-test adjusted p-value of ≥0.05 and ‘*’ indicates a paired t-test adjusted p-value of <0.05.

Phylogenetic tree inferred from an alignment of real sequences (Avian Leukosis virus) that was used to simulate datasets with degrees of non-reversibility (DNRs) varying from 0 to 20.

The alignment of Avian Leukosis virus had an average sequence identity (API) of ~90%, and the branches of this tree were scaled to produce four other trees reflecting branch tip sequences with approximate pairwise identities of ~75%, ~80%, ~85%, and~95%.

Tables

Table 1
Akaike information criterion (AIC) scores and likelihood ratio test (LRT) results for double-stranded DNA virus datasets.

The lowest small sample corrected AIC (AIC-c) scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12p-Value GTR vs NREV-12p-Value NREV-6 vs NREV-12DNR
PapillomaviridaeAPPV 635099.535108.035102.2>0.050.0070.089
HPV18_225202.925174.625179.2<0.001>0.050.323
HPV45_223600.623599.023602.9>0.05>0.050.285
HPV16_229734.029664.529665.4<0.001>0.050.371
HPV3124681.424677.324672.80.0020.010.165
HPV6_131199.131150.031141.2<0.001<0.0010.451
LPV67165.767188.167145.5<0.001<0.0010.42
DPV69829.769889.269835.1>0.05<0.0010.056
XPV95455.695617.195452.2<0.001<0.0010.072
BATV134821134511133322<0.001<0.0010.402
PolyomaviridaeJC_251806.751819.651812.0>0.050.0030.089
BK_221472.621472.721471.10.030.030.244
SV4016859.816858.016858.40.037>0.050.567
BPV148614.9148573.8148585.2<0.001>0.050.064
CaulimoviridaeCMV124083.9124221.0123888.6<0.001<0.0010.351
CSSV145327.0146575145202<0.001<0.0010.158
SVBV138575138488.1138464.7<0.001<0.0010.174
DBAV46495.546514.146502.0>0.05<0.0010.0335
RTBV54987.955350.154991>0.05<0.0010.082
BDV376325.2376647.6376029.9<0.001<0.0010.140
SiphoviridaeCLV237362.3237351.8237348.6<0.001<0.010.070
TectiviridaeTTIV913864.9913915.4913773.1<0.001<0.0010.279
AdenoviridaeFAV_C3074086.73074207.53073739.1<0.001<0.0010.169
FAV_E103482.3103222.7102636.7<0.001<0.0010.357
FAV_D2326925.62325719.42324784.5<0.001<0.0010.551
FAV_A705328.5705436.5705197.8<0.001<0.0010.645
HMAV_B103796.7103937.44103753.8<0.001<0.00110.890
HMAV_D1748635.217497691748119.1<0.001<0.0010.646
HMAV_C2851144.52851357.128511330.006<0.0010.0225
HMAV_E1915044.81915065.31914998<0.001<0.0010.049
Table 2
Akaike information criterion (AIC) scores and likelihood ratio test (LRT) results for double-stranded RNA datasets.

The lowest small sample corrected AIC (AIC-c) scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12GTR vs NREV-12NREV-6 vs NREV-12DNR
BirnaviridaeAQBV31754.931853.331721.9<0.001<0.0010.219
GBV_A47176.947347.247154.8<0.001<0.0010.142
IPNV79186.279221.979182.40.0145<0.0010.162
GBV_B39313.739062.838938.7<0.001<0.0010.201
ReoviridaeBTV_A34803.534895.134801.30.03<0.0010.042
BTV_B48849.948893.48837.1<0.001<0.0010.043
BTV_C28350.928386.528350.8>0.05<0.0010.061
BTV_D24969.124947.324894.0<0.001<0.0010.191
BTV_F20622.720708.520610.2<0.001<0.0010.067
BTV_G63349.963485.063345.90.00426<0.0010.040
BTV_H20596.720685.520586.1<0.001<0.0010.118
BTV_I17592.717622.517588.80.01<0.0010.095
BRVA_C41206.741187.441137.1<0.001<0.0010.128
HRVA_A17030.517043.217035.5>0.050.0030.036
HRVA_B8275.18280.38281.7>0.05>0.050.087
HRVA_C12815.112842.612807.60.003<0.0010.132
HRVA_D28036.88041.08043.7>0.05>0.050.057
HRVA_E7045.97056.17053.3>0.050.020.102
HRVA_F7046.07056.77053.4>0.050.020.0710
HRVA_G18424.218434.018425.1>0.05<0.0010.123
HRVA_H20431.420413.8720420.5.60.002>0.050.163
PRVA_A28540.728441.928398.7<0.001<0.0010.204
PRVA_B14757.714775.514732.6<0.001<0.0010.351
HRVC_A6713.26718.26712.30.0450.0070.124
PTOV202011.3202106.5201878.5<0.001<0.0010.039
FJV_B9274.19250.09250.9<0.001>0.050.194
EndornaviridaeEDV1771992.81772689.11771950.6<0.001<0.0010.121
BPAV70386.570540.270390.7>0.050.000.047
TotiviridaeTTV617302.6617462.6617172.9<0.001<0.0010.052
GDV80435.880396.580387.7<0.0010.0020.109
HypoviridaeHPV66859.866899.866857.80.03<0.0010.057
Table 3
Small sample corrected Akaike information criterion (AIC-c) scores and likelihood ratio test (LRT) results for single-stranded DNA datasets.

The lowest AIC-c scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12p-Value GTR vs NREV-12p-Value NREV-6 vs NREV-12DNR
NanoviridaeBBTV M15044.315207.914984.4<0.001<0.0010.662
BBTV N10605.610686.210595.2<0.001<0.0010.533
BBTV R18484.51854418480.8>0.05<0.0010.609
BBTV S12718.912757.212707.3<0.001<0.0010.728
CCDV38622.738632.038630.5>0.050.030.050
MDV36232.83606336064<0.001>0.050.142
PYDV56138.456076.656056.4<0.001<0.0010.187
FBNS100153.6100135.6100120.5<0.001<0.0010.098
GeminiviridaeBegomo 528192.128311.928192.5>0.05<0.0010.1995
Begomo 616743.016722.616724.1<0.001>0.050.214
Begomo 98517.68540.88515.60.03<0.0010.312
Dicot 144730.744594.344583.3<0.001<0.0010.200
Dicot 239909.939919.839917.9>0.05<0.0010.100
MSV252645.3254347.5254347.5<0.001<0.0010.144
PanSV94601.294600.394593.7<0.001<0.0010.182
WDV35301.735313.235253.8<0.001<0.0010.1033
CircoviridaeBFDV17256.717262.717246.7<0.001<0.0010.224
DG_CV12754.812779.512758.3>0.05<0.0010.116
PiCV19180.519192.519191.0>0.050.040.117
CCCC84435.784377.484315.3<0.001<0.0010.132
BTC262910.4262060.1261985.4<0.001<0.0010.178
POCV224940.924953.824915.8<0.001<0.0010.162
CCV90307.990301.590285.9<0.001<0.0010.114
AnelloviridaeTTV_1825811826800825292<0.001<0.0010.513
TTSV332287.9332397.4332258.2<0.001<0.0011.560
ParvoviridaeMVM26756.326743.926686.9<0.001<0.0010.148
HPV67051.267080.167001.8<0.001<0.0010.235
CPV857318569585689.3<0.0010.0070.062
PPV163006.8163090.7162995.9<0.001<0.0010.143
CAV_P37073.337115.537065.7<0.001<0.0010.162
MicroviridaeBMV31175.331164.831147.3<0.001<0.0010.188
PleolipoviridaeAPV85700.285617.485402.8<0.001<0.0010.204
BPV204797.5204802.3204796.70.040.0070.064
Table 4
Small sample corrected Akaike information criterion (AIC-c) scores and likelihood ratio test (LRT) results for single-stranded RNA datasets.

The lowest AIC-c scores indicating the best fitting models are in bold.

Virus familyDatasetAIC score GTRAIC score NREV-6AIC score NREV-12p-Value GTR vs NREV-12p-Value NREV-6 vs NREV-12DNR
AstroviridaeHAV94580.794926.394548.1<0.001<0.0010.096
BAV188307.1188572.9188144.9<0.001<0.0010.108
MMV281072.2281094.5281076.9>0.05<0.0010.072
PAV150626.88150827.6150609.5<0.001<0.0010.069
CKV90902.391233.190873.0<0.001<0.0010.083
GA64998.565223.964975.9<0.001<0.0010.110
CAV_A85558.885617.485547.3<0.01<0.0010.076
BromoviridaeCMV RNA134197.534198.834147.7<0.001<0.0010.124
CMV RNA231398.231455.931388.7<0.001<0.0010.091
CMV RNA324337.224360.324343.9>0.05<0.0010.073
AMS24337.224360.324343.9>0.05<0.0010.073
PSV6770767786.567691<0.001<0.0010.048
CaliciviridaeLAV73042.873102.472984.6<0.001<0.0010.120
NoV207667.2207777.5207660<0.001<0.0010.047
VSV235936.4236051.4235913.3<0.001<0.0010.046
ClosteroviridaeCTV30062.229980.429960.1<0.001<0.0010.272
FlaviviridaeDGV_T169771.970030.569776.2>0.05<0.0010.063
JEV146920.8148101.5146885.5<0.001<0.0010.091
HepeviridaeHPVE1200439.5200863.8200179.8<0.001<0.0010.073
HPVE2155709.1155983.8155518.6<0.001<0.0010.088
PicornaviridaeENV_A552287.9553535.5551794.1<0.001<0.0010.061
HRV_A102218.7102267.0101550.7<0.001<0.0010.285
AIV101073.1101136.7101052.2<0.001<0.0010.093
AHP139635.7140119.6139506.9<0.001<0.0010.170
ECV82078.982181.082065.8<0.001<0.0010.066
CDV130551.3130896.7130478.3<0.001<0.0010.086
TCV53027.353029530230.01510.04220.033
FMDV455180.6455582.6454806.1<0.001<0.0010.117
FusariviridaeFRV52413.152470.652418.4>0.05<0.0010.076
RetroviridaeHIV1_setA344014.4344295.1343669.7<0.001<0.0010.237
HIV1_M80764.180829.580668.1<0.001<0.0010.442
HIV1_setC180575.0180702.3180494.4<0.001<0.0010.107
HIV1_setD298489.9298695.3298260.6<0.001<0.0010.133
HIV1_setE289111.3289292.1288941.9<0.001<0.0010.112
HIV1_setF214375.9214692.2214289.4<0.001<0.0010.148
EIV126149126365.4125300<0.001<0.0010.192
BIV24505.224506.924513.2>0.05>0.050.15
FIV164542.1164487.9164260.4<0.001<0.0010.114
CAV351329.9351871.5350721.9<0.001<0.0010.174
SIV110731.2110816110663.3<0.001<0.0010.144
FiloviridaeEbola_253147.353143.053149.9>0.05>0.500.264
Orthomyxo-viridaeFlu A 282872.883010.282849.7<0.001<0.0010.27
Flu B 150090.450144.150060.9<0.001<0.0010.311
CoronaviridaeSARS-COV1214715.3214968.5214644.39<0.001<0.0010.198
SARS-COV215715.4.215715.615696.7<0.001<0.0011.536
SARB573966.3573815.1572517.0<0.001<0.0010.301
MERS-COV516683.2516983.4516608.9<0.001<0.0010.169
Table 5
Relative rate change for C to A, G to A, A to T, G to C, T to G, and C to T mutations under the 11 degrees of non-reversibility alongside the maintained rates for A to C, A to G, T to A, C to G, G to T, and T to C.
Degree of non-reversibility (DNR)Relative rates of different nucleotide substitution types (from-to)
C-AA-CG-AA-GA-TT-AG-CC-GT-GG-TC-TT-C
00.1660.166110.140.140.1310.1310.1180.1181.1011.101
22.1660.166312.140.142.1310.1312.1180.1183.1011.101
44.1660.166514.140.144.1310.1314.1180.1185.1011.101
66.1660.166716.140.146.1310.1316.1180.1187.1011.101
88.1660.166918.140.148.1310.1318.1180.1189.1011.101
1010.1660.16611110.140.1410.1310.13110.1180.11811.1011.101
1212.1660.16613112.140.1412.1310.13112.1180.11813.1011.101
1414.1660.16615114.140.1414.1310.13114.1180.11815.1011.101
1616.1660.16617116.140.1416.1310.13116.1180.11817.1011.101
1818.1660.16619118.140.1418.1310.13118.1180.11819.1011.101
2020.1660.16621120.140.1420.1310.13120.1180.11821.1011.101

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Rita Sianga-Mete
  2. Penelope Hartnady
  3. Wimbai Caroline Mandikumba
  4. Kayleigh Rutherford
  5. Christopher Brian Currin
  6. Florence Phelanyane
  7. Sabina Stefan
  8. Steven Weaver
  9. Sergei L Kosakovsky Pond
  10. Darren P Martin
(2025)
Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models
eLife 12:RP87361.
https://doi.org/10.7554/eLife.87361.3