Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models
Figures

Ternary plots illustrating the relative fit of the NREV12, NREV6, and GTR nucleotide substitution models based on weighted small sample corrected Akaike information criterion (AIC-c) scores for 30 dsDNA, 31 dsRNA, 33 ssDNA, and 47 ssRNA virus nucleotide sequence datasets.
These plots were produced using the Akaike weights function with an overlaid density function (implemented in the qpcR package of RStudio; Ritz and Spiess, 2008) to indicate point densities. Each model is represented by a corner of the triangles, and each circle represents the relative fit of each of the three models to a single nucleotide sequence dataset. The sides of the triangle represent model support axes ranging from 0% to 100%, with the position of a circle in relation to each of the sides of the triangle indicating the probability of models best describing the nucleotide sequence dataset that is represented by that point. Red colours represent a very high density of nucleotide sequence datasets that favour a particular model, blue colours indicate a lower, but still substantial, density of datasets that favour a particular model.

Weighted Robinson-Foulds distances between inferred and true phylogenetic trees for datasets simulated with different degrees of nucleotide substitution non-reversibility and different average pairwise sequence identities (APIs) (~75%, ~80%, ~85%, ~90%, and ~95%).
‘ns’ above a pair of box and whisker plots indicates a paired t-test adjusted p-value of ≥0.05 and ‘*’ indicates a paired t-test adjusted p-value of <0.05.

Phylogenetic tree inferred from an alignment of real sequences (Avian Leukosis virus) that was used to simulate datasets with degrees of non-reversibility (DNRs) varying from 0 to 20.
The alignment of Avian Leukosis virus had an average sequence identity (API) of ~90%, and the branches of this tree were scaled to produce four other trees reflecting branch tip sequences with approximate pairwise identities of ~75%, ~80%, ~85%, and~95%.
Tables
Akaike information criterion (AIC) scores and likelihood ratio test (LRT) results for double-stranded DNA virus datasets.
The lowest small sample corrected AIC (AIC-c) scores indicating the best fitting models are in bold.
Virus family | Dataset | AIC score GTR | AIC score NREV-6 | AIC score NREV-12 | p-Value GTR vs NREV-12 | p-Value NREV-6 vs NREV-12 | DNR |
---|---|---|---|---|---|---|---|
Papillomaviridae | APPV 6 | 35099.5 | 35108.0 | 35102.2 | >0.05 | 0.007 | 0.089 |
HPV18_2 | 25202.9 | 25174.6 | 25179.2 | <0.001 | >0.05 | 0.323 | |
HPV45_2 | 23600.6 | 23599.0 | 23602.9 | >0.05 | >0.05 | 0.285 | |
HPV16_2 | 29734.0 | 29664.5 | 29665.4 | <0.001 | >0.05 | 0.371 | |
HPV31 | 24681.4 | 24677.3 | 24672.8 | 0.002 | 0.01 | 0.165 | |
HPV6_1 | 31199.1 | 31150.0 | 31141.2 | <0.001 | <0.001 | 0.451 | |
LPV | 67165.7 | 67188.1 | 67145.5 | <0.001 | <0.001 | 0.42 | |
DPV | 69829.7 | 69889.2 | 69835.1 | >0.05 | <0.001 | 0.056 | |
XPV | 95455.6 | 95617.1 | 95452.2 | <0.001 | <0.001 | 0.072 | |
BATV | 134821 | 134511 | 133322 | <0.001 | <0.001 | 0.402 | |
Polyomaviridae | JC_2 | 51806.7 | 51819.6 | 51812.0 | >0.05 | 0.003 | 0.089 |
BK_2 | 21472.6 | 21472.7 | 21471.1 | 0.03 | 0.03 | 0.244 | |
SV40 | 16859.8 | 16858.0 | 16858.4 | 0.037 | >0.05 | 0.567 | |
BPV | 148614.9 | 148573.8 | 148585.2 | <0.001 | >0.05 | 0.064 | |
Caulimoviridae | CMV | 124083.9 | 124221.0 | 123888.6 | <0.001 | <0.001 | 0.351 |
CSSV | 145327.0 | 146575 | 145202 | <0.001 | <0.001 | 0.158 | |
SVBV | 138575 | 138488.1 | 138464.7 | <0.001 | <0.001 | 0.174 | |
DBAV | 46495.5 | 46514.1 | 46502.0 | >0.05 | <0.001 | 0.0335 | |
RTBV | 54987.9 | 55350.1 | 54991 | >0.05 | <0.001 | 0.082 | |
BDV | 376325.2 | 376647.6 | 376029.9 | <0.001 | <0.001 | 0.140 | |
Siphoviridae | CLV | 237362.3 | 237351.8 | 237348.6 | <0.001 | <0.01 | 0.070 |
Tectiviridae | TTIV | 913864.9 | 913915.4 | 913773.1 | <0.001 | <0.001 | 0.279 |
Adenoviridae | FAV_C | 3074086.7 | 3074207.5 | 3073739.1 | <0.001 | <0.001 | 0.169 |
FAV_E | 103482.3 | 103222.7 | 102636.7 | <0.001 | <0.001 | 0.357 | |
FAV_D | 2326925.6 | 2325719.4 | 2324784.5 | <0.001 | <0.001 | 0.551 | |
FAV_A | 705328.5 | 705436.5 | 705197.8 | <0.001 | <0.001 | 0.645 | |
HMAV_B | 103796.7 | 103937.44 | 103753.8 | <0.001 | <0.001 | 10.890 | |
HMAV_D | 1748635.2 | 1749769 | 1748119.1 | <0.001 | <0.001 | 0.646 | |
HMAV_C | 2851144.5 | 2851357.1 | 2851133 | 0.006 | <0.001 | 0.0225 | |
HMAV_E | 1915044.8 | 1915065.3 | 1914998 | <0.001 | <0.001 | 0.049 |
Akaike information criterion (AIC) scores and likelihood ratio test (LRT) results for double-stranded RNA datasets.
The lowest small sample corrected AIC (AIC-c) scores indicating the best fitting models are in bold.
Virus family | Dataset | AIC score GTR | AIC score NREV-6 | AIC score NREV-12 | GTR vs NREV-12 | NREV-6 vs NREV-12 | DNR |
---|---|---|---|---|---|---|---|
Birnaviridae | AQBV | 31754.9 | 31853.3 | 31721.9 | <0.001 | <0.001 | 0.219 |
GBV_A | 47176.9 | 47347.2 | 47154.8 | <0.001 | <0.001 | 0.142 | |
IPNV | 79186.2 | 79221.9 | 79182.4 | 0.0145 | <0.001 | 0.162 | |
GBV_B | 39313.7 | 39062.8 | 38938.7 | <0.001 | <0.001 | 0.201 | |
Reoviridae | BTV_A | 34803.5 | 34895.1 | 34801.3 | 0.03 | <0.001 | 0.042 |
BTV_B | 48849.9 | 48893. | 48837.1 | <0.001 | <0.001 | 0.043 | |
BTV_C | 28350.9 | 28386.5 | 28350.8 | >0.05 | <0.001 | 0.061 | |
BTV_D | 24969.1 | 24947.3 | 24894.0 | <0.001 | <0.001 | 0.191 | |
BTV_F | 20622.7 | 20708.5 | 20610.2 | <0.001 | <0.001 | 0.067 | |
BTV_G | 63349.9 | 63485.0 | 63345.9 | 0.00426 | <0.001 | 0.040 | |
BTV_H | 20596.7 | 20685.5 | 20586.1 | <0.001 | <0.001 | 0.118 | |
BTV_I | 17592.7 | 17622.5 | 17588.8 | 0.01 | <0.001 | 0.095 | |
BRVA_C | 41206.7 | 41187.4 | 41137.1 | <0.001 | <0.001 | 0.128 | |
HRVA_A | 17030.5 | 17043.2 | 17035.5 | >0.05 | 0.003 | 0.036 | |
HRVA_B | 8275.1 | 8280.3 | 8281.7 | >0.05 | >0.05 | 0.087 | |
HRVA_C | 12815.1 | 12842.6 | 12807.6 | 0.003 | <0.001 | 0.132 | |
HRVA_D2 | 8036.8 | 8041.0 | 8043.7 | >0.05 | >0.05 | 0.057 | |
HRVA_E | 7045.9 | 7056.1 | 7053.3 | >0.05 | 0.02 | 0.102 | |
HRVA_F | 7046.0 | 7056.7 | 7053.4 | >0.05 | 0.02 | 0.0710 | |
HRVA_G | 18424.2 | 18434.0 | 18425.1 | >0.05 | <0.001 | 0.123 | |
HRVA_H | 20431.4 | 20413.87 | 20420.5.6 | 0.002 | >0.05 | 0.163 | |
PRVA_A | 28540.7 | 28441.9 | 28398.7 | <0.001 | <0.001 | 0.204 | |
PRVA_B | 14757.7 | 14775.5 | 14732.6 | <0.001 | <0.001 | 0.351 | |
HRVC_A | 6713.2 | 6718.2 | 6712.3 | 0.045 | 0.007 | 0.124 | |
PTOV | 202011.3 | 202106.5 | 201878.5 | <0.001 | <0.001 | 0.039 | |
FJV_B | 9274.1 | 9250.0 | 9250.9 | <0.001 | >0.05 | 0.194 | |
Endornaviridae | EDV | 1771992.8 | 1772689.1 | 1771950.6 | <0.001 | <0.001 | 0.121 |
BPAV | 70386.5 | 70540.2 | 70390.7 | >0.05 | 0.00 | 0.047 | |
Totiviridae | TTV | 617302.6 | 617462.6 | 617172.9 | <0.001 | <0.001 | 0.052 |
GDV | 80435.8 | 80396.5 | 80387.7 | <0.001 | 0.002 | 0.109 | |
Hypoviridae | HPV | 66859.8 | 66899.8 | 66857.8 | 0.03 | <0.001 | 0.057 |
Small sample corrected Akaike information criterion (AIC-c) scores and likelihood ratio test (LRT) results for single-stranded DNA datasets.
The lowest AIC-c scores indicating the best fitting models are in bold.
Virus family | Dataset | AIC score GTR | AIC score NREV-6 | AIC score NREV-12 | p-Value GTR vs NREV-12 | p-Value NREV-6 vs NREV-12 | DNR |
---|---|---|---|---|---|---|---|
Nanoviridae | BBTV M | 15044.3 | 15207.9 | 14984.4 | <0.001 | <0.001 | 0.662 |
BBTV N | 10605.6 | 10686.2 | 10595.2 | <0.001 | <0.001 | 0.533 | |
BBTV R | 18484.5 | 18544 | 18480.8 | >0.05 | <0.001 | 0.609 | |
BBTV S | 12718.9 | 12757.2 | 12707.3 | <0.001 | <0.001 | 0.728 | |
CCDV | 38622.7 | 38632.0 | 38630.5 | >0.05 | 0.03 | 0.050 | |
MDV | 36232.8 | 36063 | 36064 | <0.001 | >0.05 | 0.142 | |
PYDV | 56138.4 | 56076.6 | 56056.4 | <0.001 | <0.001 | 0.187 | |
FBNS | 100153.6 | 100135.6 | 100120.5 | <0.001 | <0.001 | 0.098 | |
Geminiviridae | Begomo 5 | 28192.1 | 28311.9 | 28192.5 | >0.05 | <0.001 | 0.1995 |
Begomo 6 | 16743.0 | 16722.6 | 16724.1 | <0.001 | >0.05 | 0.214 | |
Begomo 9 | 8517.6 | 8540.8 | 8515.6 | 0.03 | <0.001 | 0.312 | |
Dicot 1 | 44730.7 | 44594.3 | 44583.3 | <0.001 | <0.001 | 0.200 | |
Dicot 2 | 39909.9 | 39919.8 | 39917.9 | >0.05 | <0.001 | 0.100 | |
MSV | 252645.3 | 254347.5 | 254347.5 | <0.001 | <0.001 | 0.144 | |
PanSV | 94601.2 | 94600.3 | 94593.7 | <0.001 | <0.001 | 0.182 | |
WDV | 35301.7 | 35313.2 | 35253.8 | <0.001 | <0.001 | 0.1033 | |
Circoviridae | BFDV | 17256.7 | 17262.7 | 17246.7 | <0.001 | <0.001 | 0.224 |
DG_CV | 12754.8 | 12779.5 | 12758.3 | >0.05 | <0.001 | 0.116 | |
PiCV | 19180.5 | 19192.5 | 19191.0 | >0.05 | 0.04 | 0.117 | |
CCCC | 84435.7 | 84377.4 | 84315.3 | <0.001 | <0.001 | 0.132 | |
BTC | 262910.4 | 262060.1 | 261985.4 | <0.001 | <0.001 | 0.178 | |
POCV2 | 24940.9 | 24953.8 | 24915.8 | <0.001 | <0.001 | 0.162 | |
CCV | 90307.9 | 90301.5 | 90285.9 | <0.001 | <0.001 | 0.114 | |
Anelloviridae | TTV_1 | 825811 | 826800 | 825292 | <0.001 | <0.001 | 0.513 |
TTSV | 332287.9 | 332397.4 | 332258.2 | <0.001 | <0.001 | 1.560 | |
Parvoviridae | MVM | 26756.3 | 26743.9 | 26686.9 | <0.001 | <0.001 | 0.148 |
HPV | 67051.2 | 67080.1 | 67001.8 | <0.001 | <0.001 | 0.235 | |
CPV | 85731 | 85695 | 85689.3 | <0.001 | 0.007 | 0.062 | |
PPV | 163006.8 | 163090.7 | 162995.9 | <0.001 | <0.001 | 0.143 | |
CAV_P | 37073.3 | 37115.5 | 37065.7 | <0.001 | <0.001 | 0.162 | |
Microviridae | BMV | 31175.3 | 31164.8 | 31147.3 | <0.001 | <0.001 | 0.188 |
Pleolipoviridae | APV | 85700.2 | 85617.4 | 85402.8 | <0.001 | <0.001 | 0.204 |
BPV | 204797.5 | 204802.3 | 204796.7 | 0.04 | 0.007 | 0.064 |
Small sample corrected Akaike information criterion (AIC-c) scores and likelihood ratio test (LRT) results for single-stranded RNA datasets.
The lowest AIC-c scores indicating the best fitting models are in bold.
Virus family | Dataset | AIC score GTR | AIC score NREV-6 | AIC score NREV-12 | p-Value GTR vs NREV-12 | p-Value NREV-6 vs NREV-12 | DNR |
---|---|---|---|---|---|---|---|
Astroviridae | HAV | 94580.7 | 94926.3 | 94548.1 | <0.001 | <0.001 | 0.096 |
BAV | 188307.1 | 188572.9 | 188144.9 | <0.001 | <0.001 | 0.108 | |
MMV | 281072.2 | 281094.5 | 281076.9 | >0.05 | <0.001 | 0.072 | |
PAV | 150626.88 | 150827.6 | 150609.5 | <0.001 | <0.001 | 0.069 | |
CKV | 90902.3 | 91233.1 | 90873.0 | <0.001 | <0.001 | 0.083 | |
GA | 64998.5 | 65223.9 | 64975.9 | <0.001 | <0.001 | 0.110 | |
CAV_A | 85558.8 | 85617.4 | 85547.3 | <0.01 | <0.001 | 0.076 | |
Bromoviridae | CMV RNA1 | 34197.5 | 34198.8 | 34147.7 | <0.001 | <0.001 | 0.124 |
CMV RNA2 | 31398.2 | 31455.9 | 31388.7 | <0.001 | <0.001 | 0.091 | |
CMV RNA3 | 24337.2 | 24360.3 | 24343.9 | >0.05 | <0.001 | 0.073 | |
AMS | 24337.2 | 24360.3 | 24343.9 | >0.05 | <0.001 | 0.073 | |
PSV | 67707 | 67786.5 | 67691 | <0.001 | <0.001 | 0.048 | |
Caliciviridae | LAV | 73042.8 | 73102.4 | 72984.6 | <0.001 | <0.001 | 0.120 |
NoV | 207667.2 | 207777.5 | 207660 | <0.001 | <0.001 | 0.047 | |
VSV | 235936.4 | 236051.4 | 235913.3 | <0.001 | <0.001 | 0.046 | |
Closteroviridae | CTV | 30062.2 | 29980.4 | 29960.1 | <0.001 | <0.001 | 0.272 |
Flaviviridae | DGV_T1 | 69771.9 | 70030.5 | 69776.2 | >0.05 | <0.001 | 0.063 |
JEV | 146920.8 | 148101.5 | 146885.5 | <0.001 | <0.001 | 0.091 | |
Hepeviridae | HPVE1 | 200439.5 | 200863.8 | 200179.8 | <0.001 | <0.001 | 0.073 |
HPVE2 | 155709.1 | 155983.8 | 155518.6 | <0.001 | <0.001 | 0.088 | |
Picornaviridae | ENV_A | 552287.9 | 553535.5 | 551794.1 | <0.001 | <0.001 | 0.061 |
HRV_A | 102218.7 | 102267.0 | 101550.7 | <0.001 | <0.001 | 0.285 | |
AIV | 101073.1 | 101136.7 | 101052.2 | <0.001 | <0.001 | 0.093 | |
AHP | 139635.7 | 140119.6 | 139506.9 | <0.001 | <0.001 | 0.170 | |
ECV | 82078.9 | 82181.0 | 82065.8 | <0.001 | <0.001 | 0.066 | |
CDV | 130551.3 | 130896.7 | 130478.3 | <0.001 | <0.001 | 0.086 | |
TCV | 53027.3 | 53029 | 53023 | 0.0151 | 0.0422 | 0.033 | |
FMDV | 455180.6 | 455582.6 | 454806.1 | <0.001 | <0.001 | 0.117 | |
Fusariviridae | FRV | 52413.1 | 52470.6 | 52418.4 | >0.05 | <0.001 | 0.076 |
Retroviridae | HIV1_setA | 344014.4 | 344295.1 | 343669.7 | <0.001 | <0.001 | 0.237 |
HIV1_M | 80764.1 | 80829.5 | 80668.1 | <0.001 | <0.001 | 0.442 | |
HIV1_setC | 180575.0 | 180702.3 | 180494.4 | <0.001 | <0.001 | 0.107 | |
HIV1_setD | 298489.9 | 298695.3 | 298260.6 | <0.001 | <0.001 | 0.133 | |
HIV1_setE | 289111.3 | 289292.1 | 288941.9 | <0.001 | <0.001 | 0.112 | |
HIV1_setF | 214375.9 | 214692.2 | 214289.4 | <0.001 | <0.001 | 0.148 | |
EIV | 126149 | 126365.4 | 125300 | <0.001 | <0.001 | 0.192 | |
BIV | 24505.2 | 24506.9 | 24513.2 | >0.05 | >0.05 | 0.15 | |
FIV | 164542.1 | 164487.9 | 164260.4 | <0.001 | <0.001 | 0.114 | |
CAV | 351329.9 | 351871.5 | 350721.9 | <0.001 | <0.001 | 0.174 | |
SIV | 110731.2 | 110816 | 110663.3 | <0.001 | <0.001 | 0.144 | |
Filoviridae | Ebola_2 | 53147.3 | 53143.0 | 53149.9 | >0.05 | >0.50 | 0.264 |
Orthomyxo-viridae | Flu A 2 | 82872.8 | 83010.2 | 82849.7 | <0.001 | <0.001 | 0.27 |
Flu B 1 | 50090.4 | 50144.1 | 50060.9 | <0.001 | <0.001 | 0.311 | |
Coronaviridae | SARS-COV1 | 214715.3 | 214968.5 | 214644.39 | <0.001 | <0.001 | 0.198 |
SARS-COV2 | 15715.4.2 | 15715.6 | 15696.7 | <0.001 | <0.001 | 1.536 | |
SARB | 573966.3 | 573815.1 | 572517.0 | <0.001 | <0.001 | 0.301 | |
MERS-COV | 516683.2 | 516983.4 | 516608.9 | <0.001 | <0.001 | 0.169 |
Relative rate change for C to A, G to A, A to T, G to C, T to G, and C to T mutations under the 11 degrees of non-reversibility alongside the maintained rates for A to C, A to G, T to A, C to G, G to T, and T to C.
Degree of non-reversibility (DNR) | Relative rates of different nucleotide substitution types (from-to) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
C-A | A-C | G-A | A-G | A-T | T-A | G-C | C-G | T-G | G-T | C-T | T-C | |
0 | 0.166 | 0.166 | 1 | 1 | 0.14 | 0.14 | 0.131 | 0.131 | 0.118 | 0.118 | 1.101 | 1.101 |
2 | 2.166 | 0.166 | 3 | 1 | 2.14 | 0.14 | 2.131 | 0.131 | 2.118 | 0.118 | 3.101 | 1.101 |
4 | 4.166 | 0.166 | 5 | 1 | 4.14 | 0.14 | 4.131 | 0.131 | 4.118 | 0.118 | 5.101 | 1.101 |
6 | 6.166 | 0.166 | 7 | 1 | 6.14 | 0.14 | 6.131 | 0.131 | 6.118 | 0.118 | 7.101 | 1.101 |
8 | 8.166 | 0.166 | 9 | 1 | 8.14 | 0.14 | 8.131 | 0.131 | 8.118 | 0.118 | 9.101 | 1.101 |
10 | 10.166 | 0.166 | 11 | 1 | 10.14 | 0.14 | 10.131 | 0.131 | 10.118 | 0.118 | 11.101 | 1.101 |
12 | 12.166 | 0.166 | 13 | 1 | 12.14 | 0.14 | 12.131 | 0.131 | 12.118 | 0.118 | 13.101 | 1.101 |
14 | 14.166 | 0.166 | 15 | 1 | 14.14 | 0.14 | 14.131 | 0.131 | 14.118 | 0.118 | 15.101 | 1.101 |
16 | 16.166 | 0.166 | 17 | 1 | 16.14 | 0.14 | 16.131 | 0.131 | 16.118 | 0.118 | 17.101 | 1.101 |
18 | 18.166 | 0.166 | 19 | 1 | 18.14 | 0.14 | 18.131 | 0.131 | 18.118 | 0.118 | 19.101 | 1.101 |
20 | 20.166 | 0.166 | 21 | 1 | 20.14 | 0.14 | 20.131 | 0.131 | 20.118 | 0.118 | 21.101 | 1.101 |
Additional files
-
MDAR checklist
- https://cdn.elifesciences.org/articles/87361/elife-87361-mdarchecklist1-v1.docx
-
Supplementary file 1
Summary of the viral genome component datasets used in the study.
- https://cdn.elifesciences.org/articles/87361/elife-87361-supp1-v1.docx