Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models

  1. University of Cape Town
  2. Brown University
  3. Temple University

Editors

  • Reviewing Editor
    Antonis Rokas
    Vanderbilt University, Nashville, United States of America
  • Senior Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public Review):

The study by Sianga-Mete et al revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. This topic is not new, previous works already showed that non-reversible, and also covarion, substitution models can fit the real data better than the reversible substitution models commonly used in phylogenetics. In this regard, the results of the present study are not surprising. Specific comments are shown below.

Major comments

It is well known that non-reversible models can fit the real data better than the commonly used reversible substitution models, see for example,
https://academic.oup.com/sysbio/article/71/5/1110/6525257
https://onlinelibrary.wiley.com/doi/10.1111/jeb.14147?af=R
The manuscript indicates that the results (better fitting of non-reversible models compared to reversible models) are surprising but I do not think so, I think the results would be surprising if the reversible models provide a better fitting.
I think the introduction of the manuscript should be increased with more information about non-reversible models and the diverse previous studies that already evaluated them. Also I think the manuscript should indicate that the results are not surprising, or more clearly justify why they are surprising.

In the introduction and/or discussion I missed a discussion about the recent works on the influence of substitution model selection on phylogenetic tree reconstruction. Some works indicated that substitution model selection is not necessary for phylogenetic tree reconstruction,
https://academic.oup.com/mbe/article/37/7/2110/5810088
https://www.nature.com/articles/s41467-019-08822-w
https://academic.oup.com/mbe/article/35/9/2307/5040133
While others indicated that substitution model selection is recommended for phylogenetic tree reconstruction,
https://www.sciencedirect.com/science/article/pii/S0378111923001774
https://academic.oup.com/sysbio/article/53/2/278/1690801
https://academic.oup.com/mbe/article/33/1/255/2579471
The results of the present study seem to support this second view. I think this study could be improved by providing a discussion about this aspect, including the specific contribution of this study to that.

The real data was downloaded from Los Alamos HIV database. I am wondering if there were any criterion for selecting the sequences or if just all the sequences of the database for every studied virus category were analysed. Also, was any quality filter applied? How gaps and ambiguous nucleotides were considered? Notice that these aspects could affect the fitting of the models with the data.

How the non-reversible model and the data are compared considering the non-reversible substitution process? In particular, given an input MSA, how to know if the nucleotide substitution goes from state x to state y or from state y to state x in the real data if there is not a reference (i.e., wild type) sequence? All the sequences are mutants and one may not have a reference to identify the direction of the mutation, which is required for the non-reversible model. Maybe one could consider that the most abundant state is the wild type state but that may not be the case in reality. I think this is a main problem for the practical application of non-reversible substitution models in phylogenetics.

Reviewer #2 (Public Review):

The authors evaluate whether non time reversible models fit better data presenting strand-specific substitution biases than time reversible models. Specifically, the authors consider what they call NREV6 and NREV12 as candidate non time-reversible models. On the one hand, they show that AIC tends to select NREV12 more often than GTR on real virus data sets. On the other hand, they show using simulated data that NREV12 leads to inferred trees that are closer to the true generating tree when the data incorporates a certain degree of non time-reversibility. Based on these two experimental results, the authors conclude that "We show that non-reversible models such as NREV12 should be evaluated during the model selection phase of phylogenetic analyses involving viral genomic sequences". This is a valuable finding, and I agree that this is potentially good practice. However, I miss an experiment that links the two findings to support the conclusion: in particular, an experiment that solves the following question: does the best-fit model also lead to better tree topologies?

On simulated data, the significance of the difference between GTR and NREV12 inferences is evaluated using a paired t test. I miss a rationale or a reference to support that a paired t test is suitable to measure the significance of the differences of the wRF distance. Also, the results show that on average NREV12 performs better than GTR, but a pairwise comparison would be more informative: for how many sequence alignments does NREV12 perform better than GTR?

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation