Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorBryan BrysonMassachusetts Institute of Technology, Cambridge, United States of America
- Senior EditorWendy GarrettHarvard T.H. Chan School of Public Health, Boston, United States of America
Reviewer #1 (Public review):
Summary:
In this manuscript, Green et al. attempt to use large-scale protein structure analysis to find signals of selection and clustering related to antibiotic resistance. This was applied to the whole proteome of Mycobacterium tuberculosis, with a specific focus on the smaller set of known antibiotic-resistance-related proteins.
Strengths:
The use of geospatial analysis to detect signals of selection and clustering on the structural level is really intriguing. This could have a wider use beyond the AMR-focussed work here and could be applied to a more general evolutionary analysis context. Much of the strength of this work lies in breaking ground into this structural evolution space, something rarely seen in such pathogen data. Additional further research can be done to build on this foundation, and the work presented here will be important for the field.
The size of the dataset and use of protein structure prediction via AlphaFold, giving such a consistent signal within the dataset, is also of great interest and shows the power of these approaches to allow us to integrate protein structure more confidently into evolution and selection analyses.
Weaknesses:
There are several issues with the evolutionary analysis and assumptions made in the paper, which perhaps overstate the findings, or require refining to take into account other factors that may be at play.
(1) The focus on antimicrobial resistance (AMR) throughout the paper contains the findings within that lens. This results in a few different weaknesses:
(a) While the large size of the analysis is highlighted in the abstract and elsewhere, in reality, only a few proteins are studied in depth. These are proteins already associated with AMR by many other studies, somewhat retreading old ground and reducing the novelty.
(b) Beyond the AMR-associated proteins, the proteome work is of great interest, but only casually interrogated and only in the context of AMR. There appears to be an assumption that all signals of positive selection detected are related to AMR, whereas something like cas10 is part of the CRISPR machinery, a set of proteins often under positive selection, and thus unlikely to be AMR-related.
(2) The strength of the signal from the structural information and the novelty of the structural incorporation into prediction are perhaps overstated.
(a) A drop of 13% in F1 for a gain of 2% in PPV is quite the trade-off. This is not as indicative of a strong predictor that could be used as the abstract claims. While the approach is novel and this is a good finding for a first attempt at such complex analysis, this is perhaps not as significant as the authors claim
(b) In relation to this, there is a lack of situating these findings within the wider research landscape. For instance, the use of structure for predicting resistance has been done, for example, in PncA (https://academic.oup.com/jacamr/article/6/2/dlae037/7630603, https://www.sciencedirect.com/science/article/pii/S1476927125003664, https://www.nature.com/articles/s41598-020-58635-x) and in RpoB (https://www.nature.com/articles/s41598-020-74648-y). These, and other such works, should be acknowledged as the novelty of this work is perhaps not as stark as the authors present it to be.
(3) The authors postulate that neutral AA substitutions would be randomly distributed in the protein structure and thus use random mutations as a negative control to simulate this neutral evolution. However, I am unsure if this is a true negative control for neutral evolution. The vast majority of residues would be under purifying selection, not neutral selection, especially in core proteins like rpoB and gyrA. Therefore, most of these residues would never be mutated in a real-world dataset. Therefore, you are not testing positive selection against neutral selection; you are testing positive against purifying, which will have a much stronger signal. This is likely to, in turn, overestimate the signal of positive selection. This would be better accounted for using a model of neutral evolution, although this is complex and perhaps outside the scope. Still, it needs to be made clear that these negative controls are not representative of neutral evolution.
(4) In a similar vein, the use of 15 Å as a cut-off for stating co-localisation feels quite arbitrary. The average radius of a globular protein is about 20 Å, so this could be quite a large patch of a protein. I think it may be good to situate the cut-off for a 'single location' within a size estimator of the entire protein, as 15 Å could be a neighbourhood in a large protein, but be the whole protein for smaller ones.
Reviewer #2 (Public review):
Summary:
This is an important study that, for the first time, systematically places the homoplastic genetic variation observed in the coding regions in a large collection of >31,000 M. tuberculosis samples into the protein structural context. This should be much more informative when, e.g. predicting antimicrobial resistance. The authors imaginatively apply the Getis-Ord score, which originated in geographical spatial analysis but has also been used in human disease to demonstrate that missense mutations in M. tuberculosis known to be associated with antimicrobial resistance are clustered in space. That they are able to consider almost all of the proteome using a large dataset of 31,000 M. tuberculosis complex clinical samples, which makes the evidence convincing.
Strengths:
To my knowledge, this is the first study to place the homoplastic missense mutations from a large clinical dataset into their protein structural context and attempt to look for clustering in space, which could be indicative of a recent evolutionary pressure, such as the use of antibiotics. The field usually only views resistance through the genetic paradigm, so it is delightful to see a structural paradigm being brought to bear, as this should, in theory, be much more informative, as protein structure is much closer to function. In addition, the dataset used is large (>31,000 clinical M. tuberculosis samples), and the authors are able to consider almost all of the ORFs (3,687/3,996) in the M. tuberculosis reference, and hence the analysis is comprehensive.
Weaknesses:
It is not apparent at the time of this review if the study could be reproduced by other researchers as e.g. whilst the authors state that the raw sequencing files (FASTQ) underpinning the dataset of 31,428 M. tuberculosis isolates can be downloaded the table in the Supplement containing the sample and accession identifiers contains rows that do not contain NCBI accessions e.g. '01R0685' or 'IDR 1600023875' or '1479144813357T181715lib5022nextseqn0035151bp' instead of the expected form e.g. 'SAMEA1016138'. I have searched the NCBI SRA using these terms and got no results, so they cannot be used to download any FASTQ files. There is also no information in the preprint on how the reads were processed (which is a complex process) and the dataset of SNPs subsequently built. One can trace back through the references, but I cannot find anywhere where one can download the SNP dataset, which would permit researchers to reproduce at least the latter stages of the work -- one obvious option would be to make the SNP dataset available. Likewise, the authors have constructed a "M. tuberculosis structureome", which would be very useful for the community but does not appear to be publicly available. At the time of the review, not all the GitHub repositories were public, so these points may have been rectified when that was corrected.
The authors correctly point out in the Introduction that supervised methods like GWAS or ML need datasets with matching genetic and phenotypic drug susceptibility data, which are much difficult/expensive to obtain, but don't then close the loop by comparing their results back to such supervised methods. They pick out RnJ as having previously been identified by a GWAS, but it would have provided a useful validation of their method to e.g. demonstrating that X% of the genes they identify were also identified by GWAS/ML studies, and therefore their method can achieve similar results but without having to collect pDST data.
Whilst the authors acknowledge that assuming all sites are equally likely to mutate in their random shuffling procedure is a shortcoming, a bigger weakness is, I suspect, that one should also only consider which amino acids could arise at each codon due to a SNP. Shuffling assumes any amino acid can arise at any codon which is only possible with multiple nucleotide changes, which is possible but highly unlikely.
Finally, the authors implicitly assume that the mutations do not perturb the structure of the proteins, which is likely to be generally true for essential genes but less likely to be true for non-essential genes. This assumption underpins their entire approach and should be borne in mind when evaluating the results.