Peer review process
Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.
Read more about eLife’s peer review process.Editors
- Reviewing EditorJungsan SohnJohns Hopkins University School of Medicine, Baltimore, United States of America
- Senior EditorSatyajit RathIndian Institute of Science Education and Research (IISER), Pune, India
Reviewer #1 (Public Review):
In this article, different machine learning models (pan-specific, peptide-specific, pre-trained, and ensemble models) are tested to predict TCR-specificity from a paired-chain peptide-TCR dataset. The data consists of 6,358 positive observations across 26 peptides (as compared to six peptides in NetTCR version 2.1) after several pre-processing steps (filtering and redundancy reduction). For each positive sample, five negative samples were generated by swapping TCRs of a given peptide with TCRs binding to other peptides. The weighted loss function is used to deal with the imbalanced dataset in pan-specific models.
The results demonstrate that the redundant data introduced during training did not lead to performance gain; rather, a decrease in performance was observed for the pan-specific model. The removal of outliers leads to better performance.
To further improve the peptide-specific model performance, an architecture is created to combine pan-specific and peptide-specific models, where the pan-specific model is trained on pan-specfic data while keeping the peptide-specfic part of the model frozen, and the peptide-specific model is trained on a peptide-specific dataset while keeping the pan-specific part of the model frozen. This model surpassed the performance of individual pan-specific and peptide-specific models. Finally, sequence similarity-based predictions of TCRbase are integrated into the pre-trained CNN model, which further improved the model performance (mostly due to the better discrimination of binders and non-binders).
The prediction for unseen peptides is still low in a pan-specific model; however, an improvement in prediction is observed for peptides with high similarity to the ones in the training dataset. Furthermore, it is shown that 15 observations shows satisfactory performance as compared to the ~150 recommended in the literature.
Models are evaluated on the external dataset (IMMREP benchmark). Peptide-specific models performed competitively with the best models in the benchmark. The pre-trained model performed worst, which the authors suggested could be because of positive and negative sample swapping across training and testing sets. To resolve this issue, they applied the redundancy removal technique to the IMMER dataset. The results agreed with earlier conclusion that the pre-trained models surpassed peptide-specific models and the integration of similarity-based methods leads to performance boost. It highlights the need for the creation of a new benchmark without data redundancy or leakage problems.
The manuscript is well written, clear and easy to understand. The data is effectively presented. The results validate the drawn conclusions.
Reviewer #2 (Public Review):
Summary:
The authors describe a novel ML approach to predict binding between MHC-bound peptides and T-Cell receptors. Such approaches are particularly useful for predicting the binding of peptide sequences with low similarity when compared to existing data sets. The authors focus on improving dataset quality and optimizing model architecture to achieve a pan-specific predictive model in hopes of achieving a high precision model for novel peptide sequences.
Strengths:
Since assuring the quality of training datasets is the first major step in any ML training project, the extensive human curation and computational analysis and enhancements made in this manuscript represent a major contribution to the field. Moreover, the systematic approach to testing redundancy reduction and data augmentation is exemplary, and will significantly help future research in the field.
The authors also highlight how their model can identify outliers and how that can be used to improve the model around known sequences, which can help the creation and optimization of future datasets for peptide binding.
The new models presented here are novel and built using paired α/β TCR sequence data to predict peptide-specific TCR binding, and have been extensively and rigorously tested.
Weaknesses:
Achieving an accurate pan-specific model is an ambitious goal, and the authors have significant difficulties when trying to achieve non-random performance for prediction of TCR binding to novel peptides. This is the most challenging task for this kind of model, but also the most desirable when applying such models to biotechnological and bioengineering projects.
The manuscript is a highly technical and extremely detailed computational work, which can make the achievements and impact of the work hard to parse for application-oriented researchers, and still hard to translate to real-world use-cases for TCR specificity predictions.