Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.
Read more about eLife’s peer review process.Editors
- Reviewing EditorAnne-Florence BitbolEcole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Senior EditorAlan MosesUniversity of Toronto, Toronto, Canada
Reviewer #1 (Public review):
Summary:
The manuscript presents a deep learning framework for predicting T cell receptor (TCR) binding to antigens (peptide-MHC) using a combination of data augmentation techniques to address class imbalance in experimental datasets, and introduces both peptide-specific and pan-specific models for TCR-MHC-I binding prediction. The authors leverage a large, curated dataset of experimentally validated TCR-MHC-I pairs and apply a data augmentation strategy based on generative modeling to generate new TCR sequences. The approach is evaluated on benchmark datasets, and the resulting models demonstrate improved accuracy and robustness.
Strengths:
The most significant contribution of the manuscript lies in its data augmentation approach to mitigate class imbalance, particularly for rare but immunologically relevant epitope classes. The authors employ a generative strategy based on two deep learning architectures:
(1) a Restricted Boltzmann Machine (RBM) and
(2) a BERT-based language model, which is used to generate new CDR3B sequences of TCRs that are used as synthetic training data for creating a class balance of TCR-pMHC binding pairs.
The distinction between peptide-specific (HLA allele-specific) and pan-specific (generalized across HLA alleles) models is well-motivated and addresses a key challenge in immunogenomics: balancing specificity and generalizability. The peptide-specific models show strong performance on known HLA alleles, which is expected, but the pan-specific model's ability to generalize across diverse HLA types, especially those not represented in training, is critical.
Weaknesses:
The paper would benefit from a more rigorous analysis of the biological validity of the augmented data. Specifically, how do the synthetic CDR3B sequences compare to real CDR3B in terms of sequence similarity, motif conservation? The authors should provide a quantitative assessment (via t-SNE or UMAP projections) of real vs. augmented sequences, or by measuring the overlap in known motif positions, before and after augmentation. Without such validation, the risk of introducing "hallucinated" sequences that distort model learning remains a concern. Moreover, it would strengthen the argument if the authors demonstrated that performance gains are not merely due to overfitting on synthetic data, but reflect genuine generalization to unseen real data. Ultimately, this can only be performed through elaborate experimental wet-lab validation experiments, which may be outside the scope of this study.
While generative modeling for sequence data is increasingly common, the choice of RBM, which is a relatively older architecture, could benefit from stronger justification, especially given the emergence of more powerful and scalable alternatives (e.g., ProGen, ESM, or diffusion-based models). While BERT was used, it will be valuable in the future to explore other architectures for data augmentation.
The manuscript would be more compelling if the authors performed a deeper analysis of the pan-specific model's behavior across HLA supertypes and allele groups. Are the learned representations truly "pan" or merely a weighted average of the most common alleles? The authors should assess whether the pan-specific model learns shared binding motifs (anchor residue preferences) and whether these features are interpretable through attention maps. A failure to identify such patterns would raise concerns about the model's interpretability and biological relevance.
The exclusive focus on CDR3β for TCR modeling is biologically problematic. TCRs are heterodimers composed of α and β chains, and both CDR1, CDR2, and CDR3 regions of both chains contribute to antigen recognition. The CDR3β loop is often more diverse and critical, but CDR3α and the CDR1/2 loops also play significant roles in binding affinity and specificity. By generating only CDR3B sequences and not modeling the full TCR αβ heterodimer, the authors risk introducing a systematic bias toward β-chain-dominated recognition, which will not reflect the full complexity of TCR-peptide-MHC interactions.
Reviewer #2 (Public review):
Summary:
This paper presents a thoughtful and well-motivated strategy to address a major challenge in TCR-epitope binding prediction: data imbalance, particularly the scarcity of positive (binding) TCR, peptide pairs. The authors introduce a two-step pipeline combining data balancing, via undersampling and generative augmentation, and a supervised CNN-based classifier. Notably, the use of Restricted Boltzmann Machines (RBMs) and BERT-style transformer models to generate synthetic CDR3β sequences is shown to improve model performance. The proposed method is applied to both peptide-specific and pan-specific settings, yielding notable performance improvements, especially for in-distribution peptides. Generative augmentation also leads to measurable gains for out-of-distribution epitopes, particularly those with high sequence similarity to the training set.
Strengths:
(1) The authors tackle the well-known but under-addressed issue of class imbalance in TCR-epitope binding data, where negatives vastly outnumber positive (binding) pairs. This imbalance undermines classifier reliability and generalization.
(2) The model is tested on both in-distribution (seen epitopes) and out-of-distribution (unseen epitopes) scenarios. Including a synthetic lattice protein benchmark allows the authors to dissect generalization behavior in a controlled environment.
(3) The paper shows a measurable benefit of generative. For example, AUC improvements of up to +0.11 are observed for peptides closely related to those seen during training, demonstrating the method's practical impact.
(4) A direct comparison between RBM- and Transformer-based sequence generators adds value, offering the community guidance on trade-offs between different generative architectures in TCR modeling applications.
Weaknesses:
(1) Generalization degrades with epitope dissimilarity
The performance drops substantially as the test epitope becomes more dissimilar to the training set. This is expected, but it highlights an essential limitation of the generative models: they help only when the test epitope is similar to one already seen. Table 1 shows that the performance gain from generative augmentation decreases as the test epitope becomes more dissimilar to the training epitopes. For epitopes with a Levenshtein distance of 1 from the training set, the average AUC improvement is approximately +0.11. This gain drops to around +0.06 for epitopes at distance 2. It becomes minimal for those at distance 4, indicating an explicit limitation in the model's ability to generalize to more distant epitopes. The authors should quantify more explicitly how far the model can generalize effectively. What is the performance degradation threshold as a function of Levenshtein distance?
(2) What is the minimal number of positive samples needed for data augmentation to help?
The approach has an intrinsic catch-22: generative models require data to learn the underlying distribution and cannot be applied to epitopes with insufficient data. As a result, the method is unlikely to be effective for completely new epitopes. Could the authors quantify the minimum number of real binders needed for effective generative augmentation? This would be particularly relevant for zero-shot or few-shot prediction scenarios, where only 0-10 positive samples are available. Such experiments would help clarify the practical limits of the proposed strategy.
(3) Lack of end-to-end evaluation on unseen epitopes as inputs
The authors frame peptide-specific models as classification over a few known epitopes, a closed-set formulation. While this is useful for evaluating generation effects, it's not representative of the more practical open-set task of predicting binding to truly novel epitopes. A stronger test would include models that take peptides as input (e.g., pan-specific, peptide-conditioned classifiers), including unseen epitopes at test time. Could the authors attempt an evaluation on benchmarks like IMMREP25 or other datasets where test epitopes are excluded from training?
(4) Focus on β-chain limits generalizability
The current pipeline is trained exclusively on CDR3β sequences. However, the field is increasingly moving toward single-cell sequencing, which provides paired α/β TCR chain data. Understanding how the proposed approach performs when both chains are available would be valuable. Could the authors evaluate the performance gains on paired α/β information, even in a small subset of single-cell data?
(5) Synthetic lattice proteins (LPs) have limited biological fidelity
While the LP-based benchmark presented in Figure 5 is a clever and controlled tool for probing model generalization, it remains conceptually and biophysically distant from real TCR-peptide interactions. Its utility as a toy model is valid, but its limitations should be more explicitly acknowledged:
a) Over-simplified binding landscape: The LP system is designed for tractability, with a simplified sequence-structure mapping and fixed lattice constraints. As shown in Figure 5c, the LP binding landscape is linearly separable, in stark contrast to the complex and often degenerate nature of real TCR-epitope interactions, where multiple structurally distinct TCRs can bind the same peptide and vice versa.
b) Absence of immunological context: The LP model abstracts away key biological factors such as MHC restriction, α/β chain pairing, peptide presentation, and structural constraints of the TCR-pMHC complex. These are essential for understanding binding specificity in actual immune repertoires.
c) Overestimation of generalization: While performance drops on more distant LP structures, even these are structurally and statistically more similar to the training data than truly novel biological epitopes. Thus, the LP benchmark likely underestimates the true difficulty of out-of-distribution generalization in real-world TCR prediction tasks.
d) Simplified biophysics: The LP simulations rely on coarse-grained energy models and empirical potentials that do not capture conformational dynamics, side-chain flexibility, or realistic binding energetics of TCR-peptide interfaces.
In summary, while the LP benchmark helps isolate specific generalization behaviors and for sanity-checking model performance under controlled perturbations, its biological relevance is limited. The authors should explicitly frame these assumptions and limitations to prevent overinterpreting results from this synthetic system.
Reviewer #3 (Public review):
Summary:
The authors present a method to address class imbalance in T cell receptor (TCR)-epitope binding datasets by generating synthetic positive binding examples using generative models, specifically BERT-based architectures and Restricted Boltzmann Machines (RBMs). They hypothesize that improving class balance can enhance model performance in predicting TCR-peptide binding.
Strengths:
(1) Interesting biological as well as technical topic.
(2) Solid technical foundations.
Weaknesses:
(1) Fundamental Biological Oversight:
While the computational strategy of augmenting positive samples via generative models is technically interesting, the manuscript falls short in addressing key biological considerations. Specifically, the authors simulate and evaluate only CDR3β-peptide binding interactions. However, antigen recognition by T cells involves both the α- and β-chains of the TCR. The omission of CDR3α undermines the biological realism and limits the generalizability of the findings.
(2) Validation of Simulated Data:
The central claim of the manuscript is that simulated positive examples improve predictive performance. However, there is no rigorous validation of the biological plausibility or realism of the generated TCR sequences. Without independent evaluation (e.g., testing whether synthetic TCR-peptide pairs are truly binding), it remains unclear whether the performance gains are biologically meaningful or merely reflect artifacts of the generation process.
(3) Risk of Bias and Overfitting:
Training and evaluating models with generated data introduces a risk of circularity and bias. The observed improvements may not reflect better generalization to real-world TCR-epitope interactions but could instead arise from overfitting to synthetic patterns. Additional testing on independent, biologically validated datasets would help clarify this point.