Effects of imbalanced data on ML classifiers.

a) 2D visualization of a two-class training dataset, with the majority class (yellow) significantly outnumbering the minority one (blue). Data are generated from a Gaussian mixture; the decision boundary (DB) separating the classes passes perpendicularly through the midpoint of the line segment connecting the centers of the two clusters (dark line); a linear classifier trained under imbalance learns a sub-optimal decision boundary (light line), which leads to low predictive capabilities (see App. 2 for more details). b) Accuracy and AUC scores for a predictive model trained to distinguish YVLDHLIVV- and YLQPRTFLL-specific CDR3β sequences from bulk CDR3βs as a function of the fraction ρN of background data in the training set. In practice we fix the class size of peptide-specific sequences and vary the size of the background sequences class to change ρN. Performances (evaluated on a balanced test set) are optimal when the two class sizes are of roughly equal sizes, i.e. when ρN ≃ 0.5. c) Graphical visualization of the imbalanced composition of TCRs datasets, in a two-class setting where P, N represent the class sizes. Our work proposes to restore class balance (i.e. P = N, straight line) by introducing a generative model able to sample new sequences compatible with the positive class, for which few data are experimentally available.

Learning pipelines for peptide-specific (top) and pan-specific (bottom) models.

(Top): peptide-specific models. Column a): Data consists of few CDR3β sequences, known to bind some epitopes (colored symbols and segments) and of many ‘negative’ sequences (yellow). Column b): A generative model is trained over peptide-specific CDR3β sequences, here, corresponding to the orange epitope. After training, Gibbs sampling of the inferred probability landscape allows us to generate putative peptide-specific sequences. Column c): A supervised CNN architecture is trained over (natural and generated) peptide-specific CDR3βs and background CDR3βs; after learning, the network is used as predictive model for TCR specificity over in- and out-of-distribution (black sequences) test data. (Bottom): Pan-specific models. Column a): Compared to the pipeline above, input data are joint sequences of peptides (left, lighter color) and of TCR (right). Background sequences are obtained through mismatch pairing. Column b): The generative models produce putative binding pairs of peptide and TCR sequences. Column c): Supervised classifier trained to carry out TCR-epitope binding predictions.

In-distribution performances of peptide-specific models.

AUC and ACC scores of the predictive models for a multiclass classification task involving three peptide-specific sets of CDR3β and background CDR3β sequences, evaluated over a balanced test set of sequences of the same classes (in-distribution case). We compare performances for different training datasets, whose balance is restored through undersampling solely, or by generating new CDR3β sequences via an RBM- or BERT-based generative architecture. The baseline scores refer to an imbalanced training dataset, whose composition can be derived from the class sizes of each epitope as reported in App. 1; 250,000 background CDR3βs are used. Results confirm the benefit of both restoring balance in the training dataset and enlarging the peptide-specific CDR3β space through generative models. Dashed black lines indicate random performance levels.

Pan-specific predictive model results.

Differences in AUC and ACC scores when balance is achieved through undersampling of data only (no generation) or with data augmentation too. In the latter case, new CDR3β sequences were generated for under-represented groups of peptides only (triangular dots). Dot sizes are proportional to the raw group size of natural sequence pairs in the dataset. Here, 𝒢 = 400 (for more details on the choice of this threshold, see App. 5).

Out-of-distribution performances evaluated with AUC and ACC metrics across a test set composed of wild type binders to the target epitope and CDR3β sequences sampled from other unseen epitopes.

For each prediction, we separately train a classifier with an enlarged training set containing also the synthetic binder of the target out-of-distribution epitope, generated through the pan-specific model (trained on the in-distribution dataset only, but with the target epitope given during the generative step, as explained in the text). The columns labeled with [u] refer to scores obtained balancing the training set by only undersampling the negative class, for comparison. The column d[in] represents the Levenshtein distance from the closest in-distribution epitope, showing that scores degrade when moving away from in-distribution data.

Out-of-distribution predictions on synthetic LP dimers for pan-specific case.

a) The two structures of the proteins making the dimer, with amino acids defining a strong binding interaction, represented by the dotted lines. b) Histogram of single structure folding scores (the lower the better), computed according to the ground truth of the model (see App. 6, Eq. (16)). In practice we take MSA of binding sequences used for training and compute the folding scores in their native structure or as if they are in an out-of-distribution close structure or not close structure. The three distributions confirm the vicinity of the sequence data orange structures to the green ones, compared to blue ones. c) tSNE visualization of in-distribution training data (binder and non binders, green and red respectively) and out-of-distribution hold-out data (close and not close structures binders) over the embeddings of our CNN architecture. In this layer the classification is linear and we can see a clear decision boundary separating green and red data points. Accuracy on in-distribution data is ACC = 0.99. The close out-of-distribution data are similar to training data, hence the model performs well on these ones (ACC = 0.98); conversely, it gives poorer performances on the not close out-of-distribution (blue data points, ACC = 0.82).

List of the epitopes selected to collect CDR3β specific sequences in order to form the database used in the analysis of peptide specific models in Results.

List of peptide-specific classes in the dataset used in the analysis of peptide-specific models.

We report sizes of the peptide-specific classes defined by peptide sequences in the TCR dataset we used to study peptide-specific models in Results. Imbalance ratios for the 5 cases reported in Fig. 3 can be obtained from the relative abundances of the corresponding triplets of classes.

List of peptide-specific classes in the dataset used in the analysis of pan-specific models.

We report sizes and imbalance ratio details on the peptide-specific classes defined by peptide sequences in the TCR dataset we used to study pan-specific models in Results. Only few peptide-specific classes have more than 1% of CDR3β binding sequences, causing heavily imbalance. Details refer to the dataset before training/test splitting.

Sequence logos of peptide-specific CDR3β classes.

We report the sequence logos profile for peptide-specific classes of CDR3β sequences, after alignment to maximal length L = 20. The PM is learned over such profiles for each class. Sequence logos show high conservation of the CASS motif and of the last amino acid, while there is more variability in the central region which is indeed responsible for the binding affinity.

Performance drop due to randomness effect.

Here we report AUC and ACC scores for a VTEHDTLLY-specific model that has been trained over a dataset containing a fraction of random sequences ρrand.. When the generative model is not good enough and is adding noise to the under represented class of data, we can observe the performances drop down to a random classifier.

Performances scores for peptide-specific models with PM generative model.

The supervised architecture is the same and it is trained as in Fig. 3 of the main text, on the same datasets with balance restored via generation of new CDR3β samples.

Out-of-distribution analysis on synthetic Lattice-Protein dimers.

a) Densities of binding scores 𝒫 bind for strong, weak and non-binding compounds. The y-axis is cut for visualization purpose, as “Non binder” compounds concentrates around zero score. b) Receiver Operating Characteristic (ROC) curves for in-distribution (Strong - Weak) and out-of-distribution (Strong - Non binder and Weak - Non binder) test sets. The classification weak-vs-non binders has worst performances. c) tSNE visualization of the embeddings (last feature layer) produced by our CNN architecture for in-distribution test data (strong and weak binders) and out-of-distribution hold-out data (non binders). Classification is carried out from linear combinations of the embeddings of the input data points; better separation of clusters reflects higher model performances.

Out-of-distribution performances are related to the distribution of features.

tSNE visualization of the feature vectors (in the second-last layer of the classifier architecture) of out-of-distribution data for a model trained over the WT epitope ELAGIGILTV, with a balanced dataset containing 1500 CDR3β per class. Top: As the WT epitope, EAAGIGILTV is responsible for the Melanoma cancer and thus is targeted by TCRs sharing similar features. Bottom: the VQELYSPIFLIV peptide is 8 mutations away from the WT and is involved in SARS-CoV-2 infections. Our model predictions are reliable in the first epitope (AUC = 0.79), and are not for the second peptide (AUC = 0.54). The tSNE plots support the claim that out-of-distribution specificity predictions drop when the feature vectors of the test data are far away from the training ones.

Pan-specific prediction scores for out-of-distribution tasks.

The AUC (second column) and ACC (fourth column) scores are averaged across many out-of-distribution epitopes, with balance restored through pan-specific generation. Out-of-distribution CDR3β are grouped according to the Levenshtein distance of their associated epitope to the closest one in the training dataset (first column). Predictions worsen as peptides get further away from the ones in the training dataset. The third and fifth columns show the standard deviations of the scores.

AUC scores dependencies on the hyperparameter choice MS.

We report results of numerical experiments for the TCR-peptide binding prediction task, comparing AUC performances before and after the generative model has been used (x and y axis, respectively). The data point size is proportional to the size of natural peptide-specific sequences in the dataset. All the training parameters are fixed for all experiments; we rescale the number of training epochs with the training dataset size so that for each experiment we minimize the loss function exactly the same number of times: this factors out all elements, but the dependence of performances on the threshold value 𝒢.

Performance change based on peptide-vs pan-specific generative model.

Here we.report AUC and ACC scores for a subset of epitopes in the cases where each peptide class in augmented using a peptide-specific generative model or a global one. Given our sampling protocols, the two approaches are equivalent in terms of performance scores while the pan-specific remains computationally advantageous over the peptide-specific approach.

Unsupervised vs. supervised classification.

(Left) Histograms of the scores assigned by the unsupervised model to balanced test sets of in-distribution epitopes; lower perplexity corresponds to better score. The gray dashed line locates the threshold obtained by maximizing the accuracy of prediction on positive and negative data. Epitopes in the top row are frequent ones and do not get enlarged during training of the supervised classifier. (Right) Comparison of the accuracy scores between unsupervised classification only ([BERT]) and the full pipeline ([BERT+CNN]) in Fig. 2 for the same peptides as in the left panel. Values in the column [BERT+CNN] can be obtained from Fig. 4 of the main text.