Enhancing TCR specificity predictions by combined pan- and peptide-specific training, loss-scaling, and sequence similarity integration

  1. Mathias Fynbo Jensen
  2. Morten Nielsen  Is a corresponding author
  1. Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Denmark
14 figures, 4 tables and 3 additional files

Figures

Architecture of the pre-trained model.

The pan-specific CNN block consists of the layers shown in blue, whereas the peptide-specific CNN block consists of the layers shown in red. During the pan-specific training, the weights and biases …

Per peptide performance of the peptide-specific and pan-specific NetTCR 2.1 in terms of AUC, when trained and evaluated on the new dataset.

The peptides are sorted based on the number of positive observations from most abundant to least abundant, with the number of positive observations listed next to the peptide sequence. The …

Figure 3 with 1 supplement
Boxplot of AUC of the pan- and peptide-specific NetTCR 2.1 and 2.2 models, respectively.

The NetTCR 2.2 models include the updates to the model architecture, with the primary change being the introduction of dropout for the concatenated max-pooling layer (dropout rate = 0.6). Both the …

Figure 3—figure supplement 1
Boxplot of AUC 0.1 of the pan- and peptide-specific NetTCR 2.1 and 2.2 models, respectively.

The NetTCR 2.2 models include the updates to the model architecture, with the primary change being the introduction of dropout for the concatenated max-pooling layer (dropout rate = 0.6). Both the …

Mean AUC of the pan-specific and peptide-specific NetTCR 2.2 models, when training on the original redundancy reduced training data, and with redundant observations back.

The AUC is reported in terms of weighted and unweighted mean across all peptides, as well as unweighted mean when the data is split into peptides with at least 100 positive observations, and less …

Figure 5 with 3 supplements
Difference in AUC between pan-specific CNN trained on the limited dataset (70th percentile) and full dataset.

Peptides with TCRs originating solely from 10 x sequencing are highlighted in red. The performance was in both cases evaluated per peptide on the full dataset. A positive ΔAUC indicates that the …

Figure 5—figure supplement 1
Prediction values on the full test data for each peptide when predicted using the NetTCR 2.2 - Peptide model.

The prediction scores are shown for model 5 in Supplementary file 1.

Figure 5—figure supplement 2
Mean AUC of the pan-specific NetTCR 2.2 models when trained on datasets with potential outliers removed.

The percentile refers to the threshold of prediction scores used for removing observations (see Materials and methods), and the higher the percentile is, the more observations are removed from …

Figure 5—figure supplement 3
Percentage of observations discarded for the 70th percentile limited dataset, as a result of the removal of potential outliers.
Figure 6 with 1 supplement
Per peptide performance of the updated peptide-specific, pan-specific, and pre-trained CNN in terms of AUC, when trained on the limited training dataset and evaluated on the full dataset.

The peptides are sorted based on the number of positive observations from most abundant to least abundant, with the number of positive observations listed next to the peptide sequence. The …

Figure 6—figure supplement 1
Per peptide performance of the updated peptide-specific, pan-specific, and pre-trained CNN in terms of AUC 0.1, when trained on the limited training dataset and evaluated on the full dataset.

The peptides are sorted based on the number of positive observations from most abundant to least abundant, with the number of positive observations listed next to the peptide sequence. The …

Figure 7 with 1 supplement
Performance of TCRbase ensemble as a function of α along with boxplot of optimal alpha in terms of AUC and AUC 0.1 for the validation partitions.

(A) The predictions of the pre-trained model ensemble (trained on the limited dataset) on the test partitions (full data) were scaled by the kernel similarity to known binders, as given by TCRbase …

Figure 7—figure supplement 1
Difference in true positive rate (TPR) between TCRbase ensemble (pre-trained +TCRbase models) and pre-trained models as a function of false positive rate (FPR).

A positive ΔTPR corresponds to an increased performance of the TCRbase ensemble compared to the pre-trained models alone. The models used for this figure are model 16 (NetTCR 2.2 - Pre-trained) and …

Boxplot of direct prediction scores and percentile ranks per peptide of the full test dataset for the TCRbase ensemble.

Peptides with 100% of positive observations coming from 10 X sequencing are highlighted in red. The model used in this figure is model 17 (TCRbase ensemble) in Supplementary file 1.

Figure 9 with 1 supplement
Percentage of correctly chosen true peptide-TCR pairs for each peptide in the limited dataset.

This was evaluated using the direct prediction score (blue) and the percentile rank (orange) of the TCRbase ensemble. KLGGALQAK, AVFDRKSDAK, NLVPMVATV, CTELKLSDY, RLRAEAQVK, RLPGVLPRA, and …

Figure 9—figure supplement 1
Boxplot of average rank per peptide for the final updated models.

The rank was evaluated on the limited dataset covering 21 peptides, that is excluding the peptides with low performance (KLGGALQAK, AVFDRKSDAK, NLVPMVATV, CTELKLSDY, RLRAEAQVK, RLPGVLPRA and …

Boxplot of percentile ranks per peptide in the rank test, with KLGGALQAK, NLVPMVATV, CTELKLSDY, RLRAEAQVK, RLPGVLPRA, and SLFNTVATLY excluded.

AVFDRKSDAK was included as an example of a peptide with a poor rank in the rank test. Top TP: Percentile rank of the correctly chosen pairs. Second TN: Percentile rank of the second-best pair, when …

Figure 11 with 1 supplement
Per peptide performance of the old (NetTCR 2.1) and updated (NetTCR 2.2) pan-specific CNN models trained in a leave-one-out setup.

The performance was evaluated in terms of AUC on the full dataset. The performance shown in this figure is based on model 63 (NetTCR 2.1 - Leave one out) and model 19 (NetTCR 2.2 - Leave one out) in …

Figure 11—figure supplement 1
Per peptide performance of the old (NetTCR 2.1) and updated (NetTCR 2.2) pan-specific CNN models trained in a leave-one-out setup.

The performance was evaluated in terms of AUC on the full dataset. The performance shown in this figure is based on model 63 (NetTCR 2.1 - Leave one out) and model 19 (NetTCR 2.2 - Leave one out) in …

Figure 12 with 1 supplement
Performance in terms of AUC of various models trained on increasing amounts of data.

These models were trained on the following peptides: GILGFVFTL, RAKFKQLL, ELAGIGILTV, IVTDFSVIK, LLWNGPMAV, CINGVCWTV, GLCTLVAML, and SPRWYFYYL. The pre-trained models were based on the …

Figure 12—figure supplement 1
Performance in terms of AUC 0.1 of various models trained on increasing amounts of data.

These models were trained on the following peptides: GILGFVFTL, RAKFKQLL, ELAGIGILTV, IVTDFSVIK, LLWNGPMAV, CINGVCWTV, GLCTLVAML and SPRWYFYYL. The pre-trained models were based on the leave-one-out …

Figure 13 with 1 supplement
Boxplot of reported unweighted AUC per peptide for the models in the IMMREP benchmark, as well as the updated NetTCR 2.2 models.

Except for the updated NetTCR 2.2 models (NetTCR 2.2 - Pan, NetTCR 2.2 - Peptide, NetTCR 2.2 - Pre-trained and TCRbase ensemble) the performance of all models is equal to the reported performance in …

Figure 13—figure supplement 1
Boxplot of average rank per peptide per model in the IMMREP test data, as reported in the IMMREP benchmark.

The updated NetTCR 2.2 models are included to the right. The color of the bars indicates the type of input used by the model. Machine-learning models are labeled with black text, whereas …

Boxplot of unweighted AUC per peptide for the NetTCR 2.1 and 2.2 models, when trained and evaluated on the redundancy reduced dataset.

The evaluation was performed using a nested cross-validation setup. The performance is based on model 58 (NetTCR 2.1 - Peptide), model 59 (NetTCR 2.2 - Pan), model 60 (NetTCR 2.2 - Peptide), model …

Tables

Table 1
Per peptide overview of the full positive training data.

The source organism for each epitope, as well as the MHC allele which they bind to, are here shown. Additionally, the number of observations discarded during each redundancy reduction step, as well …

PeptideOrganismMHCPre reduction countRemoved in first reductionRemoved in second reductionPost reduction countNot 10 X10 X
GILGFVFTLInfluenza A virusHLA-A*02:0118976451271125426699
RAKFKQLLEpstein Barr virusHLA-B*08:011065114179340934
KLGGALQAKHuman CMVHLA-A*03:01912829020902
AVFDRKSDAKEpstein Barr virusHLA-A*11:01725547160716
ELAGIGILTVMelanoma neoantigenHLA-A*02:014356342655371
NLVPMVATVHuman CMVHLA-A*02:013844311330154176
IVTDFSVIKEpstein Barr virusHLA-A*11:013231323080308
LLWNGPMAVYellow fever virusHLA-A*02:0132272212292290
CINGVCWTVHepatitis C virusHLA-A*02:012314122675151
GLCTLVAMLEpstein Barr virusHLA-A*02:0127859721295117
SPRWYFYYLSARS-CoV2HLA-B*07:02158451491490
ATDALMTGFHepatitis C virusHLA-A*01:011282141030103
DATYQRTRALVRInfluenza A virusHLA-A*68:011004393930
KSKRTPMGFHepatitis C virusHLA-B*57:01115141289089
YLQPRTFLLSARS-CoV2HLA-A*02:01696162548
HPVTKYIMHepatitis C virusHLA-B*08:01605253053
RFPLTFGWCFHIV-1HLA-A*24:02587051510
GPRLGVRATHepatitis C virusHLA-B*07:02513048048
CTELKLSDYInfluenza A virusHLA-A*01:01480048480
RLRAEAQVKEpstein Barr virusHLA-A*03:01470047047
RLPGVLPRAAML neoantigenHLA-A*02:01430043043
SLFNTVATLYHIV-1HLA-A*02:01380038038
RPPIFIRRLEpstein Barr virusHLA-B*07:024022362412
FEDLRLLSFInfluenza A virusHLA-B*37:01310031310
VLFGLGFAIT1D neoantigenHLA-A*02:01321031310
FEDLRVLSFInfluenza A virusHLA-B*37:013601323230
Table 2
Overview of number of TCRs for each peptide in the IMMREP 2022 training dataset before and after redundancy reduction.

The redundancy reduction was performed using a kernel similarity threshold of 95%.

PeptidePre reduction countPost reduction countPercent redundant
All2445196019.8%
GILGFVFTL54430144.7%
NLVPMVATV27424211.7%
YLQPRTFLL26722715.0%
TTDPSFLGRY1931873.1%
LLWNGPMAV1881756.9%
CINGVCWTV1831792.2%
GLCTLVAML1469137.7%
ATDALMTGF1047825.0%
LTDEMIAQY100946.0%
SPRWYFYYL92920.0%
KSKRTPMGF856325.9%
NQKLIANQF56535.4%
HPVTKYIM484114.6%
TPRVTGGGAM45442.2%
NYNYLYRLF44424.6%
GPRLGVRAT40377.5%
RAQAPPPSW361461.1%
Table 3
Pearson Correlation Coefficients (PCC) between the optimal α scaling factor and performance per peptide in terms of AUC and AUC 0.1 of the pre-trained CNN model and TCRbase model, respectively, for the validation partitions.

Each partition was considered as a separate sample. p-Values for the null hypothesis that the performance and optimal α are uncorrelated are also shown.

MetricPCC to optimal alphap-Value
CNN AUC–0.11010.2123
TCRbase AUC0.30560.0004
CNN AUC 0.1–0.08090.3602
TCRbase AUC 0.10.20680.0183
Table 4
Degree of redundancy between the IMMREP test and training data, when using a 95% kernel similarity threshold for redundancy within each peptide.

The redundancy reduction was performed on both positive and negative observations. The counts and percentages, however, only refers to the positive observations.

PeptidePre reduction countPost reduction countPercent redundant
All61946724.56%
GILGFVFTL1365857.35%
NLVPMVATV695421.74%
YLQPRTFLL675320.90%
TTDPSFLGRY49474.08%
LLWNGPMAV47446.38%
CINGVCWTV46460.00%
GLCTLVAML372337.84%
ATDALMTGF262215.38%
LTDEMIAQY25238.00%
SPRWYFYYL24240.00%
KSKRTPMGF221340.91%
NQKLIANQF15150.00%
TPRVTGGGAM12120.00%
HPVTKYIM121016.67%
NYNYLYRLF12925.00%
GPRLGVRAT11110.00%
RAQAPPPSW9366.67%

Additional files

Supplementary file 1

Overview of training data, model parameters, predictions and performance of the models trained and evaluated in this article, excluding the models trained and evaluated on the IMMREP 2022 dataset.

The listed Model Number for each model can be used to find the source data for the figures in this article (see the figure legends).

https://cdn.elifesciences.org/articles/93934/elife-93934-supp1-v1.xlsx
Supplementary file 2

Overview of training data, model parameters, predictions and performance of the models trained and evaluated on the IMMREP 2022 dataset.

The listed Model Number for each model can be used to find the source data for the figures in this article (see the figure legends).

https://cdn.elifesciences.org/articles/93934/elife-93934-supp2-v1.xlsx
MDAR checklist
https://cdn.elifesciences.org/articles/93934/elife-93934-mdarchecklist1-v1.docx

Download links