PPIscreenML is a method for structure-based screening of protein-protein interactions using AlphaFold
Figures
Building a challenging training/testing set for PPIscreenML.
(A) A collection of 1481 nonredundant active complexes with experimentally derived structures was obtained from DockGround, and five AlphaFold2 (AF2) models were built from each of these. To build decoys, the same collection was screened to identify the closest structural matches (by TM-score) for each component protein. The structural homologs for each template protein were aligned onto the original complex, yielding a new (decoy) complex between two presumably noninteracting proteins. Five AF2 models were built for each of these 1481 decoy complexes. (B) An example of a decoy complex (blue/cyan) superposed with the active complex from which it was generated (brown/wheat). (C) A suite of AlphaFold confidence metrics, structural properties, and Rosetta energy terms were used as input features for training PPIscreenML, a machine learning classifier built to distinguish active versus compelling inactive protein pairs.
Composition of the active and decoy datasets.
(A) For the set of actives, the distribution of DockQ scores is shown (calculated relative to the PDB structure for each complex). Complexes are included in the dataset only if their DockQ scores are at least 0.23, but most DockQ scores for these active pairings are much higher. (B) Compelling decoy complexes are assembled by drawing an alternate protein with high TM-score relative to the components of the parental (active) complex. In most cases, decoy complexes are assembled using component proteins with TM-scores of at least 0.5, corresponding to close structural matches.
Overview of dataset construction and evaluation.
Each active complex was used to generate one compelling decoy complex. Complexes were divided into training/validation/test sets, and five AlphaFold2 (AF2) models were built for each complex. Key differences in how the sets were constructed/evaluated are highlighted in white. First, the AF2 models of active complexes in the training and validation sets were filtered to keep only correctly docked models (DockQ>0.23); by contrast, all models were retained in the test set with no filtering. Second, performance of the validation set was evaluated by ranking all AF2 models independently; by contrast, performance of the test set was evaluated by ranking each complex, using the best-scoring AF2 model for each complex.
List of 57 total features considered when training PPIscreenML.
A list of features extracted from the structural models is shown and grouped into three categories: features from AlphaFold2 (AF2) confidence measures, structural ‘counting’ features extracted using the Python package BioPandas, and features from the Rosetta energy function.
Overlaid histogram of total sequence length between actives and decoys.
The method we describe for building compelling decoy complexes does exhibit a slight bias for building decoy pairs from component proteins larger than the starting template. This arises from the use of TMalign to define structural analogs, because large query proteins are slightly more likely to yield high scores than small query proteins (for a given template protein). This artifact can allow a model to ‘cheat’ if it includes any features that can serve as a proxy for the total number of residues in a model; accordingly, we ensured that no such features were included in developing PPIscreenML. Importantly, this artifact does not affect the composition or structural features of the generated decoy complexes, which are not systematically different from the active complexes.
Training and feature reduction for PPIscreenML.
(A) Receiver operating characteristic (ROC) plot demonstrating classification performance on a completely held-out test set, for an XGBoost model using 57 features. (B) The number of features was reduced using sequential backward selection, from 57 features to 7 features. Created with Biorender.com. (C) Classification performance of PPIscreenML (7 features) on the same completely held-out test set.
Comparisons of different machine learning classifiers.
Receiver operating characteristic (ROC) plots for classifiers built using different machine learning frameworks (each uses all 57 features).
Feature reduction for PPIscreenML.
Sequential backward selection was used to characterize models with diminishing numbers of features. The performance of each candidate model was evaluated on the validation set (drawn from the training set). The vertical dashed line indicates the model selected for PPIscreenML (7 features).
PPIscreenML performance on models generated with AFPTM.
(A) Receiver operating characteristic (ROC) curve of PPIscreenML and pDockQ tested on models built with AF-PTM. iPTM is not included because it is not available for this version of AF. (B) DockQ distribution for actives in the test set. (C) Among actives that were incorrectly classified by PPIscreenML as ‘not interacting’, most were mis-docked by AF-PTM.
PPIscreenML performance on models generated with AF-Multimer v2.2.
(A) Receiver operating characteristic (ROC) curve of PPIscreenML and pDockQ tested on models built with AF-Multimer v2.2. (B) DockQ distribution for actives in the test set. (C) Among actives that were incorrectly classified by PPIscreenML as ‘not interacting’, most were mis-docked by AF-Multimer v2.2.
Classification performance of PPIscreenML relative to pDockQ and iPTM.
The same test set is used here. These complexes were not seen in any phase of developing PPIscreenML, but may have been used in developing pDockQ or iPTM. (A) Receiver operating characteristic (ROC) plot shows superior performance of PPIscreenML relative to these other two methods. (B) Overlaid histograms show clear separation of actives and decoys scored using PPIscreenML. (C) Overlaid histograms show overlapping distributions when models are scored with pDockQ or iPTM.
Application of PPIscreenML to identify active pairings within the tumor necrosis factor superfamily (TNFSF).
(A) Structurally conserved TNFSF ligands bind to structurally conserved TNFSF receptors; AlphaFold2 (AF2) builds models of these complexes in the canonical pose for cognate pairings (RANKL/RANK are shown in wheat/cyan) but also in some cases for non-cognate pairings (RANKL/CD40 are shown in brown/blue). (B) Each ligand/receptor pairing was built with AF2 and scored with PPIscreenML (heatmap colored from low score in red, to high score in green). Ligand/receptor pairings observed in a comprehensive cellular assay are indicated with white checkmarks. (C) Receiver operating characteristic (ROC) plot demonstrating PPIscreenML classification of TNFSF ligand/receptor pairings.
Tables
Performance of PPIscreenML using various threshold values.
By adjusting the threshold score at which a test complex is assigned as active/decoy, PPIscreenML can be used in regimes that prioritize returning only the most confident pairings (a high threshold score yields high precision but poor recall) or in exploratory regimes that return more speculative pairings as well (a lower threshold score yields high recall but poorer precision).
| PPIscreenML threshold | FPR | TPR | FNR | TNR | Precision | Recall | F1 score |
|---|---|---|---|---|---|---|---|
| 0.98 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.01 |
| 0.95 | 0.01 | 0.16 | 0.84 | 0.99 | 0.96 | 0.16 | 0.27 |
| 0.9 | 0.07 | 0.39 | 0.61 | 0.93 | 0.85 | 0.39 | 0.54 |
| 0.85 | 0.11 | 0.60 | 0.40 | 0.89 | 0.85 | 0.60 | 0.70 |
| 0.8 | 0.15 | 0.74 | 0.26 | 0.85 | 0.83 | 0.74 | 0.78 |
| 0.75 | 0.18 | 0.80 | 0.20 | 0.82 | 0.82 | 0.80 | 0.81 |
| 0.7 | 0.21 | 0.85 | 0.15 | 0.79 | 0.80 | 0.85 | 0.83 |
| 0.64 | 0.23 | 0.88 | 0.12 | 0.77 | 0.80 | 0.88 | 0.84 |
| 0.6 | 0.24 | 0.89 | 0.11 | 0.76 | 0.79 | 0.89 | 0.84 |
| 0.56 | 0.24 | 0.92 | 0.08 | 0.76 | 0.79 | 0.92 | 0.85 |
| 0.5 | 0.27 | 0.93 | 0.07 | 0.73 | 0.78 | 0.93 | 0.85 |
| 0.47 | 0.27 | 0.93 | 0.07 | 0.73 | 0.77 | 0.93 | 0.84 |
| 0.4 | 0.29 | 0.95 | 0.05 | 0.71 | 0.76 | 0.95 | 0.85 |
| 0.38 | 0.30 | 0.95 | 0.05 | 0.70 | 0.76 | 0.95 | 0.85 |
| 0.28 | 0.33 | 0.95 | 0.05 | 0.67 | 0.75 | 0.95 | 0.84 |
| 0.21 | 0.35 | 0.95 | 0.05 | 0.65 | 0.73 | 0.95 | 0.83 |
| 0.18 | 0.35 | 0.96 | 0.04 | 0.65 | 0.73 | 0.96 | 0.83 |
| 0.08 | 0.43 | 0.96 | 0.04 | 0.57 | 0.69 | 0.96 | 0.81 |
| 0.05 | 0.48 | 0.97 | 0.03 | 0.52 | 0.67 | 0.97 | 0.79 |
| 0 | 1.00 | 1.00 | 0.00 | 0.00 | 0.50 | 1.00 | 0.67 |