Building a challenging training/testing set for PPIscreenML.

(A) A collection of 1,481 non-redundant active complexes with experimentally derived structures were obtained from DockGround, and five AF2 models were built from each of these. To build decoys, the same collection was screened to identify the closest structural matches (by TM-score) for each component protein. The structural homologs for each template protein were aligned onto the original complex, yielding a new (decoy) complex between two presumably non-interacting proteins. Five AF2 models were built for each of these 1,481 decoy complexes. (B) An example of a decoy complex (blue/cyan) superposed with the active complex from which is was generated (brown/wheat). (C) A suite of AlphaFold confidence metrics, structural properties, and Rosetta energy terms were used as input features for training PPIscreenML, a machine learning classifier built to distinguish active versus compelling inactive protein pairs.

Training and feature reduction for PPIscreenML.

(A) Receiver operating characteristic (ROC) plot demonstrating classification performance on a completely held-out test set, for an XGBoost model using 57 features. (B) The number of features was reduced using sequential backwards selection, from 57 features to 7 features. (C) Classification performance of PPIscreenML (7 features) on the same completely held-out test set.

Classification performance of PPIscreenML relative to pDockQ and iPTM.

The same test set is used here. These complexes were not seen in any phase of developing PPIscreenML, but may have been used in developing pDockQ or iPTM. (A) Receiver operating characteristic (ROC) plot shows superior performance of PPIscreenML relative to these other two methods. (B) Overlaid histograms show clear separation of actives and decoys scored using PPIscreenML. (C) Overlaid histograms show overlapping distributions when models are scored with pDockQ or iPTM.

Performance of PPIscreenML using various threshold values.

By adjusting the threshold score at which a test complex is assigned as active/decoy PPIscreenML can be used in regimes that prioritize returning only the most confident pairings (a high threshold score yields high precision but poor recall) or in exploratory regimes that return more speculative pairings as well (a lower threshold score yields high recall but poorer precision).

Application of PPIscreenML to identify active pairings within the tumor necrosis factor superfamily (TNFSF).

(A) Structurally-conserved TNFSF ligands bind to structurally-conserved TNFSF receptors; AF2 builds models of these complexes in the canonical pose for cognate pairings (RANKL/RANK are shown in wheat/cyan) but also in some cases for non-cognate pairings (RANKL/CD40 are shown in brown/blue). (B) Each ligand/receptor pairing was built with AF2 and scored with PPIscreenML (heatmap colored from low score in red, to high score in green). Ligand/receptor pairings observed in a comprehensive cellular assay are indicated with white checkmarks. (C) Receiver operating characteristic (ROC) plot demonstrating PPIscreenML classification of TNFSF ligand/receptor pairings.

Composition of the active and decoy datasets.

(A) For the set of actives, the distribution of DockQ scores are shown (calculated relative to the PDB structure for each complex). Complexes are included in the dataset only if their DockQ scores are at least 0.23, but most DockQ scores for these active pairings are much higher. (B) Compelling decoy complexes are assembled by drawing an alternate protein with high TM score relative to the components of the parental (active) complex. In most cases decoy complexes are assembled using component proteins with TM-scores of at least 0.5, corresponding to close structural matches.

Overview of dataset construction and evaluation.

Each active complex was used to generate one compelling decoy complex. Complexes were divided into training / validation / test sets, and 5 AF2 models were built for each complex. Key differences in how the sets were constructed / evaluated are highlighted in white. First, the AF2 models of active complexes in the training and validation sets were filtered to keep only correctly-docked models (DockQ>0.23); by contrast, all models were retained in the test set with no filtering. Second, performance of the validation set was evaluated by ranking all AF2 models independently; by contrast, performance of the test set was evaluated by ranking each complex, using the best-scoring AF2 model for each complex.

List of 57 total features considered when training PPIscreenML.

A list of features extracted from the structural models are shown and grouped into three categories: features from AF2 confidence measures, structural “counting” features extracted using the python package Biopandas, and features from the Rosetta energy function.

Comparisons of different machine learning classifiers.

ROC plots for classifiers built using different machine learning frameworks (each uses all 57 features).

Feature reduction for PPIscreenML.

Sequential backwards selection was used to characterize models with diminishing numbers of features. Performance of each candidate model was evaluated on the validation set (drawn from the training set). The vertical dashed line indicates the model selected for PPIscreenML (7 features).

PPIscreenML performance on models generated with AF-ptm.

(A) ROC curve of PPIscreenML and pDockQ tested on models built with AF-ptm. iPTM is not included because it is not available for this version of AF. (B) DockQ distribution for actives in the test set. (C) Among actives that were incorrectly classified by PPIscreenML as “not interacting”, most were mis-docked by AF-ptm.

PPIscreenML performance on models generated with AF-multimer version 2.2.

(A) ROC curve of PPIscreenML and pDockQ tested on models built with AF-multimer-2.2. (B) DockQ distribution for actives in the test set. (C) Among actives that were incorrectly classified by PPIscreenML as “not interacting”, most were mis-docked by AF-multimer-2.2.

Overlaid histogram of total sequence length between actives and decoys.

The method we describe for building compelling decoy complex does exhibit a slight bias for building decoys pairs from component proteins larger than the starting template. This arises from the use of TM-align to define structural analogs, because large query proteins are slightly more likely to yield high scores than small query proteins (for a given template protein). This artifact can allow a model to “cheat” if it includes any features that can serve as a proxy for the total number of residues in a model; accordingly, we ensured that no such features were included in developing PPIscreenML. Importantly, this artifact does not affect the composition or structural features of the generated decoy complexes, which are not systematically different from the active complexes.