PPI-hotspotID for detecting protein–protein interaction hot spots from the free protein structure
Abstract
Experimental detection of residues critical for protein–protein interactions (PPI) is a time-consuming, costly, and labor-intensive process. Hence, high-throughput PPI-hot spot prediction methods have been developed, but they have been validated using relatively small datasets, which may compromise their predictive reliability. Here, we introduce PPI-hotspotID, a novel method for identifying PPI-hot spots using the free protein structure, and validated it on the largest collection of experimentally confirmed PPI-hot spots to date. We explored the possibility of detecting PPI-hot spots using (i) FTMap in the PPI mode, which identifies hot spots on protein–protein interfaces from the free protein structure, and (ii) the interface residues predicted by AlphaFold-Multimer. PPI-hotspotID yielded better performance than FTMap and SPOTONE, a webserver for predicting PPI-hot spots given the protein sequence. When combined with the AlphaFold-Multimer-predicted interface residues, PPI-hotspotID yielded better performance than either method alone. Furthermore, we experimentally verified several PPI-hotspotID-predicted PPI-hot spots of eukaryotic elongation factor 2. Notably, PPI-hotspotID can reveal PPI-hot spots not obvious from complex structures, including those in indirect contact with binding partners. PPI-hotspotID serves as a valuable tool for understanding PPI mechanisms and aiding drug design. It is available as a web server (https://ppihotspotid.limlab.dnsalias.org/) and open-source code (https://github.com/wrigjz/ppihotspotid/).
eLife assessment
The article presents a machine-learning method to predict protein hotspot residues. The validation is incomplete, along with the misinterpretation of the results with other current methods like FTMap.
https://doi.org/10.7554/eLife.96643.3.sa0Introduction
Protein–protein interactions (PPIs) play a crucial role in cellular physiology, and their dysregulation is associated with various diseases (David et al., 2012) such as cancer (Nero et al., 2014), infectious diseases, and neurodegenerative diseases (Blazer and Neubig, 2009). Identifying residues critical for PPIs (termed PPI-hot spots) is important for elucidating protein function and designing targeted biomedical interventions (Cukuroglu et al., 2014; Rosell and Fernández-Recio, 2018). Conventionally, PPI-hot spots are defined as residues whose mutations to alanine cause ≥2 kcal/mol drop in the protein binding free energy (Clackson and Wells, 1995; Bogan and Thorn, 1998; DeLano, 2002; Li et al., 2004; Keskin et al., 2005; Moreira et al., 2007). However, this definition, based on measuring the binding free energy change upon mutation to alanine, limits the number of experimentally determined PPI-hot spots. Hence, PPI-hot spots have been more broadly defined to include residues whose mutations, not necessarily to alanine, significantly impair/disrupt PPIs (Fischer et al., 2003; Chen et al., 2022), as detected by experimental methods such as coimmunoprecipitation and yeast two-hybrid screening. As each mutant must be purified and analyzed separately, experimental detection of PPI-hot spots is time-consuming, costly, and labor-intensive.
To enable large-scale detection of PPI-hot spots, high-throughput PPI-hot spot prediction methods have been developed. They generally fall into two categories (Rosário‐Ferreira et al., 2022): (1) methods that compute the binding energy/free energy difference between the wild-type protein and a mutant using classical force fields or empirical scoring functions (Moreira et al., 2007; Massova and Kollman, 1999; Huo et al., 2002; Guerois et al., 2002; Kortemme and Baker, 2002; González-Ruiz and Gohlke, 2006; Grosdidier and Fernández-Recio, 2008; Yogurtcu et al., 2008; Barlow et al., 2018; Ibarra et al., 2019). (2) Methods that employ classifiers such as nearest neighbor, support vector machines, decision trees, Bayesian/neural networks, random forest, and ensemble machine-learning models using various features including conservation, secondary structure, solvent-accessible surface area (SASA), and atom density (Rosário‐Ferreira et al., 2022; Darnell et al., 2007; Cho et al., 2009; Assi et al., 2010; Xia et al., 2010; Lise et al., 2011; Wang et al., 2012; Ye et al., 2014; Munteanu et al., 2015; Melo et al., 2016; Moreira et al., 2017; Qiao et al., 2018; Sitani et al., 2021; Ovek et al., 2022). Most of the PPI-hot spot prediction methods rely on the protein complex structure and some are accessible via webservers; for example, Hotpoint (Tuncbag et al., 2010), KFC2 (Zhu and Mitchell, 2011), PredHS (Deng et al., 2014), and PredHS2 (Wang et al., 2018a). Fewer methods use only the free protein structure (Higa and Tozzi, 2009; Zerbe et al., 2012; Ozbek et al., 2013; Agrawal et al., 2014; Kozakov et al., 2015) or sequence (Qiao et al., 2018; Ofran and Rost, 2007; Chen et al., 2013; Nguyen et al., 2013; Huang and Zhang, 2016; Hu et al., 2017; Jiang et al., 2017; Liu et al., 2018; Preto and Moreira, 2020; Yao et al., 2022), and SPOTONE (hot SPOTs ON protein complexes with Extremely randomized trees) is available as a web server. SPOTONE (Preto and Moreira, 2020) predicts PPI-hot spots from the protein sequence using residue-specific features such as atom type, amino acid (aa) properties, secondary structure propensity, and mass-associated values to train an ensemble of extremely randomized trees.
The PPI-hot spot prediction methods have mostly been trained, validated, and tested on data from the Alanine Scanning Energetics database (ASEdb) (Thorn and Bogan, 2001) and/or the Structural Kinetic and Energetic database of Mutant Protein Interactions (SKEMPI) 2.0 database (Jankauskaite et al., 2019). However, antibody–antigen interactions have different sequence and structural characteristics compared to non-antibody PPIs (Wang et al., 2018b). Therefore, our focus is exclusively on non-antibody proteins in this study. The ASEdb contains 96 PPI-hot spots from 26 proteins. SKEMPI 2.0, which includes single-point mutations not necessarily to alanine that decrease the protein binding free energy by ≥2 kcal/mol, has 343 PPI-hot spots from 117 distinct proteins, 40 of which overlap with ASEdb. Altogether, ASEdb and SKEMPI contain 399 distinct PPI-hot spots in 132 proteins. To increase this number of distinct PPI-hot spots, we have expanded the definition of PPI-hot spots to include mutations in UniProtKB (UniProt Consortium, 2018) that have been manually curated as significantly impairing/disrupting PPIs (Chen et al., 2022). This expanded definition led to the creation of the PPI-HotspotDB database, which contains 4039 experimentally determined PPI-hot spots in 1893 proteins. To calibrate PPI-hot spot prediction methods using the free protein structure, a benchmark was derived from PPI-HotspotDB (Chen et al., 2022). This benchmark called PPI-Hotspot+PDBBM contains nonredundant proteins with free structures and known PPI-hot spots. The proteins in PPI-Hotspot+PDBBM share <60% sequence identity, which has been shown to be a reasonable threshold for grouping domains with similar functions (Chen et al., 2022).
Our aim is to develop a method for identifying PPI-hot spots in non-antibody proteins using free protein structures. First, we updated the PPI-Hotspot+PDBBM benchmark and constructed a dataset comprising 158 nonredundant proteins with free structures harboring 414 experimentally known PPI-hot spots and 504 PPI-nonhot spots (see ‘Materials and methods’). Using this dataset, we applied an automatic machine-learning framework that automates the machine-learning pipeline to detect PPI-hot spots using the aa type as well as structural, energetic, and evolutionary features of each residue in a protein. The resulting prediction model, named PPI-hotspotID, identifies PPI-hot spots using an ensemble of classifiers and only four residue features (conservation, aa type, SASA, and gas-phase energy, ΔGgas). We explored the possibility of detecting PPI-hot spots using the FTMap server in the PPI mode, which identifies hot spots on protein–protein interfaces from free protein structures (Kozakov et al., 2015). These hot spots are identified by consensus sites − regions that bind multiple probe clusters (Zerbe et al., 2012; Kozakov et al., 2015; Kozakov et al., 2011). Such regions are deemed to be important for any interaction involving that region of the target, independent of partner protein (Zerbe et al., 2012). PPI-hot spots were identified as residues in van der Waals (vdW) contact with probe ligands within the largest consensus site containing the most probe clusters. We also explored the possibility of detecting PPI-hot spots using the interface residues predicted by AlphaFold-Multimer (Evans et al., 2021), which has been shown to outperform current docking methods in predicting protein–protein complexes. Finally, we illustrated the utility of PPI-hotspotID by applying it to detect PPI-hot spots of eukaryotic elongation factor 2 (eEF2), a translation factor essential for peptide elongation, and experimentally verified the predictions.
Results
Evaluating the performance of PPI-hot spot detection methods
The goal of PPI-hotspotID is to detect true PPI-hot spots rather than true PPI-nonhot spots in proteins. Hence, we assessed the performance of PPI-hotspotID by computing the sensitivity/recall (the fraction of true PPI-hot spots correctly identified),
the fraction of predicted PPI-hot spots that are true PPI-hot spots; that is,
and the F1-score, which combines recall and precision:
Since our dataset also contains true PPI-nonhot spots, we calculated the specificity (the fraction of true PPI-nonhot spots correctly identified):
In Equations 1–4, TP (true positives) or TN (true negatives) is the number of correctly predicted PPI-hot spots or PPI-nonhot spots, and FP (false positives) or FN (false negatives) is the number of wrongly predicted PPI-hot spots or PPI-nonhot spots.
Performance of PPI-hotspotID vs. FTMap and SPOTONE
We compared the performance of PPI-hotspotID, FTMap (Kozakov et al., 2015), and SPOTONE (Preto and Moreira, 2020) using a dataset containing 414 true PPI-hot spots and 504 nonhot spots. Source data 1 lists the UniProt codes of the PPI-hot spot-containing protein and its binding partner, the PDB code(s) and chain of the free protein structure, the UniProt and PDB numbering of the PPI-hot spot, the wild-type→mutant residue, the corresponding protein binding free energy change and source given by the PubMed reference number, and the PPI-hotspotID assignment, where P indicates PPI-hot spot and N indicates nonhot spot. Source data 2 lists the UniProt codes of the PPI-hot spot-containing protein A and binding partner protein B, the PDB code-chain and length of the free and bound protein A structures, the PDB code-chain of the bound protein B structure, the sequence identity between free and bound protein A structures, and the PPI-hot spots of protein A. Note that the 414 true PPI-hot spots represent only 2% of the total number of residues (21,722) in the 158 proteins.
Given the free protein structure, PPI-hotspotID and SPOTONE (Preto and Moreira, 2020) predict PPI-hot spots based on a probability threshold (>0.5). FTMap, in the PPI mode, detects PPI-hot spots as consensus sites/regions on the protein surface that bind multiple probe clusters (Kozakov et al., 2011). Residues in vdW contact with probe molecules within the largest consensus site were compared with PPI-hotspotID/SPOTONE predictions. Residues not classified as PPI-hot spots by each method were considered as PPI-nonhot spots. Table 1 summarizes the results for our dataset, with the F1 score in parentheses representing the mean validation F1 score. Compared to FTMap/SPOTONE, PPI-hotspotID detected a much higher fraction of true positives (0.67 vs. 0.07/0.10) and achieved a significantly higher F1 score (0.71 vs. 0.13/0.17).
To elucidate the differences between the PPI-hot spots predicted by PPI-hotspotID and those by FTMap or SPOTONE, we compared their respective true-positive predictions. The Venn diagram in Table 1 shows a substantial overlap in true positives between FTMap or SPOTONE and PPI-hotspotID: FTMap shared 23/30 true positives with PPI-hotspotID, whereas SPOTONE shared 34/40 with PPI-hotspotID, but only 3 with FTMap. Only three true positives were predicted by all three methods. PPI-hotspotID identified many true positives that were not detected by FTMap or SPOTONE probably because it employed aspects not considered by FTMap or SPOTONE such as the gas-phase energy, ΔGgas (see ‘Discussion’). Furthermore, SPOTONE defined true negatives as residues whose mutation to alanine led to protein binding free energy changes (ΔΔGbind) of ≤2.0 kcal/mol, whereas we defined true negatives as residues whose alanine/nonalanine mutations resulted in negligible protein binding free energy changes (ΔΔGbind < 0.5 kcal/mol) or did not perturb PPIs in immunoprecipitation or GST pull-down assays (see ‘Materials and methods’).
Interface vs. noninterface PPI-hot spots
We can estimate the fraction of PPI-hot spots at the protein interface for 74 of the 158 nonredundant proteins in our dataset with complex structures. These 74 proteins harboring 243 true PPI-hot spots form 78 PPI pairs. Using the UniProt codes of each protein and its binding partner, we identified all complex structures in the PDB. Based on the complex structures of each PPI pair, we classified a PPI-hot spot as an interface one if it formed hydrogen bonds or vdW contacts with the partner protein (Laskowski et al., 2018); otherwise, it was deemed a noninterface PPI-hot spot. Among the 243 true PPI-hot spots, 67 (27.6%) lacked such contacts across the protein interface. For these 74 proteins, PPI-hotspotID predicted 240 PPI-hot spots, out of which 43 (18%) are noninterface PPI-hot spots, SPOTONE identified only five noninterface PPI-hot spots, whereas FTMap did not predict any. For example, the complex structure of interleukin-4 bound to interleukin-4 receptor subunitα (PDB: 1IAR) (Hage et al., 1999) in Figure 1 revealed three interface PPI-hot spots (E13, R92, and N93) and two noninterface ones (K127 and Y128). Based on the free structure of interleukin-4 (PDB: 1BBN) (Powers et al., 1992), PPI-hotspotID identified all five true positives, SPOTONE detected an interfacial PPI-hot spot (N93), whereas FTMap failed to identify any true positives.
Performance of AlphaFold-Multimer, PPI-hotspotID and their combination in predicting PPI-hot spots
To assess the possibility of detecting PPI-hot spots using the interface residues predicted by AlphaFold-Multimer (Evans et al., 2021) as PPI-hot spots when complex structures are unavailable, we focused on 48 ‘unsolved’ AB complex structures involving 47 proteins in the PPI-Hotspot+PDBBM(1.1), as one of the proteins, human neurotrophin (UniProtID P20783, PDB 1nt30A) interacted with two different proteins (UniProtID Q16288 and P17643). These 48 unsolved complex structures contain 90 PPI-hot spots and 45 nonhot spots. We employed the protein A structure sequence from the PPI-Hotspot+PDBBM(1.1) and the entire protein B sequence from UniProtKB (UniProt Consortium, 2018) as inputs for the AlphaFold-Multimer module in ColabFold (Mirdita et al., 2022). This generated model structures for each AB complex. Interface residues were defined based on the AMBER-relaxed model structure with the highest pTM score using a cutoff distance of 5 Å reflecting residues in close contact. Interface residues were predicted as PPI-hot spots and noninterface residues as nonhot spots.
In identifying PPI-hot spots using PPI-hotspotID, we first excluded 90 true PPI-hot spots and 45 nonhot spots belonging to 47 proteins lacking complex PDB structures from our dataset. We then used an automatic machine-learning framework to train an ensemble of machine-learning models using four features (kC, aa residue type, SASAi, and ΔGigas) on the true PPI-hot spots and nonhot spots in the remaining 111 proteins in our dataset. The final ensemble model was used to identify PPI-hot spots in the 47 proteins lacking complex structures in our dataset. The resulting sensitivity (0.58) and F1 score (0.66) in Table 2 were lower than those in Table 1 using the full dataset. Nevertheless, they were greater than those achieved using AlphaFold-Multimer-predicted interface residues as PPI-hot spots (0.41 and 0.54). When we combined the PPI-hotspotID-predicted PPI-hot spots with the AlphaFold-Multimer-predicted interface residues, the resulting sensitivity (0.70) and F1 values (0.72) were higher than those obtained by each method alone. This indicates that PPI-hotspotID can identify true PPI-hot spots that reside outside the protein–protein interface.
Experimental verification of PPI-hotspotID’s predictions in human eEF2
We experimentally verified predictions made by PPI-hotspotID by using it to detect the PPI-hot spots of eEF2, an essential translation factor that hydrolyzes GTP to catalyze peptide elongation. Binding of cytoplasmic polyadenylation element-binding protein-2 (CPEB2) to eEF2 may interfere with conformational changes of eEF2 on ribosomes, thereby affecting the efficiency of eEF2-mediated GTP hydrolysis, and slowing down translation of hypoxia-inducible factor (HIF)–1α mRNA (Chen and Huang, 2012). No eEF2-CPEB2 complex structure has been solved, but a 5 Å electron microscopy structure of eEF2 (PDB 4v6x-A) (Anger et al., 2013) is available. Using the CPEB2 N-terminus for a yeast two-hybrid screen, a positive clone containing the eEF2 residues 717–803 had been identified and subsequent co-IP assay revealed a CPEB2-binding domain comprising eEF2 residues 743–817 (Chen and Huang, 2012). Thus, we focused on this domain, which shares ≤20% sequence identity with the 158 nonredundant proteins in our dataset, in predicting PPI-hot spots. Based on the free eEF2 structure (PDB 4v6x-A) (Anger et al., 2013), PPI-hotspotID predicted F794 as the PPI-hot spot with the highest probability of 0.67. So, we chose to test F794 and seven other predicted PPI-hot spots (L763, R767, G768, G778, T779, R801, A808) that were >12 Å from F794, as well as four predicted PPI-nonhot spots (E773, P789, V790, Q807).
To validate PPI-hotspotID’s predictions, we mutated the aforementioned predicted PPI-hot spots and PPI-nonhot spots in mouse eEF2 (meEF2), which shares 99% sequence identity with human eEF2. The generated eEF2 mutants (L763A, 766AAA768, E773Q, 778AAA780, 789AA790, F794A, R801A, Q807E, A808S, and D815A), along with wild-type eEF2 and negative control (enhanced green fluorescent protein [EGFP]), were then screened for interaction with CPEB2 by co-immunoprecipitation (co-IP). This assay identified F794 as a critical eEF2 residue for binding to CPEB2. To confirm the initial screening result, we selected three mutants (778AAA780, F794A, and D815A designated as mut1, mut2, and mut3) for further analysis (Figure 2a). The interaction of wild-type and mutant eEF2 with CPEB2 was analyzed again by reciprocal co-IP. The results in Figure 2b show that the F794A mutation (mut2) abolished binding to CPEB2.
-
Figure 2—source data 1
- https://cdn.elifesciences.org/articles/96643/elife-96643-fig2-data1-v1.docx
-
Figure 2—source data 2
- https://cdn.elifesciences.org/articles/96643/elife-96643-fig2-data2-v1.zip
-
Figure 2—source data 3
- https://cdn.elifesciences.org/articles/96643/elife-96643-fig2-data3-v1.docx
Next, we investigated whether disrupting the association between CPEB2 and eEF2 affects HIF-1α expression in vivo. Because eEF2 is an essential and abundant translational factor, its ectopic expression alone was insufficient to override the function of endogenous eEF2. Thus, we tested the F794A mutant under knockdown of endogenous eEF2 condition. HeLa cells were transfected with plasmids lacking (siCtrl) or containing a short hairpin sequence for human eEF2 (siheEF2) and subjected to puromycin selection. Both shRNA sequences, specifically knocking down human but not mouse eEF2, decreased endogenous eEF2 after 4 days (Figure 2c). HeLa cells were then transfected with the siheEF2 and flag-meEF2 (wild-type, mut2, or mut3) plasmids and subjected to puromycin and G418 selection for 4 days. Cells that survived were incubated with S35-methionine/cysteine to metabolically label synthesized proteins. The expression of the F794A or D815A mutant did not affect general protein synthesis (Figure 2d). However, the level of HIF-1α, but not CPEB2 or β-actin, was selectively increased in HeLa cells reconstituted with the F794A mutant (Figure 2e). Figure 2—figure supplement 1 shows full-length gels and blots in Figure 2, and Figure 2—source data 1 shows the uncropped immunoblot images, and Figure 2—source data 2 contains raw unedited blots. Thus, the eEF2 F794A mutation influences the translation of CPEB2-targeted HIF-1α mRNA without affecting general translation function.
Discussion
Identifying PPI-hot spots is challenging especially when the complex structure is lacking. A key hurdle is the lack of experimental data on PPI-hot spots, which hampers the training of accurate machine-learning models for their prediction. Here, we introduced two novel elements that have helped to identify PPI-hot spots using the unbound structure. First, we have constructed a dataset comprising 414 experimentally known PPI-hot spots and 504 nonhot spots, and carefully checked that PPI-hot spots have no mutations resulting in ΔΔGbind < 0.5 kcal/mol, whereas nonhot spots have no mutations resulting in ΔΔGbind ≥ 0.5 kcal/mol or impact binding in immunoprecipitation or GST pull-down assays (see ‘Materials and methods’). In contrast, SPOTONE (Preto and Moreira, 2020) employed nonhot spots defined as residues that upon alanine mutation resulted in ΔΔGbind < 2.0 kcal/mol. Notably, previous PPI-hot spot prediction methods did not employ PPI-hot spots whose mutations have been curated to significantly impair/disrupt PPIs in UniProtKB (see ‘Introduction’). Second, we introduced novel features derived from unbound protein structures such as the gas-phase energy of the target protein relative to its unfolded state. The importance test results indicated the gas-phase energy as an important feature. This finding can be rationalized by considering how PPI-hot spots make significant contributions to the overall binding free energy, ΔGbind. PPI-hot spots can enhance favorable enthalpic contributions to the ΔGbind through hydrogen bonds or vdW contacts across the protein’s interface. This makes them energetically unstable in the absence of the protein’s binding partner and solvent; hence, the gas-phase energy was found to be an important input feature. Alternatively, PPI-hot spots can counteract unfavorable entropic loss upon protein binding by maintaining an optimal binding scaffold; hence, they are energetically stable.
Methods that rely on complex structures generally predict residues that make multiple contacts across the protein–protein interface as PPI-hot spots. Some of these methods assume that PPI-hot spots are exclusively located at the interface and aim to spot them among the interface residues (Wang et al., 2018a). In contrast, PPI-hotspotID leverages evolutionary conservation, residue type, and stability principles based on the free protein structure to detect PPI-hot spots, including those lacking direct contact with the partner protein. Such noninterface PPI-hot spots may serve to maintain an optimal scaffold for protein binding and are not uncommon: from our analysis of the 243 true PPI-hot spots in proteins with the complex structures, we found 67 ‘noninterface’ PPI-hot spots with no hydrogen bonds and/or vdW contacts across the protein interface. PPI-hotspotID identified 43 of these 67 noninterface PPI-hot spots. An illustrative example is the binding of Golgi resident protein (GCP60) with phosphatidylinositol 4-kinase β (PI4K-β). PPI-hotspotID correctly predicted all four experimentally known GCP60 PPI-hot spots including F19 and Y46, which do not form hydrogen bonds across the interface with PI4K-β (Figure 3). These results highlight the ability of PPI-hotspotID to identify PPI-hot spots involved in indirect interactions with partner proteins.
Proteins typically interact with multiple partners, but their PPI-hot spots may have been experimentally characterized for only a few partners. In some cases where PPI-hotspotID predicted residues that were absent in the PPI-Hotspot+PDBBM(1.1) as PPI-hot spots, the protein’s complex structures with other binding partners show intermolecular hydrogen bonds between PPI-hotspotID-predicted residues and residues of the respective partner proteins. This suggests that some of the PPI-hotspotID-predicted residues might be potential PPI-hot spots for other binding partners. For example, the death domain of CRADD (caspase-recruitment domain and death domain-containing adaptor protein) contains 7 experimentally known PPI-hot spots (N121, Q125, Y146, R147, K149, V156, Q169) critical for its interaction with PIDD (p53-induced death domain-containing protein). Based on the free crystal structure of CRADD (PDB 2O71-A) (Park and Wu, 2006), PPI-hotspotID correctly predicted three true positives (Y146, R147, and Q169), as well as G128. In the oligomeric structure (PDB 2OF5) (Park et al., 2007) of seven CRADD proteins in complex with five PIDD proteins, G128 shows no hydrogen-bonding interactions, but its neighbor, L127, forms a backbone – side chain hydrogen bond with R147 in another CRADD chain (Figure 4). A positively charged G128R mutation would repel the nearby positively charged R147 in another CRADD chain, thus disrupting the CRADD–CRADD interface and decreasing CRADD’s affinity for PIDD. Experimental data showed that the G128R CRADD mutant did not co-immunoprecipitate the PIDD death domain, and patients who have non-syndromic mental retardation possess the G128R mutant (Puffenberger et al., 2012). Thus, PPI-hotspotID could unveil a PPI-hot spot, G128, that is not apparent from the 2OF5 complex structure: although G128 does not directly interact with PIDD, its mutation, especially to an Arg, might perturb the CRADD–CRADD interface and thus CRADD’s oligomeric structure and binding affinity for PIDD.
The ability of PPI-hotspotID to detect PPI-hot spots provides biologists with a useful tool, as alanine-scanning mutagenesis and protein–protein complex structure determination to identify PPI-hot spots are laborious, time-consuming, and costly. Conventional methods based on complex structures might miss nonobvious PPI-hot spots with no direct interactions with the protein’s partner. AlphaFold-Multimer and future improved protein–protein complex prediction methods require knowledge of interacting partners and independent calculations for each known partner, which reduces the overall efficiency. Moreover, solved/modeled protein–protein complex structures only reveal the interface residues. In contrast, PPI-hotspotID can reveal nonobvious PPI-hot spots as well as potential PPI-hot spots for other protein partners, thus helping to elucidate the different PPI mechanisms.
Materials and methods
Dataset: True PPI-hot spots
Request a detailed protocolWe updated the PPI-Hotspot+PDBBM benchmark by removing two fused protein structures and adding new PPI-hot spots by (i) reviewing references in ASEdb (Thorn and Bogan, 2001) to include nonalanine mutations with ΔΔGbind > 2 kcal/mol, and (ii) checking the experimental data of certain mutations in UniProtKB (UniProt Consortium, 2018). For example, the PPI-Hotspot+PDBBM benchmark included R43A in aprataxin (UniProtID Q7Z2E3), annotated as ‘loss of interaction with MDC1’, but not K52A, annotated as ‘impairs interaction with MDC1’. However, when we checked the experimental data in the UniProtKB reference, the binding bands were absent for both R43A and K52A mutants; therefore, we added K52A as a PPI-hot spot. The updated benchmark, termed PPI-Hotspot+PDBBM(1.1), contains 414 PPI-hot spots. Among these, 104 PPI-hot spots in 32 nonredundant proteins are based on mutations resulting in ΔΔGbind ≥ 2 kcal/mol from ASEdb (Thorn and Bogan, 2001) and SKEMPI2.0 (Jankauskaite et al., 2019) with no known mutations resulting in ΔΔGbind < 0.5 kcal/mol. The remaining 310 PPI-hot spots in 128 nonredundant proteins are based on mutations that are manually curated in UniProtKB (UniProt Consortium, 2018) to significantly impair/disrupt PPIs. Two of the proteins have PPI-hot spots from ASEdb/SKEMPI2.0 and UniProtKB, resulting in a total of 158 nonredundant proteins with free structures harboring 414 PPI-hot spots.
True PPI-nonhot spots
Request a detailed protocolTo obtain PPI-nonhot spots for the 158 nonredundant proteins with true PPI-hot spots, we identified residues from ASEdb (Thorn and Bogan, 2001) and SKEMPI2.0 (Jankauskaite et al., 2019) databases where mutations to alanine/nonalanine resulted in protein ΔΔGbind < 0.5 kcal/mol. We also identified residues in the UniProtKB where mutations to alanine/nonalanine were curated not to perturb PPIs. We manually checked each reference to ensure that mutations of these residues did not lead to ΔΔGbind changes ≥0.5 kcal/mol or impact binding in immunoprecipitation or GST pull-down assays. PPI-nonhot spots in non-native proteins or regions with missing structures were excluded.
Input features
Request a detailed protocolTo distinguish PPI-hot spots from PPI-nonhot spots, we input sequence, structural, and stability features of each residue in the protein for training various machine-learning classifiers. The input features for each residue i of a protein included its aa type, conservation score, secondary structure, SASA, gas-phase energy, and respective components, polar solvation free energy, and nonpolar solvation free energy. The secondary structure, SASA, and energy components of each residue were computed using the DSSP program (Kabsch and Sander, 1983), FreeSasa (Mitternacht, 2016), and AmberTools version 20 (Case, 2020), respectively, using default parameters.
Per-residue free energy contributions
Request a detailed protocolFor a given free protein structure, the Reduce program (Word et al., 1999) was used to add hydrogens and assign the protonation states of ionizable residues. Additional missing heavy and hydrogen atoms were added using the AmberTools version 20 (Case, 2020) and the Amber FF19SB forcefield (Tian et al., 2020). To eliminate any steric clashes, we performed a conjugate gradients minimization with constraints on the heavy atoms using the Generalized Born model for 500 steps. The resulting structure was used to compute the per-residue energy/free energy contributions using the Molecular Mechanics Poisson–Boltzmann Surface Area module in AmberTools (Case, 2020). For each residue i in the protein, we computed the (i) molecular mechanics energy Eigas = EiMM,int + EiMM,vdW + EiMM,ele, where EiMM,int includes contributions from bonded terms, EiMM,vdW is the vdW interaction energy, and EiMM,ele is the electrostatic interaction energy as well as (ii) the polar (ΔGisolv,pol) and nonpolar (ΔGisolv,npl) solvation free energies relative to the corresponding values of residue i in an extended reference state where the residues do not interact with one another (Chen et al., 2007).
Per-residue conservation score
Request a detailed protocolTo calculate the conservation score, kiC, of residue i in a protein, we implemented a method similar to ConSurf (Glaser et al., 2003; Landau et al., 2005) to run in parallel with the energy evaluation code. First, we searched the UNIREF-90 database (Wu et al., 2006) using HMMER (Johnson et al., 2010) to find sequences similar to the target sequence. Near-duplicates were removed by clustering matched sequences with ≥95% pairwise sequence identity using CD-hit (Li and Godzik, 2006) and keeping only one representative. Since HMMER (Johnson et al., 2010) may only find good matches for a small proportion of the target sequence, we compared the HMMER sequences with the target sequence. We kept only those with >60% overlap with the target sequence and discarded sequences that were dissimilar (≤35% sequence identity) or nearly identical (≥95% sequence identity). Next, we pairwise aligned the remaining sequences, and if two sequences overlapped by >10% of the sequence, we rejected the shorter sequence. After this filtering process, the resulting HMMER hits were used, or if the number of hits exceeded 300, we selected the top 300 hits. These sequences were then aligned to the target sequence using MAFFT-LINSi (Nakamura et al., 2018). We then used the Rate4Site program (Pupko et al., 2002) to compute position-specific evolutionary rates from the generated multiple sequence alignment. These rates were normalized and grouped into ConSurf grades ranging from 1 to 9, where kC = 1 represents the most rapidly evolving residues, and kC = 9 indicates the most conserved residues.
Generating PPI-hot spot predictive model using AutoGluon
Request a detailed protocolWe provided all the aforementioned residue features including the conservation score, kiC, aa type, DSSP secondary structure, SASAi, EiMM,int, EiMM,vdW, EiMM,ele, EiMM, ΔGisolv,pol, andΔGisolv,npl to the Tabular module in AutoGluon v0.8.2 (https://auto.gluon.ai/stable/index.html). AutoGluon was chosen for model training and validation due to its robustness and user-friendly interface, allowing for the simultaneous and automated exploration of various machine-learning approaches and their combinations. Instead of using a single training set to train the model and a separate test set to evaluate its performance, we employed cross-validation as it utilizes the entire dataset for both training and testing, making efficient use of the limited data on PPI-hot spots and PPI-nonhot spots. AutoGluon-Tabular automatically chose a random partitioning of our dataset into multiple subsets/folds for training and validation. Notably, the training and validation data share insignificant homology as the average pairwise sequence identity in our dataset is 26%. Each fold was used once as a test set, while the remaining folds served as the training set. For each test set, the model’s performance was measured using the F1 score.
AutoGluon then trained individual ‘base’ models, including LightGBM, CatBoost, XGBoost, random forests, extremely randomized trees, neural networks, and K-nearest neighbors. Using the aggregated predictions of the base models as features in addition to the original features, AutoGluon trained multiple ‘stacker’ models, whose predictions were fed as inputs to additional higher layer stacker models in an iterative process called multilayer stacking. The output layer used ensemble selection to aggregate the predictions of the stacker models. To improve stacking performance, AutoGluon used all the data for both training and validation through repeated k-fold bagging of all models at all layers of the stack, where k is determined by best precision. We refer the reader to the original study by Erickson et al., 2020, which provides details on the methodology including the types of ‘base’ models, multilayer stack ensembling, and repeated k-fold bagging. Based on the highest mean F1 score, AutoGluon yielded a final PPI-hot spot predictive model that is a weighted regularized ensemble comprising more than a dozen different models (https://auto.gluon.ai/dev/api/autogluon.tabular.models.html).
Selecting key features
Request a detailed protocolNext, we evaluated the importance of each feature by performing a permutation-based test (part of the AutoGluon package), in which a feature in a column was randomly shuffled across different residues (rows), and the F1 score was evaluated. The importance test results revealed the four most important residue features, which in order of their importance are (i) kC, (ii) aa residue type, (iii) SASAi, and (iv) ΔGigas. These four features were used to train an ensemble of machine-learning models using the entire dataset, consisting of 414 true PPI-hot spots and 504 nonhot spots. The resulting PPI-hot spot prediction model, named PPI-hotspotID, yielded an F1 score comparable to the F1 score obtained using the initial set of 10 features. PPI-hotspotID was implemented as a freely accessible web server (https://ppihotspotid.limlab.dnsalias.org/; Wright et al., 2024) with access to four virtual CPUs and 8 GB of memory. Calculations for a 539-residue protein (PDB 1c2bA) took 35 min. The source code for PPI-hotspotID is available at https://github.com/wrigjz/ppihotspotid/ (Wright, 2024).
Detecting PPI-hot spots using the AlphaFold-Multimer-predicted interface
Request a detailed protocolIn cases where experimental complex structures are unavailable, can the protein–protein complexes modeled by AlphaFold-Multimer (Evans et al., 2021) be used to identify PPI-hot spots using the predicted interface residues? To address this, we first identified PPI-hot spots within the PPI-Hotspot+PDBBM(1.1) dataset that lack experimentally determined protein complex structures. Not all the 414 PPI-hot spots in the PPI-Hotspot+PDBBM(1.1) have sequence information and thus UniProtID of the respective binding partners, leaving 360 PPI-hot spots in 135 proteins associated with 155 pairs of PPIs, as some proteins are involved in multiple PPIs. Also, 90 of the 155 PPI pairs have complex structures in the PDB. For the 65 PPI pairs lacking complex structures, 17 pairs contain >1100 residues, exceeding the current size limit of AlphaFold-Multimer (Evans et al., 2021). Thus, we generated structural models for the remaining 48 complexes using the AlphaFold-Multimer module in the ColabFold version 1.3.0 (Mirdita et al., 2022) with default settings. For each AB complex, the input sequence for protein A was based on the free structure sequence in the PPI-Hotspot+PDBBM(1.1), whereas that for protein B was retrieved in its entirety from the UniProtKB (UniProt Consortium, 2018) as the binding region in protein B was unknown. Based on the AMBER-relaxed model structure with the highest pTM score, interface residues were defined as residues of protein A with ≥1 atom within a 5 Å cutoff of any protein B atom.
Experimental verification of predicted eEF2 PPI-hot spots: Plasmid construction
Request a detailed protocolThe predicted PPI-hot spots and PPI-nonhot spots were mutated using the QuikChange Site-Directed Mutagenesis Kit (Stratagene). The pcDNA3.1-flag-meEF2 plasmid was used as the PCR template, and the sets of sense and antisense primers for mutagenesis are listed in Figure 2—source data 3. All constructs were sequenced to confirm the mutations. The shRNA clones, #1 TRCN0000047908 (GCGATCATGAATTTCAAGAAA) and #2: TRCN0000047910 (GCAGTACCTCAACGAGATCAA), against human eEF2 mRNA were obtained from the RNAi Core Facility (Academia Sinica).
Cell lines
Request a detailed protocolHEK-293T cells (# CRL-3216) and HeLa cells (# CCL-2) were obtained from American Type Culture Collection (ATCC).
Testing eEF2-CPEB2 interactions using co-IP and reciprocal co-IP
Request a detailed protocolHEK-293T cells (ATCC, # CRL-3216) were cultured in DMEM with 10% fetal bovine serum (FBS). For reciprocal co-IP, the 8 µg DNA mixture containing 3 µg myc-CPEB2 and 5 µg flag-eEF2 (or a negative control, GFP) plasmids, generated in the previous study (Chen and Huang, 2012), was transfected into a 10 cm dish of 293T cells using Lipofectamine 2000. We transfected more flag-eEF2 plasmid DNA than myc-CPEB2 plasmid because myc-CPEB2 is expressed more abundantly than flag-eEF2. Overnight transfected cells were lysed in 500 µl IP buffer (20 mM HEPES, pH 7.4, 100 mM NaCl, 1 mM MgCl2, 0.1% TritonX-100, 10% glycerol, 0.5 mM DTT, 1× protease inhibitor cocktail, and 100 µg/ml RNaseA) and centrifuged at 10,000 × g for 3 min at 4°C. The supernatant was equally divided and incubated with Protein G beads bound with myc (Abcam, 9E10 clone) or flag (Sigma-Aldrich, F1804) antibody for 3 hr at 4°C to respectively pull down myc-CPEB2 and flag-eEF2. The beads were washed five times with 300 µl IP buffer. If myc-CPEB2 and flag-eEF2 interact, myc-CPEB2 can co-precipitate with flag-eEF2 on flag antibody beads, whereas flag-eEF2 can co-precipitate with myc-CPEB2 on myc antibody beads. GFP was used as a negative control to ensure that the signals on the beads were caused by binding between flag-eEF2 and myc-CPEB2. The precipitated proteins were separated on a sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) for western blot analysis. Similarly, for the initial co-IP screening, the 4 µg DNA mixture containing 1.5 µg myc-CPEB2 and 2.5 µg flag-eEF2 (or a negative control, EGFP) plasmids was transfected into a 6 cm dish of 293T cells, harvested in 200 µl IP buffer, and immunoprecipitated using flag antibody-bound beads.
Functional impact of eEF2 mutants on HIF-1α and global protein synthesis
Request a detailed protocolHeLa cells (ATCC, # CCL-2) were cultured in DMEM with 10% FBS. Each 6 cm plate of HeLa cells was transfected with 2 μg human eEF2 knockdown plasmid and 2 μg flag-meEF2 wild-type/mutant plasmid. Overnight transfected cells were selected with 0.5 μg/ml puromycin and 600 μg/ml G418 for 3 days to knock down endogenous eEF2 and maintain the expression of flag-meEF2, respectively. The selected cells were incubated with 10 μM MG132 for 4 hr before harvesting for western blotting of HIF-1α or replaced with 2 ml Met/Cys-lacking DMEM with 1% FBS and 60 μCi 35S-Met/Cys (PerkinElmer, cat# NEG772002MC) for 2 hr before separation on an SDS-PAGE. Antibodies used are CPEB2, generated in house; HIF-1α (NB100-134) from Novus; eEF2 (SC-13004) from Santa Cruz Biotechnology; and β-actin (A5441) form Sigma-Aldrich.
Data availability
All data generated or analyzed during this study are included in the manuscript and supporting files. The PPI-HotspotID program is available at https://github.com/wrigjz/ppihotspotid/ (Wright, 2024). A web server to perform PPI-hotspot predictions and the dataset comprising 414 experimentally known PPI-hot spots and 504 PPI-nonhot spots are available at https://ppihotspotid.limlab.dnsalias.org/.
References
-
Flex ddG: Rosetta ensemble-based estimation of changes in protein-protein binding affinity upon mutationThe Journal of Physical Chemistry. B 122:5389–5399.https://doi.org/10.1021/acs.jpcb.7b11367
-
Anatomy of hot spots in protein interfacesJournal of Molecular Biology 280:1–9.https://doi.org/10.1006/jmbi.1998.1843
-
CPEB2-eEF2 interaction impedes HIF-1α RNA translationThe EMBO Journal 31:959–971.https://doi.org/10.1038/emboj.2011.448
-
PPI-Hotspot DB: Database of protein–protein interaction hot spotsJournal of Chemical Information and Modeling 62:1052–1060.https://doi.org/10.1021/acs.jcim.2c00025
-
A feature-based approach to modeling protein-protein interaction hot spotsNucleic Acids Research 37:2672–2687.https://doi.org/10.1093/nar/gkp132
-
Hot spots in protein-protein interfaces: towards drug discoveryProgress in Biophysics and Molecular Biology 116:165–173.https://doi.org/10.1016/j.pbiomolbio.2014.06.003
-
Unraveling hot spots in binding interfaces: progress and challengesCurrent Opinion in Structural Biology 12:14–20.https://doi.org/10.1016/s0959-440x(02)00283-x
-
Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutationsJournal of Molecular Biology 320:369–387.https://doi.org/10.1016/S0022-2836(02)00442-4
-
Prediction of binding hot spot residues by using structural and evolutionary parametersGenetics and Molecular Biology 32:626–633.https://doi.org/10.1590/S1415-47572009000300029
-
ConferenceAn improved ensemble learning method with SMOTE for protein interaction hot spots predictionIEEE International Conference on Bioinformatics and Biomedicine (BIBM). pp. 1584–1589.https://doi.org/10.1109/BIBM.2016.7822756
-
Computational alanine scanning of the 1:1 human growth hormone-receptor complexJournal of Computational Chemistry 23:15–27.https://doi.org/10.1002/jcc.1153
-
Predicting and experimentally validating hot-spot residues at protein-protein interfacesACS Chemical Biology 14:2252–2263.https://doi.org/10.1021/acschembio.9b00560
-
Prediction of protein hotspots from whole protein sequences by a random projection ensemble systemInternational Journal of Molecular Sciences 18:E1543.https://doi.org/10.3390/ijms18071543
-
Hot regions in protein--protein interactions: the organization and contribution of structurally conserved hot spot residuesJournal of Molecular Biology 345:1281–1294.https://doi.org/10.1016/j.jmb.2004.10.077
-
ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structuresNucleic Acids Research 33:W299–W302.https://doi.org/10.1093/nar/gki370
-
PDBsum: Structural summaries of PDB entriesProtein Science 27:129–134.https://doi.org/10.1002/pro.3289
-
Computational alanine scanning to probe protein−protein interactions: A novel approach to evaluate binding free energiesJournal of the American Chemical Society 121:8133–8143.https://doi.org/10.1021/ja990935j
-
A machine learning approach for hot-spot detection at protein-protein interfacesInternational Journal of Molecular Sciences 17:1215.https://doi.org/10.3390/ijms17081215
-
ColabFold: Making protein folding accessible to allNature Methods 19:679–682.https://doi.org/10.1038/s41592-022-01488-1
-
Computational alanine scanning mutagenesis--an improved methodological approachJournal of Computational Chemistry 28:644–654.https://doi.org/10.1002/jcc.20566
-
Solvent accessible surface area-based hot-spot detection methods for protein-protein and protein-nucleic acid interfacesJournal of Chemical Information and Modeling 55:1077–1086.https://doi.org/10.1021/ci500760m
-
Oncogenic protein interfaces: small molecules, big challengesNature Reviews. Cancer 14:248–262.https://doi.org/10.1038/nrc3690
-
Protein interaction hotspot identification using sequence-based frequency-derived featuresIEEE Transactions on Bio-Medical Engineering 60:2993–3002.https://doi.org/10.1109/TBME.2011.2161306
-
Protein-protein interaction hotspots carved into sequencesPLOS Computational Biology 3:e119.https://doi.org/10.1371/journal.pcbi.0030119
-
Artificial intelligence based methods for hot spot predictionCurrent Opinion in Structural Biology 72:209–218.https://doi.org/10.1016/j.sbi.2021.11.003
-
Crystal structure of RAIDD death domain implicates potential mechanism of PIDDosome assemblyJournal of Molecular Biology 357:358–364.https://doi.org/10.1016/j.jmb.2005.12.082
-
SPOTONE: Hot spots on protein complexes with extremely randomized trees via sequence-only featuresInternational Journal of Molecular Sciences 21:7281.https://doi.org/10.3390/ijms21197281
-
Using machine‐learning‐driven approaches to boost hot‐spot’s knowledgeWIREs Computational Molecular Science 12:1602.https://doi.org/10.1002/wcms.1602
-
Hot-spot analysis for drug discovery targeting protein-protein interactionsExpert Opinion on Drug Discovery 13:327–338.https://doi.org/10.1080/17460441.2018.1430763
-
ff19SB: Amino-acid-specific protein backbone parameters trained against quantum mechanics energy surfaces in solutionJournal of Chemical Theory and Computation 16:528–552.https://doi.org/10.1021/acs.jctc.9b00591
-
HotPoint: hot spot prediction server for protein interfacesNucleic Acids Research 38:W402–W406.https://doi.org/10.1093/nar/gkq323
-
UniProt: the universal protein knowledgebaseNucleic Acids Research 46:2699.https://doi.org/10.1093/nar/gky092
-
Prediction of hot spots in protein interfaces using a random forest model with hybrid featuresProtein Engineering, Design & Selection 25:119–126.https://doi.org/10.1093/protein/gzr066
-
Local and global anatomy of antibody‐protein antigen recognitionJournal of Molecular Recognition 31:e2693.https://doi.org/10.1002/jmr.2693
-
Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientationJournal of Molecular Biology 285:1735–1747.https://doi.org/10.1006/jmbi.1998.2401
-
BookThe PPI-hotspotID Prediction Server and DatasetWright JD, editors. Academia Sinica.
-
The universal protein resource (UniProt): An expanding universe of protein informationNucleic Acids Research 34:D187–D191.https://doi.org/10.1093/nar/gkj161
-
Prediction of hot spots residues in protein–protein interface using network feature and microenvironment featureChemometrics and Intelligent Laboratory Systems 131:16–21.https://doi.org/10.1016/j.chemolab.2013.11.010
-
Relationship between hot spot residues and ligand binding hot spots in protein-protein interfacesJournal of Chemical Information and Modeling 52:2236–2244.https://doi.org/10.1021/ci300175u
Article and author information
Author details
Funding
Academia Sinica (AS-IA-107-L03)
- Carmay Lim
Ministry of Education (MOST-98-2113-M-001-011)
- Carmay Lim
Institute of Biomedical Sciences, Academia Sinica
- Yi-Shuian Huang
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This research was supported by Academia Sinica (AS-IA-107-L03) and the Ministry of Science and Technology (MOST-98-2113M-001-011), Taiwan.
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.96643. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2024, Chen et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 901
- views
-
- 42
- downloads
-
- 0
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Structural Biology and Molecular Biophysics
Under physiological conditions, proteins continuously undergo structural fluctuations on different timescales. Some conformations are only sparsely populated, but still play a key role in protein function. Thus, meaningful structure–function frameworks must include structural ensembles rather than only the most populated protein conformations. To detail protein plasticity, modern structural biology combines complementary experimental and computational approaches. In this review, we survey available computational approaches that integrate sparse experimental data from electron paramagnetic resonance spectroscopy with molecular modeling techniques to derive all-atom structural models of rare protein conformations. We also propose strategies to increase the reliability and improve efficiency using deep learning approaches, thus advancing the field of integrative structural biology.
-
- Structural Biology and Molecular Biophysics
In eukaryotes, RNAs transcribed by RNA Pol II are modified at the 5′ end with a 7-methylguanosine (m7G) cap, which is recognized by the nuclear cap binding complex (CBC). The CBC plays multiple important roles in mRNA metabolism, including transcription, splicing, polyadenylation, and export. It promotes mRNA export through direct interaction with a key mRNA export factor, ALYREF, which in turn links the TRanscription and EXport (TREX) complex to the 5′ end of mRNA. However, the molecular mechanism for CBC-mediated recruitment of the mRNA export machinery is not well understood. Here, we present the first structure of the CBC in complex with an mRNA export factor, ALYREF. The cryo-EM structure of CBC-ALYREF reveals that the RRM domain of ALYREF makes direct contact with both the NCBP1 and NCBP2 subunits of the CBC. Comparing CBC-ALYREF with other cellular complexes containing CBC and/or ALYREF components provides insights into the coordinated events during mRNA transcription, splicing, and export.