Coevolution-based prediction of key allosteric residues for protein function regulation

  1. Juan Xie
  2. Weilin Zhang
  3. Xiaolei Zhu
  4. Minghua Deng
  5. Luhua Lai  Is a corresponding author
  1. Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, China
  2. BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, China
  3. School of Sciences, Anhui Agricultural University, China
  4. School of Mathematical Sciences, Peking University, China
  5. Center for Statistical Science, Peking University, China
  6. Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), China

Abstract

Allostery is fundamental to many biological processes. Due to the distant regulation nature, how allosteric mutations, modifications, and effector binding impact protein function is difficult to forecast. In protein engineering, remote mutations cannot be rationally designed without large-scale experimental screening. Allosteric drugs have raised much attention due to their high specificity and possibility of overcoming existing drug-resistant mutations. However, optimization of allosteric compounds remains challenging. Here, we developed a novel computational method KeyAlloSite to predict allosteric site and to identify key allosteric residues (allo-residues) based on the evolutionary coupling model. We found that protein allosteric sites are strongly coupled to orthosteric site compared to non-functional sites. We further inferred key allo-residues by pairwise comparing the difference of evolutionary coupling scores of each residue in the allosteric pocket with the functional site. Our predicted key allo-residues are in accordance with previous experimental studies for typical allosteric proteins like BCR-ABL1, Tar, and PDZ3, as well as key cancer mutations. We also showed that KeyAlloSite can be used to predict key allosteric residues distant from the catalytic site that are important for enzyme catalysis. Our study demonstrates that weak coevolutionary couplings contain important information of protein allosteric regulation function. KeyAlloSite can be applied in studying the evolution of protein allosteric regulation, designing and optimizing allosteric drugs, and performing functional protein design and enzyme engineering.

Editor's evaluation

The manuscript reports on a useful tool to study protein allosteric regulation function. The work is based on inadequate experimental validation of the predicted residues implicated in mediating allosteric signaling. The study highlights the significance of the weak pairwise term for the prediction of the allosteric function.

https://doi.org/10.7554/eLife.81850.sa0

Introduction

Allostery commonly refers to one type of distant regulation, that is, a perturbation at one site of a macromolecule can affect the function of another site (Dokholyan, 2016), which plays important roles in many biological processes, such as enzyme catalysis (Tsai et al., 2009) and signal transduction (Hilser et al., 2012). Compared to traditional orthosteric drugs, allosteric drugs have unique advantages, including higher specificity, fewer side effects, etc. (Changeux and Christopoulos, 2016; Thal et al., 2018; Wenthur et al., 2014). However, optimization of allosteric molecules faces great challenges as allosteric molecules usually have flat structure-activity relationships (flat SARs or shallow SARs, referring to the phenomenon that no robust SARs could be obtained as small modifications may destroy the activity), and higher binding affinity does not always correspond to better activity (Christopoulos, 2002; Jimenez et al., 2012; Lewis et al., 2008; Lindsley, 2014; Nussinov and Tsai, 2012). In an allosteric pocket, the contribution of each residue to the allosteric effect is different. Bi et al. found that the interactions between allosteric molecules and the target protein can be divided into two types, interactions that only contribute to binding and interactions that contribute to both binding and signaling. Based on these understandings, they rationally designed Tar variants and engineered Escherichia coli to sense new ligands by maintaining the interactions responsible for chemotaxis allosteric signaling while changing the interactions responsible only for ligand binding (Bi et al., 2013). Nussinov et al. proposed that atoms in allosteric effectors could be divided into anchor and driver atoms. The anchors docked into the allosteric pockets, which allowed the drivers to perform a ‘pull’ or ‘push’ action (Nussinov and Tsai, 2014). Both drivers and anchors showed specific interactions with their host proteins, with the former mainly responsible for the allosteric efficacy and the latter for binding affinity (Nussinov et al., 2014). Therefore, it is important to identify residues in the host proteins that form interactions with allosteric molecules and produce allosteric signaling, which we refer to as key allo-residues, so that allosteric molecules can be optimized and designed based on these key allo-residues. Unfortunately, identifying key allo-residues remains challenging. Currently available computational methods mainly focused on the prediction of allosteric sites (Ma et al., 2016; Qi et al., 2012; Wagner et al., 2016; Xie et al., 2022), allosteric pathways (Botello-Smith and Luo, 2019; Lake et al., 2020), and key residues in allosteric pathways (Wang et al., 2020). Kalescky et al. developed the rigid residue scan method to identify key residues for protein allostery, in which multiple molecular dynamics (MD) simulations need to be performed for unbound and bound proteins. As only one residue was regarded as a single rigid body in each simulation, many simulations were necessary, which are computationally expensive and time-consuming (Kalescky et al., 2015). Therefore, methods for systematically and rapidly identifying key allo-residues in protein allosteric pockets need to be developed.

During evolution, unrelated residues may evolve independently, while functionally coupled residues coevolve. Coevolution means that when a residue changes, the residues that are structurally or functionally coupled with it will also change accordingly to maintain the overall spatial structure and biological function (de Juan et al., 2013). In principle, since homologous sequences record the long-term evolution of a protein family, the coupling pattern between residues can be estimated from multiple sequence alignment (MSA; Reynolds et al., 2011).

Various methods have been developed to analyze residue-residue coupling during evolution, which greatly expedite the recent progress of protein structure prediction (Ekeberg et al., 2013; Marks et al., 2011; Morcos et al., 2011). Direct coupling analysis (DCA) is one of such approaches that can remove the indirect correlation between residues and reflect the direct coevolution between residues (Cocco et al., 2013). DCA mainly uses methods in statistical physics to infer the pairwise coupling Jij between positions, which can explain the observed correlation between residues in an MSA. In structure predictions, only the top couplings in Jij were used (Morcos et al., 2011; Weigt et al., 2009). Recent studies showed that the weak, non-contact couplings in Jij are significantly important for the prediction of protein function, although they play as noise in predicting structural contacts. Salinas et al. proposed that the information of allosteric energy interactions is included in the statistics of MSAs and therefore is part of the entire evolutionary constraints (Salinas and Ranganathan, 2018). Russ et al. found that the top coupling items in Jij alone cannot effectively reproduce the alignment statistics in the AroQ family or the functional effects of mutations. This implies that protein functions may depend on many weak, non-contact items in Jij . Although there is no simple physical explanation for these weak items at present, they seem to represent the collective global evolution of residues, and further research is needed to reveal the significance of these items (Russ et al., 2020). Similar findings have been reported in the DCA-based prediction of protein-protein interaction (PPI), where the quality of prediction depends on many weak couplings (Bitbol, 2018).

In the present study, we analyzed the evolutionary couplings (ECs) between residues in orthosteric and allosteric sites. We found that weak couplings in Jij contain allosteric information and developed the first systematic and efficient computational method KeyAlloSite to predict key allo-residues based on the EC model (ECM; see Materials and methods for details). For each protein in the allosteric protein data set, we first performed MSA (Figure 1A) and calculated the ECs between residues using the ECM (Figure 1B). We then studied the coevolution between orthosteric and allosteric sites and found that their EC strength (ECS) is stronger than that between orthosteric and other non-functional pockets. We further performed pairwise comparison of the differences in EC scores of residues in the allosteric pocket with orthosteric pocket (Figure 1C–E; see Materials and methods for details) and got the number of significant differences corresponding to each residue in the allosteric pocket (Figure 1F). After the numbers of significant differences were normalized into Z-scores, residues corresponding to Z-scores larger than a threshold were predicted as key allo-residues. We have applied KeyAlloSite to identify key allo-residues in several allosteric proteins, including BCR-ABL1, Tar, and PDZ3, and compared the prediction results with previously reported experimental data. KeyAlloSite was also used to predict cancer mutations as well as key distant residues for enzymatic catalysis. Our study provides essential information for understanding how allosteric regulations are evolved, for designing and optimizing allosteric drugs, and for designing highly efficient enzymes and other functional proteins.

Steps to identify key allo-residues.

(A) Multiple sequence alignment. (B) Evolutionary coupling (EC) analysis. (C–D) Calculation of the EC values between residues in allosteric and orthosteric pockets. (E) Pairwise compared the difference of EC values corresponding to residues in allosteric pocket. (F) The number of significant differences corresponding to each residue in allosteric pocket.

Results

The evolutionary coupling between orthosteric and allosteric sites is stronger

We selected 23 allosteric proteins from the ‘Core Set’ of ASBench (Huang et al., 2015) as our data set, including 25 known allosteric sites (Supplementary file 1; see Materials and methods for details). The sequence lengths of the proteins in the data set range from 166 to 788 amino acid residues (Figure 2A), and the number of homologous sequences and effective homologous sequences corresponding to each protein were shown in Figure 2B. For each protein in the data set, we used CAVITY (Xu et al., 2018; Yuan et al., 2013) to identify all the potential ligand binding pockets on the protein surface and designated the mth pocket as cavity_m. Previous studies showed that motions of orthosteric and allosteric sites are highly correlated (Ma et al., 2016; Xie et al., 2022; Zhang et al., 2019), and the regulation between orthosteric and allosteric site is bidirectional (An et al., 2019). It would be interesting to see whether orthosteric and allosteric sites are coupled in evolution. We then explored the ECS (ECScavity_m ) between the orthosteric and allosteric pocket, as well as all the other pockets. The ECS between the orthosteric pocket and the mth pocket is defined as the sum of the coupling strength between the residues in the two pockets (see Materials and methods for details). Among the 25 allosteric pockets in the data set, 23 have Z-scores greater than 0.5 (Figure 2C, Supplementary file 2), which means that the recall of KeyAlloSite on predicting allosteric sites is 0.92 (Supplementary file 8). The probabilities that the known allosteric pockets were ranked in the top 1, top 2, and top 3 of Z-scores were 56.0, 76.0, and 96.0%, respectively (Figure 2D, Supplementary file 2), indicating that orthosteric and allosteric pockets are more evolutionarily coupled to each other than the orthosteric and other pockets, which can be used to predict potential allosteric pockets. We further analyzed the two proteins with Z-scores less than 0.5, AR1 and CYP3A4. For AR1, as there were only 108 effective homologous sequences, we speculated that the number of homologous sequences may not be enough for evolutionary analysis. The protein sequence–based phylogenetic tree of AR1 homologous proteins showed that AR1 located near the tail of the tree (Figure 2—figure supplement 1), implying that this allosteric function did not exist in the early evolutionary period, and it may have appeared in the later stage of evolution. Due to the relatively large number of sequences in the early evolutionary period and relatively few sequences in the later stage of evolution, the allosteric signal was weak. As a cytochrome P450 protein, CYP3A4 can bind and catalyze the transformation of a variety of substrates (Williams et al., 2004). Sequence alignments showed that several positions in its orthosteric pocket are less conserved, which may lead to the difficulty in allosteric site prediction based on evolutional coupling analysis.

Figure 2 with 3 supplements see all
Z-scores of allosteric pockets and probabilities of ranking an allosteric pocket in the top 3.

(A) The sequence lengths of all proteins in our data set. (B) The number of homologous sequences. Neff represents the number of effective homologous sequences obtained under 80% reweighting. (C) Z-scores of allosteric pockets on proteins in the data set. Among the 25 allosteric pockets, the Z-scores of 23 allosteric pockets were greater than 0.5. (D) The probabilities that the known allosteric pockets were ranked in the top 1, top 2, and top 3.

We used human Aurora A (AurA) kinase that is not included in the data set as a test case to further verify whether the ECS can be used to predict allosteric sites. AurA (PDB ID: 1OL5) is a Ser-Thr protein kinase that is essential for the cell cycle progression. Its abnormal levels can lead to inappropriate centrosome maturation, spindle formation, and enhanced cancer growth (Toji et al., 2004). AurA is known to be regulated by two distinct allosteric mechanisms, one is specific PPI, which binds TPX2 to its hydrophobic pocket, and the other is phosphorylation of the activation loop at T288 (pT288; Hadzipasic et al., 2020). We used CAVITY to find all of the potential ligand binding pockets on the surface of AurA, and a total of 12 pockets were found; cavity_3 is the known allosteric PPI pocket, and cavity_2 is the orthosteric pocket. For consistency, we chose residues within 6 Å around the ATP molecule as the orthosteric pocket. Then we calculated the ECS between the orthosteric pocket and each of the remaining 11 pockets. Cavity_3 ranked the second among the 11 pockets with a Z-score of 1.48 (Supplementary file 3), indicating that the ECS between the orthosteric and allosteric pockets is indeed stronger. Since phosphorylation mainly occurs on Serine/Threonine/Tyrosine (Ser/Thr/Tyr) residues, we then calculated the ECS between the orthosteric pocket and each of the 26 exposed Ser/Thr/Tyr residues and normalized to Z-scores. Among the 26 Ser/Thr/Tyr residues, 10 residues have Z-scores larger than 0.5, and T288 and T287 ranked the fifth and the fourth with a Z-score of 0.83 and 1.05, respectively (Supplementary file 3). This indicates that KeyAlloSite can also be used to predict post-translational modification (PTM) sites, and the predicted Ser/Thr/Tyr residues with Z-scores greater than 0.5 in addition to T288 and T287 are worth further investigation.

To exclude the possible influence of the pocket size, we further checked the dependence of ECScavity_m on the number of residues used in the calculation. For allosteric and other pockets in each protein, we summed the ECS of the top 200, top 300, and top 400 residue-residue pairs with the highest FNi,j corresponding to each pocket as the ECS between each pocket (except for orthosteric pocket) and orthosteric pocket. When different numbers of residue-residue pairs were used, the Z-scores corresponding to the ECS between allosteric and orthosteric pockets in most proteins were still greater than 0.5, which is weakly different from that when all residue-residue pairs were used (Figure 2—figure supplement 2). This indicates that the number of residues in the mth pocket does not play important role on the ECS between pockets. In other words, ECScavity_m revealed the intrinsic EC between allosteric and orthosteric pockets. These results indicate that the ECS between orthosteric and allosteric pockets is stronger than that between orthosteric and other pockets, which can be used to predict potential allosteric pockets.

We further checked whether the coevolutionary signal between allosteric and orthosteric pockets is significantly different from that between two random patches in proteins with the same number of residue pairs. For each protein in the data set, two residues that are not part of the orthosteric and allosteric sites were randomly selected from the surface residues. Among them, one was taken as the first center, and the residues around it with the same number as the residues in orthosteric pocket were selected as patch1; and the other residue was taken as the second center, and the residues around it with the same number as the residues in allosteric pocket were selected as patch2. Then we calculated the ECS between patch1 and patch2. The process was repeated four times, and then the mean and standard deviation of the ECS were calculated. We then compared the ECS between patch1 and patch2 with that between orthosteric and allosteric sites. The results showed that the ECS between orthosteric and allosteric sites was significantly higher than that between two random patches (Figure 2—figure supplement 3). In other words, there is intrinsic EC between orthosteric and allosteric sites, which is different from the EC between any two random patches.

Coevolution analysis revealed key allo-residues in allosteric pockets

For each protein, we calculated the EC values between the residues in the orthosteric and allosteric pockets by ECM and compared the corresponding pairwise EC values of the residues in the allosteric pocket (Figure 1C–E, Figure 3—figure supplement 2). Since orthosteric and allosteric pockets are two different pockets, the residues in the two pockets generally do not contact. Therefore, most of the EC values between residues in the two pockets are relatively small, that is, they correspond to the weak terms in Jij (Figure 1D). We then calculated the number of significant differences of each residue in the allosteric pocket and normalized to Z-scores. Finally, residues were predicted as key allo-residues if their corresponding Z-scores were greater than 0.8 (Supplementary file 4). We chose this threshold to ensure that the most known key allo-residues can be correctly predicted by KeyAlloSite, and at the same time, the number of predicted key allo-residues should be as small as possible. We calculated the number of known key allo-residues that could be predicted by KeyAlloSite (Supplementary file 5) and the ratios of the predicted key allo-residues in all residues of allosteric pockets in all proteins of the data set (Figure 3—figure supplement 1) for thresholds of 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0. Therefore, we finally chose the threshold as 0.8. For the allosteric pockets, the average number of pocket residues is 43, and the average number of identified key allo-residues is 8, accounting for 18.6% of all the residues in the allosteric site (Figure 3).

Figure 3 with 3 supplements see all
The number of predicted key allo-residues.

Number of residues refers to the number of residues from allosteric pockets, including the number of all residues in allosteric pockets and predicted key allo-residues.

Since the number of homologous sequences is important in coevolution analysis, we selected seven proteins with a relatively large number of homologous sequences and randomly sampled different numbers of homologous sequences. Since the number of homologous sequences required might be related to the sequence length of the protein itself, the number of homologous sequences was divided by the length of the protein to obtain a ratio. Within this ratio, different ratios of homologous sequences were randomly sampled according to different gradients, and each gradient was repeated three times. Then we calculated how many key allo-residues determined by homologous sequences with different gradients were the same as those determined by all homologous sequences. Taking the key allo-residues determined by all homologous sequences as references, we calculated the proportion of the same residues (Figure 3—figure supplement 3). It can be seen that generally speaking, 7 L (± 4L) number of effective homologous sequences is enough to give good and stable results. If the protein sequence was relatively short, the number of homologous sequences required could be less; otherwise, more homologous sequences were needed.

The predicted key allo-residues were supported by experimental results

We searched for literatures to see whether the key allo-residues we predicted were experimentally tested before. The first example is tyrosine-protein kinase ABL1 (BCR-ABL1), which is a fusion protein whose constitutive activity can cause chronic myeloid leukemia (CML). Tyrosine kinase inhibitors targeting the ABL1 ATP-binding site, such as imatinib (Gleevec) and nilotinib (Tasigna), significantly improved the overall survival of CML patients (Kalmanti et al., 2015; Miura, 2015). However, patients may develop drug resistance due to mutations in the ATP-site. The novel fourth generation ABL1 drug, asciminib (ABL001) was developed, which is an allosteric inhibitor that binds to the myristoyl pocket of BCR-ABL1 (Figure 4A). Asciminib was developed from fragment-based drug discovery approach. In the early stage of hit identification, compounds that bind BCR-ABL1 without inhibition activity were found. Among them, hit 4 binds BCR-ABL1 with a Kd of 6 μM. After changing the Cl atom at the para-position of the aniline to the CF3O- group, hit 5 showed inhibition activity with a slightly weakened Kd of 10 μM compared to hit 4 (Figure 4C; Schoepfer et al., 2018). This shows that the interaction between CF3O- and BCR-ABL1 is essential for the allosteric signaling and inhibitory activity. This group forms favorable hydrophobic interaction with L359, one of the key allo-residues predicted by our method (Figure 4B). In contrast, there is no favorable interaction between hit 4 and L359. These experimental evidences support that the predicted allo-residue L359 plays key role in allosteric signaling.

Key allo-residues predicted in BCR-ABL1.

(A) The crystal structure of the kinase domain of BCR-ABL1. The allosteric inhibitor asciminib, represented by sticks, binds to the myristoyl pocket (marine). (B) Predicted key allo-residues in the myristoyl pocket. The predicted key allo-residues are represented by marine sticks. One of the predicted key allo-residues, L359, forms a favorable hydrophobic interaction with a fluorine atom in asciminib, represented by a red dashed line. Water is represented by a red sphere. (C) The structure of fragment-derived hit 4 and hit 5 and the final marketed drug asciminib.

In the allosteric pocket of the asciminib binding site which contains 44 residues, we predicted 7 key allo-residues. In addition to L359, R479, V525, Y454, E450, T453, and T364 were also identified as potential key allo-residues (Supplementary file 6). T453 forms favorable hydrophobic interaction with the pyrazole ring in asciminib, and Y454 participates in the water-mediated H-bond with the oxygen atom in asciminib. Previous studies have shown that the conformational state of helix-I is important for functional activity, and V525 serves as a good indicator for the conformational change. Functional antagonists binding to the myristoyl pocket can bend helix-I and make the disordered region that V525 locates become ordered (Jahnke et al., 2010; Schoepfer et al., 2018). This indicates that V525 plays key role in allosterically regulating ABL1 kinase activity. At the same time, it also shows that coevolutionary information can help to capture multiple allosteric functional conformations of proteins (Morcos et al., 2013).

KeyAlloSite correctly identified key allo-residues in other proteins not in the data set

We further tested KeyAlloSite on proteins not included in the data set. The first protein is the E. coli aspartate (Asp) chemoreceptor Tar. Tar mediates the chemotaxis of bacteria toward attractants, such as Asp, and away from repellents. Tar is a homodimer during transmembrane signaling, and the signals are transmitted from the extracellular region of Tar to the cytoplasmic region through the transmembrane domain (Mise, 2016). Bi et al. discovered six new attractants and two new antagonists of Tar by computational virtual screening and experimental study. By comparing the binding patterns of attractants and antagonists, they found that the interactions between the chemoeffectors and Y149 and/or Q152 in Tar are critical for attractant chemotactic signaling (Bi et al., 2013). We chose the holo structure that binds Asp (PDB ID: 4Z9H) for analysis and used the CAVITY to identify all potential ligand binding pockets on the surface of chain B. Among the three pockets found, cavity_2 is the pocket where Asp binds, containing 21 residues, which we referred as the allosteric pocket. Since previous studies proposed that transmembrane signaling is triggered by the relative piston-like downward sliding of the α4 helix in the periplasmic domain (Mise, 2016), we chose the 16 residues (A166-T181) in the C-terminal of the α4 helix as the orthosteric site (Figure 5A). Through the coevolutionary analysis of residues in the orthosteric and allosteric sites, KeyAlloSite identified six key allo-residues in cavity_2, which were I65', F150', Y149', Q152', P153', and T154' (Figure 5B). It can be seen that our method could predict the key allo-residues Y149 and Q152. We also tested KeyAlloSite on the apo structure of Tar (PDB ID: 4Z9J). Of the eight pockets found by CAVITY, cavity_5 is the allosteric pocket, which contains 24 residues. As in the holo structure, the 16 residues (A166-T181) in the C-terminal of the α4 helix were chosen as the orthosteric site. In the prediction results, Y149' and Q152' ranked fourth and fifth among the 24 residues, with corresponding Z-scores of 1.01 and 0.66, respectively. Therefore, KeyAlloSite could also correctly predict the key allo-residue Y149 in the apo structure. For Q152, its Z-score is slightly smaller than the threshold of 0.8, though with a high ranking. This indicates that conformational changes do have subtle influence on the predicted results, probably mainly due to the change of residue composition in the allosteric pocket detected by CAVITY, which will lead to some fluctuations in the predicted key allo-residues. However, when the conformational changes between apo and holo states are not large, the influence on the results is small.

The key allo-residues predicted by our method in Tar and PDZ3.

(A) The crystal structure of holo-Tar. Aspartate (Asp) is represented by magenta sticks, the allosteric pocket is represented by marine surface, and the salmon helix is selected as the orthosteric site. (B) The key allo-residues predicted at the Asp-binding site. The predicted key allo-residues in the allosteric cavity_2 are represented by marine sticks, among which Y149 and Q152 are the true key allo-residues that have been confirmed by experiments. Hydrogen bonds are shown as red dash lines. (C) The predicted key allo-residues in PDZ3. The peptide bound to the orthosteric site is represented by salmon sticks, the allosteric pocket is represented by marine surface, and the predicted key allo-residues are represented by marine sticks.

The second protein is the PDZ3 domain, and its allosteric mechanism has been extensively studied. We selected the crystal structure of PDZ3 binding with a peptide in its orthosteric site for analysis (PDB ID: 1BE9; Doyle et al., 1996). Since the allosteric site of this structure does not bind an allosteric ligand, we used CAVITY to find the potential ligand binding pockets on its surface. Among the three pockets identified, cavity_1 is the orthosteric pocket, and cavity_2 contains the known allosteric sites, which were used for further analysis. KeyAlloSite predicted D306, S409, L302, A347, and L353 as key allo-residues (Figure 5C). Kalescky et al. used rigid residue scan to identify residues that are important for the allosteric effect of the PDZ3 domain. In the rigid residue scan, only one residue was regarded as a single rigid body in each MD simulation. They proposed that A347 is a ‘switch residue’, which is needed to turn on the allosteric effect (Kalescky et al., 2015). Lockless et al. used evolutionary data of protein families to measure the statistical coupling between amino acid positions. For the PDZ protein family, they found that there are strong statistical couplings between A347 and L353 and the key residue H372 of the orthosteric site, and verified using thermodynamic mutational studies (Lockless and Ranganathan, 1999). Moreover, McLaughlin et al. developed a high-throughput quantitative method that can individually replace a residue at each position with every other residue for comprehensive single-mutation studies. Their results showed that mutations of A347 and L353 caused significant functional loss (McLaughlin et al., 2012). These evidences all indicate that the key allo-residues A347 and L353 we predicted are important for the protein function by allosteric regulation.

KeyAlloSite identified pathogenetic mutations in human proteins

Previous studies have shown that allosteric mutation, that is, abnormal protein allosteric regulation caused by mutation is related to pathological processes such as cancer (Kurochkin et al., 2017). Shen et al. analyzed the dysfunction of allosteric proteins caused by somatic mutations in about 7000 cancer genomes across 33 cancer types, mapped these mutations to allosteric sites, orthosteric sites, and other sites in the Allosteric Database and established the Allo-Mutation data set (Shen et al., 2017). We searched for the somatic mutation data corresponding to the human proteins in our data set from the Allo-Mutation data set, and found that 11 of a total of 51 predicted key allo-residues in 7 human proteins were mutated in a variety of cancers (Table 1). Among them, cancers that contain a large number of mutations in key allo-residues are uterine corpus endometrial carcinoma and skin cutaneous melanoma. This indicates that the abnormal allosteric regulation caused by the mutation of key allo-residues plays key role in the occurrence and development of cancer. These key allo-residues can affect allosteric signal transduction and thus affect protein functions, suggesting that KeyAlloSite can be used to predict key pathogenetic mutations in proteins.

Table 1
Predicted key allo-residues that were mutated in cancers.
ProteinGenePredicted key allo-residuesMutation*Cancer type
AR1ARD732D732NSKCM
AR2ARM832M832ISKCM
PTP-1BPTPN1M282M282TCOAD
CDK2CDK2P155P155HUCEC
CK2alphaCSNK2A1F54; A110F54C; A110TUCEC; UCEC, GBM
MAPK14MAPK14P191; E192P191S; P191H; E192QSKCM; KIRC; BLCA
MAPK8MAPK8E195; M200E195K; M200IUCEC; SKCM
CYP3A4CYP3A4F219F219LUCEC
  1. *

    Mutation: confirmed disease mutations among the predicted key allo-residues.

  2. Cancer type: COAD: colon adenocarcinoma; SKCM: skin cutaneous melanoma; UCEC: uterine corpus endometrial carcinoma; GBM: glioblastoma multiforme; KIRC: kidney renal clear cell carcinoma; BLCA: bladder urothelial carcinoma.

KeyAlloSite can also identify key allosteric functional residues of enzymes

Enzyme evolution studies mainly focus on mutations in the substrate binding pocket (Wu et al., 2013; Yi et al., 2011). However, covariant residues far from the substrate binding site may also play an important role in regulating catalysis, which are difficult to identify. Since KeyAlloSite can calculate the functional coupling between residues outside and inside the orthosteric pocket, we wonder whether it can also be used to identify key allo-residues for enzyme activity regulation. We used Candida antarctica lipase B (CALB, PDB ID: 1TCA), which has many annotated key functional residues (Wu et al., 2020), as a test case. CALB is one of the most widely used biocatalysts in academia and industry that is often applied in acylating kinetic resolution of racemic alcohols and amines and desymmetrization of diols and diacetates, and is robust and easy to express (Wu et al., 2013). We selected the six angstrom-cutoff orthosteric pocket and calculated the EC values between each residue outside the orthosteric pocket and the residues in the orthosteric pocket. Then we compared the difference in EC values corresponding to each residue outside the orthosteric pocket and calculated the significant difference number di corresponding to each residue. Finally, the significant difference number di was normalized to Z-score. Due to the large number of residues outside the orthosteric pocket, residues with Z-scores greater than 0.9 were referred as key allo-residues here. Our method predicted a total of 52 residues from the 296 residues outside the orthosteric pocket, of which 20 residues have been annotated as functional residues in literature according to mutagenesis experiments (Figure 6A, Supplementary file 7). For example, A225 ranked the third out of the 52 predicted residues with a Z-score of 3.04, and A225M improves the catalytic efficiency of the enzyme by about 11 folds. V37 ranked the 35th out of the 52 predicted residues with a Z-score of 1.42, and V37I improves the catalytic efficiency of the enzyme by about threefolds (Wu et al., 2020). Therefore, our method can also be used to predict allosteric functional residues that are important for enzyme catalysis, providing a new computational tool for identifying mutant enzyme with improved catalytic properties.

KeyAlloSite predicted key allo-residues for enzymes.

(A) KeyAlloSite predicted key allo-residues for Candida antarctica lipase B. Among the predicted residues, the residues that have been annotated by the literature are shown as marine spheres, and the orthosteric pocket is represented by salmon surface. (B) KeyAlloSite predicted key allo-residues for Escherichia coli chorismate mutase (CM). Experimentally discovered key functional residues of CM are shown as marine spheres, the labels of key allo-residues predicted by KeyAlloSite are shown in marine, and the orthosteric pocket and ligand are represented by salmon surface and sticks.

Russ et al. recently used an evolution-based model to design chorismate mutase enzymes (CMs). They used DCA to learn the constraints for specifying proteins purely from evolutionary sequence data and performed Monte Carlo sampling from this model to generate artificial sequences. They were able to obtain proteins with natural-like catalytic function with sequence diversity. Eight residues (L40, L41, R44, D50, D83, L92, Q93, and H95) at the periphery of the active site were found to be important for CM catalysis in E. coli specific function (Russ et al., 2020; Figure 6B). We wonder if some of these residues regulate the enzyme function by allostery. We used the 6 angstrom-cutoff orthosteric pocket and identified 46 residues on the surface of the protein outside the orthosteric pocket. We calculated the EC values between each of these 46 residues and the orthosteric pocket and compared the difference in EC values of these 46 residues. We then calculated the significant difference number di corresponding to each of these 46 residues and normalized to Z-scores. Using the criterion of Z-score >0.9, we predicted 10 key allo-residues from the 48 residues, which were L86, A9, R44, Q89, E57, E12, L40, D69, R58, and N5. Among them, two residues (R44 and L40) were shown to be important for CM catalysis by Russ et al. And Russ et al. used DCA to show that sequence-based statistical models contain protein function information, while our study used DCA to predict allosteric sites or key allosteric residues based on the significance analysis of EC between these sites with the orthosteric site. Their study showed that the function of proteins seems to depend on many weak terms in Jij , which lack simple physical interpretations. Our study demonstrated that weak terms in Jij contain information for protein allosteric function evolution and can be used to predict allosteric sites and key allosteric residues.

To further verify the effectiveness of KeyAlloSite, we made indirect comparisons with the statistical coupling analysis (SCA) method. SCA predicts several groups of coevolved residues (sectors) that form physically continuous networks, often able to connect major functional site and allosteric sites, that is, allosteric pathways (Lockless and Ranganathan, 1999; Rivoire et al., 2016; Salinas and Ranganathan, 2018; Shulman et al., 2004; Süel et al., 2003). We analyzed all the proteins in our data set using SCA. We first compared the performance of SCA and KeyAlloSite in predicting known key allo-residues in allosteric sites (Supplementary file 9). For the known key allo-residue L359 in the BCR-ABL1 protein, it was not present in the sectors predicted by SCA, despite that the sectors contain 68 residues, while it could be correctly predicted by KeyAlloSite. For the known key allo-residues Y149 and Q152 in the Tar receptor, KeyAlloSite could correctly predict both of them. However, although the sectors predicted by SCA contained Y149 and Q152, it also included two residues (R69 and R73) that have been experimentally verified to contribute only to ligand binding and not to allosteric signaling (Bi et al., 2013). For the PDZ3, the sectors predicted by SCA contained the key allo-residues A347 and L353, which were also successfully predicted by KeyAlloSite. We further compared the performance of SCA and KeyAlloSite in predicting the key allosteric functional residues of enzymes. For the CALB, the sectors predicted by SCA missed one of the key allo-residues A225, which has been experimentally shown to have a great impact on enzyme activity, and the predicted known key allo-residues account for 32.8% of all the residues in the sectors, while KeyAlloSite could predict the key allo-residue A225, and the predicted known key allo-residues account for 38.5% of all the predicted key allo-residues. For the CMs, the sectors predicted by SCA contained only one key allo-residue D83, while KeyAlloSite could predict key allo-residues R44 and L40. For the KeyAlloSite correctly predicted functional phosphorylation sites T288 and T287 in the AurA, SCA missed both of them. Thus, KeyAlloSite performs better than SCA in predicting key allo-residues.

Discussion

Identifying key allosteric residues responsible for allosteric signaling is important for the design and optimization of allosteric drugs, enzyme, and protein engineering studies. We developed, KeyAlloSite, a novel method for predicting allosteric sites and key allo-residues based on the ECM. To the best of our knowledge, this is the first systematic and efficient computational method to predict key allo-residues. Our study demonstrated that orthosteric and allosteric pockets are coupled in protein function evolution. Our predicted key allo-residues are in accordance with previously reported experimental studies in the BCR-ABL1, Tar, and PDZ3 systems, as well as key cancer mutations. We further showed that KeyAlloSite can be applied to predict key allosteric residues distant from the catalytic site that are important for enzyme catalysis. Our study also gives a possible physical explanation for the weak couplings in Jij , that is, they may represent allosteric functional couplings. The predicted key allo-residues can help us to understand the mechanism of allosteric regulation, to provide reference and guidance for the rational design and optimization of allosteric drugs and to facilitate enzyme engineering.

It should be noted that due to limited available experimental information, for the protein systems that we tested, although all the predicted known key allo-residues ranked among top 20%, there are other residues among the top list with unidentified function. Our analysis predicted that these residues should play important roles in allosteric signaling. Further experimental studies are needed to verify their functions in the future. At the same time, our predicted list of key allo-residues greatly reduces the number of residues that needs to be verified experimentally. We expect future experimental studies on the functions of these newly predicted key ‘allo-residues’ can not only verify and demonstrate the predictive power of KeyAlloSite but also offer more data to improve it.

Several methods were developed to infer protein function based on the coevolutionary couplings between residues. For example, Hopf et al. developed EV mutation to predict the effects of mutations based on the coevolutionary couplings between residues (Hopf et al., 2017). The theoretical formulation for extracting the coupling scores between residues in the initial step of our method is the same as that of Hopf et al., but the problem studied by us is different from that of Hopf et al., and the usage of the coupling scores between residues in the later steps of the two methods is different. Hopf et al. used the coevolutionary coupling scores between residues to predict the effects of mutations by calculating the difference in statistical energy between mutant and wild-type sequences. In contrast, we used the coevolutionary coupling scores between residues to predict the allosteric sites and key allo-residues in allosteric pockets that are mainly responsible for allosteric signaling by pairwise comparing the difference of the coevolutionary coupling scores of residues in allosteric pockets. Although Hopf et al. highlighted the significance of the pairwise coupling term for the prediction of mutation effects, we highlighted the importance of the weak pairwise coupling term for the study of allosteric function.

As KeyAlloSite attempts to capture coevolutionary coupling between residues from MSA, it requires that MSA should contain sufficient homologous and diversified sequences. For the MSAs with only a few homologous sequences, KeyAlloSite usually cannot give accurate predictions. How to reduce the number of homologous sequences required remains further research.

The scores of the key allo-residues predicted by KeyAlloSite depend not only on the coevolutionary information but also on the residues contained in allosteric pockets. When different conformations of the protein are used, pocket detection method may give allosteric sites with slightly or largely different residues (depending on the difference of the conformations) that may influence the final KeyAlloSite. For example, in the case of Tar, the ECs between residues are the same for the apo and holo conformations, while the allosteric pockets found by CAVITY in the two conformations contained a small number of different residues. Because the prediction of key allo-residues by KeyAlloSite requires pairwise comparison of residues in allosteric pockets, the predicted key allo-residues in the two conformations were slightly different. For the apo Tar, on the one hand, although the score of Q152 is 0.66, which is less than the threshold of 0.8, Q152 ranked high among all residues in the allosteric pocket with a ranking of 5/24. When we lower the threshold slightly, we will be able to correctly predict Q152. On the other hand, Bi et al. showed that although Y149 and Q152 are both key allo-residues, Y149 seems to be more important as allosteric signaling can be conducted when the allosteric molecule only interacts with Y149 but not Q152 (Bi et al., 2013). Although the holo Tar has some conformational changes compared to apo Tar, the key allo-residueY149 can be captured by KeyAlloSite when using either the holo or the apo structure. For applications, we recommend to use the holo structure whenever possible.

Although we used the three-dimensional structure of proteins and their binding ligands in our analysis, KeyAlloSite can also be applied in cases where no three-dimensional structures are available on condition that a certain number of homologous sequences of the protein under investigation and location of the functional site are known. Along with the rapid progress in recent years, protein structure prediction methods, such as AlphaFold (Jumper et al., 2021), can be used to predict the protein structure first. At the same time, as our method calculates the EC between any residue of the protein and residues of the orthosteric pocket, KeyAlloSite can be used to predict not only key allosteric residues but also PTM sites that have functional correlation with orthosteric sites, which will be further studied in the future.

Key residues in protein allosteric sites determine the direction and strength of allosteric signaling, while other residues in allosteric sites mainly contribute to binding. When optimizing allosteric molecules, we often face the challenge that simply increasing binding affinity cannot improve the efficacy of allosteric regulation. Therefore, to optimize the allosteric molecules, one can first use KeyAlloSite to predict the key allo-residues in the allosteric pocket and then maintain or enhance the interactions between the allosteric molecules and the key allo-residues while altering the interactions between the allosteric molecules and other residues in the allosteric pocket. In this way, the optimized allosteric molecules can be ensured to have both strong binding affinity and allosteric signal transduction ability.

Materials and methods

The allosteric protein data set

Request a detailed protocol

We selected allosteric proteins from the ‘Core Set’ of ASBench (Huang et al., 2015) and constructed our data set according to the following criteria: (1) the protein functions as a monomer; (2) the corresponding three-dimensional protein structure data should contain both allosteric ligand and orthosteric ligand; (3) the number of effective homologous sequences of the protein should be greater than 100. Finally, 23 allosteric proteins were selected, including 25 known allosteric sites (Supplementary file 1). We collected another two proteins from published literatures, including E. coli aspartate chemoreceptor Tar (PDB ID: 4Z9H) (Mise, 2016) and PDZ3 domain (PDB ID: 1BE9) (Doyle et al., 1996), as key allo-residues have been reported for these two proteins that can be used for comparative analysis. We used HMMER to search for the homologous sequences of each of the selected allosteric proteins from pfam (Finn et al., 2015). Due to the redundancy of homologous sequences, they were reweighted according to the standard of 80% sequence identity to obtain the effective homologous sequences.

The evolutionary coupling model

Request a detailed protocol

We used a global statistical model, the ECM, which can calculate the direct couplings between residues and remove the indirect couplings. The ECM we used here was mainly based on the work of Marks and her co-workers (Ekeberg et al., 2013; Weinreb et al., 2016). From the MSA of a protein family, we can calculate the observed frequency fia and pairwise co-occurrences fijab of residues (a, b) at position (i, j). From this first-order and second-order statistics, we can infer a model to account for the observed statistics optimally, which mainly includes two parameters: single-site propensities hia and direct coevolutionary couplings between residues Jij(a,b). This model defines a probability P for each protein sequence a = (a1,,aL) of length L:

(1) Pa=1Zexp(i=1Lhiai+i=1L-1j=i+1LJij(ai,aj))
(2) Z=aexp(i=1Lhiai+i=1L-1j=i+1LJij(ai,aj))

The parameters h and J of the model were estimated by pseudo-maximum likelihood (PLM) which approximates the full likelihood for each sequence a = (a1,,aL) by a product of conditional likelihoods for each site i:

(3) P(a1,,aL |h,J) i=1LP(ai |aai,h,J)

The global partition function Z is replaced by a number of local partition functions:

(4) Pai |a\ai,h,J=exphiai+jiJijai,ajaexphia+jiJija,aj

After modifying with L2-regularization and sample reweighting, the approximate likelihood function was optimized using a quasi-Newton method (Limited-memory BFGS).

Once the parameters h and J are fitted, the Frobenius norm FN(i,j) of Jij is used to measure the ECS between the two sites i and j.

(5) FN(i,j)=Jij2=klJ,ijk,l2

J,ij was obtained after Jij was centralized.

Since the FN may include bias caused by phylogeny and undersampling, it can be corrected with average product correction (Dunn et al., 2008). The EC value is the EC score of the two sites.

(6) EC(i,j)=FN(i,j)-i,iFNi,,jj,jFNi,j,i,j,iFNi,,j,

Evolutionary coupling strength between orthosteric and other pockets

Request a detailed protocol

We first performed MSA (Figure 1A) and calculated the ECs between residues using the ECM (Figure 1B). We then used the CAVITY program (Xu et al., 2018; Yuan et al., 2013) to identify all the potential ligand binding pockets on the surface of a protein and designated the mth pocket as cavity_m. The ECS (ECScavity_m ) between the orthosteric pocket and the mth pocket is defined as the sum of the coupling strength between the residues in the two pockets. The orthosteric pocket here is defined as all of the residues within 6 Å around the orthosteric ligand. If more than 50% of the residues in the mth pocket overlap with the orthosteric pocket, the pocket will be excluded. Otherwise, the overlapping residues between the mth pocket and orthosteric pocket will be removed from the mth pocket, and the remaining residues will be used to calculate the ECS between the mth pocket and orthosteric pocket.

(7) ECScavity_m =icavity_mjorthosteric pocketFNi,j

Finally, we normalized the ECS of the pockets:

(8) Z-scorecavity_m =ECScavity_m -μcavity σcavity

μcavity is the mean value of the ECS between the orthosteric pocket and the other pockets, and σcavity is the standard deviation of the ECS.

Identification of key allo-residues

Request a detailed protocol

We first calculated the pairwise EC value between one residue ai (i=1,2,...,N) in the allosteric pocket and one residue bj (j=1,2,...,M) in the orthosteric pocket. An N × M matrix E was obtained, and each element Eij in the matrix E represents the EC value between residues in the allosteric and orthosteric pockets (Figure 1C–D). The allosteric pocket here refers to the allosteric pocket found by the CAVITY. In rare cases, if the CAVITY does not find the known allosteric pocket, all of the residues within 8 Å around the allosteric ligand are used as the allosteric pocket. After that, the difference of EC values corresponding to each residue in allosteric pocket was pairwisely compared by using the student’s t-test (α=0.05), that is, whether there was a difference between the mean values of each two rows of matrix E. The result of the comparison was expressed by Cmi (m=1,2,...,N, i=1,2,...,N), if there was a significant difference, Cmi was assigned a value of 1, if there was no significant difference, Cmi was assigned a value of 0. Finally, an N × N matrix C was obtained, and each element in C indicates whether there was a difference between the residues in the allosteric pocket (Figure 1E). On this basis, by adding up each column of C, we could get the number of significant differences di (i=1, 2,..., N) between each residue and the remaining residues in the allosteric pocket, that is, the number of significant differences corresponding to ith residue (Figure 1F).

(9) di=m=1NCmi

Finally, we normalized the number of significant differences of the residues:

(10) Z-scoreresidue_i =di -μresidue σresidue

μresidue is the mean value of the number of significant differences of the residues, and σresidue is the standard deviation of the number of significant differences.

In enzyme catalysis, in addition to residues in the active site, distant residues may also affect enzyme activity, which could not be rationally designed in enzyme engineering. We systematically studied the effect of residues outside of the active site on enzyme catalysis. We calculated the EC value between each residue outside the orthosteric pocket and residues in the orthosteric pocket (i.e. the rows in matrix E are all residues outside the orthosteric pocket here). Using the same steps as above, we determined the number of significant differences di for each residue outside the orthosteric pocket and normalized it to Z-score.

Data availability

Request a detailed protocol

The data that support the results of this study are in the Supplementary data, including information of the allosteric proteins in the data set (Supplementary file 1); list of the Z-scores and ranking of allosteric pockets in the data set (Supplementary file 2); KeyAlloSite prediction results of AurA kinase (Supplementary file 3); list of the predicted key allo-residues in allosteric pockets (Supplementary file 4); key allo-residues predicted by KeyAlloSite with different cutoffs (Supplementary file 5); KeyAlloSite prediction results of tyrosine-protein kinase ABL1 (Supplementary file 6); the key allo-residues predicted by our method on CALB (Supplementary file 7); the confusion matrices of KeyAlloSite in different scenarios (Supplementary file 8); comparison of KeyAlloSite and SCA methods (Supplementary file 9); phylogenetic tree of androgen receptor (Figure 2—figure supplement 1); comparison of ECS between pockets when all residue pairs and partial residue pairs were used (Figure 2—figure supplement 2); difference between the EC between orthosteric and allosteric sites and the EC between two random patches (Figure 2—figure supplement 3); distribution of the ratios of the number of key allo-residues predicted by KeyAlloSite in the number of all residues in allosteric pockets when using different cutoffs in all proteins (Figure 3—figure supplement 1); examples of distributions of the statistics corresponding to significant scores obtained from the t-test (Figure 3—figure supplement 2); and random sampling of homologous sequences (Figure 3—figure supplement 3). The homologous sequences of the proteins in the data set are available in the following GitHub repository: https://github.com/huilan1210/KeyAlloSite, Xie et al., 2023.

Code availability

Request a detailed protocol

KeyAlloSite is available at GitHub (https://github.com/huilan1210/KeyAlloSite, copy archived at swh:1:rev:8464b27b588af48d14033ab40d62f9eca4ed0051, Xie et al., 2023).

Data availability

All data that support the results of this study are included in the manuscript, supplementary files, and GitHub repository (https://github.com/huilan1210/KeyAlloSite; copy archived at swh:1:rev:333d4a48d2570c74806f68a1247611ed64794b97). Source Data files have been provided for all Figures (except Figure 1 and Figure 2-figure supplement 1).

References

Decision letter

  1. Shozeb Haider
    Reviewing Editor; University College London, United Kingdom
  2. Volker Dötsch
    Senior Editor; Goethe University, Germany
  3. Sarath Dantu
    Reviewer; Brunel University London, United Kingdom

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Coevolution-based prediction of key allosteric residues for protein function regulation" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Volker Dötsch as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Sarath Dantu (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

There seems to be a consensus between the reviewers that:

1) There is a lack of comparison of the proposed method against existing ones that do pretty much the same. This is the main weakness of the manuscript.

2) The authors should also provide a better metric of the true positives of their method – and of course report the false positives they have for the cases they've tested. AlloSite seems to predict several residues as key, some of which happen to be oncogenic (Table 1). It's not clear how many of the top 10, let's say, residues that it predicts have an allosteric effect.

3) In addition to the comparison, the authors need to establish that the signal coming from comparison of allosteric vs. orthosteric is different from comparing any two random patches with same number of residue pairs.

4) A complete picture of the predictive power with both strengths and weakness of the tool should be presented.

Reviewer #1 (Public Review):

Allostery refers to processes whereby a change at one site of a biological macromolecule affects the structure and dynamics at another distinct functional site, enabling the regulation of the corresponding function. Xie et al. developed an in-silico method to predict residues involved in allosteric regulation using a coevolution-based method. A fast and accurate method of identifying key residues responsible for allosteric signalling is important for drug design purposes and protein engineering.

Strengths:

1) The authors applied their method to multiple targets from different protein families to test their method.

2) The method is able to predict in a retrospective analysis certain residues involved in allosteric communication between orthosteric and allosteric binding sites.

Weaknesses:

1) There are several tools used in statistical genomics to predict allosteric communication pathways. Even though the paper tries to demonstrate the ability to predict residues that are involved in allosteric communication, KeyAllosite is not compared with any other state-of-the-art tool that does the same (1-3), which would highlight the strengths of this method with respect to existing ones.

2) The authors mention that the number of effective homologue sequences affects the probability of finding an allosteric site in the top 3 scored sites, however, Cdc4 and AR2 have also a low number of effective homologue sequences (Figure 1.A) yet a high Z-score. No sufficient explanation is given regarding this discrepancy. This also demonstrates that a threshold in the Neff under which KeyAlloSite becomes unreliable should be defined.

3) From the low Z-score of CYP3A4, the authors claim that the conservation of residues in the orthosteric site is important for getting an accurate prediction of the coupling strength. It is not clear though if and how is this aspect encoded in the KeyAlloSite. From the description of the method, it seems like the algorithm does not check for the level of conservation of the orthosteric residues. An explanation as to why has it not been incorporated into the algorithm is necessary. The conservation at a given site in an MSA defined as the overall deviance of amino acid frequencies at that site from their mean values, in combination with the statistical coupling of two sites has been shown to be important in the development of allosteric models (4).

4) In the case of AuroraA, the authors do not explain why other Ser/Thr/Tyr residues scored higher than T287 and T288, or if their higher scores are an artefact. However, in many cases, post-translational modifications take place by secondary partners and, therefore, the coupling of a post-translational modification site with the orthosteric site cannot be used to predict such sites. Post-translational modification sites are expected to have a strong coupling with residues of the upstream/downstream effector that is responsible for the modification, rather than with residues of the orthosteric site of the protein.

5) The authors defined allo-residues as the residues whose Z-score is >0.8, but there is no strong argument regarding the choice of this threshold. In the case of the allosteric pockets, for example, the authors use a threshold of 0 to identify pockets with strong coupling strength.

6) In the case of the Tar, it would be good if the authors reported the results starting from an apo structure as well and see if the method is able to find Y149 and Q152. That way, they could test how sensitive/biased the method is to the chosen conformation. If no apo structure is available, maybe another protein where both an apo and a holo structure exist could be used for this purpose.

7) The data in Table 1 is not sufficiently convincing. It would be helpful to also report how many of the identified residues by KeyAlloSite are indeed involved in oncogenic mutations or find another metric to quantify the success rate of KeyAlloSite.

8) The success rate of the method to predict key residues of the function of CALB (38% success rate based on the reported residues in the literature) and CMS (20%), should be interpreted with caution. It may be the case that the rest of the predicted residues have not been tested for their functional role, or that the method has indeed a low success rate in predicting key residues for allosteric communication.

In lines 38-41, the authors refer to SARs of allosteric drugs as "flat", which is not clear what they refer to.

It is not clear if the so-called, "allo-residues", that the authors define in lines 49-52 refer to residues that affect the binding or the signalling.

It would be helpful for the reader if the authors included a section where they describe the method itself and the key steps of developing their model before presenting the results. That way, the reader could follow the results easier.

The authors classified L359 of BCR-ABL1 as an allo-residue and justified the importance of this residue based on the fact that a weaker ligand does interact with this residue. For transparency, it would be good for the authors to report the allosteric scores of all 44 residues in the allosteric site to show that this residue was not cherry-picked and that the method can pick up all the residues that are important for the binding.

The manuscript has several typos and language mistakes (e.g. "screening" instead of "screen" in the Abstract, missing articles, etc.) that should be corrected prior to a resubmission.

Reviewer #2 (Public Review):

The authors designed a statistical approach to analyse coevolution scores from a protein and predict allosteric residues. This approach relies on comparison of residues from the two sites (allosteric vs. orthosteric) and omits the rest of the protein.

While the approach is logical, the predictive power has not been clearly established. To demonstrate the effectiveness of the approach a confusion matrix should be provided as it will show the predictive power of the method in the different scenarios presented in the manuscript (PTM's, enzymes, pathogenic mutations, etc.).

An effective method, along these lines, will have significant impact on protein and drug design.

It would be helpful if statistics on the distribution of significant scores from E row comparison after the t-test are provided. Further, you have so far compared allosteric with orthosteric, it would be interesting to compare the same distribution/statistics with allosteric vs. non-orthosteric residues. It may serve as a very good benchmark. The idea would be to see if the events of significance from allosteric vs. orthosteric are different/unique if we pick and compare any two random patches in the protein.

A boxplot of Zscores for All, top200, top300, top400 pairs from Figure 1—figure supplement 2 would be very informative.

As you have been using top zscore residues (>0.5 or >0.9 for enzymes), direct or indirect evidence for the statement in lines 306-307 is not very apparent from the provided data. This has to be flushed out further.

Please annotate residues for which experimental data is available to compare with keyallosite predictions for example in Supplementary files 4 and 5. In the main text (line 284) you only mention one residue out of 52 to demonstrate the effectiveness of the prediction. Again a confusion matrix would help here.

Can you please clarify if in Figure 2, number of residues refers to the number of residues from allosteric+orthosteric site?

Figure 6 may be ideal as Figure 1 to inform the readers about the protocol of this work.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Coevolution-based prediction of key allosteric residues for protein function regulation" for further consideration by eLife. Your revised article has been evaluated by Volker Dötsch (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1. Can you please provide a detailed description and expand the discussion on the predicted amino acids and why the predicted residues are not the top ranking ones?

2. The theoretical formulation to extract the coupling score and a compare/contrast to Hopf et al. approach would be highly beneficial to the readers.

Reviewer #1 (Recommendations for the authors):

The authors did extensive work trying to address the revision comments, and their effort is much appreciated.

Nevertheless, in the absence of further experimental validation of the importance of the predicted so-called key "allo-residues" in mediating the signal between the orthosteric and allosteric binding site, it is still hard to assess the predictive power of the presented method. There are still residues for example that score higher than known functional residues (see for example residues T235, S245 and S249 in the case of Aurora A) whose implication in signal transduction is not confirmed, or residues whose importance depends on the conformation chosen for analysis despite the coevolutionary conservation metric used to score the residues (Q152 of Tar whose score is way below the 0.8 threshold when an apo structure is considered, which shouldn't be the case given that AlloSite uses only MSA and coevolutionary information for the scoring – it seems like other residues are predicted to be way more important based on the allosteric pocket definition given by CAVITY to decrease the score of Q152 to <0.8 after normalisation).

I fully understand that such experimental validation goes beyond the capacities of a computational group, however, as a reader, I might be hesitant to try this method.

Reviewer #2 (Recommendations for the authors):

The authors have carefully addressed all the comments from the previous review.

Few additional comments for authors to consider:

The Discussion section can be enriched further. At present it only discusses two points, i.e., ability of the method to predict key allosteric residues and requirement of sequence depth. For example: the last two paragraphs 565-580 are repetitive as they both highlight the need for depth in sequence alignments which is a known issue for MSA dependent methods, as again discussed by Hopf et al. Nature Biotechnology volume 35, pages 128-135 (2017). Further, the theoretical formulation to extract the coupling score is identical to Hopf et al. and a compare/contrast of the two approaches would be highly beneficial to the audience. Even Hopf et al. highlight the significance of the pairwise term, as done in this article.

https://doi.org/10.7554/eLife.81850.sa1

Author response

Essential revisions:

There seems to be a consensus between the reviewers that:

1) There is a lack of comparison of the proposed method against existing ones that do pretty much the same. This is the main weakness of the manuscript.

We agree that comparison of our method with existing ones should be given whenever possible. However, most existing methods based on sequence statistical analysis were developed for predicting allosteric communication pathways. To the best of our knowledge, no systematic computational methods for identifying key allo-residues, residues that contribute most to allosteric signaling in allosteric sites are available now. As the main purpose of our current study is to predict key allo-residues in allosteric sites, direct comparisons with other methods are not possible. As suggested, we have performed indirect comparisons with the Statistical Coupling Analysis (SCA) method, which have been included in the revised manuscript (The last paragraph of KeyAlloSite can also identify key allosteric functional residues of enzymes).

SCA predicts several groups of coevolved residues (Sectors) that form physically continuous networks, often able to connect major functional site and allosteric sites, that is, allosteric pathways. We analyzed all the proteins in our data set using SCA. We first compared the performance of SCA and KeyAlloSite in predicting known key allo-residues in allosteric sites (Supplementary File 9). For the known key allo-residue L359 in the BCR-ABL1 protein, it was not present in the sectors predicted by SCA, despite that the sectors contain 68 residues, while it could be correctly predicted by KeyAlloSite. For the known key allo-residues Y149 and Q152 in the Tar receptor, KeyAlloSite could correctly predict both of them. However, although the sectors predicted by SCA contained Y149 and Q152, it also included two residues (R69 and R73) that have been experimentally verified to contribute only to ligand binding and not to allosteric signaling (PNAS 2013, 110, 16814). For PDZ3, the sectors predicted by SCA contained the key allo-residues A347 and L353, which were also successfully predicted by KeyAlloSite. We further compared the performance of SCA and KeyAlloSite in predicting the key allosteric functional residues of enzymes. For the CALB, the sectors predicted by SCA missed one of the key allo-residues A225, which has been experimentally shown to have a great impact on enzyme activity, and the predicted known key allo-residues account for 32.8% of all the residues in the sectors, while KeyAlloSite could predict the key allo-residue A225, and the predicted known key allo-residues account for 38.5% of all the predicted key allo-residues. For the CMs, the sectors predicted by SCA contained only one key allo-residue D83, while KeyAlloSite could predict key allo-residues R44 and L40. For the KeyAlloSite correctly predicted functional phosphorylation sites T288 and T287 in the AurA, SCA missed both of them. Thus, KeyAlloSite performs better than SCA in predicting key allo-residues.

2) The authors should also provide a better metric of the true positives of their method – and of course report the false positives they have for the cases they've tested. AlloSite seems to predict several residues as key, some of which happen to be oncogenic (Table 1). It's not clear how many of the top 10, let's say, residues that it predicts have an allosteric effect.

We have added the confusion matrices of our method in different scenarios (Supplementary file 8) in the revised manuscript. It should be noted that the known real positive data is very limited. At the same time, there is little known real negative data, and we regarded the data of unknown functions as negative data. It can be seen that KeyAlloSite has high recall (0.92 and 1.00) in the prediction of allosteric sites and the prediction of key allo-residues in allosteric pockets. The recall of KeyAlloSite in the prediction of functional post-translational modification (PTM) sites, pathogenic mutations and enzymes are not very high, mainly because there are limited known data on functional PTM sites and pathogenic mutations, and the functions of many key allo-residues predicted by KeyAlloSite are unknown. It remains to be further verified whether they have the function of allosteric regulation. Moreover, since we are now taking the predicted key allo-residues of unknown function as negative samples, our values for recall here are lower bounds of recall. The prediction list of KeyAlloSite provides a good starting point for further experimental studies and we believe that the prediction accuracy would be increased after the functions of predicted residues were experimentally identified.

3) In addition to the comparison, the authors need to establish that the signal coming from comparison of allosteric vs. orthosteric is different from comparing any two random patches with same number of residue pairs.

As suggested, we have added the difference between the evolutionary coupling between orthosteric and allosteric pockets and the evolutionary coupling between two random patches in the protein in the revised manuscript (The last paragraph of The evolutionary coupling between orthosteric and allosteric sites is stronger). For each protein in the data set, two residues that are not part of the orthosteric and allosteric sites were randomly selected from the surface residues. Among them, one was taken as the first center and the residues around it with the same number as the residues in orthosteric pocket were selected as patch1; and the other residue was taken as the second center and the residues around it with the same number as the residues in allosteric pocket were selected as patch2. Then we calculated the evolutionary coupling strength between the patch1 and patch2. The process was repeated four times, and then the mean and standard deviation of the evolutionary coupling strength were calculated. We then compared the evolutionary coupling strength between patch1 and patch2 with that between orthosteric and allosteric sites. The results showed that the evolutionary coupling strength between orthosteric and allosteric sites was significantly higher than that between two random patches (Figure 2—figure supplement 3). In other words, there is intrinsic evolutionary coupling between orthosteric and allosteric sites, which is different from the evolutionary coupling between any two random patches.

4) A complete picture of the predictive power with both strengths and weakness of the tool should be presented.

We have added discussions on the strengths and weaknesses of our approach in the revised manuscript (The second paragraph of Discussion). The main strength of our method is that it is the first systematic and efficient computational method to predict key allosteric residues in allosteric sites that are primarily responsible for allosteric signaling. The weakness of our method is that as KeyAlloSite attempts to capture coevolutionary coupling between residues from multiple sequence alignment (MSA), it requires that MSA should contain sufficient homologous and diversified sequences. For the MSAs with only a few homologous sequences, KeyAlloSite usually cannot give accurate predictions. How to reduce the number of homologous sequences required remains further research.

Reviewer #1 (Public Review):

Allostery refers to processes whereby a change at one site of a biological macromolecule affects the structure and dynamics at another distinct functional site, enabling the regulation of the corresponding function. Xie et al. developed an in-silico method to predict residues involved in allosteric regulation using a coevolution-based method. A fast and accurate method of identifying key residues responsible for allosteric signalling is important for drug design purposes and protein engineering.

We thank the reviewer for the positive remarks and encouragement.

Strengths:

1) The authors applied their method to multiple targets from different protein families to test their method.

2) The method is able to predict in a retrospective analysis certain residues involved in allosteric communication between orthosteric and allosteric binding sites.

Weaknesses:

1) There are several tools used in statistical genomics to predict allosteric communication pathways. Even though the paper tries to demonstrate the ability to predict residues that are involved in allosteric communication, KeyAllosite is not compared with any other state-of-the-art tool that does the same (1-3), which would highlight the strengths of this method with respect to existing ones.

We thank the reviewer for the comment. We have compared our method with statistical coupling analysis method that has been used to predict allosteric communication pathways. Please refer to our response to the first comment in Essential Revisions. We would like to emphasize that our method was developed mainly for predicting key allosteric residues and not for predicting allosteric communication pathways.

2) The authors mention that the number of effective homologue sequences affects the probability of finding an allosteric site in the top 3 scored sites, however, Cdc4 and AR2 have also a low number of effective homologue sequences (Figure 1.A) yet a high Z-score. No sufficient explanation is given regarding this discrepancy. This also demonstrates that a threshold in the Neff under which KeyAlloSite becomes unreliable should be defined.

We thank the reviewer for the comment. A high number of effective homologous sequences guarantee the high probability of the correct prediction of allosteric site. Furthermore, the prediction power also depends on several other factors, such as when the allosteric site appears in evolution, how many effective homologous sequences contain the allosteric site in a particular protein family, and the size of allosteric pockets. We have built the protein sequence based phylogenetic tree of AR1 homologous proteins and found that AR1 located near the tail of the phylogenetic tree (Figure 2—figure supplement 1 in the revised manuscript). This implies that this allosteric function did not exist in the early evolutionary period and the allosteric function may have appeared in the late stage of evolution. Due to the relatively large number of sequences in the early evolutionary period and relatively few sequences in the late stage of evolution, the allosteric signal was weak. For simplicity, we defined the threshold of the number of effective homologous sequences (Neff) as 100 (Section The allosteric protein data set in Methods and Materials). When Neff is less than 100, the prediction result may become unreliable, but a low number of Neff does not necessarily correspond to a small Z-score. For example, Cdc4 and AR2 have high Z-scores despite their small Neff.

3) From the low Z-score of CYP3A4, the authors claim that the conservation of residues in the orthosteric site is important for getting an accurate prediction of the coupling strength. It is not clear though if and how is this aspect encoded in the KeyAlloSite. From the description of the method, it seems like the algorithm does not check for the level of conservation of the orthosteric residues. An explanation as to why has it not been incorporated into the algorithm is necessary. The conservation at a given site in an MSA defined as the overall deviance of amino acid frequencies at that site from their mean values, in combination with the statistical coupling of two sites has been shown to be important in the development of allosteric models (4).

We thank the reviewer for the comment. Since orthosteric and allosteric sites are functionally coupled and coevolve, we mainly focus on the evolutionary coupling between residues in orthosteric and allosteric sites. We did not explicitly encode the conservation of orthosteric residues in KeyAlloSite as orthosteric residues of most proteins are relatively conserved and ECM already contains sequence conservation information. When the conservation of residues in the orthosteric site varies largely as in the case of CYP3A4, it will be hard to give reliable prediction. We have discussed this issue in the revised manuscript (The first paragraph of The evolutionary coupling between orthosteric and allosteric sites is stronger).

4) In the case of AuroraA, the authors do not explain why other Ser/Thr/Tyr residues scored higher than T287 and T288, or if their higher scores are an artefact. However, in many cases, post-translational modifications take place by secondary partners and, therefore, the coupling of a post-translational modification site with the orthosteric site cannot be used to predict such sites. Post-translational modification sites are expected to have a strong coupling with residues of the upstream/downstream effector that is responsible for the modification, rather than with residues of the orthosteric site of the protein.

We thank the reviewer for the comment. For the three residues T235, S245 and S249 with higher scores than T287 and T288, currently no experimental data are available to verify the effect of their phosphorylation on protein function. Further studies are needed to investigate whether these three residues have allosteric regulatory function, and we think these three residues are very worthy of further study.

It is true in many cases, post-translational modifications (PTMs) take place by secondary partners, and the coupling between a PTM site and residues of upstream/downstream effector determines whether this PTM reaction will happen. However, whether this PTM would influence protein function is encoded in its coupling with the protein orthosteric site as functionally coupled residues usually coevolve. A recent study has also shown that there are strong coevolutionary couplings between post-translational modification sites and orthosteric sites (Zhu, et al., J Chem Inf Model. 2022; 62:3331).

5) The authors defined allo-residues as the residues whose Z-score is >0.8, but there is no strong argument regarding the choice of this threshold. In the case of the allosteric pockets, for example, the authors use a threshold of 0 to identify pockets with strong coupling strength.

We thank the reviewer for the comment. We chose this threshold to ensure that most of the known key allo-residues can be correctly predicted by KeyAlloSite, and at the same time, the number of predicted key allo-residues should be as small as possible. We calculated the number of known key allo-residues that could be predicted by KeyAlloSite (Supplementary file 5) and the ratio of predicted key allo-residues in all residues of allosteric pockets of all proteins in the data set (Figure 3—figure supplement 1) for thresholds of 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. It can be seen from the Supplementary file 5 that when taking 0.5, 0.6, 0.7 and 0.8 as thresholds, KeyAlloSite could correctly predict the known key allo-residues in BCR-ABL1, Tar and PDZ3. But when taking 0.9 and 1.0 as thresholds, several known key allo-residues could not be correctly predicted. Moreover, when the threshold was 0.8, the ratio of predicted key allo-residues among all residues of allosteric pockets is less than that of 0.5-0.7 (Figure 3—figure supplement 1). Therefore, we finally chose the threshold as 0.8. We have added this explanation in the revised manuscript (The first paragraph of Coevolution analysis revealed key allo-residues in allosteric pockets).

6) In the case of the Tar, it would be good if the authors reported the results starting from an apo structure as well and see if the method is able to find Y149 and Q152. That way, they could test how sensitive/biased the method is to the chosen conformation. If no apo structure is available, maybe another protein where both an apo and a holo structure exist could be used for this purpose.

We thank the reviewer for the suggestion. We have added the results using the apo structure of Tar in the revised manuscript (The first paragraph of KeyAlloSite correctly identified key allo-residues in other proteins not in the data set). We first used CAVITY to identify all the potential ligand binding pockets on the surface of chain B in the apo structure of Tar (PDB ID: 4Z9J). Of the eight pockets found, cavity_5 is the allosteric pocket, which contains 24 residues. As in the holo structure, the 16 residues (A166-T181) in the C-terminal of the α4 helix were chosen as the orthosteric site. In the prediction results, Y149' and Q152' ranked fourth and fifth among the 24 residues, with corresponding Z-scores of 1.01 and 0.66, respectively. Therefore, KeyAlloSite could also correctly predict the key allo-residue Y149 in the apo structure. For Q152, its Z-score is slightly smaller than the threshold of 0.8, though with a high ranking. This indicates that conformational changes do have subtle influence on the predicted results, probably mainly due to the change of residue composition in the allosteric pocket detected by CAVITY, which will lead to some fluctuations in the predicted key allo-residues. However, when the conformational changes between apo and holo states are not large, the influence on the results is small.

7) The data in Table 1 is not sufficiently convincing. It would be helpful to also report how many of the identified residues by KeyAlloSite are indeed involved in oncogenic mutations or find another metric to quantify the success rate of KeyAlloSite.

We thank the reviewer for the suggestion. Among the 51 KeyAlloSite predicted key allo-residues in 7 human proteins, 11 have been found as oncogenic mutations in the Allo-Mutation database. We have added related information in the revised manuscript (KeyAlloSite identified pathogenetic mutations in human proteins). With the increasing number of known allosteric residues involved in oncogenic mutations, we expect that KeyAlloSite can identify more pathogenetic mutations.

8) The success rate of the method to predict key residues of the function of CALB (38% success rate based on the reported residues in the literature) and CMS (20%), should be interpreted with caution. It may be the case that the rest of the predicted residues have not been tested for their functional role, or that the method has indeed a low success rate in predicting key residues for allosteric communication.

We thank the reviewer for the suggestion. Wu et al. identified 63 annotated functional residues in CALB from literature (FASEB J 2020, 34, 1983), of which 8 were in the substrate binding pocket and the remaining 55 annotated functional residues were outside the substrate binding pocket. There are 296 residues outside the substrate binding pocket, and the annotated functional residues account for 18.6% of all residues, so it is difficult to predict these annotated functional residues. Wu et al. developed the SCA.SIM method and found that the success rate of SCA.SIM was better than that of SCA. The top 27 ranked residues predicted by SCA.SIM contained 11 annotated functional residues, 6 of which are in the substrate binding pocket and 5 of which are outside the substrate binding pocket. This means that the predicted success rate of the annotated functional residues outside the substrate binding pocket is 18.5%. In comparison, the top 27 ranked residues predicted by KeyAlloSite contained 7 annotated functional residues outside of the substrate binding pocket with a success rate of 25.9%. The top 38 ranked residues predicted by SCA.SIM contained 12 annotated functional residues, 6 of which are in the substrate binding pocket and 6 of which are outside the substrate binding pocket with a prediction success rate of 15.8%. In comparison, the top 38 ranked residues predicted by KeyAlloSite contained 9 annotated functional residues outside of the substrate binding pocket with a success rate of 23.7%. Therefore, the success rates of KeyAlloSite was slightly higher than that of SCA.SIM in predicting annotated functional residues outside of the substrate binding pocket. For the rest of the predicted residues that have not been tested for their functional role, we reason that they are likely to allosterically regulate the activity of the enzyme, and further experiments are needed to test whether they really have allosteric regulatory functions.

In lines 38-41, the authors refer to SARs of allosteric drugs as "flat", which is not clear what they refer to.

“Flat” SARs (also sometimes referred to as “Shallow” SARs) means that SARs are not robust and a modest structural modification may destroy the activity of a potent allosteric modulator. More explanations on this have been added in the revised manuscript.

It is not clear if the so-called, "allo-residues", that the authors define in lines 49-52 refer to residues that affect the binding or the signalling.

“Allo-residues” refer to residues that mainly affect signaling. More explanations on this aspect have been added in the revised manuscript.

It would be helpful for the reader if the authors included a section where they describe the method itself and the key steps of developing their model before presenting the results. That way, the reader could follow the results easier.

We have added a section describing the method and the key steps of developing the model before presenting the results in the revised manuscript (The last paragraph of the Introduction).

The authors classified L359 of BCR-ABL1 as an allo-residue and justified the importance of this residue based on the fact that a weaker ligand does interact with this residue. For transparency, it would be good for the authors to report the allosteric scores of all 44 residues in the allosteric site to show that this residue was not cherry-picked and that the method can pick up all the residues that are important for the binding.

We have added the Supplementary file 6, which reported the allosteric scores of all 44 residues in the allosteric site.

The manuscript has several typos and language mistakes (e.g. "screening" instead of "screen" in the Abstract, missing articles, etc.) that should be corrected prior to a resubmission.

We have carefully checked the language and made necessary corrections in the revised manuscript.

Reviewer #2 (Public Review):

The authors designed a statistical approach to analyse coevolution scores from a protein and predict allosteric residues. This approach relies on comparison of residues from the two sites (allosteric vs. orthosteric) and omits the rest of the protein.

While the approach is logical, the predictive power has not been clearly established. To demonstrate the effectiveness of the approach a confusion matrix should be provided as it will show the predictive power of the method in the different scenarios presented in the manuscript (PTM's, enzymes, pathogenic mutations, etc.).

An effective method, along these lines, will have significant impact on protein and drug design.

We thank the reviewer for the comments and encouragement. To demonstrate the predictive power of our method, we have given the confusion matrices in different scenarios (Supplementary file 8) in the revised manuscript. In different scenarios, the known real positive data is very limited. At the same time, there is little real negative data, and we regarded data with unknown functions as negative data. It can be seen that KeyAlloSite has high recall in the prediction of allosteric sites and the prediction of key allo-residues in allosteric pockets. The recall of KeyAlloSite in the prediction of functional post-translational modification (PTM) sites, pathogenic mutations and enzymes are not very high, mainly because there are limited known data on functional PTM sites and pathogenic mutations, and the functions of many key allo-residues predicted by KeyAlloSite are unknown. It remains to be further verified whether they have the function of allosteric regulation. Moreover, since we are now taking the predicted key allo-residues of unknown function as negative samples, our values for recall here are lower bounds of recall. The prediction recall of KeyAlloSite will be higher if the residue of the unknown function we predicted is later determined to have an allosteric regulation function.

It would be helpful if statistics on the distribution of significant scores from E row comparison after the t-test are provided. Further, you have so far compared allosteric with orthosteric, it would be interesting to compare the same distribution/statistics with allosteric vs. non-orthosteric residues. It may serve as a very good benchmark. The idea would be to see if the events of significance from allosteric vs. orthosteric are different/unique if we pick and compare any two random patches in the protein.

As suggested, we have added some examples of distributions of the statistics corresponding to significant scores obtained from the t-test (Figure 3—figure supplement 2).

We also have added the difference between the evolutionary coupling between orthosteric and allosteric pockets and the evolutionary coupling between two random patches in the protein in the revised manuscript (The last paragraph of The evolutionary coupling between orthosteric and allosteric sites is stronger). For each protein in the data set, two residues that are not part of the orthosteric and allosteric sites were randomly selected from the surface residues. Among them, one was taken as the first center and the residues around it with the same number as the residues in orthosteric pocket were selected as patch1; and the other residue was taken as the second center and the residues around it with the same number as the residues in allosteric pocket were selected as patch2. Then we calculated the evolutionary coupling strength between the patch1 and patch2. The process was repeated four times, and then the mean and standard deviation of the evolutionary coupling strength were calculated. We then compared the evolutionary coupling strength between patch1 and patch2 with that between orthosteric and allosteric sites. The results showed that the evolutionary coupling strength between orthosteric and allosteric sites was significantly higher than that between two random patches (Figure 2—figure supplement 3). In other words, there is intrinsic evolutionary coupling between orthosteric and allosteric sites, which is different from the evolutionary coupling between any two random patches.

A boxplot of Zscores for All, top200, top300, top400 pairs from Figure 1—figure supplement 2 would be very informative.

We have added the boxplot of Z-scores for All, top200, top300, top400 pairs from Figure 1—figure supplement 2 as suggested (Figure 2—figure supplement 2C).

As you have been using top zscore residues (>0.5 or >0.9 for enzymes), direct or indirect evidence for the statement in lines 306-307 is not very apparent from the provided data. This has to be flushed out further.

Thank you for the suggestion. We feel sorry for not explaining clearly how weak terms in Jij were used in our prediction. Since orthosteric and allosteric pockets are two different pockets, the residues in the two pockets generally do not contact. Therefore, when we calculate the coevolutionary coupling between residues in the two pockets, most of the evolutionary coupling values between residues in the two pockets are relatively small, that is, they correspond to the weak terms in Jij, as can be seen in Figure 1D. Then we got the number of significant differences of each residue by comparing the differences of these weak evolutionary coupling values of each residue in the allosteric pocket, and normalized the number of significant differences to Z-scores. A larger Z-score indicates that the evolutionary coupling values of this residue are significantly different from the evolutionary coupling values of more residues. In other words, Jij and Z-score are two different metrics. We have added this detailed explanation of the use of the weak terms in Jij in the revised manuscript (The first paragraph of Coevolution analysis revealed key allo-residues in allosteric pockets).

Please annotate residues for which experimental data is available to compare with keyallosite predictions for example in Supplementary files 4 and 5. In the main text (line 284) you only mention one residue out of 52 to demonstrate the effectiveness of the prediction. Again a confusion matrix would help here.

Thank you for the suggestion. We have bolded the predicted key allo-residues that were annotated as functional residues by the experimental data in the previous literature in the revised Supplementary files 4 and 7 as suggested. In the main text (line 284), V225M is a representative of the functional residues already annotated with experimental data in CALB, and other functional residues annotated with experimental data are included in revised Supplementary file 7. We have added another example, V37I, and the prediction accuracy calculated based on confusion matrix was 74.0%. Although the prediction accuracy here is not very high, it is the lower bound of the prediction accuracy of KeyAlloSite. The prediction accuracy of KeyAlloSite will be higher if the residue of the unknown function we predicted is later determined to have an allosteric regulation function.

Can you please clarify if in Figure 2, number of residues refers to the number of residues from allosteric+orthosteric site?

Thank you for pointing this out. The number of residues refers to the number of residues from allosteric site. We have added this detailed description in the revised Figure which is now Figure 3.

Figure 6 may be ideal as Figure 1 to inform the readers about the protocol of this work.

Thank you for the suggestion. We have adjusted Figure 6 as Figure 1 in the revised manuscript as suggested.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

We thank the Senior Editor, the Reviewing Editor and the reviewers for the comments and suggestions which are helpful to improve our work.

1. Can you please provide a detailed description and expand the discussion on the predicted amino acids and why the predicted residues are not the top ranking ones?

Thank you for the suggestion. We have added the detailed description and discussion on the predicted other top-ranking key allo-residues with unidentified function (The second paragraph of Discussion), the comparison to the approach of Hopf et al. (The third paragraph of Discussion), the possible effect of protein conformational changes on the prediction results (The fifth paragraph of Discussion) and possible applications of the predicted key allo-residues (The last paragraph of Discussion) in the revised manuscript.

2. The theoretical formulation to extract the coupling score and a compare/contrast to Hopf et al. approach would be highly beneficial to the readers.

We have added the theoretical formulation for extracting the coupling scores (The first paragraph of The Evolutionary Coupling Model (ECM)) and the comparison to the approach of Hopf et al. (The third paragraph of Discussion) in the revised manuscript. The theoretical formulation for extracting the coupling scores between residues in the initial step of our method is the same as that in the method of Hopf et al., but the problem studied by us is different from that of Hopf et al., and the usage of the coupling scores between residues in the later steps is different. Hopf et al. used the coevolutionary coupling scores between residues to predict the effects of mutations by calculating the difference in statistical energy between mutant and wild-type sequences. In contrast, we used the coevolutionary coupling scores between residues to predict allosteric sites and predict the key allo-residues in allosteric pockets that are mainly responsible for allosteric signaling by pairwise comparing the difference of the coevolutionary coupling scores of residues in allosteric pockets. Although Hopf et al. highlighted the significance of the pairwise term for the prediction of mutation effects, we highlighted the importance of the weak pairwise term for the study of allosteric function.

Reviewer #1 (Recommendations for the authors):

The authors did extensive work trying to address the revision comments, and their effort is much appreciated.

Nevertheless, in the absence of further experimental validation of the importance of the predicted so-called key "allo-residues" in mediating the signal between the orthosteric and allosteric binding site, it is still hard to assess the predictive power of the presented method. There are still residues for example that score higher than known functional residues (see for example residues T235, S245 and S249 in the case of Aurora A) whose implication in signal transduction is not confirmed, or residues whose importance depends on the conformation chosen for analysis despite the coevolutionary conservation metric used to score the residues (Q152 of Tar whose score is way below the 0.8 threshold when an apo structure is considered, which shouldn't be the case given that AlloSite uses only MSA and coevolutionary information for the scoring – it seems like other residues are predicted to be way more important based on the allosteric pocket definition given by CAVITY to decrease the score of Q152 to <0.8 after normalisation).

I fully understand that such experimental validation goes beyond the capacities of a computational group, however, as a reader, I might be hesitant to try this method.

We thank the reviewer for the comments. We agree that experimental validation of the newly predicted key "allo-residues" in mediating allosteric signaling will further verify the predictive power of our method. We are planning to carry out experimental validation studies in the future. Currently we only have a small number of proteins with limited experimental information about key allo-residues. Among these proteins, our predicted key allo-residues rank high among all the alloresidues. At the same time, there are residues with higher scores than these known functional residues, which may play important roles in allosteric signaling. Of course, further experimental validation will be necessary to study the function of these top-ranking residues. With more experimental data at hand, the method can be further validated and improved in the future.

The scores of the key allo-residues predicted by KeyAlloSite depend not only on the coevolutionary information, but also on the residues that constitute the allosteric pockets (Figure 1, see Materials and methods for details). For Tar, the evolutionary couplings between residues are the same for the apo and holo conformations, but the allosteric pockets found by CAVITY in the two conformations contain a number of different residues. Because the prediction of key allo-residues by KeyAlloSite requires pairwise comparison of residues in allosteric pockets, the predicted key allo-residues in the two conformations were slightly different. For the apo Tar, on the one hand, although the score of Q152 is 0.66, which is less than the threshold of 0.8, Q152 ranked high among all residues in the allosteric pocket with a ranking of 5/24. When we lower the threshold slightly, we will be able to correctly predict Q152. On the other hand, Bi et al. showed that although Y149 and Q152 are both key allo-residues, Y149 seems to be more important as allosteric signaling can be conducted when the allosteric molecule only interacts with it and not Q152 (PNAS 2013 110:16814-16819. DOI: https://doi.org/10.1073/pnas.1306811110). Therefore, although Q152 was not predicted in the apo Tar, allosteric molecules can still be optimized based on the Y149 predicted by KeyAlloSite.

Reviewer #2 (Recommendations for the authors):

The authors have carefully addressed all the comments from the previous review.

Few additional comments for authors to consider:

The Discussion section can be enriched further. At present it only discusses two points, i.e., ability of the method to predict key allosteric residues and requirement of sequence depth. For example: the last two paragraphs 565-580 are repetitive as they both highlight the need for depth in sequence alignments which is a known issue for MSA dependent methods, as again discussed by Hopf et al. Nature Biotechnology volume 35, pages 128-135 (2017). Further, the theoretical formulation to extract the coupling score is identical to Hopf et al. and a compare/contrast of the two approaches would be highly beneficial to the audience. Even Hopf et al. highlight the significance of the pairwise term, as done in this article.

We thank the reviewer for the suggestion. We have added the discussion on the predicted other topranking key allo-residues with unidentified function (The second paragraph of Discussion), the comparison to the approach of Hopf et al. (The third paragraph of Discussion), the possible effect of protein conformational changes on the prediction results (The fifth paragraph of Discussion) and the utility of the predicted key allo-residues (The last paragraph of Discussion) in the revised manuscript.

For the last two paragraphs, although they both mentioned the need for depth in sequence alignments, the two paragraphs emphasized different points. The first paragraph emphasized the need for sufficient homologous sequences, while the second paragraph emphasized that although our method needs to use the three-dimensional structure of a protein, when a protein does not have a known three-dimensional structure but has a certain number of homologous sequences, our method can also be applied.

The theoretical formulation for extracting the coupling scores between residues in the initial step of our method is the same as that in the method of Hopf et al., but the problem studied by us is different from that of Hopf et al., and the usage of the coupling scores between residues in the later steps is different. Hopf et al. used the coevolutionary coupling scores between residues to predict the effects of mutations by calculating the difference in statistical energy between mutant and wild-type sequences. In contrast, we used the coevolutionary coupling scores between residues to predict allosteric sites and predict the key allo-residues in allosteric pockets that are mainly responsible for allosteric signaling by pairwise comparing the difference of the evolutionary coupling scores of residues in allosteric pockets. Although Hopf et al. highlighted the significance of the pairwise term for the prediction of mutation effects, we highlighted the importance of the weak pairwise term for the study of allosteric function.

https://doi.org/10.7554/eLife.81850.sa2

Article and author information

Author details

  1. Juan Xie

    Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6975-0449
  2. Weilin Zhang

    BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, Beijing, China
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
  3. Xiaolei Zhu

    School of Sciences, Anhui Agricultural University, Hefei, China
    Contribution
    Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Minghua Deng

    1. Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
    2. School of Mathematical Sciences, Peking University, Beijing, China
    3. Center for Statistical Science, Peking University, Beijing, China
    Contribution
    Conceptualization, Formal analysis, Methodology
    Competing interests
    No competing interests declared
  5. Luhua Lai

    1. Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
    2. BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, Beijing, China
    3. Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing - review and editing
    For correspondence
    lhlai@pku.edu.cn
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8343-7587

Funding

National Key R&D Program of China (2022YFA1303700)

  • Luhua Lai

National Natural Science Foundation of China (21633001)

  • Luhua Lai

Chinese Academy of Medical Sciences (2021-I2M-5-014)

  • Luhua Lai

National Natural Science Foundation of China (22237002)

  • Luhua Lai

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors thank Jintao Zhu from the Center for Quantitative Biology and Gaoxiang Pan from the College of Chemistry and Molecular Engineering, Peking University for helpful discussions. This study was supported in part by the National Key R&D Program of China (2022YFA1303700), the National Natural Science Foundation of China (21633001, 22237002) and the Chinese Academy of Medical Sciences (2021-I2M-5–014).

Senior Editor

  1. Volker Dötsch, Goethe University, Germany

Reviewing Editor

  1. Shozeb Haider, University College London, United Kingdom

Reviewer

  1. Sarath Dantu, Brunel University London, United Kingdom

Version history

  1. Received: July 13, 2022
  2. Preprint posted: July 26, 2022 (view preprint)
  3. Accepted: February 16, 2023
  4. Accepted Manuscript published: February 17, 2023 (version 1)
  5. Version of Record published: March 2, 2023 (version 2)

Copyright

© 2023, Xie et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,483
    Page views
  • 276
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Juan Xie
  2. Weilin Zhang
  3. Xiaolei Zhu
  4. Minghua Deng
  5. Luhua Lai
(2023)
Coevolution-based prediction of key allosteric residues for protein function regulation
eLife 12:e81850.
https://doi.org/10.7554/eLife.81850

Further reading

    1. Biochemistry and Chemical Biology
    2. Cell Biology
    Sevim Kahraman, Kimitaka Shibue ... Rohit N Kulkarni
    Tools and Resources

    Pancreatic a-cells secrete glucagon, an insulin counter-regulatory peptide hormone critical for the maintenance of glucose homeostasis. Investigation of the function of human a-cells remains a challenge due to the lack of cost-effective purification methods to isolate high-quality a-cells from islets. Here, we use the reaction-based probe diacetylated Zinpyr1 (DA-ZP1) to introduce a novel and simple method for enriching live a-cells from dissociated human islet cells with ~ 95% purity. The a-cells, confirmed by sorting and immunostaining for glucagon, were cultured up to 10 days to form a-pseudoislets. The a-pseudoislets could be maintained in culture without significant loss of viability, and responded to glucose challenge by secreting appropriate levels of glucagon. RNA-sequencing analyses (RNA-seq) revealed that expression levels of key a-cell identity genes were sustained in culture while some of the genes such as DLK1, GSN, SMIM24 were altered in a-pseudoislets in a time-dependent manner. In conclusion, we report a method to sort human primary a-cells with high purity that can be used for downstream analyses such as functional and transcriptional studies.

    1. Biochemistry and Chemical Biology
    2. Cell Biology
    Valentin Chabert, Geun-Don Kim ... Andreas Mayer
    Research Article

    Eukaryotic cells control inorganic phosphate to balance its role as essential macronutrient with its negative bioenergetic impact on reactions liberating phosphate. Phosphate homeostasis depends on the conserved INPHORS signaling pathway that utilizes inositol pyrophosphates and SPX receptor domains. Since cells synthesize various inositol pyrophosphates and SPX domains bind them promiscuously, it is unclear whether a specific inositol pyrophosphate regulates SPX domains in vivo, or whether multiple inositol pyrophosphates act as a pool. In contrast to previous models, which postulated that phosphate starvation is signaled by increased production of the inositol pyrophosphate 1-IP7, we now show that the levels of all detectable inositol pyrophosphates of yeast, 1-IP7, 5-IP7, and 1,5-IP8, strongly decline upon phosphate starvation. Among these, specifically the decline of 1,5-IP8 triggers the transcriptional phosphate starvation response, the PHO pathway. 1,5-IP8 inactivates the cyclin-dependent kinase inhibitor Pho81 through its SPX domain. This stimulates the cyclin-dependent kinase Pho85-Pho80 to phosphorylate the transcription factor Pho4 and repress the PHO pathway. Combining our results with observations from other systems, we propose a unified model where 1,5-IP8 signals cytosolic phosphate abundance to SPX proteins in fungi, plants, and mammals. Its absence triggers starvation responses.