Design and experimental characterization of specificity-switching mutational paths of WW domains

Ahmed Rehan; Eugenio Mauri; Jorge Fernandez-de-Cossio-Diaz; Pierre-Guillaume Brun; Remi Monasson; Marco Ribezzi-Crivellari; Simona Cocco

doi:10.7554/eLife.110491.1

eLife Assessment

In this important study, the authors demonstrate that generative AI techniques (restricted Boltzmann machine) can be used effectively to design and characterize mutational pathways of WW domains with different binding specificities. The computational studies are complemented by experimental validations, and the results provide solid evidence supporting the idea that sequence landscape holds significance in understanding protein evolution from a transition path perspective. The minor weakness of the study in the current form concerns limited success in designing variants with smoothly varying binding specificities. Nevertheless, the work will likely have a major impact on research aimed at understanding how evolution navigates fitness landscapes as well as reconstructing ancestral sequences.

https://doi.org/10.7554/eLife.110491.1.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

solid: Methods, data and analyses broadly support the claims with only minor weaknesses

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Specific interactions between proteins and other biomolecules are ubiquitous in cellular processes. How specificity is encoded in the protein sequence and can be modified through a minimal set of concerted mutations is a complex issue. In this work, we focus on the WW protein domain, whose variants specifically bind to different classes of proline-rich peptides. Combining unsupervised learning of homologous WW sequence data with Restricted Boltzmann Machines (RBM) and path-sampling methods, we design mutational paths of putative WW domains interpolating between two natural WW domains with either distinct or similar specificities. Sequences along the designed paths are then experimentally validated with high-throughput in-vitro binding assays against 3 peptides of different classes. The vast majority (93%) of intermediate sequences along the designed paths are responsive to the initial or/and final peptides. On the contrary, domains along scrambled paths, in which the same mutations are introduced in random order are not functional, emphasizing how successful design crucially depends on the ability to model epistatic interactions. Interestingly, switch in specificity between classes I and IV whose representative peptides bind to different pockets on the WW domain appears to be smooth, with intermediates displaying some level of binding cross-reactivity with all tested peptides. We finally show that the RBM paths share a high identity with internal nodes obtained from ancestral sequence reconstruction based on the seed WW domains.

1 Introduction

As living organisms evolve, the proteins they produce may need to adapt and specialize to a new task while preserving their fundamental biological function, such as catalyzing certain classes of reactions or binding to specific targets [1]. An illustration of this flexibility is provided by hemoglobin and myoglobin: these proteins share a common ancestor and the conserved mechanism of binding oxygen to the heme group, however hemoglobin transports oxygen from the lungs to the tissues, and myoglobin stores oxygen for use during muscle contraction [2]. Another interesting example is the WW domain, a family of small protein domains (30-40 amino acids long), whose name comes from the presence of two strongly conserved tryptophan residues [3]. Found in proteins across multiple species, WW domains are essential to key cellular pathways, such as the Hippo pathway, which controls cell proliferation, apoptosis, and organ size [4]. WW domains are widely distributed among proteins and are commonly found both in the cytoplasm and in the nucleus [5]. Following their initial identification, these domains became a primary focus of scientific inquiry due to their involvement in signaling complexes linked to a range of human disorders, including Alzheimer’s disease, Liddle’s syndrome, muscular dystrophy, Huntington’s disease, and some types of cancers [6].

Most WW domains bind proline-rich peptides [7]. However, they can differ significantly in the specific amino-acid motifs they recognize, defining three broad specificity classes (I, II/III, IV) [8, 6, 9], which determine the biological pathways their interactions affect. Two examples of WW domains containing proteins present in the human proteome are the yes-associated protein 1 (YAP1) and the Peptidyl-prolyl cis-trans isomerase NIMA interacting 1 (Pin1). While the former acts as a transcriptional co-activator of genes involved in cellular proliferation (with a major role in oncogenic activity [10]), the latter isomerizes only phospho-Serine/Threonine-Proline motifs and has many roles, among which cell cycle regulation.

The involvement of mutations or dysregulation of WW domain-containing proteins in the pathogenesis of various diseases, including cancer, muscular dystrophy, and cardiovascular disorders, has been well documented. For instance, aberrant activity of the YAP/TAZ transcriptional regulators, which is mediated by their WW domains, is a hallmark of many cancers due to its role in promoting unchecked cell growth [11]. Similarly, mutations in the WW domain of Pin1 are involved in the emergence of Alzheimer’s disease [12], while NEDD4-2 are associated with Liddle syndrome, which disrupts sodium channel regulation and is a risk factor for hypertension [13]. These examples illustrate the significance of WW domains in health and disease, establishing them as crucial targets for therapeutic intervention. Rational design of WW domains has been proposed to change their specificity and recognize mutated ligands. As an example, in Liddle’s syndrome, the PPXY motif of a sodium channel subunit, β EnaC is mutated, which prevents sodium channel degradation by Nedd4; changing specificity of the WW domain from class I to II/III would restore recognition [14].

In addition to their clinical relevance, WW domains serve as valuable models in structural biology and protein evolution due to their small size, conserved structure, and diverse binding partners. Their modular nature and capacity to mediate specific interactions provide insights into protein network dynamics and evolutionary adaptation [15, 8, 6, 9, 16]. In this context, a natural question is to understand how homologous proteins with different ligand specificities evolved from ancestral, and possibly promiscuous proteins [17], while remaining functional at all stages. This question dates back to the seminal works of J. Maynard Smith, who described evolution as a word game in which one can change one letter at a time while keeping meaningful intermediate words, as illustrated by the path “word→wore→gore→gone→gene” [18]. Ancestral sequence reconstruction, by constructing putative intermediates linking the diverse modern proteins, is a practical way to address this problem [19]. However, bridging the gap between the theoretical and conceptual study of evolution through paths [20] and realistic fitness landscapes is difficult [21, 22].

From an experimental point of view, high-throughput experiments allow for building mutational paths between phenotypically distinct proteins separated by D mutations [22, 23, 21, 24, 25], when D is small enough. Examples include Ref. [25], in which mutational paths in a region of 4 residues increasing antibiotic resistance in β-lactamase were investigated, and Ref. [21], where all 2¹³ direct paths involving 13 residues allowing to change the fluorescence properties of a protein, were tested. Let us also mention Ref. [26] where all 20⁴ paths over a 4-residue region in the protein GB1 were characterized and Ref. [27], where about 1000 paths in the fitness landscape of transcription factors for short (9 nucleotides) DNA sequences were tested and characterized. When the number D of mutations separating the two proteins becomes large, the space of possible sequences cannot be explored in an exhaustive manner any longer. Evolutionary methods such as ancestral protein reconstruction (ASR) can help to isolate sets of important mutations involved in the switch of affinity in protein systems [28, 29], although the phenotypic characterization of the mutations remains essentially experimental. To this end, computational models approximating the genotype-phenotype mapping are necessary to prune this huge space and retain only good putative paths leading to the change of function.

Recently, we proposed such a computational approach to design mutational paths [30]. Briefly speaking, the approach relies on two ingredients. First, we use an unsupervised machine-learning architecture, trained from homologous sequence data, to model the global fitness landscape of the protein family of interest, hereafter WW domains. Various generative models were recently proposed based on deep architectures [31, 32], e.g. diffusion processes [33]. Simpler unsupervised architectures such as Boltzmann Machines (BM) or Variational Auto-Encoders have been shown to be good generative models, with the ability to design novel proteins with functionalities comparable to natural ones [16, 9, 34, 35, 36], while being easier to interpret. Hereafter, we use Restricted Boltzmann Machines (RBM), which are bipartite probabilistic graphical models learning relevant amino-acid motifs with key role for structure stability and function [37] and able to design new functional proteins [38] or other biomolecules [39, 40]. Second, we sample the space of mutational paths, defined as chains of sequences that (1) have high scores according to the RBM generative model and (2) differ from the previous and next ones along the path by a small (and controllable) number of mutations. At their extremities the chains are anchored by two sequences of interest, hereafter two natural WW domains with distinct specificities.

This path design approach was shown to successfully generate high-quality mutational paths of in silico (lattice-based) proteins, for which the ground-truth was known [30]. In addition, we showed it could be applied to the WW domain family, using as training data the multiple-sequence alignment from the PFAM/InterPro database (PFAM ID: PF00397). Intermediate configurations had high scores according to the trained model and to computational methods based on protein structure information, such as AlphaFold and ProteinMPNN [41, 42].

In this paper, we go a step further in designing mutational paths for switching specificity in WW domains. We present a high-throughput fully in-vitro binding assay that allows us to simultaneously evaluate the activity of the designed WW domain paths against multiple peptides attached to different specificity classes. These experimental results not only validate the computational approach but also allow us to study in great details how the switching of specificity takes place along the path. Lastly, we show that the RBM designed paths share some similarity with the internal nodes of phylogenetic trees inferred from representative WW domains sequences (the PFAM seed).

2 Computational models for path generation and experimental verification

Our pipeline is described in Figure 1. We consider the space of putative WW domains (Figure 1A), whose sequences contain a variable portion of L = 31 (Figure 1B) amino acids plus appropriate constant flanking regions or ‘spacers’ (Figure 1D). The probabilistic model for the variable part is provided by a two-layer unsupervised neural network, called Restricted Boltzmann Machine (RBM), trained on homologous natural sequence data, see Figures 1B,C (Methods 5.1 and [43, 30]). RBM learns a distribution over the set of all possible sequences, which allows us to score putative WW sequences, and, in turn, to generate new sequences with high scores.

Experimental and computational pipeline for designing WW domains of variable specificities.
A. Sketched of the activity vs. sequence landscape of WW domains. Highly active domains may bind to different peptides (black shapes) and are shown with different colours. B. Homologous sequences define the training data for our restricted Boltzmann machine (RBM) model. The binding specificities of some sequences are annotated, while others are not available. C. After training, the latent variables of the RBM define low-dimensional projections that identify clusters of sequences sharing the same specificity. Sequences along paths may cross regions deprived of natural sequences, of unknown specificity. D. Sketch of the experiment. In vitro transcription and translation of our construct result in the expression of a fusion protein including a WW domain (green), a linker (yellow), and the SNAP tag (blue), labelled with BG-AF647 dye. The WW domain of the fusion protein may bind to peptide-coated magnetic beads (brown). The fluorescence intensity on the bead surfaces is then measured using flow cytometry, assessing the strength of the interaction between the labelled protein and its target peptide.

Natural WW domains are broadly classified according to three specificity classes (I, II/III, IV) [8, 6, 9]. We choose peptides representative of each class to assess the binding specificity of the designed variants (Table 1). Some latent variables inferred by the RBM have been shown to be informative about the speci-ficity classes [43, 30]. More precisely, high-dimensional sequences cluster in the 2-dimensional plane defined by the inputs attached to two particular latent units, I₁ and I₂ (see Ref. [43], Methods Section and Figure 2A). Experimentally tested sequences reported in the literature appear to consistently belong to the same clusters depending on their specificities. This clustering property is sketched in Figure 1C.

Peptides used for the experimental validation of mutational paths in WW domains.

Highlighted in red are residues that bind to the WW domain. The phosphorylated Threonine in the peptide of class IV is indicated by pT.

Path within class I WW domains.
A. Alignment of the tested sequences showing the mutated amino acids. **B, C**. Logo representation of the weight vectors attached to the two RBM latent units most correlated with the specificity classes. D. Projection of the path on the latent weights, clustering sequences according to their binding specificities. The specificity thresholds indicated in the figure (Class I: I₁ *< —*1 and I₂ *> —*3; Class II/III: I₁ *> —*1 and I₂ *> —*3; Class IV: I₁ *> —*1 and I₂ *< —*3) are informed by previous specificity assays on natural sequences [9, 8] (see Figure 3 in [43] and Figure 2a in [30]). **E, F**. Global and Class I specific RBM scores for the sequences in the path. G. Experimental binding responses of the tested sequences. The red band, defined by the experimental noise limit, denotes the non-binding zone.

We learn global RBM models, trained on all available WW data irrespectively of their binding specificity, or class-specific models, trained from sequences attached to one class only, see Methods 5.1.1; notice that the attribution of sequences to one of the three functional classes for training is mostly based on the 2-dimensional clustering mentioned above and not on experimental binding experiments, and is therefore subject to errors. Specific models are, by construction, informative about the binding specificity of the WW sequences but are generally of lesser quality than the global model, due to the smaller number of available annotated WW data, in particular for class IV. As a consequence, they cannot be used to generate de novo high-quality sequences from one class or another, but rather to score the specificity of sequences generated from the global model.

We use path-sampling algorithms, developed in [30], to design paths of sequences fulfilling the following conditions:

the initial and final sequences are natural WW domains, possibly with different specificities, e.g. triangle and diamond in the sketch of Figure 1, and identity (fraction of identical residues) ranging from 30% to 50%, which are typical values within a homologous family.
Intermediate sequences along the path have high scores (probabilities) according to the RBM global model. Amino acids in the intermediate sequences can take any value.
The number of mutations from one sequence to the next along the path is small, typically one or two. The length of the designed path is typically of 1.5 times the hamming distance between the initial and final sequences.

To sample paths we have employed a Monte Carlo algorithm, which randomly draws paths according to the RBM probability distribution (Methods 5.2). We have designed four batches of sequences, listed in Appendix 8.1 together with the parameters used for the models. Intermediate sequences along the paths are scored according to the global RBM model, and to class-specific models to predict their specificities. In our sampled paths amino acids can “transiently” mutate into third-party values (underlined in blue in Figure 2A). Transient mutations can yield higher fitness scores along the evolutionary path, particularly when epistatic interactions are strong and compensatory mutations are expected [44].

To test the sequences generated by the model in high-throughput, we developed a fully in-vitro, rapid and accurate assay: “fluorescence-Activated Bead Counting (fABC)”. The assay assesses interactions between proteins and target molecules immobilized on magnetic beads, see Figure 1D. In the present case our targets are proline-rich peptides specific of the three different subclasses of WW domains (I, II/III, IV). Briefly, synthetic genes coding for SNAP-tagged WW domains[45] are expressed using cell-free in-vitro Transcription and Translation (IVTT) [46]. The SNAP-tag is leveraged for the fluorescent labelling of the WW domains using a benzyl-guanine modified fluorescent dye. Fluorescent binders are then mixed with magnetic beads coated with the target peptide and allowed to equilibrate. Binding activity is then measured by quantifying bead fluorescence on a flow cytometry instrument[47]. We demonstrate how, by using spectrally distinct fluorescently-labelled DNA linkers to coat the magnetic beads, we are able to multiplex different targets within a single run, increasing speed and reducing the cost of the assay. This method provides a robust and versatile platform for studying protein/protein or protein/peptide interactions and sequence/function relationship.

3 Results

We present and discuss hereafter a path connecting two natural WWs of type I, three paths connecting natural WWs of types I and IV, two paths connecting natural WWs of types I and II/III. Along each designed path, we pick up the sequences at regular intervals for experimental measurements of their binding affinities to ligands of classes I, II/III and IV. The density of experimentally assessed sequences is higher close to specificity switching. To test the presence of epistatic effects in specificity switching paths we compare our designed paths with scrambled paths, in which the same list of mutations present in the paths generated by the model are introduced in a different order, hence breaking epistatic interactions with the background sequence.

A complete list of the designed paths and tested sequences is provided in Tables in Appendix 8.1.

3.1 Assessment of natural WW activity levels

Several WW domains perform their function in vivo either as multi-domains or as part of a complex architecture [48, 49] and may require additional activating, post-translational modifications. Our in-vitro assay is based on the expression of individual WW domains and, as a consequence it is possible that binding activity may be lost. To explore and avert these issues, we tested a set of 28 wild-type sequences taken from the homologous sequence data (12 Class I, 15 Class II/III, 1 Class IV) against three peptides, representative of the targets of the three classes. Out of these sequences, a total of 8 did not give any response in our assay (2 classified as Class I and 6 classified as Class II/III). This is in particular the case of the YAP1 WW domain that we considered in [30]. Half of the remaining sequences gave a clear and specific response for the peptides associated to the classes (8 sequences for Class I, 5 for Class II, and 1 for Class IV). Two sequences showed cross-specificity to peptides associated to Classes I and II. In the following we discuss paths starting and ending at sequences with clear and specific responses (Appendix 8.1)

3.2 Path within class I WW domains

We start by testing a path between two Class I wild-types (2LTW and NP595793.1). Supp. Figure 7A shows one of these natural WW domains in complex with the cognate peptide. We show in Figure 2A the sequences of the two natural WWs of type I we considered, as well as the six tested intermediate sequences, sampled by the RBM-path algorithm connecting the two wild-types that were experimentally tested. The full sampled path is given in Suppl. Fig 14). Residues that change along the path are colored. As shown on the structures in Supp. Figure 7A, a large fraction (∼ 40%) of amino acids is mutated along the path. In Figure 2B,C we show the weights attached to the two specificity-related latent units of the RBM (I₁, I₂). Projections of the sequences in the path are consistently confined within the cluster attached to class I (Figure 2D). Most mutated amino acids are not proximal with the ligand, and not located on the sites with large entries in the RBM-weight attached to the latent units detecting class I specificity (I₂) (Figure 2B), in agreement with the fact that the specificity remains unchanged along the path. Twelve positions in total are altered across the path, four of which (9, 11, 12, 16) sample an intermediate residue, allowing to improve the global score of the artificial intermediate proteins with respect to the initial and final natural wild-type.

Figure 2E shows the measurement of the binding to a class I peptide of the tested sequences in the path. All tested sequences show specificity responses much above noise level. The global RBM scores of the sequences along the path and the class I specific RBM score are shown in Figure 2E,F. Interestingly, designed sequences with the largest scores, in particular the second and the third sampled ones, have larger responses than the initial natural WW.

The presence of indirect (transient) mutations (9, 11, 12, 16) improves the global score. In particular, several pairs involving site 9 (9-13, 9-14, 9-16) have a large epistatic score according to the RBM model, and correspond to contacts on the 3D fold, see Suppl. Figure 9. Indirect mutations may be therefore important for stabilizing the designed WWs along the path by exploiting epistatic interactions. As an illustration, the transient mutation A9V appearing on the second tested WW (Seq. 36) increases the stability via a favorable contact interaction, as seen from the value of the Miyazawa-Jernigan energy [50] between amino acids V9 and L12. This interaction is lost on the third tested sequence (Seq. 37) due to the mutation L12N compensated by the mutation T16V through a favorable contact interaction between amino acids V9 and V16. The Class I natural sequence anchoring the path (Seq. 42) has the smallest global score according to the RBM, but a good Class I local score and a very good binding response, illustrating the difference between the global score, a proxy for stability, and the specificity scores. In conclusion, the experimental probing of a designed path connecting two Class I wild types allows us to assess the binding response of the designed sequences along the path and to validate the RBM model predictions, in particular the epistatic interactions and the global and local scores.

3.3 Paths from Class I to Class IV

We now consider a path designed to interpolate between two different specificity classes, I (wild-type NP595793.1) and IV (wild-type PIN1). As shown in Supp. Figure 7B and previously reported [51], ligands to Class IV domains bind in a different pocket compared to their Class I counterparts. This alternative binding mode requires a longer loop between the β₁ and β₂ strands. Position 13 carries a gap in the Class I WW domain, and amino acid S in the Class IV natural domain (Figure 3A). The existence of two binding pockets may favor switching specificity and binding cross-ractivity as was hypothesized in [30].

Path from Class I to Class IV.
A. Alignment of tested sequences along the path showing mutated amino acids. B,C,D. Scores of sequences in the path predicted by the global (top), Class I or IV (middle) and Class I or II/III (bottom) RBM models. E. Two-dimensional projection of the path in the plane of specificity-related latent units. F. Relative binding responses to Class I and IV peptides of sequences along the path. G. Normalizes experimental binding responses to Class I and IV peptides. H. Responses of Seq.53, 54, 35, 48, 27 compared with Seq.82, 83 differing by mutations on residues 13 (carrying a gap for a shorter β₁ *− β*₂ loop or an S), and class I specificity-determining positions 24-25.

Class IV poses challenges from the computational point of view: the number of natural sequences in the cluster associated to Class IV is small compared to the other specificity classes, see green dots in Figure 2D. The number of experimentally validated Class IV sequences is even smaller ([8, 9] and Figure 3 in [43]). Due to the lack of training data, we expect RBM models to be less predictive for Class IV sequences..

Figure 3A shows the list of experimentally tested sequences obtained by subsampling the full path (sampled sequences are represented by red dots; the full path is given in Suppl. Fig 15). About 50% of the protein sequence is modified along the path. Most of the mutated residues are within or close to the two binding pockets, which can be recognized based on the large norm of the RBM weights ( and in Figure 3A) of the two specificity-related latent units. The importance of the residues in positions 8 and 13 and their changes in the path to achieve Class IV specificity agree with the weights (previously shown in Figure 2B) attached to unit 1 ( in Figure 3A). On the contrary, sites 19, 24 and 25 carry large entries in the weight (Figure 2C) associated to latent unit 2 ( in Figure 3A) [43, 30]. Positions 7, 9, 11, 12, 29 carry indirect mutations in the path (shown in blue in Figure 3A). The global RBM score (Figure 3B) is slightly smaller near the end of the path than at the beginning, perhaps due to the difficulty of properly modeling Class IV-like sequences or to a larger stability of Class I molecules with respect to the other classes, as shown by the MSA alignment in Suppl. Figure 10. We observe the expected trends for the specific RBM models: Classes I and IV have, respectively, lower and higher scores at the end of the path than at the beginning (Figure 3C). The two scores cross after the 6th sampled sequence in the path (Seq. 52). The path then goes, in the low-dimensional projection, across a region devoid of natural sequences (Figure 3E). Interestingly, in this region, the Class II/III specific RBM score gets closer its Class I counterpart (Figure 3D). Experiments confirm that sequences in this region keep a significant activity towards both Class I and Class II ligands (Suppl. Table 4). Furthermore, the last designed sequence along the transition path (WW150) shows cross-reactivity to Class I and IV ligands, consistently with our argument given in [30] (Figure 3G). Taken altogether, these results points towards a substantial level of promiscuity [30] along the path.

To confirm that the order in which mutations are introduced along the designed path is dictated by epistatic interactions, and thus depends on the amino-acid background, we study “scrambled” paths, in which mutations are introduced in reverse order (Suppl. Table 7). We observe that scrambled paths have global scores smaller than designed paths (Figure 3B), and do not contain any functional sequence (apart from the anchoring natural WWs), see inset of Figure 3G. Of notice, epistasis effects are correctly detected by Class I specific scores, but not by Class IV specific scores that assign a similar score to scrambled and designed paths. This observation is compatible with the absence of epistasis (Suppl. Figure 10) in the Class IV specific RBM model, probably due to the the scarcity of training data, but could also be related to a lower stability than Class I domains. To better understand the correlation between the two binding modes towards Class I and Class IV ligands, we study the specificity of sequences with adequate pairs of amino acids in position 24 & 25 for Class I specificity, namely RA or RT, or a mutated pair, NA, decreasing the binding affinity to Class I peptide, as well as long or short β₁ − β₂ loop. We find that extending the β₁ − β₂ loop is not sufficient to achieve Class IV specificity, e.g. the sequence with RT in P24-25 and a long loop is characterized as being in Class I. Moreover, decreasing the affinity of Class I with the NA mutations and extending the loop is necessary to bind class IV ligands, underlining the presence of a long-range interaction between the two binding modes.

As shown in Suppl. Table we have tested 3 additional paths interpolating between Classes I and IV. Remarkably, even on the path starting from the YAP1 wild-type, classified as non binding by our setup, all the designed sequences are functional (Supp. Table 3). All paths are found to follow similar trajectories in the two-dimensional plane, and show cross-reactivity in the region of the input space devoid of natural sequences, see Suppl. Figure 13: several sequences (Seqs. 52, 53, 54, 34, 35) even show mild cross-reactivity to Class II peptide. All paths pass through sequence ‘WW150’ before going directly to ‘WW10’ in 2 mutations, showing how constrained is the convergence towards Class IV reactivity.

We have also designed a direct path interpolating between the two classes (Suppl. Figure 13), along which transient mutations are forbidden. This path shows smaller RBM global score, especially for Class I sequences, in agreement with Section 3.2. Experimentally measured responses are very similar to the one found for indirect paths (red: no response; green: large response). In the region lacking natural sequences, the direct path goes through the same sequences (Seqs. 53, 35, 48) as the indirect ones, again underlying how constrained are sequences when approaching Class IV.

3.4 Paths from Class I to Class II/III

We now report results on paths interpolating between Classes I (wild type NP595793.1) and II/III (Wild type 1YWI or formin binding protein), see Figure 4. This path critically differs from the Class I → IV case above as ligands of Classes I and II bind to the same pocket of the WW domain [9, 43, 8, 14] (Supp. Figure 7C).

Path from Class I to Class II/III.
A. Alignment of the tested sequences, showing the mutated amino acids. Same color code as in Figure 2. **B, C**. Global and class-specific RBM scores of the sequences in the path. D. Projection of the path one the plane of RBM latent units relevant for class identification. E. Ratio of experimentally measured responses to peptides of class II over Class I for intermediate sequences along the path. F. Normalized responses of the designed sequences against peptides of Classes I and II. The red band, defined by the experimental noise limit, indicates the non-binding zone.

Figure 4A gives the list of experimentally tested sequences obtained by subsampling the complete path given in Suppl. Fig 16. Note that the natural Class II/III domain at the end of the path (Seq.64) has two flanking sequences (2 amino acids in the N-terminus and 6 in the C-terminus), which are not covered by the RBM model. To study a path ending at Seq. 64 we thus added the same flanking regions to Seq. 2, leading to Seq. 2x with no loss in specificity, as well as to all the designed sequences along the path. Supp. Figure 7C shows the complexes formed by the WW domain and the two ligands; the majority of sites (∼ 75%) in the molecule undergo mutations. Several residues are transiently mutated at the beginning of the path. Some of them (9-11-12) were already encountered on the path within Class I above, and increase the global scores (Figure 4B) with respect to the wild-type WW domain.

The global and specific scores relative to Classes I and II/III of the sequences are shown in Figure 4B,C. Notice that the natural WW domain of Class II/III at the extremity of the path has much lower global and specific scores than its Class I counterpart. This may be due to the fact that Class II/III domains are generally less stable than in Class I, as suggested by the contact map obtained from Class II/III specific RBM (Suppl. Figure 10) and by the lower conservation of residues in Class II/III domains, see sequence logo in Suppl. Figure and [48] (Table 3). We predict a change in specificity after the fifth sampled sequence (Seq. 61) from the low-dimensional representation of Figure 4D and from the crossing of the specificity scores in Figure 4C.This switch requires mutations of all specificity-determining residues: V19W, H21T [9, 48, 43] and R24N, compatible with the weight vector associated to I₂, and sequentially introduced from Seq.84 to Seq.86 along the path.

The normalized responses of the tested molecules are shown in Figure 4F. To better characterize the switch, we investigate the responses of the four designed sequences following Seq. 61 in the path (crossing the specificity threshold in Figure 4E) to the three peptides (C1, C2, C3). No response to C2 peptide is found until the complete mutations of the three key residues above in Seq. 87, showing the largest C2 binding response, coinciding with the maximum of the Class II/III score. Previous rational design-based experiments for switching the specificity of Class I YAP (with peptide PPPPXP) WW to Class II/III FE65 (with peptide PPPPPP) have already pointed out that the two substitutions L19W and H21G were sufficient to perform specificity switching in the YAP1 background having Q24. As already mentioned, YAP1 is not responsive in our assay. Moreover, given the large fraction (6/15) of non responding wild-type sequences of class II, the response to the C2 peptide seems harder to detect. Lastly, we do not exclude that other non-tested ligands could be able to bind to such sequences, eventually in a less specific and more cross-reactive way.

We next experimentally analyze a set of ‘scrambled’ sequences (Suppl. Table 7). These scrambled sequences have RBM specific and global scores smaller than the designed ones, and none of them is functional, see Figure 4F inset.

We report experimental results obtained for a second path switching specificity from classes I to II/III in Suppl. Table 6, see Suppl. Figure for a low-dimensional visualization of the path. All tested sequences but one are responsive and bind peptides of Classes I or as predicted by the class-specific scores.

3.5 Comparison of model scores and binding measurements

The analysis of the mutational paths above has shown the ability of our computational model to generate new sequences behaving as WW domains with desired binding specificities. To further quantify this capability, we study the correlations between the global RBM scores and the experimentally assessed binding responses in Figure 5. The scores of the tested sequences, either natural or designed, have large values, biased towards the right tail of the histogram of natural WW sequences (Figure 5A). In particular, most designed sequences are, according to the global RBM model, as good as, or better than the majority of natural sequences. This result was expected as we set the effective temperature to the value 1/3 (lower than the standard value 1) to design sequences with high scores [34].

Model scores and experimental responses.
A. Histogram of global RBM scores of probed sequences (Natural, Designed, and Shuffled). The histogram of global RBM scores of the full sequence data is also shown for comparison. **B, C, D**. Comparison between RBM local scores and binding responses, for Class I (B), II/III (C), and IV (D) ligands. Sequences in each panel are colored according to the predicted classes, i.e. to the quadrants in the 2D projection plane defined by the specificity-related latent units. E. Pearson correlation coefficient between experimental responses to the three ligands (I, II, IV), and the RBM scores (specific and global).

Sequences along the paths obtained through reshuffling of the mutations in Figures 4 and 3 have, on the contrary, smaller global scores, in agreement with the experimentally observed lack of functionality. The scrambling of mutations, irrespectively of the amino-acid background, is expected to be detrimental in the presence of epistatic interactions.

Histograms of scores for the natural sequences associated to specific classes are given in Suppl. Figure 8 and confirm the RBM scores of Class I natural sequences have typically larger values than the ones of Classes II/III and IV, which could be indicative of stronger structural stability.

In Figure 5B,C,D, we compare, for each class, the scores assigned by the specific RBM models to the experimental binding measures of the corresponding class ligand. We observe positive correlations, see Figure 5E for the three models, and less so for the global score. Class I is an exception, as the global score also shows a strong correlation; this may be due to the fact that class I WW have the larger scores as shown in Suppl. Figure 8 and several contacts are kept by conserved amino acids in Class I domains [48], see logo representations and contact map for Class I in Suppl. Figure 10.

3.6 Comparison with Ancestral Sequence Reconstruction on WW domains

Lastly, we compare the mutational paths designed with our method between wild-type domains with the pathways linking these domains on the phylogenetic tree derived from the WW domains present in the PFAM seed. To do so, we reconstruct the tree connecting all these domains and perform Ancestral Sequence Reconstruction (ASR) on the intermediate nodes. This tree topology is inferred directly from the WW domains only [52] and not from the proteins they belong to. Multiple pathways connect the domain subtypes on the tree (see supplementary figure 19). As illustrated in figure 6A and with the collapsed tree topology in figure 6A, we observe that all type I and IV domains are connected through a narrow path, which crosses the empty quadrant in the 2D-latent space defined by the two specificity-related RBM hidden units. This result is similar to what is obtained with our RBM-based method.

Pathways derived from ancestral sequence reconstruction (ASR) on the WW domains in PFAM’s seed.
A. 2D projections of selected paths linking domains of different types (I, II/III and IV) on the two specificity-related RBM hidden units, and simplified representation of the seed tree topology. Black dots: modern sequences; Purple dots: average ancestral sequences sampled from the posterior distribution. B. 2D projections of paths linking WT domains presented in Figures 3 and 4 along the tree. Black: mutational paths designed with the RBM; Purple: paths obtained with ASR; Dotted purple: ASR path obtained when adding to the WW tree the RBM intermediates. Blue and purple paths are largely overlapped with one another. C., D. Scores of ancestors obtained with ASR (solid lines) and with the global (top) and specific (middle panels) RBM models. Scores obtained along RBM-sampled paths are shown as dashed lines for comparison. C. Path I to II/III. D. Path I to IV. E., F. Identity between intermediate sequences along the RBM path and the closest ancestors reconstructed by ASR; median and maximum identity between seed WW domains are shown as dashed red line for reference. E. Path I to II. F. Path I to IV.

While ASR paths have in median a large number = 113 of mutations (Supplementary Figure 20, middle bottom panel), the RBM paths were constrained to be much shorter, with 14 to 25 mutations, a length shared by few ASR paths only, see Supplementary Figure 13A, B, C. This observation could explain why RBM-designed paths cross a region devoid of contemporary sequences. Remarkably, while ASR paths are not constrained in length, the pathways connecting type I and type IV domains also follow the same narrow path. Notice that while on the phylogenetic tree inferred from WW domain sequences only, the topology directly connects domains of type I with those of type IV, different results are found with trees obtained from families of homologous proteins containing WW domains as a subpart of their sequences. In the latter case, paths going from I to IV pass through region II/III (Supplementary Figure 21) In particular, the phylogeny of the protein NP001351424 (peptidyl-prolyl cis-trans isomerase from Mus musculus), which contains the WT class IV domain used as anchor for path sampling, shows intermediate ancestors between class I and II/III homologs belonging to the class II/III (Supplementary Figure 21D).

The funnel connecting directly WW domains of classes I and IV is therefore an interesting specific feature of WW-domain reconstructed trees, not locked to the evolution of the rest of the protein, and possibly better reflecting the selective pressure associated with binding specificity. The differences between the topologies of the trees found from the WW domains and from the proteins containing them are compatible with previous observations [52]; in particular, domains with different specificity types in the same protein may be less related to domains of the same type on different proteins.

Figures 6B shows that the ASR sequences along the pathways (in purple), connecting the wild-type sequences anchoring the RBM paths are strikingly similar with the RBM intermediates. Their residue identity exceed 90%, as evidenced in Figure 6E and F. Conversely, the median identity between randomly selected modern domains in the seed is 33%. However, the scores of the ancestral sequences along the phylogenetic pathways assigned by the RBM are significantly lower than the ones of the the RBM-designed sequences (Figure 6C and D). This result is expected as ASR reconstruction does not take into account epistasis, differently from RBM, and we expect ASR sequences to generally be of lesser quality.

Incorporating the RBM-designed sequences into the phylogenetic tree further corroborates their similarity with the ancestral sequences. These RBM sequences collapse on the trajectory connecting the wild-type sequences (Figure 6B, with the dotted purple ASR paths overlapped on the blue synthetic paths).

4 Discussion

In this work, we propose and test a computational method to design mutational paths allowing for specificity switching [18, 53]. We focus on WW domains, a small binding unit of many proteins, whose variants specifically bind to different classes of proline-rich peptides. Our approach relies on a model of the fitness landscape of WW domains learned by Restricted Boltzmann Machine (RBM) from homologous WW sequence data and on Monte Carlo path sampling [30, 44] of functional paths [26] that connect extant WW domains across sequence space. Using a high-throughput in-vitro approach, we test the binding response of the designed sequences toward three peptides representative of the three main WW specificity classes. Our study provides experimental validation of multiple designed paths both within the same WW specificity class (binding the same peptide) and across different specificity classes, capable of switching specificity through accumulation of mutations. These experiments, in turn, give insights on the main requirements for the change of specificity in a WW protein domain. We finally compare the RBM-designed paths with pathways obtained by ancestral sequence reconstruction (ASR) on a phylogenetic tree built on the WW domains of the PFAM seed.

Navigability in the fitness landscape was first studied in evolutionary biology using theoretical models [18, 54, 55, 56]. In recent years, the availability of massive experimental data began to allow for partial fitness landscape reconstructions. The approach introduced here for the path design is, in a way, a synthesis between these two approaches. Unlike theoretical frameworks such as the ‘house of cards’ [57] or the NK [54] models, our fitness landscape is directly inferred from sequence data and is therefore, at some level of approximation, representative of the complexity of a real protein domain [43, 30, 58]. Moreover, our approach borrows to statistical physics some conceptual and methodological ingredients, such as Monte-Carlo sampling of paths [54]; in addition, the bipartite nature of the RBM networkmakes it possible to use mean-field approximations to theoretically characterize the topology and accessibility of the fitness landscape [54, 23], e.g. the mean trajectory of paths and their entropy [44, 59]. Computational approaches enormously reduce the huge number of a priori paths to be experimentally tested, and could speed up the rational design of protein selectivity, attempted since the 1990s in the field of drug discovery, in which few mutations targeting key specificity residues were probed, often ignoring the presence of complex epistatic effects between residues [14, 60, 61, 62, 63].

Overall the designed paths are rich in functional sequences: only 4 out of 58 tested sequences were non responsive to any of the three tested ligands, while all the other sequences responded to at least one of the peptide representative of the specificity class corresponding to the natural WW domain at the extremities of the path. This result proves the general validity of combining generative models for the fitness landscapes inferred from sequence data and Monte-Carlo path sampling to propose viable paths. It extends the recent success in protein design with computational methods [64] to the design of full mutational paths. Our RBM model is capable of capturing important epistatic effects constraining the order in which mutations can be introduced along the path. As shown by scrambling the mutational order along the path, sequences in which mutations take place independently from the background content are not functional, as they do not take into account epistasis. We stress that the modeling of epistatic interactions with the RBM is global over all WW domains, and not attached to any specificity class. It is not possible to learn accurate landscapes for each single specificity as only ∼50 out of 17,000 available WW sequences in databases have been experimentally characterized in terms of binding specificity [8].

Being shallow, the RBM parameters of the model are interpretable, and unveil both key specificity determining residues, similarly to methods based on principal component analysis ([65, 66]) and epistatic interactions as in the Direct Coupling Analysis [67, 68]. Moreover, the tested specificities along the paths are in good agreement with the approximate class-specific RBM scores we have introduced; specificity switches take place through changes in determining residues and amino acids identified by the RBM weights attached to the two hidden units identified as most informative about specificity classes [43].

Our designed paths allow for the presence of “transient” substitutions with amino acids different from the ones in the anchoring wild-type WW sequences [69]. In all the paths we tested, we observe such mutations to transient amino acids on variable residues away from the main specificity determinant. These reversed mutations are important: they transiently improve the global RBM score in the initial part of the path, starting from a class I wild-type, and possibly increase the stability of the molecules with respect to the wild-type, while preparing the background for the specificity change. However, class switching typically involves mutating the specificity determining residues to the amino acids found in the final wild-type sequence. The above findings are in full agreement with the stability-specificity tradeoff previously introduced in evolutionary path characterization [56, 70, 71] and in the specific characterization of Pin WW variants [72].

Mutational paths connecting specificity Classes I and IV are particularly interesting, because they cross a region deprived of natural sequences in the low-dimensional projection defined by the two RBM specificity hidden units, see Figure 3E [30]. Synthetic WW domains along the I-IV paths go through a bottleneck in the space of sequences: several designed and tested paths go through the same intermediates, and are characterized by cross-specificity to peptides of Classes I, II and IV. Ancestral promiscuity and switch of ligands have been proposed for several enzymatic families based on ASR [17, 73, 29]. Promiscuous WW domains could be related to ancestral states, which could have specialized during evolution in favor of more stringent ligand selectivity. In Humans, WW domains exist also in tandem and multi-domains, and gene duplication is a plausible mechanism for such a specialization [49]. Reconstruction of ancestral sequences (ASR) from domains contained in PFAM’s seed supports this hypothesis, unveiling a similar narrow pathways connecting type I and IV domains, with ancestral nodes showing a large residue identity with the RBM sequences along the path. However, phylogenetic reconstructions on several families of homologous proteins containing WW domains rather show transitions between classes I and II/III and classes II/III and IV. It is therefore difficult to assess the evolutionary significance of the putative ancestral sequences along the I-IV path found in the WW-domain tree.

ASR and RBM-path sampling are complementary tools to explore hypothetical paths between distantly related modern sequences, both based on hypothesizing evolution as a parsimonious path in a fitness landscape. This analogy between the fitness landscape inferred from sequence data and the phylogenetic paths has already been used in protein engineering as a method to infer novel proteins with enhanced properties such as affinity, stability, and solubility [74, 75] and to predict mutational effects in [76], where the distance between sequences on a Neighbor-Joining tree serves as a proxy to assess the likelihood and, consequently, the fitness of novel mutations in a protein.

However ASR offers much lower resolution in the sequence space compared to the RBM, as the computational complexity in reconstructing a tree makes impossible to use all sequences available in the databases; it does not provide a true quantification of the likelihood of the intermediate sequences and does not model epistatic interactions between residues. The similarity of ASR and RBM paths proves that the sampling of paths in the fitness landscape with the RBM could help to explore likely intermediate sequences between groups of distantly related proteins, a task for which classical ASR is not well suited;which could prove useful both for protein engineering and phylogenetic studies.

Our work could be extended in several directions. First, we plan to extend our approach to sample paths attached not at their two extremities, but anchored to one wild-type WW domain only. Imposing that the affinity to a ligand different from the initial one increases along the path would make the setting more closely matches a natural evolutionary path.

Second, the in-vitro assay could be complemented in several ways. Our current assay does not allow for cooperativity between WW domains, which may be important in vivo. This could explain why some natural WW domains tested alone, such as YAP, do not display affinity to their cognate ligands. It has been previously shown that multi tandem WW domain confer higher ligand affinity [49]. In addition, our current experimental setup is limited to a small number of peptides. It is possible that the four non-responsive sequences we have sampled on the path connecting Class I and Class II/III wild-type WW domain may bind to other non-tested peptides representative of the same classes, possibly in a less specific and more cross-reactive way. To better investigate the above points we plan to extend the experimental binding measures to in vivo binding assays, and to measure binding against a library of peptides through competitive growth in yeast cells [9, 77].

Lastly, while we have here used experiments as a way to validate the paths proposed by the computational method, it would be interesting to iteratively improve our estimate of the fitness landscape inference by integrating the new experimentally labeled data using active learning methods [78, 79, 80, 64].

5 Materials and Methods

5.1 Restricted Boltzmann Machines

We train our unsupervised model on a data-set of homologous proteins, presented as a list of aligned sequences {v¹, v²,..., v^k,..., v^B} (B is the total number of sequences in the data set), where each entry is an array in which each one of the N entries may be in one of the possible 21 states (20 amino acids + the gap site). As a model, we use Restricted Boltzmann Machines [81], a neural network consisting of a visible layer v representing the data and an hidden layer of M neurons . RBMs define a joint probability distribution over the visible and hidden layer as

where Z_RBM is the normalisation constant, the input to an hidden unit is the projection of the sequence on the weigth matrix w_i,µ: I_µ(v) = ∑_i w_i,µ(v_i) and 𝒰_µ are potential energy functions over the real-valued hidden unit activations h_µ. We choose 𝒰_µ to be double Rectified Linear Unit (dReLU) potentials of the form

where we have defined the hyper-parameters γ_µ,±, θ_µ,±. Marginalising over the hidden units we obtain the probability distribution over the sequence space

Where . We use Persistent Contrastive Divergence [82] to train the model over the multi-sequence alignment of the protein family of interest in order to maximise the likelihood. This training algorithm has been shown to be sufficiently robust under cautious regularization [83]. The code and the data used to train our RBMs for WW domains can be found in [84], while the hyperparameters for the training can be found in the Supplementary Materials of [30].

5.1.1 Global and specific RBM models

We learned two classes of RBM models, depending on the train data. A global RBM model used to design the paths was learned on all the sequence data, while three class-specific RBM models, for binding classes of type I, II/III, or IV were learned on sub-set of sequences (containing 8304, 8292, 637 sequences respectively). To define the sub-sets we clusterized the sequences according to the value of their inputs I_µ(v) on the two hidden units µ = 1, µ = 2 chosen as the most informative about the specificity classes, shown in Figs. 2, 4, 3. The specificity threshold indicated in the figure (I₂ < —1; I₂ > —1,I₁ < —3; I₁ > —3) have been informed in [30] by some previous experimental tests [9, 8] (see Figure 3 in [43] and Figure 2a in [30]) on the WW specificity classes. This results in three class-specific RBM models (one for class I, one for class II/III and one for class IV) which can be used as predictor of the binding affinity of a natural or artificial sequence.

5.2 Path sampling algorithm

We define the probability a path of T steps, 𝒱 = {v₁, v₂..., v_{T −1}}, connecting two sequences v_start and v_end, as follows:

Here π is an interaction terms that is used to minimised the number of mutations at each step. Our goal is to sample paths that do at most one mutation at each time step. In practice we set π(v, v^′) = 1 if the two sequences are equal, π(v, v^′) = e^−∧ (with ∧ > 0) if the two sequences differ by one mutation and π(v, v^′) = 0 otherwise. In [30] we propose a Monte-Carlo (coupled with simulated annealing) procedure to sample paths from 𝒫 (𝒱)^β at a given inverse temperature β = 1/T. Other choices of the transition term π are discussed in the following subsection.

Here we stress that our path sampling approach is independent of the restricted Boltzmann machine used to train the model and one can use in principle different probability distribution P (v). An alternative choice would be to define our path sampling algorithm over the space of nucleotide sequences that encode for the specific protein. Each sequence would be scored according to the RBM model after translation into the corresponding amino acids sequence according to genetic code. In this way one can introduce codon biases in the sampling approach. We applied this approach to the case of WW domain and we obtained similar results to the sampling procedure carried out without codon bias.

5.3 Ancestral protein reconstruction on seed domains

Phylogenetic tree inference and Ancestral Sequence Reconstruction (ASR) were conducted to obtain an unrooted tree topology and ancestral node sequences for all 408 isolated WW domain in PF00397. We used IQ-Tree [85] and an evolutionary model defined by the amino-acid mutation matrix LG [86] and Gamma parameters discretized in four classes [87] (LG+G4, identified as the best model to describe domain evolution, see Supplementary Material Section 8.5.2). Gaps were reconstructed from the sequence alignment converted into a binary presence/absence matrix from binary sequences, using the GTR2 model [88]. We calculated inputs to the hidden units of the RBM for modern WW domains, as well as their mean values for ancestral sequences based on the posterior probabilities of ancestral residues. Scripts to perform the ASR are available on the project’s Github.

We also performed phylogenetic reconstruction from the 24 seed domains within or near cluster associated with type IV domains, as well as the NP00131424 protein used as a type IV anchor for the RBM path sampling, as described in Supplementary Material Section 8.5.2.

5.4 Experimental Materials

Cell free expression system (PURExpress ^® In Vitro Protein Synthesis Kit, New England Biolabs, E6800), SNAP labeling Dye (SNAP-Surface^® Alexa Fluor^® 647, New England Biolabs, S9136), Anti-SNAP antibody (SNAP-tag Polyclonal Antibody, Thermo Fisher Scientific, CAB4255), Streptavidin Magnetic beads (Dynabeads™ M-280 Streptavidin, Thermo Fisher Scientific, 11205D), Protein A magnetic beads (Pierce™ Protein A Magnetic Beads, Thermo Fisher Scientific, 88845), oligonucleotide linkers and eBlocks™ Gene Fragment provider (Integrated DNA Technologies), synthetic peptides provider (GenScript Biotech), Gibson assembly (Gibson Assembly^® Master Mix, New England Biolabs, E2611), Polymerase kit (Q5^® High-Fidelity DNA Polymerase, New England Biolabs, M0491) DNA purification beads kit (SPRIselect^®, Beckman Coulter, B23318) and phosphate-buffered saline buffer (DPBS 10X, Thermo Fisher Scientific, 14200075), DTT (DL-Dithiothreitol solution, Sigma-Aldrich, E2611), Tris-HCl (UltraPure™ 1 M Tris-HCI Buffer, pH 7.5, Invitrogen, 15567-027), Tween-20 (TWEEN^® 20, Sigma-Aldrich, P2287), EDTA (0.5M EDTA pH 8.0, Invitrogen, AM9260G), Binding buffer (Tris-HCl 50 mM pH 7.5, EDTA 10 mM, DTT1 mM and 0.5 % Tween 20, prepared in-house), Wash buffer (1X DPBS with 0.5% Tween-20).

5.5 Experimental protocol

Our experimental setup is sketched in Figure 1D. Double-stranded DNA gene fragments encoding the WW domain sequences were synthesized and inserted by Gibson assembly into a plasmid containing a T7 promoter and a ribosome binding site (RBS) upstream of the cloning site, these elements being essential for initiation of the expression process. A sequence encoding a linker and a SNAP-tag has been fused downstream of the WW domain sequence and a stop codon has been placed at the end of the SNAPtag sequence followed by a T7 terminator sequence. The SNAPtag is a protein capable of reacting irreversibly with the benzylguanine (BG) group. This can be used to label the protein fluorescently, using dyes coupled to BG.

After insertion of the WW domain genes, a polymerase chain reaction (PCR) was conducted to amplify the entire construct, producing a linear product encompassing the complete sequence between the promoter and the terminator. The resulting PCR product was then utilized for expression in a cell-free transcription and translation system, with the fluorescent substrate.

Concurrently with the expression of the WW protein, magnetic beads coated with the target peptides, which are presumed to bind with the WW domains, are prepared. These comprise a metallic bead displaying a multitude of copies of the peptide of interest on its surface, as well as a fluorescent dye for the identification of the beads. At the same time the protein A-coated magnetic beads are incubated with a polyclonal anti-SNAPtag antibody for the purpose of assessing the efficacy of the expression process.

Following the expression of the fusion proteins and labelling with the fluorescent dye, a mixture of color-coded beads coated with different peptides and anti-SNAPtag antibody was added to the expression mixture. Upon binding of the WW domain to the peptide, the fluorescent dye attached to the SNAP tag will report the amount of protein bound to the bead. The beads were then washed and passed through a flow cytometer, where the average dye fluorescence intensity on the bead surfaces was measured. This indicated the amount of WW domain fusion proteins bound to each peptide-coated bead and the expression assessment beads, which was then normalized for expression. Each WW domain protein was evaluated against three peptides, each representing a distinct class of WW domain. Further details are available in the supplementary Material 5.5.

Fusion Protein assembly, expression and labelling

eBlocks™ gene fragments for genes comprised in the different pathways were designed, ordered, and cloned into the SNAP fusion plasmid using Gibson Assembly^® Master Mix. Using Q5^® High-Fidelity DNA Polymerase, PCR reactions were performed to amplify linear products containing all the necessary parts for the expression of the fusion proteins. PCR products were purified using SPRIselect^® beads. Expression of the fusion proteins were performed in a PURExpress^® in vitro protein synthesis reactions with the purified linear PCR products at 15 nM and SNAP-Surface^® Alexa Fluor^® 647 at 5 µM final concentration. The cell-free in-vitro transcription and translation (IVTT) reactions were incubated for 5 minutes at 37^○C followed by 4 hours of incubation at 30^○C.

Expression control beads preparation

Expression control beads were prepared using 5 µl of protein A magnetic beads (50 ug) from Thermo Fisher Scientific, washed twice with 1X DPBS from Thermo Fisher Scientific with 0.5% Tween-20 from Sigma-Aldrich, resuspended in 50 µl of 100 ng/µl anti-SNAP polyclonal antibody diluted in 1X DPBS from Thermo Fisher Scientific with 0.5% Tween-20 from Sigma-Aldrich, incubated for 30 min at 37^○C with shaking, then washed three times with binding buffer.

Streptavidin beads preparation with target peptides

beads were prepared using 5 µl of strepta-vidin magnetic beads from Thermo Fisher Scientific, washed twice with 1X DPBS from Thermo Fisher Scientific with 0.5% Tween-20 from Sigma-Aldrich, then resuspended in 50 µl of 1uM DNA linkers from Integrated DNA Technologies diluted in 1X DPBS from Thermo Fisher Scientific with 0.5% Tween-20 from Sigma-Aldrich. The beads were incubated with the DNA linkers for 30 minutes at 37^○C with shaking at 600 RPM in ThermoMixer^® C from Eppendorf, then washed twice with wash buffer and resuspended in 20 ng/µl of target azide peptide from GenScript Biotech diluted in wash buffer, then incubated overnight at room temperature with rotation. After washing the beads twice with binding buffer, the beads were resuspended in 50 µl of binding buffer, after which beads with different targets and identification fluorescent dyes were mixed.

Binding

10 µl of labeled proteins were mixed with 4 µl of beads mixture then incubated for 1 hour at 25^○C with shaking at 600 RPM in ThermoMixer^® C from Eppendorf. After binding, the beads were washed twice with wash buffer and resuspended in 200 µl of wash buffer. Finally, the fluorescence intensity on different bead species was measured Guava^® easyCyte flow cytometer (Millipore).

Experimental Design

Each WW domain was tested against three peptides, each one corresponding to a different class of WW domains. The three peptides are given in Table 1. IVTT No expression control and no binding controls (peptides with different sequences; IAKLDEESILKQ and PALWKSKGAD) was performed for each experiment to confirm the specificity of WW domains binding to the target peptides. All binding experiments, including target and control peptide measurements, were conducted in triplicate to ensure reproducibility and statistical robustness.

Data availability

All generated and tested sequences are provided in the Supplementary Tables 2-7. All codes will be accessibles in https://github.com/orgs/CoccoMonassonLab/ and https://github.com/cossio/TransitionP.

Acknowledgements

This work was supported by the grants ANR-19 Decrypted CE300021-01, ANR-24 CE15-Methaflu and, ANR-23 CE45-0034 ProDiGen.

Additional files

Supplementary material.

Additional information

Funding

Agence Nationale de la Recherche (ANR) (Decrypted CE300021-01)

Simona Cocco

Agence Nationale de la Recherche (ANR) (ANR-23 CE45-0034 ProDiGen)

Simona Cocco

Agence Nationale de la Recherche (ANR) (ANR-24 CE15- Methaflu)

Simona Cocco

References

[1]
1. Murzin Alexey G
1998How far divergent evolution goes in proteinsCurrent opinion in structural biology 8:380–387https://doi.org/10.1016/s0959-440x(98)80073-0 PubMed Google Scholar
[2]
1. Fanelli Alessandro Rossi
2. Antonini Eraldo
3. Caputo Antonio
1964Hemoglobin and myoglobinAdvances in protein chemistry 19:73–222https://doi.org/10.1016/s0065-3233(08)60189-8 PubMed Google Scholar
[3]
1. Sudol Marius
1996Structure and function of the ww domainProgress in biophysics and molecular biology 65:113–132https://doi.org/10.1016/s0079-6107(96)00008-9 PubMed Google Scholar
[4]
1. Salah Z
2. Aqeilan RI
2011WW domain interactions regulate the hippo tumor suppressor pathwayCell Death & Disease 2:e172–e172https://doi.org/10.1038/cddis.2011.53 PubMed Google Scholar
[5]
1. Sudol Marius
2. Sliwa Krzysztof
3. Russo Tommaso
2001Functions of WW domains in the nucleusFEBS Letters 490:190–195https://doi.org/10.1016/s0014-5793(01)02122-6 PubMed Google Scholar
[6]
1. Sudol Marius
2. Hunter Tony
2002NeW wrinkles for an old domainCell 103:1001–1004Google Scholar
[7]
1. Zarrinpar Ali
2. Lim Wendell A
2000Converging on proline: the mechanism of ww domain peptide recognitionNature structural biology 7:611–613https://doi.org/10.1038/77891 PubMed Google Scholar
[8]
1. Otte Livia
2. Wiedemann Urs
3. Schlegel Brigitte
4. Pires Jose Ricardo
5. Beyermann Michael
6. Schmieder Peter
7. Krause Gerd
8. Volkmer-Engert Rudolf
9. Schneider-Mergener Jens
10. Oschkinat Hartmut
2003Ww domain sequence activity relationships identified using ligand recognition propensities of 42 ww domainsProtein Science 12:491–500https://doi.org/10.1110/ps.0233203 PubMed Google Scholar
[9]
1. Russ William P
2. Lowery Drew M
3. Mishra Prashant
4. Yaffe Michael B
5. Ranganathan Rama
2005Natural-like function in artificial ww domainsNature 437:579–583https://doi.org/10.1038/nature03990 PubMed Google Scholar
[10]
1. Shibata Masahiro
2. Ham Kendall
3. Hoque Mohammad Obaidul
2018A time for yap1: Tumorigenesis, immunosuppression and targeted therapyInternational journal of cancer 143:2133–2144https://doi.org/10.1002/ijc.31561 PubMed Google Scholar
[11]
1. Piccolo Stefano
2. Dupont Sirio
3. Cordenonsi Michelangelo
2014The biology of YAP/TAZ: Hippo signaling and beyondPhysiological Reviews 94:1287–1312https://doi.org/10.1152/physrev.00005.2014 PubMed Google Scholar
[12]
1. Butterfield D. Allan
2. Abdul Hafiz Mohmmad
3. Opii Wycliffe
4. Newman Shelley F.
5. Joshi Gururaj
6. Ansari Mubeen Ahmad
7. Sultana Rukhsana
2006Review: Pin1 in alzheimer’s diseaseJournal of Neurochemistry 98:1697–1706https://doi.org/10.1111/j.1471-4159.2006.03995.x PubMed Google Scholar
[13]
1. Fotia Andrew B.
2. Dinudom Anuwat
3. Shearwin Keith E.
4. Koch Jan-Peter
5. Korbmacher Christoph
6. Cook David I.
7. Kumar Sharad
2003The role of individual nedd4-2 (KIAA0439) WW domains in binding and regulating epithelial sodium channelsThe FASEB Journal 17:70–72https://doi.org/10.1096/fj.02-0497fje PubMed Google Scholar
[14]
1. Espanel Xavier
2. Sudol Marius
1999A single point mutation in a group i ww domain shifts its specificity to that of group ii ww domainsJournal of Biological Chemistry 274:17284–17289https://doi.org/10.1074/jbc.274.24.17284 PubMed Google Scholar
[15]
1. Macias Maria J.
2. Wiesner Silke
3. Sudol Marius
2002WW and SH3 domains, two different scaffolds to recognize proline-rich ligandsFEBS Letters 513:30–37https://doi.org/10.1016/s0014-5793(01)03290-2 PubMed Google Scholar
[16]
1. Socolich Michael
2. Lockless Steve W
3. Russ William P
4. Lee Heather
5. Gardner Kevin H
6. Ranganathan Rama
2005Evolutionary information for specifying a protein foldNature 437:512–518https://doi.org/10.1038/nature03991 PubMed Google Scholar
[17]
1. Khersonsky Olga
2. Tawfik Dan Salah
2010Enzyme promiscuity: A mechanistic and evolutionary perspectiveAnnual Review of Biochemistry 79:471–505https://doi.org/10.1146/annurev-biochem-030409-143718 PubMed Google Scholar
[18]
1. Smith John Maynard
1970Natural selection and the concept of a protein spaceNature 225:563–564https://doi.org/10.1038/225563a0 PubMed Google Scholar
[19]
1. Pauling Linus
2. Zuckerkandl Emile
3. Henriksen Thormod
4. Lövstad Rolf
1963Chemical Paleogenetics. Molecular “Restoration Studies” of Extinct Forms of LifeActa Chemica Scandinavica 17:9–16https://doi.org/10.3891/acta.chem.scand.17s-0009 Google Scholar
[20]
1. Wright Sewall
1931Evolution in mendelian populationsGenetics 16:97https://doi.org/10.1093/genetics/16.2.97 PubMed Google Scholar
[21]
1. Poelwijk Frank J.
2. Socolich Michael
3. Ranganathan Rama
2019Learning the pattern of epistasis linking genotype and phenotype in a protein -Nature CommunicationsNat. Commun 10:1–11https://doi.org/10.1038/s41467-019-12130-8 PubMed Google Scholar
[22]
1. De Visser J. Arjan G.M.
2. Krug Joachim
2014Empirical fitness landscapes and the predictability of evolutionNature Reviews Genetics 15:480–490https://doi.org/10.1038/nrg3744 PubMed Google Scholar
[23]
1. Franke Jasper
2. Klozer Alexander
3. De Visser J. Arjan G. M.
4. Krug Joachim
2011Evolutionary accessibility of mutational pathwaysPLoS Computational Biology 7:e1002134https://doi.org/10.1371/journal.pcbi.1002134 PubMed Google Scholar
[24]
1. Moulana Alief
2. Dupic Thomas
3. Phillips Angela M
4. Chang Jeffrey
5. Nieves Serafina
6. Roffler Anne A
7. Greaney Allison J
8. Starr Tyler N
9. Bloom Jesse D
10. Desai Michael M
2022Compensatory epistasis maintains ace2 affinity in sars-cov-2 omicron ba. 1Nature Communications 13:7011https://doi.org/10.1038/s41467-022-34506-z PubMed Google Scholar
[25]
1. Weinreich Daniel M.
2. Delaney Nigel F.
3. DePristo Mark A.
4. Hartl Daniel L.
2006Darwinian evolution can follow only very few mutational paths to fitter proteinsScience 312:111–114https://doi.org/10.1126/science.1123539 PubMed Google Scholar
[26]
1. Wu Nicholas C
2. Dai Lei
3. Olson Anders
4. Lloyd-Smith James O
5. Sun Ren
2016Adaptation in protein fitness landscapes is facilitated by indirect pathseLife 5:e16965https://doi.org/10.7554/eLife.16965 PubMed Google Scholar
[27]
1. Aguilar-Rodriguez Jose
2. Payne Joshua L.
3. Wagner Andreas
2017A thousand empirical adaptive landscapes and their navigabilityNature Ecology & Evolution 1:0045https://doi.org/10.1038/s41559-016-0045 PubMed Google Scholar
[28]
1. Nocedal Isabel
2. Laub Michael T
2022Ancestral reconstruction of duplicated signaling proteins reveals the evolution of signaling specificityeLife 11:e77346https://doi.org/10.7554/eLife.77346 PubMed Google Scholar
[29]
1. Brochier-Armanet Celine
2. Madern Dominique
2021Phylogenetics and biochemistry elucidate the evolutionary link between l-malate and l-lactate dehydrogenases and disclose an intermediate group of sequences with mix functional propertiesBiochimie 191:140–153https://doi.org/10.1016/j.biochi.2021.08.004 PubMed Google Scholar
[30]
1. Mauri Eugenio
2. Cocco Simona
3. Monasson Riemi
2023Mutational paths with sequence-based models of proteins: From sampling to mean-field characterizationPhys. Rev. Lett 130:158402https://doi.org/10.1103/physrevlett.130.158402 PubMed Google Scholar
[31]
1. Chu AE
2. Huang PS
2024Sparks of function by de novo protein designNature Biotechnology 42:203–215https://doi.org/10.1038/s41587-024-02133-2 PubMed Google Scholar
[32]
1. Zambaldi Vinicius
2. La David
3. Chu Alexander E.
4. Patani Harshnira
5. Danson Amy E.
6. Kwan Tristan O. C.
7. Frerix Thomas
8. Schneider Rosalia G.
9. Saxton David
10. Thillaisundaram Ashok
11. Wu Zachary
12. Moraes Isabel
13. Lange Oskar
14. Papa Eliseo
15. Stanton Gabriella
16. Martin Victor
17. Singh Sukhdeep
18. Wong Lai H.
19. Bates Russ
20. Kohl Simon A.
21. Abramson Josh
22. Senior Andrew W.
23. Alguel Yilmaz
24. Wu Mary Y.
25. Aspalter Irene M.
26. Bentley Katie
27. Bauer David L. V.
28. Cherepanov Peter
29. Hassabis Demis
30. Kohli Pushmeet
31. Fergus Rob
32. Wang Jue
2024De novo design of high-affinity protein binders with alphaproteoarXiv https://doi.org/10.48550/arXiv.2409.08022 Google Scholar
[33]
1. Watson JL
2. Bennett NR
3. et al.
2023De novo design of protein structure and function with rfdiffusionNature 620:1089–1100https://doi.org/10.1038/s41586-023-06415-8 PubMed Google Scholar
[34]
1. Russ William P.
2. Figliuzzi Matteo
3. Stocker Christian
4. Barrat-Charlaix Pierre
5. Socolich Michael
6. Kast Peter
7. Hilvert Donald
8. Monasson Remi
9. Cocco Simona
10. Weigt Martin
11. Ranganathan Rama
2020An evolution-based model for designing chorismate mutase enzymesScience 369:440–5https://doi.org/10.1126/science.aba3304 PubMed Google Scholar
[35]
1. Hawkins-Hooker Alex
2. Depardieu Florence
3. Baur Sebastien
4. Couairon Guillaume
5. Chen Arthur
6. Bikard David
2021Generating functional protein variants with variational autoencodersPLOS Computational Biology 17:1–23https://doi.org/10.1371/journal.pcbi.1008736 PubMed Google Scholar
[36]
1. Malbranke Cyril
2. Bikard David
3. Cocco Simona
4. Monasson Remi
5. Tubiana Jerome
2023Machine learning for evolutionary-based and physics-inspired protein design: Current and future synergiesCurrent Opinion in Structural Biology 80:102571https://doi.org/10.1016/j.sbi.2023.102571 PubMed Google Scholar
[37]
1. Tubiana Jerome
2. Cocco Simona
3. Monasson Remi
2019Learning protein constitutive motifs from sequence dataeLife 8:e39397https://doi.org/10.7554/eLife.39397 PubMed Google Scholar
[38]
1. Malbranke Cyril
2. Rostain William
3. Depardieu Florence
4. Cocco Simona
5. Monasson Riemi
6. Bikard David
2023Computational design of novel cas9 pam-interacting domains using evolution-based modelling and structural quality assessmentPLOS Computational Biology 19:e1011621https://doi.org/10.1371/journal.pcbi.1011621 PubMed Google Scholar
[39]
1. Di Gioacchino Andrea
2. Procyk Jonah
3. Molari Marco
4. Schreck John S
5. Zhou Yu
6. Liu Yan
7. Monasson Riemi
8. Cocco Simona
9. Sulc Petr
2022Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selectionPLoS computational biology 18:e1010561https://doi.org/10.1371/journal.pcbi.1010561 PubMed Google Scholar
[40]
1. Fernandez-De-Cossio-Diaz Jorge
2. Hardouin Pierre
3. du Moutier Francois-Xavier Lyonnet
4. Di Gioacchino Andrea
5. Marchand Bertrand
6. Ponty Yann
7. Sargueil Bruno
8. Monasson Remi
9. Cocco Simona
2023Designing molecular rna switches with restricted boltzmann machinesbioRxiv https://doi.org/10.1101/2023.05.10.540155 Google Scholar
[41]
1. Jumper John
2. Evans Richard
3. Pritzel Alexander
4. Green Tim
5. Figurnov Michael
6. Ronneberger Olaf
7. Tunyasuvunakool Kathryn
8. Bates Russ
9. Zidek Augustin
10. Potapenko Anna
11. et al.
2021Highly accurate protein structure prediction with alphafoldNature 596:583–589https://doi.org/10.1038/s41586-021-03819-2 PubMed Google Scholar
[42]
1. Dauparas Justas
2. Anishchenko Ivan
3. Bennett Nathaniel
4. Bai Hua
5. Ragotte Robert J
6. Milles Lukas F
7. Wicky Basile IM
8. Courbet Alexis
9. de Haas Rob J
10. Bethel Neville
11. et al.
2022Robust deep learning-based protein sequence design using proteinmpnnScience 378:49–56https://doi.org/10.1126/science.add2187 PubMed Google Scholar
[43]
1. Tubiana Jerome
2. Cocco Simona
3. Monasson Remi
2019Learning protein constitutive motifs from sequence dataeLife 8:e39397https://doi.org/10.7554/eLife.39397 PubMed Google Scholar
[44]
1. Mauri Eugenio
2. Cocco Simona
3. Monasson Remi
2023Direct vs. global transition paths in potts-like energy landscapesarXiv https://doi.org/10.48550/arXiv.2304.03128 Google Scholar
[45]
1. Keppler Antje
2. Gendreizig Susanne
3. Gronemeyer Thomas
4. Pick Horst
5. Vogel Horst
6. Johnsson Kai
2003A general method for the covalent labeling of fusion proteins with small molecules in vivoNature Biotechnology 21:86–89https://doi.org/10.1038/nbt765 PubMed Google Scholar
[46]
1. Shimizu Yoshihiro
2. Inoue Akio
3. Tomari Yukihide
4. Suzuki Tsutomu
5. Yokogawa Takashi
6. Nishikawa Kazuya
7. Ueda Takuya
2001Cell-free translation reconstituted with purified componentsNature Biotechnology 19:751–755https://doi.org/10.1038/90802 PubMed Google Scholar
[47]
1. Givan Alice Longobardi
2001Flow Cytometry: First PrinciplesWiley Google Scholar
[48]
1. Kasanov Jeremy
2. Pirozzi Gregorio
3. Uveges Albert J
4. Kay Brian K
2001Characterizing class i ww domains defines key specificity determinants and generates mutant domains with novel specificitiesChemistry & biology 8:231–241https://doi.org/10.1016/s1074-5521(01)00005-9 PubMed Google Scholar
[49]
1. Lin Zhijie
2. Yang Zhou
3. Xie Ruiling
4. Ji Zeyang
5. Guan Kunliang
6. Zhang Mingjie
2019Decoding ww domain tandem-mediated target recognitions in tissue growth and cell polarityeLife 8:e49439https://doi.org/10.7554/eLife.49439 PubMed Google Scholar
[50]
1. Miyazawa Sanzo
2. Jernigan Robert L
1985Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximationMacromolecules 18:534–552https://doi.org/10.1021/ma00145a039 Google Scholar
[51]
1. Kato Yusuke
2. Ito Mie
3. Kawai Kunji
4. Nagata Koji
5. Tanokura Masaru
2002Determinants of ligand specificity in groups i and iv ww domains as studied by surface plasmon resonance and model buildingJournal of Biological Chemistry 277:10173–10177https://doi.org/10.1074/jbc.m110490200 PubMed Google Scholar
[52]
1. Sudol Marius
2. Chen Henry I.
3. Bougeret Cecile
4. Einbond Aaron
5. Bork Peer
1995Characterization of a novel protein-binding module — the WW domainFEBS Letters 369:67–71https://doi.org/10.1016/0014-5793(95)00550-s PubMed Google Scholar
[53]
1. de Visser J. Arjan G.M.
2. Krug Joachim
2014Empirical fitness landscapes and the predictability of evolutionNature Reviews Genetics 15:480–490https://doi.org/10.1038/nrg3744 PubMed Google Scholar
[54]
1. Kauffman Stuart
2. Levin Simon
1987Towards a general theory of adaptive walks on rugged landscapesJournal of Theoretical Biology 128:11–45https://doi.org/10.1016/s0022-5193(87)80029-2 PubMed Google Scholar
[55]
1. Kimura Motoo
1983The neutral theory of molecular evolutionCambridge University Press Google Scholar
[56]
1. Das Suman G
2. Direito Susana Ol
3. Waclaw Bartlomiej
4. Allen Rosalind J
5. Krug Joachim
2020Predictable properties of fitness landscapes induced by adaptational tradeoff’seLife 9:e55155https://doi.org/10.7554/eLife.55155 PubMed Google Scholar
[57]
1. Kingman J. F. C.
1978A simple model for the balance between selection and mutationJournal of Applied Probability 15:1–12https://doi.org/10.2307/3213231 Google Scholar
[58]
1. Di Bari Leonardo
2. Bisardi Matteo
3. Cotogno Sabrina
4. Weigt Martin
5. Zamponi Francesco
2024Emergent time scales of epistasis in protein evolutionProceedings of the National Academy of Sciences 121:e2406807121https://doi.org/10.1073/pnas.2406807121 PubMed Google Scholar
[59]
1. Huot Marian
2. Wang Dianzhuo
3. Shakhnovich Eugene
4. Monasson Rémi
5. Cocco Simona
2025Constrained evolutionary funnels shape viral immune escapebioRxiv https://doi.org/10.1101/2025.10.26.684604 Google Scholar
[60]
1. Kurth Torsten
2. Ullmann Dirk
3. Jakubke Hans-Dieter
4. Hedstrom Lizbeth
1997Converting trypsin to chymotrypsin: Structural determinants of s1’ specificityBiochemistry 36:10098https://doi.org/10.1021/bi970937l PubMed Google Scholar
[61]
1. Korendovych Ivan V.
2018Rational and Semirational Protein Design, volume 1685 of Methods in Molecular BiologyNew York, NY: Springer New York pp. 15–23Google Scholar
[62]
1. Venekei Istvan
2. Szilagyi Laszlo
3. Graf Laszlo
4. Rutter William J.
1996Attempts to convert chymotrypsin to trypsinFEBS Letters 379:143–147https://doi.org/10.1016/0014-5793(95)01484-5 PubMed Google Scholar
[63]
1. Yang Kevin K.
2. Wu Zachary
3. Arnold Frances H.
2019Machine-learning-guided directed evolution for protein engineeringNature Methods 16:687–694https://doi.org/10.1038/s41592-019-0496-6 PubMed Google Scholar
[64]
1. Listov Dina
2. Goverde Casper A.
3. Correia Bruno E.
4. Fleishman Sarel Jacob
2024Opportunities and challenges in design and optimization of protein functionNature Reviews Molecular Cell Biology 25:639–653https://doi.org/10.1038/s41580-024-00718-y PubMed Google Scholar
[65]
1. Casari Georg
2. Sander Chris
3. Valencia Alfonso
1995A method to predict functional residues in proteinsNature structural biology 2:171https://doi.org/10.1038/nsb0295-171 PubMed Google Scholar
[66]
1. Halabi Najeeb
2. Rivoire Olivier
3. Leibler Stanislas
4. Ranganathan Rama
2009Protein sectors: evolutionary units of three-dimensional structureCell 138:774–786https://doi.org/10.1016/j.cell.2009.07.038 PubMed Google Scholar
[67]
1. Weigt Martin
2. White Robert A
3. Szurmant Hendrik
4. Hoch James A
5. Hwa Terence
2009Identification of direct residue contacts in protein-protein interaction by message passingProceedings of the National Academy of Sciences 106:67–72https://doi.org/10.1073/pnas.0805923106 PubMed Google Scholar
[68]
1. Cocco Simona
2. Feinauer Christoph
3. Figliuzzi Matteo
4. Monasson Remi
5. Weigt Martin
2018Inverse statistical physics of protein sequences: a key issues reviewReports on Progress in Physics 81:032601https://doi.org/10.1088/1361-6633/aa9965 PubMed Google Scholar
[69]
1. Wu Nicholas C
2. Dai Lei
3. Olson Anders
4. Lloyd-Smith James O
5. Sun Ren
2016Adaptation in protein fitness landscapes is facilitated by indirect pathseLife 5:e16965https://doi.org/10.7554/eLife.16965 PubMed Google Scholar
[70]
1. Tokuriki Nobuhiko
2. Stricher Francois
3. Serrano Luis
4. Tawfik Dan S.
2008How protein stability and new functions trade offPLoS Computational Biology 4:e1000002https://doi.org/10.1371/journal.pcbi.1000002 PubMed Google Scholar
[71]
1. Bloom Jesse D.
2. Labthavikul Sy T.
3. Otey Christopher R.
4. Arnold Frances H.
2006Protein stability promotes evolvabilityProceedings of the National Academy of Sciences 103:5869–5874https://doi.org/10.1073/pnas.0510098103 PubMed Google Scholar
[72]
1. Jager Marcus
2. Zhang Yan
3. Bieschke Jan
4. Nguyen Houbi
5. Dendle Maria
6. Bowman Marianne E
7. Noel Joseph P
8. Gruebele Martin
9. Kelly Jeffery W
2006Structure-function-folding relationship in a ww domainProceedings of the National Academy of Sciences 103:10648–10653https://doi.org/10.1073/pnas.0600511103 PubMed Google Scholar
[73]
1. Wheeler Lucas C
2. Harms Michael J
2021Were Ancestral Proteins Less Specific?Molecular Biology and Evolution 38:2227–2239https://doi.org/10.1093/molbev/msab019 PubMed Google Scholar
[74]
1. Spence Matthew A.
2. Kaczmarski Joe A.
3. Saunders Jake W.
4. Jackson Colin J.
2021Ancestral sequence reconstruction for protein engineersCurrent Opinion in Structural Biology 69:131–141https://doi.org/10.1016/j.sbi.2021.04.001 PubMed Google Scholar
[75]
1. Prakinee Kridsadakorn
2. Phaisan Suppalak
3. Kongjaroon Sirus
4. Chaiyen Pimchai
2024Ancestral Sequence Reconstruction for Designing Biocatalysts and Investigating their Functional MechanismsJACS Au 4:4571–4591https://doi.org/10.1021/jacsau.4c00653 PubMed Google Scholar
[76]
1. Laine Elodie
2. Karami Yasaman
3. Carbone Alessandra
2019Gemme: a simple and fast global epistatic model predicting mutational effectsbioRxiv :543587https://doi.org/10.1101/543587 Google Scholar
[77]
1. Subbanna Mythili S.
2. Winters Matthew J.
3. Davey Norman E.
4. Pryciak Peter M.
2025A quantitative intracellular peptide-binding assay reveals recognition determinants and context dependence of short linear motifsJournal of Biological Chemistry 301:108225https://doi.org/10.1016/j.jbc.2025.108225 PubMed Google Scholar
[78]
1. Cocco Simona
2. Posani Lorenzo
3. Monasson Rémi
2019Functional couplings from sequence and mutational dataIn Preparation Google Scholar
[79]
1. Rotrattanadumrong Rachapun
2. Yokobayashi Yohei
2022Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learningNature Communications 13:4847https://doi.org/10.1038/s41467-022-32538-z PubMed Google Scholar
[80]
1. Yang Jason
2. Lal Ravi G.
3. Bowden James C.
4. Astudillo Raul
5. Hameedi Mikhail A.
6. Kaur Sukhvinder
7. Hill Matthew
8. Yue Yisong
9. Arnold Frances H.
2025Active learning-assisted directed evolutionNature Communications 16:714https://doi.org/10.1038/s41467-025-55987-8 PubMed Google Scholar
[81]
1. Fischer Asja
2. Igel Christian
2012An introduction to restricted Boltzmann machinesIn: Iberoamerican Congress on Pattern Recognition Springer pp. 14–36https://doi.org/10.1007/978-3-642-33275-3_2 Google Scholar
[82]
1. Tieleman Tijmen
2008Training restricted Boltzmann machines using approximations to the likelihood gradientIn: ICML ‘08: Proceedings of the 25th International Conference on Machine learning pp. 1064–1071https://doi.org/10.1145/1390156.1390290 Google Scholar
[83]
1. Tubiana Jerome
2018Restricted Boltzmann machines : from compositional representations to protein sequence analysisPhD thesis, Universite Paris sciences et lettres, Paris, France https://doi.org/10.70675/a6d497cfz68a4z42a8za422zbbc81f218ea5 Google Scholar
[84]
1. Tubiana Jerome
2018Probabilistic graphical models (pgm)https://github.com/jertubiana/PGM
[85]
1. Minh Bui Quang
2. Schmidt Heiko A
3. Chernomor Olga
4. Schrempf Dominik
5. Woodhams Michael D
6. Von Haeseler Arndt
7. Lanfear Robert
2020IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic EraMolecular Biology and Evolution 37:1530–1534https://doi.org/10.1093/molbev/msaa015 PubMed Google Scholar
[86]
1. Gascuel S. Q. Le and O.
2008An Improved General Amino Acid Replacement MatrixMolecular Biology and Evolution 25:1307–1320https://doi.org/10.1093/molbev/msn067 PubMed Google Scholar
[87]
1. Yang Ziheng
1994Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methodsJournal of Molecular Evolution 39:306–314https://doi.org/10.1007/bf00160154 PubMed Google Scholar
[88]
1. Aadland Kelsey
2. Pugh Charles
3. Kolaczkowski Bryan
2019High-Throughput Reconstruction of Ancestral Protein Sequence, Structure, and Molecular Function
In:
1. Sikosek T.
, editors. Computational Methods in Protein Evolution New York, New York, NY: Springer pp. 135–170
https://doi.org/10.1007/978-1-4939-8736-8_8 PubMed Google Scholar
[89]
1. Finn Robert D
2. Clements Jody
3. Eddy Sean R
2011Hmmer web server: interactive sequence similarity searchingNucleic Acids Research 39:W29–W37https://doi.org/10.1093/nar/gkr367 PubMed Google Scholar
[90]
1. Kaminski Kamil
2. Ludwiczak Jan
3. Pawlicki Kamil
4. Alva Vikram
5. Dunin-Horkawicz Stanislaw
2023pLM-BLAST: Distant homology detection based on direct comparison of sequence representations from protein language modelsBioinformatics 39:btad579https://doi.org/10.1093/bioinformatics/btad579 PubMed Google Scholar
[91]
1. Katoh K.
2002MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transformNucleic Acids Research 30:3059–3066https://doi.org/10.1093/nar/gkf436 PubMed Google Scholar
[92]
1. Kalyaanamoorthy Subha
2. Minh Bui Quang
3. Wong Thomas K F
4. Von Haeseler Arndt
5. Jermiin Lars S
2017ModelFinder: Fast model selection for accurate phylogenetic estimatesNature Methods 14:587–589https://doi.org/10.1038/nmeth.4285 PubMed Google Scholar
[93]
1. Hesselberth Jay R
2. Miller John P
3. Golob Anna
4. Stajich Jason E
5. Michaud Gregory A
6. Fields Stanley
2006Comparative analysis of Saccharomyces cerevisiaeWW domains and their interacting proteinsGenome Biology 7:R30https://doi.org/10.1186/gb-2006-7-4-r30 PubMed Google Scholar
[94]
1. Eme L.
2. Sharpe S. C.
3. Brown M. W.
4. Roger A. J.
2014On the Age of Eukaryotes: Evaluating Evidence from Fossils and Molecular ClocksCold Spring Harbor Perspectives in Biology 6:a016139–a016139https://doi.org/10.1101/cshperspect.a016139 PubMed Google Scholar
[95]
1. Gaucher Eric A.
2. jan Sridhar Govindara
3. Ganesh Omjoy K.
2008Palaeotemperature trend for Precambrian life inferred from resurrected proteinsNature 451:704–707https://doi.org/10.1038/nature06510 PubMed Google Scholar
1. J.Mistry
2. Chuguransky S.
3. Williams L.
4. Qureshi M.
5. Salazar G.A.
6. Sonnhammer E.L.L.
7. Tosatto S.C.E.
8. Paladin L.
9. Raj S.
10. Richardson L.J.
11. Finn R.D.
12. Bateman A.
2021The WW domain is a protein module with two highly conserved tryptophans that binds proline-rich peptide motifs in vitroPFAM ID PF00397https://www.ebi.ac.uk/interpro/entry/pfam/PF00397/

Article and author information

Author information

Ahmed Rehan
École Supérieure de Physique et de Chimie Industrielles-ESPCI Laboratoire de Biochimie (LBC), Paris, France
- These authors contributed equally.
Eugenio Mauri
Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, Paris, France
- These authors contributed equally.
Jorge Fernandez-de-Cossio-Diaz
Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, Paris, France, Université Paris-Saclay, CNRS, CEA, Institut de Physique Théorique, Gif-sur-Yvette, France, Université Paris-Saclay, CNRS, CEA, Institut de Physique Théorique, Gif-sur-Yvette, France
ORCID iD: 0000-0002-4476-805X
Pierre-Guillaume Brun
Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, Paris, France
ORCID iD: 0009-0007-7536-9987
Remi Monasson
Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, Paris, France
ORCID iD: 0000-0002-4459-0204
Marco Ribezzi-Crivellari
École Supérieure de Physique et de Chimie Industrielles-ESPCI Laboratoire de Biochimie (LBC), Paris, France
ORCID iD: 0000-0003-1217-5572
- Co-last authors.
Simona Cocco
Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, Paris, France
ORCID iD: 0000-0002-1852-7789
- For correspondence: cocco@phys.ens.fr
- Co-last authors.

Author Notes

Competing interests: No competing interests declared

Version history

Preprint posted: December 10, 2025
Sent for peer review: January 20, 2026
Reviewed Preprint version 1: April 9, 2026

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.110491. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 249
downloads: 13
citations: 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Significance of findings

Strength of evidence

Abstract

1 Introduction

2 Computational models for path generation and experimental verification

Experimental and computational pipeline for designing WW domains of variable specificities.

Peptides used for the experimental validation of mutational paths in WW domains.

Path within class I WW domains.

3 Results

3.1 Assessment of natural WW activity levels

3.2 Path within class I WW domains

3.3 Paths from Class I to Class IV

Path from Class I to Class IV.

3.4 Paths from Class I to Class II/III

Path from Class I to Class II/III.

3.5 Comparison of model scores and binding measurements

Model scores and experimental responses.

3.6 Comparison with Ancestral Sequence Reconstruction on WW domains

Pathways derived from ancestral sequence reconstruction (ASR) on the WW domains in PFAM’s seed.