Figures and data

Experimental and computational pipeline for designing WW domains of variable specificities.
A. Sketched of the activity vs. sequence landscape of WW domains. Highly active domains may bind to different peptides (black shapes) and are shown with different colours. B. Homologous sequences define the training data for our restricted Boltzmann machine (RBM) model. The binding specificities of some sequences are annotated, while others are not available. C. After training, the latent variables of the RBM define low-dimensional projections that identify clusters of sequences sharing the same specificity. Sequences along paths may cross regions deprived of natural sequences, of unknown specificity. D. Sketch of the experiment. In vitro transcription and translation of our construct result in the expression of a fusion protein including a WW domain (green), a linker (yellow), and the SNAP tag (blue), labelled with BG-AF647 dye. The WW domain of the fusion protein may bind to peptide-coated magnetic beads (brown). The fluorescence intensity on the bead surfaces is then measured using flow cytometry, assessing the strength of the interaction between the labelled protein and its target peptide.

Peptides used for the experimental validation of mutational paths in WW domains.
Highlighted in red are residues that bind to the WW domain. The phosphorylated Threonine in the peptide of class IV is indicated by pT.

Path within class I WW domains.
A. Alignment of the tested sequences showing the mutated amino acids. B, C. Logo representation of the weight vectors attached to the two RBM latent units most correlated with the specificity classes. D. Projection of the path on the latent weights, clustering sequences according to their binding specificities. The specificity thresholds indicated in the figure (Class I: I1 < —1 and I2 > —3; Class II/III: I1 > —1 and I2 > —3; Class IV: I1 > —1 and I2 < —3) are informed by previous specificity assays on natural sequences [9, 8] (see Figure 3 in [43] and Figure 2a in [30]). E, F. Global and Class I specific RBM scores for the sequences in the path. G. Experimental binding responses of the tested sequences. The red band, defined by the experimental noise limit, denotes the non-binding zone.

Path from Class I to Class IV.
A. Alignment of tested sequences along the path showing mutated amino acids. B,C,D. Scores of sequences in the path predicted by the global (top), Class I or IV (middle) and Class I or II/III (bottom) RBM models. E. Two-dimensional projection of the path in the plane of specificity-related latent units. F. Relative binding responses to Class I and IV peptides of sequences along the path. G. Normalizes experimental binding responses to Class I and IV peptides. H. Responses of Seq.53, 54, 35, 48, 27 compared with Seq.82, 83 differing by mutations on residues 13 (carrying a gap for a shorter β1 − β2 loop or an S), and class I specificity-determining positions 24-25.

Path from Class I to Class II/III.
A. Alignment of the tested sequences, showing the mutated amino acids. Same color code as in Figure 2. B, C. Global and class-specific RBM scores of the sequences in the path. D. Projection of the path one the plane of RBM latent units relevant for class identification. E. Ratio of experimentally measured responses to peptides of class II over Class I for intermediate sequences along the path. F. Normalized responses of the designed sequences against peptides of Classes I and II. The red band, defined by the experimental noise limit, indicates the non-binding zone.

Model scores and experimental responses.
A. Histogram of global RBM scores of probed sequences (Natural, Designed, and Shuffled). The histogram of global RBM scores of the full sequence data is also shown for comparison. B, C, D. Comparison between RBM local scores and binding responses, for Class I (B), II/III (C), and IV (D) ligands. Sequences in each panel are colored according to the predicted classes, i.e. to the quadrants in the 2D projection plane defined by the specificity-related latent units. E. Pearson correlation coefficient between experimental responses to the three ligands (I, II, IV), and the RBM scores (specific and global).

Pathways derived from ancestral sequence reconstruction (ASR) on the WW domains in PFAM’s seed.
A. 2D projections of selected paths linking domains of different types (I, II/III and IV) on the two specificity-related RBM hidden units, and simplified representation of the seed tree topology. Black dots: modern sequences; Purple dots: average ancestral sequences sampled from the posterior distribution. B. 2D projections of paths linking WT domains presented in Figures 3 and 4 along the tree. Black: mutational paths designed with the RBM; Purple: paths obtained with ASR; Dotted purple: ASR path obtained when adding to the WW tree the RBM intermediates. Blue and purple paths are largely overlapped with one another. C., D. Scores of ancestors obtained with ASR (solid lines) and with the global (top) and specific (middle panels) RBM models. Scores obtained along RBM-sampled paths are shown as dashed lines for comparison. C. Path I to II/III. D. Path I to IV. E., F. Identity between intermediate sequences along the RBM path and the closest ancestors reconstructed by ASR; median and maximum identity between seed WW domains are shown as dashed red line for reference. E. Path I to II. F. Path I to IV.