Learning protein constitutive motifs from sequence data
Figures
 
              Reverse and forward modeling of proteins.
(A) Example of Multiple-Sequence Alignment (MSA), here of the WW domain (PF00397). Each column corresponds to a site on the protein, and each line to a different sequence in the family. The color code for amino acids is as follows: red = negative charge (E,D), blue = positive charge (H, K, R), purple = non charged polar (hydrophilic) (N, T, S, Q), yellow = aromatic (F, W, Y), black = aliphatic hydrophobic (I, L, M, V), green = cysteine (C), grey = other, small amino acids (A, G, P). (B) In a Restricted Boltzmann Machine (RBM), weights connect the visible layer (carrying protein sequences ) to the hidden layer (carrying representations ). Biases on the visible and hidden units are introduced by the local potentials and . Owing to the bipartite nature of the weight graph, hidden units are conditionally independent given a visible configuration, and vice versa. (C) Sequences in the MSA (dots in sequence space, left) code for proteins with different phenotypes (dot colors). RBM define a probabilistic mapping from sequences onto the representation space (right), which is indicative of the phenotype of the corresponding protein and encoded in the conditional distribution , Equation (3) (black arrow). The reverse mapping from representations to sequences is , Equation (4) (black arrow). In turn, sampling a subspace in the representation space (colored domains) defines a complex subset of the sequence space, and allows the design of sequences with putative phenotypic properties that are either found in the MSA (green circled dots) or not encountered in Nature (arrow out of blue domain). (D) Three examples of potentials defining the hidden-unit type in RBM (see Equation (1) and panel (B)): quadratic (black, , ) and double Rectified Linear Unit (dReLU) (dReLU1 (green), , ; and dReLU2 (purple), , , , ) potentials. In practice, the parameters of the hidden unit potentials are fixed through learning of the sequence data. (E) Average activity of hidden unit , calculated from Equation (3), as a function of the input defined in Equation (2). The three curves correspond to the three choices of potentials in panel (A). For the quadratic potential (black), the average activity is a linear function of . For dReLU1 (green), small inputs barely activate the hidden unit, whereas dReLU2 (Purple) essentially binarizes the inputs .
 
              Modeling Kunitz Domain with RBM.
(A) Sequence logo and secondary structure of the Kunitz domain (PF00014), showing two α-helices and two -strands. Note the presence of the three C-C disulfide bridges between positions 11&35, 2&52 and 27&48. (B) Weight logos for five hidden units(see text). Positive and negative weights are shown by letters located, respectively, above and below the zero axis. Values of the norms are given. The color code for the amino acids is the same as that in Figure 1A. (C) Top: distribution of inputs over the sequences in the MSA (dark blue), and average activity vs. input function (full line, left scale); red points correspond to the activity levels used for design in Figure 5. Bottom: histograms of Hamming distances between sequences in the MSA (grey) and between the 20 sequences (light blue) with largest (for unit 2,3,4) or smallest (1,5) . (D 3D visualization of the weights, shown on PDB structure 2knt Merigeau et al., 1998 using VMD Humphrey et al., 1996. White spheres denote the positions of the three disulfide bridges in the wildtype sequence. Green spheres locate residues such that , with for hidden units , for , and for .
 
              Modeling the WW domain with RBM.
(A) Sequence logo and secondary structure of the WW domain (PF00397), which includes three -strands. Note the two conserved W amino acids in positions 5 and 28. (B) Weight logos for four representative hidden units. (C) Corresponding inputs, average activities and distances between the top-20 feature-activating sequences. (D) 3D visualization of the features, shown on the PDB structure 1e0m Macias et al., 2000. White spheres locate the two W amino acids. Green spheres locate residues such that for each hidden unit . (E) Scatter plot of inputs vs. . Gray dots represent the sequences in the MSA; they cluster into three main groups. Colored dots show artificial or natural sequences whose specificities, given in the legend, were tested experimentally. Upper triangle: natural, from Russ et al. (2005). Lower triangle: artificial, from Russ et al. (2005). Diamond: natural, from Otte et al. (2003). Crosses: YAP1 (0) and variants (1 and 2 mutations from YAP1), from Espanel and Sudol (1999). The three clusters match the standard ligand-type classification.
 
              Modeling HSP70 with RBM.
(A, B) 3D structures of the DNaK E. coli HSP70 protein in the ADP-bound (A: PDB: 2kho Bertelsen et al., 2009) and ATP-bound (B: PDB: 4jne Qi et al., 2013) conformations. The colored spheres show the sites carrying the largest entries in the weights in panel (C). (C) Weight logos for hidden units , 2 and 5 (see Appendix 1—figure 21 for the other hidden units). Owing to the large protein length, we show only weights for positions with large weights (), with surrounding positions up to ±5 sites away; dashed lines vertical locate the left edges of the intervals. Protein backbone colors: blue = NBD; cyan = linker; red = SBD; gray = LID. Colors: orange = Unit 1 (NBD loop); black = Unit 2 (SBD β strand); green = Unit 3 (SBD/LID); yellow = Unit 4 (Allosteric). (D) Scatter plot of inputs vs. . Gray dots represent the sequences in the MSA, and cluster into four main groups. Colored dots represent the main sequence categories based on gene phylogeny, function and expression. (E) Histogram of input , showing separation between allosteric and non-allosteric protein sequences in the MSA.
 
              Sequence design with RBM.
(A) Conditional sampling of WW domain-modeling RBM. Sequences are drawn according to Equation (3), with activities fixed to , , and , see red points indicating the values of in Figure 3C. Natural sequences in the MSA are shown with gray dots, and generated sequences with colored dots. Four clusters of sequences are obtained; the first three are putatively associated to, respectively, ligand-specific groups I, II/III and IV. The sequences in the bottom left cluster, obtained through very strong conditioning, do not resemble any of the natural sequences in the MSA; their binding specificity is unknown. (B) Sequence logo of the red sequences in panel (A), with ‘long - loop’ and ‘type I’ features. (C) Conditional sampling of Kunitz domain-modeling RBM, with activities fixed to , see red dots indicating in Figure 2C. Red sequences combine the absence of the 11–35 disulfide bridge and a strong activation of the Bikunin-AMBP feature, although these two phenotypes are never found together in natural sequences. (D) Sequence logo of the red sequences in panel (C), with ‘no disulfide bridge’ and ‘bikunin’ features. (E) Scatter plot of the number of mutations to the closest natural sequence vs log-probability, for natural (gray) and artificial (colored) WW domain sequences. The color code is the same as that in panel (A); dark dots were generated with the high-probability trick, based on duplicated RBM (see 'Materials and methods'). Note the existence of many high-probability artificial sequences far away from the natural ones. (F) The same scatter plot as in panel (E) for natural and artificial Kunitz-domain sequences.
 
              Contact predictions using RBM.
(A) Sketch of the derivation with RBM of effective epistatic interactions between residues. The change in log probability resulting from a double mutation (purple arrow) is compared to the sum of the changes accompanying the single mutations (blue and red arrows) (see text and 'Materials and methods', Equations (15,16)). (B) Positive Predictive Value (PPV) vs. pairs of residues, ranked according to their scores for the Kunitz domain. RBM predictions with quadratic (Gaussian RBM) and dReLU potentials are compared to direct coupling-based methods, namely the Pseudo-Likelihood Method (plmDCA) Ekeberg et al., 2014) and Boltzmann Machine (BM) learning Sutto et al., 2015). (C) Same as panel (B) for the WW domain. (D) Distant contact predictions for the 17 protein domains used to benchmark plmDCA in Ekeberg et al. (2014) obtained using fixed regularization and . PPV for contacts between residues separated by at least five sites along the protein backbone vs. ranks of the corresponding couplings, expressed as fractions of the protein length ; solid lines indicate the median PPV and colored areas the corresponding 1/3 to 2/3 quantiles.
 
              Benchmarking RBM with lattice proteins.
(A) , one of the 103,406 distinct structures that a 27-mer can adopt on the cubic lattice Shakhnovich and Gutin, 1990. Circled sites are related to the features shown in Figure 6C. (B), another fold with a contact map (set of neighbouring sites) close to Jacquin et al., 2016. (C) Four weight logos for a RBM inferred from sequences folding into , see 'Supporting Information' for the remaining 96 weights. Weight 1 corresponds to the contact between sites 3 and 26, see black dashed contour in panel (A). The contact can be realized by amino acids of opposite (-+) charges () or by hydrophobic residues (). Weights 2 and 3 are related to, respectively, the triplets of amino acids 8-15-27 and 2-16-25, each realizing two overlapping contacts on (blue dashed contours). Weight 4 codes for electrostatic contacts between sites 3 & 26, 1 & 18 and 1 & 20, and imposes the conditon that the charges of amino acids 1 and 26 have the same sign. The latter constraint is not due to the native fold (1 and 26 are ‘far away’ on ) but because folding must be impeded in the ‘competing’ structure, (Figure 7B and 'Materials and methods') in which sites 1 and 26 are neighbours Jacquin et al., 2016). (D) Distributions of inputs () and average activities (full line, left scale). All features are activated across the entire sequence space (not shown). (E) Conditional sampling with activities fixed to , see red dots in panel (D). Designed sequences occupy specific clusters in the sequence space, corresponding to different realizations of the overlapping contacts encoded by weights 2 and 3 (Figure 6C). Conditioning to makes it possible to generate sequences combining features that are not found together in the MSA (see bottom left corner), even with very high probabilities (see 'Materials and methods'). (F) Scatter plot of the number of mutations to the closest natural sequence vs. the probability of folding into structure (see Jacquin et al., 2016 for a precise definition) for natural (gray) and artificial (colored) sequences. Note the large diversity and the existence of sequences with higher than those in the training sample.
 
              Nature of the representations built by RBM and interpretability of weights.
(A) The effect of sparsifying regularization. Left: log-probability (see , Equation (5)) as a function of the regularization strength (square root scale) for RBM with hidden units trained on WW domain sequence data. Right: the weights attached to three representative hidden units are shown for (no regularization) and 0.03 (optimal log-likelihood for the test set, see left panel); weights shown in Figure 3 were obtained at higher regularization . For larger regularization, too many weights vanish, and the log-likelihood diminishes. (B) Sequences (purple dots) in the MSA attached to a protein family define a highly sparse subset of the sequence space (symbolized by the blue square), from which a RBM model is inferred. The RBM then defines a distribution over the entire sequence space, with high scores for natural sequences and over many more other sequences putatively belonging to the protein family. The representations of the sequence space by RBM can be of different types, three examples of which are sketched in the following panels. (C) Mixture model: each hidden unit focuses on a specific region in sequence space (color ellipses, different colors correspond to different units), and the attached weights form a template for this region. The representation of a sequence thus involves one (or a few) strongly activated hidden units, while all remaining units are inactive. (D) Entangled model: all hidden units are moderatly active across the sequence space. The pattern of activities vary from one sequence to another in a complex manner. (E) Compositional model: a moderate number of hidden units are activated for each protein sequence, each recognizing one of the motifs (shown by colors) in the sequence and controling one of the protein's biological properties. Composing the different motifs in various ways (right circled compositions) generates a large diversity of sequences.
 
              Representative weights of the protein families selected in Ekeberg et al. (2014).
RBM parameters: , . The format is the same as that used in Figures 2B, 3B and 4B. Weights are ordered by similarity, from top to bottom: Sushi domain (PF00084), Heat shock protein Hsp20 (PF00011), SH3 Domain (PF00018), Homeodomain protein (PF00046), Zinc finger–C4 type (PF00105), Cyclic nucleotide-binding domain (PF00027), and RNA recognition motif (PF00076). Green spheres show the sites that carry the largest weights on the 3D folds (in order, PDB: 1elv, 2bol, 2hda, 2vi6, 1gdc, 3fhi, 1g2e). The ten weights with largest norms in each family are shown in Supplementary files 5–6.
 
              Representative weights of the protein families selected in Ekeberg et al. (2014).
RBM parameters: , . The format is the same as that used in Figures 2B, 3B and 4B. Weights are ordered by similarity (from top to bottom): SH2 domain (PF00017), superoxide dismutase (PF00081), K homology domain (PF00013), fibronectin type III domain (PF00041), double-stranded RNA-binding motif (PF00035), zinc-binding dehydrogenase (PF00107), cadherin (PF00028), glutathione S-transferase, C-terminal domain (PF00043), and 2Fe-2S iron-sulfur cluster binding domain (PF00111). Green spheres show the sites that carry the largest weights on the 3D folds (in order, PDB: 1o47, 3bfr, 1wvn, 1bqu, 1o0w, 1a71, 2o72, 6gsu, 1a70). The ten weights with largest norms in each family are shown in Supplementary files 5–6.
 
              Duplicate RBM for biasing sampling toward high-probability sequences.
Visible-unit configurations are sampled from .
 
              Model selection for RBM trained on the Lattice Proteins MSA.
Likelihood estimates for various potentials and number of hidden units, evaluated on train and held-out test sets. Top row: without regularization (). Bottom row: with regularization ().
 
              Model selection for RBM trained on the WW domain MSA.
Likelihood estimates for various potentials and number of hidden units, evaluated on train and held-out test sets. Top row: without regularization (). Bottom row: with regularization ().
 
              Sparsity-generative performance trade-off for RBM trained on the MSA of the Lattice Protein .
(A–D) Likelihood as function of regularization strength, for (top) and (bottom) sparse penalties, on train(left) and test (middle) sets. (E) Number of connected hidden units (such that ) against effective strength penalty, for and penalties. For penalty, ; for , .
 
              Hidden layer representation redundancy as a function of the hidden-unit potentials.
Distribution of Pearson correlation coeffcients between hidden-unit average activities, for RBM trained with , on (a) Lattice Proteins MSA, (b) Kunitz domain MSA, and (c) WW domain MSA. Bernoulli RBM feature the highest correlations.
 
              Comparison of Gaussian and dReLU RBM with trained on the Kunitz domain MSA.
Scatter plot of likelihoods for each model, where each point represents a sequence of the MSA. The color code is defined in Equation 19; hot colors indicate ’outlier’ sequences.
 
              Quantitative quality assessment of sequences generated by RBM trained on the Lattice Protein MSA.
(a) Distributions of the probability of folding into the native structure (Equation (14) in 'Materials and methods'), for sequences generated by various models. The horizontal bars locate the average values of . Models with higher capacity (more parameters, less regularization) generate sequences with higher quality but lower diversity. (b) Distribution of distances from a randomly selected wildtype. The unregularized BM samples have lower diversity, whereas the regularized RBM samples better reproduce the data distribution. (c) Log-probability of dReLU RBM shown in the main text (Figure 7) vs true fitness evaluated on sequences from the MSA used (train) or not (test) for training.
 
              Quality assessment of sequences generated by RBM trained on (a) the Kunitz domain MSA and (b) the WW domain MSA.
Scatter plot of the number of mutations to the closest natural sequence vs log-probability of a BM trained on the same data, for natural (gray) and RBM-generated (colored) WW domain sequences. The color code is that same as that used in Figure 5A. Note similar likelihoods values for RBM-generated sequences and natural ones, including the unseen combinations.
 
              Evaluating the role of regularization and sequence reweighting on generated sequence diversity for the WW domain.
The y-axis indicates the log-likelihood of the data generated by the model; entropy is the negative average log-likelihood.
 
              Pairwise couplings learned from Kunitz domain MSA.
Scatter plot of inferred pairwise direct couplings learned by BM vs effective pairwise couplings computed from the RBM through Equation (15) in the 'Materials and methods'.
 
              Contact map and contact predictions for the Kunitz domain.
(a) Lower diagonal: the 551 pairs of residues at nm in the structure. Upper diagonal: top 551 contacts predicted by dReLU RBM with , shown in Figure 2. (b) Positive Predicted Value vs rank for distant contacts for RBM () and pairwise models. Distant contacts are well predicted, including those involved in the secondary structure.
 
              Contact predictions for Lattice Proteins, with (a) Bernoulli (b) Gaussian (c) dReLU RBM and (d) BM potentials.
Models with quadratic or dReLU potentials and large number of hidden units are typically similar in performance to pairwise models, trained either with Monte Carlo or Pseudo-likelihood Maximization.
 
              Contact predictions as a function of RBM parameters for (a) Kunitz and (b) WW domains.
Both panels show the area under curve metric (integrated up to the true number of contacts) for various trainings, with different training parameters, regularization choice and hidden units number/potentials, against the weight sparsity. In both cases, large sparse regularization and a high number of hidden units reproduce the performance of the pairwise models.
 
              Features inferred using the first and second half of the sequences.
https://doi.org/10.7554/eLife.39397.033 
              Top 12 patterns with highest contributions to the log-probability, see eqn (23) in Cocco et al. (2013), inferred by the Hopfield-Potts model on the Kunitz domain.
https://doi.org/10.7554/eLife.39397.034 
              Top 12 patterns with the highest contributions to the log-probability (see equation (23) in Cocco et al. (2013)), inferred by the Hopfield-Potts model on the WW domain.
https://doi.org/10.7554/eLife.39397.035 
              Top 12 patterns with the highest contributions to the log-probability (see equation (23) in Cocco et al. (2013), inferred by the Hopfield-Potts model on the Lattice Proteins data.
https://doi.org/10.7554/eLife.39397.036 
              Hopfield-Potts model for sequence generation.
(A) Fitness against distance to closest sequence for the Hopfield-Potts model with pseudo-count 0.01 or 0.5, sampled with or without the high bias. Gray ellipses denote the corresponding values for the RBM. (B) Distribution of distances between generated sequences.
 
              Contact prediction for 17 protein families including the Hopfield-Potts model.
https://doi.org/10.7554/eLife.39397.038 
              Phylogenetic identity of feature-activating Kunitz sequences with the RBM shown in Figure 2.
(A) Scatter plot of inputs of hidden units 2 and 3; color depicts the organisms' position in the phylogenic tree of species. Most of the sequences that lack the disulfide bridge are nematodes. (B) Sequence logo of the 137 sequences above the dashed line (), showing the electrostatic triangle that putatively replaces the disulfide bridge.
 
              Distribution of inputs for the five features shown in main text plus hidden unit 34.
Distributions of inputs for Kunitz domains belonging to specific genes are shown.
 
              Truncated weight logo of 10 selected HSP70 hidden units (1/2).
https://doi.org/10.7554/eLife.39397.041 
              Truncated weight logo of 10 selected HSP70 hidden units (2/2).
https://doi.org/10.7554/eLife.39397.042 
              Corresponding structures (1/3).
Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), highlighting dimeric contacts.
 
              Corresponding structures (2/3).
Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), highlighting dimeric contacts.
 
              Corresponding structures (3/3).
Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), highlighting dimeric contacts.
 
              Corresponding input distributions.
Note that both hidden unit 4 and 9 discriminate the non-allosteric subfamily from the rest; and that hidden unit 8 discriminates eukaryotic Hsp expressed in the endoplasmic reticulum from the rest.
 
              Some scatter plots of inputs for the 10 hidden units shown.
https://doi.org/10.7554/eLife.39397.047 
              Statistics of the length and amino-acid content of the unstructured tail of Hsp70.
Hidden unit 5 defines a set of sites, mostly located on the unstructured tail of Hsp70; its sequence logo and input distribution suggests that for a given sequence, the tail can be enriched either in tiny (A, G)or hydrophilic amino-acids (E,D,K,R,T,S,N,Q). This is qualitatively confirmed by the non-gaussian statistics of the distributions of the fractions of tiny and hydrophilic amino-acids in the tail (blue histograms and top left contour plots). This effect could, however, be due to the variable length of the loop (bottom histogram). To assess this enrichment, we built a null model where the tail size was random (same statistics as Hsp70), and each amino-acid was drawn randomly, independently from the others, using the same amino-acid frequency as that in the tail of Hsp70. The null model statistics (orange histograms and lower left contour plots) are clearly different, validating the collective mode.
Additional files
- 
            Supplementary file 1Weight logo for all hidden units inferred from the Kunitz domain MSA. 
- https://doi.org/10.7554/eLife.39397.014
- 
            Supplementary file 2Weight logo for all hidden units inferred from the WW domain MSA. 
- https://doi.org/10.7554/eLife.39397.015
- 
            Supplementary file 3Weight logo for all hidden units inferred from the LP MSA. 
- https://doi.org/10.7554/eLife.39397.016
- 
            Supplementary file 4Weight logo of 12 Hopfield-Potts pattern inferred from the Hsp70 protein MSA. The format is the same as that used for Appendix 1—figures 14–16. 
- https://doi.org/10.7554/eLife.39397.017
- 
            Supplementary file 5Weight logo and associated structures of the 10 weights with highest norms, excluding the gap modes for each of the 16 additional domains shown in Figure 9. 
- https://doi.org/10.7554/eLife.39397.018
- 
            Supplementary file 6Weight logo and associated structures of the 10 sparse (i.e. within the 30% most sparse weights of the RBM) weights with highest norms, excluding the gap modes for each of the 16 additional domains shown in Figure 9. 
- https://doi.org/10.7554/eLife.39397.019
 
                 
         
         
        