39 figures and 6 additional files

Figures

Reverse and forward modeling of proteins.

(A) Example of Multiple-Sequence Alignment (MSA), here of the WW domain (PF00397). Each column i=1,,N corresponds to a site on the protein, and each line to a different sequence in the family. The color …

https://doi.org/10.7554/eLife.39397.003
Modeling Kunitz Domain with RBM.

(A) Sequence logo and secondary structure of the Kunitz domain (PF00014), showing two α-helices and two β-strands. Note the presence of the three C-C disulfide bridges between positions 11&35, 2&52 …

https://doi.org/10.7554/eLife.39397.004
Modeling the WW domain with RBM.

(A) Sequence logo and secondary structure of the WW domain (PF00397), which includes three β-strands. Note the two conserved W amino acids in positions 5 and 28. (B) Weight logos for four …

https://doi.org/10.7554/eLife.39397.005
Modeling HSP70 with RBM.

(A, B) 3D structures of the DNaK E. coli HSP70 protein in the ADP-bound (A: PDB: 2kho Bertelsen et al., 2009) and ATP-bound (B: PDB: 4jne Qi et al., 2013) conformations. The colored spheres show the …

https://doi.org/10.7554/eLife.39397.006
Sequence design with RBM.

(A) Conditional sampling of WW domain-modeling RBM. Sequences are drawn according to Equation (3), with activities (h3,h4) fixed to (h4-,h4+), (h3+,h4-), (h3+,h4+) and (3h3-,h4-), see red points indicating the values of h3±,h4± in Figure …

https://doi.org/10.7554/eLife.39397.007
Contact predictions using RBM.

(A) Sketch of the derivation with RBM of effective epistatic interactions between residues. The change in log probability resulting from a double mutation (purple arrow) is compared to the sum of …

https://doi.org/10.7554/eLife.39397.008
Benchmarking RBM with lattice proteins.

(ASA, one of the 103,406 distinct structures that a 27-mer can adopt on the cubic lattice Shakhnovich and Gutin, 1990. Circled sites are related to the features shown in Figure 6C. (B)SG, another …

https://doi.org/10.7554/eLife.39397.009
Nature of the representations built by RBM and interpretability of weights.

(A) The effect of sparsifying regularization. Left: log-probability (see , Equation (5)) as a function of the regularization strength λ12 (square root scale) for RBM with M=100 hidden units trained on …

https://doi.org/10.7554/eLife.39397.010
Representative weights of the protein families selected in Ekeberg et al. (2014).

RBM parameters: λ12=0.25, M=0.05×N×20. The format is the same as that used in Figures 2B, 3B and 4B. Weights are ordered by similarity, from top to bottom: Sushi domain (PF00084), Heat shock protein Hsp20 …

https://doi.org/10.7554/eLife.39397.011
Representative weights of the protein families selected in Ekeberg et al. (2014).

RBM parameters: λ12=0.25, M=0.05×N×20. The format is the same as that used in Figures 2B, 3B and 4B. Weights are ordered by similarity (from top to bottom): SH2 domain (PF00017), superoxide dismutase (PF00081), K …

https://doi.org/10.7554/eLife.39397.012
Duplicate RBM for biasing sampling toward high-probability sequences.

Visible-unit configurations 𝐯 are sampled from P2(v)P(v)2.

https://doi.org/10.7554/eLife.39397.013
Appendix 1—figure 1
Model selection for RBM trained on the Lattice Proteins MSA.

Likelihood estimates for various potentials and number of hidden units, evaluated on train and held-out test sets. Top row: without regularization (λ12=0). Bottom row: with regularization (λ12=0.025).

https://doi.org/10.7554/eLife.39397.021
Appendix 1—figure 2
Model selection for RBM trained on the WW domain MSA.

Likelihood estimates for various potentials and number of hidden units, evaluated on train and held-out test sets. Top row: without regularization (λ12=0). Bottom row: with regularization (λ12=0.25).

https://doi.org/10.7554/eLife.39397.022
Appendix 1—figure 3
Sparsity-generative performance trade-off for RBM trained on the MSA of the Lattice Protein SA.

(A–D) Likelihood as function of regularization strength, for L12 (top) and L1 (bottom) sparse penalties, on train(left) and test (middle) sets. (E) Number Meff of connected hidden units (such that maxi,v|wiμ(v)|>0) …

https://doi.org/10.7554/eLife.39397.023
Appendix 1—figure 4
Hidden layer representation redundancy as a function of the hidden-unit potentials.

Distribution of Pearson correlation coeffcients between hidden-unit average activities, for RBM trained with M=100, on (a) Lattice Proteins MSA, (b) Kunitz domain MSA, and (c) WW domain MSA. Bernoulli …

https://doi.org/10.7554/eLife.39397.024
Appendix 1—figure 5
Comparison of Gaussian and dReLU RBM with M=100 trained on the Kunitz domain MSA.

Scatter plot of likelihoods for each model, where each point represents a sequence of the MSA. The color code is defined in Equation 19; hot colors indicate ’outlier’ sequences.

https://doi.org/10.7554/eLife.39397.025
Appendix 1—figure 6
Quantitative quality assessment of sequences generated by RBM trained on the Lattice Protein MSA.

(a) Distributions of the probability pnat of folding into the native structure SA (Equation (14) in 'Materials and methods'), for sequences generated by various models. The horizontal bars locate the …

https://doi.org/10.7554/eLife.39397.026
Appendix 1—figure 7
Quality assessment of sequences generated by RBM trained on (a) the Kunitz domain MSA and (b) the WW domain MSA.

Scatter plot of the number of mutations to the closest natural sequence vs log-probability of a BM trained on the same data, for natural (gray) and RBM-generated (colored) WW domain sequences. The …

https://doi.org/10.7554/eLife.39397.027
Appendix 1—figure 8
Evaluating the role of regularization and sequence reweighting on generated sequence diversity for the WW domain.

The y-axis indicates the log-likelihood of the data generated by the model; entropy is the negative average log-likelihood.

https://doi.org/10.7554/eLife.39397.028
Appendix 1—figure 9
Pairwise couplings learned from Kunitz domain MSA.

Scatter plot of inferred pairwise direct couplings learned by BM vs effective pairwise couplings computed from the RBM through Equation (15) in the 'Materials and methods'.

https://doi.org/10.7554/eLife.39397.029
Appendix 1—figure 10
Contact map and contact predictions for the Kunitz domain.

(a) Lower diagonal: the 551 pairs of residues at D<0.8 nm in the structure. Upper diagonal: top 551 contacts predicted by dReLU RBM with M=100, shown in Figure 2. (b) Positive Predicted Value vs rank for …

https://doi.org/10.7554/eLife.39397.030
Appendix 1—figure 11
Contact predictions for Lattice Proteins, with (a) Bernoulli (b) Gaussian (c) dReLU RBM and (d) BM potentials.

Models with quadratic or dReLU potentials and large number of hidden units are typically similar in performance to pairwise models, trained either with Monte Carlo or Pseudo-likelihood Maximization.

https://doi.org/10.7554/eLife.39397.031
Appendix 1—figure 12
Contact predictions as a function of RBM parameters for (a) Kunitz and (b) WW domains.

Both panels show the area under curve metric (integrated up to the true number of contacts) for various trainings, with different training parameters, regularization choice and hidden units …

https://doi.org/10.7554/eLife.39397.032
Appendix 1—figure 13
Features inferred using the first and second half of the sequences.
https://doi.org/10.7554/eLife.39397.033
Appendix 1—figure 14
Top 12 patterns with highest contributions to the log-probability, see eqn (23) in Cocco et al. (2013), inferred by the Hopfield-Potts model on the Kunitz domain.
https://doi.org/10.7554/eLife.39397.034
Appendix 1—figure 15
Top 12 patterns with the highest contributions to the log-probability (see equation (23) in Cocco et al. (2013)), inferred by the Hopfield-Potts model on the WW domain.
https://doi.org/10.7554/eLife.39397.035
Appendix 1—figure 16
Top 12 patterns with the highest contributions to the log-probability (see equation (23) in Cocco et al. (2013), inferred by the Hopfield-Potts model on the Lattice Proteins data.
https://doi.org/10.7554/eLife.39397.036
Appendix 1—figure 17
Hopfield-Potts model for sequence generation.

(A) Fitness pnat against distance to closest sequence for the Hopfield-Potts model with pseudo-count 0.01 or 0.5, sampled with or without the high P(v) bias. Gray ellipses denote the corresponding …

https://doi.org/10.7554/eLife.39397.037
Appendix 1—figure 18
Contact prediction for 17 protein families including the Hopfield-Potts model.
https://doi.org/10.7554/eLife.39397.038
Appendix 1—figure 19
Phylogenetic identity of feature-activating Kunitz sequences with the RBM shown in Figure 2.

(A) Scatter plot of inputs of hidden units 2 and 3; color depicts the organisms' position in the phylogenic tree of species. Most of the sequences that lack the disulfide bridge are nematodes. (B) …

https://doi.org/10.7554/eLife.39397.039
Appendix 1—figure 20
Distribution of inputs for the five features shown in main text plus hidden unit 34.

Distributions of inputs for Kunitz domains belonging to specific genes are shown.

https://doi.org/10.7554/eLife.39397.040
Appendix 1—figure 21
Truncated weight logo of 10 selected HSP70 hidden units (1/2).
https://doi.org/10.7554/eLife.39397.041
Appendix 1—figure 22
Truncated weight logo of 10 selected HSP70 hidden units (2/2).
https://doi.org/10.7554/eLife.39397.042
Appendix 1—figure 23
Corresponding structures (1/3).

Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), …

https://doi.org/10.7554/eLife.39397.043
Appendix 1—figure 24
Corresponding structures (2/3).

Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), …

https://doi.org/10.7554/eLife.39397.044
Appendix 1—figure 25
Corresponding structures (3/3).

Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), …

https://doi.org/10.7554/eLife.39397.045
Appendix 1—figure 26
Corresponding input distributions.

Note that both hidden unit 4 and 9 discriminate the non-allosteric subfamily from the rest; and that hidden unit 8 discriminates eukaryotic Hsp expressed in the endoplasmic reticulum from the rest.

https://doi.org/10.7554/eLife.39397.046
Appendix 1—figure 27
Some scatter plots of inputs for the 10 hidden units shown.
https://doi.org/10.7554/eLife.39397.047
Appendix 1—figure 28
Statistics of the length and amino-acid content of the unstructured tail of Hsp70.

Hidden unit 5 defines a set of sites, mostly located on the unstructured tail of Hsp70; its sequence logo and input distribution suggests that for a given sequence, the tail can be enriched either …

https://doi.org/10.7554/eLife.39397.048

Additional files

Supplementary file 1

Weight logo for all hidden units inferred from the Kunitz domain MSA.

https://doi.org/10.7554/eLife.39397.014
Supplementary file 2

Weight logo for all hidden units inferred from the WW domain MSA.

https://doi.org/10.7554/eLife.39397.015
Supplementary file 3

Weight logo for all hidden units inferred from the LP MSA.

https://doi.org/10.7554/eLife.39397.016
Supplementary file 4

Weight logo of 12 Hopfield-Potts pattern inferred from the Hsp70 protein MSA.

The format is the same as that used for Appendix 1—figures 1416.

https://doi.org/10.7554/eLife.39397.017
Supplementary file 5

Weight logo and associated structures of the 10 weights with highest norms, excluding the gap modes for each of the 16 additional domains shown in Figure 9.

https://doi.org/10.7554/eLife.39397.018
Supplementary file 6

Weight logo and associated structures of the 10 sparse (i.e. within the 30% most sparse weights of the RBM) weights with highest norms, excluding the gap modes for each of the 16 additional domains shown in Figure 9.

https://doi.org/10.7554/eLife.39397.019

Download links