(A) Example of Multiple-Sequence Alignment (MSA), here of the WW domain (PF00397). Each column corresponds to a site on the protein, and each line to a different sequence in the family. The color …
(A) Sequence logo and secondary structure of the Kunitz domain (PF00014), showing two α-helices and two -strands. Note the presence of the three C-C disulfide bridges between positions 11&35, 2&52 …
(A) Sequence logo and secondary structure of the WW domain (PF00397), which includes three -strands. Note the two conserved W amino acids in positions 5 and 28. (B) Weight logos for four …
(A, B) 3D structures of the DNaK E. coli HSP70 protein in the ADP-bound (A: PDB: 2kho Bertelsen et al., 2009) and ATP-bound (B: PDB: 4jne Qi et al., 2013) conformations. The colored spheres show the …
(A) Conditional sampling of WW domain-modeling RBM. Sequences are drawn according to Equation (3), with activities fixed to , , and , see red points indicating the values of in Figure …
(A) Sketch of the derivation with RBM of effective epistatic interactions between residues. The change in log probability resulting from a double mutation (purple arrow) is compared to the sum of …
(A) , one of the 103,406 distinct structures that a 27-mer can adopt on the cubic lattice Shakhnovich and Gutin, 1990. Circled sites are related to the features shown in Figure 6C. (B), another …
(A) The effect of sparsifying regularization. Left: log-probability (see , Equation (5)) as a function of the regularization strength (square root scale) for RBM with hidden units trained on …
RBM parameters: , . The format is the same as that used in Figures 2B, 3B and 4B. Weights are ordered by similarity, from top to bottom: Sushi domain (PF00084), Heat shock protein Hsp20 …
RBM parameters: , . The format is the same as that used in Figures 2B, 3B and 4B. Weights are ordered by similarity (from top to bottom): SH2 domain (PF00017), superoxide dismutase (PF00081), K …
Visible-unit configurations are sampled from .
Likelihood estimates for various potentials and number of hidden units, evaluated on train and held-out test sets. Top row: without regularization (). Bottom row: with regularization ().
Likelihood estimates for various potentials and number of hidden units, evaluated on train and held-out test sets. Top row: without regularization (). Bottom row: with regularization ().
(A–D) Likelihood as function of regularization strength, for (top) and (bottom) sparse penalties, on train(left) and test (middle) sets. (E) Number of connected hidden units (such that ) …
Distribution of Pearson correlation coeffcients between hidden-unit average activities, for RBM trained with , on (a) Lattice Proteins MSA, (b) Kunitz domain MSA, and (c) WW domain MSA. Bernoulli …
Scatter plot of likelihoods for each model, where each point represents a sequence of the MSA. The color code is defined in Equation 19; hot colors indicate ’outlier’ sequences.
(a) Distributions of the probability of folding into the native structure (Equation (14) in 'Materials and methods'), for sequences generated by various models. The horizontal bars locate the …
Scatter plot of the number of mutations to the closest natural sequence vs log-probability of a BM trained on the same data, for natural (gray) and RBM-generated (colored) WW domain sequences. The …
The y-axis indicates the log-likelihood of the data generated by the model; entropy is the negative average log-likelihood.
Scatter plot of inferred pairwise direct couplings learned by BM vs effective pairwise couplings computed from the RBM through Equation (15) in the 'Materials and methods'.
(a) Lower diagonal: the 551 pairs of residues at nm in the structure. Upper diagonal: top 551 contacts predicted by dReLU RBM with , shown in Figure 2. (b) Positive Predicted Value vs rank for …
Models with quadratic or dReLU potentials and large number of hidden units are typically similar in performance to pairwise models, trained either with Monte Carlo or Pseudo-likelihood Maximization.
Both panels show the area under curve metric (integrated up to the true number of contacts) for various trainings, with different training parameters, regularization choice and hidden units …
(A) Fitness against distance to closest sequence for the Hopfield-Potts model with pseudo-count 0.01 or 0.5, sampled with or without the high bias. Gray ellipses denote the corresponding …
(A) Scatter plot of inputs of hidden units 2 and 3; color depicts the organisms' position in the phylogenic tree of species. Most of the sequences that lack the disulfide bridge are nematodes. (B) …
Distributions of inputs for Kunitz domains belonging to specific genes are shown.
Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), …
Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), …
Left: ADP-bound conformation (PDB: 2kho). Right: ATP-bound conformation (PDB: 4jne). For the last hidden unit, we show the structure of the dimer Hsp70–Hsp70 in ATP conformation (PDB: 4JNE), …
Note that both hidden unit 4 and 9 discriminate the non-allosteric subfamily from the rest; and that hidden unit 8 discriminates eukaryotic Hsp expressed in the endoplasmic reticulum from the rest.
Hidden unit 5 defines a set of sites, mostly located on the unstructured tail of Hsp70; its sequence logo and input distribution suggests that for a given sequence, the tail can be enriched either …
Weight logo for all hidden units inferred from the Kunitz domain MSA.
Weight logo for all hidden units inferred from the WW domain MSA.
Weight logo for all hidden units inferred from the LP MSA.
Weight logo of 12 Hopfield-Potts pattern inferred from the Hsp70 protein MSA.
The format is the same as that used for Appendix 1—figures 14–16.
Weight logo and associated structures of the 10 weights with highest norms, excluding the gap modes for each of the 16 additional domains shown in Figure 9.
Weight logo and associated structures of the 10 sparse (i.e. within the 30% most sparse weights of the RBM) weights with highest norms, excluding the gap modes for each of the 16 additional domains shown in Figure 9.