Learning protein constitutive motifs from sequence data
Abstract
Statistical analysis of evolutionaryrelated protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex highdimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residueresidue tertiary contacts, extended secondary motifs (αhelixes and βsheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families.
https://doi.org/10.7554/eLife.39397.001eLife digest
Almost every process that keeps a cell alive depends on the activities of several proteins. All proteins are made from chains of smaller molecules called amino acids, and the specific sequence of amino acids determines the protein’s overall shape, which in turn controls what the protein can do. Yet, the relationships between a protein’s structure and its function are complex, and it remains unclear how the sequence of amino acids in a protein actually determine its features and properties.
Machine learning is a computational approach that is often applied to understand complex issues in biology. It uses computer algorithms to spot statistical patterns in large amounts of data and, after 'learning' from the data, the algorithms can then provide new insights, make predictions or even generate new data.
Tubiana et al. have now used a relatively simple form of machine learning to study the amino acid sequences of 20 different families of proteins. First, frameworks of algorithms –known as Restricted Boltzmann Machines, RBM for short – were trained to read some aminoacid sequence data that coded for similar proteins. After ‘learning’ from the data, the RBM could then infer statistical patterns that were common to the sequences. Tubiana et al. saw that many of these inferred patterns could be interpreted in a meaningful way and related to properties of the proteins. For example, some were related to known twists and loops that are commonly found in proteins; others could be linked to specific activities. This level of interpretability is somewhat at odds with the results of other common methods used in machine learning, which tend to behave more like a ‘black box’.
Using their RBM, Tubiana et al. then proposed how to design new proteins that may prove to have interesting features. In the future, similar methods could be applied across computational biology as a way to make sense of complex data in an understandable way.
https://doi.org/10.7554/eLife.39397.002Introduction
In recent years, the sequencing of many organisms' genomes has led to the collection of a huge number of protein sequences, which are catalogued in databases such as UniProt or PFAM Finn et al., 2014). Sequences that share a common ancestral origin, defining a family (Figure 1A), are likely to code for proteins with similar functions and structures, providing a unique window into the relationship between genotype (sequence content) and phenotype (biological features). In this context, various approaches have been introduced to infer protein properties from sequence data statistics, in particular aminoacid conservation and coevolution (correlation) (Teppa et al., 2012; de Juan et al., 2013).
A major objective of these approaches is to identify positions carrying amino acids that have critical impact on the protein function, such as catalytic sites, binding sites, or specificitydetermining sites that control ligand specificity. Principal component analysis (PCA) of the sequence data can be used to unveil groups of coevolving sites that have a specific functional role Russ et al., 2005; Rausell et al., 2010; Halabi et al., 2009. Other methods rely on phylogeny Rojas et al., 2012, entropy (variability in aminoacid content) Reva et al., 2007, or a hybrid combination of both Mihalek et al., 2004; Ashkenazy et al., 2016.
Another objective is to extract structural information, such as the contact map of the threedimensional fold. Considerable progress was brought by maximumentropy methods, which rely on the computation of direct couplings between sites reproducing the pairwise coevolution statistics in the sequence data Lapedes et al., 1999; Weigt et al., 2009; Jones et al., 2012; Cocco et al., 2018. Direct couplings provide very good estimators of contacts Morcos et al., 2011; Hopf et al., 2012; Kamisetty et al., 2013; Ekeberg et al., 2014 and capture the pairwise epistasis effects necessary to model the fitness changes that result from mutations Mann et al., 2014; Figliuzzi et al., 2016; Hopf et al., 2017.
Despite these successes, we still do not have a unique, accurate framework that is capable of extracting the structural and functional features common to a protein family, as well as the phylogenetic variations specific to subfamilies. Hereafter, we consider Restricted Boltzmann Machines (RBM) for this purpose. RBM are a powerful concept coming from machine learning Ackley et al., 1987; Hinton, 2012; they are unsupervised (sequence data need not be annotated) and generative (able to generate new data). Informally speaking, RBM learn complex data distributions through their statistical features (Figure 1B).
In the present work, we have developed a method to train efficiently RBM from protein sequence data. To illustrate the power and versatility of RBM, we have applied our approach to the sequence alignments of 20 different protein families. We report the results of our approach, with special emphasis on four families — the Kunitz domain (a protease inhibitor that is historically important for protein structure determination Ascenzi et al., 2003, the WW domain (a short module binding different classes of ligands (Sudol et al., 1995, Hsp70 (a large chaperone protein Bukau and Horwich, 1998), and latticeprotein in silico data Shakhnovich and Gutin, 1990; Mirny and Shakhnovich, 2001 — to benchmark our approach on exactly solvable models Jacquin et al., 2016. Our study shows that RBM are able to capture: (1) structurerelated features, be they local (such as tertiary contacts), extended such as secondary structure motifs ($\alpha $helix and $\beta $sheet)) or characteristic of intrinsically disordered regions; (2) functional features, that is groups of amino acids controling specificity or activity; and (3) phylogenetic features, related to subfamilies sharing evolutionary determinants. Some of these features involve only two residues (as direct pairwise couplings do), others extend over large and not necessarily contiguous portions of the sequence (as in collective modes extracted with PCA). The pattern of similarities of each sequence with the inferred features defines a multidimensional representation of this sequence, which is highly informative about the biological properties of the corresponding protein (Figure 1C). Focusing on representations of interest allows us, in turn, to design new sequences with putative functional properties. In summary, our work shows that RBM offer an effective computational tool that can be used to characterize and exploit quantitatively the genotype–phenotype relationship that is specific to a protein family.
Results
Restricted Boltzmann Machines
Definition
A Restricted Boltzmann Machine (RBM) is a joint probabilistic model for sequences and representations (see Figure 1C). It is formally defined on a bipartite, twolayer graph (Figure 1B). Protein sequences $\mathbf{\mathbf{v}}=({v}_{1},{v}_{2},\mathrm{\dots},{v}_{N})$ are displayed on the Visible layer, and representations $\mathbf{\mathbf{h}}=({h}_{1},{h}_{2},\mathrm{\dots},{h}_{M})$ on the Hidden layer. Each visible unit takes one out of 21 values (20 amino acids + 1 alignment gap). Hiddenlayer unit values ${h}_{\mu}$ are real. The joint probability distribution of $\mathbf{\mathbf{v}},\mathbf{\mathbf{h}}$ is:
up to a normalization constant. Here, the weight matrix ${w}_{i\mu}$ couples the visible and the hidden layers, and ${g}_{i}({v}_{i})$ and ${\mathcal{U}}_{\mu}({h}_{\mu})$ are local potentials biasing the values of, respectively, the visible and the hidden variables (Figure 1B,D).
From sequence to representation, and back
Given a sequence $\mathbf{\mathbf{v}}$ on the visible layer, the hidden unit receives the input
This expression is analogous to the score of a sequence with a positionspecific weight matrix. Large positive or negative ${I}_{\mu}$ values signal a good match between the sequence and, respectively, the positive and the negative components of the weights attached to unit $\mu $, whereas small ${I}_{\mu}$ values correspond to a bad match.
The input ${I}_{\mu}$ determines, in turn, the conditional probability of the activity ${h}_{\mu}$ of the hidden unit,
up to a normalization constant. The nature of the potential $\mathcal{U}$ is crucial in determining how the average activity $h$ varies with the input $I$ (see Figure 1E and below).
In turn, given a representation (set of activities) $\mathbf{\mathbf{h}}$ on the hidden layer, the residues on site $i$ are distributed according to:
Hidden units with large activities ${h}_{\mu}$ strongly bias this probability, and favor values of ${v}_{i}$ corresponding to large weights ${w}_{i\mu}({v}_{i})$.
Use of Equation (3) allows us to sample the representation space given a sequence, while Equation (4) defines the sampling of sequences given a representation (see both directions in Figure 1C). Iterating this process generates highprobability representations, which, in turn, produce very likely sequences, and so on.
Probability of a sequence
The probability of a sequence, $P(\mathbf{\mathbf{v}})$, is obtained by summing (integrating) $P(\mathbf{\mathbf{v}},\mathbf{\mathbf{h}})$ over all its possible representations $\mathbf{\mathbf{h}}$.
where ${\mathrm{\Gamma}}_{\mu}(I)=\mathrm{log}\int \mathit{d}h{e}^{{U}_{\mu}(h)+hI}$ is the cumulantgenerating function associated with the potential ${\mathcal{U}}_{\mu}$ and is a function of the input to hidden unit $\mu $ (see Equation (2)).
For quadratic potentials ${\mathcal{U}}_{\mu}(h)=\frac{{\gamma}_{\mu}}{2}{h}^{2}+{\theta}_{\mu}h$ (Figure 1E), the conditional probability $P({h}_{\mu}\mathbf{\mathbf{v}})$ is Gaussian, and the RBM is said to be Gaussian. The cumulantgenerating functions ${\mathrm{\Gamma}}_{\mu}(I)=\frac{1}{2{\gamma}_{\mu}}{(I{\theta}_{\mu})}^{2}$ are quadratic, and their sum in Equation (5) gives rise to effective pairwise couplings between the visible units, ${J}_{ij}({v}_{i},{v}_{j})={\sum}_{\mu}\frac{1}{{\gamma}_{\mu}}{w}_{i\mu}({v}_{i}){w}_{j\mu}({v}_{j})$. Hence, a Gaussian RBM is equivalent to a HopfieldPotts model Cocco et al., 2013, where the number $M$ of hidden units plays the role of the number of HopfieldPotts ‘patterns’.
Nonquadratic potentials ${\mathcal{U}}_{\mu}$, and, hence, nonquadratic $\mathrm{\Gamma}(I)$, introduce couplings to all orders between the visible units, all generated from the weights ${w}_{i\mu}$. RBM thus offer a practical way to go beyond pairwise models, and express complex, highorder dependencies between residues, based on the inference of a limited number of interaction parameters (controlled by $M$). In practice, for each hidden unit, we consider the class of 4parameter potentials,
hereafter called double Rectified Linear Unit (dReLU) potentials (Figure 1E). Varying the parameters allows us to span a wide class of behaviors, including quadratic potentials, doublewell potentials (leading to bimodal distributions for ${h}_{\mu}$) and hard constraints (e.g. preventing ${h}_{\mu}$ from being negative).
RBM can thus be thought of both as a framework to extract representations from sequences through Equation (3), and as a way to model complex interactions between residues in sequences through Equation (5). They constitute a natural candidate to unify (and improve) PCAbased and directcouplingbased approaches to protein modeling.
Learning
The weights ${w}_{i\mu}$ and the defining parameters of the potentials ${g}_{i}$ and ${\mathcal{U}}_{\mu}$ are learned by maximizing the average logprobability ${\u27e8\mathrm{log}P(\mathbf{\mathbf{v}})\u27e9}_{MSA}$ of the sequences $\mathbf{\mathbf{v}}$ in the Multiple Sequence Alignment (MSA). In practice, estimating the gradients of the average logprobability with respect to these parameters requires sampling from the model distribution $P(\mathbf{\mathbf{v}})$, which is done through Monte Carlo simulation of the RBM (see 'Materials and methods').
We also introduce penalty terms over the weights ${w}_{i\mu}(v)$ (and the local potentials ${g}_{i}(v)$ on visible units) to avoid overfitting and to promote sparse weights. Sparsity facilitates the biological interpretation of weights and, thus, emphasizes the correspondence between representation and phenotypic spaces (Figure 1C). Crucially, imposing sparsity also forces the RBM to learn a socalled compositional representation, in which each sequence is characterized by a subset of strongly activated hidden units, which are of size large compared to 1 but small compared to $M$ (Tubiana and Monasson, 2017. All technical details about the learning procedure are reported in the 'Materials and methods'.
In the next sections, we present results for selected values of the number of hidden units and of the regularization penalty. The values of these (hyper)parameters are justified afterwards.
Kunitz domain
Description
The majority of natural proteins are obtained by concatenating functional building blocks, called protein domains. The Kunitz domain, with a length of about 50–60 residues (protein family PF00014 Finn et al., 2014)) is present in several genes and its main function is to inhibit serine proteases such as trypsin. Kunitz domains play a key role in the regulation of many important processes in the body, such as tissue growth and remodeling, inflammation, body coagulation and fibrinolysis. They are implicated in several diseases, such as tumor growth, Alzheimer's disease, and cardiovascular and inflammatory diseases and, therefore, have been largely studied and shown to have a large potential in drug design Shigetomi et al., 2010; Bajaj et al., 2001).
Some examples of proteins containing a Kunitzdomain include the Basic Pancreatic Trypsin Inhibitor (BPTI, which has one Kunitz domain), Bikunin (two domains) Fries and Blom, 2000, Hepatocyte growth factor activator inhibitor (HAI, two domains) and tissue factor pathway inhibitor (TFPI, three domains) Shigetomi et al., 2010; Bajaj et al., 2001).
Figure 2A shows the MSA sequence logo and the secondary structure of the Kunitz domain. It is characterized by two $\alpha $ helices and two $\beta $ strands. cysteinecysteine disulfide bridges largely contribute to the thermodynamic stability of the domain, as frequently observed in small proteins. The structure of BPTI was the first one ever resolved (Ascenzi et al., 2003, and is often used to benchmark folding predictions on the basis of simulations Levitt and Warshel, 1975) and coevolutionary approaches Morcos et al., 2011; Hopf et al., 2012; Kamisetty et al., 2013; Cocco et al., 2013; Haldane et al., 2018. We train a RBM with $M=100$ dReLU on the MSA of PF00014, constituted by $B=8062$ sequences with $N=53$ consensus sites.
Inferred weights and interpretations
Figure 2B shows the weights ${w}_{i\mu}(v)$ attached to five selected hidden units. Each logo identifies the aminoacid motifs in the sequences $\mathbf{\mathbf{v}}$ that give rise to large (positive or negative) inputs ($I$) onto the associated hidden unit( see Equation (2)).
Weight 1 in Figure 2B has large components on sites 45 and 49 that are in contact in the final ${\alpha}_{2}$ helix (Figure 2A and D). The distribution of the inputs (${I}_{1}$) partitions the MSA into three subfamilies (Figure 2C, top panel, dark blue histogram). The two peaks in ${I}_{1}\simeq 2.5$ and ${I}_{1}\simeq 1.5$ identify sequences in which the contact is due to an electrostatic interaction with, respectively, $(+,)$ and $(,+)$ charged amino acids on sites 45 and 49; the other peak in ${I}_{1}\simeq 0$ identifies sequences realizing the contact differently, for example with an aromatic amino acid on site 45. Weight 1 also shows a weaker electrostatic component on site 53 (Figure 2B); the foursite separation interval between sites 45, 49– and 53 fits well with the average helix turn of 3.6 amino acids (Figure 2D).
Weight 2 focuses on the contact between residues 11 and 35, realized in most sequences by a CC disulfide bridge (Figure 2B and a negative ${I}_{2}$ peak in Figure 2C (top). A minority of sequences in the MSA, corresponding to ${I}_{2}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$ and mostly coming from nematode organisms (Appendix 1—figure 19), do not show the CC bridge. A subset of these sequences strongly and positively activate hidden unit 3 (Appendix 1—figure 19 and ${I}_{3}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$ peak in Figure 2C). Positive components in the weight 3 logo suggest that these proteins stabilize their structure through electrostatic interactions between sites 10 ($$ charge) and site 33–36 (+ charges both) (see Figure 2B and D) that compensates for the absence of a C–C bridge on the neighbouring sites 11–35.
Weight 4 describes a feature that is mostly localized on the loop preceding the β_{1}β_{2} strands (sites 7 to 16) (see Figure 2B and D). Structural studies of the trypsin–trypsin inhibitor complex have shown that this loop binds to proteases Marquart et al., 1983): site 12 is in contact with the active site of the protease and is therefore key to the inhibitory activity of the Kunitz domain. The two amino acids (R, K) having a large positive contribution to weight 4 in position 12 are basic and bind to negatively charged residues (D, E) on the active site of trypsinlike serine proteases. Although several Kunitz domains with known trypsin inhibitory activity, such as BPTI, TFPI, TPPI2 and so on, give rise to large and positive inputs (${I}_{4}$), Kunitz domains with no trypsin/chymotrypsin inhibition activity, such as those associated with the COL7A1 and COL6A3 genes Chen et al., 2001; Kohfeldt et al., 1996, correspond to negative or vanishing values of ${I}_{4}$. Hence, hidden unit 4 possibly separates the Kunitz domains that have trypsinlike protease inhibitory activity from the others.
This interpretation is also in agreement with mutagenesis experiments carried out on sites 7 to 16 to test the inhibitory effects of Kunitz domains BPT1, HAI1, and TFP1 against trypsinelike proteases Bajaj et al., 2001; Kirchhofer et al., 2003; Shigetomi et al., 2010; Grzesiak et al., 2000; Chand et al., 2004). Kirchhofer et al. (2003) showed that mutation R12A on the first domain (out of two) of HAI1 destroyed its inhibitory activity; a similar effect was observed with R12X, where X is a nonbasic residue, in the first two domains (out of three) of TFP1 as discussed by Bajaj et al. (2001). Grzesiak et al. (2000) showed that for BPTI, the mutations G9F, G9S, G9P reduced its affinity with human serine proteases . Conversely, in Kohfeldt et al. (1996) it was shown that the set of mutations P10R, D13A & F14R could convert the COL6A3 domain into a trypsin inhibitor. All of these results are in agreement with the above interpretation and the logo of weight 4. Note that, although several sequences have large ${I}_{4}$ (top histogram in Figure 2C), many correspond to small or negative values. This may be explained by the facts that (i) many of the Kunitz domains analyzed are present in two or more copies, and as such, not all of them need to bind strongly to trypsin (Bajaj et al., 2001 and (ii) a Kunitz domain may have other specificities that are encoded by other hidden units. In particular, weight 34 in 'Supporting Information', displays on site 12 large components that are associated with medium to largesized hydrophobic residues (L, M, Y), and is possibly related to other serine protease specificity classes such as chymotrypsin (Appel, 1986).
Weight 5 codes for a complex extended mode. To interpret this feature, we display in Figure 2C (bottom histogram) the distributions of Hamming distances between all pairs of sequences in the MSA (gray histograms) and between the 100 sequences $\mathbf{\mathbf{v}}$ in the MSA with largest inputs ${I}_{\mu}(\mathbf{\mathbf{v}})$ to the corresponding hidden unit (light blue histograms). For hidden unit 5, the distances between those topinput sequences are smaller than those between random sequences in the MSA, suggesting that weight 5 is characteristic of a cluster of closely related sequences. Here, these sequences correspond to the protein Bikunin, which is present in most mammals and some other vertebrates Shigetomi et al., 2010. Conversely, for other hidden units (e.g. 1,2), both histograms are quite similar, showing that the corresponding weight motifs are found in evolutionary distant sequences.
The five weights above were chosen on the basis of several criteria. (i) Weight norm, which is a proxy for the relevance of the hidden unit. Hidden units with larger weight norms contribute more to the likelihood, whereas weights with low norms may arise from noise or overfitting. (ii) Weight sparsity. Hidden units with sparse weights are more easily interpretable in terms of structural or functional constraints. (iii) Shape of input distributions. Hidden units with multimodal input distributions separate the family into subfamilies, and are therefore potentially interesting. (iv) Comparison with available literature. (v) Diversity. The remaining 95 inferred weights are shown in the 'Supporting Information'. We find a variety of both structural features, (for example pairwise contacts as in weights 1 and 2, that are also reminiscent of the localized, loweigenvalue modes of the HopfieldPotts model Cocco et al., 2013)) and phylogenetic features (activated by evolutionary related sequences as hidden unit 5). The latter, in particular, include stretches of gaps, mostly located at the extremities of the sequence Cocco et al., 2013. Several weights have strong components on the same sites as weight 4, showing the complex pattern of amino acids that controls binding affinity.
WW domain
Description
WW is a protein–protein interaction domain, found in many eukaryotes and human signaling proteins, that is involved in essential cellular processes such as transcription, RNA processing, protein trafficking, and receptor signaling. WW is a short domain of length 30–40 aminoacids (Figure 3A, PFAM PF00397, $B=7503$ sequences, $N=31$ consensus sites), which folds into a threestranded antiparallel $\beta $sheet. The domain name stems from the two conserved tryptophans (W) at positions 5–28 (Figure 3A), which serve as anchoring sites for the ligands. WW domains bind to a variety of proline (P)rich peptide ligands, and can be divided into four groups on the basis of their preferential binding affinity (Sudol and Hunter, 2000. Group I binds specifically to the PPXY motif, where X is any amino acid; Group II to PPLP motifs; Group III to prolineargininecontaining sequences (PR); Group IV to phosphorylated serine/threonineproline sites (p(S/T)P). The modulation of binding properties allow hundreds of WW domain to specifically interact with hundreds of putative ligands in mammalian proteomes.
Inferred weights and interpretation
Four weight logos of the inferred RBM are shown in Figure 3B; the remaining 96 weights are given in the 'Supporting Information'. Weight 1 codes for a contact between sites 4 & 22, which is realized either by two amino acids with oppositive charges (${I}_{1}\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}0$) or by one small and one negatively charged amino acid (${I}_{1}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$). Weight 2 shows a $\beta $sheet–related feature, with large entries defining a set of mostly hydrophobic (${I}_{2}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$) or hydrophilic (${I}_{2}\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}0$) residues localized on the ${\beta}_{1}$ and ${\beta}_{2}$ strands (Figure 3B) and in contact on the 3D fold (see Figure 3D). The activation histogram in Figure 3C, with a large peak on negative ${I}_{2}$, suggests that this part of the WW domain is exposed to the solvent in most, but not all, natural sequences.
Weights 3 and 4 are supported by sites on the ${\beta}_{2}$${\beta}_{3}$ binding pocket and on the ${\beta}_{1}$${\beta}_{2}$ loop of the WW domain. The distributions of activities in Figure 3C highlight different groups of sequences in the MSA that strongly correlate with experimental ligandtype identification (see Figure 3E). We find that: (i) Type I domains are characterized by ${I}_{3}\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}0$ and ${I}_{4}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$; (ii) Type II/III domains are characterized by ${I}_{3}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$ and ${I}_{4}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$; (iii) there is no clear distinction between Type II and Type III domains; and (iv) Type IV domains are characterized by ${I}_{3}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$ and ${I}_{4}\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}0$. These findings are in good agreement with various studies:
Mutagenesis experiments have shown the importance of sites 19, 21, 24 and 26 for binding specificity Espanel and Sudol, 1999; Fowler et al., 2010). For the YAP1 WW domain, as confirmed by various studies (see table 2 in Fowler et al., 2010), the mutations H21X and T26X reduce the binding affinity to Type I ligands, whereas Q24R increases it and S12X has no effect. This is in agreement with the negative components of weight 3 (Figure 3B): ${I}_{3}$ increases upon mutations H21X and T26X, decreases upon Q24R and is unaffected by S12X. Moreover the mutation L19W alone, or in combination with H21[D/G/K/R/S] could switch the specificity from Type I to Type II/III Espanel and Sudol, 1999. These results are consistent with Figure 3E: YAP1 (blue cross) is of Type I but one or two mutations move it to the right side, closer to the other cluster (orange crosses). Espanel and Sudol (1999) also proposed that Type II/III specifity required the presence of an aromatic amino acid (W/F/Y) on site 19, in good agreement with weight 3.
The distinction between Types II and III is unclear in the literature, because WW domains often have high affinity with both ligand types.
Several studies Russ et al., 2005; Kato et al., 2002; Jäger et al., 2006) have demonstrated the importance of the ${\beta}_{1}$${\beta}_{2}$ loop for achieving Type IV specificity, which requires a longer, more flexible loop, as opposed to a short rigid loop for other types. The length of the loop is encoded in weight 4 through the gap symbol on site 13: short and long loops correspond to, respectively, positive and negative ${I}_{4}$. The importance of residues R11 and R13 was shown by Kato et al. (2002) and Russ et al. (2005), where removing R13 of Type IV hPin1 WW domain reduced its binding affinity to [p(S/T)P] ligands. These observations agree with the logo of weight 4, which authorizes substitutions between K and R on sites 11 and 13.
A specificityrelated sector of eight sites was identified in Russ et al. (2005), five of which carry the top entries of weight 3 (green balls in Figure 3D). Our approach not only provides another specificityrelated feature (weight 4) but also the motifs of amino acids that affectType I and IV specificity, in good agreement with the experimental findings of Russ et al. (2005).
Hsp70 protein
Description
70kDa heat shock proteins (Hsp70) form a highlyconserved family that is represented in essentially all organisms. Hsp70, together with other chaperone proteins, perform a variety of essential functions in the cell: they can assist the folding and assembly of newly synthetized proteins, trigger refolding cycles of misfolded proteins, transport unfolded proteins through organelle membranes, and when necessary, deliver nonfunctional proteins to the proteasome, endosome or lysosome for recycling Bukau and Horwich, 1998; Young et al., 2004; Zuiderweg et al., 2017. There are 13 HSP70s proteinencoding genes in humans, differing by where (nucleus/cytoplasm, mitochondria or endoplasmic reticulum) and when they are expressed. Some, such as HSPA8 (Hsc70), are constitutively expressed whereas others, such as HSPA1 and HSPA5, are stressinduced (respectively by heat shock and glucose deprivation). Notably, Hsc70 can make up to 3% of the total total mass of proteins within the cell, and thus is one of its most important housekeeping genes. Structurally, Hsp70 are multidomain proteins of ength of 600–670 sites (631 for the E. coli DNaK gene). They consist of:
A Nucleotide Binding Domain (NBD, 400 sites) that can bind and hydrolyse ATP.
A Substrate Binding Domain (SBD sites), folded in a betasandwich structure, which binds to the target peptide or protein.
A flexible, hydrophobic interdomainlinker linking the NBD and the SBD.
A LID domain, constituted by several (up to 5) $\alpha $ helices, which encapsulates the target protein and blocks its release.
An unstructured Cterminal tail of variable length, which is important for detection and interaction with other cochaperones, such as Hop proteins (Scheufler et al., 2000.
Hsp70 functions by adopting two different conformations (see Figure 4A and B). When the NBD is bound to ATP, the NBD and the SBD are held together and the LID is open, such that the protein has low binding affinity for substrate peptides. After the hydrolysis of ATP to ADP, the NBD and the SBD detach from one another, and the LID is closed, yielding high binding affinity and effectively trapping the peptides between the SBD and the LID. By cycling between both conformations, Hsp70 can bind to misfolded proteins, unfold them by stretching (e.g. with two Hsp70 molecules bound at two ends of the protein) and release them for refold cycles. Since Hsp70 alone have low ATPase activity, this cycle requires another type of cochaperone, Jprotein, which simultaneously binds to the target protein and the Hsp70 to stimulate the ATPase activity of Hsp70, as well as a Nucleotide Exchange Factor (NEF) that favors conversion of the ADP back to ATP and hence release of the target protein (see Figure 1 in Zuiderweg et al. (2017)).
We constructed an MSA for HSP70 with $N=675$ consensus sites and $B=32,170$ sequences, starting from the seeds of Malinverni et al. (2015), and queried SwissProt and Trembl UniprotKB databases using HMMER3 Eddy, 2011. Annotated sequences were grouped on the basis of their phylogenetic origin and functional role. Prokaryotes mainly express two Hsp70 proteins: DnaK ($B=17,118$ sequences in the alignment), which are the prototype Hsp70, and HscA ($B=3,897$), which are specialized in chaperoning of ironsulfur cluster containing proteins. Eukaryotes' Hsp70 were grouped by their location of expression (mitochondria, $B=851$; chloroplasts, $B=416$; endoplasmic reticulum, $B=433$; nucleus or cytoplasm and others, $B=1,452$). We also singled out Hsp110 sequences, which, despite the high homology with Hsp70, correspond to nonallosteric proteins ($B=294$). We then trained a dReLU RBM over the full MSA with $M=200$ hidden units. We show below the weight logos, structures and input distributions for ten selected hidden units (see Figure 4 and Appendix 1—figures 21–26).
Inferred weights and interpretation
Weight 1 encodes a variability of the length of the loop within the IIB subdomain of the NBD, see stretch of gaps from sites 301 to 306. As shown in Figure 4D (projection along x axis), it separates prokaryotic DNaK proteins (for which the loop is 4–5 sites longer) from most eukaryotic Hsp70 proteins and from prokaryotic HscA. An additional hidden unit (Weight 6 in Appendix 1—figure 21) further separates eukaryotic Hsp70 from HscA proteins, whose loops are 4–5 sites shorter (distribution of inputs ${I}_{6}$ in Appendix 1—figure 26). This structural difference between the three families was previously reported and is of high functional importance to the NBD (Buchberger et al., 1994; Brehmer et al., 2001. Shorter loops increase the nucleotide exchange rates (and thus the release of target protein) in the absence of NEF, and the loop size controls interactions with NEF proteins Brehmer et al., 2001; Briknarová et al., 2001; Sondermann et al., 2001). Hsp70 proteins that have long and intermediate loop sizes interact specifically with GrpE and Bag1 NEF proteins, respectively, whereas short, HscAlike loops do not interact with any of them. This cochaperone specificity allows for functional diversification within the cell; for instance, eukaryotic Hsp70 proteins that are expressed within mitochondria and chloroplasts, such as the human gene HSPA9 and the Chlamydomonas reinhardtii HSP70B, share the long loop with prokaryotic DNaK proteins, and therefore do not interact with Bag proteins. Within the DNaK subfamily, two main variants of the loop can be isolated as well (Weight 7 in Appendix 1—figure 22), hinting at more NEFprotein specificities.
Weight 2 encodes a small collective mode localized on ${\beta}_{4}{\beta}_{5}$ strands, at the edge of the β sandwich within the SBD. The weights are quite large ($w\sim 2$), and the input distribution is bimodal, notably separating HscA and chloroplast Hsp70 (${I}_{2}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$) from mitochondrial Hsp70 and the other eukaryotic Hsp70 (${I}_{2}\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}0$). We note also a similarity in structural location and aminoacid content with weight 3 of the WW–domain, which controls binding specificity (Figure 3B). Although we have found no trace of this motif in the literature, this evidence suggests that it could be important for substrate binding specificity. Endoplasmicreticulumspecific Hsp70 proteins can also be separated from the other eukaryotic proteins by looking at appropriate hidden units (see Weight 8 in Appendix 1—figure 22 and the distribution of input ${I}_{8}$ in Appendix 1—figure 26).
RBM can also extract collective modes of coevolution spanning multiple domains, as shown by Weight 3 (Appendix 1—figure 21). The residues supporting Weight 3 (green spheres in Figure 4A and B) are physically contiguous in the ADP conformation, but not in the ATP conformation. Hence, Weight 3 captures interdomain coevolution between the SBD and the LID domains.
Weight 4 (sequence logo in Appendix 1—figure 21) also codes for a wide, inter–domain collective mode, which is localized at the interface between the SBD and the NBD domains. When the Hsp70 protein is in the ATP conformation, the sites carrying weight 4 are physically contiguous, whereas in the ADP state they are far apart (see yellow spheres in Figure 4A and B). Moreover, its input distribution (shown in Figure 4E), separates the nonallosteric Hsp110 subfamily (${I}_{4}\sim 0$) from the other subfamilies (${I}_{4}\sim 40$), suggesting that this motif is important for allostery. Several mutational studies have highlighted 21 important sites for allostery within E. coli DNaK Smock et al., 2010; seven of these positions carry the top entries of Weight 3, four appear in another Hsp110specific hidden unit (Weight 9 in Appendix 1—figure 22), and several others are highly conserved and do not coevolve at all.
Last, Weight 5 (Figure 4C) codes for a collective mode that is located mainly on the unstructured Cterminal tail, with a few sites on the LID domain. Its aminoacid content is strikingly similar across all sites: positive weights for hydrophilic residues (in particular, lysine) and negative weights for tiny, hydrophobic residues. Hydrophobicrich or hydrophilicrich sequences are found in the MSA (see Appendix 1—figure 28). This motif is consistent with the role of the tail in cochaperone interaction: hydrophobic residues are important for the formation of Hsp70–Hsp110 complexes via the Hop protein Scheufler et al., 2000. Highcharge content is also frequently encountered, and is the basis of a recognition mechanism, in intrinsically disordered protein regions Oldfield and Dunker, 2014. This could suggest the existence of different protein partners.
Some of the results presented here were previously obtained with other coevolutionary methods. In Malinverni et al. (2015), the authors showed that Direct Coupling Analysis could detect conformationspecific contacts; these are similar to hidden units 3 and 4 presented here which are located on contiguous sites in the ADPbound and ATPbound conformations, respectively. In Smock et al. (2010), an interdomain sector of sites discriminating between allosteric and nonallosteric sequences was found. This sector shares many sites with our weight 4, and is also localized at the SBD/NBD edge. However, only a sector could be retrieved with sector analysis, whereas many other meaningful collective modes could be extracted using RBM.
Sequence design
The biological interpretation of the features inferred by the RBM guides us to sample new sequences $\mathbf{\mathbf{v}}$ with putative functionalities. In practice, we sample from the conditional distribution $P(\mathbf{\mathbf{v}}\mathbf{\mathbf{h}})$, Equation (4), where a few hiddenunit activities in the representation $\mathbf{\mathbf{h}}$ are fixed to desired values, whereas the others are sampled from Equation (3). For WW domains, we condition on the activities of hidden units 3 and 4, which are related to binding specificity. Fixing ${h}_{3}$ and ${h}_{4}$ to levels corresponding to the peaks in the histograms of inputs in Figure 3C allows us to generate sequences belonging specifically to each one of the three ligandspecificity clusters (see Figure 5A).
In addition, sequences with combinations of activities that are not encountered in the natural MSA can be engineered. As an illustration, we used conditional sampling to generate hybrid WWdomain sequences with strongly negative values of ${h}_{3}$ and ${h}_{4}$, corresponding to a Type Ilike ${\beta}_{2}$${\beta}_{3}$ binding pocket and a long, Type IVlike ${\beta}_{1}$${\beta}_{2}$ loop (see Figure 5A and B).
For Kunitz domains, the property ‘no 11–35 disulfide bond’ holds only for some sequences of nematode organisms, whereas the BikuninAMBP gene is present only in vertebrates; the two corresponding motifs are thus never observed simultaneously in natural sequences. Sampling our RBM conditioned to appropriate levels of ${h}_{2}$ and ${h}_{5}$ allows us to generate sequences with both features activated (see Figure 5C and D).
The sequences designed by RBM are far away from all natural sequences in the MSA, but have comparable probabilities (see Figure 5E (WW) and Figure 5F (Kunitz)). Their probabilities estimated with pairwise directcoupling models (trained on the same data), whose ability to identify functional and artificial sequences has already been tested (Balakrishnan et al., 2011; Cocco et al., 2018 andare also large (see Appendix 1—figure 7).
Our RBM framework can also be modified to design sequences with very high probabilities, even larger than in the MSA, by appropriate duplication of the hidden units (see 'Materials and methods'). This trick can be combined with conditional sampling (see Figure 5E and F).
Contact predictions
As illustrated above, the cooccurrence of large weight components in highly sparse features often corresponds to nearby sites on the 3D fold. To extract structural information in a systematic way, we use our RBM to derive effective pairwise interactions between sites, which can then serve as estimators for contacts as approaches that are based on directcoupling Cocco et al., 2018. The derivation is sketched in Figure 6A. We consider a sequence ${\mathbf{\mathbf{v}}}^{a,b}$ with residues a and b on, respectively, sites i and j. Single mutations $a\to {a}^{\prime}$ or $b\to {b}^{\prime}$ on, respectively, site i or j are accompanied by changes in the log probability of the sequence (indicated by the full arrows in Figure 6A). Comparison of the change resulting from the double mutation with the sum of the changes resulting from the two single mutations provides our RBMbased estimate of the epistatic interaction (see Equations (15,16) in 'Materials and methods'). These interactions are well correlated with the outcomes of the DirectCoupling Analysis (see Appendix 1—figure 9).
Figure 6 shows that the quality of the prediction of the contact maps of the Kunitz (Figure 6B) and the WW (Figure 6C) domains with RBM is comparable to stateoftheart methods based on direct couplings (Morcos et al., 2011); predictions for longrange contacts are reported in Appendix 1—figure 10. The quality of contact prediction with RBM:
Does not seem to depend much on the choice of the hiddenunit potential see the Gaussian and dReLU PPV performances in Figure 6B,C and D, although the latter have better performance in terms of sequence scoring than the former (see Appendix 1—figures 1, 2 and 5).
Strongly increases with the number of hidden units (see Appendix 1—figures 11,12). This dependence is not surprising, as the number $M$ of hidden units acts in practice as a regularizor over the effective coupling matrix between residues. In the case of Gaussian RBM, the value of $M$ fixes the maximal rank of the matrix ${J}_{ij}({v}_{i},{v}_{j})$ (see 'Materials and methods'). The value $M=100$ of the number of hidden units is small compared to the maximal ranks $R=20\times N$ of the couplings matrices of the Kunitz ($R=1060$) and WW ($R=620$) domains, and explains why DirectCoupling Analysis gives slightly better performance than RBM in the contact predictions of Figure 6B and C.
Worsens with stronger weightsparsifying regularizations (see Appendix 1—figure 12) as expected.
We further tested RBM distant contact predictions in a fully blind setting on the 17 protein families (the Kunitz domain plus 16 other domains) that were used for to benchmark plmDCA (Ekeberg et al., 2014), a stateoftheart procedure for inferring pairwise couplings in DirectCoupling Analysis. The number of idden units was fixed to $M=0.3R$, that is proportionally to the domain lengths, and the regularization strength was fixed to ${\lambda}_{1}^{2}=0.1$. Contact predictions averaged over all families are reported in Figure 6D for different choices of the hiddenunit potentials (Gaussian and dReLU). We find that performances are comparable to those of plmDCA, but the computational cost of training RBM is substantially higher.
Benchmarking on lattice proteins
Lattice protein (LP) models were introduced in the ${90}^{\prime}s$ to study protein folding and design (Mirny and Shakhnovich, 2001. In one of those models Shakhnovich and Gutin, 1990, a ‘protein’ of $N=27$ amino acids may fold into $\sim {10}^{5}$ distinct structures on a $3\times 3\times 3$ cubic lattice, with probabilities depending on its sequence (see 'Materials and methods' and Figure 7A and B). LP sequence data were used to benchmark the DirectCoupling Analysis in Jacquin et al. (2016), and we follow the same approach here to assess the performances of RBM in a case where the ground truth is known. We first generate a MSA containing sequences that have large probabilities (${p}_{nat}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0.99$) of folding into one structure shown in Figure 7A (Jacquin et al., 2016). A RBM with $M=100$ dReLU hidden units is then learned, (see Appendix 1 for details about regularization and crossvalidation).
Various structural LP features are encoded by the weights as in real proteins, including complex negativedesign related modes (see Figure 7C and D and the remaining weights in 'Supporting Information'). The performances in terms of contact predictions are comparable to stateofthe art methods on LP (see Appendix 1—figure 11).
The capability of RBM to design new sequences that have desired features and high values of fitness, exactly computable in LP as the probability of folding into the native structure in Figure 7A, can be quantitatively assessed. Conditional sampling allows us to design sequences with specific hiddenunit activity levels, or combinations of features that are not found in the MSA (Figure 7E). These designed sequences are diverse and have large fitnesses, comparable to those of the MSA sequences and even higher when generated by duplicated RBM (Figure 7F), and well correlated with the RBM probabilities $P(\mathbf{\mathbf{v}})$ (Appendix 1—figure 6).
Crossvalidation of the model and interpretability of the representations
Each RBM was trained on a randomly chosen subset of 80% of the sequences in the MSA, while the remaining 20% (the test set) were used for validation of its predictive power. In practice, we compute the average logprobability of the test set to assess the performances of the RBM for various values of the number $M$ of hidden units, for the regularization strength ${\lambda}_{1}^{2}$ and for different hiddenunit potentials. Results for the WW and Kunitz domains and for Lattice Proteins are reported in Figure 8 and in Appendix 2 (Model Selection). The dReLU potential, which includes quadratic and Bernoulli (another popular choice for RBM) potentials as special cases, is consistently better than the quadratic and Bernoulli potentials individually. As expected, increasing $M$ allows RBM to capture more features in the data distribution and, therefore, improves performances up to a point, after which overfitting starts to occur.
The impact of the regularization strength ${\lambda}_{1}^{2}$ favoring weight sparsity (see definition in 'Materials and methods' Equation (8)) is twofold (see Figure 8A for the WW domain). In the absence of regularization (${\lambda}_{1}^{2}=0$) weights have components on all sites and residues, and the RBM overfit the data, as illustrated by the large difference between the logprobabilities of the training and test sets. Overfitting notably results in generated sequences that are close to the natural ones and not very diverse, as seen from the entropy of the sequence distribution (Appendix 1—figure 8). Imposing mild regularization allows the RBM to avoid overfitting and maximizes the logprobability of the test set (${\lambda}_{1}^{2}=0.03$ in Figure 8A), but most sites and residues carry nonzero weights. Interestingly, imposing stronger regularizations has low impact on the generalization abilities of RBM (resulting in a small decrease in the test set logprobability), while making weights much sparser (${\lambda}_{1}^{2}=0.25$ in Figure 3). For regularizations that are too large, too few nonzero weights remain available and the RBM is not powerful enough to model the data adequately (causing a drop in logprobability of the test set).
Favoring sparser weights in exchange for a small loss in logprobability has a deep impact on the nature of the representation of the sequence space by the RBM (see Figure 8B). Good representations are expected to capture the invariant properties of sequences across evolutionarily divergent organisms, rather than idiosyncratic features that are attached to a limited set of sequences (mixture model in Figure 8C). For sparseenough weights, the RBM is driven into the compositional representation regime (see Tubiana and Monasson, 2017) of Figure 8E, in which each hidden unit encodes a limited portion of a sequence and the representation of a sequence is defined by the set of hidden units with strong inputs. Hence, the same hidden unit (e.g. weights 1 and 2 coding for the realizations of contacts in the Kunitz domain in Figure 2B) can be recruited in many parts of the sequence space corresponding to very diverse organisms (see bottom histograms attached to weights 1 and 2 in Figure 2C, which shows that the sequences corresponding to strong inputs are scattered all over the sequence space). In addition, silencing or activating one hidden unit affects only a limited number of residues (contrary to the entangled regime of Figure 8D), and a large diversity of sequences can be generated through combinatorial choices of the activity states of the hidden units, an approach that guarantees efficient sequence design.
In addition, inferring sparse weights makes their comparison across many different protein families easier. In Figure 9 and 10, we show some representative weights that were obtained after training RBMs with the MSAs of the 16 families considered by Ekeberg et al. (2014) (the 17th family, the Kunitz domain, is shown in Figure 2), which were chosen to illustrate the broad classes of encountered motifs; see 'Supporting information' for the other top weights of the 16 families. We find that weights may code for a variety of structural properties:
Pairwise contacts on the corresponding structures, realized by various types of residueresidue physicochemical interactions (see Figure 9A and B). These motifs are similar to weights 2 of the Kunitz domain (Figure 2B) and weight 1 of the WW domain (Figure 3B).
Structural triplets, carrying residues in proximity either on the tertiary structure or on the secondary structure (see Figure 9C,D,E and F). Many such triplets arise from electrostatic interactions and carry amino acids with alternating charges (Figure 9C,D and E); they are often found in αhelices and reflect their $\sim 4$site periodicity (Figure 9E and last two sites in Figure 9D), in agreement with weight 1 of the Kunitz domain (Figure 2B). Triplets may also involve residues with nonelectrostatic interactions (Figure 9F).
Other structural motifs involving four or more residues, for example between βstrands (see Figure 9G). Such motifs were also found in the WW domain (see weight 2 in Figure 3B).
In addition, weights may also reflect nonstructural properties, such as:
Stretches of gaps at the extremities of the sequences, indicating the presence of subfamilies containing shorter proteins (see Figure 10A and B).
Stretches of gaps in regions corresponding to internal loops of the proteins (see Figure 10C and D). These motifs control the length of these loops, similarly to weight 1 of HSP70 (see Figure 4C).
Contiguous residue motifs on loops (Figure 10E and F) and β–strands (Figure 10G). These motifs could be involved in binding specificity, as found in the Kunitz and WW domains (weights 4 in Figure 2B and 3B).
Phylogenetic properties shared by a subset of evolutionary close sequences (see bottom histograms Figure 10H and I), contrary to the motifs listed above. These motifs are generally less sparse and scattered over the protein sequence, as weight 5 of the Kunitz domain in Figure 2B.
For all those motifs, the top histograms of the inputs on the corresponding hidden units indicate how the protein families cluster into distinct subfamilies with respect to the features.
Discussion
In summary, we have shown that RBM are a promising, versatile, and unifying method for modeling and generating protein sequences. RBM, when trained on protein sequence data, reveal a wealth of structural, functional and evolutionary features. To our knowledge, no other method used to date has been able to extract such detailed information in a unique framework. In addition, RBM can be used to design new sequences: hidden units can be seen as representationcontroling knobs, that are tunable at will to sample specific portions of the sequence space corresponding to desired functionalities. A major and appealing advantage of RBM is that the twolayer architecture of the model embodies the very concept of genotypephenotype mapping (Figure 1C). Codes for learning and visualizing RBM are attached to this publication (see 'Materials and methods').
From a machinelearning point of view, the values of RBM that define parameters (such as class of potentials and number $M$ of hidden units, or regularization penalties) were selected on the basis of the logprobability of a test set of natural sequences not used for training and on the interpretability of the model. The dReLU potentials that we introduced in this work (Equation (6)) consistently outperform other potentials for generative purposes. As expected, increasing $M$ improves likelihood up to some level, after which overfitting starts to occur. Adding sparsifying regularization not only prevents overfitting but also facilitates the biological interpretation of weights (Figure 8A). It is thus an effective way to enhance the correspondence between representation and phenotypic spaces (Figure 1C). It also allows us to drive the RBM operation point at which most features can be activated across many regions of the sequence space (Figure 8E); examples are provided by hidden units 1 and 2 for the Kunitz domain in Figure 2B and C and hidden unit 3 for the WW domain in Figure 3B and C. Combining these features allows us to generate a variety of new sequences with high probabilities, such as those shown in Figure 5. Note that some inferred features, such as hidden unit 5 in Figure 2C and D and, to a lesser extent, hidden unit 2 in Figure 3B and C, are, by contrast, activated by evolutionary close sequences. Our inferred RBMs thus share some partial similarity with the mixture models of Figure 8C. Interestingly, the identification of specific sequence motifs with structural, functional or evolutionary meaning does not seem to be restricted to a few protein domains or proteins, but could be a generic property as suggested by our study of 16 additional families (Figure 9 and 10).
Despite the algorithmic improvements developed in the present work (see 'Materials and methods'), training RBM is challenging as it requires intensive sampling. Generative models that are alternatives to RBM, and that do not require Markov Chain sampling, exist in machine learning; they include Generative Adversarial Networks (Goodfellow et al., 2014) and Variational Auto–encoders (VAE) (Kingma and Welling, 2013. VAE were recently applied to protein sequence data for fitness prediction (Sinai et al., 2017; Riesselman et al., 2018. Our work differs in several impo rtant points: our RBM is an extension of directbased coupling approaches, requires much less hidden units (about 10 to 50 times fewer than were used in Sinai et al., 2017 and Riesselman et al., 2018), has a simple architecture with two layers carrying sequences and representations, infers interpretable weights with biological relevance, and can be easily tweaked to design sequences with desired statistical properties. We have shown that RBM can successfully model small domains (of a few tens of amino acids) as well as much longer proteins (of several hundreds of residues). The reason is that, even for very large proteins, the computational effort can be controlled through the number $M$ of hidden units (see 'Materials and methods' for discussion about the running time of our learning algorithm). Choosing moderate values of $M$ makes the number of parameters to be learned reasonable and avoids overfitting, yet allows for the discovery of important functional and structural features. It is, however, unclear how $M$ should scale with $N$ to unveil ‘all’ the functional features of very complex and rich proteins (such as Hsp70).
From a computational biology point of view, RBM unifies and extends previous approaches in the context of protein coevolutionary analysis. From the one hand, the features extracted by RBM identify ‘collective modes’ that control the biological functionalities of the protein, in a similar way to the socalled sectors extracted by statistical coupling analysis (Halabi et al., 2009). However, contrary to sectors, the collective modes are not disjoint: a site may participate in different features, depending on the value of the residue it carries. On the other hand, RBM coincide with directcoupling analysis (Morcos et al., 2011 when the potential $\mathcal{U}(h)$ is quadratic in $h$. For nonquadratic potentials $\mathcal{U}$, couplings to all orders between the visible units are present. The presence of highorder interactions allows for a significantly better description of gap modes Feinauer et al., 2014, of multiple longrange couplings due to ligand binding, and of outliers sequences (Appendix 1—figure 5). Our dReLU RBM model offers an efficient way to go beyond pairwise coupling models, without an explosion in the number of interaction parameters to be inferred, as all highorder interactions (whose number, ${q}^{N}$, is exponentially large in $N$) are effectively generated from the same $M\times N\times q$ weights ${w}_{i\mu}(v)$. RBM also outperforms the HopfieldPotts framework Cocco et al., 2013, an approach previously introduced to capture both collective and localized structural modes. HopfieldPotts ’patterns’ were derived with no sparsity regularization and within the meanfield approximation, which made the HopfieldPotts model insufficiently accurate for sequence design (see Appendix 1—figures 14–18).
The weights shown in Figures 2B, 3B and 4B are stable with respect to subsampling (Appendix 1—figure 13) and could be unambiguously interpreted and related to existing literature. However, the biological significance of some of the inferred features remains unclear, and would require experimental investigation. Similarly, the capability of RBM to design new functional sequences need experimental validation besides the comparison with past design experiments (Figure 5E) and the benchmarking on in silico proteins (Figure 7). Although recombining different parts of natural proteins sequences from different organisms is a well recognized procedure for protein design (Stemmer, 1994; Khersonsky and Fleishman, 2016, RBM innovates in a crucial aspect. Traditional approaches cut sequences into fragments at fixed positions on the basis of secondary structure considerations, but such parts are learned and need not be contiguous along the primary sequence in RBM models. We believe that protein design with detailed computational modeling methods, such as Rosetta (Simons et al., 1997; Khersonsky and Fleishman, 2016, could be efficiently guided by our RBMbased approach, in much the same way as protein folding greatly benefited from the inclusion of longrange contacts found by directcoupling analysis (Marks et al., 2011; Hopf et al., 2012.
Future projects include developing systematic methods for identifying functiondetermining sites, and analyzing more protein families. As suggested by the analysis of the 16 families shown in Figure 9 and 10, such a study could help to establish a general classification of motifs into broad classes with structural or functional relevance, shared by distinct proteins. In addition, it would be very interesting to use RBM to determine evolutionary paths between two, or more, protein sequences in the same family, but with distinct phenotypes. In principle, RBM could reveal how functionalities continuously change along the paths, and could provide a measure of viability of intermediary sequences.
Materials and methods
Data preprocessing
Request a detailed protocolWe use the PFAM sequence alignments of the V31.0 release (March 2017) for both Kunitz (PF00014) and WW (PF00397) domains. All columns with insertions are discarded, then duplicate sequences are removed. We are left with, respectively, $N=53$ sites and $B=8062$ unique sequences for Kunitz, and $N=31$ and $B=7503$ for WW; each site can carry $q=21$ different symbols. To correct for the heterogeneous sampling of the sequence space, a reweighting procedure is applied: each sequence ${\mathbf{\mathbf{v}}}^{\mathrm{\ell}}$ with $\mathrm{\ell}=1,\mathrm{\dots},B$ is assigned a weight ${w}_{\mathrm{\ell}}$ equal to the inverse of the number of sequences with more than 90% aminoacid identity (including itself). In all that follows, the average over the sequence data of a function $f$ is defined as
Learning procedure
Objective function and gradients
Request a detailed protocolTraining is performed by maximizing, through stochastic gradient ascent, the difference between the logprobability of the sequences in the MSA and the regularization costs,
Regularization terms include a standard ${L}_{2}$ penalty for the potentials acting on the visible units, and a custom ${L}_{2}/{L}_{1}$ penalty for the weights. The latter penalty corresponds to an effective ${L}_{1}$ regularization with an adaptive strength that increases with the weights, thus promoting homogeneity among hidden units. (This can be seen from the gradient of the regularization term, which reads ${\lambda}_{1}^{2}\left({\sum}_{i,{v}^{\prime}}{w}_{i\mu}({v}^{\prime})/qN\right)\text{sign}({w}_{i\mu}(v))$.) Besides, it prevents hidden units from ending up entirely disconnected (${w}_{i\mu}(v)=0\forall i,v$), and makes the determination of the penalty strength ${\lambda}_{1}^{2}$ more robust (see Appendix 1—figure 2).
According to Equation (5), the probability of a sequence $\mathbf{\mathbf{v}}$ can be written as,
is the effective ‘energy’ of the sequence, which depends on all the model parameters. The gradient of ${\u27e8\mathrm{log}P(\mathbf{\mathbf{v}})\u27e9}_{MSA}$ over one of these parameters, denoted generically by $\psi $, is therefore
Hence, the gradient is the difference between the average values of the derivative of ${E}_{eff}$ with respect to $\psi $ over the model and the data distributions.
Moment evaluation
Request a detailed protocolSeveral methods have been developed to evaluate the model average in the gradient ( see Equation (10)) Fischer and Igel, 2012. The naive approach is to run for each gradient iteration a full Markov Chain Monte Carlo (MCMC) simulation of the RBM until the samples reach equilibrium, then use these samples to compute the model average Ackley et al., 1987. A more efficient approach is the Persistent Constrastive Divergence Tieleman, 2008: the samples obtained from the previous simulation are used to initialize for the next MCMC simulation, and only a small number of Gibbs updates (${N}_{MC}\sim 10$) are performed between each gradient evaluation. If the model parameters evolve slowly, the samples are always at equilibrium, and we obtain the same accuracy as that provided the naive approach at a fraction of the computational cost. In practice, the Persistent Contrastive Divergence (PCD) algorithm succeeds if the mixing rate of the Markov Chain — which depends on the nature and dimension of the data, and the model parameters — is fast enough. In our training sessions, PCD proved sufficient to learn relevant features and good generative models for small proteins and regularized RBM.
Stochastic gradient ascent
Request a detailed protocolThe optimization is carried out by Stochastic Gradient Ascent. At each step, the gradient is evaluated using a minibatch of the data, as well as a small number of MCMC configurations. In most of our training sessions, we used the same batch size (=100) for both sets. The model is initialized as follows:
Weights ${w}_{i\mu}(v)$ are randomly and independently drawn from a Gaussian distribution with zero mean and variance equal to $\frac{0.1}{N}$. The scaling factor $\frac{1}{N}$ ensures that the initial input distribution has variance of the order of $1$.
The potentials ${g}_{i}(v)$ are given their values in the independentsite model: ${g}_{i}(v)=\mathrm{log}{\u27e8{\delta}_{{v}_{i},v}\u27e9}_{\text{MSA}}$, where $\delta $ denotes the Kronecker function.
For all hiddenunit potentials, we set ${\gamma}_{+}={\gamma}_{}=1$, ${\theta}_{+}={\theta}_{}=0$.
The learning rate is initially set to $0.1$, and decays exponentially after a fraction of the total training time (e.g. 50%) until it reaches a final, small value, for example 10^{4}.
Dynamic reparametrization
Request a detailed protocolFor Gaussian and dReLU potentials, there is a redundancy between the slope of the hidden unit average activity and the global amplitude of the weight vector. Indeed, for the Gaussian potential, the model distribution is invariant under rescaling transformations ${\gamma}_{\mu}\to {\lambda}^{2}{\gamma}_{\mu}$, ${w}_{i\mu}\to \lambda {w}_{i\mu}$, ${\theta}_{\mu}\to \lambda {\theta}_{\mu}$ and offset transformation ${\theta}_{\mu}\to {\theta}_{\mu}+{K}_{\mu}$, ${g}_{i}\to {g}_{i}{\sum}_{\mu}{w}_{i\mu}\frac{{K}_{\mu}}{{\gamma}_{\mu}}$. Though we can set ${\gamma}_{\mu}=1,{\theta}_{\mu}=0\forall \mu $ without loss of generality, it can lead either to numerical instability (at high learning rate) or slow learning (at low learning rate). A significantly better choice is to adjust the slope and offset dynamically so that $\u27e8{h}_{\mu}\u27e9\sim 0$ and $\text{Var}({h}_{\mu})\sim 1$ at all times. This new approach, reminiscent of batch normalization for deep networks, is implemented in the training algorithm released with this work. Detailed equations and benchmarks will be available online soon.
Gauge choice
Request a detailed protocolSince the conditional probability Equation 4 is normalized, the transformations ${g}_{i}(v)\to {g}_{i}(v)+{\lambda}_{i}$ and ${w}_{i\mu}(v)\to {w}_{i\mu}(v)+{K}_{i\mu}$ leave the conditional probability invariant. We choose the zerosum gauges, defined by ${\sum}_{v}{g}_{i}(v)=0$, ${\sum}_{v}{w}_{i\mu}(v)=0$. Since the regularization penalties over the fields and weight depend on the gauge choice, the gauge must be enforced throughout all training and not only at the end. The updates on the fields leave the gauge invariant, so the transformation ${g}_{i}(v)\to {g}_{i}(v)\frac{1}{q}{\sum}_{{v}^{\prime}}{g}_{i}({v}^{\prime})$ can be used only once, after initialization. On the other hand, this is not the case for the updates on the weights, so the transformation ${w}_{i\mu}(v)\frac{1}{q}{\sum}_{{v}^{\prime}}{w}_{i\mu}({v}^{\prime})$ must be applied after each gradient update.
Evaluating the partition function
Request a detailed protocolEvaluating $P(\mathbf{\mathbf{v}})$ requires knowledge of the partition function $Z={\displaystyle \sum _{\mathbf{\mathbf{v}}}}\mathrm{exp}\left({E}_{\text{eff}}(\mathbf{\mathbf{v}})\right)$ (see denominator in Equation (9)). The later expression, which involves summing over ${q}^{N}$ terms is not tractable. Instead, we estimate $Z$ using the Annealed Importance Sampling algorithm (AIS) Neal, 2001; Salakhutdinov and Murray, 2008. Briefly, the idea is to estimate partition function ratios. Let ${P}_{1}(\mathbf{\mathbf{v}})=\frac{{P}_{1}^{*}(\mathbf{\mathbf{v}})}{{Z}_{1}}$, ${P}_{0}=\frac{{P}_{0}^{*}(\mathbf{\mathbf{v}})}{{Z}_{0}}$ be two probability distributions with partition functions ${Z}_{1}$, ${Z}_{0}$. Then:
Therefore, provided that ${Z}_{0}$ is known (e.g. if ${P}_{0}$ is an independent model with no couplings), one can in principle estimate ${Z}_{1}$ through Monte Carlo sampling. The difficulty lies in the variance of the estimator: if ${P}_{1}$, ${P}_{0}$ are very different from one another, then some configurations can be very likely for ${P}_{1}$ and have very low probability with ${P}_{0}$; these configurations almost never appear in the Monte Carlo estimate of $\u27e8.\u27e9$, but the probability ratio can be exponentially large. In Annealed Importance Sampling, we address this problem by constructing a continuous path of interpolating distributions ${P}_{\beta}(\mathbf{\mathbf{v}})={P}_{1}{(\mathbf{\mathbf{v}})}^{\beta}{P}_{0}{(\mathbf{\mathbf{v}})}^{1\beta}$, and estimate ${Z}_{1}$ as a product of the ratios of the partition functions:
where we choose a linear set of interpolating inverse temperatures of the form ${\beta}_{l}=\frac{l}{{l}_{\text{max}}}$. To evaluate the successive expectations, we use a fixed number $C$ of samples initially drawn from ${P}_{0}$, and gradually anneal them from ${P}_{0}$ to ${P}_{1}$ by successive applications of Gibbs sampling at ${P}_{\beta}$. Moreover, all computations are done in logarithmic scales for numerical stability purposes: we estimate $\mathrm{log}\frac{{Z}_{1}}{{Z}_{0}}\approx {\u27e8\mathrm{log}\frac{{P}_{1}^{*}(\mathbf{\mathbf{v}})}{{P}_{0}^{*}(\mathbf{\mathbf{v}})}\u27e9}_{\mathbf{\mathbf{v}}\sim {P}_{0}}$, which is justified if ${P}_{1}$ and ${P}_{0}$ are close. In practice, we used $C=20$ chains, ${n}_{\beta}=5\times {10}^{4}$ steps. For the initial distribution ${P}_{0}$, we take the closest (in terms of KL divergence) independent model to the data distribution ${P}_{MSA}$. The visible layer fields are those of the independent model inferred from the MSA, and the weights are ${\mathbf{\mathbf{w}}}^{\beta =0}=0$. For the hidden potential values, we infer the parameters from the statistics of the hidden layer activity conditioned to the data.
Explicit formula for sampling and training RBM
Request a detailed protocolTraining, sampling and computing the probability of sequences with RBM requires: (1) sampling from $P(\mathbf{\mathbf{v}}\mathbf{\mathbf{h}})$, (2) sampling from $P(\mathbf{\mathbf{h}}\mathbf{\mathbf{v}})$, and (3) evaluating the effective energy ${E}_{\text{eff}}(\mathbf{\mathbf{v}})$ and its derivatives. This is done as follows:
Each sequence site $i$ is encoded as a categorical variable taking integer values ${v}_{i}\in [0,20]$, with each integer corresponding to one of the 20 aminoacids + 1 gap. Similarly, the fields and weights are encoded as a $N\times 21$ matrix and a $M\times N\times 21$ tensor, respectively. Owing to the bipartite structure of the graph, $P(\mathbf{\mathbf{v}}\mathbf{\mathbf{h}})={\prod}_{i}P({\mathbf{\mathbf{v}}}_{\mathbf{\mathbf{i}}}\mathbf{\mathbf{h}})$ (see Equation (4)). Therefore, sampling from $P(\mathbf{\mathbf{v}}\mathbf{\mathbf{h}})$ is done in three steps: compute the inputs received from the hidden layer, then the conditional probabilities $P({v}_{i}\mathbf{\mathbf{h}})$ given the inputs, and sample each visible unit independently the corresponding conditional distributions.
The conditional probability $P(\mathbf{\mathbf{h}}\mathbf{\mathbf{v}})$ factorizes. Given a visible configuration $\mathbf{\mathbf{v}}$, each hidden unit is sampled independently from the others via $P({h}_{\mu}\mathbf{\mathbf{v}})$ (see Equation (3)). For a quadratic potential $\mathcal{U}(h)=\frac{1}{2}\gamma {h}^{2}+\theta h$, this conditional distribution is Gaussian. For the dReLU potential $\mathcal{U}(h)$ in Equation (6), we introduce first
$\mathrm{\Phi}(x)=\mathrm{exp}(\frac{{x}^{2}}{2})\left[1\text{erf}(\frac{x}{\sqrt{2}})\right]\sqrt{\frac{\pi}{2}}$Some useful properties of $\mathrm{\Phi}$ are:
• $\mathrm{\Phi}(x){\sim}_{x\to \mathrm{\infty}}\mathrm{exp}(\frac{{x}^{2}}{2})\sqrt{2\pi}$
• $\mathrm{\Phi}(x){\sim}_{x\to \mathrm{\infty}}\frac{1}{x}\frac{1}{{x}^{3}}+\frac{3}{{x}^{5}}+\mathcal{\mathcal{O}}(\frac{1}{{x}^{7}})$
• ${\mathrm{\Phi}}^{\mathrm{\prime}}(x)=x\mathrm{\Phi}(x)1$
To avoid numerical issues, $\mathrm{\Phi}$ is computed in practice with its definition for $x\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}5$ and with its asymptotic expansion otherwise. We also write $\mathcal{T}\mathcal{N}(\mu ,{\sigma}^{2},a,b)$, the truncated Gaussian distribution of mode $\mu $, width $\sigma $ and support $[a,b]$.
Then, $P(hI)$ is given by a mixture of two truncated Gaussians:
(13) $P(hI)={p}^{+}\mathcal{T}\mathcal{N}(\frac{I{\theta}^{+}}{{\gamma}_{+}},\frac{1}{{\gamma}_{+}},0,+\mathrm{\infty})+{p}^{}\mathcal{T}\mathcal{N}(\mu =\frac{I{\theta}^{}}{{\gamma}^{}},{\sigma}^{2}=\frac{1}{{\gamma}^{}},\mathrm{\infty},0)$where ${Z}^{\pm}=\mathrm{\Phi}\left(\frac{\mp (I{\theta}^{\pm})}{\sqrt{{\gamma}^{\pm}}}\right)\frac{1}{\sqrt{{\gamma}^{\pm}}}$, and ${p}^{\pm}=\frac{{Z}^{\pm}}{{Z}^{+}+{Z}^{}}$.
Evaluating ${E}_{\text{eff}}$ and its derivatives requires an explicit expression for the cumulant–generating function $\mathrm{\Gamma}(I)$. For quadratic potentials, $\mathrm{\Gamma}(I)$ is quadratic too. For dReLU potentials, we have $\mathrm{\Gamma}(I)=\mathrm{log}({Z}^{+}+{Z}^{})$, where ${Z}^{\pm}$ is defined above.
Computational complexity
Request a detailed protocolThe computational complexity is of the order of $M\times N\times B$, with more accurate variants taking more time. The algorithm scales reasonably to large protein sizes, and was tested successfully for $N$ up to $\sim 700$, taking in the order of 1–2 days on an Intel Xeon Phi processor with 2 $\times $ 28 cores.
Sampling procedure
Request a detailed protocolSampling from $P$ in Equation (5) is done with Markov Chain Monte Carlo methods, with the standard alternate Gibbs sampler described in the main text and in Fischer and Igel (2012). Conditional sampling, that is sampling from $P(\mathbf{\mathbf{v}}{h}_{\mu}={h}_{\mu}^{c})$, is straightforward with RBM: it is achieved by the same Gibbs sampler while keeping ${h}_{\mu}$ fixed.
The RBM architecture can be modified to generate sequences with high probabilities (as in Figure 5E&F). The trick is to duplicate the hidden units, the weights, and the local potentials acting on the visible units, as shown in Figure 11. By doing so, the sequences $\mathbf{\mathbf{v}}$ are distributed according to
Hence, with the duplicated RBM, sequences with high probabilities in the original RBM model are given a boost when compared to lowprobability sequences (Figure 11). Note that more subtle biases can be introduced by duplicating some (but not all) of the hidden units in order to give more importance in the sampling to the associated statistical features.
Contact map estimation
Request a detailed protocolRBM can be used for contact prediction in a manner similar to pairwise coupling models, after derivation of an effective coupling matrix ${J}_{ij}^{\text{eff}}(a,b)$. Consider a sequence $\mathbf{\mathbf{v}}$, and two sites $i,j$. Define the set of mutated sequences ${\mathbf{\mathbf{v}}}^{a,b}$ with amino acid content: ${v}_{k}^{a,b}={v}_{k}$ if $k\ne i,j$, $a$ if $k=i$, $b$ if $k=j$ (Figure 6A). The differential likelihood ratio
where P is the marginal distribution in Equation (5), measures epistatic contributions to the double mutation $a\to {a}^{\prime}$ and $b\to {b}^{\prime}$ on sites i and j, respectively, in the background defined by sequence $\mathbf{\mathbf{v}}$ (see Figure 6A). The effective coupling matrix is then defined as
where the average is taken over the sequences $\mathbf{\mathbf{v}}$ in the MSA. For a pairwise model, $\mathrm{\Delta}\mathrm{\Delta}{R}_{ij}$ does not depend on the background sequence $\mathbf{\mathbf{v}}$, and Equation (16) coincides with the true coupling in the zerosum gauge. Contact estimators are based on the Frobenius norms of ${J}^{\text{eff}}$, with the Average Product Correction (see Cocco et al., 2018).
Code availability
Request a detailed protocolThe Python 2.7 package for training and visualizing RBMs, which was used to obtain the results reported in this work, is available at https://github.com/jertubiana/ProteinMotifRBM (Tubiana, 2019; copy archived at https://github.com/elifesciencespublications/ProteinMotifRBM). In addition, Jupyter notebooks are provided for reproducing most of the figures in this article.
Appendix 1
Supporting methods and figures
Latticeprotein synthetic sequences
LP models have been introduced in the $\prime 90$ to investigate the uniqueness of folding shared by the majority of real proteins Shakhnovich and Gutin, 1990, and have been more recently used to benchmark graphical models inferred from sequence data Jacquin et al., 2016). There are $\mathcal{\mathcal{N}}=103,\phantom{\rule{thinmathspace}{0ex}}406$ possible folds, that is selfavoiding paths of the 27 aminoacidlong chains, on $3\phantom{\rule{thinmathspace}{0ex}}\times 3\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}3$ a lattice cube Shakhnovich and Gutin, 1990. The probability that the protein sequence $\mathbf{v}=({v}_{1},\phantom{\rule{thinmathspace}{0ex}}{v}_{2},\phantom{\rule{thinmathspace}{0ex}}...,\phantom{\rule{thinmathspace}{0ex}}{v}_{27})$ folds in one of these, say, $S$, is
where the energy of sequence $\mathbf{\mathbf{v}}$ in structure $S$ is given by
In the formula above, $c}^{(S)$ is the contact map: ${c}_{ij}^{(S)}=1$ if the pair of sites $ij$ is in contact, that is $i$ and $j$ are nearest neighbors on the lattice, and zero otherwise. The pairwise energy $E({v}_{i},{v}_{j})$ represents the aminoacid physicochemical interactions, given by the the MiyazawaJernigan (MJ) knowledgebased potential Miyazawa and Jernigan, 1996.
A collection of 36,000 sequences that specifically fold on structure $S}_{A$ (Figure 7A) with high probability ${P}_{nat}(\mathbf{v};{S}_{A})\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0.995$ were generated by Monte Carlo simulations as described in Jacquin et al. (2016). Like real MSA, Lattice Protein data feature short and longrange correlations between aminoacid on different sites, as well as highorder interactions that arise from competition between folds Jacquin et al., 2016).
Model selection
We discuss here the choice of parameters (strength of regularization, number of hidden units, shape of hiddenunit potentials, …) for the RBM used in the main text. Our goal is to achieve good generative performances and to learn biologically interpretable representations. We estimate the accuracy of the fit to the data distribution using the average loglikelihood, divided by the number of visible units $\frac{1}{N}\u27e8\mathrm{log}P(\mathbf{v}){\u27e9}_{MSA}$. For visibleunit variables with $q=21$ possible values (i.e. 20 amino acids + gap symbol), this number typically ranges from $\mathrm{log}21\simeq 3.04$ (uniform distribution) to 0. Evaluating $P(\mathbf{v})$ (Methods Equation (1)) requires knowledge of the partition function, $Z=\sum _{\mathbf{v}}\mathrm{exp}\left(\sum _{i=1}^{N}{g}_{i}({v}_{i})+\sum _{\mu =1}^{M}{\mathrm{\Gamma}}_{\mu}({I}_{\mu}(\mathbf{v}))\right)$ (see section titled 'Evaluating the partition function').
Number of hidden units
The number of hidden units is critical for the generative performance. We trained RBMs on the Lattice Protein data set for various potentials (Bernoulli, quadratic and dReLU), numbers of hidden units (1–400) and regularizations (${\lambda}_{1}^{2}=0$, ${\lambda}_{1}^{2}=0.025$). The likelihood estimation shows that, as expected, the larger $M$, the better the ability to fit the training data (Appendix 1—figure 1). Overfitting resulting in a decrease in test set performance may occur for large $M$. For the regularized case, the likelihood saturates at about 100 hidden units. Similar results were obtained for WW (see Appendix 1—figure 2).
Besides generative performance, the representation also changes as M increases. For very low values of M, each hidden unit tries to explain as much covariation as possible and its corresponding weight vector is extended, as in PCA. For larger numbers of hidden units, weights tend to become more sparse; they stabilize at some point, after which new hidden units simply duplicate previous ones.
Sparse regularization
We first investigate the importance of the sparsifying penalty term. Our study shows that, unlike in the case of MNIST digit data (Tubiana and Monasson, 2017), sparsity does not arise naturally from training RBM on protein sequences but requires the introduction of a specific sparsifying regularization (see Figure 8). On the one hand, sparse weights, such as those shown in Figures 2, 3, 4 and 7, are easier to interpret, but, on the other hand, regularization generally leads to a decrease in the generative performance. We show below that the choice of regularization strength used in this work is a good compromise between sparsity and generative performance.
We train several RBM on the Lattice Proteins MSA, with a fixed number of hidden units ($M=100$), fixed potential, and varying strength of the sparse penalty ${\lambda}_{1}^{2}$ (defined in 'Materials and methods, Equation (8)), and evaluate their likelihoods. We repeat the same procedure using the standard ${L}_{1}$ regularization (${\lambda}_{1}{\sum}_{i,v,\mu}{w}_{i\mu}(v)$) instead of ${L}_{1}^{2}$. Results are shown in Appendix 1—figure 3. In both cases, the likelihood on the test set decreases mildly with the regularization strength. However, for ${L}_{1}$ regularization, several hidden units become disconnected (i.e. ${w}_{i\mu}(v)=0$ for all $i,v$) as we increase the penalty strength. The ${L}_{1}^{2}$ penalty achieves sparse weights without disconnecting hidden units when the penalty is too large, hence it is more robust and requires less fine tuning.
Hiddenunit potentials
Last, we discuss the choice of the hiddenunit potentials. A priori, the major difference between Bernoulli, quadratic and dReLU potentials are that: (i) the Bernoulli hidden unit takes discrete values whereas quadratic and dReLU hidden units take continuous ones; and (ii) after marginalization, quadratic potentials create pairwise effective interactions whereas Bernoulli and dReLU potentials create nonpairwise ones. It was shown in the context of image processing and text mining that nonpairwise models are more efficient in practice, and theoretical arguments also highlight the importance of highorder interactions (Tubiana and Monasson, 2017).
In terms of generative performance, our results on Lattice Proteins and WW domain MSAs show that, for the same number of parameters, dReLU RBM perform better than Gaussian and Bernoulli RBM. Similar results, not shown, were obtained for the Kunitz domain MSA. Although RBM with Bernoulli hidden units are known to be universal approximators as $M\to \mathrm{\infty}$
One of the key aspects that explains the difference in performance between dReLU and Gaussian RBM is the ability of the former to better model ’outlier’ sequences, with rare extended features such as BikuninAMBP (Weight 5 in the main text, Figure 2) or the nonaromatic W28substitution feature (Weight 3 in the main text, Figure 3). Indeed, thanks to the thresholding effect of the average activity, dReLU (unlike quadratic potentials) can account for outliers without altering the distribution for the bulk of the other sequences. To illustrate this property, in Appendix 1—figure 5, we compare the likelihoods for all sequences of two RBMs trained with quadratic (resp. dReLU) potentials, $M=100$, ${\lambda}_{1}^{2}=0.25$ on the Kunitz domain MSA. The color coding indicates the degree of anomaly of the sequence, which is obtained as follows:
Compute the average activity ${h}_{\mu}^{l}$ of dReLU RBM for all data sequences $\mathbf{v}}^{l$.
Normalize (zscore) each dimension: $\hat{h}}_{\mu}=\frac{{h}_{\mu}{\u27e8{h}_{\mu}\u27e9}_{MSA}}{\sqrt{\text{Var}[{h}_{\mu}{]}_{MSA}}$.
Define:
(19) ${c}^{l}=\mathrm{arg}\underset{\mu}{max}{\hat{h}}_{\mu}^{l}$
For instance, a sequence ${\mathbf{\mathbf{v}}}^{l}$ with ${c}^{l}=10$ has at least one hiddenunit average activity that is 10 standard deviations away from the mean. Clearly, most sequences have very similar likelihood but the outlier sequences are better modeled by dReLU potentials.
The features that are extracted are fairly robust with respect to the choice of potential when regularization is used. Clearly, the nature of the potentials does not matter for finding contacts features because for any potential, a hidden unit connected to only two sites will create only pairwise effective interaction. For larger collective modes, some difference arise. As discussed above, Bernoulli features are more redundant, and Gaussian RBM tend to miss outlier features.
Summary
To summarize, the systematic study suggests that:
More general potentials, such as dReLU, perform better than the simpler quadratic and Bernoulli potentials.
There exist values of sparsity regularization penalties that allow for both good generative performance and interpretability.
As the number of hidden units increases, more features are captured and generative performance improves. Beyond some point, increasing M simply adds duplicate hidden units and does not enhance performance.
Sequence generation
We use Lattice Proteins to check that our RBM is a good generative model, that is able to generate sequences that have both high fitness and high diversity (far away from one another and from the sequences provided in the training data set), as was done for Boltzmann Machines Jacquin et al., 2016). Various RBM are trained, sequences are generated for each RBM and scored using the ground truth ${p}_{nat}$ (see Appendix 1—figure 6). We find that: (i) RBMs with low likelihood (Bernoulli and/or small M) generate lowquality sequences; (ii) unregularized BMs and RBMs, which tend to overfit, generate sequences with higher fitness but low diversity; and (iii) the true fitness function is predicted well by the inferred log probability. Moreover, conditional sampling also generates highquality sequences, even when conditioning on unseen combination of features.
For RBMs trained on real proteins sequences, no groundtruth fitness is available and sequence quality cannot be assessed numerically. Appendix 1—figure 7 shows nonetheless that the generated sequences, including those with recombined features that do not appear in nature, are consistent with a pairwise model trained on the same data.
Finally, in Appendix 1—figure 8, we show the role of regularization and sequence reweighting on sequence generation. Sequences drawn from the unregularized model are closer to those of the training data, and the corresponding sequence distribution has significantly lower entropy $S=\sum _{\mathbf{v}}P(\mathbf{v})\mathrm{log}P(\mathbf{v})$ (i.e. the average negative logprobability of the generated sequences). There are respectively about ${e}^{S}\sim {10}^{12}$ and ${10}^{18}$ distinct sequences for the unregularized and regularized models, respectively. We find that sequence reweighting plays a similar role as regularization: with reweighting, sequences are slightly further away from the training set and the model has higher entropy.
Contact predictions
Since RBMs learn a full energy landscape, they can predict epistatic interactions (see 'Materials and methods'), and therefore contacts, as shown in Figure 6. The effective couplings derived with RBM are consistent with those inferred from a pairwise model (see Appendix 1—figure 9). Predictions for distant contacts in the Kunitz domain are shown in Appendix 1—figure 10, and are slightly worse than with DCA.
We briefly discuss the best set of parameters for contact prediction. As seen from Appendix 1—figure 11, all RBMs can predict contacts maps on Lattice Proteins more or less accurately. As for the likelihood and generative performance, increasing the number of hidden units significantly improves contact prediction. The best hidden unit potentials for predicting contacts are dReLU and quadratic.
We also studied how constraints on the sparsity of weights, tuned by the regularization penalty ${\lambda}_{1}^{2}$, influenced the performance. Because weights are never exactly zero, proxies are required for an appropriate definition of sparsity. In order to avoid arbitrary thresholds, we use Participation Ratios. The Participation Ratio $(P{R}_{e})$ of a vector $\mathbf{\mathbf{x}}=\{{x}_{i}\}$ is
If $\mathbf{\mathbf{x}}$ has K nonzero and equal (in modulus) components, PR is equal to K for any e. In practice, we use the values e = 2 and 3: the higher e is, the more small components are discounted against strong components in $\mathbf{\mathbf{x}}$. Also note that it is invariant under rescaling of $\mathbf{\mathbf{x}}$. We then define the weight sparsity ${p}_{\mu}$ of a hidden unit, through
and average it over $\mu $ to get a unique estimator of weight sparsity across the RBM. The results are reported in Appendix 1—figure 12, and show that performance strongly worsens when sparsity increases, both in Lattice Proteins and in real families.
Feature robustness
To assess feature robustness, we repeat the training on WW using only one of the two halves of the sequences data, and look for the closest features to those shown in the main text. The closest features, shown below, are quite similar to the original ones.
Comparison with the HopfieldPotts model
The HopfieldPotts model is a special case of RBM with: (i) quadratic potentials for hidden units, ii) no regularization but orthogonality constraints on the weights, and (iii) meanfield inference rather than PCD Monte Carlo learning. The consequences are that: (i) we cannot model highorder interactions, (ii) we do not observe a compositional regime in which the weights are sparse and typical configurations are obtained by combinations of these weights, instead, the representation is entangled and the weights attached to high eigenvalues are extended over most sites of the protein; and (iii) the model is not generative, that is, it does not reproduce the data moments and cannot generate a diverse set of sequences. To illustrate this fact, we show:
Examples of weights inferred from the the Kunitz and WW domains, and for Lattice Proteins (weights corresponding to Hsp70 can be found in a 'Supporting information' file). Loweigenvalue weights are sparse, as reported in Cocco et al. (2013), but high eigenvalue weights that encode collective modes are extended, and therefore hard to interpret and to relate to function.
Contact predictions with HopfieldPotts, showing worse performance than RBM or plmDCA.
Benchmarking of generated sequences with HopfieldPotts on Lattice Proteins (similar to Figure 7F). Using a small pseudocount, sequences are very poor (have a very low folding probability). Using a larger pseudocount, sequences have reasonable fitness $p}_{\text{nat}$, although lower than those for high$P(\mathbf{v})$ RBM, but quite low diversity. This phenomenon is characteristic of sequences generated with meanfield models (see figure 3A in Jacquin et al. (2016). We also note that the Lattice Protein benchmark is actually optimistic for the HopfieldPotts model, as the pseudocount trick does not work as well whenever a sequence has many conserved sites.
Additional figure: hiddeninput distribution for the Kunitz domain, separated by phylogenetic identity and genes
Additional figure: weight logos, 3D visualizations, input distributions of 10 hidden units for Hsp70
Hidden unit numbering: 1 = short vs long loop; 2 = function feature on SBD; 3 = LID/SBD interdomain; 4 = NBD/SBD interdomain and nonallosteric specific; 5 = unstructured tail; 6 = short/long vs very short loop; 7 = long loop variant; 8 = ER specific; 9 = second nonallosteric specific; 10 = dimer contacts.
Data availability
The Python 2.7 package for training and visualizing RBMs, used to obtained the results reported in this work, is available at https://github.com/jertubiana/ProteinMotifRBM (copy archived at https://github.com/elifesciencespublications/ProteinMotifRBM). It can be readily used for any protein family. Moreover, all four multiple sequence alignments presented in the text, as well as the code for reproducing each panel are also included. Jupyter notebooks are provided for reproducing most figures of the article.

Protein Data BankID 2KNT. THE 1.2 ANGSTROM STRUCTURE OF KUNITZ TYPE DOMAIN C5.

Protein Data BankID 2KHO. NMRRDC / XRAY structure of E. coli HSP70 (DNAK) chaperone (1605) complexed with ADP and substrate.

Protein Data BankID 4JNE. Allosteric opening of the polypeptidebinding site when an Hsp70 binds ATP.

Protein Data BankID 1ELV. CRYSTAL STRUCTURE OF THE CATALYTIC DOMAIN OF HUMAN COMPLEMENT C1S PROTEASE.

Protein Data BankID 2BOL. CRYSTAL STRUCTURE AND ASSEMBLY OF TSP36, A METAZOAN SMALL HEAT SHOCK PROTEIN.

Protein Data BankID 2VI6. Crystal Structure of the Nanog Homeodomain.

Protein Data BankID 1GDC. REFINED SOLUTION STRUCTURE OF THE GLUCOCORTICOID RECEPTOR DNABINDING DOMAIN.

Protein Data BankID 3FHI. Crystal structure of a complex between the catalytic and regulatory (RI{alpha}) subunits of PKA.

Protein Data BankID 1G2E. CRYSTAL STRUCTURE OF HUD AND AURICH ELEMENT OF THE TUMOR NECROSIS FACTOR ALPHA RNA.

Protein Data BankID 1O47. CRYSTAL STRUCTURE OF SH2 IN COMPLEX WITH RU82209.

Protein Data BankID 3BFR. The crystal structure of Sod2 from Saccharomyces cerevisiae.

Protein Data BankID 1WVN. Crystal Structure of domain 3 of human alpha polyC binding protein.

Protein Data BankID 1O0W. Crystal structure of Ribonuclease III (TM1102) from Thermotoga maritima at 2.0 A resolution.

Protein Data BankID 1A71. TERNARY COMPLEX OF AN ACTIVE SITE DOUBLE MUTANT OF HORSE LIVER ALCOHOL DEHYDROGENASE, PHE93=>TRP, VAL203=>ALA WITH NAD AND TRIFLUOROETHANOL.

Protein Data BankID 2O72. Crystal Structure Analysis of human Ecadherin (1213).

Protein Data BankID 6GSU. FIRSTSPHERE AND SECONDSPHERE ELECTROSTATIC EFFECTS IN THE ACTIVE SITE OF A CLASS MU GLUTATHIONE TRANSFERASE.
References

Readings in Computer Vision522–533, A learning algorithm for boltzmann machines, Readings in Computer Vision, Elsevier.

Chymotrypsin: molecular and catalytic propertiesClinical Biochemistry 19:317–322.https://doi.org/10.1016/S00099120(86)800029

The bovine basic pancreatic trypsin inhibitor (Kunitz inhibitor): a milestone proteinCurrent Protein & Peptide Science 4:231–251.https://doi.org/10.2174/1389203033487180

Structure and biology of tissue factor pathway inhibitorThrombosis and Haemostasis 86:959–972.https://doi.org/10.1055/s00371616518

Learning generative models for protein fold familiesProteins: Structure, Function, and Bioinformatics 79:1061–1078.https://doi.org/10.1002/prot.22934

Tuning of chaperone activity of Hsp70 proteins by modulation of nucleotide exchangeNature Structural Biology 8:427–432.https://doi.org/10.1038/87588

Structural analysis of BAG1 cochaperone and its interactions with Hsc70 heat shock proteinNature Structural Biology 8:349–352.https://doi.org/10.1038/86236

A conserved loop in the ATPase domain of the DnaK chaperone is essential for stable binding of GrpENature Structural & Molecular Biology 1:95–101.https://doi.org/10.1038/nsb029495

Structurefunction analysis of the reactive site in the first Kunitztype domain of human tissue factor pathway inhibitor2Journal of Biological Chemistry 279:17500–17507.https://doi.org/10.1074/jbc.M400802200

Inverse statistical physics of protein sequences: a key issues reviewReports on Progress in Physics 81:032601.https://doi.org/10.1088/13616633/aa9965

Emerging methods in protein coevolutionNature Reviews Genetics 14:249–261.https://doi.org/10.1038/nrg3414

Accelerated profile HMM searchesPLOS Computational Biology 7:e1002195.https://doi.org/10.1371/journal.pcbi.1002195

Fast pseudolikelihood maximization for directcoupling analysis of protein structure from many homologous aminoacid sequencesJournal of Computational Physics 276:341–356.https://doi.org/10.1016/j.jcp.2014.07.024

A single point mutation in a group I WW domain shifts its specificity to that of group II WW domainsJournal of Biological Chemistry 274:17284–17289.https://doi.org/10.1074/jbc.274.24.17284

Improving contact prediction along three dimensionsPLOS Computational Biology 10:e1003847.https://doi.org/10.1371/journal.pcbi.1003847

Coevolutionary landscape inference and the ContextDependence of mutations in BetaLactamase TEM1Molecular Biology and Evolution 33:268–280.https://doi.org/10.1093/molbev/msv211

Pfam: the protein families databaseNucleic Acids Research 42:D222–D230.https://doi.org/10.1093/nar/gkt1223

Iberoamerican Congress on Pattern Recognition14–36, An introduction to restricted boltzmann machines, Iberoamerican Congress on Pattern Recognition, Springer, 10.1007/9783642332753_2.

Bikuninnot just a plasma proteinase inhibitorThe International Journal of Biochemistry & Cell Biology 32:125–137.https://doi.org/10.1016/S13572725(99)001259

Advances in Neural Information Processing Systems2672–2680, Generative adversarial nets, Advances in Neural Information Processing Systems, MIT Press.

Inhibition of six serine proteinases of the human coagulation system by mutants of bovine pancreatic trypsin inhibitorJournal of Biological Chemistry 275:33346–33352.https://doi.org/10.1074/jbc.M006085200

Neural Networks: Tricks of the Trade599–619, A practical guide to training restricted boltzmann machines, Neural Networks: Tricks of the Trade, Springer, 10.1007/9783642352898_32.

Mutation effects predicted from sequence covariationNature Biotechnology 35:128–135.https://doi.org/10.1038/nbt.3769

VMD: visual molecular dynamicsJournal of Molecular Graphics 14:33–38.https://doi.org/10.1016/02637855(96)000185

Benchmarking inverse statistical approaches for protein structure and design with exactly solvable modelsPLOS Computational Biology 12:e1004889.https://doi.org/10.1371/journal.pcbi.1004889

Determinants of ligand specificity in groups I and IV WW domains as studied by surface plasmon resonance and model buildingJournal of Biological Chemistry 277:10173–10177.https://doi.org/10.1074/jbc.M110490200

Why reinvent the wheel? building new proteins based on readymade partsProtein Science 25:1179–1187.https://doi.org/10.1002/pro.2892

Conversion of the Kunitztype module of collagen VI into a highly active trypsin inhibitor by sitedirected mutagenesisEuropean Journal of Biochemistry 238:333–340.https://doi.org/10.1111/j.14321033.1996.0333z.x

Correlated mutations in models of protein sequences: phylogenetic and structural effectsLecture NotesMonograph Series, 33:236–256.https://doi.org/10.1214/lnms/1215455556

Representational power of restricted boltzmann machines and deep belief networksNeural Computation 20:1631–1649.https://doi.org/10.1162/neco.2008.0407510

Structural analysis of WW domains and design of a WW prototypeNature Structural Biology 7:375–379.https://doi.org/10.1038/75144

The geometry of the reactive site and of the peptide groups in trypsin, trypsinogen and its complexes with inhibitorsActa Crystallographica Section B Structural Science 39:480–490.https://doi.org/10.1107/S010876818300275X

1.2 Å refinement of the Kunitztype domain from the α3 chain of human type VI collagenActa Crystallographica Section D Biological Crystallography 54:306–312.https://doi.org/10.1107/S0907444997010846

A family of evolutionentropy hybrid methods for ranking protein residues by importanceJournal of Molecular Biology 336:1265–1282.https://doi.org/10.1016/j.jmb.2003.12.078

Protein folding theory: from lattice to allatom modelsAnnual Review of Biophysics and Biomolecular Structure 30:361–396.https://doi.org/10.1146/annurev.biophys.30.1.361

ConferenceRectified linear units improve restricted boltzmann machinesProceedings of the 27th international conference on machine learning (ICML10). pp. 807–814.

Annealed importance samplingStatistics and Computing 11:125–139.https://doi.org/10.1023/A:1008923215028

Intrinsically disordered proteins and intrinsically disordered protein regionsAnnual Review of Biochemistry 83:553–584.https://doi.org/10.1146/annurevbiochem072711164947

Allosteric opening of the polypeptidebinding site when an Hsp70 binds ATPNature Structural & Molecular Biology 20:900–907.https://doi.org/10.1038/nsmb.2583

Predicting the functional impact of protein mutations: application to cancer genomicsNucleic Acids Research 39:e118.https://doi.org/10.1093/nar/gkr407

The ras protein superfamily: evolutionary tree and role of conserved amino acidsThe Journal of Cell Biology 196:189–201.https://doi.org/10.1083/jcb.201103008

ConferenceOn the quantitative analysis of deep belief networksProceedings of the 25th International Conference on Machine Learning. pp. 872–879.https://doi.org/10.1145/1390156.1390266

Enumeration of all compact conformations of copolymers with random sequence of linksThe Journal of Chemical Physics 93:5967–5971.https://doi.org/10.1063/1.459480

Antiinflammatory actions of serine protease inhibitors containing the Kunitz domainInflammation Research 59:679–687.https://doi.org/10.1007/s0001101002055

An interdomain sector mediating allostery in Hsp70 molecular chaperonesMolecular Systems Biology 6:.https://doi.org/10.1038/msb.2010.65

ConferenceTraining restricted boltzmann machines using approximations to the likelihood gradientProceedings of the 25th International Conference on Machine Learning. pp. 1064–1071.https://doi.org/10.1145/1390156.1390290

Pathways of chaperonemediated protein folding in the cytosolNature Reviews Molecular Cell Biology 5:781–791.https://doi.org/10.1038/nrm1492

The remarkable multivalency of the Hsp70 chaperonesCell Stress and Chaperones 22:173–189.https://doi.org/10.1007/s121920170776y
Decision letter

Lucy J ColwellReviewing Editor; Cambridge University, United Kingdom

Detlef WeigelSenior Editor; Max Planck Institute for Developmental Biology, Germany
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Learning protein constitutive motifs from sequence data" for consideration at eLife. Your article has been favorably evaluated by three reviewers, one of whom served as a guest Reviewing Editor, and the evaluation has been overseen by Aviv Regev as the Senior Editor.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
1) There is a difference of opinion among the reviewers about whether your methodology needs to be compared with other protein structures, and in particular examples that are not so well known and annotated. However the consensus opinion is that the manuscript would benefit from additional comparisons. Reviewer 3 explicitly calls out the possibility that there is overfitting involved, and it is hard to dismiss this out of hand without more effort to compare to the types of examples that they mention. Reviewer 3 makes reasonable suggestions of other examples you might compare to – and I look forward to seeing how these work in the revised manuscript.
2) All of the reviewers have major comments for clarifications and addressing these will improve the manuscript. For example, reviewer 1 asks you to comment on the weights (how they are chosen, how they depend on model hyperparameters, etc.); Reviewer 2 points out that the methodology is not fully developed in this manuscript, but instead will be described elsewhere. The consensus view of the reviewers is that it is important for the methods to be fully described in this paper so that the results are reproducible by others, and a promise of publishing these elsewhere is insufficient.
3) There are other explicit comments included in these reviews about ways of clarifying the manuscript that I trust you will address in a revised version.
Reviewer #1:
In this manuscript Tubiana et al. show that Restricted Boltzman Machines are able to learn models of protein families from sequence data. The authors observe two contrasting approaches to building models from protein family sequence data that have been shown in the literature to either (i) identify functionally important groups of sequence positions, or ii) provide information about sequence positions that are close in 3D protein structure. Approach (i) often involves applying PCA type approaches to large sets of protein sequences, and examining the weights assigned to different sequence positions on those principal components with largest eigenvalues, while versions of approach (ii) such as DCA have been motivated by maximumentropy based modelling.
The authors note that a unique framework that can extract both structurally and functionally important features of a protein family is still missing, and propose Restricted Boltzmann Machines as a solution. In the manuscript, they report a method to efficiently learn RBM from sequence data, and describe applications to three diverse protein families, in addition to data generated using a latticeprotein model. Much of the manuscript is devoted to describing the sets of sequence positions that are structurally and/or functionally important in these protein families, and that can be identified by examining the weight coefficients of the RBM.
Within the main text, the authors cite their previous publication, from 2013, in which they propose the HopfieldPotts model as a means of naturally interpolating between PCAbased approaches to identifying functionally important groups of residues, and DCAtype approaches. There they also observed sparse modes (or patterns) that correspond to contacts in 3D structure, and in addition less sparse modes that correspond to features such as conserved sequence positions that form spatially connected and functionally important regions in the folded protein that are reminiscent of protein sectors, or positions at which many gap symbols are found in the alignments.
This makes me question whether there is really no framework that can extract both structurally and functionally important features of a protein family. I think the authors should be careful to accurately place this manuscript in the context of prior work. Moreover, there appear to be some similarities between the energy function of an RBM, and that of a Hopfield model – this would be an interesting point for the authors to discuss. It would be interesting to understand from the main text why the authors feel that the modeling assumptions of the RBM are applicable to protein sequence data.
My major comment is that the new manuscript focuses on a rather detailed description of how a small number of RBM weights correspond to structurally and functionally important sequence positions. Almost no time at all in the main text describing how they were able to efficiently learn RBM from sequence data, which seems to be the novel contribution of this work. The authors also do not provide any comparison between the weights of RBM and patterns learned from sequence data in prior work, such as their own 2013 paper. Specifically, did the patterns derived from the HopfieldPotts model differ significantly from those found in the RBM weights? A thorough comparison to prior work is necessary to distinguish the advances made in this work.
Beyond this, I would ask the authors to at least comment on how the diverse weights are chosen, and what happens to the weights as the number of hidden units in the model is varied. Moreover, it would be interesting to understand how the weights change as the number, and diversity of sequences in the protein family alignment is varied. In addition, the authors pay only passing attention to the fact that the protein sequences contained in the alignment are chosen uniformly at random from some underlying distribution, but rather are biased by their shared evolutionary history. The authors should provide some insight into how this affects the ability of the model to generalize.
Reviewer #2:
This manuscript introduces restricted Boltzmann machines (RBMs) for protein sequence analysis. There has been a recent revolution in inferring 3D protein structure by statistical analysis of pairwise residue covariations in deep protein sequence alignments, mostly using a class of models called Potts models; these authors have helped pioneer that work. Here the authors show that RBMs can efficiently capture effective pairwise couplings (as Potts models do), while also capturing even higherorder correlation patterns that arise as a result of phylogenetic and functional constraint. The math is wellprincipled and presented clearly and concisely. The paper shows a wellchosen set of illustrative biological examples that connect this work to the body of related previous work with Potts models and other methods.
This is a mathematical manuscript that introduces a compelling new analysis approach, but there isn't (yet) a new biological result. Maybe it won't have immediate impact on a wide eLife audience of biologists, although biologists should be able to get the gist. I expect there might be a discussion amongst reviewers and editors of whether it needs to be buttressed with more explicit biological relevance and biological results. My view is no. I believe this manuscript will be important and influential as it stands. I would compare it to the 1994 Anders Krogh and David Haussler JMB paper that introduced profile hidden Markov models, or the 2009 Martin Weight and Terry Hwa paper that introduced Potts models. These papers ignited new areas of computational biology research that later led to important biological results. The Krogh paper was weakened and diluted in its review process by JMB reviewers and editors asking for more biology. Many of us in the field worked from a predilution preprint more than the final published JMB version. I would advise not similarly diluting this strong manuscript. eLife should publish mathematical/technical papers occasionally, when the math is wellprincipled and clear, and where there's a good chance of founding a new important direction in biological data analysis. This manuscript is a nice example.
 At the beginning of Results, it would help to see even more intuitive rationale for the strengths of RBMs over Potts models, before diving in. For example, Materials and methods has an important line 'RBM<s> offer a practical way to go beyond pairwise models without an explosion in the number of interaction parameters'. Move this line up and clarify. I got a fair way into the paper thinking that RBMs would have to give up detailed pairwise correlation structure, in return for capturing phylogenetic and functional manyresidue correlation structure.
 One weakness of the paper is that it does not fully describe the method. In particular, I don't know how to calculate the partition function from reading this paper; I would have to refer to other cited work (Tieleman, 2009; Fischer and Igel, 2012), which might be fair enough. However, Materials and methods says that 'practical details… including several improvements to the standard Persistent Contrastive Divergence algorithm… will be presented elsewhere'. If those details are essential for the results, they need to be in the paper somewhere.
 '(precise Github address here)': I assume that's an oversight, and that code will be provided. Indeed that would go a long way towards addressing the above weakness, as far as I'm concerned.
Reviewer #3:
This manuscript describes a novel generative machine learning model called RBM (Restricted Boltzmann Machines) for multiple protein sequence alignment (MSA). Through RBM, a protein sequence can be generated from some hidden nodes trained from the MSA. A unique feature of this RBM model is that each hidden node may represent a small set of residues which may not be adjacent to one another along the primary sequence, but may form contacts or a functional motif. This is very different from the widelyused Potts model, in which mainly pairwise residue correlation is considered. The authors have shown that their RBM model can be used to detect motifs and interresidue contact prediction.
A major concern is that the authors tested their RBM model on only 4 natural proteins with solved structures and good functional annotations. By testing on very few proteins with solved structures, it is hard to convince the readers that the authors did not overfit their model. There are a few ways to address this issue:
1) Apply the model to analyze proteins without solved structures and then verify their results after the solved structures are available. For example, the authors may test their methods through the online benchmark CAMEO for protein contact prediction. Each week CAMEO will send out some test sequences for prediction. The native structures of these test proteins are not publicly available during the test, but will be public after 45 days.
2) The authors may also test their method on a large set of proteinswith solved structures for contact prediction, but using a simple way to choose M (the number of hidden nodes) and the regularization factor for all the test proteins. The authors may compare their result with currently popular contact prediction methods such as Evfold and CCMpred. It will also be interesting to study the relationship between accuracy and the number of effective sequences in MSA.
https://doi.org/10.7554/eLife.39397.090Author response
1) There is a difference of opinion among the reviewers about whether your methodology needs to be compared with other protein structures, and in particular examples that are not so well known and annotated. However the consensus opinion is that the manuscript would benefit from additional comparisons. Reviewer 3 explicitly calls out the possibility that there is overfitting involved, and it is hard to dismiss this out of hand without more effort to compare to the types of examples that they mention. Reviewer 3 makes reasonable suggestions of other examples you might compare to – and I look forward to seeing how these work in the revised manuscript.
To answer the concerns about overfitting, we have followed the suggestion of reviewer 3, and have further tested our modeling approach on 16 more families, used as a benchmark for contact predictions by plmDCA (an important inference method in the context of DirectCoupling Analysis – DCA) in the standard 2014 paper by Ekeberg et al. This new analysis allows us to:
a) Carry out a statistical study of the performance of RBM with respect to DCA for contact prediction. The new panel in Figure 6 shows that the performances of RBM under fixed regularization and with a number of hidden units proportional to the maximal rank (M=0.3 R) are comparable with the ones obtained by plmDCA.
b) Compare the most significant weights (with largest norms) across the 16 families, and extract broad classes of structural and functional motifs repeated in many families, including the proteins we had presented in the first version of the manuscript. A new figure (Figure 9) has been added to report these results, accompanied by a discussion in the section “Cross validation of the model and interpretability of the representations”.
We believe that these new results provide convincing evidence that our approach is not limited to a few “good” proteins or protein domains, but can be applied to any protein (with performances depending, of course, on the number of available sequences).
2) All of the reviewers have major comments for clarifications and addressing these will improve the manuscript. For example, reviewer 1 asks you to comment on the weights (how they are chosen, how they depend on model hyperparameters, etc.); Reviewer 2 points out that the methodology is not fully developed in this manuscript, but instead will be described elsewhere. The consensus view of the reviewers is that it is important for the methods to be fully described in this paper so that the results are reproducible by others, and a promise of publishing these elsewhere is insufficient.
We have added a detailed description of the main steps of the training algorithm in the Materials and methods section, as requested by reviewer 2. In particular, in the subsection "Learning Procedure", we now give the general structure of the equations for the gradient of the logprobability, the tricks used for moment evaluations, and the procedure followed to estimate the partition function. We also better present the main algorithmic innovations of the present paper, that is, the extension of RBM to multicategorical Pottslike variables, and the introduction of general nonquadratic hiddenunit potentials, interpolating between quadratic, doublewell (Bernoulli) and Rectified Linear Unit potentials (with hardwall constraints). Readers should have no problem in understanding our code for training RBM, which is made available here, together with the scripts necessary to reproduce the figures in our manuscript. In addition, we have expanded the paragraph entitled “Learning” in the main text (section “Restricted Boltzmann Machines”); this paragraph lists the main ingredients of our training procedure, and is intended for readers not interested in the technical details presented in the Materials and methods section.
We have also added a discussion on the comparison of RBM with the Hopfield Potts approach, another procedure for extracting simultaneously structural and functional information from sequence data proposed by two of us and collaborators in 2013, as required by reviewer 1. We have better developed the discussion on how the hyperparameters of the RBM should be chosen to learn efficiently interpretable features. Following reviewer 1’s request, we now explain how weights are selected, and study the stability of the extracted features under undersampling.
3) There are other explicit comments included in these reviews about ways of clarifying the manuscript that I trust you will address in a revised version.
Reviewer #1:
[…] Within the main text, the authors cite their previous publication, from 2013, in which they propose the HopfieldPotts model as a means of naturally interpolating between PCAbased approaches to identifying functionally important groups of residues, and DCAtype approaches. There they also observed sparse modes (or patterns) that correspond to contacts in 3D structure, and in addition less sparse modes that correspond to features such as conserved sequence positions that form spatially connected and functionally important regions in the folded protein that are reminiscent of protein sectors, or positions at which many gap symbols are found in the alignments.
This makes me question whether there is really no framework that can extract both structurally and functionally important features of a protein family. I think the authors should be careful to accurately place this manuscript in the context of prior work. Moreover, there appear to be some similarities between the energy function of an RBM, and that of a Hopfield model – this would be an interesting point for the authors to discuss.
We fully agree that the HopfieldPotts model (Cocco, Monasson and Weigt, 2013) was an attempt to extract simultaneously structural and functional information from sequence data, and, in this regard, is conceptually similar to RBM (with the notable exception that the idea of representation was essentially lacking). Actually, the connection between RBM and HopfieldPotts model is quite deep, as the latter is a particular case of the former, obtained when the hiddenunit potentials are quadratic functions of their arguments h. This connection, which is clearly emphasized in the new manuscript (see section “Restricted Boltzmann Machines”, in between Equations 5 and 6) was, indeed, a major incentive for us to start studying RBM. Some of the results found with RBM, such as very localized modes on two sites coding for contacts or extended modes due to stretches of gaps at the sequence extremities, were already found with HopfieldPotts models. But, despite those similarities, the work on the Hopfield Potts approach cited above suffered from several drawbacks (which we feel free to list, being among the authors of this work…):
1) the patterns (here, the weights) w were inferred within the meanfield hypothesis, known to be fast but inaccurate. In particular, 2point statistics are not reproduced. As a consequence, the inferred HopfieldPotts model cannot be used to generate new sequences in practice; it is simply too poor an approximation of the underlying sequence distribution.
2) the quadratic nature of the interactions made the patterns not uniquely defined, as they could be rotated in the muspace with no change on the probability distribution of the sequences. To avoid this problem, we artificially imposed orthogonality between the weight vectors. There was (and there is) no justification for this procedure. The same problem arises in Independent Component Analysis: it is well known that identification of independent components cannot be done based on quadratic moments only, and requires knowledge of higherorder moments (captured in RBM by nonquadratic potentials).
3) the importance of sparsity was not recognized. While some inferred patterns were sparse in practice (such as the ones corresponding to contacts), most were not (see middle panel in Figure 2 of the abovementioned paper) and could not be interpreted.
We have added in Appendix 1 several figures (for Kunitz, WW, and LP; results for HSP70 are given in a separate file in Appenndix 1) to illustrate these points, and showing in particular:
1) Some of the patterns inferred for the protein families presented in the main text. Loweigenvalue patterns are sparse (as reported in the 2013 PLoS Comp Bio paper), but higheigenvalue patterns coding for collective modes are extended, and therefore hard to relate to function.
ii) Contact predictions with the HopfieldPotts model, showing worse performances than RBM (for the same number of hidden units).
iii) Benchmarking of generated sequences with Hopfield Potts on Lattice Proteins, similar to what is shown in Figure 7F for RBM. With a small pseudocount, sequences generated by the HopfieldPotts model are very bad. Using a larger pseudocount, sequences have decent folding probabilities, but very low diversity (in agreement with findings by Jacquin et al., 2016).
Our current work on RBM does not suffer from any of these problems, and we are convinced that RBM are much more accurate and controlled ways to infer sequence motifs than the HopfieldPotts approach. Yet, we agree with reviewer 1 that proper acknowledgements about previous works are required, and we have modified the Discussion accordingly.
It would be interesting to understand from the main text why the authors feel that the modeling assumptions of the RBM are applicable to protein sequence data.
RBM are natural candidates to interpolate between PCAbased and DCAbased approaches are they can be thought as either a method for extracting modes/representations (which goes beyond PCA) or an effective way to model complex (pairwise and higherorder) interactions between residues (which extends DCA). We have added a sentence in the section on RBM to explain this important point.
My major comment is that the new manuscript focuses on a rather detailed description of how a small number of RBM weights correspond to structurally and functionally important sequence positions. Almost no time at all in the main text describing how they were able to efficiently learn RBM from sequence data, which seems to be the novel contribution of this work.
As reviewer 1 emphasizes, we primarily aimed at convincing readers that RBM could learn structurally and functionally relevant motifs in protein sequences. This is made possible by:
1) the introduction of parametric class of nonquadratic potentials, called double Rectified Linear Unit potentials, which spans a large variety of behaviors, allowing hidden units to have smooth, bimodal, or constrained (e.g. positive valued) conditional probabilities. We have added a formula (numbered 6) for dReLU potentials in the main text.
2) the introduction of the socalled L1/L2 sparsity regularization, which is an adaptive L1 penalty on the weights. This regularization is briefly described in the Results sections, and details are given in the Materials and methods section. We also discuss at length in the section "Crossvalidation of model and interpretability of representations" how the choice of the regularization strength is crucial to obtain meaningful weights.
3) the selection of hyperparameters (in particular, the number of hidden units) and how they affect performance (for instance, for contact prediction) is also discussed in the main text and in Appendix.
The authors also do not provide any comparison between the weights of RBM and patterns learned from sequence data in prior work, such as their own 2013 paper. Specifically, did the patterns derived from the HopfieldPotts model differ significantly from those found in the RBM weights? A thorough comparison to prior work is necessary to distinguish the advances made in this work.
Beyond this, I would ask the authors to at least comment on how the diverse weights are chosen?
The weights shown in Figures 2, 3, 4 were selected manually based on several criteria:
i) Weight norm, which is a proxy for the importance of the corresponding hidden unit. Hidden units with larger weight norms contribute more to the likelihood, whereas lowweight norms may arise from noise/overfitting.
ii) Weight sparsity. Hidden units with sparse weights are easier to interpret in terms of structural/functional constraints.
iii) Shape of input distribution. Hidden units with bimodal input distribution separate the family in two subfamilies and are therefore potentially interesting.
iv) Comparison with available literature, in particular mutagenesis experiments.
v) Diversity of features in the weights (many weights correspond to contacts, and we do show all of them).
We have made these criteria explicit in the manuscript.
The weights shown in Figure 9 (one for each one of the 16 families) were picked up arbitrarily among the top (having the largest norms) 10 weights of each family to illustrate the different broad classes of motifs. A large scale, statistical approach would be necessary to better define those classes, as underlined in Discussion.
And what happens to the weights as the number of hidden units in the model is varied.
At low number M of hidden units (and fixed large regularization), the weights are not sparse, which is similar to what is found with PCA or HopfieldPotts. As M increases, weights get sparser, as more hidden units are available to code for the various statistical features of the data. Above a certain value of M, the weights stabilize, and extra hidden units are essentially copies of the previous ones. The results shown here correspond to after the compositional regime emergence, prior to the ‘copy’ regime.
Moreover, it would be interesting to understand how the weights change as the number, and diversity of sequences in the protein family alignment is varied.
To show how the weights change with sampling, we have trained again our RBM after random removal of 50% the sequences in the MSA of the WW domain. We compare in Appendix 1—figure 13 the inferred weights to the ones obtained with the full MSA and shown in Figure 3B. We find that the weights are quite robust despite the removal of half of the data. Not surprisingly, weights with small norms (and not selected according to criterion i) above) are more sensitive to sampling noise.
The question on diversity is related to the presence of evolutionary correlations between sequences; the less diverse the sequences are, the stronger should be the reweighting factor, see point below.
In addition, the authors pay only passing attention to the fact that the protein sequences contained in the alignment are chosen uniformly at random from some underlying distribution, but rather are biased by their shared evolutionary history. The authors should provide some insight into how this affects the ability of the model to generalize.
Reviewer 1 is right that the sequences sampled in databases are not independent. This problem is shared by all sequence analysis methods such as Hidden Markov Models or DCA. We treat this dependence here in a rough way: each sequence S is assigned a weight factor (subsequently used for computing averages over data) proportional to the inverse of the number of neighbor sequences S’ in the alignment; S’ and S are considered as neighbors if their Hamming distance is smaller than a cut off, generally 20% of the sequence length. This flattening procedure is known to improve contact prediction with DCA.
For RBM, this flattening reduces overfitting as well. To see this, we have repeated the training on the WW domain sequence data without reweighting (or regularization). We have generated sequences and plotted their logprobabilities vs. their distances to closest sequences in the MSA (same as Figures 5E and F). Without regularization or reweighting, the generated sequences are closer to the set of natural sequences, and their logprobability is higher, meaning that they are less diverse. We have added a comment in the "Crossvalidation…" section and the corresponding figure in the Appendix.
We stress that reweighting is not necessary for well sampled sequences, i.e. for rather uniformly distributed sequences. In particular, for Lattice Proteins where we have full control and can sample sequences by Monte Carlo at equilibrium, reweighting does not improve performance, e.g. for contact prediction. Conversely, when the sampling of sequences is biased around a given (wild type) sequence, reweighting is needed to get back to good performance levels, see Jacquin et al., 2016.
Reviewer #2:
[…] At the beginning of Results, it would help to see even more intuitive rationale for the strengths of RBMs over Potts models, before diving in. For example, Materials and methods has an important line 'RBM<s> offer a practical way to go beyond pairwise models without an explosion in the number of interaction parameters'. Move this line up and clarify. I got a fair way into the paper thinking that RBMs would have to give up detailed pairwise correlation structure, in return for capturing phylogenetic and functional manyresidue correlation structure.
We agree that this sentence was unclear. As one key advantage of RBMs with respect to Potts models is that they can capture higherorder interactions, we have now expanded this notion in two places in the manuscript. First, we have expanded the section on the distribution of sequences in the section “Restricted Boltzmann Machines”; we now explain that RBMs reduce to pairwise interaction models when the hiddenunit potentials are quadratic, and offer a possibility to take into account higherorder interactions with nonquadratic potentials. We have introduced a new formula (numbered 6) to explicitly show the class of (dReLU) potentials considered in our work and fitted from the sequence data. Secondly, we come back on this notion in the Discussion and explain why RBM are efficient ways of introducing highorder interactions (whose number is a priori exponential in the sequence length N) from a limited number of parameters (the weights w).
 One weakness of the paper is that it does not fully describe the method. In particular, I don't know how to calculate the partition function from reading this paper; I would have to refer to other cited work (Tieleman, 2009; Fischer and Igel, 2012), which might be fair enough. However, Materials and methods says that 'practical details… including several improvements to the standard Persistent Contrastive Divergence algorithm… will be presented elsewhere'. If those details are essential for the results, they need to be in the paper somewhere.
We agree with reviewer 2 that the paper should be selfcontained. We have therefore added a detailed description of the main steps of the training algorithm in the Materials and methods section. In particular, in the subsection "Learning Procedure", we now give the general structure of the equations for the gradient of the logprobability, the tricks used for moment evaluations, and the procedure followed to estimate the partition function (with a paragraph explaining the principle of the Annealed Importance Sampling algorithm). We also better present the main algorithmic innovations of the present paper, that is, the extension of RBM to multicategorical Pottslike variables, and the introduction of general nonquadratic hiddenunit potentials, interpolating between quadratic, doublewell (Bernoulli) and Rectified Linear Unit potentials (with hardwall constraints). Readers should have no problem in understanding our code for training RBM, which is made available here, together with the scripts necessary to reproduce the figures in our manuscript. In addition, we have expanded the paragraph entitled “Learning” in the main text (section “Restricted Boltzmann Machines”); this paragraph lists the main ingredients of our training procedure, and is intended for readers not interested in the technical details presented in the Methods section.
 '(precise Github address here)': I assume that's an oversight, and that code will be provided. Indeed that would go a long way towards addressing the above weakness, as far as I'm concerned.
Yes, all the codes (training algorithm and scripts for reproducing the figures of the paper) are provided.
Reviewer #3:
[…] A major concern is that the authors tested their RBM model on only 4 natural proteins with solved structures and good functional annotations. By testing on very few proteins with solved structures, it is hard to convince the readers that the authors did not overfit their model. There are a few ways to address this issue:
1) Apply the model to analyze proteins without solved structures and then verify their results after the solved structures are available. For example, the authors may test their methods through the online benchmark CAMEO for protein contact prediction. Each week CAMEO will send out some test sequences for prediction. The native structures of these test proteins are not publicly available during the test, but will be public after 45 days.
2) The authors may also test their method on a large set of proteinswith solved structures for contact prediction, but using a simple way to choose M (the number of hidden nodes) and the regularization factor for all the test proteins. The authors may compare their result with currently popular contact prediction methods such as Evfold and CCMpred. It will also be interesting to study the relationship between accuracy and the number of effective sequences in MSA.
We understand the concerns about overfitting expressed by reviewer 3. Following his/her second suggestion we have further analyzed 16 protein families that were used to benchmark plmDCA by Ekeberg et al. in their original paper (Ekeberg, Hartonen and Aurell, 2014), which is exactly the same method as Evfold and CCMpred: Evfold directly uses the plmDCA code, and CCMpred is a GPU implementation of plmDCA.
This new analysis allows us to:
1) Carry out a statistical study of the performance of RBM with respect to DCA for contact prediction. The new panel in Figure 6 shows that the performances of RBM under fixed regularization and with a number of hidden units proportional to the maximal rank (M=0.3 R) are comparable with the ones obtained by plmDCA.
2) Compare the most significant weights (with largest norms) across the 16 families, and extract broad classes of structural and functional motifs repeated in many families, including the proteins we had presented in the first version of the manuscript. A new figure (Figure 9) has been added to report these results, accompanied by a discussion in the section “Crossvalidation of the model and interpretability of the representations”.
We believe that these new results provide convincing evidence that our approach is not limited to a few “good” proteins or protein domains, but can be applied to any protein (with performances depending of course on the number of available sequences).
https://doi.org/10.7554/eLife.39397.091Article and author information
Author details
Funding
École Normale Supérieure (Allocation Specifique)
 Jérôme Tubiana
Agence Nationale de la Recherche (CE30002101 RBMPro)
 Simona Cocco
 Rémi Monasson
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank D Chatenay for useful comments on the manuscript and L Posani for his help on lattice proteins.
Senior Editor
 Detlef Weigel, Max Planck Institute for Developmental Biology, Germany
Reviewing Editor
 Lucy J Colwell, Cambridge University, United Kingdom
Publication history
 Received: October 3, 2018
 Accepted: February 24, 2019
 Accepted Manuscript published: March 12, 2019 (version 1)
 Version of Record published: March 27, 2019 (version 2)
Copyright
© 2019, Tubiana et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 6,710
 Page views

 1,158
 Downloads

 53
 Citations
Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
The mouse brain is by far the most intensively studied among mammalian brains, yet basic measures of its cytoarchitecture remain obscure. For example, quantifying cell numbers, and the interplay of sex, strain, and individual variability in cell density and volume is out of reach for many regions. The Allen Mouse Brain Connectivity project produces highresolution full brain images of hundreds of brains. Although these were created for a different purpose, they reveal details of neuroanatomy and cytoarchitecture. Here, we used this population to systematically characterize cell density and volume for each anatomical unit in the mouse brain. We developed a DNNbased segmentation pipeline that uses the autofluorescence intensities of images to segment cell nuclei even within the densest regions, such as the dentate gyrus. We applied our pipeline to 507 brains of males and females from C57BL/6J and FVB.CD1 strains. Globally, we found that increased overall brain volume does not result in uniform expansion across all regions. Moreover, regionspecific density changes are often negatively correlated with the volume of the region; therefore, cell count does not scale linearly with volume. Many regions, including layer 2/3 across several cortical areas, showed distinct lateral bias. We identified strainspecific or sexspecific differences. For example, males tended to have more cells in extended amygdala and hypothalamic regions (MEA, BST, BLA, BMA, and LPO, AHN) while females had more cells in the orbital cortex (ORB). Yet, interindividual variability was always greater than the effect size of a single qualifier. We provide the results of this analysis as an accessible resource for the community.

 Computational and Systems Biology
 Immunology and Inflammation
To appropriately defend against a wide array of pathogens, humans somatically generate highly diverse repertoires of B cell and T cell receptors (BCRs and TCRs) through a random process called V(D)J recombination. Receptor diversity is achieved during this process through both the combinatorial assembly of V(D)Jgenes and the junctional deletion and insertion of nucleotides. While the Artemis protein is often regarded as the main nuclease involved in V(D)J recombination, the exact mechanism of nucleotide trimming is not understood. Using a previously published TCRβ repertoire sequencing data set, we have designed a flexible probabilistic model of nucleotide trimming that allows us to explore various mechanistically interpretable sequencelevel features. We show that local sequence context, length, and GC nucleotide content in both directions of the wider sequence, together, can most accurately predict the trimming probabilities of a given Vgene sequence. Because GC nucleotide content is predictive of sequencebreathing, this model provides quantitative statistical evidence regarding the extent to which doublestranded DNA may need to be able to breathe for trimming to occur. We also see evidence of a sequence motif that appears to get preferentially trimmed, independent of GCcontentrelated effects. Further, we find that the inferred coefficients from this model provide accurate prediction for V and Jgene sequences from other adaptive immune receptor loci. These results refine our understanding of how the Artemis nuclease may function to trim nucleotides during V(D)J recombination and provide another step toward understanding how V(D)J recombination generates diverse receptors and supports a powerful, unique immune response in healthy humans.