Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results for two short protein domains, Kunitz and WW, one long chaperone protein, Hsp70, and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, extended secondary motifs (α-helix and β-sheet) and intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and turning up or down the different modes at will. Our work therefore shows that RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.

Data availability

The Python 2.7 package for training and visualizing RBMs, used to obtained the results reported in this work, is available at https://github.com/jertubiana/ProteinMotifRBM. It can be readily used for any protein family. Moreover, all four multiple sequence alignments presented in the text, as well as the code for reproducing each panel are also included. Jupyter notebooks are provided for reproducing most figures of the article.

The following previously published data sets were used

Article and author information

Author details

  1. Jérôme Tubiana

    Laboratoire de Physique Statistique, École Normale Supérieure, Paris, France
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8878-5620
  2. Simona Cocco

    Laboratoire de Physique Statistique, École Normale Supérieure, Paris, France
    Competing interests
    The authors declare that no competing interests exist.
  3. Rémi Monasson

    Laboratoire de Physique Théorique, École Normale Supérieure, Paris, France
    For correspondence
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4459-0204


Centre National de la Recherche Scientifique

  • Jérôme Tubiana
  • Simona Cocco
  • Rémi Monasson

Ecole Normale Supérieure (Allocation Specifique)

  • Jérôme Tubiana

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Lucy J Colwell, Cambridge University, United Kingdom

Version history

  1. Received: October 3, 2018
  2. Accepted: February 24, 2019
  3. Accepted Manuscript published: March 12, 2019 (version 1)
  4. Version of Record published: March 27, 2019 (version 2)


© 2019, Tubiana et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.


  • 6,821
    Page views
  • 1,176
  • 58

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jérôme Tubiana
  2. Simona Cocco
  3. Rémi Monasson
Learning protein constitutive motifs from sequence data
eLife 8:e39397.

Further reading

    1. Cell Biology
    2. Computational and Systems Biology
    Breanne Sparta, Nont Kosaisawe ... John G Albeck
    Research Article Updated

    mTORC1 senses nutrients and growth factors and phosphorylates downstream targets, including the transcription factor TFEB, to coordinate metabolic supply and demand. These functions position mTORC1 as a central controller of cellular homeostasis, but the behavior of this system in individual cells has not been well characterized. Here, we provide measurements necessary to refine quantitative models for mTORC1 as a metabolic controller. We developed a series of fluorescent protein-TFEB fusions and a multiplexed immunofluorescence approach to investigate how combinations of stimuli jointly regulate mTORC1 signaling at the single-cell level. Live imaging of individual MCF10A cells confirmed that mTORC1-TFEB signaling responds continuously to individual, sequential, or simultaneous treatment with amino acids and the growth factor insulin. Under physiologically relevant concentrations of amino acids, we observe correlated fluctuations in TFEB, AMPK, and AKT signaling that indicate continuous activity adjustments to nutrient availability. Using partial least squares regression modeling, we show that these continuous gradations are connected to protein synthesis rate via a distributed network of mTORC1 effectors, providing quantitative support for the qualitative model of mTORC1 as a homeostatic controller and clarifying its functional behavior within individual cells.

    1. Computational and Systems Biology
    2. Genetics and Genomics
    Matthew T Parker, Sebastian M Fica ... Gordon Grant Simpson
    Research Article

    Eukaryotic genes are interrupted by introns that are removed from transcribed RNAs by splicing. Patterns of splicing complexity differ between species, but it is unclear how these differences arise. We used inter-species association mapping with Saccharomycotina species to correlate splicing signal phenotypes with the presence or absence of splicing factors. Here we show that variation in 5' splice site sequence preferences correlate with the presence of the U6 snRNA N6-methyladenosine methyltransferase METTL16 and the splicing factor SNRNP27K. The greatest variation in 5' splice site sequence occurred at the +4 position and involved a preference switch between adenosine and uridine. Loss of METTL16 and SNRNP27K orthologs, or a single SNRNP27K methionine residue, was associated with a preference for +4U. These findings are consistent with splicing analyses of mutants defective in either METTL16 or SNRNP27K orthologs and models derived from spliceosome structures, demonstrating that inter-species association mapping is a powerful orthogonal approach to molecular studies. We identified variation between species in the occurrence of two major classes of 5' splice sites, defined by distinct interaction potentials with U5 and U6 snRNAs, that correlates with intron number. We conclude that variation in concerted processes of 5' splice site selection by U6 snRNA is associated with evolutionary changes in splicing signal phenotypes.