1. Computational and Systems Biology
  2. Physics of Living Systems
Download icon

Learning protein constitutive motifs from sequence data

  1. Jérôme Tubiana
  2. Simona Cocco
  3. Rémi Monasson  Is a corresponding author
  1. École Normale Supérieure, France
Tools and Resources
  • Cited 29
  • Views 6,073
  • Annotations
Cite this article as: eLife 2019;8:e39397 doi: 10.7554/eLife.39397


Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results for two short protein domains, Kunitz and WW, one long chaperone protein, Hsp70, and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, extended secondary motifs (α-helix and β-sheet) and intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and turning up or down the different modes at will. Our work therefore shows that RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.

Data availability

The Python 2.7 package for training and visualizing RBMs, used to obtained the results reported in this work, is available at https://github.com/jertubiana/ProteinMotifRBM. It can be readily used for any protein family. Moreover, all four multiple sequence alignments presented in the text, as well as the code for reproducing each panel are also included. Jupyter notebooks are provided for reproducing most figures of the article.

The following previously published data sets were used

Article and author information

Author details

  1. Jérôme Tubiana

    Laboratoire de Physique Statistique, École Normale Supérieure, Paris, France
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8878-5620
  2. Simona Cocco

    Laboratoire de Physique Statistique, École Normale Supérieure, Paris, France
    Competing interests
    The authors declare that no competing interests exist.
  3. Rémi Monasson

    Laboratoire de Physique Théorique, École Normale Supérieure, Paris, France
    For correspondence
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4459-0204


Centre National de la Recherche Scientifique

  • Jérôme Tubiana
  • Simona Cocco
  • Rémi Monasson

Ecole Normale Supérieure (Allocation Specifique)

  • Jérôme Tubiana

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Lucy J Colwell, Cambridge University, United Kingdom

Publication history

  1. Received: October 3, 2018
  2. Accepted: February 24, 2019
  3. Accepted Manuscript published: March 12, 2019 (version 1)
  4. Version of Record published: March 27, 2019 (version 2)


© 2019, Tubiana et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.


  • 6,073
    Page views
  • 1,030
  • 29

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Biochemistry and Chemical Biology
    2. Computational and Systems Biology
    Dhruva Katrekar et al.
    Tools and Resources

    Adenosine deaminases acting on RNA (ADARs) can be repurposed to enable programmable RNA editing, however their enzymatic activity on adenosines flanked by a 5' guanosine is very low, thus limiting their utility as a transcriptome engineering toolset. To address this issue, we first performed a novel deep mutational scan of the ADAR2 deaminase domain, directly measuring the impact of every amino acid substitution across 261 residues, on RNA editing. This enabled us to create a domain wide mutagenesis map while also revealing a novel hyperactive variant with improved enzymatic activity at 5'-GAN-3' motifs. However, exogenous delivery of ADAR enzymes, especially hyperactive variants, leads to significant transcriptome wide off-targeting. To solve this problem, we engineered a split ADAR2 deaminase which resulted in 1000-fold more specific RNA editing as compared to full-length deaminase overexpression. We anticipate that this systematic engineering of the ADAR2 deaminase domain will enable broader utility of the ADAR toolset for RNA biotechnology and therapeutic applications.

    1. Computational and Systems Biology
    2. Neuroscience
    András Ecker et al.
    Research Article

    Hippocampal place cells are activated sequentially as an animal explores its environment. These activity sequences are internally recreated ('replayed'), either in the same or reversed order, during bursts of activity (sharp wave-ripples; SWRs) that occur in sleep and awake rest. SWR-associated replay is thought to be critical for the creation and maintenance of long-term memory. In order to identify the cellular and network mechanisms of SWRs and replay, we constructed and simulated a data-driven model of area CA3 of the hippocampus. Our results show that the chain-like structure of recurrent excitatory interactions established during learning not only determines the content of replay, but is essential for the generation of the SWRs as well. We find that bidirectional replay requires the interplay of the experimentally confirmed, temporally symmetric plasticity rule, and cellular adaptation. Our model provides a unifying framework for diverse phenomena involving hippocampal plasticity, representations, and dynamics, and suggests that the structured neural codes induced by learning may have greater influence over cortical network states than previously appreciated.