Learning protein constitutive motifs from sequence data
Abstract
Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results for two short protein domains, Kunitz and WW, one long chaperone protein, Hsp70, and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, extended secondary motifs (α-helix and β-sheet) and intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and turning up or down the different modes at will. Our work therefore shows that RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.
Data availability
The Python 2.7 package for training and visualizing RBMs, used to obtained the results reported in this work, is available at https://github.com/jertubiana/ProteinMotifRBM. It can be readily used for any protein family. Moreover, all four multiple sequence alignments presented in the text, as well as the code for reproducing each panel are also included. Jupyter notebooks are provided for reproducing most figures of the article.
-
THE 1.2 ANGSTROM STRUCTURE OF KUNITZ TYPE DOMAIN C5Protein Data Bank, 2KNT.
-
Crystal Structure of the Nanog HomeodomainProtein Data Bank, 2VI6.
-
CRYSTAL STRUCTURE OF SH2 IN COMPLEX WITH RU82209Protein Data Bank, 1O47.
-
The crystal structure of Sod2 from Saccharomyces cerevisiaeProtein Data Bank, 3BFR.
-
Crystal Structure of domain 3 of human alpha polyC binding proteinProtein Data Bank, 1WVN.
-
Crystal Structure Analysis of human E-cadherin (1-213)Protein Data Bank, 2O72.
Article and author information
Author details
Funding
Centre National de la Recherche Scientifique
- Jérôme Tubiana
- Simona Cocco
- Rémi Monasson
Ecole Normale Supérieure (Allocation Specifique)
- Jérôme Tubiana
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Copyright
© 2019, Tubiana et al.
This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 7,387
- views
-
- 1,290
- downloads
-
- 108
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.