Learning protein constitutive motifs from sequence data

Abstract
Data availability
Article and author information
Metrics

Abstract

Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results for two short protein domains, Kunitz and WW, one long chaperone protein, Hsp70, and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, extended secondary motifs (α-helix and β-sheet) and intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and turning up or down the different modes at will. Our work therefore shows that RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.

Data availability

The Python 2.7 package for training and visualizing RBMs, used to obtained the results reported in this work, is available at https://github.com/jertubiana/ProteinMotifRBM. It can be readily used for any protein family. Moreover, all four multiple sequence alignments presented in the text, as well as the code for reproducing each panel are also included. Jupyter notebooks are provided for reproducing most figures of the article.

The following previously published data sets were used

(1997) THE 1.2 ANGSTROM STRUCTURE OF KUNITZ TYPE DOMAIN C5
Protein Data Bank, 2KNT.

https://www.rcsb.org/structure/2KNT
(2000) PROTOTYPE WW domain
Protein Data Bank, 1E0M.

https://www.rcsb.org/structure/1e0m
1. Zuiderweg ERP
2. Bertelsen EB
(2009) NMR-RDC / XRAY structure of E. coli HSP70 (DNAK) chaperone (1-605) complexed with ADP and substrate
Protein Data Bank, 2KHO.

https://www.rcsb.org/structure/2KHO
1. Qi R
2. Sarbeng EB
3. Liu Q
4. Le KQ
5. Xu X
6. Xu H
7. Yang J
8. Wong JL
9. Vorvis C
10. Hendrickson WA
11. Zhou L
12. Liu Q
(2013) Allosteric opening of the polypeptide-binding site when an Hsp70 binds ATP
Protein Data Bank, 4JNE.

https://www.rcsb.org/structure/4JNE
(2001) CRYSTAL STRUCTURE OF THE CATALYTIC DOMAIN OF HUMAN COMPLEMENT C1S PROTEASE
Protein Data Bank, 1ELV.

https://www.rcsb.org/structure/1ELV
(2005) CRYSTAL STRUCTURE AND ASSEMBLY OF TSP36, A METAZOAN SMALL HEAT SHOCK PROTEIN
Protein Data Bank. 2BOL.

https://www.rcsb.org/structure/2BOL
(2007) Yes SH3 domain
Protein Data Bank, 2HDA.

https://www.rcsb.org/structure/2HDA
(2008) Crystal Structure of the Nanog Homeodomain
Protein Data Bank, 2VI6.

https://www.rcsb.org/structure/2VI6
1. Baumann H
2. Paulsen K
3. Kovacs H
4. Berglund H
5. Wright APH
6. Gustafsson J-A
7. Hard T
(1994) REFINED SOLUTION STRUCTURE OF THE GLUCOCORTICOID RECEPTOR DNA-BINDING DOMAIN
Protein Data Bank, 1GDC.

https://www.rcsb.org/structure/1GDC
1. Kim C
(2009) Crystal structure of a complex between the catalytic and regulatory (RI{alpha}) subunits of PKA
Protein Data Bank, 3FHI.

https://www.rcsb.org/structure/3FHI
1. Wang X
2. Hall TMT
(2001) CRYSTAL STRUCTURE OF HUD AND AU-RICH ELEMENT OF THE TUMOR NECROSIS FACTOR ALPHA RNA
Protein Data Bank, 1G2E.

https://www.rcsb.org/structure/1G2E
(2004) CRYSTAL STRUCTURE OF SH2 IN COMPLEX WITH RU82209
Protein Data Bank, 1O47.

https://www.rcsb.org/structure/1O47
1. He Y-X
2. Zhao M-X
3. Zhou C
(2008) The crystal structure of Sod2 from Saccharomyces cerevisiae
Protein Data Bank, 3BFR.

https://www.rcsb.org/structure/3BFR
(2005) Crystal Structure of domain 3 of human alpha polyC binding protein
Protein Data Bank, 1WVN.

https://www.rcsb.org/structure/1WVN
1. Bravo J
2. Staunton D
3. Heath JK
4. Jones EY
(1998) CYTOKYNE-BINDING REGION OF GP130
Protein Data Bank, 1BQU.

https://www.rcsb.org/structure/1BQU
1. Joint Center for Structural Genomics (JCSG)
(2002) Crystal structure of Ribonuclease III (TM1102) from Thermotoga maritima at 2.0 A resolution
Protein Data Bank, 1O0W.

https://www.rcsb.org/structure/1O0W
(1998) TERNARY COMPLEX OF AN ACTIVE SITE DOUBLE MUTANT OF HORSE LIVER ALCOHOL DEHYDROGENASE, PHE93=>TRP, VAL203=>ALA WITH NAD AND TRIFLUOROETHANOL
Protein Data Bank, 1A71.

https://www.rcsb.org/structure/1A71
1. Parisini E
2. Wang J-H
(2007) Crystal Structure Analysis of human E-cadherin (1-213)
Protein Data Bank, 2O72.

https://www.rcsb.org/structure/2O72
1. Xiao G
2. Ji X
3. Armstrong RN
4. Gilliland GL
(1996) FIRST-SPHERE AND SECOND-SPHERE ELECTROSTATIC EFFECTS IN THE ACTIVE SITE OF A CLASS MU GLUTATHIONE TRANSFERASE
Protein Data Bank, 6GSU.

https://www.rcsb.org/structure/6gsu
1. Binda C
2. Coda A
3. Mattevi A
4. Aliverti A
5. Zanetti G
(1998) SPINACH FERREDOXIN
Protein Data Bank, 1A70.

https://www.rcsb.org/structure/1A70

Article and author information

Author details

Jérôme Tubiana

Laboratoire de Physique Statistique, École Normale Supérieure, Paris, France

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0001-8878-5620
Simona Cocco

Laboratoire de Physique Statistique, École Normale Supérieure, Paris, France

Competing interests
The authors declare that no competing interests exist.
Rémi Monasson

Laboratoire de Physique Théorique, École Normale Supérieure, Paris, France

For correspondence
monasson@lpt.ens.fr

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0002-4459-0204

Funding

Centre National de la Recherche Scientifique

Jérôme Tubiana
Simona Cocco
Rémi Monasson

Ecole Normale Supérieure (Allocation Specifique)

Jérôme Tubiana

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.