Large-scale determination of previously unsolved protein structures using evolutionary information

  1. Sergey Ovchinnikov
  2. Lisa Kinch
  3. Hahnbeom Park
  4. Yuxing Liao
  5. Jimin Pei
  6. David E Kim
  7. Hetunandan Kamisetty
  8. Nick V Grishin
  9. David Baker  Is a corresponding author
  1. University of Washington, United States
  2. Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, United States
  3. University of Texas Southwestern Medical Center, United States
  4. Facebook Inc., United States
  5. Howard Hughes Medical Institute, University of Washington, United States
17 figures, 4 tables and 3 additional files


Accurate blind structure prediction of CASP11 targets T0806 and T0824.

(A) Location of the most strongly co-evolving residue pairs. Lines connect residue pairs with normalized coupling strength greater than 1.0; yellow, distance less than 5 Å; orange, distance less than 10 Å and red, greater than 10 Å in the models. (B) CASP11 submitted models, colored from N to C terminus (blue to red). (C) X-ray crystal structures. For T0806, the Cα RMSD over the full-length protein is 3.6 Å and 2.9 Å over 223 aligned residues. For T0824; the Cα RMSD over the full-length protein is 4.2 Å and 2.7 Å over 77 aligned residues. For statistics on all five models submitted during CASP, see Figure 1—source data 1.
Figure 1—source data 1

The Cκ-RMSD and GDT-TS calculations are over the full-length sequence.

The total GREMLIN score for the model is reported. The most accurate models have the best GREMLIN score.
Conserved residues tend to cluster in the predicted structures.

Residue conservations from multiple sequence alignments were mapped to predicted structures using Al2Co (Pei and Grishin, 2001) and are colored in rainbow from blue (variable) to red (conserved). The most conserved residues (red or orange), displayed as spheres to highlight their positions, tend to line interaction surfaces and indicate potential functional sites.
Predicted structure of the Cytochrome bd oxidase complex.

(A) Location of the top co-evolving residue pairs in our model. For clarity, the monomers have been pulled apart slightly. (B) Location of conserved and experimentally characterized residues (Borisov et al., 2011) on structure model. (C) Residues that coordinate heme in CydA are in the same location as histidines (yellow spheres) in Cytochrome b561 (PDB: 4O6Y). H126 (red sphere) overlaps one of these histidines and is the proposed as a heme d coordination site. For clarity, both the model of CydA and the structure of Cyt b562 (4O6Y) are trimmed to highlight the four helix bundle(s).
Predicted structure of the tartrate dehydratase heterotetramer composed of two copies each of ttdA and ttdB.

(A) Co-evolving residue pairs. The monomers have been pulled apart to reveal the contacts. (B) The ttdA subunit (rainbow) contains a 4Fe-4S cluster (white spheres) that is near the interface with ttdB (green).
Succinate-acetate/proton symporter SatP (YaaH).

(A) Co-evolving residue pairs in homo-oligomer model. (B) Co-evolving residue pairs in SatP monomer model. (C) SatP co-evolution-based model places known acetate selective residues (magenta spheres) lining the channel. Conserved residues (gray spheres) line the periplasmic surface. The 6TMH channels are formed by threefold pseudo-symmetric TMH hairpins. (D) Proton-gated UreI channel protomer. C-alpha positions at the periplasmic constriction site (magenta spheres) and the cytoplasmic constriction site (pink spheres) are highlighted. The SatP model has the same fold as UreI (C vs D).
Lipid II flippase (FtsW) in complex with the transmembrane domain of Peptidoglycan synthase (FtsI).

FtsW is an essential cell division protein that transports lipids across the cytoplasmic membrane and is required for localization of FtsI. (A) Location of the top co-evolving residue pairs. (B) White spheres indicate conserved positions in FtsW that when mutated to alanine result in loss of flippase activity. (C, D) The last seven transmembrane (TM) helices of FtsW (TMH4-TMH10) adopting a similar topology as TMH4-TMH7 and TMH10-12 of the TM domain of STT3 (PDB: 3WAK). Both the model of FtsW and 3WAK was trimmed over the aligned helices for clarity. (C) Two residues from the corresponding TMH4 of FtsW (R145 and K153) are essential for flippase activity. (D) The side chain of R145 overlaps the residues that coordinate the divalent metal in the conserved DxH motif at the N-terminus of STT3 TMH4. (E) The model of RfaL adopts a similar fold as ftsW.
Prolipoprotein diacylglyceryl transferase (LGT).

(A) Predicted contacts indicated on model, (B) model with conserved positions at which alanine mutations result in loss in activity indicated in white spheres; five of these are clustered at the periplasmic end of the model.
UppP catalyzes the dephosphorylation of undecaprenyl diphophate (UPP).

(A) Location of the top co-evolving residue pairs. (B) Spheres in white indicate conserved residues experimentally shown to decrease activity to <1% (Chang et al., 2014); all these residues are in the core in the model. YfcA, a protein of unknown function is a very distant sequence homologue of UppP (they are in different PFAM families); (C) the predicted structure of YfcA has the same fold as UppP with prominent broken helices (highlighted in blue and yellow).
PrsW is an intramembrane protease that is crucial in the resistance to antimicrobial peptides.

(A) Location of the top co-evolving residue pairs. Our model of PrsW (B) contains a 4TMH substructure ([TMHs 3–6], (C) which is very similar to a substructure of type II CAAX prenyl protease [TMHs 4–7; D]). The predicted active site residues in PrsW motifs EExxK (TMH3; blue spheres, C) and HxxxD (TMH6; red spheres, C) are in positions similar to those of conserved residues in motifs EExxxR (TMH4; blue spheres, D) and HxxxN (TMH7; red spheres, D) of type II CAAX prenyl protease. Another conserved histidine in the fifth TMH of PrsW (but absent in type II CAAX prenyl protease) is also located in the predicted active site of PrsW (black sphere).
Our model of the inner membrane protein YeiH (A, B) is structurally similar to the structure of the antiporter NapA (C).

Lower panels: TM helices of core domains are highlighted in white and magenta: while these helices cross over each other in NapA (right), the core of our model of YeiH (left) is formed by two pairs of broken helices (TM5, 6 and TM8, 9) that exit the membrane on the same side.
Our model of RarD has a similar architecture to EmrE but different fold.

(A, B) Full-length RarD and EmrE homodimer. (C) RarD internal repeat and EmrE monomer. The N-terminus helix (blue) is swapped in RarD relative to EmrE due to helix insertion (gray).
Predicted structures of YqfA and YhhN have topologies similar to G protein-coupled receptors (GPCR-like).

The core seven TM helix (TMH) fold exhibited by members of the GPCR superfamily is colored in rainbow from the N- to the C-terminus. (A) Bacteriorhodopsin binds retinal (pink spheres) in a pocket formed by TMH3-7 [PDB ID: 1m0k]. (B) The agonist (pink spheres) binding site of P2Y12 receptor is formed by the same set of helices [PDB ID: 4pxz]. (C) A co-evolution-based structure model for YhhN has the GPCR topology with an N-terminal TMH extension. Conserved residues that might form an active site (magenta spheres) cluster in a similar place as the YqfA catalytic residues. (D) Our co-evolution-based structure model for YqfA has a GPCR like topology and clusters residues that may form an active site (magenta spheres mark the Calpha position) in a region that corresponds to the GPCR ligand-binding pocket. (E) Side and top view of the TMalign superposition of YqfA model (in rainbow) over the recently released 3wxw (in white) human ortholog. The N- and C-terminal loops were trimmed for clarity. The TMalign score between the model and the homolog is 0.8.
YfiP predicted structure has methyltransferase-like fold with knot.

Left: the top co-evolving residues pairs. Middle: conserved residues (magenta) surround the AdoMet-binding site and a conserved Cys could bind a Fe4S cluster. Right: 3nk7 methyltransferase bound to AdoMet.
Bacillus subtilis YitE model.

(A) The top co-evolving residues. YitE (B) has architecture similar to aquaporin (PDB:2B6P) (C), including the internal pseudo-symmetry (blue vs green), but completely different connectivity.
Escherichia coli protein YgdD.

Our model of the E. coli protein YgdD trimer (C) is based on predicted contacts satisfied within the monomer (B) and between monomers in the homo-trimer (A). Structural similarity to heme copper oxidase (D) along with a weak HHpred sequence match over part of the protein suggests that YgdD is evolutionarily related to heme–copper oxidases.
Figure 16 with 1 supplement
Dependence of the accuracy of predicted contacts on the normalized GREMLIN score (sco), the effective number of sequences (seq), the length (len), and the sequence separation (sep).

Contacts are defined based on amino acid specific Cβ-Cβ distance cutoffs as described in SI Table 3 in Kamisetty et al. (2013). (A) Observed vs predicted accuracies over a large data set of proteins of known structure with deep alignments (Supplementary file 3), sub sampled to different extents (seq/√(len) = 4 (red), 8 (green), 15 (purple), 32 (cyan), and 96 (orange)). Circles represent observed contact prediction accuracies, solid lines, a fit to a sigmoid function of the normalized coupling value, the number of sequences, the length, and the sequence separation (see Figure 16—figure supplement 1 and Figure 16—figure supplement 2). (B) Observed vs predicted accuracies in an independent data set of variable length alignments for 7047 pdb chains (Supplementary file 3), using maximum number of sequences obtained with HHblits as opposed to subsampling a large alignment. Circles again represent observed contact prediction accuracies; solid lines, the predicted accuracy using the model obtained by fitting to the data in (A). The contact prediction accuracy is correctly modeled for the independent data set, justifying its use on the unknown cases described in this article. The Equation use to calculate P(contact|sco,seq,len,sep) is

Figure 16—figure supplement 1
Contact prediction accuracy is better correlated with (#sequences/sqrt(length)) than with (#sequences/length).

Accuracy is computed for the top 3L/2 GREMLIN predictions, with sequence separation ≥3, based on Cβ-Cβ amino acid specific distance as described in SI Table 3 in Kamisetty et al. (2013). The number of sequences after reducing the redundancy to 80% is shown. A set of 7047 pdb chains (see Supplemental file 3) was divided into two groups by length (less than 150 and greater than 400). (A) Larger proteins with similar number of sequence were less accurate then the smaller proteins. (B) #Sequences/length as often used does not accurately account for length dependence. There is a clear separation between the blue and green distributions. (C) #Sequences/√length better accounts for the length dependency. The blue and green distributions overlap.
The Rc metric used to assess fit of predicted contacts to a model.

The expected total GREMLIN score if the structure was native was estimated by summing sco*P(contact|sco, seq, len, sep) over all contacts with sep ≥6. To evaluate the fit of a particular model to a predicted contact set, we take the ratio of the actual total GREMLIN score of the model to the expected total score computed as above; we refer to this ratio of observed and expected contact scores as ‘Rc’. Blue line: the distribution of Rc in native structures with 4L–10L sequences; Red line: distribution of Rc after randomly reassigning contact predictions to structures. Rc values less than 0.7 are very infrequently observed for native structures; we use this value as a cutoff to evaluate the fit of a predicted contact set to a model.


Table 1

Transmembrane protein benchmark
PDBNameSeq/lenFull proteinConvergedAligned
4HE8_H (3.3)NADH-quinone oxidoreductase subunit 817.34.92692.11832.2234
1SOR_A (N/A)Aquaporin-026.22.72212.11882.0200
4Q2E_A (3.4)Phosphatidate cytidylyltransferase18.65.42623.51762.8178
4HTT_A (6.8)Sec-independent protein translocase protein14.63.92251.81242.4181
4P6V_E (3.5)Na(+)-translocating NADH-quinone reductase subunit D14.35.01941.4492.8155
4J72_A (3.3)Phospho-N-acetylmuramoyl-pentapeptide-transferase19.96.63233.12512.4237
3V5U_A (1.9)Sodium/Calcium exchanger10.23.92973.72842.3245
4PGS_A (2.5)Uncharacterized protein YetJ15.43.52072.71752.2183
4QTN_A (2.8)Vitamin B3 transporter PnuC9.04.22023.01552.8178
4OD4_A (3.3)4-hydroxybenzoate octaprenyltransferase22.83.92753.42422.8231
4O6M_A (1.9)CDP-alcohol phosphotransferase13.34.11884.01652.3159
4WD8_A (2.3)Bestrophin domain protein5.94N/A268Not converged
4F35_A (3.2)Transporter, NadC family14.5N/A434Not converged
  1. Column 1, PDB code (resolution of the crystal structure); column 2, protein name; column 3, sequences per length, after filtering to reduce the redundancy to 90%; column 4, RMSD of predicted structure to native structure; column 5, length of native structure modeled; column 6, RMSD over converged and constrained region; column 7, length of converged and constrained region; column 8, RMSD over TM-align structural alignment; column 9, length of structurally aligned region.

Table 2

Comparison of fold recognition and Rosetta models for large protein families
Known functionRcTMscore
WECH: O-acetyltransferase (YiaH)24,750−
SATP: Succinate-acetateproton symporter (YaaH)2298−
LSPA: Lipoprotein signal peptidase8156−
YADH: ABC-type multidrug transport permease42,626−
YEBZ: Putative copper export protein4067−
CRCB: Fluoride ion exporter7829−
LPTG: Lipopolysaccharide export system permease8101−
FTSW: Lipid II flippase14,900−
RFAL: O-antigen ligase13,535−
CCMB: Heme exporter protein B2433−
MLAE: ABC transporter permease for lipid asymmetry7662−
SULP: Sulfate permease6647−
TOLQ: Biopolymer transport protein9256−
LGT: Prolipoprotein diacylglyceryl transferase8121−
Q97UR7: N-methylhydantoinase B (HyuB-3)4491−
YGAZ: putative L-valine exporter6435−
CCMC: Heme exporter protein C5965−
YEDZ: Sulfoxide reductase heme-binding subunit2247−
YIAM: TRAP transporter small permease protein10,715−
TTDA: Tartrate dehydratase, alpha subunit4238−
UPPP: Undecaprenyl pyrophosphate phosphatase7842−
PLSY: Probable glycerol-3-phosphate acyltransferase6112−
FLIL: Flagellar protein2690−
CYDB: Cytochrome bd oxidase 268640.
CYDA: Cytochrome bd oxidase 162000.
MOTA: Motility protein A, flagellar motor proton conductor47340.
SLYB: Outer membrane lipoprotein18600.
MRED: Rod shape-determining protein15460.
ZUPT: Zinc transporter10,5170.
YOHK: Putative effector of murein hydrolase LrgB39412.
PRSW: Membrane proteinase25005.
DDG: Lipid A biosynthesis palmitoleoyl acyltransferase94305.
Unknown functionRcTMscore
YQFA: UPF0073 inner membrane protein7596−
YCED: Uncharacterized protein1604−
YPHA: Inner membrane protein2986−
YADS: UPF0126 inner membrane protein5222−
YHHN: Uncharacterized membrane protein2529−
YIDH: Inner membrane protein1041−
YITE: UPF0750 membrane protein8326−
HDED: Acid resistance membrane protein2885−
YFIP: DTW domain-containing protein3100−
YPJD: ABC-type uncharacterized permease6180−
YJFL: UPF0719 inner membrane protein1581−
YTEJ: Uncharacterized membrane protein5733−
YIHY: UPF0761 membrane protein10,144−
YQAA: Inner membrane protein2187−
YHID: Uncharacterized protein4416−
YLOU: Uncharacterized protein3738−
YGDD: UPF0382 inner membrane protein3025−
YJCH: Inner membrane protein1307−
YFCA: UPF0721 transmembrane protein18,8460.
YOHJ: Putative effector of murein hydrolase36080.
YHHQ: Inner membrane protein33980.
YAII: UPF0178 protein31440.
YUXK: Predicted thiol-disulfide oxidoreductase18811.
YICC: UPF0701 protein42931.
YEIH: UPF0324 inner membrane protein48634.
RARD: Putative chloramphenical resistance permease74,5076.
  1. Column 2: number of unique proteins in family; Column 3: negative log10 of E-value of top match found in HHsearch profile–profile search of PDB; Columns 4–6: fit to predicted contacts (Rc value) of best fitting of top 10 HHsearch hits (column 4), of best fitting of top 10 SPARKS-X hits (column 5), and Rosetta model (column 6). Native structures have Rc values ranging from 0.7 to 1.2 (Figure 17). Columns 7–9: structural similarity (TMscore) between Rosetta model (M) and best fitting HHsearch model, between Rosetta model and best fitting SPARKS-X model, and between best fitting HHsearch and SPARKS-X models. The Rosetta models fit the contacts as well as expected for native structures and are very different from best fitting HHsearch and SPARKS-X models. For RARD and YEIH, the HHsearch E-value is less than 1E-04, the recommended threshold for inclusion in the same Pfam clan (Xu and Dunbrack, 2012), but the fit with the co-evolutionary contacts was very poor (Rc < 0.3); these two cases are discussed in sections below. For FLIL and YAII, the Rc values for very weak HHSearch and SPARKS-X hits (E-values worse than 0.1) are greater than 0.6 but the contacts constrain only a portion of the structure.

Table 3

Comparison of methods on CASP11 targets
BAKER*Jones-UCL*Evfold-web server
  1. *

    Full-length Cα-RMSD and GDT-TS calculation based on the best of five models submitted to CASP11 from BAKER and Jones-UCL groups. For Evfold, the values for best of 50 models generated by the web server are reported, sorted by full-length Cα-RMSD. For the comparison, the alignments used during CASP11 were provided as input to the Evfold-web server, with PLM option selected. For T0824, the minimal number of sequence limit was set to 0 to allow Evfold-web server to run. PLM, pseudo-likelihood.

Table 4

Comparison of methods on transmembrane benchmark set
BAKEREvfold-web server
1SOR_A (aquaporin)2.769.76.144.5
  1. The Cα-RMSD and GDT-TS calculations are over the full sequence. For Evfold web server results, we report the best Cα-RMSD of 50 models returned. For the comparison, the alignments we used were provided as input to the Evfold-web server, and the pseudo-likelihood method was selected.

Additional files

Supplementary file 1

Detailed table containing all 131 large protein families. Also the complete list of protein coding genes from E. coli (ECOLI), B. subtilis (BACSU) Halobacterium salinarum (HALSA), and Sulfolobus solfataricus (SULSO) along with number of sequences are provided.
Supplementary file 2

Detailed table containing 58 proteins modeled.
Supplementary file 3

Table of 10,440 PDB chains used in the study and the number of sequences for each.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sergey Ovchinnikov
  2. Lisa Kinch
  3. Hahnbeom Park
  4. Yuxing Liao
  5. Jimin Pei
  6. David E Kim
  7. Hetunandan Kamisetty
  8. Nick V Grishin
  9. David Baker
Large-scale determination of previously unsolved protein structures using evolutionary information
eLife 4:e09248.