1. Biophysics and Structural Biology
  2. Computational and Systems Biology
Download icon

Molecular dynamics-based refinement and validation for sub-5 Å cryo-electron microscopy maps

  1. Abhishek Singharoy Is a corresponding author
  2. Ivan Teo
  3. Ryan McGreevy
  4. John E Stone
  5. Jianhua Zhao
  6. Klaus Schulten Is a corresponding author
  1. University of Illinois at Urbana-Champaign, United States
  2. University of California San Francisco School of Medicine, United States
Tools and Resources
Cited
10
Views
2,972
Comments
0
Cite as: eLife 2016;5:e16105 doi: 10.7554/eLife.16105

Abstract

Two structure determination methods, based on the molecular dynamics flexible fitting (MDFF) paradigm, are presented that resolve sub-5 Å cryo-electron microscopy (EM) maps with either single structures or ensembles of such structures. The methods, denoted cascade MDFF and resolution exchange MDFF, sequentially re-refine a search model against a series of maps of progressively higher resolutions, which ends with the original experimental resolution. Application of sequential re-refinement enables MDFF to achieve a radius of convergence of ~25 Å demonstrated with the accurate modeling of β-galactosidase and TRPV1 proteins at 3.2 Å and 3.4 Å resolution, respectively. The MDFF refinements uniquely offer map-model validation and B-factor determination criteria based on the inherent dynamics of the macromolecules studied, captured by means of local root mean square fluctuations. The MDFF tools described are available to researchers through an easy-to-use and cost-effective cloud computing resource on Amazon Web Services.

https://doi.org/10.7554/eLife.16105.001

eLife digest

To understand the roles that proteins and other large molecules play inside cells, it is important to determine their structures. One of the techniques that researchers can use to do this is called cryo-electron microscopy (cryo-EM), which rapidly freezes molecules to fix them in position before imaging them in fine detail.

The cryo-EM images are like maps that show the approximate position of atoms. These images must then be processed in order to build a three-dimensional model of the protein that shows how its atoms are arranged relative to each other. One computational approach called Molecular Dynamics Flexible Fitting (MDFF) works by flexibly fitting possible atomic structures into cryo-EM maps. Although this approach works well with relatively undetailed (or ‘low resolution’) cryo-EM images, it struggles to handle the high-resolution cryo-EM maps now being generated.

Singharoy, Teo, McGreevy et al. have now developed two MDFF methods – called cascade MDFF and resolution exchange MDFF – that help to resolve atomic models of biological molecules from cryo-EM images. Each method can refine poorly guessed models into ones that are consistent with the high-resolution experimental images. The refinement is achieved by interpreting a range of images that starts with a ‘fuzzy’ image. The contrast of the image is then progressively improved until an image is produced that has a resolution that is good enough to almost distinguish individual atoms.

The method works because each cryo-EM image shows not just one, but a collection of atomic structures that the molecule can take on, with the fuzzier parts of the image representing the more flexible parts of the molecule. By taking into account this flexibility, the large-scale features of the protein structure can be determined first from the fuzzier images, and increasing the contrast of the images allows smaller-scale refinements to be made to the structure.

The MDFF tools have been designed to be easy to use and are available to researchers at low cost through cloud computing platforms. They can now be used to unravel the structure of many different proteins and protein complexes including those involved in photosynthesis, respiration and protein synthesis.

https://doi.org/10.7554/eLife.16105.002

Introduction

Cryo-electron microscopy (cryo-EM) has evolved into one of the most effective structure determination tools in modern day structural biology, achieving in recent years resolutions rivalling those of X-ray crystallography or NMR spectroscopy (Cheng, 2015). Furthermore, cryo-EM based structure determination overcomes two major bottlenecks faced in traditional X-ray crystallography, namely, the arduous task of preparing well-ordered crystals of macromolecules (Unger, 2002), and the more fundamental problem with capturing these molecules in unphysiological states as a result of crystal contacts (Neutze et al., 2015). Consequently, cryo-EM provides a natural way of resolving the structures of large macromolecular complexes.

Historically, computational methods were required to bridge the resolution gap between crystallography and cryo-EM to produce atomic-resolution models of biomolecular complexes. Various real-space refinement methods that combine crystallographic structures and cryo-EM densities for structure determination have been developed, including DireX (Schröder et al., 2007), Flex-EM (Topf et al., 2008), Rosetta (DiMaio et al., 2015), FRODA (Jolley et al., 2008), Phenix real space refinement (Afonine et al., 2013), and Molecular Dynamics Flexible Fitting (MDFF) (Trabuco et al., 2008, 2009; McGreevy et al., 2016).

MDFF, in particular, has proven to be an extremely successful refinement method as evidenced by its numerous applications (Goh et al., 2015; McGreevy et al., 2016) ranging from the intricate ribosomal machinery (Villa et al., 2009; Trabuco et al., 2011; Frauenfeld et al., 2011; Wickles et al., 2014) to a host of non-enveloped viruses (Zhao et al., 2013). So far this success has been limited to structure determination from typically low-resolution cryo-EM maps in the 7–25 Å range which, indeed, represented the state-of-the-art at the time of MDFF’s inception (Trabuco et al., 2008). However, seminal advances in detection hardware and programs over the past three years (Li et al., 2013; Milazzo et al., 2011) have enabled now the routine availability of high-resolution (<5 Å) EM maps for a range of biological systems including ion channels (Liao et al., 2013), enzymes (Bartesaghi et al., 2014, 2015), membrane fusion machinery (Zhao et al., 2015), and key functional components of the ribosome (Fischer et al., 2015; Brown et al., 2015).

High-resolution maps pose an imminent challenge to the traditional map-guided structure determination methods as the maps now characterize near-atomic scale features, the interpretation of which requires extremely precise structure building and validation protocols (DiMaio et al., 2015). For example, conformation of the protein sidechains, which are more flexible than the backbone, are now discernible within the maps and, thus, require precise modeling of the dihedral angles up to Cβ atoms while also respecting the map boundaries (Barad et al., 2015).

In order to produce atomic models with correct backbone and sidechain geometries, as well as minimal potential energy, structure determination tools must be augmented with chemically accurate force fields and exhaustive search algorithms respecting density constraints. Inspired by crystallographic modeling techniques, where such structure-building requirements have already been addressed for the resolution of 3–5 Å diffraction data (DiMaio et al., 2013; Murshudov et al., 2011; McGreevy et al., 2014), tools such as Rosetta have introduced Monte Carlo simulation-based segment building and refinement protocols with heuristic force fields (DiMaio et al., 2015), to handle high-resolution EM maps. Other notable automated model-building tools that can be used for the refinement of high-resolution EM maps include Buccaneer (Cowtan, 2006), ARP/wARP (Langer et al., 2008), and Moulder (Topf et al., 2006).

Driven by a vision to extend the capabilities of flexible fitting approaches (Topf et al., 2008; Trabuco et al., 2008; Tama et al., 2004; Suhre et al., 2006; Kovacs et al., 2008; Wu et al., 2013) for addressing high-resolution maps, two new MDFF methods are introduced here. These methods, denoted cascade MDFF (cMDFF) and resolution exchange MDFF (ReMDFF), augment the traditional MDFF method (Trabuco et al., 2008, 2009; McGreevy et al., 2016) (called direct MDFF henceforth) with enhanced conformational sampling techniques, namely simulated annealing (Brünger, 1988) and replica exchange molecular dynamics (Sugita and Okamoto, 1999). The central idea behind the techniques introduced is to fit a search model sequentially to a series of maps of progressively higher resolutions, ending with the original experimental resolution; all but the last in the series are computationally blurred lower-resolution derivatives of the original map, so that larger-scale features of the structure are determined first by fitting to the blurred densities, and smaller-scale refinements are performed subsequently during the fitting to higher-resolution densities. Altogether, this treatment enables a richer conformational sampling of the model within the map than direct MDFF, thereby allowing accurate modeling of the global and local structural features from the map; a similar treatment has previously been employed to increase the radius of convergence of MDFF protocols, but with crystallographic data (Singharoy et al., 2015).

The cMDFF and ReMDFF methods are demonstrated for structure analysis based on 3.2-Å and 3.4-Å resolution maps of β-galactosidase (Bartesaghi et al., 2014) and the TRPV1 channel (Liao et al., 2013), respectively. The two methods were found to resolve atomic structures with accuracy greater than that of direct MDFF and comparable to that of Rosetta, even with poor choices of search models. The accuracy is evaluated in terms of the quality of fit measured through global and local cross-correlations (GCC and LCC), integrated Fourier shell coefficients (iFSC), and EMRinger scores (Barad et al., 2015), as well as in terms of the quality of structural integrity measures like MolProbity (Chen et al., 2010).

In the second part of the present study we establish that structural flexibility, as measured by root mean square fluctuations or RMSF within the MDFF simulation, provides an ensemble-based indicator of local and overall resolution of a map offering, thus, a quality measure of an EM map based on the inherent dynamics of the imaged macromolecule. In line with this new finding, RMSF values are shown to provide a physical basis for the determination of optimal sharpening B-factors that maximize the signal-to-noise ratio within a map. These B-factors are determined at three different levels of model description: whole-system, per-domain, and per-residue.

Finally, use of the ReMDFF method on cloud computing platforms is discussed. Cloud computing is now a highly suitable approach for computational biology and can be employed for large-scale scientific computing, data analysis, and visualization tasks. For example, Amazon Web Services has been previously demonstrated to be a low-cost cloud computing platform for processing cryo-EM data (Cianfrocco et al., 2015). We demonstrate now the usage of Amazon Web Services, highlighting the platform’s capability for rapidly fitting structures to EM density with ReMDFF. The web-interface makes it readily possible for experimental groups around the world to deploy MDFF in an easy and economical way, bypassing the need for their own staff, software, and hardware resources.

Results

In the following section, we first describe the methodological advances achieved within cMDFF and ReMDFF for the resolution of sub-5 Å maps. Search model preparation, refinement, and structure validation protocols based on these advances are subsequently demonstrated for five exemplary protein complexes that were chosen based on the availability of high-resolution (3–5 Å) EM maps and atomic structures. Finally, the performance of ReMDFF on Amazon’s cloud computing platforms is described, demonstrating that our MDFF software offers an efficient web-based resource for structure determination from EM maps.

Simulation concept

In direct MDFF, an initial atomic structure is subjected to an MD simulation with an additional potential energy term VEM that is proportional to the sign-inverse of the EM map. Through VEM, steering forces locally guide atoms towards high-density regions, thereby fitting the structure to the map (see Materials and methods).

The equilibrium structure obtained in the simulation represents a global minimum in VEM. For maps in the low resolution range (6–15 Å), this global minimum is broad, accomodating an ensemble of conformations defined by the overall shape of the macromolecule (Trabuco et al., 2008, 2011). In contrast, at the mid-resolution range of 4–6 Å, densities corresponding to the backbones become discernible, and at sub-4 Å resolutions, even sidechains can be resolved. At such high resolutions, VEM now features multiple proximal local minima which correspond to recurring spatial patterns within a macromolecule, such as helices aligned in parallel or strands in a β-sheet. As shown in Figure 1, the energy barriers separating these local minima are typically twice as high as those in the case of low-resolution maps. The existence of such potential minima in high-resolution maps exposes MDFF to a long-known weakness of traditional MD-based algorithms, namely entrapment of the fitted structure within undesired local minima instead of reaching the global minimum of VEM. Not unexpectedly, therefore, direct MDFF yields structurally poor or functionally irrelevant models with high-resolution EM maps (Figure 2—figure supplement 1) (DiMaio et al., 2015).

Visual summary of advanced MDFF methodology.

A graphic table illustrating MDFF refinement of a model of carbon monoxide dehydrogenase using a high-resolution map. The map represents an open conformation while the initial search model was obtained through crystallography of a closed conformation. This search model was independently fitted, using direct MDFF, to individual members of a set of maps obtained by applying Gaussian blurs of various half-widths (σ, first column) to the experimental density. These maps are visualized as a 3D surface in the second column, while the resulting MDFF potentials VEM are represented in cross-section in the third column. Notice the increase in number of contiguous density regions as σ increases. This increase in contiguity is manifested in the lowering of high VEM barriers (red) for small σ values to low or flat energy profiles (blue) for larger σ values, as observed in the VEM potential cross-sections. Reduced barrier heights allow the structure to explore the conformational space freely during fitting. The structure after 500 ps of fitting, shown in red, is superimposed on the known target structure, shown in blue, in the fourth column. The time evolution of RMSD with respect to the target during fitting is shown in the fifth column. The RMSD plots show that direct fitting to lower resolution maps requires fewer time steps to reach convergence. In fact, the structure never becomes less deviated than the initial 7-Å RMSD from the target in the direct MDFF of the highest-resolution map (i.e. in the absence of Gaussian blurring). The inset shows refinements of the same structure through cMDFF and ReMDFF employing the same set of maps. A clear improvement over direct MDFF is apparent, with convergence to within 1.7 Å and 1.0 Å of the target achieved within 1000 and 100 ps for cMDFF and ReMDFF respectively.

https://doi.org/10.7554/eLife.16105.003

To mitigate this weakness in direct MDFF, we introduce cMDFF and ReMDFF. In cMDFF, the structure is fitted, in a series of MDFF simulations, to maps of gradually increasing resolution. First, the experimental map is smoothened by applying a series of Gaussian blurs with increasing half-widths, σ, to obtain a set of theoretical maps with gradually decreasing resolution; σ = 0 Å corresponds to the experimental map, and σ > 0 Å corresponds to a smoothened one (see Materials and methods). Illustrated in Figure 1, the density-dependent potential derived from the smoothest map (i.e. the one with the largest σ value) features a clear global minimum representing the large-scale structural features of the protein. Second, a search model is fitted to this map, allowing resolution of these large-scale features. Third, the resulting structure is employed as the search model for fitting to the next higher-resolution map in the series. These fitting and search model refreshment steps are repeated through the series of maps in order of decreasing σ, until the structure is finally fitted to the experimental map (see Video 1 for TRPV1 refinement. An additional virtual reality version can be found at https://www.youtube.com/watch?v=UwwVC6C9tw0).

Video 1
cMDFF Refinement of TRPV1.
https://doi.org/10.7554/eLife.16105.004

The gradual increase in map resolution over the course of the simulations allows the structure to explore a greater conformational space than in direct MDFF. The structure thus avoids entrapment within local minima of the density-dependent potential and is accurately fitted to the near-atomic density features of the experimental map while also resolving the larger scale features.

In ReMDFF, the cascade scheme is infused with a greater degree of automation. Multiple MDFF simulation replica are run in parallel with each replica fitting a model to a map of a specific resolution. Based on a Metropolis formula analogous to that of conventional replica exchange molecular dynamics simulations (Sugita and Okamoto, 1999), but now derived in terms of density and not temperature (see Materials and methods), the models are exchanged at regular time intervals between maps from neighboring pairs of replica. Stepwise improvements in fit occur during exchanges between a poorly fitted model at a high resolution with a well-fitted model at a lower resolution. This well-fitted model is further refined against the high-resolution map until convergence is reached, and exchange between the chosen resolutions ceases. Further details are described in Materials and methods.

ReMDFF has advantages over cMDFF both in terms of efficiency and automation as it can take advantage of modern parallel computing hardware and the powerful and adaptive replica exchange interface of NAMD (Jiang et al., 2014). Nonetheless, as presented in the following, both cMDFF and ReMDFF outperform direct MDFF in quality and speed across a range of high-resolution examples.

Search model preparation and refinement

In an initial proof-of-principle computation, cMDFF and ReMDFF were applied to fit a structure of carbon monoxide dehydrogenase to a 3-Å synthetic density map. The same techniques were subsequently applied to obtain refined structures of two more protein systems, namely TRPV1 (Liao et al., 2013) and β-galactosidase (Bartesaghi et al., 2014), for which experimental densities of 3.4 Å and 3.2 Å resolution respectively are available. In each case, a direct MDFF simulation was also performed for the purpose of comparison. The MDFF-derived structures were then subjected to a model validation analysis, to evaluate the quality of the models with established protocols in the cryo-EM field. Additional examples chosen for this analysis include γ-secretase at 4.5 Å (Lu et al., 2014) and 3.4-Å (Bai et al., 2015), β-galactosidase at 2.2-Å (Bartesaghi et al., 2015) and the proteasome at 3.3 Å (Li et al., 2013) resolution. Comparisons between the direct and advanced MDFF protocols, and wherever possible, with other available fitting techniques, such as Rosetta (DiMaio et al., 2015), elucidate the general pros and cons of the flexible fitting strategy.

Proof of principle

The performance of cMDFF and ReMDFF was evaluated on a test system, carbon monoxide dehydrogenase, which exhibits a closed and an open conformation (Darnault et al., 2003). Both these conformations have been crystallized, and are reported respectively in chains C and D of the PDB entry 1OAO. For our demonstration, the closed conformation (1OAO:chain C) was used as the search model, while the open one (1OAO:chain D) was the target.

First, a 3-Å resolution synthetic density map was constructed in Phenix (Adams et al., 2010), employing phases from the 1OAO structure and the associated diffraction data truncated at 3 Å. This map was then masked about chain D to yield a high-resolution envelope characterizing the open conformation. Assuming that the crystallographic model provides an accurate benchmark, the corresponding map for chain D determined here represents the best possible density data at 3 Å resolution that is experimentally attainable for the open conformation. Finally, through direct MDFF, cMDFF and ReMDFF, the search model constructed from chain C was fitted into this density to derive an atomic structure representing the open conformation. Both the cMDFF and ReMDFF refinements were performed for a set of six maps with σ values ranging from 5 to 0 Å at constant decrements of 1 Å.

Accuracy of the fitting protocols was evaluated by comparing the fitted chain C structures with the crystallographically reported target chain D model. Direct MDFF of the 3 Å synthetic map performed for 2 ns converged to a structure with an RMSD of 7 Å relative to the target model. In sharp contrast, the cMDFF- and ReMDFF-generated structures are within 1.7 Å and 1 Å RMSD of the target (see the inset of Figure 1). It is also noted that fitting to the lowest-resolution (i.e. one with σ = 5 Å) brings about an immediate decrease in RMSD from the target generating structures that are within 2 Å RMSD. Fitting of the structure to subsequent high-resolution maps brought the RMSD down to 1.0 Å.

The results demonstrate that the new protocols are capable of attaining well-fit structures where direct MDFF does not. In particular, one can think of the new protocols as extending the radius of convergence to at least 7 Å, rendering the fitting procedures less dependent on the quality of the starting structure.

Refinement of β-galactosidase

In a second test case, a search model was fitted into the 3.2 Å map (Bartesaghi et al., 2014) of β-galactosidase employing direct MDFF, cMDFF, and ReMDFF. Noting that the radius of convergence of the proposed MDFF protocols was at least 7 Å for the aforementioned test case, the initial search model was prepared such that it had an RMSD of 7 Å from the reported structure. This model was obtained by applying to the reported structure (obtained by de novo modeling within the EM map [Bartesaghi et al., 2014]) a high temperature MD protocol described in Appendix 1, Section 3 and, subsequently, choosing from the collection of trajectory structures one of RMSD 7 Å from the reported structure and with the lowest GCC with respect to the reported map (Bartesaghi et al., 2014) (Figure 2—figure supplement 2a).

Summarized in Table 1, the fitting results, in terms of quality of fit as well as model quality, are significantly better for cMDFF and ReMDFF than for direct MDFF: (i) RMSD of the fitted structure with respect to the reported de novo model is 0.7 Å and 0.9 Å for cMDFF and ReMDFF respectively, much lower than the 3.7 Å RMSD attained with direct MDFF (Figure 2—figure supplement 3a); (ii) EMRinger scores for cMDFF and ReMDFF are 3.16 and 3.45 respectively, higher than the 1.91 obtained for direct MDFF, implying accurate fitting of sidechains into the density; (iii) MolProbity scores are consistently small for all the flexible fitting techniques in part due to fewer, less severe steric clashes and fewer Ramachandran outliers (further detailed in Table 2); (iv) integrated FSC (iFSC2, corresponding to the range 3.4–10 Å on the FSC plot obtained as per Appendix 1 - Section 6), considered a more stringent measure of model quality than CC (DiMaio et al., 2015), attained higher values of 5.22 Å and 4.66 Å for cMDFF and ReMDFF, respectively, than 2.74 Å for direct MDFF. iFSC1, evaluated at the lower resolution range of 5–10 Å improves from 2.11 Å for direct MDFF to 4.22 Å and 3.76 Å for cMDFF and ReMDFF, respectively, showing a trend similar to that of iFSC2 corresponding to the high-resolution range; and (v) GCCs improved from an initial value of 0.48 to 0.56, 0.67 and 0.67 for direct, cMDFF, and ReMDFF protocols respectively. Similarly, typical residue LCC values improved from about 0 to greater than 0.80 (Figure 2—figure supplement 4a and Figure 2—figure supplement 5). Overall, cMDFF and ReMDFF refinements produce structures that interpret the 3.2-Å β-galactosidase map much more accurately than direct MDFF does. Figure 2a shows, visually, how the cMDFF-derived structure differs from the direct MDFF structure in terms of fit. In judging the RMSD values to the target model the reader is reminded that equilibrium MD simulations of a single structure at room temperature typically exhibit RMSD values relative to the initial structure or the average structure of about 3 Å; the same is true for β-galactosidase (see Figure 2—figure supplement 6 in Appendix 1 - Section 7). Consequently, an RMSD of 0.7 Å of the cMDFF/ReMDFF-fitted model relative to the target implies a high-quality refinement. The high quality of this refinement is further supported by visualizations of accurate sidechain placements within the density, shown in Figure 2—figure supplement 7.

Figure 2 with 8 supplements see all
Comparison between cMDFF and direct MDFF fitted models.

Models of (a) β-galactosidase and (b) TRPV1, obtained from cMDFF (blue) and direct MDFF (red) fitting simulations are superimposed. The cMDFF-fitted models fit well into the high-resolution maps (grey) of each molecule, whereas the direct MDFF models have become trapped in local minima that result in portions of the models protruding from the maps. ReMDFF-fitted models are almost identical to those from cMDFF and are therefore not shown.

https://doi.org/10.7554/eLife.16105.005
Table 1

β-galactosidase MDFF results. cMDFF and ReMDFF provide better fitted structures than direct MDFF according to various criteria. It is noteworthy that all structures refined by any form of MDFF display an improved MolProbity (Chen et al., 2010) score compared to the original de novo structure.

https://doi.org/10.7554/eLife.16105.014
StructureRMSD(Å)EMRingeriFSC1(Å)iFSC2(Å)MolProb.GCC
de novo (Bartesaghi et al., 2014)0.02.254.035.003.140.67
Refined de novo0.64.234.195.201.230.68
Initial7.70.240.140.151.490.48
Direct MDFF3.72.312.112.741.380.56
cMDFF0.73.164.225.221.370.67
ReMDFF0.93.453.764.661.130.67
Table 2

Structure quality indicators for β-galactosidase structures. β-galactosidase structures investigated in the present study were uploaded to the MolProbity server (http://molprobity.biochem.duke.edu) to extract the quantities presented below. The results show that the cMDFF- and ReMDFF-refined structures not only exhibit good measures of fit, but also improve the clash score and rotamer geometries, relative to the de novo and initial structures, while incurring only a small expense in Ramachandran statistics, bad angles, and Cβ deviations.

https://doi.org/10.7554/eLife.16105.015
de novo (Bartesaghi et al., 2014)Refined de novoInitialDirect MDFFcMDFFReMDFF
Clashscore53.70.00.00.00.00.0
Poor rotamers (%)11.63.84.23.04.41.37
Favored rotamers (%)67.490.887.892.189.895.3
Ramachandran outliers (%)0.20.72.73.01.62.7
Ramachandran favored (%)97.495.891.191.194.490.9
MolProbity3.141.231.491.381.371.13
Cβ deviations (%)0.00.054.920.180.290.39
Bad bonds (%)0.090.043.610.020.010.03
Bad angles (%)0.030.603.980.630.490.37
RMS distance (Å)0.007 (0.025%)0.019 (0%)0.035 (0.237%)0.022 (0%)0.019 (0%)0.021 (0%)
RMS angle (degrees)1.1 (0.009%)2.2 (0.009%)3.6 (1.177%)2.4 (0.103%)2.1 (0.018%)2.3 (0.085%)
Cis prolines (%)8.068.066.456.456.458.06
Cis non-prolines (%)1.151.150.00.01.150.0

The cMDFF- and ReMDFF-refined structures were found to be comparable in every quality measure in comparison to the reported de novo structure (Bartesaghi et al., 2014); in fact, the overall Molprobity and EMRinger scores are significantly better in cMDFF and ReMDFF. However, a closer look at the Molprobity score (Table 2) reveals that even though cMDFF vastly improves clash score and poor rotamers, it marginally increases the percentage of Ramachandran outliers and Cβ deviations relative to the de novo structure. Nonetheless, both cMDFF and ReMDFF improved structural statistics with respect to the initial model (Table 1, third row) which was intentionally chosen to have a large deviation (RMSD of 7.7 Å) from the de novo structure.

Noting that the quality of MDFF output depends strongly on that of the search model, a second cMDFF simulation was also performed to refine the de novo structure within the reported map. The simulation, labeled ‘refined de novo’ in Table 1, yielded a structure that was superior in all the quality measures considered in comparison to the de novo structure as well as to the structures obtained from the various MDFF fittings of the other, 7.7 Å deviated initial model. A closer look at the Molprobity scores (Table 2) now reveals that not only are clash score and poor rotamers vastly improved, but the Ramachandran outliers and Cβ deviations are also comparable to the de novo structure.

A third cMDFF refinement was further performed with a search model of even lower structural quality (Supplementary file 1E, third row) compared to that of Table 1 (third row). Structural statistics for the refinements from this lower-quality search model are provided in Supplementary file 1F. Comparison of the three cMDFF refinements starting with three search models of varying structural quality reveal that the poorer the secondary structure of the search model, measured in terms of higher percentages of rotamer and Ramachandran outliers, the worse the local structural statistics of the MDFF/cMDFF/ReMDFF-refined model. Surprisingly, large scale map-model validation measures, such as RMSD, GCC, or iFSC values, remain insensitive to such local discrepancies in the refined model. This insensitivity is apparent in the similarity of RMSD, GCC, and iFSC values of the three refined models of β-galactosidase (Table 1 and Supplementary file 1E), which indeed feature very different number of Ramachandran outliers: 0.7% (Table 2-Refined de novo), 1.6% (Table 2-cMDFF) and 7.8% (Supplementary file 1F-cMDFF).

The high-precision of sub-5 Å maps demands correct assignment of secondary structure, which determines backbone geometry directly and sidechain conformations indirectly. In light of the initial model quality dependence of MDFF, it is advisable to begin refinement with an initial model of maximal secondary structure information, yet with minimal tertiary structure. Dependable initial secondary structure is also required since MD simulations have limited capability of recovering the structure of a protein fold if the search model begins with a random coil conformation (Freddolino et al., 2010). To this end, notwithstanding observed cMDFF improvements of the secondary structure (Table 2 columns 4 vs. 6, and also demonstrated for γ-secretase [Supplementary file 1D]), moderate- to high-confidence homology models will serve as the most optimal starting point, as has been successfully shown for MDFF with low-resolution EM (Chan et al., 2011; Noble et al., 2013; Wickles et al., 2014) and crystallographic data (McGreevy et al., 2014; Li et al., 2014). Quantitatively speaking, employing the current example (β-galactosidase) and the following one (TRPV1), it is now shown that a search model which deviates from the target by an RMSD of 7–25 Å featuring Ramachandran outliers of ~3%, poor rotamers of 11–38%, overall Molprobity score of ~4 and EMRinger score of ~0.20 can be accurately refined against a sub-5 Å EM map with cMDFF or ReMDFF.

In terms of efficiency, the ReMDFF protocol exhibits the quickest convergence, arriving at steady state within 0.1 ns of simulation, whereas cMDFF requires around 0.8 ns. Both methods employed eleven maps with Gaussian blurs starting from a width of 5 Å and decreasing in steps of 0.5 Å towards the original reported map. To ensure that the cMDFF procedure did not over-fit the structures, cross-validation using EMRinger and FSC analysis was performed using half-maps from the EMD-5995 entry. iFSC and EMRinger values were found to be almost identical in both direct and cross comparisons. Details are provided in Appendix 1 - Section 11.

In addition to the MDFF simulations described so far for β-galactosidase, other simulations were performed to investigate in more detail the capabilities of MDFF. These simulations are described in Appendix 1 - Section 7. First, it was found that fitting of the Cβ atoms to the density is crucial for accurate placement of the sidechains. In agreement with prior EMRringer results (Barad et al., 2015), it is confirmed that MDFF placement of the backbone does not guarantee correct sidechain geometries, even with state-of-the-art CHARMM36 (Klauda et al., 2010) force fields. Second, MD simulation of the cMDFF-fitted β-galactosidase model revealed that this model is indeed an excellent representation of the most probable structures of the thermodynamic ensemble that characterizes the 3.2 Å map.

Refinement of TRPV1

In the third test case, cMDFF and ReMDFF protocols were employed to fit an initial model to the 3.4-Å map (Liao et al., 2013) of the temperature-sensing protein TRPV1. The search model was prepared from the reported de novo structure through an interactive MD protocol described in Appendix 1 - Section 3. The model deviated from the de novo structure by an overall RMSD of 10 Å and locally, by about 25 Å in the vicinity of the ankyrin repeats represented by residues 199 to 430. This degree of deviation is in the ballpark of the lowest resolutions of usable EM maps and, therefore, represents the upper limit of uncertainty between a search model and the fitted structure that MDFF can still reconcile. Having to address an RMSD of 25 Å between search and target models, the present example represents an extreme test case for judging the radius of convergence of the proposed MDFF methods.

Fitting results for TRPV1, described in Appendix 1 - Section 8, were significantly better for cMDFF and ReMDFF than for direct MDFF, but now with much poorer search models than those employed for the β-galactosidase refinements. For example, cMDFF and ReMDFF refinements produced structures that interpret the 3.4-Å TRPV1 map within an RMSD of 2.4 Å and 2.5 Å from the target de novo model, much more accurately than does direct MDFF which converges to structures at an RMSD of 7.9 Å.

The cMDFF- and ReMDFF-obtained structures were found to be better in every overall quality metric in comparison to the de novo structure. Figure 2b illustrates the contrast in fit between the cMDFF and direct MDFF-derived structures. To observe the effect of MDFF on a substantially well-fitted initial structure, a direct MDFF simulation was also performed to refine the de novo structure within the reported map. The simulation yielded a structure that was comparable in all the quality measures considered to the structures obtained from cMDFF and ReMDFF (Appendix 1 - Section 8). However, the TRPV1 fitting results show that cMDFF and ReMDFF can have a radius of convergence as high as 25 Å in RMSD, whereas direct MDFF requires at the outset a well-fitted structure to deliver a satisfactory model. Also, since the number of Ramachandran outliers were minimal in both the 25 Å-deviated and de novo initial models, the cMDFF, ReMDFF, and direct MDFF-refined models exhibited low percentages of Ramachandran outliers, as reflected in Supplementary file 1B.

As was observed already for β-galactosidase, the ReMDFF protocol exhibited the quickest convergence, arriving at steady state within 0.02 ns of simulation, whereas cMDFF required around 0.27 ns. Both methods employed six maps with Gaussian blurs starting from a width of σ = 5 Å and decreasing in steps of 1 Å to the reported σ = 0 Å map. Cross-validation with half-maps was also performed on the cMDFF structure, as per the β-galactosidase simulations, to ensure that it was not over-fitted. As in the case of β-galactosidase, iFSC and EMRinger scores for direct and cross comparisons were similar. FSC analysis results are described in Appendix 1 - Section 11.

A separate set of model validation analyses was performed on the well-resolved TM portion of TRPV1 to pursue a direct comparison of a MDFF refined model with one from Rosetta (Barad et al., 2015). As reported in Appendix 1 - Section 9, MDFF produced results comparable to those of Rosetta when all the heavy atoms are coupled to the density: though the EMRinger score is marginally lower relative to that from Rosetta, the GCC and iFSCs are higher for MDFF; also MDFF provides a marginally higher MolProbity score. Altogether, major discrepencies between Rosetta and direct MDFF that were reported for the high-resolution EM maps (DiMaio et al., 2015) are now absent when employing the cMDFF and ReMDFF protocols, even with poorer choices of search models than those used with Rosetta. Thus, cMDFF and ReMDFF enable flexible fitting techniques to pursue resolution of structures within state-of-the-art maps obtained via cryo-EM.

Model validation

An EM density map represents a thermodynamic ensemble of atomic conformations (Schröder et al., 2007; Brunger et al., 2012; Schröder et al., 2010). Conventionally, however, only a single model representing a best fit to the map is reported. One may ask how statistically representative a single model can be. To quantify the deviation of a fitted model from the rest of an ensemble of simulated molecules, root mean square fluctuation (RMSF) of the model relative to the ensemble-averaged structure was computed during an MDFF refinement simulation employing the protocol described in Methods. In the following, the RMSF of a fitted model is first shown to be indicative of the quality of fit of the model, as well as to represent the degree of natural conformational variation exhibited within the thermodynamic ensemble underlying the map. Second, the RMSF values are found to correlate both locally and globally with the resolution of an EM map, providing an interpretation of map quality based on the inherent (i.e., natural) dynamics of the macromolecule under observation. Finally, RMSF values are also employed to identify optimal B-factor values for the sharpening of a map. Altogether, the results of the present study demonstrate that RMSF of a fitted model during an MDFF refinement provides valuable information on the model.

RMSF and quality of fit

The relationship between RMSF values and quality of model fit is demonstrated for the cMDFF refinement of β-galactosidase at 3.2 Å resolution. The initial conformation is a poor fit of the map, characterized by low values of GCC, LCC, and iFSC (the row containing ‘initial’ structure in Table 1). Such conformations belong to a diverse ensemble of poorly fit structures, explored by the search model in the early phase of the fitting, that gives rise to high initial RMSF values shown in Figure 4—figure supplement 1. In the ending phase of the refinement, well-fitted structures are obtained with improved GCC, LCC, and iFSC (cMDFF row of Table 1). Owing to the high resolution (3.2 Å) of the EM density map, the population of these well-fitted structures is much smaller than that of partially-fitted structures. Thus, as can be observed in Figure 4—figure supplement 1, the converged ensemble of the well-fitted structures exhibit much smaller fluctuations than the initial one. Low RMSF values for the fitted structure indicate therefore, that (i) the structure has been modeled unambiguously within the map, and (ii) the structure can be regarded as representative of the ensemble underlying the 3.2-Å β-galactosidase map; since per-residue fluctuations about the fitted structure (Figure 4—figure supplement 1) are less than 1 Å, only marginal backbone and sidechain variations within the ensemble arise (Singharoy et al., 2013).

RMSF and quality of map

Apart from representing the quality of fit, RMSF values monitored during an MDFF simulation correlate closely with the overall and local resolution of an EM map. Even though high-resolution cryo-EM data are becoming increasingly obtainable, resolution is not always uniform throughout a map. For example, Figure 3 reflects the variation in local resolution of map regions corresponding to residues of β-galactosidase and TRPV1. Conformational flexibility can cause heterogeneity in the cryo-EM data (Leschziner and Nogales, 2007), producing local resolutions lower than that of the overall map.

Models colored by local resolution, square of RMSF, and B-factor.

The published models of (a) β-galactosidase (PDB 5A1A) and (b) TRPV1 (PDB 3J5P) are colored by the local EM map resolutions, the per-residue mean square fluctuations (RMSF2) during MDFF simulation, and published B-factors. Comparison of these figures shows qualitative agreement between local resolution, RMSF2, and B-factor. In fact, the local resolutions and B-factors correlate linearly with RMSF2 of a fitted model both in the presence as well as absence of the EM map (more details in Figure 4).

https://doi.org/10.7554/eLife.16105.016

Local resolution analysis (Kucukelbir et al., 2014) can be especially important for determining the parts of a high-resolution map that realistically contain side chain information and the parts that do not, preventing over-interpretation of the latter. MDFF protocols can be adjusted to account for such local variations and better inform the process of model validation. For example, if analysis shows that certain residues reside in low-resolution regions of the density, the per-atom weighting factor applied to the forces derived from the density can be lowered.

The RMSF-resolution correlation is found to hold in the cases of TRPV1 (PDB 3J5P), 2.2-Å β-galactosidase (PDB 5A1A), γ-secretase (PDB 5A63 and 4UPC), and the T20S Proteasome (PDB 3J9I). Generally, the lower the resolution of the map, the higher the corresponding overall RMSF during MDFF simulations. For example, the overall RMSF during MDFF of the 4.5-Å γ-secretase model and map is greater than that of the 3.4-Å model and map which, in turn, is greater than that of the 2.2-Å map and model of β-galactosidase (see RMSF labels on the upper column of Figure 4). The correlation between map resolution and model RMSF extends further to local features within the density. In Figure 4 (upper row), RMSFs of atoms plotted against local resolutions of the corresponding map regions display linear correlation between the two quantities. Again, higher RMSF indicates lower local resolution.

Figure 4 with 4 supplements see all
RMSF vs. local resolution plots for various simulations.

For each test case shown, atoms in the MDFF-refined structure are classified by local resolution of the map regions they are fitted into. The average RMSF value of atoms (during MDFF simulation) in each resolution bin is calculated and plotted against the local resolution in the cases of (a) β-galactosidase (β-gal) at 2.2 Å, (b) TRPV1 at 3.4 Å, γ-secretase (γ-sec) at (c) 3.4 Å and (d) 4.5 Å resolution, and proteasome (see Figure 4—figure supplement 3). The numbers of atoms in the resolution bins are displayed as a histogram (in red) spanning a system-specific range of resolutions. The lowest resolution bins contained low (<20) populations and visual inspection consistently revealed the atoms to be on the edges of the density or were otherwise located inside map noise, and were therefore ignored during further analysis. A clear linear correlation between RMSF and local resolution can be found in each case, such that applying a linear fit produces the high R2 value shown in each graph heading. Also displayed in each heading is an overall RMSF, averaged over all atoms in the system. The overall RMSF reflects the conformational variety of structures that fit within the map, and is found to correspond to the map resolution such that higher resolutions produce lower RMSFs. The second row of plots show that the RMSF during MDFF simulation also linearly correlates with RMSF during unbiased MD simulations of (e) β-gal, (f) TRPV1 and (g,h) γ-sec, establishing that fluctuations during MDFF reflect the inherent flexibility of a system.

https://doi.org/10.7554/eLife.16105.017

The physical basis for considering RMSF values of a group of fitted atoms during an MDFF refinement as an indicator of the map resolution follows from the linear correlation of these values with RMSF of the same set of atoms during unbiased MD (Figure 4, bottom row). Noting that the RMSF value during unbiased MD simulations reflects flexibility (Karplus, 1990), this linear correlation clearly establishes the dependence of RMSF during MDFF refinement on the inherent flexibility of the macromolecule, at least for our four demonstration systems (see Appendix 1 - Section 2 for MD simulation details). Since the flexibility of a molecule during the imaging process contributes to the limiting resolution of the resulting EM density map (Kucukelbir et al., 2014), it is not surprising that overall and local map resolutions correlate with overall and local RMSF values of the best fitted model (Figure 4).

In broad terms, through the present study we establish that RMSF, together with GCC, LCC, EMRinger (Barad et al., 2015), and iFSC, provide a comprehensive set of criteria for evaluating model and map quality on both global and local levels. The added value of RMSF is particularly evident on the local level, where the other measures may not perform as consistently. For example, a high LCC may be the result of a highly flexible structure fitting to a low-resolution region of the map, and not necessarily of a good representation of the local structure. As a result, although multiple low-resolution regions of the model in Figure 4—figure supplement 2a possess similar LCCs, disparate RMSFs of the same regions clearly indicate differences in local quality of the model. Likewise, EMRinger scoring, when applied to small groups of residues, does not correlate with local resolution (Figure 4—figure supplement 2b), and, therefore, is incapable of distinguishing regions of small number of atoms by local model quality. In contrast, RMSF clearly resolves local resolution, and, thus, resolves the map and model quality even with as few as 100 atoms.

RMSF and B-factor determination

High contrast within an EM map allows clear identification of secondary structural elements. However, experimental imaging discrepancies arising from specimen movement and charging, radiation damage, and partial microscope coherence, or computational discrepancies due to inaccurate determination of the single particle order parameters (Wade, 1992; Fernández et al., 2008), introduce fuzziness to the EM map, thus hindering secondary structure identification. B-factor sharpening restores lost contrast by resolving the fuzzy features and, therefore, is a crucial step of map generation that affects map interpretation. Here, we describe the use of RMSF as a physical basis for the determination of optimal B-factors that preserve contrast during map sharpening.

Figure 5, Figure 5—figure supplement 3, Figure 4—figure supplement 4 demonstrate the relationship between RMSF and B-factor for the 2.2-Å map of β-galactosidase, 3.4-Å map of TRPV1 and 4.5-Å map of γ-secretase; these three systems were chosen due to their reasonable size and availability of unsharpened data. An initial decrease in the RMSF values is observed with increase in B-factor sharpening of the map. However, the RMSF eventually reaches a minimum before increasing as the B-factor is further increased. Surprisingly, for all the three structures the B-factor corresponding to the minimum RMSF coincides with the one determined by Guinier analysis (Fernández et al., 2008; Rosenthal and Henderson, 2003) to provide maximum contrast. In fact, for TRPV1, B-factors determined from the RMSF minima are found to be higher for the soluble regions than for the transmembrane helices (Figure 5—figure supplement 1, b vs. c), again, in agreement with those B-factors derived from the Guinier analysis. Therefore, the RMSF analysis of maps with varying B-factor sharpening provides an alternate procedure for the determination of the optimal B-factor, which has been traditionally determined by Guinier analysis.

Figure 5 with 4 supplements see all
Effect of map sharpening on residue flexibility of β-galactosidase.

(a) Overall RMSF of a fitted 2.2Å β-galactosidase structure (PDB 5A1A) during direct MDFF fitting as a function of the B-factor of the fitting map exhibits a parabolic trend. Guinier analysis identifies a B-factor of −75 as optimal, for which the corresponding RMSF (shown in red) coincides with the minimum of the trend line. EMRinger scores (shown in blue) of the same structures show a negative parabolic trend, with the peak coinciding with the minimum of the RMSF plot. (b) The linear relationships between local RMSF during MDFF and during unbiased MD for the unsharpened map and optimally sharpened map are compared. While the linear relationship is preserved in both cases, RMSFs in the sharpened case are slightly lower than in the unsharpened case.

https://doi.org/10.7554/eLife.16105.022

The rationale for optimal B-factor selection based on an analysis of RMSF values is the following. As pointed out above and in Figure 4, RMSF of a fitted model is a function of the quality of the corresponding map. Following this argument, the maps in Figure 5, Figure 5—figure supplement 1 and Figure 5—figure supplement 2 that produce the model with the lowest RMSF during fitting represent the density envelopes where atoms can be positioned with the least uncertainty at the given experimental resolutions. The corresponding sharpening B-factor thus provides the optimal contrast for atom placement and secondary structure determination.

Our argument for the RMSF-based selection of B-factors is supported by the well-established quadratic relationship of B-factors with the RMSF of atoms of a structure in an experimental setting, i.e. 8π2/3(RMSF)2 (Rosenthal and Henderson, 2003; Liu and Xiong, 2014). This relationship indeed implies that the smaller the average atomic fluctuations, the lower are the B-factors, and by definition (Liu and Xiong, 2014) the higher are the measured structure factor amplitudes, and hence the contrast, of the resulting map in both crystallographic and EM experiments (Liu and Xiong, 2014; Fernández et al., 2008). By analogy, RMSF within MDFF is a measure of these average atomic fluctuations under the experimental setting (Figure 4, bottom row) and, therefore, within a set of maps of varying B-factors, the highest contrast is indicated by the lowest RMSF obtained during fitting of a model. Since Guinier analysis of EM data selects B-factors based on the same atomic displacement-structure factor relationship that justifies the RMSF-based selection of B-factors (Rosenthal and Henderson, 2003), the B-factor selected from our analysis of the lowest RMSF matches excellently with those from the Guinier analysis of EM maps.

The higher contrast of the B-factor sharpened map, over the unsharpened one, is evident from the higher quality of the models derived from the sharpened maps. Presented in Figure 5a, Figure 5—figure supplement 3b, and Figure 5—figure supplement 1a, for the cases of β-galactosidase, γ-secretase, and TRPV1, respectively, are EMRinger scores of models fitted to unsharpened and sharpened maps. These scores indicate clearly that the models derived from maps at B-factor sharpening of −50 (Figure 5a), −100 (Figure 5—figure supplement 3b), and −100 (Figure 5—figure supplement 1a) are more accurate than those derived from the unsharpened ones. Indeed, as further illustrated in Figure 5a, Figure 5—figure supplement 3b, and Figure 5—figure supplement 1, the maximum EMRinger score is attained for the same B-factor that produces the minimum RMSF. Sharpening by B-factors of magnitudes any higher than 50 for β-galactosidase, and 100 for γ-secretase or TRPV1 damages key density features thus reducing the quality of the associated fitted models. In addition, correlation of the RMSF of the fitted structure with that from an unbiased MD of the system is still preserved (Figure 5b and Figure 5—figure supplement 2b) for the B-factor sharpened map, confirming that the dynamical fluctuations of the structure within the sharpened map reflects the inherent dynamics of the system. Altogether, RMSF establishes a unique map-model validation criterion that represents an ensemble view of the fitted structures, while also preventing over-sharpening of the EM maps.

RMSF and per-residue B-factors

The quadratic relationship between RMSF and B-factors can be further employed to determine per-residue B-factors. Presented in Figure 3, the per-residue RMSF2 of β-galactosidase and TRPV1, derived from MDFF, show excellent agreement with the distribution of local resolution and experimentally reported B-factors: regions with higher RMSF2 correspond to lower local resolution and higher per-residue B-factors, and vice versa. In fact, the quantitative agreement between the B-factors derived from MDFF through the computation of 8π2/3(RMSF)2 (0.02 < (RMSF)2<4.43 [Figure 3]) and those reported in the experiments is remarkable (Figure 5—figure supplement 4); a cross-correlation of 55% is found between the data sets, which improves to 60% in the structured regions. Larger discrepencies are observed between the RMSF-based and reported B-factors of TRPV1, particularly in the soluble region (Figure 3b). This discrepency is expected, as for poorly resolved regions MD provides up to six-fold higher B-factors (Kuzmanic et al., 2014); the higher B-factors indeed have been demonstrated to be a more accurate representation of conformational diversity (Kuzmanic et al., 2014). For the TRPV1 example, the highest computed B-factor is 8π2/3*9.25 = 240.5 (Figure 3b) compared to the reported value 53.17. However, the majority of the computed B-factors fluctuatate about the value of 8π2/3*1.73 = 44.98, which is in fair agreement with the reported values of 20–30.

Overall, B-factors of cryo-EM maps are typically calculated by Guinier analysis for maps with resolutions better than ~10 Å (Fernández et al., 2008; Rosenthal and Henderson, 2003). Using a mask around specific regions of interest, estimation of local B-factors for different parts of a map is also possible. However, this method is limited to large domains of macromolecular complexes due to problems associated with tight masking of cryo-EM maps. For maps with resolutions better than ~3 Å, local B-factors could be estimated and refined in X-ray crystallography programs. However, most highresolution cryo-EM maps have resolutions between 3 to 5 Å (Liao et al., 2013; Bartesaghi et al., 2014; Zhao et al., 2015; Fischer et al., 2015). In this resolution range, B-factors could be estimated from MDFF and used as prior information to improve model building and refinement. For resolutions better than 3 Å, MDFF-derived values may serve as initial estimates of B-factors. Furthermore, many cryo-EM maps show local variations in resolution, complicating the model building process. A possible solution may be to combine local B-factor refinement for highresolution regions of a map and B-factors derived from MDFF for lower resolution regions. A resolution-dependent weighting scheme could be incorporated to combine the different values for optimal performance, which may help improve the accuracy of atomic models derived from high-resolution cryo-EM maps.

Accessibility through cloud computing

ReMDFF involves many independent, though sporadically communicating, MDFF simulations that can be run well on a parallel computer as exploited by the NAMD software for the case of replica exchange simulations (Jiang et al., 2014). As a result, ReMDFF provides an efficient and automated method which can converge on a final fitted structure more quickly than direct or cMDFF (Figure 2—figure supplement 3). However, a potential bottleneck may exist with respect to the computing hardware that a researcher has access to. Cloud computing offers a potential solution, allowing a researcher to focus on the scientific challenges of their project without having to worry about local availability and administration of suitable computer hardware.

To prove the feasibility of performing ReMDFF simulations on a cloud platform, we performed ReMDFF for a test system, carbon monoxide dehydrogenase (PDB 1OAO:chain C), on the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) platform. The test system converges to the known target structure in approximately 0.1 ns simulation time. The time to convergence requires very little wall clock time, and, therefore, incurs a small monetary cost to a user (Table 3). However, it should be noted that human error and varying experience level can easily add to the incurred cost of cloud usage. Some systems may require multiple simulations to achieve a high quality structure and, therefore, additional time beyond the example discussed here. Furthermore, preparing a structure for simulation may require additional time and resources over the purely simulation-oriented results presented here. The files and information necessary to run ReMDFF on the test system using EC2 cloud computing resources are available at (http://www.ks.uiuc.edu/Research/cloud/). The Implementation section of the Methods contains further details for setting up and running ReMDFF simulations.

Table 3

Performance and cost results for ReMDFF of carbon monoxide dehydrogenase on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) platform. Costs are incurred on a per-hour basis, with a 1 hr minimum.

https://doi.org/10.7554/eLife.16105.027
Instance typeCPUPerformance (ns/day)Time (hours)Simulation cost ($)
c3.8xlarge305.880.411.68
c3.4xlarge123.330.720.84
c3.2xlarge61.351.780.84

Discussion

Flexible fitting methods have facilitated structure determination from low-resolution EM maps for more than a decade (Tama et al., 2004; Suhre et al., 2006; Velazquez-Muriel et al., 2006; Orzechowski and Tama, 2008; Topf et al., 2008; Kovacs et al., 2008; Lopéz-Blanco and Chacón, 2013; Wu et al., 2013) and continue to be the methods of choice for resolving molecular systems with atomic resolution. MDFF, in particular, has been a front-runner among methods that have facilitated the discovery of some of the most complicated structures in modern day structural biology (Hsin et al., 2009; Sener et al., 2009; Gumbart et al., 2011; Frauenfeld et al., 2011; Zhao et al., 2013; Wickles et al., 2014).

cMDFF and ReMDFF, the new variants of MDFF introduced above, offer now accurate fitting of atomic-level structures within sub-5 Å EM maps, a feat thus far inaccessible to direct MDFF. These new methods extend the radius of convergence of MDFF to at least 25 Å, fitting models to maps of resolutions as high as 3.2 Å. This radius of convergence is at least twice that reported for Rosetta refinements of the 20S proteasome (DiMaio et al., 2015). Such a broad radius of convergence will allow the refinement of extremely poorly guessed initial models with MDFF, as demonstrated in the cases of β-galactosidase and TRPV1 reported here.

ReMDFF simulations involving the so-called replica-exchange molecular dynamics converge quickly using a small number of replicas and are thus amenable to cloud computing applications. Running ReMDFF on the cloud lowers greatly the barrier to usage of the method, providing a cost-effective and practical solution to fitting structures to high-resolution cryo-EM densities for researchers who neither own nor can administer their own advanced computer hardware.

The accuracy of structures refined by cMDFF and ReMDFF has been confirmed by standard error analysis protocols, both in terms of quality of fit as well as in terms of the quality of the model. The results clearly show that sidechain refinements through MDFF produce accurate placement of Cα and Cβ atoms and modeling of the associated dihedrals.

Beyond the standard error analysis protocols, which apply to single static structures, the quality of both the fit and the model can be evaluated by ensemble-based measures. An example demonstrated in this study pertains to the use of RMSF during MDFF refinement for simultaneously evaluating the quality of model, map, and fit. Furthermore, utilizing the fact that the inherent flexibility of a macromolecule is a key determinant of the achievable resolution of the corresponding map, RMSF values have been employed to identify the optimal amount of sharpening for a given map offering the highest contrast, as established typically through Guinier analysis. The RMSF computations in MDFF provide a viable means of determining per-residue B-factors.

Altogether, interpretation of a map as being representative of an ensemble rather than a single model brings to light new ways of model validation. EM maps, including those at high-resolution, typically do not have a uniform local resolution (Kucukelbir et al., 2014) and contain low-resolution regions, such as flexible exterior or transmembrane segments (Leschziner and Nogales, 2007). The sub-5 Å EM maps have also been able to resolve proteins in multiple conformations (Matthies et al., 2016). Both these type of maps will continue to benefit from accurate ensemble-based flexible fitting techniques for the foreseeable future. MDFF provides a natural method to model flexible regions; de novo models constructed for one conformational state can be flexibley fitted into the density of the other state(s), thus avoiding the arduous task of model construction for all the conformational states for capturing a conformational transition process with cryo-EM.

Materials and methods

In the following section, we first outline the methodology underlying direct MDFF, cMDFF, and ReMDFF, along with specific protocols applied for the refinement of β-galactosidase and TRPV1 models. Second, we describe the computations of local and global root mean square fluctuation (RMSF) values, that are utilized for the evaluation of map-model quality. Finally, details on the computational implementation of all the three MDFF protocols are provided.

Direct MDFF

MDFF requires, as input data, an initial structure and a cryo-EM density map. A potential map is generated from the density and subsequently used to bias a MD simulation of the initial structure. The structure is subject to the EM-derived potential while simultaneously undergoing structural dynamics as described by the MD force field.

Let the Coulomb potential associated with the EM map be Φ(𝐫). Then the MDFF potential map is given by

(1) VEM(r)={ζ(Φ(r)ΦthrΦmaxΦthr)ifΦ(r)Φthr,ζifΦ(r)<Φthr.

where ζ is a scaling factor that controls the strength of the coupling of atoms to the MDFF potential, Φthr is a threshold for disregarding noise, and Φmax=max(Φ(r)). The potential energy contribution from the MDFF forces is then

(2) UEM=iwiVEM(ri),

where i labels the atoms in the structure and wi is an atom-dependent weight, usually the atomic mass.

During the simulation, the total potential acting on the system is given by

(3) Utotal=UMD+UEM+USS

where UMD is the MD potential energy as provided by MD force fields (e.g. CHARMM) and USS is a secondary structure restraint potential that prevents warping of the secondary structure by the potentially strong forces due to UEM. A detailed description of the potentials arising in Equation 3 is given in Trabuco et al (Trabuco et al., 2008, 2009).

After the MDFF and restraint potentials are created through the MDFF plugin of VMD (Humphrey et al., 1996), the initial structure is rigid-body docked (e.g. with Situs [Wriggers, 2010]) into the density map. Prior to simulation, MDFF-specific parameters can be modified and include ζ and the subset of atoms to be coupled to the MDFF potential. The latter typically consists of all non-hydrogen atoms or backbone atoms and ζ is usually set to 0.3. MDFF can be performed in various simulated conditions, including different temperatures and vacuum, membrane, and explicit or implicit (Tanner et al., 2011) solvent environments. The choice of parameters and conditions depends on the requirements of each specific case. For example, a highly polar molecule would be more accurately simulated in explicit solvent rather than in vacuum, but the computational cost would be much higher in this case. The MDFF simulation is run until the system has reached stationarity, as determined by RMSD; typical run times are nanoseconds.

Cascade MDFF

In cascade MDFF (cMDFF), the initial structure is sequentially fitted to a series of potential maps of successively higher resolution, with the final potential map being the original one derived from the EM map. Starting with i=1, the ith map in the series is obtained by applying a Gaussian blur of width σi0Å to the original potential map, such that σi decreases as the structure is fitted in the sequence i=1,2,,L, where L is the total number of maps in the series, so that σL0 \AA. One can intuitively understand cMDFF as fitting the simulated structure to an initially large and ergodic conformational space that is shrinking over the course of the simulation towards the highly corrugated space described by the original MDFF potential map.

Figure 1 provides a visual representation of cMDFF. For a mathematical illustration, suppose that the original potential map can be written as a sum of Gaussians

(4) VEM(r)=ncnG(r;rn,sn),

where cn is a weight, G(𝐫;𝐫n,sn) is a Gaussian function centered at 𝐫n with half-width sn and evaluated at 𝐫. The result of a Gaussian blur of half-width σi on VEM is (see Appendix 1 - Section 1 for details)

(5) Vσi(𝐫)=ncnG(𝐫;𝐫n,sn2+σ2).

Hence, the half-width σi allows one to tune the characteristic width of the potential map through the half-widths of its component Gaussians sn2+σi2. The initial fitting starts with a large σ1, corresponding to a diffuse potential which allows much structural mobility, and proceeds along decreasing values of σi, corresponding to narrower potentials with steeper gradients, so that the structure is gradually settled into the original potential map, characterized by σL0 \AA.

In practice, the series of cMDFF maps is generated from the original potential map using VMD’s volutil Gaussian blur tool. Optimal values for the first half-width σ1 and the change in σi from one map to the next are case-dependent. Values used in the present study were obtained through trial and error. In general, structures far from the ideal conformation benefit from a large σ1 (>5 Å) so as to explore a large conformation space. On the other hand, if the original map has a high resolution, small changes in σi (<1 Å) would allow a gradual convergence required to avoid being trapped in local potential minima. In our simulations, the change in σi is initially 1 Å. A concrete example is σ1=5 Å, σ2=4 Å, σ3=3 Å, σ4=2 Å, σ5=1 Å, σ6=0 Å. A second trial using changes of 0.5 Å was performed (σ1=5 Å, σ2=4.5 Å, σ3=4 Å,, σ10=0.5 Å, σ11=0 Å), and if the resulting structure of the second trial presented a better fit, then the first trial was discarded.

Resolution exchange MDFF

Replica Exchange MD (ReMD) is an advanced sampling method that explores conformational phase space in search of conformational intermediates, which are separated by energy barriers too high to be overcome readily by fixed temperature simulations. Instead of working with a single, fixed MD simulation, ReMD carries out many simulations in parallel, but at different temperatures T1 < T2 < T3 <  where the lowest temperature T1 is the temperature of actual interest, typically, room temperature. The simulations of several copies of the system, the so-called replicas, run mainly independently, such that ReMD can be easily parallellized on a computer, but at regular time points the instantaneous conformations of replicas of neighboring temperatures are compared in terms of energy and transitions between replicas are permitted according to the so-called Metropolis criterion (Sugita and Okamoto, 1999). This way the highest temperature replicas overcome the energy barriers between conformational intermediates and through the Metropolis criterion moves the T1 replica benefits from it such that transitions between intermediates occur frequently. The application of the Metropolis criterion in the protocol guarantees that the conformations of the T1 replica are Boltzmann-distributed.

ReMDFF extends the concept of ReMD to MDFF by simply differentiating replicas not by temperatures T1 < T2 < T3 < , but by the half-width parameters σ1 > σ2 > σ3 > . Numerical experiments showed that ReMDFF works extremely well as documented in the present study. As NAMD can parallelize ReMD well (Jiang et al., 2014), it can do the same for ReMDFF, such that the enhanced sampling achieved translates into extremely fast MDFF convergence. At certain time instances replicas i and j, of coordinates 𝐱i and 𝐱j and fitting maps of blur widths σi and σj, are compared energetically and exchanged with Metropolis acceptance probability

(6) p(xi,σi,xj,σj)=min(1,exp(E(xi,σj)E(xj,σi)+E(xi,σi)+E(xj,σj)kBT)),

where kB is the Boltzmann constant, E(𝐱,σ) is the instantaneous total energy of the configuration 𝐱 within a fitting potential map of blur width σ.

MDFF protocols for β-galactosidase and TRPV1

Computational protocols for performing direct MDFF, cMDFF, and ReMDFF refinements of the two test systems, β-galactosidase and TRPV1, are now outlined.

Direct MDFF

In order to provide a basis for comparison with cMDFF and ReMDFF, direct MDFF simulations were performed for both β-galactosidase and TRPV1. Original published maps of resolutions 3.2 Å and 3.4 Å for β-galactosidase (EMD-5995 [Bartesaghi et al., 2014]) and TRPV1 (EMD-5778 [Liao et al., 2013]), respectively, were fitted with search models characterized by RMSD values of 7 Å and 25 Å, respectively, relative to the known de novo target models (PDB entries 3J7H and 3J5P, respectively); these search models were prepared as described in Appendix 1 - Section 3. Scale factors ζ (Equation 1) of values 1.0 and 0.3 were employed for β-galactosidase and TRPV1, respectively, to couple all the heavy atoms of the models to the respective maps. All other simulation parameters are noted in Appendix 1 - Section 2. The resulting structure from each MDFF simulation was then subjected to a final re-refinement applying a scaling factor of 1.0. Furthermore, the temperature was ramped down from 300 K to 0 K over 30 ps and held at 0 K for an additional 1 ns. This re-refinement step additionally improved the fitting of sidechains (McGreevy et al., 2014). Thus, all the MDFF results reported in the present study pertain to the structures that are obtained from direct MDFF, cMDFF, or ReMDFF simulations followed by the final re-refinement step.

cMDFF

The general simulation protocol consists of a series of consecutive MDFF simulations. The search models for β-galactosidase and TRPV1 were first minimized over between 500 to 1000 time steps. Next, MDFF simulation runs were sequentially performed, starting with the map of the lowest resolution, progressing through maps of successively higher resolution, and ending with the original map. Each run was chosen to be long enough for the structure to equilibrate within the MDFF potential. Taking advantage of the stochastic nature of MDFF simulations, multiple independent cMDFF simulations (10 in the present study) were performed for each system to be fitted, generating an ensemble of fitted structures. From the ensemble, the best structure was determined by the various quality indicators described in Results. This structure was then subjected to the final re-refinement step to allow for accurate resolution of sidechains.

For β-galactosidase, cMDFF was initiated with a map blurred with half-width σ1=5 Å. Use of 0.5 Å resolution steps produced better-fitted structures than 1 Å steps. Hence, the final simulation utilized L=11 maps in total, including the original. A search model was obtained by subjecting the published structure to a 5-ns equilibration MD run with temperature 1000 K, yielding a structure with backbone RMSD of 7.6 Å relative to the original structure (see Appendix 1 - Section 3). 70-ps MDFF simulations were performed at each of the 11 resolutions to achieve convergence during the cMDFF protocol. Simulation parameters for all these MDFF runs are identical to those used for the direct MDFF simulations above.

For TRPV1, Gaussian blurred maps were generated starting with a half-width of σ1=5 Å, and decreasing by 1 Å for each subsequent map, thus yielding a series of L=6 maps, including the original. The initial structure for this case was obtained using an interactive MD protocol, described in Appendix 1 - Section 2, to displace the ankyrin repeat region of one subunit. The resulting model deviated from the de novo model by an RMSD of about 10 Å in the displaced region and about 25 Å in the overall structure. Unlike β-galactosidase, now, 100-ps MDFF simulations were performed at each of the 6 resolutions to achieve convergence.

ReMDFF

ReMDFF was performed on both β-galactosidase and TRPV1 using the same search models, simulation parameters, and high-resolution maps as in the cMDFF simulations. 11 and 6 replicas were employed for β-galactosidase and TRPV1 respectively with an exchange trial interval of 1 ps. In each case, the total energy of each replica was monitored and the simulation was run until the energies reached a stationary level. The ReMDFF simulation was found to converge in 0.1 ns for the β-galactosidase refinement, and in 0.02 ns for that of TRPV1. Finally, similar to direct MDFF and cMDFF, the re-refinement step was performed to improve sidechain geometry.

Fluctuation analysis

The local resolutions of a density map can be computed with ResMap (Kucukelbir et al., 2014) and used within VMD to select the atoms of a structure that are contained in a range of resolutions found by the ResMap analysis. First, the local resolution map output by ResMap is loaded into VMD and then the interpvol keyword can be used to automatically select the atoms found inside the volume values specified, using interpolation. The average RMSF of each selection can then be calculated for a structure during the steps of a MDFF simulation after the initial fitting has occured and the structure has stabilized. In principle any criteria for atom selection can be used for RMSF analysis, though we use local resolution of the EM density here to illustrate the correlation between the two measurements. Additionally, we compute a global average RMSF of the entire structure.

The ensemble-based nature of the RMSF analysis means that the quality metric is not dependent on a single structure, but instead a large family of structures can be employed as a better representative of the data. Ensemble-based analyses are a natural and powerful benefit of the MD-based nature of MDFF. RMSF analysis does not, however, require MDFF to be used as the method of refinement. In principle, any refinement method can be used to obtain the fitted model. A subsequent short MDFF simulation of the fitted model can then be performed to obtain the data necessary for the RMSF analysis.

Implementation

Incorporating advanced simulation techniques, such as multi-copy algorithms (Jiang et al., 2014), into the MDFF protocol creates a more efficient and accurate computational strategy in cMDFF and ReMDFF. However, these advanced simulation techniques come with an added complexity in the setup and execution of the methods. The current implementation of these methods in NAMD (Phillips et al., 2005) and VMD (Humphrey et al., 1996) aim to automate the steps previously discussed. The MDFF Graphical User Interface (GUI) (McGreevy et al., 2016) can be used to set up cMDFF and ReMDFF simulations and provides default parameters, including the number and extent of smoothed maps used for the fitting, with which to run. The parameters for the smoothed maps and number of steps used per map are set heuristically based on previous experience and represent an adequate initial starting point. The GUI automatically generates each of the smoothed maps and converts them to potentials for use in the ReMDFF simulation. All parameters can be tuned by a user to adapt the protocols to their specific system and preference. The GUI produces a series of NAMD configuration files and scripts used for running the simulation, as well sorting and visualizing the results in VMD. Instruction on the use of MDFF, including the GUI, is given in the tutorial found at http://www.ks.uiuc.edu/Training/Tutorials/science/mdff/tutorial_mdff-html/.

Future development will allow for the automatic generation of the smoothed maps in NAMD at runtime. NAMD will also analyse the dynamics of the system to determine when the simulation has converged and move on to fitting to the next density map in the sequence in case of cMDFF calculations. Furthermore, advanced visualization and analysis techniques in VMD (e.g. new graphical representations) will be critical for properly understanding the RMSF analyses and to provide greater insight when examining the quality of a model.

The use of advanced simulation techniques also comes with an added cost of computational requirements. Adapting the algorithms to best utilize available computational hardware is key when developing efficient methods. Fortunately, the cMDFF and ReMDFF methods and associated analysis algorithms are well suited to highly efficient software implementations on contemporary multi-core CPUs and graphics processing unit (GPU) accelerators. We observe that by storing the complete cascade resolution series in efficient multi-resolution data structures such as mip maps (Williams, 1983), the MDFF cascade algorithm can access a continuously variable resolution representation of the original cryo-EM density map, while making efficient use of CPU and GPU processors and memory systems (Stone et al., 2007).

The parallel nature of ReMDFF presents an opportunity for efficient, automated sampling of maps of varying resolution. However, to achieve the best efficiency, the simulations should be performed on multi-core CPUs with relatively high core counts (i.e. at least 1 core per replica). Access to such multi-core computers could prohibit use of ReMDFF, however access to machines with the necessary hardware is easily achieved through cloud computing. The cloud computing model provides researchers with access to powerful computational equipment that would otherwise be too costly to procure, maintain, and administer on their own. A particular obstacle is that structural modeling often involves the use of different software suites, such as VMD (Humphrey et al., 1996) for simulation preparation and Situs (Wriggers, 2010) for initial rigid-body docking or VMD, NAMD (Phillips et al., 2005), and Rosetta (Leaver-Fay et al., 2011) for iterative refinement of models (Lindert and McCammon, 2015). Cloud platforms can easily bundle different software packages used in a modeling workflow to guarantee their availablity and interoperability on a standardized system. Through the cloud version of our MDFF program suite a user does not need to be aware of any of the above mentioned technical issues.

To prove the feasibility of performing ReMDFF simulations on a cloud platform, we ran ReMDFF for the test system, carbon monoxide dehydrogenase (PDB 1OAO:chain C), on the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) platform. For the purposes of testing ReMDFF on EC2, we ran benchmarks on a variety of compute-optimized EC2 instance types. The details of the instance types can be found in Table 3. We used the same 6 smoothed density maps as the previously discussed cMDFF and ReMDFF simulations in the Proof of Principle and, therefore, also 6 replicas. The files and information necessary to run ReMDFF on the test system using EC2 cloud computing resources are available at (http://www.ks.uiuc.edu/Research/cloud/).

Appendix 1

1 Theory of cascade MDFF

The MDFF potential is given by

(7) UEM(R)=iwiVEM(ri),

where 𝐑 is the system configuration, i labels the atoms in the structure, wi and 𝐫i are weight and position, respectively, assigned to atom i, and VEM is the potential map obtained from the EM density map.

Subjecting the potential map VEM to a Gaussian blur increases its widths in a manner described as follows. The Gaussian blur process produces a potential map Vσ from the potential map VEM through convolution with a normalized Gaussian of specified width σ:

(8) Vσ(r)=drG(r;r,σ)VEM(r),
(9)G(r;r,σ) = A(σ)exp(||rr||22σ2),(10)A(σ) = 1(2πσ2)3/2.

One can characterize the resolution of VEM(r) explicitly by assuming that it can be written as a sum of Gaussians.

(11) VEM(r) = ncnG(r;rn,σn),

where cn are weighting factors, 𝐫n and σn are, respectively, the centers and widths of the component Gaussians. Substituting the above expression into Equation 8 yields

(12)Vσ(r) = drG(r;r,σ)ncnG(r;rn,σn)(13) = ncnA(σ)An(σn)drexp(||rr||22σ2||rrn||22σn2),

Evaluation of the above expression gives

(14)Vσ(r) = nCnexp(||rrn||22(σ2+σn2)),(15)Cn = cnA(σ)An(σn)(2π)3/2(σσnσ2+σn2)3(16) = cn[2π(σ2+σn2)]3/2.

Note that setting σ=0 A˚ in (Equations 14–16) recovers the expression for the initial map in Equation 11, i.e. V0(𝐫)=V(𝐫). On the other hand, increasing σ results in an increase in the widths of the component Gaussians, and leads in turn to an increase in the range of the MDFF force 𝐅σ(𝐫), where

(17)Fσ(r) = Vσ(r)(18) = nCnσ2+σn2exp(||rrn||22(σ2+σn2))(rrn).

2 MD simulation setup

Unless otherwise stated, all simulations reported in the present study used the following MD parameters. MDFF simulations were run in vacuum, maintained at 300 K via a Langevin thermostat and employed a time step of 1 fs. Secondary structure restraints were imposed using NAMD’s Extra Bonds function to prevent loss of secondary structure due to strong MDFF guiding forces.

MD simulations were prepared using CHARMM-GUI (Jo et al., 2008) and run under NPT conditions, maintained at 303.15 K temperature and 1 atm pressure using a Langevin thermostat and barostat. All systems were parametrized using the CHARMM36 force field (Klauda et al., 2010; Best et al., 2012). Structures were solvated in explicit water (TIP3P model) boxes, with at least 15 Å separation between structure and water box boundaries. Particle-mesh Ewald electrostatics was used and the time step was 2 fs. For simulations of the reported structures of β-galactosidase, backbone atoms were held fixed while a minimization over 1000 time steps was performed, followed by 32-ns and 40-ns equilibration for the 3.2-Å and 2.2-Å structures, respectively. Following the equilibration step, production runs of 30 ns were performed for both structures.

In the case of TRPV1, the channel was embedded in a membrane of standard lipid composition POPE, POPC, POPG at ratio 2:1:1. Initial runs involved minimization over 1000 steps and 1-ns equilibration of the lipid tails, with all other atoms fixed. In the following MD run, minimization was performed over 5000 steps and equilibration was performed for 3 ns, with protein backbone atoms held fixed. Finally, the entire system was equilibrated for a further 6.4 ns. During the equilibration, C-terminal residues 752 to 762 were harmonically restrained because a substantial C-terminal segment was missing in the structure.

3 Preparation of initial test structures

The initial test structure of β-galactosidase was obtained by subjecting the reported structure (Bartesaghi et al., 2014) to a 4-ns MD simulation at a temperature of 300 K in vacuum and using secondary structure restraints. Trajectory frames recorded at 2-ps intervals were evaluated for backbone RMSDs with respect to the reported structure. A frame with an RMSD value of 7.6 Å (Figure 2—figure supplement 2a) and lowest global cross-correlation with respect to the reported map was picked to be the initial test structure. The structural quality measure of this model is provided in Table 2. A second search model was prepared by repeating the same protocol but at 1000 K. This search model is also characterized by an RMSD of 7.6 Å but now features a more distorted local structure as measured in terms of increased rotamer and Ramachandran outliers (Supplementary file 1F).

The initial test structure of TRPV1 was also derived from a reported structure (Liao et al., 2013). In order to render the disjointed reported structure contiguous for correct structural dynamics during simulation, the missing loop region (residues 503 to 506) was added by hand. Additionally, the substantial ankyrin repeat region (residues 111 to 198) was removed because the corresponding density was missing from the map. For the purpose of testing the robustness of cMDFF and contrasting its performance with that of direct MDFF, the structure was distorted (see Figure 2—figure supplement 2b) during an interactive MD (Stone et al., 2001; Grayson et al., 2003) simulation, subjecting residues 199 to 430 in one subunit’s extramembrane domain to a series of transformations, consisting roughly of a polar angle change of 15° toward the cytoplasmic pole followed by an azimuthal rotation of 30°, so that the backbone RMSD of the transformed region was about 22 Å relative to the original structure.

4 Protocol efficiency

In the present study, the convergence of fitting results of TRPV1 and β-galactosidase for different protocols can be evaluated through RMSD value employing the reported de novo structures as reference. Figure 2—figure supplement 3 shows the time evolution of RMSD for cMDFF, ReMDFF, and direct MDFF for the two proteins, It should be noted that the plots do not include the final refinement step, which is the same across all three protocols.

cMDFF and ReMDFF can be seen to reach similar RMSD levels, outperforming direct MDFF. ReMDFF converges more quickly than cMDFF in either example. Of the 6 replicas employed for the ReMDFF of TRPV1 two resulted in poorly fitted structures, having become trapped in density minima even after exchanging with the lowest-resolution map chosen with σ = 5 Å. All replicas can be monitored during the simulation and poorly fitted ones can be discarded by a user. It is also worth noting that the region of the TRPV1 map to which both cMDFF and ReMDFF successfully fitted the model, characterizes a diverse range of local resolutions from 4 Å to 6 Å challenging thus the conformational sampling capability of any flexible fitting technique. For the same reason, this region was avoided during Rosetta refinements of TRPV1 (DiMaio et al., 2015) but is addressed now via MDFF.

5 Global vs. local measures of fit

The cross-correlation coefficient calculated over an entire structure, termed global cross-correlation coefficient (GCC), has been a popular indicator of goodness-of-fit of a structure to a corresponding EM density map. However, averaging over the entire structure smears out potentially useful local structure information and in some cases, can be misleading (see Figure 2—figure supplement 1), since GCC cannot distinguish between correct and wrong assignments of residues to a given map region as long as the residues are equally well fitted.

Local measures of fit allow one to assess every part of the structure individually. For example, local cross-correlation coefficients (LCCs) (Stone et al., 2014) were tracked over the course of the cMDFF simulations of β-galactosidase and TRPV1 (Figure 2—figure supplement 4). The improvement in LCC of the majority of residues in each case lends greater confidence in the fitting result. At the same time, residues that have relatively lower LCCs can be identified and given attention (Stone et al., 2014; McGreevy et al., 2016).

6 Fourier shell coefficients

Fourier Shell Coefficients (FSCs) can be used as a means of evaluating quality of fit by comparing the degree of similarity between the original map and a simulated map derived from the structure to be evaluated, using the simulated map feature of VMD’s MDFF package (Trabuco et al., 2008, 2009) and the same voxel size as the original map.

In the present study, FSC curves for fitted TRPV1 and β-galactosidase structures were calculated via the FSC operation in SPIDER (Shaikh et al., 2008) using a shell width of 0.5 reciprocal space units, and resolution cutoff of half the voxel size. In the case of TRPV1, both the full structure and MDFF-fitted region were evaluated. The latter was obtained by applying a mask of the region around residues 199 to 430 in the fitted structure to the simulated map, and in the reported structure to the original map.

As a means of summarizing comparisons by FSC, other studies have used 'integrated FSCs', a numerical measure obtained by integrating under the FSC plot within a predefined resolution interval. Two integrated FSC measures, corresponding to the intervals 3.4 Å to 10 Å and 5 Å to 10 Å, were obtained in the present study and tabulated in Table 1, and Supplementary file 1A and 1C.

7 Other simulations of β-galactosidase

Beyond the simulations reported in Table 1, further simulations were performed to explore the performance of MDFF within the contexts of further types of analyses. The first of these simulations was a direct MDFF simulation of the reported β-galactosidase structure, fitting only backbone atoms to the 3.2-Å map. The fitted structure was compared to the 'refined de novo' structure reported in Results. It was found that EMRinger scores were lower at 2.35 when only backbone atoms were fitted, compared to 4.23 when non-hydrogen atoms were fitted. This result suggests that even if the backbone is correctly placed, the MD force fields alone (i.e. CHARMM36 (Best et al., 2012) here) are incapable of providing sidechain geometries consistent with the map. Refinement of the sidechains will therefore require explicit fitting to the density, above and beyond the orientations captured by the force fields alone.

In the second simulation, the resulting structure of the cMDFF simulation was subjected to an equilibrium MD simulation in explicit solvent. As shown in Figure 2—figure supplement 6, the equilibrium RMSD fluctuations during this simulation ranged between 3.0 Å to 3.4 Å of the starting structure. It is worth noting that these RMSD values agree well with the 3.2 Å resolution limit of the β-galactosidase map. Thus, the result indicates that uncertainties of the map resolution reflects quantitatively the structural variations of the cMDFF-fitted β-galactosidase model at the room temperature. Consequently, this model is representative of the thermodynamic ensemble that the EM map characterizes. Further analysis of the quality of the 3.2-Å β-galactosidase model is presented in the Model Validation subsection of Results.

The third set of simulations takes advantage of a unique opportunity, presented by the availability of two different maps of the same structure, at resolutions of 3.2 Å and 2.2 Å, to compare the results of fitting β-galactosidase to maps of different resolutions. The reported structures were subjected to direct MDFF simulation for 0.7 and 1 ns for the 3.2-Å and 2.2-Å models, respectively. The RMSF for each residue is calculated over consecutive 10-ps windows during the fitting. The RMSF values for all residues, including those for the PETG binding pockets (Bartesaghi et al., 2015) are plotted in Figure 4—figure supplement 4, reflecting smaller fluctuations during the fitting to the 2.2-Å map than in the 3.2-Å one. The relationship between fluctuation and map quality is examined in greater detail in Results, and imply that the RMSF of the fitted structure correlates negatively with the resolution of the corresponding map. Thus, the smaller fluctuations of the 2.2 Å structure than those of the 3.2 Å one further validate our proposed RMSF-map quality relationship of Figure 4.

8 MDFF refinement for TRPV1

Similar to the β-galactosidase results in Table 1, the refinement of TRPV1 was significantly better with cMDFF and ReMDFF than with direct MDFF, but starting with much poorer search models than what were employed for the β-galactosidase refinements. The TRPV1 refinement results are summarized in Supplementary file 1A: (i) RMSD of the fitted structure with respect to the reported de novo structure is 7.9 Å for direct MDFF, higher than the 2.4 Å and 2.5 Å RMSD values for cMDFF and ReMDFF, respectively (Figure 2—figure supplement 3b); (ii) EMRinger scores for cMDFF and ReMDFF are 1.68 and 1.99 respectively, higher than the 1.51 score obtained for direct MDFF; (iii) MolProbity scores (Chen et al., 2010) are 2.4 and 2.5 for cMDFF and ReMDFF, smaller than the 7.9 score for direct MDFF, implying fewer, less severe steric clashes and fewer poor rotamers in the former than in the latter; (iv) integrated FSC (iFSC2, for the range 3.4-10 Å obtained as described in Appendix 1 - Section 6 above), attains higher values of 2.62 and 2.75 for cMDFF and ReMDFF respectively, than the 1.79 value for direct MDFF. iFSC1, corresponding to the lower resolution range of 5-10 Å is found to behave similarly to iFSC2; and (v) GCCs improve from an initial value of 0.16 to 0.50, 0.54 and 0.53 for direct, cMDFF, and ReMDFF protocols, respectively. Similarly, typical residue LCC values improve from 0 to 0.5 or higher, as shown in Figure 2—figure supplement 4b.

Measures of structural quality for the above fits are tabulated in Supplementary file 1B. Cross-validation with half-maps was also performed on the cMDFF structure, as per the β-galactosidase simulations, to ensure that it was not overfitted. As in the case of β-galactosidase, iFSC and EMRinger scores for direct and cross comparisons were similar. FSC analysis results are described in Appendix 1 - Section 11.

9 MDFF fitting of TRPV1 TM domain

Noting that the transmembrane (TM) region of the TRPV1 map is better resolved than the map for the extramembrane region, we performed 200 ps of MDFF on a truncated TM portion of the structure (residues 381 to 695) to characterize the performance of MDFF in the transmembrane region alone.

The TM region had previously been refined employing Rosetta tools (Leaver-Fay et al., 2011), providing an opportunity for comparison. Two MDFF simulations were performed, the first with only non-hydrogen sidechain atoms coupled to the density and harmonic restraints holding backbone atoms in the configuration of the reported structure (Liao et al., 2013) and another with all non-hydrogen atoms coupled to the density and the backbone restraints removed.

MDFF characteristics for fitting the isolated TM region of TRPV1 are summarized in Supplementary file 1C. Quality of fit measures, namely EMRinger and iFSC, for the backbone-restrained simulations were lower than those of the Rosetta-derived structure. However, MolProbity scores for the MDFF-derived structures are better than those of Rosetta. Allowing the backbone to be fitted into the map without restraints from the reported structure substantially improved the quality of fit measures so that they are comparable to those of Rosetta’s, while maintaining a lower MolProbity score.

10 RMSF as a local measure of map quality

Per-residue Root mean square fluctuation (RMSF) has been found to be a good local measure of model fit, decreasing in relation to the increasing quality of fit over a cMDFF simulation, as illustrated in Figure 4—figure supplement 1. In addition to being a measure of fit, RMSF values can also be used for the analysis of local map quality. When monitored during MDFF refinement simulations, RMSF values serve as a good measure of local map quality, decreasing linearly with increase in local map resolution, as demonstrated in a number of test cases (Figure 4), including the 20S proteasome (Figure 4—figure supplement 3). In contrast to RMSF, other local measures of fit, such as LCC and EMRinger scores, are not necessarily good indicators of local map quality, as is evident from their poor correlation with local map resolution in Figure 4—figure supplement 2.

11 Cross-validation of MDFF-fitted structures

To demonstrate that the over-fitting does not occur during MDFF refinements, which is also fairly representative of cMDFF and ReMDFF refinements, the reported de novo structures of β-galactosidase and TRPV1 were each directly fitted to two half-maps (labelled 1 and 2) of the corresponding reported EM map, (Bartesaghi et al., 2014) for β-galactosidase and (Liao et al., 2013) for TRPV1. Subsequently, simulated maps were created from the fitted structures using VMD’s MDFF plugin and resolution settings equivalent to the reported maps. In total, there were two simulated maps, also with labels 1 and 2 corresponding to the half-map from which the fits were obtained, for each protein. FSC plots describing the direct comparison of simulated maps with the corresponding half-maps (e.g. simulated map 1 with half-map 1) as well as the cross comparison of simulated maps with the non-corresponding half-maps (e.g. simulated map 1 with half-map 2) were created. Figure 2—figure supplement 8 show the plots for β-galactosidase and TRPV1, respectively. The high degree of similarity between the cross comparisons as well as between cross comparisons and direct comparisons indicate a very low degree of over-fitting. In fact, iFSC values calculated for the plots (see Figures) are practically uniform. EMRinger scores for the same sets of comparisons were also calculated. For β-galactosidase, the EMRinger scores were 3.25 for simulated map 1 against half-map 1, 2.97 for simulated map 2 against half-map 2, 2.92 for simulated map 1 against half-map 2, and 2.81 for simulated map 2 against half-map 1; these numbers are fairly comparable to the EMRinger scores with the full maps as presented in Table 1. For TRPV1 again, the EMRinger scores were 1.43 for all comparisons. The high degree of similarity between EMRinger scores for the different comparisons corroborates the favorable conclusion drawn from the FSC calculations.

References

  1. 1
  2. 2
    New tool: phenix.real_space_refine
    1. PV Afonine
    2. JJ Headd
    3. TC Terwilliger
    4. PD Adams
    (2013)
    Computational Crystallography Newsletter 4:43–44.
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
    Crystallographic Computing 4: Techniques and New Technologies
    1. AT Brünger
    (1988)
    Crystallographic refinement by simulated annealing, Crystallographic Computing 4: Techniques and New Technologies, Oxford, Clarendon Press, 10.1016/0022-2836(88)90211-2.
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
    Molecular Machines in Biology
    1. J Gumbart
    2. E Schreiner
    3. LG Trabuco
    4. K Chan
    5. K Schulten
    (2011)
    142–157, Viewing the mechanisms of translation through the computational microscope, Molecular Machines in Biology, 10.1017/cbo9781139003704.010.
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
    Molecular dynamics: applications to proteins
    1. M Karplus
    (1990)
    In: J.-L Rivail, editors. Modelling of Molecular Structures and Properties, Volume 71 of Studies in Physical and Theoretical Chemistry. Amsterdam: Elsevier Science Publishers. pp. 427–461.
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
    A system for interactive molecular dynamics simulation
    1. JE Stone
    2. J Gullingsrud
    3. P Grayson
    4. K Schulten
    (2001)
    In: JF Hughes , CH Séquin, editors. ACM Symposium on Interactive 3D Graphics. New York: ACM SIGGRAPH. pp. 191–194.
    https://doi.org/10.1145/364338.364398
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
    Proceedings of the 10th Annual Conference on Computer Graphics and Interactive Techniques
    1. L Williams
    (1983)
    1–11, Pyramidal parametrics, Proceedings of the 10th Annual Conference on Computer Graphics and Interactive Techniques, New York, ACM, 10.1145/800059.801126.
  80. 80
  81. 81
  82. 82
  83. 83

Decision letter

  1. Axel T Brunger
    Reviewing Editor; Howard Hughes Medical Institute, Stanford University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Ensemble-based model refinement and validation with Resolution Exchange MDFF for sub-5 Å cryo-electron microscopy maps" for consideration by eLife. Your article has been favorably evaluated by John Kuriyan (Senior editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The novelty of this paper lies in the combination of a variety of tools and concepts, none of which individually are new. Moreover, the authors published work on MDFF fitting of EM maps. However, this new combination (MD-based fitting of EM maps, combined with replica exchange and gradual resolution increase) works very well in the test cases that are presented in this work. The radius of convergence is impressive.

The main application of this combined method is for cases where the initial model is very far from the EM map. However, in practice the applications of their method may be somewhat limited since either (a) the EM map is near atomic resolution (3.5 A or better) allowing ab initio tracing, or (b) the map is lower resolution in which case higher resolution structures of fragments are generally needed in order to have confidence in the final model. The new method is probably most useful in the resolution range 4-7 A where ab initio fitting to the maps is not possible and when high-resolution structures are available of all components, but these components undergo large conformational changes.

Along similar lines, the advantages of cMDFF/ReMDFF over 'direct' MDFF are clearly demonstrated for high-resolution, although it would be helpful to know if there are also advantages of using cMDFF/ReMDFF over direct MDFF at moderate resolutions (lower than 4 A resolution).

Essential revisions:

1) A common practice in the EM field is to perform refinements against half maps to judge the degree over-fitting. Assuming that the authors have access to the EM data for one of the test cases, such over-fitting test should be performed – as shown in DiMaio, et al. (2013) Protein Sci, 22, 865-868 or Amunts, et al. (2014) Science 343, 1485-1489.

2) It was unclear how well the protocols work at the second step – ensuring an accurate model that sufficiently describes the density at the side chain level. In the absence of figures demonstrating the fit of side chains to density (an omission that should be corrected), one of the reviewers compared the provided model coordinates (source data to Table 1) with the deposited structure of β-galactosidase (PDB accession code 3J7H). This revealed substantial problems with the model. There are a large number of Ramachandran outliers (~10%); more than 1% of non-prolines are in cis-configurations (less than 0.05% are expected); the backbone of the region around the galactoside binding site is incorrect, and many of the side chains are not in their correct density. The authors themselves note 'even if the backbone is correctly placed, the MD force fields […] are incapable of providing sidechain geometries consistent with the map'. This may represent a considerable limitation to using cMDFF/ReMDFF to refine atomic models against sub 5 Å EM maps. It would be helpful if the authors could demonstrate that cMDFF/ReMDFF is capable of refining good starting models without disrupting backbone and side chain geometries. This would more accurately reflect the starting model for many of the potential users of cMDFF/ReMDFF. For example, the deposited model for β-galactosidase should be refined with cMDFF and ReMDFF and not just direct MDFF. The manuscript should provide guidance on how to preserve side chain geometries and the prevent cis-peptide bond formation during refinement. This would improve the utility of cMDFF/ReMDFF as a method of fitting a well-refined structure to classes of different conformational states that can range from high to mid resolution – a good test case might be CorA (Matthies et al. 2016. Cell, 164(4):747-756). Further statistics should be provided to demonstrate the quality of the model after cMDFF/ReMDFF. For example, root-mean-square deviations for bond lengths and bond angles and Ramachandran statistics, which are typically minimal requirements when reporting the quality of structural models. These should be added to Table 1.

3) The authors propose per-residue root mean square fluctuation (RMSF) values as a way of specifying the precision of atom positions. These are shown to correlate with local map quality, suggesting they can be used for model-to-map validation. In many ways RMSF values appear to be analogous to atomic B factors. It would be helpful to see a side-by-side comparison of a protein structure colored by local resolution, atomic B factors and RMSF (an expansion of Figure 3). One would expect them to correlate in a way that EMRinger and local correlation coefficient do not (Figure 4—figure supplement 2). What are the advantages of RMSF over B factors, which are a more established measure? Do the authors actually recommend depositing ensembles?

4) It is unclear what advantages are to be gained by utilizing RMSF values to determine the B factor for sharpening the map (which involves multiple simulations) compared to Guinier analysis performed during the standard post-processing procedure when both provide the same, or similar, values (Figure 5—figure supplement 2). Could RMSF values be calculated for different parts of the model to generate locally sharpened maps that subsequently improve refinement (although one suspects that applying a mask during post-processing would result in a similar effect)?

5) There is no relation of the presented method with "phase extension" approaches in crystallography where phase experimental phase information is generally much less accurate than amplitude information, and phase extension method make use of the inherently higher accuracy of the amplitude information along with approaches such as solvent flattening of map averaging. The cMDFF and ReMDFF approaches do not actually improve the EM map, they just use filtered less rugged density maps for better convergence of the global conformational changes. Please avoid this analogy.

6) The title is a bit misleading, reading "Ensemble-based refinement…". One might expect an ensemble refinement method, which this is not. Similar with the expression "ensemble-based flexible fitting" (end of Discussion): the flexible fitting itself is not "ensemble-based". Ensemble-based here simply refers to the calculation of RMSF value from the trajectory.

7) Figure 5: The plot describes map sharpening. Sharpening implies applying a negative B-factor, therefore one expects negative B-factor values on the x-axis.

8) This sentence – "well within the structural uncertainty represented by the 3 Å resolution limit of the crystallographic data" –, in the Results, is a bit misleading, since it sounds as if the structural uncertainty for 3 Angstroms data is 3 Å, which is not the case but unfortunately a common misconception. The accuracy of atomic positions is usually considerably higher than the limiting resolution of a crystal. Please remove or reword this sentence.

9) The authors should discuss in more detail the limitations and potential applications of their method as suggested in the summary above.

10) Overall, the manuscript should be edited to remove repetition (cloud computing is described as being cost-effective four times in the paper) and inaccuracies. The Materials and methods section is almost indistinguishable in style from the Results section and could be shortened considerably, with some of the more insightful descriptions moved to the Results (for example the first paragraph of the subsection “Fluctuation Analysis”). Additionally, many of the reference choices seem unnecessary. For example, in the first paragraph of the Introduction, describing how cryo-EM has evolved, the authors cite papers on NMR methods for complexes over 20 kDa (Clore and Gronenborn), membrane proteins by XFEL (Neutze et al.) and ribosome structures at mid-resolution (the papers by Rawat et al.). It would be better to cite one of the recent reviews on cryo EM, for example 'Cheng, Y. 2015. Single-particle cryo-EM at Crystallographic resolution. Cell, 161(3):450-7.

11) All starting and refined models should be provided as supplementary data files.

https://doi.org/10.7554/eLife.16105.056

Author response

Essential revisions:

1) A common practice in the EM field is to perform refinements against half maps to judge the degree over-fitting. Assuming that the authors have access to the EM data for one of the test cases, such over-fitting test should be performed – as shown in DiMaio, et al. (2013) Protein Sci, 22, 865-868 or Amunts, et al. (2014) Science 343, 1485-1489.

We have performed half-map cross-validation analyses for our β-galactosidase and TRPV1 test systems, to demonstrate that little to no over-fitting has occurred. We report the details of these analyses in Supplementary file 1–section 11 and Figure 2—figure supplement 8.

2) It was unclear how well the protocols work at the second step – ensuring an accurate model that sufficiently describes the density at the side chain level. In the absence of figures demonstrating the fit of side chains to density (an omission that should be corrected), one of the reviewers compared the provided model coordinates (source data to Table 1) with the deposited structure of β-galactosidase (PDB accession code 3J7H).This revealed substantial problems with the model. There are a large number of Ramachandran outliers (~10%); more than 1% of non-prolines are in cis-configurations (less than 0.05% are expected); the backbone of the region around the galactoside binding site is incorrect, and many of the side chains are not in their correct density. The authors themselves note 'even if the backbone is correctly placed, the MD force fields […] are incapable of providing sidechain geometries consistent with the map'. This may represent a considerable limitation to using cMDFF/ReMDFF to refine atomic models against sub 5 Å EM maps. It would be helpful if the authors could demonstrate that cMDFF/ReMDFF is capable of refining good starting models without disrupting backbone and side chain geometries. This would more accurately reflect the starting model for many of the potential users of cMDFF/ReMDFF. For example, the deposited model for β-galactosidase should be refined with cMDFF and ReMDFF and not just direct MDFF.

We concur with the reviewer that the previously submitted refined structure of β-galactosidase suffered from unfavorable structural quality statistics. In fact, several of the model validation criteria, including the EMRinger scores that evaluate detailed side-chain conformations seem quite insensitive to the local structural discrepancies. Fortunately, a closer analysis of the MolProbity statistics revealed that the structural defects in our reported model of β-galactosidase were inherited from a poor choice of starting structure. We addressed this issue at the outset of this letter (see first key improvement).

As noted by the reviewer, starting structures in actual use cases tend to be more structurally sound than the one used in the original submission, in particular with regards to Ramachandran outliers. Consequently, we have performed a new set of simulations of β-galactosidase using a better starting structure, i.e., one with fewer outliers, while still maintaining an RMSD of >7 Å w.r.t. the target structure to demonstrate a high radius of convergence of cMDFF and ReMDFF (subsection “Refinement of β-galactosidase”). The resulting fitted structures exhibit favorable structural statistics, demonstrating little to no disruption of backbone and sidechain geometries. Figure 2—figure supplement 7 is newly added to show examples of proper sidechain fitting within the density map.

The manuscript should provide guidance on how to preserve side chain geometries and the prevent cis-peptide bond formation during refinement.

The new set of refinements (Tables 1 and 2), together with the one starting with the poor initial search model (Supplementary files 2E and 2F) and another one starting with the reported model (PDB: 3J7H, Table 1 (refined de novo)) allow a systematic analysis of the quality of MDFF out- put as a function of β-galactosidase starting models (subsection “Refinement of β-galactosidase”). This analysis provides a guideline for the choice of initial models with MDFF refinements (in the aforementioned subsection).

This would improve the utility of cMDFF/ReMDFF as a method of fitting a well-refined structure to classes of different conformational states that can range from high to mid resolution – a good test case might be CorA (Matthies et al. 2016. Cell, 164(4):747-756).

As noted above, cMDFF/ReMDFF refinements have now been repeated with the reported β- galactosidase (3J7H).

We believe that the set of test systems employed now adequately showcases the advances made in MDFF. TRPV1, in particular, represents a map featuring a diverse range of local resolutions (3.4 Å to ~ 7 Å), which challenges the model refinement capabilities of MDFF across high to mid resolutions.

Nonetheless, it is noted in the Discussion section that MDFF provides the perfect tool to model structures within maps that capture a conformational transition, as is the case with the Mg2+ transporter reported by Matthies et al. A study of conformational transitions involving high- resolution EM data requires a separate publication, and is thus avoided in the current manuscript.

Further statistics should be provided to demonstrate the quality of the model after cMDFF/ReMDFF. For example, root-mean-square deviations for bond lengths and bond angles and Ramachandran statistics, which are typically minimal requirements when reporting the quality of structural models. These should be added to Table 1.

Table 1 has been updated to reflect the new results, and two new tables, Table 2 and Supplementary file 2B, have been added to report the structural statistics, including RMS deviations for bond lengths and angles, and Ramachandran statistics for both β-galactosidase and TRPV1.

3) The authors propose per-residue root mean square fluctuation (RMSF) values as a way of specifying the precision of atom positions. These are shown to correlate with local map quality, suggesting they can be used for model-to-map validation. In many ways RMSF values appear to be analogous to atomic B factors. It would be helpful to see a side-by-side comparison of a protein structure colored by local resolution, atomic B factors and RMSF (an expansion of Figure 3). One would expect them to correlate in a way that EMRinger and local correlation coefficient do not (Figure 4—figure supplement 2). What are the advantages of RMSF over B factors, which are a more established measure? Do the authors actually recommend depositing ensembles?

Figure 3 is now updated to also include structures colored by the square of RMSF and local B- factors. The spatial variations in local-resolution, RMSF2 and atomic B-factors show excellent qualitative agreement. Furthermore, utilizing the well-established relationship between B-factors and RMSF, B=8π 2/3(RMSF2), B-factors can be computed from MDFF refinements. The MDFF derived B-factors are shown to be in good agreement with those reported for β-galactosidase (Figure 5—figure supplement 4). An additional advantage of monitoring RMSF during MDFF refinements is that it not only indicates the quality-of-map/model (subsection “RMSF and Quality of Map”), but also the quality-of-fit (subsection “RMSF and Quality of Map”). More advantages of computing RMSF are provided in the response to the next question.

It is indeed advisable to submit a representative ensemble of structures rather than a single one. The rationale for supporting an ensemble view is the following. A functional biomolecule is not restricted to a single structure. It rather represents a distribution of dynamical structures all of which contribute to the well-defined function. An EM map is a manifestation of this distribution of structures, more so because unlike crystallography, it captures the biomolecules in a native- like environment. Consequently, reporting a single structure does not suffice to elucidate the conformational diversity that the full distribution of functionally relevant structures manifest. We report an ensemble of 12 structures that represent the conformational diversity of β-galactosidase (Figure 5—source data 1). Even though the statistics of the contributing structures are comparable, the flexibility of the loop regions over the structured regions is clearly observed; such information is obscured if only a single model is submitted.

4) It is unclear what advantages are to be gained by utilizing RMSF values to determine the B factor for sharpening the map (which involves multiple simulations) compared to Guinier analysis performed during the standard post-processing procedure when both provide the same, or similar, values (Figure 5—figure supplement 2).

Overall B-factors of cryo-EM maps are typically calculated by Guinier analysis for maps with resolutions better than ~10 Å (Rosenthal and Henderson, 2003). Using a mask around specific regions of interest, estimation of local B-factors for different parts of a map is also possible. However, this method is limited to large domains of macromolecular complexes due to problems associated with tight masking of cryo-EM maps. For maps with resolutions better than ~3 Å, local B-factors could be estimated and refined in X-ray crystallography programs. However, most high-resolution cryo-EM maps have resolutions between ~3 to 5 Å. In this resolution range, B- factors could be estimated from MDFF and used as prior information to improve model building and refinement. For resolutions better than ~3 Å, MDFF-derived values may serve as initial estimates of B-factors to be further refined by crystallography programs. These points are now made in the last paragraph of the subsection “RMSF and per-residue B-factors”.

Could RMSF values be calculated for different parts of the model to generate locally sharpened maps that subsequently improve refinement (although one suspects that applying a mask during post-processing would result in a similar effect)?

Yes, domain-based sharpening is now demonstrated with RMSF computations for the soluble and transmembrane domains of TRPV1 (Figure 5—figure supplement 1).

5) There is no relation of the presented method with "phase extension" approaches in crystallography where phase experimental phase information is generally much less accurate than amplitude information, and phase extension method make use of the inherently higher accuracy of the amplitude information along with approaches such as solvent flattening of map averaging. The cMDFF and ReMDFF approaches do not actually improve the EM map, they just use filtered less rugged density maps for better convergence of the global conformational changes. Please avoid this analogy.

The analogy is now avoided.

6) The title is a bit misleading, reading "Ensemble-based refinement…". One might expect an ensemble refinement method, which this is not. Similar with the expression "ensemble-based flexible fitting" (end of Discussion): the flexible fitting itself is not "ensemble-based". Ensemble-based here simply refers to the calculation of RMSF value from the trajectory.

The title of the manuscript has been changed to “Molecular dynamics-based model refinement and validation for sub-5 Å cryo-electron microscopy maps”. The expression in the Discussion referred to by the reviewer no longer contains the term “ensemble-based”.

7) Figure 5: The plot describes map sharpening. Sharpening implies applying a negative B-factor, therefore one expects negative B-factor values on the x-axis.

We have updated Figure 5 and its figure supplements 1 and 2 to reflect negative B-factors.

8) This sentence – "well within the structural uncertainty represented by the 3 Å resolution limit of the crystallographic data" –, in the Results, is a bit misleading, since it sounds as if the structural uncertainty for 3 Angstroms data is 3 Å, which is not the case but unfortunately a common misconception. The accuracy of atomic positions is usually considerably higher than the limiting resolution of a crystal. Please remove or reword this sentence.

We agree; the sentence is now removed.

9) The authors should discuss in more detail the limitations and potential applications of their method as suggested in the summary above.

Relevant points in terms of the choice of search model, B-factor determination and complementarity with de novomodel building methods have now been added in subsections “Refinement of β-galactosidase”, “RMSF and per-residue B-factors” and last paragraph of the Discussion respectively.

10) Overall, the manuscript should be edited to remove repetition (cloud computing is described as being cost-effective four times in the paper) and inaccuracies. The Materials and methods section is almost indistinguishable in style from the Results section and could be shortened considerably, with some of the more insightful descriptions moved to the Results (for example the first paragraph of the subsection “Fluctuation Analysis”). Additionally, many of the reference choices seem unnecessary. For example, in the first paragraph of the Introduction, describing how cryo-EM has evolved, the authors cite papers on NMR methods for complexes over 20 kDa (Clore and Gronenborn), membrane proteins by XFEL (Neutze et al) and ribosome structures at mid-resolution (the papers by Rawat et al). It would be better to cite one of the recent reviews on cryo EM, for example 'Cheng, Y. 2015. Single-particle cryo-EM at Crystallographic resolution. Cell, 161(3):450-7.

Repetition in the revised manuscript has been reduced and overlapping paragraphs in the methods with the results have now been removed entirely or moved into the results. Unnecessary references have been removed in favor of the suggested review article.

11) All starting and refined models should be provided as supplementary data files.

All models are provided with this resubmission.

https://doi.org/10.7554/eLife.16105.057

Article and author information

Author details

  1. Abhishek Singharoy

    Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, United States
    Contribution
    AS, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Contributed equally with
    Ivan Teo and Ryan McGreevy
    For correspondence
    singharo@illinois.edu
    Competing interests
    The authors declare that no competing interests exist.
  2. Ivan Teo

    1. Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, United States
    2. Department of Physics, University of Illinois at Urbana-Champaign, Urbana, United States
    Contribution
    IT, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Contributed equally with
    Abhishek Singharoy and Ryan McGreevy
    Competing interests
    The authors declare that no competing interests exist.
  3. Ryan McGreevy

    Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, United States
    Contribution
    RM, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Contributed equally with
    Abhishek Singharoy and Ivan Teo
    Competing interests
    The authors declare that no competing interests exist.
  4. John E Stone

    Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, United States
    Contribution
    JES, Conception and design, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  5. Jianhua Zhao

    Department of Biochemistry and Biophysics, University of California San Francisco School of Medicine, San Francisco, United States
    Contribution
    JZ, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  6. Klaus Schulten

    1. Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, United States
    2. Department of Physics, University of Illinois at Urbana-Champaign, Urbana, United States
    Contribution
    KS, Computational simulation technologies, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    kschulte@ks.uiuc.edu
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0001-7192-9632

Funding

National Institutes of Health (9P41GM104601)

  • Abhishek Singharoy
  • Ivan Teo
  • Ryan McGreevy
  • John E Stone
  • Klaus Schulten

National Science Foundation (MCA93S028)

  • Abhishek Singharoy
  • Ivan Teo
  • Ryan McGreevy
  • John E Stone
  • Klaus Schulten

Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana-Champaign

  • Abhishek Singharoy

National Institutes of Health (5R01GM098243-02)

  • Ryan McGreevy
  • John E Stone
  • Klaus Schulten

National Institutes of Health (U54GM087519)

  • Klaus Schulten

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work is supported by grants NIH 9P41GM104601, NIH 5R01GM098243-02, and NIH U54GM087519 from the National Institutes of Health. The authors gratefully acknowledge computer time provided by the NSF-funded Extreme Science and Engineering Discovery Environment (XSEDE) MCA93S028, Sriram Subramanian for extremely insightful discussions on the high-resolution cryo-EM methods in general, and the β-galactosidase study in particular, and Benjamin Barad from the Fraser group at UCSF for guidance on the use of the EMRinger program. The authors also acknowledge the Beckman Postdoctoral Fellowship program for supporting A Singharoy.

Reviewing Editor

  1. Axel T Brunger, Reviewing Editor, Howard Hughes Medical Institute, Stanford University, United States

Publication history

  1. Received: March 16, 2016
  2. Accepted: July 6, 2016
  3. Accepted Manuscript published: July 7, 2016 (version 1)
  4. Accepted Manuscript updated: July 8, 2016 (version 2)
  5. Version of Record published: August 18, 2016 (version 3)
  6. Version of Record updated: October 7, 2016 (version 4)

Copyright

© 2016, Singharoy et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,972
    Page views
  • 802
    Downloads
  • 10
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Uday K Jagadisan, Neeraj J Gandhi
    Research Article Updated