1. Biophysics and Structural Biology
Download icon

Low cost, high performance processing of single particle cryo-electron microscopy data in the cloud

  1. Michael A Cianfrocco Is a corresponding author
  2. Andres E Leschziner
  1. Harvard University, United States
  2. Harvard Medical School, United States
Tools and Resources
Cited
14
Views
4,038
Comments
0
Cite as: eLife 2015;4:e06664 doi: 10.7554/eLife.06664

Abstract

The advent of a new generation of electron microscopes and direct electron detectors has realized the potential of single particle cryo-electron microscopy (cryo-EM) as a technique to generate high-resolution structures. Calculating these structures requires high performance computing clusters, a resource that may be limiting to many likely cryo-EM users. To address this limitation and facilitate the spread of cryo-EM, we developed a publicly available ‘off-the-shelf’ computing environment on Amazon's elastic cloud computing infrastructure. This environment provides users with single particle cryo-EM software packages and the ability to create computing clusters with 16–480+ CPUs. We tested our computing environment using a publicly available 80S yeast ribosome dataset and estimate that laboratories could determine high-resolution cryo-EM structures for $50 to $1500 per structure within a timeframe comparable to local clusters. Our analysis shows that Amazon's cloud computing environment may offer a viable computing environment for cryo-EM.

https://doi.org/10.7554/eLife.06664.001

eLife digest

Microscopes can be used to view objects or structural details that are not visible with the naked eye. A type of microscope called an electron microscope—which uses beams of particles called electrons—is particularly useful for examining tiny objects or structures because it can produce images with a higher level of detail than microscopes that use light.

There are several ways to prepare biological samples for electron microscopy. One technique is called cryo-electron microscopy, or cryo-EM for short, where the sample is rapidly frozen and then viewed under the electron microscope. Using this technique it is possible to produce highly detailed images of viruses, individual compartments within cells and even single proteins.

To convert the images of proteins into three-dimensional models, high-performing clusters of computers are required. It can be difficult and expensive for many scientists to access these resources, which may limit the wider use of cryo-EM in research.

To address this problem and aid the spread of cryo-EM, Cianfrocco and Leschziner developed a publicly available ‘off the shelf’ system on Amazon's elastic cloud computing infrastructure. This provides users with software packages and the ability to create a cluster containing up to around 480 computers to analyze cryo-EM data.

Cianfrocco and Leschziner tested the system using a publicly available cryo-EM dataset of a structure in yeast cells called the 80S ribosome, which contains proteins and molecules of ribonucleic acid. This revealed that a highly detailed model of the 80S ribosome could be developed in a time frame similar to what it would have taken on a local high-performing computing cluster within a university. The cost of using this system was also competitive in price with that of maintaining a local computing cluster, with the added flexibility of its ‘pay-as-you-go’ structure.

These findings show that Amazon's cloud computing infrastructure may be a useful alternative to using clusters of computers based within a research institute or university. This will help the spread of cryo-EM as a general tool to reveal the three-dimensional structures of large molecules. Further work is required to make this cloud-based computing tool easily accessible to researchers who may have limited experience with using Linux software and computing clusters.

https://doi.org/10.7554/eLife.06664.002

Introduction

Cryo-electron microscopy (cryo-EM) has long served as an important tool to provide structural insights into biological samples. Recent advances in cryo-EM data collection and analysis, however, have transformed single particle cryo-EM (Kuhlbrandt, 2014; Bai et al., 2015), allowing it to achieve resolutions better than 5 Å for samples ranging in molecular weight from the 4 MDa eukaryotic ribosome (Bai et al., 2013) to the 170 kDa membrane protein γ-secretase (Lu et al., 2014). These high-resolution structures are the result of a new generation of cameras that detect electrons directly without the need of a scintillator, which results in a dramatic increase in the signal-to-noise ratio relative to CCD cameras, the previous most commonly used device (McMullan et al., 2009). In addition to direct electron detection, the high frame rate of these cameras allows each image to be recorded as a ‘movie’, dividing it into multiple frames. These fractionated images can be used to correct for sample movement during the exposure, further increasing the quality of the cryo-EM images (Campbell et al., 2012; Li et al., 2013; Scheres, 2014).

In addition to these technological developments in the detectors, improvements in computer software packages have played an equally important role in moving cryo-EM into the high-resolution era. Atomic or near-atomic structures have been obtained with software packages such as EMAN2 (Tang et al., 2007), Sparx (Hohn et al., 2007), FREALIGN (Grigorieff, 2007), Spider (Frank et al., 1996), and Relion (Scheres, 2012, 2014). In general, obtaining these structures involved computational approaches that sorted out the data into homogenous classes that could then be refined to high resolution.

While these advances in microscopy and analysis have been essential for the recent breakthroughs in cryo-EM, their implementation is computationally intensive and requires high-performance computing clusters. A recent survey of high-resolution single particle cryo-EM structures showed that refinement of these structures required processing times in excess of 1000 CPU-hours (Scheres, 2014). Therefore, computational time (i.e., access to high-performance clusters) may represent a bottleneck to determining high-resolution structures by single particle cryo-EM.

In order to address this limitation, we explored the possibility of using Amazon's elastic cloud computing (EC2) for processing cryo-EM data. To help others take advantage of this resource, we have created a publicly available ‘off-the-shelf’ software environment that allows new users to start up a cluster of Amazon CPUs preinstalled with cryo-EM software and we have used it to test the performance of Amazon's EC2 platform. We were able to determine a 4.6 Å structure of the 80S ribosome using a published dataset (Bai et al., 2013) for an overall cost of $100 USD within a timeframe comparable to that of a local cluster. Given the range of prices for accessing Amazon CPUs (users can bid for significantly reduced costs) and the accessibility statistics, we estimate that typical cryo-EM structures can be determined for $50–$1500 per structure.

EC2 through Amazon Web Services (AWS)

AWS is a division of Amazon that offers a variety of cloud-based solutions for website hosting and high-performance computing, amongst other services. Many different types of privately held companies take advantage of Amazon's computing infrastructure because of its affordability, flexibility, and security. Of note, global biotechnology companies such as Novartis (AWS, 2014a), Bristol-Myers-Squibb (AWS, 2013), and Pfizer (AWS, 2014b) have utilized the computing power of Amazon for scientific data processing. Many academic researchers have also begun to use Amazon's EC2 resources for analyzing datasets from super-resolution light microscopy (Hu et al., 2013), genomics (Krampis et al., 2012; Yazar et al., 2014), and proteomics (Mohammed et al., 2012; Trudgian and Mirzaei, 2012).

The overall workflow starts with users logging into a virtual machine (‘instance’) on AWS (Figure 1). AWS offers a variety of instance types that have been configured for different computing tasks. For example, instances have been optimized for computing performance, GPU-based calculations, or memory-intensive calculations. After logging onto an instance, storage drives are mounted onto it, allowing data, which can be encrypted for security, to be transferred onto the storage drives (Figure 1).

Workflow for analyzing cryo-EM data on Amazon's cloud computing infrastructure.

After collecting cryo-EM data (Step 1), particles are extracted from the micrographs and prepared for further analysis (Step 2). After logging into an ‘instance’ (Step 3), data are uploaded to a storage server (elastic block storage) (Step 4). At this point, STARcluster can be configured to launch a cluster of 2–30 instances that is mounted with the data from the storage volume (Step 5). A detailed protocol can be found at an accompanying Google site: http://goo.gl/AIwZJz.

https://doi.org/10.7554/eLife.06664.003

While users can utilize a single instance for calculations, the maximum number of CPU cores per instance is 18. Therefore, creating a computing cluster with a larger number of CPUs on AWS requires additional steps. The Software Tools for Academics and Researchers (STAR) group at the Massachusetts Institute of Technology developed a straightforward package that allows users to group individual AWS instances into a cluster. The STARcluster program is a python-based, open source package that automatically creates a cluster preconfigured with the necessary software to manage a computer cluster (Ivica et al., 2009). This package allows users to specify the number of instances to be included in the clusters as well as the instance type. By taking advantage of this tool, private clusters can be built with sizes ranging from 16 to 480 CPUs (Figure 1).

Global availability of spot instances on Amazon EC2

While Amazon provides dedicated access to instances through ‘on-demand’ reservations, there are ‘spot instances’ that are 80–90% cheaper than the on-demand price. Spot instances are unused instances within Amazon EC2 that are open for competitive bidding, where users gain access to them by making offers above the current minimum bid. This means that while the on-demand rate for high-memory, 16-CPU instances (called ‘r3.8xlarge’) is $2.80/hr, spot instance prices can be as low as $0.25–$0.35/hr.

In order to determine if spot instances offer a consistent reduction in price, we analyzed the global availability of r3.8xlarge spot instances. Currently, Amazon has 9 regions worldwide within 7 countries: US-East-1 (United States), US-West-1 (United States), US-West-2 (United States), SA-East-1 (Brazil), EU-Central-1 (Germany), EU-West-1 (Ireland), AP-Northeast-1 (Japan), AP-Southeast-1 (Singapore), and AP-Southeast-2 (Australia). For each region, we retrieved spot instance prices for r3.8xlarge instances over the past 3 months and analyzed the time they spent at prices below $0.35–$0.65/hr (corresponding to discounts of 87.5–76.8% over the full on-demand rate of $2.80/hr) (Figure 2 and Figure 2—figure supplement 1). This analysis revealed that, globally, 49.8% of r3.8xlarge instances were below $0.35/hr, 12.5% the on-demand price (Figure 2). For $0.65/hr, 76.5% below full price, one could access 82.2% of the global r3.8xlarge spot instances. These data indicate that spot instances provide dependable, cost-effective access to Amazon's computing resources.

Figure 2 with 1 supplement see all
Global availability of Amazon r3.8xlarge spot instances.

Shown is the average percentage time spent by the r3.8xlarge type of instance when the current spot instance price was less than the queried price. The data are averaged over all Amazon's regions worldwide (except for SA-East-1, which does not offer r3.8xlarge instances). Spot instance prices were calculated over a 90-day period from 1 January 2015—1 April 2015, where the average is shown ± the s.e. Source data: Figure 2—source data 1.

https://doi.org/10.7554/eLife.06664.004

Performance analysis of Amazon's EC2 environment with a 80S yeast ribosome dataset

To test the performance of Amazon's EC2 environment, we analyzed a previously published 80S Saccharomyces cerevisiae ribosome dataset (Bai et al., 2013) (EMPIAR 10002) on a 128 CPU cluster (8 × 16 CPUs; using the r3.8xlarge instance). After extracting 62,022 particles, we performed 2D classification within Relion. Subsequent 3D classification of the particles into four classes revealed that two classes adopted a similar structural state. We merged those two classes and used the associated particles to carry out a 3D refinement in Relion—we were able to obtain a structure with an overall resolution of 4.6 Å (Figure 3A–C).

Cryo-EM structure of 80S ribosome at an overall resolution of 4.6 Å.

(A) Overall view of 80S reconstruction filtered to 4.6 Å while applying a negative B-factor of −116 Å2. (B) Gold standard FSC curve. (C) Selected regions from the 60S subunit. Cryo-EM maps were visualized with UCSF Chimera (Pettersen et al., 2004). Source data: Dryad Digital Repository dataset (http://datadryad.org/review?doi=doi:10.5061/dryad.9mb54) (Cianfrocco and Leschziner).

https://doi.org/10.7554/eLife.06664.007

This structure, whose generation included particle picking, CTF estimation, 2D and 3D classification, and refinement, cost us $99.64 on Amazon's EC2 environment. This cost was achieved by bidding on spot instances for particle picking (m1.small at $0.02/hr), 2D classification (STARcluster of r3.8xlarge instances at $0.65/hr), and 3D classification and refinement (STARcluster of r3.8xlarge instances at $0.65/hr). Thus, even though obtaining this structure required 1266 total CPU-hours, Amazon's EC2 computing infrastructure provided the necessary resources to calculate it to near-atomic resolution at a reasonable price.

To further test the performance of Amazon instances, we carried out 3D classification and refinement on a variety of STARcluster configurations using Relion. As before, we ran our tests on clusters of r3.8xlarge high-memory instances (256 GiB RAM and 16 CPUs per instance). Comparing performance across cluster sizes showed that 256 CPUs had the fastest overall time and the highest speedup relative to a single CPU for both 3D classification and refinement (Figure 4A,B). However, cluster sizes of 128 and 64 CPUs were the most cost effective for 3D classification and refinement, respectively, as these were the cluster configurations where the speedup per dollar reached a maximum (Figure 4C). Importantly, the average time required to boot up these STARclusters was ≤ 10 min for all cluster sizes (Figure 4D) and, once booted up, the clusters do not have any associated job wait times. Therefore, these tests showed that Amazon's EC2 infrastructure was amenable to the analysis of single particle cryo-EM data using Relion over a range of STARcluster sizes.

Relion performance on STARcluster configurations of Amazon instances.

(A) Processing times (minutes) for Relion to perform 3D Classification or 3D refinement on 80S ribosome dataset. (B) Speedup for each cluster size relative to a single CPU (black line) shown alongside performance estimate for a perfectly parallel cluster using Amdahl's Law (curve labeled ‘Theoretical limit’). For cluster sizes ≤ 64 CPUs, Relion exhibits near-perfect performance on STARcluster configurations, while cluster sizes > 64 show that Relion's performance reaches a maximum at 256 CPUs for both 3D classification and 3D refinement. (C) Speedup/Cost is plotted against cluster size, where Speedup/Cost is defined as the speedup observed divided by the cost associated with Amazon's pricing at $0.35/hr/16 CPUs. (D) Average STARcluster boot up time (± s.d.) was measured for clusters of increasing size (n = 5). Source data: Figure 4—source data 1.

https://doi.org/10.7554/eLife.06664.008

From our analysis of the 80S yeast ribosome, we extrapolated the processing times and combined them with previously published 3D refinement times to estimate typical costs on Amazon's EC2. First, we estimated the cost for 3D refinement in Relion for previously published structures (Supplementary file 2A)—these calculated costs ranged from $12.65 to $379.03 per structure, depending on the spot instance price and required CPU-hours. We then combined these data with conservative estimates for particle picking, CTF estimation, particle extraction, 2D and 3D classification to predict the overall cost of structure determination on Amazon's EC2 (Supplementary file 2B). From these considerations, we estimated that published structures could be determined using Amazon's EC2 environment at costs of $50–$1500 per structure (Supplementary file 2B).

EM-packages-in-the-Cloud: a pre-configured software environment for single-particle cryo-EM image analysis

Given the success we had in analyzing cryo-EM data on Amazon's EC2 at an affordable price and within a reasonable timeframe, we have made our software environment publicly available as an ‘Amazon Machine Image’ (AMI), under the name ‘EM-packages-in-the-Cloud-v3.93.’ The EM-packages-in-the-Cloud-v3.93 AMI provides the software environment necessary for analyzing data on a single instance, and is preconfigured with STARcluster software. The EM-packages-in-the-Cloud-v3.93 AMI has the following cryo-EM software packages installed: Relion (Scheres, 2012, 2014), FREALIGN (Grigorieff, 2007), EMAN2 (Tang et al., 2007), Sparx (Hohn et al., 2007), Spider (Frank et al., 1996), EMAN (Ludtke et al., 1999), and XMIPP (Sorzano et al., 2004). In addition to this AMI that is capable of running on a single instance, we have also made available a second AMI—EM-packages-in-the-Cloud-Node-v3.1—that provides users with the same software packages as described above, but can set up and run within a cluster of multiple EC2 instances. These two publicly available AMIs allow users to boot up a cluster to analyze cryo-EM data in a few short steps. The protocols describing this can be found as a PDF (Supplementary file 1) or on a Google site that is being launched in conjunction with this article: http://goo.gl/AIwZJz. In addition to detailed instructions, the site includes a help forum to facilitate a conversation on cloud computing for single particle cryo-EM.

Cloud computing as a tool to facilitate high-resolution cryo-EM

Recent advances in single particle cryo-EM have drawn the interest of the broader scientific community. In addition to technical advances in electron optics, the new direct electron detectors and data analysis software have dramatically improved the resolutions that can be achieved for a variety of structural targets. In contrast to the other high-resolution techniques (X-ray crystallography, NMR), structure determination by cryo-EM is extremely computationally intensive. The publicly available ‘EM-packages-in-the-Cloud’ environment we have presented and characterized here will help remove some of the limitations imposed by these computational requirements.

We believe that cloud-based approaches have the potential to impact the future of cryo-EM image processing in two fronts: (1) new cryo-EM users or laboratories will have immediate access to a high performance cluster, and (2) existing labs may use this resource to increase their productivity. As the number of laboratories using cryo-EM increases, and as existing laboratories begin to pursue high-resolution cryo-EM, gaining immediate access to a high performance cluster may become difficult. For instance, while there are government-funded high performance clusters in the United States (e.g., XSEDE STAMPEDE), it may take up to a month for a user application to be reviewed (Rogelio Hernandez-Lopez, personal communication). Assuming that the application is approved, these clusters may not have appropriate software installed, which further delays data processing. Finally, the user will have a set limit for the number of CPU hours available per project, requiring a new application to be submitted to access the cluster again. All of these problems can be circumvented by using Amazon's EC2 infrastructure, which provides immediate, cost-effective access to hundreds of CPUs with no geographic restrictions.

The power of cloud-based solutions to alleviate the computational burden associated with cryo-EM data processing stems from its high-degree of scalability and reasonable cost. By minimizing computational time and increasing global accessibility, high-performance cloud computing may help usher in the era when high-resolution cryo-EM becomes a routine structural biology tool.

Materials and methods

Global availability of spot instances

Global spot instance prices were retrieved from the 90-day period from 1 January 2015 to 1 April 2015 using the Amazon Command Line Tools command ec2-describe-spot-price-history. Retrieval of spot instance prices for all regions was implemented automatically in a custom python program get_spot_histories_all_regions_all_zones.py. From these spot instance prices, the percentage time spent below given prices was calculated using measure_time_at_spotPrice.py, where the cumulative time of spot instances below a given price divided by the total time (90 days). Both programs can be found in the Github repository mcianfrocco/Cianfrocco-and-Leschziner-EMCloudProcessing.

Setting up a cluster on Amazon EC2 with spot instances

In order to minimize costs, STARclusters were assembled from ‘spot instances,’ which are unused instances that can be reserved through a bidding process. The spot instances are different from ‘on-demand’ instances: on-demand instances provide users with guaranteed access while spot instances are reserved until there is a higher bid, at which point the user is logged out of the spot instance. When this happens, the MPI-threaded Relion calculation will abort, requiring the user to resubmit the job to the STARcluster and start Relion from the previous iteration. Even if the user is logged out of all instances within a STARcluster, the data is automatically saved within the EBS-backed volumes on Amazon EC2.

CPUs vs vCPUs

In selecting an instance type, new users should be aware of the differences between CPUs and vCPUs on Amazon's EC2 network. Namely, that there are two vCPUs per physical CPU on Amazon. This means that while r3.8xlarge instances have 32 vCPUs, there are actually only 16 physical CPU cores in each instance, with each CPU having two hyperthreads. Practically, this means that Amazon's instances have higher performance than a 16 CPU machine and less performance than a 32 CPU machine. To account for this difference, all numbers reported here were CPU numbers that were converted from vCPUs: 1 CPU = 2 vCPUs.

Image processing

Micrographs from the 80S S. cerevisiae ribosome dataset (Bai et al., 2013) were downloaded from the EMPIAR database for electron microscopy data (EMPIAR 10002). The SWARM feature of EMAN2 (Tang et al., 2007) was used to pick particles semi-automatically. Micrograph defocus was estimated using CTFFIND3 (Mindell and Grigorieff, 2003). The resulting particle coordinates and defocus information were used for particle extraction by Relion-v1.3 (Scheres, 2012, 2014). The particle stacks and associated data files were then uploaded to an elastic block storage volume on Amazon's EC2 processing environment at a speed of 10 MB/s (24 min total upload time).

After 2D classification in Relion, 3D classification was performed on 62,022 80S Ribosome particles (1.77 Å/pixel), also in Relion. These were classified into 4 groups (T = 4) for 13 iterations using a ribosome map downloaded from the Electron Microscopy Data Bank (EMDB-1780) that was low pass filtered to 60 Å. Further 3D classification using a local search of 10° and an angular sampling of 1.8° continued for 13 iterations. At this point, two classes were identified as belonging to the same structural state and were selected for high-resolution refinement (32,533 particles). Refinement of these selected particles continued for 31 iterations using 3D auto-refine in Relion. The final resolution was determined to be 4.6 Å using Post process in Relion, applying a mask to the merged half volumes and a negative B-factor of −116 Å2.

Performance analysis

80S ribosome data were reanalyzed on clusters of increasing size using both 3D classification and 3D refinement. The time points collected involved running 3D classification for 2 rounds and 3D refinement for 6 rounds, using the same number of particles and box sizes listed above: 62,022 particles for classification and 32,533 particles for refinement with box sizes of 240 × 240 pixels. The Relion commands were identical to the commands used above and the calculations were terminated after the specified iteration.

From these time points, the speedup of each cluster size was calculated relative to a single CPU. Speedup (S) was calculated as:

S=Calculation time for 1 CPUCalculation time for x CPUs.

The measured speedup values were then compared to the speedup expected for a perfectly parallel algorithm (P = 1) using Amdahl's law (Amdahl, 1967):

S=1(1P)+1n(P)=1(11)+1n(1)=n,

Where P is the fraction of an algorithm that is parallel and n is the number of processors. The calculation times for 3D classification on a single CPU were obtained by using 1 CPU on a 16 CPU r3.8xlarge instance. For calculating a 3D refinement on a single CPU, (or two vCPUs), the refinement was run on 4 vCPUs and then converted to a single CPU (or two vCPUs) by multiplying the calculation time by 2. For cost analysis, the measured speedup was divided by the cost to run the job on spot instances of r3.8xlarge at a price of $0.35/hr. Cluster boot up times were calculated from the elapsed time between submitting the STARcluster command and the STARcluster fully booting up.

Data accession information

Further information regarding ‘EM-Packages-in-the-Cloud’ can be found in Supplementary file 1 and at an associated Google Site: http://goo.gl/AIwZJz. The final 80S yeast ribosome structure at 4.6 Å has been submitted to the EM Databank as EMDB 2858. A detailed description of global spot instance price analyses and image processing is available at https://github.com/mcianfrocco/Cianfrocco-and-Leschziner-EMCloudProcessing/wiki. Associated computing scripts and data files have been uploaded to Github (https://github.com/mcianfrocco/Cianfrocco-and-Leschziner-EMCloudProcessing) and Dryad Digital Repository (http://dx.doi.org/10.5061/dryad.9mb54) (Cianfrocco and Leschziner, 2015), respectively.

References

  1. 1
    Proceedings of the April 18-20, 1967, spring joint computer conference
    1. GM Amdahl
    (1967)
    483–485, Proceedings of the April 18-20, 1967, spring joint computer conference, Atlantic City, New Jersey, ACM.
  2. 2
    Bristol-Myers Squibb on AWS
    1. AWS
    (2013)
    Date Accessed: April 20, 2015. http://aws.amazon.com/solutions/case-studies/bristol-myers-squibb/.
  3. 3
    AWS case study: Novartis
    1. AWS
    (2014)
    Date Accessed: April 20, 2015. http://aws.amazon.com/solutions/case-studies/novartis/.
  4. 4
    AWS case study: Pfizer
    1. AWS
    (2014)
    Date Accessed: April 20, 2015. http://aws.amazon.com/solutions/case-studies/pfizer/.
  5. 5
  6. 6
  7. 7
  8. 8
    Data from: single particle cryo-electron microscopy image processing in the cloud: high performance at low cost
    1. MA Cianfrocco
    2. AE Leschziner
    (2015)
    Dryad Data Repository, 10.5061/dryad.9mb54.
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
    Paper presented at: information technology interfaces, 2009 ITI '09 proceedings of the ITI 2009 31st international conference on
    1. C Ivica
    2. JT Riley
    3. C Shubert
    (2009)
    Paper presented at: information technology interfaces, 2009 ITI '09 proceedings of the ITI 2009 31st international conference on.
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28

Decision letter

  1. Sjors HW Scheres
    Reviewing Editor; Medical Research Council Laboratory of Molecular Biology, United Kingdom

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for sending your work entitled “Low cost, high performance processing of single particle cryo-electron microscopy data in the cloud” for consideration at eLife. Your Tools and Resources article has been favorably evaluated by John Kuriyan (Senior editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors.

The following individuals responsible for the peer review of your submission have agreed to reveal their identity: Sjors Scheres (Reviewing editor); Steven Ludtke (peer reviewer). A further reviewer remains anonymous.

The Reviewing editor and the other reviewers discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission.

All three reviewers agreed that this paper represents a novel and original way of alleviating the high computational burdens that many cryo-EM labs face with the advent of huge amounts of data from new direct electron detectors. As this paper has the potential to accelerate discovery, and to change how many labs operate, publication was recommended by all three.

The following concerns (in order of importance) were raised:

Firstly, one of the reviewers had actually recently performed the EC2 cost analysis himself and was surprised to see the estimates in this paper less expensive than his own. He writes: “The first issue is an Amazon trick. The r3.8xlarge instances are marketed as “32 vCPUs”. Actually this is 32 threads running on 16 cores when you read the fine print. For image processing this is generally 18-20 “CPUs” worth of compute power for the “32 vCPUs”. CPU-hr/mo/yr normally measure core-hours, not thread hours. This is a factor of ∼1.8. If the authors disagree, I would encourage them to run a scalability test on a small problem on a single 32 vCPU instance. Perusing the Amazon site, the cost of a single r3.8xlarge instance with 16 physical cores is currently $2.80 for on demand use, rather than the $0.35 quoted by this manuscript. While it is possible to reduce this by up to ∼50% through contract prepurchase, the only mechanism I can see for getting the price anywhere close to the cited level is by bidding on unused hours, which can mean substantial delays. Currently the purchase price for an equivalent cluster is ∼$350/core ($175/thread), or ∼$7500 for a node almost identical to the $2.80/hr instance. Anyway, by my calculation, when you take all expenses into account, EC2 is about 3-5x more expensive than owning a cluster. However, if the Amazon price were suddenly 10x lower, this would be compelling. If my cost analysis is in error, I would be honestly grateful to see a correction, as it would substantially alter how we operate.”

Secondly, in Table 1, the reported times for the other 3 cases are incorrect. They are merely the same of the values reported for the old and the new movie processing in the Scheres, 2014 eLife paper. There are no reported CPU costs for the entire processing procedures of these structures in the literature. However, on the RELION wiki (http://www2.mrc-lmb.cam.ac.uk/relion/index.php/FAQs#Computational_issues) it is stated: “We do 3.x Angstrom ribosome reconstructions from say 100-200 thousand particles in approximately two weeks using around 200-300 cores in parallel”. This would result (given current cost estimates) in about $800 per structure, which is still very reasonable.

Thirdly, the authors are encouraged to base their analysis on Amdahl's Law instead of the “near-linear increase” stated in the paper. While I would normally consider this a minor concern, this analysis will yield numbers which will be interesting to consider.

[Editors’ note: the decision letter after resubmission follows.]

Thank you for choosing to send your work entitled “Low cost, high performance processing of single particle cryo-electron microscopy data in the cloud” for consideration at eLife. Your revised submission has been evaluated by John Kuriyan (Senior editor) and Sjors Scheres (Reviewing editor). Based on our discussions and the individual reviews sent previously, we regret to inform you that your work will not be considered further for publication in eLife.

It is unfortunate that the standard prices for Amazon's cloud are so high. We feel that bidding on unused hours is likely to be unpredictable in the future and this lack of predictability makes it less appropriate as a means of evaluating the costs of the calculations described in the paper. The true cost of doing an entire structure determination project (not only a single refinement run) at standard Amazon prices would probably quite substantially higher than the value mentioned in the Abstract. We fear that this could more expensive than buying a local cluster. Although some smaller labs may still benefit from the cloud setup, this paper will not likely change the way cryo-EM labs work in general.

[Editors’ note: after an appeal against the decision, further revisions were requested before acceptance.]

In order to give a fair and transparent view to the casual reader, we feel that the addition of a discussion on the costs of a “typical” structure determination project would add value to the paper. The phrase “as we illustrate here by determining a near-atomic resolution structure of the 80S yeast ribosome for $28.89 USD in ∼10 hours” in the current Abstract is not representative of a typical case. Many data sets will contain several hundreds of thousands of particles, and each 2D or 3D classification or refinement run will cost in the order of 100-200$ each (based on your estimates for gamma-secretase and mitoribosome). As in a typical project one would run multiple of these jobs, real costs will quickly reach more than a thousand dollars per structure, even when using the $0.35/hour bidding rate. This is perfectly well acceptable and still competitive with buying a local cluster. But discussing such values in the paper will prevent unpleasant surprises when PIs start receiving EC2 bills.

https://doi.org/10.7554/eLife.06664.013

Author response

The following concerns (in order of importance) were raised:

Firstly, one of the reviewers had actually recently performed the EC2 cost analysis himself and was surprised to see the estimates in this paper less expensive than his own. He writes:The first issue is an Amazon trick. The r3.8xlarge instances are marketed as32 vCPUs. Actually this is 32 threads running on 16 cores when you read the fine print. For image processing this is generally 18-20CPUsworth of compute power for the32 vCPUs. CPU-hr/mo/yr normally measure core-hours, not thread hours. This is a factor of ∼1.8. If the authors disagree, I would encourage them to run a scalability test on a small problem on a single 32 vCPU instance.

We would like to thank the reviewers for pointing out the difference between vCPUs and CPUs. To make sure there is no confusion, we updated the text so that all analysis and discussion of Amazon EC2 involves a discussion of CPUs and CPU core-hours (instead of hyperthreads and hyperthread-hours). This will allow the manuscript to be readily compared to other published work that reports on CPUs and CPU core-hours. We also included a section within the Materials and methods where we discuss the differences between CPUs and vCPUs.

Perusing the Amazon site, the cost of a single r3.8xlarge instance with 16 physical cores is currently $2.80 for on demand use, rather than the $0.35 quoted by this manuscript. While it is possible to reduce this by up to ∼50% through contract prepurchase, the only mechanism I can see for getting the price anywhere close to the cited level is by bidding on unused hours, which can mean substantial delays. Currently the purchase price for an equivalent cluster is ∼$350/core ($175/thread), or ∼$7500 for a node almost identical to the $2.80/hr instance. Anyway, by my calculation, when you take all expenses into account, EC2 is about 3-5x more expensive than owning a cluster. However, if the Amazon price were suddenly 10x lower, this would be compelling. If my cost analysis is in error, I would be honestly grateful to see a correction, as it would substantially alter how we operate.”

We agree that Amazon EC2 is not particularly cost effective for the scenarios presented by the reviewers. We were able to minimize costs by bidding on unused hours through spot instance requests. This allowed us to reserve instances at a price of $0.35/hr instead of the on-demand cost of the r3.8xlarge is $2.80/hr. Also, we have not seen any delays in getting access to these spot instances (Figure 2D), which has made them reliable during our experience with EC2. To help convey these ideas to the reader, we have included a section within the Materials and methods section describing how we were able to reserve the spot instances, comparing them to the on-demand instances.

Secondly, in Table 1, the reported times for the other 3 cases are incorrect. They are merely the same of the values reported for the old and the new movie processing in the Scheres, 2014 eLife paper. There are no reported CPU costs for the entire processing procedures of these structures in the literature. However, on the RELION wiki (http://www2.mrc-lmb.cam.ac.uk/relion/index.php/FAQs#Computational_issues) it is stated:We do 3.x Angstrom ribosome reconstructions from say 100-200 thousand particles in approximately two weeks using around 200-300 cores in parallel. This would result (given current cost estimates) in about $800 per structure, which is still very reasonable.

During the preparation of the manuscript, we overlooked this incorrect comparison and we would like to thank the reviewers for raising this issue. We have updated the table to indicate that we are comparing 3D refinement processing times, not total processing times.

Thirdly, the authors are encouraged to base their analysis on Amdahl's Law instead of thenear-linear increasestated in the paper. While I would normally consider this a minor concern, this analysis will yield numbers which will be interesting to consider.

We performed the suggested analysis and found it to be very helpful in providing a theoretical limit to increases expected by Amazon STARcluster configurations. The results have been included in Figure 2B.

[Editors’ note: the author responses to the decision on the revised submission follows.]

While we understand the rationale behind your decision, we believe that some assumptions underlying this decision are not entirely correct. We realize we are at fault for not having presented enough data in our manuscript on the likelihood of being able to secure computational resources at the costs we quoted. We have included below these data, as part of a detailed response to the main points raised in your decision letter. We hope that this new information will more successfully make the point that cloud-based computing represents a cost-effective and general solution to the computational burden imposed by cryo-EM data analysis.

It is unfortunate that the standard prices for Amazon's cloud are so high. We feel that bidding on unused hours is likely to be unpredictable in the future and this lack of predictability makes it less appropriate as a means of evaluating the costs of the calculations described in the paper.

We would like to thank you for bringing this up because it is an important concern that we should have addressed more clearly. As pointed out, the prices of Amazon’s instances can change. Despite this, the vast majority of spot instance prices have remained consistently low. This can be seen in an analysis of r3.8xlarge spot instance prices over the past three months in both the United States (Virginia) and European Union (Ireland) (Author response image 1). This analysis showed that users can gain access to > 90% of instances at a price of $0.65/hr, which is 25% of the standard rate of $2.80/hr. Even at a price of $0.65/hr, the cost of 3D classification and refinement of the 80S ribosome would have been $53.64 (Author response table 1).

Author response image 1

Percentage of instances below bid price over last 90 days. Shown are the percentages of r3.8xlarge instances that are below the spot instance price across different regions and zones.

https://doi.org/10.7554/eLife.06664.015
Author response table 1

Cost of 80S ribosome 3D classification and refinement on a 128 CPU STARcluster configuration at increasing spot instance price.

https://doi.org/10.7554/eLife.06664.016

Spot instance bid price

$0.35

$0.45

$0.55

$0.65

80S ribosome 3D classification and refinement cost

$28.89

$37.14

$45.39

$53.64

We would also like to make two final points regarding the future of prices for Amazon instances:

Competition: Multiple cloud-based service providers (e.g. Google, Microsoft, Alibaba) already provide virtual machines for users across the world. These providers are vying to supply large privately held companies with computing resources. This will translate into stabilizing market forces that will help keep the price of cloud computing stable (if not lower it).

Moore’s law: Further development of CPU processing power will make the most ‘powerful’ generation of CPUs on Amazon cheaper within 2 years, further driving down cost.

The true cost of doing an entire structure determination project (not only a single refinement run) at standard Amazon prices would probably quite substantially higher than the value mentioned in the Abstract. We fear that this could more expensive than buying a local cluster.

We would also like to thank the reviewers for raising the issue of the ‘true cost’ of single particle analysis on Amazon EC2. The estimates outlined below will highlight the cost-effectiveness of Amazon’s EC2 infrastructure; individual virtual machines can be used at rates as low as $0.003/hr. The following is an estimate for the entire processing of a single dataset:

1) Picking particles and CTF estimation: Estimated cost $2.00 (100 hrs at $0.02/hr)

For this estimate, we used a spot instance price of $0.02/hr on an m1.small instance, which has had an average spot instance price of $0.0103 +/- 0.0038 over the last 90 days in US-Virginia. At this rate, a user could select particles and estimate the CTF over the course of 100 hours (2.5 x 40 hr. work weeks) for $2.00.

2) Particle extraction: $0.20 (10 hrs at 0.02/hr)

The particle extraction of the 80S ribosome presented in our manuscript required ∼10 hrs.

3) 2D classification using Relion: $23.29 (8.32 hrs at $2.80/hr)

For the 80S ribosome dataset presented in Bai et al. 2013, we performed 2D classification using Relion on a STARcluster of 128 CPUs at price of $0.35/instance/hr on 8 instances at a total cost of $2.80/hr. The final cost for classification was $23.29 to classify 62,022 at a pixel size 3.54 and a box size of 120 x 120 pixels into 250 classes (T=2) over 25 iterations.

4) 3D classification & refinement using Relion: $28.89 (10.32 hrs at $2.80/hr)

This cost reflects of price of $0.35/hr/instance on a 128 CPU STARcluster, which comprises eight r3.8xlarge instances.

In summary:

Particle picking & CTF estimation $2.00

Particle extraction $0.20

2D classification using Relion $23.29

3D classification & refinement using Relion $28.89

Total cost $54.38

Although some smaller labs may still benefit from the cloud setup, this paper will not likely change the way cryo-EM labs work in general.

The spread of cryo-EM as a common structural biology tool necessitates the spread of a cryo-EM computing software platform. Unlike X-ray crystallography, which can be used to solve atomic structures on a single workstation, cryo-EM data analysis inherently requires access to hundreds of CPUs for 3D structure determination. Currently, major universities have invested heavily in computing infrastructures, which are absent from many universities around the world. Therefore, as more labs begin to use cryo-EM software, they will need access to larger computing resources. Inherently, many of those newcomers may be small, or there may only be limited numbers of members within a laboratory that need high performance computing. We believe cloud-based computing will provide a solution to these challenges.

While cloud computing is continuing to mature, we have created a tool that will immediately address the computational burden imposed by cryo-EM data analysis. Cloud-based computing is a tool that could and would scale rapidly with the spread of cryo-EM.

https://doi.org/10.7554/eLife.06664.014

Article and author information

Author details

  1. Michael A Cianfrocco

    1. Department of Molecular and Cellular Biology, Harvard University, Cambridge, United States
    2. Department of Cell Biology, Harvard Medical School, Boston, United States
    Contribution
    MAC, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    mcianfrocco@fas.harvard.edu
    Competing interests
    The authors declare that no competing interests exist.
  2. Andres E Leschziner

    Department of Molecular and Cellular Biology, Harvard University, Cambridge, United States
    Contribution
    AEL, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.

Funding

Damon Runyon Cancer Research Foundation (Damon Runyon) (2171-13)

  • Michael A Cianfrocco

National Institutes of Health (NIH) (R01GM107214)

  • Andres E Leschziner

National Institutes of Health (NIH) (R01GM092895A)

  • Andres E Leschziner

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank all the members of the Leschziner and Reck-Peterson labs for critical discussions. We would like to especially thank Rogelio Hernandez-Lopez, Anthony Roberts, and Daniel Cianfrocco for critical feedback on the development of this Amazon computing environment. We also would like to thank the Structural Biology Consortium (SBGrid) for pricing information on cluster and file server sizes. MAC is an HHMI fellow of the Damon Runyon Cancer Research Foundation and AEL is supported by NIH/NIGMS (R01 GM107214 and R01 GM092895A).

Reviewing Editor

  1. Sjors HW Scheres, Reviewing Editor, Medical Research Council Laboratory of Molecular Biology, United Kingdom

Publication history

  1. Received: January 25, 2015
  2. Accepted: May 1, 2015
  3. Accepted Manuscript published: May 8, 2015 (version 1)
  4. Version of Record published: May 22, 2015 (version 2)

Copyright

© 2015, Cianfrocco and Leschziner

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,038
    Page views
  • 728
    Downloads
  • 14
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Biochemistry
    Martin Steger et al.
    Research Advance Updated