CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning

  1. Ryan Conrad
  2. Kedar Narayan  Is a corresponding author
  1. Center for Molecular Microscopy, Center for Cancer Research, National Cancer Institute, National Institutes of Health, United States
  2. Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, United States
11 figures, 2 tables and 6 additional files

Figures

Preparation of a deep learning appropriate 2D EM image dataset rich with relevant and unique features.

(a) Percent distribution of collated experiments grouped by imaging technique: TEM, transmission electron microscopy; SEM, scanning electron microscopy. (b) Distribution of imaging plane pixel …

Figure 1—source data 1

Details of imaging technique, organism, tissue type and imaging plane pixel spacing in collated imaging experiments.

https://cdn.elifesciences.org/articles/65894/elife-65894-fig1-data1-v1.xlsx
CEM500K pre-training improves the transferability of learned features.

(a) Example images and colored label maps from each of the six publicly available benchmark datasets: clockwise from top left: Kasthuri++, UroCell, CREMI Synaptic Clefts, Guay, Perez, and Lucchi++. …

Figure 2—source data 1

IoU scores achieved with different datasets used for pre-training.

https://cdn.elifesciences.org/articles/65894/elife-65894-fig2-data1-v1.xlsx
Features learned from CEM500K pre-training are more robust to image transformations and encode for semantically meaningful objects with greater selectivity.

(a) Mean firing rates calculated between feature vectors of images distorted by (i) rotation, (ii) Gaussian blur, (iii) Gaussian noise, (iv) brightness v. contrast, (vi) scale. Dashed black lines …

Models pre-trained on CEM500K yield superior segmentation quality and training speed on all segmentation benchmarks.

(a) Plot of percent difference in segmentation performance between pre-trained models and a randomly initialized model. (b) Example segmentations on the UroCell benchmark in 3D (top) and 2D …

Figure 4—source data 1

IoU scores for different pre-training protocols.

https://cdn.elifesciences.org/articles/65894/elife-65894-fig4-data1-v1.xlsx
Figure 4—source data 2

IoU scores for different training iterations by pre-training protocol .

https://cdn.elifesciences.org/articles/65894/elife-65894-fig4-data2-v1.xlsx
Appendix 1—figure 1
Deduplication and image filtering.

(a) Breakdown of fractions (top) and representative examples (bottom) of patches labeled ‘uninformative’ by a trained deep learning (DL) model based on defect (as determined by a human annotator). (b

Appendix 1—figure 2
Randomly selected images from CEMraw, CEMdedup, and CEM500K.
Appendix 1—figure 3
Schematics of the MoCoV2 algorithm and UNet-ResNet50 model architecture.

(a) Shows a single step in the MoCoV2 algorithm. A batch of images is copied; images in each copy of the batch are independently and randomly transformed and then shuffled into a random order (the …

Appendix 1—figure 4
Randomly selected images from the Bloss et al., 2018 pre-training dataset.
Appendix 1—figure 5
Visual comparison of results on the UroCell benchmark.

The ground truth and Authors’ Best Results are taken from the original UroCell publication (Žerovnik Mekuč et al., 2020). The results from the CEM500K-moco pre-trained model have been colorized to …

Appendix 1—figure 6
Images from source electron microscopy (EM) volumes are unequally represented in the subsets of CEM.

The line at 45° shows the expected curve for perfect equality between all source volumes (i.e. each volume would contribute the same number of images to CEMraw, CEMdedup, or CEM500K). Gini …

Appendix 1—figure 7
Plot showing the percent of random crops from an image that will be 100% uninformative based on the percent of the image that is informative.

Tables

Table 1
Comparison of segmentation Intersection-over-Union (IoU) results for benchmark datasets from models randomly initialized and pre-trained with MoCoV2 on the Bloss dataset, and CEMraw, CEMdedup, and CEM500K.

* denotes benchmarks that exclusively contain electron microscopy (EM) images from mouse brain tissue. The best result for each benchmark is highlighted in bold and underlined.

BenchmarkRandom Init.
(No Pre-training)
Bloss et al., 2018CEMrawCEMdedupCEM500K
All Mitochondria0.3060.6940.7190.7220.745
CREMI Synaptic Clefts0.0000.2420.2540.2590.265
Guay0.3490.3800.3720.3910.404
*Kasthuri++0.8550.9070.9130.9130.915
*Lucchi++0.7880.8990.8800.8900.894
*Perez0.5470.8740.8540.8660.869
UroCell0.2080.6380.6520.6990.729
*Average Mouse Brain0.7300.8930.8830.8900.893
Average Other0.2160.4890.4990.5180.536
Table 2
Comparison of segmentation IoU scores for different weight initialization methods versus the best results on each benchmark as reported in the publication presenting the segmentation task.

All IoU scores are the average of five independent runs. References listed after the benchmark names indicate the sources for Reported IoU scores.

BenchmarkTraining IterationsRandom Init.IN-superIN-mocoCEM500K-mocoReported
All Mitochondria100000.5870.6530.6530.770
CREMI Synaptic Clefts50000.0000.1960.2260.254
Guay (Guay et al., 2020)10000.3080.2750.3000.4290.417
Kasthuri++ (Casser et al., 2018)100000.9050.9080.9110.9150.845
Lucchi++ (Casser et al., 2018)100000.8940.8650.8920.8950.888
Perez (Perez et al., 2014)25000.6720.8860.8830.9010.821
Lysosomes0.8420.8380.8160.8490.726
Mitochondria0.1300.8600.8660.8840.780
Nuclei0.9840.9870.9860.9880.942
Nucleoli0.7310.8590.8650.8850.835
UroCell25000.4240.5840.6180.734

Additional files

Source data 1

Details of image datasets acquired from external sources.

https://cdn.elifesciences.org/articles/65894/elife-65894-data1-v1.xlsx
Source data 2

Zipped folder containing .xl files for Figure 1, 2 and 4 source data.

https://cdn.elifesciences.org/articles/65894/elife-65894-data2-v1.zip
Supplementary file 1

Details of benchmarks used in this paper.

https://cdn.elifesciences.org/articles/65894/elife-65894-supp1-v1.docx
Supplementary file 2

IoU scores for pre-training with CEM500K after removing benchmark data from pre-training dataset .

https://cdn.elifesciences.org/articles/65894/elife-65894-supp2-v1.docx
Supplementary file 3

IoU scores on Guay benchmark using different hyperparameter choices.

https://cdn.elifesciences.org/articles/65894/elife-65894-supp3-v1.docx
Transparent reporting form
https://cdn.elifesciences.org/articles/65894/elife-65894-transrepform-v1.docx

Download links