Research Article

On the objectivity, reliability, and validity of deep learning enabled bioimage analyses

Institute of Clinical Neurobiology, University Hospital Würzburg, Germany
Department of Business and Economics, University of Würzburg, Germany
Institute of Physiology I, Westfälische Wilhlems-Universität, Germany
Department of Pharmacology, Medical University of Innsbruck, Austria
Department of Pharmacology and Toxicology, Institute of Pharmacy and Center for Molecular Biosciences Innsbruck, University of Innsbruck, Austria
Department of Child and Adolescent Psychiatry, Center of Mental Health, University Hospital Würzburg, Germany
Comprehensive Anxiety Center, Germany

Oct 19, 2020

Open access
Copyright information

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Bioimage analysis of fluorescent labels is widely used in the life sciences. Recent advances in deep learning (DL) allow automating time-consuming manual image analysis processes based on annotated training data. However, manual annotation of fluorescent features with a low signal-to-noise ratio is somewhat subjective. Training DL models on subjective annotations may be instable or yield biased models. In turn, these models may be unable to reliably detect biological effects. An analysis pipeline integrating data annotation, ground truth estimation, and model training can mitigate this risk. To evaluate this integrated process, we compared different DL-based analysis approaches. With data from two model organisms (mice, zebrafish) and five laboratories, we show that ground truth estimation from multiple human annotators helps to establish objectivity in fluorescent feature annotations. Furthermore, ensembles of multiple models trained on the estimated ground truth establish reliability and validity. Our research provides guidelines for reproducible DL-based bioimage analyses.

eLife digest

Research in biology generates many image datasets, mostly from microscopy. These images have to be analyzed, and much of this analysis relies on a human expert looking at the images and manually annotating features. Image datasets are often large, and human annotation can be subjective, so automating image analysis is highly desirable. This is where machine learning algorithms, such as deep learning, have proven to be useful. In order for deep learning algorithms to work first they have to be ‘trained’. Deep learning algorithms are trained by being given a training dataset that has been annotated by human experts. The algorithms extract the relevant features to look out for from this training dataset and can then look for these features in other image data.

However, it is also worth noting that because these models try to mimic the annotation behavior presented to them during training as well as possible, they can sometimes also mimic an expert’s subjectivity when annotating data. Segebarth, Griebel et al. asked whether this was the case, whether it had an impact on the outcome of the image data analysis, and whether it was possible to avoid this problem when using deep learning for imaging dataset analysis.

For this research, Segebarth, Griebel et al. used microscopy images of mouse brain sections, where a protein called cFOS had been labeled with a fluorescent tag. This protein typically controls the rate at which DNA information is copied into RNA, leading to the production of proteins. Its activity can be influenced experimentally by testing the behaviors of mice. Thus, this experimental manipulation can be used to evaluate the results of deep learning-based image analyses.

First, the fluorescent images were interpreted manually by a group of human experts. Then, their results were used to train a large variety of deep learning models. Models were trained either on the results of an individual expert or on the results pooled from all experts to come up with a consensus model, a deep learning model that learned from the personal annotation preferences of all experts. This made it possible to test whether training a model on multiple experts reduces the risk of subjectivity. As the training of deep learning models is random, Segebarth, Griebel et al. also tested whether combining the predictions from multiple models in a so-called model ensemble improves the consistency of the analyses. For evaluation, the annotations of the deep learning models were compared to those of the human experts, to ensure that the results were not influenced by the subjective behavior of one person. The results of all bioimage annotations were finally compared to the experimental results from analyzing the mice’s behaviors in order to check whether the models were able to find the behavioral effect on cFOS.

Segebarth, Griebel et al. concluded that combining the expert knowledge of multiple experts reduces the subjectivity of bioimage annotation by deep learning algorithms. Combining such consensus information in a group of deep learning models improves the quality of bioimage analysis, so that the results are reliable, transparent and less subjective.

Introduction

Modern microscopy methods enable researchers to capture images that describe cellular and molecular features in biological samples at an unprecedented scale. One of the most frequently used imaging methods is fluorescent labeling of biological macromolecules, both in vitro and in vivo. In order to test a biological hypothesis, fluorescent features have to be interpreted and analyzed quantitatively, a process known as bioimage analysis (Meijering et al., 2016). However, fluorescence does not provide clear signal-to-noise borders, forcing human experts to utilize individual heuristic criteria, such as morphology, size, or signal intensity to classify fluorescent signals as background, or to, often manually, annotate them as a region of interest (ROI). This cognitive decision process depends on the graphical perception capabilities of the individual annotator (Cleveland and McGill, 1985). Constant technological advances in fluorescence microscopy facilitate the automatized acquisition of large image datasets, even at high resolution and with high throughput (Li et al., 2010; McDole et al., 2018; Osten and Margrie, 2013). The ever increasing workload associated with image feature annotation therefore calls for computer-aided automated bioimage analysis. However, attempts to replace human experts and to automate the annotation process using traditional image thresholding techniques (e.g. histogram shape-, entropy-, or clustering-based methods [Sezgin and Sankur, 2004]) frequently lack flexibility, as they rely on a high signal-to-noise ratio in the images or require computational expertise for user-based adaptation to individual datasets (von Chamier et al., 2019). In recent years, deep learning (DL) and in particular deep convolutional neural networks have shown remarkable capacities in image recognition tasks, opening new possibilities to perform automatized image analysis. DL-based approaches have emerged as an alternative to conventional feature annotation or segmentation methods (Caicedo et al., 2019) and are even capable of performing complex tasks such as artificial labeling of plain bright-field images (von Chamier et al., 2019; Christiansen et al., 2018; Ounkomol et al., 2018). The main difference between conventional and DL algorithms is that conventional algorithms follow predefined rules (hard-coded), while DL algorithms are flexible to learn the respective task on base of a training dataset (LeCun et al., 2015). Yet, deployment of DL approaches necessitates both computational expertise and suitable computing resources. These requirements frequently prevent non-AI experts from applying DL to routine image analysis tasks. Initial efforts have already been made to break down these barriers, for instance, by integration into prevalent bioimaging tools such as ImageJ (Falk et al., 2019) and CellProfiler (McQuin et al., 2018), or using cloud-based approaches (Haberl et al., 2018). To harness the potentials of these DL-based methods, they require integration into the bioimage analysis pipeline. We argue that such an integration into the scientific process ultimately necessitates DL-based approaches to meet the same standards as any method in an empirical study. We can derive these standards from the general quality criteria of qualitative and quantitative research: objectivity, reliability, and validity (Frambach et al., 2013).

Objectivity refers to the neutrality of evidence, with the aim to reduce personal preferences, emotions, or simply limitations introduced by the context in which manual feature annotation is performed (Frambach et al., 2013). Manual annotation of fluorescent features has long been known to be subjective, especially in the case of weak signal-to-noise thresholds (Schmitz et al., 1999; Collier et al., 2003; Niedworok et al., 2016). Notably, there is no objective ground truth reference in the particular case of fluorescent label segmentation, causing a critical problem for training and evaluation of DL algorithms. As multiple studies have pointed out that annotations of low quality can cause DL algorithms to either fail to train or to reproduce inconsistent annotations on new data (von Chamier et al., 2019; Falk et al., 2019), this is a crucial obstacle for applying DL to bioimage analysis processes.

Reliability is concerned with the consistency of evidence (Frambach et al., 2013). To allow an unambiguous understanding of this concept, we further distinguish between repeatability and reproducibility. Repeatability or test-retest reliability is defined as 'closeness of the agreement between the results of successive measurements of the same measure and carried out under the same conditions' (Taylor and Kuyatt, 1994, 14), which is guaranteed for any deterministic DL model. Reproducibility, on the other hand, is specified as 'closeness of the agreement between the results of measurements of the same measure and carried out under changed conditions' (Taylor and Kuyatt, 1994, 14), for example, different observer, or different apparatus. This is a critical point, since the output of different DL models trained on the same training dataset can vary significantly. This is caused by the stochastic training procedure (e.g. random initialization, random sampling and data augmentation [Ronneberger et al., 2015]), the choice of model parameters (e.g. model architecture, weights, activation functions), and the choice of hyperparameters (e.g. learning rate, mini-batch size, training epochs). Consequently, the reproducibility of DL models merits careful investigation.

Finally, validity relates to the truth value of evidence, that is, whether we in fact measured what we intended to. Moreover, validity implies reliability - but not vice versa (Frambach et al., 2013). On a basis of a given ground truth, validity is typically measured using appropriate similarity measures such as F1 score for detection and Intersection over Union (IoU) for segmentation purposes (Ronneberger et al., 2015; Falk et al., 2019; Caicedo et al., 2019). In addition, the DL community has established widely accepted standards for training models. These comprise, among other things, techniques to avoid overfitting (regularization techniques and cross-validation), tuning hyperparameters, and selecting appropriate metrics for model evaluation. However, these standards do not apply for the training and evaluation of a DL model in the absence of a ground truth, like in the case of fluorescent features.

Taken together and with regard to the discussion about a reproducibility crisis in the fields of biology, medicine and artificial intelligence (Siebert et al., 2015; Baker, 2016; Ioannidis, 2016; Hutson, 2018; Fanelli, 2018; Chen et al., 2019), these limitations indicate that DL could aggravate this crisis by adding even more unknowns and uncertainties to bioimage analyses.

However, the present study asks whether DL, if instantiated in an appropriate manner, also holds the potential to instead enhance the objectivity, reproducibility and validity of bioimage analysis. To tackle this conundrum, we investigated different DL-based strategies on five fluorescence image datasets. We show that training of DL models on the pooled input of multiple human experts utilizing ground truth estimation (consensus models) increases objectivity of fluorescent feature segmentation. Furthermore, we demonstrate that ensembles of consensus models are even capable of enhancing the reliability and validity of bioimage analysis of ambiguous image data, such as fluorescence features in histological tissue sections.

Results

In order to evaluate the impact of DL on bioimage analysis results, we instantiated three exemplary DL-based strategies (Figure 1; strategies color-coded in gray, blue, and orange) and investigate them in terms of objectivity, reliability, and validity of fluorescent feature annotation. The first strategy, expert models (gray), reflects mere automation of the annotation process of fluorescent features in microscopy images. Here, manual annotations of a single human expert are used to train an individual (and hence expert-specific) DL model with a U-Net (Ronneberger et al., 2015) architecture. U-Net and its variants have emerged as the de facto standard for biomedical image segmentation purposes (McQuin et al., 2018; Falk et al., 2019; Caicedo et al., 2019). The second strategy, consensus models (blue), addresses the objectivity of signal annotations. Contrary to the first strategy, simultaneous truth and performance level estimation (STAPLE) (Warfield et al., 2004) is used to estimate a ground truth and create consensus annotations. The estimated ground truth (est. GT) annotation reflects the pooled input of multiple human experts and is therefore thought to be less affected by a potential subjective bias of a single expert. We then train a single U-Net model to create a consensus model. The third strategy, consensus ensembles (orange), seeks to ensure reliability and eventually validity. Going beyond the second strategy, we train multiple consensus U-Net models to create a consensus ensemble. Such model ensembles are known to be more robust to noise (Dietterich, 2000). Hence, we hypothesize that the consensus ensembles mitigate the randomness in the training process. Moreover, deep ensembles are supposed to yield high-quality predictive uncertainty estimates (Lakshminarayanan et al., 2017).

Figure 1 with 2 supplements see all

Download asset Open asset

Schematic illustration of bioimage analysis strategies and corresponding hypotheses.

Four bioimage analysis strategies are depicted. Manual (white) refers to manual, heuristic fluorescent feature annotation by a human expert. The three DL-based strategies for automatized fluorescent feature annotation are based on expert models (gray), consensus models (blue) and consensus ensembles (orange). For all DL-based strategies, a representative subset of microscopy images is annotated by human experts. Here, we depict labels of cFOS-positive nuclei and the corresponding annotations (pink). These annotations are used in either individual training datasets (gray: expert models) or pooled in a single training dataset by means of ground truth estimation from the expert annotations (blue: consensus models, orange: consensus ensembles). Next, deep learning models are trained on the training dataset and evaluated on a holdout validation dataset. Subsequently, the predictions of individual models (gray and blue) or model ensembles (orange) are used to compute binary segmentation masks for the entire bioimage dataset. Based on these fluorescent feature segmentations, quantification and statistical analyses are performed. The expert model strategy enables the automation of a manual analysis. To mitigate the bias from subjective feature annotations in the expert model strategy, we introduce the consensus model strategy. Finally, the consensus ensembles alleviate the random effects in the training procedure and seek to ensure reliability and eventually, validity.

For each of the three strategies, we complete the bioimage analysis by performing quantification and hypothesis testing on a typical fluorescent microscopy image dataset (Figure 1—figure supplement 2). These images describe changes in fluorescence signal abundance of a protein called cFOS in brain sections of mice. cFOS is an activity-dependent transcription factor and its expression in the brain can be modified experimentally by behavioral testing of the animals (Gallo et al., 2018). The low signal-to-noise ratio of this label, its broad usage in neurobiology and the well-established correlation of its abundance with behavioral paradigms render it an ideal bioimage dataset to test our hypotheses (Shuvaev et al., 2017; Gallo et al., 2018).

Consensus ensembles yield the best results for validity and reproducibility metrics

The primary goal in bioimage analysis is to rigorously test a biological hypothesis. To leverage the potentials of DL models within this procedure, we need to trust our model – by establishing objectivity, reliability, and validity. Pertaining to the case of fluorescent labels, validity (measuring what is intended to be measured) requires objectivity to know what exactly we intend to measure in the absence of a ground truth. Similarly, reliability in terms of repeatability and reproducibility is a prerequisite for a valid and trustworthy model. Starting from the expert model strategy, we seek to establish objectivity (consensus models) and, successively, reliability and validity in the consensus ensemble strategy. In the following analysis, we first turn toward a comprehensive evaluation of the objectivity and its relation to validity before moving on to the concept of reliability.

To assess the three different strategies, a training dataset of 36 images and a test set of nine microscopy images (1024 × 1024 px, 1.61 px/µm, on average ∼35 nuclei per image, see also Figure 1—figure supplement 2) showing cFOS immunoreactivity were manually annotated by five independent experts (experts 1–5). In absence of a rigorously objective ground truth, we used STAPLE (Warfield et al., 2004) to compute an estimated ground truth (est. GT) based on all expert annotations for each image. First, we trained a set of DL models on the 36 training images and corresponding annotations, either made by an individual human expert or as reflected in the est. GT (see Materials and methods for the data set and detailed training, evaluation and model selection strategy). Then, we used our test set to evaluate the segmentation (Mean IoU) and detection (F1 score) performance of human experts and all trained models by means of similarity analysis on the level of individual images.

For the pairwise comparison of annotations (segmentation masks), we calculated the intersection over union (IoU) for all overlapping pairs of ROIs between two segmentation masks (Figure 2A; see 7.9.1 Segmentation and detection). Following Maška et al., 2014, we consider two ROIs with an IoU of at least 0.5 as matching and calculated the F1 score $M_{F1 score}$ as the harmonic mean of precision and recall (Figure 2B; see 7.9.1 Segmentation and detection). As bioimaging studies predominantly use measures related to counting ROIs in their analyses, we also focused on the feature detection performance ( $M_{F1 score}$ ). The color coding (gray, blue, orange) introduced in Figure 2C refers to the different strategies depicted in Figure 1 and applies to all figures, if not indicated otherwise.

Figure 2 with 4 supplements see all

Download asset Open asset

Similarity analysis of fluorescent feature annotations by manual or DL-based strategies.

(A) Representative example of IoU $M_{IoU}$ calculations on a field of view (FOV) in a bioimage. Image raw data show the labeling of cFOS in a maximum intensity projection image of the CA1 region in the hippocampus (brightness and contrast enhanced). The similarity of estimated ground truth (est. GT) annotations (green), derived from the annotations of five expert neuroscientists, are compared to those of one human expert, an expert model, a consensus model, and a consensus ensemble (magenta, respectively). IoU results of two ROIs are shown in detail for each comparison (magnification of cyan box). Scale bar: 100 µm. (B) F1 score $M_{F1 score}$ calculations on the same FOV as shown in (A). The est. GT annotations (green; 53 ROIs) are compared to those of a consensus ensemble (magenta; 48 ROIs). IoU-based matching of ROIs at an IoU-threshold of $t = 0.5$ is depicted in three magnified subregions of the image (cyan boxes 1-3). Scale bar: 100 µm. (**C–H**) All comparisons are performed exclusively on a separate image test set which was withheld from model training and validation. (C) Color coding refers to the individual strategies, as introduced in Figure 1: white: manual approach, gray: expert models, blue: consensus models, orange: consensus ensembles. (D) $M_{F1 score}$ between individual manual expert annotations and their overall reliability of agreement given as the mean of Fleiss‘ $κ$ . (E) $M_{F1 score}$ between annotations predicted by individual models and the annotations of the respective expert (or est. GT), whose annotations were used for training. N_{models per expert} = 4. (F) $M_{F1 score}$ between manual expert annotations, the respective expert models, consensus models, and consensus ensembles compared to the est. GT as reference. A horizontal line denotes human expert average. N_models = 4, N_ensembles = 4. (G) Means of $M_{F1 score}$ of the individual DL-based strategies and of the human expert average compared to the est. GT plotted for different IoU matching thresholds t. A dashed line indicates the default threshold $t = 0.5$ . N_models = 4, N_ensembles = 4. (H) Annotation reliability of the individual strategies assessed as the similarities between annotations within the respective strategy. We calculated ${\bar{M}}_{IoU}$ , $M_{F1 score}$ and Fleiss‘ $κ$ . N_experts = 5, N_models = 4, N_ensembles = 4.

To better grasp the difficulties in annotating cFOS-positive nuclei as fluorescent features in these images, we first compared manual expert annotations (Figure 2D). The analysis revealed substantial differences between the annotations of the different experts and shows varying inter-rater agreement (Schmitz et al., 1999; Collier et al., 2003; Niedworok et al., 2016). The level of inter-rater variability was inversely correlated with the relative signal intensities (Figure 2—figure supplement 1; Niedworok et al., 2016).

By comparing the annotations of the expert models (gray) to the annotations of the respective expert (Figure 2E), we observed a higher $M_{F1 score}$ median compared to the inter-rater agreement (Figure 2D) in the majority of cases. Furthermore, comparing the similarity analysis results of human experts with those of their respective expert-specific models revealed that they are closely related (Figure 2F, Figure 2—figure supplement 3, and Figure 2—figure supplement 4). As pointed out by von Chamier et al., 2019, this indicates that our expert models are able to learn and reproduce the annotation behavior of the individual experts. This becomes particularly evident in the annotations of the DL models trained on expert 1 (Figure 2F, Figure 2—figure supplement 3, and Figure 2—figure supplement 4).

Overall, the expert models yield a lower similarity to the est. GT compared to the consensus models (blue) or consensus ensembles (orange). Notably, both consensus models and consensus ensembles perform on par with human experts. Hereby, the consensus ensembles outperform all other strategies, even at varying IoU thresholds (Figure 2F and Figure 2G).

In order to test for reliability of our analysis, we measured the repeatability and reproducibility of fluorescent feature annotation of our DL strategies. We assumed that the repeatability is assured for all our strategies due to the deterministic nature of our DL models (unchanged conditions imply unchanged model weights). Hence, our evaluation was focused on the reproducibility, meaning the impact of the stochastic training process on the output. Inter-expert and inter-model comparisons within each strategy unveiled a better performance of the consensus ensembles strategy concerning both detection ( $M_{F1 score}$ ) and segmentation ( ${\bar{M}}_{IoU}$ ) of the fluorescent features (Figure 2H). Calculating the Fleiss’ kappa value (Fleiss and Cohen, 1973) revealed that consensus ensemble annotations show a high reliability of agreement (Figure 2H). Following the Fleiss’ kappa interpretation from Landis and Koch, 1977, the results for the consensus ensembles indicate a substantial or almost perfect agreement. In contrast, the Fleiss’ kappa values for human experts refer to a fair agreement while the results for the alternative DL strategies lead to a moderate agreement (Figure 2H).

In summary, the similarity analysis of the three strategies shows that training of DL models solely on the input of a single human expert imposes a high risk of incorporating an intrinsic bias and therefore resembles, as hypothesized, a mere automation of manual image annotation. Both consensus models and consensus ensembles perform on par with human experts regarding the similarity to the est. GT, but the consensus ensembles yield by far the best results regarding their reproducibility. We conclude that, in terms of similarity metrics, only the consensus ensemble strategy meet the bioimaging standards for objectivity, reliability, and validity.

Consensus ensembles yield reliable bioimage analysis results

Similarity analysis is inevitable to assess the quality of a model’s output, that is, the predicted segmentations (Ronneberger et al., 2015; Caicedo et al., 2019; Falk et al., 2019). However, the primary goal of bioimage analysis is the unbiased quantification of distinct image features that correlate with experimental conditions. So far, it has remained unclear whether objectivity, reliability, and validity for bioimage analysis can be inferred directly from similarity metrics.

In order to systematically address this question, we used our image dataset to quantify the abundance of cFOS in brain sections of mice after Pavlovian contextual fear conditioning. It is well established in the neuroscientific literature that rodents show changes in the distribution and abundance of cFOS in a specific brain region, namely the hippocampus, after processing information about places and contexts (Keiser et al., 2017; Campeau et al., 1997; Huff et al., 2006; Ramamoorthi et al., 2011; Tayler et al., 2013; Murawski et al., 2012; Guzowski et al., 2001). Consequently, our experimental dataset offered us a second line of evidence, the objective analysis of mouse behavior, in addition to the changes of fluorescent features to validate the bioimage analyses results of our DL-based strategies.

Our dataset comprised three experimental groups (Figure 3A). In one group, mice were directly taken from their homecage as naive learning controls (H). In the second group, mice were re-exposed to a previously explored training context as context controls (C-). Mice in the third group underwent Pavlovian fear conditioning and were also re-exposed to the training context (C+) (Figure 3A). These three groups of mice showed different behavioral responses. For instance, fear (threat; LeDoux, 2014) conditioned mice (C+) showed increased freezing behavior after fear acquisition and showed strong freezing responses when re-exposed to the training context 24 hr later (Figure 3—figure supplement 1). After behavioral testing, brain sections of the different groups of mice were prepared and labeled for the neuronal activity-related protein cFOS by indirect immunofluorescence. Sections were also labeled with the neuronal marker NeuN (Fox3), allowing the anatomical identification of hippocampal subregions of interest. Images were acquired as confocal microscopy image stacks (x,y-z) and maximum intensity projections were used for subsequent bioimage analysis (Figure 1—figure supplement 2). Overall, we quantified the number of cFOS-positive nuclei and their mean signal intensity in five regions of the dorsal hippocampus (DG as a whole, suprapyramidal DG, infrapyramidal DG, CA3, and CA1), and tested for significant differences between the three experimental groups (Figure 3B–D). To extend this analysis beyond hypothesis testing at a certain significance level, we calculated the effect size ( $η^{2}$ ) for each of these 30 pairwise comparisons.

Figure 3 with 1 supplement see all

Download asset Open asset

Application of different DL-based strategies for fluorescent feature annotation.

The figure introduces how three DL-based strategies are applied for annotation of a representative fluorescent label, here cFOS, in a representative image data set. Raw image data show behavior-related changes in the abundance and distribution of the protein cFOS in the dorsal hippocampus, a brain center for encoding of context-dependent memory. (A) Three experimental groups were investigated: Mice kept in their homecage (H), mice that were trained to a context, but did not experience an electric foot shock (C-) and mice exposed to five foot shocks in the training context (C+). 24 hr after the initial training (TR), mice were re-exposed to the training context for memory retrieval (RET). Memory retrieval induces changes in cFOS levels. (**B–D**) Brightness and contrast enhanced maximum intensity projections showing cFOS fluorescent labels of the three experimental groups (H, C-, C+) with representative annotations of a consensus ensemble, for each hippocampal subregion. The annotations are used to quantify the number of cFOS-positive nuclei for each image (#) per mm² and their mean signal intensity (mean int., in bit-values) within the corresponding image region of interest, here the neuronal layers in the hippocampus (outlined in cyan). In B: granule cell layer (supra- and infrapyramidal blade), dotted line: suprapyramidal blade, solid line: infrapyramidal blade. In C: pyramidal cell layer of CA3; in D: pyramidal cell layer in CA1. Scale bars: 200 µm. (E) Analyses of cFOS-positive nuclei per mm², representatively shown for stratum pyramidale of CA1. Corresponding effect sizes are given as $η^{2}$ for each pairwise comparison. Two quantification results are shown for each strategy and were selected to represent the lowest (model 1 or ensemble 1) and highest (model 2 or ensemble 2) effect sizes (increase in cFOS) reported within each annotation strategy. Total analyses performed: N_{expert models} = 20, N_{consensus models} = 36, N_{consensus ensembles} = 9. Number of analyzed mice (N) and images (n) per experimental condition: N_H = 7, N_C- = 7, N_C+ = 6; n_H = 36, n_C- = 32, n_C+ = 28. ***p<0.001 with Mann-Whitney-U test. Statistical data are available in Figure 3—source data 1.

Figure 3—source data 1 Source files for analyses of cFOS-positive nuclei in CA1.: https://cdn.elifesciences.org/articles/59780/elife-59780-fig3-data1-v2.zip
Download elife-59780-fig3-data1-v2.zip

We illustrate our metrics with the detailed quantification of cFOS-positive nuclei in the stratum pyramidale of CA1 as a representative example and show two analyses for each DL strategy (Figure 3E). These two examples represent those two models of each strategy that yielded the lowest and the highest effect sizes, respectively (Figure 3E). Despite a general consensus of all models and ensembles on a context-dependent increase in the number of cFOS-positive nuclei, these quantifications already indicate that the variability of effect sizes decreases from expert models to consensus models and is lowest for consensus ensembles (Figure 3E).

The analysis in Figure 4 allows us to further explore the impact of the different DL strategies on the bioimage analysis results for each hippocampal subregion. Here, we display a high-level comparison of the effect sizes and corresponding significance levels of 20 independently trained expert models (4 per expert), 36 consensus models, and 9 consensus ensembles (each derived from four consensus models). In contrast to the detailed illustration of selected models in Figure 3E, Figure 4A, for instance, summarizes the results for all analyses of the stratum pyramidale of CA1. As indicated before, all models and ensembles show a highly significant context-dependent increase in the number of cFOS-positive nuclei, but also a notable variation in effect sizes for both expert and consensus models. Moreover, we identify a significant context-dependent increase in the mean signal intensity of cFOS-positive nuclei for all consensus models and ensembles. The expert models, by contrast, yield a high variation in effect sizes at different significance levels. Interestingly, all four expert models trained on the annotations of expert 1 (and two other expert models only in the case of H vs. C+) did not yield a significant increase, indicating that expert 1’s annotation behavior was incorporated into the expert-1-specific models and that this also affects the bioimage analysis results (Figure 4A).

Figure 4

Download asset Open asset

Consensus ensembles significantly increase reliability of bioimage analysis results.

(**A–E**) Single data points represent the calculated effect sizes for each pairwise comparison of all individual bioimage analyses for each DL-based strategy (gray: expert models, blue: consensus models, orange: consensus ensembles) in indicated hippocampal subregions. Three horizontal lines separate four significance intervals (n.s.: not significant, *: 0.05 ≥ p>0.01, **: 0.01 ≥ p>0.001, ***: p ≤ 0.001 after Bonferroni correction for multiple comparisons). The quantity of analyses of each strategy that report the respective statistical result of the indicated pairwise comparison (effect, x-axis) at a level of p ≤ 0.05 are given below each pairwise comparison in the corresponding color coding. In total, we performed all analyses with: N_{expert models} = 20, N_{consensus models} = 36, N_{consensus ensembles} = 9. Number of analyzed mice (N) for all analyzed subregions: N_H = 7, N_C- = 7, N_C+ = 6. Numbers of analyzed images (n) are given for each analyzed subregion. Source files including source data and statistical data are available in Figure 4—source data 1. (A) Analyses of cFOS-positive nuclei in stratum pyramidale of CA1. n_H = 36, n_C- = 32, n_C+ = 28. (B) Analyses of cFOS-positive nuclei in stratum pyramidale of CA3. n_H = 35, n_C- = 31, n_C+ = 28. (C) Analyses of cFOS-positive nuclei in the granule cell layer of the whole DG. n_H = 35, n_C- = 31, n_C+ = 27. (D) Analyses of cFOS-positive nuclei in the granule cell layer of the suprapyramidal blade of the DG. n_H = 35, n_C- = 31, n_C+ = 27. (E) Analyses of cFOS-positive nuclei in the granule cell layer of the infrapyramidal blade of the DG. n_H = 35, n_C- = 31, n_C+ = 27. (F) Reliability of bioimage analysis results are assessed as *variation per effect* (left side) and *variation per model* (right side). For the *variation per effect*, single data points represent the standard deviation of reported effect sizes ( $η^{2}$ ), calculated within each DL-based strategy for each of the 30 pairwise comparisons. Consensus ensembles show significantly lower standard (std.) deviations of $η^{2}$ per pairwise comparison compared to alternative strategies (X²(2) = 26.472, p<0.001, N_effects = 30, Kruskal-Wallis ANOVA followed by pairwise Mann-Whitney tests with Bonferroni correction, *p<0.05, ***p<0.001). For the *variation per model*, the standard deviation of centered $η^{2}$ across all pairwise comparisons was calculated for each individual model and ensemble (y-axis). In addition, the number of deviations from the congruent majority vote (at p ≤ 0.05 after Bonferroni correction for multiple comparisons) were determined for each individual model and ensemble across all pairwise comparisons (x-axis). Visualizing the interaction of both measures for each model or model ensemble individually reveals that consensus ensembles show the highest reliability of all three DL-based strategies. The statistical data for the for variation per effect is available in Figure 4—source data 2.

Figure 4—source data 1 Source files for the analysis of cFOS positive nuclei in the hippocampal subregions.: https://cdn.elifesciences.org/articles/59780/elife-59780-fig4-data1-v2.zip
Download elife-59780-fig4-data1-v2.zip
Figure 4—source data 2 Statistical data for the variation per effect.: https://cdn.elifesciences.org/articles/59780/elife-59780-fig4-data2-v2.zip
Download elife-59780-fig4-data2-v2.zip

The meta analysis discloses a context-dependent increase of cFOS in almost all analyzed hippocampal regions (Figure 4A–D), except for the infrapyramidal blade of the dentate gyrus (Figure 4E). Notably, the majority votes of all three strategies at a significance level of p ≤ 0.05 (after Bonferroni correction for multiple comparisons) are identical for each pairwise comparison (Figure 4A–E). However, the results can vary between individual models or ensembles (Figure 4A–E).

In order to assess the reliability of bioimage analysis results of the individual strategies, we further examined the variation per effect and variation per model in Figure 4F. For the variation per effect, we calculated the standard deviation of reported effect sizes within each strategy for every pairwise comparison (effect). This confirmed the visual impression from Figure 4A–E as the consensus ensembles yield a significantly lower standard deviation compared to both alternative strategies (Figure 4F). To illustrate the variation per model, we show the interaction between the number of biological effects that the corresponding model (or ensemble) reported differently compared to the congruent majority votes versus the standard deviation of its centered effect sizes across all 30 analyzed effects. This analysis shows that no expert model detected all biological effects in the microscopy images that were defined by the majority votes of all models. This is in stark contrast to the consistency of effect interpretation across the consensus ensembles (Figure 4F).

Consequently, we conclude that the consensus ensemble strategy is best suited to satisfy the bioimaging standards for objectivity, reliability, and validity.

Applicability of consensus ensemble strategy for the bioimage analysis of external data sets

Bioimage analysis of fluorescent labels comes with a huge variability in terms of investigated model organisms, analyzed fluorescent features and applied image acquisition techniques (Meijering et al., 2016). In order to assess our consensus ensemble strategy across these varying parameters, we tested it on four external datasets that were created in a fully independent manner and according to individual protocols (Lab-Mue, Lab-Inns1, Lab-Inns2, and Lab-Wue2; see Materials and methods and Figure 5—source data 2). Due to limited dataset sizes, the lab-specific training datasets consisted of just five microscopy images each and the corresponding est. GT based on the annotations from multiple experts. In the biomedical research field, the limited availability of training data is a common problem when training DL algorithms. For this reason, extensive data augmentation and regularization techniques, as well as transfer learning strategies are widely used to cope with small datasets (Ronneberger et al., 2015; Christiansen et al., 2018; Falk et al., 2019). Transfer learning is a technique that enables DL models to reuse the image feature representations learned on another source, such as a task (e.g. image segmentation) or a domain (e.g. the fluorescent feature, here cFOS-positive nuclei). This is particularly advantageous when applied to a task or domain where limited training data is available (Yosinski et al., 2014; Oquab et al., 2014). Moreover, transfer learning might be used to reduce observer variability and to increase feature annotation objectivity (Bayramoglu and Heikkilä, 2016). There are typically two ways to implement transfer learning for DL models, either by fine-tuning or by freezing features (i.e. model weights) (Yosinski et al., 2014). The latter approach, if applied to the same task (e.g. image segmentation), does not require any further model training. These out-of-the-box models reduce time and hardware requirements and may further increase objectivity of image analysis, by altogether excluding the need for any additional manual input.

Consequently, we hypothesized that transfer learning from pretrained model ensembles would substantially reduce the training efforts (Falk et al., 2019) and might even increase objectivity of bioimage analysis. To test this, we followed three different initialization variants of the consensus ensemble strategy (Figure 5A). In addition to starting the training of DL models with randomly initialized weights (Figure 5A - from scratch), we reused the consensus ensemble weights from the previous evaluation (Lab-Wue1) by means of fine-tuning (Figure 5A - fine-tuned) and freezing of all model layers (Figure 5A - frozen). Although no training of the frozen model is required, we tested and evaluated the performance of frozen models to ensure their validity. After performing the similarity analysis, we compared the full bioimage analyses, including quantification and hypothesis testing, of the different initialization variants. Finally, to establish a notion of external validity, we also compared these results with the manually and independently performed bioimage analysis of a lab-specific expert (Figure 5, Figure 5—figure supplement 1, and Figure 5—figure supplement 2).

Figure 5 with 5 supplements see all

Download asset Open asset

Consensus ensembles for DL-based feature annotation in external bioimage data sets.

(A) Schematic overview depicting three initialization variants for creating consensus ensembles on new datasets. Data annotation by multiple human experts and subsequent ground truth estimation are required for all three initialization variants. In the *from scratch* variant, a U-Net model with random initialized weights is trained on pairs of microscopy images and estimated ground truth annotations. This variant was used to create consensus ensembles for the initial *Lab-Wue1* dataset. Alternatively, the same training dataset can be used to adapt a U-Net model with pretrained weights by means of transfer-learning (*fine-tuned*). In both variants, models are evaluated and selected on base of a validation set after model training. In a third variant, U-Net models with pretrained weights can be evaluated directly on a validation dataset, without further training (*frozen*). In all three variants, consensus ensembles of the respective models are then used for bioimage analysis. (B) Overall reliability of bioimage analysis results of each variant assessed as variation per effect. In all three strategies, consensus ensembles (orange) showed lower standard deviations than consensus models (blue). The *frozen* results need to be considered with caution as they are based on models that did not meet the selection criterion (see Figure 5—source data 3). N_{pairwise comparisons} = 6; N_{consensus models} = 15, and N_{consensus ensembles} = 3 for each variant. (**C–E**) Detailed comparison of the two external datasets with highest (*Lab-Mue*) and lowest (*Lab-Wue2*) similarity to *Lab-Wue1*. (C) Representative microscopy images. Orange: representative annotations of a lab-specific *from scratch* consensus ensemble. PVT: para-ventricular nucleus of thalamus, eRet: early retrieval, lRet: late retrieval, HB: hindbrain, wt: wildtype, kd: *gad1b* knock-down. Scale bars: *Lab-Mue* 100 µm and *Lab-Wue2* 6 µm. (D) Mean $M_{F1 score}$ of *from scratch* (solid line) and *fine-tuned* (dashed line) consensus models on the validation dataset over the course of training (iterations). Mean $M_{F1 score}$ of *frozen* consensus models are indicated with arrows. Box plots show the $M_{F1 score}$ among the annotations of human experts as reference and the mean $M_{F1 score}$ of selected consensus models. Two dotted horizontal lines mark the whisker ends of the $M_{F1 score}$ among the human expert annotations. (E) Effect sizes of all individual bioimage analyses (black: manual experts, blue: consensus models, orange: consensus ensembles). Three horizontal lines separate the significance intervals (n.s.: not significant, *: 0.05≥ p>0.01, **0.01≥ p>0.001, ***p ≤ 0.001 with Mann-Whitney-U tests). *Lab-Mue*: N_{consensus ensembles} = 3 for all initialization variants; N_{from scratch/fine-tuned consensus models} = 12 (for each ensemble, 4/5 trained models per ensemble met the selection criterion), N_{frozen consensus models} = 12 (for each ensemble, 4/4 models per ensemble did not meet the selection criterion). N_eRet = 4, N_lRet = 4; n_eRet = 12, n_lRet = 11. *Lab-Wue2*: N_{consensus ensembles} = 3 for each initialization variant; N_{from scratch/fine-tuned consensus models} = 15 (for each ensemble, 5/5 trained models per ensemble met the selection criterion), N_{frozen consensus models} = 12 (for each ensemble, 4/4 models per ensemble did not meet the selection criterion). N_wt = 5, N_kd = 4, n_wt = 20, n_kd = 15. Source files of all statistical analyses (including Figure 5—figure supplement 2 and Figure 5—figure supplement 1) are available in Figure 5—source data 1. Information on all bioimage datasets (e.g. the number of images, image resolution, imaging techniques, etc.) are available in Figure 5—source data 2. Source files on model performance and selection are available in (Figure 5—source data 3).

Figure 5—source data 1 Statistical data for Lab-Mue, Lab-Wue2, Lab-Inns1, and Lab-Inns2.: https://cdn.elifesciences.org/articles/59780/elife-59780-fig5-data1-v2.zip
Download elife-59780-fig5-data1-v2.zip
Figure 5—source data 2 Characteristics of all five bioimage datasets.: https://cdn.elifesciences.org/articles/59780/elife-59780-fig5-data2-v2.docx
Download elife-59780-fig5-data2-v2.docx
Figure 5—source data 3 Model performance with selection criterion for Lab-Mue, Lab-Wue2, Lab-Inns1, and Lab-Inns2.: https://cdn.elifesciences.org/articles/59780/elife-59780-fig5-data3-v2.zip
Download elife-59780-fig5-data3-v2.zip

Dataset characteristics

The first dataset (Lab-Mue) represents very similar image parameters compared to our original Lab-Wue1 dataset (Figure 5C - Lab-Mue and Figure 5—source data 2). Mice experienced restraint stress and subsequent Pavlovian fear conditioning (cue-conditioning, tone-footshock association) and the number of cFOS-positive cells in the paraventricular thalamus (PVT) was compared between early (eRet) and late (lRET) phases of fear memory retrieval. In the context of transfer learning, this dataset originates from a very similar domain and requires the same task (image segmentation). Another two external datasets are focused on the quantification of cFOS abundance (similar domain), albeit showing less similarity in image parameters to our initial dataset (Figure 5—figure supplement 1, Figure 5—figure supplement 2 and Figure 5—source data 2). In Lab-Inns1, mice underwent Pavlovian fear conditioning and extinction in the same context. The image dataset of Lab-Inns2 shows cFOS immunoreactivity in the infralimbic cortex (IL) following fear renewal, meaning return of extinguished fear in a context different from the extinction training context. Since heterogeneity in this behavioral response was observed, mice were classified as responders (Resp) or non-responders (nResp), based on freezing responses (see Materials and methods). The image dataset of Lab-Wue2 shows the least similarity of image parameters to the dataset of Lab-Wue1. This dataset represents another commonly used model organism in neurobiology, the zebrafish. Here, cell bodies of specific neurons (GABAergic neurons) instead of nuclei were fluorescently labeled (Figure 5C - Lab-Wue2 and Figure 5—source data 2). Hence, this dataset originates from a different domain but was acquired using the same technique.

Similarity analysis

As only limited training data was available, we executed the similarity analysis for all external datasets by means of a k-fold cross-validation. We observed that the inter-rater variability differed between laboratories and different experts but remained comparable as previously for Lab-Wue1 (Figure 5D, Figure 5—figure supplement 3, and Figure 5—figure supplement 4.) Both from scratch and fine-tuned initiation variants resulted in individual consensus models that reached human expert level performance (Figure 5D, Figure 5—figure supplement 1, Figure 5—figure supplement 2). However, models adapted from pretrained weights yielded a higher validity in terms of similarity to the estimated ground truth. They either exceeded the maximal $M_{F1 score}$ reached by from scratch models (Figure 5D - Lab-Mue, Figure 5—figure supplement 1, Figure 5—figure supplement 2) or reached them after less training iterations (Figure 5D - Lab-Wue2). As expected, the performance of frozen Lab-Wue1-specific consensus models was highly dependent on the image similarity between the original and the new dataset. Consequently, the out-of-the-box segmentation performance of the frozen Lab-Wue1 models was very poor on dissimilar images (Figure 5D - Lab-Wue2), but we found it to be on par with human experts and adapted models on images that are highly similar to the original dataset (Figure 5D - Lab-Mue - very similar domain and the same task).

Bioimage analysis results

To further strengthen the validity of our workflow, we compared all DL-based bioimage analyses to the manual analysis of a human expert from the individual laboratory (Figure 5E, Figure 5—figure supplement 1, Figure 5—figure supplement 2, and Table 1).

Table 1

Bioimage analyses results of external datasets.

Data are based either on manual analysis or on annotations by a consensus ensemble. The results are given for the individual consensus ensemble initialization variants (from scratch, fine-tuned). p-Values of Lab-Inns2 are corrected for multiple comparisons using Bonferroni correction. $μ_{1}$ : mean group 1, $μ_{2}$ : mean group 2, U: U-statistic, eRet: early retrieval, lRet: late retrieval, Ctrl: control, Ext: extinction, Sal: saline, Res: L-DOPA/MS-275 responder, nRes: L-DOPA/MS-275 non-responder, wt: wildtype, kd: gad1b knock-down.

Lab	Groups	Initialization variant	$μ_{1}$	$μ_{2}$	U	Significance level (p)	$η^{2}$
Mue	eRet ∼ lRet	Manual	1.00	1.65	19.0	** (0.002)	0.39
		From scratch	1.00	1.70	25.0	** (0.007)	0.31
		Fine-tuned	1.00	1.68	24.0	** (0.006)	0.32
Inns1	Ctrl ∼ Ext	Manual	1.00	3.92	10.0	** (0.005)	0.43
		From scratch	1.00	2.26	13.0	* (0.010)	0.35
		Fine-tuned	1.00	1.85	14.0	* (0.013)	0.33
Inns2	Sal ∼ Resp	Manual	1.00	1.83	5.0	** (0.002)	0.59
		From scratch	1.00	1.96	0.0	*** (<0.001)	0.71
		Fine-tuned	1.00	2.07	0.0	*** (<0.001)	0.71
	Sal ∼ nResp	Manual	1.00	1.05	27.0	n.s. (1.000)	0.00
		From scratch	1.00	1.63	8.0	n.s. (0.130)	0.29
		Fine-tuned	1.00	1.42	12.0	n.s. (0.377)	0.16
	Res ∼ nRes	Manual	1.83	1.05	42.0	n.s. (0.130)	0.29
		From scratch	1.96	1.63	41.0	n.s. (0.173)	0.26
		Fine-tuned	2.07	1.42	42.0	n.s. (0.130)	0.29
Wue2	wt ∼ kd	Manual	1.00	0.28	227.5	* (0.010)	0.19
		From scratch	1.00	0.45	220.0	* (0.021)	0.16
		Fine-tuned	1.00	0.37	216.0	* (0.029)	0.14

For Lab-Mue, the bioimage analyses of all DL-based approaches, including the frozen consensus models and ensembles pretrained on Lab-Wue1, revealed a significantly higher number of cFOS-positive cells in the PVT of mice 24 hr after fear conditioning (lRET), which was confirmed by the manual expert analysis (Figure 5E - Lab-Mue, Table 1). Yet again, the formation of model ensembles increased the reproducibility of results by yielding less or almost no variation in the effect sizes (Figure 5E - Lab-Mue).

The manual expert analysis of the Lab-Inns1 dataset revealed a significantly higher number of cFOS-positive nuclei in the basolateral amygdala (BLA) after extinction of a previously learned fear, which was also reliably detected by all consensus ensembles, regardless of initiation variant (Figure 5—figure supplement 1, Table 1). However, this significant difference was only present in the analyses of most individual consensus models, both from scratch and fine-tuned (Figure 5—figure supplement 1). Again, this could be attributed to a higher variability between the effect sizes of individual models, compared to a higher homogeneity among ensembles (Figure 5—figure supplement 1).

For Lab-Inns2, the manual expert analysis as well as all DL-based approaches that were adapted to the Lab-Inns2 dataset show increased numbers of cFOS-positive cells in the infralimbic cortex of L-DOPA/MS-275 responders (Resp) compared to control (Sal) mice (Figure 5—figure supplement 2, Table 1). However, in L-DOPA/MS-275 non-responders (nResp), we did not observe a significant increase of cFOS-positive nuclei (Figure 5—figure supplement 2, Table 1). Furthermore, the high effect sizes of the comparison between L-DOPA/MS-275 responders and non-responders further indicate that the differences observed in the behavioral responses of Resp and nResp mice were also reflected in the abundance of cFOS in the infralimbic cortex (Figure 5—figure supplement 2, Table 1).

Manual expert analysis of the fourth external dataset revealed a significantly lower amount of GABA-positive somata in gad1b knock-down zebrafish, compared to wildtypes (Figure 5E - Lab-Wue2, Table 1). Again, this effect was reliably detected by all deep-learning-based approaches that included training on the Lab-Wue2-specific training dataset and the effect sizes of ensembles showed less variability (Figure 5E - Lab-Wue2). Despite its poor segmentation performance and hence, poor validity, this effect was also present in the bioimage analysis of the frozen consensus models and ensembles pretrained on Lab-Wue1 (Figure 5E - Lab-Wue2).

As with our initial dataset, we assessed reliability by calculating the variation per effect as the standard deviation of the reported effect sizes within each group and pooled these results across all external datasets. Consistent with the higher reliability of from scratch and fine-tuned ensemble annotations (Figure 5—figure supplement 5), this analysis shows that the formation of model ensembles reduced the variation per effect in both variants, compared to the respective individual models (Figure 5B). The frozen models and ensembles exhibit a similar effect, but need to be considered with caution as they are based on models that did not meet the selection criterion (reliably performing on par with human experts; see 7.10.4 - Training, evaluation and model selection for a detailed explanation).

In summary, we assessed the reproducibility of our consensus ensemble strategy by using four external datasets. These datasets were acquired with different image acquisition techniques, investigate two common model organisms, and analyze the two main cellular compartments (nuclei and somata) at varying resolutions (Figure 5—source data 2). In-line with the results obtained on our initial dataset, we observed an increased reproducibility for the consensus ensembles compared to individual consensus models after training on all four external datasets (Figure 5B).

Moreover, our data also suggests that pretrained consensus models can even be deployed out-of-the-box, but only when carefully validated. Thus, sharing pretrained model weights across different laboratories reduces lab-specific biases within the bioimage analysis and may further increase objectivity and validity.

Ultimately, we conclude that our proposed ensemble consensus workflow is reproducible for different datasets and laboratories and increases objectivity, reliability, and validity of DL-based bioimage analyses.

Discussion

The present study contributes to bridging the gap between ‘methods’ and ‘biology’ oriented studies in image feature analysis (Meijering et al., 2016). We explored the potentials and limitations of DL models utilizing the general quality criteria for quantitative research: objectivity, reliability, and validity. Thereby, we put forward an effective but easily implementable strategy that aims to establish reproducible, DL-based bioimage analysis within the life science community.

The number of DL-based tools for bioimage annotations and their accessibility for non-AI specialists is gradually increasing (McQuin et al., 2018; Haberl et al., 2018; Falk et al., 2019). DL models can hold advantages over conventional algorithms (Caicedo et al., 2019) and have the potential to be commonly used for bioimage analysis tasks throughout the life sciences. Usually, the performance of new bioimage analysis tools or methods is assessed by means of similarity measures to a certain ground truth (Ronneberger et al., 2015; McQuin et al., 2018; Haberl et al., 2018; Falk et al., 2019; Caicedo et al., 2019). However, this is rarely sufficient to establish trust in the use of DL models for bioimage analysis, as the vast amount of parameters and flexibility to adapt DL models to virtually any task renders them prone to internalize unintended, but subjective human biases (von Chamier et al., 2019). This is particularly true in the case of fluorescent feature analysis in bioimage datasets, as an objective ground truth is not available. In conjunction with the stochastic training process, this is a very critical point, because it holds the potential for intended or unintended tampering similar to p-hacking (Head et al., 2015), for example by training DL models until non-significant results become significant.

To investigate the effects of DL-based strategies on the bioimage analysis of fluorescent features, we acquired a typical bioimage dataset (Lab-Wue1) and five experts manually annotated corresponding ROIs (here cFOS-positive nuclei) in a representative subset of images. Then, we tested three DL-based strategies for automatized feature segmentation. DL models were either trained on the manual annotations of a single expert (expert models) or on the input of multiple experts pooled by ground truth estimation (consensus models). In addition, we formed ensembles of consensus models (consensus ensembles).

Similarity analysis of fluorescent feature annotation

In accordance with previous studies, similarity analyses revealed a substantial level of inter-rater variability in the heuristic annotations of the single experts (Schmitz et al., 1999; Collier et al., 2003; Niedworok et al., 2016). Furthermore, we confirmed the concerns already put forward by others (Falk et al., 2019; von Chamier et al., 2019) that training of DL models solely on the input of a single human expert imposes a high risk of incorporating an individual human bias into the trained models. We therefore conclude that models trained on single expert annotations resemble an automation of manual image annotation, but cannot remove subjective biases from bioimage analyses. Importantly, only consensus ensembles led to a coincident significant increase also in the reliability and validity of fluorescent feature annotations. Our analyses also show that annotations of multiple experts are imperative for two reasons: first, they mitigate or even eliminate the bias of expert-specific annotations and, second, are essential for the assessment of the model performance.

Reproducibility and validity of bioimage analyses

Our bioimage dataset from Lab-Wue1 enabled us to look at the impact of different DL-based strategies on the results of bioimage analyses. This revealed a striking model-to-model variability as the main factor impairing the reproducibility of DL-based bioimage analyses. Convincingly, the majority votes for each effect were identical for all three strategies. However, the variance within the reported effect sizes differed significantly for each strategy. This entailed, for example, that no expert model was in full agreement with the congruent majority votes. On the contrary, consensus ensembles detected all effects with significantly higher reliability. Thus, our data indicates that bioimage analysis performed with a consensus ensemble significantly reduces the risk of obtaining irreproducible results.

Evaluation of consensus ensembles on external datasets

We then tested our consensus ensemble approach and three initialization variants on four external datasets with limited training data and varying similarities in terms of image parameters to our original dataset (Lab-Wue1). In line with previous studies on transfer learning, we demonstrate that the adaptation of models from pretrained weights to new, yet similar data requires less training iterations, compared to the training of models from scratch (Falk et al., 2019). We extend these analyses and show that the reliability of fine-tuned ensembles was at least equivalent to from scratch ensembles, if not higher. Furthermore, we also provide initial evidence that pretrained ensembles can be used even without any adaptation, if task similarity is sufficiently high. Our data suggest that this component in the analysis pipeline could further increase the objectivity of bioimage analyses.

Potentials of open-source pretrained consensus ensemble libraries

Sharing model weights from validated models in open-source libraries, similarly to TensorFlow Hub (https://www.tensorflow.org/hub) or PyTorch Hub (https://pytorch.org/hub/), offers a great opportunity to provide annotation experience across labs in an open science community. In this study, for instance, we used the nuclear label of cFOS, an activity-dependent transcription factor, as fluorescent feature of interest. This label is in its signature indistinguishable from a variety of other fluorescent labels, like those of transcription factors (CREB, phospho-CREB, Pax6, NeuroG2, or Brain3a), cell division markers (phospho-histone H3), apotposis markers (Caspase-3), and multiple others. Similarly to the pretrained and shared models of Falk et al., 2019, we surmise that the learned feature representations (i.e. model weights) of our cFOS consensus ensembles may serve as a good initialization for models that aim at performing nucleosomatic fluorescent label segmentation in brain slices.

In line with the results of the Kaggle Data Science Bowl 2018 (Caicedo et al., 2019), however, our findings indicate that a model adapted to a specific data set usually outperforms a general model trained on different datasets from different domains. To use and share frozen out-of-the-box models across the science community, we therefore need to create a well-documented library that contains validated model weights for each specific task and domain (e.g. for each organism, marker type, image resolution, etc.). In conjunction with data repositories, this would also allow retrospective data analysis of prior studies.

In summary, open-source model libraries may contribute to a better reproducibility of scientific experiments (Fanelli, 2018) by improving the objectivity in bioimage analyses, by offering openness to analysis criteria, and by sharing pretrained models for (re-)evaluation.

Limitations

This paper describes a blueprint for the evaluation of DL models in biomedical imaging. Therefore, some of our methodological decisions were shaped by standardization considerations concerning the future deployment in bioimage analysis pipelines.

The project was triggered by segmentation tasks for fluorescent labels (cFOS) in the cell nucleus. These are rather simple features, and we could readily annotate data from different labs, which facilitated the evaluation. However, this limits the generalizability to more complex image segmentation tasks, where training data annotation is slow and tedious. In particular, human perceptive capabilities for richer graphical features, such as area, volume, or density, is much worse than for regular, linear image features (Cleveland and McGill, 1985; Feldman-Stewart et al., 2000). A case in point is the annotation of images showing ramified neurons or astrocytes. Such tasks would cause an enormous workload rendering complete human annotation virtually impossible. In this respect, we concur with prior research asserting that DL models based on human annotations will not be an option in these settings (Driscoll et al., 2019).

The characteristics of our examined strategies are based on best practices in the field of DL and derived from extant literature (Meijering et al., 2016; Falk et al., 2019; Caicedo et al., 2019). The focus on the U-Net model architecture (Ronneberger et al., 2015) is a direct consequence of this standardization idea. Yet, it is also an important limitation of our study. Unlike more conventional studies that introduce a new method and provide a comprehensive performance comparison to the state of the art, we rely on U-net as the widely studied de facto standard for biomedical image segmentation purposes (McQuin et al., 2018; Falk et al., 2019; Caicedo et al., 2019). Similarly, we chose to use STAPLE (Warfield et al., 2004) as the benchmark procedure for ground truth estimation. Thereby, we forwent considering alternatives and variants (Lampert et al., 2016). In addition, we tried different ways to incorporate the single expert annotations into one DL model. For instance, we followed the approach of Guan et al., 2018 by modeling individual experts in a multi-head DL model instead of pooling them in the first place. However, we decided to discard the approach as our tests did not improve the results but increased complexity.

Accessibility of our workflow and pretrained consensus ensembles

To enable other researchers to easily access, to interact with, and to reproduce our results and to share our trained models, we provide an open-source Python library that is easily accessible for both local installation or cloud-based deployment.

With Jupyter Notebooks becoming the computational notebook of choice for data scientists (Perkel, 2018), we also implemented a training pipeline for non-AI experts in a Jupyter Notebook optimized for Google Colab, providing free access to the required computational resources (e.g., GPUs and TPUs). In summary, we recommend to use the annotations of multiple human experts to train and evaluate DL consensus model ensembles. In such a way, DL can be used to increase the objectivity, reliability, and validity of bioimage analyses and pave the way for higher reproducibility in science.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Genetic reagent (Mus musculus, male)	C57BL/6J	Charles River	Cat# CRL:027; RRID:IMSR_CRL:27	Lab-Mue; Lab-Inns1
Genetic reagent (Mus musculus, male)	C57BL/6J	Jackson Laboratory	Cat# JAX:000664; RRID:IMSR_JAX:000664	Lab-Wue1
Genetic reagent (Mus musculus, male)	129S1/SvlmJ (S1)	Charles River	RRID:MGI:5658424	Lab-Inns2
Genetic reagent (Danio rerio)	AB/AB	European Zebrafish Resource Center		Lab-Wue2
Antibody	Anti-cFOS (rabbit polyclonal)	Santa Cruz	Cat# sc-52; RRID:AB_2106783	Lab-Mue (1:500); Lab-Inns2 (1:1,000)
Antibody	Anti-cFOS (rabbit polyclonal)	Millipore	Cat# PC38; RRID:AB_2106755	Lab-Inns1 (1:20,000)
Antibody	anti-cFOS (rabbit polyclonal)	Synaptic Systems	Cat# 226003; RRID:AB_2231974	Lab-Wue1 (1:10,000)
Antibody	Anti-GABA (rabbit polyclonal)	Sigma-Aldrich	Cat#A2025; RRID:AB_477652	Lab-Wue2 (1:400)
Antibody	Anti-NeuN (guinea-pig polyclonal)	Synaptic Systems	Cat# 266004; RRID:AB_2619988	Lab-Wue1 (1:400)
Antibody	Anti-Parvalbumin (mouse monoclonal)	Sigma-Aldrich	Cat# P3088; RRID:AB_477329	Lab-Inns1 (1:2,500)
Antibody	Anti-Parvalbumin (mouse monoclonal)	Swant	Cat# PV235; RRID:AB_10000343	Lab-Wue1 (1:5,000)
Software, algorithm	ImageJ	Fiji www.fiji.sc/	RRID:SCR_002285	Lab-Mue; Lab-Inns2; Lab-Wue1; Lab-Wue2
Software, algorithm	Improvision Openlab software	Perkin Elmer www.perkinelmer.com/ pages/020/cellularimaging/ products/openlab.xhtml	RRID:SCR_012158	Lab-Inns1, Version 5.5.0
Software, algorithm	GraphPad Prism software	GraphPad Prism www.graphpad.com/ scientific-software/prism/	RRID:SCR_015807	Lab-Inns1, Version 7.0
Software, algorithm	CellSens Dimension Desktop software	Olympus www.olympus-lifescience.com/ en/software/cellsens/	RRID:SCR_016238	Lab-Inns2, Version 1.9
Software, algorithm	Fluoview FV10-ASW	Olympus www.photonics.com/ Product.aspx?PRID=47380	RRID:SCR_014215	Lab-Wue1
Software, algorithm	Tensorflow	www.tensorflow.org, Abadi et al., 2016	RRID:SCR_016345
Software, algorithm	Keras	www.keras.io, Chollet, 2015
Software, algorithm	Imagej	www.imagej.net/, Rueden et al., 2017	RRID:SCR_003070
Software, algorithm	SciPy	www.scipy.org, Jones et al., 2001	RRID:SCR_008058
Software, algorithm	scikit-learn	www.scikit-learn.org/, Pedregosa et al., 2011
Software, algorithm	scikit-image	www.scikit-image.org/, van der Walt et al., 2014
Software, algorithm	Pingouin	https://pingouin-stats.org/, Vallat, 2018
Software, algorithm	simpleITK	www.simpleitk.org/, Lowekamp et al., 2013

Data sets regarding animal behavior, immunofluorescence analysis and image acquisition were performed in five independent laboratories using lab-specific protocols. Experiments were not planned together to ensure the individual character of the datasets. We refer to the lab-specific protocols as follows:

Lab-Mue: Institute of Physiology I, University of Münster, Germany
Lab-Inns1: Department of Pharmacology, Medical University of Innsbruck, Austria
Lab-Inns2: Department of Pharmacology and Toxicology, Institute of Pharmacy and Center for Molecular Biosciences Innsbruck, University of Innsbruck
Lab-Wue1: Institute of Clinical Neurobiology, University Hospital, Würzburg, Germany
Lab-Wue2: Department of Child and Adolescent Psychiatry, Center of Mental Health, University Hospital of Würzburg, Würzburg, Germany

Share this article

Cite this article

Schematic illustration of bioimage analysis strategies and corresponding hypotheses.

Similarity analysis of fluorescent feature annotations by manual or DL-based strategies.

Application of different DL-based strategies for fluorescent feature annotation.

Figure 3—source data 1

Consensus ensembles significantly increase reliability of bioimage analysis results.

Figure 4—source data 1

Figure 4—source data 2

Consensus ensembles for DL-based feature annotation in external bioimage data sets.

Figure 5—source data 1

Figure 5—source data 2

Figure 5—source data 3

Bioimage analyses results of external datasets.

Author details

Dennis Segebarth

Contribution

Contributed equally with

Competing interests

Matthias Griebel

Contribution

Contributed equally with

Competing interests

Nikolai Stein

Contribution

Competing interests

Cora R von Collenberg

Contribution

Competing interests

Corinna Martin

Contribution

Competing interests

Dominik Fiedler

Contribution

Competing interests

Lucas B Comeras

Contribution

Competing interests

Anupam Sah

Contribution

Competing interests

Victoria Schoeffler

Contribution

Competing interests

Teresa Lüffe

Contribution

Competing interests

Alexander Dürr

Contribution

Competing interests

Rohini Gupta

Contribution

Competing interests

Manju Sasi

Contribution

Competing interests

Christina Lillesaar

Contribution

Competing interests

Maren D Lange

Contribution

Competing interests

Ramon O Tasan

Contribution

Competing interests

Nicolas Singewald

Contribution

Competing interests

Hans-Christian Pape

Contribution

Competing interests

Christoph M Flath

Contribution

Contributed equally with

For correspondence

Competing interests

Robert Blum

Contribution

Contributed equally with