Performance of 3D Semantic and Instance Segmentation Models.

a: Raw mesoSPIM whole-brain sample, volumes and corresponding ground truth labels from somatosensory (S1) and visual (V1) cortical regions. b: Evaluation of instance segmentation performance for several supervised models over three data subsets. F1-score is computed from the Intersection over Union (IoU) with ground truth labels, then averaged. Error bars represent 50% Confidence Intervals (CIs). c: View of 3D instance labels from supervised models, as noted, for visual cortex volume in b evaluation. d: Illustration of our WNet3D architecture showcasing the dual 3D U-Net structure with modifications (see Methods). e: Example 3D instance labels from WNet3D; top row is S1, bottom is V1, with artifacts removed. f: Semantic segmentation performance: comparison of model efficiency, indicating the volume of training data required to achieve a given performance level. Each supervised model was trained with an increasing percentage of training data (with 10, 20, 60 or 80%, left to right within each model grouping); F1-Score score with an IoU >= 0 was computed on unseen test data, over three data subsets for each training/evaluation split. Our self-supervised model (WNet3D) is also trained on a subset of the training set of images, but always without human labels. Far right: We also show performance of the pretrained WNet3D available in the plugin (far right), with and without removing artifacts in the image. See Methods for details. The central box represents the interquartile range (IQR) of values with the median as a horizontal line, the upper and lower limits the upper and lower quartiles. Whiskers extend to data points within 1.5 IQR of the quartiles. g: Instance segmentation performance comparison of Swin-UNetR and WNet3D (pretrained, see Methods), evaluated on unseen data across 3 data subsets, compared with a Swin-UNetR model trained using labels from the WNet3D self-supervised model. Here, WNet3D was trained on separate data, producing semantic labels that were then used to train a supervised Swin-UNetR model, still on held-out data. This supervised model was evaluated as the other models, on 3 held-out images from our dataset, unseen during training. Error bars indicate 50% CIs.

CellSeg3D napari plugin pipeline, training, and example outputs.

a: Workflow diagram depicting the segmentation pipeline: either raw data can be used directly (self-supervised) or labeled and used for training and then other data can be used for model inference. Each stream concludes with posthoc inspection and refinement, if needed (post-processing analysis and/or refining the model). b: Instance segmentation performance (zero-shot) of the pretrained WNet3D on select datasets featured in c, shown as F1-score vs IoU with ground truth labels. c: Qualitative examples with WNet3D for semantic and instance segmentation. d: Qualitative example of WNet3D-generated prediction (thresholded) and labels on a crop from a whole-brain sample, with c-FOS-labeled neurons, acquired with a mesoSPIM.

Dataset ground-truth cell count per volume.

Parameters used in Figure 2b, c for instance segmentation with Voronoi-Otsu.

Hyperparameter tuning of baselines and statistics

a,b,c: Hyperparameter optimisation for several supervised models. In Cellpose, the cell probability threshold value is applied before the sigmoid, hence values between 12 and 12 were tested. CellSeg3D models return predictions between 0 and 1 after applying the softmax, values tested were therefore in this range. Error bars show 95% CIs. d: StarDist hyperparameter optimisation. Several parameters were tested for non-maximum suppression (NMS) threshold and cell probability threshold. Heatmap is F1-Score. e: Pooled F1-Scores per split, related to Figure 1f, used for statistical testing shown in f. The central box represents the interquartile range (IQR) of values with the median as a horizontal line, the upper and lower limits the upper and lower quartiles. Whiskers extend to data points within 1.5 IQR of the quartiles. Outliers are shown separately. f: Pairwise Conover’s test p-values for the Dice metric values per model shown in e. Colors are based on level of significance. g: Example image of WNet3D before and after arefact filtering; after also shown in Figure 1e.

Training WNet3D

a: Overview of the training process of WNet3D. The loss for the encoder Uenc is the SoftNCuts, whereas the reconstruction loss for Udec is MSE. The weighted sum of losses is calculated as indicated in Methods. For select epochs, input volumes are shown, with outputs from encoder Uenc above, and outputs from decoder Udec below.