BrainAlignNet can perform non-rigid registration to align the neurons in the C. elegans head

(A) Network training pipeline. The network takes in a pair of images and a pair of centroid position lists corresponding to the images at two different time points (fixed and moving). (In the LocalNet diagram, this is represented as “IN”. Intermediate cuboids represent intermediate representations of the images at various stages of network processing. In reality, the cuboids are four-dimensional, but we represent them with three dimensions (up/down is x, left/right is y, in/out is channel, and we omit z) for visualization purposes. Spaces and arrows between cuboids represent network blocks, layers, and information flow. See Methods for a detailed description of network architectures.) Image pairs were selected based on the similarity of worm postures (see Methods). The fixed and moving images were pre-registered using an Euler transformation, translating and rotating the moving images to maximize their cross-correlation with the fixed images. The fixed and moving neuron centroid positions were obtained by computing the centers of the same neurons in both the fixed and moving images as a list of (x, y, z) coordinates. This information was available since we had previously extracted calcium traces from these videos using a previous, slow version of our image analysis pipeline. The network outputs a Dense Displacement Field (DDF), a 4-D tensor that indicates a coordinate transformation from fixed image coordinates to moving image coordinates. The DDF is then used to transform the moving images and fixed centroids to resemble the fixed images and moving centroids. During training, the network is tasked with learning a DDF that transforms the centroids and images in a way that minimizes the centroid alignment and image loss, as well as the regularization loss (see Methods). Note that, after training, only images (not centroids) need to be input into the network to align the images.

(B) Network loss curves. The training and validation loss curves show that validation performance plateaued around 300 epochs of training.

(C) Example of registration outcomes on neuronal ROI images. The network-learned DDF warps the neurons in the moving image (‘moving ROIs’). The warped-moving ROIs are meant to be closer to the fixed ROIs. Each neuron is uniquely colored in the ROI images to represent its identity. The centroids of these neurons are represented by the white dots. Here, we take a z-slice of the 3-D fixed and moving ROI blocks on the x-y plane to show that the DDF can warp the x and y coordinates of the moving centroids to align with the x and y coordinates of the fixed centroids with one-pixel precision.

(D) Example of registration outcomes on tagRFP images. We show the indicated image blocks as Maximal Intensity Projections (MIPs) along the z-axis, overlaying the fixed image (orange) with different versions of the moving image (blue). While the fixed image remains untransformed, the uninitialized moving image (left) gets warped by an Euler transformation (middle) and a network-learned DDF (right) to overlap with the fixed image.

(E) Registration outcomes shown on example tagRFP and ROI images for four different trained networks. We randomly selected one registration problem from one of the testing datasets and tasked the trained networks with creating a DDF to warp the moving (RFP) image and moving ROI onto the fixed (RFP) image and fixed ROI. The full network with full loss function aligns neurons in both RFP and ROI images almost perfectly. For the networks trained without the centroid alignment loss, regularization loss, or image loss— while keeping the rest of the training configurations identical—the resulting DDF is unable to fully align the neurons and displays unrealistic deformation (closely inspect the warped moving ROI images).

(F) Evaluation of registration performance on testing datasets before network registration and after registration with four different networks. “pre-align” shows alignment statistics on images after Euler alignment, but before neural network registration. Here, we evaluated 80-100 problems per animal for all animals in the testing data. Two performance metrics are shown. Normalized cross-correlation (NCC, top) quantifies alignment of the fixed and warped moving RFP images, where a score of one indicates perfect alignment. Centroid distance (bottom) is measured as the mean Euclidean distance between the centroids of all neurons in the fixed ROI and the centroids of their corresponding neurons in the warped moving ROI; a distance of 0 indicates perfect alignment. All violin plots are accompanied by lines indicating the minimum, mean, and maximum values. **p<0.01, ***p<0.001, ****p<0.0001, distributions of registration metrics (NCC and centroid distance) were compared pairwise across all four versions of the network with the two-tailed Wilcoxon signed rank test only on problems that register frames from unique timepoints. For all datasets (“pre-align”, “full”, “no-centroid”, “no-regul.”, “no-image”), n=85, 65, 58, 44, 36 registration problems (from 5 animals); for NCC, W=35281, 64854, 78754 for “no-centroid”, “no-regul.”, “no-image” vs “full”, respectively; for centroid distance, W=12168, 12634, 13345 for “no-centroid”, “no-regul.”, “no-image” vs. “full”, respectively.

(G) Example image of the head of an animal from a strain that expresses both pan-neuronal NLS-tagRFP and eat-4::NLS-GFP. The neurons expressing both NLS-tagRFP and eat-4::NLS-GFP is a subset of all the neurons expressing pan-neuronal NLS-tagRFP.

(H) A comparison of the registration qualities of the four trained registration networks: full network, no-centroid alignment loss, no-regularization loss, no-image loss. Each network was evaluated on four datasets in which both pan-neuronal NLS-tagRFP and eat-4::NLS-GFP are expressed, examining 3927 registration problems per dataset. For a total of 15,708 registration problems, each network was tasked with registering the tagRFP images. The resulting DDFs from the tagRFP registrations were also used to register the eat-4::GFP images. For each channel in each problem, we determined which of the four networks had the highest performance (i.e. highest NCC). Note that the no-centroid alignment network performs the best of the RFP channel, but not in the GFP channel.

Instead, the full network performs the best in the GFP channel. This suggests that the network without the centroid alignment loss deforms RFP images in a manner that does not accurately move the neurons to their correct locations (i.e. scrambles the pixels).

BrainAlignNet supports calcium trace extraction with high accuracy and high SNR

(A) Diagram of ANTSUN 1.4 and 2.0, which are two full calcium trace extraction pipelines that only differ with regards to image registration. Raw tagRFP channel data is input into the pipeline, which submits image pairs with similar worm postures for registration using either elastix (ANTSUN 1.4; red) or BrainAlignNet (ANTSUN 2.0; blue). The registration is used to transform neuron ROIs identified by a segmentation U-Net (the cuboid diagram is represented as in Figure 1A). These are input into a heuristic function (ANTSUN 2.0-specific heuristics shown in blue) which defines an ROI linkage matrix. Clustering this matrix then yields neuron identities.

(B) Sample dataset from an eat-4::NLS-GFP strain, showing ratiometric (GFP/tagRFP) traces without any further normalization. This strain has some GFP+ neurons (bright horizontal lines) as well as some GFP-neurons (dark horizontal lines, which have F∼0). Registration artifacts between GFP+ and GFP-neurons would be visible as bright points in GFP-traces or dark points in GFP+ traces.

(C) Error rate of ANTSUN 2.0 registration across four eat-4::NLS-GFP animals, computed based on mismatches between GFP+ and GFP-neurons in the eat-4::NLS-GFP strain. Dashed red line shows the error rate of ANTSUN 1.4. Individual dots are different recorded datasets. Note that all error rates are <1%.

(D) Sample dataset from a pan-neuronal GCaMP strain, showing F/Fmean fluorescence. Robust calcium dynamics are visible in most neurons.

(E) Number of detected neurons across three pan-neuronal GCaMP animals for the two different ANTSUN versions (1.4 or 2.0). Individual dots are individual recorded datasets.

(F) Computation time to process one animal based on ANTSUN version (1.4 or 2.0). ANTSUN 1.4 was run on a computing cluster that provided an average of 32 CPU cores per registration problem; computation time is the total number of CPU hours used (ie: the time it would have taken to run ANTSUN 1.4 registration locally on a comparable 32-core machine). ANTSUN 2.0 was run locally on NVIDIA A4000, A5500, and A6000 graphics cards.

BrainAlignNet can be used to perform neuron alignment in jellyfish

(A) Example of image registration on a pair of mCherry images (from a testing animal, withheld from training data), composed by overlaying a moving image (blue) on the fixed image (orange). While the fixed image remains untransformed, the uninitialized moving image (left) gets warped by a Euler transformation (middle) and a BrainAlignNet-generated DDF (right) to overlap with the fixed image.

(B) Evaluation of registration performance by examining mCherry image alignment (via NCC) on testing datasets before and after registration with BrainAlignNet. “pre-align” shows alignment statistics on images after Euler alignment, but before neural network registration. Here, we evaluated all registration problems for all three animals in the testing set. As in Figure 1, NCC quantifies alignment of the fixed and warped moving RFP images, where a score of 1 indicates perfect alignment. All violin plots are accompanied by lines indicating the minimum, mean, and maximum values. ****p<0.0001, two-tailed Wilcoxon signed rank test. For both “pre-align”, and “post-BrainAlignNet”, n=25,997 registration problems (from 3 animals).

(C) Evaluation of registration performance by examining neuron alignment (measured via distance between matched centroids) on testing datasets before and after BrainAlignNet registration. Centroid distance is measured as the mean Euclidean distance between the centroids of all neurons in the fixed image and the centroids of the corresponding neurons in the warped moving image; a distance of 0 indicates perfect alignment. ****p<0.0001, two-tailed Wilcoxon signed rank test. For both “pre-align”, and “post-BrainAlignNet”, n= 25,997 registration problems (from 3 animals).

The AutoCellLabeler Network can automatically annotate >100 neuronal cell types in the C. elegans head

(A) Procedure by which AutoCellLabeler generates labels for neurons. First, the tagRFP component of a multi-spectral image is passed into a segmentation neural network, which extracts neuron ROIs, labeling each pixel as an arbitrary number with one number per neuron. Then, the full multi-spectral image is input into AutoCellLabeler, which outputs a probability map. This probability map is applied to the ROIs to generate labels and confidence values for those labels. The network cuboid diagrams are represented as in Figure 1A.

(B) AutoCellLabeler’s training data consists of a set of multi-spectral images (NLS-tagRFP, NLS-mNeptune2.5, NLS-CyOFP1, and NLS-mTagBFP2), human neuron labels, and a pixel weighting matrix based on confidence and frequency of the human labels that controls how much each pixel is weighted in AutoCellLabeler’s loss function.

(C) Pixel-weighted cross-entropy loss and pixel-weighted IoU metric scores for training and validation data. Cross-entropy loss captures the discrepancy between predicted and actual class probabilities for each pixel. The IoU metric describes how accurately the predicted labels overlap with the ground truth labels.

(D) During the label extraction procedure, AutoCellLabeler is less confident of its label on pixels near the edge of ROI boundaries. Therefore, we allow the central pixels to have much higher weight when determining the overall ROI label from pixel-level network output.

(E) Distributions of AutoCellLabeler’s confidence across test datasets based on the relationship of its label to the human label (“Correct” = agree, “Incorrect” = disagree, “Human low conf” = human had low confidence, “Human no label” – human did not even guess a label for the neuron). ****p<0.0001, as determined by a Mann-Whitney U Test between the indicated condition and the “Correct” condition where the network agreed with the human label; n=835, 25, 322, 302 labels (from 11 animals) for the conditions “Correct”, “Incorrect”, “Human low conf”, “Human no label”, respectively; U=16700, 202691, 210797 for “Incorrect”, “Human low conf”, “Human no label” vs “Correct”, respectively.

(F) Categorization of neurons in test datasets based on AutoCellLabeler’s confidence. Here “Correct” and “Incorrect” are as in (E), but “No human label” also includes low-confidence human labels. Printed percentage values are the accuracy of AutoCellLabeler on the corresponding category, computed as

(G) Distributions of accuracy of AutoCellLabeler’s high confidence (>75%) labels on neurons across test datasets based on the confidence of the human labels. n.s. not significant, *p<0.05, as determined by a paired permutation test comparing mean differences (n=11 test datasets).

(H) Accuracy of AutoCellLabeler compared with high-confidence labels from new human labelers on neurons in test datasets that were labeled at low confidence, not at all, or at high confidence by the original human labelers. Error bars are bootstrapped 95% confidence intervals. Dashed red line shows accuracy of new human labelers relative to the old human labelers, when both gave high confidence to their labels. There was no significant difference between the human vs human accuracy and the network accuracy for any of these categories of labels, determined via two-tailed empirical p-values from the bootstrapped distributions.

(I) Distributions of number of high-confidence labels per animal over test datasets. High confidence was 4-5 for human labels and >75% for network labels. We note that we standardized the manner in which split ROIs were handled for human- and network-labeled data so that the number of detected neurons could be properly compared between these two groups. n.s. not significant, ***p<0.001, as determined by a paired permutation test comparing mean differences (n=11 animals).

(J) Distributions of accuracy of high-confidence labels per animal over test datasets, relative to the original human labels. A paired permutation test comparing mean differences to the full network’s label accuracy did not find any significance.

(K) Number of ROIs per neuron class labeled at high confidence in test datasets that fall into each category, along with average confidence for all labels for each neuron class in those test datasets. “New” represents ROIs that were labeled by the network as the neuron and were not labeled by the human. “Correct” represents ROIs that were labeled by both AutoCellLabeler and the human as that neuron. “Incorrect” represents ROIs that were labeled by the network as that neuron and were labeled by the human as something else. “Lost” represents ROIs that were labeled by the human as that neuron and were not labeled by the network. “Network conf” represents the average confidence of the network for all its labels of that neuron. “Human conf” represents the average confidence of the human labelers for all their labels of that neuron. Neuron classes with high values in the “Correct” column and low values in the “Incorrect” column indicate a very high degree of accuracy in AutoCellLabeler’s labels for those classes. If those classes also have a high value in the “New” column, it could indicate that AutoCellLabeler is able to find the neuron with high accuracy in animals where humans were unable to label it.

Variants of AutoCellLabeler can annotate neurons from fewer fluorescent channels and in different strains

(A) Distributions of number of high-confidence labels per animal over test datasets for the networks trained on the indicated set of fluorophores. The “tagRFP (on low SNR)” column corresponds to a network that was trained on high-SNR, tagRFP-only data and tested on low-SNR tagRFP data due to shorter exposure times in freely moving animals. *p<0.05, **p<0.01, ***p<0.001, as determined by a paired permutation test comparing mean differences to the full network (n=11 animals).

(B) Distributions of accuracy of high-confidence labels per animal over test datasets for the networks trained on the indicated set of fluorophores. The “tagRFP (on low SNR)” column is as in (A). n.s. not significant, *p<0.05, **p<0.01, as determined by a paired permutation test comparing mean differences to the full network (n=11 animals).

(C) Same as Figure 4K, except for the tagRFP-only network.

(D) Accuracy vs detection tradeoff for various AutoCellLabeler versions. For each network, we can set a confidence threshold above which we accept labels. By varying this threshold, we can produce a tradeoff between accuracy of accepted labels (x-axis) and number of labels per animal (y-axis) on test data. Each curve in this plot was generated in this manner. The “tagRFP-only (on low SNR)” values are as in (A). The “tagRFP-only (on freely moving)” values come from evaluating the tagRFP-only network on 100 randomly-chosen timepoints in the freely moving (tagRFP) data for each test dataset. The final labels were then computed on each immobilized ROI by averaging together the 100 labels and finding the most likely label. To ensure fair comparison to other networks, only immobilized ROIs that were matched to the freely moving data were considered for any of the networks in this plot.

(E) Evaluating the performance of tagRFP-only AutoCellLabeler on data from another strain SWF415, where there is pan-neuronal NLS-GCaMP7f and pan-neuronal NLS-mNeptune2.5. Notably, the pan-neuronal promoter used for NLS-mNeptune2.5 differs from the pan-neuronal promoter used for NLS-tagRFP in NeuroPAL. Performance here was quantified by computing the fraction of network labels with the correct expected activity-behavior relationships in the neuron class (y-axis; quantified by whether an encoding model showed significant encoding; see Methods). For example, when the label was the reverse-active AVA neuron, did the corresponding calcium trace show higher activity during reverse? The blue line shows the expected fraction as a function of the true accuracy of the network (x-axis), computed via simulations (see Methods). Orange circle shows the actual fraction when AutoCellLabeler was evaluated on SWF415. Based on this, the dashed line shows estimated true accuracy of this labeling.

CellDiscoveryNet and ANTSUN 2U can perform unsupervised cell type discovery by analyzing data across different C. elegans animals

(A) A schematic comparing the approaches of AutoCellLabeler and CellDiscoveryNet. AutoCellLabeler uses supervised learning, taking as input both images and manual labels for those images, and learns to label neurons accordingly. CellDiscoveryNet uses unsupervised learning, and can learn to label neurons after being trained only on images (with no labels provided).

(B) CellDiscoveryNet training pipeline. The network takes as input two multi-spectral NeuroPAL images from two different animals. It then outputs a Dense Displacement Field (DDF), which is a coordinate transformation between the two images. It warps the moving image under this DDF, producing a warped moving image that should ideally look very similar to the fixed image. The dissimilarity between these images is the image loss component of the loss function, which is added to the regularization loss that penalizes non-linear image deformations present in the DDF.

(C) Network loss curves. Both training and validation loss curves start to plateau around 600 epochs.

(D) Distributions of normalized cross-correlation (NCC) scores comparing the CellDiscoveryNet predictions (warped moving images) and the fixed images for each pair of registered images. These NCCs were computed on all four channels simultaneously, treating the entire image as a single 4D matrix for this purpose. The “Train” distribution contains the NCC scores for all pairs of images present in CellDiscoveryNet’s training data, while the “Val+Test” distribution contains any pair of images that was not present in its training data.

(E) Distributions of centroid distance scores based on human labels. These are computed over all (moving, fixed) image pairs on all neurons with high-confidence human labels in both moving and fixed images. The centroid distance scores represent the Euclidean distance from the network’s prediction for where the neuron was and its correct location as labeled by the human. Values of a few pixels or less likely roughly indicate that the neuron was mapped to its correct location, while large values mean the neuron was mis-registered. The “Train” and “Val+Test” distributions are as in (D). The “High NCC” distribution is from only (moving, fixed) image pairs where the NCC score was greater than the 90th percentile of all such NCC scores. ****p<0.0001, Mann-Whitney U-Test comparing All versus High NCC (n=5048 vs 486 image pairs, U = 1.678 × 106).

(F) Labeling accuracy vs number of linked neurons tradeoff curve. Accuracy is the fraction of linked ROIs with labels matching their cluster’s most frequent label (see Methods). Number of linked neurons is the total number of distinct clusters; each cluster must contain an ROI in more than half of the animals to be considered a cluster. The parameter w7 describes when to terminate the clustering algorithm – higher values mean the clustering algorithm terminates earlier, resulting in more accurate but fewer detections. Red dot is the selected value w7 = 10−9 where 125 clusters were detected with 93% labeling accuracy.

(G) Number of neurons labeled per animal in the 11 testing datasets. This plot compares the number of neurons labeled as follows: human labels with 4-5 confidence, AutoCellLabeler labels with 75% or greater confidence, and CellDiscoveryNet with ANTSUN 2U labels with parameter w7 = 10−9. ***p<0.001, as determined by a paired permutation test comparing mean differences (n=11 animals).

(H) Accuracy of neuron labels in the 11 testing datasets. This plot defines the original human confidence 4-5 labels as ground truth. “Human relabel” are confidence 4-5 labels done by different humans (independently from the first set of human labels). AutoCellLabeler are confidence 75% or greater labels. CellDiscoveryNet labels were created by running ANTSUN 2U with w7 = 10−9, and defining the correct label for each cluster to be its most frequent label. A paired permutation test comparing mean differences to the full network’s label accuracy did not find any significance.

(I) Same as Figure 5(K), except using labels from CellDiscoveryNet with ANTSUN 2U. The neurons “NEW 1” through “NEW 5” are clusters that were not labeled frequently enough by humans to be able to determine which neuron class they corresponded to, as described in the main text.

Example images and performance of network trained to register arbitrary image pairs

(A) Performance of image registration in five different animals in the testing set. Normalized Cross-Correlation (NCC) scores of aligned tagRFP images are shown, which indicate the extent of image alignment (best achievable score is 1). 90-100 registration problems examined per animal are shown as violin plots with the overlaying lines indicating minimum, mean, and maximum values.

(B) Performance of image registration in five different animals in the testing set. Centroid distance is the average Euclidean distance between the centroids of matched neurons in each image (best achievable score is 0). 90-100 registration problems examined per animal are shown as violin plots with the overlaying lines indicating minimum, mean, and maximum values.

(C) Performance of image registration in five different registration problems (i.e. image pairs) from one example animal. Centroid distance is the average Euclidean distance between the centroids of matched neurons in that image pair (best achievable score is 0). All the centroid position distances for each registration problem as shown as violin plots with the overlaying lines indicating minimum, mean, and maximum values.

(D) Five example image pairs in the training set for BrainAlignNet. These are maximum intensity projections of the tagRFP channel, showing two different timepoints that were selected to be the fixed and moving images in each of these five registration problems.

(E) Five example image pairs in the training set for the network trained to align arbitrary image pairs, including much more challenging problems. Note that the head bending is more dissimilar for these image pairs, as compared to those in (D). Data are shown as in (D).

(F) Performance of the network trained to register arbitrary image pairs. Quantification is for testing data. We quantify centroid distance (average alignment of neuron centroids) and NCC (image similarity) as in panels (A-C). By both metrics, this network’s performance is far worse than that of the BrainAlignNet presented in Fig. 1. The two panels on the right show that results are qualitatively similar for different animals in the testing set.

Characterization of pan-neuronal GFP datasets processed by ANTSUN 2.0

(A) Example rimb-1::GFP (pan-neuronal GFP) dataset processed by ANTSUN 2.0. The data are shown as ratiometric GFP/RFP without any further normalization.

(B) Quantification of the standard deviation of GFP traces from 3 pan-neuronal datasets processed by either ANTSUN 1.4 (without BrainAlignNet) or 2.0 (with BrainAlignNet). To standardize across datasets, the standard deviation here was computed on traces that were normalized by F/Fmean. Ideally, GFP traces should have low standard deviation; processing with ANTSUN 2.0 did not impair trace quality, compared to the previously described ANTSUN 1.436.

BrainAlignNet Performance on Additional Withheld Jellyfish Data

(A) Image registration quality was assessed via image alignment on image pairs before and after BrainAlignNet. These image pairs were from the animals used in the training data, but were different image pairs than those used for training (n = 25697). As in Fig. 3, Normalized Cross-Correlation (NCC) scores of aligned mCherry images indicate image alignment. NCC is shown between Euler initialized images (“pre-aligned”) and BrainAlignNet-registered images. ****p<0.0001, two-tailed Wilcoxon signed rank test.

(B) Image registration quality was assessed via centroid alignment on image pairs before and after BrainAlignNet. These image pairs were from the animals used in the training data, but were different image pairs than those used for training (n = 25697). Centroid distance is as described in Fig. 3 and is shown between Euler initialized images (“pre-align”) and BrainAlignNet-register images. ****p<0.0001, two-tailed Wilcoxon signed rank test.

Further characterization of the AutoCellLabeler network

(A) Tradeoff of network labeling accuracy (x-axis) and number of neurons labeled (y-axis) for the full AutoCellLabeler network. The number of neurons labeled can be varied by adjusting the threshold confidence that the network needs to achieve to label an ROI. By varying this threshold, we were able to generate this curve. This full curve captures the tradeoff and shows the 75% confidence threshold (blue circle) that we selected to use in our analyses.

(B) Confusion matrix showing which neurons could potentially be confused for one another by AutoCellLabeler. Note that, except for the diagonal, the matrix is mostly white, reflecting that it is mostly (98%) accurate. Neurons with some inaccuracies were clustered to the lower left (boxed region). Note that with a linear color scale the diagonal would be off-scale bright with correct labels. So we capped the colorbar range at 4 counts so as to not block the ability to see actual confusion entries. For reference, the actual mean value across the diagonal is 9.7.

(C) Positive correlation between human and autolabel confidence across the neuronal cell types (each cell type is a blue dot). This plot also highlights that a subset of cells are more difficult for human labelers and, therefore, also for AutoCellLabeler (i.e. the cells that are not clustered in the upper right).

Further characterization of the different AutoCellLabeler variants

(A) These plots, displayed as in Fig. 4F, show the performance of different indicated cell annotation networks (trained and/or evaluated on different fluorophores, as indicated). Data are displayed to show network performance on different ROIs that it labels with different levels of confidence. Printed percentage values are the accuracy of AutoCellLabeler within the corresponding confidence category, computed as . Note that the lower performing networks (for example, tagRFP-only) are still accurate for their high-confidence labels, and that their decreased accuracy is mostly due to a lower fraction of high-confidence labels (i.e. more cell types where the networks had low confidence in their annotations).

(B) Example maximum intensity projection images of the worm in the tagRFP channel under three different imaging conditions: immobilized high-SNR (created by averaging together 60 immobilized lower-SNR images together, our typical condition for NeuroPAL imaging); immobilized lower-SNR (i.e. one of those 60 images); and freely moving (which was taken with the same imaging settings as immobilized lower-SNR but in a freely moving animal)

Further characterization of CellDiscoveryNet and ANTSUN 2U performance

(A) Matrix of all clusters generated by running ANTSUN 2U. Each row is a distinct cluster (i.e. inferred cell type), while each column is a distinct animal. Black entries mean that the given cluster did not include any ROIs in the given animal (ie: ANTSUN 2U failed to label that cluster in that animal). Non-black entries mean that the cluster contained an ROI in that animal. Row names correspond to the most frequent human label among ROIs in the cluster (this was defined by first disambiguating most frequent neuron class, and then disambiguating L from R). Green entries correspond to cases when the given ROI’s label matched the most frequent class label (row name ignoring L/R), orange entries correspond to the case when the given ROI’s label did not match the most frequent class label, and blue entries mean that the given ROI did not have a high-confidence human label. The neurons “NEW 1” through “NEW 5” are clusters that were not labeled frequently enough by humans to be able to determine which neuron class they corresponded to, as described in the main text. Note that there are two rows of “glia” potentially corresponding to two different types of glia in different stereotyped locations (though in all labeling in this paper, glia are just given a single label type rather than subsetting to subtypes of glia)