1. Neuroscience
  2. Physics of Living Systems
Download icon

Fast deep neural correspondence for tracking and identifying neurons in C. elegans using semi-synthetic training

  1. Xinwei Yu
  2. Matthew S Creamer
  3. Francesco Randi
  4. Anuj K Sharma
  5. Scott W Linderman
  6. Andrew M Leifer  Is a corresponding author
  1. Department of Physics, Princeton University, United States
  2. Princeton Neuroscience Institute, Princeton University, United States
  3. Department of Statistics, Stanford University, United States
  4. Wu Tsai Neurosciences Institute, Stanford University, United States
Tools and Resources
  • Cited 0
  • Views 805
  • Annotations
Cite this article as: eLife 2021;10:e66410 doi: 10.7554/eLife.66410

Abstract

We present an automated method to track and identify neurons in C. elegans, called ‘fast Deep Neural Correspondence’ or fDNC, based on the transformer network architecture. The model is trained once on empirically derived semi-synthetic data and then predicts neural correspondence across held-out real animals. The same pre-trained model both tracks neurons across time and identifies corresponding neurons across individuals. Performance is evaluated against hand-annotated datasets, including NeuroPAL (Yemini et al., 2021). Using only position information, the method achieves 79.1% accuracy at tracking neurons within an individual and 64.1% accuracy at identifying neurons across individuals. Accuracy at identifying neurons across individuals is even higher (78.2%) when the model is applied to a dataset published by another group (Chaudhary et al., 2021). Accuracy reaches 74.7% on our dataset when using color information from NeuroPAL. Unlike previous methods, fDNC does not require straightening or transforming the animal into a canonical coordinate system. The method is fast and predicts correspondence in 10 ms making it suitable for future real-time applications.

eLife digest

Understanding the intricacies of the brain often requires spotting and tracking specific neurons over time and across different individuals. For instance, scientists may need to precisely monitor the activity of one neuron even as the brain moves and deforms; or they may want to find universal patterns by comparing signals from the same neuron across different individuals.

Both tasks require matching which neuron is which in different images and amongst a constellation of cells. This is theoretically possible in certain ‘model’ animals where every single neuron is known and carefully mapped out. Still, it remains challenging: neurons move relative to one another as the animal changes posture, and the position of a cell is also slightly different between individuals. Sophisticated computer algorithms are increasingly used to tackle this problem, but they are far too slow to track neural signals as real-time experiments unfold.

To address this issue, Yu et al. designed a new algorithm based on the Transformer, an artificial neural network originally used to spot relationships between words in sentences. To learn relationships between neurons, the algorithm was fed hundreds of thousands of ‘semi-synthetic’ examples of constellations of neurons. Instead of painfully collated actual experimental data, these datasets were created by a simulator based on a few simple measurements. Testing the new algorithm on the tiny worm Caenorhabditis elegans revealed that it was faster and more accurate, finding corresponding neurons in about 10ms.

The work by Yu et al. demonstrates the power of using simulations rather than experimental data to train artificial networks. The resulting algorithm can be used immediately to help study how the brain of C. elegans makes decisions or controls movements. Ultimately, this research could allow brain-machine interfaces to be developed.

Introduction

The nervous system of the nematode C. elegans is well characterized, such that each of the 302 neurons is named and has stereotyped locations across animals (White et al., 1986; Sulston, 1976; Witvliet et al., 2020). The capability to find corresponding neurons across animals is essential to investigate neural coding and neural dynamics across animals. Despite the worm’s overall stereotypy, the variability in neurons’ spatial arrangement is sufficient to make predicting neural correspondence a challenge. For whole-brain calcium imaging (Schrödel et al., 2013Venkatachalam et al., 2016; Nguyen et al., 2016), identifying neurons across animals is additionally challenging because the nuclear localized markers that are used tend to obscure morphological features that would otherwise assist in neural identification.

An ideal method for finding neural correspondence in C. elegans should accommodate two major use cases. The first is tracking neurons within an individual across time as the animal’s head moves and deforms. Here, the goal is to be able to say with confidence that a neuron imaged in a volume taken at time t1 is the same as another neuron taken from a volume imaged at time t2. Tracking across time is needed to extract calcium dynamics from neurons during freely moving population calcium imaging (Venkatachalam et al., 2016; Nguyen et al., 2016; Lagache et al., 2020). Additionally, very fast real-time tracking will be needed to bring closed-loop techniques such as brain-machine interfaces (Clancy et al., 2014), and optical patch clamping (Hochbaum et al., 2014) to moving animals.

The second and more general use case is finding neural correspondence across individuals. Often this is to identify the name of a neuron with respect to the connectome (White et al., 1986) or a gene expression atlas (Hammarlund et al., 2018). Even when a neuron’s name cannot be ascertained, being able to identify which neurons are the same across recordings allows researchers to study neural population codes common across individuals.

For both use cases, a method to find neural correspondence is desired that is accurate, fast, requires minimal experimental training data and that generalizes across animal pose, orientation, imaging hardware, and conditions. Furthermore, an ideal method should not only perform well when restricted to neural positioning information but, should also be flexible enough to leverage genetically encoded color labeling information or other features for improved accuracy when available. Multicolor strains are powerful new tools that use multiple genetically encoded fluorescent labels to aid neural identification (Yemini et al., 2021; Toyoshima et al., 2019) (we use one of those strains, NeuroPAL (Yemini et al., 2021), for validating our model). However, some applications, like whole-brain imaging in moving worms, are not yet easily compatible with the multicolor imaging required by these new strains, so there remains a need for improved methods that use position information alone.

A variety of automated methods for C. elegans have been developed that address some, but not all these needs. Most methods developed so far focus on finding the extrinsic similarity (Bronstein, 2007) between one neuron configuration, called a test, and another neuron configuration called a template. Methods like these deform space to minimize distances between neurons in the template and neurons in the test and then attempt to solve an assignment problem (Lagache et al., 2018). For example, a simple implementation would be to use a non-rigid registration model, like Coherent Point Drift (CPD) (Myronenko and Song, 2010) to optimize a warping function between neuron positions in the test and template. More recent non-rigid registration algorithms like PR-GLS (Ma et al., 2016) also incorporate relative spatial arrangement of the neurons (Wen et al., 2018).

Models can also do better by incorporating the statistics of neural variability. NeRVE registration and clustering (Nguyen et al., 2017), for example, also uses a non-rigid point set registration algorithm (Jian and Vemuri, 2011) to find a warping function that minimizes the difference between a configuration of neurons at one time point and another. But NeRVE further registers the test neurons onto multiple templates to define a feature vector and then finds neural correspondence by clustering those feature vectors. By using multiple templates, the method implicitly incorporates more information about the range and statistics of that individual animal’s poses to improve accuracy.

A related line of work uses generative models to capture the statistics of variability across many individual worms. These generative models specify a joint probability distribution over neural labels and the locations, shapes, sizes, or appearance of neurons identified in the imaging data of multiple individuals (Bubnis et al., 2019; Varol et al., 2020; Nejatbakhsh et al., 2020; Nejatbakhsh and Varol, 2021). These approaches are based on assumptions about the likelihood of observing a test neural configuration, given an underlying configuration of labeled neurons. For example, these generative models often begin with a Gaussian distribution over neuron positions in a canonical coordinate system and then assume a distribution over potentially non-rigid transformations of the worm’s pose for each test configuration. Then, under these assumptions, the most likely neural correspondence is estimated via approximate Bayesian inference.

The success of generative modeling hinges upon the accuracy of its underlying assumptions, and these are challenging to make for high-dimensional data. An alternative is to take a discriminative modeling approach (Bishop, 2006). For example, recent work (Chaudhary et al., 2021) has used conditional random fields (CRF) to directly parameterize a conditional distribution over neuron labels, rather than assuming a model for the high-dimensional and complex image data. CRF allows for a wide range of informative features to be incorporated in the model, such as the angles between neurons, or their relative anterior-posterior positions, which are known to be useful for identifying neurons (Long et al., 2009). Ultimately, however, it is up to the modeler to select and hand curate a set of features to input into the CRF.

The next logical step is to allow for much richer features to be learned from the data. Artificial neural networks are ideal for tackling this problem, but they require immensely large training sets. Until now, their use for neuron identification has been limited. For example, in one tracking algorithm, artificial neural networks provide only the initialization, or first guess, for non-rigid registration (Wen et al., 2018).

Our approach is based on a simple insight: it is straightforward to generate very large semi-synthetic datasets of test and template worms that nonetheless are derived from measurements. We use neural positions extracted from existing imaging datasets, and then apply known, nonlinear transformations to warp those positions into new shapes for other body postures, or other individuals. Furthermore, we simulate the types of noise that appear in real datasets, such as missing or spurious neurons. Using these large-scale semi-synthetic datasets, we train an artificial neural network to map the simulated neural positions back to the ground truth. Given sufficient training data (which we can generate at will), the network learns the most informative features of the neural configurations, rather than requiring the user to specify them by hand.

Importantly, using semi-synthetic data also allows us to train our model even when we completely lack experimentally acquired ground truth data. And indeed, in this work, semi-synthetic data is derived exclusively from measurements that lack any ground truth correspondence either within-, or across animals. All ground truth for training comes only from simulation. Realistic synthetic, semi-synthetic or augmented datasets have been key to cracking other challenging problems in neurosicence (Parthasarathy et al., 2017; Yoon et al., 2017; Sun et al., 2018; Lee et al., 2020; Mathis and Mathis, 2020; Pereira et al., 2020) and have already shown promising potential for tracking neurons (Wen et al., 2018).

In this work, we use semi-synthetic data to train a Transformer network, an artificial neural network architecture that has shown great success in natural language processing tasks (Vaswani et al., 2017). Transformers incorporate an attention mechanism that can leverage similarities between pairs of inputs to build a rich representation of the input sequence for downstream tasks like machine translation and sentiment prediction. We reasoned this same architecture would be well-suited to extract spatial relationships between neurons in order to build a representation that facilitates finding correspondence to neurons in a template worm.

Not only is the Transformer well-suited to learning features for the neural correspondence problem, it also obviates the need to straighten (Peng et al., 2008) the worm in advance. Until now, existing methods have either required the worm to be straightened in preprocessing (Bubnis et al., 2019; Chaudhary et al., 2021) or explicitly transformed them during inference (Varol et al., 2020; Nejatbakhsh et al., 2020). Straightening the worm is a non-trivial task, and it is especially error-prone for complicated poses such as when the worm rolls along its centerline.

Finally, one of the main advantages of the Transformer architecture is that it permits parallel processing of the neural positions using modern GPU hardware. In contrast to existing methods, which have not been optimized for speed, the Transformer can make real-time predictions once it has been trained. This speed is a necessary step toward bringing real-time applications (Clancy et al., 2014; Hochbaum et al., 2014) to freely moving animals.

Results

Fast deep neural correspondence accurately matches neurons across semi-synthetic individuals

We developed a fast deep neural correspondence (fDNC) model that seeks to find the correspondence between configurations of C. elegans neurons in different individuals or in the same individual across time (Figure 1). We used a deep learning artificial neural network architecture, called the transformer architecture (Vaswani et al., 2017), that specializes at finding pairs of relations in datasets, Figure 1F. The transformer architecture identified similarities across spatial relations of neurons in a test and a template to identify correspondences between the neurons.

Fast deep neural correspondence model.

(A–D) Schematic of training and analysis pipeline for using the fast Deep Neural Correspondence (fDNC) model to predict correspondence between neurons across individuals. (A) Volumetric images of fluorescent labeled neuronal nuclei are segmented to extract neuron positions. (Scale bar, 10 µm). (B) Semi-synthetic training data is generated with a simulator. The simulator transforms the neural positions of a real worm and introduces noise to generate new semi-synthetic individuals. Approximately N=104 neuron configurations without labels from 12 moving worms were used to generate 2.304 × 105 labeled semi-synthetic worms for training. (C) During training, the fDNC model finds optimal internal parameters to minimize the difference between predicted neural correspondence and true correspondence in pairs of semi-synthetic worms. (D) Given positions for neurons in real worm A and positions for neurons in real worm B, the trained model predicts correspondences between them. Furthermore,if labels for neurons in A are known, the model can then assign corresponding labels to neurons in worm B. (E) Detailed schematic of the simulator from panel B. (F) Transformer architecture of the fDNC model. The position features of a template worm with n neurons and a test worm with m neurons are taken as input. The features are computed via a multi-head attention mechanism. ‘Add and Norm’ refers to an addition and layer normalization step. a and b are neuron positions and u and v are embeddings for the template and test, respectively. We choose the number of layers N=6 and the embedding dimension demb=128 by evaluating the performance on a held-out validation set.

Within a single individual, neural positions vary as the worm moves, deforms, and changes its orientation and pose. Across isogenic individuals, there is an additional source of variability that arises from the animal’s development. In practice, further variability also arises from experimental measurements: neuron positions must first be extracted from fluorescent images (Figure 1A), and slight differences in label expression, imaging artifacts, and optical scattering all contribute to errors in segmenting individual neurons.

We created a simulator to model these different sources of variability and used it to generate realistic pairs of empirically derived semi-synthetic animals with known correspondence between their neurons for training our model (Figure 1B,E). The simulator took configurations of neuron positions that lacked ground truth from real worms as inputs and then scaled and deformed them, forced them to adopt different poses sampled from real worms, and then introduced additional sources of noise to generate many new semi-synthetic individuals. We then trained our fDNC model on these experimentally derived semi-synthetic individuals of different sizes and poses.

Training our model on the empirically derived semi-synthetic data offered advantages compared to experimentally acquired data. First, it allowed us to train on larger datasets than would otherwise be practical. We trained on 2.304 × 105 semi-synthetic individuals, but only seeded our simulator with unlabeled neural configurations from experimentally acquired recordings of 12 individuals (4 × 103 volumes spread across the 12 individuals, all of which lacked ground-truth correspondence). Second, we did not need to provide ground truth correspondence because the simulator instead generates its own ground truth correspondence between semi-synthetic individuals, thereby avoiding a tedious and error prone manual step. Consequently, no experimentally acquired ground truth correspondence was used to train the model. Later in the work, we use ground truth information from human annotated NeuroPAL (Yemini et al., 2021) strains to evaluate the performance of our model, but no NeuroPAL strains were used for training. Importantly, the amount of test data with ground truth correspondence needed for evaluating performance is much smaller than the amount of training data that would be needed for training. Third, by using large and varied semi-synthetic data, we force the model to generalize its learning to a wide range of variabilities in neural positions and we avoid the risks of overtraining on idiosyncrasies specific to our imaging conditions or segmentation. Overall, we reasoned that training with semi-synthetic data should make the model more robust and more accurate across a wider range of conditions, orientations and animal poses than would be practical with experimentally acquired ground-truth datasets.

We trained our fDNC model on 2.304 × 105 semi-synthetic individuals (Figure 1C) and then, after training, evaluated its performance on 2000 additional held-out semi-synthetic pairs of individuals which had not been accessible to the model during training, Figure 1D and Figure 2. Model performance was evaluated by calculating the accuracy of the models’ predicted correspondence with respect to the ground truth in pairs of semi-synthetic individuals. One individual is called the ‘test’ and the other is the ‘template’. Every neuron in the test or template, whichever has fewer is assigned a match. Accuracy is reported as the number of correctly predicted matches between test and template, divided by the total number of ground truth matches in the test and template pair. Our fDNC model achieved 96.5% mean accuracy on the 2000 pairs of held-out semi-synthetic individuals. We compared this performance to that of Coherent Point Drift (CPD) (Myronenko and Song, 2010), a classic registration method used for automatic cell annotation. CPD achieved 31.1% mean accuracy on the same held-out semi-synthetic individuals. Our measurements show that the fDNC model significantly outperforms CPD at finding correspondence in semi-synthetic data. For the rest of the work, we use experimentally acquired human annotated data to evaluate performance.

Figure 2 with 1 supplement see all
fDNC accurately predicts matches between neuron from semi-synthetic worms (A) Schematic of evaluation pipeline.

fDNC model performance is evaluated on pairs of semi-synthetic worms with known correspondence that had been held out from training. Given neural positions in worms A and B, the model predicts matches between A and B. Accuracy is the number of correctly predicted matches divided by the total number of ground truth matches for the A-B pair. (B) Model performance of a Coherent Point Drift Registration (CPD) is compared to the fDNC model on 2000 randomly selected pairs of held-out semi-synthetic individuals, without replacement. (p=0, Wilcoxon signed rank test).

fDNC accurately tracks neurons within an individual across time

We next evaluated the fDNC model’s performance at tracking neurons within an individual over time, as is needed, for example, to measure calcium activity in moving animals (Venkatachalam et al., 2016; Nguyen et al., 2016). We evaluated model performance on an experimentally acquired calcium imaging recording of a freely moving C. elegans from Nguyen et al., 2017 in which a team of human experts had manually tracked and annotated neuron positions over time (strain AML32, 1514 volumes, six volumes per second, additional details are describeed in the 'Datasets' section of the 'Materials and methods.'). The recording has sufficiently large animal movement that the average distance a neuron travels between volumes (4.8 µm) is of similar scale to the average distance between nearest neuron neighbors (5.3 µm). The recording was excluded from training and from the set of recordings used by the simulator. We collected neuron configurations from all n time points during this recording to form n-1 pairs of configurations upon which to evaluate the fDNC model. Each pair consisted of a test and template. The template was always from the same time point t, while the test was taken to be the volume at any of the other time points. We applied the pre-trained fDNC model to the pairs of neuron configurations and compared the model’s predicted correspondence to the ground truth from manual human tracking (Figure 3). Across the pairs, the fDNC model showed a mean accuracy of 79.1%. We emphasize that the fDNC model achieved this high accuracy on tracking a real worm using only neuron position information even though it is trained exclusively on semi-synthetic data.

Figure 3 with 1 supplement see all
Tracking neurons within an individual across time.

(A) Schematic shows how the pose and orientation of a freely moving animal change with time. Black dot indicates head. (B) Pipeline to evaluate the fDNC model at tracking neurons within an individual across time. The fDNC model takes in positional features of a template neuron configuration from one time t1 of a freely moving worm, and predicts the correspondence at another time t2, called the test. Recording is of a moving animal undergoing calcium imaging from Nguyen et al., 2017. Ground truth neuron correspondence are provided by manual human annotation. The same time point is used as the template for all 1513 template-test pairs. (C) Performance of fDNC and alternative models at tracking neurons within an individual are displayed in order of mean performance. CPD refers to Coherent Point Drift. NeRVE(1) refers to the restricted NeRVE model that has access to only the same template as CPD and fDNC. NeRVE(100) refers to the full NeRVE model which uses 100 templates from the same individual to make a single prediction. A Wilcoxon signed rank significance test of fDNC’s performance compared to CPD, NeRVE(1) and NeRVE(100) yields p=2.5×10223,1.3×10140 and 1.5×10-102, respectively. Boxplots show median and interquartile range. (D) fDNC tracking performance by neuron. Cumulative fraction of neurons is shown as a function of the acceptable error rate. (E) Detailed comparison of fDNC tracking to human annotation of a moving GCaMP recording from Nguyen et al., 2017. Color at each time point indicates the neuron label manually annotated by a human. White gaps indicate that the neuron is missing at that time point. In the case of perfect agreement between human and fDNC, each row will have only a single color or white.

We compared the performance of our fDNC model to that of CPD Registration, and to Neuron Registration Vector Encoding and clustering (NeRVE), a classical computer vision model that we had previously developed specifically for tracking neurons within a moving animal over time (Nguyen et al., 2017; Figure 3C). fDNC clearly outperformed CPD achieving 79.1% accuracy compared to CPD’s 62.7%.

Both CPD and fDNC predict neural correspondence of a test configuration by comparing only to a single template. In contrast, the NeRVE method takes 100 templates, where each one is a different neuron configuration from the same individual, and uses them all to inform its prediction. The additional templates give the NeRVE method extra information about the range of possible neural configurations made by the specific individual whose neurons are being tracked. We therefore compared the fDNC model both to the full NeRVE method and also to a restricted version of the NeRVE method in which NeRVE had access only to the same single template as the CPD or fDNC models. (Under this restriction, the NeRVE method no longer clusters and the method collapses to a series of gaussian mixture model registrations [Jian and Vemuri, 2011]). In this way, we could compare the two methods when given the same information. fDNC’s mean performance of 79.1% was statistically significantly more accurate than the restricted NeRVE model (mean 73.1%, p=1.3×10-140, Wilcoxon signed rank test). The full NeRVE model that had access to additional templates outperformed the fDNC model slightly (82.9% p=1.5×10-102, Wilcoxon signed rank test).

Because CPD, NeRVE, and fDNC are all time-independent algorithms, their performance on a given volume is the same, even if nearby volumes are omitted or shuffled in time. One benefit of this approach is that errors from prior volumes do not accumulate over the duration of the recording. To visualize performance over time, we show a volume-by-volume comparison of fDNC’s tracking to that of a human (Figure 3E). We also characterize model performance on a per neuron basis (Figure 3D).

Finally, we used fDNC to extract whole brain calcium activity from a previously published recording of a moving animal in which two well-characterized neurons AVAL and AVAR were unambiguously labeled with an additional colored fluorophore (Hallinen et al., 2021; Figure 3—figure supplement 1A, Video 1). Calcium activity extracted from neurons AVAL and AVAR exhibited calcium activity transients when the animal underwent prolonged backward locomotion, as expected (Figure 3—figure supplement 1B). We conclude that the fDNC model is suitable for tracking neurons across time and performs similarly to the NeRVE method.

Video 1
Video of neuron tracking during calcium imaging in moving animal.

fDNC algorithm is applied to a calcium imaging recording from Hallinen et al., 2021 (six volumes per second, 200 planes per second). Same recording as in Figure 3—figure supplement 1. Images are shown from the RFP channel and show nuclear localized tagRFP in each neuron. For each volume, a single optical plane is shown that contains neuron AVAR (labeled in pink). Labels assigned by fDNC are shown. Color indicates whether the neuron resides in the displayed optical plane (green), or up to two planes above or below (white). The time of the video corresponding to Figure 3—figure supplement 1 is shown on the left top corner.

In the following sections, we further show that the fDNC method is orders of magnitude faster than NeRVE. Moreover, unlike NeRVE which can only be used within an individual, fDNC is also able to predict the much more challenging neural correspondence across individuals.

fDNC is fast enough for future real-time applications

Because it relies on an artificial neural network, the fDNC model finds correspondence for a set of neurons faster than traditional methods (Table 1). From the time that a configuration of segmented neurons is loaded onto a GPU, it takes only an average of 10 ms for the fDNC model to predict correspondence for all neurons on a 2.4 GHz Intel machine with an NVIDIA Tesla P100 GPU. If not using a GPU, the model predicts correspondence for all neurons in 50 ms. In contrast, on the same hardware it takes CPD 930 ms and it takes NeRVE on average over 10 s. The fDNC model may be a good candidate for potential closed-loop tracking applications because its speed of 100 volumes per second is an order of magnitude faster than the 6–10 volumes per second recording rate typically used in whole-brain imaging of freely moving C. elegans (Nguyen et al., 2016; Venkatachalam et al., 2016). We note that for a complete closed-loop tracking system, fast segmentation algorithms will also be needed in addition to the fast registration and labeling algorithms presented here. The fDNC model is agnostic to the details of the segmentation algorithm so it is well suited to take advantage of fast segmentation algorithms when they are developed.

Table 1
Time required to predict neural correspondence.

Table shows the measured time per volume required for different models to predict neural correspondence of a single volume. Time required is measured after neuron segmentation is complete and a configuration of neural positions has been loaded into memory. The same hardware is used for all models.

MethodTime (s/Volume)
CPD (Myronenko and Song, 2010)0.93
NeRVE(1) (Nguyen et al., 2017)10
NeRVE(100) (Nguyen et al., 2017)>10
fDNC [this work]0.01

The fDNC model uses built-in libraries to parallelize the computations for labeling a single volume, and this contributes to its speed. In particular, each layer of the neural network contains thousands of artificial neurons performing the same computation. Computations for each neuron in a layer can all be performed in parallel and modern GPUs have as many as 3500 CUDA cores.

In practice, the method is even faster for post-processing applications (not-realtime) because it is also parallelizable at the level of each volume. Labeling one volume has no dependencies on any previous volumes and therefore each volume can be processed simultaneously. The number of volumes to be processed in parallel is limited only by the number of volumes that can be loaded onto the memory of a GPU. When tracking during post-processing in this work, we used 32 volumes simultaneously.

fDNC accurately finds neural correspondence across individuals

Having shown that fDNC performs well at identifying neurons within the same individual, we wanted to assess its capability to identify neurons across different animals, using neural position information alone, as before. Identifying corresponding neurons across individuals is crucial for studying the nervous system. However, finding neural correspondence across individuals is more challenging than within an individual because there is variability in neuronal position from both the animal’s movement as well as from development. To evaluate the fDNC model’s performance at finding neural correspondence across individuals using only position information, we applied the same semi-synthetically-trained fDNC model to a set of 11 NeuroPAL worms. NeuroPAL worms contain extra color information that allows a human to assign ground truth labels to evaluate the model’s performance. Crucially, the fDNC model was blinded to this additional color information. In these experiments, NeuroPAL color information was only used to evaluate performance after the fact, not to find correspondence.

NeuroPAL worms have multicolor neurons labeled with genetically encoded fluorescent proteins (Yemini et al., 2021). Only a single volume was recorded for each worm since immobilization is required to capture multicolor information from the NeuroPAL strain. For each of the 11 Neuropal recording, neurons were automatically segmented and manually annotated based on the neuron’s position and color features as described in Yemini et al., 2021 (see Figure 4A,B). Across the 11 animals, a human assigned a ground-truth label to a mean of 43% of segmented head neurons, providing approximately 58 labeled neurons per animal (Figure 4C, additional details in 'Datasets' section of 'Materials and Methods'). The remaining segmented neurons were not confidently identifiable by the human and thus were left without ground truth labels. We selected as template the recording that contained the largest number of confidently labeled ground turth human annotated neurons. We evaluated our model by comparing its predicted correspondence between neurons in the other 10 test datasets and this template, using only position information (no color information). All 11 ground-truth recordings were held-out in that they were not involved in the generation of the semi-synthetic data that had been used to train the model.

fDNC model finds neural correspondence across individuals.

(A) Fluorescence image shows neuronal nuclei of a NeuroPAL worm. A single optical slice is shown from an optical stack. (Scale bar, 10 µm). Genetically encoded color labels in NeuroPAL animals aid ground truth manual neural identification (Yemini et al., 2021) and are used here to evaluate performance. Black dots indicate neurons found via automatic segmentation. (B) Locations of all segmented neurons from A. Neurons that additionally have a human annotated label are shown in green. Those that a human was unable to label are red. (C) Number of segmented neurons (mean 133.6) and subset of those that were given human annotations (mean 57.5) is shown for 11 NeuroPAL individuals. Box plot shows median and interquartile range. (D) Pipeline to evaluate fDNC model performance across NeuroPAL individual is shown. Predicted labels are compared with human annotated labels to compute accuracy. (E) Performance of the fDNC model and CPD is shown evaluated on NeuroPAL recordings using position information alone. Accuracy is the fraction of labeled neurons present in both test and template that are correctly matched. Performance is evaluated on 10 pairs of 11 recordings, where the template is always the same (Worm A). (p=0.005, Wilcoxon signed-rank test). (F) Performance evaluated on a separate publicly accessible dataset of nine NeuroPAL individuals from Chaudhary et al., 2021 (p=0.018, Wilcoxon signed-rank test).

We applied the synthetically trained fDNC model to each pair of held-out NeuroPAL test and template recordings and calculated the accuracy as the number of correctly predicted matches divided by the total number of ground truth matches in the pair. Across the 10 pairs of NeuroPAL recordings using position information alone, the fDNC model had a mean accuracy of 64.1%, significantly higher than the CPD method’s accuracy of 53.1% (p=0.005, Wilcoxon signed-rank test).

We wondered whether we could better use the likelihood information about potential matches generated by the algorithm. For each neuron i in the test recording, the fDNC model computes a relative confidence with which that neuron corresponds to each possible neuron j in the template, pij. A Hungarian algorithm finds the most probable match by considering all pijs for all neurons in the test. By default we use this best match in evaluating performance. The pijs also provide the user with a list of alternative matches ranked by the model’s estimate of their respective likelihood. We therefore also assessed the accuracy for the top three most likely matches.

Given i and j are ground truth matches, we asked whether the value pij is among the top three values of the set pik where k can be chosen from all the neurons in the template. We defined accuracy as the number of instances in which this criteria was met, divided by the number of ground truth matches in the test template pair. When considering the top three neurons, the fDNC model achieves an accuracy of 76.6% using only position information.

Validating on an alternative dataset

Data quality, selection criteria, human annotation, hardware, segmentation, and preprocessing can all vary from lab to lab making it challenging to directly compare methods. To validate our model against different measurement conditions and to allow for a direct comparison with another recent method, we applied our fDNC model to a previously published dataset of 9 NeuroPAL individuals (Chaudhary et al., 2021). This public dataset used different imaging hardware and conditions and was annotated by human experts from a different group. On this public dataset, using position information alone, our method achieved 78.2% accuracy while CPD achieved 58.9%, Figure 4F. When assessing the top three candidate accuracy, the fDNC model performance was 91.3%. The fDNC model performance was overall higher on the published dataset than on our newly collected dataset presented here. This suggests that our method performs well when applied to real-world datasets in the literature.

We further sought to compare the fDNC model to the reported accuracy of a recent model called Conditional Random Fields (CRF) from Chaudhary et al., 2021 by comparing their performance on the same published dataset from that work. There are fundamental differences between the two methods that make a direct comparison of their performance challenging. CRF assigns labels to a test worm. In contrast, fDNC assigns matches between two worms or two configurations, the test and template. To evaluate whether a match is correct using fDNC, we require a ground truth label in both test and template. Consequently, our denominator for accuracy is the intersection of neurons with ground truth labels in test and template. In contrast, the denominator for evaluating accuracy of the CRF model is all neurons with ground truth labels in the test.

When applied to the same dataset in Chaudhary et al., 2021, fDNC had an accuracy of 78%. But for the purposes of comparison with CRF this could, in principle, correspond to an accuracy of 61.2–82.5%, depending on how well those neurons in the test that lack ground truth labels in the template were matched. These bounds are calculated for the extreme cases in which neurons with ground truth labels in the test but not in the template are either all matched incorrectly (61.2%) or all matched perfectly (82.5%). Seventy-eight percent is the accuracy under the assumption that those neurons with ground truth labels in the test but not in the template are correctly matched at the same rate as those neurons with ground truth labels in both. In other words, we assume the neurons we have ground truth information about are representative of the ones we don’t. For the sake of comparison, we use this assumption to compare fDNC to the published values of CRF (Table 2).

Table 2
Comparison of across-animal model performance on additional dataset.

Table lists reported mean accuracy of different models evaluated on the same publicly accessible dataset from Chaudhary et al., 2021. We note in the text an assumption needed to compare these methods. N indicates the number of template-test pairs used to calculate accuracy. (CRF method uses an atlas as the template, whereas we randomly take one of the nine individuals and designate that as the template). CPD and fDNC performance on this dataset are also shown in Figure 4F.

MethodAccuracyNReported in
CPD59%8This work
CRF (open atlas)40%9Chaudhary et al., 2021
CRF (data driven atlas)74%9Chaudhary et al., 2021
fDNC78%8This work

fDNC accuracy is higher than the reported performance for the open atlas variant of CRF. Under the specific assumption described above, it is also higher than the data driven atlas variant, although we note that this could change with different assumptions, and we are unable to test for statistical significance. The fDNC method also offers other advantages compared to the CRF approach in that the fDNC method is optimized for speed and avoids the need to transform the worm into a canonical coordinate system. Importantly, compared to the data-driven atlas variant of the CRF, the fDNC model has an advantage in that it does not require assembling a data-driven atlas from representative recordings with known ground-truth labels. Taken together, we conclude that the fDNC model’s accuracy is comparable to that of the CRF model while also providing other advantages.

Incorporating color information

Our method only takes positional information as input to predict neural correspondence. However, when additional features are available, the position-based predictions from the fDNC model can be combined with predictions based on other features to improve overall performance. As demonstrated in Yemini et al., 2021, adding color features from a NeuroPAL strain can reduce the ambiguity of predicting neural correspondence. We applied a very simple color model to calculate the similarity of color features between neuron i in the test recording to every possible neuron j in the template. The color model returns matching probabilities, pijc based on the Kullback-Liebler divergence of the normalized color spectra in a pair of candidate neurons (details described in Materials and methods). The color model is run in parallel to the fDNC model (Figure 5A). Overall matching probabilities pijall that incorporate both color and position information are calculated by combining the color matching probabilities pijc with the position probabilities pij. The Hungarian algorithm is run on the combined matching algorithm to predict the best matches.

fDNC performance when incorporating color features.

(A) Pipeline to evaluate fDNC performance across animals with additional color features. A simple color model is added in parallel to the fDNC model to use both color and position information from 11 NeuroPAL recordings. Accuracy is calculated from ground truth human annotation and is the fraction of labeled neurons present in both test and template that are correctly matched. Matching probabilities from the color and fDNC models are combined to form the final matching probabilities. (B) Accuracy of the position-only fDNC model and the combined fDNC and color model are evaluated on 11 NeuroPAL recordings (same recordings as in Figure 4). p=5.0×103, Wilcoxon signed rank test.

Adding color information increased the fDNC model’s average accuracy from 64.1% to 74.7% (Figure 5B) when evaluated on our dataset, and improved the accuracy in every recording evaluated. The top three candidate labels attained 92.4% accuracy. Accuracy was calculated from a comparison to human ground truth labeling, as before.

We chose a trivially simple color model in part to demonstrate the flexibility with which the fDNC model framework can integrate information about other features. Since our simple color model utilized no prior knowledge about the distributions of colors in the worm, we would expect a more sophisticated color model, for example, the statistical model used in Yemini et al., 2021, to do better. And indeed that model evaluated on a different dataset is reported to have a higher performance with color than our model on our dataset (86% reported accuracy in Yemini et al., 2021 compared to 75% for the fDNC evaluated here). But that model also performs much worse than fDNC when both are restricted to use only neural position information (50% reported accuracy for Yemini et al., 2021 compared to 64% for the fDNC). Together, this suggests the fDNC model framework can take advantage of additional feature information like color and still perform relatively well when such information is missing.

Discussion

Identifying correspondence between constellations of neurons is important for resolving two classes of problems: The first is tracking the identities of neurons across time in a moving animal. The second is mapping neurons from one individual animal onto another, and in particular onto a reference atlas, such as one obtained from electron microscopy (Witvliet et al., 2020). Mapping onto an atlas allows recordings of neurons in the laboratory to be related to known connectomic, gene expression, or other measurements in the literature.

The fDNC model finds neural correspondence within and across individuals with an accuracy that is comparable or compares favorable to other methods. The model focuses primarily on identifying neural correspondence using position information alone. For tracking neurons within an individual using only position, fDNC achieves a high accuracy of 79%, while for across individuals using only position it achieves 64% accuracy on our dataset and 78% on a published dataset from another group.

We expect that an upper bound may exist, set by variability introduced during the animal’s development, that ultimately limits the accuracy with which any human or algorithm can find correspondence across individuals via only position information. For example, pairs of neurons in one individual that perfectly switch position with respect to another individual will never be unambiguously identified by position alone. It is unclear how close fDNC’s performance of 64% on our dataset or 78% on the dataset in Chaudhary et al., 2021 comes to this hypothetical upper bound, but there is reason to think that at least some room for improvement remains.

Specifically, we do not expect accuracy at tracking within an individual to be fundamentally limited, in part because we do not expect two neurons to perfectly switch position on the timescale of a single recording. Therefore fDNC’s 79% accuracy within-individuals suggests room for improving within-individual correspondence, and by extension, across-individual correspondence because the latter necessarily includes all of the variability of the former. One avenue for achieving higher performance could be to improve the simulator’s ability to better capture variability of a real testset, for example by using different choices of parameters in the simulator.

Even at the current level of accuracy, the ability to find correspondence across animals using position information alone remains useful. For example, we are interested in studying neural population coding of locomotion in C. elegans (Hallinen et al., 2021), and neural correspondence at 64% accuracy will allow us to reject null hypotheses about the extent to which neural coding of locomotion is stereotyped across individuals.

The fDNC model framework also makes it easy to integrate other features which further improve accuracy. We demonstrated that color information could be added by integrating the fDNC model with a simple color model to increase overall accuracy. We expect that performance would improve further with a more sophisticated color model that takes into account the statistics of the colors in a NeuroPAL worm (Yemini et al., 2021).

The fDNC model framework offers a number of additional advantages beyond accuracy. First, it is versatile and general. The same pre-trained model performed well at both tracking neurons within a freely moving individual across time and at finding neural correspondence across different individuals. Without any additional training, it achieved even higher accuracy on a publicly accessible dataset acquired on different hardware with different imaging conditions from a different group. This suggests that the framework should be applicable to many real-world datasets. The model provides probability estimates of all possible matches for each neuron. This allows an experimenter to consider a collection of possible matches such as the top three most likely.

In contrast to previous methods, an advantage of the fDNC method is that it does not require the worm to be straightened, axis aligned, or otherwise transformed into a canonical coordinate system. This eliminates an error-prone and often manual step. Instead, the fDNC model finds neural correspondence directly from neural position information even in worms that are in different poses or orientations.

Importantly, the model is trained entirely on semi-synthetic data, which avoids the need for large experimentally acquired ground truth datasets to train the artificial neural network. Acquiring ground truth neural correspondence in C. elegans is time consuming, error prone, and often requires manual hand annotation. The ability to train the fDNC model with semi-synthetic data derived from measurements alleviates this bottleneck and makes the model attractive for use with other organisms with stereotyped nervous systems where ground truth datasets are similarly challenging to acquire.

The model is also fast and finds neural correspondence of a new neural configuration in 10 ms. The development of fast algorithms for tracking neurons are an important step for bringing real-time closed loop applications such as optical brain-machine interfaces (Clancy et al., 2014) and optical patch clamping (Hochbaum et al., 2014) to whole-brain imaging in freely moving animals. By contrast, existing real-time methods for C. elegans in moving animals are restricted to small subsets of neurons, are limited to two-dimensions, and work only at low spatial resolution (Leifer et al., 2011; Stirman et al., 2011; Kocabas et al., 2012; Shipley et al., 2014). We note that to be used in a real-time closed loop application, our fDNC model would need to be combined with faster segmentation algorithms because current segmentation algorithms are too slow for real-time use. Because segmentation can be easily paralellized, we expect that faster segmentation algorithms will be developed soon.

Many of the advantages listed here stem from the fDNC model’s use of the transformer architecture (Vaswani et al., 2017) in combination with supervised learning. The transformer architecture, with its origins in natural language processing, is well suited to find spatial relationships within a configuration of neurons. By using supervised learning on empirically derived semi-synthetic training data of animals in a variety of different poses and orientations, the model is forced to learn relative spatial features within the neurons that are informative for finding neural correspondence across many postures and conditions. Finally, the transformer architecture leverages recent advances in GPU parallel processing for speed and efficiency, which is an important step toward future real-time applications.

Materials and methods

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Strain, strain background (C. elegans)AML320this workSee Table 4
Strain, strain background (C. elegans)OH15262Yemini et al., 2021RRID:WB-STRAIN:WBStrain00047397

Datasets

Recordings used by simulator to generate semi-synthetic data

Request a detailed protocol

Our model was trained on a semi-synthetic training dataset that was simulated from 4000 volumes spread across recordings of 12 freely moving animals of strain AML32 acquired during calcium imaging. The recordings fed to the simulator had no ground truth correspondence either within or across animals. Each recording had originally contained approximately 3000 volumes recorded at six volumes/s. The recordings fed to the simulator were set aside after use by the simulator and were never re-used for evaluating model performance.

Datasets used to evaluate performance

Request a detailed protocol

The model’s performance was evaluated on various types of datasets with ground truth correspondence, as shown in Table 3. All these recordings were held-out in the sense that they were never used for training. Some of these recordings had ground truth correspondence within an individual over time, while others had ground truth correspondence across individuals. One of the NeuroPAL (Yemini et al., 2021) datasets is a published dataset from an independent research group (Chaudhary et al., 2021). One of the calcium imaging datasets, from Hallinen et al., 2021, had no ground truth correspondence and served to demonstrate the model’s ability to extract calcium activity.

Table 3
Ground truth content, by dataset.

Table lists ground truth properties for each dataset used in this work to evaluate the model. None of the datasets listed here were used for training. ‘Vol’ refers to volume and ‘indiv’ refers to individuals. Ground truth ‘matches pair−1’ indicates the average number of ground truth matches for random pairs of test and template, which is a property of the ground truth dataset, and does not depend on the model tested.

Held-out semi-synthetic testsetCa2+ imagingNeuroPALNeuroPALCa2+ imaging
FigureFigure 2Figure 3Figures 4 and 5Figure 4FFigure 3—figure supplement 1
Type-movingimmobileimmobilemoving
Correspondenceacross indivwithin indivacross indivacross indivwithin indiv
Ground Truthsimulatorhumanhumanhuman-
Ground truth matches pair−185.764.450.150.5-
Ground truth labels vol−1102.169.257.564.3-
Segmented neurons vol−1114.1118.4133.6118.8131.1
Total Vols200015141191400
Individuals200011191
Vols indiv−111514111400
Vols s−1-6--6
Strain-AML32AML320 (via OH15262)OH15495AML310
Referencethis workNguyen et al., 2017this workChaudhary et al., 2021Hallinen et al., 2021

Data availability

Request a detailed protocol

Neural configurations acquired as part of this study have been posted in an Open Science Foundation repository with DOI:10.17605/OSF.IO/T7DZU available at https://dx.doi.org/10.17605/OSF.IO/T7DZU. The publicly accessible dataset from Chaudhary et al., 2021 is available at https://github.com/shiveshc/CRF_Cell_ID, commit 74fb2feeb50afb4b840e8ec1b8ee7b7aaa77a426. Datasets from Nguyen et al., 2017 and Hallinen et al., 2021 are publicly available in repositories associated with their respective publications.

Strains

Those strains used to create new datasets presented in this work are listed in Key Resources. All strains mentioned in this study, including those involved in previously published datasets, are listed in Table 4. All strains express a nuclear localized red fluorescent protein in all neurons. All but strains OH15495 and OH15262 also express nuclear localized GCaMP6s in all neurons. NeuroPAL (Yemini et al., 2021) strains further express many additional fluorophores.

Table 4
List of all strains mentioned in this work.
StrainRRIDGenotypeNotesRef
AML32RRID:WB-STRAIN:WBStrain00000192wtfIs5[Prab-3::NLS::GCaMP6s; Prab-3::NLS::tagRFP]Nguyen et al., 2017
AML310RRID:WB-STRAIN:WBStrain00048356wtfIs5[Prab-3::NLS::GCaMP6s; Prab-3::NLS::tagRFP]; wtfEx258 [Prig-3::tagBFP::unc-54]Hallinen et al., 2021
AML320(otIs669[NeuroPAL] V 14x; wtfIs145 [pBX + rab-3::his-24::GCaMP6::unc-54])derived from OH15262this work
OH15262RRID:WB-STRAIN:WBStrain00047397otIs669[NeuroPAL]Yemini et al., 2021
OH15495RRID:WB-STRAIN:WBStrain00047403otIs696[NeuroPAL]Yemini et al., 2021; Chaudhary et al., 2021

Imaging

Request a detailed protocol

To image neurons in the head of freely moving worms, we used a dual-objective spinning-disk based tracking system (Nguyen et al., 2016) (Yokogawa CSU-X1 mounted on a Nikon Eclipse TE2000-S). Fluorescent images of the head of a worm were recorded through a 40x objective with both 488- and 561 nm excitation laser light as the animal crawled. The 40x objective translated up and down along the imaging axis to acquire 3D image stacks at a rate of 6 head volumes/s.

To image neurons in the immobile multi-color NeuroPAL worms (Yemini et al., 2021), we modified our setup by adding emission filters in a motorized filter wheel (Prior ProScan-II), and adding a Stanford Research Systems SR474 shutter controller (with SR475 shutters) to programmatically illuminate the worm with different wavelength laser light. We use three lasers of different wavelengths: 405 nm (Coherent OBIS-LX 405 nm 100 mW), 488 nm (Coherent SAPPHIRE 488 nm 200 mW), and 561 nm (Coherent SAPPHIRE 561 nm 200 mW). Only one laser at a time reached the sample, through a 40x oil-immersion objective (1.3 NA, Nikon S Fluor). The powers measured at the sample, after spinning disk and objective, were 0.14 mW (405 nm), 0.35 mW (488 nm), and 0.36 mW (561 nm). In the spinning disk unit, a dichroic mirror (Chroma ZT405/488/561tpc) separated the excitation from the emission light. The latter was relayed to a cooled sCMOS camera (Hamamatsu ORCA-Flash 4.0 C11440-22CU), passing through the filters mounted on the filter wheel (Table 5). Fluorescent images were acquired in different ‘channels’, that is, different combinations of excitation wavelength, emission filter, and camera exposure time (Table 6). The acquisition was performed using a custom software written in LabVIEW that specifies the sequence of channels to be imaged, and controls shutters, filter wheel, piezo translator, and camera. After setting the z position, the software acquires a sequence of images in the specified channels.

Table 5
List of emission filters for multicolor imaging.
Filter labelFilters (Semrock part n.)
F1FF01-440/40
F2FF01-607/36
F3FF02-675/67 + FF01-692/LP
Table 6
Imaging channels used.
ChannelExcitation λ (nm)Emission window (nm) [filter]Primary fluorophore
ch0405420–460 [F1]mtagBFP
ch1488589–625 [F2]CyOFP
ch2561589–625 [F2]tagRFP-t
ch3561692–708 [F3]mNeptune

Preprocessing and segmentation

Request a detailed protocol

We extracted the position of individual neurons from 3D fluorescent images to generate a 3D point cloud (Figure 1A). This process is called segmentation and the fDNC model is agnostic to the specific choice of the segmentation algorithm. Segmentation was always performed on tagRFP, never on GCaMP.

For recordings of strains AML32, we used a segmentation algorithm adopted from Nguyen et al., 2017. We first applied a threshold to find pixels where the intensities are significantly larger than the background. Then, we computed the 3D Hessian matrix and its eigenvalues of the intensity image. Candidate neurons were regions where the maximal eigenvalue was negative. Next, we searched for the local intensity peaks in the region and spatially disambiguated peaks in the same region with a watershed separation based on pixel intensity.

For recordings of NeuroPAL strains, we used the same segmentation algorithm as in Yemini et al., 2021. The publicly accessible dataset from Chaudhary et al., 2021 used in Figure 4 had already been segmented prior to our use.

Generating semi-synthetic point clouds with correspondence for training

Request a detailed protocol

We developed a simulator to generate a large training set of semi-synthetic animals with known neural correspondence. The simulator takes as its input the point clouds collected from approximately 4000 volumes spread across 12 recordings of freely moving animals. Each recording contains roughly 3000 volumes. For each volume, the simulator performs a series of stochastic deformations and transformations to generate 64 new semi-synthetic individuals where the ground truth correspondence between neurons in the individuals and the original point cloud is known. A total of 2.304 × 105 semi-synthetic point clouds were used for training.

The simulator introduces a variety of different sources of variability and real-world deformations to create each semi-synthetic point cloud (Figure 1B,E). The simulator starts by straightening the worm in the XY plane using its centerline so that it now lies in a canonical worm coordinate system. Before straightening, Z is along the optical axis and XY are defined to be perpendicular to the optical axis and are arbitrarily set by the orientation of the camera. After straightening, the animal’s posterior-anterior axis lies along the X axis. To introduce animal-to-animal variability in relative neural position, a non-rigid transformation is applied to the neuron point cloud against a template randomly selected from recordings of the real observed worms using coherent point drift (CPD) (Myronenko and Song, 2010). To add variability associated with rotation and distortion of the worm’s head in the transverse plane, we apply a random affine transformation to the transverse plane. To simulate missing neurons and segmentation errors, spurious neurons are randomly added, and some true neurons are randomly removed, for up to 20% of the observed neurons. To introduce variability associated with animal pose, we randomly deform the centerline of the head. Lastly, to account for variability in animals’ size and orientation, a random affine transformation in the XY plane is applied that rescaled the animal’s size by up to 5%. With those steps, the simulator deforms a sampled worm and generates a new semi-synthetic worm with different orientation and posture while maintaining known correspondence.

Centerlines generated by the simulator were directly sampled from recordings of real individuals. The magnitude of added Gaussian noise was arbitrarily set to have a standard deviation of 0.42 µm.

Deep neural correspondence model

Overview and input

Request a detailed protocol

The deep neural correspondence model (fDNC) is an artificial neural network based on the Transformer (Vaswani et al., 2017) architecture (Figure 1C) and is implemented in the automatic differentiation framework PyTorch (Paszke et al., 2017). The fDNC model takes as input the positional coordinates of a pair of worms, a template worm a, and test worm, b (Figure 1F). For each worm, approximately 120 neurons are segmented and passed to the fDNC model. The input neuron sequences are randomly shuffled for both template worm and test worm. This eliminates the possibility that the information from the original sequence order is used.

Architecture

Request a detailed protocol

The model works as an encoder, which maps the input neuron coordinates (a1,a2,,an,b1,b2,,bm) to continuous embeddings (u1,u2,un,v1,v2,,vm). The model is composed of a stack of N=6 identical layers. Each layer consists of two sub-layers: a multi-head self-attention mechanism (Vaswani et al., 2017) and a fully connected feed-forward network. The multi-head attention mechanism is the defining feature of the transformer architecture and makes the architecture well-suited for finding relations in sequences of data, such as words in a sentence or, in our case, spatial locations of neurons in a worm. Each head contains a one-to-one mapping between the nodes in the artificial network and the C. elegans neurons. In the transformer architecture, features of a previous layer are mapped via a linear layer into three attributes of each node, called the query, the key and the value pairs. These attributes of each node contain high dimensional feature vectors which, in our context, represent information about the neuron’s relative position. The multi-head attention mechanism computes a weight for each pair of nodes (corresponding to each pair of C. elegans neurons). The weights are calculated by performing a set computation on the query and key. The output is calculated by multiplying this resultant weight by the value. In our implementation, we set the number of heads in the multi-head attention module to be eight and we set the dimension of our feature vectors to be 128. We chose the best set of the hyperparameters (details in Training section) by evaluating on a validation set, which is distinct from the training set and also from any data used for evaluation. A residual connection (He et al., 2016) and layer normalization (Jl et al., 2016) are employed for each sub-layer, as is widely used in artificial neural networks.

Calculating probabilities for potential matches

Request a detailed protocol

The fDNC model generates a high dimensional (d=128) embedding ui for neuron i from the template worm and vj for the neuron j from the test worm. The similarity of a pair of embeddings, as measured by the inner product ui,vj, determines the probability that the pair is a match. Specifically, we define the probability that neuron i in the template worm matches neuron j in the test worm as pij, where

(1) pij=eui,vjk=1meui,vk.

Equivalently, the vector pi=(pi1,,pim) is modeled as the ‘softmax’ function of the inner products between the embedding of neuron i and the embeddings of all candidate neurons 1,,m. The softmax output is non-negative and sums to one so that pi can be interpreted as a discrete probability distribution over assignments of neuron i.

We also find the most probable correspondence between the two sets of neurons by solving a maximum weight bipartite matching problem where the weights are given by the inner products between test and template worm embeddings. This is a classic combinatorial optimization problem, and it can be solved in polynomial time using the Hungarian algorithm (Kuhn, 1955).

End-user output

Request a detailed protocol

The fDNC model returns two sets of outputs to the end user. One is the algorithm’s estimate of the most probable matches for each neuron in the test worm; that is, the solution to the maximum weight bipartite matching problem described above. The other is an ordered list of alternative candidate matches for each individual neuron in the test worm and their probabilities ranked from most to least probable.

Training

Request a detailed protocol

The model was trained on 2.304 × 105 semi-synthetic animals derived from recordings of 12 individuals. The model was trained only once and the same trained model was used throughout this work.

Training is as follows. We performed supervised learning with ground truth matches provided by the semi-synthetically generated data. A cross-entropy loss function was used. If neuron i and neuron j were matched by human, the cross-entropy loss function favors the model to output pij=1. If neuron i and neuron j were not matched, the loss function favors the model to output pij=0. The model was trained for 12 hr on a 2.40 GHz Intel machine with NVIDIA Tesla P100 GPU.

We trained different models with different hyperparameters and chose the one with best performance. The training curve for each model we trained is shown in Figure 2—figure supplement 1. All the models converged after 12 hr of training. We show the performance of trained models on a held-out validation set consisting of 12,800 semi-synthetic worms in Table 7. We chose the model with 6 layers and 128 dimensional embedding space since it reaches the highest performance and increasing the complexity of the model did not appear to increase the performance dramatically.

Table 7
Model validation for hyperparameters selection.

Table lists losses of models with different hyperparameter values. N represents the number of layers for the transformer architecture. demb is the dimension of the embedding space. The loss shown is the average cross entropy loss evaluated on a held out validation set.


3264128
483.1%88.4%90.7%
686.3%94.6%96.8%
890.5%96.4%96.8%

Evaluating model performance and comparing against other models

To evaluate performance, putative matches are found between template and test, and compared to ground truth. Every segmented neuron in the test or template (whichever has fewer) is assigned a match. Accuracy is defined as the number of proposed matches that agree with ground truth, divided by the total number of ground truth matches. The number of ground truth matches is a property of the dataset used to evaluate our model, and is listed in Table 3.

Coherent Point Drift

Request a detailed protocol

We use Coherent Point Drift (CPD) Registration (Myronenko and Song, 2010) as a baseline with which to compare our model’s performance. In our implementation, CPD is used to find the optimal non-rigid transformation to align the test worm with respect to the template worm. We then calculated the distance for each pair of the neurons from the transformed test worm and the template worm. We used the Hungarian algorithm (Kuhn, 1955) to find the optimal correspondence that minimizes the total squared distance for all matches.

Color model

Request a detailed protocol

The recently developed NeuroPAL strain (Yemini et al., 2021) expresses four different genetically encoded fluorescent proteins in specific expression patterns to better identify neurons across animals. Manual human annotation based on these expression patterns serves as ground truth in evaluating our model’s performance at finding across-animal correspondence. In Figure 5B, we also explored combining color information with our fDNC model. To do so, we developed a simple color matching model that operated in parallel to our position-based fDNC model. Outputs of both models were then combined to predict the final correspondence between neurons.

Our color matching model consists of two steps: First, the intensity of each of the color channels is normalized by the total intensity. Then the similarity of color for each pair of neurons is measured as the inverse of the Kullback–Leibler divergence between their normalized color features.

To calculate the final combined matching matrix, we add the color similarity matrix to the position matching log probability matrix from our fDNC model. The similarity matrix of color is multiplied by a factor λ. We chose λ=60 so that the amplitude of values in the similarity matrix of color is comparable to our fDNC output. We note the matching results are not particularly sensitive to the choice of λ. The most probable matches are obtained by applying Hungarian algorithm on the combined matching matrix.

Code

Request a detailed protocol

Source code in Python is provided for the model, for the simulator, and for training and evaluation. A jupyter notebook with a simple example is also provided. Code is available at https://github.com/XinweiYu/fDNC_Neuron_ID (Yu, 2021; copy archived at swh:1:rev:19c678781cd11a17866af7b6348ac0096a168c06).

Data availability

All datasets generated as part of this work have been deposited in a public Open Science Foundation repository DOI: https://doi.org/10.17605/OSF.IO/T7DZU.

The following data sets were generated
The following previously published data sets were used
    1. Nguyen JP
    2. Linder AN
    3. Plummer GS
    4. Shaevitz JW
    5. Leifer AM
    (2017) IEEE DataPorts
    Tracking Neurons in a Moving and Deforming Brain Dataset.
    https://doi.org/10.21227/H2901H

References

  1. Book
    1. Bishop CM
    (2006)
    Pattern Recognition and Machine Learning
    New York: Springer-Verlag.
  2. Conference
    1. Bronstein AM
    (2007) Rock, paper, and scissors: extrinsic vs. intrinsic similarity of non-rigid shapes
    Proceedings / IEEE International Conference on Computer Vision. IEEE International Conference on Computer Vision.
    https://doi.org/10.1109/ICCV.1995.466933
  3. Conference
    1. He K
    2. Zhang X
    3. Ren S
    4. Sun J
    (2016) Deep residual learning for image recognition
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778.
    https://doi.org/10.1109/CVPR33180.2016
    1. Jian B
    2. Vemuri BC
    (2011) Robust Point Set Registration Using Gaussian Mixture Models
    IEEE Transactions on Pattern Analysis and Machine Intelligence 33:1633–1645.
    https://doi.org/10.1109/TPAMI.2010.223
    1. Ma J
    2. Zhao J
    3. Yuille AL
    (2016) Non-Rigid Point Set Registration by Preserving Global and Local Structures
    IEEE Transactions on Image Processing : a Publication of the IEEE Signal Processing Society 25:53–64.
    https://doi.org/10.1109/TIP.2015.2467217
    1. Myronenko A
    2. Song X
    (2010) Point set registration: coherent point drift
    IEEE Transactions on Pattern Analysis and Machine Intelligence 32:2262–2275.
    https://doi.org/10.1109/TPAMI.2010.46
  4. Conference
    1. Nejatbakhsh A
    2. Varol E
    (2021) Neuron matching in C. elegans With Robust Approximate Linear Regression Without Correspondence
    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2837–2846.
    https://doi.org/10.1109/WACV48630.2021.00288
  5. Book
    1. Parthasarathy N
    2. Batty E
    3. Falcon W
    4. Rutten T
    5. Rajpal M
    6. Chichilnisky EJ
    (2017) Neural Networks for Efficient Bayesian Decoding of Natural Images from Retinal Neurons
    In: Guyon I, Luxburg U. V, Bengio S, Wallach H, Fergus R, Vishwanathan S, editors. Advances in Neural Information Processing Systems, 30. Curran Associates, Inc. pp. 6434–6445.
    https://doi.org/10.1101/153759
  6. Conference
    1. Paszke A
    2. Gross S
    3. Chintala S
    4. Chanan G
    5. Yang E
    6. Devito Z
    (2017)
    Automatic differentiation in PyTorch
    31st Conference on Neural Information Processing Systems (NIPS 2017).
    1. Sulston JE
    (1976) Post-embryonic development in the ventral cord of Caenorhabditis elegans
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 275:287–297.
    https://doi.org/10.1098/rstb.1976.0084
  7. Conference
    1. Sun R
    2. Paninski L
    3. Dy J
    4. Krause A
    (2018)
    Scalable approximate bayesian inference for particle tracking data
    Proceedings of the 35th International Conference on Machine Learning.
    1. Varol E
    2. Nejatbakhsh A
    3. Sun R
    4. Mena G
    5. Yemini E
    6. Hobert O
    (2020)
    Medical Image Computing and Computer Assisted Intervention – MICCAI 2020
    119–129, Statistical Atlas of C. elegans Neurons, Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, Springer International Publishing, 10.1007/978-3-030-59722-1_12.
    1. White JG
    2. Southgate E
    3. Thomson JN
    4. Brenner S
    (1986) The structure of the nervous system of the nematode Caenorhabditis elegans
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 314:1–340.
    https://doi.org/10.1098/rstb.1986.0056

Decision letter

  1. Gordon J Berman
    Reviewing Editor; Emory University, United States
  2. Ronald L Calabrese
    Senior Editor; Emory University, United States
  3. Gordon J Berman
    Reviewer; Emory University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This manuscript will be of interest to C. elegans neuroscientists and also biologists interested in methodological innovations in live imaging. The method described in the paper is clever and elegant, and the solution to the neuron correspondence problem is significant because it is another step toward closed-loop neural perturbation experiments in mobile worms.

Decision letter after peer review:

Thank you for submitting your article "Fast deep learning correspondence for neuron tracking and identification in C. elegans using synthetic training" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Gordon J Berman as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Ronald Calabrese as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) Many details about the training and characterization are missing. There should be more supplemental info accompanying the manuscript on training the network, characterizations of robustness against noise and errors, verifications – including quantifications like scrambling the data sequence, etc. Not having the information creates a big uncertainty on how to evaluate how close the work is to being usable in a real scenario (i.e., one where segmentation is also available). For whole-brain imaging, there are so many perturbations that can throw off tracking algorithms – segmentation error, cells sometimes "show up" and sometimes "disappear" (the newer version of GCaMPs are really low in baseline), etc. Not seeing these explored systematically concerned the reviews as they didn't have a good sense of whether they have attempted to look at these issues. Thus, we ask the authors to add additional figures and quantifications along these lines to buttress the manuscript's claims.

2) There also was a paucity of details on the network training itself. The methods section mentions, in passing, that some hyper-parameter choices were made on a validation set. Which hyper-parameters were selected in this way, and what ranges of parameters were tried? In this exploration, did the authors observe that network performance was sensitive to some hyper-parameters?

3) Since speed was a crucial criterion for model design, was there any trade-off between network size and running speed, such that future hardware may possibly achieve higher accuracy without further conceptual advances? Similarly, the results all report the performance of one trained network, which the authors report taking half a day to train. This is both a long time and not so long. Were other networks trained and not described, or did they fail to converge? If so, how often do these networks train successfully if using different initial conditions, and do their performances differ?

4) Relatedly, the reviewers also thought that the clarity of the algorithm performance is lacking. For instance, there are no training curves shown for the algorithm. Adding details on the network performance/training into supplementary materials would be beneficial.

5) The majority of the paper uses the authors' own data, which has unique features and structures, leading the reviewers to wonder if the presented results are as generalizable as the authors claim. For example, the authors only accounted for cells that are present in all data sets. What would the numbers look like if they divided by all cells that might show up in any of the animals (a significantly larger number)? The authors do bring up the issue of coverage and accuracy trade-off, but did not really explore this issue at all – the reviewers thought that this point is critical to whole-brain imaging, as the data are rather noisy. So this will have to be addressed using other whole-brain data (maybe published data from another lab, e.g. the Zimmer lab), re-do analysis on the neuron identification part, and more characterization on the coverage-accuracy trade-off. If using the Zimmer data, for example, they should be able to show that they can get the same temporal PCs. If the algorithm is very generalizable and extremely fast and quite accurate as the authors claim, it should be fairly simple to use it on a real whole-brain experiment data set and show that meaningful conclusions can come of it. Without this, one should not make such claims that "The method is fast and predicts correspondence in 10 ms making it suitable for future real-time applications."

6) Related to this point, the reviewers thought that the tracking having an ~80% accuracy is not a meaningful goal. First, this accuracy is an average, and it has no bearing on whether a cell can be *continuously* tracked. The traces may be broken, and worse, wrong cells are linked together. This 80% does not guarantee anything at this point. Having an accuracy on a per-frame basis is not the main goal, and there are existing data from the authors themselves and others where this point could be validated. One would have to show that the traces are similar, and better yet, the temporal PCs are similar. The tracking having 80% accuracy cannot be used for optogenetics at all. It is not meaningful to fire the laser at cells with 20% uncertainty in their identities and carry out any meaningful experiments. Thus, either additional validation on the continuity of the tracked accuracy needs to be provided, or the text on optogenetics needs to be significantly toned down or removed.

7) What is the practical use-case for this cross-individual correspondence, since 65.8% means there are still a lot of errors. Perhaps the authors can discuss (even with some back-of-the-envelope estimates) how much this means for an experiment that compares neural activity between two different worms? What is an experiment that may require doing this fast correspondence estimation between two worms in real-time? Practically speaking, how often would one need to compute correspondences between pairs of frames between two worms? Would the overall correspondence be better if more volumes from each animal were used to find a consensus, or would the authors recommend using NeRVE in that case?

8) "Recording" was used multiple times in the text and it's not clear whether they are time series or single volumes. For instance, it is not clear what exactly are the "12 individual animals" used for generating the training data. Are they single time frames or are they video? If videos, how many frames? It is not clear the NeuroPAL data sets are videos or single volumes.

9) Relatedly, if many time points of 12 individual animals are used to generate training data, this is not fully synthetic. The basis of the training data from many worm heads holds a lot of information. The question is also whether all (any) of the augmentation components are necessary or useful. There should be a full characterization of the differential benefit of the different augmentations from not augmenting at all. Calling it synthetic data (e.g., line 566) may be somewhat of a misnomer.

Reviewer #1:

In this submission, the authors introduce a new methodology for tracking neural correspondences in calcium imaging of freely moving C. elegans using the transformer architecture. The method presents produces state-of-the-art assignment accuracies in a manner that is significantly faster and more robust than existing approaches, potentially allowing for real-time tracking applications once other aspects of the computer vision pipeline become faster. The authors demonstrate the ability of their method on data within and between individuals (using the NeuroPALworm lines), as well as on synthetic control data.

Reviewer #2:

Yu et al. developed a deep neural network model with the goal of solving two challenging problems in live imaging of neural activity in mobile C. elegans. They call their method fast Deep Learning Correspondence (fDLC), and it simultaneously achieves (1) identification of the same neurons within one animal in a movie, and (2) identification of corresponding neurons between two animals. These problems are difficult because worms change poses as they move, including wiggles within a horizontal plane as well as rolls, and there may be developmental variability between individual worms. Many past approaches have relied on computationally `straightening` the worm into a canonical coordinate system, or on generative models that rely on manually curated features. Instead, fDLC takes a neural network approach; the model builds on the transformer architecture, popularized by its success in natural language process (NLP). To circumvent the prohibitively large quantities of data typically required to train such models, the authors used a relatively modest experimental dataset and synthetically augmented the training data, simulating worm-like movements and imaging conditions to generate arbitrarily large training sets. They then tested the trained model on a variety of data not used on the training, reporting good accuracy. The fDLC approach described here is particularly impressive because of its speed -- it computes correspondances among ~100 neurons in 10 msec per volume on relatively standard GPU hardware.

Strengths

The paper is generally clearly written, the methods and results are well presented, and the figures are concise summaries of the results. The introduction gives a thorough and thoughtful review of the related literature and how this work relates to previous methods. I particularly appreciate the authors have made their code publicly available. I believe the paper describes a valuable contribution that will be of significant impact in the study of C. elegans neuroscience, as it solves a series of related technical challenges whose solution will open the door for more bold experiments.

The method described is well suited for the problem, and the performance described is impressive when compared to the closest methods available in the literature. The results are all well demonstrated and justify the conclusions.

Weaknesses

The strengths of the fDLC method are to enable real-time neural perturbations and to allow direct comparisons between different worms. However, as the authors point out, the real-time experiments are currently still intractable because cell segmentation remains slow. Further, direct comparisons between different worms remain to be demonstrated as an application of fDLC.

On the first application, the true impact of fDLC may have to wait for further development of real-time cell segmentation. This seems like an imminently achievable technology. On the second application, it remains to be shown whether the 65.8% accuracy -- while quite impressive -- is sufficient to allow novel analyses and insights to be gained. For instance, if one were analyzing a dataset of 10 separately imaged worms, the overall accuracy of identifying an individual corresponding neuron among these 10 animals may be significantly lower.

– Perhaps I'm being a bit nit-picky on terminology, but the use of the phrase `transfer learning` in the abstract (also in Figure 1) seems a bit of a stretch. Am I interpreting correctly that the `transfer` is between the train and test sets, without any further refinement? In what way is this `transfer learning` beyond the standard machine learning use of the test/train split?

– The methods sections mentions in passing, that some hyper-parameter choices were made on a validation set. Which hyper-parameters were selected in this way, and what ranges of parameters were tried? In this exploration, did the authors observe that network performance was sensitive to some hyper-parameters?

– In the tracking results of the same worm across time, the fDLC approach treats each set of coordinates as independent measurements and does not explicitly use any temporal information. Nevertheless, I would imagine that segmented neuron positions from adjacent frames of the same movie, when the worm has not moved its pose by much, may be easier to track than pairs of frames picked at random. Is this true? What about frames that are 2, 3, etc. samples apart?

– The acronym `fDLC` may be easily confused with some modification of DeepLabCut. While this work is also a deep learning based tracking software, I think mistaking this method for DeepLabCut may be not desirable.

Reviewer #3:

This manuscript describes a deep learning model for tracking neurons in C. elegans worms; a side utility of the algorithm is described to be for neuron identification. The problems it is trying to address are significant as there is a need for fast neuron tracking in moving C. elegans whole brain imaging; the premise of the work of using synthetic data for training is interesting. The manuscript has several significant deficiencies, including claims not fully supported by evidence and overreaching conclusions.

Major strengths:

1. The idea of using augmentation to real data to generate training sets for ML model is interesting, particularly in situations where data are hard to come by.

2. fDLC's speed is attractive for the use cases.

Major weaknesses:

1. For tracking to have ~80% accuracy is not meaningful. First, this accuracy is an average, and it has no bearing on whether a cell can be *continuously* tracked. The traces may be broken, and worse, wrong cells are linked together. This 80% does not guarantee anything at this point. Having an accuracy on per-frame basis is not useful at all. To actually have an impact on tracking, traces have to be shown, and these traces need to be verified. There are existing data from the authors themselves and others. One would have to show that the traces are similar, and better yet, the temporal PCs are similar. The tracking having 80% accuracy cannot be used for optogenetics at all. It is not meaningful to fire the laser at cells with 20% uncertainty in their identities and carry out any meaningful experiments. This claim does not make sense. The text on optogenetics needs to be significantly toned down, or better yet, removed.

2. What the Transformer network learned with the data is unclear. The paper does not show exactly what the Transformer network has learned – what features of the data are important? This is critical, as it is possible that another form of information is actually being learned from the data. For instance, in training where the hand-curated cells are used, the cells may be entered in a particular order. It is therefore possible that the Transformer network is learning the order of which the cells are entered, rather than the actual spatial relationships. To show that the Transformer network is really learning something meaningful in the data, one would have to scramble the order of the data and show that the results are not different.

3. The authors stressed that the learning does not require users to prescribe what to look for, but the warping, transformation, noises added are in essence adding information in user-defined way. This claim does not make sense. In the text, the authors also use language such as "roughly matched (their) estimate of variability observed by eye". This is not rigorous and seems dangerous. Exact details and rationales of choices for the warping, transformation, noises added, etc need to be included and fully justified.

4. Clarity of algorithm performance is lacking. For instance, there are no training curves shown for the algorithm.

5. Related, importantly, the accuracy of the algorithm must be very much data-dependent. Sources that can perturb a perfect scenario need to be examined. For instance, how would cells' activities in GCaMP recordings affect accuracy? How would segmentation error affect accuracy? It is not possible to evaluate the real-world utility if these issues are not explored. For all we know, it could be the best data that are fed to the algorithm that is used to calculate the accuracies here.

6. The authors stated that there is a trade-off between accuracy and coverage. This is an important point, but the authors did not fully characterize such trade-off (related to the accuracy comment above); nor was the coverage assumption/definition that went into each part of the work clearly stated. In the tracking part, what would the coverage be? How is it defined? Comparisons to literature algorithm for neuron identification is should not be done when the coverage is also not well defined, i.e. the denominators for the percentages in table 4 are ill-defined.

7. Clarity of the experimental data is lacking. "Recording" was used multiple times in the text and it's not clear whether they are time series or single volumes. For instance, it is not clear what exactly are the "12 individual animals" used for generating the training data. Are they single time frames or are they video? If videos, how many frames? It is not clear the NeuroPAL data sets are videos or single volumes.

8. Related to the issue above, if many time points of 12 individual animals are used to generate training data, this is not at all synthetic. The basis of the training data from many worm heads holds a lot of information. The question is also whether all (any) of the augmentation components are necessary or useful. There should be a full characterization of the differential benefit of the different augmentations from not augmenting at all. Calling it synthetic data in my opinion is a misnomer (e.g. line 566).

9. Generally speaking, if the algorithm is very generalizable and extremely fast, and quite accurate as the authors claim, it should be fairly simple to use it on a real whole-brain experiment data set and show that meaningful conclusions can come of it. Without this, one should not make such claims that "The method is fast and predicts correspondence in 10 ms making it suitable for future real-time applications."

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Fast deep neural correspondence for tracking and identifying neurons in C. elegans using semi-synthetic training" for further consideration by eLife. Your revised article has been evaluated by Ronald Calabrese (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Essential Revisions:

1. The data are from the authors themselves, not peer-reviewed, and not independently validated. The authors did not use, for instance, the Zimmer lab's data; the reason why was unclear to the reviewers. Also, the authors themselves have at least one volume of sensible data from their own previous work (NeRVE, Nguyen et al., 2017) in which they actually performed PCA on the GCaMP data. Applying fDNC to that set of data and showing that PCAs are comparable would make their claim much stronger.

2. The accuracy is a central claim in the paper. It is good that the authors now define what accuracy is in the text, but it is still confusing. A match between the template and the test does not assign a name necessarily -- unless the template neurons already have labels/identities from the ground truth information. From the text, it seems that the template is used as the reference with identities already assigned and that only the neurons common to both the test and template are considered since the denominator of the accuracy is defined as "the total number of ground truth matches". (Another interpretation of the definition would suggest that neurons that both the template and the test got wrong but matches each other would have been counted as accurate?!)

There are two issues – the definition is not applicable for some other methods and that this definition is artificially favorable for fDNC.

a. In Table 2, the authors compare the accuracies of fDNC to that of CPD and CRF (ref 3). This is not appropriate. fDNC and CPD both use template matching, while CRF does not. This is to say that the accuracy definition is not the same for these methods.

b. The accuracy of fDNC is artificially more favorable. NeuroPAL datasets do not reliably identify the same neurons. When using one NeuroPAL dataset as template, and another as the test set, the matches are on the order of 70-80%. The definition of accuracy the authors use, therefore, is artificially high (by some significant percentage). The errors associated in neurons not common to the test and the template are discounted.

c. The coverage and the accuracy discussion should be restored.

3. Implying that fDNC is not "data-privileged" is false (page 13). fDNC is not naive – information from 4000 volumes from 12 animals is there, and fDNC must use a known annotated NeuroPAL dataset as a template, and therefore there is information again (e.g. variability of positions etc). Revising the discussion around this point is important.

Reviewer #1:

I thank the authors for their thoughtful revisions and especially for providing additional methodological details and caveats. I think that it would make a good addition to the literature.

Reviewer #2:

I thank the authors for their careful and detailed responses to comments and concerns. I especially appreciate the additional methodological details on the training of their model and the clarified definition of how performance is evaluated. I think this work is an interesting and valuable contribution to the literature, and another substantial step in achieving real-time manipulations in this popular experimental organism.

Reviewer #3:

The revised manuscript is improved for many of the details, including the data used and how the model was constructed, which are good for reproducibility.

The responses are unsatisfying in a few major places:

1. One of the central concerns from the previous round of review is on whether the algorithm performs well enough for tracking. The revision is unsatisfactory.

a. The authors were asked to apply their tracking to real data to show that the tracking results can generate meaningful data. The authors misunderstood the request as asking for biological insights. The intention is to VALIDATE, not to generate new insights. In fact, that's precisely the reason to apply the algorithm/model to well-curated data that are already peer-reviewed and published.

b. Figure 3 and the supplemental data were a step in the right direction, but are still unsatisfactory. Figure 3E now shows the tracking accuracy; as expected, the errors are sporadic. Some neurons appear to be ok while others not. This is THE reason PCA was asked in the last round. Figure 3 supplement showing AVAL/R traces are not enough to demonstrate the traces are sensible. It is anecdotal. AVA are among the most "obvious" neurons; peaks correlating to reversal behavior made the cells easy to identify and the tracking errors very easily ignored. The question is whether the rest of the neurons (>99% of them) can give sensible traces.

c. The data are from the authors themselves, not peer reviewed and not independently validated. The authors dodged the request to use, for instance, the Zimmer lab's data; the reason is unclear to me. Also, if anything, the authors themselves have at least one volume of sensible data from their own previous work (NeRVE, Nguyen et al., 2017) in which they actually did PCA on the GCaMP data. The least they can do is to apply fDNC to that set of data and show that PCAs are comparable.

d. Tracking is tracking, and should not be confounded with the discussion on neuron identification. The accuracy for tracking purposes should be discussed separately.

2. The authors cited a biorxiv paper (ref 22) and glossed over its contribution. This paper is now published (https://elifesciences.org/articles/59187). The major contribution of 3DeeCellTracker is also a deep learning algorithm for tracking cells, and the paper also dealt with the sort of data this manuscript addresses. There is no discussion and no comparison.

a. In fact, Wen et al. directly compared 3DeeCellTracker performance with other algorithms, including even on the dataset from Nguyen et al. (2017). The accuracies reported in Wen et al., are quite favorable (>90%). This data set should be directly compared.

b. fDNC may be faster, which suggests a trade-off between speed and accuracy. It seems pertinent to include this discussion.

3. The accuracy is a central claim in the paper. It is good that the authors now define what accuracy is in the text, but it is still confusing. A match between the template and the test does not assign a name necessarily, UNLESS the template neurons already have labels/identities from the ground truth information. From the text, it seems that the template IS used as the reference with identities already assigned, and that only the neurons common to both the test and template are considered since the denominator of the accuracy is defined as "the total number of ground truth matches". (Another interpretation of the definition would suggest that neurons that both the template and the test got wrong but matches each other would have been counted as accurate?!)

There are two issues – the definition is not applicable for some other methods and that this definition is artificially favorable for fDNC.

a. In Table 2, the authors compare the accuracies of fDNC to that of CPD and CRF (ref 3). This is not appropriate. fDNC and CPD both use template matching, while CRF does not. This is to say that the accuracy definition is not the same for these methods.

b. The accuracy of fDNC is artificially more favorable. NeuroPAL datasets do not reliably identify the same neurons. When using one NeuroPAL dataset as template, and another as the test set, the matches is on the order of 70-80%. The definition of accuracy the authors use, therefore, is artificially high (by some significant percentage). The errors associated in neurons NOT common to the test and the template are discounted.

c. It seems that the coverage and the accuracy discussion should be restored.

https://doi.org/10.7554/eLife.66410.sa1

Author response

Essential revisions:

1) Many details about the training and characterization are missing. There should be more supplemental info accompanying the manuscript on training the network, characterizations of robustness against noise and errors, verifications – including quantifications like scrambling the data sequence, etc. Not having the information creates a big uncertainty on how to evaluate how close the work is to being usable in a real scenario (i.e., one where segmentation is also available). For whole-brain imaging, there are so many perturbations that can throw off tracking algorithms – segmentation error, cells sometimes "show up" and sometimes "disappear" (the newer version of GCaMPs are really low in baseline), etc. Not seeing these explored systematically concerned the reviews as they didn't have a good sense of whether they have attempted to look at these issues. Thus, we ask the authors to add additional figures and quantifications along these lines to buttress the manuscript's claims.

Thank you for these suggestions:

– To characterize training we have added Figure 2 – Supplementary Figure 1 showing training curves and Table 7 showing final performance for all hyperparameters that we explored.

– We had added the following new figures and a video to further demonstrate performance on real-world calcium imaging datasets of moving C. elegans:

– Figure 3E shows a volume-by-volume comparison of the model’s assigned neural identities to those that were manually annotated in a freely moving calcium imaging dataset.

– Figure 3D shows a breakdown of model performance by neuron for that dataset.

– Figure 3 – Supplementary Figure 1 shows calcium activity extracted from a recently published recording of whole-brain activity of a moving worm from (Hallinen et al., 2021). In that recording an additional fluorophore unambiguously labels AVAL and AVAR. We show that AVAL and AVAR’s calcium activity show expected transients.

– Video 1 shows labeled neurons over time from the same dataset in Figure 3 – Supplementary Figure 1.

– Regarding concerns related to GCaMP’s baseline activity: We note that all animals in this study expressed both RFP and GCaMP. Segmentation is performed only on RFP, thus we do not anticipate GCaMP activity to have an impact. We now clarify this in the text: “Segmentation was always performed on tagRFP, never on GCaMP.”

– Regarding the scrambling of data sequences: all of the input neuron sequences used in this work have been randomly shuffled both for the training set and the test set, thereby preventing the model from learning any information from original sequence order. We now clarify this in the text. “The input neuron sequences have been randomly shuffled for both template worm and test worm. This eliminates the possibility that the information from the original sequence order is used.”

We also note that, by design, the transformer model can’t extract information from the order of input sequence without additional position embeddings due to its permutation invariance property.

– Regarding challenging our model with realistic noise: For semi-synthetic worms, up to 20% of the total neurons are randomly added or abandoned. As mentioned above, we have also now added an additional real-world recording and show that calcium activity of a well characterized neuron pair, AVAL and AVAR exhibit expected transients.

2) There also was a paucity of details on the network training itself. The methods section mentions, in passing, that some hyper-parameter choices were made on a validation set. Which hyper-parameters were selected in this way, and what ranges of parameters were tried? In this exploration, did the authors observe that network performance was sensitive to some hyper-parameters?

In Table 7, we now report the hyperparameters that we tried. The hyper-parameters include the dimensionality of hidden space (32, 64, 128) and the number of layers (4, 6, 8) for the transformer architecture. All the models we tried converged ( see new Figure 2 – Supplementary Figure 1). We have now added a paragraph of text: “We trained different models with different hyperparameters and chose the one with best performance. The training curve for each model we trained is shown in Figure 2 Supplementary Figure 1. All the models converged after 12 hours of training. We show the performance of trained models on a held-out validation set consisting of 12,800 semi-synthetic worms in Table 7. We chose the model with 6 layers and 128 dimensional embedding space since it reaches the highest performance and increasing the complexity of the model did not appear to increase the performance dramatically. “

3) Since speed was a crucial criterion for model design, was there any trade-off between network size and running speed, such that future hardware may possibly achieve higher accuracy without further conceptual advances? Similarly, the results all report the performance of one trained network, which the authors report taking half a day to train. This is both a long time and not so long. Were other networks trained and not described, or did they fail to converge? If so, how often do these networks train successfully if using different initial conditions, and do their performances differ?

The model already achieves a very high accuracy over our semi-synthetic data ( 96.5%). This suggests that the current bottleneck for improving performance on real data likely has less to do with speed and more to do with the semi-synthetic data’s ability to capture the full variability of real measurements. We now mention this in the discussion:

“Therefore fDNC's 79% accuracy within-individuals suggests room for improving within-individual correspondence, and by extension, across-individual correspondence because the latter necessarily includes all of the variability of the former. One avenue for achieving higher performance could be to improve the simulator's ability to better capture variability of a real dataset, for example by using different choices of parameters in the simulator. ”

All the models with different hyperparameters converged. We have added text to describe convergence and convergence time.

“We trained different models with different hyperparameters and chose the one with best performance. The training curve for each model we trained is shown in Figure 2 Supplementary Figure 1. All the models converged after 12 hours of training. We show the performance of trained models on a held-out validation set consisting of 12,800 semi-synthetic worms in Table 7. We chose the model with 6 layers and 128 dimensional embedding space since it reaches the highest performance and increasing the complexity of the model did not appear to increase the performance dramatically.”

4) Relatedly, the reviewers also thought that the clarity of the algorithm performance is lacking. For instance, there are no training curves shown for the algorithm. Adding details on the network performance/training into supplementary materials would be beneficial.

Training curves have been added, see Figure 2 – Supplementary Figure 1.

5) The majority of the paper uses the authors' own data, which has unique features and structures, leading the reviewers to wonder if the presented results are as generalizable as the authors claim. For example, the authors only accounted for cells that are present in all data sets. What would the numbers look like if they divided by all cells that might show up in any of the animals (a significantly larger number)? The authors do bring up the issue of coverage and accuracy trade-off, but did not really explore this issue at all – the reviewers thought that this point is critical to whole-brain imaging, as the data are rather noisy. So this will have to be addressed using other whole-brain data (maybe published data from another lab, e.g. the Zimmer lab), re-do analysis on the neuron identification part, and more characterization on the coverage-accuracy trade-off. If using the Zimmer data, for example, they should be able to show that they can get the same temporal PCs. If the algorithm is very generalizable and extremely fast and quite accurate as the authors claim, it should be fairly simple to use it on a real whole-brain experiment data set and show that meaningful conclusions can come of it. Without this, one should not make such claims that "The method is fast and predicts correspondence in 10 ms making it suitable for future real-time applications."

We demonstrate generalizability by showing that the model performs well on all recordings in a published dataset from another group (Chaudhary et al., 2021) and on a previously published GCaMP dataset (Figure 3). We have now added an additional calcium imaging dataset, Figure 3 – Supplementary Figure 1 and corresponding video, Video 1. Extracted calcium dynamics of neurons AVAL and AVAR from this dataset exhibit expected transients.

We thank the reviewers for pointing out that our definition of accuracy may have unnecessarily caused confusion. We have made changes that should remove ambiguity.

– We now use a more straightforward definition of accuracy consistently across the entire manuscript. And we note that we do account for all neurons: “To evaluate performance, putative matches are found between template and test, and compared to ground truth. Every segmented neuron in the test or template (whichever has fewer) is assigned a match. Accuracy is defined as the number of proposed matches that agree with ground truth, divided by the total number of ground truth matches.”

– In Table 3, we now list the average number of ground truth matches for pairs of test and templates sampled from each dataset:

“The number of ground truth matches is a property of the dataset used to evaluate our model, and is listed in Table 3.”

– We regenerated all figures using this definition of accuracy. Numerical values are all now slightly different (e.g. in the worst case 65.8% became 64.1%), but our conclusions remain the same.

– We removed all discussion of what we previously had termed “coverage” because it no longer applies to this definition of accuracy.

We support our claim that the method is “fast and predicts correspondence in 10 ms” by providing evidence of the model’s speed (Table 1) and accuracy (Figure 3).

We disagree with the comment that we “should not make such claims” without additional “meaningful conclusions.” We have followed eLife’s author guidelines for Tools and Resources submissions: “Tools and Resources articles do not have to report major new biological insights or mechanisms, but it must be clear that they will enable such advances to take place, for example, through exploratory or proof-of-concept experiments.” Our experiments demonstrate the potential of this method for new discovery, and we are excited to use this method in all of our future scientific investigations.

6) Related to this point, the reviewers thought that the tracking having an ~80% accuracy is not a meaningful goal. First, this accuracy is an average, and it has no bearing on whether a cell can be *continuously* tracked. The traces may be broken, and worse, wrong cells are linked together. This 80% does not guarantee anything at this point. Having an accuracy on a per-frame basis is not the main goal, and there are existing data from the authors themselves and others where this point could be validated. One would have to show that the traces are similar, and better yet, the temporal PCs are similar. The tracking having 80% accuracy cannot be used for optogenetics at all. It is not meaningful to fire the laser at cells with 20% uncertainty in their identities and carry out any meaningful experiments. Thus, either additional validation on the continuity of the tracked accuracy needs to be provided, or the text on optogenetics needs to be significantly toned down or removed.

– Figure 3E now demonstrates the extent to which neurons are tracked continuously.

– Figure 3D now shows a breakdown of accuracy per-neuron.

– To further demonstrate that the model works with real-world data, in Figure 3 Supplementary Figure 1 we now apply the method to an additional previously published real-world recording of a moving animal during calcium imaging recording and show that well-characterized neurons AVAL and AVAR exhibit expected calcium transients.

– We have added text to note that the model tracks neurons without regard to time- or history-dependence: “Because CPD, NeRVE and fDNC are all time-independent algorithms, their performance on a given volume is the same, even if nearby volumes are omitted or shuffled in time.” We argue that, in this context, the average per-frame accuracy is relevant.

We have added a paragraph describing these additional accuracy results:

“… To visualize performance over time, we show a volume-by-volume comparison of fDNC's tracking to that of a human (Figure 3E). We also characterize model performance on a per neuron basis (Figure 3D). Finally, we used fDNC to extract whole brain calcium activity from a previously published recording of a moving animal in which two well-characterized neurons AVAL and AVAR were unambiguously labeled with an additional colored fluorophore (Hallinen et al., 2021), (Figure 3 – Supplementary Figure 1A). Calcium activity extracted from neurons AVAL and AVAR exhibited calcium activity transients when the animal underwent prolonged backward locomotion, as expected (Figure 3 – Supplementary Figure 1B).”

We have now revised language to deemphasize optogenetics and also to give more specific examples of real-time applications:

“The development of fast algorithms for tracking neurons are an important step for bringing real-time closed loop applications such as optical brain-machine interfaces (Clancy et al., 2014) and optical patch clamping (Hochbaum et al., 2014) to whole-brain imaging in freely moving animals.”

We also have added more context by comparing to the existing state of real-time methods in C. elegans, including for optogenetics:

“By contrast, existing real-time methods for C. elegans in moving animals are restricted to small subsets of neurons, are limited to two-dimensions, and work at low spatial resolution (Leifer et al., 2011; Stirman et al., 2011; Kocabas et al., 2012; Shipley et al., 2014).”

Obviously, we strive for 100% accuracy, but any automated method entails some amount of uncertainty. Our method is an important step toward improving labeling accuracy, and we believe ~80% is sufficient for many optogenetics experiments.

Nonetheless, we have broadened the discussion of real-time applications to focus less on optogenetics and to include BMI which relies only on calcium imaging.

7) What is the practical use-case for this cross-individual correspondence, since 65.8% means there are still a lot of errors. Perhaps the authors can discuss (even with some back-of-the-envelope estimates) how much this means for an experiment that compares neural activity between two different worms? What is an experiment that may require doing this fast correspondence estimation between two worms in real-time? Practically speaking, how often would one need to compute correspondences between pairs of frames between two worms? Would the overall correspondence be better if more volumes from each animal were used to find a consensus, or would the authors recommend using NeRVE in that case?

We have now added a paragraph to the discussion to clarify that correspondence is needed in two classes of use cases:

“Identifying correspondence between constellation of neurons is important for resolving two classes of problems: The first is tracking the identities of neurons across time in a moving animal. The second is mapping neurons from one individual animal onto another, and in particular onto a reference atlas, such as one obtained from electron microscopy (Witvliet et al., 2020). Mapping onto an atlas allows recordings of neurons in the laboratory to be related to known connectomic, gene expression, or other measurements in the literature”.

We have also now added four paragraphs to the discussion that put performance into a broader context, discuss potential fundamental limits, and provide one specific use case from our own work. We also now remind the reader that our model achieves 78% accuracy on the Chaudhury et al. dataset.

“The fDNC model finds neural correspondence within and across individuals with an accuracy that compares favorably to other methods. The model focuses primarily on identifying neural correspondence using position information alone. For tracking neurons within an individual using only position, fDNC achieves a high accuracy of 79%, while for across individuals using only position it achieves 64% accuracy on our dataset, and 78% on a published dataset from another group.

We expect that an upper bound may exist, set by variability introduced during the animal's development, that ultimately limits the accuracy with which any human or algorithm can find correspondence across individuals via only position information. For example, pairs of neurons in one individual that perfectly switch position with respect to another individual will never be unambiguously identified by position alone. It is unclear how close fDNC's performance of 64% on our dataset or 78% on the dataset in [2] comes to this hypothetical upper bound, but there is reason to think that at least some room for improvement remains.

Specifically, we do not expect accuracy at tracking within an individual to be fundamentally limited, in part because we do not expect two neurons to perfectly switch position on the timescale of a single recording. Therefore fDNC's 79% accuracy within-individuals suggests room for improving within-individual correspondence, and by extension, across-individual correspondence because the latter necessarily includes all of the variability of the former. One avenue for achieving higher performance could be to improve the simulator's ability to better capture variability of a real dataset, for example by using different choices of parameters in the simulator.

Even at the current level of accuracy, the ability to find correspondence across animals using position information alone remains useful. For example, we are interested in studying neural population coding of locomotion in C. elegans [34] , and neural correspondence at 64% accuracy will allow us to reject null hypothesis about the extent to which neural coding of locomotion is stereotyped across individuals.”

We are unable to evaluate the effect of using more volumes from each individual because we lack across-animal datasets that also have within-animal multi-volume ground truth correspondence, see Table 3.

8) "Recording" was used multiple times in the text and it's not clear whether they are time series or single volumes. For instance, it is not clear what exactly are the "12 individual animals" used for generating the training data. Are they single time frames or are they video? If videos, how many frames? It is not clear the NeuroPAL data sets are videos or single volumes.

We have now added Table 3, which lists information about the number of individuals, volumes, volume rate and other properties for each dataset used.

9) Relatedly, if many time points of 12 individual animals are used to generate training data, this is not fully synthetic. The basis of the training data from many worm heads holds a lot of information. The question is also whether all (any) of the augmentation components are necessary or useful. There should be a full characterization of the differential benefit of the different augmentations from not augmenting at all. Calling it synthetic data (e.g., line 566) may be somewhat of a misnomer.

Thank you for pointing out that the term synthetic could be confusing. To avoid ambiguity, we now use the term “semi-synthetic” throughout.

Note, however, that the 12 individual animals used by the simulator lack any ground truth correspondence within or between animals (only positions and postures are derived from measurements). We now emphasize:

“Importantly, using semi-synthetic data also allows us to train our model even when we completely lack experimentally acquired ground truth data. And indeed, in this work, semi-synthetic data is derived exclusively from measurements that lack any ground truth correspondence either within-, or across animals. All ground truth for training comes only from simulation.”

Implicit in the reviewer’s question, is another: Even if we had large numbers of ground truth datasets of multiple volumes from within a single animal, would that be sufficient to achieve good performance across animals? This is an interesting hypothetical. It is worth noting that the variability across animals is necessarily greater than the variability within animals, so it is possible that it would not be sufficient. We now mention this in the text:

“…suggests room for improving within-individual correspondence, and by extension, across-individual correspondence because the latter necessarily includes all of the variability of the former.”

One might further ask, why not collect more ground truth data? Here the transformer required O(10^5) semi-synthetic volumes to reach peak performance. It took the whole lab two weeks of dedicated effort to manually generate the ground-truth dataset with O(10^3) volumes, as described in (Nguyen et al., 2017). Based on these estimates, it would take two years to generate comparable ground truth data to train the transformer.

Reviewer #2:

[…]

On the first application, the true impact of fDLC may have to wait for further development of real-time cell segmentation. This seems like an imminently achievable technology. On the second application, it remains to be shown whether the 65.8% accuracy -- while quite impressive -- is sufficient to allow novel analyses and insights to be gained. For instance, if one were analyzing a dataset of 10 separately imaged worms, the overall accuracy of identifying an individual corresponding neuron among these 10 animals may be significantly lower.

– Perhaps I'm being a bit nit-picky on terminology, but the use of the phrase `transfer learning` in the abstract (also in Figure 1) seems a bit of a stretch. Am I interpreting correctly that the `transfer` is between the train and test sets, without any further refinement? In what way is this `transfer learning` beyond the standard machine learning use of the test/train split?

We had sought to highlight that our test set evaluates within- and across-animal correspondence, while our semi-synthetic training set is derived from individual volumes that lack any correspondence information at all. We agree that the term transfer learning is at best confusing or at worst incorrect and have therefore removed `transfer learning’ from the text. Thank you for pointing this out.

– The methods sections mentions, in passing, that some hyper-parameter choices were made on a validation set. Which hyper-parameters were selected in this way, and what ranges of parameters were tried? In this exploration, did the authors observe that network performance was sensitive to some hyper-parameters?

As discussed in response to “Essential Revisions #2” we have added Table 7 and accompanying text describing the choice and performance of hyper-parameters.

– In the tracking results of the same worm across time, the fDLC approach treats each set of coordinates as independent measurements and does not explicitly use any temporal information. Nevertheless, I would imagine that segmented neuron positions from adjacent frames of the same movie, when the worm has not moved its pose by much, may be easier to track than pairs of frames picked at random. Is this true? What about frames that are 2, 3, etc. samples apart?

As discussed in response to “Essential Revisions #1,” we now include Figure 3E, which shows a volume by volume comparison of fDNC predictions to that of a human for each neuron over time. The fDNC algorithm does not use temporal correlations and in fact its performance on a volume is the same, even if surrounding volumes are omitted or shuffled in time.

It is interesting to ask, under what conditions would temporal information be useful? Certainly, as the review suggests, in the regime where neuron motion between frames is small compared to the mean distance between neurons, we would expect temporal information to be valuable. Any benefit of temporal information must be weighed against the potential drawback that time dependent algorithms can accumulate errors over time. In our recordings, neuron motion between frames is of similar length scale to the mean distance between closest neuron neighbors, and this may hint at why this and previous work (Nguyen et al., 2017) have been successful with time-independent strategies. We now mention this in the text:

“The recording has sufficiently large animal movement that the average distance a neuron travels between volumes (31 um) is of similar scale to the average distance between nearest neuron neighbors (35 um).“

– The acronym `fDLC` may be easily confused with some modification of DeepLabCut. While this work is also a deep learning based tracking software, I think mistaking this method for DeepLabCut may be not desirable.

We thank the reviewer for pointing this out. We have adjusted the acronym. We now use `fDNC’ for fast Deep Neural Correspondence.

Reviewer #3:

This manuscript describes a deep learning model for tracking neurons in C. elegans worms; a side utility of the algorithm is described to be for neuron identification. The problems it is trying to address are significant as there is a need for fast neuron tracking in moving C. elegans whole brain imaging; the premise of the work of using synthetic data for training is interesting. The manuscript has several significant deficiencies, including claims not fully supported by evidence and overreaching conclusions.

Major strengths:

1. The idea of using augmentation to real data to generate training sets for ML model is interesting, particularly in situations where data are hard to come by.

2. fDLC's speed is attractive for the use cases.

Major weaknesses:

1. For tracking to have ~80% accuracy is not meaningful. First, this accuracy is an average, and it has no bearing on whether a cell can be *continuously* tracked. The traces may be broken, and worse, wrong cells are linked together. This 80% does not guarantee anything at this point. Having an accuracy on per-frame basis is not useful at all. To actually have an impact on tracking, traces have to be shown, and these traces need to be verified. There are existing data from the authors themselves and others. One would have to show that the traces are similar, and better yet, the temporal PCs are similar. The tracking having 80% accuracy cannot be used for optogenetics at all. It is not meaningful to fire the laser at cells with 20% uncertainty in their identities and carry out any meaningful experiments. This claim does not make sense. The text on optogenetics needs to be significantly toned down, or better yet, removed.

Please see detailed response to “Essential Revisions: #6”.

2. What the Transformer network learned with the data is unclear. The paper does not show exactly what the Transformer network has learned – what features of the data are important? This is critical, as it is possible that another form of information is actually being learned from the data. For instance, in training where the hand-curated cells are used, the cells may be entered in a particular order. It is therefore possible that the Transformer network is learning the order of which the cells are entered, rather than the actual spatial relationships. To show that the Transformer network is really learning something meaningful in the data, one would have to scramble the order of the data and show that the results are not different.

The order of data is indeed scrambled in both training and test sets and we have clarified this in the text. Therefore the model is not learning the order. Please see detailed response to “Essential Revisions: #1”.

3. The authors stressed that the learning does not require users to prescribe what to look for, but the warping, transformation, noises added are in essence adding information in user-defined way. This claim does not make sense. In the text, the authors also use language such as "roughly matched (their) estimate of variability observed by eye". This is not rigorous and seems dangerous. Exact details and rationales of choices for the warping, transformation, noises added, etc need to be included and fully justified.

We have now removed that text and now specify in the discussion that one avenue for future improvement is to better tune the simulator to capture variability. “One avenue for achieving higher performance could be to improve the simulator's ability to better capture variability of a real dataset, for example by using different choices of parameters in the simulator.”

4. Clarity of algorithm performance is lacking. For instance, there are no training curves shown for the algorithm.

Training curves have been added in Figure 2 – Supplementary Figure 1.

5. Related, importantly, the accuracy of the algorithm must be very much data-dependent. Sources that can perturb a perfect scenario need to be examined. For instance, how would cells' activities in GCaMP recordings affect accuracy? How would segmentation error affect accuracy? It is not possible to evaluate the real-world utility if these issues are not explored. For all we know, it could be the best data that are fed to the algorithm that is used to calculate the accuracies here.

To account for differences in data, and to provide fair comparison against other methods, we evaluate performance on multiple recordings from our own group including those that we have published previously, and on all recordings in a published dataset from a different group (Chaudhary, 2021). Performance on individual recordings in each dataset are visible in Figure 4, and they span a wide range. Also our model performs better on the Chaudhary dataset (78.2%) than our own (64.1%). That we use a wide range of datasets, and that our model performs even better on another group’s dataset is evidence that we are not using only “the best data to calculate accuracies.”

Regarding GCaMP, we note that accuracy reported in Figure 3 is evaluated on a recording that contains GCaMP activity (originally from Nguyen et al., 2017). Moreover, we also have now added a new example where we apply our method to a recording we recently published (Hallinen et al., 2021) and in this case we also show that GCaMP activity behaves as expected, Figure 3 – Supplementary Figure 1.

6. The authors stated that there is a trade-off between accuracy and coverage. This is an important point, but the authors did not fully characterize such trade-off (related to the accuracy comment above); nor was the coverage assumption/definition that went into each part of the work clearly stated. In the tracking part, what would the coverage be? How is it defined? Comparisons to literature algorithm for neuron identification is should not be done when the coverage is also not well defined, i.e. the denominators for the percentages in table 4 are ill-defined.

Regarding coverage, it is important to note that the algorithm assigns every segmented neuron in the test or template (whichever has fewer) a match. We now reiterate this point more often in the text:

“Every segmented neuron in the test or template (whichever has fewer) is assigned a match. Accuracy is defined as the number of proposed matches that agree with ground truth, divided by the total number of ground truth matches. The number of ground truth matches is a property of the dataset used to evaluate our model and is listed in Table 3.”

The numerator and denominator are now well defined for all calculations of the accuracy of our model, NeRVE and CPD, including in Table 4(Table 2 in new version): “number of proposed matches that agree with ground truth, divided by the total number of ground truth matches”.

We now list information about the denominator explicitly in Table 3 by showing the ground truth matches from test and template pairs sampled from each dataset. We note that this is a property of the dataset and not of the model. Now that we have a more simplified definition of accuracy, a discussion of “coverage” is no longer relevant and has been removed.

There remains the question of how best to compare our model’s accuracy to that of the CRF model from (Chaudhary et al) because that model reports accuracy using templates that are privileged (see below). We have added two paragraphs describing the specific assumption under which our two models can be directly compared:

“The CRF model compares test and template like we do, but their template is privileged in the sense that it is derived from either the literature (“open atlas”) or aggregated from their other recordings (``data-driven''). Moreover, the data driven atlas incorporates statistics about variability from across their recordings. By contrast our template is simply one of the other recordings in the dataset. Pairing a test with a privileged template provides slightly more ground truth matches on which to evaluate performance, because the privileged template contains more ground truth labels. We expect the difference is modest, however, because the number of ground truth matches is still limited by the number of neurons with ground truth labels in the test.

Nonetheless, we must make an assumption to directly compare reported accuracy of the CRF model on the dataset in (Chaudhary et al) to the fDNC model's performance on the same dataset. We must assume that on average there is nothing particularly special about those neurons that have ground truth labels present in the intersection of test and privileged template, but that lack a ground truth label in a non-privileged template sampled from the recordings. Under this assumption, we compared the reported performance of the CRF model on the published dataset in (Chaudhary et al) to the performance of the fDNC model evaluated on the same dataset (Table 2).”

7. Clarity of the experimental data is lacking. "Recording" was used multiple times in the text and it's not clear whether they are time series or single volumes. For instance, it is not clear what exactly are the "12 individual animals" used for generating the training data. Are they single time frames or are they video? If videos, how many frames? It is not clear the NeuroPAL data sets are videos or single volumes.

See response to “Essential Revisions #8”.

8. Related to the issue above, if many time points of 12 individual animals are used to generate training data, this is not at all synthetic. The basis of the training data from many worm heads holds a lot of information. The question is also whether all (any) of the augmentation components are necessary or useful. There should be a full characterization of the differential benefit of the different augmentations from not augmenting at all. Calling it synthetic data in my opinion is a misnomer (e.g. line 566).

See response to “Essential Revisions #9”.

9. Generally speaking, if the algorithm is very generalizable and extremely fast, and quite accurate as the authors claim, it should be fairly simple to use it on a real whole-brain experiment data set and show that meaningful conclusions can come of it. Without this, one should not make such claims that "The method is fast and predicts correspondence in 10 ms making it suitable for future real-time applications."

See response to “Essential Revisions #5”.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential Revisions:

1. The data are from the authors themselves, not peer-reviewed, and not independently validated. The authors did not use, for instance, the Zimmer lab's data; the reason why was unclear to the reviewers. Also, the authors themselves have at least one volume of sensible data from their own previous work (NeRVE, Nguyen et al., 2017) in which they actually performed PCA on the GCaMP data. Applying fDNC to that set of data and showing that PCAs are comparable would make their claim much stronger.

– The published population recordings that we know of from the Zimmer group are for immobilized animals. Tracking an immobile recording would not be a good demonstration of the fDNC method. Neuron configurations do not change over time in immobilized animals, so tracking during immobilization is relatively trivial. fDNC would only be used for tracking neurons in moving animals.

– Figure 3e shows fDNC applied to the requested GCaMP recording from Nguyen et al. 2017.

We have further clarified the caption to make this clear.

“Detailed comparison of fdNC tracking to human annotation of a moving GCaMP recording from Nguyen et al. (2017) [18]”.

– The most stringent comparison we can perform is to compare fDNC tracking to human ground truth tracking, as we have done in Figure 3d and e with the GCaMP recordings from Nguyen 2017. Comparing calcium activity, as suggested, is one step further removed, and is a less informative comparison. If the reviewer’s goal is to assess whether fDNC-tracked neurons can result in plausible calcium traces, Figure 3—figure supplement 1 (p. 27) shows that it does.

– We disagree that PCA analysis of neural activity is informative or relevant to demonstrate the method and including the analysis risks dragging the paper in a confusing new direction. But to facilitate peer review, we have included the requested analysis below. Left shows neural activity from Ngyuen et al. 2017 using human tracking projected into its first two principal components. Right shows activity from the same recording tracked via fDNC and projected into its first two principal components.

Author response image 1

Comparing tracking via calcium activity in this way is less stringent and less informative than comparing tracking directly as in Figure 3d and e. We also find these plots difficult to interpret and potentially confusing. Finally, far from being a standard analysis, in published work low dimensional neural state space trajectories have only previously been applied to immobile C. elegans, not to moving animals. So including these plots would also broach new scientific ground that is beyond the scope of this methods paper.

2. The accuracy is a central claim in the paper. It is good that the authors now define what accuracy is in the text, but it is still confusing. A match between the template and the test does not assign a name necessarily -- unless the template neurons already have labels/identities from the ground truth information. From the text, it seems that the template is used as the reference with identities already assigned and that only the neurons common to both the test and template are considered since the denominator of the accuracy is defined as "the total number of ground truth matches". (Another interpretation of the definition would suggest that neurons that both the template and the test got wrong but matches each other would have been counted as accurate?!)

There may be a misunderstanding. Ground truth labels are only used for evaluating performance after the fact, they are not part of the model. See response to Essential feedback #3. All neurons get matched, no label is required. As stated in the text,

“Every segmented neuron in the test or template (whichever has fewer) is assigned a match.”

There are two issues – the definition is not applicable for some other methods and that this definition is artificially favorable for fDNC.

a. In Table 2, the authors compare the accuracies of fDNC to that of CPD and CRF (ref 3). This is not appropriate. fDNC and CPD both use template matching, while CRF does not. This is to say that the accuracy definition is not the same for these methods.

We have rewritten the section where we compare fDNC to CPD to highlight the reviewer’s point about template matching and the differences in the models between fDNC and CRF. We now make explicit the assumptions under which we compare fDNC and CRF, despite their differences, and provide quantitative bounds on the range of possible assumptions. And, out of an abundance of caution, we have tempered our conclusions about relative accuracy. We say that fDNC’s accuracy is “comparable” to CRF in addition to having other advantages.

We hope the reviewers and editors will recognize the value in comparing methods on the same published datasets, and understand that in this case an assumption is necessary to make the comparison.

“We further sought to compare the fDNC model to the reported accuracy of a recent model called Conditional Random Fields (CRF) from (Chaudhary et. al, 2021) by evaluating fDNC on the same published dataset from that work. […] Taken together, we conclude that the fDNC model's accuracy is comparable to that of the CRF model while also providing other advantages.”

b. The accuracy of fDNC is artificially more favorable. NeuroPAL datasets do not reliably identify the same neurons. When using one NeuroPAL dataset as template, and another as the test set, the matches are on the order of 70-80%. The definition of accuracy the authors use, therefore, is artificially high (by some significant percentage). The errors associated in neurons not common to the test and the template are discounted.

– For neurons that are not part of the set of ground truth labels that intersect test and template, we neither catch errors nor catch correct matches. It is not obvious to us whether this undercounts or overcounts our accuracy compared to the hypothetical in which a human had a complete set of ground truth labels at their disposal.

– We have carefully considered alternative definitions of accuracy and of all of them, this definition best reflects the information we have. We note that (Chaudhury et al.) faces the same challenge in their framework with respect to neurons that lack ground truth in their test worms and they approach this similarly. They evaluate performance on only those neurons with ground truth labels in the test and ignore segmented neurons that lack ground truth labels for the purposes of reporting accuracy.

– In Table 3 we provide quantitative details about the number of segmented neurons per individual, the number of ground truth labels per individual, and the number of ground truth matches per pair, so that a reader has all of the information they need to understand the ramifications of our choice of accuracy.

c. The coverage and the accuracy discussion should be restored.

The key points about how we define accuracy and how we think about the denominator are present and clearer than in the initial submission. Table 3 in particular, precisely quantifies how many neurons are segmented and how many have ground truth labels. The previous round of reviewer feedback made clear that the “coverage and accuracy” framing was causing confusion. We hesitate to revive it.

3. Implying that fDNC is not "data-privileged" is false (page 13). fDNC is not naive – information from 4000 volumes from 12 animals is there, and fDNC must use a known annotated NeuroPAL dataset as a template, and therefore there is information again (e.g. variability of positions etc). Revising the discussion around this point is important.

– We rewrote the section (pasted above, in response to Reviewer #3 feedback 2b.) and removed the word “privileged” as it is imprecise and may be causing confusion. Thank you for pointing this out.

– There may be a misunderstanding. fDNC finds matches between two configurations. It does not require a known annotated NeuroPAL dataset as a template (and does not use NeuroPAL for training). We added new text to clarify:

“Later in the work we use ground truth information from human annotated NeuroPAL (Yemini et al., 2021) strains to evaluate the performance of our model, but no NeuroPAL strains were used for training.”

– For example, in Figure 3 fDNC finds correspondence between two worms even though it is blind to any NeuroPAL color information. NeuroPAL is then used only to evaluate the performance of the matches. We added text to clarify:

“NeuroPAL worms contain extra color information that allows a human to assign ground truth labels to evaluate the model's performance. Crucially, the fDNC model was blinded to this additional color information. In these experiments, NeuroPAL color information was only used to evaluate performance after the fact, not to find correspondence.”

https://doi.org/10.7554/eLife.66410.sa2

Article and author information

Author details

  1. Xinwei Yu

    Department of Physics, Princeton University, Princeton, United States
    Contribution
    Conceptualization, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8699-3546
  2. Matthew S Creamer

    Princeton Neuroscience Institute, Princeton University, Princeton, United States
    Contribution
    Investigation, Writing - review and editing, Collected data
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9458-0629
  3. Francesco Randi

    Department of Physics, Princeton University, Princeton, United States
    Contribution
    Resources, Designed optics and related software libraries
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6200-7254
  4. Anuj K Sharma

    Department of Physics, Princeton University, Princeton, United States
    Contribution
    Resources, Writing - review and editing, Performed all transgenics
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5061-9731
  5. Scott W Linderman

    1. Department of Statistics, Stanford University, Stanford, United States
    2. Wu Tsai Neurosciences Institute, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Funding acquisition, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3878-9073
  6. Andrew M Leifer

    1. Department of Physics, Princeton University, Princeton, United States
    2. Princeton Neuroscience Institute, Princeton University, Princeton, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Project administration, Writing - review and editing
    For correspondence
    leifer@princeton.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5362-5093

Funding

Simons Foundation (543003)

  • Andrew M Leifer

Simons Foundation (697092)

  • Scott W Linderman

National Science Foundation (IOS-184537)

  • Andrew M Leifer

National Science Foundation (PHY-1734030)

  • Andrew M Leifer

National Institutes of Health (R21NS101629)

  • Andrew M Leifer

National Institutes of Health (1R01NS113119)

  • Scott W Linderman

National Institutes of Health (P40 OD010440)

  • Matthew S Creamer

Swartz Foundation (Swartz Fellowship for Theoretical Neuroscience)

  • Francesco Randi

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Eviatar Yemini and Oliver Hobert of Columbia University for strain OH15262. We acknowledge productive discussions with John Murray of University of Pennsylvania. This work used computing resources from the Princeton Institute for Computational Science and Engineering. Research reported in this work was supported by the Simons Foundation under awards SCGB #543003 to AML and SCGB #697092 to SWL; by the National Science Foundation, through an NSF CAREER Award to AML (IOS-1845137) and through the Center for the Physics of Biological Function (PHY-1734030); by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award numbers R21NS101629 to AML and R01NS113119 to SWL; and by the Swartz Foundation through the Swartz Fellowship for Theoretical Neuroscience to FR. Some strains are being distributed by the CGC, which is funded by NIH Office of Research Infrastructure Programs (P40 OD010440).

Senior Editor

  1. Ronald L Calabrese, Emory University, United States

Reviewing Editor

  1. Gordon J Berman, Emory University, United States

Reviewer

  1. Gordon J Berman, Emory University, United States

Publication history

  1. Received: January 9, 2021
  2. Accepted: July 13, 2021
  3. Accepted Manuscript published: July 14, 2021 (version 1)
  4. Version of Record published: August 16, 2021 (version 2)

Copyright

© 2021, Yu et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 805
    Page views
  • 91
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Roni O Maimon-Mor et al.
    Research Article Updated

    The study of artificial arms provides a unique opportunity to address long-standing questions on sensorimotor plasticity and development. Learning to use an artificial arm arguably depends on fundamental building blocks of body representation and would therefore be impacted by early life experience. We tested artificial arm motor-control in two adult populations with upper-limb deficiencies: a congenital group—individuals who were born with a partial arm, and an acquired group—who lost their arm following amputation in adulthood. Brain plasticity research teaches us that the earlier we train to acquire new skills (or use a new technology) the better we benefit from this practice as adults. Instead, we found that although the congenital group started using an artificial arm as toddlers, they produced increased error noise and directional errors when reaching to visual targets, relative to the acquired group who performed similarly to controls. However, the earlier an individual with a congenital limb difference was fitted with an artificial arm, the better their motor control was. Since we found no group differences when reaching without visual feedback, we suggest that the ability to perform efficient visual-based corrective movements is highly dependent on either biological or artificial arm experience at a very young age. Subsequently, opportunities for sensorimotor plasticity become more limited.

    1. Cell Biology
    2. Neuroscience
    Shahzad S Khan et al.
    Research Advance

    Activating LRRK2 mutations cause Parkinson's disease, and pathogenic LRRK2 kinase interferes with ciliogenesis. Previously, we showed that cholinergic interneurons of the dorsal striatum lose their cilia in R1441C LRRK2 mutant mice (Dhekne et al., 2018). Here, we show that cilia loss is seen as early as 10 weeks of age in these mice and also in two other mouse strains carrying the most common human G2019S LRRK2 mutation. Loss of the PPM1H phosphatase that is specific for LRRK2-phosphorylated Rab GTPases yields the same cilia loss phenotype seen in mice expressing pathogenic LRRK2 kinase, strongly supporting a connection between Rab GTPase phosphorylation and cilia loss. Moreover, astrocytes throughout the striatum show a ciliation defect in all LRRK2 and PPM1H mutant models examined. Hedgehog signaling requires cilia, and loss of cilia in LRRK2 mutant rodents correlates with dysregulation of Hedgehog signaling as monitored by in situ hybridization of Gli1 and Gdnf transcripts. Dopaminergic neurons of the substantia nigra secrete a Hedgehog signal that is sensed in the striatum to trigger neuroprotection; our data support a model in which LRRK2 and PPM1H mutant mice show altered responses to critical Hedgehog signals in the nigrostriatal pathway.