New idtracker.ai rethinks multi-animal tracking as a representation learning problem to increase accuracy and reduce tracking time
Figures
Performance for a benchmark of 33 videos of flies, zebrafish, and mice.
(a) Median IDF1 score computed using all images of animals in the videos excluding animal crossings. The videos are ordered by decreasing IDF1 score of the original idtracker.ai results for ease of visualization. (b) Median tracking times are shown for the scale of hours and, in the inset, for the scale of days. The videos are ordered by increasing tracking times in the original idtracker.ai results for ease of visualization. The names of the videos in (a) and (b) start with a letter for the species (,,), followed by the number of animals in the video, and possibly an extra number to distinguish the video if there are several of the same species and animal group size. Lines between points are for visualization purposes only.
-
Figure 1—source data 1
Numerical benchmark results with median, mean, and 20–80 percentile values of tracking accuracy and times.
- https://cdn.elifesciences.org/articles/107602/elife-107602-fig1-data1-v1.xlsx
Performance for the benchmark with full trajectories with animal crossings.
(a) Median IDF1 score computed using all images of animals in the videos including animal crossings. (b) Median tracking times. Lines between points are for visualization purposes only. Source data can be found in Figure 1—source data 1.
Boxplot representation of the benchmark results.
F1 scores without (a) and with crossings (b) and tracking times (c) for the three versions of idtracker.ai and TRex.
Robustness to blurring and light conditions.
First column: unmodified video zebrafish_60_1. Second column: zebrafish_60_1 with a Gaussian blurring of 1 pixel plus a resolution reduction to 40% of the original plus MJPG video compression. Version 5 of idtracker.ai fails to track this video with an 82% IDF1 score. Third column: videos of 60 zebrafish with manipulated light conditions (same test as in idtracker.ai Romero-Ferrero et al., 2019). First row: uniform light conditions across the arena (zebrafish_60_1). Second row: similar setup but with lights off in the bottom and right side of the arena.
Memory usage across the different software.
The solid line is a logarithmic fit to the (RAM) memory peak as a function of the number of blobs in a video. The Python tool psutil.virtual_memory().used was used to measure the memory usage 5 times per second during the tracking. The presented value is the maximum measured value minus the baseline measured before the tracking start. Disclaimer: both software include automatic optimizations that adjust based on machine resources, so results may vary on systems with less available memory. These results were measured on computers with the specifications in ‘Methods’.
Tracking by identification using deep contrastive learning.
(a) Schematic representation of a video with five fish. (b) A ResNet18 network with eight outputs generates a representation of each animal image as a point in an eight-dimensional space (here shown in 2D for visualization). Each pair of images corresponds to two points in this space, separated by a Euclidean distance. The ResNet18 network is trained to minimize this distance for positive pairs and maximize it for negative pairs. (c) 2D t-SNE visualizations of the learned 8-dimensional representation space. Each dot represents an image of an animal from the video. As training progresses, clusters corresponding to individual animals become clearer. Here, we plot this process for the example video zebrafish_60_1 after training for 0, 2000, 4000, and 15,000 batches (each batch contains 400 positive and 400 negative pairs of images, that is, 1600 images per batch). The t-SNE plot at 15,000 training batches is also shown color-coded by human-validated ground-truth identities. The pink rectangle at 2000 batches of training highlights clear clusters, and the orange square fuzzy clusters. (d) The Silhouette score measures cluster coherence and increases during training, as illustrated for a video with 60 zebrafish. (e) A Silhouette score of 0.91 corresponds to a human-validated error rate of less than 1% per image.
Tracking with strong occlusions.
Accuracies when we mask a region of a video defined by an angle θ and the tracking system has no access to the information behind the mask. Light and dark gray regions correspond to the angles for which no global fragments exist in the video. Dark gray regions correspond to angles for which the video has a fragment connectivity lower than 0.5, with the fragment connectivity defined as the average number of other fragments each fragment coexists with, divided by , with the total number of animals; see Figure 3—figure supplement 1, for an analysis justifying this value of 0.5 . The original idtracker.ai (v4) and its optimized version (v5) cannot work in the gray regions, and new idtracker.ai is expected to deteriorate only in the dark gray region.
Fragment connectivity analysis.
The relation between fragment connectivity and the corresponding IDF1 score for every video and angle θ of the new idtracker.ai in Figure 3. Fragment connectivity is calculated as the average number of other fragments each fragment coexists with, divided by N−1, with N the total number of animals. Values below 0.5 are associated with low IDF1 score, and idtracker.ai alerts the user in such cases.
Protocol 2 failure rate.
Probability for the different tracking systems of not tracking the video with protocol 2 in idtracker.ai (v4 and v5) and in TRex the probability that it fails without generating trajectories.
Models comparison.
Error in image identification as a function of training time for different deep learning models randomly initialized in six test videos. For each network, we report the multiply-accumulate operations (MAC) in giga operations (G) (for a batch of 1600 images of size 40 × 40 × 1) and the number of parameters in the units of million parameters (M). Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.
ResNet models comparison.
Error in image identification as a function of training time for different deep learning models randomly initialized (except Pre-trained ResNet18) in six test videos. For each network, we report the multiply-accumulate operations (MAC) in giga operations (G) (for a batch of 1600 images of size 40 × 40 ×1) and the number of parameters in the units of million parameters (M). Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.
Embedding dimensions comparison.
Error in image identification as a function of training time for different embedding dimensions in six test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.
over comparison.
Error in image identification as a function of training time for different ratios of in six test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.
Batch size comparison.
Error in image identification as a function of training time for different batch sizes of pairs of images in six test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.
Exploration and exploitation comparison.
Error in image identification as a function of training time for different exploration/exploitation weights α in six test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.
Segmentation GUI.
Enables users to set the basic parameters required for running idtracker.ai.
Validator GUI.
Enables users to inspect tracking results, correct errors, and access additional tools.
Video Generator GUI.
Allows users to define parameters for general and individual video generation.
SocialNet output example showing learned attraction-repulsion and alignment areas for social interactions around the focal animal.
Confusion matrix between the two parts of drosophila_80 in v6 of idmatcher.ai.
It contains predictions both from the network trained in the first part with images from the second one and the other way around. In this example, the image-level accuracy is 82.9%, enough for a 100% accurate identity matching.