Figures and data

Tracking by identification using deep contrastive learning.
a Schematic representation of a video with five fish. It shows 7 portions of video with animals crossing or touching (dashed-border boxes), and 14 individual fragments, sequences of images of a single individual between two crossings (gray-background boxes). The blue-border fragments form a global fragment, as there are as many individual fragments as animals and all the individual fragments coexist in one or more frames. Some pairs of images of the same animal identity are highlighted with green borders (positive images) and some images of different identities are highlighted with red borders (negative images). b A ResNet18 network with 8 outputs generates a representation of each animal image as a point in an 8-dimensional space (here shown in 2D for visualization). Each pair of images corresponds to two points in this space, separated by a Euclidean distance. The ResNet18 network is trained to minimize this distance for positive pairs and maximize it for negative pairs. c 2D t-SNE visualizations of the learned 8-dimensional representation space. Each dot represents an image of an animal from the video. As training progresses, clusters corresponding to individual animals become clearer, plotted at training with 0, 2,000, 4,000 and 15,000 batches. The t-SNE plot at 15,000 training batches is also shown color-coded by human-validated ground-truth identities. The pink rectangle at 2,000 batches of training highlights clear clusters and the orange square fuzzy clusters. d The silhouette score measures cluster coherence and increases during training, as illustrated for a video with 60 zebrafish. e A silhouette score of 0.91 corresponds to a human-validated error rate of less than 1% per image.
Figure 1—figure supplement 1. Models comparison
Figure 1—figure supplement 2. Embedding dimensions comparison
Figure 1—figure supplement 3. Dneg over Dpos comparison
Figure 1—figure supplement 4. Batch size comparison
Figure 1—figure supplement 5. Exploration and exploitation comparison

Performance for a benchmark of 33 videos of flies, zebrafish and mice.
a. Median accuracy was computed using all images of animals in the videos excluding animal crossings. b. Median tracking times are shown for the scale of hours and, in the inset, for the scale of days. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3 give more complete statistics (median, mean and 20-80 percentiles) for the original idtracker.ai (version 4 of the software), optimized v4 (version 5) and new idtracker.ai (version 6), respectively.

Tracking with strong occlusions.
Accuracies when we mask a region of a video defined by an angle θ and the tracking system has no access to the information behind the mask. Light and dark gray region correspond to the angles for which no global fragments exist in the video. Dark gray regions correspond to angles for which the video does not have enough coexisting individual fragments, specifically on average less than 0.25(N − 1) coexisting individual fragments, with N the number of animals in the video. The original idtracker.ai (v4) and its optimized version (v5) cannot work in the gray regions, and new idtracker.ai is expected to deteriorate only in the dark gray region.

idtracker.ai new graphic user interface.
New graphics user interface (GUI) for versions v5 and v6 of idtracker.ai. On the left the segmentation GUI. On the right the Validator tool.

Performance of original idtracker.ai (v4) in the benchmark.

Performance of optimized v4 (v5) in the benchmark.

Performance of new idtracker.ai (v6) in the benchmark.

Performance of TRex in the benchmark.

Models comparison.
Error in image identification as a function of training time for different deep learning models in 6 test videos. For each network we report the multiply-accumulate operations (MAC) in giga operations (G) and the number of parameters in the units of million parameters (M). Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Embedding dimensions comparison.
Error in image identification as a function of training time for different embedding dimensions in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Dneg over Dpos comparison.
Error in image identification as a function of training time for different ratios of Dneg/Dpos in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Batch size comparison.
Error in image identification as a function of training time for different batch sizes of pairs of images in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Exploration and exploitation comparison.
Error in image identification as a function of training time for different exploration/exploitation weights α in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Performance for the benchmark with full trajectories with animal crossings.
a. Median accuracy was computed using all images of animals in the videos including animal crossings. b. Median tracking times. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3 and Supplementary Table 4 give more complete statistics (median, mean and 20-80 percentiles) for the original idtracker.ai (version 4 of the software), optimized v4 (version 5), new idtracker.ai (version 6) and TRex, respectively.

Protocol 2 failure rate.
Probability for the different tracking systems of not tracking the video with Protocol 2 in idtracker.ai (v4 and v5) and in TRex the probability that it fails without generating trajectories.

Memory usage across the different softwares.
The solid line is a logarithmic fit to the memory peak as a function of the number of blobs in a video. Disclaimer: Both software programs include automatic optimizations that adjust based on machine resources, so results may vary on systems with less available memory. These results were measured on computers with the specifications in Methods

Robustness to blurring and light conditions.
First column: Unmodified video zebrafish_60_1. Second column: zebrafish_60_1 with a gaussian blurring of sigma=1 pixel plus a resolution reduction to 40% of the original plus MJPG video compression. Third column: Videos of 60 zebrafish with manipulated light conditions (same test as in idtracker.ai Romero-Ferrero et al. (2019)). First row: Uniform light conditions across the arena (ze-brafish_60_1). Second row: Similar setup but with lights off in the bottom and right side of the arena.