Performance for a benchmark of 33 videos of flies, zebrafish and mice.

a. Median accuracy was computed using all images of animals in the videos excluding animal crossings. The videos are ordered by decreasing accuracy of the original idtracker.ai results for ease of visualization. b. Median tracking times are shown for the scale of hours and, in the inset, for the scale of days. The videos are ordered by increasing tracking times in the original idtracker.ai results for ease of visualization. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3 give more complete statistics (median, mean and 20-80 percentiles) for the original idtracker.ai (version 4 of the software), optimized v4 (version 5) and new idtracker.ai (version 6), respectively. The names of the videos in a. and b. start with a letter for the species (z,d,m), followed by the number of animals in the video, and possibly an extra number to distinguish the video if there are several of the same species and animal group size.

Performance for the benchmark with full trajectories with animal crossings.

a. Median accuracy was computed using all images of animals in the videos including animal crossings. b. Median tracking times. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3 and Supplementary Table 4 give more complete statistics (median, mean and 20-80 percentiles) for the original idtracker.ai (version 4 of the software), optimized v4 (version 5), new idtracker.ai (version 6) and TRex, respectively.

Robustness to blurring and light conditions.

First column: Unmodified video zebrafish_60_1. Second column: zebrafish_60_1 with a gaussian blurring of 1 pixel plus a resolution reduction to 40% of the original plus MJPG video compression. Versions 5 of idtracker.ai fails to track this videos with a 82% accuracy. Third column: Videos of 60 zebrafish with manipulated light conditions (same test as in idtracker.ai Romero-Ferrero et al. (2019)). First row: Uniform light conditions across the arena (zebrafish_60_1). Second row: Similar setup but with lights off in the bottom and right side of the arena.

Memory usage across the different softwares.

The solid line is a logarithmic fit to the (RAM) memory peak as a function of the number of blobs in a video. The Python tool psutil.virtual_memory().used was used to measure the memory usage 5 times per second during the tracking. The presented value is the maximum measured value minus the baseline measured before the tracking start. Disclaimer: Both softwares include automatic optimizations that adjust based on machine resources, so results may vary on systems with less available memory. These results were measured on computers with the specifications in Methods.

Tracking by identification using deep contrastive learning.

a Schematic representation of a video with five fish. b A ResNet18 network with 8 outputs generates a representation of each animal image as a point in an 8-dimensional space (here shown in 2D for visualization). Each pair of images corresponds to two points in this space, separated by a Euclidean distance. The ResNet18 network is trained to minimize this distance for positive pairs and maximize it for negative pairs. c 2D t-SNE visualizations of the learned 8-dimensional representation space. Each dot represents an image of an animal from the video. As training progresses, clusters corresponding to individual animals become clearer, plotted at training with 0, 2,000, 4,000 and 15,000 batches (each batch contains 400 positive and 400 negative pairs of images). The t-SNE plot at 15,000 training batches is also shown color-coded by human-validated ground-truth identities. The pink rectangle at 2,000 batches of training highlights clear clusters and the orange square fuzzy clusters. d The Silhouette score measures cluster coherence and increases during training, as illustrated for a video with 60 zebrafish. e A Silhouette score of 0.91 corresponds to a human-validated error rate of less than 1% per image.

Tracking with strong occlusions.

Accuracies when we mask a region of a video defined by an angle θ and the tracking system has no access to the information behind the mask. Light and dark gray region correspond to the angles for which no global fragments exist in the video. Dark gray regions correspond to angles for which the video has a fragment connectivity lower than 0.5, with the fragment connectivity defined as the average number of other fragments each fragment coexists with, divided by N − 1, with N the total number of animals; see Figure 3—figure Supplement 1, for an analysis justifying this value of 0.5. The original idtracker.ai (v4) and its optimized version (v5) cannot work in the gray regions, and new idtracker.ai is expected to deteriorate only in the dark gray region.

Fragment connectivity analysis.

The relation between fragment connectivity and the corresponding accuracy for every video and angle θ of the new id-tracker.ai in Figure 3. Fragment connectivity is calculated as the average number of other fragments each fragment coexists with, divided by N − 1, with N the total number of animals. Values below 0.5 are associated with low accuracy, and idtracker.ai alerts the user in such cases.

Performance of original idtracker.ai (v4) in the benchmark.

Performance of optimized v4 (v5) in the benchmark.

Performance of new idtracker.ai (v6) in the benchmark.

Performance of TRex in the benchmark.

Protocol 2 failure rate.

Probability for the different tracking systems of not tracking the video with Protocol 2 in idtracker.ai (v4 and v5) and in TRex the probability that it fails without generating trajectories.

Models comparison

Error in image identification as a function of training time for different deep learning models randomly initialized in 6 test videos. For each network we report the multiply-accumulate operations (MAC) in giga operations (G) (for a batch of 1600 images of size 40×40×1) and the number of parameters in the units of million parameters (M). Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.

ResNet models comparison.

Error in image identification as a function of training time for different deep learning models randomly initialized (except Pre-trained ResNet18) in 6 test videos. For each network we report the multiply-accumulate operations (MAC) in giga operations (G) (for a batch of 1600 images of size 40×40×1) and the number of parameters in the units of million parameters (M). Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.

Embedding dimensions comparison.

Error in image identification as a function of training time for different embedding dimensions in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.

Dneg over Dpos comparison.

Error in image identification as a function of training time for different ratios of Dneg/Dpos in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.

Batch size comparison.

Error in image identification as a function of training time for different batch sizes of pairs of images in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.

Exploration and exploitation comparison.

Error in image identification as a function of training time for different exploration/exploitation weights α in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette score observed up to that point.

Segmentation GUI.

Enables users to set the basic parameters required for running idtracker.ai.

Validator GUI.

Enables users to inspect tracking results, correct errors, and access additional tools.

Video Generator GUI.

Allows users to define parameters for general and individual video generation.

SocialNet output example showing learned attraction-repulsion and alignment areas for social interactions around the focal animal.

Confusion matrix between the two parts of drosophila_80 in v6 of idmatcher.ai.

It contains predictions both from the network trained in the first part with images from the second one and the other way around. In this example, the image-level accuracy is 82.9%, enough for a 100% accurate identity matching.