Video-tracking systems that attempt to follow individuals frame-by-frame can fail during occlusions, resulting in identity swaps that accumulate over time Branson et al. (2009); Plum (2024); Chen et al. (2023); Chiara and Kim (2023); Liu et al. (2023); Bernardes et al. (2021). idTracker Pérez-Escudero et al. (2014) introduced the paradigm of animal tracking by identification from the animal images. This approach, unfeasible for humans, avoids the accumulation of errors by identity swaps during occlusions. Its successor, idtracker.ai Romero-Ferrero et al. (2019), built on this paradigm by incorporating deep learning and achieved accuracies often exceeding 99.9% in videos of up to 100 animals.

While both idTracker and idtracker.ai perform well in high-quality video, they share a limitation that can be critical in videos of lower quality or with many occlusions. To understand this limitation, consider the schematics of a video in Figure 1a. The first step of both idTracker and idtracker.ai consists in detecting instances when animals touch or cross paths (Figure 1a, shown as boxes with dashed borders and containing images of overlapping fish in this example). The video is then divided into individual fragments, each consisting of the set of images of a single individual between two animal crossings (Figure 1a shows 14 of them as rectangles with a gray background). A global fragment for a video with N animals is a collection of N fragments that coexist in one or more consecutive frames in the video (Figure 1a, the 5 fragments with blue borders are a global fragment). The significance of a global fragment is that it provides a set of images and identity labels for all the animals in the video.

Tracking by identification using deep contrastive learning.

a Schematic representation of a video with five fish. It shows 7 portions of video with animals crossing or touching (dashed-border boxes), and 14 individual fragments, sequences of images of a single individual between two crossings (gray-background boxes). The blue-border fragments form a global fragment, as there are as many individual fragments as animals and all the individual fragments coexist in one or more frames. Some pairs of images of the same animal identity are highlighted with green borders (positive images) and some images of different identities are highlighted with red borders (negative images). b A ResNet18 network with 8 outputs generates a representation of each animal image as a point in an 8-dimensional space (here shown in 2D for visualization). Each pair of images corresponds to two points in this space, separated by a Euclidean distance. The ResNet18 network is trained to minimize this distance for positive pairs and maximize it for negative pairs. c 2D t-SNE visualizations of the learned 8-dimensional representation space. Each dot represents an image of an animal from the video. As training progresses, clusters corresponding to individual animals become clearer, plotted at training with 0, 2,000, 4,000 and 15,000 batches. The t-SNE plot at 15,000 training batches is also shown color-coded by human-validated ground-truth identities. The pink rectangle at 2,000 batches of training highlights clear clusters and the orange square fuzzy clusters. d The silhouette score measures cluster coherence and increases during training, as illustrated for a video with 60 zebrafish. e A silhouette score of 0.91 corresponds to a human-validated error rate of less than 1% per image.

Figure 1—figure supplement 1. Models comparison

Figure 1—figure supplement 2. Embedding dimensions comparison

Figure 1—figure supplement 3. Dneg over Dpos comparison

Figure 1—figure supplement 4. Batch size comparison

Figure 1—figure supplement 5. Exploration and exploitation comparison

The core idea of idTracker and the original idtracker.ai is to use global fragments for the classification of images of animals into identities. In idtracker.ai, this process starts by training a convolutional neural network (CNN) with the images and labels of the global fragment that contains the longest fragment for the animal that moves the least. Once trained, the network assigns identities to all animal images in the remaining global fragments. Only global fragments meeting strict quality criteria, such as ensuring all animals in a global fragment have unique identities, are retained for further training. This iterative process of training, assigning, and selecting continues until most of the video have images assigned to identities. A second algorithm then tracks animals during crossings given that animals are already identified outside crossings.

Figure 2a (blue line) shows the accuracies of the original idtracker.ai (version 4 of the software) for a benchmark of 33 videos of zebrafish, flies and mice. These accuracies were computed using all the images of animals in the videos excluding animal crossings. Figure 2—figure Supplement 1a shows the same results but for the complete trajectory with animal crossings. The names of the videos start with a letter for the species (z,f,m), followed by the number of animals in the video, and possibly an extra number to distinguish the video if there are several of the same species and animal group size. The videos in this figure are ordered by decreasing accuracy of the original idtracker.ai results for ease of visualization. The first 15 videos are videos of zebrafish, flies and mice with an accuracy of > 99.9%. The accuracy in the remaining videos gradually decreases to 92.67% in video m_4_2, and a value of 50.4% outside the figure for video d_100_3.

Performance for a benchmark of 33 videos of flies, zebrafish and mice.

a. Median accuracy was computed using all images of animals in the videos excluding animal crossings. b. Median tracking times are shown for the scale of hours and, in the inset, for the scale of days. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3 give more complete statistics (median, mean and 20-80 percentiles) for the original idtracker.ai (version 4 of the software), optimized v4 (version 5) and new idtracker.ai (version 6), respectively.

Figure 2b (blue line) shows the times that the original idtracker.ai takes to track each of the videos in the benchmark. The videos are ordered by increasing tracking times for ease of visualization. The original idtracker.ai has a faster protocol, “Protocol 2”, which works well for the simplest videos and its tracking times ranging from a few minutes to several hours. However, for complex videos, the software may switch from “Protocol 2” to “Protocol 3”, with Protocol 3 a two-step process. In the first step, all the global fragments are used to train the CNN filters. The second step proceeds like Protocol 2 but with the initial weights of the CNN filters obtained from the first step. While effective, this approach can be extremely slow, often requiring several days or weeks for a single video. Since it is stochastic whether a video is tracked using Protocol 2 or 3 (Figure 2—figure Supplement 2), a reasonable strategy to use the original idtracker.ai is to track each video multiple times until Protocol 2 successfully tracks the entire video or, when a patience threshold is reached (here set to 5 attempts), switch to Protocol 3. The tracking times shown in Figure 2b (blue line) correspond to this procedure, with the time being the accumulated time of the multiple attempts made by the software until final tracking. Some of the videos take a few minutes to track, others a few hours, and six videos take more than three days, one nearly two weeks. If we were to run idtracker.ai a single time instead of following this protocol, the tracking times for some of the videos would be longer.

We first optimized idtracker.ai by improving data loading protocols and redefining the main objects in the software (animal images and fragments) and their properties (see Methods for details). This version of the optimized original idtracker.ai (version 5 of the software) achieved better accuracies, Figure 2a (orange line), and Figure 2—figure Supplement 1a (orange line) for accuracies including animal crossings. The mean accuracy across the benchmark for this optimized version is 99.58% and 99.40% including or not animal crossings, respectively, while for the original idtracker.ai are 97.52% and 97.38%.

Even if this version also uses Protocols 2 and 3, we obtain much shorter tracking times, never longer than a day Figure 2b (orange line). On average, tracking is 13.6 times faster than with the original idtracker.ai and, for the more difficult videos, 118.4 times faster. However, waiting a day to track some videos can make a tracking pipeline too slow. To further improve accuracy and tracking times, we retained these optimizations while also changing the main logic of idtracker.ai. In the original idtracker.ai, when global fragments are short, the quality of the initial CNN is low, leading to either reduced accuracy or the triggering of the very slow Protocol 3. The new system had to be able to track without global fragments.

We reformulate multi-animal tracking as a representation learning problem. In representation learning, we learn a transformation of the input data that makes it easier to perform downstream tasks Xing et al. (2002); Bengio et al. (2013); Ericsson et al. (2022), in our case clustering into animal identities without needing identity labels. This is possible due to the structure of the video, Figure 1a. Note that pairs of images of the same individual can be obtained from the same fragment (Figure 1a, green boxes). Also, pairs of images from different individuals can be obtained from different fragments that coexist in time for one or more frames (Figure 1a, red boxes). These pairs can be used as “positive” and “negative” pairs of images for contrastive learning, a self-supervised learning framework designed to learn a representation space in which “positive” examples are close together, and “negative” examples are far apart Schroff et al. (2015); Dong and Shen (2018); KAYA and BILGE (2019); Chen et al. (2020a); Chen et al. (2020b); Guo et al. (2020); Wang et al. (2020); Yang et al. (2020).

We first evaluated neural networks suitable for contrastive learning with animal images. In addition to our previous CNN from idtracker.ai, we tested 23 networks from 8 different families of state-of-the-art convolutional neural network architectures, selected for their compatibility with consumer-grade GPUs and ability to handle small input images (20 × 20 to 100 × 100 pixels) typical in collective animal behavior videos. Among these architectures, ResNet18 He et al. (2016) was the fastest to obtain low errors (Figure 1—figure Supplement 1).

A ResNet18 with M outputs maps each input image to a point in an M-dimensional representation space (illustrated in Figure 1b as a point on a plane). Experiments showed that using M = 8 achieved faster convergence to low error (Figure 1—figure Supplement 2). ResNet18 is trained using a contrastive loss function (Chopra et al. (2005), see Methods for details). Each image in a positive or negative pair is input separately into the network, producing a point in the 8-dimensional representation space. For an image pair, we then obtain two points in an 8-dimensional space, separated by some (Euclidean) distance. The loss function is used to minimize (or maximize) this Euclidean distance for positive (or negative) pairs until the distance Dpos (or Dneg). The effect of Dpos is to prevent the collapse to a single of the positive images coming from the same fragment, allowing for a small region of the 8-dimensional representation space for all the positive pairs of the same identity. The effect of Dneg is to prevent excessive scatter of the points representing images from negative pairs. We empirically determined that Dneg/Dpos = 10 results in a faster method to obtain low error (Figure 1—figure Supplement 3), and we use Dpos = 1 and Dneg = 10.

As the model trains, the representation space becomes increasingly structured, with similar data points forming coherent clusters. Figure 1c visualizes this progression using 2D t-SNE van der Maaten and Hinton (2008) plots of the 8-dimensional representation space. After 2, 000 training batches, initial clusters emerge, and by 15,000 batches, distinct clusters corresponding to individual animals are evident. Ground truth identities verified by humans confirm that each cluster corresponds to an animal identity (Figure 1c, colored clusters).

The method to select positive and negative pairs is critical for fast learning Awasthi et al. (2022); Khosla et al. (2021); Rösch et al. (2024). This is because not all image pairs contribute equally to training. Figure 1c shows at 2, 000 training batches that some clusters well-defined (e.g. those inside the orange square) while others remain fuzzy (e.g. those inside the pink rectangle). Images in well-defined clusters have negligible impact on the loss or weight updates, as positive pairs are already close and negative pairs are sufficiently separated. Sampling from these well-defined clusters, therefore, wastes time. In contrast, fuzzy clusters contain images that still contribute significantly to the loss and benefit from further training. To address this, we developed a sampling method that prioritizes pairs from underperforming clusters requiring additional learning, while maintaining baseline sampling for all clusters based on fragment size (Methods). This ensures consistent updates across the representation space and prevents forgetting in well-defined clusters.

To assign identities to animal images, we perform K-means clustering Sculley (2010) on the points representing all images of the video in the learned 8-dimensional representation space. Each image is then assigned to a cluster with a probability that increases the closer it is to the cluster center. To evaluate clustering quality, we compute the mean Silhouette index Rousseeuw (1987), which quantifies intra-cluster cohesion and inter-cluster separation. A maximum value of 1 indicates ideal clustering. During training, the mean Silhouette index increases (Figure 1d). We empirically determined that a value of 0.91 for this index corresponds to an identity assignment error below 1% for a single image (Figure 1e). As a result, we use 0.91 as the stopping criterion for training (Methods).

After the assignment of identities to images of animals, we run some steps that are common to the previous idtracker.ai. For example, we make a final assignment of all images in fragments as each fragment must have all assignments to be the same, eliminating some errors in individual images. Also, an algorithm already present in idTracker assigns identities in the animal’s crossings taking into account that we know the identities before and after.

The new idtracker.ai has a higher accuracy than original idtracker.ai and than its optimized version, Figure 2a (magenta line). Its average accuracy in the benchmark is 99.92% and 99.78% without and with crossings, respectively, an important improvement over the original idtracker.ai (97.52% and 97.38%) and its optimized version (99.58% and 99.40%). It also gives much shorter times than the original idtracker.ai and its optimized version, Figure 2b (magenta line). It is on average 44 times faster than the original idtracker.ai and, for the more difficult videos, up to 440 times faster.

As for the original idtracker.ai, the new idtracker.ai can work well with lower resolutions, blur and video compression, and with inhomogeneous light (Figure 2—figure Supplement 4). We also compared the new idtracker.ai to TRex Walter and Couzin (2021), which is based on Protocol 2 of idtracker.ai but with additional operations like eroding crossings to make global fragments longer.

TRex gives comparable accuracies to the original idtracker.ai in the benchmark, but by avoiding Protocol 3, it is on average 31 times faster than the original idtracker.ai and up to 315 times faster (Figure 2—figure Supplement 1b). However, the new idtracker.ai is both more accurate and faster than TRex (Figure 2—figure Supplement 1). The mean accuracy of TRex across the benchmark is 98.14% and 97.89% excluding and including animal crossings, respectively. This is noticeably below the values for the new idtracker.ai of 99.92% and 99.78%, respectively. Also, the new idtracker.ai is on average 3.9 times faster and up to 16.5 times faster than TRex. Additionally, the new idtracker.ai has a memory peak lower than TRex (Figure 2—figure Supplement 3).

The new idtracker.ai also works in videos in which the original idtracker.ai does not even track because there are no global fragments. Global fragments are absent in videos with very extensive animal occlusions, for example because animals touch or cross more frequently, parts of the setup are covered, or the camera focuses on only a specific region of the setup. To study this systematically, we added a mask on the video with an angle θ (Figure 3). The tracking systems have no access to the information behind the mask. The light and dark gray regions in Figure 3 correspond to videos with no global fragments, and the original idtracker.ai and its optimized version declare tracking impossible. The new idtracker.ai, however, works well until approximately 1/4 of the setup is visible, and afterward it degrades. This also shows the limit of the new idtracker.ai. For the clustering process to be successful, we need enough coexisting individual fragments to have both positive and negative examples. Empirically, we find a deterioration with less than 0.25(N − 1) coexisting individual fragments, with N the number of animals in the video (Figure 3, dark gray region). The new idtracker.ai flags when this condition is not met.

Tracking with strong occlusions.

Accuracies when we mask a region of a video defined by an angle θ and the tracking system has no access to the information behind the mask. Light and dark gray region correspond to the angles for which no global fragments exist in the video. Dark gray regions correspond to angles for which the video does not have enough coexisting individual fragments, specifically on average less than 0.25(N − 1) coexisting individual fragments, with N the number of animals in the video. The original idtracker.ai (v4) and its optimized version (v5) cannot work in the gray regions, and new idtracker.ai is expected to deteriorate only in the dark gray region.

The final output of the new idtracker.ai consists of the xy coordinates for each identified animal and video frame. Additionally, it provides several quality metrics: an estimate of the probability of correct identity assignment for each animal and frame, the Silhouette score as a measure of clustering quality, and the average number of coexisting individual fragments per fragment divided by (N − 1), with N the number of animal in the video, which when above 0.25(N − 1) is expected to give good results. The software can also generate a video with the computed animal trajectories for visualization, and an individual video per animal to be able to run pose estimators like the ones in Lauer et al. (2022); Pereira et al. (2022); Segalin et al. (2021); Tang et al. (2025); Biderman et al. (2024). For analysis of trajectories and spatial relationships, the user can run our Python package trajectorytools on the trajectories.

In summary, the new idtracker.ai takes an approach to tracking using representational learning to avoid the need for segments of the video in which all animals are visible. This makes the new idtracker.ai work in more videos, more accurately, much faster and with a lower memory peak.

Methods

Tested computer specifications

The software idtracker.ai depends on PyTorch and is thus compatible with any machine that can run PyTorch, including Windows, MacOS, and Linux systems. Although no specific hardware is required, a graphics card is highly recommended for hardware-accelerated machine-learning computations.

Version 6 of idtracker.ai was tested on computers running Ubuntu 24.04, Fedora 41, and Windows 11 with NVIDIA GPUs from the 1000 to the 4000 series and MacOS 15 with Metal chips. The benchmark presented in this study was performed on desktop computer running Ubuntu 24.04 LTS 64bit with a AMD Ryzen 9 5950X (32 cores at 3.4 GHz) processor, 128 GB RAM and an NVIDIA GeForce RTX 4090.

Improvements to original idtracker.ai in version 5

Following the last publication of idtracker.ai Romero-Ferrero et al. (2019), the software underwent continuous maintenance, including feature additions, performance optimizations, and hyperpa-rameter tuning (released via PyPI from March 2023 for v5.0.0 to June 2024 for v5.2.12). These updates improved the implementation and tracking pipeline but did not alter the core algorithm. Significant advancements were made in user experience, tool availability, processing speed, and memory efficiency. Below, we summarize the most notable changes.

Blob memory optimization

Blobs are defined as collections of connected pixels belonging to one or more animals. In v4, blobs stored pixel indices, causing memory usage to scale quadratically with blob size. In v5, blobs are represented by simplified contours using the Teh-Chin chain approximation Teh and Chin (1989), reducing memory usage by 93% in blob instances. This also accelerated blob-related computations (centroid, orientation, area, overlap, identification image creation, etc.).

Efficient image loading

Identification images are now efficiently loaded on demand from HDF5 files, eliminating the need to load all images into memory. This enables training with all images regardless of video length, with minimal memory usage.

Code optimization

The source code was revised to eliminate speed bottlenecks. The most impactful changes include:

  • Frame segmentation accelerated by 80% through optimized OpenCV usage.

  • Faster blob-to-blob overlap checks by first evaluating bounding boxes before deeper comparisons.

  • Persistent storage of blob overlap checks to avoid redundant computations when reloading data.

  • Efficient disk access for identification images by reading them in sorted batches, minimizing I/O overhead.

  • Reduced bounding box image sizes to the minimum necessary, lowering memory and processing demands.

  • Optimized and parallelized Torch data loaders for more efficient model training.

  • Caching of computationally expensive properties for blobs, fragments, and global fragments.

  • Sorted Fragment lists to speed up coexistence detection.

Changes to the identification protocol

In v4, identity assignments to high-confidence fragments were fixed and excluded from downstream correction, regardless of later evidence. In v5, this was relaxed for short fragments (fewer than 4 frames), allowing corrections due to their statistical unreliability and frequent image noise.

Improved graphical user interface and introduction of Exclusive ROIs

The graphical user interface was redesigned for improved usability and now includes the “Exclusive Regions of Interest” feature, which allows users to define spatially distinct regions in multi-arena experiments where animal identities are treated independently (see Figure 4 left image). It also incorporates a redesigned video generator for visualizing tracking results.

idtracker.ai new graphic user interface.

New graphics user interface (GUI) for versions v5 and v6 of idtracker.ai. On the left the segmentation GUI. On the right the Validator tool.

Validation application

A standalone GUI for inspecting and correcting tracking results. It allows users to navigate video frames, review tracked positions and metadata, detect tracking errors, and apply corrections using integrated plugins (see Figure 4, right image).

Direct integration with idmatcher.ai

A utility for matching identities across videos, originally introduced in Romero-Ferrero et al. (2023). It allows users to propagate consistent identity labels across multiple recordings, facilitating longitudinal or multi-session experiments. It is now a native feature of both v5 and v6, fully integrated into the idtracker.ai ecosystem.

Protocol details for the new idtracker.ai

In this section, we give an overview of the tracking protocol. Please refer to Appendix 1 for details.

Architectures

The contrastive learning network (Figure 1b) is a ResNet18 He et al. (2016) with a single channel in the first convolutional layer for grayscale images and 8 neurons in the last layer. The network receives grayscale images because idtracker.ai always works with grayscale converted video frames.

Loss function

The contrastive loss L for a pair of images (I, J) and label l is defined as:

Here DI, J is the Euclidean distance between the embeddings of images I and J. Dpos is the maximum allowed distance between the two images of a positive pair, and Dneg, the minimum allowed distance between the two images in negative pair.

Training

ResNet18 is trained using Adam optimizer with the hyperparameters described in Kingma and Ba (2017). The learning rate is set at the value of 0.001 using training batches of 1600 images (400 positive pairs and 400 negative pairs of images). See Appendix 2 for details.

Pair selection

The selection of pairs was done by combining two sampling strategies:

  1. Sampling fragments according to their size so that fragments containing more images are sampled more often.

  2. Sampling fragments according to the loss function by increasing the sampling probability of pairs of fragments from whom the corresponding images had positive loss, and decreasing the sampling probability of pairs of fragments from whom the corresponding images had loss zero.

See Appendix 2 for more details on the pair sampling strategy.

Clustering and stopping criteria

For clustering, we use the minibatch K-means clustering, which significantly reduces the computation time compared to a classical implementation Sculley (2010).

Stopping of the training was done by computing the K-means clustering for a subset of (number of animals times 1,000) images, and measuring the corresponding Silhouette score (SS) Rousseeuw (1987) every number of animals times 5 batches. We stop training if there have been 30 consecutive SS evaluations without any improvement (patience of 30), or if there have been 2 consecutive SS evaluations without any improvement but the SS already achieved the target value 0.91. Check Appendix 2 for more details on the criteria to stop the training of the network.

Occlusion tests

For the occlusion tests, we took videos of freely behaving animals in a round arena (included in the benchmark) and occluded a sector of the circle between 0 and θ radians. For the tracking software, animals disappeared when entering this occluded section of the arena. The light gray area in Figure 3 corresponds to a degree of occlusion that prevents the existence of global fragments. The dark gray area in Figure 3 corresponds to a degree of occlusion where there are less than 0.25(N−1) coexisting individual fragments (N being the number of animals in the video). With these degrees of occlusion, too few animals overlap at any given time and identification is expected to deteriorate in this regime (Figure 3, dark gray region). idtracker.ai flags when this condition is not met.

Computation of tracking accuracy

Using the idtracker.ai Validator tool (see Methods), we manually generated ground-truth trajectories based on v5 outputs. This ground-truth consists on the positions and identities of all animals in each frame and their classification as either individual or crossing.

To detect tracking errors, we analyze the video frame by frame, verifying whether the predicted position of each animal deviates from the ground-truth by more than a threshold T. Errors are also recorded when the software loses the identity or fails to detect an animal in a given frame.

Tracking accuracy is then defined as one minus the proportion of errors in the trajectory. For accuracy with crossings, we consider all trajectory points, whereas for accuracy without crossings, we exclude points corresponding to crossing events in the ground-truth.

We present all results using a threshold T = 1BL with BL being a body length. We also verified that accuracy remains largely unaffected by the value of this threshold. For instance, reducing it to T = 0.5BL results in a very small change of the mean accuracy (without crossings) across the benchmark in the new idtracker.ai from 99.92% to 99.90%.

Benchmark of accuracy and tracking time

To evaluate the tracking time and accuracy of different versions of idtracker.ai and version 1.1.9 of TRex, we used a set of 33 videos with their corresponding human-validated ground-truth trajectories. Each video is 10 minutes long and features one of three species: mice, drosophila, or zebrafish, with the number of individuals ranging from 2 to 100 (see Methods).

Previous versions of idtracker.ai (v4 and v5) can resort to protocol 3 for tracking, a method that can take days to process more complex videos but is necessary when protocol 2 fails. Similarly, TRex, lacking an equivalent of protocol 3, can fail to track certain videos, leading to missing accuracy outputs (Figure 2—figure Supplement 2).

To estimate the accuracy and tracking time that a standard user might experience, we simulate a realistic user workflow. This simulation accounts for the possibility that the software may fail to track the video, prompting the user to try again with a slightly different parameter configuration, up to a certain number of attempts.

The user is given up to 5 attempts to successfully track a video. Attempts are sampled from a precomputed dataset of tracking runs. Accuracy is taken from the first successful run. The reported tracking time is the sum of the time taken by that successful run and all preceding failed attempts. In cases where all attempts fail, accuracy is determined by protocol 3 (in v4 and v5 of idtracker.ai), and tracking time includes the time required for protocol 3 plus the total time of all failed attempts. This sampling process is repeated 10,000 times per software and video to obtain statistically robust estimates of the tracking times and accuracies. Figure 2 and Figure 2—figure Supplement 1 report the median accuracies, without and with crossings, respectively, and tracking times. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3, and Supplementary Table 4 present the median, mean, and the 20 and 80 percentiles in v4, v5, v6 and TRex respectively.

Dataset of tracking runs

To build the dataset of tracking runs we used for the benchmark of accuracies and times, we define input parameters through each software’s graphical interface. Fixed parameters (e.g., number of animals, regions of interest) are held constant, while those with multiple valid values are treated as variable, with their ranges annotated. In idtracker.ai, the variable parameter is the intensity_threshold, whereas in TRex, the variable parameters are threshold and track_max_speed.

Tracking is repeated for each video and software until either 5 successful runs or 35 total runs are reached. For the original version of idtracker.ai, this is limited to 3 successful runs or 7 total runs due to significantly longer tracking times. In successful runs, both accuracy and tracking time are recorded. In failed runs, when idtracker.ai defaults to protocol 3 or TRex fails to output identities (see Figure 2—figure Supplement 2), only the time until failure is recorded. For previous idtracker.ai versions (v4 and v5), failure time corresponds to the time until the software switched to protocol 3.

Each tracking run is conducted by randomly sampling values for the variable parameters from the annotated ranges and executing the full tracking process. To ensure a fair comparison, TGrabs is included when running TRex, graphical interfaces are always disabled at runtime to maximize performance, and output_interpolate_positions is enabled in TRex.

Supplementary tables

Performance of original idtracker.ai (v4) in the benchmark.

Performance of optimized v4 (v5) in the benchmark.

Performance of new idtracker.ai (v6) in the benchmark.

Performance of TRex in the benchmark.

Acknowledgements

We thank Alfonso Perez-Escudero, Paco Romero-Ferrero, Francisco J. Hernandez Heras, and Madalena Valente for discussions. This work was supported by Fundaçao para a Ciência e Tecnologia PTDC/BIA-COM/5770/2020 (to G.G.dP.) and Champalimaud Foundation (to G.G.dP.).

Additional information

Data availability

All videos used in this study, their tracking parameters and human-validated groundtruth can be found in our data repository at https://idtracker.ai.

Author contributions

T.C. and G.G.dP. devised project and main algorithm, T.C. performed tests of the algorithm as stand alone, J.T. developed version 5, implemented the new algorithm into idtracker.ai architecture and made final tests with help from T.C., G.G.dP. supervised project, T.C. wrote the Appendices with help from J.T and G.G.dP., and G.G.dP. wrote the main text with help from J.T and T.C.

Software availability

idtracker.ai is a free and open source project (license GPL v.3). Information about its installation and usage can be found on the official website https://idtracker.ai. The source code is available in gitlab.com/polavieja_lab/idtrackerai and the package is pip-installable from PyPI. All versions can be found in these platforms, specifically “original idtracker.ai (v4)” as v4.0.12, “optimized v4 (v5)” as v5.2.12 and “new idtracker.ai (v6)” as v6.0.0.

Appendix 1

Preliminary concepts

Image-based tracking relies on identifying individuals through their visual features. The process begins by distinguishing the pixels corresponding to animals from those of the background. Let b represent a blob that is distinct from the background. For each blob b segmented from a video, an identification image Ib is generated by first taking the minimal bounding box image around b and then converting all pixels in Ib that do not belong to b to black. The blob within Ib is then rotated so that its first principal component is aligned at a angle to the x-axis and, finally, the image is cropped to a specified square size suitable for batch processing.

Each image Ib is classified as either an individual or a crossing of individuals. For more details on the background subtraction and individual-crossing classification process, please refer to Appendix D1-2 of the Supplementary Information of Romero-Ferrero et al. (2019).

A Fragment F is defined as a sequence of blobs that maintain a one-to-one spatial overlap, meaning they share pixels in each pair of consecutive frames over time. If two blobs merge into a single blob in the subsequent frame, or if a single blob splits into two in the next frame, each of these three blobs will terminate or initiate a new Fragment. Fragments are classified as either individual or crossing Fragments based on the classification of the blobs they contain. Blobs of different classifications are not permitted within the same Fragment. Since crossings are solved as a post-processing step after identification, from now on we will not take into consideration crossing Fragments, and we will refer to individual Fragments as Fragments.

A pair of Fragments is said to coexist if they both contain blobs from the same frames in the video. Moreover, being N the number of individuals in a video, a Global Fragment is defined as a collection of N Fragments all sharing a common frame.

By construction, we can assume that all blobs in a Fragment correspond to the same identity, this is the Fragment’s identity. From this, coexisting Fragments will have different identities and Global Fragments will have all identities, one per Fragment.

From now on, we will denote Fi as the fragment with some arbitrary unique identifier i and Iik will correspond to the identification image with the unique arbitrary identifier k in the fragment i.

General overview of Identification Protocols in the original idtracker.ai

In this section we will give a brief and high level overview on the algorithm idtracker.ai uses to assign identities to the different fragments. Please refer to Romero-Ferrero et al. (2019) for a more complete description of the algorithm.

Cascade of Training and Identification Protocols

The identification process begins with three sequential protocols that incrementally refine the identification network’s ability to label individuals. The protocols leverage segments of the video where individuals appear distinctly, called global fragments, to construct a labeled dataset for the training of the network.

Protocol 1: Basic Accumulation of Global Fragments

In Protocol 1, the algorithm searches for global fragments. The initial set of labeled images from these fragments forms the base dataset to train the identification network. This trained network is then used to label additional global fragments throughout the video. If Protocol is not able to accumulate at least 99.95% of all images in the global fragments, the algorithm proceeds to Protocol 2.

Protocol 2: Iterative Expansion with High-Quality Fragments

Protocol 2 builds on the initial training by iteratively alternating between accumulating new global fragments and using them to further train the identification network. With each iteration, the network labels more fragments, adding only those that pass strict quality checks (explained in the section below). This process continues until either 99.95% of the images in the global fragments are labeled with high certainty, or no more high-quality fragments are available.

Protocol 3: Pretraining and Fine-Tuning for Complex Scenarios

Protocol 2 might fail for videos with high visual complexity (accumulating less than 90% of the images). In those cases, idtracker.ai proceeds to Protocol 3. Protocol 3 pretrains the convolutional layers of the identification network on a large sample of global fragments, using the same convolutional layers for each global fragment while changing only the last classification layer. Although this protocol is effective in tracking videos that cannot be tracked with Protocol 2, it is very slow and may take days for some videos.

Labeling and Accumulating Images in Global Fragments

The process of labeling and accumulating images from global fragments involves the following steps:

  1. Selection of Global Fragments: The algorithm identifies global fragments where all animals are visually distinct, ensuring unambiguous initial identity assignments.

  2. Labeling with the Trained Network: The identification network, trained on an initial set of global fragments, predicts identities across additional fragments belong to the other global fragments. Each fragment is assigned an identity based on the network’s classification probabilities of its corresponding images, denoted P 1(F, i).

  3. Quality Checks: Labeled fragments are subjected to a series of quality checks to ensure the reliability of their identity assignments. For each global fragment these checks include:

    • Certainty: Each fragment F must have a high certainty score, defined by the distinction between the highest and second-highest identity probabilities:

    • where P 1(F, i) represents the probability of fragment F being assigned identity i. Here, a and b represent the identity predictions with the highest and second highest P 1 values for fragment for F, with Sa and Sb being the vectors of softmax values of all the images in the fragment F assigned to the identities a and b respectively.

    • Consistency: The identity assignment for each fragment must remain consistent across frames, preventing arbitrary changes in identity due to minor variations in appearance. This is reflected on the value of P 1.

    • Uniqueness: Within a single global fragment, each assigned identity must beunique, ensuring that no two animals share the same identity label within that fragment.

  4. Accumulation into the Training Set: Fragments that pass the quality checks are added to the training dataset, allowing the network to improve its accuracy iteratively. This accumulation process continues, increasing the network’s generalization ability across the video.

Residual Identification

After the cascade protocols, residual identification is applied to label any fragments that remain unlabeled or have low-certainty assignments. This step uses a probabilistic approach that accounts for temporal coexistence constraints, refining identity assignments. For each unlabeled fragment F, an adjusted probability P 2(F, i) is computed for assigning identity i, considering neighboring fragments γ(F) that overlap in time:

where P 1(F, i) represents the initial probability of F being identity i.

Afterwards a new measure of identification certainty is defined as

in which a and b again represent the identity predictions with the highest and second highest P 1 values for fragment for F. Fragments then are assigned identities in descending order of certainty, with the highest-confidence fragments labeled first.

In this work, the primary advancement was the replacement of protocols in idtracker.ai with an identification method based on deep metric learning. Additionally, several smaller but significant technical improvements were implemented, enhancing feature set, tracking time, and memory usage efficiency.

Appendix 2

Contrastive protocol

Contrastive learning is a type of self-supervised learning that aims to learn useful data representations by contrasting positive and negative pairs of examples. The fundamental idea is to bring similar (positive) pairs closer in the representation space while pushing dissimilar (negative) pairs farther apart. This approach leverages the inherent structure of the data, allowing the model to learn without labeled examples.

The representation space or embedding in contrastive learning is a high-dimensional environment where data points are mapped to vectors, capturing essential features and patterns of the original data. This space can be conceptualized as a vast, multidimensional environment in which each data point is represented as a vector. The primary objective is to position similar data points in close proximity while ensuring that dissimilar data points are situated at a considerable distance from one another. Positive pairs are typically created by applying different transformations or augmentations to the same data point, such as cropping, rotating, or color jittering an image, preserving the inherent semantics of the original data point. These augmentations ensure that the model learns robust features invariant to such transformations. Conversely, negative pairs are composed of different data points expected to be dissimilar, such as two distinct images.

As the model undergoes training, the representation space becomes increasingly structured, with similar types of data points forming coherent clusters. These clusters encapsulate the inherent similarities within the data, even if the specific instances differ, such as different breeds of cats or different poses. By maximizing the agreement between positive pairs and minimizing the agreement between negative pairs, the model learns to distinguish subtle differences and similarities within the data. The contrastive loss minimizes the distance between positive pairs and maximizes the distance between negative pairs in the representation space. This contrastive objective ensures the learned representations capture essential features and discriminative patterns, facilitating downstream tasks such as classification, clustering, and retrieval, even without labeled data. Thus, the representation space serves as a learned map where the positions of data points reflect their semantic relationships, enabling the model to capture and utilize the underlying structure of the data for various tasks.

We apply the principles of contrastive learning to create an embedding of all the images in a video that reflects the fragmented structure of the video. Specifically, points in the embedding corresponding to images from coexisting fragments (different identities) are positioned further apart than points corresponding to images from the same fragment (same identity) (Figure 1a–c).

  1. Segmentation and Fragmentation: The video is segmented and the blobs grouped into fragments based on temporal or content-based criteria.

  2. Training ResNet18: ResNet18 is trained using positive pairs (images from the same fragment) and negative pairs (images from coexisting fragments). The network learns a representation space where the distance between positive pairs is minimized, while the distance between negative pairs is maximized.

  3. Clustering in the Representational Space: All images are passed through the network. K-means clustering is then applied to the embedded images, assigning them to different cluster labels.

  4. Cluster based labeling of Single Image: Each cluster is labeled as a distinct animal identity. Images are classified based on their assigned clusters, and a probability distribution for each identity prediction is computed based on the Euclidean distance to the center of each cluster. If global fragments are present, proceed to next step; otherwise, proceed to Step 7.

  5. Fragment Identification with Global Fragments: A thorough identification process is conducted to classify all images belonging to global fragments, correcting any errors from the initial classification. If 99.9% > of all the images in global fragments are successfully accumulated (pass the quality checks, see section 1), go to Step 7; otherwise, go to next step.

  6. Run Accumulation Protocol if Step 5 Fails: Run protocol 2 from idtracker.ai v5 but using correctly identified images as the ground truth, as a sort of synthetic first Global Fragment.

  7. Residual Identification: A thorough identification process is conducted to classify all images in the video, correcting any errors from the initial classification step.

Network architecture

Deep metric learning often requires larger networks for classification tasks compared to standard supervised learning. To identify the most suitable architecture, we evaluated several state-of-the-art image classification networks, including the model used in the original idtracker.ai.

There were specific constraints in selecting the optimal architecture. The image size is automatically set during each tracking session to fit the average blob size, but it is typically small, ranging from 20×20 to 100×100 pixels. This limited some architectures, such as AlexNet, which requires a fixed input size of 227×227, and DenseNet, which has a minimum input size of 29×29. Additionally, the large training batches commonly associated with deep metric learning necessitate a compact model that can be trained on a consumer-grade GPU. This constraint excluded other architectures, including EfficientNet and the larger ResNet models (ResNet101 and ResNet152).

As shown in Figure 1—figure Supplement 1, ResNet18 offered the best balance between training speed and tracking accuracy.

Embedding dimension

Another critical hyperparameter is the embedding dimension. Here, too, there is a trade-off between achieving a robust representation of subtle differences between animals— differences that may be minimal and even challenging to detect visually—and maintaining a compact network size and efficient training speed. This parameter was empirically determined to be 8 (Figure 1—figure Supplement 2).

Loss function

The contrastive loss function operates on pairs of data points, aiming to minimize the distance between positive pairs and maximize the distance for negative pairs. Mathematically for our case, the contrastive loss L for a pair of images (Iik, Ijl) is defined as:

where Dik, jl is the Euclidean distance between the embedding of Iik and Ijl, Dneg is the minimum allowed distance in a negative pair of images (images coming from coexisting fragments), and Dpos is the maximum allowed distance in a positive pair of images (images from the same fragment). It is important to emphasize that the network processes one image at a time, obtaining a single independent point in the representational space for each image. The Euclidean distance between the embeddings for the corresponding pairs of images is computed only afterwards.

Dneg and Dpos serve as thresholds to regulate distances in the embedding space. Dneg prevents images from negative pairs from being pushed indefinitely far apart, while Dpos prevents the collapse of images from positive pairs into a single point. These thresholds are crucial in our problem, where we aim to embed individuals of the same identity in similar regions of the representational space. However, we face the restriction of not being able to compare all possible pairs of images and are instead limited to the fragment structure of the video to obtain the labels lik, jl.

This limitation means that the loss function does not directly pull together embeddings of the same identity, but rather images from the same fragment. Similarly, the loss does not push apart embeddings of different identities but images from coexisting fragments.

Dpos helps prevent the collapse of all images from the same fragment to a single point, allowing for the creation of a diffuse region in the representational space where fragments from the same identity are clustered together. Dneg prevents excessive scattering, ensuring better compression of the representational space and maintaining the integrity of clusters of images from the same identity.

In the contrastive protocol, we used Dpos = 1 and Dneg = 10. These values were determined empirically and provide effective embeddings and were robust for tracking multiple videos across various species and different numbers of animals (Figure 1—figure Supplement 3).

Clustering and assignment

After training the network using contrastive loss, we pass all images through the network to generate their corresponding embeddings in the learned representational space. These embeddings are then grouped using K-means clustering. Each cluster ideally represents images of the same identity, as the training process has encouraged the network to place similar images close together and dissimilar ones farther apart in the embedding space. Next, we perform single-image classification, assigning each image a label based on the cluster to which its embedding belongs. Afterwards, the assignment method follows two conditions. If global fragments are present, follow the procedure mentioned in the subsection 1. If on the contrary there are no global fragments we move straight to residual identification as explained in section 1

In order to identify fragments we, not only need an identity prediction for each image but also a probability distribution over all the identities. Let dj(Iik) be the distance of image Iik to the center of cluster j. We define the probability of image Iik belonging to identity j by

Equation (2) is used to emphasize differences in distances between points and clusters, creating a more peaked probability distribution that clearly distinguishes closer clusters from farther ones. The exponent of 7 smooths the probability distribution and reduces the influence of distant clusters, making the assignment more discriminative. In higher-dimensional spaces like the 8-dimensional space in the paper, distances are more spread out, and using a high power helps to counteract this dispersion, resulting in more confident cluster assignments.

If we are in a scenario where global fragments exist, we use them for K-means initialization: we use the embeddings from the first global fragment as initial cluster centers, choosing the one where the minimum fragment is the largest. This approach provides a strong initialization for the K-means algorithm, aligning it with the different identities and mitigating issues related to random initialization. It also allows us to better compare clusters as training progresses.

Stopping criteria

Stopping network training using the loss function directly can be highly variable, as different video conditions, the number of individuals and the sampling method significantly influence this value. To circumvent this we use the silhouette score (SS) Rousseeuw (1987) of the clusters of the embedded images. Let d(I, J) be the Euclidean distance between the embeddings of image I and J, for each image I, in cluster Ca we compute the mean intra-cluster distance

and the mean nearest-cluster distance

The SS is given by

To determine when to stop training, every m batches we compute the SS by clustering the embeddings of a random sample of the images in the video, generating also a check-point of the model. m was set to be the maximum between 100 and number of animals in a video times 5. We stop training if: 1) there have been 30 consecutive SS evaluations with-out any improvement (patience of 30), or 2) there have been 2 consecutive SS evaluations without any improvement but the SS already achieved a value of 0.91. After stopping the training, the model with the highest SS is chosen. A threshold of 0.91 was validated empirically (Figure 1d and Figure 1e). The number of images used for the computation of the SS is 1000 times the number of animals.

Pairs selection

Ideally, we would create two datasets of image pairs: one containing negative pairs and another containing positive pairs. However, the challenge with this approach is that very long videos or those containing a large number of animals can yield trillions of pairs of images, making the process computationally prohibitive. Therefore, we approach the problem with a hierarchical sampling method: first, we randomly select a pair of coexisting fragments, and then we sample an image from each fragment. For a positive pair, we sample two images from the same fragment.

Following this idea, we start by creating two datasets. The first consists of a list of all the fragments in the video, from which we will sample the positive pairs. The second dataset contains all possible pairs of coexisting fragments in the video. From these lists we exclude all fragments smaller than 4 images to reduce possible noisy blobs.

Empirical testing has revealed that large and balanced batches, with an equal number of positive and negative pairs, are ideal for our setting of contrastive learning. More concretely, we choose batches consisting of 400 positive pairs of images and 400 negative pairs of images (1600 images in total), as it was the smaller batch size that didn’t compromise training speed/accuracy (Figure 1—figure Supplement 4). Intuitively, large batch sizes allow for a good spread of pairs from a significant proportion of the video, thereby forcing the network to learn a global embedding of the video. Since positive pairs tend to diminish the size of the representational space while negative pairs tend to increase it, a good balance between the two forces the network to compress the representational space while respecting the negative relationships Chen et al. (2020a). This balance between positive and negative pairs is somewhat surprising, given that several works emphasize the importance of negative examples over positive ones Awasthi et al. (2022); Khosla et al. (2021). While we do not yet have an explanation for why this balance appears to perform better in our case, we note that it is not possible to compare all images from one class against those of another, as negative pairs of images can only be sampled from coexisting fragments. Additionally, positive pairs that compress the space can only be sampled from the same fragment and not the same identity. Since we cannot compare images freely and are constrained by the fragment structure of the video, we might need more positive pairs to ensure a higher degree of compression of the representational space, such that not only images from the same fragment are close together, but also images from the same identity.

The hierarchical sampling allows us to address the question of how to select pairs of fragments to optimize the training speed of the network. Since we sample pairs of fragments rather than directly sampling pairs of images, we need to skew the probability of a pair of fragments being sampled to reflect the number of images they contain. More concretely, let fi be the number of images in fragment Fi. For negative relations we define fi,j = fi + fj and set the probability of sampling the pair Fi, Fj, by their size as:

For positive pairs, the probability of sampling a given fragment fi is:

By examining the evolution of the clusters during training (Figure 1c) it becomes clear that the learning process is not uniform; some identities become separated sooner than others. Figure 1c top row second and third columns give us a nice illustration of this phenomenon. The images embedded in the red rectangle of the representational space already satisfy the loss function, meaning that the negative pairwise relationships are already embedded further away than Dneg, and images that form positive pairwise relationships are already embedded closer than Dpos. Consequently, the loss function for these pairs is effectively zero, and passing them through the network will not alter the weights, merely prolonging the training process. In contrast, the separation of clusters in the green rectangle is incomplete, indicating that image pairs in this region still contribute to the loss function. These pairs are more pertinent, as they contain information that the network has yet to learn. To bias the sampling of image pairs towards those that still contribute to the loss function, each pair of fragments is assigned a loss score. When a pair of images is sampled for training, if the loss for that pair is not zero, the loss score for the corresponding pair of fragments is incremented by one. This score then undergoes an exponential decay of 2% per batch. More specifically, let ls(i, j) be the loss score of the pair of fragments Fi and Fj, and ℒ (Iil, Iik) the loss of the images Iil and Iik. If the pair Iil and Iik is sampled the loss score is updated by

The exponential decay is always applied independently to every pair of fragments, regardless of whether the pairs of images were sampled from those fragments in the previous batch of images or not. The loss score is converted into a probably distribution over all pairs of fragments by

The final probability of sampling pairs of fragments is given by

This balance between these two probabilities can be seen as an exploitation versus exploration paradigm. Ps(Fi, Fj) enforces constant exploration, while exploits the current state of learning by dynamically updating the sampling probability. This ensures that pairs of fragments containing unlearned knowledge are sampled more frequently, while maintaining a baseline of exploration based on fragment size. We tried several values for α and saw that a value of α around produced the best decrease the time required to train the network across a large collection of videos (Figure 1—figure Supplement 5). It is noteworthy that the failure of the α = 0 case renders the contrastive protocol ineffective in solving the tracking problem. This failure occurs because the sampling becomes highly biased towards specific regions of the representational space, leading to only local solutions for the separation of negative pairs and the compression of positive pairs. In effect, the network experiences catastrophic forgetting by focusing excessively on small groups of fragments at a time, thereby compromising the embeddings of other images.

Models comparison.

Error in image identification as a function of training time for different deep learning models in 6 test videos. For each network we report the multiply-accumulate operations (MAC) in giga operations (G) and the number of parameters in the units of million parameters (M). Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Embedding dimensions comparison.

Error in image identification as a function of training time for different embedding dimensions in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Dneg over Dpos comparison.

Error in image identification as a function of training time for different ratios of Dneg/Dpos in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Batch size comparison.

Error in image identification as a function of training time for different batch sizes of pairs of images in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Exploration and exploitation comparison.

Error in image identification as a function of training time for different exploration/exploitation weights α in 6 test videos. Every 100 training batches, we perform k-means clustering on a randomly selected set of 20,000 images, assigning identities based on clusters. We then compute the Silhouette Score and ground-truth error on the same set. The reported error corresponds to the model with the best Silhouette Score observed up to that point.

Performance for the benchmark with full trajectories with animal crossings.

a. Median accuracy was computed using all images of animals in the videos including animal crossings. b. Median tracking times. Supplementary Table 1, Supplementary Table 2, Supplementary Table 3 and Supplementary Table 4 give more complete statistics (median, mean and 20-80 percentiles) for the original idtracker.ai (version 4 of the software), optimized v4 (version 5), new idtracker.ai (version 6) and TRex, respectively.

Protocol 2 failure rate.

Probability for the different tracking systems of not tracking the video with Protocol 2 in idtracker.ai (v4 and v5) and in TRex the probability that it fails without generating trajectories.

Memory usage across the different softwares.

The solid line is a logarithmic fit to the memory peak as a function of the number of blobs in a video. Disclaimer: Both software programs include automatic optimizations that adjust based on machine resources, so results may vary on systems with less available memory. These results were measured on computers with the specifications in Methods

Robustness to blurring and light conditions.

First column: Unmodified video zebrafish_60_1. Second column: zebrafish_60_1 with a gaussian blurring of sigma=1 pixel plus a resolution reduction to 40% of the original plus MJPG video compression. Third column: Videos of 60 zebrafish with manipulated light conditions (same test as in idtracker.ai Romero-Ferrero et al. (2019)). First row: Uniform light conditions across the arena (ze-brafish_60_1). Second row: Similar setup but with lights off in the bottom and right side of the arena.