TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields

  1. Tristan Walter  Is a corresponding author
  2. Iain D Couzin  Is a corresponding author
  1. Max Planck Institute of Animal Behavior, Germany
  2. Centre for the Advanced Study of Collective Behaviour, University of Konstanz, Germany
  3. Department of Biology, University of Konstanz, Germany
19 figures, 12 tables and 5 additional files


Videos are typically processed in four main stages, illustrated here each with a list of prominent features.

Some of them are accessible from both TRex and TGrabs, while others are software specific (as shown at the very top). (a) The video is either recorded directly with our software (TGrabs), or converted from a pre-recorded video file. Live-tracking enables users to perform closed-loop experiments, for which a virtual testing environment is provided. (b) Videos can be tracked and parameters adjusted with visual feedback. Various exploration and data presentation features are provided and customized data streams can be exported for use in external software. (c) After successful tracking, automatic visual identification can, optionally, be used to refine results. An artificial neural network is trained to recognize individuals, helping to automatically correct potential tracking mistakes. In the last stage, many graphical tools are available to users of TRex, a selection of which is listed in (d).

Figure 2 with 1 supplement
An overview of the interconnection between TRex, TGrabs and their data in- and output formats, with titles on the left corresponding to the stages in 1.

Starting at the top of the figure, video is either streamed to TGrabs from a file or directly from a compatible camera. At this stage, preprocessed data are saved to a .pv file which can be read by TRex later on. Thanks to its integration with parts of the TRex code, TGrabs can also perform online tracking for limited numbers of individuals, and save results to a .results file (that can be opened by TRex) along with individual tracking data saved to numpy data-containers (.npz) or standard CSV files, which can be used for analysis in third-party applications. If required, videos recorded directly using TGrabs can also be streamed to a .mp4 video file which can be viewed in commonly available video players like VLC.

Figure 2—video 1
This video shows an overview of the typical chronology of operations when using our software.

Starting with the raw video, segmentation using TGrabs (Figure 2a) is the first and only step that is not optional. Tracking (Figure 2b) and posture estimation (both also available for live-tracking in TGrabs) are usually performed in that order, but can be partly parallelized (e.g. performing posture estimation in parallel for all individuals). Visual identification (Figure 1c) is only available in TRex due to relatively long processing times. All clips from this composite video have been recorded directly in TRex.

Activation differences for images of randomly selected individuals from four videos, next to a median image of the respective individual – which hides thin extremities, such as legs in (a) and (c).

The captions in (a-d) detail the species per group and number of samples per individual. Colors represent the relative activation differences, with hotter colors suggesting bigger magnitudes, which are computed by performing a forward-pass through the network up to the last convolutional layer (using keract). The outputs for each identity are averaged and stretched back to the original image size by cropping and scaling according to the network architecture. Differences shown here are calculated per cluster of pixels corresponding to each filter, comparing average activations for images from the individual’s class to activations for images from other classes.

Figure 3—source data 1

Code, as well as images/weights needed to produce this figure (see README).
An overview of TRex’ the main interface, which is part of the documentation at

Interface elements are sorted into categories in the four corners of the screen (labelled here in black). The omni-box on the bottom left corner allows users to change parameters on-the-fly, helped by a live auto-completion and documentation for all settings. Only some of the many available features are displayed here. Generally, interface elements can be toggled on or off using the bottom-left display options or moved out of the way with the cursor. Users can customize the tinting of objects (e.g. sourcing it from their speed) to generate interesting effect and can be recorded for use in presentations. Additionally, all exportable metrics (such as border-distance, size, x/y, etc.) can also be shown as an animated graph for a number of selected objects. Keyboard shortcuts are available for select features such as loading, saving, and terminating the program. Remote access is supported and offers the same graphical user interface, for example in case the software is executed without an application window (for batch processing purposes).

The maximum memory by TRex and when tracking videos from a subset of all videos (the same videos as in Table 3).

Results are plotted as a function of video length (min) multiplied by the number of individuals. We have to emphasize here that, for the videos in the upper length regions of multiple hours (2, 2), we had to set to store segmentation information on disk – as opposed to in RAM. This uses less memory, but is also slower. For the video with flies we tried out both and also settled for on-disk, since otherwise the system ran out of memory. Even then, the curve still accelerates much faster for, ultimately leading to problems with most computer systems. To minimize the impact that hardware compatibility has on research, we implemented switches limiting memory usage while always trying to maximize performance given the available data. TRex can be used on modern laptops and normal consumer hardware at slightly lower speeds, but without any fatal issues.

Figure 5—source data 1

Each data-point from Figure 5 as plotted, indexed by video and software used.
Convergence behavior of the network training for three different normalization methods.

This shows the maximum achievable validation accuracy after 100 epochs for 100 individuals (Video 7), when sub-sampling the number of examples per individual. Tests were performed using a manually corrected training dataset to generate the images in three different ways, using the same, independent script (see Figure 8): Using no normalization (blue), using normalization based on image moments (green, similar to, and using posture information (red, as in TRex). Higher numbers of samples per individual result in higher maximum accuracy overall, but – unlike the other methods – posture-normalized runs already reach an accuracy above the 90 % mark for ≥75 samples. This property can help significantly in situations with more crossings, when longer global segments are harder to find.

Figure 7 with 1 supplement
Visual field estimate of the individual in the center (zoomed in, the individuals are approximately 2 – 3 cm long, Video 15).

Right (blue) and left (orange) fields of view intersect in the binocular region (pink). Most individuals can be seen directly by the focal individual (1, green), which has a wide field of view of 260 per eye. Individual three on the top-left is not detected by the focal individual directly and not part of its first-order visual field. However, second-order intersections (visualized by gray lines here) are also saved and accessible through a separate layer in the exported data.

Figure 7—video 1
A clip from Video 15, showing TRex’ visual-field estimation for Individual 1
Comparison of different normalization methods.

Images all stem from the same video and belong to the same identity. The video has previously been automatically corrected using the visual identification. Each object visible here consists of N images Mi,i[0,N] that have been accumulated into a single image using mini[0,N]Mi, with min being the element-wise minimum across images. The columns represent same samples from the same frames, but normalized in three different ways: In (a), images have not been normalized at all. Images in (b) have been normalized by aligning the objects along their main axis (calculated using image-moments), which only gives the axis within 0– 180 degrees. In (c), all images have been aligned using posture information generated during the tracking process. As the images become more and more recognizable to us from left to right, the same applies to a network trying to tell identities apart: Reducing noise in the data speeds up the learning process.

Appendix 1—figure 1
Using the interactive heatmap generator within TRex, the foraging trail formation of Constrictotermes cyphergaster (termites) can be visualized during analysis, as well as other potentially interesting metrics (based on posture- as well basic positional data).

This is generalizable to all output data fields available in TRex, for example also making it possible to visualize ‘time’ as a heatmap and showing where individuals were more likely to be located during the beginning or towards end of the video. Video: H. Hugo.

Appendix 1—figure 2
The file opening dialog.

On the left is a list of compatible files in the current folder. The center column shows meta-information provided by the video file, including its frame-rate and resolution – or some of the settings used during conversion and the timestamp of conversion. The column on the right provides an easy interface for adjusting the most important parameters before starting up the software. Most parameters can be changed later on from within TRex as well.

Appendix 2—figure 1
Example of morphological operations on images: ‘Erosion’.

Blue pixels denote on-pixels with color values greater than zero, white pixels are ‘off-pixels’ with a value equal to zero. A mask is moved across the original image, with its center (dot) being the focal pixel. A focal pixel is retained if all the on-pixels within the structure element/mask are on top of on-pixels in the original image. Otherwise the focal pixel is set to 0. The type of operation performed is entirely determined by the structure element.

Appendix 3—figure 1
An example array of pixels, or image, to be processed by the connected components algorithm.

This figure should be read from top to bottom, just as the connected components algorithm would do. When this image is analyzed, the red and blue objects will temporarily stay separate within different ‘blobs’. When the green pixels are reached, both objects are combined into one identity.

Appendix 4—figure 1
A bipartite graph (a) and its equivalent tree-representation (b).

It is bipartite since nodes can be sorted into two disjoint and independent sets ({0,1,2} and {3,4}), where no nodes have edges to other nodes within the same set. (a) is a straight-forward way of depicting an assignment problem, with the identities on the left side and objects being assigned to the identities on the right side. Edge weights are, in TRex and this example, probabilities for a given identity to be the object in question. This graph is also an example for an unbalanced assignment problem, since there are fewer objects (orange) available than individuals (blue). The optimal solution in this case, using weight-maximization, is to assign 03;24 and leave one unassigned. Invalid edges have been pruned from the tree in (b), enforcing the rule that objects can only appear once in each path. The optimal assignments have been highlighted in red.

Appendix 4—figure 2
The same set of videos as in Table 5 pooled together, we evaluate the efficiency of our crossings solver.

Consecutive frame segments are sequences of frames without gaps, for example due to crossings or visibility issues. We find these consecutive frame segments in data exported by TRex, and compare the distribution of segment-lengths to's results (as a reference for an algorithm without a way to resolve crossings). In's case, we segmented the non-interpolated tracks by missing frames, assuming tracks to be correct in between. The Y-axis shows the percentage of k[1,V]video_lengthk*#individualsk in V videos that one column makes up for – the overall coverage for TRex was 98%, while was slightly worse with 95.17%. Overall, the data distribution suggests that, probably due to it attempting to resolve crossings, TRex seems to produce longer consecutive segments.

Appendix 4—figure 2—source data 1

A list of all consecutive frame segments used in Appendix 4—figure 2.

In the table, they are indexed by their length, the software they were produced by, the video they originate from, as well as they bin they belong to.
Appendix 4—figure 2—source data 2

The raw data-points as plotted in Appendix 4—figure 2.
Appendix 4—figure 3
Mean values of processing-times and 5 %/95 % percentiles for video frames of all videos in the speed dataset (Table 1), comparing two different matching algorithms.

Parameters were kept identical, except for the matching mode, and posture was turned off to eliminate its effects on performance. Our tree-based algorithm is shown in green and the Hungarian method in red. Grey numbers above the graphs show the number of samples within each bin, per method. Differences between the algorithms increase very quickly, proportional to the number of individuals. Especially the Hungarian method quickly becomes very computationally intensive, while our tree-based algorithm shows a much shallower curve. Some frames could not be solved in reasonable time by the tree-based algorithm alone, at which point it falls back to the Hungarian algorithm. Data-points belonging to these frames (N=79) have been excluded from the results for both algorithms. One main advantage of the Hungarian method is that, with its bounded worst-case complexity (see Appendix D Matching an object to an object in the next frame), no such combinatorical explosions can happen. However, even given this advantage the Hungarian method still leads to significantly lower processing speed overall (see also Appendix 4—table 3).

Appendix 4—figure 3—source data 1

Raw data for producing this figure and Appendix 4—table 3.

Each sample is represented as a row here, indexed by method (tree, approximate, hungarian), video and the bin (horizontal line in this figure).
Appendix 5—figure 1
The original image is displayed on the left.

Each square represents one pixel. The processed image on the right is overlaid with lines of different colors, each representing one connected component detected by our outline estimation algorithm. Dots in the centers of pixels are per-pixel-identities returned by OpenCVs findContours function (for reference) coded in the same colors as ours. Contours calculated by OpenCVs algorithm can not be used to estimate the one-pixel-wide ‘tail’ of the 9-like shape seen here, since it becomes a 1D line without sub-pixel accuracy. Our algorithm also detects diagonal lines of pixels, which would otherwise be an aliased line when scaled up.

Appendix 12—figure 1 with 2 supplements
Screenshots from videos V1 and V2 listed in Appendix 12—table 1.

Left (V1), video of four ‘black mice’ (17 min, 1272 × 909 px resolution) from Romero-Ferrero et al., 2019. Right (V2), four C57BL/6 mice (1: 08 min, 1280 × 960 px resolution) by M. Groettrup, D. Mink.

Appendix 12—figure 1—video 1
A clip of the tracking results from V1, played back at normal speed.

Although it succumbs to noise in some frames (e.g. around 13 s), posture estimation remains remarkably robust to it throughout the video – sometimes even through periods where individuals overlap (e.g. at 27 s). Identity assignments are near perfect here, confirming our results in Appendix 12—table 1.

Appendix 12—figure 1—video 2
Tracking results from V2, played back at two times normal speed.

Since resolution per animal in V2 is lower than V1, and contrast is lower, posture estimation in V2 is also slightly worse than in V1. Importantly, however, identity assignment is very stable and accurate.

Appendix 12—figure 2
Median of all normalized images (N = 7161, 7040, 7153, 7076) for each of the four individuals from V1 in Appendix 12—table 1.

Posture information was used to normalize each image sample, which was stable enough — also for TRex — to tell where the head is, and even to make out the ears on each side (brighter spots).

Appendix 12—figure 3
Median of all normalized images (N = 1593, 1586, 1620, 1538) for each of the four individuals from V2 in Appendix 12—table 1.

Resolution per animal is lower than in V1, but ears are still clearly visible.


Table 1
A list of the videos used in this paper as part of the evaluation of TRex, along with the species of animals in the videos and their common names, as well as other video-specific properties.

Videos are given an incremental ID, to make references more efficient in the following text, which are sorted by the number of individuals in the video. Individual quantities are given accurately, except for the videos with more than 100 where the exact number may be slightly more or less. These videos have been analyzed using TRex’ dynamic analysis mode that supports unknown quantities of animals. Videos 7 and 8, as well as 13–11, are available as part of the original idtracker paper (Pérez-Escudero et al., 2014). Many of the videos are part of yet unpublished data: Guppy videos have been recorded by A. Albi, videos with sunbleak (Leucaspius delineatus) have been recorded by D. Bath. The termite video has been kindly provided by H. Hugo and the locust video by F. Oberhauser. Due to the size of some of these videos (>150 GB per video), they have to be made available upon specific request. Raw versions of these videos (some trimmed), as well as full preprocessed versions, are available as part of the dataset published alongside this paper (Walter et al., 2020).

IDSpeciesCommon# ind.Fps (Hz)DurationSize (Px2) (px2)
0Leucaspius delineatusSunbleak1024408 min 20 s3866 × 4048
1Leucaspius delineatusSunbleak512506 min 40 s3866 × 4140
2Leucaspius delineatusSunbleak512605 min 59 s3866 × 4048
3Leucaspius delineatusSunbleak256506 min 40 s3866 × 4140
4Leucaspius delineatusSunbleak256605 min 59 s3866 × 4048
5Leucaspius delineatusSunbleak128606 min3866 × 4048
6Leucaspius delineatusSunbleak128605 min 59 s3866 × 4048
7Danio rerioZebrafish100321 min3584 × 3500
8Drosophila melanogasterFruit-fly595110 min2306 × 2306
9Schistocerca gregariaLocust15251hr 0 min1880 × 1881
10Constrictotermes cyphergasterTermite1010010 min 5 s1920 × 1080
11Danio rerioZebrafish103210 min 10 s3712 × 3712
12Danio rerioZebrafish103210 min 3 s3712 × 3712
13Danio reriozebrafish103210 min 3 s3712 × 3712
14Poecilia reticulataGuppy8303 hr 15 min 22 s3008 × 3008
15Poecilia reticulataGuppy8251 hr 12 min3008 × 300
16Poecilia reticulataGuppy8353 hr 18 min 13 s3008 × 3008
17Poecilia reticulataGuppy11401 hr 9 min 32 s1312 × 1312
Table 2
Results of the human validation for a subset of videos.

Validation was performed by going through all problematic situations (e.g. individuals lost) and correcting mistakes manually, creating a fully corrected dataset for the given videos. This dataset may still have missing frames for some individuals, if they could not be detected in certain frames (as indicated by ‘of that interpolated’). This was usually a very low percentage of all frames, except for Video 9, where individuals tended to rest on top of each other – and were thus not tracked – for extended periods of time. This baseline dataset was compared to all other results obtained using the automatic visual identification by TRex (N=5) and (N=3) to estimate correctness. We were not able to track Videos 9 and 10 with, which is why correctness values are not available.

Video metricsReview stats% correct
Video# ind.Reviewed (%)Of that interpolated (%)
7100100.00.2399.07 ± 0.01398.95 ± 0.146
859100.00.1599.68 ± 0.53399.94 ± 0.0
91522.28.4495.12 ± 6.077N/A
1010100.01.2199.7 ± 0.088N/A
1310100.00.2799.98 ± 0.099.96 ± 0.0
1210100.00.5999.94 ± 0.00699.63 ± 0.0
1110100.00.599.89 ± 0.00999.34 ± 0.002
Table 2—source data 1

A table of positions for each individual of each manually approved and corrected trial.
Table 3
Evaluating comparability of the automatic visual identification between and TRex.

Columns show various video properties, as well as the associated uniqueness score (see Guiding the training process) and a similarity metric. Similarity (% similar individuals) is calculated based on comparing the positions for each identity exported by both tools, choosing the closest matches overall and counting the ones that are differently assigned per frame. An individual is classified as ‘wrong’ in that frame, if the euclidean distance between the matched solutions from and TRex exceeds 1 % of the video width. The column ‘% similar individuals’ shows percentage values, where a value of 99% would indicate that, on average, 1 % of the individuals are assigned differently. To demonstrate how uniqueness corresponds to the quality of results, the last column shows the average uniqueness achieved across trials. A file containing all X and Y positions for each trial and each software combined into one very large table is available from Walter et al., 2020, along with the data in different formats.

Video# ind.N TRex% similar individualsFinal uniqueness
7100599.8346 ± 0.52650.9758 ± 0.0018
859598.6885 2.11450.9356 ± 0.0358
1310599.9902 0.37370.9812 ± 0.0013
1110599.9212 ± 1.12080.9461 ± 0.0039
1210599.9546 ± 0.85730.9698 ± 0.0024
148598.8356 ± 5.81360.9192 ± 0.0077
158599.2246 ± 4.44860.9576 ± 0.0023
1628599.7704 ± 2.19940.9481 ± 0.0025
Table 3—source data 1

Assignments between identities from multiple solutions, as calculated by a bipartite-graph matching algorithm.

For each permutation of trials from TRex and for the same video, the algorithm sought to match the trajectories of the same physical individuals in both trials with each other by finding the ones with the smallest mean euclidean distance per frame between them. Available from Walter et al., 2020 as
Table 4
Both TRex and analyzed the same set of videos, while continuously logging their memory consumption using an external tool.

Rows have been sorted by video_length*#individuals, which seems to be a good predictor for the memory consumption of both solutions. has mixed mean values, which, at low individual densities are similar to TRex’ results. Mean values can be misleading here, since more time spent in low-memory states skews results. The maximum, however, is more reliable since it marks the memory that is necessary to run the system. Here, clocks in at significantly higher values (almost always more than double) than TRex.

Video#ind.LengthMax.consec.TRex memory (GB) memory (GB)
121010 min26.03s4.88 ± 0.23, max 6.318.23 ± 0.99, max 28.85
131010 min36.94s4.27 ± 0.12, max 4.797.83 ± 1.05, max 29.43
111010 min28.75s4.37 ± 0.32, max 5.496.53 ± 4.29, max 29.32
71001 min5.97s9.4 ± 0.47, max13.4515.27 ± 1.05, max 24.39
15872 min79.4s5.6 ± 0.22, max 8.4135.2 ± 4.51, max 91.26
101010 min1391s6.94 ± 0.27, max 10.71N/A
91560 min7.64s13.81 ± 0.53, max 16.99N/A
85910 min102.35s12.4 ± 0.56, max 17.4135.3 ± 0.92, max 50.26
148195 min145.77s12.44 ± 0.8, max 21.9935.08 ± 4.08, max 98.04
168198 min322.57s16.15 ± 1.6, max 28.6249.24 ± 8.21, max 115.37
Table 4—source data 1

Data from log files for all trials as a single table, where each row is one sample.

The total memory of each sample is calculated as SWAP+PRIVATE+SHARED. Each row indicates at which exact time, by which software, and as part of which trial it was taken.
Table 5
Evaluating time-cost for automatic identity correction – comparing to results from

Timings consist of preprocessing time in TGrabs plus network training in TRex, which are shown separately as well as combined (ours (min), N=5). The time it takes to analyze videos strongly depends on the number of individuals and how many usable samples per individual the initial segment provides. The length of the video factors in as well, as does the stochasticity of the gradient descent (training). timings (N=3) contain the whole tracking and training process from start to finish, using its terminal_mode (v3). Parameters have been manually adjusted per video and setting, to the best of our abilities, spending at most one hour per configuration. For videos 16 and 14, we had to set to storing segmentation information on disk (as compared to in RAM) to prevent the program from being terminated for running out of memory.

Video# ind.LengthSampleTGrabs (min)TRex (min)Ours (min) (min)
71001min1.61s2.03 ± 0.0274.62 ± 6.7576.65392.22 ± 119.43
85910min19.46s9.28 ± 0.0896.7 ± 4.45105.98495.82 ± 115.92
91560min33.81s13.17 ± 0.12101.5 ± 1.85114.67N/A
111010min12.31s8.8 ± 0.1221.42 ± 2.4530.22127.43 ± 57.02
121010min10.0s8.65 ± 0.0723.37 ± 3.8332.0282.28 ± 3.83
131010min36.91s8.65 ± 0.04712.47 ± 1.2721.1279.42 ± 4.52
101010min16.22s4.43 ± 0.0535.05 ± 1.4539.48N/A
148195min67.97s109.97 ± 0.0570.48 ± 3.67180.45707.0 ± 27.55
15872min79.36s32.1 ± 0.4230.77 ± 6.2862.87291.42 ± 16.83
168198min134.07s133.1 ± 2.2868.85 ± 13.12201.951493.83 ± 27.75
Table 5—source data 1

Preprocessed log files (see also in Walter et al., 2020) in a table format.

The total processing time (s) of each trial is indexed by video and software used – TGrabs for conversion and TRex and for visual identification. This data is also used in Appendix 4—table 4.
Appendix 4—table 1
Showing quantiles for frame timings for videos of the speed dataset (without posture enabled).

Video 15, 16, and 14 each contain a short sequence of taking out the fish, causing a lot of big objects and noise in the frame. This leads to relatively high spikes in these segments of the video, resulting in high peak processing timings here. Generally, processing time is influenced by a lot of factors involving not only TRex, but also the operating system as well as other programs. While we did try to control for these, there is no way to make sure. However, having sporadic spikes in the timings per frame does not significantly influence overall processing time, since it can be compensated for by later frames. We can see that videos of all quantities ≤256 individuals can be processed faster than they could be recorded. Videos that can not be processed faster than real-time are underlaid in gray.

Video characteristicsMs / frame (processing)Processing time
Video# ind.Ms / frame5%Mean95 %Max> real-time% video length
Appendix 4—table 2
A quality assessment of assignment decisions made by the general purpose tracking system without the aid of visual recognition – comparing results of two accurate tracking algorithms with the assignments made by an approximate method.

Here, decisions are reassignments of an individual after it has been lost, or the tracker was too ‘unsure’ about an assignment. Decisions can be either correct or wrong, which is determined by comparing to reference data generated using automatic visual recognition: Every segment of frames between decisions is associated with a corresponding ‘baseline-truth’ identity from the reference data. If this association changes after a decision, then that decision is counted as wrong. Analysing a decision may fail if no good match can be found in the reference data (which is not interpolated). Failed decisions are ignored. Comparative values for the Hungarian algorithm (Kuhn, 1955) are always exactly the same as for our tree-based algorithm, and are therefore not listed separately. Left-aligned total, excluded and wrong counts in each column are results achieved by an accurate algorithm, numbers to their right are the corresponding results using an approximate method. Raw data of trial runs using the hungarian and tree-based matching algorithms, as well as baseline data from manually or automatically corrected trials used in this table is available for download from Walter et al., 2020 (in

Video# ind.LengthTotalExcludedWrong
71001 min717755222245 (6.47%)65 (8.87%)
85910 min27931214610055 (41.35%)32 (16.09%)
9151h0min83897270111100 (13.02%)240 (27.87%)
131010min3s331337222236 (11.65%)54 (17.14%)
121010min3s382404424383 (24.41%)130 (36.01%)
111010min10s10671085505273 (7.18%)92 (8.91%)
1483h15min22s74247644142814811174 (19.58%)1481 (24.03%)
1581h12min35383714427517651 (20.93%)962 (30.09%)
1683h18min13s23763305136206594 (26.52%)1318 (42.53%)
sum1695216754-2343-25542811 (19.24%)4374 (27.38%)
Appendix 4—table 3
Comparing computation speeds of the tree-based tracking algorithm with the widely established Hungarian algorithm Kuhn, 1955, as well as an approximate version optimized for large quantities of individuals.

Posture estimation has been disabled, focusing purely on the assignment problem in our timing measurements. The tree-based algorithm is programmed to fall back on the Hungarian method whenever the current problem ‘explodes’ computationally – these frames were excluded. Listed are relevant video metrics on the left and mean computation speeds on the right side for three different algorithms: (1) The tree-based and (2) the approximate algorithm presented in this paper, and (3) the Hungarian algorithm. Speeds listed here are percentages of real-time (the videos’ fps), demonstrating usability in closed-loop applications and overall performance. Results show that increasing the number of individuals both increases the time-cost, as well as producing much larger relative standard deviation values. (1) is almost always fast than (3), while becoming slower than (2) with increasing individual numbers. In our implementation, all algorithms produce faster than real-time speeds with 256 or fewer individuals (see also appendix Table Appendix 4—table 1), with (1) and (2) even getting close for 512 individuals.

Video metrics% real-time
Video# ind.Fps (Hz)Size (px2)TreeApproximateHungarian
01024403866 × 404835.49 ± 65.9438.69 ± 65.3912.05 ± 18.72
1512503866 × 414051.18 ± 180.0875.02 ± 193.028.92 ± 29.12
2512603866 × 404859.66 ± 121.465.58 ± 175.5123.18 ± 26.83
3256503866 × 4140174.02 ± 793.12190.62 ± 743.54127.86 ± 9841.21
4256603866 × 4048140.73 ± 988.15155.9 ± 760.05108.48 ± 2501.06
5128603866 × 4048318.6 ± 347.8353.58 ± 291.63312.05 ± 337.71
6128603866 × 4048286.13 ± 330.08314.91 ± 303.53232.33 ± 395.21
7100323584 × 3500572.46 ± 98.21611.5 ± 96.46637.87 ± 97.03
859512306 × 2306744.98 ± 364.43839.45 ± 257.56864.01 ± 223.47
915251880 × 18814626 ± 424.84585.08 ± 378.644508.08 ± 404.56
10101001920 × 10802370.35 ± 303.942408.27 ± 297.832362.42 ± 296.99
1110323712 × 37126489.12 ± 322.596571.28 ± 306.346472.0 ± 322.03
1210323712 × 37126011.59 ± 318.126106.12 ± 305.9655.49.25 ± 318.21
1310323712 × 37126717.12 ± 325.376980.12 ± 316.596726.46 ± 316.87
148303008 × 30088752.2 ± 2141.038814.63 ± 2140.48630.73 ± 2177.16
158253008 × 30089786.68 ± 1438.0810118.04 ± 1380.29593.44 ± 1439.28
168353008 × 30086861.42 ± 1424.9110268.82 ± 1339.89680.68 ± 1387.14
1711401312 × 131215323.05 ± 637.1715250.39 ± 639.215680.93 ± 640.99
Appendix 4—table 4
Comparing the time-cost for tracking and converting videos in two steps with doing both of those tasks at the same time.

The columns prepare and tracking show timings for the tasks when executed separately, while live shows the time when both of them are performed at the same time using the live-tracking feature of TGrabs. The column win shows the time ‘won’ by combining tracking and preprocessing as the percentage (prepare+tracking-live)/(prepare+tracking). The process is more complicated than simply adding up timings of the tasks. Memory and the interplay of work-loads have a huge effect here. Posture is enabled in all variants.

Video metricsMinutes
Video# ind.LengthFps (Hz)PrepareTrackingLiveWin (%)
010248.33min4010.96 ± 0.341.11 ± 0.3465.72 ± 1.35-26.23
15126.67min5011.09 ± 0.2424.43 ± 0.233.67 ± 0.585.24
25125.98min6011.72 ± 0.220.86 ± 0.4731.1 ± 0.624.55
32566.67min5011.09 ± 0.217.99 ± 0.1712.32 ± 0.1735.26
42565.98min6011.76 ± 0.269.04 ± 0.2615.08 ± 0.1327.46
61285.98min6011.77 ± 0.294.74 ± 0.1312.13 ± 0.3226.49
51286.0min6011.74 ± 0.264.54 ± 0.112.08 ± 0.2525.79
71001.0min321.92 ± 0.020.47 ± 0.012.03 ± 0.0214.88
85910.0min516.11 ± 0.077.68 ± 0.129.28 ± 0.0832.7
91560.0min2512.59 ± 0.185.32 ± 0.0713.17 ± 0.1226.47
111010.17min328.58 ± 0.040.74 ± 0.018.8 ± 0.125.66
121010.05min328.68 ± 0.040.75 ± 0.018.65 ± 0.078.3
131010.05min328.67 ± 0.030.71 ± 0.018.65 ± 0.077.76
1021010.08min1004.17 ± 0.062.02 ± 0.024.43 ± 0.0528.3
148195.37min30110.51 ± 2.328.99 ± 0.22109.97 ± 2.057.98
15872.0min2531.84 ± 0.533.26 ± 0.0732.1 ± 0.428.55
168198.22min35133.45 ± 2.2211.38 ± 0.281.33 ± 2.288.1
mean14.55 %
Appendix 4—table 5
Statistics for running the tree-based matching algorithm with the videos of the speed dataset.

We achieve low leaf and node visits across the board – this is especially interesting in videos with high numbers of individuals. High values for ’# nodes visited’ are only impactful if they make up a large portion of the assignments. These are the result of too many choices for assignments – the weak point of the tree-based algorithm – and lead to combinatorical ‘explosions’ (the method will take a really long time to finish). If such an event is detected, TRex automatically switches to a more computationally bounded algorithm like the Hungarian method.

Video characteristicsMatching stats
Video# ind.# nodes visited (5,50,95,100%)# leafs visited# improvements
01024[1535; 2858; 83243; 18576918]1.113 ± 0.371.113
1512[1060; 8156; 999137; 19811558]1.247 ± 0.611.247
2512[989; 2209; 56061; 8692547]1.159 ± 0.471.159
3256[452; 479; 969; 205761]1.064 ± 0.291.064
4256[475; 496; 584; 608994]1.028 ± 0.181.028
5128[233; 245; 258; 7149]1.012 ± 0.121.012
6128[237; 259; 510; 681702]1046 ± 0.251.046
7100[195; 199; 199; 13585]1.014 ± 0.141.014
859[117; 117; 117; 16430]1.014 ± 0.21.014
915[24; 29; 29; 635]1.027 ± 0.221.027
1010[17; 19; 19; 56]1.001 ± 0.021.001
1110[19; 19; 19; 129]1.006 ± 0.11.006
1210[19; 19; 19; 1060]1.023 ± 0.231.023
1310[19; 19; 19; 106]1.001 ± 0.041.001
148[11; 15; 15; 893]1.003 ± 0.081.003
158[13; 15; 15; 597]1.024 ± 0.231.024
168[15; 15; 15; 2151]1.009 ± 0.171.009
171[1; 1; 1; 1]1.0 ± 0.021.0
Appendix 9—table 1
Settings used for trials, as saved inside the json files used for tracking.

The minimum intensity was always set to 0 and background subtraction was always enabled. An ROI is an area of interest in the form of an array of 2D vectors, typically a convex polygon containing the area of the tank (e.g. for fish or locusts). Since this format is quite lengthy, we only indicate here whether we limited the area of interest or not.

Videolength (# frames)NblobsAreaMax. intensityRoi
71921100[165, 1500]170Yes
830,62659[100, 2500]160Yes
1119,53910[200, 1500]10Yes
1319,31710[200, 1500]10Yes
1219,30910[200, 1500]10Yes
990,0018[190, 4000]147Yes
16416,2598[200, 2500]50No
14351,6778[200, 2500]50No
15108,0008[250, 2500]10No
Appendix 12—table 1
Analogous to our analysis in Table 3, we compared automatically generated trajectories for two videos with manually verified ones.

Unlike the table in the main text, the sample size per video is only one here, which is why the standard deviation is zero in both cases. Results show very high accuracy for both videos, but relatively high numbers of interpolated frames compared to Table 3, where only the results for Video 9 showed more than 8 % interpolation and all others remained below 1 %.

Video# ind.Reviewed (%)Interpolated (%)TRex
(V1) Romero-Ferrero et al., 20194100.06.4199.6 ± 0.0
(V2) D. Mink, M. Groettrup4100.01.7499.82 ± 0.0

Additional files

Transparent reporting form
Appendix 4—figure 2—source data 1

A list of all consecutive frame segments used in Appendix 4—figure 2.

In the table, they are indexed by their length, the software they were produced by, the video they originate from, as well as they bin they belong to.
Appendix 4—figure 2—source data 2

The raw data-points as plotted in Appendix 4—figure 2.
Appendix 4—figure 3—source data 1

Raw data for producing this figure and Appendix 4—table 3.

Each sample is represented as a row here, indexed by method (tree, approximate, hungarian), video and the bin (horizontal line in this figure).
Appendix 4—table 1—source data 1

Raw samples for this table and Appendix 4—table 5.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Tristan Walter
  2. Iain D Couzin
TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields
eLife 10:e64000.