1. Neuroscience
Download icon

anTraX, a software package for high-throughput video tracking of color-tagged insects

  1. Asaf Gal  Is a corresponding author
  2. Jonathan Saragosti
  3. Daniel JC Kronauer  Is a corresponding author
  1. Laboratory of Social Evolution and Behavior, The Rockefeller University, United States
Tools and Resources
  • Cited 0
  • Views 499
  • Annotations
Cite this article as: eLife 2020;9:e58145 doi: 10.7554/eLife.58145

Abstract

Recent years have seen a surge in methods to track and analyze animal behavior. Nevertheless, tracking individuals in closely interacting, group-living organisms remains a challenge. Here, we present anTraX, an algorithm and software package for high-throughput video tracking of color-tagged insects. anTraX combines neural network classification of animals with a novel approach for representing tracking data as a graph, enabling individual tracking even in cases where it is difficult to segment animals from one another, or where tags are obscured. The use of color tags, a well-established and robust method for marking individual insects in groups, relaxes requirements for image size and quality, and makes the software broadly applicable. anTraX is readily integrated into existing tools and methods for automated image analysis of behavior to further augment its output. anTraX can handle large-scale experiments with minimal human involvement, allowing researchers to simultaneously monitor many social groups over long time periods.

Introduction

Our understanding of behavior, together with the biological, neural, and computational principles underlying it, has advanced dramatically over recent decades. Consequently, the behavioral and neural sciences have moved to study more complex forms of behavior at ever-increasing resolution. This has created a growing demand for methods to measure and quantify behavior, which has been met with a wide range of tools to measure, track, and analyze behavior across a variety of species, conditions, and spatiotemporal scales (Anderson and Perona, 2014; Berman, 2018; Brown and de Bivort, 2018; Krakauer et al., 2017; Dell et al., 2014; Robie et al., 2017a; Todd et al., 2017; Egnor and Branson, 2016). One of the exciting frontiers of the field is the study of collective behavior in group-living organisms and particularly the behavior of groups of insects. Insects provide an attractive and powerful model for collective and social behavior, as they exhibit a wide range in social complexity, from solitary to eusocial, while allowing for controlled, high-throughput experiments in laboratory settings (Feinerman and Korman, 2017; Lihoreau et al., 2012; Gordon, 2014; Schneider et al., 2012). However, although complex social behavior has been the focus of extensive research for over a century, technological advances are only beginning to enable systematic and simultaneous measurements of behavior in large groups of interacting individuals.

Solutions for automated video tracking of insects in social groups can be roughly divided into two categories (for reviews see Dell et al., 2014; Robie et al., 2017a): methods for tracking unmarked individuals (Branson et al., 2009; Pérez-Escudero et al., 2014; Romero-Ferrero et al., 2019; Sridhar et al., 2019; Feldman et al., 2012; Khan et al., 2005; Fasciano et al., 2014; Fasciano et al., 2013; Bozek et al., 2020), and methods for tracking marked individuals (Mersch et al., 2013; Robinson et al., 2012). The former category has the obvious advantages of reduced interference with natural behavior, unbounded number of tracked individuals, and not having the burden of tagging animals and maintaining these tags throughout the experiment. At the same time, these approaches, when applied to individual tracking, are limited by a more extensive computational burden, higher error rates, and stricter requirements for image quality. The most common approach for tracking unmarked individuals is to try and follow the trajectory of an individual for the duration of the video. The challenge in this approach is to resolve individuals from each other and link their locations in consecutive frames during close range interactions, when they are touching or occluding each other. Common solutions to this problem are to employ sophisticated segmentation methods (Branson et al., 2009; Pérez-Escudero et al., 2014; Sridhar et al., 2019), to use predictive modeling of the animals' motion (Branson et al., 2009; Fasciano et al., 2013), or to use image characteristics to match individuals before and after occlusions (Fasciano et al., 2014). The success of these solutions is case-specific and will usually be limited to relatively simple problems, where interactions are brief, occlusion is minimal, or image resolution is sufficient to resolve the individuals even during an interaction. One important limitation of this approach is that no matter how low the error rate is, it tends to increase rapidly with the duration of the experiment. The reason is that once identities are swapped, the error is unrecoverable, and will propagate from that moment on. A different algorithmic approach for tracking unmarked individuals is to use object recognition techniques to assign separate pieces of trajectories to the same individual (Pérez-Escudero et al., 2014; Romero-Ferrero et al., 2019). While this approach is promising and performs well on many tracking problems, it requires high image quality to identify unique features for each individual animal. It will also generally not perform well on animals with high postural variability and is hard to validate on large datasets.

On the other hand, tagging individuals with unique IDs has the advantage of having a stable reference, enabling error recovery. This approach also provides a simpler method for human validation or correction and enables following the behavior of individuals even if they leave the tracked region, or across experiments when the same animals are tested in different conditions. The use of general-purpose libraries such as AprilTags (Mersch et al., 2013; Olson, 2011; Heyman et al., 2017; Greenwald et al., 2018; Stroeymeyt et al., 2018) and ArUco (Garrido-Jurado et al., 2014), or application-specific patterned tags (Crall et al., 2015; Boenisch et al., 2018; Wario et al., 2015; Wild et al., 2018), has become the gold standard for this approach in recent years. However, these tags are applicable only to species with body sizes sufficiently large to attach them, have adverse effects on the animals’ behavior, and are often lost during experiments. They also require relatively high image resolution to correctly read the barcode pattern. Taken together, while currently available methods cover a wide range of experimental scenarios, the ability to accurately track the behavior of animals in groups remains one of the major hurdles in the field. As a result, much of the experimental work still relies on manual annotation, or on partially automated analysis pipelines that require considerable manual effort to correct computer-generated annotations (see Aguilar et al., 2018; Gelblum et al., 2015; Leitner and Dornhaus, 2019; Valentini et al., 2020 for recent examples). In principle, marked animals can also be tracked by general-purpose image-based trackers such as idTracker.ai, supplementing the pixel information of the animals’ appearances with artificial features. To the best of our knowledge, however, this approach has not been formally described, and it can be expected to perform less well than trackers specifically designed for a given marking technique.

Here, we present anTraX, a new software solution for tracking color-tagged insects and other small animals. Color tagging is one of the best-established and widely used methods to mark insects, both in the field and in the laboratory (Leitner and Dornhaus, 2019; Valentini et al., 2020; Walker and Wineriter, 1981; Gordon, 1989; Hagler and Jackson, 2001; Ulrich et al., 2018; Holbrook et al., 2011; Holbrook, 2009), with long-term durability and minimal effects on behavior. anTraX works by combining traditional segmentation-based object tracking with image-based classification using convolutional neural networks (CNNs). In addition, anTraX uses a graph object for representing tracking data (Nillius et al., 2006), enabling the inference of identity of unidentified objects by propagating temporal and spatial information, thereby optimizing the use of partial tag information. anTraX is uniquely suited for tracking small social insects that form dense aggregates, in which individuals are unidentifiable over large parts of the experiment even for the human observer. It will also be useful in tracking and analyzing behavior in heterogenic groups of ‘solitary’ insects, where keeping track of the individual identity for long experimental durations is important. Such experiments are of increasing interest, as the study of behavior in classical model systems like Drosophila fruit flies is shifting toward understanding more complex behavioral phenomena such as social interactions, individuality and inter-species interactions (Schneider et al., 2012; Seeholzer et al., 2018; Schneider and Levine, 2014; Honegger and de Bivort, 2018; Ayroles et al., 2015; Akhund-Zade et al., 2019).

While we tested anTraX and found it useful for behavioral analyses in a range of study systems, it was specifically developed for experiments with the clonal raider ant Ooceraea biroi. The clonal raider ant is an emerging social insect model system with a range of genomic and functional genetic resources (Ulrich et al., 2018; Oxley et al., 2014; Trible et al., 2017; McKenzie and Kronauer, 2018; Chandra et al., 2018; Teseo et al., 2014). The unique biological features of the species enable precise control over the size, demographics and genetic composition of the colony, parameters that are essential for systematic study of collective behavior in ants (Ulrich et al., 2018; Chandra et al., 2020). Moreover, the species is amenable to genetic manipulations (Trible et al., 2017), which opens new possibilities not only for understanding the genetic and neural bases of social and collective behaviors, but also for developing and validating theoretical models by manipulating behavioral rules at the level of the individual and studying the effects on group behavior. While these ants have great promise for the study of collective behavior, they are hard to track using available approaches, due to their small size and tendency to form dense aggregates. anTraX thus constitutes a crucial element in the clonal raider ant toolbox, enabling researchers to characterize behavior in unprecedented detail both at the individual and collective level.

anTraX was designed with large-scale behavioral experiments in mind, where hundreds of colonies are recorded in parallel for periods of weeks or months, making manual tracking or even error correction impractical. Its output data can be directly imported into software packages for higher level analysis of behavior (e.g. Kabra et al., 2013) or higher resolution postural analysis of individuals in the group (Pereira et al., 2019; Mathis et al., 2018; Berman et al., 2014; Graving et al., 2019). This enables the utilization of these powerful tools and methods for the study of social insects and collective behavior. anTraX is modular and flexible, and its many parameters can be set via a graphical interface. The software is open source, and its main algorithmic components can be easily modified. Here, we provide a brief description of the different steps and principles that constitute the anTraX algorithm, while a full description is given in the Appendix and the online documentation. We validate the performance of anTraX using a number of benchmark datasets that represent a variety of behavioral settings and experimental conditions.

Materials and methods

The anTraX algorithm consists of three main steps. First, similar to other multi-object tracking algorithms (Pérez-Escudero et al., 2014; Romero-Ferrero et al., 2019), it segments the frames into background and ant-containing blobs and organizes the extracted blobs into trajectory fragments termed tracklets. The tracklets are linked together to form a directed tracklet graph (Nillius et al., 2006). The second step of the algorithm is tracklet classification, in which identifiable single-animal tracklets are labeled with a specific ID by a pre-trained CNN, while other tracklets are marked as either unidentified single-animal tracklets, or as multi-animal tracklets. Third, we infer the identity of unclassifiable tracklets in the tracklet graph by using temporal, spatial and topological propagation rules.

Object tracking and construction of the tracklet graph

Request a detailed protocol

Each frame is subtracted from the background, and a fixed threshold is applied to segment the frame into background regions and animal-containing blobs to be tracked. When two or more animals are close together, they will often be merged into a single larger blob (Figure 1A–C). Unlike other tracking solutions, we do not attempt to separate these larger blobs into single animal blobs at this stage, because those attempts are based on heuristic decisions that do not generalize well across species and experimental conditions. Instead, we will infer the composition of these larger blobs from the tracklet graph in a later step. Each blob in the currently processed frame is then linked to blobs in the previous frame (Figure 1D–E). A link between a blob in frame t and a blob in frame t-1 implies that some or all of the animals that were part of the first blob, are present in the second one. A blob can be linked to one blob (the simplest case, where the two blobs have the same ant composition), to a few blobs (where animals leave or join the blob), or to none (suggesting the animals in the blob were not detected in the other frame). We use Optical Flow to decide which blobs should be linked across frames (Figure 1D). While Optical Flow is computationally expensive, we found it to be significantly more accurate than alternatives such as simple overlap or linear assignment (based either on simple spatial distance or on distance in some feature space). To reduce the computation cost, we run the optical flow in small regions of the image that contain more than one linking option (see Appendix section 1.4 for details).

Blob tracking and construction of the tracklet graph.

(A) An example frame from an experiment with 16 ants marked with two color tags each. (B) The segmented frame after background subtraction. Each blob is marked with a unique color. Some blobs contain single ants, while others contain multiple ants. (C) A higher resolution segmentation example. While some ants are not distinguishable from their neighbors even for the human eye, others might be segmented by tuning the segmentation parameters, or by using other, more sophisticated segmentation algorithms. The anTraX algorithm takes a conservative approach and leaves those cases unsegmented to avoid segmentation errors. (D) Optical flow is used to estimate the ‘flow’ of pixels from one frame to the next, giving an approximation of the movements of the ants. The cyan silhouettes represent the location of an ant in the first frame, and the red silhouettes represent the location in the second frame. The results of the optical flow procedure are shown with blue arrows, depicting the displacement of pixels in the image. (E) An example of constructing and linking tracklets. Each layer represents a section of segmented frame. Two ants are approaching each other (tracklets marked τ1 and τ2), until they are segmented together. At that point, the two tracklets end, and a third multi-ant tracklet begins (τ3). Once the two ants are again segmented individually, the multi-ant tracklet ends, and two new single-ant tracklets begin (τ4 and τ5). (F) The graph representation of the tracklet example in E. (G) A tracklet graph from an experiment with 36 ants, representing 3 min of tracking data. The nodes are located according to the tracklet start time on the vertical axis, beginning at the bottom. The inset depicts a zoomed-in piece of the graph.

Blobs are organized into tracklets, defined as a list of linked blobs in consecutive frames that are composed of the same group of individuals (Figure 1E–F). Following linkage, tracklets are updated in the following way: (i) A blob in the current frame t that is not linked to any blob in the previous frame t-1, will 'open' a new tracklet. (ii) A blob in the previous frame that is not linked to any blob in the current frame, will 'close' its tracklet. (iii) If a pair of blobs in the previous and current frames are exclusively linked, the current blob will be added to the tracklet that contains the previous blob. (iv) Whenever a blob in the current or previous frames is connected to more than one blob, the tracklets of the linked blobs in the previous frames will 'close', and new tracklets will 'open' with the blobs in the current frame. In these latter cases, the linking between the blobs across different tracklets will be registered as an edge in the directed tracklet graph from the earlier tracklet to the latter. The tracklet graph is constructed by running an iterative loop over all the frames in the experiment. The result of this part of the algorithm, after processing all frames in the video, is a directed acyclic graph containing references to all tracklets and blobs in the dataset (Figure 1G).

Tracklet classification

Request a detailed protocol

The next step is tracklet classification, in which we label tracklets containing single animals that can be reliably identified with a specific ID (Appendix section 2.3). The successful propagation of individual IDs on top of the tracklet graph requires at least one identification of each ID at this step. Propagation will improve with additional independent identifications of individuals throughout the video. Nevertheless, it is important to note that our approach does not rely on the identification of each and every tracklet, but rather on inferring the composition of tracklets based on propagation of IDs on top of the tracklet graph. Hence, we apply a conservative algorithm that classifies only reliable cases and leaves ambiguous ones as unidentified. Classification is done by training and applying a convolutional neural network (CNN) on each blob image in the tracklet. The most frequent ID is then applied to the entire tracklet (Figure 2A). In addition to the ID label, we also assign a classification confidence score to each classified tracklet, which takes into account the number of identified blobs in the tracklet, the confidence of each classification, and the prevalence of contradictory classifications across blobs in the tracklet (see Appendix section 2.4). anTraX comes with a graphical interface for training, validating, and running the CNN (see Supplementary Material and online documentation).

Figure 2 with 1 supplement see all
Tracklet classification and ID propagation on the tracklet graph.

(A) Schematic of the tracklet classification procedure. All blobs belonging to the tracklet are classified by a pre-trained CNN classifier. The classifier assigns a label to each blob, which can be an individual ID (depicted as colored rectangles in the figure), or an ambiguous label (‘unknown’, depicted in gray). The tracklet is then classified as the most abundant ID in the label set, along with a confidence score that depends on the combination of blob classifications and their scores (see Supplementary Material for details). (B) A simple example of propagating IDs on top of the tracklet graph. The graph represents a tracking problem with three IDs (represented as red/blue/green) and eight tracklets, of which some are single-animal (depicted as circles) and some are multi-animal (depicted as squares). Three of the single-animal tracklets have classifications, and are depicted as color-filled circles. The graph shows how, within four propagation rounds, assigned IDs are propagated as far as possible, both negatively (round head arcs) and positively (arrow heads), until the animal composition of all nodes is fully resolved. See also Figure 2—video 1 for an expanded animated example. (C) An example of a solved tracklet graph from an experiment with 16 ants, representing 10 min of tracking. Single ant tracklets are depicted as circle nodes and multi ant tracklets are depicted as square nodes. Black circles represent single ant tracklets that were assigned an ID by the classifier. A subgraph that corresponds to a single focal ant ID (‘GO’: an ant marked with a green thorax tag and an orange abdomen tag) is highlighted in color. Green nodes represent single ant tracklets assigned by the classifier. Blue nodes represent tracklets assigned by the propagation algorithm. Red nodes are residual ambiguities. (D) Example snapshots of the focal ant GO at various points along its trajectory, where it is often unidentifiable. The second image from the bottom shows an image where the ant is identifiable. While the third image from the bottom shows an unidentifiable ant, it belongs to a tracklet which was assigned an ID by the classifier based on other frames in the tracklet. The first and last images show the focal ant inside aggregations, and were assigned by the propagation algorithm. The purple arrows connect each image to its corresponding node in C. (E) The 10-min long trajectories corresponding to the graph in C. The trajectory of the focal ant GO is plotted in orange, while the trajectories of all other ants are plotted in gray. Purple arrows again point from the images in D to their respective location in the trajectory plot. (F) Plot of the x and y coordinates of the focal ant during the 10 min represented in the graph in C. Gaps in the plot (marked with green asterisks) correspond to ambiguous segments, where the algorithm could not safely assign the ant to a tracklet. In most cases, these are short gaps when the ant does not move, and they can be safely interpolated to obtain a continuous trajectory.

ID propagation

Request a detailed protocol

The last part of the algorithm is the propagation of ID assignments on the tracklet graph. While formal approaches for solving this problem using Bayesian inference have been proposed (Nillius et al., 2006), we chose to implement an ad-hoc greedy iterative process that we found to work best in our particular context. Each node in the graph (corresponding to a tracklet) is annotated with a dynamic list of assigned IDs (IDs that are assigned to the tracklet) and a list of possible IDs (IDs that might be assigned to the tracklet, i.e., that were not yet excluded). Initially, all nodes are marked as ‘possible’ for all IDs, and no IDs are assigned to any nodes. All the classified tracklets from the previous step are now ranked by their confidence score. Starting with the highest confidence tracklet, its ID is propagated on the graph as far as possible. Propagation is done vertically on the graph on top of edges, both positively (an ID that is assigned to a node must also be assigned to at least one of its successors and one of its predecessors) and negatively (an ID cannot be in the possible list of a node, if it is not in at least one successor and one predecessor node), horizontally (if an ID is assigned to a node, it cannot be assigned to any other time-overlapping node), and using topological constraints (Figure 2B, Figure 2—video 1). Only non-ambiguous propagation is performed, and propagation is halted whenever an ambiguity or contradiction arises. We iterate the propagation until no more assignments can be made. Some of the propagation rules are modified in cases of tracklets that start or end in regions where individuals can enter or leave the tracked area (see Appendix section 3). Figure 2C–F visualizes an example of tracking an ant throughout a 10 min segment from an actual experiment and depicts the path of the ant through the tracklet graph along with its spatial trajectory.

Export positional and postural results for analysis

Request a detailed protocol

The tracking results are saved to disk and can be accessed using supplied MATLAB and Python interface functions. For each individual ID in the experiment, a table is returned, containing its assigned spatial coordinates in each frame of the experiment, and a flag indicating the type of the location estimation (e.g. direct single-animal classification, inferred single-animal, multi-animal tracklet). For frames where the location is derived from single animal tracklets (i.e. the animal was segmented individually), the animal orientation is also returned. Locations estimated from multi-animal tracklets are necessarily less accurate than locations from single-animal tracklets, and users should be aware of this when analyzing the data. For example, calculating velocities and orientations is only meaningful for single-animal tracklet data, while spatial fidelity can be estimated based also on approximate locations. A full description of how to import and process the tracking results is provided in Appendix section 3.6 and the online documentation.

User interface and parameter tuning

Request a detailed protocol

anTraX has many parameters that control the image segmentation step, the classifier architecture and training procedure, and the propagation algorithm. The optimal value for each depends on the specific nature and settings of the processed experiment, from the resolution and quality of the camera, to the details of the organisms and number of tags. anTraX comes with a graphical user interface to tune and verify the value of these parameters. anTraX also contains a user interface for creating an image database and training the CNN for tracklet classification.

Parallelization and usage on computer clusters

Request a detailed protocol

anTraX was specifically designed to process large-scale behavioral experiments, which can contain hundreds of video files and tens of terabytes of data. anTraX includes scripts to process such large datasets in batch mode where individual video files are tracked in parallel on multicore computers and high-performance computer clusters. Following per-video processing, anTraX will run a routine to ‘stitch’ the results of the individual files together (see online documentation).

Availability and dependencies

Request a detailed protocol

The core tracking steps of anTraX are implemented using MATLAB version 2019a, while the classification parts are implemented using TensorFlow v1.15 in the Python 3.6 environment. Compiled binaries are available for use with the freely available MATLAB Runtime Library and can be run with a command line interface. anTraX can be run on Linux/OSX systems, and large datasets benefit considerably from parallelization on computer clusters. anTraX depends on the free FFmpeg library for handling video files. The result files are readable with any programming language, and we supply a Python module for easily interfacing with output data. anTraX is distributed under the GPLv3 license, and its source code and binaries are freely available (Gal et al., 2020a). anTraX is a work in progress and will be continuously extended with new features and capabilities. Online documentation for installing and using the software is available at http://antrax.readthedocs.io. Users are welcome to subscribe, report issues, and suggest improvements using the GitHub interface.

Results

anTraX tracks individual ants with near-human accuracy over a wide range of conditions

As any tracking algorithm, the performance of anTraX depends on many external factors, such as the image quality, the framerate, the quality of the color tags (size, color set, number of tags per individual), and the behavior of the organisms (e.g. their tendency to aggregate, their activity level, etc). anTraX was benchmarked using a number of datasets spanning a variety of experimental conditions (e.g. image quality and resolution, number of tracked individuals, number of tags and colors, size variability in the colony) and study organisms, including four different ant species, as well as the fruit fly Drosophila melanogaster (Table 1, Figure 3 and its supplements). All benchmark datasets, together with the raw videos, full description, configuration files, and trained classifiers are available for download (Gal et al., 2020b).

Figure 3 with 17 supplements see all
Example of anTraX tracking output, based on the J16 dataset.

In this experiment, the ants are freely behaving in a closed arena that contains the nest (the densely populated area on the top left) and exploring ants. A short annotated clip from the tracked dataset is given as Figure 3—video 1. Tracking outputs and annotated videos of all datasets are also given in the supplementary figures and videos of this figure. (A) A labeled frame (background subtracted), showing the location of each ant in the colony, as well as a ‘tail’ of the last 10 s of trajectory. Ants that are individually segmented have precise locations. The ants clustered together have approximate locations. Labels indicate the color tag combination of the ant (e.g. ‘BG’ indicates a blue thorax tag and a green abdomen tag; colors are blue (B), green (G), orange (O), and pink (P)). (B) Individual trajectories for each ant in the colony, based on 1 hr of recording. (C) A cropped image of each ant from the video.

Table 1
Summary description of the benchmark datasets.

All raw videos and parameters of the respective tracking session are available for download (Gal et al., 2020b).

DatasetSpecies#Animals#Colors#TagsOpen* ROIDuration (hr)CameraFPSImage size (pixels)Resolution (pixels/mm)
J16Ooceraea biroi1642No24Logitech C91010960 × 72010
A36Ooceraea biroi3662No24PointGrey Flea3 12MP103000 × 300025
C12Camponotus fellah1273No6Logitech C910102592 × 198017
C32Camponotus sp.2863No24PointGrey Flea3 12MP102496 × 250013
G6 × 16Ooceraea biroi6 × 1632No1.33Logitech C910102592 × 198017
V25Ooceraea biroi2552Yes3Logitech C910102592 × 198017
T10Temnothorax nylanderi1054No6Logitech C910102592 × 198017
D7Drosophila melanogaster771No3PointGrey Flea3 12MP181056 × 105026
D16Drosophila melanogaster1642No5PointGrey Flea3 12MP181200 × 120016
  1. ROI: region of interest; FPS: frames per second. *Whether or not the ants can leave the tracked region. Dataset G6 × 8 is derived from six replicate colonies with eight ants each.

The performance of the tracking algorithm can be captured using two separate measures. The first is the rate of assignment, defined as the ratio of assigned locations in the experiments to the total possible assignments (i.e. the number of IDs times the number of frames). The second measure is the assignment error, defined as the ratio of wrong assignments to the total number of assignments made. While the assignment rate can be computed directly and precisely from the tracking results, the error rate in assigning IDs for a given data set needs to be tested against human annotation of the same dataset. Because the recording duration of these datasets is typically long (many hours), it is impractical to manually annotate them in full. Instead of using fewer or smaller datasets, which would have introduced a sampling bias, we employed a validation approach in which datasets were subsampled in a random and uniform way. In this procedure, a human observer was presented with a sequence of randomly selected test points, where each test point corresponded to a location assignment made by the software to a specific ID in a specific frame. The user was then asked to classify the assignment as either ‘correct’ or ‘incorrect’. If the user was unsure of the correctness of the assignment, they could skip to the next one. The process was repeated until the user had identified 500 points as either correct or incorrect. The accuracy of the tracking was measured as the ratio of correct test points to the sum of correct and incorrect test points, as determined by the human observer. This procedure samples the range of experimental conditions and behavioral states represented in each of the datasets in an unbiased manner, and provides a tracking performance estimate that can be applied and compared across experiments. Overall, anTraX performed at a level close to the human observer in all benchmark datasets (Table 2).

Table 2
Summary of tracking performance measures for the benchmark datasets using anTraX.

Assignment rate is defined as the proportion of all data points (the number of individuals times the number of frames) in which a blob assignment was made. In cases of closed boundary regions of interest (ROIs; in which the tracked animals cannot leave the tracked region) this measure is in the range of 0–1. In cases of open boundary ROIs (marked with asterisks; e.g., dataset V25), the upper boundary is lower, reflecting the proportion of time the ants are present in the ROI. The assignment error is an estimation of the proportion of wrong assignments (i.e. an ant ID was assigned to a blob the respective ant is not present in). As explained in the text, the estimation is done by sequentially presenting the user with a sequence of randomly sampled assignments from the dataset and measuring the proportion of assignments deemed ‘incorrect’ by the observer, relative to the sum of all ‘correct’ and ‘incorrect’ assignments. To calculate the error rates reported in the table, the presentation sequence continued until exactly 500 assignments were marked as ‘correct’ or ‘incorrect’, ignoring cases with the third response ‘can’t say’. A 95% confidence interval of the error according to the Clopper-Pearson method for binomial proportions is also reported in the table. To quantify the contribution of using graph propagation in the tracking algorithm, the analysis was repeated ignoring assignments made during the graph propagation step, and the results are reported here for comparison. A graphical summary of the performance measures is shown in Figure 4A–B.

Without graph propagationWith graph propagation
DatasetAssignment rateAssignment errorAssignment error 95% CIAssignment rateAssignment errorAssignment error 95% CI
J160.280.0120.0044–0.0260.9300–0.0074
A360.240.0140.0056–0.02860.810.0060.0012–0.0174
C120.8200–0.00740.9900–0.0074
C320.260.0420.0262–0.06350.790.0220.011–0.039
G6 × 160.570.1220.0946–0.1540.890.0780.056–0.105
V250.07*0.0580.0392–0.08220.48*0.0120.0044–0.026
T100.560.060.041–0.08450.960.0180.0083–0.339
D70.8800–0.00740.9800–0.0074
D160.890.0040.0005–0.01440.99700–0.0074

Graph inference dramatically improves tracking performance

The main novelty of anTraX compared to other tracking solutions is the use of a tracklet graph for ID inference. This method increases the tracking performance in several ways. First, it allows identification of tracklets that are unidentifiable by the classifier, using propagation of IDs from classified tracklets. Second, it corrects classification errors by overriding low-reliability assignments made by the classifier with IDs propagated from high-reliability tracklets. Third, it assigns IDs to multi-individual blobs and tracklets. This provides an approximate location for analysis, even when an animal cannot be individually segmented. Table 2 and Figure 4A–B show the increase in assignment coverage and decrease in assignment errors following graph propagation in all benchmark datasets.

Figure 4 with 3 supplements see all
Tracking performance.

(A) Contribution of graph inference to reduction of assignment error. The graph compares the assignment error in the benchmark datasets, defined as the rate of assigning wrong IDs to blobs across all IDs and frames in the experiment, and estimated as explained in the main text, before the graph propagation step of the algorithm (blue circles, ‘noprop’ category) and after the graph propagation step (orange circles, ‘final’ category). (B) Contribution of graph inference to increased assignment rate (the ratio of assignments made by anTraX to the total number of assignments possible in the experiment) in the benchmark datasets. The graph compares the assignment rate, as defined in the main text, before and after the graph propagation step (same depiction as in A). The performance measures for all benchmark datasets are reported in Table 2 and Figure 4—source data 1. (C–D) Same as in A and B, calculated for a large-scale dataset described in the text (10 colonies of 16 ants, recorded over 14 days). The performance measures for all replicas are reported in Figure 4—source data 2. (E) Generalizability of the blob classifier. Each point in categories 1–4 represents the generalization error of one classifier (trained with examples from number of replicas corresponding to its category) on data from one replica that was not used for its training. The replicas were recorded under similar conditions, but using different ants, different cameras, and different experimental setups. For classifiers trained on more than one replica, the combinations of replicas were randomly chosen, while maintaining the constraint that each replica is tested against the same number of classifiers in each condition. In the category ‘All’, the points depict the validation error of the full classifier, trained on data from the 10 replicas. All classifiers were trained with the same network architecture, started training from a scratch model, and were trained until saturation. The dashed line represents the mean validation error for the full classifier. The list of errors for all trained classifiers are given in Figure 4—source data 3.

To further demonstrate the utility of graph propagation, we used data from a full, large-scale experiment. We tracked the behavior of 10 clonal raider ant colonies, each consisting of 16 ants, for 14 days. The colonies were filmed at relatively low resolution using simple webcams (Logitech C910, 960 × 720 pixels image size, 10 frames per second), similar to that of benchmark dataset J16. This dataset represents a relatively challenging classification scenario, because the tags are small, and the colors are dull. Figure 4C–D show a comparison of assignment rate and accuracy across the 10 replicates before and after graph propagation, with both measures improving greatly. Moreover, the assignments made by the propagation algorithm are as reliable as the assignments made directly by the classifier (Figure 4—figure supplement 1). The propagation algorithm is also robust to classification errors, and successfully blocks their propagation on the tracklet graph (Figure 4—figure supplement 2).

The blob classifier generalizes well across experimental replicates

Collecting examples and training the blob classifier is the most time-consuming step in the tracking pipeline, and a good classification is essential for high-quality tracking (Figure 4—figure supplement 3). Ideally, a universal blob classifier would be trained to identify the same tag combination across experiments, without the need to retrain a classifier for each experiment. In reality, however, this is impractical. CNN classifiers do not generalize well outside the image distribution they were trained on, so even apparently small changes in experimental conditions (e.g. the type or level of lighting used, or the color tuning of the camera) can markedly decrease classification performance. Nevertheless, when experiments are conducted using similar conditions (e.g. study organism, marking technique, experimental setup, etc), it is possible to construct a classifier that will generalize across these experiments with minimal or no retraining. This enables construction of efficient tracking pipelines for high-throughput and replicate experiments, without the need for additional manual annotations.

We assessed the generalizability of blob classifiers with the 10 replicates of the experiment described in the previous section. We trained a classifier on examples from one replicate, and then used it to classify blobs sampled from the other replicates. We similarly evaluated the performance of classifiers trained with examples from two, three, and four replicates, and compared the results to the performance of a classifier trained on examples from all replicates. The comparison shows that, despite variability in animal shape and behavior, tagging process, cameras, and experimental setups across replicates, the classifier performs remarkably well (Figure 4E). Moreover, when a classifier is trained with an example set obtained from as few as two replicates, it performs similarly well as a classifier trained with examples from all replicates. Obviously, the generalizability of this result will depend on how well conditions are standardized between replicates or experiments. Nevertheless, it demonstrates that robust behavioral tracking pipelines can be constructed with minimal retraining.

anTraX can be combined with JAABA for efficient behavioral annotation of large datasets

While direct analysis of the tracking output is a possibility, phenotyping high-throughput experiments and extracting useful information from large-scale trajectory data beyond very simple measures are challenging and impractical. In recent years, the field of computational ethology has shifted to the use of machine learning, both supervised and unsupervised, for analyzing behavioral data (Todd et al., 2017; Egnor and Branson, 2016; Datta et al., 2019). One of the most useful and mature tools is JAABA, a package for behavioral annotation of large datasets using supervised learning (Kabra et al., 2013; Robie et al., 2017b). In short, JAABA projects trajectory data onto a high dimensional space of per-frame features. The user then provides the software with a set of examples for a certain behavior, and the software trains a classifier to find all occurrences of that behavior in a new dataset. anTraX includes functions to generate the per-frame data in a JAABA-compatible way. In addition to the original list of JAABA features, a set of anTraX-specific features is also generated (see online documentation for details). Beyond useful information about the appearance and kinematics of the tracked animals, these extra features provide information about whether an animal was segmented individually or was part of a multi-animal blob. This enables JAABA to learn behaviors that can only be assigned to individually segmented animals, such as those that depend on the velocity of the animal. The user can then label examples and train a classifier in the JAABA interface. This classifier can then be used to analyze entire experiments using the anTraX interface.

To demonstrate the power of this approach, we present two examples of using JAABA together with anTraX. In the first example, we train a classifier to detect O. biroi ants carrying a larva while walking. O. biroi ants carry their larva under their body, in a way not always obvious even to a human observer (Figure 5A, Figure 5—video 1). By using subtle changes in the ants’ appearance and kinematics, JAABA is able to classify this behavior with >93% accuracy (tested on a set of annotated examples not used for training). An example of trajectories from a 30 min period annotated with JAABA is shown in Figure 5B.

Figure 5 with 3 supplements see all
Interfacing anTraX with third party behavioral analysis packages for augmenting tracking data.

(A) Ants carrying a larva while they move (green/green and yellow/blue) can be difficult to distinguish from ants not carrying larvae (blue/green and yellow/purple), even for a human observer. Figure 5—video 1 shows examples for ants walking with and without a larva. (B) However, using labeled examples to train a classifier, JAABA can reliably distinguish ants walking while carrying a larva from ants walking without one from anTraX tracking output. Shown here is a 30 min segment from the A36 dataset, where trajectories classified by JAABA as ants carrying a larva are plotted in red on the background of all other tracks (in gray). (C) Classifying stops using JAABA. The plot shows a 60 min segment from the A36 experiment, where all stops longer than 2 s are marked with a colored dot. The stops are classified into four categories: rest (red), local search (green), self-grooming (blue), and object-interaction (e.g. with a food item; pink). Figure 5—video 2 shows examples of stops from all types. (D) Applying a simple DeepLabCut model to track the ants’ antennae and main body axes, shown on segmented ant images from dataset A36. Figure 5—video 3 shows an animated tracking of all ants in the colony. (E–F) Using the results from DeepLabCut to track the behavior of an ant along its trajectory. A one-hour trajectory of one ant from dataset A36 is shown on the background of the tracks of all other ants in the colony in that period (in gray). In E, the focal trajectory is colored according to the total rate of antennal movement (measured in angular velocity units rad/s). In F, the focal trajectory is colored according to how much the antennae move in-phase or anti-phase (measured in angular velocity units rad/s). Together, these panels show the behavioral variability in antennal movement patterns.

In the second example, we used JAABA to classify the behavior of ants during periods when they are not moving. We trained a classifier to detect four distinct behaviors (Figure 5—video 2): rest, in which the ant is completely immobile; local search, in which the ant does not move but uses its antennae to explore the immediate environment; self-grooming, in which the ant stops to groom itself; and object-interaction, in which the ant interacts with a non-ant object such as a piece of food, a larva or a trash item. JAABA was able to identify these behaviors with >92% accuracy. Figure 5C shows the spatial distribution of the classified behaviors during all periods where an ant stops walking for more than 2 s in a 60-min experiment, across all ants in the colony.

anTraX can be combined with DeepLabCut to augment positional data with pose tracking

Much attention has recently been given to tracking the relative position of animal body parts, taking advantage of the fast progress in machine learning and computer vision (Pereira et al., 2019; Mathis et al., 2018; Graving et al., 2019). This allows for the measurement and analysis of aspects of an animal’s behavior beyond what is extractable from its trajectory. Although these tools can in principle be directly applied to videos with multiple individuals (Iqbal et al., 2017; Insafutdinov et al., 2016), they are still not mature enough for large-scale use. A more reasonable approach it to combine individual animal pose tracking with a track-and-crop step (see discussion within Graving et al., 2019). To track body parts of individual animals within a group or a colony, we took advantage of the fact that anTraX segments and crops the images of individual animals as part of its workflow, and included an option to run pre-trained DeepLabCut models (Mathis et al., 2018) on these images, without the need to export the data in a DeepLabCut-readable format (which would have resulted in a heavy computational overhead). This way, the position of the tracked body parts relative to the animal’s centroid are returned together with the spatial location of the centroid. For training such a model, anTraX enables exporting cropped single animal videos that are loadable into the DeepLabCut user interface. Currently, this is only supported for single-animal tracklets, where animals are segmented individually.

Of course, the ability to perform accurate and useful pose estimation depends on the resolution at which animals appear in the video. To demonstrate the potential of this approach, we trained a simple DeepLabCut model to track the main body axis and antennae positions of ants from benchmark dataset A36. Figure 5D and Figure 5—video 3 show examples from the segmented and cropped images of the individual ants in the videos.

Ants use different antennation patterns to explore their environment (Draft et al., 2018), and the ability to track these patterns in parallel to their movement in space can contribute to our understanding of their sensory processing during free behavior. We used the pose tracking results to visualize the different modes of antennae movement used by the ants to explore their environment. Figure panels Figure 5E and F show the total movement rate and the relative phase of the two antennae along the trajectory of one ant in a 1-hr segment of the experiment, respectively, demonstrating the variability and richness inherent to these patterns.

Discussion

anTraX is a new algorithm and software package that provides a solution for a range of behavioral tracking problems not well addressed by available methods. First, by using a deep neural network for image classification, it enables the tracking of insects that are individually marked with color tags. While color tags have been used successfully for behavioral analysis for decades in a wide range of social insects, and in many species they are the only practical type of marker, their use has been severely limited by the lack of automation. Second, unlike other existing approaches, it handles cases where insects tightly aggregate and are not segmentable, as well as cases where the tags are obscured. This is achieved by representing the tracking data as a directed graph, and using graph walks and logical operations to propagate information from identified to unidentified nodes. Third, anTraX handles very long experiments with many replicate colonies and minimal human oversight, and natively supports parallelization on computational clusters for particularly large datasets. Finally, anTraX can easily be integrated into the expanding ecosystem of open-source software packages for behavioral analysis, making a broad range of cutting-edge ethological tools available to the social insect community. anTraX is an open-source software and conveniently modular, with each step of the algorithm (segmentation, linking, classification, and propagation) implemented as a separate module that can be easily augmented or replaced to fit experimental designs that are not well handled by the current version of the algorithm. For example, the traditional background subtracted segmentation can be replaced with a deep learning-based semantic segmentation, that is training and using a classifier to distinguish pixels of the image as belonging to either background or foreground (Rajchl et al., 2017; Moen et al., 2019; Badrinarayanan et al., 2017). This can potentially allow analysis of field experiments with natural backgrounds, or experiments with non-static backgrounds, such as videos taken with a moving camera. Another possible extension is an informed ‘second pass segmentation’ step, where multi-animal blobs are further segmented into single-animal blobs, taking into account the composition of the blob (number and IDs of animals). Knowing the composition of the blob provides a method to algorithmically validate the segmentation, allowing a ‘riskier’ segmentation approach. Another approach to locate animals in aggregations more precisely is to use neural network-based detection of the tags themselves. This method has successfully been used for bees tagged with fiducial markers inside a hive (Wild et al., 2018). Having a record of the composition of tracklets and blobs also paves the way for performing image-based behavioral analysis of interactions (Dankert et al., 2009; Klibaite et al., 2017; Klibaite and Shaevitz, 2019), or constructing specialized image classifiers for interaction types (e.g. allogrooming, trophallaxis, aggression, etc). Lastly, a newer generation of pose-estimation tools, including SLEAP (Pereira et al., 2020) and the recent release of DeepLabCut with multi-animal support, enable the tracking of body parts for multiple interacting animals in an image. These tools can be combined with anTraX in the future to extend pose tracking to multi-animal tracklets, and to augment positional information for individual animals within aggregations.

In summary, anTraX fills an important gap in the range of available tools for tracking social insects, and considerably expands the range of trackable species and experimental conditions. It also interfaces with established ethological analysis software, thereby making these tools broadly accessible for the study of social insects. anTraX therefore has the potential to greatly accelerate our understanding of the mechanisms and principles underlying complex social and collective behavior.

Appendix 1

Detailed description of the anTraX algorithm

The anTraX algorithm consists of three main steps (Appendix 1—figure 1). In the first step, we detect the tracked animals in each frame of the video and organize the extracted blobs into trajectory pieces we term tracklets. As we will detail below, these tracklets are in turn linked together to form an acyclic directed graph we name the tracklet graph. The second step of the algorithm is tracklet classification, in which identifiable tracklets are classified based on the color tag information present in their blobs. In the third step of the algorithm, we use the topology of the tracklet graph to propagate identity information from the classified tracklets to the entire set of tracklets.

In this appendix, we detail each part of the algorithm and fully describe its various computational steps and parameters. A practical tutorial for running the software and using its graphical interface can be found in the online documentation.

Appendix 1—figure 1
Flow diagram of the anTraX algorithm.

1. Creating the tracklet graph

1.1 Creating a background image

anTraX uses background subtraction for segmentation. Although using a static background is somewhat limiting in designing and performing experiments (requiring a static environment and a static camera), and it is possible to segment images for tracking without this step if there is a decent contrast between the objects and the background, background subtraction has the advantage of giving a stable object segmentation that simplifies later steps.

For creating a background image, anTraX uses random sampling of frames from the entire duration of the experiment, or from a segment defined by the user (Appendix 1—figure 2A,B). The number of frames nB is configurable, and the background IBG is computed by applying either a per-pixel median or max operator:

(1A) IBGi,j,c=medItbi,j,cb=1nB
(1B) IBGi,j,c=maxItbi,j,cb=1nB
Appendix 1—figure 2
Background creation.

(A) An example raw frame. (B) A background frame generated using a median function. Regions outside the ROI mask are dimmed. (C) Full segmented frame.

Where tb is a randomly drawn timepoint in the experiment, Itb  is the corresponding frame, i and j are the image coordinates, and c is the color channel index.

Generally, the median operation is useful in cases where animals are active enough to have each pixel in the image free of animals for at least half the frames. Otherwise, the max operation gives better results. The anTraX GUI enables the user to test and optimize the parameters in the background image creation step.

1.2 Creating an ROI mask

Typically, tracking should be performed only in part of the image, either because the animals to be tracked are confined to a region smaller than the image, or because the user cares about behavior in a small region of interest (ROI). The ROI mask IROI (Appendix 1—figure 2B) is a binary image with the same dimensions as the video frames, which is 1 in regions to be tracked and 0 in regions to be ignored.

The anTraX GUI includes a utility to create the mask by drawing shapes to be included or excluded on a frame.

1.3 Image segmentation

The first step in analyzing each frame is segmenting it into blobs (Appendix 1—figure 2C, Appendix 1—figure 3): contiguous regions of the frame that significantly differ from the background and correspond to individual animals or tightly clustered groups of animals. Segmentation is done by first subtracting the image from the background (using the fact that the animals are dark and tracked on a light background), then converting the difference to a grayscale image (Appendix 1—figure 3A-B), and comparing to a user defined threshold θs and the ROI mask to produce a binary image:

(2) It1i,j,c=IBGi,j,c-Iti,j,c
(3) It2i,j=13cIt1i,j,c
(4) Itbw(i,j)={1, It2 (i,j)  IROI(i,j) > θs0, else
Appendix 1—figure 3
Image segmentation.

(A) Raw image. (B) Background subtracted grayscale image. (C) Unfiltered binary image. (D) Final segmented image after morphological operations and blob filtering. Each separate blob is shown in a different color.

The resulting binary image (Appendix 1—figure 3C) will then undergo optional morphological operations (image closing, image opening, hole-filling, convex hull filling) that, depending on the specific conditions of the experiment, are useful for noise reduction.

Blobs (connected components; using the eight-connectivity criterion) are then extracted from the final binary image. For each detected blob, we register the coordinates of its centroid, its area, its maximal intensity (in the It2 grayscale image) and the parameters of the best fitted ellipse (orientation, eccentricity, and major axis length). Blobs are then optionally filtered by minimal area and minimal intensity criteria (Appendix 1—figure 3D).

The anTraX GUI allows the user to test and configure all the segmentation parameters.

1.4 Linking blobs across frames

After blobs are extracted from a frame, the next step in the algorithm is to link them to the blobs in the previous frame (Appendix 1—figure 4A-E): a link between a blob in frame t and a blob in frame t-1 implies that some or all of the individual animals that belong to the first blob, are present in the second one. A blob can be linked to one blob (the simplest case, where the two blobs have the same composition), to a few blobs (where animals leave or join the blob), or to none (suggesting the animals in the blob were not detected in the other frame). Relying on the fact that videos were recorded at a frame rate high enough that blobs corresponding to the same individuals will overlap in consecutive frames even when the tracked animals are moving at their maximum possible speed (for O. biroi ants, for example, 10 frames per second is sufficient), the most accurate method to link blobs is Optical Flow, which takes into account the actual pixel content of the image. It is, however, a computationally expensive algorithm, and running it on full frames is not practical for long, high-resolution videos. On the other hand, simpler and commonly used methods, such as the popular Munkres linear assignment algorithm (the Hungarian algorithm, Munkres, 1957) are prone to errors in dense problems such as those we aim to solve, and often require considerable amount of manual correction after automated tracking.

Appendix 1—figure 4
Detailed linking example.

(A–B) Raw images of the first and second frame, respectively. (C) Color blend of the frames, showing the displacement of the ants between frames. (D–E) Segmentation of the first and second frame, respectively. (F) Segmentation blend. Also shown is the clustering of the blobs into linking problems (gray background). The two upper problems are trivial, and no assignment algorithm is required. The problem at the bottom will be solved using optical flow. (G) Optical flow for the bottom problem in F. Arrows represent the estimated translation of the pixels. (H) Final linking between the blobs based on optical flow.

In sophisticated tracking solutions, the distance-based cost function that underlies the linear assignment is corrected with predictive modeling of the animals’ behavior, or with other distinguishing features of the animals such as shape, orientation, and appearance. These, however, are often problem-specific and do not generalize well across tracking problems. We chose to implement a dynamic approach, in which the linking method is chosen based on the difficulty of assignment. The linking step begins with dividing the linking problem into a few independent subproblems, by using a maximal linking distance (dlink), which by default is set to twice the maximal velocity vmax times the inter frame time interval. Practically, this is done by creating a binary image, defined as the pixel-wise logical OR of the two segmented binary frames, and dilating it using a disk with a radius that equals to dlink (Appendix 1—figure 4f). The resulting image is then divided into connected components, and all the blobs that overlap with each component are treated as an independent subproblem. For each subproblem we choose the appropriate linking method: (i) a problem with one or more blobs in one of the frames and no blobs in the other results in no links, (ii) a problem with exactly one blob in each of the frames will link the blobs with no further processing, (iii) otherwise, a small region containing only the blobs in the subproblem will be cropped from each of the frames, and an optical flow assignment will be performed (Appendix 1—figure 4G).

For solving a subproblem using optical flow, we do the following: We first crop a region from the two frames, corresponding to the bounding box of the subproblem’s connected component. This region includes all of the blobs that belong to this subproblem, but no others. We then compute the optical flow field between the two cropped frames using the Horn-Schunck method (Horn and Schunck, 1981). Next, we define the Flow Number, nofa,b, for each pair of blobs across the two frames as the number of flow field vectors pointing from blob a in frame t-1 to blob b in frame t. The flow number is an estimate of the number of pixels in the blob a that have moved to blob b in the consecutive frame. For each pair, if the flow number is greater than a threshold number θof, the blobs are linked (Figure 4H). The threshold number defaults to a third of the minimal size of a single animal in pixels and can be configured using the anTraX graphical interface.

Once again, all the parameters of the linking step can be configured and tested in the anTraX GUI.

1.5 Updating the tracklet graph

As defined above, a blob can correspond to an arbitrary number of tracked individuals. Instead of trying to break these blobs down into individual animals, our tracking approach relies on registering the transition of individuals between blobs that possibly contain multiple animals. For this purpose, we define the tracklet as a list of linked blobs in consecutive frames that have the same composition of individuals. In other words, no animal has left or entered the group between the first and last frame of the tracklet (Figure 1 in the main text).

After linking the blobs in frame t to those in frame t-1, the tracklets are updated in the following way:

  1. A blob in the current frame t that is not linked to any blob in the previous frame t-1 will ’open’ a new tracklet.

  2. A blob in the previous frame that is not linked to any blob in the current frame will ‘close’ its tracklet.

  3. If two blobs in the previous and current frames are exclusively linked, the current blob will be added to the tracklet of the previous blob.

  4. Whenever blobs in the current or previous frame are connected to more than one blob, the tracklets of the linked blobs in the previous frames will ’close’, and new tracklets will ’open’ for the blobs in the current frame. In these cases, the linking between the blobs across different tracklets will be registered as a link between the tracklets. In cases where a tracklet has its last blob linked to the first blob of a different tracklet, the former is defined as the parent tracklet, and the latter as its child tracklet.

Although the linking and tracklet construction processes are very conservative, errors can still occur when the assumptions of the algorithm are violated. For example, in benchmark dataset J16, in which the behavior of a 16 ant colony is recorded in an uncovered arena surrounded by Fluon-coated walls, ants sometimes climb on the arena’s walls and fall down on top of another ant, hence violating the maximal linking distance assumption. In such a case, the tracklet corresponding to the climbing ant will end without parenting a child tracklet, while the tracklet of the second ant will contain one ant in its first part and two ants in its second part. In the analyzed dataset, such linking errors occur very rarely (less than 0.05% of the tracklets), and in most cases will not lead to classification errors, due to the robustness of the ID propagation step to such errors (section 3).

Upon closing of a tracklet, the blob orientation has a ±π ambiguity as a result of the definition of the orientation as that of the best fit ellipse, which is not consistent along the tracklet (for each blob, the orientation is set independently of the other blobs in the tracklet by MATLAB’s blob analysis algorithm). We use a method adapted from Branson et al., 2009 to disambiguate the orientation. In short, this method uses the fact that whenever the tracked animal is moving fast, we can reliably assign the correct orientation in the moving direction and propagate this assignment to the entire tracklet by using dynamic programing. In tracklets where the animal is not moving fast enough, the result can be incorrect, but it is at least consistent along the tracklet. Most of these cases will be corrected later after the tag identification step. Multi-animal tracklets generally do not have a meaningful orientation.

The end result of this part of the algorithm, after processing all frames in the video, is an acyclic directed graph containing references to all tracklets and blobs in the experiment.

2. Classifying tracklets

2.1 Color correction

The actual RGB values of the color tags are highly sensitive to changes in illumination and variability in camera sensors, both between experiments, and within an experiment as a function of time and location. These sources of variability can adversely affect the performance of the tracklet classifier. To overcome this problem, at least partially, we include the option of applying a color correction step on images before classification (Appendix 1—figure 5). To do so, we use a white reference frame W, which is an image of a white or gray surface taken using the same conditions as the videos. The color corrected frame is then:

(5) IWi,j,c=I(i,j,c)W(i,j,c)
Appendix 1—figure 5
Color correction.

(A) The original frame. (B) The color corrected frame. Insets show a zoomed in view of a focal ant. The color correction removes the green bias in the original frame and enhances the color segmentation.

Pixel values that exceed the pixel value range are truncated.

In cases where the tracking background approximates a homogenous white surface, as is the case with all the benchmark datasets, the white reference can be automatically generated by anTraX by filtering the background image with a 2D Weiner filter. In other cases, a white reference image can be taken in the experimental setup before or after the experiment.

2.2 Training a blob classifier

Classifying a tracklet begins with classifying the individual blobs it contains. To do so, we train a convolutional neural network (CNN) image classifier using TensorFlow (Abadi et al., 2016). To create a classifier, the user has to supply a list of possible labels. Typically, this will be the list of IDs (unique tag combinations) in the experiment, plus optional labels for non-animal objects that can be detected in the videos (e.g. larvae, food items, etc). One of the limitations of using CNNs for classification is the high rate of false positives, that is, blobs that are assigned an ID even though they are not identifiable. To overcome this, we add a special label for unrecognizable blobs, which are treated as a separate class (labeled as ‘Unknown’ or ‘UK’).

To train the classifier, we collect a set of example images for each classifier label (Appendix 1—figure 6). This can be done easily using an anTraX GUI app (see Online Documentation for details).

Appendix 1—figure 6
An example subset from a training set.

Shown are examples from six ant IDs with a total of four tag colors. The UK label represents ant images that are not classifiable to the human eye. The NO label represents segmented objects that are not ants (food items, larvae, etc). To allow the classifier to generalize well, it is important that the variability of the training set captures the variability in the experiment, and includes images of ants in various poses, lighting conditions, and across experimental replicates.

In short, the GUI presents the user with all the blob images from a random tracklet. The user can then select the appropriate ID and choose to either export all images into the training set, or to select only a subset of images (useful if not all blobs in the tracklet are recognizable). In many cases, especially in social insects, where behavioral skew can be considerable, some animals are rarely observed outside an aggregation. It is therefore challenging to collect examples for them using a random sampling approach. One solution to this problem, which is the recommended one for high throughput experiments, is to pool data from several experiments into one classifier as discussed in the main text. Another solution, in case this is not possible, is to scan the video for instances in which the focal animal leaves the group, and ‘ask’ the GUI for tracklets from this segment of the video. Alternatively, one can run a first pass of training and classification using the available examples, and then ask the GUI to display only unclassified tracklets, increasing the probability of spotting the missing animal. The resulting example set augmented using various transformations (flipping, rotations, shearing, and brightness and color shifts; Appendix 1—figure 7). Some of these transformations are only applicable in certain cases (e.g. horizontal flipping will only be valid for cases where tags have a horizontal symmetry), and some are range-configurable through the anTraX interface. It is important to tune these range parameters appropriately, because there is no point in training the classifier on images that cannot actually occur in the real data. This will only slow down training and reduce performance. For example, rotations are applied by default in the range of ±15°, as we found that this value captures the variability in head orientation relative to blob orientation in most of our datasets well. For animals with low eccentricity, higher values for this parameter will produce better generalization.

Appendix 1—figure 7
Dataset augmentation using TensorFlow’s intrinsic mechanism for image transformation on a single example image to generate a larger training dataset.

As usual with supervised classifiers, there is a tradeoff between the complexity of the classifier (the size and architecture of the network), and its performance, training time, and the optimal size of the training dataset. anTraX contains a few CNN architectures that we have found to work well with our data. However, it can also use an arbitrary, user-defined architecture (see Online Documentation for details).

2.3 Filtering tracklets for classification

Once the blob classifier is trained, it can be applied to the tracklets of the experiment. Because direct classification is only meaningful for tracklets that represent individual animals, we first filter the tracklet list to identify possible single-animal tracklets. To do so, we use the typical size range of individual animals (interactively adjustable in the anTraX GUI). A tracklet whose average blob area falls within that range is considered a possible single-animal tracklet, and is passed on to the classifier. Although this filtering method is not perfect, it rarely leads to false negatives (single-animal tracklets with average blob size outside of the specified range). If the rate of false positives is high (which is usually the case in problems with high size variability between individuals), it is useful to include a separate class for multi-animal blobs.

For performance reasons, this filtering is done during the blob tracking step, and the images constituting possible single-animal tracklets are saved separately to disk, thus avoiding the need to extract them again from the videos. It is therefore important to set the single animal size range before running the tracking.

2.4 Classifying single-animal tracklets

To classify a possible single-animal tracklet, we perform the following steps:

  1. The blob classifier is applied to each blob in the tracklet. The output of the classifier for each tracklet is a matrix of likelihoods, kl, that is, the probability of blob k belonging to class l given its image. We define the most probable label for a blob as:

    (6) lk*=argmaxl(kl)
  2. If the most likely label for all of the blobs in the tracklet is a non-animal label, the tracklet is classified as non-animal, and the most abundant label in the tracklet is chosen as the tracklet label.

  3. If any of the blobs in the tracklet are classified as multi-animal, the tracklet will be classified as multi-animal. This step will occur only if a multi-animal class has been included in the classifier.

  4. If there is no ID label in the sequence of most likely labels, the tracklet is marked as unidentified.

  5. Otherwise (i.e. the tracklet is single-animal and there are at least some blobs classified as a specific ID label), we define a score for each possible ID as the sum of the likelihoods for that ID over all blobs:

    (7) sl=kkl

    The tracklet is labeled with the ID that has the maximal score:

    (8) L=argmaxlsl

    Where the argmax operation is performed over the labels that represent specific IDs (i.e. excluding the unknown, multi-animal, and no-animal classes). In addition, we define and register the classification confidence score as:

    (9) S=nsLsl

    where n is the number of blobs in the tracklet classified as specific animal IDs, sL is the score of the assigned label, and the sum is over all label indices that belong to a specific ID (i.e. excluding non-specific labels such as 'Unknown' or 'NoAnimal'). This heuristic score definition takes into account the likelihoods the classifier has assigned to each label, but also the number of identifiable blobs in the tracklet. Using this definition, the confidence score will increase as evidence for the assignment accumulates (so longer tracklets with more identifiable blobs will have a higher score).

2.5 Verification and retraining

Although this is not the final tracklet ID assignment, it is useful to be able to estimate the performance of tracklet classification. Especially, it is important to assess the performance of a classifier trained on examples from one experiment on tracklets from another. If there is a significant drop in performance, examples from the new experiment can be added to the training set, and an incremental training can be run. Both validation and adding new examples can be done using the anTraX GUI.

3. Propagating IDs on the tracklet graph

At this stage, we have the tracklet graph, in which a subset of single-animal tracklets have been labeled with a specific ID and a confidence score for that label. The rest of the tracklets in the network are either unidentified single-animal tracklets, or multi-animal tracklets. We assume that some of these classifications can be incorrect. The next step in the algorithm is to make the actual ID assignments for the tracklets, and to propagate these assigned IDs over the tracklet graph, trying to identify the composition of all tracklets, including multi-animal tracklets. In the process, a large portion of the incorrect classifications will be identified and overridden by the algorithm.

3.1 Initializing the graph

We start the propagation algorithm by creating a dynamic list of possible IDs (initially representing all individuals in the experiment) and a dynamic list of assigned IDs (initially an empty list) for each node in the graph. These lists are continuously updated during the propagation process. For all nodes (tracklets) that have been assigned a ‘non-animal’ label in the classification step, we initialize the possible ID list also as an empty one, effectively removing these nodes from the graph.

3.2 Propagation and assignment rules

Propagating and assigning IDs are done according to a set of rules executed in a specific order.

For each node to which we want to assign an ID, we do the following:

  1. If the ID we want to assign is not on the list of possible IDs for that node, abort.

  2. If the node represents a single-animal tracklet (i.e. is in the area range of a single animal as defined by the user AND was not classified as a multi-animal tracklet by the classifier), assign the ID and eliminate all other possible IDs. If it is not a single-animal node, assign the ID without eliminating other IDs.

  3. Horizontal propagation (negative): for all other nodes that overlap in time with the currently assigned node, eliminate the ID we just assigned.

  4. Vertical propagation (positive): for each parent node of the current node, look if the currently assigned ID is on the list of possible IDs. If there is only one such parent, and it has not already been assigned the ID, assign the ID to that parent. Do the same for child nodes.

  5. Topological propagation (positive): a pair of nodes on the graph that constitute a 2-vertex cut set (i.e. cutting the graph at both these nodes creates a disconnected subgraph) and the corresponding disconnected subgraph does not contain any other 0-indegree or 0-outdegree nodes (i.e. there are no animals leaving or exiting the subgraph), are defined as twin nodes. Such a pair will have exactly the same composition of IDs (this is not true in cases where one of the tracklets in the subgraph touches a border of the ROI at a point where animals can exit and enter; these cases are flagged during tracking, and no assignment is made; the special case of open boundary ROIs is discussed below). For each assignment, we also assign the first descendent twin node and the first ancestor twin node (if they exist).

For each node from which we want to eliminate a possible ID, we do the following (see Appendix 1—figure 8, Figure 2B, Figure 2—video 1 for illustrated examples):

Appendix 1—figure 8
Propagation rules.

The figure depicts the first three steps in solving an example graph. The graph has 15 tracklets and 4 IDs. Circular nodes mark single-animal tracklets, while square nodes mark multi-animal tracklets. The colored circles inside the nodes mark the current assignments of the node. Empty circles indicate possible assignments, and full circles indicate actual assignments. The full solution of the example is given in Figure 2—video 1. (A) Negative horizontal propagation. (B) Positive vertical propagation. (C) Positive topological propagation.

  1. If the ID is already marked as ‘assigned’ for that node (i.e. the ID was already propagated from that node), abort.

  2. Vertical propagation (negative): for each parent node of the current node, if there is no other child node for which the ID that we are currently eliminating is possible, eliminate the ID for that parent. For each child of the current node, if there is no other parent for which the ID is possible, eliminate the ID for that child.

  3. Topological propagation (negative): eliminate the ID for the first ancestor twin node and the first descendent twin node (if they exist).

3.3 Propagating from classified single-animal tracklets

Before we start the propagation, we rank all the single-animal tracklets that were labeled with a specific ID by the classifier (the ‘source’ tracklets) according to their confidence scores. We start by assigning the tracklet with the highest score with its classified ID, and then recursively propagate according to the rules above. When no more propagations can be made, we move on to the next tracklet on the list. All nodes with assignments inherit their confidence from the confidence of the source node.

Once the last source tracklet has been reached, we conduct another round of propagation, this time starting from all nodes with assigned IDs (not only the CNN-classified nodes), again sorting them according to their confidence, so that higher confidence propagations will have precedence. This process is repeated until no more propagations can be made.

3.4 Handling open boundary ROIs

The assumption that underlies the propagation rules as described above is that a tracklet indeed represents a given set of tagged animals in each of its frames, and that the tracklet graph correctly captures the flow of individual animals between tracklets. This assumption is violated if the ROI of the experiment is open (i.e. animals are free to exit and leave the tracked region), because a tracklet that touches the open boundary can have a changing set of tracked animals. To handle these cases, blobs that overlap with an open boundary are treated differently. In the blob linking step, whenever a blob that touches the open boundary is linked to a blob that does not touch the open boundary in the previous frame, the tracklet closes (even if it is a 1:1 link as defined in section 1.4), and a new tracklet opens and will be linked to the previous with a graph edge. The same happens when a blob that does not touch the boundary is linked to a blob that does. This way, the blobs touching the boundary (i.e. blobs that can ‘lose’ or ‘gain’ animals) are confined to the same tracklet. These special tracklets do not participate in the propagation process (i.e. they do not act as sources for IDs and do not accept vertical or topological propagations). Open boundaries are marked by the user as part of the ROI mask creation (see online documentation). See also benchmark dataset V25 (Figure 3—figure supplement 1, Figure 3—video 2) for an example of a tracked experiment with an open boundary.

3.5 Propagation of incorrect classifications

The tracklet classification is never error-free, and some incorrect assignments will be made. As the confidence score of incorrectly classified tracklets will usually be low, the rate of incorrect assignments by the propagation algorithm will usually be lower than the error rate of the classifier. The reason is that, in many cases, these tracklets will already have been assigned by propagation from a more reliable assignment by the time the algorithm reaches them. Nevertheless, some propagation from these incorrectly classified tracklets is to be expected. This propagation will continue until it contradicts an already assigned tracklet. These erroneous propagations are typically short (Figure 4—figure supplement 2), and can often be filtered out algorithmically (see next section).

3.6 Connected component filtering

Ideally, at this point, when all possible ID propagation options are exhausted, we have inferred the maximum information about the composition of each tracklet. If we look at the subgraph corresponding to a specific ID (defined by all the nodes that are possible for that ID, along with their connecting edges), we expect to see a single connected component (Appendix 1—figure 9A). This connected component will consist of nodes assigned with that ID, which do not have nodes parallel to them in the subgraph, as well as nodes without ID assignments, which can in principle have ambiguities (parallel nodes that are members of the same subgraph). However, as discussed in the previous section, because the tracklet classification process usually produces some errors, the ID subgraph can have several disconnected components (Appendix 1—figure 9B). To filter out connected components that correspond to classification errors, we assign a confidence score to each connected component, defined as the sum of the confidence scores of all the ID assignments in that component. We then go over the list of components sorted by their confidence, and accept them in order. Whenever a component contradicts one of the already accepted components (e.g. it overlaps in time, or does not contain a possible route on the graph to a previously accepted component), we discard it. To eliminate a component, we undo all assignments made of the focal ID to the nodes of that component, and all the eliminations that resulted from these assignments. This is done separately for each ID subgraph.

Appendix 1—figure 9
Connected component filtering.

An example from a 10-min tracklet graph. Green nodes are those assigned by the classifier, blue nodes are assigned by the propagation algorithm, and purple nodes are ambiguous (‘possible’ but not ‘assigned’). (A) A focal ant subgraph in which graph assignment propagation was consistent and did not result in contradictions. (B) A subgraph for a different focal ant in the same graph, for which the classifier made an incorrect assignment. As a consequence, the subgraph is fractured into a few connected components. (C) The subgraph of the same focal ant as in B, following the connected component filtering step and a second round of assignment propagations. The erroneous component was filtered, and the algorithm was able to complete the ID path through the graph.

Following the connected component filtering, we again run the ID propagation loop to close the gaps between the accepted components (Appendix 1—figure 9C). This procedure is repeated until no more component filtering can be made.

3.7 Finalizing assignments and exporting data

At this point, when all inference options are exhausted, each ID is represented in several types of nodes/tracklets. In order of decreasing assignment quality, these are:

  1. Single-animal tracklets that were assigned by the classifier and confirmed by the graph propagation algorithm (i.e. that were not identified as erroneous and overridden).

  2. Single-animal tracklets for which IDs were inferred by the propagation algorithm.

  3. Multi-animal tracklets for which IDs were inferred by the propagation algorithm.

  4. Tracklets for which no ID was assigned, but which are the only possible tracklet for a particular ID.

  5. Points of ambiguity, where no assignment was made with confidence, and several temporally overlapping nodes could possibly contain the focal ID.

When exporting trajectory data for the experiment, the assignment type for each point in the trajectory is also reported.

3.8 Multi-colony experiments

anTraX enables tracking multiple colonies/groups within the same video. This feature is useful when designing and performing high-throughput experiments, where one camera records several colonies. For multi-colony experiments, the software assigns a colony ID to each tracklet during the initial tracking step, based on the spatial location of the tracklet. During the graph propagation step, the software partitions the tracklets into a number of graphs, one for each colony. Propagation is then performed on each colony-graph separately, and the final trajectories are saved separately for each colony. Dataset G6 × 16 (Figure 3—figure supplement 8, Figure 3—video 9) gives an example of tracking an experiment where 6 colonies of 16 ants each are recorded with a single camera.

References

  1. 1
    TensorFlow: a system for large-scale machine learning
    1. M Abadi
    2. P Barham
    3. J Chen
    4. Z Chen
    5. A Davis
    6. J Dean
    (2016)
    Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation.
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
    DeeperCut: A deeper, stronger, and faster multi-person pose estimation model
    1. E Insafutdinov
    2. L Pishchulin
    3. B Andres
    4. M Andriluka
    5. B Schiele
    (2016)
    European Conference on Computer Vision.
  40. 40
    PoseTrack: Joint multi-person pose estimation and tracking
    1. U Iqbal
    2. A Milan
    3. J Gall
    (2017)
    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
    https://doi.org/10.1109/CVPR.2017.495
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
    Algorithms for the assignment and transportation problems
    1. J Munkres
    (1957)
    Journal of the Society for Industrial and Applied Mathematics 5:32–38.
    https://doi.org/10.1137/0105003
  53. 53
    Multi-target tracking - Linking identities using Bayesian network inference
    1. P Nillius
    2. J Sullivan
    3. S Carlsson
    (2006)
    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
    https://doi.org/10.1109/CVPR.2006.198
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76

Decision letter

  1. Gordon J Berman
    Reviewing Editor; Emory University, United States
  2. Catherine Dulac
    Senior Editor; Harvard University, United States
  3. Joshua W Shaevitz
    Reviewer; Princeton University, United States
  4. Alfonso Perez-Escudero
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Here, the authors present a new software package for tracking color-tagged animals, anTraX. The method combines object segmentation and tracking through thresholded connected components (blobs), a convolutional neural network that uses shape and color in identify animals when they are alone outside of groups, and then a graph-based method for connecting single animal tracks and group blobs appropriately. This is an interesting and novel strategy that combines aspects of traditional computer vision with newer work using neural networks to allow tracking of tagged animals.

Decision letter after peer review:

Thank you for submitting your article "anTraX: high throughput video tracking of color-tagged insects" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Catherine Dulac as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Joshua W Shaevitz (Reviewer #2); Alfonso Perez-Escudero (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Essential revisions:

As evidenced from the attached reviews, the reviewers were generally positive about the submission, finding the work timely, of significant potential impact, and clearly written. There were several shared concerns, however, that need to be addressed before accepting the article, though.

1) The reviewers all agreed that a key missing feature of the submission was comparing the accuracy, speed, and required computational resources of the method to other existing approaches (e.g., idtracker.ai, DeepLabCut, SLEAP). An analysis along the lines performed in Sridhar et al., 2019 would greatly benefit readers and potential users, making them aware of the apparent benefits and potential disadvantages of the approach described in this submission. If such comparisons are not possible due to the technical limitations of the software, then the authors should clearly describe what the technical limitations of the existing software are and why they could not successfully track their video data.

2) The reviewers also would like to see more description of the limitations of the blob detection algorithm used here. What if the video is mostly made of connected groups (potentially morphing into/out of each other) and there are very few instances of single animals moving about between groups? In experiments at high density, this is certainly the case and other animals that live in groups spend a vast amount of time huddled together. Would anTraX be appropriate for this kind of data? How is the position of each final track found if there is a group involved? Is it just the centroid of the multi-animal blob? Doesn't this cause discontinuities in the final tracks that are problematic for further analysis (e.g., using JAABA as the authors highlight)? For example, socially housed rodents often spend much of their time clustered together. How would anTraX fair on this type of data?

3) The number of ground truth testing frames (200 annotations per data set?) seems rather small. While high-quality ground truth data are difficult to collect, the method the authors use for validating the accuracy of their algorithm via manual labeling as "correct" or "incorrect" (or "skip") could be strongly influenced by sampling bias, or observer bias and fatigue, as is the case with any manual annotation task. This is especially worrying with the small number of annotations used here (although the exact number needs to be spelled out more clearly). We ask that the authors expand their ground truth testing set (or justify why an expansion is not necessary) and to more clearly describe how many testing frames were used and how these frames where chosen.

4) Relatedly, performing an analysis of the method's accuracy as a function of the number of training frames is an important means to assess the practicability of the method to potential users.

5) The anTraX algorithm also uses multiple types of tracking subroutines, so it would be prudent to compare accuracy across these different stages of tracking (e.g. images labeled by the CNN classifier might be very accurate compared to other parts of the tracking). Using only 200 random samples could easily be oversampling frames where individuals are well separated or accurately labeled by the CNN classifier, which would heavily bias the results to look as if the algorithm is highly accurate when the quality of the tracking could in fact be highly variable. Also, for example, it is unclear how a wrong ID propagates across tracklets. While the methods appear robust to the issue, it should be discussed explicitly and addressed here. Performing these types of comparisons across stages of tracking would be informative to the reader to assess whether the approach would be a good one for their particular data.

6) The reviewers felt that the text fails to acknowledge its similarity with previous methods in two aspects. For instance, the concept of tracklet has been used before, at least in Pérez-Escudero et al., 2014 and Romero-Ferrero et al., 2019, with essentially identical criteria to define each tracklet (although this paper presents a significant improvement in the method to assign blobs via optic flow), and the concept of representing the tracking with a network has been used at least in M. Schiegg, P. Hanslovsky, B. X. Kausler, L. Hufnagel, F. A. Hamprecht. Conservation Tracking. Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), 2013. We accordingly ask the authors to add a more thorough exposition of how anTraX fits into the previous tracking literature.

Reviewer #1:

This paper is an important and timely contribution to the fields of collective, social, and quantitative animal behavior and will certainly serve to make innovative research questions in these fields more tractable, especially those related to group-living animals such as eusocial insects. Overall the paper is well-written and provides a readable review describing the current state-of-the-art for multi-object tracking of animal groups. As the paper makes clear, using computer vision to reliably track groups of individuals over extended periods of time is an exceptionally difficult problem. The authors identify key technical problems with existing software related to disambiguating multiple closely-interacting individuals when tracking groups of animals over long periods. They then provide a clever and comprehensive, albeit complex, solution to address these problems. Their software combines conventional computer vision algorithms with modern deep-learning-based computer vision algorithms and a novel graph-based propagation algorithm that together allow for reliably tracking groups of animals for extended periods of time. The algorithm allows for high-throughput data collection using minimally invasive, human-readable, color-based markers that negate the need for invasive tagging and expensive high-resolution imaging, while also maintaining individual identities through challenging occlusions and ambiguity (even in cases where a human would struggle to disambiguate identities). The code for their software is well written and documented, which must have required an incredible amount of work for which the authors should be commended. They also give clear and useful demonstrations of how to integrate their software with the currently-available suite of tools for quantitative analysis of animal behavior (DeepLabCut, JAABA, etc.), which nicely places their work into the broader context of modern methods for quantifying and analyzing animal behavior.

Despite these many strengths, I have identified a few closely-related issues with the validation methods used to benchmark the anTraX tracking algorithm that, if properly addressed, would strengthen the results presented in the paper:

– No comparisons to existing tracking methods.

– Sample sizes used for assessing accuracy are small with no comparisons of accuracy across time, group size, tag complexity, before/during/after occlusions, or tracking subroutine.

– No direct comparison to the ground-truth trajectory data or "gold standard" barcode tracking.

– No discussion of computation time or complexity.

If the authors are able to address my concerns at least partially, then the manuscript is potentially adequate for publication as a Tools and Resources article but could certainly benefit from the additional analyses described below, especially in regard to my concerns about sample size and sampling bias.

This paper appears to fall into the category described by eLife as: "substantial improvements and extensions of existing technologies" in which case "the new method should be properly compared and benchmarked against existing methods", yet the authors make no attempt to compare their software to existing software. The most common form of computer vision-based tracking used in labs today is still basic background subtraction and blob detection followed by some form of energy minimization algorithm such as the linear assignment algorithm (recently reviewed by Sridhar et al., 2019). While the anTraX authors provide comparisons of their software both with and without their novel graph propagation algorithm, their initial tracking algorithm appears to utilize a complex set of ad-hoc criteria for switching between computationally-expensive optic-flow and cheaper object tracking algorithms. Therefore these results are not directly comparable to other methods.

For the uninitiated reader, it would be useful to expand on this discussion in the main text and more clearly demonstrate how the anTraX algorithm compares to commonly-used baseline tracking. The authors appear to have already benchmarked this type of tracking, given their statement “Flow is computationally expensive, we found it to be significantly more accurate than alternatives such as simple overlap or linear assignment (based either on simple spatial distance or on distance in some feature space)”, but do not provide the results of this comparison. In this same vein, it would also be useful to provide direct comparisons to the output of other existing, more complex tracking software (similar to the comparisons made in Sridhar et al., 2019) both quantitatively and qualitatively. A comparison with the recent deep-learning-based idtracker.ai software (Romero-Ferrero et al., 2019), in terms of how quickly identity assignment errors propagate in time between the two algorithms, would be especially relevant. If these comparisons are not possible due to technical limitations of the software, then the authors should clearly describe what the technical limitations of the existing software are and why they could not successfully track their video data. The addition of these comparisons would serve to greatly improve the paper by giving the reader a better idea of how well the anTraX software addresses the problems it claims to solve when compared to a simple baseline as well as more complex existing software.

Second, while I can appreciate that high-quality ground truth data are difficult to collect, the method the authors use for validating the accuracy of their algorithm via manual labeling as "correct" or "incorrect" (or "skip") could be strongly influenced by sampling bias, or observer bias and fatigue, as is the case with any manual annotation task. Relatedly, and perhaps more importantly, it appears that the size of ground truth data used for assessing tracking accuracy is rather small (200 annotations per dataset?), though it is actually not clear to me exactly how much ground-truth data they collected for their analysis. In addition to reporting their sample sizes in the main text, the authors should report this in the transparent reporting form as well, instead of erroneously claiming "no statistical analyses were performed". It's also not clear how these annotations were sampled within the video time series. This is an important point because as the authors are aware, unrecoverable errors tend to propagate as the time series progresses. It would also be useful to make this comparison across time with other tracking algorithms to assess how assignment errors propagate and/or corrected (i.e. by the CNN classifier). Additionally the authors have benchmarked several datasets with varying group size and tag complexity. It would be useful to explicitly analyze (or at least plot) and discuss how these aspects do affect or might affect the quality of the tracking data produced by anTraX (e.g. what are the limits of group-size and tag complexity). The anTraX algorithm also appears to be able to detect when animals come in close contact, so assessing accuracy before, during, and after prolonged occlusions would be very relevant, as this is the main novelty of the paper. The anTraX algorithm also uses multiple types of tracking subroutines, so it would be prudent to compare accuracy across these different stages of tracking (e.g. I would expect images labeled by the CNN classifier to be very accurate compared to other parts of the tracking). Using only 200 random samples could easily be over sampling frames where individuals are well separated or accurately labeled by the CNN classifier, which would heavily bias the results to look as if the algorithm is highly accurate when the quality of the tracking could in fact be highly variable.

Third, comparing the anTraX algorithm to an automated ground-truth would greatly strengthen the results presented in the paper. Using a dataset of animals marked with machine-readable (though not necessarily human-readable) barcodes would be an ideal option for improving this comparison (as well as my first two issues of comparing to other tracking software and improving ground-truth comparisons), as I assume anTraX works with any marked animals, not just color-based markers. The authors actually refer to barcode tracking as the "gold standard", so I was surprised that they did not make any attempt to directly compare the two. I have access to such a barcode dataset (tagged groups of desert locusts, S. gregaria, in a laboratory environment) with high-quality ground-truth barcode trajectories that include complex occlusions and ambiguities. I am happy to make part of this dataset publicly available (e.g. on Zenodo) for the authors to freely use for making such a comparison. To circumvent the need for manual labor, the ground truth ID labels from the barcode tracking software could easily be substituted to train the blob classifier CNN with the assumption that a human could, with enough time, disambiguate the barcodes. A second option would be to apply color thresholding to their single tag data (where individuals are marked with a single color), which should give reliable ground-truth trajectories for at least some of the animals (see for example Jolles et al., 2017, Curr. Biol.). A third option would be to use a dataset of solitary animals moving in an arena that can be reliably tracked and then artificially superimposed (using background subtraction) into a combined "multi-individual" video. This would, of course, be less realistic than real data, as ambiguous "interactions" between individuals are less likely to occur, but would be a significant step toward addressing this issue of more comprehensively comparing the tracking data output to a reliable ground truth.

Finally, the authors mention that a high-powered computer or cluster is advised in some cases, so it would be useful to discuss how much processing time is required for running the anTraX algorithm on the different datasets compared with other algorithms.

Comparing the anTraX algorithm across different factors (time, group size, tag complexity, occlusion, subroutine, etc.) with both the simple baseline of blob-detection-based tracking, existing (more complex) software, and the so-called "gold standard" of barcode tracking would go a long way to help better place this work in the broader context of existing tracking algorithms for animal groups. Additionally these added comparisons would help readers make a better informed decision whether or not using the anTraX software (vs. other tracking software) is worth the investment for their research project.

Reviewer #2:

The authors present a new software package for tracking color-tagged animals, anTraX. The method combines object segmentation and tracking through thresholded connected components (blobs), a CNN that uses shape and color in identify animals when they are alone outside of groups, and then a graph-based method for connecting single animal tracks and group blobs appropriately. This is an interesting strategy that combines aspects of traditional computer vision with newer work using NNs to allow tracking of tagged animals.

In general, I think this is an interesting technique that occupies the space somewhere between traditional tracking methods and new deep-learning approaches to tracking animals and their body parts. I have a few concerns that should be addressed before publication.

1) My biggest worry with this technique is about the treatment of the animal groups (the multi-animal blobs).

– What if the video is mostly made of connected groups (potentially morphing into/out of each other) and there are very few instances of single animals moving about between groups? In experiments at high density this is certainly the case and other animals that live in groups spend a vast amount of time huddled together. Would anTraX be appropriate for this kind of data? I worry that because of this issue anTraX is less broadly applicable than pitched in the article.

– How is the position of each final track found if there is a group involved? Is it just the centroid of the multi-animal blob? Doesn't this cause discontinuities in the final tracks that are problematic for further analysis (e.g. using JAABA as the authors highlight)? Also, JAABA doesn't really apply for animals when they are in a group, right, because the resultant track is shared among all animals in the group?

– The authors highlight a DLC analysis of the single animal images. But this fails for groups, right? I think this needs to be said more clearly.

2) At several steps in the analysis the segments (blobs) are fit to ellipses. This obviously makes sense for elongated objects, but what if the animals segments (which are inherently smoother than the animal itself) are essentially round? Would this affect the analysis?

3) How are the training images for the CNN found? The text says "To train the classifier, we collect a set of example images for each classifier label (Appendix—figure 4-6). This can be done easily using an anTraX GUI app (see Online Documentation for details).". The authors should describe how the training images are selected etc in the main manuscript. Are they taken from a random sample of the single animal blobs? What about animals that are rarely represented in that set, e.g. if an animal is almost always in the group but rarely appears as a single?

4) Recent advances in NN-based pose tracking now allow for multiple animals (see maDLC and SLEAP on which I am an author). I realize that these packages just recently became available but it would be useful for the authors to compare their method to those which don't utilize the tags for ID. This is not strictly necessary for a revision but would clearly be of interest to the field.

Reviewer #3:

The paper presents a tracking system for manually marked individuals. Overall, I think it's a really good paper, and the software seems a useful contribution, both for potential users and for future development of similar tracking systems. The paper is clear and well written, and the conclusions seem well supported by the results. All my issues are mostly about presentation:

1) One of the most time-consuming steps is training of the classifier, in which the user must manually annotate many images of the animals. This step is almost absent from the current main text. Even the supplement does not give an estimate of how many images need to be manually annotated. In my opinion, the main text should explicitly address this step, and include an estimate of the amount of manual work needed.

2) Error propagation may be an issue with this algorithm, since a wrong ID can propagate across tracklets. The method seems quite robust against this issue, but I think that it should be discussed explicitly and addressed in the validation. To do so, I think the validation should include more information. For all the mistakes detected, it should report length of the tracklet with wrong ID and the certainty of the ID assignment (since this certainty will correlate with the probability of propagation). Also, for each mistake detected, the authors should check whether that mistake propagated to any neighboring tracklets. While this must be done manually, given the low number of mistakes after ID propagation, it should be easily doable.

3) In my opinion the method has enough novelty to grant publication in eLife, but I feel that the text fails to acknowledge its similarity with previous methods in two aspects: First, the concept of tracklet has been used before, at least in Pérez-Escudero et al., 2014 and Romero-Ferrero et al., 2019, with essentially identical criteria to define each tracklet (although this paper presents significant improvement in the method to assign blobs via optic flow). Second, the concept of representing the tracking with a network has been used at least in M. Schiegg, P. Hanslovsky, B. X. Kausler, L. Hufnagel, F. A. Hamprecht. Conservation Tracking. Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), 2013.

4) I find that one of the main limitations to usability of this software is that installation seems hard. From the description in the web, I understand that I need all the following: (1) A Linux machine, (2) Python (and know what a virtual environment is and how to create it), (3) Matlab or Matlab's MCR, (4) git (and knowledge about how to clone a repository). Given that the target audience of this software are experimentalists in the field of animal behavior, I think that these requirements will severely damage its usability. And it seems that part of these requirements are in fact unnecessary (at least git should not be needed for basic users). And even having the necessary skills, the need of so many different steps makes me worry that something will fail along the way. I think that a modest investment in simplifying this step will increase the number of users substantially.

https://doi.org/10.7554/eLife.58145.sa1

Author response

Essential revisions:

As evidenced from the attached reviews, the reviewers were generally positive about the submission, finding the work timely, of significant potential impact, and clearly written. There were several shared concerns, however, that need to be addressed before accepting the article, though.

1) The reviewers all agreed that a key missing feature of the submission was comparing the accuracy, speed, and required computational resources of the method to other existing approaches (e.g., idtracker.ai, DeepLabCut, SLEAP). An analysis along the lines performed in Sridhar et al., 2019 would greatly benefit readers and potential users, making them aware of the apparent benefits and potential disadvantages of the approach described in this submission. If such comparisons are not possible due to the technical limitations of the software, then the authors should clearly describe what the technical limitations of the existing software are and why they could not successfully track their video data.

The issue with comparing anTraX to other software packages is that there is no other software that can directly track the types of experiments we developed anTraX for. This includes the benchmark experiments presented in this manuscript. Below we briefly review the existing tracking software and how they fall short in tracking these kinds of videos successfully.

General marker-less multi-object tracking algorithms: Obviously, color tagged animals can be tracked as if they were marker-less by general-purpose multi-object tracking algorithms. However, these are rarely useful when it is important to keep track of the identity of each individual animal throughout the experiment. As we review in the Introduction of our paper, most of these approaches can overcome simple occlusions, crossings, and interactions by either trying to segment animals from each other (requiring sufficient image quality and resolution) or using various kinematic modeling approaches to connect tracks following an occlusion to the tracks preceding it. These methods work only for brief, simple interactions, and will fail on experiments like ours, which sacrifice resolution and image quality for duration and high throughput and focus on social insects that tend to cluster tightly together for long periods. It’s also important to note that even a relatively low identity switching error (say 1%) entails the complete mixing of identities in experiments that are much longer than the typical time between interactions. Nevertheless, marker-less methods are still heavily used in our field, and are usually combined with extensive manual correction steps to overcome these issues.

Multi animal DLC and SLEAP: While these newly introduced methods are primarily designed for pose tracking, they can also be viewed as pose-assisted centroid trackers, in which pose information is used to solve cases of overlap and occlusion. From the point of view of multiple object tracking, these methods can be considered sophisticated marker-less trackers, as they do not attempt to classify or recognize the individual animals, but rather to relate objects in each frame in the video to their location in the previous frame. Several disadvantages make these algorithms unsuitable for tracking our experiments. First, they require high image resolution to allow reliable pose tracking of individual animals. Second, as they use the same model to pose-track all the animals, they will not handle experiments of groups with morphological variability well (e.g. our C12 dataset). Third, like the more traditional marker-less trackers mentioned above, they will fail on tight aggregations where even the human eye cannot identify single animals and will suffer from accumulation of identity switch errors in any case of switch error probability in crossings that is not strictly zero.

idTracker.ai: This is the only software currently available that is theoretically capable of tracking color tagged insects, as the reviewers correctly mention below. Its use of neural networks to learn a signature of the tracked animals has the potential to overcome the accumulation of switch errors that happens in marker-less trackers. However, there are several issues that prevent idTracker.ai from working on our data “as is”. First, idTracker.ai collects training sets by looking for points in the videos where all animals are segmented individually (i.e., the number of blobs equals the number of animals). This does not occur in most of our datasets, and also not in many typical experiments involving social insects. Second, idTracket.ai currently works by converting the frames to greyscale, thus discarding most of the relevant information for identification in our videos (although it does surprisingly well on short videos with the greyscale signature of the color tags). While this could be easily solved by a simple modification to the algorithm, doing so is beyond the scope of benchmark comparisons. Third, the idTracker.ai algorithm itself makes it impractical to track videos longer than 30 minutes, as the execution time climbs strongly supra-linearly with the length of the video. Because of this, the types of experiments we optimized anTraX for really fall outside of the current scope of idTracker.ai, and we therefore prefer not to include a formal performance comparison between idTracker.ai and anTraX in the paper.

In summary, we feel that our discussion of existing tracking software in the Introduction is sufficient, and that a formal comparison would not be any more informative.

2) The reviewers also would like to see more description of the limitations of the blob detection algorithm used here. What if the video is mostly made of connected groups (potentially morphing into/out of each other) and there are very few instances of single animals moving about between groups? In experiments at high density, this is certainly the case and other animals that live in groups spend a vast amount of time huddled together. Would anTraX be appropriate for this kind of data? How is the position of each final track found if there is a group involved? Is it just the centroid of the multi-animal blob? Doesn't this cause discontinuities in the final tracks that are problematic for further analysis (e.g., using JAABA as the authors highlight)? For example, socially housed rodents often spend much of their time clustered together. How would anTraX fair on this type of data?

The reviewers are correct: in order to work, anTraX assumes a mixture of single-animal and multi-animal tracklets. In that sense, it is not a universal tool. If a video includes only multi-animal tracklets, anTraX won’t have anything to work with, and will fail to track the animals. However, the case of a mixture of multi-ant tracklets and single-ant tracklets represents most behavioral modes in the study of collective behavior of social insects, which is the target user-base of this tool. anTraX deals well with scenarios in which animals “spend a vast amount of time huddled together”, as long as individuals can be segmented and identified occasionally, which will be the case in almost all biologically realistic scenarios.

We are now explicitly mentioning this requirement in the text:

“The successful propagation of individual IDs on top of the tracklet graph requires at least one identification of each ID at this step. Propagation will improve with additional independent identifications of individuals throughout the video.”

In the context of this answer, it is also useful to note that the necessity of segmenting each animal individually at some point during the experiment is much relaxed compared to the requirement of the current state-of-the-art (idTracker.ai) to have all animals segmented individually in the same frame at some point during the experiment.

The location of an individual in a multi-animal blob is assumed to be at the centroid. As we write in the manuscript:

“Locations estimated from multi-animal tracklets are necessarily less accurate than locations from single-animal tracklets, and users should be aware of this when analyzing the data. For example, calculating velocities and orientations is only meaningful for single-animal tracklet data, while spatial fidelity can be estimated based also on approximate locations.”.

This is something to be aware of when analyzing positional data, as generally kinematic measures will not be usable for groups. However, many animal behavior analyses are positional, and not kinematic, so knowing in which group the animal resides is useful information.

anTraX does not simply export trajectory data to JAABA. anTraX implements its own version of the first step of the JAABA analysis pipeline – the per-frame features generation, where the trajectories are projected into a high dimensional feature space – and writes NaN for the kinematic features in those cases. It also defines many other, anTraX specific, features that describe the properties of the tracklets: the number of animals in the group, the group area and location relative to other animals, etc. A full list of these features and their definitions is given in the online documentation. JAABA can then be used to either classify the behavior of the animal when isolated (for that to work, we added a few negative examples from groups, which JAABA then uses to learn this behavior is for isolated individuals only), or to classify behavior in groups that is not dependent on kinematics. We have modified both the paper and the online documentation (https://antrax.readthedocs.io/en/latest/jaaba/) to emphasize this. The revised manuscript now reads:

“Beyond useful information about the appearance and kinematics of the tracked animals, these extra features provide information about whether an animal was segmented individually or was part of a multi-animal blob. This enables JAABA to learn behaviors that can only be assigned to individually segmented animals, such as those that depend on the velocity of the animal.”

3) The number of ground truth testing frames (200 annotations per data set?) seems rather small. While high-quality ground truth data are difficult to collect, the method the authors use for validating the accuracy of their algorithm via manual labeling as "correct" or "incorrect" (or "skip") could be strongly influenced by sampling bias, or observer bias and fatigue, as is the case with any manual annotation task. This is especially worrying with the small number of annotations used here (although the exact number needs to be spelled out more clearly). We ask that the authors expand their ground truth testing set (or justify why an expansion is not necessary) and to more clearly describe how many testing frames were used and how these frames where chosen.

Because concerns regarding our validation method are alluded to in several reviewer comments, we provide an in-depth overview of our rational in choosing this method here, followed by a description of improvements made in response to these comments.

The ideal validation of anTraX, like any tracking algorithm, is against a ground truth, i.e. the “real” trajectories of each individual in the group, for the duration of the entire experiment, and for all benchmark datasets. The common way to generate such ground truth data is by performing manual tracking by a human, or preferably by a number of people independently. No such ground truth data readily exist for experiments that represent the intended use-niche of anTraX: long-term video recordings of groups of color tagged social insects. Moreover, such ground truth data is practically impossible to generate for any of our benchmark datasets (spanning altogether trajectories in total length of thousands of hours).

Therefore, several possibilities can be considered to validate our algorithm:

1) Benchmark anTraX using smaller datasets, for example by taking a short segment of few of our benchmark datasets and manually tracking them fully. While this approach will definitely give the most reliable performance estimation for the tested dataset, it will introduce a significant sampling bias. One of the challenges in tracking long experiments is the non-stationarity, expressed in slow transitions between many behavioral modes, changes in background, changes in tag quality, etc. Taking a short segment out of a long experiment will miss this behavioral and experimental complexity and will not necessarily be predictive of how well anTraX does on long experiments.

2) Comparing anTraX to tracking results of an existing algorithm that is assumed to perform close-to-perfectly. However, as we explained above, no such algorithm exists for tracking data of the kind we optimized anTraX for (color-tagged groups of closely interacting insects). Reviewer #1 suggested several clever ways around this problem: First, they suggested using the fact that our approach is actually not limited to color-tags, but can classify individual animals based on any image feature. Such features can include the natural appearance of the animals (not unlike the principle behind idTracker.ai), or the features of a machine-readable barcode attached to the animal, which can be tracked well by existing algorithms. Reviewer #1 also very kindly offered to share a dataset of tracked barcode marked animals. However, although we would be excited to try this approach and thank the reviewer for their generosity, we feel this is more of an “off label” use of the algorithm. We optimized the anTraX image processing pipeline and CNN classifier for use with low-resolution images and color-based information. It would need to be significantly changed in order to work equally well with barcodes. This validation method also suffers from the problem that performance would be estimated against a benchmark outside the range of scenarios we developed anTraX to handle.

3) A second suggestion by reviewer #1 is to compare anTraX’s performance on groups of animals tagged with a single-color mark each, to a simple tracking approach (color blob tracking). The problem with this suggestion is that such a tracker is not expected to perform well on cases of tag occlusion (e.g., when an animal grooms itself) or animal interactions, and hence cannot serve as a gold standard. This method also suffers again from a sampling bias, this time choosing a relatively easy tracking scenario, with a low number of individuals.

4) The reviewer’s third suggestion is to create an “artificial colony” by superimposing a few single animal videos, each with a unique color tag. This way, ground truth data can be generated by tracking the single animal videos, while using anTraX to track the combined video. While this is an interesting suggestion, it suffers again from the problem of not testing anTraX directly on the type of experiments and range of behaviors we designed it for. It could be a nice complementary approach but will not give a good sense of how anTraX performs on “real” data.

We feel that our chosen approach, while not perfect, has significant advantages over these alternatives:

1) The random sampling (of timepoints and individuals) of validation points across the entire experiment offers the most unbiased way of subsampling our extensive benchmark dataset collection. This allows for a simple way to estimate the gross performance of the algorithm over a wide range of experimental conditions.

2) anTraX’s validation interface also allows us to narrow down the range of tracking point selection by time or individual, thus allowing a more specific estimation of tracking performance. For the purpose of this tools-and-resources paper, we think the average performance for a given dataset is the most informative. For analyses more specific to certain behaviors, this feature is most useful. For example, in a soon-to-be-published work (preprint available at https://www.biorxiv.org/content/10.1101/2020.08.20.259614v1), we analyzed the foraging behavior of ant colonies, and estimated the tracking performance during very specific events in time, and for specific ants of interest.

3) The random sampling of validation points, together with the simple binary measure, allows for estimating confidence intervals for the tracking error. This enables a simple determination of the number of points required for validation, based on the accuracy needed for such an estimation.

4) It is true that like any manual annotation task, our method suffers from observer bias and fatigue, as the reviewers pointed out. The best way around this is using multiple annotations of the same dataset. However, this is a classical bias/variance tradeoff: it is the same total effort to annotate N points twice (lower bias but lower confidence) or 2N points once (higher bias but higher confidence). Since our algorithm does not provide marginal improvement over other methods, but rather introduces the ability to track experiments that could not be tracked by any other existing method (as we detail in the answer to issue #1), we felt that optimizing this tradeoff is of less importance.

5) For the same reason, we felt that demonstrating how our algorithm is able to successfully track many conditions is of more relevance than giving very precise error estimations. This is the reason we were originally satisfied with 200 validation points per experiment, which corresponds to confidence intervals of below 1-3%.

In order to address, at least partially, the reviewers’ concerns regarding the validity of our tracking performance estimation, we expanded our analysis in the following ways:

1) We increased the number of total validation points to 500 (from 200 in the original analysis) for each experiment. This significantly reduces the confidence intervals of the estimates (estimates and confidence intervals are reported in Table 2). The 500 points in the new analysis were resampled from the experiments and did not use the 200 points used in the original analysis. Noticeably, the difference between the estimated error in the original analysis and that of the new expanded analysis was negligible, providing further reassurance that our validation approach is reliable.

2) For each of the validation points, we extracted the duration of the tracklet to which it belongs. Under the assumption that validation of a point along a tracklet represents the entire tracklet well, we estimated that the total 9500 validation points used across all datasets represent a total of almost 4 hours of trajectory data. Thus, crudely, our analysis is equivalent to benchmarking against a fully tracked dataset of that volume, albeit sampled randomly across all the experimental conditions represented in our datasets.

3) To address point #5 below, we further analyzed all validation points according to their type of assignment, either directly by the classifier, or by the propagation algorithm. We estimated the assignment accuracy per assignment type. No significant difference was found in the accuracy of the two types of assignments. The results of this analysis are reported in the new Figure 4—figure supplement 1.

4) To address point #2 of reviewer #3, we added an analysis of the distribution of lengths of erroneous segments. While this is not directly correlated with overall performance, it is informative with regard to the algorithm’s ability to block error propagation. This analysis is reported in the new Figure 4—figure supplement 2.

5) We expanded the description of the rational and details of the validation method in the revised version, including a clear statement about the size of the validation set, and the way it was sampled:

“Because the recording duration of these datasets is typically long (many hours), it is impractical to manually annotate them in full. Instead of using fewer or smaller datasets, which would have introduced a sampling bias, we employed a validation approach in which datasets were subsampled in a random and uniform way. In this procedure, a human observer was presented with a sequence of randomly selected test points, where each test point corresponded to a location assignment made by the software to a specific ID in a specific frame.”

“The process was repeated until the user had identified 500 points as either correct or incorrect”.

“This procedure samples the range of experimental conditions and behavioral states represented in each of the datasets in an unbiased manner, and provides a tracking performance estimate that can be applied and compared across experiments.”

6) We have amended the transparent reporting form to include details of the performance estimation to reflect the error estimation process.

We hope that the reviewers will appreciate that this was a significant amount of work, and that they will deem our improved and expanded validation approach suitable for the purpose of this publication.

4) Relatedly, performing an analysis of the method's accuracy as a function of the number of training frames is an important means to assess the practicability of the method to potential users.

Like any supervised machine learning algorithm, the accuracy of the blob classifier in anTraX depends on the size and quality of the training set. While it is true that, generally, the more examples are given to the NN to train on the more accurate it will be, the size of the training set is less important than the distribution of the examples in the image space. A good training set will have denser distribution of examples in regions of the image space that are both harder to classify (i.e., pack closely images that belong to different individuals) and relevant for the task (i.e., images that represent the actual data well). Like many supervised machine learning tracking algorithm (e.g., LEAP, SLEAP, DLC), anTraX works best using an iterative training approach: starting by either using an existing classifier (possibly trained for a different experiment) or one trained on a limited set of easy examples for a first pass tracking run, then using misclassified images to enhance the training set. This recommended workflow is described in detail in the online documentation.

In the revised manuscript, we have addressed the reviewers’ request as follows:

1) We have added an analysis of the accuracy of the “blob classifier” as a function of number of training frames. The training frames have been resampled randomly from the complete training set. The results are reported in the new Figure 4—figure supplement 3.

2) We have also added an analysis of the relationship between the accuracy of the blob classifier and the final performance of the tracking algorithm. This gives the user a sense of what to aim for when iteratively training the blob classifier. This is reported in the new Figure 4—figure supplement 3.

3) In addition, as a response to a request by reviewer #2, we have expanded the detailed description of the training step in the Appendix:

“In short, the GUI presents the user with all the blob images from a random tracklet. The user can then select the appropriate ID and choose to either export all images into the training set, or to select only a subset of images (useful if not all blobs in the tracklet are recognizable). In many cases, especially in social insects where behavioral skew can be considerable, some animals are rarely observed outside an aggregation. It is therefore challenging to collect examples for them using a random sampling approach. One solution to this problem, which is the recommended one for high throughput experiments, is to pool data from several experiments into one classifier as discussed in the main text. Another solution, in case this is not possible, is to scan the video for instances in which the focal animal leaves the group, and “ask” the GUI for tracklets from this segment of the video. Alternatively, one can do a first pass of classification (not full tracking but simply running the blob classifier), and then ask the GUI to display only unclassified tracklets, increasing the probability of spotting the missing animal.”

5) The anTraX algorithm also uses multiple types of tracking subroutines, so it would be prudent to compare accuracy across these different stages of tracking (e.g. images labeled by the CNN classifier might be very accurate compared to other parts of the tracking). Using only 200 random samples could easily be oversampling frames where individuals are well separated or accurately labeled by the CNN classifier, which would heavily bias the results to look as if the algorithm is highly accurate when the quality of the tracking could in fact be highly variable. Also, for example, it is unclear how a wrong ID propagates across tracklets. While the methods appear robust to the issue, it should be discussed explicitly and addressed here. Performing these types of comparisons across stages of tracking would be informative to the reader to assess whether the approach would be a good one for their particular data.

We have performed this analysis, and the results are reported in the new Figure 4—figure supplement 1. See also our response to issue #3 above.

6) The reviewers felt that the text fails to acknowledge its similarity with previous methods in two aspects. For instance, the concept of tracklet has been used before, at least in Pérez-Escudero et al., 2014 and Romero-Ferrero et al., 2019, with essentially identical criteria to define each tracklet (although this paper presents a significant improvement in the method to assign blobs via optic flow), and the concept of representing the tracking with a network has been used at least in M. Schiegg, P. Hanslovsky, B. X. Kausler, L. Hufnagel, F. A. Hamprecht. Conservation Tracking. Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), 2013. We accordingly ask the authors to add a more thorough exposition of how anTraX fits into the previous tracking literature.

We thank the reviewers for pointing this out. We definitely borrowed the term “tracklet” from previous work and did not mean to make it seem otherwise. We have added the appropriate citation to the revised manuscript:

“First, similar to other multi-object tracking algorithms, it segments the frames into background and ant-containing blobs and organizes the extracted blobs into trajectory fragments termed tracklets (Pérez-Escudero et al., 2014, Romero-Ferrero et al., 2019).”

We also appreciate pointing us to the line of work using graph models to represent tracking data and we explicitly mention it in the revised manuscript:

“The tracklets are linked together to form a directed tracklet graph (Nillius et al., 2006)”

“While formal approaches for solving this problem using Bayesian inference have been proposed (Nillius et al., 2006), we chose to implement an ad-hoc greedy iterative process that we found works best in our particular context.”

Reviewer #1:

This paper is an important and timely contribution to the fields of collective, social, and quantitative animal behavior and will certainly serve to make innovative research questions in these fields more tractable, especially those related to group-living animals such as eusocial insects.

[…]

Comparing the anTraX algorithm across different factors (time, group size, tag complexity, occlusion, subroutine, etc.) with both the simple baseline of blob-detection-based tracking, existing (more complex) software, and the so-called "gold standard" of barcode tracking would go a long way to help better place this work in the broader context of existing tracking algorithms for animal groups. Additionally these added comparisons would help readers make a better informed decision whether or not using the anTraX software (vs. other tracking software) is worth the investment for their research project.

We appreciate the reviewer’s many suggestions regarding the validation of the anTraX algorithm, especially their generous offer to share data! We have addressed many of the major points in our response to issue #3 in the combined review. To answer also the other points:

Comparing to baseline trackers: This is indeed of interest to us as developers, and during the design and development of anTraX we have considered and measured the performance of many alternatives to the different computational steps in the “baseline” tracking. However, the anTraX approach here is very conservative and traditional. This is not where the novelty of anTraX lies. Moreover, we do not make any claims to outperform other algorithms here, and as our choices were mostly guided by the needs of later stages in the algorithm, it is reasonable to assume that other algorithms will outperform anTraX. We agree that a systematic comparison between different segmentation and linkage algorithms would be interesting and useful for the community, but feel it is well outside of the scope of the current paper.

Comparing performance across group size and tag complexity: While it is true that our benchmark datasets contain examples for various tag complexities and group sizes, they also vary in many other important properties (e.g., image quality, resolution, species, number of colors). While readers are free to look at the raw numbers reported in Tables 1 and 2, we are hesitant to make explicit claims regarding the dependency of performance on any specific factor. Such an analysis will require a properly controlled experiment in which the feature in question varies while the others are held constant.

Computer clusters: anTraX’s approach to parallelization is straight forward: tracking the experiment in chunks, each tracked as a different job/thread, and running a “stitching” step at the end. The gain is therefore proportional to the parallelization factor used, minus the very low overhead in the stitch step. Because we developed anTraX for high throughput, long duration experiments, and unlike other software packages, we included a built-in interface to handle this parallelization.

Reviewer #2:

[…]

1) My biggest worry with this technique is about the treatment of the animal groups (the multi-animal blobs).

– What if the video is mostly made of connected groups (protentially morphing into/out of each other) and there are very few instances of single animals moving about between groups? In experiments at high density this is certainly the case and other animals that live in groups spend a vast amount of time huddled together. Would anTraX be appropriate for this kind of data? I worry that because of this issue anTraX is less broadly applicable than pitched in the article.

– How is the position of each final track found if there is a group involved? Is it just the centroid of the multi-animal blob? Doesn't this cause discontinuities in the final tracks that are problematic for further analysis (e.g. using JAABA as the authors highlight)? Also, JAABA doesn't really apply for animals when they are in a group, right, because the resultant track is shared among all animals in the group?

This has been addressed in the response to issue #2 in the combined review section.

– The authors highlight a DLC analysis of the single animal images. But this fails for groups, right? I think this needs to be said more clearly.

Yes! Pose tracking is currently only available for individually segmented animals. We have emphasized this more clearly in the revised version. We plan to integrate the recently introduced maDLC and SLEAP to extend this into multi-animal tracklets, although tight groups will probably not benefit much from this. We have emphasized this point in the revised version:

“Currently, this is only supported for single-animal tracklets, where animals are well segmented individually.”

We also added a discussion point with regard to multi animal pose estimation:

“Lastly, a newer generation of pose-estimation tools, including SLEAP (Pereira et al., 2020) and the recent release of DeepLabCut with multi-animal support, enable the tracking of body parts for multiple interacting animals in an image. These tools can be combined with anTraX in the future to extend pose tracking to multi-animal tracklets, and to augment positional information for individual animals within aggregations.”

2) At several steps in the analysis the segments (blobs) are fit to ellipses. This obviously makes sense for elongated objects, but what if the animals segments (which are inherently smoother than the animal itself) are essentially round? Would this affect the analysis?

The main use of the ellipse fitting in the algorithm is to assign an initial orientation to the blob. This orientation is not used in the algorithm itself; it is just added to the tracking output. For low eccentricity blobs animals, this will be practically meaningless, but will not affect the tracking algorithm. If orientation is needed for such cases, it can be also found using pose tracking, or using the blob classifier (for identifiable blobs only).

3) How are the training images for the CNN found? The text says "To train the classifier, we collect a set of example images for each classifier label (Appendix—figure 4-6). This can be done easily using an anTraX GUI app (see Online Documentation for details).". The authors should describe how the training images are selected etc in the main manuscript. Are they taken from a random sample of the single animal blobs? What about animals that are rarely represented in that set, e.g. if an animal is almost always in the group but rarely appears as a single?

We have expanded the technical description of the training set collection in the appendix to include these details:

“In short, the GUI presents the user with all the blob images from a random tracklet. The user can then select the appropriate ID and choose to either export all images into the training set, or to select only a subset of images (useful if not all blobs in the tracklet are recognizable). In many cases, especially in social insects where behavioral skew can be considerable, some animals are rarely observed outside an aggregation. It is therefore challenging to collect examples for them using a random sampling approach. One solution to this problem, which is the recommended one for high throughput experiments, is to pool data from several experiments into one classifier as discussed in the main text. Another solution, in case this is not possible, is to scan the video for instances in which the focal animal leaves the group, and “ask” the GUI for tracklets from this segment of the video. Alternatively, one can do a first pass of classification (not full tracking but simply running the blob classifier), and then ask the GUI to display only unclassified tracklets, increasing the probability of spotting the missing animal.”

4) Recent advances in NN-based pose tracking now allow for multiple animals (see maDLC and SLEAP on which I am an author). I realize that these packages just recently became available but it would be useful for the authors to compare their method to those which don't utilize the tags for ID. This is not strictly necessary for a revision but would clearly be of interest to the field.

SLEAP and maDLC are definitely exciting developments in the field, and we have been waiting for their publication. While these methods are primarily for pose-tracking, they can also be viewed as pose-assisted centroid trackers, where pose information is used to solve cases of overlap and occlusions. We did not refer to these methods in the original manuscript because it was submitted a couple of weeks before their release. The revised version now explicitly describes and cites them.

In short, and as we described earlier in this response letter, these methods can be seen as sophisticated marker-less trackers. As such, they suffer from the same disadvantages as other marker-less trackers: accumulation of identity switching errors and dependency on high image quality. Therefore, they are less suited to directly track the same types of experiments as anTraX.

We see these methods as complementary to anTraX, and plan on porting them into the anTraX interface in a similar manner to what we did for single animal DLC. This will allow us to extend pose tracking, which, as the reviewer highlighted above, works currently only for single animal tracklets, to multi animal tracklets. We also hope to use their advantage in separating overlapping animals to improve our tracking accuracy by better locating individuals in multi animal blobs. The revised version now expands on this point in the Discussion:

“Lastly, a newer generation of pose-estimation tools, including SLEAP (Pereira et al., 2020) and the recent release of DeepLabCut with multi-animal support, enable the tracking of body parts for multiple interacting animals in an image. These tools can be combined with anTraX in the future to extend pose tracking to multi-animal tracklets, and to augment positional information for individual animals within aggregations.”

Reviewer #3:

The paper presents a tracking system for manually marked individuals. Overall, I think it's a really good paper, and the software seems a useful contribution, both for potential users and for future development of similar tracking systems. The paper is clear and well written, and the conclusions seem well supported by the results. All my issues are mostly about presentation:

1) One of the most time-consuming steps is training of the classifier, in which the user must manually annotate many images of the animals. This step is almost absent from the current main text. Even the supplement does not give an estimate of how many images need to be manually annotated. In my opinion, the main text should explicitly address this step, and include an estimate of the amount of manual work needed.

As we describe in our response to issue #4 in the combined review, we have added an analysis of the number of frames required to train the blob classifier. We also expanded the description of the labeling procedure in the appendix. Note, however, that as the labeling is done per tracklet (which can have up to a few hundred blobs in some experiments), the number of labeled blobs is always a good predictor of the manual work needed.

2) Error propagation may be an issue with this algorithm, since a wrong ID can propagate across tracklets. The method seems quite robust against this issue, but I think that it should be discussed explicitly and addressed in the validation. To do so, I think the validation should include more information. For all the mistakes detected, it should report length of the tracklet with wrong ID and the certainty of the ID assignment (since this certainty will correlate with the probability of propagation). Also, for each mistake detected, the authors should check whether that mistake propagated to any neighboring tracklets. While this must be done manually, given the low number of mistakes after ID propagation, it should be easily doable.

As the reviewer mentions, an incorrect classification will indeed propagate locally to multi-ant tracklets and unidentified single ant tracklets. However, in the majority of cases, such propagations will not be carried for long, and will terminate due to a contradiction, either by bumping into a correct classification of the individual, or by contradicting a higher-ranked assignment of the incorrect ID. Assuming the error rate is not too high, this will result in a small disconnected ID subgraph (see Appendix section “Propagating IDs on the tracklet graph”, subsection “Propagation of incorrect classifications”, for definition), that will be filtered out (Appendix—figure 9).

We have included an analysis in line with what the reviewer suggested, reported in the new Figure 4—figure supplement 2 (see also our response to issue #3 in the combined review).

3) In my opinion the method has enough novelty to grant publication in eLife, but I feel that the text fails to acknowledge its similarity with previous methods in two aspects: First, the concept of tracklet has been used before, at least in Pérez-Escudero et al., 2014 and Romero-Ferrero et al., 2019, with essentially identical criteria to define each tracklet (although this paper presents significant improvement in the method to assign blobs via optic flow). Second, the concept of representing the tracking with a network has been used at least in M. Schiegg, P. Hanslovsky, B. X. Kausler, L. Hufnagel, F. A. Hamprecht. Conservation Tracking. Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), 2013.

Addressed in the combined review section.

4) I find that one of the main limitations to usability of this software is that installation seems hard. From the description in the web, I understand that I need all the following: (1) A Linux machine, (2) Python (and know what a virtual environment is and how to create it), (3) Matlab or Matlab's MCR, (4) git (and knowledge about how to clone a repository). Given that the target audience of this software are experimentalists in the field of animal behavior, I think that these requirements will severely damage its usability. And it seems that part of these requirements are in fact unnecessary (at least git should not be needed for basic users). And even having the necessary skills, the need of so many different steps makes me worry that something will fail along the way. I think that a modest investment in simplifying this step will increase the number of users substantially.

The somewhat complicated installation process is mostly due to the hybrid nature of our software, combining MATLAB and Python components. This does not allow us to use PyPI to directly install the software. We tried to compensate by providing a detailed step-by-step installation instruction that was tested with a few experimental ecologists. We are aware of this weak point, and plan to move to a pure Python implementation in the next version of anTraX, which will simplify the process tremendously. It will, however, take some time. For now, we followed the reviewer’s suggestion and added a direct download installation flow to the online documentation (https://antrax.readthedocs.io/en/latest/installation11/#Get-anTraX).

https://doi.org/10.7554/eLife.58145.sa2

Article and author information

Author details

  1. Asaf Gal

    Laboratory of Social Evolution and Behavior, The Rockefeller University, New York, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    asafg1@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0834-2649
  2. Jonathan Saragosti

    Laboratory of Social Evolution and Behavior, The Rockefeller University, New York, United States
    Contribution
    Conceptualization, Data curation, Software, Investigation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Daniel JC Kronauer

    Laboratory of Social Evolution and Behavior, The Rockefeller University, New York, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing - review and editing
    For correspondence
    dkronauer@rockefeller.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4103-7729

Funding

National Institute of General Medical Sciences (R35GM127007)

  • Daniel JC Kronauer

Searle Scholars Program

  • Daniel JC Kronauer

Klingenstein-Simons (Fellowship Award in the Neurosciences)

  • Daniel JC Kronauer

Pew Charitable Trusts

  • Daniel JC Kronauer

Howard Hughes Medical Institute

  • Daniel JC Kronauer

Human Frontier Science Program (LT001049/2015)

  • Asaf Gal

Rockefeller University (Kravis Fellowship)

  • Jonathan Saragosti

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Z Frentz for invaluable advice during the development of the algorithm. T Kay, O Snir, S Valdés Rodríguez and L Olivos-Cisneros helped in collecting and marking ants and flies. Y Ulrich, G Alciatore, and V Chandra tested the software and shared datasets for benchmarking. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM127007 to DJCK. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was also supported by a Searle Scholar Award, a Klingenstein-Simons Fellowship Award in the Neurosciences, a Pew Biomedical Scholar Award, and a Faculty Scholars Award from the Howard Hughes Medical Institute to DJCK. AG was supported by the Human Frontiers Science Program (LT001049/2015). JS was supported by a Kravis Fellowship awarded by Rockefeller University. This is Clonal Raider Ant Project paper #14.

Senior Editor

  1. Catherine Dulac, Harvard University, United States

Reviewing Editor

  1. Gordon J Berman, Emory University, United States

Reviewers

  1. Joshua W Shaevitz, Princeton University, United States
  2. Alfonso Perez-Escudero

Publication history

  1. Received: April 22, 2020
  2. Accepted: October 29, 2020
  3. Version of Record published: November 19, 2020 (version 1)

Copyright

© 2020, Gal et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 499
    Page views
  • 27
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Yanying Wu et al.
    Research Article

    Dietary magnesium (Mg2+) supplementation can enhance memory in young and aged rats. Memory-enhancing capacity was largely ascribed to increases in hippocampal synaptic density and elevated expression of the NR2B subunit of the NMDA-type glutamate receptor. Here we show that Mg2+ feeding also enhances long-term memory in Drosophila. Normal and Mg2+ enhanced fly memory appears independent of NMDA receptors in the mushroom body and instead requires expression of a conserved CNNM-type Mg2+-efflux transporter encoded by the unextended (uex) gene. UEX contains a putative cyclic nucleotide-binding homology domain and its mutation separates a vital role for uex from a function in memory. Moreover, UEX localization in mushroom body Kenyon Cells is altered in memory defective flies harboring mutations in cAMP-related genes. Functional imaging suggests that UEX-dependent efflux is required for slow rhythmic maintenance of Kenyon Cell Mg2+. We propose that regulated neuronal Mg2+ efflux is critical for normal and Mg2+ enhanced memory.

    1. Neuroscience
    Caitlin R Bowman et al.
    Research Article

    There is a long-standing debate about whether categories are represented by individual category members (exemplars) or by the central tendency abstracted from individual members (prototypes). Neuroimaging studies have shown neural evidence for either exemplar representations or prototype representations, but not both. Presently, we asked whether it is possible for multiple types of category representations to exist within a single task. We designed a categorization task to promote both exemplar and prototype representations and tracked their formation across learning. We found only prototype correlates during the final test. However, interim tests interspersed throughout learning showed prototype and exemplar representations across distinct brain regions that aligned with previous studies: prototypes in ventromedial prefrontal cortex and anterior hippocampus and exemplars in inferior frontal gyrus and lateral parietal cortex. These findings indicate that, under the right circumstances, individuals may form representations at multiple levels of specificity, potentially facilitating a broad range of future decisions.