Figures and data

The ASBAR Framework.
The ASBAR framework’s data and model pipeline (red) comprises two modules: a pose estimation module (green) based on DeepLabCut and an action recognition module (blue) integrating models from MMAc-tion2.

From RGB image to pseudo-heatmaps.
The transformation of an RGB image into a 3D heatmap volume. An input image is passed through a Conv-Deconv architecture to output a probabilistic scoremap of the keypoint location (e.g., the right elbow). By finding a local maximum in the scoremap, the location coordinates and confidence can be extracted. Using a Gaussian transformation, a pseudo heatmap is generated for each keypoint and used as input of the subsequent behavior recognition model.

From extracted poses to behavior classification.
From a set of consecutive RGB frames (e.g., 20 in our experiments), the animal pose is extracted, transformed into pseudo-heatmaps, and stacked as input of the behavior recognition model. A 3D-CNN is trained to classify the represented action into the correct behavior category (e.g., here ‘walking’)

Examples from the pose and behavior datasets.
(Left) Sample images from the OpenMonkeyChallenge dataset, one of the largest collections of primate images annotated with 2D poses. This dataset contains over 100,000 images from 26 primate species. (Right) Sample video frames from the PanAf500 dataset, comprising 500 videos of gorillas and chimpanzees recorded in African forests using camera traps. The dataset includes annotations for bounding boxes and behaviors. Visual challenges include small individual sizes due to camera distance, abundant vegetation, nocturnal imaging, and varying backgrounds.

Final within-domain model performance.
Mean and 95% confidence intervals of the MAE (in pixels) after 40,000 iterations (end of training). Disjoint confidence intervals indicate statistically significant differences. ResNet-152 demonstrates significantly better performance compared to all other models in this task.

Model’s relative performance throughout ‘within-domain’ training.
The mean ± std of the Mean Average Euclidean Error (MAE) in pixels (left, lower is better) and percentage of correct keypoint (PCK nasal dorsum) (right, higher is better) for all nine model variations. Evaluation results of 5-fold cross-validation on test set data, at every 5,000 iterations.

Out-of-Domain performance on PanAf500-Pose.
Models are evaluated using two metrics that account for the animal’s relative size and distance: PCK nasal dorsum (left, higher is better) and normalized error rate (right, lower is better). ResNet-152 demonstrates superior performance in predicting great ape poses in their natural habitat. Vertical and horizontal dashed lines indicate the maximum and minimum values, along with the corresponding number of iterations. ResNet-152 at 60,000 iterations is selected for pose extraction.

Keypoint detection rate on within-domain vs. out-of-domain test data.
The keypoint detection rate, defined as the percentage of keypoints detected within a given pixel distance, is shown for OMC (left) and PanAf500-Pose (right). For example, within a distance of 10 pixels or less, the nose is detected in approximately 95% of the 89,223 images in OMC. In contrast, the tail is detected within the same distance in only about 38% of cases.

Normalized error rate for chimpanzees and gorillas in OMC.
Mean and 95% confidence intervals for the normalized error rate (NMER). Disjoint confidence intervals indicate statistical significance. The model demonstrates lower error rates for all gorilla keypoints, suggesting higher prediction accuracy for this species.

Normalized confusion matrix of behavior recognition.
For each true behavior label (rows), the percentage of predictions across all predicted behaviors (columns) is shown. For instance, 51% of samples labeled as ‘standing’ were correctly classified, while 16% were misclassified as ‘walking’ and 33% as ‘sitting.’ The diagonal cells represent the per-class accuracy, and their average corresponds to the Mean Class Accuracy (MCA) metric. A perfect classification model would yield a normalized confusion matrix with values of 1 on the diagonal and 0 elsewhere.

Performance comparison with previous studies.
Comparison of Top-1 Accuracy, Top-3 Accuracy, and Mean Class Accuracy (MCA) between ASBAR and previous video-based methods. ASBAR achieves comparable performance to video-based approaches across all metrics.

Prediction comparison of the nine models at test time.
After 40,000 training iterations, the models’ test predictions are visually compared to one example of the test set. Note for example that i) ResNet-50 (center) wrongly predicts the top of the head as the tail’s position, ii) only three models can predict the left ankle’s position accurately (ResNet-50 (center), ResNet-101 (center right), and EfficientNet-B1 (bottom left)) and iii) no model correctly detects the left knee’s location.

PCK nasal dorsum.
The turquoise segment represents the length between the center of the eyes and the tip of the nose, i.e., the nasal dorsum. Any model prediction (represented in green) that falls within this distance of the ground-truth location (indicated in red) is considered as detected. In this case, all keypoints are detected except for the shoulders, neck, left wrist, and the hip (circled in purple). Hence, for this image, the detection rate would be 12/17 = 0.706 = 70.56%.

Normalized error rate by families, species and keypoints.
For all OMC images at test time, we visualize the normalized error rate (NMER) for each species.

Examples of UI elements of the ASBAR graphical user interface.
The GUI is terminal-based and therefore can be rendered even when accessed on a distant machine, such as a cloud-based platform or a remote high-performance computer.