Image examples from OpenMonkeyChallenge (pose dataset).

A large collection of primate images annotated with pose. The dataset was primarily designed for an open benchmarking competition and includes a total of more than 100,000 images of primates from 26 species.

Image examples from PanAf (behavior dataset).

Videos of gorillas and chimpanzees are captured in the forest using video camera traps. Notable visual challenges among others include the small size of certain individuals due to the camera distance, the abundant vegetation, nocturnal imaging, and changes in backgrounds.

From RGB image to pseudo-heatmaps.

The transformation of an RGB image into a 3D heatmap volume. An input image is passed through a Conv-Deconv architecture to output a probabilistic scoremap of the keypoint location (e.g., the right elbow). By finding a local maximum in the scoremap, the location coordinates and confidence can be extracted. Using a Gaussian transformation, a pseudo heatmap is generated for each keypoint and used as input of the subsequent behavior recognition model.

From extracted poses to behavior classification.

From a set of consecutive RGB frames (e.g., 20 in our experiments), the animal pose is extracted, transformed into pseudo-heatmaps, and stacked as input of the behavior recognition model. A 3D-CNN is trained to classify the represented action into the correct behavior category (e.g., here ’walking’)

The ASBAR framework

The data/model pipeline of the ASBAR framework (red). The framework includes two modules - the first pose estimation (green), which is based on the DeepLabCut toolbox, and the second action recognition (blue), integrating APIs from MMAction2.

Model’s relative performance throughout ’within-domain’ training.

The mean ± std of the Mean average Euclidean error (MAE) in pixels (left, lower is better) and percentage of correct keypoint (PCK nasal dorsum) (right, higher is better) for all nine model variations. Evaluation results of 5-fold cross-validation on test set data, at every 5,000 iterations.

Final ’within-domain’ model’s relative performance.

The mean and 95% confidence intervals of the MAE in pixels after 40,000 iterations (end of training). Disjoint confidence intervals represent significant statistical differences. ResNet152 in this task performs significantly better than any other model.

’Out-of-domain’ performance on PanAf-Pose.

Models are compared with two metrics accounting for the animal’s relative size and/or distance. PCK nasal dorsum (left, higher is better) and normalized error rate (right, lower is better) showcase the superiority of RN-152 to predict great ape poses in their natural habitat. Vertical/horizontal dashed lines represent max/min values and corresponding number of iterations. We select RN-152 at 60,000 iteration for pose extraction.

Keypoint detection rate on ’within-domain’ vs. ’out-of-domain’ test data.

The keypoint detection rate at a pixel distance, i. e. the percentage of keypoints detected within a distance, is visualized for OMC (left) and PanAf-Pose (right). For instance, within a distance of 10 pixels or less, the nose is detected in around 95% of the 89, 223 images of OMC. In comparison, within the same distance, the tail is only detected in around 38% of the cases.

Normalized error rate for Chimpanzee and Gorilla

The mean and 95% confidence interval for NMER. Disjoint confidence intervals suggest statistical significance. Here the model’s error rate is lower for all Gorilla keypoints, i. e. those keypoints can be predicted more accurately by the model.

Performance comparison with previous studies.

Comparison of Top1 accuracy, Top3 accuracy, and mean class accuracy with previous video-based methods. Our framework improves both TopK accuracies while shrinking by a factor of around 20 times the volume of the behavior recognition model.

Final confusion matrix on PanAf

For each true behavior label (vertically), the percentage of prediction is reported across all predicted behaviors (horizontally). The cells on the diagonal represent the percentage of correct predictions per class. For example, 61% of all the samples labeled ’standing’ were correctly classified, while the remaining ones were wrongly predicted as ’sitting’ (28%) and ’walking’ (11%).

Examples of UI elements of the ASBAR graphical user interface

The GUI is terminal-based and therefore can be rendered even when accessed on a distant machine, such as a cloud-based platform or a remote high-performance computer. Researchers may thus train and evaluate remotely the different models of DeepLabCut and MMaction2 without the need to write any programming code or terminal commands. See more details at https://github.com/MitchFuchs/asbar

Prediction comparison of the nine models at test time.

After 40,000 training iterations, the models’ test predictions are visually compared on one example of the test set. Note for example that i) ResNet-50 (center) wrongly predicts the top of the head as the tail’s position, ii) only three models can predict the left ankle’s position accurately (ResNet-50 (center), ResNet-101 (center right), and EfficientNet-B1 (bottom left)) and iii) no model correctly detects the left knee’s location.

PCK nasal dorsum

The turquoise segment represents the length between the center of the eyes and the tip of the nose, i.e., the nasal dorsum. Any model prediction (represented in green) that falls within this distance of the ground-truth location (indicated in red) is considered as detected. In this case, all keypoints are detected except for the shoulders, neck, left wrist, and the hip (circled in purple). Hence, for this image, the detection rate would be 12/17 = 0.706 = 70.56%.

Normalized error rate by families, species and keypoints.

For all OMC images at test time, we visualize the normalized error rate (NMER) for each species.