Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.
Read more about eLife’s peer review process.Editors
- Reviewing EditorAmmie KalanUniversity of Victoria, Victoria, Canada
- Senior EditorGeorge PerryPennsylvania State University, University Park, United States of America
Reviewer #1 (Public Review):
This work provides a new dataset of 71,688 images of different ape species across a variety of environmental and behavioral conditions, along with pose annotations per image. The authors demonstrate the value of their dataset by training pose estimation networks (HRNet-W48) on both their own dataset and other primate datasets (OpenMonkeyPose for monkeys, COCO for humans), ultimately showing that the model trained on their dataset had the best performance (performance measured by PCK and AUC). In addition to their ablation studies where they train pose estimation models with either specific species removed or a certain percentage of the images removed, they provide solid evidence that their large, specialized dataset is uniquely positioned to aid in the task of pose estimation for ape species.
The diversity and size of the dataset make it particularly useful, as it covers a wide range of ape species and poses, making it particularly suitable for training off-the-shelf pose estimation networks or for contributing to the training of a large foundational pose estimation model. In conjunction with new tools focused on extracting behavioral dynamics from pose, this dataset can be especially useful in understanding the basis of ape behaviors using pose.
Since the dataset provided is the first large, public dataset of its kind exclusively for ape species, more details should be provided on how the data were annotated, as well as summaries of the dataset statistics. In addition, the authors should provide the full list of hyperparameters for each model that was used for evaluation (e.g., mmpose config files, textual descriptions of augmentation/optimization parameters).
Overall this work is a terrific contribution to the field and is likely to have a significant impact on both computer vision and animal behavior.
Strengths:
- Open source dataset with excellent annotations on the format, as well as example code provided for working with it.
- Properties of the dataset are mostly well described.
- Comparison to pose estimation models trained on humans vs monkeys, finding that models trained on human data generalized better to apes than the ones trained on monkeys, in accordance with phylogenetic similarity. This provides evidence for an important consideration in the field: how well can we expect pose estimation models to generalize to new species when using data from closely or distantly related ones?
- Sample efficiency experiments reflect an important property of pose estimation systems, which indicates how much data would be necessary to generate similar datasets in other species, as well as how much data may be required for fine-tuning these types of models (also characterized via ablation experiments where some species are left out).
- The sample efficiency experiments also reveal important insights about scaling properties of different model architectures, finding that HRNet saturates in performance improvements as a function of dataset size sooner than other architectures like CPMs (even though HRNets still perform better overall).
Weaknesses:
- More details on training hyperparameters used (preferably full config if trained via mmpose).
- Should include dataset datasheet, as described in Gebru et al 2021 (arXiv:1803.09010).
- Should include crowdsourced annotation datasheet, as described in Diaz et al 2022 (arXiv:2206.08931). Alternatively, the specific instructions that were provided to Hive/annotators would be highly relevant to convey what annotation protocols were employed here.
- Should include model cards, as described in Mitchell et al (arXiv:1810.03993).
- It would be useful to include more information on the source of the data as they are collected from many different sites and from many different individuals, some of which may introduce structural biases such as lighting conditions due to geography and time of year.
- Is there a reason not to use OKS? This incorporates several factors such as landmark visibility, scale, and landmark type-specific annotation variability as in Ronchi & Perona 2017 (arXiv:1707.05388). The latter (variability) could use the human pose values (for landmarks types that are shared), the least variable keypoint class in humans (eyes) as a conservative estimate of accuracy, or leverage a unique aspect of this work (crowdsourced annotations) which affords the ability to estimate these values empirically.
- A reporting of the scales present in the dataset would be useful (e.g., histogram of unnormalized bounding boxes) and would align well with existing pose dataset papers such as MS-COCO (arXiv:1405.0312) which reports the distribution of instance sizes and instance density per image.
Reviewer #2 (Public Review):
The authors present the OpenApePose database constituting a collection of over 70000 ape images which will be important for many applications within primatology and the behavioural sciences. The authors have also rigorously tested the utility of this database in comparison to available Pose image databases for monkeys and humans to clearly demonstrate its solid potential. However, the variation in the database with regards to individuals, background, source/setting is not clearly articulated and would be beneficial information for those wishing to make use of this resource in the future. At present, there is also a lack of clarity as to how this image database can be extrapolated to aid video data analyses which would be highly beneficial as well.
I have two major concerns with regard to the manuscript as it currently stands which I think if addressed would aid the clarity and utility of this database for readers.
1. Human annotators are mentioned as doing the 16 landmarks manually for all images but there is no assessment of inter-observer reliability or the such. I think something to this end is currently missing, along with how many annotators there were. This will be essential for others to know who may want to use this database in the future.
Relevant to this comment, in your description of the database, a table or such could be included, providing the number of images from each source/setting per species and/or number of individuals. Something to give a brief overview of the variation beyond species. (subspecies would also be of benefit for example).
2. You mention around line 195 that you used a specific function for splitting up the dataset into training, validation, and test but there is no information given as to whether this was simply random or if an attempt to balance across species, individuals, background/source was made. I would actually think that a balanced approach would be more appropriate/useful here so whether or not this was done, and the reasoning behind that must be justified.
This is especially relevant given that in one test you report balancing across species (for the sample size subsampling procedure).
And another perhaps major concern that I think should also be addressed somewhere is the fact that this is an image database tested on images while the abstract and manuscript mention the importance of pose estimation for video datasets, yet the current manuscript does not provide any clear test of video datasets nor engage with the practicalities associated with using this image-based database for applications to video datasets. Somewhere this needs to be added to clarify its practical utility.