vassi – verifiable, automated scoring of social interactions in animal groups

  1. Behavioural Evolution Research Group, Max Planck Institute of Animal Behavior, Konstanz, Germany
  2. International Max Planck Research School, Radolfzell, Germany
  3. Centre for the Advanced Study of Collective Behavior, University of Konstanz, Konstanz, Germany
  4. Department of Wildlife, Fish & Environmental Studies, Swedish University of Agricultural Sciences (SLU), Umeå, Sweden

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Gordon Berman
    Emory University, Atlanta, United States of America
  • Senior Editor
    Kate Wassum
    University of California, Los Angeles, Los Angeles, United States of America

Reviewer #1 (Public review):

Summary:

In this manuscript, Nührenberg et al., describe vassi, a Python package for mutually exclusive behavioral classification of social behaviors. This package imports and organizes trajectory data and manual behavior labels, and then computes feature representations for use with available Python machine learning-based classification tools. These representations include all possible dyadic interactions within an animal group, enabling classification of social behaviors between pairs of animals at a distance. The authors validate this package by reproducing the behavior classification performance on a previously published dyadic mouse dataset, and demonstrate its use on a novel cichlid group dataset. The authors have created a package that is agnostic to the mechanism of tracking and will reduce the barrier of data preparation for machine learning, which can be a stumbling block for non-experts. The package also evaluates the classification performance with helpful visualizations and provides a tool for inspection of behavior classification results.

Strengths:

(1) A major contribution of this paper was creating a framework to extend social behavior classification to groups of animals such that the actor and receiver can be any member of the group, regardless of distance. To implement this framework, the authors created a Python package and an extensive documentation site, which is greatly appreciated. This package should be useful to researchers with a knowledge of Python, virtual environments, and machine learning, as it relies on scripts rather than a GUI interface and may facilitate the development of new machine learning algorithms for behavior classification.

(2) The authors include modules for correctly creating train and test sets, and evaluation of classifier performance. This is extremely useful. Beyond evaluation, they have created a tool for manual review and correction of annotations. And they demonstrate the utility of this validation tool in the case of rare behaviors where correct classification is difficult, but the number of examples to review is reasonable.

(3) The authors provide well-commented step-by-step instructions for the use of the package in the documentation.

Weaknesses:

(1) While the classification algorithm was not the subject of the paper, as the authors used off-the-shelf methods and were only able to reproduce the performance of the CALMS21 dyadic dataset, they did not improve upon previously published results. Furthermore, the results from the novel cichlid fish dataset, including a macro F1 score of 0.45, did not compellingly show that the workflow described in the paper produces useful behavioral classifications for groups of interacting animals performing rare social behaviors. I commend the authors for transparently reporting the results both with the macro F1 scores and the confusion matrices for the classifiers. The mutually exclusive, all-vs-all data annotation scheme of rare behaviors results in extremely unbalanced datasets such that categorical classification becomes a difficult problem. To try to address the performance limitation, the authors built a validation tool that allows the user to manually review the behavior predictions.

(2) The pipeline makes a few strong assumptions that should be made more explicit in the paper.

First, the behavioral classifiers are mutually exclusive and one-to-one. An individual animal can only be performing one behavior at any given time, and that behavior has only one recipient. These assumptions are implicit in how the package creates the data structure, and should be made clearer to the reader. Additionally, the authors emphasize that they have extended behavior classification to animal groups, but more accurately, they have extended behavioral classification to all possible pairs within a group.

Second, the package expects comprehensive behavior labeling of the tracking data as input. Any frames not manually labeled are assumed to be the background category. Additionally, the package will interpolate through any missing segments of tracking data and assign the background behavioral category to those trajectory segments as well. The effects of these assumptions are not explored in the paper, which may limit the utility of this workflow for naturalistic environments.

(3) Finally, the authors described the package as a tool for biologists and ethologists, but the level of Python and machine learning expertise required to use the package to develop a novel behavior classification workflow may be beyond the ability of many biologists. More accessible example notebooks would help address this problem.

Reviewer #2 (Public review):

Summary:

The authors present a novel supervised behavioral analysis pipeline (vassi), which extends beyond previously available packages with its innate support of groups of any number of organisms. Importantly, this program also allows for iterative improvement upon models through revised behavioral annotation.

Strengths:

vassi's support of groups of any number of animals is a major advancement for those studying collective social behavior. Additionally, the built-in ability to choose different base models and iteratively train them is an important advancement beyond current pipelines. vassi is also producing behavioral classifiers with similar precision/recall metrics for dyadic behavior as currently published packages using similar algorithms.

Weaknesses:

vassi's performance on group behaviors is potentially too low to proceed with (F1 roughly 0.2 to 0.6). Different sources have slightly different definitions, but an F1 score of 0.7 or 0.8 is often considered good, while anything lower than 0.5 can typically be considered bad. There has been no published consensus within behavioral neuroscience (that I know of) on a minimum F1 score for use. Collective behavioral research is extremely challenging to perform due to hand annotation times, and there needs to be a discussion in the field as to the trade-off between throughput and accuracy before these scores can be either used or thrown out the door. It would also be useful to see the authors perform a few rounds of iterative corrections on these classifiers to see if performance is improved.

While the interaction networks in Figure 2b-c look visually similar based on interaction pairs, the weights of the interactions appear to be quite different between hand and automated annotations. This could lead to incorrect social network metrics, which are increasingly popular in collective social behavior analysis. It would be very helpful to see calculated SNA metrics for hand versus machine scoring to see whether or not vassi is reliable for these datasets.

Author response:

We thank the reviewers and editors for their assessment and for identifying the main issues of our framework for automated classification of social interactions in animal groups. Based on the reviewers’ feedback, we would like to briefly summarize three areas in which we aim to improve both our manuscript and the software package.

Firstly, we will revise our manuscript to better define the scope of our classification pipeline. As reviewer #1 correctly points out, our framework is built around the scoring and analysis of dyadic interactions within groups, rather than emergent group-level or collective behavior. This structure more faithfully reflects the way that researchers score social behaviors within groups, following focal individuals while logging all directed interactions of interest (e.g., grooming, aggression or courtship), and with whom these interactions are performed. Indeed, animal groups are often described as social networks of interconnected nodes (individuals), in which the connections between these nodes are derived from pairwise metrics, for example proximity or interaction frequency. For this reason, vassi does not aim to classify higher-level group behavior (i.e., the emergent, collective state of all group members) but rather the pair-wise interactions typically measured. Our classification pipeline replicates this structure, and therefore produces raw data that is familiar to researchers that study social animal groups with a focus on pairwise interactions. Since this may be seen as a limitation when studying group-level behavior (with more than two individuals involved, usually undirected), we will make this distinction between different forms of social interaction more clear in the introduction.

Secondly, we acknowledge the low performance of our classification pipeline on the cichlid group dataset. We included analyses in the first version of our manuscript that, in our opinion, can justify the use of our pipeline in such cases (comparison to proximity networks), but we understand the reviewers' concerns. Based on their comments, we will perform additional analyses to further assess whether the use of vassi on this dataset results in valid behavioral metrics. This may, for example, include a comparison of per-individual SNA metrics between pipeline results and ground truth, or equivalent comparisons on the level of group structure (e.g., hierarchy derived from aggression counts). We thank reviewer #2 for these suggestions. As the reviewers further point out, there is no consensus yet on when the performance of behavioral classifiers is sufficient for reliable downstream analyses, and although this manuscript does not have the scope to discuss this for the field, it may help to substantiate discussion in future research.

Finally, we appreciate the reviewers feedback on vassi as a methodological framework and will address the remaining software-related issues by improving the documentation and accessibility of our example scripts. This will reduce the technical hurdle to use vassi in further research. Additionally, we aim to incorporate a third dataset to demonstrate how our framework can be used for iterative training on a sparsely annotated dataset of groups, while broadening the taxonomic scope of our manuscript.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation