JAX Animal Behavior System (JABS): A genetics informed, end-to-end advanced behavioral phenotyping platform for the laboratory mouse

Anshul Choudhary; Brian Q Geuther; Thomas J Sproule; Glen Beane; Vivek Kohar; Jarek Trapszo; Vivek Kumar

doi:10.7554/eLife.107259.1

1 Introduction

Behavioral analysis in animal models seeks to link complex and dynamic behaviors with underlying genetic and neural circuit functions [1]. In the context of disease, altered genetic circuits shape altered neural circuits, which in turn produces altered behaviors. The primary purpose of behavior analysis in the animal is to understand the mechanisms of disease and to seek novel therapeutics to improve human health. The laboratory mouse has been at the forefront of these discoveries. However, linking altered genetic circuits to functional changes in neural circuits and ultimately behavior is challenging. These challenges are broad, however one major hurdle has always been behavior quantification task itself. Animal behavior quantification has rapidly advanced in the past few years with the application of machine learning to the problem of behavior annotation and with the adoption of computational ethology approaches to behavioral neurogenetics [2–6].

These advances are mainly due to breakthroughs in the statistical learning and computer science fields which have been adopted and extended for biological applications, and have made the task of behavior annotation at high resolution scalable and objective, with increased accuracy. [7]. Although significant advances have been made in the annotation of animal behavior using machine vision, a major challenge remains in the democratization of these technologies. As a simple example, many labs adopt their existing apparatus to generate an intermediate representation of the animal for tasks such as tracking. These are often segmentation masks or keypoints. Each lab generally trains a custom model for these, which, depending on the complexity of the task, can require large amounts of human annotated training data. Many do not validate or even report the performance of their models, which are taken at face value to work. This is a large data labeling burden that is repeated by individual labs. The next step of extracting behaviors from these intermediate representations is even more challenging. The process entails creating features from intermediate representations followed by heuristics or classifiers to determine when a behavior of interest occurs. Behaviorists often disagree on behavior definition, even within labs, and therefore these behavior classifiers are incredibly valuable. They encode a behaviorist’s expertise in the form of mathematical weights. Since labs start with niche behavior apparatus and intermediate representations, the process of feature extraction, classification, the logic of assigning behaviors stays within a lab. That is, it is challenging for labs to share classifiers, because they only work in their hardware setup. This paradigm is not sustainable and prohibits the application of engineering principles to biology. The paradigm described above, combined with the fact that a high level of expertise is needed for proper use and interpretation of machine learning methods, can be a challenge to the reproducibility and replicability of scientific discoveries and ultimately therapeutic discoveries.

This challenge has not gone unnoticed in the field of animal behavior annotation, and labs have created tools for behavior annotation and classification that lower the barrier to entry for non-ML experts [4, 8–10]. These software libraries allow behavior annotation but do not enable a standardized pipeline for data collection, tracking, and classifier sharing which remains an unmet need.

With this in mind, we present two complementary systems that are designed for behavior characterization in rodent models. The first platform, called JAX Animal Behavior System (JABS), consists of video collection hardware and software, a behavior labeling and active learning app, and an online database for sharing classifiers. This is an open field system which we have used in over 6 papers [11–16]. Adoption of JABS will allow laboratories to bypass the need for creating segmentation or pose estimation models for routine open-field tasks. In addition, existing models for frailty, nociception, seizures, and others can be adopted. The second, called Digital InVivo System (DIV Sys) is hardened and scalable home cage monitoring system (see Robertson et. al.). Both end-to-end systems are designed to enable community members to leverage others’ work and to extend the capabilities of the system. We hope that these platforms will be adopted and extended by the community.

2 Results

JABS is an integrated platform developed in our lab over the past five years with pose and segmentation models as intermediate representations. Our lab has previously used computer-vision methods to track visually diverse mice under different environmental conditions [11], infer pose for gait analysis and posture analysis [13], and detect complex behaviors like grooming [12]. We have also used computer-vision derived features to predict complex constructs, such as health and frailty and pain [14, 16]. These models have been trained and validated on genetically diverse mouse strains for high-quality foundational metrics [11, 13]. JABS hardware and software has been used to characterize complex behaviors such as grooming, gait, posture, as well as complex states such as frailty, pain, and intensity of seizures [11–14]. JAX has made components of JABS including ML models are free to use for non-commercial purposes.

The process and various components of JABS are illustrated Figure 1A. Briefly, our system comprises of three components encompassing five different processes, namely, i) data acquisition, ii) behavior annotation, iii) classifier training, iv) behavior characterization, and v) data integration. The first component (JABS-DA module) is the custom designed standardized data acquisition hardware and software that provides a controlled environment, optimized video storage, and live monitoring capabilities. The second component (JABS-AL module) is a python based GUI active learning app for behavior annotation and training classifiers using the annotated data. One can then use the trained classifiers for predicting whether behavior happens or not in the unlabeled frames. The last component of JABS is analysis and integration module (JABS-AI), a web application that provides an interactive user interface to browse through the strain survey results from different classifiers, download existing classifiers and related training data. The app can also be used to classify various behaviors in user submitted videos (pose files) using the classifiers available in the database. Furthermore, researchers have the option to contribute their custom classifiers, trained through the JABS-AL app. These user-generated classifiers can be submitted to perform predictions within our extensive strain survey dataset, coupled with comprehensive genetic analysis, including assessments of heritability and genetic correlations. Next, we discuss the individual components of JABS in detail.

JABS data acquisition module
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B) Detailed example of JABS data acquisition including a picture of the monitoring hardware, architecture of the real-time monitoring app, and screenshots from videos taken during daytime and nighttime.

2.1 JABS data acquisition - Hardware and Software

We use a standardized hardware setup for high quality data collection and optimized storage (Figure 1B). The end result is uniform video data across day and night. Complete details of the software and hardware, including 3D designs used for data collection, are available on our Github (https://github.com/KumarLabJax/JABS-data-pipeline/tree/main). We also provide a step-by-step assembly guide (https://github.com/KumarLabJax/JABS-data-pipeline/blob/main/Multi-day%20setup%20PowerPoint%20V3.pptx).

We have organized the animal habitat design into three groups of specifications. The first group of specifications are requirements necessary for enabling compatibility with our machine learning algorithms. The second group describes components that can be modified as long as they produce data that adheres to the first group. The third group describes components that do not affect compatibility with our machine learning algorithms. While we distinguish between abstract requirements in group 1 and specific hardware in group 2 that meet those requirements, we recommend that users of our algorithms use our specific hardware in group 2 to ensure compatibility.

The design elements that are critical to match specifications in order to re-use machine learning algorithms include (1a) the camera viewpoint, (1b) minimum camera resolution and frame rate, (1c) field of view and imaging distance, (1d) image quality and (1e) the general appearance of the habitat (cage or enclosure). The design elements that are flexible but impact the compatibility are (2a) camera model, (2b) compute interface for capturing frames, (2c) lighting conditions, (2d) strains and ages of mice and (2e) animal bedding contents and quantity. Design elements that have no impact on compatibility are (3a) height of habitat walls to prevent mice from escaping, (3b) animal husbandry concerns, (3c) mounting hardware, (3d) technician ergonomic considerations and (3e) electrical connection hardware and management.

2.1.1 Group 1 specifications

Our system operates on a top-down camera viewpoint. This specification enables flexibility and allows more diverse downstream hardware and ease of construction. The top-down viewpoint enables wider adoption due to construction simplicity and the ability to test more varied assays. While other approaches such as imaging from the bottom through a clear floor are possible and enable better view of animal appendages, they are achieved at the cost of limiting assay duration and construction complexity. For instance, long-term monitoring which requires bedding, and accumulation of feces and urine eventually obstruct bottom up view. We therefore use top-down data acquisition.

Our algorithms are trained using data originating from 800x800 pixel resolution image data and 30 frames per second temporal resolution. This resolution was selected to strike a balance between resolution of the data and size of data produced. While imaging at higher spatial and temporal resolution is possible and sometimes necessary for certain behaviors, these values were selected for general mouse behavior such as grooming, gait, posture, and social interactions. We train and test our developed algorithms against the spatial resolution. We note that these are minimum requirements, and down-sampling higher resolution and frame rate data still allows our algorithms to be applied.

Similar to the pixel resolution, we also specify the field of view and imaging distance for the acquired images in real-world coordinates. These are necessary to achieve similar camera perspectives on imaged mice. Cameras must be mounted at a working distance of approximately 100cm above the floor of the arena. Additionally, the field of view of the arena should allow for between 5 − 15% of the pixels to view the walls (field of view between 55cm and 60cm). Having the camera a far distance away from the arena floor reduces the effect of both perspective distortion and barrel distortion. We selected values such that our custom camera calibrations are not necessary, as any error introduced by these distortions are typically less than 1%.

Additionally, image quality is important for meeting valid criteria for enabling the use of machine learning algorithms. Carefully adjusting a variety of parameters of hardware and software values in order to achieve similar sharpness and overall quality of the image is important. While we cannot provide an exact number or metric to meet this quality, users of our algorithms should strive for equal or better quality that exists within our training data. One of the most overlooked aspect of image quality in behavioral recordings is image compression. We recommend against using typical software-default video compression algorithms and instead recommend using either defaults outlined in the software we use or recording uncompressed video data. Using software-defaults will introduce compression artifacts into the video and will affect algorithm performance.

Finally, the general appearance of the cage should be visually similar to the variety of training data used in training the machine learning algorithms. Documentation on this for each individual algorithm for assessing the limitations are published [11–14] . While our group strives for the most general visual diversities in mice behavioral assays, we still need to acknowledge that any machine learning algorithms should always be validated on new datasets that they are applied to. Generally our machine learning algorithms earlier in the entire processing pipeline, such as pose estimation, are trained on more diverse datasets than algorithms later in the pipeline, such as pain and frailty predictions.

2.1.2 Group 2 specifications

In order to achieve compliant imaging data for use with our machine learning algorithms, we specify the hardware we use. While the hardware and software mentioned in this section is modifiable, we recommend that careful consideration is taken such that changes still produce complaint video data.

We modified a standard open field arena that has been used for high-throughput behavioral screens [17]. The animal environment floor is 52 cm square with 92 cm high walls to prevent animals escaping and to limit environmental effects. The floor was cut from a 6mm sheet of Celtec (Scranton, PA) Expanded PVC Sheet, Celtec 700, White, Satin / Smooth, Digital Print Gradesquare and the walls from 6mm thick Celtec Expanded PVC Sheet, Celtec 700, Gray, (6 mm x 48 in x 96 in), Satin / Smooth, Digital Print Grade. All non-moving seams were bonded with adhesive from the same manufacturer. We used a Basler (Highland, IL) acA1300-75gm camera with a Tamron (Commack, NY) 12VM412ASIR 1/2” 4-12mm F/1.2 Infrared Manual C-Mount Lens. Additionally, to control for lighting conditions, we mounted a Hoya (Wattana, Bangkok) IR-80 (800nm), 50.8mm Sq., 2.5mm Thick, Colored Glass Longpass Filter in front of the lens using a 3D printed mount. Our cameras are mounted 105 +/- 5 cm above the habitat floor and powered the camera using the power over ethernet (PoE) option with a TRENDnet (Torrance, CA) Gigabit Power Over Ethernet Plus Injector. For IR lighting, we used 6 10 inch segments of LED infrared light strips (LightingWill DC12V SMD5050 300LEDs IR InfraRed 940nm Tri-chip White PCB Flexible LED Strips 60LEDs 14.4W Per Meter) mounted on 16-inch plastic around the camera. We used 940nm LED after testing 850nm LED which produced a marked red hue. The light sections were coupled with the manufactured connectors and powered from an 120vac:12vdc power adapter.

For image capture, we connected the camera to an nVidia (Santa Clara, CA) Jetson AGX Xavier development kit embedded computer. To store the images, we connected a local four-terabyte (4TB) USB connected hard drive (Toshiba (Tokyo, Japan) Canvio Basics 4TB Portable External Hard Drive USB 3.0) to the embedded device. When writing compressed videos to disk, our software both applies optimized de-noising filters as well as selecting low compression settings for the codec. While most other systems rely on the codec for compression, we rely on applying more specific de-noising to remove unwanted information instead of risking visual artifacts in important areas of the image. We utilize the free ffmpeg library for handling this filtering and compression steps with the specific settings available in our shared C++ recording software. Complete parts list and assembly steps are described in (https://github.com/KumarLabJax/JABS-data-pipeline)

2.1.3 Group 3 specifications

Finally, here we present hardware and software that can be modified without risk of affecting video compliance. For natural light, we used a F&V (Netherlands) fully dimmable R-300SE Daylight LED ring light powered by a 120vac:12vdc power adapter. These lights are adjustable to meet the visible lighting needs of specific assays without affecting the visual appearance of the data. To keep the animals nourished, we installed water bottles and a food hopper external to the animal environment. These were placed on the outside of the arena on a removable panel. The panel can be customized as needed for experiments without the need to replace/modify the entire arena. To suspend the camera and lights, we used a wire shelf from our solution for technician ergonomics.

To raise the animal cage to an ergonomic height, we used the 24-inch by 24-inch option of the Metro (Wilkes-Barre, PA) Super Erecta wire shelving system with three shelves. As mentioned in the earlier paragraph, the topmost shelf was used to suspend the camera and lights. We also hinged one wall, turning it into a door, to allow easier animal access. Communication between the electronic devices was interconnected with CAT5 cables and a network switch and a powered USB hub was used between the USB connected hard drive and the nVidia compute device. We used a digital timer for the visible LED light, a 120v power strip to consolidate the power, and a universal power source (battery backup) between the chamber and facility power.

For ease of use and reduction of environmental noise, we also include a software for remote monitoring and welfare check. The software consists of three main components: a recording client implemented in C++, a control server implemented with the Flask Python framework, and a web-based user interface implemented with Angular (Figure 1). The recording client runs locally on each Nvidia Jetson Xavier computer and communicates with the server using the Microsoft C++ REST SDK to provide centralized monitoring and control of distributed recording devices. The recording client captures raw frames from the camera and encodes video using the ffmpeg library. In addition to saving encoded video on the local hard drive, the recording client can optionally send video over the RTMP protocol to a NGINX server configured with the nginx-rtmp plug-in. The web interface communicates with the control server, which relays recording start and stop commands to individual recording devices, enabling the user to remotely control various aspects of recording in addition to viewing the live stream from the NGINX streaming server using the HTTP Live Streaming (HLS) protocol (Figure 2).

JABS data acquisition module (JABS-DA)
consists of a web-based control system for recording and monitoring experiments. (A) Screenshots from Angular web client that allows monitoring of multiple JABS Acquisition units in multiple physical locations. Dashboard view allows monitoring of all JABS units and their status, Device Status provided detailed data on individual devices, recording session dashboard allows initiation of new experiments, and remote welfare view allows live video to be observed from each unit.

2.2 Environment checks

To evaluate the suitability of JABS-DA for long-term housing of mice, we conducted a series of experiments comparing environmental conditions and animal health outcomes within these arenas to those observed in standard JAX housing cages. Our goal was to provide data for the JAX Institutional Animal Care and Use Committee (IACUC) to confirm health and welfare of animals over time in these apparatus. These data can be used for Institutional ACUC protocols by others. We compare our data with established guidelines from the Guide for the Care and Use of Laboratory Animals (the Guide) [18]. Our experiments were performed in one room at The Jackson Laboratory, Bar Harbor, Maine (JAX) with temperature and humidity set to 70-74°F (∼21-23°C) and 40-50%, respectively.

One concern related to use of the JABS arena in long-term experiments was that the 90 cm height of the walls without lower air openings might result in inadequate air flow and build up of toxic gases. To address this, we compared environmental parameters in JABS arenas with that of a standard JAX housing cage. Two JABS arenas were observed with 12 male C57BL/6J mice 12-16 weeks old in each for a 14-day period. At the same time, one wean cage containing 10 male C57BL/6J age-matched mice was observed on a conventional rack for matching air flow in the same room. We used a #2 Wean Cage (30.80 x 30.80 x 14.29 cm) from Thoren (Hazleton, Pennsylvania) with 727.8 cm² floor space, which is a common housing container for mice and is approved at JAX to house 10 animals. This commercial cage has a floor area that is ∼1/4 that of the JABS arena. The ceiling height in the wean cage ranges 5-14 cm due to the sloped metal wire cover that contains food and water. The JABS arena, by contrast, has no ceiling. Food, water, bedding type and depth and light level were all matched in the arenas and wean cage. Bedding (1:1 ratio of aspen chip/shavings mix and Alpha-Dri) was left unchanged for the full two-week period to minimize interaction with mice in JABS arenas as much as possible. To determine if forced air flow was needed for an acceptable arena environment, one of the two arenas and the wean cage were exposed to normal room air flow, while the second arena had a 6-inch quiet electric fan mounted above for increased circulation. The fan was pointed to blow air up to draw air out of the arena instead of actively blowing air towards the mice.

We monitored CO₂ and ammonia, common housing gases [18]. CO₂ was measured with an Amprobe CO₂ meter daily, excluding weekends and holidays, in both arenas and the wean cage. CO₂ was recorded in the room’s center before and after each arena and wean cage measurement as a control. For higher levels, CO₂ is shown as a range due to oscillation. Ammonia was tested with Sensidyne Ammonia Gas Detector Tubes (5-260 ppm) in the arena without a fan and the wean cage on days 0, 2, 4, 7, and 14, with samples taken near the floor and waste accumulation areas. Temperature and humidity data loggers (MadgeTech RHTEMP1000IS) were placed on the floor in each arena and the wean cage for the experiment’s duration. An environment monitor (Hobo, U12-012, Onset) was mounted on the wall for room background data. Body weight was measured daily, excluding weekends and holidays. Grain and water were weighed at the start and end of each experiment to check consumption.

We observed daily room background CO₂ levels of 454 to 502 ppm throughout the 14-day experiment. These are very close to expected outdoors levels and indicative of a high air exchange rate [19]. JABS arena CO₂ levels varied from a low of 511 ppm on day 1 to an oscillating range of 630 to 1565 ppm on day 14. The JAX standard wean cage experienced an oscillating range of 2320 to 2830 ppm on day 0 climbing to an oscillating range of 3650 to 4370 ppm on day 14. The wean cage CO₂ values approximately match those from another published study of maximum grouped mice in similar static housing [20]. Indoor CO₂ is often evaluated as level above background [19]. We observe a maximum JABS arena CO₂ level above background of 1082 ppm. This is 3.8 fold lower than the maximum observed CO₂ levels in the wean cage (4121 ppm) (Figure S1A, arena with fan excluded from graph for clarity).

Ammonia levels in the JABS arena were below 5 ppm on days 0, 2, 4, and 7, rising to 18 ppm on day 14. In the wean cage, levels were <5 ppm on days 0 and 2, rose to 52 ppm on day 4, and remained at ∼230ppm on days 7 and 14. Initial concerns about high JABS arena walls hindering airflow were alleviated as CO₂ and ammonia levels indicated better air exchange than standard housing. NIOSH’s recommended maximum ammonia exposure for humans is 25 ppm over 8 hours, with a similar recommendation for mice [18, 21]. Ammonia levels are mainly influenced by air changes per hour (ACH) [22, 23]. JAX animal rooms have ∼10 ACH and PIV cages have ∼55-65 ACH. Ammonia levels were consistently 10-50 times lower in the JABS arena compared to the control static wean cage and remained well within recommended limits (Figure S1B). Future JABS arena observations must consider the impact of ammonia on behavior [24]. Mice used in JABS experiments come from PIV housing, where ammonia levels are expected to be similar to those in the JABS arena, minimizing behavioral impact [23].

Temperatures in all locations (room background, two JABS arenas and one wean cage) remained in a range of 22-26°C throughout the experiment. Variance in room background readings suggest temperature fluctuations are more due to innate room conditions (such as environmental controls) than anything else. We find that arena structure does not adversely affect control of the temperature to which mice are exposed (Figure S1C).

The probes which measured temperature also measured humidity. The room probe, mounted on a wall 1 foot above the floor in the 8x8 feet room, recorded consistent background humidity of 45 ±5% (Figure S1D, green line). Housing probes in the bedding of each chambercentered in JABS arenas and along a wall in the smaller wean cagerecorded 55-60% humidity in the JABS arenas, except for occasional spikes not correlated with background changes, likely due to mouse urination (Figure S1D, blue and black lines). In contrast, wean cage humidity rose from 55-60% to above 75% within 12 hours and continued climbing to 97.5% by day 14 (Figure S1D, red line). Higher humidity in the micro-environments was due to mouse urination and limited air flow (Guide [18]). The JABS arenas maintained a drier environment because they had a higher bedding to mouse ratio (3.2 times more per mouse) and better air circulation compared to the wean cage (Figure S1D).

Weight is often used as a marker for health though body condition score is used as a more reliable indicator of serious concerns [15, 25, 26]. Mice in JABS arenas lost weight compared to those in the wean cage and this was initially a cause of concern. However, mice in JABS arenas maintained a healthy appearance and normal body condition score throughout the experiment. Other measurements demonstrating normal parameters and other control experiments not shown additionally led us to believe the weight differences are because JABS arena mice are active while wean cage mice, with more limited movement available, are sedentary. Mice started the experiment at 25-33 grams body weight. The lowest average recorded during the experiment was 95.6% of the start value, for mice in the JABS arena without a fan on day 9. The lowest individual recorded was 85.8% of start value at 23.6 grams on day 14, also in the arena without a fan (Figure S1E).

Per mouse grain usage was comparable between the JABS arena and the wean cage and in an expected range [27] (Figure S1F). Per mouse water usage was comparable between the JABS arena and the wean cage and in the expected range [28]. Somewhat higher water use in the arena could be indicative of higher activity requiring more hydration (Figure S1G). Since only one JABS arena and one wean cage were tested, error bars are not available to aid in interpretation.

Three mice from one arena and three from a wean cage were necropsied immediately following 14 days in the JABS arena or control wean cage to determine if any environmental conditions, such as possible low air flow in arenas potentially leading to a buildup of heavy unhealthy gases like ammonia or CO₂, were detrimental to mouse health. Nasal cavity, eyes, trachea, and lungs were collected from each mouse. They were H&E stained and analyzed by a qualified pathologist. No signs of pathology were observed in any of the tissue samples collected (Figure S2).

Based on these environmental and histological analysis, we conclude that the JABS arena is comparable and in many respects better than a standard wean cage. Lack of holes near the floor do not create a build up of ammonia or CO₂. Mice ate and drank at normal levels. We initially observed a slight decrease in body weight, which is increased in the next few days. We hypothesize that this could be due to the novel environment and the increase in space for movement, leading to more active mice.

2.3 JABS-AL: An active learning module for behavior classifier training

In the section, we first present an overview on behavior annotation and classifier training using JABSAL module which utilizes our python-based, open-source graphical user interface (GUI) application which has been developed to be compatible with Mac, Linux and Windows operating systems. We then evaluate the utility and accuracy of JABS trained classifiers through two complementary approaches. In the first approach, we benchmark the performance of JABS classifiers against a previous neural network based approach [12], providing us a comparison of the performance of the two approaches on the same dataset. In the second approach, we studied how classifiers for the same behavior trained by two different human annotators in the lab compare with each other in terms of behavior identification, allowing us to assess the inherent variability among expert annotators.

2.3.1 Behavior annotation and classifier training

There are two prominent approaches in the literature for training behavioral classifiers. The first approach trains the classifiers using the raw video files, as previously demonstrated to identify grooming behavior through the use of a deep neural network [12]. The second approach involves first extracting pose keypoints in each frame using deep neural networks, which serves as inputs for machine learning classifiers. Previously, we utilized a deep neural network based classifier to extract poses and used the keypoints to study gait behavior [13]. Pose based approach offers the flexibility to use the identified poses for training classifiers for multiple behaviors and we used this approach for JABS. Additionally, the extracted keypoints can also be used to generate quantifiable and interpretable features that can be used to study various aspects of animal behavior such as gait and posture. In addition to the raw video file, JABS annotation and classification active learning module requires pose files from our previously established neural network for pose estimation as an input to train the classifiers. Note that the raw videos are needed only for annotating behaviors, and one can predict the behaviors using only the pose files.

We have developed an easy to use open source python GUI software to annotate behaviors in videos, as shown in Fig. 3A. This tool allows users to easily annotate behaviors in video recordings through mouse/trackpad or keyboard shortcuts, as well as the option to leave frames unlabeled for ambigious cases. The GUI provides statistics of the total number of frames as well as the number of frames and bouts annotated for a particular behavior. The annotations are displayed below the video as an ethogram (Fig. 3B).The user can annotate multiple behaviors for the same video. Once minimum number of frames (100) and videos (2) have been annotated, the user can train a classifier using either of the tree-based methods such as Random Forest (RF)/Gradient Boost/XGBoost (XGB) [29–32] and check the classifiers accuracy with k-fold cross-validation, selecting a value of k that balances computational efficiency and accuracy. We used our HRNet based pose estimation neural network [13] to estimate the location of twelve keypoints in the videos and computed a number of per frame and per window features. We then compute a number of informative features like distance between various keypoints, linear and angular velocity between keypoints, etc. that are used as input for these classifiers. We also incorporate temporal information from the videos by computing window features that include information from w (window size) frames on each side of the current frame. A complete list of base features currently included in JABS is provided in the supplementary information (Table S2). The weights of different features used by the trained classifiers improve the interpretability of the classifiers.

JABS-AL is a behavior annotation and classification module that allows trainign classifiers with sparse labels.
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B) Screenshot of the python based open source GUI application used for annotating multiple videos frame by frame. One can annotate multiple mouse and for multiple behaviors. The labeled data is used for training classifiers using either random forest or gradient boosting methods. Adjustable window size (number of frames on the left and right of the current frame) to include features from a window of frames around the current frame. The labels and predicted labels are displayed at the bottom. (C) A sample workflow for training a typical classifier. Multiple experts can sparsely label videos to train multiple classifiers for the same behavior. These classifiers can be compared and experts can consult to iterate through the training process

Typically, to arrive at an optimal classifier for a behavior, we start by training multiple classifiers using annotated data from different human experts for the same set of videos and then evaluating performance of each classifier against a separate set of test videos as depicted in Fig. 3C. Further, since there is no ground truth for the test videos, we compare each frame level and bout level predictions from each classifier (for the same behavior but different expert annotators) against each other to evaluate the degree of agreement and consistency. Finally, depending on the expert consensus on the desired level of agreement, a classifier is selected or the whole process is repeated with new or corrected labels. Once the training is completed, the classifier can be exported and be used to predict labels for every unlabelled frame in all the videos in the project directory. One can even use the command line interface of the app for high performance computing environment to train and/or predict using the python scripts included with the software. The detailed user guide along with a video tutorial to install and run JABS active learning app is available online (https://jabs-tutorial.readthe-docs.io/en/latest/JABS_user_guide.html).

2.3.2 Benchmarking JABS classifier using grooming behavior

Previously, a CNN based grooming behavior classifier trained on raw videos attained human level accuracy [12]. We re-purpose this large training dataset as a benchmark for estimating learning capacity of pose-based classifiers. Further, we evaluate how the performance of the classifier varies with the choice of machine learning algorithm, window size (w) of the features and the amount of training data. For the choice of machine learning algorithm, we utilize two popular tree-based methods, namely Random Forest (RF) and XGBoost (XGB). Briefly, the dataset contains 1,253 video segments, and we held out 153 video clips for validation (this is the same validation set used in [12]) and the rest are used to train the classifier. This split results in similar distributions of frame-level classifications between training and validation sets. More details of the dataset are available in Table-S1. We trained multiple classifiers by varying the amount of annotated data, window size, and machine learning algorithm. Our best accuracy from the neural network based approach for this dataset was 0.937 and the best classifier from JABS using all the annotated data, a window size (w) of 60 frames, and XGB machine learning algorithm achieved a comparable per-frame accuracy score of 0.9364. We noticed that with the same set of features, XGB typically achieved better accuracy than RF method across different window sizes and training data size. The results for these benchmark tests are shown in Figure 4B-D. Our tests with different window sizes show that grooming performance increases as we increase the window size, reaches a maximum (around 60 frames) and then degrades for large window sizes (Fig. 4B). Because grooming typically lasts for few seconds, classifiers using features within nearby frames will perform better as they incorporate optimal temporal information and including features from too few or too many frames will decrease the performance. We also investigate the impact of the amount of labeled data on the performance of JABS classifiers, as it can help to optimize the annotation process, ultimately reducing the time and resources required to train the model. To do this, we trained the XGB and RF classifiers using a subset of the full dataset (about 20 hours) consisting of 10, 20, 50, 100, 500 and 1100 training videos. These correspond to approximately 1.3%, 2.2%, 4.4%, 8.5%, 46.1% and 100% out of a total of 2181790 frames. As expected, the performance of JABS improves as we include more labeled data. However, the results demonstrate that a high degree of accuracy, approaching 85%, can be attained through the utilization of only 10 videos of training data, as evidenced by the corresponding area under the receiver operating characteristic curve (AUROC) of approximately 0.94, as depicted in Figure 4C-E. Additionally, it was found that the true positive rate (TPR) experienced a minimal decrease of about 1% when the training data was reduced from 100% to 50%, while maintaining a false positive rate (FPR) of 5% (Fig. 4F).

JABS Benchmarks: Selecting hyper-parameters and benchmarking JABS classifiers using grooming dataset.
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. Using feature window size, type of classification algorithm and the number of training videos as our benchmarking parameters: (B) Accuracy of JABS classifiers trained using different window size features. Each boxplot shows the range of accuracy values for different number of training videos and type of classification algorithms. (C, D) The effect of increasing the training data size on Accuracy and AUROC score of the JABS classifiers. (E) ROC curves for the JABS classifier trained with the window size of 60, XGB algorithm and varying training data size. (F) True positive rate at 5% false positive rate corresponding to the JABS classifier from panel (E) as the amount of training data is changed. (G) Comparing the performance of JABS based classifiers with a 3D Convolutional neural network (CNN) and JAABA based classifiers for different training data sizes.

In the rapidly evolving field of automated quantification of animal behavior, two predominant methodologies have been established for learning behavior: using raw video data and using a reduced representation of the animal with certain keypoints, from which informed features are calculated [8, 9, 13, 33]. To understand the trade-offs and strengths of each approach, we evaluate the performance of different classifiers that employ these methodologies when utilizing varying amounts of training data, as depicted in Fig. 4G. Interestingly, our findings demonstrate that utilizing keypoint-based low dimensional representation of animal behavior, as employed by JABS and JAABA [8] methodologies, leads to superior performance when compared to using high dimensional raw video data as employed by 3D CNNs, particularly when the availability of training data is limited. However, as the quantity of training data increases, the performance of both approaches tend to converge.

Therefore, by distilling the essence of a video into a series of key poses, JABS is able to effectively learn and generalize, even with smaller training sets. It has been shown to have a learning capacity on par with deep neural networks, as demonstrated by per-frame accuracy using the same benchmark data-set. Further, achieving 85% accuracy with just 1.4% of the labeled data, suggests that researchers can strike a balance between labeling efforts and desired accuracy by carefully selecting the amount of training data.

2.4 JABS analysis and integration module

In supervised machine learning, the accuracy and reliability of a trained classifier depends heavily on the quality of labeled data. Further, it has been observed that labeling of the same behavior by different human experts introduces variability among annotations due to variety of factors, including personal biases, subjectivity, and individual differences in understanding what constitutes a behavior [34, 35]. Therefore, it is critical to accurately capture the inter-annotator variability before selecting classifiers for downstream predictions. To capture this variability, we employ both frame based and bout based comparison and demonstrate that bout-based comparison gives a better estimate of inter-annotator agreement.

2.4.1 Frame and bout-wise classifier comparison of inter-annotator variability

In order to test inter-annotator variability, we generated a set of single mouse behavior classifiers for two simple behaviors, left and right turn. We inferred behavior from all four classifies on a large set of videos and compared the two pairs of classifiers from each annotator (Figures 5, 6). The classifiers for all behaviors achieved good accuracy and F1 scores (Table S3). Further, the classifiers for the same behavior trained with different human annotations resulted in inter-annotator variability in predictions. This inter-annotator variability can be associated with (a) subjective differences of behavior definition among human labelers (b) varying level of annotator’s expertise, and (c) training with-in and across labs. We investigated the source of this variability and sought to determine the best method to mediate its effects. To capture this effect, we first visualized the predictions made by two classifiers trained for the same behaviors (left and right turn) but with different human annotators: annotator-1 (A1) & annotator-2 (A2). Figure 5B,C shows two sample ethograms corresponding to the predictions made by A1 & A2 for the left turn behavior. These ethograms show high level of concordance between the two annotators. However, upon closer examination, we observed that the percentage of left or right turn behavior predicted (for all the videos) by A2 was higher than A1 (see Figure 5D,G). The confusion matrix (shown in Figure 5E,H) quantifies the level of agreement between predictions made by annotators A1 and A2 for left and right turn behavior. However, since this behavioral task is heavily class-imbalanced (the number of frames with no-behavior is much more than that of behavior), accuracy can be misleading, as the classifier can achieve high accuracy by simply predicting majority class (not behavior) for all the frames. To address this imbalance, we calculate Cohen’s kappa (κ) metric [36] which is a commonly used measure of inter-annotator agreement accounting for the class imbalance. Mathematically, it is defined as , where p_o is the observed agreement between annotators and p_e is the expected agreement due to random chance. A κ score of 0 indicates that the agreement is no better than chance, and a score of 1 indicates perfect agreement, regardless of high/low accuracy. Finally, we visualize the frame-wise comparison of the two annotators showcasing the percentage of frames where the annotators agree and disagree on the occurence of a behavior as shown in Figure 5F,I. The venn diagram clearly highlights the discrepancy between high accuracy resulting from class imbalance (Figure 5E,H) and significant mismatch between % of predicted behavior (Figure 5D, G), with annotator A2 account for majority of discrepancy by predicting more frames as turning behavior compared to annotator A1.

Frame based comparison of classifiers from different annotators but trained for the same behavior.
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B, C) Two sample ethograms for the left turn behavior showing variation in behavior inference for two different annotators. (D, G) Kernel density estimate (KDE) of the percentage of frames predicted to be a left turn and a right turn respectively, by each annotator across all the videos. The major discrepancy between the two annotators is that A-2 systematically predicts larger number of frames as behavior compared to A-1. (E, H) Confusion matrix showing the agreement between predictions of two classifiers over all the videos in the strain survey for left and right turn behavior. (F, I) Venn diagram capturing the frame-wise behavior agreement between the two annotators for left and right turn behavior.

Bout based comparison of classifier predictions from different annotators but trained for the same behavior.
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B) Ethogram depicting frame-wise left turn predictions for annotators A1 (red) and A2 (blue). (C) Ethograph corresponding to the ethogram in panel (B) capturing the bout level information as a bipartite network. The nodes represent bouts with node size & color proportional to the bout length & annotator respectively. Edge weights captures the fraction of bout overlap between two bouts predicted by different annotators for the same behavior. Edge weight and node size with zero value indicate missed bouts by an annotator. These have been given a small positive value for visualization purposes only. (D-E) Bout length distribution of annotators A1 & A2 for left and right turn behavior. (F) The mathematical definition of the average bout agreement between two annotators, where w(*u, v*) represents weight between nodes u and v (u ⊂ U, v ⊂ V ) in the ethograph 𝓖 (*U,V, E*) and w^∗ is the bout overlap threshold (0.5 fixed for our study). (G) overview of the workflow for stitching and filtering at the bout level. (H, I) Hyper-parameters tuning to find optimal filtering and stitching thresholds. (J) Sample ethogram and its corresponding ethograph before and after applying stitching and filtering. (K) Inter-annotator agreement in frame wise predictions underestimates the agreement whereas the bout wise comparison post filtering and stitching captures the overall agreement in a more biologically meaningful way.

We observed in the ethogram (Figure 5B,C) that although many of the same bouts are captured by both A1 and A2, most of the frame discrepancies seem to be in the beginnings and ends of the bout. A2 seems to predict longer bouts than A1 (Figure 5D). Between two humans labeling the same behavior, there are unavoidable and sometimes substantial discrepancies in the exact frames of the behavior labeled even when trained in the same lab [9, 35]. To most behaviorists, detecting the same bouts of behavior is more important than the exact starting and ending frame of these bouts– as again, there are human-level discrepancies in this as well. Therefore, we used a bout-based comparison rather than a frame-based comparison to evaluate the performance of the classifiers.

For the bout-based comparison, we looked at how much overlap there was between the bouts of a behavior predicted by annotators A1 and A2, taking inspiration from the machine learning imagerecognition and action-detection fields, where an overlap of pixels of the bounding box and ground truth label box called the intersection over union (IoU) [37, 38]. We developed a graph-based approach called an ethograph to represent the bouts of behavior recorded in the ethograms of annotators A1 and A2. Concretely, we define the ethograph for two annotators as a bipartite graph 𝓖 = (U,V, E), where U, V are two disjoint sets corresponding to bouts predicted by each annotator and E represents the edges that connect each element in set U to an element in set V capturing the overlap in time between the bouts. Further, the vertices of an ethograph represents bouts with vertex color encoding for the annotator and vertex size proportional to the duration of the bout. Further, the edges (E) of the graph (𝓖) represents the temporal overlap between the bouts (corresponding to different annotators) with the thickness of the edge proportional to the amount of bout overlap. Figure 6B,C shows the ethograms and their associated ethograph for the left turn behavior as predicted by annotators A1 and A2. In contrast to traditional frame-based ethograms, which simply display the sequential list of frames in which a behavior is observed, the ethograph allows for a more intuitive and visual representation of the temporal overlap between the bouts corresponding to different annotators (or even behaviors). This can be especially useful in identifying patterns and trends that may not be immediately apparent from comparing ethograms. By coloring the vertices and edges based on the annotator, it becomes easy to see which behaviors are consistently identified by both the annotators and which are more subjective and open to interpretation. Moreover, we can easily compute the bout-based agreement between the two annotators as the fraction of edges having thickness greater than some fixed threshold (see figure 6F for mathematical definition) which essentially means the fraction of bouts having overlap greater than a chosen overlap threshold. The bout agreement between two annotators for the left and right turn at a threshold of 0.5 is shown as a Venn diagram in Figure 6D,E along with the density distribution of bout length. The agreement between two annotators with bout-based measure was certainly much better than that with frame-based comparison (see figure 5F,I).

The predictions coming out of a classifier contained many short bouts (1-3 frames) of behavior that signal false positive bouts as they are much shorter than a typical bout of annotated behavior. Moreover, certain bouts of behavior were split by very short bouts (1-3 frames) of not-behavior signalling the presence of false negative bouts that results in fragmentation of a bout of behavior (see figure 6). To address this issue, we proposed a stitching and filtering step on the predictions coming out of classifier. First, we stitched those bouts whose distance to the neighboring bout is less than certain fixed threshold. This stitched the fragmented bouts as illustrated in Figure 6G. We then applied bout filtering which removed bouts of a length below a fixed threshold. To decide the optimal values of stitching and bout filtering thresholds, a hyper-parameter scan was performed for each behavior. Figure 6H,I presents the results from hyper-parameter scan over stitching and bout filtering thresholds when the value of percentage bout overlap is fixed at 25%, 50% and 75% for left (H) and right turn (I). Figure 6J captures the effect of applying bout filtering and stitching to a portion of an ethogram corresponding to the predictions made by A1 & A2 for the left turn behavior. The effect was clearly discernible when looking at the changes in ethograph, particularly with bouts (nodes) having multiple overlaps (edges) reducing to single overlap (edge) per bout.

In summary, when comparing classifiers, it’s important to consider the inherent variability of human annotators. Frame-wise comparison penalizes this natural variability, making it a sub-optimal measure of agreement. On the other hand, bout-wise comparison takes this variability into account, making it a more biologically meaningful measure of agreement between classifiers. In addition to using bout-wise comparison, applying techniques like stitching and filtering can further improve agreement by reducing false and fragmented bouts in classifier predictions. By considering these factors, we can better understand the inter-annotator variability and design more effective guidelines for behavior annotation.

2.4.2 Compilation of Strain Survey Datasets

In the present study, we have curated and are releasing three comprehensive datasets to the public, namely JABS600, JABS1200 and JABS-BxD, that encapsulates behavioral data derived from approximately 168 unique mouse strains, ensuring a balanced representation with a nearly equable distribution between female and male sex. The JABS600 dataset includes a total of 598 videos corresponding to 60 strains, approximately balanced with five males and five females per strain. On the other hand, the JABS1200 dataset contains 1139 videos corresponding to 60 strains, representing approximately nine males and nine females represented per strain. Finally, the JABS-BxD dataset includes a total of 1083 videos corresponding to 108 BxD strains that are derived from a cross between C57BL/6J mice (B6) and DBA/2J mice (D2). The duration of each video is approximately one hour, furnishing a substantial repository of behavioral data, which is invaluable for large-scale automated analysis of behavioral patterns. Furthermore, each video is supplemented with a corresponding keypoint file comprising of 12 keypoints per frame, which is instrumental in extracting specific behavioral features. Additional information about the dataset is given in supplemental figure 2. In line with our dedication to scientific openness and collaboration, we have made these datasets - encompassing both the video recordings and the keypoint files - available for public access(https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FSAPNJG), making it easier for fellow researchers across labs to leverage our findings, replicate our experiments, and advance the field of automated behavior quantification.

2.4.3 Strain Survey of Multiple Behaviors

One of the advantages of a standardized data acquisition system such as JABS is that data can be repurposed. For instance, a classifier trained by another lab could be inferred on videos generated by another lab. We trained a set of behavior classifiers using JABS active learning system and then inferred them on a previously published strain survey dataset [11]. The training dataset was composed of multiple human-annotated short videos (around 10 minutes each), we trained classifiers for left turn, right turn, grooming, rearing supported, rearing unsupported, scratch and escape as examples. These can easily be extended to other behaviors. To capture the effect of genotype on the behavior, we subsampled the original strain survey data set to 600 one-hour open field videos representing 60 different strains with 5 female and 5 male for each strain and make predictions using the trained classifiers. Further, we define 3 aggregate phenotype associated with each behavior namely the total duration of the behavior (in minutes) for the first 5, 20 and 55 minutes of the one-hour video [12], to capture the dynamic changes in behavior over time. The results are shown in Figure 7B, where the heatmap shows the Z-scores for the total duration of the behavior in 5, 20 and 55 minutes (|Z-score| > 1 thresholding is applied for easier visualization). The red and blue colored entries for a particular phenotype represents strains exhibiting the behavior that is more than one standard deviation above and below the mean of the phenotype respectively. Such data can have multiple utility. First, any user of JABS can conduct a rich analysis with little effort to yield biological insight. Such data can be used to refine classifiers by adding edge cases to training data. In addition, downstream genetic analysis suchs as heritability quantification and GWAS analysis are possible with this data [12, 13]. In our analysis, we observed a high number of escape attempts in C58/J mice. This strain has been shown previously to have high number of repetitive behaviors, perhaps even a strain for the study of autism features [39, 40] (Figure 7 Bottom panel). We find that other strains such as I/LnJ, C57/L, and MOLF/EiJ show increased levels of escape behaviors, thus increasing potential strains that could be used to model this behavior.

JABS-AI module: Aggregated phenotypes for behaviors using our large strain survey, JABS600.
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B) Z-transformed scores for the total duration of behavior (at 5, 20, 55 mins) for each aggregate phenotype (|z score|> 1; thresholding is applied for all the behaviors except escape).

In addition to phenotypic diversity due to genotype, we explored sexual dimorphism in our dataset with these new classifiers. We examined the impact of sex on the aggregated phenotype in various strains using a univariate approach. To test for the statistical significance of the effect of sex, we utilized a nonparametric rank test and correcting for multiple testing using false discovery rate (fdr) using Benajamini-Hochberg method. The LOD scores and effect sizes are presented in Figure S5B, with the left panel showing the strength of evidence against the null hypothesis of the non-sex effect. The right panel presents a representation of the direction and magnitude of the effect size with the color and size of circle represents the direction and magnitude of the effect, respectively. The strains highlighted in pink exhibit a significant sex effect for at least one of the aggregated phenotypes. It is important to note that we are generally underpowered with five animals of each sex. However, we find that a high proportion of phenotypes show a sex effect.

2.5 JABS-AI: Heritability, genetic correlation, and GWAS analysis

Next, one of our goals is to understand the genetic architecture that governs the complex behavioral phenotypes. To facilitate this, we utilized the data derived from 49 inbred strains along with 11 F1 hybrid strains, to perform a genome-wide association study (GWAS). It was deemed necessary to exclude the six wild derived strains due to their pronounced divergence, which carried the risk of distorting the outcomes of our mouse GWAS. We first carry out power analysis for both the strain survey datasets (JABS600, JABS1200) using simulation algorithm as proposed by Genome-wide Efficient Mixed Model Association (GEMMA) software. GEMMA, a useful tool for this type of analysis, accounts for population structure and genetic relatedness between individuals, making it ideal for our inbred and hybrid strains. The power analysis as shown in Fig. 8A revealed that we had sufficient statistical power to detect genetic associations. Notably, the JASB1200 dataset demonstrated higher power in detecting these associations compared to the JABS600 dataset. With JABS1200 established as our dataset of choice for conducting the GWAS, we moved forward with assessing each of the 72 phenotypes for their potential association with genotype. We employed the GEMMA software for this purpose, giving particular emphasis to the Wald test p-value in our analysis. These 72 phenotypes have been derived from eight basic classifiers, which include turn left and turn right (each assessed by two different annotators), grooming, scratching, supported rearing, and unsupported rearing. Each of these classifiers has been further categorized into three bout-based measures: average bout length, total duration, and total number of bouts. These bout-based measures were then dissected into three time-based measures (5 minutes, 20 minutes, and 55 minutes) to provide a comprehensive analysis. We tested a substantial number of SNPs (211,077) which necessitated accounting for the inherent correlations among SNP genotypes. To establish an empirical threshold for the p-values, we shuffled the values of one normally distributed phenotype (TL_T 20_duration) randomly and identified the smallest p-value from each permutation. This rigorous process allowed us to set a p-value threshold of 1.9e-05 that reflects a corrected p-value of 0.05. We first report the heritability estimates for phenotypes corresponding to 55 minutes of observed behavior as shown in Fig. 8B. Most of the phenotypes have heritability in the range (0.2-0.8) with bout length based phenotypes having lower heritability relative to bout number or bout duration based phenotypes. Next, to further shed light on the pleiotropic action of genes, we estimate the genetic correlations across these phenotypes using the bivariate linear mixed model implemented in GEMMA. We plot the genomic restricted maximum likelihood (GREML) estimates of bivariate genetic correlations in Fig. 8C. The magnitude of the genetic correlation estimate provides an estimate of genetic overlap (common genetic loci) between two traits, whereas the sign determines the direction of the effects of the overlap on the two traits, i.e., a negative sign corresponds to the effect in the opposite direction on the two traits and vice versa. We hypothesize that for a given behavior, the bout-based measures within the behavior share common genetic effects and affect the traits in the same direction. Indeed, we find estimates of genetic correlations that are positive between the number of bouts (nBouts) and duration of the behavior (duration), the average length of each bout (avgLen) and duration of the behavior (duration), and the number of bouts (nBouts) and duration of the behavior (duration) for all behaviors except the turn right behavior (A1_TR) by annotator 1. We find positive estimates of genetic correlations between two annotators (A1, A2) for the turn left or right behaviors since we expect the genetic architecture underlying the same behavior from two annotators to overlap maximally.

JABS-AI module: Large-scale GWAS investigation of different mouse behaviors utilizing the JABS1200 dataset
: (A) Statistical power comparison between two datasets (JABS600 vs JABS1200) at the genome-wide significance threshold of 2.4e-07. The y axis shows how power varies with SNP effect size (x axis) (B) Aggregate (55 min) phenotypes’ heritability (PVE) estimates. (C) Lower Triangular Matrix Representation of Genotypic Correlation Among all of the 55 minute aggregate phenotypes using a bi-variate linear mixed model, (D) Linkage disequilibrium (LD) blocks size, along with the mean genotype correlations for SNPs at varying genomic distances. (E) Aggregated GWAS results graphically represented via a comprehensive Manhattan plot. Peak SNP clusters, extracted from (F), determine color differentiation; SNPs within the same LD block are color-coordinated to match their peak SNP. Each SNP is assigned the minimum p-value derived from all phenotypes. (F) An inclusive heatmap exhibiting all the significant peak SNPs for each phenotype. Each row, representing an SNP, is color-coordinated according to the allocated cluster within the k-means clustering. The color scheme originating from the k-means cluster is also applied in panel E of this analysis.

We adopted a specific approach to identify quantitative trait loci (QTL): we started with the SNP that exhibited the lowest p-value across the genome and designated it as a locus. We then grouped together adjacent SNPs showing a significant level of correlation in their genotypes (r² ≥ 0.2), employing a greedy strategy. We continued this process, moving on to the next SNP with the lowest p-value until we allocated all significant SNPs to a QTL. Given the inherent genetic structure of inbred mouse strains, large linkage disequilibrium (LD) blocks are expected, as represented in Fig. 8D.

Additionally, we observe pleiotropy with certain loci displaying significant associations with multiple phenotypes, an anticipated occurrence given the correlation among many of our phenotypes and the potential for individual traits to be influenced by similar genetic loci. To get a clearer picture of the pleiotropic structure apparent in our GWAS findings, we constructed a heatmap (fig. 8F) of significant QTL across all phenotypes and employed kmeans clustering to identify QTL sets governing phenotype groups. The phenotypes are grouped into 6 categories namely: grooming bout length, grooming bout number and total time, rearing supported, rearing unsupported, turn bout length, turn bout number and total time. We uncovered seven unique clusters of QTLs (A-G), each regulating a different combination of these phenotype subgroups (fig. 8F). Clusters B and G notably held pleiotropic QTLs that influenced overall turn and rearing behaviors, respectively. Yet within cluster F, we identified distinct QTL sets - one that steered grooming behavior, and another, non-overlapping set that determined turn length. This distinction signifies the existence of distinct genetic underpinnings for these different behaviors even within the same cluster. Finally, we color the associated SNPs in the Manhattan plot (fig. 8E) showing QTLs associated with all phenotypes.

2.6 Data Integration: A web application for classifier sharing and downstream genetic analysis

In conjunction with the release of the curated datasets, we have developed and launched a web-based application, JABS-AI, aimed at streamlining the sharing and utilization of classifiers. Through this platform, users can view, download and rate the classifiers for various behaviors that have been developed and trained in our laboratory as shown in Figure 9B. In addition, it provides an insight into their heritability scores and offers a feature to examine the pair-wise genetic correlations amongst different phenotypes. An added functionality of this web application is that it allows users to upload their own classifiers (trained using JABS active learning application) for any specific behavior. Upon uploading, the application automatically executes the classifier on a dataset of the user’s choosing from our strain survey datasets. It conducts an automated analysis of behavior and genetics, and subsequently dispatches the results to the user’s designated email address within a few hours (see figure 9A). This web application serves as a facilitative tool aimed at fostering collaboration among researchers and streamlining the advancement of automated behavior quantification studies by providing a platform for the efficient sharing and analysis of behavioral classifiers.

JABS-AI : Data integration module for classifier sharing and genetic analysis:
(A) Illustrates the fundamental workflow of the web application, beginning with the user employing a classifier trained via the JABS active learning application. The user subsequently deposits this classifier into our web application, which performs comprehensive automated analyses, encompassing both behavioral and genetic aspects, on the user-selected strain survey dataset. The outcome of these analyses, encapsulating detailed behavioral patterns and genetic correlations, are then dispatched to the user’s designated email address within a short timeframe. (B) Screenshot of the webapp highlighting the tabular presentation of the repository of classifiers developed in our laboratory, complete with pertinent metadata such as the date of creation, training hyperparameters, and user ratings. When any two classifiers are selected, the application offers the option to analyze the genetic correlations between the phenotypes corresponding to the selected classifiers, in conjunction with their heritability scores.

3 Discussion

Democratization of machine vision methods for advanced behavior quantification remains a challenge. Often, tracking and behavior classifiers are not transferable between laboratories. This limits the reuse of prior work with each laboratory essentially starting from scratch with advanced behavior quantification. JABS and the companion DIV Sys are designed to overcome these limitations. JABS components include video data acquisition, behavior annotation, classifier sharing, and genetic analysis. By adopting the JABS-DA laboratories can use our pose estimation and segmentation models that work across 62 mouse strains of varying coat colors and sizes. This greatly eases the barrier to entry for advanced behavior quantification. The next steps of creating behavior classifiers are carried out using JABS-AL, an active learning system modeled after JAABA. We benchmarked JABS-AL using grooming data set and demonstrate that it reaches very good performance with 10% of the data needed for a 3D-CNN for action detection. Once constructed, behavior classifiers can be shared through JABS-AI, a cloud based tool. Labs can create their own behavior classifiers using JABS-AL or download an existing one for use from another lab from JABS-AI in order to annotate single behaviors. The power of JABS-AI is also in the embedded strain survey data. A deposited classifier is inferred on one of three datasets and heritability and genetic correlation results are returned.

A key decision point is the adoption of common apparatus to create a uniform visual look of the video data across laboratories. This enables cross application of foundational models and exchange of behavior classifiers across labs. We realize that this may be challenging for some labs with limited space and budget. Indeed, JABS has a large footprint with a 2x2x6 feet (W x L x H) space requirement, costs several thousand dollars for components, and requires some computational expertise to set up and operate. Laboratories must balance this cost with the labor and time costs of adopting an existing set-up for advanced behavior analysis. In lieu of adopting a common apparatus, efforts are being made to build foundational models that can handle diverse environments and even animals. While similar datasets are common in human pose estimation, the development of equivalent datasets for animals is still underway. However, such foundational models are not yet available. Even when they are available, the initial abstraction step is simple compared to the later step of behavior classification. For instance, in our gait and posture paper, producing a pose estimation model that generalizes to diverse mouse strains took approximately 6 months. It was an iterative hard example mining task. However, the process of deriving gait and posture from keypoints and its genetic validation took over 1.5 years. By simply adopting JABS, laboratories gain access to both the pose model and validated gait/posture algorithms.

Another added benefit to JABS is the ability to apply novel behavior classifiers to large-scale genetically diverse datasets collected at JAX through JABS-AI. We have modeled this after existing platforms such as GeneNetwork and DO QTL. Currently, we provide heritability estimates of any classifier that is deposited. Users can also select behaviors to genetic correlation studies. Thus, even if two behaviors are different, they may measure the same genetic architecture. Although the current version of JABS-AI does not offer GWAS analysis due to compute restrictions, the method can be easily extended for such analysis. It is also feasible to link animal behaviors to human traits through PheWAS analysis [11, 41]. This would provide even more detailed information for users about the genetic regulators of complex behaviors. The current data sets, JABS600, JABS1200, and JABS-BxD consist young wild type animals. We have collected datasets in aging populations with various frailty statuses and animals that display nocifensive behaviors. These could also be integrated into JABS-AI for preclinical behaviors independent of genetic analysis.

While JABS is designed for individual behavior annotation, a common task in behavioral neurogenetics to determine an internal state, e.g. anxiety or social state. Often these are accomplished by measuring a single behavior. A more powerful approach is the application of behavior indices to predict certain states. These indices can be constructed using multiple behaviors and even other covariates. For example, we trained a model to predict frailty using data from over 600 JABS-DA open field tests from C57BL/6J mice of varying age and frailty. We used 34 features from JABS to derive frailty. Similarly, in companion work, we derived pain scale with 82 features from JABS. These indices were constructed with almost 1000 animals and can readily be transferred to other labs that collect data using the JABS system. This is incredibly powerful and allows labs to leverage each other’s models by using a common platform. Similarly, for pain states, we have tested multiple strains and built a pain intensity models that can be utilized. We believe a true advantage of advanced phenotyping using video data is the ability to reuse and extract more information from existing data. This essentially allows us to use less animals, a core 3R principle.

Here, we describe JABS as a single animal open-field assay that lasts from minutes to a few hours. However, we have designed the JABS arena for long-term housing of animals with a food hopper and lixit. This was the primary reason we worked with JAX-IACUC to certify JABS-DA for long-term monitoring with key required environment measures. By blocking visible light and imaging using IR LED illumination, we obtain uniform data at night or day. We routinely collect video data with three mice over several days. The models for tracking, instancing, and identity maintenance need to evolve. We plan to extend JAB-AI with classifiers for social interactions and homeostatic behaviors. Thus, future iterations of JABS will develop and share multi-animal behavior analysis.

Even when data acquisition is standardized, another fundamental source of variability can enter the system when different human experts within or across different labs, annotate the same videos for the same behavior. This type of variability can arise due to variety of factors, including differences in training, personal biases, and individual interpretation of behavior. As behaviors become more complex, we expect behaviorists to show more disagreement. These disagreements could be as simple as varying understanding of the starts and stops of the behaviors. Or more fundamentally differing opinion on the behavior such as between aggression and play. In a previous study, we asked 5 humans to annotate grooming behavior and found that the agreement ranged from 86% to 91%. In this case we simply compared on a frame wise basis labels from annotators who were asked to label every frame in the same set of videos. In most cases, such comparisons are infeasible. A more realistic comparison is provided in this manuscript. Two annotators built their own classifiers for left and right turn behaviors. Then we compared the predictions from each set of classifiers on JABS-DA video. This is akin to two different laboratories that may deposit classifiers for the same behavior. JABS-AI does not save primary the training data from each annotator (lab), and simply uses the trained classifiers to infer on a new set of videos (e.g. JABS600). From these we can compare the overlap between the two classifiers’ inferences. We do not make assumptions about which classifier is the ground truth and simply compare both classifiers.

In section 2.4.1, we demonstrate that even for simple behaviors like left and right turn, there is a significant amount of disagreement between predictions coming from classifiers trained by two expert annotators within the lab for the same behavior. One of the most commonly used statistical measures to quantify the inter-annotator variability is Cohen’s kappa which assesses the level of agreement between the annotators taking into account the possibility of agreement by chance. The Cohen’s kappa statistic works well for frame-wise comparison but is ill-defined for bout-wise comparison as unlike frames, bouts are not conserved. In order to overcome this limitation, we have introduced a new approach based on graph theory, called the ethograph. This network approach allows us to define measures that quantify the agreement between two annotators when comparing bouts of behavior among different annotators. By comparing the entire sequences of frames, the ethograph reduces subjectivity and allows for a more holistic and consistent interpretation of behaviors. This makes it well-suited for bout-wise comparison, and may provide a more accurate estimate of inter-annotator agreement than the frame-based kappa statistic.

Even though frame wise comparison shows the overlap performance is poor (κ = 0.64 and 0.65 for Left and Right Turn, respectively), each classifier does a good job of identifying tuning behaviors. The turn behaviors are short in length of frames, and the two annotators differ in defining the starts and stops of the behaviors. One annotator labels just the core turn behavior, and the other starts labeling turning behavior a few frames earlier and ends later. Classifiers from both annotators generally find the same bouts of turning which we could visualize in the ethograph. We explored bout-wise accuracy metrics as an alternative to frame-wise metrics. We also explore post-processing predictions using hyperparameters for filter and stitch. By adding these, we observe much higher agreement between the classifiers from two annotators for the same behaviors (overlap increases from 49% to 61%). It is important that users clearly define the behavior as best as possible, and document the filter and stitch parameters.

JABS users, when confronted with multiple classifiers for the same behavior in JABS-AI, must prioritize use of one classifier. JABS-AI offers a genetic solution to this challenge - by prioritizing the classfier that is more heritable. Heritability is an estimate of variance explained by genetics and can act as a discriminator in this situation. We also calculate genetic correlation with allows users to determine which underlying genetic construct is being measured. For instance, both left and right turn are highly genetically correlated and, therefore for the purposes of genetics, there is simply turn behavior. However, for certain unilateral models such as brain lesions, stroke, optogenetic stimulation, or injury, the ability to distinguish left and right turns can be critical.

3.1 Future directions and challenges

We see several areas of improvement in JABS in the future. First, the success of such a platform depends on community adoption. As such, JAX has made JABS free for noncommercial use, and we have listed all parts and software used to make JABS. We realize that many laboratories may not have the computational or fabrication resources to construct JABS and that commercial suppliers who can provide a turnkey system are needed for JABS-DA. JABS-AL and JABS-AI require fewer, though still significant, resources to support.

JABS-AI currently does not support upload of training videos due to resource limitations. This prevents other users from interrogating the primary training labels. It also prevents users from downloading and labeling new behaviors or modifying classifiers that have been uploaded. Future versions could support sharing of complete training data instead of the classifier only.

Furthermore, since the classifiers are trained on few densely labeled short video recordings and then further make predictions on a large strain survey consisting of multiple strains of mice, there is some variability in predictions purely due to out-of-distribution strains in the strain survey. Therefore, the inter-annotator variability in predictions on the new set of strains of mice can be attributed to both the variability in the human labeling and genetic variability in the strain survey. Calculating the heritability scores might help in this scenario by providing us a quantitative measure of the extent to which the inter-annotator variability is due to genetic factors versus interpretation by the human labelers.

3.1.1 Rodent Homes and Hotels

Finally, JABS and DIV Sys are complementary systems that enable behavioral monitoring across multiple scales and resolutions. DIV Sys facilitates long-term observation of home-cage behaviors, whereas JABS offers high-resolution tracking of gait, posture, and other discrete actions [12, 13]. The larger space in JABS can potentially accommodate additional tasks designed to probe specific neural circuits [42, 43], and neural recordings can be collected from instrumented mice in this same arena. We see these as “ethological tasks” that can be performed continuously over long-periods of time in order to interrogate neural and genetic circuits in customizable environments – “hotels”. Examples include mazes and other tasks that neurobehavior researchers have been developing. These assays can be validated using genetic or pharmacological models on a shared platform such as JABS. These two platforms provide a dual approach: continuous surveillance of mice in their home-cage environments (via DIV Sys) alongside targeted assessments of particular behaviors in a dedicated hotel arena (via JABS). This combined paradigm presents a powerful framework to link genetic and neural changes to complex behaviors. Indeed, elucidating how altered behaviors result from altered neural circuits and altered genetic pathways remains a central challenge in computational ethology one that platforms such as JABS and DIV Sys are poised to address.

5 Supplementary Material

5.1 Grooming benchmark dataset

Data used for grooming benchmark. Number of videos (first column), and number of annotated frames (second and third column).

5.2 Quantifying strain survey dataset imbalance

Strain Imbalance (SI):

Gender Imbalance (GI) for each strain i:

The Average Gender Imbalance (AGI) can be calculated as the mean of the Gender Imbalance (GI) for all strains:

n is the number of strains
n_i^m is the number of male samples for strain i
n_i ^f is the number of female samples for strain i.

5.3 List of features for JABS

JABS data acquisition module: Environmental parameters in the arena.
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B) Carbon dioxide concentrations and (C) ammonia concentrations were both much higher in the standard wean cage than in the JABS arena. Carbon dioxide was also compared to room background levels. (D) Temperature and (E) humidity measured at floor level in JABS arenas and a standard wean cage compared to room background across a 14 day period. (F) Average body weight as percent of start weight in each JABS arena and wean cage across the 14 day period. (G) Food and (H) water consumption shown as grams per mouse per day for one JABS arena and one wean cage for a 14 day period.

Representative hematoxylin and eosin (H&E) stained tissue sections from mice after spending 14 days in the JABS arena or control wean cage.
Tissues selected for examination (eye, lung, trachea and nasal passages) are those expected to be most affected if the mice lived in a space with inadequate air flow. All tissues appeared normal.

Classifiers trained by JABS with their respective window sizes and F1 scores

JABS behavior characterization module: Univariate analysis captures the combined effect of sex and strain on the aggregate phenotypes using JABS600 dataset:
(A) JABS pipeline highlighting individual steps towards automated behavioral quantification. (B) The LOD scores (−*log*₁₀(*q_value*)) and effect sizes are shown at left and right panels, respectively. In the left panel, the number of *s represents the strength of evidence against the null hypothesis of no sex effect, while + represents a suggestive effect. In the right panel, the color (red for female and blue for male) and area of the circle (area being proportional to the size of the effect) represent the direction and magnitude of the effect size. Strains with a sex difference in at least one of the aggregated phenotypes are colored pink.

Acknowledgements

We thank members of the Kumar Lab for helpful advice and Leinani Hession for training behavior classifiers. Michelle Foskett (Process Quality Control) and Rosalinda Doty (Diagnostic and Pathology Services) help with environment and pathology data. This work was funded by The Jackson Laboratory Directors Innovation Fund, National Institute of Health DA041668 (NIDA), DA048634 (NIDA), and AG078530 (NIA). All code and training data will be available at Kumarlab.org and Kumar Lab Github (https://github.com/KumarLabJax/JABS-data-pipeline).

Significance of findings

Strength of evidence

Abstract

1 Introduction

2 Results

JABS data acquisition module

2.1 JABS data acquisition - Hardware and Software

2.1.1 Group 1 specifications

2.1.2 Group 2 specifications

2.1.3 Group 3 specifications

JABS data acquisition module (JABS-DA)

2.2 Environment checks

2.3 JABS-AL: An active learning module for behavior classifier training

2.3.1 Behavior annotation and classifier training

JABS-AL is a behavior annotation and classification module that allows trainign classifiers with sparse labels.

2.3.2 Benchmarking JABS classifier using grooming behavior

JABS Benchmarks: Selecting hyper-parameters and benchmarking JABS classifiers using grooming dataset.

2.4 JABS analysis and integration module

2.4.1 Frame and bout-wise classifier comparison of inter-annotator variability

Frame based comparison of classifiers from different annotators but trained for the same behavior.

Bout based comparison of classifier predictions from different annotators but trained for the same behavior.

2.4.2 Compilation of Strain Survey Datasets

2.4.3 Strain Survey of Multiple Behaviors

JABS-AI module: Aggregated phenotypes for behaviors using our large strain survey, JABS600.

2.5 JABS-AI: Heritability, genetic correlation, and GWAS analysis

JABS-AI module: Large-scale GWAS investigation of different mouse behaviors utilizing the JABS1200 dataset

2.6 Data Integration: A web application for classifier sharing and downstream genetic analysis

JABS-AI : Data integration module for classifier sharing and genetic analysis:

3 Discussion

3.1 Future directions and challenges

3.1.1 Rodent Homes and Hotels

5 Supplementary Material

5.1 Grooming benchmark dataset

Data used for grooming benchmark. Number of videos (first column), and number of annotated frames (second and third column).

5.2 Quantifying strain survey dataset imbalance

5.3 List of features for JABS

List of JABS features

JABS data acquisition module: Environmental parameters in the arena.

Representative hematoxylin and eosin (H&E) stained tissue sections from mice after spending 14 days in the JABS arena or control wean cage.

Classifiers trained by JABS with their respective window sizes and F1 scores

JABS behavior characterization module: Univariate analysis captures the combined effect of sex and strain on the aggregate phenotypes using JABS600 dataset:

Acknowledgements

References

Article and author information

Author information

Anshul Choudhary†

Brian Q Geuther†

Thomas J Sproule†

Glen Beane†

Vivek Kohar

Jarek Trapszo

Vivek Kumar

Author Notes

Version history

Cite all versions

Copyright

Metrics

Anshul Choudhary

Brian Q Geuther

Thomas J Sproule

Glen Beane