The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice

  1. Cristina Segalin
  2. Jalani Williams
  3. Tomomi Karigo
  4. May Hui
  5. Moriel Zelikowsky
  6. Jennifer J Sun
  7. Pietro Perona
  8. David J Anderson
  9. Ann Kennedy  Is a corresponding author
  1. Department of Computing & Mathematical Sciences, California Institute of Technology, United States
  2. Division of Biology and Biological Engineering 156-29, TianQiao and Chrissy Chen Institute for Neuroscience, California Institute of Technology, United States
  3. Howard Hughes Medical Institute, California Institute of Technology, United States
9 figures, 2 videos, 3 tables and 1 additional file

Figures

Figure 1 with 3 supplements
The Mouse Action Recognition System (MARS) data pipeline.

(A) Sample use strategies of MARS, including either out-of-the-box application or fine-tuning to custom arenas or behaviors of interest. (B) Overview of data extraction and analysis steps in a typical neuroscience experiment, indicating contributions to this process by MARS and Behavior Ensemble and Neural Trajectory Observatory (BENTO). (C) Illustration of the four stages of data processing included in MARS.

Figure 1—figure supplement 1
Mouse Action Recognition System (MARS) camera positioning and sample frames.

(A) Contents of the home cage and positioning of cameras for data collection. (B) Sample top- and front-view frames from mice with and without head-attached cables, including representative examples of occlusion and motion blur in the dataset (bottom row of images).

Figure 1—figure supplement 2
The Mouse Action Recognition System (MARS) annotation dataset.

Number of hours scored for each behavior in the 14.2 hr MARS dataset, broken down by training, validation, and test sets.

Figure 1—figure supplement 3
Mouse Action Recognition System (MARS) graphical user interface.

(1) File navigator, supporting queueing of multiple jobs while tracking is running. (2) User options: specify video source (top-/front-view camera), type of features to extract, and analyses to perform (pose estimation, feature extraction, behavior classification, video output). (3) Display of status updates during analysis. (4) Progress bars for current video and for all jobs in the queue.

Quantifying human annotation variability in top- and front-view pose estimates.

(A, B) Anatomical keypoints labeled by human annotators in (A) top-view and (B) front-view movie frames. (C, D) Comparison of annotator labels in (C) top-view and (D) front-view frames. Top row: left, crop of original image shown to annotators (annotators were always provided with the full video frame), right, approximate figure of the mouse (traced for clarity). Middle-bottom rows: keypoint locations provided by three example annotators, and the extracted ‘ground truth’ from the median of all annotations. (E, F) Ellipses showing variability of human annotations of each keypoint in one example frame from (E) top view and (F) front view (N = 5 annotators, 1 standard deviation ellipse radius). (G, H) Variability in human annotations of mouse pose for the top-view video, plotted as the percentage of human annotations falling within radius X of ground truth for (G) top-view and (H) front-view frames.

Performance of the mouse detection network.

(A) Processing stages of mouse detection pipeline. (B) Illustration of intersection over union (IoU) metric for the top-view video. (C) Precision-recall (PR) curves for multiple IoU thresholds for detection of the two mice in the top-view video. (D) Illustration of IoU for the front-view video. (E) PR curves for multiple IoU thresholds in the front-view video.

Figure 4 with 1 supplement
Performance of the stacked hourglass network for pose estimation.

(A) Processing stages of pose estimation pipeline. (B) Mouse Action Recognition System (MARS) accuracy for individual body parts, showing performance for videos with vs. without a head-mounted microendoscope or fiber photometry cable on the black mouse. Gray envelop shows the accuracy of the best vs. worst human annotations; dashed black line is median human accuracy. (C) Histogram of object keypoint similarity (OKS) scores across frames in the test set. Blue bars: normalized by human annotation variability; orange bars, normalized using a fixed variability of 0.025 (see Materials and methods). (D) MARS accuracy for individual body parts in front-view videos with vs. without microendoscope or fiber photometry cables. (E) Histogram of OKS scores for the front-view camera. (F) Sample video frames (above) and MARS pose estimates (below) in cases of occlusion and motion blur.

Figure 4—figure supplement 1
Breakdown of Mouse Action Recognition System (MARS) keypoint errors for top- and front-view pose models.

Left: precision/recall curves as a function of object keypoint similarity (OKS) cutoff; area under the curve indicated in legend. Right: breakdown of error sources and their effect on precision/recall curve at an OKS cutoff of 0.85. Error types are as defined in Ruggero Ronchi and Perona, 2017. Classes of keypoint position errors: Miss: large localization error; Swap: confusion between similar parts of different instances (animals); Inversion: confusion between semantically similar parts of the same instance (e.g., left/right ears); Jitter: small localization errors; Opt Score: mis-ranking of predictions by confidence (not relevant); Bkg FP: performance after removing background false positives; b: performance after removing false negatives.

Figure 5 with 4 supplements
Quantifying inter-annotator variability in behavior annotations.

(A) Example annotation for attack, mounting, and close investigation behaviors by six trained annotators on segments of male-female (top) and male-male (bottom) interactions. (B) Inter-annotator variability in the total reported time mice spent engaging in each behavior. (C) Inter-annotator variability in the number of reported bouts (contiguous sequences of frames) scored for each behavior. (D) Precision and recall of annotators (humans) 2–6 with respect to annotations by human 1.

Figure 5—figure supplement 1
Expanded set of human annotations.

All panels as in Figure 5, but with the two omitted annotators (humans 7 and 8) included. (A) Example annotation for attack, mounting, and close investigation behaviors by eight trained annotators on segments of male-female (top) and male-male (bottom) interactions. (B) Inter-annotator variability in the total reported time mice spent engaging in each behavior. (C) Inter-annotator variability in the number of reported bouts (contiguous sequences of frames) scored for each behavior. (D) Precision and recall of annotators (humans) 2–8 with respect to annotations by human 1 (source of Mouse Action Recognition System [MARS] behavior classifier training annotations).

Figure 5—figure supplement 2
Within-annotator bias and variance in annotation of attack start time.

Annotations of all attack bouts in the 10-video dataset by six human annotators. All attack bouts are aligned to the first frame on which at least three human annotators scored attack as occurring. Colored dots then reflect the time when each annotator scored each bout as starting, relative to this aligned time (the group median). Each annotator shows a characteristic bias (a shift in their mean annotation start time before or after the group median) and variance (the spread of annotation start times around this mean) in their annotation style. Some annotators did not score any attack initiated within a ±1 s window of the group median for a given bout: these points are plotted at time –1. Note that the average attack bout in the dataset is 1.65 s long (using annotations from human 1).

Figure 5—figure supplement 3
Inter-annotator accuracy on individual videos.

(A) Mean precision and recall of annotators 1–6, computed relative to the median of the other five annotators (mean ± SEM). Each plotted point is one video. (B) Mean annotator F1 score (harmonic mean of precision and recall) plotted against the mean bout duration for each behavior in each video. Plot suggests a close positive correlation between the average duration of behavior bouts in a video (or dataset) and the accuracy of annotators as computed by precision and recall. (C) Mean annotator F1 score plotted against the total number of frames annotated for a given behavior in each video. Correlation is weaker than in (B).

Figure 5—figure supplement 4
Inter- and intra-annotator variability.

We asked eight individuals to all annotate a pair of 10 min videos twice, with at least 10 months between annotation sessions. Box plots in (B) and (D) show median (line), 25th to 75th percentiles (box), and minimum/maximum values (whiskers). *p<0.05, **p<0.01, ***p<0.001; effect sizes computed as U/(n1 * n2), where n1 and n2 are category sample sizes. (A) F1 score within and between annotators: we treated a given annotator (X axis) as ground truth and computed F1 score of each annotator with respect to these labels (for self-comparison, we used the first annotation session as ground truth and the second as ‘prediction’). (B) Summary of F1 score values in (A), showing mean F1 score vs. self and vs. other across annotators (attack self vs. other: p=0.00623, effect size = 0.00623, Wilcox rank sum test, N = 6 self vs. 15 other; close investigation self vs. other: p=0.0292, effect size = 0.811, Wilcox rank sum test, N = 6 self vs. 15 other). (C, D) Same as in (A), but including two additional annotators who were more variable (attack self vs. other: p=0.000498, effect size = 0.911, Wilcox rank sum test, N = 8 self vs. 28 other; close investigation self vs. other: p=0.00219, effect size = 0.862, Wilcox rank sum test, N = 8 self vs. 28 other). (E) Same data as in (C) displayed as a matrix to capture annotator identity.

Figure 6 with 3 supplements
Performance of behavior classifiers.

(A) Processing stages of estimating behavior from pose of both mice. (B) Example output of the Mouse Action Recognition System (MARS) behavior classifiers on segments of male-female and male-male interactions compared to annotations by human 1 (source of classifier training data) and to the median of the six human annotators analyzed in Figure 5. (C) Precision, recall, and precision-recall (PR) curves of MARS with respect to human 1 for each of the three behaviors. (D) Precision, recall, and PR curves of MARS with respect to the median of the six human annotators (precision/recall for each human annotator was computed with respect to the median of the other five). (E) Mean precision and recall of human annotators vs. MARS, relative to human 1 and relative to the group median (mean ± SEM).

Figure 6—figure supplement 1
Mouse Action Recognition System (MARS) precision and recall is closely correlated with that of annotators on individual videos.

(A) Mean precision and recall of annotators 1–6 for each behavior in each of the 10 tested videos (plotted points; as in Figure 5—figure supplement 3), and MARS precision-recall (PR) curves for those videos. PR curves and points that are the same color correspond to the same video. (B) Mean annotator F1 score plotted against MARS’s F1 score for each behavior in each video. Performance of MARS is well predicted by the inter-human F1 score, which is in turn correlated with mean behavior bout duration (see Figure 5—figure supplement 3).

Figure 6—figure supplement 2
Evaluation of Mouse Action Recognition System (MARS) on a larger test set.

(A) Precision-recall (PR) curves of MARS classifiers for test set 1 (‘no cable’), test set 2 (‘with cable’), and for the two sets combined. (B) F1 score of MARS classifiers for each behavior in each video, plotted against mean behavior bout duration in that video. Plots show no strong difference in performance between videos in which mice are unoperated (‘no cable’) and videos in which mice are implanted with a head-attached device (‘cable’).

Figure 6—figure supplement 3
Training Mouse Action Recognition System (MARS) on new datasets.

(A) Sample frame from CRIM13 dataset. (B) Performance of MARS pose estimator fine-tuned to CRIM13 data as a function of fine-tuning training set size. (C) 90% percent correct keypoints (PCK) radius on CRIM13 data as a function of training set size. (D) Performance of MARS classifiers for three additional social behaviors as a function of training set size (number of frames annotated for the behavior of interest). (E) Same classifiers as in (D), now showing performance as a function of the number of bouts annotated for the behavior of interest.

Screenshot of the Behavior Ensemble and Neural Trajectory Observatory (BENTO) user interface.

(A, left) The main user interface showing synchronous display of video, pose estimation, neural activity, and pose feature data. (Right) List of data types that can be loaded and synchronously displayed within BENTO. (B) BENTO interface for creating annotations based on thresholded combinations of Mouse Action Recognition System (MARS) pose features.

Application of Mouse Action Recognition System (MARS) in a large-scale behavioral assay.

All plots: mean ± SEM, N = 8–10 mice per genotype per line (83 mice total); *p<0.05, **p<0.01, ***p<0.001. (A) Assay design. (B) Time spent attacking by group-housed (GH) and single-housed (SH) mice from each line compared to controls (Chd8 GH het vs. ctrl: p=0.0367, Cohen’s d = 1.155, two-sample t-test, N = 8 het vs. 8 ctrl; Nlgn3 het GH vs. SH: p=0.000449, Cohen’s d = 1.958, two-sample t-test, N = 10 GH vs. 8 SH). (C) Time spent engaged in close investigation by each condition/line (BTBR SH BTBR vs. ctrl: p=0.0186, Cohen’s d = 1.157, two-sample t-test, N = 10 BTBR vs. 10 ctrl). (D) Cartoon showing segmentation of close investigation bouts into face-, body-, and genital-directed investigation. Frames are classified based on the position of the resident’s nose relative to a boundary midway between the intruder mouse’s nose and neck, and a boundary midway between the intruder mouse’s hips and tail base. (E) Average duration of close investigation bouts in BTBR mice for investigation as a whole and broken down by the body part investigated (close investigation, p=0.00023, Cohen’s d = 2.05; face-directed p=0.00120, Cohen’s d = 1.72; genital-directed p=0.0000903, Cohen’s d = 2.24; two-sample t-test, N = 10 het vs. 10 ctrl for all).

Analysis of a microendoscopic imaging dataset using Mouse Action Recognition System (MARS) and Behavior Ensemble and Neural Trajectory Observatory (BENTO).

(A) Schematic of the imaging setup, showing head-mounted microendoscope. (B) Sample video frame with MARS pose estimate, showing appearance of the microendoscope and cable during recording. (C) Sample behavior-triggered average figure produced by BENTO. (Top) Mount-triggered average response of one example neuron within a 30 s window (mean ± SEM). (Bottom) Individual trials contributing to mount-triggered average, showing animal behavior (colored patches) and neuron response (black lines) on each trial. The behavior-triggered average interface allows the user to specify the window considered during averaging (here 10 s before to 20 s after mount initiation), whether to merge behavior bouts occurring less than X seconds apart, whether to trigger on behavior start or end, and whether to normalize individual trials before averaging; results can be saved as a pdf or exported to the MATLAB workspace. (D) Normalized mount-triggered average responses of 28 example neurons in the medial preoptic area (MPOA), identified using BENTO. Grouping of neurons reveals diverse subpopulations of cells responding at different times relative to the onset of mounting (pink dot = neuron shown in panel C).

Videos

Video 1
Sample MARS output.
Video 2
Joint display of video, pose estimates, neural activity, and behavior within BENTO.

Tables

Table 1
Performance of MARS top-view pose estimation model.

‘Sigma from data’ column normalizes pose model performance by observed inter-human variability of each keypoint estimate.

Sigma from dataSigma = 0.025
mAP0.9020.628
AP@IoU = 500.990.967
AP@IoU = 750.9570.732
mAR0.9240.681
AR@IoU = 500.9910.97
AR@IoU = 750.970.79
  1. mAP: mean average precision; AP: average precision; mAR: mean average recall; IoU: intersection over union; MARS: Mouse Action Recognition System.

Table 2
Statistical significance testing.

All t-tests are two-sided unless otherwise stated. All tests from distinct samples unless otherwise stated. Effect size for two-sample t-test is Cohen’s d. Effect size for rank sum test is U/(n1 * n2), where n1 and n2 are sample sizes of the two categories.

FigurePanelIdentifierSample sizeStatistical testTest stat.CIEffect sizeDFp-Value
8b Chd8 GH mutant vs. GH control8 het8 wtTwo-sample t-test2.310.216–5.851.155140.0367
 Nlgn3 GH mutant vs. SH mutant10 GH8 SHTwo-sample t-test4.402.79–7.991.958160.000449
c BTBR SH mutant vs. SH control10 het 10 wtTwo-sample t-test2.590.923–8.911.157180.0186
e Close investigate10 het 10 wtTwo-sample t-test4.580.276–0.7432.05180.000230
 Face-directed10 het 10 wtTwo-sample t-test3.840.171–0.5821.72180.00120
 Genital-directed10 het 10 wtTwo-sample t-test5.010.233–0.5682.24180.0000903
ED 8b Attack6 vs. self 15 vs. otherWilcoxon rank sumU = 79x0.878x0.00623
 Close investigation6 vs. self 15 vs. otherWilcoxon rank sumU = 73x0.811x0.0292
d Attack8 vs. self 28 vs. otherWilcoxon rank sumU = 204x0.911x0.000498
 Close investigation8 vs. self 28 vs. otherWilcoxon rank sumU = 193x0.862x0.00219
  1. GH: group-housed; SH: singly housed.

Table 3
MARS feature definitions.
NameUnitsDefinitionRes.Intr.
Position Features
(p)_x, (p)_ycmx,y coordinates of each body part, for p in (nose, left ear, right ear, neck, left hip, right hip, tail)xx
centroid_x, centroid_ycmx,y coordinates of the centroid of an ellipse fit to the seven keypoints representing the mouse's pose.xx
centroid_head_x, centroid_head_ycmx,y coordinates of the centroid of an ellipse fit to the nose, left and right ear, and neck keypoints.xx
centroid_hips_x, centroid_hips_ycmx,y coordinates of the centroid of an ellipse fit to the left and right hip and tail base keypoints.xx
centroid_body_x, centroid_body_ycmx,y coordinates of the centroid of an ellipse fit to the neck, left and right hip, and tail base keypoints.xx
dist_edge_x, dist_edge_ycmdistance from the centroid of the mouse to the closest vertical (dist_edge_x) or horizontal (dist_edge_y) wall of the home cage.xx
dist_edgecmdistance from the centroid of the mouse to the closest of the four walls of the home cage.xx
Appearance Features
phiradiansabsolute orientation of the mouse, measured by the orientation of a vector from the centroid of the head to the centroid of the hips.xx
ori_headradiansabsolute orientation of a vector from the neck to the tip of the nose.xx
ori_bodyradiansabsolute orientation of a vector from the tail to the tip of the neck.xx
angle_head_body_l, angle_head_body_rradiansangle formed by the left(right) ear, neck and left(right) hip keypoints.xx
major_axis_len, minor_axis_lencmmajor and minor axis of an ellipse fit to the seven keypoints representing the mouse’s pose.xx
axis_rationonemajor_axis_len/minor_axis_len (as defined above).xx
area_ellipsecm^2area of the ellipse fit to the mouse’s pose.xx
dist_(p1)(p2)cmdistance between all pairs of keypoints (p1, p2) of the mouse’s pose.xx
Locomotion Features
speedcm/smean change in position of centroids of the head and hips (see Position Features), computed across two consecutive frames.xx
speed_centroidcm/schange in position of the mouse’s centroid (see Position Features), computed across two consecutive frames.xx
accelerationcm/s^2mean change in speed of centroids of the head and hips, computed across two consecutive frames.xx
acceleration_centroidcm/s^2change in speed of the mouse’s centroid, computed across two consecutive frames.xx
speed_fwdcm/sspeed of the mouse in the direction of ori_body (see Appearance Features).xx
radial_velcm/scomponent of the mouse’s centroid velocity along the vector between the centroids of the two mice, computed across two consecutive frames.xx
tangential_velcm/scomponent of the mouse’s centroid velocity tangential to the vector between the centroids of the two mice, computed across two consecutive frames.xx
speed_centroid_w(s)cm/sspeed of the mouse’s centroid, computed as the change in position between timepoints (s) frames apart (at 30 Hz).xx
speed_(p)_w(s)cm/sspeed of each keypoint (p) of the mouse’s pose, computed as the change in position between timepoints (s) frames apart (at 30 Hz).xx
Image-based features
pixel_changenonemean squared value of (pixel intensity on current frame minus mean pixel intensity on previous frame) over all pixels, divided by mean pixel intensity on current frame (as defined in [Hong et al].)x
pixel_change_ubbox_micenonepixel change (as above) computed only on pixels within the union of the bounding boxes of the detected mice (when bounding box overlap is greater than 0; 0 otherwise).x
(p)_pcnonepixel change (as above) within a 20 pixel-diameter square around the keypoint for each body part (p).xx
Social features
resh_twd_itrhb (resident head toward intruder head/body)nonebinary variable that is one if the centroid of the other mouse is within a –45° to 45° cone in front of the animal.xx
rel_angle_socialradiansrelative angle between the body of the mouse (ori_body) and the line connecting the centroids of both mice.xx
rel_dist_centroidcmdistance between the centroids of the two mice.x
rel_dist_centroid_changecmchange in distance between the centroids of the two mice, computed across two consecutive frames.x
rel_dist_gapcmdistance between ellipses fit two the two mice along the vector between the two ellipse centers, equivalent to Feature 13 of [Hong et al].x
rel_dist_scaledcmdistance between the two animals along the line connecting the two centroids, divided by length of the major axis of one mouse, equivalent to Feature 14 of [Hong et al].xx
rel_dist_headcmdistance between centroids of ellipses fit to the heads of the two mice.x
rel_dist_bodycmdistance between centroids of ellipses fit to the bodies of the two mice.x
rel_dist_head_bodycmdistance from the centroid of an ellipse fit to the head of mouse A to the centroid of an ellipse fit to the body of mouse B.xx
overlap_bboxesnoneintersection over union of the bounding boxes of the two mice.x
area_ellipse_rationoneratio of the areas of ellipses fit to the poses of the two mice.xx
angle_betweenradiansangle between mice defined as the angle between the projection of the centroids.x
facing_angleradiansangle between head orientation of one mouse and the line connecting the centroids of both animals.xx
dist_m1(p1)_m2(p2)cmdistance between keypoints of one mouse w.r.t to the other, for all pairs of keypoints (p1, p2).x
  1. MARS: Mouse Action Recognition System.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Cristina Segalin
  2. Jalani Williams
  3. Tomomi Karigo
  4. May Hui
  5. Moriel Zelikowsky
  6. Jennifer J Sun
  7. Pietro Perona
  8. David J Anderson
  9. Ann Kennedy
(2021)
The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice
eLife 10:e63720.
https://doi.org/10.7554/eLife.63720