Figures and data

CALMS21 test dataset results.
Behavioral categories are abbreviated: ‘attack’ – ‘att’, ‘investigation’ – ‘inv’ and ‘mount’ – ‘mnt’. All values represent means of 20 pipeline runs with different random states, standard deviations are shown as error bars if applicable. A: Behavioral timeline with ground-truth intervals (annotations, lower bars, blue) and predictions (upper bars, orange) for the four behavioral categories. Lines represent per-category model outputs (classification probabilities after smoothing). The last resident-intruder sequence of the test dataset is visualized, see SI Figure 5 for all 19 test dataset sequences. B: Per-frame confusion matrix of the four behavioral categories. Within each cell, upper values show the proportion of frames in agreement with annotated, ground-truth data (normalized across each row), lower values show the absolute frame counts. Results were visualized after model output smoothing and thresholding. C and D: Per-frame unweighted average (macro) F1 scores of raw model outputs (‘model’), after smoothing (‘smooth’), and after thresholding (‘thresh’), across only the behavioral foreground categories and across all four categories. E: Per-frame F1 scores for each category, calculated on raw model outputs, after smoothing, and after thresholding. C and E: Horizontal lines mark the F1 scores of the baseline model as reported in Sun et al. (2021).

Ethogram of N. multifasciatus that was used for the behavioral scoring of the social cichlids dataset.

Qualitative classification results of the social cichlids dataset.
A: Behavioral timelines for one focal fish (actor) of the test dataset and its three most frequent interaction partners (recipients). Colored bars denote ground-truth intervals (annotations, lower bars, blue) and predictions (upper bars, orange) for the six behavioral categories. Lines represent per-category model outputs (classification probabilities after smoothing). For recipient 2 and 3, intervals and lines are offset along both x and y axes. The behavioral background category was excluded for a clearer visualization with sparse behavioral data. B and C: Annotated and predicted interaction network of one group (15 fish), split by behavioral category. Edge line strength represents interaction counts. Note that this visualization contains data that is not part of the test dataset since the full dataset was split by individual fish and not groups. For visualization purposes, we instead used 5-fold cross validation to fit five independent classifiers that were used for predictions on the full dataset. See SI Figure 4 for correlation tests between annotated and predicted behavior counts of all dyads from the test dataset. Note: The actor and three recipients of the behavioral timelines are part of the interaction networks, marked by node color (black and grey for actor and recipients, respectively).

Classification results of the social cichlids test dataset.
All values represent means of 20 pipeline runs with different random states, standard deviations are shown as error bars if applicable. A: Confusion matrix of predicted intervals and their true category (i.e., category of annotated interval with longest overlap). B: Unweighted average (macro) F1 score of raw model outputs (‘model’), after smoothing (‘smooth’), and after thresholding (‘thresh’), calculated across all behavioral categories on predicted intervals. C: Per-category F1 scores for each of the same (post-)processing steps. Note that all results were computed based on predicted behavioral intervals (i.e., interaction counts). For both video frame and annotation interval based results, see SI Figure 2 and SI Figure 3, respectively.

Interactive validation tool.
A: An interactive table widget enables the inspection and editing of entire behavioral datasets from within the JupyterLab coding environment. Columns can be sorted (with multiple sorting, in the example: 1 – ‘Actor’ – ascending, 2 – ‘Recipient’ – ascending, 3 – ‘Probability’ – descending) and interactively filtered through selection, value ranges or quantile ranges (accessible via respective pop-ups). Active filters are depicted with the blue tick button. All fields are editable, but only allow valid entries if applicable (e.g., existing behavioral categories). Boolean columns are exposed as check boxes. Free text input is also possible, e.g., for comments. Action buttons can link to other widgets, as for example, video playback. B: Video playback interface to visualize behavioral sequences. Each row in the interactive table (i.e., annotations or predictions depending on use case) can be used to render a corresponding video, optionally with overlays for tracking data and a behavioral label. Videos are played directly in the interface and can be looped, stopped and maximized. C – E: Rendering options to configure video output. In the example, both actor and recipient are highlighted with red and blue, respectively.

CALMS21 test dataset evaluation results.
The values report means and standard deviations of 20 pipeline runs with different random states. F1 scores were calculated on three levels, i.e., frames, annotated intervals and predicted intervals (both evaluated as counts), and for three processing steps (raw model outputs, classification results after probability smoothing and after thresholding). Classification performance was evaluated separately for each behavioral category and as unweighted average (Macro F1) across all categories and across the three behavioral foreground categories ‘attack’, ‘investigation’, and ‘mount’. All scores are reported as means and standard deviations of 20 pipeline runs with different random states.

Social cichlids test dataset evaluation results.
All values report means and standard deviations of 20 pipeline runs with different random states. F1 scores were calculated on three levels, i.e., frames, annotated intervals and predicted intervals (both evaluated as counts), and for three processing steps (raw model outputs, classification results after probability smoothing and after thresholding). Classification performance was evaluated separately for each behavioral category and as unweighted average (Macro F1) across all categories and across the six behavioral foreground categories (excluding ‘none’).

Correlation tests between correlations of ground-truth behavioral interaction counts (annotations) and two potential proxies: (1) predicted counts as resulting from the classification pipeline and, (2) association time, the cumulative duration that two individuals spend within a defined distance threshold (3 average body length, and 1 and 5 body length as a sensitivity analysis).
We used Williams’ tests between two dependent correlations that share one variable (two tailed) to assess whether correlations with predicted counts were different in strength compared to equivalent correlations with association time. Note that the correlation coefficients with predicted counts are repeated for each association threshold.

Overview of sample counts (video frames) and proportions by category in the CALMS21 training and test dataset and after subsampling for model training.
The ‘sampling frequency’ column shows category specific (sub-)sampling strategies employed for this dataset: full sampling (value of 1.0, all frames) and sampling with a target count (for both ‘investigation’ and ‘other’, proportion in parentheses is the realized subsampling frequency). The resulting samples that were used for model training (‘subsampled training data’) can differ from the target count due to constrains from stratified sampling by behavioral intervals.

Overview of category counts (video frames per dyad) and proportions by category in the full social cichlids dataset, when split into training dataset (selecting all dyads where a subset of 80% of all individuals where ‘actors’) and test dataset (the remaining dyads), and after subsampling for model training.
The categories were either fully sampled (sampling frequency of 1.0), or subsampled with a given frequency. Note that for the behavioral background category ‘none’, we first randomly selected 1% of all available samples (asterisk in table), and then added more samples where an actor was interacting with a different individual. For these additional ‘none’ samples, we sampled the first and second non-interacting closest neighbors with 0.1x the sampling frequency of the actor’s actual interaction, and the third, fourth and fifth neighbors with 0.05x the sampling frequency. Also note that the category ‘none’ contains more than 99% of all samples because individuals can only interact with one other at a time, but groups consist of 15 individuals.

Spatiotemporal features that were extracted to train classifiers for the CALMS21 dataset.
In total, 201 feature values were extracted to describe the spatiotemporal movement patterns of dyadic mice interactions. Features are either individual or dyadic and result in at least one value per feature and combination of postural elements (i.e., keypoints or segments). Some features are temporal and were calculated for different time steps (video frames). For example, ‘speed’ (an individual feature) was calculated for three keypoints and for three steps, resulting in a total of 9 values. In comparison, ‘target velocity’ is a dyadic feature and was calculated for two keypoints of the actor mouse and along six target vectors between keypoints of the actor and recipient mouse. In comparison to ‘speed’ (a scalar feature with one value), ‘target velocity’ itself is a vector of two components, i.e., the projection and rejection of the keypoint displacement onto a target vector. For the three temporal steps, this results in 3 × 2 × 6 × 2 = 72 values.

Additional CALMS21 test dataset validation results.
All values represent means of 20 pipeline runs with different random states, standard deviations are shown as error bars if applicable. A – D: Evaluation based on annotated behavioral intervals. A: Confusion matrix for annotated intervals and their predicted category (i.e., category of predicted interval with longest overlap). Proportional values are normalized across rows, absolute counts are shown below in parentheses. B: Macro F1 scores calculated on the annotated intervals across all categories for three (post-)processing steps – raw model outputs (‘model’), after probability smoothing (‘smooth’) and after thresholding (‘thresh’). C: F1 scores calculated for each category and the three processing steps. D – E: Corresponding evaluation based on predicted behavioral intervals. D: Confusion matrix for predicted intervals and their true category (i.e., category of annotated interval with longest overlap). E and F: Macro F1 and category F1 scores calculated on predicted intervals. Behavioral categories are abbreviated: ‘attack’ – ‘att’, ‘investigation’ – ‘inv’ and ‘mount’ – ‘mnt’.

Additional social cichlids test dataset validation results based on annotated intervals.
The predicted category that corresponds to a behavioral annotation is selected as the category of the predicted interval with longest overlap. All values represent means of 20 pipeline runs with different random states, standard deviations are shown as error bars if applicable. A: Confusion matrix depicting recall of annotations. Note that the ‘none’ column represents false negatives (i.e., missed predictions). Proportional values are normalized across rows, absolute counts are shown below in parentheses. B and C: Macro F1 and category F1 scores calculated on annotated intervals for three (post-)processing steps – raw model outputs (‘model’), after probability smoothing (‘smooth’) and after thresholding (‘thresh’).

Additional social cichlids test dataset validation results based on video frames.
All values represent means of 20 pipeline runs with different random states, standard deviations are shown as error bars if applicable. A: Confusion matrix across all video frames of the test dataset. Proportional values are normalized across rows, absolute counts are shown below in parentheses. Note the disproportionate number of frames that belong to the behavioral background category ‘none’. B and C: Macro F1 and category F1 scores calculated on video frames for three (post-)processing steps – raw model outputs (‘model’), after probability smoothing (‘smooth’) and after thresholding (‘thresh’).

Visualization of correlations between ground-truth, annotated behavioral interaction counts and two potential behavioral proxies.
A – F: Correlations with predicted counts, split by behavioral foreground categories. G – L: Correlations with association time, the cumulative duration that two individuals spent within a distance of three average body lengths. The correlation strengths of all correlations with predicted counts are significantly higher than the corresponding correlation strengths with association time (William’s correlation tests with dependent correlations that share one variable; P < 0.001 in all cases). For statistical estimates and a sensitivity analysis with other association distance thresholds (1 and 5 body lengths), see SI Table 3.

Classification results of all 19 resident-intruder sequences in the CALMS21 test dataset after post-processing (i.e., category-specific output smoothing and thresholding).
Upper bars (orange) show predicted intervals, lower bars (blue) ground-truth annotations. Lines represent model outputs – classification probabilities for each category – after smoothing. All sequences are visually aligned with the longest sequence ‘0’. Note the skewed distribution of annotated ‘attack’ intervals: Only five of 19 sequences have annotations for this behavioral category, but in those at a relatively high frequency.