Abstract
Recent advances in genetic engineering have accelerated the production of nonhuman primate models for neuropsychiatric disorders. To use these models for preclinical drug testing, behavioral screening methods will be necessary to determine how the model animals deviate from controls, and whether treatments can restore typical patterns of behavior. In this study, we collected a multimodal dataset from a large cohort of marmoset monkeys and described typical patterns in their natural behavior. We found that these behavioral measurements varied substantially across days, and that behavioral state usage was highly correlated to the behavior of cagemates and to the vocalization rate of other animals in the colony. To elicit acute behavioral responses, we presented animals with a panel of stimuli including novel, appetitive, neutral, aversive, and social stimuli. By comparing these behavioral conditions, we demonstrate that outlier detection can be used to identify atypical responses to a range of stimuli. This data will help guide the study of marmosets as models for neuropsychiatric disorders.
Introduction
Expanding the number of species across the phylogenetic tree that can be used to test potential treatments for neuropsychiatric disorders should increase the probability that these treatments can generalize to humans1. Marmosets have recently gained significant attention as a nonhuman primate model species2–6 that displays several types of social behaviors which are disrupted in human neuropsychiatric disorders such as vocal communication7,8, face processing9,10, and social learning11–13. We applied methods for studying natural behavior, vocalizations, responses to stimuli, and cognitive ability to a large cohort of marmosets to find patterns characteristic of typically developing animals. This data will help determine what symptoms are present in neuropsychiatric disorder model animals, and whether treatments can revert the symptoms.
The application of deep learning to the quantification of video data has recently expanded the potential scope of behavioral measurements. Specifically, methods for extracting animal posture from video data14–16 have been transformative because they allow high-dimensional video data to be distilled into much lower-dimensional postural time series data that retains information about the animals’ overt movements. Furthermore, novel approaches to supervised17,18 and unsupervised19–22 modeling have demonstrated that patterns in these large new datasets can be identified and interpreted. These computational approaches to ethology have already been applied to the study of neuropsychiatric disorder model mice to reveal similarities and differences across different mouse models of autism spectrum disorder (ASD)23. Deep learning has also been applied to the annotation and study of patterns in audio data24,25. This has allowed for high-dimensional audio data (spectrograms) to be compressed into low-dimensional time series data, which can be modeled to understand the basic patterns underlying these complex social processes26.
Because the phenotypes of neuropsychiatric disorders are defined in relation to the behavioral patterns of typically developing controls, we collected a large baseline video and audio dataset from group-housed animals living in captivity comprised of 5,400 sessions from 120 animals living in 36 different family units. We found that the animals’ natural behavioral state usage was highly correlated with environmental variables such as their cagemates’ behavior and the vocalizations of other animals in the colony. To elicit acute behavioral responses, we presented the animals with a set of stimuli including novel, appetitive, neutral, aversive, and social stimuli. We found that patterns in the acute responses to these stimuli were consistent across animals, meaning that this data could be used to benchmark methods for characterizing the response to each stimulus. Human patients with ASD can display diverse symptoms27 including differences in social affect, repetitive behaviors, restricted interests, lower verbal ability, and impaired cognitive ability.
Importantly, this diversity is observed even in monogenic forms of ASD, such as in Phelan-McDermid syndrome28–30, which can be studied in animal models with alterations in the Shank3 gene31. These patients have a range of symptoms, and individual patients can exhibit different combinations of symptoms. For this reason, we aimed to develop a framework for identifying outliers on an individual basis (by comparing data from each animal to the whole-population distribution) rather than on a group basis (comparing across pre-defined groups of animals).
To benchmark our methods, we measured the distinguishability between nine stimulus-response conditions. Surprisingly, we found that most stimuli did not cause large changes in the total usage of each behavioral state. Instead, the responses to these stimuli were better differentiated by comparing the correlation patterns between the usage of each behavioral state and the time spent interacting with the stimuli. Using these patterns, we showed that outlier detection could distinguish between conditions with similar accuracy as 2-class classification. Crucially, we found that outlier detection was also robust to heterogeneity, whereas 2-class classification required homogeneity in the training and test data. This demonstrates the importance of using outlier detection in cases where experimental groups may be heterogeneous, such as in ASD models.
Overall, these results will help guide studies of how marmoset models for neuropsychiatric disorders differ from typical animals, and whether treatments can revert these changes.
Results
Extracting behavioral data from video
Marmosets form stable breeding pairs that cooperate to raise infants during their development, which takes over a year. Because marmosets often give birth to 2 sibling animals at a time, they usually live in a family of at least 4 total animals when raised in captivity32. Once the adult female gives birth to another set of siblings, older siblings are paired with new partners in different cages to form their own family units. To study marmosets living in this controlled semi-natural environment, we trained a neural network to label 21 postural key points of multiple animals (Fig. 1a-b) and developed a classifier to distinguish animals based on the color of the hair tuft near the ear, which we artificially dyed to make animals easily distinguishable to a human observer (Fig. 1c-d). This allowed us to extract postural time series data from 36 cages with similar family structure (two adult parents and two juvenile animals).
Principal component analysis of the posture time series data revealed that most of the variance was explained by the animal’s X position, Y position, and body angle (Fig. 1e, Fig. S1). This is an expected result from video data where the animals are small compared to the field of view. To extract more detailed information about the animals’ body movements, we created an egocentric version of the data that was normalized with respect to X position, Y position, and body angle. In this egocentric data, the principal components of variance reflect differences in posture, such as head angle and limb position (Fig. 1f, Fig. S1). Because the Z position of the animal can vary dramatically as animals climb on the cage walls, we used stereo vision to estimate the animals’ depths in the cage (Fig. S2). Recent work has shown that is also possible to estimate 3D postures directly from 2D postures33, and this will be a promising future direction for enabling similar types of high-throughput video data collection with even fewer cameras. In this study, we used stereo vision to estimate depth since we collected video from multiple cameras above each cage. Notably, we temporarily installed smooth transparent ceilings during the weeks of recording to restrict the animals’ ability to climb on the ceiling, and temporarily placed horizontal cage dividers to reduce the total cage size during data collection.
Labeling natural behavior on a frame-by-frame basis is a difficult task because behaviors can have different timescales and levels of description. Additionally, an animal may simultaneously perform multiple behaviors at once. Our approach to this problem was to train a recurrent neural network to use long input sequences of features (25 seconds) to predict the following 5 seconds (Fig. 2a-b). After training the network, we hierarchically clustered the latent space using t-SNE while systematically varying the alpha and perplexity parameters34. We then traced cluster identities across hierarchical levels (Fig. 2c, S3). We found that the lowest level of clustering (level 5) resulted in sub-second labels with very specific human descriptions (Fig. 2e-f, S4). The highest level of clustering (level 1) divided the data based on broad categories that humans described as “climbing”, “active”, “social”, and “alone” (Fig. S4). To increase the human-interpretability of these clusters, we plotted the distribution of features across clusters at every hierarchical level (Fig. S4). This unsupervised hierarchical approach to labeling behavioral states allowed us to study behavioral responses at multiple levels of description.
To understand how the input time series data was weighed in the bottleneck layer of the model, we measured the average attention allocation to previous time points (Fig. S5). We found that most attention was concentrated in the 1-second period prior to the current frame, but that attention was more evenly distributed across the prior 20-second period when animals’ behavioral states remained similar across those frames (Fig. S5). This demonstrates a key benefit of using a deep neural network model to describe the ongoing behavioral state. Compared to using features with pre-determined durations (which would specifically set the timescale of the analysis based on the kernel size of those features), this approach allowed the model to access information from different timescales.
Comparing striatal neural data to video data
To determine whether the organization of the behavioral latent space was relevant to the neural representation of ongoing behavior, we recorded striatal neuron activity from 4 animals using silicon probes mounted on movable drives (Fig. 3a-c). Each animal was implanted with two 64 channel probes. We recorded neural activity in the striatum because it is known to contain diverse signals related to movement35 and social interactions36,37. To allow animals to have a full range of movement and to have natural interactions with cagemates, we used on-head amplifiers and an on-head data logger to wirelessly record from these neurons while animals moved freely in their home cages.
Because the number of concurrently recorded neurons was relatively low, we did not use population-based methods to analyze the data – as is possible in studies where hundreds or thousands of neurons are recorded simultaneously38. Instead, to estimate whether individual neuronal activity was related to the behavioral states defined by our clustering method, we compared each neuron’s firing rate distribution to shuffled data based on the overall frequency of each behavior during the session. We did not track the same neurons across multiple sessions, but instead moved the probes between sessions to record new neurons in subsequent sessions. We pooled the data from 234 sessions (from 4 animals) and hierarchically clustered neurons according to their firing rate enrichment maps (Fig. 3d). We found that the highest-level clusters of striatal neurons had average activity patterns that were clearly related to the behavioral state clustering. Neural cluster 1 had increased activity when animals were in the “social” state (Fig. 3e), cluster 2 had increased activity when animals were in the “alone” state (Fig. 3e) and cluster has increased activity when animals were in the “climbing” state (Fig. 3e). Interestingly, cluster had higher activity when neurons were “alone” or “active” (Fig. 3e). This class of neurons is likely related to movement, because transitions between these two states would be characteristic of an animal alternating between movement and temporary pauses between movements (Fig. S3). Finally, we also found that individual neurons could have similar patterns of enrichment as the group averages (Fig. 3f).
Large-scale data collection reveals variability across days and correlation between cagemates
To treat the study of marmoset behavior as a quantitative data science problem39,40, we collected a large dataset of natural behavioral data. We recorded video data from 120 individual animals living in 36 families at time points corresponding to the siblings being aged 3 months, 6 months, 9 months, 12 months, and 15 months. Some parents were observed with different sets of siblings, as the data collection took several years to complete. In total, we recorded 5400 sessions (defined as the data from one animal in one day) and sorted the sessions using PCA to visualize the largest sources of variance (Fig. S6). We found that behavioral state usage varied dramatically across days (Fig. 4a-b). Notably, there was a large amount of variability across all sessions (Fig. 4c), and individual animals measured longitudinally had large within-animal variability across sessions as well (Fig. 4d). To determine whether this variability could be caused by non-uniform sleep quality, we used motion watches to measure activity across all hours. We found that animals in our study had very consistent sleep patterns and that sleep quality was similar across days (Fig. S7). Because of the variability in behavioral state usage across days, it would be difficult to find stable phenotypes of individual animals without collecting weeks or months of data per animal. To determine whether animals’ stable behavioral traits could be measured more efficiently, we aimed to better understand the sources of the observed variability.
When measuring the time spent in “active” behavioral states across 45 sessions (non-continuous days) in an example family, we found that cagemates displayed highly similar patterns of activity (Fig. 4e-f). We pooled data from all simultaneously measured animals and found a high correlation between the time spent active and the cagemates’ time spent active (Fig. 4e). To better understand the timescale of this effect, we compared the average correlation to cagemates across timescales ranging from seconds to hours (Fig. S6). For this analysis, we calculated correlation based on the average usage of all behavioral states. We found that both the correlations to cagemates and the correlation to shuffled data increased as a function of bin size. This is because the distribution of behavioral states becomes more similar to the mean as the sampling window size increases. However, the correlation to cagemates was higher than the correlation to shuffled data across all bin sizes (Fig. S6). Interestingly, measuring an animal’s average time spent active relative to cagemates gave an accurate estimate of the animal’s average time spent active relative to the colony average (Fig. 4g). Overall, these results indicate that measurements of animals’ behavioral traits depend heavily on their social environment.
Correlation between total call rate in a housing room and behavioral state usage
Families of marmosets are housed in separate cages, but animals can make loud calls that likely influence the behavioral state of neighboring cages despite the lack of visual access. Because we found that animals’ behavioral state usage was highly correlated to their cagemates’ behaviors, we also wanted to measure the effect that these loud vocalizations from neighboring cages could have on behavioral state usage. To do this, we installed microphones outside of each cage in a room, such that the signal would be loudest when a call originated from a nearby cage (Fig. 5a, S8), but loud calls in the room could still be detected by each microphone. As is standard in this colony, visual barriers between cages prevented animals from directly observing their neighbors. We trained a neural network to identify segments of audio data that contained vocalizations using human labels (Fig. 5b) and trained a separate network to classify the vocalizations according to their call type (Fig. S9).
Similar to behavioral state usage (Fig. 5c), we found that the total call rate varied dramatically across recordings (Fig. 5d). We split the audio data into 5-minute (non-continuous) segments and found that the average call rate in these segments varied from 0 calls per minute to 60 calls per minute (Fig. 5d-e). We measured the correlation between call rate and behavioral state usage and found a strong negative correlation between call rate and the high-level “social” behavioral state (Fig. 5f-g). To better understand which specific behaviors were anticorrelated with high call rate, we plotted the correlation across all 80 low-level behavioral states (Fig. 5h) and found that call rate was specifically anti-correlated with social behavioral states in which the animals are at rest and not moving (Fig. 5h, S4). This implies that a high rate of calls in a room can interrupt animals during social resting states and cause them to preferentially exhibit more active/attentive states. To confirm the generality of this effect, we compared data from 52 animals living in families of 4 animals (two parents and two siblings) to data from 43 animals living in families of 2 animals (two unrelated adults). In both datasets, a similar overall pattern emerged (Fig. S10).
Measuring behavioral response to a panel of stimuli
Because the animals’ natural behavior was highly variable and highly correlated to their cagemates’ behavior and facility conditions, we aimed to measure species-typical responses to a set of stimuli that could be robustly elicited by presenting animals with these stimuli. We selected a range of stimuli including positive, neutral, negative, novel, and social stimuli (Fig. 6a, Fig. S11). These stimuli were presented to the animals over the course of a 3-week recording session, which was repeated every 3 months based on the age of the juveniles in the cage. We found that most stimuli did not cause large changes in behavioral state usage rates (Fig. 6a, Fig. S12). Interestingly, we also found that some stimuli which did not cause a large change in behavioral state usage did still cause a significant change in the time spent near the stimulus (Fig. 6b). To better understand this data, we looked for correlations between behavioral state usage and the amount of time spent interacting with each stimulus (Fig. S11, Fig. S12).
As an example of this type of analysis, we compared the “toy response” and “tablet response” conditions (Fig. 6c-f). We found that the usage of most behavioral states was similar between these conditions, though a few were different (Fig. 6c-d). This allowed the conditions to be weakly discriminated based on behavioral state usage (Fig. 6d). By contrast, the correlation patterns between behavioral state usage and time spent interacting with each stimulus were very different between conditions (Fig. 6e-f). For example, social behavioral state usage was much more highly correlated with tablet interaction time than with toy interaction time. This resulted in high discriminability between the conditions using the behavioral state correlation data (Fig. 6f).
Overall, these results suggest that some stimuli elicit large responses that can be clearly distinguished based on behavioral state usage, but some stimuli required the behavioral state correlation measurements to be discriminable from other stimulus response conditions (Fig. S12). For the following analyses, we use both behavioral state usage and behavioral state correlation measurements as features for studying the species-typical responses to each stimulus condition.
Outlier detection
Given the heterogeneity across patients in human neuropsychiatric disorders, we aimed to develop a phenotyping method to determine whether each animal had a “typical” or “atypical” response to each stimulus. To validate this methodology, we measured the distinguishability of the behavioral responses to each stimulus based on patterns in the behavioral state usage and behavioral state correlations (Fig. 7).
We first estimated a probability distribution for each feature in each condition (Fig. 7a). Next, we assigned outlier scores for new data, which came either from held-out animals responding to the same stimulus (“Condition 1”), or from animals responding to another stimuli (“Condition 2”).
We assigned an outlier score for each feature based on the probability distributions derived from animals in the training set (Fig. 7a). As expected, data from different conditions had higher outlier scores (Fig. 7b). To benchmark the number of animals needed to robustly discriminate between conditions using outlier detection, we compared classification accuracy between 2-class classification using SVM models (Fig. 7c) and 1-class outlier detection (Fig. 7d).
We found that 2-class classification using SVM could accurately discriminate between conditions even with very few animals used as training data (Fig. 7e-f). This approach could be useful in cases where the number of animals available is small, and the phenotype is very consistent across animals because SVM models can assign greater importance to variables that differ between pre-defined classes. By contrast, 1-class outlier detection methods must weigh all variables equally. In spite of this, we found that outlier detection scores were able to accurately discriminate between conditions if the number of animals used in the training set was sufficient (Fig. 7g-h). For studies of nonhuman primates, where the total number of animals used per study must be minimized for ethical and logistical reasons, large datasets of typically developing animals will be valuable to use as control data for studies of disordered states that can use outlier detection to determine the type and severity of the disorders. In this study, we were able to use outlier scores to distinguish between stimulus response conditions with similar accuracy as 2-class classification in cases where a large number of age-matched typically developing animals were used in the training set (Fig. 7). One caveat to this approach is that outlier detection was less accurate when non-age-matched animals were used in the training set (Fig. S13). This is likely because of large scale differences between the behavioral state usage distributions of juvenile and adult animals (Fig. S14).
In the preceding analysis, we used data from typically developing animals that had similar responses to each stimulus. To simulate increased heterogeneity in the dataset, we artificially mixed data from multiple conditions so that we could measure the effect of heterogeneity on the discriminability of conditions (Fig. 8). Using 2-class classification, we found that the score given to these artificial outliers varied significantly based on the number of outliers artificially added to the training dataset (Fig. 8a). By contrast, outlier detection was not dependent on the number of outliers added to the training dataset (Fig. 8b). We measured this property across all pairs of conditions and found that classification-based scores varied based on the number of outliers artificially added to the training set across many pairs of conditions (Fig. 8c-d). By contrast, outlier detection was robust to changes in the heterogeneity of the training data across pairs of conditions (Fig. 8e-f). This demonstrates the potential usefulness of our outlier detection strategy for studying neuropsychiatric disorder model animals that could have heterogeneous phenotypes.
Discussion
Methods for quantifying natural behavior
Finding patterns in natural behavior is a complex problem, and several different approaches may yield meaningful descriptions of video data. Broadly, the analysis of video data can be either pixel-based19,20 or keypoint based41,42. We chose to use a keypoint-based method because of the large variety of cage decorations, objects, stimuli, and other animals present in the study. Pixel-based methods may be more appropriate when the conditions are more constant across video frames, such as in the quantification of facial expression in head-fixed mice43. Another key decision is whether to use supervised or unsupervised learning approaches. Supervised learning can be beneficial in cases where a small number of human-recognizable behavioral modules are the subject of the investigation, such as in studies of aggressive responses44 or escape responses45 in mice. Unsupervised learning can be more difficult to implement because the patterns identified by the model may not have much biological relevance. For instance, models using raw x/y/z coordinates rather than egocentric coordinates may label different positions in the cage as key behavioral states, rather than different motions. In cases where input data contains missing values or large jumps, the model may identify these features as the main sources of variance. We chose to use an unsupervised learning approach because our goal was to describe the entire distribution of the animals’ daily behaviors in a relatively unbiased manner rather than selecting human-identifiable behaviors manually and quantifying the frequency of those behaviors.
Aside from questions regarding methodology, the study of natural behavior is also complicated by the lack of a specific definition of what constitutes a behavioral module. Importantly, behaviors can occur on different timescales, and it is not always clear (a priori) which timescale will be relevant for each study. For instance, some neuropsychiatric disorder model animals may have motor skill deficits requiring a detailed kinematic analysis of sub-second behaviors46 and others may have deficits in complex social interactions that occur across several seconds or minutes. Many studies have focused on sub-second units of behavior because they are the shortest segments that can be clearly differentiated into units, often referred to as syllables19,20. Another complication is that behavior can be described at many levels of specificity. At higher levels of specificity, more data is needed to find patterns. But at lower levels of specificity, patterns may not be apparent due to mixed effects within a single category. We chose a method where each frame would have a single label, while preserving hierarchical information. This allowed us to identify patterns at multiple levels of specificity.
Large-scale data collection and reproducibility
Logistical constraints and ethical considerations limit the ability to collect large behavioral datasets from nonhuman primates, as is common in several other model organisms. NHP research is comparatively much more expensive, due to the larger living spaces required, longer generation times, and more complex veterinary care needed to maintain high-quality living conditions. We opportunistically collected data from a large colony of marmosets that was primarily used for transgenic model production. This open-source dataset will be useful for future studies that can use this data as a large normative control group to achieve strong statistical power even in cases where the number of animals in experimental groups is lower. Additionally, sharing this large dataset will allow computational neuroscientists to develop new methods for analysis without conducting their own large-scale data collection efforts. This is crucial because it will allow their models to be trained on data from a variety of animals and conditions, rather than on example datasets comprised of a small number of videos.
When comparing natural behavioral data across facilities, challenges can arise due to environmental variables such as the behavioral distribution of family members, number of vocalizations in the room, room configuration, and stimuli present. Although these conditions may change the exact distribution of behavioral state usage, we expect that the general principles for marmoset natural behavioral analysis identified in this study should apply across facilities.
High dimensional phenotyping for neuropsychiatric disorders
Neuropsychiatric disorders can have a wide range of symptoms, and disease model animals will need to be tested in a variety of contexts to study their natural behavior, responses to stimuli, cognitive ability, and performance in other assays. People with autism can have a variety of symptoms, such as differences in social affect, vocal communication, cognitive ability, responses to novel or high-intensity sensory stimuli, restricted interests, or repetitive behavior. Some of these symptoms may only manifest in certain conditions, such as in novel environments or when meeting new people. Some symptoms may also be secondary effects, such as differences in organ health47. To test which phenotypes are present in disease model animals, high throughput and high dimensional methods will be essential. This is particularly important for testing whether treatments can rescue multiple deficits or only a subset of the deficits.
Outlier detection for individual phenotyping
Several physical traits are routinely measured in developing human infants to ensure that they are within normal ranges. For instance, height and weight charts are used to determine how each infant compares to the overall population. Identifying outliers using these measurements is important, because they may need additional care. However, assessing normality in social and cognitive domains is more difficult because of the complexity of these behaviors. Model animals can serve a crucial purpose for preclinical screening in these areas, because measurements of complex behaviors can be collected more reproducibly in well-controlled lab environments.
Additionally, even patients with the same monogenic form of autism can exhibit substantial heterogeneity. For instance, one ASD patient could have typical behavioral patterns with respect to sensory stimuli, but atypical behavioral patterns related to social stimuli. Because of this, outlier detection in model animals should be done on an individual basis, to identify whether each individual is an outlier in each condition or assay.
Many studies of disease model mice have focused on group-based differences in specific assays31,48. Although bulk differences between heterogeneous classes can be detectible with large sample sizes (as is typical in mouse studies), this is logistically impractical for NHP research.
Rather than producing large cohorts of control animals for each study, a much more efficient use of animals is to collect large datasets of control data that can be used to power future studies with smaller cohorts. This could be done for data related to natural behaviors, responses to stimuli, cognitive tests, and other assays. These datasets could be standardized and shared across institutes, as is becoming more common for mouse research49. The normative models from these datasets could be used to test whether smaller cohorts of animals (controls, disease models, or treated animals) are outliers relative to the dataset. The open-source data from this study will contribute to these ongoing efforts.
Methods
Animals
For this study, we opportunistically collected behavioral data from 120 animals that were part of a large pre-existing colony used for transgenic animal production. No animals were obtained or bred specifically for use in this project. Additionally, we used data from 4 animals that had been implanted with silicon probes for electrophysiology. The animals were all group housed in family units consisting of 2 adult animals and juveniles up to the age of 15 months. After 15 months, the juveniles would be paired with other animals (in new cages). The lights in the facility were turned on at 7:00 am and turned off at 7:00 pm. Veterinary checks and other work in the rooms (such as cage changes) were done between 7:00 am and 10:00 am. Each session of data was collected between 10:00 am and 4:00 pm. After 5:00 pm, the animals were not disturbed to ensure that their sleep cycle would not be affected.
All procedures were performed in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the MIT Committee on Animal Care (IACUC). The MIT Division of Comparative Medicine (DCM) routinely monitored animal welfare and provided them with veterinary care as needed.
Home cage video recording
Cages were scheduled for recording sessions based on the ages of the juvenile animals living in the cage, such that they would be recorded for 3 weeks at a time when the animals were 3 months, 6 months, 9 months, 12 months, and 15 months old. Recording sessions were scheduled for weekdays to avoid variance in environmental conditions between weekdays and weekend days. Once juveniles were 15 months old, they were transferred to new cages to begin their own families with new partners. These measurements began once the animals were 3 months old, because that was when the hair tufts near their ears was long enough to dye (to differentiate between animals) and also because the animals began to behave more independently from their parents (rather than clinging to one of their parents) at this time. One week before each recording session, cages were changed to accommodate a transparent ceiling panel to facilitate top-down video recording. The smooth transparent ceiling panels prevented animals from climbing on the ceiling during the recording sessions, and horizontal dividers were added to further restrict the total cage size during the recordings.
One day before the first recording session, cameras were placed above the cages, the ceiling was replaced with a transparent panel, and animals’ hair was dyed. The hair colors were planned such that each cage would have a red-haired animal, white-haired animal, blue-haired animal, and yellow-haired animal. Red, blue, and yellow were chosen as the dye colors to maximize distinguishability between animals, according to human observers. Human distinguishability is essential to this approach since we used a supervised model to classify the animals based on hair color. To apply hair dye, one experimenter caught and restrained the animal while a second experimenter applied the dye into the hair. Dye was applied liberally to the tufts of hair near the ears so that there was no white hair remaining. Animals were then left in a temporary holding cage for 20 minutes so that the dye could dry before they were transferred back to their home cage. This was done to prevent family members from rubbing the dye onto themselves and to prevent the dye from marking items/surfaces in the home cage. After animals were returned to their cages, we waited at least one day before any recordings were performed to allow the dye to dry completely and allow the other animals to habituate to the dye.
Video recording was done using ZED stereo cameras mounted above the cages and collected using the ZED SDK developed by the manufacturer. Video was collected from 2 cameras at 60 frames per second, and with 720p resolution. Each image had a size of 1280 pixels by 720 pixels. Camera settings (i.e. brightness/contrast/exposure/gain) were manually adjusted after each camera installation to avoid image quality issues caused by oversaturation or insufficient gain.
Because light sources directly above the cages would cause glare in the video, we either moved the cages or moved the light sources to ensure that cages were not directly below light bulbs. After exporting the proprietary “svo” videos generated by the cameras into the “avi” format using the ZED SDK, further processing was done using python and Matlab.
Pose extraction
Postural keypoints were extracted from the raw video data using DeepLabCut (DLC)14 and sorted using a two-step method we developed for this specific use case. First, tracklets were joined based on kinematic features (as in the typical DLC pipeline). Then, tracklets were sorted based on the average distance between each tracklet and the labels of each hair color. To train a network to recognize hair color, we placed one label on the center of the head of each animal, rather than on each hair tuft. This approach leveraged the kinematic data to join nearby points and used the hair color data to ensure that animal identities were not mixed after animals had close interactions or were occluded by other animals. Egocentric postures were calculated by setting the middle point on the body to (0,0) and then rotating all points based on the angle formed by the front point on the body and the back point on the body. Features were then calculated based on both the raw time series data and the egocentric time series data. All code and posture time series data will be made available online alongside the publication of this paper. Raw data will be available upon request, due to logistical issues with publicly hosting the entire unprocessed dataset.
Behavioral state labeling
A recurrent neural network was trained to use 25 seconds of input data (features) to predict the next 5 seconds of data (features). This network had a bottleneck size of 256, and the data from this bottleneck was used for hierarchical t-SNE clustering. As suggested in tree-SNE34, we systematically varied the alpha (exaggeration) and perplexity (# of neighbors) in the t-SNE clustering algorithm. At high levels of alpha/perplexity, we observed fewer distinct clusters. At lower levels of alpha/perplexity, we observed more distinct clusters. Initial clustering was done using Density-Based Spatial Clustering of Applications with Noise (DBSCAN)50. Cluster labels were assigned to new data (obtained after initial model training) using openTSNE51 to embed new points into the existing embedding, and then using k-nearest neighbors (KNN) to assign a state label. This was done at the lowest hierarchical level only, because higher levels could be inferred from this information. All code and model output data will be made available online alongside the publication of this paper.
Home cage audio recording
Microphones were placed outside of each cage under study and also outside of each neighboring cage. Small wireless microphones were used so that they could be placed around the room without adding any wires or large equipment that could potentially draw the animals’ attention. Audio data was recorded at 48,000 samples per second (48 kHz) and logged locally on each wireless recording device. The data from each device was synchronized to the video data using a series of 5 claps from a clapper board at the beginning and end of each video. These claps were clearly visible in the video and easily identifiable in the audio data by a human observer. The clap onset time was manually annotated for each video and audio recording.
Vocalization labeling
Audio data was converted into spectrograms using SciPy50 with nperseg = 700, noverlap = 525, and nfft = 2000. These spectrograms were segmented using a supervised neural network trained to perform binary classification where 0 = no call and 1 = call. To determine the average call rate (Fig. 5), a threshold of 0.5 was applied to the output of the neural network, such that all outputs below 0.5 were set to 0 and all outputs above 0.5 were set to 1. Spectrograms were segmented into discrete calls using this binary classification. Next, the discrete calls were classified by a separate neural network trained to perform classification based on human categorization of calls in the training set into 8 classes. The types of calls labeled by this model were limited to the types of calls that were commonly observed in our training data. The audio data will be made available online alongside the publication of this paper.
Neural data acquisition
Animals were implanted with silicon probes mounted on drives that could be lowered into the striatum. Data was collected using on-head amplifiers and on-head data loggers. Two silicon probes with 64 channels each were implanted into each of the four animals according to the protocol published by Cambridge Neurotech52 and lowered between sessions using drives made by Cambridge Neurotech. The full description of this surgical procedure will be published in a separate paper. Data was logged on-head during each session and the logger was removed so that it could be charged, and the data could be downloaded between sessions. This reduced the total weight on the head of the animal during the non-recording time. During non-recording time, a lightweight black cover was attached in place of the data logger. Data was acquired at a rate of 25,000 samples per second (25 kHz) using data loggers made by White Matter. Spike sorting was performed using Kilosort53 version 3 followed by further manual curation. Units with missing data (either due to drift or due to lapses in the connection between the amplifiers and the logger) were not included for further analysis.
Neural data analysis
Neural data was aligned to video data using a flashing light on the top of the data logger that could be seen by the camera at the beginning and end of each session. Each data format (video, audio, and neural) had a different sampling rate. Video was sampled at 60 Hz, audio was sampled at 48,000 Hz, and neural data was sampled at 25,000 Hz. We converted all data into bins of 60 samples per second to match the rate of the video recording, which was the slowest of the three data modalities. We compared the observed distribution of spikes across the behavioral states to a simulated distribution based on the frequency of behavioral events in each session. We clustered neurons based on these normalized spike distributions across the behavioral latent space, and manually inspected the 4 highest-level clusters using post-hoc statistical tests to determine whether the neurons in each cluster had increased activity during each of the highest-level behavioral states.
Stimulus presentation
Stimuli were presented over a course of 3 weeks, and the same series of stimuli were presented to each cage every 3 months. The first 5 sessions were “control” sessions. In this condition, a horizontal divider was placed in the middle of the cage to restrict the total cage size and simplify analysis. Ledges, branches, and hammocks were removed so that animals would not be occluded from view. Nest boxes and other cage attachments were also removed, so that animals could not leave the field of view. A bed was placed in the rear left corner of the cage so that animals had a comfortable place to sit when resting. Food was placed into two bowls that fit into slots designed to hold them so that they remained in place. Water was freely available through spouts on the side of the cages. At the beginning of each session, cameras were turned on and audio/neural data was synchronized to the cameras as described above (using a clapper board to synchronize the audio data to the video and using the lights on top of the data logger to synchronize the neural data to the video). After the completion of each session, cages were restored to their normal state.
After the 5 days of these control sessions, stimuli were presented once per day. During these sessions, the cages were set up as described above and then one of the 8 types of stimuli was added. Including the control sessions, this means that there were 9 total stimulus-response conditions (one control condition, and 8 stimulus conditions). Because the control condition was performed 5 times per set of recordings, and the toy conditions was performed 3 times per set of recordings, there was not an even number of each condition in the total dataset. The conditions are described in the following section.
Control. The only items present in the cage were a bed that animals could rest in, two bowls of food placed in slots designed to hold the bowls, and water-dispensers. These items were also present in the other stimulus conditions.
Novel object. A large and brightly colored plastic dog toy was placed in the cage. Importantly, this type of object is not typically placed in the cage. Marmosets cautiously approach these objects but have low overall interest in them.
Treat. A mixture of foraging material and foods high in sugar were placed in a bowl. Foods such as banana chips, marshmallows, pudding, craisins, and dried mango were mixed into the foraging materials. Importantly, these foods were not regularly given as part of the standard diet. Marmosets quickly approach and consume these highly appetitive items.
Noise. A movie played by a computer outside of the cage. Marmosets do not have visual access to the movie, and only hear the dialogue and soundtrack. This “background noise” stimulus is not meant to have a large impact on their behavior.
Mask. A gorilla mask was placed outside the cage in the same position as the computer playing the “background noise” stimulus. The mask stimulus is similar to “human intruder” experiments. Marmosets have a very large negative response to this stimulus, often climbing to the furthest corner of the cage in a group, and sometimes making angry calls.
Toys. A set of colorful wicker balls that animals frequently chew on and interact with. The toys are presented for 3 consecutive days. The wicker balls are attached to the side of the cage using zip ties so that they cannot be moved to different positions in the cage, which could complicate the interpretation of the data.
Intruder animal. A novel animal from a different family is placed in a transfer box inside the cage. Animals were allowed to have auditory/visual access to the animal, but not able to physically interact (they were separated by a transparent plastic panel). This was done to prevent injury of either the family or the intruder animals. This is a highly salient stimulus and causes a large increase in activity.
Tablet 1. Animals were presented with a neutral stimulus (blue square) in the center of the screen. They could touch the screen to receive a syrup reward that was delivered from a metal spout attached to the bottom of the tablet. Further cognitive tasks used the same apparatus, but had increased task demands (forced choice, following a moving target, remembering past trials, or foraging) and took place individually rather than in the group setting. This cognitive data will be fully explored in an accompanying study.
Tablet 2. Animals were presented with a social stimulus (marmoset face) in the center of the screen. They could touch the screen to receive a syrup reward that was delivered from a metal spout attached to the bottom of the tablet.
2-class classification
For 2-class classification, we used a support vector machine (SVM) approach to distinguish the behavioral responses to different stimuli. First, we normalized features using z-scoring (so that each feature would have a mean of 0 and a standard deviation of 1). Then, we trained 2-class SVM models to discriminate between each pair of stimulus-response conditions using these normalized features (derived from both behavioral state usage and behavioral state correlations). We evaluated these models by calculating scores for held-out data from each class.
Several other model types could be used for 2-class classification such as Gaussian Naïve Bayes (GNB) or Linear discriminant analysis (LDA). Because we obtained high discriminability between conditions using SVM models even when using a small number of animals as training data, we did not seek to further optimize our 2-class classification approach.
Outlier detection
For 1-class outlier detection, we first generated a probability density estimate (PDE) for each input feature based on the distribution of that feature in the training data. This PDE for behavioral state usage features was defined in the range of (0,1) and the PDE for behavioral state correlation features was defined in the range of (−1,1). Both types of distributions were made with a kernel bandwidth of 0.2. Outlier scores were calculated by evaluating the PDE at a single point to obtain a density value. Importantly, this is a relative measurement (i.e. if x1 has a higher density value than x2, then x1 is relatively more likely to have been drawn from the distribution than x2). As such, these measurements are not probability measurements because we evaluated the PDE at a fixed point rather than finding the probability that the data would fall within a certain interval. We calculated this relative density measurement for each input feature (including behavioral state usage features and behavioral state correlation features), and then averaged across these values to generate a single outlier score.
To estimate the number of animals needed, we randomly sampled a set number of animals (i.e. 4, 8, 16, 32) from the training set and compared the scores of held-out animals from the same condition to scores obtained from other conditions. As the number of animals in the training set increases, the resulting PDE should more accurately estimate the true distribution, and outliers should be more discriminable from inliers. In line with this reasoning, we observed that when the number of observations (animals) used as training data was low, all data used in the test set had a high outlier score – regardless of whether it came from the same condition or different conditions. In these cases (with few observations in the training set) each new observation would appear to be an outlier. By contrast, when we increased the number of observations (animals) used as training data, the outlier scores could accurately discriminate between conditions (as measured by auROC). For auROC analysis, we calculated the area under the ROC curve, such that a score of 0.5 would indicate a model with no discriminative ability and a score of 1.0 would indicate perfect discrimination between the conditions.
Creating artificial heterogeneity to test 2-class classification and outlier detection
To determine whether each model was robust to outliers, we artificially added “outlier” data to the training set by mixing data from different stimulus-response conditions. To do this, we pooled data from all other stimulus-response conditions and randomly selected a number of datasets (i.e. 0, 2, 4, 8, 16) to add into one of the conditions. We then calculated either 2-class classification scores or outlier scores for these artificial outliers as described above. The “change in score” metric indicates the difference between the average score given to outliers when the number of outliers added was 0 and when the number of outliers added was 2, 4, 8, or 16.
Data and code availability
Data and code associated with this paper will be available online upon publication. Processed data types such as postural time series data, behavioral state data, and vocalization time series data will be publicly hosted. Large data types such as raw multi-view video data will be available upon request, due to logistical issues with publicly hosting the entire unprocessed dataset.
Additional information
Acknowledgements
We thank Alex Mathis and Mackenzie Mathis for the helpful discussions and access to beta versions of their software. We thank MIT DCM for support with animal care. This work was funded by the Yang-Tan Center for Molecular Therapeutics in Neuroscience and the Tan-Yang Center for Autism Research of the Yang Tan Collective at MIT, the Stanley Center for Psychiatric Research at Broad Institute of MIT and Harvard, the Poitras Center for Psychiatric Disorders Research at MIT, and the Simons Center for the Social Brain at MIT.
Author contributions
W.M. and E.C. designed the study and equipment. W.M., E.C., and K.B. collected the behavioral datasets. W.M., S.P., and H.X. collected the neural dataset. W.M. performed the analysis and wrote the manuscript. All authors reviewed the manuscript.
Declaration of interests
The authors declare no competing interests.
Supplementary figures
References
- 1.Cross-species neuroscience: closing the explanatory gapPhilosophical Transactions of the Royal Society B 376
- 2.Marmosets: A Neuroscientific Model of Human Social BehaviorNeuron 90:219–233
- 3.Generation of transgenic non-human primates with germline transmissionNature 2009 459:523–527
- 4.The neuronal representation of pitch in primate auditory cortexNature 2005 436:1161–1165
- 5.Current Status of and Perspectives on the Application of Marmosets in NeurobiologyAnnu Rev Neurosci 44:27–48
- 6.Marmosets: A Neuroscientific Model of Human Social BehaviorNeuron 90:219–233
- 7.Antiphonal call timing in marmosets is behaviorally significant: interactive playback experimentsJ Comp Physiol A Neuroethol Sens Neural Behav Physiol 195:783–789
- 8.Representing the dynamics of natural marmoset vocal behaviors in frontal cortexbioRxiv https://doi.org/10.1101/2024.03.17.585423
- 9.Functional mapping of face-selective regions in the extrastriate visual cortex of the marmosetJ Neurosci 35:1160–1172
- 10.The marmoset monkey as a model for visual neuroscienceNeurosci Res 93:20–46
- 11.Vocal Learning via Social Reinforcement by Infant Marmoset MonkeysCurr Biol 27:1844–1852
- 12.Development of a Marmoset Apparatus for Automated Pulling (MarmoAAP) to Study Cooperative BehaviorsbioRxiv https://doi.org/10.1101/2024.02.16.579531
- 13.Cross-Modal Representation of Identity in Primate HippocampusScience 382
- 14.Multi-animal pose estimation, identification and tracking with DeepLabCutNat Methods 19:496–504
- 15.SLEAP: A deep learning system for multi-animal pose trackingNat Methods 19:486–495
- 16.Geometric deep learning enables 3D kinematic profiling across species and environmentsNat Methods 18
- 17.JAABA: interactive machine learning for automatic annotation of animal behaviorNat Methods 10:64–67
- 18.Simple Behavioral Analysis (SimBA) as a platform for explainable machine learning in behavioral neuroscienceNat Neurosci https://doi.org/10.1038/S41593-024-01649-9
- 19.Mapping the stereotyped behaviour of freely moving fruit fliesJ R Soc Interface 11
- 20.Mapping Sub-Second Structure in Mouse BehaviorNeuron https://doi.org/10.1016/j.neuron.2015.11.031
- 21.B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviorsNat Commun 12
- 22.Live Mouse Tracker: real-time behavioral analysis of groups of micebioRxiv https://doi.org/10.1101/345132
- 23.Deep phenotyping reveals movement phenotypes in mouse neurodevelopmental modelsMol Autism 13:1–18
- 24.DeepSqueak: a deep learning-based system for detection and analysis of ultrasonic vocalizationsNeuropsychopharmacology 44:859–868
- 25.Analysis of ultrasonic vocalizations from mice using computer vision and machine learningElife 10
- 26.A model of marmoset monkey vocal turn-takingProceedings of the Royal Society B: Biological Sciences 291
- 27.Molecular and network-level mechanisms explaining individual differences in autism spectrum disorderNat Neurosci 26:650–663
- 28.The Neurological Manifestations of Phelan-McDermid SyndromePediatr Neurol 122:59–64
- 29.Definition and clinical variability of SHANK3-related Phelan-McDermid syndromeEur J Med Genet 66
- 30.Phelan-McDermid syndrome: a classification system after 30 years of experienceOrphanet J Rare Dis 17
- 31.Shank3 mutant mice display autistic-like behaviours and striatal dysfunctionNature 472:437–442
- 32.Sibling chimerism among microglia in marmosetsbioRxiv https://doi.org/10.1101/2023.10.16.562516
- 33.LiftPose3D, a deep learning-based approach for transforming two-dimensional to three-dimensional poses in laboratory animalsNat Methods 18:975–981
- 34.Tree-SNE: Hierarchical Clustering and Visualization Using t-SNEarXiv
- 35.The Striatum Organizes 3D Behavior via Moment-to-Moment Action SelectionCell 174:44–58
- 36.Performance error-related activity in monkey striatum during social interactionsSci Rep 6
- 37.Social information signaling by neurons in primate striatumCurr Biol 23:691–696
- 38.High-dimensional geometry of population responses in visual cortexbioRxiv https://doi.org/10.1101/374090
- 39.Toward a Science of Computational EthologyNeuron 84:18–31
- 40.Promises and challenges of human computational ethologyNeuron 109:2224–2238
- 41.Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamicsbioRxiv https://doi.org/10.1101/2023.03.16.532307
- 42.Learnable latent embeddings for joint behavioural and neural analysisNature 617:360–368
- 43.Facial expressions of emotion states and their neuronal correlates in miceScience (1979) 368
- 44.An approximate line attractor in the hypothalamus encodes an aggressive stateCell 186
- 45.Rapid Innate Defensive Responses of Mice to Looming Visual StimuliCurrent Biology 23:2011–2015
- 46.High-speed video gait analysis reveals early and characteristic locomotor phenotypes in mouse models of neurodegenerative movement disordersBehavioural Brain Research 311:340–353
- 47.Evaluation of Brain-Body Health in Individuals With Common Neuropsychiatric DisordersJAMA Psychiatry 80
- 48.Cortico-striatal synaptic defects and OCD-like behaviours in Sapap3-mutant miceNature 448:894–900
- 49.Standardized and reproducible measurement of decision-making in miceElife 10
- 50.SciPy 1.0: fundamental algorithms for scientific computing in PythonNature Methods 2020 17:261–272
- 51.openTSNE: A Modular Python Library for t-SNE Dimensionality Reduction and EmbeddingJ Stat Softw 109:1–30
- 52.Standard Operating Protocol: Implantation of Cambrige NeuroTech chronic silicon probe and mini-amp-64 digital headstage in miceProtocol Exchange https://doi.org/10.21203/RS.3.PEX-2188/V2
- 53.Solving the spike sorting problem with KilosortbioRxiv https://doi.org/10.1101/2023.01.07.523036
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2024, Menegas et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 463
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.