Scratch-AID, a deep learning-based system for automatic detection of mouse scratching behavior with high accuracy

  1. Huasheng Yu  Is a corresponding author
  2. Jingwei Xiong
  3. Adam Yongxin Ye
  4. Suna Li Cranfill
  5. Tariq Cannonier
  6. Mayank Gautam
  7. Marina Zhang
  8. Rayan Bilal
  9. Jong-Eun Park
  10. Yuji Xue
  11. Vidhur Polam
  12. Zora Vujovic
  13. Daniel Dai
  14. William Ong
  15. Jasper Ip
  16. Amanda Hsieh
  17. Nour Mimouni
  18. Alejandra Lozada
  19. Medhini Sosale
  20. Alex Ahn
  21. Minghong Ma
  22. Long Ding
  23. Javier Arsuaga
  24. Wenqin Luo  Is a corresponding author
  1. Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, United States
  2. Graduate Group in Biostatistics, University of California Davis, United States
  3. Howard Hughes Medical Institute, Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Harvard Medical School, United States
  4. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, United States
  5. Department of Molecular and Cellular Biology, University of California Davis, United States
  6. Department of Mathematics, University of California Davis, United States

Abstract

Mice are the most commonly used model animals for itch research and for development of anti-itch drugs. Most laboratories manually quantify mouse scratching behavior to assess itch intensity. This process is labor-intensive and limits large-scale genetic or drug screenings. In this study, we developed a new system, Scratch-AID (Automatic Itch Detection), which could automatically identify and quantify mouse scratching behavior with high accuracy. Our system included a custom-designed videotaping box to ensure high-quality and replicable mouse behavior recording and a convolutional recurrent neural network trained with frame-labeled mouse scratching behavior videos, induced by nape injection of chloroquine. The best trained network achieved 97.6% recall and 96.9% precision on previously unseen test videos. Remarkably, Scratch-AID could reliably identify scratching behavior in other major mouse itch models, including the acute cheek model, the histaminergic model, and a chronic itch model. Moreover, our system detected significant differences in scratching behavior between control and mice treated with an anti-itch drug. Taken together, we have established a novel deep learning-based system that could replace manual quantification for mouse scratching behavior in different itch models and for drug screening.

Editor's evaluation

Scratch assays are the gold standard for measuring itch in rodents. However, the current limitation is that this is performed manually which is enormously taxing in terms of hours spent counting scratching bouts. The authors have developed a valuable automatic system to quantify scratch behavior with high accuracy and provided a valuable tool for the field. This will be resourceful for the greater itch biology community.

https://doi.org/10.7554/eLife.84042.sa0

Introduction

Itch is a disturbing symptom associated with skin diseases, immune problems, systemic diseases, and mental disorders (Cevikbas and Lerner, 2020; Hong et al., 2011; Kremer et al., 2020; Ständer et al., 2007). Chronic itch affects about 13–17% of the population (Matterne et al., 2009; Weisshaar and Dalgard, 2009), severely worsening the quality of life of affected patients. Unfortunately, treatment options for many chronic itch conditions are still limited (Yosipovitch et al., 2018; Yu et al., 2021).

Mice are the most widely used model animals for studying itch mechanisms and for developing new preclinical anti-itch drugs (Han et al., 2013; Liu et al., 2009; Solinski et al., 2019; Sun and Chen, 2007). Since itch is an unpleasant sensation that provokes the desire to scratch (Ikoma et al., 2006), scratching behavior has been assessed as a proxy for itch intensity in mice (Liu et al., 2009; Morita et al., 2015). Till now, this quantification process has been mainly conducted by watching videos and manually counting scratching bouts or the total scratching time, which is tedious and time consuming, unavoidably introduces human errors and bias, and limits the large-scale genetic or drug screenings.

Given the biological importance and the obvious need, several research groups have tried different strategies to automate this process, including an acoustic recording method (Elliott et al., 2017), a method using magnetic field and metal ring to detect paw movement (Mu et al., 2017), and several video analysis-based approaches (Bohnslav et al., 2021; Kobayashi et al., 2021; Park et al., 2019; Sakamoto et al., 2022). Nevertheless, these methods have not been widely adopted by other research laboratories, due to the uncertain performance of the trained models in different lab environments, the requirement of specialized equipment, and/or inadequate evaluation of these methods in different mouse itch models.

In recent years, with the rapid progress in the field of artificial intelligence, deep learning has been applied in various scientific research areas. For example, convolutional neural networks (CNN) are widely used in computer visual recognition tasks (Gu et al., 2018), whereas recurrent neural networks (RNN) (Graves, 2013) are developed for analyzing temporal dynamic features. Moreover, rapid improvement of computing power, especially in the graphics processing unit capacity, together with new open-source deep learning libraries, such as PyTorch (Paszke et al., 2019), Keras (Gulli and Pal, 2017), and Tensorflow (Abadi et al., 2016), have greatly accelerated broad applications of deep learning.

Animal behavior analysis is one of the research areas benefiting from the applications of deep learning. For example, DeepLabCut can track different body parts in freely moving animals for behavior analysis (Mathis et al., 2018). DeepEthogram recognizes and annotates different behaviors of mice and flies (Bohnslav et al., 2021). These examples support the proof-of-principle that deep-learning is a powerful avenue for automating animal behavior analysis. Nevertheless, for a given animal behavior, like mouse scratching, a designated method, which achieves high sensitivity, specificity, and generalization to replace human observers, still needs to be established.

To meet this challenge, we developed a new deep learning-based system, Scratch-AID (Automatic Itch Detection), which achieved automatic quantification of mouse scratching behavior with high accuracy. We first designed a videotaping box to acquire high-quality recording of mouse behavior in a reproducible environment from the bottom. We recorded 40 videos of 10 adult wildtype mice (5 males and 5 females) after nape injection of a non-histamine pruritogen, chloroquine (CQ), and manually labeled all video frames as the reference. We then designed a convolutional recurrent neural network (CRNN) by combining CNN and RNN and trained it with 32 scratching videos from 8 randomly picked mice. We obtained a series of prediction models using different training parameters and evaluated these models with test videos (eight unseen test videos from the two remaining mice). The best trained model achieved 97.6% recall and 96.9% precision on test videos, similar to the manual quantification results. Impressively, Scratch-AID could also quantify scratching behavior from other major acute and chronic itch models with high accuracy. Lastly, we applied Scratch-AID in an anti-itch drug screening paradigm and found that it reliably detected the drug effect. In summary, we have established a new system for accurate automatic quantification of mouse scratching behavior. Based on the performances, Scratch-AID could replace manual quantification for various mouse itch models and for drug or genetic screenings.

Results

The overall workflow

Our workflow to develop a new system for detection and quantification of mouse scratching behavior consists of four major steps (Figure 1A): (1) Videotape mouse scratching behavior induced by an acute nape itch model; (2) Manually annotate scratching frames in all videos for training and test datasets; (3) Design a deep learning neural network; train this network with randomly selected training videos and adjust different training parameters; and evaluate the performance of the trained neural networks on test videos; (4) Evaluate the generalization of the trained neural network in additional itch models and a drug screening paradigm.

Figure 1 with 1 supplement see all
The overall workflow and building a customized videotaping box for mouse scratching behavior recording.

(A) A diagram showing the workflow to develop a deep learning-based system for automatic detection and quantification of mouse scratching behavior. (B) An image of the designed videotaping box for high-quality video recording of mouse scratching behavior. Scale bar, 5 cm. (C) A cartoon showing the acute itch model induced by the chloroquine (CQ) injection in the nape, followed by video recording in the customized videotaping box. (D) Representative images showing different phases (P1–P4) of a scratching train (upper). Red arrows indicate the scratching hind paw. A cartoon showing the dynamic movement of the scratching hind paw in a scratching train (bottom). The cycle of scratching bout (P2) and paw licking (P3) may repeat once or more times in a scratching train. Scale bar, 1 cm. (E) The total number of scratching trains in each video. (F) The distribution of scratching train duration (n = 1135 scratching trains). The inset is the zoom-in of the red rectangle.

Design a videotaping box for high-quality and reproducible recording of mouse scratching behavior

High-quality videos recorded from a reproducible environment are critical for stable performance of trained prediction models and for adoption by other research laboratories. Thus, we designed a mouse videotaping box for such a purpose. It consisted of two boxes with white acrylic walls joined by a transparent acrylic floor (Figure 1B). The top (‘mouse’) box (Length × Width × Height = 14.68 × 14.68 × 5 cm) had a lid for mouse access. The bottom (‘camera’) box (Length × Width × Height = 14.68 × 14.68 × 23.6 cm) had a door for access to the camera (Logitech C920e Business Webcam). The walls and the lid of the box were non-transparent to minimize interference from outside visual stimuli. Ambient light penetrated the walls and provided sufficient illumination for behavior recording. A mouse could freely move inside the top box, and a camera recorded mouse behavior from the bottom (30 frames/s). Compared to the top or side views, the bottom view can clearly capture the key body parts involved in scratching behavior, such as the scratching hind paw and mouth, and their movements in great details (Video 1). The magnification, resolution, and brightness of the video can be adjusted by the camera recording software (Logitech C920e Business Webcam driver and software) to achieve consistent video recording. In short, this customized videotaping box allows high-quality video recording of mouse scratching and other behaviors in a stable and reproducible environment.

Video 1
An example of a mouse scratching train recorded by the designed videotaping box.

Spontaneous scratching is a rare event under normal conditions in mice. Itch sensation and scratching are usually induced by different itch models for research. Common mouse itch models are classified as cheek or nape, histaminergic or non-histaminergic, and acute or chronic, based on the body location where itch sensation is evoked, the kind of pruritogens, and the duration of itch sensation (Ikoma et al., 2006; Liu and Dong, 2015; Shimada and LaMotte, 2008; Thurmond et al., 2008). We first used an acute nape itch model induced by a non-histaminergic pruritogen, CQ, because it triggered immediate and robust scratching behavior in mice (Liu et al., 2009). After intradermal injection of CQ (200 μg in 15 μl saline) into left or right nape of the mice, a 20-min video was recorded using the customized videotaping box (Figure 1C).

In response to CQ injection, mice scratched the affected skin region using their ipsilateral hind paw. Evoked mice displayed multiple episodes of scratching, which were separated by non-scratching phases. Each episode of scratching, here defined as a scratching train, usually contained four phases: start (lifting the scratching hind paw toward the affected skin), scratching bout (rhythmic movement of hind paw against the affected skin), paw licking (putting the scratching hind paw into the mouth and licking), and end (putting down the scratching hind paw back to the floor) (Figure 1D and Video 1). The cycle of scratching bout and paw licking might occur once or repeat multiple times, depending on the itch intensity and the internal state of the mouse. The time from the start to the end of a given scratching train is defined as the duration. The total scratching time is the sum of durations of all scratching trains, which is an effective parameter to quantify the scratching behavior and assess itch intensity.

Video annotation

Forty scratching videos from 10 adult wildtype C57 mice (5 males and 5 females, 2- to 3-month-old) were recorded (Supplementary file 1). For neural network training and testing and for comparing performances between trained neural networks and manual quantification, two methods were used to annotate mouse scratching behavior in these videos. The first method (manual annotation) was to watch these videos at normal (1×) speed (30 frames/s) and label the start and end time points (converted into frame numbers for subsequent analysis) of each scratching train. This is consistent with the field common practice for manually quantifying mouse scratching behavior. Our manual annotation results were produced by 10 human observers, thus reflecting an averaged precision of the manual quantification process. The second method (reference annotation, or ground-truth annotation) was to accurately determine the start and end of each scratching train by analyzing each video frame-by-frame. The reference annotation of the 40 videos were used as the training and test datasets. The total number of scratching train in each video and the distribution of scratching train durations were quantified (Figure 1E, F and Supplementary file 2).

Deep learning neural network design and model training

Mouse scratching behavior displayed unique dynamic (temporal) and static (spatial) features, which were highlighted by tracking the key body parts using DeepLabCut (Mathis et al., 2018; Figure 1—figure supplement 1; Video 2). One of the most obvious dynamic features was the rhythmic movement of the scratching hind paw (Figure 1—figure supplement 1A, B). Some unique static features included the relative positional relationships between the scratching hind paw and other body parts (Figure 1—figure supplement 1C–F). To fully capture these dynamic and static features, we designed a CRNN to take advantage of the different strengths of CNN and RNN (Figure 2A and Figure 2—figure supplement 1A). The CRNN contained a CNN (ResNet-18 He et al., 2016; Figure 2—figure supplement 1B) that extracts static features, such as the relative position of different body parts, an RNN (two-layer bidirectional gated recurrent unit [GRU]; Dey and Salem, 2017; Figure 2—figure supplement 1C) that extracts dynamic features, such as the rhythmic movement of the scratching hind paw in consecutive frames, and a fully connected layer (the classifier) to combine the features extracted by both CNN and RNN and generate the prediction output.

Video 2
Key body parts tracked by DeepLabCut showing the dynamic and static features of mouse scratching behavior.
Figure 2 with 2 supplements see all
Deep learning neural network design and training.

(A) Cartoon showing the architecture of designed deep learning neural network consisting of the combination of convolutional neural networks (CNN), recurrent neural networks (RNN), and classifier. (B) Cartoon showing the preparation of inputs for the training dataset. Consecutive N frames were selected as one input for training. The interval between two adjacent inputs in a video was 4–10 frames. (C) The information of a sample training and test datasets. The training loss decreased (D) while the accuracy increased (E) during the training process with different input length (N = 3, 5, 7, 13, 23, 45 frames). The inset is the zoom-in of part of the figure.

The 40 videos were randomly split into two parts, 80% of them (32 videos from 8 mice) were assigned to the training dataset and 20% of them (8 videos from 2 mice) to the test dataset (Figure 2C). Each video was converted into individual frames, and each frame was classified as ‘scratching’ (within a scratching train) or ‘non-scratching’ (out of a scratching train) based on the reference annotation. For the training dataset preparation, N consecutive frames (a parameter adjusted for optimal model performance) were selected as one input to capture the dynamic features of scratching (Figure 2B). To avoid large sets of redundancy in the training dataset, two adjacent inputs were apart between 4 and 10 frames (Figure 2B). An input was labeled as ‘scratching’ (class 1) if more than half of frames (N/2) in the input were scratching frames; otherwise labeled as ‘non-scratching’ (class 0) (Figure 2B).

Since the dynamic features of scratching behavior spanned multiple frames, the input length (N frames) would be a critical training parameter. In CQ triggered acute nape itch model, the average duration of one cycle of scratching bout and paw licking was around 30 frames (Figure 2—figure supplement 2A, B). Thus, we tested a range of input length from 3 to 45 frames for model training. During the training process, the loss (discrepancy between model prediction and reference annotation) decreased, and the prediction accuracy (correct prediction of both scratching and non-scratching frames/all frames) increased (Figure 2D, E). After 10 epochs (one epoch means the training covers the complete training dataset for one round), the accuracy reached a plateau (Figure 2E). The prediction accuracies were more than 0.98 for all input length, improving slightly with the increase of the input length (Figure 2E). These results demonstrate that the designed CRNN network works very efficiently to capture the scratching features and recognize scratching behavior in the training dataset.

Model evaluation on test datasets

We evaluated the performance of the trained prediction models on eight unseen test videos. First, similar to what described above, each test video was converted into inputs with ‘N’ frames (the same ‘N’ was used for training and test), except that the two adjacent inputs were only 1 frame apart. Second, the trained neural network predicted each input to be ‘scratching’ or ‘non-scratching’. Third, to convert the prediction from one input (containing N frames) into the prediction for each individual frame, we used the following rule: the prediction of the middle frame of each input would be the same as that of the input. For example, if an input was predicted as ‘scratching’, then the middle frame of this input would be a ‘scratching’ frame. This conversion predicted each frame of tested videos as ‘scratching’ or ‘non-scratching’ expect for the few frames at the beginning or at the end of a video (see method for the missing data interpretation). Fourth, recall (the number of correctly predicted scratching frames/the number of reference scratching frames), precision (the number of correctly predicted scratching frames/the number of all predicted scratching frames), and F1 score (2*recall*precision/(recall + precision)) were calculated. Compared to the overall accuracy (correct prediction of both scratching and non-scratching frames/all frames), the recall, precision, and F1 score give a more precise and in depth evaluation of a model’s performance (Powers, 2020), especially when scratching is a relative rare event in a video.

To rule out the possibility that the good performance of our models was due to a specific combination of the training and test datasets, we rotated the training and test videos for cross-validation. For all different combination of training and test videos, F1 scores were all above 0.9 (Figure 3—figure supplement 1A, B), supporting the stable and high performance of our prediction models. In addition, the prediction model performed better with the increased input length (Figure 3—figure supplement 1C). The best model, the one trained with videos 1–32 with the input length of 45 (N = 45), was selected for additional analyses and tests.

The average recall and precision of the best model were 97.6% and 96.9%, respectively, for the eight test videos (Figure 3A), and the recall and precision for individual videos were above 95% in most cases (Figure 3B). This performance was similar to or even slightly better than that of manual annotation, which had an average recall and precision of 95.1% and 94.2%, respectively (Figure 3C, D). When comparing the total scratching time to the reference annotation, the average discrepancy of the model prediction was 1.9% whereas that of manual annotation was 2.1% (Figure 3E). The correlation between the model prediction and the reference annotation was 0.98, similar to the manual annotation results (Figure 3F). When examining the probability traces of model prediction and the reference annotation (one example shown in Figure 3G), we found that the model successfully recognized almost all the scratching trains in test videos, and that the prediction of the start and end of each scratching train aligned well with the reference annotation (Figure 3H). Taken together, these results demonstrate the high reliability and accuracy of our model to recognize and quantify mouse scratching behavior of new videos.

Figure 3 with 5 supplements see all
Performance of the best model on test videos.

The recall, precision, and F1 score of the best model on average (A) or in individual videos (B). The recall, precision, and F1 score of manual annotation on average (C) or in individual videos (D). (E) The comparison among model prediction, manual quantification, and the reference annotation. The reference annotation is normalized to 100% shown as the red line. (F) The correlations between model prediction or manual quantification and reference annotation. R2, Pearson correlation coefficient. (G) An example scratching probability trace (red curve) predicted by the model and aligned with the reference annotation (green bar). (H) The two zoom-ins from (G) showing the nice alignment between the model prediction and the reference annotation.

The trained neural network model focused on the scratching hind paw to recognize mouse scratching behavior

How did the trained neural network model recognize mouse scratching behavior and distinguish them from other behaviors? Although deep learning neural networks are processed as a black box, saliency maps can give some hints (Selvaraju et al., 2017), because they plot which parts of each frame (pixels) were mainly used during model prediction. The most salient parts were centered around the scratching hind paw in the scratching frames (Figure 4A, B and Figure 4—figure supplement 1A), suggesting that the prediction model focused on the features of the scratching hind paw. In some scratching frames, other body parts were also highlighted, such as the two front paws (Figure 4B), suggesting the model also utilized the positional relationship of these body parts to recognize scratching behavior. In contrast, for other ‘non-scratching’ mouse behaviors, such as wiping, grooming, rearing, and locomotion, the ‘salient’ parts showed no clear association with particular mouse body parts (Figure 4C–F and Figure 4—figure supplement 1B–E). Together, these saliency maps indicate that the trained neural network learns to focus on the dynamic and static features of scratching for the prediction of mouse scratching behavior.

Figure 4 with 1 supplement see all
The prediction model focused on the scratching hind paw for scratching behavior recognition.

(A, B) Saliency map showing the gradient value of each pixel of scratching frames during mouse scratching behavior prediction by the best model. The model focused on the scratching hind paw (A, B) and other body parts, such as front paws (B).Scale bar, 2 cm. Saliency map showing the gradient value of each pixel of wiping (C), grooming (D), rearing (E), and locomotion (F) frames during mouse scratching behavior prediction by the model.

Model prediction error analysis

To further understand the performance of the best trained neural network model, we systematically analyzed its prediction errors in eight test videos (Figure 3—figure supplement 2) and compared to those from the manual quantification (Figure 3—figure supplement 3) and other trained models. We classified the prediction errors into five categories: Type 1, false positive (non-scratching region was predicted as a scratching train); Type 2, false negative (a real scratching train was not recognized); Type 3, blurred boundary (the prediction of the start or end of a scratching train was shifted); Type 4, missed interval (two or more adjacent scratching trains were predicted as one scratching train); and Type 5 split scratching train (one scratching train was predicted as two or more scratching trains) (Figure 3—figure supplement 2A). We found that the dominant prediction error of the trained neural network model was Type 3 error, accounting for around 3% of the total scratching frames, followed by Type 2 and 5 errors accounting for around 1% (Figure 3—figure supplement 2B). For manual quantification, the major errors came from Types 3 and 4, accounting for 10% and 8% of the total scratching frames, respectively (Figure 3—figure supplement 3A).

For Type 1 error of the model prediction (Figure 3—figure supplement 2C1–C3), the durations of all false positive scratching trains were shorter than 10 frames (0.3 s) (Figure 3—figure supplement 2C2) and temporarily close to a real scratching train (within 30 frames, <1 s) (Figure 3—figure supplement 2C3). They were not caused by confusion with other behaviors, such as wiping, grooming, rearing, locomotion, and resting (Figure 3—figure supplement 4). Type 1 error was also rare for manual annotation (Figure 3—figure supplement 3B).

The models might miss short scratching trains, hence caused the Type 2 error. Indeed, all missed scratching trains were shorter than 40 frames (<1.3 s) (Figure 3—figure supplement 2D1, D2). For all scratching trains lasting less than 30 frames (<1 s), 18.5% of them were missed by the model prediction. This number decreased to 2.7% for scratching trains spanning between 30 and 60 frames (1–2 s). No scratching train was missed if they were longer than 60 frames (>2 s) (Figure 3—figure supplement 2D3). The Type 2 error positively correlated with the input length (N) of prediction models. It became zero or close to zero when models trained with shorter input lengths (3, 13, and 23 frames) (Figure 3—figure supplement 5A–C). For manual annotation, Type 2 error was not common (Figure 3—figure supplement 3C1, C2).

Type 3 error (Figure 3—figure supplement 2E1–E3) was dominant among all five type errors. The average start and end frame shift of the model prediction were 2.2 and 7.0 frames (Figure 3—figure supplement 2E3), while those of the manual annotation were 11.5 and 12.8 frames (Figure 3—figure supplement 3D1, D2). The start and end frame shift of the manual quantification was similar (~350 ms), which likely reflected the temporal delay of the real time human visual system processing. For the model prediction, which was an off-line frame by frame process, the temporal shifts were less than the human visual processing. In addition, the start of a scratching train was more accurately recognized by the model than the end of a scratching train (Figure 3—figure supplement 2E3). This might reflect the feature of scratching trains. It was relatively clear to determine the start of a scratching train when a mouse lifted its hind paw, but more ambiguous to determine when a mouse put its hind paw back onto the floor to complete a scratching train. The start and end shift did not correlate with the length of a scratching train (Figure 3—figure supplement 5D). Thus, the relative error (percentage of error frames) would decrease when the duration of a scratching train increases. Indeed, the prediction accuracy (as indicated by the F1 score) positively correlated (R2 = 0.5723) with the average scratching train duration in a video (Figure 3—figure supplement 5E).

Type 4 error was caused when two adjacent scratching trains were too close to each other and were predicted as one scratching train (Figure 3—figure supplement 2F1–F3). All missed intervals were shorter than 30 frames (<1 s) (Figure 3—figure supplement 2F2). Conversely, 51.4% of intervals less than 30 frames between the two adjacent scratching trains were not recognized. All intervals longer than 30 frames were recognized (Figure 3—figure supplement 2F3). Type 4 error was more common in manual annotation than in the model prediction (Figure 3—figure supplement 3E1, E2).

Type 5 error occurred when one scratching train was predicted as two or more scratching trains, separated by mispredicted intervals. The average lengths of these mispredicted intervals were around 10 frames by model prediction and around 40 frames by manual annotation (Figure 3—figure supplement 2G1, G2 and Figure 3—figure supplement 3F). When reviewing these intervals, we found that more than 80% of them were within or partially overlapped with a paw licking phase (Figure 3—figure supplement 2G3), especially when the duration of the paw licking was more than 30 frames (Figure 3—figure supplement 5F). Thus, it seems likely that the model predicted some long licking frames within a scratching train as ‘non-scratching’. Type 4 and 5 errors reflect the intrinsic complexity of scratching behavior. Human definitions and field consensus, such as using ‘2s’ as the threshold between two adjacent scratching trains (Darmani and Pandya, 2000), would help to reduce these types of errors.

In summary, we have established a novel system combining a customized videotaping box and a well-trained CRNN neural network to automatically identify and quantify mouse scratching behavior with high accuracy. We named it as the Scratch-AID system.

Performance of the Scratch-AID system on other major acute itch models

In addition to the nape, the other commonly used body location to induce mouse itch sensation is the cheek (Shimada and LaMotte, 2008). To test whether the Scratch-AID system trained by the nape CQ model could also recognize and quantify scratching behavior of the cheek model, we injected CQ (200 μg in 15 μl saline) into the cheek of five wildtype mice (three males and two females) and recorded seven videos (Figure 5A). We compared the scratching behavior quantification by the Scratch-AID and manual annotation. The Scratch-AID prediction had 93.4% recall, 94.8% precision, and 0.941 F1 score (Figure 5B), while those of manual quantification were 96.0%, 88.6%, and 0.919 (Figure 5C). The correlation between the Scratch-AID prediction and reference annotation was 0.9926, and that of the manual quantification was 0.9876 (Figure 5D, E). The total scratching time in individual videos from both model prediction and manual annotation were close to the reference annotation (Figure 5F). These results demonstrate that the Scratch-AID system can reliably identify and quantify scratching behavior triggered by acute itch sensation in the cheek.

The Scratch-AID (Automatic Itch Detection) performance on other acute itch models.

(A) A cartoon showing an acute itch model induced by chloroquine (CQ) injection in the mouse cheek. The average recall, precision, and F1 score of Scratch-AID (B) or manual annotation (C). Error bar, standard error of the mean (SEM). The correlation between model prediction (D) or manual quantification (E) and reference annotation. R2, Pearson correlation coefficient. (F) The comparison among model prediction, manual quantification, and reference annotation. The reference annotation is normalized to 100% shown as the red line. (G) A cartoon showing an acute itch model induced by histamine injection in the mouse nape. The average recall, precision and F1 score of Scratch-AID (H) or manual annotation (I). Error bar, SEM. The correlation between model prediction (J) or manual quantification (K) and reference annotation. R2, Pearson correlation coefficient. (L) The comparison among model prediction, manual quantification, and reference annotation. The reference annotation is normalized to 100% shown as the red line.

Different pruritogens administrated at the same body location trigger scratching behavior with different dynamic features (Wimalasena et al., 2021). Thus, we tested whether the Scratch-AID system trained by CQ injection could recognize scratching behavior triggered by a different pruritogen, histamine. 100 μg histamine (in 15 μl saline) was intradermally injected into the nape and four videos were recorded (Figure 5G). The recall, precision, and F1 score from the Scratch-AID prediction were 96.6%, 90.91%, and 0.936 (Figure 5H), while those of the manual annotation were 96.3%, 80.5%, and 0.877 (Figure 5I). The correlation between the Scratch-AID prediction and reference annotation is 0.9707, and that of the manual quantification is 0.9895 (Figure 5J, K). The total scratching time in individual videos from both model prediction and manual annotation was similar to those of reference annotation (Figure 5L). Taken together, these data show that our Scratch-AID system, although only trained with the CQ nape acute itch model, can recognize and quantify scratching behavior of acute itch models induced at different skin locations or triggered by different pruritogens, and that the prediction accuracy of Scratch-AID is comparable to that of the manual annotation.

Performance of the Scratch-AID system on a chronic itch model

Chronic itch is a devastating symptom and severely affects the quality of patients’ life (Yosipovitch and Fleischer, 2003; Yu et al., 2021). Investigating the underlying mechanisms using the mouse model is important for developing novel therapies for treating chronic itch in various conditions. To test whether the Scratch-AID system could be used to study mouse chronic itch models, we generated a squaric acid dibutylester (SADBE) induced contact dermatitis chronic itch model (Beattie et al., 2022; Qu et al., 2015) and recorded nine videos from three wildtype mice (Figure 6A). Affected mice displayed spontaneous scratching toward the nape and/or head. Noticeably, dynamic features of spontaneous scratching behavior under this chronic itch condition were different from those exhibited by the CQ acute itch model: the total scratching time was less for the same given period of time (20 min) (Figure 6—figure supplement 1A), and the average duration of the scratching trains was shorter (53 frames on average) than the acute scratching behavior induced by CQ (280 frames on average) (Figure 6—figure supplement 1B and Figure 1F). Despite these considerable differences, the recall, precision, and F1 score of the Scratch-AID prediction were 84.1%, 88.5%, and 0.862, respectively, compared to 84.1%, 66.3%, and 0.740 of manual annotation (Figure 6B, C). A likely reason for the decreased recall and precision of both model prediction and manual annotation was the dominant short scratching trains in this chronic itch model (Figure 6—figure supplement 1B). The correlation between the model prediction and reference annotation was 0.9845 (Figure 6D), which was comparable to the manual annotation 0.9887 (Figure 6E). The total scratching time quantified by the model was slightly more accurate than the manual annotation (Figure 6F). From the prediction traces (an example shown in Figure 6G), the Scratch-AID system was capable to capture the low-frequency short scratching trains.

Figure 6 with 1 supplement see all
The Scratch-AID performance on a chronic itch model.

(A) A cartoon showing a squaric acid dibutylester (SADBE) induced chronic itch model. The average recall, precision, and F1 score of Scratch-AID (B) or manual annotation (C). Error bar, standard error of the mean (SEM). The correlation between model prediction (D) or manual quantification (E) and reference annotation. R2, Pearson correlation coefficient. (F) The comparison among model prediction, manual quantification, and reference annotation. The reference annotation is normalized to 100% shown as the red line. (G) An example scratching probability trace (red curve) predicted by the model and aligned with the reference annotation (green bar) (left). Zoom-in (right panel) of the blue square part showing nice alignment of the model prediction with the reference annotation.

Application of the Scratch-AID system in anti-itch drug screening

Finally, we tested whether the Scratch-AID system could be applied for anti-itch drug screening. Here, we used the histaminergic acute itch nape model and the pretreatment of Benadryl (Loew et al., 1946), an FDA approved anti-histaminergic itch cream, as a proof-of-principle example. Benadryl or control cream was topically applied onto the mouse nape skin 1 hr before the intradermal injection of histamine (200 μg in 15 μl saline). Scratching behavior was then recorded for 20 min from six C57 wildtype mice (two males, four females) in the Benardryl treated group and seven C57 wildtype mice (two males, five females) in the control group (Figure 7A). Quantification of the total scratching time (frames) using the Scratch-AID showed a significant reduction with Benardryl treatment (Figure 7B). Similar results were found by manual annotation (Figure 7C). These results suggest that the Scratch-AID system is sensitive to detect the change of scratching behavior after an anti-itch drug treatment, which highlights the potential use of the Scratch-AID system in high-throughput and large-scale anti-itch drug screenings.

Application of the Scratch-AID (Automatic Itch Detection) system in a drug screening paradigm.

(A) A diagram showing the experimental design of an anti-itch drug test. Quantification of scratching behavior in anti-itch cream treated group or control group by Scratch-AID (B) or manual annotation (C). Error bar, standard error of the mean (SEM). Differences between the two groups were analyzed using unpaired two-tailed Student’s t-test, ** p < 0.01.

Discussion

Scratching is an itch-specific behavior, and the mouse is the major model animal to study itch mechanisms and to develop novel anti-itch drugs. In this study, we developed a new system, Scratch-AID, which combined a customized videotaping box and a well-trained neural network, for automatic quantification of mouse scratching behavior with high accuracy. Its performance is comparable to the manual annotation on major itch models and an anti-itch drug screening paradigm. As far as we are aware, this is the first deep learning-based system that can achieve such high accuracy and broad generalization.

It is remarkable that a model trained with CQ induced acute nape itch videos could reliably recognize scratching behavior in other itch models, even when the body sites (nape vs. cheek) or the dynamics of scratching behavior (e.g., acute vs. chronic itch, different pruritogens) were different. The impressive performance and generalization of the Scratch-AID system are likely attributed to the high-quality and reproducible video recording; a large amount of high-quality training datasets with the frame-by-frame annotation; efficient CRNN deep learning neural network design; and training parameter optimizations.

Variable video recording condition is a major barrier that has prevented the adoption of a trained neural network by different laboratories. The trained models usually do not perform well with videos recorded under different conditions (illumination, field size, image resolution, magnification of mice, image angle, backgrounds, etc.). Thus, we built a customized videotaping box to provide a reproducible and high-quality recording environment (Figure 1B). This helps to standardize the videotaping process, reduce the noise, and facilitate the stable performance and generalization of the trained deep learning models. Meanwhile, this video recording box is easy to be set up, scaled up, and adopted. Combined with the well-trained prediction model that has a comparable performance as manual quantification, the Scratch-AID system is ready to replace the manual quantification.

The amount and quality of training datasets are important to train a high accurate and generalized model. In general, the performance of a prediction model positively corelates with the size of the training dataset. In this study, we recorded 40 scratching videos in total and chose the CQ induced acute nape itch model, which triggered robust scratching behavior in mice, for model training and testing. The large number of scratching videos and the high amount of incidence of targeted behavior in each video provided a high-quality training dataset. In addition, the clear definition of scratching train and accurate annotation of videos, as conducted frame by frame, were also critical for training, testing, and error analysis.

CRNN is the classical deep learning neural network for analysis of animal behaviors. Here, we used ResNet (He et al., 2016) for the CNN part, which simplifies the neural network model by constructing a residual learning block through a shortcut connection of identity mapping (Figure 2—figure supplement 1B). GRU was used for the RNN part for extracting dynamic features (Figure 2—figure supplement 1C). Our study shows that ResNet and GRU make a good combination of deep learning architecture for analyzing animal behaviors. In addition, our network design is highly efficient that the accuracy plateau was achieved after only 10 epochs of training.

There is still some room to improve the performance of our predication models. To increase the capability of Scratch-AID to capture short scratching trains, we could train the CRNN neural network with scratching videos from chronic itch models. In addition, optimization of training parameters could also help with the improvement of prediction accuracy. For example, different error types varied when changing the input length (Figure 3—figure supplement 2B and Figure 3—figure supplement 5A–C). The Type 2 and 4 errors increased when the input length increased, while Type 1, 3, and 5 errors decreased when the input length increased. Thus, it is a trade-off to optimize and select the best input length for different itch models. Increasing the size of the training dataset may also help to get better prediction models. When checking the videos with relatively low prediction accuracy, for example, the video number 5 (V5) in the chronic itch model (Figure 6F), we found that the missed scratching frames could be due to a rare posture during scratching behavior, in which the scratching hind paw was partially occluded by the tail (Video 3). Thus, with more training videos containing some rare scratching postures, the trained neural network models could be more ‘knowledgeable’ for the diversity of scratching behavior.

Video 3
An example of a rare posture during the scratching behavior with scratching hind paw partially occluded by the tail.

Quantification of animal behaviors is critical for understanding the underlying molecular, cellular, and circuit mechanisms. Compared to manual analysis, the deep learning-based automatic analysis will not only improve efficiency and accuracy, but also reduce human bias and errors. Along this direction, we developed the Scratch-AID system to achieve automatic quantification of mouse scratching behavior. Our study also provides useful insights for developing new deep learning neural network models to achieve automatic analysis of other animal behaviors.

Materials and methods

Mouse lines and treatments

Request a detailed protocol

Mice (C57BL/6J purchased from The Jackson Laboratory, RRID: IMSR_JAX:000664) were housed in the John Morgan animal facility at the University of Pennsylvania. All animal treatments were conducted in accordance with protocols approved by the Institutional Animal Care and Use Committee and the guidelines of the National Institutes of Health (Protocol No. 804886).

Acute itch model

Request a detailed protocol

Acute itch models were conducted as previously described (Cui et al., 2022). Mouse cheeks or napes were shaved 3 days prior to experiments, and mice were placed in the videotaping box for acclimation (15 min/day for 3 days). At the day of experiment, mice were first acclimated to the videotaping box for 5  min, CQ (Sigma, C6628) (200 µg in 15 μl saline) or histamine (Sigma, H7250) (100 µg in 15 μl saline) was then intradermally injected into either the cheek or the nape, and scratching behavior was recorded for 20 min.

Behavior recording

Request a detailed protocol

The mouse scratching behavior was recorded using a Microsoft laptop (Windows 10 Pro, purchased from Amazon) connecting to a web camera (Logitech C920e Business Webcam, purchased from Amazon). Logitech Capture 2.06.12 Software was used to adjust the following recording parameters: brightness 170, contrast 0, resolution 720 × 720, frame rate 30 fps. The brightness and contrast could be adjusted according to ambient light to achieve consistent illumination.

Annotation of mouse scratching behavior

Request a detailed protocol

The start of a scratching train was defined as when the mouse started to lift up its hind paw and prepared to scratch at the beginning of a scratching train. For the end of a scratching train, there were two cases. If the mouse did not lick its hind paw after last scratching bout, the end frame would be when the mouse put its hind paw back onto the ground; if the mouse licked its hind paw after last scratching bout, the end frame would be when the mouse put its hind paw into the mouth. For the manual annotation, each video was manually watched at the normal (1×) speed (30 frames/s). The start and end time point of each scratching train was manually annotated, the time point was converted into the frame number (30 frames/s) for downstream analysis. For the reference annotation, each video was first converted into individual frames using python package OpenCV (Bradski, 2000). Then the start and end of each scratching train was determined frame by frame. Frames within a scratching train were defined as ‘scratching’ frames, otherwise defined as ‘non-scratching’ frames. For CQ cheek acute itch video number 2 (Figure 5F V2) and SADBE chronic itch video number 5 (Figure 6F V5), long lickings (>60 frames) during scratching trains were annotated as non-scratching frames for the model evaluation. For group comparison, the annotation was performed in a double-blind manner.

Training sample preparation and preprocessing

Request a detailed protocol

In the training procedure, 40 videos from 10 mice were first randomly split into 32 training videos from 8 mice (V1–V32), and 8 test videos from 2 mice (V33–V40). For cross-validation, training and test videos were rotated. Test videos were V1–V8, V9–V16, V17–V24, V25–V32, or V33–V40, and the rest of videos were used as training videos. For the model training, input was prepared by the following procedure: First, each video was converted into frames; then, consecutive N frames (N = 3, 5, 7, 13, 23, or 45) were selected as one input with an interval of 4–10 frames between two adjacent inputs. Then each input was annotated as ‘scratching’ (class 1) if more than N/2 frames were scratching frames, otherwise labeled as ‘non-scratching’ (class 0).

All frames in each input were first converted into gray scale images and resized to 300 × 300. Then a random square crop with size ranging from 288 × 288 to 300 × 300 was applied and followed by a random horizontal and vertical flip with probability of 0.5. Finally, these frames were resized to 256 × 256 and fed into the CRNN network.

CRNN architecture

Request a detailed protocol

Our CRNN model (Figure 2A and Figure 2—figure supplement 1) consisted of a CNN part, an RNN part, and a full connection (FC) part. The CNN was modified from ResNet-18 (He et al., 2016 ) by changing the final FC layer into an FC layer with embedding size 256. The RNN consisted of two bidirectional GRU layers with hidden vector size 512. The FC part consisted of two FC layers with embedding size 256 and ReLu activation, and embedding size 2 for the final prediction. The prediction results were transformed into maximum value in the final output vector.

Model training and predication

Request a detailed protocol

The model was trained by PyTorch (version 1.10.2, RRID:SCR_018536) (Paszke et al., 2019). For model training hyperparameters, batch size was set as 16 or 32 depending on the input size. The max epoch was set as 20. ADAM optimizer was used with initial learning rate 10−4 and the learning rate reduced by multiplying factor 0.3 every 5 epochs. The binary cross entropy was used as loss function. Dropout rate of FC layer was set as 0.2. The model was trained on a customized desktop with Intel i9-10900k CPU (purchased from Newegg), 64 GB RAM (CORSAIR Vengeance LPX 64 GB, purchased from Newegg), and NVIDIA GeForce RTX 3090 with 24 GB memory (purchased from Amazon).

For model prediction on new videos, the input preparation was similar to the training dataset, except that the adjacent inputs were only 1 frame apart. Each individual frame was predicted as ‘scratching’ or ‘non-scratching’ based on the following rule: the prediction of the middle frame of each input would be the same as the input prediction. For the few frames at the beginning or at the end of each video (depending on the input length N) that could not be the middle frame of an input, they were predicted as ‘scratching’ or ‘non-scratching’ based on the first input or last input prediction.

Saliency map

Request a detailed protocol

To obtain the saliency map, we first calculated gradient value of each pixel and kept the only positive gradients, then rescaled into range 0–1 based on a previous published paper (Selvaraju et al., 2017). Then we generated the heatmap based on the gradient values (<0.1 transparent, 0.1–0.6 light blue to dark blue, >0.6 dark blue) and stacked the heatmap onto the original frame.

SADBE chronic itch model

Request a detailed protocol

Mice (8–12 weeks) were singly housed before starting the behavior experiment. In day 1, mice were individually anesthetized in chamber using isoflurane until they did not move and showed decreased respiration rates. Mice were continued for anesthesia with a nose cone to allow access to the body of the mouse. After ensuring a mouse was fully anesthetized, the abdominal skin was shaved. In day 4, after anesthesia similar in day 1, 25 µl 1% SADBE in acetone (Sigma, 3339792) was applied to the shaved area of the abdominal skin. After application of SADBE, anesthesia continued for 3 more minutes to ensure SADBE fully absorbed before putting then back in the home cage. In days 9–12, mice were habituated in the videotaping box for 1 hr and 15 min. In day 11, the nape was shaved after habituation. In days 13–18, mice were habituated in the videotaping box for 1 hr and 15 min, then 25 µl of 1% SADBE was applied to the nape. In day 18, after 1% SADBE application, each mouse was put back into the videotaping box and recorded for three videos.

Anti-itch drug experiment

Request a detailed protocol

The nape of mice (8–12 weeks, two males, five females) was shaved 3 days prior to experiments. Mice were placed in the videotaping box for acclimation (15 min per day for 3 days). At the day of experiment, mice were randomly selected and acclimated in the videotaping box for 5 min. Then, 0.2 g anti-itch cream (Benadryl, Johnson & Johnson Consumer Inc, purchased from CVS) or control cream (Neutrogena, Johnson Consumer Inc, purchased from CVS) was applied into the nape. One hour later, histamine (200 µg in 15 µl saline) was intradermally injected into the nape, and scratching behavior was recorded for 20 min. The experimenter who did the histamine injection was blind to anti-itch cream or control cream.

Illustration drawing

Request a detailed protocol

Cartoons with mice were made partially in BioRender (BioRender, 2022, RRID:SCR_018361).

Statistical analysis

Request a detailed protocol

Data and statistical analyses were performed using Prism 6.0 (GraphPad Software, RRID:SCR_002798). The criteria for significance were: ns (not significant) p ≥ 0.05, *p < 0.05, **p < 0.01, ***p < 0.001. Differences in means between two groups were analyzed using unpaired two-tailed Student’s t-test.

Data availability

The training and test videos generated during the current study can be downloaded from DRYAD (https://doi.org/10.5061/dryad.mw6m9060s). The codes for model training and test can be downloaded from GitHub (https://github.com/taimeimiaole/Scratch-AID, copy archived at swh:1:rev:d8a1e6b94e54be2c857285d74623e495a6bd47bf).

The following data sets were generated
    1. Luo W
    (2022) Dryad Digital Repository
    Data From: Scratch-AID: a deep-learning based system for automatic detection of mouse scratching behavior with high accuracy.
    https://doi.org/10.5061/dryad.mw6m9060s

References

  1. Conference
    1. Abadi M
    2. Barham P
    3. Chen J
    4. Chen Z
    5. Davis A
    6. Dean J
    7. Devin M
    8. Ghemawat S
    9. Irving G
    10. Isard M
    (2016)
    TensorFlow}: a system for {Large-Scale} machine learning
    12th USENIX symposium on operating systems design and implementation (OSDI).
  2. Software
    1. BioRender
    (2022)
    BioRender
    BioRender.Com.
    1. Bradski G
    (2000)
    The opencv library
    Dr Dobb’s Journal: Software Tools for the Professional Programmer 25:120–123.
  3. Conference
    1. Dey R
    2. Salem FM
    (2017) Gate-variants of Gated Recurrent Unit (GRU) neural networks
    2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS.
    https://doi.org/10.1109/MWSCAS.2017.8053243
  4. Book
    1. Gulli A
    2. Pal S
    (2017)
    Deep Learning with Keras
    Packt Publishing Ltd.
  5. Conference
    1. He K
    2. Zhang X
    3. Ren S
    4. Sun J
    (2016) Deep residual learning for image recognition
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR.
    https://doi.org/10.1109/CVPR.2016.90
  6. Conference
    1. Liu Q
    2. Dong X
    (2015)
    The role of the Mrgpr receptor family in itch
    Pharmacology of itch. pp. 71–88.
    1. Loew ER
    2. MacMILLAN R
    3. Kaiser ME
    (1946)
    The anti-histamine properties of benadryl, beta-di-methylaminoethyl benzhydryl ether hydrochloride
    The Journal of Pharmacology and Experimental Therapeutics 86:229–238.

Decision letter

  1. Brian S Kim
    Reviewing Editor; Icahn School of Medicine at Mount Sinai, United States
  2. Catherine Dulac
    Senior Editor; Harvard University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article " Scratch-AID: A Deep-learning Based System for Automatic Detection of Mouse Scratching Behavior with High Accuracy" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Catherine Dulca as the Senior Editor. The reviewers have opted to remain anonymous.

Reviewer #1 (Recommendations for the authors):

1. Can Scratch-AID be modified to quantify scratching bouts?

2. Saliency maps analysis suggest that the trained neural network is focusing on the movement of the hindleg to identify scratching. Although mice will spend the majority of their time scratching the injected or treated skin location, they do scratch uninjected/untreated area as well in both acute and chronic itch models. Different laboratories have different strategies to determine whether scratching to the untreated locations is included for quantification. Is Scratch-AID able to differentiate scratching directed towards treated vs untreated locations?

Reviewer #2 (Recommendations for the authors):

Mouse behavior assay is the primary readout for almost all itch research. The itch sensation triggers a stereotypical scratch-lick response by the ipsilateral hind paw, and several key parameters including the number of scratch bouts and total scratching duration can be precisely quantified and compared across genotypes and treatment groups. So far, the vast majority of itch behavior videos are manually scored, requiring hundreds of hours of labor and are prone to experimenter errors. In this paper, the authors introduce an automated system to replace the manual itch counting process. First, they designed a standardized videotaping box to produce consistent and high-quality videos of itch behaviors. They then annotated the videos frame by frame for scratch bouts, and trained a neural network to identify scratching behaviors with high precision. The system (named Scratch-AID by the authors) performed consistently with manual and reference scoring, and yielded key quantitative parameters (scratching bout counts and total scratching times). Saliency map revealed that the trained neural network focused on the scratching hind paw, which is consistent with what human observers look at. The authors further showed that Scratch-AID can be easily adapted to quantify scratching at slightly different sites (nape vs cheek), in response to different itch mediators (chloroquine vs histamine), in both acute and chronic itch models, and that it successfully revealed the effect of a generic anti-itch drug Benadryl. Overall, the model appears accurate and easily adaptable and can probably be widely used by the itch field.

There had been a few previous attempts to train neural networks to score itch behaviors (Kobayashi et al., 2021; Sakamoto et al., 2022). The main advance of this paper is the incorporation of a standardized recording system, and larger number of videos in the training dataset. The authors provided very detailed and thorough analyses of their own model, but did not compare its sensitivity and precision to the previous models. It is also unclear why the authors did not choose the high-speed video system that they previously published for detailed analyses of pain behaviors (Abdus-Saboor et al., 2019), despite evidence that multiple itch parameters can only be captured at high speed (Wimalasena, et al. 2021). The authors show that scratch-AID can accurately score multiple itch models (despite it being trained on chloroquine nape injection videos), but does not show whether it can distinguish itch behaviors induced by different mechanisms and mediators. For example, whether it can detect activation of different itch neuron populations (termed NP1, NP2 and NP3 by single cell transcriptomic analysis), or distinguish direct itch neuron activation by mediators such as chloroquine vs mast cell activators such as 48/80. Nonetheless, this study provides a great opportunity for the itch field to standardize behavior data acquisition and analysis, and will greatly enhance the throughput and quality of future itch behavior data.

The data are overall very convincing and the system an exciting advancement that will probably be widely used. I have some questions regarding data analysis and potential concerns for future users of the technology.

1. Does the model distinguish ipsilateral vs contralateral scratches?

2. Can the authors present the data in more detail? For example, present and compare both scratch bout count and total scratching time between scratch-AID and manual scoring for each of the itch models.

3. This is probably a given but can the authors provide evidence that the model clearly distinguishes pain (by capsaicin injection) vs itch (chloroquine and histamine)?

4. If all parameters are analyzed in more detail (average duration of scratch bouts, temporal distribution of scratches across the imaging period), can the model distinguish the differences between NP1(by β-alanine), NP2(by chloroquine) and NP3(by 5-HT) activation?

5. Can the model distinguish direct neuronal activation (chloroquine) vs immune cell activation (48/80, anti-IgE)?

6. Since the training is done with black mice, can the model detect itch behavior by white mice?

7. Can the authors comment on how the new system compares to previous itch-counting apparatus (such as SCLABA-REAL) and previous neural network models?

8. What's the max number of animals that can be recorded and analyzed in a single box?

9. Since video quality is vital for correct predictions by neural networks, can the authors comment on the influence of potential confounding factors (urine and feces, wear and tear of the acrylic imaging surface).

10. This is potentially a great opportunity for the itch field to standardize the itch behavior assay. Can the authors comment briefly in the discussion on other aspects of standardization such as time of the day for experiments (circadian rhythm), injection volumes, habituation, and sexual dimorphism?

11. Can the authors discuss how the rest of the field can apply the system in their own research? Will it be released as an open source software? And will the custom video box be commercially available?

https://doi.org/10.7554/eLife.84042.sa1

Author response

Reviewer #1 (Recommendations for the authors):

1. Can Scratch-AID be modified to quantify scratching bouts?

We thank the reviewer for asking the important question. Our current model cannot directly quantify scratching bouts. Nevertheless, since the high correlation between the total number of scratching bouts and total scratching time, the total scratching bouts can be estimated based on the total scratching time and average duration of each scratching bout in a specific itch model. In addition, with more detailed manual annotation of scratching bouts in each scratching train, it is possible to train a neural network to recognize and quantify scratching bouts in the future.

2. Saliency maps analysis suggest that the trained neural network is focusing on the movement of the hindleg to identify scratching. Although mice will spend the majority of their time scratching the injected or treated skin location, they do scratch uninjected/untreated area as well in both acute and chronic itch models. Different laboratories have different strategies to determine whether scratching to the untreated locations is included for quantification. Is Scratch-AID able to differentiate scratching directed towards treated vs untreated locations?

We appreciate the reviewer’s interesting comments and questions. Our current model doesn’t distinguish the scratching location. We viewed it as a strength that Scratch-AID can be generalized to recognize scratching in different body locations and itch models. This could become a problem if the number of off-target scratching makes a significant portion of the total scratching and obscures the real phenotype. In this case, the researchers may want to double check the videos and distinguish the scratching location, such as ipsilateral vs. contralateral, or cheek vs. nape. In principle, with more training videos containing different scratching locations and more detailed manual annotations, a neural network can be trained to only recognize scratching towards a particular body location.

Reviewer #2 (Recommendations for the authors):

[…]

The data are overall very convincing and the system an exciting advancement that will probably be widely used. I have some questions regarding data analysis and potential concerns for future users of the technology.

1. Does the model distinguish ipsilateral vs contralateral scratches?

Our current models cannot distinguish ipsilateral vs contralateral scratches. In principle, with the manual annotation of left or right-side scratching, a neural network can be trained to distinguish left vs right scratching. This could be an interesting future direction.

2. Can the authors present the data in more detail? For example, present and compare both scratch bout count and total scratching time between scratch-AID and manual scoring for each of the itch models.

We systematically compared the total scratching time between scratch-AID and manual scoring for different itch models (Fig. 3, Fig. 5 and Fig. 6). Our current model cannot count the scratch bouts. As discussed above (answers to question #1 of reviewer #1), it will take additional manual annotation to train the neural network for this purpose.

3. This is probably a given but can the authors provide evidence that the model clearly distinguishes pain (by capsaicin injection) vs itch (chloroquine and histamine)?

Our model can distinguish itch-induced scratching behavior and pain-related wiping behavior. In all test videos, no wiping behavior was recognized as scratching behavior. We showed examples that the wiping behavior and other behaviors were not recognized as scratching behavior (false positive, type I error, Figure 3—figure supplement 4).

4. If all parameters are analyzed in more detail (average duration of scratch bouts, temporal distribution of scratches across the imaging period), can the model distinguish the differences between NP1(by β-alanine), NP2(by chloroquine) and NP3(by 5-HT) activation?

Our current model recognizes scratching trains. Thus, it can display the temporal distribution of scratching trains across the imaging period, quantify scratching train durations or average duration, but cannot analyze the average duration of scratching bouts (as the model can’t quantify the scratching bouts). If the temporal distribution of scratching trains or scratching train durations are different between NP1, NP2 and NP3 neuron mediated itch, our model may distinguish the differences.

5. Can the model distinguish direct neuronal activation (chloroquine) vs immune cell activation (48/80, anti-IgE)?

As discussed above, our current model can quantify the total scratching time, display the temporal distribution of scratching trains, and quantify scratching train durations. If these features show differences in scratching behavior induced by neuronal activation vs immune cell activation, for example the direct neuronal activation might trigger scratching behaviors in a shorter delay than the immune cell activation, then our model could be able to distinguish direct neuronal activation vs immune cell activation.

6. Since the training is done with black mice, can the model detect itch behavior by white mice?

We haven’t tested our model using white mice. Since the videotaping box is white color, the video quality for white mice might decrease due to the low contrast. Testing Scratch-AID with mice of different coat colors could be a future direction. If the contrast is the problem, one option is to change to box color for better contrast.

7. Can the authors comment on how the new system compares to previous itch-counting apparatus (such as SCLABA-REAL) and previous neural network models?

We appreciate the reviewer raising the question. However, it’s hard for us to do the side-by-side comparison between our model and SCLABA-REAL, because there are no published data about performance of SCLABA-REAL. According to the product information, SCLABA-REAL records the video from top view, and uses frame-to-frame subtraction to recognize the scratching behavior in a real time. Our Scratch-AID records the video from bottom view and uses raw frames to recognize the scratching behavior in an off-line manner.

Compared to previous neural network models, the main novelties of our study include: (1) designed of a videotaping box for stable and reproducible video recording in different labs; (2) collected a large amount of frame-labeled dataset for the training; (3) optimized CRNN architecture and hyperparameters for more effective learning; and (4) tested the generalization of the model in major mouse itch models and in a drug screening paradigm.

8. What's the max number of animals that can be recorded and analyzed in a single box?

Only one mouse can be recorded and analyzed in a single box. However, with multiple boxes, several mice can be recorded simultaneously.

9. Since video quality is vital for correct predictions by neural networks, can the authors comment on the influence of potential confounding factors (urine and feces, wear and tear of the acrylic imaging surface).

This is a very good question. According to our experience, sufficient habituation greatly reduced the urine and feces. We also cleaned the floor (urine and feces) between two recordings. A small amount of urine and feces did not have significant influence on the performance of the model. We don’t think that wear and tear of the acrylic imaging surface will be a big issue. After around 200 video recordings, no obvious wear and tear of the acrylic imaging surface were observed. Even if there are some wear and tear after long-term use, it’s easy to replace the imaging floor with a new one (same material and quality).

10. This is potentially a great opportunity for the itch field to standardize the itch behavior assay. Can the authors comment briefly in the discussion on other aspects of standardization such as time of the day for experiments (circadian rhythm), injection volumes, habituation, and sexual dimorphism?

The reviewer raises an important point. As the reviewer mentioned, circadian rhythm, injection volumes, habituation, sexual dimorphism and other factors might affect the itch behavior assay and results. For our experiments, we did the scratching behavior recording during the daytime, injected 15 ul pruritogens dissolved in saline for both cheek and nape model, habituated each mouse in the recoding box 15 min/day for three consecutive days, and included both male and female mice. To standardize the mouse itch behavior assay, the itch field need to discuss together to reach a consensus.

11. Can the authors discuss how the rest of the field can apply the system in their own research? Will it be released as an open source software? And will the custom video box be commercially available?

We really hope that the rest of the mouse itch field will try and take advantage of our Scratch-AID system for their studies. This is the motivation for us to work on this project. Our videos, codes, and models will be publicly available as this manuscript is published. We would also like to further optimize our prototype models, add features as suggested by the reviewers, and launch a user-friendly software or a web-based analysis system (if we could have funding support and identified a software developer).

https://doi.org/10.7554/eLife.84042.sa2

Article and author information

Author details

  1. Huasheng Yu

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Jingwei Xiong
    For correspondence
    huasheng.yu@pennmedicine.upenn.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5641-2512
  2. Jingwei Xiong

    Graduate Group in Biostatistics, University of California Davis, Davis, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Huasheng Yu
    Competing interests
    No competing interests declared
  3. Adam Yongxin Ye

    Howard Hughes Medical Institute, Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Suna Li Cranfill

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3431-0061
  5. Tariq Cannonier

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  6. Mayank Gautam

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7257-5837
  7. Marina Zhang

    Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, United States
    Contribution
    Conceptualization, Software, Methodology
    Competing interests
    No competing interests declared
  8. Rayan Bilal

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Software, Methodology
    Competing interests
    No competing interests declared
  9. Jong-Eun Park

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation, Software, Methodology
    Competing interests
    No competing interests declared
  10. Yuji Xue

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation, Visualization, Methodology
    Competing interests
    No competing interests declared
  11. Vidhur Polam

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation, Visualization
    Competing interests
    No competing interests declared
  12. Zora Vujovic

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  13. Daniel Dai

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  14. William Ong

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  15. Jasper Ip

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9773-1544
  16. Amanda Hsieh

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  17. Nour Mimouni

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  18. Alejandra Lozada

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  19. Medhini Sosale

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  20. Alex Ahn

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  21. Minghong Ma

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  22. Long Ding

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Supervision, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1716-3848
  23. Javier Arsuaga

    1. Department of Molecular and Cellular Biology, University of California Davis, Davis, United States
    2. Department of Mathematics, University of California Davis, Davis, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  24. Wenqin Luo

    Department of Neuroscience, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Project administration, Writing - review and editing
    For correspondence
    luow@pennmedicine.upenn.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2486-807X

Funding

National Science Foundation (DMS-1854770)

  • Javier Arsuaga

National Institutes of Health (R01 NS083702)

  • Wenqin Luo

National Institutes of Health (R34 NS118411)

  • Long Ding
  • Wenqin Luo

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the Luo and Ma lab members for their help and support. We thank Dr. Ji Zhu, Mr. Simin Liu, Dr. You Lv, Ms. Yakun Wang, and Dr. Wei Yang for their help and suggestions for this project. This work was supported by NSF grant (DMS-1854770) of Dr. Arsuaga, NIH R01 (NS083702) of Dr. Luo, and R34 (NS118411) of Drs. Ding and Luo.

Ethics

Mice were housed in the John Morgan animal facility at the University of Pennsylvania. All animal treatments were conducted in accordance with protocols approved by the Institutional Animal Care and Use Committee and the guidelines of the National Institutes of Health (Protocol No. 804886).

Senior Editor

  1. Catherine Dulac, Harvard University, United States

Reviewing Editor

  1. Brian S Kim, Icahn School of Medicine at Mount Sinai, United States

Publication history

  1. Preprint posted: October 7, 2022 (view preprint)
  2. Received: October 9, 2022
  3. Accepted: November 29, 2022
  4. Accepted Manuscript published: December 8, 2022 (version 1)
  5. Version of Record published: December 19, 2022 (version 2)

Copyright

© 2022, Yu, Xiong et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 785
    Page views
  • 114
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Huasheng Yu
  2. Jingwei Xiong
  3. Adam Yongxin Ye
  4. Suna Li Cranfill
  5. Tariq Cannonier
  6. Mayank Gautam
  7. Marina Zhang
  8. Rayan Bilal
  9. Jong-Eun Park
  10. Yuji Xue
  11. Vidhur Polam
  12. Zora Vujovic
  13. Daniel Dai
  14. William Ong
  15. Jasper Ip
  16. Amanda Hsieh
  17. Nour Mimouni
  18. Alejandra Lozada
  19. Medhini Sosale
  20. Alex Ahn
  21. Minghong Ma
  22. Long Ding
  23. Javier Arsuaga
  24. Wenqin Luo
(2022)
Scratch-AID, a deep learning-based system for automatic detection of mouse scratching behavior with high accuracy
eLife 11:e84042.
https://doi.org/10.7554/eLife.84042

Further reading

    1. Neuroscience
    2. Physics of Living Systems
    Sabrina A Jones, Jacob H Barfield ... Woodrow L Shew
    Research Article

    Naturally occurring body movements and collective neural activity both exhibit complex dynamics, often with scale-free, fractal spatiotemporal structure. Scale-free dynamics of both brain and behavior are important because each is associated with functional benefits to the organism. Despite their similarities, scale-free brain activity and scale-free behavior have been studied separately, without a unified explanation. Here we show that scale-free dynamics of mouse behavior and neurons in visual cortex are strongly related. Surprisingly, the scale-free neural activity is limited to specific subsets of neurons, and these scale-free subsets exhibit stochastic winner-take-all competition with other neural subsets. This observation is inconsistent with prevailing theories of scale-free dynamics in neural systems, which stem from the criticality hypothesis. We develop a computational model which incorporates known cell-type-specific circuit structure, explaining our findings with a new type of critical dynamics. Our results establish neural underpinnings of scale-free behavior and clear behavioral relevance of scale-free neural activity.

    1. Neuroscience
    Barna Zajzon, David Dahmen ... Renato Duarte
    Research Article

    Information from the sensory periphery is conveyed to the cortex via structured projection pathways that spatially segregate stimulus features, providing a robust and efficient encoding strategy. Beyond sensory encoding, this prominent anatomical feature extends throughout the neocortex. However, the extent to which it influences cortical processing is unclear. In this study, we combine cortical circuit modeling with network theory to demonstrate that the sharpness of topographic projections acts as a bifurcation parameter, controlling the macroscopic dynamics and representational precision across a modular network. By shifting the balance of excitation and inhibition, topographic modularity gradually increases task performance and improves the signal-to-noise ratio across the system. We demonstrate that in biologically constrained networks, such a denoising behavior is contingent on recurrent inhibition. We show that this is a robust and generic structural feature that enables a broad range of behaviorally-relevant operating regimes, and provide an in-depth theoretical analysis unravelling the dynamical principles underlying the mechanism.