Analysis of ultrasonic vocalizations from mice using computer vision and machine learning

  1. Antonio HO Fonseca  Is a corresponding author
  2. Gustavo M Santana  Is a corresponding author
  3. Gabriela M Bosque Ortiz  Is a corresponding author
  4. Sérgio Bampi  Is a corresponding author
  5. Marcelo O Dietrich  Is a corresponding author
  1. Laboratory of Physiology of Behavior, Department of Comparative Medicine, Yale School of Medicine, United States
  2. Institute of Informatics, Federal University of Rio Grande do Sul, Brazil
  3. Graduate Program in Biological Sciences - Biochemistry, Federal University of Rio Grande do Sul, Brazil
  4. Interdepartmental Neuroscience Program, Biological and Biomedical Sciences Program, Graduate School in Arts and Sciences, Yale University, United States
  5. Department of Neuroscience, Yale School of Medicine, Brazil
5 figures, 3 tables and 7 additional files

Figures

Figure 1 with 1 supplement
Overview of the VocalMat pipeline for ultrasonic vocalization (USV) detection and analysis.

(A) Workflow of the main steps used by VocalMat, from audio acquisition to data analysis. (B) Illustration of a segment of spectrogram. The time-frequency plan is depicted as a gray scale image wherein the pixel values correspond to intensity in decibels. (C) Example of segmented USV after contrast enhancement, adaptive thresholding, and morphological operations (see Figure 1—figure supplement 1 for further details of the segmentation process). (D) Illustration of some of the spectral information obtained from the segmentation. Information on intensity is kept for each time-frequency point along the segmented USV candidate.

Figure 1—figure supplement 1
Image processing pipeline for segmentation of ultrasonic vocalizations (USVs) in spectrograms.

Image processing pipeline for segmentation of USVs in spectrograms. (A) Segment of a spectrogram post contrast adjustment (γ=1). (B) Output image post binarization using adaptive thresholding. (C) Resulting image from the opening operation with rectangle 4 × 2. (D) Result from the dilation with line l = 4 and 90°. (E) Removal of too small objects (≤60 pixels), mean of cloud points for each detected USV candidate being shown in red and green lines shows an interval of 10 ms. (F) Result after separating syllables based on the criterion of maximum interval between two tones in a syllable. The different colors differentiate the syllables from each other.

Noise elimination process for ultrasonic vocalization (USV) candidates.

(A) In a set of 64 audio files, VocalMat identified 59,781 USV candidates. (B) Examples of USVs among the pool of candidates that were manually labeled as either noise or real USVs. The score (upper-right corner) indicates the calculated contrast Ck for the candidate. (C) Example of contrast calculation ( Ck) for a given USV candidate k. The red dots indicate the points detected as part of the USV candidate (Xk) and the dashed-white rectangle indicates its evaluated neighborhood (Wk). (D) Distribution of the Ck for the USV candidates in the test data set. (E) Each USV candidate was manually labeled as real USV or noise. The distribution of Ck for the real USVs (cyan) compared to the distribution for all the USV candidates (red) in the test data set. The blue line indicates the cumulative distribution function (CDF) of Ck for all the USV candidates. The inflection point of the CDF curve is indicated by the arrow. (F) Example of a segment of spectrogram with three USVs. The analysis of this segment without the ’Local Median Filter’ results in an elevated number of false positives (noise detected as USV). ’Red’ and ’cyan’ ticks denote the time stamp of the identified USV candidates without and with the ’Local Median Filter’, respectively.

VocalMat ultrasonic vocalization (USV) classification using a convolutional neural network.

(A) Illustration of the AlexNet architecture post end-to-end training on our training data set. The last three layers of the network were replaced in order to perform a 12-categories (11 USV types plus noise) classification task. The output of the CNN is a probability distribution over the labels for each input image. (B) Linear regression between the number of USVs manually detected versus the number reported by VocalMat for the audio files in our test data set (see Figure 4—figure supplement 1 for individual confusion matrices). (C) Distribution of probabilities P(USV) for the true positive (green), false positive (red), false negative (cyan), and true negative (magenta). Ticks represent individual USV candidates.

Figure 4 with 1 supplement
VocalMat performance for ultrasonic vocalization (USV) classification.

(A) Example of the 11 categories of USVs plus noise that VocalMat used to classify the USV candidates. (B) Confusion matrix illustrating VocalMat’s performance in multiclass classification (see also Supplementary file 5 and Figure 4—figure supplement 1 for individual confusion matrices). (C) Comparison of classification performance for labels assigned based on the most likely label (Top-one) versus the two most likely labels (Top-two) (see Supplementary file 6). Symbols represent median ±95% confidence intervals.

Figure 4—figure supplement 1
Confusion matrix illustrating VocalMat’s performance in multiclass classification per recording file.
Figure 5 with 1 supplement
Vocal repertoire visualization using Diffusion Maps.

(A) Illustration of the embedding of the ultrasonic vocalizations (USVs) for each experimental condition. The probability distribution of all the USVs in each experimental condition is embedded in a Euclidean space given by the eigenvectors computed through Diffusion Maps. Colors identify the different USV types. (B) Pairwise distance matrix between the centroids of USV types within each manifold obtained for the four experimental conditions. (C) Comparison between the pairwise distance matrices in the four experimental conditions by Pearson’s correlation coefficient.

Figure 5—figure supplement 1
Alignment of the manifolds between pairs of experimental conditions.

(A) Illustration of the resulting manifold alignment for each pair of experimental conditions. The quality of the alignment between the manifolds is assessed by (B) Cohen’s coefficient and (C) overall projection accuracy into joint space.

Tables

Table 1
Summary of performance of VocalMat in detecting ultrasonic vocalizations (USVs) in the test data set.
Audio fileTrue positiveFalse negativeTrue negativeFalse positiveAccuracy (%)
1316158299.20
298511051598.55
36961273597.84
48621351498.17
5441216398.48
6696287499.24
77875122598.91
Table 2
Summary of detection performance.
ToolMissed ultrasonic vocalizations (USVs) rate (%)False discovery rate (%)
Ax4.9937.67
MUPET33.7438.78
USVSEG6.537.58
DeepSqueak27.137.61
VocalMat1.640.05
Table 3
Summary of experimental conditions covered in the test data set.
AgeMicrophone gainLocationHeating
P9MaximumEnvironmental chamberNo
P9MaximumEnvironmental chamberNo
P9MaximumEnvironmental chamberNo
P10IntermediaryOpen fieldNo
P10IntermediaryOpen fieldNo
P10MaximumEnvironmental chamberYes
P10MaximumEnvironmental chamberYes

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Antonio HO Fonseca
  2. Gustavo M Santana
  3. Gabriela M Bosque Ortiz
  4. Sérgio Bampi
  5. Marcelo O Dietrich
(2021)
Analysis of ultrasonic vocalizations from mice using computer vision and machine learning
eLife 10:e59161.
https://doi.org/10.7554/eLife.59161