Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires

  1. Jack Goffinet
  2. Samuel Brudner
  3. Richard Mooney
  4. John Pearson  Is a corresponding author
  1. Department of Computer Science, Duke University, United States
  2. Center for Cognitive Neurobiology, Duke University, United States
  3. Department of Neurobiology, Duke University, United States
  4. Department of Biostatistics & Bioinformatics, Duke University, United States
  5. Department of Electrical and Computer Engineering, Duke University, United States

Abstract

Increases in the scale and complexity of behavioral data pose an increasing challenge for data analysis. A common strategy involves replacing entire behaviors with small numbers of handpicked, domain-specific features, but this approach suffers from several crucial limitations. For example, handpicked features may miss important dimensions of variability, and correlations among them complicate statistical testing. Here, by contrast, we apply the variational autoencoder (VAE), an unsupervised learning method, to learn features directly from data and quantify the vocal behavior of two model species: the laboratory mouse and the zebra finch. The VAE converges on a parsimonious representation that outperforms handpicked features on a variety of common analysis tasks, enables the measurement of moment-by-moment vocal variability on the timescale of tens of milliseconds in the zebra finch, provides strong evidence that mouse ultrasonic vocalizations do not cluster as is commonly believed, and captures the similarity of tutor and pupil birdsong with qualitatively higher fidelity than previous approaches. In all, we demonstrate the utility of modern unsupervised learning approaches to the quantification of complex and high-dimensional vocal behavior.

Introduction

Quantifying the behavior of organisms is of central importance to a wide range of fields including ethology, linguistics, and neuroscience. Yet given the variety and complex temporal structure of many behaviors, finding concise yet informative descriptions has remained a challenge. Vocal behavior provides a paradigmatic example: audio data are notoriously high dimensional and complex, and despite intense interest from a number of fields, and significant progress, many aspects of vocal behavior remain poorly understood. A major goal of these various lines of inquiry has been to develop methods for the quantitative analysis of vocal behavior, and these efforts have resulted in several powerful approaches that enable the automatic or semi-automatic analysis of vocalizations (Tchernichovski and Mitra, 2004; Coffey et al., 2019; Van Segbroeck et al., 2017; Sainburg et al., 2019; Tchernichovski et al., 2000; Mandelblat-Cerf and Fee, 2014; Mets and Brainard, 2018; Kollmorgen et al., 2020; Holy and Guo, 2005).

Key to this approach has been the existence of software packages that calculate acoustic features for each unit of vocalization, typically a syllable (Burkett et al., 2015; Tchernichovski and Mitra, 2004; Van Segbroeck et al., 2017; Coffey et al., 2019; Chabout et al., 2015). For example, Sound Analysis Pro (SAP), focused on birdsong, calculates 14 features for each syllable, including duration, spectral entropy, and goodness of pitch, and uses the set of resulting metrics as a basis for subsequent clustering and analysis (Tchernichovski and Mitra, 2004). More recently, MUPET and DeepSqueak have applied a similar approach to mouse vocalizations, with a focus on syllable clustering (Van Segbroeck et al., 2017; Coffey et al., 2019). Collectively, these and similar software packages have helped facilitate numerous discoveries, including circadian patterns of song development in juvenile birds (Derégnaucourt et al., 2005), cultural evolution among isolate zebra finches (Fehér et al., 2009), and differences in ultrasonic vocalizations (USVs) between mouse strains (Van Segbroeck et al., 2017).

Despite these insights, this general approach suffers from several limitations: first, handpicked acoustic features are often highly correlated, and these correlations can result in redundant characterizations of vocalization. Second, an experimenter-driven approach may exclude features that are relevant for communicative function or, conversely, may emphasize features that are not salient or capture negligible variation in the data. Third, there is no diagnostic approach to determine when enough acoustic features have been collected: Could there be important variation in the vocalizations that the chosen features simply fail to capture? Lastly and most generally, committing to a syllable-level analysis necessitates a consistent definition of syllable boundaries, which is often difficult in practice. It limits the types of structure one can find in the data and is often difficult to relate to time series such as neural data, for which the relevant timescales are believed to be orders of magnitude faster than syllable rate.

Here, we address these shortcomings by applying a data-driven approach based on variational autoencoders (VAEs) (Kingma and Welling, 2013; Rezende et al., 2014) to the task of quantifying vocal behavior in two model species: the laboratory mouse (Mus musculus) and the zebra finch (Taeniopygia guttata). The VAE is an unsupervised modeling approach that learns from data of a pair of probabilistic maps, an ‘encoder’ and a ‘decoder,’ capable of compressing the data into a small number of latent variables while attempting to preserve as much information as possible. In doing so, it discovers features that best capture variability in the data, offering a nonlinear generalization of methods like Principal Components Analysis (PCA) and Independent Components Analysis (ICA) that adapts well to high-dimensional data like natural images (Dai et al., 2018; Higgins et al., 2017). By applying this technique to collections of single syllables, encoded as time-frequency spectrograms, we looked for latent spaces underlying vocal repertoires across individuals, strains, and species, asking whether these data-dependent features might reveal aspects of vocal behavior overlooked by traditional acoustic metrics and provide more principled means for assessing differences among these groups.

Our contributions are fourfold: first, we show that the VAE’s learned acoustic features outperform common sets of handpicked features in a variety of tasks, including capturing acoustic similarity, representing a well-studied effect of social context on zebra finch song, and comparing the USVs of different mouse strains. Second, using learned latent features, we report new results concerning both mice and zebra finches, including the finding that mouse USV syllables do not appear to cluster into distinct subtypes, as is commonly assumed, but rather form a broad continuum. Third, we present a novel approach to characterizing stereotyped vocal behavior that does not rely on syllable boundaries, one which we find is capable of quantifying subtle changes in behavioral variability on tens-of-milliseconds timescales. Lastly, we demonstrate that the VAE’s learned acoustic features accurately reflect the relationship between songbird tutors and pupils. In all, we show that data-derived acoustic features confirm and extend findings gained by existing approaches to vocal analysis and offer distinct advantages over handpicked acoustic features in several critical applications.

Results

VAEs learn a low-dimensional space of vocal features

We trained a VAE Kingma and Welling, 2013; Rezende et al., 2014 to learn a probabilistic mapping between vocalizations and a latent feature space. Specifically, we mapped single-syllable spectrogram images (D=16,384 pixels) to vectors of latent features (D=32) and back to the spectrogram space (Figure 1a). As with most VAE methods, we parameterized both the encoder and decoder using convolutional neural networks, which provide useful inductive biases for representing regularly sampled data such as images or spectrograms. The two maps are jointly trained to maximize a lower bound on the probability of the data given the model (see Materials and methods). As in other latent variable models, we assume each observed spectrogram can be explained by an unobserved ‘latent’ variable situated in some ‘latent space.’ As visualized in Figure 1b, the result is a continuous latent space that captures the complex geometry of vocalizations. Each point in this latent space represents a single spectrogram image, and trajectories in this latent space represent sequences of spectrograms that smoothly interpolate between start and end syllables (Figure 1c). Although we cannot visualize the full 32-dimensional latent space, methods like PCA and the UMAP algorithm Dai et al., 2018 allow us to communicate results in an informative and unsupervised way. The VAE training procedure can thus be seen as a compression algorithm that represents each spectrogram as a collection of 32 numbers describing data-derived vocal features. In what follows, we will show that these features outperform traditional handpicked features on a wide variety of analysis tasks.

Figure 1 with 1 supplement see all
Variational autoencoders (VAEs) learn a latent acoustic feature space.

(a) The VAE takes spectrograms as input (left column), maps them via a probabilistic ‘encoder’ to a vector of latent dimensions (middle column), and reconstructs a spectrogram via a ‘decoder’ (right column). The VAE attempts to ensure that these probabilistic maps match the original and reconstructed spectrograms as closely as possible. (b) The resulting latent vectors can then be visualized via dimensionality reduction techniques like principal components analysis. (c) Interpolations in latent space correspond to smooth syllable changes in spectrogram space. A series of points (dots) along a straight line in the inferred latent space is mapped, via the decoder, to a series of smoothly changing spectrograms (right). This correspondence between inferred features and realistic dimensions of variation is often observed when VAEs are applied to data like natural images (Kingma and Welling, 2013; Rezende et al., 2014).

Finally, we note that, while the VAE is compressive—that is, it discards some data—this is both necessary in practice and often desirable. First, necessity: as noted above, nearly all current methods reduce raw audio waveforms to a manageable number of features for purposes of analysis. This is driven in part by the desire to distill these complex sounds into a small collection of interpretable features, but it also stems from the needs of statistical testing, which suffers drastic loss of power for high-dimensional data without advanced methods. Second, this compression, as we will show, is often beneficial as it facilitates visualization and analyses of large collections of vocalizations in new ways. Thus, we view the VAE and its compression-based approach as complementary to both other dimension-reduction techniques like PCA and traditional acoustic signal processing as a method for learning structure from complex data.

Learned features capture and expand upon typical acoustic features

Most previous approaches to analyzing vocalizations have focused on tabulating a predetermined set of features such as syllable duration or entropy variance that are used for subsequent processing and analysis (Van Segbroeck et al., 2017; Coffey et al., 2019; Tchernichovski and Mitra, 2004; Burkett et al., 2015). We thus asked whether the VAE learned feature space simply recapitulated these known features or also captured new types of information missed by traditional acoustic metrics. To address the first question, we trained a VAE on a publicly available collection of mouse USVs (31,440 total syllables [Van Segbroeck M et al., 2019]), inferred latent features for each syllable, and colored the results according to three acoustic features—frequency bandwidth, maximum frequency, and duration—calculated by the analysis program MUPET (Van Segbroeck et al., 2017). As Figure 2a–c shows, each acoustic feature appears to be encoded in a smooth gradient across the learned latent space, indicating that information about each has been preserved. In fact, when we quantified this pattern by asking how much variance in a wide variety of commonly used acoustic metrics could be accounted for by latent features (see Materials and methods), we found that values ranged from 64% to 95%, indicating that most or nearly all traditional features were captured by the latent space (see Figure 2—figure supplement 1 for individual acoustic features). Furthermore, we found that, when the analysis was reversed, commonly used acoustic features were not able to explain as much variance in the VAE latent features, indicating a prediction asymmetry between the two sets (Figure 2e). That is, the learned features carry most of the information available in traditional features, as well as unique information missed by those metrics.

Figure 2 with 6 supplements see all
Learned acoustic features capture and expand upon traditional features.

(a–c) UMAP projections of latent descriptions of mouse ultrasonic vocalizations (USVs) colored by three traditional acoustic features. The smoothly varying colors show that these traditional acoustic features are represented by gradients within the latent feature space. (d) Many traditional features are highly correlated. When applied to the mouse USVs from (a) to (c), the acoustic features compiled by the analysis program MUPET have high correlations, effectively reducing the number of independent measurements made. (e) To better understand the representational capacity of traditional and latent acoustic features, we used each set of features to predict the other and vice versa (see Materials and methods). We find that, across software programs, the learned latent features were better able to predict the values of traditional features than vice versa, suggesting that they have a higher representational capacity. Central line indicates median, upper and lower box the 25th and 75th percentiles, respectively. Whiskers indicate 1.5 times the interquartile range. Feature vector dimensions: MUPET, 9; DeepSqueak, 10; SAP, 13; mouse latent, 7; zebra finch latent, 5. (f) As another test of representational capacity, we performed PCA on the feature vectors to determine the effective dimensionality of the space spanned by each set of features (see Materials and methods). We find in all cases that latent features require more principal components to account for the same portion of feature variance, evidence that latent features span a higher dimensional space than traditional features applied to the same datasets. Colors are as in (e). Latent features with colors labeled ‘MUPET’ and ‘DeepSqueak’ refer to the same set of latent features, truncated at different dimensions corresponding to the number of acoustic features measured by MUPET and DeepSqueak, respectively.

We thus attempted to compare the effective representational capacity of the VAE to current best approaches in terms of the dimensionalities of their respective feature spaces. We begin by noting that the VAE, although trained with a latent space of 32 dimensions, converges on a parsimonious representation that makes use of only 5–7 dimensions, with variance apportioned roughly equally between these (Figure 2—figure supplement 2; Dai et al., 2018). For the handpicked features, we normalized each feature independently by z-score to account for scale differences. For comparison purposes, we applied the same normalization step to the learned features, truncated the latent dimension to the number of handpicked features, and calculated the cumulative feature variance as a function of number of principal components (Figure 2f). In such a plot, shallow linear curves are preferred since this indicates that variance is apportioned roughly equally among principal components and the effective dimensionality of the space is large. Equivalently, this means that the eigenvalue spectrum of the feature correlation matrix is close to the identity. As Figure 2f thus makes clear, the spaces spanned by the learned latent features have comparatively higher effective dimension than the spaces spanned by traditional features, suggesting that the learned features have a higher representational capacity. While the three software packages we tested (SAP [Tchernichovski and Mitra, 2004], MUPET [Van Segbroeck et al., 2017], DeepSqueak [Coffey et al., 2019]) measure upwards of 14 acoustic features per syllable, we find that these features often exhibit high correlations (Figure 2d, Figure 2—figure supplement 3), effectively reducing the number of independent measurements made. While correlations among features are not necessarily undesirable, they can complicate subsequent statistical testing because nonlinear relationships among features violate the assumptions of many statistical tests. VAE features, which allow for nonlinear warping, avoid this potential difficulty.

The degree to which the learned features capture novel information can also be demonstrated by considering their ability to encode a notion of spectrogram similarity since this is a typical use to which they are put in clustering algorithms (although see Van Segbroeck et al., 2017 for an alternative approach to clustering). We tested this by selecting query spectrograms and asking for the closest spectrograms as represented in both the DeepSqueak acoustic feature space and the VAE’s learned latent space. As Figure 3 shows, DeepSqueak feature space often fails to return similar spectrograms, whereas the learned latent space reliably produces close matches (see Figure 3—figure supplement 1 for comparisons to metrics in spectrogram space, Figure 3—figure supplement 2 for a representative sample using all feature sets, and Figure 3—figure supplement 3 for more details on nearest neighbors returned by DeepSqueak feature space). This suggests that the learned features better characterize local variation in the data by more accurately arranging nearest neighbors.

Figure 3 with 3 supplements see all
Latent features better represent acoustic similarity.

Top row: example spectrograms; middle row: nearest neighbors in latent space; bottom row: nearest neighbors in DeepSqueak feature space.

Latent spaces facilitate comparisons between vocal repertoires

Many experimental designs require quantifying differences between sets of vocalizations. As a result, the ability of a feature set to distinguish between syllables, individuals, and groups poses a key test of the VAE-based approach. Here, we apply the VAE latent features to several comparison problems for which handpicked features are often used.

A common comparison in birdsong research is that between female-directed and undirected song. It is well-established that directed song is more stereotyped and slightly faster than undirected song (Sossinka and Böhner, 1980). We thus asked whether the learned features could detect this effect. In Figure 4a, we plot the first two principal components of acoustic features calculated by the Sound Analysis Pro software package (Tchernichovski and Mitra, 2004) for both directed and undirected renditions of a single zebra finch song syllable. We note a generally diffuse arrangement and a subtle leftward bias in the directed syllables compared to the undirected syllables. Figure 4b displays the same syllables with respect to the first two principal components of the VAE’s latent features, showing a much more concentrated distribution of directed syllables relative to undirected syllables (see Figure 4—figure supplement 1 for all syllables). In fact, when we quantify this reduction of variability across all feature-space dimensions and song syllables (see Materials and methods), learned latent features consistently report greater variability reductions than SAP-generated features (Figure 4c; SAP: 0–20%, VAE: 27–37%), indicating that latent features are more sensitive to this effect. Additionally, we find that latent features outperform SAP features in the downstream tasks of predicting social context (Appendix 1—table 1), mouse strain (Appendix 1—table 2), and mouse identity (Appendix 1—table 3).

Figure 4 with 3 supplements see all
Latent features better capture differences in sets of vocalizations.

(a) The first two principal components in SAP feature space of a single zebra finch song syllable, showing differences in directed and undirected syllable distributions. (b) The first two principal components of latent syllable features, showing the same comparison. Learned latent features more clearly indicate differences between the two conditions by clustering directed syllables together. (c) Acoustic variability of each song syllable as measured by SAP features and latent features (see Methods). Latent features more clearly represent the constriction of variability in the directed context. Spectrogram scale bars denote 100ms. (d) A UMAP projection of the latent means of USV syllables from two strains of mice, showing clear differences in their vocal repertoires. (e) Similarity matrix between syllable repertoires for each of the 40 reccording sessions from (d). Lighter values correspond to more similar syllable repertoires (lower Maximum Mean Discrepancy (MMD)). (f) t-SNE representation of similarities between syllable repertoires, where distance metric is estimated MMD. The dataset, which is distinct from that represented in (d) and (e), contains 36 individuals, 118 recording sessions, and 156,180 total syllables. Color indicates individual mice and scatterpoints of the same color represent repertoires recorded on different days. Distances between points represent the similarity in vocal repertoires, with closer points more similar. We note that the major source of repertoire variability corresponds to genetic background, corresponding to the two distinct clusters (Figure 4—figure supplement 3). A smaller level of variability can be seen across individuals in the same clusters. Individual mice have repertoires with even less variability, indicated by the close proximity of most repertoires for each mouse.

Similarly, we can ask whether latent features are able to capture differences between groups of individuals. In Van Segbroeck et al., 2017, the authors compared USVs of 12 strains of mice using a clustering-based approach. Here, we perform an alternative version of this analysis using two mouse strains (C57/BL6 and DBA/2) from a publicly available dataset that were included in this earlier study. Figure 4d shows a UMAP projection of the 31,440 detected syllables, colored by mouse strain. Visualized with UMAP, clear differences between the USV distributions are apparent. While in contrast to traditional acoustic features such as ‘mean frequency’ individual VAE latent features (vector components) are generally less interpretable, when taken together with an ‘atlas’ of USV shapes derived from this visualization (Figure 4—figure supplement 2), we can develop an intuitive understanding of the differences between the USVs of the two strains: the C57 mice mostly produce noisy USVs, while the DBA mice produce a much greater variety, including many short low-frequency syllables that C57s rarely produce.

Given these results, we asked whether these strain differences are evident at the level of individual 6.5 min recording sessions. To compare distributions of syllables without making restrictive parametric assumptions, we employed maximum mean discrepancy (MMD), a difference measure between pairs of distributions (Gretton et al., 2012). We estimated MMD between the distributions of latent syllable encodings for each pair of recording sessions (see Materials and methods) and visualized the result as a distance matrix (Figure 4e). Here, lighter values indicate more similar syllable repertoires. We note that, in general, values are brighter when comparing repertoires within strains than when comparing across strains, consistent with the hypothesis of inter-strain differences. We also note some substructure, including a well-defined cluster within the C57 block (annotated).

Finally, we used a much larger library of female-directed mouse USVs (36 individuals, 2–4 20-min recording sessions each, 40 total hours of audio, 156,000 syllables) to investigate the diversity and stability of syllable repertoires. We repeated the above procedure, estimating MMD for each pair of recording sessions (Figure 4—figure supplement 3), and then computed a t-distributed Stochastic Neighbor Embedding t-SNE layout of the recording sessions with estimated MMD as the distance metric to visualize the distribution of syllable repertoires (see Materials and methods). In Figure 4f, each recording session is represented by a scatterpoint, and recordings of the same individual are connected and displayed in the same color. We note an overall organization of syllables into two clusters, corresponding to the genetic backgrounds of the mice (Figure 4—figure supplement 3). Furthermore, we note that almost all recordings of the same individuals are co-localized, indicating that within-subject differences in syllable repertoire are smaller than those between individuals. Although it has been previously shown that a deep convolutional neural network can be trained to classify USV syllables according to mouse identity with good accuracy (Ivanenko et al., 2020, Figure S1), here we find that repertoire features learned in a wholly unsupervised fashion achieve similar results, indicating that mice produce individually stereotyped, stable vocal repertoires.

Latent features fail to support cluster substructure in USVs

Above, we have shown that, by mapping complex sets of vocalizations to low-dimensional latent representations, VAEs allow us to visualize the relationships among elements in mouse vocal repertoires. The same is likewise true for songbirds such as the zebra finch, T. guttata. Figure 5 compares the geometry of learned latent spaces for an individual of each species as visualized via UMAP. As expected, the finch latent space exhibits well-delineated clusters corresponding to song syllables (Figure 5a). However, as seen above, mouse USVs clump together in a single quasi-continuous mass (Figure 5b). This raises a puzzle since the clustering of mouse vocalizations is often considered well-established in the literature (Holy and Guo, 2005; Burkett et al., 2015; Woehr, 2014; Chabout et al., 2015; Hertz et al., 2020) and is assumed in most other analyses of these data (Van Segbroeck et al., 2017; Coffey et al., 2019). Clusters of mouse USVs are used to assess differences across strains (Van Segbroeck et al., 2017), social contexts (Chabout et al., 2015; Coffey et al., 2019; Hammerschmidt et al., 2012), and genotypes (Gaub et al., 2010), and the study of transition models among clusters of syllables has given rise to models of syllable sequences that do not readily extend to the nonclustered case (Holy and Guo, 2005; Chabout et al., 2015; Hertz et al., 2020).

Figure 5 with 5 supplements see all
Bird syllables clearly cluster, but mouse ultrasonic vocalizations (USVs) do not.

(a) UMAP projection of the song syllables of a single male zebra finch (14,270 syllables). (b) UMAP projection of the USV syllables of a single male mouse (17,400 syllables). (c) The same UMAP projection as in (b), colored by MUPET-assigned labels. (d) Mean silhouette coefficient (an unsupervised clustering metric) for latent descriptions of zebra finch song syllables and mouse syllables. The dotted line indicates the null hypothesis of a single covariance-matched Gaussian noise cluster fit by the same algorithm. Each scatterpoint indicates a cross-validation fold, and scores are plotted as differences from the null model. Higher scores indicate more clustering. (e) Interpolations (horizontal series) between distinct USV shapes (left and right edges) demonstrating the lack of data gaps between putative USV clusters.

We therefore asked whether mouse USVs do, in fact, cluster or whether, as the latent space projection suggests, they form a single continuum. In principle, this is impossible to answer definitively because, without the benefit of ground truth labels, clustering is an unsupervised classification task. Moreover, there is little consensus among researchers as to the best method for assessing clustering and where the cutoff between clustered and nonclustered data lies (Jain et al., 1999). In practice, new clustering algorithms are held to function well when they outperform previous approaches and produce sensible results on data widely agreed on to be clustered. Thus, while it is clear that zebra finch song syllables should be and are clustered by the VAE (Figure 5a), we can only ask whether clustering is a more or less satisfying account of the mouse data in Figure 5b.

To address this question, we performed a series of analyses to examine the clustering hypothesis from complementary angles. First, we asked how clusters detected by other analysis approaches correspond to regions in the latent space. As shown in Figure 5c, clusters detected by MUPET roughly correspond to regions of the UMAP projection, with some overlap between clusters (e.g., purple and blue clusters) and some noncontiguity of single clusters (red and orange clusters). That is, even though clusters do broadly label different subsets of syllables, they also appear to substantially bleed into one another, unlike the finch song syllables in Figure 5a. However, it might be objected that Figure 5b displays the UMAP projection, which only attempts to preserve local relationships between nearest neighbors and is not to be read as an accurate representation of the latent geometry. Might the lack of apparent clusters result from distortions produced by the projection to two dimensions? To test this, we calculated several unsupervised clustering metrics on full, unprojected latent descriptions of zebra finch and mouse syllables. By these measures, both bird syllables and mouse USVs were more clustered than moment-matched samples of Gaussian noise, a simple null hypothesis, but mouse USVs were closer to the null than to birdsong on multiple goodness-of-clustering metrics (Figure 5d, Figure 5—figure supplement 1). Additionally, we find that MUPET acoustic features admit uniformly poorer clusters than latent features as quantified by the same metrics (Figure 5—figure supplements 2 and 3). On a more practical note, we compared the consistency of cluster labels assigned by Gaussian mixture models (GMMs) trained on disjoint subsets of zebra finch and mouse latent syllable descriptions to determine how well the structure of the data determines cluster membership across repeated fittings (Figure 5—figure supplement 4). We find near-perfectly consistent assignments of zebra finch syllables into six clusters and much less consistent clusters for mouse syllables for more than two clusters. Finally, we tested whether the data contained noticeable gaps between syllables in different clusters. If syllable clusters are well-defined, there should not exist smooth sequences of datapoints connecting distinct examples. However, we find that even the most acoustically disparate syllables can be connected with a sequence of syllables exhibiting more-or-less smooth acoustic variation (Figure 5e), in contrast to zebra finch syllables (Figure 5—figure supplement 5). Thus, even though clustering may not constitute the best account of mouse USV syllable structure, learned latent features provide useful tools to both explore and quantify the acoustic variation within and across species.

Measuring acoustic variability over tens of milliseconds

The results above have shown that data-derived latent features represent more information about syllables than traditional metrics and can successfully capture differences within and between individuals and groups. Here, we consider how a related approach can also shed light on the short-time substructure of vocal behavior.

The analysis of syllables and other discrete segments of time is limited in at least two ways. First, timing information, such as the lengths of gaps between syllables, is ignored. Second, experimenters must choose the unit of analysis (syllable, song motif, bout), which has a significant impact on the sorts of structure that can be identified (Kershenbaum et al., 2016). In an attempt to avoid these limitations, we pursued a complementary approach, using the VAE to infer latent descriptions of fixed duration audio segments, irrespective of syllable boundaries. Similar to the shotgun approach to gene sequencing (Venter et al., 1998) and a related method of neural connectivity inference (Soudry et al., 2015), we trained the VAE on randomly sampled segments of audio, requiring that it learn latent descriptions sufficient to characterize any given time window during the recording. That is, this ‘shotgun-VAE’ approach encouraged the autoencoder to find latent features sufficient to ‘glue’ continuous sequences back together from randomly sampled audio snippets.

Figure 6a shows a UMAP projection of latent features inferred from fixed-duration segments from a subset of the mouse USVs shown in Figure 5b. While this projection does display some structure (silence on the right, shorter to longer syllables arranged from right to left), there is no evidence of stereotyped sequential structure (see also Figure 6—videos 1 and 2). In contrast, Figure 6b shows the same technique applied to bouts of zebra finch song, with the song represented as a single well-defined strand coursing clockwise from the bottom to the top left of the projection. Other notable features are the loop on the left containing repeated, highly variable introductory notes that precede and often join song renditions and a ‘linking note’ that sometimes joins song motifs. Most importantly, such a view of the data clearly illustrates not only stereotypy but variability: introductory notes are highly variable, but so are particular syllables (B, E) in contrast to others (C, F).

Figure 6 with 4 supplements see all
A shotgun variational autoencoder approach learns low dimensional latent representations of subsampled, fixed-duration spectrograms and captures short-timescale variability in behavior.

(a) A UMAP projection of 100,000 200-ms windows of mouse ultrasonic vocalizations (cp. Figure 4a). (b) A UMAP projection of 100,000 120-ms windows of zebra finch song (cp. Figure 4b). Song progresses counterclockwise on the right side, while more variable, repeated introductory notes form a loop on the left side. (c) A single rendition of the song in (b). (d) The song’s first principal component in latent space, showing both directed (cyan) and undirected (purple) renditions. (e) In contrast to a syllable-level analysis, the shotgun approach can measure zebra finch song variability in continuous time. Song variability in both directed (cyan) and undirected (purple) contexts is plotted (see Materials and methods).

Following this, we asked whether the shotgun VAE method could be used to assess the phenomenon of reduced variability in directed birdsong (Sossinka and Böhner, 1980). We examined the song portion of Figure 6b in both directed and undirected conditions, warping each in time to account for well-documented differences in rhythm and tempo. We then trained a VAE on randomly sampled 80 ms portions of the warped spectrograms. As a plot of the first principal component of the latent space shows (Figure 4d), the VAE is able to recover the expected reduction in directed song variability on a tens-of-milliseconds timescale relevant to the hypothesized neural underpinnings of the effect (Fee and Goldberg, 2011). This result recapitulates similar analyses that have focused on harmonic and tonal syllables like A and B in Figure 4c (Kao and Brainard, 2006), but the shotgun VAE method is applicable to all syllables, yielding a continuous estimate of song variability (Figure 4e). Thus, not only do VAE-derived latent features capture structural properties of syllable repertoires, the shotgun VAE approach serves to characterize continuous vocal dynamics as well.

Latent features capture song similarity

Above, we saw how a subsampling-based ‘shotgun VAE’ approach can capture fine details of zebra finch vocal behavior modulated by social context. However, a principal reason songbirds are studied is their astonishing ability to copy song. A young male zebra finch can successfully learn to sing the song of an older male over the course of multiple months after hearing only a handful of song renditions. At least three methods exist for quantifying the quality of song learning outcomes, with two using handpicked acoustic features (Tchernichovski et al., 2000; Tchernichovski and Mitra, 2004; Mandelblat-Cerf and Fee, 2014) and another using Gaussian distributions in a similarity space based on power spectral densities to represent syllable categories (Mets and Brainard, 2018). We reiterate that handpicked acoustic features are sensitive to experimenter choices, with some acoustic features like pitch only defined for certain kinds of sounds. Additionally, restrictive parametric assumptions limit the potential sensitivity of a method. By contrast, the VAE learns a compact feature representation of complete spectrograms and MMD provides a convenient nonparametric difference measure between distributions. Therefore, as a final assessment of the VAE’s learned features, we asked whether latent features reflect the similarity of tutor and pupil songs.

To assess the VAE’s ability to capture effects of song learning, we trained a syllable VAE and a shotgun VAE on song motifs of 10 paired adult zebra finch tutors and adult zebra finch pupils. As Figure 7a shows for the syllable-level analysis, most syllables form distinct clusters in a latent UMAP embedding, with many tutor/pupil syllable pairs sharing a cluster (Figure 7—video 1). For a specific tutor/pupil pair shown in Figure 7b, we highlight three corresponding syllables from the two motifs. The first and third syllables (C and E) are well-copied from tutor to pupil, but the pupil’s copy of the second syllable (syllable D) does not contain the high-frequency power present in the first half of the tutor’s syllable. This discrepancy is reflected in the latent embedding, with the two versions of syllable D corresponding to distinct clusters. We quantified the quality of copying by calculating MMD between each pair of tutor and pupil syllables, shown in Figure 7c. A dark band of low MMD values along the diagonal indicates well-copied syllables and syllable orderings for most birds.

Figure 7 with 3 supplements see all
Latent features capture similarities between tutor and pupil song.

(a) Latent UMAP projection of the song syllables of 10 zebra finch tutor/pupil pairs. Note that many tutor and pupil syllables cluster together, indicating song copying. (b) Example song motifs from one tutor/pupil pair. Letters A-E indicate syllables. Syllables Syllables “C” and “E” are well-copied, but the pupil’s rendition of syllable “D” does not have as much high-frequency power as the tutor’s rendition. The difference between these two renditions is captured in the latent UMAP projection below. Additionally, the pupil sings a concatenated version of the tutor’s syllables “A” and “B,” a regularity that a syllable-level analysis cannot capture. Thus, the pupil’s syllable “AB” does not appear near tutor syllable “B” in the UMAP projection. Scale bar denotes 100ms. (c) Quality of song copying, estimated by maximum mean discrepancy (MMD) between every pair of tutor and pupil syllables. The dark band of low MMD values near the diagonal indicates good song copying. Syllables are shown in motif order. (d) Latent UMAP projection of shotgun VAE latents (60ms windows) for the song motifs of 10 zebra finch tutor/pupil pairs. Song copying is captured by the extent to which pupil and tutor strands co-localize. (e) Top: Detail of the UMAP projection in panel d shows a temporary split in pupil and tutor song strands spanning the beginning of syllable “D” with poorly copied high-frequency power. Labeled points correspond to the motif fragments marked in panel b. Bottom: Details of the syllable and shotgun MMD matrices in panels c and f. Note high MMD values for the poorly copied “D” syllable. Additionally, the syllable-level analysis reports high MMD betwen the pupil’s fused “AB” and syllable and tutor’s“A” and “B” syllables, though the shotgun VAE approach reports high similarity (low MMD) between pupil andtutor throughout these syllables. (f) MMD between pupil and tutor shotgun VAE latents indexed by continuous-valued time-in-motif quantifies song copying on fine timescales. The dark bands near the diagonal indicate well-copied stretches of song. The deviation of MMD values from a rank-one matrix is displayed for visual clarity (see Figure 7—figure supplement 2 for details). Best viewed zoomed in.

To complement the previous analysis, we performed an analogous analysis using the shotgun VAE approach. After training the VAE on 60 ms chunks of audio drawn randomly from each song motif, we projected the learned latent features into two dimensions using UMAP (Figure 7d) with a modified distance to prevent motif ‘strands’ from breaking (see Materials and methods, Figure 7—figure supplement 1). As expected, we see a close correspondence between pupil and tutor motif strands, with only a single tutor/pupil pair (pink) not overlapping. In fact, out of the roughly 3.6 million possible pairings of pupils and tutors, it is easily apparent which is the true pairing. Additionally, we find that the finer details are informative. For instance, the poorly copied syllable ‘D’ from Figure 7b corresponds to a temporary divergence of the pupil strand from the tutor strand, reflected by large MMD values for both shotgun and syllable analyses (Figure 7e). Additionally, we find that the shotgun VAE approach accurately judges the pupil’s fused ‘AB’ syllable to be similar to the tutor’s ‘A’ and ‘B’ syllables, in contrast to the syllable-level analysis (Figure 7e, bottom). We quantified the quality of copying in continuous time for all tutor/pupil pairs by calculating MMD between the distributions of song latents corresponding to distinct times within motifs. Deviations from a simple rank-one structure are shown in Figure 7f (see Figure 7—figure supplement 2 for the original MMD matrix). Consistent with Figure 7c, a dark band near the diagonal indicates good copying, quantifying the quality of copying in much more detail than a syllable-level analysis could provide.

Discussion

The complexity and high dimensionality of vocal behavior have posed a persistent challenge to the scientific study of animal vocalization. In particular, comparisons of vocalizations across time, individuals, groups, and experimental conditions require some means of characterizing the similarity of selected groups of vocal behaviors. Feature vector-based approaches and widespread software tools have gone a long way toward addressing this challenge and providing meaningful scientific insights, but the reliance of these methods on handpicked features leaves open the question of whether other feature sets might better characterize vocal behavior.

Here, we adopt a data-driven approach, demonstrating that features learned by the VAE, an unsupervised learning method, outperform frequently used acoustic features across a variety of common analysis tasks. As we have shown, these learned features are both more parsimonious (Figure 2—figure supplement 3), capture more variability in the data (Figure 2e and f), and better characterize vocalizations as judged by nearest neighbor similarity (Figure 3, Figure 3—figure supplement 1, Figure 3—figure supplement 2). Moreover, these features easily facilitate comparisons across sessions (Figure 4f), social contexts (Figure 4a–c), and individuals (Figures 4d–f7), quantifying not only differences in mean vocal behavior (Figure 4d), but also in vocal variability (Figure 4c).

This data-driven approach is closely related to previous studies that have applied dimensionality reduction algorithms (UMAP [McInnes et al., 2018] and t-SNE [Lvd and Hinton, 2008]) to spectrograms to aid in syllable clustering of birdsong (Sainburg et al., 2019) and visualize juvenile song learning in the zebra finch (Kollmorgen et al., 2020). Additionally, a related recent publication (Sainburg et al., 2020) similarly described the application of UMAP to vocalizations of several more species and the application of the VAE to generate interpolations between birdsong syllables for use in playback experiments. Here, by contrast, we restrict use of the UMAP and t-SNE dimensionality reduction algorithms to visualizing latent spaces inferred by the VAE and use the VAE as a general-purpose tool for quantifying vocal behavior, with a focus on cross-species comparisons and assessing variability across groups, individuals, and experimental conditions.

Moreover, we have argued above that, despite conventional wisdom, clustering is not the best account of the diversity of mouse vocal behavior. We argued this on the basis of multiple converging lines of evidence, but note three important qualifications: first, the huge variety of vocal behavior among rodents (Berryman, 1976; Holy and Guo, 2005; Novakowski, 1969; Sadananda et al., 2008; Smith et al., 1977; Miller and Engstrom, 2007) suggests the possibility of clustered vocal behavior in some mouse strains not included in our data. Second, it is possible that the difference in clustered and nonclustered data depends crucially on dataset size. If real syllables even occasionally fall between well-defined clusters, a large enough dataset might lightly ‘fill in’ true gaps. Conversely, even highly clustered data may look more or less continuous given an insufficient number of samples per cluster. While this is not likely given the more than 15,000 syllables in Figure 5, it is difficult to rule out in general. Finally, our purely signal-level analysis of vocal behavior cannot address the possibility that a continuous distribution of syllables could nevertheless be perceived categorically. For example, swamp sparrows exhibit categorical neural and behavioral responses to changes in syllable duration (Prather et al., 2009). Nonetheless, we argue that, without empirical evidence to this effect in rodents, caution is in order when interpreting the apparent continuum of USV syllables in categorical terms.

Lastly, we showed how a ‘shotgun VAE’ approach can be used to extend our approach to the quantification of moment-by-moment vocal variability. In previous studies, syllable variability has only been quantified for certain well-characterized syllables like harmonic stacks in zebra finch song (Kao and Brainard, 2006). Our method, by contrast, provides a continuous variability measure for all syllables (Figure 6c). This is particularly useful for studies of the neural basis of this vocal variability, which is hypothesized to operate on millisecond to tens-of-milliseconds timescales (Fee and Goldberg, 2011).

Nonetheless, as a data-driven method, our approach carries some drawbacks. Most notably, the VAE must be trained on a per-dataset basis. This is more computationally intensive than calculating typical acoustic features (≈1 hr training times on a GPU) and also prevents direct comparisons across datasets unless they are trained together in a single model. Additionally, the resulting learned features, representing nonlinear, nonseparable acoustic effects, are somewhat less interpretable than named acoustic features like duration and spectral entropy. However, several recent studies in the VAE literature have attempted to address this issue by focusing on the introduction of covariates (Sohn et al., 2015; Louizos et al., 2015; Khemakhem et al., 2019) and ‘disentangling’ approaches that attempt to learn independent sources of variation in the data (Higgins et al., 2017; Burgess et al., 2018), which we consider to be promising future directions. We also note that great progress in generating raw audio has been made in the past few years, potentially enabling similar approaches that bypass an intermediate spectrogram representation (Avd et al., 2016; Kong et al., 2020).

Finally, we note that while our focus in this work is vocal behavior, our training data are simply syllable spectrogram images. Similar VAE approaches could also be applied to other kinds of data summarizable as images or vectors. The shotgun VAE approach could likewise be applied to sequences of such vectors, potentially revealing structures like those in Figure 6b. More broadly, our results suggest that data-driven dimensionality reduction methods, particularly modern nonlinear, overparameterized methods, and the latent spaces that come with them, offer a promising avenue for the study of many types of complex behavior.

Materials and methods

Animal statement

Request a detailed protocol

All experiments were conducted according to protocols approved by the Duke University Institutional Animal Care and Use Committee (mice: A171-20-08, birds: A172-20-08).

Recordings

Request a detailed protocol

Recordings of C57BL/6 and DBA/2 mice were obtained from the MUPET Wiki (Van Segbroeck M et al., 2019). These recordings are used in Figure 2, Figure 2—figure supplements 15, Figure 3, Figure 3—figure supplements 13, Figure 4d–e, Figure 4—figure supplement 2, Figure 5e, and Appendix 1. Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires (Appendix 1—table 2).

Additional recordings of female-directed mouse USVs are used in Figure 4 and Figure 4—figure supplement 3. These recordings comprise 36 male mice from various genetic backgrounds over 118 recording sessions of roughly 20 minutes each (≈40 total hours, 156,180 [Appendix 1—table 1] total syllables). USVs were recorded with an ultrasonic microphone (Avisoft, CMPA/CM16), amplified (Presonus TubePreV2), and digitized at 300 kHz (Spike 7, CED). A subset of these recordings corresponding to a single individual (17,400 syllables) is further used in Figures 5b–d6a, Figure 2—figure supplement 6, Figure 5—figure supplement 1, Figure 5—figure supplement 2, Figure 5—figure supplement 4, and Figure 6—figure supplement 1. Because these recordings contained more noise than the first set of C57/DBA recordings, we removed false-positive syllables by training the VAE on all detected syllables, projecting latent syllables to two dimensions, and then removing syllables contained within the resulting cluster of noise syllables, with 15,712 syllables remaining (see Figure 2—figure supplement 6).

A single male zebra finch was recorded over a 2-day period (153–154 days post-hatch [dph]) in both female-directed and undirected contexts (14,270 total syllables, 1100 directed). Song was recorded with Sound Analysis Pro 2011.104 (Tchernichovski and Mitra, 2004) in a soundproof box. These recordings are used in Figure 2, Figure 2—figure supplement 1, Figure 2—figure supplement 2, Figure 2—figure supplement 3, Figure 2—figure supplement 4, Figure 2—figure supplement 5, Figure 3—figure supplement 2, Figure 4, Figure 4—figure supplement 1, Figure 5, Figure 5—figure supplement 1, Figure 5—figure supplement 3, Figure 5—figure supplement 4, Figure 5—figure supplement 5, Figure 6, Figure 6—figure supplement 1, Figure 6—figure supplement 2, and Appendix 1—table 1.

For Figure 7, we selected 10 adult, normally reared birds from different breeding cages in our colony. Until at least 60 dph, each of these pupil birds had interacted with only one adult male, the tutor from his respective breeding cage. We recorded the adult (>90 dph) vocalizations of these pupil birds for 5–12 days each with Sound Analysis Pro 2011.104 (Tchernichovski and Mitra, 2004), then recorded their respective tutors under the same conditions for 5–12 days each. These recordings are used in Figure 7 and Figure 7—figure supplement 1.

Software comparisons

We compared our VAE method to three widely used vocal analysis packages: MUPET 2.0 (Van Segbroeck et al., 2017), DeepSqueak 2.0 (Coffey et al., 2019) (for mouse USVs), and SAP 2011.104 (Tchernichovski and Mitra, 2004) (for birdsong), each with default parameter settings. MUPET clusters were found using the minimum number of clusters (10). DeepSqueak features were generated using the DeepSqueak ‘import from MUPET’ feature.

Audio segmenting

Request a detailed protocol

For all mouse USV datasets, individual syllable onsets and offsets were detected using MUPET with default parameter settings. The shotgun VAE analysis in Figure 6 was restricted to manually defined regions (bouts) of vocalization. In this figure, zebra finch songs were segmented semi-automatically: first, we selected four representative song motifs from each individual. Then we converted these to spectrograms using a Short Time Fourier Transform with Hann windows of length 512 and overlap of 256, averaged these spectrograms, and blurred the result using a Gaussian filter with 0.5 pixel standard deviation. The result was a song template used to match against the remaining data. Specifically, we looked for local maxima in the normalized cross-correlation between the template and each audio file. Matches corresponded to local maxima with cross-correlations above 1.8 median absolute deviations from the median, calculated on a per-audio-file basis. A spectrogram was then computed for each match. All match spectrograms were then projected to two dimensions using UMAP (McInnes et al., 2018), from which a single well-defined cluster, containing stereotyped song, was retained. Zebra finch syllable onsets and offsets were then detected using SAP (Tchernichovski and Mitra, 2004) on this collection of song renditions using default parameter settings. After segmentation, syllable spectrograms were projected to two dimensions using UMAP, and eight well-defined clusters of incorrectly segmented syllables were removed, leaving six well-defined clusters of song syllables. For Figure 7, song motifs were hand-labeled for approximately 10 min of song-rich audio per animal. These labels were used to train an automated segmentation tool, vak 0.3.1 (Nicholson and Cohen, 2020), for each animal. Trained vak models were used to automatically label motifs in the remaining audio data for each animal. Automatic segmentation sometimes divided single motifs or joined multiple motifs. To correct for these errors, short (<50 ms) gaps inside motifs were eliminated. After this correction, putative motif segments with durations outside 0.4–1.5 s were discarded. Syllables segments were derived from a subset of the vak motif segments by aligning the motif amplitude traces and manually determining syllable boundaries, resulting in 75,430 total syllable segments.

Spectrograms

Request a detailed protocol

Spectrograms were computed using the log modulus of a signal’s Short Time Fourier Transform, computed using Hann windows of length 512 and overlap of 256 for bird vocalization, and length 1024 and overlap 512 for mouse vocalization. Sample rates were 32 kHz for bird vocalization and 250 kHz for mouse vocalization, except for the recordings in Figure 3f, which were sampled at a rate of 300 kHz. The resulting time/frequency representation was then interpolated at desired frequencies and times. Frequencies were mel-spaced from 0.4 to 8 kHz for bird recordings in Figure 7, mel-spaced from 0.4 to 10 kHz for all other bird recordings, and linearly spaced from 30 to 110 kHz for mouse recordings. For both species, syllables longer than tmax=200ms were discarded. Additionally, short syllables were stretched in time in a way that preserved relative duration, but encouraged the VAE to represent fine temporal details. Specifically, a syllable of duration t was stretched by a factor of tmaxt. See Figure 2—figure supplement 5 for a comparison to linear frequency spacing for zebra finches and no time stretching for mice and zebra finches. The resulting spectrograms were then clipped to manually tuned minimum and maximum values. The values were then linearly stretched to lie in the interval [0,1]. The resulting spectrograms were 128 × 128 = 16,384 pixels, with syllables shorter than tmax zero-padded symmetrically.

Model training

Request a detailed protocol

Our VAE is implemented in PyTorch (v1.1.0) and trained to maximize the standard evidence lower bound (ELBO) objective using the reparameterization trick and ADAM optimization (Kingma and Welling, 2013; Rezende et al., 2014; Paszke et al., 2017; Kingma and Ba, 2014). The encoder and decoder are deep convolutional neural networks with fixed architecture diagrammed in Figure 1—figure supplement 1. The latent dimension was fixed to 32, which was found to be sufficient for all training runs. The approximate posterior was parameterized as a normal distribution with low rank plus diagonal covariance: q(z)=N(z;μ,uu+diag(d)) , where µ is the latent mean, u is a 32 × 1 covariance factor, and d was the latent diagonal, a vector of length 32. The observation distribution was parameterized as N(μ,0.1I), where µ was the output of the decoder. All activation functions were Rectified Linear Units. Learning rate was set to 10-3 and batch size was set to 64.

Comparison of VAE and handpicked features

Request a detailed protocol

For each analysis tool (MUPET, DeepSqueak, SAP), we assembled two feature sets: one calculated by the comparison tool (e.g., MUPET features) and one a matched VAE set. For the first set, each feature calculated by the program was z-scored and all components with nonzero variance were retained (9/9, 10/10, and 13/14 components for MUPET, DeepSqueak, and SAP, respectively). For the second set, we trained a VAE on all syllables, computed latent means of these via the VAE encoder, and removed principal components containing less than 1% of the total feature variance (7, 5, and 5 out of 32 components retained for MUPET, DeepSqueak, and SAP syllables, respectively). Each feature set was used as a basis for predicting the features in the other set using k-nearest neighbors regression with k set to 10 and nearest neighbors determined using Euclidean distance in the assembled feature spaces. The variance-explained value reported is the average over five shuffled train/test folds (Figure 2e).

Unlike latent features, traditional features do not come equipped with a natural scaling. For this reason, we z-scored traditional features to avoid tethering our analyses to the identities of particular acoustic features involved. Then, to fairly compare the effective dimensionalities of traditional and acoustic features in Figure 2d, we thus also z-scored the latent features as well, thereby disregarding the natural scaling of the latent features. PCA was then performed on the resulting scaled feature set.

Birdsong variability index

Request a detailed protocol

For Figure 4c, given a set {zi|i=1n} of feature vectors of n syllables, we defined a variability index for the data as follows:

(1) V.I.=minziρ(zi)

where ρ(z) is proportional to a robust estimator of the variance of the data around z:

(2) ρ(z)=medianzj||z-zj||22

We calculate the above metric for every combination of syllable (A–F), feature set (SAP-generated vs. VAE-generated), and social context (directed vs. undirected) and report the variability index of the directed condition relative to the variability index of the undirected condition (Figure 4c).

For Figure 6e, we would ideally use the variability index defined above, but ρ(z) is expensive to compute for each datapoint, as required in (1). Thus, we use an approximate center point defined by the median along each coordinate: z^imedian(zi), where the superscript here represents the ith coordinate of z. That is, z^ contains the medians of the marginal distributions. This value is calculated for each combination of time point and social context (directed vs. undirected) and plotted in Figure 6e.

Maximum mean discrepancy

Request a detailed protocol

We used the MMD integral probability metric to quantify differences in sets of syllables (Gretton et al., 2012). Given random variables x and y, MMD is defined over a function class as supfFEx[f(x)]Ey[f(y)]. Here, was taken to the set of functions on the unit ball in a reproducing kernel Hilbert space with fixed spherical Gaussian kernel. For Figure 4e–f, the kernel width σ was chosen to be the median distance between points in the aggregate sample, a common heuristic (Gretton et al., 2012). In Figure 7, the kernel bandwidth was chosen to be 25% of the median distance between points in the aggregate sample in order to focus on finer differences in distributions. For Figure 4e, we obtained 20 approximately 6.5 min recordings of male C57BL/6 mice and 20 approximately 6.5 min recordings of male DBA/2 mice (see Recordings). Latent means of USVs from a single recording were treated as independent and identically distributed draws from a recording-specific USV distribution, and MMD was estimated using these latent means. In Figure 4e, the MMD values are plotted as a matrix and the order of rows was obtained by agglomerative clustering. In Figure 4f and Figure 7, the same procedure was followed. For Figure 4f, a t-SNE was computed for each recording session, with the distance between recording sessions taken to be the estimated MMD between them (see Figure 4—figure supplement 3 for the MMD matrix).

Unsupervised clustering metrics

Request a detailed protocol

We used three unsupervised clustering metrics to assess the quality of clustering for both zebra finch and mouse syllables: the mean silhouette coefficient (Rousseeuw, 1987), the Calinski–Harabasz Index (Caliński and Harabasz, 1974), and the Davies–Bouldin Index (Davies and Bouldin, 1979). For each species (zebra finch and mouse), we partitioned the data for 10-fold cross-validation (train on 9/10, test on 1/10 held out). For a null comparison, for each of 10% disjoint subsets of the data, we created a synthetic Gaussian noise dataset matched for covariance and number of samples. These synthetic noise datasets were then used to produce the dotted line in Figure 5d.

For each data split, we clustered using a GMM with full covariance using Expectation Maximization on the training set. We then evaluated each clustering metric on the test set. The number of clusters, k, was set to six in Figure 5d, but qualitatively similar results were obtained when k was allowed to vary between 2 and 12 (Figure 5—figure supplement 1). Reported values in Figure 5d and Figure 5—figure supplement 1 are the differences in unsupervised metrics on real data and Gaussian noise for each cross-validation fold, with a possible sign change to indicate higher values as more clustered.

Shotgun VAE

Request a detailed protocol

To perform the analysis in Figure 6a–b, regions of active vocalization were defined manually for both species (22 min of mouse recordings, 2 min of zebra finch recordings). Zebra finch bouts containing only calls and no song motifs were excluded. For both species, the duration of audio chunks was chosen to be roughly as long as the longest syllables (zebra finch: 120 ms; mouse: 200 ms). No explicit training set was made. Rather, onsets and offsets were drawn uniformly at random from the set of fixed-duration segments and the corresponding spectrograms were computed on a per-datapoint basis. Thus, the VAE likely never encountered the same spectrogram twice, encouraging it to learn the underlying time series.

To perform the variability reduction analysis in Figure 6d–e, song renditions were collected (see Audio segmenting) and a spectrogram was computed for each. The whole collection of spectrograms was then jointly warped using piecewise-linear time warping (Williams et al., 2019). Fixed-duration training spectrograms were made by interpolating normal spectrograms (as described in Spectrograms) at linearly spaced time points in warped time, generally corresponding to nonlinearly spaced time points in real time. As above, spectrograms were made during training on a per-datapoint basis. After training the VAE on these spectrograms, latent means were collected for 200 spectrograms for each song rendition, linearly spaced in warped time from the beginning to the end of the song bout. Lastly, for each combination of condition (directed vs. undirected song) and time point, the variability index described above was calculated. A total of 186 directed and 2227 undirected song renditions were collected and analyzed.

To generate the shotgun VAE training set for Figure 7, 2000 vak-labeled motifs were selected from each animal (see Audio segmenting). A single 60 ms window was drawn from each motif to create a training set of 40,000 total spectrogram windows drawn from the 20-animal cohort. After training a VAE on this dataset, the hand-labeled motif segments used to train vak models (see Audio segmenting) were segmented into overlapping 60 ms windows that spanned each motif with an 8 ms step size between successive windows (52,826 total windows). These windows were reduced with the trained VAE and their latent means subsequently analyzed.

Modified UMAP

Request a detailed protocol

Although the standard UMAP embedding of shotgun VAE latents from single-finch datasets generates points along smoothly varying strands (see Figure 6b), UMAP typically broke motifs into multiple strand-like pieces in the 20-animal dataset from Figure 7. To encourage embeddings that preserve the neighbor relationship of successive windows, we modified the distance measure underlying the UMAP. First, we computed the complete pairwise Euclidean distance matrix between all windows in latent space. Then, we artificially decreased the distance between successive windows from the same motif by multiplying corresponding distance matrix entries by 10-3. This precomputed distance matrix was then passed to UMAP as a parameter. See Figure 7—figure supplement 1 for a comparison of the two UMAP projections.

Data and code availability statement

Request a detailed protocol

The latest version of Autoencoded Vocal Analysis, the Python package used to generate, plot, and analyze latent features, is available online: https://github.com/pearsonlab/autoencoded-vocal-analysis (Goffinet, 2021; copy archived at swh:1:rev:f512adcae3f4c5795558e2131e54c36daf23b904). Mouse and zebra finch recordings used in this study are archived on the Duke Digital Repository: https://doi.org/10.7924/r4gq6zn8w.

Appendix 1

Appendix 1—table 1
Comparison of feature sets on the downstream task of predicting finch social context.

(directed vs. undirected context) given acoustic features of single syllables.Classification accuracy, in percent, averaged over five disjoint, class-balanced splits of the data is reported. Empirical standard deviation is shown in parentheses. Euclidean distance is used for nearest neighbor classifiers. Each SAP acoustic feature is independently z-scored as a preprocessing step. Latent feature dimension is truncated when >99% of the feature variance is explained. Random forest (RF) classifiers use 100 trees and the Gini impurity criterion. The multilayer perceptron (MLP) classifiers are two-layer networks with a hidden layer size of 100, ReLU activations, and an L2 weight regularization parameter ‘alpha,’ trained with ADAM optimization with a learning rate of 10-3 for 200 epochs. D denotes the dimension of each feature set, with Gaussian random projections used to decrease the dimension of spectrograms.

Predicting finch social context (Figure 4a–c)
SpectrogramSAPLatent
D = 10D = 30D = 100D = 13D = 5
k-NN (k=3)92.5 (0.2)95.3 (0.1)97.3 (0.3)93.0 (0.3)96.9 (0.2)
k-NN (k=10)93.0 (0.2)95.3 (0.2)97.1 (0.3)93.2 (0.1)96.7 (0.2)
k-NN (k=30)92.7 (0.1)94.2 (0.2)96.0 (0.2)92.8 (0.1)96.3 (0.1)
RF (depth = 10)92.6 (0.1)92.7 (0.1)93.1 (0.1)92.8 (0.1)94.9 (0.2)
RF (depth = 15)92.7 (0.1)93.2 (0.2)93.6 (0.1)93.6 (0.2)96.1 (0.1)
RF (depth = 20)92.8 (0.1)93.4 (0.1)93.8 (0.1)93.8 (0.2)96.4 (0.2)
MLP (α = 0.1)92.8 (0.4)95.4 (0.4)97.6 (0.3)92.9 (0.1)95.7 (0.1)
MLP (α = 0.01)92.9 (0.3)95.4 (0.3)97.5 (0.2)93.1 (0.2)96.2 (0.1)
MLP (α = 0.001)92.7 (0.6)95.2 (0.5)97.5 (0.2)93.0 (0.2)96.3 (0.0)
Appendix 1—table 2
Comparison of feature sets on the downstream task of predicting mouse strain.

(C57 vs.DBA) given acoustic features of single syllables. Classification accuracy, in percent, averaged over five disjoint, class-balanced splits of the data is reported. Empirical standard deviation is shown in parentheses. Euclidean distance is used for nearest neighbor classifiers. Each MUPET and DeepSqueak acoustic feature is independently z-scored as a preprocessing step. Latent features dimension is truncated when >99% of the feature variance is explained. Random forest (RF) classifiers use 100 trees and the Gini impurity criterion. The multilayer perceptron (MLP) classifiers are two-layer networks with a hidden layer size of 100, ReLU activations, and an L2 weight regularization parameter ‘alpha,’ trained with ADAM optimization with a learning rate of 10-3 for 200 epochs. D denotes the dimension of each feature set, with Gaussian random projections used to decrease the dimension of spectrograms.

Predicting mouse strain (Figure 4d–e)
SpectrogramMUPETDeepSqueakLatent
D = 10D = 30D = 100D = 9D = 10D = 7
k-NN (k=3)68.1 (0.2)76.4 (0.3)82.3 (0.5)86.1 (0.2)79.0 (0.3)89.8 (0.2)
k-NN (k=10)71.0 (0.3)78.2 (0.1)82.7 (0.6)87.0 (0.1)80.7 (0.3)90.7 (0.4)
k-NN (k=30)72.8 (0.3)78.5 (0.2)81.3 (0.5)86.8 (0.2)81.0 (0.2)90.3 (0.4)
RF (depth = 10)72.8 (0.2)76.6 (0.2)79.1 (0.3)87.4 (0.5)81.2 (0.4)88.1 (0.5)
RF (depth = 15)73.1 (0.3)78.0 (0.3)80.5 (0.2)87.9 (0.4)82.1 (0.3)89.6 (0.4)
RF (depth = 20)73.2 (0.2)78.3 (0.2)80.7 (0.3)87.9 (0.4)81.9 (0.3)89.6 (0.4)
MLP (α = 0.1)72.4 (0.3)79.1 (0.4)84.5 (0.3)87.8 (0.2)82.1 (0.4)90.1 (0.3)
MLP (α = 0.01)72.3 (0.4)78.6 (0.3)82.9 (0.4)88.1 (0.3)82.4 (0.4)90.0 (0.4)
MLP (α = 0.001)72.4 (0.4)78.5 (0.8)82.8 (0.1)87.9 (0.2)82.4 (0.3)90.4 (0.3)
Appendix 1—table 3
Comparison of feature sets on the downstream task of predicting mouse identity given acoustic features of single syllables.

Classification accuracy, in percent, averaged over five disjoint, class-balanced splits of the data is reported. A class-weighted log-likelihood loss is targeted to help correct for class imbalance. Empirical standard deviation is shown in parentheses. Each MUPET acoustic feature is independently z-scored as a preprocessing step. Latent feature principal components are truncated when >99% of the feature variance is explained. The multilayer perceptron (MLP) classifiers are two-layer networks with a hidden layer size of 100, ReLU activations, and an L2 weight regularization parameter ‘alpha,’ trained with ADAM optimization with a learning rate of 10-3 for 200 epochs. Chance performance is 2.8% for top-1 accuracy and 13.9% for top-5 accuracy. D denotes the dimension of each feature set, with Gaussian random projections used to decrease the dimension of spectrograms.

Predicting mouse identity (Figure 4f)
SpectrogramMUPETLatent
D = 10D = 30D = 100D = 9D = 8
Top-1 accuracy
MLP (α = 0.01)9.9 (0.2)14.9 (0.2)20.4 (0.4)14.7 (0.2)17.0 (0.3)
MLP (α = 0.001)10.8 (0.1)17.3 (0.4)25.3 (0.3)19.0 (0.3)22.7 (0.5)
MLP (α = 0.0001)10.7 (0.2)17.3 (0.3)25.1 (0.3)20.6 (0.4)24.0 (0.2)
Top-5 accuracy
MLP (α = 0.01)36.6 (0.4)45.1 (0.5)55.0 (0.3)46.5 (0.3)49.9 (0.4)
MLP (α = 0.001)38.6 (0.2)50.7 (0.6)62.9 (0.4)54.0 (0.2)59.2 (0.6)
MLP (α = 0.0001)38.7 (0.5)50.8 (0.3)63.2 (0.4)57.3 (0.4)61.6 (0.4)

Data availability

Dataset 1: Online, publicly available MUPET dataset: ~5GB Available at: https://github.com/mvansegbroeck/mupet/wiki/MUPET-wiki Figs: 2, 3, 4d-e.

Dataset 2: Single zebra finch data: ~200-400 MB of audio generated as part of work in progress in Mooney Lab. Figs: 2e-f, 4a-c, 5a, 5d, 6b-e.

Dataset 3: Mouse USV dataset: ~30-40 GB of audio generated as part of work in progress in Mooney Lab. Figs: 4f.

Dataset 5: This is a subset of dataset 3, taken from a single mouse: ~1GB of audio. Figs: 5b-e, 6a.

Dataset 6: 10 zebra finch pupil/tutor pairs: ~60 GB of audio generated as part of work in progress in Mooney Lab. Figs: 7.

Datasets 2-6 are archived in the Duke Digital Repository (https://doi.org/10.7924/r4gq6zn8w).

The following previously published data sets were used
    1. Pearson J
    2. Mooney R
    3. Brudner S
    4. Goffinet J
    (2021) Duke Digital Repository
    Data from: Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires.
    https://doi.org/10.7924/r4gq6zn8w

References

    1. Dai B
    2. Wang Y
    3. Aston J
    4. Hua G
    5. Wipf D
    (2018)
    Connections with robust PCA and the role of emergent sparsity in variational autoencoder models
    The Journal of Machine Learning Research 19:1573–1614.
    1. Davies DL
    2. Bouldin DW
    (1979) A cluster separation measure
    IEEE Transactions on Pattern Analysis and Machine Intelligence 1:224–227.
    https://doi.org/10.1109/TPAMI.1979.4766909
    1. Gretton A
    2. Borgwardt KM
    3. Rasch MJ
    4. Schölkopf B
    5. Smola A
    (2012)
    A kernel two-sample test
    Journal of Machine Learning Research 13:723–773.
    1. Higgins I
    2. Matthey L
    3. Pal A
    4. Burgess C
    5. Glorot X
    6. Botvinick M
    7. Mohamed S
    8. beta-VAE LA
    (2017)
    Learning basic visual concepts with a constrained variational framework
    ICLR 2:6.
    1. Lvd M
    2. Hinton G
    (2008)
    Visualizing data using T-SNE
    Journal of Machine Learning Research 9:2579–2605.
  1. Software
    1. Nicholson D
    2. Cohen Y
    (2020)
    VAK
    0.3.
  2. Software
    1. Paszke A
    2. Gross S
    3. Chintala S
    4. Chanan G
    5. Yang E
    6. DeVito Z
    7. Lin Z
    8. Desmaison A
    9. Antiga L
    10. Lerer A
    (2017)
    Automatic differentiation in PyTorch
    NeurIPS 2017 Autodiff Workshop.
  3. Conference
    1. Sohn K
    2. Lee H
    3. Yan X
    (2015)
    Learning structured output representation using deep conditional generative models
    Advances in Neural Information Processing Systems. pp. 3483–3491.
  4. Book
    1. Tchernichovski O
    2. Mitra P
    (2004)
    Sound Analysis Pro User Manual
    New York: CCNY.
  5. Book
    1. Venter JC
    2. Adams MD
    3. Sutton GG
    4. Kerlavage AR
    5. Smith HO
    6. Hunkapiller M
    (1998)
    Shotgun Sequencing of the Human Genome
    Washington, D.C: American Association for the Advancement of Science.

Decision letter

  1. Jesse H Goldberg
    Reviewing Editor; Cornell University, United States
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom
  3. Jesse H Goldberg
    Reviewer; Cornell University, United States
  4. Ofer Tchernichovski
    Reviewer; Hunter College, City University of New York, United States
  5. Scott W Linderman
    Reviewer; Stanford University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Animal vocalizations are notoriously complex and difficult to categorize. Traditionally, sounds are transformed into spectrograms, which are then segmented into syllables and analyzed according to hand-selected features such as pitch, amplitude and frequency modulation. Here, the authors take a new approach: they use variational autoencoders to analyze vocalizations from songbirds and mice and find that they can quantify the similarity between distinct utterances. This approach will complement existing sound analysis methods to further our understanding of animal social behavior.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires" for consideration by eLife. Your article has been reviewed by 4 peer reviewers, including Jesse H Goldberg as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Ofer Tchernichovski (Reviewer #3); Scott W Linderman (Reviewer #4).

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife (but note caveat below).

The reviewers mostly agreed that the VAE is a potentially interesting approach to categorizing vocalization data, and there was enthusiasm about the codebase available in github. Some major problems that arose in review were (1) lack of strong behavioral insights; (2) lack of clarity about data pre-processing – and how this would affect results; and (3) concern about novelty given the widespread use of VAEs for similar problems.

We would, in principle, be open to considering a revised version of this manuscript if the relatively long list of concerns by reviewers 2 and 4 were adequately addressed and if the VAE approach could perform similarity-score metrics (as requested by reviewer 3).

Reviewer #1:

The authors proposed the use of variational autoencoder (VAE) to vocal recordings of model organisms (mouse and zebra finch) to better capture the acoustic features that are missed by conventional acoustic metrics. This manuscript explores the effectiveness of the VAE approach from two perspectives: (i) a motifs based clustering which seeks to match the acoustics data against several predetermined template and (ii) an unsupervised clustering based on randomly segmented vocal recordings. These approaches involve the generation of a collection of images from spectrograms that are then fed to variational encoders to estimate the number of latent variables. With these latent variables, the authors then employed UMAP to visualize the variation within the dataset. The analyses are well conducted and will be useful for broad range of scientists investigating animal vocalizations.

i. From the zebra finch's discussion, this approach performs well in clustering the song syllables based on the four motifs predefined and at the same time could delineate the differences between directed and undirected songs as compared to previous acoustic metrics. While the authors provided a comparison between the ability of SAP acoustics features and VAE latent features in differentiating the directed and undirected birdsong, a comparison between the clustering of song syllables with the use of SAP features and VAE latent feature was not offered (Figure 5a). It would be interesting to see a side-by-side comparison of the 'Goodness of Clustering' metric in this VAE approach vs SAP.

ii. As for the randomly segmented vocal recordings, this method could generate a manifold where different acoustic features were situated in different regions and offer a continuous variability measure for all syllables. It is worth noting that the vocal recordings were randomly segmented based on the length of one syllable. What happens when the length is extended beyond that? Will the manifold produced look like the one shown in Figure 6?

iii. As for the mouse vocal recordings, the VAE approach generates a continuum-like points cloud that performs reasonably well in differentiating different acoustic features of the mouse vocalizations albeit the lack of boundaries in separating them. Could the smooth variation of points be due to the sampling rate? The mouse vocalizations were sampled at a much higher rate (10x) as compared to bird vocalizations. I would expect a greater resolution for the mouse data and thus the VAE can capture more subtle differences between the vocalizations, yielding a continuum-like points cloud. Of course these sampling rate differences are justified because of the different spectral properties of birdsong and mouse USVs – but a simple sentence or two about how sampling rates may affect these analyses would be useful..

Reviewer #2:

Authors introduce a new method for behavioral analysis of vocalizations that supposedly improves on weaknesses of handpicked features, which is that they miss variability and introduce undesirable correlations. They apply VAEs to bird and mouse vocalizations and perform analysis of the latent space. They show that VAEs outperform handpicked features on some simple analysis tasks including time-resolved analysis of variability.

Overall, even if it is correct that their method outperforms traditional-features based analysis, I don't think this insight is in any way relevant. Essentially, it is like saying: here is a bad way of doing things, and we offer one that is slightly better but still much worse than the gold standard. The problem is that there are many methods out there for doing things right from the start, so I cannot see a need for VAEs for the problems addressed. The problem with VAEs is that they are compressive (not representing all aspects of the data), which is why they are less desirable than methods that operate on equivalent representations of the raw data. If you are given money to buy groceries, why would you throw away part of it before you enter the store?

Overall it seems authors have not much to argue for other than their opposition to handpicked features. They pretend not much is out there other than these features for behavioral analysis, which does not reflect the wealth of methods out there. They have not even given appropriate credit to the real usefulness of the handpicked-feature approach, which is interpretability (which their method lacks). For example, one main utility of handpicked features is that some are learned. Songbirds have been shown to learn pitch from a tutor, so pitch is relevant. The same cannot be said about their latent features: which one of these do birds learn from a tutor?

Also, correlations among features that authors criticize as a bug can be seen as a feature, it provides insights into the aspects of their vocalizations that animals cannot control independently.

I could also not detect any new behavioral insights. The only real strength in their manuscript I could detect is that their method allows them to visually segregate directed from undirected song (Figure 4b), vocalizations from different mice strains (Figure 4f), and songbird syllables (Figure 5a). Thus, their method could be a suitable pre-processing for clustering efforts. Also, on a positive note, their method also produces some interesting looking plots (Figure 6b) of to-be identified utility.

L15 Authors claim that finding concise metrics remains a challenge, despite the large number of concise metrics ranging from Euclidean, Hamming, cosine, Wasserstein, Jensen-Shannon, Bhattacharia, Levenstein, ROUGE, etc. Surprisingly, their paper deals with the least concise metric imaginable, a deep autoencoder with thousands of parameters!

L17 Also not clear what would substantiate the claim that vocal behavior remains poorly understood.

L19 They mention several powerful approaches to enable automatic analysis of vocalizations but cite none.

L20 Given the wished-for generality of the problem they would like to study, it sounds strange that key to a successful approach would be a software package. Software packages are the last stage of analysis tool development.

L26 [10] did not discover overnight consolidation of learned birdsong as claimed, but instead 'inappropriate consolidation', which is challenged by a more recent analysis in [26].

L29 Authors criticize 'correlations among features', claiming these could result in redundant characterizations of vocalizations. Implicitly they argue for features that do not correlate, e.g. independent component analysis. But again, no citations, no following up on the embarked ideas. Correlations could actually be a feature of the behavior itself (and quite interesting to study).

I do not believe that mouse USV syllables form a continuum of syllables. Because authors did not find clusters, this does not mean that they are not there. Rather than trying to find shortcomings (e.g. of their method or the number of samples analyzed), authors generalize from a negative finding to inexistence. By their rejection of vocal clustering, they also ignore previous results showing such clustering [18, 4, 48, 6, 16]. Quite audacious. Is the moon still there when I close my eyes?

In the caption of Figure 2d, authors state 'an ideal representation would exhibit minimal off-diagonal correlation', i.e., ICA is ideal. Why do we need VAEs then if ICA is ideal?

Caption Figure 2e, the representational capacity will depend on the number of features, which is not reported. Same for Figure 2f, the more features used, the more principal components will be needed, so this may be a trivial effect of unequal number of features.

With regards to Figure 2f, it is not even clear from their analysis whether for a given fixed dimensionality, VAEs encode more variance than simple PCA, and if so, at what cost on memory (principal components vs auto-encoder network). For example, in the original Hinton paper in Science, the outcome of this analysis was rather surprising (VAEs are not clearly outperforming PCA in terms of representational capacity).

Last paragraph of Page 4, other than some pretty pictures (Figure S4) there is no (numerical) evidence for their claims of superiority of their latent features.

L115-125 and Figure 4: This is an ill-advised analysis. Why would one choose SAP features to detect changes in song? It is obvious that change detection requires the most sensitive analysis possible, so why would one perform compression beforehand? Same goes for their latent features. Even if it is better than SAP, why would one choose it and not the raw data itself (and a suitable standard metric)?

Same for L 126-L135 on data in mice.

L 173-L187: Authors argue about clustering failures of MUPET using UMAP representations, ignoring the fact that UMAP provides a faulty account of true distance. Their analysis of latent features is a step in the right direction, but falls short of analysis of the raw data (or an equivalent representation, e.g. https://asa.scitation.org/doi/abs/10.1121/1.4731466 and others).

L214: Their method is not useful for analyzing moment-by-moment variability of song, because they need to pre-process songs by 'warping each in time to account for well-documented differences in rhythm and tempo', which is the only problem that would complicate a more rigorous analysis.Reviewer #3:

This manuscript presents a new implementation of variational autoencoder machine learning algorithm (VAE) for the analysis of animal vocalization. The new method is impressive and powerful compared to existing methods. I also like the AVA Python program package, which is well documented. Results demonstrate that AVA can capture important biological differences between vocalization e.g., between directed and undirected songs in birds, and identify similar syllables in mouse song. It can clusters syllables to types in adult zebra finches. The evidence for the lack of clusters in the mouse song are strong and convincing, and of important implications.

The principal weakness of the manuscript, in its present form, is that only insufficient evidence are provided to allow judging how AVA can perform in more difficult tasks, for which software like SAP is often used. For example, can AVA perform robust tutor-pupil similarity measurement in zebra finches? Can it identify clusters in young zebra finches? There is also no serious attempt to show replicability across different training sets. Once these concerns are address I feel that the manuscript should be appropriate for publication.

1. Detecting similarly as in Figure S5 across random repetition of the same zebra finch syllable spectrograms is not convincing enough. It is important to show how well can AVA performs when comparing tutor and pupil zebra finch songs. This testing should include examples of tutor pupil zebra finch songs (some with high similarity and others with lower similarity) should be plotted using UMAP projection as in Figure 6b.

2. It is relatively easy to detect clusters in adult zebra finch songs, but to study song learning it is often needed to cluster song syllables in young birds. SAP can often detect clusters in 50-day old birds. I wonder if AVA can detect clusters even earlier? This would be a very convincing demonstration to the power and usability of the new approach. Again, it is critical to show how AVA behaves with presented with more challenging tasks.

3. One issue that bothers me a lot is how the specific training of the algorithm might affect the outcomes. Say for example that lab 1 trained AVA with one dataset, and lab 2 trained AVA with a second dataset. But assume that both datasets were randomly sampled from the same population of birds. How comparable the results would be? For example will a similarity measurement of the same tutor and pupil would be comparable across the labs who trained AVA independently?

4. I like the "higher representational capacity" of the new method, but sometimes "with much wisdom comes much sorrow": higher representation capacity can potentially cause trouble if it makes the method too sensitive to things we do not care about. At this level, I would like to see some evidence for robustness to noise. For example, it should be easy to test how sensitive AVA is for small differences in recording conditions, say, to recording with in a sound attenuation chamber while door is open or closed?

Reviewer #4:

Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires.

The authors use variational autoencoders (VAEs) to learn a low-dimensional representations to spectrograms of bird song an mouse unltrasonic vocalizations (USVs). They find these representations of vocal behavior to be useful for studying social interactions and differences between strains of mice. Further investigations suggest that mouse USVs do not cluster as nicely as previously thought, and rather span a continuous manifold in feature space. Finally, VAEs trained on random snippets of the spectrogram highlight variability (and stereotypy) in zebra finch songs, in contrast to the unstructured (or highly variable) space of mouse USVs.

The proposed model operates in the frequency domain, consuming snippets of time-warped spectrograms. Given that one of the central claims of this paper is the superiority of unsupervised methods for feature extraction, I think these preprocessing steps warrant further consideration. For example, the 2D convolutions in the VAE must implicitly leverage the fact that neighboring frequencies are adjacent in the 128x128 ``images,' but does the choice of frequency spacing (mel-spaced for song birds vs linearly-spaced for mouse USVs) affect the learned representations? How important is the time-warping to downstream representations and analyses? The spectral preprocessing also complicates the ``shotgun' analysis in Figure 6. Each point in the VAE latent space corresponds to a window of time rather than a single frame. How do the projections change as you vary the window size?

Motivated in part by these concerns, some recent approaches like WaveNet (van den Oord et al., 2016) have directly modeled raw waveform data. The sampling rates used for human speech and music (16kHz) are lower than those necessary for USVs, but the same principles should apply. For example, a pre-trained WaveNet with minimal fine-tuning for song bird or mouse USV could yield a very competitive generative model of vocalizations and offer a different representation of this behavior. The comparison may be beyond the scope of this paper, but I think it is worthy of discussion.

Overall, this paper offers a nice application of nonlinear latent variable models to vocal behavior data. The techniques themselves are not particularly novel – variational autoencoders have been widely used in the machine learning community for over five years now – and the finding that learned features can outperform handcrafted ones has been shown across many domains. Given the wealth of works on sequential VAEs for time-series data, I think the novelty of the shotgun VAE is somewhat overstated. In my view, the main contribution lies in the codebase (I looked through the Github repo and was quite impressed!), the analysis pipeline, and the proof-of-concept. That is why I think it is especially important to assess the sensitivity of these results to various design choices that went into the pipeline, including the very first choices about how to preprocess the raw waveform data into time-warped and windowed spectrograms.

Other comments:

– Figure 1a: It's not clear why the length 32 vector is appearing as a square matrix here.

– Please label which dataset (song bird or mouse) the point clouds and spectrograms are coming from in all of the figures. A consistent color scheme could help.

– Figure 2f only has three solid lines. Where is the plot of explained variance in MUPET features by VAE features?

– The paragraph starting on line 97 paints an overly optimistic view of VAEs. Designing deep generative models that can reliably disentangle latent factors is still an active area of research, as is model selection.

– Figures 3, S4, and S5 suggest that nearest neighbor reconstruction with DeepSqueak (and other handcrafted features) is surprisingly bad. Are you using just the Euclidean distance in DeepSqueak feature space? Did you whiten the DeepSqueak features before computing distances? Can you explain why it DeepSqueak is failing so dramatically?

– Throughout, the spectrograms fail to indicate the time window or frequency bands.

– Figure 4a+b aim to show that SAP do not separate directed and undirected vocalizations as well as latent features do, but is this information simply not present in the first two PCs? A classification accuracy assessment would be more convincing.

– The 2D embedding in Figure 4f is confusing to me. Why not just show the full distance matrix from Figure S8, but with the lines to indicate which rows/columns belong to each mouse? That figure gives considerably more information than the tSNE embedding, in my opinion. In particular, it looks like there is a solid group of C57 mice that are very similar to DBA mice, as measured by MMD. The use of tSNE seems rather arbitrary and lossy. Moreover, the colors in Figure 4f convey little information beyond identity, when there seems to be lots of extra info about strain that could be conveyed.

– There are no axis labels or titles in Figure 5a-c, just random clouds of points.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Jesse H Goldberg as the Reviewing Editor and Reviewer #1. and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Ofer Tchernichovski (Reviewer #2); Scott W Linderman (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. This was deemed a strong resubmission and only one relatively minor issue related to Figure 7 is necessary for revision.

Essential revisions:

1. Figure 7 does not show clearly enough how the SG approach overcomes the problem of fused syllables (A and B). Figure 7c should somehow indicate the similarity in the fused AB vs A,B area. The issue is that it is not easy to see how the color code correspond to specific areas in the sonograms presented. An overlay panel could help here.

Reviewer #1:

In this paper, the authors addressed the reviewers' concerns and expanded extensively on the utility of variational autoencoder (VAE). The authors included an extra section discussing VAE 's capability in handling more complicated scenarios by studying the tutor and pupil song learning experiment. One can readily visualize the differences between tutor and pupil syllables via the latent embeddings. Although the latent features could be hard to interpret, one could view it as an initial exploratory analysis in identifying possible acoustic structure discrepancies. The authors also included additional data benchmarking latent features against conventional acoustics features for classification tasks and offered a more in-depth study comparing the clustering of song syllables using traditional acoustic features and VAE latent features. Moreover, they discussed the effect of time stretch and frequency spacing parameters on SAP features prediction and VAE's replicability issue for completeness.

The new Figure 7 showing tutor-pupil analyses is a welcome addition to the paper.

While it remains uncertain if this method will actually supersede others in quantifying finch and/or mouse datasets, this paper could, at minimum, provide a case study of advantages and disadvantages for using the VAE approach for vocalization datasets.

Reviewer #2:

This study applies an unsupervised learning approach for assessing acoustic similarity and for classifying animal vocalizations. Investigation focuses on mice vocalization and song learning in zebra finches. The method demonstrate an impressive capacity to map and compare vocal sounds in both species and to assess vocal learning. It has clear advantages upon existing methods. It is still an open question to what extent this approach can successfully capture vocal development during early stages of song learning. In particular, the learned latent features have no simple interpretation in production and perception of vocal sounds, which future studies will need to address.

Two remaining issues:

1. figure 7 does not show clearly enough how the SG approach overcomes the problem of fused syllables (A and B). Figure 7c should somehow indicate the similarity in the fused AB vs A,B area. The issue is that it is not easy to see how the color code correspond to specific areas in the sonograms presented. An overlay panel could help here.

2. The lack of song development analysis is still an issue.

Reviewer #3:

I thank the authors for their detailed responses. They have satisfactorily addressed all of my concerns.

https://doi.org/10.7554/eLife.67855.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

The reviewers mostly agreed that the VAE is a potentially interesting approach to categorizing vocalization data, and there was enthusiasm about the codebase available in github. Some major problems that arose in review were (1) lack of strong behavioral insights; (2) lack of clarity about data pre-processing – and how this would affect results; and (3) concern about novelty given the widespread use of VAEs for similar problems.

We thank all four reviewers for their detailed and thoughtful critiques. In our substantially revised manuscript, we have attempted to address each of the concerns above by adding a major new analysis and numerous supplementary analyses that we believe help to round out our discussion and demonstrate the utility of the VAE as a tool for vocal analysis. In particular:

In response to a suggestion by Reviewer 3, we have included a significant new analysis comparing the similarity of zebra finch tutor and pupil song, now Figure 7. We believe these new results both demonstrate an improvement on existing methods and clearly point the way toward addressing point (1) above, since assessing similarity between tutor and pupil song constitutes a key method of behavioral analysis in birdsong research.

1. In response to suggestions by Reviewers 1, 3, and 4, we have performed several new analyses (included as Supplementary Figures 4, 5, 8, 11, 13, 14, 15, 17, and 18 and Supplementary Tables 1-3) that examine the effects of preprocessing choices, the reliability of training, and the quantitative performance of VAE features for downstream tasks. We discuss each of these more fully below; collectively, these work to address the concerns raised under point (2) above.

2. In response to concerns raised by Reviewers 2 and 4, we have revised the text to better place our work in the context of both vocal analysis and machine learning more broadly (point (3)). We clarify that, while the VAE has been applied to many kinds of problems, establishing its usefulness for a new audience is nonetheless valuable (as the reviewers’ comments attest) and necessitates a careful comparison with existing approaches to build confidence in the method.

Taken together, we believe these revisions both address each of the key concerns listed above and greatly improve the rigor and reach of the work. Replies to individual reviewers follow. Original comments are displayed in italics, our response, as here, in bold. Text added or altered in the revised manuscript is in blue.

We would, in principle, be open to considering a revised version of this manuscript if the relatively long list of concerns by reviewers 2 and 4 were adequately addressed and if the VAE approach could perform similarity-score metrics (as requested by reviewer 3).

Reviewer #1:

The authors proposed the use of variational autoencoder (VAE) to vocal recordings of model organisms (mouse and zebra finch) to better capture the acoustic features that are missed by conventional acoustic metrics. This manuscript explores the effectiveness of the VAE approach from two perspectives: (i) a motifs based clustering which seeks to match the acoustics data against several predetermined template and (ii) an unsupervised clustering based on randomly segmented vocal recordings. These approaches involve the generation of a collection of images from spectrograms that are then fed to variational encoders to estimate the number of latent variables. With these latent variables, the authors then employed UMAP to visualize the variation within the dataset. The analyses are well conducted and will be useful for broad range of scientists investigating animal vocalizations.

i. From the zebra finch's discussion, this approach performs well in clustering the song syllables based on the four motifs predefined and at the same time could delineate the differences between directed and undirected songs as compared to previous acoustic metrics. While the authors provided a comparison between the ability of SAP acoustics features and VAE latent features in differentiating the directed and undirected birdsong, a comparison between the clustering of song syllables with the use of SAP features and VAE latent feature was not offered (Figure 5a). It would be interesting to see a side-by-side comparison of the 'Goodness of Clustering' metric in this VAE approach vs SAP.

We thank the reviewer for this suggestion. Results from this analysis are shown in Figure S13, which shows latent features produce better clusters than MUPET features, and Figure S14, which shows that SAP features also admit well-clustered clusters. In fact, on two of the three metrics we examine, SAP features produce better clusters than VAE features, while on the third, VAE features perform best. We believe these results illustrate two points:

1. When data cleanly separate into clusters — as they do in the bird data — multiple methods perform well, with the best-performing method depending on what measure one wants to optimize. That is, tight, well-isolated clusters satisfy multiple (semantically overlapping) notions of what clustering entails.

2. When data do not obviously cluster, the particular definition of clustering, as well as the assumptions of clustering methods, play a much larger role. As a result, MUPET features produce worse clusters than VAE features (by the metrics investigated). We unpack this more fully in our response to Reviewer 2 below.

ii. As for the randomly segmented vocal recordings, this method could generate a manifold where different acoustic features were situated in different regions and offer a continuous variability measure for all syllables. It is worth noting that the vocal recordings were randomly segmented based on the length of one syllable. What happens when the length is extended beyond that? Will the manifold produced look like the one shown in Figure 6?

This is an excellent question. Results are presented in Figure S17. Clearly, the manifolds are not identical, since (a) changing the size of the window introduces a change in the amount of data latent features must represent, and (b) the UMAP projection does not necessarily preserve distances or orientation, only local relationships. Nonetheless, features of the song like the linking note spiral and the song motif, including the correct ordering of syllables, are still present, consistent with the fact that the shotgun VAE does preserve local similarities in the data.

iii. As for the mouse vocal recordings, the VAE approach generates a continuum-like points cloud that performs reasonably well in differentiating different acoustic features of the mouse vocalizations albeit the lack of boundaries in separating them. Could the smooth variation of points be due to the sampling rate? The mouse vocalizations were sampled at a much higher rate (10x) as compared to bird vocalizations. I would expect a greater resolution for the mouse data and thus the VAE can capture more subtle differences between the vocalizations, yielding a continuum-like points cloud. Of course these sampling rate differences are justified because of the different spectral properties of birdsong and mouse USVs – but a simple sentence or two about how sampling rates may affect these analyses would be useful..

The answer to the specific question here is simple — while the raw data were sampled at different rates, the spectrogram data on which the VAE operates were of the same dimension (128 x 128 after preprocessing) in both bird and mouse — but raises an important issue regarding clustering: Is the quasi-continuum we see in Figure 5b,c the result of some kind of “fill in” behavior (as we ourselves raise in ll. 318-328)? Might there be some feature of the VAE that favors an unclustered latent representation?

This is an important reason for including both bird and mouse data, as we have done in the manuscript. Under the same set of analyses, bird data reliably and cleanly cluster, while the mouse data are, even under the most generous interpretation, ambiguous. Moreover, our data set sizes are not drastically larger than those used by other groups (again, we use MUPET’s own benchmark data), which argues against the idea that clustering is stymied by the size of the data set. But of course, as Reviewers 1and 2 both point out, UMAP is a visualization strategy, one not guaranteed to preserve metric information, which may lead to misleading displays. This is why we have performed all analysis on latent features (not UMAP representations) and attempted to tackle the problem from multiple directions, as we further explain in response to Reviewer 2 below.

Reviewer #2:

Authors introduce a new method for behavioral analysis of vocalizations that supposedly improves on weaknesses of handpicked features, which is that they miss variability and introduce undesirable correlations. They apply VAEs to bird and mouse vocalizations and perform analysis of the latent space. They show that VAEs outperform handpicked features on some simple analysis tasks including time-resolved analysis of variability.

Overall, even if it is correct that their method outperforms traditional-features based analysis, I don't think this insight is in any way relevant. Essentially, it is like saying: here is a bad way of doing things, and we offer one that is slightly better but still much worse than the gold standard. The problem is that there are many methods out there for doing things right from the start, so I cannot see a need for VAEs for the problems addressed. The problem with VAEs is that they are compressive (not representing all aspects of the data), which is why they are less desirable than methods that operate on equivalent representations of the raw data. If you are given money to buy groceries, why would you throw away part of it before you enter the store?

We believe it would be helpful to clarify here: our claim is not that traditional feature based analysis is a “bad” way of doing things. Clearly, this approach, whatever its limitations, has proven tremendously fruitful and continues to be the standard of practice in much of the vocalization community. Our goal is to suggest the VAE as an additional tool that offers improvements in several cases where feature-based methods have encountered difficulty, as our analyses show.

We must also confess some confusion as to which particular gold standard the reviewer is referencing, and this makes it difficult for us to reply on this point. Moreover, we are unsure how some degree of data compression is to be avoided. Feature-based methods are all a form of compression — they select some aggregate measures of the data while eschewing others — as are spectrograms, which throw away phase data. If nothing else, standard statistical tests each reduce the entire data to a single summary statistic for hypothesis testing.

To be clear: when a given analysis question can be answered by a method operating on raw data (with no intermediate compressed representation), we are not, a priori, advocating against this. But in the literature, there are many studies in which feature based methods remain the approach of choice (Derégnaucourt et al. (2005), Gaub et al. (2010), Hammerschmidt et al. (2012), Holy and Guo (2005), Chabout et al. (2015), Woehr (2014)), and we would argue that the present manuscript shows that VAE methods may offer advantages in these cases. It is difficult to say much more than this without a particular alternative analysis in view.

Overall it seems authors have not much to argue for other than their opposition to handpicked features. They pretend not much is out there other than these features for behavioral analysis, which does not reflect the wealth of methods out there. They have not even given appropriate credit to the real usefulness of the handpicked-feature approach, which is interpretability (which their method lacks). For example, one main utility of handpicked features is that some are learned. Songbirds have been shown to learn pitch from a tutor, so pitch is relevant. The same cannot be said about their latent features: which one of these do birds learn from a tutor?

Also, correlations among features that authors criticize as a bug can be seen as a feature, it provides insights into the aspects of their vocalizations that animals cannot control independently.

We apologize for some lack of clarity on several points relevant to the reviewer’s concerns. In the interest of brevity and clarity, we have not attempted a comprehensive review of methods for the analysis of animal vocalization. Nor do we claim that the VAE could or should supersede all of these. Instead, we have focused on the feature-based approach, which overwhelmingly remains the standard of practice in the field and forms the basis of the analysis packages (SAP, MUPET, DeepSqueak) against which we provide extensive comparison.

We can also offer more clarity on two specific points the reviewer raises here:

1. While the learned latent features are not interpretable in the sense that they precisely correspond to constructs like pitch and maximum frequency well-known in signal processing, the point of panels 1a-c and Figure S1 is that this information is nonetheless present in the latent features in a straightforward way. That is, simple linear combinations of the latent features often correlate well (if imperfectly) with the traditional metrics of interest (Figure S1), in addition to capturing some information missed by these features.

The reviewer’s point about pitch is an important one, which we might rephrase as follows: it is not so much that juveniles learn pitch per se from a tutor as that juveniles are engaged in a complex, multi-feature problem of song learning, successful completion of which involves changes in pitch. That is, pitch forms only one axis along which behavior is changing, and our preference for describing the more complex learning problem in these terms is our own familiarity with this construct, both quantitatively and experientially. And while we do not wish to dispute the reasonability of this, as our new tutor/pupil song comparison shows (Figure 7), latent features precisely capture the similarity of the resulting songs, which has proven difficult to do with existing feature-based methods.

On decorrelation of features: As the reviewer notes, decorrelation of learned features is not, a priori, a bug, and the correlations among traditional features could possibly be put to interesting uses. In our analysis of feature decorrelation, our primary concern is for subsequent statistical testing, where nonlinear relationships among features violate the assumptions of many statistical tests and may lead to erroneous results. If traditional features were linearly related to one another, then a simple PCA or whitening operation, as is often done, would be sufficient to remedy this, but this does not appear to hold in real data (Figure S8). VAE features, which allow for nonlinear warping, avoid this. We have attempted to clarify this in the text (ll. 115-117):

“While correlations among features are not necessarily undesirable, they can complicate subsequent statistical testing because nonlinear relationships among features violate the assumptions of many statistical tests. VAE features, which allow for nonlinear warping, avoid this potential difficulty.”

I could also not detect any new behavioral insights. The only real strength in their manuscript I could detect is that their method allows them to visually segregate directed from undirected song (Figure 4b), vocalizations from different mice strains (Figure 4f), and songbird syllables (Figure 5a). Thus, their method could be a suitable pre-processing for clustering efforts. Also, on a positive note, their method also produces some interesting looking plots (Figure 6b) of to-be identified utility.

We apologize that what we believe to be novel behavioral insights were not better delineated. Briefly:

1. Mice produce stereotyped syllable repertoires that are stable across days (Figures 4f and S11, Table S3), but there is no evidence that the sequences in which these syllables are produced are stereotyped (Figure 6a).

2. We argue that the best account of USV distributions is an unclustered, quasicontinuum account (Figure 5; more on this below).

3. We show that zebra finch directed song is less variable in all syllable types, not just the tonal or harmonic syllables for which it has been established. That is, the VAE allows this to be observed and quantified for all syllables (Figures 4a-c), and this on sub-syllable time scales (6c-e).

4. We demonstrate a novel VAE-based method of computing tutor-pupil similarity in birdsong and quantify it in Figure 7.

L15 Authors claim that finding concise metrics remains a challenge, despite the large number of concise metrics ranging from Euclidean, Hamming, cosine, Wasserstein, Jensen-Shannon, Bhattacharia, Levenstein, ROUGE, etc. Surprisingly, their paper deals with the least concise metric imaginable, a deep autoencoder with thousands of parameters!

We apologize for the confusion. We are here using “metrics” in the more colloquial sense of “useful measurements,” not the mathematical construct the reviewer seems to be referencing. This has been fixed in the text, which now reads (ll. 16-17):

“[G]iven the variety and complex temporal structure of many behaviors, finding concise yet informative descriptions has remained a challenge.”

L19 They mention several powerful approaches to enable automatic analysis of vocalizations but cite none.

Thank you. This has been fixed (ll. 19-21).

L20 Given the wished-for generality of the problem they would like to study, it sounds strange that key to a successful approach would be a software package. Software packages are the last stage of analysis tool development.

The primary purpose of this work is to introduce and establish the VAE as a useful tool for vocal analysis. While we would contend that this is conceptually independent from any particular software implementation, we also note that, practically, code helps. We provide code as a means of disseminating these techniques, which Reviewers 3 and 4 applauded as providing additional value.

L26 [10] did not discover overnight consolidation of learned birdsong as claimed, but instead 'inappropriate consolidation', which is challenged by a more recent analysis in [26].

Thank you for this correction. The text has been edited to read (ll. 26-29):

“Collectively, these and similar software packages have helped facilitate numerous discoveries, including circadian patterns of song development in juvenile birds [10], …”

L29 Authors criticize 'correlations among features', claiming these could result in redundant characterizations of vocalizations. Implicitly they argue for features that do not correlate, e.g. independent component analysis. But again, no citations, no following up on the embarked ideas. Correlations could actually be a feature of the behavior itself (and quite interesting to study).

While we agree that correlations among features might provide interesting information, they can also reflect simple mathematical relationships among the feature sets chosen (x and x3 are strongly correlated by construction). However, as we have attempted to clarify under point 2 in our response above, such correlations can prove highly statistically problematic when attempting to test hypotheses about, e.g., changes in sets of vocal features across days, animals, or experimental conditions.

I do not believe that mouse USV syllables form a continuum of syllables. Because authors did not find clusters, this does not mean that they are not there. Rather than trying to find shortcomings (e.g. of their method or the number of samples analyzed), authors generalize from a negative finding to inexistence. By their rejection of vocal clustering, they also ignore previous results showing such clustering [18, 4, 48, 6, 16]. Quite audacious. Is the moon still there when I close my eyes?

We believe the reviewer is voicing here a warranted skepticism of our most controversial claim. And of course the absence of evidence is not evidence of absence. Yet, as we lay out in the manuscript (ll. 187-193), there is no absolute correct answer to the question of whether some particular data cluster, only more or less satisfying accounts. Reviewer 3 was convinced by our presentation, while others may remain skeptical. In favor of our conclusion, we note:

1. We argue from three directions: (a) the lack of apparent clustering for USVs as compared to bird song syllables (Figures 5a-b, S12); (b) the markedly worse performance of clustering algorithms (as quantified by several methods) in the mouse as compared to bird case (Figures 5c, S12-S15); (c) the existence of interpolating sequences within the data between qualitatively different USV syllables (Figures 5d, S16). Of these three, the last does not depend at all on VAE features and clearly fails for the bird song example.

2. Several of the works cited by ourselves and the reviewer ([4,6,16]) simply assume that mouse USVs cluster or that clustering offers a useful quantitative account of the behavior. One work [18] provides evidence of syllable clustering on the basis of a single-dimensional time/frequency curve description of USVs using a relatively small dataset of 750 syllables. We believe our analysis complements this approach by considering a more complete syllable description applied to much larger collections of syllables (17,400 syllables in Figures 5b-c). One of these works [48] reports a single bimodally-distributed acoustic feature, peak frequency, and refers to corresponding modes as clusters, but does not explicitly argue for clustering. To the best of our knowledge, our work provides the most comprehensive investigation of the USV syllable clustering hypothesis.

3. Many of the same works that lay claim to clustering use traditional features similar to MUPET features to cluster. To explicitly test whether MUPET features produce better behaved clusters than learned latent features, we added the analysis depicted in Figure S13. We found that MUPET features actually produce much worse clusters than latent features. In fact, when quantified by three unsupervised clustering metrics, MUPET feature-derived clusters are judged to be less clustered than moment-matched Gaussian noise.

4. On a more practical note, we have added Figure S15, which explores the consistency of syllable clusters under independent clusterings of different data splits. If cluster structure is well-determined by the data, we would expect syllables to be reliably segregated into the same clusters. This is indeed what we find for six clusters of zebra finch syllables, the number of clusters found by hand-labeling. However, for mouse syllables, we find poor consistency for more than two clusters.

5. Thus, while we cannot definitively prove the absence of clusters (nor do we know how to make such a claim rigorous), we have shown by a series of analyses that the USV data we analyze do not readily conform to a clustering account in the way bird song syllables do. Not only are the results markedly worse quantitatively, we show that there are sequences of USVs in the data that connect even the most disparate syllable types. We believe this calls into question the utility of a clustering account of USVs.

6. Finally, we have attempted in our original manuscript to nuance our conclusion (ll. 318-328), discussing multiple reasons we might have failed to find clusters. These include being open to the possibility of different results in other species/strains,

data set size, and a distinction between mathematical clustering and categorical perception.

In the caption of Figure 2d, authors state 'an ideal representation would exhibit minimal off-diagonal correlation', i.e., ICA is ideal. Why do we need VAEs then if ICA is ideal?

While ICA results in independence, it does so by choosing linear combinations of the original input vectors. The VAE can be viewed in one way as a nonlinear generalization of this demixing operation. So while ICA might be ideal if linear independence were the only desideratum, it must usually be paired with some form of dimension reduction, which the VAE will perform in a way that preserves more information.

Nonetheless, we agree that the point is confusing in the way it was stated. In the revised text, we have clarified (Figure 2 caption):

“When applied to the mouse USVs from a-c, the acoustic features compiled by the analysis program MUPET have high correlations, effectively reducing the number of independent measurements made.”

Caption Figure 2e, the representational capacity will depend on the number of features, which is not reported. Same for Figure 2f, the more features used, the more principal components will be needed, so this may be a trivial effect of unequal number of features.

With regards to Figure 2f, it is not even clear from their analysis whether for a given fixed dimensionality, VAEs encode more variance than simple PCA, and if so, at what cost on memory (principal components vs auto-encoder network). For example, in the original Hinton paper in Science, the outcome of this analysis was rather surprising (VAEs are not clearly outperforming PCA in terms of representational capacity).

Thank you. We have added the number of features to the caption of Figure 2e. In Figure 2f, we agree, and have truncated the latent feature dimension to match the number of traditional features for a fairer comparison.

“Reducing the dimensionality of data with neural networks” (Hinton and Salakhutdinov, 2006) concerns deterministic autoencoders, which we do not believe are directly relevant to this work.

Last paragraph of Page 4, other than some pretty pictures (Figure S4) there is no (numerical) evidence for their claims of superiority of their latent features.

While we would argue that Figures 3 and S6 (previously S4) demonstrate clear qualitative improvements in nearest neighbor recovery, we have also included in our revised submission additional details regarding the failure of DeepSqueak features in identifying nearest neighbors (Figure S8) and several comparisons between the utility of VAE features and other feature sets for downstream classification tasks (Tables S1-S3). As those results demonstrate, VAE features clearly outperform handpicked feature sets.

L115-125 and Figure 4: This is an ill-advised analysis. Why would one choose SAP features to detect changes in song? It is obvious that change detection requires the most sensitive analysis possible, so why would one perform compression beforehand? Same goes for their latent features. Even if it is better than SAP, why would one choose it and not the raw data itself (and a suitable standard metric)? Same for L 126-L135 on data in mice.

We appreciate the reviewer raising this point. In situations with many more features than data points, learning in a generalizable way that does not severely overfit the data requires extra assumptions, for example sparsity or other informative prior beliefs. Dimension reduction can be seen as an attempt to cope with this fundamental limitation: if we can distill the relevant variation in our data into a much lower dimensional space, then we can re-enter the classical regime with more data points than features where off-the-shelf regression and classification are known to work well.

We also note that many highly influential analyses of song learning have relied on SAP features to detect changes in song (Tchernichovski et al. “Dynamics of the vocal imitation process: how a zebra finch learns its song” (2001); Ravbar et al. “Vocal exploration is locally regulated during song learning” (2012); Vallentin et al. "Inhibition protects acquired song segments during vocal learning in zebra finches." (2016); and many others).

L 173-L187: Authors argue about clustering failures of MUPET using UMAP representations, ignoring the fact that UMAP provides a faulty account of true distance. Their analysis of latent features is a step in the right direction, but falls short of analysis of the raw data (or an equivalent representation, e.g. https://asa.scitation.org/doi/abs/10.1121/1.4731466 and others).

First, all of our clustering comparisons are performed on features (MUPET, SAP, VAE, etc.), with UMAP only used for visualization, as stated in the manuscript. And we agree with the reviewer that the UMAP representation cannot tell the whole story. Our UMAP plot simply serves the purpose of providing a contrast between mouse and bird syllables and motivating further analyses.

As for the reviewer’s other concern, we think the reviewer is suggesting that clustering might better be performed on the raw data, and that failure to find clusters in the USV case might be attributable to dimension reduction. To this objection, we first note that the studies we cite that report clustering of USVs (with which the reviewer appears to agree) all use a feature-based clustering approach. It is hard to see how both (a) featurebased clustering methods are illegitimate for USVs and (b) reports of USV clustering based on features can be trusted.

Moreover, we are unsure what clustering based on raw data in this case might mean. Even our spectrograms, which ignore phase data, comprise tens of thousands of data points per observation, and clustering data in dimensions this high is known to be borderline meaningless without strong priors or dimension reduction (cf. Aggarwal, Hinneberg, Keim (2001); Bouveyron and Brunet-Saumard (2014)). The reviewer may have one of these specific techniques in mind, but we also note that the highly structured nature of spectrograms implies that invariances in the interpretation of the data (like small pixel shifts in time) may not be respected by a naive clustering of raw data.

L214: Their method is not useful for analyzing moment-by-moment variability of song, because they need to pre-process songs by 'warping each in time to account for well-documented differences in rhythm and tempo', which is the only problem that would complicate a more rigorous analysis.

We apologize for some lack of clarity in our description. While time warping does improve our ability to align and thus compare syllables, the difference is not nearly so large as might be imagined. Figure S18 reproduces Figures 6d-e without time warping applied. More to the point, we are a bit confused by the contention that, apart from time warping, the moment-by-moment analysis of variability in song is trivial. We were unaware of any published results on this problem. Indeed several other reviewers noted this new analysis as a real advance beyond the current state of the art.

Reviewer #3:

This manuscript presents a new implementation of variational autoencoder machine learning algorithm (VAE) for the analysis of animal vocalization. The new method is impressive and powerful compared to existing methods. I also like the AVA Python program package, which is well documented. Results demonstrate that AVA can capture important biological differences between vocalization e.g., between directed and undirected songs in birds, and identify similar syllables in mouse song. It can clusters syllables to types in adult zebra finches. The evidence for the lack of clusters in the mouse song are strong and convincing, and of important implications.

The principal weakness of the manuscript, in its present form, is that only insufficient evidence are provided to allow judging how AVA can perform in more difficult tasks, for which software like SAP is often used. For example, can AVA perform robust tutor-pupil similarity measurement in zebra finches? Can it identify clusters in young zebra finches? There is also no serious attempt to show replicability across different training sets. Once these concerns are address I feel that the manuscript should be appropriate for publication.

We appreciate the reviewer’s positive assessment of the work. Based on the reviewer’s suggestions, we have implemented several new analyses detailed below. We hope the reviewer agrees that these significantly enhance the work by demonstrating the robustness and applicability of the VAE.

1. Detecting similarly as in Figure S5 across random repetition of the same zebra finch syllable spectrograms is not convincing enough. It is important to show how well can AVA performs when comparing tutor and pupil zebra finch songs. This testing should include examples of tutor pupil zebra finch songs (some with high similarity and others with lower similarity) should be plotted using UMAP projection as in Figure 6b.

This is an excellent suggestion. To address this, we have added an entirely new analysis (Figure 7) to test how well the VAE-based approach performs detecting tutor and pupil song similarity. In this analysis we take 10 pairs of zebra finch tutors and pupils and inspect the VAE’s learned representations using both syllable-based and shotgun VAE approaches. For both approaches, we find highly similar learned representations for the tutor and pupil pairs (Figure 7a,d). Additionally, we find that the VAE represents finer scale variations in the quality of song copying (Figure 7b,e) and demonstrate that the maximum mean discrepancy (MMD) metric proposed earlier in the manuscript is effective for quantifying the quality of song copying. We believe this constitutes a novel advance on current practice for quantifying tutor/pupil similarity in zebra finch song.

2. It is relatively easy to detect clusters in adult zebra finch songs, but to study song learning it is often needed to cluster song syllables in young birds. SAP can often detect clusters in 50-day old birds. I wonder if AVA can detect clusters even earlier? This would be a very convincing demonstration to the power and usability of the new approach. Again, it is critical to show how AVA behaves with presented with more challenging tasks.

We agree with the reviewer that the ability to cluster adult songs is no surprise. We also agree that identifying syllables in young birds is a much more interesting test. Indeed, as part of a separate collaboration between our labs, we have been pursuing these questions. In practice, we find the major obstacle to detecting syllable clusters in very young birds is our ability to reliably segment syllables. In order to bypass this difficulty, we are currently using the shotgun VAE approach for analyzing the song of very young birds. Encouragingly, despite the segmenting difficulties, we find that the syllable-based VAE analysis identifies some structure in the song of a 50-day-old juvenile, including a clear separation between calls and subsong.

Largely the same structure is evident five days earlier, when the bird is 45 days old.

3. One issue that bothers me a lot is how the specific training of the algorithm might affect the outcomes. Say for example that lab 1 trained AVA with one dataset, and lab 2 trained AVA with a second dataset. But assume that both datasets were randomly sampled from the same population of birds. How comparable the results would be? For example will a similarity measurement of the same tutor and pupil would be comparable across the labs who trained AVA independently?

We share this concern and note two enframing points: First, replicability will be a function of what question is subsequently asked of the VAE. For instance, any rotation of features in latent space is equally valid, so there is no guarantee – in fact, it is unlikely – that “latent feature 1” will be the same across training runs. However, other analyses that are invariant to this degree of freedom may replicate across retraining. Second, producing replicable results across training runs of neural network models constitutes an interesting and important research question in machine learning, one on which we have a separate manuscript in preparation. That is, while we believe this is an important and interesting point, a proper treatment of it is beyond the scope of this work.

To address the reviewer’s suggestion, we added Figure S5 to investigate the similarity of latent features across training runs, both using identical training data and disjoint subsets of data. For mouse USV syllables, we find very good correspondence in both cases in terms of pairwise distances and linear predictions of one set of latent features from the other. For zebra finch syllables, we find poorer pairwise distances but fairly accurate linear predictions in both cases. Taken together with Figure S15, which shows near-perfect identification of zebra finch syllable clusters, we believe the poorer performance of zebra finch syllable alignment compared to mouse syllable alignment is driven by the inconsistency of the between-cluster structure, the relative positions and orientations of syllable clusters, which is underdetermined by the data.

4. I like the "higher representational capacity" of the new method, but sometimes "with much wisdom comes much sorrow": higher representation capacity can potentially cause trouble if it makes the method too sensitive to things we do not care about. At this level, I would like to see some evidence for robustness to noise. For example, it should be easy to test how sensitive AVA is for small differences in recording conditions, say, to recording with in a sound attenuation chamber while door is open or closed?

This is well put and an important point to address. Because the VAE is trained to capture the distribution in the training data, it will represent the largest sources of variation in the features it learns. Unfortunately, this method has no built-in notion of “relevant” versus “irrelevant” features. However, in the experiments reported here, systematically varying each of the latent features does produce changes in syllable shapes that appear meaningful. That is, they do not appear to capture artifacts. However, if artifacts were present non-uniformly in the data, it is very likely that the VAE would encode this. This could be addressed in two ways: (a) better data preprocessing, which would remove the artifact; (b) approaches like domain adaptation and other methods of de-biasing machine learning that work by identifying sources of variance (like race or gender in lending decisions) that systems should ignore (see, for example, Louizos et al. “The variational fair autoencoder” (2015)). We believe this will be an interesting avenue to pursue in future work.

Reviewer #4:

Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires.

The authors use variational autoencoders (VAEs) to learn a low-dimensional representations to spectrograms of bird song an mouse unltrasonic vocalizations (USVs). They find these representations of vocal behavior to be useful for studying social interactions and differences between strains of mice. Further investigations suggest that mouse USVs do not cluster as nicely as previously thought, and rather span a continuous manifold in feature space. Finally, VAEs trained on random snippets of the spectrogram highlight variability (and stereotypy) in zebra finch songs, in contrast to the unstructured (or highly variable) space of mouse USVs.

The proposed model operates in the frequency domain, consuming snippets of time-warped spectrograms. Given that one of the central claims of this paper is the superiority of unsupervised methods for feature extraction, I think these preprocessing steps warrant further consideration. For example, the 2D convolutions in the VAE must implicitly leverage the fact that neighboring frequencies are adjacent in the 128x128 ``images,' but does the choice of frequency spacing (mel-spaced for song birds vs linearly-spaced for mouse USVs) affect the learned representations? How important is the time-warping to downstream representations and analyses? The spectral preprocessing also complicates the ``shotgun' analysis in Figure 6. Each point in the VAE latent space corresponds to a window of time rather than a single frame. How do the projections change as you vary the window size?

This is an excellent point. In practice, preprocessing can be as important as the model itself in deriving usable results. In response to the reviewer’s questions, we have performed several new analyses, including:

1. In Figure S4, we compare latent representations of mel vs. linear frequency spacing for zebra finch syllables and stretched vs. non-stretched syllables for both mouse and zebra finch. For zebra finch syllables, we find both comparisons produce latent features that encode hand-picked features to a similar extent and also display reasonable consistent pairwise distances (R2≈0.4). For mouse syllables, we find that the time stretch does provide a small but consistent benefit in terms of latent features representing hand-picked features and also fairly consistent pairwise distances with or without the time stetch (R2=0.69).

2. We have repeated the shotgun VAE analysis for temporal windows of varying size. The results are in S17. While there are obvious but trivial differences (overall rotations), major features like the linking note and song motif of bird song maps are reproduced, as well as the coarse structure of mouse USVs. However, fine details, particularly in the non-metric UMAP projection, do change, and these highlight the need, as mentioned in our response to Reviewer 3 above, for identifying downstream analyses that do not depend on small details of these maps.

Motivated in part by these concerns, some recent approaches like WaveNet (van den Oord et al., 2016) have directly modeled raw waveform data. The sampling rates used for human speech and music (16kHz) are lower than those necessary for USVs, but the same principles should apply. For example, a pre-trained WaveNet with minimal fine-tuning for song bird or mouse USV could yield a very competitive generative model of vocalizations and offer a different representation of this behavior. The comparison may be beyond the scope of this paper, but I think it is worthy of discussion.

We agree that this is both a really interesting direction and worth discussing. We have added the following text to the manuscript (ll. 341-342):

“We also note that great progress in generating raw audio has been made in the past few years, potentially enabling similar approaches that bypass an intermediate spectrogram representation (va den Oord et al., 2016, Kong et al., 2020).”

Overall, this paper offers a nice application of nonlinear latent variable models to vocal behavior data. The techniques themselves are not particularly novel – variational autoencoders have been widely used in the machine learning community for over five years now – and the finding that learned features can outperform handcrafted ones has been shown across many domains.

We agree. Our purpose here is to introduce these methods to many in the neuroscience community who’ve not been aware of their utility and to demonstrate that they produce useful and novel results in this particular domain.

Given the wealth of works on sequential VAEs for time-series data, I think the novelty of the shotgun VAE is somewhat overstated.

This is fair. Incorporating some of the sequential VAE methods is an interesting direction for future work. We have modified the text to temper the claim of novelty for this approach. Nevertheless, we feel the subsampling-based approach offers an intuitive and accessible single-parameter stand-in for what are often substantially more powerful (but complicated) time series models.

In my view, the main contribution lies in the codebase (I looked through the Github repo and was quite impressed!), the analysis pipeline, and the proof-of-concept. That is why I think it is especially important to assess the sensitivity of these results to various design choices that went into the pipeline, including the very first choices about how to preprocess the raw waveform data into time-warped and windowed spectrograms.

We concur. To recap our efforts along these lines, we have included S4, which investigates the effect of spectrogram frequency spacing and time stretching on the resulting latent features.

Other comments:

– Figure 1a: It's not clear why the length 32 vector is appearing as a square matrix here.

Thank you. This has now been fixed.

– Please label which dataset (song bird or mouse) the point clouds and spectrograms are coming from in all of the figures. A consistent color scheme could help.

Thank you for this suggestion. Green and brown/orange are used to indicate mouse vocalization (as in Figure 4d) and purple and cyan are used to indicate zebra finch vocalization (as in Figure 4a).

– Figure 2f only has three solid lines. Where is the plot of explained variance in MUPET features by VAE features?

We regret the confusion. In the original submission, only two lines were shown because there was a single set of latent features corresponding to the same syllables as both the MUPET and DeepSqueak features. In response to comments from reviewer 2, however, we have now truncated the single set of latent features to match the dimensions of the MUPET and DeepSqueak feature sets, resulting in separate lines. The caption for the figure now contains additional text:

“Latent features with colors labeled “MUPET” and “DeepSqueak” refer to the same set of latent features, truncated at different dimensions corresponding to the number of acoustic features measured by MUPET and DeepSqueak, respectively.”

– The paragraph starting on line 97 paints an overly optimistic view of VAEs. Designing deep generative models that can reliably disentangle latent factors is still an active area of research, as is model selection.

The reviewer is correct that many of these problems still constitute pressing research questions. We have omitted this mention of disentangling, instead adding a reference to (Dai et al., 2018), which specifically analyzes sparse latent representations in VAEs.

- Figures 3, S4, and S5 suggest that nearest neighbor reconstruction with DeepSqueak (and other handcrafted features) is surprisingly bad. Are you using just the Euclidean distance in DeepSqueak feature space? Did you whiten the DeepSqueak features before computing distances? Can you explain why it DeepSqueak is failing so dramatically?

This is an excellent question. For Figures 3 and S7 (previously S5), we used Euclidean distance with standardized but not whitened DeepSqueak features (ll. 423-430). We have added Figure S8 to investigate the failure of the DeepSqueak feature set, in particular, in returning reasonable nearest neighbors by constructing alternative versions of the feature set. In one version we whitened DeepSqueak features, in the next we considered the feature set without the three features most poorly predicted by latent features (from Figure S1), and lastly, we considered the linear projection of DeepSqueak feature space most predictive of the VAE’s features. All three feature sets return more suitable nearest neighbors for some of the query spectrograms shown from Figure 3, but the improvements are slight and none are as close as the latent feature nearest neighbors. In particular, the poor performance of the last feature set, an attempt to bring DeepSqueak features in line with latent features, suggests that DeepSqueak features are not sufficient to capture the full range of mouse USV syllables.

– Throughout, the spectrograms fail to indicate the time window or frequency bands.

While we agree that this information is useful, we are somewhat concerned about the additional visual clutter to figures that need to display many spectrograms. These numbers are clearly stated in the methods.

– Figure 4a+b aim to show that SAP do not separate directed and undirected vocalizations as well as latent features do, but is this information simply not present in the first two PCs? A classification accuracy assessment would be more convincing.

We thank the reviewer for this suggestion. To address this, we performed additional analyses using different feature sets for downstream classification tasks (Tables S1-3). In particular, classifiers trained to predict social context performed better when using latent features than when using SAP features (Tables S1).

– The 2D embedding in Figure 4f is confusing to me. Why not just show the full distance matrix from Figure S8, but with the lines to indicate which rows/columns belong to each mouse? That figure gives considerably more information than the tSNE embedding, in my opinion. In particular, it looks like there is a solid group of C57 mice that are very similar to DBA mice, as measured by MMD. The use of tSNE seems rather arbitrary and lossy. Moreover, the colors in Figure 4f convey little information beyond identity, when there seems to be lots of extra info about strain that could be conveyed.

We appreciate the suggestion. We have elected to retain the t-SNE embedding in the main figure, since it makes certain features more available at a glance, and provides an analogous presentation to our visualizations of syllables, but we have also added Figure S11 to include the full MMD matrix with easily visualized individual and strain information.

– There are no axis labels or titles in Figure 5a-c, just random clouds of points.

Thank you. We have fixed this.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Essential revisions:

1. Figure 7 does not show clearly enough how the SG approach overcomes the problem of fused syllables (A and B). Figure 7c should somehow indicate the similarity in the fused AB vs A,B area. The issue is that it is not easy to see how the color code correspond to specific areas in the sonograms presented. An overlay panel could help here.

We appreciate the suggestion. Figure 7e now contains an inset illustrating similarity between the fused AB syllable from the pupil and the individual A and B syllables from the tutor. When assessed by MMD, the VAE trained on syllables judges the fused syllable to be distinct from both A and B, while the shotgun VAE shows a clear similarity (dark line) between A, B, and AB.

https://doi.org/10.7554/eLife.67855.sa2

Article and author information

Author details

  1. Jack Goffinet

    1. Department of Computer Science, Duke University, Durham, United States
    2. Center for Cognitive Neurobiology, Duke University, Durham, United States
    3. Department of Neurobiology, Duke University, Durham, United States
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, software, validation, Supervision, Visualization, Writing – original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6729-0848
  2. Samuel Brudner

    Department of Neurobiology, Duke University, Durham, United States
    Contribution
    Formal analysis, Investigation, Methodology, validation, Supervision, Writing – original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6043-9328
  3. Richard Mooney

    Department of Neurobiology, Duke University, Durham, United States
    Contribution
    Conceptualization, Writing – review and editing, Investigation, Funding acquisition, Project administration, resources, Visualization, Writing – original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3308-1367
  4. John Pearson

    1. Center for Cognitive Neurobiology, Duke University, Durham, United States
    2. Department of Neurobiology, Duke University, Durham, United States
    3. Department of Biostatistics & Bioinformatics, Duke University, Durham, United States
    4. Department of Electrical and Computer Engineering, Duke University, Durham, United States
    Contribution
    Conceptualization, Formal analysis, Writing – review and editing, Investigation, Methodology, Funding acquisition, resources, Supervision, Visualization, Writing – original draft
    For correspondence
    john.pearson@duke.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9876-7837

Funding

National Institute of Mental Health (R01-MH117778)

  • Richard Mooney

National Institute of Neurological Disorders and Stroke (R01-NS118424)

  • Richard Mooney
  • John Pearson

National Institute on Deafness and Other Communication Disorders (R01-DC013826)

  • Richard Mooney
  • John Pearson

National Institute of Neurological Disorders and Stroke (R01-NS099288)

  • Richard Mooney

Eunice Kennedy Shriver National Institute of Child Health and Human Development (F31-HD098772)

  • Samuel Brudner

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by NIH grants R01-DC013826 (RM, JP), R01-NS118424 (RM, JP), R01-NS099288 (RM), R01-MH117778 (RM), and F31-HD098772 (SB) and by a hardware grant from the NVIDIA corporation.

Ethics

All data generated in conjunction for this study were generated by experiments performed in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health. All of the animals were handled according to approved institutional animal care and use committee (IACUC) protocols of Duke University, protocol numbers A171-20-08 and A172-20-08.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Jesse H Goldberg, Cornell University, United States

Reviewers

  1. Jesse H Goldberg, Cornell University, United States
  2. Ofer Tchernichovski, Hunter College, City University of New York, United States
  3. Scott W Linderman, Stanford University, United States

Publication history

  1. Received: February 24, 2021
  2. Accepted: May 12, 2021
  3. Accepted Manuscript published: May 14, 2021 (version 1)
  4. Version of Record published: June 18, 2021 (version 2)

Copyright

© 2021, Goffinet et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,703
    Page views
  • 342
    Downloads
  • 22
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jack Goffinet
  2. Samuel Brudner
  3. Richard Mooney
  4. John Pearson
(2021)
Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires
eLife 10:e67855.
https://doi.org/10.7554/eLife.67855

Further reading

    1. Computational and Systems Biology
    Valentina Baldazzi, Delphine Ropers ... Hidde de Jong
    Research Article

    Different strains of a microorganism growing in the same environment display a wide variety of growth rates and growth yields. We developed a coarse-grained model to test the hypothesis that different resource allocation strategies, corresponding to different compositions of the proteome, can account for the observed rate-yield variability. The model predictions were verified by means of a database of hundreds of published rate-yield and uptake-secretion phenotypes of Escherichia coli strains grown in standard laboratory conditions. We found a very good quantitative agreement between the range of predicted and observed growth rates, growth yields, and glucose uptake and acetate secretion rates. These results support the hypothesis that resource allocation is a major explanatory factor of the observed variability of growth rates and growth yields across different bacterial strains. An interesting prediction of our model, supported by the experimental data, is that high growth rates are not necessarily accompanied by low growth yields. The resource allocation strategies enabling high-rate, high-yield growth of E. coli lead to a higher saturation of enzymes and ribosomes, and thus to a more efficient utilization of proteomic resources. Our model thus contributes to a fundamental understanding of the quantitative relationship between rate and yield in E. coli and other microorganisms. It may also be useful for the rapid screening of strains in metabolic engineering and synthetic biology.

    1. Computational and Systems Biology
    2. Neuroscience
    Kai J Sandbrink, Pranav Mamidanna ... Alexander Mathis
    Research Article

    Biological motor control is versatile, efficient, and depends on proprioceptive feedback. Muscles are flexible and undergo continuous changes, requiring distributed adaptive control mechanisms that continuously account for the body's state. The canonical role of proprioception is representing the body state. We hypothesize that the proprioceptive system could also be critical for high-level tasks such as action recognition. To test this theory, we pursued a task-driven modeling approach, which allowed us to isolate the study of proprioception. We generated a large synthetic dataset of human arm trajectories tracing characters of the Latin alphabet in 3D space, together with muscle activities obtained from a musculoskeletal model and model-based muscle spindle activity. Next, we compared two classes of tasks: trajectory decoding and action recognition, which allowed us to train hierarchical models to decode either the position and velocity of the end-effector of one's posture or the character (action) identity from the spindle firing patterns. We found that artificial neural networks could robustly solve both tasks, and the networks'units show tuning properties similar to neurons in the primate somatosensory cortex and the brainstem. Remarkably, we found uniformly distributed directional selective units only with the action-recognition-trained models and not the trajectory-decoding-trained models. This suggests that proprioceptive encoding is additionally associated with higher-level functions such as action recognition and therefore provides new, experimentally testable hypotheses of how proprioception aids in adaptive motor control.