Vocalization categorization behavior explained by a feature-based auditory categorization model

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Vocal animals produce multiple categories of calls with high between- and within-subject variability, over which listeners must generalize to accomplish call categorization. The behavioral strategies and neural mechanisms that support this ability to generalize are largely unexplored. We previously proposed a theoretical model that accomplished call categorization by detecting features of intermediate complexity that best contrasted each call category from all other categories. We further demonstrated that some neural responses in the primary auditory cortex were consistent with such a model. Here, we asked whether a feature-based model could predict call categorization behavior. We trained both the model and guinea pigs (GPs) on call categorization tasks using natural calls. We then tested categorization by the model and GPs using temporally and spectrally altered calls. Both the model and GPs were surprisingly resilient to temporal manipulations, but sensitive to moderate frequency shifts. Critically, the model predicted about 50% of the variance in GP behavior. By adopting different model training strategies and examining features that contributed to solving specific tasks, we could gain insight into possible strategies used by animals to categorize calls. Our results validate a model that uses the detection of intermediate-complexity contrastive features to accomplish call categorization.

Editor's evaluation

This important study combines behavioral data from guinea pigs and data from a classifier model to make a compelling case for which auditory features are important for classifying vocalisations. This study is likely to be of interest to both computational and experimental neuroscientists, in particular auditory neurophysiologists and cognitive and comparative neuroscientists. A strength of this work is that a model trained on natural calls was able to predict some aspects of responses to temporally and spectrally altered cues.

https://doi.org/10.7554/eLife.78278.sa0

Introduction

Communication sounds, such as human speech or animal vocalizations (calls), are typically produced with tremendous subject-to-subject and trial-to-trial variability. These sounds are also typically encountered in highly variable listening conditions—in the presence of noise, reverberations, and competing sounds. A central function of auditory processing is to extract the underlying meaningful signal being communicated so that appropriate behavioral responses can be produced. A key step in this process is a many-to-one mapping that bins communication sounds, perhaps carrying similar ‘meanings’ or associated with specific behavioral responses, into distinct categories. To accomplish this, the auditory system must generalize over the aforementioned variability in the production and transmission of communication sounds. We previously proposed, based on a model of visual categorization (Ullman et al., 2002), a theoretical model that identified distinctive acoustic features that were highly likely to be found across most exemplars of a category and were most contrastive with respect to other categories. Using these ‘most informative features (MIFs)’, the model accomplished auditory categorization with high accuracy (Liu et al., 2019). In that study, we also showed that features of intermediate lengths, bandwidths, and complexities were typically more informative for auditory categorization. In a second study, we further showed in a guinea pig (GP) animal model that neurons in the superficial layers of the primary auditory cortex (A1) demonstrated call-feature-selective responses and complex receptive fields that were consistent with model-predicted features, providing support for the model at the neurophysiological level (Montes-Lourido et al., 2021a). In this study, we investigated whether the feature-based model held true at a behavioral level, by determining whether the model, trained solely using natural GP calls, could predict GP behavioral performance in categorizing both natural calls as well as calls with altered spectral and temporal features.

Studies in a wide range of species have probed the impact of alterations to spectral and temporal cues on call recognition. For example, in humans, it has been shown that speech recognition relies primarily on temporal envelope cues based on experiments that measured recognition performance when subjects were presented with noise-vocoded speech at different spectral resolutions (Shannon et al., 1995; Smith et al., 2002). However, recognition is also remarkably resilient when the envelope is altered because of tempo changes—for example, word intelligibility is resilient to a large degree of time-compression of speech (Janse et al., 2003). Results from other mammalian species are broadly consistent with findings in humans. In gerbils, it has been shown that firing rate patterns of A1 neurons could be used to reliably classify calls that were composed of only four spectral bands (Ter-Mikaelian et al., 2013). In GPs, small neuronal populations have been shown to be resistant to such degradations as well (Aushana et al., 2018). Slow amplitude modulation cues have been proposed as a critical cue for the neuronal discriminability of calls (Souffi et al., 2020), but behaviorally, call identification can be resilient to large changes in these cues. For example, mice can discriminate between calls that have been doubled or halved in length (Neilans et al., 2014). This remarkable tolerance to cue variations might be related to the wide range of variations with which calls are produced in different behavioral contexts. For example, for luring female mice and during direct courtship, male mice modify many call parameters including sequence length and complexity (Chabout et al., 2015). In contrast, one study in GPs using naturalistic stimuli (a human footstep sound) demonstrated that while GPs discriminated time-compressed sounds from the natural sound on which they were trained, they did not distinguish between time-expanded sounds and natural sounds (Ojima and Horikawa, 2015). Along the spectral dimension, mouse call discrimination can be robust to changes in long-term spectra, including moderate frequency shifts and removal of frequency modulations (Neilans et al., 2014). Indeed, it has been suggested that the bandwidth of ultrasonic vocalizations is more important for communication than the precise frequency contours of these calls (Screven and Dent, 2016). Again, given that mice also modify the spectral features of their calls in a context-dependent manner (Chabout et al., 2015), it stands to reason that their perception of call identity is also robust to alterations of spectral features. In GPs using footstep stimuli, animals reliably discriminated sounds subjected to a band-reject filter from the natural sound (Ojima and Horikawa, 2015).

Overall, these studies suggest that for calls, in particular, information is encoded at multiple levels. Whereas the specific parameters of a given call utterance might carry rich information about the identity (Boinski and Mitchell, 1997; Miller et al., 2010; Gamba et al., 2012; Fukushima et al., 2015) and internal state of the caller as well as social context (Seyfarth and Cheney, 2003; Coye et al., 2016), call category identity encompasses all these variations. In some behavioral situations, listeners might need to be sensitive to these specific parameter variations—for example, for courtship, female mice have been shown to exhibit a high preference for temporal regularity of male calls (Perrodin et al., 2020). But in other situations, animals must and do generalize over this variability to extract call category identity, which is critical for providing an appropriate behavioral response. What mechanisms enable animals to generalize over this tremendous variability with which calls are heard and how they accomplish call categorization, however, is not well understood.

In this study, based on our earlier modeling and neurophysiological results (Liu et al., 2019; Montes-Lourido et al., 2021a), we hypothesized that animals can generalize over this production variability and achieve call categorization by detecting features of intermediate complexity within these calls. To test this hypothesis, we trained feature-based models and GPs to classify multiple categories of natural, spectrotemporally rich GP calls. We then tested the categorization performance of both the model and GPs with manipulated versions of the calls. We found that the feature-based model of auditory categorization, trained solely using natural GP calls, could explain about 50% of the overall variance of GP behavioral responses to manipulated calls. By comparing different model versions, we could derive further insight into possible behavioral strategies used by GPs to solve these call categorization tasks. Examining the factors contributing to high model performance in different conditions also provided insight into why a feature-based encoding strategy is highly advantageous. Overall, these results provide support at a behavioral level for a feature-based auditory categorization model, further validating our model as a novel and powerful approach to deconstructing complex auditory behaviors.

Results

The acquisition of learned categorization behaviors likely involves two underlying processes—the acquisition of the knowledge of auditory categories, and the expression of this knowledge by learning the association between categories and reward outcomes (Kuchibhotla et al., 2019; Moore and Kuchibhotla, 2022). Many models have been developed to characterize the latter process, that is, the association of a stimulus with reward. Examples include models of classical conditioning and reinforcement learning (Rescorla and Wagner, 1972), models that account for the motivational state of subjects during behavior (Berditchevskaia et al., 2016), models that account for exploratory drive in subjects (Pisupati et al., 2021), or models that account for arousal state (de Gee et al., 2020). However, the former process of how knowledge of auditory categories is acquired, especially when categories are ‘non-compact’ or evident only in multiparametric spaces, is far less understood. Thus, while we briefly describe the behavioral acquisition of a vocalization categorization task for completeness (Figure 1—figure supplement 1 ), in this study, our goal was to determine the computational principles underlying the former process, that is, what features animals may use to acquire knowledge about acoustic stimulus category, and in particular, vocalization categories.

GPs learn to report call category in a Go/No-go task

We trained GPs on call categorization tasks using a Go/No-go task structure. Animals initiated trials by moving to the ‘home base’ region of the behavioral arena (Figure 1A, B). Stimuli were presented from an overhead speaker. On hearing Go stimuli, GPs were trained to move to a reward region, where they received a food pellet reward. The correct response to No-go stimuli was to remain in the home base. We trained two cohorts of GPs to categorize two pairs of call categories—Cohort 1 was trained on chuts (Go) versus purrs (No-go), calls that had similar spectral content (long-term spectral power) but different temporal (overall envelope) structure (Figure 1C), and Cohort 2 was trained on wheeks (Go) versus whines (No-go), calls that had similar temporal structure but different spectral content (Figure 1D). GPs were trained on this task over multiple short sessions every day (~6 sessions of ~40 trials each, ~10 min per session; see Materials and methods). On each trial, we presented a randomly chosen exemplar from an initial training set of 8 exemplars per category. We estimated hit rates and false alarm (FA) rates from all trials in a given day and computed a sensitivity index (d′). GPs were considered trained when d′ reliably crossed a threshold of 1.5. On average, GPs acquired this task after ~2–3 weeks of training (~4,000 total trials,~250 trials per exemplar; Figure 1—figure supplement 1). On the last day of training, GPs displayed a mean d′ of 1.94±0.26 for the chuts versus purrs task, and a mean d′ of 1.90±0.57 for the wheeks versus whines task.

Figure 1 with 1 supplement see all

Download asset Open asset

Call categorization behavior in GPs.

(A) Behavioral setup, indicating home base region for trial initiation (yellow) and reward area (green). Some naive animals observed expert animals performing the task to speed up task acquisition. (B) Video tracking was employed to detect GP position and trigger task events (stimulus presentation, reward delivery, etc.). (C) Spectrograms of example chut calls (Go stimuli for Cohort 1) and purr calls (No-go stimuli for Cohort 1). (D) Spectrograms of example wheek calls (Go stimuli for Cohort 2) and whine calls (No-go stimuli for Cohort 2). Performance of GPs during the training phase of the call categorization task is shown in Figure 1—figure supplement 1. GP, guinea pig.

To gain insight into possible behavioral strategies that GPs might adopt to solve the categorization task, we examined trends of behavioral performance over the training period. Initially, GPs exhibited low hit rates as well as low FA rates, suggesting that they did not associate the auditory stimulus with reward (Figure 1—figure supplement 1D). Note that this initial phase was not recorded for the first cohort (chuts vs. purrs task, Figure 1—figure supplement 1A). Within 2–3 days, GPs formed a stimulus-reward association and exhibited ‘Go’ responses for all stimuli but did not discriminate between Go and No-go stimulus categories. This resulted in high hit rates as well as FA rates, but low d′. For the remainder of the training period, hit rates remained stable whereas FA rates gradually declined, suggesting that the improvements to d′ resulted from GPs learning to suppress responses to No-go stimuli (Figure 1—figure supplement 1A, B, D, E).

While these data were averaged over all sessions daily for further analyses, we noticed within-day trends in performance that might provide insight into the behavioral state of the GPs. We analyzed performance across intra-day sessions, averaged over 4 days after the animals acquired the task (Figure 1—figure supplement 1C, F). In early sessions, both hit rates and FA rates were high, suggesting that the GPs weighted the food reward highly, risking punishments (air puffs/time outs) in the process. In subsequent sessions, both the hit rate and FA rate declined, suggesting that the GPs shifted to a punishment-avoidance strategy. Despite these possible changes in decision criteria used by the GPs, they maintained consistent performance, as d′ remained consistent across sessions. Therefore, in all further analyses, we used d′ values averaged over all sessions as a performance metric. Note that these observations of possible motivational and other fluctuations are not intended to be captured by our model. Rather, our model solely focuses on what spectrotemporal features GPs can use to deduce stimulus categories.

A feature-based computational model can be trained to accomplish call categorization

In parallel, we extended a feature-based model that we previously developed for auditory categorization (Liu et al., 2019) to accomplish GP call categorization in a Go/No-go framework. Briefly, we implemented a three-layer model consisting of a spectrotemporal representation layer, a feature-detection (FD) layer, and a winner-take-all (WTA) layer. The spectrotemporal layer was a biophysically realistic model of the auditory periphery (Zilany et al., 2014). For the FD layer, we used greedy search optimization and information theoretic principles to derive a set of MIFs for each call type that was optimal for the categorization of that call type from all other call types (Figure 2A, B; Liu et al., 2019). We derived five distinct sets of MIFs for each call type that could accomplish categorization (see Materials and methods). We refer to models using these distinct MIF sets as different instantiations of the model.

Figure 2

Download asset Open asset

Framework of the model trained to perform call categorization tasks.

(A) and (B) Example cochleagrams for target (A) and distractor (B) calls. Cochleagram rows were normalized to set the maximum value as 1 and then smoothed for display. White rectangles denote detected MIFs for that call. For an input call, the target (green) FD stage response is the sum of all detected target MIF weights normalized by the sum of all MIF weights for that call type. The distractor response (red) is similarly computed. (C) The output of the winner-take-all (WTA) stage is determined based on the difference between the target and distractor FD stage responses. Dots represent the WTA outputs for all calls used for training the models. Rows represent the five instantiations of the model with different MIF sets. MIF, maximally informative features; det, detected MIFs; all, all MIFs.

Call-specific MIF sets in the FD layer showed near-perfect performance (area under the curve, or AUC>0.97 for all 20 MIF sets [4 call categories×5 instantiations per category], mean=0.994) in categorizing target GP calls from other calls in the training data set. Similar to results from Liu et al., 2019, the number of MIFs for each instantiation of the model ranged from 8 to 20 (mean=16.5), with MIFs spanning ~3 octaves in bandwidth and ~110 ms in duration on average (Table 1). To assess the performance of the WTA layer based on these training data, we estimated d′ using equation 1 (Materials and methods). The WTA output also showed near-perfect performance for classifying the target from the distractor for both chuts versus purrs (mean d′=4.65) and wheeks versus whines (mean d′=3.69) tasks (Figure 2C).

Table 1

Properties of MIFs.

Call name	Instantiation	Number of MIFs	MIF duration (ms)(mean±std)	MIF bandwidth (octaves)(mean±std)
Chut	1, 2, 3, 4, 5	20, 20, 20, 20, 20	88±63, 106±53, 108±56, 109±64, 133±47	4.0±2.0, 4.4±1.2, 3.1±1.9, 3.7±1.9, 2.6±1.8
Purr	1, 2, 3, 4, 5	8, 9, 20, 20, 20	91±49, 83±43, 116±49, 116±56, 86±63	2.6±1.2, 2.8±1.2, 3.1±1.4, 3.2±1.5, 3.6±1.2
Wheek	1, 2, 3, 4, 5	8, 14, 13, 11, 12	144±47, 99±58, 104±68, 116±62, 114±65	2.3±1.6, 2.6±1.8, 2.9±2.2, 2.1±1.1, 2.5±1.7
Whine	1, 2, 3, 4, 5	20, 20, 15, 20, 20	109±55, 111±68, 133±37, 117±51, 108±70	3.5±1.8, 3.4±1.6, 2.6±1.4, 3.2±1.5, 3.9±1.6
Summary		16.5±4.7	109±57	3.2±1.7

Both GPs and the model generalize to new exemplars

To determine if GPs learned to report call category or if they simply remembered the specific call exemplars on which they were trained, we tested whether their performance generalized to a new set of Go and No-go stimuli (eight exemplars each) that the GPs had not encountered before. On each generalization day, we ran four sessions of ~40 trials each, with the first two sessions containing only training exemplars and the last two sessions containing only new exemplars. All GPs achieved a high-performance level (d′>1) to the new exemplars by generalization day 2 (Figure 3), that is, after being exposed to only a few repetitions of the new exemplars (~5 trials per new exemplar on generalization day 1). As an additional control to ensure that GPs did not rapidly learn reward associations for the new exemplars, for GPs performing the wheeks versus whines task (n=3), we also quantified generalization performance when the regular training exemplars and a second new set of exemplars were presented in an interleaved manner (400 trials with an 80/20 mix of training and new exemplars). GPs achieved d′>1 for new exemplars in this interleaved set as well, further supporting the notion that GPs were truly reporting call category.

Figure 3

Download asset Open asset

GP and model performance generalizes to new exemplars.

(A) and (C) Hit (green) and false alarm (FA; red) rates of GPs when categorizing new exemplars as a function of generalization day. We presented ~5 trials of each new exemplar per day. Dark lines correspond to average over subjects, faint lines correspond to individual subjects. (B) and (D) Quantification of generalization performance. Black line corresponds to average d′, gray lines are d′ values of individual subjects. GPs achieved a d′>1 by generalization day 2, that is, after exposure to only ~5 trials of each new exemplar on day 1. The feature-based model (orange, n = 5 instantiations, error bars correspond to s.e.m.) also generalized to new exemplars that were not part of the model’s training set of calls. Source data are available in Figure 3—source data 1. GP, guinea pig.

Figure 3—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for generalization to new vocalization exemplars.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig3-data1-v2.xlsx
Download elife-78278-fig3-data1-v2.xlsx

Similar to GPs, to test model generalization, we quantified model performance for new call exemplars (Figure 3B, D). Models using different MIF sets, that is, all instantiations of the model for chut, purr, wheek, and whine classification achieved high categorization performance (d′>1) for the new exemplars. In summary, GPs as well as the feature-based model could rapidly generalize to novel exemplars.

Both GPs and the model exhibit similar categorization-in-noise thresholds

Real-world communication typically occurs in noisy listening environments. To test how well GPs could maintain categorization in background noise, we assessed their performance when call stimuli were masked by additive white noise at several signal-to-noise ratios (SNRs) for both Go and No-go stimuli. Experiments were conducted in a block design, using a fixed SNR level per session (~40 trials) and testing 5 or 6 SNR levels each day. At the most favorable SNR (>20 dB), GPs exhibited high hit rates and low FA rates, leading to high d′ (>2) for both call groups (Figure 4). With increasing noise level (i.e., decreasing SNR), we observed a decrease in hit rate and an increase in FA, as expected, with a concomitant decrease in d′. To determine the effect of SNR value on d′, we constructed a generalized linear model with a logit link function to predict trial-by-trial behavioral outcomes, with stimulus type (Go or No-go), SNR value, and an interaction term as predictors, and animal ID as a random effect (see Equation 3, Materials and methods). We compared this full model to a null model consisting only of stimulus type as a predictor and animal ID as a random effect (see Equation 5, Materials and methods). This comparison revealed a strong effect of SNR value of d′ (chuts vs. purrs: χ²=124.6, p=2.2×10^–16; wheeks vs. whines: χ²=182.5; p=2.2×10^–16). At the most adverse SNR (–18 dB) for both call groups, hit and FA rates were similar, suggesting that the animals were performing at chance level. To estimate the SNR corresponding to the performance threshold (d′=1) for call categorization in noise, we fit a psychometric function to the behavioral d′ data (see Materials and methods). We obtained performance thresholds (SNR at which d′=1) for both the chuts versus purrs (–6.8 dB SNR) and wheeks versus whines (–11 dB SNR) tasks that were qualitatively similar to human speech discrimination performance in white noise (Phatak and Allen, 2007).

Figure 4

Download asset Open asset

Call categorization is robust to degradation by noise.

(A) and (C) Hit (green) and false alarm (FA; red) rates of GPs categorizing calls with additive white noise at different SNRs. (B) and (D) Sensitivity index (d′) as a function of SNR. Black symbols correspond to the mean d′ across animals (n=4); error bars correspond to s.e.m. Black line corresponds to a psychometric function fit to the behavioral data. Orange symbols correspond to the mean d′ across five instantiations of the model, error bars correspond to s.e.m. Orange line corresponds to a psychometric function fit to the model data. Dashed blue line signifies d′=1. The statistical significance of SNR value on behavior was evaluated using a generalized linear model (see main text). Source data are available in Figure 4—source data 1. GP, guinea pig; MAE, mean absolute error; SNR, signal-to-noise ratio.

Figure 4—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for vocalization categorization in noise (at different SNRs).: https://cdn.elifesciences.org/articles/78278/elife-78278-fig4-data1-v2.xlsx
Download elife-78278-fig4-data1-v2.xlsx

We also tested the performance of the feature-based model (trained only on clean stimuli) on the same set of noisy stimuli as the behavioral paradigm. Model performance trends mirrored behavior, with a higher threshold for the chuts vs. purrs task (–5.4 dB SNR) compared to the wheeks versus whines task (–15 dB SNR). Similar to behavioral data, we confirmed a significant effect of SNR value on model performance using a trialwise GLM analysis (chuts vs. purrs: χ²=402.3, p=2.2×10^–16; wheeks vs. whines: χ²=203.7; p=2.2×10^–16). To compare behavioral data with the model, we fit a line to the d′ values obtained from the model plotted against the d′ values obtained from behavior, and computed the R² value. We used the mean absolute error (MAE) to quantify the absolute deviation between model and behavior d′ values. Although the model over-performed for the wheeks vs. whines task, it could explain a high degree of variance (R²=0.94 for both tasks) of GP call-in-noise categorization behavior.

Stimulus information might be available to GPs in short-duration segments of calls

Several studies across species, including humans (Marslen-Wilson and Zwitserlood, 1989; Salasoo and Pisoni, 1985), birds (Knudsen and Gentner, 2010; Toarmino et al., 2011), sea-lions (Pitcher et al., 2012), and mice (Holfoth et al., 2014), have suggested that the initial parts of calls might be the most critical parts for recognition. We reasoned that if that were the case for GPs as well, and later call segments did not add much information for call categorization, we might observe a plateauing of behavioral performance after a certain length of a call was presented. To test this, we presented call segments of different lengths (50–800 ms) beginning at the call onsets (Figure 5A, D) to estimate the minimum call duration required for successful categorization by GPs. Trials were presented in a randomized manner in sessions of ~40 trials, that is, each trial could be a Go or No-go stimulus of any segment length. We did not observe systematic changes to d′ values when comparing the first and second halves of the entire set of trials used for testing, demonstrating that the GPs were not learning the specific manipulated exemplars that we presented. GPs showed d′ values >1 for as small as 75 ms segments for both tasks, and as expected, the performance stabilized for all longer segment lengths (Figure 5B, C, E and F). As with the SNR experiment, we used a comparison between a full GLM including stimulus length value and a null GLM to evaluate the statistical significance of the effect of stimulus length on behavioral performance. We confirmed a significant effect of segment length on behavioral performance (chuts vs. purrs: χ²=58.7, p=1.8×10^–13; wheeks vs. whines: χ²=120.2; p=2.2×10^–16). These data suggest that short-duration segments of calls carry sufficient information for call categorization, at least in the tested one-vs.-one scenarios. The fact that call category can be extracted from the earliest occurrences of such segments suggests two possibilities: (1) a large degree of redundancy is present in calls, or (2) the repeated segments can be used to derive information beyond call category (e.g., caller identity or emotional valence).

Figure 5

Download asset Open asset

GPs can obtain information for categorization from short-duration segments of calls.

(A) and (D) Schematic showing truncation of stimuli at different segment lengths from the onset of calls. (B) and (E) Average (n=3 GPs) hit rate (green) and false alarm rates (red) as a function of stimulus segment length. (C) and (F) Black symbols correspond to average GP d′ (n=3 GPs), error bars correspond to s.e.m. Orange symbols correspond to average model d′ (n=5 model instantiations), error bars correspond to s.e.m. Dashed blue line denotes d′=1. The statistical significance of segment length value on behavior was evaluated using a generalized linear model (see main text). Source data are available in Figure 5—source data 1. FA, false alarm; GP, guinea pig.

Figure 5—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for the segment-length manipulation.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig5-data1-v2.xlsx
Download elife-78278-fig5-data1-v2.xlsx

Model performance, however, only exceeded a d′ value of 1 for ~150 ms call segments, and performance only plateaued after a 200 ms duration (Figure 5C, F). The effect of segment length on model performance was statistically significant (chuts vs. purrs: χ²=613.2, p=1.8× 0^–13; wheeks vs. whines: χ²=775.4; p=2.2×10^–16). This observation could reflect the fact that the MIFs identified for categorization were on average about 110 ms long. Despite these differences, model performance was in general agreement with behavioral performance for both the chuts versus purrs and wheeks versus whines tasks (R²=0.674 and 0.444, respectively).

Temporal manipulations had little effect on model performance and GP behavior

To investigate the importance of temporal cues for GP call categorization, we introduced several gross temporal manipulations to the calls. We first started by changing the tempo of the calls, that is, stretching/compressing the calls without introducing alterations to the long-term spectra of calls (Figure 6A, D). This resulted in calls that were ~0.45, 0.5, ~0.56, ~0.63, ~0.77, ~1.43, 2.5, and 5 times the original lengths of the calls. As earlier, we presented stimuli in randomized order and verified that d′ did not vary systematically between the first and second half of trials, suggesting that the GPs were not learning new associations for the manipulated exemplars. GP behavioral performance remained above a d′ of 1 for all these perturbations, showing high hit rates and low FA rates (Figure 6B, E) leading to similar d′ across probed conditions (Figure 6C, F). Similar trialwise GLM analysis as earlier, but including the sign of the manipulation (compression vs. expansion) as an additional predictor in the full model, revealed a weak effect (based on the value of the χ² statistic, which roughly corresponds to effect size) of tempo shift on performance (chuts vs. purrs: χ²=46; p=5.7×10^–10; wheeks vs. whines: χ²=13.6; p=0.003).

Figure 6

Download asset Open asset

Call categorization is resistant to changes in tempo.

(A) and (D) Schematic showing changes to call tempo without affecting spectral content. (B) and (E) Average (n=4 GPs for chuts vs. purrs; n=3 GPs for wheeks vs. whines) hit rate (green) and false alarm rates (red) as a function of tempo change, expressed as times change in call duration (1 corresponds to the natural call). (C) and (F) Black points correspond to average GP d′; error bars correspond to s.e.m. Orange points correspond to average model d′ (n=5 model instantiations), error bars correspond to s.e.m. Dashed blue line denotes d′=1. The statistical significance of call duration value on behavior was evaluated using a generalized linear model (see main text). Source data are available in Figure 6—source data 1. FA, false alarm; GP, guinea pig.

Figure 6—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for tempo-shifted vocalizations.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig6-data1-v2.xlsx
Download elife-78278-fig6-data1-v2.xlsx

Similarly, model performance also remained above d′ of 1 for all tempo manipulations. We emphasize that the MIFs used in the model were trained only on natural calls, and were not temporally compressed or stretched to match the test stimuli. GLM analysis as above revealed a weak effect of tempo shift on model d′ for chuts versus purrs (χ²=43.3; p=2.14×10^–9), and no effect for wheeks versus whines (χ²=0.322; p=0.96). Note that while the model qualitatively captured GP behavioral trends, we obtained low R² values likely because of the weak effect of tempo shift values on both animal and model behaviors. The variability of the observed behavior was likely dominated by non-stimulus-related factors such as motivation, exploratory drive, or arousal (Berditchevskaia et al., 2016; de Gee et al., 2020; Pisupati et al., 2021), factors not explicitly modeled here. However, the relatively low MAE for the tempo manipulations (comparable with MAEs of the SNR manipulation which showed high R² values) suggests a broad correspondence between the model and behavior.

The tempo manipulations lengthened or shortened both syllables and inter-syllable intervals (ISIs). Because a recent study in mice (Perrodin et al., 2020) suggested that the regularity of ISI values might be crucial for the detection of male courtship songs by female mice, we next asked whether GPs used individual syllables or temporal patterns of calls for call categorization. First, as a low-level control, we replaced the ISIs of calls with silence instead of the low level of background noise present in recordings to ensure that GPs were not depending on any residual ISI information (silent ISI). Second, since many call categories show a distribution of ISI durations (Figure 7A, E), we replaced the ISI durations in a call with ISI values randomly sampled from the ISI distribution of the same call category (random ISI). The hit and FA rates for both silent and random ISI stimuli were comparable to the regular calls for both categorization tasks (Figure 7C, G), and thus, no significant difference in d′ values was observed across these conditions (Figure 7D, H; repeated measures ANOVA; p=0.536 for chuts vs. purrs and p=0.365 for wheeks vs. whines). Consistent with these behavioral trends, model performance was also largely unaffected by these ISI manipulations.

Figure 7

Download asset Open asset

Call categorization is resistant to manipulations to the inter-syllable interval (ISI).

(A) and (E) Distribution of ISI lengths for the call types used in the categorization tasks. (B) and (F) Comparison of the Go rates for natural and chimeric calls. We compared Go rates rather than d′ because chimeric calls were presented in a catch-trial design (see main text and Materials and methods). Chim. refers to chimeric calls with one call’s syllables and the other call’s ISIs. For example, chimeric chuts have chut syllables and purr ISIs. Label on x-axis refers to syllable identity. (C) and (G) Comparison of hit (green) and FA (red) rates for regular calls, calls where we replaced ISI values with values drawn from the same calls’ ISI distributions, and calls where we replaced the ISI with silence (rather than background noise). (D) and (H) Comparison of GP (black; n=3 GPs) and model (orange; n=5 instantiations) d′ values across these manipulations. Error bars correspond to s.e.m. Source data are available in Figure 7—source data 1. FA, false alarm; GP, guinea pig.

Figure 7—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for various ISI manipulations (silent ISI, random ISI, chimeric calls).: https://cdn.elifesciences.org/articles/78278/elife-78278-fig7-data1-v2.xlsx
Download elife-78278-fig7-data1-v2.xlsx

Finally, because the Go/No-go stimulus categories vary in their ISI distributions (Figure 7A, E), particularly for chuts versus purrs, we generated chimeric calls with syllables of one category and ISI values of the other category (e.g., chut syllables with purr ISIs). Since we combined properties of two call categories, we presented chimeric stimuli in a catch-trial design (see Materials and methods) and compared the Go response rates using syllable identity as the label for a category. While the response rates were marginally lower for the chimeric chuts (chut syllables with purr ISI values) compared to regular chuts (paired t-test; p=0.039), responses were unaltered for regular and chimeric purrs (paired t-test; p=0.415), chimeric wheeks (paired t-test; p=0.218), and chimeric whines (paired t-test; p=0.099) (Figure 7B, F). We did not test the model with chimeric stimuli because ‘Go’ and ‘Nogo’ category labels could not be assigned.

As a more drastic manipulation, we tested the effects of temporally reversing the calls (Figure 8A, D). Given that both chuts and purrs are calls with temporally symmetric spectrotemporal features, compared to natural calls, we observed no changes in the hit and FA rates (Figure 8B) or d′ values for reversed calls (Figure 8C; paired t-test; p=0.582). Wheeks and whines, however, show strongly asymmetric spectrotemporal features. Interestingly, reversal did not significantly affect the categorization performance for this task as well (Figure 8E, F; paired t-test; p=0.151). The model also maintained robust performance (d′>1) for call reversal conditions but with an ~18% decrease in d′ compared to behavior. Overall, these results suggest that GP behavioral performance is tolerant to temporal manipulations such as tempo changes, ISI manipulations, and call reversal, and this tolerance can be largely captured by the feature-based model.

Figure 8

Download asset Open asset

Call categorization is resistant to time-reversal.

(A) and (D) Schematics showing spectrogram and waveform of natural (left) and reversed (right) purr (A) and wheek (D) calls. (B) and (E) Average (n=3 GPs) hit rate (green) and FA rate (red) for natural and reversed calls. (C) and (F) Average performance of GPs (black; n=3 GPs) and model (orange; n=5 model instantiations) for natural and reversed calls. Error bars correspond to s.e.m. Source data are available in Figure 8—source data 1. FA, false alarm; GP, guinea pig.

Figure 8—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for natural and reversed vocalizations.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig8-data1-v2.xlsx
Download elife-78278-fig8-data1-v2.xlsx

Spectral manipulations cause similar degradation in model performance and GP behavior

Because temporal manipulations did not significantly affect GP behavioral or model classification performance, we reasoned that categorization was primarily driven by within-syllable spectral cues. To ascertain the impact of spectral manipulations on call categorization, we varied the fundamental frequency (F0) of the calls from one octave lower (–50%) to one octave higher (+100%) than the regular calls without altering call lengths (Figure 9A and D). As earlier, we verified that d′ did not vary systematically between the first and second half of trials, suggesting that the GPs were not learning new associations for the manipulated exemplars. Both increases and decreases to the F0 of the calls significantly affected behavioral performance, revealed by a GLM analysis as described earlier (chuts vs. purrs: χ²=106.7; p=2.2×10^–16; wheeks vs. whines: χ²=158.2; p=2.2×10^–16). Particularly, we saw a rise in FA rates (Figure 9B) as the F0 deviated farther from the natural values, leading to a significant drop in d′ values for several conditions (Figure 9C). Model performance was also significantly affected by F0 shift magnitude (chuts vs. purrs: χ²=93.9; p=2.2×10^–16; wheeks vs. whines: χ²=456; p=2.2×10^–16). As earlier, MIFs used in the model were trained only on natural calls, and were not frequency-shifted to match the test stimuli. Model performance mirrored behavioral trends as evidenced by high R² (0.723 and 0.468 for chuts vs. purrs and wheeks vs. whines, respectively) and low MAE values.

Figure 9

Download asset Open asset

Call categorization is sensitive to fundamental frequency (F0) shifts.

(A) and (D) Schematics showing spectrograms of natural calls (middle) and versions where the F0 has been decreased (left) or increased (right). (B) and (E) Average (n=4 GPs for chuts vs. purrs; n=3 GPs for wheeks vs. whines) hit rate (green) and FA rate (red) for F0-shifted calls Note that 0% change in F0 is the natural call, –50% change corresponds to shifting F0 one octave lower, and 100% change corresponds to shifting F0 one octave higher than the natural call. (C) and (F) Average performance of GPs (black) and model (orange; n=5 model instantiations) for natural and F0-shifted calls. Error bars correspond to s.e.m. The statistical significance of F0-shift value on behavior was evaluated using a generalized linear model (see main text). Source data are available in Figure 9—source data 1. FA, false alarm; GP, guinea pig.

Figure 9—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for F0-shifted vocalizations.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig9-data1-v2.xlsx
Download elife-78278-fig9-data1-v2.xlsx

Finally, because wheeks and whines differ in their spectral content at high frequencies (Figure 1D), we asked whether GPs exclusively used the higher harmonics of wheeks to accomplish the categorization task. To answer this question, we low-pass filtered both wheeks and whines at 3 kHz (Figure 10A), removing the higher harmonics of the wheeks while leaving the fundamental relatively unaffected. Although GP performance showed a decreasing trend for the filtered calls (Figure 10B, C), it was not significantly different from regular calls (paired t-test; p=0.169), indicating that the higher harmonics might be an important but not the sole cue used by GPs for the task. Similar to behavior, the model performed slightly poorly but above a d′ of 1 in the low-pass filtered condition.

Figure 10

Download asset Open asset

Call categorization is mildly affected by low-pass filtering.

(A) Schematic spectrograms of natural calls (top) and low-pass filtered (bottom) wheek and whine calls. (B) Average (n=3 GPs) hit rate (green) and FA rate (red) for natural and low-pass filtered (cutoff=3 kHz) calls. (C) Average performance of GPs (black) and model (orange; n=5 model instantiations) for natural and low-pass filtered calls. Error bars correspond to s.e.m. Source data are available in Figure 10—source data 1. FA, false alarm; GP, guinea pig.

Figure 10—source data 1 Hit rates, false alarm rates, and d′ values of GPs and the feature-based model for natural and low-pass filtered vocalizations.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig10-data1-v2.xlsx
Download elife-78278-fig10-data1-v2.xlsx

Feature-based model explains a high degree of variance in GP behavior

The feature-based model was developed purely based on theoretical principles, made minimal assumptions, was trained only on natural GP calls, and had no access to GP behavioral data. For training the model, we used exemplars that clearly provided net evidence for the presence of one category or the other (Figure 3C; green and red tick marks in Figure 11A, D). We tested the model (and GPs), however, with manipulated stimuli that spanned a large range of net evidence values (histograms in Figure 11A, D), with many stimuli close to the decision boundary (blue ticks correspond to an SNR value of –18 dB). Despite the difficulty imposed by this wide range of manipulations, the model explained a high degree of variance in GP behavior as evidenced by high R² and low MAE across individual paradigms (call manipulations) as well as overall (Figure 11B–F; R²=0.60 for chuts vs. purrs and 0.37 for wheeks vs. whines; using model and behavior d′ values pooled across all tasks).

Figure 11 with 2 supplements see all

Download asset Open asset

Feature-based model explains ~50% of the overall variance in GP behavior.

(A) Stacked distributions of the evidence for the presence of Go (green) and No-go (red) stimuli (across all manipulations for the chuts vs. purrs task), showing that the output is generally >0 for chuts (green; Go stimulus) and <0 for purrs (red; No-go stimulus). The evidence for easy tasks, such as generalizing to new natural chuts (green ticks) or purrs (red ticks), is typically well away from 0 (decision boundary). In contrast, the evidence for difficult tasks, such as the –18 dB SNR condition (blue ticks), falls near 0. Dashed black line corresponds to the winner-take-all output as a probability of reporting a Go response. (**B, C**) Compared to the model trained with the specific task performed by the GP (chuts vs. purrs; one vs. one), the model trained to classify each call type from all other call types (one vs. many) was more predictive of behavior as indicated by higher R² (B) and lower MAE (C). (**D–F**) Same as (**A–C**) but for the wheeks versus whines task. The performance of an SVM classifier that uses the long-term spectrum to classify natural and manipulated calls is shown in Figure 11—figure supplement 1. The performance of a feature-based classifier, with feature duration constrained to 75 ms, is shown in Figure 11—figure supplement 2. GP, guinea pig; MAE, mean absolute error; SVM, support vector machine.

To determine whether simpler models based on the overall spectral content of vocalizations could explain some of these results, we built a support vector machine (SVM) classifier that attempted to classify vocalizations based on the long-term spectra of calls (see Materials and methods). Similar to the feature-based model, a WTA stage was implemented by comparing the outputs of the target-call SVM model and the distractor-call SVM model for each input call (response=Go if target-SVM output>distractor SVM output). The spectrum-based SVM model achieved high performance on natural vocalizations (d′=3.48 for novel chuts vs. purrs and 2.65 for novel wheeks vs. whines), but failed to capture many aspects of GP responses to manipulated calls, such as the robust performance at low SNRs and modulation of performance by F0-shifted calls. Overall, this resulted in a much lower fraction of the variance explained (Figure 11—figure supplement 1). These analyses illustrate that although a simpler model might successfully classify natural vocalizations, applying the model to manipulated stimuli reveals the shortcomings of the spectrum-based model in capturing GP behavioral trends, further highlighting the need for using spectrotemporal features for classification.

Comparing models with different training procedures yields insight into GP behavioral strategy

The high explanatory power of the feature-based model could be leveraged to gain further insight into what information the GPs were using or learning to accomplish these categorization tasks. On the one hand, because GPs are exposed to these call categories from birth, the GPs may simply be employing the features that they have already acquired for call categorization over their lifetimes to solve our specific categorization tasks. The model presented so far is aligned with this possibility—we trained features to categorize one call type from all other call types (one vs. many categorization) and used a large number of call exemplars for training. Alternatively, GPs could be de-novo learning stimulus features that distinguished between the particular Go and No-go exemplars we presented during training. To test this possibility, we re-trained the model only using the eight exemplars of the targets and distractors that we used to train GPs for one versus one categorization. When tested on manipulated calls, the one versus one model typically performed poorly compared to the original one versus many model. Compared to the one versus many model, the one versus one model was less consistent with behavior as indicated by lower R² (Figure 11B, E) and higher MAE values (Figure 11C, F). One explanation for this result could be that eight exemplars of each category are insufficient to adequately train the model. But the one versus one model achieved suprathreshold (d′>1) performance on both the training and generalization call exemplars, suggesting that these training data were, to some degree, sufficient for classifying natural calls. Furthermore, if GPs were indeed learning stimulus features from the training data set, they would also face the same constraints on training data volume as the model. A second explanation is that the one versus many model better matches GP behavior because rather than re-learning new task-specific features, GPs might be using call features that they had acquired previously over their lifespan to solve our call categorization task. These results also suggest that training a feature-based categorization system (in-silico or in-vivo) on exemplars that capture within-category variability is critical to obtain a system that can flexibly adapt and maintain robust performance to unheard stimuli that exhibit large natural or artificial variations.

The effect of training our model on the one versus many categorization task using a large number of call exemplars for training was that the model learned features that truly captured the within-class and outside-class variability of calls. This resulted in a model that accurately predicted GP performance across a range of stimulus manipulations. To understand how the model was able to achieve robustness to stimulus variations, and to gain insight into how GPs may flexibly weight features differently across the various stimulus manipulations, we examined the relative detection rates of various model MIFs across different stimulus paradigms in which we observed strong behavioral effects (Figure 12, Figure 12—figure supplement 1).

Figure 12 with 1 supplement see all

Download asset Open asset

Different subsets of MIFs are flexibly recruited to solve categorization tasks for different manipulations.

(A) We estimated the relative detection rate (i.e., the difference between the detection rate of a given MIF for all within-category and outside-category calls) of all MIFs (discs) for each behavioral paradigm (e.g., SNR). Colors denote different instantiations of the MIFs. Disc diameter is monotonically proportional to the relative detection rate, using a power-law relationship (fourth power) to highlight the most robust features. While MIFs of all center frequencies (CFs) and bandwidths were uniformly recruited for generalizing calls of chut call type, MIFs with lower CFs were preferentially selected for SNR conditions, likely because high-frequency chut features were masked by white noise. In contrast, MIFs with high CF were preferred by the model to solve the F0-shift task. (B) Similar results were obtained for purrs. (C) MIFs of all durations and bandwidths were uniformly recruited for generalizing calls of chut call type. In contrast, shorter duration MIFs were preferred for segment-length conditions whereas longer-duration MIFs were preferentially recruited for F0-shift conditions. (D) Results were similar for purrs. Source data are available in Figure 12—source data 1. Similar analyses for wheeks and whines are presented in Figure 12—figure supplement 1.

Figure 12—source data 1 Characteristics of most informative features (CF, bandwidth, and duration) and their relative detection rates for the various stimulus manipulations.: https://cdn.elifesciences.org/articles/78278/elife-78278-fig12-data1-v2.xlsx
Download elife-78278-fig12-data1-v2.xlsx

In Figure 12, we plot the relative detection rates of MIFs (discs, ~20 MIFs per instantiation) from different model instantiations (colors, 5 instantiations) for the different call types used. That is, we computed the difference between the rate of detection of each MIF (for each call category) in response to within-category and outside-category stimuli, and plotted this difference (disc areas) as a function of MIF tuning properties (CF, bandwidth, and duration). For the generalization stimuli, that is, new natural calls on which the model had not been trained, almost all MIFs showed relatively large net detection rates which resulted in plots (Figure 12A–D, left panels) with discs of about equal area. For example, all chut MIFs were detected at high rates in novel chut calls (within-category calls), and detected at low rates in non-chut call types (outside-category calls). Note, however, that the learned MIFs are spread out across a range of CFs, bandwidths, and durations. Given these data alone, one might argue that learning ~20 MIFs per call category is highly redundant, and that high performance could be achieved using only a subset of these MIFs. But examining which features maintain high relative detection rates in other stimulus paradigms underscores the utility of learning this wide feature set. When we added white noise to the stimulus, low-CF features showed higher relative detection rates (Figure 12A, B, right top) and thus contributed more towards categorization. This could likely be attributed to GP calls having high power at low frequencies, resulting in more favorable local SNRs at lower frequencies. But when we altered stimulus F0, high-CF features contributed more towards categorization (Figure 12A, B, right bottom). Similarly, low-duration, high-bandwidth features contributed more when categorizing time-restricted calls, whereas high-duration, low-bandwidth features contributed more when categorizing F0-shifted calls (Figure 12C, D). That the model quantitatively matched GP behavior suggests that a similar strategy might be employed by GPs as well. Note that our contention is not that the precise MIFs obtained in our model are also the precise MIFs used by the GPs—indeed, we were able to train several distinct MIF sets that were equally proficient at categorizing calls. Rather, we are proposing a framework in which GPs learn intermediate-complexity features that account for within-category variability and best contrast a call category from all other categories, and similar to the model, recruit different subsets of these features to solve different categorization tasks.

Discussion

In this study, we trained GPs to report call categories using an appetitive Go/No-go task. We then tested GP call categorization when we challenged them with spectrally and temporally manipulated calls. We found that GPs maintained their call categorization across a wide range of gross temporal manipulations such as changes to tempo and altered ISI distributions. In contrast, GP behavior was strongly affected by altering the F0 of calls. In parallel, to determine which features GPs might be using to extract knowledge of call category from stimuli, we extended a previously developed feature-based model of auditory categorization by adding a WTA feature-integration stage that enabled us to obtain a categorical decision from the model on a trial-by-trial basis. We trained the model using natural GP calls. When we challenged the model with the identical stimuli used in the GP experiments, we obtained model responses that explained ~50% of the overall variance in GP behavior. This result suggests that learning to generalize over the variability in natural call categories resulted in the model being able to generalize over some artificial call manipulations. That the model tracks GP performance on these manipulated stimuli is a key observation that supports our reasoning that GPs learn model-like features to generalize over call variability and accomplish call categorization. We had previously reported electrophysiological support for the feature-based model by demonstrating that a large fraction of neurons in the superficial layers of auditory cortex exhibited feature-selective responses, resembling the FD stage of the model (Montes-Lourido et al., 2021a). The results described in the present manuscript lend further support to the model at a behavioral level. Taken together, these studies strongly suggest how a spectral content-based representation of sounds at lower levels of auditory processing may be transformed into a goal-directed representation at higher processing stages by extracting and integrating task-relevant features.

The feature-based model was highly predictive of GP behavior, although it was conceptualized from purely theoretical considerations, trained only using natural GP calls, and implemented without access to any behavioral data. We developed the feature-based model to capture how knowledge of auditory categories is learned by encoding informative features, but did not account for how this knowledge is expressed during behavior. Rather, we made minimal assumptions and used a WTA framework with a static decision criterion (with a small amount of error). Insights from our behavioral observations could be used to further refine the model by adding a realistic behavioral expression stage. For example, our data indicated that GPs altered their behavioral strategy over the course of multiple sessions within a given day. This could possibly reflect an early impulsivity in their decision-making brought on by food deprivation (evidenced by a FA rate) that gradually switches to a punishment-avoidance strategy with increasing satiation (although d′ remains consistent across sessions). Additional arousal-dependent effects on behavior are likely. Nevertheless, the fact that the current model could explain ~50% of the variance in behavioral trends we observed suggests that the fundamental strategy employed by the model—that of detecting features of intermediate complexity to generalize over within-category variability—also lies at the core of GP behavior.

Furthermore, we could leverage the model to gain insight into possible behavioral strategies used by GPs in performing the tasks. For example, we could compare models trained to categorize calls in one versus many or one versus one conditions to ask which scenario was more consistent with GP behavior: (1) whether the GPs used prior features that they acquired over their lifetimes to categorize a given call type from all other calls, or (2) whether GPs were de-novo learning new features to solve the particular categorization task on which they were trained. While the relatively lower volume of data used to train the one vs. one model is a potential confound, the model trained on call features that distinguish a particular call from all other calls was more closely aligned with GP behavioral data, supporting the first possibility. Examining how different subsets of features could be employed to solve different categorization tasks revealed possible strategies that GPs might use to flexibly recruit different feature representations to solve our tasks. While we have used GPs as an animal model for call categorization in this study, we have previously shown that the feature-based model shows high performance across species (GPs, marmosets, and macaques), and that neurons in the primary auditory cortex (A1) of marmosets also exhibit feature-selective responses (Liu et al., 2019). Thus, it is likely that our model reflects general principles that are applicable across species, and offers a powerful new approach to deconstruct complex auditory behaviors.

On the behavioral side, previous GP studies have largely used conditioned avoidance (e.g., Heffner et al., 1971) or classical conditioning using aversive stimuli (e.g., Edeline et al., 1993) to study simple tone detection and discrimination behaviors. Studies probing GP behaviors using more complex stimuli are rare (e.g., Ojima et al., 2012; Ojima and Horikawa, 2015). Our study of GP call categorization behavior using multiple spectrotemporally rich call types and parametric manipulations of spectral and temporal features offers comprehensive insight into cues that are critical for call categorization and builds significantly on previous studies. First, we showed that GPs can categorize calls in challenging SNRs, and that thresholds vary depending on the call types to be categorized. We demonstrated that information for GP call categorization was available in short-duration segments of calls, and consistent with some previous studies in other species (Holfoth et al., 2014; Knudsen and Gentner, 2010; Marslen-Wilson and Zwitserlood, 1989; Pitcher et al., 2012), GPs could extract call category information soon after call onset. GP call perception was robust to large temporal manipulations, such as reversal and larger changes to tempo than have been previously tested (Neilans et al., 2014). These results are also consistent with the resilience of human word identification to large tempo shifts (Janse et al., 2003). Our finding that GP call categorization performance is robust to ISI manipulations is also not necessarily inconsistent with results from mice (Perrodin et al., 2020); in that study, while female mice were found to strongly prefer natural calls compared to calls with ISI manipulations, it is possible that mice still identified the call category correctly. For gross spectral manipulations, we found that GP call categorization was robust to a larger range of F0-shifts than have been previously tested (Neilans et al., 2014). Critically, for all but one of these manipulations, the feature-based model captured GP behavioral trends with surprising accuracy both qualitatively and quantitatively.

An analysis of model deviation from behavior could suggest a roadmap for future improvements to our model that could yield further insight into auditory categorization. The one paradigm where we observed a systematic under-performance of the model compared to GP behavior was when we presented call segments of varying lengths from call onset. While the GPs were able to accomplish categorization by extracting information from as little as 75 ms segments, the model required considerably more information (~150 ms). This is likely because the model was based on the detection of informative features that were on average of ~110 ms duration, which were identified from an initial random set of features that could be up to 200 ms in duration. We set this initial limit based on an upper limit estimated from electrophysiological data recorded from A1 ( Montes-Lourido et al., 2021a). We consciously did not impose further restrictions on feature duration or bandwidth to ensure that the model did not make any assumptions based on observed behavior. Upon observing this deviation from behavior, to test whether restricting feature lengths to shorter durations would further improve model performance, we repeated the modeling by constraining the initial feature length to a maximum of 75 ms (the lowest duration for which GPs show above-threshold performance). We found that the constrained MIF model better matched GP behavior on the segment-length task (R² of 0.62 and 0.58 for the chuts vs. purrs and wheeks vs. whines tasks) with the model exceeding d′=1 for 75 ms segments for most tested cases. The constrained MIF model maintained similarity to behavior for the other manipulations as well (Figure 11—figure supplement 2), and yielded higher overall R² values (0.66 for chuts vs. purrs, 0.51 for wheeks vs. whines), explaining ~59% of the variance in GP behavior. This result illustrates how behavioral experiments can provide constraints for the further development of theoretical models in the future.

We also observed the over-performance of the model compared to behavior in some paradigms. Some of this over-performance might be explained by the fact that the model does not exhibit motivation changes, and so forth, as outlined above. A second source of this over-performance might arise from the fact that our model integrates evidence from the FD stage perfectly, that is, we take the total evidence for the presence of a call category to be the weighted sum of the log-likelihoods of all detected features (counting detected features only once) over the entire stimulus, and do not explicitly model a leaky integration of feature evidence over time, as is the case in evidence-accumulation-to-threshold models (Cheadle et al., 2014; Keung et al., 2020). Future improvements to the model could include a realistic feature-integration stage, where evidence for a call category is generated when a feature is detected and degrades with a biologically realistic time constant. In this case, a decision threshold could be reached before the entire stimulus is heard, but model parameters would need to be derived from or fit to observed behavioral data (Glaze et al., 2015).

How do the proposed model stages map onto the auditory system? In an earlier study, we provided evidence that feature detection likely occurs in the superficial layers of A1, in that a large fraction of neurons in this stage exhibits highly selective call responses and complex spectrotemporal receptive fields (Montes-Lourido et al., 2021a). How and at what stage these features are combined to encode a call category remains an open question. Neurons in A1 can acquire categorical or task-relevant responses to simple categories, for example, low versus high tone frequencies, or low versus high temporal modulation rates, with training (Bao et al., 2004; Fritz et al., 2005). In contrast, categorical responses to more complex sounds or non-compact categories only seem to arise at the level of secondary or higher cortical areas or the prefrontal cortex (Russ et al., 2008; Yin et al., 2020), which may then modulate A1 via descending connections. These results, taken together with studies that demonstrate enhanced decodability of call identity from the activity of neurons in higher cortical areas (Fukushima et al., 2014; Grimsley et al., 2012; Grimsley et al., 2011), suggest that secondary ventral-stream cortical areas, such as the ventral-rostral belt in GPs, are promising candidates as the site of evidence integration from call features. The WTA stage may be implemented via lateral inhibition at the same level using similar mechanisms as has been suggested in the primary visual cortex (Chettih and Harvey, 2019) or may require a further upstream layer. Further experiments are necessary to explore these questions.

The feature-based model we developed offers a trade-off between performance and biological interpretability. Modern deep neural network (DNN) based models can attain human-level performance (e.g., in vision: Rajalingham et al., 2015, in audition: Kell et al., 2018) but what features are encoded at the intermediate network layers remain somewhat hard to interpret. These models also typically require vast quantities of training data. In contrast, our model is based on an earlier model for visual categorization (Ullman et al., 2002) that is specifically trained to detect characteristic features that contrast the members of a category from non-members. Thus, we can develop biological interpretations for what features are preferably encoded and more importantly, why certain features are more advantageous to encode. Because the features used in the model are the most informative parts of the calls themselves, they can be identified without a parametric search. This approach is especially well-suited for natural sounds such as calls that are high-dimensional and difficult to parameterize. We are restricted, however, in that we do not know all possible categorization problems that are relevant to the animal. By choosing well-defined categorization tasks that are ethologically critical for an animal’s natural behavior (such as call categorization in the present study), we can maximize the insight that we can derive from these experiments as it pertains to a range of natural behaviors. In the visual domain, the concept of feature-based object recognition has yielded insight into how human visual recognition differs from modern machine vision algorithms (Ullman et al., 2016). Our results lay the foundation for pursuing an analogous approach for understanding auditory recognition in animals and humans.

Materials and methods

All experimental procedures conformed to the NIH Guide for the use and care of laboratory animals and were approved by the Institutional Animal Care and Use Committee of the University of Pittsburgh (protocol number 21069431).

Animals

We acquired data from four male and three female adult, wild-type, pigmented GPs (Elm Hill Labs, Chelmsford, MA), weighing ~500–1000 g over the course of the experiments. After a minimum of 2 weeks of acclimatization to handling, animals were placed on a restricted diet for the period of behavioral experiments. During this period, GPs performed auditory tasks for food pellet rewards (TestDiet, St. Louis, MO). The weight and body condition of animals were closely monitored and the weight was not allowed to drop below 90% of baseline weight. To maintain this weight, depending on daily behavioral performance, we supplemented GPs with restricted amounts of pellets (~10–30 g), fresh produce (~10–30 g), and hay (~10–30 g) in their home cages. All animals had free access to water. After behavioral testing for ~2–3 weeks, animals were provided ad-libitum food for 2–3 days to obtain an updated estimate of their baseline weights.

Share this article

Cite this article

Call categorization behavior in GPs.

Framework of the model trained to perform call categorization tasks.

Properties of MIFs.

GP and model performance generalizes to new exemplars.

Figure 3—source data 1

Call categorization is robust to degradation by noise.

Figure 4—source data 1

GPs can obtain information for categorization from short-duration segments of calls.

Figure 5—source data 1

Call categorization is resistant to changes in tempo.

Figure 6—source data 1

Call categorization is resistant to manipulations to the inter-syllable interval (ISI).

Figure 7—source data 1

Call categorization is resistant to time-reversal.

Figure 8—source data 1

Call categorization is sensitive to fundamental frequency (F0) shifts.

Figure 9—source data 1

Call categorization is mildly affected by low-pass filtering.

Figure 10—source data 1

Feature-based model explains ~50% of the overall variance in GP behavior.

Different subsets of MIFs are flexibly recruited to solve categorization tasks for different manipulations.

Figure 12—source data 1

Author details

Manaswini Kar

Contribution

Competing interests

Marianny Pernia

Contribution

Contributed equally with

Competing interests

Kayla Williams

Contribution

Contributed equally with

Competing interests

Satyabrata Parida

Contribution

Competing interests

Nathan Alan Schneider

Contribution

Competing interests

Madelyn McAndrew

Contribution

Contributed equally with

Competing interests

Isha Kumbam

Contribution

Contributed equally with

Competing interests

Srivatsun Sadagopan

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags