Auditory cortical error signals retune during songbird courtship

Caleb Jones; Jesse H. Goldberg

doi:10.7554/eLife.91769.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Andrew King
University of Oxford, Oxford, United Kingdom
Senior Editor
Andrew King
University of Oxford, Oxford, United Kingdom

Reviewer #1 (Public Review):

Summary:
This study examines the context-dependent modulation of auditory cortical neurons in response to expected sensory input, either self-generated sounds or expected perturbations of self-generated sounds. Specifically, using songbirds, the authors ask whether social context (the presence of a female conspecific) affects 1) the response of auditory cortical neurons to the bird's own song when he is singing; and 2) the response of neurons to perturbations of auditory feedback that the bird has been trained to expect.

Strengths:
First, the authors report that across the population, the responses of the neurons does not differ when a male bird sings alone or if he sings to a female. A fraction of auditory cortical neurons, however, do show significant differences in the firing rate, precision, and/or degree of burst firing when males sing alone vs. when they sing to females. This finding is broadly consistent with the literature showing that sensory neurons (visual, auditory, somatosensory, etc.) can be rapidly reconfigured into different "information processing modes" depending on behavioral state (e.g, quiescence vs vigilance).

For the perturbation experiments, the authors trained birds to expect distorted auditory feedback during a particular syllable. They found that some neurons showed greater responses during perturbation when a female was present (compared to when males were alone) while other neurons had smaller responses during perturbation when a female was present. In addition, the response of a small number of auditory cortical neurons were not affected by behavioral state. These results contrast with their prior report that the responses of midbrain dopaminergic neurons that project to the basal ganglia are "uniformly reduced" in the presence of a female, raising a question of how an evaluation signal is transformed in the circuit from the primary sensory region to the midbrain.

Weaknesses:
While the experiments and analysis are solid, the finding that social context can alter responses of auditory cortical neurons in a multitude of ways (increase, decrease or no change) raises several questions that can be examined with additional analysis. For example, do context-dependent differences in auditory responses derive from context-dependent differences in the songs? Are context-dependent differences present in all classes of neurons and throughout the auditory system?

The observed heterogeneity in the firing properties of auditory cortical neurons, both in response to self-generated sounds and during perturbations of auditory feedback, raises the question of which neurons are sensitive to social context (which likely can be addressed by the authors in a revision). The authors should provide additional details about the recordings:

a) What are the locations of the recording sites?
Prior work has shown that there is an organized map of spectrotemporal features of sounds in the auditory cortex of songbirds; spectral tuning widths change along the medial-lateral axis and temporal tuning widths differ between the input and output layers of Field L. Were the recordings primarily in Field L2 (thalamo-recipient region), L1 or L3? Were some recordings lateral to Field L in secondary auditory regions? Were the neurons that showed context-dependent changes in firing properties localized or distributed throughout Field L (i.e., were the context-dependent differences in neural responses truly brain-wide)? At a minimum, the authors should include a schematic showing the different regions of Field L and a summary of the location of the recording sites. Images of the processed tissue with electrolytic lesions would also be helpful.

b) Was the context-dependent modulation limited to a particular class of neurons (distinguished by spike waveform shape, spontaneous firing rate, or other feature)?

While the authors attribute differences in the responses of single auditory cortical neurons to the presence of a female, other potential explanations for the observed differences should be examined (and potentially ruled out):

a) Prior work has shown that songs of zebra finches differ slightly when males sing alone compared to when they sing to females: songs are faster; pitch is less variable; and the number of introductory elements is greater when males sing to females. Do some of the observed social context-dependent differences in the responses of auditory neurons reflect differences in the songs in the two conditions? This idea is supported in part by a prior study in juvenile zebra finches (Keller & Hahnloser, 2009) showing that ~20% of the neurons they recorded in Field L and a secondary auditory region (CLM) showed anticipatory activity even before the onset of a song bout, suggesting a source of premotor (or at least non-auditory drive) to neurons in the auditory cortex. Did the authors of this study also find premotor activity in Field L, and if so, did it differ between the two social contexts? Might differences in Field L responses reflect motor/song differences?

b) For the perturbation experiments, the authors report heterogeneous responses to playback, with some neurons firing more and other firing less when a female is present compared to when the male is alone. Keller and Hahnloser (2009) found that in juvenile birds, responses of Field L to perturbations of auditory feedback were sensitive to sound amplitude; perturbation responses increased with relative perturbation amplitude. This raises a question of whether perturbation amplitude is different when a male is alone and when a female is present (i.e., the male may move towards the female when she is present and if the speaker is close to the female, the perturbation may be louder than when the male is alone; alternatively, the male be more active when he is alone so the loudness of the perturbation may be more variable across song bouts). It would be useful to know if (and how much) perturbation amplitude varied depending on the location inside the cage as well as whether the sound pressure level of the underlying song was higher (e.g., Lombard effect). Addition of details of the experimental setup/procedure would help to allay concerns that the amplitude of the white noise varied significantly depending on behavioral context.

Finally, I am still trying to make sense of the differences in the context-dependent modulation of responses of auditory cortical neurons vs. midbrain dopaminergic neurons. Given the heterogeneity of responses in Field L, both to self-generated sounds and to expected perturbations during singing, how are the signals decoded downstream of Field L? At the population level, neither the mean firing rate nor the timing of firing of Field L neurons changed with courtship. Similarly, across the population, the responses to perturbations of auditory feedback were not affected by courtship state (error signal attenuated in 11 neurons, increased in 22 neurons and not affected in 10 neurons). Yet, the courtship state "uniformly" reduces the response of midbrain dopaminergic neurons to auditory perturbation. It would be helpful if the authors could include a model and/or more discussion of how this change may arise.

https://doi.org/10.7554/eLife.91769.1.sa2

Reviewer #2 (Public Review):

Summary:

In the manuscript 'Auditory cortical error signals retune during songbird courtship', Jones and Goldberg study auditory cortex in male zebra finches. They explore song-related responses in two different contexts, when the male is either alone or in the presence of a female. Social-context related responses are hypothesized based on previous results on downstream VTA neurons where such modulation is found. They play jamming stimuli through a loudspeaker to probe sensitivity of song-related neural responses to these external stimuli. They find a heterogeneity of responses, in line with auditory cortical neurons computing the social modulation of responses found in VTA.

Strengths:

In general, the work is interesting and sheds light onto auditory processing and self-perception mechanisms in songbirds.

Weaknesses:

Stability of responses has not been studied: some neurons seem to have responses that slowly drift in time, which could lead to observed differences between alone and with-female conditions. Also, possible motor confounds and sound-of-audience confounds should be addressed. The language is often imprecise.

Stability and Reversal: It is a bit unfortunate that stability of effects seemingly has not been studied by reversing experimental conditions. The work would be much stronger if authors could show that audience-dependent tuning is robust in individual cells. Did they record from some neurons during reversal back to the alone condition? Ideally, the responses should be identical before and after recording with an audience. This would control for possible non-stationarities in their neuron recordings/spike-sorting/circadian trends. If authors do not have such data, it would be worth wile to even just try to divide the dataset for each neuron and condition (either the audience or isolate condition) into two parts to verify that the response is the same in either part (provided sufficient song renditions are recorded). See also my comment below about Fig. 2A.

Motor responses: Does DAF playback change song? If so, especially if it applies only in one of the two conditions (audience/no audience), then the observed response differences could be motor-related rather than auditory responses. Analyses of song spectrograms right after DAF would presumably provide the answer.

Similarly, motif-aligned spiking activity was time warped to the median duration of undirected or directed motifs. Could the shorter motifs during directed song (as has been reported in other studies) lead to alignment differences that would account for the different error responses in alone/wfemale conditions? In other words, could increased error responses be due to the fixed 100 ms analysis window of the audience condition that extends into a song region beyond the 100 ms region of the no-audience condition where there is increased firing? And vice versa for observed decreases in error responses, i.e. is there a firing pause just after the offset of the 100 ms window in the no-audience condition that causes audience dependence of responses? A simple compensation of song tempo differences by shortening/stretching the analysis window in one of the two conditions would allow to test for this.

Audience versus sound of audience: In the first sentence of the discussion authors write: we discovered that auditory representations of an animal's own vocalizations change with an audience. Is it truly the audience that causes the difference in error responses or is it the sounds the audience makes? To control for that would be to play back stimuli that simulate a non-silent audience through a loudspeaker to see whether error responses depend on the soundscape created by a typical audience (either present or absent). Authors probably do not have such data and to record it would go beyond the scope of this study, but it would be important to discuss this possibility or perform some analysis in that vein.

https://doi.org/10.7554/eLife.91769.1.sa1

Reviewer #3 (Public Review):

Summary:

In this study, Jones et al. examine how neural activity in a primary auditory area (field L) of singing male songbirds is modulated by the presence or absence of an audience (a female conspecific). Prior work has demonstrated that the presence of an audience attenuates the responses of dopaminergic neurons to distortions of auditory feedback (DAF). Here the authors report that even in a region that is primarily considered sensory, responses to DAF are also modulated by the audience, although in a heterogeneous manner that does not readily explain previously observed attenuation. These findings address an interesting question and will potentially be important in adding to an understanding of how non-sensory factors can alter response properties of neurons even in primary sensory regions in a context dependent fashion. However, to be fully persuasive, additional analyses will be required to address how much of the apparent modulation by audience may be explained by other factors such as changes in recorded neurons or their properties over time.

Full Public Review:

In this study, Jones et al. examine how neural activity in a primary auditory area (field L) of singing male songbirds is modulated by the presence or absence of an audience (a female conspecific). They test whether activity in Field L differs between conditions in which the male is singing to a female (directed song) or alone (undirected song) and whether response to distortions of auditory feedback (DAF) differ between these conditions. Previous work has shown that in other parts of the songbird brain, sensory-motor activity can differ between directed and undirected song, and that responses to DAF are attenuated when males sing directed song versus undirected song. These prior results raise the interesting question of the extent to which such modulations of activity by the presence of an audience are already present in primary sensory areas such as Field L. This possibility is also motivated by prior work that has shown that Field L activity is not exclusively explained by auditory input, but can also be modulated by the bird's state - whether it is singing or not.

Against this background, the questions asked here are of interest for two inter-related reasons:

the authors address whether the presence of an audience (a female conspecific) alters activity in a primary auditory area during singing. Primary auditory areas such as Field L, and analogous mammalian thalamo-recipient cortical regions such as A1, are often thought of as responding very specifically to the features of sensory stimuli, but are also understood to be modulated by a variety of factors including the attentional and behavioral state of the animal. For audition, such modulation includes whether or not animals are vocalizing and listening to themselves or listening to playback of their own vocalizations. Cited works from Keller (2009) as well as Eliades and Wang (2008) have indicated that the act of vocalizing can modulate auditory responses to self-generated feedback in primary auditory areas relative to those arising from playback of the same sounds. Here, the question is whether responses to self-generated feedback differ between conditions of singing alone versus singing to a female audience. A demonstration that the presence of an audience matters to responses in Field L would add to a general understanding of how it is that non-auditory factors can modulate sensory responses.
the authors address the possible source of an audience-dependent modulation of responses to feedback perturbation in the VTA previously reported by Goldberg and colleagues (2023). In the VTA, responses to perturbations during singing are consistently attenuated when males are singing to females versus when they are singing alone, but the underlying mechanisms of this modulation are unknown. Here, the authors test the possibility that such modulation by an audience is already present at the level of Field L. The previously reported attenuation in VTA is quite striking and reflects a nice example of how neural processing can differ with varying behavioral priorities. Understanding whether this modulation of responses to DAF arises already in primary auditory areas would further a mechanistic understanding of an intriguing example of state-dependent modulation of sensory processing and behavior, and lend broad insight into related phenomena.

The authors report 1) that activity in Field L differs between directed and undirected singing at many individual recording sites, but that these changes are heterogeneous, with both increases and decreases in activity, so that there is no consistent change across the population and 2) that the responses to DAF differ between directed and undirected song, but that there is no consistent attenuation of response (as observed in the VTA) and instead heterogeneous increases and decreases in response to DAF so that there is no net change at the population level.

These findings, if firmly established, are important and of general interest. While they do not readily explain the source of the audience-dependent attenuation of auditory responses to DAF in the VTA, the demonstration of audience-dependent modulation of self-generated feedback and its disruption in a primary auditory area is an exciting result that would provide an opportunity for further investigation of how changes in social context influence brain and behavior. The manuscript is generally well written, although the presentation is terse. My main reservations about the current manuscript relate to aspects of experimental design and analysis that need to be clarified and addressed before these conclusions will be fully persuasive. There are also some places where further discussion of the findings and their relationship to prior studies would be helpful.

1. A central concern relates to whether the main reported effects associated with differences in singing directed versus undirected song reflect only those changes in conditions, versus contributions from changes in unit isolation or response properties over time. The authors record undirected song in a block in the morning and only after collecting at least 40 renditions do they later record responses during directed song over a series of repeated exposures to a female. Therefore, differences between data collected during undirected song and directed song also reflect differences between data collected initially during the morning versus later. It is unclear from methods whether any of these recordings during undirected and directed conditions are interleaved, but if this is not the case, then it is crucial to ask how stable were neural recordings with respect to unit isolation, and potential changes to response properties, over the duration of the experiments. This would be less of a concern if the results mirrored those observed in the VTA, where attenuation of responses was observed across the entire population during directed versus undirected conditions - it is hard to explain a phenomenon that is consistently observed across the population as arising from a change in which neurons and spikes are contributing to responses, or other forms of non-stationarity. However, because there are no significant differences reported at the population level in the current study, it is important to address the possibility that observed differences between conditions reflect some form of noise or drift in recorded units, rather than being entirely due to directed versus undirected singing. I have elaborated in more detail below on this concern, including places where the data seems to suggest some non-stationarity of responses, and have some suggestions for ways in which this concern might be addressed.

2. A second concern, related to this first one, has to do with the categorical definition of 'error neurons'. The authors note in their text that it could be problematic to apply categorical definitions to continuous distributions, and yet that seems to be what they then do. The authors have a metric of error sensitivity that they apply to each neuron's response to DAF in both undirected and directed conditions (the error score). They show that there is a continuous distribution of error scores (Figure 2 - figure supplement 1) across the population, with no bimodality that would be suggestive of distinct error sensitive and error-insensitive neurons. One nice feature of their analysis is that they also show the distribution of error scores computed in an analogous fashion for a period of neural activity in the song prior to DAF. This control data set makes it persuasive that there is a significant response to DAF, but also shows that there can be a broad range of error scores even when no DAF has been played, and that this range of 'noise' responses to DAF overlaps substantially with the actual responses to DAF. Despite the continuum of error scores, the authors define a subset of neurons as error responsive only if their responses to DAF exceed a specific threshold (2.5 standard deviations). One of the main conclusions of the paper is based on finding a subset of 22 neurons that exhibited error responses (by this definition) only during singing to a female and 11 neurons that exhibited error responses only when singing alone. These neurons are described as 'retuned' because they have error responses in only one condition.

The problem here is that for some, if not many, of the neurons that are categorically defined as being responsive to DAF in only one condition (directed versus undirected) there is almost certainly not a significant difference in the actual responses to DAF between conditions. This is apparent in the relevant data figure (figure 2 - figure supplement 1) and is a consequence of using a threshold to split a continuous distribution into groups defined as error responsive or not. For example, several neurons in this plot that have almost identical scores in the directed and undirected condition are counted as examples of retuning because the error scores are just a bit over 2.5 in the directed condition and just a bit under 2.5 in the undirected condition.

That this kind of categorical approach may be problematic is apparent in the control data in the plot. Despite the absence of any perturbation, there are error responsive neurons present in these data that are considered selective for directed versus undirected singing - this is an expected consequence of using a threshold on dispersed or noisy biological data. Shifting to a more stringent threshold of three standard deviations, as the authors do, does not help with this problem, as that still treats as categorically different responses that fall on either side of a line, even if only by a tiny amount. I suggest that the authors devise a measure for each neuron to test whether the responses to DAF are significantly different under the two conditions (directed versus undirected). As noted above, this measure should take into account some assessment of the stationarity of responses, as well as the distribution of responses (which, in some of the examples does not seem to be Gaussian around a mean response level, but rather highly variable across trials).

3. There are several places where further discussion of the previous literature and how the current results relate to that literature would be helpful. This includes:

3a. Some discussion of what is already known about the auditory tuning of field L, and the extent to which responses associated with distortion of feedback may reflect the frequency tuning of field L neurons versus something that might be construed as more specifically as detecting an error in perceived feedback. For example, Field L neurons have previously been characterized as having relatively simple spectro-temporal receptive fields, often with a single frequency band that is excitatory and nearby frequency bands that are inhibitory. It would be beyond the scope of this paper to directly assess the extent to which both song responses and responses to DAF are well predicted by simple STRFs that might be measured for the recorded neurons, or computed from activity during a range of vocalizations, but perhaps worth discussing whether a neuron with such frequency tuning would potentially exhibit 'error responses' of the sort described here, simply because the DAF stimulus happens to fall into the excitatory or inhibitory regions of the neuron's receptive field. While it is OK to use the term 'error responsive' in the current study, it would be good to make clear that changes in firing associated with playing DAF should be expected even for neurons that have simple auditory receptive fields (i.e. with center surround tuning to specific frequencies in a tonotopic map, as has been described for Field L) without necessarily indicating that these neurons are specifically registering any deviation or 'error' between expected feedback and experienced feedback. In this respect, there are multiple subdivisions of Field L with different tuning properties. Please specify further what criteria were used to determine recording locations and how these correspond with previously defined subdivisions.

3b. It would also be useful to discuss further previous work on differences in auditory tuning or responses between conditions when subjects are vocalizing, versus when vocalizations are played back (as in Keller, Eliades) and whether the results in the current study are similar or different. For example, this prior work has indicated that efference copy or other signals that precede vocalizations can reach and influence activity in auditory areas - with the most compelling evidence for this being the modulation of activity prior to the onset of vocalizations. Was this also observed in the current study, and to what extent might this kind of mechanism contribute to the processing of feedback distortions? With respect to this kind of efference signal, or other possibilities, can the authors provide some discussion or speculation about possible mechanisms that might be differentially engaged between conditions of singing directed versus undirected song?

3c. The previous study on DAF responses in VTA indicates enhanced responses to female calls during directed song. To what extent did the current study control for any vocalizations or other sounds produced by females during the directed singing, and could this have contributed to differences in Field L activity between conditions? This question is motivated partly by the highly variable responses in raster plots even within one condition - might some of this reflect motifs during which transient noises are produced from female calling or other movements by the male or female?

More regarding stability of recordings:

The data presented in Figure 1D illustrate some of my concerns about the stationarity of recordings. In the directed condition there are no spikes at all following the first handful of motif renditions. Were the directed and undirected recordings interleaved here? If not, could the recorded neuron simply have been lost, changed in amplitude of recorded spikes so that it was no longer counted, or reduced its responsiveness over the course of the recordings? Because the recordings of undirected and directed singing are described as occurring sequentially, it seems likely that this type of change in recorded signal could contribute to changes in measured responses over time, independently of effects due to directed versus undirected singing.

A minor issue of this example is that the raw example trace with male alone does not seem to have a corresponding set of points in the roster plot. For panel E, I also cannot find rasters that correspond to the example recordings shown at top.

Figure 2A also shows a neuron that looks like it has non-stationarity; for the alone condition without altered feedback, the main peak has no spikes for the bottom half of the rasters. For the directed condition, much of the difference between control and distorted feedback conditions seems to come from a few trials towards the bottom of the raster plot that show more and earlier firing than most other rasters.

Other more subtle examples are suggested in the figures, such as Figure 1F where responses in the alone condition seem to increase over the course of recordings. A related issue apparent in some of the raster plots is that the firing rate distributions within a given condition sometimes appear to be very non-gaussian, with some motifs during which there is a lot of activity, or apparent bursting, and others in which there is little activity. In addition to the examples above, this includes
responses in Fig 1E and Fig 2F. Does anything distinguish these cases or trails? Where differences between conditions are driven by firing differences that are present on only a subset of trials, such as in Fig 2A, there is some deviation from the normal criteria for use of T-tests/Z-scores. Please consider this point and discuss any caveats and/or apply other tests (Monte Carlo? Non-parametric?) as appropriate.

These potential issues of non-stationarily, and non-Gaussian firing rate distributions in each condition, make it complicated to think about what differences in activity reflect changes from undirected to directed conditions versus these other factors.

Approaches to addressing this issue could include more specifically indicating examples in which recordings from the alone condition and directed condition are interleaved and exhibit reversible (between conditions) changes in the pattern of responses (both without DAF in comparing alone versus directed, and with DAF demonstrating differences in DAF influences between conditions). Some good interleaved examples of this sort would be very helpful to illustrate the robustness of differences between conditions. More generally, the methods and or raster plots should include some further explanation of the time periods over which recordings were made in the alone versus directed conditions, and the extent to which they are interleaved or not.

Another approach that could be used if there are not many instances of inter-leaved recordings is to try to document the stationarily or stability of unit isolation and/or responses over time. It would be most helpful when applied to recordings from a given singing condition (i.e. alone or directed) that are interleaved, but even in cases where this is not possible perhaps one could assess the stability of waveforms and unit isolation across time. For example in Figure 2 - Supplementary figure 2, the left-hand and middle examples appear to have quite good unit isolation, and might be the sorts of cases where measures of unit isolation and waveform stability could be used to argue that a gain or loss of spikes due to drift in recordings or changes to SNR and spike detection are not contributing to changes in firing patterns over time (and across conditions).

It potentially would also be informative to present the prevalence of the main effects reported in the study as a function of some measures of unit isolation, SNR, and recording stability. It would be reassuring to see that significant differences between conditions are equally or more prevalent under the conditions of greatest unit isolation and recording stability than in cases with worse SNR or stability.

One other way that the authors might be able to address my main concern would be to look at the stability of firing patterns within conditions, where differences across trials most directly indicate the potential contributions of technical or biological changes in neural activity over time that are not related to the experimental conditions.

To further address some of these issues, it would be helpful to have additional explanations in this paper (rather than by reference to Goldberg and Fee, 2010) of the criteria that were used for counting spikes, and assessing stability of recordings. All I found about this in the Goldberg and Fee, 2010 reference was that "Spikes were sorted off-line using custom Matlab software" Does this require human inspection and judgment? Is there a simple threshold, or waveform measurement used for detecting spikes from single units? Are some sort of signal to noise measures, or ISI violations used to score how well units are isolated?

For the specific examples shown in figures, it would be useful to indicate by small tick marks or otherwise which spikes were counted as single units. For example in figure 2 column B, for the condition with female, did only the 1-3 largest spikes get counted, or also the spikes of medium height?

Page 11: "Many channels on the probes recorded multi-unit activity, which were taken note of but not analyzed in this study."

What were the criteria for this? For several of the examples in the figures there are spikes of varying amplitudes and as mentioned above it would be helpful to clarify how the spikes were sorted into single units in such cases.

Categorical scores:

Page 13: "Neurons with error responses greater than 2.5 in only one condition (undirected versus directed) were considered to have retuned; neurons with error scores greater than 2.5 in both conditions were considered not to have retuned."

This definition results in cases where responses of 2.45 vs 2.55 are described as 'retuned', even if these responses are not significantly different. The figure (Figure 2 - figure supplement 1) indicates that multiple neurons that were scored as retuning had responses that fall very near the threshold in this way.

Page 13, "Our results did not fundamentally change with ... a more stringent threshold of 3..."

The stringency is not issue here, rather the categorical threshold. Retuning would be more persuasively demonstrated if the authors could provide a test of whether or not the responses for individual neurons differ significantly between conditions appropriately taking into account multiple comparisons, stability of recordings, non-Gaussian firing rate distributions across motif renditions, etc. and use this metric to report effects, rather than setting a categorical threshold.

https://doi.org/10.7554/eLife.91769.1.sa0

Auditory cortical error signals retune during songbird courtship

Peer review process

Editors

Be the first to read new articles from eLife