Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.
Read more about eLife’s peer review process.Editors
- Reviewing EditorAndrew KingUniversity of Oxford, Oxford, United Kingdom
- Senior EditorAndrew KingUniversity of Oxford, Oxford, United Kingdom
Reviewer #1 (Public review):
This is a very interesting paper addressing the hierarchical nature of the mammalian auditory system. The authors use an unconventional technique to assess brain responses -- functional ultrasound imaging (fUSI). This measures blood volume in the cortex at a relatively high spatial resolution. They present dynamic and stationary sounds in isolation and together, and show that the effect of the stationary sounds (relative to the dynamic sounds) on blood volume measurements decreases as one ascends the auditory hierarchy. Since the dynamic/stationary nature of sounds is related to their perception as foreground/background sounds (see below for more details), this suggests that neurons in higher levels of the cortex may be increasingly invariant to background sounds.
The study is interesting, well conducted, and well written. I am broadly convinced by the results. However, I do have some concerns about the validity of the results, given the unconventional technique. fUSI is convenient because it is much less invasive than electrophysiology, and can image a large region of the cortex in one go. However, the relationship between blood volume and neuronal activity is unclear, and blood volume measurements are heavily temporally averaged relative to the underlying neuronal responses. I am particularly concerned about the implications of this for a study on dynamic/stationary stimuli in auditory cortical hierarchy, because the time scale of the dynamic sounds is such that much of the dynamic structure may be affected by this temporal averaging. Also, there is a well-known decrease in temporal following rate that is exhibited by neurons at higher levels of the auditory system. This means that results in different areas will be differently affected by the temporal averaging. I would like to see additional control models to investigate the impact of this.
I also think that the authors should address several caveats: the fact that their measurements heavily spatially average neuronal responses, and therefore may not accurately reflect the underlying neuronal coding; that the perceptual background/foreground distinction is not identical to the dynamic/stationary distinction used here; and that ferret background/foreground perception may be very different from that in humans.
Major points
(1) Changes in blood volume due to brain activity are indirectly related to neuronal responses. The exact relationship is not clear, however, we do know two things for certain: (a) each measurable unit of blood volume change depends on the response of hundreds or thousands of neurons, and (b) the time course of the volume changes are are slow compared to the potential time course of the underlying neuronal responses. Both of these mean that important variability in neuronal responses will be averaged out when measuring blood changes. For example, if two neighbouring neurons have opposite responses to a given stimulus, this will produce opposite changes in blood volume, which will cancel each other out in the blood volume measurement due to (a). This is important in the present study because blood volume changes are implicitly being used as a measure of coding in the underlying neuronal population. The authors need to acknowledge that this is a coarse measure of neuronal responses and that important aspects of neuronal responses may be missing from the blood volume measure.
(2) More importantly for the present study, however, the effect of (b) is that any rapid changes in the response of a single neuron will be cancelled out by temporal averaging. Imagine a neuron whose response is transient, consisting of rapid excitation followed by rapid inhibition. Temporal averaging of these two responses will tend to cancel out both of them. As a result, blood volume measurements will tend to smooth out any fast, dynamic responses in the underlying neuronal population. In the present study, this temporal averaging is likely to be particularly important because the authors are comparing responses to dynamic (nonstationary) stimuli with responses to more constant stimuli. To a first approximation, neuronal responses to dynamic stimuli are themselves dynamic, and responses to constant stimuli are themselves constant. Therefore, the averaging will mean that the responses to dynamic stimuli are suppressed relative to the real responses in the underlying neurons, whereas the responses to constant stimuli are more veridical. On top of this, temporal following rates tend to decrease as one ascends the auditory hierarchy, meaning that the comparison between dynamic and stationary responses will be differently affected in different brain areas. As a result, the dynamic/stationary balance is expected to change as you ascend the hierarchy, and I would expect this to directly affect the results observed in this study.
It is not trivial to extrapolate from what we know about temporal following in the cortex to know exactly what the expected effect would be on the authors' results. As a first-pass control, I would strongly suggest incorporating into the authors' filterbank model a range of realistic temporal following rates (decreasing at higher levels), and spatially and temporally average these responses to get modelled cerebral blood flow measurements. I would want to know whether this model showed similar effects as in Figure 2. From my guess about what this model would show, I think it would not predict the effects shown by the authors in Figure 2. Nevertheless, this is an important issue to address and to provide control for.
(3) I do not agree with the equivalence that the authors draw between the statistical stationarity of sounds and their classification as foreground or background sounds. It is true that, in a common foreground/background situation - speech against a background of white noise - the foreground is non-stationary and the background is stationary. However, it is easy to come up with examples where this relationship is reversed. For example, a continuous pure tone is perfectly stationary, but will be perceived as a foreground sound if played loudly. Background music may be very non-stationary but still easily ignored as a background sound when listening to overlaid speech. Ultimately, the foreground/background distinction is a perceptual one that is not exclusively determined by physical characteristics of the sounds, and certainly not by a simple measure of stationarity. I understand that the use of foreground/background in the present study increases the likely reach of the paper, but I don't think it is appropriate to use this subjective/imprecise terminology in the results section of the paper.
(4) Related to the above, I think further caveats need to be acknowledged in the study. We do not know what sounds are perceived as foreground or background sounds by ferrets, or indeed whether they make this distinction reliably to the degree that humans do. Furthermore, the individual sounds used here have not been tested for their foreground/background-ness. Thus, the analysis relies on two logical jumps - first, that the stationarity of these sounds predicts their foreground/background perception in humans, and second, that this perceptual distinction is similar in ferrets and humans. I don't think it is known to what degree these jumps are justified. These issues do not directly affect the results, but I think it is essential to address these issues in the Discussion, because they are potentially major caveats to our understanding of the work.
Reviewer #2 (Public review):
Summary:
Noise invariance is an essential computation in sensory systems for stable perception across a wide range of contexts. In this paper, Landemard et al. perform functional ultrasound imaging across primary, secondary, and tertiary auditory cortex in ferrets to uncover the mesoscale organization of background invariance in auditory cortex. Consistent with previous work, they find that background invariance increases throughout the cortical hierarchy. Importantly, they find that background invariance is largely explained by progressive changes in spectrotemporal tuning across cortical stations, which are biased towards foreground sound features. To test if these results are broadly relevant, they then re-analyze human fMRI data and find that spectro-temporal tuning fails to explain background invariance in human auditory cortex.
Strengths:
(1) Novelty of approach: Though the authors have published on this technique previously, functional ultrasound imaging offers unprecedented temporal and spatial resolution in a species where large-scale calcium imaging is not possible and electrophysiological mapping would take weeks or months. Combining mesoscale imaging with a clever stimulus paradigm, they address a fundamental question in sensory coding.
(2) Quantification and execution: The results are generally clear and well supported by statistical quantification.
(3) Elegance of modeling: The spectrotemporal model presented here is explained clearly and, most importantly, provides a compelling framework for understanding differences in background invariance across cortical areas.
Weaknesses:
(1) Interpretation of the cerebral blood volume signal: While the results are compelling, more caution should be exercised by the authors in framing their results, given that they are measuring an indirect measure of neural activity, this is the difference between stating "CBV in area MEG was less background invariant than in higher areas" vs. saying "MEG was less background invariant than other areas". Beyond framing, the basic properties of the CBV signal should be better explored:
a) Cortical vasculature is highly structured (e.g. Kirst et al.( 2020) Cell). One potential explanation for the results is simply differences in vasculature and blood flow between primary and secondary areas of auditory cortex, even if fUS is sensitive to changes in blood flow, changes in capillary beds, etc (Mace et al., 2011) Nat. Methods.. This concern could be addressed by either analyzing spontaneous fluctuations in the CBV signal during silent periods or computing a signal-to-noise ratio of voxels across areas across all sound types. This is especially important given the complex 3D geometry of gyri and sulci in the ferret brain.
b) Figure 1 leaves the reader uncertain what exactly is being encoded by the CBV signal, as temporal responses to different stimuli look very similar in the examples shown. One possibility is that the CBV is an acoustic change signal. In that case, sounds that are farther apart in acoustic space from previous sounds would elicit larger responses, which is straightforward to test. Another possibility is that the fUS signal reflects time-varying features in the acoustic signal (e.g. the low-frequency envelope). This could be addressed by cross-correlating the stimulus envelope with fUS waveform. The third possibility, which the authors argue, is that the magnitude of the fUS signal encodes the stimulus ID. A better understanding of the justification for only looking at the fUS magnitude in a short time window (2-4.8 s re: stimulus onset) would increase my confidence in the results.
(2) Interpretation of the human data: The authors acknowledge in the discussion that there are several differences between fMRI and fUS. The results would be more compelling if they performed a control analysis where they downsampled the Ferret fUS data spatially and temporally to match the resolution of fMRI and demonstrated that their ferret results hold with lower spatiotemporal resolution.
Reviewer #3 (Public review):
This paper investigates invariance to natural background noise in the auditory cortex of ferrets and humans. The authors first replicate, in ferrets, a finding from human neuroimaging showing that invariance to background noise increases along the cortical hierarchy (i.e., from primary to non-primary auditory cortex). Next, the authors ask whether this pattern of invariance could be explained by differences in tuning to low-level acoustic features across primary and non-primary regions. The authors conclude that this tuning can explain the spatial organization of background invariance in ferrets, but not in humans. The conclusions of the paper are generally well supported by the data, but additional control analyses are needed to fully substantiate the paper's claims. Finally, additional discussion and potentially analysis, are needed to reconcile these findings with similar work in the literature (particularly that of Hamersky et al. 2025 J. Neurosci.).
The paper is very straightforwardly written, with a generally clear presentation including well-designed and visually appealing figures. Not only does this paper provide an important replication in a non-human animal model commonly used in auditory neuroscience, but it also extends the original findings in three ways. First, the authors reveal a more fine-grained gradient of background invariance by showing that background invariance increases across primary, secondary, and tertiary cortical regions. Second, the authors address a potential mechanism that might underlie this pattern of invariance by considering whether differences in tuning to frequency and spectrotemporal modulations across regions could account for the observed pattern of invariance. The spectrotemporal modulation encoding model used here is a well-established approach in auditory neuroscience and seems appropriate for exploring potential mechanisms underlying invariance in auditory cortex, particularly in ferrets. However, as discussed below, the analyses based on this simple encoding model are only informative to the extent that the model accurately captures neural responses. Thus, its limitations in modeling non-primary human auditory cortex should be considered when interpreting cross-species comparisons. Third, the authors provide a more complete picture of invariance by additionally analyzing foreground invariance, a complementary measure not explored in the original study. While this analysis feels like a natural extension and its inclusion is appreciated, the interpretation of these foreground invariance findings remains somewhat unclear, as the authors offer limited discussion of their significance or relation to existing literature.
As mentioned above, interpretation of the invariance analyses using predictions from the spectrotemporal modulation encoding model hinges on the model's ability to accurately predict neural responses. Although Figure S5 suggests the encoding model was generally able to predict voxel responses accurately, the authors note in the introduction that, in human auditory cortex, this kind of tuning can explain responses in primary areas but not in non-primary areas (Norman-Haignere & McDermott, PLOS Biol. 2018). Indeed, the prediction accuracy histograms in Figure S5C suggest a slight difference in the model's ability to predict responses in primary versus non-primary voxels. Additional analyses should be done to a) determine whether the prediction accuracies are meaningfully different across regions and b) examine whether controlling for prediction accuracy across regions (i.e., sub-selecting voxels across regions with matched prediction accuracy) affects the outcomes of the invariance analyses.
A related concern is the procedure used to train the encoding model. From the methods, it appears that the model may have been fit using responses to both isolated and mixture sounds. If so, this raises questions about the interpretability of the invariance analyses. In particular, fitting the model to all stimuli, including mixtures, may inflate the apparent ability of the model to "explain" invariance, since it is effectively trained on the phenomenon it is later evaluated on. Put another way, if a voxel exhibits invariance, and the model is trained to predict the voxel's responses to all types of stimuli (both isolated sounds and mixtures), then the model must also show invariance to the extent it can accurately predict voxel responses, making the result somewhat circular. A more informative approach would be to train the encoding model only on responses to isolated sounds (or even better, a completely independent set of sounds), as this would help clarify whether any observed invariance is emergent from the model (i.e., truly a result of low-level tuning to spectrotemporal features) or simply reflects what it was trained to reproduce.
Finally, the interpretation of the foreground invariance results remains somewhat unclear. In ferrets (Figure 2I), the authors report relatively little foreground invariance, whereas in humans (Figure 5G), most participants appear to show relatively high levels of foreground invariance in primary auditory cortex (around 0.6 or greater). However, the paper does not explicitly address these apparent cross-species differences. Moreover, the findings in ferrets seem at odds with other recent work in ferrets (Hamersky et al. 2025 J. Neurosci.), which shows that background sounds tend to dominate responses to mixtures, suggesting a prevalence of foreground invariance at the neuronal level. Although this comparison comes with the caveat that the methods differ substantially from those used in the current study, given the contrast with the findings of this paper, further discussion would nonetheless be valuable to help contextualize the current findings and clarify how they relate to prior work.