1. Neuroscience
Download icon

Detecting changes in dynamic and complex acoustic environments

  1. Yves Boubenec Is a corresponding author
  2. Jennifer Lawlor
  3. Urszula Górska
  4. Shihab Shamma
  5. Bernhard Englitz
  1. Laboratoire des Systèmes Perceptifs, CNRS UMR 8248, France
  2. École normale supérieure, PSL Research University, France
  3. Radboud Universiteit, Netherlands
  4. Jagiellonian University, Poland
  5. University of Maryland, United States
Research Article
Cited
0
Views
957
Comments
0
Cite as: eLife 2017;6:e24910 doi: 10.7554/eLife.24910

Abstract

Natural sounds such as wind or rain, are characterized by the statistical occurrence of their constituents. Despite their complexity, listeners readily detect changes in these contexts. We here address the neural basis of statistical decision-making using a combination of psychophysics, EEG and modelling. In a texture-based, change-detection paradigm, human performance and reaction times improved with longer pre-change exposure, consistent with improved estimation of baseline statistics. Change-locked and decision-related EEG responses were found in a centro-parietal scalp location, whose slope depended on change size, consistent with sensory evidence accumulation. The potential's amplitude scaled with the duration of pre-change exposure, suggesting a time-dependent decision threshold. Auditory cortex-related potentials showed no response to the change. A dual timescale, statistical estimation model accounted for subjects' performance. Furthermore, a decision-augmented auditory cortex model accounted for performance and reaction times, suggesting that the primary cortical representation requires little post-processing to enable change-detection in complex acoustic environments.

https://doi.org/10.7554/eLife.24910.001

Introduction

Many natural and environmental sounds are composed of shorter, elementary events, whose occurrence can be described on a statistical level (Lederman, 1979; McDermott and Simoncelli, 2011; Thoret et al., 2014; Turner and Sahani, 2007). For example, individual drops of water can add together to sound like rain or like a dripping faucet, depending on their number, rate, and relative timing (McDermott et al., 2013). However, in real-life, listeners face a dynamic acoustic environment, where statistics do not remain constant for very long. Changes in the statistics of the sound of rustling leaves amidst the sounds of an ongoing storm, or changes in the acoustic composition of a busy cityscape, provide relevant information of putative threats. We investigate here determinants of human performance and their neural representation in these contexts, addressing the hypothesis that the behavior and neural representation are consistent with statistical estimation.

Changes in sound statistics can only be detected if the statistical properties before the change have been estimated sufficiently well (Kaya and Elhilali, 2014; McDermott et al., 2013). Without this estimate, the listener cannot distinguish between 'what to ignore' given the current statistics and 'what to recognize' as a change. Moreover, the quality of this estimate can influence the speed and certainty of detection, which are essential in real-life contexts. The present study thus investigates the factors influencing detection of deviations in sound statistics, and what the underlying dynamics of auditory sensory and evidence accumulation processes are in the human brain. For this purpose, listeners are presented with a continuous sound, whose statistics change at a random time. Hence, they are faced with the dual-task of estimating the baseline statistics and detecting a potential change in those statistics at any moment, which mimics real-life challenges.

The estimation of sound statistics depends on many factors, but most importantly on the complexity of a stimulus in relation to the time available to sample it (Kaya and Elhilali, 2014). A simple stimulus, governed by only few parameters, can be reliably estimated more quickly than a complex stimulus. We introduce a statistically controlled stimulus that combines simplicity with broad spectral distribution. In contrast to previous studies with narrow-band complex stimuli (Andreou et al., 2015; Cervantes Constantino et al., 2012; Overath et al., 2010; Teki et al., 2013), the sounds here form a minimalistic, but well-controlled model for natural, acoustic textures that are only defined by first order statistics. The task for the subjects was to listen to the texture of the stimulus (for a variable pre-change duration), and then signal the detection of a change in the texture as soon as possible.

We found that detection performance improves with the time available to sample the baseline statistics before the change. As expected, detection performance also depended on the saliency of the change. EEG recordings from auditory projection sites show a strong response related to the onset of the sound, but did not exhibit a discernible response related specifically to the subsequent change in stimulus statistics. By contrast, EEG signals over parietal cortex appeared after the time of change, and displayed a build-up rate that depended on the size of the change (consistent with EEG responses in other evidence integration tasks, e.g. O’Connell et al., 2012; Kelly and O'Connell, 2013). The peak amplitude of this potential also increased with change size, but decreased with pre-change interval, i.e. the time available to the subjects to sample the stimulus baseline statistics. Performance and reaction times were well predicted by a model of statistical estimation based on the difference in the outputs of two leaky integrators operating at fast and slow timescales. In addition, a model of auditory cortical processing (Chi et al., 2005; Overath et al., 2008) augmented with an accumulation-to-bound decision stage also accounted for the EEG responses and subjects’ behaviors, thus suggesting that decision-making in such statistically complex acoustic environments may only require minor post-processing (channel-selection and averaging) beyond the early auditory cortex.

Results

We investigated the neural mechanisms of detecting changes in the statistics of auditory stimuli, on the basis of human behavioral performance, neural response and models of acoustic processing leading to decision-making. In a set of psychoacoustic experiments, listeners (n = 12) were presented with complex acoustic stimuli, whose statistics could change at a random time. Several parameters of the change were varied in order to estimate their influence on the change's saliency. In a different set of listeners (n = 18) EEG responses were collected to track the brain dynamics reflecting the accumulation of sensory evidence leading to the detection of a change in sound statistics. We propose a simple model to account for the listener's behavior, which is based on the estimation of stimulus statistics on two timescales. Finally, we suggest a neural implementation of this principle based on a model of auditory cortical processing.

Detection of changes in statistics is consistent with estimation of marginal distribution

The ability to detect a change in stimulus statistics improved in trials that provided more time before the change (‘change time’ in Figure 1A) for subjects to listen to the baseline statistics of the texture. Performance also increased monotonically to different asymptotic levels for the four tested change sizes (50, 80, 110, 140%, Figure 2A). Asymptotic performance depended on change size, with bigger changes in marginal probability leading to greater asymptotic performance especially between levels, from 50% to 95% (Figure 2A, psize < 10−5, Friedman; ptime < 10−5, Friedman). Change size also influenced the dependence on change time, such that greater change sizes led to improved performance at shorter change times than for smaller change sizes (Figure 2A). This translates to a combined steepening and leftward shift of the performance curves with change size. The significance of this effect was assessed by fitting the performance curves for individual subjects with a parametric function of sigmoidal shape (an Erlang CDF, see Materials and methods) in order to extract the change size-dependent time constant. The characteristic time constant τ decreased significantly as a function of change size (Figure 2B; p<10−6, Kruskal-Wallis).

Dynamical change-detection paradigm with auditory textures.

(A) Subjects listened to an acoustic textural stimulus, whose predictability was governed by its marginal frequency distribution (grey curve, left panel). Tones in individual frequency bins were drawn independently consistent with the marginal (middle panel). Listeners were instructed to report changes by a button press. The frequency marginal was modified (indicated in orange in the right panel distribution) after a randomly chosen point in time (‘change time’). The probabilities in two adjacent or non-adjacent frequency bins were increased together, and the distribution over the bins renormalized to maintain average global level. (B) The distribution of change times was chosen from an exponential distribution. This ensured that the probability of a change in the next time-bin remained constant (shown here is the empirical distribution). (C) Response times occurred before (false alarms) and after the change time (hits). Subjects usually responded only after an initial listening duration, allowing them to acquire the sound statistics.

https://doi.org/10.7554/eLife.24910.002
Figure 2 with 3 supplements see all
Detecting a change in statistics improves with size and time of change.

(A) Performance of change detection depended significantly on change time (abscissa) and change size (shades of orange indicate the step size as percent of the original bin probability, see inset). Only changes in contiguous bins were used presently, to maintain identical trial numbers across difficulties. (B) The dynamics of the performance curve varied with change size, indicated by the speed parameter τ of an Erlang CDF fitted to the data (see Materials and methods). (C) Dynamical d’ confirms the dependence of performance on change time and change size. The dependence on change time suggests an improved detection relying on a converged estimate of the baseline statistics, whereas the dependence on change size indicates a higher level of certainty can be attained more rapidly if the amount of evidence is larger. (D) Instantaneous false alarm rate is uniform across time, after an initial hesitation to respond in the first 2 s. The initial hesitation is likely due to the task-design, requiring an initial estimation of the sound statistics.

https://doi.org/10.7554/eLife.24910.003

The observed performance could alternatively be explained by a timing strategy or a pattern recognition strategy. Both of these explanations can be rejected based on the data and the paradigm: if subjects had used a timing strategy, their instantaneous false alarm rate (as a function of change time, divided by the window length) should never reach a constant value. Instead, the false alarm rate exhibits an initial linear increase, followed by a constant false alarm rate per unit time (Figure 2D), a feature that was embodied in the behavior of the models (see Figures 7E/8F). Furthermore, the initial rising portion of the false alarm rate is a consequence of the dual estimation task design. The uniform regime of false alarm rate is consistent with the use of an exponential distribution of change times, which keeps the change occurrence probability constant per unit of time (see Figure 1B and Materials and methods).

Some subjects could have attempted to use a pattern recognition strategy, i.e. effectively ignoring the statistics of the first stimulus. However, based on the stimulus design, a pattern recognition strategy would have failed, since the first stimulus was drawn randomly for each trial, and the second was a stochastic modification of the first. Further, in this case, detection performance should not have depended on change time. All together these results are inconsistent with both a pattern recognition and a timing strategy.

Using the time-dependent false alarm rate, the sensitivity of the subjects to detect a change can be analyzed with a time-dependent d' (see Materials and methods for details of computing this d'). This analysis exhibited similar monotonically increasing shapes as a function of both change time and size (Figure 2C). Further, probability in a frequency bin was positively correlated with change detection (Figure 2—figure supplement 1), consistent with the idea that a high rate of samples provided a better estimate of the probability value in a frequency bin. We can rule out that only large probability bins were attended to, since the performance for equal size chances in large probability bins is dominated by the change in other, lower probability bins (Figure 2—figure supplement 2). Finally, longer stimulus duration in the current trial predicted a reduced performance in the following trial (Figure 2—figure supplement 3), suggesting that the converged estimate in the previous trial could ‘contaminate’ the estimation process in the subsequent trial. This is another indication that subjects were not using a pattern recognition strategy, as such a strategy completely ignores the statistics presented in the previous trial.

In summary, those findings indicate that change detection (i) improves with time allowed to sample the stimulus, (ii) improves with the size of the change and (iii) saturates with longer observation intervals. These properties are consistent with statistical decision-making, where a decision can only be made if the observed change in a stimulus property is substantial compared to the current uncertainty about the same property. Subjects using statistical decision making can (i) reduce their uncertainty by collecting more stimulus information over time, (ii) use larger differences in the stimulus property to overcome the uncertainty more rapidly, and (iii) will not be able to improve their performance once the estimation of the stimulus statistics has saturated.

Reaction Times are consistent with statistical estimation

The dependence of performance on change time suggests a dynamical mechanism performing an on-going estimation of the initial statistics. To gain insights into these dynamics, we examined the dependence of reaction times on the parameters of the change, especially its size, which intuitively correlates inversely with task difficulty according to Piéron’s law (Pins and Bonnet, 1996) and time of occurrence (or ‘change time’).

Reaction time distributions changed both in duration and shape as a function of change size (Figure 3A). Median reaction time decreased with larger change sizes (p<10−3; Kruskal-Wallis, Figure 3B), in accordance with the increase in performance with larger change sizes. Receiver operating curve (ROC)-based analysis indicated that the distributions of reaction times were different across change sizes and chance level (Figure 3—figure supplement 1; p<10−7; Friedman). More specifically we found a significant difference between the most difficult condition and chance level (p<10−5; Kruskal-Wallis), confirming that subjects were performing at all change sizes. This suggests that the time necessary to detect the deviation between the pre- and post-change stimulus statistics was reduced for larger change sizes.

Figure 3 with 1 supplement see all
Reaction times also reflect estimation of pre- and post-change stimulus properties.

(A) Reaction time distribution sharpens with change size. (B) Median response time significantly reduces by 20% (p<10−4, Kruskal-Wallis) with larger change size (different colors indicate different change sizes). These effects indicate a faster, temporally more constrained decision, which could indicate more rapid evidence accumulation for larger changes. (C) Reaction time distribution sharpens with change time and D) median reaction time reduces rapidly with change time by 25% (p<10−5, Kruskal-Wallis). Both effects indicate a higher degree of certainty in decision making, which could indicate a more converged estimation of the pre-change statistics.

https://doi.org/10.7554/eLife.24910.007

For shorter change times, reaction time distribution changed in a qualitatively similar manner as was observed for smaller change sizes, although the effect was less pronounced (Figure 3C). Median reaction times decreased with change times, mirroring dependence of performance on change times (p<10−5; Kruskal-Wallis, Figure 3D). This dependence can already be seen in the raw data (Figure 1C), where hit trials (black) for longer change times exhibited shorter reaction times. Again the timing of the first correct responses decreased correspondingly with longer change time, suggesting more accurate estimation of the initial statistics.

Dependence on spectral location of acoustic change

Changes in stimulus statistics are effectively a probabilistic redistribution of the stimulus energy in the spectrotemporal domain, here restricted to the spectral axis. Therefore, we hypothesized that a detection process acting in a spectrally localized manner should perform better when the total energy of the change is concentrated in a restricted frequency region. Indeed, we found that performance decreased for non-localized changes when compared with localized ones (Figure 4A). This effect was significant only when distances below and above eight semitones were grouped (Figure 4A; p<5.10−3). Finally, we found that performance did not vary as a function of relative position along the frequency axis (p=0.28; Figure 4B), contrary to the predictions of a recent study (Catz and Noreña, 2013) showing that the cortical representation at the extreme edges of the stimulus spectrum could be enhanced for sharp contrast, resulting in lower change detection thresholds.

Detectability of changes depends on spectral properties of the change.

(A) Spectral distance between the changed bin centers ('change distribution', measured in semitones, st) significantly reduces performance (p=0.01, Kruskal-Wallis test). Spectral distance ranged from neighboring (three st) bin centers to locations at the edges of the tested range (23 st). (B) Absolute spectral position of the changed bins does not influence performance (p=0.85, Kruskal-Wallis). Absolute spectral position was not significantly correlated with the detectability.

https://doi.org/10.7554/eLife.24910.009

EEG responses correlate with accumulation of sensory evidence

We collected neural responses using electroencephalography (EEG) in human subjects performing the above psychoacoustic task to study the relationship between behavioral performance and neural responses, and to narrow down the scalp regions whose neural response reflects the change in statistics. The analysis was focused on a subset of the recording electrodes, namely an auditory (central location, El.1; corresponding to the center in Nie et al., 2014) and a centro-parietal (14,27,28; corresponding to Twomey et al., 2015) set. Depicted potentials show averages across each set of electrodes. Subjects exhibited similar performance and reaction time dependencies on change time as in the psychophysical experiments (Figure 5—figure supplement 1). Change times were binned into four bins based on their distribution and Hit rate to equalize trials per bin.

At stimulus onset, the average auditory potential exhibited a classical, large and rapid event-related potential (ERP) (Figure 5A,C, composed of N1 and P2), followed by a negative sustained potential (indicated as NS in the figure) previously described for prolonged stimulus duration (Hari et al., 1980; Lammertmann and Lütkenhöner, 2001; Lütkenhöner et al., 2011). However, there was no systematic evidence for a response to the change in statistics (Figure 5B1, EEG of Hit trials aligned to change times). EEG aligned to subjects’ response time also did not show a significant response (Figure 5B2, EEG of Hit trials aligned to button-press, different colors indicate different change sizes, averaged over all change times, see below for differences in change time). This suggests that the detection of the change in statistics was not accompanied by an overall response in the auditory cortex comparable to other stimulus changes such as stimulus onset or offset (compare also to the model responses in Figure 8B, see also Discussion). While this does not preclude the information about the change to be available in early auditory cortex, there is no specific, overall reaction to the change, compared to the continuous representation of the stimulus.

Figure 5 with 2 supplements see all
The CPP potential shows a dependence on both time and size of change, while the central potential remains unaffected.

(A) After stimulus onset, the central potential (Ch. 1, black dot in C) shows a classical N1-P2 progression, followed by a sustained negative potential (labelled NS here). Different shades of red indicate different change sizes. Curves are average over all change times, to avoid crowding the plots. Note that the lowpass filtering at 20 Hz (common for all potentials) reduces the N1/P2 amplitudes below their typical size. (B1) Locked to the time of change, the central potential shows a slow negative trend, which, however, does not depend systematically on change size. (B2) Preceding the response, the central electrodes show no significant change in potential, which only starts to deviate from 0 after the button press. (C) At 200 ms after stimulus onset, the topography of the potential indicates a typical auditory onset response for bilateral stimulation, i.e. centered on Cz (El.1 in the equidistant layout, black dot). (D) The potential above the central parietal cortex (average over Ch. 14,27,28 in the equidistant cap, black dots in F) shows no substantial change at stimulus onset. (E1) Aligned to the time of change, the CPP electrodes show a progressive increase in potential, with some staggering according to change size. In comparison to the response-locked potentials, the present potential is wider and smaller since it is composed of responses at different times. (E2) In contrast to the central electrodes, the CPP electrodes show a clear increase before the response, peaking at or slightly after the response time. (F) The topography locked to the response is found to be centered over the parietal cortex, tending towards the occipital cortex (black dots mark Ch. 14,27,28). The inset shows the difference between the 140% and 50% condition, indicating that the difference in potential is also localized consistently with the average topography. Note, that there was no display change in the entire tone presentation, and a 0.5 s gap after the response, before the screen changed, hence, visual responses can be excluded. (G) CPP slope of the potential leading up to the response in relation to the different change time and size conditions was measured in a window of 300–50 ms before the response. (H) CPP slope depended significantly on change size (2-way ANOVA with change time and change size as factors, p<<0.001 for the change time as a factor). (I) CPP slope did not depend significantly on change time (ANOVA as above, p=0.07). (J) CPP slope for false alarms showed no significant dependence on the time into the trial (p=0.76, 1-way ANOVA). (K) Peak height of the CPP was measured in a symmetric window of 80 ms around the response time. (L) Peak height of the CPP showed a significant increase with change size (2-way ANOVA with change time and size as factors, p<<0.001 for change size). (M) Peak height depended significantly on change time, decreasing with longer change times (ANOVA as above, p<<0.001 for change time). (N) Peak heights for false alarms showed no dependence on time into the trial (p=0.43, 1-way ANOVA) but were significantly smaller than the hit trials (p<1e-9, 1-way ANOVA). Error bars indicate single SEMs for all plots.

https://doi.org/10.7554/eLife.24910.010

The centro-parietal electrodes exhibited a centro-parietal positivity (CPP) reported previously (O’Connell et al., 2012; Kelly and O'Connell, 2013; Twomey et al., 2015) in a similar location (see Figure 5F for its topography at response time). In contrast to the central electrodes, the CPP did not display any clear response to sound onset (Figure 5D) but exhibited a long-lasting response following change events (Figure 5E1). This increase in the EEG signal was building-up and preceded subjects’ responses across change sizes (Figure 5E2), outlasting the timing of the button press. The difference between change sizes was colocalized with the CPP (Figure 5F, inset), indicating that the difference in amplitude is not due to a global shift in potential). In previous studies, the CPP potential was clearly linked to evidence integration in decision making tasks, e.g. in simple visual and auditory detection tasks (O’Connell et al., 2012) and a complex visual discrimination task (Kelly and O'Connell, 2013). We therefore hypothesized the CPP to also be indicative of evidence integration in complex auditory detection tasks. In order to assess this, we examined how the CPP potential depended on the amount of evidence, and whether it exhibited accumulation-to-threshold dynamics.

Both the slope (Figure 5G) and the height (Figure 5K) of the response-aligned CPP potential depended on the stimulus parameters. The slope increased significantly with change size (Figure 5H, p<<0.001, 2-way ANOVA across change size and change time), but was not significantly dependent on change time (Figure 5I, p=0.074, same ANOVA). The effect of change size on slope is consistent with a representation of task-related evidence in the CPP signal, as reported previously in other change detection tasks (O’Connell et al., 2012).

The height of the potential also increased significantly with change size (Figure 5L, p<<0.001, 2-way ANOVA across change size and change time), and decreased significantly as a function of time (Figure 5M, p<<0.001, same ANOVA). Such a change size dependence has been reported before (see Figure 2 in O’Connell et al., 2012), and at first appears inconsistent with a fixed threshold. However, since the execution of the button press requires some time, the application of the threshold has to precede the button press by some delay. The observed difference in heights could thus reflect a continued accumulation of evidence at different slopes, during the time interval between decision commitment and response completion, until the execution of the decision is communicated to the CPP source. Consistent with this interpretation, CPP height did not exhibit a dependence on change size, if measured in a window of 200–100 ms preceding the response time (p=0.16, same ANOVA, close to the crossing in Figure 5E2). In addition, we verified that the CPP height did not depend on the reaction time (Figure 6), as expected from an evidence accumulator signal (Kelly and O'Connell, 2013).

The CPP potential shows no dependence on whether responses occur early or late after the change.

(A) CPP potentials aligned to response as in Figure 5E2 (for second change-time bin, i.e. around 2.4 s). The solid lines are the early responses (up to median reaction time) and the dashed lines are the late responses (median reaction time to end of response-window). (B) Across all conditions the reaction time did not significantly influence the height of the CPP potential (p=0.36 for reaction time, 3-way ANOVA over reaction time, change size and change time).

https://doi.org/10.7554/eLife.24910.013

The height decrease as a function of change time is indicative of a reduction in threshold as a function of time (Figure 5M). However, we did not observe an increase in FA rate later in the trial (Figure 2D), suggesting no increase in unfounded decisions. Although the time-dependence of CPP height could result in a decrease of CPP height for late versus early reaction times, we did not find any significant decrease in CPP height for late reaction times, which may be due to a rather small effect-size (Figure 6B).

Finally, CPP responses aligned to false alarms exhibited similar slope and amplitude as the lower signal conditions (50%, 80%), however, were overall significantly lower than the overall signal conditions (p<<0.001, 1-way ANOVA, across change size). Neither slope nor height displayed a dependence as a function of time into the trial (Figure 5J/N, p=0.76, p=0.43, respectively, 1-way ANOVA, across different time-into-trial bins). Together these results suggest that the decision threshold on the CPP is close to the lowest change size / false alarm height.

Neither of these results depended on the detrending method, as verified by the alternative use of a classical high-pass filter (see Materials and methods and Figure 5—figure supplement 2).

In summary, we found central and centro-parietal electrodes to respond in a diametrically opposed manner to stimulus onset and (detection of) change in statistics. The CPP potential remained practically silent to stimulus onset, but reflected properties of the stimulus/task when aligned to button press. These results reinforce the notion that the CPP potential reflects sensory evidence accumulation and exhibits accumulation-to-threshold dynamics, with the possibility of continued integration until actual response execution. As a function of change time, only the CPP potential's height reduced, suggesting a time-dependent threshold.

Dual timescale statistical estimation model matches human response behavior

The psychoacoustic results demonstrate that a listener’s ability to detect a change in a statistical property of the environment depends on the time available to estimate this parameter, both for the pre- and post-change stimulus. However, how does the listener know, when to start estimating the new statistics? Since - as in real life - the change occurs at an unexpectable time, one solution would be to compare the recent statistics to a longer term estimate of the same statistics, acting as a baseline - or 'null' - distribution. A minimal implementation of this solution consists of two processes estimating the same statistical property on different timescales (Figure 7).

Dual timescale statistical estimation replicates behavioral results.

(A) The dual timescale model consists of two dynamical estimation processes operating with different speeds. If their estimates differ by more than a threshold T, a change in the stimulus is detected. The model was fitted to the entire set of behavioral data (D–G). (B) In a single trial the slow (Pslow, blue) and the fast (Pfast, purple) estimates of the actual stimulus probability (light grey) vary with the stimulus (black) on different timescales. Here, a decision. (PfastPslow>T) is detected at 300 ms after the change in the stimulus (red). (C) The distribution of response times compared with the change times exhibits a similar shape as for the real subjects (see Figure 1B). (D) Detection performance of the model (dashed lines) closely matches the human data (continuous line with 1 SEM error hull) both as a function of change time and change size (different shades see legend in G), see text for parameter values). (E) False alarm rates are also matched closely (same legend as in D). (F) Miss rates are matched equally closely (same legend as in D). (G) Response time distributions are also matched closely, which is of interest as no explicit model of response times was included in the model (same legend as in D).

https://doi.org/10.7554/eLife.24910.014

For this purpose, we turned to models of statistical estimation of the drift diffusion type, used previously to account for visual and auditory decision making in paradigms where subjects were asked to choose between two alternative choices (Britten et al., 1996; Brunton et al., 2013). In these models a dynamic variable compares the stimulus information in favor of the two alternatives, and when reaching a predefined bound, a decision is made. We extended this model to a pair of variables, estimating the statistical property on different timescales (Figure 7A–B and Materials and methods). A deviation is detected if the long-term estimate (Figure 7B, Pslow) and the short-term estimate (Pfast) differ by more than the difference between the thresholds (Figure 7B, T). As introduced above, this was intended to capture the dual task the participants faced in our paradigm, namely to estimate the base (initial) statistics while simultaneously scanning for deviations from these statistics. The modified model is governed by four parameters, which control the timescales of the dynamics variables and the threshold. To make the model applicable to our auditory textures, we assume that multiple copies of it operate in parallel in different frequency channels (see Materials and methods).

We presented an analogous stimulus to the model, exhibiting a change in the probability of tone occurrence at a random time (Figure 7A left and 7B, gray, only one frequency bin shown) and in a random frequency location, and quantified the model’s response in performance and response time. The model exhibited a comparable behavior on individual trials as humans (Figure 7C, compare to Figure 1C), with an initial hesitation to respond, and a mixture of false alarms (gray), correct response (black) and misses (not shown). We quantified the performance (performance, false alarms, misses) and the reaction times as a function of the change times and the change size (Figure 7D–G). The match between the human data and the model was close, with an average residual (mean squared error) of 0.049 (in units of probability). The correlation coefficients between the real data and the fit were [0.97,0.99,0.98], for performance (Figure 7D), false alarms (Figure 7E) and misses (Figure 7F), respectively.

The reaction times could be accounted for both in mean and distribution for different change sizes (r = 0.95, MSE = 0.009 (norm. prob.), Figure 7G). For the condition with the biggest step, a certain fraction of the responses occurred very early, which may be subject-dependent and we were unable to replicate in the present model.

The parameters that best fit the average human data were τf = 0.2 s, τs = 1.1 s, τa = 0.65 s, and T=0.40 (in units of probability). Hence, the time constants of the fast and the slow processes differed by more than fivefold, and the threshold for detecting a step was surprisingly high. The time for eliciting a motor signal was consistent with the asymptotic times we found in the human data (see Figures 3B and 140 %). The time constant of the transitional period represents (as the other parameters) an average over the subjects. Inspecting individual subjects revealed some variability in their propensity to react early (min median: 0.77 s; max median: 1.03 s).

The residual differences in the fit could be a consequence of the fact that the data from multiple listeners was pooled, rather than fitted individually. With the current limitation of ~1000 trials / listener, a single listener fit would be dominated by within-subject variability across trials, requiring more trials before stabilizing.

In summary, the dual timescale estimation model captures the human performance and reaction times well, suggesting that its basic principle may be implemented by the brain. The fitted timescales of estimation suggest that a rapid estimate of the present statistics can be formed within 200 ms. While this time appears sufficient to reliably distinguish the larger steps in statistics, it is insufficient to detect small changes in occurrence probability, which are often perceived as unchanged statistics.

In relation to the CPP's response properties, it is noteworthy that the decision variable in the dual-time scale model exhibits a similar, positive dependence between slope and change size (evident from Equations 2/3/3). If the evidence accumulation continued during the time interval between decision commitment and the actual motor execution, this would translate into a dependence of the height on the change-size as well. In agreement with the CPP dynamics, the slope should not depend on change time. Instead, the estimate of Pslow(t) should exhibit better convergence as a function of change time, leading to improved discrimination against Pfast(t). This non-decision period separating the crossing of the threshold and the actual response execution is implicitly incorporated in the model as the motor-related increment in reaction times.

Detection of changed statistics based on spectrotemporal processing in auditory cortex

The dual timescale model successfully captures human performance via an estimation of stimulus statistics. While this suggests a consistency with the principle of statistical estimation, it does not provide any insights into putative neural implementations. For this purpose, we turn to an established model of auditory cortical processing ('cortical model', Chi et al., 2005; Elhilali et al., 2009; Krishnan et al., 2014; Patil et al., 2012; Yang et al., 1992), which we augment here with a decision stage specific to the present task. In particular, this alternative model investigates whether the cortical model (and hence the primary auditory cortex) represents the acoustic stimulus in a way that supports an account of our psychoacoustic data, i.e. supports decision making in certain statistical contexts.

The cortical model emulates the spectrotemporal response properties of neurons in primary auditory cortex, which have been extensively studied by various groups (Ahrens et al., 2008; Eggermont, 2002; Kowalski et al., 1996). Its responses are based on a filterbank-based, joint spectrotemporal modulation analysis following the output of the early stages of the auditory system (auditory spectrogram, Figure 8A). Parameters and properties were set to approximate the responses of neurons in auditory cortex (see Materials and methods for details) (Chi et al., 2005; Yang et al., 1992). The spectrotemporal filters cover the experimentally observed range of 1–30 Hz and 0.5–8 cycle/oct, whose outputs are weighted in correspondence with the experimentally observed abundance of these properties in A1 (Kowalski et al., 1996), Figure 8B).

A cortical filter-bank model provides an implementation consistent with the behavioral results.

(A) Conceptual structure of the model. The cochleogram (top panel) is passed through modulation filters (scale Ω: 0.54 cycle/oct.; rate ω: 0.72 Hz) for obtaining a cortical representation of the sound (middle panel). Changes are detected with a threshold (bottom panel, grey dashed line) applied to the frequency-averaged cortical representation (collapsing threshold parameters: λ = 1.14 s; b = 10.77; a = 6.23). First peak exceeding the threshold is classified as change (purple arrow). Timing of change is indicated by a red arrow in the three panels. (B) Average output of the cortical model across all modulation filters. Although trial onset elicits an overall increase in activity, the change in statistics does not lead to an average change in activity (depiction for single trial length, with change time indicated by arrow). (C) Single filter output as a function of change time (average over 100 trials for each curve). Change times are indicated by colored arrows. Notice that the change-related peak is not discernible for early changes, due to its interaction with the onset response. Same parameters than in A). (D) Single filter output as a function of change sizes (average over 100 trials for each curve). Same parameters as in A). (E) Performance for human participants (thin lines) and the decision model (dashed thick lines), as a function of change size and change time. Same colors as in D). (F) False alarm rate as a function of change size and change time. Same colors as in D). (G) Response time distributions as a function of change size. Same colors as in D). (H) Decrease in performance with respect to the distance between incremented bins. Actual data in full line, model result is depicted with a dashed black line.

https://doi.org/10.7554/eLife.24910.015

We simulated two types of readouts from the model to account for two of the main experimental constraints. For the first, we summed all cortical outputs to simulate an effective EEG recording with limited spatial separation of sources, leading to a global response. As expected, in this case, trial onsets and offsets produced strong responses (Figure 8B), with a plateau of sustained response for the whole duration of the stimulus. The responses due to the statistical change in the stimulus were largely diluted in the summed response and thus could not been discerned, consistent with the present EEG recordings of the auditory electrodes (Figure 5B).

The ranges of spectral bandwidths and timescales related to the change were kept constant over the whole duration of the task. Consequently, a more optimal strategy would be to focus on the temporal modulation filters in cortex that are most activated by the statistical change. Hence, we postulated that high-order areas could monitor the outputs of the task-relevant temporal filters. For example, subjects could make their decisions based on the largest output produced by the change. These would be sampled from the filter with the temporal dynamics and spectral modulation that roughly matched those of the stimuli. The response thus selected is shown in Figure 8A. Aside from the strong responses at stimulus onset and offset, the responses now exhibited in addition a prominent intermediate peak due to the change in statistics (Figure 8A). This change-induced response peak vanished in trials when the pre-change interval was very short because it became fused with the large onset peak (Figure 8C). Change size was encoded in the slope of this cortical response (Figure 8D), consistent with the neural CPP response (Figure 5H). The variability in responses of the cortical outputs was solely due to the random tone-clouds preceding and following the change in the stimulus.

To quantitatively simulate the perceptual decisions of the listeners, we analyzed the cortical filter outputs for individual trials. In order to take into account the time-dependence of the CPP amplitude in the EEG recordings, we used a time-varying threshold that remained identical across all conditions (see Figure 8A and Materials and methods). The first peak exceeding this threshold (if any) was considered to be the decision point (purple arrow in Figure 8A). This readout mechanism was fitted to the performance and false alarm rate across change sizes and change times by allowing five free parameters (Figure 8E–F): the (bandwidth) scale Ω, the (temporal) rate ω, and the decision parameters (λ, a, b; see Materials and methods). The parameters that best fitted the human dataset were Ω = 0.54 cycle/oct, ω = 0.72 Hz (a rate corresponding approximately to dynamics or an integration time-constant of the order of 1–2 s), and a = 6.2, b = 10.8, and λ = 1.14 s (ρ = 0.95; p<5.10−16; MSE = 0.7%). The scale value corresponds to a full width at half-maximum for the scale filter of approximately 0.56 octave, very close to the frequency region spanned by localized changes (0.55 octave). This may indicate that subjects preferentially used a single scale value for monitoring the frequency modulation and that they estimated the most common frequency modulation across trials since half of the trials contained localized changes. Importantly, reaction times predicted by the model matched subject reaction times remarkably well both in distributional shape, mean and spread (Figure 8G), although the fitting procedure did not make use of this information (ρ = 0.90; p<5.10−13; MSE = 9.1%).

Scale filters integrate frequency modulations over a limited spectral bandwidth set by the scale factor Ω. Such scale filters are more prone to detect changes localized in the spectrotemporal modulation domain. Therefore spectrally distributed changes could be missed by the decision stage as they elicit less activity in the filter outputs. This is reminiscent of the observation that listeners detected changes more efficiently if their energy was concentrated in the frequency domain (Figure 4A). Consistent with this, we found a decrease in the model performance for non-localized changes, without fitting the model parameters to this aspect of the data (Figure 8H).

Thus, the model describes a physiological mechanism that accounts for the behavioral data, as well as suggesting an implementation for the basis of statistical estimation in neural terms. In relation to the neural responses, it provides an interpretation for the lack of change-related signal in the auditory EEG electrodes, and the decision signal's slope also scales with the change size (Figure 8D), i.e. the amount of evidence. Similar to the dual timescale model, the peak size of the decision signal increases with change size (Figure 8D), unless interrupted by the threshold. The reduction of the threshold over time (to compensate for the task designs) is consistent with the experimentally observed reduction in CPP size as a function of change time (Figure 5M).

Discussion

We investigated how listeners detected changes in spectrotemporally broad acoustic textures, as a model for change detection in complex auditory environments. The results demonstrated that listeners estimated the statistics of the stimulus to make their decision, as evidenced by the dependence of performance, reaction times, and the CPP response on the stimulus parameters. We developed a drift-diffusion type model for estimating certain stimulus statistics, which accounted well for the response performance and dynamics in human listeners. Finally, we adapted a model of auditory cortical processing to provide a link between statistical estimation and the underlying physiology. The model accounted equally well for the human performance by exploiting a range of temporal filters, providing a potential, neurally plausible substrate for statistical decision-making. The decision signals of both models exhibit consistent integration behavior with the CPP potential.

Relation to previous spectral detection tasks

The present experimental paradigm mimics the unexpected transformation of a sound source within a natural auditory environment. There are some relations to previous research on spectral representations of sound, e.g. profile analysis (Green, 1992; Green and Berg, 1991; Hartmann, 1986; Lentz and Richards, 1997; Neff and Green, 1987). Our work, however, differs in several relevant ways. In profile analysis, subjects detected spectral shape changes on static spectra that were presented in isolation for short times (fraction of a second each). By comparison, our stimuli were dynamic and sustained (multiple seconds), and changes were detected in the midst of a continuous background with an explicit measure of reaction times. This enabled us to explore the dynamic acquisition of the statistical information.

Further, a series of recent studies investigated detection of change occurring in first- and second-order sound statistics (Sohoglu and Chait, 2016; Barascud et al., 2016). In particular, these authors probed the detection of appearing or disappearing regular sound sources in an acoustic scene (Sohoglu and Chait, 2016). This type of changes featured modifications of first- and second-order sound statistics, which also included an increase in the overall sound level. In comparison, our stimulus design allowed us to limit the change to the first-order statistics while keeping the overall sound level constant.

Our experimental task offers a compromise between complexity of spectrotemporal structure versus tractability and interpretability of the changes. Furthermore, the task design and acoustic stimulus are well-suited for electrophysiological studies with behaving animals, where one can easily estimate neuronal receptive fields from the responses to tone clouds at the same time as the animal detects the changes (Ahrens et al., 2008; Wang et al., 2012).

Another important aspect of the experiments was their interleaved (as opposed to block-based) design for change sizes and other parameters, which had several consequences. For instance, it is likely that the observed performance underestimated optimal performance, since the time, location and size of changes were unexpected. This also prevented subjects from using a template-match strategy on the largest change size, and provided access to reaction times, which consistently mirrored performance, and perhaps the certainty of the subjects in their decisions (Kiani et al., 2014).

Modelling statistical decision-making on two levels

Following the modeling steps proposed by Marr (1982), we provided an algorithmic and an (neural) implementational model of our subjects’ behavior. The algorithmic approach implemented the principle of statistical estimation, while the neural model leveraged principles of auditory cortex processing. Although both models analyzed recent inputs, and effectively detected deviations from them, they differed fundamentally in their levels of description and abstraction.

The statistical estimation model implements the principle of statistical integration in a close-to minimal form, and provides a link to classical drift-diffusion models. It is a mechanistic, non-neural description of the process that performed statistical estimation in the classical sense, by representing and comparing the probability of stimuli in frequency bins, based on a lossy memory. Previous work has suggested a possible neural implementation of such a decision making process, in the form of competing neuronal populations, each corresponding to one alternative choice (reviewed in Insabato et al., 2014). While this approach can in principle be extended to the estimation of other properties of a stimulus distribution, i.e. moments or correlations, it has to be adapted more specifically to each particular task. In the present case we chose a static set of parameters, since the change time distribution remained unchanged in a session. More generally, (temporal) integration properties can adapt to the recent statistics, as recently shown in related contexts (Raviv et al., 2012; Ossmy et al., 2013).

The cortical model differs fundamentally in that it seeks to capture basic sensory neural responses and is inspired by physiological mechanisms. In this sense, it is agnostic to the type of stimulus, and can be readily extended to handle more complex scenarios such as changes in natural stimuli, speech and music. To create behavioral performance from its representation, we merely added a filter selection and a decision criterion. The spectrotemporal filters implemented in the cortical model exhibit alternating excitatory (positive) and inhibitory (negative) fields (Figure 8A) that compare the spectral stimulus properties over a given time window set by a filter's temporal rate. As such, it effectively integrates the recent input with opposing signs to detect a change, which can be compared to the difference between the fast and slow estimators in the statistical estimation model.

Therefore, we may view this model as approximating a neural implementation of the statistical model, and thus as a bridge to a neural interpretation of the behavioral performance and the EEG recordings. Several properties of human performance and of the neural data can be considered within each model’s framework. The most relevant are (i) reduced performance in detecting early changes, (ii) longer reaction times for early changes, and (iii) reduced height of the centro-parietal EEG responses for late changes.

In the cortical model, the reduced performance results primarily from the large onset response masking the responses to the smaller subsequent change, rendering the peak response less detectable (i). In order to simulate the instructions to the subjects not to report the stimulus onset as a change, the detection threshold was set to decrease from a larger initial value, which will delay responses for early changes (ii). Interestingly, this choice for the threshold is in line with the reducing CPP potential heights as a function of change times (iii). Overall, the integration time-constants in the cortical model on the order of 1 s (due to bandpass filters tuned at rates near ~1 Hz) appear sufficiently long to explain the decision dynamics exhibited by the subjects (Figure 2). These time-constants, while on the slow-end of the range, are still found in the primary and secondary auditory cortical regions (Kowalski et al., 1996; Liang et al., 2002).

In the statistical model, the reduced performance (i) is a consequence of the model’s design having two estimators: one with a fast and the other with an adaptive time-constant (τf and τs). At stimulus (trial) onset, the absence of prior evidence is reflected by the equality of the two time-constants. As the trial progresses, τs becomes longer, and the difference between the two estimator outputs increases to reflect the buildup of evidence for a change in stimulus statistics (see Materials and methods). the dynamics are a consequence of the time-constant dynamics (as above) as well as the not-yet converged estimate of the initial occurrence probability (ii). There is no correspondence for the observed decrease of the CPP potential as a function of change time (iii).

It has previously been proposed that subjects may be trading response speed for accuracy (Teichert et al., 2014). We think this may apply to the first period up to 1 s, where subjects responded very little. After this point accuracy quickly rose (Figure 2C), as did the false alarm rate (before reaching its plateau). The time controlling the divergence between the two estimators in the statistical model (τa = 0.65 s) roughly matches this time scale and may be accounting for an initial postponing of decision by the subjects. An alternative modelling strategy would include a dead-time, corresponding to the minimal time subjects take before responding. While this appeared unnecessary for the present data, such a model may become relevant if the paradigm includes blocks of different response window length, where subjects are forced to respond more quickly to perform successfully.

In summary, what is typically termed accumulation of evidence (and its associated performance and dynamics) could be explained by the dynamics of the onset response in the cortical model intertwined with its integration time-constants. Future experiments need to further test the validity of this neural interpretation, given the ubiquity of such ‘sudden’ events in natural stimuli due to saccades (in vision), attentional switches, or trial onsets, which could also influence the detectability of changes (as e.g. in change blindness, Levin and Simons, 1997; Rensink et al., 2000).

EEG recordings and the site of decision-making

As discussed above, recognizing a change in the statistics of a complex spectrotemporal sound requires the extraction and accumulation of evidence from the stimulus to estimate decision-relevant properties. This transition from a stimulus-related to a task-related representation needs to occur along several stations of the auditory system. Our EEG recordings provide partial evidence regarding their putative location. Specifically, we found a clear difference in the representation of the stimulus at the central electrodes (estimated to originate from auditory cortex activity) and at the centro-parietal electrodes (estimated to reflect parietal activity): while the central electrodes exhibited a sharp onset response at stimulus onset (Figure 5A) and offset, they showed little evidence of the change response or of the presumed accumulation of evidence for a change (Figure 5B).

In sharp contrast, the centro-parietal electrodes displayed no response to the onset (Figure 5D), and a clear evidence of the sensory evidence accumulation after the change aligned to response (Figure 5E2). Previous studies using a linear increase of sensory evidence found a quadratic time progression of the centro-parietal potential. In the present task, the constant amount of evidence as function of time resulted in more linear dynamics of the centro-parietal potential, supporting the integration hypothesis. The task-irrelevant abrupt sensory event (i.e. the onset) was thus filtered out in the parietal EEG response while the task-relevant event (the change in statistics), although more subtle in nature, was selectively integrated and converted into a decision signal.

A set of related EEG studies termed the corresponding potential the centro-parietal positivity (CPP, Kelly and O'Connell, 2013; O’Connell et al., 2012). Consistently, we found the CPP slope and amplitude at the response time to be correlated with evidence level (change size). However, the amplitude was independent of reaction time, and did not depend on change size before the response time (200–100 ms). Additionally, the amplitude of the CPP was found to be inversely related with change times, as one would expect if performance would be influenced by the estimated, maximal trial duration. Such a reduction would be expected both as a consequence of the general task design, i.e. on the one hand, the requirement to not respond to the onset of the first texture, and on the other hand, the possibility to approximate the maximal trial duration. For practical reasons, arbitrary trial lengths are not realizable. Hence, subjects could form an expectation of the maximal sound duration, which means participants could be subject to an urgency signal that would lower their criterion in this time-range.

On the other hand, the texture onset may also play an important role, since the requirement to not respond to an otherwise salient change, may be regulated by a change in threshold (as modeled in the decision stage of the cortical model).

A decrease in decision threshold has been used in models of decision-making for dealing with speed-accuracy trade-offs (Bogacz et al., 2006) and observed in electrophysiological studies (Heitz and Schall, 2012). Although it could in theory explain the time-dependence we observed for CPP heights, subjects did not exhibit any urgency to respond, even after a few seconds, as exemplified by the constant FA rate per unit of time (Figure 2D). Instead, we propose that this decrease in threshold reflects a more sensitive criterion for change detection, via a more settled estimate of the initial statistics. The decision threshold would thus be dynamically adjusted during the course of a trial. Importantly, the lack of increase in FA rate suggests that the improved estimate of the initial statistics would also reduce the neural response to expectable deviations, such that the sensitivity (type I errors) stays at the same level.

This predominance of decision-related signals in the centro-parietal electrodes is consistent with decades of research in the accumulation of task-related visual information in the parietal cortex, more specifically in decision-making with saccades in the lateral intraparietal (LIP) cortex (Huk and Shadlen, 2005; Roitman and Shadlen, 2002; Shadlen and Newsome, 2001). Neurons in LIP have been shown to exhibit activity correlated with the accumulation of visual evidence coming from MT (Huk and Shadlen, 2005; Mazurek et al., 2003). Their firing rate usually exhibits a linear increase until the animal makes a decision (Shadlen and Newsome, 2001). In these studies, typically a fixed threshold on neural firing rate is used to relate neural activity to decision making.

It has recently been suggested that individual neurons change their firing rate instantaneously at the single trial level (Latimer et al., 2015). We presently observed gradual, rather than step-wise changes in our across-trial averages. However, we predict that even single trial EEG signals would be gradual as these step-changes occur randomly between neurons, and hence are unlikely to be synchronized at the population-level. Due to the large ensemble of neural responses contributing to a single scalp location’s potential, this instead results in the commonly seen ramping activity on the EEG level, as observed in our data.

The lack of evidence for a change-related signal in the auditory EEG potentials can, however, not fully rule out the presence of a change-related signal in auditory cortex in the present stimulus context. The representation of the change could be diverse and distributed, which may average out in the non-selective, coarse averaging on the EEG level (see Figure 8B). This is also consistent with recent work, demonstrating choice-related signal in auditory cortex (e.g. Bizley et al., 2013; Tsunada et al., 2016). Our cortical modelling suggests that the representation in auditory cortex provides a good substrate for initial accumulation of sensory information about changes in stimulus statistics, which is then selected and amplified in parietal cortex, leading up to the sustained parietal activity and a full representation of accumulated evidence and choice (Shadlen and Newsome, 2001). This interpretation is supported by the match in performance, reaction times (Figure 8E–G), and in the progression of activity between specific filters of the cortical model (Figure 8C) and neural data (Figure 5E).

In conclusion, as with many other cognitive functions, it is likely that higher-order areas such as the LIP and PFC select and potentially amplify task-relevant outputs of the auditory cortex. To test this hypothesis and the value of the proposed models, it will be necessary to extend change detection tasks to more natural and complex stimuli. As shown previously (Lewicki, 2002; Smith and Lewicki, 2006), natural statistics shape neural processing, and in a similar way should be informative about which changes to focus on in research. Furthermore, the models should be extended to include the effects of cognitive functions in modulating this process, such as attention or expectations.

Materials and methods

Participants

In the main psychophysical study, 15 normal hearing subjects (mean age: 24.8y, 6 females) participated, 10 of which could be included for final analysis (see below for criteria). A different set of 18 subjects participated in the combined psychophysics and EEG experiment (mean age: 30 ± 10 years, 7 females), all of which could be included for final analysis (see below for criteria). All experiments were performed in accordance with the guidelines of the Helsinki Declaration. The Ethics Committees for Health Sciences at Université Paris Descartes approved the experimental procedures.

Experimental setup

Acoustic Stimulation Subjects were seated in front of a screen with access to a response box in an acoustically-sealed booth (Industrial Acoustics Company GmbH). Acoustic stimulus presentation and behavioral control were performed using custom-written software in MATLAB (BAPHY, from the Neural Systems Laboratory, University of Maryland, College Park; available upon request). The acoustic stimulus was sampled at 100 kHz, and converted to an analog signal using an IO board (National Instruments, PCIe-6353) before being sent to diotic presentation using high-fidelity headphones (Sennheiser i380, calibrated flat, i.e. ±5 dB, within 100–20000 Hz). Reaction times were measured via a custom-built response box and collected by the same IO card sampled at 1 kHz.

Electroencephalogram (EEG) acquisition EEG recordings were collected in a separate set of 18 normal-hearing subjects while listening and responding to the texture change stimuli. Continuous EEG data were recorded using a 64-channel system (ActiCap, BrainProducts, Gilching, Germany) at a sampling rate of 500 Hz with one reference and one ground electrode. In order to standardize electrode placement on the skull, we used a default fabric head-cap that holds the electrodes (EasyCap, Equidistant layout, 60 scalp, four ocular channels). The analysis of EEG responses was carried out offline (see section Data analysis).

Stimulus design and trial procedure

We investigated the conditions under which listeners could detect a change in the statistics of complex acoustic stimuli. More specifically, we wondered how subjects capture the percept of a spectrotemporally complex stimulus, and then use it as a background to detect changes relative to it. Concretely, in an experimental trial, a sound texture was presented, allowing the subjects a randomly varying period of time to form a percept of the stimulus (i.e. ‘estimate the baseline statistics’), and then a change in the frequency distribution of the tones was introduced (while maintaining the overall sound level). After the change, subjects had up to 2 s to indicate that they had detected it. The stimulus captures the central textural properties of complex spectrotemporal structure and statistical predictability. Both the stimulus design and the procedure are described in detail below.

Stimulus design

Briefly, the stimulus was a 'cloud' of tones, i.e. a train of short pure tones chords (30 ms) drawn from a range of 2.2 octaves (400 to 1840 Hz), where the occurrence probability of each tone was governed by a marginal distribution (see below, Figure 1, and sound examples in Supplementary files 14). The frequency resolution of the tone distribution was 12 semitones per octave, i.e. 26 logarithmically spaced pure tones covered the used frequency range. To limit the number of experimental conditions, these were grouped into eight spectral bins, each comprising 3–4 of the pure tone frequencies (see Figure 1 for illustration). The marginal distribution was chosen to ensure that the actual rate of tones per bin was controlled, independent of the number of pure tone frequencies constituting the bin. The entire stimulus can be described by a spectrogram denoted by S(t,f) as a function of time and frequency.

The minimal temporal unit of the stimulus was a 30 ms chord, i.e. a synchronous presentation of multiple pure tones. The number of tones for each chord was drawn from a Poisson distribution with a fixed mean of 2 tones per octave. The mean number of tones per chord was kept fixed as a function of time to avoid changes in level (see below). The frequency of each tone in a chord was chosen in two successive steps: First, one of the eight spectral bins was selected according to a marginal probability distribution (see below). Second, within this bin, one of the pure tone frequencies constituting the bin was randomly selected. Chords at different times were drawn independently from each other.

The baseline marginal probability distribution was composed of 8 frequency bins with discrete probability values (Figure 1A, left). These values were chosen pseudo-randomly for each trial, forcing subjects to always reestimate the stimulus statistics. The probability in each bin could take one of 3 values: 0.083, 0.125, 0.188. To avoid differences in spectral density, the number of bins with each probabilities was fixed to three bins with p=0.083, two bins with p=0.125 and 3 bins with p=0.188. The marginal distribution is thus normalized, i.e. the sum across bins equals 1. Since multiple pure tone frequencies constituted each of the eight bins, the probability per pure tone frequency bin was correspondingly lower: based on this marginal distribution and the number of tones per chord, the effective probability of a tone falling in a pure tone frequency bin thus ranges between 0.021–0.063 per chord duration, corresponding to an average rate of ~147 tones/s.

The change in statistics consisted in a change in the baseline marginal distribution. Two out of the eight spectral bins were increased in probability at a random point in time (referred to as change time, more details below) during stimulus presentation, i.e. the stimulus continued uninterrupted. The increment size will be referred to as change size and was drawn from a set of discrete values: 30, 50, 80, 110, 140% (inset in Figure 2A), relative to the single bin probability in a uniform distribution (for eight bins this is 0.125, i.e. a 50% change size would be an increment of 0.0625). In order to exclude cues from global level changes, the marginal distribution was simultaneously renormalized, thus keeping the global level constant within a trial (i.e. as mentioned above the rate of tones per chord was kept constant at all times). Since the 30% condition was only collected for three subjects, it is omitted from most plots, although results were generally consistent with the other conditions.

The relative spectral locations of the two changed bins were separated into two conditions:

  1. Localized: the frequency bins containing the change were adjacent. To limit the number of conditions, only 4 pairs of bins, {1,2}, {3,4}, {5,6}, {7,8} were tested at all increment levels.

  2. Non-localized: the frequency bins containing the change were separated in frequency. To limit the number of conditions, we chose a subset of distances (D=[2, 3, 5, 7] bins, i.e. [6.6, 9.9, 16.5, 23.1] semitones (st)) and only used the change size 110% (determined as intermediate difficulty during pilot studies). Since certain inter-bin distances are more frequent (i.e. six for D = 2, five for D = 3, three for D = 5, one for D = 7), the number of trials going into each distance differs, which scales the error bars accordingly (see Figure 4B).

The time at which the change occurred (change time) was drawn randomly from an exponential distribution (mean: 3.2 s) limited to the interval of [0,8] s (Figure 1B). This choice of distribution prevents subjects from developing a timing strategy, by keeping the probability of a change constant over time. The associated flat hazard rate minimizes the ability of participants to anticipate the trial end (Janssen and Shadlen, 2005; Kiani et al., 2008). The change time is an important parameter with respect to the estimation of the first marginal distribution, with the hypothesis that greater change times improve detection of changes.

Given the low, per-bin probabilities (see above), individual tones remained distinguishable. Hence, the spectrotemporal density was low enough to avoid fusion into a single stream per channel, although the present study still has some parallels with previous paradigms, e.g. spectral shape analysis (Green, 1988, 1992; Green and Berg, 1991) (see Discussion).

Procedure

The experiment proceeded along three phases: instruction, training, and main experiment. After reading the instructions, subjects went through 10 min (60 trials) of training, where they were required to obtain at least a detection performance of 40%. The training comprised only stimuli of the two largest change sizes (110%, 140%). Three subjects in the psychophysics-only group did not attain the criterion level of performance and were not tested further.

The main experiment was composed of two sessions of about 70 min each, comprising a total of 930 trials. The two sessions were never more than two days apart. Each session contained three blocks of about 20 min, for a total of 465 trials per session, corresponding to 30 repetitions of each condition (for the three subjects in which the 30% condition was tested the total trial number increased to 1050). In between blocks subjects could take a short break.

The instructions specified that subjects would be compensated according to their performance, although an easily attainable threshold of proficiency would give them the maximal compensation. However, all subjects were compensated equally according to the length of the experiment (€10/hour).

After reading the instructions, subjects were aware that the change could arise at any moment on each trial and that their task was to detect it within a 2 s window. When subjects heard a change, they pressed a response button. This terminated the trial and the sound. Hence, the subject had up to 2 s after the change to detect the change in stimulus statistics.

Visual feedback was always displayed on a screen in front of them after the end of the trial. A red square was displayed, if the button was pressed before the change (false alarm), or if the button was not pressed within the 2 s time window after the change (miss). A green square was displayed, if the button was pressed after the change, but within the 2 s window (hit).

In addition, stimulus level was roved from trial to trial, chosen randomly between 60 and 80 dB SPL. This procedure is classically applied to prevent subjects from adopting an absolute level strategy (Green and Berg, 1991). Overall level was not found to significantly influence performance (p=0.89, ANOVA). The inter-trial interval was ~1 s with a small, random jitter (<0.1 s) depending on computer load.

Psychophysics procedure during EEG recordings

During the EEG recordings, stimuli and experimental procedures identical to those of the psychophysics experiments were used. In addition, subjects were required to continuously fixate a white cross on the screen. They were asked not to blink and to keep fixation especially during the sound presentation. After the end of the trial (i.e. either the end of the sound or their response), they received a visual text feedback after 0.5 s. After the feedback disappeared, eye blinks were allowed during the intertrial interval indicated by on-screen text underneath the fixation cross. At 1 s before the next stimulus, the text disappeared, indicating that blinking should be prevented subsequently.

Data analysis

The ability of the subjects to detect the change in stimulus statistics was quantified using two measures, performance and d-prime, denoted d'. These analysis (Figures 14) were performed on the data obtained during the psychophysics experiments, and restricted to the trials embedding localized changes unless stated in the text. In addition, reaction times dependences over stimulus parameters were analyzed.

Performance

We computed a subject’s performance as the fraction between successful detection (hits) out of the total trials for which the change occurred before the response (hits + misses). False alarms were excluded from the performance computation, since the responses occurred before the change arose (see d’ for an inclusion of false alarms).

d’ analysis

We developed a time-dependent d’ measure, in which longer trials serve as catch trials before the change occurs (Green and Swets, 1966). We computed d’ values to assess the ability to detect changes (Egan et al., 1961), while taking their false alarm rate into account, as classically analyzed using signal detection theory. Due to the present task structure, d’ could be computed as a continuous function of time from stimulus onset. We used the usual approximation d’(t) = Z(HR(t)) - Z(FAR(t)), where Z(p) is the inverse of the Gaussian cumulative distribution function (CDF). HR(t) is the hit rate as a function of time since stimulus onset. HR was computed as the fraction of correct change detections, in relation to the number of trials with changes occurring at t (Macmillan and Creelman, 1991). As detailed above, a correct detection had to occur within 2 s of the change time. Similarly, the false alarm rate FAR(t) was computed as the number of false alarms that occurred over all 2 s windows (starting at t), in which no change in statistics occurred. This created an artificial reaction time for each false alarm, that we used for comparing the distributions of the actual reaction times resulting from the Hits (Yin et al., 2010). d’ was computed for different times and change sizes, yielding only a limited number of trials per condition. To avoid degenerate cases (i.e. d’ would be infinite for perfect scores), the analysis was not performed separately by subject, but over the pooled data. Confidence bounds (95%) were then estimated by bootstrapping within the dataset. The analysis was verified on surrogate data from a random responder (binomial with p=0.01 per time bin at 40 Hz sampling rate), providing d’ close to 0 as expected.

Reaction times

We obtained reaction times by subtracting the change time from the response time in each hit trial. For each condition, the distribution of reaction times was assembled and the median reaction time computed. Note that very early and late reaction times will in some cases not correspond to actual reaction to the change in statistics, but are coincidental, which cannot be distinguished on a trial-by-trial level. The results presented for the effect of change size on performance and reaction time were computed using only the data with change in contiguous bins (localized change). Results for the trials with non-localized bins (at 110% change size) were qualitatively the same, however, they were excluded from this analysis to keep the number of trials per condition equal.

These measures were computed as a function of change size and change time. While change times were drawn without binning from an exponential distribution for the experiment, they were binned for analysis using bins of exponentially increasing width (in order to achieve comparable numbers of trials in each bin).

Performance dynamics

In order to compare the performance dynamics for different change sizes, we fitted an adapted version of the Erlang CDF to the data according to:

(1) P(Δc,tc)=P0(Δc)+Pmax(Δc)γ(k,tc/τ(Δc))/(k1)!

where tc is change time, Δc change size, γ the incomplete gamma function, τ the function rate, and k controls the function shape. k was kept constant across subjects and change sizes, assuming the shape of the performance curves is invariant. Optimizations were performed using nonlinear least-squares minimization on the residuals of the fit (via ‘lsqnonlin’ in Matlab).

To control for inattentive subjects, we set a 30% threshold for the total false alarm rate. Two subjects were discarded according to this criterion leaving a total of 10 subjects for the data analysis, with false alarm rates below 25%.

Analysis of EEG recordings

We analyzed two signals based on the EEG: the classical auditory event-related potential (ERP), and the centro-parietal positive potential (CPP). First, slow trends were removed from all electrodes using a low-dimensional polynomial fit (‘nt_detrend’, from the NoiseTools Matlab toolbox by de Cheveigné and Parra, 2014). We verified that a classical high-pass filter (Matlab: filtfilt, 0.1 Hz, 15th order, 50 dB attenuation in the stop band) gave very similar results. Electrodes were low-pass filtered below 30 Hz with a 45th order Chebyshev filter using the ‘filtfilt’ function in MATLAB to avoid phase distortion. All electrodes were referenced to the common average potential. All trials with at least one scalp channel exceeding 500 µV at any time after referencing were discarded as artifacts. All subjects had a low or moderate rate of blinks and eye movements and could thus be included for a total of 18 subjects.

Classical auditory ERPs were estimated from the central electrode (El. one in the equidistant layout of EasyCap, equal to Cz; corresponding to the center in Nie et al., 2014). The CPP signal was based on a set of centro-parietal electrodes (El. 14, 27, 28, similar to Twomey et al., 2015). Trials were then extracted in the period encompassing 0.5 s before and 3 s after the sound presentation and either time-shifted to the onset or their corresponding change times (see Figure 5). EEG data were segmented into shorter epochs locked on stimulus onset or response time for display. The epochs were baseline-corrected relative to a 150 ms interval prior to onset, and a 200 ms interval before change time for alignments to both change and response time.

CPP amplitude was computed as the peak amplitude at the response time in a window of ±80 ms. CPP slope was the average slope in a window of 300–50 ms before response time, computed as the mean derivative of the CPP. Topographic distributions of the EEG signal were plotted with EEGLAB (‘topoplot’ function) (Delorme and Makeig, 2004).

Dual timescale model

We assume that subjects continuously estimate a wide range of statistical properties of the acoustic environment, and are able to detect unexpected deviations in these properties for the purpose of detecting changes in the ongoing sound. Among these properties are the probabilities of having a tone in the different frequency channels. Since these are the only determining properties in our stimulus design, we developed a phenomenological model, which estimates and detects changes in the marginal tone probabilities across multiple frequency channels (see next section for a more biologically motivated model based on a cortical filter-bank).

The model consists of change-detector modules, which operate independently on a limited spectral range and whose output is combined to enable change-detection on a full spectrum. For simplicity, the spectral division of the modules was matched to the presently relevant division of the psychophysical stimulus S(t,f) (see above), i.e. we here consider four modules, one for each pair of frequency bins, whose marginal probability could change. Since the modules operate independently, frequency separation is not relevant in the present model (but see below in the cortical model). For the present model, these frequency bins are referred to as Si(t) (with i∈[1,4]), which contain a random set of tones, adhering to the same marginal probabilities as the psychophysical stimulus.

For each frequency bin Si(t), a pair of dynamical processes {Pslow(t),Pfast(t)}i, acts as a change detector. Pslow,i estimates the long-term probability of the presence of a tone at a given time in Si(t), and Pfast,i estimates the more recent probability of the presence of a tone in Si(t). The dynamics of the processes are given by:

(2) dPfast,i(t)dt=Pfast,i(t)Si(t)τfdPslow,i(t)dt=Pslow,i(t)Si(t)τs

where τs > τf, which separates the speed of the processes. Normally, Pfast,i and Pslow,i are going to have similar values, since Pfast,i is simply tracking faster than Pslow,i. However, if a change in the probability of occurrence occurs in the stimulus, the difference between the two processes will grow, since Pfast,i will react faster to this change. A change in the environmental statistics is hence detected, if | Pfast,i Pslow,i | > T, where T is a threshold and a free variable of the model. Identical models exist for different frequency channels Si(t). If T is exceeded in a particular Si(t), this is considered as a detected change in the environment at the corresponding time Ti. Hence, only the first detected change in any Si is recorded as the response. The time of actual response is then given by T = Ti+Tm, where Tm is a constant time equals to 250 ms to account for the non-integration related process, such as stimulus representation and motor execution, up to the button press (akin to the non-decision time, by Ratcliff and McKoon [2008]). The model is termed a dual timescale model.

If we use the model as described so far, it would - correctly - detect a change in statistics at the onset of the stimulus (transition from silence to stimulus). In the present task design, the subjects were instructed to ignore the change associated with the start of the stimulus, but only detect the change in statistics within the stimulus. As laid out in the introduction, two estimations needed to be performed simultaneously: (1) estimate the probability from stimulus onset, (2) compare this estimate to the changed probability in the latter part of the stimulus (which occurs at an unknown time). To account for this initial period of estimation, we change the dynamics of Pslow (the slower tracking process) as a function of stimulus time. Intuitively this means that Pslow and Pfastinitially operate on the same timescales, and thus θ is never exceeded. The modified equations therefore become

(3) dPfast,i(t)dt=Pfast,i(t)Si(t)τfdPslow,i(t)dt=Pslow,i(t)Si(t)θ(t)θ(t)=(τsτf)et/τa

The speed at which the tracking dynamics diverge is regulated by τa. Overall, the model has four free parameters (T, τf, τs, τa), which were matched to account for the experimentally collected data. The phenomenological model accounted for the dependence of performance on change time and change size. Given the numerators in (2) and (3), the slope of the both Pslow and Pfast and their difference will depend on change size (compare to the EEG data in Figure 5). Simulations were run at a sampling rate of 100 Hz. Fitting was performed by exhaustive search in the parameter space to avoid local minima and biasing by initial values.

The model structure is inspired by earlier accounts for decision-making in random-dot motion stimuli, i.e. so-called drift-diffusion models (Bogacz et al., 2006; Britten et al., 1996), which have also recently been used to account for acoustic click-rate comparison tasks (Brunton et al., 2013). In contrast to these models, the dynamical process Pslow in our case becomes an estimate of the medium-term occurrence probability, and Pfast an estimate of the recent occurrence probability, and a decision is made across the set of estimators (similar to Churchland et al., 2008) Note, that the processes can transiently exceed 1, however, on average the right hand side of the dynamical equations is zero, when the dynamical process equals the probability that Si is drawn from.

Auditory multiresolution cortical model

The cortical model is an approximation to the analysis performed up to primary auditory cortex, which has been used successfully in a range of different auditory projects. A full description of the model can be found in Chi et al. (2005) and Yang et al. (1992), but an outline of its basic principles is provided below.

Computational structure of the cortical model

The cortical model processes the audio signal via two stages, inspired by the auditory pathway up to the midbrain and by the primary auditory cortex. The first stage transforms the sound into an auditory spectrogram, and the second performs a spectrotemporal analysis on this spectrogram.

The processing of the acoustic signal in the cochlea is modelled as a bank of 128 constant-Q, asymmetric bandpass filters, equally spaced on the logarithmic frequency scale spanning 5.3 octaves. The cochlear output is then transduced into inner hair cell potentials via a high-pass and low-pass operation. The resulting auditory nerve signals undergo further spectral sharpening via a lateral inhibitory network. Finally, a midbrain model resulting in additional loss in phase locking is performed using short term integration with a time constant of 4 ms, resulting in a time-frequency representation (the auditory spectrogram z(t,f)) (top panel in Figure 8A). The central stage further analyzes the spectrotemporal content of the auditory spectrogram using a bank of modulation-selective filters centered at each frequency along the tonotopic axis, mimicking neurophysiological receptive fields. This step corresponds to a 2D affine wavelet transform, with a spectrotemporal mother wavelet, defined as a Gabor-shape in frequency and exponential in time. Each filter h is tuned (Q = 1) to a specific rate (ω in Hz) of temporal modulations and a specific scale of spectral modulations (Ω in cycles/octave), and a bidirectional orientation (+ for upward and - for downward). The response of each cortical filter in the model is given by

(4) r±(t,f;ω,Ω;θ,Φ)=z(t,f)t,f h±(t,f;ω,Ω;θ,Φ)

where *t,f denotes convolution in time and frequency, where θ and Φ are the characteristic phases of the cortical filter, which determine the degree of asymmetry in the time and frequency axes respectively (middle panel in Figure 8A). Because changes were isotropic within the sound spectrum, we averaged the upward and downward components of the scale modulation filter. To simplify the analysis, we limited our computations to the real cortical outputs across frequency (i.e. responses corresponding to zero-phase filters). The resulting modulation response is denoted R(t;ω,Ω) (bottom panel in Figure 8A). Simulations were run at a sampling rate of 100 Hz.

Decision process based on the cortical model output

On a single trial basis, the stochastic nature of the stimulus was reflected in the noisy outputs of the cortical model. To facilitate change detection on single trials, we post-filtered the modulation response R(t;ω,Ω) using the average response to a change in statistics. Concretely, the shape of the trial-averaged response in R(t;ω,Ω) was convolved with single trials, to improve detection of change. Due to the different modulation rates, the length of the average response shape varied by modulation rate ω as 1/ (2ω) ms. A unique combination of rate ω and scale Ω was used across all trials to characterize the modulation response. Next, we implemented a decision criterion on top of the filtered R(t;ω,Ω).

Due to the comparative nature of the present paradigm and because the onset peak was not driven by any task-relevant feature of the sound, a time-dependent decision boundary was better suited to match the experimentally observed reaction times in both models. This was inspired by previous studies that described either time-varying collapsing boundaries (Ditterich, 2006) or linearly increasing emergency-related gain (Cisek et al., 2009; Drugowitsch et al., 2012). We designed the time-dependent threshold as follows:

(5) T(t)=bet/λ+a

where a and b scales the amplitude of the threshold and λ sets its time-dependence. The first peak exceeding the time-dependent threshold was labelled as the decision timing.

In total, the decision stage is controlled by five parameters: the time-varying threshold (λ, a, b), the scale Ω, and the rate ω, while other parameters of the cortical model were kept fixed. The threshold parameters tune the balance between conservative and liberal decisions. To take into account this aspect we fitted both performance and false alarm rate across all subjects for all change sizes and change times. Motor-related delay was accounted for by a 250 ms offset added to the estimated reaction times, as was done for the phenomenological model.

Statistical analysis

If not specified otherwise, nonparametric tests were used. When data were normally distributed (for performance), we checked that statistical conclusions were the same. One-way analysis of variance was computed with the Kruskal-Wallis' test; two-way using Friedman's test. Error bars are ±2 SEM (standard error of the mean), unless specified otherwise. All statistical analysis was performed using Matlab (The Mathworks, Natick).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
    Stochastic models of decisions about motion direction: behavior and physiology
    1. J Ditterich
    (2006)
    Neural Networks : The Official Journal of the International Neural Network Society 19:981–1012.
    https://doi.org/10.1016/j.neunet.2006.05.042
  16. 16
  17. 17
  18. 18
    Temporal modulation transfer functions in cat primary auditory cortex: separating stimulus effects from neural mechanisms
    1. JJ Eggermont
    (2002)
    Journal of Neurophysiology 87:305–321.
  19. 19
  20. 20
    Spectral weights and the profile bowl
    1. DM Green
    2. BG Berg
    (1991)
    The Quarterly Journal of Experimental Psychology Section A 43:449–458.
    https://doi.org/10.1080/14640749108400981
  21. 21
    Signal detection theory and psychophysics
    1. DM Green
    2. JA Swets
    (1966)
    Society 1:521.
  22. 22
    Profile Analysis (Oxford Psy)
    1. DM Green
    (1988)
    Oxford University Press.
  23. 23
    The number of components in profile analysis tasks
    1. DM Green
    (1992)
    The Journal of the Acoustical Society of America 91:1616–1623.
    https://doi.org/10.1121/1.402442
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
    Analysis of dynamic spectra in ferret primary auditory cortex. I. characteristics of single-unit responses to moving ripple spectra
    1. N Kowalski
    2. DA Depireux
    3. SA Shamma
    (1996)
    Journal of Neurophysiology 76:3503–3523.
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
    Neural representations of sinusoidal amplitude and frequency modulations in the primary auditory cortex of awake primates
    1. L Liang
    2. T Lu
    3. X Wang
    (2002)
    Journal of Neurophysiology 87:2237–2261.
  44. 44
  45. 45
    Detection Theory: A Users' Guide
    1. NA Macmillan
    2. CD Creelman
    (1991)
    CUP Archive.
  46. 46
    Vision: A Computational Investigation Into the Human Representation and Processing of Visual Information
    1. D Marr
    (1982)
    New York: Freeman.
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
    Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task
    1. JD Roitman
    2. MN Shadlen
    (2002)
    Journal of Neuroscience 22:9475–9489.
  62. 62
    Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey
    1. MN Shadlen
    2. WT Newsome
    (2001)
    Journal of Neurophysiology 86:1916–1936.
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
    Modeling natural sounds with modulation cascade processes
    1. R Turner
    2. M Sahani
    (2007)
    Advances in Neural Information Processing Systems 20:1–8.
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
    Do ferrets perceive relative pitch?
    1. P Yin
    2. JB Fritz
    3. SA Shamma
    (2010)
    The Journal of the Acoustical Society of America 127:1673–1680.
    https://doi.org/10.1121/1.3290988

Decision letter

  1. Timothy EJ Behrens
    Reviewing Editor; University College London, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: a previous version of this paper was rejected after a second round of review, but the authors submitted for reconsideration. What follows is the first decision letter after peer review, requesting revisions.]

Thank you for submitting your article "Detecting changes in dynamic and complex acoustic scenes" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom, Barbara G Shinn-Cunningham (Reviewer #1), is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jennifer K Bizley (Reviewer #2) and Simon P Kelly (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This manuscript combines psychophysical, electrophysiological and modeling approaches to understanding auditory change detection. Listeners need to first estimate the probability distribution across a "tone cloud," then detect a change in this probability distribution. Behavioral data show that changes are easier to detect when they are later, larger, and when the spectral changes are in neighboring frequency channels. The authors complement these behavioral results with EEG data from a very similar task. Finally the paper includes two different models that account for the key characteristics of the behavioral data.

All three reviewers think the paper is interesting and publishable, but we had a number of concerns about the presentation. We appreciate that you had to make some difficult analysis decisions, both with respect to the psychophysics and the ERPs, and see that your choices are well considered and principled. The experimental results are compelling, showing nice behavioral and neuroelectric imaging on similar tasks. However, the modeling is not all that convincing, and the rationale for including both a statistical decision model and a more physiologically motivated model is not clear. We also have a fairly long list of minor edits / suggestions for you to consider to improve the clarity and readability of the manuscript.

Essential revisions:

1) The writing makes it hard to appreciate the study.

The manuscript is a big of a tough haul; it is quite heavy going and descriptive. The paper would be easier to read, and have bigger impact, if it were streamlined to better emphasize the research questions and hypotheses and to provide clear interpretations (e.g., to highlight what is learned from each experiment, to explain the importance of model parameter values). For instance, even understanding what the stimuli were and what the listeners were being asked to do was a real challenge. The description of the stimulus design is quite lengthy-yet very hard to follow. Perhaps it would help to start with a more intuitive, descriptive explanation of what you did, why you did it, and what the task was measuring before (in sentence two of your Stimulus Design section) jumping directly to descriptions of probabilities and marginal distributions and what the exact range of frequencies was. There is no forest emerging from these technical-tree details.

2) We also had concerns with the modeling.

2a) Claims about temporal dynamics of neural decisions that are based on averages of electrical activity over many trials are problematic. When you find a slow effect in an average, there is no way of knowing if activity is slowly accumulating within each trial, or if it is suddenly jumping from low to high at some point in time, but at different times, within each trial. In either case, you end up with an average that will look like a slow increase (see the recent paper by Jonathon Pillow on this topic). Many of the arguments about evidence accumulation rely on this aspect of the data. This weakness needs to be acknowledged and discussed.

2b) Assuming you can justify the accumulation modeling (see 2a), you need a clearer motivation for including both models and a better discussion of how the two relate to each other. Both offer some sort of insight into how this task might be solved, and can account for important characteristics of the dataset (including features that they were not explicitly modeled to fit). But it is not clear how the results of the cortical model relate to the accumulation dynamics. The cortical model seems to say that everything can happen in A1. But there is not any temporal integration function applied beyond A1, which seems discrepant with the CPP finding (unless the CPP comes from A1?). You address this briefly in the discussion, suggesting that the job of the CPP might be just to select and amplify certain accumulators, but these claims must be made more carefully with reference to the existing literature about the CPP signal. Because it is not clear how the two models are related, it is very hard to figure out what the take-away message is (e.g. how does the cortical model relate to the slow and fast approximation steps the statistical model?). What do we learn from having both? Do either (or both) generate testable predictions for further study? What experiments might be necessary to gather evidence for the statistical model's physiological instantiation? Either choose one model, which emphasizes what you want the reader to understand, or make clear how the models relate, and what each contributes to understanding the phenomena you observed.

3) There are some experimental points that need to be clarified.

The psychophysical and EEG sessions differed in potentially crucial ways, yet this is played down in the paper. The behavioral data from the EEG sessions are not reported at all. Comparing the psychophysics sessions to the EEG sessions, respectively: 1) responses are immediate versus delayed/withheld until stimulus offset, 2) there are 0 versus 50% catch trials, and 3) feedback is given versus not. These factors muddy the degree to which measured EEG signals reflect mechanisms and strategies for performing the task in the psychophysics sessions. We think it would help if you a) show the RT and accuracy (on both change and catch trials) data for the EEG session, and b) discuss these differences head-on, clarifying why it can be assumed that despite the methodological differences, the EEG effects reflect the same mechanisms at play during the psychophysics sessions.

4a) Analysis of the EEG data also raised some questions.

You conclude that neural activity in auditory cortex and parietal cortex carry different signals. However, since two completely different analyses were performed for auditory ERPs and the CPP, is this justifiable? The data in Figure 5 shows how the different analysis methods affect the resulting signals (the CPP are smoothed by the low-pass filtering). Are the differences between brain regions maintained if the same analysis methods are applied to both sets of electrodes? In order to determine that a decision-making signal is present in one set and not the other, shouldn't the same analysis be applied to both? Another example of this kind of issue is in subsection “EEG recordings and the site of decision-making”, where you conclude that the representation in the CPP electrodes is less temporally precise than the auditory cortical one. However, since the CPP signal has been low-pass filtered at 30 Hz, this is hardly surprising (!).

4b) Your "scene" may consist of a single, changing auditory object.

A "scene" typical has multiple "objects" that come and go; we question whether your stimuli are perceived as containing objects that appear and disappear. Superficially, this may seem like a semantic issue, but if one believes in "auditory objects" are the "units of attention" (as at least one of the reviewers does), this distinction has implications for the way in which the task is solved. Specifically, your task may depend on detecting changes in the spectral envelope of a single object, which could explain why you do not see change-related signals in auditory cortex, although other researchers do. For example, in the Chait lab, where change-related cortical signals are observed, acoustic scenes likely are perceived as consisting of model multiple objects that appear or disappear. If your listeners do hear multiple sources, this needs to be explained to the reader; if they hear one object changing, then this needs to be explained, and the discussion of how your study relates to "the real world" modified.

[Editors’ note: what now follows is the decision letter after re-review.]

Thank you for submitting your work entitled "Detecting changes in dynamic and complex acoustic environments" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jennifer K Bizley (Reviewer #2) and Simon P Kelly (Reviewer #3).

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

All of us are very interested in the topic and the questions you pose, and we all feel that the study has significant value, which is why we were receptive to the initial submission.

However, the new information provided in the revised submission brought to light problems with the data and interpretation that were not clear in the initial submission (see detailed reviews below). The corrected topographic plots in particular call into question the explanation of the activity as being attributable to an accumulation-based decision process. Unfortunately, the data included in the current manuscript do not provide sufficient support for your conclusions, and are inconsistent with published results on this topic (an issue that is not addressed in interpreting your results). The study would likely benefit from collection of additional data, as indicated below.

Reviewer #2:

The authors have addressed all of the concerns I originally outlined. I still have some reservations about the levels of performance of the subjects for all but the largest change sizes. Perhaps I've misunderstood the additional data included in Figure 5—figure supplement 3) on performance during the EEG task, but it seems that performance for catch trials and non-catch trials is indistinguishable suggesting that subjects are effectively randomly responding.

Reviewer #3:

I found that the authors were generally attentive to our comments, but an important part of the findings remains critically unconvincing. My own main issue was that it was not clear that the results of the EEG experiment really provide a view on the neural dynamics at play in the psychophysics task, because so many aspects of the task (immediate versus delayed response, presence of 50% catch trials, different set of change sizes and timings, and provision of feedback) were different.

Having prompted the authors to add the behavioral results for the task version run during EEG recording (which they did) and explain and discuss the impact of the task differences, I regret to say my concerns over this have grown rather than diminished. The authors state that "it is the same task" but it most certainly is not from the point of view of decision mechanisms, into which the study is supposed to provide new insights – the subjects listen out for the same kind of acoustic event, yes, but the various contingencies of any given task protocol – what they know or can surmise about timing, about the probability of the event happening at all, about the range of trial difficulties possible and certainly the requirement to report their detection immediately – can each have potentially huge impact on the strategies employed to perform the task and by extension the neural dynamics of decision formation.

The authors gave a single reason for the task differences: to avoid motor preparation signals in the EEG. This rationale does not hold up because the prior work identifying the "CPP" as a decision signal clearly demonstrated that response preparation can be easily dissociated from the CPP evidence accumulator simply from the fact that the CPP does not care about response hand. In fact, immediate decision reports were crucial because the ability to time-lock the signal to these actions and show that the accumulator peaks at response regardless of RT was one of the major identifying criteria of the decision signal.

Another major reason for my fallen enthusiasm is the fact that, after correcting an electrode-ordering mistake in this revision, the topography of the positive buildup appears to be focused at the very inferior edge of the cap. Neither the previously characterized decision signal "CPP" nor the P300 to which it has been equated, to my knowledge, has ever been found with such an inferior topography. The overall problem is that the authors are using the presence of this ERP positivity and the effects of change time on its buildup as evidence supporting their claim that this task is accomplished through sensory evidence accumulation. But for this to work, it needs to be definitively and independently established that signal is in fact an evidence accumulator – otherwise, the argument is circular, i.e., the neural evidence for the improved change detection at later change times being a result of steeper evidence accumulation comes from the finding that an accumulator signal rises more steeply for later change times, but the sole basis upon which that signal is identified as an evidence accumulator is the fact that it rises more steeply for later change times! Clearly what is needed is an independent way to establish that the signal is an evidence accumulator. Shared properties with an established accumulator signal (CPP) gave at least partial support, but with the topography now so distinct I think the authors have lost this one piece of support. Source localizing to parietal cortex does not help much because parietal cortex in general has not been associated with decision formation any more than any other cognitive function, let alone with evidence accumulation in particular, and a parietal neural source certainly played no part in the CPP's identification as a decision signal.

I feel my hands are tied on highlighting this issue as a preclusive one when I consider the simple fact that if an ERP study with as few as 6 subjects, equating an occipital positivity with a previously characterized centro-parietal one, were to be submitted to a low-ranking but specialized journal like Psychophysiology, it would likely be dismissed out of hand.

The obvious solution to all of the above would be to collect EEG data from a more reasonable number of subjects (>15) on a task with identical parameters to that run in the psychophysical sessions. This would readily allow independent tests of the assertion that the positivity is an evidence accumulator: 1) The peaking at RT regardless of whether RT is fast or slow I mentioned before, and 2) The other major identifying property is the dependence of buildup rate on evidence strength which could be tested across the four different change sizes 50-140%. All the authors would have to do to deal with the broader range of change times is to sort trials into change-time bins. In addition to these independent tests, more reliable analyses could be conducted to address many other unanswered questions in the paper about what the positive buildup signal is actually doing in the task, e.g., what temporal profile does the signal have during the baseline pre-change part of the stimulus – does it steadily build up reflecting the accumulation of information used to construct the baseline statistical distribution, or does the signal step up discretely as suggested by the current waveforms, and if so, why? And what is the signal actually reflecting with respect to the models, e.g. the sum of the two accumulators working on different timescales? Are the apparent buildup rate effects seen already in the pre-change baseline for later changes statistically reliable, and if so what could this mean?

Further comments

The visual topography of the positivity: The authors measure the positive signal from electrodes Pz to Oz on the posterior midline. Pz is the most inferior the CPP has been measured thus far (Twomey et al. 2015, cited by the authors), and this is included in the authors' cluster, but where is the signal actually biggest? From the topography it looks like it could be Oz, and this strangely suggests a visual cortical origin. It now becomes important to know what the subjects actually looked at during the task, which is not detailed anywhere – did they fixate on a cross and nothing else? Were there stimuli on the screen that were presented during the trial or after, and could be processed or anticipated by the subject? One alternative possibility to a decision buildup account is that this positivity reflects suppression of visual cortex upon detection of the change, which would happen earlier on late-change trials where the change is detected quickest.

If the occipital positivity is in fact the same signal as the CPP, then the reason for its dramatic inferior shift should be figured out. A likely reason is that 6 subjects simply don't provide a reliable enough measurement, and one or two outlying subjects are dominating. But it could also be a signal issue – maybe the polynomial detrending has removed the actual decision signal from centro-parietal sites? "nt_detrend" probably has been used exclusively for datasets with relatively fast responses, and not for tasks where the potentials are very slow-moving like the current one – what happens to the signals when this step is removed, or replaced by a first-order (linear) detrend? It is possible in principle that strong auditory-evoked negative activity more frontally causes the positive focus to shift back on the scalp, but this seems an unlikely explanation given that the CPP/P300 have been measured with the same centro-parietal topographies for many auditory tasks including ones containing continuous audio streams (O'Connell et al. 2012).

Abstract: "in parietal cortex" should definitely be replaced with "over parieto-occipital scalp". This is EEG.

[Editors’ note: what now follows is the decision letter after the authors resubmitted for consideration.]

Thank you for resubmitting your work entitled "Detecting changes in dynamic and complex acoustic environments" for further consideration at eLife. Your revised article has been favorably evaluated by Timothy Behrens (Senior and Reviewing editor), and two reviewers.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

Reviewer #1:

In this paper, Boubenec and colleagues demonstrate compelling psychophysical and electrophysiological correlates of statistical estimation mechanisms underlying naturalistic acoustic change detection, and provide both an abstract and physiologically-based computational model to explain their data. This is a revision of a previous submission, based on new EEG data recorded during the exact same task as was physiologically characterized and modeled, and the results are now more definitive and highly interesting, including stronger evidence for accumulation (steeper buildup for stronger changes) and additional effects on peak amplitude of the neural index of accumulation. The concepts and hypotheses are now also laid out very clearly, with nice sections explaining predictions based on statistical estimation strategies. There are just a couple of substantive points that are important to clear up at this stage.

There are a couple of important checks to perform on the CPP amplitudes at response to verify the effects on this measure. First, the topography of the effect of change size on amplitude should be shown, around the time of response, e.g. the largest change size minus smallest – does it have a similar CPP topography? This would go some way to verifying that the amplitude effects are not caused by an overlapping, separate process (e.g. slow fronto-central negativity). Also, since the baseline for the response-locked waveforms was 1000-1200ms prior to response, conditions with longer RTs are baselined relative to intervals further out into the stimulus; this is a potential confound because it could be that for lower change sizes, the hit trials are those in which the CPP waveform gets a head-start by accumulating noise or a false change, and so it has less of a distance to go to threshold when the actual change happens. Or given how long some RTs are for weaker changes, the "head-start" could be from genuine physical evidence! This would result in an amplitude decrease at response when using the current baselining regime, even in the absence of any change in bound. Therefore a pre-stimulus or a pre-change baseline should instead be used for the response-locked waveforms. It is important to see whether this changes the results. I would be fairly confident that the main effects hold up – the slope effects certainly will because baselining doesn't matter in that case. Nevertheless, these controls are important to rule out critical potential confounds.

Possible explanations for the decrease in amplitude of CPP across change times tend to be provided briefly in a scattered fashion. It is important to reiterate the lack of effect of RT here – if bound is collapsing over time, surely the CPP amplitude would decrease with RT as well as with change time? Further, as the authors point out themselves, if bound were collapsing over time, false alarm rate should increase – it seems that this point should be weaved into the discussion of this effect more than it is. In general, since several possible explanations are given for this effect, it might be better not to scatter them in various parts of the Results and Discussion, but rather go through them all in one coherent section in the Discussion?

Around paragraph three of subsection “EEG recordings and the site of decision-making”the discussion is a bit muddled – it appears to both argue for and against a decreasing bound in the same breath. This could use clarifying.

Finally on this point, have the authors considered other, simpler mechanisms for preventing the subject from clicking when the stimulus initially onsets? The work of Tobias Teichert (see e.g. review in 2014) and others have demonstrated that subjects may be in control of the timing of onset of decision processes (specifically evidence accumulation) – given there's a "dead zone" apparent in Figure 1C showing that subjects are not willing to respond until a certain time has elapsed, couldn't they be simply deferring the kick-off of their decision mechanisms? This is easily parameterized in a fixed delay, and would avoid the need to introduce dramatic time-dependence in the fixed mechanisms (e.g. lengthening of the slow estimator's time constant) to explain that aspect. The authors should at least discuss this possibility.

Figure 5A-F: I think it might make more sense to have each signal (medial, CPP) in the rows, and the response-locked waveforms to the right of the stimulus locked ones, as is standard across neurophysiological decision research. The labels should also be consistent in nature – perhaps they should be "fronto-central" and "parietal?" – "medial" is not a good descriptor since both signals are medial.

Missing from Figure 5 main figure are the change-onset locked waveforms. Supplement 3 shows the auditory electrodes, but really both signals should be shown time locked to change onset for a complete view over the dynamics. The x-axes of the current figures can easily be cut to zoom in on the main dynamics (e.g. the current 500-ms baseline shown before stimulus onset need only be 100 ms), so there should be room to squeeze them in between. This is important given that in places there are references to what happens to the CPP in response to the change, not just prior to response.

I believe eLife format requires methods to go at the end.

Reviewer #2:

The inclusion of the additional data (18 subjects in a simultaneous behavioural-EEG paradigm) significantly strengthens this manuscript providing clarity on the issues raised in previous reviews. The manuscript makes a significant contribution to the auditory field and I have only a few minor comments.

https://doi.org/10.7554/eLife.24910.023

Author response

[Editors’ note: the first author response to the requests for revisions follows.]

Essential revisions:

1) The writing makes it hard to appreciate the study.

The manuscript is a big of a tough haul; it is quite heavy going and descriptive. The paper would be easier to read, and have bigger impact, if it were streamlined to better emphasize the research questions and hypotheses and to provide clear interpretations (e.g., to highlight what is learned from each experiment, to explain the importance of model parameter values). For instance, even understanding what the stimuli were and what the listeners were being asked to do was a real challenge. The description of the stimulus design is quite lengthy-yet very hard to follow. Perhaps it would help to start with a more intuitive, descriptive explanation of what you did, why you did it, and what the task was measuring before (in sentence two of your Stimulus Design section) jumping directly to descriptions of probabilities and marginal distributions and what the exact range of frequencies was. There is no forest emerging from these technical-tree details.

The description of the stimulus is inherently complex, and so we agree with the concern that the description in Methods does not offer an 'easy way in'. We therefore added more entry text to give a simpler, intuitive description ahead of the stimulus & procedure description in the Methods (paragraph Stimulus Design & Trial Procedure). We also added a set of sound samples as supplementary material, using the possibility of embedding multimedia files within the course of the paper. Also, we have revised the Abstract, Introduction and Discussion to streamline the presentation for improved readability.

2) We also had concerns with the modeling.

2a) Claims about temporal dynamics of neural decisions that are based on averages of electrical activity over many trials are problematic. When you find a slow effect in an average, there is no way of knowing if activity is slowly accumulating within each trial, or if it is suddenly jumping from low to high at some point in time, but at different times, within each trial. In either case, you end up with an average that will look like a slow increase (see the recent paper by Jonathon Pillow on this topic). Many of the arguments about evidence accumulation rely on this aspect of the data. This weakness needs to be acknowledged and discussed.

Indeed, Jonathan Pillow and colleagues recently showed that the spiking activity of individual neurons in the parietal areas could follow discrete and instantaneous change in their underlying firing distribution. However, as far as we know, they do not suggest that simultaneous discrete steps would occur at the level of the population. Rather, the idea is that each neuron will independently undergo a discrete change in its spiking activity at different times from one trial to the next. Thus, at the single trial level, one would expect a ramping activity if one is to average responses from a large neuronal population, to produce a ramping response much like that of the cortical model or the EEG recordings. To clarify this point further to the readers, we added the following statements in the EEG recordings and the site of decision-making paragraph of the Discussion section:

“It has recently been suggested that individual neurons change their firing rate instantaneously at the single trial level (Latimer et al., 2015). We presently observed gradual, rather than step-wise changes in our across-trial averages. However, we predict that even single trial EEG signals would be gradual as these step-changes occur randomly, and hence are unlikely to be synchronized at the population-level. Due to the large ensemble of neural responses contributing to a single scalp location’s potential, this instead results in the commonly seen ramping activity on the EEG level, as observed in our data.”

2b) Assuming you can justify the accumulation modeling (see 2a), you need a clearer motivation for including both models and a better discussion of how the two relate to each other. Both offer some sort of insight into how this task might be solved, and can account for important characteristics of the dataset (including features that they were not explicitly modeled to fit). But it is not clear how the results of the cortical model relate to the accumulation dynamics. The cortical model seems to say that everything can happen in A1. But there is not any temporal integration function applied beyond A1, which seems discrepant with the CPP finding (unless the CPP comes from A1?). You address this briefly in the discussion, suggesting that the job of the CPP might be just to select and amplify certain accumulators, but these claims must be made more carefully with reference to the existing literature about the CPP signal. Because it is not clear how the two models are related, it is very hard to figure out what the take-away message is (e.g. how does the cortical model relate to the slow and fast approximation steps the statistical model?). What do we learn from having both? Do either (or both) generate testable predictions for further study? What experiments might be necessary to gather evidence for the statistical model's physiological instantiation? Either choose one model, which emphasizes what you want the reader to understand, or make clear how the models relate, and what each contributes to understanding the phenomena you observed.

The dual-timescale model, which accounts well for the data, is clearly only remotely related to the physiology, and instead mostly applies at the level of statistical estimation as a guiding principle. Since Drift-Diffusion Models showed a great success in describing many experimental data, we think it is important to demonstrate that our psychophysical data can be explained by such a mechanism. On the other hand, the cortical model simulations make a good case for a consistency of a lack of grand response in A1 and the local processing in the model.

We have now added text to motivate the choice of the models and their relative strengths. We also described in more detail the “mechanistic” differences of how the two models accomplish their performance, and on which level we would see their implementation. Briefly, detection of statistical changes can be accomplished by a classical statistical estimation model, but can also be accounted for by a different, more physiological model with multiple time-scales. We made this idea easier to understand for the reader with the following text (Modelling statistical decision-making on two levels paragraph of the Discussion section):

“To create behavioral performance from its representation, we merely added a filter selection and a decision criterion. The spectrotemporal filters implemented in the cortical model exhibit alternating excitatory (positive) and inhibitory (negative) fields (Figure 8A) that compare the spectral stimulus properties over a given time window set by a filter's temporal rate. As such, it effectively integrates the recent input with opposite signs to detect a change, which can be compared to the difference between the fast and slow estimators in the statistical estimation model.”

3) There are some experimental points that need to be clarified.

The psychophysical and EEG sessions differed in potentially crucial ways, yet this is played down in the paper. The behavioral data from the EEG sessions are not reported at all. Comparing the psychophysics sessions to the EEG sessions, respectively: 1) responses are immediate versus delayed/withheld until stimulus offset, 2) there are 0 versus 50% catch trials, and 3) feedback is given versus not. These factors muddy the degree to which measured EEG signals reflect mechanisms and strategies for performing the task in the psychophysics sessions. We think it would help if you a) show the RT and accuracy (on both change and catch trials) data for the EEG session, and b) discuss these differences head-on, clarifying why it can be assumed that despite the methodological differences, the EEG effects reflect the same mechanisms at play during the psychophysics sessions.

The differences between the psychophysical and EEG sessions are now discussed explicitly right after their introduction in the Materials and methods.

“The delayed response and catch trials were introduced in the EEG study to improve the reliability of observing the neural response. Delaying the response allows us to observe the neural integration, without a (preparatory) motor response interfering. Since listeners had more of an opportunity to respond in each trial, catch trials had to be introduced to assess detection performance (see Figure 5—figure supplement 1). While the overall set of trials differed somewhat between the psychophysics and the EEG experiments, the task in both cases is the same. Given the (on average) longer integration duration before a decision is made in the EEG, we expected reaction times and performance to differ less. However, they were both significantly dependent on change time (Figure 5—figure supplement 1B-C).”

Further, a figure supplement has been added to Figure 5—figure supplement 1), which details both reaction times and performance for the EEG condition. Both reaction times and performance show a dependence on change time, although the absolute reaction times are considerably longer than in the psychoacoustic version, since subjects had less pressure to react quickly. Subjects appeared to be conservative for short trials, i.e. did not detect a change in statistics, and hence were at 'below-chance' in change detection, but above-chance 'catch detection'.

4a) Analysis of the EEG data also raised some questions.

You conclude that neural activity in auditory cortex and parietal cortex carry different signals. However, since two completely different analyses were performed for auditory ERPs and the CPP, is this justifiable? The data in Figure 5 shows how the different analysis methods affect the resulting signals (the CPP are smoothed by the low-pass filtering). Are the differences between brain regions maintained if the same analysis methods are applied to both sets of electrodes? In order to determine that a decision-making signal is present in one set and not the other, shouldn't the same analysis be applied to both? Another example of this kind of issue is in subsection “EEG recordings and the site of decision-making”, where you conclude that the representation in the CPP electrodes is less temporally precise than the auditory cortical one. However, since the CPP signal has been low-pass filtered at 30 Hz, this is hardly surprising (!).

This is an excellent point that we checked beforehand but did not include in the manuscript at that time. We now inserted a figure supplement to Figure 5 showing the low-pass filtering analysis applied to the set of auditory electrodes. The buildup signal is specific to the parieto- occipital electrodes as shown by the absence of any post-change significant activity over the low-passed auditory-related electrodes. We also refer to this supplement figure in the manuscript:

“This signal was absent from the set of auditory electrodes when applying the same analysis (Figure 5—figure supplement 2), indicating that this post-change activity was specific to the group of parieto-occipital electrodes.”

We also realized that a minor part of the electrode labels were actually shifted by 1 index. This does not affect qualitatively the previous results. Auditory electrodes were untouched by this correction whereas the electrodes displaying the slow post-change potential were more occipital than what we described in the previous version of the manuscript (see the updated topographic plot in Figure 5B). This has been corrected in the entire revised manuscript.

In order to verify that this occipital shift of the potential is not at odds with the CPP signal stemming from a parietal source, we performed source localization on the onset and change components (using the clustering approach in the dipfit toolbox in EEGLAB). For the onset component of the response, associated with the fronto-central positivity (Figure 5—figure supplement 3 A and B) the dipoles are localized in superior temporal cortex, BA 41 (Figure 5—figure supplement 3C). For the change component of the response, the dipoles are more distributed but cluster around the parietal cortex, in the range beyond TPJ (temporal parietal junction). We are aware that source localization is not trivial, however, the present locations were estimated without much tweaking of parameters, and are cleanly associated with the temporal response shapes we discussed in the central and the parieto-occipital electrode locations.

4b) Your "scene" may consist of a single, changing auditory object.

A "scene" typical has multiple "objects" that come and go; we question whether your stimuli are perceived as containing objects that appear and disappear. Superficially, this may seem like a semantic issue, but if one believes in "auditory objects" are the "units of attention" (as at least one of the reviewers does), this distinction has implications for the way in which the task is solved. Specifically, your task may depend on detecting changes in the spectral envelope of a single object, which could explain why you do not see change-related signals in auditory cortex, although other researchers do. For example, in the Chait lab, where change-related cortical signals are observed, acoustic scenes likely are perceived as consisting of model multiple objects that appear or disappear. If your listeners do hear multiple sources, this needs to be explained to the reader; if they hear one object changing, then this needs to be explained, and the discussion of how your study relates to "the real world" modified.

We agree with the reviewers that the choice of terminology was not optimal. Auditory scenes are classically considered to have a small number of identifiable acoustic objects, whereas acoustic textures are better categorized under the term 'acoustic environment', in analogy with the typical, naturally occurring examples of auditory textures, i.e. rain, wind, water, etc, typical for natural environments. From this perspective, we agree – and never intended to claim anything different – that the auditory texture is typically perceived as a whole, while being composed of a large number of acoustic elements (e.g. drops, bubbles, short tones in our case). While listeners may differ in their perception/strategy, each could choose to listen to a certain subset of the texture, e.g. certain constituents or – as suggested in point 10 by the reviewers – a certain frequency range, e.g. the loudest one. We show in the new Figure 2—figure supplement 3, that the latter strategy is unlikely, given the performance data, see point 10 for details. If a listener had, nonetheless, adopted such a strategy, we hypothesize that auditory objects would be transiently formed on an individual element basis, but it would still hardly qualify as an acoustic scene. In consequence, we have gone through the manuscript and replaced the term 'scene' with 'environment', and carefully scouted for similar formulations.

Regarding the work of Prof. Chait, we have devoted a new paragraph in the Discussion to a comparison with her work.

[Editors’ note: the author responses to the second round of review (rejection) follows.]

Reviewer #2:

The authors have addressed all of the concerns I originally outlined. I still have some reservations about the levels of performance of the subjects for all but the largest change sizes. Perhaps I've misunderstood the additional data included in the supplemental material Figure 5—figure supplement 3) on performance during the EEG task, but it seems that performance for catch trials and non-catch trials is indistinguishable suggesting that subjects are effectively randomly responding.

The task was designed with a high level of difficulty, to ensure that subjects had to listen closely to detect a change. We think that the performance data is indicative of the subjects actively trying to solve the task, rather than guessing, for the following reasons:

– The performance depends strongly and significantly on the change time. Hence, both above 50% performance in the 3s condition and below 50% performance for the 0.75s condition are indicative of actively performing subjects, otherwise chance performance would be expected for both. As in other detection paradigms, some conditions are easy, leading to above chance performance, whereas others are designed to be very hard to detect. Because conditions were not presented in blocks, subjects could not be biased towards chance performance by balancing their perceptual reports to equate the change and no change responses within condition. This led to below chance performance for the hardest no-catch condition (50%) where subjects were missing a substantial proportion of change. If we were to continue the change time axis in both directions, we would expect to find a sigmoidal dependence, possibly not actually reaching 0 and 100% due to fatigue and unsystematic response errors. However, the average performance of these subjects across all non-catch conditions could well be 50%, however, the performance is systematically determined by the independent variable (change time here). We agree, that a higher level of performance would be more reassuring about the subject's strategy in all conditions, however, this would come at the cost of making the task easy for every parameter value, and thus losing the dependence of performance on variables of the task.

– Certainty of decision appeared to vary as a function of change time as well, since the reaction times for both catch and signal trials reduced significantly with change time. This indicates also that subjects were not guessing after stimulus end, but made an active decision based on the previous information.

– The correct classification of catch trials ('no response') overall is significantly greater than chance (p=0.0000007, t-test of all conditions compared to 50%). This is the performance indicated by the percentages (y-axis) in B. The slight decrease in performance with change time in this bigger dataset surprised us, but could indicate an increased sensitivity to small fluctuations over time, which then erroneously classify a subset of the catch trials as signals.

Author response image 1
Change detection reaction times and performance during the delayed response EEG experiment as a function of exposure to the first texture Reaction time decreased significantly as a function of change time and trial type both for catch (brown) and change trials (blue, 1 way ANOVA, p-values indicated in the figure).

Reaction times were normalized within each subject before averaging to account for individual overall differences. (A) The accuracy (correct response for either trial type) of catch trials stayed unchanged (brown, 1-way ANOVA), while the performance for the change trials improved significantly with change time (blue, 1-way ANOVA).

https://doi.org/10.7554/eLife.24910.020

Please note that this plot is not provided in the manuscript, since in response to reviewer 3, the delayed response dataset has been replaced with the “immediate” response dataset. The performance in that task (during EEG performance) is depicted in Figure 5—figure supplement 1.

Reviewer #3:

I found that the authors were generally attentive to our comments, but an important part of the findings remains critically unconvincing. My own main issue was that it was not clear that the results of the EEG experiment really provide a view on the neural dynamics at play in the psychophysics task, because so many aspects of the task (immediate versus delayed response, presence of 50% catch trials, different set of change sizes and timings, and provision of feedback) were different.

Having prompted the authors to add the behavioral results for the task version run during EEG recording (which they did) and explain and discuss the impact of the task differences, I regret to say my concerns over this have grown rather than diminished. The authors state that "it is the same task" but it most certainly is not from the point of view of decision mechanisms, into which the study is supposed to provide new insights – the subjects listen out for the same kind of acoustic event, yes, but the various contingencies of any given task protocol – what they know or can surmise about timing, about the probability of the event happening at all, about the range of trial difficulties possible and certainly the requirement to report their detection immediately – can each have potentially huge impact on the strategies employed to perform the task and by extension the neural dynamics of decision formation.

While we think both task designs have their merits, we agree that an exact match between the paradigms removes potential doubts. We would like to thank the reviewer for this criticism, prompting us to add the data from the immediate response experiment. Here, subjects to respond immediately and the stimulus conditions are matched (change times and change sizes) with the psychophysical experiment (N=18, depicted in the new Figure 5, and its figure supplements). The data from the delayed paradigm have become unnecessary, and its presentation would complicate the presentation to a degree that would not be in the interest of the readers. We have therefore replaced the corresponding figures and text with the new, matched task. Below we refer to the delayed task a few times, to address some of the reviewer questions. Further details are provided below.

The authors gave a single reason for the task differences: to avoid motor preparation signals in the EEG. This rationale does not hold up because the prior work identifying the "CPP" as a decision signal clearly demonstrated that response preparation can be easily dissociated from the CPP evidence accumulator simply from the fact that the CPP does not care about response hand. In fact, immediate decision reports were crucial because the ability to time-lock the signal to these actions and show that the accumulator peaks at response regardless of RT was one of the major identifying criteria of the decision signal.

We agree that the motor response may not have a disruptive influence on the CPP potential or invalidate its significance as a decision signal. However, the fact that the motor response is synchronous with parts of the CPP signal may lead to shifts in topography and changes in timing (see e.g. Salisbury et al., 2001). We think that it remains an interesting and important question to resolve where the difference in location between the post-change potentials arises from, however, this would require this manuscript to grow beyond the limits of the eLife format.

We also agree that the possibility to time-lock the EEG response to the behavioral response time is crucial for the analysis of decision signals. For the purpose of the present manuscript, we can confirm that in the immediate response task, the location of the response-locked potential is consistent with previous CPP reports. Within the added dataset (immediate response, matched conditions) we provide these time-locked analyses (new Figure 5 and figure supplements) and demonstrate that the present signal recorded in the CPP location exhibits similar properties to the signals recorded by the reviewer and others in evidence accumulation tasks (see below, point 4, for more details). In addition we show a decrease in the potential as a function of change time – to our knowledge a novel finding – which could be indicative of a reducing threshold as a function of time-into-trial.

Another major reason for my fallen enthusiasm is the fact that, after correcting an electrode-ordering mistake in this revision, the topography of the positive buildup appears to be focused at the very inferior edge of the cap. Neither the previously characterized decision signal "CPP" nor the P300 to which it has been equated, to my knowledge, has ever been found with such an inferior topography. The overall problem is that the authors are using the presence of this ERP positivity and the effects of change time on its buildup as evidence supporting their claim that this task is accomplished through sensory evidence accumulation. But for this to work, it needs to be definitively and independently established that signal is in fact an evidence accumulator – otherwise, the argument is circular, i.e., the neural evidence for the improved change detection at later change times being a result of steeper evidence accumulation comes from the finding that an accumulator signal rises more steeply for later change times, but the sole basis upon which that signal is identified as an evidence accumulator is the fact that it rises more steeply for later change times! Clearly what is needed is an independent way to establish that the signal is an evidence accumulator. Shared properties with an established accumulator signal (CPP) gave at least partial support, but with the topography now so distinct I think the authors have lost this one piece of support. Source localizing to parietal cortex does not help much because parietal cortex in general has not been associated with decision formation any more than any other cognitive function, let alone with evidence accumulation in particular, and a parietal neural source certainly played no part in the CPP's identification as a decision signal.

We can now address this point more encompassingly in light of the added dataset. We will first reiterate here the criteria for considering our observations as a decision signal, its relation to the CPP as well as to our present main question, the detection of changes in the statistics of auditory textures.

First, a useful set of criteria (or at least indicators) for a decision signal as provided by O'Connell et al., 2012, Kelly & O'Connell, 2013 and others, are

1) Encoding the integral of sensory evidence (i.e. linearly for constant evidence as a function of time or quadratically, if the amount of sensory evidence increases linearly itself, as in O'Connell et al., 2012).

a) As a corollary, the slopes of the potential should increase with the level of evidence.

2) Existence of a threshold, which determines when a sufficient amount of evidence has been reached that validates a response.

a) As a corollary, EEG responses for early and late reaction times should have similar height.

b) As suggested by a few models, the threshold could depend on time, indicating either an increase in certainty or relating to the task design (e.g. an expectation of a trial end)

c) The size of the EEG response at decision time (or just before) should not depend on the rate of evidence.

d) False alarms should have low/lower EEG responses than signal conditions, indicating that stochastically a decision was made before reaching the actual threshold (or a fluctuating threshold that is lower than the typical threshold)

3) Other properties would be generalizing over modalities, independence of response type, etc., which were not tested here.

4) In relation to previous reports on EEG potentials related to evidence integration, one would expect a centro-parietal location with a positive polarity, known as CPP, as first reported by O'Connell et al., 2012.

Based on the new dataset we can confirm essentially all of these points, with details provided below. Starting with point 4, the CPP was defined as the average potential from electrodes 14, 27 and 28 (EasyCap, 61 Channel, Equidistant layout, where for example electrode 14 has coordinates Theta = 45, Phi = 90) based on the peak region of the potential (see Figure 5F). The potentials scalp location is hence consistent with previous reports (e.g. Twoney et al. 2015), although still slightly more occipital than some reports (O'Connell et al. 2012). These residual differences could be related to the complexity of the auditory stimulus, where a more complex stimulus (as presently used) may lead to a slightly more posterior signal from higher auditory cortex, which mixes with the generator of the CPP, or to some modality-specific changes in the localization of the generator (Dreo et al., 2016).

Considering point 1a above, we find that the slopes depend significantly on the change size, which in the present experiment can be equated with the instantaneous rate of evidence. Hence, the decision variable should increase linearly as a function of time, with increasing slopes for higher change sizes. We indeed find such a highly significant dependence (Figure 5H).

Considering point 2a, we find the potentials for early and late reaction times to not be significantly different (see Figure 6 A and B).

Considering point 2b, we find the threshold to decrease as a function of change time, with the most significant decrease towards the end of the trial (Figure 5M). This would be consistent with subjects expecting the end of the maximal length trial (which is not completely avoidable, even in the present case of an exponential distribution of change times), and adjusting their response threshold as they approach the expected end. However, we did not find an increase in the false alarm rate towards the end of the trial, which should have accompanied such a behavior. On the other hand, converging thresholds have also been suggested in other decision theoretic accounts (Bogacz et al, 2009). The reduction in threshold could, hence, reflect the improved estimate of the statistics of the first texture, thus requiring less evidence to make a decision at the same level of certainty.

Considering point 2c, we in fact find an increase of the CPP height with the change size, similar to what was previously reported (O'Connell et al., 2012). While this may appear to be inconsistent with a fixed threshold, we would like to suggest a possible interpretation: If the CPP potential is read off, and informed by the decision maker, then in the intervening time the CPP could continue to rise and thus achieve differential final levels for different rates of inst. evidence. Consistent with this idea, we find the CPP to not depend significantly on the change size in the window of 200-100 ms before the response time (2-way ANOVA, as in Figure 5 lower panels, only with a shifted time window).

Considering point 2d, we find that the CPP of false alarms is significantly smaller than the CPP for hits (p=1e-9, 1-way ANOVA as a function of change size), mirroring a result by O'Connell et al., 2012.

Together these results seem to indicate that the significance of the CPP as indicator for an evidence integrating, decision related process extends to the domain of complex broadband acoustic stimuli, such as acoustic textures.

Next, we would like to discuss the implications in the context of auditory texture processing, as well as the relation to our models. Different views exist on whether an auditory texture is mostly recognized based on its statistics (as has recently been proposed by McDermott and colleagues) or whether it is individual, characteristic elements that are indicative of the texture type. Within the limits of our paradigm, the identification of the CPP signal and its dependence on change size supports the hypothesis that evidence is (statistically) integrated, rather than instantaneously recognized.

The properties of the CPP, especially its relation to change size, are reflected in both models we proposed. The slope of the decision variables is in both cases increasing with the instantaneous evidence (this follows directly from the defining equations), and the height of the potential will depend on the combination of the instantaneous rate of evidence, i.e. on the change size. In both models a transient overshoot of the variable above the threshold would be observed, if the trials are terminated by the moment of decision execution, rather than internal evidence read-out, hence, the potential size would also depend on the change size.

Finally, the converging threshold used in the cortical model is reminiscent of the reduction in CPP size as a function of change time. An alternative interpretation of the reduction in threshold could, however, be the subjects roughly estimate the trial end, which could lead to a lowering of their threshold criterion. The constant FA rate (Figure 2D), however, argues against this interpretation. We therefore favor the interpretation that the reduction in threshold is a consequence of an improved estimate of the statistics of the first texture, thus requiring less evidence to detect a deviation in statistics at the same level of certainty.

Reference:

Dreo, J., Attia, D., Pirtošek, Z. and Repovš, G. (2016), The P3 cognitive ERP has at least some sensory modality-specific generators: Evidence from high-resolution EEG. Psychophysiol. doi:10.1111/psyp.12800

I feel my hands are tied on highlighting this issue as a preclusive one when I consider the simple fact that if an ERP study with as few as 6 subjects, equating an occipital positivity with a previously characterized centro-parietal one, were to be submitted to a low-ranking but specialized journal like Psychophysiology, it would likely be dismissed out of hand.

We hope to have addressed these issues fully now.

The obvious solution to all of the above would be to collect EEG data from a more reasonable number of subjects (>15) on a task with identical parameters to that run in the psychophysical sessions. This would readily allow independent tests of the assertion that the positivity is an evidence accumulator: 1) The peaking at RT regardless of whether RT is fast or slow I mentioned before, and 2) The other major identifying property is the dependence of buildup rate on evidence strength which could be tested across the four different change sizes 50-140%. All the authors would have to do to deal with the broader range of change times is to sort trials into change-time bins.

As detailed above, we have confirmed both properties mentioned (1 & 2) on the new matched dataset (n=18), with the change-times divided into 4 bins.

In addition to these independent tests, more reliable analyses could be conducted to address many other unanswered questions in the paper about what the positive buildup signal is actually doing in the task, e.g., what temporal profile does the signal have during the baseline pre-change part of the stimulus – does it steadily build up reflecting the accumulation of information used to construct the baseline statistical distribution, or does the signal step up discretely as suggested by the current waveforms, and if so, why? And what is the signal actually reflecting with respect to the models, e.g. the sum of the two accumulators working on different timescales? Are the apparent buildup rate effects seen already in the pre-change baseline for later changes statistically reliable, and if so what could this mean?

We agree that we have so far not placed a significant emphasis on the build-up signal in the pre-change period (depicted in Figure 5E (old) before). The presence of this building-potential after stimulus onset is indeed interesting, although again its interpretation is not trivial: this potential is present for both the delayed response and the immediate response paradigms, however, only positive and dominant at the occipital electrodes (in both paradigms!). Looking at the new EEG data at the CPP electrodes (14,27,28) there is barely any deflection visible (Figure 5B).

Hence, while it would be tempting to speculate about the meaning of the potential w.r.t. evidence integration, this would be better informed with a matched fMRI study, that can pinpoint the location of an integration/sensory memory (as e.g. in Linke et al., 2011). Along the lines of the cited study, it could also be interpreted as a persistent suppression in activity in sensory cortex, which is reflected in a dipole increase at another location, here the occipital electrodes.

Author response image 2
Recreated Figure 5 for the delayed paradigm with a larger number of subjects (n=13), demonstrating that the topography of the potential is unchanged, as are the dependence of slope on change time (which we, however, now interpret as a combination of change time and response time).
https://doi.org/10.7554/eLife.24910.021

Further comments

The visual topography of the positivity: The authors measure the positive signal from electrodes Pz to Oz on the posterior midline. Pz is the most inferior the CPP has been measured thus far (Twomey et al. 2015, cited by the authors), and this is included in the authors' cluster, but where is the signal actually biggest? From the topography it looks like it could be Oz, and this strangely suggests a visual cortical origin. It now becomes important to know what the subjects actually looked at during the task, which is not detailed anywhere – did they fixate on a cross and nothing else? Were there stimuli on the screen that were presented during the trial or after, and could be processed or anticipated by the subject? One alternative possibility to a decision buildup account is that this positivity reflects suppression of visual cortex upon detection of the change, which would happen earlier on late-change trials where the change is detected quickest.

If the occipital positivity is in fact the same signal as the CPP, then the reason for its dramatic inferior shift should be figured out. A likely reason is that 6 subjects simply don't provide a reliable enough measurement, and one or two outlying subjects are dominating.

The screen remained static during the entire duration of sound presentation, i.e. there was no visual cue presented close to the time when the change occurred (this information has been added in more detail to the Methods section of the manuscript, thanks for pointing this out). With visual stimulation excluded, there are two main alternatives left: either (1) mixing with a motor signal, or (2) generation of a different signal.

1) Button presses have been shown to affect the P300 topography (e.g. Salisbury et al., 2001), however, a comparison with their results is difficult, since their P300 location was again different to start with. Their results, however, indicate that the location of a right-hand button press component is on the left side, which aligns with a slight leftward shift that we observe in our immediate response CPP potential (new Figure 5F, aligned to response).

2) Regarding your hypothesis that the positivity reflects suppression of visual cortex: Given the consistent location of the change potential in the CPP location for the immediate responses, we think that – together with its properties – there is little doubt about the nature of the signal, at least in the immediate response case (see additional arguments above following your point 4). We think that in the delayed case, interpreting the shift to the more posterior location as a suppression of visual cortex is not likely, given that we have previously demonstrated that the generator of the signal localizes to parietal cortex. We agree here with the reviewer that scalp topography cannot be simply translated to anatomical origin of brain activity.

As indicated above, the resolution of the question is beyond the scope of this article, and the inclusion of the immediate response data has provided the necessary evidence to conclude that the observed CPP potential is consistent with a decision variable that is integrating sensory evidence. The resolution of the question of the different topographies will be separately investigated subsequently.

But it could also be a signal issue – maybe the polynomial detrending has removed the actual decision signal from centro-parietal sites? "nt_detrend" probably has been used exclusively for datasets with relatively fast responses, and not for tasks where the potentials are very slow-moving like the current one – what happens to the signals when this step is removed, or replaced by a first-order (linear) detrend? It is possible in principle that strong auditory-evoked negative activity more frontally causes the positive focus to shift back on the scalp, but this seems an unlikely explanation given that the CPP/P300 have been measured with the same centro-parietal topographies for many auditory tasks including ones containing continuous audio streams (O'Connell et al., 2012).

We have performed two tests to assess the potential influence of nt-detrend onto the recovery of slow potentials.

1) Reanalysis of the delayed response data with a classical high-pass filter (Matlab: filtfilt, 0.1 Hz, 15th order, 50 dB attenuation in the stop band) on the new dataset. The results were quite similar although obviously not identical (Author response image 3).

2) Analysis of the immediate response EEG data with a classical high-pass filter (same setting as above). The results of this analysis are provided below, and show very similar results as for the filtering with nt_detrend. Especially, slow potentials are not removed to a larger degree than for high-pass filtering (see below, Figure 5—figure supplement 2).

Author response image 3
Recreation of Figure 5 for the delayed paradigm with a classical highpass filter, same caption (compare to Author response image 2).
https://doi.org/10.7554/eLife.24910.022

Abstract: "in parietal cortex" should definitely be replaced with "over parieto-occipital scalp". This is EEG.

We changed this description.

[Editors’ note: the author responses to the re-review follow.]

Reviewer #1:

In this paper, Boubenec and colleagues demonstrate compelling psychophysical and electrophysiological correlates of statistical estimation mechanisms underlying naturalistic acoustic change detection, and provide both an abstract and physiologically-based computational model to explain their data. This is a revision of a previous submission, based on new EEG data recorded during the exact same task as was physiologically characterized and modeled, and the results are now more definitive and highly interesting, including stronger evidence for accumulation (steeper buildup for stronger changes) and additional effects on peak amplitude of the neural index of accumulation. The concepts and hypotheses are now also laid out very clearly, with nice sections explaining predictions based on statistical estimation strategies. There are just a couple of substantive points that are important to clear up at this stage.

There are a couple of important checks to perform on the CPP amplitudes at response to verify the effects on this measure. First, the topography of the effect of change size on amplitude should be shown, around the time of response, e.g. the largest change size minus smallest – does it have a similar CPP topography? This would go some way to verifying that the amplitude effects are not caused by an overlapping, separate process (e.g. slow fronto-central negativity).

We have performed this test subtracting the 140% (largest) and 50% (smallest) conditions from each other. The topography is shown now as an inset in Figure 5F. The topography is naturally smaller and noisier (mostly introduced by the 50% condition with less hits), but otherwise shows the same shape, also peaking in similar locations. The color-scale is the same as in the main figure of Figure 5F, although it would be an option to increase it for clarity.

Also, since the baseline for the response-locked waveforms was 1000-1200ms prior to response, conditions with longer RTs are baselined relative to intervals further out into the stimulus; this is a potential confound because it could be that for lower change sizes, the hit trials are those in which the CPP waveform gets a head-start by accumulating noise or a false change, and so it has less of a distance to go to threshold when the actual change happens. Or given how long some RTs are for weaker changes, the "head-start" could be from genuine physical evidence! This would result in an amplitude decrease at response when using the current baselining regime, even in the absence of any change in bound. Therefore a pre-stimulus or a pre-change baseline should instead be used for the response-locked waveforms. It is important to see whether this changes the results. I would be fairly confident that the main effects hold up – the slope effects certainly will because baselining doesn't matter in that case. Nevertheless, these controls are important to rule out critical potential confounds.

We agree that this choice of baseline is more sensible in general (for the Response plots). We have changed the analysis accordingly and replot the results in Figure 5 (this applies to panels B2, E2, and all panels subsequent to (>=) G). No qualitative changes were observed although the p-values and averages were changed slightly. Specifically, the significance for the amplitude in response decreased, however, still appears quite solid. The baselining description in Methods has been adapted accordingly, as follows:

“The epochs were baseline-corrected relative to a 150 ms interval prior to onset, and a 200 ms interval before change time for alignments to both change and response time”.

Possible explanations for the decrease in amplitude of CPP across change times tend to be provided briefly in a scattered fashion. It is important to reiterate the lack of effect of RT here – if bound is collapsing over time, surely the CPP amplitude would decrease with RT as well as with change time?

An excellent point to check! The results shown in Figure 6B show a tiny, non-significant decrease in the predicted direction (0.1μV). We checked separately in all change time bins, but the effect was not significant in either bin. However, it is worth considering the expectable effect size: The overall decrease in threshold as suggested from Figure 5M would be 4.3-3.0μV = 1.3μV over the period of 4s. The average difference in RT between early and late (as analyzed in Figure 6) would be ~0.5s (estimated from the Data in Figure 3C). Hence the expectable effect size over this time-difference would be 1.4/(4/0.5) = ~0.16μV. The observed difference across conditions in Figure 6B is 0.1μV. We therefore conclude that as far as we can tell, the data are qualitatively consistent with this hypothesis, or at least not inconsistent. However, the expectable effect-size is small, so that one would need substantially more data to perform this control. This is now stated in the manuscript as follows:

“Although the time-dependence of CPP height could result in a decrease of CPP height for late versus early reaction times, we did not find any significant decrease in CPP height for late reaction times, which may be due to a rather small effect-size (Figure 6B).”

Further, as the authors point out themselves, if bound were collapsing over time, false alarm rate should increase – it seems that this point should be weaved into the discussion of this effect more than it is. In general, since several possible explanations are given for this effect, it might be better not to scatter them in various parts of the Results and Discussion, but rather go through them all in one coherent section in the discussion?

Around paragraph three of subsection “EEG recordings and the site of decision-making”the discussion is a bit muddled – it appears to both argue for and against a decreasing bound in the same breath. This could use clarifying.

We have reduced the interpretation to a minimum in the Results and centralized all of the considerations surrounding the reduction in height over time in the Discussion (which is arguably a better place than the Results, while the legibility in the Results may have benefitted from it). We would like to mention that a decreasing threshold (applied to the neural response, rather than the stimulus) is not necessarily inconsistent with a flat false alarm rate. If the response potential also decreases in size for the same deviation in the stimulus over time, these two effects could balance each other. Such a decrease in response potential could be predicted on the basis that certain stimulus changes may appear unexpected in the beginning, but are later expected after sampling the statistics for some time.

Finally on this point, have the authors considered other, simpler mechanisms for preventing the subject from clicking when the stimulus initially onsets? The work of Tobias Teichert (see e.g. review in 2014) and others have demonstrated that subjects may be in control of the timing of onset of decision processes (specifically evidence accumulation) – given there's a "dead zone" apparent in Figure 1C showing that subjects are not willing to respond until a certain time has elapsed, couldn't they be simply deferring the kick-off of their decision mechanisms? This is easily parameterized in a fixed delay, and would avoid the need to introduce dramatic time-dependence in the fixed mechanisms (e.g. lengthening of the slow estimator's time constant) to explain that aspect. The authors should at least discuss this possibility.

We agree that this is an alternative for modeling the strategy at the beginning of the trials. The current choice of a time-constant for the divergence of the integration properties can be considered a soft variant of your suggestion of a fixed dead-time. Unless we make the transition to regular integration thereafter instantaneous, we would, however, have to introduce another parameter (i.e. the time-constant would be replaced by the dead-time, and another, faster time-constant). We think that this could be a good method to model specific 'shapes of reluctance to respond' at the beginning, however, to differentiate between them would require a separate experiment, for example, where for example the response window is changed in different blocks. We discuss these options now in the Discussion and propose a future experiment to differentiate between them.

Figure 5A-F: I think it might make more sense to have each signal (medial, CPP) in the rows, and the response-locked waveforms to the right of the stimulus locked ones, as is standard across neurophysiological decision research. The labels should also be consistent in nature – perhaps they should be "fronto-central" and "parietal?" – "medial" is not a good descriptor since both signals are medial.

We had previously chosen this alignment, as it provided a match of the onset component (top row) with the corresponding topoplot. But we also acknowledge the more logical arrangement of time going from left to right. The plots have been rearranged accordingly. Also, 'CPP' has been renamed to 'parietal' (in the figure, however, kept as CPP in the rest of the manuscript), and 'medial' to 'central'. Since, the peak of the onset potential is taken in central electrode, we considered it slightly misleading to include 'fronto' in the name.

Missing from Figure 5 main figure are the change-onset locked waveforms. Supplement 3 shows the auditory electrodes, but really both signals should be shown time locked to change onset for a complete view over the dynamics. The x-axes of the current figures can easily be cut to zoom in on the main dynamics (e.g. the current 500-ms baseline shown before stimulus onset need only be 100 ms), so there should be room to squeeze them in between. This is important given that in places there are references to what happens to the CPP in response to the change, not just prior to response.

Good point. We have accordingly inserted the change-time locked potential in between the Onset and Response, and zoomed in on the onset plot. The axes widths were scaled correspondingly to keep the temporal relation between the plots. Correspondingly the Figure Supplement 3 has been removed. The parietal change-locked potentials exhibit similar dependencies on the stimulus parameters, however, the wide distribution of response times together with the response potentials leads to a more widely spread out potential shape than in our previous dataset. References to the change-time related plots have been inserted and all labels changed in text and captions.

https://doi.org/10.7554/eLife.24910.024

Article and author information

Author details

  1. Yves Boubenec

    1. Laboratoire des Systèmes Perceptifs, CNRS UMR 8248, Paris, France
    2. Département d'études cognitives, École normale supérieure, PSL Research University, Paris, France
    Contribution
    YB, Conceptualization, Data curation, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    Jennifer Lawlor
    For correspondence
    boubenec@ens.fr
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0002-0106-6947
  2. Jennifer Lawlor

    1. Laboratoire des Systèmes Perceptifs, CNRS UMR 8248, Paris, France
    2. Département d'études cognitives, École normale supérieure, PSL Research University, Paris, France
    Contribution
    JL, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology
    Contributed equally with
    Yves Boubenec
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0002-6116-3001
  3. Urszula Górska

    1. Department of Neurophysiology, Donders Centre for Neuroscience, Radboud Universiteit, Nijmegen, Netherlands
    2. Psychophysiology Laboratory, Institute of Psychology, Jagiellonian University, Krakow, Poland
    3. Smoluchowski Institute of Physics, Jagiellonian University, Krakow, Poland
    Contribution
    UG, Data curation, Validation
    Competing interests
    The authors declare that no competing interests exist.
  4. Shihab Shamma

    1. Laboratoire des Systèmes Perceptifs, CNRS UMR 8248, Paris, France
    2. Département d'études cognitives, École normale supérieure, PSL Research University, Paris, France
    3. Department of Electrical and Computer Engineering, University of Maryland, College Park, United States
    4. Institute for Systems Research, University of Maryland, College Park, United States
    Contribution
    SS, Funding acquisition, Validation, Writing—review and editing
    Competing interests
    The authors declare that no competing interests exist.
  5. Bernhard Englitz

    1. Laboratoire des Systèmes Perceptifs, CNRS UMR 8248, Paris, France
    2. Département d'études cognitives, École normale supérieure, PSL Research University, Paris, France
    3. Department of Neurophysiology, Donders Centre for Neuroscience, Radboud Universiteit, Nijmegen, Netherlands
    Contribution
    BE, Conceptualization, Data curation, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Competing interests
    The authors declare that no competing interests exist.

Funding

Agence Nationale de la Recherche (ANR-10-LABX-0087 IEC)

  • Yves Boubenec
  • Jennifer Lawlor
  • Shihab Shamma
  • Bernhard Englitz

Agence Nationale de la Recherche (ANR-10-IDEX-0001-02 PSL*)

  • Yves Boubenec
  • Jennifer Lawlor
  • Shihab Shamma
  • Bernhard Englitz

Advanced European Research Council (ERC 295603)

  • Yves Boubenec
  • Jennifer Lawlor
  • Shihab Shamma
  • Bernhard Englitz

European Commission's Marie Curie grant (660328)

  • Shihab Shamma

Army Research Office (63113-LS)

  • Shihab Shamma

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank the Equipe Audition, the Donders Center for Cognitive Neuroscience, and the NSF-funded Neuromorphic Cognition Engineering Workshop in Telluride, CO, USA, for allowing us to use sound booths and EEG equipment, with helpful discussions. Funding was provided through the ERC ADAM project. ANR-10-LABX-0087 IEC and ANR-10-IDEX-0001–02 PSL* supported the research unit. SS was also supported by an ARO grant 63113-LS. BE was supported by a European Commission's Marie Curie Grant (660328).

Ethics

Human subjects: All experiments were performed in accordance with the guidelines of the Helsinki Declaration. The Ethics Committees for Health Sciences at Université Paris Descartes approved the experimental procedures.

Reviewing Editor

  1. Timothy EJ Behrens, Reviewing Editor, University College London, United Kingdom

Publication history

  1. Received: January 5, 2017
  2. Accepted: March 4, 2017
  3. Accepted Manuscript published: March 6, 2017 (version 1)
  4. Accepted Manuscript updated: March 8, 2017 (version 2)
  5. Version of Record published: March 27, 2017 (version 3)
  6. Version of Record updated: March 29, 2017 (version 4)
  7. Version of Record updated: August 23, 2017 (version 5)

Copyright

© 2017, Boubenec et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 957
    Page views
  • 276
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Scopus, Crossref.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Cell Biology
    2. Genes and Chromosomes
    Wahid A Mulla et al.
    Research Article Updated