1. Neuroscience
Download icon

Rapid computations of spectrotemporal prediction error support perception of degraded speech

  1. Ediz Sohoglu  Is a corresponding author
  2. Matthew H Davis
  1. School of Psychology, University of Sussex, United Kingdom
  2. MRC Cognition and Brain Sciences Unit, United Kingdom
Research Article
  • Cited 2
  • Views 1,036
  • Annotations
Cite this article as: eLife 2020;9:e58077 doi: 10.7554/eLife.58077

Abstract

Human speech perception can be described as Bayesian perceptual inference but how are these Bayesian computations instantiated neurally? We used magnetoencephalographic recordings of brain responses to degraded spoken words and experimentally manipulated signal quality and prior knowledge. We first demonstrate that spectrotemporal modulations in speech are more strongly represented in neural responses than alternative speech representations (e.g. spectrogram or articulatory features). Critically, we found an interaction between speech signal quality and expectations from prior written text on the quality of neural representations; increased signal quality enhanced neural representations of speech that mismatched with prior expectations, but led to greater suppression of speech that matched prior expectations. This interaction is a unique neural signature of prediction error computations and is apparent in neural responses within 100 ms of speech input. Our findings contribute to the detailed specification of a computational model of speech perception based on predictive coding frameworks.

Introduction

Although we understand spoken language rapidly and automatically, speech is an inherently ambiguous acoustic signal, compatible with multiple interpretations. Such ambiguities are evident even for clearly spoken speech: A /t/ consonant will sometimes be confused with /p/, as these are both unvoiced stops with similar acoustic characteristics (Warner et al., 2014). In real-world environments, where the acoustic signal is degraded or heard in the presence of noise or competing speakers, additional uncertainty arises and speech comprehension is further challenged (Mattys et al., 2012; Peelle, 2018).

Given the uncertainty of the speech signal, listeners must exploit prior knowledge or expectations to constrain perception. For example, ambiguities in perceiving individual speech sounds are more readily resolved if those sounds are heard in the context of a word (Ganong, 1980; Rogers and Davis, 2017). Following probability theory, the optimal strategy for combining prior knowledge with sensory signals is by applying Bayes theorem to compute the posterior probabilities of different interpretations of the input. Indeed, it has been suggested that spoken word recognition is fundamentally a process of Bayesian inference (Norris and McQueen, 2008). This work aims to establish how these Bayesian computations might be instantiated neurally.

There are at least two representational schemes by which the brain could implement Bayesian inference (depicted in Figure 1C; see Aitchison and Lengyel, 2017). One possibility is that neural representations of sensory signals are enhanced or ‘sharpened’ by prior knowledge (Murray et al., 2004; Friston, 2005; Blank and Davis, 2016; de Lange et al., 2018). Under a sharpening scheme, neural responses directly encode posterior probabilities and hence representations of speech sounds are enhanced in the same way as perceptual outcomes are enhanced by prior knowledge (McClelland and Elman, 1986; McClelland et al., 2014). Alternatively, neural representations of expected speech sounds are subtracted from bottom-up signals, such that only the unexpected parts (i.e. ‘prediction error’) are passed up the cortical hierarchy to update higher-level representations (Rao and Ballard, 1999). Under this latter representational scheme, higher-level neural representations come to encode posterior probabilities (as required by Bayesian inference) but this is achieved by an intermediate process in which prediction errors are computed (Aitchison and Lengyel, 2017). In many models, both representational schemes are utilized, in separate neural populations (cf. predictive coding, Rao and Ballard, 1999; Spratling, 2008; Bastos et al., 2012).

Figure 1 with 1 supplement see all
Overview of experimental design and hypotheses.

(A) On each trial, listeners heard and judged the clarity of a degraded spoken word. Listeners’ prior knowledge of speech content was manipulated by presenting matching (‘clay’) or mismatching (‘fast’) text before spoken words presented with varying levels of sensory detail (3/6/12-channel vocoded), panel reproduced from B, Sohoglu and Davis, 2016. (B) Ratings of speech clarity were enhanced not only by increasing sensory detail but also by prior knowledge from matching text (graph reproduced from Figure 2A, Sohoglu and Davis, 2016). Error bars represent the standard error of the mean after removing between-subject variance, suitable for repeated-measures comparisons (Loftus and Masson, 1994). (C) Schematic illustrations of two representational schemes by which prior knowledge and speech input are combined (for details, see Materials and methods section). For illustrative purposes, we depict degraded speech as visually degraded text. Under a sharpening scheme (left panels), neural representations of degraded sensory signals (bottom) are enhanced by matching prior knowledge (top) in the same way as perceptual outcomes are enhanced by prior knowledge. Under a prediction error scheme (right panels), neural representations of expected speech sounds are subtracted from sensory signals. These two schemes make different predictions for experiments that assess the content of neural representations when sensory detail and prior knowledge of speech are manipulated. (D) Theoretical predictions for sharpened signal (left) and prediction error (right) models. In a sharpened signal model, representations of the heard spoken word (e.g. ‘clay’; expressed as the squared correlation with a clear [noise-free] ‘clay’) are most accurately encoded in neural responses when increasing speech sensory detail and matching prior knowledge combine to enhance perception. Conversely, for models that represent prediction error, an interaction between sensory detail and prior knowledge is observed. For speech that mismatches with prior knowledge, increasing sensory detail results in better representation of the heard word ‘clay’ because bottom-up input remains unexplained. Conversely, for speech that matches prior knowledge, increased sensory detail results in worse encoding of ‘clay’ because bottom-up input is explained away. Note that while the overall magnitude of prediction error is always the smallest when expectations match with speech input (see Figure 1—figure supplement 1), the prediction error representation of matching ‘clay’ is enhanced for low-clarity speech and diminished for high-clarity speech. For explanation, see the Discussion section.

© 2016, PNAS. Figure 1A reproduced from Figure 1B, Sohoglu and Davis, 2016. The author(s) reserves the right after publication of the WORK by PNAS, to use all or part of the WORK in compilations or other publications of the author's own works, to use figures and tables created by them and contained in the WORK.

© 2016, PNAS. Figure 1B reproduced from Figure 2A, Sohoglu and Davis, 2016. The author(s) reserves the right after publication of the WORK by PNAS, to use all or part of the WORK in compilations or other publications of the author's own works, to use figures and tables created by them and contained in the WORK.

A range of experimental evidence has been used to distinguish sharpened and prediction error representations. An often observed finding is that matching prior knowledge reduces the amplitude of evoked neural responses (Ulanovsky et al., 2003; Grill-Spector et al., 2006; Rabovsky et al., 2018). This observation is commonly attributed to prediction errors since expected stimuli are relatively predictable and therefore should evoke reduced prediction error. However, reduced activity is equally compatible with sharpened responses because, under this representational scheme, neuronal activity encoding competing features (i.e. ‘noise’) is suppressed (see Figure 1C; Murray et al., 2004; Friston, 2005; Blank and Davis, 2016; Aitchison and Lengyel, 2017; de Lange et al., 2018).

One way to adjudicate between representations is by manipulating signal quality alongside prior knowledge and measuring the consequences for the pattern (rather than only the mean) of neural responses. Computational simulations reported by Blank and Davis, 2016 demonstrate a unique hallmark of prediction errors which is that neural representations of sensory stimuli show an interaction between signal quality and prior knowledge (see Figure 1D). This interaction arises because sensory signals that match strong prior expectations are explained away more effectively as signal quality increases and hence neural representations are suppressed even as perceptual outcomes improve. Whereas for sensory signals that follow uninformative prior expectations, increased signal quality leads to a corresponding increase in sensory information that remains unexplained (see Figure 1C). This pattern – opposite effects of signal quality on neural representations depending on whether prior knowledge is informative or uninformative – highlights an important implication of neural activity that represents prediction errors. Rather than directly signaling perceptual outcomes, these neural signals in the sensory cortex serve the intermediary function of updating higher-level (phonological, lexical, or semantic) representations to generate a perceptual interpretation (the posterior, in Bayesian terms) from a prediction (prior). It is these updated perceptual interpretations, and not prediction errors, that should correlate most closely with perceived clarity (Sohoglu and Davis, 2016).

By contrast, in computational simulations implementing a sharpening scheme, representational patterns are similarly enhanced by increased signal quality and matching prior knowledge (see Figure 1D; Murray et al., 2004; Friston, 2005; Blank and Davis, 2016; Aitchison and Lengyel, 2017; de Lange et al., 2018). Using prior written text to manipulate listeners’ prior knowledge of degraded spoken words, Blank and Davis, 2016 showed that multivoxel representations of speech in the superior temporal gyrus (as measured by fMRI) showed an interaction between signal quality and prior expectations, consistent with prediction error computations. This is despite the observation that the mean multivoxel responses to speech were always reduced by matching expectations, no matter the level of sensory detail.

Although the study of Blank and Davis, 2016 provides evidence in support of prediction errors, key questions remain. First, it remains unclear at which levels of representation prediction errors are computed. Predictive coding models that utilize prediction errors are hierarchically organized such that predictions are signaled by top-down connections and prediction errors by bottom-up connections. Therefore, in principle, prediction errors will be computed at multiple levels of representation. Previous studies using similar paradigms have suggested either a higher-level phonetic (Di Liberto et al., 2018a) or a lower-level acoustic locus of prediction error representations (Holdgraf et al., 2016). Second, due to the sluggishness of the BOLD signal, the timecourse over which prediction errors are computed is unknown. Therefore, it is unclear whether prediction errors are formed only at late latencies following re-entrant feedback or more rapidly during the initial feedforward sweep of cortical processing (Sohoglu and Davis, 2016; Kok et al., 2017; de Lange et al., 2018; Di Liberto et al., 2018a).

In this study, we reanalyzed MEG recordings of neural activity from a previous experiment in which we simultaneously manipulated signal quality and listeners’ prior knowledge during speech perception (Sohoglu and Davis, 2016). Listeners heard degraded (noise-vocoded) spoken words with varying amounts of sensory detail and hence at different levels of signal quality. Before each spoken word, listeners’ read matching or mismatching text and therefore had accurate or inaccurate prior knowledge of upcoming speech content (Figure 1A). Our previously reported analyses focused on the mean amplitude of evoked responses, which as explained above, cannot adjudicate between sharpened and prediction error representations. We, therefore, used linear regression to test which of several candidate speech features are encoded in MEG responses (Ding and Simon, 2012; Pasley et al., 2012; Crosse et al., 2016; Holdgraf et al., 2017) and further asked how those feature representations are modulated by signal quality and prior knowledge. Following Blank and Davis, 2016, the two-way interaction between sensory detail and prior knowledge is diagnostic of prediction errors (see Figure 1C and D). Because of the temporal resolution of MEG, if we observe such an interaction we can also determine the latency at which it occurs.

Results

Behavior

During the MEG recordings, listeners completed a clarity rating task in which they judged the subjective clarity of each degraded (noise-vocoded) spoken word (Figure 1A and B). Ratings of speech clarity were enhanced both when sensory detail increased (F (2,40) = 295, μp2 = .937, p <.001) and when listeners had prior knowledge from matching written text (F (1,20) = 93.2, μp2 = .823, p <.001). These behavioral results have previously been reported (Sohoglu and Davis, 2016) but we include them here to facilitate interpretation of the present MEG analyses.

Encoding analysis: Stimulus feature space selection

We used ridge regression to predict the MEG data from the stimulus features. A model that accurately predicts the MEG data would indicate that the component features in the model are well represented in neural responses. Four feature spaces were obtained from the original clear versions of the spoken stimuli (i.e. before noise-vocoding; see Figure 2 and Materials and methods for a full description). Constructing the feature spaces from clear speech enabled us to more easily compare model accuracies for speech stimuli presented under different degradation levels: any model accuracy differences as a function of vocoder channels should be attributed to neural encoding rather than inherent acoustic differences in the stimulus. The first two feature spaces (envelope and spectrogram) were simple acoustic representations that captured time-varying sound energy. These were followed by more complex feature spaces with greater abstraction from the sound waveform: one acoustic (spectrotemporal modulations) and one linguistic (phonetic features). In the first stage of our analysis, we examined which of these four feature spaces best predicted the MEG data.

Stimulus feature spaces used to model MEG responses, for the example word ‘cape’.

(A) shown as an audio waveform for the original clear recording (i.e. before vocoding). (B) Envelope: broadband envelope derived from summing the envelopes across all spectral channels of a 24-channel noise-vocoder. (C) Spectrogram: derived from the envelope in each spectral channel of a 24-channel noise-vocoder. (D) Spectrotemporal modulations: derived from the spectral and temporal decomposition of a spectrogram into 25 spectrotemporal modulation channels, illustrated at regular temporal intervals through the spoken word. (E) Phonetic features: derived from segment-based representations of spoken words.

We averaged model accuracies over conditions and over the 20 sensors with the highest model accuracies (computed separately for each feature space, hemisphere, and participant; shown as black bars in Figure 3A). There was a significant main effect of feature space on model accuracies (F (3,60) = 51.5, μp2 = .720, p <.001), which did not interact with hemisphere (F (3,60) = 1.56, μp2 = .072, p = .223). Post-hoc t-tests revealed that the spectrotemporal modulation feature space best predicted the MEG data (p-values shown in Figure 3A). Sensor selections (i.e. those sensors with the highest model accuracies; shown in Figure 3B) were similarly distributed over temporal and frontal sites for all feature spaces, consistent with bilateral neural generators in the superior temporal cortex.

Figure 3 with 1 supplement see all
Encoding model comparison.

(A) Mean model accuracies for the four feature spaces (black bars) and the two feature space combinations (white bars). Error bars represent the standard error of the mean after removing between-subject variance, suitable for repeated-measures comparisons (Loftus and Masson, 1994). Braces indicate the significance of paired t-tests ***p<0.001 (B) Topographic distribution of MEG gradiometer sensors over which model accuracies were averaged. In each hemisphere of each participant, we selected 20 sensors with the highest model accuracies. The topographies show the percentage of participants for which each sensor was selected, consistent with neural sources in the superior temporal cortex.

We additionally tested more complex combinations of feature spaces that previous work (Daube et al., 2019; Di Liberto et al., 2015) has shown to be good models of neural responses (shown as white bars in Figure 3A). We were particularly interested in how the model performance of the spectrotemporal modulation feature space compared with these more complex feature spaces. Despite having lower dimensionality (25 features), the spectrotemporal modulation feature space remained the model that best predicted neural responses, outperforming Spectrogram+Phonetic features (24+13 = 37 features; F (1,20) = 16.5, μp2 = .453, p = .001) and Spectrogram+Spectral derivative (24+24 = 48 features; F (1,20) = 98.2, μp2 = .831, p <.001). No effects involving hemisphere were significant (all p’s >.125).

Note that because different feature spaces tend to be correlated with each other, combining different feature spaces does not result in additive increases in model accuracy (Norman-Haignere and McDermott, 2018; Kriegeskorte and Douglas, 2019). For example, the model performance for Spectrogram+Phonetic features does not equal the sum of the model accuracies for the individual Spectrogram and Phonetic features models (Figure 3A). To more clearly identify the unique contributions of the feature spaces, we tested additional combinations of feature spaces and conducted more systematic comparisons.

We first sought to replicate previous findings by testing whether the addition of Phonetic features or the Spectral derivative could improve model performance beyond the Spectrogram alone (Di Liberto et al., 2015; Daube et al., 2019). Results and p-values are shown in Figure 3—figure supplement 1A, expressed as performance increases over the spectrogram alone. We replicated previous findings in that both the Spectrogram+Phonetic features model and the Spectrogram+Spectral derivative model outperformed the Spectrogram feature space. We also tested whether adding the spectrotemporal modulation model to the spectrogram feature space could improve performance. This resulted in the biggest performance gain, consistent with our earlier findings indicating that spectrotemporal modulations was the feature space that individually predicted MEG responses most accurately.

Next, we tested whether combining the spectrotemporal modulation model with other feature spaces could improve model accuracy (shown in Figure 3—figure supplement 1B). We found that phonetic features, the spectral derivative and spectrogram all resulted in performance gains beyond spectrotemporal modulations alone. Of these three feature spaces, adding phonetic features resulted in the largest increase in model accuracy. This indicates that although spectrotemporal modulations can explain the largest MEG variance of any feature model individually, other feature spaces (phonetic features in particular) are also represented in MEG responses.

Because the performance of a spectrogram feature space can be improved by including a compressive non-linearity (Daube et al., 2019), as found in the auditory periphery, we also repeated the above analysis after raising the spectrogram values to the power of 0.3. While this change increased model accuracies for the spectrogram and spectral derivative feature spaces, the overall pattern of results remained the same.

Acoustic analysis: Stimulus modulation content and effect of vocoding

Our analysis above suggests that a feature space comprised of spectrotemporal modulations is most accurately represented in neural responses. One of the motivations for testing this feature space stems from the noise-vocoding procedure used to degrade our speech stimuli, which removes narrowband spectral modulations while leaving slow temporal modulations intact (Shannon et al., 1995; Roberts et al., 2011). To investigate the acoustic impact of noise-vocoding on our stimuli, we next characterized the spectrotemporal modulations that convey speech content in our stimulus set and how those modulations are affected by noise-vocoding with a wide range of spectral channels, from 1 to 24 channels. As shown in Figure 4A, modulations showed a lowpass profile and were strongest in magnitude for low spectrotemporal frequencies. This is consistent with previous work (Voss and Clarke, 1975; Singh and Theunissen, 2003) demonstrating that modulation power of natural sounds decays with increasing frequency following a 1/f relationship (where f is the frequency). Within this lowpass region, different spectrotemporal modulations relate to distinct types of speech sound (Elliott and Theunissen, 2009). For example, fast temporal and broad spectral modulations reflect transient sounds such as stop consonant release bursts whereas slow temporal and narrowband spectral modulations reflect sounds with a sustained spectral structure such as vowels. More intermediate modulations correspond to formant transitions that cue consonant place and manner of articulation (Liberman et al., 1967).

Acoustic comparison of vocoded speech stimuli at varying levels of sensory detail.

(A) The magnitude of different spectrotemporal modulations for speech vocoded with different numbers of channels and clear speech (for comparison only; clear speech was not presented to listeners in the experiment). (B) Mean between-word Euclidean distance for different spectrotemporal modulations in speech vocoded with different numbers of channels. (C) Paired t-test showing significant differences in spectrotemporal modulation magnitude for comparison of 958 spoken words vocoded with 24 channels versus one channel (p<0.05 FDR corrected for multiple comparisons across spectrotemporal modulations). (D) Mean difference of between-word Euclidean distances for different spectrotemporal modulations for 24 versus 1 channel vocoded speech.

Over items, increasing the number of vocoder channels resulted in significantly higher signal magnitude specific to intermediate spectral and temporal modulations (1–2 cycles per octave and 2–4 Hz, all effects shown are FDR corrected for multiple comparisons across spectrotemporal modulations; see Figure 4C). Thus, while low-frequency modulations dominate the speech signal overall (irrespective of the number of vocoder channels), it is the intermediate spectrotemporal modulations that are most strongly affected by noise-vocoding. These intermediate spectrotemporal modulations are known to support speech intelligibility (Elliott and Theunissen, 2009; Venezia et al., 2016; Flinker et al., 2019), consistent with the strong impact of the number of vocoder channels on word report accuracy (e.g. Shannon et al., 1995; Davis and Johnsrude, 2003; Scott et al., 2006; Obleser et al., 2008) and subjective clarity (e.g. Obleser et al., 2008; Sohoglu et al., 2014). The opposite effect (i.e. decreased signal magnitude with an increasing number of vocoder channels) was observed for broadband spectral modulations (0.5 cycles/octave) across all temporal modulation rates (reflecting the increase in envelope co-modulation when fewer spectral channels are available) and for fast temporal modulations (16 Hz) and narrowband (>2 cycles/octave) spectral modulations (reflecting stochastic fluctuations in the noise carrier; Stone et al., 2008).

We also asked which spectrotemporal modulations were most informative for discriminating between different words. For each spectrotemporal modulation bin, we computed the Euclidean distances between the time series of all the words in our stimulus set and averaged the resulting distances (see Figure 4B). Between-word distances resembled the magnitude-based analysis, with the greatest distance for low spectrotemporal frequencies. Between-word distances also showed a similar effect of increasing the number of vocoder channels (see Figure 4D).

Encoding analysis: Condition effects

Next, we determined between-condition differences in the ability of the spectrotemporal modulation feature space to predict MEG responses. Any such differences would indicate that the neural representation of spectrotemporal modulations is modulated by prior knowledge or speech sensory detail. Model accuracies for different conditions are shown in Figure 5A. There was a significant interaction between prior knowledge and speech sensory detail (F (2,40) = 5.41, μp2 = .213, p = .01), which marginally interacted with hemisphere (F (2,40) = 3.57, μp2 = .151, p = .051). Follow-up tests in the left hemisphere sensors again showed a statistical interaction between sensory detail and prior knowledge (F (2,40) = 7.66, μp2 = .277, p = .002). For speech that Mismatched with prior knowledge, model accuracies increased with increasing speech sensory detail (F (2,40) = 4.49, μp2 = .183, p = .021). In the Matching condition, however, model accuracies decreased with increasing sensory detail (F (2,40) = 3.70, μp2 = .156, p = .037). The statistical interaction between sensory detail and prior knowledge can also be characterized by comparing model accuracies for Matching versus Mismatching conditions at each level of sensory detail. While model accuracies were greater for Matching versus Mismatching conditions for low clarity 3 channel speech (t (20) = 2.201, dz = .480, p = .040), the opposite was true for high clarity 12 channel speech (t (20) = 2.384, dz = .520, p = .027). Model accuracies did not significantly differ between Matching and Mismatching conditions for 6 channel speech with intermediate clarity (t (20) = 1.085, dz = .237, p = .291). This interaction is consistent with a prediction error scheme and inconsistent with sharpened representations (compare Figure 5A with Figure 1D). No significant differences were observed in the right hemisphere.

Figure 5 with 1 supplement see all
Spectrotemporal modulation encoding model results.

(A) Mean model accuracies for lags between 0 and 250 ms as a function of sensory detail (3/6/12 channel vocoded words), and prior knowledge (speech after mismatching/matching text) in the left and right hemisphere sensors. Error bars represent the standard error of the mean after removing between-subject variance, suitable for repeated-measures comparisons (Loftus and Masson, 1994). (B) Root Mean Square (RMS) amplitude across all left hemisphere sensors for the Temporal Response Functions (TRFs) averaged over spectrotemporal modulations, conditions, and participants. The gray box at the bottom of the graph indicates the lags used in computing the model accuracy data in panel A.

We also conducted a control analysis in which we compared the observed model accuracies to empirical null distributions. This allowed us to test whether neural responses encode spectrotemporal modulations even in the Mismatching three channel and Matching 12 channel conditions, that is, when model accuracies were lowest. The null distributions were created by fully permuted each spoken word’s feature representation (spectrotemporal modulation channels and time-bins) and repeating this for each condition and 100 permutations. As can be seen in Figure 5—figure supplement 1A (only the left hemisphere sensors shown), this results in near-zero null distributions. Accordingly, observed model accuracies in all six conditions were significantly above chance as defined by the null distributions (all p’s < 0.01). Hence, even in the Mismatching three channel and Matching 12 channel conditions, neural responses encode spectrotemporal modulations in speech.

To test whether the interaction between prior knowledge and sensory detail in the observed data remains after accounting for the empirical null distributions, we z-scored the observed data with respect to the feature-shuffled distributions (shown in Figure 5—figure supplement 1B). As expected, given the near-zero model accuracies in the null distributions, the z-scored data show essentially the same pattern of results as seen previously in Figure 5A (prior knowledge by sensory detail interaction in the left hemisphere: F (2,40) = 5.92, μp2 = .228, p = .006).

In a further control analysis, we created additional null distributions by randomly permuting the feature representations across trials (i.e. shuffling the words in our stimulus set while keeping the component features of the words intact). The resulting null distributions for the left hemisphere sensors are shown in Figure 5—figure supplement 1C. Once again, observed model accuracies in all six conditions (shown as broken lines in the figure) were significantly above chance as defined by the null distributions (all p’s < 0.01). This confirms that neural responses encode the specific acoustic form of the heard spoken word. To our surprise, however, a prior knowledge by sensory detail interaction is also apparent in the null distributions (see Figure 5—figure supplement 1C). This suggests that a spectrotemporal representation of a generic or randomly chosen word can also predict MEG responses with some degree of accuracy. This is possible because the words used in the experiment have homogeneous acoustic properties, for example, they are all monosyllabic words spoken by the same individual. All words therefore share, to some extent, a common spectrotemporal profile.

As before, we z-scored the observed data with respect to the empirical null distributions (now from word shuffling). As shown in Figure 5—figure supplement 1D, an interaction is still apparent in the z-scored data. Confirming this, repeated measures ANOVA showed a significant interaction between prior knowledge and sensory detail (F (2,40) = 3.41, μp2 = .146, p = .048). Although unlike the original interaction in Figure 5A, this interaction was equally strong in both hemispheres (F (2,40) = .324, μp2 = .016, p = .683). In addition, simple effects of speech sensory detail were not significant (Mismatching: F (2,40) = 1.67, μp2 = .077, p = .202; Matching: F (2,40) = 1.69, μp2 = .078, p = .199) although there were marginal changes from 3 to 6 channels for both Mismatching (F (1,20) = 3.33, μp2 = .143, p = .083) and Matching (F (1,20) = 3.32, μp2 = .142, p = .083) speech.

Taken together, the above control analyses confirm that encoding of an acoustic representation of a heard word – either for the specific word spoken, or from generic acoustic elements shared with other monosyllabic words from the same speaker – shows an interaction between prior knowledge and sensory detail that is more consistent with a prediction error scheme. We will return to this point in the Discussion.

Our encoding analysis integrates past information in the stimulus (i.e. over multiple lags from 0 to 250 ms) to predict the neural response. Thus, the analysis of model accuracies above does not indicate when encoding occurs within this 250 ms period. One way to determine this is to examine the weights of the encoding model linking stimulus variation to neural responses, that is, temporal response functions (TRFs). The TRFs show two peaks at 87.5 and 150 ms (shown in Figure 5B). This indicates that variations in spectrotemporal modulations are linked to the largest changes in MEG responses 87.5 and 150 ms later. These findings are consistent with previous work showing multiple early components in response to ongoing acoustic features in speech that resemble the prominent P1, N1 and P2 components seen when timelocking to speech onset (e.g. Lalor and Foxe, 2010; Ding and Simon, 2012; Di Liberto et al., 2015). Thus, analysis of the TRFs confirms that encoding of spectrotemporal modulations is associated with short-latency neural responses at a relatively early hierarchical stage of speech processing. In a later section (see ‘Decoding analysis’ below), we address when condition differences emerge.

Decoding analysis

To link MEG responses with specific spectrotemporal modulations, we used linear regression for data prediction in the opposite direction: from MEG responses to speech spectrotemporal modulations (i.e. decoding analysis). As shown in Figure 6A, intermediate temporal modulations (2–4 Hz) were best decoded from MEG responses. This observation is consistent with previous neurophysiological (Ahissar et al., 2001; Luo and Poeppel, 2007; Peelle et al., 2013; Ding and Simon, 2014; Di Liberto et al., 2015; Park et al., 2015; Obleser and Kayser, 2019) and fMRI (Santoro et al., 2017) data showing that intermediate temporal modulations are well-represented in auditory cortex. These intermediate temporal modulations are also most impacted by noise-vocoding (see Figure 4C and D) and support speech intelligibility (Elliott and Theunissen, 2009; Venezia et al., 2016).

Figure 6 with 1 supplement see all
Decoding of spectrotemporal modulations from MEG responses to speech.

(A) The grid shows model accuracies for specific spectrotemporal modulations averaged over conditions. The left bar graph depicts model accuracy for each spectral modulation frequency, averaged over temporal modulations. Bottom bar graph depicts model accuracy for each temporal modulation frequency, averaged over spectral modulations. Braces indicate the significance of paired t-tests ***p<0.001 (B) Effect size (model accuracy differences, r) for the interaction contrast: 12–3 channels (Mismatch) – 12–3 channels (Match). The effect size display has been thresholded so as to only show cells in which the sensory detail by prior knowledge interaction is statistically significant at p<0.05 FDR corrected for multiple comparisons across spectrotemporal modulations. (C) Effect size (model accuracy differences, r) for comparisons between 12 and 3 channels, computed separately for Mismatch and Match conditions. Red shows greater model accuracy for 12 channel than for three channel speech (observed for speech that Mismatches with written text). Blue shows lower model accuracy for 12 channel than for three channel speech (observed for speech that matches written text). ch = channels. (D) Timecourse of decoding accuracy (single-lag analysis). Black trace shows model accuracy differences (r) attributable to the interaction contrast 12–3 channels (Mismatch) – 12–3 channels (Match). Red and blue traces show the contrast 12–3 channels separately for Mismatch and Match conditions, respectively. Shading around each trace represents the standard error of the mean. Horizontal bars at the bottom indicate significant lags for each contrast using the same color scheme as the traces (dark sections indicate p<0.05 FDR corrected across lags and light sections indicate p<0.05 uncorrected). Data have been averaged over the spectrotemporal modulations showing a significant interaction in panel B, indicated also as an inset (top-right).

Having previously identified a significant interaction between prior knowledge and sensory detail in our encoding analysis, we next conducted a targeted t-test on decoding accuracies using the interaction contrast: 12–3 channels (Mismatch) – 12–3 channels (Match). This allowed us to identify which spectrotemporal modulations show evidence of the interaction between prior knowledge and sensory detail. As shown in Figure 6B, this interaction was observed at intermediate (2–4 Hz) temporal modulations (FDR corrected across spectrotemporal modulations). Visualization of the difference between 12 and 3 channels for Mismatch and Match conditions separately confirmed that this interaction was of the same form observed previously (i.e. increasing sensory detail led to opposite effects on decoding accuracy when prior knowledge mismatched versus matched with speech). Thus, intermediate temporal modulations are well represented in MEG responses and it is these representations that are affected by our manipulations of sensory and prior knowledge.

Our analysis up to this point does not reveal the timecourse of the neural interaction between sensory detail and prior knowledge. While the weights of the spectrotemporal encoding model implicate early neural responses (shown in Figure 5B), this evidence is indirect as the weights were averaged over conditions. To examine the timecourse of the critical interaction, we conducted a single-lag decoding analysis (O'Sullivan et al., 2015) in which we decoded spectrotemporal modulations from neural responses at each lag separately from −50 to 250 ms relative to speech input. For this analysis, we first averaged over the decoding accuracies in Figure 6B that previously showed the interaction when integrating over lags. As shown in Figure 6D, the interaction contrast 12–3 channels (Mismatch) – 12–3 channels (Match) emerged rapidly, peaking at around 50 ms. The contrast of 12–3 channels separately for Mismatch and Match conditions again confirmed the form of the interaction (i.e. opposite effects of sensory detail on decoding accuracy in Mismatch and Match trials) although these effects were only significant at an uncorrected p<0.05 level. We also directly compared Match and Mismatch trials by computing the contrast Match – Mismatch for 3 channels and 12 channels separately (shown in Figure 6—figure supplement 1). While we did not observe significant differences for three channel speech (even at an uncorrected p<0.05 level), for 12 channel speech decoding accuracy was lower in Match versus Mismatch trials from 25 to 40 ms and then again from 125 to 150 and 175–200 ms (albeit at uncorrected p<0.05 level).

Discussion

Here using linear regression and MEG responses to noise-vocoded words, we report four main findings. First, spectrotemporal modulation content in the acoustic signal is well represented in MEG responses, more so than alternative speech features (envelope, spectrogram, and phonetic features). Second, the information content of speech is well characterized by spectrotemporal modulations. Modulations at intermediate temporal (2–4 Hz) and spectral scales (1–2 cycles/octave) are critical for distinguishing between individual spoken words and these spectrotemporal modulations are affected by noise-vocoding manipulations that impact speech intelligibility. Third, signal quality and prior knowledge have an interactive influence on these modulation-based representations, peaking approximately 50 ms after speech input: when mismatching text precedes degraded speech, neural representations of spectrotemporal modulations are enhanced with increasing sensory detail whereas the opposite effect is observed when text matches speech. Fourth, this interaction is observed specifically for neural representations of intermediate spectrotemporal modulations that convey speech content. This result stands in marked contrast to what would be expected on the basis of a sharpening scheme in which both signal quality and prior knowledge enhance neural representations, just as they enhance perceptual clarity. An interactive influence of signal quality and prior knowledge is, however, fully consistent with prediction error representations in which neurons signal the difference between expected and heard speech (Blank and Davis, 2016).

Cortical responses encode spectrotemporal modulations in speech

Our finding that MEG responses are driven by spectrotemporal modulations in speech adds to a growing body of evidence implicating spectrotemporal representations as an important component of the neural code by which the cortex represents speech and other sounds. Linguistically relevant features such as formant transitions and speaker fundamental frequency are clearly apparent in these modulations and their removal by filtering has marked consequences for speech intelligibility and speaker discrimination (Elliott and Theunissen, 2009; Flinker et al., 2019). In animal electrophysiology (Chi et al., 2005; Theunissen and Elie, 2014), neurons in auditory cortex are shown to be well-tuned to spectrotemporal features in the modulation domain. These findings complement human intracranial (Pasley et al., 2012; Hullett et al., 2016) and fMRI (Santoro et al., 2014; Santoro et al., 2017) studies demonstrating that the relationship between auditory stimuli and neural responses is best modelled using spectrotemporal modulations.

Analysis of the acoustic properties of noise-vocoded spoken words show that although low-frequency spectral and temporal modulations dominate the speech signal, only the intermediate modulations (2–4 Hz temporally; 1–2 cycles per octave spectrally) are degraded by noise-vocoding that severely impairs intelligibility (e.g. Shannon et al., 1995; Davis and Johnsrude, 2003; Scott et al., 2006; Obleser et al., 2008). These intermediate spectrotemporal modulations encode formant transitions that are an important cue to consonant place and manner of articulation (Liberman et al., 1967; Roberts et al., 2011). Notably, our decoding analysis shows that it was the intermediate temporal modulations (2–4 Hz) that were preferentially represented in MEG responses. Thus, rather than faithfully encoding the acoustic structure of speech by tracking low-frequency modulations, cortical responses are selectively tuned to acoustic properties (intermediate spectrotemporal modulations) that convey speech information.

Our findings agree with recent work (Di Liberto et al., 2015; Daube et al., 2019) suggesting that cortical responses to speech (as measured with MEG) are not completely invariant to acoustic detail. While our analysis that combined different feature spaces also indicate contributions from acoustically invariant phonetic features (consistent with Di Liberto et al., 2015), the spectrotemporal modulation feature space best predicted MEG responses when considered individually. Despite the lack of full invariance, spectrotemporal modulations are nonetheless a higher-level representation of the acoustic signal than, for example, the spectrogram. Whereas a spectrogram-based representation would faithfully encode any pattern of spectrotemporal stimulation, a modulation-based representation would be selectively tuned to patterns that change at specific spectral scales or temporal rates. Such selectivity is a hallmark of neurons in higher-levels of the auditory system (Rauschecker and Scott, 2009). Our findings are thus consistent with the notion that superior temporal neurons, which are a major source of speech-evoked MEG responses (e.g. Bonte et al., 2006; Sohoglu et al., 2012; Brodbeck et al., 2018), lie at an intermediate stage of the speech processing hierarchy (e.g. Mesgarani et al., 2014; Evans and Davis, 2015; Yi et al., 2019). It should be noted however that in Daube et al., 2019, the best acoustic predictor of MEG responses was based on spectral onsets rather than spectrotemporal modulations. It may be that the use of single words in the current study reduced the importance of spectral onsets, which may be more salient in the syllabic transitions during connected speech as used by Daube et al., 2019.

Our findings may also reflect methodological differences: the encoding/decoding methods employed here have typically been applied to neural recordings from >60 min of listening to continuous – and clear – speech. By comparison, our stimulus set comprised single degraded words, generating fewer and possibly less varied neural responses, which may limit encoding/decoding accuracy for certain feature spaces. However, we think this is unlikely for two reasons. First, we could replicate several other previous findings from continuous speech studies, for example, greater encoding accuracy for spectrogram versus envelope models (Di Liberto et al., 2015) and for the combined spectrogram + phonetic feature model versus spectrogram alone (Di Liberto et al., 2015; Daube et al., 2019). Second, our effect sizes are comparable to those reported in previous work (e.g. Di Liberto et al., 2015) and model comparison results highly reliable, with the spectrotemporal modulation model outperforming the other individual feature spaces in 17 out of 21 participants in the left hemisphere and 18 out of 21 participants in the right hemisphere. Thus, rather than reflecting methodological factors, we suggest that our findings might instead reflect differences in the acoustic, phonetic, and lexical properties of single degraded words and connected speech – including differences in speech rate, co-articulation, segmentation, and predictability. Future work is needed to examine whether and how these differences are reflected in neural representations.

Prediction error in speech representations

Going beyond other studies, however, our work further shows that neural representations of speech sounds reflect a combination of the acoustic properties of speech, and prior knowledge or expectations. Our experimental approach – combining manipulations of sensory detail and matching or mismatching written text cues – includes specific conditions from previous behavioral (Sohoglu et al., 2014) and neuroimaging (Sohoglu et al., 2012; Blank and Davis, 2016; Sohoglu and Davis, 2016; Blank et al., 2018) studies that were designed to distinguish different computations for speech perception. In particular, neural representations that show an interaction between sensory detail and prior knowledge provide unambiguous support for prediction error over sharpening coding schemes (Blank and Davis, 2016). This interaction has previously been demonstrated for multivoxel patterns as measured by fMRI (Blank and Davis, 2016). The current study goes beyond this previous fMRI work in establishing the latency at which prediction errors are computed. As prediction errors reflect the outcome of a comparison between top-down and bottom-up signals, it has been unclear whether such a computation occurs only at late latencies and plausibly results from re-entrant feedback connections that modulate late neural responses (Lamme and Roelfsema, 2000; Garrido et al., 2007). However, our decoding analysis suggests that the neural interaction between sensory detail and prior knowledge that we observe is already apparent within 100 ms of speech input. This suggests that prediction errors are computed during early feedforward sweeps of processing and are tightly linked to ongoing changes in speech input.

One puzzling finding from the current study is that the interaction between prior knowledge and sensory detail on neural representations is also present when randomly shuffling the spectrotemporal modulation features of different words. While an interaction is still present after controlling for contributions from the shuffled data – albeit only for speech with 3 to 6 channels of sensory detail – our findings suggest that between-condition differences in model accuracy are also linked to a generic spectrotemporal representation common to all the words in our study. However, it is unclear how differences in encoding accuracy between matching and mismatching conditions could result from a generic representation (since the same words appeared in both conditions). One possibility is that listeners can rapidly detect mismatch and use this to suppress or discard the representation of the written word in mismatch trials. That is, they switch from a high-precision (written text) prior, to a lower precision prior midway through the perception of a mismatching word (see Cope et al., 2017 for evidence consistent with flexible use of priors supported by frontal and motor speech areas). However, the timecourse of when speech can be decoded in the present study appears inconsistent with this explanation since the neural interaction is present already at early time lags. In this analysis, we focussed on neural responses that track the speech signal with shorter or longer delays. Future research to link neural processing with specific moments in the speech signal (e.g. comparing mismatch at word onset versus offset; Sohoglu et al., 2014; Blank et al., 2018) would be valuable and might suggest other ways of assessing when and how priors are dynamically updated during perception.

Previous electrophysiological work (Holdgraf et al., 2016; Di Liberto et al., 2018b) has also shown that matching prior knowledge, or expectation can affect neural responses to speech and representations of acoustic and phonetic features. However, these findings were observed under a fixed degradation level, close to our lowest-clarity, 3-channel condition (see also Di Liberto et al., 2018a). Thus, it is unclear whether the enhanced neural representations observed in these earlier studies reflect computations of sharpened signals or prediction errors – both types of representational scheme can accommodate the finding that prior knowledge enhances neural representations for low clarity speech (see Figure 1D). By manipulating degradation level alongside prior knowledge, we could test for the presence of interactive effects that distinguish neural representations of prediction errors from sharpened signals (Blank and Davis, 2016). This has potential implications for studies in other domains. For example, multivariate analysis of fMRI (Kok et al., 2012) and MEG (Kok et al., 2017) signals has shown that visual gratings with expected orientations are better decoded than unexpected orientations. In these previous studies, however, the visual properties and hence perceptual clarity of the decoded gratings was held constant. The current study, along with fMRI results from Blank and Davis, 2016, suggests that assessing the interaction between expectation and perceptual clarity more clearly distinguishes prediction error from sharpened signal accounts.

Enhanced prediction error representations of low-clarity speech when prior expectations are more accurate might seem counterintuitive. In explaining this result, we note that in our study, listeners were able to make strong (i.e. precise) predictions about upcoming speech: Matching and mismatching trials occurred equally frequently and thus upon seeing a written word, there was a. 5 probability of hearing the same word in spoken form. However, in the low-clarity condition, acoustic cues were by definition weak (imprecise). Therefore, in matching trials, listeners would have made predictions for acoustic cues that were absent in the degraded speech input, generating negative prediction errors (shown as dark ‘clay’ in Figure 1C). Critically, however, these negative prediction errors overlap with the cues that are normally present in clear speech (shown as light ‘clay’ in Figure 1C). This close correspondence between negative prediction error and degraded speech cues would result in good encoding/decoding accuracy. In mismatching trials, negative prediction errors would also be generated (shown as dark ‘fast’ in Figure 1C) but would necessarily mismatch with heard speech cues. Thus, while the overall magnitude of prediction error might be highest in this condition (see Figure 1—figure supplement 1), there would be limited overlap between predicted and heard speech cues, resulting in worse model accuracy. In summary then, an important factor in explaining our results is the relative precision of predictions and the sensory input.

Other work has taken a different approach to investigating how prior knowledge affects speech processing. These previous studies capitalized on the fact that in natural speech, certain speech sounds are more or less predictable based on the preceding sequence of phonemes (e.g. upon hearing ‘capt-”, ‘-ain’ is more predictable than ‘-ive’ because ‘captain’ is a more frequent English word than ‘captive’). Using this approach, MEG studies demonstrate correlations between the magnitude of neural responses and the magnitude of phoneme prediction error (i.e. phonemic surprisal, Brodbeck et al., 2018; Donhauser and Baillet, 2020). However, this previous work did not test for sharpened representations of speech; leaving open the question of whether expectations are combined with sensory input by computing prediction errors or sharpened signals. Indeed, this question may be hard to address using natural listening paradigms since many measures of predictability are highly correlated (e.g. phonemic surprisal and lexical uncertainty, Brodbeck et al., 2018; Donhauser and Baillet, 2020). Here by experimentally manipulating listeners’ predictions and the quality of sensory input, we could test for the statistical interaction between prior knowledge and sensory detail that most clearly distinguishes prediction errors from sharpened representations. Future work in which this approach is adapted to continuous speech could help to establish the role of predictive computations in supporting natural speech perception and comprehension.

The challenge of experimental control with natural listening paradigms may also explain why acoustic representations of words during connected speech listening are enhanced when semantically similar to previous words (Broderick et al., 2019). Under the assumption that semantically similar words are also more predictable, this finding might be taken to conflict with the current findings, that is, weaker neural representations of expected speech in the high-clarity condition. However, this conflict can be resolved by noting that semantic similarity and word predictability are only weakly correlated and furthermore may be neurally dissociable (see Frank and Willems, 2017, for relevant fMRI evidence). Further studies that monitor how prior expectations modulate representations concurrently at multiple levels (e.g. acoustic, lexical, semantic) would be valuable to understand how predictive processing is coordinated across the cortical hierarchy.

Our evidence for prediction error representations challenges models, such as TRACE (McClelland and Elman, 1986; McClelland et al., 2014), in which bottom-up speech input can only ever be sharpened by prior expectations. Instead, our findings support predictive coding accounts in which prediction errors play an important role in perceptual inference. However, it should be noted that in these predictive coding accounts, prediction errors are not the end goal of the system. Rather, they are used to update and sharpen predictions residing in separate neural populations. It is possible that these sharpened predictions are not apparent in our study because they reside in deeper cortical layers (Bastos et al., 2012) to which the MEG signal is less sensitive (Hämäläinen et al., 1993) or in higher hierarchical levels (e.g. lexical/semantic versus the acoustic level assessed here). It has also been proposed that prediction errors and predictions are communicated via distinct oscillatory channels (e.g. Arnal et al., 2011; Bastos et al., 2012), raising the possibility that other neural representations could be revealed with time-frequency analysis methods. In future, methodological advances (e.g. in layer-specific imaging; Kok et al., 2016) may help to resolve this issue.

We also consider how our results relate to the ‘opposing process’ account recently proposed by Press et al., 2020. According to this account, perception is initially biased by expected information. However, if the stimulus is sufficiently surprising, then later processing is redeployed toward unexpected inputs. Importantly in the context of the current study, this latter process is proposed to operate only when sensory inputs clearly violate expectations, that is, when sensory evidence is strong. While this account is formulated primarily at the perceptual level, a possible implication is that our interaction effect results from the conflation of two distinct processes occurring at different latencies: an early component in which neural responses are upweighted for expected input (leading to greater model accuracy for matching versus mismatching trials) and a later component triggered by high-clarity speech whereby neural responses are upweighted for surprising information (leading to greater model accuracy in mismatching trials). However, the timecourse of when high-clarity speech can be decoded is inconsistent with this interpretation: for high-clarity speech, decoding is only ever lower in matching trials, and never in the opposite direction, as would be expected for neural responses that are initially biased for predicted input (see Figure 6—figure supplement 1). It may be that the opposing process model is more applicable to neural representations of predictions rather than the prediction errors that are hypothesized to be apparent here. Future research with methods better suited to measuring neural representations of predictions (e.g. time-frequency analysis, as mentioned above) may be needed to provide a more definitive test of the opposing process model.

Materials and methods

Participants

21 (12 female, 9 male) right-handed participants were tested after being informed of the study’s procedure, which was approved by the Cambridge Psychology Research Ethics Committee (reference number CPREC 2009.46). All were native speakers of English, aged between 18 and 40 years (mean = 22, SD = 2) and had no history of hearing impairment or neurological disease based on self-report.

Spoken stimuli

Request a detailed protocol

468 monosyllabic words were presented to each participant in spoken or written format drawn randomly from a larger set of monosyllabic spoken words. The spoken words were 16-bit, 44.1 kHz recordings of a male speaker of southern British English and their duration ranged from 372 to 903 ms (mean = 591, SD = 78). The amount of sensory detail in speech was varied using a noise-vocoding procedure (Shannon et al., 1995), which superimposes the temporal-envelope from separate frequency regions in the speech signal onto white noise filtered into corresponding frequency regions. This allows parametric variation of spectral detail, with increasing numbers of channels associated with increasing intelligibility. Vocoding was performed using a custom Matlab script (The MathWorks Inc), using 3, 6, or 12 spectral channels spaced between 70 and 5000 Hz according to Greenwood’s function (Greenwood, 1990). Envelope signals in each channel were extracted using half-wave rectification and smoothing with a second-order low-pass filter with a cut-off frequency of 30 Hz. The overall RMS amplitude was adjusted to be the same across all audio files.

Each spoken word was presented only once in the experiment so that unique words were heard on all trials. The particular words assigned to each condition were randomized across participants. Before the experiment, participants completed brief practice sessions each lasting approximately 5 min that contained all conditions but with a different corpus of words to those used in the experiment. Stimulus delivery was controlled with E-Prime 2.0 software (Psychology Software Tools, Inc).

Procedure

Request a detailed protocol

Participants completed a modified version of the clarity rating task previously used in behavioral and MEG studies combined with a manipulation of prior knowledge (Davis et al., 2005; Hervais-Adelman et al., 2008; Sohoglu et al., 2012). Speech was presented with 3, 6, or 12 channels of sensory detail while prior knowledge of speech content was manipulated by presenting mismatching or matching text before speech onset (see Figure 1A). Written text was composed of black lowercase characters presented for 200 ms on a gray background. Mismatching text was obtained by permuting the word list for the spoken words. As a result, each written word in the Mismatching condition was also presented as a spoken word in a previous or subsequent trial and vice versa.

Trials commenced with the presentation of a written word, followed 1050 (±0–50) ms later by the presentation of a spoken word (see Figure 1A). Participants were cued to respond by rating the clarity of each spoken word on a scale from 1 (‘Not clear’) to 4 (‘Very clear’) 1050 (±0–50) ms after speech onset. The response cue consisted of a visual display of the rating scale and responses were recorded using a four-button box from the participant’s right hand. Subsequent trials began 850 (±0–50) ms after participants responded.

Manipulations of sensory detail (3/6/12 channel speech) and prior knowledge of speech content (Mismatching/Matching) were fully crossed, resulting in a 3 × 2 factorial design with 78 trials in each condition. Trials were randomly ordered during each of three presentation blocks of 156 trials.

Schematic illustrations of sharpened signal and prediction error models

Request a detailed protocol

To illustrate the experimental predictions for the sharpened signal and prediction error models, we generated synthetic representational patterns of speech (shown in Figure 1C). To facilitate visualization, speech was represented as pixel-based binary patterns (i.e. written text) corresponding to the words ‘clay’ and ‘fast’. To simulate our experimental manipulation of sensory detail, we used a two-dimensional filter to average over eight (‘low’ sensory detail) or four (‘medium’) local pixels in the speech input patterns. To simulate the ‘high’ sensory detail condition, the input patterns were left unfiltered. Patterns for the speech input and predictions were added to uniformly distributed noise (mean = 0, standard deviation = 0.5) and normalized to sum to one. Sharpened signal patterns were created by multiplying the input and prediction patterns and normalising to sum to one. The prediction error patterns were created by subtracting the predictions from the input.

To approximate the model accuracy measure in our MEG analysis, we correlated a clear (noise-free) representation of the speech input (‘clay’) with the sharpened signal or prediction error patterns (shown in Figure 1D as the squared Pearson’s correlation R2). In addition to this correlation metric focusing on representational content, we also report the overall magnitude of prediction error by summing the absolute prediction error over pixels (shown in Figure 1—figure supplement 1).

Data acquisition and pre-processing

Request a detailed protocol

Magnetic fields were recorded with a VectorView system (Elekta Neuromag, Helsinki, Finland) containing two orthogonal planar gradiometers at each of 102 positions within a hemispheric array. Data were also acquired by magnetometer and EEG sensors. However, only data from the planar gradiometers were analyzed as these sensors have maximum sensitivity to cortical sources directly under them and therefore are less sensitive to noise artifacts (Hämäläinen, 1995).

MEG data were processed using the temporal extension of Signal Source Separation (Taulu et al., 2005) in Maxfilter software to suppress noise sources, compensate for motion, and reconstruct any bad sensors. Subsequent processing was done in SPM8 (Wellcome Trust Centre for Neuroimaging, London, UK), FieldTrip (Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, the Netherlands) and NoiseTools software (http://audition.ens.fr/adc/NoiseTools) implemented in Matlab. The data were subsequently highpass filtered above 1 Hz and downsampled to 80 Hz (with anti-alias lowpass filtering).

Linear regression

Request a detailed protocol

We used ridge regression to model the relationship between speech features and MEG responses (Pasley et al., 2012; Ding and Simon, 2013; Di Liberto et al., 2015; Holdgraf et al., 2017; Brodbeck et al., 2018). Before model fitting, the MEG data were epoched at matching times as the feature spaces (see below), outlier trials removed and the MEG time-series z-scored such that each time-series on every trial had a mean of zero and standard deviation of 1. We then used the mTRF toolbox (Crosse et al., 2016; https://sourceforge.net/projects/aespa) to fit two types of linear model.

The first linear model was an encoding model that mapped from the stimulus features to the neural time-series observed at each MEG sensor and at multiple time lags:

y =Sw + ε

where y is the neural time-series recorded at each planar gradiometer sensor, S is an Nsamples by Nfeatures matrix defining the stimulus feature space (concatenated over different lags), w is a vector of model weights and ε is the model error. Model weights for positive and negative lags capture the relationship between the speech feature and the neural response at later and earlier timepoints, respectively. Encoding models were fitted with lags from −100 to 300 ms but for model prediction purposes (see below), we used a narrower range of lags (0 to 250 ms) to avoid possible artefacts at the earliest and latest lags (Crosse et al., 2016).

Our encoding models tested four feature spaces (shown in Figure 2), derived from the original clear versions of the spoken stimuli (i.e. before noise-vocoding):

  1. Broadband envelope, a simple acoustic-based representation of the speech signal, which captured time-varying acoustic energy in a broad 86–4878 Hz frequency range. Thus, this feature space contained information only about the temporal (and not spectral) structure of speech, which alone can cue many speech sounds (e.g. ‘ch’ versus ‘sh’ and ‘m’ versus ‘p’; Rosen, 1992). This feature space was obtained by summing the envelopes across the spectral channels in a 24-channel noise-vocoder. The center frequencies of these 24 spectral channels were Greenwood spaced: 86, 120, 159, 204, 255, 312, 378, 453, 537, 634, 744, 869, 1011, 1172, 1356, 1565, 1802, 2072, 2380, 2729, 3127, 3579, 4093, and 4678 Hz (see Greenwood, 1990). Envelopes were extracted using half-wave rectification and low-pass filtering at 30 Hz.

  2. Spectrogram, which was identical to the Envelope feature space but with the 24 individual spectral channels retained. Because this feature space was spectrally resolved, it contained more information about the identity of speech sounds than the broadband envelope (e.g. fricatives such as ‘f’ versus ‘s’ and stop consonants such as ‘b’ versus ‘d’; Liberman et al., 1967).

  3. Spectrotemporal modulations captured regular fluctuations in energy across the frequency and time axes of the spectrogram. This feature space can be considered an abstraction of the spectrogram that more directly indexes perceptually relevant spectrotemporal features, such as spectrally regular harmonics in vowel sounds or fast temporal modulations of stop consonants. Whereas the spectrogram might be appropriate for modelling subcortical processing, it has been proposed that spectrotemporal modulations are a better model of neural representation in auditory cortex (Chi et al., 2005; Theunissen and Elie, 2014). We used the NSL toolbox in Matlab (http://nsl.isr.umd.edu/downloads.html) to first compute a 128-channel ‘auditory’ spectrogram (with constant Q and logarithmic center frequencies between 180 and 7040 Hz), before filtering with 2D wavelet filters tuned to spectral modulations of 0.5, 1, 2, 4, and 8 cycles per octave and temporal modulations of 1, 2, 4, 8, and 16 Hz. All other parameters were set to the default (frame length = 8 ms, time constant = 8 ms and no nonlinear compression). Note that temporal modulations can be positive or negative, reflecting the direction of frequency sweeps (i.e. upward versus downward; Elliott and Theunissen, 2009). This default representation is a very high-dimensional feature space: frequency × spectral modulation × temporal modulation × temporal modulation direction with 128 × 5×5 × 2 = 6400 dimensions, each represented for every 8 ms time sample in the speech file. We, therefore, averaged over the 128 frequency channels and positive- and negative-going temporal modulation directions. The resulting feature space was a time-varying representation of spectrotemporal modulation content comprised of 25 features (five spectral modulations x five temporal modulations).

  4. Phonetic features. The final feature space comprised 13 phonetic features describing the time-varying phonetic properties of speech including vowel backness, voicing, place, and manner of articulation. We first used a forced-alignment algorithm included as part of BAS software (Kisler et al., 2017; https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSGeneral) to estimate the onset time of each phoneme. We then converted phoneme representations into the following articulatory phonetic features: Voiced, Unvoiced, Bilabial, Labiodental/Dental, Alveolar, Velar, Plosive, Nasal, Fricative, Liquid, Front, Central, and Back (International Phonetic Association, 1999). Between the onset time of each phoneme and the onset time of the subsequent phoneme, the corresponding phonetic features were set to one and non-corresponding phonetic features to 0. Phonetic features that occurred infrequently in our stimulus set were excluded or merged (labiodental and dental features). For dipthong vowels and affricate consonants, we averaged the feature vectors for the component speech sounds.

In addition to the four feature spaces above, we also tested more complex combinations of feature spaces that previous work has shown to be good predictors of neural responses. The first of these combined the spectrogram feature space above with the phonetic feature space (Di Liberto et al., 2015). The second combined the spectrogram with the half-wave rectified derivative of the spectrogram to capture spectral onsets (Daube et al., 2019).

All feature spaces were downsampled to 80 Hz to match the MEG sampling rate and the acoustic feature spaces (i.e. 1, 2, 3 above) were z-score transformed such that each time-series on every trial had a mean of zero and standard deviation of 1. For each trial/word, all feature spaces were zero-padded before speech onset to match the duration of the longest negative lag used for model fitting (i.e. 100 ms). Similarly, zero-padding was added after speech offset to match the duration of the longest positive lag (i.e. 300 ms). This enabled information at the beginning and end of each spoken word to inform the model fits for negative and positive lags, respectively.

The second linear model was a decoding model that mapped in the opposite direction from the neural time-series back to speech spectrotemporal modulations (since analysis of encoding models showed this feature space to be most predictive of neural responses). Before model fitting, to remove redundant dimensions and reduce computation time, the data from the 204 planar gradiometers were transformed into principal components and the first 50 components retained. The model was therefore as follows:

s =Yw + ε

Here s is a vector expressing the time-varying spectrotemporal modulations (one vector for each of the 25 spectrotemporal modulations), Y is an Nsamples by Ncomponents matrix of MEG data (concatenated over different lags), w is a vector of model weights and ε is the model error. Decoding models were fitted with lags from −100 to 300 ms and from 0 to 250 ms for model prediction purposes. As with the encoding analysis, here model weights for positive and negative lags capture the relationship between the speech feature and the neural response at later and earlier timepoints, respectively. Note that this notation may differ to that used in other work, where instead negative lags are used to denote the post-stimulus period for decoding analyses (see Crosse et al., 2016). The single-lag analysis proceeded similarly except that model prediction was conducted separately for each lag, generating model accuracies for all lags between −50 and 250 ms.

To control for overfitting, we used ridge regression and varied the lambda parameter (over 17 values as follows: 20, 21, 23, … 220) which varies the degree to which strongly positive or negative weights are penalized (Crosse et al., 2016). This lambda parameter was optimized using a leave-one-trial-out cross-validation procedure. For each feature space, sensor, condition, and participant, we fitted the models using data from all but one trial. We then averaged the model weights across trials and used the result to predict the MEG response (for the encoding models) or speech spectrotemporal modulations (for the decoding model) of the left-out trial. We computed model accuracy as the Pearson correlation between the predicted and observed data and repeated this procedure such that model accuracies were obtained for all trials. Before optimizing lambda, model accuracies were averaged across trials, and also across sensors (encoding models) or modulations (decoding model). The optimal lambda value was then selected as the mode of the model accuracy distribution over participants and conditions (Holdgraf et al., 2016). When testing the complex models that combined feature spaces, lambda was optimized for each component feature space independently resulting in a search space of 17 × 17 = 289 lambda values (i.e. 'banded' ridge regression; Nunez-Elizalde et al., 2019). When conducting the single-lag decoding analysis, lambda was optimized for each lag separately.

When testing for differences in encoding model accuracies, we averaged model accuracies over trials and over the 20 sensors with the highest model accuracies (in each hemisphere separately). For the decoding analysis, model accuracies were averaged over trials. The resulting data were then entered into repeated measures ANOVAs or t-tests to examine differences in model accuracy between feature spaces or conditions.

When testing for above-chance encoding model accuracies, null distributions were created for each condition separately by randomly permuting the feature representations across trials (i.e. shuffling the words in our stimulus set) or within trials (i.e. shuffling the spectrotemporal modulation channels and time-bins for each word). This was repeated for 100 permutations using the same ridge regression lambda that was optimized for the non-shuffled data. For each of the 100 permutations, we then subjected the data to the same analysis pipeline as for the non-permuted data (averaging over trials, selecting 20 left-hemisphere and 20 right-hemisphere sensors with the highest model accuracy and averaging over participants). When z-scoring the observed data with respect to the null distributions, the 20 left-hemisphere and 20 right-hemisphere sensors were selected after z-scoring. This ensured the same sensors were used when combining the observed and permuted data to compute the z-scores. z-scores were obtained by taking the observed model accuracy of each condition and participant, subtracting the mean of the relevant null distribution, and dividing by the null distribution standard deviation.

To visualize the weights of the spectrotemporal modulation encoding model, or ‘temporal response functions (TRFs)’, we used denoising source separation (DSS) to transform the TRFs into a set of linear components (spatial filters) ordered by their consistency over trials (de Cheveigné and Parra, 2014). The first three DSS components (i.e. the three most consistent components) were retained and projected back into sensor-space. When applying DSS, the covariance matrices were computed by summing the covariances over features and conditions. Following DSS, we averaged the TRFs across trials and computed the root mean square (RMS) amplitude across all left hemisphere sensors.

Data availability

Raw data, experimental materials and analysis scripts are available on a public repository (https://osf.io/b2jpt/).

The following data sets were generated
    1. Sohoglu E
    (2020) Open Science Framework
    ID b2jpt. Rapid computations of spectrotemporal prediction error support perception of degraded speech.

References

    1. Friston K
    (2005) A theory of cortical responses
    Philosophical Transactions of the Royal Society B: Biological Sciences 360:815–836.
    https://doi.org/10.1098/rstb.2005.1622
    1. Ganong WF
    (1980) Phonetic categorization in auditory word perception
    Journal of Experimental Psychology: Human Perception and Performance 6:110–125.
    https://doi.org/10.1037/0096-1523.6.1.110
    1. Rosen S
    (1992) Temporal information in speech: acoustic, auditory and linguistic aspects
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 336:367–373.
    https://doi.org/10.1098/rstb.1992.0070
    1. Warner N
    2. McQueen JM
    3. Cutler A
    (2014) Tracking perception of the sounds of english
    The Journal of the Acoustical Society of America 135:2995–3006.
    https://doi.org/10.1121/1.4870486

Decision letter

  1. Andrew J King
    Senior Editor; University of Oxford, United Kingdom
  2. Peter Kok
    Reviewing Editor; University College London, United Kingdom
  3. Peter Kok
    Reviewer; University College London, United Kingdom
  4. Clare Press
    Reviewer; Birkbeck, University of London, United Kingdom
  5. Edmund C Lalor
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper convincingly demonstrates the importance of prediction errors – the mismatch between what we predict to hear and what we actually hear – for processing speech signals. Theoretical proposals on the role of prediction errors in perception abound, but direct empirical evidence is scarce. Furthermore, the authors show that these prediction errors occur very early on, suggesting that predictions play a fundamental role in sensory processing.

Decision letter after peer review:

Thank you for submitting your article "Rapid computations of spectrotemporal prediction error support perception of degraded speech" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Peter Kok as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Andrew King as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Clare Press (Reviewer #2); Edmund C Lalor (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

This manuscript by Sohoglu and Davis uses sophisticated analysis of previously published MEG data to adjudicate between two models of how prior information might influence the perception of degraded speech. Based on an interesting interaction effect in the data – increasing sensory detail leads to better representation with mismatching prior information, but worse representation for matching prior information – they conclude that the benefits of prior information on degraded speech perception involve neural computations that are consistent with a predictive coding model. They demonstrate that this interaction was present for the first 250 ms of processing. The reviewers found that the manuscript addresses a very interesting question, is well written, and uses a clever design and analyses. The reviewers had questions about the interpretation of the effects that need to be better supported or toned down in a revised manuscript.

Revisions for this paper:

1) Consistency of the data with a predictive coding model. It seems that a predictive coding explanation of the reported interaction between sensory detail and prior knowledge depends on two assumptions, and it would strengthen the manuscript to outline these assumptions – assuming the authors agree with them. First, predictive accounts assume two unit types: error units and hypothesis units. It seems that this interaction requires assuming that the error units contribute to the signal to a greater extent than the hypothesis units. Second, it rests upon assuming high precision for the prior distributions. e.g., the authors state "This interaction arises because sensory signals that match prior expectations are explained away more effectively as signal quality increases and hence neural representations are suppressed even as perceptual outcomes improve." This is only true if the prior has high precision. If the prior has low precision then the opposite effect seems more likely. Put another way, if the statistics of one's environment dictate that sight of the word “clay” is followed by low clarity audio of clay, then the 3 channel likely matches the prior to a greater extent than the 12 channel. Is anything known about the statistical structure of the environment with respect to visual prediction of the auditory signal for these words that could support the assumption of high precision on priors?

2) Theoretical positions involving sharpening that would also predict this interaction. We agree that this interaction would not be predicted under a simple sharpening account, but the recent “opposing process” account cited by the authors – which includes sharpening – would make this prediction. This account predicts that sharpening is followed in time by mechanisms that increase the sensory processing of mismatching inputs, but only those generating high KLD. This would only be the case for high quality, not low quality, inputs and therefore would generate this interaction – even if some other particulars would be assumed different. It is worth discussing that the interaction is also consistent with this account, which involves sharpening, even if less consistent with pure sharpening accounts.

3) Regardless of the specific model used to explain the data, one specific aspect of the main interaction effect – as schematized in Figure 1D and as shown in the data in Figure 5A – needs clarification. Why might we expect a bigger response for a low sensory detail stimulus when it is matching rather than non-matching? That is, why are the two leftmost points on the right plot in Figure 1D the way they are? It seems intuitive that the error should be smaller for an imprecise "clay" than for an imprecise "fast", but the data suggest the opposite is true. We wonder if there would there be any complementary value to trying to predict the MEG based on an acoustic representation of the mismatching text? In other words, if I have a strong prediction of "fast" and I hear an imprecise "clay" would I get a smaller response (representing the error for "fast") than if I had heard a precise "clay". We leave it up to the authors to decide whether this analysis is feasible and valuable; in any case, this finding (reduced error for matching vs. mismatching low quality speech), and its interpretation, should be explained more fully.

4) The logic of the study seems to rest upon effects reflecting sensory processing. It seems plausible that higher level regions would also exhibit high model accuracy, especially given that 250 ms will have allowed sensory information to be communicated throughout most of the cortex. It would therefore strengthen the manuscript to increase the evidence that the effect is sensory. This could be achieved through additional analysis e.g., restricting the analysis to a shorter temporal window, if the data allows, or through theoretical reasoning to support the sensory nature of effects, e.g., related to the linear relationship between spectrotemporal modulations and the MEG signal. In any case, given that a lot of (re-entrant) processing occurs in the first 250 ms, the current claims about early processing, especially references to the feedforward sweep and suggestions that re-entrant processing only occurs > 250 ms, should be strongly toned down.

5) The reviewers had a concern about the use of different speech representations with the TRF. First, they thought the analysis of Figure 3 lacked an assessment of a model that combined spectrotemporal modulations and phonetic features. These models both do pretty well on their own, but should also be complementary.

6) A complementary concern was that there may not be enough data – or enough variability in the specific stimuli – to conduct a fair test of the various representations. Some representations have more dimensions than others. And it seems the TRFs were fit using data acquired to 468 words (many of which were presented with very low levels of intelligibility). This might not be enough data to get quality fits for some of the representations and, thus, the results in Figure 3 may not be quite fair. We appreciate what the authors are trying to do in terms of finding a representation that gives them the best TRF predictions and, thus, the best SNR for testing their hypothesis. However, this issue should be discussed somewhere in the manuscript – to make it clear that the results of Figure 3 are based on a fairly limited amount of data and might not necessarily generalize to a larger data set.

7) Broderick et al., 2019 report seemingly opposite effects of prior knowledge to those presented here. In their case prior knowledge results from semantic similarity of preceding spoken words, and they observed enhanced rather than suppressed encoding of matching spoken words. This discrepancy should be discussed.

8) If the speech system preferentially represents inputs that mismatch predictions, how do the authors explain that valid prior knowledge enhances speech intelligibility? This should be discussed.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Rapid computations of spectrotemporal prediction error support perception of degraded speech" for further consideration by eLife. Your revised article has been evaluated by Andrew King (Senior Editor) and a Reviewing Editor.

The manuscript has been greatly improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

1) The new analyses performed in response to point 3 revealed that the main result of the paper, an interaction between match/mismatch and sensory detail, is also present in a null distribution of their data. The authors offer a potential explanation, but admit that they do not fully understand this result. This seems concerning; if the main result of their study is also present in a control analysis with shuffled trial labels, it is unclear how to interpret these effects. Given the impact of this finding on the interpretability of the results, it does not seem appropriate to omit these results from the manuscript. We recommend including these findings in the paper, along with the (admittedly speculative, but plausible) explanation offered by the authors. Note that this refers specifically to the results of the null distribution analysis of the main results, the reviewers agree with the authors that it's fine to omit the more speculative “written mismatch” analysis (i.e., using the acoustic representation of a word they didn't actually hear) suggested by the reviewers in the previous round.

2) The authors point out how they think they obtained their results because MEG is more sensitive to error unit than hypothesis unit processing. If this was a general rule, two visual MEG studies by Kok and colleagues (2017 PNAS; 2020 J Neurosci) would be unlikely to have found these sharpening patterns. Could the authors say anything about this discrepancy? Is it perhaps more plausible that these auditory data, in combination with the visual findings, might suggest a difference between visual and auditory systems in the extent to which MEG reflects hypothesis v error processing?

3) Relatedly, the authors have interestingly added a time resolved analysis. Given the interaction is seen very early, they consider this finding less consistent with the opposing process account (Press et al., 2020). However, the opposing process account describes perception, and in terms of neural processing is proposed to be reflected in hypothesis unit processing, not error unit processing. If the authors believe they are seeing the processing in error units only in their study, the data do not in fact speak to the opposing process account. If our understanding of the authors' claims is correct, we would suggest amending the paragraph that mentions this account – mentioning that it's a theory of perception and more concerned with hypothesis unit processing which is hypothesised not to be reflected in the present data.

4) The authors have increased clarity concerning how the predictive coding account could explain the present dataset – by clarifying the assumed precision of the priors. However, the reviewers felt the fact that there is greater representation of matching 3 channel than non-matching 3 channel under this account could still be explained more clearly. It was not fully clear how the clarifications provided speak to this particular point. E.g, if assuming a precise gaussian prediction for clay, an imprecise gaussian with the same mean (clay) will overlap with that gaussian more than an imprecise gaussian with a different mean (fast). Higher Kullback Leibler divergence for mismatch still i.e., by prescribing the shape of the prior, they have accounted for why 3 channel can generate more error than 12 channel, but not relative differences between match and non-match in the 3 channel. We think it would help readers if the logic could be further clarified here as our conceptualisation of error processing may deviate from the authors'.

5) With respect to the discussion of the lack of correspondence to the Broderick findings – this may be due to the fact that the Broderick paper was based on semantic similarity – which is not the same thing as predictability. Indeed, Frank and Willems (2017) found that semantic similarity and semantic surprisal were not strongly correlated with each other. It may even be that the Broderick results based on semantic similarity map better onto the earlier activity from hypothesis units and do not tap into error unit activity as much (which might be accessible by exploring the top-down role of semantic surprisal).

6) We don't insist on making changes necessarily – but the transition from the "Stimulus feature space selection" subsection to the "Acoustic analysis…" subsection (p12 – 13) still seems a little weak. You have just spent a lot of time comparing models and then quite quickly converge on the spectrotemporal one. You attempt to justify it might best match with the vocoded speech – but might not that actually work against you? The idea is that you are hoping to be sensitive to error, no? So if you have a strong prior and then get a degraded speech input, don't you want to be sensitive to the details of that prior – and, hence, the error. Again, we don't insist on any major change here. It may suffice that you just wanted to keep the story fairly straightforward and so wanted to pick the single speech representation that performed best.

https://doi.org/10.7554/eLife.58077.sa1

Author response

Revisions for this paper:

1) Consistency of the data with a predictive coding model. It seems that a predictive coding explanation of the reported interaction between sensory detail and prior knowledge depends on two assumptions, and it would strengthen the manuscript to outline these assumptions – assuming the authors agree with them. First, predictive accounts assume two unit types: error units and hypothesis units. It seems that this interaction requires assuming that the error units contribute to the signal to a greater extent than the hypothesis units. Second, it rests upon assuming high precision for the prior distributions. e.g., the authors state "This interaction arises because sensory signals that match prior expectations are explained away more effectively as signal quality increases and hence neural representations are suppressed even as perceptual outcomes improve." This is only true if the prior has high precision. If the prior has low precision then the opposite effect seems more likely. Put another way, if the statistics of one's environment dictate that sight of the word “clay” is followed by low clarity audio of clay, then the 3 channel likely matches the prior to a greater extent than the 12 channel. Is anything known about the statistical structure of the environment with respect to visual prediction of the auditory signal for these words that could support the assumption of high precision on priors?

We agree that the consistency of our data with a predictive coding account depends on these two assumptions.

In the previous version of the manuscript, we briefly mentioned the first assumption. In the revision, we now outline the assumption in greater detail:

“Our evidence for prediction error representations challenges models, such as TRACE (McClelland and Elman, 1986; McClelland et al., 2014), in which bottom-up speech input can only ever be sharpened by prior expectations. […] In future, methodological advances (e.g. in layer-specific imaging; Kok et al., 2016) may help to resolve this issue.”

We also now outline the second assumption in the Discussion section. We believe that outlining this second assumption also helps to address comment #3 below i.e. clarifying why the interaction between prior knowledge and sensory detail comes about.

“Enhanced prediction error representations of low-clarity speech when prior expectations are more accurate might seem counterintuitive. […] In summary then, an important factor in explaining our results is the relative precision of predictions and the sensory input.”

2) Theoretical positions involving sharpening that would also predict this interaction. We agree that this interaction would not be predicted under a simple sharpening account, but the recent “opposing process” account cited by the authors – which includes sharpening – would make this prediction. This account predicts that sharpening is followed in time by mechanisms that increase the sensory processing of mismatching inputs, but only those generating high KLD. This would only be the case for high quality, not low quality, inputs and therefore would generate this interaction – even if some other particulars would be assumed different. It is worth discussing that the interaction is also consistent with this account, which involves sharpening, even if less consistent with pure sharpening accounts.

We now discuss the opponent process model in relation to our results in the Discussion section:

“We also consider a final possibility: that sharpened signals and prediction errors are apparent at different times relative to stimulus input (Press et al., 2020). […] Future research to link neural processing with specific moments in the speech signal (e.g. comparing mismatch at word onset versus offset; Sohoglu et al., 2014; Blank et al., 2018) would be valuable and might suggest other ways of assessing the opponent process account.”

3) Regardless of the specific model used to explain the data, one specific aspect of the main interaction effect – as schematized in Figure 1D and as shown in the data in Figure 5A – needs clarification. Why might we expect a bigger response for a low sensory detail stimulus when it is matching rather than non-matching? That is, why are the two leftmost points on the right plot in Figure 1D the way they are? It seems intuitive that the error should be smaller for an imprecise "clay" than for an imprecise "fast", but the data suggest the opposite is true.

We hope that our response above to comment #1, and the associated changes to the manuscript, clarify why prediction error representations should be enhanced for low-clarity speech when (strong) expectations are matching versus mismatching. We emphasize representations here as the interaction between sensory detail and prior knowledge that distinguishes prediction error and sharpening models is apparent for multivariate (and not univariate) measures (Blank and Davis, 2016). Thus, according to a prediction error account, we should expect a more informative response (but not necessarily a “bigger” response or an increase in the overall magnitude of perdiction error) for a low sensory detail stimulus when it is matching rather than mismatching.

We wonder if there would there be any complementary value to trying to predict the MEG based on an acoustic representation of the mismatching text? In other words, if I have a strong prediction of "fast" and I hear an imprecise "clay" would I get a smaller response (representing the error for "fast") than if I had heard a precise "clay". We leave it up to the authors to decide whether this analysis is feasible and valuable;

Following Figure 1C, we might in fact expect the opposite pattern i.e. the acoustic representation of mismatching text could be most apparent for low-clarity speech, under both prediction error and sharpened representational schemes. Note that even if this is true, this analysis is less valuable for distinguishing between accounts, since in this case both representational schemes make similar experimental predictions.

We are also not confident about the feasibility of this analysis. Acoustic information is dynamic and unfolds over time. It is therefore unclear how best to timelock neural responses to dynamic acoustic predictions from mismatching text that are never physically present.

Nonetheless, we have performed the analysis and show the results below. Here “(written)Mismatch” is a new condition in which neural encoding accuracy is assessed for the acoustic (spectrotemporal modulation) representation of the mismatching written word. Here we make the simplifying assumption that the acoustic representation of the written word is apparent at the same time as heard speech, and follows the exact same timecourse as if the spoken word were present. Both of these assumptions might be false – if for instance, it might be that only predictions for the initial portion of the word are expressed, or that predictions are pre-played at a different rate from the spoken word itself.

As can be seen in Author response image 1, encoding accuracy in the (written)Mismatch condition increases with increasing speech clarity i.e. counter to what may be suggested in Figure 1C on the basis of either prediction error or sharpening accounts.

Author response image 1
Encoding accuracy in left hemisphere sensors.

Data have been analyzed in an identical fashion to Figure 5A except that we show a new (written)Mismatch condition in which the spectrotemporal modulation representation of the written word is used to predict neural responses. There was a significant main effect of sensory detail for this (written)Mismatch condition (F(2,40) = 4.466 p = .023). However, as explained below, permutation testing in which we shuffled the order of the stimuli (over 100 permutations) revealed that each (written)Mismatch datapoint did not significantly differ from the permutation null distribution. ns p >=.05. This is in contrast to the match/mismatch conditions reported in the manuscript which all showed above-chance encoding accuracy for the spoken word that was heard irrespective of sensory clarity.

Puzzled by this, we then, compared encoding accuracy to a null distribution. This null distribution was created by randomly permuting the feature representations across trials (i.e. shuffling the words in our stimulus set), for each condition separately, and repeating this for 100 permutations (but always using the same ridge regression lambda that was previously found to be optimal for the non-shuffled data). For each of the 100 permutations, we then subjected the data to the same analysis pipeline as for the non-permuted data (averaging over trials, selecting 20 left-hemisphere sensors with the highest model accuracy and averaging over participants). This resulted in the null distributions shown in Author response image 2:

Author response image 2
Histograms showing the number of permutations (random shuffles of feature space across trials) as a function of (binned) model accuracy when predicting the MEG response from the spectrotemporal modulation representation of mismatching text.

Vertical blue lines indicate the observed (non-permuted) model accuracies.

Importantly, encoding accuracy for (written)Mismatch is never above chance, as defined by this null distribution. It is therefore difficult to interpret between-condition differences in encoding accuracy for this (written)Mismatch condition. However, we also noticed that the null distributions appear to shift depending on experimental condition – i.e. the null distribution of model accuracy increased for written(Mismatch) when higher-clarity speech is presented.

To follow-up on this, we also constructed null distributions for the original six conditions i.e. we predicted the MEG response from a spectrotemporal modulation representation of mismatching and matching speech, presented with 3/6/12 channels of sensory detail. Here we denote prior knowledge conditions as (speech)Mismatch and speech(Match) to distinguish these conditions from the new (written)Mismatch condition presented above. As before, we constructed a null distribution based on 100 permutations in which we randomly assigned spectrotemporal modulation representations for different words to different MEG trials within each condition.

Reassuringly, model accuracies in these conditions are well above chance as defined by the null distribution (p <.01 for all six conditions). Hence, in both match and mismatch conditions, neural responses encode the specific acoustic form of the spoken word that is heard. Surprisingly, however, a match/mismatch by sensory detail interaction is also apparent in the null distributions, as shown in Author response image 3:

Author response image 3
Histograms showing the number of permutations (random shuffles of feature space across trials) as a function of (binned) model accuracy when predicting the MEG response from the spectrotemporal modulation representation of mismatching/matching speech.

Vertical blue lines indicate the observed (non-permuted) model accuracies.

We do not fully understand this result. Our preferred explanation is that a spectrotemporal representation of a generic or randomly chosen word can also predict MEG responses with some degree of accuracy. This is possible because the words used in the experiment have homogeneous acoustic properties – e.g. they are all monosyllabic words spoken by the same individual. All words therefore share, to some extent, a common spectrotemporal profile. By this interpretation, then, the result in Author response image 3 again confirms that encoding of this shared spectrotemporal profile in MEG responses depends on the combination of (mis)match with prior knowledge and sensory detail.

To help confirm that other simple properties of the neural response cannot explain the observed model accuracies (e.g. differences in the mean univariate response over conditions), we also confirmed that we obtain near-zero encoding accuracy for a fully-permuted model. That is, randomly permuting spectrotemporal modulation channels and time-bins gives us the results shown in Author response image 4:

Author response image 4
Histograms showing permutation null distributions when predicting the MEG response from the spectrotemporal modulation representation of mismatching/matching speech but now for a fully-permuted model (randomly permuting spectrotemporal modulation channels and time-bins).

Note the scale for these graphs is much smaller than the previous figures (-.002 to.002 versus.095 to.14), reflecting the substantial drop in model accuracy to near-zero for the fully permuted model.

We therefore feel confident that the interaction between match/mismatch and sensory detail – both in the results shown in Figure 5A of the manuscript, and as shown in Author response image 3 of this letter – supports a prediction error scheme. Encoding of an acoustic representation of a heard word – either for the specific word spoken, or from generic acoustic elements shared with other monosyllabic words from the same speaker – shows an interaction between prior knowledge and sensory detail that is a unique signature of a prediction error scheme. We still do not fully understand why a generic representation would lead to differences in encoding accuracy between match and mismatch conditions. One possibility is that listeners can rapidly detect mismatch and use this to suppress or discard the representation of the written word in mismatch trials. That is, they switch from a high-precision (written text) prior, to a low precision or neutral prior midway through the perception of a mismatching word. However, the time course of the interaction shown in decoding analyses (see point 4 below) would argue against this – there’s no evidence that the neural interaction emerges only at later time points.

In addition, we still do not fully understand why encoding accuracy for (written)Mismatch as shown in Author response image 1 was never above chance levels. At face value, both a prediction error and a sharpening scheme would lead us to expect this outcome – at least for speech that is sufficiently degraded. Indeed it has previously been shown that expected sensory stimuli can be decoded from stimulus omissions (e.g. Kok, Failing and de Lange, 2014, J Cogn Neurosci), or from responses to uninformative stimuli (e.g. Wolff et al., 2017, Nat Neurosci). Further studies in which we use other methods (e.g. time-frequency analysis) to detect less timelocked neural representations of predicted speech stimuli would be valuable to address this issue.

We therefore feel that the added value of this written mismatch analysis is limited and have opted not to add these analyses to the manuscript.

We present it here in the hope that you might help us to understand how to interpret these analyses and potentially incorporate these into our manuscript. However, in the absence of a more confident interpretation, or an explanation of how these findings should change our conclusions we would prefer to exclude these from the manuscript.

In any case, this finding (reduced error for matching vs. mismatching low quality speech), and its interpretation, should be explained more fully.

As we mention above, we hope that our response to comment #1, and the associated changes to the manuscript clarify this issue.

4) The logic of the study seems to rest upon effects reflecting sensory processing. It seems plausible that higher level regions would also exhibit high model accuracy, especially given that 250 ms will have allowed sensory information to be communicated throughout most of the cortex. It would therefore strengthen the manuscript to increase the evidence that the effect is sensory. This could be achieved through additional analysis e.g., restricting the analysis to a shorter temporal window, if the data allows, or through theoretical reasoning to support the sensory nature of effects, e.g., related to the linear relationship between spectrotemporal modulations and the MEG signal. In any case, given that a lot of (re-entrant) processing occurs in the first 250 ms, the current claims about early processing, especially references to the feedforward sweep and suggestions that re-entrant processing only occurs > 250 ms, should be strongly toned down.

We agree that a key interpretation of our study is that prediction errors are computed at an early acoustic level involving representations of spectrotemporal modulations. We have therefore followed your suggestion to implement a time-resolved analysis, which we now report in the Results section and in Figure 6 as a new panel D. This analysis suggests that the critical interaction effect between sensory detail and prior knowledge (on spectrotemporal decoding) peaks approximately 50 ms after speech input. This finding is consistent with our interpretation of an early sensory locus.

In addition, we no longer suggest that 250 ms is a latency indicative of feedforward or feedback processing.

We thank you for this helpful suggestion to strengthen our manuscript.

5) The reviewers had a concern about the use of different speech representations with the TRF. First, they thought the analysis of Figure 3 lacked an assessment of a model that combined spectrotemporal modulations and phonetic features. These models both do pretty well on their own, but should also be complementary.

We now include a more comprehensive model comparison that includes the feature space combination of spectrotemporal modulations and phonetic features. This appears as a new Figure 3—figure supplement 1 and new sections of text in the Results section.

The outcome of this new analysis indicates that spectrotemporal modulations+phonetic features outperforms spectrotemporal modulations alone. This suggests that both phonetic features and spectrotemporal modulations are represented in neural responses. We now have modified the Discussion section to reflect these new findings:

“Our findings agree with recent work (Di Liberto et al., 2015; Daube et al., 2019) suggesting that cortical responses to speech (as measured with MEG) are not completely invariant to acoustic detail. While our analysis that combined different feature spaces also indicate contributions from acoustically invariant phonetic features (consistent with Di Liberto et al., 2015), the spectrotemporal modulation feature space best predicted MEG responses when considered individually.”

Importantly, however, this does not negate our previous finding that spectrotemporal modulations is the feature space that individually best explains MEG variance. Spectrotemporal modulations also provides the clearest link to our noise-vocoding manipulation, which degrades speech by removing narrowband spectral modulations. In the revised manuscript, we have made this link more explicit:

“Our analysis above suggests that a feature space comprised of spectrotemporal modulations is most accurately represented in neural responses. […] To investigate the acoustic impact of noise-vocoding on our stimuli, we next characterized the spectrotemporal modulations that convey speech content in our stimulus set and how those modulations are affected by noise-vocoding with a wide range of spectral channels, from 1 to 24 channels.”

6) A complementary concern was that there may not be enough data – or enough variability in the specific stimuli – to conduct a fair test of the various representations. Some representations have more dimensions than others. And it seems the TRFs were fit using data acquired to 468 words (many of which were presented with very low levels of intelligibility). This might not be enough data to get quality fits for some of the representations and, thus, the results in Figure 3 may not be quite fair. We appreciate what the authors are trying to do in terms of finding a representation that gives them the best TRF predictions and, thus, the best SNR for testing their hypothesis. However, this issue should be discussed somewhere in the manuscript – to make it clear that the results of Figure 3 are based on a fairly limited amount of data and might not necessarily generalize to a larger data set.

We acknowledge that because our stimulus set consisted of single words, rather than connected speech, there is a concern apriori that there may be not be enough data to distinguish different representations. However, our results indicate that this is not the case because 1) we can replicate several key findings from studies using similar methods and 2) our effect sizes are comparable to those reported in previous work and 3) our model comparison results are highly reliable (shown in Figure 3A).

Nonetheless, we agree that this issue should be acknowledged, and we now do so in the Discussion section:

“Our findings may also reflect methodological differences: the encoding/decoding methods employed here have typically been applied to neural recordings from > 60 minutes of listening to continuous – and undegraded – speech. […] Future work is needed to examine whether and how these differences are reflected in neural representations.”

7) Broderick et al., 2019 report seemingly opposite effects of prior knowledge to those presented here. In their case prior knowledge results from semantic similarity of preceding spoken words, and they observed enhanced rather than suppressed encoding of matching spoken words. This discrepancy should be discussed.

We now discuss this study in the Discussion section:

“The challenge of experimental control with natural listening paradigms may also explain why accurate semantic predictions appear to enhance acoustic representations during connected speech listening (Broderick et al., 2019) i.e. opposite to the effect we observe for the high-clarity condition shown in Figure 5A. […] In our paradigm the assignment of spoken words to conditions was counterbalanced across participants, allowing us to control for all acoustic differences between matching and mismatching expectation conditions.”

8) If the speech system preferentially represents inputs that mismatch predictions, how do the authors explain that valid prior knowledge enhances speech intelligibility? This should be discussed.

We now clarify this in the Introduction section:

“This pattern – opposite effects of signal quality on neural representations depending on whether prior knowledge is informative or uninformative – highlights an important implication of neural activity that represents prediction errors. Rather than directly signaling perceptual outcomes, these neural signals in sensory cortex serve the intermediary function of updating higher-level (phonological, lexical or semantic) representations so as to generate a perceptual interpretation (the posterior, in Bayesian terms) from a prediction (prior). It is these updated perceptual interpretations, and not prediction errors, that should correlate most closely with perceived clarity (Sohoglu and Davis, 2016).”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been greatly improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

1) The new analyses performed in response to point 3 revealed that the main result of the paper, an interaction between match/mismatch and sensory detail, is also present in a null distribution of their data. The authors offer a potential explanation, but admit that they do not fully understand this result. This seems concerning; if the main result of their study is also present in a control analysis with shuffled trial labels, it is unclear how to interpret these effects. Given the impact of this finding on the interpretability of the results, it does not seem appropriate to omit these results from the manuscript. We recommend including these findings in the paper, along with the (admittedly speculative, but plausible) explanation offered by the authors. Note that this refers specifically to the results of the null distribution analysis of the main results, the reviewers agree with the authors that it's fine to omit the more speculative “written mismatch” analysis (i.e., using the acoustic representation of a word they didn't actually hear) suggested by the reviewers in the previous round.

Thank you for this helpful suggestion. We agree that this analysis is important for interpretation and now include it in our revised manuscript as new next in the Results and Discussion sections as well as a new Figure 5—figure supplement 1. We also provide additional analysis in which the observed data are z-scored with respect to the feature-shuffled and word-shuffled distributions (Figure 5—figure supplement 1B and D, respectively). Interestingly, this analysis suggests that an interaction between prior knowledge and sensory detail on model accuracy is still present in the data even after controlling for contributions from the word-shuffled distributions. The precise form of the interaction differs to that seen previously: increasing sensory detail has opposite effects on model accuracy when prior knowledge is matching vs mismatching but only for speech ranging from 3 to 6 channels. Nonetheless, this suggests an interaction that favours prediction errors is word-specific to some extent, as originally interpreted. On the other hand, it is also true that a large contribution to the original interaction effect appears to be linked to a generic representation and we now clearly acknowledge this in the revised manuscript (Results and Discussion sections).

2) The authors point out how they think they obtained their results because MEG is more sensitive to error unit than hypothesis unit processing. If this was a general rule, two visual MEG studies by Kok and colleagues (2017 PNAS; 2020 J Neurosci) would be unlikely to have found these sharpening patterns. Could the authors say anything about this discrepancy? Is it perhaps more plausible that these auditory data, in combination with the visual findings, might suggest a difference between visual and auditory systems in the extent to which MEG reflects hypothesis v error processing?

We cannot rule out the possibility that these differences in findings reflect differences in how visual and auditory systems are expressed in MEG signals. However, given that our interaction between prior knowledge and sensory detail is also present in speech fMRI data (Blank and Davis, 2016), we do not think the discrepancy is linked to MEG specifically. This could suggest a more general difference between vision and audition (irrespective of imaging modality). On the other hand, there is other work that has shown decreased decoding accuracy for expected visual stimuli (inconsistent with a sharpening model; Kumar 2017 JoCN).

Our preferred explanation is therefore that if visual paradigms were adapted to include a range of perceptual clarity levels (so as to permit a test of the interaction with expectation), sharpening versus prediction error models could be more definitively tested. We now provide further discussion along these lines:

“This has potential implications for studies in other domains. For example, multivariate analysis of fMRI (Kok et al., 2012) and MEG (Kok et al., 2017) signals has shown that visual gratings with expected orientations are better decoded than unexpected orientations. In these previous studies, however, the visual properties and hence perceptual clarity of the decoded gratings was held constant. The current study, along with fMRI results from Blank and Davis (2016), suggests that assessing the interaction between expectation and perceptual clarity more clearly distinguishes prediction error from sharpened signal accounts.”

3) Relatedly, the authors have interestingly added a time resolved analysis. Given the interaction is seen very early, they consider this finding less consistent with the opposing process account (Press et al., 2020). However, the opposing process account describes perception, and in terms of neural processing is proposed to be reflected in hypothesis unit processing, not error unit processing. If the authors believe they are seeing the processing in error units only in their study, the data do not in fact speak to the opposing process account. If our understanding of the authors' claims is correct, we would suggest amending the paragraph that mentions this account – mentioning that it's a theory of perception and more concerned with hypothesis unit processing which is hypothesised not to be reflected in the present data.

Thank you for pointing out this inaccuracy in our description of the opposing process account. We now clarify that the model is primarily formulated at the perceptual level. We still think that our decoding results can be argued to be inconsistent with the model. But we now acknowledge that the model might be more applicable to hypothesis unit processing, rather than the prediction errors we hypothesise are reflected in our data:

“We also consider how our results relate to the ‘opposing process’ account recently proposed by Press et al. (2020). […] Future research with methods better suited to measuring neural representations of predictions (e.g. time-frequency analysis, as mentioned above) may be needed to provide a more definitive test of the opposing process model.”

4) The authors have increased clarity concerning how the predictive coding account could explain the present dataset – by clarifying the assumed precision of the priors. However, the reviewers felt the fact that there is greater representation of matching 3 channel than non-matching 3 channel under this account could still be explained more clearly. It was not fully clear how the clarifications provided speak to this particular point. E.g, if assuming a precise gaussian prediction for clay, an imprecise gaussian with the same mean (clay) will overlap with that gaussian more than an imprecise gaussian with a different mean (fast). Higher Kullback Leibler divergence for mismatch still i.e., by prescribing the shape of the prior, they have accounted for why 3 channel can generate more error than 12 channel, but not relative differences between match and non-match in the 3 channel. We think it would help readers if the logic could be further clarified here as our conceptualisation of error processing may deviate from the authors'.

We believe this point can be clarified by highlighting the distinction between the overall magnitude of neural response and representational content.

We agree that, under a prediction error scheme, there should be higher Kullback Liebler Divergence (KLD) for Mismatch than Match conditions for low-clarity speech. To confirm this, we computed the KLD between the speech input and prediction patterns in Figure 1C. The results are shown in Author response image 5:

Author response image 5
Kullback Liebler Divergence (KLD) between speech input and prediction patterns (shown in Figure 1C).

However, because the KLD measure integrates over all representational units, this measure is effectively capturing the overall magnitude of prediction error. A similar result would be obtained if instead of the KLD, we took the sum of absolute prediction error (cf. Gagnepain et al., 2012 Current Biology).

In contrast to analysing the KLD (or summed absolute prediction error), we can instead correlate a clear template pattern of “clay” with the prediction error patterns. This allows us to approximate our main model accuracy measure used in the manuscript i.e. measuring how well the original (clear) spoken word is represented in MEG responses to vocoded versions of the speech signal. Expressing the correlation as the coefficient of determination (R2), gives us the results in Figure 1D.

As can be seen in Figure 1D, when using a measure that captures representational content of heard speech, we can successfully reproduce the qualitative pattern of results in the low clarity condition i.e. better prediction error representation of heard speech when prior expectations match than mismatch.

To clarify this point, we have made the following changes:

  • Revised Figure 1D so that we now show the R2 measure described above

(instead of the “cartoon” depiction in previous versions)

  • We include a new figure supplement to Figure 1 to illustrate how the overall magnitude of prediction error is expected to differ between conditions. For this we show the summed absolute prediction error as a simpler alternative to the KLD

  • In Figure 1C, we now include depictions for three levels of sensory detail for better correspondence with our experimental design (previously we only showed two levels)

  • The changes above are also accompanied by next text in the Materials and methods section to explain how the illustrations and associated metrics (R2, magnitude of prediction error) in Figure 1 and supplemental figure 1 were obtained

  • We have modified the caption to Figure 1 to highlight the different outcomes for the R2 and magnitude of prediction error metrics

We have also modified the relevant part of the Discussion:

“Therefore, in matching trials, listeners would have made predictions for acoustic cues that were absent in the degraded speech input, generating negative prediction errors (shown as dark “clay” in Figure 1C). […] Thus, while the overall magnitude of prediction error might be highest in this condition (see Figure 1 —figure supplement 1), there would be limited overlap between predicted and heard speech cues, resulting in worse model accuracy.”

5) With respect to the discussion of the lack of correspondence to the Broderick findings – this may be due to the fact that the Broderick paper was based on semantic similarity – which is not the same thing as predictability. Indeed, Frank and Willems (2017) found that semantic similarity and semantic surprisal were not strongly correlated with each other.

We agree that the lack of correspondence here is likely because semantic similarity is dissociable from word predictability. We have amended this section to clarify this point:

“The challenge of experimental control with natural listening paradigms may also explain why acoustic representations of words during connected speech listening are enhanced when semantically similar to previous words (Broderick et al., 2019). Under the assumption that semantically similar words are also more predictable, this finding might be taken to conflict with the current findings i.e. weaker neural representations of expected speech in the high-clarity condition. However, this conflict can be resolved by noting that semantic similarity and word predictability are only weakly correlated and furthermore may be neurally dissociable (see Frank and Willems, 2017, for relevant fMRI evidence).”

It may even be that the Broderick results based on semantic similarity map better onto the earlier activity from hypothesis units and do not tap into error unit activity as much (which might be accessible by exploring the top-down role of semantic surprisal).

We now acknowledge that an important question for future research is to investigate how prediction is expressed at multiple hierarchical levels, including semantics:

“Further studies that monitor how prior expectations modulate representations concurrently at multiple levels (e.g. acoustic, lexical, semantic) would be valuable to understand how predictive processing is coordinated across the cortical hierarchy.”

While we do not mention hypothesis units in this section, we have modified a later part of the discussion to acknowledge that hypothesis unit processing might be more evident in higher (e.g. semantic) levels than the acoustic level apparent in the current study:

“It is possible that these sharpened predictions are not apparent in our study because they reside in deeper cortical layers (Bastos et al., 2012) to which the MEG signal is less sensitive (Hämäläinen et al., 1993) or in higher hierarchical levels (e.g. lexical/semantic versus the acoustic level assessed here).”

6) We don't insist on making changes necessarily – but the transition from the "Stimulus feature space selection" subsection to the "Acoustic analysis…" subsection (p12 – 13) still seems a little weak. You have just spent a lot of time comparing models and then quite quickly converge on the spectrotemporal one. You attempt to justify it might best match with the vocoded speech – but might not that actually work against you? The idea is that you are hoping to be sensitive to error, no? So if you have a strong prior and then get a degraded speech input, don't you want to be sensitive to the details of that prior – and, hence, the error. Again, we don't insist on any major change here. It may suffice that you just wanted to keep the story fairly straightforward and so wanted to pick the single speech representation that performed best.

We acknowledge that multiple representations are likely to be reflected in the MEG signal, not just the spectrotemporal modulation feature space. This is also acknowledged in the manuscript.

However, examining how prior knowledge influences multiple feature representations requires individual contributions from each feature space to be identified. Because different feature spaces tend to be correlated, this analysis is a challenging one to perform (e.g. because the variance that can be attributed uniquely to a single feature space tends to be small). We therefore have decided to focus on the feature space that explains the most variance (spectrotemporal modulation model).

As the Editors/Reviewers point out, our approach has the advantage of simplifying the analysis and interpretation of what is arguably already a complex study. We also maintain that the spectrotemporal modulation framework is ideally suited to characterising the acoustic consequences of noise-vocoding and therefore also the neural consequences.

While apriori it might be hypothesised that prior knowledge of speech content modulates speech processing at a higher (non-acoustic level) such as phonetic features (di Liberto et al., 2018a), this would be inconsistent with our previous MEG studies (e.g. Sohoglu et al., 2012; Sohoglu et al., 2016) as well as from ECoG data from other groups (Holdgraff et al., 2016; Leonard et al., 2016 Nature Comms). The very early latency at which we observe the interaction effect in our decoding analysis is also consistent with the notion that prior knowledge modulates processing at an early acoustic stage (Figure 6D).

https://doi.org/10.7554/eLife.58077.sa2

Article and author information

Author details

  1. Ediz Sohoglu

    School of Psychology, University of Sussex, Brighton, United Kingdom
    Contribution
    Conceptualization, Data curation, Formal analysis, Writing - original draft, Writing - review and editing
    For correspondence
    E.Sohoglu@sussex.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0755-6445
  2. Matthew H Davis

    MRC Cognition and Brain Sciences Unit, Cambridge, United Kingdom
    Contribution
    Conceptualization, Funding acquisition, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2239-0778

Funding

Medical Research Council (SUAG/044 G101400)

  • Matthew H Davis

Medical Research Council (Centenary Award)

  • Ediz Sohoglu

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This research was supported by the Medical Research Council (Centenary Award to ES and SUAG/044 G101400 to MHD). We are grateful to Maarten van Casteren, Clare Cook, and Lucy MacGregor for their excellent technical support during data collection.

Ethics

Human subjects: Volunteers were tested after informed consent, and consent to publish. The study was approved by the Cambridge Psychology Research Ethics Committee (reference number CPREC 2009.46).

Senior Editor

  1. Andrew J King, University of Oxford, United Kingdom

Reviewing Editor

  1. Peter Kok, University College London, United Kingdom

Reviewers

  1. Peter Kok, University College London, United Kingdom
  2. Clare Press, Birkbeck, University of London, United Kingdom
  3. Edmund C Lalor

Publication history

  1. Received: April 20, 2020
  2. Accepted: October 19, 2020
  3. Version of Record published: November 4, 2020 (version 1)

Copyright

© 2020, Sohoglu and Davis

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,036
    Page views
  • 122
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Genetics and Genomics
    2. Neuroscience
    Li Hou et al.
    Research Article Updated

    Long-term flight depends heavily on intensive energy metabolism in animals; however, the neuroendocrine mechanisms underlying efficient substrate utilization remain elusive. Here, we report that the adipokinetic hormone/corazonin-related peptide (ACP) can facilitate muscle lipid utilization in a famous long-term migratory flighting species, Locusta migratoria. By peptidomic analysis and RNAi screening, we identified brain-derived ACP as a key flight-related neuropeptide. ACP gene expression increased notably upon sustained flight. CRISPR/Cas9-mediated knockout of ACP gene and ACP receptor gene (ACPR) significantly abated prolonged flight of locusts. Transcriptomic and metabolomic analyses further revealed that genes and metabolites involved in fatty acid transport and oxidation were notably downregulated in the flight muscle of ACP mutants. Finally, we demonstrated that a fatty-acid-binding protein (FABP) mediated the effects of ACP in regulating muscle lipid metabolism during long-term flight in locusts. Our results elucidated a previously undescribed neuroendocrine mechanism underlying efficient energy utilization associated with long-term flight.

    1. Neuroscience
    Krishna N Badhiwala et al.
    Research Article

    Hydra vulgaris is an emerging model organism for neuroscience due to its small size, transparency, genetic tractability, and regenerative nervous system; however, fundamental properties of its sensorimotor behaviors remain unknown. Here, we use microfluidic devices combined with fluorescent calcium imaging and surgical resectioning to study how the diffuse nervous system coordinates Hydra's mechanosensory response. Mechanical stimuli cause animals to contract, and we find this response relies on at least two distinct networks of neurons in the oral and aboral regions of the animal. Different activity patterns arise in these networks depending on whether the animal is contracting spontaneously or contracting in response to mechanical stimulation. Together, these findings improve our understanding of how Hydra’s diffuse nervous system coordinates sensorimotor behaviors. These insights help reveal how sensory information is processed in an animal with a diffuse, radially symmetric neural architecture unlike the dense, bilaterally symmetric nervous systems found in most model organisms.