1. Neuroscience
Download icon

Neurophysiological evidence of efference copies to inner speech

  1. Thomas J Whitford  Is a corresponding author
  2. Bradley N Jack
  3. Daniel Pearson
  4. Oren Griffiths
  5. David Luque
  6. Anthony WF Harris
  7. Kevin M Spencer
  8. Mike E Le Pelley
  1. University of New South Wales (UNSW Sydney), Australia
  2. Westmead Institute for Medical Research, Australia
  3. University of Malaga, Spain
  4. University of Sydney, Australia
  5. Veterans Affairs Boston Healthcare System, United States
  6. Harvard Medical School, United States
Research Article
  • Cited 5
  • Views 5,505
  • Annotations
Cite this article as: eLife 2017;6:e28197 doi: 10.7554/eLife.28197

Abstract

Efference copies refer to internal duplicates of movement-producing neural signals. Their primary function is to predict, and often suppress, the sensory consequences of willed movements. Efference copies have been almost exclusively investigated in the context of overt movements. The current electrophysiological study employed a novel design to show that inner speech – the silent production of words in one’s mind – is also associated with an efference copy. Participants produced an inner phoneme at a precisely specified time, at which an audible phoneme was concurrently presented. The production of the inner phoneme resulted in electrophysiological suppression, but only if the content of the inner phoneme matched the content of the audible phoneme. These results demonstrate that inner speech – a purely mental action – is associated with an efference copy with detailed auditory properties. These findings suggest that inner speech may ultimately reflect a special type of overt speech.

https://doi.org/10.7554/eLife.28197.001

eLife digest

As you read this text, the chances are you can hear your own inner voice narrating the words. You may hear your inner voice again when silently considering what to have for lunch, or imagining how a phone conversation this afternoon will play out. Estimates suggest that we spend at least a quarter of our lives listening to our own inner speech. But to what extent does the brain distinguish between inner speech and the sounds we produce when we speak out loud?

Listening to a recording of your own voice activates the brain more than hearing yourself speak out loud. This is because when the brain sends instructions to the lips, tongue, and vocal cords telling them to move, it also makes a copy of these instructions. This is known as an efference copy, and it enables regions of the brain that process sounds to predict what they are about to hear. When the actual sounds match those predicted – as when you hear yourself speak out loud – the brain’s sound-processing regions dampen down their responses.

But does the inner speech in our heads also generate an efference copy? To find out, Whitford et al. tracked the brain activity of healthy volunteers as they listened to speech sounds through headphones. While listening to the sounds, the volunteers had to produce either the same speech sound or a different speech sound inside their heads. A specific type of brain activity decreased whenever the inner speech sound matched the external speech sound. This decrease did not occur when the two sounds were different. This suggests that the brain produces an efference copy for inner speech similar to that for external speech.

These findings could ultimately benefit people who suffer from psychotic symptoms, for example as part of schizophrenia. Symptoms such as hearing voices are thought to reflect problems with producing and interpreting inner speech. The technique that Whitford et al. have developed will enable us to test this long-held but hitherto untestable idea. The results should increase our understanding of these symptoms and may eventually lead to new treatments.

https://doi.org/10.7554/eLife.28197.002

Introduction

Sensory attenuation – also known as self-suppression – refers to the phenomenon that self-generated sensations feel less salient, and evoke a smaller neurophysiological response, than externally-generated sensations which are physically identical (Hughes et al., 2013; Cardoso-Leite et al., 2010). Sensory attenuation is believed to result from the action of an internal forward model, or IFM (Blakemore et al., 2000a; Wolpert and Miall, 1996). According to this account, the sensory consequences of self-generated movements are predicted based on a copy of the outgoing motor command, known as an efference copy. These predicted sensations are compared to the actual sensations resulting from the movement, and the difference between predicted and actual sensation (i.e., the sensory discrepancy – Wolpert and Miall, 1996) is sent higher up the neuronal hierarchy for further processing (Seth and Friston, 2016). In the case of a self-generated movement, the internal prediction is able to account for, and ‘explain away’, much of the resulting sensation, which is why self-initiated sensations typically feel less salient, and evoke a smaller neurophysiological response, than externally-initiated sensations (Blakemore et al., 1998).

Sensory attenuation has been extensively studied in the context of overt speech production. Auditory stimuli elicit an electrophysiological brain response (the auditory-evoked potential) with a characteristic N1 component. The amplitude of this component is known to be sensitive to sound intensity; i.e., loud sounds evoke larger N1 amplitudes than soft sounds (Näätänen and Picton, 1987; Hegerl and Juckel, 1993). Numerous electroencephalographic (EEG) and magnetoencephalographic (MEG) studies have found that self-generated vocalizations elicit an N1 component (M100 in the MEG) of smaller amplitude than the N1 component elicited when passively listening to the same sounds (Ford et al., 2007a; Curio et al., 2000; Heinks-Maldonado et al., 2005; Oestreich et al., 2015; Houde et al., 2002). This phenomenon has been dubbed N1-suppression, and it suggests that self-generated sounds are processed as though they were physically softer than externally-generated sounds, reflecting the action of an IFM (Greenlee et al., 2011; Heinks-Maldonado et al., 2006). The suggestion that an IFM is responsible for sensory attenuation to overt speech is bolstered by the finding that experimentally altering auditory feedback (e.g., by pitch-shifting or delaying an individual’s voice, such that auditory sensations do not match the predictions of the IFM) reduces the amount of N1-suppression (Heinks-Maldonado et al., 2006; Behroozmand and Larson, 2011; Behroozmand et al., 2011; Aliu et al., 2009).

The central aim of the present study is to explore whether N1-suppression, which has consistently been observed in response to overt speech, also occurs in response to inner speech, which is a purely mental action. Inner speech – also known as covert speech, imagined speech, or verbal thoughts – refers to the silent production of words in one’s mind (Perrone-Bertolotti et al., 2014; Alderson-Day and Fernyhough, 2015). Inner speech is one of the most pervasive and ubiquitous of human activities; it has been estimated that most people spend at least a quarter of their lives engaged in inner speech (Heavey and Hurlburt, 2008). An influential account of inner speech suggests that it ultimately reflects a special case of overt speech in which the articulator organs (e.g., mouth, tongue, larynx) do not actually move; that is, inner speech is conceptualized as ‘a kind of action’ (Jones and Fernyhough, 2007, p.396 – see also Feinberg, 1978; Pickering and Garrod, 2013; Oppenheim and Dell, 2010). Support for this idea has been provided by studies showing that inner speech activates similar brain regions to overt speech, including audition and language-related perceptual areas and supplementary motor areas, but does not typically activate primary motor cortex (Palmer et al., 2001; Zatorre et al., 1996; Aleman et al., 2005; Shuster and Lemieux, 2005). While previous data suggest that inner and overt speech share neural generators, relatively few neurophysiological studies have explored the extent to which these two processes are functionally equivalent. If inner speech is indeed a special case of overt speech – ‘a kind of action’ – then it would also be expected to have an associated IFM (Tian and Poeppel, 2010 – see also Feinberg, 1978; Numminen and Curio, 1999; Whitford et al., 2012; Ford et al., 2001a).

A significant challenge in determining the existence (or otherwise) of an IFM to inner speech is that inner speech does not elicit a measurable auditory-evoked potential, which means that N1-suppression to inner speech cannot be determined directly using current methodologies. However, the existence of an IFM may be inferred based on how the production of inner speech suppresses the brain’s electrophysiological response to overt speech (Tian and Poeppel, 2010; Tian and Poeppel, 2015; Tian and Poeppel, 2013; Tian and Poeppel, 2012). A critical feature of the IFM associated with overt speech is that the efference copy is (a) time-locked to the onset of the action, and (b) contains specific predictions as to the expected sensory consequences of that action (i.e., is content-specific – Wolpert et al., 1995). Correspondingly, if inner speech were, in fact, associated with an IFM, then its associated efference copy would be expected to be: (a) time-locked to the onset of the inner speech, and (b) contain information as to the specific content of the inner speech. In this case, engaging in inner speech would be expected to result in maximal N1-suppression to overt speech in the case where two conditions were met: (1) the external sound was presented at precisely the same time as the inner speech was produced, and (2) the content of the external sound matched the content of the inner speech.

The present study introduces a new experimental procedure that allowed us to test whether inner speech produces N1-suppression to audible speech in the absence of any overt motor action. In this protocol, participants were instructed to produce a single phoneme in inner speech at a specific time, which was designated by means of a precise visual cue. At the same time, an audible phoneme was presented in participants’ headphones; the audible phoneme could be either the same as (Match condition) or different from (Mismatch condition) the inner phoneme. In the Passive condition, participants were instructed not to produce an inner phoneme. The results indicated that inner speech resulted in N1-suppression, but only if the content of the inner phoneme matched the content of the audible phoneme. These results suggest that inner speech production is associated with a time-locked and content-specific internal forward model, similar to the one that operates in the production of overt speech. Furthermore, these results suggests that inner speech, by itself, is able to elicit an efference copy and cause sensory attenuation, even in the absence of an overt motor action.

Results

Inner speech experiment

On each trial of the experiment, participants watched a short animation of approximately 5 s duration. As illustrated in Figure 1a, the animation depicted a red vertical line (the ‘fixation’ line) that remained in a fixed location in the middle of the screen. This fixation line was overlaid upon a thick green horizontal bar (the ‘ticker tape’). A second, green, vertical line (the ‘trigger’ line), which was embedded in the ticker tape, was initially presented at the far right-hand side of the screen. At the start of each trial, participants fixated their eyes on the fixation line. After an interval of 1–2 s, the green ticker tape and the green trigger line began to move leftwards across the screen towards, and ultimately beyond, the stationary fixation line. At the exact time at which the trigger line intersected the fixation line – the ‘sound-time’ – an audible phoneme was delivered to participants’ headphones (Figure 1c). The audible phoneme was a recording of a male speaker producing either the phoneme /BA/ or the phoneme /BI/. (Note: for clarity, audible phonemes are capitalized in text while inner phonemes are written in lower case. Our justification for choosing these two audible phonemes in particular is presented below). Participants were instructed to generate an inner phoneme at exactly the moment the fixation line intersected the trigger line (i.e., at the sound-time). The experimental manipulation was the content of the inner phoneme that participants were instructed to produce. There were three different types of trial blocks. In the first type of trial block, participants were asked to produce the inner phoneme /ba/ at the sound-time. The second type of trial block was identical, except that participants were asked to produce the inner phoneme /bi/. On each trial, participants were asked to imagine themselves moving their articulator organs (i.e., mouth, tongue, larynx, etc.) and vocalizing the inner phoneme, but not to actually make any movements. In the third type of trial block, participants were not instructed to produce an inner phoneme, but were instructed to simply listen to the sounds. Following the sound-time, the trigger line continued to move for an additional 1 s before a text-box was displayed and participants were asked to rate how successfully they followed the instructions on the trial.

A schematic of the experimental protocol.

Participants were instructed to fixate their eyes on the central red fixation line (Panel A). After a delay (1–2 s), the green trigger line, which was presented on the far right-hand side of the screen, and visible in participants’ peripheral vision, began to move smoothly across the screen in a leftwards direction at a speed of 6.5°/s (Panel B), such that after 3.75 s the trigger line overlapped with the fixation line. At this precise moment, dubbed the ‘sound-time’, two events occurred simultaneously (Panel C). Firstly, the participant was asked to imagine themselves producing a pre-defined phoneme in inner speech (either /ba/ or /bi/ or no inner phoneme). Secondly, an audible phoneme (either /BA/ or /BI/), produced by a male speaker, was delivered to the participant’s headphones. In Match trials (Panel D, top, blue), the inner phoneme was congruent with the audible phoneme (e.g., inner phoneme: /ba/; audible phoneme: /BA/). In Mismatch trials (Panel D, middle, red), the inner phoneme was incongruent with the audible phoneme (e.g., inner phoneme: /bi/; audible phoneme: /BA/). In Passive trials (Panel D, bottom, black), the participant did not produce an inner phoneme. Following the sound-time, the trigger line continued to move past the fixation line for an additional 1 s. The trial was then complete and the participant was asked to rate how successfully they managed to follow the instructions on that trial, on a scale from 1 (Not at all successful) to 5 (Completely successful).

https://doi.org/10.7554/eLife.28197.003

The data were parsed into three discrete trial-types, which were analyzed as separate conditions: (1) Match trials, in which the inner phoneme matched the audible phoneme (i.e., inner phoneme: /ba/; audible phoneme: /BA/ OR inner phoneme: /bi/; audible phoneme: /BI/), (2) Mismatch trials, in which the imagined phoneme did not match the audible phoneme (i.e., inner phoneme: /ba/; audible phoneme: /BI/ OR inner phoneme: /bi/; audible phoneme: /BA/, and (3) Passive trials, in which the participant was not instructed to imagine a phoneme (i.e., inner phoneme: none; audible phoneme: /BA/ OR inner phoneme: none; audible phoneme: /BI/).

Following pre-processing (see EEG Processing and Analysis for full details), EEG data epochs (time-locked to the onset of the audible phoneme) were averaged separately for each of the three conditions (Match, Mismatch, Passive). The dependent variable was the amplitude of the N1 component of the auditory-evoked potential elicited by the auditory phoneme.

The N1 peak was identified on each individual participant’s average waveform for each of the three conditions. Figure 2 shows the auditory-evoked potentials averaged across electrodes FCz, Fz, and Cz, as these were the electrodes at which N1 was maximal (see Figure 2a, voltage maps). Figure 2b shows a box-and-whiskers plot of raw N1 amplitudes for each condition (Match, Mismatch, Passive). Given that this experiment used a repeated measures design, Figure 2c shows a scatterplot of the magnitude of the within-subjects differences between conditions, which constitute the critical contrasts. These difference scores were approximately normally distributed with no clear outliers. Repeated measures ANOVA revealed a significant main effect of Condition (F(2,82) = 4.21, p = 0.018, ηp2 = 0.09) on the amplitude of the N1 peak. Analysis of simple effects revealed that N1-amplitude in the Match condition was significantly smaller than both the Mismatch (t(41) = 2.54, p = 0.015, dz = 0.39, CI(95%) = [0.187, 1.649]) and Passive (t(41) = 2.77, p = 0.008, dz = 0.43, CI(95%) = [0.278, 1.776]) conditions (Figure 2c). There was no difference in N1-amplitude between the Mismatch and Passive conditions (t(41) = 0.26, p = 0.800, dz = 0.04, CI(95%) = [−0.758, 0.977]).

Inner speech experiment: N1 component analysis.

(A) Waveforms showing the auditory-evoked potentials elicited by the audible phonemes in the Match condition (blue line), Mismatch condition (red line) and Passive condition (black line). The N1-component is labelled; the waveforms were averaged across electrodes FCz, Fz, and Cz, as these were the electrodes at which the N1 component was maximal. The waveforms are shown collapsed across audible phoneme (audible /BA/ and /BI/), and the waveforms for the Match and Mismatch conditions are shown collapsed across inner phoneme (inner /ba/ and /bi/). Voltage maps are plotted separately for each condition; white dots illustrate the electrodes used in the analysis. (B) Box-and-whiskers plots showing the amplitude of the N1 component elicited by the audible phonemes in the Match, Mismatch and Passive conditions. The edges of the boxes represent the top and bottom quartiles, the horizontal stripe represents the median, the cross represents the mean, the whiskers represent the 9th and 91st percentiles, and the colored dots represent the participants whose data fell outside the range defined by the whiskers. (C) Scatterplots showing the within-subjects difference scores (in terms of N1-amplitude) for the three contrasts-of-interest in the inner speech experiment; namely Match minus Mismatch, Match minus Passive, and Mismatch minus Passive. These difference scores were approximately normally distributed with no clear outliers. Each dot represents a single participant’s difference score. The horizontal bars represent the mean, and the error bars represent the 95% confidence interval.

https://doi.org/10.7554/eLife.28197.004

As can be seen in Figure 2, while the topographies exhibited a fronto-central negativity in all three conditions, centered on electrode FCz, there was a hint of a leftward shift in the scalp distribution in the Match condition. Thus, in order to ensure the stability of the results, we performed a supplementary analysis with an expanded set of nine electrodes: specifically, Fz, FCz, Cz, F1, FC1, C1, F2, FC2, and C2. The pattern of results was identical to when the analysis was restricted to the three midline electrodes. Specifically, the main effect of Condition remained significant (F(2,82) = 3.61, p = 0.031, ηp2 = 0.08), as did the difference between the Match and Mismatch conditions (t(41) = 2.35, p = 0.024, dz = 0.36, CI(95%) = [0.122, 1.607]) and the Match and Passive conditions (t(41) = 2.57, p = 0.014, dz = 0.40, CI(95%) = [0.193, 1.624]). The difference between the Mismatch and Passive conditions remained non-significant (t(41) = 0.11, p = 0.916, dz = 0.02, CI(95%) = [−0.889, 0.800]). The Condition × Electrode interaction was also not significant (F(16,656) = 1.18, p = 0.323, ηp2 = 0.03).

There was also a main effect of Condition on the latency of the N1 peak (F(2,82) = 5.03, p = 0.016, ηp2 = 0.11. Analysis of the simple effects revealed that the N1-peak occurred significantly earlier in the Match condition compared to the Passive condition (t(41) = 3.15, p = 0.003, dz = 0.49, CI(95%) = [−6.927,–1.518]). There was no significant difference in N1-latency between the Mismatch and Passive conditions (t(41) = 1.76, p = 0.087, dz = 0.27, CI(95%) = [−6.298, 0.441]), nor between the Match and Mismatch conditions (t(41) = 1.29, p = 0.204, dz = 0.20, CI(95%) = [−3.316, 0.729]).

A visual inspection of Figure 2 also suggested between-condition differences in the P2 (150–190 ms) and P3 (250–310 ms) components of the auditory-evoked potential. While these components were not directly relevant to our hypotheses, for completeness the data and analyses for these components are presented below.

The P2 component occurred around 150–190 ms post-sound (see Figure 3a), while the P3 component occurred around 250–310 ms post-sound (see Figure 4a). However, not all three conditions generated a distinct peak for the P2 and P3 components. Specifically, the Match and Mismatch conditions did not elicit a distinct P2, whereas the Passive condition did not exhibit a distinct P3. This meant that (unlike for analysis of the N1) it was not possible to use a peak-detection approach for these components. Instead, time-windows were identified for the P2 (150–190 ms) and P3 (250–310 ms) components, and the average voltage within these time-windows were analyzed (see EEG Processing and Analysis for more detail).

Figure 3a shows the average waveforms for the P2 component, averaged across the Cz, FCz, and CPz electrodes, and Figure 3B shows a box-and-whiskers plot of raw P2 amplitudes for each condition. One-way ANOVA revealed a main effect of Condition (F(2,82) = 6.60, p = 0.006, ηp2 = 0.14). Analysis of the simple effects revealed that amplitude of the P2 component was significantly smaller in the Mismatch condition, relative to both the Match (t(41) = 3.54, p = 0.001, dz = 0.55, CI(95%) = [0.555, 2.028]) and Passive (t(41) = 3.21, p = 0.003, dz = 0.50, CI(95%) = [−3.549, –0.810]) conditions (Figure 3C). The difference between the Match and Passive conditions was not significant (t(41) = 1.26, p = 0.216, dz = 0.19, CI(95%) = [−0.540, 2.316]).

Inner speech experiment: P2 component analysis.

(A) Waveforms showing the auditory-evoked potentials elicited by the audible phonemes in the Match condition (blue line), Mismatch condition (red line), and Passive condition (black line). The P2-component is labelled; P2 amplitude was calculated as the average voltage in the 150–190 ms time-window. The waveforms were averaged across electrodes Cz, FCz, and CPz, as these were the electrodes at which the P2 component was maximal. Voltage maps are plotted separately for each condition; white dots illustrate the electrodes used in the analysis. (B) Box-and-whiskers plots showing the amplitude of the P2 component elicited by the audible phonemes in the Match, Mismatch, and Passive conditions. The edges of the boxes represent the top and bottom quartiles, the horizontal stripe represents the median, the cross represents the mean, the whiskers represent the 9th and 91st percentiles, and the colored dots represent the participants whose raw data fell outside the range defined by the whiskers. (C) Scatterplots showing the within-subjects difference scores (in terms of P2-amplitude) for the three contrasts-of-interest in the inner speech experiment; namely Match minus Mismatch, Match minus Passive, and Mismatch minus Passive. These difference scores were approximately normally distributed with no clear outliers. Each dot represents a single participant’s difference score. The horizontal bars represent the mean, and the error bars represent the 95% confidence interval.

https://doi.org/10.7554/eLife.28197.006

Figure 4a shows the average waveforms for the P3 component, averaged across the CPz, Cz, and Pz electrodes, and Figure 4B shows a box-and-whiskers plot of raw P3 amplitudes for each condition. ANOVA revealed a main effect of Condition (F(2,82) = 5.86, p = 0.004, ηp2 = 0.13). Analysis of the simple effects revealed that the amplitude of the P3 component was significantly larger in the Match condition relative to both the Mismatch (t(41) = 2.23, p = 0.032, dz = 0.34, CI(95%) = [0.117, 2.433]) and Passive (t(41) = 3.26, p = 0.002, dz = 0.50, CI(95%) = [0.813, 3.444]) conditions (Figure 4c). There was no significant difference between the Passive and Mismatch conditions (t(41) = 1.31, p = 0.196, dz = 0.20, CI(95%) = [−0.458, 2.165]). 

Inner speech experiment: P3 component analysis.

(A) Waveforms showing the auditory-evoked potentials elicited by the audible phonemes in the Match condition (blue line), Mismatch condition (red line), and Passive condition (black line). The P3-component is labelled; P3 amplitude was calculated as the average voltage in the 250–310 ms time-window. The waveforms were averaged across electrodes CPz, Cz, and Pz, as these were the electrodes at which the P3 component was maximal. Voltage maps are plotted separately for each condition; white dots illustrate the electrodes used in the analysis. (B) Box-and-whiskers plots showing the amplitude of the P3 component elicited by the audible phonemes in the Match, Mismatch, and Passive conditions. The edges of the boxes represent the top and bottom quartiles, the horizontal stripe represents the median, the cross represents the mean, the whiskers represent the 9th and 91st percentiles, and the colored dots represent the participants whose raw data fell outside the range defined by the whiskers. (C) Scatterplots showing the within-subjects difference scores (in terms of P3-amplitude) for the three contrasts-of-interest in the inner speech experiment; namely Match minus Mismatch, Match minus Passive, and Mismatch minus Passive. These difference scores were approximately normally distributed with no clear outliers. Each dot represents a single participant’s difference score. The horizontal bars represent the mean, and the error bars represent the 95% confidence interval.

https://doi.org/10.7554/eLife.28197.008

Overt speech experiment

To provide a point of reference with the inner speech experiment, we also conducted a follow-up ‘overt speech’ experiment. The overt speech experiment had an identical experimental procedure to the inner speech experiment, except that participants were instructed to overtly – as opposed to covertly – vocalize the phonemes at the sound-time. Just as in the inner speech experiment, an audible phoneme (i.e., /BA/ or /BI/) was delivered to participants’ headphones at the sound-time. Participants were instructed to vocalize the overt phonemes softly, so as to minimize the amount of bone conduction of the sound to the ear. An additional ‘Motor-Control’ condition was also included in which participants overtly vocalized the phonemes at the sound-time, but no audible phoneme was delivered. The purpose of this condition was to allow us to identify and correct for the electrophysiological activity generated by the motor act of producing the overt phoneme per se. This was done by subtracting participants’ activity in the motor-only condition from their waveforms in the active conditions (i.e., Match and Mismatch), as is common in sensory attenuation studies which compare motor-active and motor-passive conditions (Ford et al., 2014). The order of the conditions was randomized for each participant, with the caveat that the Motor-Control condition was always run last.

Thirty individuals participated in the overt speech experiment. Participants’ mean age was 25.0 years (SD = 6.0) and 20 were female. Thirty-three participants were originally recruited for the study, however three participants generated ≤60 usable epochs in one or more conditions (based on the exclusion criteria described for the inner speech experiment) and were excluded from further analysis. Figure 5 shows the auditory-evoked potentials averaged across electrodes FCz, Fz, and Cz for the uncorrected (Figure 5a) and motor-corrected data (Figure 5b). Voltage-maps are presented for the motor-corrected data. Raw N1 amplitudes for each motor-corrected condition are presented as box-and-whiskers plots in Figure 5c, and Figure 5d shows scatterplots of the (within-subjects) difference scores between the conditions.

Overt speech experiment: N1 component analysis.

The experimental protocol for the overt speech experiment was identical to the inner speech experiment except that participants were required to overtly (as opposed to covertly) vocalize the phoneme at the sound time. (A) Uncorrected waveforms showing the auditory-evoked potentials elicited by the audible phonemes in the Match condition (blue line), Mismatch condition (red line), and Passive condition (black line) in the overt speech experiment. The waveform for the motor-control condition is also shown (green line: in this condition participants overtly vocalized a phoneme at the sound-time, but no audible phoneme was delivered). The N1-component is labelled; the waveforms were averaged across electrodes FCz, Fz, and Cz, as these were the electrodes at which the N1 component was maximal. The waveforms are shown collapsed across audible phoneme (audible /BA/ and /BI/), and the waveforms for the Match, Mismatch, and Motor Control conditions are shown collapsed across vocalized phoneme (overt /ba/ and /bi/). (B) Motor-corrected waveforms showing the auditory-evoked potentials elicited by the audible phonemes in the Match condition (blue line), Mismatch condition (red line), and Passive condition (black line) in the overt speech experiment. The motor-corrected waveforms were generated by subtracting the activity generated in the motor-control condition from each participant’s Match, Mismatch, and Passive waveforms. Voltage maps are plotted separately for each condition; white dots illustrate the electrodes used in the analysis. (C) Box-and-whiskers plots showing the amplitude of the N1 component elicited by the audible phonemes in the Match, Mismatch, and Passive conditions in the overt speech experiment, using motor-corrected data for the Match and Mismatch conditions. The edges of the boxes represent the top and bottom quartiles, the horizontal stripe represents the median, the cross represents the mean, the whiskers represent the 9th and 91st percentiles, and the colored dots represent the participants whose raw data fell outside the range defined by the whiskers. (D) Scatterplots showing the within-subject difference scores (in terms of N1-amplitude) for the three contrasts-of-interest in the overt speech experiment; namely Match minus Mismatch, Match minus Passive, and Mismatch minus Passive. These difference scores were approximately normally distributed with no clear outliers. Each dot represents a single participant’s difference score. The horizontal bars represent the mean, and the error bars represent the 95% confidence interval.

https://doi.org/10.7554/eLife.28197.010

For the uncorrected data, repeated-measures ANOVA revealed a significant main effect of Condition (F(2,58) = 10.99, p < 0.001, ηp2 = 0.28) on the amplitude of the N1 peak. Critically, analysis of simple effects revealed that N1-amplitude in the Match condition was significantly smaller than the Mismatch condition (t(29) = 2.61, p = 0.014, dz = 0.48, CI(95%) = [0.472, 3.897]), consistent with the results of the inner speech experiment. However, contrary to the results of the inner speech experiment, N1-amplitude in the Passive condition was significantly smaller than both the Match (t(29) = 2.63, p = 0.013, dz = 0.48, CI(95%) = [0.646, 5.137]) and Mismatch (t(29) = 3.97, p < 0.001, dz = 0.72, CI(95%) = [2.461, 7.691]) conditions in the overt speech experiment (see Figure 5a).

The pattern of results was identical for the motor-corrected data. Repeated-measures ANOVA revealed a significant main effect of Condition (F(2,58) = 8.43, p = 0.001, ηp2 = 0.23). Analysis of simple effects revealed that N1-amplitude in the Match condition was significantly smaller than the Mismatch condition (t(29) = 2.46, p = 0.020, dz = 0.45, CI(95%) = [0.384, 4.190]), and that N1-amplitude in the Passive condition was significantly smaller than both the Match (t(29) = 2.20, p = 0.036, dz = 0.40, CI(95%) = [0.150, 4.243]) and Mismatch (t(29) = 3.43, p = 0.002, dz = 0.63, CI(95%) = [1.808, 7.159]) conditions (see Figure 5b, c and d).

Selecting the audible phonemes for the inner and overt speech experiments

In order to select which two audible phonemes would be presented to participants in the inner and overt speech experiments, we presented nine phonemes to 10 participants (age = 18.7 years, SD = 1.1; seven female) while they listened passively. Each phoneme was presented 90 times, and the presentation order was randomized. The nine phonemes were: /BA/, /BI/, /DA/, /DI/, /GA/, /KI/, /PA/, /PI/, and /TI/. Each phoneme was ~200 ms in duration, presented at ~70 dB SPL, and was produced by the same male speaker.

Waveforms showing the auditory-evoked potentials elicited by the nine phonemes are presented in Figure 6. The waveforms are shown collapsed across electrodes FCz, Cz, and Fz.

Auditory-evoked potentials elicited by nine different phonemes; namely: /BA/, /BI/, /DA/, /DI/, /GA/, /KI/, /PA/, /PI/, and/TI/.

Each phoneme was ~200 ms in duration, presented at ~70 dB SPL, and was produced by the same male speaker. Each phoneme was presented 90 times; the presentation order was randomized. Participants were instructed to simply sit quietly and listen to the phonemes. Of the nine different phonemes, /BA/ and /BI/ were judged to be most similar in terms of their amplitude and overall shape, and hence these phonemes were chosen to be used as the audible phonemes in both the inner and overt speech experiments.

https://doi.org/10.7554/eLife.28197.012

Of the nine different phonemes, /BA/ and /BI/were judged to be most similar in terms of their amplitude and overall shape, and hence these phonemes were chosen for use as the audible phonemes in both the inner and overt speech experiments.

Discussion

The present study used a novel experimental protocol to demonstrate that the production of an inner phoneme resulted in sensory attenuation of the auditory-evoked potential elicited by a simultaneously-presented audible phoneme, in the absence of any overt motor action. Crucially, the production of inner speech did not result in equal sensory attenuation to all sounds; sensory attenuation was dependent on the content of the inner phoneme matching the content of the audible phoneme. These results suggest that inner speech production is associated with a time-locked and content-specific internal forward model, similar to the one believed to operate in the production of overt speech (Hickok, 2012; Tourville and Guenther, 2011; Hickok and Poeppel, 2004; Houde et al., 2013). In short, the results of this study suggest that inner speech alone is able to elicit an efference copy and cause sensory attenuation of audible sounds. 

The key finding of the present study was that the production of inner speech, by itself, led to N1-suppression to an audible sound. N1-suppression has been reported many times previously in response to overt speech (Ford et al., 2007a; Curio et al., 2000; Heinks-Maldonado et al., 2005; Oestreich et al., 2015; Houde et al., 2002). There is strong evidence that the mechanistic basis of N1-suppression to overt speech involves an efference-copy mediated IFM (Eliades and Wang, 2003; Crapse and Sommer, 2008; Rauschecker and Scott, 2009). It thus seems reasonable to assume that a similar mechanism underlies sensory attenuation in the case of inner speech. The idea that inner speech shares mechanistic features with overt speech is consistent with the conceptualization of inner speech ‘as a kind of action’ (Jones and Fernyhough, 2007, p.396) and, more generally, with Hughlings Jackson’s belief that thinking is merely our most complex motor act: ‘sensori-motor processes… form the anatomical strata of mental states’ (Hughlings Jackson, 1958, p.49). In the words of Oppenheim and Dell (2010, p.1158), ‘inner speech cannot be independent of the movements that a person would use to express it’.

It was notable that inner speech production resulted in N1-suppression if – and only if – the content of the inner phoneme matched the content of the audible phoneme; that is, only in the Match condition. That said, we note that comparisons between the Match/Mismatch conditions on the one hand, and the Passive condition on the other, should be treated with a degree of caution. This is because data for these conditions came from different trial blocks, and (most importantly) differed in terms of the task that participants were required to perform (i.e., produce a covert/overt phoneme at the precise sound-time, versus passively listen to the audible phoneme). Notwithstanding the fact that the N1-suppression effect has previously been found to be robust to variations in attention (Timm et al., 2013; Saupe et al., 2013) and trial structure (Baess et al., 2011), this nevertheless raises the possibility that differences in participants’ attention or task-preparation may have contributed to the observed differences between the ‘active’ conditions and the passive condition. We note that this limitation is not restricted to the current study – it potentially applies to any procedure that attempts to measure sensory suppression by comparing active and passive conditions, which constitutes the vast majority of studies that have examined sensory suppression to overt speech. Critically, however, this limitation does not apply to the key contrast between the Match and Mismatch conditions. This is because data for these conditions came from the same trial blocks, in which participants were required to perform exactly the same task on each trial. For example, in blocks in which participants were required to produce the inner phoneme /ba/, they experienced both Match trials (in which the audible phoneme was /BA/) and Mismatch trials (in which the audible phoneme was /BI/), and there was no way for participants to predict whether the current trial would be a Match or Mismatch trial prior to the onset of the audible phoneme. Hence attention, task-preparation, etc., during the pre-stimulus period could not differ systematically between Match and Mismatch trials. Our factorial design also ensured that the differences between the Match and Mismatch conditions were not driven by differences in the inner phoneme (i.e., /ba/ vs /bi/) or differences in the audible phoneme (i.e., /BA/ vs /BI/). Consequently, we can conclude with confidence that the observed difference in N1 amplitude between Match and Mismatch trials reflects the impact of participants’ inner speech on their sensory response to the audible phoneme. It is this contrast that demonstrates that inner speech, like overt speech, is associated with a precise, content-specific efference copy.

The results of the inner speech experiment mirror those of previous studies which have found sensory attenuation to overt speech to be reduced or eliminated if auditory feedback deviates from what would normally be expected; e.g., by pitch-shifting the feedback or providing foreign-language feedback (Heinks-Maldonado et al., 2006; Behroozmand and Larson, 2011; Behroozmand et al., 2011; Behroozmand et al., 2016; Larson and Robin, 2016). To confirm this pattern using the current procedure, we performed a supplementary experiment in which participants were required to overtly vocalize the phoneme at the sound-time. The key result from the overt speech experiment was that N1-amplitude elicited by an audible phoneme was significantly smaller when participants simultaneously produced the same phoneme in their overt speech (Match condition), compared to when they produced a different phoneme in their overt speech (Mismatch condition). This result, which is consistent with numerous previous studies in the sensory attenuation literature (Heinks-Maldonado et al., 2006; Behroozmand and Larson, 2011; Behroozmand et al., 2011; Behroozmand et al., 2016; Larson and Robin, 2016), was identical to the corresponding contrast in the inner speech experiment, in which participants had to produce phonemes in inner, as opposed to overt, speech.

A second notable finding of the overt speech experiment was that N1-amplitude was significantly smaller in the Passive condition compared to both active conditions (i.e., Match and Mismatch). A potential ‘low-level’ explanation for this result lies in the fact that participants were required to make an overt motor action in the overt speech experiment but not the inner speech experiment. While the marked difference in pre-stimulus activity between the active and passive conditions (Figure 5a) is consistent with this explanation, the fact that between-condition differences remained when correcting for the motor-associated activity (Figure 5b) stands against this possibility (though see Horváth, 2015; Sams et al., 2005 for a discussion of the challenges associated with motor correction when comparing active and passive conditions in studies of sensory attenuation). An alternative explanation for why N1 was smallest in the Passive condition is based on the idea that the auditory N1-component is (in addition to sound intensity, discussed previously) also sensitive to stimulus predictability, with predictable sounds evoking a smaller N1 than unpredictable sounds (Behroozmand and Larson, 2011; Bäss et al., 2008; Lange, 2011). Our task differed from other willed vocalization tasks in the literature in that the audible phoneme delivered to the headphones was: (a) of a different person’s voice, and (b) much louder than the actual vocalization, as participants were instructed to vocalize quietly in order to minimize bone conduction. In other words, a substantial discrepancy between the predicted and actual sound existed, even in the Match condition. This sensory discrepancy was even larger in the Mismatch condition, as the content of the sound was also different. Consistent with the idea that N1-amplitude is sensitive to stimulus predictability, it is possible that the larger N1-amplitude in the active compared to the passive conditions was due to prediction-errors as to the expected quality of the audible phoneme. It is further possible that such detailed predictions as to phoneme quality do not occur in the context of inner speech (a suggestion for which there is some empirical support – Oppenheim and Dell, 2008), which may account for why N1-amplitude was not reduced in the passive condition in the inner speech experiment.

In summary, both the inner speech and overt speech experiments showed the same basic pattern of results with respect to the key contrast: N1-magnitude was smaller if the phoneme generated by the participant (either covertly or overtly) matched the audible phoneme than if it mismatched. These findings suggest that inner speech – like overt speech – is associated with a precise, content-specific efference copy, as opposed to a generic and non-specific prediction. Taken together, our results provide support for the contention that inner speech is a special case of overt speech, which does not have an associated motor act. The notion that inner speech generates an IFM in the absence of an overt motor act has been hypothesized previously across several different literatures (Jones and Fernyhough, 2007; Feinberg, 1978; Scott, 2013; Guenther and Hickok, 2015; Ford et al., 2001b; Sams et al., 2005; Kauramäki et al., 2010). However, this hypothesis has been notoriously difficult to test empirically, due to the covert nature of inner speech. Ford et al., (Ford and Mathalon, 2004) played participants repeated sentences over 30 s and asked them to reproduce the same sentences in inner speech. Ford et al., found that the sentences elicited a smaller N1-component when participants engaged in inner speech compared to when they did not, consistent with the results of the present study. More recently, Tian and Poeppel (Tian and Poeppel, 2010) used MEG to show that the auditory cortex was activated immediately following production of an inner phoneme in the absence of auditory feedback, which they took as evidence of an inner-speech-initiated efference copy. In a subsequent study, which was a strong influence on our own, Tian and Poeppel (Tian and Poeppel, 2013) asked participants to produce an inner phoneme within a 2.4 s window. This window was followed by an audible phoneme that could either match or mismatch the content of the inner phoneme. The authors found no difference in the amplitude of the M100 between the match and mismatch conditions, inconsistent with the results of the present study. However, given the width of the temporal window in which participants were asked to produce their inner phoneme (2.4 s), the efference copy and auditory feedback would not necessarily be expected to coincide under these conditions, in which case M100-suppression would not be expected to occur. Tian and Poeppel (Tian and Poeppel, 2015) asked participants to signal their production of an inner phoneme via a button-press, and measured the amplitude of the M100 component evoked by a pre-recorded audible phoneme of their own voice which matched the content of the inner phoneme, but which could be pitch-shifted or delayed. They found evidence of M100 suppression to unshifted, undelayed audible phonemes relative to a passive baseline condition, consistent with the results of the present study. Pitch-shifting or delaying the auditory phonemes was found to increase M100 amplitude above baseline levels. While this study’s design enabled the timing of the inner phoneme to be precisely specified, the fact that it was specified by means of an overt motor action (i.e., a button-press), which is known to be associated with N1-suppression per se (Hughes et al., 2013), raises the possibility of the motor action and inner speech being confounded. Finally, Ylinen et al., (Ylinen et al., 2015) asked participants to mentally rehearse tri-syllabic pseudowords in inner speech. After several mental repetitions, an audible pseudoword was played which had either matching or mismatching beginnings and endings to the rehearsed pseudoword. The results revealed that audible syllables that were concordant with participants’ inner speech elicited less MEG activity than discordant syllables, a result which is broadly consistent with the results of the present study.

The current experiment holds some methodological advantages over previous designs. Firstly, the experimental stimuli (animation, audible phonemes, rating-scale, etc.) were physically identical across all conditions, as was the nature of participants’ task (i.e., to fixate on the screen and produce an inner phoneme at a designated time). The only thing that differed between the different trial-types was the inner phoneme that participants were asked to produce. This meant that the observed differences in sensory attenuation could not have been due to any physical differences between the conditions (Luck, 2005). Secondly, the fact that it was impossible to predict which of the two audible phonemes would be presented on any given trial meant that it was impossible to distinguish Match from Mismatch trials a priori. This meant that the observed results could not have been due to between-condition differences in, for example, demand characteristics. Thirdly, the ‘ticker tape’ feature of the current protocol enabled participants to very accurately time-lock the onset of their inner phoneme to match the onset of the external sound. In the current protocol, the position of the trigger line refreshed every 8.3 ms, which presumably enabled participants to time the onset of their inner phoneme far more accurately than would be possible with a countdown sequence, mental rehearsal or open temporal window, such as have been used in previous studies (Tian and Poeppel, 2013; Ford et al., 2001b; Ylinen et al., 2015). Finally, the current protocol did not require participants to make an active movement to signal the onset of their inner phoneme, such as by pressing a button. This is a significant advantage over previous studies which have employed a motor condition to signal the onset of covert actions, as it avoids the potential confound associated with having temporally-overlapping auditory-evoked and motor-evoked potentials – see Horvath (Horváth, 2015) and Neszmélyi and Horvath (Neszmélyi and Horváth, 2017) for a discussion of the challenges associated with ‘correcting’ for motor activity in studies of sensory attenuation. In light of its methodological features, the present study provides arguably the strongest evidence to date that inner speech results in sensory attenuation of the N1-component of the auditory-evoked potential, even in the absence of an overt motor response. Perhaps the most important strength of this paradigm is that all the above issues were controlled within a single task, thereby removing any reliance on cross-experimental inferences.

This study’s focus on the N1 component is consistent with the majority of the existing literature on electrophysiological sensory attenuation. The rationale for focusing on N1 lies in the fact that the amplitude of this component is volume dependent; that is, other things being equal, loud sounds evoke N1-components of larger amplitude than do soft sounds (Näätänen and Picton, 1987; Hegerl and Juckel, 1993). In prior studies of N1-suppression, participants have typically generated sounds through overt actions such as overt speech, button-presses etc. The observation of N1-suppression in such studies thus implies that the brain processes self-generated sounds as though they were physically softer than identical external sounds. The N1-suppression demonstrated in the present study extends this idea by suggesting that the brain also processes imagined sounds as though they were physically softer than identical, unimagined sounds. In addition to providing evidence that inner speech is associated with an IFM of similar nature to overt speech, this finding provides evidence that mental state influences perception at a fundamental level (Gregory, 1997).

With regards to the question of what mechanism could underlie sensory attenuation to inner speech: a recent study by Niziolek, Nararajan and Houde (Niziolek et al., 2013) on sensory attenuation in the context of overt speech production found that the degree of sensory attenuation was stronger when participants produced vowel sounds that were closer (in terms of their acoustic properties) to their median production of these sounds, compared to when they produced vowel sounds that were, for them, less typical. These results suggests that the efference copy associated with overt speech production represents a sensory goal (i.e., ‘a prototypical production at the center of a vowel’s formant distribution’, p. 16115), and that the distance (in formant space) of any given utterance from this ‘sensory prototype’ determines its degree of sensory attenuation. If inner speech is, in fact, a special case of overt speech (as we have suggested above), then this raises the question of the nature of the sensory goal in the context of inner speech production. One possibility, based on the results of Niziolek et al., is that in the present study (in ‘imagine /ba/’ trials, for example) the sensory goal was of a prototypical /ba/, which was presumably covertly ‘spoken’ in the participant’s own voice (though see below for a discussion of the validity of this assumption). In this case, the participant’s prototypical /ba/ would never match perfectly with the audible phoneme, as the audible phoneme would never be the participant’s own voice.

The fact that the present study observed N1-suppression in the Match condition but not the Mismatch condition is nevertheless consistent with the Niziolek et al., 2013 account, in that the distance, in formant space, between an inner /ba/ and an audible /BA/ would be smaller than the distance between an inner /bi/ and an audible /BA/, even though the inner phoneme did not match the audible phoneme perfectly in either case (as the audible phoneme was always produced by the same unknown speaker). The fact that while Niziolek et al., 2013 observed maximal levels of sensory attenuation to prototypical vowel sounds, they still observed significant (i.e., non-zero) levels of sensory attenuation to atypical vowel sounds is also consistent with this idea. A prediction of this account is that participants should show even greater levels of sensory attenuation in the Match condition if the audible phoneme is presented in their own voice rather than the voice of an unknown stranger; testing this prediction may be a worthwhile endeavor in future studies.

In regards to the assumption that the sensory goals of inner speech are the same as overt speech, and that a person’s inner voice is the same as their actual voice, there is some evidence in support of this conjecture: Filik and Barber (2011) provided evidence that people produce inner speech in the same regional accent as their overt speech. However, other studies have reported evidence suggesting that inner speech has impoverished acoustic properties relative to overt speech (Oppenheim and Dell, 2008). It is also possible that inner speech can consist of several distinct ‘voices’, with each having specific auditory properties; the fact that people with auditory-verbal hallucinations often report hearing multiple voices with different auditory properties (McCarthy-Jones et al., 2014) is consistent with this idea, if – as discussed further below – auditory-verbal hallucinations ultimately reflect inner speech being misperceived as overt speech. Finally, it is also possible that the acoustic properties of inner speech are not fixed. Specifically, in the context of the present study, it is possible that the acoustic properties of the audible phonemes began to influence the inner phonemes, such that after numerous repetitions, participants began to imagine themselves producing an inner phoneme with the acoustic properties of the audible phoneme. Testing these possibilities may also be worthwhile in future studies.

While the primary focus of the paper was on the N1-component of the auditory-evoked potential, between-condition differences were also observed in the amplitude of the P2 and P3 components (see Figures 3 and 4). A likely explanation for the observed results in these later components involves another ERP component, the N2, whose spatial and temporal distribution typically overlaps with that of the P2 (Griffiths et al., 2016). The N2 and P3 components are among the most heavily investigated components in the ERP literature (Näätänen and Picton, 1986; Polich, 2007), and are typically elicited by tasks – such as the auditory oddball and Go/NoGo tasks – in which the participant is asked to identify (by means of a button-press, for example) ‘target’ stimuli which are nested among ‘non-target’ stimuli (Smith et al., 2010; Spencer et al., 1999). Critically, the N2 and P3 can also be elicited by tasks in which a mental response is required, such as when target stimuli have to be mentally counted as opposed to signaled with a button-press (Mertens and Polich, 1997). We suggest that, in the two inner speech conditions of our study, participants made a mental response – possibly a ‘template-matching response’ along the lines of whether the audible phoneme matched their inner phoneme (Griffiths et al., 2016) – which they did not make in the Passive condition. In this case, the audible phoneme in the inner speech conditions might be expected to elicit an N2 and a P3 component, which would not be present in the Passive condition. The occurrence of a (negative-going) N2 in the inner speech condition would then interact and compete with the expression of a (positive-going) P2 component elicited by the audible phoneme. The result would be the absence of a distinct P2, but presence of a P3, in the inner speech conditions but not the Passive condition – as observed empirically. It is also worth noting that Tian and Poeppel (Tian and Poeppel, 2013) observed a larger M200 component in their match vs. mismatch comparison, consistent with the enhanced P2 observed in the Match vs. Mismatch comparison in the present study. Taken together, these results suggest that the M200/P2 component may index something other than a sensory prediction, possibly involving a cognitive ‘template matching’ process.

The implied existence of an efference copy to inner speech holds important implications for how to best understand some of the psychotic symptoms associated with schizophrenia. Some of the most characteristic of these symptoms seem to reflect the patient misattributing, to external agents, self-generated motor actions (e.g., delusions of control) and self-generated thoughts (e.g., delusions of thought insertion, auditory-verbal hallucinations – Feinberg, 1978; Frith, 1987). An influential account of these experiences argues that they arise because of an abnormality in the IFM associated with both physical and mental actions (Feinberg, 1978; Frith, 1992). This IFM abnormality leads to an inability to predict and suppress the consequences of self-generated actions, which leads to confusion as to their origins. This hypothesis has a strong theoretical foundation: for example, the distinctive symptom of thought echo, in which the patient hears their own thoughts spoken out loud by an external voice, can be well explained as the patient’s own inner speech being misattributed and misperceived as an external voice (Frith, 1992). However, while numerous studies have provided empirical evidence showing that schizophrenia patients exhibit subnormal levels of sensory attenuation to their own physical actions (Whitford et al., 2011; Blakemore et al., 2000b; Shergill et al., 2005), including subnormal levels of N1-suppression to overt speech (Ford et al., 2001a; Ford et al., 2007b; Ford et al., 2001c), there is little empirical evidence that schizophrenia patients show sensory attenuation deficits to self-generated mental actions such as inner speech. Furthermore, the few studies that did report sensory attenuation deficits to inner speech in patients with schizophrenia did not include a ‘mismatch’ condition, raising the possibility that these sensory attenuation deficits ultimately reflect attentional deficits in the patient group (Ford and Mathalon, 2004; Ford et al., 2001d). We suggest that the failure to identify electrophysiological sensory attenuation deficits to inner speech in schizophrenia patients is not because the deficits do not exist, but rather because previous experimental protocols have been insufficiently sensitive to detect them. In order to maximize the chances of detecting sensory attenuation deficits to inner speech in schizophrenia patients, we suggest that future experiments should: (a) ensure that the onset of inner speech is precisely time-locked to the audible sound, without reverting to using a willed action (e.g., a button-press) to signal inner speech-onset, as this could potentially lead to the auditory and motor responses being confounded; (b) limit the content of inner speech to phonemes rather than entire sentences as this enables the onset and content of the inner speech to be more tightly controlled; (c) investigate patients exhibiting those symptoms that seem to most clearly reflect misperceived inner speech (e.g., thought echo, other auditory-verbal hallucinations), rather than grouping patients with different symptom profiles into a clinically-heterogeneous ‘schizophrenia’ group. By providing an optimized protocol for quantifying N1-suppression to inner speech, our hope is that the present study can provide a methodological framework for identifying and assessing sensory attenuation deficits in inner speech in patients with schizophrenia.

In conclusion, the present study demonstrated that engaging in inner speech resulted in sensory attenuation (specifically, N1-suppression) of the electroencephalographic activity evoked by an audible phoneme, but only if the content of inner speech matched the content of the audible phoneme. These results suggest that inner speech evokes an efference-copy-mediated IFM, which is both content-specific and time-locked to the onset of inner speech, which is consistent with the existing literature on sensory attenuation to overt speech. Cumulatively, this implies that inner speech may ultimately be ‘a kind of action’, and a special case of overt speech, as long suggested by prominent models of language. Accordingly, these findings not only provide insight into the nature of inner speech, but also provide an experimental framework for investigating sensory attenuation deficits in inner speech, such as have been proposed to underlie some psychotic symptoms in patients with schizophrenia.

Materials and methods

Participants

Forty-two healthy individuals participated in the inner speech experiment. Participants’ mean age was 23.4 years (SD = 7.3) and 24 were female. Fifty participants were originally recruited for the study, however eight participants generated ≤ 60 usable epochs in one or more conditions and were excluded from further analysis – see EEG Processing and Analysis for further details. The study was conducted at the University of New South Wales (UNSW Sydney; Sydney, Australia), and approved by the UNSW Human Research Ethics Advisory Panel (Psychology).

Procedure

Participants were seated in a quiet, dimly-lit room, approximately 60 cm from a computer monitor (BenQ XL2420T, 1920 × 1080 pixels, 144 Hz), and were fitted with headphones (AKG K77 Perception) and an EEG recording cap. EEG was recorded with a BioSemi ActiveTwo system from 64 Ag/AgCl active electrodes placed according to the extended 10–20 system. A vertical electro-oculogram was calculated by recording from an electrode placed below the left eye, and subtracting its activity from that of electrode FP1; a horizontal EOG was recorded by placing an electrode on the outer canthus of each eye. We also placed an electrode on the tip of the nose, on the left and right mastoid, and on the masseter muscle to detect jaw movements. During data acquisition, the reference was composed of CMS and DRL sites, and the sampling rate was 2048 Hz.

In regards to the animation that participants viewed on each experimental trial: the ticker tape moved at a constant velocity of 6.5°/s, which meant that it took 3.75 s until the trigger line intersected the fixation line. The ticker tape was marked with labels ‘3’, ‘2’ and ‘1’ that passed the fixation line 3 s, 2 s, and 1 s prior to the trigger line (see Figure 1b). The two audible phonemes, /BA/ and /BI/, were selected on the basis of a pilot study which indicated that amongst nine candidate audible phonemes (/BA/, /BI/, /DA/, /DI/, /GA/, /KI/, /PA/, /PI/, /TI/), the two audible phonemes /BA/ and /BI/ produced auditory-evoked potentials that were most similar in terms of their amplitude and overall shape (see Figure 6). The two audible phonemes were produced by the same male speaker, and were similar in terms of their loudness (~70 dB SPL) and duration (~200 ms).

There were 60 trials in each trial block. Participants were instructed to fix their gaze on the fixation line on every trial. At the start of each block, participants were told that on every trial of that block they should produce a particular inner phoneme (either /ba/ or /bi/) at the exact moment the trigger line interested the fixation line, or (in Passive blocks) that they should simply listen to the audible sound and not try to imagine anything. Each audible phoneme (/BA/ and /BI/) was presented on 50% of trials within each trial block, and the order was randomized for each participant. This meant that, in those blocks in which participants were instructed to generate a particular inner phoneme (i.e., active blocks), on half of trials their inner phoneme matched the audible phoneme, while on half of trials it mismatched. Following each trial, participants were asked to rate their success in imagining the instructed inner phoneme at the sound-time (or in not imagining anything in the Passive condition). These ratings were made on a scale from 1 (‘Not at all successful’) to 5 (‘Completely successful’), and were reported using the computer keypad. Participants’ average ratings were 4.09 out of 5 (SD = 0.65) for Match trials, 4.01 (SD = 0.67) for Mismatch trials, and 4.87 (SD = 0.40) for Passive trials. The order of the trial blocks (imagine /ba/, imagine /bi/, or passive) was randomized for each participant. Each block took approximately 7 min to complete, and was repeated twice over the course of the experiment. Stimulus presentation was controlled by MATLAB (MathWorks, Natick, MA), using Psychophysics Toolbox extensions (Brainard, 1997; Kleiner et al., 2007).

EEG processing and analysis

The data pre-processing and analysis was performed in BrainVision Analyzer (Brain Products GmbH, Munich, Germany). The EEG data were re-referenced offline to the nose electrode. Data were first notch filtered (50 Hz) to remove mains artefact, and then band-pass filtered from 0.1 to 30 Hz using a phase-shift free Butterworth filter (48 dB/Oct slope). The filtered data were separated into 800 ms epochs (200 ms prior to sound onset, 600 ms post-onset), and baseline corrected to the mean voltage from –100 to 0 ms. The epochs were corrected for eye-movement artefacts, using the technique described in (Gratton et al., 1983), and any epoch with a signal exceeding a peak-to-peak amplitude of 200 µV for any channel was excluded. To ensure data quality, epochs were classified as unusable and excluded prior to analysis if they failed to meet the above criterion, or if the participant rated their success on the trial as ≤2 out of 5. The remaining usable epochs were included in the analysis and used to make the average waveforms for the three conditions. There were an average of 103.1 (SD = 14.8) usable epochs in the Match condition (of a max. possible 120), 97.0 (SD = 18.3) in the Mismatch condition, and 107.2 (SD = 17.6) in the Passive condition.

The amplitude of the N1-component of the auditory-evoked potential was the primary dependent variable. The N1 peak was identified on each participant’s average waveform (Whitford et al., 2011; Ford et al., 2007b) as the most negative local minimum in the window 25–175 ms post-stimulus onset. The auditory N1-component typically has a fronto-central topography (Näätänen and Picton, 1987), which was verified in the current data: N1 was maximal at electrode FCz (see Figure 2a). Supplementary analyses were also performed on the P2 and P3 components in the inner speech experiment. As it was not possible to use a peak-detection approach for these components (as not all conditions exhibited a clear P2 and P3 peak), time-windows were identified for the P2 (150–190 ms) and P3 (250–310 ms) components. Average voltage within these time-windows was the dependent variable in these supplementary analyses.

Statistical analysis

Data were analyzed using repeated-measures ANOVA, with one factor Condition (three levels: Match, Mismatch and Passive). In the case of a main effect of Condition, contrasts were used to unpack the simple effects. The Greenhouse-Geisser correction was used in the case of a violation in the assumption of sphericity. N1 was maximal at electrode FCz for all three conditions, however in order to improve the reliability of the analysis, the data was averaged across FCz and neighboring electrodes Fz and Cz (Näätänen and Picton, 1987; Woods, 1995). All of the relevant statistics remained significant when analysis was restricted to electrode FCz.

With regards to the supplementary analyses on the P2 and P3 components: the P2 component (150–190 ms) was maximal at electrode Cz; the data were collapsed across Cz and neighboring electrodes FCz and CPz for the statistical analysis. The P3 component (250–310 ms) was maximal at electrode CPz; the data were collapsed across CPz and neighboring electrodes Cz and Pz for the statistical analysis.

Due to the novelty of the paradigm, it was not possible to obtain precise estimates of the expected effect size. Thus we powered our study to detect a small effect size, based on the heuristic provided by (Cohen, 1969); our sample size of 42 provided adequate power (β = 0.8) to detect a small effect size (ηp2 = 0.04) at α = 0.05. The power analysis was conducted with G*Power software (version 3.1.9.2; Faul et al., 2007). Each experiment was performed only once.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
    Statistical Power Analysis for the Behavioural Sciences
    1. J Cohen
    (1969)
    New York: Academic Press.
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
    The Cognitive Neuropsychology of Schizophrenia
    1. CD Frith
    (1992)
    Hove: Lawrence Erlbaum Associates.
  31. 31
  32. 32
  33. 33
    Knowledge in perception and illusion
    1. RL Gregory
    (1997)
    Philosophical Transactions of the Royal Society B: Biological Sciences 352:1121–1127.
    https://doi.org/10.1098/rstb.1997.0095
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
    Selected Writings of John Hughlings Jackson
    1. J Hughlings Jackson
    (1958)
    New York, NY: Basic Books.
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
    An Introduction to the Event-Related Potential Technique
    1. SJ Luck
    (2005)
    Cambridge: MIT Press.
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
    N2 and automatic versus controlled processes
    1. R Näätänen
    2. TW Picton
    (1986)
    Electroencephalography and Clinical Neurophysiology. Supplement 38:169–186.
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
    Active interoceptive inference and the emotional brain
    1. AK Seth
    2. KJ Friston
    (2016)
    Philosophical Transactions of the Royal Society B: Biological Sciences 371:20160007.
    https://doi.org/10.1098/rstb.2016.0007
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
    Forward Models for Physiological Motor Control
    1. DM Wolpert
    2. RC Miall
    (1996)
    Neural Networks : The Official Journal of the International Neural Network Society 9:1265–1279.
    https://doi.org/10.1016/S0893-6080(96)00035-4
  86. 86
    The component structure of the N1 wave of the human auditory evoked potential
    1. DL Woods
    (1995)
    Electroencephalography and Clinical Neurophysiology. Supplement 44:102–109.
  87. 87
  88. 88

Decision letter

  1. Richard Ivry
    Reviewing Editor; University of California, Berkeley, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Neurophysiological evidence of efference copies to inner speech" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Richard Ivry, serving here as both Senior Editor and Reviewing Editor. The following individual involved in review of your submission has agreed to reveal his identity: Chi-Ming Chen (Reviewer #1).

We have discussed the reviews with one another and I have drafted this decision to help you prepare a revised submission.

Summary:

The study employed electrophysiological methods to explore how the brain responds to covert actions, drawing on a substantial literature showing that the sensory response to self-produced movements is attenuated, presumably due to efference copy signals. The focus here is on an early ERP response, the N100. In a clever design, participants are presented with a visual cue that alerts them to an auditory presentation of a CV sound (/bi/ or /ba/). They are instructed to covertly produce either a /bi/ or a /ba/ (depending on the block) via a moving visual cue that enabled the temporal alignment of the inner speech with the external stimulus. As such, they create conditions in which the covert speech either matches or does not match the stimulus. The results show a reduced N1 amplitude for the match conditions compared to both the mismatch and a control, passive condition. A supplemental experiment using overt speech shows a reduced N1 amplitude for the match condition compared to the mismatch condition. These results make a compelling case for the operation of an efference copy process during inner speech.

Major comments for revisions:

1) There was concern about your use of a blocked design, with all of the reviewers noting that there are multiple advantages to a mixed design. In the end, we decided that this was not of critical concern, especially since the focus here is on the N1; factors such as a shift in attentional set might be more likely to impact the latter components. This could, in part explain the >200 ms results, but could also affect global amplitude/power differences, and these could potentially lead to spurious results in the N1 analysis. One way of perhaps mitigating these concerns would be to assess amplitude/power differences outside of the epochs of interest (e.g. during the baseline) and see if there are significant differences between conditions (i.e. the passive and the match/mismatch). This also provides a check on whether there may be any issues with the block design contributing to the N1 results, something we don't anticipate to be a problem, but would like to have assessed.

2) "Figure 2 shows the auditory-evoked potentials averaged across electrodes FCz, Fz and Cz, as these were the electrodes at which N1 was maximal (see Figure 2B, voltage maps)." The voltage montage maps clear show the negative/blue/N1 maximal at the left hemisphere not at the averaged midline electrodes, Fz, FCz, and Cz. This is especially evident in the Match condition, the negative/blue/N1 maximal is clearly at C3 location. Please clarify on this issue.

3) Related to this, you have written, "The auditory N1-component typically has a fronto-central topography [Ford et al., 2014], which was verified in the current data: N1 was maximal at electrode FCz (see Figure 2C)." However, there is no Figure 2C in this version of the manuscript. It may be that the current Figure 2B's voltage maps were not updated from a previous version (e.g., these maps were from data that were not re-referenced by nose electrode)? However, if the maps are correct, the statements in the manuscript must be wrong and it would not justify using the averaged ERP from Fz, FCz, and Cz.

4) It's hard to tell to what extent this is testing the production of an efference copy since the presented stimuli are not the subjects' own voice. While you refer to "content of the audible phoneme" (Discussion, third paragraph), little effort is made to unpack this statement. We imagine, you are arguing that the matching component of the efference copy works in some sort of abstract space (i.e. comparing the phonemic representation rather than say, the motor or sensory goal), but this seems not to be the case in the literature (see Niziolek et al. 2013). This point requires further commentary. What is the actual mechanism at work here? What is the mapping between an external stimulus and an efference copy when you attempt to compare someone else's speech with your own self-produced speech?

5) While we appreciate the focus on the internal speech experiment, we felt that the control experiment with overt speech provided a nice point of contrast. As things now stand, this experiment is buried in the Discussion. Knowing that some readers never make it that far, we would like the overt speech experiment presented in the Results section.

https://doi.org/10.7554/eLife.28197.021

Author response

Major comments for revisions:

1) There was concern about your use of a blocked design, with all of the reviewers noting that there are multiple advantages to a mixed design. In the end, we decided that this was not of critical concern, especially since the focus here is on the N1; factors such as a shift in attentional set might be more likely to impact the latter components. This could, in part explain the >200 ms results, but could also affect global amplitude/power differences, and these could potentially lead to spurious results in the N1 analysis. One way of perhaps mitigating these concerns would be to assess amplitude/power differences outside of the epochs of interest (e.g. during the baseline) and see if there are significant differences between conditions (i.e. the passive and the match/mismatch). This also provides a check on whether there may be any issues with the block design contributing to the N1 results, something we don't anticipate to be a problem, but would like to have assessed.

Thank you for raising this point. We designed the experiment such that while participants were required to perform the same mental action on every trial in each (60-trial) block, the audible phoneme presented to their headphones (i.e., either /BA/ or /BI/) was unpredictable for any given trial. The upshot of this design is that it meant that while the Passive trials were effectively ‘blocked’, the Match and Mismatch trials were ‘mixed’, and thus any differences between the Match and Mismatch conditions could not have been due to extraneous factors such as between-condition differences in attention (as highlighted in of the previous version of the manuscript). This is critical, because the Match versus Mismatch comparison constituted the key contrast around which the vast majority of the Discussion was framed: it is this contrast which demonstrates most clearly that inner speech is associated with a precise, content-specific efference copy.

Our rationale for designing the experiment in this way was that we were concerned a fully-mixed design would be confusing to participants, as it would require them receiving different instructions as to what mental action they should perform on each individual trial. We were concerned that having instructions that changed every trial could lead to participants forgetting or ignoring the instructions, and/or limit participants’ ability to produce the inner phonemes clearly. However, we agree that our design could potentially have led to unanticipated differences between the Passive condition and the two inner speech conditions (i.e., Match and Mismatch), possibly due to factors such as attention.

As per your suggestion, one way of mitigating this concern would be to compare the Passive condition with the inner speech conditions across the pre-stimulus period. As illustrated by the overlapping waveforms in the baseline period in Figure 2A, there were no significant differences between the three conditions at any point in the original -200 to 0 ms baseline (which was baseline corrected to the average voltage in the period -100 to 0 ms). However, in order to gain a more global perspective of pre-stimulus activity, we extended the pre-stimulus period all the way out to -3000 ms (i.e., 3000 ms pre-stimulus), and baseline corrected to the average voltage in the period -3000 to -2500 ms. While the waveforms for the three conditions largely overlapped for the first ~1900 ms of the extended epoch, at around 1100 ms pre-stimulus, the waveforms for the two inner-speech conditions (i.e., Match and Mismatch) began to diverge from the Passive waveform, and began to exhibit a slow, negative-going deflection – see Author response image 1, left hand panel.

We believe that this deflection may reflect a readiness-potential (RP) that is associated with the participant preparing to generate the inner phoneme. The basis for this assertion is that the negative deflection we observed is similar – in terms of its global form, onset, and scalp distribution – to the negative deflection which has previously been reported in studies of overt speech production, and which has been previously argued to reflect an RP (e.g., see McArdle et al., 2014, Clinical Neurophysiology, 120, 275-284; Jansen et al., 2014, Journal of Neuroscience Methods, 232, 24-29; Wolhert, 1993, Journal of Speech and Hearing Research, 36, 897-905; Deecke et al., 1986, Experimental Brain Research, 65, 219-223). Specifically, the negative deflection we observed had the same basic form (i.e., a slow, monotonic, negatively-going waveform), a similar onset (1000+ ms pre-stimulus), and a similar topography (a prominent central negativity) as the negative deflection reported in these previous studies of overt speech production. Furthermore, we also observed a similar negative deflection in the extended pre-stimulus period in our overt speech experiment, which directly replicated the results of the aforementioned studies. The negative deflection we observed in the overt speech experiment had a similar form and topography to the deflection observed in the inner speech experiment (compare the left and right hand panels in Author response image 1) though it had a slightly delayed onset relative to the inner speech experiment, and was of larger magnitude (note the different scales between the panels).

The identification of an RP to inner speech – a purely mental action – would be noteworthy, given that this component has traditionally been assumed to require an overt motor action. Identifying an RP to inner speech would also provide further evidence for our paper’s central tenet, which is that inner speech is ultimately processed by the brain “as a kind of action”, and may in fact reflect a special case of overt speech.

However, we urge caution in making this interpretation, as it is not possible to definitely determine whether the negative deflection observed in the current data is, in fact, an RP. Specifically, while the observed negative deflection could well reflect an RP, it could also potentially reflect a Contingent Negative Variation (CNV), or perhaps even a combination of these two components. The CNV is a slow, negative-going waveform that is elicited when “a warning stimulus (S1) announces that, within a few seconds, an imperative stimulus (S2) will arrive, asking for a quick response” (Brunia et al., 2012, p. 189). On the face of it, the ticker tape feature of the present design, which announced the impending arrival of the imperative trigger line, would seem to provide fertile ground for the generation of a CNV. Furthermore, the RP to overt speech and the CNV are known to have a similar global form (i.e., a slow, negative-going waveform in the pre-stimulus period), a similar onset (i.e., 1000+ ms, pre-stimulus), and a similar scalp distribution (i.e., a bilateral central topography). The upshot of this is that determining the relative contributions of the RP and CNV to the negative deflection observed in the present study is not feasible, post hoc. For this reason – and also so as not to distract from the paper’s central finding, namely N1-suppression to inner speech – we have decided not to include a description of the pre-stimulus activity in the extended baseline period in the revised manuscript. However, we would, of course, be happy to add these data to the revised manuscript, if you think they would be of interest to readers.

A crucial point to note is that regardless of whether the negative deflection reflected an RP, a CNV, or a combination of both components, it is almost certainly not responsible for the between-condition differences in N1-amplitude we observed. The reason we say this is that we were able to remove the negative deflection in the Match and Mismatch conditions by applying a 1 Hz high-pass filter (phase-shift-free Butterworth filter, 12 dB/octave slope), as can been seen below in Author response image 2, by comparing the left hand (unfiltered) and right hand (filtered) panels.

Critically, the high-pass filter did not alter N1 amplitude for any condition in any appreciable way, and nor did it alter the pattern of results, as can be seen in Author response image 3 by comparing the left and right hand panels. Thus, we are confident that the observed between-condition differences in N1 amplitude were not driven by between-condition differences in the pre-stimulus period.

An important point to note here, which is particularly relevant to the issues discussed in the opening paragraph, is that the negative deflection in the pre-stimulus period is far more likely to be due to an RP/CNV in the ‘active’ conditions, as opposed to between-condition differences in block-type. This is because the negative deflection in the pre-stimulus period was also observed in the Motor-Control condition in the overt speech experiment; that is, a condition in which participants vocalized a phoneme at the sound-time, but no audible phoneme was delivered. This was in contrast to the Passive condition in which, to reiterate, the negative deflection did not occur. The significance of this point is that both the Motor-Control and Passive conditions were fully blocked. The fact that an RP was present in the Motor-Control condition (which was blocked), but not present in the Passive condition (which was also blocked), suggests that the observed between-condition differences in pre-stimulus activity were not due to the differences in block-type, but were rather due to the electrophysiological consequences of action preparation (both mental [inner speech] and physical [overt speech]), and/or stimulus anticipation. Finally, we would again emphasize that the Passive condition is in some ways fundamentally different from both the Match and Mismatch conditions (for both the inner and overt speech experiments), in that the latter conditions involved the performance of a mental or physical ‘action’ (i.e., the production of an inner or overt phoneme), while the Passive condition did not. The Passive condition can thus only ever provide a rough ‘point-of-reference’ for the active conditions. The key comparison, we emphasize, was between the Match and Mismatch conditions, and these conditions were matched in terms of both the active processes involved (i.e., both involved the same mental or physical action), and how the trials were presented (i.e., a mixed design).

We now discuss the potential issues associated with between-condition differences in block-type explicitly, in the following additions to the manuscript text:

[Results, footnote]: The decision was taken to block trials in this way in order to make the task more ergonomic for participants, by having them perform the same task (imagine /ba/, imagine /bi/, or passively listen) on every trial of a block. […] Importantly, however, our procedure meant that the critical Match and Mismatch trials were fully and unpredictably intermixed in those blocks in which participants were required to produce an inner phoneme. We return to this issue in the Discussion.

[Discussion]: That said, we note that comparisons between the Match/Mismatch conditions on the one hand, and the Passive condition on the other, should be treated with a degree of caution. […] It is this contrast that demonstrates that inner speech, like overt speech, is associated with a precise, content-specific efference copy.

[Discussion footnote]: We note that this limitation is not restricted to the current study – it potentially applies to any procedure that attempts to measure sensory suppression by comparing active and passive conditions, which constitutes the vast majority of studies that have previously examined sensory suppression to overt speech.

2) "Figure 2 shows the auditory-evoked potentials averaged across electrodes FCz, Fz and Cz, as these were the electrodes at which N1 was maximal (see Figure 2B, voltage maps)." The voltage montage maps clear show the negative/blue/N1 maximal at the left hemisphere not at the averaged midline electrodes, Fz, FCz, and Cz. This is especially evident in the Match condition, the negative/blue/N1 maximal is clearly at C3 location. Please clarify on this issue.

Thank you for identifying this apparent contradiction. When preparing our response to this point, we began by checking the N1-amplitudes at all of the electrodes, and found that the N1-amplitude was indeed maximal at electrode FCz for all three conditions; i.e., consistent with what we claimed in the text, but inconsistent with the voltage maps. However, we then realized the source of our error: while the N1 amplitude and latency data described in the manuscript (on which the statistics were based) were calculated from participants’ peak-picked data, the voltage maps were calculated based on a time-window around the N1-peak on the grand-average waveform. In other words, the data on which the statistics were calculated were not the same data from which the voltage maps were generated. We thus modified the N1 voltage maps so that they were calculated based on the same data that were analyzed statistically (i.e., participants’ peak-picked data). The revised voltage maps showed the expected pattern: that is, N1 was maximal at FCz in all three conditions – see the revised voltage maps in Figure 2.

We repeated this approach for the N1-analysis in the overt speech experiment and again, the results revealed a maximum at in all three conditions – see the revised voltage maps in Figure 3.

As can be seen in the revised Figure 2 (top), the revised topographies retained a slight leftwards-shift in the inner speech experiment, at least for the Match condition. Thus in order to ensure the stability of the results, we re-ran the analyses with an expanded set of electrodes to ensure that we captured the region in which N1 was maximal: specifically, in addition to the three midline electrodes (Fz, FCz, Cz), we entered the three electrodes that were immediately to the left (F1, FC1, C1) and right (F2, FC2, C2) of these midline electrodes. The observed pattern of results remained identical when using this expanded set of electrodes. This supplementary analysis is now included in the Results section of the revised manuscript:

[Results]: As can be seen in Figure 2, while the topographies exhibited a fronto-central negativity in all three conditions, centered on electrode FCz, there was a hint of a leftward shift in the scalp distribution in the Match condition. […] The difference between the Mismatch and Passive conditions remained non-significant (t(41) = 0.11, p =.916, dz = 0.02, CI(95%) = [-0.889, 0.800]. The Condition × Electrode interaction was also not significant (F(16,656) = 1.18, p =.323, ηp2 = 0.028).

3) Related to this, you have written, "The auditory N1-component typically has a fronto-central topography [Ford et al., 2014], which was verified in the current data: N1 was maximal at electrode FCz (see Figure 2C)." However, there is no Figure 2C in this version of the manuscript. It may be that the current Figure 2B's voltage maps were not updated from a previous version (e.g., these maps were from data that were not re-referenced by nose electrode)? However, if the maps are correct, the statements in the manuscript must be wrong and it would not justify using the averaged ERP from Fz, FCz, and Cz.

As discussed in our response to the previous point, we can confirm that the N1 component was indeed maximal at FCz for all three conditions. However, as discussed above, the voltage maps in Figure 2 were incorrect in the original manuscript, as they were based on a time-window taken from the grand-average waveform, as opposed the peak-picked data on which the analyses were based. We have rectified this issue in the revised manuscript and now present the correct topographies.

4) It's hard to tell to what extent this is testing the production of an efference copy since the presented stimuli are not the subjects' own voice. While you refer to "content of the audible phoneme" (Discussion, third paragraph), little effort is made to unpack this statement. We imagine, you are arguing that the matching component of the efference copy works in some sort of abstract space (i.e. comparing the phonemic representation rather than say, the motor or sensory goal), but this seems not to be the case in the literature (see Niziolek et al. 2013). This point requires further commentary. What is the actual mechanism at work here? What is the mapping between an external stimulus and an efference copy when you attempt to compare someone else's speech with your own self-produced speech?

Thank you for raising this interesting and important issue. The study of Niziolek et al. (2013) provided evidence that the efference copy associated with the overt production of speech (in this case, a vowel sound) predicted a prototypical production at the center of a vowel’s formant distribution”, p. 16115). Furthermore, they found that the degree of sensory attenuation was related to a spoken sound’s distance (i.e., in formant space) from this prototype; i.e., sounds closer to an individual’s median production were attenuated more than sounds further away.

If inner speech is, in fact, a special case of overt speech – as we have suggested in the manuscript – then this raises a question as to the nature of the sensory prototype to inner speech. Specifically, with regards to the present study, it raises the question as to why the Match condition showed sensory attenuation, given that the audible phoneme was not presented in the participant’s own voice, and thus would could never perfectly match the sensory prototype. (This argument assumes that the sensory prototype of inner speech has the auditory properties of the participant’s own voice; we discuss this issue further below).

We have attempted to address these important theoretical issues in detail in the following addition to the Discussion section. In a nutshell, we suggest that if the sensory prototype is the participant’s own voice, then while the inner phoneme in the Match condition would not match this sensory prototype perfectly, it would be a better match than the inner phoneme in the Mismatch condition, and would thus be expected to be associated greater – though not maximal – levels of sensory attenuation. We then go on to discuss the possibility that the assumption that the sensory prototype is of the participant’s own spoken voice may not, in fact, be correct (or at least may not be correct all of the time).

The full addition to the Discussion is as follows:

[Discussion]: With regards to the question of what mechanism could underlie sensory attenuation to inner speech: a recent study by Niziolek, Nararajan and Houde [Niziolek, Nararajan and Houde, 2013] on sensory attenuation in the context of overt speech production found that the degree of sensory attenuation was stronger when participants produced vowel sounds that were closer (in terms of their acoustic properties) to their median production of these sounds, compared to when they produced vowel sounds that were, for them, less typical. […] Testing these possibilities may be also be worthwhile in future studies.

5) While we appreciate the focus on the internal speech experiment, we felt that the control experiment with overt speech provided a nice point of contrast. As things now stand, this experiment is buried in the Discussion. Knowing that some readers never make it that far, we would like the overt speech experiment presented in the Results section.

We agree, and as per your instructions have moved the results of the overt speech experiment from the Discussion to the Results section, and have modified the relevant section in the Discussion where we discuss the implications of the overt speech experiment.

https://doi.org/10.7554/eLife.28197.022

Article and author information

Author details

  1. Thomas J Whitford

    1. School of Psychology, University of New South Wales (UNSW Sydney), Sydney, Australia
    2. Brain Dynamics Centre, Westmead Institute for Medical Research, Sydney, Australia
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    t.whitford@unsw.edu.au
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9187-3816
  2. Bradley N Jack

    1. School of Psychology, University of New South Wales (UNSW Sydney), Sydney, Australia
    2. Brain Dynamics Centre, Westmead Institute for Medical Research, Sydney, Australia
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0523-6656
  3. Daniel Pearson

    1. School of Psychology, University of New South Wales (UNSW Sydney), Sydney, Australia
    2. Brain Dynamics Centre, Westmead Institute for Medical Research, Sydney, Australia
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1903-4019
  4. Oren Griffiths

    1. School of Psychology, University of New South Wales (UNSW Sydney), Sydney, Australia
    2. Brain Dynamics Centre, Westmead Institute for Medical Research, Sydney, Australia
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9833-9998
  5. David Luque

    1. School of Psychology, University of New South Wales (UNSW Sydney), Sydney, Australia
    2. Department of Basic Psychology, University of Malaga, Malaga, Spain
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3457-9204
  6. Anthony WF Harris

    1. Brain Dynamics Centre, Westmead Institute for Medical Research, Sydney, Australia
    2. Discipline of Psychiatry, University of Sydney, Sydney, Australia
    Contribution
    Conceptualization, Funding acquisition, Investigation, Methodology, Writing—review and editing
    Competing interests
    Dr Harris has received consultancy fees from Janssen Australia and Lundbeck Australia. He has been on an advisory board for Sumitomo Dainippon Pharma. He has received payments for educational sessions run for Janssen Australia and Lundbeck Australia. He is the chair of One Door Mental Health.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8617-4962
  7. Kevin M Spencer

    1. Veterans Affairs Boston Healthcare System, Boston, United States
    2. Department of Psychiatry, Harvard Medical School, Boston, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5500-7627
  8. Mike E Le Pelley

    School of Psychology, University of New South Wales (UNSW Sydney), Sydney, Australia
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5145-5502

Funding

Australian Research Council (DP140104394)

  • Thomas J Whitford
  • David Luque
  • Mike E Le Pelley

National Health and Medical Research Council (APP1069487)

  • Thomas J Whitford
  • Mike E Le Pelley

Australian Research Council (DP170103094)

  • Thomas J Whitford
  • Anthony W.F. Harris
  • Kevin M Spencer

Australian Research Council (DE150100667)

  • Oren Griffiths

National Health and Medical Research Council (APP1090507)

  • Thomas J Whitford

U.S. Department of Veterans Affairs (I01CX001443)

  • Kevin M Spencer

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics

Human subjects: All participants gave written informed consent to participate and to have their data published in scientific journals. The study was conducted at UNSW Sydney (Sydney, Australia), and approved by the UNSW Human Research Ethics Advisory Panel (Psychology) (File # 2499).

Reviewing Editor

  1. Richard Ivry, University of California, Berkeley, United States

Publication history

  1. Received: April 28, 2017
  2. Accepted: October 28, 2017
  3. Version of Record published: December 4, 2017 (version 1)

Copyright

© 2017, Whitford et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 5,505
    Page views
  • 444
    Downloads
  • 5
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    2. Neuroscience
    Benjamin J De Corte et al.
    Research Article Updated
    1. Neuroscience
    Atsushi Kikumoto, Ulrich Mayr
    Research Article