1. Neuroscience
Download icon

The auditory representation of speech sounds in human motor cortex

  1. Connie Cheung
  2. Liberty S Hamilton
  3. Keith Johnson
  4. Edward F Chang Is a corresponding author
  1. University of California, Berkeley-University of California, San Francisco, United States
  2. University of California, San Francisco, United States
  3. University of California, Berkeley, United States
Research Article
Cited
9
Views
2,665
Comments
0
Cite as: eLife 2016;5:e12577 doi: 10.7554/eLife.12577

Abstract

In humans, listening to speech evokes neural responses in the motor cortex. This has been controversially interpreted as evidence that speech sounds are processed as articulatory gestures. However, it is unclear what information is actually encoded by such neural activity. We used high-density direct human cortical recordings while participants spoke and listened to speech sounds. Motor cortex neural patterns during listening were substantially different than during articulation of the same sounds. During listening, we observed neural activity in the superior and inferior regions of ventral motor cortex. During speaking, responses were distributed throughout somatotopic representations of speech articulators in motor cortex. The structure of responses in motor cortex during listening was organized along acoustic features similar to auditory cortex, rather than along articulatory features as during speaking. Motor cortex does not contain articulatory representations of perceived actions in speech, but rather, represents auditory vocal information.

https://doi.org/10.7554/eLife.12577.001

eLife digest

When we speak, we force air out of our lungs so that it passes over the vocal cords and causes them to vibrate. Movements of the jaw, lips and tongue can then shape the resulting sound wave into speech sounds. The brain’s outer layer, which is called the cortex, controls this process. More precisely, neighboring areas in the so-called motor cortex trigger the movements in a specific order to produce different sounds.

Brain imaging experiments have also shown that the motor cortex is active when we listen to speech, as well as when we produce it. One theory is that when we hear a sound, such as the consonant ‘b’, the sound activates the same areas of motor cortex as those involved in producing that sound. This could help us to recognize and understand the sounds we hear.

To test this theory, Cheung, Hamilton et al. studied how speech sounds activate the motor cortex by recording electrical signals directly from the brain’s surface in nine human volunteers who were undergoing a clinical evaluation for epilepsy surgery. This revealed that speaking activates many different areas of motor cortex. However, listening to the same sounds activates only a small subset of these areas. Contrary to what was thought, brain activity patterns in motor cortex during listening do not match those during speaking. Instead, they depend on the properties of the sounds. Thus, sounds that have similar acoustic properties but which require different movements to produce them, such as ‘b’ and ‘d’, activate the motor cortex in similar ways during listening, but not during speaking.

Further research is now needed to work out why the motor cortex behaves differently when we hear as opposed to when we speak. Previous work has suggested that the region increases its activity during listening when the sounds heard are unclear, for example because of background noise. One testable idea therefore is that the motor cortex helps to enhance the processing of degraded sounds.

https://doi.org/10.7554/eLife.12577.002

Introduction

Our motor and sensory cortices are traditionally thought to be functionally separate systems. However, an accumulating number of studies has revealed their roles in action and perception to be highly integrated (Pulvermüller and Fadiga, 2010). For example, a number of studies have demonstrated that both sensory and motor cortices are engaged during perception (Gallese et al., 1996; Wilson et al., 2004; Tkach et al., 2007; Cogan et al., 2014). In humans, this phenomenon has been observed in the context of speech, where listening to speech sounds evokes robust neural activity in the motor cortex (Wilson et al., 2004; Pulvermüller et al., 2006; Edwards et al., 2010; Cogan et al., 2014). This observation has re-ignited an intense scientific debate over the role of the motor system in speech perception over the past decade (Lotto et al., 2009; Scott et al., 2009; Pulvermüller and Fadiga, 2010).

One interpretation of the observed motor activity during speech perception is that “the objects of speech perception are the intended phonetic gestures of the speaker”- as posited by Liberman’s motor theory of speech perception (Liberman et al., 1967; Liberman and Mattingly, 1985). The motor theory is a venerable and well-differentiated exemplar of a set of speech perception theories that we could call 'production-referencing' theories. Unlike motor theory, more modern production referencing theories do not assume that sensorimotor circuits are necessarily referenced in order for speech to be recognized, but they allow for motor involvement in perception in certain phonetic modes. For example, Lindblom, 1996 suggested that a direct link between spectrotemporal analysis and word recognition is the normal mode of speech perception (the 'what' mode of perception), but in some cases listeners do use a route through sensorimotor circuits (the 'how' mode of perception) if, for example, the listener is attempting to imitate a new sound (Lindblom, 1996).

While demonstrations of evoked motor cortex activity by speech sounds strengthen production-referencing theories, it remains unclear what information is actually represented by such activity. Determining what phonetic properties are encoded in the motor cortex has significant implications for elucidating the role it may play in speech perception. To address this, we recorded direct neural activity from the peri-Sylvian speech cortex in nine human participants undergoing clinical monitoring for epilepsy surgery. This includes but is not limited to two relevant areas comprising the supra-Sylvian ventral half of the lateral sensorimotor cortex (vSMC) for the motor control of articulation (Penfield and Boldrey, 1937) and the infra-Sylvian superior temporal gyrus (STG) for the auditory processing of speech sounds (Ojemann et al., 1989; Boatman et al., 1995). Since cortical processing of speech sounds is spatially discrete and temporally fast (Formisano et al., 2008; Chang et al., 2011; Steinschneider et al., 2011), we used customized high-density electrode grids (a four-fold increase over conventional recordings) (Bouchard et al., 2013; Mesgarani et al., 2014). Importantly, these recordings have simultaneous high spatial and temporal resolution in order to study the detailed speech representations in the vSMC (Crone et al., 1998; Edwards et al., 2009). With this approach, we seek to address unanswered questions about the representation of speech sounds in motor cortex, including how the spatiotemporal patterns compare when speaking and listening and whether auditory representations in motor cortex are organized along articulatory or acoustic dimensions.

Results

Participants first listened passively to consonant-vowel (CV) syllables (8 consonants followed by the /a/ vowel). In a separate trial block, they spoke aloud these same CV syllables. We measured the average evoked cortical activity during these listening and speaking CV tasks. We focused our analysis on high gamma (70–150 Hz) cortical surface local field potentials, which strongly correlate with extracellular multi-unit neuronal spiking (Steinschneider et al., 2008; Ray and Maunsell, 2011). We aligned neural responses to the onset of speech acoustics (t = 0) in listening and speaking tasks to provide a common reference point across speech sounds.

We first determined which peri-Sylvian cortical areas were activated during passive listening to speech sounds. Figure 1a and b shows the locations of cortical areas that demonstrated cortical evoked responses in a single representative subject during listening and speaking respectively. During listening, evoked responses spanned middle and posterior STG as expected, with weaker responses in middle temporal gyrus (MTG) (Figure 1a). In the vSMC, (composed of the pre- and post- central gyri) we found electrodes in the superior-most and inferior-most aspects (Figure 1a, Figure 1—figure supplement 1, 2) that demonstrated reliable and robust single-trial responses to speech sounds during passive listening (Figure 1b). Neural responses were also found at a few sites scattered across supramarginal, inferior-, and middle- frontal gyri—though these were not consistent across subjects (Figure 1—figure supplement 1). By performing spatial clustering analysis on the electrode positions in each subject, we found that 3/5 subjects showed significant clustering of regions responsive to auditory stimuli (Hartigan’s Dip statistic, p<0.05 (see Materials and methods); Figure 1—figure supplement 1). Out of these 3 subjects, k-means clustering revealed two subjects with k=2 electrode clusters (subjects 1 and 4, clusters in inferior and superior vSMC), and one subject with k=5 clusters. When participants spoke the same CV syllables, in contrast, articulatory movement-related cortical activity was well distributed throughout vSMC (Figure 1c), with auditory feedback cortical activity seen in the STG.

Figure 1 with 2 supplements see all
Speech sounds evoke responses in the human motor cortex.

(a) Magnetic resonance image surface reconstruction of one representative subject’s cerebrum (subject 1: S1). Individual electrodes are plotted as dots, and the average cortical response magnitude (z-scored high gamma activity) when listening to CV syllables is signified by the color opacity. CS denotes the central sulcus; SF denotes the Sylvian fissure. (b) Acoustic waveform, spectrogram, single-trial cortical activity (raster), and mean cortical activity (high gamma z-score, with standard error) from two vSMC sites and one STG site when a subject is listening to /da/. Time points significantly above a pre-stimulus silence period (p<0.01, bootstrap resampled, FDR corrected, alpha < 0.005) are marked along the horizontal axis. The vertical dashed line indicates the onset of the syllable acoustics (t=0). (c) Same subject as in (a); distributed vSMC cortical activity when speaking CV syllables (mean high gamma z-score). (d) Total number of significantly active sites in all subjects during listening, speaking, and both conditions (p<0.01, t-test, responses compared to silence and speech). Electrode sites are broken down by their anatomical locations. S denotes superior vSMC sites; I denotes inferior vSMC sites.

https://doi.org/10.7554/eLife.12577.003

Across all participants, we identified 115 electrodes that demonstrated significant neural activity in vSMC during listening (p<0.01, t-test, compared to pre-stimulus silent rest period; Figure 1d). When speaking, in contrast, a total of 362 electrodes in vSMC were found to be significantly active (Figure 1d, p<0.01, t-test, compared to pre-stimulus silent rest period). We compared the relative proportions of electrodes that were found in different supra-Sylvian anatomical regions. Critically, only a subset of sites in vSMC (98 out of 362, ~27%) was active during both listening and speaking (Figure 1d). These sites were primarily localized to the pre-central gyrus, whereas speaking evoked activity across both pre- and post-central gyri sites. Neural responses in the vSMC during listening were found in the superior (S in Figure 1d) pre-central gyrus and inferior, anterior aspect of the sub-central gyrus of the vSMC (I in Figure 1d).

We next compared the patterns of cortical activity to specific speech sounds during listening and speaking. During speaking, specific articulator representations have been identified in the somatotopically-organized vSMC (Bouchard et al., 2013). For example, the plosive consonants /b/, /d/, and /g/ are produced by the closure of the vocal tract at the lips, front tongue, and back tongue, respectively (Figure 2a, b, see Figure 2—figure supplement 1 for all syllable tokens) (Ladefoged and Johnson, 2010). The cortical representations for these articulators are laid out along a superior-to-inferior (medial-to-lateral) sequence in the vSMC (Penfield and Boldrey, 1937). We first examined average cortical activity at single electrode sites distributed along the vSMC axis for articulating individual speech sounds. Figure 2c shows single electrode activity from a single representative subject (the same from Figure 1) for speaking (blue lines) and listening (red lines) for three CV syllables, which have different place of articulation (/ba/, /da/, and /ga/). The exact location of these electrodes on the vSMC is shown in Figure 2d. The production of labial consonants (/b/) is associated with activity in lip cortical representations as evidenced by strong responses to the bilabial /ba/ (Figure 2c, electrodes 5–6, blue lines). These are located superior to the tongue representations associated with the /d/ and /g/ consonants, as shown previously (Bouchard et al., 2013). Those tongue sites were sub-specified by ‘coronal’ (i.e. anterior-based) tongue position for /d/ (electrodes 8–10, blue lines) superiorly, and ‘dorsal’ (i.e. posterior-based) tongue position for /g/ inferiorly (electrode 13, blue line). Other sites (electrodes 1–4, 11–12, blue lines) showed the same neural activity across all three syllables.

Figure 2 with 1 supplement see all
Site-by-site differences in vSMC neural activity when speaking and listening to CV syllables.

(a) Top, vocal tract schematics for three syllables (/ba/, /da/, /ga/) produced by occlusion at the lips, tongue tip, and tongue body, respectively (arrow). (b) Acoustic waveforms and spectrograms of spoken syllables. (c) Average neural activity at electrodes along the vSMC for speaking (blue) and listening (red) to the three syllables (high gamma z-score). Solid lines indicate activity was significantly different from pre-stimulus silence activity (p<0.01). Transparent lines indicate activity was not different from pre-stimulus silence activity (p>0.01). Vertical dashed line denotes the onset of the syllable acoustics (t=0). (d) Location of electrodes 1–13 in panel c, shown on whole brain and with inset detail. CS = central sulcus, SF = Sylvian fissure.

https://doi.org/10.7554/eLife.12577.006

We next examined those same vSMC electrodes during listening, and found that the majority of those cortical vSMC electrodes were not active (p>0.01, t-test compared to silence, Figure 2c transparent red lines). The few that were active (electrodes 1, 2, 4, 11–12, solid red lines) were similar for all three CV syllables, with activity increasing approximately 100ms after the acoustic onset. Across the entire population of vSMC electrodes that were active during listening, onset latencies were generally shorter than those in STG sites, with significant increases in both inferior vSMC (p<0.001) and superior vSMC (p<0.05) compared to STG (Figure 3a, Wilcoxon rank sum test, see Figure 3c for average responses to all syllables). The latency to the response peak was also significantly higher in superior vSMC compared to STG (Figure 3b, p<0.01, Wilcoxon rank sum test). A cross-correlation analysis between these vSMC electrodes and STG electrodes revealed a diverse array of relationships between these populations (Figure 3d–f), including STG electrode activity leading vSMC electrode activity and vice versa. In contrast to speaking, we did not observe somatotopic organization of cortical responses when listening to speech. Therefore, the pattern of raw evoked responses during listening shows critical differences from those during speaking.

Dynamics of responses during CV listening in STG, inferior vSMC, and superior vSMC.

(a) STG onset latencies were significantly lower than both inferior vSMC (p<0.001, Z = −4.03) and superior vSMC (p<0.05, Z = −2.28). (b) STG peak latencies were significantly lower than superior vSMC (p<0.01, Z = −2.93), but not significantly different from peak latencies in inferior vSMC (p>0.1). In (a) and (b), red bar indicates the median, boxes indicate 25th and 75th percentile, and error bars indicate the range. Response latencies were pooled across all subjects. All p-values in (a) and (b) are from the Wilcoxon rank sum test. (c) Average evoked responses to all syllable tokens across sites in superior vSMC (n=32), inferior vSMC (n=37), and STG. Responses were aligned to the syllable acoustic onset (t=0). A random subset of STG responses (n=52 out of the 273 that were used in the latency analysis in (a) and (b)) are shown here for ease of viewing. (d) Example cross-correlations between three vSMC electrodes and all STG electrodes in one patient, for a maximum lag of ± 0.75 s. More power in the negative lags indicates a faster response in the STG compared to the vSMC electrode, and more power in the positive lags indicates a faster response in vSMC compared to STG. We observe vSMC electrodes that tend to respond later than STG (e248, left panel), vSMC electrodes that tend to respond before STG (e136, middle panel), and vSMC electrodes that respond at similar times to some STG electrodes (e169, right panel). (e) Average evoked responses during CV listening for all STG electrodes from this patient and the three vSMC electrodes shown in panel (d). Responses were aligned to the syllable acoustic onset (t=0), as in panel (c). (f) Percentage of sites with STG leading, coactive, or vSMC leading as expressed by the asymmetry index (see Materials and methods). Both inferior and superior vSMC show leading and lagging responses compared to STG, as well as populations of coactive pairs.

https://doi.org/10.7554/eLife.12577.008

We next evaluated quantitatively whether the structure of distributed vSMC neural activity during listening was more similar to that of vSMC during speaking or the STG during listening. In previous studies, we demonstrated that the structure of evoked responses are primarily organized by different feature sensitivities: place of articulation in the vSMC (Bouchard et al., 2013), and manner of articulation in the STG (Mesgarani et al., 2014). We visualized the similarity of population activity evoked by different consonants using unsupervised multidimensional scaling (MDS), where the 2-dimensional Euclidean distances between stimuli correspond to the similarity of their neural responses. Visual inspection of MDS plots shows that, during speaking, evoked activity in vSMC clustered into place of articulation features (Figure 4a): labials (/b/, /p/), alveolars (/s/, /sh/, /t/, /d/), and velars (/g/, /k/) (Figure 4b). In contrast, neural responses during listening did not cluster into the same features (Figure 4c). To quantify the degree to which the evoked activity clustered into place of articulation features, we used unsupervised K-means clustering to assign the neural responses to clusters (k=3), and the adjusted Rand Index (RIadj) (Rand, 1971; Hubert and Arabie, 1985) to measure the degree to which the neural clustering agreed with linguistically defined place of articulation consonant clusters. The RIadj quantifies the degree of agreement between two clustering patterns, where RIadj = 1 denotes identical clustering patterns and RIadj = 0 denotes independent clustering patterns. We found that while evoked activity during speaking clustered by place of articulation features, activity during listening did not (Figure 4d; see Figure 4—figure supplement 1 for moving time window analysis). Even when the vSMC electrode subset was restricted to short-latency vSMC electrodes leading STG activity (as evidenced by a positive asymmetry index in Figure 3f), activity during listening did not cluster according to place of articulation features (Figure 4—figure supplement 2). Thus, responses in motor areas during speech perception do not show a spatially distributed representation of speech motor articulator features.

Figure 4 with 3 supplements see all
Organization of motor cortex activity patterns.

(a) Consonants of all syllable tokens organized by place and manner of articulation. Where consonants appear in pairs, the right is a voiced consonant, and the left is a voiceless consonant. (b) Relational organization of vSMC patterns (similarity) using multidimensional scaling (MDS) during speaking. Neural pattern similarity is proportional to the Euclidean distance (that is, similar response patterns are grouped closely together, whereas dissimilar patterns are positioned far apart). Tokens are colored by the main place of articulation of the consonants (labial, velar, or alveolar). (c) Similarity of vSMC response patterns during listening. Same coloring by place of articulation. (d) Organization by motor articulators. K-means clustering was used to assign mean neural responses to 3 groups (labial, alveolar, velar) for both listening and speaking neural organizations (b,c). The similarity of the grouping to known major articulators was measured by the adjusted Rand Index. An index of 1 indicates neural responses group by place of articulation features. ***p<0.001, Wilcoxon rank-sum (e) Organization of mean STG responses using MDS when listening. In contrast to c and d, tokens are now colored by their main acoustic feature (fricative, voiced plosive, or voiceless plosive). (f) Organization of mean vSMC responses using MDS when listening colored by their main acoustic feature. (Identical to C, but recolored here by acoustic features). (g) Organization by manner of articulation acoustic features (fricative, voiced plosive, voiceless plosive) for both STG and vSMC organizations when listening (e, f). The similarity of the grouping to known acoustic feature groupings was measured by the adjusted Rand Index. ***p<0.001, Wilcoxon rank sum. (h) During listening, responses in vSMC show significantly greater organization by acoustic manner features compared to place features as assessed by the adjusted Rand Index, indicating an acoustic rather than articulatory representation (***p<0.001, Wilcoxon rank-sum). Bars in this panel are the same as the red bars in (d) and (g). In (d), (g), and (h), bars indicate mean ± standard deviation,

https://doi.org/10.7554/eLife.12577.009

Finding no evidence that major articulator features are either locally or spatially distributed in the vSMC in response to speech sounds, we next compared vSMC responses to population responses in the STG. STG has an acoustic sensory representation of speech that best discriminates speech sounds by manner of articulation features with salient acoustic differences (Mesgarani et al., 2014). Using multidimensional scaling, STG spatial patterns during listening showed clustering according to three high-order acoustic features (Figure 4e): voiced plosives (/b/, /d/, /g/), unvoiced plosives (/p/, /t/, /k/), and fricatives (/s/, /sh/) (Ladefoged and Johnson, 2010). This is consistent with the relational organization derived by analysis of structure in the stimulus acoustics (Figure 4—figure supplement 3a), and the structure of STG during speaking (Figure 4—figure supplement 3b). With the same analyses, we observed that activity in motor cortex clustered into the same three acoustic features (Figure 4f, note this panel is identical to Figure 4c simply re-colored). Unsupervised K-means clustering analysis confirmed that vSMC activity, during listening, organized into these linguistically defined acoustic feature groups, but was significantly weaker than the organization of STG (p<0.001, Wilcoxon rank-sum, Figure 4g). Importantly, however, clustering by acoustic manner features was significantly stronger than clustering by place features in vSMC electrodes during listening (p<0.001, Wilcoxon rank-sum, Figure 4h). This organization suggests that motor cortex activity during speech perception reflects an acoustic sensory representation of speech in the vSMC that mirrors acoustic representations of speech in auditory cortex.

To further define the acoustic selectivity and tuning of vSMC motor electrodes, participants listened to natural, continuous speech samples from a corpus with a range of American English speakers (Garofolo et al., 1993). We fit spectrotemporal receptive field (STRF) models for each vSMC electrode using normalized reverse correlation (see Materials and methods), which describes the spectrotemporal properties of speech acoustics that predict the activity of a single site in motor cortex. To compute the STRF, we calculate the correlation between the neural response at an electrode and the stimulus spectrogram at multiple time lags. The result is then normalized by the auto-correlation in the stimulus. This results in a linear filter for each electrode (the STRF), which, when convolved with the stimulus spectrogram, produces a predicted neural response to that stimulus. The prediction performance of each STRF was determined by calculating the correlation between the activity predicted by the STRF and the actual response on held out data. A fraction of vSMC sites (16/98 sites total) were reasonably well-predicted with a linear STRF (r>=0.10 and p<0.01, permutation test) (Theunissen et al., 2001). STRFs with significant correlation coefficients were localized to superior and inferior vSMC (primarily precentral gyrus) in addition to STG (Figure 5a). Still, the prediction performance of STRFs in vSMC was generally lower than that of the STG (Figure 5b). Furthermore, the majority of STRFs in both regions showed strong low frequency tuning (100–200 Hz) properties related to voicing (Figure 5c), though some also showed high frequency tuning consistent with selectivity for fricatives and stop consonants by visual inspection (Mesgarani et al., 2014). We also estimated the mean cortical response at each motor site to every phoneme in English and found a diverse set of responses (Figure 5—figure supplement 1a) that were notably weaker in magnitude compared to STG responses (Figure 5—figure supplement 1b). Weak selectivity to phonetic features measured by the Phoneme Selectivity Index (PSI) was also observed (Figure 5—figure supplement 1c) (Mesgarani et al., 2014). These findings reveal that individual sites in motor cortex reflect sensory responses to definable spectrotemporal features speech acoustics, including voicing attributes. Presumably, this tuning gives rise to the acoustic organization found in the previous analysis of distributed spatial patterns of neural activity.

Figure 5 with 1 supplement see all
Acoustic spectrotemporal tuning in vSMC.

(a) All STRF correlations and locations are plotted with opacity signifying the strength of the correlation. CS denotes the central sulcus; SF denotes the Sylvian fissure. (b) Distribution of STRF prediction correlations for significantly active vSMC and STG sites. Cut-off at r = 0.1 is shown as a dashed line. (c) Individual STRFs from all subjects (S1-S5, STRF correlation>0.1) plotted as a function of distance from the central sulcus and Sylvian fissure, with opacity signifying the strength of the STRF correlation.

https://doi.org/10.7554/eLife.12577.013

Discussion

Our principal objective was to determine the vSMC motor cortex representation of auditory speech sounds. We used high-resolution cortical recordings and a wide array of speech sounds to determine how the vSMC structure of speech sounds compared to the structure of motor commands in vSMC and sensory processing in STG. We found evidence for both spatially local and distributed activity correlated to speech acoustics, which suggests an auditory representation of speech in motor cortex.

The proposal that the motor cortex critically integrates observations with motor commands largely stems from the discovery of mirror neurons (in area F5 of macaques) that fire both when a monkey produced an action and observed a similar action (di Pellegrino et al., 1992; Rizzolatti and Craighero, 2004; Pulvermüller and Fadiga, 2010). This 'integrative' view is reminiscent of linguistic production-referencing theories, including the motor theory of speech perception, which propose that motor circuits are involved in speech perception (Liberman et al., 1967; Liberman and Mattingly, 1985). In line with these theories, human neuroimaging studies have showed mirror activity in ventral premotor cortex during listening (Wilson et al., 2004; Pulvermüller et al., 2006; Edwards et al., 2010), and modulated premotor activity in phoneme categorization tasks (Alho et al., 2012; Chevillet et al., 2013). Our results extend these findings by detailing the representational selectivity and encoding of vSMC in perception. Consistent with previous findings, we demonstrated local ‘audiomotor’ responses to speech sounds in vSMC. When the responses were further examined for phonetic structure, we found major motor articulatory place features, such as labial, alveolar, and velar, were not represented with single site activity or distributed spatial activity. This observation is in direct contrast with structural predictions made by the original motor theory of speech perception (Liberman et al., 1967; Liberman and Mattingly, 1985), while confirming that motor cortex plays a role in perception (Lindblom, 1996; Hickok and Poeppel, 2007).

We localized activity during speech perception to regions of the vSMC that have been implicated in phonation and laryngeal control (Penfield and Boldrey, 1937; Brown et al., 2008). When listening to speech, we observed these regions reflected acoustic sensory properties of speech, with individual sites tuned for spectrotemporal acoustic properties. The tuning properties of responsive sites in vSMC are similar to properties observed in STG during listening (Mesgarani et al., 2014) and appear to give rise to an acoustic sensory organization of speech sounds (rather than purely motor organization) in motor cortex during listening.

There is an emerging consensus that frontal and motor regions are recruited during effortful listening (Du et al., 2014). For example, previous studies have demonstrated that frontal areas come online to process degraded speech for the attentional enhancement of auditory processing (Wild et al., 2012). Our results may complement this interpretation in that the audiomotor cortex enhancement is specific to an auditory representation, without transforming information to a motor articulatory representation. That being said, the auditory encoding that we observed in the motor cortex did not appear to be as strong as that as that observed in the STG, and exhibited comparatively weaker activity and weaker phoneme selectivity (Figure 5—figure supplement 1b and c, and see (Mesgarani et al., 2014).

In addition to having implications for perceptual models, we speculate that these results have strong implications for speech production, as auditory feedback is potentially processed directly in the vSMC in addition to the canonical auditory cortex. Speech production models currently propose a complex role for sensory feedback, where pathways exist for the activation of auditory cortex from vSMC activation (the forward prediction of production consequences), and the activation of vSMC from auditory and somatosensory input (the error correction signal) (Guenther et al., 2006; Houde and Nagarajan, 2011). In the current study, it appears that the motor cortex contains both sensory and motor representations, where the sensory representations are active during passive listening, whereas motor representations dominate during speech production.

Analysis of the time course of vSMC and STG responses revealed a heterogeneous population of both short- and longer-latencies in the inferior and superior vSMC that are generally slower than the STG (Figure 3a–c). Early responses in vSMC may reflect bidirectional connections from STG (Zatorre et al., 2007), primary auditory cortex (Nelson et al., 2013; Schneider et al., 2014) or auditory thalamus (Henschke et al., 2014), whereas later responses might reflect indirect connectivity in areas downstream from the STG (Rauschecker and Scott, 2009). Indeed, our cross-correlation analysis revealed bidirectional dynamical relationships between vSMC and STG responses, in which STG responses led vSMC responses and vice versa (Figure 3d–f). Still, this analysis was independent of the diverse tuning properties in the vSMC and STG electrode sets, so longer latency responses likely reflect the later responses to vowels relative to consonants. Even so, we found a wide variety of tuning and dynamical profiles in the vSMC electrodes that responded during listening. Given these proposed functional connections, activity in vSMC from speech sounds may be a consequence of sounds activating the sensory feedback circuit (Hickok et al., 2011). Alternatively, evoked responses in the motor cortex during passive listening may directly reflect auditory inputs arising from aggregated activity picked up by the electrode. We believe the latter scenario to be less likely, however, given that auditory responses were observed in dorsal vSMC on electrode contacts several centimeters away from auditory inputs in the STG. In addition, the spatial spread of neural signals in the high gamma range is substantially smaller than this difference – high gamma signal correlations at <2 mm spacing are only around r=0.5, and at distances of 1 cm reach a noise floor (Chang, 2015); Muller et al, unpublished findings). Given the observed acoustic rather than place selectivity observed during listening in the vSMC, our results suggest that motor theories of speech perception may need to be revised to incorporate a novel sensorimotor representation of sound in the vSMC.

Materials and methods

Participants

Nine human participants were implanted with high-density multi-electrode cortical surface arrays as part of their clinical evaluation for epilepsy surgery. The array contained 256 electrodes with 4 mm pitch. Arrays were implanted on the lateral left hemispheres over the peri-Sylvian cortex, but exact placement was determined entirely by clinical indications (Figure 1—figure supplement 1 and Figure 1—figure supplement 2). Using anatomic image fusion software from BrainLab (Munich, Germany), electrode positions were extracted from the computed tomography (CT) scan, co-registered with the patient’s MRI and then superimposed on the participant’s 3D MRI surface reconstruction image. All participants were left hemisphere language dominant, as assessed by the Wada test. Participants had self-reported normal hearing. The study protocol was approved by the UC San Francisco Committee on Human Research, and all participants provided written informed consent.

Task

Participants completed three separate tasks that were designed to sample a range of phonetic features. First, participants listened to eight consonant-vowel (CV) syllables (/ba/, /da/, /ga/, /pa/, /ta/, /ka/, /∫a/, /sa/) produced by a male speaker unknown to the participant. Stimuli were presented randomly, with 4–21 repetitions of each CV syllable for 5 out of the 9 subjects included in all subsequent analyses, and one repetition of each CV syllable for 4 subjects shown only in Figure 1—figure supplement 2. To remain alert, participants were asked to identify the syllable they heard by selecting from a multiple-choice question on a computer with their ipsilateral (left) hand. In the second task, participants spoke aloud the same CV syllables prompted by a visual cue on the laptop computer display. In the third task, participants passively listened to natural speech samples from a phonemically transcribed continuous speech corpus (TIMIT). We chose 499 unique sentences from 400 different male and female speakers. Each sentence was repeated two times. For the phoneme selectivity analysis, we chose a subset of TIMIT phonemes that occurred more than 30 times. This resulted in an analysis of 33 phonemes. For spectrotemporal receptive field analysis (see below), data from all sentences were used.

Data acquisition and preprocessing

Electrocorticographic (ECoG) signals were recorded with a multichannel PZ2 amplifier connected to an RZ2 digital signal acquisition system (Tucker-Davis Technologies, Alachua, FL, USA) sampling at 3,052 Hz. The produced speech was recorded with a microphone, digitized, and simultaneously recorded. The speech sound signals were presented monaurally from loudspeakers at a comfortable level, digitized, and also simultaneously recorded with the ECoG signals.

Line noise (60 Hz and harmonics at 120 and 180 Hz) was next removed from the signal with a notch filter. Each time series was visually and quantitatively inspected for excessive noise, and was subsequently removed from further analyses if its periodogram deviated more than two standard deviations away from the average periodogram of all other time series. The remaining time series were then common-average referenced (CAR) and used for analyses. The CAR was taken across 16 channel banks in order to remove non-neural electrical noise from shared inputs to the PZ2. We find that this method of CAR significantly reduces movement-related and other non-neural artifacts while not adversely affecting our signals of interest. The analytic amplitude of each time series was extracted using eight bandpass filters (Gaussian filters, logarithmically increasing center frequencies (70–150 Hz) with semi-logarithmically increasing bandwidths) with the Hilbert transform. The high-gamma power was calculated by averaging the analytic amplitude across these eight bands, and downsampling the signal to 100 Hz. The signal was finally z-scored relative to the mean and standard deviation of baseline rest data for each channel.

Electrode selection

Supra-Sylvian cortical sites with robust evoked responses to both speech sounds and speech production were selected for this analysis. To identify if a site was responsive to speech sounds, we implemented a bootstrap t-test comparing a site's responses randomly sampled over time during speech sound presentations to responses randomly sampled over time during pre-stimulus silent intervals (p<0.01). This resulted in 10, 22, 29, 27, and 27 sites for the five participants (n=115). Next we implemented a bootstrap t-test comparing neural responses during speech production and pre-stimulus silence (p<0.01), resulting in 25, 74, 87, 92, and 84 sites (n=362). Finally, we took the intersection of these two groups to arrive at our final supra-Sylvian sites set of 8, 16, 28, 22, and 24 sites active during listening and speaking (n=98).

To analyze the responses of the auditory cortex, we restricted the infra-Sylvian cortical sites to those that were reliably evoked by speech sounds (p<0.01, t-test between silence and speech sounds neural responses). This resulted in 73, 61, 40, 77, and 89 infra-Sylvian temporal cortical sites (n=340) responsive to speech sounds.

Spatial clustering analysis

To investigate the degree of spatial clustering in the vSMC electrodes responsive during listening, we used the Dip-means method (Kalogeratos and Likas, 2012), which allows us to test whether data shows any form of clustering. Importantly, unlike the silhouette index, this allows us to distinguish between k=1 and k>1 clusters. For each subject, the pairwise distances between the spatial locations of all electrodes in a single subject were computed. Using each electrode in turn as a 'viewer' (Kalogeratos and Likas, 2012), we tested to see whether the distribution of distances to that electrode significantly deviated from unimodality (Hartigan and Hartigan, 1985). If one or more electrodes showed a signficantly non-unimodal pairwise distance histogram, then the data were considered to be clustered. Following this procedure, k-means clustering was performed with k=2 through k=6 clusters, and the silhouette index was used to determine the best number of clusters for a given subject. The silhouette index for a given data point is defined as

s(i)=b(i)aimax{a(i),b(i)}

where b(i) is the lowest average distance of i to any other cluster of which i is not a member, and a(i) is the average distance between i and any other data point assigned to the same cluster. The silhouette index ranges from −1 to 1, with higher positive values indicating good clustering.

Average neural response and peak high-gamma measurement

For the speaking and listening CV syllable tasks, the start of the syllable acoustics was used to align the responses of each electrode site. For the phoneme responses, the TIMIT phonetic transcriptions were used to align responses to the phoneme onset. Once responses were aligned to a stimulus, the average activity for each site to each stimulus was measured by taking the mean response over different trials of the same stimuli. The maximum of the mean responses to different stimuli were then used to measure the peak-high gamma distributions between different tasks and sites.

Response latency analysis

We measured the onset latencies for responses to listening in STG and vSMC by calculating the average z-scored high gamma activity across all CV syllables, and then calculating the first time at which activity was significantly higher than the 500-ms pre-stimulus silent rest period (one-tailed Wilcoxon rank sum test, p<0.001). We also calculated the peak latency as the time at which the average z-scored response reached its maximum value. Differences in onset and peak latencies were compared across STG, inferior, and superior vSMC using the a two-tailed Wilcoxon rank sum test at a significance level of p<0.05 (uncorrected).

Cross-correlation analysis

To measure the timing/dynamics between pairs of vSMC and STG sites during CV syllable listening, we performed a cross-correlation analysis between pairs of electrodes in these two regions. The cross-correlation measures the similarity of two time series at different time lags by taking pairs of electrode responses and calculating the correlation between one response and a time-shifted version of the second response. If the peak in the cross-correlation between an STG electrode and a vSMC electrode occurs at a negative lag, this indicates that the STG response leads (occurs earlier than) the vSMC response and that STG activity in the past is predictive of future activity in the vSMC. In contrast, if the peak in the cross-correlation between an STG electrode and a vSMC electrode occurs at a positive lag, this indicates that the vSMC response leads (occurs earlier than) the STG response. The cross-correlation at time lag τ is calculated between the response at an STG electrode (denoted x) and the response at a vSMC electrode (y) as follows:

x*y[τ] = deft=-0.5 st=1 sx*[t]y[t+τ]

Where the maximum lag τ was chosen to be 0.75 s. Cross-correlations were normalized by 1M|τ| (where M is the total number of time points in the response) to obtain an unbiased estimate at each time lag τ. The cross-correlation between vSMC and STG electrodes was calculated separately for each CV syllable trial, and then averaged across trials (see examples in Figure 3d).

To determine the incidence of relationships within our electrode population where STG leads vSMC, vSMC leads STG, or both are coactive, we calculated an asymmetry index. This index ranges from −1 to 1 and describes the relative power in the positive versus negative lags for each vSMC electrode. It is calculated for each vSMC electrode by taking the sum of the positive cross-correlations in the negative lags and the sum of the positive cross-correlations in the positive lags, and then computing the ratio:

asymmetry index  = PposPnegPpos+Pneg

For a given vSMC electrode, an asymmetry index of −1 indicates that the cross-correlations lie fully in the negative lags (indicating that STG responses lead the vSMC response in that electrode). In contrast, a value of 1 indicates that the cross-correlations are in the positive lags only, indicating that the vSMC electrode leads all STG electrodes.

Multidimensional scaling (MDS) analysis

To examine the relational organization of the neural responses to syllables, we applied unsupervised multidimensional scaling (MDS) to the distance matrix of the mean neural responses at the sites of interest described in Materials and methods: Electrode selection. For analysis of speaking and listening responses, the vSMC sites used were those identified as significantly active during both speech production and speech perception (n=98, Figure 4b,c,f). However, clustering results for speaking were similar when all vSMC sites identified as significantly active during speech production were included (n=362, Figure 4—figure supplement 3c). The STG sites used were those identified as significantly active during speech perception (n=340, Figure 4e, Figure 4—figure supplement 3b). Syllables placed closer together in MDS space elicited similar neural response patterns, and those further apart from one another elicited more dissimilar patterns. To calculate the distance between a pair of mean neural responses, a mean neural response to one syllable was linearly correlated to another, and the resulting correlation coefficient was subtracted from 1.

Neural clustering analysis

We used unsupervised K-means clustering to examine the grouping of the mean neural activity to syllables of the electrodes of interest described in Methods: Electrode selection. We clustered the mean activity into 3 distinct clusters. This number of clusters was chosen because there are 3 major place of articulations and manner of articulations in the syllable stimuli set (Figure 4a) that have been shown to play a major role in the neural organization of motor cortex during speech production and auditory cortex during speech perception.

After clustering the neural responses into three distinct groups, we measured the similarity of the grouping to the linguistically defined grouping of consonants by place of articulation and acoustic features (Figure 4a and Figure 2—figure supplement 1) using the adjusted Rand Index (RIadj). The RIadj is frequently used in statistics for cluster validation. It measures the amount of agreement between two clustering schemes: one by a given clustering process (e.g. K-means), and the other by some external criteria, or gold-standard (e.g. place of articulation linguistic features). The RIadj takes an intuitive approach to measuring cluster similarity by counting the number of pairs of objects classified in the same cluster under both clustering schemes, and controlling for chance (hence, 'adjusted' RI). It has an expected value of 0 for independent clusterings, and a maximum value of 1 for identical clustering. It is defined as the following:

Let S be a set of n objects, S = (o1, o2, …, on). Partitioning the objects in two different ways such that U = (U1, …, Ur) is a partition of S into r subsets, and V = (V1, …, Vt) is a partition of S into t subsets, let:

a = number of pair of objects that are in the same set in U and in the same set in V,

b = number of pair of objects that are in the same set in U and in different sets in V,

c = number of pair of objects that are in different sets in U and in the same set in V,

d = number of pair of objects that are in different sets in U and in different sets in V.

Without adjusting for chance, the RI is simply:

RI = a+da+b+c+d=a+dn2.

Taking into account chance pairings, RIadj becomes:

RIadj = n2(a+d) - (a+b)(a+c) + (c+d) (b+d)n22- (a+b)(a+c) + (c+d) (b+d).

To localize an unbiased time window for analysis, the ΔRIadj metric was derived for all time windows by subtracting the RIadj measured with the place of articulation features gold-standard from the RIadj measured with the acoustic feature gold-standard (Figure 4—figure supplement 1). An ΔRIadj = 1 denotes organization by acoustic features, and an ΔRIadj = −1 denotes organization by place features. The significance of the ΔRIadj was computed by calculating the RIadj for a randomized labeling of neural responses compared to either acoustic feature or place feature clustering, taking the difference (∆RIadj), and repeating this procedure 1000 times with different randomized labelings to create a null distribution of ∆RIadj values. The p-value was calculated as the number of times this random ∆RIadj exceeded the observed ∆RIadj, and was thresholded at an FDR-corrected p<0.05 using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995).

Electrode phoneme selectivity index (PSI)

To characterize the phoneme selectivity of each electrode site, we implemented the PSI calculation described by Mesgarani et al., 2014. In short, for a single site, we summed the number of responses that were statistically different (Wilcoxon rank-sum test, p<0.01, corrected for multiple comparisons) from the response to a particular phoneme. This resulted in a PSI that ranges from 0 to 32, where a PSI = 32 is an extremely selective electrode and a PSI = 0 is not selective. A PSI describes an electrode’s selectivity to one phoneme, and a vector of PSIs describes an electrode’s selectivity profile to all phonemes.

Spectrotemporal receptive field (STRF) estimation

The spectrotemporal representation of speech sounds was first estimated using a cochlear frequency model, consisting of a bank of logarithmically spaced constant Q asymmetric filters. The filter bank output was subjected to nonlinear compression, followed by a first order derivative along the spectral axis modeling a lateral inhibitory network, and an envelope estimation operation (Wang and Shamma, 1994). This resulted in a two dimensional spectrotemporal representation (spectrogram) of speech sounds simulating the pattern of activity on the auditory nerve.

We then estimated the spectrotemporal receptive fields (STRFs) of the sites from passive listening to TIMIT using normalized reverse correlation (Aertsen and Johannesma, 1981; Klein et al., 2000; Theunissen et al., 2001; Woolley et al., 2006) between spectrotemporal representation of the sentences and the evoked neural activity (STRFLab software package: http://strflab.berkeley.edu, DirectFit routine). The STRF is a linear filter that describes which combinations of spectrotemporal features will elicit a neural response in a given electrode. The relationship between the STRF, H, stimulus spectrogram, S (as estimated above), and the predicted response, r^(t), of an electrode are given by the following equation:

r^(t) =i=0M-1 τ=0N-1 H(τ,f)S(t -τ,f)

where N is the number of delays of length τ after which the STRF will be estimated (reflecting memory for the stimulus), and M is the number of frequency bands in the spectrogram. To estimate the STRF, we minimize the mean squared error between the predicted and observed responses. To prevent overfitting, we used an L2 regularization procedure in which a ridge hyperparameter and sparseness hyperparameter were calculated for each electrode's STRF (details in [Woolley et al., 2006]). The ridge hyperparameter acts as a smoothing factor on the STRF, whereas the sparseness hyperparameter controls the number of non-zero weights in the STRF. These hyperparameters were optimized with a systematic hyperparameter grid search maximizing for mutual information (bits/s). With the optimized hyperparameters, we calculated the final STRF and correlation between the predicted and actual neural response using cross-validation. To do this, a STRF was derived using 9/10 of the stimuli-response pairs, and the Pearson correlation coefficient (indicating the STRF goodness-of-fit) was measured by predicting the remaining one-tenth responses. This was repeated 10 times with 10 non-overlapping stimuli-response pair sets. The final STRF and correlation number were derived by averaging the 10 STRFs and correlation coefficients.

Note on statistical tests

To assess statistical differences, we used independent sample t-tests when the data were found not to deviate significantly from normality (KS test). When data were not normally distributed, we used the nonparametric Wilcoxon rank sum test. In some cases, a bootstrap t-test was used.

References

  1. 1
  2. 2
  3. 3
    Controlling the false discovery rate: a practical and powerful approach to multiple testing
    1. Y Benjamini
    2. Y Hochberg
    (1995)
    Journal of the Royal Statistical Society. Series B 57:289–300.
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
    TIMIT Acoustic-Phonetic Continuous Speech Corpus, Philadelphia, Linguist Data Consortium
    1. JS Garofolo
    2. LF Lamel
    3. WF Fisher
    4. JG Fiscus
    5. DS Pallett
    6. NL Dahlgren
    7. V Zue
    (1993)
    TIMIT Acoustic-Phonetic Continuous Speech Corpus, Philadelphia, Linguist Data Consortium.
  19. 19
  20. 20
    The dip test of unimodality
    1. J a Hartigan
    2. PM Hartigan
    (1985)
    Annals of Statistics 13:70–84.
  21. 21
    Possible anatomical pathways for short-latency multisensory integration processes in primary sensory cortices
    1. JU Henschke
    2. T Noesselt
    3. H Scheich
    4. E Budinger
    (2015)
    Brain Structure & Function, 220, 10.1007/s00429-013-0694-4.
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    Dip-means : an incremental clustering method for estimating the number of clusters
    1. A Kalogeratos
    2. A Likas
    (2012)
    Advances in Neural Information Processing Systems pp. 2402–2410.
  27. 27
  28. 28
    A Course in Phonetics
    1. P Ladefoged
    2. K Johnson
    (2010)
    Boston, MA: Cengage Learning.
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53

Decision letter

  1. Barbara G Shinn-Cunningham
    Reviewing Editor; Boston University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your work entitled "The Auditory Representation of Speech Sounds in Human Motor Cortex" for consideration by eLife. Your article has been reviewed by two peer reviewers, including Nicholas Hatsopoulos, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Reviewing Editor and David Van Essen as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This study examines evoked activity in speech motor cortex due to passive listening of consonant-vowel (CV) syllables to determine the nature of these representations. High-density ECoG grids recorded local field potentials from the surface of motor and auditory cortices of epileptic patients who either produced CVs or passively listened to them. During speaking, evoked neural activity in ventral sensorimotor cortex (vSMC) was somatotopically organized according to place of articulation whereas during listening, the organized seemed to be based on acoustic features. Both reviewers and the reviewing editor believe that the study could shed light on the important question of whether or not speech perception depends upon encoding spoken language in a motor-based neural code. The innovative experimental and analytical techniques add to our enthusiasm for the work. While the data suggest that language perception is encoded in acoustic features rather than speech-motor production features, all of us felt that there were some major issues, especially with the data analysis, that need to be addressed before the full impact of the study can be evaluated.

Essential revisions:

1) The methods state that the data were analyzed using a common-average reference. Following this, to estimate the high-γ power relative to baseline, the power (extracted using the Hilbert transform) in 8 bandpass frequency regions was computed and averaged, and then turned into a z score. Given this processing stream, the number and locations of the electrodes across which the common-average reference was computed might bias the findings, especially since oscillation power was ultimately analyzed (i.e., very strong high-γ power that was synchronous across a small number of nearby sites would end up looking like weak high-γ power across all sites). This issue needs to be discussed; perhaps some bootstrap analysis or other control could demonstrate that such artifacts are not an issue.

2) The authors argue that during listening "the tuning properties of responsive sites in vSMC.… appear to give rise to an acoustic sensory organization of speech sounds (rather than purely motor organization) in motor cortex during listening." The impact of the work rests on this claim, that the responses in vSMC during listening are organized by acoustic rather than motor properties of sound. However, proving this claim requires showing that statistically, the listening response in vSMC is quantitatively better described as "auditory" rather than "motor" (i.e., by a direct statistical comparison of results plotted in Figure 3c and f). Without this proof, the impact of this paper may be less than claimed. This should be addressed by a statistical comparison of the two fits of the vSMClistening results.

3) In the analyses that are provided, the authors select different electrodes when evaluating vSMClistening (98 electrodes) and vSMCspeaking (362 electrodes). The electrodes evaluated in the vSMCspeaking condition are not all active during listening. To allow direct comparison of results for speaking vs. listening, it makes more sense to select identical subsets for both analyses. This in turn means limiting the analysis to those sites that are active for both vSMCspeaking and vSMClistening conditions. It seems problematic to claim that activity patterns are different across conditions when those patterns are being examined in different sets of electrodes. This can best be addressed by a reanalysis that directly compares vSMCspeaking and vSMClistening responses in the more limited set of electrodes that respond in both cases.

4) Related to item 3, the MDS plots for vSMClistening (Figure 3c and f) are less clearly clustered than vSMCspeaking. This may be an artifact of the 3x difference in sample sizes. Depending on how the authors decide to deal with item 2, this may not remain a problem; however, if there is a comparison of clustering results using different sample sizes, the authors need to verify that the results shown in Figure 3d and g are not an artifact of this difference (e.g., by showing that the same effect holds if the vSMCspeaking and STGlistening datasets are subsampled to equalize the sample size).

5) It may be that the evoked responses in motor cortex during passive listening reflect auditory inputs. While high-γ LFPs strongly correlate with multi-unit spiking (as the authors state), high-γ LFPs may also reflect the aggregate synaptic potentials near the electrode. The authors should discuss this possibility.

6) A motor representation in vSMC during listening may be present in a subset of the vSMC sites. The authors show that response in some vSMC sites lead STG sites (Figure 4d-f). If the MDS analysis was restricted to sites that lead STG (i.e., vSMC signals with very short latencies), would these sites reveal a motor representation? Some vSMC may be generating motor-like responses that are triggered by the auditory stimulus, but that reflect motor, not sensory properties. Analysis of the short-latency vSMC responses would address this concern.

7) The authors report that 16 total sites in vSMC have activity that was "strongly" predicted by a linear STRF, with "strong" defined as r>0.10. The authors need to make clear where the subset of 16 electrodes with significant STRFs were located, and also whether or not this fraction of locations (16/X) is significantly above chance (e.g., correcting for false discovery rate). Most likely, X should include all auditory-responsive vSMC sites, but this is not stated clearly. In addition, the authors should explain by what criterion r>0.10 (i.e., a correlation in which 1% of the variance is explained by the strf) is a good criterion for a "strong" prediction. For instance, how does this correlation strength compare to STRF predictions in STG?

https://doi.org/10.7554/eLife.12577.017

Author response

Essential revisions:

1) The methods state that the data were analyzed using a common-average reference. Following this, to estimate the high-γ power relative to baseline, the power (extracted using the Hilbert transform) in 8 bandpass frequency regions was computed and averaged, and then turned into a z score. Given this processing stream, the number and locations of the electrodes across which the common-average reference was computed might bias the findings, especially since oscillation power was ultimately analyzed (i.e., very strong high-γ power that was synchronous across a small number of nearby sites would end up looking like weak high-γ power across all sites). This issue needs to be discussed; perhaps some bootstrap analysis or other control could demonstrate that such artifacts are not an issue.

We agree that it is important to show that our use of the common average reference (CAR) does not adversely affect our results, and we have included more information in the manuscript to clarify what was done. We do not find that the CAR reduces the high-γ power, but rather serves to remove large artifacts across wide sensory and non-sensory areas that are likely non-neural in origin. In our study, the CAR was applied across 16 channel blocks running in columns across the ECoG grid before segmenting data into speech responsive or non-responsive. Blocks of 16 channels were used because when the ECoG grids are plugged into the amplifier, the 256 channels are separated into 16 channel sets, each of which is plugged into a separate bank (http://www.tdt.com/files/specs/PZ2.pdf). The grids are designed such that the 16 channel sets correspond to columns on the grid – for example, channels 1–16 share a common plug, 17-32 share the next plug, 33-48 the next, etc. Thus, when we subtract the CAR in these 16-channel banks, it removes noise that is shared across a group of wires plugged into the same bank on the amplifier. In practice, we find that this significantly reduces the amount of 60 Hz noise and removes some artifacts due to movement, but does not adversely affect signals in the high γ band. Because of the orientation of the grids, the columns of the grid generally run ventrodorsally and include electrodes over speech-selective and non-selective areas (STG and parietal cortex, or STG and motor cortex, for example). Thus, subtracting signals shared across this wide region do not generally affect the high-γ band power in this way (that is, we do not find that strong high-γ power is weakened by the CAR). As a quantitative comparison, we compared the phoneme aligned high γ matrices with and without the CAR (shown for one subject in Author response image 1). There is no significant difference between the high γ data with and without the CAR (Wilcoxon sign rank test, p=1).

However, because of its ability to remove artifacts due to non-neural electrical noise (see Author response image 2 power spectrum showing power in different neural frequency bands with and without the CAR—error indicates standard deviation across channels), we chose to leave the results in the paper as is, using the CAR as described. We added the following sentences to the Methods section to clarify the use of the CAR: "The CAR was taken across 16 channel banks in order to remove non-neural electrical noise from shared inputs to the PZ2. We find that this method of CAR significantly reduces movement-related and other non-neural artifacts while not adversely affecting our signals of interest."

2) The authors argue that during listening "the tuning properties of responsive sites in vSMC.… appear to give rise to an acoustic sensory organization of speech sounds (rather than purely motor organization) in motor cortex during listening." The impact of the work rests on this claim, that the responses in vSMC during listening are organized by acoustic rather than motor properties of sound. However, proving this claim requires showing that statistically, the listening response in vSMC is quantitatively better described as "auditory" rather than "motor" (i.e., by a direct statistical comparison of results plotted in Figure 3c and f). Without this proof, the impact of this paper may be less than claimed. This should be addressed by a statistical comparison of the two fits of the vSMClistening results.

We agree that this direct statistical comparison is important for interpretation of our results and thank the reviewers for the suggestion. To this end, we have performed the statistical comparison of the data in Figure 3c and 3F (now Figure 4c and f) for vSMC electrodes during listening and found a significantly higher RIadj for acoustic (manner) features compared to place of articulation features (p<0.001, Wilcoxon rank sum test). We now show quantitatively that the listening responses in vSMC are indeed better characterized as "auditory" rather than "motor". This comparison is shown in Figure 4h and described in the figure legend. In addition, we have added the following text to the Results section: "Importantly, however, clustering by acoustic manner features was significantly better than clustering by place features in vSMC electrodes during listening (p<0.001, Wilcoxon rank-sum, Figure 4h)."

3) In the analyses that are provided, the authors select different electrodes when evaluating vSMClistening (98 electrodes) and vSMCspeaking (362 electrodes). The electrodes evaluated in the vSMCspeaking condition are not all active during listening. To allow direct comparison of results for speaking vs. listening, it makes more sense to select identical subsets for both analyses. This in turn means limiting the analysis to those sites that are active for both vSMCspeaking and vSMClistening conditions. It seems problematic to claim that activity patterns are different across conditions when those patterns are being examined in different sets of electrodes. This can best be addressed by a reanalysis that directly compares vSMCspeaking and vSMClistening responses in the more limited set of electrodes that respond in both cases.

We agree with the reviewers that our claim that areas of vSMC show an acoustic representation while listening and a place of articulation representation while speaking would be stronger by analyzing the same subset of electrodes for both speaking and listening. We repeated the clustering analyses in Figure 4 using only the 98 electrode subset during speaking and listening, and found that our results still hold in this more limited set. Thus, we opted to replace panel b in Figure 4 with the results from the 98 electrode subset, and placed the former Figure 4b in Figure 4—figure supplement 3; Figure 4—figure supplement 2, panel c. When only the 98 electrode subset is analyzed, responses still appear to be similar for similar place of articulation – for example, /b/ and /p/, the bilabials, are close to one another in MDS space, as are the velars /k/ and /g/, and the alveolars /∫/, /s/, /t/, and /d/.

The clustering trajectory analysis in Figure 4—figure supplement 1 has also been replaced using the same electrodes for speaking and listening. Responses are aligned according to the acoustic onset of the CV syllable (at t=0 s). Significant ∆RIadj values are indicated by the blue and red windows (FDR-corrected p<0.05, permutation test). Here it is clear that, just prior to and during speaking, vSMC responses cluster according to place of articulation. During listening, however, those same responses exhibit selectivity according to manner of articulation, reflecting an acoustic representation.

4) Related to item 3, the MDS plots for vSMClistening (Figure 3c and f) are less clearly clustered than vSMCspeaking. This may be an artifact of the 3x difference in sample sizes. Depending on how the authors decide to deal with item 2, this may not remain a problem; however, if there is a comparison of clustering results using different sample sizes, the authors need to verify that the results shown in Figure 3d and g are not an artifact of this difference (e.g., by showing that the same effect holds if the vSMCspeaking and STGlistening datasets are subsampled to equalize the sample size).

We agree with the reviewers that the strong clustering observed in the vSMC speaking data may be due to the larger sample size, however, in reanalyzing the speaking data as described in the response to point (3), we found that clustering according to place of articulation was still observed when the speaking data was limited to the same set of electrodes used in the vSMC listening analysis. Thus, we have chosen to present the clustering data using the same sample size for vSMC listening and speaking, and present for reference the original analysis with speaking data for all active electrodes in Figure 4—figure supplement 2c.

5) It may be that the evoked responses in motor cortex during passive listening reflect auditory inputs. While high-γ LFPs strongly correlate with multi-unit spiking (as the authors state), high-γ LFPs may also reflect the aggregate synaptic potentials near the electrode. The authors should discuss this possibility.

We thank the reviewers for the suggestion, and have added to the last paragraph of the Discussion section additional details on where such auditory signals may arise. While it is possible that some inferior vSMC electrodes are picking up on auditory inputs, we believe it to be unlikely based on other analyses we have performed in the lab on spatial correlations across the ECoG grid at different neural frequency bands. We have added the following text:

"Alternatively, evoked responses in the motor cortex during passive listening may directly reflect auditory inputs arising from aggregated activity picked up by the electrode. We believe the latter scenario to be less likely, however, given that auditory responses were observed in dorsal vSMC on electrode contacts several centimeters away from auditory inputs in the STG. In addition, the spatial spread of neural signals in the high γ range is substantially smaller than this difference – high γ signal correlations at <2 mm spacing are only around r=0.5, and at distances of 1 cm reach a noise floor (Chang EF, Neuron 2015; Muller et al., in revision). Given the observed acoustic rather than place selectivity observed during listening in the vSMC, our results suggest that motor theories of speech perception may need to be revised to incorporate a novel sensorimotor representation of sound in the vSMC."

6) A motor representation in vSMC during listening may be present in a subset of the vSMC sites. The authors show that response in some vSMC sites lead STG sites (Figure 4D-F). If the MDS analysis was restricted to sites that lead STG (i.e., vSMC signals with very short latencies), would these sites reveal a motor representation? Some vSMC may be generating motor-like responses that are triggered by the auditory stimulus, but that reflect motor, not sensory properties. Analysis of the short-latency vSMC responses would address this concern.

We agree that this is an interesting additional analysis to determine whether the short-latency sites in the vSMC show differences in their motor vs. sensory responses. We performed the MDS analysis as suggested by restricting the analysis to sites that lead the STG as evidenced by a positive asymmetry index (see Figure 3f and Methods section– Cross Correlation analysis). This analysis yielded 50 vSMC electrodes. We found that even in this subset of short-latency responders, responses during listening appeared to cluster according to acoustic rather than articulatory features. This suggests that vSMC does not generate motor-like responses during listening. These results are now shown in Figure 4—figure supplement 3.

7) The authors report that 16 total sites in vSMC have activity that was "strongly" predicted by a linear STRF, with "strong" defined as r>0.10. The authors need to make clear where the subset of 16 electrodes with significant STRFs were located, and also whether or not this fraction of locations (16/X) is significantly above chance (e.g., correcting for false discovery rate). Most likely, X should include all auditory-responsive vSMC sites, but this is not stated clearly. In addition, the authors should explain by what criterion r>0.10 (i.e., a correlation in which 1% of the variance is explained by the strf) is a good criterion for a "strong" prediction. For instance, how does this correlation strength compare to STRF predictions in STG?

We appreciate the reviewers’ comments and have attempted to clarify the results of the STRF analysis in the new manuscript, as well as address the issue of correction for multiple comparisons. We recalculated our STRF correlations using a shuffled permutation test in which we take the predicted response from the STRF and shuffle it in time, using 500 ms chunks to maintain some of the temporal correlations inherent in the ECoG signal. We then calculate the correlation between the shuffled predictions and the actual response to get a sense of the magnitude of STRF correlations that would be observed due to chance. We perform this shuffling procedure 1000 times and calculate the p-value as the percentage of times that the shuffled time series correlations are higher than the correlations from our STRF prediction. We then perform FDR correction using the Benjamini-Hochberg procedure. As it turns out, there are many STRF correlations with r-values less than 0.1 that are still significantly higher than what would be expected due to chance. Still, we believe r = 0.1 to be a reasonable cutoff and have opted not to include more sites with lower correlation values. We have toned down the claim that these sites are "strongly" predicted by a linear STRF, instead writing “A fraction of vSMC sites (16/98 sites total) were reasonably well-predicted with a linear STRF (r>=0.10, p<0.01) (Theunissen et al., 2001).” Although this fraction is relatively small, these sites may encode auditory-related features that are not spectrotemporal in nature, or may respond nonlinearly to auditory features, and thus would not be captured appropriately by the STRF. In addition, more typical auditory areas including the anterior STG can show significant auditory responses that are not adequately captured by a linear STRF. The electrodes described here were located in both superior and inferior vSMC, as shown in Figure 5, which includes STRFs from all 5 subjects plotted as a function of the rostral-caudal distance from the central sulcus and the dorsal distance from the Sylvian fissure in millimeters.

On average, the correlation strength in the vSMC electrodes is lower than the correlation strength in STG electrodes, as shown by the new panel b added to Figure 5 and additional text in the Results section: “Still, the prediction performance of STRFs in vSMC was generally lower than that of the STG (Figure 5b).”

https://doi.org/10.7554/eLife.12577.018

Article and author information

Author details

  1. Connie Cheung

    1. Graduate Program in Bioengineering, University of California, Berkeley-University of California, San Francisco, San Francisco, United States
    2. Department of Neurological Surgery, University of California, San Francisco, San Francisco, United States
    3. Center for Integrative Neuroscience, University of California, San Francisco, San Francisco, United States
    4. Department of Physiology, University of California, San Francisco, San Francisco, United States
    Contribution
    CC, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Contributed equally with
    Liberty S Hamilton
    Competing interests
    The authors declare that no competing interests exist.
  2. Liberty S Hamilton

    1. Department of Neurological Surgery, University of California, San Francisco, San Francisco, United States
    2. Center for Integrative Neuroscience, University of California, San Francisco, San Francisco, United States
    3. Department of Physiology, University of California, San Francisco, San Francisco, United States
    Contribution
    LSH, Analysis and interpretation of data, Drafting or revising the article
    Contributed equally with
    Connie Cheung
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0003-0182-2500
  3. Keith Johnson

    Department of Linguistics, University of California, Berkeley, Berkeley, United States
    Contribution
    KJ, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  4. Edward F Chang

    1. Graduate Program in Bioengineering, University of California, Berkeley-University of California, San Francisco, San Francisco, United States
    2. Department of Neurological Surgery, University of California, San Francisco, San Francisco, United States
    3. Center for Integrative Neuroscience, University of California, San Francisco, San Francisco, United States
    4. Department of Physiology, University of California, San Francisco, San Francisco, United States
    Contribution
    EFC, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    edward.chang@ucsf.edu
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0003-2480-4700

Funding

National Institute on Deafness and Other Communication Disorders (1F32DC014192-01)

  • Liberty S Hamilton

NIH Office of the Director (OD00862)

  • Edward F Chang

McKnight Foundation

  • Edward F Chang

National Institute on Deafness and Other Communication Disorders (R01-DC012379)

  • Edward F Chang

National Institute of Neurological Disorders and Stroke (R00-NS065120)

  • Edward F Chang

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

N Mesgarani, M Leonard, and J Houde provided helpful comments on the manuscript. EFC was funded by the US National Institutes of Health grants R01-DC012379, R00-NS065120, DP2-OD00862, and the Ester A and Joseph Klingenstein Foundation. LSH was funded by a Ruth L Kirschstein National Research Service Award (1F32DC014192-01) through the NIH National Institute on Deafness and Other Communication Disorders.

Ethics

Human subjects: Written informed consent was obtained from all study participants. The study protocol was approved by the UCSF Committee on Human Research.

Reviewing Editor

  1. Barbara G Shinn-Cunningham, Reviewing Editor, Boston University, United States

Publication history

  1. Received: October 25, 2015
  2. Accepted: February 12, 2016
  3. Version of Record published: March 4, 2016 (version 1)
  4. Version of Record updated: April 27, 2016 (version 2)

Copyright

© 2016, Cheung et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,665
    Page views
  • 612
    Downloads
  • 9
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, PubMed Central, Crossref.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Developmental Biology and Stem Cells
    Sylvie Remaud et al.
    Research Article Updated