Cortical encoding of acoustic and linguistic rhythms in spoken narratives

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Speech contains rich acoustic and linguistic information. Using highly controlled speech materials, previous studies have demonstrated that cortical activity is synchronous to the rhythms of perceived linguistic units, for example, words and phrases, on top of basic acoustic features, for example, the speech envelope. When listening to natural speech, it remains unclear, however, how cortical activity jointly encodes acoustic and linguistic information. Here we investigate the neural encoding of words using electroencephalography and observe neural activity synchronous to multi-syllabic words when participants naturally listen to narratives. An amplitude modulation (AM) cue for word rhythm enhances the word-level response, but the effect is only observed during passive listening. Furthermore, words and the AM cue are encoded by spatially separable neural responses that are differentially modulated by attention. These results suggest that bottom-up acoustic cues and top-down linguistic knowledge separately contribute to cortical encoding of linguistic units in spoken narratives.

Introduction

When listening to speech, low-frequency cortical activity in the delta (<4 Hz) and theta (4–8 Hz) bands is synchronous to speech (Keitel et al., 2018; Luo and Poeppel, 2007). However, it remains debated what speech features are encoded in the low-frequency cortical response. A large number of studies have demonstrated that the low-frequency cortical response tracks low-level acoustic features in speech, for example, the speech envelope (Destoky et al., 2019; Ding and Simon, 2012; Koskinen and Seppä, 2014; Di Liberto et al., 2015; Nourski et al., 2009; Peelle et al., 2013). Since the theta-band speech envelope provides an important acoustic cue for syllable boundaries, it has been hypothesized that neural tracking of the theta-band speech envelope is a mechanism to segment continuous speech into discrete units of syllables (Giraud and Poeppel, 2012; Poeppel and Assaneo, 2020). In other words, the theta-band envelope-tracking response reflects an intermediate neural representation linking auditory representation of acoustic speech features and phonological representation of syllables. Consistent with this hypothesis, it has been found that neural tracking of speech envelope is related to both low-level speech features (Doelling et al., 2014) and perception. On one hand, it can occur when speech recognition fails (Etard and Reichenbach, 2019; Howard and Poeppel, 2010; Peña and Melloni, 2012; Zoefel and VanRullen, 2016; Zou et al., 2019). On the other hand, it is strongly modulated by attention (Zion Golumbic et al., 2013; Kerlin et al., 2010) and may be a prerequisite for successful speech recognition (Vanthornhout et al., 2018).

Speech comprehension, however, requires more than syllabic-level processing. Previous studies suggest that low-frequency cortical activity can also reflect neural processing of higher-level linguistic units, for example, words and phrases (Buiatti et al., 2009; Ding et al., 2016a; Keitel et al., 2018), and the prosodic cues related to these linguistic units, for example, delta-band speech envelope and pitch contour (Bourguignon et al., 2013; Li and Yang, 2009; Steinhauer et al., 1999). One line of research demonstrates that, when listening to natural speech, cortical responses can encode word onsets (Brodbeck et al., 2018) and capture semantic similarity between words (Broderick et al., 2018). It remains to be investigated, however, how bottom-up prosodic cues and top-down linguistic knowledge separately contribute to the generation of these word-related responses. Another line of research selectively focuses on top-down processing driven by linguistic knowledge. These studies demonstrate that cortical responses are synchronous to perceived linguistic units, for example, words, phrases, and sentences, even when acoustic correlates of these linguistic units are not available (Ding et al., 2016a; Ding et al., 2018; Jin et al., 2018; Makov et al., 2017). Based on these results, it has been hypothesized that low-frequency cortical activity can reflect linguistic-level neural representations that are constructed based on internal linguistic knowledge instead of acoustic cues (Ding et al., 2016a; Ding et al., 2018; Meyer et al., 2020). Nevertheless, to dissociate linguistic units with the related acoustic cues, most of these studies present speech as an isochronous sequence of synthesized syllables, which organizes into a sequence of unrelated words and sentences. Therefore, it remains unclear whether cortical activity can synchronize to linguistic units in natural spoken narratives, and how it is influenced by bottom-up acoustic cues and top-down linguistic knowledge.

Here we first asked whether cortical activity could reflect the rhythm of disyllabic words in semantically coherent stories. The story was either naturally read or synthesized as an isochronous sequence of syllables to remove acoustic cues for word boundaries (Ding et al., 2016a). We then asked how the neural response to disyllabic words was influenced by acoustic cues for words. To address this question, we amplitude modulated isochronous speech at the word rate and tested how this word-synchronized acoustic cue modulated the word response. Finally, since previous studies have shown that cortical tracking of speech strongly depended on the listeners’ task (Ding and Simon, 2012; Zion Golumbic et al., 2013; O'Sullivan et al., 2015), we designed two tasks during which the neural responses were recorded. One task required attentive listening to speech and answering of comprehension questions afterwards, while in the other task participants were engaged in watching a silent movie while passively listening to speech.

Results

Neural encoding of words in isochronously presented narratives

We first presented semantically coherent stories that were synthesized as an isochronous sequence of syllables (Figure 1A, left). To produce a metrical structure in stories, every other syllable was designed to be a word onset. More specifically, the odd terms in the metrical syllable sequence always corresponded to the initial syllable of a word, that is, word onset, while the even terms corresponded to either the second syllable of a disyllabic word (73% probability) or a monosyllabic word (23% probability). In the following, the odd terms of the syllable sequence were referred to as σ1, and the even terms as σ2. Since syllables were presented at a constant rate of 4 Hz, the neural response to syllables was frequency tagged at 4 Hz. Furthermore, since every other syllable in the sequence was the onset of a word, neural activity synchronous to word onsets was expected to show a regular rhythm at half of the syllabic rate, that is, 2 Hz (Figure 1A, right).

Figure 1

Download asset Open asset

Stimulus.

(**A and B**) Two types of stories are constructed: metrical stories and nonmetrical stories. (A) Metrical stories are composed of disyllabic words and pairs of monosyllabic words, so that the odd terms in the syllable sequence (referred to as σ1) must be the onset of a word. Here the onset syllable of each word is shown in bold. All syllables are presented at a constant rate of 4 Hz. A 500 ms gap is inserted at the position of any punctuation. The red curve illustrates cortical activity synchronous to word onsets, and it shows a 2-Hz rhythm, which can be clearly observed in the spectrum shown on the right. The stories are in Chinese and English examples are shown for illustrative purposes. (B) In the nonmetrical stories, word onsets are not regularly positioned, and activity that is synchronous to word onsets does not show 2-Hz rhythmicity. (C) Natural speech. The stories are naturally read by a human speaker and the duration of syllables is not controlled. (D) Amplitude-modulated isochronous speech is constructed by amplifying either σ1 or σ2 by a factor of 4, creating a 2-Hz amplitude modulation. The red and blue curves illustrate responses that are synchronous to word onsets and amplified syllables, respectively. The response synchronous to word onsets is identical for σ1- and σ2-amplified speech, that is, the phase difference was 0° at 2 Hz. In contrast, the response synchronous to amplified syllables is offset by 250 ms between conditions, that is, the phase difference was 180° at 2 Hz.

As a control condition, we also presented stories with a nonmetrical structure (Figure 1B, left). These stories were referred to as the nonmetrical stories in the following. In these stories, the word duration was not controlled and σ1 was not always a word onset. Given that the word onsets in these stories did not show rhythmicity at 2 Hz, neural activity synchronous to the word onsets was not frequency tagged at 2 Hz (Figure 1B, right).

When listening to the stories, one group of participants was asked to attend to the stories and answer comprehension questions presented at the end of each story. The task was referred to as story comprehension task, and the participants correctly answered 96 ± 9% and 94 ± 9% questions for metrical and nonmetrical stories, respectively. Another group of participants, however, were asked to watch a silent movie while passively listening to the same set of stories as those used in the story comprehension task. The silent movie was not related to the auditorily presented stories, and the participants did not have to answer any speech comprehension question. The task was referred to as movie watching task. The electroencephalogram (EEG) responses to isochronously presented stories are shown in Figure 2A and C. The response spectrum was averaged over participants and EEG electrodes.

Figure 2 with 1 supplement see all

Download asset Open asset

Electroencephalogram (EEG) response spectrum.

Response spectrum is averaged over participants and EEG electrodes. The shaded area indicates one standard error of the mean (SEM) across participants. Stars indicate significantly higher power at 2 or 4 Hz than the power averaged over four neighboring frequency bins (two on each side). *p<0.05, **p<0.001 (bootstrap, false discovery rate [FDR] corrected). The color of the star is the same as the color of the spectrum. The topography on the top of each plot shows the distribution of response power at 2 Hz and 4 Hz. The five black dots in the topography indicate the position of electrodes FCz, Fz, Cz, FC3, and FC4. (**A–D**) Response spectrum for isochronous speech and amplitude-modulated speech during two tasks. To facilitate the comparison between stimuli, the red curves in panels A and C are repeated in panels B and D, respectively. (E) Response spectrum when the participants listen to natural speech. In this analysis, the response to natural speech is time warped to simulate the response to isochronous speech, and then transformed into the frequency-domain.

Figure 2—source data 1 Preprocessed electroencephalogram (EEG) data recorded in Experiments 1–3.: https://cdn.elifesciences.org/articles/60433/elife-60433-fig2-data1-v2.mat
Download elife-60433-fig2-data1-v2.mat

Figure 2A shows the responses from the participants attentively listening to the stories in story comprehension task. For metrical stories, two peaks were observed in the EEG spectrum, one at 4 Hz, that is, the syllable rate (p=0.0001, bootstrap, false discovery rate [FDR] corrected) and the other at 2 Hz, that is, the rate of disyllabic words (p=0.0001, bootstrap, FDR corrected). For nonmetrical stories, however, a single response peak was observed at 4 Hz (p=0.0001, bootstrap, FDR corrected), while no significant response peak was observed at 2 Hz (p=0.27, bootstrap, FDR corrected). A comparison of the responses to metrical and nonmetrical stories was performed, and a significant difference was observed at 2 Hz (p=0.0005, bootstrap, FDR corrected, Figure 3A) but not at 4 Hz (p=0.40, bootstrap, FDR corrected, Figure 3B). The response topography showed a centro-frontal distribution.

Figure 3 with 1 supplement see all

Download asset Open asset

Response power and phase.

(**A and B**) Response power at 2 and 4 Hz. Color of the bars indicates the stimulus. Black stars indicate significant differences between different types of speech stimuli while red stars indicate significant differences between tasks. *p<0.05, **p<0.01 (bootstrap, false discovery rate [FDR] corrected). Throughout the manuscript, in all bar graphs of response power, the response power at a target frequency is subtracted by the power averaged over four neighboring frequency bins (two on each side) to reduce the influence of background neural activity. (C) The difference in 2-Hz response phase between the σ1- and σ2-amplified conditions at 2 Hz. The phase difference is averaged across participants, and the polar histogram shows the distribution of phase difference across 64 electrodes.

Figure 3—source data 1 Preprocessed EEG data recorded in Experiment 1-3.: https://cdn.elifesciences.org/articles/60433/elife-60433-fig3-data1-v2.mat
Download elife-60433-fig3-data1-v2.mat

When participants watched a silent movie during story listening, however, a single response peak was observed at 4 Hz for both metrical (p=0.0002, bootstrap, FDR corrected) and nonmetrical stories (p=0.0002, bootstrap, FDR corrected) (Figure 2C). The response peak at 2 Hz was not significant for either kind of stories (p>0.074, bootstrap, FDR corrected). A comparison of the responses to metrical and nonmetrical stories did not find significant difference at either 2 Hz (p=0.22, bootstrap, FDR corrected, Figure 3A) or 4 Hz (p=0.39, bootstrap, FDR corrected, Figure 3B). Furthermore, the 2-Hz response was significantly stronger in the story comprehension task than in the movie watching task (p=0.0004, bootstrap, FDR corrected, Figure 3A). These results showed that cortical activity was synchronous to the word rhythm during attentive speech comprehension. When attention was diverted, however, the word-rate response was no longer detected.

Neural encoding of words in natural spoken narratives

Next, we asked whether cortical activity was synchronous to disyllabic words in natural speech processing. The same set of stories used in the isochronous speech condition was read in a natural manner by a human speaker and presented to participants. The participants correctly answered 95 ± 4% and 97 ± 6% comprehension questions for metrical and nonmetrical stories, respectively. In natural speech, syllables were not produced at a constant rate (Figure 1C), and therefore the syllable and word responses were not frequency tagged. Nevertheless, we time warped the response to natural speech and made the syllable and word responses periodic. Specifically, the neural response to each syllable in natural speech was extracted and realigned to a constant 4-Hz rhythm using a convolution-based procedure (Jin et al., 2018; see Materials and methods for details). After time-warping analysis, cortical activity synchronous to the word onsets was expected to show a 2-Hz rhythm, the same as the response to the isochronous speech (Figure 2E).

The spectrum of the time-warped response was averaged over participants and EEG electrodes (Figure 2E). For metrical stories, two peaks were observed in spectrum of the time-warped response, one at 4 Hz (p=0.0002, bootstrap, FDR corrected) and the other at 2 Hz (p=0.0007, bootstrap, FDR corrected). For nonmetrical stories, however, a single response peak was observed at 4 Hz (p=0.0002, bootstrap, FDR corrected), while no significant response peak was observed at 2 Hz (p=0.37, bootstrap, FDR corrected). A comparison of the responses to metrical and nonmetrical stories was performed and found a significant difference at 2 Hz (p=0.036, bootstrap, FDR corrected, Figure 3A) but not at 4 Hz (p=0.09, bootstrap, FDR corrected, Figure 3B). These results demonstrated that cortical activity was synchronous to the word rhythm in natural spoken narratives during attentive speech comprehension.

Neural responses to amplitude-modulated speech

To investigate potential interactions between neural encoding of linguistic units and relevant acoustic features, we amplitude modulated the isochronous speech at the word rate, that is, 2 Hz. The amplitude modulation (AM) was achieved by amplifying either σ1 or σ2 by a factor of 4 (Figure 1D), creating two conditions: σ1-amplified condition and σ2-amplified condition. Such 2-Hz AM provided an acoustic cue for the word rhythm. The speech was referred to as amplitude-modulated speech in the following. When listening to amplitude-modulated speech, the participants correctly answered 94 ± 12% and 97 ± 8% comprehension questions in the σ1- and σ2-amplified conditions, respectively.

The EEG responses to amplitude-modulated speech are shown in Figure 2B and D. When participants attended to the speech, significant response peaks were observed at 2 Hz (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0001, bootstrap, FDR corrected, Figure 2D). A comparison between the responses to isochronous speech and amplitude-modulated speech found that AM did not significantly influence the power of the 2-Hz neural response (σ1-amplified vs. σ2-amplified: p=0.56; σ1-amplified vs. isochronous: p=0.38, σ2-amplified vs. isochronous: p=0.44, bootstrap, FDR corrected, Figure 3A). These results showed that the 2-Hz response power was not significantly influenced by the 2- Hz AM during attentive speech comprehension.

Another group of participants passively listened to amplitude-modulated speech while attending to a silent movie. In their EEG responses, significant response peaks were also observed at 2 Hz (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0001, bootstrap, FDR corrected, Figure 2D). A comparison between the responses to isochronous speech and amplitude-modulated speech showed stronger 2-Hz response in the processing of amplitude-modulated speech than isochronous speech (σ1-amplified vs. isochronous: p=0.021, σ2-amplified vs. isochronous: p=0.0042, bootstrap, FDR corrected, Figure 3A). No significant difference was found between responses to σ1-amplified and σ2-amplified speech (p=0.069, bootstrap, FDR corrected, Figure 3A). Therefore, when attention was diverted, the 2-Hz response power was significantly increased by the 2-Hz AM.

The Fourier transform decomposes an arbitrary signal into sinusoids and each complex-valued Fourier coefficient captures the magnitude and phase of a sinusoid. The power spectrum reflects the response magnitude but ignores the response phase. The phase difference between the 2-Hz responses in the σ1- and σ2-amplified conditions, however, carried important information about whether the neural response was synchronous to the word onsets or amplified syllables: Neural activity synchronous to amplified syllables showed a 250 ms time lag between the σ1- and σ2-amplified conditions (Figure 1D), which corresponded to a 180° phase difference between conditions at 2 Hz. Neural activity synchronous to the word onsets, however, should be identical in the σ1- and σ2-amplified conditions.

The 2-Hz response phase difference between the σ1- and σ2-amplified conditions is shown in Figure 3C. The phase difference, when averaged across participants and EEG electrodes, was 41° (95% confidence interval: −25–91°) during the story comprehension task and increased to 132° (95% confidence interval: 102–164°) during the movie watching task (see Figure 3—figure supplement 1 for response phase in individual conditions).

Separate the responses to words and word-rate AM

In the previous section, we analyzed the difference in 2-Hz response phase difference between the σ1- and σ2-amplified conditions. A 0° phase difference indicated that the 2-Hz response was only synchronous to word onsets, while a 180° phase difference indicated that the 2-Hz response was only synchronous to amplified syllables. A phase difference between 0° and 180° indicated that neural response synchronous to word onsets and neural response synchronous to amplified syllables both exist, but the phase analysis could not reveal the strength of the two response components. Therefore, in the following, we extracted the neural responses to words and the 2-Hz AM by averaging the responses across the σ1- and σ2-amplified conditions in different manners.

To extract the neural response to words, we averaged the neural response waveforms across the σ1- and σ2-amplified conditions. The word onsets were aligned in these two conditions and therefore neural activity synchronous to the word onsets was preserved in the average. In contrast, cortical activity synchronous to the amplified syllables exhibited a 250 ms time lag between the σ1- and σ2-amplified conditions. Therefore, the 2-Hz response synchronous to the amplified syllables was 180° out of phase between the two conditions, and got canceled in the average across conditions. In sum, the average over the σ1- and σ2-amplified conditions preserved the response to words but canceled the response to amplified syllables (Figure 4A). This average was referred to as the word response in the following.

Figure 4

Download asset Open asset

Word and amplitude modulation (AM) responses to amplitude-modulated speech.

(**A–D**) Neural responses in σ1- and σ2-amplified conditions are aligned based on either word onsets (AB) or amplified syllables (CD), and averaged to extract the 2-Hz response synchronous to words or AM, respectively. Panels A and C illustrate the procedure. The red and blue curves illustrate the response components synchronous to word onsets and the AM respectively, which are mixed in the electroencephalogram (EEG) measurement and shown separately for illustrative purposes. The spectrum and topography in panels B and D are shown the same way as they are shown in Figure 2. *p<0.05, **p<0.001 (bootstrap, false discovery rate [FDR] corrected). (E) Response power at 2 Hz. Black stars indicate significant differences between the word and AM responses, while red stars indicate a significant difference between tasks. *p<0.05, **p<0.01 (bootstrap, FDR corrected). (F) The left panel shows the power difference between the word and AM responses in single electrodes. The right panel shows the difference in normalized topography, that is, topography divided by its maximal value. Black dots indicate electrodes showing a significant difference between the word and AM responses (p<0.05, bootstrap, FDR corrected).

Figure 4—source data 1 Preprocessed electroencephalogram (EEG) data recorded in Experiments 1–3.: https://cdn.elifesciences.org/articles/60433/elife-60433-fig4-data1-v2.mat
Download elife-60433-fig4-data1-v2.mat

For the word response, a significant response peak was observed at 2 Hz during both story comprehension task (p=0.0001, bootstrap, FDR corrected) and movie watching task (p=0.011, bootstrap, FDR corrected, Figure 4B). The response power was significantly stronger in the story comprehension task than that in the movie watching task (p=0.0008, bootstrap, FDR corrected, Figure 4E). During the movie watching task, a significant 2-Hz word response was observed when amplitude-modulated speech was presented (p=0.011, bootstrap, FDR corrected, Figure 4B), while no significant 2-Hz word response was observed when isochronous speech was presented (p=0.074, bootstrap, FDR corrected, Figure 2C). This finding indicated that the 2-Hz AM facilitated word processing during passive listening.

Furthermore, during the story comprehension task, the power of the 2-Hz word response did not significantly differ between amplitude-modulated speech and isochronous speech (p=0.69, bootstrap, FDR corrected), suggesting that the 2-Hz AM did not significantly modulate the 2-Hz word response during attentive speech comprehension.

To extract the neural response to 2-Hz AM, we first aligned the responses in the σ1- and σ2-amplified conditions by adding a delay of 250 ms to the response in the σ1-amplified condition. We then averaged the response waveforms across conditions (Figure 4C). After the delay was added to the response to σ1-amplified speech, the 2-Hz AM was identical between the σ1- and σ2-amplified conditions. Therefore, the average across conditions preserved the response to 2-Hz AM while canceling the response to words. The averaged response was referred to as the AM response in the following.

For the AM response, significant 2-Hz response peaks were observed during both the story comprehension task (p=0.0021, bootstrap, FDR corrected) and the movie watching task (p=0.0001, bootstrap, FDR corrected). The 2-Hz response power did not significantly differ between tasks (p=0.39, bootstrap, FDR corrected), suggesting that the 2-Hz AM response was not strongly enhanced when participants attended to speech.

During the story comprehension task, the 2-Hz response power averaged across electrodes was significantly stronger for the word response than the AM response (p=0.021, bootstrap, FDR corrected, Figure 4E). During the movie watching task, however, the reverse was true, that is, the 2-Hz word response was significantly weaker than the 2-Hz AM response (p=0.013, bootstrap, FDR corrected, Figure 4E).

An analysis of individual electrodes showed the 2-Hz word response was significantly stronger than the 2-Hz AM response in centro-frontal electrodes during the story comprehension task (p<0.05, bootstrap, FDR corrected, Figure 4F, left). To further compare the spatial distribution of 2-Hz word and AM responses, on top of their power difference, we normalized the response topography by dividing its maximum value. Significant difference was found in the normalized topography between the 2-Hz word and AM responses in temporal electrodes (p<0.05, bootstrap, FDR corrected, Figure 4F, right). These results suggested that the 2-Hz word and AM responses had distinct neural sources.

Neural responses to amplitude-modulated speech: a replication

The responses to amplitude-modulated speech suggested that the brain encoded words and acoustic cues related to words in a different manner. Furthermore, the word-related acoustic cues seemed to facilitate word processing during passive listening, but not significantly enhance the word response during attentive listening. To further validate these findings, we conducted a replication experiment to measure the neural response to amplitude-modulated speech in a separate group of participants. These participants first performed the movie watching task and then the story comprehension task. During the story comprehension task, the participants correctly answered 96 ± 8% and 95 ± 9% questions in the σ1-amplified and σ2-amplified conditions, respectively.

We first analyzed the spectrum of response to σ1- and σ2-amplified speech. Consistent with previous results, a significant 2-Hz response peak was observed whether the participants attended to the speech (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0001, bootstrap, FDR corrected) or movie (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0004, bootstrap, FDR corrected, Figure 5—figure supplement 1A and B). When averaged across participants and electrodes, the 2-Hz response phase difference between the σ1- and σ2-amplified conditions was 7° (95% confidence interval: −26–47°) during the story comprehension task and 96° (95% confidence interval: 55–143°) during the movie watching task (Figure 5—figure supplement 1C and D).

We then separately extracted the word and AM responses following the procedures described in Figure 4A and C. Similar to previous results, in the spectrum (Figure 5A and B), a significant 2-Hz word response was observed during both tasks (story comprehension task: p=0.0001, bootstrap; movie watching task: p=0.0003, bootstrap, FDR corrected), and a significant 2-Hz AM response was also observed during both tasks (story comprehension task: p=0.0001, bootstrap; movie watching task: p=0.0001, bootstrap, FDR corrected). Furthermore, the 2-Hz word response exhibited significantly stronger power than the 2-Hz AM response during the story comprehension task (p=0.0036, bootstrap, FDR corrected, Figure 5C), and the two responses showed distinct topographical distribution (Figure 5D). In sum, the results obtained in the replication experiment were consistent with those from the original experiment. A follow-up comparison of the results from the original and replication experiments suggested that the 2-Hz word response exhibited significantly stronger power during the movie watching task in the replication experiment than the original experiment (p=0.0008, bootstrap, FDR corrected).

Figure 5 with 1 supplement see all

Download asset Open asset

Replication of the neural response to amplitude-modulated speech.

(**A and B**) Spectrum and topography for the word response (A) and amplitude modulation (AM) response (B). Colored stars indicate frequency bins with stronger power than the power averaged over four neighboring frequency bins (two on each side). *p<0.05, **p<0.001 (bootstrap, false discovery rate [FDR] corrected). The topography on the top of each plot shows the distribution of response power at 2 Hz and 4 Hz. (C) Response power at 2 Hz. Black stars indicate significant differences between the word and (AM) responses, while red stars indicate a significant difference between tasks. *p<0.05, **p<0.01 (bootstrap, FDR corrected). (D) Power difference between the word and AM responses in individual electrodes are shown in the left panel. To further illustrate the difference in topographical distribution instead of the response power, each response topography is normalized by dividing its maximum value. The difference in the normalized topography is shown in the right panel. Black dots indicate electrodes showing a significant difference between the word and AM responses (p<0.05, bootstrap, FDR corrected).

Figure 5—source data 1 Preprocessed electroencephalogram (EEG) data recorded in Experiment 4.: https://cdn.elifesciences.org/articles/60433/elife-60433-fig5-data1-v2.mat
Download elife-60433-fig5-data1-v2.mat

Time course of EEG responses to words

The event-related potential (ERP) responses evoked by σ1 and σ2 are separately shown in Figure 6. The ERP analysis was restricted to disyllabic words so that the responses to σ1 and σ2 represented the responses to the first and second syllables of disyllabic words respectively. When participants attended to speech, the ERP responses to σ1 and σ2 showed significant differences for both isochronous (Figure 6A) and natural speech (Figure 6B). When participants watched a silent movie, a smaller difference was also observed between the ERP responses to σ1 and σ2 (Figure 6A). The topography of the ERP difference showed a centro-frontal distribution. For isochronous speech, the ERP latency could not be unambiguously interpreted for isochronous speech, given that the stimulus was strictly periodic. For natural speech, the ERP responses to σ1 and σ2 differed in a time window around 300–500 ms.

Figure 6 with 1 supplement see all

Download asset Open asset

Event-related potential (ERP) responses evoked by disyllabic words.

The ERP responses evoked by σ1 and σ2 are shown in red and black, respectively. The ERP response is averaged over participants and electrodes. The shaded area indicates 1 SEM across participants. The gray lines on top denote the time intervals in which the two responses are significantly different from each other (p<0.05, cluster-based permutation test). The topography on top is averaged over all time intervals showing a significant difference between the two responses in each plot. Time 0 indicates syllable onset.

Figure 6—source data 1 Preprocessed electroencephalogram (EEG) data recorded in Experiments 1–3.: https://cdn.elifesciences.org/articles/60433/elife-60433-fig6-data1-v2.mat
Download elife-60433-fig6-data1-v2.mat

The ERP results for amplitude-modulated speech are shown in Figure 6C and D. When participants attended to speech, a difference between the ERP responses to σ1 and σ2 was observed in both σ1- and σ2-amplified conditions (Figure 6C). During passive listening, however, a significant ERP difference was observed near the onset of the amplified syllable. These results were consistent with the results in the replication experiment (Figure 6—figure supplement 1).

Discussion

Speech comprehension is a complex process involving multiple stages, for example, encoding of acoustic features, extraction of phonetic features, and processing of higher-level linguistic units such as words, phrases, and sentences. Here, we investigate how low-frequency cortical activity encodes linguistic units and related acoustic features. When participants naturally listen to spoken narratives, we observe that cortical activity is synchronous to the rhythm of spoken words. The word synchronous response is observed whether participants listen to natural speech or synthesized isochronous speech that removes word-related acoustic cues. Furthermore, when introducing an AM cue to the word rhythm, neural responses to words and the AM cue are both observed and they show different spatial distribution. In addition, when participants are engaged in a story comprehension task, the word response exhibits stronger power than the AM response. The AM cue does not clearly modulate the word response during story comprehension task (Figure 2B), but can facilitate word processing when attention is diverted: The word response is not detected for isochronous speech (Figure 2C), but is detected for amplitude-modulated speech during passive listening (Figures 4B and 5A). In sum, these results show that both top-down linguistic knowledge and bottom-up acoustic cues separately contribute to word synchronous neural responses.

Neural encoding of linguistic units in natural speech

In speech, linguistic information is organized through a hierarchy of units, including phonemes, syllables, morphemes, words, phrases, sentences, and discourses. These units span a broad range of time scales, from tens of milliseconds for phonemes to a couple of seconds for sentences, and even longer for discourses. It is a challenging question to understand how the brain represents the hierarchy of linguistic units, and it is an appealing hypothesis that each level of linguistic unit is encoded by cortical activity on the relevant time scale (Ding et al., 2016a; Doumas and Martin, 2016; Giraud and Poeppel, 2012; Goswami, 2019; Keitel et al., 2018; Kiebel et al., 2008; Meyer and Gumbert, 2018). Previous fMRI studies have suggested that neural processing at different levels, for example, syllables, words, and sentences, engages different cortical networks (Blank and Fedorenko, 2020; Hasson et al., 2008; Lerner et al., 2011). Magnetoencephalography (MEG)/Electroencephalogram (EEG) studies have found reliable delta- and theta-band neural responses that are synchronous to speech (Ding and Simon, 2012; Luo and Poeppel, 2007; Peelle et al., 2013), and the time scales of such activity are consistent with the time scales of syllables and larger linguistic units.

Nevertheless, it remains unclear whether these MEG/EEG responses directly reflect neural encoding of hierarchical linguistic units, or simply encode acoustic features associated with these units (Daube et al., 2019; Kösem and van Wassenhove, 2017). On one hand, neural tracking of sound envelope is reliably observed in the absence of speech comprehension, for example, when participants listen to unintelligible speech (Howard and Poeppel, 2010; Zoefel and VanRullen, 2016) and non-speech sound (Lalor et al., 2009; Wang et al., 2012). The envelope tracking response is even weaker for sentences composed of real words than sentences composed of pseudowords (Mai et al., 2016), and is weaker for the native language than an unfamiliar language (Zou et al., 2019). Neural tracking of speech envelope can also be observed in animal primary auditory cortex (Ding et al., 2016b). Furthermore, a recent study shows that low-frequency cortical activity cannot reflect the perception of an ambiguous syllable sequence, for example, whether repetitions of a syllable is perceived as ‘flyflyfly’ or ‘lifelifelife’ (Kösem et al., 2016).

On the other hand, cortical activity synchronous to linguistic units, such as words and phrases, can be observed using well-controlled synthesized speech that removes relevant acoustic cues (Ding et al., 2016a; Jin et al., 2018; Makov et al., 2017). These studies, however, usually present semantically unrelated words or sentences at a constant pace, which creates a salient rhythm easily noticeable for listeners. In contrast, in the current study, we presented semantically coherent stories. It was found that for both synthesized isochronous speech and natural speech, cortical activity was synchronous to multi-syllabic words in metrical stories when participants were engaged in a story comprehension task. Furthermore, few listeners reported noticing any difference between the metrical and nonmetrical stories (see Supplementary materials for details), suggesting that the word rhythm was not salient and was barely noticeable to the listeners. Therefore, the word response is likely to reflect implicit word processing instead of the perception of an explicit rhythm.

A comparison of the responses to natural speech and isochronous speech showed that responses to word and syllable were weaker for natural speech, suggesting that strict periodicity in the stimulus could indeed boost rhythmic neural entrainment. Although the current study and previous studies (Ding et al., 2018; Makov et al., 2017) observe a word-rate neural response, the study conducted by Kösem et al., 2016 does not report observable neural activity synchronous to perceived word rhythm. A potential explanation for the mixed results is that Kösem et al., 2016 repeat the same word in each trial while the other studies present a large variety of words with no immediate repetitions in the stimuli. Therefore, it is possible that low-frequency word-rate neural response more strongly reflects neural processing of novel words, instead of the perception of a steady rhythm (see also Ostarek et al., 2020).

Mental processes reflected in neural activity synchronous to linguistic units

It remains elusive what kind of mental representations are reflected by cortical responses synchronous to linguistic units. For example, the response may reflect the phonological, syntactic, or semantic aspect of a perceived linguistic unit and it is difficult to tease apart these factors. Even if a sequence of sentences is constructed with independently synthesized monosyllabic words, the sequence does not sound like a stream of individual syllables delivered at a constant pace. Instead, listeners can clearly perceive each sentence as a prosodic unit. In this case, mental construction of sentences is driven by syntactic processing. Nevertheless, as long as the mental representation of a sentence is formed, it also has an associated phonological aspect. Previous psycholinguistic studies have already demonstrated that syntax has a significant impact on prosody perception (Buxó-Lugo and Watson, 2016; Garrett et al., 1966).

It is also possible that neural activity synchronous to linguistic units reflect more general cognitive processes that are engaged during linguistic processing. For example, within a word, later syllables are more predictable that earlier syllables. Therefore, neural processing associated with temporal prediction (Breska and Deouell, 2017; Lakatos et al., 2013; Stefanics et al., 2010) may appear to be synchronous to the perceived linguistic units. However, it has been demonstrated that when the predictability of syllables is controlled as a constant, cortical activity can still synchronize to perceived artificial words, suggesting that temporal prediction is not the only factor driving low-frequency neural activity either (Ding et al., 2016a). Nevertheless, in natural speech processing, temporal prediction may inevitably influence the low-frequency response. Similarly, temporal attention is known to affect low-frequency activity and attention certainly varies during speech perception (Astheimer and Sanders, 2009; Jin et al., 2018; Sanders and Neville, 2003). In addition, low-frequency neural activity has also been suggested to reflect the perception of high-level rhythms (Nozaradan et al., 2011) and general sequence chunking (Jin et al., 2020).

Since multiple factors can drive low-frequency neural responses, the low-frequency response to natural speech is likely to be a mixture of multiple components, including, for example, auditory responses to acoustic prosodic features, neural activity related to temporal prediction and temporal attention, and neural activity encoding phonological, syntactic, or semantic information. These processes are closely coupled and one process can trigger other processes. Here we do not think the word-rate response exclusively reflects a single process. It may well consist of a mixture of multiple response components, or provide a reference signal to bind together mental representations of multiple dimensions of the same word. Similar to the binding of color and shape information in the perception of a visual object (Treisman, 1998), the perception of a word requires the binding of, for example, phonological, syntactic, and semantic representations. It has been suggested that temporal coherence between neural responses to different features provides a strong cue to bind these features into a perceived object (Shamma et al., 2011). It is possible that the word-rate response reflects the temporal coherence between distinct mental representations of each word and functionally relates to the binding of these representations.

In speech processing, multiple factors contribute to the word response and these factors interact. For example, the current study suggested that prosodic cues such as AM had a facilitative effect on word processing: It was observed that, during passive listening, a word response was observed for amplitude-modulated speech but not for isochronous speech. In addition, the word response to amplitude-modulated speech during passive listening exhibited stronger power in the replication experiment that only presented amplitude-modulated speech, compared with the results obtained in the original experiment that presented amplitude-modulated speech with isochronous speech in a mixed manner. Consistent with this finding, a larger number of participants reported the noticing of stories during passive listening in the replication experiment (see Materials and methods). All these results suggest that the 2-Hz AM, which provides an acoustic cue for the word rhythm, facilitates word processing during passive listening. This result is consistent with the idea that prosodic cues have a facilitative effect on speech comprehension (Frazier et al., 2006; Ghitza, 2017; Ghitza, 2020; Giraud and Poeppel, 2012).

Finally, it should be mentioned that we employed AM to manipulate the speech envelope, given that the speech envelope is one of the strongest cues to drive stimulus-synchronous cortical response. However, the AM is only a weak prosodic cue compared with other variables such as timing and pitch contour (Shen, 1993). Furthermore, stress is strongly modulated by context and does not affect word recognition in Chinese (Duanmu, 2001). Future studies are needed to characterize the modulation of language processing by different prosodic cues and investigate the modulatory effect across different languages.

Attention modulation of cortical speech responses

It has been widely reported that cortical tracking of speech is strongly modulated by attention. Most previous studies demonstrate that in a complex auditory scene consisting of two speakers, attention can selectively enhance neural tracking of attended speech (Ding and Simon, 2012; Zion Golumbic et al., 2013). These results strongly suggest that the auditory cortex can parse a complex auditory scene into auditory objects, for example, speakers, and separately represent each auditory object (Shamma et al., 2011; Shinn-Cunningham, 2008). When only one speech stream was presented, cross-modal attention can also modulate neural tracking of the speech envelope, but the effect is much weaker (Ding et al., 2018; Kong et al., 2014).

Consistent with previous findings, in the current study, the 4-Hz syllable response was also enhanced by cross-modal attention (Figure 3B). The 2-Hz AM response power, however, was not significantly modulated by cross-modal attention (Figures 4D and 5B), suggesting that attention did not uniformly enhance the processing of all features within the same speech stream. Given that the 2-Hz AM does not carry linguistic information, the result suggests that attention selectively enhances speech features relevant to speech comprehension. This result extends previous findings by showing that attention can differentially modulate different features within a speech stream.

Time course of the neural response to words

The neurophysiological processes underlying speech perception has been extensively studied using ERPs (Friederici, 2002). Early ERP components, such as the N1, mostly reflect auditory encoding of acoustic features, while late components can reflect higher-level processing of lexical, semantic, or syntactic processing (Friederici, 2002; Friederici, 2012; Sanders and Neville, 2003). In the current study, for isochronous speech, response latency cannot be uniquely determined: Given that syllables are presented at a constant rate of 4 Hz, a response with latency T cannot be distinguished from responses with latency T ± 250 ms. The periodic pattern in the stimulus design enables accurate prediction of its timing in the brain, and therefore the responses observed could turn out to be predictive instead of reactive.

In natural speech, the responses to the two syllables in a disyllabic word differ in a late latency window of about 400 ms. This component is consistent with the latency of the N400 response, which can be observed when listening to either individual words or continuous speech (Broderick et al., 2018; Kutas and Federmeier, 2011; Kutas and Hillyard, 1980; Pylkkänen and Marantz, 2003; Pylkkänen et al., 2002). A previous study on the neural responses to naturally spoken sentences has also shown that the initial syllable of an English word elicits larger N1 and N200–300 components than the word-medial syllable (Sanders and Neville, 2003). A recent study also suggests that the word onset in natural speech elicits a response at ~100 ms latency (Brodbeck et al., 2018). The current study, however, does not observe this early effect, and language difference might offer a potential explanation: In Chinese, a syllable generally equals a morpheme, while in English single syllables do not carry meaning in most cases. The 400 ms latency response observed in the current study is consistent with the hypothesis that the N400 is related to lexical processing (Friederici, 2002; Kutas and Federmeier, 2011). Besides, it is also possible that the second syllable in a disyllabic word elicits weaker N400 since it is more predictable than the first syllable (Kuperberg et al., 2020; Lau et al., 2008).

The difference between the ERPs evoked by the first and second syllables in disyllabic words was amplified by attention (Figure 6A). Furthermore, the ERP difference remained significant when participants’ attention was diverted away from the speech in the movie watching task, while the 2-Hz response in the power spectrum was no longer significant (Figure 2C). These results are similar to the findings of a previous study (Ding et al., 2018), in which no word-rate response peak in the EEG spectrum is observed in contrast to coherent word-rate response phase across participants in unattended listening task. Taken together, these results suggest that word-level processing occurs during the movie watching task, but the word-tracking response is rather weak.

In sum, the current study suggests that bottom-up acoustic cues and top-down linguistic knowledge separately contribute to neural construction of linguistic units in the processing of spoken narratives.

Materials and methods

Participants

Sixty-eight participants (20–29 years old, mean age, 22.6 years; 37 females) took part in the EEG experiments. Thirty-four participants (19–26 years old, mean age, 22.5 years; 17 females) took part in a behavioral test to assess the naturalness of the stimuli. All participants were right-handed native Mandarin speakers, with no self-reported hearing loss or neurological disorders. The experimental procedures were approved by the Research Ethics Committee of the College of Medicine, Zhejiang University (2019–047). All participants provided written informed consent prior to the experiment and were paid.

Share this article

Cite this article

Stimulus.

Electroencephalogram (EEG) response spectrum.

Figure 2—source data 1

Response power and phase.

Figure 3—source data 1

Word and amplitude modulation (AM) responses to amplitude-modulated speech.

Figure 4—source data 1

Replication of the neural response to amplitude-modulated speech.

Figure 5—source data 1

Event-related potential (ERP) responses evoked by disyllabic words.

Figure 6—source data 1

Author details

Cheng Luo

Contribution

Competing interests

Nai Ding

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism