1. Neuroscience
Download icon

Cortical encoding of acoustic and linguistic rhythms in spoken narratives

  1. Cheng Luo
  2. Nai Ding  Is a corresponding author
  1. Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, China
  2. Research Center for Advanced Artificial Intelligence Theory, Zhejiang Lab, China
Research Article
  • Cited 0
  • Views 827
  • Annotations
Cite this article as: eLife 2020;9:e60433 doi: 10.7554/eLife.60433

Abstract

Speech contains rich acoustic and linguistic information. Using highly controlled speech materials, previous studies have demonstrated that cortical activity is synchronous to the rhythms of perceived linguistic units, for example, words and phrases, on top of basic acoustic features, for example, the speech envelope. When listening to natural speech, it remains unclear, however, how cortical activity jointly encodes acoustic and linguistic information. Here we investigate the neural encoding of words using electroencephalography and observe neural activity synchronous to multi-syllabic words when participants naturally listen to narratives. An amplitude modulation (AM) cue for word rhythm enhances the word-level response, but the effect is only observed during passive listening. Furthermore, words and the AM cue are encoded by spatially separable neural responses that are differentially modulated by attention. These results suggest that bottom-up acoustic cues and top-down linguistic knowledge separately contribute to cortical encoding of linguistic units in spoken narratives.

Introduction

When listening to speech, low-frequency cortical activity in the delta (<4 Hz) and theta (4–8 Hz) bands is synchronous to speech (Keitel et al., 2018; Luo and Poeppel, 2007). However, it remains debated what speech features are encoded in the low-frequency cortical response. A large number of studies have demonstrated that the low-frequency cortical response tracks low-level acoustic features in speech, for example, the speech envelope (Destoky et al., 2019; Ding and Simon, 2012; Koskinen and Seppä, 2014; Di Liberto et al., 2015; Nourski et al., 2009; Peelle et al., 2013). Since the theta-band speech envelope provides an important acoustic cue for syllable boundaries, it has been hypothesized that neural tracking of the theta-band speech envelope is a mechanism to segment continuous speech into discrete units of syllables (Giraud and Poeppel, 2012; Poeppel and Assaneo, 2020). In other words, the theta-band envelope-tracking response reflects an intermediate neural representation linking auditory representation of acoustic speech features and phonological representation of syllables. Consistent with this hypothesis, it has been found that neural tracking of speech envelope is related to both low-level speech features (Doelling et al., 2014) and perception. On one hand, it can occur when speech recognition fails (Etard and Reichenbach, 2019; Howard and Poeppel, 2010; Peña and Melloni, 2012; Zoefel and VanRullen, 2016; Zou et al., 2019). On the other hand, it is strongly modulated by attention (Zion Golumbic et al., 2013; Kerlin et al., 2010) and may be a prerequisite for successful speech recognition (Vanthornhout et al., 2018).

Speech comprehension, however, requires more than syllabic-level processing. Previous studies suggest that low-frequency cortical activity can also reflect neural processing of higher-level linguistic units, for example, words and phrases (Buiatti et al., 2009; Ding et al., 2016a; Keitel et al., 2018), and the prosodic cues related to these linguistic units, for example, delta-band speech envelope and pitch contour (Bourguignon et al., 2013; Li and Yang, 2009; Steinhauer et al., 1999). One line of research demonstrates that, when listening to natural speech, cortical responses can encode word onsets (Brodbeck et al., 2018) and capture semantic similarity between words (Broderick et al., 2018). It remains to be investigated, however, how bottom-up prosodic cues and top-down linguistic knowledge separately contribute to the generation of these word-related responses. Another line of research selectively focuses on top-down processing driven by linguistic knowledge. These studies demonstrate that cortical responses are synchronous to perceived linguistic units, for example, words, phrases, and sentences, even when acoustic correlates of these linguistic units are not available (Ding et al., 2016a; Ding et al., 2018; Jin et al., 2018; Makov et al., 2017). Based on these results, it has been hypothesized that low-frequency cortical activity can reflect linguistic-level neural representations that are constructed based on internal linguistic knowledge instead of acoustic cues (Ding et al., 2016a; Ding et al., 2018; Meyer et al., 2020). Nevertheless, to dissociate linguistic units with the related acoustic cues, most of these studies present speech as an isochronous sequence of synthesized syllables, which organizes into a sequence of unrelated words and sentences. Therefore, it remains unclear whether cortical activity can synchronize to linguistic units in natural spoken narratives, and how it is influenced by bottom-up acoustic cues and top-down linguistic knowledge.

Here we first asked whether cortical activity could reflect the rhythm of disyllabic words in semantically coherent stories. The story was either naturally read or synthesized as an isochronous sequence of syllables to remove acoustic cues for word boundaries (Ding et al., 2016a). We then asked how the neural response to disyllabic words was influenced by acoustic cues for words. To address this question, we amplitude modulated isochronous speech at the word rate and tested how this word-synchronized acoustic cue modulated the word response. Finally, since previous studies have shown that cortical tracking of speech strongly depended on the listeners’ task (Ding and Simon, 2012; Zion Golumbic et al., 2013; O'Sullivan et al., 2015), we designed two tasks during which the neural responses were recorded. One task required attentive listening to speech and answering of comprehension questions afterwards, while in the other task participants were engaged in watching a silent movie while passively listening to speech.

Results

Neural encoding of words in isochronously presented narratives

We first presented semantically coherent stories that were synthesized as an isochronous sequence of syllables (Figure 1A, left). To produce a metrical structure in stories, every other syllable was designed to be a word onset. More specifically, the odd terms in the metrical syllable sequence always corresponded to the initial syllable of a word, that is, word onset, while the even terms corresponded to either the second syllable of a disyllabic word (73% probability) or a monosyllabic word (23% probability). In the following, the odd terms of the syllable sequence were referred to as σ1, and the even terms as σ2. Since syllables were presented at a constant rate of 4 Hz, the neural response to syllables was frequency tagged at 4 Hz. Furthermore, since every other syllable in the sequence was the onset of a word, neural activity synchronous to word onsets was expected to show a regular rhythm at half of the syllabic rate, that is, 2 Hz (Figure 1A, right).

Stimulus.

(A and B) Two types of stories are constructed: metrical stories and nonmetrical stories. (A) Metrical stories are composed of disyllabic words and pairs of monosyllabic words, so that the odd terms in the syllable sequence (referred to as σ1) must be the onset of a word. Here the onset syllable of each word is shown in bold. All syllables are presented at a constant rate of 4 Hz. A 500 ms gap is inserted at the position of any punctuation. The red curve illustrates cortical activity synchronous to word onsets, and it shows a 2-Hz rhythm, which can be clearly observed in the spectrum shown on the right. The stories are in Chinese and English examples are shown for illustrative purposes. (B) In the nonmetrical stories, word onsets are not regularly positioned, and activity that is synchronous to word onsets does not show 2-Hz rhythmicity. (C) Natural speech. The stories are naturally read by a human speaker and the duration of syllables is not controlled. (D) Amplitude-modulated isochronous speech is constructed by amplifying either σ1 or σ2 by a factor of 4, creating a 2-Hz amplitude modulation. The red and blue curves illustrate responses that are synchronous to word onsets and amplified syllables, respectively. The response synchronous to word onsets is identical for σ1- and σ2-amplified speech, that is, the phase difference was 0° at 2 Hz. In contrast, the response synchronous to amplified syllables is offset by 250 ms between conditions, that is, the phase difference was 180° at 2 Hz.

As a control condition, we also presented stories with a nonmetrical structure (Figure 1B, left). These stories were referred to as the nonmetrical stories in the following. In these stories, the word duration was not controlled and σ1 was not always a word onset. Given that the word onsets in these stories did not show rhythmicity at 2 Hz, neural activity synchronous to the word onsets was not frequency tagged at 2 Hz (Figure 1B, right).

When listening to the stories, one group of participants was asked to attend to the stories and answer comprehension questions presented at the end of each story. The task was referred to as story comprehension task, and the participants correctly answered 96 ± 9% and 94 ± 9% questions for metrical and nonmetrical stories, respectively. Another group of participants, however, were asked to watch a silent movie while passively listening to the same set of stories as those used in the story comprehension task. The silent movie was not related to the auditorily presented stories, and the participants did not have to answer any speech comprehension question. The task was referred to as movie watching task. The electroencephalogram (EEG) responses to isochronously presented stories are shown in Figure 2A and C. The response spectrum was averaged over participants and EEG electrodes.

Figure 2 with 1 supplement see all
Electroencephalogram (EEG) response spectrum.

Response spectrum is averaged over participants and EEG electrodes. The shaded area indicates one standard error of the mean (SEM) across participants. Stars indicate significantly higher power at 2 or 4 Hz than the power averaged over four neighboring frequency bins (two on each side). *p<0.05, **p<0.001 (bootstrap, false discovery rate [FDR] corrected). The color of the star is the same as the color of the spectrum. The topography on the top of each plot shows the distribution of response power at 2 Hz and 4 Hz. The five black dots in the topography indicate the position of electrodes FCz, Fz, Cz, FC3, and FC4. (A–D) Response spectrum for isochronous speech and amplitude-modulated speech during two tasks. To facilitate the comparison between stimuli, the red curves in panels A and C are repeated in panels B and D, respectively. (E) Response spectrum when the participants listen to natural speech. In this analysis, the response to natural speech is time warped to simulate the response to isochronous speech, and then transformed into the frequency-domain.

Figure 2—source data 1

Preprocessed electroencephalogram (EEG) data recorded in Experiments 1–3.

https://cdn.elifesciences.org/articles/60433/elife-60433-fig2-data1-v2.mat

Figure 2A shows the responses from the participants attentively listening to the stories in story comprehension task. For metrical stories, two peaks were observed in the EEG spectrum, one at 4 Hz, that is, the syllable rate (p=0.0001, bootstrap, false discovery rate [FDR] corrected) and the other at 2 Hz, that is, the rate of disyllabic words (p=0.0001, bootstrap, FDR corrected). For nonmetrical stories, however, a single response peak was observed at 4 Hz (p=0.0001, bootstrap, FDR corrected), while no significant response peak was observed at 2 Hz (p=0.27, bootstrap, FDR corrected). A comparison of the responses to metrical and nonmetrical stories was performed, and a significant difference was observed at 2 Hz (p=0.0005, bootstrap, FDR corrected, Figure 3A) but not at 4 Hz (p=0.40, bootstrap, FDR corrected, Figure 3B). The response topography showed a centro-frontal distribution.

Figure 3 with 1 supplement see all
Response power and phase.

(A and B) Response power at 2 and 4 Hz. Color of the bars indicates the stimulus. Black stars indicate significant differences between different types of speech stimuli while red stars indicate significant differences between tasks. *p<0.05, **p<0.01 (bootstrap, false discovery rate [FDR] corrected). Throughout the manuscript, in all bar graphs of response power, the response power at a target frequency is subtracted by the power averaged over four neighboring frequency bins (two on each side) to reduce the influence of background neural activity. (C) The difference in 2-Hz response phase between the σ1- and σ2-amplified conditions at 2 Hz. The phase difference is averaged across participants, and the polar histogram shows the distribution of phase difference across 64 electrodes.

Figure 3—source data 1

Preprocessed EEG data recorded in Experiment 1-3.

https://cdn.elifesciences.org/articles/60433/elife-60433-fig3-data1-v2.mat

When participants watched a silent movie during story listening, however, a single response peak was observed at 4 Hz for both metrical (p=0.0002, bootstrap, FDR corrected) and nonmetrical stories (p=0.0002, bootstrap, FDR corrected) (Figure 2C). The response peak at 2 Hz was not significant for either kind of stories (p>0.074, bootstrap, FDR corrected). A comparison of the responses to metrical and nonmetrical stories did not find significant difference at either 2 Hz (p=0.22, bootstrap, FDR corrected, Figure 3A) or 4 Hz (p=0.39, bootstrap, FDR corrected, Figure 3B). Furthermore, the 2-Hz response was significantly stronger in the story comprehension task than in the movie watching task (p=0.0004, bootstrap, FDR corrected, Figure 3A). These results showed that cortical activity was synchronous to the word rhythm during attentive speech comprehension. When attention was diverted, however, the word-rate response was no longer detected.

Neural encoding of words in natural spoken narratives

Next, we asked whether cortical activity was synchronous to disyllabic words in natural speech processing. The same set of stories used in the isochronous speech condition was read in a natural manner by a human speaker and presented to participants. The participants correctly answered 95 ± 4% and 97 ± 6% comprehension questions for metrical and nonmetrical stories, respectively. In natural speech, syllables were not produced at a constant rate (Figure 1C), and therefore the syllable and word responses were not frequency tagged. Nevertheless, we time warped the response to natural speech and made the syllable and word responses periodic. Specifically, the neural response to each syllable in natural speech was extracted and realigned to a constant 4-Hz rhythm using a convolution-based procedure (Jin et al., 2018; see Materials and methods for details). After time-warping analysis, cortical activity synchronous to the word onsets was expected to show a 2-Hz rhythm, the same as the response to the isochronous speech (Figure 2E).

The spectrum of the time-warped response was averaged over participants and EEG electrodes (Figure 2E). For metrical stories, two peaks were observed in spectrum of the time-warped response, one at 4 Hz (p=0.0002, bootstrap, FDR corrected) and the other at 2 Hz (p=0.0007, bootstrap, FDR corrected). For nonmetrical stories, however, a single response peak was observed at 4 Hz (p=0.0002, bootstrap, FDR corrected), while no significant response peak was observed at 2 Hz (p=0.37, bootstrap, FDR corrected). A comparison of the responses to metrical and nonmetrical stories was performed and found a significant difference at 2 Hz (p=0.036, bootstrap, FDR corrected, Figure 3A) but not at 4 Hz (p=0.09, bootstrap, FDR corrected, Figure 3B). These results demonstrated that cortical activity was synchronous to the word rhythm in natural spoken narratives during attentive speech comprehension.

Neural responses to amplitude-modulated speech

To investigate potential interactions between neural encoding of linguistic units and relevant acoustic features, we amplitude modulated the isochronous speech at the word rate, that is, 2 Hz. The amplitude modulation (AM) was achieved by amplifying either σ1 or σ2 by a factor of 4 (Figure 1D), creating two conditions: σ1-amplified condition and σ2-amplified condition. Such 2-Hz AM provided an acoustic cue for the word rhythm. The speech was referred to as amplitude-modulated speech in the following. When listening to amplitude-modulated speech, the participants correctly answered 94 ± 12% and 97 ± 8% comprehension questions in the σ1- and σ2-amplified conditions, respectively.

The EEG responses to amplitude-modulated speech are shown in Figure 2B and D. When participants attended to the speech, significant response peaks were observed at 2 Hz (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0001, bootstrap, FDR corrected, Figure 2D). A comparison between the responses to isochronous speech and amplitude-modulated speech found that AM did not significantly influence the power of the 2-Hz neural response (σ1-amplified vs. σ2-amplified: p=0.56; σ1-amplified vs. isochronous: p=0.38, σ2-amplified vs. isochronous: p=0.44, bootstrap, FDR corrected, Figure 3A). These results showed that the 2-Hz response power was not significantly influenced by the 2- Hz AM during attentive speech comprehension.

Another group of participants passively listened to amplitude-modulated speech while attending to a silent movie. In their EEG responses, significant response peaks were also observed at 2 Hz (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0001, bootstrap, FDR corrected, Figure 2D). A comparison between the responses to isochronous speech and amplitude-modulated speech showed stronger 2-Hz response in the processing of amplitude-modulated speech than isochronous speech (σ1-amplified vs. isochronous: p=0.021, σ2-amplified vs. isochronous: p=0.0042, bootstrap, FDR corrected, Figure 3A). No significant difference was found between responses to σ1-amplified and σ2-amplified speech (p=0.069, bootstrap, FDR corrected, Figure 3A). Therefore, when attention was diverted, the 2-Hz response power was significantly increased by the 2-Hz AM.

The Fourier transform decomposes an arbitrary signal into sinusoids and each complex-valued Fourier coefficient captures the magnitude and phase of a sinusoid. The power spectrum reflects the response magnitude but ignores the response phase. The phase difference between the 2-Hz responses in the σ1- and σ2-amplified conditions, however, carried important information about whether the neural response was synchronous to the word onsets or amplified syllables: Neural activity synchronous to amplified syllables showed a 250 ms time lag between the σ1- and σ2-amplified conditions (Figure 1D), which corresponded to a 180° phase difference between conditions at 2 Hz. Neural activity synchronous to the word onsets, however, should be identical in the σ1- and σ2-amplified conditions.

The 2-Hz response phase difference between the σ1- and σ2-amplified conditions is shown in Figure 3C. The phase difference, when averaged across participants and EEG electrodes, was 41° (95% confidence interval: −25–91°) during the story comprehension task and increased to 132° (95% confidence interval: 102–164°) during the movie watching task (see Figure 3—figure supplement 1 for response phase in individual conditions).

Separate the responses to words and word-rate AM

In the previous section, we analyzed the difference in 2-Hz response phase difference between the σ1- and σ2-amplified conditions. A 0° phase difference indicated that the 2-Hz response was only synchronous to word onsets, while a 180° phase difference indicated that the 2-Hz response was only synchronous to amplified syllables. A phase difference between 0° and 180° indicated that neural response synchronous to word onsets and neural response synchronous to amplified syllables both exist, but the phase analysis could not reveal the strength of the two response components. Therefore, in the following, we extracted the neural responses to words and the 2-Hz AM by averaging the responses across the σ1- and σ2-amplified conditions in different manners.

To extract the neural response to words, we averaged the neural response waveforms across the σ1- and σ2-amplified conditions. The word onsets were aligned in these two conditions and therefore neural activity synchronous to the word onsets was preserved in the average. In contrast, cortical activity synchronous to the amplified syllables exhibited a 250 ms time lag between the σ1- and σ2-amplified conditions. Therefore, the 2-Hz response synchronous to the amplified syllables was 180° out of phase between the two conditions, and got canceled in the average across conditions. In sum, the average over the σ1- and σ2-amplified conditions preserved the response to words but canceled the response to amplified syllables (Figure 4A). This average was referred to as the word response in the following.

Word and amplitude modulation (AM) responses to amplitude-modulated speech.

(A–D) Neural responses in σ1- and σ2-amplified conditions are aligned based on either word onsets (AB) or amplified syllables (CD), and averaged to extract the 2-Hz response synchronous to words or AM, respectively. Panels A and C illustrate the procedure. The red and blue curves illustrate the response components synchronous to word onsets and the AM respectively, which are mixed in the electroencephalogram (EEG) measurement and shown separately for illustrative purposes. The spectrum and topography in panels B and D are shown the same way as they are shown in Figure 2. *p<0.05, **p<0.001 (bootstrap, false discovery rate [FDR] corrected). (E) Response power at 2 Hz. Black stars indicate significant differences between the word and AM responses, while red stars indicate a significant difference between tasks. *p<0.05, **p<0.01 (bootstrap, FDR corrected). (F) The left panel shows the power difference between the word and AM responses in single electrodes. The right panel shows the difference in normalized topography, that is, topography divided by its maximal value. Black dots indicate electrodes showing a significant difference between the word and AM responses (p<0.05, bootstrap, FDR corrected).

Figure 4—source data 1

Preprocessed electroencephalogram (EEG) data recorded in Experiments 1–3.

https://cdn.elifesciences.org/articles/60433/elife-60433-fig4-data1-v2.mat

For the word response, a significant response peak was observed at 2 Hz during both story comprehension task (p=0.0001, bootstrap, FDR corrected) and movie watching task (p=0.011, bootstrap, FDR corrected, Figure 4B). The response power was significantly stronger in the story comprehension task than that in the movie watching task (p=0.0008, bootstrap, FDR corrected, Figure 4E). During the movie watching task, a significant 2-Hz word response was observed when amplitude-modulated speech was presented (p=0.011, bootstrap, FDR corrected, Figure 4B), while no significant 2-Hz word response was observed when isochronous speech was presented (p=0.074, bootstrap, FDR corrected, Figure 2C). This finding indicated that the 2-Hz AM facilitated word processing during passive listening.

Furthermore, during the story comprehension task, the power of the 2-Hz word response did not significantly differ between amplitude-modulated speech and isochronous speech (p=0.69, bootstrap, FDR corrected), suggesting that the 2-Hz AM did not significantly modulate the 2-Hz word response during attentive speech comprehension.

To extract the neural response to 2-Hz AM, we first aligned the responses in the σ1- and σ2-amplified conditions by adding a delay of 250 ms to the response in the σ1-amplified condition. We then averaged the response waveforms across conditions (Figure 4C). After the delay was added to the response to σ1-amplified speech, the 2-Hz AM was identical between the σ1- and σ2-amplified conditions. Therefore, the average across conditions preserved the response to 2-Hz AM while canceling the response to words. The averaged response was referred to as the AM response in the following.

For the AM response, significant 2-Hz response peaks were observed during both the story comprehension task (p=0.0021, bootstrap, FDR corrected) and the movie watching task (p=0.0001, bootstrap, FDR corrected). The 2-Hz response power did not significantly differ between tasks (p=0.39, bootstrap, FDR corrected), suggesting that the 2-Hz AM response was not strongly enhanced when participants attended to speech.

During the story comprehension task, the 2-Hz response power averaged across electrodes was significantly stronger for the word response than the AM response (p=0.021, bootstrap, FDR corrected, Figure 4E). During the movie watching task, however, the reverse was true, that is, the 2-Hz word response was significantly weaker than the 2-Hz AM response (p=0.013, bootstrap, FDR corrected, Figure 4E).

An analysis of individual electrodes showed the 2-Hz word response was significantly stronger than the 2-Hz AM response in centro-frontal electrodes during the story comprehension task (p<0.05, bootstrap, FDR corrected, Figure 4F, left). To further compare the spatial distribution of 2-Hz word and AM responses, on top of their power difference, we normalized the response topography by dividing its maximum value. Significant difference was found in the normalized topography between the 2-Hz word and AM responses in temporal electrodes (p<0.05, bootstrap, FDR corrected, Figure 4F, right). These results suggested that the 2-Hz word and AM responses had distinct neural sources.

Neural responses to amplitude-modulated speech: a replication

The responses to amplitude-modulated speech suggested that the brain encoded words and acoustic cues related to words in a different manner. Furthermore, the word-related acoustic cues seemed to facilitate word processing during passive listening, but not significantly enhance the word response during attentive listening. To further validate these findings, we conducted a replication experiment to measure the neural response to amplitude-modulated speech in a separate group of participants. These participants first performed the movie watching task and then the story comprehension task. During the story comprehension task, the participants correctly answered 96 ± 8% and 95 ± 9% questions in the σ1-amplified and σ2-amplified conditions, respectively.

We first analyzed the spectrum of response to σ1- and σ2-amplified speech. Consistent with previous results, a significant 2-Hz response peak was observed whether the participants attended to the speech (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0001, bootstrap, FDR corrected) or movie (σ1-amplified: p=0.0001, and σ2-amplified: p=0.0004, bootstrap, FDR corrected, Figure 5—figure supplement 1A and B). When averaged across participants and electrodes, the 2-Hz response phase difference between the σ1- and σ2-amplified conditions was 7° (95% confidence interval: −26–47°) during the story comprehension task and 96° (95% confidence interval: 55–143°) during the movie watching task (Figure 5—figure supplement 1C and D).

We then separately extracted the word and AM responses following the procedures described in Figure 4A and C. Similar to previous results, in the spectrum (Figure 5A and B), a significant 2-Hz word response was observed during both tasks (story comprehension task: p=0.0001, bootstrap; movie watching task: p=0.0003, bootstrap, FDR corrected), and a significant 2-Hz AM response was also observed during both tasks (story comprehension task: p=0.0001, bootstrap; movie watching task: p=0.0001, bootstrap, FDR corrected). Furthermore, the 2-Hz word response exhibited significantly stronger power than the 2-Hz AM response during the story comprehension task (p=0.0036, bootstrap, FDR corrected, Figure 5C), and the two responses showed distinct topographical distribution (Figure 5D). In sum, the results obtained in the replication experiment were consistent with those from the original experiment. A follow-up comparison of the results from the original and replication experiments suggested that the 2-Hz word response exhibited significantly stronger power during the movie watching task in the replication experiment than the original experiment (p=0.0008, bootstrap, FDR corrected).

Figure 5 with 1 supplement see all
Replication of the neural response to amplitude-modulated speech.

(A and B) Spectrum and topography for the word response (A) and amplitude modulation (AM) response (B). Colored stars indicate frequency bins with stronger power than the power averaged over four neighboring frequency bins (two on each side). *p<0.05, **p<0.001 (bootstrap, false discovery rate [FDR] corrected). The topography on the top of each plot shows the distribution of response power at 2 Hz and 4 Hz. (C) Response power at 2 Hz. Black stars indicate significant differences between the word and (AM) responses, while red stars indicate a significant difference between tasks. *p<0.05, **p<0.01 (bootstrap, FDR corrected). (D) Power difference between the word and AM responses in individual electrodes are shown in the left panel. To further illustrate the difference in topographical distribution instead of the response power, each response topography is normalized by dividing its maximum value. The difference in the normalized topography is shown in the right panel. Black dots indicate electrodes showing a significant difference between the word and AM responses (p<0.05, bootstrap, FDR corrected).

Figure 5—source data 1

Preprocessed electroencephalogram (EEG) data recorded in Experiment 4.

https://cdn.elifesciences.org/articles/60433/elife-60433-fig5-data1-v2.mat

Time course of EEG responses to words

The event-related potential (ERP) responses evoked by σ1 and σ2 are separately shown in Figure 6. The ERP analysis was restricted to disyllabic words so that the responses to σ1 and σ2 represented the responses to the first and second syllables of disyllabic words respectively. When participants attended to speech, the ERP responses to σ1 and σ2 showed significant differences for both isochronous (Figure 6A) and natural speech (Figure 6B). When participants watched a silent movie, a smaller difference was also observed between the ERP responses to σ1 and σ2 (Figure 6A). The topography of the ERP difference showed a centro-frontal distribution. For isochronous speech, the ERP latency could not be unambiguously interpreted for isochronous speech, given that the stimulus was strictly periodic. For natural speech, the ERP responses to σ1 and σ2 differed in a time window around 300–500 ms.

Figure 6 with 1 supplement see all
Event-related potential (ERP) responses evoked by disyllabic words.

The ERP responses evoked by σ1 and σ2 are shown in red and black, respectively. The ERP response is averaged over participants and electrodes. The shaded area indicates 1 SEM across participants. The gray lines on top denote the time intervals in which the two responses are significantly different from each other (p<0.05, cluster-based permutation test). The topography on top is averaged over all time intervals showing a significant difference between the two responses in each plot. Time 0 indicates syllable onset.

Figure 6—source data 1

Preprocessed electroencephalogram (EEG) data recorded in Experiments 1–3.

https://cdn.elifesciences.org/articles/60433/elife-60433-fig6-data1-v2.mat

The ERP results for amplitude-modulated speech are shown in Figure 6C and D. When participants attended to speech, a difference between the ERP responses to σ1 and σ2 was observed in both σ1- and σ2-amplified conditions (Figure 6C). During passive listening, however, a significant ERP difference was observed near the onset of the amplified syllable. These results were consistent with the results in the replication experiment (Figure 6—figure supplement 1).

Discussion

Speech comprehension is a complex process involving multiple stages, for example, encoding of acoustic features, extraction of phonetic features, and processing of higher-level linguistic units such as words, phrases, and sentences. Here, we investigate how low-frequency cortical activity encodes linguistic units and related acoustic features. When participants naturally listen to spoken narratives, we observe that cortical activity is synchronous to the rhythm of spoken words. The word synchronous response is observed whether participants listen to natural speech or synthesized isochronous speech that removes word-related acoustic cues. Furthermore, when introducing an AM cue to the word rhythm, neural responses to words and the AM cue are both observed and they show different spatial distribution. In addition, when participants are engaged in a story comprehension task, the word response exhibits stronger power than the AM response. The AM cue does not clearly modulate the word response during story comprehension task (Figure 2B), but can facilitate word processing when attention is diverted: The word response is not detected for isochronous speech (Figure 2C), but is detected for amplitude-modulated speech during passive listening (Figures 4B and 5A). In sum, these results show that both top-down linguistic knowledge and bottom-up acoustic cues separately contribute to word synchronous neural responses.

Neural encoding of linguistic units in natural speech

In speech, linguistic information is organized through a hierarchy of units, including phonemes, syllables, morphemes, words, phrases, sentences, and discourses. These units span a broad range of time scales, from tens of milliseconds for phonemes to a couple of seconds for sentences, and even longer for discourses. It is a challenging question to understand how the brain represents the hierarchy of linguistic units, and it is an appealing hypothesis that each level of linguistic unit is encoded by cortical activity on the relevant time scale (Ding et al., 2016a; Doumas and Martin, 2016; Giraud and Poeppel, 2012; Goswami, 2019; Keitel et al., 2018; Kiebel et al., 2008; Meyer and Gumbert, 2018). Previous fMRI studies have suggested that neural processing at different levels, for example, syllables, words, and sentences, engages different cortical networks (Blank and Fedorenko, 2020; Hasson et al., 2008; Lerner et al., 2011). Magnetoencephalography (MEG)/Electroencephalogram (EEG) studies have found reliable delta- and theta-band neural responses that are synchronous to speech (Ding and Simon, 2012; Luo and Poeppel, 2007; Peelle et al., 2013), and the time scales of such activity are consistent with the time scales of syllables and larger linguistic units.

Nevertheless, it remains unclear whether these MEG/EEG responses directly reflect neural encoding of hierarchical linguistic units, or simply encode acoustic features associated with these units (Daube et al., 2019; Kösem and van Wassenhove, 2017). On one hand, neural tracking of sound envelope is reliably observed in the absence of speech comprehension, for example, when participants listen to unintelligible speech (Howard and Poeppel, 2010; Zoefel and VanRullen, 2016) and non-speech sound (Lalor et al., 2009; Wang et al., 2012). The envelope tracking response is even weaker for sentences composed of real words than sentences composed of pseudowords (Mai et al., 2016), and is weaker for the native language than an unfamiliar language (Zou et al., 2019). Neural tracking of speech envelope can also be observed in animal primary auditory cortex (Ding et al., 2016b). Furthermore, a recent study shows that low-frequency cortical activity cannot reflect the perception of an ambiguous syllable sequence, for example, whether repetitions of a syllable is perceived as ‘flyflyfly’ or ‘lifelifelife’ (Kösem et al., 2016).

On the other hand, cortical activity synchronous to linguistic units, such as words and phrases, can be observed using well-controlled synthesized speech that removes relevant acoustic cues (Ding et al., 2016a; Jin et al., 2018; Makov et al., 2017). These studies, however, usually present semantically unrelated words or sentences at a constant pace, which creates a salient rhythm easily noticeable for listeners. In contrast, in the current study, we presented semantically coherent stories. It was found that for both synthesized isochronous speech and natural speech, cortical activity was synchronous to multi-syllabic words in metrical stories when participants were engaged in a story comprehension task. Furthermore, few listeners reported noticing any difference between the metrical and nonmetrical stories (see Supplementary materials for details), suggesting that the word rhythm was not salient and was barely noticeable to the listeners. Therefore, the word response is likely to reflect implicit word processing instead of the perception of an explicit rhythm.

A comparison of the responses to natural speech and isochronous speech showed that responses to word and syllable were weaker for natural speech, suggesting that strict periodicity in the stimulus could indeed boost rhythmic neural entrainment. Although the current study and previous studies (Ding et al., 2018; Makov et al., 2017) observe a word-rate neural response, the study conducted by Kösem et al., 2016 does not report observable neural activity synchronous to perceived word rhythm. A potential explanation for the mixed results is that Kösem et al., 2016 repeat the same word in each trial while the other studies present a large variety of words with no immediate repetitions in the stimuli. Therefore, it is possible that low-frequency word-rate neural response more strongly reflects neural processing of novel words, instead of the perception of a steady rhythm (see also Ostarek et al., 2020).

Mental processes reflected in neural activity synchronous to linguistic units

It remains elusive what kind of mental representations are reflected by cortical responses synchronous to linguistic units. For example, the response may reflect the phonological, syntactic, or semantic aspect of a perceived linguistic unit and it is difficult to tease apart these factors. Even if a sequence of sentences is constructed with independently synthesized monosyllabic words, the sequence does not sound like a stream of individual syllables delivered at a constant pace. Instead, listeners can clearly perceive each sentence as a prosodic unit. In this case, mental construction of sentences is driven by syntactic processing. Nevertheless, as long as the mental representation of a sentence is formed, it also has an associated phonological aspect. Previous psycholinguistic studies have already demonstrated that syntax has a significant impact on prosody perception (Buxó-Lugo and Watson, 2016; Garrett et al., 1966).

It is also possible that neural activity synchronous to linguistic units reflect more general cognitive processes that are engaged during linguistic processing. For example, within a word, later syllables are more predictable that earlier syllables. Therefore, neural processing associated with temporal prediction (Breska and Deouell, 2017; Lakatos et al., 2013; Stefanics et al., 2010) may appear to be synchronous to the perceived linguistic units. However, it has been demonstrated that when the predictability of syllables is controlled as a constant, cortical activity can still synchronize to perceived artificial words, suggesting that temporal prediction is not the only factor driving low-frequency neural activity either (Ding et al., 2016a). Nevertheless, in natural speech processing, temporal prediction may inevitably influence the low-frequency response. Similarly, temporal attention is known to affect low-frequency activity and attention certainly varies during speech perception (Astheimer and Sanders, 2009; Jin et al., 2018; Sanders and Neville, 2003). In addition, low-frequency neural activity has also been suggested to reflect the perception of high-level rhythms (Nozaradan et al., 2011) and general sequence chunking (Jin et al., 2020).

Since multiple factors can drive low-frequency neural responses, the low-frequency response to natural speech is likely to be a mixture of multiple components, including, for example, auditory responses to acoustic prosodic features, neural activity related to temporal prediction and temporal attention, and neural activity encoding phonological, syntactic, or semantic information. These processes are closely coupled and one process can trigger other processes. Here we do not think the word-rate response exclusively reflects a single process. It may well consist of a mixture of multiple response components, or provide a reference signal to bind together mental representations of multiple dimensions of the same word. Similar to the binding of color and shape information in the perception of a visual object (Treisman, 1998), the perception of a word requires the binding of, for example, phonological, syntactic, and semantic representations. It has been suggested that temporal coherence between neural responses to different features provides a strong cue to bind these features into a perceived object (Shamma et al., 2011). It is possible that the word-rate response reflects the temporal coherence between distinct mental representations of each word and functionally relates to the binding of these representations.

In speech processing, multiple factors contribute to the word response and these factors interact. For example, the current study suggested that prosodic cues such as AM had a facilitative effect on word processing: It was observed that, during passive listening, a word response was observed for amplitude-modulated speech but not for isochronous speech. In addition, the word response to amplitude-modulated speech during passive listening exhibited stronger power in the replication experiment that only presented amplitude-modulated speech, compared with the results obtained in the original experiment that presented amplitude-modulated speech with isochronous speech in a mixed manner. Consistent with this finding, a larger number of participants reported the noticing of stories during passive listening in the replication experiment (see Materials and methods). All these results suggest that the 2-Hz AM, which provides an acoustic cue for the word rhythm, facilitates word processing during passive listening. This result is consistent with the idea that prosodic cues have a facilitative effect on speech comprehension (Frazier et al., 2006; Ghitza, 2017; Ghitza, 2020; Giraud and Poeppel, 2012).

Finally, it should be mentioned that we employed AM to manipulate the speech envelope, given that the speech envelope is one of the strongest cues to drive stimulus-synchronous cortical response. However, the AM is only a weak prosodic cue compared with other variables such as timing and pitch contour (Shen, 1993). Furthermore, stress is strongly modulated by context and does not affect word recognition in Chinese (Duanmu, 2001). Future studies are needed to characterize the modulation of language processing by different prosodic cues and investigate the modulatory effect across different languages.

Attention modulation of cortical speech responses

It has been widely reported that cortical tracking of speech is strongly modulated by attention. Most previous studies demonstrate that in a complex auditory scene consisting of two speakers, attention can selectively enhance neural tracking of attended speech (Ding and Simon, 2012; Zion Golumbic et al., 2013). These results strongly suggest that the auditory cortex can parse a complex auditory scene into auditory objects, for example, speakers, and separately represent each auditory object (Shamma et al., 2011; Shinn-Cunningham, 2008). When only one speech stream was presented, cross-modal attention can also modulate neural tracking of the speech envelope, but the effect is much weaker (Ding et al., 2018; Kong et al., 2014).

Consistent with previous findings, in the current study, the 4-Hz syllable response was also enhanced by cross-modal attention (Figure 3B). The 2-Hz AM response power, however, was not significantly modulated by cross-modal attention (Figures 4D and 5B), suggesting that attention did not uniformly enhance the processing of all features within the same speech stream. Given that the 2-Hz AM does not carry linguistic information, the result suggests that attention selectively enhances speech features relevant to speech comprehension. This result extends previous findings by showing that attention can differentially modulate different features within a speech stream.

Time course of the neural response to words

The neurophysiological processes underlying speech perception has been extensively studied using ERPs (Friederici, 2002). Early ERP components, such as the N1, mostly reflect auditory encoding of acoustic features, while late components can reflect higher-level processing of lexical, semantic, or syntactic processing (Friederici, 2002; Friederici, 2012; Sanders and Neville, 2003). In the current study, for isochronous speech, response latency cannot be uniquely determined: Given that syllables are presented at a constant rate of 4 Hz, a response with latency T cannot be distinguished from responses with latency T ± 250 ms. The periodic pattern in the stimulus design enables accurate prediction of its timing in the brain, and therefore the responses observed could turn out to be predictive instead of reactive.

In natural speech, the responses to the two syllables in a disyllabic word differ in a late latency window of about 400 ms. This component is consistent with the latency of the N400 response, which can be observed when listening to either individual words or continuous speech (Broderick et al., 2018; Kutas and Federmeier, 2011; Kutas and Hillyard, 1980; Pylkkänen and Marantz, 2003; Pylkkänen et al., 2002). A previous study on the neural responses to naturally spoken sentences has also shown that the initial syllable of an English word elicits larger N1 and N200–300 components than the word-medial syllable (Sanders and Neville, 2003). A recent study also suggests that the word onset in natural speech elicits a response at ~100 ms latency (Brodbeck et al., 2018). The current study, however, does not observe this early effect, and language difference might offer a potential explanation: In Chinese, a syllable generally equals a morpheme, while in English single syllables do not carry meaning in most cases. The 400 ms latency response observed in the current study is consistent with the hypothesis that the N400 is related to lexical processing (Friederici, 2002; Kutas and Federmeier, 2011). Besides, it is also possible that the second syllable in a disyllabic word elicits weaker N400 since it is more predictable than the first syllable (Kuperberg et al., 2020; Lau et al., 2008).

The difference between the ERPs evoked by the first and second syllables in disyllabic words was amplified by attention (Figure 6A). Furthermore, the ERP difference remained significant when participants’ attention was diverted away from the speech in the movie watching task, while the 2-Hz response in the power spectrum was no longer significant (Figure 2C). These results are similar to the findings of a previous study (Ding et al., 2018), in which no word-rate response peak in the EEG spectrum is observed in contrast to coherent word-rate response phase across participants in unattended listening task. Taken together, these results suggest that word-level processing occurs during the movie watching task, but the word-tracking response is rather weak.

In sum, the current study suggests that bottom-up acoustic cues and top-down linguistic knowledge separately contribute to neural construction of linguistic units in the processing of spoken narratives.

Materials and methods

Participants

Sixty-eight participants (20–29 years old, mean age, 22.6 years; 37 females) took part in the EEG experiments. Thirty-four participants (19–26 years old, mean age, 22.5 years; 17 females) took part in a behavioral test to assess the naturalness of the stimuli. All participants were right-handed native Mandarin speakers, with no self-reported hearing loss or neurological disorders. The experimental procedures were approved by the Research Ethics Committee of the College of Medicine, Zhejiang University (2019–047). All participants provided written informed consent prior to the experiment and were paid.

Stories

Request a detailed protocol

Twenty-eight short stories were constructed for the study. The stories were unrelated in terms of content and ranged from 81 to 143 in word count (107 words on average). In 21 stories, word onset was metrically organized and every other syllable was a word onset, and these stories were referred to as metrical stories (Figure 1A). In the other seven stories, word onset was not metrically organized and these stories were referred to as nonmetrical stories (Figure 1B). In an ideal design, metrical stories should be constructed solely with disyllabic words to form a constant disyllabic word rhythm. Nevertheless, since it was difficult to construct such materials, the stories were constructed with disyllabic words and pairs of monosyllabic words. In other words, in between two disyllabic words, there must be an even number of monosyllabic words. After the stories were composed, the word boundaries within the stories were further parsed based on a Natural Language Processing (NLP) algorithm (Zhang and Shang, 2019). The parsing result confirmed that every other syllable in the story (referred to as σ1 in Figure 1A) was the onset of a word. For the other syllables (referred to as σ2 in Figure 1A), 77% was the second syllable of a disyllabic word while 23% was a monosyllabic word.

Similarly, nonmetrical stories used as a control condition were composed with sentences with an even number of syllables. Nevertheless, no specific constraint was applied to word duration, and the odd terms in the syllable sequence were not always a word onset. Furthermore, in each sentence, it was required that at least one odd term was not a word onset.

Speech

Each story was either synthesized as an isochronous sequence of syllables or naturally read by a human speaker.

Isochronous speech

Request a detailed protocol

All syllables were synthesized independently using the Neospeech synthesizer (http://www.neospeech.com/, the male voice, Liang). The synthesized syllables were 75–354 ms in duration (mean duration 224 ms). All syllables were adjusted to 250 ms by truncation or padding silence at the end, following the procedure in Ding et al., 2016a. The last 25 ms of each syllable were smoothed by a cosine window and all syllables were equalized in intensity. In this way, the syllables were presented at a constant rate of 4 Hz (Figure 1A and B). In addition, a silence gap lasting 500 ms, that is, the duration of two syllables, was inserted at the position of any punctuation, to facilitate story comprehension.

Natural speech

Request a detailed protocol

The stories were read in a natural manner by a female speaker, who was not aware of the purpose of the study. In natural speech, syllables were not produced at a constant rate and the boundaries between syllables were labeled by professionals (Figure 1C). The total duration of speech was 1122 s for the 21 metrical stories and 372 s for the seven nonmetrical stories. A behavioral test showed that most participants did not perceive any difference between the metrical and nonmetrical stories (Supplementary file 1).

Amplitude-modulated speech

Request a detailed protocol

Amplitude modulation (AM) was applied to isochronous speech to create a word-rate acoustic rhythm (Figure 1D). In σ1-amplified condition, all σ1 syllables were amplified by a factor of 4. In σ2-amplified condition, all σ2 syllables were amplified by a factor of 4. Such 2-Hz AM was clearly perceivable but did not affect speech intelligibility, since sound intensity is a weak cue for stress (Zhong et al., 2001) and stress does not affect word recognition in Mandarin Chinese (Duanmu, 2001). A behavioral test suggested that when listening to amplitude-modulated speech, a larger number of participants perceived σ1-amplified speech as more natural than σ2-amplified speech (Supplementary file 1).

Experimental procedures and tasks

Behavioral test

Request a detailed protocol

A behavioral test was conducted to assess the naturalness of the stimuli. The test was divided into two blocks. In block 1, the participants listened to a metrical story and a nonmetrical story read by a human speaker, which were presented in a pseudorandom order. The stories were randomly selected from the story set. Each story ranged from 53 to 66 s in duration. After listening to each story, the participants were asked to write a sentence to summarize the story and fill out a questionnaire. Block 2 was of the same procedure as block 1, except that the metrical and nonmetrical stories were replaced with σ1- and σ2-amplified speech.

In block 1, the first question in the questionnaire asked whether the two types of stories, a metrical and a nonmetrical story, showed any noticeable difference regardless of their content. Thirty-one participants (91%) reported no difference perceived, and the other three participants (9%) were asked to elaborate on the differences they detected. Two of them said the metrical story showed larger pitch variations, and the other said they were read with a different tone. Therefore, the vast majority of the participants did not notice any difference between the two types of stories read by a human speaker. A few participants noticed some differences in the intonation pattern but no participants reported the difference in word rhythm.

The second question was to check the naturalness of the stories read. Twenty-four participants (71%) reported that both the two types of stories were naturally read. Three participants (9%) commented that metrical story was not naturally read, and all of them attributed it to intonation. The rest seven participants (20%) thought neither types of stories were naturally read. The reasons reported included (1) exaggerated intonation (N = 2); (2) the speed and intonation pattern seemed uniform (N = 2); (3) lack of emotion (N = 2); and (4) the pitch went up at the end of each sentence (N = 1). In sum, most participants thought the stories were naturally read and only two participants (6%) commented on the uniformity of pace.

In block 2, only one question was asked. Participants had to compare the naturalness and accessibility of σ1-amplified speech or σ2-amplified speech. Fifteen participants (44%) perceived σ1-amplified speech as being more natural, two participants (6%) perceived the σ2-amplified speech as being more natural, and the rest 17 participants (50%) thought there was no difference in naturalness between the two conditions. In sum, relatively more participants thought the σ1-amplified speech was more natural.

EEG experiment

The study consisted of four EEG experiments. Experiments 1–3 involved 16 participants respectively, and Experiment 4 involved 20 participants. Experiments 1 and 4 both consisted of two blocks with stories presented in a randomized order within each block. In Experiments 2 and 3, all stories were presented in a randomized order in a single block.

Experiment 1

Request a detailed protocol

Synthesized speech was presented in Experiment 1. The experiment was divided into two blocks. In block 1, participants listened to isochronous speech, including seven metrical stories and seven nonmetrical stories. In block 2, the participants listened to amplitude-modulated speech, including 7 σ1-amplified stories and 7 σ2-amplified stories. All 14 stories presented in block 2 were metrical stories and did not overlap with the stories used in block 1. Participants were asked to keep their eyes closed while listening to the stories. After listening to each story, participants were required to answer three comprehension questions by giving oral responses. An experimenter recorded the responses and then pressed a key to continue the experiment. The next story was presented after an interval randomized between 1 and 2 s (uniform distribution) after the key press. The participants took a break between blocks.

Experiment 2

Request a detailed protocol

Speech stimuli used in Experiment 2 were the same as those used in Experiment 1, while the task was different. The participants were asked to watch a silent movie (The Little Prince) with subtitles and ignored any sound during the experiment. The stories were presented ~5 min after the movie started to make sure that participants were already engaged in the movie watching task. The interval between stories was randomized between 1 and 2 s (uniform distribution). The movie was stopped after all 28 stories were presented. The experiment was followed up with questions on the awareness of stories being presented during the movie watching task, and 87.5% participants (N = 14) reported that they did not notice any story.

Experiment 3

Request a detailed protocol

Experiment 3 used the same set of stories as those used in Experiment 1, but the stories were naturally read by a human speaker. The task was of the same design as that in Experiment 1. Participants listened to 21 metrical stories and seven nonmetrical stories. Participants took a break after every 14 stories.

Experiment 4

Request a detailed protocol

Experiment 4 was designed to test whether the results based on amplitude-modulated speech was replicable in a different group of participants. All stories used in Experiment 4 were metrical stories and each story was presented once. In block 1, participants were asked to watch a silent movie (The Little Prince) with subtitles and ignored any sound during the task. Amplitude-modulated speech (5 σ1-amplified and 5 σ2-amplified stories) were presented ~5 min after the movie started. The interval between stories was randomized between 1 and 2 s (uniform distribution). Block 1 was followed up with questions on the awareness of stories being presented during the movie watching task, and 15% participants (N = 3) reported that they did not notice any story during the task. Note that the percentage of participants reporting no awareness of the presentation of stories was much lower in Experiment 4 than that in Experiment 2. A potential explanation was that Experiment 4 only presented amplitude-modulated speech and the consistent presence of word-rate acoustic cues facilitated word recognition. In Experiment 2, however, as amplitude-modulated speech was mixed with isochronous speech, the lack of consistent presence of AM cue diminished its effect. After block 1 was finished, the participants took a break.

In block 2, participants listened to amplitude-modulated speech (5 σ1-amplified stories and 5 σ2-amplified stories) with their eyes closed. At the end of each story, three comprehension questions were presented, and answers were to be given with oral responses. The experimenter recorded the responses and then pressed a key to continue the experiment. The next story was presented after an interval randomized between 1 and 2 s (uniform distribution) after the key press.

Data recording and preprocessing

Request a detailed protocol

Electroencephalogram (EEG) and electrooculogram (EOG) were recorded using a Biosemi ActiveTwo system. Sixty-four EEG electrodes were recorded. Two additional electrodes were placed at the left and right temples to record the horizontal EOG (right minus left), and two electrodes were placed above and below the right eye to record the vertical EOG (upper minus lower). Two additional electrodes were placed at the left and right mastoids and their average was used as the reference for EEG (Ding et al., 2018). The EEG/EOG recordings were low-pass filtered below 400 Hz and sampled at 2048 Hz.

All preprocessing and analysis in this study were performed using MATLAB (The MathWorks, Natick, MA). The EEG recordings were down-sampled to 128 Hz, referenced to the average of mastoid recordings, and band-pass filtered between 0.8 Hz and 30 Hz using a linear-phase finite impulse response (FIR) filter (6 s Hamming window, −6 dB attenuation at the cut-off frequencies). A linear-phase FIR filter causes a constant time delay to the input. The delay equals to N/2, where N was the window length of the filter (Oppenheim et al., 1997). The delay was compensated by removing the first N/2 samples in the filter output. To remove ocular artifacts in EEG, the horizontal and vertical EOG were regressed out using the least-squares method (Ding et al., 2018). Occasional large artifacts in EEG/EOG, that is, samples with magnitude >1 mV, were removed from the analysis (Jin et al., 2018).

Data analysis

Frequency-domain analysis

Request a detailed protocol

The EEG responses during the gap between sentences and the responses to the first two syllables of each sentence were removed from analysis to avoid the responses to sound onsets. The EEG responses to the rest of the sentence, that is, from the third syllable to the last syllable of each sentence, were then concatenated.

To further remove potential artifacts, the EEG responses were divided into 7 s trials (a total of 148 trials for Experiments 1–3, and 104 trials for Experiment 4), and we visually inspected all the trials and removed trials with identifiable artifacts. On average, 8.45 ± 3.20% trials were rejected in Experiment 1, 15.20% ± 3.97% trials were rejected in Experiment 2, 10.35% ± 1.53% trials were rejected in Experiment 3, and 12.9 ± 4.46% trials were rejected in Experiment 4.

Then, the response averaged over trials was transformed into the frequency-domain using Discrete Fourier transform (DFT) without any additional smoothing window. Therefore, the frequency resolution of the DFT analysis was 1/7 Hz. The response power, that is, the square of the magnitude of the Fourier coefficients, was grand averaged over EEG electrodes and participants. The phase of the Fourier coefficients were averaged using the circular mean (Fisher, 1995). The 2-Hz phase difference between the σ1- and σ2-amplified conditions was averaged over participants in each electrode.

Time-warping analysis

Request a detailed protocol

In natural speech used in Experiment 3, syllables were not produced at a constant rate, and therefore the responses to syllables and words were not frequency tagged. However, the neural response to natural speech could be time warped to simulate the response to isochronous speech (Jin et al., 2018). In the time-warping analysis, we first extracted the ERP response to each syllable (from 0 to 750 ms), and simulated the response to 4-Hz isochronous speech using the following convolution procedure: s(t) = Σj hj(t)*δ(t –0.25j), where s(t) was the time-warped response, δ(t) was the Dirac delta function, and hj(t) was the ERP evoked by the jth syllable. The word index j ranged from one to the total number of syllables in a story.

In the time-warping procedure, it was assumed that the syllable response was time-locked to the syllable onsets and the word response was time-locked to word onsets. The frequency-domain analysis was subsequently applied to the time-warped response, following the same procedure as adopted in the analysis of the response to isochronous speech.

Time-domain analysis

Request a detailed protocol

Time-domain analysis was only applied to the responses to disyllabic words, and the responses to monosyllabic words were not analyzed. The ERP responses to the first and second syllables of each disyllabic word were separately extracted and averaged across all disyllabic words. The ERP response to each syllable was baseline correlated by subtracting the mean response in a 100 ms window before the syllable onset.

Statistical test

Request a detailed protocol

In the frequency-domain analysis, statistical tests were performed using bias-corrected and accelerated bootstrap (Efron and Tibshirani, 1994). In the bootstrap procedure, data of all participants were resampled with replacement 10,000 times. To test the significance of the 2-Hz and 4-Hz peaks in the response spectrum (Figures 2A–E, 3B and D, and 4A and B), the response amplitude at the peak frequency was compared with the mean power of the neighboring four frequency bins (two bin on each side, one-sided comparison). If the response power at 2 Hz or 4 Hz was stronger than the mean power of the neighboring bins N times in the resampled data, the significance level could be calculated as (N + 1)/10,001 (Jin et al., 2018).

When comparing the response power between conditions, the response power was always subtracted by the power averaged over four neighboring frequency bins (two on each side) to reduce the influence of background neural activity. A two-sided test was used to test the power difference between conditions within an experiment (solid black lines in Figure 3A and B and Figures 4E and 5C; topography in Figures 4F and 5D). If the response power was greater in one condition N times in the resampled data, the significance level could be calculated as (2N +1)/10,001. As to the power difference between experiments (dotted red lines in Figures 3A and B and Figure 4E), the significance level was v if the sample mean in one experiment exceeded the 100 v/2 percentile (or fell below the v/2 percentile) of the distribution of the sample mean in the other experiment (Ding et al., 2018).

To test the phase difference between conditions, the V% confidence interval of the phase difference was measured by the smallest angle that could cover V% of the 10,000 resampled phase difference (Jin et al., 2018). In the inter-participant phase coherence test (Figure 3—figure supplement 1), 10,000 phase coherence values were generated based on the null distribution, that is, a uniform distribution. If the actual phase coherence was smaller than N of the 10,000 phase coherence values generated based on the null distribution, its significance level was (N + 1)/10,001 (Ding et al., 2018).

In the time-domain analysis (Figure 6), the significance of ERP difference between conditions was determined by means of the cluster-based permutation test (Maris and Oostenveld, 2007). The test was performed with the following steps given below: (1) The ERP for each participant in two conditions were pooled into the same set. (2) The set was randomly partitioned into two equally sized subsets. (3) At each time point, the responses were compared between the two subsets using a paired t-test. (4) The significantly different data points in the responses were clustered based on temporal adjacency. (5) The cluster-level statistics were calculated by taking the sum over the t-values within each cluster. (6) Steps 2–5 were repeated 2000 times. The p-value was estimated as the proportion of partitions that resulted in a higher cluster-level statistic than the actual two conditions.

When multiple comparisons were performed, the p-value was adjusted using the false discovery rate (FDR) correction (Benjamini and Hochberg, 1995).

Post-hoc effect size calculation

Request a detailed protocol

On top of showing the 2-Hz response power from individual participants and individual electrodes in Figure 2—figure supplement 1, an effect size analysis was applied to validate that the sample size was appropriate to observe the 2-Hz response. To simplify the analysis, we calculated the effect size based on a paired t-test to compare the power at 2 Hz and the power averaged over four neighboring frequencies. Since the response power was not subject to a normal distribution, such a t-test had lower power than, for example, the bootstrap test. However, based on the t-test, the 2-Hz response remained significantly stronger than the mean response averaged over neighboring frequency bins in all conditions shown in Supplementary file 2. The effect size of the t-test was calculated using the G*Power software (version 3.1) (Faul et al., 2007). We calculated d and power based on the mean and standard deviation of the 2-Hz response (reported in Supplementary file 2). The power was above 0.8 for all conditions, suggesting that the sample size was big enough even for the more conservative t-test.

Data availability

The EEG data and analysis code (in MatLab) were uploaded as source data files.

References

  1. Conference
    1. Doumas LAA
    2. Martin AE
    (2016)
    Abstraction in time: finding hierarchical linguistic structure in a model of relational processing
    Conference Cognitive Science. pp. 2279–2284.
  2. Book
    1. Duanmu S
    (2001)
    Stress in Chinese
    University of Michigan.
  3. Book
    1. Oppenheim AV
    2. Willsky AS
    3. Nawab SH
    (1997)
    Signals and Systems
    Prentice Hall.
    1. Treisman A
    (1998) Feature binding, attention and object perception
    Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 353:1295–1306.
    https://doi.org/10.1098/rstb.1998.0284
    1. Zhang H
    2. Shang J
    (2019)
    NLPIR-Parser: an intelligent semantic analysis toolkit for big data
    Corpus Linguistics 6:87–104.
    1. Zhong X
    2. Wang B
    3. Yang Y
    4. Lv S
    (2001)
    The perception of prosodic word stress in standard Chinese
    Acta Psychologica Sinica 033:481–488.

Decision letter

  1. Virginie van Wassenhove
    Reviewing Editor; CEA, DRF/I2BM, NeuroSpin; INSERM, U992, Cognitive Neuroimaging Unit, France
  2. Andrew J King
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "Delta-band Cortical Tracking of Acoustic and Linguistic Features in Natural Spoken Narratives" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Andrew King as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

The acoustics of human speech convey complex linguistic hierarchical structures. While several studies have demonstrated that cortical activity shows speech-following responses, the extent to which these responses track acoustic features or linguistic units remains highly debated. Delta-band activity has been reported to track prosody, syntactic phrasing as well as temporal predictions. In this new study comprised of three EEG experiments, Luo and colleagues tested the cortical responses to the presentation of spoken stories which could be metrically organized or not, and read via synthesized isochronous or delta (~2Hz) amplitude modulated isochronous speech, or by a natural reader. The authors argue that delta synchronizes to multisyllabic compound words, favoring a word-based rather than a syllabic-based speech tracking response.

Revisions for this paper:

1) The Materials and methods section and the figures need to be clearly documented and annotated. Reviewer 1 raises some concerns about the loose use of "phase" and some of the methods that are not clearly incorporated in the Results to help the more general audience follow the subtleties of the analyses. I also leave the full set of reviews to guide such revision – as well as edit the English when needed.

2) Tempering the claim that delta uniquely tracks word rhythm.

All reviewers provide detailed comments pointing out to additional literature that could be considered and which can feed the discussion regarding syntactic vs. semantic vs. prosodic dissociations…

Clarifying the specificities reported in the literature should also help the authors address serious comments raised by reviewer 1 and 3, who are particularly hesitant regarding the strong claim on word-specificity of the delta response – including the lack of changes in the phase of delta oscillations. Reviewer 2 also stresses this point. The legitimacy of calling the low-frequency response delta is questioned as opposed to a cautionary possible mixture of network activity as suggested for instance by reviewer 3.

All three reviewers further question the description of delta phase considering the quantifications (e.g. well-known effect of the reference) and the observable topographical changes in Figure 3.

As a result of these major considerations, the authors should temper the strong claims that are currently being put forward.

Revisions expected in follow-up work:

3) While the study provides an interesting means for parametric manipulation of amplitude, reviewer 2 raised an interesting point regarding the effect of prosody in that frequency range. As expressed by reviewer 2: "no only the amplitude of pitch changes is critical to the perception of prosodic events, but also the timing and rise/fall slopes. All this is not linearly affected by increasing the volume at 2 Hz. To show that delta-band oscillations are not affected by stimulus prosody would require a more systematic dedication to the different linguistic dimension of prosody, amplitude being only one."

Reviewer #1:

In a series of three EEG experiments, Luo and colleagues tested whether the low-frequency following response to auditory speech was elicited by linguistic or acoustic features. The authors modulated the task (story comprehension vs. passive video watching) and the acoustic structure of the speech signals using (non-)metrical, isochronous, and 2Hz amplitude modulated speech, or natural speech. The authors mainly used spectral quantification methods of the following responses and some phase analyses. The authors summarize their findings as illustrating that the delta band synchronizes to multisyllabic words, favoring a word-based following response as compared to a syllabic-based speech tracking response.

My main concerns about the study stem from the methods and a difficulty to straightforwardly understand the procedure.

For instance:

– Can the authors clarify the choice of band-pass filtering and the compensation that was applied as briefly stated "and the delay caused by the filter was compensated."

– How did the authors insure that concatenated responses did not introduce low-frequency noise in their analysis?

– Subsection “Frequency domain analysis”: can the authors report the number of participants excluded by the analysis?

I also have substantial difficulties regarding the following methodological aspects:

– Bootstrap analyses artificially increase the number of samples while maintaining constant the variance of the samples. How did the authors take this into account in their final statistical comparisons?

– Figure 1D and predictions on the (absolute? relative? Instantaneous?) phase-locked activity to word onset and syllable need to be better and more precisely explained. For instance, there is no a priori reason to believe that 0° and 180° owe to be the actual expected values as clearly illustrated in Figure 3?

– While a time-warping analysis is being performed for data collected in Experiment 3, no mention of this approach is being explained in more details (as to possible interpretational limits) in the actual Results section.

I am here particularly concerned about the phase analysis which is not being sufficiently reported in details to clearly understand the results and the claims of the authors.

In line with this, I highlight two additional issues:

– Subsection “Response phase at 2 Hz”, Figure 4: the authors argue that attention may explain the lack of phase differences. Additionally, the authors seem to assume in their working hypothesis that the phase response should be comparable across participants.

In a recent study (Kösem et al., 2016), no evidence for phase-related linguistic parsing in low-frequency activity was found when participants actively or passively listened to a stationary acoustic speech sequence whose perceptual interpretation could alternate. Are these findings compatible with those illustrated in Figure 4? Furthermore, some important inter-individual differences were reported, which could severely impact the primary assumption of phase-consistency across individuals.

– Have the authors considered the possibility that the choice of EEG reference may largely affect the reported mean instantaneous phase responses and if so, could they control for it?

Reviewer #2:

The authors report an n = 48 EEG study aiming at dissociating the previously proposed roles of delta-band oscillations in the processing of speech acoustics and symbolic linguistic information. The authors report that delta-band activity is related more strongly to linguistic rather than acoustic information.

This is a welcome study, given that by now, the literature contains claims of delta-band oscillations being involved in (1) prosody tracking, (2) generation of linguistic structures / syntactic phrasing, (3) processing of linguistic information content, (4) temporal prediction in- and outside of the linguistic domain.

The study is well-written and I am optimistic that it will provide a good contribution to the existing literature.

I find the framing and discussion a bit gappy and I missed multiple lines of related research. For instance, the literature on the relationship between delta-band oscillations and prosody is not mentioned in the Introduction; likewise, the literature on temporal prediction versus linguistic prediction is not discussed well. In addition, I also think that this study is not as definitive as the authors claim, resulting from the choice of unnatural (i.e., isochronous) stimuli that, strangely enough, the authors claim to be natural.

Below, I given some comments for the authors to further improve their interesting work.

1) False claim of studying natural speech: The authors artificially induce unnatural stimulus rhythms as a means to frequency-tag their stimuli and perform frequency-domain analysis. While this is an interesting approach, this results in stimuli that do not have ecological validity. Using unnatural stimuli with artificial periodicity to propose periodic electrophysiological processing of words is not scientifically sound. In natural speech, words do not occur periodically. Why assume then that they are processed by a periodic oscillatory mechanism? While the frequency-tagging approach inherently provides good SNR, it cannot claim ecological validity for anything above phonemes or syllables, as for these, there are no linguistic corpus studies out there that show periodicity of the constructs to start with. Specific comment:

Subsection “Neural tracking of words in natural spoken narratives”: "natural" The authors artificially make natural speech unnatural by making it isochronous. It is bizarre to still call the resulting stimulus "natural". Isochronous word sequences are certainly not natural speech. Hence, the authors can also not claim that they demonstrate cortical tracking of natural speech. To show this, they would have to show that there is speech-brain synchronicity for the non-isochronous condition. Certainly, to do so, the authors cannot use their standard spectral-peak analysis, but would have to show coherence or phase-locking against a different baseline (e.g., scrambled, autocorrelation-preserving speech). I strongly encourage the authors to do this analysis instead of pretending that isochronous word sequences are natural. The more general problem with this is that it adds to literature claiming that oscillations track words in human speech processing, which cannot possibly be true in this simplistic version: words are simply too variable in duration and timing.

2) Depth of framing and discussion: (a) Introduction is not specific enough to speech prosody versus the different types of results on linguistic information (e.g., syntactic structure / phrasing, syntactic and lexical entropy), including reference to prior work; also, relevant references that suggested delta-band oscillations to serve both bottom-up (i.e., prosody) and top-down (i.e., syntax, prediction) functionalities are not discussed in the Introduction; (b) interpretation of the results in terms of the different proposed functionalities of the delta-band lacks depth and differentiation.

Introduction: What is missing here is mentioning that delta-band oscillations have also been implied in prosody (e.g., Bourguignon et al., 2013). Relatedly, the idea that delta-band oscillations are related both to slow-fluctuating linguistic representations (e.g., phrases; Meyer and Gumbert, 2018; Meyer et al., 2016; Weissbart et al., 2019) and prosody has been put forward various times (e.g., Ghitza, 2020; Meyer et al., 2016). I think that mentioning this would help the framing of the current manuscript: (1) delta = symbolic linguistic units, (2) delta = prosody, (3) current study: Is delta reflecting symbolic processing or prosody?

Subsection “Time course of ERP response to words”: As to prediction and the delta-band, please consider discussing Weissbart et al., 2019, Meyer and Gumbert, 2018, Breska and Deouell, 2017, Lakatos et al., 2013, Roehm et al., 2009; in particular the relationship between linguistic predictions and the delta-band should be discussed shortly here. In general, there is need in the field to characterize the relationship between temporal prediction, prosody, linguistic units, and the delta-band. Just noting all this would be fine here.

"semantic processing" Why semantics? See my above comment on the relationship between the delta-band and temporal predictions (e.g., Roehm et al., 2009; Stefanics et al., 2010). In addition, please consider the recent reconceptualisation of the N400-P600 complex in terms of predictive coding (e.g., Kuperberg et al., 2019).

3) EEG preprocessing / report thereof does not accord to standards in the field. Apparently, the EEG data have not been cleaned from artifacts except for calculating the residual between the EEG and the EOG. Hence, I cannot assume that the data are clean enough to allow for statistical analysis. Specific point:

Materials and methods: The EEG-cleaning section is very sparse. With standard EEG and standard cleaning, rejection rates for adult EEG are in the other of 25-35 %. It am highly doubtful that these EEG data have been cleaned in an appropriate way. There is no mentioning of artifact removal, amount of removed data segments, time- and frequency-domain aspects that were considered for artifact definition. What toolboxes were used, what algorithms?

4) Terminology: Like many people in the field, the authors speak of “tracking” of linguistic information. To be clear: While it is fine that the field understands the problems of the term “entrainment”, now opting for “tracking”, it does not make much sense to use the combination of “tracking” and “linguistic units”. This is because linguistic units are not in the stimulus; instead they are only present in the brain as such. They cannot be tracked; instead, they need to be “associated”, “inferred”, or even “generated”. If one were to assume that the brain “tracks” linguistic units, this would mean that the brain tracks itself. Specific comments:

"cortical activity tracking of higher-level linguistic units" Linguistic units, such as "words and phrases" cannot be tracked, as they are not present in the stimulus. They are purely symbolic and they must be inferred. They cannot be sensed. Instead, they are present only in the mind and electrophysiology of the listener. Linguistic units are activated in memory or generated in some form. They are internal representations or states. If one were to say that the brain tracks these in some way, one would say that the brain tracks its own internal representations / states. Try and rephrase, please.

"linguistic information" cannot be extracted from acoustic features. Instead, linguistic information (e.g., distinctive features in phonology, semantic meaning, part-of-speech labels) must be associated with a given acoustic stimulus by an associative / inferential cognitive process. When we hear some acoustic stimulus, we associate with something linguistic. Just as in my above comment on the loose use of the term “tracking”, I suggest the authors to change this wording as well. A suggestion for a better phrasing would be "Linguistic information is retrieved / constructed / generated when specific acoustic information is encountered / recognized…". Or "Speech comprehension involves the active inference of linguistic structure from speech acoustics".

Discussion: "tracks" see my above comment.

5) Potential differences between the σ1 and 2 conditions: There are apparent differences in the number of sensors that show sensitivity to the amplitude modulation. This is not captured in the current statistical analysis. Still, given these drastic differences, the claim that acoustics of speech prosody do not affect delta-band amplitude is certainly much too strong and may just result from a lack of analysis (i.e., overall phase / ERP rather than sensor-by-sensor analysis). See specific comment next:

"or the phase of the 2-Hz amplitude modulation" I disagree with excitement and a happy suggestion that the authors must consider. Look, the plots (Figure 3, σ1/2) may not show differences in the phase angle. Yet, what they may point to is a dimension of the data that the authors have not assessed yet: topographical spread. From simply eyeballing it is clear that the number of significant electrodes differs drastically between the σ1 and σ2 conditions. This pattern is interesting and should be examined more closely. Is word-initial stress more natural in Chinese? If so, this could be interpreted in terms of some influence of native prosodic structure: Amplitude modulation does appear affect the amount / breadth of EEG signal (i.e. number of sensors), but only if it conforms to the native stress pattern? Not only words / phrases are part of a speakers linguistic knowledge, but also the stress pattern of their language. I urge the authors to look into this more closely.

"The response" see my above comment on the differences in the number significant sensors in the σ1 and 2 conditions.

Reviewer #3:

This manuscript presents research aimed at examining how cortical activity tracks with the acoustic and linguistic features of speech. In particular, the authors are interested in how rhythms in cortical data track with the timing of the acoustic envelope of speech and how that might reflect and, indeed, dissociate, from the tracking of the linguistic content. The authors collect EEG from subjects as they listen to speech that is presented in a number of different ways. This includes conditions where: the speech is rhythmic and the first syllable of every word appears at regular intervals; the speech is rhythmic but the first syllable of every word does not appear at regular intervals; the speech appears with a more natural rhythm with some approximate regularity to the timing of syllables (that allows for an analysis based on time warping); and some rhythmic speech that is amplitude modulated to either emphasize the first or second syllables. And they also present these stimuli when subjects are tasked with attending to the speech and when subjects are engaged in watching a silent video. The authors analyze the EEG from these different conditions in the frequency domain (using time-warping where needed), in terms of the phase of the rhythmic EEG responses; and in the time domain. Ultimately, they conclude that, when subjects are attending to the speech, the EEG primarily tracks the linguistic content of speech. And, when subjects are not attending to the speech, the EEG primarily tracks the acoustics.

I enjoyed reading this manuscript, which tackles an interesting and topical question. And I would like to commend the authors on a very nice set of experiments – it's a lovely design.

However, I also had a number of fairly substantial concerns – primarily about the conclusions being drawn from the data.

1) My main concern with the manuscript is that it seems – a priori – very committed to discussing the results in terms of tracking by activity in a specific (i.e., delta) cortical band. I understand the reason to want to do this – there is a lot of literature on speech and language processing that argues for the central role of cortical oscillations in tracking units of speech/language with different timescales. However, in the present work, I think this framework is causing confusion. To argue that delta tracking is a unitary thing that might wax and wane like some kind of push/pull mechanism to track acoustics sometimes and linguistic features other times seems unlikely to be true and likely to cause confusion in the interpretation of a nice set of results. My personal bias here – and I think it could be added for balance to the manuscript – is that speech processing is carried out by a hierarchically organized, interconnected network with earlier stages/lower levels operating on acoustic features and later stages/higher levels operating on linguistic features (with lots of interaction between the stages/levels). In that framing, the data here would be interpreted as containing separable simultaneous evoked contributions from processing at different hierarchical levels that are time locked to the relevant features in the stimulus. Of course, because of the nature of the experiments here, these contributions are all coming at 2 Hz. But to then jump to saying a single "delta oscillation" is preferentially tracking different features of the speech given different tasks, seems to me to be very unlikely. Indeed, the authors seem sensitive to the idea of different evoked contributions as they include some discussion of the N400 later in their manuscript. But then they still hew to the idea that what they are seeing is a shift in what "delta-band" activity is doing. I am afraid I don't buy it. I think if you want to make this case – it would only be fair to discuss the alternatives and more clearly argue why you think this is the best way to interpret the data.

2) Following my previous point, I have to admit that I am struggling to figure out how, precisely, you are determining that "cortical activity primarily tracks the word rhythm during speech comprehension". There is no question that the cortical activity is tracking the word rhythm. But how are you arguing that it is primarily word rhythm. I am guessing it has something to do with Figure 3? But I am afraid I am not convinced of some of the claims you are making in Figure 3. You seem to want to embrace the idea that there is no difference in phase between the σ1-amplified and σ2-amplified conditions and that, therefore, the signal is primarily tracking the word rhythm. But I think there is an important flaw in this analysis – that stems from my concern in point number 1 above. In particular, I think it is pretty clear that there are differences in phase at some more frontal channel locations, and no difference in more posterior channels (Figure 3A, rightmost panel). So, going back to my issue in point number 1, I think it is very likely that word rhythm is being tracked posteriorly (maybe driven by N400 like evoked activity) – and that phase is similar (obviously). But it also seems likely that the 2 Hz rhythm over frontal channels (which likely reflect evoked activity based on acoustic features) are at a different phase. Now, cleanly disentangling these things is difficult because they are both at 2Hz and will contaminate each other. And I think when subjects are watching the silent video, the posterior (N400 like) contribution disappears and all you have left is the acoustic one. So, again, I think two separable things are going on here – not just one delta rhythm that selectively tracks different features.

3) I'm afraid I must take issue with the statements that "Nevertheless, to dissociate linguistic units with the related acoustic cues, the studies showing linguistic tracking responses mostly employ well-controlled synthesized speech that is presented as an isochronous sequence of syllables… Therefore, it remains unclear whether neural activity can track linguistic units in natural speech, which is semantically coherent but not periodic, containing both acoustic and linguistic information in the delta band." I know the (very nice) studies you are talking about. But you have just also cited several studies that try to disentangle linguistic and acoustic features using more natural speech (e.g., Brodbeck; Broderick). So I think your statement is just too strong.

4) I was confused with the presentation of the data in Figure 2F and G. Why are some bars plotted up and others down? They are all positive measures of power, and they would be much easier to compare if they were all pointed in the same direction. Unless I am missing the value of plotting them this way?

https://doi.org/10.7554/eLife.60433.sa1

Author response

Revisions for this paper:

1) The Materials and methods section and the figures need to be clearly documented and annotated. Reviewer 1 raises some concerns about the loose use of "phase" and some of the methods that are not clearly incorporated in the Results to help the more general audience follow the subtleties of the analyses. I also leave the full set of reviews to guide such revision – as well as edit the English when needed.

Based on the very constructive comments from the reviewers, we have clarified our analysis procedures in the revised manuscript. The response phase in the manuscript refers to the phase of the complex-valued Fourier coefficients, which we have now clearly defined. Furthermore, we have also added a new analysis that is easier to interpret and moved detailed results of phase analysis to Figure 3—figure supplement 1 (please see the reply to the next question for details).

2) Tempering the claim that delta uniquely tracks word rhythm.

All reviewers provide detailed comments pointing out to additional literature that could be considered and which can feed the discussion regarding syntactic vs. semantic vs. prosodic dissociations…

We have added the suggested references to the manuscript and expanded the Introduction and Discussion. Please see our point-to-point reply for details.

Clarifying the specificities reported in the literature should also help the authors address serious comments raised by reviewer 1 and 3, who are particularly hesitant regarding the strong claim on word-specificity of the delta response – including the lack of changes in the phase of delta oscillations. Reviewer 2 also stresses this point. The legitimacy of calling the low-frequency response delta is questioned as opposed to a cautionary possible mixture of network activity as suggested for instance by reviewer 3.

In the previous manuscript, we referred to the 2-Hz neural response as a “delta-band” neural response to distinguish it from the 4-Hz response to speech envelope which was usually referred to as a “theta-band” envelope-tracking response. We now realized that this terminology was not appropriate since it could mislead the readers to believe that the 2-Hz response must relate to spontaneous delta oscillations. To avoid this potential confusion, we have now used amplitude modulation (AM) and speech envelope to refer to the 2-Hz and 4-Hz acoustic rhythms respectively in the amplitude-modulated speech.

All three reviewers further question the description of delta phase considering the quantifications (e.g. well-known effect of the reference) and the observable topographical changes in Figure 3.

As a result of these major considerations, the authors should temper the strong claims that are currently being put forward.

The original conclusion that “delta-band activity primarily tracks the word rhythm” was made on the basis of response phase analysis, which showed that the response phase was barely influenced by the 2-Hz amplitude modulation (AM). The three reviewers, however, all raised concerns about the conclusion and/or the relevant analysis procedure. Following the constructive comments from the reviewers, we have now collected more data and performed new analyses to further investigate how acoustic and linguistic information is jointly encoded in cortical activity. We have updated our conclusions on the grounds of the new results, which are summarized in the following.

Phase difference analysis

Reviewers 2 and 3 both pointed out that the response phase was affected by the 2-Hz AM even during attentive story comprehension. To further investigate this important observation, we have now calculated the phase difference between σ1- and σ2-amplified conditions for each participant and each EEG electrode, and tested whether the phase difference was significantly different from 0º in any electrode. The results indeed revealed a significant phase difference between conditions in some EEG electrodes, even during the story comprehension task (Author response image 1A, B), confirming the reviewers’ observation.

Author response image 1
Topographical distribution of the 2-Hz phase difference between the σ1- and σ2-amplified conditions.

The phase difference is calculated for each participant and each electrode, and then averaged over participants. The black dots indicate electrodes showing a significant phase difference between the σ1- and σ2-amplified conditions (P < 0.05, bootstrap, FDR corrected). (AB) The 2-Hz phase difference in the original experiment. (CD) The 2-Hz phase difference pooled over the original experiment and the replication experiment.

A replication experiment

To further validate the results of phase difference analysis, we have now replicated the experiment with another group of participants. In this replication experiment, participants only listened to amplitude-modulated speech and they participated in the video watching task before the story comprehension task. The replication experiment confirmed our original observation, finding larger 2-Hz response phase difference between the σ1- and σ2-amplified conditions in the video watching task than in the story comprehension task. Nevertheless, the new observation illustrated in Author response image 1A was not replicated. After the data was pooled over the original experiment and the replication experiment, no EEG electrode showed any significant phase difference between conditions during the story comprehension task (Author response image 1C).

Therefore, the 2-Hz AM reliably modulated the response phase during passive listening (Author response image 1B, D), but not during active listening (Author response image 1A, C). Although the 2-Hz AM did not significantly modulate the response phase during active listening, the following new analysis showed that this null result was potentially attributable to the low statistical power of the response phase analysis.

Time-domain separation of AM and word responses

For amplitude modulated speech, the neural response synchronous to the 2Hz AM (referred to as the AM response) and the response synchronous to word onsets (referred to as the word response) can be dissociated based on response timing (Figure 1D). In the previous manuscript, we mainly employed the phase analysis to characterize response timing (previous Figure 3). Reviewer 1, however, raised the concern that the previous analysis was built on the assumption that the response phase was consistent across participants, which was neither a necessary nor a well-supported assumption. Therefore, we have now employed a different method to extract the AM and word responses based on response timing. This analysis was briefly reported in Supplementary Figure 2 of the previous manuscript. We have now expanded it and moved it to the main text.

In this new analysis, AM and word responses were extracted by averaging the response over the σ1- and σ2-amplified conditions in different ways, which was illustrated in Figure 4A and C. The spectra of the AM and word responses were calculated in the same way as the spectra were calculated for, e.g., the response to isochronous speech. In this new analysis, the power spectrum was calculated for each participant and then averaged. Therefore it did not assume phase consistency across participants. More importantly, this new method could estimate the strength of the AM and word responses, while the previous phase analysis could only prove the existence of the AM response.

The new analysis clearly revealed that the AM response was statistically significant during both attentive and passive listening, suggesting that the cortical response contained a component that reflected low-level auditory encoding. The new analysis confirmed that the word response was stronger than the AM response. Nevertheless, it also showed that the word response was also statistically significant during passive listening for amplitude-modulated speech, but not for isochronous speech, suggesting that the AM cue could facilitate word processing during passive listening. Additionally, the AM response was not significantly modulated by attention, while the word response was. The AM and word responses also showed distinguishable spatial distribution, suggesting different neural sources. These key findings were all replicated in the replication experiment (Figure 5), demonstrating that the new spectral analysis method was robust.

Conclusions

With findings from the new analysis, we have now updated the conclusions in the Abstract as follows.

“Our results indicate that an amplitude modulation (AM) cue for word rhythm enhances the word-level response, but the effect is only observed during passive listening. […] These results suggest that bottom-up acoustic cues and top-down linguistic knowledge separately contribute to cortical encoding of linguistic units in spoken narratives.”

Revisions expected in follow-up work:

3) While the study provides an interesting means for parametric manipulation of amplitude, reviewer 2 raised an interesting point regarding the effect of prosody in that frequency range. As expressed by reviewer 2: "no only the amplitude of pitch changes is critical to the perception of prosodic events, but also the timing and rise/fall slopes. All this is not linearly affected by increasing the volume at 2 Hz. To show that delta-band oscillations are not affected by stimulus prosody would require a more systematic dedication to the different linguistic dimension of prosody, amplitude being only one."

Reviewer 2 raised a very important point, which was not discussed in the previous manuscript. We certainly agree that the 2-Hz word response in the current study could reflect prosodic processing. Furthermore, we would like to distinguish the perceived prosody and the acoustic cues for prosody. Previous literature has shown that high-level linguistic information can modulate prosodic and even auditory processing (e.g., Buxó-Lugo and Watson, 2016; Garrett et al., 1966). Therefore, even when the prosodic cues in speech are removed, e.g., in the isochronous speech sequence, listeners may still mentally recover the prosodic structure. In fact, the experiment design is not to rule out the possibility that prosody is related to low-frequency neural activity. This important point was not discussed in the previous manuscript and caused confusion.

Furthermore, as the reviewer pointed out, the amplitude envelope is just one kind of prosodic cues. Furthermore, in most languages, the amplitude envelope contributes less to prosodic processing than other cues, e.g., the pitch contour, rise/fall slopes, and timing. Therefore, the purpose of manipulating the speech envelope is not to modulate prosodic processing. Instead, we manipulate the amplitude envelope since it is one of the acoustic features that can most effectively drive cortical responses. Even amplitude modulated noise, which has no pitch contour or linguistic content, can strongly drive an envelope-tracking response. Therefore, we would like to test how auditory envelope tracking and linguistic processing, which includes prosodic processing, separately contribute to speech synchronous neural activity.

We have now added a new section of discussion about prosodic processing (see point-to-point reply for details) and revised the Discussion to explain why we manipulated the speech envelope.

Reviewer #1:

[…] My main concerns about the study stem from the methods and a difficulty to straightforwardly understand the procedure.

Thank you for pointing out these issues. We have thoroughly modified the Materials and methods and Results, and we believe there is now enough information to understand the procedure. Furthermore, we will make the data and analysis code publicly available after the manuscript is accepted.

For instance:

– Can the authors clarify the choice of band-pass filtering and the compensation that was applied as briefly stated "and the delay caused by the filter was compensated."

We have now clarified the filter we used and how the delay was compensated.

“The EEG recordings were down-sampled to 128 Hz, referenced to the average of mastoid recordings, and band-pass filtered between 0.8 Hz and 30 Hz using a linear-phase finite impulse response (FIR) filter (6 s Hamming window, -6 dB attenuation at the cut-off frequencies). […] The delay was compensated by removing the first N/2 samples in the filter output…”

– How did the authors insure that concatenated responses did not introduce low-frequency noise in their analysis?

The EEG responses to different sentences were concatenated in the frequency-domain analysis. Duration of the response to each sentence ranged from 1.5 to 7.5 s (mean duration: 2.5 s). Therefore, concatenation of the responses could only generate low-frequency noises below 1 Hz, which could barely interfere with the 2-Hz and 4-Hz responses that we analyzed.

To confirm that the 2-Hz and 4-Hz responses were not caused by data concatenation, we have now also analyzed the EEG response averaged over sentences. Considering the variation in sentence duration, this analysis was restricted to sentences that consisted of at least 10 syllables. Moreover, only the responses to the first 10 syllables in these sentences were averaged. In a procedure similar to that applied in the original spectral analysis, the responses during the first 0.5 s were removed to avoid the onset response and the remaining 2-second response was transformed into the frequency domain using the DFT. The results was consistent with the results obtained from the original analysis and were illustrated in Author response image 2. The advantage of this new analysis was that it did not involve data concatenation, while the disadvantage was that it threw away a lot of data to ensure equal duration in the response to each sentence.

Author response image 2
Spectrum of the EEG response averaged over sentences.

This analysis is restricted to sentences that had at least 10 syllables, and only the response to the first 10 syllables is analyzed. The response during the first two syllables is removed to avoid the onset response and the rest 2 seconds of response is averaged over sentences. The averaged response is transformed into the frequency-domain using the DFT. Response spectrum averaged over participants and EEG electrodes. The shaded area indicates 1 standard error of the mean (SEM) across participants. Stars indicate significantly higher power at 2 Hz or 4 Hz than the power averaged over 4 neighboring frequency bins (2 on each side). The color of the star is the same as the color of the spectrum **P < 0.01(bootstrap, FDR corrected).

– Subsection “Frequency domain analysis”: can the authors report the number of participants excluded by the analysis?

In the previous phase analysis, for each EEG electrode, participants not showing significant inter-trial phase coherence (P > 0.1) were excluded. Therefore, a different number of participants were excluded for each electrode. To simplify the procedure, we have now kept all participants in the analysis and the updated results were shown in Figure 3—figure supplement 1.

I also have substantial difficulties regarding the following methodological aspects:

– Bootstrap analyses artificially increase the number of samples while maintaining constant the variance of the samples. How did the authors take this into account in their final statistical comparisons?

In the bootstrap procedure, we estimated the distribution of the sample mean by resampling the data. The p-value was determined based on the null distribution. We used a standard bootstrap procedure and its details were specified in the reference (Efron and Tibshirani, 1994), and the code was publicly available (MATLAB scripts: bootstrap_for_vector.m, https://staticcontent.springer.com/esm/art%3A10.1038%2Fs41467-018-07773y/MediaObjects/41467_2018_7773_MOESM4_ESM.zip). The bootstrap procedure is well established and has been used in our previous studies (e.g., Jin et al., 2018) and in other studies (e.g., Bagherzadeh et al., 2020, and Norman-Haignere et al., 2019).

Bagherzadeh, Y., Baldauf, D., Pantazis, D., Desimone, R., 2020. Alpha synchrony and the neurofeedback control of spatial attention. Neuron 105, 577-587. e575.

Norman-Haignere, S.V., Kanwisher, N., McDermott, J.H., Conway, B.R., 2019. Divergence in the functional organization of human and macaque auditory cortex revealed by fMRI responses to harmonic tones. Nature Neuroscience 22, 1057-1060.

– Figure 1D and predictions on the (absolute? relative? Instantaneous?) phase-locked activity to word onset and syllable need to be better and more precisely explained. For instance, there is no a priori reason to believe that 0° and 180° owe to be the actual expected values as clearly illustrated in Figure 3?

Thank you for pointing out this issue. We have now modified the illustration.

Since response phase can be defined in different ways, we no longer illustrate the response phase in Figure 1. Instead, we now show the time lag between neural responses in Figure 1D. Correspondingly, in the main analysis, we have now separately extracted the word and AM responses by time shifting and averaging the responses across σ1- and σ2-amplified conditions (please see Figure 4).

– While a time-warping analysis is being performed for data collected in Experiment 3, no mention of this approach is being explained in more details (as to possible interpretational limits) in the actual Results section.

Thank you for pointing out this issue. We have now detailed the time-warping procedure in Materials and methods.

“Time-Warping Analysis: In natural speech used in Experiment 3, syllables were not produced at a constant rate, and therefore the responses to syllables and words were not frequency tagged. […] The word index j ranged from 1 to the total number of syllables in a story.”

We have also mentioned the assumption underlying the time-warping procedure.

“In the time-warping procedure, it was assumed that the syllable response was time-locked to the syllable onsets and the word response was time-locked to word onsets. The frequency-domain analysis was subsequently applied to the time-warped response, following the same procedure as adopted in the analysis of the response to isochronous speech.”

In the time-domain analysis (please see Figure 6), which does not involve time warping, it is further confirmed that there is a word response synchronous to the word onset.

I am here particularly concerned about the phase analysis which is not being sufficiently reported in details to clearly understand the results and the claims of the authors.

The previous manuscript did not mention how the response phase was calculated. It was the phase of the complex-valued Fourier coefficient. Most of the phase results are now moved to Figure 3—figure supplement 1, and we have clearly defined the method for phase calculation. Since we analyzed the steady-state response, we did not calculate, e.g., the instantaneous phase.

In the Results, it is mentioned:

“The Fourier transform decomposes an arbitrary signal into sinusoids and each complex-valued Fourier coefficient captures the magnitude and phase of a sinusoid. […] Nevertheless, the response phase difference between the σ1- and σ2-amplified conditions carried important information about whether the neural response was synchronous to the word onsets or amplified syllables…”

In Materials and methods, it is mentioned:

“Then, the response averaged over trials was transformed into the frequency-domain using Discrete Fourier transform (DFT) without any additional smoothing window. […] The 2-Hz phase difference between the σ1- and σ2-amplified conditions was averaged over participants in each electrode.”

In line with this, I highlight two additional issues:

– Subsection “Response phase at 2 Hz”, Figure 4: the authors argue that attention may explain the lack of phase differences. Additionally, the authors seem to assume in their working hypothesis that the phase response should be comparable across participants.

In a recent study (Kösem et al., 2016), no evidence for phase-related linguistic parsing in low-frequency activity was found when participants actively or passively listened to a stationary acoustic speech sequence whose perceptual interpretation could alternate. Are these findings compatible with those illustrated in Figure 4? Furthermore, some important inter-individual differences were reported, which could severely impact the primary assumption of phase-consistency across individuals.

The reviewer raised two very important questions here. One was about the phase consistency across participants, and the other was about how the current results were related to the results in Kösem et al., 2016. In the following, we answered the two questions separately.

Inter-participant phase consistency

In terms of the experiment design, we do not assume that the response phase is consistent across individuals. We only hypothesize that there are two potential neural response components, which are separately synchronous to the speech envelope and the word rhythm. In the previous analysis, however, we did implicitly assume consistent response phase across individuals, since the response phase appeared to be consistent across individuals in most conditions (previous Figure 3). However, as the reviewer pointed out, this assumption was not necessary to test our hypothesis. Therefore, we added a new analysis that did not assume inter-participant phase consistency. This analysis calculated the phase difference between the σ1- and σ2-amplified conditions for individuals and tested whether the phase difference significantly deviated from 0º. This analysis was illustrated in Figure 3—figure supplement 1E.

Relation to the results in Kösem et al., 2016

We actually have our own pilot data acquired using a paradigm very similar to the paradigm adopted in Kösem et al., 2016. In the pilot study, a single word was repeated in the stimuli and we did not observe a word-related response either. It seems like a robust word synchronous response is only observable when the sequence presents different words instead of repeating the same word. There are several potential reasons for this difference. First, repeating the same word generates an acoustic rhythm of the same rate with the word rhythm. Neural tracking of the acoustic rhythm may potentially interact with the word response. Secondly, repetition of the same word may lead to neural adaptation and phenomena such as semantic satiation, which attenuates the neural response to words.

We have now added a discussion about the issue in Discussion.

“Furthermore, a recent study shows that low-frequency cortical activity cannot reflect the perception of an ambiguous syllable sequence, e.g., whether repetitions of a syllable is perceived as “flyflyfly” or “lifelifelife” (Kösem et al., 2016).”

“Although the current study and previous studies (Ding et al., 2018; Makov et al., 2017) observe a word-rate neural response, the study conducted by Kösem et al., 2016, does not report observable neural activity synchronous to perceived word rhythm. […] Therefore, it is possible that low-frequency word-rate neural response more strongly reflects neural processing of novel words, instead of the perception of a steady rhythm (see also Ostarek et al., 2020).”

– Have the authors considered the possibility that the choice of EEG reference may largely affect the reported mean instantaneous phase responses and if so, could they control for it?

We agree that the choice of EEG reference can influence the absolute phase. However, the hypothesis we would like to test is about the phase difference across conditions, which should not be sensitive to the choice of EEG reference.

To confirm that the phase difference between conditions is not strongly influenced by the choice of EEG electrodes, we have also analyzed the results using the average of sixty-four electrodes as the reference. The phase difference results are illustrated in Author response image 3, which is consistent with the results obtained using the average of the two mastoid electrodes as the reference (Figure 3 —figure supplement 1E).

Author response image 3
Topographical distribution of the 2-Hz phase difference between the σ1- and σ2-amplified conditions using the average of sixty-four electrodes as the reference.

The phase difference is calculated for each participant and each electrode, and then averaged over participants. The black dots indicate the electrodes showing a significant phase difference between the σ1- and σ2-amplified conditions (P < 0.05, bootstrap, FDR corrected).

Reviewer #2:

[…] I find the framing and discussion a bit gappy and I missed multiple lines of related research. For instance, the literature on the relationship between delta-band oscillations and prosody is not mentioned in the Introduction; likewise, the literature on temporal prediction versus linguistic prediction is not discussed well. In addition, I also think that this study is not as definitive as the authors claim, resulting from the choice of unnatural (i.e., isochronous) stimuli that, strangely enough, the authors claim to be natural.

Thank you for pointing out these issues. We have now made substantial modifications to the Introduction and Discussion sections, and have modified the conclusions based on new data and new analyses. Please see our responses in the following for details.

Below, I given some comments for the authors to further improve their interesting work.

1) False claim of studying natural speech: The authors artificially induce unnatural stimulus rhythms as a means to frequency-tag their stimuli and perform frequency-domain analysis. While this is an interesting approach, this results in stimuli that do not have ecological validity. Using unnatural stimuli with artificial periodicity to propose periodic electrophysiological processing of words is not scientifically sound. In natural speech, words do not occur periodically. Why assume then that they are processed by a periodic oscillatory mechanism? While the frequency-tagging approach inherently provides good SNR, it cannot claim ecological validity for anything above phonemes or syllables, as for these, there are no linguistic corpus studies out there that show periodicity of the constructs to start with. Specific comment:

Thank you for raising these important issues. First, we want to clarify that we did use natural speech in the “natural speech condition”. We time warp the neural response instead of the speech sound. The confusion is caused by the previous Figure 1C, which shows time-warped speech for illustrative purposes. We have now modified the figure. The updated Figure 1 only displays the stimuli and Figure 2E illustrates time-warped neural response. We have also included samples of the stimulus as Supplementary file 3.

We fully agree that words are not of equal duration in natural speech. Syllables and phonemes are not of equal duration either. The motivation of including a natural speech condition is to test whether neural activity is synchronous to words when the word rhythm is not constant. Therefore, we never used the word “oscillation” in the manuscript. We used frequency-tagging as a high SNR analysis method to extract the word-related response, but did not assume that words were encoded by periodic neural oscillations. We did refer to the word response as a “delta-band” response. However, we used the term to distinguish it from the “theta-band” syllabic-level response. In the current manuscript, we have avoided using the word “delta-band” to denote the word response.

Subsection “Neural tracking of words in natural spoken narratives”: "natural" The authors artificially make natural speech unnatural by making it isochronous. It is bizarre to still call the resulting stimulus "natural". Isochronous word sequences are certainly not natural speech. Hence, the authors can also not claim that they demonstrate cortical tracking of natural speech. To show this, they would have to show that there is speech-brain synchronicity for the non-isochronous condition. Certainly, to do so, the authors cannot use their standard spectral-peak analysis, but would have to show coherence or phase-locking against a different baseline (e.g., scrambled, autocorrelation-preserving speech). I strongly encourage the authors to do this analysis instead of pretending that isochronous word sequences are natural. The more general problem with this is that it adds to literature claiming that oscillations track words in human speech processing, which cannot possibly be true in this simplistic version: words are simply too variable in duration and timing.

The “natural speech” used in the study was not time-warped, which was explained in the response to the previous question. Nevertheless, metrical stories had an implicit disyllabic word-level rhythm, which was unnatural. The speaker reading the stories, however, were unaware of the purpose of the study, and the listeners were not told that some stories had a metrical word rhythm. We now conducted a behavioral experiment to test if these stories sounded natural and whether listeners could easily hear the metrical word rhythm. The results were reported in Supplementary file 1 (see Materials and methods for details). In short, most participants did not detect any difference between the metrical and nonmetrical stories and perceived the speech materials as natural.

“Thirty-four participants (19-26 years old, mean age, 22.5 years; 17 females) took part in a behavioral test to assess the naturalness of the stimuli…”

“…The test was divided into 2 blocks. In block 1, the participants listened to a metrical story and a nonmetrical story read by a human speaker, which were presented in a pseudorandom order. The stories were randomly selected from the story set. Each story ranged from 53 to 66 second in duration. After listening to each story, the participants were asked to write a sentence to summarize the story and fill out a questionnaire…”

“…In block 1, the first question in the questionnaire asked whether the two types of stories, a metrical and a nonmetrical story, showed any noticeable difference regardless of their content. […] The reasons reported included (1) exaggerated intonation (N = 2); (2) the speed and intonation pattern seemed uniform (N = 2); (3) lack of emotion (N = 2); (4) the pitch went up at the end of each sentence (N = 1). In sum, most participants thought the stories were naturally read and only two participants (6%) commented on the uniformity of pace.”

2) Depth of framing and discussion: (a) Introduction is not specific enough to speech prosody versus the different types of results on linguistic information (e.g., syntactic structure / phrasing, syntactic and lexical entropy), including reference to prior work; also, relevant references that suggested delta-band oscillations to serve both bottom-up (i.e., prosody) and top-down (i.e., syntax, prediction) functionalities are not discussed in the Introduction; (b) interpretation of the results in terms of the different proposed functionalities of the delta-band lacks depth and differentiation.

Thank you for the useful suggestions, we have updated the Introduction and Discussion. Please see, e.g., the response to the following question.

Introduction: What is missing here is mentioning that delta-band oscillations have also been implied in prosody (e.g., Bourguignon et al., 2013). Relatedly, the idea that delta-band oscillations are related both to slow-fluctuating linguistic representations (e.g., phrases; Meyer and Gumbert, 2018; Meyer et al., 2016; Weissbart et al., 2019) and prosody has been put forward various times (e.g., Ghitza, 2020; Meyer et al., 2016). I think that mentioning this would help the framing of the current manuscript: (1) delta = symbolic linguistic units, (2) delta = prosody, (3) current study: Is delta reflecting symbolic processing or prosody?

The reviewer made a very good point here and we have added the following into Introduction.

“Speech comprehension, however, requires more than syllabic-level processing. […] Therefore, it remains unclear whether cortical activity can synchronize to linguistic units in natural spoken narratives, and how it is influenced by bottom-up acoustic cues and top-down linguistic knowledge”.

Furthermore, we have added the following into the Discussion.

“It remains elusive what kind of mental representations are reflected by cortical responses synchronous to linguistic units. [...] Previous psycholinguistic studies have already demonstrated that syntax has a significant impact on prosody perception (Buxó-Lugo and Watson, 2016; Garrett et al., 1966).”

“In speech processing, multiple factors contribute to the word response and these factors interact. […] This result is consistent with the idea that prosodic cues have a facilitative effect on speech comprehension (Frazier et al., 2006; Ghitza, 2017, 2020; Giraud and Poeppel, 2012).”

Subsection “Time course of ERP response to words”: As to prediction and the delta-band, please consider discussing Weissbart et al., 2019, Meyer and Gumbert, 2018, Breska and Deouell, 2017, Lakatos et al., 2013, Roehm et al., 2009; in particular the relationship between linguistic predictions and the delta-band should be discussed shortly here. In general, there is need in the field to characterize the relationship between temporal prediction, prosody, linguistic units, and the delta-band. Just noting all this would be fine here.

Thank you for the useful suggestions and we have added the following to Discussion.

“It is also possible that neural activity synchronous to linguistic units reflect more general cognitive processes that are engaged during linguistic processing. […] In addition, low-frequency neural activity has also been suggested to reflect the perception of high-level rhythms (Nozaradan et al., 2011) and general sequence chunking (Jin et al., 2020).”

"semantic processing" Why semantics? See my above comment on the relationship between the delta-band and temporal predictions (e.g., Roehm et al., 2009; Stefanics et al., 2010). In addition, please consider the recent reconceptualisation of the N400-P600 complex in terms of predictive coding (e.g., Kuperberg et al., 2019).

We have replaced the sentence about “semantic processing” with the following sentence:

“This component is consistent with the latency of the N400 response, which can be observed when listening to either individual words or continuous speech (Broderick et al., 2018; Kutas and Federmeier, 2011; Kutas and Hillyard, 1980; Pylkkänen and Marantz, 2003; Pylkkänen et al., 2002).”

We have also added the following into the Discussion.

“The 400-ms latency response observed in the current study is consistent with the hypothesis that the N400 is related to lexical processing (Friederici, 2002; Kutas and Federmeier, 2011). Besides, it is also possible that the second syllable in a disyllabic word elicits weaker N400 since it is more predictable than the first syllable (Kuperberg et al., 2020; Lau et al., 2008).”

3) EEG preprocessing / report thereof does not accord to standards in the field. Apparently, the EEG data have not been cleaned from artifacts except for calculating the residual between the EEG and the EOG. Hence, I cannot assume that the data are clean enough to allow for statistical analysis. Specific point:

Materials and methods: The EEG-cleaning section is very sparse. With standard EEG and standard cleaning, rejection rates for adult EEG are in the other of 25-35 %. It am highly doubtful that these EEG data have been cleaned in an appropriate way. There is no mentioning of artifact removal, amount of removed data segments, time- and frequency-domain aspects that were considered for artifact definition. What toolboxes were used, what algorithms?

Thank you for pointing out this issue. We have now mentioned the software and added more details about the pre-processing procedures. We only used basic Matlab functions without any particular toolbox.

“All preprocessing and analysis in this study were performed using Matlab

(The MathWorks, Natick, MA). […] Occasional large artifacts in EEG/EOG, i.e., samples with magnitude > 1 mV, were removed from the analysis (Jin et al., 2018).”

We have now removed the trials with obvious muscle artifacts after the preprocessing.

“To further remove potential artifacts, the EEG responses were divided into 7-s trials (a total of 148 trials for Experiments 1-3, and 104 trials for Experiment 4), and we visually inspected all the trials and removed trials with identifiable artifacts. On average, 8.45% ± 3.20% trials were rejected in Experiment 1, 15.20% ± 3.97% trials were rejected in Experiment 2, 10.35% ± 1.53% trials were rejected in Experiment 3, and 12.9% ± 4.46% trials were rejected in Experiment 4.”

4) Terminology: Like many people in the field, the authors speak of “tracking” of linguistic information. To be clear: While it is fine that the field understands the problems of the term “entrainment”, now opting for “tracking”, it does not make much sense to use the combination of “tracking” and “linguistic units”. This is because linguistic units are not in the stimulus; instead they are only present in the brain as such. They cannot be tracked; instead, they need to be “associated”, “inferred”, or even “generated”. If one were to assume that the brain “tracks” linguistic units, this would mean that the brain tracks itself. Specific comments:

"cortical activity tracking of higher-level linguistic units" Linguistic units, such as "words and phrases" cannot be tracked, as they are not present in the stimulus. They are purely symbolic and they must be inferred. They cannot be sensed. Instead, they are present only in the mind and electrophysiology of the listener. Linguistic units are activated in memory or generated in some form. They are internal representations or states. If one were to say that the brain tracks these in some way, one would say that the brain tracks its own internal representations / states. Try and rephrase, please.

Thank you for pointing out this important terminology issue. We certainly agree that the word-related responses are internally constructed. The word “tracking” was used loosely in the previous manuscript. Now, we avoided calling the response a “word-tracking response”, and referred to the response as a word response or a neural response synchronous to the word rhythm.

"linguistic information" cannot be extracted from acoustic features. Instead, linguistic information (e.g., distinctive features in phonology, semantic meaning, part-of-speech labels) must be associated with a given acoustic stimulus by an associative / inferential cognitive process. When we hear some acoustic stimulus, we associate with something linguistic. Just as in my above comment on the loose use of the term “tracking”, I suggest the authors to change this wording as well. A suggestion for a better phrasing would be "Linguistic information is retrieved / constructed / generated when specific acoustic information is encountered / recognized…". Or "Speech comprehension involves the active inference of linguistic structure from speech acoustics".

We have removed the description, and we no longer use the word “extract” in similar situations.

Discussion: "tracks" see my above comment.

Please see our reply to the previous question.

5) Potential differences between the σ1 and 2 conditions: There are apparent differences in the number of sensors that show sensitivity to the amplitude modulation. This is not captured in the current statistical analysis. Still, given these drastic differences, the claim that acoustics of speech prosody do not affect delta-band amplitude is certainly much too strong and may just result from a lack of analysis (i.e., overall phase / ERP rather than sensor-by-sensor analysis). See specific comment next:

We have updated our conclusions based on a new analysis. Please see our reply to the editorial comments.

"or the phase of the 2-Hz amplitude modulation" I disagree with excitement and a happy suggestion that the authors must consider. Look, the plots (Figure 3, σ1/2) may not show differences in the phase angle. Yet, what they may point to is a dimension of the data that the authors have not assessed yet: topographical spread. From simply eyeballing it is clear that the number of significant electrodes differs drastically between the σ1 and σ2 conditions. This pattern is interesting and should be examined more closely. Is word-initial stress more natural in Chinese? If so, this could be interpreted in terms of some influence of native prosodic structure: Amplitude modulation does appear affect the amount / breadth of EEG signal (i.e. number of sensors), but only if it conforms to the native stress pattern? Not only words / phrases are part of a speakers linguistic knowledge, but also the stress pattern of their language. I urge the authors to look into this more closely.

The reviewer raised a very important point about the stress pattern in Chinese. According to the classic study by Yuen Ren Chao (Mandarin Primer, Harvard University Press, 1948), the stress of disyllabic Chinese words usually falls on the second syllable. Later studies more or less confirm that slightly more than 50% of the disyllabic words tend to be stressed on the second syllable. However, it is well established that, in Chinese, stress is highly dependent on the context and does not affect word recognition. For example, Shen, 1993, noted that “However, unlike English where lexical stress is fixed and can be predicted by the phonology, lexical stress in Mandarin varies socio-linguistically and idiosyncratically. Some disyllabic words can be uttered iambically or trochaically”.

Since linguistic analysis does not generate a strong prediction about whether σ1-amplified or σ2-amplified speech should sound more natural, we now have asked a group of participants (N = 34) to rate the naturalness of the speech designed for these two conditions (see Materials and methods for details). The results showed that half of the participants commented that speech in the two conditions were equally natural (N = 17). For the other half of the participants, most thought σ1-amplified speech sounded more natural (N = 15) while the others thought σ2-amplified speech was more natural (N = 2).

In sum, while previous linguistic studies tend to suggest that σ2-amplified speech is slightly more natural, our behavioral assessment suggests that σ1-amplified speech is slightly more natural. Therefore, it is difficult to draw a solid conclusion about which condition is more consistent with natural speech.

Additionally, the response power does not differ between the σ1-amplified or σ2-amplified conditions, and only the inter-participant phase coherence differ between conditions, as mentioned by the reviewer. This phenomenon, however, can potentially be explained without making any assumption about the interaction between the AM and word responses: If the AM and word responses are independent, the measured neural response is the sum of the two responses and its phase is influenced by both components. If the phase of the AM response is more consistent with the phase of the word response in the σ1-amplified condition, the inter-participant phase coherence will be higher in the σ1-amplified condition on the assumption that the strength of the AM and word response varies across participants (illustrated in Author response image 4).

Author response image 4
Illustration of the response phase for individuals.

The red and blue arrows indicate the phase of word response and AM response, which are assumed to be consistent across individuals. The AM response is 180° out of phase between the σ1 and σ2-amplified conditions, while the word response phase is the same in both conditions. The measured response is the vector sum of the AM and word responses. The purple arrows indicate the phase of the measured response for individual participants. If the phase of the AM response is more consistent with the phase of the word response in the σ1-amplified condition, the inter-participants phase coherence is higher for the σ1-amplified condition than the σ2-amplified condition.

Therefore, in the current study, we integrate the σ1- and σ2-amplified conditions to separate the AM and word responses, and future studies are needed to establish whether the neural response is modulated by the naturalness of speech prosodic cues. We have now added discussions about this issue.

“Finally, it should be mentioned that we employed amplitude modulation to manipulate the speech envelope, given that the speech envelope is one of the strongest cues to drive stimulus-synchronous cortical response. […] Future studies are needed to characterize the modulation of language processing by different prosodic cues and investigate the modulatory effect across different languages.”

"The response" see my above comment on the differences in the number significant sensors in the σ1 and 2 conditions.

The sentence is replaced as the following sentence:

“Consistent with previous findings, in the current study, the 4-Hz syllable response was also enhanced by cross-modal attention (Figure 3B). The 2-Hz AM response power, however, was not significantly modulated by cross-modal attention (Figure 4D, and Figure 5B), suggesting that attention did not uniformly enhance the processing of all features within the same speech stream…”

Reviewer #3:

[…] I enjoyed reading this manuscript, which tackles an interesting and topical question. And I would like to commend the authors on a very nice set of experiments – it's a lovely design.

However, I also had a number of fairly substantial concerns – primarily about the conclusions being drawn from the data.

1) My main concern with the manuscript is that it seems – a priori – very committed to discussing the results in terms of tracking by activity in a specific (i.e., delta) cortical band. I understand the reason to want to do this – there is a lot of literature on speech and language processing that argues for the central role of cortical oscillations in tracking units of speech/language with different timescales. However, in the present work, I think this framework is causing confusion. To argue that delta tracking is a unitary thing that might wax and wane like some kind of push/pull mechanism to track acoustics sometimes and linguistic features other times seems unlikely to be true and likely to cause confusion in the interpretation of a nice set of results. My personal bias here – and I think it could be added for balance to the manuscript – is that speech processing is carried out by a hierarchically organized, interconnected network with earlier stages/lower levels operating on acoustic features and later stages/higher levels operating on linguistic features (with lots of interaction between the stages/levels). In that framing, the data here would be interpreted as containing separable simultaneous evoked contributions from processing at different hierarchical levels that are time locked to the relevant features in the stimulus. Of course, because of the nature of the experiments here, these contributions are all coming at 2 Hz. But to then jump to saying a single "delta oscillation" is preferentially tracking different features of the speech given different tasks, seems to me to be very unlikely. Indeed, the authors seem sensitive to the idea of different evoked contributions as they include some discussion of the N400 later in their manuscript. But then they still hew to the idea that what they are seeing is a shift in what "delta-band" activity is doing. I am afraid I don't buy it. I think if you want to make this case – it would only be fair to discuss the alternatives and more clearly argue why you think this is the best way to interpret the data.

The reviewer raised a very important point about how to interpret the results. We actually agree with the reviewer’s interpretation. It might be our writing that has caused confusions. We do not think there is a unitary delta oscillation and the term “oscillation” is never used in the manuscript. In the current manuscript, we no longer refer to the results as “delta-band responses”, and we have revised the Introduction and Discussion sections to make it clear that responses from multiple processing stages can contribute to the measured EEG response.

Furthermore, a new analysis now separates the measured response into a word response and an AM response , which is also based on the assumption that multiple response components coexist in the measured response. Please see our reply to the editorial comments for details.

2) Following my previous point, I have to admit that I am struggling to figure out how, precisely, you are determining that "cortical activity primarily tracks the word rhythm during speech comprehension". There is no question that the cortical activity is tracking the word rhythm. But how are you arguing that it is primarily word rhythm. I am guessing it has something to do with Figure 3? But I am afraid I am not convinced of some of the claims you are making in Figure 3. You seem to want to embrace the idea that there is no difference in phase between the σ1-amplified and σ2-amplified conditions and that, therefore, the signal is primarily tracking the word rhythm. But I think there is an important flaw in this analysis – that stems from my concern in point number 1 above. In particular, I think it is pretty clear that there are differences in phase at some more frontal channel locations, and no difference in more posterior channels (Figure 3A, rightmost panel). So, going back to my issue in point number 1, I think it is very likely that word rhythm is being tracked posteriorly (maybe driven by N400 like evoked activity) – and that phase is similar (obviously). But it also seems likely that the 2 Hz rhythm over frontal channels (which likely reflect evoked activity based on acoustic features) are at a different phase. Now, cleanly disentangling these things is difficult because they are both at 2Hz and will contaminate each other. And I think when subjects are watching the silent video, the posterior (N400 like) contribution disappears and all you have left is the acoustic one. So, again, I think two separable things are going on here – not just one delta rhythm that selectively tracks different features.

Thank you for the insightful comments and please see our reply to the editorial comments.

3) I'm afraid I must take issue with the statements that "Nevertheless, to dissociate linguistic units with the related acoustic cues, the studies showing linguistic tracking responses mostly employ well-controlled synthesized speech that is presented as an isochronous sequence of syllables… Therefore, it remains unclear whether neural activity can track linguistic units in natural speech, which is semantically coherent but not periodic, containing both acoustic and linguistic information in the delta-band." I know the (very nice) studies you are talking about. But you have just also cited several studies that try to disentangle linguistic and acoustic features using more natural speech (e.g., Brodbeck; Broderick). So I think your statement is just too strong.

Thank you for pointing out this issue. We have now revised the Introduction to make it more precise.

“Previous studies suggest that low-frequency cortical activity can also reflect neural processing of higher-level linguistic units, e.g., words and phrases (Buiatti et al., 2009; Ding et al., 2016a; Keitel et al., 2018), and the prosodic cues related to these linguistic units, e.g., delta-band speech envelope and pitch contour (Bourguignon et al., 2013; Li and Yang, 2009; Steinhauer et al., 1999). […] It remains to be investigated, however, how bottom-up prosodic cues and top-down linguistic knowledge separately contribute to the generation of these word-related responses”.

4) I was confused with the presentation of the data in Figure 2F and G. Why are some bars plotted up and others down? They are all positive measures of power, and they would be much easier to compare if they were all pointed in the same direction. Unless I am missing the value of plotting them this way?

In the previous manuscript, we thought it was easier to make comparisons if the bars were presented as mirror images. However, it was indeed confusing to have downward bars. We have now made all bars pointed in the same direction (Figure 3A, B).

https://doi.org/10.7554/eLife.60433.sa2

Article and author information

Author details

  1. Cheng Luo

    Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft
    Competing interests
    No competing interests declared
  2. Nai Ding

    1. Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China
    2. Research Center for Advanced Artificial Intelligence Theory, Zhejiang Lab, Hangzhou, China
    Contribution
    Conceptualization, Formal analysis, Supervision, Validation, Visualization, Methodology, Project administration, Writing - review and editing
    For correspondence
    ding_nai@zju.edu.cn
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3428-2723

Funding

National Natural Science Foundation of China (31771248)

  • Nai Ding

MajorScientific Research Project of Zhejiang Lab (2019KB0AC02)

  • Nai Ding

Zhejiang Provincial Natural Science Foundation (LGF19H090020)

  • Cheng Luo

Fundamental Research Funds for the Central Universities (2020FZZX001-05)

  • Nai Ding

National Key R&D Program Of China (2019YFC0118200)

  • Nai Ding

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Dr. Virginie van Wassenhove and three anonymous reviewers for their constructive comments. We thank Dr. Xunyi Pan, Dr. Lang Qin, Jiajie Zou, and Yuhan Lu for thoughtful comments on previous versions of the manuscript. The research was supported by National Natural Science Foundation of China 31771248 (ND), Major Scientific Research Project of Zhejiang Lab 2019KB0AC02 (ND), National Key R and D Program of China 2019YFC0118200 (ND), Zhejiang Provincial Natural Science Foundation of China LGF19H090020 (CL), and Fundamental Research Funds for the Central Universities 2020FZZX001-05 (ND).

Ethics

Human subjects: The experimental procedures were approved by the Research Ethics Committee of the College of Medicine, Zhejiang University (2019-047). All participants provided written informed consent prior to the experiment and were paid.

Senior Editor

  1. Andrew J King, University of Oxford, United Kingdom

Reviewing Editor

  1. Virginie van Wassenhove, CEA, DRF/I2BM, NeuroSpin; INSERM, U992, Cognitive Neuroimaging Unit, France

Publication history

  1. Received: June 26, 2020
  2. Accepted: December 20, 2020
  3. Accepted Manuscript published: December 21, 2020 (version 1)
  4. Version of Record published: December 31, 2020 (version 2)

Copyright

© 2020, Luo and Ding

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 827
    Page views
  • 136
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Behrad Noudoost et al.
    Research Article

    Visually guided behavior relies on the integration of sensory input and information held in working memory (WM). Yet it remains unclear how this is accomplished at the level of neural circuits. We studied the direct visual cortical inputs to neurons within a visuomotor area of prefrontal cortex in behaving monkeys. We show that the efficacy of visual input to prefrontal cortex is gated by information held in WM. Surprisingly, visual input to prefrontal neurons was found to target those with both visual and motor properties, rather than preferentially targeting other visual neurons. Furthermore, activity evoked from visual cortex was larger in magnitude, more synchronous, and more rapid, when monkeys remembered locations that matched the location of visual input. These results indicate that WM directly influences the circuitry that transforms visual input into visually guided behavior.

    1. Developmental Biology
    2. Neuroscience
    Jessica Douthit et al.
    Research Article Updated

    As neural circuits form, growing processes select the correct synaptic partners through interactions between cell surface proteins. The presence of such proteins on two neuronal processes may lead to either adhesion or repulsion; however, the consequences of mismatched expression have rarely been explored. Here, we show that the Drosophila CUB-LDL protein Lost and found (Loaf) is required in the UV-sensitive R7 photoreceptor for normal axon targeting only when Loaf is also present in its synaptic partners. Although targeting occurs normally in loaf mutant animals, removing loaf from photoreceptors or expressing it in their postsynaptic neurons Tm5a/b or Dm9 in a loaf mutant causes mistargeting of R7 axons. Loaf localizes primarily to intracellular vesicles including endosomes. We propose that Loaf regulates the trafficking or function of one or more cell surface proteins, and an excess of these proteins on the synaptic partners of R7 prevents the formation of stable connections.