Neural tracking of phrases in spoken language comprehension is automatic and task-dependent

  1. Sanne ten Oever
  2. Sara Carta
  3. Greta Kaufeld
  4. Andrea E Martin  Is a corresponding author
  1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Netherlands
  2. Language and Computation in Neural Systems group, Donders Centre for Cognitive Neuroimaging, Netherlands
  3. Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Netherlands
  4. ADAPT Centre, School of Computer Science and Statistics, University of Dublin, Trinity College, Ireland
  5. CIMeC - Center for Mind/Brain Sciences, University of Trento, Italy

Abstract

Linguistic phrases are tracked in sentences even though there is no one-to-one acoustic phrase marker in the physical signal. This phenomenon suggests an automatic tracking of abstract linguistic structure that is endogenously generated by the brain. However, all studies investigating linguistic tracking compare conditions where either relevant information at linguistic timescales is available, or where this information is absent altogether (e.g., sentences versus word lists during passive listening). It is therefore unclear whether tracking at phrasal timescales is related to the content of language, or rather, results as a consequence of attending to the timescales that happen to match behaviourally relevant information. To investigate this question, we presented participants with sentences and word lists while recording their brain activity with magnetoencephalography (MEG). Participants performed passive, syllable, word, and word-combination tasks corresponding to attending to four different rates: one they would naturally attend to, syllable-rates, word-rates, and phrasal-rates, respectively. We replicated overall findings of stronger phrasal-rate tracking measured with mutual information for sentences compared to word lists across the classical language network. However, in the inferior frontal gyrus (IFG) we found a task effect suggesting stronger phrasal-rate tracking during the word-combination task independent of the presence of linguistic structure, as well as stronger delta-band connectivity during this task. These results suggest that extracting linguistic information at phrasal rates occurs automatically with or without the presence of an additional task, but also that IFG might be important for temporal integration across various perceptual domains.

Editor's evaluation

This MEG study elegantly assesses human brain responses to spoken language at the syllable, word, and sentence level. Although prior studies have shown significant cortical tracking of the speech signal, the current work uses clever task manipulation to direct attention to different timescales of speech, thus demonstrating tracking mechanisms that are both automatic and task-dependent operate in tandem during spoken language comprehension.

https://doi.org/10.7554/eLife.77468.sa0

Introduction

Understanding spoken language likely requires a multitude of processes (Friederici, 2011; Martin, 2020; Halle and Stevens, 1962). Although not always an exclusively bottom-up affair, acoustic patterns must be segmented and mapped onto internally-stored phonetic and syllabic representations (Halle and Stevens, 1962; Marslen-Wilson and Welsh, 1978; Martin, 2016). Phonemes must be combined and mapped onto words, which in turn form abstract linguistic structures such as phrases (e.g., Martin, 2020; Pinker and Jackendoff, 2005).In proficient speakers of a language, this process seems to happen so naturally that one might almost forget the complex parallel and hierarchical processing which occurs during natural speech and language comprehension.

It has been shown that it is essential to track the temporal dynamics of the speech signal in order to understand its meaning (e.g., Giraud and Poeppel, 2012; Peelle and Davis, 2012). In natural speech, syllables follow each other in the theta range (3–8 Hz; Rosen, 1992; Ding et al., 2017; Pellegrino et al., 2011), while higher-level linguistic features such as words and phrases tend to occur at lower rates (0.5–3 Hz; Rosen, 1992; Kaufeld et al., 2020; Keitel et al., 2018). Tracking of syllabic features is stronger when one understands a language (Luo and Poeppel, 2007; Zoefel et al., 2018a; Doelling et al., 2014) and tracking of phrasal rates is more prominent when the signal contains phrasal information (Kaufeld et al., 2020; Keitel et al., 2018; Ding et al., 2016; e.g., word lists versus sentences). Importantly, phrasal tracking even occurs when there are no distinct acoustic modulations at the phrasal rate (Kaufeld et al., 2020; Keitel et al., 2018; Ding et al., 2016). These results seem to suggest that tracking of relevant temporal timescales is critical for speech understanding.

An observation one could make regarding these findings is that tracking occurs only at the rates that are meaningful and thereby behaviourally relevant (Kaufeld et al., 2020; Ding et al., 2016). For example, in word lists, the word rate is the slowest rate that is meaningful during natural listening. Modulations at slower phrasal rates might not be tracked as they do not contain behaviourally relevant information. In contrast, in sentences, phrasal rates contain linguistic information and therefore these slower rates are also tracked. Thus, when listening to speech one automatically tries to extract the meaning, which requires extracting information at the highest linguistic level (Halle and Stevens, 1962; Martin, 2016). However, it remains unclear if tracking at these slower rates is a unique feature of language processing, or rather is dependent on attention to relevant temporal timescales.

As understanding language requires a multitude of processes, it is difficult to figure out what participants actually are doing when listening to natural speech. Moreover, designing a task in an experimental setting that does justice to this multitude of processing is difficult. This is probably why tasks in language studies vary vastly. Tasks include passively listening (e.g., Kaufeld et al., 2020), asking comprehension questions (e.g., Keitel et al., 2018), rating intelligibility (e.g., Luo and Poeppel, 2007; Doelling et al., 2014), working memory tasks (e.g., Kayser et al., 2015), or even syllable counting (e.g., Ding et al., 2016). It is unclear whether outcomes are dependent on the specifics of the task. There has so far not been a study that investigates if task instructions focusing on extracting information at different temporal rates or timescales have an influence on the tracking that occurs on these timescales. It is therefore not clear whether tracking phrasal timescales is unique for language stimuli which contain phrasal structures, or could also occur for other acoustic materials where participants are instructed to pay attention to information happening at these temporal rates or timescales.

To answer this question, we designed an experiment in which participants were instructed to pay attention to different temporal modulation rates while listening to the same stimuli. We presented participants with naturally spoken sentences and word lists and asked them to either passively listen, or perform a task on the temporal scales corresponding to syllables, words, or phrases. We recorded brain activity using magnetoencephalography (MEG) while participants performed these tasks and investigated tracking as well as power and connectivity at three nodes that are part of the language network: the superior temporal gyrus (STG), the middle temporal gyrus (MTG), and the inferior frontal gyrus (IFG). We hypothesized that if tracking is purely based on behavioural relevance, it should mostly depend on the task instructions, rather than the nature of the stimuli. In contrast, if there is something automatic and specific about language information, tracking should depend on the level of linguistic information available to the brain.

Results

Behaviour

Overall task performance was above chance and participants complied with task instructions (Figure 1; see Figure 1—figure supplement 1 for individual data). We found a significant interaction between condition and task (F(2,72.0) = 11.51, p < 0.001) as well as a main effect of task (F(2,19.7) = 44.19, p < 0.001) and condition (F(2,72.0) = 29.0, p < 0.001). We found that only for the word-combination (phrasal-level) task, the sentence condition had a significantly higher accuracy than the word list condition (t(54.0) = 6.97, p < 0.001). For the other two tasks, no significant condition effect was found (syllable: t(54.0) = 0.62, p = 1.000; word list: t(54.0) = 1.74, p = 0.176). Investigating the main effect of task indicated a difference between all tasks (phrase–syllable: t(18.0) = 3.71, p = 0.003; phrase–word: t(22.4) = −6.34, p < 0.001; syllable–word: t(19.2) = −8.67, p < 0.001).

Figure 1 with 1 supplement see all
Behavioural results.

Accuracy for the three different tasks. Double asterisks indicate significance at the 0.01 level using a paired samples t-test (n=19). Box edges indicate the standard error of the mean.

Mutual information

The overall time–frequency response in the three different regions of interest (ROI) using the top-20 PCA components was as expected, with an initial evoked response followed by a more sustained response to the ongoing speech (Figure 2). From these regions-of-interest, we extracted mutual information (MI) in three different frequency bands (phrasal, word, and syllable). Here, we focus on the phrasal band as this is the band that differentiates word lists from sentences and showed the strongest modulation for this contrast in our previous study (Kaufeld et al., 2020). MI results for all other bands are reported in the supplementary materials.

Anatomical regions of interests (ROIs).

(A) ROIs displayed on one exemplar participant surface. (B) Time–frequency response at each ROI. STG = superior temporal gyrus, MTG = medial temporal gyrus, IFG = inferior frontal gyrus.

For the phrasal timescale in STG, we found significantly higher MI in the sentence compared to the word list condition (F(3,126) = 67.39, p < 0.001; Figure 3; see Figure 3—figure supplement 1 for individual data). No other effects were significant (p > 0.1). This finding paralleled the effect found in Kaufeld et al., 2020. For the MTG, we saw a different picture: Besides the main effect of condition (F(3,126) = 50.24, p < 0.001), an interaction between task and condition was found (F(3,126) = 2.948, p = 0.035). We next investigated the effect of condition per task and found for all tasks except the passive task a significant effect of condition, with stronger MI for the sentence condition (passive: t(126) = 1.07, p = 0.865; syllable: t(126) = 4.06, p = 0.003; word: t(126) = 5.033, p < 0.001; phrase: t(126) = 4.015, p = 0.003). For the IFG, we found a main effect of condition (F(3,108) = 21.89, p < 0.001) as well as a main effect of task (F(3,108) = 2.74, p = 0.047). The interaction was not significant (F(3,108) = 1.49, p = 0.220). Comparing the phrasal task with the other tasks indicated higher MI for the phrasal compared to the word task (t(111) = 2.50, p = 0.028). We also found a trend for the comparison between the phrasal and syllable tasks (t(111) = 2.17, p = 0.064), as well as the phrasal and passive tasks (t(111) = 2.25, p = 0.052).

Figure 3 with 2 supplements see all
Mutual information (MI) analysis at the phrasal band (0.8–1.1 Hz) for the three different regions of interests (ROIs).

Single and double asterisks indicate significance at the 0.05 and 0.01 level using a paired samples t-test (n=19). T indicates trend level significance (p < 0.1). Inset at the top left of the graph indicates whether a main effect of condition was present (with higher MI for sentences versus word lists; this inset does not reflect real data). Averages of conditions are only shown if there was a main task effect without an interaction. Box edges indicate the standard error of the mean.

For the word and syllable frequency bands no interactions were found (all p > 0.1; Figure 3—figure supplement 2). For all six models, there was a significant effect of condition, with stronger MI for word lists compared to sentences (all p < 0.001). The main effect of task was not significant in any of the models (p > 0.1; for the MTG syllable level there was a trend: F(3,126) = 2.40, p = 0.071).

When running the power control analysis, we did not find that significant effects in power differences (also see next section for power in generic bands; mostly due to main effects of condition) influenced our tracking results for any of the bands investigated.

Power

We repeated the linear mixed modelling using power instead of MI to investigate if power changes paralleled the MI effects. For the delta band, we found for the STG a main effect of condition (F(1,18) = 6.11, p = 0.024; Figure 4. See Figure 4—figure supplement 1 for individual data) and task (F(3,108) = 3.069, p = 0.031). For the interaction we found a trend (F(3,108) = 2.620, p = 0.054). Overall sentences had stronger delta power than word lists. We found lower power for the phrase compared to the passive task (t(111) = 2.31, p = 0.045) and lower power for the phrase compared to the syllable task (t(111) = 2.43, p = 0.034). There was no significant difference between the phrase and word task (t(111) = 0.642, p = 1.00).

Figure 4 with 2 supplements see all
Power effects for the different regions of interests (ROIs).

Single and double asterisks indicate significance at the 0.05 and 0.01 level using a paired samples t-test (n=19). T indicates trend significance (p < 0.1) Inset at the right top of the graph indicates whether a main effect of condition was present (with higher activity for sentences versus word lists; this inset does not reflect real data). Averages of conditions are only shown if there was a main task effect. Box edges indicate the standard error of the mean.

The MTG delta power effect overall paralleled the STG effects with a significant condition (F(1,124.94) = 12.339, p < 0.001) and task effect (F(3,124.94) = 4.326, p = 0.006). The interaction was trend significant (F(3,124.94) = 2.58, p = 0.056). Pairwise comparisons of the task effect showed significantly stronger power for the phrase compared to the passive task (t(128) = 2.98, p = 0.007) and lower power for the phrase compared to the syllable task (t(128) = 3.10, p = 0.024). The passive–word comparison was not significant (t(128) = 2.577, p = 0.109). Finally, for the IFG we only found a trend effect for condition (F(1,123.27)=4.15, p = 0.057), with stronger delta power in the sentence condition.

The results for all other bands can be found in the supplementary materials (Figure 4—figure supplement 2). In summary, no interaction effects were found for any of the models (all p > 0.1). In all bands, power was generally higher for sentences than for word lists. Any task effect generally showed stronger power for the lower hierarchical level (e.g. generally higher power for passive versus word-combination tasks).

Connectivity

Overall connectivity patterns showed the strongest connectivity in the delta and alpha frequency band (Figure 5). In the delta band, we found a main effect of task for the STG–IFG connectivity (F(3,122.06) = 4.1078, p = 0.008; Figure 6; see Figure 6—figure supplement 1 for individual data). Follow-up analysis showed a significant difference between the phrasal and passive tasks with higher connectivity in the phrasal compared to the passive task (t(125) = 3.254, p = 0.003). The other comparisons with the phrasal task were not significant. The effect of task remained significant even when correcting for power differences between the passive and phrasal tasks (F(1,53.02) = 12.39, p < 0.001; note the change in degrees of freedom as only the passive and phrasal tasks were included in this mixed model as any power correction is done on pairs). Initially, we also found main effects of condition for the delta and beta bands for the MTG–IFG connectivity (stronger connectivity for the sentence compared to the word list condition), however after controlling for power, these effects did not remain significant (Figure 6—figure supplement 2).

Connectivity pattern between anatomical regions of interests (ROIs).

(A) ROI connections displayed on one exemplar participant surface. (B) Time–frequency weighted phase-lagged index (WPLI) response at each ROI.

Figure 6 with 2 supplements see all
Weighted phase lag index (WPLI) effects for the different regions of interests (ROIs).

Double asterisks indicate significance at the 0.01 level using a paired samples t-test (n=19) after correcting for power differences between the two conditions (we plot the original data, not corrected for power, as we can only perform pairwise power and consequently data will be different for each control). Averages of conditions are only shown if there was a main task effect. Box edges indicate the standard error of the mean.

MEG–behavioural performance relation

We found for the MI analysis a significant effect of accuracy only in the MTG. Here, we found a three-way interaction between accuracy × task × condition (F(2,91.9) = 3.459, p = 0.036). Splitting up for the three different tasks we found only an uncorrected significant effect for the condition × accuracy interaction for the phrasal task (F(1,24.8) = 5.296, p = 0.03) and not for the other two tasks (p > 0.1). In the phrasal task, we found that when accuracy was high, there was a stronger difference between the sentence and the word list condition compared to when accuracy was low, with stronger accuracy for the sentence condition (Figure 7A).

Figure 7 with 1 supplement see all
MEG–behavioural performance relation.

(A) Predicted values for the phrasal band MI in the middle temporal gyrus (MTG) for the word-combination task separately for the two conditions. (B) Predicted values for the delta-band weighted phase lag index (WPLI) in the superior temporal gyrus (STG)–MTG connection separately for the two conditions. Error bars indicate the 95% confidence interval of the fit. Coloured lines at the bottom indicate individual datapoints.

No relation between accuracy and power was found. For the connectivity analysis, we found a significant condition × accuracy interaction for the STG–MTG connection (F(1,80.23) = 5.19, p = 0.025; Figure 7B). Independent of task, when accuracy was low the difference between sentence and word lists was stronger with higher weighted phase lag index (WPLI) fits for the sentence condition. After correcting for accuracy there was also a significant task × condition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005), but not for the other tasks (p > 0.1).

Age control

Adding age to the analysis did not change any of the original findings (all original effects were still significant). We did however find for the power analysis age-specific interactions with condition and task. Specifically, for both the STG and the MTG we found an interaction between age and condition (F(1,28.87) = 6.156, p = 0.0192 and F(1,31) = 10.31, p = 0.003). In both ROIs, there was a stronger difference between sentences and word lists (higher delta power for sentences) for the younger compared to the older participants (Figure 7—figure supplement 1). In the MTG, there was also an interaction between task and age (F(1,31) = 5.020, p = 0.006). Here, in a follow-up we found that only in the word task there was a correlation between age and power (p = 0.023 uncorrected), but not for the other tasks (p > 0.1).

Number of component control

Overall, the amount of PCA components did not influence any of the qualitative differences in the condition. It did seem however that 10 PCA components were not sufficient to show all original effects with the same power. Specifically, the IFG task and MTG task × condition effect were only trend significant for 10 components (p = 0.06 and p = 0.1, respectively). The other effects did remain significant with 10 components. Using 30 components made some of our effects stronger than with 20 components. Here, the IFG task and MTG task × condition effects had p values of 0.034 and 0.006, respectively. We conclude that the amount of PCAs components did not qualitative change any of our reported effects.

Discussion

In the current study, we investigated the effects of ‘additional’ tasks on the neural tracking of sentences and word lists at temporal modulations that matched phrasal rates. Different nodes of the language network showed different tracking patterns. In STG, we found stronger tracking of phrase-timed dynamics in sentences compared to word lists, independent of task. However, in MTG we found this sentence-improved tracking only for active tasks. In IFG, we also found an overall increase of tracking for sentences compared to word lists. Additionally, stronger phrasal tracking was found for the phrasal-level word-combination task compared to the other tasks (independent of stimulus type; note that for the syllable and passive comparison we found a trend), which was paralleled with increased IFG–STG connectivity in the delta band for the word-combination task. Behavioural performance seemed to relate to MI tracking in the MTG and STG–MTG connections. This suggests that tracking at phrasal timescales depends both on the linguistic information present in the signal, and on the specific task that is performed.

The findings reported in this study are in line with previous results, with overall stronger tracking of low-frequency information in the sentences compared to the word list condition (Kaufeld et al., 2020). Crucially, for the stimuli used in our study it has been shown that the condition effects are not due to acoustic differences in the stimuli and also do not occur for reversed speech (Kaufeld et al., 2020). It is therefore most likely that our results reflect an automatic inference-based extraction of relevant phrase-level information in sentences, indicating automatic processing in participants as they understand the meaning of the speech they hear using stored, structural linguistic knowledge (Martin, 2020; Ding et al., 2016; Har-Shai Yahav and Zion Golumbic, 2021). Overall, it did not seem that making participants pay attention to the temporal dynamics at the same hierarchical level through an additional task – instructing them to remember word combinations at the phrasal rate during word list presentation – could counter this main effect of condition.

Even though there was an overall main effect of condition, task did influence neural responses. Interestingly, the task effects differed for the three ROIs. In the STG, we found no task effects, while in the MTG we found an interaction between task and condition. In the MTG increased phrasal-level tracking for sentences only occurred when participants were specifically instructed to perform an active task on the materials. It therefore seems that in MTG all levels of linguistic information are used to do an active language operation on the stimuli. Importantly, the tracking at the phrasal rate in MTG seemed relevant for behavioural performance when attending to phrasal timescales (Figure 7A). This is in line with previous theoretical and empirical research suggesting a strong top-down modulatory response of speech processing in which predictions flow from the highest hierarchical levels (e.g., syntax) down to lower levels (e.g., phonemes) to aid language understanding (Martin, 2016; Hagoort, 2017; Federmeier, 2007). As in the word list condition no linguistic information is present at the phrasal rate, this information cannot be used to provide useful feedback for processing lower-level linguistic information. Instead, it could have been expected that the same type of increased tracking should have happened at the word rate rather than the phrasal rate for word lists (i.e., stronger word-rate tracking for word lists for the active tasks versus passive task). This effect was not found; this could either be attributed to different computational operations occurring at different hierarchical levels or to signal-to-noise/signal detection issues.

We found that across participants both the MI and the connectivity in temporal cortex influenced behavioural performance. Specifically, MTG–STG connections were, independent of task, related to accuracy. There was higher connectivity between MTG and STG for sentences compared to word lists at low accuracies. At high accuracies, we found that stronger MTG tracking at phrasal rates (measured with MI) for sentences compared to word lists during the word-combination task. These results suggest that indeed tracking of phrasal structure in MTG is relevant to understand sentences compared to word lists. This was reflected in a general increase in delta connectivity differences when the task was difficult (Figure 7B). Participants might compensate for the difficulty using phrasal structure present in the sentence condition. When phrasal structure in sentences are accurately tracked (as measured with MI) performance is better when these rates are relevant (Figure 7A). These results point to a role for phrasal tracking for accurately understanding the higher-order linguistic structure in sentences, though more research is needed to verify this. It is evident that the connectivity and tracking correlations to behaviour do not explain all variation in the behavioural performance (compare Figure 1 with Figure 3). Plainly, temporal tracking does not explain everything in language processing. Besides tracking there are many other components important for our designated tasks, such as memory load and semantic context which are not captured by our current analyses.

It is interesting that MTG, but not STG, showed an interaction effect. Both MTG and STG are strong hubs for language processing and have been involved in many studies which contrasted pseudo-words and words (Hickok and Poeppel, 2007; Turken and Dronkers, 2011; Vouloumanos et al., 2001). It is likely that STG does the more lower-level processing of the two regions, as it is earlier in the cortical hierarchy, thereby being more involved in initial segmentation and initial phonetic abstraction rather than a lexical interface (Hickok and Poeppel, 2007). This could also explain why STG does not show task-specific tracking effects; STG could be earlier in a workload bottleneck, receiving feedback independent of task, while MTG feedback is recruited only when active linguistic operations are required. Alternatively, it is possible that either small differences in the acoustics are detected by STG (even though this effect was not previously found with the same stimuli, Kaufeld et al., 2020), or that our blocked designed put participants in a sentence or word list ‘mode’ which could have influenced the state of these early hierarchical regions.

The IFG was the only region that showed an increase in phrasal-rate tracking specifically for the word-combination task. Note, however, that this was a weak effect, as the comparison between the phrase task and the syllable and passive tasks only reached a trend towards significance. Nonetheless, this effect is interesting for understanding the role of IFG in language. Traditionally, IFG has been viewed as a hub for articulatory processing (Hickok and Poeppel, 2007), but its role during speech comprehension, specifically in syntactic processing, has also been acknowledged (Friederici, 2011; Hagoort, 2017; Nelson et al., 2017; Dehaene et al., 2015; Zaccarella et al., 2017). Integrating information across time and relative timing is essential for syntactic processing (Martin, 2020; Dehaene et al., 2015; Martin and Doumas, 2019), and IFG feedback has been shown to occur in temporal dynamics at lower (delta) rates during sentence processing (Park et al., 2015; Keitel and Gross, 2016). However, it has also been shown that syntactic-independent verbal working memory chunking tasks recruit the IFG (Dehaene et al., 2015; Osaka et al., 2004; Fegen et al., 2015; Koelsch et al., 2009). This is in line with our findings that show that IFG is involved when we need to integrate across temporal domains either in a language-specific domain (sentences versus word lists) or for language-unspecific tasks (word-combination versus other tasks). We also show increased delta connectivity with STG for the only temporal-integration tasks in our study (i.e., the word-combination task), independent of the linguistic features in the signal. Our results therefore support a role of the IFG as a combinatorial hub integrating information across time (Gelfand and Bookheimer, 2003; Schapiro et al., 2013; Skipper, 2015).

In the current study, we investigated power as a neural readout during language comprehension from speech. This was both to ensure that any tracking effects we found were not due to overall signal-to-noise (SNR) differences, as well as to investigate task-and-condition dependent computations. SNR is better for conditions with higher power, which therefore leads to more reliable phase estimations, critical for computing MI as well as connectivity (Zar, 1998). We will therefore discuss the power differences as well as their consequences for the interpretation of the MI and connectivity results. Generally, it seemed that there was stronger power in the sentence condition compared to the word list condition in the delta band. However, the pattern was very different than the MI pattern. For the power, the word list-sentence difference was the biggest in the passive condition. In contrast, for the MI there was either no task difference (in STG) or even a stronger effect for the active tasks (in MTG; note that the power interaction was trend significant STG and MTG). We therefore think it unlikely that our MI effects were purely driven by SNR differences, and our power control analysis is consistent with this interpretation. Instead, power seems to reflect a different computation than the tracking, where more complex tasks generally lead to lower power across almost all tested frequency bands. As most of our frequency bands are on the low side of the spectrum (up to beta), it is expected that more complex tasks reduce the low-frequency power (Jensen and Mazaheri, 2010; Klimesch, 1999). It is interesting to observe that this did not reduce the connectivity for the delta band between IFG and STG, but rather increased it. It has been suggested that low power can potentially increase the available computational space, as it increases the entropy in the signal (Hanslmayr et al., 2012; ten Oever and Sack, 2015). Note that even though we found increased connectivity, we did not see a clear power peak in the delta band. This suggests that we might not be looking at an endogenous oscillator, but rather at connections operation at that temporal scale (potentially being non-oscillatory in nature). Finally, in the power comparisons for the theta, alpha, and beta bands we found stronger power for the sentence compared to the word list condition, which could reflect that listening to a natural sentence is generally less effortful than listening to a word list.

In the current manuscript, we describe tracking of ongoing temporal dynamics. However, the neural origin of this tracking is unknown. While we can be sure that modulations in the phrasal-rate follow changes in the phrasal rate of the acoustic input, it is unclear what the mechanism behind this modulation is. It is possible that there is stronger alignment of neural oscillations with the acoustic input at the phrasal rate (Lakatos et al., 2008; Obleser and Kayser, 2019; Rimmele et al., 2021). However, it could as well be that there is a phrasal timescale or slower operation happening while processing the incoming input (which de facto is at the same timescale as the phrasal structure inferred from the input). This operation, in response to stimulus input, could just as well induce the patterns we observe (Meyer et al., 2019; Zoefel et al., 2018b). Finally, it is possible that there are specific responses as a consequence of the syntactic structure, task, or statistical regularities occurring as specific events at phrasal timescales (Obleser and Kayser, 2019; Ten Oever and Martin, 2021; Frank and Yang, 2018).

It is difficult to decide on the most natural task in an experimental setting, that best reflects how we use language in a natural setting. This is probably why such a vast number of different tasks have been used in the literature. Our study (and many before us) indicates that during passive listening, we naturally attend to all levels of linguistic hierarchy. This is consistent with the widely accepted notion that the meaning of a natural sentence requires composing words in a grammatical structure. For most research questions in language, it therefore is sensible to use a task that mimics this automatic natural understanding of a sentence. Here, we show that automatic understanding of linguistic information, and all the processing that this entails, cannot be countered to substantially change the consequences for neural readout, even when explicitly instructing participants to pay attention to particular timescales.

Materials and methods

Participants

In total, 20 Dutch native speakers (16 females; age range: 18–59; mean age = 39.5) participated in the study. All were right handed, reported normal hearing, had normal or corrected-to-normal vision, and did not have any history of dyslexia or other language-related disorders. Participants performed a screening for their eligibility in the MEG and MRI and gave written informed consent. The study was approved by the Ethical Commission for human research Arnhem/Nijmegen (project number CMO2014/288). Participants were reimbursed for their participation. One participant was excluded from the analysis as they did not finish the full session.

Materials and design

Request a detailed protocol

Materials were identical to the stimuli used in Kaufeld et al., 2020. They consisted of naturally spoken sentences or word lists which consisted of 10 words (see Table 1 for examples). The sentences contained two coordinate clauses with the following structure: [Adj N V N Conj Det Adj N V N]. All words were disyllabic except for the words ‘de’ (the) and ‘en’ (and). Word lists were word-scrambled versions of the original sentences which always followed the structure [V V Adj Adj Det Conj N N N N] or [N N N N Det Conj V V Adj Adj] to ensure that they were grammatically incorrect. In total, 60 sentences were used. All sentences were presented at a comfortable sound level.

Table 1
Stimuli and task examples.
Sentence[bange helden] [plukken bloemen] en de [bruine vogels] [halen takken]
[timid heroes] [pluck flowers] and the [brown birds] [gather branches]
Word list[helden bloemen] [vogels takken] de en [plukken halen] [bange bruine]
[heroes flowers] [birds branches] and the [pluck gather] [timid brown]
SentenceWord list
CorrectIncorrectCorrectIncorrect
Syllable/bɑ//lɑ//bɑ//lɑ/
Wordbloemen
[flowers]
vaders
[fathers]
bloemen
[flowers]
vaders
[fathers]
Word combinationbange helden
[timid heroes]
halen bloemen
[gather flowers]
helden bloemen
[heroes flowers]
vogels bloemen
[birds flowers]
  1. For each condition (sentence and word list) one example stimulus (top) and corresponding tasks are shown (bottom).

Participants were asked to perform four different tasks on these stimuli: a passive task, a syllable task, a word task, and a word-combination task. For the passive task, participants did not need to perform any task other than comprehension – they only needed to press a button to go to the next trial. For the syllable task, participants heard after every sentence two part-of-speech sounds, each consisting of one syllable. The sound fragments were a randomly determined syllable from the previously presented sentence and a random syllable from all other sentences. Participants’ task was to indicate via a button press which of the two sound fragments was part of the previous sentence. For the word task, two words were displayed on the screen after each trial (a random word from the just presented sentence and one random word from all other sentences excluding ‘de’ and ‘en’), and participants needed to indicate which of the two words was part of the sentence before. For the word-combination task, participants were presented with two word pairs on the screen. Each of the four words was part of the just presented sentence, but only one of the pairs was in the correct order. Participants needed to indicate which of the two pairs was presented in the sentence before. Presented options for the sentence condition were always a grammatically and semantically plausible combination of words. See Table 1 for an example of the tasks for each condition (sentences and word lists). The three active tasks required participants to focus on the syllabic (syllable task), word (word task), or phrasal (word combination or also called phrasal task) timescales. Note that different trials within a task were not matched for task difficulty. For example, in the syllable task syllables that make a word are much easier to recognize than syllables that do not make a word. Additionally, trials pertaining to the beginning of the sentence are more difficult than ones related to the end of the sentence due to recency effects.

Procedure

Request a detailed protocol

At the beginning of each trial, participants were instructed to look at a fixation cross presented at the middle of the screen on a grey background. Audio recordings were presented after a random interval between 1.5 and 3 s; 1 s after the end of the audio, the task was presented. For the word and word-combination task, this was the presentation of visual stimuli. For the syllable task, this entailed presenting the sound fragments one after each other (with a delay of 0.5 s in between). For the passive task, this was the instruction to press a button to continue. In total, there were eight blocks (two conditions × four tasks) each lasting about 8 min. The order of the blocks was pseudo-randomized by independently randomizing the order of the tasks and the conditions. For a single participant, we then always presented the same task twice in a row to avoid task-switching costs. As a consequence, condition was always alternated (a possible order of blocks would be: passive-sentence, passive-word list, word-sentence, word-word list, syllable-sentence, syllable-word list, word-combination-sentence, word-combination-word list). Across participants the starting condition was counterbalanced. After the main experiment, an auditory localizer was collected which consisted of listening to 200 ms sinewave and broadband sounds (centred at 0.5, 1, and 2 kHz; for the broadband at a 10% frequency band) at approximately equal loudness. Each sound had a 50 ms linear on and off ramp and was presented for 30 times (with random inter-stimulus interval between 1 and 2 s).

At arrival, participants filled out a screening. Electrodes to monitor eye movements and heart beat were placed (left mastoid was used as ground electrode) at an impedance below 15 kΩ. Participants wore metal free clothes and fitted earmolds on which two of the three head localizers were placed (together with a final head localizer placed at the nasion). They then performed the experiment in the MEG. MEG was recorded using a 275-channel axial gradiometer CTF MEG system at a sampling rate of 1.2 kHz. After every block participants had a break, during which head position was corrected (Stolk et al., 2013). After the session, the headshape was collected using Polhemus digitizer (using as fiducials the nasion and the entrance of the ear canals as positioned with the earmolds). For each participant, an MRI was collected with a 3T Siemens Skyra system using the MPRAGE sequence (1 mm isotropic). Also for the MRI acquisition participants wore the earmolds with vitamin pills to optimize the alignment.

Behavioural analysis

Request a detailed protocol

We performed a linear mixed model analysis with fixed factors task (syllable, word, and word combination) and condition (sentence and word list) as implemented by lmer in R4.1.0. The dependent variable was accuracy. First, any outliers were removed (values more extreme than median ± 2.5 IQR). Then, we investigated what the best random model was, including a random intercept or a random slope for one or two of the factors. The models with varying random factors were compared with each other using an analysis of variance. With no significant difference, the model with the lowest number of factors was included (with minimally a random intercept). Finally, lsmeans was used for follow-up tests using the kenward-roger method to calculate the degrees of freedom from the linear mixed model. For significant interactions, we investigated the effect of condition per task. For main effects, we investigated pairwise comparisons. We corrected for multiple comparisons using adjusted Bonferroni corrections unless specified otherwise. For all further reported statistical analyses for the MEG data, we followed the same procedure (except that there was one more level of task, i.e. the passive task). To avoid exploding the amount of comparisons, we a priori decided for any task effects in the MEG analysis to only compare the individual tasks with the phrase task.

MEG pre-processing

Request a detailed protocol

First source models from the MRI were made using a surface-based approach in which grid points were defined on the cortical sheet using the automatic segmentation of freesurfer6.0 (Fischl, 2012) in combination with pre-processing tools from the HCP workbench1.3.2 (Glasser et al., 2013) to down-sample the mesh to 4 k vertices per hemisphere. The MRI was co-registered to the MEG using the previously defined fiducials as well as an automatic alignment of the MRI to the Polhemus headshape using the Fieldtrip20211102 software (Oostenveld et al., 2011).

Pre-processing involved epoching the data between −3 and +7.9 s (+3 relative to the longest sentence of 4.9 s) around sentence onset. We applied a dftfilter at 50, 100, and 150 Hz to remove line noise, a Butterworth bandpass filter between 0.6 and 100 Hz, and performed baseline correction (−0.2 to 0 s baseline). Trials with excessive movements or squid jumps were removed via visual inspection (20.1 ± 18.5 trials removed; mean ± standard deviation). Then data were resampled to 300 Hz and we performed ICA decomposition to correct for eye blinks/movement and heart beat artefacts (4.7 ± 0.99 components removed; mean ± standard deviation). Trials with remaining artefacts were removed by visual inspection (11.3 ± 12.4 trials removed; mean ± standard deviation). Then we applied a lcmv filter to transform the data to have single-trial source space representations. A common filter across all trials was calculated using a fixed orientation and a lambda of 5%. We only extracted time courses for our ROI, STG (Friederici, 2011; Park et al., 2015; Fegen et al., 2015; Koelsch et al., 2009), medial temporal gyrus (Pinker and Jackendoff, 2005; Peelle and Davis, 2012; Luo and Poeppel, 2007), and inferior frontal cortex (Ding et al., 2016; Kayser et al., 2015; Har-Shai Yahav and Zion Golumbic, 2021); numbers correspond to label-coding from the aparc parcellations implemented in Freesurfer. These time courses were baseline corrected (−0.2 to 0 s). To reduce computational load and to ensure that we used relevant data within the ROI, we extracted the top 20 PCA components per ROI for all following analyses based on a PCA using the time window of interest (0.5–3.7 s; 0.5 to ensure that all initial evoked responses were not included and 3.7 as it corresponds to the shortest trials). All following analyses were done per ROI. With enough statistical power one would add ROI as a separate factor in the analyses, but unfortunately, we did not have enough power to find a potential three-way interaction (ROI × condition × task). We therefore cannot make strong conclusions about one ROI having a stronger effect than another.

MI analysis

Request a detailed protocol

First, we extracted the speech envelopes by following previous procedures (Kaufeld et al., 2020; Keitel et al., 2018; Gross et al., 2013; Ince et al., 2017). The acoustic waveforms (third-order Butterworth filter) were filtered in eight frequency bands (100–8000 Hz) equidistant on the cochlear frequency map (Smith et al., 2002). The absolute of the Hilbert transform was computed, we low passed the data at 100 Hz (third order Butterworth) and then down-sampled to 300 Hz (matching the MEG sampling rate). Then, we averaged across all bands.

MI was calculated between the filtered speech envelopes (using a third-order Butterworth filter) and the filtered MEG data at three different frequency bands corresponding to information content at different linguistic hierarchical levels: phrase (0.8–1.1 Hz), word (1.9–2.8 Hz), and syllable (3.5–5.0 Hz). The frequency bands were extracted based on the rate of the linguistic information in the speech signal. We hypothesized that tracking of relevant information should happen at those respective bands. While in our stimulus set the boundaries of the linguistic levels did not overlap, in natural speech the brain has an even more difficult task as there is no one-to-one match between band and linguistic unit (Obleser et al., 2012). Our main analysis focuses on the phrasal band, as that is where our previous study found the strongest effects (Kaufeld et al., 2020), but for completeness we also report on the other bands. MI was estimated after the evoked response (0.5 s) until the end of the stimulus at five different delays (60, 80, 100, 120, and 140 ms) and averaged across delays between the phase estimations of the envelopes and MEG data. A single MI value was generated per condition per ROI by concatenating all trials before calculating the MI (MEG and speech). Statistical analysis was performed per ROI per frequency band.

Power analysis

Request a detailed protocol

Power analysis was performed to compare the MI results with absolute power changes. On the one hand, we did this analysis as MI differences could be a consequence of signal-to-noise differences in the original data (which would be reflected in power effects). On the other hand, generic delta power has been associated with language processing (Meyer, 2018; Kazanina and Tavano, 2021). Therefore, we choose to analyse classical frequency bands instead of the stimulus informed ones (as used in the MI analysis) in order to compare these results with other studies. Moreover, as this analysis does not measure tracking to the stimulus (like the MI analysis does) it did not seem appropriate to match the frequency content to the stimulus content. We first extracted the time–frequency representation for all conditions and ROIs separately. To do so, we performed a wavelet analysis with a width of 4, with a frequency of interest between 1 and 30 (step size of 1) and time of interest between −0.2 and 3.7 s (step size of 0.05 s). We extracted the logarithm of the power and baseline corrected the data in the frequency domain using a −0.3 and −0.1 s window. For four different frequency bands (delta: 0.5–3.0 Hz; theta: 3.0–8.0 Hz; alpha: 8.0–15.0 Hz; beta: 15.0–25.0 Hz) we extracted the mean power in the 0.5–3.7 s time window per task, condition, and ROI. Again, our main analysis focuses on the delta band, but we also report on the other bands for completeness. For each ROI, we performed the statistical analysis on power as described in the behavioural analysis.

Connectivity analysis

Request a detailed protocol

For the connectivity analysis, we repeated all pre-processing as in the power analysis, but separately for the left and right hemispheres (as we did not expect connections for PCA across hemispheres), after which we averaged the connectivity measure across hemispheres (using the Fourier spectrum and not the power spectrum). We used the debiased WPLI for our connectivity measure, which ensures that no zero-lag phase differences are included in the estimation (avoiding effects due to volume conduction). All connections between the three ROIs were investigated for the mean WPLI for the four different frequency bands also used in the power analysis in the 0.5–3.7 s time window. Also in this case, the same statistical analysis was applied. Note that not for all frequency bands we found a clear peak in the power signal (only clearly so for the alpha band), this indicates that the connectivity likely does not reflect endogenous oscillatory activity (Donoghue et al., 2020), but might still pertain to connected regions operating at those timescales.

MEG–behavioural performance analysis

Request a detailed protocol

To investigate the relation between the MEG measures and the behavioural performance we repeated the analyses (MI, power, and connectivity) but added accuracy as a factor (together with the interactions with the task and condition factor). As there is no accuracy for the passive task, we removed this task from the analysis. We then followed the same analyse steps as before. Since we reduced our degree of freedom, we could however only create random intercept and not random slope models.

Power control analysis

Request a detailed protocol

The reliability of phase estimations is influenced by the signal-to-noise ratio of the signal (Zar, 1998). As a consequence, trials with generally high power have more reliable phase estimations compared to low power trials. This could influence any measure relying on this phase estimation, such as MI and connectivity (Ince et al., 2017; Bastos and Schoffelen, 2015). It is therefore possible that power differences between conditions lead to differences between connectivity or MI. To ensure that our reported effects are not due to signal-to-noise effects, we controlled any significant power difference between conditions for the connectivity and MI analysis. To do this, we iteratively removed the highest and lowest power trials between the mean highest and mean lowest of the two relevant conditions (either collapsing trials across tasks/conditions or using individual conditions; for the MI analysis we used power estimated within its respective frequency band). We repeated this until the original condition with the highest power had lower power than the other condition. Then we repeated the analysis and statistics, investigating if the effect of interest was still significant. The control analysis is reported along the main MI and connectivity sections.

Other control analysis

Request a detailed protocol

We performed two final control analyses. Firstly, we investigated if age had an influence on any of our primary outcome measures. Secondly, we repeated the analyses using either 10 or 30 PCA components instead of the original 20 components. These controls ensure a robustness check of all the reported results. Note that this study was not intended to investigate age related differences. We therefore only report on interaction effect with our task and condition variables. Any main effect of age is difficult to interpret as it is unclear if the effect pertains to overall age-related differences (or anatomical variation leading to differential MEG responses) or language-related age differences.

Data availability

Data and analysis code are available at https://data.donders.ru.nl/collections/di/dccn/DSC_3027006.01_220 (doi: https://doi.org/10.34973/vjw9-0572).

The following data sets were generated
    1. ten Oever S
    2. Carta S
    3. Kaufeld G
    4. Martin AE
    (2022) Donders Repository
    Task relevant tracking of hierarchical linguistic structure.
    https://doi.org/10.34973/vjw9-0572

References

    1. Rosen S
    (1992) Temporal information in speech: acoustic, auditory and linguistic aspects
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 336:367–373.
    https://doi.org/10.1098/rstb.1992.0070
  1. Book
    1. Zar JH
    (1998)
    Biostatistical Analysis (4 ed)
    Englewood Cliffs, New Jersey: Prentice Hall.

Decision letter

  1. Jonathan Erik Peelle
    Reviewing Editor; Washington University in St. Louis, United States
  2. Barbara G Shinn-Cunningham
    Senior Editor; Carnegie Mellon University, United States
  3. Johanna Rimmele
    Reviewer; Max-Planck-Institute for Empirical Aesthetics, Germany

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Task-dependent and automatic tracking of hierarchical linguistic structure" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Johanna Rimmele (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1. The relationship of participants' behavior to the observed MEG responses. One clear concern is the degree to which the accuracy differences between word lists and sentences (Figure 1) explain MI differences.

2. A number of important details regarding analysis choices were missing and should be added, including: justification of the frequency bands used for the MI analysis, task order (were sentences always first, and word lists second), whether ROIs were defined across hemispheres, details on signal filtering, and details regarding the PCA analysis.

3. The use of "hierarchical linguistic structure" should be clarified or tempered in the context of the current results. Although the attention was directed to syllables, words, and phrases through instructions, the significant MEG results were in the phrasal band (Figure 3) and not the others (supplemental figures).

Reviewer #1 (Recommendations for the authors):

The participants ranged in age from 18-59, which is a rather broad range. Given age-related changes in both hearing (not assessed) and language processing, it was surprising to not at least see age included in the models (though perhaps there were not enough participants?). Some comment would be useful here.

Figures: The y axis label seems frequently oddly positioned at the top – maybe consider just centering it on the axis?

I liked the inset panels for the main effects, but they are rather small. If you can make them any larger that would improve readability.

In general, I really like showing individual subjects in addition to means. However, it also can make the plots a bit busy. I imagine you've tried other ways of plotting but might be worth exploring ways to highlight the means a bit more (or simplify the main paper figures, and keep these more detailed figures for supplemental)? Totally up to you, but I had a hard time seeing the trends as the plots currently stand.

On my copy of the PDF, Figures 6 and 4 (p. 14) were overlapping and so really impossible to see.

Figure 1: In the text, the condition is referred to as "word list", but in the figure "wordlist". Having these match isn't strictly necessary but would be a nice touch.

Regarding accuracy differences: including accuracy in the models are good, but alternately, restricting analyses to only correct responses might also help with this?

Reviewer #2 (Recommendations for the authors):

– l. 122: does that mean the sentence was always presented first and the word-list second? The word-list did not contain the same words as the sentence, right? This might have affected differences in tracking and power.

– l. 197: why were the linear mixed-models for the MEG data performed separately per ROI? Also, were ROIs computed across hemispheres?

– l. 198: was the procedure described by Ince et al., 2017 using gaussian copula used for the MI analysis, if yes could you cite the reference?

– l. 191: could you give the details of the filter, what kind of filter etc?

– l. 191/l. 206: why were the frequency bands for the power analysis (δ: 0.5-3.0 Hz; theta: 3.0-8.0 Hz; α: 8.0-15.0 Hz; β: 15.0-25.0 Hz) different than for the mutual information analysis (l. 191: phrase (0.8-1.1 Hz), word (1.9-2.8 Hz), and syllable (3.5- 5.0 Hz)), could you explain the choice of frequency band? Particularly, if the authors want to check signal-to-noise differences in the mutual information analysis by using the power analysis, it seems relevant to match frequency bands.

– l. 207: why was the data averaged across trials and not the single-trial data fed into the mixed-models, such an analysis might strengthen the findings?

– l. 213: averaged across hemispheres or across trials?

– l. 216: all previous analyses were conducted across the hemispheres?

– l. 217: "for the four different frequency bands" this is referring to the frequency bands chosen for the power analysis, not those in the MI analysis?

– Figure 1: were only correct trials analyzed in the MEG analysis? If not, this might be a problem as there were more correct trials in the sentence compared to the word-list condition at the phrasal scale task. Could you add a control analysis on the correct trials only, to make sure that this was not a confound?

– l. 252 ff./Figure 2: As no power peak is observed in the δ (and α) band, this might result in spurious connectivity findings.

– l. 304: with higher connectivity in the phrasal compared to the passive task? Could you add this info?

– Something went wrong with figure 6.

– l. 403: alignment of neural oscillations with acoustics at phrasal scale, this reference seems relevant here: Rimmele, Poeppel, Ghitza, 2021.

– Wording: l. 387 "in STG".

Reviewer #3 (Recommendations for the authors):

1. In procedure settings, I am not sure whether the sentence condition is always presented before the word list condition (the misunderstanding comes from lines 126-128). Please add the necessary details and avoid this misunderstanding.

2. In the MEG analysis, PCA was performed to reduce computational load and increase the data relevance. However, the PCA procedure is not clear. For example, the authors extracted top *20 PCA components per region. Why 20 components? And why not 10 or 30 components? The details should be clarified.

https://doi.org/10.7554/eLife.77468.sa1

Author response

Essential revisions:

1. The relationship of participants' behavior to the observed MEG responses. One clear concern is the degree to which the accuracy differences between word lists and sentences (Figure 1) explain MI differences.

2. A number of important details regarding analysis choices were missing and should be added, including: justification of the frequency bands used for the MI analysis, task order (were sentences always first, and word lists second), whether ROIs were defined across hemispheres, details on signal filtering, and details regarding the PCA analysis.

3. The use of "hierarchical linguistic structure" should be clarified or tempered in the context of the current results. Although the attention was directed to syllables, words, and phrases through instructions, the significant MEG results were in the phrasal band (Figure 3) and not the others (supplemental figures).

We thank the Editor and the Reviewers for their helpful comments and constructive feedback. We feel that the reviews have substantially improved our manuscript. We have now addressed these three core concerns, and detail at length in individual responses to each reviewer how and what can now additionally be shown. To summarize briefly here:

1. We now include a new analysis in which we add accuracy as a factor. None of the original statistical patterns changed. We still find that MI/ neural tracking is higher for phrases in sentences than in word lists and that tracking during spoken language processing is a largely automatic response. Task specific effect were still found in in MTG and IFG. Thus, accuracy differences did not explain MI difference and we therefore remain with our core messages of the paper. We would like to note that we do no use tasks in the same way that they are typically used in cognitive neuroscience (say in a working memory paradigm, where only neural activity on correct trials can be associated the cognitive processing in question; instead we use tasks to direct our participant’s attention to syllables, words, and phrases and the timescale that they occur in). This difference in design – in other words, that in our case, that behavioral performance on the syllable task was lower than in the word or phrase tasks, does not mean that the sentence or word list was not heard/comprehended – means that inclusion or exclusion of incorrect trials does not further isolate the cognitive process of spoken language comprehension in question. That said, we agree it is very important to show that MI differences do not stem from behavioral performance differences. In order to show this, we add the above mentioned analysis to the manuscript. As said, this analysis did not change any of our main findings or conclusions, and rather strengthened the argument that tracking of phrases in sentences vs. word lists is stronger. We report now report these results in both our response and the manuscript.

2. We believe that we have fully addressed these queries and thank the reviewers for encouraging us be more comprehensive and thorough. We note that due to our 2-condition by 4-task design, we did not achieve enough power to include ROI as an independent factor. This is obviously not ideal, but embodies a trade-off between testing these 8 conditions together, which we felt was crucial for the inferences we wanted to make about tracking, and discovering underlying sources. We thus stuck to the well-known ROIs/ sources widely used and defined in the speech and language processing literature.

3. We regret any confusion this word might have caused; we meant to refer to the fact that our 4 tasks focus participants’ attention to different timescales and linguistic representations occur on across the linguistic hierarchy. We have changed the title to more clearly reflect this and have removed reference to linguistic hierarchy throughout the paper.

Reviewer #1 (Recommendations for the authors):

The participants ranged in age from 18-59, which is a rather broad range. Given age-related changes in both hearing (not assessed) and language processing, it was surprising to not at least see age included in the models (though perhaps there were not enough participants?). Some comment would be useful here.

We indeed had quite a wide range of ages included in our study. Our wide age range was partly due to constraints in the MEG testing during covid times. It was difficult to find participants during the pandemic. On top of this, it is almost standard practice in the Netherlands to put dental wires behind the teeth after having braces and leave them in place for life. These participants cannot participate in MEG studies and therefore we widened the scope of our age range to reach enough participants. However, we also believe that it is good practice to not limit the participants age range to such a restricted range as is common practice in most cognitive neuroscience research. Otherwise, we end up only comparing brains of student populations with each other while we want to make conclusions for a much wider population. It is therefore rather an asset than a drawback that we have this wide age range. Regarding the hearing difficulties, we would like to note that we did ask participants whether they had any hearing problems (which none of them reported), but indeed did not assess did ourselves.

Nonetheless, we agree that it is valuable to investigate whether age had an influence in any of our results and therefore added age in the models of the main manuscript (i.e. for the lower frequencies). Again, we could only run fixed intercept models due to a reduction in the degrees of freedom.

For the MI we found no significant effect of age or interactions for any of our models. There was a trend significant effect in MTG for task*age interaction (F(3,123) = 2.5874, p = 0.0561). However, also the task*condition interaction remained significant and thus not change any conclusions.

Author response image 1
Age effects on power estimates.

(A) Predicted values for δ power for the two conditions dependent on age in STG (left) and MTG (right). (B) Predicted values for δ power for the four tasks dependent on age in MTG. Error bars indicate the 95% confidence interval of the fit. Colored lines at the bottom indicate individual datapoints.

For power we found a significant interaction between condition and age in STG (F(1, 28.88), p = 0.0192) and MTG (F(1,31) = 10.31, p = 0.003). In the MTG we additionally found an interaction between task and ages (F(3,31) = 5.02, p = 0.006). In IFG we only found a main effect of age (F(1,43) = 5.067, p = 0.030). For connectivity we found overall a main effect of age for all connections (STG-MTG, F(1,17.12) = 10.09, p = 0.005), (STG-IFG, F(1,16.96) = 17.42, p < 0.001, MTG-IFG, F(1,17.058) = 12.478, p = 0.002). The condition*age interaction in STG and MTG both suggested only for wordlist a change in power with age and not for the sentence condition (follow-up correlation age-MI per condition. STG: p = 0.076 (uncorrected) and MTG: p = 0.023 (uncorrected)). The task*age interaction in MTG showed only for the passive task a significant effect of age (follow-up correlation age-MI per task. p = 0.028 (uncorrected)).

Generally, we found it difficult to make strong conclusions about the main effect of age. Firstly, we did not have any baseline to assess whether main effects are specific to our language tasks or a more general age-related difference. Secondly, there could be anatomical age-related differences that spuriously drive the main effects of age. Therefore, we only report on the interaction effects. These analyses are now in the manuscript and there is an additional figure in the supplementary materials.

The results now read:

“Age control. Adding age to the analysis did not change any of the original findings (all original effects were still significant). We did however find for the power analysis age-specific interactions with condition and task. Specifically, for both the STG and the MTG we found an interaction between age and condition (F(1,28.87) = 6.156, p = 0.0192 and F(1,31) = 10.31, p = 0.003). In both ROIs there was a stronger difference between sentences and word lists (higher δ power for sentences) for the younger compared to the older participants (Supplementary Figure 4). In the MTG there was also an interaction between task and age (F(1,31) = 5.020, p = 0.006). Here, in a follow-up we found that only in the word task there was a correlation between age and power (p = 0.023 uncorrected), but not for the other tasks (p>0.1).”.

Figures: The y axis label seems frequently oddly positioned at the top – maybe consider just centering it on the axis?

We changed the y-axes of the figures accordingly and centered them all.

I liked the inset panels for the main effects, but they are rather small. If you can make them any larger that would improve readability.

We increased the size of the insets. Note however that as there is no space to add any y-labels these insets do not reflect real data, but just to quickly show the direction and presence of a main effect. We added this information now in the figure legend to ensure that this is clear.

In general, I really like showing individual subjects in addition to means. However, it also can make the plots a bit busy. I imagine you've tried other ways of plotting but might be worth exploring ways to highlight the means a bit more (or simplify the main paper figures, and keep these more detailed figures for supplemental)? Totally up to you, but I had a hard time seeing the trends as the plots currently stand.

There is a difficult balance to showing everything and highlighting the important bits. We were happy the reviewer pointed out that all trends were difficult to see and now change the figures to only include the mean and SEM. Additionally, when we found a main effect of task, we also added the condition means to highlight the differences on the significant effect more. The original plots are now added in the supplementary materials.

On my copy of the PDF, Figures 6 and 4 (p. 14) were overlapping and so really impossible to see.

We are so sorry about this! We realized this was the case due to the pdf conversion from our own word document. We now double-checked that this wasn’t the case.

Figure 1: In the text, the condition is referred to as "word list", but in the figure "wordlist". Having these match isn't strictly necessary but would be a nice touch.

We change the condition to word list throughout.

Regarding accuracy differences: including accuracy in the models are good, but alternately, restricting analyses to only correct responses might also help with this?

We have looked into this analysis, but there are some issues with doing the analysis in this way. First, there are clear differences in difficulty level of the trials within a condition. For example, if the target question was related to the last part of the audio fragment, the task was much easier than when it was at the beginning of the audio fragment. In the syllable task, if syllables also were (by chance) a part-word, the trial was also much easier. If we were to split up in correct and incorrect trials we would not really infer solely processes due to accurately processing the speech fragments, but also confounded the analysis by the individual difficulty level of the trials. Second, we end up with a very low trial amount after only looking at incorrect trials. In the worst case, some participants end up with only 22 trials which is too little to do much. We think it is therefore fair to compare the accuracy across participants (as done in the other analysis), but it is difficult to do this within participants.

To acknowledge this, we added this limitation to the methods. The method now reads:

“Note that different trials within a task were not matched for task difficulty. For example, in the syllable task syllables that make a word are much easier to recognize than syllables that do not make a word. Additionally, trials pertaining to the beginning of the sentence are more difficult than ones related to the end of the sentence due to recency effects.”

Reviewer #2 (Recommendations for the authors):

– l. 122: does that mean the sentence was always presented first and the word-list second? The word-list did not contain the same words as the sentence, right? This might have affected differences in tracking and power.

We regret the confusion. We randomized whether the sentence or wordlist would come first across participants, but kept it constant within the participants across the four tasks. So, an individual would always alternate between a sentence and a word list block, but which one is first is counterbalanced across participants. To control for acoustic differences, we have the same words in the wordlists and the sentences (see e.g. Kaufeld et al., 2020). But this cannot have affected the results as the order is counterbalanced across participants. We now made clearer in the text how we assign the blocks.

– l. 197: why were the linear mixed-models for the MEG data performed separately per ROI? Also, were ROIs computed across hemispheres?

We had a clear a-priori idea about choosing these ROIs, therefore we did the analysis separately. Post-hoc, there now is also the simple issue of having too little power to estimate a three-way interaction between ROI, task, and condition. This power situation arises because we needed to test 4 task * 2 condition together in a single experiment in order to be able to make the inferences that we wanted to about the effects on tracking. This unfortunately comes at the cost of the power we can allot to each condition which affects our ability to generate source models with ROIs as a factor. Ideally, this would have been done, but practically this is very difficult to achieve. We cannot do more than acknowledge this in the main text (now in line 195-199). It is simply very difficult to have any design with multiple factors on top of an anatomical constraint (most studies seem to limit themselves to two factors these days to avoid these issues). The ROIs were computed across hemisphere (we had no a-priori hypothesis about hemispheres). Note that the PCAs therefore we also calculated across hemispheres (except for the coherence analysis as cross-hemisphere coherence seemed unlikely to us). Please note that we do not use ROIs that deviate from the established speech and language processing literature.

We now write:

“All following analyses were done per ROI. With enough statistical power one would add ROI as a separate factor in the analyses, but unfortunately, we did not have enough power to find a potential three-way interaction (ROI*condition*task). We therefore cannot make strong conclusions about one ROI having a stronger effect than another.”

– l. 198: was the procedure described by Ince et al., 2017 using gaussian copula used for the MI analysis, if yes could you cite the reference?

We regret not citing it in the main text and do this now.

– l. 191: could you give the details of the filter, what kind of filter etc?

We used a third order (bi-directional) Butterworth filter separately on eight equidistant bands of the speech signal. This information is now added.

– l. 191/l. 206: why were the frequency bands for the power analysis (δ: 0.5-3.0 Hz; theta: 3.0-8.0 Hz; α: 8.0-15.0 Hz; β: 15.0-25.0 Hz) different than for the mutual information analysis (l. 191: phrase (0.8-1.1 Hz), word (1.9-2.8 Hz), and syllable (3.5- 5.0 Hz)), could you explain the choice of frequency band? Particularly, if the authors want to check signal-to-noise differences in the mutual information analysis by using the power analysis, it seems relevant to match frequency bands.

The frequency bands of the MI analyses were based on the stimuli. They reflect the syllabic, word, and phrasal rates (calculated in Kaufeld et al., 2020). The power analyses were based on generic frequency bands. We choose for these two different splits as for the tracking one would expect the tracking (or phase alignment) from the exact frequency ranges in the signal relating to linguistic content (Ding et al., 2016; Martin, 2020; Keitel et al., 2018). In contrast, if there is no alignment, it is still possible that oscillatory signals are important in processing language stimuli (Jensen et al., 2010; Benitez-Burraco and Murphy, 2019). As these hypotheses rather pertain to commonly known frequency bands present in the brain we think it is more appropriate here to use generic bands (also to promote comparisons). We therefore stick to the stimulus-driven frequencies for the tracking hypothesis and use for control frequencies the generic ones (as we do not match them with anything in the stimulus). Note however, that when controlling for power in both the MI and the connectivity analysis, we did use the power of the respective band used for that specific analysis.

We write this reasoning now clearer in the methods:

“Power analysis was performed to compare the MI results with absolute power changes. On the one hand, we did this analysis as MI differences could be a consequence of signal-to-noise differences in the original data (which would be reflected in power effects). On the other hand, generic δ power has been associated with language processing [27, 28]. Therefore, we choose to analyse classical frequency bands instead of the stimulus informed ones (as used in the MI analysis) in order to compare these results with other studies. Moreover, as this analysis does not measure tracking to the stimulus (like the MI analysis does) it did not seem appropriate to match the frequency content to the stimulus content.”

– l. 207: why was the data averaged across trials and not the single-trial data fed into the mixed-models, such an analysis might strengthen the findings?

We could do this for the power analysis, but not for the MI and connectivity analysis which were calculated incorporating all trials (for the MI by concatenating all trials/speech signals, and for the connectivity as connectivity is calculated across trials). This would mean that we would run a different model on the power as on the other two variables of interest. We choose not to do this. Previously, MI analyses have been done by concatenating the data to improve sensitivity of the measure (Keitel et al., 2020; Kaufeld et al., 2018). Theoretically one could calculate the MI per trial, however this would unnecessarily reduce the sensitivity of the analysis.

– l. 213: averaged across hemispheres or across trials?

Across hemispheres. The WPLI is calculated across trials, so cannot be averaged across trials. We now clarified this.

– l. 216: all previous analyses were conducted across the hemispheres?

Yes. The MI and the power used the principles components incorporating both hemispheres, but for the connectivity analysis we calculated the PCAs for the hemispheres separately and averages only after calculating the connectivity across hemispheres.

– l. 217: "for the four different frequency bands" this is referring to the frequency bands chosen for the power analysis, not those in the MI analysis?

Yes. This is now clarified. As stated above there it makes more sense to look at generic frequency bands when we do not directly link the analysis with the speech signal.

– Figure 1: were only correct trials analyzed in the MEG analysis? If not, this might be a problem as there were more correct trials in the sentence compared to the word-list condition at the phrasal scale task. Could you add a control analysis on the correct trials only, to make sure that this was not a confound?

We included all trials in the analysis; we do this for several reasons, first because we not use the tasks in the same way as a traditional working memory task, where only correct trials contain the cognitive process that is being studied. Second, because it is not possible to only include the correct trials as we end up with too few trials per condition to reliably estimate our effect. Moreover, trials are not controlled for difficulty within a condition. However, in order to show that our MI effects are not driven by task performance, we can include accuracy across participants as a factor. We refer to point 1 for the results of this analysis. We have added this information now in the methods:

“Note that different trials within a task were not matched for task difficulty. For example, in the syllable task syllables that make a word are much easier to recognize than syllables that do not make a word. Additionally, trials pertaining to the beginning of the sentence are more difficult than ones related to the end of the sentence due to recency effects.”

– l. 252 ff./Figure 2: As no power peak is observed in the δ (and α) band, this might result in spurious connectivity findings.

Indeed, the absence of a peak does result in difficulty to interpret connectivity findings as it is unknown whether we are truly investigating an endogenous oscillation or some other form of connections which could for example be related to stimulus evoked responses. If the latter, we could not speak of ‘true’ connectivity increases, but rather for differential processing of the stimulus. We acknowledge this, but still find also this type of change worth reporting. We therefore made clear cautious note in the Results section that this is the case.

The results now read:

“Note that not for all frequency bands we found a clear peak in the power signal (only clearly so for the α band), this indicates that the connectivity likely does not reflect endogenous oscillatory activity [29], but might still pertain to connected regions operating at those time scales.”

The discussion now reads:

“Note that even though we found increased connectivity, we did not see a clear power peak in the δ band. This suggest that we might not be looking at an endogenous oscillator, but rather at connections operation at that temporal scale (potentially being non-oscillatory in nature).”

– l. 304: with higher connectivity in the phrasal compared to the passive task? Could you add this info?

We have added this info.

– Something went wrong with figure 6.

We have realized this and regret it. The new manuscript should have the figures placed correctly.

– l. 403: alignment of neural oscillations with acoustics at phrasal scale, this reference seems relevant here: Rimmele, Poeppel, Ghitza, 2021.

Agreed at we have added this.

– Wording: l. 387 "in STG".

We have changed this wording accordingly.

Reviewer #3 (Recommendations for the authors):

1. In procedure settings, I am not sure whether the sentence condition is always presented before the word list condition (the misunderstanding comes from lines 126-128). Please add the necessary details and avoid this misunderstanding.

We regret the confusion. We randomized whether the sentence or wordlist would come first across participants, but kept it constant within the participants across the four tasks. So, an individual would always alternate between a sentence and a word list block, but which one is first is counterbalanced across participants. To control for acoustic differences, we have the same words in the wordlists and the sentences (see e.g. Kaufeld et al., 2020). But this cannot have affected the results as the order is counterbalanced across participants. We now made clearer in the text how we assign the blocks.

2. In the MEG analysis, PCA was performed to reduce computational load and increase the data relevance. However, the PCA procedure is not clear. For example, the authors extracted top *20 PCA components per region. Why 20 components? And why not 10 or 30 components? The details should be clarified.

As the reviewer also mentions, our main aim was to reduce the dimensions of our analysis (to reduce computational load), while keeping in enough variance relevant for the analysis. We are not aware of any established way to determine the amount of PCA components which is best to use and can therefore only look at the explained variance of the components. As for us the goal was to reduce computational load, we wanted to keep as much original variance in the analysis as possible. In Author response image 2 one can see the cumulative explained variance of the components. Around 20 components over 99.9 % of the data is explained, which seems sensible amount to keep in.

Author response image 2

To ensure that all our results are also robust against changing the exact number of components we repeated the analysis using 10 and 30 components. No big qualitative differences were visible. It did seem that 10 components were not sufficient to show the original effects. The IFG task effect and MTG task*condition effect were only trend significant for 10 components (p = 0.06 and p = 0.1 respectively). The condition effect remained significant for all ROIs. Using 30 components all effects (including the main and interaction; IFG task effect: p = 0.034; MTG task*condition interaction p = 0.0064) effect remained significant. We therefore believe that the results are robust against choosing the exact amount of PCAs, but does require more than 10 components.

We now discuss the effect of PCA component number in the text. The text now reads:“Overall, the amount of PCA components did not influence any of the qualitative differences in the condition. It did seem however that 10 PCA components were not sufficient to show all original effects with the same power. Specifically, the IFG task and MTG task*condition effect were only trend significant for 10 components (p = 0.06 and p = 0.1 respectively). The other effects did remain significant with 10 components. Using 30 components made some of our effects stronger than with 20 components. Here, the IFG task and MTG task*condition effects had p-values of 0.034 and 0.006 respectively. We conclude that the amount of PCAs components did not qualitative change any of our reported effects.”

https://doi.org/10.7554/eLife.77468.sa2

Article and author information

Author details

  1. Sanne ten Oever

    1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. Language and Computation in Neural Systems group, Donders Centre for Cognitive Neuroimaging, Nijmegen, Netherlands
    3. Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, Netherlands
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing - original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7547-5842
  2. Sara Carta

    1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. ADAPT Centre, School of Computer Science and Statistics, University of Dublin, Trinity College, Dublin, Ireland
    3. CIMeC - Center for Mind/Brain Sciences, University of Trento, Trento, Italy
    Contribution
    Data curation, Investigation, Methodology, Software, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Greta Kaufeld

    Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    Contribution
    Conceptualization, Data curation, Methodology, Software, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Andrea E Martin

    1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. Language and Computation in Neural Systems group, Donders Centre for Cognitive Neuroimaging, Nijmegen, Netherlands
    Contribution
    Conceptualization, Formal analysis, Funding acquisition, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – review and editing
    For correspondence
    Andrea.Martin@mpi.nl
    Competing interests
    Reviewing editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3395-7234

Funding

Max-Planck-Gesellschaft (Lise Meitner Research Group "Language and Computation in Neural Systems")

  • Andrea E Martin

Max-Planck-Gesellschaft (Independent Research Group "Language and Computation in Neural Systems")

  • Andrea E Martin

Nederlandse Organisatie voor Wetenschappelijk Onderzoek (016.Vidi.188.029)

  • Andrea E Martin

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Ethics

Participants performed a screening for their eligibility in the MEG and MRI and gave written informed consent. The study was approved by the Ethical Commission for human research Arnhem/Nijmegen (project number CMO2014/288). Participants were reimbursed for their participation.

Senior Editor

  1. Barbara G Shinn-Cunningham, Carnegie Mellon University, United States

Reviewing Editor

  1. Jonathan Erik Peelle, Washington University in St. Louis, United States

Reviewer

  1. Johanna Rimmele, Max-Planck-Institute for Empirical Aesthetics, Germany

Publication history

  1. Received: January 31, 2022
  2. Preprint posted: February 10, 2022 (view preprint)
  3. Accepted: June 25, 2022
  4. Version of Record published: July 14, 2022 (version 1)
  5. Version of Record updated: July 15, 2022 (version 2)

Copyright

© 2022, ten Oever et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 660
    Page views
  • 163
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sanne ten Oever
  2. Sara Carta
  3. Greta Kaufeld
  4. Andrea E Martin
(2022)
Neural tracking of phrases in spoken language comprehension is automatic and task-dependent
eLife 11:e77468.
https://doi.org/10.7554/eLife.77468

Further reading

    1. Neuroscience
    Laura K Globig, Nora Holtz, Tali Sharot
    Research Article

    The powerful allure of social media platforms has been attributed to the human need for social rewards. Here, we demonstrate that the spread of misinformation on such platforms is facilitated by existing social ‘carrots’ (e.g., ‘likes’) and ‘sticks’ (e.g., ‘dislikes’) that are dissociated from the veracity of the information shared. Testing 951 participants over six experiments, we show that a slight change to the incentive structure of social media platforms, such that social rewards and punishments are contingent on information veracity, produces a considerable increase in the discernment of shared information. Namely, an increase in the proportion of true information shared relative to the proportion of false information shared. Computational modeling (i.e., drift-diffusion models) revealed the underlying mechanism of this effect is associated with an increase in the weight participants assign to evidence consistent with discerning behavior. The results offer evidence for an intervention that could be adopted to reduce misinformation spread, which in turn could reduce violence, vaccine hesitancy and political polarization, without reducing engagement.

    1. Neuroscience
    Benjamin M Zemel, Alexander A Nevue ... Henrique von Gersdorff
    Research Article Updated

    Complex motor skills in vertebrates require specialized upper motor neurons with precise action potential (AP) firing. To examine how diverse populations of upper motor neurons subserve distinct functions and the specific repertoire of ion channels involved, we conducted a thorough study of the excitability of upper motor neurons controlling somatic motor function in the zebra finch. We found that robustus arcopallialis projection neurons (RAPNs), key command neurons for song production, exhibit ultranarrow spikes and higher firing rates compared to neurons controlling non-vocal somatic motor functions (dorsal intermediate arcopallium [AId] neurons). Pharmacological and molecular data indicate that this striking difference is associated with the higher expression in RAPNs of high threshold, fast-activating voltage-gated Kv3 channels, that likely contain Kv3.1 (KCNC1) subunits. The spike waveform and Kv3.1 expression in RAPNs mirror properties of Betz cells, specialized upper motor neurons involved in fine digit control in humans and other primates but absent in rodents. Our study thus provides evidence that songbirds and primates have convergently evolved the use of Kv3.1 to ensure precise, rapid AP firing in upper motor neurons controlling fast and complex motor skills.