A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension

  1. Filiz Tezcan  Is a corresponding author
  2. Hugo Weissbart
  3. Andrea E Martin
  1. Language and Computation in Neural Systems Group, Max Planck Institute for Psycholinguistics, Netherlands
  2. Donders Centre for Cognitive Neuroimaging, Radboud University, Netherlands

Abstract

When we comprehend language from speech, the phase of the neural response aligns with particular features of the speech input, resulting in a phenomenon referred to as neural tracking. In recent years, a large body of work has demonstrated the tracking of the acoustic envelope and abstract linguistic units at the phoneme and word levels, and beyond. However, the degree to which speech tracking is driven by acoustic edges of the signal, or by internally-generated linguistic units, or by the interplay of both, remains contentious. In this study, we used naturalistic story-listening to investigate (1) whether phoneme-level features are tracked over and above acoustic edges, (2) whether word entropy, which can reflect sentence- and discourse-level constraints, impacted the encoding of acoustic and phoneme-level features, and (3) whether the tracking of acoustic edges was enhanced or suppressed during comprehension of a first language (Dutch) compared to a statistically familiar but uncomprehended language (French). We first show that encoding models with phoneme-level linguistic features, in addition to acoustic features, uncovered an increased neural tracking response; this signal was further amplified in a comprehended language, putatively reflecting the transformation of acoustic features into internally generated phoneme-level representations. Phonemes were tracked more strongly in a comprehended language, suggesting that language comprehension functions as a neural filter over acoustic edges of the speech signal as it transforms sensory signals into abstract linguistic units. We then show that word entropy enhances neural tracking of both acoustic and phonemic features when sentence- and discourse-context are less constraining. When language was not comprehended, acoustic features, but not phonemic ones, were more strongly modulated, but in contrast, when a native language is comprehended, phoneme features are more strongly modulated. Taken together, our findings highlight the flexible modulation of acoustic, and phonemic features by sentence and discourse-level constraint in language comprehension, and document the neural transformation from speech perception to language comprehension, consistent with an account of language processing as a neural filter from sensory to abstract representations.

Editor's evaluation

This study addresses a fundamental aspect of human speech processing: namely, how acoustic and linguistic features interact during comprehension. The authors present convincing evidence that helps elucidate the role of language experience on neural processing, re-weighting processing of speech based on whether a listener understands the language being spoken.

https://doi.org/10.7554/eLife.82386.sa0

Introduction

When we understand spoken language, we transform a continuous physical signal into structured meaning. In order to achieve this, we likely capitalize on stored, previously-learned linguistic representations and other forms of knowledge. During this process, systematic changes shown by the acoustic signal are categorized by the brain as phonemes, while the similar ordering of consecutive phonemes creates words and, a certain ordering of words creates phrases and sentences. As these transformation occur, the phase of the neural signal is thought to align with temporal landmarks in both acoustic and linguistic dimensions; this phenomenon is referred to as neural tracking (Daube et al., 2019; Luo and Poeppel, 2007; Di Liberto et al., 2015; Brodbeck et al., 2018; Broderick et al., 2018; Daube et al., 2019; Keitel et al., 2018; Donhauser and Baillet, 2020; Gillis et al., 2021; Kaufeld et al., 2020a; Weissbart et al., 2020; Brodbeck et al., 2022; Heilbron et al., 2020; Coopmans et al., 2022; Slaats et al., 2023; Ten Oever et al., 2022a; Zioga et al., 2023). Neural tracking has been proposed as a mechanism to segment the acoustic input into linguistic units such as phoneme, syllables, morphemes (Giraud and Poeppel, 2012; Ghitza, 2013; Ding et al., 2016) by aligning the neural excitability cycles to the incoming input. One way to quantify the degree of neural tracking of speech features is using a linear encoding modelling also known as Temporal Response Functions (TRF). Adding sub-lexical, lexical, and phrase- and sentence-level predictors, either quantified with information theoretic metrics (e.g. surprisal and entropy), or syntactic annotations, to the encoding model improves the reconstruction accuracy of neural responses on distinct sources in the brain (see Brodbeck et al., 2022), and surprisal and entropy predictors have been shown to modulate the neural response at different time intervals (see Donhauser and Baillet, 2020). However, whether acoustic features and higher level linguistic units are encoded with similar veracity during comprehension, and how they impact each other’s encoding is still poorly understood. In this study, we investigated the effect of language comprehension in a first language (Dutch) on the tracking of acoustic and phonemic features, contrasted with a statistically familiar but uncomprehended language (French), during naturalistic story listening. Contrasting a comprehended language with a language that is familiar, but not understood, allows us to separate neural signals related to speech processing vs. those related to speech processing in the service of language comprehension. We asked three questions: (1) whether phoneme-level features contribute to neural encoding even when acoustic contributions are carefully controlled, as a function of language comprehension, (2) whether sentence- and discourse-level constraints on lexical information operationalized as word entropy impacted the encoding of acoustic and phoneme-level features, and (3) whether tracking of acoustic landmarks (viz., acoustic edges) was enhanced or suppressed as a function of comprehension. We found that acoustic features are enhanced when the spoken language stimulus is not understood, but that phonemic features are more strongly encoded when language is understood, consistent with an account where gain modulation, in the form of enhancement or suppression, expresses the behavioral goal of the brain (Martin, 2016; Martin, 2020). We found that both phonemic and acoustic features are more strongly encoded when the context is less constraining, as exemplified by word entropy. Finally, and in contrast with extant arguments in the literature about the pre-activation or enhancement of low-level representations with the contextual information (DeLong et al., 2005; McClelland and Rumelhart, 1981 Rumelhart and McClelland, 1982, Nieuwland et al., 2018; Nieuwland, 2019 ), we found that acoustic-edge processing appears to be suppressed when language comprehension occurs, compared to when it does not, and both acoustic and phonemic features are suppressed when entropy is low within a comprehended language (viz., the context is more constraining and a word is more expected). This pattern of results is consistent with acoustic edges being important for auditory and speech processing, but speaks against their functioning as computational landmarks during language comprehension, the behavioral and neural goal of speech processing.

During comprehension, contextual information from higher level linguistic units such as words, phrases, and sentences likely affects the neural representation and processing of lower level units like phonemes and acoustic features. A wealth of behavioral and neural data from both spoken and visual language comprehension contexts has been leveraged to illustrate the incremental processing of incoming perceptual input within existing sentence and discourse context. During spoken language comprehension, acoustic, phonemic, phonological, and prosodic information is dynamically integrated with morphemic, lexical, semantic, and syntactic information (e.g., Bai et al., 2022; Kaufeld et al., 2020b; Kaufeld et al., 2020c; Marslen-Wilson and Welsh, 1978; Friederici, 2002; Hagoort, 2013; Martin et al., 2017; Oganian et al., 2023; Martin, 2016; Martin, 2020). Models with top-down and bottom-up information flow have been invoked to account for how sensory information could be integrated with internally-generated knowledge or information, expressed in many forms in different models, as statistical priors, distributional knowledge, or as abstract structure (McClelland and Rumelhart, 1981; Rumelhart and McClelland, 1982; Rao and Ballard, 1999; Lee and Mumford, 2003; Friston, 2005; Martin, 2016; Martin and Doumas, 2019; Martin, 2020; Ten Oever and Martin, 2021; Ten Oever et al., 2022a; Ten Oever et al., 2022b). In word recognition, models such as the Interactive Activation Model (McClelland and Rumelhart, 1981) and TRACE (Rumelhart and McClelland, 1982), feedback connections from word-level to phoneme-level enable a faster activation of phonemes in words than phonemes in nonwords by either enhancing the representation of expected phonemes or suppressing the competing phonemes. Even though these models can account for phenomena such as the word-superiority effect, faster detection of a letter on a masked visual image when presented in a word than in an unpronounceable nonword or in isolation (Reicher, 1969; Wheeler, 1970; Mewhort, 1967), they do not incorporate effects of sentence- and discourse- level constraints. It has also been demonstrated that mispronounced or ambiguous phonemes are more likely to be missed by participants during a detection task when the context is more constraining, indicating that top-down processing constraints interact with bottom-up sensory information to reduce the number of possible word candidates as the acoustic input unfolds (Marslen-Wilson and Welsh, 1978; Martin et al., 2017; Martin and Doumas, 2017).

Even though the effects of sentence-level and word-level context on sub-lexical representations were widely studied with behavioral experiments, studies that investigated the neural readout of the perceptual modulation by the context are still very limited for non-degraded or naturally produced spoken stimuli. Martin, 2016; Martin, 2020 suggested that cues from each hierarchical level are weighted according to their reliabilities and integrated simultaneously by gain modulation in the form of selective amplification and inhibition. When the acoustic edges of the speech signal reach a threshold for the activation of a particular phoneme representation, the sensory representations of the acoustic edges are inhibited. On a slower timescale, the resulting representations of phonemes activate items in the lexicon which then, in turn, suppress the phonemic representations once lexical access is achieved. Intelligibility of the speech signal determines the reliability of the acoustic signal, and sentence- and discourse-level constraint determines the reliability of lexical access and structure formation. According to this theoretical model, when context is less constraining, sub-lexical representations are less inhibited compared to in a highly constraining context.

Linear encoding models have made it possible to investigate how certain features of phonemes and words affect the neural response during naturalistic language comprehension. However, one of the important challenges in this modeling method is the difficulty of understanding whether the features tracked by the brain are only acoustic changes or changes in linguistic units as well (Daube et al., 2019). For example, the beginning of a word may coincide with the beginning of a phoneme, that is also a sudden change in the envelope of the acoustic signal. It has also been argued that neural tracking reflects the convolution of evoked responses to the change in acoustic envelope referred to as acoustic edges (Oganian et al., 2023). To overcome this, features that can cause changes in brain signal should be modeled together as multivariate and to isolate the effect of each feature, features should be added to the models in an incremental way (Brodbeck et al., 2021). Most of the studies using the linear encoding models showed the tracking of either the envelope of the acoustic signal or linguistic units at both word and phoneme levels however only a few of them controlled the acoustic features. In this study, we added each phoneme feature on top of the acoustic base model incrementally and compared the reconstruction accuracies only explained by that feature.

Another limitation of linear encoding method is that only the linear relationship between the changes in the features and the amplitude of the brain signal can be observed. While modeling the linear relationship of the conditional probability of phonemes in words with the brain signal, the effect of phoneme features on the brain signal is assumed to be temporally and topologically constant. However, previous studies have shown that the intensity of the sound stimulus, and the constraints introduced on the words by the context of the sentence has an effect on the latency of the brain signal. Although it is not possible to continuously model this dynamic relationship with linear models, stimulus features can be separated according to certain properties and modeled separately (Drennan and Lalor, 2019). Thus, it is possible to model how sentence and discourse context affects the tracking over lower-level features by separating the predictors into high and low constraining context conditions, denoted by differences in word entropy. Figure 1 shows predictors are separated into low and high word entropy conditions.

Schematic of TRF models and features used in models.

(A) All speech features are divided into high and low word entropy conditions (B) TRFs for each brain source are generated by the linear regression model that estimates the source localized MEG signal from speech features, then they are averaged over sources.

To this end, we made use of an MEG dataset of native Dutch speakers with no or minimal comprehension of French (i.e. self-report of no comprehension,<1 year education in French, chance performance on comprehension questions) that was recorded while they listened to audiobooks in Dutch and French. We then used the linear encoding modeling method to predict the MEG signal from the acoustic and phoneme-level speech features including the spectrogram, acoustic edges, phoneme onset, phoneme surprisal and phoneme entropy, separately, for words with high and low word entropy. Word entropy was used to quantify the sentence- and discourse-level constraint for each word given the preceding 30 words. Phoneme surprisal quantifies the conditional probability of each phoneme in a word calculated according to the probability distribution over the lexicon weighted by the occurrence frequency of each word. Phoneme entropy reflects how constraining each phoneme is by quantifying the uncertainty about the next phoneme with each unfolding phoneme. If sentence comprehension modulates phoneme-level processing then the reconstruction accuracies should differ in two important contrasts. First, there should be an effect of comprehended language on whether words with high and low entropy affect phoneme processing, and within the comprehended language, high and low entropy words should differently affect phoneme features. For example, when participants are listening to an audiobook in a language they can understand, they can encounter a word like ‘interval’ or ‘internet’ in a high or low constraining sentence. The occurrence frequency of the word ‘internet’ is higher than ‘interval’ and the conditional probability of hearing the phoneme ‘n’ after ‘ɪnˈtɜː(r)’ is also higher than phoneme ‘v’. That means the surprisal of phoneme ‘n’ which is calculated by the negative log probability, is lower. Representations for phonemes might be different when the reliabilities of contextual cues are different. In a sentence such as ‘You can find anything on the internet’ phoneme features might be more important and more informative compared to a sentence like ‘On campus, you can connect to wireless internet’ because in the first sentence it is more difficult to predict which word is coming given the previous words before the word ‘internet’ in the sentence, so participants may rely more on the phoneme-level features.

Our aim in this current study was to investigate the following questions: (1) Can we measure neural tracking of phoneme- and word-level features when acoustic features are properly controlled? (2) Can we demonstrate that products of language comprehension (viz., words in a sentence and story context) modulate the encoding of lower-level linguistic cues (acoustic and phoneme level features) under normal listening conditions? And (3) does the encoding of acoustic information change as a function of comprehension, reflecting perceptual modulation of neural encoding in the service of behavioral goals? In order to address these questions, we chose to compare the neural response between a comprehended first language (Dutch) and an uncomprehended but acoustically familiar one (French). Having a high-familiarity, statistically known control language allows us to model the neural response to acoustic and phoneme-level features and contrast it to a case where there is comprehension versus where there is not. This is stronger control than examining the neural response to an unknown, statistically unfamiliar language, whose modulation of the neural response may mask the degree to which acoustic and phoneme-level processing dominate the neural signal during language comprehension.

Results

Behavioral results

To evaluate if the stories were comprehended by participants, we compared the percentage of correct answers to comprehension questions after each story part. Participants replied 88% (SD = 7%) of questions, significantly above the chance level 25% (t(46)=39.45, p<0.0001) about Dutch stories correctly and 25% (SD = 11%) of questions about French stories which was not significantly different than the chance level (t(46)=0.44, p=0.66).

Tracking performance of linguistic features

Firstly, to investigate the question whether linguistic units are tracked by the brain signal even when acoustic features are controlled, we compared the averaged reconstruction accuracies over all source points to see if adding each feature is increasing the reconstruction accuracy. To assess the individual contribution of each feature toward reconstruction accuracy, we conducted a stepwise analysis by fitting various TRF models. We added each feature sequentially to an acoustic base model, and then calculated the difference in accuracy between the model with the feature of interest and the previous model without that feature. Detailed information can be found in Table 11 in the Materials and methods section. We then fitted a linear mixed model for reconstruction accuracies with random intercept for subjects and a random slope for fixed effects. Independent variables were language (Dutch and French) and models (Acoustic, Phoneme Onsets, Phoneme Surprisal, Phoneme Entropy and, Word Frequency). Linear mixed model (LMM) with a random slope both for models and language did not converge as the reconstruction accuracies were highly correlated, so we only fitted a random slope for language. We used backward difference coding for the model contrasts, so each contrast shows the difference between consecutive models (e.g. In Table 1, Phoneme Onset row shows the contrast between the accuracies of Phoneme Onset model and Acoustic model.) To evaluate weather adding language and models and their interaction as fixed effect increased predictive accuracy, we compared LMMs with and without these effects using R’s anova() function.

Table 1
LMM results of reconstruction accuracies for Dutch and French stories.
EstimateStd. Errort valuePr(>|t|)
(Intercept)2.86E-032.38E-0412.002.16E-11***
Language (French - Dutch)–7.70E-041.57E-04–4.915.64E-05***
Phon. Onset – Acoustic5.50E-051.14E-054.823.04E-06***
Phon. Surprisal – Phon. Onset7.90E-051.14E-056.917.51E-11***
Phon. Entropy – Phon. Surprisal1.33E-041.14E-0511.632.00E-16***
Word Frequency – Phon. Entropy1.48E-041.14E-0512.952.00E-16***
Language: Phon. Onset – Acoustic–3.64E-051.62E-05–2.252.55E-02*
Language: Phon. Surprisal – Phon. Onset–5.97E-051.62E-05–3.702.88E-04***
Language: Phon. Entropy – Phon. Surprisal–1.04E-041.62E-05–6.459.83E-10***
Language:Word Frequency – Phon. Entropy–1.30E-041.62E-05–8.079.12E-14***
  1. ****<0.0001, ***<0.001, **<0.01, *<0.05.

The formulas used for the LMMs were then:

  • LMM1:Accuracy  Language Models + (1+Language|subject)   

  • LMM2:Accuracy  Language+ Models + (1+Language|subject)   

  • LMM3:Accuracy   Models + (1|subject)   

  • LMM4:Accuracy  1+ (1|subject)   

LMM comparison showed that Models (LMM3 - LMM4 Δχ2 = 82.79, P<0.0001, LLM3 Bayesian Information Criterion (BIC): –3633.1, LMM4 BIC: –2787.5), Language (LMMl2 – LMM3 Δχ2 = 862.02, P<0.0001, LLM2 BIC: –3693.9), and their interaction (LMM1 - LMM2 Δχ2 = 71.63, p<0.0001, LLM1 BIC: –3743.7) predicted the averaged reconstruction accuracies. We reported the results of LMM1 in Table 1 as it was the LMM with most predictive power and lowest Bayesian Information Criterion (BIC).

Averaged reconstruction accuracies were significantly higher in Dutch stories. Each feature incrementally increased averaged reconstruction accuracy compared to previous model and there was a significant interaction between Language and Models (Table 1).

Then we run two separate mixed-effect models for French and Dutch stories with random intercept for subjects and a random slope for models. Each feature incrementally increased averaged reconstruction accuracy compared to previous model for each language (Figure 2A, Figure 2B, Table 2 and Table 3).

Table 2
LMM results of reconstruction accuracies for Dutch stories.
EstimateStd. ErrorDft valuePr(>|t|)
(Intercept)2.86E-032.38E-0423.111.992.14E-11***
Phoneme Onset – Acoustic5.50E-051.49E-0592.03.703.63E-04***
Phoneme Surprisal – Phoneme Onset7.90E-051.49E-0592.05.327.41E-07***
Phoneme Entropy – Phoneme Surprisal1.33E-041.49E-0592.08.943.82E-14***
Word Frequency - Phoneme Entropy1.48E-041.49E-0592.09.962.84E-16***
Table 3
LMM results of reconstruction accuracies for French stories.
EstimateStd. ErrorDft valuePr(>|t|)
(Intercept)2.09E-032.09E-0423.010.007.54E-10***
Phoneme Onset – Acoustic1.86E-056.34E-0692.02.944.15E-03**
Phoneme Surprisal – Phoneme Onset1.93E-056.34E-0692.03.043.09E-03**
Phoneme Entropy – Phoneme Surprisal2.88E-056.34E-0692.04.541.70E-05***
Word Frequency - Phoneme Entropy1.76E-056.34E-0692.02.786.54E-03**
  1. ****<0.0001, ***<0.001, **<0.01, *<0.05.

Model accuracy comparison between Dutch and French stories (n=24).

(A) Accuracy improvement (averaged over the sources in whole brain) by each feature for Dutch Stories (B) Accuracy improvement (averaged over the sources in whole brain) by each feature for French Stories. Braces in Figure A and B shows the significance values of the contrasts (difference between consecutive models, ****<0.0001, ***<0.001, **<0.01, *<0.05) in linear mixed effect models (Tables 2 and 3). Error bars show within subject standard errors. (C) Source points where accuracies of base acoustic model were significantly different than 0 (D) Source points where reconstruction accuracies of the model were significantly different than previous model. Accuracy values shows how much each linguistic feature increased the reconstruction accuracy compared to the previous model.

Then, we identified the source points that each feature changed the reconstruction accuracy compared to previous model using a mass-univariate two tail related sample t-test with threshold-free cluster enhancement (TFCE). Figure 2C shows the sources where reconstruction accuracies of base acoustic model were significantly different than zero. For both languages, acoustic features were tracked on both hemispheres around language network. Figure 2-D shows the sources where each feature incrementally increased the reconstruction accuracy compared to previous model. We fitted a liner mixed effect model to test the accuracy improvement by each linguistic feature for lateralization by taking the average of the contrasts shown in Figure 2-D for each hemisphere. We used the below formulas for LMM.

  • LMM1:Accuracy   Hemisphere + (1|subject)   

  • LMM2:Accuracy  1+ (1|subject)   

We compared LMMs with and without Hemisphere effect using R’s anova() function. LMM comparison showed that Hemisphere (LMM1 - LMM2 Δχ2 = 4.03, p<0.05, LLM1 Bayesian Information Criterion (BIC): –3331.5, LMM2 BIC: –3329.5) predicted the averaged reconstruction accuracies in Dutch stories but not in French stories. We reported the results of LMM1 in Table 4 and Table 5.

Table 4
LMM results of accuracy improvement by linguistic features for Dutch stories.
EstimateStd. Errordft valuePr(>|t|)
(Intercept)4.22E-056.83E-0631.586.186.83E-07***
hemisphere_right–1.06E-055.26E-06167.00–2.024.55E-02*
  1. ****<0.0001, ***<0.001, **<0.01, *<0.05.

Table 5
LMM results of accuracy improvement by linguistic features for French stories.
EstimateStd. Errordft valuePr(>|t|)
(Intercept)5.16E-063.05E-0641.241.699.80E-02.
hemisphere_right–1.52E-063.10E-06167.00–0.496.26E-01
  1. ****<0.0001, ***<0.001, **<0.01, *<0.05.

In Dutch stories, linguistic features increased the reconstruction accuracy mostly on the left hemisphere however in French stories only Phoneme Onset and Entropy slightly increased the reconstruction accuracies and we couldn’t find any significant lateralization effect.

Effect of sentence context on neural tracking

To investigate the second question, how do higher level cues (sentence and discourse constraint embodied by word entropy) interact with lower level cues (acoustic- and phoneme-level features), words in each story are grouped into low and high entropy conditions. TRFs including all features were estimated for each condition and language on each hemisphere. We compared the reconstruction accuracies of high and low entropy words by subtracting the reconstruction accuracy of the model which has all features except phoneme features from the full model which includes all features. Similarly, to isolate the effect of contextual constraint on acoustic edges, we compared the averaged reconstruction accuracies by subtracting the reconstruction accuracies of the model which has all features except acoustic edges from the full model which includes all features. To compare the reconstruction accuracies averaged over all brain sources in each hemisphere, we fitted a linear mixed model with random intercept for subjects and a random slope for word entropy, language and hemisphere. The LMM with a random slope for all effects did not converge, so we fitted a random slope for language, hemisphere. As we were interested in the interaction between language and word entropy, we evaluated whether adding language and word entropy interaction increased predictive accuracy. Model comparison between a model with and without interaction was done with ANOVA. The formula of the LMM reads:

LMM1: Accuracy   Language + Word Entropy + Hemisphere + Word Entropy  Language + (1+ Language + Hemisphere | Subject)
LMM2: Accuracy   Language + Word Entropy + Hemisphere + (1+ Language + Hemisphere | Subject)

Model comparison showed that language and word entropy interaction predicted the averaged reconstruction accuracies of phoneme features (LMM1 vs LMM2: Δχ2 = 15.315, p<0.00001) and also acoustic edges (LMM1 vs LMM2: Δχ2 = 33.92, p<0.00001). We found a significant main effect both for Language (p<0.001) and Word Entropy (p<0.0001), and interaction between them (p<0.0001). There was a significant main effect of Hemisphere for Acoustic Edges (p=0.045) but not for Phoneme Features. Interaction of reconstruction accuracies averaged over whole brain between language and word entropy are shown in Figure 3A for acoustic edges and 3-B for phoneme features. LMM results are shown in Table 6 and Table 7.

Table 6
LMM results of reconstruction accuracies for Phoneme features.
EstimateStd. Errordft valuePr(>|t|)
(Intercept)8.68E-048.17E-0526.410.625.11E-11***
French–8.49E-046.75E-0535.1–12.581.46E-14***
Low Word Entropy–2.69E-044.21E-05118.0–6.383.65E-09***
Right Hemisphere1.22E-049.96E-0523.01.232.33E-01
French: Low Word Entropy3.74E-045.96E-05118.06.295.69E-09***
Table 7
LMM results of reconstruction accuracies for Acoustic Edges.
EstimateStd. Errordft valuePr(>|t|)
(Intercept)6.83E-051.66E-0532.84.112.45E-04***
French8.85E-051.10E-05141.08.033.48E-13***
Low Word Entropy–6.08E-051.10E-05141.0–5.521.61E-07***
Right Hemisphere4.11E-051.94E-0523.02.124.49E-02*
French: Low Word Entropy–9.54E-051.56E-05141.0–6.128.76E-09***
  1. ****<0.0001, ***<0.001, **<0.01, *<0.05.

Figure 3 with 1 supplement see all
First 4 Dutch Story Parts (n=24).

Light orange and light green represent Low Word Entropy condition, dark orange and dark green represent High Word Entropy condition for Dutch and French stories, respectively. (A) Reconstruction accuracy interaction between word entropy and language for acoustic features (B) Reconstruction accuracy interaction between word entropy and language for phoneme features (Braces in Figure A and B indicate the significant different between high and low entropy word conditions, ****<0.0001, ***<0.001, **<0.01, *<0.05. Error bars shows within subject standard errors.) (C-D) Acoustic Edge TRFs on left hemisphere (LH) and right hemisphere (RH) (E-F) Phoneme Features TRFs on LH and RH. Lines on the graphs in Figure C-F show the mean and shaded areas show the standard error of the mean. (G) Sources where the main effect of Language and Word Entropy, and interaction are found.

To examine the contribution of each feature to tracking performance, we compared the TRFs of each feature by running a mass-univariate repeated measures ANOVA on source data to see the main effect of language and word entropy, while also modeling their interaction. Before the statistical analysis, we took the power of TRF weights and smoothed for 2 voxels (Gaussian window, SD = 14 mm) to compensate for head movements of participants. The multiple comparisons problem was handled with a cluster-level permutation test across space and time with 8000 permutations (Oostenveld et al., 2011). Figure 3C, D, E and F shows the averaged TRFs over all source points of acoustic edge and phoneme features (phoneme onset, phoneme surprisal, and phoneme entropy). For all TRF components, we found a significant main effect of language, word entropy and interaction between them on both hemispheres. Significant source clusters of main effects and interactions for each feature are shown in Figure 3G. To show each contrast on the same scale, percentage of power of weights was calculated for each hemisphere.

Power of weights of high entropy words are greater than low entropy word in each speech feature TRF. As opposed to phoneme features, in acoustic edge TRF, power of weights in French stories are greater than in Dutch stories. Interaction between language and word entropy ends earlier for acoustic edges compared to phoneme features. Acoustic Edge TRF peaks around 80ms and 200ms for both languages, whereas phoneme features TRF has a peak around 80ms for both languages but there is a second peak between 200ms and 600ms in Dutch stories.

We validated our analysis on the next 4 Dutch story parts and found the same effects with the reconstruction accuracies and weights of TRFs (Table 8, Table 9 and Figure 4).

Table 8
LMM results of reconstruction accuracies for Phoneme Features (Next 4 Dutch Story Parts).
EstimateStd. ErrorDft valuePr(>|t|)
(Intercept)8.14E-057.03E-0639.111.573.36E-14***
French–8.00E-055.62E-06141.0–14.232.00E-16***
Low Word Entropy–5.00E-055.62E-06141.0–8.892.61E-15***
Right Hemisphere–5.32E-067.61E-0623.0–0.704.91E-01
French: Low Word Entropy3.67E-057.95E-06141.04.618.87E-06***
Table 9
LMM results of reconstruction accuracies for Acoustic Edges (Next 4 Dutch Story Parts).
EstimateStd. Errordft valuePr(>|t|)
(Intercept)8.38E-051.70E-0532.04.932.44E-05***
French7.70E-051.09E-05141.07.067.15E-11***
Low Word Entropy–8.18E-051.09E-05141.0–7.496.76E-12***
Right Hemisphere3.30E-051.92E-0523.01.729.89E-02
French: Low Word Entropy–7.44E-051.54E-05141.0–4.823.66E-06***
  1. ****<0.0001, ***<0.001, **<0.01, *<0.05.

Figure 4 with 1 supplement see all
Second 4 Dutch Story Parts (n=24).

Light orange and light green represent Low Word Entropy condition, dark orange and dark green represent High Word Entropy condition for Dutch and French stories, respectively. (A) Reconstruction accuracy interaction between word entropy and language for acoustic features. (B) Reconstruction accuracy interaction between word entropy and language for phoneme features. (Braces in Figure A and B indicate the significant different between high and low entropy word conditions, ****<0.0001, ***<0.001, **<0.01, *<0.05. Error bars shows within subject standard errors.) (C-D) Acoustic Edge TRFs on left hemisphere (LH) and right hemisphere (RH). (E-F) Phoneme Features TRFs on LH and RH. Lines on the graphs in Figure C-F show the mean and shaded areas show the standard error of the mean. (G) Sources where the main effect of Language and Word Entropy, and interaction are found.

We also run new models with changing time lags and computed the model accuracy improvement by each feature (See Models with varying time lags in Materials and methods Section) Figures 5 and 6 shows the reconstruction accuracy improvement by each feature for each time window. On x axis, accuracy of each time windows is shown on their center time. For example, for the window between –100ms and 0ms, it’s shown on t=0.05 s. Red bar below shows the time intervals when there was a significant interaction between language and word entropy. Analysis results shows that word entropy modulated the encoding of phoneme onset more in the comprehended language between 200 and 750ms in LH and RH, between 150 and 650ms in LH and between 0 and 600ms in RH for phoneme surprisal, and it was between 550ms and 650ms in LH for phoneme entropy.

Accuracy improvement by each linguistic feature calculated by subtracting the model accuracy of previous model from the model which also has the feature of interest for each time window in High and Low Word Entropy Conditions (n=24).

(A) Phoneme Onset (Left Hemisphere – LH on the left, Right Hemisphere RH on the right). (B) Phoneme Surprisal. (C) Phoneme Entropy. Lines on the graphs in Figure A-C show the mean and shaded areas show the standard error of the mean.

Accuracy improvement by each linguistic feature calculated by subtracting the model accuracy of previous model from the model which also has the feature of interest for each time window (n=24).

High (Left) and Low (Right) Word Entropy Conditions for French (Below) and Dutch (Top) Stories.

Discussion

In this study, we investigated whether phoneme-level features are tracked over and above acoustic features and acoustic edges, and whether word entropy affects the tracking of acoustic- and phoneme-level features as a function of language comprehension. We used linear encoding models to quantify the degree of neural tracking while native Dutch speakers listened to audiobooks in Dutch and French. Participants were familiar with the prosodic and phonotactic statistical regularities of the French language, but did not comprehend the stories, allowing for the contrast of speech processing during and in the absence of comprehension in a statistically familiar auditory and perceptual context. First, we examined whether phoneme features are tracked by the brain signal even when acoustic features are controlled. Results showed that all phoneme features (i.e. phoneme onset, phoneme surprisal, phoneme entropy) and word frequency increased the averaged reconstruction accuracy for both Dutch and French stories when acoustic features were controlled, extending the findings of previous studies (Brodbeck et al., 2018; Gillis et al., 2021; Brodbeck et al., 2022). An interaction between language and encoding model showed that reconstruction accuracy improvement from the addition of phoneme-level features and word frequency was greater in Dutch stories, suggesting that when language is comprehended, internally-generated linguistic units, e.g., phonemes, are tracked over and above acoustic edges (Meyer et al., 2020). Our findings contradict the results of Daube et al., 2019, who used articulatory features and phoneme onsets as phonetic features and showed that phoneme-locked responses can be explained by the encoding of acoustic features. It is possible that information theoretic metrics, such as surprisal and entropy, are better suited to capture the transformation of acoustic features into abstract linguistic units.

Comparison of the reconstruction accuracies on the source level showed that, for Dutch stories, accuracy improvement by the linguistic features were left lateralized. However, for uncomprehended French stories, we couldn’t find a significant lateralization effect. The Dutch-speaking participants we selected were familiar with the acoustics and phonotactics of French, and thus with the statistical distribution of those features, even though they cannot comprehend the language (as evidenced by their chance-level behavioral performance, educational background, and self-report). This comparison of neural signals in acoustically-familiar contexts set the bar higher for detecting neural sensitivity to phoneme-level features in a comprehended language, over and above any general statistical sensitivity the brain might show during perception and perceptual adaptation. The slight increase in reconstruction accuracy on the phoneme-level in French is likely due to the participants’ general familiarity with the phonotactic properties of the language, though, given the duration of the experiment, participants also could have learned the phonotactic distribution of the stories, as well as become sensitive to high-frequency words that they may recognize, but which do not lead to sentence and discourse comprehension (e.g. the most frequent words in the stories were le, la, un, une, les, des – all forms of the definite and indefinite determiner the/a); statistical regularities can be acquired very quickly even by naive listeners (Saffran et al., 1996).

We performed a word-level analysis to examine the second question: do higher level linguistic constraints, as formed during comprehension, interact with the neural encoding lower-level information (acoustic and phonemic features) under normal listening conditions? We operationalized constraint as word entropy, or the uncertainty around word predictability given the previous 30 words. When the word entropy is high, the next word is less reliably predictable given the previous words, so the context is less constraining. However, word frequency and predictability are known to be highly correlated (Cohen Priva and Jaeger, 2018; in our study, the correlation between entropy and word frequency was French stories: R=0.336, p<0.00001; Dutch stories: R=0.343, p<0.00001). To control for the responses correlated with word frequency, we also added word frequencies in our full model and compared the reconstruction accuracy improvement explained by acoustic and phoneme features. When we compared the reconstruction accuracies for low and high entropy words, we found that tracking performance of phoneme and acoustic features in high entropy words was higher in Dutch stories. This suggests that sentence and discourse constraint, as expressed in word entropy, modulates the encoding of sub-lexical representations, consistent with top-down models of language comprehension. Results showed that when context was more reliable (i.e. on low entropy words) the neural contribution of acoustic and phoneme-level features is downregulated, inhibited, or not enhanced. When sentence- and discourse-level context constrains lexical representations, sub-lexical representations are inhibited by gain modulation as they are no longer as important for thresholding (Martin, 2016; Martin, 2020).

We also found an interaction between language comprehension and word entropy in the opposite direction for acoustic edges and phoneme features: the effect of sentence and discourse-level context was larger on phoneme features for a comprehended language, and it was larger on acoustic features for the uncomprehended language. In the uncomprehended language, low entropy words were also the most frequent words presented, such as le, la, un, une. Modulation of acoustic edges by context in this situation could be related to statistical 'chunking' of the acoustic signal for frequent words, essentially reflecting recognition of those single function words in the absence of language comprehension. In contrast, when a language is understood, contextual information strongly modulates the tracking of phoneme features. Yet, when the next word is predicted from the context, phoneme features are not as informative as when the context is less containing, and they are downregulated or suppressed such that their encoding does not contribute as much to the composition of the neural signal.

To investigate how language comprehension and word entropy modulated acoustic and phoneme-level linguistic features as they unfolded in time, we analyzed the weights of the TRFs. The amplitude of the power of weights were greater for phoneme-level features in Dutch stories than in French stories, however it was lower for acoustic edges. This suggests that when the language is not comprehended, acoustic features may be more dominant in the neural response, as linguistic features simply are not available. This pattern of results suggest that language comprehension might suppress neural tracking of acoustic edges, or it suppresses the neural representation of edges. If this is so, then language comprehension can be seen as a neural filter over sensory and perceptual input in service of the transformation of that input into linguistic structures. A similar effect was also shown for increasing speech rate; tracking performance of linguistic features decreased and tracking performance of acoustic features increased with decreasing intelligibility (Verschueren et al., 2022).

We also found that phoneme features generated a peak at around 80ms and between 200 and 600ms in the comprehended language, whereas in an uncomprehended language, they only peaked at around 80ms. This dual-peak pattern could be attributable to the internal generation of morphemic units or to lexical access in the comprehended language. Similar to the results we found in reconstruction accuracies, comparison of low and high entropy word TRFs showed that when the word entropy was higher (viz., in a low-constraining sentence and discourse context), the weights of acoustic edges and phoneme-level features were higher. This result suggests that when the higher level cues are not available or reliable, lower-level cues at the phoneme- and acoustic-level may be upregulated or enhanced, or allowed to propagate in order to comprise more of the neural bandwidth. While our results are consistent with Molinaro et al., 2021. – we provide support for a cost minimization perspective rather than the perception facilitation perspective discussed in Molinaro et al. – it is important to note that Molinaro et al. only examined the tracking of acoustic features, specifically the speech envelope, using the Phase Locking Value, and did not examine the contribution of lower-level linguistic features. Secondly, Molinaro et al. use a condition-based experimental design in contrast to our naturalistic stimulus approach. In our study, our aim was to investigate the dynamics of encoding both acoustic and linguistic features, and we utilized a multivariate linear regression method on low and high constraining words which ‘naturally’ occurred in our audiobook stimulus across languages. Our results revealed a trade-off between the encoding of acoustic and linguistic features that was dependent on the level of comprehension. Specifically, in the comprehended language, the predictability of the following word had a greater influence on the tracking of phoneme features as opposed to acoustic features, while in the uncomprehended language, this trend was reversed. Similarly, Donhauser and Baillet, 2020 showed that when sentence context is not reliable, phonemic features (i.e. phoneme surprisal and entropy) are enhanced. Here we examined the effects word entropy on acoustic and phonemic representations in comprehended and uncomprehended languages; we found that comprehension modulates acoustic edges for a shorter period of time than phoneme features when language is comprehended. The interaction between language comprehension and word entropy for acoustic edge encoding was located around auditory cortex in both hemispheres, whereas for phoneme features, it was located in left frontal cortex and right temporal cortex. The presence of a later interaction for phoneme onsets compared to acoustic edges could be due to the sustained representations of phonemes until lexical ambiguity is resolved (Gwilliams et al., 2020) – when the sentence context is not very informative, phonemic representations transformed from acoustic features may persist or play a larger role in the neural response until the word is recognized or until a structure is built.

In this study, to disassociate the effects of word entropy and word frequency, we modeled word frequency with an assumption that entropy and frequency have linearly additive effects; however, previous studies have shown that they interact during late stages of word processing (Fruchter and Marantz, 2015; Huizeling et al., 2022). In a naturalistic listening paradigm, even though it is an instrumental way to study language comprehension, it is not fully possible to control for word frequency or to completely dissociate it from any sentence context effects.

Another limitation of our study is that linear regression modeling does not allow the dissociation of the weights of highly correlated features. When we compared the TRF weights of phoneme onset, surprisal and entropy separately between the set of the first four and the second four Dutch story parts, we saw that TRF weights of these features showed differences (Figure 3—figure supplement 1, Figure 4—figure supplement 1). However, acoustic-edge features and averaged phoneme-level features were highly consistent between different story parts, as acoustic and averaged phoneme features are not as correlated as phoneme onset, phoneme surprisal and phoneme entropy (Figure 3 and Figure 4). Thus, it would be problematic to interpret how language comprehension and sentence and discourse level constraint affects these features separately. As alternative solution to this problem, we also fitted different models with varying time lags separated by 50ms with a 100ms sliding window between –100 and 800ms and also compared the reconstruction accuracies of these models instead of TRF weights. Results showed that word entropy modulated the encoding of phoneme onset and phoneme surprisal for a longer time, starting earlier compared to phoneme entropy.

Conclusion

In this study, we show that modeling phoneme-level linguistic features in addition to acoustic features better reconstructs the neural tracking response to spoken language, and that this improvement is even more pronounced in the comprehended language, likely reflecting the transformation of acoustic features into phoneme-level representations when a language is fully comprehended. Although acoustic edges are important for speech tracking, internally generated phonemes are more strongly encoded in comprehended language. This suggests that language comprehension can be seen as a neural filter over acoustic edges of the speech signal in the service of transforming sensory input into abstract linguistic units. We demonstrated that low sentence- and discourse-level constraints enhanced the neural encoding of both acoustic and phoneme features. When language is not comprehended, the neural encoding of acoustic features was stronger, likely due to the absence of comprehensive lexical access and interpretation, including higher level linguistic structure formation, in addition to the recognition of single, highly frequent function words (viz., the/a) in isolation. When a language is comprehended from speech, phoneme features were more strongly encoded in the neural response compared to when it is not comprehended. Only in the comprehended language, phoneme features aligned with the phase of neural signal between 200 and 600ms, and this is effect was stronger when the context was less constraining and more information must be extracted from the sensory signal. This pattern of results may reflect the persistence of phoneme-related neural activity before a word is recognized and phoneme-level information can be inhibited. Relying more on low-level features when high-level cues are not reliable, and suppressing low-level information when contextual cues are informative could be a strategy for the brain to utilize its resources in an efficient way toward its behavioral goal. In summary, our results support an account of language comprehension where the flexible modulation of the acoustic and phonemic features by lexical, sentential, and discourse-level information is instrumental in the transformation of sensory input into interpreted linguistic structure and meaning.

Materials and methods

Participants

We collected MEG data from 24 participants between 18 and 58 years old (average age: 31.17 years, 18 F and 6 M) while they were listening audiobooks in Dutch and French. They were all right handed native Dutch speakers with either no or very little French proficiency. Four of the participants reported that they can only comprehend a full French sentence only if it is spoken very slowly or it is a very simple sentence. Rest of the participants reported they cannot comprehend a full sentence. This study was approved by the ethical Commission for human research Arnhem/Nijmegen (project number CMO2014/288). Participants were reimbursed for their participation.

Stimuli

Request a detailed protocol

The stimuli (Table 10) consisted of one story by Hans Christian Andersen, two stories by the Brothers Grimm in Dutch and one story by Grimm, one by E.A. Poe, and one by Andersen in French. (Kearns, 2015; Hart, 1971) All stories are divided into 5–6 min story parts (9 story parts in Dutch, 4 story parts in French). The stories are presented in a randomized order to the participants. After each part, participants were asked to answer five multiple-choice comprehension questions. For our analysis, we used the first four story parts in each language to balance the length of data in each language. Then we repeated the same analysis for second four part of Dutch stories.

Table 10
Stimuli.
Story PartLanguageDurationSpeakerParts used in analysis
Anderson_S01_P01NL4 min 58 sWoman 1Dutch Part 1
Anderson_S01_P02NL5 min 17 sWoman 1Dutch Part 1
Anderson_S01_P03NL4 min 49 sWoman 1Dutch Part 1
Anderson_S01_P04NL5 min 50 sWoman 1Dutch Part 1
Grimm_23_1NL5 min 3 sWoman 2Dutch Part 2
Grimm_23_2NL5 min 32 sWoman 2Dutch Part 2
Grimm_23_3NL5 min 2 sWoman 2Dutch Part 2
Grimm_20_1NL6 min 6 sWoman 2Dutch Part 2
ANGE_part1FR4 min 34 sWoman 3French Part 1
BALL_part1FR4 min 58 sWoman 3French Part 1
EAUV_part1FR5 min 43 sMan 1French Part 1
EAUV_part2FR6 min 1 sMan 1French Part 1

Data acquisition

Request a detailed protocol

Brain activity of participants were recorded using magnetoencephalography (MEG) with a 275-sensor axial gradiometer system (CTF Systems Inc) in a magnetically shielded room. Before presenting each story part, 10 s of resting state data were recorded. All stimuli were presented audibly by using the Psychophysics Toolbox extensions of Matlab (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007) while participants were fixating a cross in the middle of the presentation screen. MEG data were acquired at a sampling frequency of 1200 Hz. Head localization was monitored during the experiment using marker coils placed at the cardinal points of the head (nasion, left and right ear canal) and head position was corrected before each story part presentation to keep it at the same position as at the beginning of the experiment. Bipolar Ag/AgCl electrode pairs were used to record electrooculogram (EOG) and electrocardiogram (ECG). In addition to MEG data, we also acquired T1-weighted structural MR images using a 3 T MAGNETOM Skyra scanner (Siemens Healthcare, Erlangen, Germany). Lastly, three-dimensional coordinates of each participants head surface was measured using a digitizing pen system (Polhemus Isotrak system, Kaiser Aerospace Inc).

MEG data preprocessing

Request a detailed protocol

MEG data were analyzed using mne-python (version 0.23.1). First, data were annotated to exclude the response parts from the rest of the analysis, then filtered between 0.5 and 40 Hz with one-pass, zero-phase, non-causal FIR filter using the default settings of mne-python. Bad channels were removed using mne implementation of Maxwell filtering, and removed channels were interpolated. Then data were resampled to 600 Hz and ocular and cardiac artifacts were removed with independent components analysis. Each story part segments were cropped from preprocessed data and source localization was done separately for each part. Before the source localization, each part was low pass filtered at 8 Hz with one-pass, zero-phase, non-causal FIR filter.

Individual head models were created for each participant with their structural MR images with Freesurfer (surfer.nmr.mgh.harvard.edu) and were co-registered to the MEG coordinate system with mne coregistration utility. A surface-based source space was computed for each participant using fourfold icosahedral subdivision. Cortical sources of the MEG signals were estimated using noise-normalized minimum norm estimate method, called dynamic statistical parametric map (dSPM). Orientations of the dipoles were constrained to be perpendicular to the cortical surface. Resting state data before the presentation of each story part (130 s in total) were used to calculate the noise covariance matrix. Lastly, source time courses were resampled at 100 Hz.

Predictor variables

Acoustic features

Request a detailed protocol

Acoustic features (8 band gammatone spectrogram and an 8-band acoustic onset spectrogram, both covering frequencies from 20 to 5000 Hz in equivalent rectangular bandwidth (ERB) space) for each story part were generated using Eelbrain toolbox (Brodbeck et al., 2021).

Phoneme onsets

Request a detailed protocol

Phoneme onsets were extracted from the audio files of the stories automatically using the forced alignment tool from WebMAUS Basic module of the BAS Web Services (Schiel, 1999; Strunk et al., 2014).

Phoneme surprisal and entropy

Request a detailed protocol

Probabilities of each phoneme in a given word was calculated according to the probability distribution over the lexicon of each language weighted by the occurrence frequency of each word. When each phoneme unfolds in a word, it reduces the number of possible words in the cohort and generates a subset of cohorti. Conditional probability of each phoneme Phi given the previous phoneme equals to the ratio of total frequencies of the words in the remaining cohort to previous cohort.

PPhiPh(i-1) =word cohortifreqword(i)/word cohort(i-1)freqword(i-1)

The surprisal of phoneme Phi is inversely related to the likelihood of that phoneme.

SPhiCohort(i) = - log2 (P(Phi))

The entropy of phoneme Phi quantifies the uncertainty about the next phoneme Phi+1. It is calculated by taking the average of expected surprisal values of all possible phonemes.

E(Phi|Cohort(i))=PhAll phonemes P(Ph|cohort(i1))  log2 (P(Ph|cohort(i1)))

To calculate the probabilities of phonemes, we used SUBTLEX-NL dictionary (Keuleers et al., 2010) for Dutch stories and Lexique383 dictionary (9_freqfilms from subtitles) for French stories. Both dictionaries were filtered to eliminate the words which contains nonalphabetic characters. French dictionary had 92.109 words, so the same number of most frequent words also selected from Dutch dictionary. Words in the dictionaries are transformed into phonemic transcription by using the g2p module of The BAS Web Services (Schiel, 1999; Schiel, 2015).

Word frequency

Request a detailed protocol

Word frequency features were calculated by taking negative logarithm of word frequencies varying between 0 and 1 which were calculated as word occurrence per 1,000,000 words in SUBTLEX-NL dictionary (Keuleers et al., 2010) for Dutch stories and Lexique383 dictionary (9_freqfilms from subttitles) for French stories (New et al., 2001).

Word Freq = - log2 (Freqi)

Word entropy

Request a detailed protocol

To quantify the uncertainty about the next word given previous 30 words, we used the transformer-based language model GPT-2 which were fine tuned for Dutch (de Vries and Nissim, 2020) and French (Louis, 2020). Entropy values for each word were calculated from the probability distribution generated by the GPT-2 language model.

E(Wordi|Context(i30:i1))=TAll tokens P(T|Context(i30:i1))  log2 (P(T|Context(i30:i1)))

Words in each story part were divided into two conditions; high and low word entropy, so that each condition has equal length of signal in each story part (Figure 1).

Linear encoding models

Request a detailed protocol

Four different models were built by incrementally adding each phonemic feature on top of acoustic control features. Model names and features included in models are shown in Table 11.

Table 11
Model names and speech features in models.
SpectrogramAcoustic EdgePhoneme OnsetPhoneme SurprisalPhoneme EntropyWord Frequency
Acoustic
Phoneme Onset
Phoneme Surprisal
Phoneme Entropy
Word Frequency

Temporal response functions (TRF) were computed for each model, subject and source using the Eelbrain toolbox (Brodbeck et al., 2021). For each model, corresponding speech features were shifted by T lags between –100ms and 800ms from the onset of each phoneme. With 50ms wide Hamming windows at 100 Hz sampling rate that yields T=90 time points. MEG response at time t yitnj=1N (N=5,124 virtual current source, i: subject number, tn: time point) was predicted by convolving the TRF with predictor features shifted by T time delays xftn-τkf=1F (F: number of speech features in the model). βijfτk is the TRF of ith subject, jth source point, fth speech features at kth latency.

yij(tn)=f=1Fk=1Tβijf(τk)xf(tnτk)

All predictors and MEG signals were normalized by dividing by the absolute mean value. To estimate TRFs, boosting algorithm of Eelbrain toolbox was used to minimize the l1 error using a fivefold cross-validation procedure. We used the early stopping from the toolbox. It uses a validation set which is distinct from the test set to stop training when the error starts to increase to prevent overfitting (Brodbeck et al., 2021). As there were four French story parts, we only used the first four Dutch story parts. Total duration of French stories was 21 min 17 s and it was 20 min 54 s for Dutch stories. Then we repeated the same analysis with the next four Dutch story parts. Total duration of the next four Dutch stories was 21 min 43 s.

Model accuracy comparison

Request a detailed protocol

To evaluate the effect of adding each feature on top of acoustic features on the reconstruction accuracy, proportion of the explained variance values on each source point were smoothed (Gaussian window, SD = 14 mm) and a linear mixed model with random slope for subjects were fitted for the average reconstruction accuracy values of each model on each hemisphere using the lmer function in the lme4 package for R.

To identify the brain region where a specific feature increased model accuracy, model accuracies on each source point for each model were compared with the model accuracies of previous model that has all other speech features except the features of interest using a mass-univariate two-tailed related sample t-test with threshold-free cluster enhancement (TFCE) (Smith and Nichols, 2009).

Models with varying time lags

Request a detailed protocol

We also run new models with changing time lags and computed the model accuracy improvement by each feature by subtracting the reconstruction accuracy of a model from the previous model which does not have the feature of interest as it was done in the first analysis where we showed the additional contribution of each feature to the reconstruction accuracy. We use 17 different time lags separated by 50ms with a 100ms sliding window between –100 and 800ms. We then compared the averaged reconstruction accuracies over whole brain of high and low word entropy conditions of each language (French and Dutch) with a cluster-level permutation test across time with 8000 permutations.

Data availability

The raw data used in this study are available from the Donders Institute Data Repository (https://doi.org/10.34973/a65x-p009) By the time of the submission of the paper, the following subjects had both MRI and MEG data: sub-001, sub-003, sub-004, sub-008, sub-009, sub-010, sub-011, sub-013, sub-014, sub-015, sub-017, sub-018, sub-019, sub-020, sub-021, sub-023, sub-025, sub-026, sub-027, sub-028, sub-029, sub-030, sub-032, and sub-033. They were included in the analysis. Analysis code is shared on the following link. https://github.com/tezcanf/Scripts_for_publication.git (copy archived at swh:1:rev:23b3c7ad9dcabae84dca2a017ee8260081a9d18c).

The following data sets were generated
    1. Martin AE
    (2023) Donders Institute Data Repository
    Constructing sentence-level meaning: an MEG study of naturalistic language comprehension.
    https://doi.org/10.34973/a65x-p009

References

    1. Friston K
    (2005) A theory of cortical responses
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 360:815–836.
    https://doi.org/10.1098/rstb.2005.1622
    1. Kleiner M
    2. Brainard D
    3. Pelli D
    (2007)
    What’s new in Psychtoolbox-3
    Psychology 36:1–16.
    1. Lee TS
    2. Mumford D
    (2003) Hierarchical Bayesian inference in the visual cortex
    Journal of the Optical Society of America. A, Optics, Image Science, and Vision 20:1434–1448.
    https://doi.org/10.1364/josaa.20.001434
    1. Oganian Y
    2. Kojima K
    3. Breska A
    4. Cai C
    5. Findlay A
    6. Chang E
    7. Nagarajan SS
    (2023)
    Phase alignment of low-frequency neural activity to the amplitude envelope of speech reflects evoked responses to acoustic edges, not oscillatory Entrainment
    Journal of Neuroscience 43:3909–3921.
    1. Pelli DG
    (1997)
    The VideoToolbox software for visual psychophysics: transforming numbers into movies
    Spatial Vision 10:437–442.
    1. Rumelhart DE
    2. McClelland JL
    (1982)
    An interactive activation model of context effects in letter perception: part 2. The contextual enhancement effect and some tests and extensions of the model
    Psychological Review 89:60–94.
  1. Conference
    1. Schiel F
    (1999)
    Automatic phonetic transcription of non-prompted speech
    International Conference on Statistical Language and Speech Processing.
  2. Conference
    1. Schiel F
    (2015)
    A Statistical Model for Predicting Pronunciation
    International Congress of Phonetic Sciences.
  3. Conference
    1. Strunk J
    2. Schiel F
    3. Seifart F
    (2014)
    Untrained forced alignment of transcriptions and audio for language documentation corpora using WebMAUS
    International Conference on Language Resources and Evaluation.

Decision letter

  1. Jonathan Erik Peelle
    Reviewing Editor; Northeastern University, United States
  2. Andrew J King
    Senior Editor; University of Oxford, United Kingdom

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Andrew King as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) The aspects of the current results that are consistent with vs. different to prior arguments should be better clarified. As it stands, the theoretical advance of the current paper relative to much current thinking in the field is not always clear.

2) The treatment of the high-entropy vs. low-entropy contexts in the models was unclear; in particular, if separate TRFs were constructed for these groups of contexts (line 260), a single model (in which entropy was included as a factor) would seem preferable (this was suggested by the model in line 275). Please clarify the model construction and results that were obtained and which underlie the main conclusions (including how words were divided into high vs. low entropy).

3) The conclusions about listener responses to understood (Dutch) vs. not-understood (French) speech need to be tempered to account for the acoustic differences across language (and possibly across speaker – it was unclear whether the same talker was used for both languages). These points should be explicitly considered and clarified.

Reviewer #1 (Recommendations for the authors):

Figure 4 was very useful and I thought should appear earlier in the paper – I would put this first, so that readers see it as that.

I could not find data or analysis scripts, which would be useful to share. There is a promise that it will be available through an institutional repository but it should be available as part of the review process, no?

Reviewer #2 (Recommendations for the authors):

TRF analysis. As the authors mention, the TRFs will be highly correlated for different features, which makes it problematic to infer the spatiotemporal distribution of each speech feature. Can the authors take this analysis further by assessing model accuracies as before, but this time changing the lags over which the regression is computed (e.g. in sliding time windows 0-50 ms, 25-75 ms, etc.)? This would give information on processing latencies unique to a particular speech feature, going beyond what is currently presented.

Could the authors discuss what might explain their finding of specific phoneme tracking independent of acoustics, which is contra the findings of Daube et al.

I am unclear about linear mixed models presented for the 'Tracking performance of linguistic features' section (some of the issues described here apply elsewhere too). The text states that forward difference coding was used to assess the difference between consecutive models. Which p-values correspond to these contrasts? Are they reported in the tables? But these look like the coefficients of a model, not model comparisons. Significance braces are shown in the figures but the captions do not indicate what these represent exactly. Then in line 224 there is a single p-value attributed to the effect of model, which again I am unsure how to interpret (is the combined effect of the forward difference contrasts?). Finally source plots in Figure 1D, do they show the forward difference contrasts or do they show pairwise comparisons between e.g. acoustic model vs phoneme onsets, phoneme onsets vs phoneme entropy etc. To make this section clearer, in addition to providing additional detail (e.g. in the figure captions), I would consider using more familiar pairwise comparisons instead of forward difference contrasts.

Figure 2, please fully label all plots. e.g. 2A the dark vs light green bars are not distinguished (same issue for TRF plots). Do they represent the left vs right hemispheres? If so, where are the data for the high vs low entropy comparison? There is an inset panel but I think it would be clearer to show all the 'conditions' in the main graphs i.e. high low entropy, left right hemisphere. As with Figure 1, caption needs more detail e.g. what are the significance braces indicating?

Please specify how words were grouped into high and low entropy. Median split?

Reviewer #3 (Recommendations for the authors):

Regarding point (2) of the public review, my suggestion would be to analyze the distinction by including both low and high entropy predictor sets in a single TRF model. This way, all words' responses would be accounted for, and overlapping responses from different words would be properly controlled for. Contributions from specific sets of predictors could still be assessed by estimating models without those predictors.

Regarding point (3), a stronger case could be made, for example, with a group of French speakers listening to the same stimuli – to make comprehension orthogonal to acoustic properties.

Errors/inconsistencies

Line 531 implies that acoustic edges were modeled in 8 bands ("8-band acoustic onset spectrogram") but Figure 4 only shows a single time series.

In Figure 4A, it looks to me like the spectrogram is shifted relative to the sound wave.

In Figure 4A, the third word has lower Word Entropy than the fourth, but is assigned to the high entropy condition, while the fourth is in the low entropy condition.

The Discussion says about the initial model (before adding entropy) "Phoneme surprisal, phoneme entropy, and word frequency reconstruction accuracies were left-lateralized" (348), but I don't think I saw a statistical test of lateralization. In fact, in the model with entropy the effect of Hemisphere for phoneme features was n.s. (281).

"To isolate the effect of sentence and discourse context and control for the frequency effect we also added word frequencies in our full model" (374): I don't believe this is doing what is intended (/described). The main question asked statistically is whether word entropy modulates other predictors' TRFs. Adding a word frequency predictor will control for responses correlated with word frequency, but it will not control for other predictors' TRF shapes being modulated by word frequency.

https://doi.org/10.7554/eLife.82386.sa1

Author response

Essential revisions:

1) The aspects of the current results that are consistent with vs. different to prior arguments should be better clarified. As it stands, the theoretical advance of the current paper relative to much current thinking in the field is not always clear.

No previous studies compared comprehended and uncomprehended speech in a naturalistic context – this is a crucial comparison, in our view, in order to best estimate how much the brain response is driven purely by the acoustic dynamics and physical statistical properties of a natural speech stimulus. Estimating this allows us to then determine the neural responses that are associated with language comprehension over and above the demands on the brain by speech processing.

Furthermore, previous studies did not estimate the contribution of word entropy on the encoding of both acoustic (i.e., spectrogram and acoustic edges) AND linguistic lower-level features (i.e., phoneme entropy). It has of course previously been shown that word entropy explains neural response variance, but prior to our study the dynamic and interactive relationship between word-level features and low-level features was hypothesized, but not directly shown. Thus we can show that word entropy affects the encoding of both acoustic and linguistic features, which has not been shown before. We discuss later in this response letter and in the manuscript more specifically how our findings are different from particular studies in the literature. In short, our study uses a naturalistic spoken stimulus (audiobook) and compares the effect of word entropy on both acoustic and linguistic features, whereas previous research only looked at acoustic features in highly controlled experimental settings.

Because of these two key manipulations – manipulating language comprehension apart from speech processing, and directly comparing acoustic and linguistic feature encoding as a function of word entropy – we show for the first time a dynamic tradeoff between acoustic and linguistic features, which previous work focused on separately, and which was not addressed within the same study or analysis. Because we did address these issues all together, we could show that (1) Linguistic features are encoded more strongly during language comprehension than when comprehension is absent, and (2) that high word entropy enhances the encoding of lower-level acoustic and linguistic features while low word entropy suppresses it.

2) The treatment of the high-entropy vs. low-entropy contexts in the models was unclear; in particular, if separate TRFs were constructed for these groups of contexts (line 260), a single model (in which entropy was included as a factor) would seem preferable (this was suggested by the model in line 275). Please clarify the model construction and results that were obtained and which underlie the main conclusions (including how words were divided into high vs. low entropy).

We carried out a new analysis, followed the suggestions of the Reviewers, and evaluated them in a single model, and we find the same effects. The effect size was smaller than when evaluated in separate models, but the overall pattern was robust.

3) The conclusions about listener responses to understood (Dutch) vs. not-understood (French) speech need to be tempered to account for the acoustic differences across language (and possibly across speaker – it was unclear whether the same talker was used for both languages). These points should be explicitly considered and clarified.

We appreciate the Reviewer concerns here and want to explain better why acoustic differences are already accounted for in our model comparison approach. We are not comparing acoustic models directly – if we were, we agree with the Reviewer that this would be problematic. However, we are comparing how much adding linguistic features to base models that have already taken into account acoustic differences results in more variance explained in the neural response. By comparing the addition of phoneme-level and word-level features to a base model that already takes into account acoustic differences, we can assess the differences between language over and above any acoustic differences. That said, we carried out additional analyses to control for speaker identity and gender differences (detailed below), and we found the same effects; no differences compared to our previously reported findings.

Reviewer #1 (Recommendations for the authors):

Figure 4 was very useful and I thought should appear earlier in the paper – I would put this first, so that readers see it as that.

We thank the reviewer for this suggestion. We agree that it would be helpful to put this figure first and we moved that figure to introduction section.

I could not find data or analysis scripts, which would be useful to share. There is a promise that it will be available through an institutional repository but it should be available as part of the review process, no?

Analysis script is shared in this link. https://github.com/tezcanf/Scripts_for_publication.git

Reviewer #2 (Recommendations for the authors):

TRF analysis. As the authors mention, the TRFs will be highly correlated for different features, which makes it problematic to infer the spatiotemporal distribution of each speech feature. Can the authors take this analysis further by assessing model accuracies as before, but this time changing the lags over which the regression is computed (e.g. in sliding time windows 0-50 ms, 25-75 ms, etc.)? This would give information on processing latencies unique to a particular speech feature, going beyond what is currently presented.

We thank reviewers for this suggestion. We have now run new models with changing time lags and computed the model accuracy improvement by each feature by subtracting the reconstruction accuracy of a model from the previous model which doesn’t have the feature of interest as it was done in the first analysis where we showed the additional contribution of each feature to the reconstruction accuracy. Due to long computation time of boosting algorithm we used to compute the TRFs, we could use 17 different time lags separated by 50 ms with a 100 ms sliding window between -100 and 800 ms. We then compared the averaged reconstruction accuracies over whole brain of high and low word entropy conditions of each language (French and Dutch) with a cluster-level permutation test across time with 8000 permutations. Below graphs shows the reconstruction accuracy improvement by each feature for each time window. On x axis, accuracy of each time windows is shown on their center time. (For example, for the window between -100 ms and 0 ms, it’s shown on t = 0.05 seconds). Red bar below shows the time intervals when there was a significant interaction between language and word entropy. Analysis results shows that word entropy modulated the encoding of phoneme onset more in the comprehended language between 200 – 750 ms in LH and RH, between 150 and 650 ms in LH and between 0 and 600 ms in RH for phoneme surprisal, and it was between 550 ms and 650 ms in LH for phoneme entropy.

Could the authors discuss what might explain their finding of specific phoneme tracking independent of acoustics, which is contra the findings of Daube et al.

We extended the discussion about the encoding of phonemic features beyond acoustic features in the discussion as suggested by the reviewer.

“An interaction between language and encoding model showed that reconstruction accuracy improvement from the addition of phoneme-level features and word frequency was greater in Dutch stories, suggesting that when language is comprehended, internally-generated linguistic units, e.g., phonemes, are tracked over and above acoustic edges (Meyer, Sun, and Martin, 2020). Our findings contradict the results of Daube et al. (2019), who used articulatory features and phoneme onsets as phonetic features and showed that phoneme-locked responses can be explained by the encoding of acoustic features. It is possible that information theoretic metrics, such as surprisal and entropy, are better suited to capture the transformation of acoustic features into abstract linguistic units.”

I am unclear about linear mixed models presented for the 'Tracking performance of linguistic features' section (some of the issues described here apply elsewhere too). The text states that forward difference coding was used to assess the difference between consecutive models. Which p-values correspond to these contrasts? Are they reported in the tables? But these look like the coefficients of a model, not model comparisons.

We thank the reviewer for pointing out this confusion. We noticed that using the name for model for linear mixed effect models (LMMs) and also as the fixed effect in LMMs might be confusing, so we replaced the model with LMM for linear mixed effect models In Table 1,2 and 3. p values correspond to the contrasts Phoneme Onset – Acoustic, Phoneme Surprisal – Phoneme Onset, Phoneme Entropy – Phoneme Surprisal, Word Frequency – Phoneme Entropy. Names of the contrasts were revised in the manuscript too. We also made a mistake in the name of the contrast. It should be backward difference instead of forward. We revised it in the manuscript too. We used backward difference coding because we want to compare the accuracy improvement by each additional features on top of the previous model. For example, when we contrast Phoneme Onset – Acoustic models, Phoneme Onset model has both acoustic features (spectrogram and acoustic edges) and phoneme onset feature whereas Acoustic model only has acoustic features, so this contrast gives us the model accuracy improvement by phoneme onset feature. Revised tables are as below.

Significance braces are shown in the figures but the captions do not indicate what these represent exactly.

Figure caption is revised as below.

“Figure 2. A) Accuracy improvement (averaged over the sources in whole brain) by each feature for Dutch Stories B) Accuracy improvement (averaged over the sources in whole brain) by each feature for French Stories Braces in Figure A and B shows the significance values of the contrasts (difference between consecutive models, **** <0.0001, *** <0.001, **<0.01, * < 0.05) in linear mixed effect models (Table 2 and 3)”

Then in line 224 there is a single p-value attributed to the effect of model, which again I am unsure how to interpret (is the combined effect of the forward difference contrasts?).

P values in the text starting in line 224 shows the significance test for LMM comparison. To evaluate whether adding language and models and their interaction as fixed effects increased predictive accuracy of LMM, we compared LMMs with and without these effects using R’s anova() function. Formula for each LMM also added to the text. LMM comparison showed that adding each fixed effect and their interaction increased the predictive power and also LMM with both fixed effect and the interaction has the lowest Bayesian Information Criteria. After we made sure that adding these fixed effects increased the predictive power of the LMM, we presented the results of the LMM with these fixed effects in Table 1,2,3,4 and 5. We also added the formulas of compared LMMs in the text.

To evaluate whether adding language and models and their interaction as fixed effect increased predictive accuracy, we compared LMMs with and without these effects using R’s anova() function.

The formulas used for the LMMs were then:

LMM1: Accuracy  Language Models + (1+Language|subject)

LMM2: Accuracy  Language+ Models + (1+Language|subject)

LMM3: Accuracy  Models + (1|subject)

LMM4: Accuracy  1+ (1|subject)

LMM comparison showed that Models (LMM3 – LMM4 Δχ2 = 82.79, p<0.0001, LLM3 Bayesian Information Criterion (BIC): -3633.1, LMM4 BIC: -2787.5 ) , Language (LMMl2 – LMM3 Δχ2 = 862.02, p<0.0001, LLM2 BIC: -3693.9), and their interaction (LMM1 – LMM2 Δχ2 = 71.63, p <0.0001, LLM1 BIC: -3743.7) predicted the averaged reconstruction accuracies.

Finally source plots in Figure 1D, do they show the forward difference contrasts or do they show pairwise comparisons between e.g. acoustic model vs phoneme onsets, phoneme onsets vs phoneme entropy etc. To make this section clearer, in addition to providing additional detail (e.g. in the figure captions), I would consider using more familiar pairwise comparisons instead of forward difference contrasts.

We thank the reviewer for expressing this preference and helping us be more consistent. We used a backward difference contrast in our analysis. Source plots in Figure 2D (Figure 1D in the previous version of the manuscript) shows the backward difference contrasts which corresponds to the sources where a specific linguistic feature incrementally increased the model accuracy compared to previous model. We used a backward difference contrast because we wanted to investigate if adding each acoustic or linguistic feature increased the model accuracy compared to the previous model which doesn’t have that particular feature. We added a more detailed description to the figure caption to make it clearer.

“Figure 2. (D) Source points where reconstruction accuracies of the model were significantly different than previous model. Accuracy values shows how much each speech feature increased the reconstruction accuracy compared to the previous model.”

Figure 2, please fully label all plots. e.g. 2A the dark vs light green bars are not distinguished (same issue for TRF plots). Do they represent the left vs right hemispheres? If so, where are the data for the high vs low entropy comparison? There is an inset panel but I think it would be clearer to show all the 'conditions' in the main graphs i.e. high low entropy, left right hemisphere. As with Figure 1, caption needs more detail e.g. what are the significance braces indicating?

We apologize for the confusion. Dark and light color bars represent high and low word entropy condition. Braces indicate the significant different between high and low entropy word conditions. In Figure 3 A (Figure 2A in the previous version of the manuscript), reconstruction accuracies are averaged over whole brain to show the interaction between word entropy and language. All figures are labelled and figure caption is updated in the revised version.

Please specify how words were grouped into high and low entropy. Median split?

Instead of splitting them with the median value, we picked the entropy value that divided the auditory signal into equal lengths to have equal length of zero values for predictors. Words in each story part were divided into two conditions; high and low word entropy, so that each condition has equal length of signal not to have a bias on one condition due to the amount of training data.

Reviewer #3 (Recommendations for the authors):

Regarding point (2) of the public review, my suggestion would be to analyze the distinction by including both low and high entropy predictor sets in a single TRF model. This way, all words' responses would be accounted for, and overlapping responses from different words would be properly controlled for. Contributions from specific sets of predictors could still be assessed by estimating models without those predictors.

We thank the reviewer for pointing this confusion. We didn’t bin the words as they were in separate trials. After generation of the acoustic and phonetic features for continuous stimuli, we projected predictors into a higher dimensional space by multiplying them with a weight matrix that zeros out all the predictors of high entropy words in low entropy word condition and vice versa. However, MEG signal was intact. This creates sparser predictors for each condition to create a contrast between two conditions to account for the nonlinear changes in the response function (viz., it is also used for amplitude binning by Drennan and Lalor, 2019). We fitted 2 separate models for each condition to compare the accuracies of conditions. As you also mentioned, in separate models, when a model is predicting the response for high entropy words, it is also predicting the absence of a response in low entropy words which gives us the opportunity to contrast the effect word entropy between conditions.

Following your suggestion, we also fitted one single model for those predictors and predicted the reconstruction accuracies of high and low entropy word conditions by subtracting accuracies of models without those predictors from the full model which has all predictors. For example, to calculate the accuracy of phoneme features in low entropy word condition we used below method.

Full_model = [spectrogram_low, spectrogram_high, acoustic_edge_low, acoustic_edge_high, phoneme_onset_low, phoneme_onset_high, phoneme_surprisal_low, phoneme_surprisal_high, phoneme_entropy_low, phoneme_entropy_high, word_freq_low, word_freq_high]

Full_model_minus_low_word_entropy_phonemes = [spectrogram_low, spectrogram_high, acoustic_edge_low, acoustic_edge_high, phoneme_onset_high, phoneme_surprisal_high, phoneme_entropy_high, word_freq_low, word_freq_high]

R2 of phonetic features in low entropy word condition = R2 of Full_model – R2 of Full_model_minus_low_entropy_phonemes

With this method, we calculated the accuracies of phoneme features in low entropy condition, phoneme features in high entropy condition, acoustic features in low entropy condition and acoustic features in high entropy condition.

To compare the reconstruction accuracies of each feature in each condition, we fitted a linear mixed effect model with the below formula to analyze the interaction between word entropy and language for acoustic and phonemic features.Accuracy  Language + Word Entropy + Hemisphere + Word Entropy  Language + (1+ Language + Hemisphere |Subject)

Author response table 1
LMM results of reconstruction accuracies for Phoneme Features.
EstimateStd. Errordft valuePr(>|t|)
(Intercept)3.29E-055.45E-0643.076.033.34E-07****
French-2.48E-054.06E-06164.00-6.126.75E-09****
Low Word Entropy-2.00E-054.06E-06164.00-4.922.10E-06****
Right Hemisphere-3.73E-062.87E-06164.00-1.301.95E-01
French: Low Word Entropy1.09E-055.74E-06164.001.905.96E-02.
Author response table 2
LMM results of reconstruction accuracies for Acoustic Edges.
EstimateStd. Errordft valuePr(>|t|)
(Intercept)4.01E-043.82E-0532.4110.505.85E-12****
French1.48E-052.16E-05140.000.694.93E-01
Low Word Entropy-5.78E-062.64E-05140.00-0.228.27E-01
Right Hemisphere9.76E-054.96E-0528.021.975.91E-02.
French: Low Word Entropy-5.88E-053.05E-05140.00-1.935.60E-02.

**** <0.0001, *** <0.001, ** <0.01, * < 0.05,. <0.1

We found that interaction between language and word entropy both for acoustic and phonemic features which were marginally significant (p=0.0596 for phonemic features and p=0.056 for acoustic features). So, this effect was stronger when we fitted 2 separate model that predicts the response only for that condition.

Author response image 1 shows the opposite interaction between language and word entropy for acoustic and phonemic features. We also compared the TRF weights of each condition in the full model. Results are very similar to the TRF weights obtained by two separate models shown in Figure 3 A and B in the manuscript.

Author response image 1
Full Model.

Light orange and light green represent Low Word Entropy condition, dark orange and dark green represent High Word Entropy condition for Dutch and French stories, respectively. (Braces indicate the significant different between high and low entropy word conditions, **** <0.0001, *** <0.001, ** <0.01, * < 0.05 ) (A) Reconstruction accuracy interaction between word entropy and language for acoustic features (B) Reconstruction accuracy interaction between word entropy and language for phoneme features (C-D) Acoustic Edge TRFs on LH and RH (E-F) Phoneme Features TRFs on LH and RH (G) Sources where the main effect of Language and Word Entropy, and interaction are found.

Regarding point (3), a stronger case could be made, for example, with a group of French speakers listening to the same stimuli – to make comprehension orthogonal to acoustic properties.

We thank the reviewer for this comment. As we have already included acoustic features as a base model and compared how much phonetic features increased the model accuracy relative to the base acoustic model rather than directly comparing the accuracies of acoustic models for our first question, and we investigated how sentence context modulated acoustic and phonetic features for our second and third question, these interaction effects are independent from the particular acoustic features which may be related to acoustic differences in stimuli. We agree with the reviewer that if we were only comparing the model accuracies of acoustic features, then acoustic differences related to speaker, language, or gender differences would be a confound in the model accuracy differences. We appreciate the Reviewer’s suggestion that comparing French speakers to Dutch speakers listening to our stimuli might shed light on the problem, however, comparing two different group of participants (Dutch and French speakers) listening to the same Dutch stimuli would, rather than ameliorate the concern, lead to unsurmountable challenges in comparison of model accuracies between groups of participants with different language experience. In this case individual differences in signal-to-ratio of neural signal would be problematic when comparing the model accuracies of two groups, which are likely to be greater than the contribution of the any acoustic difference which are already modelled in the base models (Crosse et al., 2021).

We added a new table (Table 10) that shows which story part was read by different speakers in the revised manuscript. French stories were read by both a woman and a man speaker, however Dutch stories were read by women speakers only. In the first version of our manuscript, we compared the first 4 parts of Dutch stories read by the same woman with the 4 parts of French stories read by a woman and a man speaker in order to have similar amount of training data for both languages because the amount of training data is also factor for the model accuracy. Then we repeated the same analysis with the second group of four Dutch story parts that read by another woman speaker, and crucially, we find the same effects (see reported results in Figure 3 and Figure 4 in the manuscript). To test if gender difference has an impact on the results of our first question, we repeated the same analysis that presented in Table 1 and compared the first 2 part of Dutch stories (read by women 1 in Table 10 in the revised manuscript) with the first 2 parts of French stories (read by women 3). Then compared the first 2 parts of Dutch stories (read by women 1) with second 2 parts of French stories (read by man 1) Linear mixed effect model results are presented below. For each comparison we found the same interaction between languages and model accuracy improvement by phoneme level features.

Author response table 3
LMM results of reconstruction accuracies for Dutch (first 2 parts) and French (first 2 parts) stories.
EstimateStd. Errort valuePr(>|t|)
(Intercept)2.92E-032.72E-0410.721.99E-10***
Language (French – Dutch)-6.44E-041.52E-04-4.233.13E-04***
Phon. Onset – Acoustic5.57E-051.23E-054.521.13E-05***
Phon. Surprisal – Phon. Onset8.10E-051.23E-056.575.04E-10***
Phon. Entropy – Phon. Surprisal1.25E-041.23E-0510.102.00E-16***
Word Frequency – Phon. Entropy1.22E-041.23E-059.882.00E-16***
Language: Phon. Onset – Acoustic-4.17E-051.74E-05-2.391.80E-02*
Language: Phon. Surprisal – Phon. Onset-6.73E-051.74E-05-3.861.58E-04***
Language: Phon. Entropy – Phon. Surprisal-1.06E-041.74E-05-6.096.30E-09***
Language:Word Frequency – Phon. Entropy-1.27E-041.74E-05-7.298.80E-12***
Author response table 4
LMM results of reconstruction accuracies for Dutch (first 2 parts) and French (second 2 parts) stories.
EstimateStd. Errort valuePr(>|t|)
(Intercept)2.92E-032.72E-0410.722.00E-10***
Language (French- Dutch)-1.09E-031.75E-04-6.202.45E-06***
Phon. Onset – Acoustic5.57E-051.18E-054.714.79E-06***
Phon. Surprisal – Phon. Onset8.10E-051.18E-056.861.03E-10***
Phon. Entropy – Phon. Surprisal1.25E-041.18E-0510.552.00E-16***
Word Frequency – Phon. Entropy1.22E-041.18E-0510.322.00E-16***
Language: Phon. Onset – Acoustic-4.38E-051.67E-05-2.629.49E-03**
Language: Phon. Surprisal – Phon. Onset-7.34E-051.67E-05-4.401.87E-05***
Language: Phon. Entropy – Phon. Surprisal-1.17E-041.67E-05-7.024.25E-11***
Language:Word Frequency – Phon. Entropy-1.39E-041.67E-05-8.321.91E-14***

Furthermore, to test if gender difference has an impact on the results of second and third questions, we repeated the same analysis that presented in Figure 3 and 4, and Table 6,7,8 and 9 in the manuscript for the story parts mentioned above. Below tables shows the results of LMMs. Similar to the previous results in the manuscript we found an opposite interaction between language and word entropy for acoustic and phoneme features. These results suggest that gender difference of the speakers doesn’t have an effect on the results.

Author response table 5
LMM results of reconstruction accuracies for Phoneme Features (First 2 parts of Dutch stories and first 2 parts of French stories).
EstimateStd. Errordft valuePr(>|t|)
(Intercept)4.92E-057.45E-0656.116.601.56E-08***
French-5.92E-056.46E-06164.00-9.172.00E-16***
Low Word Entropy-4.89E-056.46E-06164.00-7.572.62E-12***
Right Hemisphere-3.68E-064.57E-06164.00-0.814.21E-01
French: Low Word Entropy4.13E-059.13E-06164.004.521.19E-05***
Author response table 6
LMM results of reconstruction accuracies for Acoustic Edges (First 2 parts of Dutch stories and first 2 parts of French stories).
EstimateStd. Errordft valuePr(>|t|)
(Intercept)1.45E-043.40E-0545.164.271.00E-04***
French-2.16E-062.61E-05164.00-0.089.34E-01
Low Word Entropy3.58E-052.61E-05164.001.371.72E-01
Right Hemisphere6.05E-051.85E-05164.003.281.27E-03**
French: Low Word Entropy-1.20E-043.69E-05164.00-3.241.43E-03**

**** <0.0001, *** <0.001, ** <0.01, * < 0.05

Author response table 7
LMM results of reconstruction accuracies for Phoneme Features (First 2 parts of Dutch stories and second 2 parts of French stories).
EstimateStd. Errordft valuePr(>|t|)
(Intercept)4.97E-056.73E-0669.147.392.55E-10***
French-6.57E-056.37E-06164.00-10.312.00E-16***
Low Word Entropy-4.89E-056.37E-06164.00-7.671.45E-12***
Right Hemisphere-4.77E-064.51E-06164.00-1.062.91E-01
French: Low Word Entropy3.78E-059.01E-06164.004.204.38E-05***
Author response table 8
LMM results of reconstruction accuracies for Acoustic Edges (First 2 parts of Dutch stories and second 2 parts of French stories).
EstimateStd. Errordft valuePr(>|t|)
(Intercept)1.41E-043.06E-0547.694.603.11E-05***
French-4.81E-062.43E-05164.00-0.208.44E-01
Low Word Entropy3.58E-052.43E-05164.001.471.43E-01
Right Hemisphere6.87E-051.72E-05164.004.009.69E-05***
French: Low Word Entropy-1.33E-043.44E-05164.00-3.871.55E-04***

**** <0.0001, *** <0.001, ** <0.01, * < 0.05

Errors/inconsistencies

Line 531 implies that acoustic edges were modeled in 8 bands ("8-band acoustic onset spectrogram") but Figure 4 only shows a single time series.

In the figure we took the average of 8-bands for acoustic edges. We revised it now to show 8-bands separately.

In Figure 4A, it looks to me like the spectrogram is shifted relative to the sound wave.

Yes, it was indeed shifted. We corrected it.

In Figure 4A, the third word has lower Word Entropy than the fourth, but is assigned to the high entropy condition, while the fourth is in the low entropy condition.

Coloring was done manually, so it was a mistake. We corrected it in the revised figure 1.

The Discussion says about the initial model (before adding entropy) "Phoneme surprisal, phoneme entropy, and word frequency reconstruction accuracies were left-lateralized" (348), but I don't think I saw a statistical test of lateralization. In fact, in the model with entropy the effect of Hemisphere for phoneme features was n.s. (281).

We apologize for this mistake. The statistical analysis that we didn’t find any lateralization was the effect of word entropy on the encoding of acoustic features. There is a significant lateralization effect of the accuracy improvement by each linguistic feature with Dutch stories but not with French stories. Unfortunately, we forgot to add that to our manuscript before. It’s added to the Results section as below and we also corrected the discussion.

Results:

“Figure 2-D shows the sources where each feature incrementally increased the reconstruction accuracy compared to previous model. We fitted a liner mixed effect model to test the accuracy improvement by each linguistic feature for lateralization by taking the average of the contrasts shown in Figure 2-D for each hemisphere. We used the below formulas for LMM.

LMM1: Accuracy  Hemisphere + (1|subject)

LMM2: Accuracy  1+ (1|subject)

We compared LMMs with and without Hemisphere effect using R’s anova() function. LMM comparison showed that Hemisphere (LMM1 – LMM2 Δχ2 = 4.03 , p<0.05, LLM1 Bayesian Information Criterion (BIC): -3331.5, LMM2 BIC: -3329.5 ) predicted the averaged reconstruction accuracies in Dutch stories but not in French stories. We reported the results of LMM1 in Table 4 and Table 5.”

In Dutch stories, linguistic features increased the reconstruction accuracy mostly on the left hemisphere however in French stories only Phoneme Onset and Entropy slightly increased the reconstruction accuracies and we couldn’t find any significant lateralization effect.

Discussion:

“Comparison of the reconstruction accuracies on the source level showed that, for Dutch stories, accuracy improvement by the linguistic features were left lateralized. However, for uncomprehended French stories, we couldn’t find a significant lateralization effect.”

"To isolate the effect of sentence and discourse context and control for the frequency effect we also added word frequencies in our full model" (374): I don't believe this is doing what is intended (/described). The main question asked statistically is whether word entropy modulates other predictors' TRFs. Adding a word frequency predictor will control for responses correlated with word frequency, but it will not control for other predictors' TRF shapes being modulated by word frequency.

We are thankful to the reviewer for pointing out this issue. We agree that adding word frequency as a predictor doesn’t control for the modulation of other predictors by word frequency. As we also mentioned in the Discussion, we still see the modulation of acoustic features in French stories, however in uncomprehended language we don’t expect to see the effect of context, so this modulation should be driven by the recognition of frequent words. We revised that part as following. To control for the responses correlated with word frequency we also added word frequencies in our full model and compared the reconstruction accuracy improvement explained by acoustic and phoneme features.

https://doi.org/10.7554/eLife.82386.sa2

Article and author information

Author details

  1. Filiz Tezcan

    Language and Computation in Neural Systems Group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    Filiz.TezcanSemerci@mpi.nl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3327-0181
  2. Hugo Weissbart

    Donders Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, Netherlands
    Contribution
    Conceptualization, Data curation, Supervision, Validation, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2820-3865
  3. Andrea E Martin

    1. Language and Computation in Neural Systems Group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. Donders Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, Netherlands
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Methodology, Project administration, Writing - review and editing
    Competing interests
    Reviewing editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3395-7234

Funding

Max-Planck-Gesellschaft (Lise Meitner Research Group "Language and Computation in Neural Systems")

  • Andrea E Martin

Max-Planck-Gesellschaft (Independent Research Group "Language and Computation in Neural Systems")

  • Andrea E Martin

Nederlandse Organisatie voor Wetenschappelijk Onderzoek (016.Vidi.188.029)

  • Andrea E Martin

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. Open access funding provided by Max Planck Society.

Acknowledgements

We thank Sanne ten Oever for constructive feedback on the study design, and Ryan MC Law, Ioanna Zioga, Cas Coopmans, and Sophie Slaats for contributing to data acquisition.

Ethics

Human subjects: Participants performed a screening for their eligibility in the MEG and MRI and gave written informed consent. The study was approved by the Ethical Commission for human research Arnhem/Nijmegen (project number CMO2014/288). Participants were reimbursed for their participation.

Senior Editor

  1. Andrew J King, University of Oxford, United Kingdom

Reviewing Editor

  1. Jonathan Erik Peelle, Northeastern University, United States

Version history

  1. Received: August 2, 2022
  2. Preprint posted: August 18, 2022 (view preprint)
  3. Accepted: June 18, 2023
  4. Version of Record published: July 7, 2023 (version 1)

Copyright

© 2023, Tezcan et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 673
    Page views
  • 124
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Filiz Tezcan
  2. Hugo Weissbart
  3. Andrea E Martin
(2023)
A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension
eLife 12:e82386.
https://doi.org/10.7554/eLife.82386

Share this article

https://doi.org/10.7554/eLife.82386

Further reading

    1. Neuroscience
    Peibo Xu, Jian Peng ... Yuejun Chen
    Research Article

    Deciphering patterns of connectivity between neurons in the brain is a critical step toward understanding brain function. Imaging-based neuroanatomical tracing identifies area-to-area or sparse neuron-to-neuron connectivity patterns, but with limited throughput. Barcode-based connectomics maps large numbers of single-neuron projections, but remains a challenge for jointly analyzing single-cell transcriptomics. Here, we established a rAAV2-retro barcode-based multiplexed tracing method that simultaneously characterizes the projectome and transcriptome at the single neuron level. We uncovered dedicated and collateral projection patterns of ventromedial prefrontal cortex (vmPFC) neurons to five downstream targets and found that projection-defined vmPFC neurons are molecularly heterogeneous. We identified transcriptional signatures of projection-specific vmPFC neurons, and verified Pou3f1 as a marker gene enriched in neurons projecting to the lateral hypothalamus, denoting a distinct subset with collateral projections to both dorsomedial striatum and lateral hypothalamus. In summary, we have developed a new multiplexed technique whose paired connectome and gene expression data can help reveal organizational principles that form neural circuits and process information.

    1. Neuroscience
    Maureen van der Grinten, Jaap de Ruyter van Steveninck ... Yağmur Güçlütürk
    Tools and Resources

    Blindness affects millions of people around the world. A promising solution to restoring a form of vision for some individuals are cortical visual prostheses, which bypass part of the impaired visual pathway by converting camera input to electrical stimulation of the visual system. The artificially induced visual percept (a pattern of localized light flashes, or ‘phosphenes’) has limited resolution, and a great portion of the field’s research is devoted to optimizing the efficacy, efficiency, and practical usefulness of the encoding of visual information. A commonly exploited method is non-invasive functional evaluation in sighted subjects or with computational models by using simulated prosthetic vision (SPV) pipelines. An important challenge in this approach is to balance enhanced perceptual realism, biologically plausibility, and real-time performance in the simulation of cortical prosthetic vision. We present a biologically plausible, PyTorch-based phosphene simulator that can run in real-time and uses differentiable operations to allow for gradient-based computational optimization of phosphene encoding models. The simulator integrates a wide range of clinical results with neurophysiological evidence in humans and non-human primates. The pipeline includes a model of the retinotopic organization and cortical magnification of the visual cortex. Moreover, the quantitative effects of stimulation parameters and temporal dynamics on phosphene characteristics are incorporated. Our results demonstrate the simulator’s suitability for both computational applications such as end-to-end deep learning-based prosthetic vision optimization as well as behavioral experiments. The modular and open-source software provides a flexible simulation framework for computational, clinical, and behavioral neuroscientists working on visual neuroprosthetics.