Stimulus dependencies—rather than next-word prediction—can explain pre-onset brain encoding in naturalistic listening designs

  1. Inés Schönmann  Is a corresponding author
  2. Jakub Szewczyk
  3. Floris P de Lange
  4. Micha Heilbron
  1. Donders Institute for Brain Cognition and Behaviour, Netherlands
  2. Institute of Psychology, Jagiellonian University, Poland
  3. Amsterdam Brain and Cognition, University of Amsterdam, Netherlands
4 figures and 1 additional file

Figures

Figure 1 with 1 supplement
MEG encoding model and results for encoding neural data.

(A) Magnetoencephalography (MEG) encoding model. MEG data was epoched to word onset and averaged over a sliding window of 100 ms, moving with a step size of 25 ms. The model representation (GPT-2, GloVe, or arbitrary) of the word at t=0 was then used to predict the neural response for each channel and time point in a separate cross-validated Ridge regression. The actual and predicted responses were then correlated time point by time point, resulting in a time-resolved encoding plot. (B) Positive pre-onset encoding (subject 1) for GPT-2 (green), GloVe (blue), and arbitrary (grey) embeddings shows that it is possible to find ostensible neural signatures of pre-activation in MEG data. Lines show clusters of time points that are significantly different from zero (p<0.05 under the permutation distribution). (C) Encoding using GloVe embeddings demonstrates a slight advantage of the predictability of a word (top-one prediction by GPT-2-XL) for pre-onset encoding. The line indicates clusters of time points prior to word onset during which predictable words are significantly better encoded (p<0.05 under the permutation distribution).

Figure 1—figure supplement 1
Hallmarks of prediction in the remaining two subjects of the few-subject dataset and the multi-subject dataset.

The first panel from the left shows the overall encoding performance of GPT-2, GloVe, and arbitrary vectors. Lines show clusters of time points for which encoding performance is significantly different from zero (p<0.05 under the permutation distribution). The middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction, and the right panel shows the sensitivity to GPT-2-XL’s top-five prediction. Lines show clusters of time points for which encoding performance is significantly larger for the predictable as opposed to unpredictable words prior to word onset (p<0.05 under the permutation distribution). Shaded areas show 95 % confidence intervals computed over sources and cross-validation splits in the single-subject analyses and over subjects in the multi-subject analysis. For the multi-subject dataset, we find no evidence for the second hallmark of prediction, i.e., no sensitivity to the predictability of a word for pre-onset encoding performance.

Figure 2 with 1 supplement
Control system 1: encoding model and results.

(A) Control system 1: word embeddings. For the first control system, we performed the same analysis as in Figure 1 but replaced the neural data at each time point with a vector representation (embedding) of the word presented at that time point. The word vector at t=0 was then used to predict the previous word vector for each dimension and time point in a separate cross-validated Ridge regression. The actual and predicted values were then correlated time point by time point, resulting in a time-resolved self-predictability plot. (B) Pre-onset encoding (self-predictability) for GPT-2 (green), GloVe (blue) and arbitrary (grey) embeddings. Shaded areas show 95 % confidence intervals computed over model dimensions. (C) Modulation of pre-onset encoding of static (GloVe) word embeddings by contextual predictability: Prior word vectors are better predicted by successive word vectors if the subsequent word is highly predictable in context (i.e. top-one prediction by GPT-2). Shaded areas show 95 % confidence intervals computed over model dimensions.

Figure 2—figure supplement 1
Self-predictability in the multi-subject dataset and in Goldstein et al., 2022b data.

The first panel from the left shows self-predictability of GPT-2, GloVe, and arbitrary models. The middle panel shows the sensitivity of self-predictability of GloVe to the predictability of the word as defined by GPT-2-XL’s top-one prediction, and the right panel shows the sensitivity as defined by GPT-2-XL’s top-five prediction. Shaded areas show 95 % confidence intervals computed over model dimensions. Self-predictability results are identical to those observed in the few-subject dataset. We find both ostensible hallmarks of prediction in the stimulus material, namely the word embeddings of the material.

Figure 3 with 3 supplements
Control system 2: encoding model and results.

(A) For the second control system, we again perform the same analysis as in Figure 1 but replaced the neural data with the stimulus acoustics at that time point. The word embedding of the word at t=0 was then used to predict the prior acoustics. The actual and predicted values were then correlated time point by time point, resulting in a time-resolved correlation plot. (B) Pre-onset encoding of speech acoustics based on GPT-2 (green), GloVe (blue), and arbitrary (grey), embeddings. Shaded areas show 95 % confidence intervals computed over the nine dimensions (8 mels + envelope). (C) Modulation of pre-onset encoding of speech acoustics based on GloVe embeddings by word predictability (top-one prediction by GPT-2-XL). Shaded areas show 95 % confidence intervals computed over the nine dimensions (8 mels + envelope).

Figure 3—figure supplement 1
Predicting acoustics prior to word onset from original embedding vectors for both datasets.

The first panel from the left shows the overall encoding performance of residualised GPT-2, GloVe, and arbitrary vectors. Results closely mirror brain encoding results both in encoding time course and differences between models (see Figure 1B and C, Figure 1—figure supplement 1). The middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction, and the right panel shows the sensitivity to GPT-2-XL’s top-five prediction when using original GloVe embeddings. Shaded areas show 95 % confidence intervals computed over the nine dimensions (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding studies.

Figure 3—figure supplement 2
Predicting acoustics from arbitrary embedding vectors for all three datasets after all reoccurring bigrams have been removed.

Results show that even when reoccurrences of bigrams are removed, it is still possible to find significant pre-onset encoding for arbitrary vectors which is solely due to encoding temporal dependencies in the stimulus material. Shaded areas show 95 % confidence intervals computed over 10 cross-validation folds and the nine dimensions of our acoustic data (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding studies.

Figure 3—figure supplement 3
Encoding data using GPT, GloVe, and arbitrary vectors after removing re-occurrences of bigrams, i.e., only retaining the first occurrence, in our few-subject dataset.

(A) shows reduced but significant pre-onset encoding for all three vectors of the magnetoencephalography (MEG) data (subject 1). (B) shows model self-predictability after removing re-occurring bigrams. Indeed, removing bigrams had no effect on the self-predictability of any of the three models. Shaded areas show 95 % confidence intervals computed over the model dimensions. (C) shows the pre-onset predictability of the acoustics of our few-subject dataset after removing reoccurring bigrams. As for the self-predictability, removing bigrams led to almost identical encoding performance. This shows that removing reoccurring bigrams does not account for dependencies in the stimulus material and pre-onset encoding of the neural data might be driven by the predictability of stimulus or model features prior to word onset. Shaded areas show 95 % confidence intervals computed over 10 cross-validation folds and the dimensionality (for the self-predictability and acoustics) and channels (for the MEG data). Results mirror qualitative differences observed in the brain encoding studies.

Figure 4 with 2 supplements
Controlling for self-predictability.

(A) In order to remove shared information between a word and its predecessor in the text, we residualised word embeddings by first fitting an OLS regression to predict the next word based on the previous word’s embedding, i.e., predicting ‘know’ based on ‘You.’ This resulted in a predicted embedding x^, e.g., ‘know^,’ which contained the shared information between the two words. Finally, the predicted embedding, e.g., ‘know^,’ was removed from the original embedding, e.g., ‘know,’ to generate word representations for which the dependency between neighbouring words was removed. (B) Self-predictability after regressing out the previous embedding from the embedding at t=0 shows that it is possible to successfully remove the correlations between neighbouring model representations. For brain encoding results when using these residualised embeddings, see Figure 4—figure supplement 1. (C) Predictability of prior word acoustics when using residualised GPT-2 (green), GloVe (blue), and arbitrary (grey) embeddings prior to word onset. Patterns closely mirror those observed in each model’s self-predictability or in residual brain encoding results. Shaded areas show 95 % confidence intervals computed over the nine dimensions (8 mels + envelope). (D) Predictability of prior word acoustics using residualised GloVe embeddings demonstrates a clear advantage of the predictability of a word (top-one prediction by GPT-2-XL) for predicting its prior acoustic representations, and therefore, the same qualitative difference as observed when encoding neural data. Shaded areas show 95 % confidence intervals computed over the nine dimensions (8 mels + envelope).

Figure 4—figure supplement 1
Predicting acoustics prior to word onset from residual word embeddings in the multi-subject dataset.

The first panel from the left shows the overall encoding performance of residualised GPT-2, GloVe, and arbitrary vectors. Results mirror brain encoding results both in encoding time courses and differences between models (see Figure 4—figure supplement 2C). The middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction, and the right panel shows the sensitivity to GPT-2-XL’s top-five prediction when using residualised GloVe embeddings. Shaded areas show 95 % confidence intervals computed over the nine dimensions (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding study.

Figure 4—figure supplement 2
Ostensible hallmarks of prediction after removing model self-predictability through residualising word embeddings in the few-subject dataset and the multi-subject dataset.

The first panel from the left shows the overall encoding performance of residualised GPT-2, GloVe, and arbitrary vectors. Lines show clusters of time points for which encoding performance is significantly different from zero (p<0.05 under the permutation distribution). Encoding performance, as well as differences between models are reduced compared to encoding results with original vectors. The middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction, and the right panel shows the sensitivity to GPT-2-XL’s top-five prediction. Lines show clusters of time points for which encoding performance is significantly larger for the predictable as opposed to unpredictable words prior to word onset (p<0.05 under the permutation distribution). Shaded areas show 95 % confidence intervals computed over sources and cross-validation splits in the single-subject analyses and over subjects in the multi-subject analysis. For the multi-subject dataset, we find no evidence for the second hallmark of prediction, i.e., no sensitivity to the predictability of a word for pre-onset encoding performance.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Inés Schönmann
  2. Jakub Szewczyk
  3. Floris P de Lange
  4. Micha Heilbron
(2026)
Stimulus dependencies—rather than next-word prediction—can explain pre-onset brain encoding in naturalistic listening designs
eLife 14:RP106543.
https://doi.org/10.7554/eLife.106543.3