Figures and data

A) MEG Encoding Model. MEG data was epoched to word onset and averaged over a sliding window of 100ms, moving with a step size of 25ms. The model representation (GPT-2, GloVe or arbitrary) of the word at t = 0 was then used to predict the neural response for each channel and time point in a separate cross-validated Ridge regression. The actual and predicted responses were then correlated time point by time point, resulting in a time-resolved encoding plot. B) Positive pre-onset encoding (subject 1) for GPT-2 (green), GloVe (blue) and arbitrary (grey) embeddings shows that it possible to find ostensible neural signatures of pre-activation in MEG data. Lines show clusters of time points that are significantly different from zero (p < 0.05 under the permutation distribution). C) Encoding using GloVe embeddings demonstrate a slight advantage of the predictability of a word (top-one prediction by GPT-2-XL) for pre-onset encoding. Line indicates clusters of time points prior to word onset during which predictable words are significantly better encoded (p < 0.05 under the permutation distribution).

A) Control system 1: word embeddings. For the first control system, we performed the same analysis as in Figure 1 but replaced the neural data at each time point with a vector representation (embedding) of the word presented at that time point. The word vector at t = 0 was then used to predict the previous word vector for each dimension and time point in a separate cross-validated Ridge regression. The actual and predicted values were then correlated time point by time point, resulting in a time-resolved self-predictability plot. B) Pre-onset encoding (self-predictability) for GPT-2 (green), GloVe (blue) and arbitrary (grey) embeddings. C) Modulation of pre-onset encoding of static (GloVe) word embeddings by contextual predictability: Prior word vectors are better predicted by successive word vectors if the subsequent word is highly predictable in context (i.e. top-one prediction by GPT-2).

A) For the second control system we again perform the same analysis as in Figure 1 but replaced the neural data with the stimulus acoustics at that time point. The word embedding of the word at t = 0 was then used to predict the prior acoustics. The actual and predicted values were then correlated time point by time point, resulting in a time-resolved correlation plot. B) Pre-onset encoding of speech acoustics based on GPT-2(green), GloVe (blue) and arbitrary (grey) embeddings. C) Modualtion of pre-onset encoding of speech acoustics based on GloVe embeddings by word predictability (top-one prediction by GPT-2-XL).

A) In order to remove shared information between a word and its predecessor in the text, we residualised word embeddings by first fitting an OLS regression to predict the next word based on the previous word’s embedding, i.e. predicting “know” based on “You”. This resulted in a predicted embedding x̂, e.g. 


Hallmarks of prediction in the remaining two subjects of the few-subject dataset and the Multi-subject dataset.
First panel from the left shows the overall encoding performance of GPT-2, GloVe and arbitrary vectors. Lines show clusters of time points for which encoding performance is significantly different from zero (p < 0.05 under the permutation distribution). Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-five prediction. Lines show clusters of time points for which encoding performance is significantly larger for the predictable as opposed to unpredictable words prior to word onset (p < 0.05 under the permutation distribution). Shaded areas show 95-% confidence intervals computed over sources and cross-validation splits in the single-subject analyses and over subjects in the multi-subject analysis. For the multi-subject dataset we find no evidence for the second hallmark of prediction, i.e. no sensitivity to the predictability of a word for pre-onset encoding performance.

Self-Predictability in the Multi-subject dataset and in Goldstein et al. (2022b) data.
First panel from the left shows self-predictability of GPT-2, GloVe and arbitrary models. Middle panel shows the sensitivity of self-predictability of GloVe to the predictability of the word as defined by GPT-2-XL’s top-one prediction and the right panel shows the the sensitivity as defined by GPT-2-XL’s top-five prediction. Shaded areas show 95-% confidence intervals computed over model dimensions. Self-predictability results are identical to those observed in the few-subject dataset. We find both ostensible hallmarks of prediction in the stimulus material, namely the word embeddings of the material.

Predicting Acoustics prior to word onset from original embedding vectors for both datasets.
First panel from the left shows the overall encoding performance of residualised GPT-2, GloVe and arbitrary vectors. Results closely mirror brain encoding results both in encoding time course and differences between models (see Figure 1B and C and S1). Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-five prediction when using original GloVe embeddings. Shaded areas show 95-% confidence intervals computed over the 9 dimensions (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding studies.

Encoding data using GPT, GloVe and arbitrary vectors after removing re-occurrences of bigrams, i.e. only retaining the first occurrence, in our few-subject dataset.
Panel A) shows reduced but significant pre-onset encoding for all three vectors of the MEG data (subject 1). Panel B) shows model self-predictability after removing re-occurring bigrams. Indeed, removing bigrams had no effect on the self-predictability of any of the three models. Panel C) shows the pre-onset predictability of the acoustics of our few-subject dataset after removing re-occurring bigrams. As for the self-predictability, removing bigrams led to almost identical encoding performance as in S6A). This shows that removing reoccurring bigrams does not account for dependencies in the stimulus material and pre-onset encoding of the neural data might be driven by the predictability of stimulus or model features prior to word onset. Shaded areas show 95-% confidence intervals computed over 10 cross-validation folds and the dimensionality (for the selfpredictability and acoustics) and channels (for the MEG data). Results mirror qualitative differences observed in the brain encoding studies.

Predicting Acoustics from arbitrary embedding vectors for all three datasets after all reoccurring bigrams have been removed.
Results show that even when reoccurrances of bigrams are removed, it is still possible to find significant pre-onset encoding for arbitrary vectors which is solely due to encoding temporal dependencies in the stimulus material. Shaded areas show 95-% confidence intervals computed over 10 cross-validation folds and the 9 dimensions of our acoustic data (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding studies.

Predicting Acoustics prior to word onset from residual word embeddings in the multi-subject dataset.
First panel from the left shows the overall encoding performance of residualised GPT-2, GloVe and arbitrary vectors. Results mirror brain encoding results both in encoding time courses and differences between models (see Figure S7 C). Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-five prediction when using residualised GloVe embeddings. Shaded areas show 95-% confidence intervals computed over the 9 dimensions (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding study.

Ostensible hallmarks of prediction after removing model self-predictability through residualising word embeddings in the few-subject dataset and the multi-subject dataset.
First panel from the left shows the overall encoding performance of residualised GPT-2, GloVe and arbitrary vectors. Lines show clusters of time points for which encoding performance is significantly different from zero (p < 0.05 under the permutation distribution). Encoding performance as well as differences between models are reduced compared to encoding results with original vectors. Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-one prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-five prediction. Lines show clusters of time points for which encoding performance is significantly larger for the predictable as opposed to unpredictable words prior to word onset (p < 0.05 under the permutation distribution). Shaded areas show 95-% confidence intervals computed over sources and cross-validation splits in the single-subject analyses and over subjects in the multi-subject analysis. For the multi-subject dataset we find no evidence for the second hallmark of prediction, i.e. no sensitivity to the predictability of a word for pre-onset encoding performance.