Figures and data

A) MEG Encoding Model. MEG data was epoched to word onset and averaged over a sliding window of 100ms, moving with a step size of 25ms. The model representation (GPT-2, GloVe or arbitrary) of the word at t = 0 was then used to predict the neural response for each channel and time point in a separate cross-validated Ridge regression. The actual and predicted response were then correlated time point by time point, resulting in a time-resolved encoding plot. B) Positive pre-onset encoding (subject 1) for GPT-2 (green), GloVe (blue) and arbitrary (grey) embeddings shows that it possible to find ostensible neural signatures of pre-activation in MEG data. Lines show clusters of time points that are significantly different from zero (p < 0.05 under the permutation distribution). C) Encoding using GloVe embeddings demonstrate a slight advantage of the predictability of a word (top-1 prediction by GPT-2-XL) for pre-onset encoding. Line indicates clusters of time points prior to word onset during which predictable words are significantly better encoded (p < 0.05 under the permutation distribution).

A) Self-predictability of Model Representations. In order to assess the predictability of neighbouring model representations in each MEG epoch, we replaced the neural data at each time point with the model representation of the word presented during that time point. The model representation of the word at t = 0 was then used to predict the previous model representations for each dimension and time point in a separate cross-validated Ridge regression. The actual and predicted values were then correlated time point by time point, resulting in a time-resolved self-predictability plot. B) Self-predictability for GPT-2(green), GloVe (blue) and arbitrary (grey) embeddings prior to word onset mirrors patterns observed when encoding the eural data. C) Self-predictability of GloVe embeddings demonstrate a clear advantage of the predictability of a word (top-1 prediction by GPT-2-XL), and therefore the same qualitative difference as observed when encoding neural data.

A) In order to remove shared information between a word and its predecessor in the text, we residualised word embeddings by first fitting an OLS regression to predict the next word based on the previous word’s embedding, i.e. predicting “know” based on “You”. This resulted in a predicted embedding

A) To test whether removing self-predictability (as in Fig.3) can correct for stimulus dependencies more generally, we investigated the predictability of pre-onset acoustics using residualised word embeddings. We computed an 8-Mel spectrogram and the envelope for each word and replaced the neural data at each time point with the respective acoustic representation. The residualised GPT-2, GloVe or arbitrary embedding of the word at t = 0 was then used to predict previous acoustic representations. The actual and predicted values were then correlated, resulting in a time-resolved correlation plot. B) Predictability of prior word acoustics when using residualised GPT-2 (green), GloVe (blue) and arbitrary (grey) embeddings prior to word onset. Patterns closely mirror those observed in each model’s self-predictability or in residual brain encoding results. C) Predictability of prior word acoustics using residualised GloVe embeddings demonstrate a clear advantage of the predictability of a word (top-1 prediction by GPT-2-XL) for predicting its prior acoustic representations, and therefore, the same qualitative difference as observed when encoding neural data.

Hallmarks of prediction in the remaining two subjects of the few-subject dataset and the Multi-subject dataset. First panel from the left shows the overall encoding performance of GPT-2, GloVe and arbitrary vectors. Lines show clusters of time points for which encoding performance is significantly different from zero (p < 0.05 under the permutation distribution). Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-1 prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-5 prediction. Lines show clusters of time points for which encoding performance is significantly larger for the predictable as opposed to unpredictable words prior to word onset (p < 0.05 under the permutation distribution). Shaded areas show 95-% confidence intervals computed over sources and cross-validation splits in the single-subject analyses and over subjects in the multi-subject analysis. For the multi-subject dataset we find no evidence for the second hallmark of prediction, i.e. no sensitivity to the predictability of a word for pre-onset encoding performance.

Self-Predictability in the Multi-subject dataset. First panel from the left shows self-predictability of GPT-2, GloVe and arbitrary models. Middle panel shows the sensitivity of self-predictability of GloVe to the predictability of the word as defined by GPT-2-XL’s top-1 prediction and the right panel shows the the sensitivity as defined by GPT-2-XL’s top-5 prediction. Shaded areas show 95-% confidence intervals computed over model dimensions. Self-predictability results are identical to those observed in the few-subject dataset. We find both ostensible hallmarks of prediction in the stimulus material, namely the word embeddings of the material.

Ostensible hallmarks of prediction after removing model self-predictability through residualising word word embeddings. Left panels: Overall encoding performance of residualised GPT-2, GloVe and arbitrary vectors. Lines show clusters of significant time points (p < 0.05). Middle panels: Sensitivity to GPT-2-XL’s top-1 prediction. Right panels: Sensitivity to GPT-2-XL’s top-5 prediction. Lines show clusters with significantly larger encoding for predictable vs. unpredictable words prior to onset (p < 0.05). Shaded areas: 95% confidence intervals. For the multi-subject dataset, no evidence of the second hallmark of prediction is found.

Predicting Acoustics prior to word onset from residual word embeddings in the multi-subject dataset. First panel from the left shows the overall encoding performance of residualised GPT-2, GloVe and arbitrary vectors. Results mirror brain encoding results both in encoding time courses and differences between models (see Figure S3 C). Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-1 prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-5 prediction when using residualised GloVe embeddings. Shaded areas show 95-% confidence intervals computed over the 9 dimensions (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding study.

Predicting Acoustics prior to word onset from original embedding vectors for both datasets. First panel from the left shows the overall encoding performance of residualised GPT-2, GloVe and arbitrary vectors. Results closely mirror brain encoding results both in encoding time course and differences between models (see Figure 1B and C and S1). Middle panel shows the sensitivity to the predictability of the word for GPT-2-XL’s top-1 prediction and the right panel shows the the sensitivity to GPT-2-XL’s top-5 prediction when using original GloVe embeddings. Shaded areas show 95-% confidence intervals computed over the 9 dimensions (8 mels + envelope). Results mirror qualitative differences observed in the brain encoding studies.