Figures and data

The encoding model finds a linear mapping between words in the narrative and corresponding brain responses.
a. The encoding model mimics the brain by taking each word and its context and learning to generate a brain-like response within a time window around the word’s onset. b. Word embeddings are extracted from the GPT-2 model and used to predict brain responses at each time point t through linear regression. c. The brain score represents the correlation between actual and predicted brain responses across multiple words, calculated at each time point t.

Word embeddings explain brain responses before word onset
left column: All brain regions show encoding of the word embedding, with peak values in the left hemisphere, especially in the temporal cortex, inferior frontal areas and tempoparietal junction associated with language processing. right column: A comparison of encoding models shows that removing correlations between neighboring word embeddings (decor_X) does not remove pre-onset encoding. Similarly, eliminating bi-grams in the narrative (decor_X noBigram) slightly reduces overall encoding performance, but the pre-onset effect persists. The shaded regions in the line plots indicate the standard error of the mean (SEM) across aggregated ECoG electrodes/MEG sources of all participants.

Temporal generalization of representations captured by the encoding model differs before and after word onset.
a. Temporal generalization (TG) matrix computed by training the encoding model with decorrelated embeddings at one time point and testing it at another. Positive values indicate successful generalization of representations across time. The diagonal pattern reflects temporally dynamic rather than stable representations. b. Generalization profiles for models trained at the peak encoding response (~ 300 ms; purple) and along the diagonal (blue). The divergence between the two curves indicates that pre-onset encoding does not reflect the same representations engaged during word processing.

Removing autocorrelation between pre- and post-onset activity does not eliminate pre-onset encoding.
a. Temporal generalization (TG) matrix computed using decorrelated embeddings after regressing out post-onset brain activity from the pre-onset signal. Removing correlations in the neural signal should, in principle, eliminate any trace of predictive pre-activation. Following this procedure, the small pre-onset generalization observed near word onset in Fig. 3 disappears. b. Generalization profiles for models trained at the peak encoding response (~ 300 ms; purple) and along the diagonal (blue). The persistence of pre-onset encoding (blue curve) despite the absence of pre-activation indicates that pre-onset encoding is not necessarily a signature of prediction.

Encoding of the future and past words.
All curves represent averages across participants. The embedding vector is constructed by concatenating d future word embeddings (d > 0) or |d| past word embeddings (d < 0) along with the embedding of the current word wi. a Including the next word embedding in the encoding model (d = 1) enhances encoding only after that word is heard in the story, while including the previous word (d = −1) improves encoding even after the current word’s onset. Encoding enhancement, Δℛ, is shown for b positive and c negative values of d. Vertical gray lines mark the median inter-word interval values. Adding each successive future word embedding improves encoding only after that word is heard in the narrative, while including previous words consistently improves encoding beyond their offset.

Encoding results for individual subjects in the MEG dataset
To examine whether predictable words are better encoded before word onset, we categorized words as predictable or unpredictable based on whether they appeared in the top-5 predictions of GPT-2 for that position (Goldstein et al., 2022b). a. Brain score values peak in the left hemisphere, especially in the temporal cortex and inferior frontal areas associated with language processing. b. Brain scores from single (gray) and all 10 (black) sessions show encoding before word onset, peaking at or shortly after it (see Methods for MEG source selection). c. The encoding model was trained separately to compute brain scores for predictable and unpredictable words. The figure indicates that predictable words show stronger encoding up to 1 second before word onset. The encoding of unpredictable words appears stronger around 400 ms post-onset, though these effects did not reach statistical significance for subjects 2 and 3. Star symbols mark significant differences between predictable and unpredictable words calculated using a dependent t-test for paired samples on the brain scores of the two groups across the same MEG sources and accounted for multiple hypothesis testing using the Benjamini-Hochberg correction. we considered q values smaller than 0.05 as significant. A smoothing window of 200ms was used in this analysis. All error bars were computed as the standard error across the MEG sources.

Encoding values for individual subjects in the ECoG dataset.
Brain score values and temporal profiles vary across subjects in the ECoG dataset, partly due to differences in the number and locations of electrodes selected for each participant. Shaded regions in the line plots indicate the standard error of the mean (SEM) across ECoG electrodes included for each subject.

Shape of the FIR filter used for band-pass filtering MEG data between 0.1-40 Hz.
Shown is the flnite impulse response (FIR) Alter implemented in MNE and used for MEG preprocessing. Because this Alter is non-causal and applied in both forward and reverse directions, it can, in principle, introduce temporal leakage. However, the Alter amplitude converges to zero within the first 100 ms, making any temporal leakage negligible for the effects examined in this study.

Average correlation between nearbyword embeddings in the 10h MEG dataset.
(left) GPT2 embeddings have large, symmetric correlations between nearbywords. (right) After removing the correlation of each word embedding in the story with its previous 8 word embeddings, most of the correlations between nearbyword embeddings disappear. These figures were obtained by calculating the correlation matrix given each 10 consecutive words (in non-overlapping windows), and then averaging all those values across the 10h MEG dataset.

Temporal correlation within the epochs in the MEG dataset.
a. shows the time-by-time correlation matrix of the MEG signal across the epochs. b. shows a slice at t=0 of the correlation matrix. The autocorrelation decays rapidly and reaches zero within ~ 1 second. c shows the correlation matrix after regressing out all post-onset time-points from each pre-onset time-point.

Negative areas in the TG matrix are not caused by a sign flip in brain activity.
a. Average model predictions for pre- and post-onset activity (averaged over [-600, 0] ms and [0, 600] ms, respectively). For each MEG source, a subset of words showing anticorrelated model predictions is highlighted in orange. This analysis was performed on data from the first participant in the MEG dataset after decorrelating pre- and post-onset signals, similar to Fig. 4. b. Correlation matrix computed for the selected subset of words at each MEG source and averaged across the 30 MEG sources used in the main analyses. The resulting matrix shows that even for this subset, there is no systematic sign reversal in brain activity before and after word onset; instead, the two signals remain largely uncorrelated.

Analysis with original gpt2 embeddings also shows that future word embeddings do not improve encoding, while past word embeddings do.
This figure shows was obtained similar to Fig. 5 but using original gpt2 embeddings, instead of decorrelated embeddings. All curves represent averages across participants. The embedding vector is constructed by concatenating d future word embeddings (d > 0) or |d| past word embeddings (d < 0) along with the embedding of the current word wi. a Including the next word embedding in the encoding model (d = 1) enhances encoding only after that word is heard in the story, while including the previous word (d = −1) improves encoding even after the current word’s onset. Encoding enhancement, Δℛ, is shown for b negative and c positive values of d. Vertical gray lines mark the median inter-word interval values. Adding each successive future word embedding improves encoding only after that word is heard in the narrative, while including previous words consistently improves encoding beyond their offset.

Lack of evidence for future-word encoding is not due to MEG source selection.
To test whether the absence of an effect observed in Fig. 5 could be explained by our MEG source selection procedure, we repeated the analysis using only sources that, on average, showed enhanced encoding within the [0, 230] ms window for d > 0. Even with this intentionally biased selection, encoding did not noticeably improve within the [0, 230] ms interval. In contrast, including embeddings of previous words resulted in a measurable increase in encoding performance.

No evidence for predictive pre-activation in the IFG region of the ECoG dataset.
To test whether the absence of effects was due to the broad selection of ECoG electrodes, we repeated the analysis using only electrodes located in the inferior frontal gyrus (IFG), previously reported to show robust pre-onset encoding of upcoming stimuli (Goldstein et al., 2022b). The same qualitative patterns were observed in the IFG as in the borader selection of electrodes in the main text.a. top: location of the IFG electrodes included in this analysis.bottom: A comparison of encoding models shows that removing correlations between neighboring word embeddings (decor_X) doesn’t eliminate pre-onset encoding. Similarly, eliminating bi-grams in the narrative (decor_X noBigram) slightly reduces overall encoding performance, but the pre-onset effect persists. The shaded regions in the line plots indicate the standard error of the mean (SEM) across aggregated ECoG electrodes/MEG sources of all participants. b. Temporal generalization of representations captured by the encoding model differs before and after word onset. top: Temporal generalization (TG) matrix computed with decorrelated embeddings . Positive values indicate successful generalization of representations across time. bottom: Generalization profiles for models trained at the peak encoding response (~300 ms;purple) and along the diagonal (blue). The divergence between the two curves indicates that pre-onset encoding does not reflect the same representations engaged during word processing., c. Removing autocorrelation between pre- and post-onset activity does not eliminate pre-onset encoding. top Temporal generalization (TG) matrix computed using decorrelated embeddings after regressing out post-onset brain activity from the pre-onset signal. Removing correlations in the neural signal should, in principle, eliminate any trace of predictive pre-activation. Following this procedure, the small pre-onset generalization observed near word onset in panel b. disappears. bottom Generalization profiles for models trained at the peak encoding response (~300 ms;purple) and along the diagonal (blue). The persistence of pre-onset encoding (blue curve) despite the absence of pre-activation indicates that pre-onset encoding is not necessarily a signature of prediction. .d. Encoding of the future and past words. All curves represent averages across participants. The embedding vector is constructed by concatenating d future word embeddings (d > 0) or |d| past word embeddings (d < 0) along with the embedding of the current word wi. left Including the next word embedding in the encoding model (d = 1) enhances encoding only after that word is heard in the story, while including the previous word (d = −1) improves encoding even after the current word’s onset. Encoding enhancement, Δℛ, is shown for top right positive and bottom right positive values of d. Vertical gray lines mark the median inter-word interval values. Adding each successive future word embedding improves encoding only after that word is heard in the narrative, while including previous words consistently improves encoding beyond their offset.

No evidence for future-word encoding replicated using GloVe embeddings in MEG data.
To test whether the absence of future-word encoding observed in Fig. 5 could be attributed to the contextual nature of GPT embeddings, we repeated the analysis using static GloVe embeddings. a. The variable d represents the number of word embeddings concatenated with the current word embedding to form the final vector. Positive values of d signify the inclusion of future word embeddings and negative d signify the inclusion of past word embeddings. b. and c. The same qualitative pattern was observed: including future-word embeddings did not enhance model performance within the [0, 230] ms window after word onset, whereas including past-word embeddings clearly improved it.