Executive resources shape the impact of language predictability across the adult lifespan

  1. Merle Marie Schuckart  Is a corresponding author
  2. Sandra Martin
  3. Sarah Tune
  4. Lea-Maria Schmitt
  5. Gesa Hartwigsen
  6. Jonas Obleser
  1. Department of Psychology, University of Lübeck, Germany
  2. Center of Brain, Behavior and Metabolism, University of Lübeck, Germany
  3. Research Group Cognition and Plasticity, Max Planck Institute for Human Cognitive and Brain Sciences, Germany
  4. Donders Institute for Brain, Cognition and Behaviour, Radboud University, Netherlands
  5. Wilhelm Wundt Institute for Psychology, Leipzig University, Germany

eLife Assessment

This study presents a valuable finding on whether executive resources mediate the impact of language predictability in reading in the context of aging. The evidence is solid in the investigation of prediction in reading, with one caveat that the text materials used could be biased against the aging population. The work will be of interest to cognitive neuroscientists working on reading, language comprehension, and executive control.

https://doi.org/10.7554/eLife.108176.3.sa0

Abstract

Humans routinely anticipate upcoming language, but whether such predictions come at a cognitive cost remains debated. In this study, we demonstrate the resource-dependent nature of predictive mechanisms in language comprehension across the lifespan: Experimentally limiting executive resources through a concurrent task reduces the effect of language predictability on reading time. Participants (N = 175, replication N = 96) read short articles presented word-by-word while completing a secondary font colour n-back task, thus varying cognitive demand. Language predictability was indexed by word surprisal as derived from a pre-trained large language model (GPT-2). Across two independent samples, our findings reveal that language predictions are not cost-free: They draw on executive control resources, and this dependency becomes more pronounced with age (18–85 years). These results help resolve the debate over cognitive demands in language comprehension and highlight prediction as a dynamic, resource-dependent process across the lifespan.

eLife digest

Understanding language requires more than simply recognizing individual words. Instead, readers and listeners continuously generate predictions about what might come next. For instance, after reading “She walked her…”, most individuals would probably anticipate the word “dog.”

Such predictive processes facilitate comprehension, making it faster and more efficient. Yet, a long-standing debate concerns whether these predictions emerge effortlessly and automatically, or whether they depend on domain-general executive resources, including working memory, attention, inhibitory control and goal-directed behaviour. Importantly, these resources are finite: they can be temporarily depleted when performing additional non-linguistic tasks and generally decline with age.

Schuckart et al. investigated whether language prediction during reading relies on executive resources and how language prediction might change with age as executive functioning declines. Employing a dual task reading paradigm, they examined how imposing a cognitively demanding secondary task influenced the effect of word predictability on reading speed in adults aged 18 to 85. This approach addressed a fundamental question in psycholinguistics – whether prediction is automatic or resource-dependent – and shed light on why language comprehension remains robust despite age-related cognitive decline.

The study involved 175 participants, with an independent replication sample of 96, who read texts word by word while sometimes performing a concurrent working-memory task. Word predictability was quantified using ‘surprisal’, a lexical score derived from the large language model GPT-2, which provides information about how unpredictable a word is.

The results revealed that less predictable words were read more slowly, confirming that readers actively generated linguistic predictions. The effect of predictability diminished under increased cognitive load, demonstrating that language prediction draws on executive resources. Older adults exhibited stronger predictability effects overall, but these were also more susceptible to reduction when executive resources were taxed. Together, the results show that language prediction is not cost-free and changes systematically across the adult lifespan.

These findings advance our understanding of how predictive language processing relies on executive resources and may have implications for interventions in conditions where these resources are compromised, such as following a stroke.

Introduction

We constantly rely on our ability to swiftly yet accurately process linguistic input. When reading a book, watching television, or navigating a car through busy traffic while following instructions, language prediction is considered a catalyst that enhances the efficiency of linguistic processing (Clark, 2013; Pickering and Garrod, 2007; Onnis et al., 2022; Rao and Ballard, 1999; Ryskin and Nieuwland, 2023). However, due to the inherent flexibility and richness of natural language, upcoming words can rarely be predicted from context with complete certainty (Rubenstein and Aborn, 1958). Instead, linguistic features are thought to be pre-activated broadly rather than following an all-or-nothing principle, as there is evidence for predictive processing even for moderately or low-restraint contexts (Boston et al., 2008; Roland et al., 2012; Schmitt et al., 2021; Smith and Levy, 2013). This generation of graded predictions is sometimes described as being passive and cost-free (Luke and Christianson, 2016). However, it is still under debate whether maintaining such an elaborate process really incurs no cognitive cost.

Graded language predictions necessitate the active generation of hypotheses on upcoming words as well as the integration of prediction errors to inform future predictions (Clark, 2013; Ryskin and Nieuwland, 2023). Supporting this, recent evidence suggests that language predictions may indeed impose processing demands. Shain et al., 2024 found that reading time increases with decreasing word predictability, with even small drops in predictability of highly expected words leading to significant processing costs. These findings suggest that language predictions are not entirely automatic or effortless. This aligns with numerous neuroimaging studies arguing for a strong interaction between language-specific and domain-general executive brain regions (Geranmayeh et al., 2017; Martin et al., 2022; Sliwinska et al., 2017; Wingfield and Grossman, 2006), particularly in situations that are cognitively demanding (Erb et al., 2013; Vaden et al., 2013; Vaden et al., 2015; Vaden et al., 2016; Peelle, 2018).

In this context, domain-general executive resources refer to higher-level cognitive control processes, such as working memory, inhibitory control, and cognitive flexibility, that are crucial for managing and coordinating behaviour across a wide range of tasks and modalities (Alvarez and Emory, 2006; Duncan, 2010; Friedman and Miyake, 2017). These processes are supported by the multiple-demand system, a fronto-parietal network that is recruited in various cognitively challenging situations (Duncan, 2010). However, while numerous studies support the claim that language prediction relies on such domain-general executive resources, a body of research suggests the opposite (e.g., Ryskin et al., 2020; Shain et al., 2020; Wehbe et al., 2021; Diachek et al., 2020). This raises the unresolved question to what extent such resources are taxed by predictive processes (Ryskin and Nieuwland, 2023; MacGregor et al., 2022; Shain et al., 2022; Xie et al., 2023).

An interesting test case for the dynamic interplay between language-specific and domain-general systems is cognitive ageing, as advancing age has been shown to be associated with sensory and executive decline (Idowu and Szameitat, 2023; Salthouse et al., 2003; Verhaeghen et al., 2003). In the current study, we thus ask: How does language prediction change when executive resources are limited – both intrinsically due to advanced age, and extrinsically through increased task demands?

The age-related change in cognitive resources is reflected by longer linguistic processing times, especially in situations with high cognitive effort, such as dual-task processing (Liu et al., 2016; Smiler et al., 2003). However, previous research presents conflicting results on how cognitive ageing affects the use of linguistic predictions during language comprehension. Behavioural studies suggest a decline in using context to make semantic predictions with age (Häuser et al., 2019) while EEG studies present a more nuanced picture. Some studies indicate heightened neural sensitivity to unexpected information in older adults (Cheimariou et al., 2019), while others report no significant age-related differences in neural response (Dave et al., 2018; Payne and Federmeier, 2018).

Furthermore, it is unclear how the use of language predictions across the adult lifespan might be affected by increasing cognitive demands. According to the Compensation-Related Utilisation of Neural Circuits Hypothesis (CRUNCH; Reuter-Lorenz and Cappell, 2008), older adults might earlier reach a point where their cognitive load capacity is fully exhausted than young adults, leading to a performance decline. If language prediction draws on executive resources, its effect on reading time might thus diminish with increasing cognitive load due to shared cognitive resources. However, this effect might re-emerge once capacity limits are reached, causing tasks to be processed sequentially, which would result in an increased effect of language prediction paralleled by slower performance.

Here, we explored the role of executive control in language prediction in two large cohorts across the adult lifespan. Using a novel dual-task paradigm that couples natural reading with an n-back task that taxes executive resources, we tested the following hypotheses: First, both increased cognitive load and reduced word predictability (i.e., increased surprisal) should be reflected by longer reading time (Figure 1a, b), alongside a decrease in text comprehension and n-back task performance. Second, we expected that the formation of language predictions should be contingent upon the availability of executive resources. Most importantly, a gradual limitation of these resources due to increased task demands should result in diminished effects of language predictability on reading time (interaction between cognitive load and surprisal; Figure 1c).

Visualisation of hypotheses.

We expected main effects on reading time of (a) cognitive load and (b) surprisal, as well as (c) an interaction of surprisal and cognitive load. Additionally, (d) we explored how these effects are modulated by age.

Lastly, we explored how these previously described effects would be modulated by age (Figure 1d). Note that the literature allows for contradicting hypotheses: On the one hand, if language predictions are impaired under limited executive resources, older individuals should rely less on language predictions due to overall decreased executive resources (Idowu and Szameitat, 2023; Salthouse et al., 2003). This should be reflected by diminished predictability effects. On the other hand, given the presumed stability of language comprehension across the lifespan (Shafto and Tyler, 2014), older adults might instead rely more heavily on language predictions, thus fully compensating for any impairments in reading comprehension caused by sensory and executive decline (Idowu and Szameitat, 2023; Salthouse et al., 2003; Verhaeghen et al., 2003). In this case, we should see strong surprisal effects independent of age, or even stronger surprisal effects in older than younger adults.

The two large-sample reading-time studies presented here help resolve the ongoing discussion on the role of executive resources in language predictions with three key findings: First, across our large age range, we found a general increase in reading time with both low language predictability and high task demands. Second, higher task demands reduce the influence of language predictability on reading time, indicating that linguistic prediction relies on executive control resources. Third, predictability had a more pronounced effect on reading time in older adults compared to younger ones. These findings highlight the dynamic interaction between language predictability and executive resources.

Results

We report data from 175 participants (M = 44.9 ± 17.9, 18–85 years, 51% female) who were either tested online (Nonline = 80) or in the laboratory (Nlaboratory = 95) in one session. Moreover, we conducted an internal, pre-registered replication study involving another 96 participants (M = 39.8 ± 14.0, 18–70 years, 51% female) tested online to replicate our main findings (see Figure 2—figure supplement 1 for age distributions of all samples). During the experiment, participants engaged in a self-paced reading task. They read 300-word newspaper articles, presented word-by-word in various font colours (Figure 2a). The task was either performed in isolation (Reading Only) or paired with a competing n-back task on the words' font colour (1-back and 2-back Dual Task). Participants were instructed to read the texts carefully, as content-related multiple-choice questions were asked after each text (see Methods for details).

Figure 2 with 1 supplement see all
Experimental design and quantification of predictability as word surprisal using a large language model (GPT-2).

(a) Participants were asked to perform a self-paced reading task (Reading Only) which was complemented in some blocks by a secondary n-back task on the font colour of the words (Reading + 1-back, and Reading + 2-back). The order of the blocks was pseudo-randomised, with Reading Only always being the first condition to be presented, followed by the two dual-task conditions, and another main block for each of the three conditions. Both dual-task paradigms (Reading + 1-back and Reading + 2-back) were first introduced in short single-task training sets. (b) We generated one surprisal score for each word in the reading material by using context chunks of two words as prompts for next-word predictions in GPT-2. The resulting probability for the actual next word in the text (here: ‘mail’, marked in teal) was then transformed into a surprisal score, which reflected how predictable the respective word was given the context. Additionally, based on the distribution of probabilities for all possible continuations, we computed an entropy score, which reflects the uncertainty in predicting the next word. Please note that the example sentence used here has been translated to English for better comprehensibility, while the original text materials were in German.

Increased cognitive load and older age reduce task performance

Results from a generalised mixed model for text comprehension accuracy showed that participants read the texts carefully and answered most comprehension questions correctly, with accuracies of 93% ± 14% (mean ± SD) in the Reading Only condition, and 80% ± 25% in the 1-back and 73% ± 28% in the 2-back Dual Task conditions (Figure 3—figure supplement 1b). Increases in cognitive load (OR1-back vs. BL = 0.253, z1-back vs. BL = –8.928, p1-back vs. BL < 0.001, OR2-back vs. BL = 0.156, z2-back vs. BL = –12.403, p2-back vs. BL < 0.001) and in age (OR = 0.986, z = –2.676, p = 0.013) were associated with poorer performance (see Methods for model details; Figure 3a, Appendix 1—table 1).

Figure 3 with 2 supplements see all
Estimated marginal effects of predictors age, cognitive load, and surprisal on task performance and reading time.

Main effects of cognitive load and age on accuracy in the comprehension question task (a) and on n-back task performance (d-primes; b). Please note that we do not show d-primes for the Reading Only task as there was no n-back task in this condition. Reading time increased with increasing age and word surprisal (c, left: results from linear mixed model, LMM, right: results from generalised additive model, GAM – for an explanation see section Modelling potential non-linear contributions). In (panel d), we show the two-way interaction of cognitive load and surprisal (left) and cognitive load and age (middle). In both cases, effects were strongest in the Reading Only condition (see bar plot insets). Additionally, we show how age modulates the effect of surprisal on reading time (c, right). For raw and predicted individual trajectories, please see Figure 3—figure supplements 1 and 2 in the Supplementary Material. Estimated marginal effects were adjusted for ‘Reading Only’ as the reference level. N = 175.

Similarly, d′ values demonstrated good performance in the n-back task, with mean d’ of 3.77 ± 0.8 in the 1-back and 2.12 ± 0.87 in the 2-back Dual Task condition (Figure 3—figure supplement 1a). The overall high d′ values observed here can be attributed to the low target ratio in the experiment, resulting in high correct rejection and low false alarm rates. A linear mixed-effects model revealed n-back performance declined with cognitive load (β = –1.636, t(173.13) = –26.120, p < 0.001), with more pronounced effects with advancing age (β = –0.014, t(169.77) = –3.931, p > 0.001; Figure 3b, Appendix 1—table 1).

Reading time increases with age, surprisal, and cognitive load

To operationalise language predictability, we calculated word surprisal scores using a 12-layered generative pre-trained transformer model (GPT-2; Radford et al., 2019; Figure 2b). Word surprisal quantifies the predictability of the current word given its preceding context (Smith and Levy, 2013; Hale, 2001; Levy, 2008). We chose a context length of two words, as constraining the context has been shown to increase GPT-2’s psychometric predictive power, making its next-word predictions more human-like (Kuribayashi et al., 2022). In addition to each word’s surprisal, we also computed word entropy, which reflects the uncertainty in predicting the next word. Thus, while word surprisal indicates the predictability of each word, word entropy reflects the uncertainty underlying its prediction.

We used linear mixed-effects regression (LMM) to assess the effect of word surprisal on reading time and its interaction with cognitive load and age (see Methods for model details). Our model as a whole was able to explain 65% of variance (conditional R2) in single-word reading time from surprisal, n-back load, n-back performance, and other linguistic and demographic variables, and up to 81% when additionally considering the variability across subjects (marginal R2; see Appendix 1—table 2).

As hypothesised, we observed significantly longer reading time with advancing age, independent of cognitive load condition (β = 0.009, t(178.46) = 9.199, p < 0.001). For illustration, each additional year of age increased reading time by roughly 1%, respectively (see Figure 3c, top; Table 1; Figure 3—figure supplement 2a).

Table 1
Main results for model for reading time (N = 175).
PredictorsEstimateStd. errorCItdfp
Main effectsSurprisal0.0017070.0001510.001411 to 0.00200211.3206772361.371.368 × 10–28*
Age0.0091130.0009910.007158 to 0.0110689.199100178.461.751 × 10–16*
Cognitive load
[1-back vs. Reading Only]
0.4738000.0139160.446336 to 0.50126434.046321176.188.399 × 10–79*
Cognitive load
[2-back vs. Reading Only]
0.7915400.0260900.740046 to 0.84303430.338989173.767.320 × 10–71*
Two-way interactionsSurprisal × age0.0000350.0000040.000027 to 0.0000429.287151287,771.273.481 × 10–20*
Surprisal × cognitive load
[1-back vs. Reading Only]
–0.0010930.000161–0.001409 to –0.000776–6.771521287,959.112.043 × 10–11*
Surprisal × cognitive load
[2-back vs. Reading Only]
–0.0012550.000163–0.001575 to –0.000935–7.681261288,294.962.709 × 10–14*
Age × cognitive load
[1-back vs. Reading Only]
–0.0027980.000776–0.004330 to –0.001267–3.606479171.995.135 × 10–4*
Age × cognitive load
[2-back vs. Reading Only]
–0.0024580.001454–0.005329 to 0.000412–1.690400170.799.681 × 10–2
Three-way interactionsSurprisal × age × cognitive load
[1-back vs. Reading Only]
–0.0001110.000009–0.000129 to –0.000094–12.266076287,807.343.748 × 10–34*
Surprisal × age × cognitive load
[2-back vs. Reading Only]
–0.0000780.000009–0.000096 to –0.000060–8.483676287,771.654.384 × 10–17*
Model fitIntra-class correlation (ICC)0.46
Marginal R2/conditional R20.643/0.807
  1. All continuous predictors were centred. Degrees of freedom for p-values, standard errors and confidence intervals (CIs) were computed using Satterthwaite’s approximation. All p-values reported here are FDR-corrected and were computed using ANOVAs with type III sum of squares. Results that are significant on an alpha-level of 0.05 are marked with a star.

To account for the potential influence of verbal intelligence and education on the observed age effects, we compared three additional LMMs: a baseline model without these predictors, and two additional models, one including a verbal intelligence predictor and the other including an education predictor (please see Methods section Control analysis: Assessing potential effects of verbal intelligence and education).

Adding verbal intelligence as a predictor to our model did not significantly improve the model fit (χ²(1) = 0.769, p = 0.381). This implies that verbal intelligence cannot account meaningfully for the differences in reading time found between younger and older participants.

When including years of education as a predictor, we observed a significant effect on reading time (β = 0.009, t = 9.199, p < 0.001), which led to a modest yet significant improvement in model fit (χ²(1) = 4.209, p = 0.0402, AICmodel_baseline = 37,580, AICmodel_education = 37,578). However, the three-way interaction of age, surprisal and cognitive load we found in our original model remained significant (β1-back = –0.00007, t1-back(88.61) = –5.229, p1-back < 0.001, β2-back = –0.00004, t2-back(88.90) = –2.784, p2-back = 0.007), suggesting that while education has a significant effect on reading time, it cannot account for the age-related effects observed in our models.

Analogous to the effect of age, increased cognitive load was associated with significantly longer reading time (β1-back = 0.474, t1-back(176.18) = 34.046, p1-back < 0.001, β2-back = 0.792, t2-back(173.76) = 30.339, p2-back < 0.001; Table 1 and Figure 3—figure supplements 1 and 2), indicating participants read more slowly when faced with a more challenging task. Even after excluding the Reading Only condition from the model and comparing only the two equally attention-demanding dual-task conditions 1-back and 2-back (control analysis), hereby controlling for attentional switching costs, this effect still held true (β = 0.339, t(178.98) = 16.221, p <0 .001; Appendix 1—table 4).

Moreover, in line with our hypothesis, we observed a consistent increase in reading time with higher surprisal (β = 0.002, t(2361.37) = 11.321, p < 0.001; Table 1 and Figure 3c, bottom). Highly predictable words were read more quickly. Specifically, for a change in surprisal by one standard deviation, reading time increased by about 2.1%. For words and individuals matched in all other regards, this translates to a mean reading-time difference of approximately 118 ms between words with the minimum and maximum surprisal values in our dataset (range of surprisal values: 3.56–72.19).

Cognitive load reduces the impact of surprisal

In line with our hypotheses, cognitive load significantly modulated the effect of surprisal on reading time (β1-back = –0.001, t1-back(287,959.11) = –6.772, p1-back < 0.001, β2-back = –0.001, t2-back(288,294.96) = –7.681, p2-back < 0.001). While we observed a clear increase in reading time in the Reading Only condition when surprisal was high, this effect was mitigated in both the 1-back and 2-back Dual-Task condition, where cognitive load was increased (Table 1 and Figure 3d, left).

Age modulates the effect of cognitive load

While increased task demands were associated with prolonged reading time across participants, we found this effect became less pronounced with advancing age (β1-back = –0.003, t1-back(171.99) = –3.606, p1-back < 0.001; β2-back = –0.002, t2-back(170.79) = –1.690, p2-back = 0.097; Table 1 and Figure 3d, middle). To illustrate, when comparing the reading time in the 2-back Dual Task and the Reading Only condition, we found an increase of 130.48% in an average young (i.e., 27 years old) and an increase of 111.30% in an average older (i.e., 63 years old) participant.

Age modulates the effect of surprisal

Older age was associated with stronger surprisal effects (β = 0.00004, t(287,771.27) = 9.287, p < 0.001; Table 1 and Figure 3d, right), indicating that highly unpredictable (i.e., surprising) words were associated with a significantly longer reading time, especially in older adults. This suggests that, as individuals age, the effect of word predictability on their reading time becomes increasingly pronounced instead of remaining constant or diminishing.

Age modulates the interaction of surprisal and cognitive load

Finally, we tested whether the observed interaction between cognitive load and surprisal changes with advancing age: We found that age indeed modulated the joint influence of cognitive load and surprisal on reading time, as reflected by a significant three-way interaction of surprisal, cognitive load and age (β1-back–Reading Only = –0.00011, t1-back–Reading Only(287,807.34) = –12.2661, p1-back–Reading Only < 0.001, β2-back–Reading Only = –0.00008, t2-back–Reading Only(287,771.65) = –8.484, p2-back–Reading Only < 0.001; Table 1).

Even after excluding the Reading Only condition from the model and contrasting only the dual-task conditions 1-back and 2-back (control analysis), results still showed a significant three-way interaction of age, cognitive load, and surprisal (β2-back–1-back = 0.00003, t(188,203.53) = 3.373, p = 0.001; Appendix 1—table 4). Please note the change in reference level caused by the exclusion of the Reading Only condition, which causes a reversal in the direction of the three-way interaction between age, cognitive load, and surprisal.

To get a more nuanced understanding of age-related differences in the effect of cognitive load and surprisal on reading time, we conducted a simple slopes analysis for our original model (Figure 4, Figure 4—figure supplement 1): Under low cognitive load (condition Reading Only), surprisal significantly influenced reading time in all but the youngest participants. Put simply, when participants read a text without having to perform an additional n-back task, predictable (i.e., low surprisal) words yielded significantly shorter reading time than unpredictable (i.e., high surprisal) words (Figure 4a, left panel, first plot). This effect was most pronounced in the oldest participants (β85–18 years = 0.006, p < 0.001).

Figure 4 with 3 supplements see all
Results of the simple slopes analysis and exemplary marginal effects plots for three different ages.

In the Johnson–Neyman plot (Johnson and Neyman, 1936) on the left side of panel (a), we show the effect of surprisal on reading time across the whole age range separated by cognitive load condition: Reading Only (top; blue), 1-back Dual Task (middle; yellow), and 2-back Dual Task (bottom; red). The stronger the surprisal effect for a certain age, the higher the value on the y-axis. Grey areas indicate age ranges for which we did not find an effect of surprisal on reading time in the respective condition, whereas blue areas indicate a significant surprisal effect (see inset on the right for a visualisation of a non-significant effect in a younger participant and a significant effect in an older participant). In panel (b), we show the predicted surprisal effect in each cognitive load for an average young (average age −1 SD), middle-aged (average age) and older participant (average age +1 SD). The bar plots illustrate the predicted effects of surprisal on reading time (Estimates ± 95% CI) across the three cognitive load conditions for those three average participants. N = 175.

As task demands increased, we observed a reversal of this pattern, with younger participants exhibiting stronger surprisal effects than older participants (condition 1-back; Figure 4a, left panel, second plot). Notably, under increased cognitive load, even the youngest participants started showing significant surprisal effects (β85–18 years = −0.0009, p = 0.04).

Finally, in the most demanding condition (2-back), the pattern shifted again, with older adults again showing stronger surprisal effects than their younger counterparts (β85–18 years = 0.001, p = 0.003; Figure 4a, left panel, third plot).

Comparing surprisal effects between cognitive load conditions revealed that older adults showed the most pronounced reduction in surprisal effects as cognitive load increased (Figure 4b), which suggests they were more vulnerable to increased task demands than younger participants (18 year-old: β1-back–Reading Only = –0.003, p < 0.001, β2-back–Reading Only = –0.003, p < 0.001, β2-back–1-back = 0.0004, p = 0.06; 85 year-old: β1-back–Reading Only = –0.011, p < 0.001, β2-back–Reading Only = –0.008, p < 0.001, β2-back–1-back = 0.003, p < 0.001).

Modelling potential non-linear contributions

To account for the possibility that age, surprisal, and their interaction with cognitive load might demonstrate non-linear effects on reading time, we fitted a generalised additive mixed-effects model (GAM) to our data, using the same model structure as for the linear regression model (see Methods for model details). Results from the GAM revealed overall similar effects relative to the LMM. The effective degrees of freedom (EDF) of all continuous predictors were estimated above 1, confirming their non-linearity. Similar to the LMM, both our predictors of interest, surprisal and age, demonstrated significant effects on reading time (Figure 3c, right), with surprisal showing a more non-linear trajectory than age (EDF for surprisal: 4.107, EDF for age: 3.028, both p < 0.001). The smoothing splines for the three-way interaction of age, surprisal, and cognitive load showed significant effects for all levels of cognitive load (Figure 4—figure supplements 1 and 2). Similar to the LMM, we found the strongest effect for the Reading Only condition (EDF for Reading Only: 10.248, p < 0.001, EDF for 1-back: 2.017, p = 0.036, EDF for 2-back: 4.89, p = 0.024). The full model results are reported in Appendix 1—table 5 and Figure 4—figure supplement 1.

Disentangling the effect of cognitive load on pre- and post-stimulus predictive processing

The present study focuses primarily on surprisal, an information-theoretic measure of how unexpected an encountered word is given its preceding context. As such, it reflects the post-stimulus integration of predictions. In an additional control analysis, we considered entropy, which reflects the uncertainty (and thus the inverse precision) of pre-stimulus predictions, or – put simply – the expected surprisal of an upcoming word (Pimentel et al., 2022). To examine whether limitations in executive resources constrain prediction generation before word onset or prediction integration after word onset, we modelled the effects of entropy on reading times, using the same approach as for modelling the effect of surprisal. Analogous to the effects of surprisal, we expected reading time to increase with higher entropy and anticipated that increased cognitive load would attenuate the effect of entropy on reading time.

Contrary to our hypothesis, results indicated that in the Reading Only condition, with minimal cognitive load, participants benefitted from higher entropy, as reflected by shorter reading times. As the reading materials were designed to be as easy to understand as possible, entropy was overall very low, ranging from 0.708 to 6.763 (see Bentz and Alikaniotis, 2016 for an overview of entropy in natural texts in different languages). The observed facilitated processing for segments with slightly increased entropy is thus consistent with previous findings that entropy can have facilitating effects on language comprehension due to pre-activation of semantic features (Karimi et al., 2024).

Interestingly, as cognitive load increased, this effect reversed: participants showed longer reading times when entropy was high (β1-back = −0.0066, t1-back(287,500.04) = 5.214, p1-back < 0.001, β2-back = 0.0065, t2-back(287,757.94) = 5.026, p2-back < 0.001), which is in line with our hypotheses. While we found this pattern – beneficial effect of entropy under minimal cognitive load and detrimental effect under increased cognitive load – in our younger participants, older participants showed a detrimental effect of increased entropy across conditions. Moreover, paralleling the results for surprisal, older adults were more sensitive to variations in entropy than their younger counterparts (β1-back = −0.00004, t1-back(287,440.36) = –5.547, p1-back < 0.001, β2-back = −0.00002, t2-back(287,488.32) = –2.578, p2-back < 0.001; see and Figure 4—figure supplement 3 and Appendix 1—table 6 for full results).

Interaction of surprisal and cognitive load generalises to new sample

To probe the reliability of our findings, we conducted an exact online replication of the original experiment. The replication model included the main effects of age, cognitive load, and surprisal, and the two-way interaction between cognitive load and surprisal; the three-way interaction with age was not modelled. This streamlined model ensured adequate statistical power and yielded stable, interpretable estimates given the available sample size. When comparing the results of the online participants from the original and replication samples, we found highly consistent effects (see Figure 5). Again, a significant interaction of cognitive load and surprisal emerged (β₁-back = 0.001499, t₁-back(161,262.31) = 7.377, p₁-back < 0.001; β₂-back = 0.001365, t₂-back(161,923.06) = 6.721, p₂-back < 0.001; see Appendix 1—table 3 and Figure 4—figure supplement 2), despite the smaller sample size (N = 96 vs. 175).

Figure 5 with 1 supplement see all
Results of the internal online replication in comparison with the results of the online sample of the original study.

Estimates ±CI for the main effects of age, surprisal, and cognitive load as well as the two-way interaction of surprisal and cognitive load are visualised. RO: Reading only. Full results are provided in Appendix 1—table 3. For a comparison of age distributions in the original online and lab sample and the online replication sample, please see Figure 2—figure supplement 1. Please note that effects are grouped by their magnitude.

Modelling cumulative effects of surprisal on reading time

As noted above (see section Reading time increases with age, surprisal, and cognitive load), reading time increased as a function of word surprisal, with a mean difference of 118 ms between the most and least predictable word in our text material. This corresponds to approximately 22.91% of the average per-word reading time in the BL condition (M = 522.04 ± 275.927 ms), highlighting a substantial effect of surprisal – especially considering that all other predictors were held constant when estimating the effect of surprisal. The substantial effect of surprisal is particularly notable given that the texts were edited for ease of comprehension and contained relatively low surprisal values overall (M = 18.165 ± 7.523), indicating that the observed reading time differences between high- and low-surprisal words likely underestimate the potential effect size in more complex texts with higher surprisal variability.

Notably, in everyday life, we typically encounter sequences of words, ranging from short phrases to texts of hundreds or thousands of words. Consequently, small word-level effects can accumulate and yield substantial processing differences over time (cf. Funder and Ozer, 2019). Thus, to quantify this cumulative effect of surprisal, we predicted reading time for two average participants aged 27 (M − 1 SD) and 63 (M + 1 SD) years for a short example sentence. Predictions were conducted for the easiest cognitive load condition Reading Only, in which surprisal effects were most pronounced, and for the most challenging condition 2-back Dual Task, where surprisal effects were diminished. The example sentence comprised 14 words and had relatively low surprisal values (M = 15.24 ± 7.65; even slightly lower than in our original text material), implying that the cumulative effects of surprisal shown here are rather conservative estimations.

In the condition with the largest surprisal effects (Reading Only), surprisal led to a total cumulative increase in reading time of 73.6 ms for younger and 648 ms for older participants over the course of the sentence. In the more challenging 2-back Dual Task condition, we observed a total increase of 199 ms in younger and 485 ms in older participants (see Figure 6b for cumulated effects of surprisal; see Figure 6a for predicted reading time incorporating all effects for a comparison).

Estimates ± 95% CI for the cumulative effect of surprisal on reading time.

To illustrate the cumulative effect of surprisal on reading time over the course of a text, we predicted reading times for an average younger (27 years, M − 1 SD) and average older (63 years, M + 1 SD) participant in the easy Reading Only condition (blue) and the most challenging condition 2-back (Dual Task; red) and computed the cumulative sum for a short example sentence. Panel a illustrates how reading time gradually increases in total over the course of the sentence, with all predictors being held constant at their average, except for the predictors age, cognitive load, and word length. In panel b, we again show cumulative reading times, this time isolating the effect of surprisal. Please note that surprisal values are zero for the first two words, as our GPT-2 model estimates surprisal based on the two preceding words, which are unavailable at the beginning of the sentence. The example sentence used in both panels is the German translation of the opening line of Anna Karenina, ‘Happy families are all alike, every unhappy family is unhappy in its own way’ (Karenina, 1878). N = 175.

Discussion

Linguistic predictions are a powerful feature of language comprehension. But do they really come ‘for free’, or how much do they draw on executive resources? With the present study, we asked how language predictions change with increasing cognitive load, and how this interaction is modulated by age. To do so, we paired a self-paced reading task with a secondary n-back task on the font colour of the words.

First, as expected, and validating our overall approach, high cognitive load was associated with an increase in reading time as well as a decrease in task performance across age groups. This is consistent with previous studies using n-back tasks (Lamichhane et al., 2020; Kwong See and Ryan, 1995).

Next, as hypothesised, higher word surprisal slowed down reading, even when controlling for word length and frequency as well as prediction entropy. This effect of surprisal replicates findings from previous studies showing that highly unpredictable words are generally associated with a longer reading time (Schmitt et al., 2021; Smith and Levy, 2013; Heilbron et al., 2022; Monsalve et al., 2012; Shain et al., 2024; Wilcox et al., 2023).

Finally, we explored the relationship between surprisal and cognitive load across the (cross-sectional) adult lifespan. We hypothesised that increasing cognitive load should gradually impair the building of language predictions, which should surface in a diminished surprisal effect on reading time in conditions with high cognitive load. In line with our hypothesis, we found that the effect of word surprisal on reading time was modulated by cognitive load. Specifically, when cognitive load was high, the effect of surprisal on reading time was significantly diminished.

Interestingly, this interaction between surprisal and cognitive load was modulated by age. While age generally increased the reliance on language prediction, it also increased the susceptibility of this strategy to changes in available executive resources: Older adults showed the strongest relative reduction of the surprisal effect with increasing cognitive load. However, under high load, older adults still showed the strongest surprisal effect in absolute terms (Figure 4b). In a direct replication of the original experiment, we reproduced this finding, further confirming the reliability of our results.

Disentangling effects of attention versus executive resources

When investigating the interaction of cognitive load and language predictability on reading time, we found that increased cognitive load diminished the effect of word predictability. We take this as evidence that executive resources are involved in the generation of language predictions.

Predictive processing has not only been suggested to be foundational to language comprehension (Ryskin and Nieuwland, 2023; Schmitt et al., 2021; Heilbron et al., 2022), but it is also thought to be a core mechanism of the human brain (Rao and Ballard, 1999; Bubic et al., 2010; Friston, 2010). Drawing parallels between domains can thus offer valuable insights into the mechanisms of language prediction: For instance, there is evidence from the visual domain showing that attending to the stimulus material is a prerequisite for predictive processing (Larsson and Smith, 2012; Richter and de Lange, 2019). This observation from the visual domain can potentially be extended to our findings, wherein language predictions were diminished if attention had to be divided between the reading material and a challenging non-linguistic secondary task.

However, attention alone cannot fully account for the differences in sensitivity to word predictability shown here. The predictability effect did not only diminish when a secondary task was introduced but also decreased with increasing cognitive load even when attentional switching costs were held constant between conditions (i.e., when comparing the 1-back to 2-back load conditions). In line with previous literature (Huettig and Janse, 2016; Ito et al., 2018; Fricke and Zirnstein, 2022), our results thus suggest that executive processes beyond attention, such as updating and maintenance of context information, shifting between tasks, and inhibition of irrelevant information (Alvarez and Emory, 2006; Duncan, 2010), are integral to language prediction.

Language prediction as a compensatory mechanism in older age

Further examining the effect of language predictability across the adult lifespan revealed interesting age differences. Namely, older adults showed stronger language predictability effects than younger participants. This effect held true even when controlling for potential differences in verbal intelligence and education between participants. Our result aligns with previous work on age-related changes in linguistic predictions, indicating heightened sensitivity to unexpected lexical input in older adults (Cheimariou et al., 2019).

This finding may reflect a commonly reported pattern of greater reliance on intact vocabulary and world knowledge with age in the face of declining executive functions (Martin et al., 2022; Salthouse et al., 2003). Here, we show that even under increased cognitive load, older adults still rely heavily on their refined system of linguistic predictions driven by their lifelong experience. This allows them to make more fine-tuned predictions but also renders them more vulnerable to unexpected information (i.e., high-surprisal words) than their younger counterparts.

Previous studies have found a larger vocabulary size associated with more rapid processing of language and improved language comprehension (Borovsky et al., 2012; Huettig and Pickering, 2019; Mainz et al., 2017; Matthews, 2018; Stæhr, 2009; Stæhr, 2008). In line with this, older adults exhibit more advanced language processing abilities compared to children or younger adults due to their accumulated years of exposure to language and their increased vocabulary (Brysbaert et al., 2016; Ito and Sakai, 2021). This accumulated skill is thought to serve as a compensatory mechanism for decline in working memory capacities or reduced executive functioning with age (Bunzeck et al., 2024; Reuter-Lorenz et al., 2021). Indeed, speech comprehension appears to remain largely intact in older adults (Shafto and Tyler, 2014; Burke and Mackay, 1997).

As a caveat, one should not disregard the possibility of age-related differences between younger and older adults in utilising formed predictions. While we assumed that formed predictions are utilised automatically, and that the observed differences in the effect of language predictability on reading time between individuals might thus be attributable to a difference in executive resources, one could also argue that individuals might simply weigh their formed predictions differently.

Akin to the longer exposure to language in older individuals, it is reasonable to assume that older individuals have also had more time to accumulate experience regarding the accuracy of their predictions and to refine their predictions through prediction errors. Consequently, older adults might rely more heavily on their predictions than younger adults (Chan et al., 2021; Moran et al., 2014). Additionally, due to age-related sensory decline (Verhaeghen et al., 2003), older adults might exhibit a stronger dependence on context-based predictions to process language, as incoming sensory information might be less informative (Rogers, 2017; Wingfield et al., 2015; Wolpe et al., 2016). A stronger reliance on language predictions could thus serve as a compensatory mechanism to facilitate language comprehension despite sensory decline in older adults.

How can limited executive resources affect language prediction?

As shown in Figure 4, the relationship between word predictability and age depends on cognitive load. Under low load, predictability effects on reading time increased with age; older adults showed robust effects, while younger adults showed none. Under intermediate load (1-back), this pattern reversed, with younger adults showing stronger predictability effects than older adults.

This reversal begs for an explanation, and we deem it most likely to reflect differences in how executive resources are deployed. In low-load settings, young adults may process both predictable and unpredictable words efficiently, minimising observable surprisal effects. The absence of a predictability effect in this group should therefore not be taken as evidence against predictive processing. Rather, it may indicate that prediction is less necessary when processing is fast and flexible. Under intermediate load, executive resources are partially taxed, revealing underlying prediction processes in young adults. At higher loads, predictability effects diminish again, suggesting resource constraints impair predictive processing.

In older adults, however, predictability effects decline already at intermediate load, consistent with the CRUNCH model (Reuter-Lorenz and Cappell, 2008), which posits that cognitive capacity limits are reached earlier in ageing. At this point, resources are insufficient to maintain predictive processing while also performing the secondary task. Behaviourally, this may result in fluctuating performance or trial-wise switching between the two tasks. As load increases further, older adults continue to show reduced, though still measurable, predictability effects – indicating sustained but strained processing.

Taken together, the data suggest that both age groups experience a reduction in predictive processing when executive resources are limited, but the ‘crunch point’ is reached earlier in older adults.

Limitations and future directions

One intriguing question that remains is how n-back performance and language surprisal interrelate. It is plausible that when texts are highly predictable (i.e., when surprisal is very low on average), the cognitive load associated with language processing is reduced. This reduction could free up domain-general executive resources, thereby enhancing n-back performance. Conversely, when surprisal is high, the increased demands of processing less predictable language may compromise working memory updating and lead to poorer n-back performance. As we assessed n-back performance at block-level and maintained equal predictability across texts, analysing trial-level effects of surprisal on n-back task performance was not possible. Future studies could address this limitation by systematically examining the relationship between surprisal and n-back performance at a more granular level.

Another important area for future research involves exploring potential age-related differences in task strategies within dual-task designs. We previously hypothesised that the differing effects of cognitive load on surprisal-driven reading times across age groups may reflect compensatory strategies in older adults. Given declines in executive control and working memory (Idowu and Szameitat, 2023; Salthouse et al., 2003; Verhaeghen et al., 2003), older adults may prioritise language processing over multitasking under high load, whereas younger adults might distribute resources more flexibly. Future studies should thus examine how age affects our response strategies in dual tasks.

Finally, it is important to note that GPT-2 next-word predictions are not tailored to individual participants. Large language models like GPT-2 are trained on extensive internet corpora (Radford et al., 2019). Consequently, this raises the question of whether linguistic content from the internet – and, by extension, GPT-2’s language generation – disproportionately reflects the linguistic patterns of specific demographic or cognitive groups (Gustilo and Dino, 2017; Haller et al., 2024; Hardy and Friginal, 2012; Tan and Celis, 2019; Venkit et al., 2023). There is only very limited evidence addressing these questions directly. However, it has been shown that GPT-3.5 – a later-generation model building on the same transformer architecture as GPT-2 – shows performance comparable to children aged 6–15 in a language task requiring recollection and inference abilities (Sicilia et al., 2023). Additionally, another study shows that predictions generated by a German GPT-2 Large model (which we use in our studies as the smaller version, GPT-2 Small) align more closely with the language patterns of individuals with low verbal intelligence (Haller et al., 2024), complementing the findings by Sicilia et al., 2023. Despite the limited number of studies in this area, the findings from both studies suggest that the language predictions produced by GPT-2 may reflect certain demographic or cognitive traits, such as a very young age or a low verbal intelligence. As a result, language predictions generated by GPT-2 may not be equally representative across different groups, which must be considered when interpreting between-group effects in studies that rely on GPT-2-generated stimuli. In our study, this potential bias is unlikely to affect the results. We did not compare adults with children, and in the lab sample where verbal intelligence was measured (N = 90), it was only mildly correlated with age (ρ = 0.279, p = 0.007). Including verbal intelligence in our models did not improve model fits, and all main effects and interactions of age, cognitive load, and surprisal remained robust. Nevertheless, developing methods to generate individualised next-word predictions is an interesting direction for future research.

Conclusion

In summary, the present study contributes to resolving the debate about the cognitive cost of predictive language processing. The data offer the following key insights:

First, low language predictability as well as high task demands both have a detrimental effect on reading time. This holds true across a large age range. Second, we find that higher task demands diminish the effects of language predictability on reading time, replicably demonstrating that language prediction draws on resources of executive control. Third, the data reveal age-related differences in the use of linguistic predictions: High predictability had more leverage on reading efficiency in older than in younger adults but was also more sensitive to available executive resources.

Materials and methods

Participants

We recorded data from 178 participants, who were either tested online (N = 83) or in a controlled lab environment (N = 95). We excluded data from three participants from the online sample from further analysis, either due to technical issues (N = 1) or because they reported having been distracted during the task (N = 2). The resulting final sample comprised 175 participants aged 18–85 years (M = 44.9 ± 17.9, 18–85 years) with a balanced gender distribution of 51% female, 47% male, and 2% non-binary identifying participants. All participants were native German speakers with normal or corrected-to-normal vision, and intact colour vision (assessed by a screening test; Ishihara, 1987). Exclusion criteria were a history of psychiatric or neurological disorders, drug abuse, dyslexia, illiteracy, or any impairments in language processing. Individuals who had consumed drugs or alcohol immediately prior to the study were not eligible for participation.

All participants from the online sample enlisted through the recruitment platform Prolific, whereas lab participants were recruited via an existing database of the Max-Planck-Institute. Lab participants above the age of 40 performed the Mini-Mental State Examination (Folstein et al., 2014) to screen for cognitive impairments. Middle-aged participants (40–59 years) had a mean score of 29.57 ± 0.84, whereas older participants (60–85 years) scored 28.23 ± 1.62 points.

Study design

Request a detailed protocol

During the experiment, participants were asked to read short newspaper articles which were adapted from articles from the news archive of a well-known German magazine (Der Spiegel). The nine selected articles were edited to be easy to understand, neutral in tone and non-emotional in content to avoid any influences of text content on reading time. Additionally, all texts were limited to a length of 300 words (trials). Participants were instructed to read the texts carefully, as content-related multiple-choice questions were asked after each one. The comprehension questions served both as a measure of reading comprehension as well as a motivator to pay close attention to the reading material.

During this self-paced reading task (Reading Only task), each word was presented individually on screen and participants proceeded to the next word by pressing the space bar on their keyboard (Figure 2a). To ensure that each word was displayed at least briefly, the response window started after the word had been shown for a fixed period of 50 ms. Each word was presented centred on screen in one of four font colours (Hex codes: #D292F3 [lilac], #F989A2 [muted pink], #2AB7EF [cerulean blue], and #88BA3F [leaf green]) against a white background.

In four of the six main blocks, the self-paced reading task was complemented by a secondary n-back task (see Figure 2a), in which participants were instructed to press a target button on the keyboard (‘C’ for right-handed, ‘M’ for left-handed participants) whenever the font colour of the current word matched that of the previous (1-back Dual Task) or the penultimate word (2-back Dual Task). Participants were still required to press the spacebar to advance to the next word after pressing the target button. Reaction times were recorded for both kinds of responses. Neither the Reading Only task nor the 1-back and 2-back Dual Task blocks were speeded, allowing participants to complete the experiment at their own pace.

Before being introduced to this combination of reading and n-back task (see Figure 2a), participants could familiarise themselves with the n-back paradigm in short, non-linguistic, single-task blocks comprising coloured rectangles as stimuli. These blocks served as an introduction to the nature of an n-back task, and to quantify participants’ working memory abilities. For all main blocks with an n-back task, a target ratio of 16.667% was used. The low target ratio was chosen to prevent an excessive number of n-back reactions during the dual-task blocks.

Taken together, the experiment comprised three cognitive load conditions: A baseline condition, comprising a reading task without additional n-back task (Reading Only), a reading task with an additional 1-back task (1-back Dual Task), and a reading task with an additional 2-back task (2-back Dual Task). Each condition was presented in two blocks, each comprising 300 trials. For each block, one of nine texts was randomly selected, with no text occurring more than once.

All experiments were implemented using lab.js (Henninger et al., 2024). Online studies were hosted on OpenLab – a server-side platform designed specifically for lab.js experiments (Shevchenko, 2022) – and data were saved on OSF (Foster and Deardorff, 2017).

Generation of word surprisal and entropy scores

Request a detailed protocol

The predictability of each word was operationalised via word surprisal (Figure 2b), which reflects the predictability of the current word given its preceding context (Smith and Levy, 2013; Hale, 2001; Levy, 2008). A word’s surprisal score is defined as the negative logarithm of the word’s probability given its context (Hale, 2001):

surprisalwn=log(P(wnw1,,wn1))

If a word wn has a high surprisal score, its occurrence given its preceding context w1, w2, …, wn−1 has a low probability, rendering it highly unpredictable (i.e., surprising).

 In addition to each word’s surprisal, we also computed the entropy of the probability distribution for each predicted word given its context (Figure 2b), which reflects the uncertainty in predicting the next word. The entropy is defined as the negative sum of the product of the probability of each word in the vocabulary and its respective logarithm probability (Shannon, 1948), or – put simply – as the average surprisal of all possible continuations in the vocabulary (Slaats and Martin, 2023):

entropy(p)=wnWP(wnw1,,wn1)logP(wnw1,,wn1)

If entropy is low, only one or a few possible words in the vocabulary are assigned high probabilities of being the actual next word, hence indicating low uncertainty about which word will come next. This is usually the case if the previous context is very restricting. Conversely, if a vast amount of words in the vocabulary would be suited as continuations for the given context, and the probability distribution across words is fairly uniform, word entropy – and thus uncertainty about which word will come next – is high. Taken together, word surprisal signifies the predictability of each word whereas word entropy signifies the uncertainty underlying its prediction.

In the current study, we computed surprisal scores as well as one entropy score for each word in the experimental texts (mean surprisal: 18.165 ± 7.523, mean entropy: 4.067 ± 0.932; see further information in the Appendix).

Entropy and surprisal scores were estimated using a two-word context window. While short contexts have been shown to enhance GPT-2’s psychometric alignment with human predictions, making next-word predictions more human-like (Kuribayashi et al., 2022), other work suggests that longer contexts can also increase model–human similarity (Goldstein et al., 2022). To reconcile these findings in our stimuli and guide the choice of context length, we tested longer windows and found surprisal scores were highly correlated with the 2-word context (e.g., 10-word vs. 2-word context: Spearman’s ρ = 0.976), with the overall pattern of results unchanged. Additionally, employing longer context windows would have also reduced the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window. Crucially, any additional noise introduced by the short context biases effect estimates towards zero, making our analyses conservative rather than inflating them.

To generate word entropy and surprisal scores, we used a 12-layered GPT-2 model (Radford et al., 2019), which was pre-trained on German texts by the MDZ Digital Library team (dbmdz) at the Bavarian State Library, and the corresponding tokeniser, both available from the Hugging Face model hub (Schweter, 2021). Scores were calculated using Python version 3.10.12 (Rossum and Drake, 2009).

Analysis

Preprocessing

Request a detailed protocol

To gauge participants’ response accuracy in the n-back task, we computed the detection-prime (d') index. This measure quantifies the ability to distinguish between target and non-target stimuli, in our case trials (i.e., words in dual- and rectangles in single-task blocks) with colour repetitions and trials where the current colour does not match the colour from the nth previous trial. A d' value of 0 signifies an inability to discriminate between signal and noise stimuli, suggesting that participants indicated they saw a target in either no or all trials. Thus, we excluded all dual-task blocks with d-primes of 0 from further analyses, which affected only five participants. In total, we excluded three main blocks in the online sample and two main blocks in the lab sample.

After each text block, participants were asked to answer three multiple-choice questions as a measure of reading comprehension. For each question, we showed four response options, with only one of them being correct. To ensure participants performed the tasks as intended and read the words on screen, we excluded all blocks where none of the questions were answered correctly. In total, we excluded one 1-back and seven 2-back Dual Task blocks from the online sample, as well as four 1-back and eleven 2-back Dual Task blocks from the lab sample, from datasets of 6 and 14 participants, respectively.

Lastly, we preprocessed the reading time data: First, any trial exhibiting a raw reading time exceeding 5000 ms was considered an extended break and subsequently excluded from further analyses. This cutoff was selected arbitrarily, based on the observation that participants tested in the lab did not exhibit trial durations exceeding 5000 ms. Therefore, we assume that participants tested online may have been distracted and less focused on the experiment during trial durations of this length.

To further remove outliers, we followed the procedure recommended by Berger and Kiefer, 2021 to ensure comparable exclusion criteria for long and short outliers in typically skewed reading time: First, reading times were transformed using the Proportion of Maximum Scaling method (POMS; Little, 2013). We POMS-transformed the data on block level to account for potential differences in reading time distributions between blocks. The square root of each value was then taken to ensure a symmetric distribution. Following this, we z-transformed the data and excluded all trials from further analysis where z-scores fell outside a range of –2 to 2 (Berger and Kiefer, 2021; Cousineau and Chartier, 2010). Taken together, we excluded 12,968 trials (4.117% of all trials from the main blocks) with an average of 74.103 ± 12.654 excluded trials per participant.

To facilitate interpretability of units in the results, we subsequently continued working with the raw reading times, which had been cleaned of outliers at this stage, and log-transformed them for statistical analysis.

Statistical analysis of n-back responses and comprehension questions

Request a detailed protocol

To ensure the validity of our cognitive load manipulation in the dual-task blocks, we examined whether increased cognitive load induced a decline both in n-back task performance – as indicated by reduced d-primes – and the accuracy in the comprehension question task, as reflected by a lower number of correct answers. We employed a linear mixed-effects model (LMM) for d-primes and a logistic linear mixed-effects model (GLMM, logit link function) for comprehension question accuracy.

In both models, we included recording location (online vs. lab), cognitive load (1-back and 2-back Dual Task vs. Reading Only as the reference level) and continuously measured age (centred) in both models as well as the interaction of age and cognitive load as fixed effects. In the model for the d-primes, we additionally included measures of comprehension question accuracy (on participant and block level) as well as the block number as fixed effects to control for different response strategies and tiredness effects, respectively. Moreover, we included the mean d-primes from the 1-back and 2-back Single Tasks as a working memory measure.

Please note that we did not control for trial-level stimulus colour here. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour – or sequence of colours – from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the block-level comprehension questions.

We assigned simple coding schemes to the factors recording location and cognitive load. While the model for d-primes included by-participant random slopes for cognitive load, the model for comprehension question accuracy comprised random intercepts for participants. Both models included random intercepts for participants and texts.

d-prime ~ mean d-prime from single tasks +
mean comprehension question accuracy +
block-level deviation from mean comprehension question accuracy +
recording location + block number +
<bold><italic>age * cognitive load</italic></bold> <italic>+</italic>
(1 + cognitive load | ID) +
(1 | text number)

Note. Structure of the model for d-primes in the n-back task in dual-task blocks. The variable age was centred and the variable cognitive load encompassed only two levels (1-back and 2-back) as there was no n-back task in the Reading Only condition.

comprehension question accuracy ~ recording location +
<bold><italic>age * cognitive load</italic></bold> <italic>+</italic>
(<italic>1 | ID) + (1 | text number</italic>)

Note. Structure of the model for accuracy in the comprehension question task. The variable age was centred. We used a binomial family distribution with a logit link function for modelling the comprehension question accuracies.

Statistical analysis of reading times

Request a detailed protocol

We explored the effects of cognitive load, age, and surprisal as well as their two- and three-way interactions on log-transformed reading times using an LMM. The model included an interaction of age, surprisal, and cognitive load motivated by our hypotheses as well as additional fixed effects to control for nuisance effects. The final selection of fixed and random effects structure was based on highest R2 values. As we only modelled reading times from trials where we had surprisal scores, the first two trials of each block were not included in the statistical analyses.

We included the reading time of the previous word as a fixed effect to control for potential nuisance effects such as post-error slowing following a missed n-back target in the previous trial, or sequential modulation effects if the previous trial was ended prematurely, leading to an extended reading time carried over to the current trial. Additionally, it is important to consider that reading times, like many sequential behavioural measures, exhibit strong autocorrelation (Schuckart et al., 2025), meaning that the reading time of a given word is partially predictable from the reading time of the previous word. Such spillover effects can confound attempts to isolate trial-specific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the reading time of the preceding trial as a covariate.

As response strategies may differ between individuals, but also within an individual from block to block, we included two different regressors representing these distinct between- vs. within-participant effects on reading time. Between-participant effects were modelled by the individual mean comprehension performance whereas within-participant effects were modelled by the block-level deviation from this mean (cf. Tune et al., 2021; Bell et al., 2019). We further included block-wise d-primes and participant-wise mean single-task d-primes as a proxy of each participant’s working memory capacity. By incorporating block- and participant-level performance measures, which are designed to be sensitive to task difficulty, we accounted for the potential variation in perceived task load between age groups or samples. For instance, the 2-back task might present a greater challenge for an older individual compared to their younger counterpart, therefore rendering the tasks not entirely comparable between age groups if not appropriately controlled for.

The remaining fixed effects entailed the recording location (online vs. lab), word frequency (as estimated using Python’s wordfreq package; Speer et al., 2018), word length, as well as the position of block and trial in the course of the experiment as main effects. Furthermore, we included entropy as a fixed effect to account for the uncertainty in the prediction of the next word. Surprisal and entropy values were weakly correlated (r = 0.29, p < 0.001). To account for the delay in reaction time associated with n-back responses, we included n-back reaction (reaction vs. no reaction) as a binary predictor in our models.

Finally, we included the three-way interaction of age (as a continuous predictor), cognitive load (on three levels: Reading Only, 1-back Dual Task, and 2-back Dual Task; contrasted using a simple coding scheme with the Reading Only condition as the reference level), and surprisal score (continuous predictor). This entails the implicit inclusion of all two-way interactions of age, cognitive load, and surprisal, as well as the main effects of those variables.

Random effects included random intercepts for participants, the effect of text, the current word, the colour of the current word, and by-participant random slopes for cognitive load. All continuous predictors were centred.

log(RT)~RT of previous word +
block-level d-prime+mean d-prime from single tasks +
mean comprehension question performance +
block-level deviation from mean comprehension question performance +
recording location +entropy +
word frequency +word length (without punctuation)+
n back reaction +block number +trial number +
<bold>surprisal * age * cognitive load</bold> +
(1+cognitive load | ID) + (1 | text number) + (1 | word) + (1 | colour)

Note. Model structure. RT = Reading Time, ID = participant.

To gain a more nuanced understanding of the three-way interaction of age, cognitive load, and surprisal, we performed a subsequent simple slopes analysis. This analysis allows exploring the interaction of two continuous predictors, in our case quantifying the slope of the surprisal effect in each of the three cognitive load conditions as a function of age. This way, we determined for which age range and cognitive load condition we observed a significant effect of surprisal on reading time.

For all models, including control analyses and the internal replication, p-values were obtained using ANOVAs with type III sums of squares. Degrees of freedom for p-values and standard errors were estimated using Satterthwaite’s approximation for all LMMs, and Wald’s approximation for the GLMM and GAM (Luke, 2017; Satterthwaite, 1946). All effects reported are significant on an alpha-level of 0.05 after FDR-correction for multiple comparisons (Benjamini and Hochberg, 1995).

All analyses were carried out in R version 4.2.2 (R Development Core Team, 2023) using the packages gratia, interactions, lmerTest, lme4, mgcv, modelbased, and sjPlot (Gratia, 2024; Bates et al., 2015; Kuznetsova et al., 2017; Wood, 2011; Makowski et al., 2020; Long, 2024).

Control analysis: Dissociating cognitive control from attention

Request a detailed protocol

To disentangle attentional and cognitive load effects, we modelled reading times using an additional linear mixed model of the same structure as described before, but contrasting only the 2-back Dual Task condition with the less demanding 1-back Dual Task condition. The two Dual Task conditions only differ in cognitive load, but not attentional switching costs, which means any effects of cognitive load can be attributed to the cognitive load manipulation, with attentional demands held constant.

Control analysis: Assessing potential effects of verbal intelligence and education

Request a detailed protocol

To ensure potential effects of verbal intelligence or education did not unduly influence our findings, we analysed data from 95 lab participants who reported their formal education in years (M = 18.0 ± 3.343, range = 11–30) and completed a lexical decision task – the Spot-the-Word test (Baddeley et al., 1993) – where they were asked to identify the word in pairs of words and non-words. Each participant’s score on this test (M = 32.021 ± 2.993, range = 21–37) provided a measure of their verbal intelligence.

To assess the potential effect of education and verbal intelligence on reading time, we fitted three additional LMMs: The first model mirrored the structure of the original LMM used to analyse log-transformed reading times (see section Statistical analysis of reading times for the model structure), with one key modification: The predictor for recording location was excluded, as all participants were tested in a single location. The remaining two models followed the same structure, with the inclusion of centred education scores and centred Spot-the-Word test scores as additional predictors to account for education and verbal intelligence, respectively. We then statistically compared the baseline LMM with the two other LMMs using an ANOVA to determine whether verbal intelligence or education significantly improved the model fits.

Control analysis: Modelling non-linear effects of age

Request a detailed protocol

We also fitted a generalised additive model (GAM) to our data to allow for non-linear relationships of the predictor variables with reading time, as it has been shown that reaction time is oftentimes modulated by predictor variables in a non-linear way (Miwa and Baayen, 2021; Wood, 2017).

Specifically, we employed a GAM to account for a potential non-linear relationship of age and surprisal with reading time. To this end, all continuous predictors were fitted with thin-plate regression splines and the interaction of surprisal, age, and cognitive load was fitted via a tensor product smooth with individual curves for each level of cognitive load. The number of basis dimensions for each smoothing spline was checked via model diagnostics available in mgcv after the first model set-up and appropriately updated to reach a k-index >1.01 and p>0.05 to avoid oversmoothing. The random effect structure was set up similarly to the LMM.

It is important to note that the outcomes of GAMs and LMMs can differ: GAMs are particularly adept at identifying localised, non-linear changes in predictor effects on reading time that may be overlooked by LMMs. As a result, the effects obtained from LMMs and those derived from GAMs are based on distinct metrics, which complicates direct comparisons between the two approaches.

Control analysis: Disentangling the effect of cognitive load on pre- and post-stimulus predictive processing

Request a detailed protocol

Predictive processing can be conceptualised in terms of two complementary information-theoretic constructs: surprisal and entropy. While surprisal captures the cognitive cost associated with the post-stimulus integration of a generated prediction and the actual percept, entropy reflects the uncertainty underlying a next-word prediction (Smith and Levy, 2013; Hale, 2001; Levy, 2008; Pimentel et al., 2022; Shannon, 1948; Slaats and Martin, 2023). As such, entropy mirrors the estimated surprisal, suggesting entropy and surprisal are somewhat related, but do not represent the same construct (Pimentel et al., 2022).

While the primary focus of the present study is on predictive processing as indexed by word surprisal, it is also possible to partially dissociate pre- and post-stimulus predictive mechanisms by examining entropy. The empirical literature on the effect of entropy on reading time is scarce and somewhat contradictory, with evidence for both facilitative and inhibitory effects, or even no effects of next-word entropy over and above surprisal (Pimentel et al., 2022; Karimi et al., 2024; Schijndel and Schuler, 2017; Aurnhammer and Frank, 2019; Roark et al., 2009). Accordingly, we sought to investigate the effect of increased entropy in our dataset, asking whether higher entropy is associated with longer reading times, and whether this relationship is modulated by cognitive load, analogous to the effects observed for surprisal.

To this end, we fitted a linear mixed-effects model structurally analogous to that used for surprisal, substituting surprisal with entropy. Consistent with our approach for surprisal, we also included a three-way interaction of entropy with age and cognitive load to capture potential age-related changes in the relationship between entropy and cognitive load.

log(RT)~RT of previous word +
  block-level d-prime+mean d-prime from single tasks +
  mean comprehension question performance +
  block-level deviation from mean comprehension question performance +
  recording location +surprisal +
  word frequency +word length (without punctuation) +
  n-back reaction +block number +trial number +
  <bold>entropy * age * cognitive load</bold> +
  (1+cognitive load | ID) + (1 | text number) + (1 | word) + (1 | colour)

Note. Model structure. RT = Reading Time, ID = participant.

Internal replication

To ensure the reliability of our findings, we conducted an internal replication of the previously described experiment. This internal replication was preregistered on OSF (doi: 10.17605/OSF.IO/SU6VX).

Replication sample

Request a detailed protocol

As outlined in the preregistration, we conducted an online study with a sample of 100 participants. We excluded data from four participants from further analysis, either due to technical issues (N = 2) or because they reported having been distracted during the task (N = 2). The resulting final sample comprised 96 participants aged 18–70 years (M = 39.750 ± 13.996 years) with a balanced gender distribution of 51% female, 48% male, and 1% non-binary identifying participants. As in the original experiment, all participants were native German speakers with normal or corrected-to-normal vision and intact colour vision without dyslexia, illiteracy, a history of psychiatric or neurological disorders or drug abuse. Individuals who had consumed drugs or alcohol immediately prior to the study were not eligible for participation.

Replication analyses

Request a detailed protocol

Analogous to the original experiment, we first cleaned the reading time data of trials exceeding a duration of 5000 ms as well as outliers (see section Preprocessing), which affected 4.036% of all trials from the main blocks with an average of 72.656 ± 11.089 excluded trials per participant. The structure of the statistical model for the replication analysis was analogous to the model for the original analysis of reading times (see section Statistical analysis of reading times), except for the predictor of recording location, which was excluded as we only analysed data collected online. In addition, we modelled the two-way interaction between cognitive load and surprisal rather than the higher-order three-way interaction with age. This more parsimonious model structure was chosen to ensure adequate statistical power and to yield stable, interpretable estimates given the available sample sizes.

To compare results from the original experiment and the replication, we fitted the model once using the data of the online sample from the original experiment and once using the new online replication datasets.

Given that we simplified the analysis approach in the original study after having preregistered the replication, we deviated from the analysis plan described in the preregistration and made the same modifications here, resulting in the use of word surprisal for only one context length instead of four and, consequently, only one LMM instead of several.

Appendix 1

Supplementary methods

Demographics

We recruited participants both using the participant database of the Max-Planck-Institute in Leipzig and the online participant recruitment platform Prolific. Due to the predominantly younger demographic on Prolific, the online samples comprised mainly younger and middle-aged participants, whereas the lab sample spanned a broader age range (see Figure 2—figure supplement 1).

Study design

Text stimuli

During the experiment, participants were asked to read short newspaper articles on emotionally neutral topics such as literature, history, geography, and biology. All texts were edited to be easy to understand without excessive simplification. By doing so, we aimed to keep the cognitive load in the reading task as low as possible, while still maintaining a balance between text clarity and the avoidance of consistently high word predictability.

All texts had a Wiener Sachtextformel (WSTF4; Bamberger and Lesen, 1984) score below or equal to 10, which corresponds to a reading level suitable for students below 10th grade (mean WSTF4 score: 7.9 ± 0.623). As the WSTF4 only measures the syntactic complexity of a text, participants rated the subjective text difficulty as well as their subjective interest for two of the texts presented to them. All texts used in this study yielded a mean difficulty rating of 21.983 ± 2.799 (on a scale from 0 to 100, with 100 being ‘extremely difficult’) and a mean interest rating of 70.684 ± 5.657 (on a scale from 0 to 100, with 100 being ‘extremely interesting’), confirming that the texts used in this study were both easy to understand and interesting to read.

Structure of the experiment

At the outset of the experiment, participants were first presented with a training block of the Reading Only condition to familiarise themselves with the task. This was followed by the first main block of the Reading Only task (300 trials). After this, in a short training block (20 trials, repeating it was optional) followed by a longer main block (60 trials in the online experiment, 90 trials in the lab experiment), the participant was introduced to either the 1-back or the 2-back task as a non-linguistic single task comprising coloured rectangles as stimuli. Which n-back task was introduced first was randomised. This was then followed by the first dual-task block (300 trials) where the previously practised n-back task was performed together with the reading task. After having completed the first dual-task block in one of the two n-back conditions, the participant was then introduced to the other n-back task in the same fashion as before. Having performed each of the conditions once, the participant was subsequently presented with three main blocks of each of the three conditions in random order (300 trials each). After each block comprising a reading task, participants were asked to answer three multiple-choice comprehension questions on the content of the text.

Appendix 1—table 1
Results from models for task performance measures (N = 175).
LMM for d-primesGLMM for comprehension question accuracy
EstimateStd. errortdfpORStd. errorzp
Mean d-prime single-tasks0.4690.05498.550166.512.294 × 10–14*
Mean comprehension question performance0.0110.00363.053171.6533.943 × 10–3*
De-meaned comprehension question performance–0.0010.0012–0.475372.6886.350 × 10–1
Block number–0.0060.0085–0.676349.3335.618 × 10–1
Recording location [online]–0.5060.0902–5.609163.1821.920 × 10–7*0.9800.1876–0.1069.155x10–1
Age–0.0050.0027–2.057164.0385.003 × 10–20.9860.0053–2.6761.304x10–2*
Cognitive load
[1-back vs. Reading Only]
0.2530.0390–8.9281.011x10–18*
Cognitive load
[2-back vs. Reading Only]
0.1560.0234–12.4031.753x10–34*
Cognitive load. [2-back vs. 1-back]–1.6360.0626–26.120173.1252.672 x 10–61*
Age * cognitive load [1-back vs. Reading Only]0.9900.0081–1.1833.313x10–1
Age * cognitive load [2-back vs. Reading Only]1.0030.00800.3958.081x10–1
Age * cognitive load [2-back vs. 1-back]–0.0140.0035–3.931169.7662.210 × 10–4*
Model fitConditional/marginal R2ICCConditional/marginal R2ICC
0.822/0.6340.5120.304/0.1460.185
  1. Note. All continuous predictors were centred. Degrees of freedom for p-values, standard errors and confidence intervals (CI) were computed using Satterthwaite’s approximation (LMM for d-primes) and Wald’s approximation (GLMM for comprehension question accuracy). All p-values reported here are FDR-corrected and were computed using ANOVAs with type III sums of squares. Results that are significant on an alpha-level of 0.05 are marked with a star. OR = Odds Ratio.

Appendix 1—table 2
Results from the model for reading times for full original sample (N = 175).
LMM for full original sample (N = 175)
PredictorsEstimateStd. errorCItdfp
Main effectsReading time of previous trial (log-transformed)0.1102350.0014090.107472 to 0.11299778.210829287,711.33<1.33 × 10–322*
d-prime–0.0062930.001767–0.009755 to –0.002831–3.562277224,716.704.903×10–4*
Mean d-prime single-tasks0.0846360.0195850.045974 to 0.1232984.321418169.863.717×10–5*
Mean comprehension question performance0.0028410.0012760.000321 to 0.0053602.225728169.263.282×10–2*
De-meaned comprehension question performance0.0000530.000038–0.000023 to 0.0001281.374475257,871.341.693×10–1
Word frequency0.6435700.362057–0.067231 to 1.3543711.777538727.788.280×10–2
Word length0.0078390.0004070.007039 to 0.00863819.2409511400.807.407×10–73*
Word entropy0.0014100.000785–0.000129 to 0.0029491.7965337594.278.280×10–2
n-back reaction [reaction vs. no reaction]0.3174640.0017990.313937 to 0.320990176.444743287,862.05<1.33 × 10–322*
Block number–0.0064950.000164–0.006816 to –0.006173–39.637620287,253.02<1.33 × 10–322*
Trial number–0.0004560.000007–0.000470 to –0.000442–64.08523518,673.68<1.33 × 10–322*
Recording location [online vs. lab]–0.2193390.032284–0.283072 to –0.155606–6.793975168.692.686×10–10*
Surprisal0.0017070.0001510.001411 to 0.00200211.3206772361.371.368×10–28*
Age0.0091130.0009910.007158 to 0.0110689.199100178.461.751×10–16*
Cognitive load [1-back vs. Reading Only]0.4738000.0139160.446336 to 0.50126434.046321176.188.399×10–79*
Cognitive load [2-back vs. Reading Only]0.7915400.0260900.740046 to 0.84303430.338989173.767.320×10–71*
Two-way interactionsSurprisal x age0.0000350.0000040.000027 to 0.0000429.287151287,771.273.481×10–20*
Surprisal x cognitive load [1-back vs. Reading Only]–0.0010930.000161–0.001409 to –0.000776–6.771521287,959.112.043×10–11*
Surprisal x cognitive load [2-back vs. Reading Only]–0.0012550.000163–0.001575 to –0.000935–7.681261288,294.962.709×10–14*
Age x cognitive load [1-back vs. Reading Only]–0.0027980.000776–0.004330 to –0.001267–3.606479171.995.135×10–4*
Age x cognitive load [2-back vs. Reading Only]–0.0024580.001454–0.005329 to 0.000412–1.690400170.799.681×10–2
Three-way interactionsSurprisal x age x cognitive load [1-back vs. Reading Only]–0.0001110.000009–0.000129 to –0.000094–12.266076287,807.343.748×10–34*
Surprisal x age x cognitive load [2-back vs. Reading Only]–0.0000780.000009–0.000096 to –0.000060–8.483676287,771.654.384×10–17*
Model fitIntra-class correlation (ICC)0.46
Marginal R2/conditional R20.643/0.807
  1. Note. All continuous predictors were centred. Degrees of freedom for p-values, standard errors and confidence intervals (CI) were computed using Satterthwaite’s approximation. All p-values reported here are FDR-corrected and were computed using ANOVAs with type III sum of squares. Results that are significant on an alpha-level of 0.05 are marked with a star.

Appendix 1—table 3
Results from models for reading times for original online sample and online replication sample (N = 80 and N = 96, respectively).
LMM for online original sample (N = 80)LMM for online replication sample (N = 96)
PredictorsEstimateStd. errorCItdfpEstimateStd. ErrorCItdfp
Main effectsReading time of previous trial (log-transformed)0.0761570.0020600.072120 to 0.08019336.978112133,340.9405.049x10–297*0.1490350.0018180.145472 to 0.15259781.985161,495.845<3.442 × 10–281*
d-prime0.0580860.0033090.051601 to 0.06457117.55549958,135.0522.440x10–68*–0.0107670.002175–0.015029 to –0.006504–4.950607139,351.7258.888x10–7*
Mean d-prime single-tasks0.0935330.0275430.038674 to 0.1483913.39582575.9304471.403x10–3*0.1117420.0200970.071832 to 0.1516525.56023192.6143.324x10–7*
Mean comprehension question performance0.0033660.001733–0.000085 to 0.0068161.94267776.1696.690x10–20.0034590.0016590.000163 to 0.0067542.08456791.9384.487x10–2*
De-meaned comprehension question performance–0.0004810.000059–0.000597 to –0.000365–8.123202118,882.4988.250x10–16*–0.0007150.000047–0.000807 to –0.000624–15.286171152,148.7462.661x10–52*
Word frequency0.2552480.300091–0.335305 to 0.8458000.850567299.7414.190x10–10.2775100.253935–0.222530 to 0.7775501.092838258.9722.917x10–1
Word length0.0063290.0004140.005517 to 0.00714115.2971051292.8342.874x10–48*0.0063410.0003620.005631 to 0.00705217.5120711287.9532.794x10–61*
Word entropy–0.0006560.000933–0.002485 to 0.001173–0.7029983838.6514.821x10–10.0003370.000826–0.001283 to 0.0019570.4074333601.3556.837x10–1
n-back reaction [reaction vs. no reaction]0.3663510.0026030.361250 to 0.371452140.757916133,345.032<5.049 × 10–297*0.3354490.0022930.330955 to 0.339943146.303524161,785.798<3.442 × 10–281*
Block number–0.0084860.000247–0.008970 to –0.008002–34.367684131,940.8934.797x10–257*–0.0080420.000224–0.008481 to –0.007604–35.946825159,855.4343.442x10–281*
Trial number–0.0004070.000009–0.000425 to –0.000390–45.7207228566.066<5.049 × 10–297*–0.0003860.000008–0.000402 to –0.000371–48.6010478175.168<3.442 × 10–281*
Surprisal0.0011450.0001620.000826 to 0.0014637.0465101889.6254.190x10–12*0.0013750.0001440.001093 to 0.0016569.5782581886.7535.358x10–21*
Age0.0053820.0015350.002326 to 0.0084383.50719076.0561.057x10–3*0.0089530.0013180.006336 to 0.0115706.79518191.9791.460x10–9*
Cognitive load [1-back vs. Reading Only]0.4689890.0187220.431749 to 0.50622925.05062082.5265.650x10–40*0.5075570.0221130.463669 to 0.55144622.95317096.8411.343x10–40*
Cognitive load [2-back vs. Reading Only]0.8240860.0346530.755102 to 0.89307123.78126678.2822.909x10–37*0.7224230.0317930.659316 to 0.78553122.72267396.1833.806x10–40*
Two-way interactionsSurprisal x cognitive load [1-back vs. Reading Only]0.0007860.0002230.000349 to 0.0012233.522647133,382.1126.411x10–4*0.0014990.0002030.001101 to 0.0018977.376731161,262.3052.667x10–13*
Surprisal x cognitive load [2-back vs. Reading Only]0.0003750.000225–0.000067 to 0.0008161.661967133,507.3491.086x10–10.0013650.0002030.000967 to 0.0017636.721136161,923.0552.714x10–11*
Model fitIntra-class correlation (ICC)0.470.50
Marginal R2/conditional R20.587/0.7810.615/0.809
  1. Note. All continuous predictors were centred. Degrees of freedom for p-values, standard errors and confidence intervals (CI) were computed using Satterthwaite’s approximation. All p-values reported here are FDR-corrected and were computed using ANOVAs with type III sum of squares. Results that are significant on an alpha-level of 0.05 are marked with a star.

Appendix 1—table 4
Results from models for control analysis (1-back vs. 2-back) of reading times for full original sample (N = 175).
Control analysis 2-back vs. 1-back: LMM for full original sample (N = 175)
PredictorsEstimateStd. errorCItdfp
Main effectsReading time of previous trial (log-transformed)0.0621810.0017550.058740 to 0.06562235.420630188,475.923.296×10–273*
d-prime–0.0066670.001898–0.010388 to –0.002947–3.512595168,020.997.398×10–4*
Mean d-prime single-tasks0.0947490.0207720.053746 to 0.1357524.561318171.051.756×10–5*
Mean comprehension question performance0.0031460.0013520.000478 to 0.0058152.327238170.383.018×10–2*
De-meaned comprehension question performance0.0000380.000047–0.000053 to 0.000130.821932150,546.364.837×10–1
Word frequency–0.1282760.319365–0.756126 to 0.499574–0.401659398.706.882×10–1
Word length0.0052200.0004140.004408 to 0.00603112.6160911351.554.020×10–34*
Word entropy0.0017110.000914–0.000081 to 0.0035021.8718974521.928.171×10–2
n-back reaction [reaction vs. no reaction]0.3160070.0019170.312250 to 0.319764164.836212188,099.24<1.33 × 10–322*
Block number–0.0113630.000295–0.011941 to –0.01079–38.535522185,049.491.33×10–322*
Trial number–0.0004500.000009–0.000467 to –0.000433–52.05381110,059.10<1.33 × 10–322*
Recording location [online vs. lab]–0.2260620.034131–0.293438 to –0.158686–6.623323169.718.894×10–10*
Surprisal0.0018480.0001610.001532 to 0.00216411.4673272010.373.901×10–29*
Age0.0087620.0011010.006591 to 0.0109347.961804184.313.751×10–13*
Cognitive load [2-back vs. 1-back]0.3388960.0208920.297669 to 0.38012316.221045178.981.838×10–36*
Two-way interactionsSurprisal x age0.0000030.000005–0.000007 to 0.0000120.545912187,940.826.159×10–1
Surprisal x cognitive load [2-back vs. 1-back]–0.0001480.000173–0.000486 to 0.000191–0.855146187,910.284.837×10–1
Age x cognitive load [2-back vs. 1-back]0.0006890.001156–0.001593 to 0.002970.595977172.036.133×10–1
Three-way interactionSurprisal x age x cognitive load [2-back vs. 1-back]0.0000330.0000100.000014 to 0.0000523.372931188,203.531.144×10–3*
Model fitIntra-class correlation (ICC)0.44
Marginal R2/conditional R20.442/0.690
  1. Note. p-values were computed using Wald's approximation as implemented in the package mgcv. Results that are significant on an alpha-level of 0.05 are marked with a star. Edf: Effective degrees of freedom.

Appendix 1—table 5
Results from GAM for control analysis of reading times for full original sample (N = 175).
Control analysis: GAM for full original sample (N = 175)
PredictorsEstimateStd. ErrortFEDFp
Main effectsReading time of previous trial
(log-transformed)
204.59133.664<2 × 10–16*
d-prime38.77428.042<2 × 10–16*
Mean d-prime single-tasks24.3441.892<2 × 10–16*
Mean comprehension question performance5.3482.3053.23×10–3*
De-meaned comprehension question performance39.4087.477<2 × 10–16*
Word frequency6.8377.571<2 × 10–16*
Word length73.0764.038<2 × 10–16*
Word entropy4.0273.7042.44×10–3*
Surprisal9.5474.107<2 × 10–16*
Age51.7833.028<2 × 10–16*
n-back reaction [reaction vs. no reaction]0.31670.00179177.06<2 × 10–16*
Block number–0.00610.00017–36.89<2 × 10–16*
Trial number–0.00040.00001–64.47<2 × 10–16*
Recording location (online vs. lab)–0.25140.02602–9.66<2 × 10–16*
Cognitive load [1-back vs. Reading Only]0.43180.0251417.17<2 × 10–16*
Cognitive load [2-back vs. Reading Only]0.78190.0252630.95<2 × 10–16*
Two-way interactionsSurprisal x cognitive load13.96213.849<2 × 10–16*
Three-way interactionsSrprisal x age x cognitive load [Reading Only]23.94610.248<2 × 10–16*
Surprisal x age x cognitive load [1-back]2.8742.0173.616×10–2*
Surprisal x age x cognitive load [2-back]2.3924.8772.375×10–2*
Random effectsCognitive load | ID255.250508.610<2 × 10–16*
Text Nr.14,053.2207.840<2 × 10–16*
Word1.870804.600<2 × 10–16*
Colour12.5302.770<2 × 10–16*
Model fitR2815
  1. Note. All continuous predictors were centred. Degrees of freedom for p-values, standard errors and confidence intervals (CI) were computed using Satterthwaite’s approximation. All p-values reported here are FDR-corrected and were computed using ANOVAs with type III sum of squares. Results that are significant on an alpha-level of 0.05 are marked with a star.

Appendix 1—table 6
Results from the model for reading times for full original sample (N = 175) for the effects of entropy, cognitive load, and age on reading time.
LMM for full original sample (N = 175)
PredictorsEstimateStd. errorCItdfp
Main effectsReading time of previous trial (log-transformed)0.1100660.0014100.107302 to 0.11283078.048894287,686.846<8.542 × 10–79*
d-prime–0.0062480.001767–0.009712 to –0.002784–3.535537224,708.0666.105x10–4*
Mean d-prime single-tasks0.0846280.0195870.045963 to 0.1232924.320683169.8994.225x10–5*
Mean comprehension question performance0.0028420.0012760.000322 to 0.0053622.226655169.2953.275x10–2*
De-meaned comprehension question performance0.0000520.000039–0.000023 to 0.0001271.350491257,990.1291.769x10–1
Word frequency0.6082990.360822–0.100080 to 1.3166781.685873725.6159.929x10–2
Word length0.0078120.0004060.007015 to 0.00860919.2209441402.2709.864x10–73*
Surprisal0.0016820.0001500.001387 to 0.00197711.1770242359.1917.143x10–28*
n-back reaction [reaction vs. no reaction]0.3173600.0018000.313832 to 0.320888176.311387287,866.290<8.542 × 10–79*
Block number–0.0064810.000164–0.006803 to –0.006160–39.539875287,255.082<8.542 × 10–79*
Trial number–0.0004560.000007–0.000470 to –0.000442–64.06517318,557.632<8.542 × 10–79*
Recording location [online vs. lab]–0.2193700.032287–0.283107 to –0.155632–6.794462168.7203.895x10–10*
Entropy0.0014120.000785–0.000126 to 0.0029501.8002227570.6598.213x10–2
Age0.0091100.0009910.007155 to 0.0110659.195551178.4952.325x10–16*
Cognitive load [1-back vs. Reading Only]0.4739800.0139240.446501 to 0.50145834.041427176.1888.542x10–79*
Cognitive load [2-back vs. Reading Only]0.7918500.0260980.740340 to 0.84336130.341059173.7507.294x10–71*
Two-way interactionsEntropy x age0.0000900.0000300.000032 to 0.0001483.030333287,391.5723.257x10–3*
Entropy x cognitive load [1-back vs. Reading Only]0.0066380.0012730.004142 to 0.0091335.213908287,500.0353.416x10–7*
Entropy x cognitive load [2-back vs. Reading Only]0.0064900.0012910.003959 to 0.0090215.025891287,757.9428.595x10–7*
Age x cognitive load [1-back vs. Reading Only]–0.0027850.000776–0.004317 to –0.001253–3.587447171.9946.142x10–4*
Age x cognitive load [2-back vs. Reading Only]–0.0024410.001455–0.005313 to 0.000430–1.678106170.7729.929x10–2
Three-way interactionsEntropy x age x cognitive load [1-back vs. Reading Only]–0.0003990.000072–0.000540 to –0.000258–5.546582287,440.3575.831x10–8*
Entropy x age x cognitive load [2-back vs. Reading Only]–0.0001880.000073–0.000331 to –0.000045–2.577310287,488.3171.258x10–2*
Model fitIntra-class correlation (ICC)0.46
Marginal R2/conditional R20.643/0.807
  1. Note. All continuous predictors were centred. Degrees of freedom for p-values, standard errors and confidence intervals (CI) were computed using Satterthwaite’s approximation. All p-values reported here are FDR-corrected and were computed using ANOVAs with type III sum of squares. Results that are significant on an alpha-level of 0.05 are marked with a star.

Data availability

Experimental and analysis scripts as well as preprocessed data are publicly available in the project's OSF repository: https://osf.io/2hczy/.

The following data sets were generated
    1. Schuckart MM
    2. Martin S
    3. Tune S
    4. Schmitt LM
    5. Hartwigsen G
    6. Obleser J
    (2025) Open Science Framework
    ID 2hczy. Executive Resources Shape the Impact of Language Predictability Across the Adult Lifespan.

References

  1. Book
    1. Bamberger R
    2. Lesen VE
    (1984)
    Lesen - Verstehen - Lernen - Schreiben: Die Schwierigkeitsstufen von Texten in Deutscher Sprache
    Jugend und Volk.
    1. Burke DM
    2. Mackay DG
    (1997) Memory, language, and ageing
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 352:1845–1856.
    https://doi.org/10.1098/rstb.1997.0170
    1. Foster ED
    2. Deardorff A
    (2017) Open Science Framework (OSF)
    Journal of the Medical Library Association 105:203–206.
    https://doi.org/10.5195/jmla.2017.88
  2. Conference
    1. Hale J
    (2001) A probabilistic earley parser as a psycholinguistic model
    Second meeting of the North American Chapter of the Association for Computational Linguistics. pp. 1–8.
    https://doi.org/10.3115/1073336.1073357
  3. Book
    1. Ishihara SM
    (1987)
    Ishihara’s Tests for Colour-Blindness
    Tokyo: Kanehara & Co.
    1. Johnson PO
    2. Neyman J
    (1936)
    Tests of certain linear hypotheses and their application to some educational problems
    Statistical Research Memoirs 1:57–93.
  4. Book
    1. Karenina TLA
    (1878)
    Anna Karenina
    Project Gutenberg.
  5. Book
    1. Little TD
    (2013)
    Longitudinal Structural Equation Modeling
    The GuilfordPress.
  6. Conference
    1. Monsalve IF
    2. Frank SL
    3. Vigliocco G
    (2012)
    Lexical surprisal as a general predictor of reading time
    Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. pp. 398–408.
  7. Software
    1. R Development Core Team
    (2023) R: a language and environment for statistical computing
    R Foundation for Statistical Computing, Vienna, Austria.
  8. Book
    1. Reuter-Lorenz PA
    2. Festini SB
    3. Jantz TK
    (2021) Executive functions and neurocognitive aging
    In: Reuter-Lorenz PA, editors. Handbook of the Psychology of Aging. Elsevier. pp. 67–81.
    https://doi.org/10.1016/B978-0-12-816094-7.00019-2
  9. Book
    1. Rossum G
    2. Drake FL
    (2009)
    Python 3 Reference Manual
    Createspace.
  10. Conference
    1. Schijndel M
    2. Schuler W
    (2017)
    Approximations of predictive entropy correlate with reading times
    Proceedings of the Annual Meeting of the Cognitive Science Society.
  11. Conference
    1. Sicilia A
    2. Gates J
    3. Alikhani M
    (2023) HumBEL: A Human-in-the-Loop Approach for Evaluating Demographic Factors of Language Models in Human-Machine Conversations
    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1127–1143.
    https://doi.org/10.18653/v1/2024.eacl-long.68

Article and author information

Author details

  1. Merle Marie Schuckart

    1. Department of Psychology, University of Lübeck, Lübeck, Germany
    2. Center of Brain, Behavior and Metabolism, University of Lübeck, Lübeck, Germany
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing, shared first authorship with Sandra Martin
    Contributed equally with
    Sandra Martin
    For correspondence
    merle.schuckart@uni-luebeck.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7178-7360
  2. Sandra Martin

    Research Group Cognition and Plasticity, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft, Project administration, Writing – review and editing, shared first authorship with Merle Schuckart
    Contributed equally with
    Merle Marie Schuckart
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6687-5278
  3. Sarah Tune

    1. Department of Psychology, University of Lübeck, Lübeck, Germany
    2. Center of Brain, Behavior and Metabolism, University of Lübeck, Lübeck, Germany
    Contribution
    Formal analysis, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9022-9965
  4. Lea-Maria Schmitt

    Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
    Contribution
    Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9356-2234
  5. Gesa Hartwigsen

    1. Research Group Cognition and Plasticity, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
    2. Wilhelm Wundt Institute for Psychology, Leipzig University, Leipzig, Germany
    Contribution
    Conceptualization, Supervision, Funding acquisition, Writing – review and editing, shared last authorship with Jonas Obleser
    Contributed equally with
    Jonas Obleser
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8084-1330
  6. Jonas Obleser

    1. Department of Psychology, University of Lübeck, Lübeck, Germany
    2. Center of Brain, Behavior and Metabolism, University of Lübeck, Lübeck, Germany
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing – review and editing, shared last authorship with Gesa Hartwigsen
    Contributed equally with
    Gesa Hartwigsen
    Competing interests
    Reviewing editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7619-0459

Funding

DFG (OB 352/2-2)

  • Jonas Obleser

DFG (HA 6314/4-2; Research Unit 5429/1 (467143400): HA 6314/10-1)

  • Gesa Hartwigsen

ERC (ERC-2021-COG 101043747)

  • Gesa Hartwigsen

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Christian Koblitz and Marcel Blumenthal for their help with data acquisition in the lab. This work was supported by the German Research Foundation (DFG, OB 352/2-2 to JO, and HA 6314/4-2 to GH). GH was supported by Lise Meitner Excellence funding from the Max Planck Society and the European Research Council (ERC-2021-COG 101043747). Open access funding provided by the Max Planck Society. We acknowledge financial support by Land Schleswig-Holstein within the funding programme Open Access Publikationsfond.

Ethics

The study was conducted in accordance with the Declaration of Helsinki and was approved by the local ethics committees of the University of Lübeck and Leipzig University, respectively. Prior to participation, participants provided their written informed consent to participate and to have their data published, and received financial compensation (12€/h).

Version history

  1. Preprint posted:
  2. Sent for peer review:
  3. Reviewed Preprint version 1:
  4. Reviewed Preprint version 2:
  5. Version of Record published:

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.108176. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2025, Schuckart, Martin et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 496
    views
  • 33
    downloads
  • 0
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Merle Marie Schuckart
  2. Sandra Martin
  3. Sarah Tune
  4. Lea-Maria Schmitt
  5. Gesa Hartwigsen
  6. Jonas Obleser
(2026)
Executive resources shape the impact of language predictability across the adult lifespan
eLife 14:RP108176.
https://doi.org/10.7554/eLife.108176.3

Share this article

https://doi.org/10.7554/eLife.108176