Large-scale replication study reveals a limit on probabilistic prediction in language comprehension

  1. Mante S Nieuwland  Is a corresponding author
  2. Stephen Politzer-Ahles  Is a corresponding author
  3. Evelien Heyselaar  Is a corresponding author
  4. Katrien Segaert  Is a corresponding author
  5. Emily Darley  Is a corresponding author
  6. Nina Kazanina  Is a corresponding author
  7. Sarah Von Grebmer Zu Wolfsthurn  Is a corresponding author
  8. Federica Bartolozzi  Is a corresponding author
  9. Vita Kogan  Is a corresponding author
  10. Aine Ito  Is a corresponding author
  11. Diane Mézière  Is a corresponding author
  12. Dale J Barr  Is a corresponding author
  13. Guillaume A Rousselet  Is a corresponding author
  14. Heather J Ferguson  Is a corresponding author
  15. Simon Busch-Moreno  Is a corresponding author
  16. Xiao Fu  Is a corresponding author
  17. Jyrki Tuomainen  Is a corresponding author
  18. Eugenia Kulakova  Is a corresponding author
  19. E Matthew Husband  Is a corresponding author
  20. David I Donaldson  Is a corresponding author
  21. Zdenko Kohút  Is a corresponding author
  22. Shirley-Ann Rueschemeyer  Is a corresponding author
  23. Falk Huettig  Is a corresponding author
  1. Max Planck Institute for Psycholinguistics, Netherlands
  2. University of Edinburgh, United Kingdom
  3. The Hong Kong Polytechnic University, Hong Kong
  4. University of Oxford, United Kingdom
  5. University of Birmingham, United Kingdom
  6. University of Bristol, United Kingdom
  7. University of Glasgow, United Kingdom
  8. University of Kent, United Kingdom
  9. University College London, United Kingdom
  10. University of Stirling, United Kingdom
  11. University of York, United Kingdom

Abstract

Do people routinely pre-activate the meaning and even the phonological form of upcoming words? The most acclaimed evidence for phonological prediction comes from a 2005 Nature Neuroscience publication by DeLong, Urbach and Kutas, who observed a graded modulation of electrical brain potentials (N400) to nouns and preceding articles by the probability that people use a word to continue the sentence fragment (‘cloze’). In our direct replication study spanning 9 laboratories (N=334), pre-registered replication-analyses and exploratory Bayes factor analyses successfully replicated the noun-results but, crucially, not the article-results. Pre-registered single-trial analyses also yielded a statistically significant effect for the nouns but not the articles. Exploratory Bayesian single-trial analyses showed that the article-effect may be non-zero but is likely far smaller than originally reported and too small to observe without very large sample sizes. Our results do not support the view that readers routinely pre-activate the phonological form of predictable words.

https://doi.org/10.7554/eLife.33468.001

Introduction

In the last decades, the idea that people routinely and implicitly predict upcoming words during language comprehension turned from a highly controversial hypothesis to a widely accepted assumption. Initial objections to prediction in language were based on a lack of empirical support (e.g. Zwitserlood, 1989), incompatibility with traditional bottom-up models and contemporary interactive models of language comprehension (e.g. Kintsch, 1988; Marslen-Wilson and Tyler, 1980), and the purported futility of prediction in a generative system where sentences can continue in infinitely many different ways (Jackendoff, 2002). Current theories of language comprehension, however, reject such objections and posit prediction as an integral and inevitable mechanism by which comprehension proceeds quickly and incrementally (e.g. Altmann and Mirković, 2009; Dell and Chang, 2014; Pickering and Garrod, 2013). Prediction, that is, context-based pre-activation of an upcoming linguistic input, is thought to occur at all levels of linguistic representation (semantic, morpho-syntactic and phonological/orthographic) and serves to facilitate the integration of newly available bottom-up information into the unfolding sentence- or discourse-representation. In this line of thought, language is yet another domain in which the brain acts as a prediction machine (Clark, 2013; Van Berkum, 2010; see also Friston, 2005, 2010; Summerfield and de Lange, 2014), hard-wired to continuously match sensory inputs with top-down, grammatical or probabilistic expectations based on context and memory.

What promoted linguistic prediction from outlandish and deeply contentious to ubiquitous and somewhat anodyne? One of the key and most acclaimed pieces of empirical evidence for linguistic prediction to date comes from a landmark Nature Neuroscience publication by DeLong et al. (2005), whose approach exploited a phonological rule of English whereby the indefinite article is realized as a before consonant-initial words and as an before vowel-initial words. In their experiment, participants read sentences of varying degree of contextual constraint that led to expectations for a particular consonant- or vowel-initial noun. This expectation was operationalized as a word’s cloze probability (cloze), calculated in a separate, non-speeded sentence completion task as the percentage of continuations of a sentence fragment with that word (Taylor, 1953). For example, the sentence fragment "The day was breezy so the boy went outside to fly...” is continued with ‘a’ by 86% of participants, and "The day was breezy so the boy went outside to fly a..." is continued with ‘kite’ by 89% of participants. In the main experiment, word-by-word sentence presentation enabled DeLong and colleagues to examine electrical brain activity elicited by articles that were concordant with the highly expected but yet unseen noun (‘a’, followed by ‘kite’), or by articles that were incompatible with the highly expected noun and heralded a less expected one (‘an’, followed by ‘airplane’). Of note, an unexpected like ‘a/an’ does not rule out that the expected noun appears, just that it appears as the immediately following word (e.g., ‘an old kite’), we return to this issue in the Discussion. The dependent measure was the amplitude of the N400 Event-related potential (ERP), a negative ERP deflection that peaks approximately 400 ms after word onset and is maximal at centroparietal electrodes (Kutas and Hillyard, 1980). The N400 is elicited by every word of an unfolding sentence, and its amplitude is smaller (less negative) with increasing ease of semantic processing (Kutas and Hillyard, 1984). In this article, we use ‘N400 amplitude’ as a shorthand for ‘ERP amplitude in the time window associated with the N400’; this ERP amplitude is actually a sum of the N400 ERP component and other ERP components (reflecting other aspects of cognition) that overlap with it in time and space. DeLong et al. found that the N400 amplitude for a given word decreased as a function of increasing cloze probability, both for nouns and, critically, for articles. DeLong et al. presented the systematic and graded N400 modulation by article-cloze as strong evidence that participants activated the nouns and articles in advance of their appearance, and that the disconfirmation of this prediction by the less-expected articles resulted in processing difficulty (higher N400 amplitude at the article).

The results obtained with this elegant design warranted a much stronger conclusion than related results available at the time. Previous studies that employed a visual-world paradigm had revealed listeners’ anticipatory eye-movements toward visual objects on the basis of probabilistic or grammatical considerations (e.g. Altmann and Kamide, 1999). However, predictions in such studies are scaffolded onto already-available visual context, and therefore do not measure purely pre-activation, but perhaps re-activation of word information previously activated by the visual object itself (Huettig, 2015). DeLong and colleagues examined brain responses to information associated with concepts that were not pre-specified and had to be retrieved from long-term memory ‘on-the-fly’. Furthermore, DeLong and colleagues were the first to muster evidence for highly specific pre-activation of a word’s phonological form, rather than merely its semantic (e.g. Federmeier and Kutas, 1999) or morpho-syntactic features (e.g. Van Berkum et al., 2005; Wicha et al., 2004). Crucially, as their demonstration involved semantically identical articles (function words) rather than nouns or adjectives (content words) that are rich in meaning, the observed N400 modulation by article-cloze is unlikely to reflect difficulty interpreting the articles themselves. Most notably, DeLong and colleagues were the first to examine brain activity elicited by a range of more- or less-predictable articles, not simply most- versus least-expected. Based on the observed correlation, they argued that pre-activation is not all-or-none and limited to highly constraining contexts, but occurs in a graded, probabilistic fashion, with the strength of a word pre-activation proportional to its cloze probability. Moreover, they concluded that prediction is an integral part of real-time language processing and, most likely, a mechanism for propelling the comprehension system to keep up with the rapid pace of natural language.

DeLong et al.’s study has had an immense impact on the field of psycholinguistics, neurolinguistics and beyond. It is cited by authoritative reviews (e.g. Altmann and Mirković, 2009; Hagoort, 2017; Lau et al., 2008; Pickering and Clark, 2014; Pickering and Garrod, 2007) as delivering decisive evidence for probabilistic prediction of words all the way up to their phonological form. Moreover, as a demonstration of pre-activation of phonological form (sound) during reading, it is sometimes cited as evidence for ‘prediction through production’ (e.g. Pickering and Garrod, 2013), the hypothesis that linguistic predictions are implicitly generated by the language production system. To date, DeLong et al. has received a total of 766 citations (Google Scholar), averaging to more than one citation per week over the past decade, with an increasing number of citations in each subsequent year. The results also played an important role in settling an ongoing debate in the neuroscience of language. It provided the clearest evidence that the N400 component, which some researchers had long taken to directly index the high-level compositional processes by which people integrate a word’s meaning with its context (Brown and Hagoort, 1993; Chwilla et al., 1995; Connolly and Phillips, 1994; Friederici et al., 1999; van Berkum et al., 1999; Van Petten et al., 1999), actually reflected non-compositional processes by which word information is accessed as a function of context (e.g. Kutas and Hillyard, 1984).

But how robust are gradient effects of form prediction? In over a decade that has passed since the publication by DeLong and colleagues, there is still no published study that directly replicates their graded pattern of results (for an overview, see Ito et al., 2017b). DeLong and colleagues also performed an alternative analysis of the same data, using cloze as a categorical variable instead of a continuous variable. This analysis did not yield a statistically significant result (p.59 in DeLong, 2009) and was not mentioned in the published report. In at least three other unpublished data sets (DeLong, 2009; Miyamoto, 2016), DeLong and colleagues did not find a significant correlation between article-N400 and cloze probability. Martin et al. (2013) claimed a successful conceptual replication in native speakers of English but not in bilinguals. However, their study did not test for a graded effect of cloze, and differed from the original in many crucial aspects of the experimental design, data-preprocessing and statistical analysis, clouding both a qualitative and quantitative comparison to the original results. Moreover, two attempts to replicate the Martin et al. results in English monolinguals failed to yield a reliable effect of cloze on article-ERPs (Ito et al., 2017b); for results that combined data from monolinguals and bilinguals, see Ito et al., 2017a).

As the tremendous scientific impact of the DeLong et al. findings is at odds with the apparent lack of replication attempts, we report here a direct replication study. Inspired by recent demonstrations for the need for large subject-samples in psychology and neuroscience research (Button et al., 2013; Open Science Collaboration, 2015), our replication spanned nine laboratories each with a sample size equal to or greater than that of the original. In addition to duplicating the original analysis, our replication attempt also seeks to improve upon DeLong et al.’s data analysis. DeLong et al.’s original analysis reduced an initial pool of 2560 data points (32 subjects who each read 80 sentences) to 10 grand-average values, by averaging N400 responses over trials within 10 cloze probability decile-bins (cloze 0–10, 11–20, etc.), per participant and then averaging over participants, even though these bins held greatly different numbers of observations (for example, the 0–10 cloze bin contained 37.5% of all data, whereas the 90–100 cloze bin contained only about 4%, which means that the reliability of the estimates per bin greatly differ, increasing the likelihood of obtaining spurious results; for additional discussion see Ito et al., 2017b). These 10 values were correlated with the average cloze value per bin, yielding numerically high correlation coefficients with large confidence intervals (e.g., the Cz electrode showed a statistically significant r-value of 0.68 with a 95% confidence interval ranging from 0.09 to 0.92). However, this analysis potentially compromises power by discretizing cloze probability into deciles and not distinguishing various sources of subject-, item-, bin-, and trial-level variation. Furthermore, treating subjects as a fixed rather than random factor potentially inflates false positive rates, since the overall cloze effect is confounded with by-subject variation in the effect (Barr et al., 2013; Clark, 1973).

In our replication study, we followed two pre-registered analysis routes: a replication analysis that duplicated the DeLong et al. analysis, and a single-trial analysis that modelled variance at the level of item and subject (with a linear mixed-effects model), which offers better control over false-positives than the replication analysis when analyzing effects of the continuous predictor cloze probability. The effect of cloze on noun-elicited N400s (DeLong et al., 2015; Kutas and Hillyard, 1984) is necessary but not sufficient evidence for the claim on pre-activation in language processing (as it is also compatible with the view that the noun’s cloze probability correlates with the ease of integration of that noun into the context). It serves as a manipulation check to ensure that the experiment is able to successfully detect graded variation in N400 amplitude, but does not provide strong evidence for the prediction of phonological form. That evidence would come from the ERPs elicited by articles. Observing a reliable effect of cloze on article-elicited N400s in the replication analysis and, in particular, in the single-trial analysis, would constitute powerful evidence for the pre-activation of phonological form during reading.

Results

We first obtained offline cloze probabilities for all target articles and nouns from a group of native English speakers. These values closely resembled those of the original study (see Methods for details). In the subsequent ERP experiment, a different group of participants (N = 334) read the sentences word-by-word from a computer display at a rate of 2 words per second, while we recorded their electrical brain activity at the scalp. The replication analysis and single-trial analysis described below were each pre-registered at https://osf.io/eyzaq/.

Replication analysis

We sorted the articles and nouns into 10 bins based on each word’s cloze probability (e.g. items with 0–10% cloze were put in one bin, 10–20% in another, etc.). For each laboratory, we averaged ERPs per bin first within, then across, participants. No baseline correction was used, following the procedure described in the Methods section in DeLong et al. (2005). We then correlated the averaged cloze values per bin with mean ERP amplitude in the N400 time window (200–500 ms) elicited by the nouns (for the noun analysis) or articles (for the article analysis) from the corresponding bin, yielding a Pearson correlation coefficient (r-value) per EEG channel. This analysis yielded a very different pattern than DeLong et al. observed (Figure 1). In no laboratory did article-N400 amplitude at centro-parietal sites become significantly smaller (less negative) as article-cloze probability increased (in fact, in most laboratories the pattern went into the opposite direction). Only in one laboratory (Lab 2) did the correlation coefficient have a p-value below .05 in the predicted direction (positive) at any electrode (uncorrected for multiple comparisons), but this effect was observed at a few left-frontal electrodes, not at the central-parietal electrodes where DeLong et al. found their N400 effects. Moreover, in two laboratories (Labs 3 and 5), a statistically significant effect was observed in the opposite direction, larger (more negative) article-N400 amplitude for articles with increasing cloze probability. For the nouns, the pattern was more similar to the DeLong et al. results. In six laboratories (Lab 2, 3, 4, 6, 7, and 9), noun-N400 amplitude for nouns at central-parietal or parietal-occipital electrodes became smaller with increasing noun-cloze, and in two other laboratories (Lab 5 and 8) the effects clearly went in the expected direction without reaching statistical significance.

Replication analysis.

Correlations between N400 amplitude and article/noun cloze probability per laboratory. N400 amplitude is the mean voltage in the 200–500 ms time window after word onset. A positive value corresponds to the canonical finding that N400 amplitude became smaller (less negative—more positive) with increasing cloze probability. Here and in all further plots, negative voltages are plotted upwards. Upper graph: Scatter plots showing the correlation between cloze and N400 activity at electrode Cz, for each lab. The position of Cz and the other electrodes is displayed in the head plot in between the upper and lower graph. Lower graph: Scalp distribution of the r-values for each lab. Asterisks (*) indicate electrodes that showed a statistically significant correlation (two-tailed p<0.05, not corrected for multiple comparisons). Exact r- and p-values for each laboratory and EEG channel are available as source data (Figure 1—source datas 14) and on https://osf.io/eyzaq.

https://doi.org/10.7554/eLife.33468.002
Figure 1—source data 1

r-values for the articles for each laboratory and each channel

https://doi.org/10.7554/eLife.33468.003
Figure 1—source data 2

p-values for the articles for each laboratory and each channel.

https://doi.org/10.7554/eLife.33468.004
Figure 1—source data 3

r-values for the nouns for each laboratory and each channel.

https://doi.org/10.7554/eLife.33468.005
Figure 1—source data 4

r-values for the nouns for each laboratory and each channel.

https://doi.org/10.7554/eLife.33468.006

DeLong et al. recently mentioned using a 500 ms baseline correction procedure that was not mentioned in the published study (personal communication by DeLong, March 2017). In an exploratory analysis, we therefore recomputed the correlations based on data pooled from all laboratories using this baseline correction procedure (Figure 2). This analysis also showed a lack of statistically significant positive correlations for the articles, but statistically significant positive correlations for the nouns. In exploratory Bayesian analyses reported below, we perform an analysis to establish whether these results are consistent with the size and direction of the effects reported by DeLong et al., regardless of statistical significance.

Replication analysis.

Scalp distribution and r-values at each channel based on data pooled from all laboratories, using a 500 ms baseline correction procedure as used by DeLong et al. (2005). Data were pooled after computing bin-averages per laboratory as in the original study, treating the laboratories as multiple observations of each bin-average. Asterisks (*) indicate electrodes that showed a statistically significant correlation (two-tailed, not corrected for multiple comparisons). Exact r- and p-values for each EEG channel are available as source data (Figure 2—source datas 14).

https://doi.org/10.7554/eLife.33468.007
Figure 2—source data 1

r-values for the articles for each channel, computed across laboratories. 

https://doi.org/10.7554/eLife.33468.008
Figure 2—source data 2

p-values for the articles for each channel, computed across laboratories.

https://doi.org/10.7554/eLife.33468.009
Figure 2—source data 3

r-values for the nouns for each channel, computed across laboratories. 

https://doi.org/10.7554/eLife.33468.010
Figure 2—source data 4

p-values for the nouns for each channel, computed across laboratories. 

https://doi.org/10.7554/eLife.33468.011

Single-trial analysis

We first performed baseline correction by subtracting the average amplitude in the 100 ms time window before word onset. Baseline-corrected ERPs for relatively expected and unexpected words and difference waveforms are shown in Figure 3. Then, for the data pooled across all laboratories, we used linear mixed effects models to regress the N400 amplitude (in a spatiotemporal region of interest selected a priori based on the DeLong et al. results) on cloze probability. For the articles, the effect of cloze was not statistically significant at the α = 0.05 level, β = 0.29, CI [−0.08, .67], χ2(1)=2.31, p=0.13 (see Figure 4, left panel) , with β referring to the N400 difference in microvolts associated with stepping from 0% to 100% cloze. Unless otherwise indicated, p-values are two-tailed, and CIs are two-tailed 95% confidence intervals. The effect of cloze on N400 amplitude at the article did not significantly differ between laboratories, χ2(8)=7.90, p=0.44. For the nouns, however, higher cloze values were strongly associated with smaller N400s, β = 2.22, CI [1.76, 2.69], χ2(1)=56.50, p<0.001 (see Figure 4, right panel). This pattern did not significantly differ between laboratories, χ2(8)=11.59, p=0.17. The effect of cloze on noun-N400s was statistically different from its effect on article-N400s, χ2(1)=31.38, p<0.001.

Single-trial analysis.

Grand-average ERPs elicited by relatively expected and unexpected words (cloze higher/lower than 50%) and the associated difference waveforms (low minus high cloze) at electrode Cz. Dotted lines indicate one standard deviation above or below the grand average.

https://doi.org/10.7554/eLife.33468.012
Single-trial analysis.

Relationship between cloze and ERP amplitude for articles and nouns in the N400 spatiotemporal window, as illustrated by the mean ERP values per cloze value (number of observations reflected in circle size), along with the regression line and 95% confidence interval. A change in article cloze from 0 to 100 is associated with a change in amplitude of 0.296 µV (95% confidence interval: −0.08 to .67). A change in noun-cloze from 0 to 100 is associated with a change in amplitude of 2.22 µV (95% confidence interval: 1.75 to 2.69). The data for these analyses were pooled across all nine labs.

https://doi.org/10.7554/eLife.33468.013

Exploratory (i.e. not pre-registered) single-trial analyses

The effect of article-cloze did not significantly vary as a function of subject comprehension question accuracy, χ2(1)=0.45, p=0.50. In addition, the effect of article-cloze was also not statistically significant when subject comprehension accuracy was included in the analysis (100 ms baseline: β = 0.24, CI [−0.17, .64], χ2(1)=1.27, p=0.26).

In our dataset, an analysis in the 500 to 100 ms time window before article-onset revealed a non-significant effect of cloze that resembled the pattern observed after article-onset, β = 0.16, CI [−0.07, .39], χ2(1)=1.82, p=0.18 (Figure 5). Because the sentence context of each item was identical for the expected and unexpected article, effects in the pre-article window cannot be meaningfully related to the appearance of the article. Effects in this window must therefore be due to a spurious mix of ‘residual EEG background noise’ (activity that differed between expected and unexpected conditions but was unrelated to actual expectancy) with EEG activity associated with the specific word appearing before the article (which varied between items in terms of lexical characteristics, contextual constraint, and sentence position). The observed result in this time window therefore suggests that a 500 ms baseline correction procedure, which was used but not reported in DeLong et al. (2005), would better correct for pre-article voltage-levels. We repeated our analysis with the 500 ms baseline correction procedure. Compared to the article-cloze effect observed in the pre-registered analysis, the observed effect with the new baseline procedure (Figure 5) was numerically smaller and yielded a higher p-value (β = 0.14, CI [−0.25, .53], χ2(1)=0.46, p=0.50).

Exploratory single-trial analyses.

The relationship between cloze and ERP amplitude as illustrated by the mean ERP values per cloze value (number of observations reflected in circle size), along with the regression line and 95% confidence interval, from two exploratory analyses. We performed a test which used a longer baseline time window (500 ms, left panel) to better control for pre-article voltage levels. This test reduced the initially observed effect of article-cloze, β = 0.14, CI [−0.25, .53], χ2(1)=0.46, p=0.50). An analysis in the 500 to 100 ms time window before article-onset (right panel) revealed a non-significant effect of cloze that resembled the pattern observed after article-onset, β = 0.16, CI [−0.07, .39], χ2(1)=1.82, p=0.18, shedding doubt on the conclusion that the observed results are due to the presentation of the articles.

https://doi.org/10.7554/eLife.33468.014

Upon request of reviewers for this journal, we also performed an additional exploratory analysis with cloze as a dichotomous variable (based on a medium-split, thus disregarding the known variability in cloze values). We note that this type of analysis was not reported in DeLong et al. (2005), although it was reported in the corresponding thesis chapter (DeLong, 2009) and did not yield a statistically significant effect of cloze on article-elicited ERPs. We performed this analysis for articles (100 and 500 ms baseline correction) and nouns. The results did not change substantially, and, in fact, each analysis yielded a lower χ2 value (and higher p-value) for the cloze variable than the corresponding analysis with cloze as a continuous predictor. The results can be reproduced from our online dataset and code.

Exploratory Bayesian analyses

For the articles, our pre-registered replication analyses yielded non-significant p-values, indicating failure to reject the null-hypothesis that cloze has no effect on N400 activity. To better adjudicate between the null-hypothesis (H0) and an alternative hypothesis (Hr), we performed an exploratory replication Bayes factor analysis for correlations (Wagenmakers et al., 2016). The obtained replication Bayes factor quantifies the evidence that there is an effect in the size and direction reported by DeLong et al. (see Figure 6). For the articles, this yielded strong to extremely strong evidence for the null hypothesis that the effect of cloze is zero, with BF0r values up to 154 (at the Cz electrode depicted by DeLong et al., BF0r = 77), and strongest evidence at the posterior channels. For the nouns, we obtained extremely strong evidence for the alternative hypothesis that the effect is consistent with the original effect, particularly at posterior channels, with BF10 values up to 9,163,515 (at Cz, BFr0 = 10,725). The pattern of results was similar when the 500 ms pre-stimulus baseline correction was applied.

Exploratory replication Bayes factor analysis.

This analysis quantifies the obtained evidence for the null hypothesis (H0) that N400 is not impacted by cloze, or for the alternative hypothesis (H1) that N400 is impacted by cloze with the direction and size of effect reported by DeLong et al. Scalp maps show the common logarithm of the replication Bayes factor for each electrode, capped at log(100) for presentation purposes. Electrodes that yielded at least moderate evidence for or against the null hypothesis (Bayes factor of ≥3) are marked by an asterisk. At posterior electrodes where DeLong et al. found their effects, our article data yielded strong to extremely strong evidence for the null hypothesis, whereas our noun data yielded extremely strong evidence for the alternative hypothesis (upper graphs). These results were obtained with the procedure described in DeLong et al. (no baseline correction), and with a 500 ms pre-word baseline correction (lower graphs), the procedure later described by DeLong and colleagues.

https://doi.org/10.7554/eLife.33468.015

Next, we computed Bayesian mixed-effect model estimates (β) and 95% credible intervals (CrI) for our single-trial analyses, using priors based on the results from DeLong et al. In both of our article-analyses, credible intervals included zero (100 ms baseline: β = 0.31, CrI [−0.06 .69]; 500 ms baseline: β = 0.17, CrI [−0.22 .55]). For the nouns, zero was not within the credible interval: β = 2.24, CrI [1.77 2.70]. The analyses suggest that the data (combined with prior assumptions about the effect) are not very consistent with the hypothesis that the article-effect is zero (further information and posterior summaries are available in Figure 7), but also are extremely inconsistent with the hypothesis that the article-effect is as big as that observed by DeLong et al. (2005). The data are most consistent with an effect that is more likely to be positive than zero or negative but is very small (so small that it was not detected at traditional significance levels in this large-scale experiment with substantially higher power than previous experiments).

Exploratory Bayesian mixed-effects model analyses.

Posterior density distributions for the effect of cloze on ERP amplitudes in the N400 window. The x-axis shows cloze effect sizes (i.e. changes in microvolts associated with an increase from 0% cloze probability to 100% cloze probability). The black line indicates the posterior distribution of effects; higher values of the posterior density at a given effect size indicate higher probability that this is the true effect size in the population. The peak of the posterior distribution roughly corresponds to the point estimate of the effect size (the regression coefficient) fitted from the Bayesian mixed effect model, i.e., the most likely value of the true effect size. The middle 95% of the posterior distribution, shaded in orange, corresponds to a two-tailed 95% credible interval for the effect size—i.e., an interval that we can be 95% confident contains the true effect. The green dotted line indicates the prior distribution (i.e., our expectation about where the true effect would lie before the data were collected). For the articles, this prior is centred on 1.25 μV, an approximation of the effect observed by DeLong et al. (2005), and for the nouns it is centred on 3.5 μV. The black connected dots illustrate the ratio between the posterior and prior distribution (i.e. the Bayes factor) at the effect size of 0 μV; for example, a Bayes factor of 4 suggests we can be four times more certain that the true effect is zero after having conducted this experiment than before, or, in other words, that the data increased our confidence in the null effect of zero fourfold. We performed these analyses for each of the linear mixed-effects model analyses we performed. We note that in all the article-analyses, the posterior probability of the estimated effect being greater than zero is around 80 or 90%, although this is also true for the pre-stimulus variable, shedding doubt that the observed results are due to presentation of the articles. In none of our article-analyses did zero lie outside the obtained credible interval, whereas for the nouns, zero lay outside the credible interval. These results are consistent with a failure to replicate the size of the article-effect reported by DeLong et al. and a successful replication of the noun-effect.

https://doi.org/10.7554/eLife.33468.016

Control experiment

Lack of a statistically significant, article-elicited prediction effect could reflect a general insensitivity of our participants to the phonologically conditioned variation of the English indefinite article, that is, a/an alternation. We ruled out this alternative explanation in an additional experiment that followed the replication experiment as part of the same experimental session. Participants read 80 short sentences containing the same nouns as the replication experiment, preceded by a phonologically licit or illicit article (e.g. ‘David found a/an apple...”), presented in the same manner as before. In each laboratory, nouns following illicit articles elicited a late positive-going waveform compared to nouns following licit articles (see Figure 8), starting at about 500 ms after word onset and strongest at parietal electrodes. This standard P600 effect (Osterhout and Holcomb, 1992) was confirmed in a single-trial analysis, χ2(1)=83.09, p<0.001, and did not significantly differ between labs, χ2(8)=8.98, p=0.35.

Control experiment.

P600 effects at electrode Pz per lab associated with flouting of the English a/an rule. Plotted ERPs show the grand-average difference waveform and standard deviation for ERPs elicited by ungrammatical expressions (‘an kite’) minus those elicited by grammatical expressions (‘a kite’).

https://doi.org/10.7554/eLife.33468.017

Discussion

In a landmark study, DeLong, Urbach and Kutas observed a statistically significant, graded modulation of article- and noun-elicited electrical brain potentials (N400) by the pre-determined probability that people continue a sentence fragment with that word (cloze). They concluded that people routinely and probabilistically pre-activate upcoming words to a high level of detail, including whether a word starts with a consonant or vowel. Our direct replication study spanning nine laboratories found a statistically significant effect of cloze on noun-elicited N400 activity but, critically, no significant effect of cloze on article-elicited N400 activity. This pattern was observed in a pre-registered replication analysis that duplicated the original study’s analysis, and a pre-registered single-trial analysis that modeled variance at the level of item and subject. Exploratory replication Bayes factor analyses confirmed that we successfully replicated the direction and size of the correlations reported by Delong et al. for the nouns, but not for the articles. Exploratory Bayesian mixed-effects model analyses suggested that, while there is some evidence that the true population-level effect may be in the direction reported by DeLong and colleagues, the effect is likely far smaller than what they reported. In fact, the effect is likely is too small to be meaningfully observed without very large sample sizes. Finally, a control experiment confirmed that our participants did respect the phonological alternation a/an of the article with nouns used in the replication experiment.

Our findings thus challenge one empirical cornerstone of the ‘strong prediction view’ held by current theories of language comprehension (e.g. Altmann and Mirković, 2009; Pickering and Garrod, 2013). The strong prediction view entails two key claims. The first is that people pre-activate words at all levels of representation in a routine and implicit (i.e. non-strategic) fashion. Pre-activation is not limited to a word’s meaning, but includes its grammatical features and even its orthographic and/or phonological form. This would put language on a par with other cognitive systems such as visual perception, wherein higher level brain regions attempt to predict lower level inputs (Friston, 2005; 2010; Summerfield and de Lange, 2014). The second claim is that pre-activation occurs at all levels of contextual support and gradually increases in strength with the level of contextual support. When contextual support for a specific word is high, like at a 100% cloze value, the word’s form and meaning is strongly pre-activated. When contextual support for a word is low, like when it is one amongst 20 words each with a 5% cloze value, pre-activation is distributed across multiple potential continuations. However, even then, a word’s form and meaning are pre-activated, just weakly so. The strength of pre-activation is probabilistic, that is, linked to estimated probability of occurrence.

DeLong and colleagues, and subsequently other scientists (e.g. Dell and Chang, 2014; Pickering and Garrod, 2013), took their results as the evidence to support both these claims. DeLong et al. (2005) was – and still is - the only study to date that measured pre-activation at the prenominal articles a and an that do not differ in their semantic or grammatical content, and that observed a graded relationship between cloze and N400 activity across a range of low- and high-cloze words, rather than merely a difference between low- and high-cloze words. Given that the use of these articles depends on whether the next word starts with a vowel or consonant, their results were considered as powerful evidence that participants probabilistically pre-activated the initial sound of upcoming nouns.

However, we show that there is no statistically significant effect of cloze on article-elicited N400 activity, using a sample size more than ten times that of the original, and a statistical analysis that better accounts for sources of non-independence than the original averaging-based correlation approach. If an effect of cloze on article-N400s exists at all, its true effect size is so small that it cannot be reliably detected even in an expansive multi-laboratory approach, let alone in the typical sample size in psycholinguistic and neurolinguistic experiments (roughly, N = 30). This means that even if article-cloze is associated with a graded modulation of N400 amplitudes, this effect seems to be so small that it cannot be reliably measured with small samples, and thus the previous studies may not have contributed much reliable information to our understanding of this effect. Moreover, it is also possible that the effect is sensitive to specifics of the experimental procedure and context such that it lacks generalizability. Current theoretical positions thus either require new strong evidence for phonological pre-activation or require revision. In particular, one claim from the strong prediction view, namely that pre-activation routinely occurs across all – including phonological – levels (Pickering and Garrod, 2013), can no longer be viewed as having strong empirical support. Our work impels the field to think differently about what constitutes strong evidence within a theory, but also highlights the need for a theory of linguistic prediction to formulate quantitative predictions about the effect-size of to-be-observed effects (for discussion, see also Vasishth et al., 2018).

By contrast, we observed a strong and statistically significant effect of cloze on noun-elicited activity in the majority of our analyses. Although three of the nine laboratories did not show statistically significant correlations between noun-cloze and N400s, data pooled across all laboratories showed a strong and statistically significant noun-cloze effect, and our replication Bayes factor analysis overwhelmingly replicated the direction and size of the noun-cloze effect of DeLong et al. Moreover, our single-trial analysis revealed a significant noun-cloze effect in each of the laboratories, further demonstrating that our single-trial analysis is a more powerful approach than the averaged-based correlation approach of DeLong et al. These results are therefore consistent with the handful of studies that reported a graded relationship between noun-cloze and noun-N400s (DeLong et al., 2005; Kutas and Hillyard, 1984; Wlotko and Federmeier, 2012).

Where do our results leave the strong prediction view? Following the experimental logic of DeLong et al, we do not have sufficient evidence to conclude that people routinely pre-activate the initial phoneme of an upcoming noun, or perhaps any other word form information. Without pre-activation of the initial phoneme, the specific instantiation of the article does not cause people to revise their prediction about the meaning of the upcoming noun, thus lacking any impact on processing. Crucially, this conclusion is incompatible with the strong prediction view, because it suggests that pre-activation does not occur at the level of detail that is often assumed. Our results are also incompatible with an alternative interpretation of the DeLong et al. findings that people predict the article itself together with the noun (Ito et al., 2017a; Van Petten and Luka, 2012), and they pose a serious challenge to the theory that comprehenders predict upcoming words, including their initial phonemes, through implicit production (Pickering and Garrod, 2013). Crucially, the idea that prediction is probabilistic, rather than all-or-none, is now questionable, given that there is no other published report of a pre-activation gradient (also, see Van Petten and Luka, 2012, for a critique of the DeLong et al. conclusions that graded effects evidence graded pre-activation). Although other studies have claimed prediction of form (Ito et al., 2016) or a prediction gradient (Smith and Levy, 2013), no study has indisputably demonstrated graded pre-activation, that is, graded effects occurring before the noun. Effects that are observed upon, rather than before the noun, do not purely index pre-activation but can index a mixture of memory retrieval and semantic integration processes instigated by the noun itself (Baggio and Hagoort, 2011; Lau et al., 2016; Nieuwland et al., 2018; Otten and Van Berkum, 2008; Steinhauer et al., 2017). Therefore, there is currently no clear evidence to support routine probabilistic pre-activation of a noun’s phonological form during sentence comprehension.

Our results, however, should not be taken as evidence against prediction in language processing more generally, and we believe that prediction could play an important role in language comprehension. In addition, our results do not necessarily exclude phonological form pre-activation, and we temper our conclusion with a caveat stemming from the a/an manipulation. For this manipulation to ‘work’, people must specifically predict the initial phoneme of the next word, and revise this prediction when faced with an unexpected article. However, because articles are only diagnostic about the next word within the noun phrase, rather than about the head noun itself, an unexpected article does not refute the upcoming noun, it merely signals that another word would come first (e.g., ‘an old kite’). This opens up explanations for why the a/an manipulation ‘fails’ (see also Ito et al., 2017a, 2017b). In addition, comprehenders may not predict the noun to follow immediately, but at a later point; the unexpected article then does not evoke a change in prediction. Predictions about a specific position may be disconfirmed too often in natural language to be viable. This idea is supported by corpus data (Corpus of Contemporary American English and British National Corpus), showing a mere 33% probability that a/an is directly followed by a noun. Alternatively, people predict the noun to come next, but only revise their prediction about its linear position while retaining the prediction about its meaning. So perhaps a revision of the predicted meaning, not the position, is required to trigger differential ERPs. In both of these hypothetical scenarios, people do not revise their prediction about the upcoming noun’s meaning unless they must.

Our results can be straightforwardly reconciled with effects reported for other pre-nominal manipulations, such as those of Dutch or Spanish article-gender (e.g. Van Berkum et al., 2005; Otten and Van Berkum, 2008; Otten and Van Berkum, 2009; Wicha et al., 2004). Unlike a/an articles, gender-marked articles can immediately disconfirm the noun, because article- and noun-gender agrees regardless of intervening words (e.g. the Spanish article ‘el’ heralds a masculine noun). Revising the prediction about the noun presumably results in a semantic processing cost, thereby modulating N400 activity (e.g. Kochari and Flecken, 2017; Otten and Van Berkum, 2009). Although gender-marked articles do not consistently incur the exact same type of effect (for a recent review, see Kochari & Flecken, 2017) and have only been observed at very high-cloze values, previous studies suggest that a noun’s grammatical gender can be pre-activated along with its meaning. Compared to this gender-manipulation, DeLong et al.’s study based on the English a/an manipulation claimed a stronger version of the prediction view, namely that people predict which word comes next up to its phonological form and, make backwards prediction as to the phonological form of the preceding linguistic material even on the basis of probabilistic, graded information.

What do our results say about prediction during natural language processing? Like the conclusions by DeLong et al., ours are limited by the generalization from comprehension of single sentences in a laboratory setting. On one hand, a rich conversational or story context may enhance predictions of upcoming words, and listeners may be more likely to pre-activate the phonological form of upcoming words than readers. On the other hand, our laboratory setting offered particularly good conditions for prediction of the next word’s initial sound to occur. Each article was always immediately followed by a noun, unlike in natural language. Moreover, our word presentation rate was slow compared to natural reading rates, which may facilitate predictive processing (Ito et al., 2016; Wlotko and Federmeier, 2015). In natural reading, articles are hardly fixated and often skipped (e.g. O’Regan 1979). In short, arguments can be made both for and against phonological form prediction in natural language settings, and novel avenues of experimentation are needed to settle this issue.

DeLong and colleagues recently stated an omission in the description of their data analysis, that is, a baseline procedure was applied to the data but inadvertently omitted from the description (DeLong et al., 2005). We have shown that our conclusions hold regardless of the baseline procedure. In a recent commentary, DeLong et al., 2017 also described filler-sentences in their experiment, which were omitted from their original report, and were neither provided nor mentioned to us by the authors upon our request for the stimuli. DeLong et al. used the existence of these filler-sentences to dismiss an alternative explanation of their original findings, namely that an unusual experimental context wherein every sentence contains an article-noun combination leads participants to strategically predict upcoming nouns. Following this logic, our results were obtained despite an experimental context that could inadvertently encourage strategic prediction (for demonstrations of experimental context boosting predictive processing, see Brothers et al., 2017; Lau et al., 2013). Therefore, the presence of fillers in their experiment versus absence in ours cannot straightforwardly explain the different results, and may even strengthen our conclusions.

Since becoming publicly available as a pre-print (Nieuwland et al., 2017), our study has been simultaneously criticized for being not a sufficiently direct replication (due to the differences in fillers and baseline procedure; DeLong et al., 2017; Yan et al., 2017) and for being a too direct replication (because we base our analysis on the same theoretical assumptions as the original study, rather than applying an ad-hoc transformation or different kind of analysis that might 'reveal' the effect; e.g. Yan et al., 2017). As an example of the latter, an unpublished commentary by Yan et al., 2017 raises an interesting point that cloze probability should be log-transformed to better approximate their suggested index of probabilistic semantic prediction, the Bayesian surprise over the noun semantics upon encountering the article. Yan et al. describe a number of exploratory reanalyses of our single-trial data with the log-transform, and one of those exploratory analyses yields a small but statistically significant effect of article-cloze (p=0.015). Ultimately, however, their conclusion is not that different from ours, namely that there is some evidence in our data that the effect is non-zero. More importantly, their commentary demonstrates that our dataset, like any complex EEG dataset, can be analyzed in many different ways, which can lead to different outcomes. However, even if alternative analyses are well-motivated after the fact, the problem remains that they are contingent on the data, and the accompanying researcher degrees of freedom lead to a multiple comparison problem (e.g. Gelman and Loken, 2013; Luck and Gaspelin, 2017). We pre-registered our main analyses and none of these allowed us to conclude that the DeLong et al. study replicated. Yan et al. present an alternative analysis that is exploratory and that itself requires further replication. Moreover, their analysis also raises a novel set of important concerns. For example, log-transformation of cloze also boosts the effect in the pre-article time window (p=0.058), where there cannot be a meaningful effect, possibly because it amplifies ‘noise’ (between-item differences at the low end of the cloze-scale that have nothing to do with prediction of the article). Furthermore, log-transformation does not yield a significant effect with the original baseline procedure of DeLong et al., and it strongly boosts the impact of items with zero cloze, that is, the items that are problematic because their predictability cannot be accurately estimated (of note, without zero-cloze values in their analysis, higher cloze leads to more negative, not positive voltage). Yan et al. report that log-transformation yields somewhat higher t-values of cloze in this dataset and changes our non-significant effect into a significant effect, but it remains unclear whether log-transformation is indeed ‘better’. Crucially, the difference between significant and not-significant itself may not be significant (Gelman and Stern, 2006), log-transformation does not yield higher t-values consistently across laboratories, does not necessarily improve model fit, and does not yield higher t-values or improve model fit in another large dataset (collapsed data from Ito et al., 2017a; Nieuwland, 2016; Nieuwland and Martin, 2012; total N = 124). Finally, it is unknown whether log-transformation weakens rather than strengthens the effect of the original study. Details of these and further concerns are available on https://osf.io/mb2ud. In sum, these concerns merely add to our main point, namely that even if analysis decisions are justifiable in retrospect, a flexible analysis practice can lead researchers to capitalize on noise (Gelman and Loken, 2013).

To conclude, we failed to replicate the main result of DeLong et al., a landmark study published more than 10 years ago that has not been directly replicated since. Our results suggest that, if there is an effect of article-cloze probability on the amplitude of the N400, it is too small and/or too sensitive to unknown experimental design factors to have been meaningfully measured in previous small-sample-size experiments. Our findings thus do not lend clear support to the ‘strong prediction view’ in which people routinely and probabilistically pre-activate information at all levels of linguistic representation, including phonological form information such as the initial phoneme of an upcoming noun. Consequently, there is currently no convincing evidence that people routinely pre-activate the phonological form of an upcoming noun during written sentence comprehension. In addition, our findings further highlight the importance of direct replication, large sample size studies, transparent reporting and pre-registration to advance reproducibility and replicability in the neurosciences.

Materials and methods

Experimental design and materials

Request a detailed protocol

Nieuwland requested all original materials from DeLong et al., including the questions and norms, with the stated purpose of direct replication (personal communication, November 4 and 19, 2015), upon which DeLong et al. made available the 80 sentences described in the original study. These sentences were then adapted from American to British spelling and underwent a few minor changes to ensure their suitability for British participants. The complete set of materials and the list of changes to the original materials are available online (Supplementary file 1). The materials were 80 sentence contexts with two possible continuations each: a more or less expected indefinite article + noun combination. The noun was followed by at least one subsequent word. All article + noun continuations were grammatically correct. Each article + noun combination served once as the more expected continuation and the other time as the less expected continuation, in different contexts. We divided the 160 items in two lists of 80 sentences such that each list contained each noun only once. Each participant was presented with only one list (thus, each context was seen only once). One in four sentences was followed by a yes/no comprehension question, which yielded a mean response accuracy of 95% (after taking into account ambiguity in three of the questions, see Supplementary file 1). While this percentage is very similar to that reported by DeLong et al., we note that this cannot be directly compared to the accuracy reported in DeLong et al., because we had to create new comprehension questions in the absence of the original ones. Regardless, because Delong et al. suggested that our results were due to poor language comprehension (DeLong et al., 2017), we describe an exploratory analysis in which we attempt to account for variation in response accuracy in the statistical model.

We obtained article cloze and noun cloze ratings from a separate group of native speakers of English who were students at the University of Edinburgh and did not participate in the ERP experiment. They were instructed to complete the sentence fragment with the best continuation that comes to mind (Taylor, 1953). We obtained article cloze ratings from 44 participants for 80 sentence contexts truncated before the critical article. Noun cloze ratings were obtained by first truncating the sentences after the critical articles, and presenting two different, counterbalanced lists of 80 sentences to 30 participants each, such that a given participant only saw each sentence context with the expected or the unexpected article. The obtained values closely resemble those of the original study, with the same range (0–100% for articles and nouns), slightly lower median values (for articles and nouns, 29% and 40%, compared to 31% and 46% in the original study), but slightly higher mean values (for articles and nouns, 41% and 46%, compared to 36% and 44%). Because the sentence materials we used describe common situations that can be understood by any English speaker, and because students at the University of Edinburgh come from across the whole of the UK, we had no a priori expectation that cloze ratings would differ substantially across laboratories, and thus we did not obtain cloze norms from other sites. Consistent with this assumption, nothing in our results suggests stronger cloze effects in University of Edinburgh students compared to other students, suggesting that our cloze norms are sufficiently representative for the other universities. The raw cloze responses are available on our OSF page.

Participants

Participants were students from the University of Birmingham, Bristol, Edinburgh, Glasgow, Kent, Oxford, Stirling, York, or volunteers from the participant pool of University College London or Oxford University, who received cash or course credit for taking part in the ERP experiment. Participant information and EEG recording information per laboratory is available online (Supplementary file 1). We pre-registered a target sample size of 40 participants per laboratory, which was thought to give at least 32 participants (the sample size of DeLong et al.) per laboratory after accounting for data loss, as was later confirmed. Due to logistic constraints, not all laboratories reached an N of 40. Because in two labs corruption of data was incorrectly assumed before computing trial loss, these laboratories tested slightly more than 40 participants. All participants (N = 356; 222 women) were right-handed, native English speakers with normal or corrected-to-normal vision, between 18 and 35 years (mean, 19.8 years), free from any known language or learning disorder. Eighty-nine participants reported a left-handed parent or sibling.

Procedure

Request a detailed protocol

After giving written informed consent, participants were tested in a single session. Sentences were presented visually in the center of a computer display, one word at a time (200 ms duration, followed by a blank screen of 300 ms duration). Due to a programming error, in four labs (1, 3, 5 and 8, which used E-prime scripts) the critical articles and nouns, but not other words, were followed by a 380 ms blank instead of the intended 300 ms. This delay is unlikely to have affected the results because if it was noticed at all, which is unlikely, it could only be noticed 500 ms after the article, that is, after the N400 window associated with the article. Of note, the pattern of the results from the pre-registered single-trial analysis did not change when we removed these labs from the analysis. Participants were instructed to read sentences for comprehension and answer yes/no comprehension questions by pressing hand-held buttons. The electroencephalogram (EEG) was recorded from at least 32 electrodes.

The replication experiment was followed by a control experiment, which served to detect sensitivity to the correct use of the a/an rule in our participants. Participants read 80 relatively short sentences (average length eight words, range 5–11) that contained the same critical words as the replication experiment, preceded by a correct or incorrect article. As in the replication experiment, each critical word was presented only once, and was followed by at least one more word. All words were presented at the same rate as the replication experiment. There were no comprehension questions in this experiment. After the control experiment, participants performed a Verbal Fluency Test and a Reading Span test; the results from these tests are not discussed here. All stimulus presentation scripts are publicly available in two different software packages (E-Prime and Presentation) on https://osf.io/eyzaq.

Data processing

Request a detailed protocol

Data processing was performed in BrainVision Analyzer 2.1 (Brain Products, Germany). We performed one pre-registered replication analysis that followed the DeLong et al. analysis as closely as possible and one pre-registered single-trial analysis (Open Science Framework, https://osf.io/eyzaq). All non-pre-registered analyses are considered as exploratory. First, we interpolated bad channels from surrounding channels, and downsampled to a common set of 22 EEG channels per laboratory which were similar in scalp location to those used by DeLong et al. One laboratory did not have 12 of the selected 22 channels in its EEG channel montage, and we matched the full 22-channel layout used for other laboratories by creating 12 virtual channels from neighbouring channels using topographic interpolation by spherical splines. We then applied a 0.01–100 Hz digital band-pass filter (including 50 Hz Notch filter), re-referenced all channels to the average of the left and right mastoid channels (in a few participants with a noisy mastoid channel, only one mastoid channel was used), and segmented the continuous data into epochs from 500 ms before to 1000 ms after word onset. We then performed visual inspection of all data segments and rejected data with amplifier blocking, movement artifacts, or excessive muscle activity. Subsequently, we performed independent component analysis (Jung et al., 2000) on a 1 Hz high-pass filtered version of the data, and applied the obtained weightings to the original data to correct for blinks, eye movements or steady muscle artefacts. After this, we automatically rejected segments containing a voltage difference of over 120 µV in a time window of 150 ms or containing a voltage step of over 50 µV/ms. Participants with fewer than 60/80 article trials or 60/80 noun trials were removed from the analysis, leaving a total of 334 participants (range across laboratories 32–42, and therefore each lab had a sample size at least as large as DeLong et al.). On average, participants had 77 article trials and 77 noun trials. All raw data and pre-processed data are available on https://osf.io/eyzaq.

Pre-registered replication analysis

Request a detailed protocol

We applied a 4th-order Butterworth band-pass filter at 0.2–15 Hz to the segmented data, averaged trials per participant within 10% cloze bins (0–10, 11–20, etc. until 91–100), and then averaged the participant-wise averages separately for each laboratory. Because the bins did not contain equal numbers of trials (the intermediate bins contained fewest trials), like in DeLong et al., not all participants contributed a value for each bin to the grand average per laboratory. For nouns and articles separately, and for each EEG channel, we computed the correlation between ERP amplitude in the 200–500 ms time window per bin with the average cloze probability per bin.

Pre-registered single-trial analysis

Request a detailed protocol

In this analysis, we did not apply the 0.2–15 Hz band-pass filter, which carries the risk of inducing data distortions (Luck, 2014; Tanner et al., 2015). However, we deemed it necessary to perform a baseline correction of the data. This procedure corrects for spurious voltage differences before word onset, increasing confidence that observed effects are elicited by the word rather than differences in brain activity that already existed before the word and is a standard procedure in ERP research (Luck, 2014). DeLong et al. (2005) did not report a baseline correction, nor did any of the related work from DeLong and colleagues that was reported in DeLong, 2009. Yet baseline correction has been used in many other publications from the Kutas Cognitive Electrophysiology Lab. We chose a 100 ms pre-stimulus baseline as the most frequently used one both in other studies from the Kutas lab and in similar studies from other labs. For each trial, we performed baseline correction by subtracting the mean voltage of the −100 to 0 ms time window from each data point in the epoch.

Instead of averaging N400 data across trials and participants for subsequent statistical analysis, we performed linear mixed effects model analysis (Baayen et al., 2008) of the single-trial N400 data, using the ‘lme4’ package (Bates, Maechler, Bolker & Walker, 2014) in the R software (R CoreTeam, 2014). This approach simultaneously models variance associated with each subject and with each item. Especially when analyzing effects of a continuous predictor variable such as cloze probability, linear mixed-effects regression offers better control over false-positive results than the averaged-based correlation analysis of the original study. Using a spatiotemporal region-of-interest approach based on the DeLong et al. results, our dependent measure (N400 amplitude) was the average voltage across six centro-parietal channels (Cz/C3/C4/Pz/P3/P4) in the 200–500 ms window for each trial. Analysis scripts and data to run these scripts are publicly available on https://osf.io/eyzaq.

For articles and nouns separately, we used a maximal random effects structure as justified by the design (Barr et al., 2013), which did not include random effects for ‘laboratory’ as there were only nine laboratories. Z-scored cloze was entered in the model as a continuous variable, and laboratory was entered as a deviation-coded nuisance predictor. We tested the effects of ‘laboratory’ and ‘cloze’ through model comparison with a χ2 log-likelihood test. We tested whether the inclusion of a given fixed effect led to a significantly better model fit. The first model comparison examined laboratory effects, namely whether the cloze effect varied across laboratories (cloze-by-laboratory interaction) or whether the N400 magnitudes varied over laboratory (laboratory main effect). Because laboratory effects were not significant, we dropped them from the analysis because they were not of theoretical interest. For the articles and nouns separately, we compared the subsequent models below. Each model included the random effects associated with the fixed effect ‘cloze’ (see Barr et al., 2013). All output β estimates and 95% confidence intervals (CI) were transformed from z-scores back to raw scores, and then back to the 0–100% cloze range, so that the voltage estimates represent the change in voltage associated with a change in cloze probability from 0 to 100.

Model 1: N400 ~cloze * laboratory + (cloze | subject) + (cloze | item)

Model 2: N400 ~cloze + laboratory + (cloze | subject) + (cloze | item)

Model 3: N400 ~cloze + (cloze | subject) + (cloze | item)

Model 4: N400 ~ (cloze | subject) + (cloze | item)

In an analysis that itself was not pre-registered but that included the data from the pre-registered analysis of both articles and nouns, we tested the differential effect of cloze on article ERPs and on noun ERPs by comparing models with and without an interaction between cloze and the deviation-coded factor ‘wordtype’ (article/noun). Random correlations were removed for the models to converge.

Model 1: N400 ~cloze * wordtype + (cloze * wordtype || subject) + (cloze * wordtype || item)

Model 2: N400 ~cloze + wordtype + (cloze * wordtype || subject) + (cloze * wordtype || item)

Exploratory correlation analysis

Request a detailed protocol

Of note, DeLong et al. have recently described using a 500 ms baseline correction procedure that they failed to mention in DeLong et al. (2005). Using this baseline correction procedure, we recomputed the correlations that we obtained in our Replication analysis (Figure 2). To compare our results most directly with those reported in Figure 1C of DeLong et al. (2005), we pooled data from all the laboratories to obtain a single r-value for each EEG-channel. Data were pooled after computing bin-averages per laboratory as in the original study, treating the laboratories as multiple observations of each bin-average.

Exploratory single-trial analyses

Request a detailed protocol

We performed an exploratory analysis in the 500 to 100 ms time window before the article, using the originally (−100 to 0 ms) baselined data, using Models 3 and 4 from the article analysis. This window covers the first 400 ms of the word that preceded the article. Analysis in this window yielded a similar pattern as in the pre-registered analysis, which indicates that a baseline correction procedure covering the entire 500 ms pre-stimulus window would account better for pre-article voltage levels. We performed this additional analysis, the results of which did not change our conclusions and are shown in Figure 5.

We also performed an exploratory analysis in which we control for a potential influence of response accuracy, taken as a proxy for the subject’s attention to the task, on predictive processing of linguistic input. We entered the (z-transformed) average response accuracy of each subject in our model, and compared the models below. Comparison of Models 1 and 2 tested whether the effect of cloze on the article-N400s depended on subject accuracy. Comparison of Models 2 and 3 tested whether there was a significant effect of cloze on article-N400s when subject accuracy was included in the model.

Model 1: N400 ~accuracy * cloze + (cloze | subject) + (cloze | item)

Model 2: N400 ~accuracy + cloze + (cloze | subject) + (cloze | item)

Model 3: N400 ~accuracy + (cloze | subject) + (cloze | item)

Exploratory Bayesian analyses

Request a detailed protocol

Supplementing the Replication analysis, we performed a replication Bayes factor analysis for correlations (Wagenmakers et al., 2016) using as prior the size and direction of the effect reported in the original study. We performed this test for each electrode separately, after collapsing the data points from the different laboratories. Because we had no articles in the 40–50% cloze bin, there was a total of 9 and 10 data points per laboratory for the articles and nouns, respectively. Our analysis used priors estimated from the DeLong et al. results, matched as closely as possible to our electrode locations. A Bayes factor between 3 and 10 is considered moderate evidence, between 10 and 30 is considered strong evidence, 30–100 is very strong evidence, and values over 100 are considered extremely strong evidence (Jeffreys, 1961). In addition to using a 100 ms pre-stimulus baseline, we also computed the replication Bayes factors using the 500 ms pre-stimulus time window for baseline correction. Results are shown in Figure 6.

Supplementing the pre-registered single-trial analyses, we performed an exploratory Bayesian mixed-effects model analysis using the brms package for R (Buerkner, 2016), which fits Bayesian multilevel models using the Stan programming language (Stan Development Team, 2016). Nieuwland requested to use the results of a mixed-effects model reanalysis of the DeLong et al. data as an appropriate prior (personal communication from Nieuwland, November 14 and 22 2017); this request was declined by DeLong and colleagues. We were therefore limited to using a prior centered on a point estimate based on the Delong et al. correlation analysis, namely our estimate of the observed effect size at Cz for a difference between 0% cloze and 100% cloze (1.25 μV and 3.75 μV for articles and nouns, respectively, based on visual inspection of the graphs) and a prior centered on zero for the intercept. Both priors had a normal distribution and a standard deviation of 0.5 (given the a priori expectation that average ERP voltages in this window generally fluctuate on the order of a few microvolts; note that these units are expressed in terms of the z-scored cloze values, rather than the original cloze values, such that μ for the cloze prior was 0.45, which corresponds to a raw cloze effect of 1.25). We computed estimates and 95% credible intervals for each of the mixed-effects models we tested, and transformed these back into raw cloze units. The credible interval is the range of values such that one can be 95% certain that it contains the true effect, given the data, priors and the model. The results from these analyses are shown in Figure 7; the analyses suggest that, while there may be a small positive association between article cloze and ERP amplitude elicited by the articles, the effect is substantially smaller than that estimated by DeLong et al. (2005) and likely is too small to be observed without very large sample sizes.

Control experiment

Request a detailed protocol

Analysis of the control experiment involved a comparison between a model with the categorical factor ‘grammaticality’ (grammatical/ungrammatical) and a model without. Our dependent measure (P600 amplitude; Osterhout and Holcomb, 1992; Wicha et al., 2004) was the average voltage across six centro-parietal channels (Cz/C3/C4/Pz/P3/P4) in the 500–800 ms window for each trial. Results are shown in Figure 8.

Model 1: P600 ~grammaticality + (grammaticality | subject) + (grammaticality | item)

Model 2: P600 ~ (grammaticality | subject) + (grammaticality | item)

Data availability

The following data sets were generated
    1. Nieuwland M
    (2018) Replication Recipe Analysis plan
    Available at the Open Science Framework.

References

  1. Book
    1. DeLong KA
    2. Urbach TP
    3. Kutas M
    (2017)
    Concerns with Nieuwland et al. Multi-Lab Study (2017)
    San Diego: University of California.
  2. Book
    1. DeLong KA
    (2009)
    Electrophysiological Explorations of Linguistic Pre-Activation and Its Consequences During Online Sentence Processing
    San Diego: University of California.
    1. Friston K
    (2005) A theory of cortical responses
    Philosophical Transactions of the Royal Society B: Biological Sciences 360:815–836.
    https://doi.org/10.1098/rstb.2005.1622
  3. Website
    1. Miyamoto K
    (2016) Hemispheric differences in linguistic prediction given high constraint contexts
    Presentation Given at the Kutas Cognitive Electrophysiology Lab. Accessed April 23, 2016.
    1. Van Berkum JJ
    (2010)
    The brain is a prediction machine that cares about good and bad-any implications for neuropragmatics?
    Italian Journal of Linguistics 22:181–208.

Decision letter

  1. Barbara G Shinn-Cunningham
    Reviewing Editor; Boston University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Large-scale replication study reveals a limit on probabilistic prediction in language comprehension" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Sabine Kastner as the Senior Editor. The following individual involved in review of your submission have agreed to reveal his identity: Matt Davis (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. The three reviewers were unanimous in recognizing the importance of this large-scale replication (with pre-registered hypotheses, across multiple labs) study, and of publishing this kind of work. They all agreed that the study is appropriate for eLife, but also all agreed that the presentation requires some polishing before publication.

The positives here are clear. While no experimental evidence can "prove" the null hypothesis, this work highlights that the field has adopted a very strong assumption about mechanisms of language comprehension even though the landmark evidence for this idea is not very robust. The reviewers all applaud the effort that went into this large-scale, multi-site study. All hope that the publication of this study encourages others to undertake this kind of effort, which not only is principled, but provides a large dataset that can fuel additional statistical analysis, computational modelling, and theoretical debate. The reviewers felt that emphasizing these issues even more in the manuscript would make clear how this kind of work contributes to the literature.

Each of the three reviewers provided extraordinarily long and detailed comments that were thoughtful and constructive. While we have tried to consolidate these comments, the result remains a quite long discussion, for which I apologize; however, given the care that went into these reviews, I decided it was helpful to preserve much of the content.

Essential revisions:

1) At the highest level, the theoretical importance of the original paper-and the current replication- needs to be even more clearly laid out. The interpretation of the original results is stated in only vague terms, which leads to confusion about what one should conclude. Crucially, the authors never really spell what mechanism would lead to effects on the N400 in response to an article preceding the putatively predicted noun (as opposed to the N1, P2, LAN, P600, or any other ERP component).

The current paper seems to argue for the "high-level compositional" (Introduction) view of the N400. The "N400 is elicited by every word of an unfolding sentence, and its amplitude is smaller (less negative) with increasing ease of semantic processing" (Introduction). Therefore, if a word does not fit semantically well within a context, N400 effects are expected. But DeLong et al., (2005) argued that the N400 reflects lexico-semantic prediction. The mixing of these two views is problematic here, as the DeLong et al., (2005) design "involved semantically identical articles (function words) rather than nouns or adjectives (content words) that are rich in meaning" (Introduction), which in turn would make "the observed N400 modulation by article-cloze.… unlikely to reflect difficulty interpreting the articles themselves" (Introduction). Unfortunately, this feature of the study makes it hard to make a precise prediction about what to expect at the level of processing the definite articles, especially if one is espousing a semantic integration view of the N400, for several reasons:

a) N400 effects are normally investigated in content words, not closed class words like articles.

b) As closed class words, articles elicit different ERP morphologies than open class words, including within the N400 time window.

c) Previous studies contrasting ERP responses to articles that are compatible or incompatible in upcoming word gender features elicited differences in late positivities, not N400.

d) The control study in the current manuscript shows that the incorrect pairing of a definite article allomorph with a subsequent word induces later positivities, not N400 effects. This calls into question why we should expect an N400 effect in the case of the predicted upcoming word.

e) There is a long history of research tying the N400 to general lexico-semantic processing, but not compositional semantics. At different places, the authors seem to argue both of these views. Under the compositional semantic integration view, the N400 may be affected by the phonological form of articles. Both forms contribute similarly to the semantics of the sentence (they are allomorphs after all). Semantically reversible sentences do not show N400 effects even when one order of the arguments is semantically odd (e.g.," the hunters chased the fox" vs. "the fox chased the hunters"), which argues against a compositional view of the N400. Given all this, why should one expect the N400 to be affected for allomorphic variations of the definite article when no obvious semantic integration effects are involved?

f) The fact that the a/an alternation is not triggered by the predicted head noun, but rather by the word immediately following the article, also confuses things quite a bit: why should there be, under any interpretation of the N400, an N400 effect for the alternating articles on the basis of the predicted head noun if the head noun is not necessarily (or even statistically) the trigger for the article form alternation? While the authors discuss this idea, it casts serious doubt on the original interpretation of the DeLong et al., findings.

These issues call into question the theoretical importance of the original study (and thus the replication). Contrary to what the authors state in the paper, that original paper offers only weak evidence for a "strong prediction view" precisely because the theoretical implications are vague and unspecified.

2) Throughout the paper, the authors emphasize the failure to find graded neural effects of cloze probability on the indefinite article preceding a noun. The reviewers do not believe this is as critical as it is played up in the manuscript. For instance, the current study also provides some evidence that a mismatch between articles (a/an) and subsequent nouns, conducted in the same participants tested in a subsequent experiment, leads to different neural activity. Moreover, a gradient in response at the level of group data (as in the original study) could nonetheless arise from discrete, all-or-none prediction at the single-trial level. It seems a bit disingenuous to argue that the study should call into question the importance of prediction on theories of language comprehension so broadly.

Relatedly, it would be good if results from studies of other "predictive" effects were considered and reconciled with the present findings. The authors list several of these in the Discussion section, but omit others; the authors should provide a balanced discussion of other evidence of different prediction effects, such as

- gender predictions (Wicha, Moreno and Kutas, 2004; Otten, Nieuwland and van Berkum, 2008; Van Berkum et al., 2005).

- predictions of initial sounds of upcoming words (Connolly and Phillips, 1994).

- expectations of the form of words with specific syntactic classes (Dikker et al., 2010).

- processes affected pre-N400 ERP responses driven by form-level characteristics (Lau et al., 2006).

- effects on early sensory responses like the M100 (Dikker et al., 2010; Dikker, Rabagliati, and Pylkkanen, 2009).

- behavioral evidence by close-shadowing (Marslen-Wilson, 1985).

The authors argue that the present null results for articles can "be straightforwardly reconciled with effects reported for other pre-nominal manipulations" (Discussion section), but studies like those above provide evidence for predictive processing. The preceding paragraph (Discussion section) offer a number of caveats, but readers of this paragraph might take away the message that the current results "do not necessarily exclude phonological form pre-activation" rather than the message from previous page (Discussion section) that "there is currently no clear evidence to support routine probabilistic pre-activation of a noun's phonological form during sentence comprehension.

In general, the wording of the manuscript should be revised throughout so as to not overstate the implications of the findings, while at the same time giving a consistent message about what the findings do mean.

3) The correlation analysis in the original paper analyzed only 10 data points, each of which is a mean across 32 participants who each did ~80 trials (binned into deciles based on cloze probability). This analysis excludes several critical sources of variance between trials, items, and participants, which increases the likelihood of type I error. On the other hand, the analysis is also underpowered, since with only 10 data points and 9 degrees of freedom, statistical significance requires an exceptionally strong correlation; thus, the likelihood of type II error is great (making it not very surprising that results fail to replicate). This is an important lesson for the field and should be brought our more clearly in the present manuscript. While adding additional analysis is not necessary for the manuscript to be publishable, this point could be strengthened, for instance, by running simulations to assess the power of the original DUK study, varying the number of participants and items and exploring whether a graded effect of cloze probability on article and noun processing can be detected.

4) On a related note, the DUK study also reports a reliable difference in N400 magnitude between high and low cloze probability articles and nouns (Figure 1A of DUK, left panel); their analysis included by-participant variation as a random effect. Given their counterbalanced design and arguments from Raaijmakers et al., (1999), this analysis is less prone to errors than their correlational analysis. The categorical analysis provides additional evidence for a role of prediction – albeit not a graded one. Figure 3 of the present manuscript presents similar categorical results, but there is no statistical analysis reported. Do these new results also show significant evidence for a non-graded prediction of articles? If so, this would substantially modify the conclusions drawn from the current study. This should be considered and discussed.

5. The reviewers noted that there is an online critique and re-analysis posted on BioRxiv by Yan, Kuperberg and Jaeger (https://www.biorxiv.org/content/early/2017/05/30/143750) that endorses the importance of the issues raised and the value of the data collected here. It's not clear whether and how the authors anticipated this commentary in the present manuscript.

For instance, this commentary suggests that Bayesian surprisal, KL divergence between pre- and post-article word probability distributions, or other variables might show a closure correspondence to the neural data than does cloze probability; both the current authors and the original study make an implicit assumption that cloze probability is the best predictor.

Another example is the idea raised in the commentary that there is a chain of cognitive processes mediating between meaning predictions and form-based predictions for indefinite articles. A failure, probabilistic limitation, or probability mis-estimation (e.g., cloze tests overestimating the probability of an indefinite article) in these mediating processes may explain why the strongly constraining lexical/semantic predictions set-up by the sentence contexts don't always influence the processing of indefinite articles. The sentence contexts set up semantic expectations, which constrain the choice of lemmas in critical sentence positions. Some predicted lemmas have phonological forms that have specific consequences for articles; however, form-based predictions may not follow even if a semantically predicted lemma is chosen (e.g., "a kite" vs "an aeroplane", but also "an old kite" or "a plane"). The authors acknowledge this point (Discussion section), but do not carry this line of argument through the rest of the paper. It's not the case that a failure to observe predictions at the level of the phonological form of indefinite articles necessarily implies that all prediction is absent; yet in a number of places they seem to argue along these lines.

The manuscript's impact would be heightened by incorporating comments that address points like these that were raised in this online critique.

6) It is great that the authors included Bayes factor analyses to gain theoretical insights from null results. These analyses provide evidence in favor of the null hypothesis that there are no graded predictions for articles, despite numerical effects in the expected direction. However, these analyses depend on assumptions about the expected effect size and other details that are not clearly spelled out. These priors should be provided. For instance, Figure 1A/B DUK suggests that the expected effect size of the cloze probability effect on the article is considerably smaller than the effect on the noun itself. It seems that this was taken into account in Bayes factor computations (given the graphs in SI Figure 2), but what other assumptions are made in this analysis? Was the variance expected to be equivalent for articles and nouns? Can the authors justify the choice of using a Gaussian? What does the estimated effect size being greater than zero mean for the "pre article" period (in SI Figure 2)? If predictive processing occurs, then predictions should have been computed prior to the onset of the article.

7) The use of a pre-article baseline correction may be problematic, since neural responses prior to the onset of the article will include processing of linguistic information that contributes to the generation of predictions or (equivalently) constrains the likely form and meaning of upcoming words. For example, recent data from Grisoni et al., (2017) suggests a "semantic readiness potential" reflecting pre-activation of specific semantic representations. Similar semantic pre-activation is likely present in the sentences used in the current study. By doing a pre-article baseline, such differences could corrupt or conflate with differences ascribed to neural responses to the article. The authors consider but do not describe or discuss these effects in Supplementary figure 2. While the authors are not responsible for the fact that the original study used an unconventional baseline, in subsection “Single-trial analysis” the manuscript implies that there is a reason why it is appropriate, because the current authors observed non-significant cloze effects immediately prior to the critical noun.

At a minimum, this potential confound should be raised and discussed. Better still, the authors could run an analysis in which they either use a pre-sentence baseline or analyze raw ERPs without baseline correction and compare to the pre-article baseline analysis. They could also analyze whether there are differences to highly vs weakly constraining sentences prior to the onset of the article (building on the analysis in SI Figure 2), an approach that might benefit from using a different predictor variable the captures the strength of prediction (e.g. the entropy of the cloze probability distribution) rather than not whether or not the prediction is subsequently confirmed or violated. Inclusion of these types of additional analyses could support a more nuanced discussion of whether or not participants compute the likely meaning of upcoming words.

8) The distinction between form and meaning predictions may influence differences or similarities between N400 effects seen on articles and on nouns. The original DUK paper reports similar magnitude and topography of N400 effects on both the article and noun. It would be instructive to compare the location, timing, and effect size for cloze probability correlations between the article and the noun in the current, larger dataset. Additional tests for difference between the significant and null correlations would be valuable. For instance, is the reliable effect of graded cloze probability on nouns significantly different from the null effect of cloze probability on articles? This difference should be reliable if semantic integration is the key driver of N400 effects on nouns.

In a similar vein, what is the assumed model of the N400 generator? There are at least two published computational models that seek to explain the large body of N400 effects in the existing ERP literature. One proposes something akin to integration demands (Cheyette and Plaut, 2016), another proposes a form of (semantic) prediction error (Rabovsky and McRae, 2014). Such models provide a helpful context in which to discuss whether or not prediction violations at the level of form will lead to an N400 in the absence of any ongoing semantic integration demands. Ultimately, questions about the replicability or otherwise of the data presented by DUK are only informative to the extent that they are or, are not theoretically constraining. A discussion of these and other computational theories of the N400 are important for interpreting the present results, and how these results should be considered in the field of language comprehension going forward.

Other general comments for the authors to consider:

1) The authors conducted a single-trial analysis using linear mixed effects regression (LMER) analysis. This is the most appropriate analysis method since it includes both between-participant and between-item variance and combines all of the usable data from individual participants and single trials into one analysis. While LMER analysis is now preferred to correlation analysis used in the original study, it only came to prominence after the publication of the original paper. Pointing out the advantages of LMER analyses would enhance the current paper's impact. For instance, especially when analyzing effects of continuous predictor variables, LMER may offer better control over false-positives than conventional analyses. This is an important message for researchers studying the neuroscience of language, who generally do not undertake this approach.

2) The flip side of using single-trial LMER, however, is that it is not clear whether or not the link between cloze probability and neural responses is linear or logistic. The independent measure (cloze probability) used as a predictor variable is summarizing a binary outcome variable (i.e., did individual participants in the cloze test generate the article "a" or "an") as a probability. It is at least plausible that neural outcomes could be similarly binary: participants predict "a" or "an" and then don (or don't) generate a neural prediction error when an unexpected word is presented. Would reliable effects of cloze probability be observed with a logistic or binomial linking function? While this is not the form of graded prediction tested by DUK, it seems quite likely that averaging over many participants and many trials (as in the DUK correlation analysis) would generate a graded, linear relationship between cloze probability and neural outcomes, even if the underlying relationship was logistic or non-linear. This possibility should be considered, assessed and discussed.

3) Previous work has explored the presence/absence of fillers, which is another difference between the DUK study and the current one. While it is unlikely to explain the failure to replicate results, some additional discussion seems appropriate. Given the large scale of the present dataset there should be many opportunities for determining whether cloze probability effects (for nouns, if not for articles) are enhanced once participants have already encountered a number of distinctive, high/low cloze probability sentences.

4) The fact that the nine labs inadvertently manipulated the timing of stimulus presentations of the critical articles and nouns (see footnote 3, Materials and methods section) is unlikely critical to their findings/conclusions. However, it would be useful to know whether the timing of early, visual form responses for written words that are delayed and non-delayed differ. Simply arguing that the change in timing is "unlikely to be noticed" and "after the N400 window" is making assumptions that can and should be tested with their ERP data.

5) The use of different baselines was confusing to multiple reviewers. A table that summarizes details of all the current analyses and how these correspond to the procedures used in the original DUK study would help the reader keep straight the different results.

6) Both the original and current studies present a high percentage of a/an continuations in the cloze test. Could this procedure bias participants towards using a/an rather than "his", "hers", "their," etc. that aren't marked for the phonological form of the subsequent noun? After encountering several sentences in which the indefinite article is appropriate, participants may routinely complete the remaining sentences with an indefinite article continuation due to this sort of syntactic priming. A comment about this would be welcome.

7) In multiple places, the authors use phrasing about how the study argues for "a more limited role for prediction during language comprehension." This wording seems overly broad, conflating very different conceptions of prediction across different levels of representation. There is a deep theoretical division between proposed mechanisms of conceptual pre-activation (e.g., the concept of 'readiness' in the discourse literature, linked to the N400 literature beautifully in Jos Van Berkum's work) and mechanisms of predicting the form of the upcoming input (commitment to a particular linguistic formulation of the message). It seems likely that the former, long argued for in the discourse literature, might play a central role in language comprehension, while contributions from the prediction-of-form mechanism may be quite limited. It would be unfortunate if readers take the current study to cast doubt on both extremely different mechanisms, when it actually only speaks to the latter mechanism. Please consider rewording this throughout.

https://doi.org/10.7554/eLife.33468.023

Author response

Essential revisions:

1) […] These issues call into question the theoretical importance of the original study (and thus the replication). Contrary to what the authors state in the paper, that original paper offers only weak evidence for a "strong prediction view" precisely because the theoretical implications are vague and unspecified.

We thank the reviewer for these comments. In the Introduction we now state “DeLong et al., presented the systematic, graded N400 modulation by article-cloze as strong evidence that participants activated the nouns and articles in advance of their appearance, and that the disconfirmation of this prediction by the less-expected articles resulted in processing difficulty (higher N400 amplitude at the article)”

This reviewer questions the theoretical relevance offered by DeLong et al., about their own findings and points out different interpretations of the N400 that may have different repercussions for interpreting the DUK results. However, that interpretation exercise assumes that the original effects were reliable and replicable. The motivation of our study was to test whether the effects can be replicated in the first place, after which a theoretical interpretation becomes relevant, as we offer in our discussion. This reviewer suggests that we are advocating a particular interpretation of the N400. This is not the case. Our group does not take the N400 as a pure measure of high-level semantic composition. We purposefully used the neutral term “ease of semantic processing”, which can include both memory access and integration processes. Our only assumption regarding the N400 followed the logic of DeLong et al., namely that noun-elicited N400s themselves do not yield unambiguous evidence for prediction.

We certainly agree that DUK offers only weak evidence for the “strong prediction view”, in particular because their effects are weak, and they tried multiple analyses. We now refer to the original results throughout the paper as “the most acclaimed” instead of “the strongest evidence”. This is because while DUK may offer only weak evidence, that is not how their results have been and are still received in the wider literature: indeed, several impactful review papers literally cite DUK as ‘strong evidence’. This reviewer may call into question the theoretical relevance of the original study, but, as we describe in our text, the original study has had a major impact on the field, is cited in all the major theoretical review papers on linguistic prediction. Moreover, never has a question been raised about the replicability or interpretation of the DUK findings (with a few exceptions which we cite in the text).

This reviewer also states that all pre-nominal gender manipulations elicited effects other than N400 effects. This is not the case. Prenominal gender manipulations have elicited a range of effects, including N400 effects (e.g. Otten and Berkum, 2009; Wicha et al., 2003; Kochari and Flecken, under review) and sometimes P600-like effects or frontal negative ERP effects. We mentioned this in the previous submission and have now expanded this section. However, why different effects are observed for those gender-manipulations is unknown and beyond the scope of our paper on a different manipulation.

2) Throughout the paper, the authors emphasize the failure to find graded neural effects of cloze probability on the indefinite article preceding a noun. The reviewers do not believe this is as critical as it is played up in the manuscript. For instance, the current study also provides some evidence that a mismatch between articles (a/an) and subsequent nouns, conducted in the same participants tested in a subsequent experiment, leads to different neural activity. Moreover, a gradient in response at the level of group data (as in the original study) could nonetheless arise from discrete, all-or-none prediction at the single-trial level. It seems a bit disingenuous to argue that the study should call into question the importance of prediction on theories of language comprehension so broadly.

We added comments to clarify that we do not question the general importance of prediction in theories of language (see also our response to the next parts of comment 2). Here and in M4, the reviewers bring up the use of cloze as a dichotomous variable. We have performed this analysis with a median-split approach (100 and 500 ms baseline for the articles, 100 ms baseline for the nouns). The new code is also available on our OSF page. Importantly, the results are not affected substantially. Moreover, these analyses all yielded lower chi-sq values (and higher p-values) than the continuous approach. We mention these analyses and their results in the Results section.

Relatedly, it would be good if results from studies of other "predictive" effects were considered and reconciled with the present findings. The authors list several of these in the Discussion section, but omit others; the authors should provide a balanced discussion of other evidence of different prediction effects, such as

- gender predictions (Wicha, Moreno and Kutas, 2004; Otten, Nieuwland and van Berkum, 2008; Van Berkum, et al., 2005).

- predictions of initial sounds of upcoming words (Connolly and Phillips, 1994).

- expectations of the form of words with specific syntactic classes (Dikker, et al., 2010).

- processes affected pre-N400 ERP responses driven by form-level characteristics (Lau et al., 2006).

- effects on early sensory responses like the M100 (Dikker et al., 2010; Dikker, Rabagliati, and Pylkkanen, 2009).

- behavioral evidence by close-shadowing (Marslen-Wilson, 1985).

We thank the reviewer for the suggestion. We already cite many of the relevant papers on gender manipulations, which are the most relevant because they involve prenominal manipulations like DUK. We mentioned that these studies have not elicited a consistent pattern of results and have sometimes demonstrated non-replicability. However, providing a review of all ERP effects associated with prediction is beyond the scope of our work, and there are already many other papers who review this literature. This reviewer appears to take all the listed studies as clear-cut evidence for some form of prediction. But many of these studies are problematic for all sorts of reasons, including the issue of replicability and data-contingent analysis. For example, the Connolly and Phillips results are intriguing but rely entirely on the distinction between the N200 and N400, which is poorly defined in the original study, and not observed in many other studies (Van Petten et al., 1999; Van den Brink et al.,; Boudewyn et al., 2015; Diaz and Swaab, 2007). The study by Lau et al., reported ELAN-results that were not necessarily pre-N400 (200-400 ms), could only be obtained with a common average procedure that no other ELAN study has ever used, showed clear evidence of a ‘polar average reference effect’, and has failed to replicate (Kaan, Kirkham and Wijnen, 2016). The Dikker et al., studies are low N (~12) studies and p-values in the.01-.05 range with many researcher degrees of freedom due to the methodology, and do not show any M1 modulation by predictability. We also note that all the studies mentioned by the reviewer are single studies that, like DUK, have not been directly replicated, hence the robustness of the reported findings needs to be verified, as well as their generalizability to other contexts, labs, etc. In other words, all these observations are worthy of a dedicated, thorough and critical review, but that is for a different paper. Our aim is not to question the general role of prediction in language comprehension (see next point), but to illustrate the importance of replication.

The authors argue that the present null results for articles can "be straightforwardly reconciled with effects reported for other pre-nominal manipulations" (Discussion section), but studies like those above provide evidence for predictive processing. The preceding paragraph (Discussion section) offer a number of caveats, but readers of this paragraph might take away the message that the current results "do not necessarily exclude phonological form pre-activation" rather than the message from previous page (Discussion section) that "there is currently no clear evidence to support routine probabilistic pre-activation of a noun's phonological form during sentence comprehension

In general, the wording of the manuscript should be revised throughout so as to not overstate the implications of the findings, while at the same time giving a consistent message about what the findings do mean.

We do not question the existence of predictive processing in general, and we now make that more explicit in the revised Discussion section(“Our results, however, should not be taken as evidence against prediction in language processing more generally, and we believe that prediction could play an important role in language comprehension”). We also note that each of these studies argues for a slightly different aspect of prediction, and none of them clearly supports "routine probabilistic pre-activation of a noun's phonological form during written sentence comprehension". Furthermore, as pointed out above, these studies are not as strong as they may first appear, for all sorts of reasons. As for our own results, they are not necessarily evidence against phonological form pre-activation, just not very clear evidence for it; that is what we would like readers to take away from our discussion.

3) The correlation analysis in the original paper analyzed only 10 data points, each of which is a mean across 32 participants who each did ~80 trials (binned into deciles based on cloze probability). This analysis excludes several critical sources of variance between trials, items, and participants, which increases the likelihood of type I error. On the other hand, the analysis is also underpowered, since with only 10 data points and 9 degrees of freedom, statistical significance requires an exceptionally strong correlation; thus, the likelihood of type II error is great (making it not very surprising that results fail to replicate). This is an important lesson for the field and should be brought our more clearly in the present manuscript. While adding additional analysis is not necessary for the manuscript to be publishable, this point could be strengthened, for instance, by running simulations to assess the power of the original DUK study, varying the number of participants and items and exploring whether a graded effect of cloze probability on article and noun processing can be detected.

We appreciate the suggestion to emphasize the potential problems with the correlation approach. The problems with this approach are clear. In our study, for example, while a statistically significant effect at the noun was found in each participating lab using a more powerful single-trial analysis, the effect across labs in the correlation analysis was mixed. We have now added a statement to that effect in the Discussion section (“Moreover, our single-trial analysis revealed a significant noun-cloze effect in each of the laboratories, further demonstrating that our single-trial analysis is a more powerful approach than the averaged-based correlation approach of DeLong et al.”). Given the obvious problems with the correlation-analysis, we feel that focusing too much on that analysis would detract from our own single-trial results. In addition, we feel that running simulations to assess the power of DUK with varying participant and item numbers is unnecessarily burdensome, if all that is available to us is the correlation coefficients: running such simulations using the actual DUK data would be a lot more meaningful, but they have refused to share their data so far. In addition, we know that DUK have analyzed their own data with a single-trial analysis, which is a more informative analysis given the weaknesses of the original analysis, and which also did not yield a statistically significant article-effect. We assume that DeLong et al., will publish those new results in the near future, so that power-analyses can be applied to the improved analyses.

4) On a related note, the DUK study also reports a reliable difference in N400 magnitude between high and low cloze probability articles and nouns (Figure 1A of DUK, left panel); their analysis included by-participant variation as a random effect. Given their counterbalanced design and arguments from Raaijmakers et al., (1999), this analysis is less prone to errors than their correlational analysis. The categorical analysis provides additional evidence for a role of prediction – albeit not a graded one. Figure 3 of the present manuscript presents similar categorical results, but there is no statistical analysis reported. Do these new results also show significant evidence for a non-graded prediction of articles? If so, this would substantially modify the conclusions drawn from the current study. This should be considered and discussed.

DUK figure 1a showed the difference between high and low only as “illustrative ERPs”, the original report did NOT report any categorical statistical analysis. However, DeLong et al., (2009) DID report such an analysis and the effect of cloze on article-elicited ERPs was not significant. We have also performed these analyses, as described in M2.

5) The reviewers noted that there is an online critique and re-analysis posted on BioRxiv by Yan, Kuperberg and Jaeger (https://www.biorxiv.org/content/early/2017/05/30/143750) that endorses the importance of the issues raised and the value of the data collected here. It's not clear whether and how the authors anticipated this commentary in the present manuscript.

For instance, this commentary suggests that Bayesian surprisal, KL divergence between pre- and post-article word probability distributions, or other variables might show a closure correspondence to the neural data than does cloze probability; both the current authors and the original study make an implicit assumption that cloze probability is the best predictor.

We are aware of this unpublished critique and re-analysis by YKJ, which is a comment on a previous draft of our manuscript that became available on Biorxiv in February 2017. We did not incorporate discussion of this work because we felt that a separate, dedicated response was more appropriate, in which we could extensively comment on some of the details of their analysis and their argumentation (after all, their commentary is 60 pages long). YKJ bring up an interesting point, namely that the cloze variable should be log-transformed, based on their account in terms of Bayesian surprisal, which itself is not a new argument (e.g. Smith and Levy, 2013). We have now added a brief discussion of the Yan et al. commentary. To retain a balance between succinctness and completeness in the main text, we moved our fuller response to an online supplement on our OSF page with a link in the text (https://osf.io/mb2ud/).

In essence, the YKJ conclusion is not different from what we conclude, namely that there is some evidence for a non-zero effect. However, YKJ ignore our strong evidence that the effect is not like that of DUK, and misrepresent our conclusions to argue their case (e.g., “their conclusion that their data provides no evidence for prediction on the article”, while we in fact purposefully refrained from using terms like “no evidence”, we used terms like “no convincing evidence” or “no clear evidence”). The YKJ argument about how cloze probabilities should be analyzed is ultimately one for further testing, because their exploratory analysis was conditional on our data. YKJ applaud us for “leading by example in pre-registering their study” but, have yet to follow our example. It is one year since we published our pre-print, and about 10 months after Yan et al., performed their analyses. As far as we are aware, they have not proceeded to test their hypothesis in a pre-registered study, nor have they attempted to reanalyze the wealth of N400 data from their own labs. YKJ report no independent data to support the conclusion that Bayesian surprisal is a better predictor of N400 activity than cloze probability. Unlike YKJ, we have run their transformed analysis on other cloze datasets. The results speak against a generally better fit of log-cloze, and we have also made those data available online.

The YKJ analysis did yield a significant article-cloze effect in our data, and a model with log-transformed cloze had a lower AIC (an indication of model fit) than a model with regular cloze. For the nouns, the transformation yielded a slightly higher T value for cloze. According to one of the current reviewers (S11): “No valid conclusion can be drawn from a change in numerical magnitude or a change in the observed p-value”, and if that is true the same concern should apply to YKJ as well. The article-effect becomes significant whereas it was not-significant before, but the difference between significant and not-significant itself may not be significant (Gelman and Stern, 2006). Moreover, their analysis could be problematic because between-item differences in the lower cloze scale are not matched on relevant variables. This is also suggested by the fact that their analysis also amplifies the effect in the pre-article time window (a marginally significant effect at p=0.058), so it might just be ‘amplifying noise’ (by amplifying the between-item differences at the lower end of the cloze scale), and does not yield a significant effect when we used the original DUK baseline. Their results also hinge entirely on the zero-cloze values which they argue cannot be reliably estimated. Jaeger mentioned in a private message that “all the signal is in the zero-cloze values”, and this is correct; in fact, if one repeats their analysis without zero-cloze values the effect size changes direction (higher cloze leads to more negative voltage). In other words, the transformation seems to matter very little, and what it does seem to do is boost the impact of zero-cloze values which are problematic because they are indeterminate, so the evidence for their case is not very convincing.

YKJ present the significant effect of article-cloze in their main text, and they put the analyses that do not yield as clear support in a footnote, and do not report analysis of the pre-article window. This seems somewhat disingenuous to us, given that we reported all analyses in the main text, significant or non-significant. YKJ also report their analysis as “the only one we performed”, although there is no record of this and cannot be verified. In fact, private twitter messages by Jaeger to Nieuwland stated the opposite: that multiple analyses (to deal with the zero-cloze smoothing issue) were tried, and although they were said to lead to similar results, those are not mentioned in YKJ, and the reported ‘correction for multiple comparisons’ can therefore not be complete. A similar concern can be raised against the other analyses that were tried: for instance, YKJ analyzed the data with and without lab as random effect and they analyzed the data to test effects in each individual lab (with and without log transformation, so 2*9=18 comparisons). In none of those analyses did they find a statistically significant effect, despite what they state (“we find a significant effect on both the noun and the article in 8 out of 9 of the labs that participated in the replication attempt”). In addition, none of these analyses were taken into account during the correction for multiple comparisons. All this merely demonstrates the need for YKJ to specify and pre-register their analysis in advance and run a confirmatory study to test their hypothesis.

Finally, what strikes us as odd is that YKJ have many things to say about why our analysis is sub-optimal and our conclusions should not be accepted. But when it comes to the DUK data, all they say is “For the purpose of this discussion, we take the effect that they report at face value”, despite the many problems with the DUK analysis that we had identified. The importance of the log-transform as suggested by YKJ would also need to apply to the Delong study, and it's not clear that the transform there would strengthen their effect rather than weaken it.

Another example is the idea raised in the commentary that there is a chain of cognitive processes mediating between meaning predictions and form-based predictions for indefinite articles. A failure, probabilistic limitation, or probability mis-estimation (e.g., cloze tests overestimating the probability of an indefinite article) in these mediating processes may explain why the strongly constraining lexical/semantic predictions set-up by the sentence contexts don't always influence the processing of indefinite articles. The sentence contexts set up semantic expectations, which constrain the choice of lemmas in critical sentence positions. Some predicted lemmas have phonological forms that have specific consequences for articles; however, form-based predictions may not follow even if a semantically predicted lemma is chosen (e.g., "a kite" vs "an aeroplane", but also "an old kite" or "a plane"). The authors acknowledge this point (Discussion section), but do not carry this line of argument through the rest of the paper. It's not the case that a failure to observe predictions at the level of the phonological form of indefinite articles necessarily implies that all prediction is absent; yet in a number of places they seem to argue along these lines.

The manuscript's impact would be heightened by incorporating comments that address points like these that were raised in this online critique.

We have added an explicit statement that our results are not to be taken as evidence against prediction in general (see response to Essential revision #2); we never said that our results suggest ‘prediction is absent’ and it was not our intention to imply this. As this reviewer writes, our submission already mentioned that the article-form does not rule out the expected meaning altogether. This aspect of the a/an rule in part triggered our interest in the DUK study, which we take up in the Discussion section. We have added a mention of how the a/an rule works on page 4 but we do not see why this would have to be brought up “throughout the rest of the paper”. We wrote it as a potential explanation for the discrepant findings, and this was available in our initial pre-print (available online in Feb 2017), and was also discussed earlier in Ito, Martin and Nieuwland (2017; available online in May 2016), so all in advance of the Yan et al. commentary.

6) It is great that the authors included Bayes factor analyses to gain theoretical insights from null results. These analyses provide evidence in favor of the null hypothesis that there are no graded predictions for articles, despite numerical effects in the expected direction. However, these analyses depend on assumptions about the expected effect size and other details that are not clearly spelled out. These priors should be provided. For instance, Figure 1A/B DUK suggests that the expected effect size of the cloze probability effect on the article is considerably smaller than the effect on the noun itself. It seems that this was taken into account in Bayes Factor computations (given the graphs in SI Figure 2), but what other assumptions are made in this analysis? Was the variance expected to be equivalent for articles and nouns? Can the authors justify the choice of using a Gaussian?

This reviewer may have missed our description of the priors in the methods section. Yes, they differed for articles and nouns because they were based on the DeLong correlation results. We have seen the LMEM results of the DUK data, but DeLong et al. refused to share their LMEM based priors for our analysis. We can inform the reviewer that those priors pulled our estimates closer to zero. Yes, variance was assumed to be equivalent. There is no ground to suspect a Gaussian would not be appropriate. Of note, these are all exploratory, non-preregistered analyses and the code and data are online so that anyone can explore our data further.

What does the estimated effect size being greater than zero mean for the "pre article" period (in SI Figure 2)? If predictive processing occurs, then predictions should have been computed prior to the onset of the article.

In short, we don’t really know. We analyzed this window because we found out, after publishing our pre-print, that the 500 ms window was used for baseline correction by DeLong. Their choice differed from most of the other studies in their lab, so maybe they picked this window to correct for pre-article differences, we do not know. We also saw what looked like slow drift effects in the ERPs in some labs in this window. The effects in this window cannot be meaningfully interpreted. This is because the two versions of each item are identical before the article. So, any effect there must be a mix of noise and effects due to differences between the sentence contexts (e.g., the predictability or integrability of the pre-article word, sentence position, lexical effects; all of these were not matched between the sentence contexts). It is possible that the distributions of the unexpected and expected articles along the cloze scale is somehow interacting with some of these between-item sentence context differences. We agree that predictions should be generated before the article, but this particular design does not allow for a meaningful analysis.

7) The use of a pre-article baseline correction may be problematic, since neural responses prior to the onset of the article will include processing of linguistic information that contributes to the generation of predictions or (equivalently) constrains the likely form and meaning of upcoming words. For example, recent data from Grisoni et al., (2017) suggests a "semantic readiness potential" reflecting pre-activation of specific semantic representations. Similar semantic pre-activation is likely present in the sentences used in the current study. By doing a pre-article baseline, such differences could corrupt or conflate with differences ascribed to neural responses to the article. The authors consider but do not describe or discuss these effects in Supplementary figure 2. While the authors are not responsible for the fact that the original study used an unconventional baseline, in subsection “Single-trial analysis” the manuscript implies that there is a reason why it is appropriate, because the current authors observed non-significant cloze effects immediately prior to the critical noun.

At a minimum, this potential confound should be raised and discussed. Better still, the authors could run an analysis in which they either use a pre-sentence baseline or analyze raw ERPs without baseline correction and compare to the pre-article baseline analysis. They could also analyze whether there are differences to highly vs weakly constraining sentences prior to the onset of the article (building on the analysis in SI Figure 2), an approach that might benefit from using a different predictor variable the captures the strength of prediction (e.g. the entropy of the cloze probability distribution) rather than not whether or not the prediction is subsequently confirmed or violated. Inclusion of these types of additional analyses could support a more nuanced discussion of whether or not participants compute the likely meaning of upcoming words.

We appreciate this concern about using a baseline window period in which predictive processing takes place. The reviewer correctly points out an inherent problem with choosing an appropriate baseline for the article in a study that searches for the presence of prediction for the article. We performed the requested analysis without baseline correction, this does not change the observed pattern (below) and introduced a large amount of variance into the data that a baseline procedure is supposed to remove.

Fixed effects:

Estimate Std. Error df t value Pr(>|t|)cloze-effect 0.1351 0.3897 474.5000 0.347 0.729

We emphasize that we are not arguing against some form of semantic prediction, and our revised draft makes this more clear. While this reviewer wants additional exploratory analyses to support a discussion of whether or not word meaning was anticipated, that was never the question of our study, and there is already quite a lot of evidence that people predict the meaning of upcoming words. Furthermore, the 80 sentence contexts are not matched on relevant variables known to impact EEG activity, so the experimental design is not suitable for addressing the question of pre-article constraint. We considered it important to make this clearer to the readers. In subsection “Exploratory (i.e., not pre-registered) single-trial analyses” we added “Because the sentence context of each item was identical for the expected and unexpected article, effects in the pre-article window cannot be meaningfully related to the appearance of the article. Effects in this window must therefore be due to a spurious mix of ‘residual EEG background noise’ (activity that differed between expected and unexpected conditions but was unrelated to actual expectancy) with EEG activity associated with the specific word appearing before the article (which varied between items in terms of lexical characteristics, contextual constraint, and sentence position).”

8) The distinction between form and meaning predictions may influence differences or similarities between N400 effects seen on articles and on nouns. The original DUK paper reports similar magnitude and topography of N400 effects on both the article and noun. It would be instructive to compare the location, timing, and effect size for cloze probability correlations between the article and the noun in the current, larger dataset. Additional tests for difference between the significant and null correlations would be valuable. For instance, is the reliable effect of graded cloze probability on nouns significantly different from the null effect of cloze probability on articles? This difference should be reliable if semantic integration is the key driver of N400 effects on nouns.

We already compare the effects of articles and nouns in our more powerful single trial-analysis, it shows a massive difference. We have also seen a single-trial re-analysis of the DUK data (by DUK), it shows the same pattern. We do not see the added value of the suggested analysis. We don’t think it is particularly instructive to know whether the correlations differ if this analysis is clearly problematic. Moreover, the noun-correlations are significantly different from zero, whereas the article-correlations go into the opposite direction at the channels where the noun-N400 effects are strong.

In a similar vein, what is the assumed model of the N400 generator? There are at least two published computational models that seek to explain the large body of N400 effects in the existing ERP literature. One proposes something akin to integration demands (Cheyette and Plaut, 2016), another proposes a form of (semantic) prediction error (Rabovsky and McRae, 2014). Such models provide a helpful context in which to discuss whether or not prediction violations at the level of form will lead to an N400 in the absence of any ongoing semantic integration demands. Ultimately, questions about the replicability or otherwise of the data presented by DUK are only informative to the extent that they are or, are not theoretically constraining. A discussion of these and other computational theories of the N400 are important for interpreting the present results, and how these results should be considered in the field of language comprehension going forward.

The two cited models are not models of sentence-level comprehension effects in N400 activity but of word-level comprehension, so it's difficult to say what they would predict on DUK type violations (if anything). This reviewer states that “these and other computational theories of the N400 are important for interpreting the present results”, but none of these models has ever said anything about ERP effects of pre-nominal manipulations in prediction-studies like DUK, and these studies have nothing or very little so say about form-prediction (Rabovsky et al., 2018), probably because they all involve purely semantic representations. It is not at all clear to us how these theories help us understanding whether the true population-level effect size is more like that reported by DUK or more like that reported here.

Cheyette and Plaut assume that the N400 ERP component reflects difficulty of semantic access, essentially the view-point of Kutas et al. Rabovsky and McRae argue that the N400 reflects semantic surprise, and a later model about sentence-level comprehension by Rabovsky, Hanssen and McClelland “treats N400 amplitudes as indexing the change induced by an incoming word in an implicit probabilistic representation of meaning”, and does not assume separate stages of access and integration. All these accounts assume the N400 component reflects a unitary process, which we think is unlikely, and, in fact, our own new results suggest that EEG activity in the N400 window is sensitive to different types of information at different times (Nieuwland et al., 2018). The only computational model that is accompanied by discussion of DUK-like effects is from Fitz and Chang (https://psyarxiv.com/frx2w/) which takes the N400 as a signal of error propagation. Fitz and Chang specifically discuss the fact that their model does not capture DUK-like N400 effects as a limitation of their model, and state that DUK-like effects would require a different set of mechanisms beyond their own model.

Other general comments for the authors to consider:

1) The authors conducted a single-trial analysis using linear mixed effects regression (LMER) analysis. This is the most appropriate analysis method since it includes both between-participant and between-item variance and combines all of the usable data from individual participants and single trials into one analysis. While LMER analysis is now preferred to correlation analysis used in the original study, it only came to prominence after the publication of the original paper. Pointing out the advantages of LMER analyses would enhance the current paper's impact. For instance, especially when analyzing effects of continuous predictor variables, LMER may offer better control over false-positives than conventional analyses. This is an important message for researchers studying the neuroscience of language, who generally do not undertake this approach.

We agree. In the Materials and methods section on our single-trial analysis, we added “Especially when analyzing effects of a continuous predictor variable such as cloze probability, LMER offers better control over false-positive results than the averaged-based correlation analysis of the original.” In the last paragraph of the Introduction, we added “and a single-trial analysis that modelled variance at the level of item and subject (with a linear mixed-effects model, which offers better control over false-positives than the replication analysis when analyzing effects of the continuous predictor cloze probability.”

2) The flip side of using single-trial LMER, however, is that it is not clear whether or not the link between cloze probability and neural responses is linear or logistic. The independent measure (cloze probability) used as a predictor variable is summarizing a binary outcome variable (i.e., did individual participants in the cloze test generate the article "a" or "an") as a probability. It is at least plausible that neural outcomes could be similarly binary: participants predict "a" or "an" and then don (or don't) generate a neural prediction error when an unexpected word is presented. Would reliable effects of cloze probability be observed with a logistic or binomial linking function? While this is not the form of graded prediction tested by DUK, it seems quite likely that averaging over many participants and many trials (as in the DUK correlation analysis) would generate a graded, linear relationship between cloze probability and neural outcomes, even if the underlying relationship was logistic or non-linear. This possibility should be considered, assessed and discussed.

To our knowledge, these analyses would require a binomial dependent variable, it is unclear how this would be applied to our data. But, regardless, this could be an interesting alternative analysis that others could use to explore our data (and later confirm its results in subsequent testing). We have discussed the issue of alternative analyses in our revised Discussion section.

3) Previous work has explored the presence/absence of fillers, which is another difference between the DUK study and the current one. While it is unlikely to explain the failure to replicate results, some additional discussion seems appropriate. Given the large scale of the present dataset there should be many opportunities for determining whether cloze probability effects (for nouns, if not for articles) are enhanced once participants have already encountered a number of distinctive, high/low cloze probability sentences.

We appreciate the interest in this issue of fillers. It is unclear what additional discussion this reviewer would like to see. We know that DeLong et al., also have run other experiments with the a/an manipulations with different kinds of fillers. Publication of those results will offer a better opportunity to determine the importance of fillers. In our own dataset, there are opportunities to examine effects of experiment-position, and, in fact, when we posted our BioRxiv pre-print about a year ago, Florian Jaeger already notified us of his intentions to perform such an analysis using the data we made available. Our data is available to anyone with an interest in pursuing such an exploratory analysis.

4) The fact that the nine labs inadvertently manipulated the timing of stimulus presentations of the critical articles and nouns (see footnote 3, Materials and methods section) is unlikely critical to their findings/conclusions. However, it would be useful to know whether the timing of early, visual form responses for written words that are delayed and non-delayed differ. Simply arguing that the change in timing is "unlikely to be noticed" and "after the N400 window" is making assumptions that can and should be tested with their ERP data.

We have rephrased the description in this footnote to make it more clear. First, as the original footnote stated, the timing was slightly off only in 4 labs, not all labs, as this reviewer writes. We have also analyzed only the data where the timing was as intended, which did not change the results, and we had mentioned this already in the footnote. Second, the articles were never themselves delayed, but the blank screen that followed them was 80 ms longer than intended. This is crucial, because it means that the first moment at which the timing is different from the intended timing is at 500 ms, which therefore cannot have impacted the article-N400s. This is what we meant by “after the N400 window”: this is a calculation, not an assumption. We have removed the references about slower reading times leading to more predictive processing to avoid further confusion.

5) The use of different baselines was confusing to multiple reviewers. A table that summarizes details of all the current analyses and how these correspond to the procedures used in the original DUK study would help the reader keep straight the different results.

There are only three different baseline procedures: none, 100ms or 500 ms. We used different baselines because DUK did not correctly report their procedure. We have explained and motivated our pre-registered baseline procedures and motivated changes to the baseline procedure in exploratory analyses. We have also changed “Original baseline procedure” in Figure 6 to avoid confusion, because this is ambiguous (it could mean what was reported or what was done).

All the procedures are summarized in the table 1. We do not think it is necessary to include it in the main text but will include it if the editor deems it important enough to include.

Table 1. Reported or used baseline correction procedures in DeLong, Urbach and Kutas (2005) and in the current study
Original studyCurrent study
Correlation analysisReported no baseline correctionPre-registered no baseline correction
(Figures 1 and 5)
Later acknowledged using a 500 ms baseline correctionRe-analysis with a 500 ms baseline correction
(Figures 2 and 6)
Single-trial analysisPre-registered a 100 ms baseline correction
(Figures 3 and 4)
Re-analysis with a 500 ms baseline correction
(Figures 5 and 7)

6) Both the original and current studies present a high percentage of a/an continuations in the cloze test. Could this procedure bias participants towards using a/an rather than "his", "hers", "their," etc. that aren't marked for the phonological form of the subsequent noun? After encountering several sentences in which the indefinite article is appropriate, participants may routinely complete the remaining sentences with an indefinite article continuation due to this sort of syntactic priming. A comment about this would be welcome.

Inspection of the raw cloze results suggests that this is a very unlikely scenario. We suspect that the original items were designed to elicit a/an responses instead of possessive pronouns, which would make sense given the goal of the experiment, but we cannot be certain, and, to our knowledge, the raw cloze data of DUK are not publicly available. In our own cloze data, there are two items where several participants use possessive pronouns (for the famous kite/airplane example item number 8, and another item with ‘her finger’ which is number 50 in the list. So, our participants did use possessive pronouns when those pronouns came to mind as best completions even more than halfway through the list.

7) In multiple places, the authors use phrasing about how the study argues for "a more limited role for prediction during language comprehension." This wording seems overly broad, conflating very different conceptions of prediction across different levels of representation. There is a deep theoretical division between proposed mechanisms of conceptual pre-activation (e.g., the concept of 'readiness' in the discourse literature, linked to the N400 literature beautifully in Jos Van Berkum's work) and mechanisms of predicting the form of the upcoming input (commitment to a particular linguistic formulation of the message). It seems likely that the former, long argued for in the discourse literature, might play a central role in language comprehension, while contributions from the prediction-of-form mechanism may be quite limited. It would be unfortunate if readers take the current study to cast doubt on both extremely different mechanisms, when it actually only speaks to the latter mechanism. Please consider rewording this throughout.

We have removed or reworded that phrase throughout the text.

https://doi.org/10.7554/eLife.33468.024

Article and author information

Author details

  1. Mante S Nieuwland

    1. Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    mante.nieuwland@mpi.nl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4001-6608
  2. Stephen Politzer-Ahles

    1. Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
    2. Faculty of Linguistics, Philology and Phonetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    stephen.politzerahles@polyu.edu.hk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5474-7930
  3. Evelien Heyselaar

    School of Psychology, University of Birmingham, Birmingham, United Kingdom
    Contribution
    Investigation, Writing—review and editing
    For correspondence
    evelien.heyselaar@mpi.nl
    Competing interests
    No competing interests declared
  4. Katrien Segaert

    School of Psychology, University of Birmingham, Birmingham, United Kingdom
    Contribution
    Resources, Supervision, Writing—review and editing
    For correspondence
    k.segaert@bham.ac.uk
    Competing interests
    No competing interests declared
  5. Emily Darley

    School of Experimental Psychology, University of Bristol, Bristol, United Kingdom
    Contribution
    Investigation
    For correspondence
    d12634@my.bristol.ac.uk
    Competing interests
    No competing interests declared
  6. Nina Kazanina

    School of Experimental Psychology, University of Bristol, Bristol, United Kingdom
    Contribution
    Resources, Supervision, Writing—original draft, Writing—review and editing
    For correspondence
    nina.kazanina@bristol.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7737-4279
  7. Sarah Von Grebmer Zu Wolfsthurn

    School of Experimental Psychology, University of Bristol, Bristol, United Kingdom
    Contribution
    Investigation
    For correspondence
    sv13691@my.bristol.ac.uk
    Competing interests
    No competing interests declared
  8. Federica Bartolozzi

    School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Investigation
    For correspondence
    Federica.Bartolozzi@mpi.nl
    Competing interests
    No competing interests declared
  9. Vita Kogan

    School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Investigation
    For correspondence
    vitavkogan@gmail.com
    Competing interests
    No competing interests declared
  10. Aine Ito

    1. School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, United Kingdom
    2. Faculty of Linguistics, Philology and Phonetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Investigation, Writing—review and editing
    For correspondence
    aineito@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4408-8801
  11. Diane Mézière

    School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Investigation
    For correspondence
    d.c.meziere.1@student.rug.nl
    Competing interests
    No competing interests declared
  12. Dale J Barr

    Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
    Contribution
    Software, Formal analysis, Writing—review and editing
    For correspondence
    Dale.Barr@glasgow.ac.uk
    Competing interests
    No competing interests declared
  13. Guillaume A Rousselet

    Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
    Contribution
    Resources, Formal analysis, Supervision, Methodology, Writing—review and editing
    For correspondence
    guillaume.rousselet@glasgow.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0006-8729
  14. Heather J Ferguson

    School of Psychology, University of Kent, Canterbury, United Kingdom
    Contribution
    Resources, Supervision, Funding acquisition, Writing—review and editing
    For correspondence
    H.Ferguson@kent.ac.uk
    Competing interests
    No competing interests declared
  15. Simon Busch-Moreno

    Division of Psychology and Language Sciences, University College London, London, United Kingdom
    Contribution
    Software, Formal analysis, Investigation
    For correspondence
    simon.busch.15@ucl.ac.uk
    Competing interests
    No competing interests declared
  16. Xiao Fu

    Division of Psychology and Language Sciences, University College London, London, United Kingdom
    Contribution
    Investigation
    For correspondence
    xiao.fu.15@ucl.ac.uk
    Competing interests
    No competing interests declared
  17. Jyrki Tuomainen

    Division of Psychology and Language Sciences, University College London, London, United Kingdom
    Contribution
    Resources, Supervision
    For correspondence
    j.tuomainen@ucl.ac.uk
    Competing interests
    No competing interests declared
  18. Eugenia Kulakova

    Institute of Cognitive Neuroscience, University College London, London, United Kingdom
    Contribution
    Software, Formal analysis, Investigation
    For correspondence
    e.kulakova@ucl.ac.uk
    Competing interests
    No competing interests declared
  19. E Matthew Husband

    Faculty of Linguistics, Philology and Phonetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Resources, Supervision, Writing—review and editing
    For correspondence
    matthew.husband@ling-phil.ox.ac.uk
    Competing interests
    No competing interests declared
  20. David I Donaldson

    Psychology, Faculty of Natural Sciences, University of Stirling, Stirling, United Kingdom
    Contribution
    Resources, Supervision, Writing—review and editing
    For correspondence
    d.i.donaldson@stir.ac.uk
    Competing interests
    No competing interests declared
  21. Zdenko Kohút

    Department of Psychology, University of York, York, United Kingdom
    Contribution
    Investigation
    For correspondence
    zk578@york.ac.uk
    Competing interests
    No competing interests declared
  22. Shirley-Ann Rueschemeyer

    Department of Psychology, University of York, York, United Kingdom
    Contribution
    Supervision, Writing—review and editing
    For correspondence
    shirley-ann.rueschemeyer@york.ac.uk
    Competing interests
    No competing interests declared
  23. Falk Huettig

    Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    Contribution
    Conceptualization, Funding acquisition, Writing—review and editing
    For correspondence
    falk.huettig@mpi.nl
    Competing interests
    No competing interests declared

Funding

European Research Council (ERC Starting grant 636458)

  • Heather J Ferguson

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was partly funded by ERC Starting grant 636458 to HJF. We thank Matt Davis for his comments on a previous draft of this work. We thank Alexander Ly and Eric-Jan Wagenmakers for their support in computing the replication Bayes factors.

Ethics

Human subjects: All participants were informed about the procedure of the experiment and then gave informed consent to use the data for research and dissemination/publication purpose. Ethical approval for EEG experimentation was obtained at each involved institution, according to custom guidelines of the ethics committee at each institution.

Reviewing Editor

  1. Barbara G Shinn-Cunningham, Boston University, United States

Version history

  1. Received: November 10, 2017
  2. Accepted: March 19, 2018
  3. Version of Record published: April 3, 2018 (version 1)
  4. Version of Record updated: April 19, 2018 (version 2)

Copyright

© 2018, Nieuwland et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 8,398
    Page views
  • 804
    Downloads
  • 152
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Mante S Nieuwland
  2. Stephen Politzer-Ahles
  3. Evelien Heyselaar
  4. Katrien Segaert
  5. Emily Darley
  6. Nina Kazanina
  7. Sarah Von Grebmer Zu Wolfsthurn
  8. Federica Bartolozzi
  9. Vita Kogan
  10. Aine Ito
  11. Diane Mézière
  12. Dale J Barr
  13. Guillaume A Rousselet
  14. Heather J Ferguson
  15. Simon Busch-Moreno
  16. Xiao Fu
  17. Jyrki Tuomainen
  18. Eugenia Kulakova
  19. E Matthew Husband
  20. David I Donaldson
  21. Zdenko Kohút
  22. Shirley-Ann Rueschemeyer
  23. Falk Huettig
(2018)
Large-scale replication study reveals a limit on probabilistic prediction in language comprehension
eLife 7:e33468.
https://doi.org/10.7554/eLife.33468

Further reading

    1. Neuroscience
    Amanda J González Segarra, Gina Pontes ... Kristin Scott
    Research Article

    Consumption of food and water is tightly regulated by the nervous system to maintain internal nutrient homeostasis. Although generally considered independently, interactions between hunger and thirst drives are important to coordinate competing needs. In Drosophila, four neurons called the interoceptive subesophageal zone neurons (ISNs) respond to intrinsic hunger and thirst signals to oppositely regulate sucrose and water ingestion. Here, we investigate the neural circuit downstream of the ISNs to examine how ingestion is regulated based on internal needs. Utilizing the recently available fly brain connectome, we find that the ISNs synapse with a novel cell-type bilateral T-shaped neuron (BiT) that projects to neuroendocrine centers. In vivo neural manipulations revealed that BiT oppositely regulates sugar and water ingestion. Neuroendocrine cells downstream of ISNs include several peptide-releasing and peptide-sensing neurons, including insulin producing cells (IPCs), crustacean cardioactive peptide (CCAP) neurons, and CCHamide-2 receptor isoform RA (CCHa2R-RA) neurons. These neurons contribute differentially to ingestion of sugar and water, with IPCs and CCAP neurons oppositely regulating sugar and water ingestion, and CCHa2R-RA neurons modulating only water ingestion. Thus, the decision to consume sugar or water occurs via regulation of a broad peptidergic network that integrates internal signals of nutritional state to generate nutrient-specific ingestion.

    1. Neuroscience
    Lucas Y Tian, Timothy L Warren ... Michael S Brainard
    Research Article

    Complex behaviors depend on the coordinated activity of neural ensembles in interconnected brain areas. The behavioral function of such coordination, often measured as co-fluctuations in neural activity across areas, is poorly understood. One hypothesis is that rapidly varying co-fluctuations may be a signature of moment-by-moment task-relevant influences of one area on another. We tested this possibility for error-corrective adaptation of birdsong, a form of motor learning which has been hypothesized to depend on the top-down influence of a higher-order area, LMAN (lateral magnocellular nucleus of the anterior nidopallium), in shaping moment-by-moment output from a primary motor area, RA (robust nucleus of the arcopallium). In paired recordings of LMAN and RA in singing birds, we discovered a neural signature of a top-down influence of LMAN on RA, quantified as an LMAN-leading co-fluctuation in activity between these areas. During learning, this co-fluctuation strengthened in a premotor temporal window linked to the specific movement, sequential context, and acoustic modification associated with learning. Moreover, transient perturbation of LMAN activity specifically within this premotor window caused rapid occlusion of pitch modifications, consistent with LMAN conveying a temporally localized motor-biasing signal. Combined, our results reveal a dynamic top-down influence of LMAN on RA that varies on the rapid timescale of individual movements and is flexibly linked to contexts associated with learning. This finding indicates that inter-area co-fluctuations can be a signature of dynamic top-down influences that support complex behavior and its adaptation.