1. Neuroscience
Download icon

Brain signatures of a multiscale process of sequence learning in humans

  1. Maxime Maheu  Is a corresponding author
  2. Stanislas Dehaene
  3. Florent Meyniel  Is a corresponding author
  1. CEA DRF/JOLIOT, INSERM, Université Paris-Sud, Université Paris-Saclay, NeuroSpin center, France
  2. Université Paris Descartes, Sorbonne Paris Cité, France
  3. Collège de France, France
Research Article
  • Cited 2
  • Views 1,519
  • Annotations
Cite this article as: eLife 2019;8:e41541 doi: 10.7554/eLife.41541

Abstract

Extracting the temporal structure of sequences of events is crucial for perception, decision-making, and language processing. Here, we investigate the mechanisms by which the brain acquires knowledge of sequences and the possibility that successive brain responses reflect the progressive extraction of sequence statistics at different timescales. We measured brain activity using magnetoencephalography in humans exposed to auditory sequences with various statistical regularities, and we modeled this activity as theoretical surprise levels using several learning models. Successive brain waves related to different types of statistical inferences. Early post-stimulus brain waves denoted a sensitivity to a simple statistic, the frequency of items estimated over a long timescale (habituation). Mid-latency and late brain waves conformed qualitatively and quantitatively to the computational properties of a more complex inference: the learning of recent transition probabilities. Our findings thus support the existence of multiple computational systems for sequence processing involving statistical inferences at multiple scales.

https://doi.org/10.7554/eLife.41541.001

Introduction

From seasons’ cycle to speech, events in the environment are rarely independent from one another; instead they are often structured in time. For instance, dark clouds often precede rain. Detecting those hidden temporal structures is useful for many cognitive functions (Lashley, 1951). For instance, by building correct expectations of the future, one can then process sensory information more quickly and/or more efficiently (Summerfield and de Lange, 2014; Melloni et al., 2011; Ekman et al., 2017) or adapt behavior by selecting the best course of actions (Skinner, 1953). In humans, the extraction of structures over time is also crucial in language acquisition (Saffran et al., 1996; Endress and Mehler, 2009).

Sequences of observations can be described at different levels of abstraction (Dehaene et al., 2015). The most basic level of coding used to represent sequential inputs is the encoding of statistical regularities, such as item frequency and transition probabilities. Many experiments demonstrate that the brain possesses powerful statistical learning mechanisms that extract such regularities from sequential inputs (Armstrong et al., 2017; Santolin and Saffran, 2018). Importantly, the learnt statistics can serve as building blocks for more elaborate codes. For instance, in infants, learning the transition probabilities between syllables seems to be a building block on top of which words and syntactic tree structures are built (Saffran et al., 1996; Chomsky and Ronat, 1998). However, the exact manner through which the brain extracts such statistics remains unknown. Here, we ask two basic questions: What statistics are estimated by the brain? Over which timescale are such computations performed, that is which observations, from recent to remote ones, are incorporated in the learning process? As we shall see, there is no single answer to those questions because different brain processes estimate different statistics computed over different timescales. However, we show how learning models, fitted to brain responses, can help tease apart those processes.

Several statistics can indeed be used to describe a given sequence. To illustrate this point, consider the following sequence: AAAAABBBBBBAA. A first level of description is the frequency of each item (A and B). In this example: p(A) = 1 – p(B) = 7/13. The brain is sensitive to the frequency of items in both visual (Grill-Spector et al., 2006) and auditory sequences (Näätänen et al., 2007; Garrido et al., 2009). But note that item frequency (IF) ignores the order in which observations are received. By contrast, alternation frequency (AF) inspects whether successive observations are identical or different, a statistic to which the brain is also sensitive in visual (Summerfield et al., 2008; Teinonen et al., 2009; Bulf et al., 2011; Summerfield et al., 2011) and auditory sequences (Todorovic et al., 2011; Todorovic and de Lange, 2012). In our example, p(alternation) = 1 – p(repetition) = 2/12. Note that alternation frequency ignores the nature of repetitions and alternations, as it treats similarly the repetitions A → A and B → B and similarly the alternations A → B and B → A. Such dependencies can only be captured by transition probabilities (TP), which specify what stimulus is likely to be observed (an A or a B) given the context. In our example, p(A|B) = 1 – p(B|B) = 1/6 and p(B|A) = 1 – p(A|A) = 1/6. Importantly, TP entail information regarding the frequency of items (IF), their co-occurence (AF), and their serial order (e.g. p(A|B) ≠ p(B|A)). This makes TP the simplest statistical information that genuinely reflects a sequential regularity. There is evidence that TP are monitored by human adults (Domenech and Dreher, 2010; Meyniel et al., 2015; Mittag et al., 2016; Meyniel and Dehaene, 2017; Higashi et al., 2017) and infants (Marcovitch and Lewkowicz, 2009; Lipkind et al., 2013), macaque monkeys (Meyer and Olson, 2011; Ramachandran et al., 2016), rodents (Yaron et al., 2012) and even songbirds (Takahasi et al., 2010; Markowitz et al., 2013; Lipkind et al., 2013; Chen and Ten Cate, 2015). Those three different statistics (IF, AF, TP) are, however, often studied separately. This may cause confusion, because some (IF and AF) are embedded in others (TP). In addition, the correct identification of those statistics depends on the timescale over which they are computed, which may vary between studies. For instance, if the following elements are presented sequentially: AAAAABBBBBBAA…, the underlined B will appear surprising with respect to both IF (a B after many As) and AF (an alternation after many repetitions), whereas the underlined A will not appear surprising to AF (it follows the tendency to repeat), and comparatively more surprising to IF since at the scale of a few observations this A is quite rare relative to the many preceding Bs.

Our second question is therefore about timescales: which history of observations is used to inform the estimated statistics? Previous studies showed that the brain is particularly sensitive to the recent history of observations both in visual (Huettel et al., 2002) and auditory sequences (Squires et al., 1976; Kolossa et al., 2012). Moreover, brain responses elicited by each observation indicate the existence of different statistical expectations arising in parallel from distinct observations’ histories (Ulanovsky et al., 2004; Kiebel et al., 2008; Bernacchia et al., 2011; Ossmy et al., 2013; Meder et al., 2017; Scott et al., 2017; Runyan et al., 2017). There exists a continuum of timescales, but for clarity and following previous works (Squires et al., 1976; Kolossa et al., 2012; Meyniel et al., 2016), we will refer to statistics being estimated using the most recent observations only, as local statistics, and we term global statistics the ones whose estimation takes into account all observations.

Taken altogether, the prior results quoted so far suggest that the brain is sensitive to different statistics (IF, AF, TP) and different timescales of integration (from local to global). Yet those conclusions remain unclear because those effects were investigated using different techniques (e.g. reaction times, EEG, fMRI), in different modalities (using sounds, faces, geometrical shapes, …) and with only a subset of all possible conditions (e.g. by manipulating IF but not AF). We recently proposed that those seemingly disparate findings might in fact all result from an estimation of TP on a local timescale (Meyniel et al., 2016). In this view, (i) sensitivity to global, block-level, statistics results from an average of local expectations, and (ii) sensitivity to simpler statistics (IF and AF) is explained by the fact that they are embedded in the space of TP. Here, we will explore whether different brain signatures reflect computations of different statistics and different timescales, and the extent to which they correspond to our proposed local TP model.

Results

Experimental design

In order to address those questions, we designed a new MEG experiment in which binary auditory sequences comprising two sounds, denoted A and B, were presented to human subjects (see Figure 1A). Different statistics were used to generate these stochastic sequences (see Figure 1B). In the frequency-biased condition, one item (B) was overall more frequent than the other. In the alternation-biased condition, alternation (i.e. pairs AB and BA) was overall more frequent than repetition (i.e. AA and BB). In the repetition-biased condition, the reverse was true. Finally, in the fully stochastic condition, all the events were equally likely to happen (i.e. as many As than Bs and as many AA, than AB, BA, BB).

Figure 1 with 2 supplements see all
Experimental design: transition probabilities induce orthogonal variations of item frequency and alternation frequency.

(A) Human subjects were presented with binary sequences of auditory stimuli. Each sequence was composed of a unique set of syllables (e.g. A = /ka/ and B = /pi/) presented at a relatively slow rhythm (every 1.4 s) and occasionally interrupted by questions asking subjects to predict the next stimulus. (B) Stimuli were drawn from fixed generative transition probabilities p(A|B) and p(B|A) that varied across four blocks. Those transition probabilities, in turn, determined item frequency p(A) and alternation frequency p(alt.). (C) Example sequences derived from generative transition probabilities used in each condition. The resulting sequences were either fully stochastic, biased toward one of the stimuli (here Bs), or biased toward repetitions or alternations. (D) Learning models were applied to these sequences: a model learning transition probabilities (TP), one learning the frequency of items (IF) and one learning the frequency of alternations (AF). Note that only the TP model can discriminate all four conditions; the IF model is blind to biases toward repetitions or alternations, while the AF model is blind to biases in the balance between stimuli.

https://doi.org/10.7554/eLife.41541.002

A quantitative account of learning

It is important to distinguish the statistics that are used to generate the sequences and the statistics that are inferred by the brain. In order to characterize the inference process, we designed three families of learning models, each corresponding to the inference of a different statistics: IF, AF and TP. Each of the sequences presented to subjects corresponds to a different bias in generative TP, and it can be described from the view point of each model (IF, AF, TP). Figure 1D shows an example for each of the 12 possibilities (three models times four sequences generated with a different bias).

Within each family of model (one per statistics that is estimated), we further distinguished between models on the basis of their timescale of integration. An extreme case is global integration: all observations are used equally in the inference process (see Figure 1D). A more realistic possibility is local integration by means of leaky accumulation (Gerstner and Kistler, 2002): previous observations are progressively discounted, giving a larger weight to the most recent observations in the inference process.

All models estimate the current statistics based on a (local or global) count of observations by means of standard Bayesian inference (see Materials and methods). Using Bayesian inference (again), the estimated statistics are turned into a prediction about the next stimulus. In that framework, surprise results from the discrepancy between the prediction and the actual observation that is received. More formally, surprise is quantified as the negative logarithm of the (estimated) probability of the actual observation (Shannon, 1948): observations that are deemed less probable produce larger surprise.

Many stimulus-evoked MEG responses are typically thought of as surprise signals, or equivalently, prediction error signals. According to the predictive coding hypothesis (Rao and Ballard, 1999; Friston, 2005; Spratling, 2017), prediction error is an important quantity used to revise our beliefs (Friston and Kiebel, 2009) or simply to perceive the world (Rao and Ballard, 1999; Lee and Mumford, 2003). Such surprise signals dominate evoked responses measured with MEG and EEG, in particular in the auditory domain (Heilbron and Chait, 2018), and are thus fairly easy to quantify. We therefore adopted a reverse-engineering approach, comparing MEG signals to theoretical surprise levels computed from different learning models. Based on this comparison, we aimed to determine the most likely inference performed by the brain among those that we have hypothesized. We compared different learning models that differ both in the statistics (IF, AF and TP) and the timescale of integration (parameterized by ω, we tested 55 values in total). We adopted a design that allows the identification of both the statistics that are inferred by the brain, and the timescale of integration (see Figure 1—figure supplement 1).

Description of auditory-evoked responses

We first verified that our paradigm elicited typical auditory responses. The grand-averaged MEG responses evoked by the sounds revealed four distinct canonical components (see Figure 1—figure supplement 2): three early responses (M50, M150 and M250 components) characterized by topographies plausibly arising from bilateral auditory cortex (Chait et al., 2007; Strauss et al., 2015) and a late response associated with the classical P300/slow-wave topography (Wacongne et al., 2011).

In order to investigate whether those brain responses were modulated by the statistical regularities of the auditory sequences, we explored two timescales at which those regularities may emerge in our experiment: global and local. Our general approach proceeds as follows: (1) searching effects of global statistical regularities, (2) investigating the interplay between local and global statistical regularities in order to follow up and expand previous historical studies (in particular: Squires et al., 1976) that focused on late evoked responses and (3) performing a systematic analysis, independent of any prior assumption regarding the timing (what post-stimulus latency), spatial location (which sensors) and properties (which effect, local or global, and of which statistics) of the effects.

Sensitivity to global, block-level, statistics

We first examined how these auditory-evoked responses were modulated by global, block-level statistics within each of the four conditions. By design, three out of these four experimental conditions induce a bias in the global statistics, such that some events are overall more likely to occur than others. To investigate the brain sensitivity to such global biases, we contrasted MEG responses evoked by rare events with frequent ones. We report results corrected for multiple comparison across time points and sensors (the spatio-temporal clusters that survive correction at < 0.05 at the test level and < 0.05 at the cluster level, see Materials and methods and Figure 2—Source data 1).

In the frequency-biased condition, rare A items were overall less frequent than frequent B items and, as expected, rare items elicited larger responses than frequent ones (significant between 227 and 910 ms post-sound onset, see Figure 2A and Figure 2—figure supplement 1A). In the repetition-biased condition, repetitions, that is XX = AA or BB, were overall more frequent than alternations, that is XY = AB or BA; brain signals also reflected this bias, showing increased activity for rare alternations as compared to frequent repetitions (significant between 391 and 633 ms post-sound onset, see Figure 2B and Figure 2—figure supplement 1B). In the alternation-biased condition, in which alternations are overall more frequent than repetitions, we found a reversal of the effect compared to the repetition-biased condition. Brain signals were larger for rare repetitions than frequent alternations (significant between 176 and 371 ms post-sound onset, see Figure 2B and Figure 2—figure supplement 1B). Finally, in the fully stochastic condition, there was no global bias regarding item or alternation frequencies: there were as many As as Bs and as many repetitions as alternations. Accordingly, no significant difference was found, neither for the A – B contrast (see first row in Figure 2A) nor for the repetition – alternation one (see first row in Figure 2B). Between-conditions comparisons show that differences between items or between alternation and repetition are indeed stronger when there exists a global statistical bias than in the fully stochastic condition (see Figure 2—figure supplement 2).

Figure 2 with 2 supplements see all
Brain responses are modulated by global statistics.

(A) Difference in brain activity evoked by the two different items in the fully stochastic (in which both stimuli are equiprobable) and frequency-biased (in which one of the two stimuli is more frequent than the other) conditions. (B) Difference in brain activity evoked by alternations and repetitions in the fully stochastic (in which repetitions and alternations are equiprobable), repetition-biased (in which repetitions are more frequent than alternations) and alternation-biased (in which alternations are more frequent than repetitions) conditions. X/Y denotes a pooling of both stimuli together, XX (or YY) thus denotes repetitions of A or B (i.e. AA and BB pooled) while XY (or YX) denotes alternations between A and B (i.e. AB and BA pooled). We report the difference in the MEG responses elicited by a given item (X = A or B) when it is preceded by the same (XX = AA or BB) versus a different (YX = BA or AB) item. Sensors marked with a black dot showed a significant difference (p < 0.05 at the test level and p < 0.05 at the cluster level).

https://doi.org/10.7554/eLife.41541.005

Sensitivity to local, history-dependent, statistics

Beyond and above those block effects, due to their stochastic nature, the sequences also contain biases in the local statistics (e.g. a sudden surge of A sounds). Previous findings have shown that some brain signals, notably late ones, reflect a sensitivity to such local regularities on top, or even in the absence of global regularities (Squires et al., 1976; Kolossa et al., 2012). In a seminal publication, Squires et al. (1976) introduced an analysis that allows a clear visualization of those effects. They constructed a spatial filter to project sensor data in a specific (late) time window onto one dimension that maximized the difference between globally rare and frequent items. They then used this same filter to plot the brain response to all possible local patterns of stimuli, thereby revealing prominent effects of local statistics, such as differences between locally rare/frequent items, and between locally rare/frequent alternations and repetitions. Here, we extended their analysis: we also inspected local effects in the repetition- and alternation-biased conditions, while previous studies (Squires et al., 1976; Kolossa et al., 2012) only had frequency-biased and fully stochastic conditions. We used the group-averaged topographical difference between continuing and violating streaks as spatial filters in a late time window from 500 to 730 ms (see Figure 3—figure supplement 1 for the exact topographies used) but unlike previous studies, we defined and applied these filters in a cross-validated manner to avoid circular analysis (Kriegeskorte et al., 2009). Following Squires et al. (1976), we then plotted the spatially filtered response to all possible local patterns up to length 3, in the same late time window (from 500 to 730 ms) for all conditions, in the form of a tree whose structure reflects the effect of local statistics on the amplitude of MEG signals (see Figure 3).

Figure 3 with 1 supplement see all
Brain responses are modulated by local statistics.

The brain response to a given observation is plotted as a function of the recent history of events. We connected the possible extensions from shorter to longer patterns, such that the data are represented as a ‘tree’. Each of these trees corresponds to an experimental condition. At each node of a tree, the circles should be read from left to right and denote the corresponding patterns; for instance, the pattern AAAB shows the activity level elicited by the item B when it was preceded by three As. X and Y denote the pooling of both stimuli in conditions in which both stimuli are equiprobable. In the frequency-biased condition, we report the activity evoked by item B, the most frequent stimulus. Activity levels across sensors were averaged using a topographical filter within a late time window (from 500 to 730 ms) post-stimulus onset. The topographical filters were obtained by contrasting rare and frequent patterns (e.g. XYXX – XYXY in the alternation-biased condition). The filters are shown in Figure 3—figure supplement 1; the small, circled and colored numbers at the bottom of each tree serve as identifiers. We defined and applied the filters using a cross-validation approach to ensure statistical independence (see Materials and methods).

https://doi.org/10.7554/eLife.41541.009

Note that this late time window (from 500 to 730 ms) was selected in order to follow up and expand the seminal study of Squires et al. (1976) which focused on late (EEG) brain responses. Here, the exact time window was not optimized to show the clear tree structure seen in Figure 3, or the strong similarity between observed data and the model reported in Figure 4, but selected a priori (see Figure 3—figure supplement 1).

Figure 4 with 1 supplement see all
The local transition probability model accounts for modulations of late MEG signals.

Observed MEG signals and theoretical surprise levels predicted by the transition probability model (with a local integration, ω = 6) in response to (A) globally rare events and to (B) violation or continuation of local patterns. MEG signals correspond to activity levels across sensors were averaged using a topographical filter (see Figure 3—figure supplement 1) within a late time window (from 500 to 730 ms) post-stimulus onset. The patterns reported here correspond to the diagnostic ones that are highlighted in Figure 3. See Figure 4—figure supplement 1 for theoretical predictions by other models.

https://doi.org/10.7554/eLife.41541.011

This graphical representation summarizes several interesting aspects of the data, of which we provide a statistical analysis below together with the predictions of our model. First, alternating pairs (i.e. XY) elicited larger signals than repeating ones (i.e. XX) in all conditions except the alternation-biased condition, in which this order was reversed. Second, the violation of a repeating streak (i.e. XXXY) always produced large activity, no matter of the experimental condition. This is particularly interesting in the alternation-biased condition, since such violations are actually expected in this case. Last, the continuation of an alternating pattern (i.e. XYXY) produced intermediate activity in both the fully stochastic and frequency-biased conditions, but low activity in the alternation-biased condition and high activity in the repetition-biased condition. Altogether these results confirm and extend what has been observed in the past: the effect of local statistic onto brain responses are modulated by the global frequency of items in the block, but also by the global frequency of alternations.

Comparison of MEG signals with proposed learning models

We now turn to a quantitative analysis of late brain responses (from 500 to 730 ms, as in Figure 3) with cross-validated topographical filter (see Figure 3—figure supplement 1) and we compare the most diagnostic effects with the predictions of different learning models. We show that a parsimonious model, the estimation of local TP (p(A|B) = 1 − p(B|B) and p(B|A) = 1 − p(A|A), TP model) predicts the effects of local and global statistics observed in brain responses (see Figure 4) and that those effects are incompatible with two other models (see Figure 4—figure supplement 1), learning either the local frequency of items (i.e. p(A) = 1 − p(B), IF model) or the local frequency of alternations between these items (i.e. p(alt.) = 1 − p(rep.), AF model).

We begin with the effects of global, block-level statistics. This analysis supplements findings reported in Figure 2, but now with cross-validated topographical filters in order to facilitate comparison with the models. Figure 4A shows the late MEG responses evoked by rare and frequent events in the three biased conditions. First, in the frequency-biased condition, rare items evoke significantly stronger MEG signals than frequent ones (t17 = 2.57, = 0.020), similarly to the surprise of the TP model. Note in Figure 4—figure supplement 1A that the alternative IF model also captures this difference and that the AF model shows by comparison a weaker effect. Indeed, in the frequency-biased condition, there exists a weak bias in favor of repetitions, p(rep.) = 5/9, which is smaller than the frequency bias, p(B) = 2/3.

Second, in the repetition-biased condition, rare alternations evoked stronger signals than frequent repetitions (t17 = 2.97, p = 8.63 ⋅ 10–3). Conversely, in the alternation-biased condition, rare repetitions elicit stronger signals than frequent alternations; however, this difference is expectedly smaller than in the repetition-biased condition and it was indeed not significant (t17 = 1.33, p = 0.20). This reversal of repetition/alternation effect in alternation-/repetition-biased conditions is significant when tested as an interaction in an ANOVA (F1,17 = 17.26, = 6.64 ⋅ 10−4). Such an interaction is fully predicted by the TP model (see Figure 4A). Note that the AF model, but not the IF model, shows such interaction (see Figure 4—figure supplement 1A). To summarize, the TP model is the only model that predicts the observed effects of global statistics onto MEG responses across all conditions.

We now turn to the effects of local statistics in short series of stimuli. Figure 4B shows several effects of the violation or continuation of short patterns of stimuli on MEG responses in the different conditions. First, we compared MEG responses elicited by the continuation and violation of a repeating streak. The violation of repeating streaks evoked significantly larger signals than their continuation in the fully-stochastic (t17 = 3.12, p = 6.21 ⋅ 10–3) as well as in the frequency-biased condition (t17 = 2.34, = 0.032). As expected, they also induced significantly stronger brain signals in the repetition-biased condition (t17 = 3.49, p = 2.79 ⋅ 10–3). There was a similar trend in the alternation-biased condition, but non-significant (t17 = 0.63, = 0.54) probably because of the conflict between global and local statistics. The ANOVA did not reveal a significant interaction (F1,17 = 1.60, p = 0.22) between repetition-/alternation-biased conditions and the continuation/violation of a repeating streak. These local effects are accounted for by all models (see Figure 4B  and Figure 4—figure supplement 1B).

Second, we compared MEG responses elicited by the continuation and violation of an alternating streak. The violation and continuation evoked similar brain signals in both the fully-stochastic (t17 = 0.27, = 0.79) and the frequency-biased (t17 = 0.47, = 0.65) conditions. The difference was not significant in the repetition-biased (t17 = –1.39, = 0.18) and alternation-biased (t17 = 1.62, = 0.13) conditions, but interestingly, it appeared with opposite signs. The ANOVA indeed revealed a trend regarding the interaction between repetition-/alternation-biased conditions and the continuation/violation of an alternating streak (F1,17 = 3.69, = 0.07). This local effect of alternations is predicted by the TP and AF models, but not by the IF model (see Figure 4B and Figure 4—figure supplement 1B).

Third, we compared MEG responses elicited by the same observations received in different orders. Since a local integration assigns gradually smaller weights to more remote observations, it indeed predicts an effect of the order in which observations are received. We compared four local patterns all composed of 3 Xs and 1 Y but differing in the position of the Y (i.e. first, second, third or last position). The profile of difference in MEG activity evoked by these different patterns was qualitatively similar to what was predicted by the TP model, but not the IF nor the AF ones (see Figure 4B and Figure 4—figure supplement 1B).

Computational characterization of MEG signals

The previous analyses, like previous studies, focused on the effect of very short patterns (here, up to 3 or 4 observations) and specific brain responses (here, a late component). In principle, however, we can analyze the effect of any timescale of integration onto brain responses of any given latency. To extend the results obtained so far, we therefore run a more systematic analysis aiming at characterizing quantitatively, from an information-theoretic viewpoint, the trial-by-trial fluctuations in MEG responses recorded at any latency post-stimulus onset (up to 1 s) and any sensor.

To achieve this goal, we performed trial-by-trial regressions for each subject between MEG responses (separately for each sensor and each time point between –250 ms to 1 s post-sound) recorded in all conditions and the theoretical surprise levels of various learning models described above. An example regression is shown in Figure 5A. Note that a similar model-based approach to brain electrophysiological signals has already been adopted in the past (Strange et al., 2005; Lieder et al., 2013a; Lieder et al., 2013b; Mars et al., 2008).

Figure 5 with 3 supplements see all
Theoretical surprise levels from learning models fitted to MEG signals reveal a multiscale inference process.

Mass-univariate (across sensors and time points) trial-by-trial regressions of MEG signals against theoretical surprise levels from learning models learning different statistics with different timescale of integration (ω). (A) Example trial-by-trial regression (the colors represent the different conditions), for one subject, at one particular time point, in one particular sensor in the case of transition probability learning with ω = 16. (B) The maximum R2 over models and parameters was averaged across sensors, and then across subjects. The resulting time course reveals three distinct time windows in which regressions yield a high proportion of explained variance. For reference, the thin grey line represents the global field power evoked by all the sounds (see Figure 1—figure supplement 2A). (C) Average topography of maximum R2 values in each time window. (D) Bayesian model comparison reveals that different models best explain MEG signals in these three time windows: item frequency (IF) in an early time window, and transition probabilities (TP) in later time windows. (E) Bayesian model averaging further reveals that different timescales of integration are involved: slow, global integration for the early time window, and increasingly local integration for the later time windows. These posterior distributions give the probability of ω given the MEG data. The error shading shows the inter-subject s.e.m.

https://doi.org/10.7554/eLife.41541.013

The maximum proportion of variance over models and parameter was averaged across sensors (see Figure 5B and Figure 5—video 1) and it reveals three distinct time windows of interest: an early one (60 to 130 ms, peak at 70 ms), an intermediate one (160 to 320 ms, peak at 250 ms) and a late one (460 to 625 ms, peak at 500 ms). The R2 topography in each time window showed peaks over fronto-temporal sensors (see Figure 5C) as expected for auditory stimuli. This topography was highly bilateral for the first window, slightly left lateralized for the second one and strongly left lateralized for the last one. Note that these topographies are similar to the ones obtained in previous model-free analyses, indicating that the model-based analysis identified similar effects but without any a priori about their spatial or temporal properties.

We then performed a Bayesian model selection (BMS, see Materials and methods; Stephan et al., 2009) to identify the learning model that best corresponds to each of the identified time windows. In the first time window, the model learning IF was deemed the best with high posterior probability (p(MIF|y) = 0.978) and high confidence (φ = 1); thus, this early response was best characterized as an adaptation to individual frequent stimuli. By contrast, in the second (p(MTP|y) = 0.781, φ = 0.994) and the third (p(MTP|y) = 0.993, φ = 1) time windows, the observer learning TP best explained the MEG signals (see Figure 5D). We also performed a hierarchical Bayesian model selection (Rigoux et al., 2014) over the four different conditions that found high confidence (φ = 1) that the same models are likely to underlie all conditions, indicating that the results are not driven by one condition only.

Finally, in order to draw inference about the timescale of integration of each of these three signatures, we performed a Bayesian model averaging (BMA, see Materials and methods). This analyses optimally combines the different possible models, weighted by their uncertainty in order to obtain subject-specific posterior distributions over parameter values. The result shows that the later the MEG time windows, the shorter the timescale of integration (linear fit using the mean of subject-specific posterior distributions of ω: t17 = –3.00, = 0.008). At the group level (see Figure 5E), the first window was best characterized by a global integration (ω = ∞), the second and third by much more local integration (ω = 13 and 6, respectively, maximum a posteriori values; that is a half-life of 9.01 and 4.16 observations, respectively).

For completeness, Figure 5—figure supplement 1 shows the posterior probability of each model and timescale of integration in continuous time rather than in specific time windows.

Discussion

In this study, we found a series of brain signals that covary with the online inference of the statistical properties of auditory sequences of stimuli. Late brain signals qualitatively and quantitatively conformed to the inference of transition probabilities, while earlier brain waves denoted a sensitivity to simpler statistics, such as the item frequency. Furthermore, those different signals were characterized by different timescales of integration: late signals were best explained by a very local timescale that progressively discards past observations, while early signals denoted a much longer timescale of integration.

Below, we discuss successively both aspects: the types of statistics that might be inferred by the brain when it is presented with a sequence of observations and the timescales of integration involved in those inferences. But first, we highlight an important distinction between sequence learning experiments. Some studies use sequences structured into chunks of stimuli: the sequences are interrupted by short pauses that define chunks of stimuli, thereby cueing subjects about the type of regularities to look for and their timescale. Those chunks are often pairs (Todorovic et al., 2011; Todorovic et al., 2011; Meyer and Olson, 2011; Meyer et al., 2014) or longer patterns, for example of 5 stimuli (Bekinschtein et al., 2009; Wacongne et al., 2011; Strauss et al., 2015). In contrast, we used continuous sequences, devoid of chunks of stimuli, very much like in the oddball literature (Squires et al., 1976; Näätänen et al., 2007). The presence of chunks is crucial for the processing of sequences in the brain (Dehaene et al., 2015), as exemplified by the strikingly different results obtained by Sussman et al. (1998), Sussman and Gumenyuk (2005) and by Bekinschtein et al. (2009) using similar sequences that were, respectively, either continuous or structured into chunks with pauses. We will therefore discuss experiments with chunks and continuous sequences separately.

Sensitivity to item frequency

Many previous studies have manipulated the frequency of items within continuous sequences presented to subjects and reported larger brain responses for rare items compared to frequent ones. In oddball paradigms for instance, globally rare (deviants) items systematically evoke stronger brain signals than frequent (standard) ones (Näätänen et al., 2007; Garrido et al., 2009; Garrido et al., 2013; Hsu et al., 2015; Lecaignard et al., 2015). On top of the effect of the global frequency of items, the seminal work of Squires et al. (1976) as well as more recent findings (Ulanovsky et al., 2004; Kolossa et al., 2012) showed that local regularities also modulate brain signals. In particular Squires et al. (1976) showed that the late P300 response evoked by a particular item was modulated by the preceding history of stimuli: the violations of repeating (XXXY) or alternating (XYXX) streaks evoked stronger brain signals than their continuation (XXXX and XYXY). The modulatory effects of local continuation/violation of local patterns were observed even in a fully random sequence, in which there is no global bias in the frequency of items.

Here, we replicated and extended these findings. First, the MEG responses that we recorded were strongly modulated by the global frequency of items. Second, despite the limited number of observations in some cases (e.g. p(AAAB) = (1/3)3 ⋅ (2/3) = 2.5% of all patterns of length four in the frequency-biased condition), we found effects of local patterns both in the fully stochastic (where p(A) = p(B) = 1/2) and frequency-biased (where p(A) = 1 – p(B) = 1/3) conditions. Furthermore, we unify the effects of both local and global statistics into the single framework of Bayesian inference (Meyniel et al., 2016). In that framework the brain is seen as an inference device that seeks to predict future observations, and recorded brain signals are related to theoretical surprise levels elicited by the comparison between predictions and actual observations (Friston, 2005). Here, we provide new data in support of our hypothesis that several brain signals, in particular those characterized by a late latency, can be best accounted for as surprise signals corresponding to the learning of transition probabilities.

Sensitivity to alternation frequency

Our hypothesis about an inference of transition probabilities makes an additional important prediction, rarely tested in several previous studies of mismatch responses in continuous sequences. Indeed, not only item frequency, but also the frequency of alternations is embedded in the space of transition probabilities. It follows that rare alternations should be more surprising than frequent repetitions and, conversely, that rare repetitions should be more surprising than frequent alternations. Recent studies manipulated the frequency of alternations while keeping the item frequency fixed and reported that rare alternations indeed induce stronger brain signals than frequent repetitions (using continuous sequences: Summerfield et al., 2011; de Gardelle et al., 2013; using pairs: Summerfield et al., 2008; Grotheer and Kovács, 2014, Grotheer and Kovács, 2015; Tang et al., 2018; Feuerriegel et al., 2018). However, in continuous sequence paradigms, the reverse is rarely observed: rare repetitions often fail to induce stronger brain responses than frequent alternations (but see in pair learning paradigms: Todorovic and de Lange, 2012; and in pattern learning paradigms: Strauss et al., 2015). In most cases, there is only a reduced difference (slightly more signal for frequent alternations than for rare repetitions) rather than a genuine reversal (more signal for rare repetitions than for frequent alternations).

On the contrary, one important finding of our study, in line with predictive coding (Spratling, 2017; Heilbron and Chait, 2018) and the learning of transition probabilities, is that we observe a genuine reversal of brain activity evoked by repetitions and alternations depending on which one is more frequent (by comparing the repetition-biased and alternation-biased conditions). We also extended those results by demonstrating effects of local statistics on brain signals even when there is a bias in the frequency of alternations. We report several findings that are predicted by the transition probability model. For instance, compared to other patterns, the repeating streak XXXX elicits more signal in the alternation-biased condition than in the repetition-biased condition. Conversely, the alternating pattern XYXY elicits more signal in the repetition-biased condition than in the alternation-biased condition. On the contrary, the violated repeating pattern XXXY always elicits stronger signal than other patterns, no matter the condition. Those subtle effects in the ordering of patterns result from the sometimes conflicting effects of local and global regularities, and critically, they are all predicted by a model according to which the brain attempts to infer transition probabilities.

A general account in terms of transition probability

Previous studies that manipulated either the frequency of items or the frequency of alternations provide substantial evidence that the brain is sensitive to both types of statistics. However, they left unclear whether their results could be attributed, and possibly better explained, by a sensitivity to a higher-order statistics like the transition probabilities that subsume item and alternation frequencies. Studies that manipulate those two statistics are actually rare. In some of them, item and alternation frequencies are not manipulated independently of one another (Wang et al., 2017; Higashi et al., 2017) and, in some others, systematic analyses are not reported (Tueting et al., 1970); thereby precluding strong conclusions about the inference of transition probabilities. Here, by manipulating distinct dimensions of the space of transition probabilities (i.e. item frequency and alternation frequency) within the same experiment, we were able to demonstrate that mid-latency (~250 ms) and late (~500 ms) brain responses are sensitive to both item and alternation frequencies in a way that is best predicted by the inference of transition probabilities. Future studies whose goal will be to investigate transition probability learning should therefore systematically manipulate both item and alternation frequencies orthogonally within subjects.

Several computational models relying on predictive coding have been proposed to account for the effect of global and local item frequency, using (1) information theory (Mars et al., 2008), (2) the free-energy principle (Lieder et al., 2013a; Lieder et al., 2013b) or (3) constraints on the complexity of the representation used to generate predictions (Rubin et al., 2016). Our proposed modeling approach is in line with each of these influential proposals. As in (1), we show a correlation between brain responses and an information-theoretic measure (Shannon surprise). As in (2) we compute surprise using Bayesian inference. As in (3), we find that brain responses are well explained by the use of very recent observations to predict the upcoming one, and an attempt to summarize this recent history into a simple representation (here, we used transition probabilities between successive item, but we did not explore more complex possibilities).

Previous modeling studies often focused on a single brainwave, either the mismatch negativity (Lieder et al., 2013a; Lieder et al., 2013b) or the P300 component (Squires et al., 1976; Mars et al., 2008; Kolossa et al., 2012). Here, however, we conducted a systematic search across all post-stimulus latencies and sensors in order to identify brain responses that are modulated by various statistical properties of the input sequence. Using a similar approach, recent studies also showed that different computational aspects of learning could be mapped onto different electrophysiological signals (Sedley et al., 2016; Diaconescu et al., 2017). Here, this systematic approach combined with qualitative and quantitative model comparison (Forstmann and Wagenmakers, 2015; Palminteri et al., 2017) allowed the identification of a succession of computational processes in the brain. Early responses were best explained by the learning of the frequency of items over a long timescale, while late responses were best explained by the learning of transition probabilities over an increasingly shorter timescale.

Characterising timescales of integration

We modeled these timescales of integration using ‘leaky’ integrators, which have been shown to be biologically plausible and are therefore widely used in the modeling of many brain functions (Rescorla and Wagner, 1972; Glaze et al., 2015; Farashahi et al., 2017). We should, however, acknowledge a limitation of the present study: since we did not vary the timing between stimuli (unlike, for instance, Pegado et al., 2010), the timescale of integration here confounds elapsed time and the number of items, which cannot be separated by our experimental design.

We found that both long-term and short-term sequence learning mechanisms coexist in the brain, at different moments in the course of stimulus processing, with the timescale of integration decreasing progressively with response latency. Thus, previous studies that focused only on short-term (Huettel et al., 2002) or longer term (Näätänen et al., 2007) regularities may have missed some brain responses related to sequence learning. Our finding is, however, much in line with other studies suggesting that the brain may entertain several statistical estimates computed over different timescales (Ulanovsky et al., 2004; Kiebel et al., 2008; Bernacchia et al., 2011; Ossmy et al., 2013; Meder et al., 2017; Scott et al., 2017; Runyan et al., 2017). These multiple estimates might have an important computational role in the sense that they would allow the brain to detect regularities occurring at many different timescales while keeping the computations fairly easy and tractable, that is without relying on hierarchical Bayesian inference (Collins and Koechlin, 2012; Wilson et al., 2013).

Altogether, the different types of statistics that are learnt and the different timescales of integration that we identified in this study suggest an interesting dissociation, which has already been put forward in the past (May and Tiitinen, 2010; Todorovic and de Lange, 2012; Strauss et al., 2015; Collins and Frank, 2018), regarding the neural mechanisms at stake: early responses would reflect a habituation mechanism while later responses would be best explained by a predictive coding mechanism. In the first case, stronger signals to rare events come from the fact that neurons tuned to properties of rare items gets less habituated (i.e. they fatigue less) than those tuned to properties of the frequent items. In the second case, stronger signals evoked by rare events follow from violated expectations. In line with this dichotomy, we found that early responses are better explained by sensitivity to item frequency only, computed over a long history of stimuli. This has two consequences: first, only rare items (not rare repetitions or alternations) evoke strong signals at these latencies; second, this occurs at a very long timescale. Both aspects have been hypothesized as being cornerstones of habituation processes (Kandel and Tauc, 1965). On the other hand, mid-latency and late responses reflect a sensitivity to transition probabilities, computed over a much less protracted history. This translates into the fact that not only rare items but also rare transitions (be them alternations or repetitions) evoke stronger signals at these latencies. The simple fact that pairs of items, whose item frequency is matched, modulate brain signals at those latencies preclude an interpretation based on habituation mechanisms and are therefore only compatible with predictive coding (Wacongne et al., 2012).

Multiple processings: from transition probabilities to chunks and patterns

Our findings also support the existence of multiple computational systems for sequence learning (Dehaene et al., 2015). The mid-latency response, with an integration characterized by a half-life of ~9 items, is compatible with a neuro-computational model of predictive coding, according to which the brain uses a window over the recent past in order to predict the transition probability from the current to the next item (Wacongne et al., 2012). The late response, on the other hand, is compatible with a conscious search for patterns (Bekinschtein et al., 2009; Dehaene et al., 2015; Wang et al., 2015) as shown by its extreme sensitivity to very recent observations (a half-life of ~4 items). The sensitivity to local transition probabilities identified here may be a mere consequence of this search for local patterns. We are indeed not claiming that the late brain signals we measured reflect solely transition probability learning. In fact, several studies have shown that late brain responses like the P300 reflect complex sequential dependencies (Donchin and Coles, 1988) such as short melodic patterns (Bekinschtein et al., 2009; Wacongne et al., 2011; Strauss et al., 2015) and their abstract numerical and grammatical organization (Wang et al., 2015). Local transition probability learning, as modeled here, is probably just a minimal statistical approximation of what the brain systems generating those responses genuinely estimate.

One of such studies, the so-called ‘local-global’ paradigm (Bekinschtein et al., 2009; Wacongne et al., 2011), argues in favor of such a dissociation between brain systems characterized by an early latency which learn very simple sequential properties (like the item frequency or alternation frequency), and brain systems characterized by a late latency that track more complex sequential dependencies involving patterns of stimuli. In this paradigm, sequences are structured into chunks of five tones separated by a pause. Each chunk is a pattern which can be a series of repeated items (XXXXX) or a series of four repeated items followed by a deviant (XXXXY). In a modified version (Strauss et al., 2015), those patterns could also be a series of alternation (XYXYX) and a series followed by a deviant (XYXYY). In different experimental conditions, either pattern can itself be presented frequently (80% of patterns), thereby making the other pattern a rare deviant (20% of patterns). This experimental design thus manipulates independently within-chunk deviance (e.g. XXXXY and XYXYY) and between-chunk deviance (rare pattern). In this paradigm, early and mid-latency responses (the mismatch negativity, occuring at ~250 ms) were found sensitive to the item frequency (original version) and the alternation frequency (modified version): there is increased activity for the last item (Y) in both the XXXXY pattern and the XYXYY pattern. This first finding is fully compatible with the present observations and the hypothesis of a brain system learning transition probabilities because item frequency and alternation frequency are embedded in the space of transition probabilities. Furthermore, this increased activity for the last item (Y) is actually larger when the XXXXX pattern is more frequent (thus inducing a very low global frequency of Ys) compared to when the XXXXY pattern is more frequent (inducing a larger global frequency of Ys), suggesting that the corresponding timescale of integration is longer than five elements, compatible with the one we found in the present study for mid-latency responses (a half-life of ~9 items for the component occurring at ~250 ms).

A second finding from the ‘local-global’ paradigm is that late brain responses (the P300) are sensitive to the rarity of patterns over the entire block. While this could seem to contradict our present result that late brain responses are characterized by a very short timescale (a half-life of ~4 items for the component occurring at ~500 ms), they can be reconciled if one keeps in mind the critical difference between continuous sequences (the present study) and sequences structured into chunks (like the ‘local-global’ paradigm). With this distinction in mind, our current results and previous ones indeed support altogether the same hypothesis: late brain responses correspond to a brain system that operates over different levels of abstraction and within the space of working memory. This system would first process the (4 or 5) recent items and search for regularities among them. This system would then abstract those regularities in the form of a pattern, especially when the sequence is already segmented into chunks, such as ‘the current pattern is made of alternating notes’, or ‘the current pattern comprises four identical notes followed by a different one’. Those patterns would serve as a new unit of processing: the brain would keep track of the patterns recently encountered and be surprised when an observation violates the pattern that recently prevailed. Such characteristics are typical of a rule-based search in conscious working memory, thus limited to a small number of elements (Baddeley, 1992). If this hypothesis is true, we can speculate that late brain responses observed here in the context of our prediction task will not be observed if participants are not attending to the sequence of stimuli. Those late brain waves are typically not observed in patients with impaired consciousness (Faugeras et al., 2012), during sleep (Strauss et al., 2015), or if the subject in not attending (Bekinschtein et al., 2009; Chennu et al., 2013). We used a prediction task here to maximize our chance of evoking late responses, and to be able to compare them with previous reports (Squires et al., 1976). On the contrary, earlier brain waves may arise from brain processes that automatically extract environmental statistics independently of task-relevance (Bekinschtein et al., 2009; Wacongne et al., 2011; Strauss et al., 2015). Testing the automaticity and task-dependence of our results will require further experiments.

Finally, while the present modeling effort was limited to the modeling of transition probability learning, it is likely that such learning constitutes only a building block on top of which higher-order representations are built, including the hierarchical structures that are thought to underpin language processing (Saffran et al., 1996; Dehaene et al., 2015; Leonard et al., 2015). Future studies will be needed to better understand whether and how lower-order properties like transition probabilities can serve as a scaffold for more complex representation. A practical implication of our results is that such future studies should carefully control for possible biases in transition probabilities, otherwise effects attributed to higher-order representation could actually be driven by simple statistical biases.

Materials and methods

Subjects

Twenty participants (11 females), aged between 18 and 25 (mean 21.24, s.e.m.: 0.44) were recruited for this study. The study was approved by the local ethics committee (CPP 08–021 Ile-de-France VII), and participants gave their informed written consent before participating. Data from two subjects was discarded because of excessive head motion.

Stimuli

Request a detailed protocol

Auditory and visual stimulation was delivered via Psychtoolbox (Brainard, 1997) running on MATLAB (Mathworks). The experiment was split into four different conditions. In each of these conditions, two different sounds were presented. These sounds were French syllables and were reported as being perceived without ambiguity by the subjects. The pair of syllables was changed at the beginning of each block: /ka/ and /pi/ in the first run, /ky/ and /te/ in the second one, /ku/ and /tɛ̃/ in the third one, and /pø/ and /to/ in the fourth one. Each syllable was pseudo-randomly labelled as A or B and lasted about 200 milliseconds, with a SOA between 2 syllables of 1.4 s (see Figure 1A). A fixation dot was displayed at the center of the screen while syllables were presented. To ensure subjects’ attentional focus on the auditory stimulation, they were occasionally asked (every 12 to 18 sounds) to report an explicit discrete prediction (i.e. A or B) about the forthcoming stimulus using one of two buttons. Each block presented 400 to 409 stimuli and lasted for about 10 min. Subjects were allowed to rest in between.

Experimental design

Request a detailed protocol

The four different conditions were characterized by different first-order transition probabilities (see Figure 1B): a condition with a frequency bias (i.e. p(A|B) = 1/3 and p(B|A) = 2/3 which led to p(A) = 1/3 and p(alt.) = 4/9), a condition with a repetition bias (i.e. p(A|B) = p(B|A) = 1/3 which led to p(A) = 1/2 and p(alt.) = 1/3), a condition with an alternation bias (i.e. p(A|B) = p(B|A) = 2/3 which led to p(A) = 1/2 and p(alt.) = 2/3) and a fully stochastic condition devoted of any bias (i.e. p(A|B) = p(B|A) = 1/2 which led to p(A) = 1/2 and p(alt.) = 1/2). The order of conditions was pseudo-randomized across subjects.

Data acquisition

Request a detailed protocol

Subjects’ brain activity was recorded using a 306 channels (102 triplets of sensors each composed of 1 magnetometer and two orthogonal planar gradiometers) whole-head Elekta Neuromag MEG system with an acquisition sampling rate of 1 kHz and hardware bandpass filtering between 0.1 and 330 Hz. Prior to installing the subjects in the MEG room, we digitized three head landmarks (nasion and pre-auricular points), four head position indicator (HPI) coils placed over frontal and mastoïdian skull areas, as well as about 60 additional locations over subjects’ skull using a three-dimensional Fastrak system (Polhemus). Thanks to HPI landmarks, head position within the MEG helmet was measured at the beginning of each run. Electrocardiogram, vertical and horizontal oculogram were recorded in order to monitor heart rate and eye movements during the experiment.

Pre-processing

Request a detailed protocol

Raw MEG signals were first corrected for between-session head movements and bad channels using MaxFilter (Elekta). Following pre-processing steps were performed using Fieldtrip in MATLAB (Mathworks). First, MEG data was epoched between −250 ms to 1 s with respect to sound onset. Power line artefacts were then removed by filtering out the 50 Hz frequency and its 100 and 150 Hz harmonics. Trials exhibiting muscle and/or other movements’ artefacts and/or SQUIDs’ jumps that were detected based on semi-automatic methods (using the variance of MEG signals over sensors and the first order derivative of MEG signals over time) were then completely removed from the dataset. MEG signals were then low-pass filtered below 30 Hz and downsampled to 250 Hz. An ICA was then performed and components exhibiting topographical and time courses signatures of eye blinks or cardiac artefacts were subsequently removed from the data. The remaining trials were eventually monitored visually and individually to ensure that all bad trials were removed from the subject’s dataset. Last, data were baseline-corrected by removing trial-wise averaged activity in the 250 ms time window preceding sound onset. Throughout the paper results are reported only for magnetometers.

Sources localizations

Request a detailed protocol

Anatomical T1-weighted MRIs (with 1 × 1 × 1.1 mm voxel size) were acquired for each subject using a 3T Tim Trio (Siemens) MRI scanner. For each subject, grey matter was first segmented from white matter using FreeSurfer (Fischl et al., 1999). Segmented tissues were then imported into BrainStorm (Tadel et al., 2011) in order to reconstruct subjects’ head and cortical surfaces using a cortical mesh composed of 150,002 vertices. Models of the cortex and the head were used to estimate the current-source density distribution over the cortical surface. To do so, subject-specific noise covariance matrices were estimated in each condition on the baseline from the 250 ms time window preceding sound onset. Individual sources were computed using weighted minimum-norm estimates (depth weighting factor: 0.8, losing factor for dipole orientation: 0.2), separately for each condition, and the norm of each tridimensional vector obtained was finally computed for each vertex. The resulting sources were projected on a standard anatomical template and averaged across subjects. The group average was finally interpolated to a higher resolution brain template composed of 306716 vertices that was inflated to 35% to ensure high-quality figures (see Figure 1—figure supplement 2B).

Statistical analyses

Request a detailed protocol

We searched for differences between rare and frequent observations within each experimental condition. To correct for multiple comparison over sensors and time points, we used spatiotemporal clusters and non-parametric paired t-tests estimated from 1000 permutations (Maris and Oostenveld, 2007). A cluster was defined by adjacent time points and neighbouring sensors (neighbours sensors are less than 15 cm apart; there is an average of 8.3 neighbours per sensor). The cluster-level statistic was the sum of the sample-specific t-statistics that belong to a given cluster. The alpha level of the sample-specific test statistic was 0.05 and the alpha level of the cluster-specific test statistic was 0.05.

Topographical filtering

Request a detailed protocol

Group averaged topographical ‘surprise’ responses (see Figure 3—figure supplement 1) were computed by contrasting, in a late time window (from 500 to 730 ms), MEG signals evoked by local streaks that were violated (e.g. XXXY), both given local (e.g. a local series of repetitions) and global (e.g. alternations are globally less frequent than alternations in the alternation-biased condition) statistics, against the same streaks that were continued (e.g. XXXX). Those group-level topographies were then used to filter, at the subject level, MEG signals obtained in the same time window and the same sequence, but evoked by all the possible patterns up to length 4 (see Figure 3). Importantly, we used a cross-validation approach to ensure statistical validity and preclude double dipping (Kriegeskorte et al., 2009). More precisely, topographical filters were estimated at the group level on odd trials and applied at the subject level on even trials. We also did it the other way around and pooled both together. In order to facilitate visual comparison, in each condition, the averaged activity was subtracted from the activity evoked by each pattern (see Figure 4).

Bayesian models of the learning process

Request a detailed protocol

We designed ‘ideal observers’ that infer the hidden value of a statistic θ of stimuli given a particular sequence u of binary observations. For the sake of simplicity, we first consider here the learning of IF (MIF). This ideal observer infers the frequency of As (θA), and updates it following each observation, in an optimal manner following Bayes’ rule.

p(θA|u1:k)=p(u1:k|θA)p(θA)p(u1:k)

Because observations are binary and our prior about θA is uniform, the posterior distribution follows a Beta distribution (Gelman et al., 2013).

p(θA|u1:k)Beta(NkA+1,NkB+1)

The mean of this distribution, which is also the predictive likelihood that the next observation will be an A, can be expressed analytically as follows.

E(θkA)=NkA+1NkA+NkB+2=p(uk+1=A|u1:k,IF)

Where NA and NB denote the number of A and B items that have been observed up to the k-th observation of the sequence.

Following Shannon (1948), the theoretical surprise levels can then be computed as the predictive likelihood of the observation that is actually received.

Ik+1={log2(E(θkA))ifuk+1=Alog2(1E(θkA))ifuk+1=B

Because integration over long periods of time and a large number of observations is often compromised by memory constraints, we designed models with different timescales of integration. In those models, the weight of a given observation decays exponentially with the number of observations. The count of observations NA and NB is therefore ‘leaky’. For mathematical convenience, let us assume that the sequence of observation u1:n is coded numerically as 0 s and 1 s representing respectively Bs and As. In that case, the leaky count of the occurrences of item A is:

NωA=k=1nunkexp(kω)

Where ω is a free parameter that characterizes the strength of the leak and therefore controls the timescale of integration (see Figure 5—figure supplement 2): smaller values of ω correspond to a more local integration that favours more recent observations. When ω = ∞, the integration becomes perfect, that is devoid of any forgetting, and thus provides an optimal estimate of the true generative statistics that are fixed within each condition of our experiment. Importantly, the timescale of integration determines the stability of beliefs, and therefore influences the dynamics of surprise levels (see Figure 2—figure supplement 1B for an example): the smaller the ω, the more local the integration and the more erratic the surprise levels.

Now that we have presented the derivation for the ideal observer learning the frequency of items IF, we can extend it to other types of observers that learn AF or TP. The derivation for the AF model is most similar to the IF model: while the IF model counts, with a leak, the occurrence of As (and Bs), the AF model counts, with a leak, the occurrence of alternations (and repetitions). For the AF model, one therefore simply needs to recode the input sequence into repetitions and alternations (rather than As and Bs) and apply the exact same logic as for the IF model. The derivation of the TP model is slightly different, since it requires to estimate two statistics (instead of one, in IF and AF): two transition probabilities, one for the probability of observing an A after a B θA|B = 1 – θB|B and one for the probability of observing a B after an A, θB|A = 1 – θA|A. Note that θB|B and θA|A can be deduced from θA|B and θB|A, such that two probabilities suffice to describe all transitions. Each TP is estimated essentially like the frequency of items, but by counting transitions. The estimation of θA|B and θB|A therefore requires to entertain (leaky) counts corresponding to all possible transitions (AA, AB, BA, BB). More details are provided in Meyniel et al. (2016) and the MATLAB code for all ideal observers is available (Meyniel and Maheu, 2018; copy archived at https://github.com/elifesciences-publications/MinimalTransitionProbsModel).

It is important to notice that IF and AF can be analytically derived from TP:

θA=θA|BθA|B+θB|A and θalt.=2θA|BθB|AθA|B+θB|A

Thus, IF and AF are subsumed in the space of TP (see Figure 1A).

Trial-by-trial regressions

Request a detailed protocol

Following previous work (Mars et al., 2008), we hypothesize that, for each trial k, brain signals yk linearly scale with the theoretical level of surprise Ik. Here, different ideal observer models M learning a different statistics i with a difference timescale of integration ω have different surprise levels. We thus estimated regression coefficients β from the following linear regression model (see Figure 5).

yk=β0b+β1Ik(i,ω)+ϵwithϵ𝒩(0,σϵ2)

We estimated regressions for each subject, at each time point around sound onset and each MEG sensor. We also varied, in separate regressions, the ideal observer model (IF, AF, TP) and the timescale of integration ω. We tested 55 values of ω (see Figure 5E for the exact grid used) so as to cover the entire number of observations in the sequence (including the case of a perfect integration: ω = ∞) while minimizing correlations between surprise levels obtained with different values of ω.

Bayesian model comparison

Request a detailed protocol

In order to compare models, we adopted a Bayesian approach. We approximated the marginal likelihood (or model evidence) of each model, a key metric for model comparison, using the Bayesian information criterion (BIC; Schwarz, 1978). Under independent and identically distributed model errors, the BIC can be expressed as follows.

BIC=nlogσ^ϵ2+klogn
σ^ϵ2=min1nk1n(yky^k,ω)2

where n is the number of observations, κ the number parameters: that is here 4 β0s (one per condition), β1 and ω. Note that all models here have the same number of free parameters. Therefore, other approximations for the model evidence such as the Akaike Information Criterion (Akaike, 1998), or simply the proportion of explained variance, would lead to the same conclusions.

We then used model evidence to compare models with a random-effect approach. An alternative is to use a fixed-effect approach, by summing log-ratios of model evidence (i.e. Bayes Factors). However, such an approach is valid only for pairs of models and its results can easily be driven by outlier subjects. We performed the random-effect analysis described in Stephan et al. (2009); Rigoux et al. (2014) and implemented in the VBA toolbox (Daunizeau et al., 2014). This analysis returns several key estimates for each model included in the comparison set: the expected model frequency in the general population as well as the probability for each model to be more frequent than any other model in the general population (the ‘exceedance probability’, noted φ).

Bayesian model averaging

Request a detailed protocol

In addition to identifying the model that best explain the MEG signals, one can also draw inference about the best integration parameter ω. To do so, the BIC approximation of the model evidence is estimated for each value of the timescale integration parameter ω.

p(y|,ω)exp(BIC(ω)2)

Note that the equation above depends on the use of a particular model. Since we do not know for sure which model is used by the brain, we can combine all models following probability calculus in order to infer the value of ω that best explain the MEG signals. Under a uniform prior over M and ω, it boils down to:

p(ω|y)ip(y|i,ω)

Note that those equations apply to one subject, such that we obtain an estimate of ω per subject.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
    On Language: Chomsky's Classic Works Language and Responsibility and Reflections on Language in One Volume
    1. N Chomsky
    2. M Ronat
    (1998)
    New Press.
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
    A theory of cortical responses
    1. K Friston
    (2005)
    Philosophical Transactions of the Royal Society B: Biological Sciences 360:815–836.
    https://doi.org/10.1098/rstb.2005.1622
  28. 28
    Predictive coding under the free-energy principle
    1. K Friston
    2. S Kiebel
    (2009)
    Philosophical Transactions of the Royal Society B: Biological Sciences 364:1211–1221.
    https://doi.org/10.1098/rstb.2008.0300
  29. 29
  30. 30
  31. 31
     Bayesian Data Analysis
    1. A Gelman
    2. HS Stern
    3. JB Carlin
    4. DB Dunson
    5. A Vehtari
    6. DB Rubin
    (2013)
     Chapman and Hall/CRC.
  32. 32
  33. 33
  34. 34
    A Bias–Variance Trade-Off Governs Individual Differences in on-Line Learning in an Unpredictable Environment
    1. CM Glaze
    2. ALS Filipowicz
    3. JW Kable
    4. V Balasubramanian
    5. JI Gold
    (2018)
    Nature Publishing Group.
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
    Variation in Event-Related potentials by state transitions
    1. H Higashi
    2. T Minami
    3. S Nakauchi
    (2017)
    Frontiers in Human Neuroscience, 11, 10.3389/fnhum.2017.00075, 28289380.
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
    MinimalTransitionProbsModel, version https://github.com/florentmeyniel/MinimalTransitionProbsModel
    1. F Meyniel
    2. M Maheu
    (2018)
    GitHub.
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
    A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement
    1. RA Rescorla
    2. AR Wagner
    (1972)
    Classical Conditioning II: Current Research and Theory 2:64–99.
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
    Science and human behavior (No. 92904)
    1. BF Skinner
    (1953)
    Simon and Schuster.
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107

Decision letter

  1. Floris de Lange
    Reviewing Editor; Donders Institute for Brain, Cognition and Behaviour, Netherlands
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Brain Signatures of a Multiscale Process of Sequence Learning in Humans" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Floris de Lange as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Marta Garrido (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Maheu et al. perform a systematic and thorough investigation of the effects of item frequency (IF), alternation frequency (AF) and transition probability (TP) of auditory events on the MEG response in different time windows. They find that IF, estimated over long temporal windows, explains the brain response in very early auditory sensory processing, whereas TP, estimated over progressively shorter time windows, explains the brain response at progressively later time points.

Essential revisions:

1) Task-dependency of results?

The participants were explicitly required to judge the predictability of the patterns. It is therefore very likely that the observed brain responses (especially the later ones) reflect this perceptual state and therefore cannot be interpreted as automatic processes reflecting the brain's modus operandi as a statistical learning machine. It would have been far better, if the task performed by participants was unrelated to the sequence dimension being investigated, e.g. some sort of omission detection task? Or an entirely unrelated (e.g. visual) task. The present choice limits the interpretation of the results. It will be important to better qualify what inferences can and cannot be drawn on the basis of these data.

Related, the major findings that (1) brain responses reflect the statistics of sound sequences (2) early responses reflect adaptation-like effects whereas later effects reflect increasingly complex models is not new. The authors themselves review lots of relevant literature. Notably, they fail to relate their results to the modelling work of Rubin et al., 2006. Though the paper is cited in the present report in the context of the claim that brain responses are larger for deviants than standards, the actual work has employed extensive modelling to understand how neural responses to sequences similar to those used here are affected by the preceding history. Notably, they have concluded that responses are affected by quite a lengthy context (more than 10 tones) but that the representation of the transition probabilities is coarse. Is it possible that the different effects observed here are driven by the behavioral task (see my comment no. 1, above)?

2) Choice of time window:

The spatial filter analysis is focused on the late time window. The one that is most contaminated by the behavioral task. What is the justification of using that time window and not and earlier one?

In previous and later figures, IF and AF seem to be doing all the work at much earlier time windows.

3) Various methodological concerns: a) Figure 5 the time course of max R2 nicely aligns with the three MEG peaks. Could it be that the appearance of "3 distinct time windows" is just a consequence of MEG power?

b) Figure 2 shows significant differences between conditions (B vs. A, and Rep vs. Alt) overlaid on MEG topography for the 4 different sequence conditions. While it is interesting to observe the differences in topography for the MEG difference wave across sequence conditions, this remains a qualitative observation. Statistical maps would be required to really pin down whether those patterns are indeed significantly different between conditions, and if yes, where in space-time. For example Figure 2B, it appears that from 400 ms onwards the contrast of Rep vs. Alt is a little different between fully stochastic and repetition biased sequences, but, is this difference (interaction) really significant?

c) Figure 4 shows a remarkable similarity between empirical MEG data and the surprise levels predicted by the transition probability model for 500-730 ms, which is very impressive and pleasing. It was unclear, however, whether this time window was a priori chosen or whether the whole window was fully explored with appropriate multiple testing correction. If a priori chosen then more explanation for the rationale is needed given that a) strong evidence exists for MMN window 100-250 ms and b) Figure 3A shows significant differences for earlier periods – in fact for the alternation biased condition no significant differences between rep and alt exist beyond 400 ms so that's difficult to reconcile with the later time window. If based on Figure 5D then why not from 160ms from which point TP wins?

d) Still related to the findings in Figure 4 that related to a local integration window parameterised by ω =6. My understanding is that this parameter was optimised for each subject and that 55 values (as many as the max of observations made) were explored. How consistent was this value across participants (ω =16 is also mentioned)? It appears this was also optimised for each of the 3 times windows – can this be made more explicit and also again whether that was consistent over subjects. Did it correlate with their task accuracy (more accurate prediction judgments on upcoming stimuli for greater ω)? This is not necessary but could strengthen the interpretation.

e) The fact that global statistic violations such as item frequency violation modulates earlier (not late) responses matches nicely Garrido 2013 narrow/broad Gaussian paper showing that variance (a global statistic) modulates early latency responses. However, it appears at odds with Bekinschtein's 2009 local-global paper showing that global violations modulate P300 (not MMN). Can we reconcile these findings? The Discussion appears to point to chunk vs. continuous sequences as potentially explaining the divergence (which would again be consistent with Garrido 2013), but it still remains unclear why would chunking make such a difference?

https://doi.org/10.7554/eLife.41541.022

Author response

Essential revisions:

1) Task-dependency of results?

The participants were explicitly required to judge the predictability of the patterns. It is therefore very likely that the observed brain responses (especially the later ones) reflect this perceptual state and therefore cannot be interpreted as automatic processes reflecting the brain's modus operandi as a statistical learning machine.

We completely agree with the reviewers that some of the observed brain responses, in particular late ones, may depend upon the fact that the sequence of stimulus is task-relevant, as previously reported by e.g. Pitts et al., 2014, and Strauss et al., 2015. However, it was not our intention to suggest that our results reflect an automatic and unconscious extraction of statistical regularities. We have screened our text searching such a misleading interpretation, but could not find any. We are happy to take suggestions. In order to clarify this point, we now add a paragraph in the Discussion, indicating that several aspects of our conclusions may be task-dependent, and that it is unclear whether the same effects would be observed if the subject were engaged in a distracting task, not paying attention to stimuli, or even asleep (see next point).

It would have been far better, if the task performed by participants was unrelated to the sequence dimension being investigated, e.g. some sort of omission detection task? Or an entirely unrelated (e.g. visual) task. The present choice limits the interpretation of the results. It will be important to better qualify what inferences can and cannot be drawn on the basis of these data.

The fact that the behavioural task was related to the sequence dimension was actually a deliberate choice. We wanted to maximize the number and variety of brain processes we would record. In particular, we wanted to relate our study to different previous studies, including those on P300/slow-wave components, which are rarely seen for unattended or irrelevant stimuli. Indeed, an important goal of our study was to extend the seminal results by Squires et al., 1976, who reported a modulation of the P300/slow-wave EEG component by the local and global item frequency. We did so by including in our study a new computational model as well as new experimental conditions (with repetition and alternation biases). In Squires et al.’s 1976 study, subjects counted the high pitch tone, so that the stimulus was task-relevant. Using an entirely unrelated task would have jeopardized the study of late brain responses and comparability with previous studies. To avoid possible misunderstandings, we have added a new paragraph to the Discussion in order to clarify this point:

“[…] we can speculate that late brain responses observed here in the context of our prediction task will not be observed if participants are not attending to the sequence of stimuli. Those late brain waves are typically not observed in patients with impaired consciousness (Faugeras et al., 2012), during sleep (Strauss et al., 2015), or if the subject in not attending (Chennu et al., 2013). […] Testing the automaticity and task-dependence of our results will require further experiments.”

Related, the major findings that (1) brain responses reflect the statistics of sound sequences (2) early responses reflect adaptation-like effects whereas later effects reflect increasingly complex models is not new. The authors themselves review lots of relevant literature. Notably, they fail to relate their results to the modelling work of Rubin et al., 2006. Though the paper is cited in the present report in the context of the claim that brain responses are larger for deviants than standards, the actual work has employed extensive modelling to understand how neural responses to sequences similar to those used here are affected by the preceding history. Notably, they have concluded that responses are affected by quite a lengthy context (more than 10 tones) but that the representation of the transition probabilities is coarse. Is it possible that the different effects observed here are driven by the behavioral task (see my comment no. 1, above)?

The modelling and experimental work that we quoted by Rubin et al., 2016,) is indeed highly relevant to our work (we assumed that the reviewer mistyped 2006 instead of 2016). Their theoretical account proposes a new way of predicting future observations while constraining the complexity of the representation of the past that is used for the prediction. They tested the model’s predictions on data obtained in an oddball task, and show that the model indeed predicts higher activity for the deviant tones (rare items) compared to the standard ones (frequent items). They also show that their model is crucially characterized by a local inference and a coarse representation of the past.

Important differences with this prior work should nevertheless be stressed: it presents single-unit recordings in anesthetized cats (while we have MEG data in behaving humans), it uses only two of our four experimental conditions (the frequency-bias condition and the no-bias condition), it does not explore the dynamics of the stimulus-evoked response. The model is also different: we restricted ourselves to transition probabilities of order 1 while Rubin et al., 2016, go to order 50. Our choice was motivated by our experimental design, which is tailored for testing the order 1 with the inclusion of a repetition-bias and alternation-bias condition, and the fact that investigating higher order is difficult due to combinatorial explosion and the limited amount of data available in typical experiments (sequences have 430 stimuli in our experiment). Differences between our study and Rubin et al., 2016, should therefore be interpreted with care.

Despite those differences, there are two major points of convergence between Rubin et al., 2016, and our study. First, prediction relies on a local inference: Rubin et al., 2016, found that a history of more than 10 items is used; we found an exponential decay with ω = 13 and 6 for the intermediate and late time window; and we found ω = 16 for Squires et al.’s 1976 dataset. Second, in order to make predictions, the brain uses a compressed representation of the past; in our case, it is in the form of transition probabilities; in Rubin et al., 2016, their explore more complex representations.

In the revised version, we have added a new paragraph in the Discussion, see below) that lists the influential theoretical accounts that have been proposed regarding the modeling of brain responses to sequences characterized by a bias in the frequency of items (e.g. oddball sequences), and in particular, discuss more the study by Rubin et al., 2016.

“Several computational models relying on predictive coding have been proposed to account for the effect of global and local item frequency, using (1) information theory (Mars et al., 2008), (2) the free-energy principle (Lieder et al., 2013a, 2013b) or (3) constraints on the complexity of the representation used to generate predictions (Rubin et al., 2016). […] As in (3), we find that brain responses are well explained by the use of very recent observations to predict the upcoming one, and an attempt to summarize this recent history into a simple representation (here, we use transition probabilities between successive item, but we did not explore more complex possibilities).”

2) Choice of time window:

The spatial filter analysis is focused on the late time window. The one that is most contaminated by the behavioral task. What is the justification of using that time window and not and earlier one?

In previous and later figures, IF and AF seem to be doing all the work at much earlier time windows.

First, we would like to clarify that none of our data are not contaminated by behavior: motor responses are restricted to ~1/15 of the trials, which are not included in the analysis. We added a sentence in the Materials and methods to clarify. If the reviewer is referring to a potential task-dependency of our results, please see our response above.

Second, the analysis that involves the spatial filter is intended as an extension of the seminal and influential paper by Squires et al., 1976, in which the authors reported a modulation of P300 and late slow-wave by the sequence statistics (both local and global) in an experiment that contained two of our four conditions. We therefore selected a similar late time-window, in order to allow comparison with this study and subsequent ones on the same topic (Bekinschtein et al., 2009; Donchin and Coles, 1988; Feuerriegel et al., 2018; Kolossa et al., 2012; Mars et al., 2008; Melloni et al., 2011; Meyniel, Maheu and Dehaene, 2016; Pegado et al., 2010; Squires et al., 1976; Strauss et al., 2015; Summerfield et al., 2011; Wacongne et al., 2011).

Third, note that while we provide a Squires-style, detailed analysis of the signal in an (arbitrary or “historical”) late time window, the remainder of our analysis (Figures 2, 5, Figure 2—figure supplement 1, Figure 2—figure supplement 2, Figure 2—source data 1 and even more in the new Figure 2—figure supplement 1 and Figure 5—figure supplement 1 that is also presented below) probes the MEG signal in a systematic manner, independently from any prior assumption regarding the timing of effects (which post-stimulus latency), their spatial location (which sensors) and their tuning properties (modulatory effect of local/global statistics, and of which statistics).

In the revised version, we have added a sentence in the Results section that explain our general approach and the fact that the late time window was chosen for historical reasons (subsection “Description of auditory-evoked responses”, last paragraph, see below).

3) Various methodological concerns:a) Figure 5 the time course of max R2 nicely aligns with the three MEG peaks. Could it be that the appearance of "3 distinct time windows" is just a consequence of MEG power?

Theoretically speaking, it could have been the case that our model explains the signal (R2) mostly when evoked responses are strong. However in practice, it is not the case: the (3) peaks in R2 do not really align with the (4) peaks observed for MEG (global field) power (see Author response image 1). In particular, the second and fourth peaks of the MEG power (blue) fall exactly in between two peaks of R2 (blue).

We have amended our Figure 5 so as to now include the global field power for reference.

b) Figure 2 shows significant differences between conditions (B vs. A, and Rep vs. Alt) overlaid on MEG topography for the 4 different sequence conditions. While it is interesting to observe the differences in topography for the MEG difference wave across sequence conditions, this remains a qualitative observation. Statistical maps would be required to really pin down whether those patterns are indeed significantly different between conditions, and if yes, where in space-time. For example Figure 2B, it appears that from 400 ms onwards the contrast of Rep vs. Alt is a little different between fully stochastic and repetition biased sequences, but, is this difference (interaction) really significant?

We should distinguish two types of comparisons: within and between conditions. Our statistical analysis, which is corrected for multiple comparisons over sensors and time points, addresses within-condition comparisons: Do A and B produce different signals when there exist a frequency bias? Do alternation and repetition produce different signals within the no-bias condition? And similarly, within the repetition-bias condition, and within the alternation-bias condition? We revised our text to avoid any ambiguity.

We nevertheless agree with the reviewers that the between-condition comparisons are also interesting, albeit different. We partially address between-conditions effects later in the paper, when we focus on (late) brain responses that reflected a sensitivity to all types of (local) statistics (i.e. in all conditions).

In order to assess it systematically, we now provide, as suggested by the reviewers, the statistical maps corresponding to pairwise differences between frequent – rare events across conditions.

As anticipated by the reviewers, the statistical analysis indeed reveals a significant difference between the Rep – Alt contrast between fully-stochastic and repetition-biased conditions starting around 500 ms which we included as a new Figure 2—figure supplement 2.

c) Figure 4 shows a remarkable similarity between empirical MEG data and the surprise levels predicted by the transition probability model for 500-730 ms, which is very impressive and pleasing. It was unclear, however, whether this time window was a priori chosen or whether the whole window was fully explored with appropriate multiple testing correction. If a priori chosen then more explanation for the rationale is needed given that a) strong evidence exists for MMN window 100-250 ms and b) Figure 3A shows significant differences for earlier periods – in fact for the alternation biased condition no significant differences between rep and alt exist beyond 400 ms so that's difficult to reconcile with the later time window. If based on Figure 5D then why not from 160ms from which point TP wins?

We thank the reviewers for raising this important question.

First, note that Figure 4 is an extension of Figure 3: we therefore used the same time-window. The choice of this late time window is motivated to allow comparison with the study by Squires et al., 1976, see point 2 above. This analysis relies on the use of spatial filters, the choice of the exact time window was done a priori based on the filters’ profiles (see Figure 3—figure supplement 1): we selected manually a single late time window in which all filters showed strong signals. This choice was not further optimized to show similarity with the local transition probability model (which would have required correction for multiple testing, as pointed by the reviewers).

Also note that Figure 4 is mostly for illustration purpose: to show that the local transition probability model qualitatively accounts for the late evoked responses which were of interest in previous studies (Bekinschtein et al., 2009; Donchin and Coles, 1988; Feuerriegel et al., 2018; Kolossa et al., 2012; Mars et al., 2008; Melloni et al., 2011; Meyniel, Maheu and Dehaene, 2016; Pegado et al., 2010; Squires et al., 1976; Strauss et al., 2015; Summerfield et al., 2011; Wacongne et al., 2011). A more quantitative and exhaustive analysis (for all time points and sensors) is provided in Figure 5.

We have added two sentences explaining the choice of that late time-window in the Results section (see below), and we amended Figure 3—figure supplement 1 with a new panel, in order to clarify that the time window is selected a priori based on the filters, and not post-hoc based on the cross-validated analyses reported in Figure 3 and Figure 4.

“Note that this late time window (from 500 to 730 ms) was selected in order to follow up and expand the seminal study of Squires et al., 1976, which focused on late (EEG) brain responses. Here, the exact time window was not optimized to show the clear tree structure seen in Figure 3, or the strong similarity between observed data and the model reported in Figure 4, but selected a priori (see Figure 3—figure supplement 1).”

Regarding the specific question about effects in the alternation-biased condition beyond 400 ms: Figure 2 shows no significant effect when correcting for multiple comparisons over time and sensors, but this is of course conservative and dependent on a p-value threshold, the topographies nevertheless show (below-threshold) effects. Those effects are pooled across sensors by the spatial filter in a time window in Figure 4, which further reveals the effect: there is no contradiction in the data. In actuality, the Alt. vs. Rep. effect is not significantly stronger in the alternation-bias than in the repetition-bias condition (see response to 3b).

d) Still related to the findings in Figure 4 that related to a local integration window parameterised by ω =6. My understanding is that this parameter was optimised for each subject and that 55 values (as many as the max of observations made) were explored. How consistent was this value across participants (ω =16 is also mentioned)? It appears this was also optimised for each of the 3 times windows – can this be made more explicit and also again whether that was consistent over subjects.

Figure 4 presents theoretical predictions, which can be computed only after parameterizing our model (ω, the time scale of integration, is the only free parameter). We have not optimized ω for Figure 4, actually we took the value that we identified in the exhaustive analysis presented in Figure 5 for the late time window (because Figure 4 is about late responses).

In order to obtain this value of ɷ in Figure 5, we computed for each subject the posterior distribution p(ω|MEG) which gives, for each of the 55 values of ωthat were tested (see vertical grid on Figure 5E), how likely that particular value of ω is given the observed MEG activity in that subject (see Materials and methods). This distribution is obtained through Bayesian Model Averaging (BMA) which takes into account the uncertainty at the level of model inference (IF, AF and TP). We then averaged over subjects, and reported the mean distribution in Figure 5E. The maximum a posteriori value was defined on the group-level distribution for each time window of interest. We rephrased the legend of Figure 5 and the main text to clarify.

“Finally, in order to draw inference about the timescale of integration of each of these three signatures, we performed a Bayesian Model Averaging (BMA, see Materials and methods). […] At the group level (see Figure 5E and Figure 5—figure supplement 2), the first window was best characterized by a global integration (ω = ∞), the second and third by much more local integration (ω = 13 and 6 respectively, maximum a posteriori values; i.e. a half-life of 9.01 and 4.16 observations respectively).”

Those estimates of ωshould be taken with caution. First, the posterior distributions in Figure 5 shows that actually the range of probable values is rather large. The reviewers mention ω= 16, which is the value that we found in our previous paper (Meyniel, Maheu and Dehaene, 2016) when fitting the dataset by Squires et al., 1976. We also found ω = 19 for another EEG dataset (Kolossa et al., 2013). All those values have a very high posterior probability in our current data (see Figure 5E). Differences between those values may also reflect that the evoked responses is taken at different (late) latencies and specific aspects of each data set. Second, the exact values for each participant are rather spread out, especially maximum a posteriori values. Mean values are more consistent, but this is artifactually due to a regression to the (average) prior value. This is why we report group-level estimates (rather than a mean of subject-level estimates) to achieve a more stable result. Note that while individual exact values are spread out, the decrease of those values from early to late latencies is very consistent across subjects (t17 = –3.00, p = 0.008, already reported in the text).

Did it correlate with their task accuracy (more accurate prediction judgments on upcoming stimuli for greater ω)? This is not necessary but could strengthen the interpretation.

The answer to this question depends on how accuracy is measured. An objective measure is to consider the average likelihood of subjects’ choices based on the generative transition probabilities. For instance, in the repetition-biased condition, reporting expecting a A after a A comes with a higher likelihood given the generative transition probabilities than expecting a B after the same A. With such a measure, subjects would indeed achieve a better performance if they were using a longer integration (i.e. greater ω). However, this measure of accuracy did not correlated significantly with the subject’s maximum a posteriori ω parameter (with mean ω: ρ = –0.11, p = 0.66; with maximum a posteriori ω: ρ = –0.16, p = 0.54). We are reluctant to include this non-significant result in the paper because our study is most likely underpowered for testing between-subject correlations (N = 18 subjects).

e) The fact that global statistic violations such as item frequency violation modulates earlier (not late) responses matches nicely Garrido 2013 narrow/broad Gaussian paper showing that variance (a global statistic) modulates early latency responses. However, it appears at odds with Bekinschtein's 2009 local-global paper showing that global violations modulate P300 (not MMN). Can we reconcile these findings? The Discussion appears to point to chunk vs. continuous sequences as potentially explaining the divergence (which would again be consistent with Garrido 2013), but it still remains unclear why would chunking make such a difference?

First, we would like to stress that late responses are also modulated by global statistics (see e.g. Figure 2A and Figure 4A), including the violation of item frequency.

Second, we acknowledge that terminology plays against us. In this paper, local/global refers to the timescale of integration (i.e. number of items processed) following previous publications (Donchin and Coles, 1988; Feuerriegel et al., 2018; Kolossa et al., 2012; Mars et al., 2008; Melloni et al., 2011; Meyniel, Maheu and Dehaene, 2016; Squires et al., 1976; Summerfield et al., 2011), whereas in the Bekinschtein et al.’s (2009) study, local vs. global effects refer to within- vs. between-chunks effects. The Bekinschtein et al.’s 2009 study says nothing about how many items or chunks are necessary to observe between-chunk effects.

Third, we previously discussed the link with Bekinschtein et al.’s 2009 study in two distinct paragraphs, which was not ideal. We have now assembled this point into only one paragraph as follows:

“A second finding from the “local-global” paradigm is that late brain responses (the P300) are sensitive to the rarity of patterns over the entire block. […] Testing the automaticity and task-dependence of our results will require further experiments.”

Last, thank you for pointing to us the paper by Garrido, Sahani and Dolan, 2012 which is highly relevant to our work. We now cite it (Garrido, Sahani and Dolan, 2013). Our results are indeed fully compatible with their findings as we indeed also find that early responses are modulated by a global statistic. However, we cannot speculate about the effect of the variance of the distribution in our case as, with Bernoulli processes (governing the generation of discrete events), the variance is confounded with the strength of the probabilistic bias (extreme probabilities induce lower variance).

https://doi.org/10.7554/eLife.41541.023

Article and author information

Author details

  1. Maxime Maheu

    1. Cognitive Neuroimaging Unit, CEA DRF/JOLIOT, INSERM, Université Paris-Sud, Université Paris-Saclay, NeuroSpin center, Gif-sur-Yvette, France
    2. Université Paris Descartes, Sorbonne Paris Cité, Paris, France
    Contribution
    Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft
    For correspondence
    maheu.mp@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6851-4927
  2. Stanislas Dehaene

    1. Cognitive Neuroimaging Unit, CEA DRF/JOLIOT, INSERM, Université Paris-Sud, Université Paris-Saclay, NeuroSpin center, Gif-sur-Yvette, France
    2. Collège de France, Paris, France
    Contribution
    Conceptualization, Supervision, Funding acquisition, Validation, Writing—review and editing
    Competing interests
    No competing interests declared
  3. Florent Meyniel

    Cognitive Neuroimaging Unit, CEA DRF/JOLIOT, INSERM, Université Paris-Sud, Université Paris-Saclay, NeuroSpin center, Gif-sur-Yvette, France
    Contribution
    Conceptualization, Supervision, Validation, Writing—review and editing
    For correspondence
    florent.meyniel@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6992-678X

Funding

Fondation Bettencourt Schueller ("Frontières du Vivant" doctoral fellowship)

  • Maxime Maheu

Ministère de l'Enseignement supérieur, de la Recherche et de l'Innovation ("Frontières du Vivant" doctoral fellowship)

  • Maxime Maheu

Institut National de la Santé et de la Recherche Médicale

  • Stanislas Dehaene

Commissariat à l'Énergie Atomique et aux Énergies Alternatives

  • Stanislas Dehaene
  • Florent Meyniel

Collège de France

  • Stanislas Dehaene

European Research Council (Grant NeuroSyntax)

  • Stanislas Dehaene

European Union Seventh Framework Programme (604102 - Human Brain Project)

  • Stanislas Dehaene
  • Florent Meyniel

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank all the volunteers for their participation. We thank Sébastien Marti, Emmanuel Procyk and Valentin Wyart for useful discussions throughout the project. We are grateful to the nurses Véronique Joly-Testault, Laurence Laurier, and Gaëlle Médiouni for their help in recruiting subjects and data acquisition.

Ethics

Human subjects: The study was approved by the local ethics committee (CPP 08-021 Ile-de-France VII), and participants gave their informed written consent before participating.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Floris de Lange, Donders Institute for Brain, Cognition and Behaviour, Netherlands

Publication history

  1. Received: August 29, 2018
  2. Accepted: January 18, 2019
  3. Version of Record published: February 4, 2019 (version 1)

Copyright

© 2019, Maheu et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,519
    Page views
  • 243
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)