The roles of online and offline replay in planning

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Animals and humans replay neural patterns encoding trajectories through their environment, both whilst they solve decision-making tasks and during rest. Both on-task and off-task replay are believed to contribute to flexible decision making, though how their relative contributions differ remains unclear. We investigated this question by using magnetoencephalography (MEG) to study human subjects while they performed a decision-making task that was designed to reveal the decision algorithms employed. We characterised subjects in terms of how flexibly each adjusted their choices to changes in temporal, spatial and reward structure. The more flexible a subject, the more they replayed trajectories during task performance, and this replay was coupled with re-planning of the encoded trajectories. The less flexible a subject, the more they replayed previously preferred trajectories during rest periods between task epochs. The data suggest that online and offline replay both participate in planning but support distinct decision strategies.

eLife digest

Studies show that humans and animals replay past experiences in their brain. To do this, the brain creates a pattern of electrical activity for each part of a multistep experience and then plays them back in order. Humans and other animals can replay scenarios either while the experience is still happening (i.e. online replay) or later when they are resting or sleeping (i.e. offline replay). Being able to replay an experience and its outcome may help a person or animal plan a better course of action in the future. However, it is poorly understood how online and offline replay each contribute to such planning.

To answer this question, Eldar et al. used a brain imaging tool called magnetoencephalography (MEG for short) to measure the electrical activity inside the brain. This technique was able to detect replays in the brain of individuals performing a particular task, and later whilst they were resting.

In the experiments, 40 healthy volunteers played a game in which each location in a space was associated with an image, for example a frog or a traffic sign, and each image was given a value. Participants got paid for moving to more valuable images in one or two steps. Eldar et al. found that people who replay their steps during a task are able to adjust their choices on the fly, whereas individuals who replay their choices during rests tend to approach a task with a less flexible, more preformed plan.

Eldar et al. suggest that replaying an experience too much during rest and not enough in real-time might contribute to more rigid behaviors, a theory that could shed light on the mechanisms behind certain behavioral disorders such as obsessive compulsive disorder. However, more studies are needed to determine if these two different replay strategies play a causal role in human behavior.

Introduction

Online and offline replay are both suggested to contribute to decision making (Behrens et al., 2018; Diba and Buzsáki, 2007; Foster, 2017; Foster and Wilson, 2006; Gupta et al., 2010; Ji and Wilson, 2007; Kurth-Nelson et al., 2016; Louie and Wilson, 2001; Ólafsdóttir et al., 2017; Pezzulo et al., 2014; Skaggs and McNaughton, 1996; Ólafsdóttir et al., 2015; Stachenfeld et al., 2017), but their precise contributions remain unclear. Replay of experienced and expected state transitions during a task, either immediately before choice or following outcome feedback, is particularly well suited to mediate on-the-fly planning, where choices are evaluated based on the states to which they lead (this is known as model-based planning). Off-task replay might serve a complementary role of consolidating a model of a state space, which specifies how each state can be reached from other states as well as the values of each state. According to this perspective, both types of replay help subjects make choices that are flexibly adapted to current circumstances.

However, an alternative possibility is that off-task replay also directly participates in planning, by calculating and storing a (so-called model-free) decision policy that specifies in advance what to do in each state (Gershman et al., 2014; Mattar and Daw, 2018; Momennejad et al., 2018; Sutton, 1991). Such a pre-formulated policy is inherently less flexible than a policy that is constructed on the fly, but at the same time it decreases a need for subsequent online planning when time itself might be limited. Thus, rather than online and offline replay both supporting the same form of planning, this latter perspective suggests a trade-off between them. In other words, online replay promotes on-the-fly (model-based) flexibility, whereas offline replay establishes a stable (model-free) policy.

Despite the wide-ranging behavioural implications of a distinction between model-based and model-free planning (Crockett, 2013; Everitt and Robbins, 2005; Gillan et al., 2017; Kurdi et al., 2019), and much theorising on the role of replay in one or the other form of planning, to date there is little data indicating whether online and offline replay have complementary or contrasting impacts in this regard. However, recent advances in magnetoencephalography (MEG) analysis have made it possible to study replay in human subjects in relation to learning and decision-making behaviour (Eldar et al., 2018; Kurth-Nelson et al., 2016; Liu et al., 2019). This methodology involves three key steps. First, MEG signals are recorded before, during, and after the subject performs a task of interest. Second, the MEG time series are decoded so as to estimate a moment-by-moment probability a subject is neurally representing each task element, even when the sensory processing associated with that element has ceased. This is the minimal requirement for replay. However, a special feature of replay is the coordination between the non-sensory neural representations of multiple, related, task elements. Thus, finally, the relationships between elements’ representational probability time series are examined to determine whether pairs of elements tended to be represented sequentially, one after the other.

Here, we use this methodology to test the relationship between both online and offline replay and key aspects of decision flexibility that dissociate model-free and model-based forms of planning (Daw et al., 2011). For this purpose, we first recorded MEG signals from human subjects during rest and while they navigated a specially designed state space. We next characterised each individual subject’s decision-making flexibility based on their task behaviour, and we then analysed their MEG signals to look for evidence of on-task (Eldar et al., 2018) and off-task (Kurth-Nelson et al., 2016; Liu et al., 2019) sequences of state representations.

Results

Individual differences in decision flexibility

We used distinct visual images to represent eight unique states, where occupancy of each state provided a different amount of reward (Figure 1a). Subjects started each trial at a random state and had to choose a movement direction in order to collect reward from subsequent states (Figure 1b). Subjects learnt beforehand how much reward was associated with each state (Figure 1—figure supplement 1), but they did not know initially where states were in relation to one another. The latter aspect of task structure had to be acquired through trial and error learning in order for subjects to be able to implement subsequent moves that delivered the maximal amount of reward. We assessed subjects’ flexibility in three ways. First, after the initial two blocks of trials, we changed the reward associated with each state (Figure 1a; grey numbers) such that persisting with previously optimal moves would result in below-chance performance. Second, after two additional blocks of trials, we informed subjects that two specified pairs of states had switched positions (Figure 1a; ‘switch’), again rendering the previously optimal policy now suboptimal. A flexible model-based planner would be capable of re-planning their moves perfectly following each of these instructed changes, since such a planner has acquired knowledge as to how each state can be reached. Conversely, a pure model-free planner would need to learn a new policy from scratch via trial and error each time there is a change, since such an agent only possesses a now outdated policy that specifies where to move from each state.

Figure 1 with 4 supplements see all

Download asset Open asset

Subjects differed in decision flexibility.

(a) Experimental task space. Before performing the main task, subjects learned state-reward associations (numbers in black circles) and they were then gradually introduced to the state space in a training session. After performing the main task for two blocks of trials, subjects learned new state-reward associations (numbers in dark grey circles) and then returned to the main task. Before a final block of trials, subjects were informed of a structural task change such that ‘House’ switched position with ‘Tomato’, and ‘Traffic sign’ switched position with ‘Frog’. The bird’s eye view shown in the figure was never seen by subjects. Subjects only saw where they started from on each trial and, after completing a move, the state to which their move led. The map was connected as a torus (e.g., starting from ‘Tomato’, moving right led to ‘Traffic sign’, and moving up or down from the tomato led to ‘Pond’). (b) Each trial started from a pseudorandom location from whence subjects were allowed either one (‘1-move trial’) or two (‘2-move trial’) consecutive moves (signalled at the start of each set of six trials), before continuing to the next trial. Outcomes were presented as images alone, and the associated reward points were not shown. A key design feature of the map was that in 5 out of 6 trials the optimal (first) move was different depending on whether the trial allowed for one or two moves. For instance, given the initial image-reward associations (black) and image positions, the best *single* move from ‘Face’ is LEFT (9 points), but when two moves are allowed the best moves are RIGHT and then DOWN (5+9 giving 15 total points). Note that the optimal moves differed also given the second set of image-reward associations. On ‘no-feedback’ trials (which started all but the first block), outcome images were also not shown (i.e., in the depicted trials, the ‘Wrench’, ‘Tomato’ and ‘Pond’ would appear as empty circles). (c) The proportion of obtainable reward points collected by the experimental subjects, and by three simulated learning algorithms. Each data point corresponds to 18 trials (six 1-move and twelve 2-move trials), with 54 trials per block. The images to which subjects moved were not shown to subjects for the first 12 trials of Blocks II to V (the corresponding ‘Without feedback’ data points also include data from 6 initial trials with feedback wherein starting locations had not yet repeated, and thus, subjects’ choices still reflected little new information). All algorithms were allowed to forget information so as to account for post-change performance drops as best fitted subjects’ choices (see Materials and methods for details). Black dashed line: chance performance. Shaded area: SEM. (d) Proportion of first choices that would have allowed collecting maximal reward where one (‘1-optimal’) or two (‘2-optimal’) consecutive moves were allowed. Choices are shown separately for what were in actuality 1-move and 2-move trials. Subjects are colour coded from lowest (gold) to highest (red) degree of flexibility in adjusting to one vs. two moves (see text). Dashed line: chance performance (33%, since up and down choices always lead to the same outcome). (**e,f**) Decrease in collected reward following a reward-contingency (e) and spatial (f) change, as a function of the index of flexibility (IF) computed from panel d. Measures are corrected for the impact of pre-change performance level using linear regression. $p$ value derived using a premutation test.

Examining how subjects’ overall performance altered immediately following these changes revealed a decrement in average performance (Figure 1c). However, there were substantial individual differences in this regard, with some subjects seamlessly adapting to reward and position changes, and others showing drops in performance to chance levels. Subjects whose performance showed a strong decline following a reward change tended to cope poorly also with the position change ( $ρ = 0.50$ , partial correlation controlling for performance levels before the changes; $p = 0.001$ , Permutation test).

As a third, more continuous, test of a different aspect of decision flexibility, we interleaved sets of six trials in which only a single move was allowed (‘1-move trials’) with trials which allowed two consecutive moves (‘2-move trials’; Figure 1b). In 2-move trials, subjects were rewarded for both states they visited, and thus, an optimal course of action often required subjects to move first to an initial low-reward state in order to gain access to a high-reward state with their second move. Thus, we defined an individual index (IF) of decision flexibility as the difference between the proportion of moves that were optimal given the actual number of allotted moves and the proportion of moves that would have been optimal given a different number of allotted moves (i.e., had 1-move trials instead involved two moves and 2-move trials involved one move). An IF value of zero implies no net adjustment, while positive IF values imply advantageous flexibility.

The results indicate subjects adjusted their choices advantageously to the number of allotted moves (+0.21, SEM 0.05, p < 0.001, Bootstrap test), though there was evidence again of substantial individual differences (Figure 1d). Importantly, IF correlated with how well a subject coped with the reward-contingency (Figure 1e) and position (Figure 1f) changes as well with how accurately they could sketch maps of the state space at the end of the experiment ( $r = 0.51$ , $p < 0.001$ , Permutation test; Figure 1—figure supplement 2), indicating these subjects acquired and utilised a model of the state space.

Planning two steps into the future

Having a cognitive model that specifies how states are spatially organised makes it possible to plan several steps into the future. To test whether subjects were able to do that, we challenged subjects with 12 ‘without-feedback’ trials at the beginning of each of the last four blocks, during which outcome images were not shown. This meant that in 2-move trials subjects had to choose their second move ‘blindly’, without having seen the image to which their previous move had led (e.g., the tomato in Figure 1b). We found that subjects performed above chance on these blind second moves (proportion of optimal choices: 0.56, SEM 0.03; chance = 0.45; p<0.001, Bootstrap test), and this was the case even immediately following position and reward changes, when subjects could not have relied on previously tested 2-move sequences (0.52, SEM 0.03; p=0.01, Bootstrap test). Most importantly, such blind-move success was correlated with IF (Figure 1—figure supplement 3a).

This result indicates that more flexible subjects were better able to plan two steps into the future when required. Examining response times suggested flexibility was associated with advance planning also when it was not required. Thus, we found that IF correlated with quicker execution of second moves in general (Spearman correlation with median reaction time: $r = - 0.61$ , $p < 0.001$ , Permutation test). To determine whether advance planning was indeed generally associated with flexibility, we examined at what point during a trial their choices became decodeable from MEG signals. For this purpose, we trained a decoder to decode chosen moves from MEG signals recorded outside of the main task (see Materials and methods for details). Validating the decoder on MEG data from the main task showed that chosen moves became gradually more evident over the course of the trial, their decodability peaking 140 ms before a choice was made (Figure 1—figure supplement 3b).

Thus, we used the move decoder to test whether second-move choices began to materialise in the MEG signal even before subjects observed the outcomes of their first moves. We found that chosen second moves were indeed decodeable already during first-move choices (decodability: $M =$ 0.006, 95% Credible Interval = 0.004 to 0.008, Bayesian Gaussian Process analysis; Figure 1—figure supplement 3c) and prior to the appearance of the first outcome (decodability: $M = 0.004$ , 95% Credible Interval = 0.002 to 0.006; Figure 1—figure supplement 3d). Importantly, this early decodability was correlated with IF ( $β$ : $M = 0.29$ , 95% Credible Interval = 0.24 to 0.34). By contrast, later decodability, following the onset of the second image, did not correlate with IF ( $β$ : $M = 0.02$ , 95% Credible Interval = −0.02 to 0.05). Thus, neural and behavioural evidence concur with the notion that flexibility was associated with planning second moves prospectively.

Individual flexibility reflected MF-MB balance

These convergent results suggest that IF reflected deployment of a model-based planning strategy. To test this formally, we compared how well different model-free (MF) and model-based (MB) decision algorithms, as well as a combination of both, explained subjects’ choices. Importantly, we enhanced these algorithms to maximise their ability to mimic one another (see Materials and methods for details). Thus, for instance, the MF algorithm included separate 1-move and 2-move policies, which allow it to achieve optimal adjustment to trial type given sufficient experience.

We found that a hybrid of MF and MB algorithms outperformed substantially either of them alone (Bayesian Information Criterion [Bishop, 2006]: MF = 40821, MB = 43249, MF-MB hybrid = 39908), suggesting that subjects employed a mix of model-free and model-based planning strategies. Simulating task performance using the hybrid algorithm showed it captured adequately behavioural differences evident between subjects (correlation between real and simulated IF: $r = 0.92$ , $p < 0.001$ , Permutation test; Figure 1—figure supplement 4a). When we examined each subject’s best-fitting parameter values, to determine which of these covaried with IF, we found 84% of inter-individual variance in IF was explained by three parameters that control a balance between model-based and model-free planning (Figure 1—figure supplement 4b). Importantly, less flexible subjects had comparable learning rates and a higher model-free inverse temperature parameter (in 2-move trials), indicating that lower flexibility did not reflect a non-specific impairment, but rather, it was associated with enhanced deployment of a model-free algorithm. Thus, our index of flexibility specifically reflected the influence of model-based, as compared to model-free, planning.

On-task replay is induced by prediction errors and associated with high flexibility

In rodents, reinstatement of past states, potentially in the service of planning, is evident both prior to choices (Pfeiffer and Foster, 2013) and following observation of outcomes (Pezzulo et al., 2014). Thus, we determined firstly at what point states were neurally reinstated during our task. For this purpose, we trained MEG decoders to identify the images subjects were processing (Figure 2a). Such decoders robustly reveal stimulus representations that are reinstated from memory and contribute to decision processes (Eldar et al., 2018; Kurth-Nelson et al., 2015; Bishop, 2006; Pfeiffer and Foster, 2013). Crucially, image decoders were trained on MEG data collected prior to subjects having any knowledge about the task, ensuring that the decoding was free of confounds related to other task variables (Figure 2—figure supplement 1). Applying these decoders to MEG signals from the main task, we found no evidence of prospective representation of outcome states (images) to which subjects will transition at choice (Figure 2—figure supplement 2a). Instead, we found strong evidence that following outcomes (corresponding to new states to which subjects had transitioned), subjects represented the states from which they had just moved ( $\bar{t} = 3.4$ , $p = 0.001$ , Permutation test; Figure 2—figure supplement 2b). Consequently, we examined in detail the MEG data recorded following each outcome for evidence of replay of state sequences that subjects had just traversed.

Figure 2 with 2 supplements see all

Download asset Open asset

Sequenceness analysis.

(a) Validation of the image MEG decoder used for the sequenceness analyses. $n$ = 40 subjects. The plot shows the decodability of starting images from MEG data recorded during the main task at trial onset. Decodability was computed as the probability assigned to the starting image by an 8-way classifier based on each timepoint’s spatial MEG pattern, minus chance probability (0.125). (b) Schematic depiction of the sequenceness analysis used to determine whether representational probabilities of pairs of elements followed one another in time. Sequenceness is computed as the difference between the cross-correlation of two time series with positive and negative time lags. Since it focuses on asymmetries in the cross correlation function, this measure is useful for detecting sequential relationships even between closely correlated (or anti-correlated) time series. Negative sequenceness indicates signals are ordered in reverse.

To test for evidence of replay, we applied a measure of ‘sequenceness’ to the decoded MEG time series, a metric we previously showed is sensitive in detecting replay of experienced and decision-related sequences of states (Eldar et al., 2018; Kurth-Nelson et al., 2016; Liu et al., 2019; Figure 2b). Importantly, sequenceness is not sensitive to simultaneous covariation, and thus, it is only found if stimulus representations follow one another in time (Eldar et al., 2018; as in previous work, we allowed for inter-stimulus lags of up to 200 ms). Thus, following each outcome, we computed sequenceness between the decoded representations of the preceding and the outcome state (Figure 3a). Additionally, MEG signals recorded following the second outcome in 2-move trials were also tested for sequenceness reflecting the trial’s first transition (i.e., between the starting state and first outcome; Figure 3b).

Figure 3

Download asset Open asset

On-task replay of state-to-state trajectories as a function of individual flexibility.

$n$ = 40 subjects. (a) Sequenceness corresponding to a transition from the image the subject had just left (‘Start image’; in the cartoon at the bottom, the face) to the image to which they arrived (‘outcome image’; the tomato) following highly surprising outcomes (i.e., above-mean state prediction error). In the cartoon, the white arrow indicates the actual action taken on the trial; the blue arrow indicates the sequence that is being decoded. For display purposes alone, mean time series are shown separately for subjects with high (above median) and low (below median) IF. Positive sequenceness values indicate forward replay and negative values indicate backward replay. As in previous work (Eldar et al., 2018), sequenceness was averaged over all inter-image time lags from 10 ms to 200 ms, and each timepoint reflects a moving time window of 600 ms centred at the given time (e.g., the 1 s timepoint reflects MEG data from 0.7 s to 1.3 s following outcome). Dashed lines show mean data generated by a Bayesian Gaussian Process analysis, and the dark gray bars indicate timepoints where the 95% Credible Interval excludes zero and Cohen’s d > 0.1. The top plot shows IF as a function of sequenceness for the timepoint where the average over all subjects was maximal. $p$ value derived using a premutation test. Dot colours denote flexibility rank. (b) Sequenceness following the conclusion of 2-move trials corresponding to a transition from the starting image to the first outcome image. (c) Difference in the probability of subsequently choosing a different transition as a function of sequenceness recorded at the transition’s conclusion. For display purposes only, sequenceness is divided into high (i.e., above mean) and low (i.e., below mean). A correlation analysis between sequenceness and probability of policy change showed a similar relationship (Spearman correlation: $M = - 0.04$ , $S E M = 0.02$ , $p = 0.04$ , Bootstrap test). Sequenceness was averaged over the first cluster of significant timepoints from panels b) and c), in subjects with non-negligible inferred sequenceness (more than the standard deviation divided by 10; $n = 25$ ), for the first time the subject chose each trajectory. Probability of changing policy was computed as the frequency of choosing a different move when occupying precisely the same state again. 0 corresponds to the average probability of change (51%). Error bars: s.e.m.

Using an hierarchical Bayesian Gaussian Process approach (see Materials and methods for details) we tested for timepoints at which sequenceness was evident and correlated with individual flexibility. This method directly corrects for comparison across multiple timepoints by accounting for the dependency between them (Kruschke, 2014). Since replay is thought to be induced by surprising observations (Mattar and Daw, 2018; Momennejad et al., 2018; Moore and Atkeson, 1993; Peng, 1993), we also included surprise about the outcome (i.e. the state prediction error inferred by the hybrid algorithm) as a predictor of sequenceness. We found significant sequenceness encoding the last experienced state transition (from 50 to 330 ms and from 820 to 950 ms following outcome onset; Figure 3a; note that the median split is for display purposes alone; actual analyses depended on the continuous flexibility index) and, at the conclusion of 2-move trials, also the penultimate transition (from 130 ms before to 350 ms following outcome onset; Figure 3b). These sequences were accelerated in time, with an estimated lag of 130 ms between the images, and were encoded in a ‘forward’ direction corresponding to the order actually visited. Moreover, later in the post-outcome epoch, the penultimate transition was also replayed backwards (from 440 to 940 ms following outcome onset). As would be expected, the finding of sequenceness was associated with enhanced decoding of previously visited states during the post-outcome epoch (Figure 2—figure supplement 2c).

More importantly, we found this evidence of replay, across all timepoints, was correlated with IF (mean $β = 0.17$ , 95% Credible Interval = 0.13 to 0.20), with surprise about the outcome (mean $β = 0.06$ , CI = 0.03 to 0.10), and with the interaction of these two factors (mean $β = 0.19$ , CI = 0.15 to 0.22). Thus, sequenceness was predominantly evident following surprising outcomes in subjects with high index of flexibility. This result is consistent with online replay contributing to post-outcome model-based planning.

On-task replay is associated with changes of policy

Recent theorising regarding the role of replay in planning argues that replay is preferentially induced when there is benefit to updating one’s policy (Mattar and Daw, 2018). Although policy updates can either reinforce or inhibit a chosen move, in our experiment, subsequently avoided choices in our experiment were more likely followed by a policy update than subsequently repeated choices, since a substantial proportion of the latter already reflected a well-informed policy (as evidenced by subjects’ above chance performance in Figure 1c). Thus, a role for replay in planning predicts that subjects should be more disposed to replay trajectories that they might not want to choose again, rather than trajectories whose choice reflects a firm policy. To determine whether decodable on-task replay was associated with behavioural change, we tested the relationship between sequenceness corresponding to each move that subjects chose, and the probability of making a different choice when occupying the same state later on. We found that moves after which high forward sequenceness was evident corresponded to moves that were less likely to be re-chosen subsequently (Figure 3c). These policy changes increased the proportion of obtained reward ( $M = + 11.1 %$ , $S E M = 1.5 %$ , $p = 0.001$ ). Thus, we found evidence of online replay for a chosen trajectory was coupled with advantageous re-evaluation of one’s choice.

Off-task replay is induced by prediction errors and associated with low flexibility

We next studied off-task replay, examining MEG data recorded during the 2-minute rest period that preceded each experimental block. Since each block included five frequently repeating starting states, we computed sequenceness for the five most frequent image-to-image transitions subjects chose before and after each rest period (mean choice frequency = 8.4 repetitions per block). As a control analysis, we also examined sequenceness for the five least frequently chosen transitions from the same starting states (mean choice frequency = 1.0 repetitions per block). We found significant evidence for sequenceness throughout the rest periods for frequent transitions ( $M$ = 0.002, $S E M$ = 0.001, $p = 0.01$ , Bootstrap test). By contrast, we found no evidence of sequenceness for the infrequent transitions ( $M$ < 0.001, $S E M$ = 0.001, $p = 0.47$ , Bootstrap test). Frequent transitions were replayed in a forward direction, with an estimated time lag of 180 ms between images, and prioritised trajectories that induced more reward prediction errors in the previous block (correlation of sequenceness with sum of absolute model-free reward prediction errors inferred by the hybrid algorithm: $M =$ 0.04, $S E M = 0.018$ , $p = 0.03$ , Bootstrap test). Most importantly, off-task sequenceness correlated negatively with IF (Figure 4). This association of sequenceness during rest with low flexibility is consistent with a proposed role for offline replay in establishing model-free policies (Gershman et al., 2014; Mattar and Daw, 2018; Momennejad et al., 2018; Sutton, 1991).

Figure 4

Download asset Open asset

Off-task replay of past and future trajectories.

$n$ = 40 subjects. Individual flexibility as a function of sequenceness in rest MEG data for the five most frequently experienced image-to-image transitions. For each rest period, sequenceness was averaged over transitions from both the preceding and following blocks of trials. $p$ value derived using a premutation test. Dot colours denote flexibility rank.

Off-task replay can predict subsequently chosen sequences

If offline replay is involved in planning, then its content should predict subjects’ subsequent choices. To test this, we dissociated the replay of experienced trajectories from that of planned trajectories, by focusing on the third rest period after which the optimal image-to-image transitions changed entirely (due to a change in state-reward associations; see Figure 1). As subjects had been taught about the reward change before this rest period, the rest afforded subjects an opportunity to re-plan their choices accordingly.

We first examined the behavioural effect of the state-reward change in more detail. The most frequently chosen transitions in the block that followed the third rest period differed from the transitions most frequently chosen in the preceding block (overlap: $M = 14 %$ , $S E M = 3 %$ ), and this policy change was substantially greater than for the other rest periods (overlap: $M = 53 %$ , $S E M = 2 %$ ). As expected, the newly chosen transitions from the following block were advantageous given the new state-reward associations (reward collected: $M = 71 %$ , $S E M = 2 %$ ; chance = 60%) and disadvantageous given the state-reward associations that had so far applied ( $M = 52 %$ , $S E M = 2 %$ ).

Given the behavioural change, we focused our examination of the MEG data on evidence for sequenceness during this crucial third rest period. We found that subjects indeed replayed the transitions they subsequently chose ( $M = 0.004$ , $S E M = 0.002$ , $p = 0.02$ , Bootstrap test). This replay of subsequently chosen moves indicates subjects utilised a model of the task to re-plan their moves offline (Momennejad et al., 2018; Gershman et al., 2014; Mattar and Daw, 2018). Our reasoning here is that re-planning in light of the new reward associations, before subjects experienced them in practice, requires a model that specifies how to navigate from one state to another. Indeed, multiple regression analysis showed that low IF was only associated with sequenceness encoding previously chosen transitions ( $β = - 0.35$ , $t_{37} = 2.25$ , $p = 0.03$ ), whereas the replay of subsequently chosen transitions did not correlate with IF ( $β = - 0.004$ , $t_{37} = 0.03$ , $p = 0.97$ ). On the other hand, the lack of a flexibility enhancement associated with prospective offline replay might indicate that, as might be expected, offline planning is ill-suited for enhancing trial-to-trial flexibility.

Discussion

We find substantial differences in the behaviour of individual subjects in a simple state-based sequential decision-making task that corresponds also to a distinction in the nature, and apparent effects, of MEG-recorded on- and off-task replay of state trajectories (Figure 5). These results bolster important behavioural dissociations, as well as provide substantial new insights into the control algorithms that subjects employ. The findings fit comfortably with an evolving literature that addresses human replay and preplay (Eldar et al., 2018; Kurth-Nelson et al., 2015; Kurth-Nelson et al., 2016; Liu et al., 2019; Schuck and Niv, 2019).

Figure 5

Download asset Open asset

Individual flexibility and evidence of replay.

The figure illustrates typical results for individuals with low (i.e., below median) and high (i.e., above median) flexibility. More flexible subjects advantageously adjusted their choices to the number of allotted moves. Such flexibility was associated with evidence of primarily forward replay following surprising outcomes, encoding the chosen transitions that led to those outcomes, and coupled with a reevaluation of those choices. Less flexible subjects showed evidence of forward replay during rest, encoding previously and subsequently chosen transitions.

The distinction between model-based and model-free reasoning has intuitive appeal, and close associations with many well-established psychological distinctions (Kahneman, 2011; Stanovich and West, 2000). However, popular tasks for investigating this distinction (Daw et al., 2011; Decker et al., 2016; Gillan et al., 2017; Gläscher et al., 2010) have been criticised for offering a better grasp on model-based compared to model-free reasoning processes (Gillan et al., 2015; da Silva and Hare, 2019); for rewarding model-based reasoning indifferently (Kool et al., 2016); and for admitting complex model-free strategies that can masquerade as being model-based (Akam et al., 2017). In our new task, we show a convergence between superficially divergent methods for distinguishing model-based and model-free methods – flexibility to immediate task demands (one-step versus two-step control), preserved performance in the face of changes in the location of rewards or structure, and an ability to reproduce explicitly, after the fact, the transition structure. Furthermore, the task effectively incentivises flexible model-based reasoning, as this type of reasoning alone allows collection of substantial additional reward (93%) compared to our most successful MF algorithm (80%). These convergent observations suggest that the model-based and model-free distinction we infer from our task rests on solid behavioural grounds.

In human subjects, there is a growing number of observations of replay and/or preplay of potential trajectories of states that are associated with the structure of tasks that subjects are performing (Kurth-Nelson et al., 2015; Kurth-Nelson et al., 2016; Schuck and Niv, 2019). However, it has been relatively hard to relate these replay events to ongoing performance. By contrast, there is evidence that rodent preplay has at least some immediate behavioural function (Gupta et al., 2010; Pfeiffer and Foster, 2013), and there are elegant theories for how replay should be optimally sequenced and structured in the service of planning. In particular, a recent normative model of replay, which aims to account for both online and offline events, suggests forward replay should prioritise trajectories on which the agent might soon re-embark, and backward replay should prioritise trajectories for which one’s policy can be improved (Mattar and Daw, 2018). Our results are broadly consistent with this normative perspective, showing that evidence of forward replay prioritises frequently chosen trajectories, whereas evidence of forward and backward replay follows new observations that can inform one’s policy, and indeed predicts appropriate changes in policy.

Two main aspects of our results deviate from this normative perspective. First, we primarily find forward sequenceness following outcomes. Second, rather than preplay immediately prior to choice, we found evidence of on-task replay following feedback alone. One possible explanation of the latter result is that replay before choice is more variable in speed and direction, which would make it more difficult to detect. However, this result might also indicate that upon observing an outcome, subjects immediately decided what moves to make next time they start from the same state. Such a strategy could be computationally expedient in that it minimises the need for retrieval and computation later on, during choice time, when a subject might be under time pressure. This suggests a third potential factor impacting on the timing and content of replay – the need to minimise memory load by embedding new information in ones’ policies as soon as it is received.

Critically, the timing and content of replay differed across individuals in a manner that links with their dominant mode of planning. More model-based subjects tended to replay trajectories during learning, predominantly reflecting choices they were likely to reconsider. There have been reports of preferential replay of deprecated trajectories in rodents (Gupta et al., 2010; Carey et al., 2019). However, those studies are consistent with a more general function for replay (e.g., maintaining the integrity of a map given a biased experience), whereas in our case, replay was closely related to future behaviour.

By contrast, the decodeable replay of more model-free subjects reinstated previously experienced trajectories during rest periods, when DYNA-like mechanisms (Sutton, 1991) are hypothesised to compile information about the environment to create an effective model-free policy. This replay of state-to-state transitions suggests that despite a general inability at the end of the task to draw a map accurately, model-free subjects do have implicit access to some form of model, though likely an incomplete one. In any case, the lack of association here between offline replay and ultimate winnings indicates that generating a policy offline might not be a good strategy for a task that requires trial-to-trial flexibility.

This aspect of the task might explain an apparent discrepancy with previous work on retrospective revaluation (Gershman et al., 2014; Momennejad et al., 2018), which indicates offline replay is associated with a greater degree of behavioural change. In the retrospective revaluation paradigm, behavioural change manifests between experimental phases that are separated in time by several minutes. The only way an algorithm like Dyna would be able to afford the trial-to-trial flexibility required in our task would be to furnish different stimulus-response mappings that can be summoned at will. It is possible such flexibility goes beyond the capabilities of any forms of DYNA that might be implemented in the brain.

Our work has a number of limitations. First, although our experimental design probes various facets of decision flexibility, it tests flexibility most extensively by interleaving 1- and 2-move trials. Modelling subjects’ choices shows this measure of flexibility captures individual differences in model-based and model-free planning. However, it is important to keep in mind that this measure does not capture all types of decision flexibility, as exemplified, for instance, by the different sort of flexibility that manifests in a retrospective revaluation paradigm. Second, our experiment was not ideally suited for inducing compound representations that link states with those that succeed them, since succession here changed frequently both within and between blocks. However, algorithms that utilise such representations mimic both model-free and model-based behaviour, and future work could utilise our methods to investigate whether and how these algorithms are aided by online and offline forms of replay (Russek et al., 2017). Third, the sequenceness measure that we use to determine replay suffers from a restriction of comparing forwards to backwards sequences. There is every reason to expect both forwards and backwards sequences co-exist, so focusing on a relative predominance of one or the other is likely to provide an incomplete picture. The problem measuring forwards and backwards replay against an absolute standard is the large autocorrelation in the neural decoding, and better ways of correcting for this are desirable in future studies. Nevertheless, despite these shortcomings the work we report is a further step towards revealing the rich and divergent structure of human choice in sequential decision making tasks.

Materials and methods

Subjects

40 human subjects, aged 18–33 years, 25 female, were recruited from a subject pool at University College London. Exclusion criteria included age (younger than 18 or older than 35), neurological or psychiatric illness, and current psychoactive drug use. To allow sufficient statistical power for comparisons between subjects, we set the sample size to roughly double that used in recent magnetoencephalography (MEG) studies on dynamics of neural representations (Hunt et al., 2012; Kurth-Nelson et al., 2015), and in line with our previous study of individual differences using similar measurements (including ‘sequenceness’; Eldar et al., 2018). Subjects received monetary compensation for their time (£20) in addition to a bonus (between £10 and £20) reflecting how many reward points subjects earned in the experiment task. The experimental protocol was approved by the University of College London local research ethics committee, and informed consent was obtained from all subjects.

Share this article

Cite this article

Subjects differed in decision flexibility.

Sequenceness analysis.

On-task replay of state-to-state trajectories as a function of individual flexibility.

Off-task replay of past and future trajectories.

Individual flexibility and evidence of replay.

Author details

Eran Eldar

Contribution

For correspondence

Competing interests

Gaëlle Lièvre

Contribution

Competing interests

Peter Dayan

Contribution

Contributed equally with

Competing interests

Raymond J Dolan

Contribution

Contributed equally with

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism