1. Neuroscience
Download icon

Offline replay supports planning in human reinforcement learning

  1. Ida Momennejad  Is a corresponding author
  2. A Ross Otto
  3. Nathaniel D Daw
  4. Kenneth A Norman
  1. Princeton University, United States
  2. McGill University, Canada
Research Article
  • Cited 5
  • Views 2,175
  • Annotations
Cite this article as: eLife 2018;7:e32548 doi: 10.7554/eLife.32548

Abstract

Making decisions in sequentially structured tasks requires integrating distally acquired information. The extensive computational cost of such integration challenges planning methods that integrate online, at decision time. Furthermore, it remains unclear whether ‘offline’ integration during replay supports planning, and if so which memories should be replayed. Inspired by machine learning, we propose that (a) offline replay of trajectories facilitates integrating representations that guide decisions, and (b) unsigned prediction errors (uncertainty) trigger such integrative replay. We designed a 2-step revaluation task for fMRI, whereby participants needed to integrate changes in rewards with past knowledge to optimally replan decisions. As predicted, we found that (a) multi-voxel pattern evidence for off-task replay predicts subsequent replanning; (b) neural sensitivity to uncertainty predicts subsequent replay and replanning; (c) off-task hippocampus and anterior cingulate activity increase when revaluation is required. These findings elucidate how the brain leverages offline mechanisms in planning and goal-directed behavior under uncertainty.

https://doi.org/10.7554/eLife.32548.001

Introduction

To make choices in structured environments, human and non-human animals must forecast the long-term consequences of candidate actions, which often are delayed and play out serially over multiple steps. Predicting these outcomes often requires integrating information from disparate learning episodes, which occurred at different times in the past. In other words, successful planning requires both (a) accessing memories of multiple previous events and (b) integrating these memories to make useful predictions about an action’s consequences: a type of integrative memory replay (Lengyel and Dayan, 2008). Indeed, both human and non-human animals can solve reward learning tasks designed to require this sort of integration, such as latent learning tasks (Tolman, 1948) in which information about the outcome contingencies of actions (e.g., the spatial layout of a maze) is presented in a separate stage of training from the associated rewards (e.g. receipt of food).

Little is known, however, about how and when such integrative planning occurs. A longstanding assumption, dating back to Tolman’s proposal in 1948 (Tolman, 1948), is that planning unfolds at decision time, for example during mental simulation of the candidate future action trajectories. More recent proposals suggest that such planning need not occur only when the animal is presented with a choice, but can also occur earlier, during encoding of choice-relevant information (Wimmer and Shohamy, 2012; Shohamy and Daw, 2015). In particular, when new information is first learned (e.g. food discovered in some location), its consequences for distal actions (e.g. a high value for routes leading there) can be inferred and immediately encoded. Indeed, human fMRI work provides support for both mechanisms: Studies have found that people’s ability to solve integrative evaluation problems depends on memories of distal task states accessed either at the time of choice (Doll et al., 2015) or encoding (Wimmer and Shohamy, 2012).

What the encoding- and choice-time mechanisms share is that both occur during active task performance, which artificial intelligence refers to as learning online (Sutton, 1991). However, during both encoding/learning and decision-making periods, people often have limited time for exploring the set of interrelationships among decisions and outcomes, which may be prohibitive in complex environments. A third, more general possibility, which we examine here, is that integration of memories to support action evaluation can also occur offline, such as during off-task rest periods or even sleep. An advantage of offline planning is that it offloads computational costs to periods that are not already occupied by task behavior. This way, before the organism faces a problem that requires a decision (but after it has initially encoded the constituent events), it can update action valuations or choice policies offline, thereby accomplishing at least part of the evaluation required for later choice (Gershman et al., 2014).

Here we investigate integrative memory processes in humans that help piece together information offline, thus supporting planning and improving decision-making behavior. In the formalism of reinforcement learning in artificial intelligence, the Dyna family of algorithms implements offline planning, which is accomplished by learning from replay of experiences that were initially encoded during online behavior (Van Seijen and Sutton, 2015; Sutton, 1991). Inspired by this work, and together with behavioral evidence of offline revaluation (Gershman et al., 2014; Momennejad et al., 2017), computational models suggest that Dyna-like offline replay mechanisms might be implemented in the brain (Momennejad et al., 2017; Johnson and Redish, 2005; Ludvig et al., 2017; Russek et al., 2017). Indeed, this hypothesis fits closely with a substantial rodent literature revealing that hippocampal replay of previously experienced spatial trajectories is observed during both awake behavior and sleep (Takahashi, 2015; Ólafsdóttir et al., 2015).

It has also been shown that hippocampal replay (Pfeiffer and Foster, 2013) captures the topological structure of the task environment (Wu and Foster, 2014), and hippocampal reverse replay of trajectories occurs offline when the animal experiences changes in reward (prediction errors) at the goal state (Ambrose et al., 2016a). In parallel, human fMRI research work reveals how memories are reactivated offline, such as during periods of rest that follow rewarding episodes (Gruber et al., 2016; Murty et al., 2016). This offline reactivation could potentially aid in memory consolidation and integration, evidenced by enhanced subsequent memory performance (Takashima et al., 2009; Staresina et al., 2013; Tambini and Davachi, 2013). Consistent with the hypothesis that memory integration can be accomplished via offline replay, there is behavioral evidence that manipulations affecting rest periods influence humans’ ability to solve integrative decision tasks (Gershman et al., 2014). However, so far, no animal or human study has provided neural evidence that integration during offline replay supports planning.

Our main question in this study is: does replay predict a change in choice behavior? In this study, we use fMRI and multivariate pattern analysis (MVPA) to test the hypothesis, outlined above, that offline replay drives integrative re-evaluation and update of past policies. We designed a paradigm where such integration would be reflected in the behavioral choices participants make. It has been shown that replay of a particular episode can lead to the reinstatement of neural activity in sensory cortical regions (Ji and Wilson, 2007; Polyn et al., 2005). Accordingly, we use MVPA here to track cortical reinstatement of a particular state during off-task rest periods, as a measure of offline replay of that state. We then examine the relationship between neural replay evidence (from MVPA) and the change in subsequent choice behavior, to test whether replay facilitates the brain’s ability to integrate information and update planning policies accordingly.

Having found evidence for this relationship between replay and replanning, our secondary question in this study was to explore the antecedents of this process: What determines which memories are replayed? The idea that integrative replay occurs offline (rather than being triggered by particular task events) raises the question of prioritization: Out of all the episodes stored, which memories should the brain replay to ensure integrative and goal-directed learning? What is it about an experience that privileges it for later replay? Here we draw insight again from artificial intelligence, where the standard approach for this problem is an algorithm known as prioritized sweeping (Moore and Atkeson, 1993; Peng and Williams, 1993). The key insight behind this algorithm is that when surprising information is encountered at some state, it may have consequences for the value of states or actions that could have led there, and therefore those predecessors should be prioritized for replay. One representative and standard signal of such surprise or uncertainty, which is leveraged as a heuristic to trigger replay in prioritized sweeping, is the experience of prediction errors—that is the deviation between the agent’s predictions and experiences of the world. To this end, both negative and positive prediction errors (i.e. the unsigned error) are meaningful as they signify the need for improved value estimates. To test this hypothesis, we use model-based fMRI analysis to evaluate whether, across blocks and subjects, larger neural correlates of prediction errors are associated with greater replay during subsequent rest periods, and ultimately greater change in behavior.

In short, our major proposal is that offline neural replay drives the integration of distal memories to update previously learned representations and support future planning. Our secondary question is which earlier events predict replay and ultimately replanning. We hypothesize that episodes marked with large prediction errors (PEs), whether positive or negative, signal uncertainty about the optimal reward trajectory and tag related memories for future replay. This replay, in turn, enables the integration of separate memories in the service of updating optimal reward policy (action policies that maximize reward). We combine model-based fMRI, MVPA, and a sequential decision-making task to test our hypotheses.

Results

To test the hypothesis that offline replay supports integration of memories to serve future decisions, we operationalized planning using a reward revaluation paradigm (Gershman et al., 2014; Momennejad et al., 2017). In reward revaluation, participants first learn the multi-step sequential decisions that lead them to reward from a starting point; later, they experience a local change to later stages of the decision trajectory. Upon return to the beginning of the trajectory, we can measure the extent to which a participant’s choices adapt to the new local changes they experienced.

Here we used a 2-stage sequential decision task with a within-subject design that allowed us to manipulate whether or not revaluation was required for reward-maximizing choice behavior (Gershman et al., 2014; Momennejad et al., 2017). In both the revaluation and control conditions, participants (n = 24) repeatedly visited a starting stimulus (Stage I) and chose between two actions that would lead them to one of two subsequent stimuli depending on their choice (Stage II), as displayed in Figure 1. Once a participant reached either of those Stage II stimuli (or states), they selected one of two actions to obtain reward. We will refer to the Stage I stimulus as state one and the two Stage II stimuli as state 2 and state 3. As a cover story, participants were told they were playing a game as a thief, exploring different buildings to steal the most money. Participants were given four separate blocks of this task, each with a different reward structure; half these blocks were randomly assigned to the revaluation condition, and half the blocks were assigned to the control condition, as described below. To ensure that participants did not transfer what they learned from one block to another, each block had a different colored background that indicated they were in a different city, and different stimuli for the Stage I and Stage II actions (Figure 1—figure supplement 1).

Figure 1 with 1 supplement see all
(A) Task design Each block had three phases.

During the Learning phase, participants explored a 2-stage environment via button presses to learn optimal reward policy. During the Relearning phase, they only explored the Stage II states: on revaluation trials the optimally rewarding Stage II state changed during the Relearning phase; on control blocks the mean rewards did not change. The final phase was the Test phase: participants were asked to choose an action from the starting (Stage I) state that would lead to maximal reward. The red arrow denotes the action that provides access to the highest reward in the learning phase. The optimal policy during the Learning phase remains optimal in the control condition but is the suboptimal choice in the revaluation condition. (B) The time course of an example block. Stimuli images were downloaded from the public domain via Google images.

https://doi.org/10.7554/eLife.32548.002

The task in both revaluation and control blocks consisted of three phases: a Learning phase (where the participant explored the entire decision tree), a Relearning phase (in which the participant visited Stage II but never Stage I), and a Test phase (where participants revisited Stage I). During the Learning phase, participants learned the structure of states and rewards.

During the Relearning phase, the rewards associated with Stage II stimuli changed for the revaluation blocks (but not for the control blocks). The new reward amounts in revaluation blocks were chosen so as to require a change in the decision policy leading to optimal reward in Stage I (e.g., during the Learning phase, the optimal decision policy might be to go left at Stage I, but after taking into account the Stage II reward changes experienced during Relearning, the new optimal decision policy might be to go right at Stage I).

The Test phase assayed whether participants reversed their initial preference according to the Stage II changes they experienced in the Relearning phase (in control blocks we did not expect any change in preference to occur). Note that a change in Stage I policy preference would require integrating information about the action-stimulus contingencies from the Learning phase with the re-learned reward amounts associated with Stage II states from the Relearning phase. The Test phase choice was repeated four times; we computed the average proportion correct across these four choices. Importantly (to prevent any influence of new learning during the test phase), participants were not given any feedback about the consequent Stage II stimulus and reward after these choices.

Participants were incentivized to search for and choose higher-value states. They were explicitly instructed prior to the learning phase to find the highest monetary reward and navigate there in order to increase their bonus money. The instructions below appeared prior to the learning phase of every city/MDP during the experiment in order to reiterate this incentive (the names of the city and location were adjusted for each session):

"In this game you will be robbing a theater in Berlin.

By learning how to get around the theater, and learning where the most money is, you can steal the most money possible.

And the more money you steal in the game, the more REAL BONUS MONEY you will be paid on top of the normal experiment payment. You could earn an extra $2–3!’

Note that the critical difference between the control and revaluation conditions was in the Relearning phase: in the control condition there was no change in the rewards associated with Stage II choices (such that integrative consideration of this new information should not result in systematic change in Stage I preferences), whereas in the revaluation condition Stage II rewards changed. Hence, during the Test phase we could test whether decisions made in Stage I were affected by information learned about Stage II in the Relearning phase (see Figure 1 and Materials and methods). The inclusion of the control condition helps to rule out effects related to any nonspecific change in preferences (as opposed to integrative replanning), such as forgetting or inattention. As noted above, participants completed four separate blocks with this structure but different stimuli and rewards (two revaluation and two control). They did not know to which condition each block belonged (or about the logic of the manipulation), and the order of revaluation vs. control blocks was randomized.

In addition to the revaluation vs. control manipulation, we also manipulated the level of noise (variance) associated with rewards experienced following an action in a given state. This led to a 2 × 2 factorial design with revaluation and noise manipulations. For both revaluation and control blocks, half the blocks had variable (noisy) rewards selected from a distribution with a fixed mean (noisy rewards condition) leading to a persistent level of prediction errors even in the control condition (e.g. choosing left at state two would lead to different values that average to $5), and half of the blocks had no noise in rewards (noiseless rewards condition) leading to no prediction errors after learning unless there was a revaluation (e.g. choosing left at state two would always lead to $5).

This noise manipulation—that is the noiseless-rewards versus noisy-rewards conditions—ensured that the difference between revaluation and control condition blocks was not limited to the absence or presence of PEs or surprise in general. In particular, the inclusion of a noisy control block (in which there is elevated surprise but no actionable new information during the Relearning phase) further helps to control for nonspecific changes in preference (e.g., mediated by attention or motivation) that might be associated with surprise per se, rather than surprise enabling integrative planning. Note that under the Dyna and prioritized sweeping algorithms that underlie our hypotheses, replay by itself is not sufficient to produce change in preference. Our hypothesis, instead, is that surprise leads to replay later on (to investigate whether the surprising information has consequences for choice), but whether this replay leads to replanning/revaluation depends on whether the replayed experiences in fact support a change in preference. Accordingly, the inclusion of reward noise in the noisy-rewards condition (even in control blocks) allowed us to rule out the possibility that change in choice preferences is brought about by ‘any’ experience of uncertainty, whether or not experience actually supported replanning. However, given that our primary interest is in replanning per se, in what follows we mainly report revaluation vs. control effects, collapsing noiseless and noisy reward conditions, while reporting statistical differences and further fMRI analyses of the noise manipulation in the Supplementary material.

Instructions

Below is an example of instructions for one trial. Each trial had a separate city context (Berlin, Paris, London, Kyoto) and place (Theater, Museum, Subway, Garden).

“In this game you will be robbing a GARDEN in KYOTO.

By learning how to get around the GARDEN, and learning where the most money is, you can steal the most money possible.

And the more money you steal in the game, the more REAL BONUS MONEY you will be paid on top of the normal experiment payment. You could earn an extra $2–3!’

When you are in the Main KYOTO GARDEN, you either go to the LEFT or the RIGHT garden to find money. The MAIN GARDEN and the LEFT and RIGHT gardens all have different PHOTOS, so you will always know where you are.

Pay attention! Sometimes you will directly begin in the LEFT or in the RIGHT garden!

Once you enter the LEFT or RIGHT garden, you will have the choice of 2 cash registers to look for money in: TOP or BOTTOM machines.

Each machine contains between 1$ and 50$.

You need to figure out whether the TOP or the BOTTOM machine pays better in each garden.

Stealing as much money as possible requires that you successfully navigate from the MAIN garden LEFT or RIGHT to the GARDEN with the best cash register.'...

Beware!! Press THUMB if you find yourself in a suspicious corridor where you may get caught!’

Forced choice catch stimuli

It is worth noting that every trial consisted of a number of catch stimuli. These ‘catch states’ were included to control for attention and learning. There were two kinds of catch states: forced choice and police states. In forced choice catch states, participants faced the same choice but one of the actions was highlighted, thus they were forced to select the indicated response. Forced choices were included in the design in order to ensure (a) that they sample all parts of the state space and (b) that they are paying attention. On ‘police’ states, upon choosing an action for State 1, participants saw a ‘police face’ stimulus (if the second state stimuli were faces) or a ‘suspicious corridor’ stimulus (if the second state stimuli were scenes), which implied that they should not press the usual response and they need to press an extra key (to avoid being ‘caught’). Participants were trained and could distinguish police faces and suspicious corridors beforehand.

The number of forced choice changed depending on the performance of the participant. Including forced-choice trials ensured that the distributions were adequately sampled during the experiment. Each participant’s choices were tracked, and actions that were not adequately sampled were presented in forced choice trials in which the participant had to take those actions. We included 4 ‘police’ catch trials and, on average, 3 ‘forced choice’ catch trials in phase 1 of each run. We included at least 1 of each less explored option. If the participant did not adequately explore options within a set period of time, more catch stimuli were inserted just to ensure participants encountered the value of those states. As a result of forced choice stimuli, participants sampled all options. As a result of police trials, we ensured that they did not merely rely on stimulus-independent motor sequence memory when making decisions (e.g. memorizing that pressing L-index then R-middle is rewarding). Instead, participants were directed to pay attention to the stimuli/states they observed in order to select the optimal action. They were told that these catch choices are important for their bonus. Note that we ensured all varieties of catch trials were category-congruent with the second stage stimuli.

Attention engagement during the relearning (control trials)

Participants were instructed at the beginning of each run that the rewarding choices may change and they needed to pay attention to changes. Furthermore, both the instructions as well as forced choice stimuli incentivized them to pay attention during the relearning phase: (a) The relearning phase had ‘forced choice’ trials where participants were instructed to select the option already highlighted by the experimenter. We used these trials to ensure they were paying attention and provided verbal feedback between the runs to pay attention to these forced choices. (b) Participants were told their choices during the relearning phase were also related to reward at the end of the experiment (maximizing rewards required maximizing performance as a ‘thief’ in the cover story). (c) Participants were sometimes given catch trials with ‘police’ stimuli that required a different response, as noted above. Taken together, all of these factors helped to ensure that participants were paying attention in both the revaluation and the control condition. Note also that participants were not directly told whether a particular block was a revaluation block or control block – they had to pay attention to experienced rewards in all blocks during relearning to assess whether they were changing.

Moreover, we designed the noisy control condition to control for the possibility that participants were not paying enough attention to details and simply followed a rule of thumb: ‘any change means revaluation’. Our observations ruled out this possibility: the mere experience of PEs did not lead participants to change their choice at the end of noisy control trials. Note that even in the relearning phase of the noisy control condition there were still non zero PEs, indicating that participants were still learning and improving their predictions. They did not know they were in the control condition and were incentivized to pay attention.

Our main neural hypothesis was that offline replay would facilitate the integration of Stage I and Stage II information, which will in turn support subsequent planning behavior, that is, change in choice preferences. In the control condition, participants should not change their choices from learning to test regardless of experienced PEs, while in the revaluation condition, replay should help them update previous optimal policy and replan their choice. Note that updating or replanning the optimal Stage I choice required integration of past information, because the transition from Stage I to Stage II, on the one hand, and the new Stage II rewards, on the other, were originally experienced at separate distal points in time during online task performance.

In our study, the Relearning phase of each block was interleaved with three separate rest periods of 30 s each. Thus, offline replay could be assessed using brain measurements during these off-task rest periods. To allow us to measure replay, we used image categories for Stage I and Stage II stimuli that were easy to decode with fMRI: All stimuli were either faces or scenes (Polyn et al., 2005). In particular, the Stage I (root) stimulus belonged to a different category than the two Stage II stimuli. This manipulation allowed us to operationalize ‘offline replay’ using classifier evidence of Stage I in separate category-selective regions of interest (ROI) during rest periods, as described below.

Replanning (revaluation) behavior

To measure revaluation behavior, we computed a replanning magnitude score for each subject and block (Gershman et al., 2014; Momennejad et al., 2017; Russek et al., 2017), defined as the proportion of optimal Stage I choices during Test minus the proportion of choices for that same option during the Learning phase (see Equation 1 in Materials and methods). We use a score that indexes change in preferences here (Gershman et al., 2014), rather than raw Test phase choice behavior for two reasons. First, this is what our hypotheses concern: that replay updates pre-existing choice policies, driving a change. Second, this score controls for any confounding effects of initial learning performance on Test phase behavior. (For instance, subjects paying different amounts of attention might learn to varying degrees in the initial phases, leading to differences in both replay and Test phase behavior, even if replay itself doesn't affect later preferences.) In revaluation blocks, this replanning magnitude quantifies successful reversal of preference from Learning to Test, because the initially suboptimal option becomes optimal during the Test phase. The difference in preference captured by the replanning magnitude compares the frequency of taking the newly optimal action during the Test phase to the frequency of taking this action at the end of the Learning phase (when it was suboptimal), which adjusts for choice noisiness or failure to acquire the initial associations during the Learning phase. Thus, for revaluation blocks, a large replanning magnitude indicates consistent, successful reversal of an initially consistent, correct preference. Conversely, in control blocks (where the initially suboptimal option remains suboptimal at test) a large replanning magnitude indicates an unwarranted and suboptimal reversal of the initially trained preference, allowing us to control for preference change due to some other, nonspecific source of error such as decision noise or forgetting.

Behavioral results

Comparison of replanning magnitudes revealed that participants showed significant revaluation behavior (i.e., reversed their Stage I choice from Learning to Test, as indicated by positive replanning magnitudes) in revaluation blocks (t(23) = 4.89, p < 0.0001), but not during control blocks (t(23) = −0.10, p = 0.92), and the difference in magnitudes between the two conditions was significant (t(23) = 4.333, p = 0.0002) (Figure 2B). The comparison with the control condition verifies that this apparent revaluation behavior is not simply attributable to general forgetting or increased memory noise from Learning to the Test phase; rather, it is due to revaluation of past policies in the face of novel information in the revaluation condition—but not the control condition. A 2-way analysis of variance confirmed the main effect of the reward vs. control factor but revealed no significant main effect of the reward stability vs. noise manipulation (see Figure 2—figure supplement 1).

Figure 2 with 2 supplements see all
Offline replay of distal past states supports planning.

(A) Behavioral results (n = 24). Participants significantly changed their choice from the Learning phase to the Test phase in the revaluation condition, but not the control condition. (B) Replay supports planning. Correlation between Stage I replay during off-task rest periods and replanning magnitude during the subsequent Test phase for Control (left) and Revaluation (right) blocks. Stage I replay was operationalized as MVPA evidence for the Stage I stimulus category, in category-selective regions of interest, during all rest periods of control and revaluation conditions (n = 24). The correlation was significant in the Revaluation blocks (Spearman rho = 0.54, p = 0.0068) but not Control blocks (Spearman rho = −0.13, p = 0.55), and it was significantly larger in Revaluation than Control blocks (p = 0.0230, computed using a bootstrap, computing Spearman rho 1000 times with replacement). Regression lines are provided for visualization purposes, but statistics were done on Spearman rho values. The bottom figures display both fitted lines as well as Spearman’s rho values for clarity.

https://doi.org/10.7554/eLife.32548.004

Replay-behavior relationships

Next, we turned to the neuroimaging data to test our main hypothesis: whether neural evidence for offline replay during rest periods predicts subsequent replanning behavior. As noted above, we used MVPA to track replay of Stage I, leveraging the fact that Stage I and Stage II stimuli were always taken from different categories (faces and scenes). For example, in a trial where the Stage I stimulus was a face and Stage II stimuli were scenes, we used MVPA measures of face activity during rest periods to measure spontaneous replay of Stage I. Previous evidence suggests that multivariate pattern analysis (MVPA) can be used to measure offline replay and memory reactivation during post-task rest periods (Staresina et al., 2013; Cichy et al., 2012; Gruber et al., 2016). Specifically, it has been shown that imagery, reinstatement, and perception of faces and scenes elicit patterns in the temporal cortical regions associated with visual categories, roughly corresponding to the fusiform gyrus for the face category and the parahippocampal gyrus for the scene category (Polyn et al., 2005; Eichenbaum, 2004). Therefore, we first identified these category-selective regions using general linear models (GLMs) on a separate localizer run before the main task. The localizer run consisted of blocks of faces, scenes, objects, and rest. The faces and scenes, but not objects, were used later during the experiment. We then trained a classifier (using logistic regression) on face, scene, and object stimuli appearing during the separate localizer run. The main goal of the classifier was to classify the reinstatement of cortical patterns associated with the Stage I stimulus category during the rest periods. As a cortical measure of replay, for each participant, each block, and each of the 30 s rest periods, we used this logistic regression classification method to identify TR-by-TR classifier evidence for the cortical reinstatement of faces and scenes in category ROIs (regions of interest). To ensure that we could independently assess positive evidence for both face and scene categories (rather than competitively detect one at the expense of the other), we assessed classifier evidence for each category from an ROI that was selective for that category (see Methods for further discussion of this point).

To test whether offline replay during intermittent rest periods in the Relearning phase predicted successful revaluation in the later Test phase, we averaged MVPA evidence for reinstatement of the Stage I category during all rest periods (Figure 2, x-axis), and computed the correlation with mean revaluation behavior in each condition (Figure 2, y-axis), across subjects. Even though the Stage I stimulus category never appeared during the Relearning phase, it was important to ensure that we were not simply observing residual signals evoked by the Stage II stimuli observed prior to each rest phase. Therefore, we excluded the first 5 TRs (corresponding to 10 s) of the beginning of each rest period from analysis.

In revaluation trials, there was a significant correlation between MVPA evidence for Stage I replay during the three off-task rest periods and revaluation behavior during the subsequent Test phase (rho = 0.54, p = 0.0068) but not for control trials (rho = −0.13, p = 0.55, Figure 2). The correlation coefficients for the revaluation vs. the control trials were significantly different from one another (p = 0.0230; bootstrap, see Methods). We also conducted the same analyses, separately, for the noisy rewards versus the noiseless rewards conditions. The revaluation vs. control difference was trending in the noisy condition (p = 0.066) but not the noiseless condition (p = 0.16), and there was no significant interaction between revaluation/control and noisy/noiseless (p = 0.34) (Figure 2—figure supplement 1; see Discussion section for further treatment of these results).

Note that the crucial step for revaluation is retrieval of memory for the Stage-1 state, which was not presented in the Relearning phase. In contrast, there may be neural activity for the Stage-2 stimulus (which was presented during the Relearning phase) for many reasons unrelated to planning, such as simple working memory. Accordingly we expect the relationship between replay and revaluation to be most clearly expressed for the Stage one stimulus. We thus conducted a similar analysis for Stage two replay, looking at whether MVPA evidence for Stage two category replay during the rest period was correlated with subsequent replanning behavior. As expected, we observed no significant correlation between Stage two replay and revaluation behavior (Figure 2—figure supplement 2) in the revaluation condition (rho = 0.2, p = 0.17) or the control condition (rho = -0.25, p = 0.22).

In short, we found that MVPA evidence of Stage I replay during rest correlates with subsequent replanning behavior. This fits with our hypothesis that offline memory processes can support planning. Next, we test the hypothesis that this replay is driven by the experience of prediction error.

The brain’s response to unsigned prediction error tags memories for replay

A key question for any account of learning from replay is how the brain selects which memories to replay. We thus explored what events favor replay, and ultimately successful revaluation. A general, appealing possibility is that events with some particular property when first experienced, like surprising ones, are privileged for later offline replay. (Mechanistically, this might occur either because they are ‘tagged’ for replay when first experienced, or because they are singled out later at retrieval.) We tested this idea in terms of one representative notion of surprise – neural sensitivity to unsigned PEs (i.e., the absolute value of PEs), as in the prioritized sweeping algorithm (Moore and Atkeson, 1993). Offline replay in turn would support updating past policies and hence predict future revaluation behavior (Figure 3, note the relationships indicated in the diagram, top left). We have shown in previous sections that offline replay during rest is correlated with revaluation during a later Test phase. To examine evidence for the hypothesized relationship between the brain’s response to PE and offline replay, as well as subsequent replanning behavior (Figure 3, top left), we ran parametric modulation analyses as follows.

Figure 3 with 1 supplement see all
(Top left) Schematic of a theoretical account of revaluation.

We propose that neural sensitivity to reward prediction errors during learning ‘tags’ or ‘prioritizes’ memories for replay during later rest periods. Replay during rest, in turn, allows the comparison of past policy with new simulated policy and updating of the past policy when needed. (Top right) Regions where sensitivity to unsigned PE in Revaluation blocks correlates with subsequent replay during rest (extent threshold p < 0.005, cluster family-wise error [FWE] corrected, p < 0.05, shown in purple) and revaluation behavior (extent threshold p < 0.005, cluster FWE corrected, p < 0.05, shown in blue). (Bottom) Green reveals the conjunction of regions where sensitivity to unsigned PE in Revaluation blocks correlates with subsequent replay during rest (extent threshold p < 0.005, cluster FWE corrected, p < 0.05) and revaluation behavior (threshold p < 0.005, cluster FWE corrected, p < 0.05) in those blocks; the conjunction is shown at a p < 0.05 threshold. We found that the sensitivity of broad regions in the basal ganglia, the cingulate cortex (including the ACC), and the posterior medial cortex (precuneus) to unsigned prediction errors (signaling increase in uncertainty) correlated with both future replay during rest as well as subsequent revaluation behavior. See Tables 1 and 2 for coordinates.

https://doi.org/10.7554/eLife.32548.007

For each participant, we conducted a parametric regression analysis within the Revaluation condition to identify regions with sensitivity to stimulus-by-stimulus unsigned PEs. Prediction errors were estimated from a Temporal Difference (TD) learning model (see Methods). We then tested, for each subject and each Revaluation block, the relationship between the magnitude of the brain’s PE response (in that block) and two different covariates: subsequent replay and subsequent replanning score (Figure 3, top right, purple and blue maps respectively). This allowed us to determine whether (and in which brain areas) the magnitude of activity in response to unsigned PE was related to off-line replay (Figure 3, regions indicated in purple, see Table 1 for coordinates) and subsequent replanning (Figure 3, regions indicated in Blue, see Table 2 for coordinates), as hypothesized (Figure 3, top left). In order to identify the combined effect and given the correlation between the two covariates, we used a conjunction analysis over the two resulting correlation brain maps (Figure 3, regions indicated in green). This conjunction analysis yielded regions where the brain’s response to unsigned PE on each block correlated with both (a) subsequent mean replay evidence (for Stage I stimuli) during rest, and (b) subsequent replanning magnitude of the choice behavior during that block’s Test phase (Figure 3).

Table 1
Coordinates of voxels where parametric modulation with unsigned PEs during learning predicted future replay of Stage I during rest periods, extent threshold p < 0.005, corrected at cluster level family-wise error p < 0.05 (these correspond to purple regions in Figure 3).
https://doi.org/10.7554/eLife.32548.009
RegionZ scoreXYZK voxelsP
Right Cuneus
Left Cuneus
Right Lingual
Left Lingual
Calcarine
3.31
3.29
3.21
3.15
3.14
-2
4
−14
16
6
−68
−66
−68
−46
−70
4
22
22
-8
4
1898.000488
Table 2
Coordinates of voxels where parametric modulation with unsigned PEs during learning predicted future replanning behavior (revaluation magnitude), extent threshold p < 0.005, corrected at cluster level family-wise error p < 0.05 (these correspond to blue regions in Figure 3).
https://doi.org/10.7554/eLife.32548.010
RegionZ scoreXYZK voxelsP
Right precentral
Orbitofrontal cortex (OFC)
Supplementary motor area
Right superior frontal cortex
ACC
Right OFC
3.67
3.62
3.59
3.49
3.49
3.42
26
0
-8
18
−14
6
−20
58
4
32
50
58
68
−12
48
38
0
−18
52401.36e-08
Right superior temporal
Right supra-marginal
Inferior parietal
Right putamen
Right superior temporal pole
3.67
3.6
3.58
3.39
3.35
56
34
36
26
46
−30
−36
−42
8
18
14
44
44
2
−16
51651.67e-08
Left superior temporal pole
Left putamen
3.4
3.35
−46
−20
−10
18
0
8
2161.00021
Table 3
.Coordinates of peak voxels of regions with higher off-task activity during rest periods of revaluation > control condition.

The clusters were selected with threshold p < 0.005, corrected at cluster level family-wise error p < 0.05 (these correspond to red regions in Figure 4).

https://doi.org/10.7554/eLife.32548.011
Region (cluster)Z scoreXYZK voxelsP
 Left hippocampus4.74−20−3810297.014
Left insula
Left superior temporal
4.57
4.57
−36
−42
-8
−24
6
6
1959.000
Right superior temporal4.1144−1222249.000
Right anterior cingulate, ventromedial PFC4.652386238.048
Left mid cingulate
Left Supplementary motor area
3.31
3.33
−10
-6
−10
−12
46
62
258.030

This conjunction analysis yielded multiple regions where neural sensitivity to unsigned PEs significantly predicted both future offline replay and revaluation magnitudes: the anterior cingulate cortex (ACC), the mid cingulate, posterior medial cortex (PMC) including dorsal and ventral precuneus, and the basal ganglia, especially the putamen (Figure 3, bottom). These results are consistent with the hypothesized relationship between sensitivity to prediction errors during learning, offline replay during rest, and revaluation behavior at test, and suggest an involvement of ACC, PMC, and basal ganglia in signaling the relevant PEs (see Figure 3—figure supplement 1 for breakdown of this analysis for the noisy-rewards and noiseless-rewards blocks).

Brain activity during revaluation vs. control rest

Next, we report a number of observations further unpacking the key results discussed above. Our hypothesis is that learning and updating take place during offline replay periods—indeed, we revealed above that replay in category-sensitive brain regions during rest periods of revaluation blocks (but not control blocks) predicted subsequent planning behavior. This hypothesis also implies that brain regions related to learning and memory should show more activity during the rest periods of revaluation trials, where engaging in offline replay could result in better choices, compared to control trials in which the most learning has already occurred during the Learning phase. To test this, we used a GLM to compare overall differences in univariate activity during the rest periods of revaluation vs. control trials. Contrasting rest periods from revaluation trials versus control trials (Restrevaluation> Restcontrol) revealed higher activity in the hippocampus/medial temporal lobe and the anterior cingulate cortex (p < 0.005, cluster corrected at p < 0.05, Figure 4, see Table 3 for coordinates, and Figure 4—figure supplement 1 for breakdown of this analysis for the noisy-rewards and noiseless-rewards blocks).

Figure 4 with 1 supplement see all
Univariate general linear model contrasts comparing activity rest periods of revaluation vs. control blocks, Restrevaluation> Restcontrol.

The contrast reveals higher activity in the hippocampus, the anterior cingulate cortex, mid cingulate (shown above), as well as bilateral insula and superior temporal cortices (extent threshold p < 0.005, cluster level family-wise error corrected at p < 0.05). See Table 3 for coordinates, cluster size, and p values.

https://doi.org/10.7554/eLife.32548.012

Learning during rest

So far, we have shown that evidence for offline replay, across subjects and blocks, correlated with subsequent replanning behavior, and that regions broadly involved in memory and evaluation were more active during the rest period of revaluation trials. Finally, we broke down these effects over the course of the three rest periods in the Relearning phase to examine how they evolved over the progression of relearning and seek evidence of learning dynamics.

A property of the learning model is that PEs during the Relearning period should decrease from the period preceding Rest 1 (when participants are first encountering the new rewards in the revaluation condition) to the period preceding Rest 3 (after participants have had more time to learn these new rewards); this was the case for the PE values estimated by the learning model (Figure 5, top left). Given the decreasing pattern in unsigned PE during re-learning, we predicted that this within-block variation might be echoed in patterns of replay across rest periods – reduced PE should lead to less replay. To focus the analysis on differences among rest periods within each block, we mean-centered MVPA evidence for the Stage I stimulus during the three rest periods for each block of each subject. This allowed us to compare patterns of within-block differences between replay in the three rest periods of revaluation and control trials. In control blocks, we found no differences in MVPA evidence for S1 replay during the three rest periods (Figure 5). During revaluation blocks, however, evidence for offline replay significantly decreased from Rest one to Rest 3 (t(23) = 2.9, p = 0.007), similar to the decreasing pattern of unsigned prediction errors (Figure 5). Furthermore, offline replay was higher during Rest 1 of revaluation blocks compared to the Rest 1 of control blocks (t(23) = 2.17, p = 0.039). Thus, consistent with our hypothesis, prediction errors and replay track each other over the course of the block, as well as across blocks (as previously shown in Figure 3).

Figure 5 with 1 supplement see all
Dynamics of replay and activity during rest periods.

Dynamics of prediction errors prior to each rest period (left), MVPA evidence for Stage I replay during each rest period (middle), and replay-behavior correlation across the three rest periods of revaluation runs (right).

https://doi.org/10.7554/eLife.32548.014

Analyzing the rest periods of revaluation trials in the noisy-rewards and noiseless-rewards conditions separately (Figure 5—figure supplement 1), we found a similar trend: In the noisy-rewards condition, replay was marginally higher during Rest one than both Rest 2 (t(23) = 2.07, p = 0.049) and Rest 3 (t(23) = 2.02, p = 0.05) – in the noiseless-rewards condition, replay levels were low overall and did not vary across rest periods.

Finally, we considered how this replay related to subsequent replanning behavior in the Test phase. We compared the replay-replanning correlation across the different rest periods (Figure 5; unpacking over time the correlation from Figure 2B). We found that, for the average of all revaluation blocks, this correlation was numerically stronger (and only statistically significant) during later rest periods (however, there was no significant difference between the correlation coefficient for Rest one versus the correlation coefficient for Rest 3, Fisher’s z = −1.01, p = 0.31). We then analyzed this effect separately for revaluation blocks in the noisy rewards and revaluation blocks in the noiseless-rewards condition (Figure 5—figure supplement 1), and found again that the replay-behavior correlation was mainly driven by the noise condition, where the most significant replay-behavior correlation is observed during Rest 2.

Discussion

We present fMRI evidence for the hypotheses that uncertainty-driven replay can help integrate past episodes offline, and in turn, update representations and policies that support planning (Van Seijen and Sutton, 2015). We address two questions. Our main question was: does off-task replay support planning? Inspired by the Dyna architecture in reinforcement learning, our main contribution is to demonstrate a clear and selective association between offline replay and change in choice preferences, that is replanning. We found that MVPA evidence for offline replay during rest correlates with subsequent choice behavior in reward revaluation. We suggest that during offline replay the brain integrates and pieces together simulated trajectories of information acquired during multiple learning episodes and updates planning policies accordingly.

Our secondary question was: what predicts such offline replay? Inspired by prioritized sweeping in reinforcement learning, our second hypothesis was that certain classes of events – such as surprising ones – privilege trajectories for future replay that helps the brain to learn whether they have consequences for choice preferences. We examined this question using a representative index of surprise, the brain’s response to unsigned prediction error. Whether this subsequent replay leads to revaluation or not will depend on whether the replayed experiences themselves support a change in behavior. Therefore, not all prediction errors will lead to revaluation, even if they increase replay. We found that the brain’s sensitivity to unsigned prediction errors in the ACC, basal ganglia, and posterior medial cortex (the precuneus) across subjects and blocks predicted both the extent of subsequent offline replay as well as subsequent replanning behavior. We suggest that the brain’s sensitivity to surprising information and increased uncertainty, here signaled by unsigned prediction errors, may tag certain episodes and trajectories of experience for selective offline replay later on (Moore and Atkeson, 1993; Peng and Williams, 1993).

Finally, we observed a general increase in hippocampal, posterior medial, anterior cingulate (extending to ventromedial prefrontal cortex upon relaxing the threshold), and superior temporal cortex activity during the rest periods of revaluation trials compared to control. The hippocampus has long been identified as the locus of episodic memory (Ranganath and Ritchey, 2012), posterior medial cortex is shown to be involved in recall and integration of memories over long time-scales (Chen et al., 2017; Shenhav et al., 2013), the anterior cingulate cortex is known to enable comparisons and signal conflicts the resolution of which may be valuable enough to recruit controlled processes (van Kesteren et al., 2013), and the ventromedial prefrontal cortex has been established as a region involved in memory retrieval of relevant information (Preston and Eichenbaum, 2013; Gläscher et al., 2009), value representation (Fellows and Farah, 2007), and decision making (Anderson et al., 2016) respectively. Hippocampal-prefrontal interactions have been shown to mediate control of memory in memory inhibition (Kalisch et al., 2006) and updating memory in extinction (Behrens et al., 2007). Thus, this finding supports an increased role for memory and evaluation mechanisms during the rest periods of revaluation blocks.

In short, inspired by work on the Dyna architecture in reinforcement learning (Sutton, 1991) and prioritized sweeping (Moore and Atkeson, 1993; Peng and Williams, 1993) in machine learning, as well as behavioral paradigms of reward revaluation (Gershman et al., 2014; Momennejad et al., 2017), these results are the first to show a functional role for offline replay in planning under uncertainty. The key novel aspect of our results is our finding that offline replay can lead to replanning by integrating multiple, temporally distal learning episodes (here, computing the implication of Stage II experiences on choice policy at Stage I). The predominant assumption, going back to Tolman, is that humans and animals solve tasks of this sort by online computation, e.g. forward planning when a choice is required during the Test phase (Tolman, 1948), or earlier during acquisition. Accordingly, previous human studies have shown benefits of online reinstatement of past episodes on decision-making in a sensory preconditioning paradigm (Wimmer and Shohamy, 2012), and that goal-directed behavior in sequential decision-making is accompanied by the online reinstatement of prospective states (Doll et al., 2015). Other studies have explored properties of offline reinstatement, showing that it favors past episodes associated with higher reward (Gruber et al., 2016) and supports upcoming learning of related material (Cichy et al., 2012). However, to our knowledge, no human or rodent study has tested the role of offline replay in piecing together distal episodes to support planning. Tolman famously designed revaluation tasks that require the animal to integrate information experienced at various points in time (Tolman, 1948). Thus, we used a simplified Tolman-style revaluation task to test the role of offline replay in integration of distant memories and address the crucial question of which memories the brain would preferentially replay to improve planning. While the rodent literature has shown in various paradigms that offline replay is associated with better task performance (Johnson and Redish, 2005; Ólafsdóttir et al., 2015; Ambrose et al., 2016a), none verifies that these improvements are specifically due to the offline integration of memory. Our present findings further these previous efforts and elucidate the role of offline replay in goal-directed memory integration that supports planning.

It is important to consider alternative hypotheses and interpretations of our results. An important general point is that our results are correlational, and we of course can’t directly draw causal conclusions from the associations we report. Our main result is the association between offline neural reinstatement and subsequent preference change, supporting our hypothesis of a role for the former in replanning. In general, alternative hypotheses in the literature for how replanning occurs tend to focus on online computation. However, evidence for the importance of online computation is not mutually exclusive with the mechanism discussed here, and it cannot itself explain the reported association between planning and offline replay. In order to rule out alternative interpretations, we included a number of controls in the design of our study. As a general matter, the factorial design of our experiment, which includes no-revaluation controls and noise manipulations, and our use of a relative replanning score to control for initial learning, all help to verify that the association we report between replay and behavior is specific to replanning. For instance, if the brain-behavior relationship was confounded by many sorts of nonspecific factors such as variation in overall attention or motivation, or by forgetting rather than strategic replanning, similar brain-behavior relationships should be present in the control blocks as well. Accordingly, we can rule out the alternative interpretation that our results are due to behavioral change driven by ‘any’ surprising event regardless of whether it warrants a change in behavior. To control for this possibility, we included a noisy control condition where participants experienced prediction errors in the experienced rewards but the global structure of the task, and therefore the most rewarding choice, did not change. We did not observe a relationship between replay and revaluation behavior in this control condition. A second alternative interpretation is that any replay, regardless of the category, is associated with revaluation of choice behavior. Such a relationship between nonspecific replay and behavior could reflect a more generic index of task engagement, e.g., higher attention or motivation in participants who are better at the task. This interpretation would predict a relationship between Stage two category replay and replanning as well. However, we found no significant correlation between Stage two replay and replanning (Figure 2—figure supplement 2). Altogether, the specificity of the findings is consistent with the hypothesis that learning from simulated experience during replay determines whether revaluation occurs.

Our second key result concerns a further pair of associations, between earlier signatures of PE and subsequent replay and behavioral planning. This is consistent with a general family of explanations, in which events with certain characteristics (e.g., surprising one) are privileged for later replay. We operationalize this in terms of the neural response to unsigned prediction error during encoding, as suggested by prioritized sweeping (the standard embodiment of this general idea in computer science), but this is meant to be representative of a broader class of heuristics. Our task design does not definitively allow us to distinguish this particular heuristic measure of surprise from other, correlated ones that could support similar effects. For instance, in more elaborate models, overall variability in outcomes can be decomposed into contributions of volatility and stochasticity, and PE can be scaled relative to these (Ambrose et al., 2016b). Similarly, there is evidence that neural replay is sensitive to the sign of prediction errors (Gläscher et al., 2010). However, in the present task, signed and unsigned PEs are highly correlated (in particular, most PEs are positive, because subjects are incentivized to make rewarding choices and not return to disappointing states). Thus, the present design does not have sufficient power to dissociate the role of signed vs. unsigned prediction errors in replay.

Furthermore, although prioritized sweeping works by marking high-PE states when they are experienced – and this echoes many notions of ‘tagging’ in the cognitive neuroscience of memory – given the correlational design we can’t rule out other mechanisms by which the decision which events to replay happens later. For instance, in principle, even without tagging, the brain could retrieve memories offline in an initially unbiased way to search for high-PE ones to focus on. Our data do not directly dissociate the two hypotheses, and both are consistent with our core contribution: that offline replay of surprising events supports planning. Having said this, we noted that, in realistic scenarios (as well as task designs with larger decision trees) such unguided search would be computationally expensive and potentially intractable. We experience many events and form many memories on a daily basis. In order to enhance learning and planning, it would be useful for the brain to have a mechanism to focus simulated experience on those that may lead to policy updates. Thus, while acknowledging other possibilities, we suggest that some form of prioritized replay is more likely.

Given that our interest is primarily in the relationships between prediction errors, replay, and revaluation behavior, we focused our analysis on the difference between revaluation and control conditions, collapsing over blocks with noisy vs. noiseless rewards. We did, however, also consider all results separately across those conditions. Perhaps unsurprisingly (given lower power), we did not find significant differences between the conditions in any of our measures, but we note a few findings. First, we only observed a significant correlation between replay and replanning behavior in revaluation blocks in the noisy reward condition (Figure 2—figure supplement 1). These results may be due to a ceiling effect in the noiseless reward conditions, where updating past policies was sufficiently easy for online processes to adequately do the job. Second, we only found significant conjunction results for the PE analyses in the combined condition, but neither in the noisy-rewards condition nor the noiseless-rewards conditions alone. This is most likely due to the fact that the data in the noisy and noiseless conditions were half the number of runs included in the combined analysis reported in the main text (Figure 3). Future experiments should focus on the differential effect of the noisiness vs. stability of reward structures, and the volatility of event and reward structures more broadly, on brain regions recruited in offline processes that support planning. One possibility is that the brain is more likely to benefit from offline replay for planning when the reward structure in the world is noisier, in which case changes in mean reward may be neglected as variance, making online integration more difficult. While the present study has focused on reward prediction errors, future studies are required to study the role of other forms of prediction errors, e.g. state prediction errors (Dunsmoor and Murphy, 2015).

We also examined the dynamics of our results over the progression of the relearning phase. The key difference between the three rest periods was in the amount of unsigned prediction error (PE, or uncertainty) experienced prior to each (Figure 5). Here, unsigned prediction errors are how the brain notices something is not right, i.e. some prediction does not match the observation, and needs to determine whether the surprising experience may have further consequences for the brain’s stored representations and policies (its model of the world). Before Rest one during revaluation trials, participants had just noticed that rewards have changed from what they expected, leading to the experience of high unsigned PEs while making Stage II decisions. Importantly, they did not visit the Stage I state during the Relearning phase, but only Stage II stimuli, therefore they did not experience any PEs after choosing a policy from Stage I. Before Rest 3 of revaluation trials, participants experienced smaller prediction errors because they had already had the chance to learn the new reward structures for Stage II stimuli; correspondingly, overall levels of replay were lower after Rest three than after Rest 1. However, crucially, while levels of replay were lower after Rest 3, there was still a robust correlation between Rest three replay and behavior (the size of the replay-behavior correlation did not significantly differ across rest periods).

Taken together, these findings suggest a potential two-stage functional role for replay in the present study. First, replay of trajectories marked with unsigned PE could allow identifying distal parts of the state space that may need to be updated. When the brain encounters prediction errors, initial replay can enable search through a graph of memory states to identify other states affected by the experienced PE: this is the ‘sample to tag with priority’ function of replay. The present experiment consists of one such past state, but it is possible that in larger decision trees this search for relevant past states leads to ‘reverse replay’ (Wu and Foster, 2014; Ambrose et al., 2016a). As such key past states that lead to the present state in which high PE was experienced may be tagged with priority over other memories to be replayed later on. Second, once the new structure of the environment is adequately learned, and high PEs are no longer experienced; the tagged relevant past states and their corresponding trajectories can be replayed, simulated if never directly experienced, and updated. A similar tagging idea had been proposed in the domain of fear learning, where memory traces could be retroactively altered in accordance with new behavioral relevance of fear associations (Kurth-Nelson et al., 2016), whereas here we propose a more general case where unsigned PEs tag memories with priority for subsequent integrative replay. As such, the second function of replay may be ‘learning from replay’: to actually enable the brain to piece together and simulate trajectories in order to update past representations and planning policies offline (Pfeiffer and Foster, 2013).

It is worth commenting on how the notion of replay used here relates to the notion of sequential replay in the rodent literature – in this literature, the term ‘replay’ often refers to the replay of sequences at sub-second speed. The temporal resolution of fMRI poses a challenge to providing direct neural evidence of sequential replay. That said, we designed the study such that replanning the optimal Stage I choice required integration of past information by piecing together different memories sequentially: The transition from Stage I to Stage II, on the one hand, and the new Stage II rewards, on the other, were originally experienced at separate distal points in time during online task performance. Unlike animal studies, but similar to previous human fMRI studies of offline replay (Staresina et al., 2013; Cichy et al., 2012; Gruber et al., 2016), here we leverage a controlled design together with evidence of Stage I reinstatement to investigate the role of offline replay in planning. Future work using methods with higher temporal resolution, such as MEG and direct recordings from the hippocampus in patients, is required to shed light on the details of forward and reverse sequential replay in updating representations and memories that serve planning.

In future work, we plan to conduct more detailed explorations of the functional role of replay in planning, using paradigms with larger decision trees and higher temporal resolutions that make it possible to track replay trajectories in more detail; e.g., a recent human study used MEG to identify a role for reverse replay of trajectories (Dayan, 1993). We also plan to explore how offline replay might contribute to generalization and temporal abstraction, e.g. via multi-step predictive representations of upcoming states (or the successor representation, (Russek et al., 2017; Momennejad et al., 2017). Caching of multi-step predictive representations during replay can in turn enable generalization (Botvinick and Weinstein, 2014), faster planning especially under time pressure, and sub-goal discovery (Beckmann et al., 2003) across longer scales. After consolidation, the generalized multi-step representations could lead to abstract representations of states and policies such as schemas. Using generalized representations, both discovery of relevant past states (as in the first function of replay above) and piecing together episodes to simulate trajectories to goal (as the second function of replay above) could also take place in a multi-step fashion consistent with the successor representation (Russek et al., 2017), enabling more efficiency simulation of past and future trajectories.

To summarize, we have provided neural and behavioral evidence for the hypothesis that offline replay supports planning. We further suggest that the brain’s sensitivity to uncertainty, operationalized as unsigned prediction errors, mediates the amount of offline replay and, through this, revaluation behavior. Our findings further our understanding of how the brain leverages offline memory mechanisms in planning and goal-directed behavior.

Materials and methods

Subjects

We recruited 26 volunteers using the Princeton University recruitment system (SONA) to participate in the fMRI study. Two participants were excluded due to running out of time and not having sufficient number of runs. Please note that, logically, the analysis required each participant to learn the local dependencies of the task in the learning phase (indexed by their choice behavior) and there was one run per condition. If a participant did not learn the task structure during the learning phase, neither their replay during rest nor their subsequent choice behavior could be interpretable. That is, for any ‘replanning’ to occur later on, participants needed to learn sufficiently to be able to show ‘planning’ behavior in the learning phase, indexed by their choices. Therefore, based on this a priori requirement, any participant who did not learn the task structure during the learning phase was logically excluded. The remaining 24 participants’ data were further analyzed. The Princeton University Institutional Review Board approved the study (Protocol #6014, ‘Cognitive Control: Functional Imaging of Memory and Mental Representations’). All participants gave informed to participate in the fMRI study and signed a screening form that ensured they had normal or corrected to normal vision, had no metal in their body, and had no history of psychiatric or neurological disorders.

Stimuli and task

Request a detailed protocol

Eight face, eight scene, and eight object stimuli were used in the face/scene/object functional localizers. The faces and scenes were subsequently used in the main experimental task. Before the beginning of the main experiment, participants were subjected to blocked functional localizer runs. In the localizer participants were exposed to separate blocks of scenes, faces, and objects. Each category was presented for four blocks that each lasted for 12 s, separated by rest periods of 12 s. In each block, every stimulus was presented for 3.5 s with an ISI of. 5 s. To ensure that participants were paying attention to the stimuli during the localizer, they were instructed to give responses to a cover task. In half of the localizer blocks participants were instructed to give responses with the left hand and in the other half of the blocks with the right hand.

The main experimental paradigm consisted of 4 runs, each of which corresponded to one Markov Decision Process (MDP) with three phases. Two of 4 runs were assigned to the revaluation condition and the rest were control runs (in which no revaluation took place). The order of conditions was randomized for each participant. The task was explained to the participants in terms of a cover task with a robbery scenario: They were told the experiment was a game where they were to explore different locations in different cities, each corresponding to each run’s simple 2-step maze to find out which state they could steal more money from. Participants were instructed that they would receive a bonus compensation for their performance on the final Test phase, which they received for all runs at the end of the fMRI session. Each MDP was presented as a new ‘city’ in which the participants would explore a building in search of money. Different cities were randomly assigned to different conditions across participants. Each MDP consisted of 3 main states: Stage I and two Stage II states. The same stimulus signaled Stage I in each city, and the same Stage II stimuli signaled states 2 and 3. Importantly, Stage I and Stage II stimuli belonged to different categories. In 2 out of 4 cities state one was a face and states 2 and 3 were scenes, and in the remaining two cities the opposite was the case.

Additionally, we manipulated the variance in the rewards observed at each state. We wanted to ensure that, after learning, participants did not only experience prediction errors in the revaluation condition, but that the control condition also contained some degree of uncertainty with respect to rewards. To this end, the rewards in half of the runs, in other words one revaluation run and one control run, were sampled from a normal distribution with a fixed mean and variance, and in the other half of runs the rewards were completely deterministic.

In each city, during the Learning phase participants were exposed to the stimuli associated with the three main states of the MDP about 70 times. The last 10 choices they made from the Stage I state were used to calculate their probability of choosing the optimal response (e.g. if the optimal policy was left: p(left) = #left choices/10). Following the Learning phase, participants entered the Relearning phase where they only visited Stage II stimuli and never the Stage I stimulus. During this phase in revaluation trials, the rewards associated with Stage II (state 2 and 3) choices changed – but not during control trials. After every six Stage II episodes each requiring a choice, a rest period of 30 s length followed for which no instructions were given. Each participant experienced three rest periods of 30 s each during the Relearning phase. Finally, during the Test phase, participants were once again exposed to state 1, four times. Replanning magnitudes were computed as the probability of choosing the other valid action in those four trials, minus the probability of choosing the same response during the Learning phase – thus detecting the actual change in performance by subtracting noisiness in decision at the end of the Learning phase. For instance, if the optimal policy at test was right: revaluation magnitude = (#right choices at test/4) - (#right during last 10 choices of learning/10).

(1) replanning magnitude=nstestntest- nslearningnlearning

Here ntest denotes the total number of valid responses (i.e., excluding missed responses or invalid button presses) during the Test phase, and nstestis the number of times that the optimal choice was made during the test phase; nlearningdenotes the total number of valid responses during the last 10 trials of the Learning phase, and nslearningis the number of times that the choice that was optimal at Test was selected during the last 10 trials of the learning phase (note that, in revaluation blocks, the response that is optimal during the Test phase was suboptimal during the Learning phase).

fMRI imaging protocol and preprocessing

Request a detailed protocol

Imaging data were acquired on a 3 Tesla Skyra scanner (Siemens, Erlangen, Germany) using a 20-channel head coil. Functional runs were acquired using T2*-weighted echo-planar sequence with 37 slices (interleaved order, ascending) with 3 mm thickness, an in-plane resolution of 3 × 3 m, TR = 2.08 s, echo time of 70 ms, anterior-to-posterior phase encoding direction, and no gaps. Slices were tilted for each participant by roughly 30 degrees to avoid the sinuses and optimize signal in the orbitofrontal cortex but avoid losing temporal cortex voxels. A T1-weighted structural volume was acquired with 1 × 1×1 mm resolution. Functional volumes were slice-time corrected, realigned to the mean of each run, and motion corrected. FMRI data processing was carried out using FEAT (FMRI Expert Analysis Tool) Version 6.00, part of FSL (FMRIB's Software Library, www.fmrib.ox.ac.uk/fsl). Higher-level analysis was carried out using FLAME (FMRIB's Local Analysis of Mixed Effects) stage 1 (Woolrich et al., 2004; Woolrich, 2008; Kuhl et al., 2011).

Behavioral and neuroimaging data analysis

Request a detailed protocol

Functional localization We designed a general linear model (GLM) to identify functional regions for each participant. We used regressors corresponding to face, scene, objects, and rest conditions, and applied Scene > Face and Face > Scene contrasts to identify face-selective and scene-selective regions for each participant (Eichenbaum, 2004); for these contrasts, we used a liberal voxel-wise threshold of p < 0.005, uncorrected, to ensure that we did not miss informative voxels. These regions of interest were used as a means of dimensionality reduction, allowing us to confine the multivariate analysis of experimental runs to these regions.

Multivariate pattern analysis

Request a detailed protocol

As noted, Stage I stimuli were of a different category than Stage II stimuli. For each participant, Stage I stimuli were faces in half of the blocks and scenes in the other half. This ensured that identifying Stage I evidence was not confounded with simply a replay bias for faces and scenes, which might have been the case if the Stage I stimuli were from the same category across all trials. We used L2-regularized logistic regression (penalty = 1) to train a classifier on TRs when participants observed images of faces, objects, and scenes during the independent functional localizer run acquired prior to experiment (Polyn et al., 2005). Importantly, to detect face activation, we trained a face vs. other (scene, object) classifier only on voxels from the face-selective ROI identified in the localizer analysis above; conversely, to detect scene activation, we trained a scene vs. other (face, object) classifier only on voxels from the scene-selective ROI identified in the localizer analysis above. Our use of this approach (where we used separate sets of voxels for face detection and scene detection) was driven by our desire to obtain, to the greatest extent possible, independent readouts of face and scene activity. It is a well known issue with MVPA that, when face and scene classifiers are applied to the same voxels, face and scene classifier readouts tend to be artifactually anti-correlated because the face and scene labels are anti-correlated in the training data (i.e., the classifier learns that, in the training data, the presence of scene predicts the absence of face, and vice-versa) (Wilson and Niv, 2015). This problem can be mitigated to some degree by adding additional categories at training (this is why we added the object category), but we found that it can be mitigated even further by applying the classifiers to distinct sets of voxels – this carries a potential cost of losing sensitivity (by restricting the voxels going into a particular classifier) but we were willing to pay this price to address the anti-correlation problem described above: In our study, the correlation between the time course of face evidence (from face ROIs) and scene evidence (from scene ROIs) during the rest periods was positive (on average,. 36), not negative.

To measure replay of Stage I category information, we took the trained face and scene classifiers and applied them to the TR-by-TR volumes from each run’s rest periods, focusing specifically on the last 10 TRs of each rest period (we omitted the first 5 TRs from each rest period to minimize ‘spillover’ from participants perceiving Stage II stimuli right before the rest period started). As noted above, if the Stage I stimulus was a face, we used the face classifier to detect replay; if the Stage I stimulus was a scene, we used the scene classifier. For each TR, the classifier outputs an evidence value for the selected category between zero and one; we averaged these evidence values across TRs to get a neural replay score for each block (since there were three rest periods per block and we used 10 TRs per rest period, the replay score for each block was based on 30 TRs).

The main goal of this analysis was to test whether evidence for offline replay of the root state (State 1) during the rest periods correlated with subsequent revaluation behavior during the Test phase. Therefore, we computed correlations between mean classification accuracy (evidence for offline replay of Stage I) and subsequent revaluation behavior separately for revaluation condition trials and control trials (Figure 2B).

Bootstrap comparison of replay-replanning correlation across conditions

Request a detailed protocol

To compare the size of the correlation between replay and behavior across conditions (revaluation vs. control), we used a bootstrap approach in which we resampled participants with replacement 1000 times; for each bootstrap sample, we computed the difference in the size of the replay-behavior correlation across conditions, thereby giving us a bootstrap distribution of correlation differences.

Univariate analysis of rest periods

Model-based analysis

Request a detailed protocol

In order to identify regions that were sensitive to reward prediction errors (RPE), we ran a parametric modulation analysis in which we regressed out mean signal change when participants received reward, and identified regions where variance around the mean was modulated by the magnitude of absolute RPEs experienced while receiving the reward. Unsigned prediction errors were calculated for each subject and each stimulus using Equation (2).

(2) | δ =rs'+ γQs', a'- Q(s, a)|

Here s refers to the current state and s’ to the next state, a is the current action and a’ the next action, and γis the temporal discounting parameter (here γ=1,since the temporal horizon in a 2-step Markov decision process (MDP) is limited to one step (42)). The reward prediction error is in turn used to update the state-action value as equation (3) below, where αdenotes the learning rate. We estimated a learning rate of. 7 based on analysis of previous pilot behavioral data (n=24) and used this learning rate for all participants.

(3) Qs, a= Qs, a+ α δ

We were interested in identifying regions where sensitivity to unsigned reward PEs, a measure of uncertainty used in the reinforcement learning literature, would correlate with future replay during rest periods and revaluation magnitude during the final Test phase.

Estimating learning rates

Request a detailed protocol

We estimated learning rates from a separate sample that performed the same task in a pilot study. The structure and rewards used in this pilot study were virtually identical to the fMRI task. As model-based fMRI analyses are typically performed using group-level (rather than individual level) parameter estimates, we estimated this learning rate from the pooled subject data using Maximum Likelihood Estimation, resulting in one best-fitting learning rate for the entire population (learning rate of. 7, n = 24). We confirmed a similar best-fitting learning rate in the fMRI sample (.78). With regard to the generalizability of the neural results to different values of the learning rate, it is worth noting that a previous systematic evaluation of this issue (51) found that substantive differences in learning rates in the model-based analysis of similar tasks lead to only minute changes in neural results.

The effect of volatility on prediction errors

Request a detailed protocol

Here we have estimated prediction errors regardless of the noise conditions. However, theoretically the statistical volatility of, and hence the overall uncertainty about, the environment or context may change the extent to which prediction errors tag memories for offline replay. For instance in environments with high volatility or uncertainty, the brain may become less sensitive to PEs that are not larger than some threshold. One possibility to give a computational account of the effect of such contextual uncertainty (or reward volatility) is to divide the PE by the variance (e.g. in rewards) in future models. Such a simple computational step, and similar varieties using the hazard rate, would lead to lower prediction errors in contexts with higher variance. Our data only partially speaks to this point: the noisy revaluation condition (that includes uncertainty about the precise reward outcome) leads to worse replanning performance than the noiseless revaluation condition (that has no uncertainty about outcomes outside of the revaluation). A systematic investigation of the role of hazard rate or volatility of rewards on tagging of surprising memories for future replay merits a separate dedicated study. Such a study would benefit from a design where prediction errors are parametrically changed across conditions or groups.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
    Model-based hierarchical reinforcement learning and human action control
    1. M Botvinick
    2. A Weinstein
    (2014)
    Philosophical Transactions of the Royal Society B: Biological Sciences 369:20130480.
    https://doi.org/10.1098/rstb.2013.0480
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
    Hippocampal Contributions to Control: The Third Way
    1. M Lengyel
    2. P Dayan
    (2008)
    Advances in Neural Information Processing Systems  20:889–896.
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
    A deeper look at planning as learning from replay
    1. H Van Seijen
    2. RS Sutton
    (2015)
    Proceedings of the 32nd International Conference on International Conference on Machine Learning. pp. 2314–2322.
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50

Decision letter

  1. David Badre
    Reviewing Editor; Brown University, United States
  2. Michael J Frank
    Senior Editor; Brown University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Offline Replay Supports Planning: fMRI Evidence from Reward Revaluation" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal his identity: Fiery Cushman (Reviewer #4).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This manuscript investigates how people engage in offline replay to update their value estimates decisions during a sequential task. Specifically, the experiment constructs a situation in which participants assign value to early decisions based on late outcomes, and then acquire new information about late consequences outside the context of early decisions. This provides a putative opportunity to compute the new expected value of the early decisions through offline replay. The paper reports behavioral and MVPA evidence that offline replay aids in this type of re-planning.

Essential revisions:

The editors and reviewers agreed that this paper investigates a fundamental topic. Further, the approach that combines MVPA estimates of task state with a two-step task is innovative and clever. However, both reviewers also raised significant concerns regarding the design and analysis. The most serious of these questioned what conclusions can be drawn from these results and whether there is specific evidence for planning versus other alternatives. In discussion among the reviewers and editors, it was agreed that several essential revisions are critical to shore up the specific contribution here beyond a basic association of replay with performance.

1) Alternative accounts of these results.

The authors consider the possibility that unsigned reward prediction errors experienced during the "relearning" phase prompt revaluation. This is plausible, but the authors don't consider any alternative hypotheses. In the absence of an alternative, it's hard to evaluate whether the evidence they consider in favor of this model is actually decisive – i.e., does their evidence uniquely favor their model, or is it consistent with other models?

Moreover, they undercut their own model when they write: "the inclusion of reward noise in the noisy-rewards condition (even in control blocks) allowed us to rule out the possibility that replanning is brought about by 'any' experience of uncertainty, whether or not replanning was optimal". This is a good point and an elegant feature of their design, but they don't seem to address how their model of revaluation assesses the "optimality" of replanning beyond unsigned PEs. In other words, they include a control condition specifically to show that their model is incomplete, and then don't address how it could be made complete.

Just to sharpen this point, here is one alternative account of what prompts revaluation. Possibly, during the rest period, people start engaging in value iteration from random points in their task model, which includes Stage 1. They then continue the process of value iteration at all points where value iteration results in high unsigned PEs during update. In other words, when performing value iteration on Stage 1 actions, the mismatch between the existing Stage 1 value estimates and the newly updated Stage 2 value estimates generates high unsigned PEs. These remain high until sufficient value iteration occurs that Stage 1 actions are appropriately revalued.

In summary, perhaps what drives the total amount of revaluation is not the magnitude of Stage 2 PEs during online relearning, but rather the magnitude (or persistence) of Stage 1 PEs during offline revaluation. (It should be clear how this stands in contrast to the "tagging" idea presented in the general Discussion. Incidentally, there is no reason that both things couldn't happen).

This alternative model addresses the question of why there is no reevaluation for high variance rewards in the control condition: the variance of the terminal rewards does not affect the Stage 1 PEs during value iteration.

It also seems to be consistent with all the data the authors present, because (except in the control condition) high unsigned Stage 2 PEs during relearning will be correlated with high unsigned Stage 1 PEs during revaluation.

But this correlation need not always hold – the difference between the two models can be tested. For instance, imagine that during the revaluation phase, values are swapped on the right/left dimension within each state. i.e., if formerly State 2/left was $1 and State 2/right was $8, then now State 2/left is $8 and State 2/right is $1. On the authors' model, this will result in high unsigned PEs during relearning and therefore also high levels of revaluation. On the model sketched above, however, it would not result in high levels of revaluation because max(Q) for each Stage 2 state does not change, and so there is no change to the Q value of each Stage 1 state, and thus no PEs during value iteration.

(A conceptually similar experiment would involve drifting rewards away from their original values but then right back to their original values all within a single relearning phase. This results in high PEs during the phase, but the model sketched above would not predict much revaluation for reasons similar to those described above).

The purpose of detailing this alternative is not to argue that this model is right or the only alternative, but to illustrate a much broader point. By sketching only one model of what prompts revaluation and failing to explicitly consider alternatives, the authors leave the reader in a poor position to evaluate whether the analyses that are consistent with their model are in fact good evidence for it – we don't know whether those analyses are also consistent with all plausible alternatives. Could the authors show us what models of revaluation-prompting are ruled out by their analyses? This would be a big help, whether or not the authors decide to discuss the specific alternative mentioned here.

2) Alternatives raised by the design and logic

Akin to the above point, there were several points regarding the design that raise other alternatives, including basic issues of engagement. Specifically:

- What incentive did people have to go to the higher-valued states in the learning phase? It seems this part of the task was not incentivized, so people had no particular reason to select higher numbers over lower numbers (except for learning for later). The replanning magnitude metric then becomes potentially problematic in that it subtracts away performance during the learning phase, which was not incentivized. This issue becomes especially problematic when trying to assess the impact of the brain replay (e.g., Figure 2). Here replay correlates with replanning, but which aspect of the replanning is not clear. Perhaps this correlation is due to behaviour in the learning phases-i.e., those people who paid more attention and learned more initially show more replay later on. That's not so surprising or interesting. The interesting thing is to predict later behaviour, but subtracting away a baseline that precedes the replay means that the replay can just be correlating with something that occurred earlier (learning/attention/task engagement, not replanning). It would be more convincing to show that the replay correlates solely with the test behaviour not with this replanning metric.

- The above point is compounded somewhat with the choice of the control condition here. Unlike the revaluation condition, there is no learning necessary at all in the second phase. There is also no incentive to pay attention or be engaged at all during this second phase. As a result, the differences in brain activity may be due to differential engagement, rather than specific to the replay-replanning connection as claimed here.

- The test phase gave no feedback or guidance. How were participants supposed to know which experience to draw upon when making those decisions? The assumption here is that the optimal thing here is to integrate the information in the two phases by adjusting the values of the terminal states in phases 2 and keeping the same phase 1 knowledge of task structure. A smart participant, however, might infer that the recurrence of the Stage 1 stimulus (which never appeared in the relearning phase) meant that the Learning phase reward-contingency rules were back in effect. It's not obvious what is optimal here. Maybe they have learned that the Stage 2 stimuli give different values when they are (or are not) preceded by the Stage 1 stimuli (i.e., a type of occasion setting).

- What were the means/variances of the experienced distributions? Were the actual means of the experienced distributions controlled or only the generative process? Were the experienced distributions close to the planned ones? How was it ensured that all conditions were experienced? Could people have avoided one of the outcomes e.g., due to getting initially unlucky (hot-stove effect)?

- It was not clear is why non-specific replaying the Stage 1 stimuli during the replanning task would help. In some ways, it says that people who are more engaged with the task do better on the task. That's interesting, but not groundbreaking. The data here do not seem to show that category-specific replay predicts later performance. Does replay of other categories (not the Stage 1 stimulus) not correlate with replanning behaviour? That would be a good additional piece of evidence beyond simple task engagement.

3) Concerns over the robustness of the analysis and approach to the computational modeling. It should be noted in light of the first comment below that eLife does not require that all hypothesis, predictions, or analyses be preregistered. However, the reviewers agreed that there should be explicit discussion of investigator degrees of freedom and what safeguards against flexibility in the analysis were put in place. Further, any checks that can be provided for robustness would be helpful. Here are the specific comments raised regarding the analysis and modeling.

- The article is pitched as confirmatory in the sense that it makes predictions and then confirms those predictions with evidence. Were any of these predictions pre-registered anywhere? Were there any competing hypotheses? What sort of data might have gone against these predictions? The evidence in the paper could use to be bolstered to establish that it was truly a confirmatory study. Similarly, were any of the metrics or analyses preregistered? If they were indeed hypothesis-driven, that should have been possible. It would seem that there are many, many ways that these data could have been analyzed (both behavioural and neural), and some of the metrics seem very arbitrary (e.g., number of trials in learning phase). Were other analysis pathways attempted? Evaluating the import of the inferential statistics requires more knowledge of the researcher degrees of freedom.

- Two participants were excluded post-hoc due to "not having sufficient number of runs". Were there any pre-defined criteria on this front? Or was this exclusion made after the data were examined? Additionally, the sample size seems small (though not inordinately so for an fMRI study), leaving many of the statistical tests underpowered (and indeed many of the brain-behaviour correlations fell in the marginal zone, making interpretation difficult).

- The idea that high prediction error should be preferentially replayed seems like a rule that would only work in very narrow (static) environments. If there is any variance in the rewards, then this rule could lead to perverse replay of unreducible uncertainty.

- The selection of learning rate requires further justification. How was the.7 estimated? What sort of task? If the task were as simple as the current task, with such a wide gap in rewards, I wonder how reliable an estimate this number represents. Moreover,.7 seems extremely high for a task with variable rewards. That would make people extremely influenced by the most recent outcome, much more so than a typical task with variable or probabilistic rewards. Is it reasonable to use the learning rate estimated in a different sample for analyzing the new sample? How robust are the prediction error analysis to this selection of learning rate?

https://doi.org/10.7554/eLife.32548.024

Author response

Essential revisions:

The editors and reviewers agreed that this paper investigates a fundamental topic. Further, the approach that combines MVPA estimates of task state with a two-step task is innovative and clever. However, both reviewers also raised significant concerns regarding the design and analysis. The most serious of these questioned what conclusions can be drawn from these results and whether there is specific evidence for planning versus other alternatives. In discussion among the reviewers and editors, it was agreed that several essential revisions are critical to shore up the specific contribution here beyond a basic association of replay with performance.

We thank the editor and reviewers for their constructive comments – these comments have enabled us to substantially improve the manuscript. We have edited the manuscript accordingly. Below we address the reviewer’s points and quote edited excerpts from the manuscript in our response.

We will first summarize what we see as the main contributions of our paper, including alternative hypotheses that we have ruled out. In the revised manuscript, we have sought to be much clearer about what can and cannot be concluded from our results.After this overall summary, we respond to the reviewers’ concerns in a point-by-point fashion.

1) What are the consequences of replay?

Inspired by the Dyna architecture in RL: Our main contribution is showing that replay is correlated with successful replanning in the revaluation condition.

Our core hypothesis was that, in situations where the reward structure of the world has changed, offline replay would drive learning and revaluation (replanning). This hypothesis was inspired by the Dyna architecture proposed by Rich Sutton in machine learning (R. S. Sutton, 1991, Dyna, an Integrated Architecture for Learning, Planning, and Reacting), in which offline replay of experience trains an actor/learner. We designed our experiments to find evidence for Dynalike algorithms in the brain. In support of our hypothesis, we found that fMRI-replay of the Stage 1 category during the relearning phase was correlated with later replanning behavior in the revaluation condition. This relationship between neural replay and behavior was specific to the revaluation condition, not the control condition.

The finding that replay was associated with behavioral change in the revaluation condition but not the control condition rules out the possibility that replay is always associated with behavioral changeand also a number of other potential confounds involving attention or forgetting. Rather, it appears that (as we hypothesized) replay is only associated with behavioral change in situations where (prior to rest) the reward structure of the world changed. We regard this specific relationship between replay of an antecedent state and subsequent behavioral change as the most important contribution of the present manuscript. In the revised manuscript, we discuss these alternative hypotheses (and why they are ruled out by the data) in a much more direct fashion.

2) What predicts replay?

Inspired by prioritized sweeping in RL: Our second hypothesis was that surprise during encoding drives later replay.

The second contribution of our paper is to explore what sorts of events are favored for the replay that drives behavioral change. We approach this question by looking for factors at encoding that predict subsequent replay. Our hypothesis about this was inspired by prioritized sweeping in machine learning, which uses how surprising an experience is to prioritize its later replay. Surprise is operationalized here as the unsigned magnitude of a prediction error. As predicted, we found that the brain’s sensitivity to unsigned prediction errors in the ACC, basal ganglia, and posterior medial cortex (the precuneus) during initial encoding was correlated with the extent of subsequent offline replay. We also found a relationship between neural sensitivity to prediction errors and subsequent replanning in the revaluation condition.

However, as with replay, it is crucial to emphasize that according to our hypothesis there will not always be a relationship between surprise and revaluation. Surprise will drive replay, but whether or not this replay drives behavioral change will depend – as noted above – on whether the world has changed. Again, the lack of behavioral effects in the control condition speaks to the specificity of the mechanisms we are observing, and against more nonspecific alternatives in which behavioral and neural measures might generally covary due, for instance, to subject differences in task attention or motivation. In the revised manuscript, we clarify our predictions on these points, and in the remainder of this letter we show how these points address various concerns raised by the reviewers.

We have now edited the text to reflect these issues more clearly:

“Our main question in this study is: does replay predict a change in choice behavior? […] This replay, in turn, enables the integration of separate memories in the service of updating optimal reward policy (action policies that maximize reward). We combine model-based fMRI, MVPA, and a sequential decisionmaking task to test our hypotheses.”

And we added the following:

“We thus explored what events favor replay, and ultimately successful revaluation. […] We tested this idea in terms of one representative notion of surprise – neural sensitivity to unsigned PEs (i.e., the absolute value of PEs), as in the prioritized sweeping algorithm (Moore and Atkeson, 1993).”

We have also edited the first two paragraphs of the Discussion to make sure that the above points are clear to readers:

“We present fMRI evidence for the hypotheses that uncertainty-driven replay can help integrate past episodes offline, and in turn, update representations and policies that support planning (Van Seijen and Sutton, 2015). […] We suggest that the brain’s sensitivity to surprising information and increased uncertainty, here signaled by unsigned prediction errors, may tag certain episodes and trajectories of experience for selective offline replay later on (Moore and Atkeson, 1993; Peng and Williams, 1993).”

Furthermore, it is important to note that there are alternative hypotheses that are not specifically ruled out by our data. In the current revision, we attempt to more clearly delineate these possibilities. The first and most important point is that of course in a correlational study, we cannot establish that the neural events we observe are causal for behavior. In particular, based on prioritized replay we hypothesize that a surprising event is tagged for offline replay, but the reviewers suggest an alternative mechanism in which the decision of which experiences to replay occurs later (during rest). However, because this decision is based on the same criterion we posit (surprise), it predicts the same neural correlations. Note that this alternative hypothesis shares with our account the core computational idea that surprise triggers replay; the only difference is whether it is surprise during relearning or surprise during replay that actually boosts (subsequent) replay. In the revised manuscript, we now discuss this alternative hypothesis, along with some other, computational reasons to favor our hypothesis over this alternative. See our response below, c.f.1) Alternative accounts of these results.

A second caveat, which we also now state clearly in the revised manuscript, is that it is possible that these results might reflect an effect on replay of some quantity other than absolute prediction error, but which is correlated with it. One possibility here is the signed prediction error (which would favor replay of better – but not worse-than-expected events). There are some computational (Mattar and Daw BioRxiv, 2017) and empirical (Ambrose et al., 2016) suggestions supporting this alternative hypothesis. Unfortunately, as we now state in the manuscript, our experiment is not designed to distinguish these quite similar alternatives. Please see our response below for how we address this issue in the manuscript.

In what follows we provide our point-by-point attempt to address the reviewers’ questions.

1) Alternative accounts of these results.

The authors consider the possibility that unsigned reward prediction errors experienced during the "relearning" phase prompt revaluation. This is plausible, but the authors don't consider any alternative hypotheses. In the absence of an alternative, it's hard to evaluate whether the evidence they consider in favor of this model is actually decisive – i.e., does their evidence uniquely favor their model, or is it consistent with other models?

As discussed above, we hope that the new manuscript does a better job situating our results within the space of possible accounts. We would first note that the reviewers’ comments in this section appear to center around the interpretation of our secondary (and admittedly more exploratory) result, about which events predict replay. As mentioned above and elaborated below, we now discuss alternative accounts of our results in the fifth paragraph of the Discussion, involving the possibility that a different quantity (e.g. signed rather than unsigned surprise) or surprise at a different time drives replay.

As for our primary result, on the relationship between replay and behavioral change, the only substantive alternative hypothesis we know of here is that behavioral change might also be produced by computations accomplished online. As discussed in the manuscript, these mechanisms are not mutually exclusive, but they do not by themselves explain the results we report.

Moreover, they undercut their own model when they write: "the inclusion of reward noise in the noisy-rewards condition (even in control blocks) allowed us to rule out the possibility that replanning is brought about by 'any' experience of uncertainty, whether or not replanning was optimal". This is a good point and an elegant feature of their design, but they don't seem to address how their model of revaluation assesses the "optimality" of replanning beyond unsigned PEs. In other words, they include a control condition specifically to show that their model is incomplete, and then don't address how it could be made complete.

We have revised the quoted section about “optimality” in the text. We agree with the reviewers that surprise is an imperfect signal as to whether effective replanning is possible: it is, in effect, a necessary but not sufficient condition, since prediction errors may just reflect noise. (This point also comes up later in the review.) As we now say in the paper, to some extent this might be mitigated by, for instance, more carefully attempting to track variance/uncertainty so as to recognize irreducible stochasticity. But the big picture here is that there is no perfect heuristic: it’s not really possible to decide with certainty whether planning will succeed without actually doing it, so we don’t think the theory is “incomplete” in this sense.

Also, we have tried to be clearer about our hypothesis and our interpretation of the noisy rewards condition. As noted above, our hypothesis is that Stage 2 unsigned PEs prompt later replay – but this will not necessarily result in effective replanning (meaning, changed behavior at Stage 1). Whether or not this replay causes revaluation will depend on whether the world has changed (and hence there is something new to learn from replay). Putting these pieces together, we expected there to be a link between stage 2 unsigned PEs and replanning in the revaluation condition (where the reward changed during relearning) but not in the control condition (even under noise, where the reward structure did not change during relearning). This is what we found, and this pattern of results speaks against alternatives in which, for instance, neural and behavioral signatures of decision variables and replanning covary more generally due to nonspecific changes in, for instance, motivation or attention.

See the Discussion:

“It is important to consider alternative hypotheses and interpretations of our results. […] Altogether, the specificity of the findings is consistent with the hypothesis that learning from simulated experience during replay determines whether revaluation occurs.”

Also see Results:

“This noise manipulation – i.e. the noiseless-rewards versus noisy-rewards conditions – ensured that the difference between revaluation and control condition blocks was not limited to the absence or presence of PEs or surprise in general. […] Accordingly, the inclusion of reward noise in the noisy-rewards condition (even in control blocks) allowed us to rule out the possibility that change in choice preferences is brought about by ‘any’ experience of uncertainty, whether or not experience actually supported replanning.”

Just to sharpen this point, here is one alternative account of what prompts revaluation. Possibly, during the rest period, people start engaging in value iteration from random points in their task model, which includes Stage 1. They then continue the process of value iteration at all points where value iteration results in high unsigned PEs during update. In other words, when performing value iteration on Stage 1 actions, the mismatch between the existing Stage 1 value estimates and the newly updated Stage 2 value estimates generates high unsigned PEs. These remain high until sufficient value iteration occurs that Stage 1 actions are appropriately revalued.

In summary, perhaps what drives the total amount of revaluation is not the magnitude of Stage 2 PEs during online relearning, but rather the magnitude (or persistence) of Stage 1 PEs during offline revaluation. (It should be clear how this stands in contrast to the "tagging" idea presented in the general Discussion. Incidentally, there is no reason that both things couldn't happen).

[…] The purpose of detailing this alternative is not to argue that this model is right or the only alternative, but to illustrate a much broader point. By sketching only one model of what prompts revaluation and failing to explicitly consider alternatives, the authors leave the reader in a poor position to evaluate whether the analyses that are consistent with their model are in fact good evidence for it – we don't know whether those analyses are also consistent with all plausible alternatives. Could the authors show us what models of revaluation-prompting are ruled out by their analyses? This would be a big help, whether or not the authors decide to discuss the specific alternative mentioned here.

We thank the reviewers for proposing this alternative hypothesis. In general, we view this as a friendly amendment, in that it actually shares the core computational claim of our favored (“tagging”) hypothesis: the idea that surprise triggers replay. The only difference is whether it is surprise during relearning or surprise during replay that boosts (subsequent) replay. We agree that the data do not discriminate between these two (highly related) accounts and we have updated the manuscript to reflect this. We also note in the paper that there are a priori computational reasons to favor the tagging hypothesis over the random-replay hypothesis: for tasks with larger trees and longer trajectories the random strategy is not tenable. Prioritized replay might seem equally effective in small state spaces, like the one in the present paradigm, but it outperforms the random replay strategy for large state spaces, and especially realistic state spaces with high dimensionality. We have now edited the Discussion to reflect these thoughts:

“Our second key result concerns a further pair of associations, between earlier signatures of PE and subsequent replay and behavioral planning. […] Thus, while acknowledging other possibilities, we suggest that some form of prioritized replay is more likely.”

2) Alternatives raised by the design and logic

Akin to the above point, there were several points regarding the design that raise other alternatives, including basic issues of engagement. Specifically:

- What incentive did people have to go to the higher-valued states in the learning phase? It seems this part of the task was not incentivized, so people had no particular reason to select higher numbers over lower numbers (except for learning for later). The replanning magnitude metric then becomes potentially problematic in that it subtracts away performance during the learning phase, which was not incentivized.

Thank you for this question. We have added the following to the revised manuscript:

“Participants were incentivized to search for and choose higher-value states. […] And the more money you steal in the game, the more REAL BONUS MONEY you will be paid on top of the normal experiment payment. You could earn an extra $2-3!”

This issue becomes especially problematic when trying to assess the impact of the brain replay (e.g., Figure 2). Here replay correlates with replanning, but which aspect of the replanning is not clear. Perhaps this correlation is due to behaviour in the learning phases-i.e., those people who paid more attention and learned more initially show more replay later on. That's not so surprising or interesting. The interesting thing is to predict later behaviour, but subtracting away a baseline that precedes the replay means that the replay can just be correlating with something that occurred earlier (learning/attention/task engagement, not replanning). It would be more convincing to show that the replay correlates solely with the test behaviour not with this replanning metric.

To be clear, our hypotheses do not concern raw test phase performance; the prediction of the theory (as expressed in the DYNA learning equations) is that change in the behavioral policy (not the absolute endpoint policy) is driven by replay. Relatedly, and in response to the reviewer’s particular concern, we think it is important to test a replanning score rather than raw test phase performance precisely to control for initial performance during the learning phase. Otherwise, a positive correlation between raw test phase performance and neural activity could in principle be driven by a relationship between the policy learned initially (carried over unchanged into the test phase) and the neural measure. Previous behavioral work based on similar logic (Gershman et al., 2014) used a relative revaluation score for the same reason. We add the following to our explanation of the replanning score:

“We use a score that indexes change in preferences here (Gershman et al., 2014), rather than raw Test phase choice behavior for two reasons. First, this is what our hypotheses concern: that replay updates pre-existing choice policies, driving a change. Second, this score controls for any confounding effects of initial learning performance on Test phase behavior. (For instance, subjects paying different amounts of attention might learn to varying degrees in the initial phases, leading to differences in both replay and Test phase behavior, even if replay itself doesn't affect later preferences.)”

Secondly, we would repeat the point made earlier, that our use of the control condition helps to rule out concerns of this sort, since incidental relationships between replay and behavior should be the same in this condition if they are actually driven by attention, motivation, forgetting or other nonspecific factors. Instead we report that the relationship is significantly attenuated in this case, speaking to the specificity of the relationship to the hypothesized condition.

Nevertheless, we examined learning phase behavior separately to verify that participants’ learning-phase behavior was not by itself driving the correlation, we computed a correlation between the participant’s performance on the 10 last trials of the learning phase (i.e. the prescore) and replay and revaluation scores later on. Recall that the key result was a brain-behavior correlation specific to the revaluation condition. But in the revaluation condition (red plots in Author response image 1) there were no positive correlations between replay and learning phase behavior (Revaluation, no Noise: rho=.19, p=.38; Revaluation, Noisy: rho=.34, p=.1):

We also did not find a significant correlation in the control condition with no noise (rho = -0.16, p = 0.47). The only just significant correlation we saw was in the noisy control condition (grey plots in Author response image 2), where we observed a marginally significant negative correlation between prescores (learning phase behavior) and replay (Spearman’s Rho = -0.41, p = 0.045). This indicates that lower performance was correlated with more Stage 1 replay during subsequent rest, but this could be also due to uncertainty and does not confound our main results.

- The above point is compounded somewhat with the choice of the control condition here. Unlike the revaluation condition, there is no learning necessary at all in the second phase. There is also no incentive to pay attention or be engaged at all during this second phase. As a result, the differences in brain activity may be due to differential engagement, rather than specific to the replay-replanning connection as claimed here.

In the revision, we have clarified that attention was required during the relearning phase, for in both the revaluation condition and the control condition. To address this, we have added a paragraph to the main text:

“Attention engagement during the relearning (control trials)Participants were instructed at the beginning of each run that the rewarding choices may change and they needed to pay attention to changes. […] They did not know they were in the control condition and were incentivized to pay attention.”

- The test phase gave no feedback or guidance. How were participants supposed to know which experience to draw upon when making those decisions? The assumption here is that the optimal thing here is to integrate the information in the two phases by adjusting the values of the terminal states in phases 2 and keeping the same phase 1 knowledge of task structure. A smart participant, however, might infer that the recurrence of the Stage 1 stimulus (which never appeared in the relearning phase) meant that the Learning phase reward-contingency rules were back in effect. It's not obvious what is optimal here. Maybe they have learned that the Stage 2 stimuli give different values when they are (or are not) preceded by the Stage 1 stimuli (i.e., a type of occasion setting).

We agree that it is theoretically possible that participants would learn two separate sets of reward-contingency rules in the revaluation condition (one for learning, one for relearning) and then access the initial set of rules during the final test. However, empirically speaking, this account is not correct – it predicts that participants would not show a behavioral replanning effect in the revaluation condition, but they did.

- What were the means/variances of the experienced distributions? Were the actual means of the experienced distributions controlled or only the generative process?

In order to address the reviewer’s first question, we have generated subject-level scatter plots of mean generative reward (x-axis) vs. the rewards the participant observed (y-axis). In response to the second question: we did not fix the realized distribution of rewards but instead generated outcomes pseudorandomly according to the generative distributions. That said, we ensured that the distributions were adequately sampled during the experiment by including forced choice trials. Each participant’s choices were tracked, and actions that were less explored were presented in forced-choice trials in which the participant had to take those actions. We included on average, 3 “forced choice” catch trials in phase 1 of each run. We included at least 1 of each less explored option (out of 4 total options to explore). As a result of this forced-choice manipulation, they sampled all options. In the no-noise conditions this of course ensured that they had repeated experience with the exact means. In the noisy conditions the precise means could slightly vary across participants because the number of visits to each option and their random realizations were not matched across all participants. That said, the rewards were distributed such that this potential variation in the experienced means of the noisy conditions did not change the globally highest mean reward. Therefore, the optimal policy in the global structure remained the same.

Were the experienced distributions close to the planned ones? How was it ensured that all conditions were experienced? Could people have avoided one of the outcomes e.g., due to getting initially unlucky (hot-stove effect)?

That is an important point that we have now added to the manuscript (subsection “Forced choice catch stimuli”).

In short, we took the possibility of avoiding outcomes into consideration when designing the experiment. To address this possibility, we included controlled forced choice trials in which one of the choices on the screen was highlighted and participants had been instructed to choose the highlighted option each time they encountered one. These control forced choices ensured that they explored every branch, and served a number of purposes. (a) They matched the encounters of the different options to ensure participants had exposure to the different conditions. This ensured that the experienced distributions were close to the planned ones, although there were fewer negative PEs, as participants are less likely to return to a choice with negative PE. (b) As mentioned in our response above, these forced choice trials served to keep participants’ attention engaged, as they were told that it is important to choose the highlighted action when an option is highlighted.

If the reviewers see fit, we would be happy to include this in the main text.

- It was not clear is why non-specific replaying the Stage 1 stimuli during the replanning task would help. In some ways, it says that people who are more engaged with the task do better on the task. That's interesting, but not groundbreaking. The data here do not seem to show that category-specific replay predicts later performance. Does replay of other categories (not the Stage 1 stimulus) not correlate with replanning behaviour? That would be a good additional piece of evidence beyond simple task engagement.

We are grateful for the comment. To be clear, we do think that the result concerning Stage 1 stimuli is rather groundbreaking. Importantly, unlike other related work like that of Shohamy and Wimmer, 2012, and Doll et al., 2015, this is activity during a rest period: it is thus a bit of a stretch to describe it as reflecting “task engagement.” Perhaps one might still find it unsurprising that thinking about the task even when resting promotes better on-task performance, but this has not to our knowledge been shown previously, and indeed as we discuss in the paper the predominant hypothesis going back to Tolman has been that this sort of task is solved not by this sort of off-line contemplation but by on-line planning at choice time, corresponding to our test phase.

Second, our general perspective is that the Stage 1 stimulus (rather than the Stage 2 stimulus) is the most diagnostic signal of the sort of offline replay with hypothesize. The Stage 2 stimulus may also be involved in replay for planning. However, since the Stage 2 stimuli were also visible onscreen during the relearning phase immediately preceding the rest periods where we measure replay, there is likely to be Stage-2 related activity for all sorts of reasons not specific to planning, like simple working memory for recent event. Instead, retrieval of memory for the Stage-1 category (which was not presented in the relearning phase) is the crucial step hypothesized for binding the new information with the Stage 1 actions, i.e. for planning. (We have used similar logic in other studies from our lab, e.g., Gershman et al., 2013, Journal of Neuroscience and Manning et al., 2016, Psychonomic Bulletin and Review.) Overall, we expect the relationship between replay and revaluation to be most clearly expressed for the Stage 1 stimulus.

In light of the reviewer’s comment, we examined the category specificity of the relationship between rest-period category activity and replanning behavior. As expected, the relationship for the Stage-2 stimulus was weaker and not significant. Specifically, we tested whether Stage 2 replay was also correlated with subsequent planning (revaluation) behavior, and we found that the correlation was non-significant for both conditions (revaluation: rho=.2, p=.17; control: rho=.25, p=.22). Multiple regression did not reveal any significant effect of Stage 2 replay (ß=.84, p=.44) but only a significant effect of the Stage 1 replay on the revaluation scores (ß=1.89, p=.04).

The relatively stronger effect for Stage-1 rather than Stage-2 stimuli also again suggests that our neural effects are driven by the specific planning-related computations we hypothesize, rather than more generic attention-related confounds.

To address these points in the paper, we have made the following changes.

We added the following to the revised manuscript:

“Note that the crucial step for revaluation is retrieval of memory for the Stage-1 state, which was not presented in the Relearning phase. […] As expected, we observed no significant correlation between Stage 2 replay and revaluation behavior (Figure 2—figure supplement 2) in the revaluation condition (rho=.2, p=.17) or the control condition (rho=.25, p=.22).

And included figure 2—figure supplement 2 with the following caption:

“Figure 2—figure supplement 2. We tested whether Stage 2 replay was also correlated with subsequent planning (revaluation) behavior. We found no significant correlation. […] Multiple regression did not reveal any significant effect of Stage 2 replay (ß=.84, p=.44) but only a significant effect of the Stage 1 replay on the revaluation scores (ß=1.89, p=.04).”

3) Concerns over the robustness of the analysis and approach to the computational modeling. It should be noted in light of the first comment below that eLife does not require that all hypothesis, predictions, or analyses be preregistered. However, the reviewers agreed that there should be explicit discussion of investigator degrees of freedom and what safeguards against flexibility in the analysis were put in place. Further, any checks that can be provided for robustness would be helpful. Here are the specific comments raised regarding the analysis and modeling.

- The article is pitched as confirmatory in the sense that it makes predictions and then confirms those predictions with evidence. Were any of these predictions pre-registered anywhere? Were there any competing hypotheses? What sort of data might have gone against these predictions? The evidence in the paper could use to be bolstered to establish that it was truly a confirmatory study. Similarly, were any of the metrics or analyses preregistered? If they were indeed hypothesis-driven, that should have been possible. It would seem that there are many, many ways that these data could have been analyzed (both behavioural and neural), and some of the metrics seem very arbitrary (e.g., number of trials in learning phase). Were other analysis pathways attempted? Evaluating the import of the inferential statistics requires more knowledge of the researcher degrees of freedom.

While we did not preregister any of the predictions or analyses/metric, our main predictions were derived from well-established computational models that preceded our experiment by decades, and our analyses were indeed envisioned ahead of time based on other studies that we have conducted in our lab and additional studies from the literature. As discussed in the paper, our prediction linking replay to replanning comes from Sutton’s Dyna architecture (Sutton, 1991), in which action values are updated via offline replay. Our prediction linking surprise during learning to replay comes from prioritized sweeping algorithms (Moore and Atkeson, 1993; Peng and Williams, 1993). The combination of the two was again proposed in Sutton et al., 2012 where the authors combined Dyna-style planning with prioritized sweeping. With respect to the experimental and analytic approaches, Gershman et al., 2014, had tested a similarly structured behavioral task (and analyzed it according to a comparative revaluation index) and Momennejad et al., 2017, use similar reward revaluation paradigms and show that human behavior is consistent with the behavior of an algorithm that employs offline replay. The logic of testing across-subject variation in revaluation success against neural reinstatement measures was derived from Wimmer and Shohamy, 2012, and, in our group, Doll et al., 2015. Backed by these theoretical proposals in reinforcement learning and previous findings in humans, we hypothesized that the brain employs algorithms that use offline replay to update planning policies offline, and that trajectories related to states with local PE would have prioritized replay. Our fMRI replay analyses were based closely on other studies from our lab that used scene/face MVPA to track thoughts related to stimuli not presently onscreen (Gershman et al., 2013; Manning et al., 2016). The method for testing the hypothesis that |PE| drives offline replay and replanning was tested using the common approach in the RL-fMRI literature, which is to estimate the prediction errors using a learning rate from the model, and then use these trial by trial prediction error estimates in a parametric modulation analysis.

Our main predictions could easily have been falsified if our MVPA metric of Stage 1 replay had failed to show a correlation with behavior, or if we had failed to observe a relationship between neural sensitivity to prediction errors and replay. These primary analyses happened to work “out of the box” with minimal adjustment of parameters on our part. Also, as described throughout this response, we have run several control analyses that each could have cast doubt on our interpretation if they turned out a particular way (but they did not). We would be happy to share the data and analyses after publication.

Finally we note that we ran this study in summer 2016 and our lab did not “flip the switch” on preregistering our fMRI studies until 2017. A lot has changed with regard to the prevalence of preregistration in the last 2 years!

- Two participants were excluded post-hoc due to "not having sufficient number of runs". Were there any pre-defined criteria on this front? Or was this exclusion made after the data were examined? Additionally, the sample size seems small (though not inordinately so for an fMRI study), leaving many of the statistical tests underpowered (and indeed many of the brain-behaviour correlations fell in the marginal zone, making interpretation difficult).

We could only include participants who successfully completed trials in all four experimental conditions, and there was 1 run per condition. Thus, participants were excluded on a priori grounds if they failed to learn the initial MDP in any of the four runs. We have now included the following lines to clarify this point in the manuscript:

“Please note that, logically, the analysis required each participants to learn the local dependencies of the task in the learning phase (indexed by their choice behavior) and there was 1 run per condition. […] Therefore, based on this a priori requirement, any participant who did not learn the task structure during the learning phase (for any of the four runs) was excluded."

- The idea that high prediction error should be preferentially replayed seems like a rule that would only work in very narrow (static) environments. If there is any variance in the rewards, then this rule could lead to perverse replay of unreducible uncertainty.

That is a great comment and we appreciate that the reviewer brought this up. As also discussed earlier, we agree that in a more sophisticated model tracking variances of various sorts, priority could be scaled to down-prioritize replaying irreducibly uncertain values. (Though we also note that given that part of the goal of replay is to average out stochasticity in computing net action values we don’t actually think it’s entirely pointless to replay those either.) We have now included the following lines to the manuscript to address this point.

“The effect of volatility on prediction errorsHere we have estimated prediction errors regardless of the noise conditions. […] Such a study would benefit from a design where prediction errors are parametrically changed across conditions or groups.”

- The selection of learning rate requires further justification. How was the.7 estimated? What sort of task? If the task were as simple as the current task, with such a wide gap in rewards, I wonder how reliable an estimate this number represents. Moreover,.7 seems extremely high for a task with variable rewards. That would make people extremely influenced by the most recent outcome, much more so than a typical task with variable or probabilistic rewards. Is it reasonable to use the learning rate estimated in a different sample for analyzing the new sample? How robust are the prediction error analysis to this selection of learning rate?

We would like to first stress that while the learning rate sounds high, it is actually pretty typical in our experience for human performance in tasks with stochastic reward. For instance, the learning rates we have estimated in two-step MDPs across many samples exceed 0.5. Here, we estimated this learning rate based on a separate sample of subjects who performed the same task in a behavioral pilot study. The structure (and rewards) used in this pilot study were virtually identical to the fMRI task. As model-based fMRI analyses are typically performed using group-level (rather than individual-level) parameter estimates, we estimated this learning rate from the pooled subject data using Maximum Likelihood Estimation, resulting in one best-fitting learning rate for the entire population. Though this learning rate was estimated from a different sample (a behavioral pilot study), we confirmed, based on this comment, a similar best-fitting learning rate in the fMRI sample (0.78). In this regard, we can confidently say that this learning rate generalizes well across samples. With respect to the generalizability of the neural results to other values of the learning rate, it is worth pointing out that a previous systematic evaluation of this issue found that substantive differences in learning rates in the model-based analysis of similar tasks lead to only minute changes in the neural results (Wilson and Niv, 2015). We have now included the following text in the manuscript to address this concern:

“Estimating learning ratesWe estimated learning rates from a separate sample that performed the same task in a pilot study. […] With regard to the generalizability of the neural results to different values of the learning rate, it is worth noting that a previous systematic evaluation of this issue (Wilson and Niv 2015) found that substantive differences in learning rates in the model-based analysis of similar tasks lead to only minute changes in neural results.”

https://doi.org/10.7554/eLife.32548.025

Article and author information

Author details

  1. Ida Momennejad

    Princeton Neuroscience Institute, Princeton University, New Jersey, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    idam@princeton.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0830-3973
  2. A Ross Otto

    Department of Psychology, McGill University, Montreal, Canada
    Contribution
    Conceptualization, Data curation, Formal analysis, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9997-1901
  3. Nathaniel D Daw

    Princeton Neuroscience Institute, Princeton University, New Jersey, United States
    Contribution
    Conceptualization, Resources, Software, Supervision, Funding acquisition, Methodology, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5029-1430
  4. Kenneth A Norman

    Princeton Neuroscience Institute, Princeton University, New Jersey, United States
    Contribution
    Conceptualization, Resources, Software, Supervision, Funding acquisition, Methodology, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5887-9682

Funding

John Templeton Foundation (57876)

  • Ida Momennejad
  • Kenneth A Norman

National Institute of Mental Health (R01MH109177)

  • Nathaniel D Daw

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Matthew Botvinick and Sam Gershman for helpful conversations, and Sarah Dubrow, Dylan Rich, and James Antony for helpful comments on manuscript drafts. We acknowledge and thank Eeh Pyoung Rhee, who ran control analyses to compare classifier performance for his senior thesis in computer science at Princeton University. This project and publication was made possible through the support of a grant from the John Templeton Foundation, grant 57876, and NIMH grant R01MH109177, part of the CRCNS program. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation or other funding bodies.

Ethics

Human subjects: The Princeton University Institutional Review Board approved the study. All participants gave informed consent to participate in the fMRI study and signed a screening form that ensured they had normal or corrected to normal vision, had no metal in their body, and had no history of psychiatric or neurological disorders.(Protocol#6014).

Senior Editor

  1. Michael J Frank, Brown University, United States

Reviewing Editor

  1. David Badre, Brown University, United States

Publication history

  1. Received: October 5, 2017
  2. Accepted: December 4, 2018
  3. Accepted Manuscript published: December 14, 2018 (version 1)
  4. Version of Record published: December 21, 2018 (version 2)

Copyright

© 2018, Momennejad et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,175
    Page views
  • 356
    Downloads
  • 5
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)