Research Article

Neuroscience

Neural signatures of model-based and model-free reinforcement learning across prefrontal cortex and striatum

Institute of Neurology, Department of Clinical and Movement Neurosciences, University College London, United Kingdom
Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Portugal
Department of Experimental Psychology, University of Oxford, United Kingdom
Wellcome Centre for Integrative Neuroimaging, University of Oxford, FMRIB, John Radcliffe Hospital, United Kingdom
Sainsbury Wellcome Centre for Neural Circuits and Behaviour College, University College London, United Kingdom
Max Planck Institute for Biological Cybernetics, Germany
University of Tübingen, Germany

Jun 22, 2026

https://doi.org/10.7554/eLife.106032.3

Open access
Copyright information

eLife Assessment

This important study presents single-unit activity collected during model-based (MB) and model-free (MF) reinforcement learning in non-human primates. The dataset was carefully collected, and the statistical analyses, including the modeling, are rigorous. The evidence convincingly supports different roles for particular cortical and subcortical areas in representing key variables during reinforcement learning.

https://doi.org/10.7554/eLife.106032.3.sa0

Significance of the findings:

Important: Findings that have theoretical or practical implications beyond a single subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Convincing: Appropriate and validated methodology in line with current state-of-the-art

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Animals integrate knowledge about how the state of the environment evolves to choose actions that maximise reward. Such goal-directed behaviour – or model-based (MB) reinforcement learning (RL) – can flexibly adapt choice to changes, being thus distinct from simpler habitual – or model-free (MF) RL – strategies. Previous inactivation and neuroimaging work implicates prefrontal cortex (PFC) and the caudate striatal region in MB-RL; however, details are scarce about its implementation at the single-neuron level. Here, we recorded from two PFC regions – the dorsal anterior cingulate cortex (ACC) and dorsolateral PFC (DLPFC), and two striatal regions, caudate and putamen – while two rhesus macaques performed a sequential decision-making (two-step) task in which MB-RL involves knowledge about the statistics of reward and state transitions. All four regions, but particularly the ACC, encoded the rewards received and tracked the probabilistic state transitions that occurred. However, ACC (and to a lesser extent caudate) encoded the key variables of the task – namely the interaction between reward, transition, and choice – which underlies MB decision-making. ACC and caudate neurons also encoded MB-derived estimates of choice values. Moreover, caudate value estimates of the choice options flipped when a rare transition occurred, demonstrating value update based on structural knowledge of the task. The striatal regions were unique (relative to PFC) in encoding the current and previous rewards with opposing polarities, reminiscent of dopaminergic neurons, and indicative of an MF prediction error. Our findings provide a deeper understanding of selective and temporally dissociable neural mechanisms underlying goal-directed behaviour.

Introduction

Animals use at least two major reinforcement learning (RL) systems for behavioural control in sequential decision-making: a goal-directed or model-based (MB) system and a habitual or model-free (MF) (Dickinson and Balleine, 1994; Daw et al., 2005; Dolan and Dayan, 2013; Dickinson, 1985) system. Both approaches rely on previous experience and converge to the same behaviour given enough practice in a stable environment, but they differ as to how this information is used to infer the values of choices. MB-RL computes estimates prospectively by integrating reward information with knowledge about the state-transition function, which specifies how the state of the world evolves probabilistically given particular choices (Tolman, 1948; Daw and Dayan, 2014). MF-RL, a less flexible but simpler approach, learns without any model of the environment by bootstrapping sampled experience to train cached predictions of long-run rewards via reward prediction errors (RPEs) (Thorndike, 1911; Sutton, 1988).

Neural substrates of reward-based learning and decision-making involve complex anatomical connections between the basal ganglia and the prefrontal cortex (PFC) (Balleine and O’Doherty, 2010; Haber and Behrens, 2014). Lesion studies and human neuroimaging suggest an involvement of the primate putamen in MF-RL (Yin et al., 2004; Tricomi et al., 2009; Wunderlich et al., 2012), whereas more anterior regions of the caudate (Wunderlich et al., 2012; Yin et al., 2005; Tanaka et al., 2008), PFC (Valentin et al., 2007; Gläscher et al., 2010; Daw et al., 2011; Miller et al., 2017; Schuck et al., 2016; Chan et al., 2021; Kolling et al., 2016; Behrens et al., 2018), and hippocampus (Miller et al., 2017; Johnson and Redish, 2007; Boorman et al., 2016) have been implicated in MB-RL. However, there is also much work on the online (Morris and Cushman, 2019; Moran et al., 2021; Sezener et al., 2019) and offline (Antonov et al., 2022; Eldar et al., 2020; Liu et al., 2021; Mattar and Daw, 2018) integration of MF and MB quantities, emphasising a closer interplay between both learning systems. At the single-neuron level, midbrain dopaminergic cells report an RPE that could drive the updating of MF-RL predictions (Schultz et al., 1997; Bayer and Glimcher, 2005). Furthermore, neurons in the striatum have been found to encode cached MF-RL action values at choice time (Lau and Glimcher, 2008), but the paradigms used did not elicit differences between MF and MB computations.

One such task that does enable the dissociation of MB and MF computations is ‘two-step’ task by Daw et al., 2011. It contains a probabilistic transition between task states to uncouple MF learners (who would assign credit to which state was rewarded regardless of the transition) from MB learners (who would appropriately assign credit based on the reward and transition that occurred). Rodents (Miller et al., 2017), monkeys (Miranda et al., 2020), and humans (Daw et al., 2011) all use MB-like behaviour to solve the task. Evidence in rodents suggests dorsal anterior cingulate cortex (ACC) tracks rewards, states, and the probabilistic transition structure, and that ACC is essential in implementing an MB strategy (Akam et al., 2021). Here, we compare primate single-neuron activity of four different subregions implicated in reward-based learning and choice (ACC, dorsolateral PFC [DLPFC], caudate, and putamen) during performance of the classic two-step task, and demonstrate signatures of MB-RL primarily in ACC, and MF-RL signatures most notably in putamen.

Results

Two male rhesus macaques (Macaca mulatta) completed a variation of the classic two-step task (Daw et al., 2011). After an initial eye fixation period, subjects first chose between two first-stage options (A, B, ‘choice 1’), each leading to a different second-stage state (indicated by background colour) (Figure 1A and B). Crucially, the transition between these two stages was probabilistic, with each option transitioning to its preferred second-stage state in 70% (‘common’) of trials and transitioning to the alternative second-stage state in 30% (‘rare’) of trials (Figure 1B). This state-transition allows the dissociation of MF-RL strategies (that estimate state values using bootstrapped sampling of experienced states) from MB-RL strategies (which use the transition structure to assign rewards prospectively to states based on the probability of that state occurring). Both second-stage states then required another choice (‘choice 2’) between two options (C and D, or E and F), leading to one of three possible rewarding outcomes (high, medium, low outcome) indicated by a secondary reinforcer cue (‘feedback’). In our version of the task, the outcome values of each second-stage state randomly and independently changed value every five to nine trials, thus requiring subjects to continue sampling the different second-stage states to determine where the highest rewards were located across trials (Figure 1B). This reward schedule resulted in an advantage of an MB policy over an MF policy, the former of which consistently gained more reward per trial than the latter in simulations (MB gained 1.2979±0.0071 units of reward per trial [n=57 recording sessions, the three reward levels were counted as 0, 1, or 2 AU]; MF 1.2416±0.0069 reward per trial; difference between them: p<0.0001, paired t-test; Figure 1—figure supplement 1). Both strategies were also better than a random policy (0.9953±0.007 reward per trial, vs MF/MB: p<0.001).

Figure 1 with 1 supplement see all

Download asset Open asset

Two-stage decision task performance.

(A) Timeline of events. Eye fixation was required while a red fixation cue was shown, otherwise subjects could saccade freely and indicate their decision (arrow as an example) by moving a manual joystick in the direction of the chosen stimulus. Once the second-stage choice had been made, the nature of the outcome was revealed by a secondary reinforcer cue (here, the pause symbol represents high outcome). Once the latter cue was off the screen, there was a fixed 500 ms delay and the possibility of a further delay (for both medium and low outcomes) before juice was provided (for both high and medium outcomes). (B) The state-transition structure (kept fixed throughout the experiment). Each second-stage stimuli had an independent reward structure: the outcome level (defined by the magnitude of the reward and the delay to its delivery) remained the same for a minimum number of trials (a uniformly distributed pseudorandom integer between 5 and 9) and then, either stayed in the same level (with one-third probability) or changed randomly to one of the other two possible outcome levels. (C) Likelihood of first-stage choice repetition, averaged across sessions, as a function of reward and transition on the previous trial. (**D–E**) Logistic regression results on first-stage choice with the contributions of the reward main effect (D) and reward×transition (E) from the five previous trials. **, *, p<0.05, 0.01, respectively. In **C–E**, error bars depict standard error of the mean (n=30 and 27 for subject C and J, respectively).

In a prior comprehensive behavioural analysis (Miranda et al., 2020), we demonstrated that both subjects estimated the value of each choice 1 option using a combination of MF and MB-RL algorithms, but with MB dominating (Figure 1C–E). MF-RL does not exploit information about task structure, so it predicts no difference in the probability of repeating choice 1 dependent on common/rare transition, whereas a key signature of MB-RL is just such a difference. We found that both subjects were significantly more likely to repeat choice 1 when a high reward was obtained through a common (rather than rare) transition, with the opposite pattern evident following low rewards (Figure 1C), indicating a strong MB influence on behaviour. Having established the influence of MF- and MB-RL strategies on behaviour (Miranda et al., 2020), we next examined whether signatures of MF- and MB-RL were evident in neuronal activity in four key brain regions implicated in learning and choice: we recorded single-neuron activity from two regions in PFC – dorsal ACC (n=240) and DLPFC (n=187), and two regions in the striatum – caudate (n=115) and putamen (n=119) (see Figures 8 and 9 for recording locations).

As parameters were correlated over trials (e.g. the same choice was made repeatedly when it led to a high reward), there was a risk that intrinsic temporal autocorrelations in firing rate across trials would confound the analysis (Shin et al., 2021; Elber-Dorozko and Loewenstein, 2018; Harris, 2020). We therefore incorporated several nuisance regressors to account for temporal autocorrelations in neuronal activity that may have been present (see Materials and methods, Tables 1–3). To check this was an adequate control, we analysed neural firing rates with respect to a different session’s trial data (i.e. the dependent and independent variables for the regression were from different sessions) (Shin et al., 2021; Elber-Dorozko and Loewenstein, 2018; Harris, 2020). No significant encoding of any of our parameters of interest was observed at either the population- or single-neuron level in such null controls (Figure 2—figure supplement 1). Therefore, temporal autocorrelations in neuronal activity were adequately controlled, and we next turned to examining neural encoding of task parameters.

Table 1

The contents of GLM1.

Far-right column indicates which figure panels each regressor was used for.

Regressor	Description	Figure
Constant	To account for the mean effect
Linear effects	A ramping regressor to account for any monotonic autocorrelations across the session
(Linear effects) ^ 2	To account for any polynomial autocorrelations across the session
(Linear effects) ^ 3
(Linear effects) ^ 4
Sin 2	Sin curves of varying frequency (approximately 2, 4, 6, 9 cycles per session) to account for any non-monotonic autocorrelations across the session
Sin 4
Sin 6
Sin 9
Previous reward (t–3)	Value of the secondary reinforcer shown three trials ago
Previous reward (t–2)	Value of the secondary reinforcer shown two trials ago	Figure 2F and G
Previous reward (t–1)	Value of the secondary reinforcer shown on previous trial	Figure 2A–C Figure 2F and G
Previous transition	Transition that occurred on the previous trial
Previous choice 1	The first choice from the previous trial
*Previous reward previous transition**	Pairwise interactions between the events from the previous trial
*Previous reward previous choice 1**
*Previous transition previous choice 1**
*Previous reward previous transition * previous choice 1**	Three-way interaction of the events from the previous trial (equivalent to previous second-stage state * previous reward, as choice 1 * transition = second-stage state)	Figure 6B
Reward	Value of the secondary reinforcer (current trial)	Figure 2D and E Figure 3E Figure 5E and F
Transition	Transition that occurred (current trial)	Figure 3A–E
Choice 1	The cue the subjects chose (current trial)	Figure 6A
*Previous reward transition**	Pairwise interactions between the current trial’s events with reward from the previous trial	Figure 3C and D
*Previous reward choice 1**
*Transition choice 1**
*Previous reward Transition * choice 1**	Three-way interaction of the current trial’s events with the previous trial’s reward

ACC encoded reward across task events; striatal regions encoded an MF RPE signal

We first examined the extent of reward encoding in each region by examining the response of neurons to the secondary reinforcer cue. At the population level, all regions strongly encoded the value of the secondary reinforcer cue at feedback; furthermore, this encoding persisted throughout a vast period of the following trial – being significantly stronger for ACC during this particular time period (Figure 2A; GLM1). The same was true at the single-neuron level, with large populations of neurons within each region encoding the reward during feedback and long into the following trial (Figure 2B). ACC’s encoding of reward was stronger than all other regions at both the population-level (p<0.05, cluster-based permutation test; Figure 2A) and single-neuron level (p<0.05, chi-square test; Figure 2B). Encoding of the value of the secondary reinforcer peaked earlier in striatal neurons (507.7±19.16 ms) compared to prefrontal neurons (588.5±13.9 ms; p=0.005 independent t-test; Figure 2C). Therefore, all four regions, but especially ACC, encoded the previous trial’s reward throughout the next trial.

Figure 2 with 1 supplement see all

Download asset Open asset

Anterior cingulate cortex (ACC) is the strongest reward coding region.

(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the reward received on the previous trial. Feedback is shown both for the previous trial (left) and the current trial (right). Solid horizontal lines represent periods where that area’s CPD differed significantly from all other areas’ CPDs (p<0.05, cluster-based permutation test). Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. All epochs are 0–500 ms with the exception of 0–1000 ms for the first feedback epoch. (B) The percentage of neurons within each region that significantly encoded reward during the feedback epoch and epochs of the subsequent trial (p<0.05 cluster-based permutation test). Solid horizontal line represents period where that area’s CPD differed significantly from all other areas’ CPDs (p<0.05, chi-squared test). (C) The time (relative to feedback onset) at which the peak encoding of the secondary reinforcer occurred, across significant encoding neurons. ***, 0.001, independent t-test. (D) Average coefficients for encoding the value of the secondary reinforcer. Solid horizontal lines represent periods where the coefficients in a particular region differed from 0 (p<0.05, one-sample t-test). (E) Same as in D but for the percentage of significant neurons. Dashed lines indicate the net percentage of significant neurons (positive – negative percentages). (F) Average coefficients across ACC neurons for the reward received on the current trial and the previous two trials. Inset, the average activity for each of the three conditions at the time point of peak reward encoding. **, ***, p<0.01, 0.001, one-sample t-test against 0. (**G–I**) Same as F but for dorsolateral prefrontal cortex (DLPFC), caudate, and putamen neurons, respectively. In all cases, error bars depict standard error of the mean across neurons. (n=240, 187, 115, and 119 for ACC, DLPFC, caudate, and putamen, respectively).

The direction of this reward coding differed between regions from PFC and striatum. The latter encoded the predicted reward (i.e. from secondary reinforcer) positively (i.e. firing rate positively correlated with reward; p<0.05 from 190 and 160 ms for caudate and putamen respectively, one-sample t-test of coefficients against 0, Figure 2D and E), whereas ACC encoded reward with a negative polarity and slower response time (p<0.05 at 410 ms onwards, Figure 2D and E). Additionally, both caudate and putamen encoded the reward from the previous trial negatively during the feedback period of the current trial (Figure 2F and G). Such quantitative features in both striatal areas (i.e. positive coding of current reward and negative encoding of past rewards) are typical of a dopaminergic RPE signal (Schultz et al., 1997; Bayer and Glimcher, 2005; Dolan and Dayan, 2013).

ACC encoded MB state-transition information

Tracking the state-transition structure of the task is imperative for solving the task as an MB learner. All four regions encoded whether the current trial’s first-stage choice transitioned to the common or rare second-stage state (which could be inferred by a change in background colour immediately after choice indicating which second-stage state they had just entered, Figure 1A). This signal was significantly stronger in ACC – where state-transition information was maintained from transition until the feedback epoch (CPD >0.3%, p<0.002, cluster-based permutation test; Figure 3A; GLM1). A similar pattern was also observed at the individual neuron level (Figure 3B).

Figure 3 with 1 supplement see all

Download asset Open asset

Anterior cingulate cortex (ACC) encodes transition information until feedback.

(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the transition that occurred. Solid horizontal lines represent periods where that area’s CPD differed significantly from all other areas’ CPDs (p<0.05, cluster-based permutation test). Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. All epochs are 0–1000 ms. Error bars depict standard error of the mean. (B) The percentage of neurons within each region that significantly encoded transition (p<0.05, cluster-based permutation test). (**C, D**) Same as **A, B** but for encoding of the interaction of previous reward and transition. (E) ACC coefficients for transition and feedback were correlated (coefficients taken from 300 ms post onset of each epoch). In A, B, and D, solid horizontal line indicates periods where ACC was significantly greater than all three other regions (A: permutation test, B and D: chi-squared test, p<0.05/3).

As reward expectations would change on rare transitions if using an MB strategy, we next examined whether the encoding of previous trial reward was modulated by current trial’s transition. While all four regions exhibited some selectivity for the interaction between the previous trial’s reward and the current trial’s transition, a larger proportion of ACC neurons encoded this parameter compared to all other regions (p<0.05, chi-squared test, Figure 3C and D). Thus, ACC reward expectancy signals were modulated by the state-transition structure of the task.

All regions, but particularly ACC, encoded a common transition (at the time of transition) similar to a high reward (at the time of feedback), as there was a positive correlation between the coefficients for reward and transition (the transition parameter was signed such that common and rare transitions were equivalent to high and low rewards, respectively) (ACC r=0.4963, DLPFC r=0.3273, caudate r=0.4712, putamen r=0.5052; all p<0.002 except DLPFC, where p=0.006, circular permutation test; Figure 3E, Figure 3—figure supplement 1). As the reward expectation will be higher on common compared to rare trials, this demonstrates that the brain encodes being diverted to an area with a lower reward expectation equivalent to actually receiving a low reward (and vice versa). This signal also reflects awareness of the (MB-like) state-transition structure.

ACC and caudate encoded MB value estimates

Due to this encoding of the state-transition and its interaction with other variables, we next tested whether neurons encoded MB- or MF-derived chosen values of the choice 1 options (using RL computational modelling to estimate these values in a subject- and trial-specific manner; Miranda et al., 2020). At the single-neuron level, ACC was unique in having a large population of neurons that encoded both the MB and MF value of the choice 1 options (Figure 4A and B; GLM2). The encoding of MB emerged 280 ms before the cues were displayed, indicating the subjects were anticipating the upcoming choice, whereas MF-derived chosen value coding only emerged after the cues were presented (170 ms post cue presentation, p<0.05, binomial test). Caudate also had a population of neurons encoding the MB-derived value of choice 1, which emerged 430 ms after choice onset (Figure 4A). At the population level all four regions encoded the MB value estimate of choice 1, with ACC and caudate encoding it more than 500 ms before the options were displayed (Figure 4C). All four regions also encoded MF-RL estimates of choice 1’s value, again with ACC encoding emerging more than 500 ms before choice onset (Figure 4D).

Figure 4 with 1 supplement see all

Download asset Open asset

Value estimate encoding was predominantly model-based (MB).

(A) Percentage of cells in each region that encoded the MB-derived estimates of the value of each of the choice 1 options. Solid horizontal lines indicate periods where the percentage of neurons was significant (p<0.05, binomial test). (B) Same as in A but for model-free (MF)-derived estimates of each option’s value. (**C, D**) Same as in **A, B** but for the average coefficient of partial determination (CPD) across neurons in a region. Dashed lines represent the 95% confidence interval determined by permutation testing. Solid lines indicate periods where the strength of encoding was significant (p<0.05, cluster-based permutation test). Error bars depict standard error of the mean. (E) Distribution of the peak CPD values (i.e. the highest CPD value observed over the epoch shown in A–D for either PicA or PicB cues) for each neuron during the epoch shown in C (MB, blue) and D (MF, orange) values. Horizontal lines indicate median and extrema. *, ***, p<0.05, 0.001 paired t-test. (F) The percentage of neurons in each region that significantly encoded an MB estimate of the chosen option’s value (left), an MF estimate (middle), or both (right) assessed using cluster-length permutation testing. Coloured asterisks indicate that population is significantly greater than 10% (blue and orange) or 1% (green), binomial test. Asterisks between bars indicate a difference in size between the two populations (*, ***, p<0.05, 0.001, chi-square test).

To further compare this encoding, we examined the peak strength of encoding of each neuron for either of the two choice 1 options (i.e. MB and MF value estimates for PicA/PicB; Table 2) at any point from 500 ms before to 1000 ms after choice 1 was presented. In all regions except DLPFC, peak CPD values across neurons were higher for MB compared to MF estimates (ACC: t=19.8812, p<0.0001, paired t-test; DLPFC: t=−2.1474, p=0.1319; caudate: t=14.659, p<0.0001; putamen: t=6.1911, p<0.0001; Figure 4E). ACC’s peak MB and MF encoding was significantly higher than in all other regions (p<0.001 independent t-test).

Table 2

The contents of GLM2.

Far-right column indicates which figure panels each regressor was used for.

Regressor	Description	Figure
Constant	To account for the mean effect
Linear effects	A ramping regressor to account for any monotonic autocorrelations across the session
(Linear effects) ^ 2	To account for any polynomial autocorrelations across the session
(Linear effects) ^ 3
(Linear effects) ^ 4
Sin 2	Sin curves of varying frequency (approximately 2, 4, 6, 9 cycles per session) to account for any non-monotonic autocorrelations across the session
Sin 4
Sin 6
Sin 9
Model-based PicA	Model-based derived estimates of the value of each option at choice 1	Figure 4
Model-based PicB
Model-free PicA	Model-free-derived estimates of the value of each option at choice 1
Model-free PicB
Choice 1 picture chosen	Which cue the subject chose on each trial

Examination of the strongest signal observed, ACC encoding of MB Q-values, showed a dynamic pattern, with different neurons encoding the signal during different parts of the epoch (Figure 4—figure supplement 1). When aggregating the number of significant coders throughout the epoch, and examining the specificity of MB versus MF coding, we found that all regions had a significant population of neurons that encoded MB-, but not MF-, derived value (30%, 19%, 23%, and 24% of neurons in ACC, DLPFC, caudate and putamen, respectively; all p<0.0014 binomial test against 10% [as the strongest response to either of the two options was used]; Figure 4F). All regions had a population of neurons that encoded both MB- and MF-derived option value (18%, 4%, 8%, 5% of neurons in ACC, DLPFC, caudate, and putamen, respectively; all p<0.012 binomial test against 1; Figure 4F), but ACC’s population was significantly higher than all other regions (p<0.05, chi-square test, Figure 4F). This dominance of MB encoding in caudate but not putamen at the population level may explain similar dissociations in these two striatal regions observed in human neuroimaging studies (Wunderlich et al., 2012).

Caudate value estimates remapped following a rare transition

The observed prominence of transition-related selectivity, in addition to strong encoding of MB-derived value estimates, led us to next examine how transition type altered the value estimation of the chosen first-stage option. To explore this, we used a combination of MB- and MF-value estimations (or the hybrid estimate) derived from the animal’s behaviour, which better explained their choices during the task than either of the two models in isolation (Miranda et al., 2020) (GLM3). In common trials, caudate was unique in encoding the value of the chosen option more positively than the unchosen option’s value (p<0.025, 280–320 ms post transition revealed, paired t-test; Figure 5). After a rare transition, caudate was again unique in encoding the chosen and unchosen option values differently, but this time it was the unchosen option that was encoded positively and the chosen option negatively (p<0.025, 170–370 ms, paired t-test; Figure 5). Thus, caudate’s encoding of an option’s value also reflected the availability of the option.

Figure 5

Download asset Open asset

Direction of the encoding of chosen and unchosen choice 1 options, depending on the transition that occurred.

(A) Average coefficients across neurons in the anterior cingulate cortex (ACC) with respect to the value of chosen (blue) and unchosen (orange) options of choice 1 in common trials. Value estimates were calculated using a hybrid of model-based (MB) and model-free (MF) estimates derived from each monkey’s behaviour. Red horizontal line indicates portions where the coefficients in the two conditions differed significantly from one another (p<0.05, paired t-test). Error bars depict standard error of the mean (n=240). (B) Same as in A but for trials where a rare transition occurred. (**C–H**) Same as in A–B but for dorsolateral prefrontal cortex (DLPFC, n=187), caudate (n=115), and putamen (n=119), respectively.

Reward-modulated encoding of first-stage choice

We next examined the encoding of the choice 1 option (i.e. the stimulus identity of what they chose). All regions encoded it to an equal extent around the time of choice (difference between regions all p>0.05, cluster-based permutation test; Figure 6A; GLM1). Strikingly, all regions also encoded the interaction of choice 1, transition, and reward from the previous trial during the choice 1 epoch (Figure 6B), with ACC encoding it stronger than DLPFC and putamen (ACC vs DLPFC p=0.006 between –180 and 190 ms, vs putamen p=<0.002 between –150 and 450 ms; cluster-based permutation test; Figure 6B). Importantly, none of the intermediate pairwise interactions were encoded to a similar extent (Figure 6—figure supplement 1). Therefore, both ACC and caudate performed a very specific MB computation of integrating the reward, transition, and choice from the previous trial into a signal which could inform which option to choose on the current trial.

Figure 6 with 1 supplement see all

Download asset Open asset

Choice 1 was encoded by neurons sensitive to reward and transition.

(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of choice 1. Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. (B) Same as A but for encoding of the interaction of reward, transition, and choice 1, all from the previous trial. (C) A support vector machine was used to decode from each neural population which cue the monkeys would choose at choice 1 on each trial. Dashed horizontal line indicates the 95th confidence interval (permutation test) and solid horizontal lines indicate periods of significant decoding (p<0.05, cluster-based permutation test). Dashed vertical line indicates the time at which the subjects made their choice (0 ms). (**D, E**) Same as in C but neurons were median split into two groups depending on the strength with which they encoded reward at feedback and transition at transition. (F) Difference in decoder strength between A and B. Solid horizontal line indicates periods of significant difference assessed using permutation test. In all cases, error bars depict standard error of the mean (n=240, 187, 115, and 119 for ACC, DLPFC, caudate, and putamen, respectively).

Furthermore, we also found that choice 1 stimulus identity was encoded by the same neurons that encoded reward and transition. Choice 1 could be decoded from all four neural populations using a support vector machine, with it being most strongly represented in ACC and caudate (Figure 6C). We then performed a median split of neurons within each region depending on the strength with which they encoded reward at feedback and transition at transition (taking the average of the two coefficients). In ACC only, the neurons that were more sensitive to reward and transition encoded choice 1 more strongly than those neurons that were less sensitive to these attributes (Figure 6D–F). Therefore, it was the same subpopulation of neurons within ACC that tracked the different task parameters relevant for guiding MB choice.

Explore/exploit strategy modulates encoding of first-stage choice

In our task, the outcome level (high, medium, low) of each second-stage stimulus remained the same for five to nine trials before potentially changing. This design naturally created periods where subjects could ‘exploit’ the same choice 1 to maximise reward for several trials, and other periods where they had to ‘explore’ different second-stage stimuli to optimise reward (as contingencies shifted). In classical MB-RL, the transition between reward states can be learned by keeping counts of observed transitions from a current state-action pair to a subsequent state, yielding a maximum-likelihood estimate of the environment’s dynamics (Sutton et al., 2018). In fact, knowledge of the reward contingency schedule could support decision-making in both exploitation – by enabling efficient choice when rewards are stable – and exploration – by guiding alternative behaviour most likely to yield improved outcomes (this is different from MF learning, where exploration is more random since the agent lacks explicit state-transition knowledge).

We thus repeated our decoding analysis of choice 1 stimulus identity, but this time limited trials to those where they had not received a high reward for the previous two trials (‘explore’ trials) and those where the previous two rewards had been the highest level (‘exploit’ trials). All regions encoded choice 1 for some duration of the choice epoch for both explore (p<0.002 in all cases, permutation test; Figure 7A) and exploit (p<0.002 in all cases; Figure 7B) conditions, but decoding accuracy was strongest in ACC. Choice 1 was less strongly decoded – particularly in ACC – in the former condition compared to the latter (p<0.002 in all cases, permutation test on differences observed; Figure 7C); and, also during exploitation, the ACC encoded choice 1 before the choice was even presented to the subject (Figure 7—figure supplement 1). This pre-choice ACC encoding in exploit trials may reflect the need to allocate cognitive (or attentive) resources to features – i.e., choice 1 stimulus identity – that are most certain predictors of important outcomes. As a control, we also decoded the direction of the choice 1 (where choice was indicated via joystick movement), which was randomised each trial and therefore orthogonal to the stimulus that was chosen. Again, all four regions encoded its direction in both explore (p<0.002 in all cases; Figure 7D) and exploit (p<0.002 in all cases; Figure 7E). However, there were minimal differences in the strength of the representation between explore and exploit conditions (ACC, p=0.088, cluster-based permutation test; DLPFC p=0.016; caudate p=0.32; putamen p=1; Figure 7F). Therefore, exploit behaviour specifically upregulated relevant task parameters that were worth remembering across trials.

Figure 7 with 1 supplement see all

Download asset Open asset

Encoding of choice 1 was modulated by reward stability (explore/exploit strategy).

(A) Trials were split into those following two consecutive high rewards (‘exploit’) and those following two consecutive low/medium rewards (‘explore’). A support vector machine was used to decode from each region which cue the monkeys would choose at choice 1 on explore trials. Shaded error bars represent standard error of the mean over the different permutations of the data (n=20). Horizontal dashed lines indicate the 95% confidence interval (permutation test) and solid horizontal lines indicate periods of significant coding (p<0.05, permutation test). (B) Same as in A but only for exploit trials. (C) Difference in decoder strength between A and B. Solid horizontal line indicates periods of significant difference assessed using permutation test. (**D–F**) Same as in **A–C** but for decoding the direction of the choice 1, which was randomised across trials and therefore orthogonal to the cue that was chosen. Note that a choice was indicated via manual joystick movement.

Discussion

Despite the extensive work in both MF-RL (or habitual) and MB-RL (or goal-directed) behaviour (Dolan and Dayan, 2013), few studies have shown simultaneous single-neuron signatures of both learning strategies (Akam et al., 2021), let alone across a number of relevant cortical and subcortical regions, and in the primate brain. Using a decision task which formally distinguishes MF and MB values, and by adopting an analytical approach that considered both observable variables and computational measures, we showed that key RL elements underlying choice behaviour were encoded in neurons in the primate PFC and striatum at different time points. The observed widespread and simultaneous representations of MF- and MB-related computations are consistent with the view that these controllers operate in parallel (Daw et al., 2005; Dolan and Dayan, 2013; Daw et al., 2011), but their associated signals are more richly intertwined than had originally been expected. Additionally, we found convincing evidence supporting a prominent role of ACC, and to a lesser extent, the caudate, in MB-RL. The ability of single neurons to encode reward-related parameters has been described throughout the brain (Apicella et al., 1991; Platt and Glimcher, 1999; Roesch and Olson, 2003; Wallis and Kennerley, 2010; Hunt et al., 2018; Muller et al., 2024). Here, we found reward coding at feedback (of MF and MB relevance) to be significantly stronger in ACC. However, at feedback an RPE (a hallmark of MF-RL) can also be computed to support learning. The midbrain dopamine neurons have been strongly implicated in encoding such RPE compared to other areas with similar (but not identical) feedback-related activity (Schultz et al., 1997; Bayer and Glimcher, 2005). Signals in other brain regions, such as PFC, have been found to report differences between received and expected outcome (Apicella et al., 2009; Seo and Lee, 2007; Matsumoto et al., 2007; Kennerley et al., 2011), but in most of these cases either the definition of RPE did not conform to the normative RL theory or the neurons did not exhibit the quantitative properties and negative aspect of the error. In our study, both caudate and putamen neurons clearly responded at feedback with the parametric features of a dopamine-like RPE (Schultz et al., 1997; Bayer and Glimcher, 2005). It is important to underline that, despite being seen as a signal that drives MF-RL, both MF and MB approaches coincide at feedback and use this RPE to update their valuations (Miranda et al., 2020). The data presented here confirms that basal ganglia structures (both dopaminergic and striatal neurons) are unique (relative to PFC) in computing RPE-like signals.

The knowledge of the state-transition structure differentiates MB from MF-RL and hence we investigated neurons that significantly discriminated a common from a rare transition. Our findings are in line with the neural changes seen in the rat ACC when an animal’s belief is modified after an environmental change (Karlsson et al., 2012), and the observation that inhibition of mouse ACC impairs the use of the transition, but not rewards (Akam et al., 2021). Additionally, the selectivity of ACC neurons for detecting state-transition fits well with the neuroimaging signal observed in ACC on a saccadic planning task when an internal model required update (O’Reilly et al., 2013), endorsing its role in dynamically updating behavioural policies. As all regions adjusted their value estimates as a function of the transition type occurred, our simultaneous recordings extend previous findings by supporting that the ubiquitous use of state-transition information across fronto-striatal circuits is likely propagated from (or via) ACC to other (reward-focused) regions.

It is important to highlight that MB-like behaviour can arise from sophisticated MF strategies, such as tracking state-transition probabilities and the reward function (Akam et al., 2021; Akam et al., 2015; Blanco-Pozo et al., 2024). Further work should be undertaken to carefully evaluate whether the MB-like signals seen here are truly derived from an MB-RL mechanism or instead reflect predictions about hidden states that can be used as inputs for learning by MF systems (Blanco-Pozo et al., 2024).

It is well established that ACC neurons can multiplex task variables during decision-making tasks (Kennerley et al., 2011; Hayden et al., 2011; Kennerley and Wallis, 2009; Hunt et al., 2018). The interaction between reward and transition coding is particularly critical for MB-RL valuation and choice in the two-step task. Here, we demonstrate that ACC was the primary region to simultaneously encode key variables of the task structure for MB choice, namely the interaction between reward, transition, and choice. This was also borne out in the computationally derived MB estimates of the first-stage choice, where ACC activity (and to a lesser extent caudate) strongly encoded this information. Another clear distinction between ACC and the other regions when coding either reward or transition information was the timescale of responses. ACC neurons have some of the slowest intrinsic timescales (i.e. the rate of their autocorrelation decay) across the brain (Murray et al., 2014; Cavanagh et al., 2018), which likely underlies the persistent encoding of these task variables. Maintaining these representations over long time periods likely potentiates the ability to contextualise outcomes by state, as well as the monitoring of action-outcome relationships relevant for behavioural adjustment (Hadland et al., 2003; Kennerley et al., 2006; Rudebeck et al., 2008). In fact, the encoding of the next trial’s chosen cue and its expected value was evident in ACC long before the choice epoch even started. Thus, our data provide novel correlates of prospective planning in single-neuron activity (Doll et al., 2015) and contributes to our understanding of the mechanisms underlying MB-RL.

Along with ACC, caudate appears to have an important role in MB behaviour. Choice 1 decoding was most strongly represented by ACC and caudate, and specifically by neurons that tracked both reward and transition information. Furthermore, the distinctive caudate signal of updating (flipping) the value estimates of the currently experienced option on rare trials implies this signal is not a general temporal-difference RPE, but instead more reflective of a state PE (Gläscher et al., 2010) and further supports the role of caudate in MB valuation. Our recordings were predominantly in anterior parts of caudate, which is where others have described flexible value coding (Kim and Hikosaka, 2013), representations of sequences (Seo et al., 2012), and preference for early phases of learning (Pasupathy and Miller, 2005) – features more often embraced by MB-RL. Anatomically, it is also the part of the caudate with the highest afferent projections from ACC – the area we observed to support MB computations (Griggs et al., 2017).

On the other hand, MF-based estimates were neither as striking nor as specific to striatal regions as expected and observed in previous studies (Daw et al., 2011). The monkeys were extensively trained on the task before recordings commenced, which may have caused a shift towards both MB behaviour and MB value representation within the striatum. Alternatively, this training may have allowed more sophisticated representations to occur, such as using latent states to expand the task space (Akam et al., 2015). Despite this, ACC was the only region that significantly encoded both MB and MF values. Having access to both MF and MB values at the time of choice implicates ACC in the arbitration process between the two learning strategies (Lee et al., 2014) that has been considered crucial to guide optimal behaviour. It is also in line with alternative theoretical proposals for ACC that propose a specific role in monitoring the error likelihoods of all possible expected outcomes (Alexander and Brown, 2011). In fact, the use of an RL framework could indeed unify other roles attributed to ACC, in particular conflict resolution (van Veen et al., 2001) and cognitive control (Shenhav et al., 2013).

Materials and methods

Subjects and neurophysiological procedures

Request a detailed protocol

Subjects C and J were two male rhesus monkeys (M. mulatta) that were 5–6 years of age and weighed 8–10 kg at the time of neural recordings. Surgical procedures were performed using aseptic techniques and under general anaesthesia. Subjects were implanted with a titanium head positioner for restraint, then implanted with two recording chambers that were located based on 3T MRI and stereotactic measurements. The centre of each chamber along the anterior-posterior (AP) coordinate plane was as follows: left hemisphere at AP = 38(C)/37(J) mm, right hemisphere at AP = 27(C)/27.5(J) mm. The chambers were angled along the medial-lateral plane to target different regions. Craniotomies were then performed inside each chamber to allow for neuronal recordings. We used gadolinium-attenuated MRI and electrophysiological mapping of gyri and sulci to confirm chamber placement and electrode trajectories within our recording grid. A custom-built MATLAB (version R2014b, MathWorks, MA, USA) algorithm was used to project each recording location (using grid position and depth from dura penetration) onto the MRI images (Figures 8 and 9).

Figure 8

Download asset Open asset

Locations of each neuron recorded from subject C.

Figure 9

Download asset Open asset

Locations of each neuron recorded from subject J.

Neuronal activity was measured with epoxy-coated (FHC Instruments, Bowdoin, USA) or glass-coated (AlphaOmega Engineering, Nazareth, Israel) tungsten microelectrodes inserted through a guide tube mounted in a custom-designed grid with 1 mm spacing between adjacent grid locations. Electrodes were slowly advanced through the dura each recording session using either custom-built manually controlled micro-drives that lowered electrodes in pairs or triplets, or from motorised microdrives (Flex MT and EPS, Alpha Omega Engineering, Nazareth, Israel) with individual control of electrodes. During a typical recording session, 8–28 electrodes were lowered bilaterally into multiple target regions until well-isolated neurons were found. Neuronal signals were recorded at 40 kHz (OmniPlex System, Plexon Instruments, Dallas, TX, USA). Single-unit isolation was achieved with manual spike sorting (Offline Sorter by Plexon Instruments, Dallas, TX, USA). Neurons were randomly sampled; no attempt was made to select neurons based on responsiveness or specific cortical layer. After each recording session, the microelectrodes were retracted and the microdrive assemblies were removed.

We recorded neuronal data from four target regions: ACC, DLPFC, caudate, and putamen (Figures 8 and 9). In subject C, we recorded simultaneously from the ACC (dorsal bank of the ACC sulcus, primarily area 9/32) and the DLPFC (dorsal bank of the principal sulcus, area 46d) in both the left and right hemispheres, and from the dorsal caudate and the dorsal putamen in the right hemisphere. In subject J, we recorded from the ACC (dorsal bank of the ACC sulcus, primarily area 9/32) and the DLPFC (dorsal bank of the principal sulcus, area 46d) in the left hemisphere, and from the dorsal caudate and the dorsal putamen from the right hemisphere. In total we recorded single-unit activity from 661 neurons (C: 508 and J: 153) in 57 recording sessions (C: 30 and J: 27) across all four investigated regions: ACC, 240 neurons; DLPFC, 187 neurons; caudate, 115 neurons; putamen, 119 neurons. In some sessions, neurons were recorded from all four regions simultaneously, whereas in other sessions, only two or three regions were sampled.

Task

Request a detailed protocol

Full task details are reported elsewhere (Miranda et al., 2020). We monitored eye position and pupil dilation during the task using an infrared system (ISCAN ETL-200). In brief, subjects initiated a trial by fixating a central red cue, and once extinguished, subjects were free to view the stimuli and indicate their decision by moving a manual joystick in the direction of the chosen stimulus (Figure 1). Two decisions were required (first-stage, second-stage states) on each trial to obtain reward. The first-stage state was indicated by a grey background and the choice was between two options indicated by pictures. Each first-stage choice could lead to either a common (70% transition probability) or rare (30% transition probability) second-stage state, represented by different background colours (brown and violet). This state-transition structure was fixed throughout the experiment. In the second-stage state, another two-option choice between pictures was required, which could lead to three different outcome levels: high (big reward and no delay), medium (small reward and small delay), or low (no reward and big delay). To promote learning and updating of stimulus values, each second-stage stimulus had an independent reward structure: the outcome level (high, medium, low) remained the same for five to nine trials, and then either stayed the same level (with one-third probability) or changed randomly to one of the other two possible outcome levels. Rather than fixed reward amounts, small Gaussian drifts of reward (mean/standard deviation of 0/200 ms for high reward and 0/100 ms for medium reward) were also added to promote constant valuation of the reward amounts. Fifteen per cent of the trials were forced (i.e. without allowing a choice as only one option was presented), which could be at either the first or second stage. The trial type sequence was randomly generated at the start of the session and was followed even after error trials. Trials with either no choice, no eye fixation, break of eye fixation, early joystick response, or the joystick not centred before choice resulted in time-outs for the subjects, and were excluded from the data analysis (C: M=5%; J: M=8%). In both decision stages, the choice options were randomised to one of three possible locations, and subjects indicated their choice by moving a joystick towards the stimulus (C: left, right and down; J: left, right and up). The reward (C: diluted cranberry juice; J: diluted apple juice) was provided by a spout positioned in front of the subject’s mouth and delivered at a constant flowrate using a peristaltic pump (Ismatec IPC). We used Monkeylogic software (https://monkeylogic.nimh.nih.gov/) to control the presentation of stimuli and task contingencies and acquire joystick and eye data. All visual stimuli used were the same across sessions for both subjects. All experimental procedures were approved by the UCL Local Ethical Procedures Committee and the UK Home Office (PPL Number 70/7231) and carried out in accordance with the UK Animals (Scientific Procedures) Act.

Neural analysis

Request a detailed protocol

All data analysis was conducted using Python 3.7.6 (Python Software Foundation).

Raster generation

Request a detailed protocol

Each cell’s spike raster was smoothed with a Gaussian kernel (σ=50 ms) and epoched relative to the time at which either an epoch started (choice 1, choice 2, feedback) or an action was taken (initial fixation, choice 1 made, choice 2 made). Rasters were then downsampled to 100 Hz resolution by averaging every 10 points and standardised using the mean and standard deviation of each epoch across all trials.

Multiple linear regression models

Request a detailed protocol

Multiple linear regression was used to index encoding of different task parameters. Three different general linear models (GLMs) were constructed. The first design matrix (GLM1) included task parameters without considering any algorithmically derived MF or MB computations (Table 1). The second design matrix (GLM2) included state value estimates of choice 1 and choice 2’s value derived from the best-fitting computational models which generate MB and MF Q-value estimates on each trial (Miranda et al., 2020; Table 2).

The third design matrix (GLM3) combined MB and MF Q-value estimates based on the animal’s behaviour, which was shown to better predict the animal’s choices than any tested MB or MF model in isolation (Miranda et al., 2020) (‘hybrid’ model) (Table 3).

Q_{H y b} (s_{1, t}, a_{1, t}) = (1 - ω) Q_{M F} (s_{1, t}, a_{1, t}) + ω Q_{M B} (s_{1, t}, a_{1, t})

Table 3

The contents of GLM3.

Far-right column indicates which figure panels each regressor was used for.

Regressor	Description	Figure
Constant	To account for the mean effect
Linear effects	A ramping regressor to account for any monotonic autocorrelations across the session
(Linear effects) ^ 2	To account for any polynomial autocorrelations across the session
(Linear effects) ^ 3
(Linear effects) ^ 4
Sin 2	Sin curves of varying frequency (approximately 2, 4, 6, 9 cycles per session) to account for any non-monotonic autocorrelations across the session
Sin 4
Sin 6
Sin 9
Hybrid-based chosen option	Hybrid-based derived estimates of the value of each option at choice 1	Figure 5
Hybrid-based unchosen option		Figure 5

In Equation 1, the hybrid model assumes that first-stage choices are computed as a weighted sum of the state-action values from MF and MB learning systems. Here, a, action, s, stage (choice 1/2), t, trial, w, fitted hyperparameter.

The GLMs also included several control regressors to account for any neural drift that occurred over time. This included linear and quadratic time functions, and sin curves of various frequency (see Tables 1–3). Examination of the false positive rates using session shuffled data (Shin et al., 2021; Elber-Dorozko and Loewenstein, 2018; Harris, 2020) demonstrated that these adequately controlled for neural drift and autocorrelations (Figure 2—figure supplement 1).

Population-level encoding

Request a detailed protocol

To assess the presence of coding for a particular explanatory variable in multiple linear regression, we used the coefficient of partial determination (partial R², CPD):

C P D = (1 - (\frac{S S E_{F u l l}}{S S E_{R e d}})) \times 100

where SSE is the sum of squared error using either the full design matrix (SSE_Full) or the design matrix with the relevant regressor omitted (SSE_Red). This represents the percentage of additional variance that the full model explains compared to the reduced model. This measure was then averaged across all cells from a region and the mean and standard error used as summary statistics.

To determine the significance of our CPD measure, we computed the null distribution by shuffling the dependent variable 500 times and repeating the analysis. Several control regressors were also included to capture autocorrelation in the data (Tables 1–3). Rare cases where NaNs were generated due to a lack of variance in certain time points from low firing rate neurons were discarded. Periods of encoding that crossed the 95th percentile of this distribution were then considered as putatively significant. A second null distribution was then computed from the lengths of periods of significance that occurred in the null distribution. Significant events in the true data had to remain above threshold for longer than the Bonferroni corrected (by 4 as there were 4 areas) 95th percentile of the null distribution of lengths to be counted as significant. This resulted in an acceptable false positive rate of approximately 5% (Figure 2—figure supplement 1).

For comparing correlations between weights for different features (i.e. between transition and reward coding, Figure 3E), the null distribution of correlations observed in circularly shifted data was compared to the correlation seen in the actual data. This accounts for any correlations between features that existed in the task by preserving the structure of the design matrices.

To compare differences in CPD between regions, an independent sample’s t-test was initially used to find periods where CPDs differed significantly. A permutation test was then conducted by shuffling neurons between the two regions and repeating the t-test over time for 500 permutations. Any observed runs of significance in the real data had to be longer than the 95th percentile of runs of significance in the null distribution.

Single-neuron encoding

Request a detailed protocol

An equivalent cluster-based permutation test was used to quantify the number of individual neurons that significantly encoded a particular parameter. The neurons firing rate in the relevant epoch was shuffled across trials and the null distribution of the CPD computed for 500 permutations. The neuron’s CPD then had to be higher for 95% of the null distribution, for longer than 95% of runs of spurious significance in the null distribution, to be considered significant. This resulted in an acceptable false positive rate of approximately 5% (Figure 2—figure supplement 1).

To test whether the proportion of significant neurons was larger than expected by chance, a binomial test against 5% was used. To test whether there were significant regional differences in the proportion of significant neurons observed, a chi-square test was used.

Pattern of encoding

Request a detailed protocol

Similarities in the pattern of neural encoding between different parameters was quantified using Pearson’s r. Significance was determined by computing the null distribution of correlations by shuffling coefficients. Any observed runs of significance had to be longer than 95% of any spurious runs of significance that occurred in this null distribution.

Decoding analysis

Request a detailed protocol

A support vector machine (SVM) was used to decode the first-stage choice (choice 1). First, a pseudopopulation was generated by collapsing across all neurons recorded from a region. The session with the fewest number of trials for each sample was determined, and all other sessions randomly subsampled to this length to create a dataset with an equal number of samples for each neuron and each condition. The data were then split using stratified fivefold cross-validation, whereby 80% of trials were used to train an SVM (using sklearn.svm.LinearSVC from the Python Scikit-learn package), which was then tested on the remaining 20% of trials. This process was repeated for the remaining four folds of the data and decoder accuracies averaged over the five folds. As this process involved leaving out a random selection of trials each time, it was repeated 20 times and the average and standard error of these 20 permutations logged as the indicator of the decoder’s performance.

To determine whether the decoder was performing significantly better than chance, the above procedure was repeated 500 times while shuffling the trial labels on each iteration to ascertain the null distribution of the decoder’s performance. The observed data were then compared to this distribution to calculate the reported p-values, and significance counted as an accuracy >95% of the null distribution. To compare whether one condition was significantly better than another, a null distribution of differences was created by shuffling condition labels and calculating the difference in decoder performance between the randomly split conditions 500 times. The observed difference in accuracy between conditions was then compared to this null distribution of differences.

Data availability

All data and code to reproduce figures are available at https://github.com/jamesbutler01/TwoStepExperiment, copy archived at Butler, 2026.

References

1. Akam T
2. Costa R
3. Dayan P
(2015) Simple plans or sophisticated habits? State, transition and learning interactions in the two-step task
PLOS Computational Biology 11:e1004648.

https://doi.org/10.1371/journal.pcbi.1004648
- PubMed
- Google Scholar
1. Akam T
2. Rodrigues-Vaz I
3. Marcelo I
4. Zhang X
5. Pereira M
6. Oliveira RF
7. Dayan P
8. Costa RM
(2021) The anterior cingulate cortex predicts future states to mediate model-based action selection
Neuron 109:149–163.

https://doi.org/10.1016/j.neuron.2020.10.013
- PubMed
- Google Scholar
1. Alexander WH
2. Brown JW
(2011) Medial prefrontal cortex as an action-outcome predictor
Nature Neuroscience 14:1338–1344.

https://doi.org/10.1038/nn.2921
- PubMed
- Google Scholar
1. Antonov G
2. Gagne C
3. Eldar E
4. Dayan P
(2022) Optimism and pessimism in optimised replay
PLOS Computational Biology 18:e1009634.

https://doi.org/10.1371/journal.pcbi.1009634
- PubMed
- Google Scholar
(1991) Responses to reward in monkey dorsal and ventral striatum
Experimental Brain Research 85:491–500.

https://doi.org/10.1007/BF00231732
- PubMed
- Google Scholar
(2009) Tonically active neurons in the striatum differentiate between delivery and omission of expected reward in a probabilistic task context
The European Journal of Neuroscience 30:515–526.

https://doi.org/10.1111/j.1460-9568.2009.06872.x
- PubMed
- Google Scholar
1. Balleine BW
2. O’Doherty JP
(2010) Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action
Neuropsychopharmacology 35:48–69.

https://doi.org/10.1038/npp.2009.131
- PubMed
- Google Scholar
1. Bayer HM
2. Glimcher PW
(2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal
Neuron 47:129–141.

https://doi.org/10.1016/j.neuron.2005.05.020
- PubMed
- Google Scholar
(2018) What is a cognitive map? Organizing knowledge for flexible behavior
Neuron 100:490–509.

https://doi.org/10.1016/j.neuron.2018.10.002
- PubMed
- Google Scholar
(2024) Dopamine-independent effect of rewards on choices through hidden-state inference
Nature Neuroscience 27:286–297.

https://doi.org/10.1038/s41593-023-01542-x
- PubMed
- Google Scholar
(2016) Two anatomically and computationally distinct learning signals predict changes to stimulus-outcome associations in hippocampus
Neuron 89:1343–1354.

https://doi.org/10.1016/j.neuron.2016.02.014
- PubMed
- Google Scholar
Software
1. Butler JL
(2026) TwoStepExperiment, version swh:1:rev:ef32a40742af500dd55584218007d8f515d6c107
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:8571db215b2de578d656d7726b304d71c6744ad8;origin=https://github.com/jamesbutler01/TwoStepExperiment;visit=swh:1:snp:070f83ab9f17424b006d0452efd29e2f84806742;anchor=swh:1:rev:ef32a40742af500dd55584218007d8f515d6c107
(2018) Reconciling persistent and dynamic hypotheses of working memory coding in prefrontal cortex
Nature Communications 9:3498.

https://doi.org/10.1038/s41467-018-05873-3
- PubMed
- Google Scholar
1. Chan SCY
2. Schuck NW
3. Lopatina N
4. Schoenbaum G
5. Niv Y
(2021) Orbitofrontal cortex and learning predictions of state transitions
Behavioral Neuroscience 135:487–497.

https://doi.org/10.1037/bne0000461
- PubMed
- Google Scholar
1. Daw ND
2. Niv Y
3. Dayan P
(2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control
Nature Neuroscience 8:1704–1711.

https://doi.org/10.1038/nn1560
- PubMed
- Google Scholar
1. Daw ND
2. Gershman SJ
3. Seymour B
4. Dayan P
5. Dolan RJ
(2011) Model-based influences on humans’ choices and striatal prediction errors
Neuron 69:1204–1215.

https://doi.org/10.1016/j.neuron.2011.02.027
- PubMed
- Google Scholar
1. Daw ND
2. Dayan P
(2014) The algorithmic anatomy of model-based evaluation
Philosophical Transactions of the Royal Society B 369:20130478.

https://doi.org/10.1098/rstb.2013.0478
- Google Scholar
1. Dickinson A
(1985) Actions and habits: the development of behavioural autonomy
Philosophical Transactions of the Royal Society of London. B, Biological Sciences 308:67–78.

https://doi.org/10.1098/rstb.1985.0010
- Google Scholar
1. Dickinson A
2. Balleine B
(1994) Motivational control of goal-directed action
Animal Learning & Behavior 22:1–18.

https://doi.org/10.3758/BF03199951
- Google Scholar
1. Dolan RJ
2. Dayan P
(2013) Goals and habits in the brain
Neuron 80:312–325.

https://doi.org/10.1016/j.neuron.2013.09.007
- PubMed
- Google Scholar
1. Doll BB
2. Duncan KD
3. Simon DA
4. Shohamy D
5. Daw ND
(2015) Model-based choices involve prospective neural activity
Nature Neuroscience 18:767–772.

https://doi.org/10.1038/nn.3981
- PubMed
- Google Scholar
1. Elber-Dorozko L
2. Loewenstein Y
(2018) Striatal action-value neurons reconsidered
eLife 7:e34248.

https://doi.org/10.7554/eLife.34248
- PubMed
- Google Scholar
1. Eldar E
2. Lièvre G
3. Dayan P
4. Dolan RJ
(2020) The roles of online and offline replay in planning
eLife 9:e56911.

https://doi.org/10.7554/eLife.56911
- PubMed
- Google Scholar
(2010) States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning
Neuron 66:585–595.

https://doi.org/10.1016/j.neuron.2010.04.016
- PubMed
- Google Scholar
1. Griggs WS
2. Kim HF
3. Ghazizadeh A
4. Costello MG
5. Wall KM
6. Hikosaka O
(2017) Flexible and stable value coding areas in caudate head and tail receive anatomically distinct cortical and subcortical inputs
Frontiers in Neuroanatomy 11:106.

https://doi.org/10.3389/fnana.2017.00106
- PubMed
- Google Scholar
1. Haber SN
2. Behrens TEJ
(2014) The neural network underlying incentive-based learning: implications for interpreting circuit disruptions in psychiatric disorders
Neuron 83:1019–1039.

https://doi.org/10.1016/j.neuron.2014.08.031
- PubMed
- Google Scholar
(2003) The anterior cingulate and reward-guided selection of actions
Journal of Neurophysiology 89:1161–1164.

https://doi.org/10.1152/jn.00634.2002
- PubMed
- Google Scholar
Preprint
1. Harris KD
(2020) Nonsense correlations in neuroscience
bioRxiv.

https://doi.org/10.1101/2020.11.29.402719
- Google Scholar
(2011) Neuronal basis of sequential foraging decisions in a patchy environment
Nature Neuroscience 14:933–939.

https://doi.org/10.1038/nn.2856
- PubMed
- Google Scholar
(2018) Triple dissociation of attention and decision computations across prefrontal cortex
Nature Neuroscience 21:1471–1481.

https://doi.org/10.1038/s41593-018-0239-5
- PubMed
- Google Scholar
1. Johnson A
2. Redish AD
(2007) Neural ensembles in CA3 transiently encode paths forward of the animal at a decision point
The Journal of Neuroscience 27:12176–12189.

https://doi.org/10.1523/JNEUROSCI.3761-07.2007
- PubMed
- Google Scholar
(2012) Network resets in medial prefrontal cortex mark the onset of behavioral uncertainty
Science 338:135–139.

https://doi.org/10.1126/science.1226518
- PubMed
- Google Scholar
(2006) Optimal decision making and the anterior cingulate cortex
Nature Neuroscience 9:940–947.

https://doi.org/10.1038/nn1724
- PubMed
- Google Scholar
1. Kennerley SW
2. Wallis JD
(2009) Encoding of reward and space during a working memory task in the orbitofrontal cortex and anterior cingulate sulcus
Journal of Neurophysiology 102:3352–3364.

https://doi.org/10.1152/jn.00273.2009
- PubMed
- Google Scholar
(2011) Double dissociation of value computations in orbitofrontal and anterior cingulate neurons
Nature Neuroscience 14:1581–1589.

https://doi.org/10.1038/nn.2961
- PubMed
- Google Scholar
1. Kim HF
2. Hikosaka O
(2013) Distinct basal ganglia circuits controlling behaviors guided by flexible and stable values
Neuron 79:1001–1010.

https://doi.org/10.1016/j.neuron.2013.06.044
- PubMed
- Google Scholar
(2016) Value, search, persistence and model updating in anterior cingulate cortex
Nature Neuroscience 19:1280–1285.

https://doi.org/10.1038/nn.4382
- PubMed
- Google Scholar
1. Lau B
2. Glimcher PW
(2008) Value representations in the primate striatum during matching behavior
Neuron 58:451–463.

https://doi.org/10.1016/j.neuron.2008.02.021
- PubMed
- Google Scholar
(2014) Neural computations underlying arbitration between model-based and model-free learning
Neuron 81:687–699.

https://doi.org/10.1016/j.neuron.2013.11.028
- Google Scholar
1. Liu Y
2. Mattar MG
3. Behrens TEJ
4. Daw ND
5. Dolan RJ
(2021) Experience replay is associated with efficient nonlocal learning
Science 372:abf1357.

https://doi.org/10.1126/science.abf1357
- Google Scholar
(2007) Medial prefrontal cell activity signaling prediction errors of action values
Nature Neuroscience 10:647–656.

https://doi.org/10.1038/nn1890
- PubMed
- Google Scholar
1. Mattar MG
2. Daw N
(2018) Prioritized memory access explains planning and hippocampal replay
Nature Neuroscience 21:1609–1617.

https://doi.org/10.1038/s41593-018-0232-z
- PubMed
- Google Scholar
(2017) Dorsal hippocampus contributes to model-based planning
Nature Neuroscience 20:1269–1276.

https://doi.org/10.1038/nn.4613
- PubMed
- Google Scholar
(2020) Combined model-free and model-sensitive reinforcement learning in non-human primates
PLOS Computational Biology 16:e1007944.

https://doi.org/10.1371/journal.pcbi.1007944
- PubMed
- Google Scholar
(2021) Model based planners reflect on their model-free propensities
PLOS Computational Biology 17:e1008552.

https://doi.org/10.1371/journal.pcbi.1008552
- PubMed
- Google Scholar
1. Morris A
2. Cushman F
(2019) Model-Free RL or action sequences?
Frontiers in Psychology 10:2892.

https://doi.org/10.3389/fpsyg.2019.02892
- PubMed
- Google Scholar
1. Muller TH
2. Butler JL
3. Veselic S
4. Miranda B
5. Wallis JD
6. Dayan P
7. Behrens TEJ
8. Kurth-Nelson Z
9. Kennerley SW
(2024) Distributional reinforcement learning in prefrontal cortex
Nature Neuroscience 27:403–408.

https://doi.org/10.1038/s41593-023-01535-w
- PubMed
- Google Scholar
1. Murray JD
2. Bernacchia A
3. Freedman DJ
4. Romo R
5. Wallis JD
6. Cai X
7. Padoa-Schioppa C
8. Pasternak T
9. Seo H
10. Lee D
11. Wang X-J
(2014) A hierarchy of intrinsic timescales across primate cortex
Nature Neuroscience 17:1661–1663.

https://doi.org/10.1038/nn.3862
- PubMed
- Google Scholar
(2013) Dissociable effects of surprise and model update in parietal and anterior cingulate cortex
PNAS 110:1305373110.

https://doi.org/10.1073/pnas.1305373110
- Google Scholar
1. Pasupathy A
2. Miller EKN
(2005) Different time courses of learning-related activity in the prefrontal cortex and striatum
Nature 433:873–876.

https://doi.org/10.1038/nature03287
- PubMed
- Google Scholar
1. Platt ML
2. Glimcher PW
(1999) Neural correlates of decision variables in parietal cortex
Nature 400:233–238.

https://doi.org/10.1038/22268
- PubMed
- Google Scholar
1. Roesch MR
2. Olson CR
(2003) Impact of expected reward on neuronal activity in prefrontal cortex, frontal and supplementary eye fields and premotor cortex
Journal of Neurophysiology 90:1766–1789.

https://doi.org/10.1152/jn.00019.2003
- PubMed
- Google Scholar
(2008) Frontal cortex subregions play distinct roles in choices between actions and stimuli
The Journal of Neuroscience 28:13775–13785.

https://doi.org/10.1523/JNEUROSCI.3541-08.2008
- PubMed
- Google Scholar
1. Schuck NW
2. Cai MB
3. Wilson RC
4. Niv Y
(2016) Human orbitofrontal cortex represents a cognitive map of state space
Neuron 91:1402–1412.

https://doi.org/10.1016/j.neuron.2016.08.019
- PubMed
- Google Scholar
(1997) A neural substrate of prediction and reward
Science 275:1593–1599.

https://doi.org/10.1126/science.275.5306.1593
- Google Scholar
1. Seo H
2. Lee D
(2007) Temporal filtering of reward signals in the dorsal anterior cingulate cortex during a mixed-strategy game
The Journal of Neuroscience 27:8366–8377.

https://doi.org/10.1523/JNEUROSCI.2369-07.2007
- PubMed
- Google Scholar
1. Seo M
2. Lee E
3. Averbeck BB
(2012) Action selection and action value in frontal-striatal circuits
Neuron 74:947–960.

https://doi.org/10.1016/j.neuron.2012.03.037
- PubMed
- Google Scholar
(2019) Optimizing the depth and the direction of prospective planning using information values
PLOS Computational Biology 15:e1006827.

https://doi.org/10.1371/journal.pcbi.1006827
- PubMed
- Google Scholar
(2013) The expected value of control: an integrative theory of anterior cingulate cortex function
Neuron 79:217–240.

https://doi.org/10.1016/j.neuron.2013.07.007
- PubMed
- Google Scholar
1. Shin EJ
2. Jang Y
3. Kim S
4. Kim H
5. Cai X
6. Lee H
7. Sul JH
8. Lee S-H
9. Chung Y
10. Lee D
11. Jung MW
(2021) Robust and distributed neural representation of action values
eLife 10:e53045.

https://doi.org/10.7554/eLife.53045
- PubMed
- Google Scholar
1. Sutton R
(1988) Learning to predict by the methods of temporal differences
Machine Learning 3:9–44.

https://doi.org/10.1023/A:1022633531479
- Google Scholar
Book
(2018)
Reinforcement Learning: An Introduction

The MIT Press.
- Google Scholar
(2008) Calculating consequences: brain systems that encode the causal effects of actions
The Journal of Neuroscience 28:6750–6755.

https://doi.org/10.1523/JNEUROSCI.1808-08.2008
- PubMed
- Google Scholar
Book
1. Thorndike E
(1911) Animal Intelligence: Experimental Studies
Macmillan Company.

https://doi.org/10.5962/bhl.title.55072
- Google Scholar
1. Tolman EC
(1948) Cognitive maps in rats and men
Psychological Review 55:189–208.

https://doi.org/10.1037/h0061626
- PubMed
- Google Scholar
(2009) A specific role for posterior dorsolateral striatum in human habit learning
The European Journal of Neuroscience 29:2225–2232.

https://doi.org/10.1111/j.1460-9568.2009.06796.x
- PubMed
- Google Scholar
(2007) Determining the neural substrates of goal-directed learning in the human brain
The Journal of Neuroscience 27:4019–4026.

https://doi.org/10.1523/JNEUROSCI.0564-07.2007
- PubMed
- Google Scholar
(2001) Anterior cingulate cortex, conflict monitoring, and levels of processing
NeuroImage 14:1302–1308.

https://doi.org/10.1006/nimg.2001.0923
- PubMed
- Google Scholar
1. Wallis JD
2. Kennerley SW
(2010) Heterogeneous reward signals in prefrontal cortex
Current Opinion in Neurobiology 20:191–198.

https://doi.org/10.1016/j.conb.2010.02.009
- PubMed
- Google Scholar
(2012) Mapping value based planning and extensively trained choice in the human brain
Nature Neuroscience 15:786–791.

https://doi.org/10.1038/nn.3068
- PubMed
- Google Scholar
(2004) Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning
The European Journal of Neuroscience 19:181–189.

https://doi.org/10.1111/j.1460-9568.2004.03095.x
- PubMed
- Google Scholar
(2005) The role of the dorsomedial striatum in instrumental conditioning
The European Journal of Neuroscience 22:513–523.

https://doi.org/10.1111/j.1460-9568.2005.04218.x
- PubMed
- Google Scholar

Article and author information

Author details

Bruno Miranda
1. Institute of Neurology, Department of Clinical and Movement Neurosciences, University College London, London, United Kingdom
2. Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
Contribution
Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing

Contributed equally with
James L Butler

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-4660-6051
James L Butler

Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom

Contribution
Formal analysis, Visualization, Writing – original draft, Writing – review and editing

Contributed equally with
Bruno Miranda

For correspondence
james.butler@psy.ox.ac.uk

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-5017-6472
WM Nishantha Malalasekera

Institute of Neurology, Department of Clinical and Movement Neurosciences, University College London, London, United Kingdom

Contribution
Data curation

Competing interests
No competing interests declared
Timothy EJ Behrens
1. Wellcome Centre for Integrative Neuroimaging, University of Oxford, FMRIB, John Radcliffe Hospital, Oxford, United Kingdom
2. Sainsbury Wellcome Centre for Neural Circuits and Behaviour College, University College London, London, United Kingdom
Contribution
Conceptualization, Supervision, Funding acquisition, Writing – review and editing

Competing interests
Editor-in-Chief, eLife

"This ORCID iD identifies the author of this article:" 0000-0003-0048-1177
Peter Dayan
1. Max Planck Institute for Biological Cybernetics, Tübingen, Germany
2. University of Tübingen, Tübingen, Germany
Contribution
Conceptualization, Supervision, Funding acquisition, Methodology, Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-3476-1839
Steven W Kennerley
1. Institute of Neurology, Department of Clinical and Movement Neurosciences, University College London, London, United Kingdom
2. Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom
Contribution
Conceptualization, Data curation, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-5696-7507

Funding

Wellcome Trust

https://doi.org/10.35802/096689

James L Butler
Steven W Kennerley

Wellcome Trust

https://doi.org/10.35802/220296

James L Butler
Steven W Kennerley

Fundação para a Ciência e a Tecnologia (SFRH/BD/51711/2011)

Bruno Miranda

Santa Casa da Misericórdia de Lisboa (Premio Joao Lobo Antunes 2017)

Bruno Miranda

Astor Foundation

WM Nishantha Malalasekera

Wellcome Trust

https://doi.org/10.35802/219525

Timothy EJ Behrens

Wellcome Trust

https://doi.org/10.35802/214314

Timothy EJ Behrens

Gatsby Charitable Foundation (GAT3955)

Timothy EJ Behrens

Jean Francois and Marie-Laure de Clermont Tonerre Foundation

Timothy EJ Behrens

Max Planck Society

Peter Dayan

Alexander von Humboldt Foundation

Peter Dayan

Biotechnology and Biological Sciences Research Council (BB/W003392/1)

Steven W Kennerley

Rosetrees Trust

WM Nishantha Malalasekera

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Acknowledgements

We would like to thank Thomas Akam for helpful comments on the manuscript. JLB and SWK were supported by Wellcome Trust Investigator Awards (096689/Z/11/Z, 220296/Z/20/Z). BM was supported by the Fundacão para a Ciência e Tecnologia (scholarship SFRH/BD/51711/2011) and the Premio João Lobo Antunes 2017 – Santa Casa da Misericordia de Lisboa. NM was supported by Astor Foundation, Rosetrees Charitable Trust. TEJB is supported by a Wellcome Principal Research Fellowship (219525/Z/19/Z), a Wellcome Trust Collaborator award (214314/Z/18/Z), and the Gatsby Initiative for Brain Development and Psychiatry (GAT3955), and by the Jean Francois and Marie-Laure de Clermont Tonerre Foundation. PD was supported by the Max Planck Society and the Alexander von Humboldt Foundation. SWK is supported by BBSRC Strategic Longer and Larger Grant (BB/W003392/1).

Ethics

All experimental procedures were approved by the UCL Local Ethical Procedures Committee and the UK Home Office (PPL Number 70/7231), and carried out in accordance with the UK Animals (Scientific Procedures) Act.

Version history

Preprint posted: January 12, 2025
Sent for peer review: January 18, 2025
Reviewed Preprint version 1: April 15, 2025
Reviewed Preprint version 2: February 27, 2026
Version of Record published: June 22, 2026

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.106032. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.