Dopamine ramps as a normative consequence of dual-process control

Luke Priestley; Thomas Akam

doi:10.7554/eLife.111458.1

eLife Assessment

This important study developed a novel theory to account for various aspects of dopamine signals, particularly dopamine ramps. The authors propose that dopamine reward prediction error (RPE) signals are generated by a dual-process learning system in which values inferred by a model-based system enter the RPE asymmetrically into the update target but not the prediction. The results are well-presented and convincing, and make a contribution that is of importance to the field. This work will be of interest to those studying dopamine specifically or brain learning computations and systems more broadly.

https://doi.org/10.7554/eLife.111458.1.sa3

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

convincing: Appropriate and validated methodology in line with current state-of-the-art

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Midbrain dopamine neurons are thought to implement a temporal difference (TD) reward prediction error (RPE) that updates cached values stored in striatum. This has been challenged by evidence that dopamine “ramps up” to predictable rewards during goal-directed behaviour. Here, we propose that dopamine ramps are RPEs generated by a dual-process learning system in which values inferred using a world model train cached values via the RPE. Ramps arise because efficient training of cached values requires that inferred values contribute to the update target but not the prediction component of the RPE. The model reproduces key dopamine ramp phenomena, including learning dynamics on fast and slow timescales, global updates following changes in reward expectation, transient responses during unexpected state transitions, and sensitivity to state uncertainty manipulations. We therefore argue that dopamine ramps are a signature of interactions between inferred and cached values that revise the traditional dichotomy between model-based and model-free learning.

2 Introduction

Dopamine is widely thought to implement a temporal-difference (TD) reward prediction error (RPE) signal that updates cached values stored at striatal synapses (Montague et al., 1996; Schultz, 2006; Schultz et al., 1997). However, there are features of dopamine activity that this theory struggles to explain. Perhaps most striking is the fact that dopamine in ventral striatum “ramps up” in anticipation of predictable rewards (Howe et al., 2013), particularly in spatial paradigms where distinct locations or sensory states indicate reward proximity (Farrell et al., 2022; Guru et al., 2020; Hamid et al., 2016; Kim et al., 2020; Krausz et al., 2023; Mikhael et al., 2022; Mohebi et al., 2019). The tension with the RPE hypothesis is clear: if dopamine implements an RPE, why do dopamine signals progressively increase during goal approach in well-learned tasks when the value of each state is known?

Theoretical accounts of dopamine ramps fall broadly into two camps. The first proposes that dopamine conveys a value signal rather than an RPE (Hamid et al., 2016; Howe et al., 2013; Mohebi et al., 2019). Although this explains why dopamine might ramp in spatial tasks, where value increases with reward proximity, it is difficult to reconcile with evidence that dopamine exhibits key properties of an RPE in many species and settings (Blanco-Pozo et al., 2024; Eshel et al., 2015; Kim et al., 2020; O’Doherty et al., 2003; Pessiglione et al., 2006; Schultz et al., 1997; Steinberg et al., 2013; Witten et al., 2011). The second proposes that dopamine ramps are a specific case of RPE that arises under special conditions – for example, when state-uncertainty prevents accurate value estimation (Mikhael et al., 2022), when synaptic decay induces forgetting (Kato and Morita, 2016), or when action-timing is uncertain (Lloyd and Dayan, 2015).

Two, striking, recently reported features of dopamine ramp dynamics are not captured by existing theories. First, new information about expected reward at navigational goals rapidly and globally updates ramp amplitude in a manner that appears inconsistent with TD learning (Guru et al., 2020; Krausz et al., 2023). Second, ramps diminish with experience when animals navigate to the same goal in a stable environment (Guru et al., 2020), but on a much slower timescale than that over which behaviour converges. Guru et al. (2020) and Krausz et al. (2023) propose that model-based value computations underlie the rapid effect of new reward information on dopamine ramps, but a theoretical account explaining why model-based computations give rise to dopamine ramps, and how this explains observed ramp dynamics, is lacking.

Here, we suggest that dopamine ramps are a consequence of a dual-process learning architecture that combines model-based and TD learning systems. This builds on the longstanding idea that the brain has multiple complementary learning systems: an efficient but slow TD system for learning cached values, implemented in the basal ganglia, and a flexible but constrained modelbased system for inferring value using a world-model, putatively implemented in frontal cortex (Balleine and Dickinson, 1998; Daw et al., 2005; Dolan and Dayan, 2013). Our key claim is that if the brain can infer values independently of the TD system, for these to efficiently train cached values they must enter the RPE computation in a specific way that necessarily generates ramps. Specifically: inferred values should contribute to the update target towards which cached values are incremented, but not to the prediction against which outcomes are compared to compute the RPE, which should be determined by cached values alone.

We show that incorporating inferred values into the RPE in this way is normative in the sense that it accelerates learning compared to alternative approaches. We further show that the model generates RPE ramps analogous to dopamine ramps observed in experiments, and reproduces diverse experimental findings on dopamine ramp dynamics. We therefore argue that dopamine ramps are RPEs generated by a normative dual-process learning system, a view that revises the traditional dichotomy between model-free and model-based evaluation.

3 Results

We first review the standard TD learning algorithm, its putative implementation in the brain, and its inconsistency with dopamine ramps. The TD algorithm aims to learn a value function that reflects expected cumulative future reward given a starting state and policy (Sutton and Barto, 2014). Formally:

Where r_t is a reward received at time t, γ is a discount factor, and π is a policy specifying a probabilistic mapping between states of the environment and actions by the agent. For simplicity, we henceforth omit the π superscript from all notation.

If the environment is Markovian, V (s_t) can be expressed recursively as:

The TD algorithm exploits this recursion using an online learning rule where value estimates are updated in light of immediate rewards, and differences in value between successive states. This is formalised in a teaching signal called the reward prediction error δ_t, defined as:

In this equation, the estimate of the old state’s value V (s_t) is a prediction – i.e., it is the current ‘best guess’ about expected future reward – while the immediate reward r_t+1 plus the discounted value estimate for the new state V (s_t+1) is an update target – i.e. the value toward which the prediction is adjusted. Value estimates are updated using RPEs as:

Where α ∈ [0, 1] is a learning rate controlling how value estimates change in response to RPEs. The TD algorithm is the basis of an influential account of value-learning in cortico-striatal circuits (Montague et al., 1996; Schultz et al., 1997). This account has three main components fig. 1A-i: (i) The cortex constructs a state representation from sensory experience and communicates it to the striatum (Chang and Tsao, 2017; Liu et al., 2016; Yamins and DiCarlo, 2016); (ii) the striatum estimates value using cortico-striatal synaptic weights, which reflect the relationship between state-features and value (Samejima et al., 2005; Van Der Meer, 2009), and; (iii) VTA dopaminergic neurons compute reward prediction errors that induce plasticity at cortico-striatal synapses (Pawlak and Kerr, 2008), enabling stored values to be updated.

Inferred values train cached values in a dual process architecture.
**(a)** Diagram of neural circuits for value-learning showing: (i) The standard model of TD learning in cortico-striatal circuits. (ii) The proposed dual-process model. Note the connection from a model-based evaluation system in frontal cortex to dopamine neurons that bypasses striatum. **(b)** Difference between cached values and the true value function (i.e., value error) during learning in the dual-process model, a standard TD agent, and an alternative dual-process model where V_NET contributes to both the RPE update target and prediction. **(c)** RPEs during approach to a reward at different stages of learning (early, mid, late) in: (i) a standard TD agent, and; (ii) the dual-process model. **(d)** Value functions in different components of the dual-process model during approach to a reward early in learning. The RPE (δ) is approximately the difference between V_TD and V_NET.

Dopamine ramps challenge this account because the TD algorithm does not, in general, produce RPE ramps in settings where dopamine ramps occur. This is because cached values are assumed to control the agent’s policy. In stable and deterministic environments, cached values converge on the true value function with learning, implying that the policy converges when RPEs disappear. This is inconsistent with experimental observations of dopamine ramps, which persist long after animals exhibit expert task performance (Howe et al., 2013; Krausz et al., 2023).

3.1 Inferred values train cached values in a dual process architecture

We propose a dual-process account of dopamine ramps with two core assumptions: (i) that the brain possesses a model-based system that can infer value independently of the basal ganglia TD system in tasks where dopamine ramps occur, and; (ii) that inferred value estimates contribute to training the TD system (fig. 1A-ii). There are many proposals for how the brain might implement model-based value computations (Akam and Walton, 2021; Dolan and Dayan, 2013; Mattar and Lengyel, 2022), including roll-out-based planning, successor or geodesic representations (Dayan, 1993; Sagiv et al., 2025), and inference mechanisms based on attractor dynamics (Donnarumma et al., 2025; Jensen et al., 2025). We do not provide a substantive account of model-based evaluation in this paper. For simulation purposes, we assume a model-based system that infers a goal-conditioned value function using shortest-path distances between states (see below).

If the model-based system can predict future rewards that are not yet reflected in cached values, it can be used to train cached values via the RPE. We argue that to accomplish this, inferred values should contribute to the RPE as:

Here, the estimated value of the new state V_NET (S_t+1) used in the update target combines an inferred value estimate V_MB and a cached value estimate V_TD using a mixing parameter k. Although our model is agnostic as to how k is determined, it should in principal ensure that V_NET reflects the agent’s best estimate of the true value given the reliability of V_TD and V_MB e.g., using uncertainty- or confidence-based arbitration (Daw et al., 2005; Lee et al., 2014). We do not provide a substantive account of arbitration in this paper and our simulations use a fixed value of k = 0.5 for simplicity (see Methods). Cached values are then updated using the RPE as:

Where α_TD is a learning-rate for cached values.

The critical feature of the RPE computation in eq. (6) is that inferred values contribute only to the update target, as V_MB(s_t+1), whereas cached values contribute to both the update target as V_TD(s_t+1), and the prediction as V_TD(s). The rationale is that because the RPE functions to update cached values, the prediction against which outcomes are compared should reflect cached values alone. However, the update target which cached predictions are updated towards should represent the best available estimate of future reward, and hence should incorporate inferred value estimates if they are available. Computing RPEs in this way updates cached values towards V_NET, which is desirable when V_MB contains accurate value information that is not yet consolidated into cached values.

How might this dual process architecture be implemented in the brain? A key assumption is that inferred values arise independently of striatal cached values but contribute to RPE computations in the VTA. Since frontal cortex is strongly implicated in model-based evaluation (Akam et al., 2021; Daw et al., 2011; Huang et al., 2020; Jones et al., 2012; Killcross and Coutureau, 2003; Niedringhaus and West, 2022; Stalnaker et al., 2014), we propose that monosynaptic projections from frontal cortex to VTA (Babiczky and Matyas, 2022; Beier et al., 2015; Gao et al., 2022; Wang et al., 2020) communicate the inferred value information used in RPE (fig. 1A-ii, see Discussion).

3.2 Dual process learning generates ramping RPEs

We first evaluated the dual-process model on a 1D tabular environment with a single, absorbing, rewarding goal-state. This mimics the trial-based structure of experiments where dopamine ramps occur, which involve moving through a series of locations to a rewarding goal. Environments with a single, terminal reward yield a special case of the value function where value reduces to the reward available in the goal state discounted by its shortest-path distance from the current state:

Where d(s) is the distance between state s and the goal-state and is an estimate of the reward in the goal-state. We assume that distances between states d are known a priori. In spatial navigation, this assumption is motivated by entorhinal grid-cells which represent 2D space in a manner that generalises between environments and, in principle, permits distance estimation between locations (Bush et al., 2015; Hafting et al., 2005; Whittington et al., 2020). The estimated reward in the goal-state is updated using a delta rule:

Where n indexes a trial of experience, r_n is the reward received on trial n, and α_MB is a learningrate for reward estimate updates. We assume that the learning-rate in the model-based system is greater than the learning-rate for the TD system (α_MB > α_TD).

We tested the dual process model’s learning performance by comparing time-to-convergence for V_TD with respect to the true value function in three different cases: (i) the dual process model proposed in eqs. 5—7, where V_MB contributes to the RPE update target but not the prediction; (ii) an alternative dual process model where V_MB contributes equally to the RPE update target and prediction, and; (iii) the standard TD learning algorithm, where V_MB does not appear (fig. 1B). Cached values converged to the true value function more efficiently in our proposed model compared to both alternatives. Strikingly, the alternative dual-process model, where V_MB contributed to both the RPE update target and prediction, learned less efficiently than standard TD. This demonstrates that if inferred values are available, it is normative to use them only in the RPE update target.

We next compared RPEs from the dual-process model and standard TD learning during navigation of the linear track environment (fig. 1C). Ramping RPEs occurred in the dual-process model when inferred values predicted future reward that was not yet captured in cached values. We illustrate this in fig. 1D by displaying V_TD, V_MB, and V_NET during early learning. Inferred values emerged rapidly during initial encounters with the environment, whereas cached values emerged incrementally. Consequently, inferred values exceeded cached values in all states during early learning. Given that V_MB contributes only to the update target, and the prediction is given only by V_TD, the RPE is shaped by differences between V_MB and V_TD for successive states, which increase as the agent approaches reward – in other words, RPEs ramp. Dopamine ramps are therefore consistent with inferred values training cached values via the RPE.

3.3 Dopamine ramp dynamics at short and long timescales

A key prediction of the dual process model is that dopamine ramps will diminish over time in stable environments. This is because ramps arise from the difference between cached V_TD and inferred V_MB values which, in stable environments, (fig. 1) reduces with experience as V_TD and V_MB converge to the true value function. (fig. 2C). Consequently, RPEs – and therefore ramps – should also reduce with experience. Consistent with this, Guru et al. (2020) report that dopamine ramps in mice gradually diminish with extensive training in a spatial task involving navigation between rewards at alternate ends of a linear track (fig. 2A–B). Dopamine ramps then re-emerged when place-reward contingencies changed, suggesting that ramps implement a learning-related computation consistent with an RPE. RPE ramps in the dual process model reproduced these patterns when simulated on an analogous task (fig. 2D). The dual-process model is thus consistent with long-timescale changes in dopamine ramps reported by Guru et al. (2020).

Dopamine ramp dynamics at different timescales.
**(a)** Diagram of behavioural task in Guru et al. (2020). **(b)** Experimental data from Guru et al. (2020) showing dopamine ramps at different training stages during runs to big (red) and small (black) reward locations. **(c)** Evolution of value estimates during training in the dual-process model; **(i)** value estimates during early training; **(ii)** value estimates during late training **(d)** Evolution of dual process model RPEs during training on a task that replicates Guru et al. (2020). **(e)** Experimental data from Guru et al. (2020) showing rapid development of dopamine ramps after initial encounters with rewarding goals in a novel environment. **(f)** Evolution of RPEs in the dual-process model during initial encounters with rewarding goals.

Guru et al. (2020) further report that dopamine ramps emerge rapidly when rewards are first encountered in a novel environment, (fig. 2E). This is consistent with our model under the assumption that distances between spatial states can be estimated with minimal experience (see above and discussion), and that the model-based system (α_MB) employs a high learning-rate for the reward function. Simulating the dual process model in a novel environment confirmed that RPE ramps were absent on the first trial when the the reward at the goal was unknown. Ramps then rapidly developed over subsequent trials as rewards drive learning of inferred values (fig. 2F). The rapid development of dopamine ramps in novel environments is thus consistent with RPE dynamics in the dual-process model.

3.4 Rapid global updates to ramp amplitude by reward

When rewards at goal locations are dynamic, dopamine ramp amplitudes are rapidly updated by reward outcomes. Specifically, Krausz et al. (2023) demonstrate that an outcome at a goal location modulates ramp amplitude on the subsequent visit, with rewards increasing amplitude and omissions decreasing amplitude (fig. 3A–B). Importantly, amplitude changes occur even if the goal is reached by a different route on the subsequent visit, suggesting that they reflect global updates in reward expectation. The dual process model captures these patterns under the assumption that changes in the expected reward at a goal location globally modulate inferred values. This is consistent with inferred values that are computed by combining an estimate of the immediate reward at a goal state with a representation of the distances between states, e.g., via Euclidean distances computed by grid cells (Bush et al., 2015) or cached shortest-path distances as in the Geodesic representation (Sagiv et al., 2025). Implementing the dual process model in a 2D gridworld environment with multiple paths to goal locations recapitulated the patterns reported by Krausz et al. (2023) (fig. 3C–D), suggesting that it is consistent with rapid, global changes in dopamine ramps in environments with dynamic rewards.

Reward globally updates dopamine ramps.
**(a)** Experimental design from Krausz et al. (2023) in which animals can reach rewarding goal locations via multiple routes. **(b)** Experimental data from Krausz et al. (2023) showing that outcomes at goal locations globally update dopamine ramps regardless of subsequent route. **(c)** Diagram showing global updating of inferred values V_MB by rewards in the dual process model. **(d)** Effect of reward on trial t on RPEs on trial *t+1* in the dual process model when goal is reached via different routes on t and *t+1*.

3.5 RPE-like dopamine responses to unexpected state transitions

We next tested whether the dual process model reproduces RPE-like dopamine responses during experimental manipulations in spatial tasks. Previous work has shown that when animals progress toward a reward location in a spatial virtual-reality (VR) environment, dopamine signals are modulated by unexpected changes in spatial position (Kim et al., 2020). For example, teleports between non-adjacent states cause dopamine transients that superimpose on dopamine ramps, with magnitudes that are proportional to teleport end-state (fig. 4A) and teleport-distance (fig. 4B). Similarly, the speed at which animals progress towards the goal modulates ramp slopes, with faster speeds producing stronger slopes (fig. 4C). These patterns favour an RPE interpretation of ramps in which dopamine represents changes in value between timepoints, rather than value itself. Dual process model RPEs reproduced dopamine responses during teleport and speed manipulations because RPEs are intrinsically modulated by changes in value (fig. 4A-C). By equating dopamine with an RPE, the dual process model is thus consistent with the results in Kim et al. (2020).

Dopamine responses to unexpected state transitions.
Experimental conditions (left), dopamine recordings (middle), and RPEs in simulations of the dual-process model (right) for key experimental conditions from Kim et al. (2020) in which unexpected state transitions occured during reward approach in a VR environment. **(a)** Teleport end-state manipulation, where teleports of constant distance were aligned with different end-states. **(b)** Teleport distance manipulation, where teleports varied in distance but ended at a common state. **(c)** Traversal speed manipulation.

3.6 Dopamine ramps dynamics under state-uncertainty manipulations

Finally, we consider experiments motivated by the proposal that dopamine ramps arise from distortions in learning caused by state uncertainty (Mikhael et al., 2022). As in Kim et al. (2020), mice approached a reward location within a VR corridor (fig. 5A). On some trials, the VR environment progressively darkened during goal approach, thereby increasing the animal’s uncertainty about its location (fig. 5A). Dopamine signals on these trials resembled ‘bumps’ rather than ramps – they initially increased more rapidly than the standard trial signal, before subsequently decreasing below it (fig. 5B).

Dopamine ramp dynamics under state-uncertainty manipulations.
**(a)** Experimental paradigm from Mikhael et al. (2022) where animals approached a rewarded location in a VR environment that differed in movement speed (fast-vs-slow) and luminance (bright-vs-darkening) across trials. **(b)** Dopamine signals from Mikhael et al. (2022), where progressively increasing state uncertainty causes dopamine bumps rather than dopamine ramps. **(c)** Effect of progressively increasing state uncertainty on RPEs in the dual process model.

Following Mikhael et al. (2022), we assume the subject does not know the true current state s_t of the environment, but instead maintains a probability distribution p(s| x_t) over possible states s given sensory input x. This probability distribution is assumed to be a Gaussian centred on s_t with standard deviation σ_t:

Value estimates with respect to x are computed as a probability weighted sum over state value estimates:

Simulating this version of the model on standard trials with constant state uncertainty produced ramping RPEs consistent with those observed without state uncertainty (fig. 5C). Following Mikhael et al. (2022), darkening trials were modelled by gradually increasing the width of the state uncertainty kernel across the trial. This generated RPE bumps similar to the dopamine bumps seen in the experimental data (fig. 5C).

RPE bumps occur because state uncertainty distorts the inferred value estimates that mdrive RPEs. Greater uncertainty assigns more weight to states away from the true state. Early in the trial, this increases inferred value estimates, as although the uncertainty kernel is symmetric around the true state, the slope of the value function is steeper on the higher value side. Late in the trial, uncertainty adds probability mass primarily behind the true state, as the uncertainty kernel cannot extend beyond the reward location if the reward has not been reached. This decreases inferred value estimates and hence the RPE.

Although state uncertainty manipulations have similar effects in both our model and Mikhael et al. (2022), the underlying mechanism of RPE ramps is fundamentally different. Mikhael et al. (2022) propose that RPE ramps arise from a correction term in the value update which counteracts biases that arise from state uncertainty. Learning therefore converges not when the RPE is zero but rather when it is cancelled out by the correction term. Since the required correction is proportional to state value, this generates RPE ramps. This account rests on the critical assumption that state uncertainty is systematically larger when estimating the value of the new state V_xt+1 compared to the value of the old state V_{x t} in the RPE computation – the rationale being that sensory feedback reduces uncertainty for the old state s_t relative to the new state s_t+1. It is this assumption that causes state uncertainty to systematically bias learning, and necessitates the correction term that generates RPE ramps. Importantly, recent experimental data measuring the effect of striatal stimulation on dopamine signals (Campbell et al., 2025) suggests that temporal difference value comparison is implemented by synaptic delays in striatum-VTA circuitry, such that V_{x t} is simply a delayed copy of V_{x t+1}, and hence inherits the same state uncertainty.

In contrast to Mikhael et al. (2022), our model reproduces dopamine dynamics under uncertainty manipulations without requiring systematic differences in state uncertainty between terms in the RPE computation.

4 Discussion

Dopamine ramps have attracted widespread interest because they appear to contradict the theory that dopamine implements a temporal difference reward prediction error (Berke, 2018; Hamid et al., 2016; Howe et al., 2013; Niv, 2013). Here, we propose that dopamine ramps are RPEs generated by a dual-process learning architecture in which values inferred from a world model train cached values via the update target of the RPE. We show that this architecture accelerates cached value learning, and reproduces the key empirical features of dopamine ramps within a unified explanatory framework.

The dual process model generates ramping RPEs because inferred values contribute to the update target but not to the prediction component of the RPE. The rationale is that the update target should represent the best estimate of future reward, and inferred values should therefore contribute to it if they are accurate. The prediction, by contrast, should be determined by cached values alone, because cached values are the quantities that the RPE must update. This asymmetric use of inferred values is normative in the sense that, given accurate inferred values, it accelerates convergence of cached values to the true value function. Strikingly, incorporating inferred values symmetrically in both the update target and prediction slows learning relative to not using inferred values at all (fig. 1C).

The model accounts for key experimental findings on dopamine ramp dynamics via the interplay of two sources of value information (figs. 2 to 4): (i) fast-evolving inferred values, which explain rapid global updates following individual rewards, and (ii) slow-evolving cached values, which explain why ramps diminish with experience as cached values incrementally converge. This clarifies why ramps persist after expert behaviour has developed, as policy can be guided by inferred values long before cached values converge.

Our theory implies that the brain has mechanisms for inferring value online during behaviour, independently of the striatum. This contrasts with offline replay mechanisms that refine cached value estimates guiding subsequent behaviour (Mattar and Daw, 2018; Sutton, 1991). Since modelbased evaluation is generally computationally intensive (Sutton and Barto, 2014), this raises the question of how online inferred value estimation is tractable. We suggest that fast, online inferred value estimation is possible only in specific situations that permit efficient solution methods. Specifically, RL problems characterised by absorbing goal-states reduce goal-conditioned value functions to the immediate reward at the goal, discounted by the distance or cost to reach it. This enables value to be estimated using learned distances between locations (Piray and Daw, 2021; Sagiv et al., 2025). Although real-world behaviour continues after goals are reached, the brain might use the solution methods afforded by absorbing goal states as a heuristic, or as a component of a hierarchical control architecture (Ringstrom et al., 2025).

These considerations suggest that dopamine ramps will emerge when: i) the brain has an internal model of distances between states, and ii) behaviour is organised by discrete, known, and rewarding goal states, rather than random foraging. Goal-directed spatial navigation exemplifies these conditions, explaining the prominence of ramps in navigation tasks. Grid cells facilitate distance estimation in physical space and carry representations that generalize across environments (Bush et al., 2015; Hafting et al., 2005; Whittington et al., 2020), consistent with the rapid onset of dopamine ramps in spatial tasks (Guru et al., 2020). Hippocampal-entorhinal circuits for spatial cognition also represent position in sensory or abstract state spaces (Aronov et al., 2017; Constantinescu et al., 2016), which may explain ramps in tasks where sensory cues indicate reward proximity (Kim et al., 2020). By contrast, classical conditioning tasks lack distinct sensory states indicating reward proximity. Dopamine responses in these settings resemble a backpropagating TD error rather than a ramp, consistent with inferred values playing no role in the update target (Cohen et al., 2012; Schultz et al., 1997).

Neural implementation of the dual process architecture requires that dopamine neurons receive inferred value information through a non-striatal pathway. Although the neural basis of model-based evaluation remains poorly understood, it has been linked to orbitofrontal (OFC) and medial frontal cortex (mFC) in both humans and non-human animals (Akam et al., 2021; Daw et al., 2011; Huang et al., 2020; Jones et al., 2012; Killcross and Coutureau, 2003; Niedringhaus and West, 2022; Stalnaker et al., 2014). Notably, mFC and OFC have monosynaptic projections to VTA dopamine neurons (Babiczky and Matyas, 2022; Beier et al., 2015; Gao et al., 2022; Wang et al., 2020), and stimulating the FC-VTA pathway induces conditioned place preference via the nucleus accumbens (Beier et al., 2015). VTA-projecting FC neurons are therefore a plausible source of the inferred value information that generates dopamine ramps (Guru et al., 2020). Our model further implies that VTA-projecting and striatum-projecting subpopulations of FC neurons should encode different signals: inferred values in the former, and state features in the latter. Consistent with this, these subpopulations are largely anatomically separate, although their coding properties remain uncharacterised (Babiczky and Matyas, 2022; Gao et al., 2022).

Our model makes several testable experimental predictions. (1) VTA-projecting frontal cortex neurons will encode inferred value signals (i.e., expected discounted future reward) in settings where dopamine ramps occur. (2) Silencing the FC-to-VTA pathway, or the components of the world model necessary for inferred value estimation, will abolish dopamine ramps. (3) Abolishing dopamine ramps will slow the development of striatal state-value signals during goal-directed navigation (Van Der Meer, 2009), and alter the development of state value representations during learning. (4) Transiently stimulating FC-to-VTA and striatum-to-VTA pathways will evoke distinct patterns of ventral striatal dopamine release. Stimulating NAc D1 neurons initially excites then subsequently inhibits VTA neurons (Campbell et al., 2025), consistent with the dual role of cached values in the RPE update target and prediction. Stimulating the FC-VTA pathway should, by contrast, evoke VTA excitation alone, since inferred values contribute only to the update target.

Finally, together with Mattar and Daw (2018), our work suggests that the traditional dichotomy between model-based and model-free evaluation should be revised. Each account emphasises that model-based mechanisms have a profound influence on the striatal cached value system – online through dopamine ramps, and offline through replay. This implies that the cached value system should not be viewed as model-free, but rather as a long-term memory system for value that is shaped by both temporal-difference learning and model-based evaluation.

6 Methods

6.1 General simulation details

All simulations were implemented in Python v3.14. For simplicity, the policy for all agents in all simulations was deterministic and involved moving directly to the rewarding goal location. Dualprocess agents were simulated according to eqs. 5—7. Task specific environments and parameter choices are described below.

Code for replicating the simulations and generating the manuscript figures is available at: https://github.com/lpriestley/da-ramps

6.2 Comparison of value-learning algorithms

To characterise whether the dual-process model accelerated value-learning (fig. 1B), we implemented (i) a dual-process agent, (ii) a standard TD agent, and (iii) an alternative dual-process agent on a linear track environment. The standard TD agents was implemented according to eqs. 3—4. The alternative dual process agent was implemented according to:

The key difference compared to the dual process agent defined in eqs. 6—7, therefore, is that inferred values appear in both the RPE update target and the prediction, instead of the update target alone. This alternative dual-process agent learned slower than standard TD learning (fig. 1B) and did not generate ramping RPEs. Parameters for the agents were: α_TD = 0.01, α_MB = 0.50, γ = 0.93, k = 0.50,

The linear track was formalised as a tabular environment with N = 10 states. There was a goal state at one end of the track with a scalar reward r = 1.0. Agents started each trial at the end of the track opposite the goal state. All agents performed the task for T = 5000 trials. Value error was calculated on each trial as — i.e the average discrepancy between cached values V_TD and the true value function V.

6.3 Characterising dual-process learning dynamics

To demonstrate why the dual-process model produces ramping RPEs, a dual-process agent was simulated on a linear track, formalised as a 1D tabular environment with N = 20 states, and a goal state at one end of the track with a scalar reward r = 1.0. Agents started each trial at the end of the track opposite the goal state. The parameters for the agent were: α_TD = 0.01, α_MB = 0.50, γ = 0.85, k = 0.50. The value-functions V_MB, V_TD and V_NET were extracted from the agent on trial t = 100 of learning and displayed in fig. 1D. RPEs δ at early, intermediate and late stages of learning were further extracted and graphed in fig. 1E-ii. This was compared to RPEs from a standard TD agent simulated with parameters α_TD = 0.01, γ = 0.85.

6.4 Guru et al. (2020) simulation

We compared dual-process RPEs to dopamine signals in Guru et al. (2020). The Guru et al. (2020) study involved recording dopamine signals whilst mice ran between alternate ends of a linear track. One end of the track had a large reward (2µL), and the other end of the track had a small reward (1µL).

To simulate the dual-process model on this task, we treated navigation towards each end of the track as a separate state space, consistent with hippocampal units having strong movement direction tuning on linear tracks. Each agent was simulated with the following parameters: α_TD = 0.005, α_MB = 0.50, γ = 0.93, k = 0.50. Each linear track was formalised as a 1D tabular environment with N = 35 states and a goal state at one end. Agents started each trial at the end of the track opposite the goal state. In the large-reward track, the initial reward value was r = 2.0, and the small-reward track, the initial reward value was r = 1.0.

Agents performed S = 18 sessions of learning, where each session involved T = 100 trials. On session s = 17, the reward values at the end of high-reward and low-reward tracks were swapped. The training regime was designed to replicate the Guru et al. (2020) experiment.

In fig. 2C, V_MB, V_TD and V_NET were extracted from trial t = 100 on sessions s ∈{1, 4}. In fig. 2D, the evolution of RPEs over learning was visualised by computing, for each state, the mean RPE over trials within a session. In fig. 2F, RPEs and V_MB were extracted for trials t ∈{1, 2, 3}.

6.5 mKrausz et al. (2023) simulation

We compared dual-process RPEs to dopamine signals in Krausz et al. (2023). In Krausz et al. (2023), rats performed a maze navigation task, in which a series of goal locations delivered probabilistic rewards. The task took place in a complex maze environment with multiple pathways to each goal location.

To simulate the task, a dual-process agent was implemented on a a 2D 10 × 10 gridworld. There was a goal-location g = (1, 1) which, when visited, delivered reward stochastically with r = 1.0 and p(reward) = 0.5. The agent was alternately started on odd and even trials from s_odd = (10, 1) and s_even = (1, 10) and followed a trajectory directly to the reward location. This allowed us to test how specific rewards and omissions at the goal location influenced RPEs on the subsequent trial, even when the trajectory to the goal location was different. The agent was simulated with the following parameters: α_TD = 0.01, α_MB = 0.10, γ = 0.85, k = 0.50. It performed T = 500 trials. In fig. 3C, the effect of rewards on inferred values was visualised by extracting V_MB for consecutive trials t − 1 and t, where t − 1 was rewarded. In fig. 3D, the effect of rewards on RPEs was visualised by comparing RPEs on consecutive trials t − 1 and t, where t − 1 was rewarded.

6.6 Kim et al. (2020) simulation

We compared dual-process RPEs with dopamine signals in Kim et al. (2020). In the Kim et al. (2020) experiment, subjects viewed a VR track environment with a terminal reward at the end of the track. The environment was manipulated using teleports between non-adjacent states, and speed modulations that controlled how quickly subjects moved through the environment.

We simulated the dual-process agent on these experiments using linear tracks, which were formalised as 1D tabular environments with a goal-state that delivered a reward r = 1.0 at one end of the track. The agent always started at the end opposite the goal. In the teleport experiments (fig. 4A and fig. 4B), the track had N = 32 states. In the speed-manipulation experiment (fig. 4C), the track had N = 40 states. In the teleport-distance experiment, all teleports ended at state s_{teleport−destination} = 24, where short teleports had distance d_short = 2 and long teleports had distance d_long = 10. In the pause condition, the agent remained in s_teleport −_destination for an arbitrary number of timepoints. We abolished the effect of inferred values on RPEs during the pause period under the assumption that inferred values predict temporally discounted future reward. We assume that such predictions are null in situations when the agent is static in a non-rewarding state. In the teleport end-state experiment, teleports had a constant distance d = 10 and were initiated from either early, moderate, or late start locations where s_early = 2, s_intermediate = 10, s_late = 14. In the speed-manipulation experiment, slow, normal and fast speeds were implemented by modulating the step-size with which the agent moved through the environment, where stepsize_small = 1, stepsize_normal = 2, stepsize_fast = 4. Training in the speed-manipulation experiment was performed with stepsize_normal. In teleport experiments, the agent was trained on T = 200 trials before experiencing the teleport manipulation. In the speed experiment, the agent was trained on T = 500 trials before experiencing the speed manipulation. Cached values were clamped during test trials to prevent learning. Agents were simulated with the following parameters: α_TD = 0.01, α_MB = 0.50, γ = 0.93, k = 0.50. RPEs were extracted on test trials and visualised in fig. 4.

6.7 Mikhael et al. (2022) simulation

Finally, we compared dual-process RPEs with dopamine signals in Mikhael et al. (2022). In the Mikhael et al. (2022) experiment, subjects viewed a VR track akin to Kim et al. (2020) except that the sensory features were progressively darkened on a subset of trials. The experiment further incorporated speed-manipulations.

Environments were formalised with feature-based function approximation. Each state was initially encoded as a one-hot feature vector. To generate state uncertainty, feature vectors were passed through a Gaussian filter parameterised by 𝒩(s_t, σ_t). The mean of the Gaussian s_t was always the true-state at time t, while the standard deviation sigma_t was time dependent, and set differently in each experimental condition (see below). Lost probability mass (i.e. mass that was pushed beyond the boundaries of the environment due to Gaussian filtering) was reassigned to the nearest boundary state ensure simplex feature distributions.

In fig. 5C, we simulated a dual-process agent on the VR track experiment in Mikhael et al. (2022). The agent was simulated according to eqs. 5—7, but with value estimates constructed according to eq. (11) and eq. (12) to account for state uncertainty. The agent was tested on a linear track with N = 87 states with a goal state that delivered a reward r = 1.0 at one end of the track. In the bright condition, the standard-deviation in the Gaussian filter was constant at σ_t = 4. In the darkening condition, the standard-deviation was drawn from a rescaled exponential function with the minimum value σ_min = 4 and a maximum value σ_max = 24, which reproduced the assumptions about state uncertainty during sensory darkening in Mikhael et al. (2022). In the standard-speed condition, the agent moved with the stepsize stepsize_std = 1, whereas in the fast-speed condition, it moved with the stepsize stepsize_fast = 2. The agent was simulated with the following parameters: α_TD = 0.01, α_MB = 0.50, γ = 0.93, k = 0.50. It was first trained on the task in the bright, standard-speed condition for T = 150 trials. It then performed one test trial in each combination of brightness (bright-vs-dark) and speed (standard-vs-fast) conditions. The RPE in each state and each test condition was extracted and visualised in fig. 5C.

Data availability

Code to reproduce the results in the manuscript is available at: https://github.com/lpriestley/dopamine_ramps

Acknowledgements

We are grateful to Kris Jensen, Eleanor Spens and Marta Blanco-Pozo for helpful feedback on the manuscript. The work was supported by Wellcome Trust Career Development Award 225926/Z/22/Z. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Additional information

Funding

Wellcome Trust (WT)

https://doi.org/10.35802/225926

Thomas Akam

References

1. Akam T.
2. Walton M. E.
2021What is dopamine doing in model-based reinforcement learning?Current Opinion in Behavioral Sciences 38:74–82https://doi.org/10.1016/j.cobeha.2020.10.010 PubMed Google Scholar
1. Akam T.
2. Rodrigues-Vaz I.
3. Marcelo I.
4. Zhang X.
5. Pereira M.
6. Oliveira R. F.
7. Dayan P.
8. Costa R. M.
2021The Anterior Cingulate Cortex Predicts Future States to Mediate Model-Based Action SelectionNeuron 109:149–163https://doi.org/10.1016/j.neuron.2020.10.013 PubMed Google Scholar
1. Aronov D.
2. Nevers R.
3. Tank D. W.
2017Mapping of a non-spatial dimension by the hippocam-pal–entorhinal circuitNature 543:719–722https://doi.org/10.1038/nature21692 PubMed Google Scholar
1. Babiczky Á.
2. Matyas F.
2022Molecular characteristics and laminar distribution of prefrontal neurons projecting to the mesolimbic systemeLife 11:e78813https://doi.org/10.7554/eLife.78813 PubMed Google Scholar
1. Balleine B. W.
2. Dickinson A.
1998Goal-directed instrumental action: Contingency and incentive learning and their cortical substratesNeuropharmacology 37:407–419https://doi.org/10.1016/S0028-3908(98)00033-1 PubMed Google Scholar
1. Beier K. T.
2. Steinberg E. E.
3. DeLoach K. E.
4. Xie S.
5. Miyamichi K.
6. Schwarz L.
7. Gao X. J.
8. Kremer E. J.
9. Malenka R. C.
10. Luo L.
2015Circuit Architecture of VTA Dopamine Neurons Revealed by Systematic Input-Output MappingCell 162:622–634https://doi.org/10.1016/j.cell.2015.07.015 PubMed Google Scholar
1. Berke J. D.
2018What does dopamine mean?Nature Neuroscience 21:787–793https://doi.org/10.1038/s41593-018-0152-y PubMed Google Scholar
1. Blanco-Pozo M.
2. Akam T.
3. Walton M. E.
2024Dopamine-independent effect of rewards on choices through hidden-state inferenceNature Neuroscience 27:286–297https://doi.org/10.1038/s41593-023-01542-x PubMed Google Scholar
1. Bush D.
2. Barry C.
3. Manson D.
4. Burgess N.
2015Using Grid Cells for NavigationNeuron 87:507–520https://doi.org/10.1016/j.neuron.2015.07.006 PubMed Google Scholar
1. Campbell M. G.
2. Ra Y.
3. Chen Z.
4. Xu S.
5. Burrell M.
6. Matias S.
7. Watabe-Uchida M.
8. Uchida N.
2025A hardwired neural circuit for temporal difference learningbioRxiv :2025.09.18.677203https://doi.org/10.1101/2025.09.18.677203 Google Scholar
1. Chang L.
2. Tsao D. Y.
2017The Code for Facial Identity in the Primate BrainCell 169:1013–1028https://doi.org/10.1016/j.cell.2017.05.011 PubMed Google Scholar
1. Cohen J. Y.
2. Haesler S.
3. Vong L.
4. Lowell B. B.
5. Uchida N.
2012Neuron-type-specific signals for reward and punishment in the ventral tegmental areaNature 482:85–88https://doi.org/10.1038/nature10754 PubMed Google Scholar
1. Constantinescu O.
2. O’Reilly J. X.
3. Behrens T. E. J.
2016Organizing conceptual knowledge in humans with a gridlike codeScience 352:1464–1468https://doi.org/10.1126/science.aaf0941 PubMed Google Scholar
1. Daw N. D.
2. Niv Y.
3. Dayan P.
2005Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral controlNature Neuroscience 8:1704–1711https://doi.org/10.1038/nn1560 PubMed Google Scholar
1. Daw N. D.
2. Gershman S. J.
3. Seymour B.
4. Dayan P.
5. Dolan R. J.
2011Model-based influences on humans’ choices and striatal prediction errorsNeuron 69:1204–1215https://doi.org/10.1016/j.neuron.2011.02.027 PubMed Google Scholar
1. Dayan P.
1993Improving Generalization for Temporal Difference Learning: The Successor Repre-sentationNeural Computation 5:613–624https://doi.org/10.1162/neco.1993.5.4.613 Google Scholar
1. Dolan R. J.
2. Dayan P.
2013Goals and Habits in the BrainNeuron 80:312–325https://doi.org/10.1016/j.neuron.2013.09.007 PubMed Google Scholar
1. Donnarumma F.
2. Parr T.
3. Friston K.
4. Whittington J.
5. Pezzulo G.
2025Inferential planning in the frontal cortexbioRxiv :2025.11.26.690672https://doi.org/10.1101/2025.11.26.690672 Google Scholar
1. Eshel N.
2. Bukwich M.
3. Rao V.
4. Hemmelder V.
5. Tian J.
6. Uchida N.
2015Arithmetic and local circuitry underlying dopamine prediction errorsNature 525:243–246https://doi.org/10.1038/nature14855 PubMed Google Scholar
1. Farrell K.
2. Lak A.
3. Saleem A. B.
2022Midbrain dopamine neurons signal phasic and ramping reward prediction error during goal-directed navigationCell Reports 41:111470https://doi.org/10.1016/j.celrep.2022.111470 PubMed Google Scholar
1. Gao L.
2. Liu S.
3. Gou L.
4. Hu Y.
5. Liu Y.
6. Deng L.
7. Ma D.
8. Wang H.
9. Yang Q.
10. Chen Z.
11. Liu D.
12. Qiu S.
13. Wang X.
14. Wang D.
15. Wang X.
16. Ren B.
17. Liu Q.
18. Chen T.
19. Shi X.
20. Yao H.
21. Xu C.
22. Li T.
23. Sun Y.
24. Li A.
25. Luo Q.
26. Gong H.
27. Xu N.
28. Yan J.
2022Single-neuron projectome of mouse prefrontal cortexNature Neuroscience 25:515–529https://doi.org/10.1038/s41593-022-01041-5 PubMed Google Scholar
1. Guru C. Seo
2. Post R. J.
3. Kullakanda D. S.
4. Schaffer J. A.
5. Warden M. R.
2020Ramping activity in midbrain dopamine neurons signifies the use of a cognitive mapbioRxiv https://doi.org/10.1101/2020.05.21.108886 Google Scholar
1. Hafting T.
2. Fyhn M.
3. Molden S.
4. Moser M.-B.
5. Moser E. I.
2005Microstructure of a spatial map in the entorhinal cortexNature 436:801–806https://doi.org/10.1038/nature03721 PubMed Google Scholar
1. Hamid A.
2. Pettibone J. R.
3. Mabrouk O. S.
4. Hetrick V. L.
5. Schmidt R.
6. Vander Weele C. M.
7. Kennedy R. T.
8. Aragona B. J.
9. Berke J. D.
2016Mesolimbic dopamine signals the value of workNature Neuroscience 19:117–126https://doi.org/10.1038/nn.4173 PubMed Google Scholar
1. Howe M. W.
2. Tierney P. L.
3. Sandberg S. G.
4. Phillips P. E. M.
5. Graybiel A. M.
2013Prolonged dopamine signalling in striatum signals proximity and value of distant rewardsNature 500:575–579https://doi.org/10.1038/nature12475 PubMed Google Scholar
1. Huang Y.
2. Yaple Z. A.
3. Yu R.
2020Goal-oriented and habitual decisions: Neural signatures of model-based and model-free learningNeuroImage 215https://doi.org/10.1016/j.neuroimage.2020.116834 PubMed Google Scholar
1. Jensen K. T.
2. Doohan P.
3. Sablé-Meyer M.
4. Reinert S.
5. Baram A.
6. Akam T.
7. Behrens T. E. J.
2026A mechanistic theory of planning in prefrontal cortexeLife 15:RP109757https://doi.org/10.7554/eLife.109757.1 Google Scholar
1. Jones J. L.
2. Esber G. R.
3. McDannald M. A.
4. Gruber A. J.
5. Hernandez A.
6. Mirenzi A.
7. Schoenbaum G.
2012Orbitofrontal Cortex Supports Behavior and Learning Using Inferred But Not Cached ValuesScience 338:953–956https://doi.org/10.1126/science.1227489 PubMed Google Scholar
1. Kato A.
2. Morita K.
2016Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to MotivationPLOS Computational Biology 12:e1005145https://doi.org/10.1371/journal.pcbi.1005145 PubMed Google Scholar
1. Killcross S.
2. Coutureau E.
2003Coordination of Actions and Habits in the Medial Prefrontal Cortex of RatsCerebral Cortex 13:400–408https://doi.org/10.1093/cercor/13.4.400 PubMed Google Scholar
1. Kim H. R.
2. Malik A. N.
3. Mikhael J. G.
4. Bech P.
5. Tsutsui-Kimura I.
6. Sun F.
7. Zhang Y.
8. Li Y.
9. Watabe-Uchida M.
10. Gershman S. J.
11. Uchida N.
2020A Unified Framework for Dopamine Signals across TimescalesCell 183:1600–1616https://doi.org/10.1016/j.cell.2020.11.013 PubMed Google Scholar
1. Krausz T. A.
2. Comrie A. E.
3. Kahn A. E.
4. Frank L. M.
5. Daw N. D.
6. Berke J. D.
2023Dual credit assignment processes underlie dopamine signals in a complex spatial environmentNeuron 111:3465–3478https://doi.org/10.1016/j.neuron.2023.07.017 PubMed Google Scholar
1. Lee S. W.
2. Shimojo S.
3. O’Doherty J. P.
2014Neural Computations Underlying Arbitration between Model-Based and Model-free LearningNeuron 81:687–699https://doi.org/10.1016/j.neuron.2013.11.028 PubMed Google Scholar
1. Liu L.
2. She L.
3. Chen M.
4. Liu T.
5. Lu H. D.
6. Dan Y.
7. Poo M.-m.
2016Spatial structure of neuronal receptive field in awake monkey secondary visual cortex (V2)Proceedings of the National Academy of Sciences 113:1913–1918https://doi.org/10.1073/pnas.1525505113 PubMed Google Scholar
1. Lloyd K.
2. Dayan P.
2015Tamping Ramping: Algorithmic, Implementational, and Computational Explanations of Phasic Dopamine Signals in the AccumbensPLOS Computational Biology 11:e1004622https://doi.org/10.1371/journal.pcbi.1004622 PubMed Google Scholar
1. Mattar M. G.
2. Daw N. D.
2018Prioritized memory access explains planning and hippocampal replayNature Neuroscience 21:1609–1617https://doi.org/10.1038/s41593-018-0232-z PubMed Google Scholar
1. Mattar M. G.
2. Lengyel M.
2022Planning in the brainNeuron 110:914–934https://doi.org/10.1016/j.neuron.2021.12.018 PubMed Google Scholar
1. Mikhael J. G.
2. Kim H. R.
3. Uchida N.
4. Gershman S. J.
2022The role of state uncertainty in the dynamics of dopamineCurrent Biology 32:1077–1087https://doi.org/10.1016/j.cub.2022.01.025 PubMed Google Scholar
1. Mohebi A.
2. Pettibone J. R.
3. Hamid A. A.
4. Wong J.-M. T.
5. Vinson L. T.
6. Patriarchi T.
7. Tian L.
8. Kennedy R. T.
9. Berke J. D.
2019Dissociable dopamine dynamics for learning and motivationNature 570:65–70https://doi.org/10.1038/s41586-019-1235-y PubMed Google Scholar
1. Montague P.
2. Dayan P.
3. Sejnowski T.
1996A framework for mesencephalic dopamine systems based on predictive Hebbian learningThe Journal of Neuroscience 16:1936–1947https://doi.org/10.1523/JNEUROSCI.16-05-01936.1996 PubMed Google Scholar
1. Niedringhaus M.
2. West E. A.
2022Prelimbic cortex neural encoding dynamically tracks expected outcome valuePhysiology & Behavior 256https://doi.org/10.1016/j.physbeh.2022.113938 PubMed Google Scholar
1. Niv Y.
2013Dopamine ramps upNature 500:533–535https://doi.org/10.1038/500533a PubMed Google Scholar
1. O’Doherty J. P.
2. Dayan P.
3. Friston K.
4. Critchley H.
5. Dolan R. J.
2003Temporal Difference Models and Reward-Related Learning in the Human BrainNeuron 38:329–337https://doi.org/10.1016/S0896-6273(03)00169-7 PubMed Google Scholar
1. Pawlak V.
2. Kerr J. N. D.
2008Dopamine Receptor Activation Is Required for Corticostriatal Spike-Timing-Dependent PlasticityThe Journal of Neuroscience 28:2435–2446https://doi.org/10.1523/JNEUROSCI.4402-07.2008 PubMed Google Scholar
1. Pessiglione M.
2. Seymour B.
3. Flandin G.
4. Dolan R. J.
5. Frith C. D.
2006Dopamine-dependent prediction errors underpin reward-seeking behaviour in humansNature 442:1042–1045https://doi.org/10.1038/nature05051 PubMed Google Scholar
1. Piray P.
2. Daw N. D.
2021Linear reinforcement learning in planning, grid fields, and cognitive controlNature Communications 12:4942https://doi.org/10.1038/s41467-021-25123-3 PubMed Google Scholar
1. Ringstrom T. J.
2. Hasanbeig M.
3. Abate A.
2025Goal Kernel Planning: Linearly-Solvable Non-Markovian Policies for Logical Tasks with Goal-Conditioned OptionsarXiv https://doi.org/10.48550/arXiv.2007.02527 Google Scholar
1. Sagiv Y.
2. Akam T.
3. Witten I. B.
4. Daw N. D.
2025Prioritizing replay when future goals are unknownNeuron 113:4278–4292https://doi.org/10.1016/j.neuron.2025.09.021 PubMed Google Scholar
1. Samejima K.
2. Ueda Y.
3. Doya K.
4. Kimura M.
2005Representation of Action-Specific Reward Values in the StriatumScience 310:1337–1340https://doi.org/10.1126/science.1115270 PubMed Google Scholar
1. Schultz W.
2006Behavioral Theories and the Neurophysiology of RewardAnnual Review of Psychology 57:87–115https://doi.org/10.1146/annurev.psych.56.091103.070229 PubMed Google Scholar
1. Schultz W.
2. Dayan P.
3. Montague P. R.
1997A Neural Substrate of Prediction and RewardScience 275:1593–1599https://doi.org/10.1126/science.275.5306.1593 PubMed Google Scholar
1. Stalnaker T. A.
2. Cooch N. K.
3. McDannald M. A.
4. Liu T.-L.
5. Wied H.
6. Schoenbaum G.
2014Orbitofrontal neurons infer the value and identity of predicted outcomesNature Communications 5:3926https://doi.org/10.1038/ncomms4926 PubMed Google Scholar
1. Steinberg E. E.
2. Keiflin R.
3. Boivin J. R.
4. Witten I. B.
5. Deisseroth K.
6. Janak P. H.
2013A causal link between prediction errors, dopamine neurons and learningNature Neuroscience 16:966–973https://doi.org/10.1038/nn.3413 PubMed Google Scholar
1. Sutton R. S.
1991Dyna, an integrated architecture for learning, planning, and reactingACM SIGART Bulletin 2:160–163https://doi.org/10.1145/122344.122377 Google Scholar
1. Sutton R. S.
2. Barto A.
2014Reinforcement Learning: An Introduction. Adaptive Computation and Machine LearningCambridge, Massachusetts: The MIT Press Google Scholar
1. Van Der Meer M. A. A.
2009Covert expectation-of-reward in rat ventral striatum at decision pointsFrontiers in Integrative Neuroscience 3https://doi.org/10.3389/neuro.07.001.2009 PubMed Google Scholar
1. Wang Q.
2. Ding S.-L.
3. Li Y.
4. Royall J.
5. Feng D.
6. Lesnar P.
7. Graddis N.
8. Naeemi M.
9. Facer B.
10. Ho A.
11. Dolbeare T.
12. Blanchard B.
13. Dee N.
14. Wakeman W.
15. Hirokawa K. E.
16. Szafer A.
17. Sunkin S. M.
18. Oh S. W.
19. Bernard A.
20. Phillips J. W.
21. Hawrylycz M.
22. Koch C.
23. Zeng H.
24. Harris J. A.
25. Ng L.
2020The Allen Mouse Brain Common Coordinate Framework: A 3D Reference AtlasCell 181:936–953https://doi.org/10.1016/j.cell.2020.04.007 PubMed Google Scholar
1. Whittington J. C.
2. Muller T. H.
3. Mark S.
4. Chen G.
5. Barry C.
6. Burgess N.
7. Behrens T. E.
2020The Tolman-Eichenbaum Machine: Unifying Space and Relational Memory through Generalization in the Hippocampal FormationCell 183:1249–1263https://doi.org/10.1016/j.cell.2020.10.024 PubMed Google Scholar
1. Witten B.
2. Steinberg E. E.
3. Lee S. Y.
4. Davidson T. J.
5. Zalocusky K. A.
6. Brodsky M.
7. Yizhar O.
8. Cho S. L.
9. Gong S.
10. Ramakrishnan C.
11. Stuber G. D.
12. Tye K. M.
13. Janak P. H.
14. Deisseroth K.
2011Recombinase-Driver Rat Lines: Tools, Techniques, and Optogenetic Application to Dopamine-Mediated ReinforcementNeuron 72:721–733https://doi.org/10.1016/j.neuron.2011.10.028 PubMed Google Scholar
1. Yamins L. K.
2. DiCarlo J. J.
2016Using goal-driven deep learning models to understand sensory cortexNature Neuroscience 19:356–365https://doi.org/10.1038/nn.4244 PubMed Google Scholar

Article and author information

Author information

Luke Priestley
Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom
ORCID iD: 0009-0003-9125-7265
- For correspondence: luke.priestley@psy.ox.ac.uk
Thomas Akam
Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom, Sainsbury Wellcome Centre, University College London, London, United Kingdom
ORCID iD: 0000-0002-1810-0494
- For correspondence: thomas.akam@psy.ox.ac.uk

Author Notes

Competing interests: No competing interests declared

Version history

Preprint posted: February 19, 2026
Sent for peer review: April 21, 2026
Reviewed Preprint version 1: June 22, 2026

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.111458. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 160
downloads: 5
citations: 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.