Neuroscience

A mechanistic theory of planning in prefrontal cortex

Kristopher T Jensen author has email address
Peter Doohan
Mathias Sablé-Meyer
Sandra Reinert
Alon Baram
Thomas Akam
Timothy EJ Behrens

Sainsbury Wellcome Centre, University College London, London, United Kingdom
Oxford Centre for Integrative Neuroimaging, University of Oxford, Oxford, United Kingdom
Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom

https://doi.org/10.7554/eLife.109757.1

Open access
Copyright information

Figures and data

Background.
(A) Recent work has identified explicit sequence representations in prefrontal cortex during working memory (Xie et al., 2022; El-Gaby et al., 2023). When an animal has to execute a behavioural sequence (left), individual neurons represent conjunctions of location and sequence element (centre). Separate populations (planes) therefore represent the expected location at different times in the future. The entire sequence is represented concurrently by the simultaneous firing of different neurons that encode the expected location at each time in the future (right). (B) An example V1 cell fires when its receptive field is aligned with an inferred line (blue), but not for a control stimulus with no inference (orange). Such visual inference is mediated by structural priors embedded in the circuit connectivity, where neurons representing consistent visual features excite each other (Iacaruso et al., 2017; Shin et al., 2023). Figure adapted from Lee and Nguyen (2001). (C) Structural knowledge is embedded in the synapses of the head direction system (centre; Turner-Evans et al., 2020), which constrains the network to represent single angles (Kim et al., 2017). Visual and proprioceptive inputs (left) determine which angle should be represented (right). We suggest that a mechanism like (B) and (C) infers the representation in (A). (D) Prefrontal cortex is particularly important in dynamic environments. Stimulus-response associations and repeated choices are robust to prefrontal lesions (i; ii), and acortical mice can solve spatial memory tasks (iii; Zheng et al., 2024). However, PFC is needed for reversal learning (iv; Walton et al., 2010) and when different goals are important, or ‘rewarding’, at different times (v; Shallice and Burgess, 1991). In multiplayer board games (vi), different resources are valuable at different stages, and opponents can dynamically change their strategy.

The spacetime attractor.
(A) Left; in a one-dimensional ring attractor, neurons representing similar directions excite each other (top, red), and different directions inhibit each other (blue). Connections are shown for the green example cell. Stable network states (fixed points) have neurons at a particular encoded angle being most active (bottom). Right; grid cells (bottom) emerge from a two-dimensional attractor network (top), where cells with similar preferred location excite each other (red) and intermediate distances inhibit each other (blue). (B) The spacetime attractor is a three-dimensional generalisation, where neurons have both a preferred location (planes) and delay (horizontal axis). This resembles the PFC representation in Figure 1A. Cells that represent adjacent locations in both space and time excite each other (top, red), and different locations at the same time inhibit each other (blue). Given inputs indicating the current location (mouse) and future reward (cheese), the fixed points represent reward-maximising paths through space and time (bottom). While this example has a single static goal, reward inputs can also differ between subspaces if the reward is dynamic. (C) Dynamics of the spacetime attractor. Initial activity is diffuse (left), followed by convergence to a stable representation of a plan (centre). At convergence, a policy is read out from the subspace that represents the next location along the desired trajectory. When the agent moves to the next state, neural activity updates to represent the remaining plan-to-go (right; El-Gaby et al., 2023).

Model comparisons in static and dynamic tasks.
(A) Example tasks with different degrees of dynamic reward. Colours indicate reward at each location (white to green), and red arrows indicate the optimal policy. In the ‘moving goal’ task, the agent intercepts a target that moves along a different trajectory in each trial (arrow). In the ‘reward landscape’ task, the reward is sampled independently in space and time. The agent has to maximise cumulative reward, and it can be optimal to forgo immediate reward for a later payoff. (B) Representations and performance in the static goal task with a fixed goal. Left: value functions learned by TD and SR agents. Centre: STA representation at convergence, which encodes a path through space and time. The final plane is a max projection that summarises the path. Right: performance of each model in the task (grey: random baseline). (C) As in (B), now for the task where the static goal changes between trials. (D) Representations and performance in the moving goal task. Left: the SR computes a value function that averages reward across time-within-trial. Middle: the STA takes into account the moving goal and computes a path that intercepts it. Right: performance of each agent. (E) Performance in the reward landscape task. All error bars indicate 1 standard deviation across 20 agents and environments (dots).

RNNs learn spacetime representations in the reward landscape task.
(A) An STA infers the entire future during planning and represents the ‘future-to-go’ during execution (left). The representation in each subspace generalises across all trajectories passing through that location in spacetime (right). (B) We compare a spacetime attractor; an RNN trained on the reward landscape task; and an agent that computes an exact value function in space and time. The value-based agent computes an optimal policy from ‘neural activity’ containing (i) the value function, (ii) the agent location, and (iii) the time-within-trial (Methods). (C) Decoding accuracy at the end of planning for: left; agent location at each time in the future. Right; the time at which the agent will be at a given location, plotted as a function of the actual time the location was visited. Decoders were trained in crossvalidation across the current agent location (Methods). (D) Decoding accuracy during execution of location at each time in the past or future. (E) We trained a single decoder to predict location at time 3 from neural activity at time 1 (black circle). The same decoder predicted location at time t + 2 from neural activity at any other t, demonstrating ‘conveyor belt’ dynamics. (F) Performance in the static goal (left), moving goal (centre), and reward landscape (right) tasks for RNNs trained on either task (x-labels; colours). (G) Normalised parameter magnitudes of the three RNNs (left) and average firing rates in the static goal task (right). All error bars indicate 1 standard deviation across 5 RNNs (dots).

RNNs learn a world model.
(A) The weights of RNNs trained on the reward landscape task are projected into an orthonormal coordinate system with axes that predict different points in spacetime (top). The projected weights are interactions between points in spacetime (bottom). (B) The average recurrent weights between subspaces separated by a single action resemble the environment adjacency matrix (bottom, ‘RNN’). ‘Empirical STA’ is the same analysis performed on approximate subspaces estimated from neural activity in the handcrafted STA. ‘True STA’ indicates weights between the ground truth future-coding subspaces in the handcrafted STA. Green box in ‘True STA’ indicates weights between (i) the single location denoted by a green circle in ‘Prediction’, and (ii) all locations in the subspace indicated by a green square. (C) Input weights to the ‘current’ (δ = 0; left) and a ‘future’ (δ = 2; right) subspace. Consistent with an STA, the current subspace of the RNN receives location input, and future subspaces receive reward information. (D) Average recurrent weights between an example location (green arrow) and other locations at increasing time differences (Δ; planes). Locations in subspaces Δ apart are more strongly connected if they can be reached in Δ actions. Light green circles indicate these Δ^th order adjacency matrices (Δ = 0 corresponds to the identity). (E) Correlation between (i) the average connectivity between subspaces separated by Δ actions (lines; legend), and (ii) different order adjacency matrices (x-axis). Shading indicates standard deviation across 5 RNNs. (F) Example environment with a high-value path and a lower-value path. (G) Weak stimulation does not affect the spacetime representation of the RNN, but strong stimulation switches it to the lower-value path. (H) Magnitude of representational change over time across stimulation strengths (Methods).

Spacetime attractors can adapt to changing structure.
(A) Changing structure requires adaptation. (B) Performance of RNNs trained on the reward landscape task in a single maze or with a different structure on every trial, when evaluated in either a single maze (top) or across changing mazes (bottom). (C) Two example mazes (left) and their corresponding adjacency matrices (centre). The effective connectivity between adjacent subspaces in a single RNN (right) reflects the structure of each maze. (D) Average correlation between subspace connectivity in a maze and the structure of either the same (blue) or a random (grey) maze. (E) The spacetime representation is more similar across distinct trials from the same maze than trials from different mazes (left). By using slightly different subspaces in different environments, the network can match their connectivity to the environment structure (right; orange vs. blue). (F) Putative mechanism for structural generalisation. Instead of representing each future location, neurons encode expected future transitions (black arrows in example state). Each transition (green) connects to transitions that can follow or precede it (red arrows). Structural input to PFC inhibits transitions that are not available in a given environment (blue), preventing planning between states that would otherwise be connected (light red arrows). (G) Effective connectivity between directions in neural state space that encode consistent’ consecutive future transitions (and ; green to red), ‘adjacent’ transitions (green to blue), or any other transitions (green to grey). (H) Projection of the input from wall w_ij onto representations of future transitions through the wall (τ^ij; dark blue), transitions to other adjacent states (τ^ik; light blue), or any other transitions (grey). All error bars indicate 1 standard deviation across 5 RNNs (dots).

Spacetime attractors can adapt to changing structure.
(A) Changing structure requires adaptation. (B) Performance of RNNs trained on the reward landscape task in a single maze or with a different structure on every trial, when evaluated in either a single maze (top) or across changing mazes (bottom). (C) Two example mazes (left) and their corresponding adjacency matrices (centre). The effective connectivity between adjacent subspaces in a single RNN (right) reflects the structure of each maze. (D) Average correlation between subspace connectivity in a maze and the structure of either the same (blue) or a random (grey) maze. (E) The spacetime representation is more similar across distinct trials from the same maze than trials from different mazes (left). By using slightly different subspaces in different environments, the network can match their connectivity to the environment structure (right; orange vs. blue). (F) Putative mechanism for structural generalisation. Instead of representing each future location, neurons encode expected future transitions (black arrows in example state). Each transition (green) connects to transitions that can follow or precede it (red arrows). Structural input to PFC inhibits transitions that are not available in a given environment (blue), preventing planning between states that would otherwise be connected (light red arrows). (G) Effective connectivity between directions in neural state space that encode consistent’ consecutive future transitions (and ; green to red), ‘adjacent’ transitions (green to blue), or any other transitions (green to grey). (H) Projection of the input from wall w_ij onto representations of future transitions through the wall (τ^ij; dark blue), transitions to other adjacent states (τ^ik; light blue), or any other transitions (grey). All error bars indicate 1 standard deviation across 5 RNNs (dots).

RNN performance during and after training.
Each line in this figure corresponds to one of the five RNNs that were used for analyses in the main text. (A) Performance over the course of training, averaged over all actions within each trial. (B) Performance at the end of training as a function of the action number within the trial. When assessing performance at time t, trials were only included that had optimal choices up to time t − 1. (C) Probability of choosing the action with highest value as a function of the value difference between the two actions with highest value. This analysis shows that errors are only made then the optimal action is close in value to the second best action. (D) Probability of choosing the action with highest reward as a function of the reward difference between the two actions with highest reward. Reward is less predictive of behaviour than value, confirming that the RNNs compute long-term value rather than relying on greedy reward.

Additional analyses of learned RNN representations.
(A) We compare a spacetime attractor; an RNN trained on the reward landscape task; and an agent that analytically computes a full spacetime value function. The value-based agent computes an optimal policy from ‘neural activity’ containing (i) the value function, (ii) the current location, and (iii) the time-within-trial (Methods). (B) Decoding accuracy of agent location at different times (x-axis) from neural activity at every other time (lines; legend). All decoders were trained in crossvalidation across the current agent location (Methods). This is why the accuracy is zero when decoding location from activity at the same time. (C) We trained a single decoder to predict location at time 3 from neural activity at time 1 (green circle). The same decoder predicts location at time t + 2 (x-axis) from neural activity at any other t (lines). (D) Similarity of decoding patterns to idealised representations of future location in ‘relative’ or ‘absolute’ time (schematics).

RNNs trained on simpler tasks do not learn spacetime representations.
In this figure, we analyse the representations of four RNNs trained on all combinations of (i) the static goal or the moving goal task, and (ii) reward input throughout the task (‘continual’) or reward input only during the planning phase (‘working memory’). For all analyses in this figure, we only included trials where an optimal agent would intercept the goal in 3 to 6 actions. (A) Decoding accuracy for agent location at different times (x-axis) from neural activity at the end of the planning period. Decoders were trained in crossvalidation across the current agent location. Only the RNN trained on the moving goal task in a working memory setting seems to learn a generalisable representation of the future. This network is also unlikely to have learned a full STA, since it fails catastrophically on the reward landscape task (Figure 4F). Note that the decoding accuracy generally increases slightly for the true STA as a function of time-within-trial. This is because the navigation tasks have stronger correlations between consecutive positions, which leads to some degree of overfitting on the training data. This overfitting is less prominent later in a trial, where the space of possible locations conditioned on the current location is larger. (B) In this analysis, we trained a decoder to predict whether the agent would be at a particular location at any time in the trial from neural activity at the end of planning. Binary decoders were trained for each possible future location in crossvalidation across the current agent location. Bars indicate the average predictive accuracy across all binary decoders and current locations. The simpler networks seem to learn a representation of whether they will be at a given location at some point in the future.

Parameters learned by the reward landscape RNN.
Network weights are projected into an orthonormal coordinate system with axes that maximally predict future locations. All weight matrices are for a single example RNN, since the environment differs between networks, and the connectivity is therefore slightly different. (A) Structure of the environment that the RNN was trained in, illustrated as the 0^th order adjacency matrix (the identity matrix), the 1^st order adjacency matrix, and the 2^nd order adjacency matrix (B) Recurrent weight matrix estimated during the planning period, which shows structure resembling the environment adjacency matrix in the off-diagonal blocks. (C) Recurrent weight matrix estimated during the execution period, which shows an additional ‘feedforward’ component that copies information from later to earlier subspaces. We posit that this component of the connectivity matrix helps implement the conveyor belt dynamics identified in Figure 4E (Supplementary Note). (D) Input weight matrix estimated during the planning period. The ‘current’ subspace receives location input, and future subspaces receive reward corresponding to the appropriate time in the future. (E) Output weight matrix estimated during the execution period. The policy is read out from the ‘immediate future’ subspace as expected in a spacetime attractor. (F) Overlap between subspaces estimated during the planning and execution periods. This analysis was performed both for the standard RNN (‘WM’), and for a network trained with continual reward input throughout the trial instead of only during the planning phase (‘continual’; Figure S6). The WM RNN uses separate subspaces for computation of the plan and subsequent execution, consistent with the different connectivity patterns in (B) and (C). The continual RNN can use the same subspaces for planning and execution since it always receives the same type of input.

Additional analyses of attractor dynamics.
(A) Change in implied spacetime representation over time in the trained RNN for different perturbation strengths (reproduced from Figure 5H). (B) Change in spacetime representation over time in the handcrafted spacetime attractor for different perturbation strengths. In contrast to the trained RNN, the ‘low value’ path is a fixed point of the perturbation-free network dynamics in the handcrafted network. At the end of a sufficiently strong perturbation, the representation can therefore stay in this new fixed point. (C) Change in RNN firing rates for different perturbation strengths. While the change in implied spacetime representation completely saturates with perturbation strength, the change in firing rates continues to increase with perturbation strength. This is expected because the network has a non-saturating ReLU nonlinearity. Small external perturbations are still quenched when quantifying the change in representation using the raw firing rates instead of the implied spacetime representation.

RNNs trained with continual reward input also learn spacetime attractors.
In the main text, we focused on an RNN that was trained with reward input provided during an initial ‘planning phase’, while no information was given about the reward function during subsequent ‘execution’. In this figure, we perform some of the same analyses on an RNN that receives reward input throughout the entire task. In this setting, the task could in theory be solved using a ‘feedforward’ strategy that does not rely on recurrent dynamics at all. However, the RNNs still learn a spacetime attractor-like solution. (A) Decoding accuracy of agent location at different times (x-axis) from neural activity at every other time. Each line corresponds to predictions from neural activity at a different time in the trial from t = 0 (yellow) to t = 4 (blue). Decoders were trained in crossvalidation across the current agent location. (B) We trained a single decoder to predict location at time 3 from neural activity at time 1 (green circle). The same decoder predicts location at time t + 2 (x-axis) from neural activity at any other t (lines). (C) Similarity of decoding patterns to idealised representations of future location in ‘relative’ or ‘absolute’ time (Figure S2). (D) The average recurrent weights between subspaces separated by a single action resemble the adjacency matrix of the environment. (E) Correlation between (i) the average connectivity between subspaces separated by Δ actions (lines; legend), and (ii) different order adjacency matrices (x-axis). (F) Input weights to the ‘current’ subspace (δ = 0). (G) Input weights to a ‘future’ subspace (δ = 2).

RNNs with a local action space also learn spacetime attractors.
In the main text, we analysed an RNN that generated a global allocentric policy, consisting of a probability distribution over all locations that was renormalised over ‘adjacent’ locations before sampling an action. In this figure, we perform some of the same analyses on a network that outputs a ‘local’ policy in an action space consisting of ‘north’, ‘south’, ‘east’, and ‘west’. This RNN also learns a spacetime attractor. (A) Decoding accuracy of agent location at different times (x-axis) from neural activity at every other time. Each line corresponds to predictions from neural activity at a different time in the trial from t = 0 (yellow) to t = 4 (blue). Decoders were trained in crossvalidation across the current agent location. (B) We trained a single decoder to predict location at time 3 from neural activity at time 1 (green circle). The same decoder predicts location at time t + 2 (x-axis) from neural activity at any other t (lines). (C) Similarity of decoding patterns to idealised representations of future location in ‘relative’ or ‘absolute’ time (Figure S2). (D) The average recurrent weights between subspaces separated by a single action resemble the adjacency matrix of the environment. (E) Correlation between (i) the average connectivity between subspaces separated by Δ actions (lines; legend), and (ii) different order adjacency matrices (x-axis). (F) Input weights to the ‘current’ subspace (δ = 0). (G) Input weights to a ‘future’ subspace (δ = 2).

RNN representations and performance across network sizes.
We trained a series of RNNs with different network sizes ranging from 50 (dark blue) to 800 (yellow) hidden units. (A) Networks with approximately 300 or more units learned a spacetime representation, and the future could be decoded from the hidden state of the network at the end of the planning period. (B) Task performance saturated as a function of network size at approximately 300 hidden units. (C) Task performance increased with the ability of the network to represent the entire future explicitly. These results mirror the findings of Whittington et al. (2023) that RNNs trained on working memory tasks learn a similar ‘slot-like’ solution only if the network is large enough.

Additional analyses of RNNs trained with changing environment structure.
(A) Learning curves of RNNs trained on the reward landscape task in a single maze (‘fixed maze’; orange) or with a different structure on every trial (‘changing maze’; blue). (B) We took the RNNs trained across many mazes and evaluated them in a single maze (blue). Future states could be decoded almost as well as in the RNNs trained in a single maze (orange). Additionally, the representations in each maze were more similar to idealised representations of future location in ‘relative’ than ‘absolute’ time (right). (C) When considering data across many mazes, future transitions could be decoded in a way that generalised across current transition and maze structure (blue). Such a representation did not exist in RNNs trained in a single maze (orange).

Sign up for email alerts