Neuroscience

Dynamic reinforcement learning reveals time-dependent shifts in strategy during reward learning

Sarah Jo C Venditto author has email address
Kevin J Miller
Carlos D Brody author has email address
Nathaniel D Daw author has email address

Princeton University
Google DeepMind
University College London
Howard Hughes Medical Institute

https://doi.org/10.7554/eLife.97612.1

Open access
Copyright information

Figures and data

Rat Two-Step Task Description and Behavior
A Schematic of single trial during the rat two step task, where yellow suns indicate active port. The rat initiates a trial by poking into the top-center port (i), after which the rat is prompted to chose between the adjacent side ports (ii). After making a choice choice, the rat initiates the second step (iii). During the second step, the probabilistic transition determines the outcome probability (iv), and the rat is instructed to enter the active outcome port (v). Each outcome port has some probability of delivering liquid reward, P_r, at one port and the inverse probability, 1 − P_r, at the other port (vi). Throughout the session, the reward probability P_r will reverse unpredictably (vii). B Trial-history regressions fit to data simulated by each individual RL agent, each with learning rate 0.5 (5000 trials split between 20 sessions for each simulation). Each agent demonstrates differential effects of trial type on choices, which decays across past trials. (i) The model-based reward (MBr) agent tends to repeat choices (positive weight) following common-rewarded (blue-solid) and rare-omission (red-dashed) trials, and switches choices (negative weight) following rare-rewarded (blue-dashed) and common-omission (red-solid) trials, showing an effect by both transition and reward. (ii) Model-based choice (MBc) agent with positive weight tends to repeat choices following common transitions (solid) and switch away from choices following rare transition trials (dashed), capturing transition-dependent outcome port perseveration. (iii) Model-free reward (MFr) agent tends to repeat choices after rewarded (blue) trials and switch away from choices following omission trials (red), ignoring the observed transition. (iv) Positive model-free choice (MFc) captures a choice perseveration, tending to repeat choices regardless of observed reward or transition. C Trial-history regressions fit to animal behavioral data (n=20 rats), where dark lines are mean regression values across all rats and shaded error bars are 95% confidence intervals around the mean. Rats look most similar to the MBr agent, but show additional modulation that can be accounted for by the influence of other agents.

Model diagrams during a generic reward learning task.
A Diagram of a static MoA model. At the start of a session, values for each agent, Q₁ = [Q_MBr, Q_{MF c}], are initialized and fed into a MoA described by agent weights β = [β_MBr, β_{MF c}]. This weighted sum is pushed through a softmax to produce a probability distribution P (c₁) from which to draw choice c₁. An outcome port o₁ and reward r₁ are observed, and all task variables are used to update agent values according to their respective update rules (see methods). These updated values are used by the MoA on the next trial, and the process repeats until the session terminates. B Diagram of a MoA-HMM model. Values are initialized and updated in the same manner as A, but now the MoA used to generate choice probability is dependent on hidden state, where each hidden state MoA has its own set of agent weights β^z = [β¹, β², β³]. The state at the beginning of the session is determined by an initial state probability distribution, and the conditional probability of each subsequent hidden state is governed by a transition matrix, each row corresponding to the previous state and each column corresponding to the upcoming state.

Example 3-state MoA-HMM with single-state and 3-state model recovery.
A 3-state MoA-HMM parameters used to generate example data during the two-step task. Three agents are included: MBr, MFc, and Bias. (i) Agent weights for each hidden state. The first state (blue) is dominated by the MBr agent, the second state (orange) is dominated by the MFc agent, and the third state (green) is dominated by the Bias agent. (ii) Learning rates for MBr and MFc agents. (iii) Non-uniform initial state probability. (iv) Asymmetric transition probability matrix. B Effective (weighted) agent values during an example session. The dominant value in the weighted sum (black line) changes depending on the active hidden state (background color). C Recovered parameters by a single-state MoA. (i) Recovered agent weights erroneously identify behavior as a mixture of the three agents. (ii) Agent learning rates recovered by the single-state MoA. D Inferred values from recovered single-state MoA. The weighting on each agent is fixed across the session. E Recovered 3-state MoA-HMM parameters. (i) Each generative state, with a single dominating strategy per state, is recaptured. (ii) Learning rates are similarly recovered. F Inferred values from recovered 3-state MoA-HMM vary similarly to the generative values. The recovered hidden state is also closely approximated.Figure 3—figure supplement 1. Simulated 3-state MoA-HMM parameter recovery for the five agents used in behavioral fits: model-based reward, model-based choice, model-free reward, model-free choice, and bias. Each simulation contained 5000 trials evenly split between 20 sessions. Parameters for each state are pooled. Figure 3—figure supplement 2. 3-state MoA-HMM parameter recovery from data simulated from each rats’ behavioral model fit. Each rat’s model (n=20) was used to generate 5 independent data sets, where each data set contained the same number of trials and sessions as the corresponding rat’s behavioral data set used to fit the behavioral model, giving a total of 100 simulations.

Changes in model fit quality across rats.
A Change in normalized, cross-validated likelihood from the full model when removing one of the agents, computed for both a single-state MoA (gray, left) and a 3-state MoA-HMM (green, right). All agents, excluding MFr for a single-state MoA, show a significant (across rats) decrease in likelihood. B (left) Normalized, cross-validated likelihood of a single-state MoA. (right) Increase in normalized, cross-validated likelihood when adding additional hidden states into the MoA-HMM, relative to one fewer states. Significant increases are observed through 4 states. A-B Each colored dot is an individual rat, black dots correspond to the median across rats, and error bars are bootstrapped 95% confidence intervals around the median. *:p<0.02, **:p<0.001, ***:p<1E-4
Figure 4—figure supplement 1. New RL learning rules significantly improve fit to behavior and capture much of the variance explained by the Novelty Preverence and Perseveration agents of the original model from Miller et al. (2017)

Example and summary 3-state MoA-HMMs fit to rats in the two-step task using a population prior (see methods). The states are manually ordered based on three properties: the first state is chosen as the one with the highest initial state probability (blue diamond), the remaining two states are then ordered second and third (orange and green diamonds, respectively) in order of the weight they give to MBr (orange higher). A Example 3-state MoA-HMM on a single rat. (i) Agent weights split by hidden state. (ii) Initial state probability. (iii) State transition probability matrix. (iv) The expected hidden state calculated from the forward-backward algorithm averaged (mean) across sessions, with error bars as 95% confidence intervals around the mean. B Summary of all 3-state MoA-HMM fits across the population of 20 rats. States were sorted in the same way as (A). (i) Agent weights split by hidden state, where individual dots represent a single rat, light grey lines connect weights within a rat, bars represent the median weight over rats, and error bars are bootstrapped 95% confidence intervals around the median. Distribution of initial state probabilities, with each dot as an individual rat. (iii) Distribution of state transition probabilities, with each panel representing a single (t − 1) → (t) transition. (iv) Session-averaged expected state computed as in (Aiv), where light lines are average probabilities for individual rats and the dark solid lines are the population mean with 95% confidence intervals around the mean. *:p<0.05, ***:p<E-4
Figure 5—figure supplement 1. (A) Learning rates fit for each agent (i) corresponding to the example rat shown in Figure 5A and (ii) summarizing each learning rate over the population of rats. Each dot is an individual rat, bars represent the median, and errorbars are bootstrapped 95% confidence intervals around the median. (B) Three example sessions showing the inferred state likelihood on each trial from the example rat shown in Figure 5A. (C) Cross correlation between left choices and reward probabilities for the common outcome port given that choice (gray). Left choices are highly correlated to left-outcome reward blocks, with the peak correlation at a slight lag (vertical dashed line) indicating the trial at which the rat detects the reward probability flip. To test whether the latent states track reward flips, the cross correlation is also shown between left-outcome reward probability and the likelihood of each state: initial state (blue), the remaining state with a more rightward choice bias (orange), and the remaining state with a more leftward bias (green). These correspond directly to states 1-3 in the example rat (i) whose model is shown in Figure 5A. while other rats had states 2 and 3 assigned according to their individual choice biases.
Figure 5—figure supplement 2. A second 3-state MoA-HMM example and comparison to a GLM-HMM. States identified by a MoA-HMM and GLM-HMM are highly similar. A Example 3-state MoA-HMM model parameters with (i) agent weights split by state, (ii) each agent’s learning rate, the initial state probability, and (iv) the state transition matrix. B 3-state GLM-HMM fit to the same rat as (A). (i) GLM-HMM regression weights for each state. Each state is described by four types of regressors indicating four possible trial types – common-reward, common-omission, rare-reward and rare-omission – and choice direction for up to 5 previous trials, giving 20 parameters per state. Each state additionally had a bias term (ii), leading to a total of 63 model weights for a 3-state model. (iii) GLM-HMM initial state probabilities also identify a prominent initial state. (iv) GLM-HMM transition matrix closely matches MoA-HMM. C Expected state probabilities for MoA-HMM (i) averaged across all sessions and (ii) highlighted example sessions. D Expected state probabilities for GLM-HMM (i) averaged across all sessions and (ii) highlighted example sessions. Temporal structure is highly similar to MoA-HMM. E Cross-correlation between the expected state probabilities inferred from the MoA-HMM and GLM-HMM (i.e. panels Cii and Dii) across all sessions. Each dot is an individual rat, black circles are medians and error bars are 95% confidence intervals around the median.
Figure 5—figure supplement 3. Two example 4-state MoA-HMM fits corresponding to 3 state fits from (A) Figure 5A and (B) Figure 5-figure supplement 2. States are ordered according to the initial state probability (Aii and Bii) and the transition probabilities to most-likely states that follow (Aiii and Biii). Initial states are generally consistent with the 3-state fits, and the way the remaining two states split into three states is more idiosyncratic. For example, (A) suggests state 3 from the smaller model is split into two states (iv) that differ by bias (i), while (B) suggest the additional state 4 draws from both the smaller model’s states 2 and 3 (iv), and the state with largest MBr state no longer directly follows the initial state (i).

Changes in inter-trial interval (ITI) duration predicted by latent state
A Task diagram of response time of interest: ITI duration. (i) The ITI duration is measured as the difference in time (Δtime) between outcome port entry during step 2 of trial t and initiation port entry during step one of the next trial t + 1. (ii) Due to a large right tail from the longest ITIs, we instead model the inverse of ITI duration, or ITI rate, in our multilevel linear regression model. B Example prediction from a multilevel linear regression with fixed effects of reward, time, state, time × reward, and state × reward, and random effects from session on every predictor. (i, top) Scatter plot of ITI rate by time in session (in minutes) within an example session, colored by whether the ITI followed a reward (gray) or omission (gold). Overlaid lines are the model-predicted rate split for reward (dashed, black) and omission (solid, dark gold) trials. Discrete shifts in ITI rate align with shifts in inferred state (background color), with the highest rate during state 2. (i, bottom) Components of the prediction in the top panel due to time predictors (blue) and state predictors (red) split for reward (dashed) and omission (solid) trials. The state component, and its interaction with reward, capture the discrete shift observed in the top panel. ii Grouped dot plot of ITI rate split by inferred state, after subtracting changes in rate due to time (time component from i, bottom panel). State 2 shows a significantly higher rate than state 3 for ITIs following both rewarded and omission trials. C Coefficient of partial determination (CPD, or partial R-squared) for both time and state decouples the variance explained by each group of predictors. Time predicts significantly more than state, but state still significantly increases the variance explained by the model. **:p<0.01, ***:p<1E-4

Orbitofrontal cortex (OFC) neural encoding modulated by state.
A Modulation of model-based (MBr) outcome value encoding by state. (i) PSTHs of an example OFC unit time-locked to outcome for each state, where outcome time is measured as the time at which the rat pokes the second-step outcome port. To differentiate between levels of expected outcome value, trials were split into terciles by value. Shaded areas are 95% confidence intervals around the mean. The greatest modulation to outcome value is seen in state 2. (ii) Average firing rate within the outcome window (defined from 1 second before until 3 seconds after outcome port entry) for each trial with a high expected outcome value throughout the session. Background shading corresponds to inferred state. Shifts in response magnitude align with shifts in inferred state. B Population CPD of expected outcome value throughout a trial, centered around the ITI, computed for each state. Dots above each panel correspond to time points where the true CPD was greater than 95% of CPDs computed from circularly permuted datasets. Similar to the example unit in (i), state 2 contains the greatest modulation by outcome value, and no significant outcome value encoding is measured in state 3. C Modulation of reward/omission encoding by state. (i) PSTHs of an example OFC unit time-locked to outcome for each state. To differentiate between responses to rewards and omissions, each PSTH is split by trials following omissions (dashed) and trials following rewards (solid). Shaded areas correspond to 95% confidence intervals around the mean. The first unit contains the highest response to reward in state 1 and the lowest in state 3. (ii) Average firing rate within the outcome window (defined from 1 second before and 3 seconds after outcome port entry) for each rewarded trial throughout the session. Background shading corresponds to inferred state. Shifts in response magnitude align with shifts in inferred state above a slow negative drift across the session. D The CPD of reward across all units throughout a trial, centered around the ITI, computed for each state. Similar to the example unit in (i), state 1 contains the greatest modulation by reward.
Figure 7—figure supplement 1. Population CPD computed around the inter-trial interval (ITI) reveals significant encoding of state (red) even after accounting for time (blue). CPDs were measured as the median CPD across all units, with errorbars corresponding to bootstrapped 95% confidence intervals.
Figure 7—figure supplement 2. Example and summary 3-state MoA-HMM fits for rats with electrophysiological recordings in OFC.

Simulated 3-state MoA-HMM parameter recovery for the five agents used in behavioral fits: model-based reward, model-based choice, model-free reward, model-free choice, and bias. Each simulation contained 5000 trials evenly split between 20 sessions. Parameters for each state are pooled.

3-state MoA-HMM parameter recovery from data simulated from each rats’ behavioral model fit. Each rat’s model (n=20) was used to generate 5 independent data sets, where each data set contained the same number of trials and sessions as the corresponding rat’s behavioral data set used to fit the behavioral model, giving a total of 100 simulations.

New RL learning rules significantly improve fit to behavior and capture much of the variance explained by the Novelty Preverence and Perseveration agents of the original model from Miller et al. (2017)

(A) Learning rates fit for each agent (i) corresponding to the example rat shown in Figure 5A and (ii) summarizing each learning rate over the population of rats. Each dot is an individual rat, bars represent the median, and errorbars are bootstrapped 95% confidence intervals around the median. (B) Three example sessions showing the inferred state likelihood on each trial from the example rat shown in Figure 5A. (C) Cross correlation between left choices and reward probabilities for the common outcome port given that choice (gray). Left choices are highly correlated to left-outcome reward blocks, with the peak correlation at a slight lag (vertical dashed line) indicating the trial at which the rat detects the reward probability flip. To test whether the latent states track reward flips, the cross correlation is also shown between left-outcome reward probability and the likelihood of each state: initial state (blue), the remaining state with a more rightward choice bias (orange), and the remaining state with a more leftward bias (green). These correspond directly to states 1-3 in the example rat (i) whose model is shown in Figure 5A. while other rats had states 2 and 3 assigned according to their individual choice biases.

A second 3-state MoA-HMM example and comparison to a GLMHMM. States identified by a MoA-HMM and GLM-HMM are highly similar. A Example 3-state MoAHMM model parameters with (i) agent weights split by state, (ii) each agent’s learning rate, (iii) the initial state probability, and (iv) the state transition matrix. B 3-state GLM-HMM fit to the same rat as (A). (i) GLM-HMM regression weights for each state. Each state is described by four types of regressors indicating four possible trial types – common-reward, common-omission, rare-reward and rare-omission – and choice direction for up to 5 previous trials, giving 20 parameters per state. Each state additionally had a bias term (ii), leading to a total of 63 model weights for a 3-state model. (iii) GLM-HMM initial state probabilities also identify a prominent initial state. (iv) GLM-HMM transition matrix closely matches MoA-HMM. C Expected state probabilities for MoA-HMM (i) averaged across all sessions and (ii) highlighted example sessions. D Expected state probabilities for GLMHMM (i) averaged across all sessions and (ii) highlighted example sessions. Temporal structure is highly similar to MoA-HMM. E Cross-correlation between the expected state probabilities inferred from the MoA-HMM and GLM-HMM (i.e. panels Cii and Dii) across all sessions. Each dot is an individual rat, black circles are medians and error bars are 95% confidence intervals around the median.

Two example 4-state MoA-HMM fits corresponding to 3 state fits from (A) Figure 5A and (B) Figure 5-figure supplement 2. States are ordered according to the initial state probability (Aii and Bii) and the transition probabilities to most-likely states that follow (Aiii and Biii). Initial states are generally consistent with the 3-state fits, and the way the remaining two states split into three states is more idiosyncratic. For example, (A) suggests state 3 from the smaller model is split into two states (iv) that differ by bias (i), while (B) suggest the additional state 4 draws from both the smaller model’s states 2 and 3 (iv), and the state with largest MBr state no longer directly follows the initial state (i).

Population CPD computed around the inter-trial interval (ITI) reveals significant encoding of state (red) even after accounting for time (blue). CPDs were measured as the median CPD across all units, with errorbars corresponding to bootstrapped 95% confidence intervals.

Example and summary 3-state MoA-HMM fits for rats with electrophysiological recordings in OFC.

Sign up for email alerts