Neuroscience

Dynamic reinforcement learning reveals time-dependent shifts in strategy during reward learning

Sarah Jo C Venditto author has email address
Kevin J Miller
Carlos D Brody author has email address
Nathaniel D Daw author has email address

Princeton University, Princeton, United States
Google DeepMind, London, United Kingdom
University College London, London, United Kingdom
Howard Hughes Medical Institute, Chevy Chase, United States

https://doi.org/10.7554/eLife.97612.2

Open access
Copyright information

Figures and data

Rat Two-Step Task Description and Behavior
A Schematic of single trial during the rat two step task, where yellow suns indicate active port. The rat initiates a trial by poking into the top-center port (i), after which the rat is prompted to chose between the adjacent side ports (ii). After making a choice, the rat initiates the second step (iii). During the second step, the probabilistic transition determines the outcome probability (iv), and the rat is instructed to enter the active outcome port (v). One outcome port has some probability of delivering liquid reward, P_r, and the other outcome port delivers reward with the inverse probability, 1 – P_r (vi). Throughout the session, the reward probability P_r will reverse unpredictably (vii). B Trial-history regressions fit to data simulated by each individual RL agent, each with learning rate 0.5 (5000 trials split between 20 sessions for each simulation). Each agent demonstrates differential effects of trial type on choices, which decays across past trials. (i) The model-free reward (MFr) agent tends to repeat choices (positive weight) after rewarded (blue) trials and switch away from choices (negative weight) following omission trials (red), ignoring the observed transition. (ii) Positive model-free choice (MFc) captures choice perseveration, tending to repeat choices regardless of observed reward or transition. (iii) The model-based reward (MBr) agent tends to repeat choices following common-rewarded (blue-solid) and rare-omission (red-dashed) trials, and switches choices following rare-rewarded (blue-dashed) and common-omission (red-solid) trials, showing an effect by both transition and reward. (iv) Model-based choice (MBc) agent with positive weight tends to repeat choices following common transitions (solid) and switch away from choices following rare transition trials (dashed), capturing transition-dependent outcome port perseveration. C Trial-history regressions fit to animal behavioral data (n=20 rats), where dark lines are mean regression values across all rats and shaded error bars are 95% confidence intervals around the mean. Rats look most similar to the MBr agent, but show additional modulation that can be accounted for by the influence of other agents.

Model diagrams during a generic reward learning task. A Diagram of a static MoA model. At the start of a session, values for each agent, Q₁ = [Q_MBr, Q_{MF c}], are initialized (i) and fed into a MoA described by agent weights β = [β_MBr, β_{MF c}]. This weighted sum is pushed through a softmax to produce a probability distribution P (y₁) (ii) from which to draw choice y₁ (iii). An outcome port o₁ and reward r₁ are observed (iv), and all task variables may be used to update agent values according to their respective update rules (v). These updated values are used by the MoA on the next trial, and this process repeats until the session terminates. B Diagram of a MoA-HMM model. Values are initialized and updated in the same manner as (A), but now the MoA used to generate choice probability is determined by an underlying hidden state, where each hidden state has its own set of agent weights β^z = [β¹, β², β³]. The state at the beginning of the session is determined by an initial state probability distribution (i), and the conditional probability of each subsequent hidden state is governed by a transition matrix (ii), each row corresponding to the previous state and each column corresponding to the upcoming state.

Example 3-state MoA-HMM with single-state and 3-state model recovery. A 3-state MoA-HMM parameters used to generate example data during the two-step task. Three agents are included: MBr, MFc, and Bias. (i) Agent weights for each hidden state. The first state (blue) is dominated by the MBr agent, the second state (orange) is dominated by the MFc agent, and the third state (green) is dominated by the Bias agent. (ii) Learning rates for MBr and MFc agents. (iii) Non-uniform initial state probability. (iv) Asymmetric transition probability matrix. B Effective (weighted) agent values during an example session. The dominant value in the weighted sum (black line) changes depending on the active hidden state (background color). C Recovered parameters by a single-state MoA. (i) Recovered agent weights erroneously identify behavior as a mixture of the three agents. (ii) Agent learning rates recovered by the single-state MoA. D Inferred values from recovered single-state MoA. The weighting on each agent is fixed across the session. E Recovered 3-state MoA-HMM parameters. (i) Each generative state, with a single dominating strategy per state, is recaptured. (ii) Recovered learning rates. (iii) Recovered initial state probability. (iv) Recovered transition probability matrix. F Inferred values from recovered 3-state MoA-HMM vary similarly to the generative values. The recovered hidden state is also closely approximated.

Simulated 3-state MoA-HMM parameter recovery for the five agents used in behavioral fits: model-based reward, model-based choice, model-free reward, model-free choice, and bias. Simulated model parameters were randomly drawn (see methods). Each simulation contained 5000 trials evenly split between 20 sessions. Parameters in each panel are pooled across states. “r” is the Pearson’s correlation coefficient between the simulated and recovered parameters, and “R2” is the coefficient of determination, R², calculating how well the simulated parameters predict the recovered parameters. Due to the interaction between different model parameters (e.g. a small β weight will affect the recoverability of the agent’s learning rate α), a number of “failures” can be seen.

3-state MoA-HMM parameter recovery from data simulated from each rats’ behavioral model fit. Each rat’s model (n=20) was used to generate 5 independent data sets, where each data set contained the same number of trials and sessions as the corresponding rat’s behavioral data set used to fit the behavioral model, giving a total of 100 simulations. “r” is the Pearson’s correlation coefficient between the simulated and recovered parameters, and “R2” is the coefficient of determination, R², calculating how well the simulated parameters predict the recovered parameters. Parameters inferred from real data are more reliably recovered.

Changes in model fit quality across rats. A Change in normalized, cross-validated likelihood from the full model when removing one of the agents, computed for both a single-state MoA (gray, left) and a 3-state MoA-HMM (green, right). All agents, excluding MFr for a single-state MoA, show a significant (across rats) decrease in likelihood. B (left) Normalized, cross-validated likelihood of a single-state MoA. (right) Change in normalized, cross-validated likelihood when adding additional hidden states into the MoA-HMM, relative to the single-state model. Significant changes are computed with respect to models with one fewer states (e.g. 2-state vs 1-state, 3-state vs 2-state), where significant increases are observed through 4 states. A-B Each colored dot is an individual rat, black dots correspond to the median across rats, and error bars are bootstrapped 95% confidence intervals around the median. *:p<0.02, **:p<0.001, ***:p<1E-4

New RL learning rules significantly improve fit to behavior and capture much of the variance explained by the Novelty Preverence and Perseveration agents of the original model from Miller et al. (2017). *:p<0.02, **:p<0.001, ***:p<1E-4

Example and summary 3-state MoA-HMMs fit to rats in the two-step task using a population prior (see methods). The states are manually ordered based on three properties: the first state is chosen as the one with the highest initial state probability (blue diamond), the remaining two states are then ordered second and third (orange and green diamonds, respectively) in order of the weight they give to MBr (orange = higher MBr). A Example 3-state MoA-HMM on a single rat. (i) Agent weights split by hidden state. (ii) Initial state probability. (iii) State transition probability matrix. (iv) The expected hidden state calculated from the forward-backward algorithm averaged (mean) across sessions, with error bars as 95% confidence intervals around the mean. B Summary of all 3-state MoA-HMM fits across the population of 20 rats. States were sorted in the same way as (A). (i) Agent weights split by hidden state, where individual dots represent a single rat, light grey lines connect weights within a rat, bars represent the median weight over rats, and error bars are bootstrapped 95% confidence intervals around the median. (ii)Distribution of initial state probabilities, with each dot as an individual rat. (iii) Distribution of state transition probabilities, with each panel representing a single (t − 1) → (t) transition. (iv) Session-averaged expected state computed as in (Aiv), where light lines are average probabilities for individual rats and the dark solid lines are the population mean with 95% confidence intervals around the mean. *:p<0.05, ***:p<E-4

(A) Learning rates fit for each agent (i) corresponding to the example rat shown in Figure 5A and (ii) summarizing each learning rate over the population of rats. Each dot is an individual rat, bars represent the median, and errorbars are bootstrapped 95% confidence intervals around the median. (B) Three example sessions showing the inferred state likelihood on each trial from the example rat shown in Figure 5A. (C) Cross-correlation between left choices and reward probabilities for the common outcome port given that choice (gray). Left choices are highly correlated to left-outcome reward blocks, with the peak correlation at a slight lag (vertical dashed line) indicating the trial at which the rat detects the reward probability flip. To test whether the latent states track reward flips, the cross correlation is also shown between left-outcome reward probability and the likelihood of each state: initial state (blue), the remaining state with a more rightward choice bias (orange), and the remaining state with a more leftward bias (green). These correspond directly to states 1-3 in the example rat (i) whose model is shown in Figure 5A. while other rats had states 2 and 3 assigned according to their individual choice biases.

A second 3-state MoA-HMM example and comparison to a GLM-HMM. States identified by a MoA-HMM and GLM-HMM are highly similar. A Example 3-state MoA-HMM model parameters with (i) agent weights split by state, (ii) each agent’s learning rate, (iii) the initial state probability, and (iv) the state transition matrix. B 3-state GLM-HMM fit to the same rat as (A). (i) GLM-HMM regression weights for each state. Each state is described by four types of regressors indicating four possible trial types – common-reward, common-omission, rare-reward and rare-omission – and choice direction for up to 5 previous trials, giving 20 parameters per state. Each state additionally had a bias term (ii), leading to a total of 63 model weights for a 3-state model. iii) GLM-HMM initial state probabilities also identify a prominent initial state. (iv) GLM-HMM transition matrix closely matches MoA-HMM. C Expected state probabilities for MoA-HMM (i) averaged across all sessions and (ii) highlighted example sessions. D Expected state probabilities for GLM-HMM (i) averaged across all sessions and (ii) highlighted example sessions. Temporal structure is highly similar to MoA-HMM. E Cross-correlation between the expected state probabilities inferred from the MoA-HMM and GLM-HMM (i.e. panels Cii and Dii) across all sessions. Each dot is an individual rat, black circles are medians and error bars are 95% confidence intervals around the median.

Two example 4-state MoA-HMM fits corresponding to 3 state fits from (A) Figure 5A and (B) Figure 5-figure supplement 2. States are ordered according to the initial state probability (Aii and Bii) and the transition probabilities to most-likely states that follow (Aiii and Biii). Initial states are generally consistent with the 3-state fits, and the way the remaining two states split into three states is more idiosyncratic. For example, (A) suggests state 3 from the smaller model is split into two states (iv) that differ by bias (i), while (B) suggest the additional state 4 draws from both the smaller model’s states 2 and 3 (iv), and the state with largest MBr state no longer directly follows the initial state (i).

Changes in inter-trial interval (ITI) duration predicted by latent state
A Task diagram of response time of interest: ITI duration. (i) The ITI duration is measured as the difference in time (Δtime) between outcome port entry during step 2 of trial t and initiation port entry during step one of the next trial t + 1. (ii) Due to a large right tail from the longest ITIs, we instead model the inverse of ITI duration, or ITI rate, in our multilevel linear regression model. B Example prediction from a multilevel linear regression with fixed effects of reward, time, state, time × reward, and state × reward, and random effects from session on every predictor. (top) Scatter plot of ITI rate by time in session (in minutes) within an example session, colored by whether the ITI followed a reward (blue) or omission (red). Overlaid lines are the model-predicted rate split for reward (solid) and omission (dashed) trials. Discrete shifts in ITI rate align with shifts in inferred state (background color), with the highest rate during state 2. (bottom) Components of the prediction in the top panel due to time predictors (gray) and state predictors (gold) split for reward (solid) and omission (dashed) trials. The state component, and its interaction with reward, capture the discrete shift observed in the top panel. C Grouped dot plot of ITI rate split by inferred state, after subtracting changes in rate due to time (time component from i, bottom panel). State 2 shows a significantly higher rate than state 3 for ITIs following both rewarded and omission trials. D Coefficient of partial determination (CPD, or partial R-squared) for both time and state decouples the variance explained by each group of predictors. Time predicts significantly more than state, but state still significantly increases the variance explained by the model. **:p<0.01, ***:p<1E-4

Orbitofrontal cortex (OFC) neural encoding modulated by state.
A Modulation of model-based (MBr) outcome value encoding by state. (i) Peristimulus time histograms (PSTHs) of an example OFC unit time-locked to outcome for each state, where outcome time is measured as the time at which the rat enters the second-step outcome port. To differentiate between levels of expected outcome value, trials were split into terciles by value. Shaded areas are 95% confidence intervals around the mean. The greatest modulation to outcome value is seen in state 2. (ii) Average firing rate within the outcome window (defined from 1 second before until 3 seconds after outcome port entry) for each trial with a high expected outcome value throughout the session. Background shading corresponds to inferred state. Shifts in response magnitude align with shifts in inferred state. B Population CPD of expected outcome value throughout a trial, centered around the ITI, computed for each state. CPDs were baseline-corrected by subtracting out the mean CPD computed from circularly permuted datasets. Dots above each panel correspond to time points where a state’s CPD was significantly greater than the other states’ CPDs (blue: state 1 greater than states 2 and 3; orange: state 2 greater than states 1 and 3), i.e. when the CPD difference was greater than 95% of CPD differences computed from circularly permuted datasets. C Modulation of reward/omission encoding by state. (i) PSTHs of an example OFC unit time-locked to outcome for each state. To differentiate between responses to rewards and omissions, each PSTH is split by trials following omissions (dashed) and trials following rewards (solid). Shaded areas correspond to 95% confidence intervals around the mean. The first unit contains the highest response to reward in state 1 and the lowest in state 3. (ii) Average firing rate within the outcome window (defined from 1 second before and 3 seconds after outcome port entry) for each rewarded trial throughout the session. Background shading corresponds to inferred state. Shifts in response magnitude align with shifts in inferred state above a slow negative drift across the session. D The CPD of reward across all units throughout a trial, centered around the ITI, computed for each state, baseline corrected as in (B). Dots above panels similarly represent time points where a state’s CPD is significantly greater than the other states (blue: state 1 greater than states 2 and 3; orange: state 2 greater than state 3).

Population CPD computed around the inter-trial interval (ITI) reveals significant encoding of state (gold) even after accounting for time (gray). CPDs were measured as the median CPD across all units, with errorbars corresponding to bootstrapped 95% confidence intervals.

Example and summary 3-state MoA-HMM fits for rats with electrophysiological recordings in OFC.

Sign up for email alerts