Dynamic reinforcement learning reveals time-dependent shifts in strategy during reward learning

Sarah Jo C Venditto; Kevin J Miller; Carlos D Brody; Nathaniel D Daw

doi:10.7554/eLife.97612.2

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Reviewing Editor
Naoshige Uchida
Harvard University, Cambridge, United States of America
Senior Editor
Michael Frank
Brown University, Providence, United States of America

Reviewer #1 (Public review):

Summary:

Motivated by the existence of different behavioral strategies (e.g. model-based vs. model-free), and potentially different neural circuits that underlie them, Venditto et al. introduce a new approach for inferring which strategies animals are using from data. In particular, they extend the mixture of agents (MoA) framework to accommodate the possibility that the weighting among different strategies might change over time. These temporal dynamics are introduced via a hidden Markov model (HMM), i.e. with discrete state transitions. These state transition probabilities and initial state probabilities are fit simultaneously along with the MoA parameters, which include decay/learning rate and mixture weightings, using the EM algorithm. The authors test their model on data from Miller et al., 2017, 2022, arguing that this formulation leads to (1) better fits and (2) improved interpretability over their original model, which did not include the HMM portion. Lastly, they claim that certain aspects of OFC firing are modulated by internal state as identified by the MoA-HMM.

Strengths:

The paper is very well written and easy to follow, especially for one with a significant modeling component. Furthermore, the authors do an excellent job explaining and then disentangling many threads that are often knotted together in discussions of animal behavior and RL: model-free vs. model-based choice, outcome vs. choice-focused, exploration vs. exploitation, bias, perserveration. Each of these concepts is quantified by particular parameters of their models. Model recovery (Fig. 3) is convincing post-revision and licenses their fits to animal behavior later. While the specific claims made about behavior and neural activity are not especially surprising (e.g. the animals begin a session, in which rare vs. common transitions are not yet known, in a more exploratory mode), the MoA-HMM framework seems broadly applicable to other tasks in the field and useful for the purpose of quantification here. Overall, I believe this paper is certainly worthy of publication in a journal like eLife.

Weaknesses:

I am pleased with the authors' responses to my initial comments, and I thank them for their efforts. My main note of caution to readers is just that when it comes to applying this method to neural data, the benefits may be subtle. On one extreme, it may be possible to capture many of these effects simply by explicitly modeling time, although the authors do a good job showing that they can beat this benchmark in their case. On the other extreme, there may be multiple switches that cannot simply be a monotonic time effect, but these might be at a faster timescale than can be easily captured in this model (in Fig. 7Aii, for example, there is still lots of variance unexplained by the latent state). Quantitative justification will be required for using this model over simpler alternatives, but again, I commend the authors for providing that justification in this paper.

https://doi.org/10.7554/eLife.97612.2.sa2

Reviewer #2 (Public review):

Summary:

This is an interesting and well-performed study that develops a new modeling approach (MoA-HMM) to simultaneously characterize reinforcement learning parameters of different RL agents, as well as latent behavioral states that differ in the relative contributions of those agents to the animal's choices. They performed this study in rats trained to perform the two-step task. While the major advance of the paper is developing and rigorously validating this novel technical approach, there are also a number of interesting conceptual advances. For instance, humans performing the two-step task are thought to exhibit a trade-off between model-free and model-based strategies. However, the MoA-HMM did not reveal such a trade-off in the rats, but instead suggested a trade-off between model-based exploratory vs. exploitative strategies. Additionally, the firing rates of neurons in the orbitofrontal cortex (OFC) reflected latent behavioral states estimated from the HMM, suggesting that (1) characterizing dynamic behavioral strategies might help elucidate neural dynamics supporting behavior, and (2) OFC might reflect the contributions of one or a subset of RL agents that are preferentially active or engaged in particular states identified by the HMM.

Strengths:

The claims were generally well-supported by the data. The model was validated rigorously, and was used to generate and test novel predictions about behavior and neural activity in OFC. The approach is likely to be generally useful for characterizing dynamic behavioral strategies.

https://doi.org/10.7554/eLife.97612.2.sa1

Author response:

The following is the authors’ response to the original reviews.

Reviewer #1 (Public Review):

The authors sometimes seem to equivocate on to what extent they view their model as a neural (as opposed to merely behavioral) description. For example, they introduce their paper by citing work that views heterogeneity in strategy as the result of "relatively independent, separable circuits that are conceptualized as supporting distinct strategies, each potentially competing for control." The HMM, of course, also relates to internal states of the animal. Therefore, the reader might come away with the impression that the MoA-HMM is literally trying to model dynamic, competing controllers in the brain (e.g. basal ganglia vs. frontal cortex), as opposed to giving a descriptive account of their emergent behavior. If the former is really the intended interpretation, the authors should say more about how they think the weighting/arbitration mechanism between alternative strategies is implemented, and how it can be modulated over time. If not, they should make this clearer.

The MoA-HMM is meant to be descriptive in identifying behaviorally distinct strategies. Our intention in connecting it with a “mixture-of-strategies” view of the brain is that the results of the MoA-HMM could be indicative of an underlying arbitration process, but not modeling that process per se, that can be used to test neural hypotheses driven by this idea. We’ve added additional clarification in the discussion to highlight this point.

Explicitly, we added the following sentence in the discussion: “For example, while the MoA-HMM itself is a descriptive model of behavior and is not explicitly modeling an underlying arbitration of controllers in the brain, the resulting behavioral states may be indicative of underlying neural processes and help identify times when different neural controllers are prevailing”

Second, while the authors demonstrate that model recovery recapitulates the weight dynamics and action values (Fig. 3), the actual parameters that are recovered are less precise (Fig. 3 Supplement 1). The authors should comment on how this might affect their later inferences from behavioral data. Furthermore, it would be better to quantify using the R^2 score between simulated and recovered, rather than the Pearson correlation (r), which doesn't enforce unity slope and zero intercept (i.e. the line that is plotted), and so will tend to exaggerate the strength of parameter recovery.

In the methods section, we noted that the interaction between parameters can cause the recovery of randomly drawn parameter sets to fail, as seen in Figure 3 Supplement 1. This is because there are parameter regimes (specifically when a softmax temperature is near zero) which causes choices to be random, and therefore other parameters no longer matter. To address this, we included a second supplemental figure, Figure 3 Supplement 2, where we recovered model parameters from data simulated solely from models inferred from the behavioral data. Recovery of these models is much more precise, which credits our later inferences from the behavioral data.

To make this point clearer, we changed the reference to Figure 3 Supplements 1 & 2 to: “(Figure 3 – figure supplement 1 for recovery of randomized parameters with noted limitations, and figure supplement 2 for recovery of models fit to real data)” We additionally added the following to the Figure 3 Supplement 1 caption: “Due to the interaction between different model parameters (e.g. a small 𝛽 weight will affect the recoverability of the agent’s learning rate 𝛼), a number of “failures” can be seen.”

Furthermore, we added an R^2 score that enforces unity slope and zero intercept alongside the Pearson correlation coefficient for more comprehensive metrics of recovery. The R^2 scores are plotted on both Figure 3 Supplements 1 & 2 as “R2”, and the following text was added in both captions: “"r" is the Pearson's correlation coefficient between the simulated and recovered parameters, and "R2" is the coefficient of determination, R2, calculating how well the simulated parameters predict the recovered parameters.”

Finally, the authors are very aware of the difficulties associated with long-timescale (minutes) correlations with neural activity, including both satiety and electrode drift, so they do attempt to control for this using a third-order polynomial as a time regressor as well as interaction terms (Fig. 7 Supplement 1). However, on net there does not appear to be any significant difference between the permutation-corrected CPDs computed for states 2 and 3 across all neurons (Fig. 7D). This stands in contrast to the claim that "the modulation of the reward effect can also be seen between states 2 and 3 - state 2, on average, sees a higher modulation to reward that lasts significantly longer than modulation in state 3," which might be true for the neuron in Fig. 7C, but is never quantified. Thus, while I am convinced state modulation exists for model-based (MBr) outcome value (Fig. 7A-B), I'm not convinced that these more gradual shifts can be isolated by the MoA-HMM model, which is important to keep in mind for anyone looking to apply this model to their own data.

We agree with the reviewers that our initial test of CPD significance was not sufficient to support the claims we made about state differences, especially for Figure 7D. To address this, we updated the significance test and indicators in Figure 7B,D to instead signify when there is a significant difference between state CPDs. This updated test supports a small, but significant difference in early post-outcome reward modulation between states 2 and 3.

We clarified and updated the significance test in the methods with the following text:

“A CPD (for a particular predictor in a particular state in a particular time bin) was considered significant if that CPD computed using the true dataset was greater than 95% of corresponding CPDs (same predictor, same state, same time bin) computed using these permuted sessions. For display, we subtract the average permuted session CPD from the true CPD in order to allow meaningful comparison to 0.

To test whether neural coding of a particular predictor in a particular time bin significantly differed according to HMM state, we used a similar test. For each CPD that was significant according to the above test, we computed the difference between that CPD and the CPD for the same predictor and time bin in the other HMM states. We compare this difference to the corresponding differences in the circularly permuted sessions (same predictor, time bin, and pair of HMM states). We consider this difference to be significant if the difference in the true dataset is greater than 95% of the CPD differences computed from the permuted sessions.”

We updated the significance indicators above the panels in Figure 7B,D (colored points) to refer to significant differences between states, with additional text to the left of each row of points to specify the tested state and which states it is significantly greater than. We updated the figure caption for both B and D to reflect these changes.

We also changed text in the results to focus on significant differences between states. Specifically, we replaced the sentence “Looking at the CPD of expected outcome value split by state (Figure 7B) reveals that the trend from the example neuron is consistent across the population of OFC units, where state 2 shows the greatest CPD.” with the sentence “Looking at the CPD of expected outcome value split by state (Figure 7B) reveals that the trend from the example neuron is consistent across the population of OFC units, where state 2 has a significantly greater CPD than states 1 and 3.”

We also replaced the sentence “Suggestively, the modulation of the reward effect can also be seen between states 2 and 3 – state 2, on average, sees a higher modulation to reward that lasts significantly longer than modulation in state 3.” with the sentence “Additionally, the modulation of the reward effect can also be seen between states 2 and 3 — immediately after outcome, we see a small but significantly higher modulation to reward during state 2 than during state 3.”

Reviewer #2 (Public Review):

There were a lot of typos and some figures were mis-referenced in the text and figure legends.

We apologize for the numerous typos and errors in the text and are grateful for the assistance in identifying many of them. We have taken another thorough pass through the manuscript to address those identified by the reviewer as well as fix additional errors. To reduce redundancy, we’ll address all typoand error-related suggestions from both reviewers here.

● We fixed all Figure 1 references. We additionally reversed the introduction order of the agents in Figure 1 and in the results section “Reinforcement learning in the rat two-step task”, where we introduce both model-free agents before both model-based agents. This is to make the model-based choice agent description (which references the model-free choice agent in the statement “That is, like MFc, this agent tends to repeat or switch choices regardless of reward”) come after introducing the model-free choice agent.

● We fixed all Figure 4 references.

● We fixed all Figure 6 references and fixed the panel references in the figure caption to match the figure labeling: Starting with panel B, the reference to (i) was removed, and the reference to (ii) was updated to C. The previous reference to C was updated to D.

● All line-numbered suggestions were addressed.

● The text “(move to supplement?)” was removed from the methods heading, and the mistaken reference to Q_MBr was fixed.

● We removed all “SR” acronyms from the statistics as it was an artifact from an earlier draft.

● We homogenized notation in Figure 2, replacing all “c” variable references with “y”, as well as homogenized notation of β

● We replaced many uses of the word “action” with the word “choice” for consistency throughout the manuscript.

● We addressed many additional minor errors

Reviewer #1 (Recommendations For The Authors):

(1) Could the authors comment on why the cross-validated accuracy continues to increase, albeit non-significantly, after four states, as opposed to decreasing (as I would naively expect would be the result due to overfitting)?

Due to the large amounts of trials and sessions obtained from each rat (often >100 sessions with >200 trials per session) and the limited number of training iterations (capped at 300 iterations), it is not guaranteed that the cross-validated accuracy would decrease over the range of states we included in Figure 4, especially given that the number of total parameters in the largest model shown (7-states, 95 parameters) is greatly less than the number of observations. Since we’re mainly interested in using this tool to identify interpretable, consistent structure across animals, we did not focus on interpreting the regime of larger models.

(2) It seems like the model was refit multiple times with different priors ("Estimation of Population Prior"), each derived from the previous step of fitting. I'm not very familiar with fitting these kinds of models. Is this standard practice? It gives off the feeling of double-dipping. It would be helpful if the authors could cite some relevant literature here or further justify their choices.

We adopted a “one-step” hierarchical approach, where we estimate the population prior a single time on (nearly) unconstrained model fits, and use it for a second, final round of model fits which were used for analysis. Since the prior is only estimated once, in practice there isn’t risk of converging on an overly constrained prior. This is a somewhat simplified approach motivated by analogy to the first step of EM fit in a hierarchical model, in which population- and subject-level parameters are iteratively re-estimated in terms of one another until convergence (Huys et al., 2012; Daw 2010). We have clarified this approach in the methods with citations by adding the following paragraph:

“Hierarchical modeling gives a better estimate of how model parameters can vary within a population by additionally inferring the population distribution over which individuals are likely drawn (Daw, 2011). This type of modeling, however, is notoriously difficult in HMMs; therefore, as a compromise, we adopt a “one-step” hierarchical model, where we estimate population parameters from “unconstrained” fits on the data, which are then used as a prior to regularize the final model fits. This approach is motivated by analogy to the first step of EM fit in a hierarchical model, in which population- and subject-level parameters are iteratively re-estimated in terms of one another until convergence (Daw, 2011; Huys et al., 2012). It is important to emphasize, since we aren’t inferring the population distributions directly, that we only estimate the population prior a single time on the “unconstrained” fits as follows.”

Reviewer #2 (Recommendations For The Authors):

Figure 3a.iii: Did the model capture the transition probabilities correctly as well?

We have updated Figure 3E to include additional panels (iii) and (iv) to show the recovered initial state probabilities and transition matrix.

For Figure 6, panel B makes it look like there is a larger influence of state on ITI rate after omission, in both the top and bottom plots. However, the violin plots in panel C show a different pattern, where state has a greater effect on ITIs following rewarded trials. Is it that the example in panel B is not representative of the population, or am I misinterpreting?

We thank the reviewer for catching this issue, as the colors were erroneously flipped in panel C. We have fixed this figure by ensuring that the colors appropriately matched the trial type (reward or omission). Additionally, we updated the colors in B and C that correspond to reward (previously gray, now blue) and omission (previously gold, now red) trials to match the color scheme used in Figure 1. We also inverted the corresponding line styles (reward changed to solid, omission changed to dashed) to match the convention used in Figure 7. To differentiate from the reward/omission color changed, we additionally changed the colors in Figure 6D and Figure 7 Supplement 1, where the color for “time” was changed from blue to gray, and the color for “state” was changed from red to gold.

For figure 4B right, I am confused. The legend says that this is the change in model performance relative to a model with one fewer state. But the y-axis says it's the change from the single-state model. Please clarify.

The plot is showing the increase in performance from the single-state model, while the significance tests were done between consecutive numbered states. We updated the significance indicators on the plot to more clearly identify that adjacent models are being compared (with the exception of the 2-state model, which is being compared to 0). We updated the Figure 4B caption text for the left panel to state: “Change in normalized, cross-validated likelihood when adding additional hidden states into the MoA-HMM, relative to the single-state model. Significant changes are computed with respect to models with one fewer states (e.g. 2-state vs 1-state, 3-state vs 2-state)”

https://doi.org/10.7554/eLife.97612.2.sa0

Dynamic reinforcement learning reveals time-dependent shifts in strategy during reward learning

Peer review process

Editors

Be the first to read new articles from eLife