Dynamic reinforcement learning reveals time-dependent shifts in strategy during reward learning

  1. Princeton University
  2. Google DeepMind
  3. University College London
  4. Howard Hughes Medical Institute

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Naoshige Uchida
    Harvard University, Cambridge, United States of America
  • Senior Editor
    Michael Frank
    Brown University, Providence, United States of America

Reviewer #1 (Public Review):

Summary:

Motivated by the existence of different behavioral strategies (e.g. model-based vs. model-free), and potentially different neural circuits that underlie them, Venditto et al. introduce a new approach for inferring which strategies animals are using from data. In particular, they extend the mixture of agents (MoA) framework to accommodate the possibility that the weighting among different strategies might change over time. These temporal dynamics are introduced via a hidden Markov model (HMM), i.e. with discrete state transitions. These state transition probabilities and initial state probabilities are fit simultaneously along with the MoA parameters, which include decay/learning rate and mixture weightings, using the EM algorithm. The authors test their model on data from Miller et al., 2017, 2022, arguing that this formulation leads to (1) better fits and (2) improved interpretability over their original model, which did not include the HMM portion. Lastly, they claim that certain aspects of OFC firing are modulated by the internal state as identified by the MoA-HMM.

Strengths:

The paper is very well written and easy to follow, especially for one with a significant modeling component. Furthermore, the authors do an excellent job explaining and then disentangling many threads that are often knotted together in discussions of animal behavior and RL: model-free vs. model-based choice, outcome vs. choice-focused, exploration vs. exploitation, bias, preservation. Each of these concepts is quantified by particular parameters of their models. Model recovery (Fig. 3) is mostly convincing and licenses their fits to animal behavior later (although see below). While the specific claims made about behavior and neural activity are not especially surprising (e.g. the animals begin a session, in which rare vs. common transitions are not yet known, in a more exploratory mode), the MoA-HMM framework seems broadly applicable to other tasks in the field and useful for the purpose of quantification here.

Weaknesses:

The authors sometimes seem to equivocate on to what extent they view their model as a neural (as opposed to merely behavioral) description. For example, they introduce their paper by citing work that views heterogeneity in strategy as the result of "relatively independent, separable circuits that are conceptualized as supporting distinct strategies, each potentially competing for control." The HMM, of course, also relates to internal states of the animal. Therefore, the reader might come away with the impression that the MoA-HMM is literally trying to model dynamic, competing controllers in the brain (e.g. basal ganglia vs. frontal cortex), as opposed to giving a descriptive account of their emergent behavior. If the former is really the intended interpretation, the authors should say more about how they think the weighting/arbitration mechanism between alternative strategies is implemented, and how it can be modulated over time. If not, they should make this clearer.

Second, while the authors demonstrate that model recovery recapitulates the weight dynamics and action values (Fig. 3), the actual parameters that are recovered are less precise (Fig. 3 Supplement 1). The authors should comment on how this might affect their later inferences from behavioral data. Furthermore, it would be better to quantify using the R^2 score between simulated and recovered, rather than the Pearson correlation (r), which doesn't enforce unity slope and zero intercept (i.e. the line that is plotted), and so will tend to exaggerate the strength of parameter recovery.

Finally, the authors are very aware of the difficulties associated with long-timescale (minutes) correlations with neural activity, including both satiety and electrode drift, so they do attempt to control for this using a third-order polynomial as a time regressor as well as interaction terms (Fig. 7 Supplement 1). However, on net there does not appear to be any significant difference between the permutation-corrected CPDs computed for states 2 and 3 across all neurons (Fig. 7D). This stands in contrast to the claim that "the modulation of the reward effect can also be seen between states 2 and 3 - state 2, on average, sees a higher modulation to reward that lasts significantly longer than modulation in state 3," which might be true for the neuron in Fig. 7C, but is never quantified. Thus, while I am convinced state modulation exists for model-based (MBr) outcome value (Fig. 7A-B), I'm not convinced that these more gradual shifts can be isolated by the MoA-HMM model, which is important to keep in mind for anyone looking to apply this model to their own data.

Reviewer #2 (Public Review):

Summary:

This is an interesting and well-performed study that develops a new modeling approach (MoA-HMM) to simultaneously characterize reinforcement learning parameters of different RL agents, as well as latent behavioral states that differ in the relative contributions of those agents to the animal's choices. They performed this study in rats trained to perform the two-step task. While the major advance of the paper is developing and rigorously validating this novel technical approach, there are also a number of interesting conceptual advances. For instance, humans performing the two-step task are thought to exhibit a trade-off between model-free and model-based strategies. However, the MoA-HMM did not reveal such a trade-off in the rats, but instead suggested a trade-off between model-based exploratory vs. exploitative strategies. Additionally, the firing rates of neurons in the orbitofrontal cortex (OFC) reflected latent behavioral states estimated from the HMM, suggesting that (1) characterizing dynamic behavioral strategies might help elucidate neural dynamics supporting behavior, and (2) OFC might reflect the contributions of one or a subset of RL agents that are preferentially active or engaged in particular states identified by the HMM.

Strengths:

The claims were generally well-supported by the data. The model was validated rigorously and was used to generate and test novel predictions about behavior and neural activity in OFC. The approach is likely to be generally useful for characterizing dynamic behavioral strategies.

Weaknesses:

There were a lot of typos and some figures were mis-referenced in the text and figure legends.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation