On the computational principles underlying human exploration

Lior Fox; Ohad Dan; Yonatan Loewenstein

doi:10.7554/eLife.90684.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Timothy Verstynen
Carnegie Mellon University, Pittsburgh, United States of America
Senior Editor
Timothy Behrens
University of Oxford, Oxford, United Kingdom

Reviewer #1 (Public Review):

Summary:

Fox, Dan, and Loewenstein investigated how people explored six maze-like environments. They show that roughly one-third of their participants make choices now that increase the potential for future information gain and also temporally discount potential information gain based on how far in the future potential gains might be. The authors argue that rather than valuing exploration in its own right, participant behavior is most consistent with using exploration as a way to reduce uncertainty. They then propose a reinforcement learning (RL) model in which agents estimate an "exploration value" (the expected cumulative information gained by taking a given action in a given state) using standard RL techniques for estimating value (expected cumulative reward). They find that this model exhibits several qualitative similarities with human behavior and that it best captures the temporal dynamics of human exploration when propagating information through the entire history of a behavioral episode (as opposed to merely propagating it in a single step as some of the simplest RL models do).

While the core insight and basic method of the paper are compelling, the way in which both the behavioral experiment and computational modeling were conducted raise concerns that mean that, in their present form, the results do not fully justify the conclusions. After resolving these issues, the work would demonstrate how human exploration is sensitive to long-range dependencies in information gain, as well as valuable insights about how best to characterize this behavior computationally. I am not particularly well-versed in the literature on exploration so cannot comment on novelty here.

Strengths:
The entire paper is logically well-motivated. It builds on a valuable basic insight, namely that while bandit tasks are an ideally minimal platform for testing certain questions about decision-making and exploration, richer paradigms are needed to capture the long-range informational dependencies distinguishing between various approaches to exploration.

Even so, the maze navigation paradigm explored here remains simple. Participants navigate a maze with two main branches which are identical save for minimal, theoretically motivated differences. Moreover, the tested differences are designed to clearly and explicitly test well-identified questions. The task, and really the entire paper, is clearly organized, and each component is logically connected to a larger argument.

The proposed model is also simple, clearly presented, and a clever way of applying ideas typically used to reason about reward-motivated behavior to reason here about information-motivated behavior.

One other strength of this work is that it combines behavioral experiments with computational modeling. This approach pairs a detailed and objectively specified theory (i.e. the model) with novel data specifically designed to test that theory and thus in principle presents a particularly strong test of the authors' hypotheses.

Weaknesses:
Despite many strengths in the underlying logic of the paper, the presented evidence does not provide compelling support for the conclusions. In particular:

- The main claims are based on the behavior of 452 participants classed as good explorers, out of 1,052 participants included in the analyses and 1,336 participants who completed the study. That is, the authors' broad claims about human exploration are based on a third of their total sample; the other two-thirds displayed very different behavior, including 20% who performed at or below chance levels. That is, while a significant sub-population may demonstrate the claimed abilities, it is far from clear that they are universal.

- While the experimental manipulations are elegant, the behavioral study seems underpowered. In each of the primary manipulations, key theoretical predictions are not statistically validated. For example, in Experiment 1, the preference for the right door increases from the 4:3 condition to the 5:2 condition, but not when moving from the 5:2 condition to the 6:1 condition, as predicted (Figure 1c). Similar results can be seen for other analyses in Figures 3b and 4b. Relatedly, the experiments comprised just 20 episodes, and it is unclear whether that was sufficiently long for participants to demonstrate asymptotic behavior (e.g. Figure 5b). Either more participants or greater differences between conditions (e.g. testing 9:8, 12:5, and 15:2 conditions in a revised Experiment 1), as well as running a greater number of total episodes, would be needed to resolve this concern.

- The model is presented after the behavioral results, giving the impression that it was perhaps constructed to fit the data. No attempt is made to fit the model to a subset of the data and then validate the rest or give any clear indication as to how the model parameters were set. Moreover, as noted, even where the model is successful, it only explains the behavior of a minority of the total participants. No modeling work is done to explain the behavior of the other two-thirds of the participants.

- The authors helpfully discuss several meaningful alternative models of exploration, such as visit-counting and incorporating an objective function sensitive to information gain. They do not, however, compare their model against these or any other meaningful baselines. Moreover, the comparison between model and human participants is qualitative rather than quantitative. These issues could be resolved by introducing a more rigorous analysis quantitatively comparing a variety of theoretically relevant models as quantitative explanations of the human data.

https://doi.org/10.7554/eLife.90684.1.sa2

Reviewer #2 (Public Review):

Summary:
In this article, the authors develop an algorithm for exploration inspired by the classic, state-action-reward-state-action (SARSA) reinforcement learning algorithm. Designed to account for exploration in multi-state environments, this algorithm computes the expected discounted return from selecting an action in a state and uses that value to update the cached value of taking a given action in a given state. The value represents the uncertainty in a given state, and the backed-up value is computed from the discounted future return plus the immediate reduction in uncertainty regarding the state.

Strengths:
The article is ambitious and seeks to characterize human exploration in a novel task using zero rewards. That characterization is useful.

Weaknesses:
The paper suffers from many problems. Here, I will mention three. First, the algorithm is very poorly motivated-exploration is central to many behaviors, but the algorithm computes the value of exploration independent of any long-run considerations of exploitation. Second, the article attempts to recover the observed exploratory behavior in two different multi-state choice tasks. But the algorithm does not explain that behavior, and there is no performance metric on the model, nor a comparison to other models. Third, the article frames the algorithm in terms of uncertainty, but there is no measure of uncertainty.

In short, in many ways this manuscript is 'half an article', and the authors have much work to do. They could decide to dive into the convergence proofs and other theoretical properties of the model. However, as far as I understand the model, it is literally an optimistic SARSA, whose characteristics are well-understood. Or, they could compare the model's performance to a number of other exploration models (UCB, Thompson sampling, infomax, infotaxis-there are so many!). However the authors need to choose one or the other. I urge the authors to properly compare their model to other models.

1. Motivation
The algorithm is poorly motivated. Exploration is valuable for a time but quickly becomes less valued as more is learned about the environment. The algorithm attempts to account for this by the ad hoc nature of the backup: the immediate outcome is -E(s,a), which represents a reduction in uncertainty. So in the long run, the exploratory value will decrease to zero. But this is ad hoc; why not add E(s,a)? In addition, exploration values are set to 1. But this is also ad hoc; why should E(s,a) start at 1? They have cherry-picked their starting values and the nature of the back-up to yield exploratory behavior.

2. Performance
The authors wish to compare the model's performance to observed exploration behavior. However, their model does a poor job of explaining the behavior. What's confusing is that the authors note the ways the model deviates. There are two principal deviations. First, the model exhibits an exploratory transient, but it is too wide to match the humans. Second, the model fails to exhibit the low-level persistent exploration characteristic of humans in their task.

The next natural step would be to augment the model in different ways to attempt to describe the behavior. The authors do attempt to import td-λ aspects into their exploration model. They determine that importation fails to capture the observed behavior. But why stop there? Why not continue? Why not follow through and change the model in a way that can capture the dynamics of exploration?

In addition, a natural complement would be to compare the model's ability to describe human performance to other models. This would require model fitting, recovery, and validation. However the authors don't engage in that model fitting exercise.

They note that a model-based learning strategy could account for the speed of learning in humans. However they don't comment generally on how model-based strategies could explain their findings nor how they relate to their model. They should comment on this. In particular, the participants are likely learning a model of their environment, and this can be done using non-parametric Bayesian inference (along the lines of Gershman or Collins's work). The authors should model their task using these models and compare this to their algorithm.

The authors state that there was no reward. Were subjects paid for their time? Also, the lack of a reward is unusual, and even if unconsciously, participants may have been engaged in reward-seeking. The authors should try to model the behavior with a pseudo-reward to see how that accounts for their findings. This is especially true from the perspective of computational RL. On that theory, the only object 'in' the agent is the policy; everything else is considered 'in' the environment. This means that rewards in RL need not be from environmental returns but could also be from inside the organism (even if modeled as 'outside' the agent in the RL framework). So they need to model the behavior using 'pseudorewards' to see if that can account for their findings. Finally, though trivially, a reward of 0 is technically a reward, and the model's exploratory drive comes from settling on the true values of the states (i.e., 0).

3. Uncertainty
The authors frame their model in terms of uncertainty, but their model does not measure uncertainty at all. The model makes choices on the basis of optimistic initial Q-values and then searches on that basis, backing up the 0 rewards until the true values are more or less hit upon. But that is not a measure of uncertainty in any sense; rather, it is an optimistic Q-value bias that drives exploration. However, I may simply fail to understand their model.

https://doi.org/10.7554/eLife.90684.1.sa1

Reviewer #3 (Public Review):

Summary:
In this article, Fox and colleagues describe the results of a novel and innovative task, coupled with a modified computational model, to explore pure directed exploration (not quite a pun, but intended nonetheless). In their task, participants make a series of discrete choices, importantly with no reward feedback, to navigate a nested series of rooms in a virtual environment. The initial 2-door choice is used as the primary probe and the complexity of the series of rooms behind each choice is used as the critical independent variable. The authors find that, as the number of follow-up options behind a door increases, "good" participants are more likely to choose the door that leads to the more complex choices. As the depth of the search increased (i.e. the room with the most doors was presented "farther" down the search), these same participants were less likely to choose the door leading to the more complex route. Finally, these same "good" participants showed an initial boost in preference towards the more complex exploration option after a few learning episodes that settled down after about 10 episodes, with a modest reliable preference towards the more complex route. This reflected the fact that information value decays over time in stable situations. Using an adaptation of standard Q-learning, with a proxy of information value being substituted for reward value, the authors show how their model can qualitatively capture most of the observed experimental effects, although with some critical differences in the temporal dynamics of learning, suggesting that the memory horizon for humans is longer than in the adapted Q-learning model.

Strengths:
1. Clever experimental design
The novel task is really clever and gets around many of the limitations for understanding directed exploration that have plagued prior research (which typically involve some use of reward feedback). Finding a way to provide direct information that can be experimentally manipulated, without needing to provide any explicit reward feedback, makes this one of the few pure exploration tasks that I am aware of.

2. Compelling results
The effect of manipulating choice complexity and depth on initial choice probability for "good" directed learners seems fairly strong, as do the learning dynamics. The heterogeneity in exploration style across participants is also interesting and brings up more questions that are useful for follow-up research.

3. Simple model
The computational model used is a simple adaptation of standard reinforcement learning models, specifically Q-learning models. This is elegant as it doesn't require major changes in the dynamics of learning, simply a revision of the variables going into the update. The simplicity of this change, coupled with the ability to capture the results of the "good" directed explorers makes a strong case that information seeking and reward-seeking may share common underlying mechanisms (as shown previously by Kobayashi, K., & Hsu, M. (2019). Common neural code for reward and information value. Proceedings of the National Academy of Sciences, 116(26), 13061-13066.).

Weaknesses:

1. "Good" vs. "poor"
There is an odd circularity, and implicit value judgment, in the classification of participants into "good" and "poor" directed explorers. The logic, based on the visit-counter model of directed exploration, is that the probability of repeating a choice (at the initial decision trial) should be low for directed explorers vs. random explorers. Doing the median split on repetition probability seems intuitively fine here, but it does bring up two issues. First, the labels "good" vs. "poor" seem arbitrarily judgemental, after all random exploration is a viable exploration strategy in many contexts. Would "directed" vs. "random" be more appropriate labels based on how the decision was made to categorize participants? Second, how much of the "good" participant performance is driven by the extreme non-repeaters? For example, if a tertiary split was performed instead of a binary median split, would the middle group show a weaker version of the effects seen in the "good" group or appear more like the "poor" group?

2. Characterization of information value
The authors discuss primarily methods that can be summarized by visit counters as a description for all directed exploration models. However, that doesn't seem to be a good summary of the overall literature in this space. There are also entropy-based approaches, that quantify information value based on the statistics of the feedback. For example, in machine learning methods like the KL divergence are often used to represent the information value of a channel. A few such papers are highlighted below. Now it is entirely possible that these approaches can be extrapolated to simple visit-count approaches, but I am unaware of anything showing this. I think it would be good to broaden the discussion on directed exploration models beyond visit-counter methods like UCB, highlighting the other methods used to promote directed exploration.

Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). Vime: Variational information maximizing exploration. Advances in neural information processing systems, 29.

Eysenbach, B., Gupta, A., Ibarz, J., & Levine, S. (2018). Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070.

Hazan, E., Kakade, S., Singh, K., & Van Soest, A. (2019, May). Provably efficient maximum entropy exploration. In International Conference on Machine Learning (pp. 2681-2691). PMLR.

3. Model vetting
The model used to simulate the behavioral results is interesting and intuitive. However, there seem to be some things left on the table and unresolved. First, the definition of information value (E) that is maximized is assumed to satisfy the same constraints as typical reward does in the Bellman solution for reinforcement learning. This is the only way it can be substituted into the typical Q-learning method. Is that true here?

Second, the advantage of these simpler computational-level models is that they can be effectively fit to behavior. The model outlined in the paper has only a few free parameters (some of which can be fixed for convenience purposes). Was there an attempt to fit each participant's data into the model? This would be a powerful way of highlighting where exactly the differences between the "good" and "bad" participants arise.

https://doi.org/10.7554/eLife.90684.1.sa0

On the computational principles underlying human exploration

Peer review process

Editors

Be the first to read new articles from eLife