When encountered with a novel setting, animals and humans explore their environment. Such exploration is essential for learning which actions are beneficial for the organism and which should be avoided. The speed of learning, and even the learning outcome, crucially depends on the “quality” of that exploration: for example, if as a result of poor exploration some actions are never chosen, their effects are never observed, and hence cannot be learned. More generally, a fundamental difference between learning by trial and error and Supervised Learning scenarios is that in the latter, the distribution of examples is controlled by the “teacher”, whereas in the former, the distribution of examples that the agent gets to observe depends on the agent’s own behavioral policy. Therefore, in order to successfully learn a good policy by trial and error, agents need to take into account uncertainty when choosing actions, reflecting the fact that the observations collected so far might mis-represent the actual quality of the different actions.

Learning by trial and error is often abstracted in the framework of the computational problem of Reinforcement Learning (RL) (Sutton and Barto, 2018): An agent makes sequential decisions in an unknown environment; at each time-step, it observes the current state of the environment, and chooses an action from a set of possible actions. In response to this action, the environment transfers the agent to the next state, and provides a reward signal (which can also be zero or negative). The ultimate goal of the agent is to learn how to choose actions – i.e, learn a policy – such as to maximize some performance metric, typically the expected cumulative reward.

Exploration algorithms in RL differ in the particular way they address uncertainties. Random exploration, in which a random component is added to the policy (e.g., a policy otherwise maximizing based on current estimates) is, arguably, the simplest way of incorporating exploration. By adding randomness, the agent is bound to eventually accumulate information about all states and actions. More sophisticated exploration methods, referred to as directed exploration (Thrun, 1992), attempt to identify and actively choose the specific actions that will be more effective in reducing uncertainty. To do that, the agent needs to track and update some estimate or measures of uncertainty associated with different actions. For example, the agent can use visit-counters: keep track of the number of times each action was chosen in each state, and prioritize those actions that have previously been neglected (Auer et al., 2002; Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017).

The intuition behind counter-based methods can be made precise in the important case of Multi-Armed Bandit problems (or bandit problems, for short). In a k-armed bandit, the environment is characterized by a single state and k actions (“arms”), each associated with a reward distribution. Because these distributions are unknown, and feedback (i.e., a sample from the distribution) is given only for the chosen arm at each trial, exploration is needed to guarantee that the best arm (i.e., the one associated with the highest expected reward) is identified. Bandit problems are theoretically well-understood, with various algorithms having optimality guarantees, under some statistical assumptions (for a comprehensive review see Lattimore and Szepesvári, 2020). Particularly, counter-based methods (e.g., UCB, Auer et al., 2002) can be shown to explore optimally in bandit tasks, in the online-learning sense of minimizing regret.

Human exploration has been studied extensively in bandit and bandit-like problems (Shteingart et al., 2013; Wilson et al., 2014; Mehlhorn et al., 2015; Gershman, 2018; Schulz et al., 2020). Because these are arguably the simplest form of RL problems, they offer a clean and potentially well-controlled framework for experiments (Fox et al., 2020). The strong theoretical foundations are another appeal for experimental work, because behavior can be compared with well-defined algorithms, and, potentially, also with an optimal solution.

However, generalizing conclusions about human exploration from behavior in bandit tasks to behavior in more complex environments is not trivial. In a bandit task, an action that was chosen less times is, everything else being equal, exploratory more valuable compared to one that was chosen more often. By contrast, visit-counters alone might be a poor measure of uncertainty in complex environments, because they completely ignore future consequences of the actions (Figure 1a). Indeed, the limitations of naive counter-based exploration in structured and complex environments have been discussed in the machine learning literature, and different exploration schemes that take into account the long-term exploratory consequences of actions have been proposed (Storck et al., 1995; Meuleau and Bourgine, 1999; Osband et al., 2016a,b; Chen et al., 2017; Fox et al., 2018).

Directed exploration in complex environments.

(a) In a bandit problem (left), actions have no long-term consequences. In complex environments (right), actions have long-term consequences as particular actions might lead, in the future, to different parts of the state-space. In this example, these parts (shaded areas) are of different size. As a result, the local visit-counters are no longer a good measure of uncertainty. In this example, a2 should be, in general, chosen more often compared to a1 in order to exhaust the larger uncertainty associated with it. (b) Participants were instructed to navigate through a maze of rooms. Each room was identified by a unique background image and a title. To move to the next room, participants chose between the available doors by mouse-clicking. Background images and room titles (Armenian letters) were randomized between participants, and were devoid of any clear semantic or spatial structure. (c) The three maze structures in Experiment 1 (Top) have a root state S (highlighted in yellow) with two doors. They differ in the imbalance between the number of doors available in future rooms MR and ML (nR : nL – 4:3, 5:2, 6:1). Consistent with models of directed exploration that take into account long-term consequences of actions, and unlike counter-based models, participants exhibited bias towards room MR, deviating from a uniform policy (Bottom, bars and error-bars denote mean and 95% confidence interval of pR; number of participants: n = 161; 120; 137. Statistical significance, here and in following figures: * : p < 0.05, ** : p < 0.01; *** : p < 0.001).

Our goal here is to study the extent to which human exploration is sensitive to long-term consequences of actions, as opposed to counter-based exploration. Crucially, this question cannot be addressed in the common bandit problems paradigm, because general exploration algorithms are reduced to counter-based methods when they are faced with a bandit problem. Thus, even if humans do (approximately) use some general, beyond visit-counters, directed exploration strategies, they will likely manifest as counter-based strategies in bandit tasks. Therefore, we set out to study exploration in a novel task that addresses these issues. First, we show that humans take into account the long-term exploratory consequences of their actions when exploring complex environments (Experimental results). Next, we model this exploration using an RL-like algorithm, in which agents learn exploratory “action-values” and use these values to guide their exploration (Computational modeling).


Experimental results

Sensitivity to future consequences of actions

To test the hypothesis that human exploration is sensitive to the long-term consequences of actions, we conducted an experiment that formalizes the intuition presented in the Introduction (see Figure 1a). In the experiment (denoted as “Experiment 1”), participants were instructed to explore a novel environment, a maze of rooms, by navigating through the doors connecting those rooms (Figure 1b). Each room was identified by a unique background, a title, and the number of doors in that room. No reward was given in this task, but participants were instructed to “understand how the rooms are connected” (see Methods). Testing participants in a task devoid of clear goal and rewards is somewhat unorthodox. We go back to this point in the Discussion section.

Three groups of participants were tested, each in a different maze as is described in Figure 1c (top): In all mazes, there was a start room (S) with two doors, each leading to a different room. One of these rooms, a multi-action room (MR) was endowed with nR doors, while the other, denoted as ML, was endowed with nL doors. All three mazes were unbalanced, in the sense that nR > nL. Between the different mazes, we varied nR − nL, while keeping nR + nL = 7 constant. The locations of the doors leading to MR and ML were counterbalanced across participants. For clarity of notation, we refer to them as “right” and “left”, respectively. All other remaining rooms were endowed with only a single door. After going through these single-door rooms, a participant would reach a common terminal room (T). There, they were informed that they reached the end of the maze and then they were transported back to S. Overall, each participant visited S (of the one particular environment they were assigned to) 20 times.

Since there was no reward, all choices in this task are exploratory. If participant’s exploration is driven by visit-counters, then we expect that the frequencies in which they choose each of the doors in S, denoted pR and pL, would be equal. By contrast, if they take into consideration the long-term consequences of their actions, then we would expect them to choose the right door more often (resulting in pR > pL). In line with the hypothesis that participants are sensitive to the long-term consequences of their actions, we found that averaged over all participants in the three conditions, pR > pL (pR = 0.54, 95% confidence interval: pR ∈ [0.518, 0.563]). Considering each group of participants separately, significant bias in favor of pR was observed in the 6:1 (pR = 0.572, n = 137, 95% CI: [0.528, 0.617]) and the 5:2 groups (pR = 0.549, n = 120, 95% CI: [0.506, 0.592]), but not in the 4:3 group (pR = 0.507, n = 161, 95% CI: [0.472, 0.541]).

We hypothesized that the larger the imbalance (nR − nL), the stronger will be the bias towards MR (larger pR). To test this hypothesis, we compared the biases of participants in the different groups (Figure 1c). As expected, the average pR in the 5:2 and 6:1 groups was significantly larger than that of the 4:3 group (p < 0.05 and p < 0.01 respectively, permutation test, see Methods). The average pR in the 6:1 group was larger than that of the 5:2 group. However, this difference was not statistically significant (p = 0.17).

The results depicted in Figure 1c indicate that on average, human participants are sensitive to the exploratory long-term consequences of their actions. Considering individual participants, however, there was substantial heterogeneity in the biases exhibited by the different participants. While some chose the right door almost exclusively, others favored the left door. We next asked whether some of this heterogeneity across participants reflects more general individual-differences in exploratory strategies, which would also manifest in their exploration in other states. To test this hypothesis, we focused on state MR. In this state, exploration is also required because there are nR different alternatives to choose from. However, unlike in state S, these alternatives do not, effectively, have long-term consequences. As such, choosing an action in MR is a bandit-like task. Thereofre, directed exploration in MR is expected to be driven by visit-counters, such that participants would equalize the number of times each door in MR is selected. Note that this is not a strong prediction, because random exploration will, on average, also equalize the number of choices of each door. Yet, directed and random exploration have diverging predictions with respect to the temporal pattern of choices in MR. Specifically, with pure directed exploration (that is driven by visit-counters), participants are expected to avoid choosing the same door that they chose the last time that they visited MR. Consequently, the probability of repeating the same choice in consecutive visits of MR, which we denote by prepeat, is expected to vanish. By contrast, random exploration predicts that prepeat = 1/nR. Figure 2 (Top) depicts the histograms (over participants) of prepeat in the three experimental conditions, demonstrating that participants exhibited substantial variability in prepeat. While for some participants prepeat was close to 0, as predicted by pure directed exploration, for others it was similar to 1/nR, as predicted by random exploration. Many other participants exhibited prepeat that was even larger than 1/nR, indicating that, potentially, choice bias and / or momentum also influenced choices in the task. Based on the predictions of directed and random exploration, we divided participants into two groups, depending on the quality of exploration in MR: “good” directed explorers, in which prepeat < 1/nR, and “poor” directed explorers, in which prepeat ≥ 1/nR (Figure 2 Top, dots and diagonal stripes, respectively).

Heterogeneity in exploration strategies.

Top: Histograms of prepeat at state MR (highlighted in yellow) for participants in the three conditions of Experiment 1 (left to right: nR = 4, 5, 6). Dashed vertical line represents the value expected by chance, 1/nR. Based on their prepeat values, we divided participants into “good” and “poor” directed explorers (dotted and striped patterns, respectively; “good” explorers proportion: 40%, 44%, 51%). Bottom: Histograms of pR at state S (highlighted in yellow), for the “good” and “poor” directed explorers groups.

Is the quality of directed exploration in the bandit-like task of state MR informative about directed exploration in S? To address this question, we computed the histograms of pR separately for the “good” and “poor” directed explorers (Figure 2 Bottom). Averaging within each group we found that indeed, pR among the “poor” explorers was not significantly different from chance in any of the three conditions (Figure 3a), consistent with the predictions of random exploration. By contrast, among “good” explorers, there was a significant bias in the 5:2 (pR = 0.597, n = 53, 95% CI: [0.537, 0.652]) and the 6:1 (pR = 0.612, n = 71, 95% CI: [0.544, 0.678]) groups (Figure 3b). These findings show that participants that avoid repetition in the bandit task are also more sensitive to the long-term exploratory consequences of their actions. We conclude that those participants who tend to perform good directed exploration in MR also perform good directed exploration in S. Crucially, the implementation of directed exploration in the two states is rather different. In MR, where different actions have no long-term consequences, “good” explorers rely on visit-counters that are the relevant measure of uncertainty, resulting in an overall uniform choice. By contrast in S, actions do have long-term consequences, and “good” explorers go beyond the visit-counters, biasing their choices in favor of the action associated with more future uncertainty.

“Poor” and “good” directed explorers.

Choice biases at state S (pR) analyzed separately for “poor” and “good” explorers (striped and dotted patterns; divided based on their exploration in MR, see Figure 2) in the 3 conditions of Experiment 1. While behavior of the “poor” explorers was not significantly different from chance (consistent with the prediction of random exploration), “good” explorers in the nR = 5, 6 conditions exhibited significant bias towards “right”. Bars and error bars denote mean and 95% confidence interval of pR; number of participants n = 95; 66 67; 53, 66; 71 (“poor”; “good”).

Temporal discounting

In the previous section we showed that if the future exploratory consequences of the actions are one trial ahead, humans are sensitive to these consequences. It is well known that in humans and animals, the value of a reward is discounted with its delay (Vanderveldt et al., 2016). We hypothesized that similar temporal discounting will manifest in evaluating the exploratory “usefulness” of actions. To test this prediction, we conducted Experiment 2 on a new set of participants. Similar to Experiment 1, Experiment 2 consisted of 3 different maze structures. The imbalance between the number of possible outcomes was kept fixed across 3 mazes, at nR = 5 and nL = 2. However, the depth at which these outcomes occur, relative to the root state S, varied between 1 (as in Experiment 1) to 3 (Figure 4, Top). The depth of MR determines the delay between the choice made at S and its exploratory benefit. In the presence of temporal discounting of exploration, we therefore expect pR to decrease with the depth of MR.

Temporal discounting of exploratory consequences.

The three mazes in Experiment 2 (Top) had the same imbalance (nR = 5, nL = 2), however we varied the depth of MR (and ML) relative to the root state S (left to right: depth = 1, 2, 3). “Poor” and “good” directed explorers (striped and dotted patterns, respectively) were divided by their prepeat value at MR (same as in Experiment 1, see Figure 2). Bars and error-bars denote mean and 95% confidence interval of pR. Number of participants n = 99; 92, 121; 84, 153; 85 (“poor”; “good”).

To test this prediction, we divided participants to “good” and “poor” directed explorers, as in Experiment 1, based on the degree of prepeat in MR. As depicted in Figure 4, both the “poor” and “good” explorers exhibited a bias in favor of “right” in S. For the “good” explorers, a larger delay was also associated with a smaller bias.

The dynamics of exploration

Insofar, we demonstrated that human participants exhibit directed exploration in which they take into their considerations the future exploratory consequences of their action. To better understand the computational principles underlying this directed exploration, we revisit the question of why explore in the first place. One possible answer to this question is that exploration is required for learning. According to this view, actions are favorable from an exploratory point of view when they are associated with, or lead to other actions associated with, high uncertainty, missing knowledge, and other related quantities (Schmidhuber, 1991; Still and Precup, 2012; Little and Sommer, 2014; Houthooft et al., 2016; Pathak et al., 2017; Burda et al., 2019). An alternative, that has received some attention in the machine learning literature, is that exploration could be driven by its own normative objective (Machado and Bowling, 2016; Hazan et al., 2019; Zhang et al., 2020; Zahavy et al., 2021). For example, such objective could be to maximize the entropy of the discounted distribution of visited states and chosen actions (Hazan et al., 2019). Experimentally, the difference between the two approaches will be particularly pronounced towards the end of a long experiment. When all states and actions had been visited sufficiently many times, everything that can be learned has already been learned. Thus, if the goal of exploration is to facilitate learning, then exploratory behavior is expected to fade over time. By contrast, if exploration is driven by a normative objective, then we generally expect behavior to converge to a one that (approximately) maximizing this objective, and hence maintaining asymptotic exploratory behavior.

Specifically considering Experiment 1 and 2, we do not expect any bias in S (pR = 0.5) in the beginning of the task, because participants are naive and are unaware of the different long-term consequences of the two actions. With time and learning, we expect participants to favor MR over ML (pR > 0.5). This prediction holds either if participants are driven by the goal of reducing the (long-term) uncertainty associated with MR, or by the goal of optimizing some exploration objective, such as to match the choices per door in MR and ML. In other words, both approaches predict that with time, pR will increase. With more time elapsing, however, the predictions of the two approaches diverge. As uncertainty decreases, uncertainty-driven exploration predicts a decay of pR to its baseline value . By contrast, the normative approach predicts that pR will converge to a steady-state.

Figure 5 depicts the temporal dynamics of pR (t), as a function of the number of times t that S was visited (defined as “episodes”). The learning curves are shown separately for the “poor” (Figure 5a) and “good” (Figure 5b) explorers, averaged over all 6 conditions of Experiments 1 and 2. As expected, there was no preference in the first episodes. However, with time, the participants developed a bias in favor of MR, which was more pronounced in the “good” directed explorers group. In this group, participants exhibited a significant bias, pR (t) > 0.5 from the 3rd episode. Notably, this increased bias was followed by a decrease to a steady state bias value (episodes 10 − 20). This steady state value was lower than its peak transient value (consistent with uncertainty-driven exploration), but was higher than baseline level before learning (consistent with a normative exploration objective).

Learning dynamics.

Bias towards MR as a function of training episode (pR (t)), averaged over participants in all 6 conditions (Experiments 1 & 2), shown for the “poor” (a) and “good” (b) groups. The “good” explorers exhibited a transient peak in pR (t), consistent with models of uncertainty-driven exploration. However, the steady-state value was still slightly larger than chance, consistent with an objective-driven exploration component. Dots and shaded areas denote mean and 95% confidence interval of pR (t).

Computational modeling

The model

Together, the two experiments of the previous sections provide us with the following insights: (1) Humans exploration is affected by long-term consequences of actions (Figure 1c); (2) Both the number of future states and their depth affect this exploration (Figure 3 and Figure 4); and finally, (3) Exploration dynamics peaks transiently and then decays, consistent with an uncertainty-driven exploration (Figure 5).

In theorizing about effective exploration we have alluded to concepts such as “exploratory value” or “usefulness” of particular actions, but did not provide a precise working definition for it. In this section we consider a specific computational model for directed exploration, and test this model in view of these experimental findings. The model is a general-purpose algorithm for directed exploration, which formalizes the intuition that the challenge of exploration in complex environments is analogous to the standard credit-assignment problem in RL (in the reward-maximization sense).

According to the model, the agent observes the current state of the environment s at each time-step and chooses an action a from the set of possible actions. In response to this action, the environment transfers the agent to the next state s′, at which the agent chooses action a′. Each state-action pair (s, a) is associated with an exploration value, denoted E (s, a) (Fox et al., 2018). These exploration values represent a current estimate of “missing knowledge”, such that a high value indicates that further exploration of that action is beneficial. At the beginning of the process, E-values are initialized to a positive constant (specifically E = 1), representing the largest possible missing knowledge. Each transition from s, a to s′, a′ triggers an update to E (s, a) according to the following update rule:

In words, the change in E (s, a) is a sum of two contributions. The first, −E (s, a), is the immediate reduction in the uncertainty regarding state s and action a due to the current visit of that state-action. The second, γE (s′, a′) represents future uncertainty propagating back to (s, a). This second part is weighted by a discount-factor parameter, 0 ≤ γ ≤ 1. The overall update magnitude is controlled by a learning-rate parameter 0 < η < 1. In the particular case that s′ is a terminal state, its exploration value is always defined as 0.

To complete the model specification, we define the policy as derived directly from these exploration values. We use a standard softmax policy, in which the probability of choosing an action a in state s is given by:

where β ≥ 0 is a gain parameter. A gain value of β = 0 corresponds to random exploration, with all actions chosen at equal probability, while a positive gain corresponds to (stochastically) preferring actions associated with a larger E-value (and hence higher uncertainty).

Conceptually, this model is similar to standard RL algorithms (specifically the SARSA algroithm, Rummery and Niranjan, 1994) that are used to account for operant learning in animals and humans. There, a similar update rule is used to learn the expected discounted sum of future rewards (and a similar rule is assumed for action-selection). Therefore, similar cognitive mechanisms that account for operant learning, can account for this type of directed exploration (at least to the extent that standard RL models are indeed a good descriptions of operant learning; see Mongillo et al., 2014; Fox et al., 2020).

To gain insight into the properties of the E-values, we consider first the case of “infinite” discounting, namely γ = 0. In that case, the update rule of Equation 1 becomes:

and hence, after n visits of (s, a), the associated E-value is E (s, a) = (1 − η)n, such that − log E ∝ n.1 In other words, when γ = 0, and long-term consequences are completely ignored, the E-value is effectively a visit-counter.

When γ > 0, the change in the value of E (s, a) following a visit of (s, a) is more complex. In addition to the decay term, a term that is proportional to E (s′, a′) is added to E (s, a). Notably, E (s′, a′) depends on the number of past visits of (s′, a′), (as well its own future states (s″, a″) and so on). Consequently, the number of actual visits that is required to reduce the E-values by a given amount is larger in state-actions leading to many future states than in state-actions leading to fewer future states. In that sense, the E-values are a generalization of visit-counters.

Finally (and regardless of the value of γ), the softmax policy of Equation 2 favors actions associated with larger E-values. Because choosing these actions will generally lead to a reduction in their associated E-values, the result will be a policy that effectively attempts to equalize the E-values of all available actions (within a given state). In the case of γ = 0, this will result in a preference toward those actions that were chosen less often. In the case of γ > 0, it will result in a preference that is also sensitive to (the number of) future potential states reachable through the different actions.

To conclude, the model therefore encapsulates the three principles identified in human behavior – it propagates information to track long-term uncertainties associated with individual state-actions, it temporally discounts future exploratory consequences, and it uses estimated uncertainties to derive a behavioral policy.

Directed-exploration in the maze task

We now return to the maze task and study the behavior of the model there. In state MR, where the E-values correspond to visit-counters, the attempt to equalize the E-values will result in a bias against repeating the same action, yielding a low prepeat value and on average, a uniform policy. To demonstrate this, we simulated behavior of the model in the 3 conditions of Experiments 1. Indeed, as depicted in Figure 6a, the values of prepeat in the simulations were smaller than chance-level. Unlike the population of human participants, simulated agents are more homogeneous, as reflected in the narrower histograms of prepeat. This is due to the fact that the model is designed to perform directed exploration, that is, to model the behavior of the “good” directed explorers. Nevertheless, the model can also produce random exploration if the gain parameter is set to β = 0 (see also Discussion).

Simulations results.

Simulating behavior of the E-values model (Equation 12) reproduces the main findings of directed exploration in the maze task. (a) In MR, the model exhibits directed exploration which manifests in low values of prepeat (shown for the 3 conditions of Experiment 1; dashed line denote chance-level expected for random exploration, 1/nR) (b) In the environments of Experiment 1, agents exhibited bias towards MR that increased with imbalance of nR : nL, reflecting the propagation of long-term uncertainties over states. (c) In the environments of Experiment 2, the bias decreased with depth, reflecting temporal discounting. (d) Bias towards MR peaks transiently, followed by a decay to baseline at steady-state, as expected from uncertainty-driven exploration (average results over all 6 environments). Results are based on 3,000 simulations in each environment. Bars and histograms in (a)-(c) are shown for the first 20 episodes for comparison with the behavioral experiments. Error bars are negligible and therefore are not shown. Model parameters: η = 0.9, β = 5, γ = 0.6.

More interesting is the behavior of the model in state S. The larger nR, the smaller will be the decay of E (s = S, a = right) per a single visit of (s = S, a = right). Therefore, the model will tend to choose “right” more often (pR > 0.5), a bias that is expected to increase with nR. Indeed, similar to the behavior of the “good” human explorers, the simulated agents exhibited a preference towards “right” in S, a preference that increased with nR − nL (Figure 6b).

The model is sensitive to long-term consequences because it propagates future uncertainty, from the next visited state-action back to the current state-action. This future uncertainty, however, is weighted by γ < 1, such that the effect of further away states on E (s, a) is expected to decrease with distance. In the environments of experiment 2, where we manipulated the depth of MR (relative to S), this will result in a decrease of the bias (pR) at S, as demonstrated in Figure 6c.

Because the policy in the model is derived from the E-values, the temporal pattern of exploration is expected to be transient. In the first episodes, when E (s = S, a = right) = E (s = S, a = left), the result is pR = 0.5. With sufficient learning, exploration values of all visited state-actions decay to 0 and in this limit, pR = 0.5 as well. Therefore, we expect the learning dynamics to exhibit a transient increase in bias, followed by a decay back to chance level. This is demonstrated in Figure 6d where we plot pR (t), averaged over the simulations of the model in all six conditions of Experiments 1 and 2.

Qualitatively, the transient dynamics resemble the experimental results (Figure 5b). However, there are two important differences. First, while the human participants exhibited what seems like a steady-state bias even at the end of the experiment, pR in the model decays to chance level. As discussed above, the decay to chance in the simulations is expected because exploration in the model is uncertainty-driven. In the framework of this model, steady-state exploration can be achieved if we assume that β is not stationary, but rather increases over episodes. However, we hypothesize that to capture this aspect of humans’ exploration, we may need to go beyond this class of uncertainty-driven models. Second, the transient dynamics of the model are longer than that of the human participants. While the learning speed in the model is largely controlled by the learning-rate parameter η, the value of η cannot by itself explain this gap. This is because in the model η < 1, and the dynamics cannot be arbitrarily fast. Particularly, in the simulations of Figure 6d we have used a large learning-rate of η = 0.9, but learning was still considerably slower compared to human participants. We further discuss the issue of learning speed in the next section.

Learning dynamics: 1-step updates and trajectory-based updates

To learn to prefer “right” in S, the agent needs to learn that this action leads, in the future, to MR, which from an exploratory point of view is superior to ML. This kind of learning of delayed outcomes is typical of RL problems, in which the agent needs to learn that the value of a particular action stems from its consequences, which can be delayed. For example, an action may valuable because it leads to a large reward, even if this reward is delayed. In the RL literature this is known as the credit assignment problem, because during learning, upon observing a desired outcome (in “standard” RL, getting a large reward; here, arriving at MR), the agent needs to properly assign credit for past actions that have led to this outcome.

RL algorithms typically address the credit assignment problem by propagating information about the reward backwards through sequences of visited states and actions (Sutton, 1988; Watkins and Dayan, 1992; Dayan, 1992). According to some RL algorithms, the information about the reward propagates backwards one state at a time. By contrast, in other algorithms, a trace of the entire trajectory is maintained, allowing the information to “jump” backwards over a large number of states and actions. We refer to these alternatives as 1-step and trajectory-based updates, respectively.

The E-values model can be understood as an RL algorithm that propagates visitations information (rather than reward information). Specifically, it uses 1-step updates (Equation 1) such that with each observation (a transition of the form s, a, s′, a′) only immediate information, from (s′, a′), is used to update the exploration value of (s, a). With 1-step updates it takes time (episodes) for information from MR to reach back to S. We hypothesized that this reliance on 1-step updates might be an important source for the difference in learning speed between the model and humans, who might use more temporally-extended learning rules. To test this, we considered an extension to the exploration model in which E-values are learned using a trajectory-based update rule. Technically, this corresponds to changing the TD algorithm of Equation 1 to a TD (λ) algorithm (see Methods, Algorithm 1). Simulating this extended model we found that, similar to the original model, it reproduces the main experimental findings (Figure S1, compare with Figure 6). Moreover, as predicted, learning is faster than that the learning in the original model (Figure S1d, compare with Figure 6d). Nevertheless, even this faster learning is still slower than the rapid learning observed in human participants, suggesting further components of human learning that are not captured by either of the models (we get back to this point in the Discussion).

Another way of distinguishing between 1-step and trajectory-based updates is to consider the predictions they make in Experiment 2. Recall that the three conditions in Experiment 2 differ in the delay (in the sense of number of states) between S and MR. If information (about the exploratory “value” of MR) propagates one step at a time, then the time it takes to learn that “right” is preferable in S will increase with the delay: it will be shortest in Condition 1, in which MR and ML are merely one step ahead of S, and longest in Condition 3, in which MR and ML are three steps away from S (Figure 7, top left). By contrast, if information about MR and ML can “jump” directly to S within each episode, as in trajectory-based updates, learning speed will be comparable in all three conditions (Figure 7, top right). A more thorough analysis of the model dependence on the parameters γ and λ is depicted in Figure S2. Finally, Figure 7 (bottom) depicts the learning dynamics of the “good” human explorers, analyzed separately in the three conditions of Experiment 2. We did not find evidence supporting the hypothesis that learning time increases with depth. These results further support the hypothesis that human learning relies on more global, temporally-extended update rules in which information can “jump” backwards over several states and actions.

1-step backups and trajectory-based updates.

Learning dynamics simulated by the E-values model using the 1-step backup learning rule of TD (0) (Equation 12; top left) and the trajectory-based learning rule TD (λ) (Methods, Algorithm 1; top right) in the 3 environments of Experiment 2. With TD (0), the depth of MR relative to S (depth = 1, 2, 3) affects both the peak value of pR (t) (due to temporal discounting) and the time it takes the model to learn (due to the longer sequence of states over which the information has to be propagated). By contrast, with TD (λ), different depths result in a different maximum bias (due to temporal discounting), but the learning time is comparable (because information is propagated over multiple steps in each update). For the same reason, learning is overall faster with TD (λ). In humans (bottom), peak bias decreased with depth (consistent with temporal-discounting), but there was no noticeable difference in learning speed (consistent with trajectory-based updates). Learning curves of human participants are shown with a moving-average of 3 episodes. Dots and shaded areas denote means and 70% confidence intervals of pR (t). Model results are average over 30, 000 simulations; model parameters: η = 0.9, β = 5, γ = 0.6, and λ = 0.6 (for the TD (λ) model).


Exploration is a wide phenomenon that has been linked to different aspects of behavior, including foraging (Mobbs et al., 2018; Kolling and Akam, 2017), curiosity (Gottlieb and Oudeyer, 2018), and creativity (Hart et al., 2018). In this study, we focused on exploration as part of learning. For that, we use the framework of RL, in which exploration is an essential component. Particularly, we study the computational principles underlying human exploration in complex environments – sufficiently complex such that exploration per se requires learning, due to delayed and long-term consequences of actions. Our approach builds on the analogy between the challenges of learning to explore, and the challenges of learning to maximize reward – the latter being the standard RL scenario. In both cases, the agent needs to represent information, propagate it, and use it to choose actions. In the former case it is information about uncertainty and in the latter it is information about expected reward.

We found that while exploring in complex environments, humans are sensitive to long-term consequences of actions and not only to local measures of uncertainty. Moreover, such longterm exploratory consequences are temporally-discounted, similar to the discounting of future rewards. Finally, the dynamics of exploration is consistent with the predictions of uncertainty-driven exploration, in which directed exploratory behavior peaks transiently, and then decay to a more random exploration (supposedly when most of the uncertainty have been resolved). To account for these experimental results, we introduce a computational model that uses a RL-like learning rule implementing the aforementioned principles. In the model, information about state-action visits, rather than about reward as in standard RL algorithms, is being propagated (and discounted) over sequences visited state-actions. This results in a set of “exploration values” (analogous to reward-based values) which are then used to choose actions.

Directed exploration beyond bandit tasks   Previous studies have identified some components of directed exploration in human behavior using bandit tasks (Wilson et al., 2014; Gershman, 2018, 2019), particularly, the use of counter-based methods such as Upper Confidence Bounds (UCB, Auer et al., 2002). Going beyond the bandit, we were able to show that these counter-based strategies might be a special case implementation (appropriate for bandit tasks) of more general principles. To study and identify these principles, it is therefore necessary to test human exploration in environments that are more complex than the bandit task. Indeed a more recent study have shown that more general principles might underlie human exploration, both random and directed, in sequential tasks (Wilson et al., 2020). However, unlike our experiments, in that study actions did not have long-term consequences in the sense of state transitions. Finally, the necessity of going beyond simple bandit tasks is not unique to the study of exploration alone. It is present also when studying other components of RL algorithms underlying operant learning. For example, it is impossible to distinguish in a bandit task between model-based and model-free RL, because there is no “model” to be learned in those tasks (Daw et al., 2011).

Non-stationary aspects of exploration   While the analogy between learning to explore and learning to maximize rewards is a useful one, there are some important differences. One difference is that while in RL, rewards (more precisely, the distribution thereof) are typically assumed to be Markovian and stationary, exploration has a fundamental non-stationary nature. This is due to the fact that if exploration is interpreted as part of the learning process, or is uncertainty driven, then the exploratory “reward” from a given state-action will decrease over time, because uncertainty will reduce with visits of that state-action. This non-stationarity poses a challenge for exploration algorithms. The E-values model circumvents that by assuming a stationary (and constant) zero fictitious “reward”, combined with an optimism bias at initialization (Fox et al., 2018).

A different solution to the challenge of non-stationarity is to posit an exploration objective function which is by itself independent of learning. The predictions of the two classes of models differ with respect to the expected steady-state behavior. In the former, exploration will diminish over time while in the latter, it will be sustained. The observation that human participants maintain a preference (albeit relatively small) for “right” even at the end of the experiment suggests that human exploration is driven, at least in part, by more than just uncertainty. A more complete characterization of these two components will be an interesting topic for future work.

Pure-exploration and the role of reward   It has been long argued that at least part of human and animal behavior is driven by intrinsic motivation, which is largely independent of external rewards (Oudeyer and Kaplan, 2009; Barto, 2013). Pure exploration tasks can be used to characterize aspects of such intrinsic motivation. In this study, the “desire” to visit less-visited states is one such intrinsic motivation factor. Additional factors that are based on information-theoretic quantities (Still and Precup, 2012; Little and Sommer, 2014; Houthooft et al., 2016) or prediction errors of non-reward signals (Pathak et al., 2017; Burda et al., 2019) have also been proposed in the literature. While many of these will, in general, be correlated, and hence difficult to identify experimentally, we believe that future studies of pure-exploration in complex environments will allow to better relate these concepts, mostly discussed in the theoretical and computational literature, to the learning and behavior of humans and animals.

To dissect the exploratory component of behavior, we focused on a pure-exploration, reward-free task. This allowed us to neutralize the exploration-exploitation dilemma, focusing on the unique challenges for exploration itself. More generally, we expect the identified exploration principles to be relevant also in the reward maximization scenario. Indeed, it has been shown theoretically and empirically that the naive use of counter-based methods (or other “local” exploration techniques) can be highly sub-optimal for learning an optimal policy (in the reward maximization sense) in complex environments (Osband et al., 2016a,b; Chen et al., 2017; Fox et al., 2018; Oh and Iyengar, 2018). How humans deal with the exploration-exploitation dilemma in complex environments is an important open question.

Implications for neuroscience   Algorithms such as TD-learning hold considerable sway in neuroscience. For example, it is generally believed that dopaminergic neurons encode reward prediction errors, which are used for learning the “values” of states and actions (Schultz et al., 1997; Glimcher, 2011, but see also Elber-Dorozko and Loewenstein, 2018). More recent studies suggest that in fact, the brain maintains a separate representation of different reward dimensions (Smith et al., 2011; Grove et al., 2022). Given that our formalism of uncertainty (E-values) is identical to that of other types of value, it would be interesting to test whether the representation of uncertainty in the brain is similar to that of other reward types. For example, whether dopaminergic neurons also represent the equivalent of E-values TD-error. Along the same lines, it would be interesting to check whether the finding that dopaminergic neurons encode what seems to be reward-independent features of the task (Engelhard et al., 2019) can be better understood assuming that uncertainty is a reward-like measure.

Heterogeneity   There was a substantial heterogeneity among participants in both Experiments 1 and 2. We used this heterogeneity to divide participants into “good” and “poor” explorers in terms of the “directedness” of their exploration. However, this division is somewhat crude. For example, while bias in favor of MR was smaller in the “poor” explorers, it was still larger than the baseline level of 0.5 predicted by a true random exploration behavior (Figure 5a). This separation can be understood as a first approximation, highlighting the more prominent source of exploratory behavior at the individual subject basis. Moreover, even within the “good” explorers, there was considerable variability. Heterogeneity in the parameters of the computational model can, perhaps, explain some of the heterogeneity, but parameters variability alone (within the E-values model) certainly cannot explain all of the heterogeneity in participants’ behavior. For example, consider again the division to “poor” and “good” directed explorers. In principle, such a division could be modeled through the gain parameter β, with random explorers having a value of β = 0 (and directed explorers a value of β > 0). Even with random exploration, the model prediction for prepeat is 1/nR. By contrast, many participants exhibited values of prepeat larger than this chance-level, all the way up to prepeat = 1. Similarly, considering behavior at S as measured by pR, no combination of model parameters predict pR values which are smaller than 0.5. This is because even random exploration will result in pR = 0.5. Values of pR that are close to 1 are also impossible in the model, because they imply under-exploration of the left-hand-side of the maze. Yet some human participants exhibited extreme (close to 0 or 1) values of pR. Other factors, such as (task-independent) choice bias (Baum, 1974; Laquitaine et al., 2013; Lebovich et al., 2019) and tendency to repeat actions (Urai et al., 2019) are likely to contribute to participants’ choices.

Learning speed   Another limitation of the model is the gap between the learning speed of human participants and the learning speed of the model. Overall, humans learned considerably faster than the model, even with a large learning-rate. On average participants exhibited a bias as soon as the 3rd episode, which is faster than the theoretical limit possible for the TD(0) model in this task. While some of this discrepancy can be attributed to the model’s reliance on 1-step backups, it is noteworthy that even in comparison with TD(λ), humans’ learning is faster than the that of the model. The rapid learning in humans suggest mechanisms that go beyond simple model-free learning as implemented in our models. In our model, the fact that “right” is favorable can only be learned implicitly, by actually visiting more unique states following MR (compared to ML). This is because the only information that is available to the agent is the identity of states and actions. By contrast, a single visit of both MR an ML is likely sufficient for humans to learn that the number of doors in MR is larger than in ML, a fact which can by itself bias their following choices in favor of “right”. Indeed by using this (possibly salient) feature, of the number of doors, as an explicit part of the state representation, one could infer that MR is more favorable over ML already after 2 episodes even with model-free learning. While such strategy is not as general as the computational principles encapsulated by our models, in the specific task at hand it will be rather effective. The ability of humans to rapidly form and utilize such heuristics and generalizations is likely an important part of their ability to rapidly adapt and learn in novel situations. The interplay between basic, more general-purpose, computational principles, and heuristic, more ad-hoc, principles remains an important challenge for computational modeling in the cognitive sciences.

Generalization, priors, and “natural” exploration The goal of this study was to identify computational principles underlying exploration in a “general” setting. To that goal, we used a task in which the semantic content attached to states was minimal, with no a-priori indication of any structure (temporal, geometric, spatial, etc.) of the state-space. The motivation behind this design was to de-emphasize, as much as possible, behavior components stemming from participants’ prior knowledge and generalization abilities, and focus on core exploratory strategies. This also justified the models that we used: general-purpose, simplistic, learning models that operate on an abstract notion of states and actions. On the other hand, the abstract design of the task limits its applicability to more realistic tasks and natural behavior. Indeed in complex environments, it has been demonstrated that humans rely largely on both priors and generalizations to achieve efficient learning and exploration (Dubey et al., 2018; Schulz et al., 2020). How such priors, semantic knowledge, and generalization interact with more abstract and general principles of exploration and decision-making is an important open question. Notably, we have found that humans are capable of performing directed exploration of complex environments even in the absence of a readily-available semantic structure to guide their exploration. This is in contrast to the recent work of Brändle et al. (2022), that demonstrated directed exploration (interpreted as driven by the information-theoretic quantity of empowerment) in complex environments with available semantic structure, that was not observed in a structurally identical task where the semantic structure has been masked.


Online experiments and data collection

The study was approved by the Hebrew University Committee for the Use of Human Subjects in Research. Participants were recruited using the Amazon MechanicalTurk online platform, and were randomly assigned to one of the conditions in each experiment. Participants were instructed to “understand how the rooms are connected”, and were informed regarding the test phase: “At the end of the task, a test will check how quickly can you get from one specific room to a different one.”. The training phase of the experiment consisted of 120 trials, corresponding to 20 episodes. Between 20% to 30% of participants (depending on the experiment and condition) performed a longer experiment of 250 trials corresponding to 42 episodes, but for these participants only the first 20 episodes were analyzed. The end of each episode (reaching the terminal state T) was signaled by a message screen (“Youv’e reached a dead-end room, and will be moved back to the first room”). After the training episodes, there was a test phase in which participants were asked to navigate to a target room in the minimal number of steps possible, starting from a particular start room (which was not the initial state S). An online working copy of the experiment can be accessed at:

For each participant, we recorded the sequence of visited rooms (states) and chosen doors (actions), in the train and test phases. No other details (including demographics details, questionnaire, or comments about the experiment) were collected from participants. Test performance was used as a criterion for filtering. Out of the total participants who finished the experiment (i.e., finished both training and test phases), we rejected those who did not finish the test phase in a number of steps smaller than expected by chance (e.g., the expected number of steps it would take to reach the target by random walk). We also rejected participants who, during training, did not choose both “right” and “left” at least twice. The test start and target rooms were identical for all participants, and were chosen as to maximize the difference between performance (i.e., number of steps) expected by chance to that of the optimal (shortest path) policy. The number of participants in each experiment is given in Table 1, and their division into “Good” and “Poor” explorers is given in Table 2.

Number of participants in Experiments 1 and 2.

Participant groups in Experiments 1 and 2

Estimating policy from behavior

For the average results, we computed for each participant their pR value as the number of “right” choices divided by the total (and fixed) number of visits to S. Similarly, prepeat was calculated for individual participants as the number of visits to MR in which the chosen action was identical to the one chosen in their previous visit of MR, divided by the total visits of MR minus one. Note that the total number of visits to MR was different for different participants, as it depended on their policy at S. We have used the same measurements for the results of the model simulations for consistency. Note that, in principle, the model allows to measure the policy of individual agents (at individual time-points) directly, without the need to estimate it from behavior (i.e., the generated stochastic choices). To estimate learning dynamics, we can no longer estimate pR (t) on an individual level, because each participant only made one binary choice at a given episode. Therefore, we computed pR (t) at the population level, as the number of participants who chose “right” in the tth episode divided by the total number of participants (possibly within a particular group, for example only “good” explorers). Alternatively, when considering specific experimental conditions, we have estimated pR (t) for individual participants using a moving-average over a window of 3 consecutive episodes.

Statistical analysis

Confidence Intervals (CI) for pR were computed using bootstrapping, by resampling participants and choices. Comparisons between different conditions were computed using a permutation test, by shuffling all participants of the two groups being compared, and resampling under the null hypothesis of no group difference. With this resampling we computed the distribution of pR (A) − pR (B) for two random shuffled groups of participants A and B. Reported p-value is the CDF of this distribution evaluated at the real (unshuffled) groups.

TD (λ) learning for E-values

We start by proving a short, non-technical description of the TD and TD (λ) value-learning algorithms. The value of a state-action (denoted Q (s, a)), is defined as the expected sum of (discounted) rewards achieved following that state-action. The goal of the algorithms is to learn these values. To that end, the agent maintains and updates estimates of the true state-action values Q (s, a). In TD-learning, Upon observing a transition (s, a, r, s′, a′), the estimated value is updated towards . Crucially, is also, on its own, an estimated value. This usage of (a part of) the current estimator to form the target for updating the same estimator is known as bootstrapping. TD learning therefore breaks the estimation of value – the sum of rewards – into two parts: the first reward, which is taken from the environment, and the rest of the sum, which is bootstrapped.

It is possible, however, to estimate the values while breaking the sum of rewards in other ways. For example one could sum the first two rewards based on observations, and bootstrap the rest, that is, from time-step 3 on-wards. Importantly, this would result in information (about the rewards) propagating backwards 2-steps in a single update, rather than 1-step. More generally, breaking the sum after n steps will result in an n-step backup learning rule. It is also possible to average multiple n-step backups in a single update. The TD (λ) algorithm is a particular popular scheme to do that: it can be understood as combining all possible n-step backups, with a weighting function that decays exponentially with n (i.e., the weight given to the n-step backup is λn−1, where λ is a parameter). With λ = 0 the algorithm recovers the standard 1-step backup algorithm, or in other words, TD (0) is simply TD. A value of λ = 1 corresponds to no bootstrapping at all, relying instead on Monte Carlo estimates of the action value by collecting direct samples (sum of rewards over complete trajectories).2

Equation 1 can be understood as a TD algorithm (specifically, using the sarsa algorithm (Rummery and Niranjan, 1994; Sutton and Barto, 2018)) in the particular case that all the rewards signals are assumed to be r = 0, and estimates are initialized at 1. The extended model (Algorithm 1) is a direct generalization of that correspondence to the TD (λ) case.

Algorithm 1 TD (λ) learning for E-values

Supplementary material

Simulations results of TD (λ).

Simulating behavior of the E-values model with the TD (λ) learning rule (Methods, Algorithm 1) reproduces the main findings of directed exploration in the maze task. (a) In MR, the model exhibits directed exploration which manifests in low values of prepeat (shown for the 3 conditions of Experiment 1; dashed line denote chance-level expected for random exploration, 1/nR) (b) In the environments of Experiment 1, agents exhibited bias towards MR that increased with imbalance of nR : nL, reflecting the propagation of long-term uncertainties over states. (c) In the environments of Experiment 2, the bias decreased with depth, reflecting temporal discounting. (d) Bias towards MR peaks transiently, followed by a decay to baseline at steady-state, as expected from uncertainty-driven exploration (average results over all 6 environments). The learning dynamics is faster than that of the 1-step update model. Results are based on 3,000 simulations in each environment. Bars and histograms in (a)-(c) are shown for the first 20 episodes to match the behavioral experiments. Model parameters: η = 0.9, β = 5, γ = 0.6, λ = 0.6.

Model parameters.

Learning curves of the TD (λ) model in the 3 environments of Experiment 2 for different values of γ,λ (with fixed η = 0.9, β = 5). With infinite discounting (γ = 0), future consequences are neglected, resulting in a uniform (counter-based like) policy with no bias. With no discounting (γ = 1), information from the terminal state T dominates, resulting in a bias towards “right” (since there are more routes to the terminal states via the “right” branch) that is not dependent of the depth of MR. For intermediate values of γ, transient exploration opportunities (i.e., in MR) becomes important, resulting in a bias towards MR that decreases with depth, reflecting temporal-discounting. In this regime, one-step backup learning rule (λ = 0) results in difference learning speed for different depths, while for trajectory-based learning rules (λ > 0) learning speed is comparable for the different depths. Each learning curve is the average of 30, 000 simulations.