Directed exploration in complex environments.

(a) In a bandit problem (left), actions have no long-term consequences. In complex environments (right), actions have long-term consequences as particular actions might lead, in the future, to different parts of the state-space. In this example, these parts (shaded areas) are of different size. As a result, the local visit-counters are no longer a good measure of uncertainty. In this example, a2 should be, in general, chosen more often compared to a1 in order to exhaust the larger uncertainty associated with it. (b) Participants were instructed to navigate through a maze of rooms. Each room was identified by a unique background image and a title. To move to the next room, participants chose between the available doors by mouse-clicking. Background images and room titles (Armenian letters) were randomized between participants, and were devoid of any clear semantic or spatial structure. (c) The three maze structures in Experiment 1 (Top) have a root state S (highlighted in yellow) with two doors. They differ in the imbalance between the number of doors available in future rooms MR and ML (nR : nL – 4:3, 5:2, 6:1). Consistent with models of directed exploration that take into account long-term consequences of actions, and unlike counter-based models, participants exhibited bias towards room MR, deviating from a uniform policy (Bottom, bars and error-bars denote mean and 95% confidence interval of pR; number of participants: n = 161; 120; 137. Statistical significance, here and in following figures: * : p < 0.05, ** : p < 0.01; *** : p < 0.001).

Heterogeneity in exploration strategies.

Top: Histograms of prepeat at state MR (highlighted in yellow) for participants in the three conditions of Experiment 1 (left to right: nR = 4, 5, 6). Dashed vertical line represents the value expected by chance, 1/nR. Based on their prepeat values, we divided participants into “good” and “poor” directed explorers (dotted and striped patterns, respectively; “good” explorers proportion: 40%, 44%, 51%). Bottom: Histograms of pR at state S (highlighted in yellow), for the “good” and “poor” directed explorers groups.

“Poor” and “good” directed explorers.

Choice biases at state S (pR) analyzed separately for “poor” and “good” explorers (striped and dotted patterns; divided based on their exploration in MR, see Figure 2) in the 3 conditions of Experiment 1. While behavior of the “poor” explorers was not significantly different from chance (consistent with the prediction of random exploration), “good” explorers in the nR = 5, 6 conditions exhibited significant bias towards “right”. Bars and error bars denote mean and 95% confidence interval of pR; number of participants n = 95; 66 67; 53, 66; 71 (“poor”; “good”).

Temporal discounting of exploratory consequences.

The three mazes in Experiment 2 (Top) had the same imbalance (nR = 5, nL = 2), however we varied the depth of MR (and ML) relative to the root state S (left to right: depth = 1, 2, 3). “Poor” and “good” directed explorers (striped and dotted patterns, respectively) were divided by their prepeat value at MR (same as in Experiment 1, see Figure 2). Bars and error-bars denote mean and 95% confidence interval of pR. Number of participants n = 99; 92, 121; 84, 153; 85 (“poor”; “good”).

Learning dynamics.

Bias towards MR as a function of training episode (pR (t)), averaged over participants in all 6 conditions (Experiments 1 & 2), shown for the “poor” (a) and “good” (b) groups. The “good” explorers exhibited a transient peak in pR (t), consistent with models of uncertainty-driven exploration. However, the steady-state value was still slightly larger than chance, consistent with an objective-driven exploration component. Dots and shaded areas denote mean and 95% confidence interval of pR (t).

Simulations results.

Simulating behavior of the E-values model (Equation 12) reproduces the main findings of directed exploration in the maze task. (a) In MR, the model exhibits directed exploration which manifests in low values of prepeat (shown for the 3 conditions of Experiment 1; dashed line denote chance-level expected for random exploration, 1/nR) (b) In the environments of Experiment 1, agents exhibited bias towards MR that increased with imbalance of nR : nL, reflecting the propagation of long-term uncertainties over states. (c) In the environments of Experiment 2, the bias decreased with depth, reflecting temporal discounting. (d) Bias towards MR peaks transiently, followed by a decay to baseline at steady-state, as expected from uncertainty-driven exploration (average results over all 6 environments). Results are based on 3,000 simulations in each environment. Bars and histograms in (a)-(c) are shown for the first 20 episodes for comparison with the behavioral experiments. Error bars are negligible and therefore are not shown. Model parameters: η = 0.9, β = 5, γ = 0.6.

1-step backups and trajectory-based updates.

Learning dynamics simulated by the E-values model using the 1-step backup learning rule of TD (0) (Equation 12; top left) and the trajectory-based learning rule TD (λ) (Methods, Algorithm 1; top right) in the 3 environments of Experiment 2. With TD (0), the depth of MR relative to S (depth = 1, 2, 3) affects both the peak value of pR (t) (due to temporal discounting) and the time it takes the model to learn (due to the longer sequence of states over which the information has to be propagated). By contrast, with TD (λ), different depths result in a different maximum bias (due to temporal discounting), but the learning time is comparable (because information is propagated over multiple steps in each update). For the same reason, learning is overall faster with TD (λ). In humans (bottom), peak bias decreased with depth (consistent with temporal-discounting), but there was no noticeable difference in learning speed (consistent with trajectory-based updates). Learning curves of human participants are shown with a moving-average of 3 episodes. Dots and shaded areas denote means and 70% confidence intervals of pR (t). Model results are average over 30, 000 simulations; model parameters: η = 0.9, β = 5, γ = 0.6, and λ = 0.6 (for the TD (λ) model).

Number of participants in Experiments 1 and 2.

Participant groups in Experiments 1 and 2

Simulations results of TD (λ).

Simulating behavior of the E-values model with the TD (λ) learning rule (Methods, Algorithm 1) reproduces the main findings of directed exploration in the maze task. (a) In MR, the model exhibits directed exploration which manifests in low values of prepeat (shown for the 3 conditions of Experiment 1; dashed line denote chance-level expected for random exploration, 1/nR) (b) In the environments of Experiment 1, agents exhibited bias towards MR that increased with imbalance of nR : nL, reflecting the propagation of long-term uncertainties over states. (c) In the environments of Experiment 2, the bias decreased with depth, reflecting temporal discounting. (d) Bias towards MR peaks transiently, followed by a decay to baseline at steady-state, as expected from uncertainty-driven exploration (average results over all 6 environments). The learning dynamics is faster than that of the 1-step update model. Results are based on 3,000 simulations in each environment. Bars and histograms in (a)-(c) are shown for the first 20 episodes to match the behavioral experiments. Model parameters: η = 0.9, β = 5, γ = 0.6, λ = 0.6.

Model parameters.

Learning curves of the TD (λ) model in the 3 environments of Experiment 2 for different values of γ,λ (with fixed η = 0.9, β = 5). With infinite discounting (γ = 0), future consequences are neglected, resulting in a uniform (counter-based like) policy with no bias. With no discounting (γ = 1), information from the terminal state T dominates, resulting in a bias towards “right” (since there are more routes to the terminal states via the “right” branch) that is not dependent of the depth of MR. For intermediate values of γ, transient exploration opportunities (i.e., in MR) becomes important, resulting in a bias towards MR that decreases with depth, reflecting temporal-discounting. In this regime, one-step backup learning rule (λ = 0) results in difference learning speed for different depths, while for trajectory-based learning rules (λ > 0) learning speed is comparable for the different depths. Each learning curve is the average of 30, 000 simulations.