Neuroscience

On the computational principles underlying human exploration

Lior Fox author has email address
Ohad Dan author has email address
Yonatan Loewenstein author has email address

The Edmond and Lily Safra Center for Brain Sciences, The Hebrew University, Jerusalem
Yale School of Medicine
Department of Cognitive Sciences, The Alexander Silberman Institute of Life Sciences, and The Federmann Center for the Study of Rationality

https://doi.org/10.7554/eLife.90684.1

Open access
Copyright information

Figures and data

Directed exploration in complex environments.
(a) In a bandit problem (left), actions have no long-term consequences. In complex environments (right), actions have long-term consequences as particular actions might lead, in the future, to different parts of the state-space. In this example, these parts (shaded areas) are of different size. As a result, the local visit-counters are no longer a good measure of uncertainty. In this example, a₂ should be, in general, chosen more often compared to a₁ in order to exhaust the larger uncertainty associated with it. (b) Participants were instructed to navigate through a maze of rooms. Each room was identified by a unique background image and a title. To move to the next room, participants chose between the available doors by mouse-clicking. Background images and room titles (Armenian letters) were randomized between participants, and were devoid of any clear semantic or spatial structure. (c) The three maze structures in Experiment 1 (Top) have a root state S (highlighted in yellow) with two doors. They differ in the imbalance between the number of doors available in future rooms M_R and M_L (n_R : n_L – 4:3, 5:2, 6:1). Consistent with models of directed exploration that take into account long-term consequences of actions, and unlike counter-based models, participants exhibited bias towards room M_R, deviating from a uniform policy (Bottom, bars and error-bars denote mean and 95% confidence interval of p_R; number of participants: n = 161; 120; 137. Statistical significance, here and in following figures: * : p < 0.05, ** : p < 0.01; *** : p < 0.001).

Heterogeneity in exploration strategies.
Top: Histograms of p_repeat at state M_R (highlighted in yellow) for participants in the three conditions of Experiment 1 (left to right: n_R = 4, 5, 6). Dashed vertical line represents the value expected by chance, 1/n_R. Based on their p_repeat values, we divided participants into “good” and “poor” directed explorers (dotted and striped patterns, respectively; “good” explorers proportion: 40%, 44%, 51%). Bottom: Histograms of p_R at state S (highlighted in yellow), for the “good” and “poor” directed explorers groups.

“Poor” and “good” directed explorers.
Choice biases at state S (p_R) analyzed separately for “poor” and “good” explorers (striped and dotted patterns; divided based on their exploration in M_R, see Figure 2) in the 3 conditions of Experiment 1. While behavior of the “poor” explorers was not significantly different from chance (consistent with the prediction of random exploration), “good” explorers in the n_R = 5, 6 conditions exhibited significant bias towards “right”. Bars and error bars denote mean and 95% confidence interval of p_R; number of participants n = 95; 66 67; 53, 66; 71 (“poor”; “good”).

Temporal discounting of exploratory consequences.
The three mazes in Experiment 2 (Top) had the same imbalance (n_R = 5, n_L = 2), however we varied the depth of M_R (and M_L) relative to the root state S (left to right: depth = 1, 2, 3). “Poor” and “good” directed explorers (striped and dotted patterns, respectively) were divided by their p_repeat value at M_R (same as in Experiment 1, see Figure 2). Bars and error-bars denote mean and 95% confidence interval of p_R. Number of participants n = 99; 92, 121; 84, 153; 85 (“poor”; “good”).

Learning dynamics.
Bias towards M_R as a function of training episode (p_R (t)), averaged over participants in all 6 conditions (Experiments 1 & 2), shown for the “poor” (a) and “good” (b) groups. The “good” explorers exhibited a transient peak in p_R (t), consistent with models of uncertainty-driven exploration. However, the steady-state value was still slightly larger than chance, consistent with an objective-driven exploration component. Dots and shaded areas denote mean and 95% confidence interval of p_R (t).

Learning dynamics.
Bias towards M_R as a function of training episode (p_R (t)), averaged over participants in all 6 conditions (Experiments 1 & 2), shown for the “poor” (a) and “good” (b) groups. The “good” explorers exhibited a transient peak in p_R (t), consistent with models of uncertainty-driven exploration. However, the steady-state value was still slightly larger than chance, consistent with an objective-driven exploration component. Dots and shaded areas denote mean and 95% confidence interval of p_R (t).

Simulations results.
Simulating behavior of the E-values model (Equation 1–2) reproduces the main findings of directed exploration in the maze task. (a) In M_R, the model exhibits directed exploration which manifests in low values of p_repeat (shown for the 3 conditions of Experiment 1; dashed line denote chance-level expected for random exploration, 1/n_R) (b) In the environments of Experiment 1, agents exhibited bias towards M_R that increased with imbalance of n_R : n_L, reflecting the propagation of long-term uncertainties over states. (c) In the environments of Experiment 2, the bias decreased with depth, reflecting temporal discounting. (d) Bias towards M_R peaks transiently, followed by a decay to baseline at steady-state, as expected from uncertainty-driven exploration (average results over all 6 environments). Results are based on 3,000 simulations in each environment. Bars and histograms in (a)-(c) are shown for the first 20 episodes for comparison with the behavioral experiments. Error bars are negligible and therefore are not shown. Model parameters: η = 0.9, β = 5, γ = 0.6.

1-step backups and trajectory-based updates.
Learning dynamics simulated by the E-values model using the 1-step backup learning rule of TD (0) (Equation 1–2; top left) and the trajectory-based learning rule TD (λ) (Methods, Algorithm 1; top right) in the 3 environments of Experiment 2. With TD (0), the depth of M_R relative to S (depth = 1, 2, 3) affects both the peak value of p_R (t) (due to temporal discounting) and the time it takes the model to learn (due to the longer sequence of states over which the information has to be propagated). By contrast, with TD (λ), different depths result in a different maximum bias (due to temporal discounting), but the learning time is comparable (because information is propagated over multiple steps in each update). For the same reason, learning is overall faster with TD (λ). In humans (bottom), peak bias decreased with depth (consistent with temporal-discounting), but there was no noticeable difference in learning speed (consistent with trajectory-based updates). Learning curves of human participants are shown with a moving-average of 3 episodes. Dots and shaded areas denote means and 70% confidence intervals of p_R (t). Model results are average over 30, 000 simulations; model parameters: η = 0.9, β = 5, γ = 0.6, and λ = 0.6 (for the TD (λ) model).

Number of participants in Experiments 1 and 2.

Participant groups in Experiments 1 and 2

Simulations results of TD (λ).
Simulating behavior of the E-values model with the TD (λ) learning rule (Methods, Algorithm 1) reproduces the main findings of directed exploration in the maze task. (a) In M_R, the model exhibits directed exploration which manifests in low values of p_repeat (shown for the 3 conditions of Experiment 1; dashed line denote chance-level expected for random exploration, 1/n_R) (b) In the environments of Experiment 1, agents exhibited bias towards M_R that increased with imbalance of n_R : n_L, reflecting the propagation of long-term uncertainties over states. (c) In the environments of Experiment 2, the bias decreased with depth, reflecting temporal discounting. (d) Bias towards M_R peaks transiently, followed by a decay to baseline at steady-state, as expected from uncertainty-driven exploration (average results over all 6 environments). The learning dynamics is faster than that of the 1-step update model. Results are based on 3,000 simulations in each environment. Bars and histograms in (a)-(c) are shown for the first 20 episodes to match the behavioral experiments. Model parameters: η = 0.9, β = 5, γ = 0.6, λ = 0.6.

Model parameters.
Learning curves of the TD (λ) model in the 3 environments of Experiment 2 for different values of γ,λ (with fixed η = 0.9, β = 5). With infinite discounting (γ = 0), future consequences are neglected, resulting in a uniform (counter-based like) policy with no bias. With no discounting (γ = 1), information from the terminal state T dominates, resulting in a bias towards “right” (since there are more routes to the terminal states via the “right” branch) that is not dependent of the depth of M_R. For intermediate values of γ, transient exploration opportunities (i.e., in M_R) becomes important, resulting in a bias towards M_R that decreases with depth, reflecting temporal-discounting. In this regime, one-step backup learning rule (λ = 0) results in difference learning speed for different depths, while for trajectory-based learning rules (λ > 0) learning speed is comparable for the different depths. Each learning curve is the average of 30, 000 simulations.

Sign up for email alerts