Dopaminergic modulation of the exploration/exploitation trade-off in human decision-making

  1. Karima Chakroun
  2. David Mathar  Is a corresponding author
  3. Antonius Wiehler
  4. Florian Ganzer
  5. Jan Peters
  1. Department of Systems Neuroscience, University Medical Center Hamburg-Eppendorf, Germany
  2. Department of Psychology, Biological Psychology, University of Cologne, Germany
  3. Institut du Cerveau et de la Moelle épinière - ICM, Centre de NeuroImagerie de Recherche - CENIR, Sorbonne Universités, Groupe Hospitalier Pitié-Salpêtrière, France
  4. German Center for Addiction Research in Childhood and Adolescence, University Medical Center Hamburg-Eppendorf, Germany
17 figures, 9 tables and 1 additional file

Figures

Task design of the restless four-armed bandit task (Daw et al., 2006).

(a) Illustration of the timeline within a trial. At trial onset, four colored squares (bandits) are presented. The participant selects one bandit within 1.5 s, which is then highlighted and, after a waiting period of 3 s, the payoff is revealed for 1 s. After that, the screen is cleared and the next trial starts after a fixed trial length of 6 s plus a variable intertrial interval (not shown) with a mean of 2 s. (b) Example of the underlying reward structure. Each colored line shows the payoffs of one bandit (mean payoff plus Gaussian noise) that would be received by choosing that bandit on each trial.

Percentage of optimal choices (highest payoff) throughout the task.

Shown are the mean percentage of choosing the best bandit in trials 1–10, and over task blocks of trials 11–50 (block 1) and 51–300 separated in 5 blocks of 50 trials each, over all participants, and for each drug session separately. Participants started with randomly (~25%) choosing one bandit in trial 1 (21.5% ± 7.49%, M ± SE). After five trials participants already chose the most valuable bandit with 49.03 ± 4.98% (M ± SE).

Results of the cognitive model comparison.

Leave-one-out (LOO) log-likelihood estimates were calculated over all drug conditions (n = 31 subjects with t = 3*300 trials) and once separately for each drug condition (n = 31 with t = 300). All LOO estimates were divided by the total number of data points in the sample (n*t) for better comparability across the different approaches. Note that the relative order of LOO estimates is invariant to linear transformations. Delta: simple delta learning rule; Bayes: Bayesian learner; SM: softmax (random exploration); E: directed exploration; R: total uncertainty-based random exploration; P: perseveration.

Trial-by-trial variables of the best-fitting Bayesian model (Bayes-SMEP).

Trial-by-trial estimates are shown for the placebo data of one representative subject with posterior medians: β=0.29, φ = 1.34, and ρ=4.11 (random exploration, directed exploration, and perseveration). (a) Colored lines depict the expected values (μ^pre) of the four bandits, whereas colored dots denote actual payoffs. Vertical black lines mark trials classified as exploratory (Daw et al., 2006). (b) Exploration bonus (φσ^pre) and uncertainty (σ^pre) for each bandit. (c) Perseveration bonus (Iρ). This bonus is a fixed value added only to the bandit chosen in the previous trial, shown here for one bandit. (d) Choice probability (P). Each colored line represents one bandit. (e) Reward prediction error (δ). (f) The subject’s overall uncertainty (Σσ^pre), that is the summed uncertainty over all four bandits.

Drug effects on the percentage of exploitations and explorations (bandit with highest uncertainty is chosen).

Shown are the mean percentage of directed explorations for each drug session over six blocks of 50 trials each (error bars indicate standard error of the mean).

Drug effects for the group-level parameter estimates of the best-fitting Bayesian model.

Shown are posterior distributions of the group-level mean (M) of all choice parameters (β, φ, ρ), separately for each drug condition. Each plot shows the median (vertical black line), the 80% central interval (blue (grey) area), and the 95% central interval (black contours); β: random exploration, φ: directed exploration; ρ: perseveration parameter. For drug effects on the standard deviation of the group-level median parameters  φ, β and ρ see Appendix 1—figure 1a. See Appendix 1—figure 1b and c for pairwise drug-related differences of the group-level mean (M) and (c) standard deviation (Λ) of φ.

Brain regions differentially activated by exploratory and exploitative choices.

Shown are overlays of statistical parametric maps (SPMs) for the contrast (a) the parametric regressor expected value (μ^pre) of the chosen bandit (in blue) and the binary trial classification related contrast exploit > explore (‘exploit’ in red), and for (b) the parametric regressor uncertainty (σ^pre) (in blue) and the contrast explore > exploit (‘explore’ in red), over all drug conditions. For visualization purposes: thresholded at p<0.001, uncorrected. R: right.

L-dopa effects on neural coding of overall uncertainty.

(a) Regions in which activity correlated positively with the overall uncertainty in the placebo condition included the dorsal anterior cingulate cortex (dACC) and left posterior insula (PI). (b) Regions in which the correlation with overall uncertainty was reduced under L-dopa compared to placebo included the dACC and left anterior insula (AI). Thresholded at p<0.001, uncorrected. R: right.

Graphical description of the hierarchical Bayesian modeling scheme.

In this graphical scheme, nodes represent variables of interest (squares: discrete variables; circles: continuous variables) and arrows indicate dependencies between these variables. Shaded nodes represent observed variables, here rewards (r) and choices (ch) for each trial (t), subject (s), and drug condition (d). For each subject and drug condition, the observed rewards until trial t-1 determine (deterministically) choice probabilities (P) on trial t, which in turn determine (stochastically) the choice on that trial. The exact dependencies between previous rewards and choice probabilities are specified by the different cognitive models and their model parameters (x). Note that the double-bordered node indicates that the choice probability is fully determined by its parent nodes, that is the reward history and the model parameters. As the model parameters differ between all applied cognitive models, they are indicated here by an x as a placeholder for one or more model parameter(s). Still, the general modeling scheme was the same for all models: Model parameters were estimated for each subject and drug condition and were assumed to be drawn from a group-level normal distribution with mean Mx and standard deviation Λx for any parameter x. Note that group-level parameters were estimated separately for each drug condition. Each group-level mean (Mx) was assigned a non-informative (uniform) prior between the limits xmin  and xmax as listed above. Each group-level standard deviation (Λx) was assigned a half Cauchy distributed prior with location parameter 0 and scale 1. Subject-level parameters included α,β, φ, ρ, and γ depending on the cognitive model (see Table 1).

Appendix 1—figure 1
Group-level parameter estimates of the winning model .

Shown are the posterior distributions of the (a) group-level standard deviation (Λ) for all choice parameters (β,φ,ρ) of the winning model, separately for each drug condition, and (b) of the pairwise drug-related differences of the group-level mean (M) and (c) standard deviation (Λ) of . For each posterior distribution, the plot shows the median (vertical black line), the 80% central interval (blue/grey area), and the 95% central interval (black contours). β: softmax parameter; φ: exploration bonus parameter; ρ: perseveration bonus parameter.

Appendix 1—figure 2
Drug effects for the subject-level parameter estimates of the directed exploration parameter φ.

Shown are posterior distributions of the subject-level parameter φ from the best-fitting Bayesian model, separately for each drug condition. Each plot shows the median (black dot), the 80% central interval (blue area), and the 95% central interval (black contours). For the L-dopa and haloperidol conditions, posterior distributions (in blue) are overlaid on the posterior distributions of the placebo condition (in white) for better comparison.

Appendix 1—figure 3
Test for an inverted-U relationship between DA baseline proxy measures (spontaneous eye blink rate (sEBR) & working memory capacity (WMCPCA)) and the posterior medians of the three choice parameters (β,φ,ρ) of the winning (Bayes-SMEP) model.

Model parameters: β: softmax parameter; φ: exploration bonus parameter; ρ: perseveration bonus parameter.

Appendix 1—figure 4
Test for an inverted-U relationship between choice behavior and DA baseline.

Choice behavior was assessed by four model-free choice variables (payout, %bestbandit, meanrank, %switches). DA baseline function was assessed by the two DA proxies spontaneous eyeblink rate (sEBR) and working memory capacity (WMC). For the latter, the first principal component across three different WMC tasks was used, denoted by WMCPCA. Each plot shows two regression lines that were fitted to the data, one for the “linear model” (red line) and one for the “quadratic model” (blue line). Note that data from a pilot study and the placebo condition of the main study were combined for this analysis to increase the sample size to n=47. β: softmax parameter; φ: exploration bonus parameter; ρ: perseveration bonus parameter.

Appendix 1—figure 5
Brain regions differentially activated by exploratory and exploitative choices.

Shown are statistical parametric maps (SPMs) for (a) the contrast explore > exploit and (b) the contrast exploit > explore over all drug conditions. AG: angular gyrus; AI: anterior insula; Cb: cerebellum; dACC: dorsal anterior cingulate cortex; FPC: frontopolar cortex; HC: hippocampus; IPS: intraparietal sulcus; vmPFC: ventromedial prefrontal cortex; OFC: orbitofrontal cortex; PCC: posterior cingulate cortex; SMA: supplementary motor area; T: thalamus. For visualization purposes thresholded at p<0.001, uncorrected. R: right.

Appendix 1—figure 6
Brain activation patterns for different types of explorations .

Shown are pairwise overlays of the statistical parametric maps for the contrasts explore > exploit (‘overall’ in green), directed > exploit (‘directed’ in red), and random > exploit (‘random’ in blue) over all drug conditions. While the first contrast is based on a binary choice classification according to which all choices not following the highest expected value are explorations, the other two contrast are based on a trinary choice classification, which further subdivides explorations into choices following the highest exploration bonus (directed) and choices not following the highest exploration bonus (random). All activation maps thresholded at p<0.05, uncorrected for display purposes. R: right.

Appendix 1—figure 7
Striatal coding of the model-based prediction error (PE).

Activity in the bilateral ventral striatum correlated positively with the PE signal. For visualization purposes thresholded at p<0.001, uncorrected. R: right.

Author response image 1
Shown are leave-one-out (LOO) log-likelihood estimates calculated for our winning model (BAYES-SMEP), the model with an additional term capturing uncertainty-based random exploration (BAYES-SMERP), and the respective alternative model formulations (‘shift’) over all drug conditions (n=31 subjects with t=3*300 trials) and once separately for each drug condition (n=31 with t=300).

All LOO estimates were divided by the total number of data points in the sample (n*t) for better comparability across the different approaches. Bayes: Bayesian learner; SM: softmax (random exploration); E: directed exploration; R: total uncertainty-based random exploration; P: perseveration.

Tables

Table 1
Free and fixed parameters of all six computational models.
Delta ruleBayes learner rule
Choice rule 1α,βfixed: v1βfixed: λ^,ϑ^,σ^02,σ^d2,μ^1pre,σ^1pre
Choice rule 2α,β,φfixed: v1β,φfixed: λ^,ϑ^,σ^02,σ^d2,μ^1pre,σ^1pre
Choice rule 3α,β,φ,ρfixed: v1β,φ,ρfixed: λ^,ϑ^,σ^02,σ^d2,μ^1pre,σ^1pre
Choice rule 4α,β,φ,ρ, γfixed: v1β,φ,ρ, γfixed: λ^,ϑ^,σ^02,σ^d2,μ^1pre,σ^1pre
  1. Note: Free parameters are only listed for the subject-level. Hierarchical models contained for each free subject-level parameter x two additional free parameters (Μx,Λx) on the group-level (Figure 9). Choice rule 1: softmax; Choice rule 2: softmax with exploration bonus; Choice rule 3: softmax with exploration bonus and perseveration bonus; α: learning rate; β: softmax parameter; φ: exploration bonus parameter; ρ: perseveration bonus parameter; , γ: uncertainty-based random exploration parameter; v1: initial expected reward values for all bandits; λ^: decay parameter; ϑ^: decay center; σ^o2: observation variance; σ^d2: diffusion variance; μ^1pre: initial mean of prior expected rewards for all bandits; σ^1pre: initial standard deviation of prior expected rewards for all bandits.

Table 2
Brain regions in which activity was significantly correlated with the overall uncertainty (fourth GLM), shown for the placebo condition and for pairwise comparison with L-dopa.
RegionMNI coordinatespeakcluster
xyzz-valueextent (k)
Placebo
L posterior insula−34−2084.63198
R supplementary motor cortex810523.9892
R/L dorsal anterior cingulate cortex,
L supplementary motor cortex
-321393.96176
R anterior insula4215-63.4638
R thalamus8−1023.4118
Placebo > L-dopa
L posterior insula−34−2085.05*82
L anterior insula, L frontal operculum−386144.88222
L opercular part of the inferior frontal gyrus−429264.0180
L precentral gyrus−543123.4723
R dorsal anterior cingulate cortex414283.4132
R precentral gyrus39-9443.3916
L dorsal anterior cingulate cortex-236333.3217
L-dopa > placebo
no suprathreshold activation
  1. Note: Thresholded at p<0.001, uncorrected, with k ≥ 10 voxels; L: left; R: right.

    *p=0.031, FWE-corrected for whole-brain volume.

Appendix 1—table 1
Correspondence between model parameters and fraction of random exploration, directed exploration and exploitation trials.
% explorationsβϕρ
overall-.65***.30*0.18
random-.68***0.09-.22
directed0.28.64***0.09
  1. Note that overall explorations were defined according to the binary choice classification, while directed and random explorations were defined according to the trinary choice classification. β: RE parameter; φ: DE parameter; ρ: CP parameter. *p<0.05. ***p<0.001.

Appendix 1—table 2
Drug effects on the exploration bonus parameter (φ) on the group-level.
ΜφΛφ
% above 090% HDI% above 090% HDI
placebo - L-dopa97.5[0.05, 0.69]47.5[−0.18, 0.16]
placebo - haloperidol49.3[−0.30, 0.27]90.0[−0.04, 0.29]
L-dopa - haloperidol1.7[−0.70,–0.10]90.8[−0.02, 0.31]
  1. Note: Results refer to the posterior drug differences of the group-level mean (Μφ) and standard deviation (Λφ) for the φ parameter of the winning model. For each posterior difference, the table shows the percentage of samples with values above zero (column: % above 0) and the 90% highest density interval (column: 90%HDI).

Appendix 1—table 3
Test for an inverted-U relationship between choice behavior and DA baseline.
LooLM - LooQMß2 estimateß2p-value
sEBRWMCsEBRWMCsEBRWMC
model-based:
β−0.06−0.04−2.09e−042.98e−04.132.949
φ−3.60−2.57−1.13e−031.27e−02.470.809
ρ−53.09−49.681.69e−031.20e−01.869.726
model-free:
payout−0.95−1.05−6.04e−041.37e−02.582.710
%bestbandit198.06−245.78−2.45e−027.40e−02.149.897
meanrank0.06−0.10−5.29e−04−3.45e−03.080.733
%switches−484.04−700.67−2.09e−046.58e−01.222.509
  1. Note. Choice behavior was assessed by the three choice parameters of the winning (Bayes-SM+E+P) model (upper part) and four model-free choice variables (lower part). Baseline dopamine (DA) function was assessed by the two behavioral DA proxies spontaneous eye blink rate (sEBR) and working memory capacity (WMC). For the latter, the first principal component across three different WMC tasks was used, denoted by WMCPCA. The column “LOOLM-LOOQM” denotes the difference of the squared distances for the linear model (LM) minus the quadratic model (QM) from the leave-one-out (LOO) model comparison. Note that negative values for LOOLM - LOOQM indicate better predictive accuracy of the LM. The columns “β2 estimate” and “β2 p-value” show for each quadratic model the estimated value and p-value of the β2 regression coefficient, respectively. Note that data from a pilot study (n=16) and the placebo condition of the main study were combined for this analysis to increase the sample size to n=47. β: softmax parameter; φ: exploration bonus parameter; ρ: perseveration bonus parameter.

Appendix 1—table 4
Test for a linear relationship between drug-related effects on model-parameters and DA baseline.
ß1 estimateß1p-value
sEBRWMCsEBRWMCsEBRWMC
β (P-D)2.13e−51.52e−34.75e−052.25e−03.98.84
φ (P-D)1.87e−24.25e−21.40e−02−1.19e−01.46.27
ρ (P-D)5.91e-32.42e-3−4.89e−03−1.76e−01.68.79
β (P-H)2.47e−22.93e−21.64e−031.01e−02.40.36
φ (P-H)9.58e−33.36e−2−1.01e−02−1.06e−01.60.32
ρ (P-H)4.61e−21.18e−2−9.01e−03−2.57e−01.25.56
β (D-H)1.57e−37.83e−31.57e−037.83e−03.43.49
φ (D-H)1.02e−32.95e−4−3.93e−031.20e−02.86.93
ρ (D-H)4.11e-35.69e-4−4.07e−02−8.54e−02.73.90
  1. Note. Drug-related differences (P: placebo, D: L-dopa, H: haloperidol) of model parameters for all participants (n = 31). Baseline dopamine (DA) function was assessed by the two behavioral DA proxies spontaneous eye blink rate (sEBR) and working memory capacity (WMC). For the latter, the first principal component across three different WMC tasks was used, denoted by WMCPCA. The column ‘R²’ denotes the R²-values of the linear regressions. The columns ‘β1 estimate’ and ‘β1 p-value’ show for each linear model the estimated value and p-value of the β1 regression coefficient, respectively. β: softmax parameter; φ: exploration bonus parameter; ρ: perseveration bonus parameter

Appendix 1—table 5
Regions used for small volume correction.
region ofpeak voxel (mm)reference for
small volume correctionxyzpeak voxel
rFPC (right frontopolar cortex)27576Daw et al., 2006
lFPC (left frontopolar cortex)−27484Daw et al., 2006
rIPS (right intraparietal sulcus)39−3642Daw et al., 2006
lIPS (left intrapareital sulcus)−29−3345Daw et al., 2006
rAIns (right anterior insula)3222-8Blanchard and Gershman, 2018
lAIns (left anterior insula)−3016-8Blanchard and Gershman, 2018
dACC (dorsal anterior cingulate cortex)81646Blanchard and Gershman, 2018
  1. Note: Each small volume correction used a 10-mm-radius sphere around the listed voxel coordinates, which mark brain regions that have previously been associated with exploratory choices.

Appendix 1—table 6
Brain regions showing higher activity for exploratory than exploitative choices (first GLM).
RegionMNI coordinatespeakcluster
xyzz-valueextent (k)
R/L intraparietal sulcus, R/L postcentral gyrus, R/L precuneus, L precentral gyrus−48−335210.4515606
R precentral gyrus26-8509.322297
R/L supplementary motor cortex,
R/L dorsal anterior cingulate cortex
812458.472552
R cerebellum/fusiform gyrus18−51−228.092574
R middle frontal gyrus (FPC)3934287.561291
R cerebellum24−57−547.35128
L precentral gyrus−510347.31430
L cerebellum, L fusiform gyrus−40−54−327.281419
L thalamus−10−2066.96556
R/L calcarine cortex-8−74146.901222
R anterior insula362036.87511
L anterior insula−361536.69557
R precentral gyrus518246.49434
R thalamus10−1886.32331
R cerebellum30−44−486.2428
L middle frontal gyrus (FPC)−4227276.0797
R cerebellum14−62−455.8861
R pallidum156-45.8325
R calcarine cortex9−9465.74104
vermis3−75−345.7052
R supramarginal gyrus51−42285.6946
L middle frontal gyrus (FPC)−3046155.6747
L pallidum−106-45.6451
R anterior orbital gyrus2454-95.6033
L posterior cingulate cortex-3−32265.5121
L caudate nucleus−16−14185.3328
R caudate nucleus12-8165.2416
L lingual gyrus−16−84−125.2110
R anterior cingulate cortex1027215.1310
  1. Note: Thresholded at p<0.05, FWE-corrected for whole-brain volume, with k ≥ 10 voxels; L: left; R: right.

Appendix 1—table 7
Brain regions showing higher activity for exploitative than exploratory choices (first GLM).
RegionMNI coordinatespeakcluster
xyzz-valueextent (k)
L angular gyrus−42−74348.042530
L posterior cingulate cortex/precuneus-6−52157.401087
R angular gyrus52−68287.02185
R postcentral gyrus33−26546.80503
R cerebellum27−78−386.28452
R rostral anterior cingulate cortex418−145.90125
L superior temporal gyrus−62−3635.8970
L lateral orbital gyrus−3834−145.81102
R central operculum45−14205.7383
L middle temporal gyrus−62-4−225.67193
R/L medial frontal cortex (vmPFC)-240−105.67233
L superior frontal gyrus−1054305.5420
L superior frontal gyrus−1051365.4510
L middle temporal gyrus−60−51-25.3861
R superior temporal gyrus52−12-95.3525
R middle temporal gyrus624−215.3010
L rostral anterior cingulate cortex-64645.1713
L inferior frontal gyrus−502725.1620
  1. Note: Thresholded at p<0.05, FWE-corrected for whole-brain volume, with k ≥ 10 voxels; L: left; R: right

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Karima Chakroun
  2. David Mathar
  3. Antonius Wiehler
  4. Florian Ganzer
  5. Jan Peters
(2020)
Dopaminergic modulation of the exploration/exploitation trade-off in human decision-making
eLife 9:e51260.
https://doi.org/10.7554/eLife.51260