Dopaminergic modulation of the exploration/exploitation tradeoff in human decisionmaking
Decision letter

Samuel J GershmanReviewing Editor; Harvard University, United States

Michael J FrankSenior Editor; Brown University, United States

Samuel J GershmanReviewer; Harvard University, United States

Bruno B AverbeckReviewer; NIH/NIMH, United States

John PearsonReviewer; Duke University, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
The reviewers and I agree that this work offers important new data bearing upon the role of dopamine in exploration. We appreciated the thoroughness of the modeling and data analysis.
Decision letter after peer review:
Thank you for submitting your article "Dopaminergic modulation of the exploration/exploitation tradeoff in human decisionmaking" for consideration by eLife. Your article has been reviewed by Michael Frank as the Senior Editor, a Reviewing Editor, and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Samuel J Gershman (Reviewer #1); Bruno B Averbeck (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This paper reports a new drug and neuroimaging dataset using a multiarmed bandit task previously studied by Daw et al., (2006), who also used fMRI (albeit with a smaller sample size). The main new results here come from the drug treatment (Ldopa and haloperidol), since the models studied here are nominally the same (though the authors use a more sophisticated hierarchical Bayesian estimation scheme). Consistent with (some) earlier work, models incorporating both directed exploration (in the form of an exploration bonus tied to bandit arm uncertainty) and perseveration (a tendency to repeat a previous choice) outperform models lacking each of these effects. The main novel finding is that Ldopa reduces the strength of the uncertainty bonus in exploratory choice. While fMRI contrasts between explore and exploit trials did reveal differences in line with previous literature, there were no drug effects that reached statistical significance, save for one exploratory analysis at a reduced threshold. The authors should be commended for comprehensively reporting both positive and null effects. The reviewers also agreed that the paper is for the most part technically wellexecuted and provides muchneeded data on an important set of questions in the reinforcement learning literature.
Essential revisions:
1) The model comparison includes sensible models (it's also nice to see consistent model ordering across conditions). One thing that's missing (as noted in the Discussion section) is a model of random exploration dependent on "total uncertainty" (see Gershman, 2018, 2019), which should theoretically act as a gain modulator of the values. In fact, what's called the "overall uncertainty" is almost identical to this quantity, and is used in later analyses but apparently not in the choice rule.
2) It doesn't quite make sense to link the neural representation of "overall uncertainty" to the uncertaintybased exploration strategy formalized in this paper, because that strategy uses an uncertainty bonus, which means that the critical quantity is *relative* not *overall* (or "total" in the terminology of Gershman, 2018) uncertainty (relative and total uncertainty have been distinguished by previous imaging studies: Badre et al., 2012; Tomov et al., 2019). One approach, as suggested above, would be to build uncertaintybased random exploration into the behavioral model, and then connect the quantities in that model to the neural and drug data. Another (not mutually exclusive) approach is to look for neural correlates of relative uncertainty (i.e., the difference between uncertainties). For example, using the uncertainty of the chosen option relative to the average uncertainty of the unchosen options.
3) The manuscript is quite long and has a lot of detail that could be relegated to supplemental or summarized. Each subtopic received at least a page in the Discussion, comprising one paragraph of discussion of results followed by lengthy literature review and speculation about its relation to other findings. All of the detail is interesting to a small percentage of expert readers. However, for most readers it makes it hard to get at the main results. As many of the outcomes are null findings, this is even more of an issue. For example, Figure 3 could be removed and replaced with some text indicating means and 95% Cis. Figure 4 is too much detail. Also, the paragraph that discusses how the inclusion of the perseverative coefficient in the model increases sensitivity for directed exploration could be in supplemental (this is worth pointing out, since the original Daw study did not find evidence for this, but this will be of interest to a small number of people, especially since several other studies have now found evidence for an uncertainty bonus). Also, the paragraph that shows no evidence for Ushaped dopamine effects and similarly the figure showing RPE effects in the ventralstriatum could go to supplemental. And correspondingly, a bit more behavioral data might be useful. For Figure 2, is there a way to show a summary figure, across subjects? One of the issues with the original Daw study, was that it was very hard for the participants to learn. How well are the participants able to track the values, and find the best option? Is there a way (might be very hard) to illustrate the effect of Ldopa on directed exploration by directly plotting behavioral data?
https://doi.org/10.7554/eLife.51260.sa1Author response
Summary:
This paper reports a new drug and neuroimaging dataset using a multiarmed bandit task previously studied by Daw et al., (2006), who also used fMRI (albeit with a smaller sample size). The main new results here come from the drug treatment (Ldopa and haloperidol), since the models studied here are nominally the same (though the authors use a more sophisticated hierarchical Bayesian estimation scheme). Consistent with (some) earlier work, models incorporating both directed exploration (in the form of an exploration bonus tied to bandit arm uncertainty) and perseveration (a tendency to repeat a previous choice) outperform models lacking each of these effects. The main novel finding is that Ldopa reduces the strength of the uncertainty bonus in exploratory choice. While fMRI contrasts between explore and exploit trials did reveal differences in line with previous literature, there were no drug effects that reached statistical significance, save for one exploratory analysis at a reduced threshold. The authors should be commended for comprehensively reporting both positive and null effects. The reviewers also agreed that the paper is for the most part technically wellexecuted and provides muchneeded data on an important set of questions in the reinforcement learning literature.
Essential revisions:
1) The model comparison includes sensible models (it's also nice to see consistent model ordering across conditions). One thing that's missing (as noted in the Discussion section) is a model of random exploration dependent on "total uncertainty" (see Gershman, 2018, 2019), which should theoretically act as a gain modulator of the values. In fact, what's called the "overall uncertainty" is almost identical to this quantity, and is used in later analyses but apparently not in the choice rule.
As suggested by the Reviewers, we now examined an additional model that includes an additional random exploration term that depends on โtotal uncertaintyโ (Gershman et al., 2018), that is, the summed uncertainty across all bandits. This additional model was again combined with both the delta rule (fixed learning rate) and the Kalman filter (uncertaintydependent learning rate), yielding a total model space of now 8 models.
Based on this expanded model space, we again ran a model comparison via leaveoneout cross validation (Vehtari, Gelman and Gabry, 2017). In short, including an additional term for capturing total uncertaintybased random exploration did not improve model fit compared to the previously bestfitting model with perseveration and directed exploration terms. The revised Figure 3 illustrating the results from the model comparison now reads as follows:
We now describe the results of this expanded model comparison in the Results section, where we now write: โThese learning rules were combined with four different choice rules that were all based on a softmax action selection rule (Sutton and Barto, 1998; Daw et al., 2006). Choice rule 1 was a standard softmax with a single inverse temperature parameter (ฮฒ) modeling random exploration. Choice rule 2 included an additional free parameter (๐) modeling an exploration bonus that scaled with the estimated uncertainty of the chosen bandit (directed exploration). Choice rule 3 included an additional free parameter (ฯ) modeling a perseveration bonus for the bandit chosen on the previous trial. Finally, choice rule 4 included an additional term to capture random exploration scaling with total uncertainty across all bandits (Gershman, 2018). Leaveoneout (LOO) crossvalidation estimates (Vehtari, Gelman and Gabry, 2017) were computed over all drugconditions, and for each condition separately to assess the modelsโ predictive accuracies. The Bayesian learning model with terms for directed exploration and perseveration (BayesSMEP) showed highest predictive accuracy in each drug condition and overall (Figure 3). The most complex model including an additional totaluncertainty dependent term provided a slightly inferior account of the data compared to the model without this term (loo loglikelihood: BayesSME(R)P: 0.5983 (0.59989)).โ
However, we then had an additional concern regarding our modeling approach. Specifically, we were concerned that our results might be influenced by our model formulation with respect to the implementation of the separate drug sessions. Recall that the original models implemented the three drug sessions via three separate grouplevel distributions for each parameter (๐ฝ, ๐, ๐; mean and standard deviation) from which the single subject parameters for each drug condition were drawn. Alternatively, one could model the placebo condition as the baseline, and the drugeffects as withinsubject additive changes from that baseline (see e.g. Pedersen et al., 2017). We therefore reran all models and model comparisons using this alternative implementation of the drugeffects.
Specifically, in the alternative model formulation, we modeled potential Ldopa and Haloperidol related deviations (โshiftsโ) from the placebo condition in explore/exploit behavior. Separate grouplevel distributions modeled drugrelated deviations from the parametersโ grouplevel distribution under placebo. On the single subjectlevel, Ldopa and Haloperidol associated deviations from placebo were then implemented as parameters drawn from these grouplevel โshiftโ distributions (which where modeled with Gaussian priors centered at zero). These โshiftโ parameters per drug are then added to the placebo parameter estimates via dummycoded indicator variables coding for the distinct drug conditions (e.g. for the directed exploration parameter ๐:
๐[๐ ๐ข๐๐๐๐๐ก] + ๐ผ_{๐ป๐ด๐ฟ}[๐๐๐ข๐] โ ๐_{๐ป๐ด๐ฟ}[๐ ๐ข๐๐๐๐๐ก] + ๐ผ_{๐ฟ๐ท}[๐๐๐ข๐] โ ๐_{๐ฟ๐ท}[๐ ๐ข๐๐๐๐๐ก]).
Importantly, using this alternative model formulation, we replicated the model ranking observed for the original formulation. Likewise, the magnitude and directionality of drug effects was highly similar in the alternative model formulation. In Author response image 1, we plot the loo loglikelihood estimates for four models: the bestfitting model (BAYESSMEP), the model with an additional term for uncertaintybased random exploration (BAYESSMERP), and both models as โshiftโ versions with drugeffects coded in the manner described above. Therefore, regardless of how drug effects were modeled, the BayesSMEP model provided a superior account of the data:
Taken together, both model comparisons suggest that inclusion of an additional term to capture total uncertaintybased random exploration reduces the predictive accuracy of the model. As a consequence, we refrained from examining this model in greater detail with respect to the neuronal and behavioral effects, and rather stuck with the betterfitting original model.
2) It doesn't quite make sense to link the neural representation of "overall uncertainty" to the uncertaintybased exploration strategy formalized in this paper, because that strategy uses an uncertainty bonus, which means that the critical quantity is *relative* not *overall* (or "total" in the terminology of Gershman, 2018) uncertainty (relative and total uncertainty have been distinguished by previous imaging studies: Badre et al., 2012; Tomov et al., 2019). One approach, as suggested above, would be to build uncertaintybased random exploration into the behavioral model, and then connect the quantities in that model to the neural and drug data. Another (not mutually exclusive) approach is to look for neural correlates of relative uncertainty (i.e., the difference between uncertainties). For example, using the uncertainty of the chosen option relative to the average uncertainty of the unchosen options.
We agree that improved visualization of the behavioral effects would be helpful to facilitate understanding the nature of the drug effect. To this end, we now included two additional Figures to visualize the drug effects on behavior.
In the new Figure 2, we now plot the proportion of choices of the best bandit over trials, separately for each drug condition:
With respect to the novel Figure 2 we now write [see Results section]: โOverall, participantsโ choice behavior indicated that they understood the task structure, and tracked the most valuable bandit throughout the task. On trial 1, participants randomly selected one of the four bandits (probability to choose best bandit: 21.5% ยฑ 7.49%, MยฑSE). After 5 trials, participants already selected the most valuable option with 57.76% (ยฑ4.89%; MยฑSE), which was significantly above chance level of 25% (t30=5.79, p=2.52*106, Figure 2), and consistently kept choosing the bandit with the highest payoff with on average 67.89% (ยฑ2.78%). Thus, participant continuously adjusted their choices to the fluctuating outcomes of the four bandits.โ
In the new Figure 5, we now plot the proportion of exploitation, random exploration and directed exploration trials over time, separately for each drug condition:
With respect to the novel Figure 5, we now write [see Results section]: โNext we tested for possible drug effects on the percentage of exploitation and exploration trials (overall, random and directed) per subject. Three separate rmANOVAs with within factors drug and trial (6 blocks of 50 trials each) were computed for each of the following four dependent variables: the percentage of (a) exploitation trials, (b) overall exploration trials, (c) random exploration trials, and (d) directed exploration trials. We found a significant drug effect only for the percentage of directed explorations (F1.66,49.91โ=7.18, pโ=โ.003; Figure 5c). Fraction of random explorations (F2,60โ=โ0.55, pโ=โ.58, Figure 5b), overall explorations (F2,60โ=โ0.97, pโ=โ.39), or exploitations (F2,60โ=โ1.57, pโ=โ.22; Figure 5a) were not modulated by drug. All drug block interactions were not significant (p>=0.19). Posthoc, paired ttests showed a significant reduction in the percentage of directed explorations under Ldopa compared to placebo (mean difference PDโ=โ2.82, t30โ=โ4.69, pโ<โ.001) and haloperidol (mean difference HD = 2.42, t30โ=โ2.76, pโ=โ.010), but not between placebo and haloperidol (mean difference PHโ=โ0.39, t30โ=โ0.43, pโ=โ.667). Notably, an exploratory ttest revealed that the percentage of exploitations was marginally increased under Ldopa compared to placebo (mean difference PDโ=โ2.61, t30โ=โ1.92, pโ=โ.065).โ
3) The manuscript is quite long and has a lot of detail that could be relegated to supplemental or summarized. Each subtopic received at least a page in the Discussion, comprising one paragraph of discussion of results followed by lengthy literature review and speculation about its relation to other findings. All of the detail is interesting to a small percentage of expert readers. However, for most readers it makes it hard to get at the main results. As many of the outcomes are null findings, this is even more of an issue. For example, Figure 3 could be removed and replaced with some text indicating means and 95% Cis. Figure 4 is too much detail. Also, the paragraph that discusses how the inclusion of the perseverative coefficient in the model increases sensitivity for directed exploration could be in supplemental (this is worth pointing out, since the original Daw study did not find evidence for this, but this will be of interest to a small number of people, especially since several other studies have now found evidence for an uncertainty bonus). Also, the paragraph that shows no evidence for Ushaped dopamine effects and similarly the figure showing RPE effects in the ventralstriatum could go to supplemental. And correspondingly, a bit more behavioral data might be useful. For Figure 2, is there a way to show a summary figure, across subjects? One of the issues with the original Daw study, was that it was very hard for the participants to learn. How well are the participants able to track the values, and find the best option? Is there a way (might be very hard) to illustrate the effect of Ldopa on directed exploration by directly plotting behavioral data?
We substantially shortened the manuscript and relegated significant parts from the results and Discussion sections to the supplement (โAppendix 1โ). Specifically, the following sections have now been moved to Appendix 1:
1) The delineation of modelbased/modelfree reinforcement learning and its relation to explore/exploit behavior in the introduction;
2) The description of the assessment of dopamine proxy measures to test dopamine baseline dependent effects within the introduction,
3) The entire section โAccounting for perseveration boosts estimates of directed explorationโ within the results;
4) The subsection showing the drug effects on the singlesubject parameter posterior distributions within the Results section โLdopa reduces directed explorationโ;
5) The section of explore/exploit related brain activation and the respective drug effects has been condensed and the figures depicting the contrasts explore vs exploit, as well as PErelated activation have been moved to Appendix 1.
6) All subsections within the Discussion have been substantially shortened, specifically the sections on drug effects on behavior and neural activation have been significantly condensed.
https://doi.org/10.7554/eLife.51260.sa2