Abstract
The exploreexploit dilemma occurs anytime we must choose between exploring unknown options for information and exploiting known resources for reward. Previous work suggests that people use two different strategies to solve the exploreexploit dilemma: directed exploration, driven by information seeking, and random exploration, driven by decision noise. Here, we show that these two strategies rely on different neural systems. Using transcranial magnetic stimulation to inhibit the right frontopolar cortex, we were able to selectively inhibit directed exploration while leaving random exploration intact. This suggests a causal role for right frontopolar cortex in directed, but not random, exploration and that directed and random exploration rely on (at least partially) dissociable neural systems.
https://doi.org/10.7554/eLife.27430.001Introduction
In an uncertain world adaptive behavior requires us to carefully balance the exploration of new opportunities with the exploitation of known resources. Finding the optimal balance between exploration and exploitation is a hard computational problem and there is considerable interest in understanding how humans and animals strike this balance in practice (Badre et al., 2012; Cavanagh et al., 2011; Cohen et al., 2007; Daw et al., 2006; Frank et al., 2009; Hills et al., 2015; Mehlhorn et al., 2015; Wilson et al., 2014). Recent work has suggested that humans use two distinct strategies to solve the exploreexploit dilemma: directed exploration, based on information seeking, and random exploration, based on decision noise (Wilson et al., 2014). Even though both of these strategies serve the same purpose, that is, balancing exploration and exploitation, it is likely they rely on different cognitive mechanisms. Directed exploration is driven by information and is thought to be computationally complex (Gittins and Jones, 1979; Auer et al., 2002; Gittins, 1974). On the other hand, random exploration can be implemented in a simpler fashion by using neural or environmental noise to randomize choice (Thompson, 1933).
A key question is whether these dissociable behavioral strategies rely on dissociable neural systems. Of particular interest is the frontopolar cortex (FPC) – an area that has been associated with a number of functions, such as tracking pending and/or alternate options (Koechlin and Hyafil, 2007; Boorman et al., 2009), strategies (Domenech and Koechlin, 2015) and goals (Pollmann, 2016) and that has been implicated in exploration itself (Badre et al., 2012; Cavanagh et al., 2011; Daw et al., 2006). Importantly, however, the exact role that FPC plays in exploration is unknown as how exploration is defined varies from paper to paper. In one line of work, exploration is defined as information seeking. Understood this way, exploration correlates with RFPC activity measured via fMRI (Badre et al., 2012) and a frontal theta component in EEG (Cavanagh et al., 2011), suggesting a role for RFPC in directed exploration. However, in another line of work, exploration is operationalized differently, as choosing the low value option, not the most informative. Such a measure of exploration is more consistent with random exploration where decision noise drives the sampling of low value options by chance. Defined in this way, exploratory choice correlates with lateral FPC activation (Daw et al., 2006) and stimulation and inhibition of RFPC with direct current (tDCS) can increase and decrease the frequency with which such exploratory choices occur (Raja Beharelle et al., 2015).
Taken together, these two sets of findings suggest that RFPC plays a crucial role in both directed and random exploration. However, we believe that such a conclusion is premature because of a subtle confound that arises between reward and information in most exploreexploit tasks. This confound arises because participants only gain information from the options they choose, yet are incentivized to choose more rewarding options. Thus, over many trials, participants gain more information about more rewarding options and the two ways of defining exploration, that is, choosing high information or low reward options, become confounded (Wilson et al., 2014). This makes it impossible to tell whether the link between RFPC and exploration is specific to either directed or random exploration, or whether it is general to both.
To distinguish these interpretations and investigate the causal role of FPC in directed and random exploration, we used continuous thetaburst TMS (Huang et al., 2005) to selectively inhibit right frontopolar cortex (RFPC) in participants performing the ‘Horizon Task’, an exploreexploit task specifically designed to separate directed and random exploration (Wilson et al., 2014). Using this task we find evidence that inhibition of RFPC selectively inhibits directed exploration while leaving random exploration intact.
Results
We used our previously published ‘Horizon Task’ (Figure 1) to measure the effects of TMS stimulation of RFPC on directed and random exploration. In this task, participants play a set of games in which they make choices between two slot machines (onearmed bandits) that pay out rewards from different Gaussian distributions. To maximize their rewards in each game, participants need to exploit the slot machine with the highest mean, but they cannot identify this best option without exploring both options first.
The Horizon Task has two key manipulations that allow us to measure directed and random exploration. The first manipulation is the horizon itself, i.e. the number of decisions remaining in each game. The idea behind this manipulation is that when the horizon is long (6 trials), participants should explore more frequently, because any information they acquire from exploring can be used to make better choices later on. In contrast, when the horizon is short (1 trial), participants should exploit the option they believe to be best. Thus, this task allows us to quantify directed and random exploration as changes in information seeking and behavioral variability that occur with horizon.
The second manipulation is the amount of information participants have about each option before making their first choice. This information manipulation is achieved by using four forcedchoice trials, in which participants are told which option to pick, at the start of each game. We use these forcedchoice trials to setup one of two information conditions: an unequal, or (AstonJones and Cohen, 2005; Badre et al., 2012), condition, in which participants see 1 play from one option and 3 plays from the other option, and an unequal, or (Auer et al., 2002; Auer et al., 2002), condition, in which participants see two outcomes from both options. By varying the amount of information participants have about each option independent of the mean payout of that option, this information manipulation allows us to remove the rewardinformation confound, at least on the first freechoice trial (Figure 2). After the first freechoice trial, however, participants tend to choose more rewarding options more frequently and reward and information are rapidly confounded. For this reason the bulk of our analyses are focussed on the first freechoice trial where the confound has been removed.
RFPC stimulation selectively inhibits directed exploration on the first freechoice
In this section we analyze behavior on the first freechoice trial in each game. This way we are able to remove any effect of the rewardinformation confound and fairly compare behavior between horizon conditions. We analyze the data with both a modelfree approach, using simple statistics of the data to quantify directed and random exploration, as well as a modelbased approach, using a cognitive model of the behavior to draw more precise conclusions. Both analyses point to the same conclusion that RFPC stimulation selectively inhibits directed, but not random, exploration.
Modelfree analysis
The two information conditions in the Horizon Task allow us to quantify directed and random exploration in a modelfree way. In particular, directed exploration, which involves information seeking, can be quantified as the probability of choosing the high information option, $p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o})$ in the [1 3] condition, while random exploration, which involves decision noise, can be quantified as the probability of making a mistake, or choosing the low mean reward option, $p(\mathrm{l}\mathrm{o}\mathrm{w}\text{}\mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n})$, in the [2 2] condition.
Using these measures of exploration, we found that inhibiting the RFPC had a significant effect on directed exploration but not random exploration (Figure 3A,B). In particular, for directed exploration, a repeated measures ANOVA with horizon, TMS condition and order as factors revealed a significant interaction between stimulation condition and horizon ($F(1,24)=4.96$, $p=0.036$). Conversely, a similar analysis for random exploration revealed no effects of stimulation condition (main effect of stimulation condition, $F(1,24)=0.88$, $p=0.36$; interaction of stimulation condition with horizon, $F(1,24)=1.24$, $p=0.28$). Post hoc analyses revealed that the change in directed exploration was driven by changes in information seeking in horizon 6 (onesided ttest, $t(24)=2.62$, $p=0.008$) and not in horizon 1 (twosided ttest, $t(24)=0.30$).
Modelbased analysis
While the modelfree analyses are intuitive, the modelfree statistics, $p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o})$ and $p(\mathrm{l}\mathrm{o}\mathrm{w}\text{}\mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n})$, are not pure reflections of information seeking and behavioral variability and could be influenced by other factors such as spatial bias and learning. To account for these possibilities we performed a modelbased analysis using a model that extends our earlier work (Wilson et al., 2014; Somerville et al., 2017; Krueger et al., 2017) see Materials and methods for a complete description. In this model, the level of directed and random exploration is captured by two parameters: an information bonus for directed exploration, and decision noise for random exploration. In addition the model includes terms for the spatial bias and to describe learning.
Overview of model
Before presenting the results of the modelbased analysis we begin with a brief overview of the most salient points of the model. A full description of the model can be found in the Methods and code to implement the model can be found in the Supplementary Material.
Conceptually, the model breaks the exploreexploit choice down into two components: a learning component, in which participants estimate the mean payoff of each option from the rewards they see, and a decision component, in which participants use this estimated payoff to guide their choice. The learning component assumes that participants compute an estimate of the average payoff for each slot machine, ${R}_{t}^{i}$, using a simple delta rule update equation (based on a Kalman filter (Kalman, 1960), see Materials and methods):
where ${r}_{t}$ is the reward on trial $t$ and ${\alpha}_{t}^{i}$ is the timevarying learning rate that determines the extent to which the prediction error, $({r}_{t}{R}_{t}^{i})$, updates the estimate of the mean of bandit $i$. The learning process is described by three free parameters: the initial value of the estimated payoff, ${R}_{0}$, and two learning rates, the initial learning rate, ${\alpha}_{1}$, and the asymptotic learning rate, ${\alpha}_{inf}$, which together describe the evolution of the actual learning rate, ${\alpha}_{t}$, over time. For simplicity, we assume that these parameters are independent of horizon and uncertainty condition (Table 1).
The decision component of the model assumes that participants choose between the two options (left and right) probabilistically according to.
where $\mathrm{\Delta}R$ ( $={R}_{t}^{left}{R}_{t}^{right}$ ) is the difference in expected reward between left and right options and $\mathrm{\Delta}I$ is the difference in information between left and right options (which we define as +1 when left is more informative, −1 when right is more informative, and 0 when both options convey equal information in the [2 2] condition). The decision process is described by three free parameters: the information bonus $A$, the spatial bias $B$, and the decision noise $\sigma $. We estimate separate values of the decision parameters for each horizon and (since the information bonus is only used in the [1 3] condition) separate values of only the bias and decision noise for each uncertainty condition.
Overall, subject’s behavior in each session (vertex vs RFPC stimulation) is described by 13 free parameters (Table 1): three describing learning (${R}_{0}$, ${\alpha}_{1}$ and ${\alpha}_{\mathrm{\infty}}$) and 10 describing the decision process ($A$ in the two horizon conditions, $B$ and $\sigma $ in the four horizonxuncertainty conditions). These 13 parameters were fit to each subject in each stimulation condition using a hierarchical Bayesian approach (Lee and Wagenmakers, 2014) (see Materials and methods).
Model fitting results
Posterior distributions over the grouplevel means are shown in the left column of Figure 4, while posteriors over the TMSrelated change in parameters are shown in the right column. Both columns suggest a selective effect of RFPC stimulation on the information bonus in horizon 6.
Focussing on the left column first, overall the parameter values seem reasonable. The prior mean is close to the generative mean of 50 used in the actual experiment, and the decision parameters are comparable to those found in our previous work (Wilson et al., 2014). The learning rate parameters, ${\alpha}_{1}$ and ${\alpha}_{\mathrm{\infty}}$, were not included in our previous models and are worth discussing in more detail. As expected for Bayesian learning (Kalman, 1960; Nassar et al., 2010), the initial learning rate is higher than the asymptotic learning rate (95% of samples in the vertex condition, 94% in the RFPC condition). However, the actual values of the learning rates are quite far from their ‘optimal’ settings of ${\alpha}_{1}=1$ and ${\alpha}_{\mathrm{\infty}}=0$ that would correspond to perfectly computing the mean reward. This suggests a greater than optimal reliance on the prior (${\alpha}_{1}<1$) and a pronounced recency bias (${\alpha}_{\mathrm{\infty}}>0$) such that the most recent rewards are weighted more heavily in the computation of expected reward, ${R}_{t}^{i}$. Both of these findings are likely due to the fact that the version of the task we employed did not keep the outcomes of the forced trials on screen and instead relied on people’s memories to compute the expected value.
Turning to the right hand column of Figure 4, we can see that the modelbased analysis yields similar result to the modelfree analysis. In particular we see a reduction (of about 4.8 points) in the information bonus in horizon 6 (with 99% of samples showing a reduced information bonus in the RFPC stimulation condition) and no effect on decision noise in either horizon in either the [2 2] or [1 3] uncertainty conditions (with between 40% and 63% of samples below zero).
In addition to the effect on the information bonus in horizon 6, there is also a hint of an effect on the information bonus in horizon 1 (85% samples less than zero) and on the prior mean ${R}_{0}$ (88% samples above zero). While these results may suggest that RFPC stimulation affects more than just information bonus in horizon 6, they more likely reflect an inherent tradeoff between prior mean and information bonus that is peculiar to this task. In particular, because the prior mean has a stronger effect on the more uncertain option, an increase in ${R}_{0}$ increases the value of the more informative option in much the same way as an information bonus. Thus, when applied to this task, the model has a built in tradeoff between prior mean and information bonus that can muddy the interpretation of both. Note that this tradeoff is not a general feature of the model and could be removed with a different task design that employed more forced choice trials and hence more time for the effects of the prior to be removed.
Figure 5 exposes the tradeoff between ${R}_{0}$ and $A$ in more detail. Panels A and B plot samples from the posterior over the TMSrelated change in information bonus, $A(\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{t}\mathrm{e}\mathrm{x})A(\mathrm{R}\mathrm{F}\mathrm{P}\mathrm{C})$, against the TMSrelated change in prior mean, ${R}_{0}(\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{t}\mathrm{e}\mathrm{x}){R}_{0}(\mathrm{R}\mathrm{F}\mathrm{P}\mathrm{C})$. For both horizon conditions we see a strong negative correlation such that increasing ${R}_{0}$ decreases $A$. This negative correlation is especially problematic for the interpretation of the horizon 1 change in information bonus where a sizable fraction of the posterior centers on no change in either variable. In contrast the negative correlation between $A$ and ${R}_{0}$ does not affect our interpretation of the horizon 6 result where the TMSrelated change in $A$ is negative regardless of of the change in ${R}_{0}$.
Finally we asked whether the horizondependent change in information seeking, i.e. $\mathrm{\Delta}A=A(h=6)A(h=1)$, was different in each TMS condition. As shown in Figure 5C, the TMSrelated change in $\mathrm{\Delta}A$ is about −3.1 points (94% samples below 0) and is uncorrelated with the TMSrelated change in ${R}_{0}$. Taken together, this suggests that we can be fairly confident in our claim that RFPC stimulation has a selective effect on directed exploration.
The effect of RFPC stimulation on later trials
Our analyses so far have focussed on just the first free choice and have ignored the remaining five choices in the horizon six games. The reason for this is the rewardinformation confound, illustrated in Figure 2, which makes interpretation of the later trials more difficult. Despite this difficulty, we note that in Figure 2 the size of the confound is almost identical in the two stimulation conditions and so we proceed, with caution, to present a modelfree analysis of the later trials below.
In Figure 6 we plot the modelfree measures, $p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o})$ and $p(\mathrm{l}\mathrm{o}\mathrm{w}\text{}\mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n})$, as a function of trial number. Both measures show a decrease over the course of the horizon six games although, because of the confound, it is difficult to say whether these changes reflect a reduction in directed exploration, random exploration, or both. Interestingly, the differences in $p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o})$ between vertex and RFPC conditions on the first freechoice trial appear to persist into the second, a result that becomes more apparent when we plot the TMSrelated change, that is, $p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o},\mathrm{R}\mathrm{F}\mathrm{P}\mathrm{C})p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o},\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{t}\mathrm{e}\mathrm{x})$ (Figure 6C,D). More formally a repeated measures ANOVA with trial number, TMS condition as factors reveals a significant main effect of trial number ($F(5,120)=126$, $p<{10}^{45}$), no main effect of TMS condition ($F(1,120)=1.17$, $p=0.29$) and a near significant interaction between trial number and TMS condition ($F(5,120)=2.26,p=0.053$). A post hoc, onesided ttest on the second trial reveals a marginally significant reduction in $p(\mathrm{h}\mathrm{i}\mathrm{g}\mathrm{h}\text{}\mathrm{i}\mathrm{n}\mathrm{f}\mathrm{o})$ on the second trial ($t(24)=1.61$). In contrast, a similar analysis for random exploration shows no evidence for any effect of TMS condition on $p(\mathrm{l}\mathrm{o}\mathrm{w}\text{}\mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n})$ (main effect of TMS, $F(1,120)=0.16$, $p=0.69$; TMS x trial number, $F(5,120)=0.69$, $p=0.63$) although the main effect of trial number persists ($F(5,120)=13.7$, $p<{10}^{9}$). Thus, the analysis of later trials provides additional, albeit modest, support for the idea that RFPC stimulation selectively disrupts directed but not random exploration at long horizons.
Discussion
In this work we used continuous thetaburst transcranial magnetic stimulation (cTBS) to investigate whether right frontopolar cortex (RFPC) is causally involved in directed and random exploration. Using a task that is able to behaviorally dissociate these two types of exploration, we found that inhibition of RFPC caused a selective reduction in directed, but not random exploration. To the best of our knowledge, this finding represents the first causal evidence that directed and random exploration rely on dissociable neural systems and is consistent with our recent findings showing that directed and random exploration have different developmental profiles (Somerville et al., 2017). This suggests that, contrary to the assumption underlying many contemporary studies (Daw et al., 2006; Badre et al., 2012), exploration is not a unitary process, but a dual process in which the distinct strategies of information seeking and choice randomization are implemented via distinct neural systems.
Such a dualprocess view of exploration is consistent with the classical idea that there are multiple types of exploration (Berlyne, 1966). In particular Berlyne’s constructs of ‘specific exploration’, involving a drive for information and ‘diversive exploration’, involving a drive for variety, bear a striking resemblance to our definitions of directed and random exploration. Despite the importance of Berlyne’s work, more modern views of exploration tend not to make the distinction between different types of exploration, considering instead a single exploratory state or exploratory drive that controls information seeking across a wide range of tasks (Berlyne, 1966; AstonJones and Cohen, 2005; Hills et al., 2015; Kidd and Hayden, 2015). At face value, such unitary accounts seem at odds with a dualprocess view of exploration. However, these two viewpoints can be reconciled if we allow for the possibility that, while directed and random exploration are implemented by different systems, their levels are set by a common exploratory drive.
Intriguingly, individual differences in behavior on the Horizon Task provide some support for the idea that directed and random exploration are driven by a common source. In particular, in a large behavioral data set of 277 people performing the Horizon Task, we find a positive correlation between the levels of directed and random exploration such that people with high levels of directed exploration also tend to have high levels random exploration ($r(275)=0.29$, $p<{10}^{5}$), Figure 7. This is consistent with the idea that the levels of directed and random exploration are set by the strength of an exploratory drive that varies as an individual difference between people.
While the present study does allow us to conclude that directed and random exploration rely on different neural systems, the limited spatial specificity of TMS limits our ability to say exactly what those systems are. In particular, because the spatial extent of TMS is quite large, stimulation aimed at frontal pole may directly affect activity in nearby areas such as ventromedial prefrontal cortex (vmPFC) and orbitofrontal cortex (OFC), both areas that have been implicated in exploratory decision making and that may be contributing to our effect (Daw et al., 2006). In addition to such direct effects of TMS on nearby regions, indirect changes in areas that are connected to the frontal pole could also be driving our effect. For example, cTBS of left frontal pole has been associated with changes in blood perfusion in areas such as amygdala, fusiform gyrus and posterior parietal cortex (Volman et al., 2011) and by changes in the fMRI BOLD signal in OFC, insula and striatum (Hanlon et al., 2017). In addition (Volman et al., 2011) showed that unilateral cTBS of left frontal pole is associated with changes in blood perfusion to the right frontal pole. Indeed, such a bilateral effect of cTBS may explain why our intervention was effective at all given that a number of neuroimaging studies have shown bilateral activation of the frontal pole associated with exploration (Daw et al., 2006; Badre et al., 2012). Future work combining cTBS with neuroimaging will be necessary to shed light on these issues.
With the above caveats that our results may not be entirely due to disruption of frontal pole, the interpretation that RFPC plays a role in directed, but not random, exploration is consistent with a number of previous findings. For example, frontal pole has been associated with tracking the value of the best unchosen option (Boorman et al., 2009), inferring the reliability of alternate strategies (Boorman et al., 2009; Domenech and Koechlin, 2015), arbitrating between old and new strategies (Donoso et al., 2014; Mansouri et al., 2015), and reallocating cognitive resources among potential goals in underspecified situations (Pollmann, 2016). Taken together, these findings suggest a role for frontal pole in modelbased decisions (Daw et al., 2006) that involve longterm planning and the consideration of alternative actions. From this perspective, it is perhaps not surprising that directed exploration relies on RFPC, since computing an information bonus relies heavily on an internal model of the world. It is also perhaps not surprising that random exploration is independent of RFPC, as this simpler strategy could be implemented without reference to an internal model. Indeed, the ability to explore effectively in a modelfree manner, may be an important function of random exploration as it allows us to explore even when our model of the world is wrong.
More generally, it is unlikely that frontal pole is the only area involved in directed exploration, and more work will be needed to map out the areas involved in directed and random exploration and expose their causal relationship to exploreexploit behavior.
Materials and methods
Participants
31 healthy righthanded, adult volunteers (19 female, 12 male; ages 19–32). An initial sample size of 16 was chosen based on two studies using a very similar cTBS design that stimulated lateral FPC (Costa et al., 2011; Costa et al., 2013) and this was augmented to 31 on the basis of feedback from reviewers. Five participants (5 female, 0 male) were excluded from the analysis due to chancelevel performance in both experimental sessions. One (female) participant failed to return for the second (vertex stimulation condition) session and is excluded from the modelfree analyses but not the modelbased analyses as that can handle missing data more gracefully. Thus our final data set consisted of 25 participants (13 female, 12 male, ages 19–32) with complete data and one participant (female, aged 20) with data from the RFPC session only.
All participants were informed about potential risks connected to TMS and signed a written consent. The study was approved by University of Social Sciences and Humanities ethics committee.
Procedure
Request a detailed protocolThere were two experimental TMS sessions and a preceding MRI session. On the first session T1 structural images were acquired using a 3T Siemens TRIO scanner. The scanning session lasted up to 10 min. Before the first two sessions, participants filled in standard safety questionnaires regarding MRI scanning and TMS. During the experimental sessions, prior to the stimulation participants went through 16 training games to get accustomed to the task. Afterwards, resting motor thresholds were obtained and the stimulation took place. Participants began the main task immediately after stimulation. The two experimental sessions were performed with an intersession interval of at least 5 days. The order of stimulation conditions was counterbalanced across subjects. All sessions took place at Nencki Institute of Experimental Biology in Warsaw.
Stimulation site
Request a detailed protocolThe RFPC peak was defined as [x,y,z]= [35,50,15] in MNI (Montreal Neurological Institute) space. The coordinates were based on a number of fMRI findings that indicated RFPC involvement in exploration (Badre et al., 2012; Boorman et al., 2009; Daw et al., 2006) and constrained by the plausibility of stimulation (e.g. defining ‘z’ coordinate lower would result in the coil being placed uncomfortably close to the eyes). Vertex corresponded to the Cz position of the 10–20 EEG system. In order to locate the stimulation sites we used a frameless neuronavigation system (Brainsight software, Rogue Research, Montreal, Canada) with a Polaris Vicra infrared camera (Northern Digital, Waterloo, Ontario, Canada).
TMS protocol
Request a detailed protocolWe used continuous theta burst stimulation (cTBS) (Huang et al., 2005). cTBS requires 50 Hz stimulation at 80% resting motor threshold. 40 s stimulation is equivalent to 600 pulses and can decrease cortical excitability for up to 50 min (Wischnewski and Schutter, 2015).
Individual resting motor thresholds were assessed by stimulating the right motor knob and inspecting if the stimulation caused an involuntary hand twitch in 50% of the cases. We used a MagPro X100 stimulator (MagVenture, Hueckelhoven, Germany) with a 70 mm figureeight coil. The TMS was delivered in line with established safety guidelines (Rossi et al., 2009).
Limitations
Request a detailed protocolDefining stimulation target by peak coordinates based on findings from previous studies did not allow to account for individual differences in either brain anatomy or the impact of TMS on brain networks (Gratton et al., 2013). However, a study by Volman and colleagues (Volman et al., 2011) that used the same thetaburst protocol on the left frontopolar cortex has shown bilateral inhibitory effects on blood perfusion in the frontal pole. This suggests that both right and left parts of the frontopolar cortex might have been inhibited in our experiment, which is consistent with imaging results indicating bilateral involvement of the frontal pole in exploratory decisions.
Task
Request a detailed protocolThe task was a modified version of the Horizon Task (Wilson et al., 2014). As in the original paper, the distributions of payoffs tied to bandits were independent between games and drawn from a Gaussian distribution with variable means and fixed standard deviation of 8 points. Participants were informed that in every game one of the bandits was objectively ‘better’ (has a higher payoff mean). Differences between the mean payouts of the two slot machines were set to either 4, 8, 12 or 20. One of the means was always equal to either 40 or 60 and the second was set accordingly. The order of games was randomized. Mean sizes and order of presentation were counterbalanced. Participants played 160 games and the whole task lasted between 39 and 50 min (mean 43.4 min).
Each game consisted of 5 or 10 choices. Every game started with a screen saying ‘New game’ and information about whether it was a long or short horizon, followed by sequentially presented choices. Every choice was presented on a separate screen, so that participants had to keep previous the scores in memory. There was no time limit for decisions. During forced choices participants had to press the prompted key to move to the next choice. During free choices they could press either ‘z’ or ‘m’ to indicate their choice of left or right bandit. The decision could not be made in a time shorter than 200 ms, preventing participants from accidentally responding too soon. The score feedback was presented for 500 ms. A counter at the bottom of the screen indicated the number of choices left in a given game. The task was programmed using PsychoPy software v1.86 (Peirce, 2007).
Participants were rewarded based on points scored in two sessions. The payoff bounds were set between 50 and 80 zl (equivalent to approximately 12 and 19 euro). Participants were informed about their score and monetary reward after the second session.
Finally, the random seeds were not perfectly controlled between subjects. The first 16 subjects ran the task with identical random seeds and thus all 16 saw the same sequence of forcedchoice trials in both vertex and RFPC sessions. For the remaining subjects the random seed was unique for each subject and each session, thus these subjects had unique a series of forcedchoice trials for each session. Despite this limitation we saw no evidence of different behavior across the two groups.
Data and code
Request a detailed protocolBehavioral data as well as Matlab code to recreate the main figures from this paper can be found on the Dataverse website at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CZT6EE.
Modelbased analysis
We modeled behavior on the first free choice of the Horizon Task using a version of the logistic choice model in Wilson et al. (2014) that was modified to include a learning component. In particular, we assume that participants use the outcomes of the forcedchoice trials to learn an estimate of the mean reward of each option, before inputting that mean reward into a decision function that includes terms for directed and random exploration. This model naturally decomposes into a learning component and a decision component and we consider each of these components in turn.
Learning component
Request a detailed protocolThe learning component of the model assumes that participants use a Kalman filter to learn a value for the mean reward of each option. The Kalman filter (Kalman, 1960) has been used to model learning in other exploreexploit tasks (Daw et al., 2006) and is a popular model of Bayesian learning as it is both analytically tractable and easily relatable to the deltarule update equations of reinforcement learning.
More specifically, the Kalman filter assumes a generative model in which the rewards from each bandit, ${r}_{t}$, are generated from Gaussian distribution with a fixed standard deviation, ${\sigma}_{r}$, and a mean, ${m}_{t}^{i}$, that is different for each bandit and can vary over time. The time dependence of the mean is determined by a Gaussian random walk with mean 0 and standard deviation ${\sigma}_{d}$. Note that this generative model, assumed by the Kalman filter, is slightly different to the true generative model used in the Horizon Task, which assumes that the mean of each bandit is constant over time, that is, ${\sigma}_{d}=0.$ This mismatch between the assumed and actual generative models, is quite deliberate and allows us to account for the suboptimal learning of the subjects. In particular, this mismatch, introduces the possibility of a recency bias (when ${\sigma}_{d}>0$) whereby more recent rewards are overweighted in the computation of ${R}_{t}^{i}$.
The actual equations of the Kalman filter model are straightforward. The model keeps track of an estimate of both the mean reward, ${R}_{t}^{i}$, of each option, $i$, and the uncertainty in that estimate, ${\sigma}_{t}^{i}$. When option $i$ is played on trial $t$, these two parameters update according to
and
When option i is not played on trial t we assume that the estimate of the mean stays the same, but that the uncertainty in this estimate grows as the generative model assumes the mean drifts over time. Thus for unchosen option $j$ we have
When the option is played, the update Equation 3 for ${R}_{t}^{i}$ is essentially just a ‘delta rule’ (Rescorla and Wagner, 1972; Schultz et al., 1997), with the estimate of the mean being updated in proportion to the prediction error, ${r}_{t}{R}_{t}^{i}$. This relationship to the reinforcement learning literature is made more clear by rewriting the learning equations in terms of the time varying learning rate,
Written in terms of this learning rate, Equations 3 and 4 become
and
where
The learning model has four free parameters, the noise variance, ${\sigma}_{r}^{2}$, the drift variance, ${\sigma}_{d}^{2}$, and the initial values of the estimated reward, ${R}_{0}$, and uncertainty in that variance estimate, ${\sigma}_{0}^{2}$. In practice, only three of these parameters are identifiable from behavioral data, and we will find it useful to reparameterize the learning model in terms of ${R}_{0}$ and an initial, ${\alpha}_{1}$, and asymptotic, ${\alpha}_{\mathrm{\infty}}$, learning rate. In particular, the initial value of the learning rate relates to ${\sigma}_{0}$, ${\sigma}_{r}$ and ${\sigma}_{d}$ as
While the asymptotic value of the learning rate, which corresponds to the steady state value of ${\alpha}_{t}^{i}$ if option $i$ is played forever, relates to ${\alpha}_{d}$ (and hence ${\sigma}_{d}$ and ${\sigma}_{r}$) as
While this choice to parameterize the learning equations in terms of ${\alpha}_{1}$ and ${\alpha}_{\mathrm{\infty}}$ is somewhat arbitrary, we feel that the learning rate parameterization has the advantage of being slightly more intuitive and leads to parameter values between 0 and 1 which are (at least for us) easier to interpret.
Decision component
Request a detailed protocolOnce the payoffs of each option, ${R}_{t}^{i}$, have been estimated from the outcomes of the forcedchoice trials, the model makes a decision using a simple logistic choice rule:
where $\mathrm{\Delta}R$ ( $={R}_{t}^{left}{R}_{t}^{right}$ ) is the difference in expected reward between left and right options and $\mathrm{\Delta}I$ is the difference in information between left and right options (which we define as +1 when left is more informative, −1 when right is more informative, and 0 when both options convey equal information in the (Auer et al., 2002; Auer et al., 2002) condition). The three free parameters of the decision process are: the information bonus, $A$, the spatial bias, $B$, and the decision noise $\sigma $. We assume that these three decision parameters can take on different values in the different horizon and uncertainty conditions (with the proviso that $A$ is undefined in the (Auer et al., 2002; Auer et al., 2002) information condition since $\mathrm{\Delta}I=0$). Thus the decision component of the model has 10 free parameters ($A$ in the two horizon conditions, and $B$ and $\sigma $ in the 4 horizon x uncertainty conditions). Directed exploration is then quantified as the change in information bonus with horizon, while random exploration is quantified as the change in decision noise with horizon.
Model fitting
Hierarchical bayesian model
Request a detailed protocolBetween the learning and decision components of the model, each subject’s behavior is described by 13 free parameters, all of which are allowed to vary between TMS conditions. These parameters are: the initial mean, R${}_{0}$, the initial learning rate, ${\alpha}_{1}$, the asymptotic learning rate, ${\alpha}_{\mathrm{\infty}}$, the information bonus, $A$, in both horizon conditions, the spatial bias, $B$, in the four horizon x uncertainty conditions, and the decision noise, $\sigma $, in the four horizon x uncertainty conditions (Table 2, Figure 8).
Each of the free parameters is fit to the behavior of each subject using a hierarchical Bayesian approach (Lee and Wagenmakers, 2014). In this approach to model fitting, each parameter for each subject is assumed to be sampled from a grouplevel prior distribution whose parameters, the socalled ‘hyperparameters’, are estimated using a Markov Chain Monte Carlo (MCMC) sampling procedure. The hyperparameters themselves are assumed to be sampled from ‘hyperprior’ distributions whose parameters are defined such that these hyperpriors are broad. For notational convenience, we refer to the hyperparameters that define the prior for variable X as ${\theta}^{X}$. In addition we use subscripts to refer to the dependence of both parameters and hyperparameters on TMS stimulation condition, $\tau $, horizon condition, $h$, uncertainty condition, $u$, subject, $s$, and game, $g$.
The particular priors and hyperpriors for each parameter are shown in Table 2. For example, we assume that the prior mean, ${R}_{0}^{\tau s}$, for each stimulation condition $\tau $ and horizon condition $h$, is sampled from a Gaussian prior with mean ${\mu}_{{R}_{0}}^{\tau}$ and standard deviation ${\sigma}_{{R}_{0}}^{\tau}$. These prior parameters are sampled in turn from their respective hyperpriors: ${\mu}_{{R}_{0}}^{\tau}$, from a Gaussian distribution with mean 50 and standard deviation 14, ${\sigma}_{{R}_{0}}^{\tau}$ from a Gamma distribution with shape parameter 1 and rate parameter 0.001.
Model fitting using MCMC
Request a detailed protocolThe model was fit to the data using Markov Chain Monte Carlo approach implemented in the JAGS package (Plummer, 2003) via the MATJAGS interface (psiexp.ss.uci.edu/research/programs_data/jags/). This package approximates the posterior distribution over model parameters by generating samples from this posterior distribution given the observed behavioral data.
In particular we used 4 independent Markov chains to generate 4000 samples from the posterior distribution over parameters (1000 samples per chain). Each chain had a burn in period of 500 samples, which were discarded to reduce the effects of initial conditions, and posterior samples were acquired at a thin rate of 1. Convergence of the Markov chains was confirmed post hoc by eye. Code and data to replicate our analysis and reproduce our Figures is provided as part of the Supplementary Materials.
Data availability

A causal role for right frontopolar cortex in directed, but not random, explorationPublicly accessible via the Harvard Dataverse website (https://dx.doi.org/10.7910/DVN/CZT6EE).
References

Adaptive gain and the role of the locus coeruleusnorepinephrine system in optimal performanceThe Journal of Comparative Neurology 493:99–110.https://doi.org/10.1002/cne.20723

Finitetime analysis of the multiarmed bandit problemMachine Learning 47:235–256.https://doi.org/10.1023/A:1013689704352

Should I stay or should I go? How the human brain manages the tradeoff between exploitation and explorationPhilosophical Transactions of the Royal Society B: Biological Sciences 362:933–942.https://doi.org/10.1098/rstb.2007.2098

Executive control and decisionmaking in the prefrontal cortexCurrent Opinion in Behavioral Sciences 1:101–106.https://doi.org/10.1016/j.cobeha.2014.10.007

Resource allocation in speculative chemical researchJournal of Applied Probability 11:255.https://doi.org/10.1017/S0021900200036718

The effect of thetaburst TMS on cognitive control networks measured with resting state fMRIFrontiers in Systems Neuroscience 7:124.https://doi.org/10.3389/fnsys.2013.00124

Exploration versus exploitation in space, mind, and societyTrends in Cognitive Sciences 19:46–54.https://doi.org/10.1016/j.tics.2014.10.004

A New Approach to Linear Filtering and Prediction ProblemsJournal of Basic Engineering 82:35–45.https://doi.org/10.1115/1.3662552

Strategies for exploration in the domain of lossesJudgment and Decision Making 12:104–117.

PsychoPyPsychophysics software in PythonJournal of Neuroscience Methods 162:8–13.https://doi.org/10.1016/j.jneumeth.2006.11.017

JAGS: a program for analysis of bayesian graphical models using gibbs samplingProceedings of the 3rd International Workshop on Distributed Statistical Computing 124:125.

Frontopolar resource allocation in human and nonhuman primatesTrends in Cognitive Sciences 20:84–86.https://doi.org/10.1016/j.tics.2015.11.006

A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcementClassical Conditioning II: Current Research and Theory 2:64–99.

Charting the expansion of strategic exploratory behavior during adolescenceJournal of Experimental Psychology: General 146:155–164.https://doi.org/10.1037/xge0000250

Humans use directed and random exploration to solve the exploreexploit dilemmaJournal of Experimental Psychology: General 143:2074–2081.https://doi.org/10.1037/a0038199
Decision letter

Michael J FrankReviewing Editor; Brown University, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]
Thank you for submitting your work entitled "A causal role for right frontopolar cortex in directed, but not random, exploration" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The reviewers have opted to remain anonymous.
Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work in the current state will not be considered further for publication in eLife.
All involved found the work to have great merit and contributes to the literature on RLPFC and exploration. In our view this is perhaps the clearest demonstration to date that the RLPFC is involved in directed, uncertaintyguided exploration, in that it is the first to imply causality. However, given the state of the literature with other studies (cited in your manuscript) that show RLPFC activation during exploration, that it codes for uncertainty and/or the value of alternative actions, together with an existing TDCS study manipulating it and affecting exploration (albeit not in a way that clearly implicates uncertainty), we felt that the bar for establishing causality in your study needs to be quite high. The reviewers agreed that given the small sample and somewhat marginal statistics, it would be more reassuring if the results held up in a larger N study (or a separate independent replication). Moreover, while the findings here are compelling (e.g. the selectivity to horizon 6 directed exploration), they would be more so especially if you had a control site of stimulation (e.g. DLPFC or IFG) to establish specificity of the RLPFC site. (One of the reviewers noted in the consultation session that RLFPC stimulation may cause discomfort relative to vertex stimulation, which could differentially impact conditions that may require differences in effort).
Reviewer 1 also had concerns regarding potential power differences to detect effects in directed vs. random exploration.
If you feel strongly that you can address these concerns we could consider a resubmission. But because the nature of the concerns requires new data collection, and it is unclear whether the results of new studies will provide more clarity, we are rejecting the paper as it stands. We would understand if you chose to submit this study as it is elsewhere.
Reviewer #1:
This manuscript reports the results of a TMS study in which participants are stimulated with thetaburst TMS while participating in a one armed bandit gambling task aimed at distinguishing directed from random exploration. The authors hypothesize that frontopolar cortex is involved in directed but not random exploration. Using both modelbased and modelfree analyses the authors report that frontopolar cortex inhibition impacts on directed but not random exploration, allowing the authors to conclude that this structure plays a specific role in directed exploration.
Overall, the study is an interesting one that identifies a potentially important finding. The notion that frontopolar cortex is especially involved in directed exploration is highly plausible, and the results do indeed provide some indication of this possibility. However, I do have several major concerns which I detail below.
1) One major concern is the possibility that there is a substantial difference in power to detect the effects of TMS on the two forms of exploration due to perhaps a big difference in the number of behavioral choice that index these two forms of exploration on the first trial of each block. While directed exploration in the vertex treatment in the [1 3] condition perhaps occurs frequently (the number of trials in which this behavior is found are not reported, but I am inferring this from the high probability of directed exploration reported in the horizon 6 condition), it seems natural to expect that there would be much fewer instances of "random" exploration as defined by choice of the lower valued option in the [2 2] condition – this appears to be reflected in the much lower reported probabilities of random exploration in that condition. If there are many fewer trials of random exploration in the first place this ought to make an effect of random exploration following TMS stimulation much harder to detect. Therefore, one trivial account for the authors' double dissociation is that it occurs as a result of a difference in the experimental power to detect these two effects in the paradigm. The claim the authors have about an effect of TMS on direct exploration per se seems well supported in my opinion, but the claim for the specificity of the effect to random exploration seems a lot weaker.
2) Another concern is that for a behavioral and TMS study the use of such a small sample size of only 15 participants seems hard to justify, especially given that the authors are reporting effects that are just barely reaching significance at p<0.05. Given the concerns raised above about power to detect effects on random exploration, and given that there are a very small number of trials per subject enabling the authors to test their claimed effects (as they are throwing away most of the trials per block and focusing only on the first), suggest that it would not be unreasonable to expect the authors to obtain a larger sample size.
3) A more generic concern with TMS over frontopolar cortex is that it is unclear with this stimulation protocol how diffuse the effect of the TMS stimulation has been, and to what extent the stimulation protocol has also impacted adjacent regions of frontal cortex. This is an inherent limitation of this technique of course, but there are ways to ameliorate concerns in this regard such as by measuring effects of the stimulation protocol with fMRI. The authors could discuss this limitation and ideally bolster their claims about the degree to which these effects can be specifically attributed to effects of stimulation on frontopolar cortex per se.
4) Could the apparent effect on directed exploration be driven by other more prosaic possibilities such as an impairment in the ability to flexibly change task set (e.g. from a short to long horizon) across blocks or alterations in the capacity to attend to the task cues indicating the horizon length or even the capacity to incorporate knowledge of task instructions could be impacted instead of directed exploration per se.
5) Can the authors discriminate between different ways in which directed exploration could be implemented computationally on this task? For instance one could imagine a Bayesian implementation in which a representation of uncertainty over the options is computed and used to direct exploration toward the more uncertain options, or else one could simply use a heuristic strategy of just counting the number of samples of each option to try to ensure each option has been sampled an equivalent number of times.
6) Although the authors cite Wilson et al. (2014) to describe their modeling strategies, it would be important to reproduce details of exactly how they implemented the model fitting etc. in the current paradigm, as these analyses are central to the current paper and the reader shouldn't be required to go searching for another paper to understand precisely what was done.
7) I wonder whether more use can be made of the subsequent trials in each block. It seems a shame to throw these trials away, even if the utility of the trials for distinguishing these constructs drops off over repeated trials within a block it seems plausible to me that the 2nd and 3rd trials at the very least would contain useful information.
Reviewer #2:
Zajkowski and colleagues present a study showing that continuous theta burst stimulation to right frontopolar cortex, but not the vertex, selectively reduces directed exploration, but not random exploration. I commend the authors for their experimental approach, combining a carefully designed experimental paradigm and computational modeling of behavior with a transient causal manipulation, such as cTBS. While the results look straightforward, and I do believe they represent an advance on current knowledge in the field, I do not think they represent such a significant advance to merit publication in eLife (or a similar high impact journal of broad interest), but would be appropriate for a more specialized journal in the field. Rather than advancing thinking on this topic in some new way, developing a new methodology, or resolving a debate, I believe the results essentially confirm what could be inferred to be likely from the existing fMRI (Daw et al., 2006; Badre et al., 2012) and stimulation (TDCS) literature (Beharelle et al., 2015) on the RFPC and exploration/exploitation. Furthermore, the experimental paradigm and modeling results have been published (Wilson, et al., 2014). I do not mean to discourage the authors, who I think have conducted a genuinely interesting study by combining approaches in an unusual way, and confirming their main hypothesis. I simply do not believe the paper is best suited for a journal of the caliber of eLife, but I of course leave this up to the editor's discretion. I have added a few comments below that I hope will be helpful to the authors.
In the Introduction random exploration is framed as simply increasing decision noise, and directed exploration as information seeking. But is that really the critical distinction? In the real world random exploration is likely to occur when the environmental statistics have changed very rapidly and/or the animal has inferred (for whatever reason) their prior causal model (or even set of models) is (are) no longer tenable. In these circumstances their exploratory behavior is likely still characterized as information seeking, even if it manifests formally as an increase in decision noise. It seems to me, therefore, that the key distinction between directed and undirected exploration is that animals no longer know which options to explore. Can the authors clarify their view, and perhaps modify the Introduction and/or Discussion as needed?
What were the instructions to participants? Do they necessarily understand that the bandit means are constant and independent? Is there any evidence they weight more recent past samples more strongly than more distant samples? Would this change the estimates of the means in any meaningful way?
Given the demonstrated effects of cTBS on the hemodynamic signal measured in control networks (Gratton et al., 2013), how specific is the effect of stimulation to RFPC? To address this question, I would have liked to see the investigators target another frontal comparison brain region, in addition or instead of the vertex.
[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]
Thank you for resubmitting your work entitled "A causal role for right frontopolar cortex in directed, but not random, exploration" for further consideration at eLife. Your revised article has been favorably evaluated by Sabine Kastner as Senior Editor, Michael Frank as Reviewing editor and two reviewers.
The manuscript has been improved, especially given the doubled sample size, and the modelbased and modelfree analyses are sophisticated, comprehensive, and generally compelling. However, there are some remaining issues that need to be addressed before acceptance, as outlined below:
1) Why do the authors binarize relative information such that it is coded as +1 when the left gamble is more informative and 1 when the right gamble is? Based on Badre, Doll, Frank, et al. I would have thought that the estimated relative uncertainty between options would be more appropriate to quantitatively test the impact of stimulation on directed exploration. Or is variance in this quantity negligible across the critical choices in this task? Related to this question, is this quantity matched across conditions and do all subjects see identical or different schedules?
2) Although I am not requesting the authors conduct another experiment, a second stimulation site within prefrontal cortex would make for an important comparison for future studies. My suggestion is in part due to the quite severe discomfort frequently caused by TMS stimulation to FPC and neighboring regions due to the underlying facial musculature, as compared to say the vertex. Any differences between stimulation sites could in theory be due to differences in discomfort or subsequent distraction produced by the stimulation sites. Here, this difference could conceivably interact with the comparison between horizon 6 and horizon 1 in the unequal condition if this horizon 6 condition is in fact more cognitively demanding. Note this is not a concern in the cited tDCS study by Raja Beharelle et al. because tDCS does not stimulate the facial muscles and because the excitation and inhibition respectively following anodal and cathodal tDCS provides for an internal control. Can the authors provide some evidence that horizon 6 in the unequal condition is not the most cognitively demanding for their subjects, for instance by analysing RTs? Are there existing data that address this concern by comparing stimulation of FPC and other PFC regions using cTBS?
3) The trend of an effect of RFPC stimulation on the information bonus for horizon 1, although smaller than that of horizon 6, seems problematic for an interpretation purely based on directed exploration, since there is no opportunity to exploit the newly acquired information for horizon 1. The authors suggest subjects may become less informationseeking in both conditions (consistent with risk or ambiguity aversion in horizon 1 and reduced directed exploration in horizon 6), but this begs the question of what process or mechanism underlies this decrease in both horizons. Given the broader literature on the role of FPC, one interpretation would be that stimulation has disrupted the FPC's ability to faithfully encode the parameters of a "pending" option that they may choose in the future (e.g. Koechlin and Hyafil, Science, 2007) – in this task this could be seen as the option that has not been selected as frequently or attended to recently during forced choices. However I am sure there are other plausible interpretations. How do the authors interpret this effect across horizons in the unequal condition with respect to the broader literature on FPC?
https://doi.org/10.7554/eLife.27430.015Author response
[Editors’ note: the author responses to the first round of peer review follow.]
Reviewer #1:
[…] 1) One major concern is the possibility that there is a substantial difference in power to detect the effects of TMS on the two forms of exploration due to perhaps a big difference in the number of behavioral choice that index these two forms of exploration on the first trial of each block. While directed exploration in the vertex treatment in the [1 3] condition perhaps occurs frequently (the number of trials in which this behavior is found are not reported, but I am inferring this from the high probability of directed exploration reported in the horizon 6 condition), it seems natural to expect that there would be much fewer instances of "random" exploration as defined by choice of the lower valued option in the [2 2] condition – this appears to be reflected in the much lower reported probabilities of random exploration in that condition. If there are many fewer trials of random exploration in the first place this ought to make an effect of random exploration following TMS stimulation much harder to detect. Therefore, one trivial account for the authors' double dissociation is that it occurs as a result of a difference in the experimental power to detect these two effects in the paradigm. The claim the authors have about an effect of TMS on direct exploration per se seems well supported in my opinion, but the claim for the specificity of the effect to random exploration seems a lot weaker.
This is an important point. Put simply, do we find no effect on random exploration because our experiment is underpowered to detect effects on random exploration? We believe that we do have sufficient power to detect an effect on random exploration (if it were there) and we try to show this using both a modelfree and modelbased approach.
For the modelfree approach we consider the size of the horizon effect for directed and random exploration in the control condition. This horizon effect is essentially the effect we are trying to remove with TMS and the idea is that, if the horizon effect size is smaller for random than directed exploration, there would be a difference in power to detect changes to the horizon effect. Fortunately the horizon effects are of equal size in this study (in the vertex condition Cohen’s d for directed = 0.71; for random = 0.68). These numbers are largely in line with pure behavioral subjects (the 60 undergraduates from Somerville et al. 2016) where we find d = 0.75 for directed and, a slightly larger, d = 1.18 for random. Thus, if TMS were to reduce the horizon effect by 50% we would have essentially equal power to detect both effects (note we have the same number of trials in the [2 2] condition, for measuring p(low mean) and random exploration, and [1 3] condition, for measuring p(high info) and directed exploration).
For the modelbased approach, we can fit the decision noise in the [1 3] uncertainty condition in addition to the [2 2] condition. This gives us an independent estimate of decision noise and gives us another chance to see an effect of TMS on random exploration. In addition, in our new modelbased analysis, we use hierarchical Bayesian model fitting to compute posterior distributions over all model parameters given the data (see reviewer #1 response #6 for more details on this model). As shown by the posterior distributions (Figure 4, main text) we see no effect of TMS on decision noise in any of the four uncertainty x horizon conditions, but we do see an effect on information bonus in horizon 6.
2) Another concern is that for a behavioral and TMS study the use of such a small sample size of only 15 participants seems hard to justify, especially given that the authors are reporting effects that are just barely reaching significance at p<0.05. Given the concerns raised above about power to detect effects on random exploration, and given that there are a very small number of trials per subject enabling the authors to test their claimed effects (as they are throwing away most of the trials per block and focusing only on the first), suggest that it would not be unreasonable to expect the authors to obtain a larger sample size.
We agree that N = 15 was not ideal. We have now run an additional 16 subjects and our results hold (see Author response image 1).
In addition, we have included two new analyses: a modelbased Bayesian analysis (results of which are shown in Figure 4), as well as a modelfree analysis of later trials. Both of these analyses point to the same conclusion – inhibition of RFPC leads to selective inhibition of directed exploration in horizon 6.
The modelfree analysis of later trials is presented in the main paper in Figure 6 in its own section. In this analysis we compute p(high info) and p(low mean) for all trials in the horizon 6 game to see whether behavior on the later trials is affected by stimulation of frontal pole. For directed exploration we find some evidence that the reduction in p(high info) on the first trial continues into the second (post hoc, onesided ttest on the second trial, t(24) = 1.61; p = 0.06), Figure 6 panels A and C. While this is a marginal result, it is consistent with our hypothesis and provides more support for frontal pole playing a role in directed exploration. For random exploration we find no effect of RFPC stimulation on any trial. This is consistent with the idea that frontal pole is not involved in random exploration.
For completeness we reproduce the particular section of text here:
“The effect of RFPC stimulation on later trials
Our analyses so far have focused on just the first free choice and have ignored the remaining five choices in the horizon 6 games. […] Thus, the analysis of later trials provides additional, albeit modest, support for the idea that RFPC stimulation selectively disrupts directed but not random exploration at long horizons.”
3) A more generic concern with TMS over frontopolar cortex is that it is unclear with this stimulation protocol how diffuse the effect of the TMS stimulation has been, and to what extent the stimulation protocol has also impacted adjacent regions of frontal cortex. This is an inherent limitation of this technique of course, but there are ways to ameliorate concerns in this regard such as by measuring effects of the stimulation protocol with fMRI. The authors could discuss this limitation and ideally bolster their claims about the degree to which these effects can be specifically attributed to effects of stimulation on frontopolar cortex per se.
We agree that this is an important point and would be an important followup study. We have added the following to the Discussion to address this point:
“While the present study does allow us to conclude that directed and random exploration rely on different neural systems, the limited spatial specificity of TMS limits our ability to say exactly what those systems are. […] Future work combining cTBS with neuroimaging will be necessary to shed light on these issues.”
4) Could the apparent effect on directed exploration be driven by other more prosaic possibilities such as an impairment in the ability to flexibly change task set (e.g. from a short to long horizon) across blocks or alterations in the capacity to attend to the task cues indicating the horizon length or even the capacity to incorporate knowledge of task instructions could be impacted instead of directed exploration per se.
This is an interesting idea that we believe we can rule out. To paraphrase, the idea is that RFPC stimulation inhibits the ability to adapt to horizon in general (e.g. by causing subjects to ignore relevant task cues) rather than causing a specific deficit in directed exploration. Such a general deficit would predict that the horizon effect on random exploration would also be abolished with RFPC stimulation and this is something we do not see at all in three separate analyses.
First, in the modelfree analysis (Figure 3B) we see that p(low mean) increases with horizon even in the RFPC condition and that RFPC stimulation has no effect on this measure of random exploration.
Second, this modelfree result also holds for the later trials in which we see no stimulation based change in p(low mean) over the course of horizon 6 games (Figure 6B, D). Of course, these later trial results are subject to the reward information confound and so should not be overinterpreted, but they do at least point to the same conclusion that RFPC stimulation does not change the horizon dependence of random exploration.
Third, our modelbased analysis points to the same conclusion that there is no change in decision noise with stimulation condition (Figure 4).
5) Can the authors discriminate between different ways in which directed exploration could be implemented computationally on this task? For instance one could imagine a Bayesian implementation in which a representation of uncertainty over the options is computed and used to direct exploration toward the more uncertain options, or else one could simply use a heuristic strategy of just counting the number of samples of each option to try to ensure each option has been sampled an equivalent number of times.
Unfortunately the vanilla Horizon Task used here is not well suited to addressing this question. The reason is that uncertainty on the first free choice is not parametrically modulated – there either is a difference in uncertainty (in the [1 3] condition) or else there is no difference in uncertainty (in the [2 2] condition). While one could try to look at this with a modelbased analysis of the later trials, such an analysis is deeply affected by the rewardinformation confound which makes interpreting results of such an analysis difficult.
In an ongoing set of experiments, we have performed a (purely behavioral) version of the task with parametric modulation of uncertainty. This reveals that the information bonus does appear to scale with uncertainty in a more Bayesian manner, more analysis needs to be done to be sure and the result requires internal replication (much easier with pure behavior than TMS!) before we publish.
6) Although the authors cite Wilson et al. (2014) to describe their modeling strategies, it would be important to reproduce details of exactly how they implemented the model fitting etc. in the current paradigm, as these analyses are central to the current paper and the reader shouldn't be required to go searching for another paper to understand precisely what was done.
This is a fair point and we have now included much more detail on the model. In addition we have expanded the model to include a learning component and fit the model in a different (and more rigorous) hierarchical Bayesian manner. We describe the model at two different points in the text and provide code to implement the model in the Supplementary Material. In the Results section, we highlight the salient points to try to convey the main intuition in the subsection “RFPC stimulation selectively inhibits directed exploration on the first freechoice”. In the Materials and methods section, we go into all the gory details. As this text is extensive, we do not quote it here.
7) I wonder whether more use can be made of the subsequent trials in each block. It seems a shame to throw these trials away, even if the utility of the trials for distinguishing these constructs drops off over repeated trials within a block it seems plausible to me that the 2nd and 3rd trials at the very least would contain useful information.
I wonder this too and have been for quite a while! In trying to model the later trials, it quickly becomes apparent that the rewardinformation confound is very real and introduces very strong correlations between the fitted parameter values that makes interpretation of the results essentially impossible.
Despite this difficulty in interpreting the modelbased parameters, the modelfree statistics (while still being confounded) are at least more straightforward. As mentioned above (response #2), we include this modelfree analysis of later trials in a separate section of the Results, along with appropriate health warnings about the reward information confound.
Reviewer #2:
Zajkowski and colleagues present a study showing that continuous theta burst stimulation to right frontopolar cortex, but not the vertex, selectively reduces directed exploration, but not random exploration. I commend the authors for their experimental approach, combining a carefully designed experimental paradigm and computational modeling of behavior with a transient causal manipulation, such as cTBS.
We thank the reviewer for the positive comments and helpful feedback. We hope this revision will change your mind about the “importance” of the findings, but regardless of whether the paper is accepted to eLife, your comments have greatly improved the paper!
While the results look straightforward, and I do believe they represent an advance on current knowledge in the field, I do not think they represent such a significant advance to merit publication in eLife (or a similar high impact journal of broad interest), but would be appropriate for a more specialized journal in the field. Rather than advancing thinking on this topic in some new way, developing a new methodology, or resolving a debate, I believe the results essentially confirm what could be inferred to be likely from the existing fMRI (Daw et al., 2006; Badre et al., 2012) and stimulation (TDCS) literature (Beharelle et al., 2015) on the RFPC and exploration/exploitation. Furthermore, the experimental paradigm and modeling results have been published (Wilson, et al., 2014). I do not mean to discourage the authors, who I think have conducted a genuinely interesting study by combining approaches in an unusual way, and confirming their main hypothesis. I simply do not believe the paper is best suited for a journal of the caliber of eLife, but I of course leave this up to the editor's discretion. I have added a few comments below that I hope will be helpful to the authors.
While we acknowledge that such judgments of “importance” are often a matter of taste and perspective (all things look big when viewed up close!), we respectfully disagree with this point and believe our study represents a major update to current thinking. In particular, by showing that RFPC stimulation selectively inhibits directed exploration we show that “exploration” is not a unitary process, it is a dual process in which directed and random exploration rely on (at least partially) dissociable neural systems.
That exploration is a dual process is absolutely not something one would have concluded from previous work. For example, Daw and Badre see similar activations despite defining exploration in very different ways (choosing low value option for Daw and (loosely) choosing high information options for Badre). The reason the activations are similar is that both tasks have a rewardinformation confound and after making just a few free choices, the high information options are the low value options. This means that every single explorationrelated activation in those studies now has a big question mark on it – is it an activation related to directed exploration, random exploration or both? The same can be said of the Beharelle finding, which is beautiful in how it shows opposite effects for anodal and cathodal stimulation, but which cannot dissociate directed and random exploration because of the nature of the behavioral task. To be clear, we do not mean to attack previous work here – these are all incredibly important studies. However, our findings do open them up to reinterpretation.
We have tried to emphasize this dualprocess interpretation in the Discussion:
“In this work we used continuous thetaburst transcranial magnetic stimulation (cTBS) to investigate whether right frontopolar cortex (RFPC) is causally involved in directed and random exploration. […] This is consistent with the idea that the levels of directed and random exploration are set by the strength of an exploratory drive that varies as an individual difference between people.”
In the Introduction random exploration is framed as simply increasing decision noise, and directed exploration as information seeking. But is that really the critical distinction? In the real world random exploration is likely to occur when the environmental statistics have changed very rapidly and/or the animal has inferred (for whatever reason) their prior causal model (or even set of models) is (are) no longer tenable. In these circumstances their exploratory behavior is likely still characterized as information seeking, even if it manifests formally as an increase in decision noise. It seems to me, therefore, that the key distinction between directed and undirected exploration is that animals no longer know which options to explore. Can the authors clarify their view, and perhaps modify the Introduction and/or Discussion as needed?
This is a really interesting idea and one that would be worth investigating in its own right. We have added a few sentences to the Discussion suggesting that random exploration may be a “modelfree” method of exploration that works especially well when the model is unknown.
“With the above caveats that our results may not be entirely due to disruption of frontal pole, the interpretation that RFPC plays a role in directed, but not random, exploration is consistent with a number of previous findings. […] Indeed, the ability to explore effectively in a modelfree manner, may be an important function of random exploration as it allows us to explore even when our model of the world is wrong.”
What were the instructions to participants? Do they necessarily understand that the bandit means are constant and independent?
The instructions were a direct Polish translation of the original instructions used by Wilson et al. (2014). These instructions clearly state that the average reward from each bandit is constant in each game and that the variability is constant over the entire game. For reference see the supplementary material of the original paper. If you feel it would be important for this paper, we would be happy to include them as Supplementary Material.
Is there any evidence they weight more recent past samples more strongly than more distant samples? Would this change the estimates of the means in any meaningful way?
This is a great question and one that has pushed us to update the model. In particular, we have now modeled the learning process (i.e. the process by which participants infer the mean of each option from the forced trials) using a Kalman filter. This model assumes that participants learn the mean reward for each option using a deltarule update equation
R^{i} _{t+1} = R^{i}_{t} + α^{i}_{t} (r_{t} – R^{i}_{t}) (*)
Where α^{i}_{t} is the time varying learning rate. The time dependence of the learning rate is determined by the Kalman filter equations (see Materials and methods for full description of the model) and can be parameterized by two parameters: the initial learning rate α_{0} and the asymptotic learning rate α_{inf}. Crucially, equation (*) allows for potentially uneven weighting of the reward depending on the values of α_{0} and α_{inf}. Our previous model, with equal weighting given to all points, corresponds to the case of α_{0} = 1, α_{inf} = 0. Models with α_{0} < 1 and α_{inf} > 0 have a recency bias, weighting more recent rewards more strongly.
The posterior distributions over the group average values of α_{0} and α_{inf} are shown in Figure 3 in the main paper. In particular α_{0} ~ 0.6 and α_{inf} ~ 0.45, suggesting quite a pronounced recency effect. Importantly, however, neither of these parameters changes between stimulation conditions, and including this learning term in the model does not change the effect of TMS on directed exploration (information bonus in horizon 6).
Given the demonstrated effects of cTBS on the hemodynamic signal measured in control networks (Gratton et al., 2013), how specific is the effect of stimulation to RFPC? To address this question, I would have liked to see the investigators target another frontal comparison brain region, in addition or instead of the vertex.
We agree that our inability to nail down the specificity of the effect is an important limitation of this work. Unfortunately we currently lack the resources to run a study looking at stimulation of other areas and have instead focused our efforts on increasing the sample size of the current study. Likewise, combining TMS with fMRI will be important in future work to more precisely characterize the effects of the perturbation. We have acknowledged both of these limitations in the Discussion as follows:
“While the present study does allow us to conclude that directed and random exploration rely on different neural systems, the limited spatial specificity of TMS limits our ability to say exactly what those systems are. […] Future work combining cTBS with neuroimaging will be necessary to shed light on these issues.”
[Editors' note: the author responses to the rereview follow.]
1) Why do the authors binarize relative information such that it is coded as +1 when the left gamble is more informative and 1 when the right gamble is? Based on Badre, Doll, Frank, et al. I would have thought that the estimated relative uncertainty between options would be more appropriate to quantitatively test the impact of stimulation on directed exploration. Or is variance in this quantity negligible across the critical choices in this task? Related to this question, is this quantity matched across conditions and do all subjects see identical or different schedules?
There are a few thoughts behind binarizing information. First, binary information matches the task design in which there is only one unequal information condition and no gradations in uncertainty from a normative perspective. Related to this, and as the reviewer rightly intuits, the single unequal uncertainty condition means that the variance in relative uncertainty between options is relatively small meaning that there is very little difference between the binarized vs continuous definition of information. Because of this we have decided to stick with the binarized version in the paper so as to avoid over interpreting the data.
More generally, the parametric effect of uncertainty in this task is a key question and is something we are looking at behaviorally in ongoing experiments with different numbers of forced trials. Such explicit manipulation of information leads to much more variance in the uncertainties allowing us to compute parametric effects of uncertainty with more confidence. In brief, these results do suggest a linear effect of uncertainty as seen in previous work.
Of course, it is possible to fit the continuous model to the data in this paper and when we do so we come to the same conclusions as the binarized model – a selective effect of RFPC stimulation on directed exploration (see Author response image 2 and 3).
Finally, as to the question of whether participants receive exactly the same schedule of trials or not, unfortunately this was not perfectly controlled in either direction. The first 16 subjects (the initial group) were run with the same random seed while the remaining subjects (the replication group) were run with unique random seeds. Given the results replicate between groups we do not think this is a major issue although we now include the following text in the Materials and methods section:
“Finally, the random seeds were not perfectly controlled between subjects. […] Despite this limitation we saw no evidence of different behavior across the two groups.”
2) Although I am not requesting the authors conduct another experiment, a second stimulation site within prefrontal cortex would make for an important comparison for future studies. My suggestion is in part due to the quite severe discomfort frequently caused by TMS stimulation to FPC and neighboring regions due to the underlying facial musculature, as compared to say the vertex. Any differences between stimulation sites could in theory be due to differences in discomfort or subsequent distraction produced by the stimulation sites. Here, this difference could conceivably interact with the comparison between horizon 6 and horizon 1 in the unequal condition if this horizon 6 condition is in fact more cognitively demanding. Note this is not a concern in the cited tDCS study by Raja Beharelle et al. because tDCS does not stimulate the facial muscles and because the excitation and inhibition respectively following anodal and cathodal tDCS provides for an internal control. Can the authors provide some evidence that horizon 6 in the unequal condition is not the most cognitively demanding for their subjects, for instance by analysing RTs?
We agree that other types and locations of stimulation will be an important avenue for future work and is something that I (RCW) am planning once TMS becomes available at UA.
The point about pain is also important. As we understand it, the idea is that RFPC stimulation can be painful. Pain is distracting which leads to worse performance, especially when a task is cognitively demanding. Thus if directed exploration in horizon 6 is the most cognitively demanding component of the task, then distraction from pain could cause the effect.
While we cannot rule this interpretation out entirely, two results suggest that simple distraction is likely not to blame.
First, one prediction of the distraction hypothesis is that people should perform worse overall when distracted by pain. In the modelfree analysis this should show up as increased p(low mean) with stimulation of frontal pole. In the modelbased analysis, distraction should manifest as increased decision noise in both [1 3] and [2 2] conditions. In both analyses we see no effect of RFPC stimulation (Figures 3B and 4). This effectively puts an upper bound on how distracting the pain could be – the distraction effect must be small enough to cause no change in the ability to pick out the high reward option.
Of course, the above analysis says nothing about the lower bound and it could still be the case that, while the pain is not distracting enough to affect computing the mean reward, it is distracting enough to affect the computations of the information bonus. This could be the case if computing the information bonus were harder than computing the mean. Evidence for this increased computational load could come from reaction times. Specifically if computing the bonus is difficult, then RTs should be longer in the [1 3] condition in horizon 6 than in horizon 1. As shown in Author Response Image 4 this is not the case and there is no effect of horizon on RT for the first free choice (F = 1.32, p = 0.26). Thus computing the information bonus is not a time consuming process, suggesting it is not any more taxing than computing the difference in means between options.
Together with the null effect on p(low mean) we believe that these results provide good evidence that our effects are driven by neural changes (presumably in RFPC – although this is impossible to verify without neuroimaging) not as a response to pain.
Are there existing data that address this concern by comparing stimulation of FPC and other PFC regions using cTBS?
A Google Scholar search for “cTBS frontal pole” found only one paper that reported pain measures. None that we could find directly compared pain from stimulation to FPC and other areas of PFC.
Hanlon, C. A., Dowdle, L. T., Correia, B., Mithoefer, O., KearneyRamos, T., Lench, D.,[…] and George, M. S. (2017). Left frontal pole theta burst stimulation decreases orbitofrontal and insula activity in cocaine users and alcohol users. Drug and Alcohol Dependence.
This study compared cTBS to frontal pole to a sham stimulation of muscles with electrodes. The study found that participants could not distinguish TMS from sham stimulation. More importantly for our purposes they also found that pain subsided quickly “Subjective reports indicated that the painfulness of the protocol subsided after the first 1530 s”.
The following other studies uncovered by the same search did not report measures of pain / discomfort.
Costa, A., Oliveri, M., Barban, F., Torriero, S., Salerno, S., Lo Gerfo, E.,.[…] and Carlesimo, G. A. (2011). Keeping memory for intentions: a cTBS investigation of the frontopolar cortex. Cerebral cortex, 21(12), 26962703.
Costa, A., Oliveri, M., Barban, F., Bonnì, S., Koch, G., Caltagirone, C., and Carlesimo, G. A. (2013). The right frontopolar cortex is involved in visualspatial prospective memory. PLoS One, 8(2), e56039.
Rahnev, D., Nee, D. E., Riddle, J., Larson, A. S., and D’Esposito, M. (2016). Causal evidence for frontal cortex organization for perceptual decision making. Proceedings of the National Academy of Sciences, 113(21), 60596064.
3) The trend of an effect of RFPC stimulation on the information bonus for horizon 1, although smaller than that of horizon 6, seems problematic for an interpretation purely based on directed exploration, since there is no opportunity to exploit the newly acquired information for horizon 1. The authors suggest subjects may become less informationseeking in both conditions (consistent with risk or ambiguity aversion in horizon 1 and reduced directed exploration in horizon 6), but this begs the question of what process or mechanism underlies this decrease in both horizons. Given the broader literature on the role of FPC, one interpretation would be that stimulation has disrupted the FPC's ability to faithfully encode the parameters of a "pending" option that they may choose in the future (e.g. Koechlin and Hyafil, Science, 2007) – in this task this could be seen as the option that has not been selected as frequently or attended to recently during forced choices. However I am sure there are other plausible interpretations. How do the authors interpret this effect across horizons in the unequal condition with respect to the broader literature on FPC?
We have dug into this point more and can now include more detail. What we believe is going on here is a tradeoff between the mean of the prior, R_{0}, and the information bonus A. While this tradeoff does not affect our conclusions that RFPC stimulation selectively affects directed exploration, we believe that the tradeoff does suggest caution when interpreting the horizon 1 result.
In particular, note that in Figure 4, in addition to the information bonus going down in both horizons, the prior mean goes up suggesting a possible tradeoff between the information bonus parameter and the mean of the prior. Such a tradeoff is to be expected in this task because the prior has a larger effect on the more uncertain option – i.e. the option chosen once in the [1 3] condition. This larger effect of the prior means that increasing R_{0} can have a similar effect to an information bonus in the task by increasing the relative value of the uncertain option (in RL terms, this would be exploration by optimistic initialization).Thus, in the context of this task, the model contains an inherent tradeoff between the information bonus and mean of the prior.
In practice, the tradeoff between R_{0} and A shows up as correlations in the posteriors. This is shown in the updated Figure 5 in the manuscript where we plot samples from the posterior over the change in R_{0} between stimulation conditions (R_{0} (vertex) – R_{0} (RFPC)) against the change in information bonus (A(vertex) – A(RFPC)). In both horizon 1 (panel A) and horizon 6 (panel B) there is a tradeoff between the two parameters. However, while the tradeoff affects the interpretation of the horizon 1 and horizon 6 result alone, it does not affect the interpretation of the horizonbased change in information bonus (panel C).
In addition to including this new figure, we have addressed this point in the manuscript with the following text:
“In addition to the effect on the information bonus in horizon 6, there is also a hint of an effect on the information bonus in horizon 1 (85% samples less than zero) and on the prior mean R0 (88% samples above zero). […] Taken together, this suggests that we can be fairly confident in our claim that RFPC stimulation has a selective effect on directed exploration.”
https://doi.org/10.7554/eLife.27430.016Article and author information
Author details
Funding
No external funding was received for this work.
Ethics
Human subjects: All participants were informed about potential risks connected to TMS and signed a written consent. The study was approved by University of Social Sciences and Humanities ethics committee.
Reviewing Editor
 Michael J Frank, Brown University, United States
Publication history
 Received: April 4, 2017
 Accepted: September 14, 2017
 Accepted Manuscript published: September 15, 2017 (version 1)
 Version of Record published: October 4, 2017 (version 2)
Copyright
© 2017, Zajkowski et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,274
 Page views

 305
 Downloads

 34
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.