Dynamic decision policy reconfiguration under outcome uncertainty
Abstract
In uncertain or unstable environments, sometimes the best decision is to change your mind. To shed light on this flexibility, we evaluated how the underlying decision policy adapts when the most rewarding action changes. Human participants performed a dynamic twoarmed bandit task that manipulated the certainty in relative reward (conflict) and the reliability of actionoutcomes (volatility). Continuous estimates of conflict and volatility contributed to shifts in exploratory states by changing both the rate of evidence accumulation (drift rate) and the amount of evidence needed to make a decision (boundary height), respectively. At the trialwise level, following a switch in the optimal choice, the drift rate plummets and the boundary height weakly spikes, leading to a slow exploratory state. We find that the drift rate drives most of this response, with an unreliable contribution of boundary height across experiments. Surprisingly, we find no evidence that pupillary responses associated with decision policy changes. We conclude that humans show a stereotypical shift in their decision policies in response to environmental changes.
Editor's evaluation
The authors conducted an impressive study investigating dynamic adjustments in decision policies as a function of two types of uncertainty: decision conflict and volatility (change point probability). They combine learning model parameters with drift diffusion modeling to assess how the policy (as a combination of drift rate and threshold) varies with uncertainty and also test how these adjustments relate to the LCNE system via pupil diameter. This work is impressive and will certainly be of interest to many.
https://doi.org/10.7554/eLife.65540.sa0Introduction
‘Should I stay or should I go?‘ refers not only to an iconic 1980s punk anthem but also the fundamental dilemma all animals face in uncertain or unstable environments. Should someone buy coffee from the cafe that serves their favorite roast or try the new cafe that opened down the street? If their favorite drink is bitter one day, is that a sign to switch to a new blend or is one subpar experience inadequate to prompt a switch? Ultimately, these decisions converge to a single predicament: whether we choose an action that we believe is likely to produce desirable results (i.e. exploit) or risk choosing another action that is less certain, on the chance that it will produce a more positive outcome (i.e. explore) (O’Reilly, 2013). Ultimately, this is the problem of knowing when to change your mind.
The shift of a decision policy from exploratory to exploitative states is driven by environmental context. To illustrate this, Figure 1A shows what happens when a simple reinforcement learning (RL) agent tries to maximize reward in a dynamic variant of the twoarmed bandit task (Sutton and Barto, 1998; see Materials and methods). Here, the relative difference in reward probability for the two actions (conflict) and the frequency of a change in the optimal action (volatility) were independently manipulated. For each level of conflict and volatility, a set of tabular Qlearning (Sutton and Barto, 1998) agents played the task with learning rate held constant while the degree of randomness of the selection policy ($\beta $ in a Softmax function) varied. The agent that returned the most rewards was identified as the agent with the best explorationexploitation balance. Increasing either form of uncertainty led to selecting agents with more random or exploratory selection policies (Figure 1A). As the value of the optimal choice decreases relative to the value of a suboptimal choice (conflict increases), the learner exploits what she already knows. Action values grow unstable (volatility increases) when the clarity of the optimal choice is constant (constant conflict), and the learner is biased toward exploration (Bland and Schaefer, 2012). As these two forms of uncertainty change together, the gradient of action selection strategy also changes.
Knowing how decision policies shift in the face of dynamic environments requires looking at the algorithmic properties of the policy itself. One popular set of algorithms for describing the dynamics of decision making are accumulationtobound processes like the driftdiffusion model (DDM; Ratcliff, 1978). The normative form of the DDM proposes that a decision between two choices is described by a noisy accumulation process that drifts toward one of two decision boundaries at a specific rate (Figure 1B). Two parameters of this model are critical in determining the degree of randomness of a selection policy: the rate of evidence accumulation (drift rate; $v$) and the amount of information required to make a decision (boundary height; $a$). For example, decreasing the drift rate and increasing the boundary height leads to more random decisions (Figure 1C), with the speed of these decisions depending on the ratio of the two parameters (Figure 1D). Thus, exploratory policies can result in either fast or slow decisions, depending on the relative configuration of drift rate and boundary height.
Are the parameters that govern accumulation of evidence for decision making modifiable? Previous modeling work has shown that the parameters of a DDM process can be modulated by feedback signals and choice history (Pedersen et al., 2017; Ratcliff and Frank, 2012; Dunovan and Verstynen, 2019; Dunovan et al., 2019; Mendonça et al., 2020; Urai et al., 2018) with different mechanisms for adapting the drift rate and the boundary height. In valuebased decisionmaking tasks where the statistics of sensory signals are equivalent for all actions, drift rate fluctuations appear to track the relative value of an action or the value difference between actions (Dunovan et al., 2019; Mikhael and Bogacz, 2016; Bariselli et al., 2019; Rubin et al., 2021). In contrast to value estimation, selection errors in this context have been linked to changes in the boundary height (Forstmann et al., 2008; Forstmann et al., 2010; Bogacz et al., 2010; Herz et al., 2016; Herz et al., 2017; Dunovan et al., 2019; Dunovan and Verstynen, 2019) and internal estimates of environmental change (Nassar et al., 2010; Wilson and Niv, 2011; Nassar et al., 2012; Behrens et al., 2007).
Given the adaptive sensitivity of the drift rate and the boundary height to value estimation and selection, respectively, these decision parameters define unique states on a surface of fast or slow and exploratory or exploitative decision policies. These policies, in turn, adaptively reconfigure based on current environmental feedback signals by modulating value estimation and the rate of selection errors (Figure 1E). Agents can move along the surface of decision policies, from exploitative states (bright colors, Figure 1E) to different types of exploratory states (darker colors, Figure 1E), as they commit a greater number of selection errors prompted by change in actionoutcome contingencies. As the system relearns properties of the environment, the decision policy migrates along the surface to return to an exploitative state until a change occurs again.
One plausible neural mechanism for this migration along the surface of selection policies is the locus coereleus norepinephrine (LCNE) system, which has been linked to adaptive behavioral variability in response to uncertainty (Urai et al., 2017; Dayan and Yu, 2006; Bouret and Sara, 2005). The LCNE system has two distinct modes (AstonJones and Bloom, 1981) that map onto distinct decision states (AstonJones and Cohen, 2005). In the phasic mode, a burst of LC activity results in a global, temporally precise release of NE. This increases the gain on cortical processing and encourages exploitation. In the tonic mode, NE is released without the temporal precision of the phasic mode, increasing baseline NE (AstonJones and Bloom, 1981). This encourages disengagement from the current task and facilitates exploration. The dynamic fluctuation of these two modes is thought to optimize the tradeoff between the exploitation of stable sources of reward and the exploration of potentially better options (AstonJones and Cohen, 2005). Thus the LCNE system, which can be indirectly measured by fluctuations in pupil diameter (AstonJones and Cohen, 2005; Jepma and Nieuwenhuis, 2011), may be a central mechanism for modulating selection policies.
We investigated the malleability of decision policies as the environment necessitates a change of mind as to what constitutes the ‘best’ decision. To control environmental uncertainty, we manipulated the volatility of changes in actionoutcome contingencies (i.e. which of two targets returns the most rewards), as well as ambiguity in optimal choice (conflict), while human participants performed a dynamic variant of the twoarmed bandit task (Sutton and Barto, 2018). We predicted that, in response to suspected changes in actionoutcome contingencies, humans would exhibit a stereotyped adjustment in the drift rate and boundary height that pushes decisions from certain, exploitative states to uncertain, exploratory states and back again (Figure 1E). In addition, using pupillary data, we explored whether the LCNE system covaries with shifts of the boundary height in response to a change in action outcomes to facilitate exploration, consistent with prior studies (Keung et al., 2019; Murphy et al., 2014; Cavanagh et al., 2014).
Results
Across two experiments, we used a dynamic twoarmed bandit task with equivalent sensory reliability across arms to independently manipulate the reward conflict and the volatility of action outcomes in order to measure how underlying decision processes respond to changes in actionoutcome contingencies (see Stimuli and Procedure). Both of these experiments shared a common feedback structure. Participants were asked to select either the left or right target presented on the screen using the corresponding key on a response box. Rewards were probabilistically determined for each target and, if a reward was delivered, it was sampled from a Gaussian distribution. The optimally rewarding target delivered reward with a predetermined probability ($P(optimal)$) and the suboptimal target gave reward with the inverse probability ($1P(optimal)$). After a delay determined by the rate parameter of a Poisson distribution ($\lambda $), the reward probabilities for the optimal and suboptimal targets would switch.
In Experiment 1, 24 participants completed four sessions (high and low conflict; high and low volatility) each composed of 600 trials. During each session, they were asked to select one of two coin boxes (Exp. 1: Figure 2A). The levels of conflict and volatility for all four conditions in Experiment 1 are shown as gray dots in Figure 2C. Experiment 2 was a replication of Experiment 1 with more extensive withinsubject sampling of conflict and volatility, as well as the inclusion of pupilometry as a proxy for measuring LCNE dynamics. In Experiment 2, participants were asked to choose between one of two Greebles (one male, one female). Each Greeble probabilistically delivered a monetary reward (Exp. 2: Figure 2B). Participants were trained to discriminate between male and female Greebles prior to testing to prevent errors in perceptual discrimination from interfering with selection on the basis of value estimation. Four participants completed nine sessions composed of 400 trials each, generating 3600 trials in total per subject. The levels of conflict and volatility for all nine conditions in Experiment 2 are shown as black dots in Figure 2C. Importantly, Experiment 2 manipulated the same forms of uncertainty as Experiment 1, but had different perceptual features and more expansively sampled the space of conflict and volatility. Given the similarity in design, the behavioral results for both of these experiments are presented together below.
The influence of ambiguity and instability on speed and accuracy
We first looked at overall speed and accuracy effects in both Experiments 1 and 2. In Experiment 1, accuracy (i.e. optimal choice selection) suffered as the optimal choice grew more ambiguous, with accuracy in the low conflict condition being 1.2 times higher than what is observed in the high conflict condition (Figure 3A; $\beta =1.213$, 95% CI: 1.192, 1.235, z = 21.36, p < 2e16). In contrast, increasing conflict had no observable impact on overall reaction times (Figure 3A; $\beta =6.902{e}^{}5$, 95% CI: –0.002, 0.002, t = −0.06, p = 0.951). As expected, participants also became less accurate as the instability of action outcomes (i.e. volatility) grew (Figure 3B; $\beta =0.092$, 95% CI: 0.077, 0.111, z = 10.36, p < 2e16). Under volatile conditions, participants also took slightly longer to make a decision ($\beta =0.012$, 95% CI: –0.015,–.010, t = −10.80, p < 2e16); however, while this effect on reaction times was statistically reliable, the impact of volatility on reaction times was weak (increasing volatility increased reaction time by ∼13 ms on average; Figure 3B).
Experiment 2 served as a high powered test of whether the effects we observed in Experiment 1 were replicable at the withinsubject level. Because Experiment 2 independently manipulated conflict and volatility, we were able to test whether conflict and volatility interacted to affect behavior. We found similar effects of conflict and volatility on accuracy as we observed in Experiment 1 (Figure 3C). Accuracy increased as conflict decreased (i.e. as the probability of reward increased; $\beta $=0.223, 95% CI = 0.189,0.256, z = 12.757, p<2e16). As the environment grew less volatile, accuracy increased ($\beta $ = 0.101, 95% CI = 0.066,0.14, z = 5.828, p = 5.6e09). We did not observe an interaction of conflict and volatility on accuracy ($\beta $ = 0.024, 95% CI = −0.013, 0.058, z = 1.364, p = 0.173).
However, we did find that conflict and volatility interacted to affect reaction time (RT; $\beta $=−0.002, 95% CI = −0.004, –0.001, t = −3.084, p = 0.002), with a linear increase in reaction time as the environment grew less volatile and conflict was highest (when $p(r)=0.65\overline{RT}=0.472,0.480,0.493$ as a function of $\lambda $; see Figure 3D for RT distributions). When conflict was moderate ($p(r)=0.75$) or low ($p(r)=0.85$), volatility had a nonlinear effect on RTs. Here, reaction times decreased when volatility was moderate ($\overline{RT}=0.483$ when $\lambda $=20 and $p(r)=0.75$ when $\lambda $=20 and $p(r)=0.85$). Reaction times increased to approximately the same extent within moderate or low conflict conditions when volatility was high ($\overline{RT}=0.499$ when $\lambda $=10 and $p(r)=75$ when $\lambda $=10 and $p(r)=0.85$) and when volatility was low ($\overline{RT}=0.506$ when $\lambda =30$ and $p(r)=0.75$ when $\lambda =30$ and $p(r)=0.85$), with an increase in baseline reaction times when conflict was low relative to moderate ($\overline{RT}=0.527$ when $p(r)=0.85$ when $p(r)=0.75$; see Appendix 1—figure 1 for interaction visualization).
At the gross level, over all trials within an experimental condition, increasing the ambiguity of the optimal choice (conflict) and increasing the instability of action outcomes (volatility) decreases the probability of selecting the optimal choice. Reaction time effects were inconsistent, with a negligible effect of volatility in Experiment 1. Experiment two revealed that volatility and conflict interact to influence reaction times in complex ways. However, because trials where actionoutcome contingencies change are so infrequent, even under high volatility conditions, these overall effects on speed and accuracy may be masking more subtle behavioral dynamics in response to feedback changes. We adopt a more focal, modelbased analysis in the next section to clarify these perichange point dynamics.
Tracking estimates of action value and environmental volatility
We calculated trialbytrial estimates of two ideal observer parameters of environmental states (see Cognitive model for calculation details; Nassar et al., 2010; Vaghi et al., 2017). Belief in the value difference ($\mathrm{\Delta}B$) reflects the difference between the learned values of the optimal and suboptimal targets. For ease of interpretation, we refer to the converse of belief as doubt, such that when belief decreases doubt increases. $\mathrm{\Delta}B$ thus reflects a local estimate of uncertainty regarding the choices themselves. To capture the estimated probability of fundamental shifts in action values, we calculated how often the same action gave a different reward (change point probability; $\mathrm{\Omega}$). Here, $\mathrm{\Omega}$ reflects a global estimate of uncertainty in the environment, specifically the uncertainty in response contingencies. We used the data from Experiment 1 to assess how well these learning estimates captured our imposed manipulations, and observed similar results in Experiment 2 (Appendix 1—figure 4).
In Experiment 1, we observed a sharp decrease in $\mathrm{\Delta}B$ after a switch in action outcomes and a gradual return to asymptotic values (Figure 4A) with a decreased difference in reward probability resulting in increased doubt (Figure 4B; $\beta =0.216$, 95% CI:0.206, 0.224, t = 46.24, p < 2e16). As expected, less volatile conditions allowed the learner to more fully update her belief in the value of the optimal choice over all trials ($\beta =0.058$, 95% CI:0.050, 0.068, t = 12.32, p < 2e16), though to a smaller degree than low conflict conditions allowed (see Figure 4B). Increasing volatility resulted in a sharp increase in the estimate of $\mathrm{\Omega}$ at the onset of a change point with a quick return to a baseline estimate of change (Figure 4C). Notably, this estimate of $\mathrm{\Omega}$ was more sensitive to change points when conditions were relatively volatile, with a more pronounced peak in response to a change under high volatility conditions than under low volatility conditions (Figure 4C). Correspondingly, over all trials, $\mathrm{\Omega}$ was higher under more volatile conditions (Figure 4D, $\beta =0.022$, 95% CI:−0.023,–0.020, t = −30.74, p < 2e16) indicating sensitivity to the increased frequency of action outcome switches in the reward schedule.
When the identity of the optimal choice was clear (i.e. when conflict was low), the estimate of $\mathrm{\Omega}$ was more sensitive to the presence of a true change point than when the optimal choice was ambiguous (i.e. when conflict was high) (Figure 4C and D). This observation is consistent with the idea that increasing the difficulty of value estimation and, thereby, the assignment of value to a given choice also impairs change point sensitivity. Interestingly, increasing conflict nevertheless resulted in a net increase in $\mathrm{\Omega}$ calculated over all trials (Figure 4D; $\beta $=−0.006, 95 %CI:−0.007,–0.004, t = −8.64, p < 2e16), likely because higher conflict conditions increased the baseline estimate of change instead of enhancing sensitivity to true change points (see change point response and relative baseline values for the high conflict condition in Figure 4C). Here, the system conservatively overestimates the volatility of action outcomes, assuming a slightly greater frequency of changes in the probability of reward for the optimal choice than we imposed (actual proportion of change points for high conflict condition: $0.041\pm 0.004$; estimated $\mathrm{\Omega}$).
Reassuringly, net change point probability was much greater when change points were more frequent (see increased $\mathrm{\Omega}$ estimates for high volatility conditions over high conflict conditions in Figure 4D). These results suggest that our formulation of these ideal observer estimates adequately captures our manipulation of volatility and conflict at a continuous level.
Thus, these ideal observer parameters show a reliable response to a change in actionoutcome contingencies. The difference in value belief decreases, or doubt increases, when a change point occurs and slowly recovers over the course of six to eight trials as participants learn new actionoutcome contingencies. The initial drop in belief difference is deeper and the recovery time after a change point is slower in conditions with greater overall uncertainty (i.e. under high conflict and high volatility). In contrast, internal estimates that a change has occurred briefly spike at a change point, indicating that participants can reliably detect that something has changed, and quickly settle after a few trials. Interestingly, net change point probability estimates are higher in the conditions with higher uncertainty (high conflict, high volatility), likely reflecting increased vigilance for changes in those conditions. In the next section, we explore how the underlying parameters of the decision process itself respond to local changes in actionoutcome contingencies.
Different forms of uncertainty impact distinct decision processes
Our next goal was to test which decision parameters were sensitive to a change point. To this end, we estimated the change point evoked response of the boundary height $a$, drift rate $v$, nondecision time $t$, starting bias $z$, and drift criterion $dc$ for each trial surrounding the change point. To detect changes in the changepointevoked distributions for each decision parameter, we evaluated whether the sequential distributions evoked by each trial were significantly different, beginning with the trial preceding the change point and ending three trials after the change point. For example, if the 95% CI of the $z$ distribution evoked on the trial prior to the change point overlapped with the 95% CI of the distribution evoked on the change point and so on for all successive trials considered, then we would conclude that $z$ failed to show change point sensitivity (see Hierarchical drift diffusion modeling for details). To select the model that best accounted for the data, we compared the deviance information criterion (DIC) scores (Spiegelhalter et al., 2002) for these models. DIC scores provide a measure of model fit adjusted for model complexity and quantify information loss. A lower DIC score indicates a model that loses less information. Here, a difference of ≤ two points from the lowestscoring model cannot rule out the higher scoring model; a difference of 3–7 points suggests that the higher scoring model has considerably less support; and a difference of 10 points suggests essentially no support for the higher scoring model (Spiegelhalter et al., 2002; Burnham and Anderson, 1998).
Under this analysis, we found that only the boundary height and drift rate showed change point sensitivity as defined above. The drift rate showed a clear, persistent separation between trialspecific distributions, with a rapid decrease at the onset of the change point (${t}_{1}$ 95% CI = 1.021, 1.218; ${t}_{0}$=−0.972, –0.779) and a return to baseline values thereafter (${t}_{1}$ = −0.656,,–0.46; ${t}_{2}$=0.039, 0.241; ${t}_{3}$=0.411, 0.616; Figure 5A). The boundary height showed a transient response to the change point, spiking (${t}_{1}$ 95% CI = 0.792, 0.819; ${t}_{0}$=0.820, 0.847) and then dropping to baseline levels (${t}_{1}$ = 0.789, 0.815; ${t}_{2}$=0.783, 0.811; ${t}_{3}$=0.780, 0.808; Figure 5B).
The remainder of the decision parameters showed no change point sensitivity. Nondecision time showed no clear response (${t}_{1}$ 95% CI = 0.183, 0.188; ${t}_{0}$=0.183, 0.186; ${t}_{1}$=0.184, 0.191; ${t}_{2}$=0.181, 0.186; ${t}_{3}$=0.183, 0.186; Figure 5C) along with the starting bias (${t}_{1}$ 95% CI = −0.112,,–0.01; ${t}_{0}$=−0.098, 0.002; ${t}_{1}$=−0.090, 0.008; ${t}_{2}$=−0.069, 0.032; ${t}_{3}$=−0.055, 0.045; Figure 5D) and the drift criterion (${t}_{1}$ 95% CI = 0.229, 0.439; ${t}_{0}$=0.175, 0.374; ${t}_{1}$=0.223, 0.435; ${t}_{2}$=0.245, 0.458; ${t}_{3}$=0.244, 0.464; Figure 5E).
Further, models fitting drift rate and boundary height lost the least nullmodeladjusted information relative to models of the changepointevoked response for the other parameters, showing that a changepointevoked decrease in drift rate and spike in the boundary height best accounted for our observational data in comparison to all alternatives ($\mathrm{\Delta}DI{C}_{null}$ for $v$ = –978 and $\mathrm{\Delta}DI{C}_{null}$ for $a$ = –13.7; see Figure 5F).
Given that only the drift rate and boundary height showed change point sensitivity, we next focused on how those two parameters related to internal estimates of change and conflict in both experiments. Recall that we used the ideal observer parameters $\mathrm{\Delta}B$ and $\mathrm{\Omega}$ as proxies for internal estimates of belief in the difference in learned target values and change point probability, respectively. This provided a continuous quantification of our manipulation of conflict and volatility (see Tracking estimates of action value and environmental volatility). Experiment 2 provided an intensively sampled withinsubject test of the changepointevoked mapping between decision processes and these ideal observer estimates.
In order to determine the nature of the mapping between the ideal observer parameters and the changepoint sensitive decision parameters, we estimated single and dualparameter models mapping $\mathrm{\Delta}B$ and $\mathrm{\Omega}$ and the changepointsensitive decision parameters, drift rate and boundary height, and examined the fit of these models to our data. We found that the model mapping $\mathrm{\Delta}B$ to drift rate and $\mathrm{\Omega}$ to boundary height provided the best fit in Experiment 1 ($\mathrm{\Delta}DI{C}_{null}=2698.0$; left panel of Table 1).
To test whether this mapping was preserved in an independent data set, we performed the same model comparison procedure for Experiment 2. Because Experiment 2 followed a replicationbased design, we fit a separate model to each subject to assess the replicability of the best fitting model from Experiment 1. While we found support for the model mapping $\mathrm{\Delta}B$ to drift rate and $\mathrm{\Omega}$ to boundary height, we also found that the DIC scores for the singleparameter model mapping $\mathrm{\Delta}B$ to $v$ alone fit the data equally well (see bottom panel of Table 1 for summary statistics and Appendix 3—table 1). Altogether, this suggests that we have strong evidentiary support for a mapping between valuedriven belief and drift rate (Figure 6A, blue). However, the support for a mapping between change point probability and boundary height (Figure 6A, red), while robustly present in Experiment 1, fails to appear when tested in an independent data set.
For a more granular assessment of how drift rate and boundary height respond to a change point, we quantified the changepointevoked effect of $\mathrm{\Delta}B$ and $\mathrm{\Omega}$ on drift rate and boundary height, respectively, for both experiments (see Hierarchical drift diffusion modeling for details). In Experiment 1, we found that the rate of evidence accumulation, $v$, increased with the belief in the value of the optimal choice relative to a change point (${\beta}_{v\sim \mathrm{\Delta}B}=0.576$, 95% CI: 0.544, 0.609, empirical $p=0.000$; Figure 6B, left panel). The boundary height increased with change point probability (${\beta}_{a\sim \mathrm{\Omega}}=0.046$, 95% CI: 0.005, 0.088, empirical $p=0.001$; Figure 6B, right panel).
Experiment two showed similar, but attenuated, results, with drift rate increasing with $\mathrm{\Delta}B$ (${\beta}_{v\sim \mathrm{\Delta}B}=0.112$, 95% CI: 0.016, 0.227, empirical $p=0.004$; Figure 6B, inset panel on left) and an unreliable effect of $\mathrm{\Omega}$ on boundary height (${\beta}_{a\sim \mathrm{\Omega}}=0.036$, 95% CI: –0.155, 0.097, empirical $p=0.282$; Figure 6B, inset panel on right). Therefore, as the belief in the value of the optimal choice approaches the reward value for the optimal choice, the rate of information accumulation increases. An internal estimate of change point probability weakly increases the amount of information required to make a decision, although this latter effect is less reliable.
Altogether, these results suggest a drift rate mechanism for adaptation to change that may also combine with boundary height dynamics (Figure 6A). However, the strength of the drift rate response weakened and the boundary height response was statistically unreliable in Experiment 2 (Figure 6B inset panels). When a change point is detected and the threshold for committing to a choice (a) responds, it shows a weak, transient increase. At the same time, the drift rate approaches zero, allowing time for the decision process to diffuse and encouraging a random selection. As the learner accrues information about the new optimal choice, the rate of information accumulation slowly recovers to asymptotic levels, with the decision process assuming a more directed path toward the choice that has accrued evidence for reward. Together, the changes in these underlying decision processes, largely driven by drift rate dynamics, point to a mechanism for gathering information in a relatively slow, unbiased manner shortly after the learner suspects she should update her valuation. We now explore these dynamics in more detail in the next section.
Environmental instability prompts a stereotyped decision trajectory
So far, we have established that both the drift rate and the boundary height can be independently manipulated by two different estimates of environmental uncertainty with different temporal dynamics, although this effect reduces to drift rate dynamics in Experiment 2. This suggests that a change in actionoutcome contingencies prompts a unique trajectory through the space of possible decision policies (Figure 1E).
To visualize this trajectory, we plot the temporal relationship between drift rate and boundary height beginning with the trial prior to the change point and ending three trials after the change point (Figure 7A). To clearly visualize the distribution of the changepoint driven response in the relationship between drift rate and boundary height over time, we also represent the trialwise shift in these two decision variables as vectors. The trialbytrial estimates of drift rate and boundary height were taken from the best model of the fitted changepointevoked response and zscored (see Different forms of uncertainty impact distinct decision processes for model selection). Then the difference between each sequential set of boundary height and drift rate coordinates, $(a,v)$, was calculated to produce a vector length. The arc tangent between these differenced values was computed to yield an angle in radians between sequential decision vectors, concisely representing the overall decision dynamics ($\theta $, Figure 7B; see Decision vector representation for methodological details).
For Experiment 1, following a shift in response contingencies, the navigation of this decision surface follows a stereotyped pattern. The boundary height spikes and drift rate decreases rapidly, gradually recovering and stabilizing over time (see the trial prior to the change point in Figure 7A). This decision trajectory is robust in Experiment 1 (Figure 7B, top panel).
Here, we find that the distribution of $\theta $ prior to a change point averages to ∼300°, sharply changes in response to the observation of a change point (∼165°) and steadily returns to values prior to the onset of a change (main panels in Figure 7B). One trial after the change point, drift rate sharply decreases and boundary height spikes, after which boundary height quickly recovers and drift rate steadily progresses toward its baseline value.
However, this trajectory is substantially more variable in Experiment 2, with most of the response restricted to the drift rate dimension and inconsistent trajectories along the boundary height dimension (Figure 7B, lower panel). Here, the distribution of $\theta $ prior to a change point averages to ∼270° and shifts to ∼90° with the observation of a change. In both experiments, we find that the decision trajectory quickly responds to a shift in action outcomes and also quickly recovers and stabilizes.
Having characterized the changepointevoked trajectory through the range of decision policies, we next asked whether conditions of increased volatility and increased conflict might modify its path. To this end, we conducted a comparison of a null model with models specifying the changepoint evoked response alone and this evoked response as a function of conflict and volatility. To estimate this relationship between drift rate and boundary height, we used Bayesian circular regression (Mulder and Klugkist, 2017). First, we tested the null hypothesis that the decision dynamics (the relationship between drift rate and boundary height; $\theta $) were solely a function of the intercept, or the average of the decision dynamics $\theta $:
We call this the null model.
To test the hypothesis that decision dynamics varied solely as a function of time after a switch in actionoutcome contingencies, we estimated the change in $(a,v)$ coordinates ($\theta $) relative to a change point, with the time scale of consideration determined by the results of a stability analysis from Experiment 1 (see Model proposals and evaluation; Appendix 1—figure 5):
We call this the evoked response model.
Our model comparison logic was as follows. We first evaluated whether the posterior probability of the evoked response model was greater than that for the null model. This would suggest that time relative to a change point alone is a better predictor of decision dynamics than the average response. If the posterior probability of the evoked response model reliably exceeded the posterior probability of the null model, we then quantified the evidence for alternative models relative to the evoked response model. The sole effect of time relative to a change point was then framed as the new null hypothesis.
We used Bayes Factors to quantify the ratio of evidence for two competing hypotheses. If the ratio is close to 1, then the evidence is equivocal. As the ratio grows more positive, there is greater evidence for the model specified in the numerator, and if the ratio is less than 1, then there is evidence for the model specified in the denominator (Jeffreys, 1998). Evidence for the null hypothesis is denoted $B{F}_{01}$ and evidence for the alternative hypothesis is denoted $B{F}_{10}$. Because Experiment 2 took a withinsubject approach, a separate model was fit for each participant for all proposed models.
To determine whether volatility and conflict affected these perichange decision dynamics, we modeled changes in decision policy on the drift rate and boundary height surface as a function of $\lambda $ and $p$, where $\lambda $ corresponds to the average period of stability and $p$ corresponds to the mean probability of reward for the optimal choice (see Figure 8 for the full set of models considered). We explored the potential influence of volatility and conflict on the relationship between drift rate and boundary height by examining the posterior probability for each hypothesized model given the set of alternative hypotheses (Model proposals and evaluation; Figure 8A). We found that the evoked response model describing the relationship between shifts in decision parameters and time relative to a change point was more probable than the null model (see Figure 8A).
We also present the evidence for the null model against each alternative model as a Bayes Factor ($B{F}_{01}$) (Figure 8B). The 95% confidence interval for the $BF01$ comparing the ratio of evidence for the null model and the evoked response model specifying timedependent effects of volatility included 1, suggesting inconclusive evidence for either of these models. Likewise, the 95% confidence interval for the $B{F}_{01}$ comparing the evidence for the null model against the model specifying changepointevoked effects of conflict included 1, suggesting no substantive difference between them. Given the equivocal evidence for these two models we excluded them from further comparison with the evoked response model.
The remainder of the models had substantially negative $B{F}_{01}$ values (Figure 8B), suggesting that they better fit the data than the null model and allowing them to survive to the next stage of analysis. To evaluate the hypothesis that time alone best accounted for the data, we computed the $B{F}_{01}$ for the evoked response model against the surviving models from the null model analysis. We find that, for all the remaining models, the $B{F}_{01}$ is substantially positive (Figure 8C), indicating that the evoked response model best accounted for the data (posterior probability of evoked response model given the set of models considered: $0.76\pm 0.473$; posterior prob. for 3/4 participants > 0.99).
These analyses suggest that the relationship between the rate of evidence accumulation and the boundary height is only related to the change point itself. We find no evidence to suggest that changing the degree of volatility or changing the degree of conflict changes the path of the decision policy following a change point. Thus, the stereotyped response of the decision policy is solely dependent on the presence of a change point rather than either the history of change point frequency or the history of optimal choice ambiguity. Note that while the ideal observer estimates respond to our conditional manipulations of volatility and conflict, the decision dynamics $\theta $ we observe do not reflect these effects. This is due to the noisy, imperfect correspondence between the ideal observer signals and $a$ and $v$. This suggests that adaptation to environmental changes in actionoutcome contingencies involves a rapid, coordinated increase in the relationship between the amount of information needed to make a decision and a decrease in the rate of information accumulation, with a stereotyped return to a stable baseline soon thereafter until another change occurs.
No evidence for locuscoeruleus norepinephrine (LCNE) system contribution to the decision trajectory
The LCNE system is known to modulate exploration states under uncertainty and pupil diameter shows a tight correspondence with LC neuron firing rate (AstonJones and Cohen, 2005; Rajkowski et al., 1994), with changes in pupil diameter indexing the exploreexploit decision state (Jepma and Nieuwenhuis, 2011). Similar to the classic YerkesDodson curve relating arousal to performance (Yerkes and Dodson, 1908), performance is optimal when tonic LC activity is moderate and phasic LC activity increases following a goalrelated stimulus (AstonJones et al., 1999, but see Joshi et al., 2016 for an exception). Because of this link between LCNE and the regulation of behavioral variability in response to uncertainty, we expected that LCNE system responses, as recorded by pupil diameter, would associate with environmental uncertainty and the trajectory through decision policy space following a change in actioncontingencies. Specifically, if the LCNE system were sensitive to a change in the optimal choice then we should observe a moderate spike in phasic activity following a change in actionoutcome contingencies. Note that we do not observe previously established links between exploratory choice behavior and the pupillary response (Jepma and Nieuwenhuis, 2011; Murphy et al., 2011; van Kempen et al., 2019). We ask the reader to titrate their interpretation of these pupillary data accordingly.
We characterized the evoked pupillary response on each trial in Experiment 2 using seven metrics: the mean of the pupil data over each trial interval, the latency to the peak onset and offset, the latency to peak amplitude, the peak amplitude, and the area under the curve of the pupillary response (see Pupil data preprocessing; Figure 9A). From a computational perspective, reducing the dimensionality of this set of pupillary response metrics expands the set of models we can consider without taxing computational resources in a reasonable amount of time. Further, dimensionality reduction of the pupillary response allows us to capture separable sources of variance relating to timing and amplitude effects without restricting the data to a smaller set of metrics and possibly discarding information (e.g. timing effects may not be constrained to peak latency or onset latency; amplitude effects may not be constrained to peak dilation amplitude). Therefore, we submitted these metrics to principal component analysis to reduce their dimensionality while capturing maximum variance.
Evoked response characterization and principal component analyses were conducted for each session and for each subject in Experiment 2. The 95% CI for the number of principal components needed to explain 95% of the variance in the data was calculated over subjects and sessions to determine the number of principal components to keep for further analysis. To aid in interpreting subsequent analysis using the selected principal components, the feature importance of each pupil metric was calculated for each principal component and aggregated across subjects as a mean and bootstrapped 95% CI (Figure 9). We found that the first two principal components explained 95% of the variance in the pupillary data. Peak onset, peak offset, and latency to peak amplitude had the greatest feature importance for the first principal component (Figure 9B, upper panel). Mean pupil diameter and peak amplitude had the greatest feature importance for the second principal component (Figure 9B, lower panel). Thus, for interpretability, we refer to the first and second principal components as timing and magnitude components, respectively (Figure 9B). Note that we also conduct this analysis using more conventional methods of pupillary analysis and continue to observe a null effect (see Pupil data preprocessing for details).
To test for the possibility that fluctuations in norepinephrine covaried with changes in the driftrate and the boundary height, we evaluated a set of models exploring the relationship between the timing and magnitude components of the changepointevoked pupillary response and shifts in $\theta $. As in our previous model comparison (Figure 8; see Environmental instability prompts a stereotyped decision trajectory), we found that the model describing the relationship between decision policy shift and time relative to a change point had the highest posterior probability given the set of models considered (Figure 10A). To further evaluate the extent of the evidence for the evoked response hypothesis, we present the evidence for the evoked response model against the original model set as $B{F}_{01}$ (Figure 10B). We find unambiguous evidence in favor of the evoked response model relative to the models specifying the modulation of $\theta $ via the timing and magnitude features of the changepointevoked pupillary response (posterior probability of timenull model given the set of models considered: $0.997\pm 0.002$), with substantially positive $B{F}_{01}$ values. We find no evidence that the pupillary response associates with the dynamics of the decision policy changes in response to a change in actionoutcome contingencies.
Discussion
We investigated how decision policies change when the rules of the environment change. In two separate experiments, we characterized how decision processes adapted in response to a change in actionoutcome contingencies as a trajectory through the space of possible types of exploratory and exploitative decision policies. Our findings highlight how, in the context of two choice paradigms, when faced with a possible change in outcomes, humans rapidly shift to a slow exploratory strategy by reducing the drift rate and, sometimes, increasing the boundary height in a stereotyped manner. Using pupillary data, we were unable to detect a relationship between the LCNE system and the dynamics of adaptive decision policies in unstable environments. Our findings show how the underlying decision algorithm adapts to different forms of uncertainty.
Exploration and exploitation states are not discrete, but exist along a continuum (Addicott et al., 2017). Instead of switching between binary states, humans manage environmental instability by adjusting the greediness of their decision policies (Sadeghiyeh et al., 2020; PratCarrabin et al., 2020; Feng et al., 2020; Wilson et al., 2014; PayzanLeNestour and Bossaerts, 2011; PayzanLenestour and Bossaerts, 2012; Wilson et al., 2021). Depending on the relative configuration of parameters in the accumulation to bound process, this adjustment can manifest as either speeded or slowed decisions (Figure 1E; Alexandrowicz, 2020; Ratcliff, 1978). Our results suggest that, in the context of volatile twochoice decisions, humans adopt a mechanism that simultaneously changes the rate of evidence accumulation and, sometimes, the threshold of evidence needed to trigger a decision, so as to adapt to an environmental change (Figure 6A). As soon as a shift in action outcomes is suspected, an internal estimate of change point probability increases and an estimate of the belief in the value of the optimal target plummets (Figure 7A). The rapid increase in change point probability causes a rapid rise in the boundary height on the subsequent trial, thereby increasing the criterion for selecting a new action and allowing variability in the accumulation process to have a greater influence on choice (Figure 7B), although this latter effect is inconsistent across experiments. These changes lead to slow exploratory decisions that facilitate discovery of the new optimal action and result in a quick recovery of the original threshold value over the course of a few trials. In parallel, the rate of evidence accumulation for the optimal choice decreases, with an immediate drop that gradually returns to its asymptotic value as the belief in the value of the optimal choice stabilizes. These results show that when a learner confronts a change point, the decision policy becomes more exploratory by simultaneously increasing the amount of evidence needed to make a decision and slowing the integration of evidence over time. Together, these decision dynamics form a mechanism for gathering information in an unbiased manner that slows the decision at the decision process level but responds quickly relative to a suspected change in trial time.
Critically, our finding that underlying decision policies can reconfigure multiple underlying decision parameters closely parallels recent work in the domain of informationseeking. Information seeking has been decomposed into random and directed components (Wilson et al., 2014). Random exploration refers to inherent behavioral variability that leads us to explore other options, while directed exploration refers to the volitional pursuit of new information. Feng and colleagues recently found that random exploration is driven by changes in the drift rate and the boundary height, with drift rate changes dominating the policy shift (Feng et al., 2020). When environmental conditions encouraged exploration, the drift rate slowed, reducing the signaltonoise ratio of the reward representation. This finding clearly aligns with our current observations showing that the drift rate sharply decreases in response to a change point and that this change in drift rate dominates the reconfiguration of decision processes, although our experiments were not designed to isolate the directed and random elements of exploration.
Our results are also broadly consistent with a growing body of research converging on the idea that decision policies are not static, but sensitive to changes in environmental dynamics (Dunovan and Verstynen, 2019; Urai et al., 2018). Previous work by our lab (Dunovan and Verstynen, 2019) has shown how, during a modified reactive inhibitory control task, different feedback signals target different parts of the accumulationtobound process. Specifically, errors in response timing drove rapid changes in the drift rate on subsequent trials, while selection errors (i.e. making a response on trials where the response should be inhibited) changed the boundary height. Further, there is new evidence that the drift rate adapts on the basis of previous choices, independent of the feedback given for those choices. Urai and colleagues have convincingly demonstrated that choice history signals sculpt the dynamics of the accumulation process by biasing the rate of evidence accumulation (Urai et al., 2018). Our current findings and these previous observations (Pedersen et al., 2017; Ratcliff and Frank, 2012) all highlight how sensitive the parameters of accumulationtobound processes are to immediate experience.
Previous literature has shown a conflictinduced spike in reaction time (e.g. Jahfari et al., 2019). However, our complex reaction time results depart from this. One reason for this departure may relate to the demands of the task we are asking participants to perform. While increased cognitive demand should increase reaction times across conditions, we observe a linear decrease in reaction time as a function of volatility when conflict is highest, and we also see that a net increase in conflict decreases reaction times (Figure 3D). We suspect that the presence of both conflict and volatility blurs the distinction between these two sources of uncertainty, especially under high volatility and high conflict conditions. We also see this effect in our formulation of change point probability (CPP), with a bias to overestimate CPP when conflict is high (Figure 4D). It is possible that participants also exhibit this bias to overestimate volatility when conflict is high, which could muddle the effect of conflict on reaction times. Future research should explore the interaction of change point and conflict estimation on the speedaccuracy tradeoff.
We hypothesized that any shift in decision policy in response to a change in actionoutcome contingencies would be linked to changes in phasic responses of the LCNE pathways (AstonJones and Cohen, 2005). However, we failed to find any evidence of this link using pupillary responses as a proxy of LCNE dynamics. It should be noted, however, that our experimental design cannot distinguish between pupillary dynamics driven by other catecholamines, such as dopamine, and those dynamics driven by the LCNE system (Spiers and Calne, 1969; McClure et al., 2005; Gershman and Tzovaras, 2018; Gershman and Uchida, 2019), Thus it is possible that the LCNE system may still be playing a role in shift of decision policies, and the pupil responses we collected were insensitive to the underlying dynamics. Nonetheless, this null association suggests that an alternative neural mechanism drives the adaptive changes that we observed behaviorally.
One possible alternative mechanism for resetting decision policies is is dopaminergic changes to the corticobasal gangliathalamic (CBGT) pathways, or ‘loops’. Both recent experimental (Yartsev et al., 2018; Dunovan et al., 2015) and theoretical (Bogacz and Larsen, 2011; Caballero et al., 2018; Wei et al., 2015) studies have pointed to the CBGT loops as being a crucial pathway for accumulating evidence during decision making, with the wiring architecture of these pathways ideal for implementing the sequential probability ratio test (Bogacz and Gurney, 2007; Bogacz, 2007), the statistically optimal algorithm for evidence accumulation decisions and the basis for the DDM itself (Ratcliff, 1978). Further, multiple lines of theoretical work have suggested that, within the CBGT pathways, the difference in direct pathway activity between action channels covaries with the rate of evidence accumulation for individual decisions (Mikhael and Bogacz, 2016; Bariselli et al., 2019; Dunovan et al., 2019; Rubin et al., 2021), while the indirect pathways are linked to control of the boundary height (Wei et al., 2015; Herz et al., 2016; Bogacz, 2007; Ratcliff and Frank, 2012). This suggests that changes in the direct and indirect pathways, both within and between representations of different actions, may regulate shifts in decision policies.
Critically, the CBGT pathways are a target of the dopaminergic signaling that drives reinforcement learning (Schultz et al., 1992), suggesting that changes in relative actionvalue should drive trialbytrial changes in the drift rate. Indeed, previous work relating dopaminergic circuitry to decision policy adaptation suggests that dopamine may play a critical role in modulating decision policies. Dopamine has substantial links to exploration (Kakade and Dayan, 2002) and recent pharmacological evidence suggests a role for dopaminergic regulation of exploration in humans (Chakroun et al., 2020). More explicitly, both directed and random exploration have been linked to variations in genes that affect dopamine levels in prefrontal cortex and striatum, respectively (Gershman and Tzovaras, 2018). Physiologically, previous work has found that a dopaminecontrolled spiketimingdependent plasticity rule alters the ratio of direct to indirect pathway efficacy in a simulated corticostriatal network (Vich et al., 2020), with overall indirect pathway activity (i.e. predecision firing rates) linked to the modulation of the boundary height in a DDM and the difference in direct pathway activation across action channels associating with changes in the drift rate (Dunovan et al., 2019; Rubin et al., 2021). Moreover, recent optogenetic work in mice suggests that activating the subthalamic nucleus, a key node in the indirect pathway, not only halts the motoric response but also interrupts cognitive processes related to action selection (Heston et al., 2020). Our current observations, combined with this previous work, suggests that the decision policy reconfiguration that we observe may associate with similar underlying corticostriatal dynamics, with beliefdriven changes to drift rate varying with the difference in direct pathway firing rates across action channels (Dunovan et al., 2019), and changepointprobabilitydriven changes to the boundary height varying with overall indirect pathway activity (Dunovan et al., 2019; Vich et al., 2020). Future physiological studies should focus on validating this predicted relationship between decision policy reconfiguration and CBGT pathways.
The current study raises many more questions about the dynamics of adaptive decision policies than it answers. For example, we only sparsely sampled the space of possible states of value conflict and volatility. Future work would benefit from a more complete sampling of the conflict and volatility space. A psychophysical characterization of how decision states shift in response to varying forms of uncertainty will expose potential nonlinear relationships between the decision policy and feedback uncertainty. Moreover, the decisions that we have modeled here are simple two choice decisions, constrained mostly by the normative form of the traditional DDM framework (Ratcliff, 1978). Scaling the complexity of the task will allow for a more complete assessment of how these relationships change with more complex decisions that better approximate the choices that we make outside the lab. This could be done by moving the cognitive model to frameworks that can fit processes for decisions involving more than two alternatives (e.g. Tajima et al., 2019). Finally, because our estimate of the relationship between our ideal observer estimates of uncertainty and human estimates of uncertainty were indirect, this work would benefit from online approximations of ideal observer estimates, as has been done previously (Wilson et al., 2010). Indeed, there can be substantive individual differences in the detection of of change points (Wilson et al., 2010). Thus, an approximation of how well the estimates of change point probability from our ideal observer correspond to estimates that human observers hold is needed. This approximation would validate the fidelity of the relationship between the ideal observer estimates of uncertainty and the decision parameters that we observed.
Together, our results suggest that when humans are forced to change their mind about the best action to take, the underlying decision policy adapts in a specific way. When a change in actionoutcome contingency is suspected, the rate of evidence accumulation decreases and more evidence may briefly be required to commit to a response, allowing variability inherent to the decision process to play a greater role in response selection and resulting in a slow exploratory state. As the environment becomes stable, the system gradually adapts to an exploitative state. Importantly, we find no evidence that norepinephrine pathways associate with this response. This suggests that other pathways may be engaged in this adaptive reconfiguration of decision policies. These results reveal the multifaceted underlying decision processes that can adapt action selection policy under multiple forms of environmental uncertainty.
Materials and methods
Participants
Neurologically healthy adults were recruited from the local university population. All procedures were approved by the Carnegie Mellon University Institutional Review Board (Approval Code: 2018_00000195; Funding: Air Force Research Laboratory, Grant Office ID: 180119). All research participants provided informed consent to participate in the study and consent to publish any research findings based on their provided data.
Twentyfour participants (19 female, 22 righthanded, 19–31 years old) were recruited for Experiment 1 and paid $20 at the end of four sessions. Four participants (two female, 4 righthanded, 21–28 years old) were recruited for Experiment 2 and paid $10 for each of nine sessions, in addition to a performance bonus.
Processed data and code are available within a Github repository for this publication (copy archived at swh:1:rev:0486705db0f004a5e1365759f5f5a391790771f8, Bond, 2021). Hypotheses were registered prior to the completion of data collection using the Open Science Framework (Foster, MSLS and Deardorff, MLIS, 2017).
Stimuli and procedure
Experiment 1
Request a detailed protocolTo begin the task, each participant read the following instructions:
“You’re going on a treasure hunt! You will start with 600 coins in your treasure chest, and you’ll be able to pay a coin to open either a purple or an orange box. When you open one of those boxes, you will get a certain number of coins, depending on the color of the box. However, opening the same box will not always give you the same number of coins, and each choice costs one coin. After making your choice, you will receive feedback about how much money you have. Your goal is to make as much money as possible. Press the green button when you’re ready to continue. Choose the left box by pressing the left button with your left index finger and choose the right box by pressing the right button with your right index finger. Note that if you choose too slowly or too quickly, you won’t earn any coins. Finally, remember to make your choice based on the color of the box. Press the green button when you’re ready to begin the hunt!”.
On each trial, participants chose between one of two ‘mystery boxes’ presented sidebyside on the computer screen (Figure 2A). Participants selected one of the two boxes by pressing either a left button (left box selection) or right button (right box selection) on a button box (Black Box ToolKit USB Response Pad, URP48). Reaction time (RT) was defined as the time elapsed from stimulus presentation to stimulus selection. Reaction time was constrained so that participants had to respond within 100 ms to 1000 ms from stimulus presentation. If participants responded too quickly, the trial was followed by a 5 s pause and they were informed that they were too fast and asked to slow down. If participants responded too slowly, they received a message saying that they were too slow, and were asked to choose quickly on the next trial. In both of these cases, participants did not receive any reward feedback or earn any points, and the trial was repeated so that 600 trials met these reaction time constraints. In order to avoid fatigue, a small break was given midway through each session (break time: 0.72 ± 1.42 m). Participants began each condition with 600 points and lost one point for each incorrect decision.
Feedback was given after each rewarded choice in the form of points drawn from the normal distribution $N(\mu =3,\sigma =1)$ and converted to an integer. If the choice was unrewarded, then participants received 0 points. These points were displayed above the selected mystery box for 0.9 s. To prevent stereotyped responses, the intertrial interval was sampled from a uniform distribution with a lower limit of 250ms and an upper limit of 750ms ($U(250,750)$). The relative leftright position of each target was pseudorandomized on each trial to prevent incidental learning based on the spatial position of either the mystery box or the responding hand.
To induce decisionconflict, the probability of reward for the optimal target ($P$) was manipulated across two conditions. We imposed a relatively low probability of reward for the high conflict condition ($P$ = 0.65). Conversely, we imposed a relatively high probability of reward for the low conflict condition ($P$ = 0.85). For all conditions, the probability of the lowvalue target was $1P$.
Along with these reward manipulations, we also introduced volatility in the actionoutcome contingencies. After a prespecified number of trials, the identity of the optimal target switched periodically. The point at which the optimal target switched identities was termed a change point. Each period of mean contingency stability was defined as an epoch. Consequently, each session was composed of multiple change points and multiple epochs. Epoch lengths, in trials, were drawn from a Poisson distribution. The lambda parameter was held constant for both high conflict and low conflict conditions ($\lambda =25$).
To manipulate volatility, epoch lengths were manipulated across two conditions. The high volatility condition drew epoch lengths from a Poisson distribution where $\lambda =15$ and the low volatility condition drew epoch lengths from a distribution where $\lambda =35$. In these conditions manipulating volatility, the probability of reward was held constant ($P$ = 0.75).
Each participant was tested under four experimental conditions: high conflict, low conflict, high volatility, and low volatility. Each condition was completed in a unique experimental session and each session consisted of 600 trials. Each participant completed the entire experiment over two testing days. To eliminate the effect of timing and its correlates on reward learning (Byrne et al., 2017; Murray et al., 2009), the order of conditions was counterbalanced across participants.
Experiment 2
Request a detailed protocolExperiment 2 used male and female Greebles (Gauthier and Tarr, 1997) as selection targets (Figure 2B). Participants were first trained to discriminate between male and female Greebles to prevent errors in perceptual discrimination from interfering with selection on the basis of value. Using a twoalternative forced choice task, participants were presented with a male and female Greeble and asked to select the female, with the male and female Greeble identities resampled on each trial. Participants received binary feedback regarding their selection (correct or incorrect). This criterion task ended after participants reached 95% accuracy (mean number of trials to reach criterion: 31.29, standard deviation over means for subjects: 9.99).
After reaching perceptual discrimination criterion for each session, each participant was tested under nine reinforcement learning conditions composed of 400 trials each, generating 3600 trials per subject in total. Data were collected from four participants in accordance with a replicationbased design, with each participant serving as a replication experiment. Participants completed these sessions across three weeks in randomized order. Each trial presented a male and female Greeble (Gauthier and Tarr, 1997), with the goal of selecting the sex identity of the Greeble that was most profitable (Figure 2B). Individual Greeble identities were resampled on each trial; thus, the task of the participant was to choose the sex identity rather than the individual identity of the Greeble which was most rewarding. Probabilistic reward feedback was given in the form of points drawn from the normal distribution $N(\mu =3,\sigma =1)$ and converted to an integer, as in Experiment 1. These points were displayed at the center of the screen. Participants began with 200 points and lost one point for each incorrect decision. To promote incentive compatibility (Hurwicz, 1972; Ledyard, 1989), participants earned a cent for every point earned. Reaction time was constrained such that participants were required to respond within 0.1 and 0.75 s from stimulus presentation. If participants responded in $\le .1$ s, $\ge 0.75$ s, or failed to respond altogether, the point total turned red and decreased by five points. Each trial lasted 1.5 s and reward feedback for a given trial was displayed from the time of the participant’s response to the end of the trial.
To manipulate change point probability, the sex identity of the most rewarding Greeble was switched probabilistically, with a change occurring every 10, 20, or 30 trials, on average. To manipulate the belief in the value of the optimal target, the probability of reward for the optimal target was manipulated, with $P$ set to 0.65, 0.75, or 0.85. Each session combined one value of $P$ with one level of change point probability, such that all combinations of change point frequency and reward probability were imposed across the nine sessions (Figure 2C). As in Experiment 1, the position of the highvalue target was pseudorandomized on each trial to prevent prepotent response selections on the basis of location.
Throughout the task, the headstabilized diameter and gaze position of the left pupil were measured with an Eyelink 1,000 desktop mount at 1000 Hz. Participants viewed stimuli from within a custombuilt booth designed to eliminate the influence of ambient sources of luminance. Because the extent of the pupillary response is known to be highly sensitive to a variety of influences (Sirois and Brisson, 2014), we established the dynamic range of the pupillary response for each session by exposing participants to a sinusoidal variation in luminance prior to the rewardlearning task. During the rewardlearning task, all stimuli were rendered isoluminant with the background of the display to further prevent luminancerelated confounds of the taskevoked pupillary response. To obtain as clean a trialevoked pupillary response as possible and minimize the overlap of the pupillary response between trials, the intertrial interval was sampled from a truncated exponential distribution with a minimum of 4 s, a maximum of 16 s, and a rate parameter of 2. The eyetracker was calibrated and the calibration was validated at the beginning of each session. See Pupil data preprocessing for pupil data preprocessing steps.
Models and simulations
Qlearning simulations
Request a detailed protocolA simple, tabular qlearning agent (Sutton and Barto, 1998) was used to simulate action selection in contexts of varying degrees of conflict and volatility. On each trial, $t$, the agent chooses which of two actions to take according to the policy
Here, $\beta $ is the inverse temperature parameter, $1/\tau $, reflecting the greediness of the selection policy and ${Q}_{t}$ is the estimated stateaction value vector on that trial. Higher values of $\beta $ reflect more exploitative decision policies.
After selection, a binary reward was returned. This was used to update the $Q$ table according using a simple update rule
where $\alpha $ is the learning rate for the model.
On each simulation an agent was initialized with a specific $\beta $ value, ranging from 0.1 to 3. On each run the agent completed 500 trials at a specific conflict and volatility level, according to the experimental procedures described in Stimuli and Procedure. The total returned reward was tallied after each run, which was repeated for 200 iterations to provide a stable estimate of return for each agent and condition. The agent was tested on a range of pairwise conflict ($P(optimal)=0.550.90$) and volatility ($\lambda =10100$) conditions.
After all agents were tested on all conditions, the $\beta $ value for the agent that returned the greatest average reward across runs was identified as the optimal agent for that experimental condition.
Drift diffusion model simulations
Request a detailed protocolA normative driftdiffusion model (DDM) process (Ratcliff, 1978) was used to simulate the outcomes of agents with different drift rates and boundary heights. The DDM assumes that evidence is stochastically accumulated as the loglikelihood ratio of evidence for two competing decision outcomes. Evidence is tracked by a single decision variable $\theta $ until reaching one of two boundary heights, representing the evidence criterion for committing to a choice. The dynamics of $\theta $ is given by.
where $v$ is the mean strength of the evidence and $\sigma $ is the standard deviation of a white noise process $W$, representing the degree of noise in the accumulation process. The choice and reaction time (RT) on each trial are determined by the first passage of $\theta $ through one of the two decision boundaries $\{a,\mathrm{\hspace{0.17em}0}\}$. In this formulation, $\theta $ remains fixed at a predefined starting point $z/a\in [0,1]$ until time $tr$, resulting in an unbiased evidence accumulation process when $z=a/2$. In perceptual decision tasks, $v$ reflects the signaltonoise ratio of the stimulus. However, in a valuebased decision task, $v$ can be taken to reflect the difference between Qvalues for the left and right actions. Thus, an increase (decrease) in ${Q}_{L}{Q}_{R}$ from 0 would correspond to a proportional increase (decrease) in $v$, leading to more rapid and frequent terminations of $\theta $ at the upper (lower) boundary $a$ (0).
Using this DDM framework, we simulated a set of agents with different configurations of $a$ and $v$. Each agent completed 1,500 trials of a ‘left’ (upper bound) or ‘right’ (lower bound) choice task, with $tr=0.26$ and $z=\frac{a}{2}$. The values for $a$ were sampled between 0.05 and 0.2 in intervals of 0.005. The values for $v$ were sampled from 0 to 0.3 in 0.005 intervals. At the end of each agent run, the probability of selecting the left target, $P(L)$, and the mean RT were recorded.
Cognitive model
Request a detailed protocolOur a priori hypothesis was that the drift rate ($v$) and the boundary height ($a$) should change on a trialbytrial basis according to two estimates of uncertainty from an ideal observer (Bond et al., 2018). We adapted the below ideal observer calculations from a previous study (Vaghi et al., 2017; for the original formulation of this reduced ideal observer model and its derivation, see Nassar et al., 2010).
First, we assumed that reward feedback drove the belief in the reward associated with an action. We called the belief in the reward attributable to a given action $B$. This reward belief is learned separately for each action target. Given the chosen target ($c$) and the unchosen target ($u$), the belief in the mean reward for the chosen and unchosen targets on the next trial (trial $t+1$) was calculated as:
where ${\alpha}_{t}$ denotes the learning rate, ${\delta}_{t}$ the prediction error, and ${\mathrm{\Omega}}_{t}$ the change point probability on the current trial $t$, as discussed below. $E(r)$ refers to the pooled expected value of both targets:
with ${\overline{r}}_{t0},{\overline{r}}_{t1}$ fixed based on the imposed target reward probabilities.
The prediction error, ${\delta}_{t}$, was the difference between the reward obtained for the target chosen and the model belief:
The signed belief in the reward difference between optimal and suboptimal targets ($\mathrm{\Delta}B$) was calculated as the difference in reward value belief between target identities:
Model confidence ($\varphi $) was defined as a function of change point probability ($\mathrm{\Omega}$) and the variance of the generative distribution of points (${\sigma}_{n}^{2}$), both of which formed an estimate of relative uncertainty ($RU$):
Thus $\varphi $ is calculated as:
An estimate of the variance of the reward distribution, ${\sigma}_{t}^{2}$, was calculated as:
where ${\sigma}_{n}$ is the fixed variance of the generative reward distribution.
The learning rate of the model ($\alpha $) was determined by the change point probability ($\mathrm{\Omega}$) and the model confidence ($\varphi $). Here, the learning rate was high if either (1) a change in the mean of the distribution of the difference in expected values was likely ($\mathrm{\Omega}$ is high) or (2) the estimate of the mean was highly imprecise (${\sigma}_{t}^{2}$ was high):
To model how learners update actionvalues, we calculated an estimate of how often the same action gave a different reward (Vaghi et al., 2017). This estimate gave our representation of change point probability, $\mathrm{\Omega}$. The change point probability approached one from below as the probability of a sample coming from a uniform distribution, relative to a Gaussian distribution, increased:
In equation (12), $H$ refers to the hazard rate, or the global probability of a change point over trials:
Our preregistered expectation was that the belief in the value of a given action and an estimate of environmental stability would target different parameters of the DDM model. Specifically, we hypothesized that the belief in the relative reward for the two choices, $\mathrm{\Delta}B$, would update the drift rate, $v$, or the rate of evidence accumulation:
while the change point probability, $\mathrm{\Omega}$, would increase the boundary height, $a$, or the amount of evidence needed to make a decision:
Hierarchical drift diffusion modeling
Request a detailed protocolFirst, to identify which decision parameters were sensitive to the onset of a change point, we estimated the posterior distribution of drift rate ($v$), boundary height ($a$), drift criterion ($dc$), starting point ($z$), and nondecision time ($t$) for the trial preceding the change point and the following three trials using stimuluscoded fitting methods for Experiment 1. We then looked for changepointevoked effects in these parameters by comparing the overlap of the distributions for each decision parameter for each of these trials. If less than 5% of the mass of the trialwise posterior distributions for a given decision parameter overlapped, we considered those distributions to exhibit change point sensitivity.
To identify the fits that best accounted for the data, we conducted a model selection process using Deviance Information Criterion (DIC) scores. We compared the set of fitted models (Table 1) to an interceptonly regression model ($DI{C}_{i}DI{C}_{intercept}$). A lower DIC score indicates a model that loses less information. Here, a difference of ≤ two points from the lowestscoring model cannot rule out the higher scoring model; a difference of 3–7 points suggests that the higher scoring model has considerably less support; and a difference of 10 points suggests essentially no support for the higher scoring model (Spiegelhalter et al., 2002; Burnham and Anderson, 1998).
We used these complementary model ‘pruning’ methods (i.e. distributional overlap and information loss) as an outofset filtering method to determine which decision parameters to include for the subsequent HDDM regression analyses in Experiment 2.
The best parameter fits, evaluated as above, were used to plot the decision trajectory (Decision vector representation) and to estimate the changepointevoked relationship between those winning parameters (Model proposals and evaluation).
For Experiment 2, to assess whether and how much the ideal observer estimates of change point probability ($\mathrm{\Omega}$) and the belief in the value of the optimal target ($\mathrm{\Delta}\text{B}$) updated the rate of evidence accumulation ($v$) and the amount of evidence needed to make a decision ($a$), we regressed the changepointevoked ideal observer estimates onto the decision parameters using hierarchical drift diffusion model (HDDM) regression (Wiecki et al., 2013). These ideal observer estimates of environmental uncertainty served as a more direct and continuous measure of the uncertainty we sought to induce with our experimental conditions (see Figure 4 for how the experimental conditions impacted these estimates). Considering this more direct approach, we pooled change point probability and belief across all conditions and used these values as our predictors of drift rate and boundary height. Responses were accuracycoded, and the belief in the difference between targets values was transformed to the belief in the value of the optimal target ($\mathrm{\Delta}{B}_{\text{optimal(t)}}={B}_{\text{optimal(t)}}{B}_{\text{suboptimal(t)}}$). This approach allowed us to estimate trialbytrial covariation between the ideal observer estimates and the decision parameters relative to the onset of a change point.
For both the HDDM fits for Experiment 1 and the regression analyses for Experiment 2, Markovchain MonteCarlo methods were used to sample the posterior distributions of the regression coefficients. Twenty thousand samples were drawn from the posterior distributions of the coefficients for each model, with 5000 burned samples and a thinning factor of five. We chose this number of samples to optimize the tradeoff between computation time and the precision of parameter estimates, and all model parameters converged to stability. This method generates a distributional estimate of the regression coefficients instead of a single best fit.
To test our hypotheses regarding these HDDM regression estimates, we again used the posterior distributions of the regression parameters. To quantify the reliability of each regression coefficient, we computed the probability of the regression coefficient being greater than or less than 0 over the posterior distribution. We considered a regression coefficient to be reliable if the estimated coefficient maintained the same sign over at least 95% of the mass of the posterior distribution.
Analyses
General statistical analysis
Request a detailed protocolStatistical analyses and data visualization were conducted using custom scripts written in R (R Foundation for Statistical Computing, version 3.4.3) and Python (Python Software Foundation, version 3.5.5).
To determine how many trials would be needed to detect proposed condition effects, we conducted a power analysis by way of parameter recovery. For this, we simulated accuracy and reaction time data using our hypothesized model (Cognitive model) and calculated the generative or “true” mean drift rate and boundary height parameters across trials. Then we conducted hierarchical parameter estimation given 200, 400, 600, 800, or 1000 simulated trials. The mean squared error of parameter estimates was stable at 600 trials for all decision parameters. Additionally, as a validation measure, we estimated parameters using component models (drift rate alone, boundary height alone) and a combined model (drift rate and boundary height). We found that the Deviance Information Criterion (DIC) scores among competing models were clearly separable at 600 trials, and in favor of the hypothesized model from which we generated the data, as expected (Acknowledgments). Based on these results, we used 600 trials per condition for each participant for our first experiment. We chose to recruit 24 participants for this experiment to fully counterbalance the four conditions (4! = 24).
Binary accuracy data were submitted to a mixed effects logistic regression analysis with either the degree of conflict (the probability of reward for the optimal target) or the degree of volatility (mean change point frequency) as predictors. The resulting loglikelihood estimates were transformed to likelihood for interpretability. RT data were logtransformed and submitted to a mixed effects linear regression analysis with the same predictors as in the previous analysis. To determine if participants used ideal observer estimates to update their behavior, two more mixed effects regression analyses were performed. Estimates of change point probability and the belief in the value of the optimal target served as predictors of reaction time and accuracy across groups. As before, we used a mixed logistic regression for accuracy data and a mixed linear regression for reaction time data.
Because we adopted a withinsubjects design, all regression analyses of behavior modeled the nonindependence of the data as constantly correlated data within participants (random intercepts). Unless otherwise specified, we report bootstrapped 95% confidence intervals for behavioral regression estimates. To prevent any bias in the regression estimates emerging from collinearity between predictors and to aid easy interpretation, all predictors for these regressions were meancentered and standardized prior to analysis. The Satterthwaite approximation was used to estimate pvalues for mixed effects models (Satterthwaite, 1946; Luke, 2017).
Decision vector representation
Request a detailed protocolTo concisely capture the changepointdriven response in the relationship between the boundary height and the drift rate over time, we represented the relationship between these two decision variables in vector space. Trialbytrial estimates of drift rate and boundary height were calculated from the winning HDDM regression equation and zscored. Then the difference between each sequential set of $(a,v)$ coordinates was calculated to produce a vector length. The arctangent between these subtracted values was computed to yield an angle in radians between sequential decision vectors (Figure 7B).
For Experiment 1, these computations were performed from the trial prior to the onset of the change point to eight trials after the change point. The initial window of nine trials was selected to maximize the overlap of stable data between high and low volatility conditions (see Supp. Fig. References). This resulted in a sequence of angles formed between trials –1 and 0 ($\mathrm{\Delta}{t}_{1}$ yielding ${\theta}_{1}$), 0 and 1 ($\mathrm{\Delta}{t}_{2}$ yielding ${\theta}_{2}$), and so on. To observe the timescale of these dynamics, a circular regression (Mulder and Klugkist, 2017) was performed to determine how $\theta $ changed as a function of the number of trials after the change point:
To quantitatively assess the number of trials needed for $\theta $ to stabilize, we calculated the probability that the posterior distributions of the regression estimates (Appendix 1—figure 6) for sequential pairs of trials had equal means (${\theta}_{{\mathrm{\Delta}}_{t}}={\theta}_{{\mathrm{\Delta}}_{t+1}}$). This result (Appendix 1—figure 7) provided an outofset constraint on the timescale of the decision response to consider for analogous analyses in Experiment 2.
Experiment 2 used the stability convergence analysis from Experiment 1 to guide the timescale of further circular analyses and, thus, placed a constraint on the complexity of the models proposed (Model proposals and evaluation). Because Experiment 2 took a replicationbased approach, a separate model was fit for each participant for all proposed models. We report the mean and 95% CI of the posterior distributions of regression parameter estimates and the mean and standard deviation of estimates across subjects.
The circular regression analyses used Markovchain MonteCarlo (MCMC) methods to sample the posterior distributions of the regression coefficients. For both experiments, 10,000 effective samples were drawn from the posterior distributions of the coefficients for each model (Kruschke and Vanpaemel, 2015). Traces were plotted against MCMC iteration for a visual assessment of equilibrium, the autocorrelation function was calculated to verify independence of MCMC steps, trace distributions were visually evaluated for normality, and point estimates of the mean value were verified to be contained within the 95% credible interval of the posterior distribution for the estimated coefficients.
Pupil data preprocessing
Request a detailed protocolPupil diameter data were segmented to capture the interval from 500ms prior to trial onset to the end of the 1500ms trial, for a total of 2000ms of data per trial. While the latency in the phasic component of the taskevoked pupillary response ranges from 100 to 200ms on average (Beatty, 1982), suggesting that our segmentation should end 200ms after the trial ending, participants tended to blink after the offset of the stimulus and during the intertrial interval (see Appendix 1—figure 8 for a representative sample of blink timing). Because of this, we ended the analysis window with the offset of the stimulus. Following segmentation, pupil diameter samples marked as blinks by the Eyelink 1000 default blink detection algorithm and zero or negativevalued samples were replaced by linearly interpolating between adjacent valid samples. Pupil diameter samples with values exceeding three standard deviations of the mean value for that session were likewise removed and interpolated. Interpolated data were bandpass filtered using a 0.01–5 Hz secondorder Butterworth filter. Median pupil diameter calculated over the 500ms prior to the onset of the stimulus was subtracted from the trial data. Finally, processed data were zscored by session.
For each trial interval, we characterized the evoked response as the mean of the pupil data over that interval, the latency to peak onset and offset, the latency to peak amplitude, the peak amplitude, and the area under the curve of the phasic pupillary response (Figure 9A). We then submitted these metrics to principal component analysis to reduce their dimensionality while capturing maximum variance. Evoked response characterization and principal component analysis were conducted for each session and for each subject.
The 95% CI for the number of principal components needed to explain 95% of the variance in the data was calculated over subjects and sessions to determine the number of principal components to keep for further analysis.
To aid in interpreting further analysis using the selected principal components, the feature importance of each pupil metric was calculated for each principal component and aggregated across subjects as a mean and bootstrapped 95% CI (Figure 9B).
Note that we also conducted a similar analysis using more conventional methods to assess the taskevoked pupillary response and observed another null effect. Specifically, if we take the derivative of the evoked pupillary response with respect to time (Reimer et al., 2016) and then characterize the pupillary response with the above metrics and conduct principal component analysis, we again see no evidence for a relationship between the pupillary response and the decision trajectory. Additionally, we observe no relationship between our experimental manipulations of conflict and volatility and these metrics, or a changepoint evoked shift in prestimulus pupillary response (Gilzenrat et al., 2010, Appendix 1—figure 13). As such, we caution the reader to view our pupillary results in light of this lack of replication of preestablished explorationdriven pupillary responses.
Model proposals and evaluation
Request a detailed protocolTo assess the hypothesized influences on $\theta $ in Experiment 2, we began our model set proposal with a null hypothesis. Our null model estimates decision dynamics as a function of the intercept, or the average of $\theta $:
Next, we estimated decision dynamics solely as a function of time relative to a change point, with the timescale of consideration determined by the results of the stability convergence analysis from Experiment 1. We call this the evoked response model:
We first evaluated whether the posterior probability of the evoked response model given the data was greater than the posterior probability for the absolute null model. If the lower bound of the 95% CI of the posterior probability for the timenull model exceeded the upper bound of the 95% CI for the absolute null model (i.e the posterior probability was greater for the evoked response model and the CIs were nonoverlapping), we proceeded to evaluate the evidence for alternative models relative to this evoked response model. We evaluated the statistical reliability of the posterior probabilities using a bootstrapped 95% CI computed over subjects.
We considered an explicit set of hypotheses regarding the effect of the changepointevoked pupillary response on boundary height and drift rate dynamics (see Figure 10 for the full set of models considered). The first two principal components of the set of pupil metrics, which we term the timing and magnitude components, respectively, were included in this model set to evaluate the effect of the timing and magnitude of noradrenergic dynamics on the changepointevoked decision manifold. Under the assumption of a neuromodulatory effect on decision dynamics, these principal components were shifted forward by one trial to match the expected timing of the response to neuromodulation.
To determine whether perturbations of volatility and conflict affected changepointevoked decision dynamics, we estimated the evoked decision dynamics as a function of $\lambda $ and $p$, where $\lambda $ corresponds to the average length of an epoch and $p$ corresponds to the mean probability of reward for optimal target selection (see Table 1 for the full set of models considered).
We used Bayes Factors to quantify the ratio of evidence for competing hypotheses (Wagenmakers, 2007). To estimate whether these models accounted for decision dynamics beyond the effect of time relative to a change point alone, we calculate the Bayes Factor for the evoked response model relative to each candidate model ($B{F}_{01}$). Finally, we calculate the posterior probability of the null model given the full set of alternative models (Wagenmakers, 2007). Note that this approach assumes that each model has equal a priori plausibility.
Bayes Factor visualizations represent the mean and bootstrapped 95% CI with 1000 bootstrap iterations.
Appendix 1
Appendix 2
Appendix 3
Data availability
Behavioral data and their computational derivatives are available at https://github.com/kmbond/dynamic_decision_policy_reconfiguration (copy archived at swh:1:rev:0486705db0f004a5e1365759f5f5a391790771f8). Code used to generate figures can be found here. Raw pupillometry data (DOI: 10.1184/R1/13543133), the features of the taskevoked pupillometry response (DOI: 10.1184/R1/13543067.v1), and the principal components calculated from those features (DOI: 10.1184/R1/13543160.v1) are available here.

KiltHubPrincipal components of taskevoked pupillary response.https://doi.org/10.1184/R1/13543160.v1
References

A primer on foraging and the explore/exploit tradeoff for psychiatry researchNeuropsychopharmacology: Official Publication of the American College of Neuropsychopharmacology 42:1931–1939.https://doi.org/10.1038/npp.2017.108

The diffusion model visualizer: an interactive tool to understand the diffusion model parametersPsychological Research 84:1157–1165.https://doi.org/10.1007/s0042601811126

Role of locus coeruleus in attention and behavioral flexibilityBiological Psychiatry 46:1309–1320.https://doi.org/10.1016/s00063223(99)001407

An integrative theory of locus coeruleusnorepinephrine function: adaptive gain and optimal performanceAnnual Review of Neuroscience 28:403–450.https://doi.org/10.1146/annurev.neuro.28.061604.135709

A competitive model for striatal action selectionBrain Research 1713:70–79.https://doi.org/10.1016/j.brainres.2018.10.009

Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

Different varieties of uncertainty in human decisionmakingFrontiers in Neuroscience 6:85.https://doi.org/10.3389/fnins.2012.00085

Optimal decisionmaking theories: linking neurobiology with behaviourTrends in Cognitive Sciences 11:118–125.https://doi.org/10.1016/j.tics.2006.12.006

The neural basis of the speedaccuracy tradeoffTrends in Neurosciences 33:10–16.https://doi.org/10.1016/j.tins.2009.09.002

Network reset: a simplified overarching theory of locus coeruleus noradrenaline functionTrends in Neurosciences 28:574–582.https://doi.org/10.1016/j.tins.2005.09.002

BookPractical use of the informationtheoretic approachIn: Burnham KP, editors. Model Selection and Inference. New York, NY: Springer. pp. 75–117.https://doi.org/10.1007/9781475729177

Time of day differences in neural reward functioning in healthy young menThe Journal of Neuroscience 37:8895–8900.https://doi.org/10.1523/JNEUROSCI.091817.2017

A probabilistic, distributed, recursive mechanism for decisionmaking in the brainPLOS Computational Biology 14:e1006033.https://doi.org/10.1371/journal.pcbi.1006033

Eye tracking and pupillometry are indicators of dissociable latent decision processesJournal of Experimental Psychology. General 143:1476–1488.https://doi.org/10.1037/a0035813

Rewarddriven changes in striatal pathway competition shape evidence evaluation in decisionmakingPLOS Computational Biology 15:e1006998.https://doi.org/10.1371/journal.pcbi.1006998

Open Science Framework (OSF)Journal of the Medical Library Association 105:88.https://doi.org/10.5195/JMLA.2017.88

Believing in dopamineNature Reviews. Neuroscience 20:703–714.https://doi.org/10.1038/s4158301902207

Pupil diameter tracks changes in control state predicted by the adaptive gain theory of locus coeruleus functionCognitive, Affective & Behavioral Neuroscience 10:252–269.https://doi.org/10.3758/CABN.10.2.252

Pupil diameter predicts changes in the explorationexploitation tradeoff: evidence for the adaptive gain theoryJournal of Cognitive Neuroscience 23:1587–1596.https://doi.org/10.1162/jocn.2010.21548

Dopamine: generalization and bonusesNeural Networks 15:549–559.https://doi.org/10.1016/s08936080(02)000485

Regulation of evidence accumulation by pupillinked arousal processesNature Human Behaviour 3:636–645.https://doi.org/10.1038/s4156201905514

BookBayesian estimation in hierarchical modelsIn: Busemeyer JR, Wang Z, Townsend JT, Eidels A, editors. The Oxford Handbook of Computational and Mathematical Psychology. Oxford University Press. pp. 279–299.https://doi.org/10.1093/oxfordhb/9780199957996.013.13

BookIncentive compatibilityIn: Eatwell J, Milgate M, Newman P, editors. Allocation, Information and Markets. Springer. pp. 1–2.https://doi.org/10.1007/9781349202157

Evaluating significance in linear mixedeffects models in RBehavior Research Methods 49:1494–1502.https://doi.org/10.3758/s134280160809y

ConferenceAdvances in Neural Information Processing SystemsAn explorationexploitation model based on norepinepherine and dopamine activity. pp. 867–874.

Learning Reward Uncertainty in the Basal GangliaPLOS Computational Biology 12:e1005062.https://doi.org/10.1371/journal.pcbi.1005062

Bayesian estimation and hypothesis tests for a circular Generalized Linear ModelJournal of Mathematical Psychology 80:4–14.https://doi.org/10.1016/j.jmp.2017.07.001

Pupillinked arousal determines variability in perceptual decision makingPLOS Computational Biology 10:e1003854.https://doi.org/10.1371/journal.pcbi.1003854

An approximately Bayesian deltarule model explains the dynamics of belief updating in a changing environmentThe Journal of Neuroscience 30:12366–12378.https://doi.org/10.1523/JNEUROSCI.082210.2010

Rational regulation of learning dynamics by pupillinked arousal systemsNature Neuroscience 15:1040–1046.https://doi.org/10.1038/nn.3130

Making predictions in a changing worldinference, uncertainty, and learningFrontiers in Neuroscience 7:105.https://doi.org/10.3389/fnins.2013.00105

Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settingsPLOS Computational Biology 7:e1001048.https://doi.org/10.1371/journal.pcbi.1001048

The drift diffusion model as the choice rule in reinforcement learningPsychonomic Bulletin & Review 24:1234–1251.https://doi.org/10.3758/s134230161199y

A theory of memory retrievalPsychological Review 85:59–108.https://doi.org/10.1037/0033295X.85.2.59

The credit assignment problem in corticobasal gangliathalamic networks: A review, a problem and a possible solutionThe European Journal of Neuroscience 53:2234–2253.https://doi.org/10.1111/ejn.14745

Neuronal activity in monkey ventral striatum related to the expectation of rewardThe Journal of Neuroscience 12:4595–4610.https://doi.org/10.1523/JNEUROSCI.121204595.1992

PupillometryWiley Interdisciplinary Reviews. Cognitive Science 5:679–692.https://doi.org/10.1002/wcs.1323

Bayesian measures of model complexity and fitJournal of the Royal Statistical Society 64:583–639.https://doi.org/10.1111/14679868.00353

Action of dopamine on the human irisBritish Medical Journal 4:333–335.https://doi.org/10.1136/bmj.4.5679.333

Reinforcement Learning: An IntroductionIEEE Transactions on Neural Networks 9:1054.https://doi.org/10.1109/TNN.1998.712192

Optimal policy for multialternative decisionsNature Neuroscience 22:1503–1511.https://doi.org/10.1038/s4159301904539

Conference2018 Conference on Cognitive Computational NeuroscienceChoice History Biases Subsequent Evidence Accumulation.https://doi.org/10.32470/CCN.2018.11920

Corticostriatal synaptic weight evolution in a twoalternative forced choice task: a computational studyCommunications in Nonlinear Science and Numerical Simulation 82:105048.https://doi.org/10.1016/j.cnsns.2019.105048

A practical solution to the pervasive problems of p valuesPsychonomic Bulletin & Review 14:779–804.https://doi.org/10.3758/bf03194105

Role of the indirect pathway of the basal ganglia in perceptual decision makingThe Journal of Neuroscience 35:4052–4064.https://doi.org/10.1523/JNEUROSCI.361114.2015

HDDM: Hierarchical Bayesian estimation of the DriftDiffusion Model in PythonFrontiers in Neuroinformatics 7:14.https://doi.org/10.3389/fninf.2013.00014

Bayesian online learning of the hazard rate in changepoint problemsNeural Computation 22:2452–2476.https://doi.org/10.1162/NECO_a_00007

Inferring relevance in a changing worldFrontiers in Human Neuroscience 5:189.https://doi.org/10.3389/fnhum.2011.00189

Humans use directed and random exploration to solve the exploreexploit dilemmaJournal of Experimental Psychology. General 143:2074–2081.https://doi.org/10.1037/a0038199

Balancing exploration and exploitation with information and randomizationCurrent Opinion in Behavioral Sciences 38:49–56.https://doi.org/10.1016/j.cobeha.2020.10.001

The relation of strength of stimulus to rapidity of habitformationJournal of Comparative Neurology and Psychology 18:459–482.https://doi.org/10.1002/cne.920180503
Decision letter

Redmond G O'ConnellReviewing Editor; Trinity College Dublin, Ireland

Floris P de LangeSenior Editor; Radboud University, Netherlands

Niels A KloostermanReviewer; Max Planck Institute for Human Development, Germany
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Decision letter after peer review:
Thank you for submitting your article "Dynamic decision policy reconfiguration under outcome uncertainty" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Floris de Lange as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Redmond O'Connell (Reviewer #1), Niels A Kloosterman (Reviewer #3).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) The authors should rewrite sections of the manuscript to more clearly articulate for the reader the core theoretical questions that this study sets out to address. Beyond linking elements of choice uncertainty to parameters of an influential decision model, what were the study goals and why are they important/interesting? Similarly, the authors should clarify the broader implications of their findings for current theory and future research. In addition, the Results section is very long and at times it is hard to follow the rationale for each analysis step or how it relates to the study goals. The authors might consider moving certain details that are not essential to the Methods or Supplemental Materials.
2) The reviewers agreed that the DDM analyses seem overly restricted, being limited to only two parameters when there are other parameters of this model that can plausibly mediate the observed uncertainty effects. The authors make strong claims about the specific nature of the decision policy adjustments observed here but this interpretation would be greatly strengthened if the authors examined model variants that leave other parameters (e.g. nondecision time, drift bias). Please also provide some illustration of the fits to individual data.
3) Several concerns were raised regarding the pupillometry analyses.
a) First, there is a concern that any underlying relationships with choice strategy may have been obscured if the task stimuli evoked strong pupil light reflexes (constrictions) within the measurement window. This cannot be properly assessed without the authors providing plots of the average pupil diameter timecourses along with details regarding the stimuli (e.g. brightness).
b) Second, the authors appear to overlook several relevant metrics. This includes unbaselined prestimulus pupil diameter which has been linked to variations in choice performance in a number of investigations and also to exploration/exploitation switches in at least one report (Jepma and Nieuwenhuis 2011). Previous work has also shown that the LCNE system is in fact better tracked by using the first time derivative of the pupil signal (Reimer et al., Nat Comm, 2016). The authors should consider looking at the pupil derivative to see if this reveals a link to their experimental manipulations. Importantly, using the derivative instead of the actual pupil time series attenuates the pupil light reflect since only the slope is taken.
c) Third, the authors should check whether they can replicate the relationships between uncertainty and pupil diameter that have been reported in the previous literature. This would go a long way toward confirming the validity both of their pupil data and the null result in the relationship with exploration/exploitation
The authors should also take note of the individual comments of the reviewers provided below as many additional suggestions are provided that should help further improve the manuscript.
Reviewer #1:
In this paper the authors seek to uncover the decision policy adjustments that underpin participants choices on a twoarmed bandit task in which the relative reward and probability of reward associated with each alternative varied unpredictably over time. In an extensive modelling analysis the authors examined whether decision dynamics on this task could be understood in terms of adjustments to the bound and/or drift rate parameters of the drift diffusion model. The results indicate that when participants detect a change in which of two choice alternatives is most rewarding they switch from an exploitative to an exploratory decision strategy by lowering both the bound and drift rate of the decision process. My sense is that these findings are supported by extensive analyses reported in the paper.
1. I note that the authors allowed only one decision model parameter to map on to volatility and conflict. It would be interesting to know whether or not allowing either bound or drift rate to account for both volatility and conflict would improve the model fits.
2. The paper also reports a failure to detect any relationship between pupillary responses (here used as a proxy for noradrenergic arousal) during decision making and the aforementioned decision policy adjustments. Here the authors should cite the work of Joshi et al. (2016, Neuron) which, to my knowledge, was the first peerreviewed paper to report a relationship between locus coeruleus activity and pupil diameter. This paper is also important to consider because they report a failure to observe the tonic/phasic firing modes originally reported by AstonJones and colleagues that form the basis for the present hypotheses.
3. The prior literature suggests that there are important functional distinctions between average absolute (i.e. unbaselined) pupil diameter measured in a given time window and the pupil dilation responses that are elicited during decision making. The authors do not consider the former which has been linked to exploration/exploitation strategies at least once in the prior literature (Jepma and Nieuwenhuis, 2011, J Cog Neuro) and which has been linked to variations in choice performance on several occasions (e.g. Murphy et al. 2011, Psychophysiology; Van Kempen et al. 2019, eLife). The authors then conducted a principal component analysis on 7 different metrics extracted from the stimulusevoked pupil dilation response.
4. Aside from the desire to reduce dimensionality a clear rationale for this approach is not provided. This limits the ability to compare the present results to those in the related literature since most previous studies have investigated relationships between choice behaviour and metrics like pupil dilation amplitude, peak latency and onset latency individually.
5. The manuscript would benefit from more clearly articulating the theoretical advances/insights that it provides. It could be argued that the modelling work results in a situation where one set of cognitive constructs (change point detection and value estimation) are swapped for another set (bound and drift rate) but it is not clear what new understanding is gained from this.
Reviewer #2:
In the present study the authors investigated decisionpolicy adjustments as a function of two distinct forms of uncertainty, conflict and volatility. They extend previous studies on exploreexploit dynamics by investigating specifically how choice parameters in an evidence accumulation framework vary with uncertainty.
To that aim they experimentally manipulated conflict and volatility in a two armed bandit task and combined bayesian models of learning with evidence accumulation models of decisionmaking. Then they quantify the degree to which the relationship between estimated choice parameters – indexing the choice policy – varies as a function of the context in which the choice is made. They further recorded pupil diameters to index LCNE activity and test whether policy adjustments are related to this activity
Choice conflict modulated the rate of evidence accumulation and change points – while also increasing conflict – decreased the boundary height, leading to fast exploratory choices. The authors show that choice dynamics are linked to uncertainty dynamics by driving systematic changes in the decisionpolicy characterized by the combination of drift rate and threshold. Fast increases in uncertainty following change points drove fast exploratory choices (low drift rate, low threshold) that gradually recovered to exploitatory choices as the new reward contingencies were learned (high drift rate, high threshold).
They further find no evidence that these changes relate to fluctuations in the LCNE system as indexed with pupil diameter.
This work is methodologically very impressive and theoretically very interesting. It expands the space of decisionpolicies beyond the exploreexploit dichotomy and characterizes elegantly how decisionpolicies should dynamically vary along a twodimensional policy space as the environment is found to change.
Overall, I really liked the study. I do however have some questions and suggestions for additional analyses and edits.
Questions:
1. Typically more similar options/conflict leads to longer RTs. In Exp 1 the authors find no effect and in Exp 2 the opposite. That's usually a pretty robust effect. Any idea why that's not the case here?
2. Also how does this square with the positive effect of ΔB on drift rate? I think that might be worth picking up and unpacking a bit, if only to show the superiority of modelbased analyses to raw behavior given that the behavioral findings are super confusing and the parameter results make perfect sense.
3. Could that be some power issue (re number of subjects, not observations)? Is maybe one subject weird in exp. 2 and can't detect change points so well or track the values?
4. What is meant by the similar time course of belief updating for high and low volatility? (p 5 line 178) Shouldn't people update more under higher volatility? Is that what's captured in the change point probability parameter? Maybe that sentence could be clarified so it doesn't confuse readers familiar with work linking volatility and the α parameter in RL (as in more learning/updating under high volatility).
5. If I understand correctly, when change point probability goes up, ΔB always goes down. What's the correlation between the two and if there is a correlation, what does that mean for the impact of those learning parameters on decision parameters? Can you assess conflict effects independent of change point effects (are there period where values are stable and super similar)?
Suggestions:
6. Provide a clearer rationale for the model comparisons to test hypotheses about policy adjustments.
I honestly found it a bit difficult to follow the model comparison for theta. I also think that when the pupil is added, the rationale could be explained a bit more clearly to allow the reader to follow. Specifically I got confused about the intercept and time null models (is that just time or time relative to change point if the latter, why is that a null model?).
7. Replicate relationship between uncertainty parameters and pupil measures before linking it to policy adjustments.
As I understand, you find that the adjustments in the decisionpolicy are unrelated to pupil diameter. As a sanity check, have the authors looked at pupil diameter as a function of the uncertainty parameters? It would be good to show that earlier effects of uncertainty and their temporal dynamics are replicated (have a plot with the betas over time prechoice and post feedback). I think the conclusion can be stronger if the authors show that pupil dilation tracks both forms of uncertainty and the anticipation and response dynamics associate with that, but that the subsequent adjustments are not mediated by this system.
Right now something could be wrong with the pupil data and the reader has no way to know. It would also be important just to see that these earlier findings replicate.
8. Reconsider causal language/interpretation of drift rate effects in the discussion.
You say in the discussion that people reduce the drift rate. Isn't the drift rate here determined by the consistency of the evidence? Sure, people can focus more (i.e. in cognitive control tasks, where the response rules are well known and errors are primarily driven by early incorrect activations of prepotent responses or attentional lapses), but I can focus all I want when there's no evidence (when I just don't know what the relative values are because I currently have zero [or little] valid experience to draw from after I detected a change point ) and my drift rate will still be low. No? Couldn't it be that participants have a sense that their drift rate is low (because they have no idea what's going on) and because taking time to sample would be useless (because uncertainty is not reducible other than through action), dropping the threshold is the right thing to do? In that sense the (expected) drift rate would dictate the optimal boundary height. I'm thinking of work by Tajima et al.
9. Reconsider reinterpretation of previous findings in the discussion – add nuance where nuance is due.
I have a bit of a problem with the authors’ assertion that previous findings relating boundary height and conflict could be a misattribution of volatility effects (Frank and colleagues). These previous studies did not have change points. So that is an unlikely explanation of that finding. What is more likely is that the choice dynamics were different because the choices were not temporally dependent, i.e. participants made choices between different options on each trial, meaning that the conflict and thus the optimal decisionstrategy differed on every trial (in addition to any learning related uncertainty, but importantly, the true values associated with stimuli never changed). That is not the same as a change point/volatility. Further in the present study, conflict is anticipated, except in the case of change points. So that could equally be the difference between expected and unexpected uncertainty that leads to dissociable effects on decision strategies. In both cases, what drives the threshold adjustment is probably some form of surprise (unexpected conflict). As it stands, the statement in the discussion is inaccurate/misleading. That’s an easy fix though.
Reviewer #3:
Shifting between more explorative and more exploitative modes of decision making is crucial for adaptive human behavior. Therefore, the authors' attempt to investigate the internal processes that allow these modes is important to begin to understand this remarkable ability. In addition, investigating the proposed link to the LCNE system is sensible and establishing its role in these processes would help the field forward. The authors present a thorough, modellingheavy set of analysis on two interesting datasets aimed at revealing the underlying mechanisms.
1. Despite these strong points, the manuscript in its current version falls somewhat short of answering the questions that it poses. For one, the DDM analyses are restricted to only two parameters, which begs the question whether other established parameters might be better able to explain the results and thereby shed more light on the underlying mechanisms. Also, regarding the role of the pupillinked LCNE system, no strong conclusions can be drawn from the data, since the visual stimulus design likely resulted in strong pupil light reflexes, which might well have overshadowed subtler, more interesting modulations of the pupil. Despite the manuscripts innovative and clever use of Bayesian modelling and PCA, these two shortcomings might limit the impact of the manuscript in its current form on the field.
Shifting between more explorative and more exploitative modes of decision making is crucial for adaptive human behavior. Therefore, the authors' attempt to investigate the internal processes that allow these modes is important to begin to understand this remarkable faculty. In addition, investigating the proposed link to the LCNE system is sensible and establishing its role in these processes would help the field forward. However, although in general the presented analyses seem thorough, I have two main concerns that in my opinion should be addressed before conclusions can be drawn from the data.
First, the DDM modelling is too restrictive, only focusing on the bound and drift parameters. Besides these two main parameters, another main parameter of the standard DDM is nondecision time, which to my surprise is not mentioned at all in the manuscript. Moreover, recent work has shown that two further parameters can capture internal processes possibly related to explore/exploit policies: starting point (z) and drift bias (called drift criterion or dc by Ratcliff and McKoon (2008)). Including these latter two parameters possibly can explain the RTs better than drift only and shed more light on the components underlying conflict and volatility. In addition, nondecision time might also be affected by the experimental manipulations, and should at least be reported in the manuscript (I assume that the authors did include it in their currently reported DDMs). In my mind, investigating all these further parameters is crucial before the conclusion that bound and drift rate best capture conflict and volatility is warranted.
My second point concerns the pupil analysis.
a) Although I could not find information about visual stimulus size and brightness in the methods, Figure 2AB suggests that there were strong visual transients at trial onset (black screen → stimulus), which presumably resulted in strong pupil constrictions due to the pupil light reflex (PLR).
b) I would have liked to see pupil time courses in this manuscript. The first components of the PCA, as employed by the authors (which in principle I think is a great idea) is likely to capture exactly these PLR dynamics given the large variance due to PLR.
c) Now, previous work has shown that the LCNE system is in fact better tracked by using the first time derivative of the pupil signal (Reimer et al., Nat Comm, 2016). The authors should consider looking at the pupil derivative to see if this reveals a link to their experimental manipulations. Importantly, using the derivative instead of the actual pupil time series attenuates the PLR since only the slope is taken. Hence, when using the derivative, the PCA might pick up more interesting, cognitive drivers of pupil dynamics, since the PLR dynamics are suppressed. It would be interesting to see if this would reveal a link to the experimental manipulations.
d) Further, please note that the pupil likely not only is linked to the LCNE system, but generally to catecholamines, which includes dopamine (Joshi et al. Neuron 2015). Therefore, I would recommend to not exclusively link pupil to LCNE in the manuscript while interpreting the pupil results.
e) In any case, the author should show raw pupil as well as pupil derivative time courses for the different conditions to give insight in their data.
https://doi.org/10.7554/eLife.65540.sa1Author response
Essential revisions:
1) The authors should rewrite sections of the manuscript to more clearly articulate for the reader the core theoretical questions that this study sets out to address. Beyond linking elements of choice uncertainty to parameters of an influential decision model, what were the study goals and why are they important/interesting? Similarly, the authors should clarify the broader implications of their findings for current theory and future research. In addition, the Results section is very long and at times it is hard to follow the rationale for each analysis step or how it relates to the study goals. The authors might consider moving certain details that are not essential to the Methods or Supplemental Materials.
We have made substantial changes to the frontend of manuscript (Abstract and Introduction) to more explicitly clarify the core theoretical questions that our study set out to address. We have also edited the Results and Discussion in targeted ways to better align the framing of our findings to the context of our theoretical questions. We hope that this better articulates the goals of our work.
2) The reviewers agreed that the DDM analyses seem overly restricted, being limited to only two parameters when there are other parameters of this model that can plausibly mediate the observed uncertainty effects. The authors make strong claims about the specific nature of the decision policy adjustments observed here but this interpretation would be greatly strengthened if the authors examined model variants that leave other parameters (e.g. nondecision time, drift bias). Please also provide some illustration of the fits to individual data.
The reviewers raise crucial points regarding the restricted set of model comparisons. Our original focus on these parameters was driven by prior work showing adaptation of the drift rate and boundary height terms under uncertainty, including findings linking corticobasal ganglia dynamics with these driftdiffusion parameters under uncertain conditions (Dunovan et al. 2019; Dunovan and Verstynen 2019; Rubin et al. 2021). As we detail below, we have now expanded the set of models considered to include other plausible decision parameters (nondecision time (tr), drift criterion / drift bias (dc), and the starting point (z), in addition to the drift rate (v) and boundary height (a)).
These more rigorous analyses have resulted in an update to the changepointevoked effect on boundary height (a). Instead of a decreasing in response to a suspected change in reward contingencies, a increases in response to a suspected change. Ironically, this is consistent with our preregistered hypothesis as well as with prior experimental results in our lab (and others). We have updated both our key figures and the text of the Results and Discussion to reflect this change. (See response below).
3) Several concerns were raised regarding the pupillometry analyses.
a) First, there is a concern that any underlying relationships with choice strategy may have been obscured if the task stimuli evoked strong pupil light reflexes (constrictions) within the measurement window. This cannot be properly assessed without the authors providing plots of the average pupil diameter timecourses along with details regarding the stimuli (e.g. brightness).
We were careful to control the effect of light on the pupillary response during data collection. This included the construction of a specific rig around the testing computer so as to reduce ambient reflection of light from walls and ceiling.
To control luminance we also used a DerringtonKrauskopfLennie (DKL) color space that allows for direct luminance control. As a result, the stimulus presentation display was rendered isoluminant throughout the task. In addition, the lead author built a booth to isolate the participant from ambient sources of light during data collection.
We now mention these details in the Methods:
“Throughout the task, the headstabilized diameter and gaze position of the left pupil were measured with an Eyelink 1000 desktop mount at 1000 Hz. […] During the rewardlearning task, we used this method to isolate the taskevoked pupillary response.”
Additionally, we include a reminder as part of the caption for Figure 2B to aid the reader in interpreting the results as they progress through the paper:
“(B) In Experiment 2, participants were asked to choose between one of two Greebles (one male, one female). The total number of points earned was displayed at the center of the screen. The stimulus display was rendered isoluminant throughout the task.”
(b) Second, the authors appear to overlook several relevant metrics. This includes unbaselined prestimulus pupil diameter which has been linked to variations in choice performance in a number of investigations and also to exploration/exploitation switches in at least one report (Jepma and Nieuwenhuis 2011). Previous work has also shown that the LCNE system is in fact better tracked by using the first time derivative of the pupil signal (Reimer et al., Nat Comm, 2016). The authors should consider looking at the pupil derivative to see if this reveals a link to their experimental manipulations. Importantly, using the derivative instead of the actual pupil time series attenuates the pupil light reflect since only the slope is taken.
We have reanalyzed the data using unbaselined prestimulus pupil diameter and the first time derivative of the pupillary response. We again observe a null result. These analyses are detailed as part of our pointbypoint reply to reviewers below. Figures visualizing these time courses have been added to the supplementary section of the manuscript (Supplementary Figures 10 and 11).
c) Third, the authors should check whether they can replicate the relationships between uncertainty and pupil diameter that have been reported in the previous literature. This would go a long way toward confirming the validity both of their pupil data and the null result in the relationship with exploration/exploitation
We have now conducted analyses to check for the previously reported links between exploratory choice behavior and the pupillary response. However, we observe no clear evidence for these preestablished links. Therefore, we qualify the use of the pupillary data. The Results section now states this lack of replication to titrate the reader’s confidence in the pupillary results and the corresponding inferences relating to LCNE system influence on the progression through the decision manifold:
“Specifically, if the LCNE system were sensitive to a change in the optimal choice, then we should observe a moderate spike in phasic activity following a change in actionoutcome contingencies. […] We ask the reader to titrate their interpretation of these pupillary data accordingly.”
We include a similar addition to the pupillometry subsection of the Methods:
“Note that we also conducted a similar analysis using more conventional methods to assess the taskevoked pupillary response and observed another null effect. […] As such, we caution the reader to view our pupillary results in light of this lack of replication of preestablished explorationdriven pupillary responses.”
Reviewer #1:
In this paper the authors seek to uncover the decision policy adjustments that underpin participants choices on a twoarmed bandit task in which the relative reward and probability of reward associated with each alternative varied unpredictably over time. In an extensive modelling analysis the authors examined whether decision dynamics on this task could be understood in terms of adjustments to the bound and/or drift rate parameters of the drift diffusion model. The results indicate that when participants detect a change in which of two choice alternatives is most rewarding they switch from an exploitative to an exploratory decision strategy by lowering both the bound and drift rate of the decision process. My sense is that these findings are supported by extensive analyses reported in the paper.
1. I note that the authors allowed only one decision model parameter to map on to volatility and conflict. It would be interesting to know whether or not allowing either bound or drift rate to account for both volatility and conflict would improve the model fits.
The choice to map distinct ideal observer estimates to distinct decision parameters reflects the theoretical motivation underlying the development of our hypotheses tested here. We now expand on this in the introduction to make clear the reasoning for our narrow focus, as below:
“Are the parameters that govern accumulation of evidence for decision making modifiable? […] These policies, in turn, adaptively reconfigure based on current environmental feedback signals by modulating value estimation and the rate of selection errors (Figure 1E).”
A complete, exhaustive sweep of decision parameters as proposed would be computationally inefficient. This would require evaluation of all possible single parameter, dualparameter (nparameter) pairings, with both manytoone ideal observer to DDM mappings and manytoone DDM parameter mappings to ideal observer mappings. As model complexity increases in these hierarchical DDM fits, the convergence or stability of the fits is more difficult to achieve. So we opted for parsimony in the set of model fits to avoid a data mining expedition that would very likely lead to a set of inconclusive model fits on overly complex models. This sort of restricted set test is quite common when using hierarchical DDM (and, indeed, many hierarchical models in general). Further, estimating pairwise ideal observer to DDM parameter mappings alone keeps model complexity constant, allowing us to make clear comparisons in information loss scores between candidate models.
2. The paper also reports a failure to detect any relationship between pupillary responses (here used as a proxy for noradrenergic arousal) during decision making and the aforementioned decision policy adjustments. Here the authors should cite the work of Joshi et al. (2016, Neuron) which, to my knowledge, was the first peerreviewed paper to report a relationship between locus coeruleus activity and pupil diameter. This paper is also important to consider because they report a failure to observe the tonic/phasic firing modes originally reported by AstonJones and colleagues that form the basis for the present hypotheses.
We thank the reviewer for this omitted reference. Indeed, Joshi et al. 2016 provides clear evidence of a link between locus coeruleus activity and pupil diameter. To our knowledge, this observation extends back to the work of Rajkowski, Kubiak, and AstonJones 1994, showing the phasic and tonic modes of the locuscoeruleus system in relation to exploratory behavior. All of this prior work is clearly important to consider, and we thank the reviewer for bringing this to our attention.
We now cite both studies in our revised manuscript:
“The LCNE system is known to modulate exploration states under uncertainty and pupil diam eter shows a tight correspondence with LC neuron firing rate (AstonJones and Cohen, 2005; Rajkowski et al., 1994), with changes in pupil diameter indexing the exploreexploit decision state (Jepma and Nieuwenhuis, 2011). Similar to the classic YerkesDodson curve relating arousal to performance (Yerkes et al., 1908), performance is optimal when tonic LC activity is moderate and phasic LC activity increases following a goalrelated stimulus (AstonJones et al. (1999), but see Joshi et al. (2016) for an exception).”
3. The prior literature suggests that there are important functional distinctions between average absolute (i.e. unbaselined) pupil diameter measured in a given time window and the pupil dilation responses that are elicited during decision making. The authors do not consider the former which has been linked to exploration/exploitation strategies at least once in the prior literature (Jepma and Nieuwenhuis, 2011, J Cog Neuro) and which has been linked to variations in choice performance on several occasions (e.g. Murphy et al. 2011, Psychophysiology; Van Kempen et al. 2019, eLife).
Thank you for bringing this important functional distinction between unbaselined pupil diameter and the dilation response to our attention. In our data, if baseline pupil diameter were sensitive to shifts from exploitation to exploration, then we should observe a change in baseline pupil diameter proximal to a change point. However, we do not observe a changepointevoked shift in baseline pupil diameter in our data, as we might expect given the previous links to exploratory behavior. We now mention our lack of support for these validation analyses in the Results section and visualize the pupillary time courses and results in the Supplementary section (Supp. Figures 1013).
“Specifically, if the LCNE system were sensitive to a change in the optimal choice, then we should observe a moderate spike in phasic activity following a change in actionoutcome contingencies. Note that we do not observe previously established links between exploratory choice behavior and the pupillary response (Jepma and Nieuwenhuis, 2011; Murphy et al., 2011; van Kempen et al., 2019). We ask the reader to titrate their interpretation of these pupillary data accordingly.”
4. The authors then conducted a principal component analysis on 7 different metrics extracted from the stimulusevoked pupil dilation response. Aside from the desire to reduce dimensionality a clear rationale for this approach is not provided. This limits the ability to compare the present results to those in the related literature since most previous studies have investigated relationships between choice behaviour and metrics like pupil dilation amplitude, peak latency and onset latency individually.
From a computational perspective, reducing the dimensionality of this set of pupillary response metrics expands the set of models we can consider without taxing computational resources in a reasonable amount of time.
Further, our original PCA method was intended to maximize the variability of the pupillary response linked to the decision manifold. This allowed us to capture separable sources of variance relating to timing and amplitude effects without restricting the data to a smaller set of metrics and possibly discarding information (e.g. timing effects may not be constrained to peak latency or onset latency; amplitude effects may not be constrained to peak dilation amplitude).
We have edited the Results section with the motivation for our PCA approach:
“We characterized the evoked pupillary response on each trial using seven metrics: the mean of the pupil data over each trial interval, the latency to the peak onset and offset, the latency to peak amplitude, the peak amplitude, and the area under the curve of the pupillary response. […] Therefore, we submitted these metrics to principal component analysis to reduce their dimensionality while capturing maximum variance.”
We have also reanalyzed these pupillary data using conventional analysis methods and continue to observe a null effect (see points 3C and 7 for Reviewer 3). We have edited the Results section to reflect this:
“Thus, for interpretability, we refer to the first and second principal components as timing and magnitude components, respectively (Figure 9B). Note that we also conduct this analysis using more conventional methods of pupillary analysis and continue to observe a null effect (see the Pupil data preprocessing for details).”
5. The manuscript would benefit from more clearly articulating the theoretical advances/insights that it provides. It could be argued that the modelling work results in a situation where one set of cognitive constructs (change point detection and value estimation) are swapped for another set (bound and drift rate) but it is not clear what new understanding is gained from this.
This is an excellent point and it reflects our somewhat opaque framing in the
Introduction, which we have now fixed (see our response to the first point raised by the Review Editor). Our primary goal was to understand how the evidence accumulation dynamics changed when the environment requires reassessing learned state action values. Rather than think of these accumulation dynamics, driven by drift rate, boundary height, etc., as a static process, we make the case of thinking of these as points on a continuum of possible states (see manifold in Figure 1e). So our primary focus is on the algorithms of information processing. However, if these are dynamic processes, e.g., drift rate fluctuates over time, then there has to be some learning signal that drives these fluctuations. We chose the ideal observer parameters as likely learning signals that drive plasticity in the decision policy state. We acknowledge that these may not be the only signals that drive adjustments in decision policy dynamics, but they are ideal in that they reflect two correlated, but separate estimates of environmental state.
In line with our response to point 1 from Reviewer 1, we now make this clearer in the Introduction:
“Knowing how decision policies shift in the face of dynamic environments requires looking at the algorithmic properties of the policy itself. […] We predicted that, in response to suspected changes in actionoutcome contingencies, humans would exhibit a stereotyped adjustment in the drift rate and boundary height that pushes decisions from certain, exploitative states to uncertain, exploratory states and back again (Figure 1E).”
Reviewer #2:
[…] Overall, I really liked the study. I do however have some questions and suggestions for additional analyses and edits.
Questions:
1. Typically more similar options/conflict leads to longer RTs. In Exp 1 the authors find no effect and in Exp 2 the opposite. That's usually a pretty robust effect. Any idea why that's not the case here?
The reviewer is correct, we do observe different effects of conflict on reaction time in Experiments 1 and 2. The effect is absent in Experiment 1 and small enough in Experiment 2 that it can effectively be considered a null finding. One possible reason for attenuated effects of conflict on reaction time is that participants were overtrained
(Experiment 1 was 2hrs per participant, Experiment 2 was nine hours per participant). This may result in participants having developed an expectation of change point frequency and/or conflict manipulation. If this were the case, then we might expect to see an effect on reaction times in Experiment 1, where participants undergo four sessions of training, but not in Experiment 2, where participants undergo nine sessions of training each. However, we instead see negligible effects on reaction times in Experiment 2 and no effect of conflict on reaction times in Experiment 1.
A second possibility relates to the complexity of our manipulations. Here we impose a range of conflict levels that also vary with degrees of volatility, meaning that our observed reaction time effects are a mix of responses to different extents of conflict and volatility together. This contrasts with previous reports measuring a more restricted range of conflict and without the influence of volatility. Thus, participants are tracking two sources of uncertainty, which may attenuate the overall observed RT response to conflict alone.
We now acknowledge this discrepancy in the Discussion.
“Previous literature has shown a conflictinduced spike in reaction time (e.g. Jahfari et al. 2019). […] Future research should explore the interaction of change point and conflict estimation on the speedaccuracy tradeoff.”
Either way, this discrepancy between our results and other studies, as well as the lack of internal replication of the change point probability and boundary height association in Experiment 2 of our results, is an interesting avenue of exploratory research that we are currently following up on.
2. Also how does this square with the positive effect of ΔB on drift rate? I think that might be worth picking up and unpacking a bit, if only to show the superiority of modelbased analyses to raw behavior given that the behavioral findings are super confusing and the parameter results make perfect sense.
The reviewer is absolutely correct. The lack of simple main effects on RT across experiments obscures meaningful behavioral patterns that can be detected with a modelbased approach. We now reference the value of a modelbased analysis after reviewing the ambiguous behavioral results:
“At the gross level, across all trials within an experimental condition, increasing the ambiguity of the optimal choice (conflict) and increasing the instability of action outcomes (volatility) decreases the probability of selecting the optimal choice. […] We adopt a more focal, modelbased analysis in the next section to clarify these perichange point dynamics.”
3. Could that be some power issue (re number of subjects, not observations)? Is maybe one subject weird in exp. 2 and can't detect change points so well or track the values?
We see fairly consistent sensitivity to change points across subjects in Experiment 2, with accuracy plummeting and recovering with similar time courses, suggesting that all participants track the value of the optimal choice in a consistent manner (see Supplementary Figures 8 and 9 for evoked response profiles by subject). In addition, we conducted a power analysis prior to data collection (preregistration) and the withinsession, withinsubject power for Experiment 2 is still high enough to detect our hypothesized effects.
As in our response to the previous comment, we think that the RT effects are masking compensatory changes in two different parameters in the accumulation process, rather than outlier participants or sessions.
That said, it may be the case that the reaction time effects simply require more power to detect than accuracy effects, and both of our experiments fail to detect them for that reason. We plan to conduct a replication experiment to recover the effects we observed using a highpowered experimental design both within and across subjects. We hope that the results of this replication address this question. However, given that this is a tangential focus to our original research question, it is more suitable to address this as a follow up paper.
4. What is meant by the similar time course of belief updating for high and low volatility? (p 5 line 178) Shouldn't people update more under higher volatility? Is that what's captured in the change point probability parameter? Maybe that sentence could be clarified so it doesn't confuse readers familiar with work linking volatility and the α parameter in RL (as in more learning/updating under high volatility).
This was an awkwardly worded sentence. We apologize. What we meant here is that the rate of change in relative reward value (ΔB) after a change point is qualitatively similar under both low and high volatility conditions. In other words, the slope of the lines for the two volatility manipulations in Figure 4A is approximately the same. However, the changepointevoked response belies the main effect of volatility on overall estimates of relative reward value, as shown in Figure 4B. Therefore, we removed this sentence to maintain clarity.
5. If I understand correctly, when change point probability goes up, ΔB always goes down. What's the correlation between the two and if there is a correlation, what does that mean for the impact of those learning parameters on decision parameters? Can you assess conflict effects independent of change point effects (are there period where values are stable and super similar)?
These two parameters are indeed correlated. In fact, ΔB is included in the calculation of change point probability and vice versa. But this interdependence is expected: your certainty in value is going to decrease if you live in a chaotic and changing world. However, the real question is whether they are too correlated to impact our model interpretability. The correlation between change point probability and ΔB is small but reliable, with an increase in belief as change point probability decreases (Spearman’s rho = 0.234 +/ 0.029). However, the Variance Inflation Factor, a collinearity metric, is within an acceptable range for both experiments (Experiment 1: 1.100 +/ 0.013; Experiment 2: 1.058 +/ 0.017). This is generally considered to be an acceptable degree of collinearity (values greater than 10 cause concern; Chatterjee and Simonoff 2013, p. 2829, notebook showing these results). Thus, the degree of correlation between change point probability and ΔB should have a minimal effect on the estimation of the decision parameters. This suggests that we can safely estimate independent effects of volatility and conflict using change point probability and belief.
Suggestions:
6. Provide a clearer rationale for the model comparisons to test hypotheses about policy adjustments.
I honestly found it a bit difficult to follow the model comparison for theta. I also think that when the pupil is added, the rationale could be explained a bit more clearly to allow the reader to follow. Specifically I got confused about the intercept and time null models (is that just time or time relative to change point if the latter, why is that a null model?).
We thank the reviewer for bringing this lack of clarity to our attention. The timenull model tested for an impact of time relative to a change point, separate from conditional influences of volatility and conflict and the influence of the pupillary response.
We have renamed these models for clarity and expanded on our selection logic for models specifying an impact on decision policy adjustment and updated our naming convention in the Methods section and in the Results section:
“First, we tested the null hypothesis that the decision dynamics was solely a function of the intercept, or the average of the decision dynamics. […] We call this the evoked response model.”
7. Replicate relationship between uncertainty parameters and pupil measures before linking it to policy adjustments.
As I understand, you find that the adjustments in the decisionpolicy are unrelated to pupil diameter. As a sanity check, have the authors looked at pupil diameter as a function of the uncertainty parameters? It would be good to show that earlier effects of uncertainty and their temporal dynamics are replicated (have a plot with the betas over time prechoice and post feedback). I think the conclusion can be stronger if the authors show that pupil dilation tracks both forms of uncertainty and the anticipation and response dynamics associate with that, but that the subsequent adjustments are not mediated by this system.
Right now something could be wrong with the pupil data and the reader has no way to know. It would also be important just to see that these earlier findings replicate.
The reviewer raises an excellent point. We did look into this relationship and yet failed to observe evidence for a relationship between our uncertainty parameters – belief and change point probability – and the pupillary response, as measured by both the metrics we calculated and the principal components derived from those metrics.
We have now stated this lack of replication in the Results section to caution the reader:
“Specifically, if the LCNE system were sensitive to a change in the optimal choice, then we should observe a moderate spike in phasic activity following a change in actionoutcome contingencies. […] We ask the reader to titrate their interpretation of these pupillary data accordingly and to view the corresponding inferences relating noradrenergic and catecholaminergic systems to decision policy adjustment in this light.”
8. Reconsider causal language/interpretation of drift rate effects in the discussion.
You say in the discussion that people reduce the drift rate. Isn't the drift rate here determined by the consistency of the evidence? Sure, people can focus more (i.e. in cognitive control tasks, where the response rules are well known and errors are primarily driven by early incorrect activations of prepotent responses or attentional lapses), but I can focus all I want when there's no evidence (when I just don't know what the relative values are because I currently have zero [or little] valid experience to draw from after I detected a change point ) and my drift rate will still be low. No? Couldn't it be that participants have a sense that their drift rate is low (because they have no idea what's going on) and because taking time to sample would be useless (because uncertainty is not reducible other than through action), dropping the threshold is the right thing to do? In that sense the (expected) drift rate would dictate the optimal boundary height. I'm thinking of work by Tajima et al.
The reviewer brings up two points in this comment. We will address each separately.
First there appears to be a conflation of intention with causation. While we fully agree that we can reduce the causal certainty of the language used in the Discussion (something we now do in the revised text), the fact remains that in our data, drift rate reliably changes in response to a changepoint in a stereotypic fashion. Given the nature of the experimental design, we are careful to not make any assumptions as to whether this change is driven by explicit or intentional mechanisms (e.g., increased focus) versus implicit or automatic mechanisms.
The second point raised regards alignment with the work by Tajima and colleagues. It is very likely that the drift rate and boundary height are changing in a cooperative, adaptive manner, at least insofar as the trials immediately surrounding a change point are concerned. The temporal profile of the two parameter changes (at least in Exp. 1 given our new analysis) is quite different, with boundary height changes being brief and drift rate adaptation requiring more time. So, if a change in the boundary height is dictating drift rate changes it is only happening briefly. Therefore we think that a majority of the changes seen in response to a change point are occurring through independent means (consistent with our prior computational models of these pathways (Dunovan et al. 2019; Dunovan and Verstynen 2019; Rubin et al. 2021)).
9. Reconsider reinterpretation of previous findings in the discussion – add nuance where nuance is due.
I have a bit of a problem with the authors' assertion that previous findings relating boundary height and conflict could be a misattribution of volatility effects (Frank and colleagues). These previous studies did not have change points. So that is an unlikely explanation of that finding. What is more likely is that the choice dynamics were different because the choices were not temporally dependent, i.e. participants made choices between different options on each trial, meaning that the conflict and thus the optimal decisionstrategy differed on every trial (in addition to any learning related uncertainty, but importantly, the true values associated with stimuli never changed). That is not the same as a change point/volatility. Further in the present study, conflict is anticipated, except in the case of change points. So that could equally be the difference between expected and unexpected uncertainty that leads to dissociable effects on decision strategies. In both cases, what drives the threshold adjustment is probably some form of surprise (unexpected conflict). As it stands, the statement in the discussion is inaccurate/misleading. That's an easy fix though.
Thank you for this careful reading of our critique. Given the update to our results after the more thorough set of model comparisons requested, we no longer include this point in the Discussion. Further, we have made sure to qualify our interpretation of how our findings integrate with the broader literature where necessary. We hope this reflects a more nuanced view of the prior literature.
Reviewer #3:
Shifting between more explorative and more exploitative modes of decision making is crucial for adaptive human behavior. Therefore, the authors' attempt to investigate the internal processes that allow these modes is important to begin to understand this remarkable ability. In addition, investigating the proposed link to the LCNE system is sensible and establishing its role in these processes would help the field forward. The authors present a thorough, modellingheavy set of analysis on two interesting datasets aimed at revealing the underlying mechanisms.
1. Despite these strong points, the manuscript in its current version falls somewhat short of answering the questions that it poses. For one, the DDM analyses are restricted to only two parameters, which begs the question whether other established parameters might be better able to explain the results and thereby shed more light on the underlying mechanisms. Also, regarding the role of the pupillinked LCNE system, no strong conclusions can be drawn from the data, since the visual stimulus design likely resulted in strong pupil light reflexes, which might well have overshadowed subtler, more interesting modulations of the pupil. Despite the manuscripts innovative and clever use of Bayesian modelling and PCA, these two shortcomings might limit the impact of the manuscript in its current form on the field.
Shifting between more explorative and more exploitative modes of decision making is crucial for adaptive human behavior. Therefore, the authors' attempt to investigate the internal processes that allow these modes is important to begin to understand this remarkable faculty. In addition, investigating the proposed link to the LCNE system is sensible and establishing its role in these processes would help the field forward. However, although in general the presented analyses seem thorough, I have two main concerns that in my opinion should be addressed before conclusions can be drawn from the data.
First, the DDM modelling is too restrictive, only focusing on the bound and drift parameters. Besides these two main parameters, another main parameter of the standard DDM is nondecision time, which to my surprise is not mentioned at all in the manuscript. Moreover, recent work has shown that two further parameters can capture internal processes possibly related to explore/exploit policies: starting point (z) and drift bias (called drift criterion or dc by Ratcliff and McKoon (2008)). Including these latter two parameters possibly can explain the RTs better than drift only and shed more light on the components underlying conflict and volatility. In addition, nondecision time might also be affected by the experimental manipulations, and should at least be reported in the manuscript (I assume that the authors did include it in their currently reported DDMs). In my mind, investigating all these further parameters is crucial before the conclusion that bound and drift rate best capture conflict and volatility is warranted.
The reviewer is correct. Investigating the remaining DDM parameters is crucial to substantiate our claim that the drift rate and boundary height respond in a coordinated fashion to promote exploration in response to a suspected change. This was something that we did in our initial model evaluations but was left out for the sake of concision. We now include a more thorough test of the set of DDM parameters (a,v,t,z,dc) that could respond to a change point. This broader firstlevel test confirms our initial results showing that the t, z, and dc parameters do not reliably change in response to a change point (Figure 5).
My second point concerns the pupil analysis. Regarding the role of the pupillinked LCNE system, no strong conclusions can be drawn from the data, since the visual stimulus design likely resulted in strong pupil light reflexes, which might well have overshadowed subtler, more interesting modulations of the pupil.
a) Although I could not find information about visual stimulus size and brightness in the methods, Figure 2AB suggests that there were strong visual transients at trial onset (black screen → stimulus), which presumably resulted in strong pupil constrictions due to the pupil light reflex (PLR).
The representation of the display depicted in Figures 2A and B does not show the actual luminance of the stimulus display for Experiment 2. As we state in our general response to the Editor above and in point B of this response, we carefully controlled taskrelated luminance and ambient sources of light in order to maximize our capacity to detect subtle pupillary effects. See point 3A in our response to the Editor and the revised language included in that response.
b) I would have liked to see pupil time courses in this manuscript. The first components of the PCA, as employed by the authors (which in principle I think is a great idea) is likely to capture exactly these PLR dynamics given the large variance due to PLR.
It is unlikely that we are capturing pupillary light reflexes given our control of luminance in the experimental testing rig. However, we have now included average time courses of the pupillary response to the Supplementary section (Supp. Figure 913). We hope these are useful for readers who share this concern. See point 3C of our response to the editor for cautionary language added to the Results and Methods section.
c) Now, previous work has shown that the LCNE system is in fact better tracked by using the first time derivative of the pupil signal (Reimer et al., Nat Comm, 2016). The authors should consider looking at the pupil derivative to see if this reveals a link to their experimental manipulations. Importantly, using the derivative instead of the actual pupil time series attenuates the PLR since only the slope is taken. Hence, when using the derivative, the PCA might pick up more interesting, cognitive drivers of pupil dynamics, since the PLR dynamics are suppressed. It would be interesting to see if this would reveal a link to the experimental manipulations.
We have now calculated the first time derivative of the pupil signal and reanalyzed our data on this measure. Using the first time derivative of the pupil signal, we recalculated the principal components of the pupillary response as with the first order signal. Using these recalculated principal components, we reassessed the relationship between the pupillary data and our conditional manipulations and retested the link between these principal components and theta, the relationship between a and v. We continue to observe null effects. See point 4 of our response to Reviewer 1.
d) Further, please note that the pupil likely not only is linked to the LCNE system, but generally to catecholamines, which includes dopamine (Joshi et al. Neuron 2015). Therefore, I would recommend to not exclusively link pupil to LCNE in the manuscript while interpreting the pupil results.
We appreciate the point that other catecholamines, such as dopamine, also contribute to the taskevoked pupillary response. We now acknowledge this lack of specificity in the Discussion:
“We hypothesized that these shifts in decision policies would be linked to changes in phasic responses of the LCNE pathways, although we should note our experimental design does not distinguish between pupillary dynamics driven by other catecholamines, such as dopamine, and those dynamics driven by the LCNE system.”
e) In any case, the author should show raw pupil as well as pupil derivative time courses for the different conditions to give insight in their data.
We now include subjectwise visualizations of the evoked pupillary response and the time derivative of that response for all combinations of conflict and volatility in the Supplementary section (Supp. Figures 11 and 13).
https://doi.org/10.7554/eLife.65540.sa2Article and author information
Author details
Funding
Air Force Research Laboratory (FA95501810251)
 Krista Bond
 Timothy Verstynen
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank all members of the Cognitive Axon Lab for their feedback during the development of this work. We also thank Marlene Behrmann and Michael Granovetter for their help with eyetracking and pupillometry data collection and Chris Wordingham for his programming and engineering consultation in the early phases of this project.
Ethics
Human subjects: Neurologically healthy adults were recruited from the local university population. All procedures were approved by the Carnegie Mellon University Institutional Review Board (Approval Code: 2018_00000195; Funding: Air Force Research Laboratory, Grant Office ID: 180119). All research participants provided informed consent to participate in the study and consent to publish any research findings based on their provided data.
Senior Editor
 Floris P de Lange, Radboud University, Netherlands
Reviewing Editor
 Redmond G O'Connell, Trinity College Dublin, Ireland
Reviewer
 Niels A Kloosterman, Max Planck Institute for Human Development, Germany
Publication history
 Received: December 7, 2020
 Accepted: December 23, 2021
 Accepted Manuscript published: December 24, 2021 (version 1)
 Version of Record published: February 1, 2022 (version 2)
Copyright
© 2021, Bond et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 718
 Page views

 135
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Biochemistry and Chemical Biology
 Neuroscience
Dynamic Ca^{2+} signals reflect acute changes in membrane excitability, and also mediate signaling cascades in chronic processes. In both cases, chronic Ca^{2+} imaging is often desired, but challenged by the cytotoxicity intrinsic to calmodulin (CaM)based GCaMP, a series of geneticallyencoded Ca^{2+} indicators that have been widely applied. Here, we demonstrate the performance of GCaMPX in chronic Ca^{2+} imaging of cortical neurons, where GCaMPX by design is to eliminate the unwanted interactions between the conventional GCaMP and endogenous (apo)CaMbinding proteins. By expressing in adult mice at high levels over an extended time frame, GCaMPX showed less damage and improved performance in twophoton imaging of sensory (whiskerdeflection) responses or spontaneous Ca^{2+} fluctuations, in comparison with GCaMP. Chronic Ca^{2+} imaging of one month or longer was conducted for cultured cortical neurons expressing GCaMPX, unveiling that spontaneous/local Ca^{2+} transients progressively developed into autonomous/global Ca^{2+} oscillations. Along with the morphological indices of neurite length and soma size, the major metrics of oscillatory Ca^{2+}, including rate, amplitude and synchrony were also examined. Dysregulations of both neuritogenesis and Ca^{2+} oscillations became discernible around 2–3 weeks after virus injection or drug induction to express GCaMP in newborn or mature neurons, which were exacerbated by stronger or prolonged expression of GCaMP. In contrast, neurons expressing GCaMPX were significantly less damaged or perturbed, altogether highlighting the unique importance of oscillatory Ca^{2+} to neural development and neuronal health. In summary, GCaMPX provides a viable solution for Ca^{2+} imaging applications involving longtime and/or highlevel expression of Ca^{2+} probes.

 Neuroscience
The automatic initiation of actions can be highly functional. But occasionally these actions cannot be withheld and are released at inappropriate times, impulsively. Striatal activity has been shown to participate in the timing of action sequence initiation and it has been linked to impulsivity. Using a selfinitiated task, we trained adult male rats to withhold a rewarded action sequence until a waiting time interval has elapsed. By analyzing neuronal activity we show that the striatal response preceding the initiation of the learned sequence is strongly modulated by the time subjects wait before eliciting the sequence. Interestingly, the modulation is steeper in adolescent rats, which show a strong prevalence of impulsive responses compared to adults. We hypothesize this anticipatory striatal activity reflects the animals’ subjective reward expectation, based on the elapsed waiting time, while the steeper waiting modulation in adolescence reflects agerelated differences in temporal discounting, internal urgency states, or explore–exploit balance.