Normative decision rules in changing environments
Abstract
Models based on normative principles have played a major role in our understanding of how the brain forms decisions. However, these models have typically been derived for simple, stable conditions, and their relevance to decisions formed under more naturalistic, dynamic conditions is unclear. We previously derived a normative decision model in which evidence accumulation is adapted to fluctuations in the evidencegenerating process that occur during a single decision (Glaze et al., 2015), but the evolution of commitment rules (e.g. thresholds on the accumulated evidence) under dynamic conditions is not fully understood. Here, we derive a normative model for decisions based on changing contexts, which we define as changes in evidence quality or reward, over the course of a single decision. In these cases, performance (reward rate) is maximized using decision thresholds that respond to and even anticipate these changes, in contrast to the static thresholds used in many decision models. We show that these adaptive thresholds exhibit several distinct temporal motifs that depend on the specific predicted and experienced context changes and that adaptive models perform robustly even when implemented imperfectly (noisily). We further show that decision models with adaptive thresholds outperform those with constant or urgencygated thresholds in accounting for human response times on a task with timevarying evidence quality and average reward. These results further link normative and neural decisionmaking while expanding our view of both as dynamic, adaptive processes that update and use expectations to govern both deliberation and commitment.
Editor's evaluation
This paper makes an important contribution to the study of decisionmaking under time pressure. The authors provide convincing evidence that decision boundaries can be highly nontrivial – even reaching infinity in realistic regimes. This paper will be of broad interest to both experimentalists and theorists working on decisionmaking under time pressure.
https://doi.org/10.7554/eLife.79824.sa0eLife digest
How do we make good choices? Should I have cake or yoghurt for breakfast? The strategies we use to make decisions are important not just for our daily lives, but also for learning more about how the brain works.
Decisionmaking strategies have two components: first, a deliberation period (when we gather information to determine which choice is ‘best’); and second, a decision ‘rule’ (which tells us when to stop deliberating and commit to a choice). Although deliberation is relatively wellunderstood, less is known about the decision rules people use, or how those rules produce different outcomes.
Another issue is that even the simplest decisions must sometimes adapt to a changing world. For example, if it starts raining while you are deciding which route to walk into town, you would probably choose the driest route – even if it did not initially look the best. However, most studies of decision strategies have assumed that the decisionmaker’s environment does not change during the decision process.
In other words, we know much less about the decision rules used in reallife situations, where the environment changes. Barendregt et al. therefore wanted to extend the approaches previously used to study decisions in static environments, to determine which decision rules might be best suited to more realistic environments that change over time.
First, Barendregt et al. constructed a computer simulation of decisionmaking with environmental changes built in. These changes were either alterations in the quality of evidence for or against a particular choice, or the ‘reward’ from a choice, i.e., feedback on how good the decision was. They then used the computer simulation to model single decisions where these changes took place.
These virtual experiments showed that the best performance – for example, the most accurate decisions – resulted when the threshold for moving from deliberation (i.e., considering the evidence) to selecting an option could respond to, or even anticipate, the changing situations. Importantly, the simulations’ results also predicted realworld choices made by human participants when given a decisionmaking task with similar variations in evidence and reward over time. In other words, the virtual decisionmaking rules could explain real behavior.
This study sheds new light on how we make decisions in a changing environment. In the future, Barendregt et al. hope that this will contribute to a broader understanding of decisionmaking and behavior in a wide range of contexts, from psychology to economics and even ecology.
Introduction
Even simple decisions can require us to adapt to a changing world. Should you go through the park or through town on your walk? The answer can depend on conditions that could be changing while you deliberate, such as an unexpected shower that would send you hurrying down the faster route (Figure 1A) or a predictable sunrise that would nudge you toward the route with better views. Despite the ubiquity of such dynamics in the real world, they are often neglected in models used to understand how the brain makes decisions. For example, many commonly used models assume that decision commitment occurs when the accumulated evidence for an option reaches a fixed, predefined value or threshold (Wald, 1945; Ratcliff, 1978; Bogacz et al., 2006; Gold and Shadlen, 2007; Kilpatrick et al., 2019). The value of this threshold can account for inherent tradeoffs between decision speed and accuracy found in many tasks: lower thresholds generate faster, but less accurate decisions, whereas higher thresholds generate slower, but more accurate decisions (Gold and Shadlen, 2007; Chittka et al., 2009; Bogacz et al., 2010). However, these classical models do not adequately describe decisions made in environments with potentially changing contexts (Thura et al., 2014; Thura and Cisek, 2016; Palestro et al., 2018; Cisek et al., 2009; Drugowitsch et al., 2012; Thura et al., 2012; Tajima et al., 2019; Glickman et al., 2022). Efforts to model decisionmaking thresholds under dynamic conditions have focused largely on heuristic strategies that aim to account for contexts that change between each decision. For instance, a common class of heuristic models is ‘urgencygating models’ (UGMs). UGMs filter accumulated evidence through a lowpass filter and use thresholds that collapse monotonically over time (equivalent to dilating the belief in time) to explain decisions based on timevarying evidence quality (Cisek et al., 2009; Carland et al., 2015; Evans et al., 2020). Although collapsing decision thresholds are optimal in some cases, they do not always account for changes that occur during decision deliberation, and they are sometimes implemented adhoc without a proper derivation from first principles. Such derivations typically assume that individuals set decision thresholds to maximize trialaveraged reward rate (Simen et al., 2009; Balci et al., 2011; Drugowitsch et al., 2012; Tajima et al., 2016; Malhotra et al., 2018; Boehm et al., 2020), which can result in adaptive, timevarying thresholds similar to those assumed by heuristic UGMs. However, as in fixedthreshold models, these timevarying thresholds are typically defined before the evidence is accumulated, preceding the formative stages of the decision, and thus cannot account for environmental changes that may occur during deliberation.
To identify how environmental changes during the course of a single deliberative decision impact optimal decision rules, we developed normative models of decisionmaking that adapt to and anticipate two specific types of context changes: changes in reward expectation and changes in evidence quality. Specifically, we used Bellman’s equation (Bellman, 1957; Mahadevan, 1996; Sutton and Barto, 1998; Bertsekas, 2012; Drugowitsch, 2015) to identify decision strategies that maximize trialaveraged reward rate when conditions can change during decision deliberation. We show that for simple tasks that include sudden, expected withintrial changes in the reward or the quality of observed evidence, these normative decision strategies involve nontrivial, timedependent changes in decision thresholds. These rules take several different forms that outperform their heuristic counterparts, are identifiable from behavior, and have performance that is robust to noisy implementations. We also show that, compared to fixedthreshold models or UGMs, these normative, adaptive threshold models provide a better account of human behavior on a ‘tokens task’, in which the value of commitment changes gradually at predictable times and the quality of evidence changes unpredictably within each trial (Cisek et al., 2009; Thura et al., 2014). These results provide new insights into the behavioral relevance of a diverse set of adaptive decision thresholds in dynamic environments and tightly link the details of such environmental changes to threshold adaptations.
Results
Normative theory for dynamic context 2AFC tasks
To determine potential deliberation and commitment strategies used by human subjects, we begin by identifying normative decision rules for twoalternative forced choice (2AFC) tasks with dynamic contexts. Normative decision rules that maximize trialaveraged reward rate can be obtained by solving an optimization problem using dynamic programming (Bellman, 1957; Sutton and Barto, 1998; Drugowitsch et al., 2012; Tajima et al., 2016). We define this trialaveraged reward rate, $\rho $, as (Gold and Shadlen, 2002; Drugowitsch et al., 2012)
where $\u27e8R\u27e9$ is the average reward for a decision, ${T}_{d}$ is the decision time, $\u27e8C({T}_{d})\u27e9=\u27e8{\int}_{0}^{{T}_{d}}c(t)dt\u27e9$ is the average total accumulated cost given an incremental cost function $c(t)$, $\u27e8{T}_{t}\u27e9$ is the average trial length, and $\u27e8{t}_{i}\u27e9$ is the average intertrial interval (Drugowitsch, 2015). Note that all averages in Equation 1 are taken over trials. To find the normative decision thresholds that maximize $\rho $, we assign specific values (i.e., economic utilities) to correct and incorrect choices (reward and/or punishment) and the time required to arrive at each choice (i.e., evidence cost). The incremental evidence function $c(t)$ represents both explicit time costs, such as a price for gathering evidence, and implicit costs, such as opportunity cost. While there are many forms of this cost function, we make the simplifying assumption that it is constant, $c(t)=c$. Because more complex cost functions can influence decision threshold dynamics (Drugowitsch et al., 2012), restricting the cost function to a constant ensures that the threshold dynamics we identify are governed purely by changes in the (external) task conditions and not the (internal) cost function. To represent the structure of a 2AFC tasks, we assume a decision environment for an observer with an initially unknown environmental state, $s\in \{{s}_{+},{s}_{}\}$, that uniquely determines which of two alternatives is correct. To infer the environmental state, this observer makes measurements, $\xi $, that follow a distribution ${f}_{\pm}(\xi )=f(\xi {s}_{\pm})$ that depends on the state. Determining the correct choice is thus equivalent to determining the generating distribution, ${f}_{\pm}$. An ideal Bayesian observer uses the loglikelihood ratio (LLR), $y$, to track their ‘belief’ over the correct choice (Wald, 1945; Bogacz et al., 2006; VelizCuba et al., 2016). After $n$ discrete observations ${\xi}_{1:n}$ that are independent across time, the discretetime LLR belief y_{n} is given by:
Given this defined task structure, we discretize the time during which the decision is formed and define the observer’s actions during each timestep. The observer gathers evidence (measurements) during each timestep prior to a decision and uses each increment of evidence to update their belief about the correct choice. Then, the observer has the option to either commit to a choice or make another measurement at the next timestep. By assigning a utility to each of these actions (i.e., a value ${V}_{+}$ for choosing ${s}_{+}$, a value ${V}_{}$ for choosing $s}_{$, and a value ${V}_{w}$ for sampling again), we can construct the value function for the observer given their current belief:
For a full derivation of this equation, see Materials and methods. In Equation 3, ${p}_{n}=\mathrm{Pr}({s}_{+}{\xi}_{1:n})=\frac{1}{1+{e}^{{y}_{n}}}$ is the state likelihood at time $t}_{n$, ${R}_{c}$ is the reward for a correct choice, ${R}_{i}$ is the reward for an incorrect choice, and $\delta t$ is the timestep between observations. We choose generating distributions to be symmetric Gaussian distributions ${f}_{\pm}(\xi )\sim \mathcal{N}\left(\pm \mu ,{\sigma}^{2}\right)$ to allow us to compute the conditional distribution function ${f}_{p}({p}_{n+1}{p}_{n})$ needed for the average future value explicitly:
In Equation 4, ${f}_{p}({p}_{n+1}{p}_{n})$ is the conditional probability of the future state likelihood ${p}_{n+1}$ given the current state likelihood $p}_{n$. For the case of Gaussiandistributed evidence, this conditional probability is given by Equation 16 in Materials and methods. Using Equation 3, we find the specific belief values where the optimal action changes from gathering evidence to commitment, defining thresholds on the ideal observer’s belief that trigger decisions. Figure 1B shows a schematic of this process.
To understand how normative decision thresholds adapt to changing conditions, we derived them for several different forms of twoalternative forcedchoice (2AFC) tasks in which we controlled changes in evidence or reward. Even for such simple tasks, there is a broad set of possible changing contexts. In the next section, we analyze a task in which context changes gradually (the tokens task). Here, we focus on tasks in which the context changes abruptly. For each task, an ideal observer was shown evidence generated from a Gaussian distribution ${f}_{\pm}(\xi )=\mathcal{N}\left(\pm \mu ,{\sigma}^{2}\right)$ with signaltonoise ratio (SNR) $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ (Figure 2—figure supplement 1). The SNR measures evidence quality: a smaller (larger) $m$ implies that evidence is of lower (higher) quality, resulting in harder (easier) decisions. The observer’s goal was to determine which of the two means (i.e., which distribution, ${f}_{+}$ or ${f}_{}$) were used to generate the observations. We introduced changes in the reward for a correct decision (‘rewardchange task’) or the SNR (‘SNRchange task’) within a single decision, where the time and magnitude of the changes are known in advance to the observer (Figure 1A, Figure 2—figure supplement 2). For example, changes in SNR arise naturally throughout a day as animals choose when to forage and hunt given variations in light levels and therefore targetacquisition difficulty (Combes et al., 2012; Einfalt et al., 2012).
Under these dynamic conditions, dynamic programming produces normative thresholds with rich nonmonotonic dynamics (Figure 2A and B, Figure 2—figure supplement 2). Environments with multiple reward changes during a single decision lead to complex threshold dynamics that we summarize in terms of several threshold change “motifs.” These motifs occur on shorter intervals and tend to emerge from simple monotonic changes in context parameters (Figure 2—figure supplement 2). To better understand the range of possible threshold motifs, we focused on environments with single changes in task parameters. For the rewardchange task, we set punishment ${R}_{i}=0$ and assumed reward ${R}_{c}$ changes abruptly, so that its dynamics are described by a Heaviside function:
Thus, the reward switches from the prechange reward $R}_{1$ to the postchange reward $R}_{2$ at $t=0.5$.
For this singlechange task, normative threshold dynamics exhibited several motifs that in some cases resembled fixed or collapsing thresholds characteristic of previous decision models but in other cases exhibited novel dynamics. Specifically, we characterized five different dynamic motifs in response to single, expected changes in reward contingencies for different combinations of pre and postchange reward values (Figure 2C and i–v). For tasks in which reward is initially very low, thresholds are infinite until the reward increases, ensuring that the observer waits for the larger payout regardless of how strong their belief is (Figure 2i). The region where thresholds are infinite corresponds to when ${V}_{w}({p}_{n};\rho )$ in Equation 3, which is the value associated with waiting to gather more information, is maximal for all values of $p}_{n$. In contrast, when reward is initially very high, thresholds collapse to zero just before the reward decreases, ensuring that all responses occur while payout is high (Figure 2v). Between these two extremes, optimal thresholds exhibit rich, nonmonotonic dynamics (Figure 2ii,iv), promoting early decisions in the highreward regime, or preventing early, inaccurate decisions in the lowreward regime. Figure 2C shows the regions in pre and postchange reward space where each motif is optimal, including broad regions with nonmonotonic thresholds. Thus, even simple context dynamics can evoke complex decision strategies in ideal observers that differ from those predicted by constant decisionthresholds and heuristic UGMs.
We also formulated an ‘inferred rewardchange task’, in which reward fluctuates between a high value ${R}_{H}$ and low value ${R}_{L}$ governed by a twostate Markov process with known transition rate $h$ and state $R(t)\in \{{R}_{H},{R}_{L}\}$ that the observer must infer online. For this task, the observer receives two independent sets of evidence: the evidence of the state $\xi {s}_{\pm}\sim \mathcal{N}\left(\pm \mu ,{\sigma}^{2}\right)$ and the evidence of the current reward $\eta {R}_{H/L}\sim \mathcal{N}\left(\pm {\mu}_{R},{\sigma}_{R}^{2}\right)$. The observer must then track their beliefs about both the state and the current reward and take both sources of information into account when determining the optimal decision thresholds. We found that these thresholds always changed monotonically with monotonic shifts in expected reward (see Figure 2—figure supplement 3). These results contrast with our findings from the rewardchange task in which changes can be anticipated and monotonic changes in reward can produce nonmonotonic changes in decision thresholds.
For the SNRchange task, optimal strategies for environments with multiple changes in evidence quality are characterized by threshold dynamics that adapt to these changes in a way similar to how they adapt to changes in reward (Figure 3—figure supplement 1). To study the range of possible threshold motifs, we again considered environments with single changes in the evidence quality $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ by taking µ to be a Heaviside function:
For this singlechange task, we again found similar threshold motifs to those in the rewardchange task (Figure 3A and B). However, in this case monotonic changes in evidence quality always produce monotonic changes in response behavior. This observation holds across all of parameter space for evidencequality schedules with single change points (Figure 3C), with only three optimal behavioral motifs (Figure 3i–iii). This contrasts with our findings in the rewardchange task, where monotonic changes in reward can produce nonmonotonic changes in decision thresholds. Strategies arising from known dynamical changes in context tend to produce sharper response distributions around reward changes than around quality changes, which may be measurable in psychophysical studies. These findings suggest that changes in reward can have a larger impact on the normative strategy thresholds than changes in evidence quality.
Performance and robustness of nonmonotonic normative thresholds
The normative solutions that we derived for dynamiccontext tasks by definition maximize reward rate. This maximization assumes that the normative solutions are implemented perfectly. However, a perfect implementation may not be possible, given the complexity of the underlying computations, biological constraints on computation time and energy (Louie et al., 2015), and the synaptic and neural variability of cortical circuits (Ma and Jazayeri, 2014; Faisal et al., 2008). Given these constraints, subjects may employ heuristic strategies like the UGM over the normative model if noisy or mistuned versions of both models result in similar reward rates. We used synthetic data to better understand the relative benefits of different imperfectly implemented strategies. Specifically, we corrupted the internal belief state and simulated response times with additive Gaussian noise with zero mean and variance ${\sigma}_{mn}^{2}$ (See Figure 4—figure supplement 1C) for three models:
The continuoustime normative model with timevarying thresholds $\pm \theta (t)$ from Equation 3 and belief that evolves according to the stochastic differential equation
$\text{}d\stackrel{~}{y}=\underset{\text{drift}}{\underset{\u23df}{\pm m\phantom{\rule{thinmathspace}{0ex}}dt}}+\underset{\text{sample noise}}{\underset{\u23df}{\sqrt{2m}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t}}}+\underset{\text{sensory noise}}{\underset{\u23df}{{\sigma}_{y}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t}^{\prime}}},$where $d{W}_{t}$ is a standard increment of a Wiener process, the sign of the drift $\pm mdt$ is given by the correct choice ${s}_{\pm}$, and $d{W}_{t}^{\prime}$ is an independent Wiener process with strength ${\sigma}_{y}$. The addition of the additional noise process $d{W}_{t}^{\prime}$ makes this a noisy Bayesian (NB) model.
A constantthreshold (Const) model, which uses the same belief $\stackrel{~}{y}$ as the normative model but a constant, nonadaptive decision threshold $\pm \theta (t)=\pm {\theta}_{0}$ (Figure 4—figure supplement 1A).
The UGM, which uses the output of a lowpass filter as the belief,
(7) $\tau \phantom{\rule{thinmathspace}{0ex}}dE=\underset{\text{drift \xa7amp; sample noise}}{\underset{\u23df}{\left(E+\frac{1}{1+{e}^{y}}\frac{1}{2}\right)\phantom{\rule{thinmathspace}{0ex}}dt}}+\underset{\text{sensory noise}}{\underset{\u23df}{{\sigma}_{y}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t}}},$and commits to a decision when this output crosses a hyperbolically collapsing threshold $\pm \theta (t)=\pm \frac{{\theta}_{0}}{at}$ (Figure 4—figure supplement 1B). In Equation 7, $E$ is the filter’s output that serves as the UGM’s belief, $\tau $ is a relaxation time constant, and the optimal observer’s belief $y$ is the filter’s input. Note that the filter’s input can also be written in terms of the state likelihood $p$,
$\tau \phantom{\rule{thinmathspace}{0ex}}dE=\left(E+p\frac{1}{2}\right)\phantom{\rule{thinmathspace}{0ex}}dt+{\sigma}_{y}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t},$which is the form first proposed by Cisek et al., 2009.
For more details about these three models, see Materials and methods. We compared their performance in terms of reward rate achieved on the same set of rewardchange tasks shown in Figure 2. To ensure the average total reward in each trial was the same, we restricted the prechange reward $R}_{1$ and postchange reward $R}_{2$ so that ${R}_{1}+{R}_{2}=11$.
When all three models were implemented without additional noise, the relative benefits of the normative model depended on the exact task condition. The performance differential between models was highest when reward changed from low to high values (Figure 4A, dotted line; Figure 4). Under these conditions, normative thresholds are initially infinite and become finite after the reward increases, ensuring that most responses occur immediately once the high reward becomes available (Figure 4D). In contrast, response times generated by the constantthreshold and UGM models tend to not follow this pattern. For the constantthreshold model, many responses occur early, when the reward is low (Figure 4E). For the UGM, a substantial fraction of responses are late, leading to higher time costs however, it is possible to tune the UGM’s thresholds rate of collapse to prevent any early responses while the reward is low (Figure 4F). In contrast, when the reward changes from high to low values, all models exhibit similar response distributions and reward rates (Figure 4A, dashed line; Figure 4—figure supplement 2). This result is not surprising, given that the constantthreshold model produces early peaks in the reaction time distribution, and the UGM was designed to mimic collapsing bounds that hasten decisions in response to imminent decreases in reward (Cisek et al., 2009). We therefore focused on the robustness of each strategy when corrupted by noise and responding to lowtohigh reward switches – the regime differentiating strategy performance in ways that could be identified in subject behavior.
Adding noise to the internal belief state (which tends to trigger earlier responses) and simulated response distributions (which tends to smooth out the distributions) without retuning the models to account for the additional noise does not alter the advantage of the normative model: across a range of added noise strengths, which we define as $\frac{{\sigma}_{y}+{\sigma}_{mn}}{{\overline{\sigma}}_{y}+{\overline{\sigma}}_{mn}}$, where ${\overline{\sigma}}_{y}$ and ${\overline{\sigma}}_{mn}$ are the maximum possible strengths of sensory and motor noise, respectively, the normative model outperforms the other two when encountering lowtohigh reward switches (Figure 4C). This robustness arises because, prior to the reward change, the normative model uses infinite decision thresholds that prevent early noisetriggered responses when reward is low (Figure 4D). In contrast, the heuristic models have finite collapsing or constant thresholds and thus produce more suboptimal early responses as belief noise is increased (Figure 4E and F). Thus, adaptive decision strategies can result in considerably higher reward rates than heuristic alternatives even when implemented imperfectly, suggesting subjects may be motivated to learn such strategies.
Adaptive normative strategies in the tokens task
To determine the relevance of the normative model to human decisionmaking, we analyzed previously collected data from a ‘tokens task’ (Cisek et al., 2009). For this task, human subjects were shown 15 tokens inside a center target flanked by two empty targets (see Figure 5A for a schematic). Every 200ms, a token moved from the center target to one of the neighboring targets with equal probability. Subjects were tasked with predicting which flanking target would contain more tokens by the time all 15 moved from the center. Subjects could respond at any time before all 15 tokens had moved. Once the subject made the prediction, the remaining tokens would finish their movements to indicate the correct alternative. Given this task structure, one can show using a combinatorial argument (Cisek et al., 2009) that the state likelihood function ${p}_{n}=Pr(\text{top}\phantom{\rule{thinmathspace}{0ex}}{\xi}_{1:n})$, the probability the top target will hold more tokens at the end of the trial, is given by:
where ${U}_{n}$, ${L}_{n}$, and ${C}_{n}$ are the number of tokens in the upper, lower, and center targets after token movement $n$, respectively. The token movements are Markovian because each token has an equal chance of moving to the upper/lower target. However, the probability that a target will contain more tokens at the end of the trial is history dependent, and the evolution of these probabilities is thus nonMarkovian. As such, the quality of evidence possible from each token draw changes dynamically and gradually. In addition, the task included two different postdecision token movement speeds, ‘slow’ and ‘fast’: once the subject committed to a choice, the tokens finished out their animation, moving either once every 170ms (slow task) or once every 20ms (fast task). This postdecision movement acceleration changed the value associated with commitment by making the average intertrial interval ($\u27e8{t}_{i}\u27e9$ in Equation 1) decrease over time. Because of this modulation, we can interpret the tokens task as a multichange reward task, where commitment value is controlled through $\u27e8{t}_{i}\u27e9$ rather than through reward ${R}_{c}$. Our dynamicprogramming framework for generating adaptive decision rules can handle the gradual changes in task context emerging in the tokens task. Given that costs and rewards can be subjective, we quantified how normative decision thresholds change with different combinations of rewards ${R}_{c}$ and costs $c(t)=c$ for fixed punishment ${R}_{i}=1$, for both the slow (Figure 5B) and fast (Figure 5C) versions of the task.
We identified four distinct motifs of normative decision threshold dynamics for the tokens task (Figure 5iiv). Some combinations of rewards and costs produced collapsing thresholds (Figure 5ii) similar to the UGM developed by Cisek et al., 2009 for this task. In contrast, large regions of task parameter space produced rich nonmonotonic threshold dynamics (Figure 5iii,iv) that differed from any found in the UGM. In particular, as in the case of rewardchange tasks, normative thresholds were often infinite for the first several token movements, preventing early and weakly informed responses. These motifs are similar to those produced by lowtohigh reward switches in the rewardchange task, but here resulting from the low relative cost of early observations. These nonmonotonic dynamics also appear if we measure belief in terms of the difference in tokens between the top and bottom target, which we call ‘token lead space’ (see Figure 5—figure supplement 1).
Adaptive normative strategies best fit subject response data
To determine the relevance of these adaptive decision strategies to human behavior, we fit discretetime versions of the noisy Bayesian (four free parameters), constantthreshold (three free parameters), and urgencygating (five free parameters) models to responsetime data from the tokens task collected by Cisek et al., 2009; see Table 1 in Materials and methods for a table of parameters for each model. All models included belief and motor noise, as in our analysis of the dynamiccontext tasks (Figure 4—figure supplement 1C). The normative model tended to fit the data better than the heuristic models (see Figure 6—figure supplement 1), based on three primary analyses. First, both corrected AIC (AICc), which accounts for goodnessoffit and model degreesoffreedom, and average rootmeansquared error (RMSE) between the predicted and actual trialbytrial response times, favored the noisy Bayesian model for most subjects for both the slow (Figure 6A) and fast (Figure 6D) versions of the task. Second, when considering only the bestfitting model for each subject and task condition, the noisy Bayesian model tended to better predict subject’s response times (Figure 6B and E). Third, most subjects whose data were best described by the noisy Bayesian model had bestfit parameters that corresponded to nonmonotonic decision thresholds, which cannot be produced by either of the other two models (Figure 6C and F). This result also shows that, assuming subjects used a normative model, they used distinct model parameters, and thus different strategies, for both the fast and slow task conditions. This finding is clearer when looking at the posterior parameter distribution for each subject and model parameter (see Figure 6—figure supplement 1 for an example). We speculate that the higher estimated value of reward in the slow task may arise due to subjects valuing frequent rewards more favorably. Together, our results strongly suggest that these human subjects tended to use an adaptive, normative strategy instead of the kinds of heuristic strategies often used to model response data from dynamic context tasks.
Discussion
The goal of this study was to build on previous work showing that in dynamic environments, the most effective decision processes do not necessarily use relatively simple, predefined computations as in many decision models (Bogacz et al., 2006; Cisek et al., 2009; Drugowitsch et al., 2012), but instead adapt to learned or predicted features of the environmental dynamics (Drugowitsch et al., 2014a). Specifically, we used new ‘dynamic context’ task structures to demonstrate that normative decision commitment rules (i.e., decision thresholds, or bounds, in ‘accumulatetobound’ models) adapt to reward and evidencequality switches in complex, but predictable, ways. Comparing the performance of these normative decision strategies to the performance of classic heuristic models, we found that the advantage of normative models is maintained when computations are noisy. We extended these modeling results to include the ‘tokens task’, in which evidence quality changes in a way that depends on stimulus history and the utility of commitment increases over time. We found that the normative decision thresholds for the tokens task are also nonmonotonic and robust to noise. By reanalyzing human subject data from this task, we found most subjects’ response times were bestexplained by a noisy normative model with nonmonotonic decision thresholds. Taken collectively, these results show that ideal observers and human subjects use adaptive and robust normative decision strategies in relatively simple decision environments.
Our results can aid experimentalists investigating the nuances of complex decisionmaking in several ways. First, we demonstrated that normative behavior varies substantially across task parameters for relatively simple tasks. For example, the rewardchange task structure produces five distinct behavioral motifs, such as waiting until reward increases (Figure 2i) and responding before reward decreases unless the accumulated evidence is ambiguous (Figure 2iv). Using these kinds of modeling results to inform experimental design can help us understand the possible behaviors to expect in subject data. Second, extending our work and considering the sensitivity of performance to both model choice and task parameters (Barendregt et al., 2019; Radillo et al., 2019) will help to identify regions of task parameter space where models are most identifiable from observables like response time and choice. Third, and more generally, our work provides evidence that for tasks with gradual changes in evidence quality and reward, human behavior is more consistent with normative principles than with previously proposed heuristic models. However, more work is needed to determine if and how people follow normative principles for other dynamiccontext tasks, such as those involving abrupt changes in evidence or reward contingencies, by using normative theory to determine which subject strategies are plausible, the nature of tasks needed to identify them, and the relationship between task dynamics and decision rules.
Modeldriven experimental design can aid in identification of adaptive decision rules in practice. People commonly encounter unpredictable (e.g. an abrupt thunderstorm) and predictable (e.g. sunset) context changes when making decisions. Natural extensions of common perceptual decision tasks (e.g. randomdot motion discrimination [Gold and Shadlen, 2002]) could include withintrial changes in stimulus signaltonoise ratio (evidence quality) or anticipated reward payout. Taskrelevant variability can also arise from internal sources, including noise in neural processing of sensory input and motor output (Ma and Jazayeri, 2014; Faisal et al., 2008). We assumed subjects do not have precise knowledge of the strength or nature of these noise sources, and thus they could not optimize their strategy accordingly. However, people may be capable of rapidly estimating performance error that results from such internal noise processes and adjusting online (Bonnen et al., 2015). To extend the models we considered, we could therefore assume that subjects can estimate the magnitude of their own sensory and motor noise, and use this information to adapt their decision strategies to improve performance.
Real subjects likely do not rely on a single strategy when performing a sequence of trials (Ashwood et al., 2022) and instead rely on a mix of nearnormative, subnormative, and heuristic strategies. In fitting subject data, experimentalists are thus presented with the difficult task of constructing a library of possible models to use in their analysis. More general approaches have been developed for fitting response data to a broad class of models (Shinn et al., 2020), but these model libraries are typically built on preexisting assumptions of how subjects accumulate evidence and make decisions. Because the potential library of decision strategies is theoretically limitless, a normative analyses can both expand and provide insights into the range of possible subject behaviors in a systematic and principled way. Understanding this scope will assist in developing a wellgroomed candidate list of nearnormative and heuristic models. For example, if a normative analysis of performance on a dynamic reward task produces threshold dynamics similar to those in Figure 2B, then the fitting library should include a piecewiseconstant threshold (or urgency signal) model. Combining these modelbased investigations with modelfree approaches, such as ratedistortion theory (Berger, 2003; Eissa et al., 2021), can also aid in identifying commonalities in performance and resource usage within and across model classes without the need for pilot experiments.
Our work complements the existing literature on optimal decision thresholds by demonstrating the diversity of forms those thresholds can take under different dynamic task conditions. Several early normative theories were, like ours, based on dynamic programming (Rapoport and Burkheimer, 1971; Busemeyer and Rapoport, 1988) and in some cases models fit to experimental data (Ditterich, 2006). For example, dynamic programming was used to show that certain optimal decisions can require nonconstant decision boundaries similar to those of our normative models in dynamic reward tasks (Frazier and Yu, 2007; Figure 2). More recently, dynamic programming (Drugowitsch et al., 2012; Drugowitsch et al., 2014b; Tajima et al., 2016) or policy iteration (Malhotra et al., 2017; Malhotra et al., 2018) have been used to identify normative strategies in dynamic environments that can have monotonically collapsing decision thresholds that in some cases can be implemented using an urgency signal (Tajima et al., 2019). These strategies include dynamically changing decision thresholds when signaltonoise ratios of evidence streams vary according to a CoxIngersollRoss process (Drugowitsch et al., 2014a) and nonmonotonic thresholds when the evidence quality varies unpredictably across trials but is fixed within each trial Malhotra et al., 2018. Other recent work has started to generalize notions of urgencygating behavior (Trueblood et al., 2021). However, these previous studies tended to focus on environments with a fixed structure, in which dynamic decision thresholds are adapted as the observer acquires knowledge of the environment. Here we have characterized in more detail how both expected and unexpected changes in context within trials relate to changes in decision thresholds over time.
Perceptual decisionmaking tasks provide a readily accessible route for validating our normative theory, especially considering the ease with which task difficulty can be parameterized to identify parameter ranges in which strategies can best be differentiated (Philiastides et al., 2006). There is ample evidence already that people can tune the timescale of leaky evidence accumulation processes to the switching rate of an unpredictably changing state governing the statistics of a visual stimulus, to efficiently integrate observations and make a decision about the state (Ossmy et al., 2013; Glaze et al., 2015). We thus speculate that adaptive decision rules could be identified similarly in the strategies people use to make decisions about perceptual stimuli in dynamic contexts.
The neural mechanisms responsible for implementing and controlling decision thresholds are not well understood. Recent work has identified several cortical regions that may contribute to threshold formation, such as prefrontal cortex (Hanks et al., 2015), dorsal premotor area (Thura and Cisek, 2020), and superior colliculus (Crapse et al., 2018; Jun et al., 2021). Urgency signals are a complementary way of dynamically changing decision thresholds via a commensurate scale in belief, which Thura and Cisek, 2017 suggest are detectable in recordings from basal ganglia. The normative decision thresholds we derived do not employ urgency signals, but analogous UGMs may involve nonmonotonic signals. For example, the switch from an infinitetoconstant decision threshold typical of lowtohigh reward switches would correspond to a signal that suppresses responses until a reward change. Measurable signals predicted by our normative models would therefore correspond to zero mean activity during low reward, followed by constant mean activity during high reward. While more experimental work is needed to test this hypothesis, our work has expanded the view of normative and neural decision making as dynamic processes for both deliberation and commitment.
Materials and methods
Normative decision thresholds from dynamic programming
Here we detail the dynamic programming tools required to find normative decision thresholds. For the freeresponse tasks we consider, an observer gathers a sample of evidence $\xi $, uses the loglikelihood ratio (LLR) $y=\frac{\mathrm{Pr}({s}_{+}\xi )}{\mathrm{Pr}({s}_{}\xi )}$ as their ‘belief’, and sets potentially timedependent decision thresholds, ${\theta}_{\pm}(t),$ that determine when they will stop accumulating evidence and commit to a choice. When $y\ge {\theta}_{+}(t)$ ($y\le {\theta}_{}(t)$), the observer chooses the state $s}_{+$ ($s}_{$). In general, an observer is free to set ${\theta}_{\pm}(t)$ any way they wish. However, a normative observer sets these thresholds to optimize an objective function, which we assume throughout this study to be the trialaveraged reward rate, $\rho $, which is given by Equation 1. In this definition of reward rate, the incremental cost function $c(t)$ accounts for both explicit costs (e.g. paying for observed evidence, metabolic costs of storing belief in working memory) and implicit costs (e.g. opportunity cost). We assume symmetry in the problem (in terms of prior, rewards, etc.) that guarantees the thresholds are symmetric about $y=0$ and ${\theta}_{\pm}(t)=\pm \theta (t)$. We derive the optimal threshold policy for a general incremental cost function $c(t)$, but in our results we consider only constant costs functions $c$. Although the space of possible cost functions is large, restricting to a constant value ensures that threshold dynamics are governed purely by task and reward structure and not by an arbitrary evidence cost function.
To find the thresholds $\pm \theta (t)$ that optimize the reward rate given by Equation 1, we start with a discretetime task where observations are made every $\delta t$ time units, and we simplify the problem so the length of each trial is fixed and independent of the decision time ${T}_{d}$. This simplification makes the denominator of $\rho $ constant with respect to trialtotrial variability, meaning we can optimize reward rate by maximizing the numerator $\u27e8R\u27e9\u27e8C({T}_{d})\u27e9$. Under this simplified task structure, we suppose the observer has just drawn a sample ${\xi}_{n}$ and updated their state likelihood to ${p}_{n}=\frac{1}{1+{e}^{{y}_{n}}}$, where ${y}_{n}=\mathrm{ln}\frac{\mathrm{Pr}({s}_{+}{\xi}_{1:n})}{\mathrm{Pr}({s}_{}{\xi}_{1:n})}$ is the discretetime LLR given by Equation 2. At this moment, the observer takes one of three possible actions:
Stop accumulating evidence and commit to choice $s}_{+$. This action has value equal to the average reward for choosing ${s}_{+}$, which is given by:
(9) $\text{}{V}_{+}({p}_{n})={R}_{c}{p}_{n}+{R}_{i}(1{p}_{n}),$
where ${R}_{c}$ is the value for a correct choice and ${R}_{i}$ is the value for an incorrect choice.
Stop accumulating evidence and commit to choice $s}_{$. By assuming the reward for correctly (or incorrectly) choosing ${s}_{+}$ is the same as choosing $s}_{$, the value of this action is obtained by symmetry from:
(10) $\text{}{V}_{}({p}_{n})={R}_{c}(1{p}_{n})+{R}_{i}{p}_{n}.$Wait to commit to a choice and draw an additional piece of evidence. Choosing this action means the observer expects their future overall value $V$ to be greater than their current value, less the cost incurred by waiting for additional evidence. Therefore, the value of this choice is given by:
(11) $\text{}{V}_{w}({p}_{n})=\u27e8V({p}_{n+1}){p}_{n}{\u27e9}_{{p}_{n+1}}c(t)\delta t,$
where $c$ is the incremental evidence cost function; because we assume that the incremental cost is constant, this simplifies $c(t)\delta t=c\delta t$.
Given the action values from Equations 9–11, the observer takes the action with maximal value, resulting in their overall value function
Because the valuemaximizing action depends on the state likelihood, $p}_{n$, the regions of likelihood space where each action is optimal divide the space into three disjoint regions. The boundaries of these regions are exactly the optimal decision thresholds, which can be mapped to LLRspace to obtain $\pm \theta (t)$. To find these thresholds numerically, we started by discretizing the state likelihood space $p}_{n$. Because the state likelihood $p}_{n$ is restricted to values between 0 and 1, whereas the loglikelihood ratio ${y}_{n}=\mathrm{ln}\frac{{p}_{n}}{1{p}_{n}}$ is unbounded, we chose to formulate all the components of Bellman’s equation in terms of $p}_{n$ to minimize truncation errors. We then proceeded by using backward induction in time, starting at the total trial length $t={T}_{t}$. At this moment in time, it impossible to wait for more evidence, so the value function in Equation 12 does not depend on the future. This approach implies that the value function is:
Once the value is calculated at this time point, it can be used as the future value at time point $t={T}_{t}\delta t$.
To find the decision thresholds for the desired tasks where ${T}_{t}$ is not fixed, we must optimize both the numerator and denominator of Equation 1. To account for the variable trial length, we adopt techniques from average reward reinforcement learning (Mahadevan, 1996) and penalize the waiting time associated with each action by the waiting time itself scaled by the reward rate $\rho $ (i.e., $\u27e8{t}_{i}\u27e9\rho $ for committing to ${s}_{+}$ or $s}_{$ and $\rho \delta t$ for waiting). This modification makes all trials effectively the same length and allows us to use the same approach used to derive Equation 12 (Drugowitsch et al., 2012). The new overall value function is given by Equation 3:
To use this new value function to numerically find the decision thresholds, we must note two new complications that arise from moving away from fixedlength trials. First, we no longer have a natural end time from which to start backward induction. We remedy this issue by following the approach of Drugowitsch et al., 2012 and artificially setting a final trial time ${T}_{f}$ that is far enough in the future so that decision times of this length are highly unlikely and do not impact the response distributions. If we desire accurate thresholds up to a time $t$, we set ${T}_{f}=5t$, which produces an accurate solution while avoiding a large numerical overhead incurred from a longer simulation time. In our simulations, we set $t$ based on when we expect most decisions to be made. Second, the value function now depends on the unknown quantity $\rho $, resulting in a cooptimization problem. To address this complication, note that when $\rho $ is maximized, our derivation requires $V\left({p}_{0}=\frac{1}{2};\rho \right)=0$ for a consistent Bellman’s equation (Drugowitsch et al., 2012). We exploit this consistency requirement by fixing an initial reward rate ${\rho}_{0}$, solving the value function through backward induction, calculating $V(0;{\rho}_{0})$, and updating the value of $\rho $ via a root finding scheme. For more details on numerical implementation, see https://github.com/nwbarendregt/AdaptNormThresh; Thresh, 2022.
Dynamic context 2AFC tasks
For all dynamic context tasks, we assume that observations follow a Gaussian distribution with so that $\xi {s}_{\pm}\sim \mathcal{N}(\pm \mu ,{\sigma}^{2})$. Using the Functional Central Limit Theorem, one can show (Bogacz et al., 2006) that in the continuoustime limit, the belief $y$ evolves according to a stochastic differential equation:
In Equation 14, $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ is the scaled signaltonoise ratio (SNR) given by the observation distribution function $\xi {s}_{\pm}\sim \mathcal{N}(\pm \mu ,{\sigma}^{2})$, $d{W}_{t}$ is a standard increment of a Wiener process, and the sign of the drift $\pm mdt$ is given by the sign of the correct choice ${s}_{\pm}$. To construct Bellman’s equation for this task, we start by discretizing time ${t}_{1:n}$ and determine the average value gained by waiting and collecting another observation given by Equation 4:
where ${p}_{n}=\mathrm{Pr}({s}_{+}{\xi}_{1:n})$ is the probability the environment is in state ${s}_{+}$ given $n$ pieces of evidence. The main difficulty in computing this expectation is computing the conditional probability distribution ${f}_{p}\text{}({p}_{n+1}\text{}\text{}{p}_{n})$, which we call the likelihood transfer function. Once we construct the likelihood transfer function, we can use our discretization of the state likelihood space $p}_{n$ to evaluate the integral in Equation 4 using any standard numerical quadrature scheme. To compute this transfer function, we can start by using the definition of the LLR $y}_{n$ and leveraging the relationship between $p}_{n$ and $y}_{n$ to find $p}_{n$ and a function of the observation ${\xi}_{n}$:
Note that we used the fact that in discretetime with a time step $\delta t$, the observations $\xi {s}_{\pm}\sim \mathcal{N}(\pm \mu \delta t,{\sigma}^{2}\delta t)$. The relationship between ${\xi}_{n+1}$ and ${p}_{n+1}$ in Equation 15 can be inverted to obtain:
With this relationship established, we can find the likelihood transfer function ${f}_{p}(p({\xi}_{1:n+1})p({\xi}_{1:n}))$ by finding the observation transfer function ${f}_{\xi}(\xi ({p}_{n+1})\xi ({p}_{n}))$ and performing a change of variables, which by independence of the sample is simply ${f}_{\xi}({\xi}_{n+1})$. With probability $p}_{n$, ${\xi}_{n+1}$ will be drawn from the normal distribution $\mathcal{N}(+\mu \delta t,{\sigma}^{2}\delta t)$, and with probability $1{p}_{n}$, ${\xi}_{n+1}$ will be drawn from the normal distribution $\mathcal{N}(\mu \delta t,{\sigma}^{2}\delta t)$. This immediately provides the observation transfer function by marginalizing:
Performing the change of variables using the derivative $\frac{d{\xi}_{n+1}}{d{p}_{n+1}}=\frac{{\sigma}^{2}}{2{p}_{n+1}\mu 2{p}_{n+1}^{2}\mu}>0$ yields the transfer function
Note that Equation 16 is equivalent to the likelihood transfer function given by Equation 16 in Drugowitsch et al., 2012 for the case of $m=1$. Combining Equation 14 and Equation 16, we can construct Bellman’s equation for any dynamic context task.
Rewardchange task thresholds
For the rewardchange task, we fixed punishment ${R}_{i}=0$ and allowed the reward ${R}_{c}$ to be a Heaviside function given by Equation 5:
In Equation 5, there is a single switch in rewards between prechange reward $R}_{1$ and postchange reward $R}_{2$. This change occurs at $t=0.5$. Substituting this reward function into Equation 3 allows us to find the normative thresholds for this task as a function of $R}_{1$ and $R}_{2$.
For the inferred reward change task, we allowed the reward value $R(t)\in \{{R}_{H},{R}_{L}\}$ to be controlled by a continuoustime twostate Markov process with transition (hazard) rate $h$ between rewards ${R}_{H}\ge {R}_{L}$. The hazard rate $h$ governs the probability of switching between ${R}_{H}$ and ${R}_{L}$:
where $o(\delta t)$ represents a function $g(\delta t)$ with the property ${lim}_{\delta t\downarrow 0}\frac{g(\delta t)}{\delta t}=0$ (i.e., all other terms are of smaller order than $\delta t$). In addition, the state of this Markov process must be inferred from evidence $\eta $ that is independent of the environment’s state evidence $\xi $ (i.e., the correct choice). For simplicity, we assume that the rewardevidence source is also Gaussiandistributed such that $\eta {R}_{H/L}\sim \mathcal{N}(\pm {\mu}_{R},{\sigma}_{R}^{2})$ with quality ${m}_{R}=\frac{2{\mu}_{R}^{2}}{{\sigma}_{R}^{2}}$. Glaze et al., 2015; VelizCuba et al., 2016; Barendregt et al., 2019 have shown that the belief ${y}_{R}=\mathrm{ln}\frac{\mathrm{Pr}(R(t)={R}_{H}\eta )}{\mathrm{Pr}(R(t)={R}_{L}\eta )}$ for such a dynamic state inference process is given by the modified DDM
where $x(t)\in \pm 1$ is a telegraph process that mirrors the state of the reward process (i.e., $x(t)=1$ when $R(t)={R}_{H}$ and $x(t)=1$ when $R(t)={R}_{L}$). With this belief over reward state, we must also modify the values ${V}_{+}({p}_{n};\rho )$ and ${V}_{}({p}_{n};\rho )$ to account for the uncertainty in ${R}_{c}$. Defining $q=\frac{{e}^{{y}_{R}}}{1+{e}^{{y}_{R}}}$ as the reward likelihood gives
where we have fixed ${R}_{i}=0$ for simplicity.
SNRchange task thresholds
For the SNRchange task, we allowed the task difficulty $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ to vary over a single trial by making $\mu (t)$ a timedependent step function given by Equation 6:
In Equation 6, there is a single switch in evidence quality between prechange quality ${\mu}_{1}$ and postchange quality ${\mu}_{2}$. This change occurs at $t=0.5$. Substituting this quality time series into the likelihood transfer function in Equation 16 allows us to find the normative thresholds for this task as a function of ${\mu}_{1}$ and ${\mu}_{2}$. This modification necessitates that the transfer function $f}_{p$ also be a function of time; however, because the quality change points are known in advance to the observer, we can simply change between different transfer functions at the specified quality changes.
Rewardchange task model performance
Here we detail the three models used to compare observer performance in the rewardchange task, as well as the noise filtering process used to generate synthetic data. For the noisy Bayesian model, the observer uses the thresholds $\pm \theta (t)$ obtained via dynamic programming, thus making the observer a noisy ideal observer. For the constantthreshold model, the observer uses a constant threshold $\pm \theta (t)=\pm {\theta}_{0}$, which is predicted to be optimal only in very simple, static decision environments with only two states $s$. Both the noisy Bayesian and constantthreshold models also use a noisy perturbation of the LLR $\stackrel{~}{y}=y+{\sigma}_{y}Z$ as their belief, where ${\sigma}_{y}$ is the strength of the noise and $Z$ is a sample from a standard normal distribution. In continuoustime, this perturbation involves adding an independent Wiener process to Equation 14:
where $d{W}_{t}^{\prime}$ is an independent Wiener process with strength ${\sigma}_{y}$. The UGM, being a phenomenological model, behaves differently from the other models. The UGM belief $E$ is the output of the noisy lowpass filter given by Equation 7:
To add additional noise to the UGM’s belief variable $E$, we simply allowed ${\sigma}_{y}>0$ in the lowpass filter in Equation 7.
In addition to the inference noise with strength ${\sigma}_{y}$, we also filtered each process through a Gaussian responsetime filter with zero mean and standard deviation ${\sigma}_{mn}$. Under this responsetime filter, if the model predicted a response time $T$, the measured response time $\stackrel{~}{T}$ was drawn from a normal distribution centered at $T$ with standard deviation ${\sigma}_{mn}$. If the response time $\stackrel{~}{T}$ was drawn outside of the simulation’s time discretization (i.e., if $\stackrel{~}{T}<0$ or $\stackrel{~}{T}>\frac{{T}_{f}}{5}$), we redrew $\stackrel{~}{T}$ until it fell within the discretization. This filter was chosen to represent both “early responses” caused by attentional lapses, as well as ‘late responses’ caused by motor processing delays between the formation of a choice in the brain and the physical response. We have chosen to add these two sources of noise after optimizing each model to maximize average reward rate, rather than reoptimizing each model after adding these additional noise sources. Although we could have reoptimized each model to maximize performance across noise realizations, we were interested in how the models responded to perturbations that drove their performance to be suboptimal (but possibly nearoptimal).
To compare model performance on the rewardchange task, we first fixed the value of prechange reward $R}_{1$ (and set ${R}_{1}+{R}_{2}=11$) to find the postchange reward and tuned each model to achieve optimal reward rate with no additional noise in both the inference and response processes. Bellman’s equation outputs both the optimal normative thresholds and reward rate. For the constant threshold model and the UGM, we approximated the maximal performance of each model by using a grid search over each models parameters to find the model tuning that yielded the highest average reward rate. After tuning all models for a given reward structure, we filtered them through both the sensory (${\sigma}_{y}$) and motor (${\sigma}_{mn}$) noise sources without returning the models to account for this additional noise. When generating noisy synthetic data from these models, we generated 100 synthetic subjects, each with sampled values of ${\sigma}_{y}$ and ${\sigma}_{mn}$. For each synthetic subject with noise parameter sample (${\sigma}_{y}$, ${\sigma}_{mn}$), we defined the “noise strength” of that subject’s noise to be the ratio
where ${\overline{\sigma}}_{y}=5$ and ${\overline{\sigma}}_{mn}=0.25$ are the maximum values of belief noise and motor noise considered, respectively. Using this metric, noise strength is defined between 0 and 1. Additionally, the maximum noise levels ${\overline{\sigma}}_{y}$ and ${\overline{\sigma}}_{mn}$ where chosen such that a noise strength of 0.5 is approximately equivalent to the fitted noise strength obtained from tokens task subject data. We plot the response distributions using noise strengths of 0, 0.5, and 1 in our results. To compare the performance of each model after being corrupted by noise, we then generated 1000 trials for each subject and had each simulated subject repeat the same block of trials three times, one for each model. This process ensured that the only difference between model performance would come from their distinct threshold behaviors, because each model was taken to be equally noisy and was run using the same stimuli.
Tokens task
Normative model for the tokens task
For the tokens task, observations in the form of token movements are Bernoulli distributed with parameter $p=0.5$ that occur every 200ms. Once a subject committed to a decision, the token movements continued at a faster rate until the entire animation had finished. This postdecision token acceleration was 170ms per movement in the ‘slow’ version of the task and 20ms per movement for the ‘fast’ version of the task. Because of the stimulus structure, one can show using a combinatorial argument (Cisek et al., 2009) that the likelihood function $p}_{n$ is given by Equation 8. Constructing the likelihood transfer function $f}_{p$ required for Bellman’s equation is also simplified from the Gaussian 2AFC tasks, as there are only two possible likelihoods that one can transition two after observing a token movement:
Combining Equation 8 and Equation 17, we can fully construct Bellman’s equation for the tokens task. While the timings of the token movements, postdecision token acceleration, and intertrial interval are fixed, we let the reward ${R}_{c}$ and cost function $c$ be free parameters to control the different threshold dynamics of the model.
Model fitting and comparison
We used three models to fit the subject response data provided by Cisek et al., 2009: the noisy Bayesian model ($k=4$ parameters), the constant threshold model ($k=3$ parameters), and the UGM ($k=5$) parameters (Table 1). To adapt the continuoustime models to this discretetime task, we simply changed the time step to match the time between token movements ($\delta t=200$ ms). To fit each model, we took the subject response time distributions as our objective function and used Markov Chain Monte Carlo (MCMC) with a standard Gaussian proposal distribution to generate an approximate posterior made up of 10,000 samples. For more details as to our specific implementation of MCMC for this data, see the MATLAB code available at https://github.com/nwbarendregt/AdaptNormThresh, (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2; Thresh, 2022). We held out 2 of the 22 subjects to use as training data when tuning the covariance matrix of the proposal distribution for each model, and performed the model fitting and comparison analysis on the remaining 20 subjects. Using the approximate posterior obtained via MCMC for each subject and model, we used calculated AICc using the formula
In Equation 18, $k$ is the number of parameters of the model, $\widehat{L}$ is the likelihood of the model evaluated at the maximumlikelihood parameters, and $n$ is the number of responses in the subject data (Cavanaugh, 1997; Brunham and Anderson, 2002). Because each subject performed different numbers of trials, using AICc allowed us to normalize results to account for the different data sizes; note that for many responses (i.e., for large $n$), AICc converges to the standard definition of AIC. For the second model selection metric, we measured how well each fitted model predicted the trialbytrial responses of the data by calculating the average RMSE between the response times from the data and the response times predicted by each model. To measure the difference between a subject’s response time distribution and the fitted model’s distribution (Figure 6—figure supplement 1), we used KullbackLeibler (KL) divergence:
In Equation 19, $i$ is a time index representing the number of observed token movements, ${\text{RT}}_{D}(i)$ is the probability of responding after $i$ token movements from the subject data, and ${\text{RT}}_{M}(i)$ is the probability of responding after $i$ token movements from the model’s response distribution. Smaller values of KL divergence indicate that the model’s response distribution is more similar to the subject data.
Code availability
See https://github.com/nwbarendregt/AdaptNormThresh; (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2; Thresh, 2022) for the MATLAB code used to generate all results and figures.
Data availability
MATLAB code used to generate all results and figures is available at https://github.com/nwbarendregt/AdaptNormThresh, (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2).
References

Mice alternate between discrete strategies during perceptual decisionmakingNature Neuroscience 25:201–212.https://doi.org/10.1038/s4159302101007z

Acquisition of decision making criteria: reward rate ultimately beats accuracyAttention, Perception & Psychophysics 73:640–657.https://doi.org/10.3758/s1341401000497

Analyzing dynamic decisionmaking models using chapmankolmogorov equationsJournal of Computational Neuroscience 47:205–222.https://doi.org/10.1007/s10827019007335

A theoretical analysis of the reward rate optimality of collapsing decision criteriaAttention, Perception, & Psychophysics 82:1520–1534.https://doi.org/10.3758/s13414019018064

The neural basis of the speedaccuracy tradeoffTrends in Neurosciences 33:10–16.https://doi.org/10.1016/j.tins.2009.09.002

BookModel Selection and Multimodel Inference: A Practical InformationTheoretic ApproachNew York Inc: Springer.

Psychological models of deferred decision makingJournal of Mathematical Psychology 32:91–134.https://doi.org/10.1016/00222496(88)900429

The urgencygating model can explain the effects of early evidencePsychonomic Bulletin & Review 22:1830–1838.https://doi.org/10.3758/s1342301508512

Unifying the derivations for the akaike and corrected akaike information criteriaStatistics & Probability Letters 33:201–208.https://doi.org/10.1016/S01677152(96)001289

SpeedAccuracy tradeoffs in animal decision makingTrends in Ecology & Evolution 24:400–407.https://doi.org/10.1016/j.tree.2009.02.010

Decisions in changing conditions: the urgencygating modelThe Journal of Neuroscience 29:11560–11571.https://doi.org/10.1523/JNEUROSCI.184409.2009

Linking biomechanics and ecology through predatorprey interactions: flight performance of dragonflies and their preyThe Journal of Experimental Biology 215:903–913.https://doi.org/10.1242/jeb.059394

Evidence for timevariant decision makingThe European Journal of Neuroscience 24:3628–3641.https://doi.org/10.1111/j.14609568.2006.05221.x

The cost of accumulating evidence in perceptual decision makingThe Journal of Neuroscience 32:3612–3628.https://doi.org/10.1523/JNEUROSCI.401011.2012

ConferenceOptimal decisionmaking with timevarying evidence reliabilityIn Advances in neural information processing systems. pp. 748–756.

ReportNotes on normative solutions to the speedaccuracy tradeoff in preceptual decisionmakingFENSHertie Winter School.

A parameter recovery assessment of timevariant models of decisionmakingBehavior Research Methods 52:193–206.https://doi.org/10.3758/s13428019012180

ConferenceSequential Hypothesis Testing under Stochastic DeadlinesAdvances in Neural Information Processing Systems.

Evidence integration and decision confidence are modulated by stimulus consistencyNature Human Behaviour 6:988–999.https://doi.org/10.1038/s41562022013186

The neural basis of decision makingAnnual Review of Neuroscience 30:535–574.https://doi.org/10.1146/annurev.neuro.29.051605.113038

Optimal models of decisionmaking in dynamic environmentsCurrent Opinion in Neurobiology 58:54–60.https://doi.org/10.1016/j.conb.2019.06.006

Adaptive neural coding: from biological to behavioral decisionmakingCurrent Opinion in Behavioral Sciences 5:91–99.https://doi.org/10.1016/j.cobeha.2015.08.008

Neural coding of uncertainty and probabilityAnnual Review of Neuroscience 37:205–220.https://doi.org/10.1146/annurevneuro071013014017

Overcoming indecision by changing the decision boundaryJournal of Experimental Psychology. General 146:776–805.https://doi.org/10.1037/xge0000286

TimeVarying decision boundaries: insights from optimality analysisPsychonomic Bulletin & Review 25:971–996.https://doi.org/10.3758/s1342301713406

Some task demands induce collapsing bounds: evidence from a behavioral analysisPsychonomic Bulletin & Review 25:1225–1248.https://doi.org/10.3758/s1342301814799

Neural representation of task difficulty and decision making during perceptual categorization: a timing diagramThe Journal of Neuroscience 26:8965–8975.https://doi.org/10.1523/JNEUROSCI.165506.2006

Models for deferred decision makingJournal of Mathematical Psychology 8:508–538.https://doi.org/10.1016/00222496(71)900058

A theory of memory retrievalPsychological Review 85:59–108.https://doi.org/10.1037/0033295X.85.2.59

Reward rate optimization in twoalternative decision making: empirical tests of theoretical predictionsJournal of Experimental Psychology. Human Perception and Performance 35:1865–1897.https://doi.org/10.1037/a0016926

Reinforcement learning: an introductionIEEE Transactions on Neural Networks 9:1054.https://doi.org/10.1109/TNN.1998.712192

Optimal policy for valuebased decisionmakingNature Communications 7:12400.https://doi.org/10.1038/ncomms12400

Optimal policy for multialternative decisionsNature Neuroscience 22:1503–1511.https://doi.org/10.1038/s4159301904539

SoftwareAdaptNormThresh, version swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2Software Heritage.

Decision making by urgency gating: theory and experimental supportJournal of Neurophysiology 108:2912–2930.https://doi.org/10.1152/jn.01071.2011

ContextDependent urgency influences speedaccuracy tradeoffs in decisionmaking and movement executionThe Journal of Neuroscience 34:16442–16454.https://doi.org/10.1523/JNEUROSCI.016214.2014

Microstimulation of dorsal premotor and primary motor cortex delays the volitional commitment to an action choiceJournal of Neurophysiology 123:927–935.https://doi.org/10.1152/jn.00682.2019

Urgency, leakage, and the relative nature of information processing in decisionmakingPsychological Review 128:160–186.https://doi.org/10.1037/rev0000255

Sequential tests of statistical hypothesesThe Annals of Mathematical Statistics 16:117–186.https://doi.org/10.1214/aoms/1177731118
Decision letter

Peter LathamReviewing Editor; University College London, United Kingdom

Timothy E BehrensSenior Editor; University of Oxford, United Kingdom

Gaurav MalhotraReviewer; University of Bristol, United Kingdom
Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Normative Decision Rules in Changing Environments" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen Timothy Behrens as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Gaurav Malhotra (Reviewer #3).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
All three reviewers very much liked the paper. It was nice to see the formalism used to solve these problems take a central place in the manuscript, and the huge variability in bounds is something we haven't seen before.
But that didn't stop us from making a huge number of comments – you have either the good luck or the bad luck, depending on your point of view, of being reviewed by experts. Most comments have to do with the presentation: important information was missing (or at least we couldn't find it), and there were even places where we got lost (not a good sign, given that all three of us work in the field). Details follow.
I. The tokens task can be analyzed using the formalism introduced in Equation (3), but it seems pretty far from the "dynamic context" examples emphasized in the bulk of the paper. That doesn't mean the tokens task shouldn't be included. But it does mean we have no evidence one way or the other whether subjects would adopt the highly idiosyncratic boundaries found in simulations (for instance, the infinite threshold boundaries in Figures 2i, 2ii).
You need to be clear about this. The way the paper reads, it sounds like you have provided evidence for the dynamic context setup, when in fact that's not the case. It should be crystal clear that dynamic context problems have not been explored, at least not in the lab. Instead, what you showed is that a normative model can beat a particular heuristic model.
II. The following are technical but important.
1. Adding noise: you are currently optimizing the model parameters/policy before adding sensory and motor noise. Decisionmakers could be aware of sensory noise and so could try to optimize their decision processes with that knowledge. Would you be able to also compare the models' performances if they have been optimized to maximize performance in the presence of all noise? From our understanding, this should be feasible for the Const and UGM model, but might be harder for the normative model. Sensory noise could be included by finding Equation (13) that includes such noise, but finding the optimal thresholds once RT noise is included might be prohibitive. This is just a suggestion, not an essential inclusion. However, it might be worth at least discussing the difference between what you do, and being clear on the scenario you considered.
2. Fitting token task data: according to Cisek et al. (2009), the same participants performed both the slow and the fast version of the task. However, their fitted reward magnitudes differ by an order of magnitude between the two conditions (your Figure 6C/F). Is it just that the fitting objective didn't wellconstrain these parameters? Given that you use MCMC for model fits, you could compare the parameter posteriors across conditions. Furthermore, how much worse would the model fits become if you would fit both conditions simultaneously and share all parameters that can be meaningfully shared across conditions? In any case, an explanation for this difference should be provided in the manuscript.
3. The dynamic context examples would seem to apply only when subjects take many seconds to make a decision. This would seem to rule out perceptual decisionmaking tasks. Is this true? If so, you should be upfront about this – so that those who work on perceptual decisionmaking will know what they're getting into.
4. A known and predictable change in the middle of a task seems somewhat unrealistic. Given that it plays such a central role, concrete examples where this comes up would be very helpful. Or at least you should make a proposal for laboratory experiments where it could come up. The examples in the introduction ("Some of these factors can change quickly and affect our deliberations in real time; e.g., an unexpected shower will send us hurrying down the faster route (Figure 1A), whereas spotting a new ice cream store can make the longer route more attractive.") don't quite fall into the "known and predictable change" category.
III. Better contact with existing literature needs to be made. For instance:
1. Drugowitsch, MorenoBote and Pouget (2014) already computed normative decision policies for timevarying SNR, with the difference that they assumed the SNR to follow a stochastic process rather than a known, deterministic time course. Thus, the work is closely related, but not equivalent.
2. Some early models to predict dynamic decision boundaries were proposed by Busemeyer and Rapoport (1988) and Rapoport and Burkheimer (1971) in the context of a deferred decisionmaking task.
3. One of the earliest models to use dynamic programming to predict nonconstant decision boundaries was Frazier and Yu (2007). Indeed some boundaries predicted by the authors (e.g. Figure 2v) are very similar to boundaries predicted by this model. In fact, the switch from high to low reward used to propose boundaries in Figure 2v can be seen as a "softer" version of the deadline task in Frazier and Yu (2007).
4. Another early observation that timevarying boundaries can account for empirical data was made by Ditterich (2006). Seems highly relevant to the authors' predictions, but is not cited.
5. The authors seem to imply that their results are the first results showing nonmonotonic thresholds. This is not true. See, for example, Malhotra et al. (2018). What is novel here is the specific shape of these nonmonotonic boundaries.
IV. Clarity could be massively improved. If you want to write an unclear paper that is your prerogative. However, if you do, you can't say "Our results can aid experimentalists investigating the nuances of complex decisionmaking in several ways". It would be difficult to aid experimentalists if they have to struggle to understand the paper.
Below are comments collected from the three reviewers, and more or less collated (so it's possible there's some overlap, and the order isn't exactly optimized). You can, in fact, almost ignore them, if you take into account the main message: all information should be easily accessible, in the main text, and the figures should be easy to make sense of.
As authors, we are aware that the length of replies can sometimes exceed the paper, which is not a good use of anybody's time. Please use your judgment as to which ones you reply to. For instance, if you're going to implement our suggestions, no reason to tell us. Maybe comment only if you have a major objection? Or use some other scheme? What we really care about is that the revised paper is easy to read!
1. When the UGM was introduced, all you say is "urgencygating models (UGMs) use thresholds that collapse monotonically over time". You include some references, but for the casual reader, it looks like you're considering a generic collapsing bound model. In fact, you're considering a particular shape for the collapsing bound and particular filtering of the evidence. This should be clear. It also needs to be justified. For instance, Voskuilen et al. (J. Math. Psych. 73:5979, 2016) use a different functional form for the collapsing bound, and they don't filter the evidence. Why use one model over another?
And while we're on the topic of the UGM: Equation (4) lowpass filters the noisefree observer's belief y that reflects all accumulated evidence up to current time t. According to our reading of Cisek et al. (2009), the UGM lowpass filters the momentary internal estimate of sensory information (the Ei(tau) defined below Equation (1); Equations. (17)(19) for the lowpass filter in Cisek et al.) rather than the accumulated estimate of sensory information. Are we misinterpreting Cisek et al. (2009) or your Equation. (4)? Either way, please clarify.
In Equation. 4 it would be more clear to put E + 0.5*tanh(y) on the RHS. What's the justification for tanh? Why not just filter y? Do you use tanh because the original paper did? If so, you should point that out.
Also, what's y in that equation?
2. Important inline equations need to be displayed. There's nothing more annoying than having to crawl through text to look for the definition of an important symbol. To take a few (hardly exhaustive) examples: ${f}_{\pm}(\xi ),y,{p}_{n},{f}_{p}({p}_{n+1}{p}_{n})$. The actual list is much longer. If any symbol is going to be used again, please make it easy to find! This in itself is a reason for displayed equations: you can refer to equation numbers when introducing variables that you haven't used for several pages.
3. A lot of the lines don't have line numbers, which is relevant mainly for us, since it's hard to refer to things without line numbers. This is a bug, but there's a way to fix it. I think (but I'm not sure) that in your latex file you need to leave a space between equations and surrounding text. (Or maybe no space? It's been a while). Although I believe there's a more elegant fix.
4. Not all equations were numbered. We know, in some conventions only equations one refers to are numbered (that's what one of us grew up with), but it turns out to be not so convenient for us as reviewers when we want to refer to an unnumbered equation.
5. Lines 436: "Efforts to model decisionmaking thresholds under dynamic conditions have focused largely on heuristic strategies. For instance, "urgencygating models" (UGMs) use thresholds that collapse monotonically over time (equivalent to dilating the belief in time) to explain decisions based on timevarying evidence quality".
In fact, a collapsing bound is not necessarily a heuristic; it can be optimal, although the exact shape of the collapsing bound has to be found by dynamic programming. Please reword to reflect this.
6. Line 76: c(t) is barely motivated at all here. It's better motivated in Methods, but its value is very hard to justify. Why not stick with optimizing average reward, for which c=0? And I don't think you ever told us what c(t) was; only that it was constant (although we could have missed its value).
7. Figure 2C would be easier to make sense of if it were square.
8. In general, information is scattered all over the place, and much of it seems to be missing. Each task should be described succinctly in the main text, with enough information to actually figure out what's going on. In addition, there should be a table listing _all_ the parameters; right now the reader has to go to Methods, and even then it seems that many are missing. For instance, we don't think we were ever told the value of tau in Equation. 4.
9. Lots of questions/comments about Figure 4:
a. It would be very helpful to include the optimal model. I think NB is the optimal model when σ_y=0, but I also believe that in most panels σ_y \ne 0.
b. It would be helpful to emphasize, in the figure caption, that NB with σ_y = 0 is the optimal model. Assuming that's true.
c. Figure 4A: What's the postreward rate? And please indicate the prereward rate at which prereward = postreward. Also, If pre and postreward rates sum to 11 (as mentioned in Methods, line 411), why are the curves' minima at around 5 rather than 5.5?
g. Figure 4B: horizontal axis label missing (presumably "pre reward"?). And we assume you used the following color code: Orange: reward(NB)reward(Const); violet: reward(NB)reward(UGM). Correct? Either way, this should be stated in the figure caption.
e. Figure 4C: what are the pre and postrewards? And presumably noise strength = σ_y? This should be stated clearly. And more explanation, in the main text, of what "noise strength" is would help.
f. Figure 4F: It is not clear to us why UGM in 0 noise condition have RTs aligned to the time reward increases from $R}_{1$ to $R}_{2$. Surely, this model does not take RR into account to compute the thresholds, does it? In fact, looking at Figure 4B, Supplement 1, the thresholds are always highest at t=0. Please clarify.
10. Lines 2079: "Because the total number of tokens was finite and known to the subject, token movements varied in their informativeness within a trial, yielding a dynamic and historydependent evidence quality that, in principle, could benefit from adaptive decision processes".
To us, "historydependent" implies nonMarkov, whereas the tokens task is Markov. But maybe that's not what historydependent means here? This should be clarified.
11. We assume the yaxis in Figure 5iiv is the difference between the number of tokens on the top and the number on the bottom. This should be stated (if it's true). And please explain how you differentiate between motifs iii and iv. We believe it's the presence of two threshold increases (rather than just one) in motif iv, but we're not sure.
12. What's the reward/punishment structure for the tokens task? It seemed that this was only half explained.
13. Lines 229232: "To determine the relevance of these adaptive decision strategies to human behavior, we fit discretetime versions of the noisy Bayesian (four free parameters), constantthreshold (three free parameters), and urgencygating (five free parameters) models to responsetime data from the tokens task collected by Cisek et al. (2009)."
As mentioned above, the parameters should go in a table.
14. You should tell us what V(T_final) is, and why. We believe it's the same as V(0), but We could be wrong.
15. After Equation. 11: it says m = 2 mu^{2}/σ^{2}. Are these mu and σ different than the ones on line 383? If so, that should be clear. (If not, we're lost.)
16. We looked, but couldn't find, the definition of f_p. We believe it's just a conditional probability,
f_p(p_{n+1}p_n) = P(p_{n+1}p_n).
If so, why not use that notation? It would be a lot easier to remember. In any case, when this is used, please tell us what it is, or where it was originally defined (which should be in a displayed equation!).
17. State space is parameterized by p_n, and that needs to be discretized, right? If so, that's worth mentioning. If not, we're lost.
18. Analysis (in particular Equation. 13) would be a lot easier if you used y_n instead of p_n. y_n is what is generally accumulated in DDMs, and it's what you generally plot on the yaxis. So why use p_n?
19. Equations. 14 and 15 should really be in the main text. They're simple and important.
20. We didn't understand the inferred reward change task, in the text starting after line 393. We might have been able to guess, but please put in equations so it's crystal clear.
21. Somewhere below line 404: "a constant threshold … is predicted to be optimal only in simple, static decision environments." It's worth pointing out that the decision environments have to be _very_ simple. Even adding one more mean induces a nonconstant (and typically collapsing) bound.
22. Equation above line 405: why repeat that equation, and not repeat Equation. 4? Just curious.
23. Lines 40911: Couldn't parse.
24. After line 411, we find out that R1+R2=11. This is important and simple; you should tell us in the main text.
25. After line 411: we couldn't parse "allowing us to find the exact tuning of the normative model."
26. In fact, we're lost in pretty much everything between line 411 and the tokens task.
27. Line 429: what's "postdecision token acceleration"?
28. Line 433: "We used three models to fit the subject response data …". As far as we could tell, the three models are continuous time models. How were they adapted to this task, which runs in discrete time? Is it just a matter of making the time step larger?
29. Lines 432434: please be more clear about parameter counting  by listing parameters.
30. Lines 4378: "For more details as to our specific implementation of MCMC for this data, see the MATLAB code available at https://github.com/nwbarendregt/AdaptNormThresh".
We shouldn't have to look at code to get details; all important details should be in the paper.
31. Figure 2—figure supplement 2 and Figure 3—figure supplement 1: we thought the reward changed only once. But it's changing a lot in panel A. What's going on?
32. The Abstract / Introduction isn't clear enough about what you refer to as a "changing / dynamic environment". In particular, there is a rich history of research on environments whose state changes across decisions rather than within individual decisions. Making this distinction explicit, and clarifying that you care about the latter rather than the former should make Abstract / Intro clearer.
33. In the text around Equation. (2), you should mention that you're assuming independence across time.
34. Equation. (3): should c(dt) really be c(t)dt? Its dependence on only the time step size seems incompatible with its initial definition in line 77, where it depends on time t since trial onset. Although eventually, it does become a constant.
35. Below Equation. (3): "We choose generating distributions f_+/ that allow us to explicitly compute the average future value […]" – can you compute the average future value explicitly, or just f_p(p_n+1  p_n)? Methods only discuss the latter.
36. Figures 2 and 3: the assumed reward/cost magnitudes should be mentioned in the main text, and also if the results were qualitatively dependent on these magnitudes (we assume not?).
37. Figure 2B: "belief" in Bayesian statistics usually refers to a posterior probability, whereas you seem to be using it to refer to logposterior odds (or logodds). Please clarify in the text what you mean by "belief" (if you haven't done so already and we missed it). This also refers to Figure 3B and clarifies what the thresholds are on in Figures3/4/5.
38. Figures2C/3C: the letter placements are slightly unclear. In particular, in Figure 2C it is hard to see where exactly 'iv' is placed. Maybe using labeled dots instead would increase placement precision?
39. Line 130: "[…] in which reward fluctuations are governed by a twostate Markov process […]". We couldn't figure out from the description in the main text what setup you are referring to and how to interpret Figure 2 – suppl 3. Please provide more detail (not just in Methods) on the reward switching process: what information is provided to the decisionmaker to infer its state, etc.
40. below Line 156: we got lost in the notation for the different noisy / noiseless accumulator models. y_tilde appears to be accumulation with added sensory noise but is in the second point referred to as the "belief y_tilde [of the] normative model", which, being normative, presumably wouldn't have sensory noise. Furthermore, the UGM model seems to use the "noisefree observer's belief y". Is that the belief as defined in Equation. (2) which still includes the sample noise, such that calling it "noisefree" might be confusing?
41. Starting on line 169: the text is unclear on how the models are tuned to cope with the noise, if at all. How the model parameters of the Const and UGM are chosen should also be mentioned in the main text, not just Methods – in particular, that they are tuned to maximize decision performance.
42. Line 332: "+ theta" – missing "(t)"?
43. Line 333: "where observations every dt time units" – fragment?
44. Equation. (10): shouldn't V+ / V / Vw also be functions of rho?
45. The equation above Equation. (12): how is the expected future value computed? I assume that this can only be done numerically? Either way, please specify the details of how you do so. Referring to a Github repo isn't sufficient.
45. The evidence setup that leads to Equation. (13) appears to be equivalent to the one leading to Equation. (16) in Drugowitsch et al. (2012) for M=1. Is this correct? If yes, is the result equivalent? Either way, the relationship would be worth pointing out.
46. Line 411: "the measured response time T_tilde was drawn from a normal distribution […]" – what happened for predicted response times <0? Did you truncate the normal distribution at 0?
47. Line 432: what was the objective function for the MCMC fits? The joint likelihood of RTs and choices?
48. One of the more realistic scenarios is presented in Figure 2—figure supplement 3, where reward doesn't switch at a fixed time, but uses instead a Markov process. But you do not provide enough details of the task or the results. Is m_R = R_H / R_L? Is it the dark line that corresponds to m_R=\inf (as indicated by legend) or the dotted line (as indicated by caption)? For what value of drift are these thresholds derived? These details should be included.
https://doi.org/10.7554/eLife.79824.sa1Author response
I. The tokens task can be analyzed using the formalism introduced in Equation. (3), but it seems pretty far from the "dynamic context" examples emphasized in the bulk of the paper. That doesn't mean the tokens task shouldn't be included. But it does mean we have no evidence one way or the other whether subjects would adopt the highly idiosyncratic boundaries found in simulations (for instance, the infinite threshold boundaries in Figures 2i, 2ii).
You need to be clear about this. The way the paper reads, it sounds like you have provided evidence for the dynamic context setup, when in fact that's not the case. It should be crystal clear that dynamic context problems have not been explored, at least not in the lab. Instead, what you showed is that a normative model can beat a particular heuristic model.
We have revised the text substantially to clarify and expand upon these important points. Specifically, we:
a. More clearly define the broad set of possible “dynamic context” conditions, including changes in outcome expectations or evidence quality while the evidence is being processed, where the changes can be either: (1) abrupt, as in the rewardchange and SNRchange tasks we introduce, which we analyze only theoretically, or (2) gradual, as in the evidence quality changes in the tokens task, which we analyze theoretically and experimentally (e.g., in Results: Even for such simple tasks, there is a broad set of possible dynamic contexts. In the next section, we will analyze a task where context changes gradually (the tokens task)). Here we focus on tasks where the context changes abruptly.
b. Explain that our theoretical framework is general enough to account for both abrupt and gradual changes clarify that our analysis of data from the tokens task shows that the behavior of subjects is better described by a noisy normative model than by previously considered alternatives applied to that particular form of a dynamiccontext task. We also state explicitly that more work is needed to determine if and how people follow normative principles for other dynamiccontext tasks, …
II. The following are technical but important.
1. Adding noise: you are currently optimizing the model parameters/policy before adding sensory and motor noise. Decisionmakers could be aware of sensory noise and so could try to optimize their decision processes with that knowledge. Would you be able to also compare the models' performances if they have been optimized to maximize performance in the presence of all noise? From our understanding, this should be feasible for the Const and UGM model, but might be harder for the normative model. Sensory noise could be included by finding Equation. (13) that includes such noise, but finding the optimal thresholds once RT noise is included might be prohibitive. This is just a suggestion, not an essential inclusion. However, it might be worth at least discussing the difference between what you do, and being clear on the scenario you considered.
We appreciate these important points and now consider them in the revised Discussion. However, we have chosen not to extend our analyses, for several reasons: (1) An optimal observer without internal sensory and motor noise gives the best possible responses, and thus provides a useful benchmark; and (2) we fear that adding results that define optimality with respect to internal sensory and motor noise would, because of the assumptions we would have to make about both the nature and knowledge of those noise sources, be distracting as well as much more speculative and thus make the paper harder to follow.
We have updated the Methods section to highlight these points:
“We have chosen to add these two sources of noise after optimizing each model to maximize average reward rate, rather than reoptimizing each model after adding these additional noise sources. Although we could have reoptimized each model to maximize performance across noise realizations, we were interested in how the models responded to perturbations that drove their performance to be suboptimal (but possibly nearoptimal).”
as well as the Discussion:
“Taskrelevant variability can also arise from internal sources, including noise in neural processing of sensory input and motor output (Ma and Jazayeri, 2014; Faisal et al., 2008). We assumed subjects do not have precise knowledge of the strength or nature of these noise sources, and thus they could not optimize their strategy accordingly. However, people may be capable of rapidly estimating performance error that results from such internal noise processes and adjusting online (Bonnen et al., 2015). To extend the models we considered, we could therefore assume that subjects can estimate the magnitude of such sensory and motor noise, and use this information to adapt their decision strategies to improve performance.”
2. Fitting token task data: according to Cisek et al. (2009), the same participants performed both the slow and the fast version of the task. However, their fitted reward magnitudes differ by an order of magnitude between the two conditions (your Figure 6C/F). Is it just that the fitting objective didn't wellconstrain these parameters? Given that you use MCMC for model fits, you could compare the parameter posteriors across conditions. Furthermore, how much worse would the model fits become if you would fit both conditions simultaneously and share all parameters that can be meaningfully shared across conditions? In any case, an explanation for this difference should be provided in the manuscript.
We now include a supplementary figure (Figure 6—figure supplement 2) comparing the posteriors across conditions as well as reward magnitudes in the slow and fast versions of the tokens task for a representative subject. The maximum likelihood estimate of the reward magnitude tended to be much higher in the slow task than in the fast task. It appears that subjects thus use distinct strategies in the two contexts, which we do not find surprising. We therefore do not expect to obtain fits of the same quality if we assume that subjective reward magnitude is the same across conditions. We speculate that subjects may value reward more in the slow task because it is obtained less frequently. Related effects have been attributed to amplified dopamine responses when rewards are rare (Rothenhoefer et al. 2021 Nat Neurosci). We added text to the Results section to point out this interesting finding:
“This result also shows that, assuming subjects used a normative model, they used distinct model parameters, and thus different strategies, for both the fast and slow task conditions. This finding is clearer when looking at the posterior parameter distribution for each subject and model parameter (see Figure 6—figure supplement 1 for an example). We speculate that the higher estimated value of reward in the slow task may arise due to subjects valuing frequent rewards more favorably.”
3. The dynamic context examples would seem to apply only when subjects take many seconds to make a decision. This would seem to rule out perceptual decisionmaking tasks. Is this true? If so, you should be upfront about this – so that those who work on perceptual decisionmaking will know what they're getting into.
We disagree. The impact of normative decision rules is relevant even on shorter timescales, including those relevant to perceptual decisions (e.g., on the order of 100 ms). Figure 2 —figure supplement 2 and Figure 3 —figure supplement 1 demonstrate that even though normative decision rules may invoke plans across multiple context changepoints, often decisions are made within the 1st or 2nd changepoint, and the corresponding reaction time distributions would have a character distinct from those emerging from strategies with flat decision thresholds. Moreover, there is ample evidence that subjects are capable of adapting perceptual evidence integration to subsecond timescales (Ossmy et al. 2013; Glaze et al. 2015). We thus speculate that perceptual decision rules could adapt on similar timescales as predicted by our normative models.
We have updated the Discussion to clarify these points:
“Perceptual decisionmaking tasks provide a readily accessible route for validating this theory, especially considering the ease with which task difficulty can be parameterized to identify parameter ranges in which strategies can best be differentiated (Philiastides et al. 2006). There is ample evidence already that people can tune the timescale of leaky evidence accumulation processes to the switching rate of an unpredictably changing state governing the statistics of a visual stimulus, to efficiently integrate observations and make a decision about the state (Ossmy et al. 2013; Glaze et al. 2015). We thus speculate that adaptive decision rules could be identified similarly in the strategies people use to make decisions about perceptual stimuli in dynamic contexts.”
4. A known and predictable change in the middle of a task seems somewhat unrealistic. Given that it plays such a central role, concrete examples where this comes up would be very helpful. Or at least you should make a proposal for laboratory experiments where it could come up. The examples in the introduction ("Some of these factors can change quickly and affect our deliberations in real time; e.g., an unexpected shower will send us hurrying down the faster route (Figure 1A), whereas spotting a new ice cream store can make the longer route more attractive.") don't quite fall into the "known and predictable change" category.
Foraging animals must often deal with unpredictable changes in light and visibility conditions, but they also adjust to predictable changes in light brought about by the variation in sunlight with time of day. Sunrise and sunset represent stereotyped changes in foraging conditions as well as necessary escape conditions for prey animals. On shorter timescales, birds and other animals seeking mates, parents, or offspring must often discriminate between two or more calls with known amplitude modulations over time. Financial traders make decisions in markets with fixed open and closing times that strongly shape trading context. Dutch auctions are structured so that an item’s cost is successively lowered until a bidder agrees to pay that amount, reflecting a predictable stairstepping procedure for cost changes. In all these examples the quality of evidence changes in a predictable way, while the evidence remains noisy.
Concerning laboratory experiments, the first half of the paper already proposes a visual decisionmaking task. The experiment we analyzed could be implemented as a switching context random dot motion discrimination task with either changes in signaltonoise (coherence) levels, or changes in reward amounts. Such changes could be signaled or consistently implemented at the same time each trial, so as to be known.
We now have added a sentence in the Introduction:
“People and other animals thus must cope with unpredictable changes in context, such as breaks in the weather (Grubb, 1975), as well as predictable changes that affect their observations, like the daily sunrise and sunset (McNamara et al., 1994).”
as well as a note in the Discussion to indicate the relevance of such task structures, and describe how they can be implemented in a laboratory setting:
“Modeldriven experimental design can aid in identification of adaptive decision rules in practice. People commonly encounter unpredictable (e.g., an abrupt thunderstorm) and predictable (e.g., sunset) context changes when making decisions. Natural extensions of common perceptual decision tasks (e.g., randomdot motion discrimination Gold and Shadlen 2002) could include withintrial changes in stimulus signaltonoise ratio (evidence quality) or anticipated reward payout.”
III. Better contact with existing literature needs to be made. For instance:
1. Drugowitsch, MorenoBote and Pouget (2014) already computed normative decision policies for timevarying SNR, with the difference that they assumed the SNR to follow a stochastic process rather than a known, deterministic time course. Thus, the work is closely related, but not equivalent.
Indeed we had not explained in detail the differences between their work and ours. We have now added the following sentence to the Discussion to make this clear:
“These strategies include dynamically changing decision thresholds when signaltonoise ratios of evidence streams vary according to a CoxIngersollRoss process (Drugowitsch et al., 2014a)”
2. Some early models to predict dynamic decision boundaries were proposed by Busemeyer and Rapoport (1988) and Rapoport and Burkheimer (1971) in the context of a deferred decisionmaking task.
Thanks very much for pointing out these seminal references, which we now include in the Discussion:
“Several early normative theories were, like ours, based on dynamic programming (Rapoport and Burkheimer, 1971; Busemeyer and Rapoport, 1988), and in some cases models fit to experimental data (Ditterich, 2006).”
3. One of the earliest models to use dynamic programming to predict nonconstant decision boundaries was Frazier and Yu (2007). Indeed some boundaries predicted by the authors (e.g. Figure 2v) are very similar to boundaries predicted by this model. In fact, the switch from high to low reward used to propose boundaries in Figure 2v can be seen as a "softer" version of the deadline task in Frazier and Yu (2007).
Again, we very much appreciate the pointer to the very relevant reference, which we include in the Discussion:
“For example, dynamic programming was used to show that certain optimal decisions can require nonconstant decision boundaries similar to those of our normative models in dynamic reward tasks (Frazier and Yu, 2007) (Figure 2).”
4. Another early observation that timevarying boundaries can account for empirical data was made by Ditterich (2006). Seems highly relevant to the authors' predictions, but is not cited.
We agree and regret the oversight. We now reference that paper.
5. The authors seem to imply that their results are the first results showing nonmonotonic thresholds. This is not true. See, for example, Malhotra et al. (2018). What is novel here is the specific shape of these nonmonotonic boundaries.
As with the work by Drugowitsch et al. (2014), this work demonstrates the emergence of nonmonotonic boundaries, but in tasks and settings distinct from the ones we consider (which specifically employ dynamic context). We have clarified these points in the manuscript.
IV. Clarity could be massively improved. If you want to write an unclear paper that is your prerogative. However, if you do, you can't say "Our results can aid experimentalists investigating the nuances of complex decisionmaking in several ways". It would be difficult to aid experimentalists if they have to struggle to understand the paper.
Below are comments collected from the three reviewers, and more or less collated (so it's possible there's some overlap, and the order isn't exactly optimized). You can, in fact, almost ignore them, if you take into account the main message: all information should be easily accessible, in the main text, and the figures should be easy to make sense of.
As authors, we are aware that the length of replies can sometimes exceed the paper, which is not a good use of anybody's time. Please use your judgment as to which ones you reply to. For instance, if you're going to implement our suggestions, no reason to tell us. Maybe comment only if you have a major objection? Or use some other scheme? What we really care about is that the revised paper is easy to read!
Thanks for providing us with flexibility in how and to what we respond. Generally, we found all comments helpful, and so we have endeavored to make edits that address everything the reviewers brought to our attention. To simplify this letter, we include below only those points that require additional explanation. Otherwise all changes can be found in red in the revised manuscript.
6. Line 76: c(t) is barely motivated at all here. It's better motivated in Methods, but its value is very hard to justify. Why not stick with optimizing average reward, for which c=0? And I don't think you ever told us what c(t) was; only that it was constant (although we could have missed its value).
We have added the following motivation of the cost function c(t) to the main text:
“The incremental evidence function c(t) represents both explicit time costs, such as a price for gathering evidence, and implicit costs, such as the opportunity cost. While there are many forms of this cost function, we will make the simplifying assumption that it is constant, c(t)=c. Because more complex cost functions can influence decision threshold dynamics (Drugowitsch et al., 2012), restricting the cost function to a constant ensures that threshold dynamics are governed purely by changes in the (external) task conditions and not the (internal) cost function.”
We also specified the cost function c(t) = 1 that we used in Figure 24 in the figure captions. We revised the caption of Figure 5 to make it more clear that we are finding decision threshold motifs as a function of the cost function c:
“… B: Colormap of normative threshold dynamics for the “slow'' version of the tokens task in rewardevidence cost parameter space (i.e., as a function of R_{c} and c(t) = c from Equation 3, with punishment R_{i} set to 1). Distinct …”
We also added in more clarification to the caption of Figure 6C,F to emphasize that we are fitting the cost function c(t) = c.
10. Lines 2079: "Because the total number of tokens was finite and known to the subject, token movements varied in their informativeness within a trial, yielding a dynamic and historydependent evidence quality that, in principle, could benefit from adaptive decision processes".
To us, "historydependent" implies nonMarkov, whereas the tokens task is Markov. But maybe that's not what historydependent means here? This should be clarified.
Yes, the token count differential is driven by a Markov process, since there is always a 50/50 chance of the token being moved to the top or bottom target. However, the log likelihood ratio associated with either target having more tokens at the end is a nonMarkovian, historydependent process, because the possible LLR increments on each token movement are determined by the token movements so far. This subtlety does make this a dynamic context task, where the evidence quality is the context that changes gradually throughout a trial. We addressed this in our response to the major comments above as we describe the temporal dynamics of the tokens task.
“In addition, the task included two different postdecision token movement speeds, ``slow'' and ``fast'': once the subject committed to a choice, the tokens finished out their animation, moving either once every 170 ms (slow task) or once every 20 ms (fast task). This postdecision movement acceleration changed the value associated with commitment by making the average intertrial interval (t_{i} in Equation 1) decrease over time. Because of this modulation, we can interpret the tokens task as a multichange reward task, where commitment value is controlled through t_{i} rather than through reward R_{c}.”
19. Equations. 14 and 15 should really be in the main text. They're simple and important.
We added the following text to include these Heaviside functions in the main text and to better motivate our investigation into singlechange environments for reward:
“Environments with multiple fluctuations during a single decision lead to complex threshold dynamics, but are comprised of threshold change ``motifs.'' These motifs occur on shorter intervals and tend to emerge from simple monotonic changes in context parameters (Figure 2—figure supplement 2). To better understand the range of possible threshold motifs, we focused on environments with single changes in task parameters. For the rewardchange task, we set punishment to R_{i} = 0, and assumed reward R_{c} changes abruptly, so that its dynamics are described by a Heaviside function
${R}_{c}(t)=({R}_{1}{R}_{2}){H}_{\theta}(t0.5)+{R}_{1}.$ Thus, the reward switches from a prechange value of $R}_{1$ to a postchange value of $R}_{2$ at t=0.5. For this singlechange task, …”
and quality:
“In the SNRchange task, optimal strategies for environments with multiple fluctuations are characterized by threshold dynamics adapted to changes in evidence quality in a way similar to changes in reward (Figure 3—figure supplement 1). To study the range of possible threshold motifs, we again considered environments with single changes in evidence quality $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ by taking μ to be a Heaviside function: $\mu (t)=({\mu}_{1}{\mu}_{2}){H}_{\theta}(t0.5)+{\mu}_{1},$ For this singlechange task, we again found similar threshold motifs to those in the rewardchange task (Figure 3A,B).”
23. Lines 40911: Couldn't parse.
We have revised this paragraph for clarity and to include more details and motivation:
“In addition to the inference noise with strength $\sigma}_{y$, we also filtered each process through a Gaussian responsetime filter with zero mean and standard deviation $\sigma}_{\text{mn}$. Under this responsetime filter, if the model predicted a response time T, the measured response time $\stackrel{~}{T}$ was drawn from a normal distribution centered at T with standard deviation $\sigma}_{\text{mn}$. If the response time $\stackrel{~}{T}$ was drawn outside of the simulation's time discretization (i.e., if $\stackrel{~}{T}$ < 0 or $\stackrel{~}{T}$> $\frac{{T}_{f}}{5}$), we redrew $\stackrel{~}{T}$ until it fell within the discretization. This filter was chosen to represent both ``early responses'' caused by attentional lapses, as well as ``late responses'' caused by motor processing delays between the formation of a choice in the brain and the physical response.”
https://doi.org/10.7554/eLife.79824.sa2Article and author information
Author details
Funding
National Institutes of Health (R01MH115557)
 Nicholas W Barendregt
 Joshua I Gold
 Krešimir Josić
 Zachary P Kilpatrick
National Institutes of Health (R01EB02984701)
 Nicholas W Barendregt
 Zachary P Kilpatrick
National Science Foundation (NSFDMS1853630)
 Nicholas W Barendregt
 Zachary P Kilpatrick
National Science Foundation (NSFDBI1707400)
 Krešimir Josić
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Paul Cisek for providing response data from the tokens task used in our analysis.
Senior Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Reviewing Editor
 Peter Latham, University College London, United Kingdom
Reviewer
 Gaurav Malhotra, University of Bristol, United Kingdom
Version history
 Preprint posted: April 29, 2022 (view preprint)
 Received: May 3, 2022
 Accepted: October 20, 2022
 Accepted Manuscript published: October 25, 2022 (version 1)
 Version of Record published: December 15, 2022 (version 2)
Copyright
© 2022, Barendregt et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 752
 Page views

 129
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Neuroscience
Cerebellar climbing fibers convey diverse signals, but how they are organized in the compartmental structure of the cerebellar cortex during learning remains largely unclear. We analyzed a large amount of coordinatelocalized twophoton imaging data from cerebellar Crus II in mice undergoing ‘Go/Nogo’ reinforcement learning. Tensor component analysis revealed that a majority of climbing fiber inputs to Purkinje cells were reduced to only four functional components, corresponding to accurate timing control of motor initiation related to a Go cue, cognitive errorbased learning, reward processing, and inhibition of erroneous behaviors after a Nogo cue. Changes in neural activities during learning of the first two components were correlated with corresponding changes in timing control and error learning across animals, indirectly suggesting causal relationships. Spatial distribution of these components coincided well with boundaries of AldolaseC/zebrin II expression in Purkinje cells, whereas several components are mixed in single neurons. Synchronization within individual components was bidirectionally regulated according to specific task contexts and learning stages. These findings suggest that, in close collaborations with other brain regions including the inferior olive nucleus, the cerebellum, based on anatomical compartments, reduces dimensions of the learning space by dynamically organizing multiple functional components, a feature that may inspire newgeneration AI designs.

 Cancer Biology
 Computational and Systems Biology
Drug resistance is a challenge in anticancer therapy. In many cases, cancers can be resistant to the drug prior to exposure, i.e., possess intrinsic drug resistance. However, we lack targetindependent methods to anticipate resistance in cancer cell lines or characterize intrinsic drug resistance without a priori knowledge of its cause. We hypothesized that cell morphology could provide an unbiased readout of drug resistance. To test this hypothesis, we used HCT116 cells, a mismatch repairdeficient cancer cell line, to isolate clones that were resistant or sensitive to bortezomib, a wellcharacterized proteasome inhibitor and anticancer drug to which many cancer cells possess intrinsic resistance. We then expanded these clones and measured highdimensional singlecell morphology profiles using Cell Painting, a highcontent microscopy assay. Our imaging and computationbased profiling pipeline identified morphological features that differed between resistant and sensitive cells. We used these features to generate a morphological signature of bortezomib resistance. We then employed this morphological signature to analyze a set of HCT116 clones (five resistant and five sensitive) that had not been included in the signature training dataset, and correctly predicted sensitivity to bortezomib in seven cases, in the absence of drug treatment. This signature predicted bortezomib resistance better than resistance to other drugs targeting the ubiquitinproteasome system. Our results establish a proofofconcept framework for the unbiased analysis of drug resistance using highcontent microscopy of cancer cells, in the absence of drug treatment.