Normative decision rules in changing environments
Abstract
Models based on normative principles have played a major role in our understanding of how the brain forms decisions. However, these models have typically been derived for simple, stable conditions, and their relevance to decisions formed under more naturalistic, dynamic conditions is unclear. We previously derived a normative decision model in which evidence accumulation is adapted to fluctuations in the evidencegenerating process that occur during a single decision (Glaze et al., 2015), but the evolution of commitment rules (e.g. thresholds on the accumulated evidence) under dynamic conditions is not fully understood. Here, we derive a normative model for decisions based on changing contexts, which we define as changes in evidence quality or reward, over the course of a single decision. In these cases, performance (reward rate) is maximized using decision thresholds that respond to and even anticipate these changes, in contrast to the static thresholds used in many decision models. We show that these adaptive thresholds exhibit several distinct temporal motifs that depend on the specific predicted and experienced context changes and that adaptive models perform robustly even when implemented imperfectly (noisily). We further show that decision models with adaptive thresholds outperform those with constant or urgencygated thresholds in accounting for human response times on a task with timevarying evidence quality and average reward. These results further link normative and neural decisionmaking while expanding our view of both as dynamic, adaptive processes that update and use expectations to govern both deliberation and commitment.
Editor's evaluation
This paper makes an important contribution to the study of decisionmaking under time pressure. The authors provide convincing evidence that decision boundaries can be highly nontrivial – even reaching infinity in realistic regimes. This paper will be of broad interest to both experimentalists and theorists working on decisionmaking under time pressure.
https://doi.org/10.7554/eLife.79824.sa0eLife digest
How do we make good choices? Should I have cake or yoghurt for breakfast? The strategies we use to make decisions are important not just for our daily lives, but also for learning more about how the brain works.
Decisionmaking strategies have two components: first, a deliberation period (when we gather information to determine which choice is ‘best’); and second, a decision ‘rule’ (which tells us when to stop deliberating and commit to a choice). Although deliberation is relatively wellunderstood, less is known about the decision rules people use, or how those rules produce different outcomes.
Another issue is that even the simplest decisions must sometimes adapt to a changing world. For example, if it starts raining while you are deciding which route to walk into town, you would probably choose the driest route – even if it did not initially look the best. However, most studies of decision strategies have assumed that the decisionmaker’s environment does not change during the decision process.
In other words, we know much less about the decision rules used in reallife situations, where the environment changes. Barendregt et al. therefore wanted to extend the approaches previously used to study decisions in static environments, to determine which decision rules might be best suited to more realistic environments that change over time.
First, Barendregt et al. constructed a computer simulation of decisionmaking with environmental changes built in. These changes were either alterations in the quality of evidence for or against a particular choice, or the ‘reward’ from a choice, i.e., feedback on how good the decision was. They then used the computer simulation to model single decisions where these changes took place.
These virtual experiments showed that the best performance – for example, the most accurate decisions – resulted when the threshold for moving from deliberation (i.e., considering the evidence) to selecting an option could respond to, or even anticipate, the changing situations. Importantly, the simulations’ results also predicted realworld choices made by human participants when given a decisionmaking task with similar variations in evidence and reward over time. In other words, the virtual decisionmaking rules could explain real behavior.
This study sheds new light on how we make decisions in a changing environment. In the future, Barendregt et al. hope that this will contribute to a broader understanding of decisionmaking and behavior in a wide range of contexts, from psychology to economics and even ecology.
Introduction
Even simple decisions can require us to adapt to a changing world. Should you go through the park or through town on your walk? The answer can depend on conditions that could be changing while you deliberate, such as an unexpected shower that would send you hurrying down the faster route (Figure 1A) or a predictable sunrise that would nudge you toward the route with better views. Despite the ubiquity of such dynamics in the real world, they are often neglected in models used to understand how the brain makes decisions. For example, many commonly used models assume that decision commitment occurs when the accumulated evidence for an option reaches a fixed, predefined value or threshold (Wald, 1945; Ratcliff, 1978; Bogacz et al., 2006; Gold and Shadlen, 2007; Kilpatrick et al., 2019). The value of this threshold can account for inherent tradeoffs between decision speed and accuracy found in many tasks: lower thresholds generate faster, but less accurate decisions, whereas higher thresholds generate slower, but more accurate decisions (Gold and Shadlen, 2007; Chittka et al., 2009; Bogacz et al., 2010). However, these classical models do not adequately describe decisions made in environments with potentially changing contexts (Thura et al., 2014; Thura and Cisek, 2016; Palestro et al., 2018; Cisek et al., 2009; Drugowitsch et al., 2012; Thura et al., 2012; Tajima et al., 2019; Glickman et al., 2022). Efforts to model decisionmaking thresholds under dynamic conditions have focused largely on heuristic strategies that aim to account for contexts that change between each decision. For instance, a common class of heuristic models is ‘urgencygating models’ (UGMs). UGMs filter accumulated evidence through a lowpass filter and use thresholds that collapse monotonically over time (equivalent to dilating the belief in time) to explain decisions based on timevarying evidence quality (Cisek et al., 2009; Carland et al., 2015; Evans et al., 2020). Although collapsing decision thresholds are optimal in some cases, they do not always account for changes that occur during decision deliberation, and they are sometimes implemented adhoc without a proper derivation from first principles. Such derivations typically assume that individuals set decision thresholds to maximize trialaveraged reward rate (Simen et al., 2009; Balci et al., 2011; Drugowitsch et al., 2012; Tajima et al., 2016; Malhotra et al., 2018; Boehm et al., 2020), which can result in adaptive, timevarying thresholds similar to those assumed by heuristic UGMs. However, as in fixedthreshold models, these timevarying thresholds are typically defined before the evidence is accumulated, preceding the formative stages of the decision, and thus cannot account for environmental changes that may occur during deliberation.
To identify how environmental changes during the course of a single deliberative decision impact optimal decision rules, we developed normative models of decisionmaking that adapt to and anticipate two specific types of context changes: changes in reward expectation and changes in evidence quality. Specifically, we used Bellman’s equation (Bellman, 1957; Mahadevan, 1996; Sutton and Barto, 1998; Bertsekas, 2012; Drugowitsch, 2015) to identify decision strategies that maximize trialaveraged reward rate when conditions can change during decision deliberation. We show that for simple tasks that include sudden, expected withintrial changes in the reward or the quality of observed evidence, these normative decision strategies involve nontrivial, timedependent changes in decision thresholds. These rules take several different forms that outperform their heuristic counterparts, are identifiable from behavior, and have performance that is robust to noisy implementations. We also show that, compared to fixedthreshold models or UGMs, these normative, adaptive threshold models provide a better account of human behavior on a ‘tokens task’, in which the value of commitment changes gradually at predictable times and the quality of evidence changes unpredictably within each trial (Cisek et al., 2009; Thura et al., 2014). These results provide new insights into the behavioral relevance of a diverse set of adaptive decision thresholds in dynamic environments and tightly link the details of such environmental changes to threshold adaptations.
Results
Normative theory for dynamic context 2AFC tasks
To determine potential deliberation and commitment strategies used by human subjects, we begin by identifying normative decision rules for twoalternative forced choice (2AFC) tasks with dynamic contexts. Normative decision rules that maximize trialaveraged reward rate can be obtained by solving an optimization problem using dynamic programming (Bellman, 1957; Sutton and Barto, 1998; Drugowitsch et al., 2012; Tajima et al., 2016). We define this trialaveraged reward rate, $\rho $, as (Gold and Shadlen, 2002; Drugowitsch et al., 2012)
where $\u27e8R\u27e9$ is the average reward for a decision, ${T}_{d}$ is the decision time, $\u27e8C({T}_{d})\u27e9=\u27e8{\int}_{0}^{{T}_{d}}c(t)dt\u27e9$ is the average total accumulated cost given an incremental cost function $c(t)$, $\u27e8{T}_{t}\u27e9$ is the average trial length, and $\u27e8{t}_{i}\u27e9$ is the average intertrial interval (Drugowitsch, 2015). Note that all averages in Equation 1 are taken over trials. To find the normative decision thresholds that maximize $\rho $, we assign specific values (i.e., economic utilities) to correct and incorrect choices (reward and/or punishment) and the time required to arrive at each choice (i.e., evidence cost). The incremental evidence function $c(t)$ represents both explicit time costs, such as a price for gathering evidence, and implicit costs, such as opportunity cost. While there are many forms of this cost function, we make the simplifying assumption that it is constant, $c(t)=c$. Because more complex cost functions can influence decision threshold dynamics (Drugowitsch et al., 2012), restricting the cost function to a constant ensures that the threshold dynamics we identify are governed purely by changes in the (external) task conditions and not the (internal) cost function. To represent the structure of a 2AFC tasks, we assume a decision environment for an observer with an initially unknown environmental state, $s\in \{{s}_{+},{s}_{}\}$, that uniquely determines which of two alternatives is correct. To infer the environmental state, this observer makes measurements, $\xi $, that follow a distribution ${f}_{\pm}(\xi )=f(\xi {s}_{\pm})$ that depends on the state. Determining the correct choice is thus equivalent to determining the generating distribution, ${f}_{\pm}$. An ideal Bayesian observer uses the loglikelihood ratio (LLR), $y$, to track their ‘belief’ over the correct choice (Wald, 1945; Bogacz et al., 2006; VelizCuba et al., 2016). After $n$ discrete observations ${\xi}_{1:n}$ that are independent across time, the discretetime LLR belief y_{n} is given by:
Given this defined task structure, we discretize the time during which the decision is formed and define the observer’s actions during each timestep. The observer gathers evidence (measurements) during each timestep prior to a decision and uses each increment of evidence to update their belief about the correct choice. Then, the observer has the option to either commit to a choice or make another measurement at the next timestep. By assigning a utility to each of these actions (i.e., a value ${V}_{+}$ for choosing ${s}_{+}$, a value ${V}_{}$ for choosing $s}_{$, and a value ${V}_{w}$ for sampling again), we can construct the value function for the observer given their current belief:
For a full derivation of this equation, see Materials and methods. In Equation 3, ${p}_{n}=\mathrm{Pr}({s}_{+}{\xi}_{1:n})=\frac{1}{1+{e}^{{y}_{n}}}$ is the state likelihood at time $t}_{n$, ${R}_{c}$ is the reward for a correct choice, ${R}_{i}$ is the reward for an incorrect choice, and $\delta t$ is the timestep between observations. We choose generating distributions to be symmetric Gaussian distributions ${f}_{\pm}(\xi )\sim \mathcal{N}\left(\pm \mu ,{\sigma}^{2}\right)$ to allow us to compute the conditional distribution function ${f}_{p}({p}_{n+1}{p}_{n})$ needed for the average future value explicitly:
In Equation 4, ${f}_{p}({p}_{n+1}{p}_{n})$ is the conditional probability of the future state likelihood ${p}_{n+1}$ given the current state likelihood $p}_{n$. For the case of Gaussiandistributed evidence, this conditional probability is given by Equation 16 in Materials and methods. Using Equation 3, we find the specific belief values where the optimal action changes from gathering evidence to commitment, defining thresholds on the ideal observer’s belief that trigger decisions. Figure 1B shows a schematic of this process.
To understand how normative decision thresholds adapt to changing conditions, we derived them for several different forms of twoalternative forcedchoice (2AFC) tasks in which we controlled changes in evidence or reward. Even for such simple tasks, there is a broad set of possible changing contexts. In the next section, we analyze a task in which context changes gradually (the tokens task). Here, we focus on tasks in which the context changes abruptly. For each task, an ideal observer was shown evidence generated from a Gaussian distribution ${f}_{\pm}(\xi )=\mathcal{N}\left(\pm \mu ,{\sigma}^{2}\right)$ with signaltonoise ratio (SNR) $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ (Figure 2—figure supplement 1). The SNR measures evidence quality: a smaller (larger) $m$ implies that evidence is of lower (higher) quality, resulting in harder (easier) decisions. The observer’s goal was to determine which of the two means (i.e., which distribution, ${f}_{+}$ or ${f}_{}$) were used to generate the observations. We introduced changes in the reward for a correct decision (‘rewardchange task’) or the SNR (‘SNRchange task’) within a single decision, where the time and magnitude of the changes are known in advance to the observer (Figure 1A, Figure 2—figure supplement 2). For example, changes in SNR arise naturally throughout a day as animals choose when to forage and hunt given variations in light levels and therefore targetacquisition difficulty (Combes et al., 2012; Einfalt et al., 2012).
Under these dynamic conditions, dynamic programming produces normative thresholds with rich nonmonotonic dynamics (Figure 2A and B, Figure 2—figure supplement 2). Environments with multiple reward changes during a single decision lead to complex threshold dynamics that we summarize in terms of several threshold change “motifs.” These motifs occur on shorter intervals and tend to emerge from simple monotonic changes in context parameters (Figure 2—figure supplement 2). To better understand the range of possible threshold motifs, we focused on environments with single changes in task parameters. For the rewardchange task, we set punishment ${R}_{i}=0$ and assumed reward ${R}_{c}$ changes abruptly, so that its dynamics are described by a Heaviside function:
Thus, the reward switches from the prechange reward $R}_{1$ to the postchange reward $R}_{2$ at $t=0.5$.
For this singlechange task, normative threshold dynamics exhibited several motifs that in some cases resembled fixed or collapsing thresholds characteristic of previous decision models but in other cases exhibited novel dynamics. Specifically, we characterized five different dynamic motifs in response to single, expected changes in reward contingencies for different combinations of pre and postchange reward values (Figure 2C and i–v). For tasks in which reward is initially very low, thresholds are infinite until the reward increases, ensuring that the observer waits for the larger payout regardless of how strong their belief is (Figure 2i). The region where thresholds are infinite corresponds to when ${V}_{w}({p}_{n};\rho )$ in Equation 3, which is the value associated with waiting to gather more information, is maximal for all values of $p}_{n$. In contrast, when reward is initially very high, thresholds collapse to zero just before the reward decreases, ensuring that all responses occur while payout is high (Figure 2v). Between these two extremes, optimal thresholds exhibit rich, nonmonotonic dynamics (Figure 2ii,iv), promoting early decisions in the highreward regime, or preventing early, inaccurate decisions in the lowreward regime. Figure 2C shows the regions in pre and postchange reward space where each motif is optimal, including broad regions with nonmonotonic thresholds. Thus, even simple context dynamics can evoke complex decision strategies in ideal observers that differ from those predicted by constant decisionthresholds and heuristic UGMs.
We also formulated an ‘inferred rewardchange task’, in which reward fluctuates between a high value ${R}_{H}$ and low value ${R}_{L}$ governed by a twostate Markov process with known transition rate $h$ and state $R(t)\in \{{R}_{H},{R}_{L}\}$ that the observer must infer online. For this task, the observer receives two independent sets of evidence: the evidence of the state $\xi {s}_{\pm}\sim \mathcal{N}\left(\pm \mu ,{\sigma}^{2}\right)$ and the evidence of the current reward $\eta {R}_{H/L}\sim \mathcal{N}\left(\pm {\mu}_{R},{\sigma}_{R}^{2}\right)$. The observer must then track their beliefs about both the state and the current reward and take both sources of information into account when determining the optimal decision thresholds. We found that these thresholds always changed monotonically with monotonic shifts in expected reward (see Figure 2—figure supplement 3). These results contrast with our findings from the rewardchange task in which changes can be anticipated and monotonic changes in reward can produce nonmonotonic changes in decision thresholds.
For the SNRchange task, optimal strategies for environments with multiple changes in evidence quality are characterized by threshold dynamics that adapt to these changes in a way similar to how they adapt to changes in reward (Figure 3—figure supplement 1). To study the range of possible threshold motifs, we again considered environments with single changes in the evidence quality $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ by taking µ to be a Heaviside function:
For this singlechange task, we again found similar threshold motifs to those in the rewardchange task (Figure 3A and B). However, in this case monotonic changes in evidence quality always produce monotonic changes in response behavior. This observation holds across all of parameter space for evidencequality schedules with single change points (Figure 3C), with only three optimal behavioral motifs (Figure 3i–iii). This contrasts with our findings in the rewardchange task, where monotonic changes in reward can produce nonmonotonic changes in decision thresholds. Strategies arising from known dynamical changes in context tend to produce sharper response distributions around reward changes than around quality changes, which may be measurable in psychophysical studies. These findings suggest that changes in reward can have a larger impact on the normative strategy thresholds than changes in evidence quality.
Performance and robustness of nonmonotonic normative thresholds
The normative solutions that we derived for dynamiccontext tasks by definition maximize reward rate. This maximization assumes that the normative solutions are implemented perfectly. However, a perfect implementation may not be possible, given the complexity of the underlying computations, biological constraints on computation time and energy (Louie et al., 2015), and the synaptic and neural variability of cortical circuits (Ma and Jazayeri, 2014; Faisal et al., 2008). Given these constraints, subjects may employ heuristic strategies like the UGM over the normative model if noisy or mistuned versions of both models result in similar reward rates. We used synthetic data to better understand the relative benefits of different imperfectly implemented strategies. Specifically, we corrupted the internal belief state and simulated response times with additive Gaussian noise with zero mean and variance ${\sigma}_{mn}^{2}$ (See Figure 4—figure supplement 1C) for three models:
The continuoustime normative model with timevarying thresholds $\pm \theta (t)$ from Equation 3 and belief that evolves according to the stochastic differential equation
$\text{}d\stackrel{~}{y}=\underset{\text{drift}}{\underset{\u23df}{\pm m\phantom{\rule{thinmathspace}{0ex}}dt}}+\underset{\text{sample noise}}{\underset{\u23df}{\sqrt{2m}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t}}}+\underset{\text{sensory noise}}{\underset{\u23df}{{\sigma}_{y}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t}^{\prime}}},$where $d{W}_{t}$ is a standard increment of a Wiener process, the sign of the drift $\pm mdt$ is given by the correct choice ${s}_{\pm}$, and $d{W}_{t}^{\prime}$ is an independent Wiener process with strength ${\sigma}_{y}$. The addition of the additional noise process $d{W}_{t}^{\prime}$ makes this a noisy Bayesian (NB) model.
A constantthreshold (Const) model, which uses the same belief $\stackrel{~}{y}$ as the normative model but a constant, nonadaptive decision threshold $\pm \theta (t)=\pm {\theta}_{0}$ (Figure 4—figure supplement 1A).
The UGM, which uses the output of a lowpass filter as the belief,
(7) $\tau \phantom{\rule{thinmathspace}{0ex}}dE=\underset{\text{drift \xa7amp; sample noise}}{\underset{\u23df}{\left(E+\frac{1}{1+{e}^{y}}\frac{1}{2}\right)\phantom{\rule{thinmathspace}{0ex}}dt}}+\underset{\text{sensory noise}}{\underset{\u23df}{{\sigma}_{y}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t}}},$and commits to a decision when this output crosses a hyperbolically collapsing threshold $\pm \theta (t)=\pm \frac{{\theta}_{0}}{at}$ (Figure 4—figure supplement 1B). In Equation 7, $E$ is the filter’s output that serves as the UGM’s belief, $\tau $ is a relaxation time constant, and the optimal observer’s belief $y$ is the filter’s input. Note that the filter’s input can also be written in terms of the state likelihood $p$,
$\tau \phantom{\rule{thinmathspace}{0ex}}dE=\left(E+p\frac{1}{2}\right)\phantom{\rule{thinmathspace}{0ex}}dt+{\sigma}_{y}\phantom{\rule{thinmathspace}{0ex}}d{W}_{t},$which is the form first proposed by Cisek et al., 2009.
For more details about these three models, see Materials and methods. We compared their performance in terms of reward rate achieved on the same set of rewardchange tasks shown in Figure 2. To ensure the average total reward in each trial was the same, we restricted the prechange reward $R}_{1$ and postchange reward $R}_{2$ so that ${R}_{1}+{R}_{2}=11$.
When all three models were implemented without additional noise, the relative benefits of the normative model depended on the exact task condition. The performance differential between models was highest when reward changed from low to high values (Figure 4A, dotted line; Figure 4). Under these conditions, normative thresholds are initially infinite and become finite after the reward increases, ensuring that most responses occur immediately once the high reward becomes available (Figure 4D). In contrast, response times generated by the constantthreshold and UGM models tend to not follow this pattern. For the constantthreshold model, many responses occur early, when the reward is low (Figure 4E). For the UGM, a substantial fraction of responses are late, leading to higher time costs however, it is possible to tune the UGM’s thresholds rate of collapse to prevent any early responses while the reward is low (Figure 4F). In contrast, when the reward changes from high to low values, all models exhibit similar response distributions and reward rates (Figure 4A, dashed line; Figure 4—figure supplement 2). This result is not surprising, given that the constantthreshold model produces early peaks in the reaction time distribution, and the UGM was designed to mimic collapsing bounds that hasten decisions in response to imminent decreases in reward (Cisek et al., 2009). We therefore focused on the robustness of each strategy when corrupted by noise and responding to lowtohigh reward switches – the regime differentiating strategy performance in ways that could be identified in subject behavior.
Adding noise to the internal belief state (which tends to trigger earlier responses) and simulated response distributions (which tends to smooth out the distributions) without retuning the models to account for the additional noise does not alter the advantage of the normative model: across a range of added noise strengths, which we define as $\frac{{\sigma}_{y}+{\sigma}_{mn}}{{\overline{\sigma}}_{y}+{\overline{\sigma}}_{mn}}$, where ${\overline{\sigma}}_{y}$ and ${\overline{\sigma}}_{mn}$ are the maximum possible strengths of sensory and motor noise, respectively, the normative model outperforms the other two when encountering lowtohigh reward switches (Figure 4C). This robustness arises because, prior to the reward change, the normative model uses infinite decision thresholds that prevent early noisetriggered responses when reward is low (Figure 4D). In contrast, the heuristic models have finite collapsing or constant thresholds and thus produce more suboptimal early responses as belief noise is increased (Figure 4E and F). Thus, adaptive decision strategies can result in considerably higher reward rates than heuristic alternatives even when implemented imperfectly, suggesting subjects may be motivated to learn such strategies.
Adaptive normative strategies in the tokens task
To determine the relevance of the normative model to human decisionmaking, we analyzed previously collected data from a ‘tokens task’ (Cisek et al., 2009). For this task, human subjects were shown 15 tokens inside a center target flanked by two empty targets (see Figure 5A for a schematic). Every 200ms, a token moved from the center target to one of the neighboring targets with equal probability. Subjects were tasked with predicting which flanking target would contain more tokens by the time all 15 moved from the center. Subjects could respond at any time before all 15 tokens had moved. Once the subject made the prediction, the remaining tokens would finish their movements to indicate the correct alternative. Given this task structure, one can show using a combinatorial argument (Cisek et al., 2009) that the state likelihood function ${p}_{n}=Pr(\text{top}\phantom{\rule{thinmathspace}{0ex}}{\xi}_{1:n})$, the probability the top target will hold more tokens at the end of the trial, is given by:
where ${U}_{n}$, ${L}_{n}$, and ${C}_{n}$ are the number of tokens in the upper, lower, and center targets after token movement $n$, respectively. The token movements are Markovian because each token has an equal chance of moving to the upper/lower target. However, the probability that a target will contain more tokens at the end of the trial is history dependent, and the evolution of these probabilities is thus nonMarkovian. As such, the quality of evidence possible from each token draw changes dynamically and gradually. In addition, the task included two different postdecision token movement speeds, ‘slow’ and ‘fast’: once the subject committed to a choice, the tokens finished out their animation, moving either once every 170ms (slow task) or once every 20ms (fast task). This postdecision movement acceleration changed the value associated with commitment by making the average intertrial interval ($\u27e8{t}_{i}\u27e9$ in Equation 1) decrease over time. Because of this modulation, we can interpret the tokens task as a multichange reward task, where commitment value is controlled through $\u27e8{t}_{i}\u27e9$ rather than through reward ${R}_{c}$. Our dynamicprogramming framework for generating adaptive decision rules can handle the gradual changes in task context emerging in the tokens task. Given that costs and rewards can be subjective, we quantified how normative decision thresholds change with different combinations of rewards ${R}_{c}$ and costs $c(t)=c$ for fixed punishment ${R}_{i}=1$, for both the slow (Figure 5B) and fast (Figure 5C) versions of the task.
We identified four distinct motifs of normative decision threshold dynamics for the tokens task (Figure 5iiv). Some combinations of rewards and costs produced collapsing thresholds (Figure 5ii) similar to the UGM developed by Cisek et al., 2009 for this task. In contrast, large regions of task parameter space produced rich nonmonotonic threshold dynamics (Figure 5iii,iv) that differed from any found in the UGM. In particular, as in the case of rewardchange tasks, normative thresholds were often infinite for the first several token movements, preventing early and weakly informed responses. These motifs are similar to those produced by lowtohigh reward switches in the rewardchange task, but here resulting from the low relative cost of early observations. These nonmonotonic dynamics also appear if we measure belief in terms of the difference in tokens between the top and bottom target, which we call ‘token lead space’ (see Figure 5—figure supplement 1).
Adaptive normative strategies best fit subject response data
To determine the relevance of these adaptive decision strategies to human behavior, we fit discretetime versions of the noisy Bayesian (four free parameters), constantthreshold (three free parameters), and urgencygating (five free parameters) models to responsetime data from the tokens task collected by Cisek et al., 2009; see Table 1 in Materials and methods for a table of parameters for each model. All models included belief and motor noise, as in our analysis of the dynamiccontext tasks (Figure 4—figure supplement 1C). The normative model tended to fit the data better than the heuristic models (see Figure 6—figure supplement 1), based on three primary analyses. First, both corrected AIC (AICc), which accounts for goodnessoffit and model degreesoffreedom, and average rootmeansquared error (RMSE) between the predicted and actual trialbytrial response times, favored the noisy Bayesian model for most subjects for both the slow (Figure 6A) and fast (Figure 6D) versions of the task. Second, when considering only the bestfitting model for each subject and task condition, the noisy Bayesian model tended to better predict subject’s response times (Figure 6B and E). Third, most subjects whose data were best described by the noisy Bayesian model had bestfit parameters that corresponded to nonmonotonic decision thresholds, which cannot be produced by either of the other two models (Figure 6C and F). This result also shows that, assuming subjects used a normative model, they used distinct model parameters, and thus different strategies, for both the fast and slow task conditions. This finding is clearer when looking at the posterior parameter distribution for each subject and model parameter (see Figure 6—figure supplement 1 for an example). We speculate that the higher estimated value of reward in the slow task may arise due to subjects valuing frequent rewards more favorably. Together, our results strongly suggest that these human subjects tended to use an adaptive, normative strategy instead of the kinds of heuristic strategies often used to model response data from dynamic context tasks.
Discussion
The goal of this study was to build on previous work showing that in dynamic environments, the most effective decision processes do not necessarily use relatively simple, predefined computations as in many decision models (Bogacz et al., 2006; Cisek et al., 2009; Drugowitsch et al., 2012), but instead adapt to learned or predicted features of the environmental dynamics (Drugowitsch et al., 2014a). Specifically, we used new ‘dynamic context’ task structures to demonstrate that normative decision commitment rules (i.e., decision thresholds, or bounds, in ‘accumulatetobound’ models) adapt to reward and evidencequality switches in complex, but predictable, ways. Comparing the performance of these normative decision strategies to the performance of classic heuristic models, we found that the advantage of normative models is maintained when computations are noisy. We extended these modeling results to include the ‘tokens task’, in which evidence quality changes in a way that depends on stimulus history and the utility of commitment increases over time. We found that the normative decision thresholds for the tokens task are also nonmonotonic and robust to noise. By reanalyzing human subject data from this task, we found most subjects’ response times were bestexplained by a noisy normative model with nonmonotonic decision thresholds. Taken collectively, these results show that ideal observers and human subjects use adaptive and robust normative decision strategies in relatively simple decision environments.
Our results can aid experimentalists investigating the nuances of complex decisionmaking in several ways. First, we demonstrated that normative behavior varies substantially across task parameters for relatively simple tasks. For example, the rewardchange task structure produces five distinct behavioral motifs, such as waiting until reward increases (Figure 2i) and responding before reward decreases unless the accumulated evidence is ambiguous (Figure 2iv). Using these kinds of modeling results to inform experimental design can help us understand the possible behaviors to expect in subject data. Second, extending our work and considering the sensitivity of performance to both model choice and task parameters (Barendregt et al., 2019; Radillo et al., 2019) will help to identify regions of task parameter space where models are most identifiable from observables like response time and choice. Third, and more generally, our work provides evidence that for tasks with gradual changes in evidence quality and reward, human behavior is more consistent with normative principles than with previously proposed heuristic models. However, more work is needed to determine if and how people follow normative principles for other dynamiccontext tasks, such as those involving abrupt changes in evidence or reward contingencies, by using normative theory to determine which subject strategies are plausible, the nature of tasks needed to identify them, and the relationship between task dynamics and decision rules.
Modeldriven experimental design can aid in identification of adaptive decision rules in practice. People commonly encounter unpredictable (e.g. an abrupt thunderstorm) and predictable (e.g. sunset) context changes when making decisions. Natural extensions of common perceptual decision tasks (e.g. randomdot motion discrimination [Gold and Shadlen, 2002]) could include withintrial changes in stimulus signaltonoise ratio (evidence quality) or anticipated reward payout. Taskrelevant variability can also arise from internal sources, including noise in neural processing of sensory input and motor output (Ma and Jazayeri, 2014; Faisal et al., 2008). We assumed subjects do not have precise knowledge of the strength or nature of these noise sources, and thus they could not optimize their strategy accordingly. However, people may be capable of rapidly estimating performance error that results from such internal noise processes and adjusting online (Bonnen et al., 2015). To extend the models we considered, we could therefore assume that subjects can estimate the magnitude of their own sensory and motor noise, and use this information to adapt their decision strategies to improve performance.
Real subjects likely do not rely on a single strategy when performing a sequence of trials (Ashwood et al., 2022) and instead rely on a mix of nearnormative, subnormative, and heuristic strategies. In fitting subject data, experimentalists are thus presented with the difficult task of constructing a library of possible models to use in their analysis. More general approaches have been developed for fitting response data to a broad class of models (Shinn et al., 2020), but these model libraries are typically built on preexisting assumptions of how subjects accumulate evidence and make decisions. Because the potential library of decision strategies is theoretically limitless, a normative analyses can both expand and provide insights into the range of possible subject behaviors in a systematic and principled way. Understanding this scope will assist in developing a wellgroomed candidate list of nearnormative and heuristic models. For example, if a normative analysis of performance on a dynamic reward task produces threshold dynamics similar to those in Figure 2B, then the fitting library should include a piecewiseconstant threshold (or urgency signal) model. Combining these modelbased investigations with modelfree approaches, such as ratedistortion theory (Berger, 2003; Eissa et al., 2021), can also aid in identifying commonalities in performance and resource usage within and across model classes without the need for pilot experiments.
Our work complements the existing literature on optimal decision thresholds by demonstrating the diversity of forms those thresholds can take under different dynamic task conditions. Several early normative theories were, like ours, based on dynamic programming (Rapoport and Burkheimer, 1971; Busemeyer and Rapoport, 1988) and in some cases models fit to experimental data (Ditterich, 2006). For example, dynamic programming was used to show that certain optimal decisions can require nonconstant decision boundaries similar to those of our normative models in dynamic reward tasks (Frazier and Yu, 2007; Figure 2). More recently, dynamic programming (Drugowitsch et al., 2012; Drugowitsch et al., 2014b; Tajima et al., 2016) or policy iteration (Malhotra et al., 2017; Malhotra et al., 2018) have been used to identify normative strategies in dynamic environments that can have monotonically collapsing decision thresholds that in some cases can be implemented using an urgency signal (Tajima et al., 2019). These strategies include dynamically changing decision thresholds when signaltonoise ratios of evidence streams vary according to a CoxIngersollRoss process (Drugowitsch et al., 2014a) and nonmonotonic thresholds when the evidence quality varies unpredictably across trials but is fixed within each trial Malhotra et al., 2018. Other recent work has started to generalize notions of urgencygating behavior (Trueblood et al., 2021). However, these previous studies tended to focus on environments with a fixed structure, in which dynamic decision thresholds are adapted as the observer acquires knowledge of the environment. Here we have characterized in more detail how both expected and unexpected changes in context within trials relate to changes in decision thresholds over time.
Perceptual decisionmaking tasks provide a readily accessible route for validating our normative theory, especially considering the ease with which task difficulty can be parameterized to identify parameter ranges in which strategies can best be differentiated (Philiastides et al., 2006). There is ample evidence already that people can tune the timescale of leaky evidence accumulation processes to the switching rate of an unpredictably changing state governing the statistics of a visual stimulus, to efficiently integrate observations and make a decision about the state (Ossmy et al., 2013; Glaze et al., 2015). We thus speculate that adaptive decision rules could be identified similarly in the strategies people use to make decisions about perceptual stimuli in dynamic contexts.
The neural mechanisms responsible for implementing and controlling decision thresholds are not well understood. Recent work has identified several cortical regions that may contribute to threshold formation, such as prefrontal cortex (Hanks et al., 2015), dorsal premotor area (Thura and Cisek, 2020), and superior colliculus (Crapse et al., 2018; Jun et al., 2021). Urgency signals are a complementary way of dynamically changing decision thresholds via a commensurate scale in belief, which Thura and Cisek, 2017 suggest are detectable in recordings from basal ganglia. The normative decision thresholds we derived do not employ urgency signals, but analogous UGMs may involve nonmonotonic signals. For example, the switch from an infinitetoconstant decision threshold typical of lowtohigh reward switches would correspond to a signal that suppresses responses until a reward change. Measurable signals predicted by our normative models would therefore correspond to zero mean activity during low reward, followed by constant mean activity during high reward. While more experimental work is needed to test this hypothesis, our work has expanded the view of normative and neural decision making as dynamic processes for both deliberation and commitment.
Materials and methods
Normative decision thresholds from dynamic programming
Here we detail the dynamic programming tools required to find normative decision thresholds. For the freeresponse tasks we consider, an observer gathers a sample of evidence $\xi $, uses the loglikelihood ratio (LLR) $y=\frac{\mathrm{Pr}({s}_{+}\xi )}{\mathrm{Pr}({s}_{}\xi )}$ as their ‘belief’, and sets potentially timedependent decision thresholds, ${\theta}_{\pm}(t),$ that determine when they will stop accumulating evidence and commit to a choice. When $y\ge {\theta}_{+}(t)$ ($y\le {\theta}_{}(t)$), the observer chooses the state $s}_{+$ ($s}_{$). In general, an observer is free to set ${\theta}_{\pm}(t)$ any way they wish. However, a normative observer sets these thresholds to optimize an objective function, which we assume throughout this study to be the trialaveraged reward rate, $\rho $, which is given by Equation 1. In this definition of reward rate, the incremental cost function $c(t)$ accounts for both explicit costs (e.g. paying for observed evidence, metabolic costs of storing belief in working memory) and implicit costs (e.g. opportunity cost). We assume symmetry in the problem (in terms of prior, rewards, etc.) that guarantees the thresholds are symmetric about $y=0$ and ${\theta}_{\pm}(t)=\pm \theta (t)$. We derive the optimal threshold policy for a general incremental cost function $c(t)$, but in our results we consider only constant costs functions $c$. Although the space of possible cost functions is large, restricting to a constant value ensures that threshold dynamics are governed purely by task and reward structure and not by an arbitrary evidence cost function.
To find the thresholds $\pm \theta (t)$ that optimize the reward rate given by Equation 1, we start with a discretetime task where observations are made every $\delta t$ time units, and we simplify the problem so the length of each trial is fixed and independent of the decision time ${T}_{d}$. This simplification makes the denominator of $\rho $ constant with respect to trialtotrial variability, meaning we can optimize reward rate by maximizing the numerator $\u27e8R\u27e9\u27e8C({T}_{d})\u27e9$. Under this simplified task structure, we suppose the observer has just drawn a sample ${\xi}_{n}$ and updated their state likelihood to ${p}_{n}=\frac{1}{1+{e}^{{y}_{n}}}$, where ${y}_{n}=\mathrm{ln}\frac{\mathrm{Pr}({s}_{+}{\xi}_{1:n})}{\mathrm{Pr}({s}_{}{\xi}_{1:n})}$ is the discretetime LLR given by Equation 2. At this moment, the observer takes one of three possible actions:
Stop accumulating evidence and commit to choice $s}_{+$. This action has value equal to the average reward for choosing ${s}_{+}$, which is given by:
(9) $\text{}{V}_{+}({p}_{n})={R}_{c}{p}_{n}+{R}_{i}(1{p}_{n}),$
where ${R}_{c}$ is the value for a correct choice and ${R}_{i}$ is the value for an incorrect choice.
Stop accumulating evidence and commit to choice $s}_{$. By assuming the reward for correctly (or incorrectly) choosing ${s}_{+}$ is the same as choosing $s}_{$, the value of this action is obtained by symmetry from:
(10) $\text{}{V}_{}({p}_{n})={R}_{c}(1{p}_{n})+{R}_{i}{p}_{n}.$Wait to commit to a choice and draw an additional piece of evidence. Choosing this action means the observer expects their future overall value $V$ to be greater than their current value, less the cost incurred by waiting for additional evidence. Therefore, the value of this choice is given by:
(11) $\text{}{V}_{w}({p}_{n})=\u27e8V({p}_{n+1}){p}_{n}{\u27e9}_{{p}_{n+1}}c(t)\delta t,$
where $c$ is the incremental evidence cost function; because we assume that the incremental cost is constant, this simplifies $c(t)\delta t=c\delta t$.
Given the action values from Equations 9–11, the observer takes the action with maximal value, resulting in their overall value function
Because the valuemaximizing action depends on the state likelihood, $p}_{n$, the regions of likelihood space where each action is optimal divide the space into three disjoint regions. The boundaries of these regions are exactly the optimal decision thresholds, which can be mapped to LLRspace to obtain $\pm \theta (t)$. To find these thresholds numerically, we started by discretizing the state likelihood space $p}_{n$. Because the state likelihood $p}_{n$ is restricted to values between 0 and 1, whereas the loglikelihood ratio ${y}_{n}=\mathrm{ln}\frac{{p}_{n}}{1{p}_{n}}$ is unbounded, we chose to formulate all the components of Bellman’s equation in terms of $p}_{n$ to minimize truncation errors. We then proceeded by using backward induction in time, starting at the total trial length $t={T}_{t}$. At this moment in time, it impossible to wait for more evidence, so the value function in Equation 12 does not depend on the future. This approach implies that the value function is:
Once the value is calculated at this time point, it can be used as the future value at time point $t={T}_{t}\delta t$.
To find the decision thresholds for the desired tasks where ${T}_{t}$ is not fixed, we must optimize both the numerator and denominator of Equation 1. To account for the variable trial length, we adopt techniques from average reward reinforcement learning (Mahadevan, 1996) and penalize the waiting time associated with each action by the waiting time itself scaled by the reward rate $\rho $ (i.e., $\u27e8{t}_{i}\u27e9\rho $ for committing to ${s}_{+}$ or $s}_{$ and $\rho \delta t$ for waiting). This modification makes all trials effectively the same length and allows us to use the same approach used to derive Equation 12 (Drugowitsch et al., 2012). The new overall value function is given by Equation 3:
To use this new value function to numerically find the decision thresholds, we must note two new complications that arise from moving away from fixedlength trials. First, we no longer have a natural end time from which to start backward induction. We remedy this issue by following the approach of Drugowitsch et al., 2012 and artificially setting a final trial time ${T}_{f}$ that is far enough in the future so that decision times of this length are highly unlikely and do not impact the response distributions. If we desire accurate thresholds up to a time $t$, we set ${T}_{f}=5t$, which produces an accurate solution while avoiding a large numerical overhead incurred from a longer simulation time. In our simulations, we set $t$ based on when we expect most decisions to be made. Second, the value function now depends on the unknown quantity $\rho $, resulting in a cooptimization problem. To address this complication, note that when $\rho $ is maximized, our derivation requires $V\left({p}_{0}=\frac{1}{2};\rho \right)=0$ for a consistent Bellman’s equation (Drugowitsch et al., 2012). We exploit this consistency requirement by fixing an initial reward rate ${\rho}_{0}$, solving the value function through backward induction, calculating $V(0;{\rho}_{0})$, and updating the value of $\rho $ via a root finding scheme. For more details on numerical implementation, see https://github.com/nwbarendregt/AdaptNormThresh; Thresh, 2022.
Dynamic context 2AFC tasks
For all dynamic context tasks, we assume that observations follow a Gaussian distribution with so that $\xi {s}_{\pm}\sim \mathcal{N}(\pm \mu ,{\sigma}^{2})$. Using the Functional Central Limit Theorem, one can show (Bogacz et al., 2006) that in the continuoustime limit, the belief $y$ evolves according to a stochastic differential equation:
In Equation 14, $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ is the scaled signaltonoise ratio (SNR) given by the observation distribution function $\xi {s}_{\pm}\sim \mathcal{N}(\pm \mu ,{\sigma}^{2})$, $d{W}_{t}$ is a standard increment of a Wiener process, and the sign of the drift $\pm mdt$ is given by the sign of the correct choice ${s}_{\pm}$. To construct Bellman’s equation for this task, we start by discretizing time ${t}_{1:n}$ and determine the average value gained by waiting and collecting another observation given by Equation 4:
where ${p}_{n}=\mathrm{Pr}({s}_{+}{\xi}_{1:n})$ is the probability the environment is in state ${s}_{+}$ given $n$ pieces of evidence. The main difficulty in computing this expectation is computing the conditional probability distribution ${f}_{p}\text{}({p}_{n+1}\text{}\text{}{p}_{n})$, which we call the likelihood transfer function. Once we construct the likelihood transfer function, we can use our discretization of the state likelihood space $p}_{n$ to evaluate the integral in Equation 4 using any standard numerical quadrature scheme. To compute this transfer function, we can start by using the definition of the LLR $y}_{n$ and leveraging the relationship between $p}_{n$ and $y}_{n$ to find $p}_{n$ and a function of the observation ${\xi}_{n}$:
Note that we used the fact that in discretetime with a time step $\delta t$, the observations $\xi {s}_{\pm}\sim \mathcal{N}(\pm \mu \delta t,{\sigma}^{2}\delta t)$. The relationship between ${\xi}_{n+1}$ and ${p}_{n+1}$ in Equation 15 can be inverted to obtain:
With this relationship established, we can find the likelihood transfer function ${f}_{p}(p({\xi}_{1:n+1})p({\xi}_{1:n}))$ by finding the observation transfer function ${f}_{\xi}(\xi ({p}_{n+1})\xi ({p}_{n}))$ and performing a change of variables, which by independence of the sample is simply ${f}_{\xi}({\xi}_{n+1})$. With probability $p}_{n$, ${\xi}_{n+1}$ will be drawn from the normal distribution $\mathcal{N}(+\mu \delta t,{\sigma}^{2}\delta t)$, and with probability $1{p}_{n}$, ${\xi}_{n+1}$ will be drawn from the normal distribution $\mathcal{N}(\mu \delta t,{\sigma}^{2}\delta t)$. This immediately provides the observation transfer function by marginalizing:
Performing the change of variables using the derivative $\frac{d{\xi}_{n+1}}{d{p}_{n+1}}=\frac{{\sigma}^{2}}{2{p}_{n+1}\mu 2{p}_{n+1}^{2}\mu}>0$ yields the transfer function
Note that Equation 16 is equivalent to the likelihood transfer function given by Equation 16 in Drugowitsch et al., 2012 for the case of $m=1$. Combining Equation 14 and Equation 16, we can construct Bellman’s equation for any dynamic context task.
Rewardchange task thresholds
For the rewardchange task, we fixed punishment ${R}_{i}=0$ and allowed the reward ${R}_{c}$ to be a Heaviside function given by Equation 5:
In Equation 5, there is a single switch in rewards between prechange reward $R}_{1$ and postchange reward $R}_{2$. This change occurs at $t=0.5$. Substituting this reward function into Equation 3 allows us to find the normative thresholds for this task as a function of $R}_{1$ and $R}_{2$.
For the inferred reward change task, we allowed the reward value $R(t)\in \{{R}_{H},{R}_{L}\}$ to be controlled by a continuoustime twostate Markov process with transition (hazard) rate $h$ between rewards ${R}_{H}\ge {R}_{L}$. The hazard rate $h$ governs the probability of switching between ${R}_{H}$ and ${R}_{L}$:
where $o(\delta t)$ represents a function $g(\delta t)$ with the property ${lim}_{\delta t\downarrow 0}\frac{g(\delta t)}{\delta t}=0$ (i.e., all other terms are of smaller order than $\delta t$). In addition, the state of this Markov process must be inferred from evidence $\eta $ that is independent of the environment’s state evidence $\xi $ (i.e., the correct choice). For simplicity, we assume that the rewardevidence source is also Gaussiandistributed such that $\eta {R}_{H/L}\sim \mathcal{N}(\pm {\mu}_{R},{\sigma}_{R}^{2})$ with quality ${m}_{R}=\frac{2{\mu}_{R}^{2}}{{\sigma}_{R}^{2}}$. Glaze et al., 2015; VelizCuba et al., 2016; Barendregt et al., 2019 have shown that the belief ${y}_{R}=\mathrm{ln}\frac{\mathrm{Pr}(R(t)={R}_{H}\eta )}{\mathrm{Pr}(R(t)={R}_{L}\eta )}$ for such a dynamic state inference process is given by the modified DDM
where $x(t)\in \pm 1$ is a telegraph process that mirrors the state of the reward process (i.e., $x(t)=1$ when $R(t)={R}_{H}$ and $x(t)=1$ when $R(t)={R}_{L}$). With this belief over reward state, we must also modify the values ${V}_{+}({p}_{n};\rho )$ and ${V}_{}({p}_{n};\rho )$ to account for the uncertainty in ${R}_{c}$. Defining $q=\frac{{e}^{{y}_{R}}}{1+{e}^{{y}_{R}}}$ as the reward likelihood gives
where we have fixed ${R}_{i}=0$ for simplicity.
SNRchange task thresholds
For the SNRchange task, we allowed the task difficulty $m=\frac{2{\mu}^{2}}{{\sigma}^{2}}$ to vary over a single trial by making $\mu (t)$ a timedependent step function given by Equation 6:
In Equation 6, there is a single switch in evidence quality between prechange quality ${\mu}_{1}$ and postchange quality ${\mu}_{2}$. This change occurs at $t=0.5$. Substituting this quality time series into the likelihood transfer function in Equation 16 allows us to find the normative thresholds for this task as a function of ${\mu}_{1}$ and ${\mu}_{2}$. This modification necessitates that the transfer function $f}_{p$ also be a function of time; however, because the quality change points are known in advance to the observer, we can simply change between different transfer functions at the specified quality changes.
Rewardchange task model performance
Here we detail the three models used to compare observer performance in the rewardchange task, as well as the noise filtering process used to generate synthetic data. For the noisy Bayesian model, the observer uses the thresholds $\pm \theta (t)$ obtained via dynamic programming, thus making the observer a noisy ideal observer. For the constantthreshold model, the observer uses a constant threshold $\pm \theta (t)=\pm {\theta}_{0}$, which is predicted to be optimal only in very simple, static decision environments with only two states $s$. Both the noisy Bayesian and constantthreshold models also use a noisy perturbation of the LLR $\stackrel{~}{y}=y+{\sigma}_{y}Z$ as their belief, where ${\sigma}_{y}$ is the strength of the noise and $Z$ is a sample from a standard normal distribution. In continuoustime, this perturbation involves adding an independent Wiener process to Equation 14:
where $d{W}_{t}^{\prime}$ is an independent Wiener process with strength ${\sigma}_{y}$. The UGM, being a phenomenological model, behaves differently from the other models. The UGM belief $E$ is the output of the noisy lowpass filter given by Equation 7:
To add additional noise to the UGM’s belief variable $E$, we simply allowed ${\sigma}_{y}>0$ in the lowpass filter in Equation 7.
In addition to the inference noise with strength ${\sigma}_{y}$, we also filtered each process through a Gaussian responsetime filter with zero mean and standard deviation ${\sigma}_{mn}$. Under this responsetime filter, if the model predicted a response time $T$, the measured response time $\stackrel{~}{T}$ was drawn from a normal distribution centered at $T$ with standard deviation ${\sigma}_{mn}$. If the response time $\stackrel{~}{T}$ was drawn outside of the simulation’s time discretization (i.e., if $\stackrel{~}{T}<0$ or $\stackrel{~}{T}>\frac{{T}_{f}}{5}$), we redrew $\stackrel{~}{T}$ until it fell within the discretization. This filter was chosen to represent both “early responses” caused by attentional lapses, as well as ‘late responses’ caused by motor processing delays between the formation of a choice in the brain and the physical response. We have chosen to add these two sources of noise after optimizing each model to maximize average reward rate, rather than reoptimizing each model after adding these additional noise sources. Although we could have reoptimized each model to maximize performance across noise realizations, we were interested in how the models responded to perturbations that drove their performance to be suboptimal (but possibly nearoptimal).
To compare model performance on the rewardchange task, we first fixed the value of prechange reward $R}_{1$ (and set ${R}_{1}+{R}_{2}=11$) to find the postchange reward and tuned each model to achieve optimal reward rate with no additional noise in both the inference and response processes. Bellman’s equation outputs both the optimal normative thresholds and reward rate. For the constant threshold model and the UGM, we approximated the maximal performance of each model by using a grid search over each models parameters to find the model tuning that yielded the highest average reward rate. After tuning all models for a given reward structure, we filtered them through both the sensory (${\sigma}_{y}$) and motor (${\sigma}_{mn}$) noise sources without returning the models to account for this additional noise. When generating noisy synthetic data from these models, we generated 100 synthetic subjects, each with sampled values of ${\sigma}_{y}$ and ${\sigma}_{mn}$. For each synthetic subject with noise parameter sample (${\sigma}_{y}$, ${\sigma}_{mn}$), we defined the “noise strength” of that subject’s noise to be the ratio
where ${\overline{\sigma}}_{y}=5$ and ${\overline{\sigma}}_{mn}=0.25$ are the maximum values of belief noise and motor noise considered, respectively. Using this metric, noise strength is defined between 0 and 1. Additionally, the maximum noise levels ${\overline{\sigma}}_{y}$ and ${\overline{\sigma}}_{mn}$ where chosen such that a noise strength of 0.5 is approximately equivalent to the fitted noise strength obtained from tokens task subject data. We plot the response distributions using noise strengths of 0, 0.5, and 1 in our results. To compare the performance of each model after being corrupted by noise, we then generated 1000 trials for each subject and had each simulated subject repeat the same block of trials three times, one for each model. This process ensured that the only difference between model performance would come from their distinct threshold behaviors, because each model was taken to be equally noisy and was run using the same stimuli.
Tokens task
Normative model for the tokens task
For the tokens task, observations in the form of token movements are Bernoulli distributed with parameter $p=0.5$ that occur every 200ms. Once a subject committed to a decision, the token movements continued at a faster rate until the entire animation had finished. This postdecision token acceleration was 170ms per movement in the ‘slow’ version of the task and 20ms per movement for the ‘fast’ version of the task. Because of the stimulus structure, one can show using a combinatorial argument (Cisek et al., 2009) that the likelihood function $p}_{n$ is given by Equation 8. Constructing the likelihood transfer function $f}_{p$ required for Bellman’s equation is also simplified from the Gaussian 2AFC tasks, as there are only two possible likelihoods that one can transition two after observing a token movement:
Combining Equation 8 and Equation 17, we can fully construct Bellman’s equation for the tokens task. While the timings of the token movements, postdecision token acceleration, and intertrial interval are fixed, we let the reward ${R}_{c}$ and cost function $c$ be free parameters to control the different threshold dynamics of the model.
Model fitting and comparison
We used three models to fit the subject response data provided by Cisek et al., 2009: the noisy Bayesian model ($k=4$ parameters), the constant threshold model ($k=3$ parameters), and the UGM ($k=5$) parameters (Table 1). To adapt the continuoustime models to this discretetime task, we simply changed the time step to match the time between token movements ($\delta t=200$ ms). To fit each model, we took the subject response time distributions as our objective function and used Markov Chain Monte Carlo (MCMC) with a standard Gaussian proposal distribution to generate an approximate posterior made up of 10,000 samples. For more details as to our specific implementation of MCMC for this data, see the MATLAB code available at https://github.com/nwbarendregt/AdaptNormThresh, (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2; Thresh, 2022). We held out 2 of the 22 subjects to use as training data when tuning the covariance matrix of the proposal distribution for each model, and performed the model fitting and comparison analysis on the remaining 20 subjects. Using the approximate posterior obtained via MCMC for each subject and model, we used calculated AICc using the formula
In Equation 18, $k$ is the number of parameters of the model, $\widehat{L}$ is the likelihood of the model evaluated at the maximumlikelihood parameters, and $n$ is the number of responses in the subject data (Cavanaugh, 1997; Brunham and Anderson, 2002). Because each subject performed different numbers of trials, using AICc allowed us to normalize results to account for the different data sizes; note that for many responses (i.e., for large $n$), AICc converges to the standard definition of AIC. For the second model selection metric, we measured how well each fitted model predicted the trialbytrial responses of the data by calculating the average RMSE between the response times from the data and the response times predicted by each model. To measure the difference between a subject’s response time distribution and the fitted model’s distribution (Figure 6—figure supplement 1), we used KullbackLeibler (KL) divergence:
In Equation 19, $i$ is a time index representing the number of observed token movements, ${\text{RT}}_{D}(i)$ is the probability of responding after $i$ token movements from the subject data, and ${\text{RT}}_{M}(i)$ is the probability of responding after $i$ token movements from the model’s response distribution. Smaller values of KL divergence indicate that the model’s response distribution is more similar to the subject data.
Code availability
See https://github.com/nwbarendregt/AdaptNormThresh; (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2; Thresh, 2022) for the MATLAB code used to generate all results and figures.
Data availability
MATLAB code used to generate all results and figures is available at https://github.com/nwbarendregt/AdaptNormThresh, (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2).
References

Mice alternate between discrete strategies during perceptual decisionmakingNature Neuroscience 25:201–212.https://doi.org/10.1038/s4159302101007z

Acquisition of decision making criteria: reward rate ultimately beats accuracyAttention, Perception & Psychophysics 73:640–657.https://doi.org/10.3758/s1341401000497

Analyzing dynamic decisionmaking models using chapmankolmogorov equationsJournal of Computational Neuroscience 47:205–222.https://doi.org/10.1007/s10827019007335

A theoretical analysis of the reward rate optimality of collapsing decision criteriaAttention, Perception, & Psychophysics 82:1520–1534.https://doi.org/10.3758/s13414019018064

The neural basis of the speedaccuracy tradeoffTrends in Neurosciences 33:10–16.https://doi.org/10.1016/j.tins.2009.09.002

BookModel Selection and Multimodel Inference: A Practical InformationTheoretic ApproachNew York Inc: Springer.

Psychological models of deferred decision makingJournal of Mathematical Psychology 32:91–134.https://doi.org/10.1016/00222496(88)900429

The urgencygating model can explain the effects of early evidencePsychonomic Bulletin & Review 22:1830–1838.https://doi.org/10.3758/s1342301508512

Unifying the derivations for the akaike and corrected akaike information criteriaStatistics & Probability Letters 33:201–208.https://doi.org/10.1016/S01677152(96)001289

SpeedAccuracy tradeoffs in animal decision makingTrends in Ecology & Evolution 24:400–407.https://doi.org/10.1016/j.tree.2009.02.010

Decisions in changing conditions: the urgencygating modelThe Journal of Neuroscience 29:11560–11571.https://doi.org/10.1523/JNEUROSCI.184409.2009

Linking biomechanics and ecology through predatorprey interactions: flight performance of dragonflies and their preyThe Journal of Experimental Biology 215:903–913.https://doi.org/10.1242/jeb.059394

Evidence for timevariant decision makingThe European Journal of Neuroscience 24:3628–3641.https://doi.org/10.1111/j.14609568.2006.05221.x

The cost of accumulating evidence in perceptual decision makingThe Journal of Neuroscience 32:3612–3628.https://doi.org/10.1523/JNEUROSCI.401011.2012

ConferenceOptimal decisionmaking with timevarying evidence reliabilityIn Advances in neural information processing systems. pp. 748–756.

ReportNotes on normative solutions to the speedaccuracy tradeoff in preceptual decisionmakingFENSHertie Winter School.

A parameter recovery assessment of timevariant models of decisionmakingBehavior Research Methods 52:193–206.https://doi.org/10.3758/s13428019012180

ConferenceSequential Hypothesis Testing under Stochastic DeadlinesAdvances in Neural Information Processing Systems.

Evidence integration and decision confidence are modulated by stimulus consistencyNature Human Behaviour 6:988–999.https://doi.org/10.1038/s41562022013186

The neural basis of decision makingAnnual Review of Neuroscience 30:535–574.https://doi.org/10.1146/annurev.neuro.29.051605.113038

Optimal models of decisionmaking in dynamic environmentsCurrent Opinion in Neurobiology 58:54–60.https://doi.org/10.1016/j.conb.2019.06.006

Adaptive neural coding: from biological to behavioral decisionmakingCurrent Opinion in Behavioral Sciences 5:91–99.https://doi.org/10.1016/j.cobeha.2015.08.008

Neural coding of uncertainty and probabilityAnnual Review of Neuroscience 37:205–220.https://doi.org/10.1146/annurevneuro071013014017

Overcoming indecision by changing the decision boundaryJournal of Experimental Psychology. General 146:776–805.https://doi.org/10.1037/xge0000286

TimeVarying decision boundaries: insights from optimality analysisPsychonomic Bulletin & Review 25:971–996.https://doi.org/10.3758/s1342301713406

Some task demands induce collapsing bounds: evidence from a behavioral analysisPsychonomic Bulletin & Review 25:1225–1248.https://doi.org/10.3758/s1342301814799

Neural representation of task difficulty and decision making during perceptual categorization: a timing diagramThe Journal of Neuroscience 26:8965–8975.https://doi.org/10.1523/JNEUROSCI.165506.2006

Models for deferred decision makingJournal of Mathematical Psychology 8:508–538.https://doi.org/10.1016/00222496(71)900058

A theory of memory retrievalPsychological Review 85:59–108.https://doi.org/10.1037/0033295X.85.2.59

Reward rate optimization in twoalternative decision making: empirical tests of theoretical predictionsJournal of Experimental Psychology. Human Perception and Performance 35:1865–1897.https://doi.org/10.1037/a0016926

Reinforcement learning: an introductionIEEE Transactions on Neural Networks 9:1054.https://doi.org/10.1109/TNN.1998.712192

Optimal policy for valuebased decisionmakingNature Communications 7:12400.https://doi.org/10.1038/ncomms12400

Optimal policy for multialternative decisionsNature Neuroscience 22:1503–1511.https://doi.org/10.1038/s4159301904539

SoftwareAdaptNormThresh, version swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2Software Heritage.

Decision making by urgency gating: theory and experimental supportJournal of Neurophysiology 108:2912–2930.https://doi.org/10.1152/jn.01071.2011

ContextDependent urgency influences speedaccuracy tradeoffs in decisionmaking and movement executionThe Journal of Neuroscience 34:16442–16454.https://doi.org/10.1523/JNEUROSCI.016214.2014

Microstimulation of dorsal premotor and primary motor cortex delays the volitional commitment to an action choiceJournal of Neurophysiology 123:927–935.https://doi.org/10.1152/jn.00682.2019

Urgency, leakage, and the relative nature of information processing in decisionmakingPsychological Review 128:160–186.https://doi.org/10.1037/rev0000255

Sequential tests of statistical hypothesesThe Annals of Mathematical Statistics 16:117–186.https://doi.org/10.1214/aoms/1177731118
Article and author information
Author details
Funding
National Institutes of Health (R01MH115557)
 Nicholas W Barendregt
 Joshua I Gold
 Krešimir Josić
 Zachary P Kilpatrick
National Institutes of Health (R01EB02984701)
 Nicholas W Barendregt
 Zachary P Kilpatrick
National Science Foundation (NSFDMS1853630)
 Nicholas W Barendregt
 Zachary P Kilpatrick
National Science Foundation (NSFDBI1707400)
 Krešimir Josić
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Paul Cisek for providing response data from the tokens task used in our analysis.
Version history
 Preprint posted: April 29, 2022 (view preprint)
 Received: May 3, 2022
 Accepted: October 20, 2022
 Accepted Manuscript published: October 25, 2022 (version 1)
 Version of Record published: December 15, 2022 (version 2)
Copyright
© 2022, Barendregt et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,034
 views

 167
 downloads

 3
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Genetics and Genomics
Runs of homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for the efficient identification of shared ROH diplotypes. Here, we present a new method, ROHDICE, to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROHDICE identified over 1 million ROH diplotypes that span over 100 SNPs and are shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various selfreported diseases, with the strongest associations found between the extended HLA region and autoimmune disorders. We found an association between a diplotype covering the HFE gene and hemochromatosis, even though the wellknown causal SNP was not directly genotyped or imputed. Using a genomewide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase in mortality among COVID19 patients (Pvalue=1.82×10^{11}). In summary, our ROHDICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genomewide mapping approach for finding diseasecausing loci with multimarker recessive effects at a population scale.

 Computational and Systems Biology
A study of two enzymes in the brain reveals new insights into how redox reactions regulate the activity of protein kinases.