Normative decision rules in changing environments

  1. Nicholas W Barendregt  Is a corresponding author
  2. Joshua I Gold
  3. Krešimir Josić
  4. Zachary P Kilpatrick
  1. Department of Applied Mathematics, University of Colorado Boulder, United States
  2. Department of Neuroscience, University of Pennsylvania, United States
  3. Department of Mathematics, University of Houston, United States

Abstract

Models based on normative principles have played a major role in our understanding of how the brain forms decisions. However, these models have typically been derived for simple, stable conditions, and their relevance to decisions formed under more naturalistic, dynamic conditions is unclear. We previously derived a normative decision model in which evidence accumulation is adapted to fluctuations in the evidence-generating process that occur during a single decision (Glaze et al., 2015), but the evolution of commitment rules (e.g. thresholds on the accumulated evidence) under dynamic conditions is not fully understood. Here, we derive a normative model for decisions based on changing contexts, which we define as changes in evidence quality or reward, over the course of a single decision. In these cases, performance (reward rate) is maximized using decision thresholds that respond to and even anticipate these changes, in contrast to the static thresholds used in many decision models. We show that these adaptive thresholds exhibit several distinct temporal motifs that depend on the specific predicted and experienced context changes and that adaptive models perform robustly even when implemented imperfectly (noisily). We further show that decision models with adaptive thresholds outperform those with constant or urgency-gated thresholds in accounting for human response times on a task with time-varying evidence quality and average reward. These results further link normative and neural decision-making while expanding our view of both as dynamic, adaptive processes that update and use expectations to govern both deliberation and commitment.

Editor's evaluation

This paper makes an important contribution to the study of decision-making under time pressure. The authors provide convincing evidence that decision boundaries can be highly nontrivial – even reaching infinity in realistic regimes. This paper will be of broad interest to both experimentalists and theorists working on decision-making under time pressure.

https://doi.org/10.7554/eLife.79824.sa0

eLife digest

How do we make good choices? Should I have cake or yoghurt for breakfast? The strategies we use to make decisions are important not just for our daily lives, but also for learning more about how the brain works.

Decision-making strategies have two components: first, a deliberation period (when we gather information to determine which choice is ‘best’); and second, a decision ‘rule’ (which tells us when to stop deliberating and commit to a choice). Although deliberation is relatively well-understood, less is known about the decision rules people use, or how those rules produce different outcomes.

Another issue is that even the simplest decisions must sometimes adapt to a changing world. For example, if it starts raining while you are deciding which route to walk into town, you would probably choose the driest route – even if it did not initially look the best. However, most studies of decision strategies have assumed that the decision-maker’s environment does not change during the decision process.

In other words, we know much less about the decision rules used in real-life situations, where the environment changes. Barendregt et al. therefore wanted to extend the approaches previously used to study decisions in static environments, to determine which decision rules might be best suited to more realistic environments that change over time.

First, Barendregt et al. constructed a computer simulation of decision-making with environmental changes built in. These changes were either alterations in the quality of evidence for or against a particular choice, or the ‘reward’ from a choice, i.e., feedback on how good the decision was. They then used the computer simulation to model single decisions where these changes took place.

These virtual experiments showed that the best performance – for example, the most accurate decisions – resulted when the threshold for moving from deliberation (i.e., considering the evidence) to selecting an option could respond to, or even anticipate, the changing situations. Importantly, the simulations’ results also predicted real-world choices made by human participants when given a decision-making task with similar variations in evidence and reward over time. In other words, the virtual decision-making rules could explain real behavior.

This study sheds new light on how we make decisions in a changing environment. In the future, Barendregt et al. hope that this will contribute to a broader understanding of decision-making and behavior in a wide range of contexts, from psychology to economics and even ecology.

Introduction

Even simple decisions can require us to adapt to a changing world. Should you go through the park or through town on your walk? The answer can depend on conditions that could be changing while you deliberate, such as an unexpected shower that would send you hurrying down the faster route (Figure 1A) or a predictable sunrise that would nudge you toward the route with better views. Despite the ubiquity of such dynamics in the real world, they are often neglected in models used to understand how the brain makes decisions. For example, many commonly used models assume that decision commitment occurs when the accumulated evidence for an option reaches a fixed, predefined value or threshold (Wald, 1945; Ratcliff, 1978; Bogacz et al., 2006; Gold and Shadlen, 2007; Kilpatrick et al., 2019). The value of this threshold can account for inherent trade-offs between decision speed and accuracy found in many tasks: lower thresholds generate faster, but less accurate decisions, whereas higher thresholds generate slower, but more accurate decisions (Gold and Shadlen, 2007; Chittka et al., 2009; Bogacz et al., 2010). However, these classical models do not adequately describe decisions made in environments with potentially changing contexts (Thura et al., 2014; Thura and Cisek, 2016; Palestro et al., 2018; Cisek et al., 2009; Drugowitsch et al., 2012; Thura et al., 2012; Tajima et al., 2019; Glickman et al., 2022). Efforts to model decision-making thresholds under dynamic conditions have focused largely on heuristic strategies that aim to account for contexts that change between each decision. For instance, a common class of heuristic models is ‘urgency-gating models’ (UGMs). UGMs filter accumulated evidence through a low-pass filter and use thresholds that collapse monotonically over time (equivalent to dilating the belief in time) to explain decisions based on time-varying evidence quality (Cisek et al., 2009; Carland et al., 2015; Evans et al., 2020). Although collapsing decision thresholds are optimal in some cases, they do not always account for changes that occur during decision deliberation, and they are sometimes implemented ad-hoc without a proper derivation from first principles. Such derivations typically assume that individuals set decision thresholds to maximize trial-averaged reward rate (Simen et al., 2009; Balci et al., 2011; Drugowitsch et al., 2012; Tajima et al., 2016; Malhotra et al., 2018; Boehm et al., 2020), which can result in adaptive, time-varying thresholds similar to those assumed by heuristic UGMs. However, as in fixed-threshold models, these time-varying thresholds are typically defined before the evidence is accumulated, preceding the formative stages of the decision, and thus cannot account for environmental changes that may occur during deliberation.

Simple decisions may require complex strategies.

(A) When choosing where to walk, environmental fluctuations (e.g., weather changes) may necessitate changes in decision bounds (black line) adapted to changes in the conditions (cloudy to sunny). (B) Schematic of a dynamic programming. By assigning the best action to each moment in time, dynamic programming optimizes trial-averaged reward rate to produce the normative thresholds for a given decision.

To identify how environmental changes during the course of a single deliberative decision impact optimal decision rules, we developed normative models of decision-making that adapt to and anticipate two specific types of context changes: changes in reward expectation and changes in evidence quality. Specifically, we used Bellman’s equation (Bellman, 1957; Mahadevan, 1996; Sutton and Barto, 1998; Bertsekas, 2012; Drugowitsch, 2015) to identify decision strategies that maximize trial-averaged reward rate when conditions can change during decision deliberation. We show that for simple tasks that include sudden, expected within-trial changes in the reward or the quality of observed evidence, these normative decision strategies involve non-trivial, time-dependent changes in decision thresholds. These rules take several different forms that outperform their heuristic counterparts, are identifiable from behavior, and have performance that is robust to noisy implementations. We also show that, compared to fixed-threshold models or UGMs, these normative, adaptive threshold models provide a better account of human behavior on a ‘tokens task’, in which the value of commitment changes gradually at predictable times and the quality of evidence changes unpredictably within each trial (Cisek et al., 2009; Thura et al., 2014). These results provide new insights into the behavioral relevance of a diverse set of adaptive decision thresholds in dynamic environments and tightly link the details of such environmental changes to threshold adaptations.

Results

Normative theory for dynamic context 2AFC tasks

To determine potential deliberation and commitment strategies used by human subjects, we begin by identifying normative decision rules for two-alternative forced choice (2AFC) tasks with dynamic contexts. Normative decision rules that maximize trial-averaged reward rate can be obtained by solving an optimization problem using dynamic programming (Bellman, 1957; Sutton and Barto, 1998; Drugowitsch et al., 2012; Tajima et al., 2016). We define this trial-averaged reward rate, ρ, as (Gold and Shadlen, 2002; Drugowitsch et al., 2012)

(1) ρ=RC(Td)Tt+ti,

where R is the average reward for a decision, Td is the decision time, C(Td)=0Tdc(t)dt is the average total accumulated cost given an incremental cost function c(t), Tt is the average trial length, and ti is the average inter-trial interval (Drugowitsch, 2015). Note that all averages in Equation 1 are taken over trials. To find the normative decision thresholds that maximize ρ, we assign specific values (i.e., economic utilities) to correct and incorrect choices (reward and/or punishment) and the time required to arrive at each choice (i.e., evidence cost). The incremental evidence function c(t) represents both explicit time costs, such as a price for gathering evidence, and implicit costs, such as opportunity cost. While there are many forms of this cost function, we make the simplifying assumption that it is constant, c(t)=c. Because more complex cost functions can influence decision threshold dynamics (Drugowitsch et al., 2012), restricting the cost function to a constant ensures that the threshold dynamics we identify are governed purely by changes in the (external) task conditions and not the (internal) cost function. To represent the structure of a 2AFC tasks, we assume a decision environment for an observer with an initially unknown environmental state, s{s+,s-}, that uniquely determines which of two alternatives is correct. To infer the environmental state, this observer makes measurements, ξ, that follow a distribution f±(ξ)=f(ξ|s±) that depends on the state. Determining the correct choice is thus equivalent to determining the generating distribution, f±. An ideal Bayesian observer uses the log-likelihood ratio (LLR), y, to track their ‘belief’ over the correct choice (Wald, 1945; Bogacz et al., 2006; Veliz-Cuba et al., 2016). After n discrete observations ξ1:n that are independent across time, the discrete-time LLR belief yn is given by:

(2)  yn=lnPr(s+|ξ1:n)Pr(s|ξ1:n)=lnf+(ξn)f(ξn)+yn1.

Given this defined task structure, we discretize the time during which the decision is formed and define the observer’s actions during each timestep. The observer gathers evidence (measurements) during each timestep prior to a decision and uses each increment of evidence to update their belief about the correct choice. Then, the observer has the option to either commit to a choice or make another measurement at the next timestep. By assigning a utility to each of these actions (i.e., a value V+ for choosing s+, a value V- for choosing s, and a value Vw for sampling again), we can construct the value function for the observer given their current belief:

(3) V(pn;ρ)=max{V+(pn;ρ),V(pn;ρ),Vw(pn;ρ)}=max{Rcpn+Ri(1pn)tiρ,choose s+Rc(1pn)+Ripntiρ,choose sV(pn+1;ρ)|pnpn+1c(t)δtρδt,sample again.

For a full derivation of this equation, see Materials and methods. In Equation 3, pn=Pr(s+|ξ1:n)=11+e-yn is the state likelihood at time tn, Rc is the reward for a correct choice, Ri is the reward for an incorrect choice, and δt is the timestep between observations. We choose generating distributions to be symmetric Gaussian distributions f±(ξ)N(±μ,σ2) to allow us to compute the conditional distribution function fp(pn+1|pn) needed for the average future value explicitly:

(4) V(pn+1;ρ) | pnpn+1=01V(pn+1;ρ) fp(pn+1| pn)d pn+1.

In Equation 4, fp(pn+1|pn) is the conditional probability of the future state likelihood pn+1 given the current state likelihood pn. For the case of Gaussian-distributed evidence, this conditional probability is given by Equation 16 in Materials and methods. Using Equation 3, we find the specific belief values where the optimal action changes from gathering evidence to commitment, defining thresholds on the ideal observer’s belief that trigger decisions. Figure 1B shows a schematic of this process.

To understand how normative decision thresholds adapt to changing conditions, we derived them for several different forms of two-alternative forced-choice (2AFC) tasks in which we controlled changes in evidence or reward. Even for such simple tasks, there is a broad set of possible changing contexts. In the next section, we analyze a task in which context changes gradually (the tokens task). Here, we focus on tasks in which the context changes abruptly. For each task, an ideal observer was shown evidence generated from a Gaussian distribution f±(ξ)=N(±μ,σ2) with signal-to-noise ratio (SNR) m=2μ2σ2 (Figure 2—figure supplement 1). The SNR measures evidence quality: a smaller (larger) m implies that evidence is of lower (higher) quality, resulting in harder (easier) decisions. The observer’s goal was to determine which of the two means (i.e., which distribution, f+ or f-) were used to generate the observations. We introduced changes in the reward for a correct decision (‘reward-change task’) or the SNR (‘SNR-change task’) within a single decision, where the time and magnitude of the changes are known in advance to the observer (Figure 1A, Figure 2—figure supplement 2). For example, changes in SNR arise naturally throughout a day as animals choose when to forage and hunt given variations in light levels and therefore target-acquisition difficulty (Combes et al., 2012; Einfalt et al., 2012).

Under these dynamic conditions, dynamic programming produces normative thresholds with rich non-monotonic dynamics (Figure 2A and B, Figure 2—figure supplement 2). Environments with multiple reward changes during a single decision lead to complex threshold dynamics that we summarize in terms of several threshold change “motifs.” These motifs occur on shorter intervals and tend to emerge from simple monotonic changes in context parameters (Figure 2—figure supplement 2). To better understand the range of possible threshold motifs, we focused on environments with single changes in task parameters. For the reward-change task, we set punishment Ri=0 and assumed reward Rc changes abruptly, so that its dynamics are described by a Heaviside function:

(5)  Rc(t)=(R2R1)Hθ(t0.5)+R1.

Thus, the reward switches from the pre-change reward R1 to the post-change reward R2 at t=0.5.

Figure 2 with 3 supplements see all
Normative decision rules are characterized by non-monotonic task-dependent motifs.

(A,B) Example reward time series for a reward-change task (black lines in A), with corresponding thresholds found by dynamic programming (black lines in B). The colored lines in B show sample realizations of the observer’s belief. (C) To understand the diversity of threshold dynamics, we consider the simple case of a single change in the reward schedule. The panel shows a colormap of normative threshold dynamics for these conditions. Distinct threshold motifs are color-coded, corresponding to examples shown in panels i-v. (i-v): Representative thresholds (top) and empirical response distributions (bottom) from each region in C. During times at which thresholds in the upper panels are not shown (e.g., t[0,0.5] in i), the thresholds are infinite and the observer will never respond. For all simulations, we take the incremental cost function c(t)=1, punishment Ri=0, evidence quality m=5, and inter-trial interval ti=1.

For this single-change task, normative threshold dynamics exhibited several motifs that in some cases resembled fixed or collapsing thresholds characteristic of previous decision models but in other cases exhibited novel dynamics. Specifically, we characterized five different dynamic motifs in response to single, expected changes in reward contingencies for different combinations of pre- and post-change reward values (Figure 2C and i–v). For tasks in which reward is initially very low, thresholds are infinite until the reward increases, ensuring that the observer waits for the larger payout regardless of how strong their belief is (Figure 2i). The region where thresholds are infinite corresponds to when Vw(pn;ρ) in Equation 3, which is the value associated with waiting to gather more information, is maximal for all values of pn. In contrast, when reward is initially very high, thresholds collapse to zero just before the reward decreases, ensuring that all responses occur while payout is high (Figure 2v). Between these two extremes, optimal thresholds exhibit rich, non-monotonic dynamics (Figure 2ii,iv), promoting early decisions in the high-reward regime, or preventing early, inaccurate decisions in the low-reward regime. Figure 2C shows the regions in pre- and post-change reward space where each motif is optimal, including broad regions with non-monotonic thresholds. Thus, even simple context dynamics can evoke complex decision strategies in ideal observers that differ from those predicted by constant decision-thresholds and heuristic UGMs.

We also formulated an ‘inferred reward-change task’, in which reward fluctuates between a high value RH and low value RL governed by a two-state Markov process with known transition rate h and state R(t){RH,RL} that the observer must infer on-line. For this task, the observer receives two independent sets of evidence: the evidence of the state ξ|s±N(±μ,σ2) and the evidence of the current reward η|RH/LN(±μR,σR2). The observer must then track their beliefs about both the state and the current reward and take both sources of information into account when determining the optimal decision thresholds. We found that these thresholds always changed monotonically with monotonic shifts in expected reward (see Figure 2—figure supplement 3). These results contrast with our findings from the reward-change task in which changes can be anticipated and monotonic changes in reward can produce non-monotonic changes in decision thresholds.

For the SNR-change task, optimal strategies for environments with multiple changes in evidence quality are characterized by threshold dynamics that adapt to these changes in a way similar to how they adapt to changes in reward (Figure 3—figure supplement 1). To study the range of possible threshold motifs, we again considered environments with single changes in the evidence quality m=2μ2σ2 by taking µ to be a Heaviside function:

(6) μ(t)=(μ2μ1)Hθ(t0.5)+μ1.

For this single-change task, we again found similar threshold motifs to those in the reward-change task (Figure 3A and B). However, in this case monotonic changes in evidence quality always produce monotonic changes in response behavior. This observation holds across all of parameter space for evidence-quality schedules with single change points (Figure 3C), with only three optimal behavioral motifs (Figure 3i–iii). This contrasts with our findings in the reward-change task, where monotonic changes in reward can produce non-monotonic changes in decision thresholds. Strategies arising from known dynamical changes in context tend to produce sharper response distributions around reward changes than around quality changes, which may be measurable in psychophysical studies. These findings suggest that changes in reward can have a larger impact on the normative strategy thresholds than changes in evidence quality.

Figure 3 with 1 supplement see all
Dynamic-quality task does not exhibit non-monotonic motifs.

(A,B) Example quality time series for the SNR-change task (A), with corresponding thresholds found by dynamic programming (B). Colored lines in B show sample realizations of the observer’s belief. As in Figure 2, we characterize motifs in the threshold dynamics and response distributions based on single changes in SNR. (C) Colormap of normative threshold dynamics for a known reward schedule task with a single quality change. Distinct dynamics are color-coded, corresponding to examples shown in panels i-iii. (i-iii): Representative thresholds (top) and empirical response distributions (bottom) from each region in C. For all simulations, we take the incremental cost function c(t)=1, reward Rc=5, punishment Ri=0, and inter-trial interval ti=1.

Performance and robustness of non-monotonic normative thresholds

The normative solutions that we derived for dynamic-context tasks by definition maximize reward rate. This maximization assumes that the normative solutions are implemented perfectly. However, a perfect implementation may not be possible, given the complexity of the underlying computations, biological constraints on computation time and energy (Louie et al., 2015), and the synaptic and neural variability of cortical circuits (Ma and Jazayeri, 2014; Faisal et al., 2008). Given these constraints, subjects may employ heuristic strategies like the UGM over the normative model if noisy or mistuned versions of both models result in similar reward rates. We used synthetic data to better understand the relative benefits of different imperfectly implemented strategies. Specifically, we corrupted the internal belief state and simulated response times with additive Gaussian noise with zero mean and variance σmn2 (See Figure 4—figure supplement 1C) for three models:

  1. The continuous-time normative model with time-varying thresholds ±θ(t) from Equation 3 and belief that evolves according to the stochastic differential equation

     dy~=±mdtdrift+2mdWtsample noise+σydWtsensory noise,

    where dWt is a standard increment of a Wiener process, the sign of the drift ±mdt is given by the correct choice s±, and dWt is an independent Wiener process with strength σy. The addition of the additional noise process dWt makes this a noisy Bayesian (NB) model.

  2. A constant-threshold (Const) model, which uses the same belief y~ as the normative model but a constant, non-adaptive decision threshold ±θ(t)=±θ0 (Figure 4—figure supplement 1A).

  3. The UGM, which uses the output of a low-pass filter as the belief,

    (7) τdE=(E+11+ey12)dtdrift \§amp; sample noise+σydWtsensory noise,

    and commits to a decision when this output crosses a hyperbolically collapsing threshold ±θ(t)=±θ0at (Figure 4—figure supplement 1B). In Equation 7, E is the filter’s output that serves as the UGM’s belief, τ is a relaxation time constant, and the optimal observer’s belief y is the filter’s input. Note that the filter’s input can also be written in terms of the state likelihood p,

    τdE=(E+p12)dt+σydWt,

    which is the form first proposed by Cisek et al., 2009.

For more details about these three models, see Materials and methods. We compared their performance in terms of reward rate achieved on the same set of reward-change tasks shown in Figure 2. To ensure the average total reward in each trial was the same, we restricted the pre-change reward R1 and post-change reward R2 so that R1+R2=11.

When all three models were implemented without additional noise, the relative benefits of the normative model depended on the exact task condition. The performance differential between models was highest when reward changed from low to high values (Figure 4A, dotted line; Figure 4). Under these conditions, normative thresholds are initially infinite and become finite after the reward increases, ensuring that most responses occur immediately once the high reward becomes available (Figure 4D). In contrast, response times generated by the constant-threshold and UGM models tend to not follow this pattern. For the constant-threshold model, many responses occur early, when the reward is low (Figure 4E). For the UGM, a substantial fraction of responses are late, leading to higher time costs however, it is possible to tune the UGM’s thresholds rate of collapse to prevent any early responses while the reward is low (Figure 4F). In contrast, when the reward changes from high to low values, all models exhibit similar response distributions and reward rates (Figure 4A, dashed line; Figure 4—figure supplement 2). This result is not surprising, given that the constant-threshold model produces early peaks in the reaction time distribution, and the UGM was designed to mimic collapsing bounds that hasten decisions in response to imminent decreases in reward (Cisek et al., 2009). We therefore focused on the robustness of each strategy when corrupted by noise and responding to low-to-high reward switches – the regime differentiating strategy performance in ways that could be identified in subject behavior.

Adding noise to the internal belief state (which tends to trigger earlier responses) and simulated response distributions (which tends to smooth out the distributions) without re-tuning the models to account for the additional noise does not alter the advantage of the normative model: across a range of added noise strengths, which we define as σy+σmnσ¯y+σ¯mn, where σ¯y and σ¯mn are the maximum possible strengths of sensory and motor noise, respectively, the normative model outperforms the other two when encountering low-to-high reward switches (Figure 4C). This robustness arises because, prior to the reward change, the normative model uses infinite decision thresholds that prevent early noise-triggered responses when reward is low (Figure 4D). In contrast, the heuristic models have finite collapsing or constant thresholds and thus produce more suboptimal early responses as belief noise is increased (Figure 4E and F). Thus, adaptive decision strategies can result in considerably higher reward rates than heuristic alternatives even when implemented imperfectly, suggesting subjects may be motivated to learn such strategies.

Figure 4 with 3 supplements see all
Benefits of adaptive normative thresholds compared to heuristics.

(A) Reward rate ρ for the noisy Bayesian (NB) model, constant-threshold (Const) model, and UGM for the reward-change task, where all models are tuned to maximize performance with zero sensory noise (σy=0) and zero motor noise (σmn=0); in this case, the NB model is equivalent to the optimal normative model. Model reward rates are shown for different pre-change rewards R1, with post-change reward R2 set so that R1+R2=11 to keep the average total reward fixed (see Materials and methods for details). Low-to-high reward changes (dotted line) produce larger performance differentials than high-to-low changes (dashed line). (B) Absolute reward rate differential between NB and alternative models, given by ρ(NB)-ρ(alt) for different pre-change rewards. Legend shows which alternate model was used to produce each curve. (C) Reward rates of all models for reward-change task with (R1,R2)=(3,8) as both observation and response-time noise is increased. Noise strength for each model is given by σy+σmnσ¯y+σmn¯, are σ¯y=5 and σ¯mn=0.25 were the maximum strengths of σy and σmn we considered (See Figure 4—figure supplement 1C for reference). Filled markers correspond to no noise, moderate noise, and high noise strengths. D,E,F: Response distributions for (D) NB; (E) Const; and (F) UGM models in a low-to-high reward environment with (R1,R2)=(3,8). In each panel, results derived for several noise strengths, corresponding with filled markers in C, are superimposed, with lighter distributions denoting higher noise. Inset in D shows normative thresholds obtained from dynamic programming. Dashed line shows time of reward increase. For all simulations, we take the incremental cost function c(t)=1, punishment Ri=0, and evidence quality m=5.

Adaptive normative strategies in the tokens task

To determine the relevance of the normative model to human decision-making, we analyzed previously collected data from a ‘tokens task’ (Cisek et al., 2009). For this task, human subjects were shown 15 tokens inside a center target flanked by two empty targets (see Figure 5A for a schematic). Every 200ms, a token moved from the center target to one of the neighboring targets with equal probability. Subjects were tasked with predicting which flanking target would contain more tokens by the time all 15 moved from the center. Subjects could respond at any time before all 15 tokens had moved. Once the subject made the prediction, the remaining tokens would finish their movements to indicate the correct alternative. Given this task structure, one can show using a combinatorial argument (Cisek et al., 2009) that the state likelihood function pn=Pr(top|ξ1:n), the probability the top target will hold more tokens at the end of the trial, is given by:

(8)  pn=p(top|Un,Ln,Cn)=Cn!2Cnk=0min{Cn,7Ln}1k!(Cnk)!,

where Un, Ln, and Cn are the number of tokens in the upper, lower, and center targets after token movement n, respectively. The token movements are Markovian because each token has an equal chance of moving to the upper/lower target. However, the probability that a target will contain more tokens at the end of the trial is history dependent, and the evolution of these probabilities is thus non-Markovian. As such, the quality of evidence possible from each token draw changes dynamically and gradually. In addition, the task included two different post-decision token movement speeds, ‘slow’ and ‘fast’: once the subject committed to a choice, the tokens finished out their animation, moving either once every 170ms (slow task) or once every 20ms (fast task). This post-decision movement acceleration changed the value associated with commitment by making the average inter-trial interval (ti in Equation 1) decrease over time. Because of this modulation, we can interpret the tokens task as a multi-change reward task, where commitment value is controlled through ti rather than through reward Rc. Our dynamic-programming framework for generating adaptive decision rules can handle the gradual changes in task context emerging in the tokens task. Given that costs and rewards can be subjective, we quantified how normative decision thresholds change with different combinations of rewards Rc and costs c(t)=c for fixed punishment Ri=-1, for both the slow (Figure 5B) and fast (Figure 5C) versions of the task.

Figure 5 with 1 supplement see all
Normative strategies for the tokens task exhibit various distinct decision threshold motifs with sharp, non-monotonic changes.

(A) Schematic of the tokens task. The subject must predict which target (top or bottom) will have the most tokens once all tokens have left the center target (see text for details). (B) Colormap of normative threshold dynamics for the ‘slow’ version of the tokens task in reward-evidence cost parameter space (i.e., as a function of Rc and c(t)=c from Equation 3, with punishment Ri set to –1). Distinct dynamics are color-coded, with different motifs shown in i-iv. (C) Same as B, but for the ‘fast’ version of the tokens task. (i-iv): Representative thresholds (top) and empirical response distributions (bottom) from each region in (B,C). Thresholds are plotted in the LLR-belief space yn=lnpn1-pn, where pn is the state likelihood given by Equation 8. Note that we distinguish iii and iv by the presence of either one (iii) or multiple (iv) consecutive threshold increases. In regions where thresholds are not displayed (e.g., Nt{0,,7} in ii), the thresholds are infinite.

We identified four distinct motifs of normative decision threshold dynamics for the tokens task (Figure 5i-iv). Some combinations of rewards and costs produced collapsing thresholds (Figure 5ii) similar to the UGM developed by Cisek et al., 2009 for this task. In contrast, large regions of task parameter space produced rich non-monotonic threshold dynamics (Figure 5iii,iv) that differed from any found in the UGM. In particular, as in the case of reward-change tasks, normative thresholds were often infinite for the first several token movements, preventing early and weakly informed responses. These motifs are similar to those produced by low-to-high reward switches in the reward-change task, but here resulting from the low relative cost of early observations. These non-monotonic dynamics also appear if we measure belief in terms of the difference in tokens between the top and bottom target, which we call ‘token lead space’ (see Figure 5—figure supplement 1).

Adaptive normative strategies best fit subject response data

To determine the relevance of these adaptive decision strategies to human behavior, we fit discrete-time versions of the noisy Bayesian (four free parameters), constant-threshold (three free parameters), and urgency-gating (five free parameters) models to response-time data from the tokens task collected by Cisek et al., 2009; see Table 1 in Materials and methods for a table of parameters for each model. All models included belief and motor noise, as in our analysis of the dynamic-context tasks (Figure 4—figure supplement 1C). The normative model tended to fit the data better than the heuristic models (see Figure 6—figure supplement 1), based on three primary analyses. First, both corrected AIC (AICc), which accounts for goodness-of-fit and model degrees-of-freedom, and average root-mean-squared error (RMSE) between the predicted and actual trial-by-trial response times, favored the noisy Bayesian model for most subjects for both the slow (Figure 6A) and fast (Figure 6D) versions of the task. Second, when considering only the best-fitting model for each subject and task condition, the noisy Bayesian model tended to better predict subject’s response times (Figure 6B and E). Third, most subjects whose data were best described by the noisy Bayesian model had best-fit parameters that corresponded to non-monotonic decision thresholds, which cannot be produced by either of the other two models (Figure 6C and F). This result also shows that, assuming subjects used a normative model, they used distinct model parameters, and thus different strategies, for both the fast and slow task conditions. This finding is clearer when looking at the posterior parameter distribution for each subject and model parameter (see Figure 6—figure supplement 1 for an example). We speculate that the higher estimated value of reward in the slow task may arise due to subjects valuing frequent rewards more favorably. Together, our results strongly suggest that these human subjects tended to use an adaptive, normative strategy instead of the kinds of heuristic strategies often used to model response data from dynamic context tasks.

Figure 6 with 2 supplements see all
Adaptive normative strategies provide the best fit to subject behavior in the tokens task.

(A) Number of subjects from the slow version of the tokens task whose reponses were best described by each model (legend) identified using corrected AIC (left) and average trial-by-trial RMSE (right). (B) Comparison of mean RT from subject data in the slow version of the tokens task (x-axis) to mean RT of each fit model (y-axis) at maximum-likelihood parameters. Each symbol is color-coded to agree with its associated model. Darker symbols correspond to the model that best describes the responses of a subject selected using corrected AIC. The NB model had the lowest variance in the difference between predicted and measured mean RT (NB var: 0.13, Const var: 3.11, UGM var: 5.39). (C) Scatter plot of maximum-likelihood parameters Rc and c(t)=c for the noisy Bayesian model for each subject in the slow version of the task. Each symbol is color-coded to match the threshold dynamics heatmap from Figure 5B. Darker symbols correspond to subjects whose responses were best described by the noisy Bayesian model using corrected AIC. (D-F) Same as A-C, but for the fast version of the tokens task. The NB model had the lowest variance in the difference between predicted and measured mean RT in this version of the task (NB var: 0.22, Const var: 0.82, UGM var: 5.32).

Discussion

The goal of this study was to build on previous work showing that in dynamic environments, the most effective decision processes do not necessarily use relatively simple, pre-defined computations as in many decision models (Bogacz et al., 2006; Cisek et al., 2009; Drugowitsch et al., 2012), but instead adapt to learned or predicted features of the environmental dynamics (Drugowitsch et al., 2014a). Specifically, we used new ‘dynamic context’ task structures to demonstrate that normative decision commitment rules (i.e., decision thresholds, or bounds, in ‘accumulate-to-bound’ models) adapt to reward and evidence-quality switches in complex, but predictable, ways. Comparing the performance of these normative decision strategies to the performance of classic heuristic models, we found that the advantage of normative models is maintained when computations are noisy. We extended these modeling results to include the ‘tokens task’, in which evidence quality changes in a way that depends on stimulus history and the utility of commitment increases over time. We found that the normative decision thresholds for the tokens task are also non-monotonic and robust to noise. By reanalyzing human subject data from this task, we found most subjects’ response times were best-explained by a noisy normative model with non-monotonic decision thresholds. Taken collectively, these results show that ideal observers and human subjects use adaptive and robust normative decision strategies in relatively simple decision environments.

Our results can aid experimentalists investigating the nuances of complex decision-making in several ways. First, we demonstrated that normative behavior varies substantially across task parameters for relatively simple tasks. For example, the reward-change task structure produces five distinct behavioral motifs, such as waiting until reward increases (Figure 2i) and responding before reward decreases unless the accumulated evidence is ambiguous (Figure 2iv). Using these kinds of modeling results to inform experimental design can help us understand the possible behaviors to expect in subject data. Second, extending our work and considering the sensitivity of performance to both model choice and task parameters (Barendregt et al., 2019; Radillo et al., 2019) will help to identify regions of task parameter space where models are most identifiable from observables like response time and choice. Third, and more generally, our work provides evidence that for tasks with gradual changes in evidence quality and reward, human behavior is more consistent with normative principles than with previously proposed heuristic models. However, more work is needed to determine if and how people follow normative principles for other dynamic-context tasks, such as those involving abrupt changes in evidence or reward contingencies, by using normative theory to determine which subject strategies are plausible, the nature of tasks needed to identify them, and the relationship between task dynamics and decision rules.

Model-driven experimental design can aid in identification of adaptive decision rules in practice. People commonly encounter unpredictable (e.g. an abrupt thunderstorm) and predictable (e.g. sunset) context changes when making decisions. Natural extensions of common perceptual decision tasks (e.g. random-dot motion discrimination [Gold and Shadlen, 2002]) could include within-trial changes in stimulus signal-to-noise ratio (evidence quality) or anticipated reward payout. Task-relevant variability can also arise from internal sources, including noise in neural processing of sensory input and motor output (Ma and Jazayeri, 2014; Faisal et al., 2008). We assumed subjects do not have precise knowledge of the strength or nature of these noise sources, and thus they could not optimize their strategy accordingly. However, people may be capable of rapidly estimating performance error that results from such internal noise processes and adjusting on-line (Bonnen et al., 2015). To extend the models we considered, we could therefore assume that subjects can estimate the magnitude of their own sensory and motor noise, and use this information to adapt their decision strategies to improve performance.

Real subjects likely do not rely on a single strategy when performing a sequence of trials (Ashwood et al., 2022) and instead rely on a mix of near-normative, sub-normative, and heuristic strategies. In fitting subject data, experimentalists are thus presented with the difficult task of constructing a library of possible models to use in their analysis. More general approaches have been developed for fitting response data to a broad class of models (Shinn et al., 2020), but these model libraries are typically built on pre-existing assumptions of how subjects accumulate evidence and make decisions. Because the potential library of decision strategies is theoretically limitless, a normative analyses can both expand and provide insights into the range of possible subject behaviors in a systematic and principled way. Understanding this scope will assist in developing a well-groomed candidate list of near-normative and heuristic models. For example, if a normative analysis of performance on a dynamic reward task produces threshold dynamics similar to those in Figure 2B, then the fitting library should include a piecewise-constant threshold (or urgency signal) model. Combining these model-based investigations with model-free approaches, such as rate-distortion theory (Berger, 2003; Eissa et al., 2021), can also aid in identifying commonalities in performance and resource usage within and across model classes without the need for pilot experiments.

Our work complements the existing literature on optimal decision thresholds by demonstrating the diversity of forms those thresholds can take under different dynamic task conditions. Several early normative theories were, like ours, based on dynamic programming (Rapoport and Burkheimer, 1971; Busemeyer and Rapoport, 1988) and in some cases models fit to experimental data (Ditterich, 2006). For example, dynamic programming was used to show that certain optimal decisions can require non-constant decision boundaries similar to those of our normative models in dynamic reward tasks (Frazier and Yu, 2007; Figure 2). More recently, dynamic programming (Drugowitsch et al., 2012; Drugowitsch et al., 2014b; Tajima et al., 2016) or policy iteration (Malhotra et al., 2017; Malhotra et al., 2018) have been used to identify normative strategies in dynamic environments that can have monotonically collapsing decision thresholds that in some cases can be implemented using an urgency signal (Tajima et al., 2019). These strategies include dynamically changing decision thresholds when signal-to-noise ratios of evidence streams vary according to a Cox-Ingersoll-Ross process (Drugowitsch et al., 2014a) and non-monotonic thresholds when the evidence quality varies unpredictably across trials but is fixed within each trial Malhotra et al., 2018. Other recent work has started to generalize notions of urgency-gating behavior (Trueblood et al., 2021). However, these previous studies tended to focus on environments with a fixed structure, in which dynamic decision thresholds are adapted as the observer acquires knowledge of the environment. Here we have characterized in more detail how both expected and unexpected changes in context within trials relate to changes in decision thresholds over time.

Perceptual decision-making tasks provide a readily accessible route for validating our normative theory, especially considering the ease with which task difficulty can be parameterized to identify parameter ranges in which strategies can best be differentiated (Philiastides et al., 2006). There is ample evidence already that people can tune the timescale of leaky evidence accumulation processes to the switching rate of an unpredictably changing state governing the statistics of a visual stimulus, to efficiently integrate observations and make a decision about the state (Ossmy et al., 2013; Glaze et al., 2015). We thus speculate that adaptive decision rules could be identified similarly in the strategies people use to make decisions about perceptual stimuli in dynamic contexts.

The neural mechanisms responsible for implementing and controlling decision thresholds are not well understood. Recent work has identified several cortical regions that may contribute to threshold formation, such as prefrontal cortex (Hanks et al., 2015), dorsal premotor area (Thura and Cisek, 2020), and superior colliculus (Crapse et al., 2018; Jun et al., 2021). Urgency signals are a complementary way of dynamically changing decision thresholds via a commensurate scale in belief, which Thura and Cisek, 2017 suggest are detectable in recordings from basal ganglia. The normative decision thresholds we derived do not employ urgency signals, but analogous UGMs may involve non-monotonic signals. For example, the switch from an infinite-to-constant decision threshold typical of low-to-high reward switches would correspond to a signal that suppresses responses until a reward change. Measurable signals predicted by our normative models would therefore correspond to zero mean activity during low reward, followed by constant mean activity during high reward. While more experimental work is needed to test this hypothesis, our work has expanded the view of normative and neural decision making as dynamic processes for both deliberation and commitment.

Materials and methods

Normative decision thresholds from dynamic programming

Here we detail the dynamic programming tools required to find normative decision thresholds. For the free-response tasks we consider, an observer gathers a sample of evidence ξ, uses the log-likelihood ratio (LLR) y=Pr(s+|ξ)Pr(s-|ξ) as their ‘belief’, and sets potentially time-dependent decision thresholds, θ±(t), that determine when they will stop accumulating evidence and commit to a choice. When yθ+(t) (yθ-(t)), the observer chooses the state s+ (s). In general, an observer is free to set θ±(t) any way they wish. However, a normative observer sets these thresholds to optimize an objective function, which we assume throughout this study to be the trial-averaged reward rate, ρ, which is given by Equation 1. In this definition of reward rate, the incremental cost function c(t) accounts for both explicit costs (e.g. paying for observed evidence, metabolic costs of storing belief in working memory) and implicit costs (e.g. opportunity cost). We assume symmetry in the problem (in terms of prior, rewards, etc.) that guarantees the thresholds are symmetric about y=0 and θ±(t)=±θ(t). We derive the optimal threshold policy for a general incremental cost function c(t), but in our results we consider only constant costs functions c. Although the space of possible cost functions is large, restricting to a constant value ensures that threshold dynamics are governed purely by task and reward structure and not by an arbitrary evidence cost function.

To find the thresholds ±θ(t) that optimize the reward rate given by Equation 1, we start with a discrete-time task where observations are made every δt time units, and we simplify the problem so the length of each trial is fixed and independent of the decision time Td. This simplification makes the denominator of ρ constant with respect to trial-to-trial variability, meaning we can optimize reward rate by maximizing the numerator R-C(Td). Under this simplified task structure, we suppose the observer has just drawn a sample ξn and updated their state likelihood to pn=11+e-yn, where yn=lnPr(s+|ξ1:n)Pr(s-|ξ1:n) is the discrete-time LLR given by Equation 2. At this moment, the observer takes one of three possible actions:

  1. Stop accumulating evidence and commit to choice s+. This action has value equal to the average reward for choosing s+, which is given by:

    (9)  V+(pn)=Rcpn+Ri(1pn),
  • where Rc is the value for a correct choice and Ri is the value for an incorrect choice.

  1. Stop accumulating evidence and commit to choice s. By assuming the reward for correctly (or incorrectly) choosing s+ is the same as choosing s, the value of this action is obtained by symmetry from:

    (10)  V(pn)=Rc(1pn)+Ripn.
  2. Wait to commit to a choice and draw an additional piece of evidence. Choosing this action means the observer expects their future overall value V to be greater than their current value, less the cost incurred by waiting for additional evidence. Therefore, the value of this choice is given by:

    (11)  Vw(pn)=V(pn+1)|pnpn+1c(t)δt,
  • where c is the incremental evidence cost function; because we assume that the incremental cost is constant, this simplifies c(t)δt=cδt.

Given the action values from Equations 9–11, the observer takes the action with maximal value, resulting in their overall value function

(12) V(pn)=max{V+(pn),V(pn),Vw(pn)}=max{Rcpn+Ri(1pn)choose s+Rc(1pn)+Ripnchoose sV(pn+1)|pnpn+1cδt,sample again.

Because the value-maximizing action depends on the state likelihood, pn, the regions of likelihood space where each action is optimal divide the space into three disjoint regions. The boundaries of these regions are exactly the optimal decision thresholds, which can be mapped to LLR-space to obtain ±θ(t). To find these thresholds numerically, we started by discretizing the state likelihood space pn. Because the state likelihood pn is restricted to values between 0 and 1, whereas the log-likelihood ratio yn=lnpn1-pn is unbounded, we chose to formulate all the components of Bellman’s equation in terms of pn to minimize truncation errors. We then proceeded by using backward induction in time, starting at the total trial length t=Tt. At this moment in time, it impossible to wait for more evidence, so the value function in Equation 12 does not depend on the future. This approach implies that the value function is:

V(pn)=max{V+(pn),V(pn),Vw(pn)}=max{Rcpn+Ri(1pn)choose s+Rc(1pn)+Ripnchoose s.

Once the value is calculated at this time point, it can be used as the future value at time point t=Tt-δt.

To find the decision thresholds for the desired tasks where Tt is not fixed, we must optimize both the numerator and denominator of Equation 1. To account for the variable trial length, we adopt techniques from average reward reinforcement learning (Mahadevan, 1996) and penalize the waiting time associated with each action by the waiting time itself scaled by the reward rate ρ (i.e., tiρ for committing to s+ or s and ρδt for waiting). This modification makes all trials effectively the same length and allows us to use the same approach used to derive Equation 12 (Drugowitsch et al., 2012). The new overall value function is given by Equation 3:

(13) V(pn;ρ)=max{V+(pn;ρ),V(pn;ρ),Vw(pn;ρ)}=max{Rcpn+Ri(1pn)tiρ,choose s+Rc(1pn)+Ripntiρ,choose sV(pn+1;ρ)|pnpn+1c(t)δtρδt,sample again.

To use this new value function to numerically find the decision thresholds, we must note two new complications that arise from moving away from fixed-length trials. First, we no longer have a natural end time from which to start backward induction. We remedy this issue by following the approach of Drugowitsch et al., 2012 and artificially setting a final trial time Tf that is far enough in the future so that decision times of this length are highly unlikely and do not impact the response distributions. If we desire accurate thresholds up to a time t, we set Tf=5t, which produces an accurate solution while avoiding a large numerical overhead incurred from a longer simulation time. In our simulations, we set t based on when we expect most decisions to be made. Second, the value function now depends on the unknown quantity ρ, resulting in a co-optimization problem. To address this complication, note that when ρ is maximized, our derivation requires V(p0=12;ρ)=0 for a consistent Bellman’s equation (Drugowitsch et al., 2012). We exploit this consistency requirement by fixing an initial reward rate ρ0, solving the value function through backward induction, calculating V(0;ρ0), and updating the value of ρ via a root finding scheme. For more details on numerical implementation, see https://github.com/nwbarendregt/AdaptNormThresh; Thresh, 2022.

Dynamic context 2AFC tasks

For all dynamic context tasks, we assume that observations follow a Gaussian distribution with so that ξ|s±N(±μ,σ2). Using the Functional Central Limit Theorem, one can show (Bogacz et al., 2006) that in the continuous-time limit, the belief y evolves according to a stochastic differential equation:

(14)  dy=±mdt+2mdWt.

In Equation 14, m=2μ2σ2 is the scaled signal-to-noise ratio (SNR) given by the observation distribution function ξ|s±N(±μ,σ2), dWt is a standard increment of a Wiener process, and the sign of the drift ±mdt is given by the sign of the correct choice s±. To construct Bellman’s equation for this task, we start by discretizing time t1:n and determine the average value gained by waiting and collecting another observation given by Equation 4:

V(pn+1;ρ) | pnpn+1=01V(pn+1;ρ) fp (pn+1 | pn)dpn+1,

where pn=Pr(s+|ξ1:n) is the probability the environment is in state s+ given n pieces of evidence. The main difficulty in computing this expectation is computing the conditional probability distribution fp (pn+1 | pn), which we call the likelihood transfer function. Once we construct the likelihood transfer function, we can use our discretization of the state likelihood space pn to evaluate the integral in Equation 4 using any standard numerical quadrature scheme. To compute this transfer function, we can start by using the definition of the LLR yn and leveraging the relationship between pn and yn to find pn and a function of the observation ξn:

(15) pn+1=11+eyn+1=11+elnf+(ξn+1)f(ξn+1)eyn=11+f(ξn+1)f(ξn+1)(1pnpn)=pnpn+(1pn)e2ξn+1μσ2.

Note that we used the fact that in discrete-time with a time step δt, the observations ξ|s±N(±μδt,σ2δt). The relationship between ξn+1 and pn+1 in Equation 15 can be inverted to obtain:

ξn+1=σ22μln(pn1)pn+1pn(pn+11).

With this relationship established, we can find the likelihood transfer function fp(p(ξ1:n+1)|p(ξ1:n)) by finding the observation transfer function fξ(ξ(pn+1)|ξ(pn)) and performing a change of variables, which by independence of the sample is simply fξ(ξn+1). With probability pn, ξn+1 will be drawn from the normal distribution N(+μδt,σ2δt), and with probability 1-pn, ξn+1 will be drawn from the normal distribution N(μδt,σ2δt). This immediately provides the observation transfer function by marginalizing:

 fξ(ξn+1|ξ1:n)=pn{12πδtσe(ξn+1μδt)22σ2δt}+(1pn){12πδtσe(ξn+1+μδt)22σ2δt}.

Performing the change of variables using the derivative dξn+1dpn+1=σ22pn+1μ2pn+12μ>0 yields the transfer function

(16)  fp(pn+1|pn)=12μpn+1(1pn+1)2πδtσ[pnexp{12σ2δt(σ22μln(pn1)pn+1pn(pn+11)δtμ)2}+(1pn)exp{12σ2δt(σ22μln(pn1)pn+1pn(pn+11)+δtμ)2}].

Note that Equation 16 is equivalent to the likelihood transfer function given by Equation 16 in Drugowitsch et al., 2012 for the case of m=1. Combining Equation 14 and Equation 16, we can construct Bellman’s equation for any dynamic context task.

Reward-change task thresholds

For the reward-change task, we fixed punishment Ri=0 and allowed the reward Rc to be a Heaviside function given by Equation 5:

 Rc(t)=(R2R1)Hθ(t0.5)+R1.

In Equation 5, there is a single switch in rewards between pre-change reward R1 and post-change reward R2. This change occurs at t=0.5. Substituting this reward function into Equation 3 allows us to find the normative thresholds for this task as a function of R1 and R2.

For the inferred reward change task, we allowed the reward value R(t){RH,RL} to be controlled by a continuous-time two-state Markov process with transition (hazard) rate h between rewards RHRL. The hazard rate h governs the probability of switching between RH and RL:

Pr(R(t+δt)=RH/L|R(t)=RL/H)=hδt+o(δt), δt0,Pr(R(t+δt)=RH/L|R(t)=RH/L)=1hδt+o(δt), δt0,

where o(δt) represents a function g(δt) with the property limδt0g(δt)δt=0 (i.e., all other terms are of smaller order than δt). In addition, the state of this Markov process must be inferred from evidence η that is independent of the environment’s state evidence ξ (i.e., the correct choice). For simplicity, we assume that the reward-evidence source is also Gaussian-distributed such that η|RH/LN(±μR,σR2) with quality mR=2μR2σR2. Glaze et al., 2015; Veliz-Cuba et al., 2016; Barendregt et al., 2019 have shown that the belief yR=lnPr(R(t)=RH|η)Pr(R(t)=RL|η) for such a dynamic state inference process is given by the modified DDM

 dyR=x(t)mRdt2hsinh(yR)dt+2mRdWt,

where x(t)±1 is a telegraph process that mirrors the state of the reward process (i.e., x(t)=1 when R(t)=RH and x(t)=-1 when R(t)=RL). With this belief over reward state, we must also modify the values V+(pn;ρ) and V-(pn;ρ) to account for the uncertainty in Rc. Defining q=eyR1+eyR as the reward likelihood gives

V+(pn;ρ)=(RHqn+RL(1qn))pntiρ,V(pn;ρ)=(RHqn+RL(1qn))(1pn)tiρ,

where we have fixed Ri=0 for simplicity.

SNR-change task thresholds

For the SNR-change task, we allowed the task difficulty m=2μ2σ2 to vary over a single trial by making μ(t) a time-dependent step function given by Equation 6:

μ(t)=(μ2μ1)Hθ(t0.5)+μ1.

In Equation 6, there is a single switch in evidence quality between pre-change quality μ1 and post-change quality μ2. This change occurs at t=0.5. Substituting this quality time series into the likelihood transfer function in Equation 16 allows us to find the normative thresholds for this task as a function of μ1 and μ2. This modification necessitates that the transfer function fp also be a function of time; however, because the quality change points are known in advance to the observer, we can simply change between different transfer functions at the specified quality changes.

Reward-change task model performance

Here we detail the three models used to compare observer performance in the reward-change task, as well as the noise filtering process used to generate synthetic data. For the noisy Bayesian model, the observer uses the thresholds ±θ(t) obtained via dynamic programming, thus making the observer a noisy ideal observer. For the constant-threshold model, the observer uses a constant threshold ±θ(t)=±θ0, which is predicted to be optimal only in very simple, static decision environments with only two states s. Both the noisy Bayesian and constant-threshold models also use a noisy perturbation of the LLR y~=y+σyZ as their belief, where σy is the strength of the noise and Z is a sample from a standard normal distribution. In continuous-time, this perturbation involves adding an independent Wiener process to Equation 14:

 dy~=±mdt+2mdWt+σydWt,

where dWt is an independent Wiener process with strength σy. The UGM, being a phenomenological model, behaves differently from the other models. The UGM belief E is the output of the noisy low-pass filter given by Equation 7:

τdE=(E+11+ey12)dt+σydWt.

To add additional noise to the UGM’s belief variable E, we simply allowed σy>0 in the low-pass filter in Equation 7.

In addition to the inference noise with strength σy, we also filtered each process through a Gaussian response-time filter with zero mean and standard deviation σmn. Under this response-time filter, if the model predicted a response time T, the measured response time T~ was drawn from a normal distribution centered at T with standard deviation σmn. If the response time T~ was drawn outside of the simulation’s time discretization (i.e., if T~<0 or T~>Tf5), we redrew T~ until it fell within the discretization. This filter was chosen to represent both “early responses” caused by attentional lapses, as well as ‘late responses’ caused by motor processing delays between the formation of a choice in the brain and the physical response. We have chosen to add these two sources of noise after optimizing each model to maximize average reward rate, rather than reoptimizing each model after adding these additional noise sources. Although we could have reoptimized each model to maximize performance across noise realizations, we were interested in how the models responded to perturbations that drove their performance to be sub-optimal (but possibly near-optimal).

To compare model performance on the reward-change task, we first fixed the value of pre-change reward R1 (and set R1+R2=11) to find the post-change reward and tuned each model to achieve optimal reward rate with no additional noise in both the inference and response processes. Bellman’s equation outputs both the optimal normative thresholds and reward rate. For the constant threshold model and the UGM, we approximated the maximal performance of each model by using a grid search over each models parameters to find the model tuning that yielded the highest average reward rate. After tuning all models for a given reward structure, we filtered them through both the sensory (σy) and motor (σmn) noise sources without re-turning the models to account for this additional noise. When generating noisy synthetic data from these models, we generated 100 synthetic subjects, each with sampled values of σy and σmn. For each synthetic subject with noise parameter sample (σy, σmn), we defined the “noise strength” of that subject’s noise to be the ratio

σy+σmnσ¯y+σ¯mn,

where σ¯y=5 and σ¯mn=0.25 are the maximum values of belief noise and motor noise considered, respectively. Using this metric, noise strength is defined between 0 and 1. Additionally, the maximum noise levels σ¯y and σ¯mn where chosen such that a noise strength of 0.5 is approximately equivalent to the fitted noise strength obtained from tokens task subject data. We plot the response distributions using noise strengths of 0, 0.5, and 1 in our results. To compare the performance of each model after being corrupted by noise, we then generated 1000 trials for each subject and had each simulated subject repeat the same block of trials three times, one for each model. This process ensured that the only difference between model performance would come from their distinct threshold behaviors, because each model was taken to be equally noisy and was run using the same stimuli.

Tokens task

Normative model for the tokens task

For the tokens task, observations in the form of token movements are Bernoulli distributed with parameter p=0.5 that occur every 200ms. Once a subject committed to a decision, the token movements continued at a faster rate until the entire animation had finished. This post-decision token acceleration was 170ms per movement in the ‘slow’ version of the task and 20ms per movement for the ‘fast’ version of the task. Because of the stimulus structure, one can show using a combinatorial argument (Cisek et al., 2009) that the likelihood function pn is given by Equation 8. Constructing the likelihood transfer function fp required for Bellman’s equation is also simplified from the Gaussian 2AFC tasks, as there are only two possible likelihoods that one can transition two after observing a token movement:

(17) fp(p(top|Un+1,Ln+1,Cn+1)|p(top|Un,Ln,Cn))={12, (Un+1,Ln+1,Cn+1)=(Un+1,Ln,Cn1)12, (Un+1,Ln+1,Cn+1)=(Un,Ln+1,Cn1)0, otherwise.

Combining Equation 8 and Equation 17, we can fully construct Bellman’s equation for the tokens task. While the timings of the token movements, post-decision token acceleration, and inter-trial interval are fixed, we let the reward Rc and cost function c be free parameters to control the different threshold dynamics of the model.

Table 1
List of model parameters used for analyzing tokens task response time data.
NB:Reward Rc
Cost c(t)=c
Sensory Noise σy
Motor Noise σmn
Const:Threshold θ0
Sensory Noise σy
Motor Noise σmn
UGM:Threshold Scale θ0
Gain a
Sensory Noise σy
Time Constant τ
Motor Noise σmn

Model fitting and comparison

We used three models to fit the subject response data provided by Cisek et al., 2009: the noisy Bayesian model (k=4 parameters), the constant threshold model (k=3 parameters), and the UGM (k=5) parameters (Table 1). To adapt the continuous-time models to this discrete-time task, we simply changed the time step to match the time between token movements (δt=200 ms). To fit each model, we took the subject response time distributions as our objective function and used Markov Chain Monte Carlo (MCMC) with a standard Gaussian proposal distribution to generate an approximate posterior made up of 10,000 samples. For more details as to our specific implementation of MCMC for this data, see the MATLAB code available at https://github.com/nwbarendregt/AdaptNormThresh, (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2; Thresh, 2022). We held out 2 of the 22 subjects to use as training data when tuning the covariance matrix of the proposal distribution for each model, and performed the model fitting and comparison analysis on the remaining 20 subjects. Using the approximate posterior obtained via MCMC for each subject and model, we used calculated AICc using the formula

(18) AICc=2k2ln(L^)+2k2+2knk1.

In Equation 18, k is the number of parameters of the model, L^ is the likelihood of the model evaluated at the maximum-likelihood parameters, and n is the number of responses in the subject data (Cavanaugh, 1997; Brunham and Anderson, 2002). Because each subject performed different numbers of trials, using AICc allowed us to normalize results to account for the different data sizes; note that for many responses (i.e., for large n), AICc converges to the standard definition of AIC. For the second model selection metric, we measured how well each fitted model predicted the trial-by-trial responses of the data by calculating the average RMSE between the response times from the data and the response times predicted by each model. To measure the difference between a subject’s response time distribution and the fitted model’s distribution (Figure 6—figure supplement 1), we used Kullback-Leibler (KL) divergence:

(19) KL=i=015RTD(i)ln(RTD(i)RTM(i)).

In Equation 19, i is a time index representing the number of observed token movements, RTD(i) is the probability of responding after i token movements from the subject data, and RTM(i) is the probability of responding after i token movements from the model’s response distribution. Smaller values of KL divergence indicate that the model’s response distribution is more similar to the subject data.

Code availability

See https://github.com/nwbarendregt/AdaptNormThresh; (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2; Thresh, 2022) for the MATLAB code used to generate all results and figures.

Data availability

MATLAB code used to generate all results and figures is available at https://github.com/nwbarendregt/AdaptNormThresh, (copy archived at swh:1:rev:2878a3d9f5a3b9b89a0084a897bef3414e9de4a2).

References

  1. Book
    1. Bellman R
    (1957)
    Dynamic Programming
    Princeton University Press.
  2. Book
    1. Berger T
    (2003)
    Rate-Distortion Theory
    Wiley Encyclopedia of Telecommunications.
  3. Book
    1. Bertsekas D
    (2012)
    Dynamic Programming and Optimal Control
    Athena scientific.
  4. Book
    1. Brunham K
    2. Anderson D
    (2002)
    Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach
    New York Inc: Springer.
  5. Conference
    1. Drugowitsch J
    2. Moreno-Bote R
    3. Pouget A
    (2014a)
    Optimal decision-making with time-varying evidence reliability
    In Advances in neural information processing systems. pp. 748–756.
  6. Report
    1. Drugowitsch J
    (2015)
    Notes on normative solutions to the speed-accuracy trade-off in preceptual decision-making
    FENS-Hertie Winter School.
  7. Conference
    1. Frazier P
    2. Yu AJ
    (2007)
    Sequential Hypothesis Testing under Stochastic Deadlines
    Advances in Neural Information Processing Systems.

Decision letter

  1. Peter Latham
    Reviewing Editor; University College London, United Kingdom
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom
  3. Gaurav Malhotra
    Reviewer; University of Bristol, United Kingdom

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Normative Decision Rules in Changing Environments" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen Timothy Behrens as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Gaurav Malhotra (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

All three reviewers very much liked the paper. It was nice to see the formalism used to solve these problems take a central place in the manuscript, and the huge variability in bounds is something we haven't seen before.

But that didn't stop us from making a huge number of comments – you have either the good luck or the bad luck, depending on your point of view, of being reviewed by experts. Most comments have to do with the presentation: important information was missing (or at least we couldn't find it), and there were even places where we got lost (not a good sign, given that all three of us work in the field). Details follow.

I. The tokens task can be analyzed using the formalism introduced in Equation (3), but it seems pretty far from the "dynamic context" examples emphasized in the bulk of the paper. That doesn't mean the tokens task shouldn't be included. But it does mean we have no evidence one way or the other whether subjects would adopt the highly idiosyncratic boundaries found in simulations (for instance, the infinite threshold boundaries in Figures 2i, 2ii).

You need to be clear about this. The way the paper reads, it sounds like you have provided evidence for the dynamic context setup, when in fact that's not the case. It should be crystal clear that dynamic context problems have not been explored, at least not in the lab. Instead, what you showed is that a normative model can beat a particular heuristic model.

II. The following are technical but important.

1. Adding noise: you are currently optimizing the model parameters/policy before adding sensory and motor noise. Decision-makers could be aware of sensory noise and so could try to optimize their decision processes with that knowledge. Would you be able to also compare the models' performances if they have been optimized to maximize performance in the presence of all noise? From our understanding, this should be feasible for the Const and UGM model, but might be harder for the normative model. Sensory noise could be included by finding Equation (13) that includes such noise, but finding the optimal thresholds once RT noise is included might be prohibitive. This is just a suggestion, not an essential inclusion. However, it might be worth at least discussing the difference between what you do, and being clear on the scenario you considered.

2. Fitting token task data: according to Cisek et al. (2009), the same participants performed both the slow and the fast version of the task. However, their fitted reward magnitudes differ by an order of magnitude between the two conditions (your Figure 6C/F). Is it just that the fitting objective didn't well-constrain these parameters? Given that you use MCMC for model fits, you could compare the parameter posteriors across conditions. Furthermore, how much worse would the model fits become if you would fit both conditions simultaneously and share all parameters that can be meaningfully shared across conditions? In any case, an explanation for this difference should be provided in the manuscript.

3. The dynamic context examples would seem to apply only when subjects take many seconds to make a decision. This would seem to rule out perceptual decision-making tasks. Is this true? If so, you should be upfront about this – so that those who work on perceptual decision-making will know what they're getting into.

4. A known and predictable change in the middle of a task seems somewhat unrealistic. Given that it plays such a central role, concrete examples where this comes up would be very helpful. Or at least you should make a proposal for laboratory experiments where it could come up. The examples in the introduction ("Some of these factors can change quickly and affect our deliberations in real time; e.g., an unexpected shower will send us hurrying down the faster route (Figure 1A), whereas spotting a new ice cream store can make the longer route more attractive.") don't quite fall into the "known and predictable change" category.

III. Better contact with existing literature needs to be made. For instance:

1. Drugowitsch, Moreno-Bote and Pouget (2014) already computed normative decision policies for time-varying SNR, with the difference that they assumed the SNR to follow a stochastic process rather than a known, deterministic time course. Thus, the work is closely related, but not equivalent.

2. Some early models to predict dynamic decision boundaries were proposed by Busemeyer and Rapoport (1988) and Rapoport and Burkheimer (1971) in the context of a deferred decision-making task.

3. One of the earliest models to use dynamic programming to predict non-constant decision boundaries was Frazier and Yu (2007). Indeed some boundaries predicted by the authors (e.g. Figure 2v) are very similar to boundaries predicted by this model. In fact, the switch from high to low reward used to propose boundaries in Figure 2v can be seen as a "softer" version of the deadline task in Frazier and Yu (2007).

4. Another early observation that time-varying boundaries can account for empirical data was made by Ditterich (2006). Seems highly relevant to the authors' predictions, but is not cited.

5. The authors seem to imply that their results are the first results showing non-monotonic thresholds. This is not true. See, for example, Malhotra et al. (2018). What is novel here is the specific shape of these non-monotonic boundaries.

IV. Clarity could be massively improved. If you want to write an unclear paper that is your prerogative. However, if you do, you can't say "Our results can aid experimentalists investigating the nuances of complex decision-making in several ways". It would be difficult to aid experimentalists if they have to struggle to understand the paper.

Below are comments collected from the three reviewers, and more or less collated (so it's possible there's some overlap, and the order isn't exactly optimized). You can, in fact, almost ignore them, if you take into account the main message: all information should be easily accessible, in the main text, and the figures should be easy to make sense of.

As authors, we are aware that the length of replies can sometimes exceed the paper, which is not a good use of anybody's time. Please use your judgment as to which ones you reply to. For instance, if you're going to implement our suggestions, no reason to tell us. Maybe comment only if you have a major objection? Or use some other scheme? What we really care about is that the revised paper is easy to read!

1. When the UGM was introduced, all you say is "urgency-gating models (UGMs) use thresholds that collapse monotonically over time". You include some references, but for the casual reader, it looks like you're considering a generic collapsing bound model. In fact, you're considering a particular shape for the collapsing bound and particular filtering of the evidence. This should be clear. It also needs to be justified. For instance, Voskuilen et al. (J. Math. Psych. 73:59-79, 2016) use a different functional form for the collapsing bound, and they don't filter the evidence. Why use one model over another?

And while we're on the topic of the UGM: Equation (4) low-pass filters the noise-free observer's belief y that reflects all accumulated evidence up to current time t. According to our reading of Cisek et al. (2009), the UGM low-pass filters the momentary internal estimate of sensory information (the Ei(tau) defined below Equation (1); Equations. (17)-(19) for the low-pass filter in Cisek et al.) rather than the accumulated estimate of sensory information. Are we misinterpreting Cisek et al. (2009) or your Equation. (4)? Either way, please clarify.

In Equation. 4 it would be more clear to put -E + 0.5*tanh(y) on the RHS. What's the justification for tanh? Why not just filter y? Do you use tanh because the original paper did? If so, you should point that out.

Also, what's y in that equation?

2. Important inline equations need to be displayed. There's nothing more annoying than having to crawl through text to look for the definition of an important symbol. To take a few (hardly exhaustive) examples: f±(ξ),y,pn,fp(pn+1|pn). The actual list is much longer. If any symbol is going to be used again, please make it easy to find! This in itself is a reason for displayed equations: you can refer to equation numbers when introducing variables that you haven't used for several pages.

3. A lot of the lines don't have line numbers, which is relevant mainly for us, since it's hard to refer to things without line numbers. This is a bug, but there's a way to fix it. I think (but I'm not sure) that in your latex file you need to leave a space between equations and surrounding text. (Or maybe no space? It's been a while). Although I believe there's a more elegant fix.

4. Not all equations were numbered. We know, in some conventions only equations one refers to are numbered (that's what one of us grew up with), but it turns out to be not so convenient for us as reviewers when we want to refer to an un-numbered equation.

5. Lines 43-6: "Efforts to model decision-making thresholds under dynamic conditions have focused largely on heuristic strategies. For instance, "urgency-gating models" (UGMs) use thresholds that collapse monotonically over time (equivalent to dilating the belief in time) to explain decisions based on time-varying evidence quality".

In fact, a collapsing bound is not necessarily a heuristic; it can be optimal, although the exact shape of the collapsing bound has to be found by dynamic programming. Please reword to reflect this.

6. Line 76: c(t) is barely motivated at all here. It's better motivated in Methods, but its value is very hard to justify. Why not stick with optimizing average reward, for which c=0? And I don't think you ever told us what c(t) was; only that it was constant (although we could have missed its value).

7. Figure 2C would be easier to make sense of if it were square.

8. In general, information is scattered all over the place, and much of it seems to be missing. Each task should be described succinctly in the main text, with enough information to actually figure out what's going on. In addition, there should be a table listing _all_ the parameters; right now the reader has to go to Methods, and even then it seems that many are missing. For instance, we don't think we were ever told the value of tau in Equation. 4.

9. Lots of questions/comments about Figure 4:

a. It would be very helpful to include the optimal model. I think NB is the optimal model when σ_y=0, but I also believe that in most panels σ_y \ne 0.

b. It would be helpful to emphasize, in the figure caption, that NB with σ_y = 0 is the optimal model. Assuming that's true.

c. Figure 4A: What's the post-reward rate? And please indicate the pre-reward rate at which pre-reward = post-reward. Also, If pre and post-reward rates sum to 11 (as mentioned in Methods, line 411), why are the curves' minima at around 5 rather than 5.5?

g. Figure 4B: horizontal axis label missing (presumably "pre reward"?). And we assume you used the following color code: Orange: reward(NB)-reward(Const); violet: reward(NB)-reward(UGM). Correct? Either way, this should be stated in the figure caption.

e. Figure 4C: what are the pre and post-rewards? And presumably noise strength = σ_y? This should be stated clearly. And more explanation, in the main text, of what "noise strength" is would help.

f. Figure 4F: It is not clear to us why UGM in 0 noise condition have RTs aligned to the time reward increases from R1 to R2. Surely, this model does not take RR into account to compute the thresholds, does it? In fact, looking at Figure 4B, Supplement 1, the thresholds are always highest at t=0. Please clarify.

10. Lines 207-9: "Because the total number of tokens was finite and known to the subject, token movements varied in their informativeness within a trial, yielding a dynamic and history-dependent evidence quality that, in principle, could benefit from adaptive decision processes".

To us, "history-dependent" implies non-Markov, whereas the tokens task is Markov. But maybe that's not what history-dependent means here? This should be clarified.

11. We assume the y-axis in Figure 5i-iv is the difference between the number of tokens on the top and the number on the bottom. This should be stated (if it's true). And please explain how you differentiate between motifs iii and iv. We believe it's the presence of two threshold increases (rather than just one) in motif iv, but we're not sure.

12. What's the reward/punishment structure for the tokens task? It seemed that this was only half explained.

13. Lines 229-232: "To determine the relevance of these adaptive decision strategies to human behavior, we fit discrete-time versions of the noisy Bayesian (four free parameters), constant-threshold (three free parameters), and urgency-gating (five free parameters) models to response-time data from the tokens task collected by Cisek et al. (2009)."

As mentioned above, the parameters should go in a table.

14. You should tell us what V(T_final) is, and why. We believe it's the same as V(0), but We could be wrong.

15. After Equation. 11: it says m = 2 mu22. Are these mu and σ different than the ones on line 383? If so, that should be clear. (If not, we're lost.)

16. We looked, but couldn't find, the definition of f_p. We believe it's just a conditional probability,

f_p(p_{n+1}|p_n) = P(p_{n+1}|p_n).

If so, why not use that notation? It would be a lot easier to remember. In any case, when this is used, please tell us what it is, or where it was originally defined (which should be in a displayed equation!).

17. State space is parameterized by p_n, and that needs to be discretized, right? If so, that's worth mentioning. If not, we're lost.

18. Analysis (in particular Equation. 13) would be a lot easier if you used y_n instead of p_n. y_n is what is generally accumulated in DDMs, and it's what you generally plot on the y-axis. So why use p_n?

19. Equations. 14 and 15 should really be in the main text. They're simple and important.

20. We didn't understand the inferred reward change task, in the text starting after line 393. We might have been able to guess, but please put in equations so it's crystal clear.

21. Somewhere below line 404: "a constant threshold … is predicted to be optimal only in simple, static decision environments." It's worth pointing out that the decision environments have to be _very_ simple. Even adding one more mean induces a non-constant (and typically collapsing) bound.

22. Equation above line 405: why repeat that equation, and not repeat Equation. 4? Just curious.

23. Lines 409-11: Couldn't parse.

24. After line 411, we find out that R1+R2=11. This is important and simple; you should tell us in the main text.

25. After line 411: we couldn't parse "allowing us to find the exact tuning of the normative model."

26. In fact, we're lost in pretty much everything between line 411 and the tokens task.

27. Line 429: what's "post-decision token acceleration"?

28. Line 433: "We used three models to fit the subject response data …". As far as we could tell, the three models are continuous time models. How were they adapted to this task, which runs in discrete time? Is it just a matter of making the time step larger?

29. Lines 432-434: please be more clear about parameter counting -- by listing parameters.

30. Lines 437-8: "For more details as to our specific implementation of MCMC for this data, see the MATLAB code available at https://github.com/nwbarendregt/AdaptNormThresh".

We shouldn't have to look at code to get details; all important details should be in the paper.

31. Figure 2—figure supplement 2 and Figure 3—figure supplement 1: we thought the reward changed only once. But it's changing a lot in panel A. What's going on?

32. The Abstract / Introduction isn't clear enough about what you refer to as a "changing / dynamic environment". In particular, there is a rich history of research on environments whose state changes across decisions rather than within individual decisions. Making this distinction explicit, and clarifying that you care about the latter rather than the former should make Abstract / Intro clearer.

33. In the text around Equation. (2), you should mention that you're assuming independence across time.

34. Equation. (3): should c(dt) really be c(t)dt? Its dependence on only the time step size seems incompatible with its initial definition in line 77, where it depends on time t since trial onset. Although eventually, it does become a constant.

35. Below Equation. (3): "We choose generating distributions f_+/- that allow us to explicitly compute the average future value […]" – can you compute the average future value explicitly, or just f_p(p_n+1 | p_n)? Methods only discuss the latter.

36. Figures 2 and 3: the assumed reward/cost magnitudes should be mentioned in the main text, and also if the results were qualitatively dependent on these magnitudes (we assume not?).

37. Figure 2B: "belief" in Bayesian statistics usually refers to a posterior probability, whereas you seem to be using it to refer to log-posterior odds (or log-odds). Please clarify in the text what you mean by "belief" (if you haven't done so already and we missed it). This also refers to Figure 3B and clarifies what the thresholds are on in Figures3/4/5.

38. Figures2C/3C: the letter placements are slightly unclear. In particular, in Figure 2C it is hard to see where exactly 'iv' is placed. Maybe using labeled dots instead would increase placement precision?

39. Line 130: "[…] in which reward fluctuations are governed by a two-state Markov process […]". We couldn't figure out from the description in the main text what setup you are referring to and how to interpret Figure 2 – suppl 3. Please provide more detail (not just in Methods) on the reward switching process: what information is provided to the decision-maker to infer its state, etc.

40. below Line 156: we got lost in the notation for the different noisy / noise-less accumulator models. y_tilde appears to be accumulation with added sensory noise but is in the second point referred to as the "belief y_tilde [of the] normative model", which, being normative, presumably wouldn't have sensory noise. Furthermore, the UGM model seems to use the "noise-free observer's belief y". Is that the belief as defined in Equation. (2) which still includes the sample noise, such that calling it "noise-free" might be confusing?

41. Starting on line 169: the text is unclear on how the models are tuned to cope with the noise, if at all. How the model parameters of the Const and UGM are chosen should also be mentioned in the main text, not just Methods – in particular, that they are tuned to maximize decision performance.

42. Line 332: "+- theta" – missing "(t)"?

43. Line 333: "where observations every dt time units" – fragment?

44. Equation. (10): shouldn't V+ / V- / Vw also be functions of rho?

45. The equation above Equation. (12): how is the expected future value computed? I assume that this can only be done numerically? Either way, please specify the details of how you do so. Referring to a Github repo isn't sufficient.

45. The evidence setup that leads to Equation. (13) appears to be equivalent to the one leading to Equation. (16) in Drugowitsch et al. (2012) for M=1. Is this correct? If yes, is the result equivalent? Either way, the relationship would be worth pointing out.

46. Line 411: "the measured response time T_tilde was drawn from a normal distribution […]" – what happened for predicted response times <0? Did you truncate the normal distribution at 0?

47. Line 432: what was the objective function for the MCMC fits? The joint likelihood of RTs and choices?

48. One of the more realistic scenarios is presented in Figure 2—figure supplement 3, where reward doesn't switch at a fixed time, but uses instead a Markov process. But you do not provide enough details of the task or the results. Is m_R = R_H / R_L? Is it the dark line that corresponds to m_R=\inf (as indicated by legend) or the dotted line (as indicated by caption)? For what value of drift are these thresholds derived? These details should be included.

https://doi.org/10.7554/eLife.79824.sa1

Author response

I. The tokens task can be analyzed using the formalism introduced in Equation. (3), but it seems pretty far from the "dynamic context" examples emphasized in the bulk of the paper. That doesn't mean the tokens task shouldn't be included. But it does mean we have no evidence one way or the other whether subjects would adopt the highly idiosyncratic boundaries found in simulations (for instance, the infinite threshold boundaries in Figures 2i, 2ii).

You need to be clear about this. The way the paper reads, it sounds like you have provided evidence for the dynamic context setup, when in fact that's not the case. It should be crystal clear that dynamic context problems have not been explored, at least not in the lab. Instead, what you showed is that a normative model can beat a particular heuristic model.

We have revised the text substantially to clarify and expand upon these important points. Specifically, we:

a. More clearly define the broad set of possible “dynamic context” conditions, including changes in outcome expectations or evidence quality while the evidence is being processed, where the changes can be either: (1) abrupt, as in the reward-change and SNR-change tasks we introduce, which we analyze only theoretically, or (2) gradual, as in the evidence quality changes in the tokens task, which we analyze theoretically and experimentally (e.g., in Results: Even for such simple tasks, there is a broad set of possible dynamic contexts. In the next section, we will analyze a task where context changes gradually (the tokens task)). Here we focus on tasks where the context changes abruptly.

b. Explain that our theoretical framework is general enough to account for both abrupt and gradual changes clarify that our analysis of data from the tokens task shows that the behavior of subjects is better described by a noisy normative model than by previously considered alternatives applied to that particular form of a dynamic-context task. We also state explicitly that more work is needed to determine if and how people follow normative principles for other dynamic-context tasks, …

II. The following are technical but important.

1. Adding noise: you are currently optimizing the model parameters/policy before adding sensory and motor noise. Decision-makers could be aware of sensory noise and so could try to optimize their decision processes with that knowledge. Would you be able to also compare the models' performances if they have been optimized to maximize performance in the presence of all noise? From our understanding, this should be feasible for the Const and UGM model, but might be harder for the normative model. Sensory noise could be included by finding Equation. (13) that includes such noise, but finding the optimal thresholds once RT noise is included might be prohibitive. This is just a suggestion, not an essential inclusion. However, it might be worth at least discussing the difference between what you do, and being clear on the scenario you considered.

We appreciate these important points and now consider them in the revised Discussion. However, we have chosen not to extend our analyses, for several reasons: (1) An optimal observer without internal sensory and motor noise gives the best possible responses, and thus provides a useful benchmark; and (2) we fear that adding results that define optimality with respect to internal sensory and motor noise would, because of the assumptions we would have to make about both the nature and knowledge of those noise sources, be distracting as well as much more speculative and thus make the paper harder to follow.

We have updated the Methods section to highlight these points:

“We have chosen to add these two sources of noise after optimizing each model to maximize average reward rate, rather than reoptimizing each model after adding these additional noise sources. Although we could have reoptimized each model to maximize performance across noise realizations, we were interested in how the models responded to perturbations that drove their performance to be sub-optimal (but possibly near-optimal).”

as well as the Discussion:

“Task-relevant variability can also arise from internal sources, including noise in neural processing of sensory input and motor output (Ma and Jazayeri, 2014; Faisal et al., 2008). We assumed subjects do not have precise knowledge of the strength or nature of these noise sources, and thus they could not optimize their strategy accordingly. However, people may be capable of rapidly estimating performance error that results from such internal noise processes and adjusting on-line (Bonnen et al., 2015). To extend the models we considered, we could therefore assume that subjects can estimate the magnitude of such sensory and motor noise, and use this information to adapt their decision strategies to improve performance.”

2. Fitting token task data: according to Cisek et al. (2009), the same participants performed both the slow and the fast version of the task. However, their fitted reward magnitudes differ by an order of magnitude between the two conditions (your Figure 6C/F). Is it just that the fitting objective didn't well-constrain these parameters? Given that you use MCMC for model fits, you could compare the parameter posteriors across conditions. Furthermore, how much worse would the model fits become if you would fit both conditions simultaneously and share all parameters that can be meaningfully shared across conditions? In any case, an explanation for this difference should be provided in the manuscript.

We now include a supplementary figure (Figure 6—figure supplement 2) comparing the posteriors across conditions as well as reward magnitudes in the slow and fast versions of the tokens task for a representative subject. The maximum likelihood estimate of the reward magnitude tended to be much higher in the slow task than in the fast task. It appears that subjects thus use distinct strategies in the two contexts, which we do not find surprising. We therefore do not expect to obtain fits of the same quality if we assume that subjective reward magnitude is the same across conditions. We speculate that subjects may value reward more in the slow task because it is obtained less frequently. Related effects have been attributed to amplified dopamine responses when rewards are rare (Rothenhoefer et al. 2021 Nat Neurosci). We added text to the Results section to point out this interesting finding:

“This result also shows that, assuming subjects used a normative model, they used distinct model parameters, and thus different strategies, for both the fast and slow task conditions. This finding is clearer when looking at the posterior parameter distribution for each subject and model parameter (see Figure 6—figure supplement 1 for an example). We speculate that the higher estimated value of reward in the slow task may arise due to subjects valuing frequent rewards more favorably.”

3. The dynamic context examples would seem to apply only when subjects take many seconds to make a decision. This would seem to rule out perceptual decision-making tasks. Is this true? If so, you should be upfront about this – so that those who work on perceptual decision-making will know what they're getting into.

We disagree. The impact of normative decision rules is relevant even on shorter timescales, including those relevant to perceptual decisions (e.g., on the order of 100 ms). Figure 2 —figure supplement 2 and Figure 3 —figure supplement 1 demonstrate that even though normative decision rules may invoke plans across multiple context changepoints, often decisions are made within the 1st or 2nd changepoint, and the corresponding reaction time distributions would have a character distinct from those emerging from strategies with flat decision thresholds. Moreover, there is ample evidence that subjects are capable of adapting perceptual evidence integration to sub-second timescales (Ossmy et al. 2013; Glaze et al. 2015). We thus speculate that perceptual decision rules could adapt on similar timescales as predicted by our normative models.

We have updated the Discussion to clarify these points:

“Perceptual decision-making tasks provide a readily accessible route for validating this theory, especially considering the ease with which task difficulty can be parameterized to identify parameter ranges in which strategies can best be differentiated (Philiastides et al. 2006). There is ample evidence already that people can tune the timescale of leaky evidence accumulation processes to the switching rate of an unpredictably changing state governing the statistics of a visual stimulus, to efficiently integrate observations and make a decision about the state (Ossmy et al. 2013; Glaze et al. 2015). We thus speculate that adaptive decision rules could be identified similarly in the strategies people use to make decisions about perceptual stimuli in dynamic contexts.”

4. A known and predictable change in the middle of a task seems somewhat unrealistic. Given that it plays such a central role, concrete examples where this comes up would be very helpful. Or at least you should make a proposal for laboratory experiments where it could come up. The examples in the introduction ("Some of these factors can change quickly and affect our deliberations in real time; e.g., an unexpected shower will send us hurrying down the faster route (Figure 1A), whereas spotting a new ice cream store can make the longer route more attractive.") don't quite fall into the "known and predictable change" category.

Foraging animals must often deal with unpredictable changes in light and visibility conditions, but they also adjust to predictable changes in light brought about by the variation in sunlight with time of day. Sunrise and sunset represent stereotyped changes in foraging conditions as well as necessary escape conditions for prey animals. On shorter timescales, birds and other animals seeking mates, parents, or offspring must often discriminate between two or more calls with known amplitude modulations over time. Financial traders make decisions in markets with fixed open and closing times that strongly shape trading context. Dutch auctions are structured so that an item’s cost is successively lowered until a bidder agrees to pay that amount, reflecting a predictable stair-stepping procedure for cost changes. In all these examples the quality of evidence changes in a predictable way, while the evidence remains noisy.

Concerning laboratory experiments, the first half of the paper already proposes a visual decision-making task. The experiment we analyzed could be implemented as a switching context random dot motion discrimination task with either changes in signal-to-noise (coherence) levels, or changes in reward amounts. Such changes could be signaled or consistently implemented at the same time each trial, so as to be known.

We now have added a sentence in the Introduction:

“People and other animals thus must cope with unpredictable changes in context, such as breaks in the weather (Grubb, 1975), as well as predictable changes that affect their observations, like the daily sunrise and sunset (McNamara et al., 1994).”

as well as a note in the Discussion to indicate the relevance of such task structures, and describe how they can be implemented in a laboratory setting:

“Model-driven experimental design can aid in identification of adaptive decision rules in practice. People commonly encounter unpredictable (e.g., an abrupt thunderstorm) and predictable (e.g., sunset) context changes when making decisions. Natural extensions of common perceptual decision tasks (e.g., random-dot motion discrimination Gold and Shadlen 2002) could include within-trial changes in stimulus signal-to-noise ratio (evidence quality) or anticipated reward payout.”

III. Better contact with existing literature needs to be made. For instance:

1. Drugowitsch, Moreno-Bote and Pouget (2014) already computed normative decision policies for time-varying SNR, with the difference that they assumed the SNR to follow a stochastic process rather than a known, deterministic time course. Thus, the work is closely related, but not equivalent.

Indeed we had not explained in detail the differences between their work and ours. We have now added the following sentence to the Discussion to make this clear:

“These strategies include dynamically changing decision thresholds when signal-to-noise ratios of evidence streams vary according to a Cox-Ingersoll-Ross process (Drugowitsch et al., 2014a)”

2. Some early models to predict dynamic decision boundaries were proposed by Busemeyer and Rapoport (1988) and Rapoport and Burkheimer (1971) in the context of a deferred decision-making task.

Thanks very much for pointing out these seminal references, which we now include in the Discussion:

“Several early normative theories were, like ours, based on dynamic programming (Rapoport and Burkheimer, 1971; Busemeyer and Rapoport, 1988), and in some cases models fit to experimental data (Ditterich, 2006).”

3. One of the earliest models to use dynamic programming to predict non-constant decision boundaries was Frazier and Yu (2007). Indeed some boundaries predicted by the authors (e.g. Figure 2v) are very similar to boundaries predicted by this model. In fact, the switch from high to low reward used to propose boundaries in Figure 2v can be seen as a "softer" version of the deadline task in Frazier and Yu (2007).

Again, we very much appreciate the pointer to the very relevant reference, which we include in the Discussion:

“For example, dynamic programming was used to show that certain optimal decisions can require non-constant decision boundaries similar to those of our normative models in dynamic reward tasks (Frazier and Yu, 2007) (Figure 2).”

4. Another early observation that time-varying boundaries can account for empirical data was made by Ditterich (2006). Seems highly relevant to the authors' predictions, but is not cited.

We agree and regret the oversight. We now reference that paper.

5. The authors seem to imply that their results are the first results showing non-monotonic thresholds. This is not true. See, for example, Malhotra et al. (2018). What is novel here is the specific shape of these non-monotonic boundaries.

As with the work by Drugowitsch et al. (2014), this work demonstrates the emergence of non-monotonic boundaries, but in tasks and settings distinct from the ones we consider (which specifically employ dynamic context). We have clarified these points in the manuscript.

IV. Clarity could be massively improved. If you want to write an unclear paper that is your prerogative. However, if you do, you can't say "Our results can aid experimentalists investigating the nuances of complex decision-making in several ways". It would be difficult to aid experimentalists if they have to struggle to understand the paper.

Below are comments collected from the three reviewers, and more or less collated (so it's possible there's some overlap, and the order isn't exactly optimized). You can, in fact, almost ignore them, if you take into account the main message: all information should be easily accessible, in the main text, and the figures should be easy to make sense of.

As authors, we are aware that the length of replies can sometimes exceed the paper, which is not a good use of anybody's time. Please use your judgment as to which ones you reply to. For instance, if you're going to implement our suggestions, no reason to tell us. Maybe comment only if you have a major objection? Or use some other scheme? What we really care about is that the revised paper is easy to read!

Thanks for providing us with flexibility in how and to what we respond. Generally, we found all comments helpful, and so we have endeavored to make edits that address everything the reviewers brought to our attention. To simplify this letter, we include below only those points that require additional explanation. Otherwise all changes can be found in red in the revised manuscript.

6. Line 76: c(t) is barely motivated at all here. It's better motivated in Methods, but its value is very hard to justify. Why not stick with optimizing average reward, for which c=0? And I don't think you ever told us what c(t) was; only that it was constant (although we could have missed its value).

We have added the following motivation of the cost function c(t) to the main text:

“The incremental evidence function c(t) represents both explicit time costs, such as a price for gathering evidence, and implicit costs, such as the opportunity cost. While there are many forms of this cost function, we will make the simplifying assumption that it is constant, c(t)=c. Because more complex cost functions can influence decision threshold dynamics (Drugowitsch et al., 2012), restricting the cost function to a constant ensures that threshold dynamics are governed purely by changes in the (external) task conditions and not the (internal) cost function.”

We also specified the cost function c(t) = 1 that we used in Figure 2-4 in the figure captions. We revised the caption of Figure 5 to make it more clear that we are finding decision threshold motifs as a function of the cost function c:

“… B: Colormap of normative threshold dynamics for the “slow'' version of the tokens task in reward-evidence cost parameter space (i.e., as a function of Rc and c(t) = c from Equation 3, with punishment Ri set to -1). Distinct …”

We also added in more clarification to the caption of Figure 6C,F to emphasize that we are fitting the cost function c(t) = c.

10. Lines 207-9: "Because the total number of tokens was finite and known to the subject, token movements varied in their informativeness within a trial, yielding a dynamic and history-dependent evidence quality that, in principle, could benefit from adaptive decision processes".

To us, "history-dependent" implies non-Markov, whereas the tokens task is Markov. But maybe that's not what history-dependent means here? This should be clarified.

Yes, the token count differential is driven by a Markov process, since there is always a 50/50 chance of the token being moved to the top or bottom target. However, the log likelihood ratio associated with either target having more tokens at the end is a non-Markovian, history-dependent process, because the possible LLR increments on each token movement are determined by the token movements so far. This subtlety does make this a dynamic context task, where the evidence quality is the context that changes gradually throughout a trial. We addressed this in our response to the major comments above as we describe the temporal dynamics of the tokens task.

“In addition, the task included two different post-decision token movement speeds, ``slow'' and ``fast'': once the subject committed to a choice, the tokens finished out their animation, moving either once every 170 ms (slow task) or once every 20 ms (fast task). This post-decision movement acceleration changed the value associated with commitment by making the average inter-trial interval (ti in Equation 1) decrease over time. Because of this modulation, we can interpret the tokens task as a multi-change reward task, where commitment value is controlled through ti rather than through reward Rc.

19. Equations. 14 and 15 should really be in the main text. They're simple and important.

We added the following text to include these Heaviside functions in the main text and to better motivate our investigation into single-change environments for reward:

“Environments with multiple fluctuations during a single decision lead to complex threshold dynamics, but are comprised of threshold change ``motifs.'' These motifs occur on shorter intervals and tend to emerge from simple monotonic changes in context parameters (Figure 2—figure supplement 2). To better understand the range of possible threshold motifs, we focused on environments with single changes in task parameters. For the reward-change task, we set punishment to Ri = 0, and assumed reward Rc changes abruptly, so that its dynamics are described by a Heaviside function

Rc(t)=(R1R2)Hθ(t0.5)+R1. Thus, the reward switches from a pre-change value of R1 to a post-change value of R2 at t=0.5. For this single-change task, …”

and quality:

“In the SNR-change task, optimal strategies for environments with multiple fluctuations are characterized by threshold dynamics adapted to changes in evidence quality in a way similar to changes in reward (Figure 3—figure supplement 1). To study the range of possible threshold motifs, we again considered environments with single changes in evidence quality m=2μ2σ2 by taking μ to be a Heaviside function: μ(t)=(μ1μ2)Hθ(t0.5)+μ1, For this single-change task, we again found similar threshold motifs to those in the reward-change task (Figure 3A,B).”

23. Lines 409-11: Couldn't parse.

We have revised this paragraph for clarity and to include more details and motivation:

“In addition to the inference noise with strength σy, we also filtered each process through a Gaussian response-time filter with zero mean and standard deviation σmn. Under this response-time filter, if the model predicted a response time T, the measured response time T~ was drawn from a normal distribution centered at T with standard deviation σmn. If the response time T~ was drawn outside of the simulation's time discretization (i.e., if T~ < 0 or T~> Tf5), we redrew T~ until it fell within the discretization. This filter was chosen to represent both ``early responses'' caused by attentional lapses, as well as ``late responses'' caused by motor processing delays between the formation of a choice in the brain and the physical response.”

https://doi.org/10.7554/eLife.79824.sa2

Article and author information

Author details

  1. Nicholas W Barendregt

    Department of Applied Mathematics, University of Colorado Boulder, Boulder, United States
    Contribution
    Conceptualization, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    nicholas.barendregt@colorado.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3268-9426
  2. Joshua I Gold

    Department of Neuroscience, University of Pennsylvania, Philadelphia, United States
    Contribution
    Supervision, Funding acquisition, Writing - review and editing
    Competing interests
    Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6018-0483
  3. Krešimir Josić

    Department of Mathematics, University of Houston, Houston, United States
    Contribution
    Supervision, Funding acquisition, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1975-3913
  4. Zachary P Kilpatrick

    Department of Applied Mathematics, University of Colorado Boulder, Boulder, United States
    Contribution
    Supervision, Funding acquisition, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2835-9416

Funding

National Institutes of Health (R01-MH-115557)

  • Nicholas W Barendregt
  • Joshua I Gold
  • Krešimir Josić
  • Zachary P Kilpatrick

National Institutes of Health (R01-EB029847-01)

  • Nicholas W Barendregt
  • Zachary P Kilpatrick

National Science Foundation (NSF-DMS-1853630)

  • Nicholas W Barendregt
  • Zachary P Kilpatrick

National Science Foundation (NSF-DBI-1707400)

  • Krešimir Josić

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Paul Cisek for providing response data from the tokens task used in our analysis.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Peter Latham, University College London, United Kingdom

Reviewer

  1. Gaurav Malhotra, University of Bristol, United Kingdom

Version history

  1. Preprint posted: April 29, 2022 (view preprint)
  2. Received: May 3, 2022
  3. Accepted: October 20, 2022
  4. Accepted Manuscript published: October 25, 2022 (version 1)
  5. Version of Record published: December 15, 2022 (version 2)

Copyright

© 2022, Barendregt et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 752
    Page views
  • 129
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Nicholas W Barendregt
  2. Joshua I Gold
  3. Krešimir Josić
  4. Zachary P Kilpatrick
(2022)
Normative decision rules in changing environments
eLife 11:e79824.
https://doi.org/10.7554/eLife.79824

Further reading

    1. Computational and Systems Biology
    2. Neuroscience
    Huu Hoang, Shinichiro Tsutsumi ... Keisuke Toyama
    Research Article Updated

    Cerebellar climbing fibers convey diverse signals, but how they are organized in the compartmental structure of the cerebellar cortex during learning remains largely unclear. We analyzed a large amount of coordinate-localized two-photon imaging data from cerebellar Crus II in mice undergoing ‘Go/No-go’ reinforcement learning. Tensor component analysis revealed that a majority of climbing fiber inputs to Purkinje cells were reduced to only four functional components, corresponding to accurate timing control of motor initiation related to a Go cue, cognitive error-based learning, reward processing, and inhibition of erroneous behaviors after a No-go cue. Changes in neural activities during learning of the first two components were correlated with corresponding changes in timing control and error learning across animals, indirectly suggesting causal relationships. Spatial distribution of these components coincided well with boundaries of Aldolase-C/zebrin II expression in Purkinje cells, whereas several components are mixed in single neurons. Synchronization within individual components was bidirectionally regulated according to specific task contexts and learning stages. These findings suggest that, in close collaborations with other brain regions including the inferior olive nucleus, the cerebellum, based on anatomical compartments, reduces dimensions of the learning space by dynamically organizing multiple functional components, a feature that may inspire new-generation AI designs.

    1. Cancer Biology
    2. Computational and Systems Biology
    Megan E Kelley, Adi Y Berman ... Gregory P Way
    Research Article

    Drug resistance is a challenge in anticancer therapy. In many cases, cancers can be resistant to the drug prior to exposure, i.e., possess intrinsic drug resistance. However, we lack target-independent methods to anticipate resistance in cancer cell lines or characterize intrinsic drug resistance without a priori knowledge of its cause. We hypothesized that cell morphology could provide an unbiased readout of drug resistance. To test this hypothesis, we used HCT116 cells, a mismatch repair-deficient cancer cell line, to isolate clones that were resistant or sensitive to bortezomib, a well-characterized proteasome inhibitor and anticancer drug to which many cancer cells possess intrinsic resistance. We then expanded these clones and measured high-dimensional single-cell morphology profiles using Cell Painting, a high-content microscopy assay. Our imaging- and computation-based profiling pipeline identified morphological features that differed between resistant and sensitive cells. We used these features to generate a morphological signature of bortezomib resistance. We then employed this morphological signature to analyze a set of HCT116 clones (five resistant and five sensitive) that had not been included in the signature training dataset, and correctly predicted sensitivity to bortezomib in seven cases, in the absence of drug treatment. This signature predicted bortezomib resistance better than resistance to other drugs targeting the ubiquitin-proteasome system. Our results establish a proof-of-concept framework for the unbiased analysis of drug resistance using high-content microscopy of cancer cells, in the absence of drug treatment.