Abstract
Many tasks used to study decisionmaking encourage subjects to integrate evidence over time. Such tasks are useful to understand how the brain operates on multiple samples of information over prolonged timescales, but only if subjects actually integrate evidence to form their decisions. We explored the behavioral observations that corroborate evidenceintegration in a number of taskdesigns. Several commonly accepted signs of integration were also predicted by nonintegration strategies. Furthermore, an integration model could fit data generated by nonintegration models. We identified the features of nonintegration models that allowed them to mimic integration and used these insights to design a motion discrimination task that disentangled the models. In human subjects performing the task, we falsified a nonintegration strategy in each and confirmed prolonged integration in all but one subject. The findings illustrate the difficulty of identifying a decisionmaker’s strategy and support solutions to achieve this goal.
Introduction
Unlike reflexive behaviors and simple sensorimotor response associations, cognitive functions are not beholden to fleeting sensory information or the realtime control of motor systems. They incorporate many samples of information, spanning timescales of tenths of seconds to years. A fruitful approach to study how the brain operates on information over prolonged timescales has been to record and perturb neural activity while subjects make decisions about noisy sensory stimuli (Shadlen and Kiani, 2013). The presence of this noise encourages subjects to reduce it by integrating the sensory information over time. If the timescale of integration is long enough, then the neural mechanisms of this decision process are likely to provide insight into those that allow cognitive processes to operate over long timescales.
One concern is that a subject may not integrate sensory evidence when making these decisions, even though they are encouraged to do so. The concern is exacerbated in animal subjects because they must learn the task without explicit instructions. Experimentalists are therefore posed with a challenge: they must infer a subject’s decision strategy from behavioral measurements like choiceaccuracy and reaction time (RT). Correct identification of a subject’s strategy is especially important for neuroscience, which aims to elucidate biological mechanisms of evidence integration. Results cannot bear on the neural mechanisms of evidence integration if subjects were not integrating evidence.
Mathematical models of evidence integration provide a framework for using behavioral measurements to infer a subject’s decision strategy. These models posit that samples of noisy evidence are integrated over time until a threshold is exceeded or until the stream of evidence (e.g. the stimulus) is extinguished, at which point a decision is made. Examples include driftdiffusion, race, and attractor models (Gold and Shadlen, 2007; Deco et al., 2013). Integration models predict several observations that are common in behavioral data: in fixed stimulus duration (FSD) tasks, subjects’ decisions appear to incorporate evidence presented throughout the entire stimulus presentation epoch (e.g. Odoemene et al., 2018; Deverett et al., 2018; Morcos and Harvey, 2016; Katz et al., 2016; Yates et al., 2017; Wyart et al., 2012); in variable stimulus duration (VSD) tasks, accuracy improves with increasing stimulus duration (e.g. Britten et al., 1992; Gold and Shadlen, 2000; Kiani et al., 2008; de Lafuente et al., 2015; Kiani and Shadlen, 2009; Bowman et al., 2012; Brunton et al., 2013; Robertson et al., 2012); and, in free response (FR) tasks, RTs for the most difficult stimulus conditions are longer than those for easier stimulus conditions (e.g. Roitman and Shadlen, 2002; see Ratcliff and McKoon, 2008 for review). Indeed, the fits of integration models to behavioral data are often remarkably precise. These observations are commonly adduced to conclude that a subject integrated evidence.
Yet, it is unclear whether these observations reveal a subject’s actual decision strategy. For example, previous work has shown that models that posit little to no integration can also fit data from FR tasks (Ditterich, 2006; Thura et al., 2012). This raises a critical question: which behavioral observations corroborate an integration strategy? In cases where integration cannot be differentiated from strategies that lack integration, it will be important to identify why, as doing so may aid the design of experiments that encourage integration.
We compared the predictions of evidence integration to those of strategies that lack integration in a number of taskdesigns. We found that many signatures of evidence integration are also predicted by nonintegration strategies, and we identified the critical features that allowed them to mimic integration. We used these insights to design a novel variant of the randomdotmotion discrimination task that disentangles evidence integration from nonintegration strategies. With this taskdesign, we ruledout a nonintegration strategy in each subject and confirmed prolonged integration times in all but one. Our results underscore the difficulty of inferring subjects’ strategies in perceptual decisionmaking tasks, offer an approach for doing so, and illustrate the importance of evaluating strategies at the level of individual subjects.
Results
We explored the observations predicted by an evidence integration strategy and those predicted by two nonintegration strategies in binary perceptual decisionmaking tasks. The main model representing an integration strategy was a variant of the driftdiffusion model (Ratcliff and McKoon, 2008) that has been used extensively to explain behavioral data from randomdotmotion discrimination tasks (Figure 1A). For simplicity, we refer to this model as integration. It posits that noisy evidence is sequentially sampled from a stationary distribution of random values representing the noisy momentary evidence from the stimulus and is perfectly integrated (i.e. with no leak). The decision can be terminated in two ways: either the integrated evidence exceeds one of two decisionbounds (the timing of which determines the decision’s duration) or the stimulus is extinguished. In both cases, the choice is determined by the sign of the integrated evidence.
The first nonintegration model we considered was extrema detection (Figure 1B). The model was inspired by probability summation over time (Watson, 1979), which was proposed as an explanation of yesno decisions in a detection task. Extrema detection is also similar to other previously proposed models (Cartwright and Festinger, 1943; Ditterich, 2006; Cisek et al., 2009; Brunton et al., 2013; Glickman and Usher, 2019). In the extrema detection model, evidence is sampled sequentially from a stationary distribution until a sample exceeds one of two decisionbounds, which terminates the decision. Crucially, however, the sampled evidence is not integrated. Evidence that does not exceed a decisionbound has no effect on the decision; it is ignored and forgotten. If an extremum is detected, the process is terminated and no additional evidence is considered. If the evidence stream extinguishes before an extremum is detected, then the choice is made independent of the evidence (i.e. random guessing).
The second nonintegration model, what we term snapshot, not only lacks integration but also sequential sampling. The decisionmaker acquires a single sample of evidence at a random time during the stimulus presentation. This single sample of evidence is compared to a decision criterion in order to resolve the choice. The distribution of sampling times is not constrained by any mechanism and can thus be inferred to best match data. Similar to extrema detection, if the evidence stream extinguishes before a sample is acquired, then the choice is determined by a random guess. To facilitate comparison, we parameterized the three models as similarly as possible, such that they were conceptually nested. In other words, extrema detection only differed from integration in its lack of integration, and snapshot only differed from extrema detection in its lack of sequential sampling. We assumed flat decisionbounds in the integration and extrema detection models, unless stated otherwise.
In the first part of this paper, we simulated each model in fixed stimulus duration, variable stimulus duration, and free response taskdesigns to identify observations that differentiate between integration and nonintegration strategies. On each simulated trial, the models specified a positive or negative choice based on noisy sensory evidence. We also tested whether an integration model can fit simulated data generated by nonintegration models. Our primary focus with this complementary approach was not to validate (or invalidate) that a model comparison metric favors the nonintegration model, but to ask whether the integration fit could lead to erroneous conclusions had the nonintegration models not been considered. In the second part of the paper, we used the insights gained from the first part to identify the decision strategies of subjects performing a randomdotmotion discrimination task.
Integration and nonintegration strategies are difficult to differentiate
Fixed stimulus duration tasks
In FSD taskdesigns, the sensory stimulus is presented for a fixed, experimentercontrolled duration on every trial and subjects report their choice after viewing the stimulus. The stimulus strength typically varies across trials such that the correct choice ranges from obvious to ambiguous. Therefore, if a subject performs well and integrates evidence, they should exhibit a sigmoidal psychometric function that saturates at nearperfect accuracy when the stimulus is strong.
To infer integration, the experimenter also exploits the stochastic aspects of the stimulus and attempts to ascertain when, during the stimulus presentation, brief fluctuations in stimulus strength about the mean exerted leverage on the decision. The strength of evidence is represented as a time series, and the experimenter correlates the variability in this value, across trials, with the decision. This is achieved by averaging across the trials ending in either decision and subtracting the averages or by performing logistic regression at discrete time points spanning the stimulus duration. We refer to the outcome of either approach as a psychophysical kernel (Figure 2B). It is a function of time that putatively reflects a subject’s temporal weighting profile, or the average weight a subject places on different timepoints throughout the stimulus presentation epoch (cf. Okazawa et al., 2018). The shape of the psychophysical kernel is thought to be informative of decision strategy because a given strategy often predicts a specific temporal weighting profile. For example, a subject that perfectly integrates evidence weights every timepoint in the trial equally, and so they ought to have a flat psychophysical kernel. In a FSD taskdesign, the observations of a flat psychophysical kernel and successful taskperformance (i.e. a sigmoidal psychometric curve) are commonly cited as evidence for an integration strategy (e.g. Shadlen and Newsome, 1996).
We found that these observations also arise from nonintegration models. We simulated integration, extrema detection, and snapshot (with a uniform sampling distribution) in a FSD taskdesign and asked whether extrema detection and snapshot can generate data that mimics data generated by integration. As shown in Figure 2A, the extrema detection and snapshot models can produce sigmoidal psychometric curves whose slope matched that of the integration model.
Given the simulated choicedata, all three models produced psychophysical kernels that are effectively indistinguishable (Figure 2B). To calculate a psychophysical kernel for each model, the simulations included a small, 100 ms stimulus pulse whose sign and timing were random on each trial (see Materials and methods for details). The kernel was calculated by determining the pulse’s effect on choices as a function of time, as defined by a logistic regression (Figure 2B). The nonintegration models posit that only a very short timeperiod during the stimulus epoch contributes to the choice on each trial. Yet, their psychophysical kernels misleadingly suggest that evidence presented throughout the entire stimulus epoch contributed to choices. These results held for a range of generating parameters (Figure 2—figure supplement 1). We thus conclude that the observations of high choiceaccuracy and a flat psychophysical kernel are not, on their own, evidence for or against any particular decisionmaking strategy.
Why are extrema detection and snapshot able to mimic integration in a FSD task? First, they can match the choiceaccuracy of the integration model because of the lack of constraints on how the sensory stimulus is transformed into a signaltonoise ratio (SNR) of the momentary evidence. In each model, this transformation is determined by a single parameter, $\kappa $. In many cases, the SNR cannot be measured directly and thus the true SNR is generally unknown. Each model’s $\kappa $ parameter is therefore free to take on any value. This bestows extrema detection and snapshot with the ability to produce choiceaccuracy that matches that of integration; while these models are highly suboptimal compared to the integration model, they can compensate by adopting higher SNR (Figure 2A, inset).
Nevertheless, this tradeoff does not explain why extrema detection and snapshot can produce flat psychophysical kernels. Snapshot can produce a flat kernel—and theoretically any other kernel shape—because the data analyst is free to assert any distribution of sampling times. The shape of the desired psychophysical kernel can thus be used to infer the shape of the distribution of sampling times. To generate a flat kernel, we used a uniform distribution for the sampling times.
It is less intuitive why extrema detection can predict a flat kernel. Indeed, in extrema detection, the sample of evidence that determines the choice is exponentially distributed in time (see Materials and methods). This implies that the model should produce early temporal weighting that decays toward zero. The degree of early weighting is governed by the $\kappa $ and decisionbound parameters, the combination of which determines the probability of detecting an extremum on each sample for a given stimulus strength. If this probability is high, then it is very likely that an extremum will be detected early in the trial. In contrast, if this probability is low enough, then detecting an extremum late in the trial will only be slightly less likely than detecting it early in the trial. A low detection probability also leads to more trials that end before an extremum is detected, in which case the choice is determined by random guessing. These guess trials effectively add noise centered at zero weighting to the kernel. With this noise, it is exceedingly difficult to distinguish the very slight, early weighting from flat weighting, even with tens of thousands of trials.
Variable stimulus duration tasks
A major benefit of integrating information over time is that a decisionmaker can reduce noise through averaging. This leads to a prediction: a subject’s sensitivity (i.e. accuracy) should improve if they are given more time to view the stimulus, and thus averageout more noise. More precisely, if a subject is perfectly integrating independent samples of evidence, then the improvement in sensitivity should be governed by the squareroot of the stimulus duration. This prediction can be tested with a VSD task, in which the experimentercontrolled stimulus duration on each trial varies randomly. Indeed, sensitivity often improves with increasing stimulus duration and often does so at the rate predicted by perfect integration (Kiani et al., 2008; Brunton et al., 2013). These observations are complemented by the fact that integration models fit VSD data well. If these observations can be relied on to conclude that a subject was integrating evidence, then they should be absent for data generated by nonintegration models.
Yet, simulations reveal that these observations are also predicted by nonintegration models. We simulated VSD data with either an extrema detection model or a snapshot model (with an exponential sampling distribution). The figure shows that the sensitivity of both models improved with increasing stimulus duration, although sensitivity plateaued for the longer stimulus durations (data points in Figure 3A and B, respectively).
Unlike integration, the extrema detection and snapshot models do not improve their sensitivity by averagingout noise. Instead, the improvement of sensitivity is attributed to the guessing mechanisms posited by the models. If the stimulus extinguishes before an extremum is detected or a sample is acquired, then the models’ decisions are based on a coin flip. Thus, the simulated data result from two types of trials: ones in which decisions are based on sampled evidence, and ones in which decisions result from random guessing. The models predict that sensitivity should improve with time because guesses are less likely with longer stimulus durations (Figure 3C–D).
Given this integrationlike behavior of extrema detection and snapshot, we wondered whether an integration model could successfully—and hence misleadingly—fit the simulated datasets. Figure 3A and B show fits of the integration model (magenta curves) to the simulated extrema detection and snapshot data, respectively. Note that the model fits deviate from perfect integration at the longest stimulus durations because of the decisionbound parameter, which allows a decision to be made before the stimulus extinguishes (see Kiani et al., 2008). Qualitatively, integration provided an excellent account of the simulated extrema detection dataset. Indeed, the agreement between the integration model and the data might lead an experimenter to erroneously conclude that the data were generated by an integration strategy. The fit of the integration model to the simulated snapshot dataset was noticeably worse (Figure 3B). We also compared the fit of the integration model to that of the corresponding datagenerating model (Figure 3A–B, dashed curves) using a standard model comparison metric. For both datasets, a model comparison unsurprisingly favored the datagenerating model over the integration model ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=70.34$ when extrema detection generated the data; $\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=182.79$ when snapshot generated the data). We found similar results when integration served as the datagenerating model (Figure 3—figure supplement 1A); the extrema detection model fit the simulated data well, whereas the snapshot model did so considerably worse.
In our implementations of extrema detection and snapshot, choices are determined by random guessing if the stimulus extinguishes before an extremum is detected or a sample is acquired. We considered whether a different rule for choices in this condition would lead to different conclusions. One alternative is a ‘lastsample’ rule, in which the sign of the final evidence sample determines the choice. Similar rules are often implicit in models that implement a nonintegration strategy with high levels of leak (e.g. the ‘burst detector’ model from Brunton et al., 2013). For snapshot, a lastsample rule eliminated the model’s ability to improve its sensitivity with time because expected performance is independent of when the snapshot is acquired. However, this was not the case for the extrema detection model. Figure 3—figure supplement 1B shows the fit of the lastsample model to the simulated data in Figure 3A, which was generated by an extrema detection model that used the guess rule. The lastsample rule still predicts sensitivity that increases with stimulus duration, but with a shallower slope. Interestingly, the fit was worse than that of the integration model ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=8.73$), even though the lastsample model is conceptually similar to the datagenerating model. As is the case in a FSD taskdesign, the guessing mechanism is essential to extrema detection’s mimicry of integration in a VSD taskdesign.
Free response tasks
In a FR task, subjects are free to report their decision as soon as they are ready, thus furnishing two measurements on each trial: the subject’s choice and reaction time (RT; relative to stimulus onset). Note that any model designed to explain data from a FR task must prescribe how a decision is terminated (e.g. a decision bound). Additionally, models generally posit that the measured RT is the sum of the duration of two processes: (i) the decision process—the evaluation of evidence up to termination—and (ii) a set of operations, unrelated to the decision, comprising sensory and motor delays. The durations i and ii are termed the decision time and the nondecision time, respectively. Bounded evidence integration explains why decisions based on weaker evidence are less accurate and take longer to make, and it can often explain the precise, quantitative relationship between a decision’s speed and accuracy (Gold and Shadlen, 2007; Ratcliff and McKoon, 2008).
Could FR data generated from a nonintegration model be mistakenly attributed to integration? Several analyses and model fitting exercises on simulated data demonstrate that this is indeed possible. First, extrema detection also predicts that RT should depend on the stimulus strength (Figure 4A, top; see also Ditterich, 2006): the weaker the stimulus strength, the more samples it takes for an extremum to be detected. In contrast, the snapshot model does not predict this—and hence does not mimic integration—because the time at which a sample is acquired is independent of stimulus strength. For this reason, we do not include snapshot in subsequent analyses.
Second, it is possible for integration to successfully predict choiceaccuracy from mean RTs, even though the data were generated by an extrema detection model. The integration model with flat bounds has separate, closedform equations that describe the RT and choice functions given a set of model parameters (see Materials and methods). This allows us to estimate the model parameters that best fit the mean RT data, which can then be plugged into the equation for the choice function to generate predictions for the choicedata (Shadlen and Kiani, 2013; Kang et al., 2017; Shushruth et al., 2018). Conveniently, the same procedure can be performed using the equations derived for the extrema detection model (also with flat bounds; see Materials and methods). The top panel in Figure 4A shows the fits of both models to the simulated mean RTs (solid curves), and the bottom panel displays the resulting choiceaccuracy predictions (dashed curves). The predictions of both models are remarkably accurate, and, the models are indistinguishable on the basis of either the RT fits or the choiceaccuracy predictions. Further, the similarity of the models’ behavior does not depend on which is the datagenerating model; extrema detection can fit RT means and predict choiceaccuracy when integration serves as the datagenerating model (Figure 4—figure supplement 1). Thus, although an integration model might accurately fit—and even predict—data from a FR taskdesign, strong conclusions may not be warranted. Later, however, we will show that choiceaccuracy predictions can be informative with just one additional constraint.
Finally, integration and extrema detection are not distinguishable on the basis of the shapes of the RT distributions. Because the models assume that RTs are determined by the sum of the decision time and the nondecision time, the predicted distribution for RT is the convolution of the decision time distribution with the nondecision time distribution (here, assumed to be Gaussian). As described earlier, the extrema detection model posits decision times that are exponentially distributed. Thus, one might be tempted to ruleout this model because RTs do not typically conform to an exponential distribution. However, after convolving the exponential decision time distribution with the Gaussian nondecision time distribution, the resulting RT distribution is similar to that of the integration prediction and to what is typically observed in data (Figure 4B and C). Indeed, the exGaussian distribution—precisely what the extrema detection model predicts—is often used as a descriptive model of RT distributions (Luce, 1986; Ratcliff, 1993; Whelan, 2008). We fit the integration and extrema detection models to the RT distributions simulated by extrema detection. Both models fit these RT distributions reasonably well (by eye), although a model comparison favored extrema detection ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=316.41$).
A more systematic modelcomparison further illustrates that the models are difficult to disentangle in a FR task, especially when there is a limited number of trials. We simulated 600 datasets, half of which were generated by integration and the other half of which were generated by extrema detection. Each dataset comprised 10, 100, or 1000 trials per signed stimulus strength and the generating parameters were chosen pseudorandomly within a range of plausible parameter values (see Materials and methods). We then fit the full RT distributions of each simulated dataset with the integration and extrema detection models and calculated a ∆BIC statistic for each comparison (Figure 4—figure supplement 2). With 10 trials per stimulus strength (120 trials in total), the large majority of ∆BICs did not offer strong support for either model ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}<10$ in 181 of 200 datasets) and a large proportion of datasets with 100 trials per stimulus strength (1200 trials in total) still did not yield strong support for either model ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}<10$ in 63 out of 200 datasets). In contrast, all but one dataset with 1000 trials per stimulus strength (12,000 trials in total) yielded strong support for the datagenerating model.
While extrema detection and integration predict similar RT distributions, extrema detection predicts shorter decision times and longer nondecision times. Because decision times are exponentially distributed in the extrema detection model, they are skewed toward shorter times compared to integration (Figure 4B and C, left). The model also predicts shorter decision times because it requires high SNR; the probability of detecting an extremum on each sample is exceedingly high when the stimulus is strong, such that the decision is made within a few samples. Given shorter decision times, extrema detection must predict longer nondecision times in order to produce RT distributions that are similar to those produced by integration (Figure 4B and C, middle). Importantly, the difference in nondecision time is robust and typically ranges from 50 to 150 ms. Therefore, empirical constraints on the nondecision time should, in theory, disentangle the models.
Our results thus far show that many observations commonly taken as evidence for an integration strategy can also be explained by nonintegration strategies. However, we also identified the factors that allow nonintegration models to mimic integration. In the next section, we leverage these insights to design a task that disentangles the models and test whether subjects used an integration strategy to perform the task.
A motion discrimination task that disentangles the models
The modeling exercises described above illustrate that integration and extrema detection differ in their predictions for SNR (determined by the $\kappa $ parameter) and nondecision time (in FR taskdesigns). This suggests that constraining the estimates of these parameters should cause the models to make predictions that are substantially different and hence testable. How can this be achieved experimentally? It is generally not possible to measure the SNR of the momentary evidence used by a subject, and an estimation of SNR from neural data relies upon assumptions about how sensory information is encoded and decoded (e.g. Ditterich, 2006). Instead, we reasoned that SNR in a FR task ought to be closely matched to that in a VSD task, so long as properties of the sensory stimulus are unchanged. Therefore, if a subject performs trials in both a VSD and a FR taskdesign, a model that accurately estimates SNR should be able to parsimoniously explain data from both trialtypes with a single, common $\kappa $ parameter. A model that does not accurately estimate SNR would require two separate $\kappa $ parameters to explain data from both trialtypes. To be even more stringent, a successful model should be able to fit data from one trialtype and use the estimated $\kappa $ parameter to accurately predict data from the other trialtype. We also reasoned that the nondecision time can be constrained and/or empirically estimated through conditions that minimize the decision time. If a subject makes decisions that are so automatic that the decision times are negligible, then the resulting RTs would reflect only the nondecision time and hence confer an empirical estimate of the nondecision time distribution. Additionally, because decision time generally decreases as the stimulus strength increases, the nondecision time can be constrained by including sufficiently strong stimulus strengths.
We incorporated these constraints into a randomdotmotion (RDM) discrimination task. The task requires subjects to judge the direction of motion in the RDM stimulus and report their decision by making an eye movement to a corresponding choice target (Figure 5A). We included blocks of trials that switched between a FR design and a VSD design and forced the models to fit data from the VSD trials using the $\kappa $ parameter derived from fits to FR trials. We constrained the nondecision time in two ways: first, we interleaved a small proportion of 100% coherence trials, which contain completely unambiguous motion and thus require minimal decision time to render a correct choice. Second, we conducted a supplementary experiment in which subjects received trials that included only 100% coherent motion and were instructed to respond as quickly as possible while maintaining perfect accuracy. We refer to these trials as speeded decision trials. In these trials, decision time is minimized as much as possible, thereby giving rise to an empirical estimate of the nondecision time mean and standard deviation. We forced the models to adopt these nondecision time estimates when fitting data.
Before describing the experimental results, we first verify, through simulations, that this taskdesign disentangles the models. Figure 5BE shows model fits to simulated data generated by extrema detection and integration performing the task described above. The models can be clearly distinguished. First, only the datagenerating model successfully predicted choicedata from mean RTs (Figure 5B, D). Interestingly, each model failed in a systematic way when it was not the datagenerating model: the integration model produced an overly narrow RT function and overestimated the slope of the choice function; the extrema detection model produced an overly broad RT function and underestimated the slope of the choice function. Second, only the datagenerating model successfully predicted sensitivity in the VSD trials when forced to use the $\kappa $ parameter derived from fits to the FR data (Figure 5C and E). Finally, a model comparison heavily favored the datagenerating model when comparing fits to the full RT distributions ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=1172.0$ when extrema detection generated the data; $\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=1080.1$ when integration generated the data). We will use these analyses as benchmarks when we analyze data from the human subjects.
The decision strategies of human subjects performing a motion discrimination task
Six human subjects performed the motion discrimination task. As expected, stronger motion led to more accurate and faster choices in FR trials (Figure 6A). In VSD trials, longer stimulus durations were associated with greater sensitivity (Figure 6B). The speeded decision trials gave rise to similar nondecision time estimates across the six subjects (black arrows, top row of Figure 6A; Table 1), and it succeeded in minimizing decision time, as RTs in this experiment were substantially faster than those for the 100% coherence trials that were interleaved in the main experiment (Figure 6—figure supplement 1). As predicted by our results in the first part of the paper, if we did not include the constraints on the nondecision time and SNR, integration and extrema detection could not be clearly differentiated on the basis of their fits to—and predictions of—the subjects’ data (Figure 6—figure supplement 2).
With constraints on the nondecision time and $\kappa $ parameter, extrema detection was incompatible with each subject’s dataset. Figure 6A shows the mean RT fits (solid curves) and the corresponding choicedata predictions (dashed curves) for both models. We also fit a logistic function to the choicedata alone (black curves), which approximates an upperlimit on the quality of the choiceaccuracy predictions given binomial variability and the assumption that choices are explained by a logistic function of motion coherence. Extrema detection produced visibly poor fits to the mean RTs in many cases, and in all cases fit the mean RTs worse than integration (Table 1). Additionally, the model systematically underestimated the slope of the choice function when predicting choicedata. In other words, the subjects’ choiceaccuracy was too high to be explained by an extrema detection model, given their pattern of decision times.
We next asked whether the extrema detection model can accurately fit the VSD data using the $\kappa $ parameter estimated from fits to the FR data. While the variants of the models we used to fit the mean RTs are parsimonious in that they use only four free parameters, they may not yield the most accurate estimates of the $\kappa $ parameter. Therefore, we estimated $\kappa $ with an elaborated version of each model in which the decisionbounds can symmetrically collapse toward zero as a function of time. The collapsing decisionbounds allow the models to fit the full RT distributions while accounting for some features in the data that are not accounted for by the parsimonious model (e.g. longer RTs on error trials; see Ditterich, 2006). These estimates of $\kappa $ are shown in Table 2 (top row), and were generally similar to the values estimated by the parsimonious model (Table 1, top row). With $\kappa $ constrained, the extrema detection model’s fits to the VSD trials were visibly poor for four of the six subjects (dashed cyan curves in Figure 6B). In the remaining two subjects (S3 and S5), extrema detection produced a reasonable fit to the VSD data. Nevertheless, for every subject, extrema detection failed at least one of the benchmarks described above.
In contrast, we found strong evidence in favor of the integration model for some of the subjects. For subjects 13, the model’s ability to fit and predict data despite rigid constraints was remarkable. First, the predicted choice function closely resembled the fitted logistic function (Figure 6A), and the loglikelihood of the prediction was statistically indistinguishable from that of the logistic fit in two of these subjects (S1: $p=0.012$; S2: $p=0.052$; S3: $p=0.058$, bootstrap). Second, integration could accurately fit data from VSD trials using the $\kappa $ parameter derived from fits to the FR data (Figure 6B). Finally, integration was heavily favored over extrema detection when fitting the full RT distributions (Table 2). We thus find clear support for an integration strategy in three of the human subjects.
For subjects 46, the evidence in favor of the integration model was less compelling. The model overestimated the slopes of the choice functions for subjects 4 and 5 (dashed curves, bottom row of Figure 6A), and these predictions were worse than those of the extrema detection model (Table 1). In subject 6, integration offered a reasonable prediction for the slope of the psychometric function, but inaccurately predicted the subject’s choicebias. The fit of the integration model to these subjects’ VSD data also produced mixed results (Figure 6B). The integration model could fit VSD data from subject 5 using the $\kappa $ parameter derived from the fits to the FR data, but, as mentioned above, so could the extrema detection model. And, in subjects 4 and 6, constraining the $\kappa $ parameter caused integration to underestimate sensitivity as a function of stimulus duration. Despite these shortcomings, integration was heavily favored over extrema detection for these subjects based on fits to the full RT distributions (Table 2).
Thus far, we have primarily drawn conclusions based on how well the data conform to predictions made by integration and extrema detection, but this approach has its drawbacks. First, it implies a potentially false dichotomy between the two models. Second, the approach requires us to arbitrarily determine whether a model’s predictions are good enough, because no reasonable model will ever perfectly predict all of the idiosyncrasies associated with real data. Finally, it is unclear what to conclude when neither model makes accurate predictions. Our results are a case in point: for subjects 1–3, the integration predictions were good but not always perfect; and, in subjects 4–6, the predictions of both models were mediocre. This invites a more nuanced approach.
Integration and extrema detection can be thought of as two ends of a continuum of sequential sampling models that differ in their degree of leaky integration. Integration posits a timeconstant of infinity (i.e. no leak or perfect integration) and extrema detection posits a timeconstant of zero (i.e. infinite leak). In a leaky integration model, the timeconstant is a free parameter that lies between zero and infinity (Busemeyer and Townsend, 1993; Usher and McClelland, 2001). The timeconstant determines the rate at which the decision variable decays to zero with a lack of sensory input—that is, it determines how much information is lost as new information is acquired. It thus bestows the model with some flexibility over the relationship between decisiontime and choiceaccuracy. Our results give rise to two hypotheses in the context of the leaky integration model: (1) The model should support negligible information loss for the subjects who were wellexplained by perfect integration. (2) The model should support nonnegligible information loss for the subjects who could not be explained by either perfect integration or extrema detection.
To obtain our best estimate of the integration timeconstant, we fit the leaky integration model to the FR and VSD data simultaneously, thereby forcing the model to fit both datasets with a common $\kappa $ and leak parameter, and the decisionbounds were allowed to collapse toward zero over time. As before, the model was also forced to adopt the empirical estimates of the nondecision time mean and standard deviation. All other parameters were allowed to take on any value and were allowed to vary across the two trialtypes. With this fitting protocol, we could faithfully recover the parameters that generated simulated datasets (Figure 7—figure supplement 1).
Figure 7A shows the model’s estimated timeconstant for each subject. For subjects 2 and 3, the timeconstants were effectively infinite. As such, the leaky integration model was functionally equivalent to the perfect integration model. Consistent with this conclusion, a model comparison supported that the addition of a leak parameter was not justified for these subjects ($\mathrm{\Delta}\text{BIC}>10$; Figure 7B). The estimated timeconstant for subject 1 was shorter than the timeconstants for subjects 2 and 3. However, the ∆BIC indicates that the leak parameter was not strongly justified. Note the large Bayesian credible intervals for the estimated timeconstants (Figure 7A, thick orange lines for the interquartile range, thin orange lines for the 95% credible interval). This is because the timeconstant becomes less identifiable as its value approaches and exceeds the longest decision times. The estimated lower bounds for the timeconstants are close to the decision times at the most difficult stimulus conditions, again suggesting that these subjects made decisions by integrating motion information with little to no information loss.
We found evidence for leaky integration in two of the remaining three subjects. In subjects 4 and 5, the model produced timeconstants that were just below 1 s (0.80 s for subject 4, 0.84 s for subject 5; Figure 7A) and the addition of the leak parameter substantially improved the quality of the model fit (Figure 7B). Figure 7C and D (orange curves) shows these fits, which capture the main features of both datasets. Note, however, that the model still slightly underestimated the sensitivity of subject 4 for the shortest stimulus durations. Finally, the leaky integration model failed to account for data from subject 6. The fitted timeconstant was indistinguishable from infinity (Figure 7A) and thus the failure of the perfect integration model (Figure 6, subject 6) could not be accounted for by leak.
Discussion
We considered three general classes of decisionmaking strategies—sequential sampling with integration (e.g. drift diffusion), sequential sampling without integration (e.g. extrema detection), and no sequential sampling and no integration (e.g. snapshot)—and found that disentangling integration and nonintegration strategies is more difficult than previously appreciated. Simulations of these models in different taskdesigns showed that several observations interpreted as conclusive evidence for integration were also predicted by nonintegration strategies. Additionally, the integration model consistently fit simulated data well, even when these data were generated by nonintegration models. Together, these results demonstrate the ease with which behavioral data could be misattributed to an integration mechanism.
We are not the first to propose that nonintegration mechanisms can explain behavior in perceptual decisionmaking tasks. In fact, the first model that attempted to reconcile accuracy with reaction time resembles our extrema detection model (Cartwright and Festinger, 1943). Similar nonintegration mechanisms, known as probability summation, have long been used to explain the detection of static stimuli as a function of their intensity, duration, and spatial properties (e.g. Sachs et al., 1971; Watson, 1979; Robson and Graham, 1981). A parallel line of research showed that bounded evidence integration is optimal for decisions based on sequences of independent samples (Barnard, 1946; Wald, 1947; Good, 1979). Such integration models also reconcile accuracy with reaction time (e.g. Stone, 1960; Ratcliff, 1978). Given these insights, one might naturally assume that subjects integrate evidence when making decisions about stochastic stimuli, which comprise independent samples of evidence. This assumption can be problematic, however, if untested. For as we show here, integration and nonintegration models can behave similarly.
In the first part of the paper, we identified several factors that allow nonintegration models to mimic integration. A crucial factor was the freedom of the models to fit the SNR of the momentary evidence to the data. Nonintegration models are highly suboptimal compared to integration and therefore require higher SNR to produce the same level of performance. They are free to adopt this relatively large SNR because the true SNR cannot be measured directly. In other words, there is a tradeoff between the SNR of the momentary evidence and the efficiency with which evidence samples are combined. Integration and nonintegration models account for the same data by assuming different balances between the two. Of course, only one balance between SNR and efficiency holds in reality for a given dataset, and this is why the models can be disentangled if the SNR is adequately constrained. We demonstrate that the SNR can be adequately constrained if its estimate is derived from a separate taskdesign (see also Drugowitsch et al., 2016).
In FSD and VSD taskdesigns, we found that nonintegration models mimicked integration in part because of a guessing mechanism. If the stimulus extinguished before an extremum was detected or a sample was acquired, then the decision was based on a coinflip. This guessing rule allowed extrema detection to produce a range of psychophysical kernel shapes in a FSD task and to improve sensitivity with increasing stimulus duration in a VSD task. An alternative to guessing is to base the choice on the final sample, and we show that this variant of the model does not mimic integration. Therefore, it is possible to ruleout this variant based on data that conform to integration, but the same may not hold for extrema detection with a guessing rule.
In agreement with Ditterich, 2006, we found that extrema detection mimics integration in a FR task in part because the nondecision time is unconstrained. Given fits to the same dataset, extrema detection predicts longer nondecision time than integration. This observation is a consequence of its exponentially distributed decision times and the aforementioned requirement of higher SNR, both of which require the nondecision time to compensate for the relatively short decision times. The difference in predicted nondecision time between the two models manifests at strong stimulus strengths—that is, conditions that minimize decision time. If these conditions are excluded from experimental designs, then the models can evade punishment for inaccurate estimates of the nondecision time.
The results from the first part of the paper illustrate the difficulty of ruling out nonintegration strategies. There are several implications. At the very least, experimenters should not assume subjects integrate evidence just because integration is the optimal strategy. The results also imply that a given behavioral observation should be adduced as conclusive evidence for an integration strategy only if it is not reproduced by alternative strategies. Notably, nonintegration models mimicked integration for reasons that were often counterintuitive, which stresses the importance of testing the predictions of alternative models through simulations (see also Palminteri et al., 2017; Wilson and Collins, 2019). Similarly, our findings discourage experimenters from drawing strong conclusions about decision strategy or an underlying neural mechanism based on the quality of a model fit, without first verifying that the model fails to fit data from conceptually opposing mechanisms. The practices that our results caution against are relatively standard in the field. Indeed, our own group has used these practices in previous work to support claims about subjects’ decision strategies. It would be prudent to consider potential alternative strategies when designing experiments in order to ensure that behavioral data is not misattributed to an integration mechanism.
Such misattribution could lead to a variety of errors when neural data is used to make inferences about the neural mechanisms of integration. For example, if data generated by a nonintegration strategy were misattributed to an integration strategy, an experimenter might mistake short bursts of neural activity for a mechanism of integration, or they might conclude that a brain area does not integrate because its activity does not reflect an integration process. In cases where neural activity is perturbed, brain areas that are essential to evidence integration might not be identified as causal. This is not to say that neural activity cannot, in its own right, inform models of the decision process. Ditterich (2006) used this approach to show that neural responses in the lateral intraparietal area are most consistent with an integration process that includes a timevariant gain. More broadly, if neural responses clearly reflect integrated evidence (e.g. Huk and Shadlen, 2005), then it would be reasonable to presume that the subjects’ decisions were based on integrated evidence.
The misidentification of decision strategy could also be problematic when model fits to behavioral data are used to make inferences about underlying mechanisms. This approach is widely applied in mathematical psychology (Ratcliff and McKoon, 2008) and computational psychiatry (Montague et al., 2012). We showed that the misidentification of a subject’s decision strategy leads to parameter fits that are systematically misleading. For example, fits of the integration model to simulated data generated by extrema detection led to an underestimate of both SNR and nondecision time. Therefore, it is possible that differences in bestfitting model parameters between experimental groups (e.g. patients vs. controls) do not actually reflect a difference in these parameters but in the strategies deployed by the two groups. A more explicit consideration of alternative strategies will add to our ability to link specific model components to underlying mechanisms. Indeed, it might reveal differences between experimental groups or conditions that would have otherwise gone undetected.
We wish to emphasize that integration models are often useful even in the absence of evidence for an integration strategy. They can be used as descriptive models of behavioral data. For example, they allow experimenters to estimate a subject’s sensitivity while controlling for RT or characterize a subject’s speedaccuracy tradeoff. The model's use in these ways is similar to the use of signal detection theory models to derive a criterionfree estimate of sensitivity (Green and Swets, 1966). Furthermore, many studies on decisionmaking are ambivalent about whether subjects integrated evidence but can still use integration models to draw conclusions about other components of the decision process. For example, in Kang et al. (2017), subjects performed a motion discrimination task and, after viewing the stimulus, adjusted the setting of a clock to indicate the moment they felt they had reached a decision. The authors used an integration model to conclude that aspects of these subjective decision times accurately reflected the time at which a decisionbound was reached. They fit the model to the subjective decision times and the resulting parameters could accurately predict subjects’ choices. Another germane example is work from Evans et al. (2018), who used an integration framework to study the neural mechanism of the decision bound in mice making escape decisions. They identified a synaptic mechanism that produced an allornothing response in the dorsal periaqueductal grey when it received a critical level of input from the superior colliculus, which caused the mice to flee. While our results suggest that neither study produced definitive behavioral evidence for integration, substituting extrema detection for integration would not change the studies’ main conclusions.
In the second part of the paper, we used the insights derived from our simulations to design a version of the RDM discrimination task that constrained the SNR and nondecision time. The constraints allowed us to ruleout a nonintegration strategy for each subject we tested, which is consistent with the idea that SNR in visual cortex would have to be implausibly high in order for a nonintegration strategy to be viable in a RDM task (Ditterich, 2006). We found strong evidence for effectively perfect integration in half of our subjects. In these subjects, the predictions of the perfect integration model were remarkably accurate, even with strong parameter constraints. Data from two of the remaining three subjects appeared to lie somewhere between the predictions of perfect integration and no integration, and fits of a leaky integration model were consistent with this observation. The timeconstants estimated by the leaky integration model suggested that these subjects integrated with only minimal information loss. Surprisingly, no model we tested offered a satisfactory explanation of the data from subject 6. The failure of the models in this subject nonetheless reinforces the fact that the models were not guaranteed to succeed in the other subjects.
We accounted for some of the failures of the perfect integration model with a leaky integration model; however, we suspect some other models that do not posit leaky integration could do so as well. Examples include models that only integrate strong evidence samples (Cain et al., 2013), competing accumulator models with mutual inhibition but no leak (Usher and McClelland, 2001), and models that posit noise in the integration process itself (Drugowitsch et al., 2016). The shared feature of these models and leaky integration is that they involve information loss. We focused on leaky integration because, for those who seek to understand how information is maintained and manipulated over long timescales, substantial leakage would be most problematic. With this in mind, the fact that the ‘leaky’ subjects yielded timeconstants on the order of half a second to a second is encouraging. At the very least, they were integrating information over timescales that are substantially longer than the timeconstants of sensory neurons and the autocorrelation times of the visual stimulus. Furthermore, the timeconstants are likely underestimated. Our estimates of the nondecision time are biased toward longer times because we assume decision time is negligible in the speeded decision experiment, and an overestimate of the nondecision time would lead to an underestimate of the timeconstant.
While the small number of subjects we used prevents us from making sweeping claims, the apparent variability in decision strategy across our subjects underscores the importance of analyzing data at the level of individuals. Many of our findings would be obfuscated had we not analyzed each subject separately. Our insights are also relevant to an ongoing debate about whether subjects’ decisions are better explained by an urgencygating model (Cisek et al., 2009; Thura et al., 2012; Carland et al., 2015a; Carland et al., 2015b), which posits little to no integration, or a driftdiffusion model (Winkel et al., 2014; Hawkins et al., 2015; Evans et al., 2017). A subject’s strategy could lie somewhere between no integration and perfect integration or in a completely different space of models. A subject may also change their strategy depending on several factors, including the task structure, the nature of the stimulus, and the subject’s training history (Brown and Heathcote, 2005 ; Evans and Hawkins, 2019; Tsetsos et al., 2012; Glaze et al., 2015; Ossmy et al., 2013). Further characterization of the factors that affect decision strategy will be an important direction for future work.
Of course, our approach is not the only one available to ruleout nonintegration strategies. For example, Pinto et al. (2018) tasked mice with counting the number of visual ‘pulses’ that appeared as the mice ran through a virtual corridor. The authors showed that a snapshot strategy predicted a linear psychometric function under certain conditions in their task, which did not match the mice’s behavioral data. Additionally, Waskom and Kiani (2018) were able to ruleout a nonintegration strategy for humans performing a contrast discrimination task. Discrete evidence samples were drawn from one of two possible distributions, and subjects chose which generating distribution was most likely. Because the distributions overlapped, there was an upper bound on the performance of any strategy that utilized only a single sample, and subjects performed better than this upper bound. This approach could be used in similar taskdesigns (e.g. Drugowitsch et al., 2016). Finally, as mentioned above, Ditterich (2006) showed that the SNR in visual cortex would have to be implausibly high in order for a nonintegration strategy to explain data from a RDM discrimination task. These examples and others (e.g. Glickman and Usher, 2019) illustrate that the best approach for rulingout nonintegration strategies will likely depend on the specifics of the stimulus and the taskdesign.
Nevertheless, any attempt to differentiate integration from nonintegration strategies requires that the latter be considered in the first place. Here, we demonstrated the importance of such consideration, identified why nonintegration strategies can mimic integration, and developed an approach to rule them out. The general approach should be widely applicable to many evidence integration tasks, although it will likely require modifications. It explicitly mitigates the factors that allow nonintegration strategies to mimic integration and allow integration models to fit data generated by alternative mechanisms. By doing so, nonintegration strategies can be ruledout, and the predictions of evidence integration models can be tested in a regime where they can reasonably fail. We hope that our insights help lead to more precise descriptions of the processes that underlie decisions and, by extension, cognitive processes that involve decisions. Such descriptions will enhance our understanding of these processes at the level of neural mechanism.
Materials and methods
Description of the models
We explored four decisionmaking models. A shared feature of all four models is that they render decisions from samples of noisy momentary evidence. We model the momentary evidence as random values drawn from a Normal distribution with mean $\mu =\kappa (C{C}_{0})$ and unit variance per second, where $\kappa $ is a constant, C is the stimulus strength (e.g. coherence), and ${C}_{0}$ is a bias term. We implement bias as an offset to the momentary evidence because the method approximates the normative solution under conditions in which a range of stimulus strengths are interleaved, such that decision time confers information about stimulus strength (see Hanks et al., 2011). Note that each model receives its own set of parameter values. Each strategy differs in how it uses momentary evidence to form a decision.
Integration
Request a detailed protocolWe formalized an integration strategy with a driftdiffusion model. The model posits that samples of momentary evidence are perfectly integrated over time. The expectation of the momentary evidence distribution is termed the drift rate, and its standard deviation is termed the diffusion coefficient. The decision can be terminated in two ways: (i) The integrated evidence reaches an upper or lower bound ($\pm B$), whose sign determines the choice; (ii) The stream of evidence is extinguished, in which case the sign of the integrated evidence at that time determines the choice. Note that only (i) applies in a FR taskdesign because the stimulus never extinguishes before a decisionbound is reached.
To estimate the predicted proportion of positive (rightward) choices as a function of stimulus strength and duration in a VSD task, we used Chang and Cooper’s finite difference method (1970) to numerically solve the FokkerPlanck equation associated with the driftdiffusion process. We derive the probability density of the integrated evidence ($x$) as a function of time ($t$), using a $\mathrm{\Delta}t$ of 0.5 ms. We assume that $x=0$ at $t=0$ (i.e., the probability density function is given by a delta function, $p(x,t=0)=\delta \left(0\right)$). At each time step, we remove (and keep track of) the probability density that is absorbed at either bound. The proportion of positive (rightward) choices for each stimulus duration is therefore equal to the density at $x>0$ at the corresponding time point. We fit the model parameters ($\kappa ,B,{C}_{0}$) to VSD data by maximizing the likelihood of observing each choice given the model parameters, the signed stimulus strength, and the stimulus duration. Unless otherwise stated, we used Bayesian adaptive direct search (BADS; Acerbi and Ji, 2017) to optimize the model parameters. All model fits were confirmed using multiple sets of starting parameters for the optimization. Unless stated otherwise, when fitting VSD data from the 'constrained' RDM task (see below; e.g. Figure 6), instead of fitting the $\kappa $ parameter, we used the $\kappa $ parameter estimated from fits of an elaborated, collapsing bound model to FR data (see below).
For the FR task, we used two variants of the model. The first is more parsimonious, and assumes that the decision bounds are flat (as above). This variant of the model provides analytical equations for the predicted proportion of positive choices and mean RT:
where ${t}_{ND}$ is the mean nondecision time, which summarizes sensory and motor delays. Equation 2 allowed us to fit the model’s parameters to the mean RTs, which we then used to predict the psychometric function with Equation 1. Note that Equation 2 explains the mean RT only when the sign of the choice matches the sign of the drift rate (for nonzero stimulus strengths). To account for this, we identified trials to be included in the calculation of mean RTs by first finding the point of subjective equality (PSE), given by a logistic function fit to choices. We then only included trials whose choice matched the sign of the stimulus strength, adjusted by the PSE. The PSE was not taken into account when fitting the ${C}_{0}$ parameter. The parameters in Equation 2 ($\kappa ,B,{C}_{0},{t}_{ND}$) were fit by maximizing the loglikelihood of the mean RTs given the model parameters, assuming Gaussian noise with standard deviation equal to the standard error of the mean RTs. Optimization was performed using MATLAB’s fmincon function.
We also used a more elaborate variant of the model that allowed the decisionbound to collapse toward zero over time in order to explain full, choiceconditioned RT distributions. In principle, the elaborated model should provide a more precise estimate of $\kappa $ because it takes into account all the data instead of just the mean RTs. The model also explains features of the data that are not explained by the flatbounds model (e.g. longer RTs on error trials). In the elaborated model, the bounds remained symmetric around zero but were a logistic function of time:
where a is constrained to be nonnegative. The predicted decision time distribution for each stimulus strength and choice was derived by computing the probability density of the integrated evidence that exceeded each decision bound. We convolved these decision time distributions with a Gaussian nondecision time distribution with mean ${\mu}_{\mathrm{t}\mathrm{n}\mathrm{d}}$ and standard deviation ${\sigma}_{\mathrm{t}\mathrm{n}\mathrm{d}}$ in order to generate the predicted RT distributions. The model parameters ($\kappa ,{B}_{0},a,d,{C}_{0},{\mu}_{\mathrm{t}\mathrm{n}\mathrm{d}},{\sigma}_{\mathrm{t}\mathrm{n}\mathrm{d}}$) were fit by maximizing the loglikelihood of the observed RT on each trial given the model parameters, the stimulus strength, and the choice.
Extrema detection
Request a detailed protocolIn the extrema detection model, each independent sample of momentary evidence is compared to a positive and negative detection threshold or bound ($\pm B$). As with integration, the decision can be terminated in two ways: (i) A sample of momentary evidence exceeds $\pm B$, in which case the decision is terminated and the choice is determined by the sign of $\pm B$. (ii) The stream of evidence is extinguished. In the latter case, we implemented two different rules for determining the choice. The main rule we implemented posits that the choice is randomly determined with equal probability for each option. We believe this 'guess' rule is most appropriate because the essence of the model is that evidence is ignored if it does not exceed a detection threshold. We also explored a rule in which the choice is determined by the sign of the last sample of evidence acquired before the stimulus extinguished, which we termed a 'last sample' rule (Figure 3—figure supplement 1B). Note that extrema detection and integration had the same set of free parameters and the procedures for fitting the two models were identical.
The behavior of the model is primarily governed by the probability of exceeding $\pm B$ on each sample. These probabilities are described by,
where erfc is the complementary error function. The equations represent the density of the momentary evidence distribution that exists beyond $\pm B$. We assumed $\mathrm{\Delta}t=0.5$ ms and adopted the same variance per timestep as in the integration model. The probability of a positive choice, conditional on an extremum being detected, is therefore,
where $\mathbb{E}$ signifies an extremum was detected. Note that we use uppercase $P$ to represent probabilities associated with trial outcomes and lowercase $p$ to represent probabilities associated with single samples of evidence. The probability of a positive choice, conditional on the stimulus extinguishing before an extremum is detected, depends on the choice rule. For the guess rule this probability is 0.5. For the last sample rule it is the probability that the sample was greater than zero and did not exceed $\pm B$:
where $\mathbb{M}$ is the momentary evidence distribution. Note that none of the equations above depend on the passage of time (i.e., the number of samples). However, the cumulative probability of detecting an extremum does increase with the number of samples and is described by a cumulative geometric distribution:
where N is the number of samples ($N=\lceil t/\Delta t\rceil $). The fact that Equation 9 depends on the number of samples, as opposed to time, is potentially important. It means that the cumulative probability of detecting an extremum depends not only on ${p}_{\pm B}$, but also $\mathrm{\Delta}t$. We used a $\mathrm{\Delta}t=0.5$ ms in order to match that used in the integration model, and the results were unchanged when we used $\mathrm{\Delta}t=1$ ms. However, the behavior of the model could change with changes to the sampling rate. Combining Equation 6 through 9 gives us the probability of a positive choice as a function of stimulus duration in a VSD experiment:
In a FR experiment, the decision can only be terminated if an extremum is detected. Therefore, the predicted proportion of positive choices is given by Equation 6. The predicted mean RT, in seconds, is described by,
Similar to the procedure for fitting the integration model, Equation 11 was used to fit mean RT data and the resulting model parameters were plugged into Equation 6 to generate a predicted choice function.
As with the integration model, we used an elaborated model with collapsing decision bounds to explain the full, choiceconditioned RT distributions. We used Equation 3 to parameterize the collapsing bounds. The collapsing bounds cause the probability of detecting an extremum on each sample to increase with time. The probability of the decision terminating after N samples (i.e. the decision time distribution), regardless of choice, is described by,
where ${p}_{\pm B\left(n\right)}$ is the probability of exceeding a decisionbound on sample n. Intuitively, the cumulative product represents the probability that an extremum is not detected after $N1$ samples. Note that Equation 12 simplifies to the geometric distribution when the decisionbounds are flat. The decision time density function conditional on choice is the product of the unconditioned decision time density function and the probability that a decision at each timepoint will result in a positive (or negative) choice. These density functions are then convolved with a truncated Gaussian nondecision time probability density function to produce the predicted RT distributions (the truncation ensures that nondecision times are always positive).
Snapshot
Request a detailed protocolIn the snapshot model, only a single sample of momentary evidence is acquired on each trial. We do not consider mechanisms that would determine when the sample is acquired. We instead assume that the time at which the sample is acquired is a predetermined random variable and is independent of the stimulus strength. The distribution that describes this random variable can be chosen arbitrarily. For simplicity, the sampling times were assumed to be uniformly distributed in the FSD task. We used an exponential distribution for the sampling times in the VSD task, although several other distributions produced similar results. If the sample is acquired before the stimulus extinguishes, then the choice is determined by the sample’s sign. Otherwise, the choice is randomly assigned with equal probability for each option. The probability of a positive choice when a sample is acquired is described by,
The overall probability of a positive choice as a function of viewing duration is then,
where ${P}_{\mathbb{S}}$ is the probability of acquiring a sample. It is a function of viewing duration and depends on the distribution of sampling times. While Equation 14 resembles what is described for extrema detection (Equation 10), there is a crucial difference: unlike extrema detection, the probability that a choice is based on evidence is independent of the stimulus strength. In a FR task, the probability of a positive choice is governed by Equation 13. The predicted RT distribution is simply the distribution of sampling times convolved with the nondecision time distribution, and it is independent of the stimulus strength.
Leaky integration
Request a detailed protocolThe leaky integration model is a simple extension of the (perfect) integration model. The model posits that the rate of change of the decision variable depends on both the momentary evidence and its current value, the latter of which causes it to decay exponentially or 'leak' toward zero if input is withdrawn. The decay’s halflife is determined by a single parameter, which is termed the integration timeconstant, $\tau $. The shorter the timeconstant the more the decision variable 'leaks.' Perfect integration and extrema detection can be thought of as special cases of the leaky integration model, in which the timeconstant is infinite and zero, respectively. The decision variable, x, is modeled as an OrnsteinUhlenbeck (OU) process, such that,
where $\lambda ={\tau}^{1}$ and $\xi $ is the standard Wiener process.
We developed a method to derive the probability density function of the integrated evidence for the leaky integration model. As for the perfect integration model, we assume that $x=0$ for $t=0$ (i.e. probability density function of the integrated evidence is given by a delta function, $p(x,t=0)=\delta \left(0\right)$). First, we propagate the probability density function of the integrated evidence, $p(x,t)$, for one small time step ($\Delta t=0.5$ ms). We use Chang and Cooper’s implicit integration method (Chang and Cooper, 1970; Kiani and Shadlen, 2009), assuming perfect integration from $t$ to $t+\Delta t$ (i.e., $\lambda =0$). We then add the influence of the leak, through a linear transformation that maps the probability of the integrated evidence being $x$ at time $t+\Delta t$, to a new value of integrated evidence, $x}^{\prime$, where $x}^{\prime}=x{e}^{\mathrm{\Delta}t/\lambda$. This shrinks the probability density function toward zero in proportion to the leak parameter, $\lambda $. We iterate the twostep process until the motion stimulus is turned off or until the probability mass that has not been absorbed at either bound becomes negligibly small. As described for the integration model, at each time step we remove (and keep track of) the probability mass that is absorbed at either bound.
We estimated the posterior of the model parameters in order to determine the range of timeconstants that best explain each subject’s data (Figure 7A). Numerically calculating the posterior is computationally expensive. Instead, we calculated approximate posterior distributions of our model parameters with Variational Bayesian Monte Carlo (VBMC; Acerbi, 2018), which uses variational inference and activesampling Bayesian quadrature to approximate the posterior distributions. We used highly conservative priors over the parameters when estimating the posteriors and changes to the priors had negligible effects within a large range.
We also performed a parameter recovery analysis to verify that our fitting procedure and the VBMC method accurately estimated groundtruth parameters used to generate data from the leaky integration model. We simulated the constrained RDM task to produce nine datasets, each of which were generated with a unique combination of $\kappa $, B, and $\tau $ (see Table 3). Each simulation contained ~3000 FR trials and 3000 VSD trials. We found that the approximate posteriors of the parameters, obtained through VBMC, accurately reflect these parameters (Figure 7—figure supplement 1).
Model simulations
Request a detailed protocolWe simulated the integration, extrema detection, and snapshot models to compare the predictions they made in different taskdesigns. Each trial had a randomly chosen stimulus strength that remained constant throughout the trial’s duration. We used stimulus strengths that mimicked those commonly used in a RDM discrimination task: ±.512, ±.256, ±.128, ±.064, ±.032, and 0. This allowed us to calibrate our model parameters such that the simulated data resembled real data from RDM tasks.
In the simulations of FSD experiments, each trial contained a transient stimulus pulse, which we used to calculate a psychophysical kernel for each dataset. The pulse added or subtracted 0.1 units of stimulus strength for 100 ms, thereby shifting the mean of the momentary evidence distribution for that duration. After the 100 ms pulse, the stimulus strength returned to its original value. The sign of the pulse was random and the timing of its onset was uniformly distributed in steps of 100 ms starting at $t=0$. The psychophysical kernel is described by the relationship between the time of the pulse and its effect on choice across all trials, which we estimated with a logistic regression such that
where $\mathit{X}$ is a design matrix. The design matrix included a column for each pulseonset time, which took the form of a signed indicator variable (${\mathit{X}}_{pulse}\in ${1, 0, 1}). We also included a column for the trial’s stimulus strength, although results were similar if this was not included. Each row of the design matrix therefore summarizes the stimulus strength and the pulseonset time for a given trial. The ordinate in Figure 2B is the value of $\beta $ associated with each pulseonset time.
The simulations of VSD experiments were identical to the FSD simulations, except there were no pulses and the stimulus duration on each trial was randomly chosen from a list of 12 possible, logarithmicallyspaced durations between 0.07 s and 1.0 s. Each simulation therefore yielded 12 psychometric functions—one for each stimulus duration. To calculate a measure of sensitivity for each stimulus duration, we fit each psychometric function with a logistic function and used the fitted slope parameter to summarize sensitivity.
In the FR simulations, the stimulus remained on until a decision bound was reached. We did not simulate the snapshot model in a FR task. For the model comparisons in Figure 4—figure supplement 2, we generated 600 FR datasets, half of which were generated by integration and the other half of which were generated by extrema detection. We varied the number of simulated trials among 10, 100, and 1000 trials per stimulus condition, such that there were 100 datasets per model per trial count. For each model, we pseudorandomly chose 100 sets of generating parameters within a range of plausible parameter values (Integration: $5<\kappa <25,0.6<B<1.2,0.3<{\mu}_{\mathrm{t}\mathrm{n}\mathrm{d}}<0.4,0.02<{\sigma}_{\mathrm{t}\mathrm{n}\mathrm{d}}<0.08$; Extrema detection: $50<\kappa <215,0.07<B<0.08,0.46<{\mu}_{\mathrm{t}\mathrm{n}\mathrm{d}}<0.56,0.09<{\sigma}_{\mathrm{t}\mathrm{n}\mathrm{d}}<0.11$). The same 100 sets of generating parameters were used across all three trial groups.
For the graphs in Figure 5, we simulated each model in a FR design and a VSD design using the same $\kappa $ parameter for the two designs. The number of simulated trials in each design was similar to that collected for the human subjects (~2000 total FR trials; ~3000 total VSD trials). The FR simulation also included stimulus strengths of ±0.99 with the stimulus strengths listed above.
Random dot motion task
Request a detailed protocolWe explored the decision strategies of human subjects with a ‘constrained’ randomdotmotion (RDM) discrimination task. The subjects were required to make a binary choice about the direction of motion of randomly moving dots. The RDM movies were generated using methods described previously (Roitman and Shadlen, 2002). Three interleaved sets of dots were presented on successive video frames (75 Hz refresh rate). Each dot was redrawn three video frames later at a random location within the stimulus aperture or at a location consistent with the direction of motion; the motion coherence is the probability of the latter occurring, and it remained constant throughout the duration of the trial. Note that even though the coherence does not fluctuate within a trial, the effective motion strength (e.g. motion energy) at each time point does fluctuate due to the stochastic nature of the stimulus (see Zylberberg et al., 2016). The stimulus aperture subtended 5° of visual angle, the dot density was 16.7 dots/deg^{2}/s, and the size of the coherent dotdisplacement was consistent with apparent motion of 5 deg/s. Stimuli were presented on a CRT monitor with the Psychophysics toolbox (Brainard, 1997). Subjects’ eye positions were monitored with a video tracking system (Eyelink 1000; SR Research, Ottawa, Canada).
Six subjects (five male and one female) performed the task. One subject (S3) is an author on this paper. Another subject (S1) had previous experience with RDM stimuli but was naive to the purpose of the experiment. The remaining four subjects were naive to the purpose of the experiment and did not have previous experience with RDM stimuli. Each of these four subjects received at least one training session (~1000 trials) before beginning the main experiment to achieve familiarity with the task and to ensure adequate and stable task performance.
The main experiment consisted of two trialtypes, VSD and FR, which were presented in blocks of 150 and 100 trials, respectively. Each subject performed ~4800 trials in total across 46 sessions, yielding ~3000 VSD trials (S1: 2946 trials; S2: 3007; S3: 2656; S4: 3086; S5: 3095; S6: 3069) and ~1800 FR trials (S1: 1833 trials; S2: 1866; S3: 1814; S4: 1766; S5: 1831; S6: 1914). Subjects initiated a trial by fixating on a centrally located fixation point (0.33° diameter), the color of which indicated the trialtype (red for VSD trials, blue for FR trials). Two choice targets then appeared on the horizontal meridian at 9° eccentricity, one corresponding to leftward motion and one corresponding to rightward motion. After 0.1 to 1 s (sampled from a truncated exponential distribution with $\tau =0.3$ sec), the RDM stimulus appeared, centered over the fixation point. In VSD trials, subjects were required to maintain fixation throughout the stimulus presentation epoch. Once the stimulus extinguished, subjects reported their choice via an eye movement to the corresponding choicetarget. Fixation breaks before this point resulted in an aborted trial. In order to ensure that subjects could not predict the time of stimulus offset, the stimulus duration on each trial was randomly drawn from a truncated exponential distribution (0.071.3 s, $\tau =0.4$ sec). To account for the fact that the first three video frames contain effectively 0% coherent motion (see above), we subtracted 40 ms from the stimulus durations when modeling the VSD data (Figure 6B; Figure 7D). Doing so generally led to better model predictions; our conclusions are unchanged if we do not subtract the 40 ms. We assume that this 40 ms duration is accounted for by the nondecision time in FR trials. In FR trials, subjects were free to indicate their choice at any point after stimulus onset and RT was defined as the time spanning the stimulus onset and the indication of the choice. Additionally, ~7% of FR trials contained 100% coherent motion. Subjects received auditory feedback about their decision on every trial, regardless of trialtype, and errors resulted in a timeout of 1 s. Choices on 0% coherence trials were assigned as correct with probability 0.5.
At the end of their final session, subjects also performed a block of 300 to 400 FR trials, comprising only 100% coherent motion. Subjects were instructed to respond as fast as possible while maintaining perfect performance. This supplemental experiment was designed to reduce decision times as much as possible. If decision times were negligible, the resulting RTs would approximate each subject’s nondecision time distribution. We used the mean and standard deviation of this distribution as the nondecision time parameters when fitting the models to data from the main experiment (see above). In practice, the decisions presumably take a very short, but nonnegligible, amount of time. Thus, this ‘empirical’ nondecision time distribution probably overestimates the mean of the nondecision time, albeit slightly. Note that an overestimate of the nondecision time would induce an underestimate of the integration timeconstant. As such, its use is conservative with respect to a claim that a subject is integrating over prolonged timescales.
Statistical analysis
Request a detailed protocolWe quantified the quality of a model fit using the Bayesian information criterion (BIC), which takes into account the complexity of the model. The BIC is defined as
where n is the number of observations, $k$ is the number of free parameters, and $\hat{L}$ is the loglikelihood of the data given the bestfitting model parameters. To compare the fits of two models, we report the difference of the BICs. Note that because integration and extrema detection have the same number of parameters, their ∆BIC is equivalent to the difference of the deviance of the models. We treated ‘pure’ model predictions (e.g., predicting the choicedata from mean RT fits), as model fits with zero free parameters.
To evaluate the slope of a psychometric function and a reasonable upperlimit on the quality of the modelpredicted psychometric functions, we fit the choicedata with a logistic function, in which the proportion of rightward choices is given by
where ${\beta}_{0}$ determines the leftright bias and ${\beta}_{1}$ determines the slope of the psychometric function. This function represents an upper limit under the assumption that choices are governed by a logistic function of coherence and binomial noise. To test whether the model prediction is significantly worse than this upper limit, we used a bootstrap analysis. For each subject, we generated 10,000 bootstrapped choicedatasets and fit each bootstrapped dataset with the logistic function above. We then compared the resulting distribution of loglikelihood values with the loglikelihood of the model prediction. The quality of the prediction was deemed significantly worse from that of the logistic fit if at least 95% of the bootstrapped loglikelihoods were greater than the loglikelihood produced by the model prediction.
Data availability
The data generated during this study are included in the source data file for Figure 6.
References

BookVariational Bayesian Monte CarloIn: Bengio S, Wallach H, Larochelle H, Grauman K, CesaBianchi N, Garnett R, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 8213–8223.

BookPractical Bayesian Optimization for Model Fitting with Bayesian Adaptive Direct SearchIn: Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 1836–1846.

Sequential tests in industrial statisticsSupplement to the Journal of the Royal Statistical Society 8:1–21.https://doi.org/10.2307/2983610

The analysis of visual motion: a comparison of neuronal and psychophysical performanceThe Journal of Neuroscience 12:4745–4765.https://doi.org/10.1523/JNEUROSCI.121204745.1992

Practice increases the efficiency of evidence accumulation in perceptual choiceJournal of Experimental Psychology: Human Perception and Performance 31:289–298.https://doi.org/10.1037/00961523.31.2.289

Neural integrators for decision making: a favorable tradeoff between robustness and sensitivityJournal of Neurophysiology 109:2542–2559.https://doi.org/10.1152/jn.00976.2012

Evidence against perfect integration of sensory information during perceptual decision makingJournal of Neurophysiology 115:915–930.https://doi.org/10.1152/jn.00264.2015

The urgencygating model can explain the effects of early evidencePsychonomic Bulletin & Review 22:1830–1838.https://doi.org/10.3758/s1342301508512

A practical difference scheme for FokkerPlanck equationsJournal of Computational Physics 6:1–16.https://doi.org/10.1016/00219991(70)90001X

Decisions in changing conditions: the urgencygating modelJournal of Neuroscience 29:11560–11571.https://doi.org/10.1523/JNEUROSCI.184409.2009

Representation of accumulating evidence for a decision in two parietal AreasThe Journal of Neuroscience 35:4306–4318.https://doi.org/10.1523/JNEUROSCI.245114.2015

Brain mechanisms for perceptual and rewardrelated decisionmakingProgress in Neurobiology 103:194–213.https://doi.org/10.1016/j.pneurobio.2012.01.010

The neural basis of decision makingAnnual Review of Neuroscience 30:535–574.https://doi.org/10.1146/annurev.neuro.29.051605.113038

Elapsed decision time affects the weighting of prior probability in a perceptual decision taskJournal of Neuroscience 31:6339–6352.https://doi.org/10.1523/JNEUROSCI.561310.2011

Discriminating evidence accumulation from urgency signals in speeded decision makingJournal of Neurophysiology 114:40–47.https://doi.org/10.1152/jn.00088.2015

Piercing of consciousness as a ThresholdCrossing operationCurrent Biology 27:2285–2295.https://doi.org/10.1016/j.cub.2017.06.047

Computational psychiatryTrends in Cognitive Sciences 16:72–80.https://doi.org/10.1016/j.tics.2011.11.018

Historydependent variability in population dynamics during evidence accumulation in cortexNature Neuroscience 19:1672–1681.https://doi.org/10.1038/nn.4403

Visual evidence accumulation guides DecisionMaking in unrestrained miceThe Journal of Neuroscience 38:10143–10155.https://doi.org/10.1523/JNEUROSCI.347817.2018

The importance of falsification in computational cognitive modelingTrends in Cognitive Sciences 21:425–433.https://doi.org/10.1016/j.tics.2017.03.011

An AccumulationofEvidence task using visual pulses for mice navigating in virtual realityFrontiers in Behavioral Neuroscience 12:36.https://doi.org/10.3389/fnbeh.2018.00036

A theory of memory retrievalPsychological Review 85:59–108.https://doi.org/10.1037/0033295X.85.2.59

Methods for dealing with reaction time outliersPsychological Bulletin 114:510–532.https://doi.org/10.1037/00332909.114.3.510

SpatialFrequency channels in human vision*Journal of the Optical Society of America 61:1176–1186.https://doi.org/10.1364/JOSA.61.001176

Comparison of DecisionRelated signals in sensory and motor preparatory responses of neurons in area LIPThe Journal of Neuroscience 38:6350–6365.https://doi.org/10.1523/JNEUROSCI.066818.2018

Decision making by urgency gating: theory and experimental supportJournal of Neurophysiology 108:2912–2930.https://doi.org/10.1152/jn.01071.2011

The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295X.108.3.550

Probability summation over timeVision Research 19:515–522.https://doi.org/10.1016/00426989(79)901366

Effective analysis of reaction time dataThe Psychological Record 58:475–482.https://doi.org/10.1007/BF03395630

Early evidence affects later decisions: why evidence accumulation is required to explain response time dataPsychonomic Bulletin & Review 21:777–784.https://doi.org/10.3758/s1342301305518

Functional dissection of signal and noise in MT and LIP during decisionmakingNature Neuroscience 20:1285–1292.https://doi.org/10.1038/nn.4611
Decision letter

Valentin WyartReviewing Editor; École normale supérieure, PSL University, INSERM, France

Michael J FrankSenior Editor; Brown University, United States

Valentin WyartReviewer; École normale supérieure, PSL University, INSERM, France

Marius UsherReviewer
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Your work addresses an important issue with the standard methodology for modeling perceptual decisions. Your careful simulations of nonintegration models show very clearly that nonintegration strategies can produce fits that are qualitatively similar to evidence integration (and therefore mimic evidence integration) in widely used paradigms. The novel methodology you propose to distinguish integration from nonintegration strategies represents a timely achievement. Altogether, your work provides an important cautionary tale for the existing literature on modeling perceptual decisions. We congratulate you again for this work, which we are happy to publish in eLife.
Decision letter after peer review:
Thank you for submitting your article "Differentiating between integration and nonintegration strategies in perceptual decision making" for consideration by eLife. Your article has been reviewed by two peer reviewers, including Valentin Wyart as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Marius Usher (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
This manuscript describes a behavioral modeling study which aims at differentiating integration from nonintegration strategies during perceptual decisionmaking. For this purpose, the authors rely first on a simulationbased approach by comparing an evidence integration model to two nonintegration models (extrema detection and snapshot) in different paradigms used to study perceptual decisionmaking: fixed stimulus duration (FSD), variable stimulus duration (VSD) and free response (FR). The authors show that nonintegration models display qualitative features commonly used as signatures of evidence integration. Based on the results of these simulations, the authors then propose a paradigm combining FR and VSD trials which afford to better distinguish integration from nonintegration strategies. The authors report that six human subjects tested in their paradigm are better fitted (both quantitatively and qualitatively) by an evidence integration model.
Both reviewers found that your article addresses an important issue with the standard methodology for modeling perceptual decisions. Indeed, because evidence integration is an optimal (or at least adequate) cognitive strategy for such tasks, it is typically assumed that perceptual decisions rely on evidence integration. Your simulations of nonintegration models clearly show that nonintegration strategies (e.g., extrema detection) can produce behavior and fits that are qualitatively similar to evidence integration. Furthermore, the paradigm and methodology you propose to distinguish integration from nonintegration strategies represents a timely achievement. Both reviewers found your article to be clearly written, and to provide an important cautionary tale for the existing literature on modeling perceptual decisionmaking. The fact that it emphasizes the importance of simulating competing models of behavior and performing parameter recovery analyses (as proposed by Palminteri et al., 2017 and Wilson and Collins, 2019) is also something very valuable for the field.
Although the reviewers do not have significant reservations that would require essential revisions of your article, they have identified different points (listed below) that would benefit from clarifications in a revised version of your article.
1) Origin of the behavioral similarity of integration and nonintegration strategies:
The critical reason why nonintegration strategies can mimic integration when fitted to behavioral data in randomdot motion paradigms could be brought up more explicitly in the manuscript. It appears to be that the motion evidence SNR is not measurable directly, and thus the SNR parameter in nonintegration models can be set to widely implausible values to fit the behavioral data. The authors rightfully mention at the end of the Discussion that paradigms such as the ones used by Waskom and Kiani, 2019, but also Drugowitsch et al., 2016, afford to measure the sensory SNR and thus put an upper bound on the performance of any nonintegration model. The issue of the nonmeasurability of the motion evidence SNR in randomdot motion paradigms could be stated earlier and more explicitly in the manuscript. It would make even clearer why nonintegration models can be tweaked to fit behavioral data simulated using evidence integration.
2) Distinction between differences sources of behavioral variability:
Another related point of discussion could be the definition of SNR in the model. Noise in perceptual decisionmaking can arise from at least three different sources – as laid out in Drugowitsch et al., 2016: 1. noise in sensory processing (here, motion processing), 2. noise during evidence integration, and 3. noise during response selection. As emphasized in the previous point, the core issue put forward by the authors – that nonintegration models can be fitted to data by tweaking a parameter (SNR) that is not measurable independently – illustrates the danger of not characterizing and quantifying the different sources of decision errors in these tasks. It could be useful to state explicitly in the Discussion that an alternative strategy for ruling out nonintegration models is to measure the sensory SNR for the evidence used in the perceptual decisionmaking task. While it is not (at least not easily) possible for randomdot motion stimuli, it is clearly possible for pulsed evidence in terms of gratings (e.g., Drugowitsch et al., 2016), but also contrast (Waskom and Kiani, 2019).
3) Additional discussion of existing literature on integration and nonintegration strategies:
i) The mechanism referred to as “extrema detection” may precede integration as an account of perceptual decisions, under the name of Probability Summation over Time (PST; Watson, 1979). While this is cited, it would be helpful to discuss in more detail why PST was historically used in psychophysical tasks, while evidence integration is preferred in studies with stochastic evidence that extends over longer intervals. Unless the authors believe that the two mechanisms were confounded in these older psychophysical studies and that, in fact, evidence integration was misidentified as evidence for PST.
ii) There is at least one recent study that provides a complementary method for distinguishing evidence integration and extrema detection (Glickman and Usher, 2019). The idea was to plot the "integrated evidence" until response in a FR task. As shown in Figure 4D of this paper, extrema detection predicts that the integrated evidence should increase with time. This is quite different for how the integrated evidence varies with time under integration mechanisms (Figure 4A, 4B), where it is either constant or decreasing. If the authors can record the timevarying motion evidence within each trial, they could compare their method to this one. Even without such comparison, a discussion of this complementary approach would be helpful.
4) Decision bounds:
There is some confusion due to the swap between fixed or collapsing bounds within the manuscript. The original comparison is presented using a fixed boundary model, but then suddenly the collapsing boundary is preferred. If the use of collapsing boundary is necessary for the demonstration, the article should mention this model in the first part. Also, one should provide more details about how it fares when compared to the fixed boundary model in terms of model evidence (e.g., BIC).
Another important point for existing research concerns how much model misspecification might impact parameter recovery with respect to decision bounds? Many applications of the driftdiffusion model are used for asking questions about how threshold is adjusted, e.g. to speedaccuracy manipulations, and how that might be altered in different groups (aging, ADHD, OCD, etc.). It would be very important to know, based on additional simulationbased analyses, whether the conclusions about decision thresholds/bounds are somewhat less impacted by model misspecification (i.e., whether there is true integration or not) than nondecision times.
5) Leaky integration:
The authors state that "leakiness" can be seen as a spectrum that links evidence integration with extrema detection. I agree that it is the case in FR paradigms which are modeled with an accumulationtobound model. But I can't see how it would be the case for boundfree evidence integration in FSD or VSD paradigms. Could the authors clarify this statement?
Also, the classification of S1 as nonleaky is not particularly convincing, especially since the difference in BIC seems to support leak for this particular subject (and its time constant is similar to that of S4 and S5).
Also, the authors should also mention somewhere in the Discussion that randomdot motion stimuli do not afford to distinguish between a timedependent leak (as a function of time) and a stimulusdependent leak (as a function of the presentation of evidence samples). Indeed, recent results obtained by Waskom and Kiani suggest that the "leakiness" observed during perceptual decisionmaking is stimulusdependent, not timedependent.
https://doi.org/10.7554/eLife.55365.sa1Author response
1) Origin of the behavioral similarity of integration and nonintegration strategies:
The critical reason why nonintegration strategies can mimic integration when fitted to behavioral data in randomdot motion paradigms could be brought up more explicitly in the manuscript. It appears to be that the motion evidence SNR is not measurable directly, and thus the SNR parameter in nonintegration models can be set to widely implausible values to fit the behavioral data. The authors rightfully mention at the end of the Discussion that paradigms such as the ones used by Waskom and Kiani, 2019, but also Drugowitsch et al., 2016, afford to measure the sensory SNR and thus put an upper bound on the performance of any nonintegration model. The issue of the nonmeasurability of the motion evidence SNR in randomdot motion paradigms could be stated earlier and more explicitly in the manuscript. It would make even clearer why nonintegration models can be tweaked to fit behavioral data simulated using evidence integration.
The reviewers are correct that the SNR, which cannot be measured directly, can be set to allow nonintegration strategies to mimic integration. We stress, however, that this is not the only reason why nonintegration models can mimic integration. High SNR is necessary for mimicry, but it is not the only factor required to reproduce most of the behavioral observations discussed in the paper.
The reviewers are also correct that the SNR needed to explain data from the RDM task with a nonintegration model is likely implausible (as suggested by Ditterich, 2006). However, we would like to clarify that we can only make this judgement of implausibility because of the decades of work on how motion in RDM stimuli is represented in the primate visual cortex. There are many cases in which an experimenter cannot reasonably determine a plausible range for the SNR; for example, when using novel stimuli or lessstudied animal models.
We agree that the points raised about SNR could be more explicit in the paper. To this end, we have made the following changes:
1) We added an inset to Figure 2A, which displays the SNR parameter for each model simulation. We hope this makes clear that the nonintegration models require much higher SNR to produce the same choice behavior as integration.
2) We added the text, “In many cases, the SNR cannot be measured directly and thus the true SNR is generally unknown.”
3) When we introduce the motion discrimination task, we added the text, “It is generally not possible to measure the SNR of the momentary evidence used by a subject, and an estimation of SNR from neural data relies upon assumptions about how sensory information is encoded and decoded (e.g. Ditterich, 2006).”
4) In the Discussion, we added, “Nonintegration models […] are free to adopt this relatively large SNR because the true SNR cannot be measured directly.”
We would like to clarify that the upper bound in performance that can be calculated in Waskom and Kiani, 2018 and Drugowitsch et al., 2016, is due to the fact that the distributions that generate the evidence samples overlap. It is an upperbound because it is calculated assuming no sensory (i.e., internal) noise, and the upperbound will depend on the degree to which the distributions overlap. We agree that this is an appealing approach to ruleout nonintegration strategies. We clarified the approach in the Discussion:
“Waskom and Kiani, 2018, were able to ruleout a nonintegration strategy for humans performing a contrast discrimination task. […] This approach could be used in similar taskdesigns (e.g., Drugowitsch et al., 2016).”
2) Distinction between differences sources of behavioral variability:
Another related point of discussion could be the definition of SNR in the model. Noise in perceptual decisionmaking can arise from at least three different sources – as laid out in Drugowitsch et al., 2016: 1. noise in sensory processing (here, motion processing), 2. noise during evidence integration, and 3. noise during response selection. As emphasized in the previous point, the core issue put forward by the authors – that nonintegration models can be fitted to data by tweaking a parameter (SNR) that is not measurable independently – illustrates the danger of not characterizing and quantifying the different sources of decision errors in these tasks. It could be useful to state explicitly in the Discussion that an alternative strategy for ruling out nonintegration models is to measure the sensory SNR for the evidence used in the perceptual decisionmaking task. While it is not (at least not easily) possible for randomdot motion stimuli, it is clearly possible for pulsed evidence in terms of gratings (e.g., Drugowitsch et al., 2016), but also contrast (Waskom and Kiani, 2019).
Again, we do not fully agree that unconstrained SNR is the “core issue” put forward in our paper (see the response above).
In our paper, we focus only on noise in sensory processing (i.e., the SNR of the momentary evidence distribution). We agree in general that noise can affect processing at several different stages, but for our purposes, the alternative forms of noise are not as relevant. First, limiting our comparison to only sensory noise makes sense because noise in the integration process does not apply for nonintegration strategies. Obviously, characterizing and quantifying the noise during the integration process would be circular, as it would require that we assume there is an integration process to begin with. Second, because our subjects did not make errors in the easiest conditions, we can ruleout noise in response selection.
It is nonetheless possible that noise in the integration process could explain why the perfect integration model failed for some of our subjects. We show that this failure can be explained by leaky integration (for some subjects), but we list other forms of information loss that might also explain the failure of the perfect integration model. In this paragraph, we have added noisy integration to the list of alternative models:
“We accounted for some of the failures of the perfect integration model with a leaky integration model; however, we suspect some other models that do not posit leaky integration could do so as well. […] The shared feature of these models and leaky integration is that they involve information loss.”
Concerning the point about SNR in RDM tasks and pulsed tasks, it is important to distinguish between the SNR of the stimulus (i.e., external noise) and the SNR of the momentary evidence distribution (i.e., the internal representation of the stimulus). The former can be measured directly and sometimes used to calculate an upper bound on performance with a nonintegration strategy (as described in the previous response). Indeed, this analysis might be easier in paradigms where evidence is pulsed. The SNR of the momentary evidence distribution cannot be measured directly, even if evidence is pulsed. It must be estimated. We show that nonintegration models can mimic integration when this estimate is poorly constrained. The SNR of the momentary evidence distribution can be adequately constrained if its estimate is derived from a separate taskparadigm. We demonstrate this for RDM stimuli. Drugowitsch et al., 2016, use a similar approach for “pulsed” grating stimuli. We have edited the relevant part of the paragraph to clarify these points. It now reads:
“Nonintegration models […] are free to adopt this relatively large SNR because the true SNR cannot be measured directly. […] We demonstrate that the SNR can be adequately constrained if its estimate is derived from a separate taskparadigm (see also Drugowitsch et al., 2016).”
3) Additional discussion of existing literature on integration and nonintegration strategies:
i) The mechanism referred to as “extrema detection” may precede integration as an account of perceptual decisions, under the name of Probability Summation over Time (PST; Watson, 1979). While this is cited, it would be helpful to discuss in more detail why PST was historically used in psychophysical tasks, while evidence integration is preferred in studies with stochastic evidence that extends over longer intervals. Unless the authors believe that the two mechanisms were confounded in these older psychophysical studies and that, in fact, evidence integration was misidentified as evidence for PST.
Extrema detection is conceptually similar to PST, but there are important differences. Watson’s PST model was meant to explain a yes/no detection task and extrema detection is meant to explain discrimination tasks. Furthermore, PST assumes that it is effectively impossible to detect an extremum when the stimulus strength is equal to zero. Of course, we do not make this assumption in the extrema detection model, and this difference causes the behavior of the models to diverge drastically. Finally, extrema detection is most similar to the model proposed by Cartwright and Festinger, 1943, which precedes PST by 30 years.
We agree with the reviewers that extrema detection was heavily inspired by PST. We have amended the paragraph, in which we introduce the extrema detection model, to make this clearer. The beginning of the paragraph now reads:
“The first nonintegration model we considered was extrema detection (Figure 1B). The model was inspired by probability summation over time (Watson, 1979), which was proposed as an explanation of yesno decisions in a detection task. Extrema detection is also similar to other previously proposed models (Cartwright and Festinger, 1943; Ditterich, 2006; Cisek et al., 2009; Brunton et al., 2012; Glickman and Usher, 2019).”
We have no reason to believe that any specific older psychophysical study misattributed evidence integration to a nonintegration process (e.g., PST), or vice versa. Our point is that one cannot directly assume that subjects integrate evidence over time just because the task demands it; this assumption has to be tested against nonintegration strategies.
We agree that a discussion of the history of nonintegration and integration models would be helpful. We have added the following paragraph to the Discussion:
“We are not the first to propose that nonintegration mechanisms can explain behavior in perceptual decisionmaking tasks. […] For as we show here, integration and nonintegration models can behave similarly.”
ii) There is at least one recent study that provides a complementary method for distinguishing evidence integration and extrema detection (Glickman and Usher, 2019; Cognition). The idea was to plot the "integrated evidence" until response in a FR task. As shown in Figure 4D of this paper, extrema detection predicts that the integrated evidence should increase with time. This is quite different for how the integrated evidence varies with time under integration mechanisms (Figure 4A, 4B), where it is either constant or decreasing. If the authors can record the timevarying motion evidence within each trial, they could compare their method to this one. Even without such comparison, a discussion of this complementary approach would be helpful.
We thank the reviewers for pointing us toward this paper. It is an interesting approach that we had not considered. To determine whether this approach would work for RDM stimuli, we simulated the stimulus with statistics that resembled those reported by Zylberberg et al., 2016, for the motion energy in RDM stimuli. We simulated 10,000 trials with both models. We ensured that the models produced similar choice and RTfunctions and that these functions resembled those we observed in our data. We then calculated the integrated evidence at the time of the decision (with a sliding window) for both models, using only the external evidence signal and ignoring internal noise, as in Glickman and Usher, 2019. Author response image 1 shows the results of this analysis. We found that the analysis does not distinguish the models in our case. Both models predict that the integrated evidence should increase with decision time. The integration model predicts this because it is in a “high internal noise” regime (Glickman and Usher, 2019), which is the relevant regime for RDM tasks. We now refer to Glickman and Usher, 2019: “These examples and others (e.g., Glickman and Usher, 2019) illustrate that the best approach for rulingout nonintegration strategies will likely depend on the specifics of the stimulus and the taskdesign.”
4) Decision bounds:
There is some confusion due to the swap between fixed or collapsing bounds within the manuscript. The original comparison is presented using a fixed boundary model, but then suddenly the collapsing boundary is preferred. If the use of collapsing boundary is necessary for the demonstration, the article should mention this model in the first part. Also, one should provide more details about how it fares when compared to the fixed boundary model in terms of model evidence (e.g., BIC).
We agree that the rationale behind switching to the collapsing bounds model could be clearer. Indeed, its introduction in the original manuscript was sudden and did not provide a rationale: “Note that, when estimating $\kappa $ from fits to the [free response (FR)] data and comparing the quality of the fits, we used an elaborated variant of the models in which the decisionbounds can collapse symmetrically toward zero as a function of time (see below).” We have removed this sentence and have elected to introduce the collapsing bounds model in the next section. The first mention of the collapsing bounds model now reads:
“While the variants of the models we used to fit the mean RTs are parsimonious in that they use only four free parameters, they may not yield the most accurate estimates of the $\kappa $ parameter. […] These estimates of $\kappa $ are shown in Table 2 (top row), and were generally similar to the values estimated by the parsimonious model (Table 1, top row).”
We have also added text to the Materials and methods that explains the rationale for the collapsing bounds model:
“We also used a more elaborate variant of the model that allowed the decisionbound to collapse toward zero over time in order to explain full, choiceconditioned RT distributions. In principle, the elaborated model should provide a more precise estimate of $\kappa $ because it takes into account all of the data instead of just the mean RTs. The model also explains features of the data that are not explained by the flatbounds model (e.g., longer RTs on error trials). In the elaborated model, the bounds remained symmetric around zero but were a logistic function of time…”
The primary reason for the collapsing bounds model was to estimate $\kappa $ for the VSD predictions. The fact that estimates of $\kappa $ were similar across the collapsing bounds model and the flat bounds model suggests that the VSD predictions would also be similar across the two variants of the models. Furthermore, none of the results in the first part of the paper depend on collapsing bounds. We state early on that, “we assumed flat decisionbounds in the integration and extrema detection models, unless stated otherwise.” In the manuscript, the collapsing bounds constitute a minor technical issue – not a point of comparison between integration and nonintegration models. The nature of the decision bound is of general interest to us and many others in the field, but a formal comparison of the collapsing bounds model and a flat bounds model is not germane to the paper and would distract readers from its message.
Another important point for existing research concerns how much model misspecification might impact parameter recovery with respect to decision bounds? Many applications of the driftdiffusion model are used for asking questions about how threshold is adjusted, e.g. to speedaccuracy manipulations, and how that might be altered in different groups (aging, ADHD, OCD, etc.). It would be very important to know, based on additional simulationbased analyses, whether the conclusions about decision thresholds/bounds are somewhat less impacted by model misspecification (i.e., whether there is true integration or not) than nondecision times.
We thank the reviewers for bringing up this point. Indeed, parameter fits of the driftdiffusion model are often used to interpret the effects of speedaccuracy manipulations. Both models change the speedaccuracy tradeoff by changing the decision bounds. The effect of this change on the RT and choice functions is similar for both models. In theory, this means that an increase (or decrease) of the decisionbound across conditions should be identified as such, even with model misspecification. We make the more general point that integration models can be useful even in the absence of evidence for an integration strategy, so long as integration is not critical to the questions being asked, and we provide examples of such cases. We believe that withinsubject analyses concerning changes in the speedaccuracy tradeoff generally fit this description as well. We changed the beginning of the aforementioned paragraph so it now reads:
“We wish to emphasize that integration models are often useful even in the absence of evidence for an integration strategy. They can be used as descriptive models of behavioral data. For example, they allow experimenters to estimate a subject’s sensitivity while controlling for RT or characterize a subject’s speedaccuracy tradeoff. The model's use in these ways is similar to the use of signal detection theory models to derive a criterionfree estimate of sensitivity (Green and Swets, 1966).”
Acrossgroup comparisons are more complex. Typical approaches involve fitting the driftdiffusion model to data from both groups (e.g., patients and controls) and comparing some aspect of the model fits. As we explain in the paper, such comparison could be misleading if the two groups used different strategies to perform the task. We suspect that conclusions about decision bounds are not immune to this issue, but we also suspect that all of this will strongly depend on the nature of the comparison.
5) Leaky integration:
The authors state that "leakiness" can be seen as a spectrum that links evidence integration with extrema detection. I agree that it is the case in FR paradigms which are modeled with an accumulationtobound model. But I can't see how it would be the case for boundfree evidence integration in FSD or VSD paradigms. Could the authors clarify this statement?
We agree that a spectrum of leakiness does not perfectly link the two models, particularly for FSD and VSD paradigms. We did not mean to imply a perfect link in the text. We have revised the text as follows: "Integration and extrema detection models can be thought of as two ends of a continuum of sequential sampling models that differ in their degree of leaky integration.”
Also, the classification of S1 as nonleaky is not particularly convincing, especially since the difference in BIC seems to support leak for this particular subject (and its time constant is similar to that of S4 and S5).
We classified this subject as “nonleaky” because the ∆BIC was less than 10, which is a commonly used criterion for BIC comparisons. We have added a dashed line to Figure 7B to denote this criterion. In addition, the estimated timeconstant rivaled the longest decision times for this subject. We nonetheless acknowledge that the classification is somewhat arbitrary. We have softened the conclusions about S1 in the text:
“Figure 7A shows the model’s estimated timeconstant for each subject. For subjects 2 and 3, the timeconstants were effectively infinite. […] However, the ∆BIC indicates that the leak parameter was not strongly justified.”
Also, the authors should also mention somewhere in the Discussion that randomdot motion stimuli do not afford to distinguish between a timedependent leak (as a function of time) and a stimulusdependent leak (as a function of the presentation of evidence samples). Indeed, recent results obtained by Waskom and Kiani suggest that the "leakiness" observed during perceptual decisionmaking is stimulusdependent, not timedependent.
We do not fully understand the distinction between “stimulusdependent” and “timedependent” leak. To our understanding, Waskom and Kiani, 2018, did not mention this distinction explicitly. We accounted for some of the subjects with a leaky integration model, and we agree with the reviewers that we cannot identify the cause of the leak. However, to reiterate our response in (2) and the text, we could likely substitute the leaky integration model with many other models that incorporate information loss. Given that this caveat is explicit in the paper, we believe a discussion of the caveat raised by the reviewers would be redundant.
https://doi.org/10.7554/eLife.55365.sa2Article and author information
Author details
Funding
Howard Hughes Medical Institute
 Ariel Zylberberg
 Michael N Shadlen
National Eye Institute (EY011378)
 Gabriel M Stine
 Ariel Zylberberg
 Michael N Shadlen
National Eye Institute (EY013933)
 Gabriel M Stine
National Institute of Neurological Disorders and Stroke (NS113113)
 Gabriel M Stine
 Ariel Zylberberg
 Michael N Shadlen
Israel Institute for Advanced Studies
 Michael N Shadlen
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This research was supported by the Howard Hughes Medical Institute, the National Eye Institute (R01 EY011378, EY013933), the National Institute of Neurological Disorders and Stroke (NS113113), and the Israel Institute for Advanced Studies. We thank the members of the Shadlen lab for helpful discussions.
Ethics
Human subjects: The institutional review board of Columbia University (protocol #IRBAAAL0658) approved the experimental protocol, and subjects gave written informed consent.
Senior Editor
 Michael J Frank, Brown University, United States
Reviewing Editor
 Valentin Wyart, École normale supérieure, PSL University, INSERM, France
Reviewers
 Valentin Wyart, École normale supérieure, PSL University, INSERM, France
 Marius Usher
Publication history
 Received: January 21, 2020
 Accepted: April 24, 2020
 Accepted Manuscript published: April 27, 2020 (version 1)
 Version of Record published: May 12, 2020 (version 2)
Copyright
© 2020, Stine et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,652
 Page views

 543
 Downloads

 7
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.