Abstract
When external feedback about decision outcomes is lacking, agents need to adapt their decision policies based on an internal estimate of the correctness of their choices (i.e., decision confidence). We hypothesized that agents use confidence to continuously update the tradeoff between the speed and accuracy of their decisions: When confidence is low in one decision, the agent needs more evidence before committing to a choice in the next decision, leading to slower but more accurate decisions. We tested this hypothesis by fitting a bounded accumulation decision model to behavioral data from three different perceptual choice tasks. Decision bounds indeed depended on the reported confidence on the previous trial, independent of objective accuracy. This increase in decision bound was predicted by a centroparietal EEG component sensitive to confidence. We conclude that internally computed neural signals of confidence predict the ongoing adjustment of decision policies.
https://doi.org/10.7554/eLife.43499.001Introduction
Every day humans have to make numerous choices. These range from small and trivial (which shirt to wear) to complex and important (which house to buy). Such decisions are often based on ambiguous or noisy information about the state of the world. Human decisionmakers are remarkably good at estimating their own accuracy, commonly reporting higher confidence for correct than for incorrect choices. Decision confidence can be conceptualized as the probability of a choice being correct, given the available evidence (Pouget et al., 2016; Sanders et al., 2016; Urai et al., 2017). A number of studies has investigated neural correlates of decision confidence (Fleming et al., 2010; Kepecs et al., 2008; Kiani and Shadlen, 2009), including attempts to dissociate subjective reports of decision confidence from objective decision accuracy (Desender et al., 2018; Odegaard et al., 2018; Zylberberg et al., 2012). An important open question is whether and how this subjective ‘sense of confidence’ is used to regulate subsequent behavior (Meyniel et al., 2015; Yeung and Summerfield, 2012). Theoretical treatments posit that, when information is sampled sequentially, confidence can be used to regulate how much information should be sampled before committing to a choice (Meyniel et al., 2015).
Here, we tested this prediction within the context of bounded accumulation models of decisionmaking (see Figure 1 for illustration). Such models for twochoice tasks postulate the temporal accumulation of noisy sensory evidence towards one of two decision bounds, crossing of which determines commitment to one alternative (Bogacz et al., 2006; Gold and Shadlen, 2007; Usher and McClelland, 2001). The drift diffusion model (DDM; Ratcliff and McKoon, 2008) is a widely used instance of these models. Here, the mean drift rate quantifies the efficiency of evidence accumulation. Large signaltonoise ratio (SNR) of the sensory evidence yields a large drift rate and, consequently, high accuracy and rapid decisions (conversely for low SNR; see Figure 1A). When evidence SNR is constant over trials, this sequential sampling process achieves an intended level of decision accuracy with the shortest decision time, or, conversely, an intended decision time at the highest accuracy level (Gold and Shadlen, 2007; Moran, 2015). The separation of the two decision bounds determines response caution, that is, the tradeoff between decision time and accuracy: The larger the bound separation, the more evidence is required before committing to a choice, increasing accuracy at the cost of slower decisions (Figure 1B). Controlling this decision bound separation, therefore, enables the decisionmaker to prioritize either speed or accuracy (Bogacz et al., 2010b).
Several studies have shown that decisionmakers can change their decision bounds as a function of external manipulations. For example, instructions to adhere to a liberal or conservative response strategy (Forstmann et al., 2008; Hanks et al., 2014; Palmer et al., 2005), or environments that reward fast or accurate responses (Bogacz et al., 2010a) all lead to changes in bound separation. Such manipulations typically rely on providing external instructions, or feedback, to the agent. In reallife decisions, explicit feedback about choice outcomes is often delayed or withheld. We hypothesized that, in the absence of external feedback, decisionmakers set their decision bounds depending on internal signals encoding their decision confidence: in this way, low confidence about one choice gives rise to more cautious decisionmaking in the next (Meyniel et al., 2015; Yeung and Summerfield, 2012).
We found that decision confidence predicted decision bounds on the subsequent trial across three different perceptual choice tasks. A centroparietal EEG component that tracked confidence on the current trial was linearly related to the subsequenttrial decision bound.
Results
Human confidence ratings exhibit signatures of statistical decision confidence
Twentyeight human participants performed a task that has been used widely in computational and neurophysiological analyses of perceptual decisionmaking: discrimination of the net motion direction in dynamic random dot displays (Bogacz et al., 2006; Gold and Shadlen, 2007; Siegel et al., 2011). We asked participants to decide, as fast and accurate as possible, whether a subset of dots was moving coherently towards the left or right side of the screen. Decision difficulty was manipulated by varying the proportion of coherently moving dots. There were five different levels of coherence, ranging from 0 up to. 4, that were randomly intermixed in each block. After their choice, a 1 s blank screen or 1 s of continued motion (same coherence and direction as the initial stimulus) was shown, so as to allow for postdecisional accumulation of extra evidence, either from the sensory buffer (in the blank condition; Resulaj et al., 2009) or from the external stimulus (continued motion condition; Fleming et al., 2018). After this additional second, participants indicated how confident they felt about having made the correct choice (see Figure 2A).
As expected, RTs on correct trials and choice accuracy scaled with motion coherence level (Figure 2B; RTs: F(4, 45.33)=30.61, p<0.001, error rates: X²(4)=1285.6, p<0.001). Correspondingly, drift rates estimated from DDM fits (see Materials and methods) also increased monotonically with coherence level (Figure 2B; Friedman χ2(5)=140, p<0.001). In these model fits, decision bound separation was not allowed to vary as a function of coherence; its average estimate across participants was 2.09 (SD = 0.33). Similarly, nondecision time was held constant across levels of coherence; its average was 0.38 (SD = 0.09). Model fits closely captured the patterns seen in behavior (i.e., green crosses in Figure 2B), indicating that the DDM fitted the behavioral data well.
Participants’ confidence ratings exhibited a key signature of statistical decision confidence established in previous work (Sanders et al., 2016; Urai et al., 2017): an oppositesign relation between evidence strength and confidence for correct and incorrect choices (Figure 2C). The scaling of confidence judgments with coherence level (F(4,52.6) = 4.56, p=0.003) depended on choice accuracy (F(4,6824.6) = 154.52, p<0.001), with confidence increasing with coherence levels for correct trials (linear contrast: p<0.001) and decreasing for error trials (linear contrast: p<0.001). This pattern was highly similar in blocks with continued evidence following the choice and blocks in which choices were followed by a blank screen (see Figure 2—figure supplement 1). A Bayesian ANOVA confirmed that the interactive effects of coherence and accuracy on confidence were similar for both conditions, BF = 0.05 (i.e., the null hypothesis was 20 times more likely than the alternative). Correspondingly, confidence ratings were closely linked to choice accuracy (Figure 2D). Confidence ratings were monotonically related to choice accuracy on a trialbytrial basis, even after factoring out motion coherence (logistic regression of confidence on accuracy with coherence as a covariate: positive slopes for all observers, 23 ps <0.025, five nonsignificant).
Notably, and different from previous work (Sanders et al., 2016), confidence ratings predicted accuracy from below maximum uncertainty (i.e., 50%) to about 100%: Ratingpredicted accuracy ranged from 23% (certain error) up to 94% (certain correct), both significantly different from chance level (p<0.001). By contrast, RTs for the initial choice (pooled across difficulty levels), while also monotonically related to accuracy (b = −0.03, t(27) = −9.41, p<0.001) predicted accuracy variations only from about 60% to 90% correct (Figure 2E), similar to previous results (Sanders et al., 2016; Urai et al., 2017). This shows that human confidence ratings can lawfully account for certainty about errors (i.e., accuracy levels below 50%) when such certainty is enabled by the experimental protocol, due to postdecisional evidence accumulation (see also Fleming et al., 2018). This generalizes the signatures of decision confidence (as defined above) reported by previous analyses of reaction times or confidence reports (Sanders et al., 2016; Urai et al., 2017) to the domain of error detection. We next sought to pinpoint the consequences of trialtotrial variations in participants’ confidence ratings on the subsequent decision process.
Decision confidence influences subsequent decision bound
We hypothesized that response caution would increase after lowconfidence decisions (in particular after ‘perceived errors’), a change in speedaccuracy tradeoff mediated by an increase in decision bound. A bound increase will increase both RT and accuracy, a trend that was evident in the data (Figure 3A, left). We multiplied median RT and mean accuracy, separately for each level of confidence, to combine both effects into a single, modelfree measure of response caution (Figure 3A, right). This aggregate measure of response caution was predicted by the confidence rating from the previous decision, F(2,81) = 3.13, p=0.049. Posthoc contrasts showed increased caution after perceived errors compared to after both high confidence, z = 2.27, p=0.032, and low confidence, z = 2.05, p=0.041. There was no difference in caution following high and low confidence ratings, p=0.823.
The confidencedependent change in subsequent response caution was explained by a DDM, in which decision bounds and drift rate could vary as a function of previous confidence (Figure 3A, green crosses; see Materials and methods). We used a hierarchical DDM regression approach to fit this model (Wiecki et al., 2013), see Materials and methods). In this fitting approach, parameter estimates for individual participants are constrained by the group prior, whereby the contribution of each individual to the group prior depends on the number of trials available from that participant in the corresponding condition (Materials and methods). One important consequence of this approach is that individual parameter estimates are not independent and statistical inferences are only meaningful at the group level. We treated high confidence trials as reference category in the regression, so parameter values reflect deviations (i.e. delta scores) from the parameter estimate for high confidence. All summary statistics of the observed data fell within the 95% credibility interval of the fitted RTs.
As predicted, the subsequenttrial separation of decision bounds scaled monotonically with the complement of decision confidence (i.e., uncertainty; Figure 3B). Subsequent decision bound increased after low compared to high confidence decisions (M = 0.083, SD = 0.047, p=0.037) and even further after participants perceived an error (M = 0.262, SD = 0.078, p<0.001). Decision bound after perceived errors was also larger compared to after low confidence decisions (p=0.021). The posterior distribution for subsequent decision bounds for perceived errors (in blue) overlapped only slightly with the distribution for low confidence trials (in red), and barely overlapped with that for high confidence trials (zero; the reference category, Figure 3B). The pvalues presented in the figure directly reflect this overlap. This pattern of results was similar in blocks with and blocks without postdecisional evidence presentation (see Figure 3—figure supplement 1). In sum, these results demonstrate that decision bounds are increased following perceived errors. Decision confidence had no effect on subsequent drift rate (Figure 3B).
The results in Figure 3 were obtained by fitting the regression model as function of currenttrial confidence, ignoring currenttrial choice accuracy. So, for example, trials labeled as ‘perceived errors’ contained a mixture of correct and error trials. Part of the results could thus reflect previously established effects of posterror slowing in decisionmaking (Purcell and Kiani, 2016). When estimating the effects of current confidence separately for current correct and error trials, we found that the confidence ratingdependence of subsequent decision bound holds even for variations of confidence ratings within both correct (Figure 4A) and error trials (Figure 4B). This result shows that the modulation of decision bound is specifically due to trialbytrial variations in internal confidence signals, rather than the objective accuracy of the choice, thus going beyond the previous findings.
The effects of confidence ratings on subsequent decision bounds are unlikely to be caused by our systematic manipulation of evidence strength (i.e., motion coherence) or evidence volatility (Materials and methods). Confidence ratings were reliably affected by both evidence strength (as reported before; Figure 2C) and evidence volatility (data not shown, F(1, 26.7)=47.10, p<0.001, reflecting higher confidence with high evidence volatility). However, evidence strength and volatility, in turn, did not affect subsequent decision bound, both ps > 0.133. For each of these, zero (i.e., no effect) was included in the 95% highest density interval of the posterior distribution (−0.167 to. 026, and −0.020 to. 039, respectively), suggesting it is likely that the true parameter value was close to zero.
The analyses on the behavioral measure of response caution and on the model fits (shown in Figure 3) control for slow drifts in performance over the course of the experiment (Dutilh et al., 2012b; Gilden, 2003; Palva et al., 2013), which likely reflect slow and nonspecific fluctuations in behavioral state factors (e.g. motivation, arousal, attention). Indeed, slow (‘scalefree’) fluctuations similar to those reported previously (Gilden, 2003; Palva et al., 2013) were present in the current RT and confidence rating (slopes of linear fits to loglog spectra of the corresponding time series were significant for both RTs, b = −0.42, t(23) = −11.21, p<0.001, and confidence, b = −0.60, t(23) = −13.15, p<0.001). To appreciate this result, consider a streak of trials during which arousal level declines monotonically. This will cause a monotonic increase in the ‘noise’ of sensory responses (McGinley et al., 2015; Reimer et al., 2014), which, in turn, will translate into a monotonic decrease of drift, decision accuracy, and decision confidence (Kepecs et al., 2008), accompanied by an increase in RT. Such effects would cause changes in all behavioral variables and DDM parameter estimates from trial n to trial n+1, without reflecting the rapid and strategic, confidencedependent adjustments of decision policy, which were of interest for the current study. We hypothesized that latter adjustments were superimposed onto the former, slow performance drifts. To isolate the confidencedependent trialbytrial adjustments, we removed the influence of slow performance drifts: we subtracted the effect of decision confidence on trial_{n+2} on the dependent variables on trial_{n+1} from the effect of confidence on trial_{n} in Figure 3 (see Materials and methods and Figure 3—figure supplement 2 and Figure 3—figure supplement 3 for the ‘raw’ effects for trial_{n} and trial_{n+2}).
A possible concern is that the decision bound on trial_{n+1} was correlated with confidence ratings on trial_{n+2}, which would confound our measure of the effect of confidence on trial_{n} on decision bound on trial_{n+1}. Two observations indicate that this does not explain our findings. First, the observed association between confidence ratings on trial_{n} and decision bound on trial_{n+1} was also evident in the ‘raw’ parameter values for the bound modulation, that is, without removing the effects of slow performance drift (Figure 3—figure supplement 3). Second, when using a complementary approach adopted from the posterror slowing literature (Dutilh et al., 2012b; see Materials and methods), we observed largely similar results (see Figure 3—figure supplement 4).
Confidencedependent modulation of decision bound generalizes to other tasks
Having established a robust effect of confidence ratings on subsequent response caution and decision bound in the dot motion discrimination task, we tested for the generalization of the effect to other perceptual choice tasks. First, we reanalyzed previously published data (Boldt and Yeung, 2015) from an experiment in which sixteen participants performed a speeded decision task, in which they decided as quickly as possible which of two boxes contained more dots (Experiment 2; Figure 5A, left). Different from Experiment 1, in this dataset only a single level of difficulty was used, thus allowing us to test whether the findings of Experiment one generalize to internal variations of confidence occurring at a fixed evidence SNR. Similar to Experiment 1, both RTs and confidence judgments predicted choice accuracy (see Figure 5—figure supplement 1). As in Experiment 1, our modelfree measure of response caution (RT*accuracy) was modulated by confidence ratings on the previous trial, F(2,45) = 3.21, p=0.050 (perceived errors vs. high confidence: z = 2.53, p=0.011; no significant differences for other comparisons ps >0.178; Figure 5B; see Figure 5—figure supplement 2A for the ‘raw’ effects of trial_{n} and trial_{n+2}).
Second, we analyzed data from an experiment, in which twentythree participants performed a visual color categorization task, designed after de Gardelle and Summerfield (2011), deciding as fast as possible whether the mean color of eight elements was red or blue (Experiment 3; Figure 5A, right). Task difficulty was manipulated by independently varying the distance of the color mean from the category bound and the standard deviation across the elements’ colors. Both variables together determined the SNR of the sensory evidence (i.e., mean distance from category boundary/variance). Similar to Experiment 1, RTs, accuracy and drift rate scaled monotonically with SNR, and both RTs and confidence judgments predicted choice accuracy (see Figure 5—figure supplement 5). Our modelfree measure of response caution was again affected by previous confidence ratings, F(2,66) = 14.43, p<0.001 (perceived errors vs. high confidence: z = 4.61, p<0.001; perceived errors vs. low confidence: z = 4.69, p<0.001; high vs. low confidence: p=0.938; Figure 5B; see Figure 5—figure supplement 2B for the ‘raw’ effects of trial_{n} and trial_{n+2}).
Also for Experiments 2 and 3, the modulation of behavior by confidence was captured by the DDM fits to the data (Figure 5B; green crosses). All summary statistics of the observed data fell within the 95% credibility interval of the fitted RTs. In both datasets, we again found that subsequent decision bounds were modulated by decision confidence (see Figure 5C). When participants perceived to have committed an error, subsequent decision bounds were increased (Exp2: M = 0.110, SD = 0.046; Exp3: M = 0.117, SD = 0.046) compared to having high confidence (Exp2: p=0.007; Exp3: p=0.004) or low confidence (Exp 2: M = 0.059, SD = 0.038, p=0.059; Exp 3: M = −0.046, SD = 0.027, p<0.001). In Experiment 3, subsequent decision bounds were unexpectedly lower following low confidence trials compared to high confidence (p=0.043). Again, the effects of confidence ratings on subsequent decision bounds were present separately for confidence ratings on correct (Figure 6A) and error trials (Figure 6B). As in Experiment 1, the systematic trialtotrial variations of evidence strength (SNR) did not influence subsequent decision bound in Experiment 3 (p=0.220), and zero was included in the 95% highest density interval of the posterior (−0.010 to. 031). Finally, we again observed a robust effect of confidence ratings on subsequent decision bound when using the above described alternative procedure to control for slow performance drift (Figure 5—figure supplement 4 and Figure 5—figure supplement 7). In Experiment 3 (Figure 5—figure supplement 6) but not in Experiment 2 (Figure 5—figure supplement 3), this effect was also present without controlling for slow performance drifts.
Both datasets also showed a small modulation of subsequent drift rate by decision confidence, an effect not present in Experiment 1, but consistent with recent studies of post error slowing (Purcell and Kiani, 2016): When participants had low confidence in their choice, mean drift rate on the subsequent trial was lower (Exp2: M = −0.216, SD = 0.121; Exp3: M = −0.149, SD = 0.085) relative to high confidence (Exp2: p=0.039; Exp3: p=0.039) and trials perceived as errors (Exp2: p=0.122; Exp3: p=0.034). The latter two were not different (Exp2: p=0.378; Exp3: p=0.169).
A neural marker of confidence predicts subsequent decision bound
The results from the previous sections indicated that confidence modulates the separation of the bounds for subsequent decisions. Which internal signals are used to transform confidence estimates into changes in subsequent decision bound? Two components of the human EEG evoked potential are established neurophysiological markers of confidence and error processing: (i) the errorrelated negativitiy (ERN), a frontocentral signal peaking around the time of the response; and (ii) the Pe, a centroparietal signal that follows the ERN in time. The ERN originates from midfrontal cortex (Dehaene et al., 1994; Van Veen and Carter, 2002) and has been implicated in error processing. Different accounts postulate that the ERN reflects a mismatch between the intended and the actual response (Charles et al., 2013; Nieuwenhuis et al., 2001), the detection of conflict (Yeung et al., 2004), or a negative prediction error (Holroyd and Coles, 2002). The Pe was initially linked to error perception (hence its name; Nieuwenhuis et al., 2001) and more recently to postdecisional evidence accumulation (Murphy et al., 2015) as well as finegrained variations in decision confidence (Boldt and Yeung, 2015).
We here used the EEG data that were collected in Experiment two to test if the confidencedependent modulation of subsequent decision bound was linked to one, or both, of these confidencerelated neural signals. We reasoned that these neural data may provide a more veridical measure of the internal confidence signals governing subsequent behavior than the overt ratings provided by participants, which require additional transformations (and thus additional noise sources) and are likely biased by interindividual differences in scale use and calibration. Furthermore, quantifying the unique contribution of both confidencerelated neural signals to bound adjustment allowed for testing for the specificity of their functional roles, an important issue given their distinct latencies and neuroanatomical sources.
Both the Pe and the ERN were modulated by decision confidence (Figure 7A), as already shown in the original report of these data (Boldt and Yeung, 2015). ERN amplitudes at electrode FCz were monotonically modulated by confidence, F(2,14) = 18.89, p<0.001, and all conditions differed from each other, ts > 3.62 and ps <0.003. Likewise, Pe amplitude at electrode Pz was monotonically modulated by confidence, F(2,14) = 19.19, p<0.001, and all conditions differed from each other, ts > 2.19, ps <0.045.
We then fitted the DDM to the data with both EEG markers as (trialtotrial) covariates (i.e., ignoring decision confidence) to test if either or both postdecisional EEG signals predicted subsequent decision bound (see Figure 8A). The regression coefficient relating the Pe to subsequent decision bound was significantly different from zero (p<0.001) whereas the regression coefficient for the ERN was not (p=0.327). Regression coefficients of the Pe and the ERN were significantly different from each other (p<0.001). The sign of the Pe regression coefficient was positive, indicating that more positive Pe amplitudes predicted an increase of subsequent decision bound. Concerning the drift rate, the coefficient relating the Pe to subsequent drift rate differed from zero, p<0.001, whereas the ERN was unrelated to subsequent drift rate, p=0.340. Regression coefficients of the Pe and ERN were significantly different, p<0.001. The sign of the Pe regression coefficient was negative, indicating that more positive Pe amplitudes were related to smaller drift rates on the subsequent trial. The isolated effects uncontrolled for trial_{n+2} are shown in Figure 8—figure supplement 1.
The error positivity (Pe) linearly scales with subsequent decision bound
The previous section showed that the Pe was related to subsequent decision bound and drift rate, but that analysis could not reveal potential nonlinearities in this relationship. We therefore divided the Pe into five equalsized bins based on its amplitude, separately for each participant. In the previous section, both components were entered in a single model, thus in that fit the regression coefficients capture unique variance of each signal. The trialtotrial variations in Pe amplitudes were correlated with those of the ERN amplitudes, mean r = 0.180, t(15) = 6.92, p<0.001. In order to again capture the effect unique to the Pe, bins were created after the ERN was regressed out of the Pe, separately per participant, which did not affect the confidence scaling of the Pe (inset of Figure 7A). Note that the results described below remain largely unchanged when creating bins based on raw Pe amplitude (Figure 8—figure supplement 2). Figure 7B–C show the distribution of confidence judgments over the different bins.
Subsequent decision bounds increased monotonically (and approximately linearly) as a function of binned Pe amplitude quantile, Friedman χ2(4)=61.00, p<0.001 (Figure 8B), with all adjacent bins differing significantly from each other, all ps < 0.010. Likewise, subsequent drift rates linearly decreased as a function of binned Pe amplitude quantile, Friedman χ2(4)=61.75, p<0.001, and all adjacent bins were significantly different from each other, all ps < 0.055. The simple effects, uncontrolled for slow performance drifts, are shown in Figure 8—figure supplement 3. Similar findings were obtained using our alternative approach to control for slow performance drifts (Figure 8—figure supplement 4). Finally, fitting the same model selectively on correct trials (Figure 8C) or error trials (Figure 8D) provided highly similar results. In sum, there was an approximately linear relationship between the amplitude of the error positivity (Pe) and subsequent decision bound separation as well as subsequent drift rate. Thus, the Pe qualifies as a neural marker of decision confidence predicting flexible, trialtotrial adaptation of the decision bounds.
Discussion
Accumulationtobound models of decision making assume that choices are formed once the integration of noisy evidence reaches a bound. This decision bound is commonly assumed to be fixed within a block of constant external task conditions (Ratcliff and McKoon, 2008). Here, we show that this decision bound, in fact, dynamically changes from trial to trial, dependent on the confidence about the previous decision: In three independent datasets the separation between decision bounds increased after participants sensed they had made an error. Importantly, this was observed independent of the objective accuracy of a trial. A postdecisional brain signal, the socalled Pe component, scaled with decision confidence and linearly predicted the decision bound on the subsequent trial. These findings indicate that, in the absence of external feedback about choice outcome, decisionmakers use internal confidence signals to continuously update their decision policies.
Decision confidence modulates subsequent decision bound
Choice behavior exhibits substantial intrinsic variability (for review, see Wyart and Koechlin, 2016). Current models of decisionmaking account for this behavioral variability in terms of parameters quantifying random ‘noise’ in the decision process (e.g., within the DDM: drift rate variability; Ratcliff and McKoon, 2008). Recent evidence shows that some of this variability is not actually noise, but rather due to dynamic variations in systematic decision biases due to choice history (Urai et al., 2017) or arousal (de Gee et al., 2017). The current work extends these insights by demonstrating that acrosstrial variations in decision bound are governed by decision confidence. A key factor of the current work was that observers did not receive direct feedback about their accuracy. Consequently, observers rely on an internal estimate of accuracy to generate a speedaccuracy tradeoff policy for the subsequent trial.
The model fits in Figures 3 and 5 suggest that the effect is rather consistent across participants. For example, the increased decision bound following perceived errors in Experiments 1, 2 and 3 is found for all but one, two, and four participants, respectively. However, these model fits are realized by relying on a hierarchical Bayesian version of the DDM (Wiecki et al., 2013). One advantage of this method is that participants with low trial counts in specific conditions, due to the idiosyncratic nature of confidence judgments, can contribute to the analysis: Data are pooled across participants to estimate posterior parameter estimates, whereby the data of a participant with low trial counts in a specific condition will contribute less to the posterior distribution of the respective condition. Individualsubject estimates are constrained by the group posterior (assumed to be normally distributed), and estimates with low trial counts are pulled towards the group average. A limitation of this procedure is that it precludes strong conclusions about the parameter estimates from individual participants. Future studies should collect extensive data from individual participants in order to shed light on individual differences in confidenceinduced bound changes.
Trialtotrial variations in decision confidence likely result from several factors. For example, confidence might be low because of low internal evidence quality (i.e., low drift rate) or because insufficient evidence has been accumulated before committing to a choice (i.e., low decision bound). When the bound is low and results in low confidence, it is straightforward to increase the bound for the subsequent decision in order to improve performance. When drift rate is low, increasing the subsequent bound might increase accuracy only slightly, but at a vast cost in terms of response speed. Future work should aim to unravel to what extent strategic changes in decision bound differ between conditions in which variations in confidence are driven by a lack of accumulated evidence or by a lack of instantaneous evidence quality.
In all three datasets, several trials were characterized by high certainty about errors, which indeed predicted significant belowchance levels of accuracy. This observation suggests an important role for postdecisional processes, as perception of an error by definition can only occur following the commitment to a choice. Within the framework of sequential sampling models of decisionmaking, changesofmind about the perceived correct response have been explained by allowing postdecisional accumulation of evidence, coming from a sensory buffer (Resulaj et al., 2009) or from additional sensory input (Fleming et al., 2018). After the integrated evidence has hit a decision bound, and a choice is made, the evidence continues to accumulate, and so the decision variable can eventually favor the unchosen option. Such postdecisional evidence accumulation can naturally account for dissociations between confidence ratings and choice accuracy (Moran et al., 2015; Navajas et al., 2016; Pleskac and Busemeyer, 2010). Indeed, recent work using a similar protocol like our Experiment one showed, likewise, low confidence judgments predicting close to 0% accuracy, which was attributed to the integration of postdecisional evidence into confidence judgments (Fleming et al., 2018). That previous study also showed nearperfect integration of predecisional and postdecisional stimulus information into confidence judgments. By contrast, in our Experiment 1, we found that postdecisional sensory stimuli did not have a larger impact on confidence than a postdecisional delay with just a blank screen. The fact that the postdecisional blank and the postdecisional evidence conditions showed indistinguishable confidence judgments, indicates that postdecisional evidence was accumulated from a buffer, whereas extra sensory information was not used for the confidence judgment, different from Fleming et al. (2018). This difference might be explained by a number of differences between the experimental protocols – most importantly, the fact that Fleming et al., but not us, rewarded their participants based on the accuracy of their confidence judgments, which might have motivated their participants to actively process the postdecisional stimulus information. This evidence for postdecisional contributions to confidence ratings (and in particular, certainty about errors) does not rule out the contribution of pre and intradecisional computations to confidence (e.g., Gherman and Philiastides, 2018; Kiani and Shadlen, 2009).
Previous work has indicated that the error positivity (Pe) tracks postdecisional evidence accumulation (Murphy et al., 2015) and reflects variations in decision confidence (Boldt and Yeung, 2015). We here demonstrated that the Pe predicted increases in subsequent decision bound. Interestingly, this relation was specific for the Pe, and not evident for another signal reflecting confidence and error processing, the ERN. Other work has linked frontal theta oscillations, which have been proposed to drive the ERN (Cavanagh and Frank, 2014; Yeung et al., 2007); but see Cohen and Donner, 2013), to slowed reaction times following an error (Cavanagh et al., 2009). Although this is typically observed in flanker tasks, where there is no ambiguity concerning choice accuracy, a similar process of postdecision evidence accumulation has been proposed to underlie both error awareness (Murphy et al., 2015) and graded levels of confidence (Pleskac and Busemeyer, 2010). Further headtohead comparison of participants who perform both tasks seems necessary to further resolve this discrepancy.
The relation between decision confidence and drift rate
The main focus of the current work was to unravel influences of decision confidence on subsequent decision bound; we had no predefined hypothesis about whether confidence also affects subsequent drift rate. In Experiments 2 and 3, we observed a small reduction in drift rate following lowconfidence trials. This nonmonotonic reduction in drift rate driven by low confidence seems hard to reconcile with the clear monotonic relation between currenttrial Pe amplitude and subsequent drift rate seen in the EEG data of Experiment 2. One explanation for this discrepancy might be that neural recordings provide a more veridical measure of the internal evaluation of accuracy than explicit confidence reports, which is subject to differences in scale use and differences in calibration. Indeed, when we fitted a model in which subsequent drift rate was allowed to vary as a function of both decision confidence and binned Pe amplitude, both the nonmonotonic relation with decision confidence and the monotonic relation with Pe amplitude were replicated. Previous work has observed similar reductions of subsequent drift rate after errors (Notebaert et al., 2009; Purcell and Kiani, 2016), possibly reflecting distraction of attention from the main task due to error processing. Thus, in addition to affecting subsequent decision bounds, internal confidence (in particular: error) signals might also affect subsequent attentional focus on subsequent trials. However, given that this finding was not consistently observed across the three Experiments, in contrast with the modulation of decision bound, conclusions about the modulation of drift rate should be made with caution and warrant further investigation.
Relation to previous work on errordependent behavioral adjustments
Human observers slow down following incorrect choices, a phenomenon referred to as posterror slowing (Rabbitt, 1966). The underlying mechanism has been a matter of debate. Posterror slowing has been interpreted as a strategic increase in decision bound in order to avoid future errors (Dutilh et al., 2012a; Goldfarb et al., 2012; Holroyd et al., 2005) or an involuntary decrease in attentional focus (e.g., reduced drift rate) following an unexpected event (Notebaert et al., 2009; Purcell and Kiani, 2016). A key observation of the current work is that similar adjustments can also be observed based on internally computed and graded confidence signals. Our results also go beyond established effects of posterror slowing in that we establish them for trialtotrial variations in internally computed, graded confidence signals within the ‘correct’ and ‘error’ condition. This aspect sets our work apart from previous modelbased investigations of posterror slowing (e.g., Dutilh et al., 2012a; Goldfarb et al., 2012; Purcell and Kiani, 2016) and is important from an ecological perspective: internal, graded confidence signals enable the adjustment of decision parameters also in the absence of external feedback, and even after decisions that happened to be correct but were made with low confidence.
Another important novel aspect of our work is the observation of a neural confidenceencoding signal measured over parietal cortex predictive of changes in decision bound on the next trial. This observation differs from the results of a previous study into the posterror slowing in monkey lateral intraparietal area (LIP; Purcell and Kiani, 2016) in a critical way: Purcell and Kiani (2016) found that errors are followed by changes in LIP dynamics on the subsequent trial, which explained the subsequent changes in drift rate and bound; in other words, the LIP effects reported by Purcell and Kiani (2016) reflected the consequences of posterror adjustments. By contrast, the current work uncovered a putative neural source of adaptive adjustments of decisionmaking overlying parietal cortex. While the neural generators of the Pe are unknown and potentially widespread, our finding implicates parietal cortex (along with possibly other brain regions) in the neural pathway controlling ongoing topdown adjustments of decisionmaking.
Modulating the speedaccuracy tradeoff by decision confidence can be thought of as an adaptive way to achieve a certain level of accuracy. Indeed, normative models prescribe that uncertainty (i.e., the inverse of confidence) should determine how much information needs to be sampled (Bach and Dolan, 2012; Meyniel et al., 2015). The current findings help bridge between studies of topdown control and perceptual decisionmaking (Shea et al., 2014; Shimamura, 2008; Yeung and Summerfield, 2012). Decision confidence has been shown to guide study choices (Metcalfe and Finn, 2008) and act as an internal teaching signal that supports learning (Guggenmos et al., 2016). Moreover, the current findings bear close resemblance to previous work showing that participants request more information when having low confidence in an impending choice (Desender et al., 2018). Conceptually, both the previous study and the current work demonstrate that participants sample more information when they are uncertain, which depending on the task context is achieved by increasing the decision bound or by actively requesting more information, respectively. Further evidence linking both lines of work comes from the observation that the same postdecisional neural signature of confidence, the Pe, predicts increases in decision bound (current work) and informationseeking (Desender et al., 2019). Interestingly, decision confidence seems to have no direct influence on topdown controlled processes such as response inhibition (Koizumi et al., 2015) or working memory (Samaha et al., 2016). Of direct relevance for the current work is a recent study by van den Berg et al. (2016) who showed that confidence acts as a bridge in multistep decisionmaking. In their work, reward was obtained only when two choices in trial sequence were correct. The results showed a linear increase in decision bound with increasing confidence in the first decision of a sequence. The sign of this relation was opposite to what we observed in the current work. Given the multistep nature of the task, observers likely sacrificed performance on the second choice (by decreasing the decision bound) when having low confidence in the first choice, given that both choices needed to be correct in order to obtain a reward. Contrary to this, in our current work observers were motivated to perform well on each trial, and thus adaptively varied the height of the decision bound in order to achieve optimal performance.
In sum, we have shown that decision confidence affects subsequent decision bounds on a trialbytrial level. A postdecisional brain signal sensitive to decision confidence predicted this adaptive modulation of the decision bound at a singletrial level.
Materials and methods
Participants
Thirty participants (two men; age: M = 18.5, SD = 0.78, range 18–21) took part in Experiment 1 (two excluded due to a lack of data in one of the confidence judgments). Sixteen participants (eight females, age: M = 23.9, range 21–30) took part in Experiment 2. ERPs and nonoverlapping analyses from Experiment two have been published earlier (Boldt and Yeung, 2015). Experiment three was a combination of two very similar datasets (see below) that are reported as one in the main text. Twelve participants (three men, mean age: 20.6 years, range 18–42) took part in Experiment 3a (one excluded due to a lack of data in one of the confidence judgments) and twelve participants (all female, mean age: 19.1 years, range 18–22) in Experiment 3b, all in return for course credit. All participants provided written informed consent before participation. All reported normal or correctedtonormal vision and were naive with respect to the hypothesis. All procedures were approved by the local ethics committees.
Stimuli and apparatus
Request a detailed protocolIn all experiments, stimuli were presented on a gray background on a 20inch CRT monitor with a 75 Hz refresh rate, using the MATLAB toolbox Psychtoolbox3 (Brainard, 1997). Responses were made using a standard QWERTY keyboard.
In Experiment 1, random moving white dots were drawn in a circular aperture centered on the fixation point. The experiment was based on code provided by Kiani et al. (2013), and parameter details can be found there.
In Experiment 2, two fields were presented with one field containing 45 dots in a 10by10 matrix, the other containing 55 dots. Within this constraint, the displays were randomly generated for each new trial.
In Experiment 3, each stimulus consisted of eight colored shapes spaced regularly around a fixation point (radius 2.8° visual arc). To influence trial difficulty, both color mean and color variance were manipulated. The mean color of the eight shapes was determined by the variable C; the variance across the eight shapes by the variable V. The mean color of the stimuli varied between red ([1, 0, 0]) and blue ([0, 0, 1]) along a linear path in RGB space ([C,0, 1 −C]). In Experiment 3a, C could take four different values: 0.425, 0.4625, 0.5375 and 0.575 (from blue to red, with 0.5 being the category boundary), and V could take three different values: 0.0333, 0.1000 and 0.2000 (low, medium and high variance, respectively). In Experiment 3b, C could take four different values: 0.450, 0.474, 0.526 and 0.550, and V could take two different values: 0.0333 and 0.1000. On every trial, the color of each individual element was pseudorandomly selected with the constraint that the mean and variance of the eight elements closely matched (criterion value = 0.001) the mean of C and its variance V, respectively. Across trials, each combination of C and V values occurred equally often. The individual elements did not vary in shape.
Procedure
Experiment 1
Request a detailed protocolAfter a fixation cross shown for 1000 ms, randomly moving dots were shown on the screen until a response was made or 3 s passed. On each trial, the proportion of dots moving coherently towards the left or right side of the screen was either 0%, 5%, 10%, 20% or 40%. In each block, there was an equal number of leftward and rightward movement. Participants were instructed to respond as quickly as possible, deciding whether the majority of dots were moving left or right, by pressing ‘c’ or ‘n’ with the thumbs of their left and right hand, respectively (counterbalanced between participants). When participants failed to respond within 3 s, the trial terminated with the message ‘too slow, press any key to continue’. When participants responded in time, either a blank screen was shown for 1 s or continued random motion continued for 1 s (sampled from the same parameters as the predecisional motion). Whether a blank screen or continued motion was shown depended on the block that participants were in. Subsequently, a 6point confidence scale appeared with labels ‘certainly wrong’, ‘probably wrong’, ‘maybe wrong’, ‘maybe correct’, ‘probably correct’, and ‘certainly correct’ (reversed order for half of the participants). Participants had unlimited time to indicate their confidence by pressing one of six numerical keys at the top of their keyboard (1, 2, 3, 8, 9 or 0), which mapped onto the six confidence levels. On half of the trials, the coherence value on each timeframe was sampled from a normal distribution (SD = 25.6%) around the generative coherence (cf. Zylberberg et al., 2016). This manipulation was irrelevant for the current purpose, however, and was ignored in the analysis. Apart from the blocks with a 1 s blank screen and 1 s continued evidence following the response, there was a third block type in which participants jointly indicated their choice (left or right) and level of confidence (low, medium, or high) in a single response. Because perceived errors cannot be indicated using this procedure, these data were omitted from all further analysis. The block order of these three conditions was counterbalanced using a Latin square. The main part of Experiment 1 comprised 9 blocks of 60 trials. The experiment started with one practice block (60 trials) without confidence judgments (only 20% and 40% coherence) that was repeated until participants reached 75% accuracy. Feedback about the accuracy of the choice was shown for 750 ms. The second practice block (60 trials) was identical to the first, except that now the full range of coherence values was used. This block was repeated until participants reached 60% accuracy. The third practice block (60 trials) was identical to the main experiment (i.e., with confidence judgments and without feedback).
Experiment 2
Request a detailed protocolOn each trial, participants judged which of two simultaneously flashed fields (160 ms) contained more dots, using the same response keys as in Experiment 1 (counterbalanced across participants). After their response, a blank screen was presented for 600 ms after which confidence in the decision was queried using the same labels and response layout as in Experiment 1. The intertrial interval lasted 1 s. Each participant performed 18 blocks of 48 trials. The experiment started with one practice block with feedback without confidence judgments but with performance feedback (48 trials), and one practice block with confidence judgments but without feedback (48 trials).
Experiment 3
Request a detailed protocolThis experiment was a combination of two highly similar datasets. Because both datasets show highly similar results (Figure 5—figure supplement 5) they were discussed as one experiment here. In both experiments, after a fixation point shown for 200 ms, the stimulus was flashed for 200 ms, followed again by the fixation point. Participants were instructed to respond as quickly as possible, deciding whether the average of the eight elements was blue or red, using the same response layout as in Experiment 1 (counterbalanced between participants). When participants failed to respond within 1500 ms, the trial terminated with the message ‘too slow, press any key to continue’. When participants responded in time, a fixation point was shown for 200 ms. Then, participants where queried for a confidence judgments using the same scale and response layout as in Experiment 1. The intertrial interval lasted 1000 ms. The main part of Experiment 3a comprised 8 blocks of 60 trials. To maintain a stable color criterion over the course of the experiment, each block started with 12 additional practice trials with auditory performance feedback in which the confidence judgment was omitted. The experiment started with one practice block (60 trials) without confidence judgments but with auditory performance feedback and one practice block (60 trials) with confidence judgments but without feedback. The main part of Experiment 3b comprised 8 blocks of 64 trials. Each block started with 16 additional practice trials with auditory performance feedback in which the confidence judgment was omitted. The experiment started with one practice block (64 trials) without confidence judgments but with auditory performance feedback, and one practice block (64 trials) with confidence judgments but without feedback. In even blocks of Experiment 3b, participants did not provide a confidence judgment, these data are excluded here.
Behavioral analyses
Request a detailed protocolBehavioral data were analyzed using mixed regression modeling. This method allows analyzing data at the singletrial level. We fitted random intercepts for each participant; error variance caused by betweensubject differences was accounted for by adding random slopes to the model. The latter was done only when this increased the model fit, as assessed by model comparison using BIC scores. RTs and confidence were analyzed using linear mixed models, for which F statistics are reported and the degrees of freedom were estimated by Satterthwaite’s approximation (Kuznetsova et al., 2014). Accuracy was analyzed using logistic linear mixed models, for which X² statistics are reported. Model fitting was done in R (R Development Core Team, 2008) using the lme4 package (Bates et al., 2015).
EEG data preparation
Request a detailed protocolPrecise details about the EEG collection have been described in Boldt and Yeung (2015) and are not reiterated here. From the data presented in that work, we extracted rawdata singletrial amplitudes using the specified time windows and electrodes. Raw data were lowpass filtered at 10 Hz. Afterwards, singletrial ERN amplitudes were extracted at electrode FCz during the window −10 ms pre until 90 ms postresponse. Singletrial Pe amplitudes were extracted at electrode Pz during a window from 250 ms to 350 ms postresponse.
Drift diffusion modeling
Request a detailed protocolWe fitted the drift diffusion model (DDM) to behavioral data (choices and reaction times). The DDM is a popular variant of sequential sampling models of twochoice tasks (Ratcliff and McKoon, 2008). We used the hierarchical Bayesian model fitting procedure implemented in the HDDM toolbox (Wiecki et al., 2013). The HDDM uses Markovchain Monte Carlo (MCMC) sampling for generating posterior distributions over model parameters. The Bayesian MCMC generates full posterior distributions over parameter estimates, quantifying not only the most likely parameter value but also uncertainty associated with each estimate. Due to the hierarchical nature of the HDDM, estimates for individual subjects are constrained by grouplevel prior distributions. In practice, this results in more stable estimates for individual subjects, allowing the model to be fit even with unbalanced data, as is typically the case with confidence judgments.
For each variant of the model, we ran 10 separate Markov chains with 10000 samples each. The first half of these samples were discarded as burnin and every second sample was discarded for thinning, reducing autocorrelation in the chains. All chains of a model were then concatenated. Group level chains were visually inspected to ensure convergence. Additionally, GelmanRubin R hat statistics were computed (comparing withinchain and betweenchain variance) and it was checked that all grouplevel parameters had an R hat between 0.98–1.02. Because individual parameter estimates are constrained by grouplevel priors, frequentist statistics are inappropriate because data are not independent. The probability that a condition differs from another (or from the baseline) can be computed by calculating the overlap in posterior distributions. Linear relations were assessed using Friedman’s χ2 test, a nonparametric rankorder test suited for repeated measures designs.
To compute statistics, we subtracted group posterior distributions of confidence on trial_{n+2} from confidence on trial_{n}, and computed pvalues from these difference distributions. To compare these models against simpler ones, we additionally fitted models in which bound, drift or both were fixed rather than free. We used Deviance Information Criterion (DIC) to compare different models to each other. Lower DIC values indicate that a model explains the data better, while taking model complexity into account. A DIC of 10 is generally taken as a meaningful difference in model fit.
Modeling the link between confidence ratings and subsequent behavior
Request a detailed protocolIn Experiment 1, we first used the default accuracy coding scheme to fit a model where drift rate depended on the coherence level. All other parameters were not allowed to vary. This fit produced lower DIC values compared to a fit in which the drift rate was fixed (ΔDIC = −3882). Next, we used the regression coding scheme and allowed both the decision bound and drift rate to vary as a function of confidence on trial_{n} and confidence on trial_{n+2} (both of which were treated as factors). To obtain reliable and robust estimates, we combined trials labeled as ‘certainly correct’ and ‘probably correct’ into a ‘high confidence’ bin, trials labeled as ‘guess correct’ and ‘guess wrong’ into a ‘low confidence’ bin, and trials labeled as ‘probably wrong’ and ‘certainly wrong’ into a ‘perceived error’ bin. This ensured a sufficient number of trials for each level of confidence for all individual participants (high confidence: M = 191.7, range 47–311; low confidence: M = 114.5, range 4–228; perceived error: M = 23.8, range 1–83). The hierarchical Bayesian approach does not fit the model to individual subject’s data, but rather it jointly fits the data of the entire group. Therefore, data from participants with low trial counts in certain conditions does not contribute much to the posteriors for the respective condition. At the same time, participantlevel estimates are estimated, but these are constrained by the grouplevel estimate. One obvious advantage of this approach is that participants with unequal trial numbers across conditions can contribute to the analysis, whereas in traditional approaches their data would be lost.
Trials with high confidence were always treated as reference category (i.e., fixed to zero). In addition, the drift rate was allowed to vary as a function of coherence, which was treated as a covariate (because we were not interested in the parameter estimate but solely wanted to capture variance in the data accounted for by signaltonoise ratio). To quantify the influence of confidence on the subsequent decision bound and drift rate, we subtracted estimates of subsequent bound and drift by confidence on trial_{n+2} from estimates of subsequent bound and drift by confidence on trial_{n}. Statistics of the simple effects of confidence on trial_{n} and confidence on trial_{n+2} are reported in the figure supplements. Relative to the null model without confidence, the full model (presented in Figure 3) provides the best fit (ΔDIC = −288), explaining the data better than simpler models in which only the bound (ΔDIC = −234) or the drift (ΔDIC = −94) were allowed to vary.
The data of Experiment two were analyzed in the same way, except that difficulty was fixed and thus trial difficulty (i.e., coherence or signaltonoise ratio) needed not to be accounted for within the model. Relative to the null model (presented in Figure 5), allowing both drift and bound to vary as a function of confidence provides the best fit (ΔDIC = −677), which explained the data better than simpler models in which only the bound (ΔDIC = −302) or the drift (ΔDIC = −170) were allowed to vary. Trial counts for this experiment were relatively high (high confidence: M = 516, range 132–705; low confidence: M = 257, range 77–674; perceived error: M = 80, range 18–197).
The data of Experiment three were analyzed in the same way as Experiment 1, except that the variable coherence was replaced by signaltonoise ratio. For both experiment 3a and 3b, a model in which only the drift was allowed to vary as a function of signaltonoise ratio produced lower DIC values compared to a fit in which the drift rate was fixed (Experiment 3a: ΔDIC = −311; Experiment 3b: ΔDIC = −88). For the confidencedependent fitting, a single model was fit to the data of Experiments 3a and 3b simultaneously. Relative to the null model without confidence, the full model (presented in Figure 5) provides the best fit (ΔDIC = −634), explaining the data better than simpler models in which only the bound (ΔDIC = −58) or the drift (ΔDIC = −260) were allowed to vary. Trial counts were relatively high (high confidence: M = 166, range 92–302; low confidence: M = 92, range 41–175; perceived error: M = 38, range 10–41).
Modeling the link between ERP components and subsequent behavior
Request a detailed protocolBecause singletrial EEG contains substantial noise, a robust measure was computed by rank ordering all trials per participant, and then using rank as a predictor rather than the raw EEG signal. A hierarchical DDM regression model was then fit in which subsequent bound and drift were allowed to vary as a function of the Pe and the ERN, both on trial_{n} and trial_{n+2}.
To examine potential nonlinear effects, the Pe was divided into five bins, separately for each participant. This was done after regressing out the effect of the ERN, separately for each participant. Then, a hierarchical drift diffusion model regression model was run in which subsequent bound and drift were allowed to vary as a function of binned Pe on trial_{n} and trial_{n}_{+2}. The bin with the lowest amplitudes was always treated as the reference category. Model comparison revealed that, relative to a model without the Pe, the full model provides the best fit (ΔDIC = −6524), explaining the data much better than simpler models in which only the bound (ΔDIC = −1816) or only the drift (ΔDIC = −1787) were allowed to vary as a function of Pe. When applying the same binned analysis to the ERN, model comparison revealed that both the full model (ΔDIC = 24), and a model in which only drift (ΔDIC = 171) or bound (ΔDIC = 182) were allowed to vary, provided a worse fit than the null model. Thus, the ERN had no explanatory power in explaining either drift rate or decision bound.
Controlling for autocorrelation in performance
Request a detailed protocolThe relation between decision confidence and decision bound on the subsequent trial might be confounded by autocorrelations in performance. During the course of an experiment autocorrelation is typically observed in RTs, accuracy (Dutilh et al., 2012b; Gilden, 2003; Palva et al., 2013), and confidence (Rahnev et al., 2015). This could be due to slow drifts in behavioral state factors (e.g. motivation, arousal, attention). When observers report high confidence in ‘fast periods’ and low confidence in ‘slow periods’ of the experiment (c.f., the link between response speed and confidence; Kiani et al., 2014), this can artificially induce a negative relation between decision confidence on trial_{n} and reaction time on trial_{n+1}. We reasoned that one solution to control for such effects of slow drift is to is use confidence on trial_{n+2}: Confidence on trial_{n+2} cannot causally affect decision bound on trial_{n+1} (because of temporal sequence) and might thus be used as a proxy of the effects of slow performance drifts. So, we isolated the impact of rapid (trialbytrial) and (as we hypothesize) causallymediated, confidencedependent changes in decision bound from slow performance drifts as follows: we took the difference between confidencedependent changes in decision bound, whereby confidence was either evaluated on trial_{n} or confidence was evaluated on trial_{n+2}; we subtracted the latter (proxy of drift) from the former.
A possible concern is that the decision bound on trial_{n+1} affected confidence ratings on trial_{n+2}, which would complicate the interpretation of the results of our approach. Thus, we also used a complementary approach controlling for slow drifts in performance, which is analogous to an approach established in the posterror slowing literature (Dutilh et al., 2012b; Purcell and Kiani, 2016). In that approach, posterror trials are compared to postcorrect trials that are also preerror. As a consequence, both trial types appear adjacent to an error, and therefore likely stem from the same time period in the Experiment. We adopted this approach to confidence ratings as follows: decision bound and drift rate on trial_{n+1} were fitted in separate models where i) we compared the effect of low confidence on trial_{n} to high confidence on trial_{n} for which trial_{n+2} was a low confidence trial, and ii) we compared the effect of perceived errors on trial_{n} to high confidence trials on trial_{n} for which trial_{n+2} was a perceived error. Thus, this ensured that the two trial types that were compared to each other stemmed from statistically similar environments. For the EEG data, we fitted a new model estimating decision bound and drift rate on trial_{n+1} when trial_{n} stemmed from the lowest Pe amplitude quantile, compared to when trial_{n} stemmed from the highest Pe amplitude quantile and trial_{n+2} stemmed from the lowest Pe amplitude quantile.
References

1
Knowing how much you don't know: a neural organization of uncertainty estimatesNature Reviews Neuroscience 13:572–586.https://doi.org/10.1038/nrn3289

2
Fitting linear MixedEffects models using lme4Journal of Statistical Software 67:1–48.https://doi.org/10.18637/jss.v067.i01
 3

4
Do humans produce the speed–accuracy tradeoff that maximizes reward rate?Quarterly Journal of Experimental Psychology 63:863–891.https://doi.org/10.1080/17470210903091643

5
The neural basis of the speedaccuracy tradeoffTrends in Neurosciences 33:10–16.https://doi.org/10.1016/j.tins.2009.09.002

6
Shared neural markers of decision confidence and error detectionJournal of Neuroscience 35:3478–3484.https://doi.org/10.1523/JNEUROSCI.079714.2015
 7

8
Frontal theta as a mechanism for cognitive controlTrends in Cognitive Sciences 18:414–421.https://doi.org/10.1016/j.tics.2014.04.012
 9

10
Midfrontal conflictrelated thetaband power reflects neural oscillations that predict behaviorJournal of Neurophysiology 110:2752–2763.https://doi.org/10.1152/jn.00479.2013
 11
 12

13
Localization of a neural system for error detection and compensationPsychological Science 5:303–305.https://doi.org/10.1111/j.14679280.1994.tb00630.x

14
Subjective confidence predicts information seeking in decision makingPsychological Science 29:761–778.https://doi.org/10.1177/0956797617744771

15
A postdecisional neural marker of confidence predicts InformationSeeking in DecisionMakingThe Journal of Neuroscience 39:3309–3319.https://doi.org/10.1523/JNEUROSCI.262018.2019

16
Testing theories of posterror slowingAttention, Perception, & Psychophysics 74:454–465.https://doi.org/10.3758/s1341401102432

17
How to measure posterror slowing: a confound and a simple solutionJournal of Mathematical Psychology 56:208–216.https://doi.org/10.1016/j.jmp.2012.04.001
 18

19
Neural mediators of changes of mind about perceptual decisionsNature Neuroscience 21:617–624.https://doi.org/10.1038/s4159301801046
 20
 21

22
Cognitive emissions of 1/f noisePsychological Review 108:33–56.https://doi.org/10.1037/0033295X.108.1.33

23
The Neural Basis of Decision MakingAnnual Review of Neuroscience 30:535–574.https://doi.org/10.1146/annurev.neuro.29.051605.113038

24
Can PostError dynamics explain sequential reaction time patterns?Frontiers in Psychology 3:213.https://doi.org/10.3389/fpsyg.2012.00213
 25
 26

27
A mechanism for error detection in speeded response time tasksJournal of Experimental Psychology: General 134:163–191.https://doi.org/10.1037/00963445.134.2.163
 28
 29

30
Integration of direction cues is invariant to the temporal gap between themJournal of Neuroscience 33:16483–16489.https://doi.org/10.1523/JNEUROSCI.209413.2013
 31
 32

33
Does perceptual confidence facilitate cognitive control?Attention, Perception, & Psychophysics 77:1295–1306.https://doi.org/10.3758/s1341401508433

34
lmerTest: test in linear mixed effect modelsJournal of Statistical Software 20:1–25.
 35

36
Evidence that judgments of learning are causally related to study choicePsychonomic Bulletin & Review 15:174–179.https://doi.org/10.3758/PBR.15.1.174
 37

38
Optimal decision making in heterogeneous and biased environmentsPsychonomic Bulletin & Review 22:38–53.https://doi.org/10.3758/s1342301406693
 39
 40

41
Postdecisional accounts of biases in confidenceCurrent Opinion in Behavioral Sciences 11:55–60.https://doi.org/10.1016/j.cobeha.2016.05.005
 42
 43
 44
 45
 46

47
Twostage dynamic signal detection: a theory of choice, decision time, and confidencePsychological Review 117:864–901.https://doi.org/10.1037/a0019737

48
Confidence and certainty: distinct probabilistic quantities for different goalsNature Neuroscience 19:366–374.https://doi.org/10.1038/nn.4240
 49

50
Errors and error correction in choiceresponse tasksJournal of Experimental Psychology 71:264–272.https://doi.org/10.1037/h0022853

51
Confidence leak in perceptual decision makingPsychological Science 26:1664–1680.https://doi.org/10.1177/0956797615595037
 52
 53
 54
 55
 56

57
Suprapersonal cognitive control and metacognitionTrends in Cognitive Sciences 18:186–193.https://doi.org/10.1016/j.tics.2014.01.006

58
A neurocognitive approach to metacognitive monitoring and controlHandbook of Metamemory and Memory pp. 373–390.https://doi.org/10.4324/9780203805503.ch19

59
Cortical network dynamics of perceptual decisionmaking in the human brainFrontiers in Human Neuroscience 5:21.https://doi.org/10.3389/fnhum.2011.00021
 60

61
The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295X.108.3.550

62
Confidence is the bridge between Multistage decisionsCurrent Biology 26:3157–3168.https://doi.org/10.1016/j.cub.2016.10.021

63
The timing of actionmonitoring processes in the anterior cingulate cortexJournal of Cognitive Neuroscience 14:593–602.https://doi.org/10.1162/08989290260045837

64
HDDM: hierarchical bayesian estimation of the DriftDiffusion model in PythonFrontiers in Neuroinformatics 7:14.https://doi.org/10.3389/fninf.2013.00014

65
Choice variability and suboptimality in uncertain environmentsCurrent Opinion in Behavioral Sciences 11:109–115.https://doi.org/10.1016/j.cobeha.2016.07.003
 66
 67

68
Metacognition in human decisionmaking: confidence and error monitoringPhilosophical Transactions of the Royal Society B: Biological Sciences 367:1310–1321.https://doi.org/10.1098/rstb.2011.0416

69
The construction of confidence in a perceptual decisionFrontiers in Integrative Neuroscience 6:79.https://doi.org/10.3389/fnint.2012.00079
 70
Decision letter

Roozbeh KianiReviewing Editor; New York University, United States

Michael J FrankSenior Editor; Brown University, United States

Simon P KellyReviewer; University College Dublin, Ireland
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Postdecisional sense of confidence shapes speedaccuracy tradeoff for subsequent choices" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Simon P. Kelly (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This paper tackles an interesting topic concerning how trialtotrial adjustments to decision policy are made based on confidence. These policy adjustments are well studied for categorical (error/correct) outcomes of past trials but less understood for graded levels of reported confidence for each outcome. The authors have compiled a substantial dataset to study changes of decision bound with confidence associated with past decisions. Their dataset consists of three behavioral experiments that used different perceptual decision tasks. One experiment also included EEG data. They report that perceived error trials are associated with a higher subsequent decision bound than high confidence trials. They also report a strong, linear association between the Pe ERP component and subsequent decision bound. The paper is quite interesting and written well. However, as explained below, there are substantial conceptual and technical complexities that should be addressed before publication.
Essential revisions:
1) What supports the claim that it is only the "postdecisional" component of confidence (as opposed to the entirety of confidence) that influences subsequent choices? As the authors show in Figure 2 and supplementary figures, confidence of the current choice is influenced by processing the evidence both before and after the choice is reported. However, the predecision component and its potential effects are largely ignored. The authors seem to rely on the late EEG correlate for their claim, but that point is a bit unclear from the manuscript. Also, a late correlate does not necessarily imply only postdecisional contributions. Because the title and Abstract of the manuscript emphasize postdecisional effects, this claim should be properly supported through reanalysis of existing EEG and behavioral data (or a new behavioral experiment, if needed). Alternatively, the authors could focus on the effects of confidence as a whole. The reviewers find the paper interesting enough even if the postdecisional claims are dropped.
2) The claim that speedaccuracy tradeoff is altered by confidence needs clarification and support. As shown in Figures 3 and 5, changes of accuracy are quite small (<5%) and it is unclear if they reach statistical significance. Similarly, changes of RT are quite small and their statistical significance remains unclear. Because the DDM is only an approximation to the neural computations that underlie the behavior, it is critical that claims about speedaccuracy tradeoff are supported by changes in speed and/or accuracy, instead of only changes in model parameters. Also, the data suggest that behavioral effects of confidence on subsequent decisions may be more complex than a simple trade off of speed and accuracy. For example, it seems that effects of confidence on subsequent decision bound were stronger for error trials than for correct trials. Is that right, and if so what is the interpretation?
3) It is critical to show that apparent effects of confidence on behavior are not mediated by other factors. The reviewers wonder whether there is something that varies from trial to trial, such as coherence, coherence volatility, or dotclustering which could both affect confidence on the current trial and also prime the subject to adjust on the next trial, without a direct impact of confidence on the adjustment. Take volatility in experiment 1 for example: On half of trials coherence varied randomly from frame to frame, which would increase variance in the decision variable and thus plausibly affect confidence and also the setting of bound on the next trial. In the Materials and methods the authors simply state that this volatility manipulation doesn't matter, but the main relationship between confidence and bound on next trial could be mediated through this variation. Coherence itself could also be a driver of the relationship – to the extent that subjects can tell high from low coherence trials, they may lower their bound after low coherence trials and increase it after high coherence trials – this might not be beneficial but neither are many typical priming effects. Taking the error trials for example, trials in which the subjects most thought they were correct would be the low coherence trials, which on the basis of coherence itself induce a bound lowering, not necessarily mediated through confidence. Support for this interpretation comes from experiment 2, where stimulus difficulty was fixed. In this experiment, there was not a robust difference between high and low confidence decision bounds in Figures 5C or 6A/B. To support their conclusion, the authors should carefully consider mediating factors. They may be able to conduct mediation analyses to better test for confidence's direct impact despite such other factors, and may be able to explore the factors themselves to see if they have any such impact in the first place, e.g. testing for differences across coherence levels, volatility levels, and even withincondition motion energy or dot distribution.
4) The composition of the dataset raises concerns. Three issues require clarification:
a) Did participants receive adequate practice prior to data collection to learn the tasks and show stable performance? In experiment 1, data collection started following only 180 practice trials. It is unclear that the practice included feedback and was sufficient for subjects to learn how to optimally use the sensory evidence or how to generate reliable confidence reports. Was the behavior stable throughout data collection? More important, how good were the data? Reaction times of a relatively large fraction of subjects seem to vary minimally across stimulus strengths in Figure 2. Also, many subjects seem to have accuracies far below 100% for the strongest stimuli in the experiment. Figure 2D suggests that ~20% of subjects have almost chance level accuracy even when they report they are quite sure to be correct! Similar problems – shortage of practice trials and unclear quality and stability of performance – are present in the other experiments too.
b) What fraction of subjects support the main results? The analysis methods of the paper prevent a clear answer to this question. Consequently, we cannot tell that the average trends in the population are not generated by a mixture of diverse (or even opposing) trends across participants. For example, Figures 3 or 5 could include subjects that go against the population average for changes of RT and accuracy, or changes of decision bound. Can the authors clarify if individual participant's results match the population average and whether there are subjects that deviate from the average trends?
c) Can the authors clarify the trial numbers in subsection “Decision confidence influences subsequent decision bound”? Does 831 mean between 1 and 83 trials? If yes, how could 1 trial be sufficient for fitting the model to individual subject's data?
5) The reviewers appreciate the authors' attempt to remove confounds caused by slow fluctuations of RT, accuracy and confidence. However, it is unclear that the correction procedure is adequate. Three related concerns are raised:
a) It is unclear that the underlying assumption of the analysis is correct. Of course confidence on trial n+2 cannot influence the decision bound on n+1, but could the decision bound on n+1 influence confidence on n+2? In both directions, these are just correlations, What support is there for the causal claim the authors are making of confidence on trial n changing bound on n+1?
b) The motivation of the analysis is to remove generic autocorrelations from the data (due to, e.g., long periods of high/low confidence correlated with low/high RT), under the assumption that trial n+2 confidence cannot be causally associated with trial n+1 decision bound. However, based on the unnormalized, "simple effects of confidence on decision bound," the effects seem to be mainly coming from trial n+2, rather than trial n (e.g., Figure 5—figure supplement 1, Figure 8—figure supplement 4, and Figure 5—figure supplement 3). This is fairly consistent in the behavioral data and is especially striking for the relation between Pe amplitude and decision bound. How does it affect whether we believe trial n+2 is an appropriate baseline measure? More important, these observations appear to challenge the main claim that confidencedependent bound changes are shaped by trial n. Can the authors clarify why this challenge is not critical?
c) The authors use the same model to quantify the effect of trial n or n+2 on trial n+1 and subtract the two effects to filter out slow fluctuations. This method would be successful only to the extent that the applied models are good at fitting the data. If a model fails to adequately capture the data, the residual error could contain slow fluctuations of behavior asymmetrically for the n/n+1 and n+2/n+1 analyses, which could contribute to results. There are reasons to be concerned because we do not know how well the DDM fits the behavior of individual participants, especially their RT distributions (not just the mean RT across the population; also consider large lapses and shallow RT functions). Showing residual errors of the model for RT and the autocorrelation of these residual errors would be useful to alleviate this concern.
6) A point that deserves discussion and possibly looking into is the role of RT in determining the relationship between confidence and subsequent adjustment. From past papers, in varyingdifficulty contexts, lower confidence can result from a decision reaching commitment at too low a level of cumulative evidence, OR alternatively from a crossing made later in the trial. The latter is more an indication that the current trial had weaker evidence than that the bound was not high enough. Setting the bound higher for the latter kind of trial would bring an increase in accuracy but at a vast cost in terms of time. Bound increases for low confidence do not clearly seem to be universally beneficial, and it would seem to depend on the cost of time, presence of deadlines etc. This kind of territory may be covered in papers like Meyniel et al., but either way, it seems a matter worth discussing in the current paper.
https://doi.org/10.7554/eLife.43499.030Author response
Essential revisions:
1) What supports the claim that it is only the "postdecisional" component of confidence (as opposed to the entirety of confidence) that influences subsequent choices? As the authors show in Figure 2 and supplementary figures, confidence of the current choice is influenced by processing the evidence both before and after the choice is reported. However, the predecision component and its potential effects are largely ignored. The authors seem to rely on the late EEG correlate for their claim, but that point is a bit unclear from the manuscript. Also, a late correlate does not necessarily imply only postdecisional contributions. Because the title and Abstract of the manuscript emphasize postdecisional effects, this claim should be properly supported through reanalysis of existing EEG and behavioral data (or a new behavioral experiment, if needed). Alternatively, the authors could focus on the effects of confidence as a whole. The reviewers find the paper interesting enough even if the postdecisional claims are dropped.
We agree that our previous focus on the postdecisional component of confidence was overly specific and not necessary for our main conclusion. We have dropped the corresponding claims from the title, Abstract, and Results section and keep this as a point for the Discussion section.
Our focus was motivated by the observation that perceived errors played a critical role in the bound adjustments. Assuming that subjects base both their choices and confidence ratings on (some transformation of) the same decision variable, perceived errors are very likely to result from postdecisional processing. However, we acknowledge that this does not rule out an influence of pre and intradecisional computations on confidence ratings and subsequent bound adjustments. We now acknowledge this point explicitly in the Discussion:
“This evidence for postdecisional contributions to confidence ratings (and in particular, certainty about errors) does not rule out the contribution of pre and intradecisional computations to confidence (e.g., Gherman and Philiastides, 2018; Kiani and Shadlen, 2009).”
2) The claim that speedaccuracy tradeoff is altered by confidence needs clarification and support. As shown in Figures 3 and 5, changes of accuracy are quite small (<5%) and it is unclear if they reach statistical significance. Similarly, changes of RT are quite small and their statistical significance remains unclear. Because the DDM is only an approximation to the neural computations that underlie the behavior, it is critical that claims about speedaccuracy tradeoff are supported by changes in speed and/or accuracy, instead of only changes in model parameters.
The effects of confidence ratings on subsequent RTs and accuracies are qualitatively in line with a change in speedaccuracy tradeoff (i.e., increase in response caution after perceived errors) for all data sets. This is statistically significant for Experiments 2 (RTs: F(2,45) = 4.96, p =.011; accuracy: p =.362) and Experiment 3 (RTs: F(2,65.99) = 3.80, p =.027; accuracy: F(2,66) = 8.09, p <.001), but not quite for Experiment 1 (RTs: p =.10; accuracy: p =.357), which showed overall subtler effects.
We devised an aggregate, modelfree behavioral measure of speed accuracy tradeoff by multiplying median RT and mean accuracy at trial_{n+1}, for each level of confidence at trial_{n}, Increase in bound separation predicts larger values on this modelfree measure of response caution. This is conceptually similar to the socalled “inverse efficiency score” (RT/accuracy) which is often used as modelfree index of drift rate (Bruyer and Brysbaert, 2011). This measure is now also presented, along with RT and accuracy separately, in Figure 3 and Figure 5, alongside the model predictions (green crosses). This measure shows a statistically significant effect of confidence ratings on subsequent speedaccuracy tradeoff for all three experiments (Experiment 1: F(2,81) = 3.13, p =.049; Experiment 2: F(2,45) = 3.21, p =.050; Experiment 3: F(2,66) = 14.43, p <.001). This is explained and reported in the Results section:
“A bound increase will increase both RT and accuracy, a trend that was evident in the data (Figure 3A, left). We multiplied median RT and mean accuracy as a function of confidence to combine both effects into a single, modelfree measure of response caution (Figure 3A, right). This aggregate measure of response caution was predicted by the confidence rating from the previous decision, F(2,81) = 3.13, p =.049. Posthoc contrasts showed increased caution after perceived errors compared to after both high confidence, z = 2.27, p =.032, and low confidence, z = 2.05, p =.041. There was no difference in caution following high and low confidence ratings, p =.823. The confidencedependent change in subsequent response caution was explained by a DDM, in which decision bounds and drift rate could vary as a function of previous confidence (Figure 3A, green crosses; see Materials and methods).”
“As in Experiment 1, our modelfree measure of response caution (RT*accuracy) was modulated by confidence ratings on the previous trial, F(2,45) = 3.21, p =.050 (perceived errors vs. high confidence: z = 2.53, p =.011; no significant differences for other comparisons ps >.178; Figure 5B).”
“Our modelfree measure of response caution was again affected by previous confidence ratings, F(2,66) = 14.43, p <.001 (perceived errors vs. high confidence: z = 4.61, p <.001; perceived errors vs. low confidence: z = 4.69, p <.001; high vs. low confidence: p =.938; Figure 5B).”
Also, the data suggest that behavioral effects of confidence on subsequent decisions may be more complex than a simple trade off of speed and accuracy. For example, it seems that effects of confidence on subsequent decision bound were stronger for error trials than for correct trials. Is that right, and if so what is the interpretation?
We agree with the reviewers that it appears as if the effect of confidence on subsequent decision bound is stronger for error trials than for correct trials. However, note that the main issue here is that the effect holds when currenttrial accuracy is held constant (i.e., either being correct or incorrect), so as to rule out that the increase in decision bound actually reflects posterror slowing. Because of limited number of trials when separately fitting corrects and errors, the uncertainty in each of these estimates is much larger, as visible in the width of the posteriors. This in turn has a large influence on the estimates of individual participants, which are therefore much more constrained by the group posterior and hence much more homogenous (particularly the case for current error trials).
It is not straightforward to directly compare the estimates of current correct and current error to each other, because both come from a different model. Within a single model fit, parameters can be directly compared to each other by pairwise comparison of the traces from the MonteCarlo MarkovChain, but comparing different models to each other cannot be done because these models generate different chains. We agree that it is a very interesting question whether confidence is associated with adjustments of decision computation more complex than changes in speedaccuracy tradeoff (and relatedly: bound separation and drift). But we feel that answering this question will require different modeling approaches (e.g. more biophysically inspired models) that are beyond the scope of the current work.
3) It is critical to show that apparent effects of confidence on behavior are not mediated by other factors. The reviewers wonder whether there is something that varies from trial to trial, such as coherence, coherence volatility, or dotclustering which could both affect confidence on the current trial and also prime the subject to adjust on the next trial, without a direct impact of confidence on the adjustment. Take volatility in experiment 1 for example: On half of trials coherence varied randomly from frame to frame, which would increase variance in the decision variable and thus plausibly affect confidence and also the setting of bound on the next trial. In the Materials and methods the authors simply state that this volatility manipulation doesn't matter, but the main relationship between confidence and bound on next trial could be mediated through this variation. Coherence itself could also be a driver of the relationship – to the extent that subjects can tell high from low coherence trials, they may lower their bound after low coherence trials and increase it after high coherence trials – this might not be beneficial but neither are many typical priming effects. Taking the error trials for example, trials in which the subjects most thought they were correct would be the low coherence trials, which on the basis of coherence itself induce a bound lowering, not necessarily mediated through confidence. Support for this interpretation comes from experiment 2, where stimulus difficulty was fixed. In this experiment, there was not a robust difference between high and low confidence decision bounds in Figures 5C or 6A/B. To support their conclusion, the authors should carefully consider mediating factors. They may be able to conduct mediation analyses to better test for confidence's direct impact despite such other factors, and may be able to explore the factors themselves to see if they have any such impact in the first place, e.g. testing for differences across coherence levels, volatility levels, and even withincondition motion energy or dot distribution.
Thank you for raising this important point. We ran additional analyses to test whether the experimental variables that we manipulated from trial to trial affected subsequent decision bound. For Experiment 1, we fitted a model that quantified the effect of current evidence strength (coherence) and current evidence volatility on subsequent decision bound, while allowing for subsequent drift rate to depend on subsequent coherence. This analysis did not show an effect of current volatility, p =.242, or current coherence, p =.133. For each of these, zero (i.e., no effect) was included in the 95% highest density interval of the posterior distribution (.167 to.026, and .020 to.039, respectively), suggesting it is likely that the actual parameter value is close to zero.
Likewise, we fitted a model to the data of Experiment 3 that tested the influence of current evidence strength on subsequent decision bound, while allowing for subsequent drift rate to depend on subsequent signaltonoise ratio. Again, there was no effect of current signaltonoise ratio, p =.220, and zero was included in the 95% highest density interval of the posterior (.010 to.031). Given that neither of these variables affected subsequent decision bound, it seems unlikely that the effect of confidence on subsequent bound is mediated by one of these factors.
These additional control analyses are now reported in the corresponding places of the Results section:
“The effects of confidence ratings on subsequent decision bounds are unlikely to be caused by our systematic manipulation of evidence strength (i.e., motion coherence) or evidence volatility (Materials and methods). Confidence ratings were reliably affected by both evidence strength (Figure 2C) and evidence volatility (data not shown, F(1, 26.7) = 47.10, p <.001). However, evidence strength and volatility, in turn, did not affect subsequent decision bound, both ps >.133. For each of these, zero (i.e., no effect) was included in the 95% highest density interval of the posterior distribution (.167 to.026, and .020 to.039, respectively), suggesting it is likely that the true parameter value was close to zero.”
“As in Experiment 1, the systematic trialtotrial variations of evidence strength (SNR) did not influence subsequent decision bound in Experiment 3 (p =.220), and zero was included in the 95% highest density interval of the posterior (.010 to. 031).”
4) The composition of the dataset raises concerns. Three issues require clarification:
a) Did participants receive adequate practice prior to data collection to learn the tasks and show stable performance? In experiment 1, data collection started following only 180 practice trials. It is unclear that the practice included feedback and was sufficient for subjects to learn how to optimally use the sensory evidence or how to generate reliable confidence reports. Was the behavior stable throughout data collection? More important, how good were the data? Reaction times of a relatively large fraction of subjects seem to vary minimally across stimulus strengths in Figure 2. Also, many subjects seem to have accuracies far below 100% for the strongest stimuli in the experiment. Figure 2D suggests that ~20% of subjects have almost chance level accuracy even when they report they are quite sure to be correct! Similar problems – shortage of practice trials and unclear quality and stability of performance – are present in the other experiments too.
We agree with the reviewers that the Materials and methods should be expanded to fully unpack the procedures, and we apologize for failing to do so in our initial submission.
Several points were raised by the reviewers: a) nature of task practice, b) stability of performance, and c) the relation between confidence and accuracy (for Experiment 1 only). In what follows, we address each of these points, separately for each Experiment.
Experiment 1:
We now further clarify the details of the training blocks in the Materials and methods: “The experiment started with one practice block (60 trials) without confidence judgments (only 20% and 40% coherence) that was repeated until participants reached 75% accuracy. Feedback about the accuracy of the choice was shown for 750 ms. The second practice block (60 trials) was identical to the first, except that now the full range of coherence values was used. This block was repeated until participants reached 60% accuracy. The third practice block (60 trials) was identical to the main experiment (i.e., with confidence judgments and without feedback).”
To shed light on the quality of the data, we performed additional linear regression analyses on the raw data of each participant. These showed that coherence significantly affected RTs (p <.05) for 27 out of 28 participants (with a negative slope for all 28), whereas coherence significantly affected both accuracy and confidence (p <.05) in the expected direction for all 28 participants. Next, we added the variable experiment half (first half vs. second half) to these regressions, and tested for an interaction between coherence and experiment half. None of the 28 participants showed an interaction between experiment half and coherence in predicting accuracy, whereas this was the case for 5 participants when predicting RTs (one participant became more sensitive to coherence in the second half) and for 6 participants when predicting confidence (one participant became more sensitive to coherence in the second half). Finally, in Author response image 1 we included raw RTs and accuracy per block for the first six participants of Experiment 1, to demonstrate the stability of performance over the course of the experiment (note that this was highly similar for the other participants).
As becomes clear from Author response image 1, the data of the majority of participants was of high quality, and rather stable throughout the entire experiment. In order to keep the manuscript concise and to the point, we decided not to include these additional quality checks in the manuscript, but we are glad to do so if the reviewers deem this to be helpful. Finally, to fully satisfy the concern raised about data quality, we reran the main analysis on the subset of 17 participants who showed significant scaling of RTs, accuracy and confidence with coherence, and who did not show a significant interaction between coherence and experiment half on any of these measures. This analysis showed that subsequent decision bound separation was numerically higher when participants had low confidence in their choice (M =.04, p =.156) and significantly increased when participants perceived an error (M =.30, p <.001); the latter two were also different from each other (p <.001). There were no significant effects on drift rate, ps >.308.
Also, many subjects seem to have accuracies far below 100% for the strongest stimuli in the experiment.
Note that perfect accuracy is not to be expected here because participants were explicitly motivated to perform the experiment as fast and accurate as possible. Therefore, errors are expected even at the highest coherence level, probably because participants put their decision bound too low. Evidence for this comes from the observation that, on average, 65% of the errors committed on trials with 40% coherence (26 out of 51) were judged as perceived errors, suggesting that these errors do not result from poor data quality, but rather reflect genuine premature responses.
Figure 2D suggests that ~20% of subjects have almost chance level accuracy even when they report they are quite sure to be correct!
Thanks for spotting this. This figure was based on a subset of blocks in which participants jointly indicated their level of confidence and choice. There were four participants who misunderstood the instructions (i.e., they used the response scale as if it would range from certainly correct to certainly wrong, rather than ranging from certainly left to certainly right), and consequently performed at chance level. While these participants were included in the main analysis (i.e., where they did use the scale correctly and performed well in the separated choiceconfidence blocks), they should obviously not have been included in these analyses. Note that reviewer 2 (comment 5) commented that these data are not a fair comparison because only half the range of the confidence scale is available (compared to the whole confidence scale in the separated choice and confidence condition), and so we decided to drop these data, given that they were not critical for our conclusion.
Experiment 2:
We now further clarify the details of the training blocks: “The experiment started with one practice block with feedback without confidence judgments but with performance feedback (48 trials), and one practice block with confidence judgments but without feedback (48 trials)”.
Because only a single level of difficulty was used throughout the entire experiment, the quality checks reported above cannot be carried out on these data.
Experiment 3:
We now further clarify the details of the training blocks in the Materials and Methods. To shed light on the quality of the data, the same linear regression analyses were run as for Experiment 1. These showed that signaltonoise ratio significantly affected RTs (p <.05) for 7 out of 23 participants (with the expected negative slope for 19 out of 23 participants), and it significantly affected accuracy for 21 out of 23 participants (22 in the expected direction), and confidence for 18 out of 23 participants (20 in the expected direction). Note that this low number of significant effects of signaltonoise on RTs does not necessarily imply low signal quality; because of the limited number of trials per participants it could well be that the effect is present (i.e., the slopes are in the correct direction for most participants) but experimental power is simply too low to detect these at the individual participant level. Next, as before, we added experiment half to these regressions. None of the 28 participants showed an interaction between experiment half and coherence in predicting RTs or in predicting confidence, whereas this was the case for one participant when predicting accuracy (s/he became more sensitive to SNR in the second half). Nevertheless, to fully satisfy the concern raised about data quality, we reran the main analysis on the subset of 7 participants who showed significant scaling of RTs, accuracy and confidence with coherence, and who did not show a significant interaction between coherence and experiment halve on any of these measures. This analysis showed that subsequent decision bound separation was decreased when participants had low confidence in their choice (M = .08, p =.041) and nonsignificantly increased when participants perceived an error (M =.07, p =.112); the latter two were different from each other (p =.010). There were no significant effects on drift rate, ps >.124.
b) What fraction of subjects support the main results? The analysis methods of the paper prevent a clear answer to this question. Consequently, we cannot tell that the average trends in the population are not generated by a mixture of diverse (or even opposing) trends across participants. For example, Figures 3 or 5 could include subjects that go against the population average for changes of RT and accuracy, or changes of decision bound. Can the authors clarify if individual participant's results match the population average and whether there are subjects that deviate from the average trends?
Unfortunately, this question is difficult to answer given the approach we have opted to pursue for this particular paper. Upfront, we would like to clarify that there are two possible approaches in behavioral modeling studies: i) gathering a lot of data from a relatively small (N<=10) sample of participants and fit each at the individual level, using independent fits; or ii) gather fewer trials per participant, from a larger sample of participants, and then resort to hierarchical Bayesian model fitting approaches, which pool trials across participants to estimate parameters at the group level. We feel that there is no generally “right” or “wrong” approach, but the optimal choice of approach depends on the specific question at hand. Indeed, in our previous work, we have used both of these approaches.
In the present study, we opted for the second approach, because we expected substantial interindividual differences, both in the way the confidence ratings are used and in how they are translated into adjustments of subsequent decision processing.
Confidence ratings are often unevenly distributed. Although the Bayesian fitting method is very powerful – especially when trial counts are low and/or unevenly distributed – it comes at the costs that parameters are not readily interpretable at the participantlevel, because individual estimates are constrained by the group level prior.
We now acknowledge this limitation explicitly in the Discussion:
“The model fits in Figure 3 and 5 suggest that the effect is rather consistent across participants. For example, the increased decision bound following perceived errors in Experiments 1, 2 and 3 is found for all but one, two, and four participants, respectively. However, these model fits are realized by relying on a hierarchical Bayesian version of the DDM (Wiecki, Sofer, and Frank, 2013a). One advantage of this method is that participants with low trial counts in specific conditions due to the idiosyncratic nature of confidence judgments, can contribute to the analysis: Data are pooled across participants to estimate posterior parameter estimates, whereby the data of a participant with low trial counts in a specific condition will contribute less to the posterior distribution of the respective condition. Individualsubject estimates are constrained by the group posterior (assumed to be normally distributed), and estimates with low trial counts are pulled towards the group average. A limitation of this procedure is that it precludes strong conclusions about the parameter estimates from individual participants. Future studies should collect extensive data from individual participants in order to shed light on individual differences in confidenceinduced bound changes.”
c) Can the authors clarify the trial numbers in subsection “Decision confidence influences subsequent decision bound”? Does 831 mean between 1 and 83 trials?
Yes, that is indeed the case, and it is a consequence of the approach we pursued here (see previous point). The combination of relatively few trials per participant and an uneven distribution of individual confidence ratings could lead, in few cases, to very low trial counts for certain subjects in certain confidence conditions. The hierarchical Bayesian model fitting procedure is specifically designed for situations like this (Wiecki et al., 2012).
Histograms of the trial counts for the three confidence bins are shown in Author response images 2, 3 and 4 (note, blue = high confidence, red = low confidence, green = perceived errors). Overall, trial counts smaller than 10 per subject were rare in these data sets (note that Experiment 3 comprised Exp 3a and 3b, see Methods and Materials, causing the difference in overall trial counts seen).
If yes, how could 1 trial be sufficient for fitting the model to individual subject's data?
The hierarchical Bayesian approach does not fit the model to individual subject’s data, but rather it jointly fits the data of the entire group. Therefore, data from participants with low trial counts in a certain condition does not contribute much to the posteriors for the respective condition. Participantlevel estimates are estimated, but these are constrained by the grouplevel estimate. The influence of the group level estimate on the participantlevel estimate is inversely related to the uncertainty in the estimate: the smaller the uncertainty in the group estimate the larger its influence on the participantlevel estimates. Importantly, the same holds for the other direction: the less data available for a certain participant in a certain condition the more that estimate is going to be constrained by the group parameter. Therefore, it is possible to estimate the model with only a single trial in a condition for a certain participant, although for this specific participant this specific estimate is going to be (almost) entirely determined by the group estimate. This can, for example, be mostly appreciated in Figure 4, where the participantlevel estimates (i.e., the dots) for error trials (4B) are much more homogenous than those for correct trials (4A). This (likely) reflects the fact that much less error trials are available for each participant, and although it is still possible to estimate a group level estimate (because data are pooled across participants) the participantlevel estimates are very much influenced by the group level estimate. We now added the following in the Materials and methods:
“The hierarchical Bayesian approach does not fit the model to individual subject’s data, but rather it jointly fits the data of the entire group. Therefore, data from participants with low trial counts in certain conditions does not contribute much to the posteriors for the respective condition. At the same time, participantlevel estimates are estimated, but these are constrained by the grouplevel estimate. One obvious advantage of this approach is that participants with unequal trial numbers across conditions can contribute to the analysis, whereas in traditional approaches their data would be lost.”
5) The reviewers appreciate the authors' attempt to remove confounds caused by slow fluctuations of RT, accuracy and confidence. However, it is unclear that the correction procedure is adequate. Three related concerns are raised:
We thank the reviewers for raising this very important point. The comment suggests to us that the reviewers appreciate the necessity of factoring out slow drifts in overall performance. Note that in the previous version of our manuscript, we reported correlations between RT, confidence and accuracy on trial_{n1} and trial_{n}, which – we realized – can result from both slow drifts in performance and/or strategic trialbytrial effects. Therefore, these have been removed and we now report the following on:
“Indeed, slow (‘scalefree’) fluctuations similar to those reported previously (Gilden, 2003; Palva et al., 2013) were present in the current RT and confidence rating time series as assessed by spectral analysis. Slopes of linear fits to loglog spectra were significant for both RTs, b = .42, t(23) = 11.21, p <.001, and confidence, b = .60, t(23) = 13.15, p <.001 (data not shown)”.
Upfront, we would like to solidify the intuition that some correction of slow performance drifts is, in fact, critical to assess the rapid, confidencedependent bound updating effects we set out to test here. In a simulation shown in Author response image 5, we took the data from participant 1 in Experiment 1 and updated the bound based on the participant’s actual confidence ratings (high confidence: .05, medium confidence: +.08, perceived error: +.2; note that these values were directly based on the fit of Experiment 1). Thus, all variations in decision bound are driven by confidence (left panel of Author response image 5).
When conditioning the decision bound from trial n+1 on confidence rating from trial n, this shows a dip in subsequent decision bound following low confidence (right upper panel: “uncorrected analysis”; high confidence M = 1.67, medium confidence M = 1.63, perceived errors M = 1.87), despite the fact that the bounds were forced to increase (by our simulation design) following medium confidence. In contrast, when using our approach of correcting for slow performance drifts (i.e. subtracting the bound conditioned on confidence on trial n+2), the resulting values closely match the ground truth (right lower panel: “correcting for n+2”; high confidence M = .05, low confidence M =.08, perceived error M =.32). This indicates that the approach is appropriate for dealing with slow performance drifts.
However, we acknowledge that it comes at the cost of some ambiguity due to possible effects of decision bounds on subsequent confidence ratings (although there is no theoretical rationale for such an effect). We therefore now show that an alternative procedure for removing the slow performance drifts yields largely the same results as the approach used in the main paper and the above simulation (see our response to part b. of this comment).
We now add:
“If the decision bound is not fixed throughout the experiment (i.e., our data suggest that it is dynamically modulated by confidence), it is theoretically possible that specific levels of confidence judgments appear more frequently at specific levels of decision bound (e.g., everything else equal, high confidence trials are expected mostly in periods with a high decision bound).”
a) It is unclear that the underlying assumption of the analysis is correct. Of course confidence on trial n+2 cannot influence the decision bound on n+1, but could the decision bound on n+1 influence confidence on n+2? In both directions, these are just correlations, What support is there for the causal claim the authors are making of confidence on trial n changing bound on n+1?
We fully agree that claims about causality are not warranted given the correlative nature of our approach. We have now rephrased or toned down all statements that could be read as implying causality. This includes even a change in the title. That said, we would like to highlight that (i) our main conclusion does not depend on our specific approach for correcting for slow performance drifts (see below), and (ii) the temporal ordering of effects (confidence rating on trial n predicts bound on trial n+1) satisfies a commonly used (weak) criterion for causality.
b) The motivation of the analysis is to remove generic autocorrelations from the data (due to, e.g., long periods of high/low confidence correlated with low/high RT), under the assumption that trial n+2 confidence cannot be causally associated with trial n+1 decision bound. However, based on the unnormalized, "simple effects of confidence on decision bound," the effects seem to be mainly coming from trial n+2, rather than trial n (e.g., Figure 5—figure supplement 1, Figure 8—figure supplement 4, and Figure 5—figure supplement 3). This is fairly consistent in the behavioral data and is especially striking for the relation between Pe amplitude and decision bound. How does it affect whether we believe trial n+2 is an appropriate baseline measure? More important, these observations appear to challenge the main claim that confidencedependent bound changes are shaped by trial n. Can the authors clarify why this challenge is not critical?
We have clarified in two independent ways that this challenge is not critical. First, for Experiments 1 and 3, the main results hold in the ‘raw measures’ without subtraction of trial n+2 confidence effects (Figure 3—figure supplement 2 and Figure 5—figure supplement 5). Only for Experiment 2 is the effect not quite significant (Figure 5—figure supplement 2; but note that such an absence of effect is hard to interpret given the simulation reported above). Second, we now show that our main conclusions remain unchanged when using a complementary approach to control for slow performance drifts: namely one that is analogous to the approach established in the posterror slowing literature. Here, the concern is that posterror and postcorrect trials are not evenly distributed across the course of an experiment. Errors typically appear in the context of errors, a period also characterized by slow reaction times, hence creating an artificial link between errors on trial n and slow RTs on trial n+1. To eliminate this confound, Dutilh et al. (2012) proposed to compare posterror trials to postcorrect trials that originate from the same locations in the time series. To achieve this, they examined performance (i.e., RT, accuracy) on trial n+1 when trial n was an error, compared to when trial n was correct and trial n+2 was an error (thus controlling for attentional state). We adapted this approach to our case (trial to trial variations in confidence ratings) as follows: similar as our previous approach we fitted a HDDM regression model to the data estimating subsequent decision bound as a function of decision confidence. However, we no longer added confidence on trial n+2 as a regressor, instead we only selected trials that originate from the same (“attentional state”) locations in the time series. Specifically, when comparing subsequent decision bound as a function of high confidence versus subjective errors, we compared the decision bound on trial n+1 when trialn was a perceived error, compared to when trialn was judged with high confidence and trial n+2 was a perceived error. In order to compare high confidence and low confidence, we compared the decision bound on trial n+1when trial n was judged with low confidence, compared to when trialn was judged with high confidence and trial n+2 was judged with low confidence. The results are shown in Figure 3—figure supplement 3, Figure 5—figure supplement 3 and 6, and Figure 8—figure supplement 4. In brief, using this approach we obtained largely the same findings as reported before, even when only selecting correct trials (there were not enough trials to only select errors).
The fact that we obtained largely the same findings using a different control further strengthens our confidence in our findings. After careful deliberation, we decided to keep the original analyses in the main paper and show the results from this complementary approach as controls in the supplementary materials. This is because our initial analysis allows to use all of the data, and fit these using a single model, making this a much more statistically powerful approach. By contrast, in the new approach a heavy subselection of trials is required, and a separate model needs to be fitted for each comparison. If the editors and reviewers prefer to see the results from the alternative approach in the main figures, we would be glad to move them there.
We now added the full explanation of this in the Materials and methods section:
“A possible concern is that the decision bound on trial_{n+1} affected confidence ratings on trial_{n+2}, which would complicate the interpretation of the results of our approach. Thus, we also used a complementary approach controlling for slow drifts in performance, which is analogous to an approach established in the posterror slowing literature (Dutilh et al., 2012; Purcell and Kiani, 2016). In that approach, posterror trials are compared to postcorrect trials that are also preerror. As a consequence, both trial types appear adjacent to an error, and therefore likely stem from the same location in the Experiment. We adopted this approach to confidence ratings as follows: decision bound and drift rate on trial_{n+1} were fitted in separate models where i) we compared the effect of low confidence on trial_{n} to high confidence on trial_{n} for which trial_{n+2} was a low confidence trial, and ii) we compared the effect of perceived errors on trial_{n} to high confidence trials on trial_{n} for which trial_{n+2} was a perceived error. Thus, this ensured that the two trial types that were compared to each other stemmed from statistically similar environments. For the EEG data, we fitted a new model estimating decision bound and drift rate on trial_{n+1} when trial_{n} stemmed from the lowest Pe amplitude quantile, compared to when trial_{n} stemmed from the highest Pe amplitude quantile and trial_{n+2} stemmed from the lowest Pe amplitude quantile.”
And the results are referred to in the Results section:
“A possible concern is that the decision bound on trial_{n+1} affected confidence ratings on trial_{n+2}, which would confound our measure of the effect of confidence on trial_{n} on decision bound on trial_{n+1}. Two observations indicate that this does not explain our findings. First, the observed association between confidence ratings on trial_{n} and decision bound on trial_{n+1} was also evident in the “raw” parameter values for the bound modulation, that is, without removing the effects of slow performance drift (Figure 3—figure supplement 2). Second, when using a complementary approach adopted from the posterror slowing literature (Dutilh, Ravenzwaaij, et al., 2012; see Materials and methods), we observed largely similar results (see Figure 3—figure supplement 3).”
“Finally, we again observed a robust effect of confidence ratings on subsequent decision bound when using the above described alternative procedure to control for slow performance drift (Figure 5—figure supplement 3 and Figure 5—figure supplement 6). In Experiment 3 (Figure 5—figure supplement 5) but not in Experiment 2 (Figure 5—figure supplement 2), this effect was also present without controlling for slow performance drifts.”
“Similar findings were obtained using our alternative approach to control for slow performance drifts (Figure 8—figure supplement 4)”
c) The authors use the same model to quantify the effect of trial n or n+2 on trial n+1 and subtract the two effects to filter out slow fluctuations. This method would be successful only to the extent that the applied models are good at fitting the data. If a model fails to adequately capture the data, the residual error could contain slow fluctuations of behavior asymmetrically for the n/n+1 and n+2/n+1 analyses, which could contribute to results. There are reasons to be concerned because we do not know how well the DDM fits the behavior of individual participants, especially their RT distributions (not just the mean RT across the population; also consider large lapses and shallow RT functions). Showing residual errors of the model for RT and the autocorrelation of these residual errors would be useful to alleviate this concern.
As suggested, for each experiment we simulated data from the model and then subtracted the predicted RTs from the observed RTs in order to obtain residuals from the model. In Author response image 6 we show for each experiment that the residuals are not different between the different conditions (left) and that there is no autocorrelation in these residuals (right).
6) A point that deserves discussion and possibly looking into is the role of RT in determining the relationship between confidence and subsequent adjustment. From past papers, in varyingdifficulty contexts, lower confidence can result from a decision reaching commitment at too low a level of cumulative evidence, OR alternatively from a crossing made later in the trial. The latter is more an indication that the current trial had weaker evidence than that the bound was not high enough. Setting the bound higher for the latter kind of trial would bring an increase in accuracy but at a vast cost in terms of time. Bound increases for low confidence do not clearly seem to be universally beneficial, and it would seem to depend on the cost of time, presence of deadlines etc. This kind of territory may be covered in papers like Meyniel et al., but either way, it seems a matter worth discussing in the current paper.
We agree this is an interesting point. We now touch upon this issue in the Discussion:
“Trialtotrial variations in decision confidence likely result from several factors. For example, confidence might be low because of low internal evidence quality (i.e., low drift rate) or because insufficient evidence has been accumulated before committing to a choice (i.e., low decision bound). When the bound is low and results in low confidence, it is straightforward to increase the bound for the subsequent decision in order to improve performance. When drift rate is low, increasing the subsequent bound might increase accuracy only slightly, but at a vast cost in terms of response speed. Future work should aim to unravel to what extent strategic changes in decision bound differ between conditions in which variations in confidence are driven by a lack of accumulated evidence or by a lack of instantaneous evidence quality.”
https://doi.org/10.7554/eLife.43499.031Article and author information
Author details
Funding
Fonds Wetenschappelijk Onderzoek (FWO [PEGASUS]² Marie SkłodowskaCurie fellow)
 Kobe Desender
Economic and Social Research Council (PhD studentship)
 Annika Boldt
Wellcome (Sir Henry Wellcome Postdoctoral Fellowship)
 Annika Boldt
Deutsche Forschungsgemeinschaft (DO 1240/21)
 Tobias H Donner
Deutsche Forschungsgemeinschaft (DO 1240/31)
 Tobias H Donner
Fonds Wetenschappelijk Onderzoek (G010419N)
 Kobe Desender
 Tom Verguts
Deutsche Forschungsgemeinschaft (SFB 936/A7)
 Tobias H Donner
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
Thanks to JanWillem de Gee for help with the drift diffusion modeling and to Peter Murphy, Konstantinos Tsetsos, Niklas Wilming, Anne Urai and Cristian Buc Calderon for useful comments on an earlier version of the manuscript.
Ethics
Human subjects: Written informed consent and consent to publish was obtained prior to participation. All procedures were approved by the local ethics committee of the University Medical Center, HamburgEppendorf (PV5512).
Senior Editor
 Michael J Frank, Brown University, United States
Reviewing Editor
 Roozbeh Kiani, New York University, United States
Reviewer
 Simon P Kelly, University College Dublin, Ireland
Publication history
 Received: November 8, 2018
 Accepted: August 16, 2019
 Accepted Manuscript published: August 20, 2019 (version 1)
 Version of Record published: August 27, 2019 (version 2)
Copyright
© 2019, Desender et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,531
 Page views

 338
 Downloads

 3
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.