Confidence phenotypes: a unified computational account of value and decision certainty in reinforcement learning

Nicolás A Comay; Guillermo Solovey; Pablo Barttfeld

doi:10.7554/eLife.111820.1

Introduction

Humans routinely face uncertainty, about the state of the world, the outcomes of their actions, and the intentions of others. To navigate this uncertainty, the brain constructs internal beliefs and attaches to them a sense of confidence: a graded estimate of how reliable those beliefs are (Meyniel, Sigman, et al., 2015). Confidence, in this broad sense, permeates virtually every domain of cognition. For instance, it shapes how we learn from feedback (Meyniel, Schlunegger, et al., 2015), monitor and adjust decisions (Yeung & Summerfield, 2012), seek information (Desender et al., 2018), and coordinate with others (Bang et al., 2017). Its pervasive influence has led to the view that confidence is a central component of behavioural control (Schulz et al., 2023), and a key construct in transdiagnostic models of mental health (Hoven et al., 2019).

Despite its ubiquity, confidence is typically treated as a summary of how certain the brain is about its own states. This assumption has been highly productive, primarily through the use of perceptual paradigms and the study of decision confidence—the subjective probability of having made a correct choice (Figure 1; Fleming, 2024; Pouget et al., 2016). In this framework, confidence can bias subsequent choices (Lisi et al., 2020) or determine when to stop accumulating evidence (Balsdon et al., 2020; Balsdon & Philiastides, 2024), yet it is usually conceived as a more reflective quantity—a metacognitive evaluation of one’s accuracy—rather than as a variable exerting a direct or instrumental influence on the decision process. In contrast, research in human reinforcement learning (RL) has centered on value confidence—the confidence or certainty in the values or expected outcomes associated with available options (Figure 1)—which plays a pivotal role in adaptive behavior by, for instance, governing the rate of belief updating, modulating the exploration-exploitation trade-off, and determining how new information is weighted against prior expectations (Behrens et al., 2007; Boldt et al., 2019; Nassar et al., 2010; Payzan-LeNestour & Bossaerts, 2011).

Different levels of confidence in reinforcement learning contexts.
Value confidence (top) represents the certainty about latent states of the environment, such as the value of one option in multi-armed bandit tasks. Under a Bayesian framework, value confidence can be straightforwardly modelled as the inverse of the standard deviation of a posterior distribution over the option’s estimated value (represented by 0 in the figure). By observing subsequent rewards, not only the estimation of the option’s value gets more precise but the posterior shrinks, which is reflected in an increase in value confidence. Decision confidence (bottom) reflects the certainty of having made a correct choice. Under a Bayesian formulation, it can be understood as the probability of being correct, illustrated here by the area under the decision variable (the distribution obtained by subtracting the unchosen option’s posterior from the chosen option’s posterior) that it is greater than zero. As trials progress and certainty about the environment increases, decision confidence better distinguishes between correct and incorrect responses. This increase in correct trials and decrease in incorrect trials is known as the “folded-x pattern” of confidence (Hangya et al., 2016; Sanders et al., 2016).

The contrast between the approaches used in perceptual and reinforcement-learning paradigms exposes a fundamental tension in how these two traditions conceptualize confidence: while perceptual models define it as a reflective readout of certainty, learning models reveal it as a generative signal that drives behaviour. How these roles coexist within a single computational architecture remains unknown—especially given that confidence itself is highly idiosyncratic (Navajas et al., 2017). Indeed, while individuals tend to report confidence consistently across time (Ais et al., 2016), their mappings between internal uncertainty and reported confidence vary widely across the population, suggesting that distinct computational strategies may underlie how uncertainty is evaluated and used to guide behaviour. Only a handful of studies have begun to address these questions in human RL (Boldt et al., 2019; Salem-Garcia et al., 2023; Ting et al., 2023), leaving open how the brain integrates confidence across the representational hierarchy—from beliefs about the world to confidence in the choices derived from them—and how it shapes decisions under uncertainty across individuals.

Here, we develop a unified framework that distinguishes multiple computational forms of confidence within reinforcement learning. Using new and previously published datasets, we show that value confidence is captured by the precision of Bayesian value estimates, whereas decision confidence reflects a hybrid computation combining probability of being correct with a more global estimate of value certainty. The relative weighting of these components characterises individual computational profiles—’confidence phenotypes’—that predict performance, exploration strategies, and metacognitive accuracy. Together, these findings suggest that confidence is best understood not as a single scalar quantity, but as a set of related computations that jointly regulate learning and choice under uncertainty.

Results

Value confidence emerges from Bayesian inference

We began by modeling value confidence, i.e.: the confidence on the estimated values of the options. In the value confidence task from Boldt et al. (2019), on each trial participants (N=21) observed a reward randomly sampled from one of two arm-bandits. Then, they reported their belief on their mean value of that option and their confidence on this estimation. We tested several models to fit these confidence ratings: one model set was based on extensions of the Rescorla-Wagner (RW) algorithms, and the others were based on Bayesian inference (see Methods for details). Figure 2a depicts model fitting results of the best two models to this dataset (see Supplementary Figure 1 for the predictions of all tested models). We found that value confidence ratings were best explained by a Bayesian model (Model frequency (p_model) = 0.60, exceedance probability (p_exc.) > .99, protected exceedance probability (p_p.exc.) > .99; Figure 2c), in which value confidence reflected the inverse of the standard deviation of the posterior distribution over the estimated value of the option at play.

Value confidence is captured by a Bayesian model.
(a) Model fitting results for the two best models in the Boldt et al. (2019) data: the Bayesian model, where value confidence reflects the inverse of the standard deviation of the posterior distribution over an option’s inferred value, and the Rescorla-Wagner model, where value confidence equals to the square root of the number of rewards observed of a specific option. First row: models’ fits to all the data. As blocks had different lengths, we divided blocks in four quantiles with respect to the trials to be able to pool all blocks together (x axis). Second row: example of models’ fits in one block only. (b) Model fitting results to the Quandt et al. (2022) data. First row: models’ fits to Exp. 1 data. Second row: models’ fits to Exp. 2 data. Models that are a function of the number of rewards experienced (such as the RW sqrt(n) model) cannot account for the negative effect that the spread of the reward distribution has on value confidence. (c) Model comparison results. The Bayesian model was the best fitting model in Boldt et al. (2019) data. (d) Model comparison results. The Bayesian model was the best fitting model in both experiments in Quandt et al. (2022) data. In panels (a) and (b) error bars represent the standard error of the mean (SEM) of the behavioral data, shaded regions represent the SEM of models’ predictions and dots represent individual averages.

To test whether the Bayesian account of value confidence generalizes beyond the specific conditions of the Boldt et al. (2019) task, we next applied our models to data published by to Quandt et al. (2022) (N=62 for their Experiment 1; N=60 for their Experiment 2). In their task, participants saw a sequence of a hundred rewards from one option in rapid presentation and then, as in Boldt et al. (2019), they had to report their estimated mean value of the option and their confidence on this estimate. A key manipulation was present in this dataset, namely that the variance of the reward distributions were different across alternatives, thus generating different levels of certainty across options which significantly affected value confidence judgments. This key manipulation allowed us to even better differentiate between our models, as some of the candidate models do not take the variance of the reward distributions into account, thus predicting a constant level of confidence regardless of the variance of the rewards (indeed, the inclusion of this dataset improved model recovery results, see Supplementary Figure 2). Specifically, we tested three models on Quandt et al. (2022) data: the mentioned Bayesian model, a RW model where value confidence reflected the square root of the number of rewards seen of the particular option at play (as it was the best non-Bayesian model in Boldt et al. dataset; also note that, in this dataset, this model make the same predictions as the other models that were a function of the number of rewards seen) and a RW model with a separate learning algorithm that tracked the variance of the rewards (as this one was the only RW model that had information about the variance of the rewards). Note that we did not include any of the models which involved a surprise term (nor the model where value confidence explicitly reflected the surprise of the reward seen, see Methods) as these models could not account for the data in the Boldt et al. (2019) dataset. We again found that the Bayesian model was the best fitting model (Exp. 1: p_model = .94; p_exc. > .99; p_p.exc. > .99; Exp. 2: p_model = .96; p_exc. > .99; p_p.exc. > .99; Figure 2b and Figure 2d), as it was able to capture the decrease in value confidence levels as the standard deviation of the options reward distributions increased (Figure 2b).

Value confidence modulates the exploration-exploitation trade-off

A central functional role proposed for confidence is to regulate the balance between exploration and exploitation during learning (Boldt et al., 2019). Indeed, it is reasonable for an agent learning from the environment to increase exploitation as it becomes more certain about the values of the options at play. To test whether value confidence serves this adaptive control function, we extended the Bayesian model to include a parameter, b₁, that modulates decision noise as a function of value confidence (see Methods section and Figure 3a). A positive b₁ indicates that decisions become more deterministic—that is, more exploitative—as value confidence increases. For testing this idea, we leveraged on the decision data from Boldt et al. (2019) experiment 2 study (N=30), and in two new studies conducted in our laboratory (N=29 and N=30; the latter a pre-registered replication). In all cases, participants performed a classic two-armed bandit task in which they had to choose between two options and rate their confidence on having selected the best one. After that, they saw the reward obtained.

To assess whether value confidence modulates the exploration-exploitation balance, we compared the Bayesian model including the the b₁ parameter against the same model not including the parameter i.e., a model with constant decision noise. We found that the model including b₁ better explained the decision data in all datasets (Boldt et al. dataset: p_model = .96; p_exc. > .99; p_p.exc. > .99; Study 1: p_model = .82; p_exc. > .99; P_p.exc. = .98; Study 2: p_model = .85; p_exc. > .99; p_p.exc. > .99). Note that the main divergence between models emerged toward the end of each block, in line with the increase of exploitation as value confidence accumulates (Figure 3b, first and second rows). We also found that the b₁ parameter was significantly greater than zero in the three datasets employed (Boldt et al. dataset: t₂₉ = 8.76; p < .001, d = 1.60; Study 1: t₂₈ = 3.95; p < .001, d = 0.73; Study 2: t₂₉ = 5.58; p < .001, d = 1.02; Figure 3b, third row), further validating the proposed mechanism.

Decision confidence deviates from the probability of being correct

Having established that value confidence and choice behavior are well described by Bayesian inference, we next evaluated whether the same Bayesian model could also explain decision confidence—the subjective certainty that a chosen option is correct. According to the Bayesian confidence hypothesis (Meyniel, Sigman, et al., 2015; Sanders et al., 2016), confidence should reflect the probability of being correct. Interestingly, while this model was able to capture the pattern of confidence in correct trials (r₃₅₄ = 0.87; p < .001), it failed to account for confidence in incorrect trials (r₃₅₄ = 0.74; p < .001; Figure 4b and Figure 4c, top row). Indeed, the Bayesian model predicts that confidence should progressively increase across trials for correct responses but decrease for the incorrect ones, a signature known as the “folded-x pattern” of Bayesian confidence (Hangya et al., 2016; Sanders et al., 2016). Inspired by Navajas et al. (2017), in which a similar deviation of Bayesian confidence in incorrect trials was found but in categorical decisions, we constructed an Bayesian-hybrid model, where confidence reflects a weighted combination of the probability of being correct (i.e.: the Bayesian confidence computation) and the overall certainty of value estimates (i.e.: the overall value confidence). This model was able to capture decision confidence data both in correct (r₃₅₄ = 0.93; p < .001) and incorrect trials (r₃₅₄ = 0.90; p < .001; Figure 4b and Figure 4c, bottom row), and provided the best fit in two out of the three evaluated datasets (Boldt et al.: p_model = 0.5; p_exc. = 0.5; p_p.exc. = 0.5; Study 1: p_model = 0.9; p_exc. > 0.99; p_p.exc. > 0.99; Study 2: p_model = 0.81; p_exc. > 0.99; p_p.exc. > 0.99; Figure 4d). Across the three studies, in 65 out of 89 participants the hybrid model was preferred. We also evaluated a family of models that ignore option uncertainty and rely solely on estimated option values (as in Desender & Verguts, 2024; Salem-Garcia et al., 2023). A confidence model incorporating chosen option’s value emerged as a reasonable alternative, but it failed to capture behaviour in bandit tasks where both the mean and variance of reward distributions are manipulated (see Methods & Supplementary material).

Individual signatures of decision confidence reveal distinct behavioural profiles

Having established a population-level deviation from Bayesian confidence, we next examined whether individuals differ systematically in how they combine the probability of being correct and overall value confidence when reporting confidence. Following Navajas et al. (2017) we ran ordered linear regressions for each participant predicting decision confidence using the mentioned latent variables from the Bayesian-hybrid model. This yielded two regression coefficients—one for p(correct), β_p(correct), and one for overall value confidence, β_{value conf.}—quantifying each individual’s reliance on these sources of information (Fig. 5a).

Interindividual differences on confidence judgments predict task performance.
(a) We found substantial individual variation in the way decision confidence was reported. By running a regression using the latent variables from the model, that is the probability of being correct and the overall value confidence, we were able to quantify this variation. Here we plot the beta values associated with the mentioned latent variables: β_p(correct) and β_{value conf.}. Participants that had both betas with positive values showed a pattern where confidence tended to increase in both correct and incorrect decisions (Participant A and Participant B), or not decrease in incorrect decisions (Participant C). Participants with near zero beta β_{value conf.} values had a more Bayesian style of confidence reporting (illustrated by Participant D and Participant E). Interestingly, several participants had negative β_{value conf.} values. At first value, this is surprising as an increase in certainty should intuitively lead to greater confidence. However, the Bayesian model was the best model for virtually all of these participants (lightblue dots; note for instance the pattern of confidence judgments for Participant E), pointing out that the overall value confidence variable here should not be contributing and therefore its negative value suggests overfitting of the regression model. Note, however, that for some of these participants, like Participant F, a negative value appears to be justified as indeed her confidence decreased throughout the trials, that is, as certainty increases. Finally, some participants had negative β_p(correct) values, which consequently led to a poor metacognitive ability as incorrect decisions were associated with higher confidence (Participant G). (b) Increasing values of the β_p(correct) predicted higher task performance. (c) As with task performance, the β_p(correct) values predicted the values of the b₁ parameter (the parameter that controlled the exploration-exploitation trade-off with increasing value confidence levels). (d) As in the b₁ parameter case, only β_p(correct) values were associated with the metacognitive ability of the participants, that is, their ability to distinguish their correct and incorrect decisions with their confidence judgments. Note, however, that in this case a negative interaction between the two betas were found, meaning that increasing values of β_{value conf.} negatively affected the ability of β_p(correct) to predict metacognition.

Importantly, participants assigned significantly greater weight to the probability of being correct than to overall value confidence (paired t-test among beta values: t₈₈ = 9.37, p < .001, d = 0.99), indicating that Bayesian estimates of being correct constituted the primary driver of confidence reports, whereas value confidence contributed comparatively less. This dominance is particularly important given that the overall value confidence does not necessarily track the discriminability between the options (i.e.: the difficulty of a decision). Consequently, relying more heavily on this signal could, in principle, give rise to confidence patterns that are misaligned with decision difficulty. By contrast, simulations showed that when probability of being correct is weighted more strongly—as in the empirical data—the model yields well-calibrated confidence, with higher confidence for easier decisions even when overall uncertainty remains high (see Supplementary Material).

We then asked whether these computational weights were related to task and metacognitive performances. First, we ran a linear regression predicting task performance using both β_p(correct) and β_{value conf.} as well as their interaction. We found that higher β_p(correct) values were predictive of higher performance (β = 0.053; p < .001; Figure 5b, left), but β_{value conf.} values were not (β = −0.034; p = .233; Figure 5b, right). No interaction was found between the two predictors (β = −0.009; p = .538). Second, using a linear regression again, we evaluated whether these beta values were associated with the b₁ parameter values (i.e., the parameter that controlled the modulation by value confidence of the exploration-exploitation trade-off). We found that higher β_p(correct) values were associated with higher b₁ parameter values (β = 0.313; p < .001; Figure 5c, left), while β_{value conf.} values were not (β = 0.082; p = .564; Figure 5c, right), and no interaction was found between the two predictors (β = −0.056; p = .442). Finally, for evaluating metacognition—the ability to distinguish between correct and incorrect decisions—we run logistic regressions on each participant predicting accuracy from confidence judgments. We found that, as expected, larger β_p(correct) values were associated with higher metacognition (β = 1.056; p < .001; Figure 5d, left), but no relationship was found between metacognition and β_{value conf.} (β = 0.338; p = .104; Figure 5d, right). This time, however, a negative interaction was found, suggesting that at higher values of β_{value conf.} the positive effect of β_p(correct) diminished (β = −0.224; p = .035). Together, these results indicate that variability in confidence computation reflects distinct behavioral phenotypes rather than mere idiosyncrasies in reporting, suggesting that Bayesian estimates of correctness provide the dominant normative signal for confidence, while reliance on value confidence may reflect an additional, more idiosyncratic component that biases confidence reports away from this primary source. Indeed, participants who relied more heavily on Bayesian computations of confidence tended to make more accurate decisions (Figure 5b), used their value confidence more effectively to guide exploitative choices (Figure 5c), and showed greater accuracy in evaluating the correctness of their decisions (Figure 5d). In contrast, greater reliance on value confidence computation proved suboptimal: it impaired metacognitive accuracy while exerting no detectable influence on performance exploration-exploitation strategies.

Discussion

Overall, our results reveal how value and decision confidence arise and interact in human reinforcement learning. By quantifying inter-individual differences in these two forms of confidence computations we uncovered different behavioral patterns, where participants whose confidence computations more closely matched the Bayesian model showed higher task performance and metacognitive accuracy—in line with normative principles of decision-making.

Value confidence—the certainty in the estimated values of available options—was best captured by Bayesian computations, reflecting the precision of a posterior distribution over those values. This aligns with several findings in the literature where Bayesian models capture human learning remarkably well (Bounmy et al., 2023; Gershman, 2018; Kang et al., 2024; Meyniel, 2020; Meyniel, Schlunegger, et al., 2015). Our work further strengthens this notion by validating the predictions of these models not only against choice behavior but also against participants’ explicit value confidence ratings, providing a more direct test of the model’s internal variables. It should be noted, however, that human reinforcement learning is not always strictly normative. Perhaps the most prominent deviations from normative behaviour in human reinforcement learning are the positivity and the confirmation biases, weighting new evidence more heavily when it is associated with a positive valence or consistent with prior beliefs (Chambon et al., 2020; Lefebvre et al., 2017; Palminteri et al., 2017). These effects are usually modelled using separate learning rates for positive or confirmatory evidence, allowing for a greater update of positive and confirmatory information (Palminteri & Lebreton, 2022). Given that we focused on the computations underlying the uncertainty around the estimated values—rather than the estimated values per se—it remains to be tested whether including separate learning rates would further improve model fits. Interestingly, recent work has proposed that these apparent biases can themselves emerge under Bayesian principles (Godara, 2025)—which may align with their evolutionary value (Hoxha et al., 2025)—therefore being a possibility that Bayesian updating can suffice to explain these phenomena.

Extending the Bayesian framework to decision data, we found that value confidence acts as a key variable modulating the exploration-exploitation trade-off. Indeed, as certainty increased over trials, decision noise systematically decreased, resulting in more exploitative decisions (see Desender & Verguts, 2024, for a similar result but using decision confidence rather than value confidence as a modulatory signal). This aligns closely with Thompson sampling algorithms where an agent explores more if uncertainty is higher (Gershman, 2018). It should be noted, however, that human behavior sometimes departs from this principle. In directed exploration cases, for instance, humans tend to deliberately select more uncertain options (Abir et al., 2024; Gershman, 2018). For other choice scenarios where direct and random exploration are likely to be at play, hybrid algorithms that combine directed and random exploration better explain human behavior (Gershman, 2018; E. Schulz & Gershman, 2019).

Although our model does not aim to capture all possible effects of (un)certainty on learning and decision-making, it highlights how Bayesian frameworks—by naturally representing uncertainty through precision metrics—can be extended to encompass a wide range of certainty-driven influences on behavior. For instance, Bayesian models have been used to explain how balancing the approach and avoidance of uncertainty can reduce the cognitive costs of exploration in a resource-rational manner (Abir et al., 2024), and how uncertainty modulates the use of simple decision heuristics that imperfectly exploit immediate rewards (Paunov et al., 2024). It is also worth mentioning that, outside human RL and Bayesian modelling, work in value-based decisions has shown that, by fitting certainty-informed drift-diffusion models to human choices, certainty about options’ values modulates not only choices but response times (Lee & Usher, 2023), and that confidence provides a benefit signal for allocating cognitive resources during decision formation (Bénon et al., 2024; Lee & Daunizeau, 2021). Including such models that account for decision-level dynamics as a decision rule in RL contexts could provide even more insight into the mechanisms by which confidence guides decisions in uncertain environments.

Decision confidence, on the other hand, deviated from the Bayesian computations that accounted for value confidence ratings and decisions. This is inline with previous work showing deviations of normative principles in decision confidence in the perceptual domain (Comay et al., 2023; Li & Ma, 2020; Lisi et al., 2020; Miyoshi & Sakamoto, 2025; Xue et al., 2024) as well as in reinforcement learning tasks (Salem-Garcia et al., 2023; Ting et al., 2023). Specifically, confidence departed from Bayesian predictions specially on incorrect trials, violating the well-known “folded-x pattern” of confidence: with increasing evidence, confidence should increase for correct trials but decrease for incorrect ones (Hangya et al., 2016; Sanders et al., 2016)—although it should be noted that this pattern is not always a prediction of normative models of confidence (Adler & Ma, 2018; Rausch & Zehetleitner, 2019). Decision confidence was better captured by a weighted combination of two sources: the (Bayesian) probability of being correct and the certainty of the value estimates (which reflects the overall level of value confidence). Following the work of Navajas et al. (2017) in categorical decisions, we showed that the relative contributions of these two sources of information characterized the inter-individual patterns of decision confidence, reinforcing the notion that confidence is constructed in an idiosyncratic manner.

An important clarification concerns the computational role of overall value confidence in the model. This quantity does not uniquely determine option discriminability and therefore does not map directly onto decision difficulty. For example, a choice may be easy when the posterior value distributions of two options are well separated, even if both value estimates remain highly uncertain. Conversely, a choice may be maximally difficult when the two options have the same value, even if those values are known with high certainty. If confidence relied heavily on overall value confidence alone, this dissociation could produce counterintuitive patterns. For instance, higher confidence in the objectively harder case. However, both the empirical data and our simulations show that the probability of being correct is the dominant determinant of confidence, ensuring that confidence remains appropriately calibrated to decision difficulty (Supplementary Figure 6). In this framework, overall value confidence is better understood as a secondary biasing signal that modulates confidence without overriding the primary Bayesian estimate. This component is nevertheless important: it captures idiosyncratic individual differences and helps explain deviations from normative confidence, particularly on incorrect trials, where reliance on global uncertainty can produce systematic departures from the expected relationship between evidence and confidence.

Furthermore, the idea that the certainty about the options’ values has an influence in decision confidence has a parallelism in influential models of confidence in perceptual decisions where stimulus reliability plays a key role in confidence computations (Boldt et al., 2017; Hellmann et al., 2023; Rausch et al., 2018; Shekhar & Rahnev, 2024) pointing to a possible cross-domains mechanism (although see Brus et al., 2021; Quandt et al., 2022). Extending this view, recent work has proposed that confidence itself reflects a noisy estimate of decision reliability constrained by a higher-order “meta-uncertainty” about the precision of the decision variable (Boundy-Singer et al., 2022). Incorporating such second-order uncertainty into learning frameworks could provide a fruitful direction for future research, offering a computational account of how confidence variability arises not only from task-related uncertainty but also from uncertainty about one’s own inferential precision.

Importantly, these idiosyncratic patterns were not merely descriptive variations in confidence computation, but reflected distinct behavioral profiles. Participants whose confidence computations more closely followed the Bayesian ideal—weighting the probability of being correct more heavily—achieved higher overall accuracy, relied more strongly on value confidence to regulate their exploration-exploitation balance, and exhibited superior metacognitive insight into their own decisions. This suggests that the extent to which individuals approximate Bayesian confidence principles is not just a matter of internal calibration, but a signature of more adaptive learning and decision-making strategies. In this line, given that alterations in confidence are predictable of symptoms across multiple psychiatric dimensions (Hoven et al., 2019, 2023), these results may point to a continuum between optimal, Bayesian-like confidence computations and the kinds of suboptimality that characterize clinical populations.

Beyond idiosyncratic biases, recent work in human RL has shown that confidence judgments are also shaped by systematic biases (Salem-Garcia et al., 2023). Humans tend to overestimate their accuracy—the well-known overconfidence bias (Baranski & Petrusic, 1994)—and to report higher confidence when seeking gains than when avoiding losses, a phenomenon termed the valence-induced confidence bias (Lebreton et al., 2018). These effects can be formally accounted for by an overweighting of the learned value of the chosen option in the computation of confidence, suggesting that confidence judgments partially inherit biases originating in the learning process. Moreover, individual differences in the learning parameters underlying these value biases—such as confirmatory updating and outcome-context dependency—predict the magnitude of metacognitive biases, establishing a computational bridge between biased learning and biased confidence (Salem-Garcia et al., 2023). While the Bayesian model proposed here does not explicitly capture such value- and valence-related biases (although see Godara, 2025), extended Bayesian models—incorporating for instance reward-context dependencies—could provide a fruitful avenue for future research, offering a unified computational account of how learned values, value certainty, and decision confidence jointly give rise to both adaptive and biased behavior.

Finally, given that this recent research has modelled decision confidence using estimated option values but, unlike our approach, without incorporating uncertainty in those estimates (Desender & Verguts, 2024; Salem-Garcia et al., 2023), we therefore tested this kind of models on our datasets, as they offer a plausible alternative account of confidence behaviour. Consistent with these studies, a model driven by the value of the chosen option—akin to the “positive evidence bias” widely reported in perceptual decision-making (Maniscalco et al., 2016; Peters et al., 2017; Zylberberg et al., 2012)—provided a reasonable fit (see Supplementary material for model fitting results). However, it deviated from human behaviour mainly on incorrect trials, similarly to the Bayesian model (although over-estimating rather than under-estimating confidence in those trials). In addition, this model struggled to explain confidence patterns in a bandit task where both the mean and the variance of the reward distributions were manipulated (see Supplementary Material). Together, these findings support the idea that uncertainty in value estimates plays a central role in shaping confidence in learning settings. More broadly, we believe that future research could exploit systematic manipulations of reward distributions to further dissociate the respective contributions of estimated uncertainty and chosen-option value to decision confidence.

By linking Bayesian principles of learning with metacognitive evaluations of choice, our results bridge two traditionally separate literatures—on reinforcement learning and decision confidence—into a unified account of behavior under uncertainty. Beyond their theoretical relevance, these insights may inform how confidence guides learning and exploration across domains, and how its miscalibration contributes to suboptimal decision-making strategies.

Methods

All data & scripts to reproduce the reported results, as well as the Supplementary material, are available online at: osf.io/8tex5.

Datasets description

Here we describe the nature of the datasets employed in our study. For datasets that have been previously published, we refer the reader to the original publications for a more detailed description of the tasks employed. For the newly collected datasets, participants were university students who provided informed consent prior to participation. These new experimental protocols were approved by the Ethics Committee of the Instituto de Investigaciones Psicológicas (UNC-CONICET).

Value confidence

For the value confidence analysis, we employed datasets from two previous studies. First, we used the data from the experiment 1 of Boldt et al. (2019) study. Participants (N=21) performed a two-armed bandit task in which, on each trial, they observed a reward randomly sampled from one of the options. After seeing the reward, participants reported their belief on the mean value of this alternative and their confidence on this estimate on a continuous square scale. Rewards were drawn from Beta distributions with the following parameters:

{α = 1; β = 3}; {α = 2; β = 3}; {α = 3; β = 3}; {α = 3; β = 2}; {α = 3; β = 1}.

Participants performed 600 trials divided in 15 blocks of different lengths, ranging from 20 trials to 60 trials. Twenty five % of the trials were “decision trials” where participants had to choose one of the two options and rate their confidence on having chosen the best one.

The second dataset comprised the two experiments carried by Quandt et al. (2022). Both experiments followed identical procedures, with 62 participants in the first one and 60 in the second. Participants observed a hundred rewards from one alternative and then rated their belief on the expected value of the option and their confidence in this rating. They performed five trials with each alternative, for a total of six alternatives. The rewards associated each alternative were drawn from Normal distributions with the following parameters:

{μ = 80, σ = 15}; {μ = 100, σ = 20}; {μ = 120, σ = 30}; {μ = 130, σ = 40}; {μ = 150, σ = 10}; {μ = 160, σ = 5}.

Decision confidence

For the decision confidence analysis we used four datasets. First, we used the data from experiment 2 of Boldt et al. (2019) study. Participants (N=30) performed a two-armed bandit task in which they had to choose, on each trial, one of the options with the aim to maximize their rewards. After choosing, they had to report on a continuous scale their confidence in having selected the best alternative. The alternatives’ rewards distributions were sampled from the same Beta distributions of the value confidence experiment. Participants performed 600 trials, divided in 15 blocks of different lengths. Block length ranged from 20 trials to 60 trials. In 25% of the trials participants did not make a choice but instead reported their estimated mean value of the options (“rating trials”). These trials were excluded from the analysis.

The second dataset corresponded to a study conducted by us in our laboratory (“ Study 1 “). Participants (N=29) performed 20 blocks of 30 trials of a two armed bandit task. After choosing, they had to report their confidence in having selected the best option on a 1 (not sure at all) to 4 (completely sure) scale. Rewards were sampled from the same Beta distributions of the Boldt et al. (2019) study. The third dataset (“Study 2”) was a pre-registered replication of this protocol (N=30). Pre-registration is available at: https://doi.org/10.17605/OSF.IO/Z5TCA.

We employed a fourth dataset (“Study 3”) to differentiate between models with and without including certainty in estimated values (results are reported in the Supplementary material, see also Computational models section for the description of the models). This dataset was similar as the previous ones, with participants (N=15) performing 20 blocks of 30 trials of a two armed bandit task and reporting confidence on a 1 to 4 scale after each choice. However, the reward distributions were manipulated differently in this task. Inspired by Hertz et al. (2018) design, rewards from each option were sampled from Gaussian distributions, one with higher mean than the other, thus constituting one “correct” and one “incorrect” option. The variances of each one could independently be high (H = 25² = 625) or low (L = 10² = 100). This resulted in a design with four experimental conditions (thus participants completing 5 blocks with each condition, randomly ordered in the experimental session): H-H, H-L, L-H, L-L, where the first letter indicates the variance of the correct (higher expected value) and the second indicates the variance of the incorrect (lower expected value) option. The mean of the correct option was sampled uniformly from the range {40; 60} and the mean of the incorrect option was the mean of the correct option minus 30. The difference of value between options, therefore, was always the same.

Computational models

Values and value confidence

Two main classes of models were tested to account for the pattern of the value confidence data. One class consisted of models based on Rescorla-Wagner (RW) algorithms (R. A. Rescorla & A. R. Wagner, 1972), and the other was based on Bayesian inference. For the RW models, expected values for the options were updated as follows:

Were V represents the value of the option, t indexes the trial number, δ is the learning rate parameter and r is the reward obtained.

For the Bayesian models, posterior probability distributions of the options’ expected values were constructed using Bayesian inference. As mentioned, in the Boldt et al. datasets, rewards were drawn from Beta distributions. Therefore, a Bayesian observer computes, on each trial, a posterior distribution over the mean of a Beta distribution, represented by θ, by integrating the likelihood of the rewards given θ and the prior probability of θ, which we modelled as uniform. Formally:

The maximum-a-posteriori (MAP) value of this posterior is therefore the predicted expected value of the option on trial t, i.e., V_t. These Bayesian models were implemented using Stan (Stan, 2025; MCMC diagnostics are reported in the Supplementary material). A re-parametrization was used to infer the mean of a Beta distribution. Specifically, we estimated the parameters that govern the shape of a Beta distribution, α and β, and then obtained the mean of the Beta as:

Given that in Quandt et al. (2022) datasets rewards came from Normal distributions, we inferred the options’ expected values (θ) by leveraging on the Normal-Normal conjugate family (Johnson, 2022). The posterior of θ, therefore, was therefore computed as follows:

Where represents the mean of the prior (which we set at zero), π² represents prior variance (which we set at 100, thus constructing an uninformative prior), is the mean of the rewards seen and σ² is the variance of the rewards seen (i.e.: the likelihood variance). The subscript 1:100 represents that subjects faced a hundred rewards for each option.

Finally, for the dataset of Study 3, reported in the Supplementary material, Bayesian inference was implemented using a Kalman-filter following Gershman (2018) to recursively infer posteriors’ mean (θ) and variances (σ²) for each option. Formally:

where the learning rate δ_t is given by:

where τ² was set to the specific option’s reward distribution variance. For all options initial means (θ) were set at zero and initial σ² at 100 (uninformative prior).

Given these two modeling frameworks for value updating (i.e.: the Bayesian inference framework and the RW framework), we compared several models for explaining value confidence. The Bayesian framework naturally offers a measure for the certainty associated to the inferred values of the options, which is the inverse of the standard deviation of the posterior distribution over the expected value of the option at play. Thus, under the Bayesian model, value confidence can be expressed as:

Here σ represents the standard deviation of the posterior distribution over θ, the estimated value of the specific option at play.

In contrast, as RW algorithms do not have an explicit measure of certainty in the values associated with each alternative, we extended this approach with several possible mechanisms for computing value confidence. The first of these extensions, the RW surprise model, defines value confidence as the inverse of the absolute surprise of the outcome. Formally:

The second RW model comprised a separated RW rule to track the variance of the outcomes (RW tracking variance model, Hertz et al., 2018):

Here τ refers to the computed variance of the rewards observed, and γ is the variance learning rate. Value confidence under this model reflected the inverse of the tracked variance, i.e.:

For the third RW model, value confidence reflected the number of rewards observed of the specific option sampled (RW n model). Let n refers to that number; value confidence is represented as:

Next, based on the RW n model, we constructed two additional variants: In this models value confidence reflects the logarithm of the number of rewards experienced with the specific option at play (RW log(n) model) or the square root of the same number (RW sqrt(n) model).

Finally, we tested a weighted combination of the Bayesian model, the RW sqrt(n) model and the RW log(n) model with the surprise generated by the outcome. The Bayesian-surprise model can be expressed as follows:

The RW sqrt(n)-surprise model can be then stated as:

And the RW log(n)-surprise model is the same as above but with the logarithm of n instead of the square root. Note that the weights (w) were independent parameters for each model.

All the RW models were tested in the Boldt et al. (2019) dataset. In the Quandt et al. (2022) dataset we only tested the RW sqrt(n) model and the RW tracking variance model given that 1) the RW sqrt(n) model was the best non-Bayesian model in Boldt et al. (2019) data; 2) the RW tracking variance model was the only non-bayesian model including information of the reward distribution variance for value confidence, the key manipulation in Quandt et al. (2022) dataset. The Bayesian model was included in all datasets.

Decisions and decision confidence

The decision confidence experiments involved classic two-armed bandit tasks where participants chose between two alternatives (which we will represent as A and B) and then had to rate their confidence on having chosen the best option (see Datasets description section). Given that we found that value confidence was best explained by a Bayesian model, we used Bayesian inference to update the values of the options on each trial as described for the value confidence data. Bayesian inference for options’ posterior means and variances on each trial was implemented in Stan for the Boldt et al. (2019), Study 1 and Study 2 datasets and with a Kalman-filter for Study 3 dataset, as described in the Values and value confidence section above.

For fitting decisions, we used a classic softmax equation:

Here d represents the decision, DV represents the decision variable (which is , note that V here is the maximum-a-posteriori value of each option), and A is the slope of the sigmoid that controls the noise in the decision process as an inverse temperature parameter. We tested two variations for fitting decisions. In one of them λ was treated as a constant. In the other, λ dynamically varied across trials, being modulated by value confidence in the following way:

Where σ represent the standard deviation of the posteriors over the options’ expected values. The term multiplying b₁ can be interpreted as a global value confidence, as it reflects the inverse of the combined uncertainty associated with the value estimates of both options. As long as b₁ is positive, value confidence will modulate the exploration-exploitation trade-off by reducing decision noise with increasing values of value confidence. Conversely, if b₁ is indistinguishable from zero, decision noise is constant across trials and participants do not take into account the options’ uncertainty for making decisions.

For decision confidence we tested several models. In the first one, confidence followed the Bayesian confidence hypothesis (Meyniel et al., 2015; Sanders et al., 2016), i.e.: confidence reflects the probability of being correct (Bayesian model). As the posteriors over options’ expected values have Gaussian shape, this probability can be computed as follows:

Where Φ is the standard cumulative Gaussian function. In the second model, the Bayesian-hybrid model, decision confidence reflected a weighted sum of the probability of being correct and the overall certainty of the estimated values (i.e.: the overall value confidence; Navajas et al., 2017). Formally:

In the main manuscript we report the results of these two primary models. However, because prior work on confidence in bandit tasks has relied on models that do not incorporate uncertainty in value estimates (Desender & Verguts, 2024; Salem-Garcia et al., 2023), we sought to assess whether including the certainty of the decision variable in the Bayesian-hybrid model was indeed warranted. To do so, we compared its fit to that of models that rely solely on the estimated option values. Specifically, we tested two extra models. For one of them, confidence reflected the absolute difference between the estimated values of the options (Mean difference model), i.e.:

For the other extra model, and given that the value of the chosen option have been identified as important in explaining decision confidence both in bandit tasks and in perceptual decision making (i.e.: the “positive evidence bias”, Zylberberg et al., 2012), confidence reflected a weighted sum of the absolute difference between the estimated values of the options and the value of the chosen option (Mean difference + PEB model). Formally:

Model fitting results of these two extra models are reported in the Supplementary material.

Model fitting and model comparison procedure

We fitted all models by maximizing the log-likelihood of the parameters given each subject data using the optim function in R with the simulated annealing method. Ten optimization routines with different starting points were run per subject. The probability of a specific decision was given by the softmax equations detailed above. For computing the probability of confidence ratings (both value and decision confidence) in the case of Boldt et al. (2019) datasets and our datasets, we used a set of confidence criteria, {c₁,…, c_k−1}, where k is the number of confidence ratings, that divided a Gaussian distribution with mean at the confidence predicted by the model and a standard deviation that was fitted to each subject (akin as observational noise, Nunez et al., 2024). The probability of a specific level of confidence is then the area under this Gaussian that is restricted by corresponding confidence criteria. Boldt et al. (2019) confidence data was discretized from 0 to 10 to be able to apply this procedure. For Quandt et al. (2022) data we applied a linear transformation for mapping models’ confidence to participants’ confidence. We computed the likelihood of the parameters by determining the proportion of transformed confidence predictions from the model that matched the reported confidence divided by the total number of predictions (we used this approach to reduce the number of parameters as there were only 30 trials per subject).

We compared all models using Bayesian model selection at the group level (Stephan et al., 2009). Specifically, we computed Bayesian Information Criterion weights (Wagenmakers & Farrell, 2004) for each subject and for each model as a proxy for model evidence (i.e., the belief that model m generated data from subject i) and then used the bmsR R package to compute exceedance and protected exceedance probabilities (the probabilities that participants were more likely to use a certain model to generate behavior rather than any other alternative model) which were used as the metric for model comparison. We report in the Supplementary material parameter and model recovery results.

Statistical analyses

We also carried out several model free statistical tests. We used one-sample t-tests against zero (two-sided) for testing whether the bl parameter was greater than zero in the three datasets employed (Figure 3b). For each subject we run ordinal regressions predicting decision confidence on each trial with the two model-derived variables: the probability of being correct and the overall value confidence. Each of these two terms were therefore associated with two beta values, β_p(correct) for the first term and β_{value conf.} for the second term, which allowed us to characterize individual differences in confidence reporting (Figure 5a). We then used these beta values (as well as their interaction) to predict task performance (the proportion of correct choices), the b₁ parameter and the metacognitive ability of each participant using the following linear regression models (Figure 5b, 5c, 5d):

Metacognition was measured using a logistic regression per subject predicting a correct (1) or incorrect (0) response using the confidence level reported by the participant on each trial:

Where x = β₀ + β₁ * confidence.

Supplementary material

Value confidence predictions for all models

Here we illustrate the predictions in the Boldt et al. (2019) dataset of all tested models for value confidence. Supplementary Figure 1a shows models’ predictions for all the trials, while Supplementary Figure 1b shows models’ predictions for one specific block as an example.

Model fits for value confidence data, all models.
(a) Models’ predictions considering all the trials in Boldt et al. (2019) dataset. (b) Models’ predictions considering only one block, for illustrative purposes. Error bars represent the standard error of the mean (SEM) of the behavioural data and shaded regions represent the SEM of the models’ predictions.

MCMC diagnostics

For the datasets of Boldt et al. (2019) and our 2 main studies rewards came from Beta distributions. As part of the Bayesian modelling, we used Stan to numerically compute trial-by-trial posterior estimates of the options’ values. Two chains were run with 8000 iterations each, using a burn-in of 4000 iterations and a thinning factor of 1.

To assess the reliability of this procedure, we examined standard MCMC diagnostics—R-hat, effective sample sizes ratios (Neff/N), divergent transitions, trace plots and checks of the posterior samples—using two simulation scenarios. In the first, we performed inference after observing only a single reward, illustrating sampler behaviour under minimal information (i.e., at the start of a block). This procedure was repeated 1000 times and we extracted the mean and range of the diagnostic metrics. The second simulation repeated the same procedure but after observing twenty rewards, providing an example of sampler performance with more informative data.

We report the results in the Supplementary Table 1 below. Across both scenarios, all diagnostics indicated proper convergence of the Stan sampler, supporting the validity of the numerical posterior estimates used in our modelling.

We also include representative posterior samples from these two scenarios, which showed an approximately Gaussian shape, as well as trace plots confirming stable and well-mixed chains in Supplementary Figure 2 below.

Illustration of posterior distributions and trace plots.
(a) Scenario 1: only one reward (r = 0.5) was observed. (b) Scenario 2: twenty rewards (mean r = 0.54) were observed. Theta represents the inference over the mean value of a Beta distribution, i.e.: . The vertical dashed lines at the posteriors represent the mean of the posterior distribution.

Model and parameter recovery results. Value confidence

To evaluate our ability to differentiate between the models tested for value confidence data, we conducted model recovery analysis. We simulated 30 subjects with each model both with the Boldt et al. (2019) experiment design and then fitted all the models to the simulated data with the same procedure described in the model fitting section of the main manuscript. We then checked the proportion of times that the best fitting model (according to the Bayesian Information Criterion) was equal to the generative model among all the simulated subjects with that specific generative model. Supplementary Figure 3a below depicts the results obtained.

Model and parameter recovery results for value confidence data.
(a) Model recovery results for the Boldt et al. (2019) experiment design. We found that the RW n model is mistakenly considered to be the model that generated the data for several generative models, with the most prominent examples being the RW log(n) model and the RW sqrt(n) model. (b) Model recovery results for Quandt et al. (2022) experiment design. While this result is uninformative to distinguish between the models that are a function of the number of rewards seen, it helps to differentiate that group of models and the models that take the variance of the reward distribution into account. (c) Parameter recovery results for the parameter that weighted the contribution of the surprise term in the three models that included it.

We also carried out model recovery with the Quandt et al. (2022) experiment design. For this second model recovery analysis, we only tested the RW tracking variance model, the Bayesian model, and the RW n model due to two reasons: first, models that included a surprise term were worse at fitting the Boldt et al. (2019) data (see Supplementary Figure 1); second, in the Quandt et al (2022) design all models that are a function of the number of trials experienced (such as the RW sqrt(n) model or the RW log(n) model) make the same predictions as the RW n model. Due to computation time, having less models also allowed us to simulate more subjects per model: 100. Once data which each model was simulated, we then followed the same procedure as described above for the Boldt et al. (2019) experiment design. We found that including the Quandt et al (2022) dataset improves model recovery results (Supplementary Figure 3b), as we could better distinguish between models that take (the RW track variance and the Bayesian model) and do not take (the RW n family of models) the reward distribution variance into account. Note, however, that the RW tracking variance model was confused with the Bayesian model in 81% of the cases.

Furthermore, with the same simulated datasets we also checked parameter recoverability (Supplementary Figure 3c). We especially focused the analysis on the parameters related to value confidence, that is, the weights applied to the surprise term in the Bayesian-suprise model, the RW sqrt(n)-surprise model and the RW log(n)-surprise model. We found significant correlations between the generative and the recovered weights for the surprise term in the three models (Bayesian-surprise model: r = 0.52, p = .003; RW log(n)-surprise model: r = 0.38, p = .040; RW sqrt(n)-surprise model: r = 0.44, p = .015; Overall correlation: r = 0.41, p < .001).

Model and parameter recovery results. Decision confidence.

Using the same procedure described above (i.e.: simulating data with all models, then fitting the simulated data with all models and finally checking the proportion of times that the recovered model equals the generative model) we also carried out model and parameter recovery analyses in the models tested for decision confidence data. First, we checked the recoverability of the model including the modulation of decision noise by value confidence via the b₁ parameter (dynamic decision noise model) in contrast with the model without the decision noise modulation (constant decision noise model) (Supplementary Figure 4a). We used this simulated data to also check for parameter recoverability of the b₀ and b₁ parameters that modulated the softmax we implemented as a decision policy. We found strong correlations between the generative and the recovered parameters for both parameters (b₀ : r = 0.55, p = .002 [constant decision noise model]; b₀: r = 0.43, p < .001 [dynamic decision noise model]; b₁ : r = 0.99, p < .001; Supplementary Figure 4b). This is important as we used these parameters (especially the b₁ parameter) to characterize individual differences in the way decisions were made, that we then also correlated with individual differences in confidence computations (main manuscript Figure 5c).

Model and parameter recovery results for decision confidence data.
(a) Model recovery results regarding the models with and without the b₁ parameter. (b) Parameter recovery results for the b₀ and b₁ parameters. (c) Model recovery results for the decision confidence models. (d) Parameter recovery results for the ω parameter that weighted the sources of information for decision confidence in the Hybrid model: the probability of being correct and the overall value confidence.

Moving to confidence data, we checked model recoverability on the decision confidence models, that is, the Bayesian and the Bayesian-hybrid model. The recoverability of the two models were practically perfect (Supplementary Figure 4c). Finally, we checked whether we can recover the weight associated with the overall value confidence in the Hybrid model. We found a significant correlation between the generative and the recovered model for this particular parameter (r = 0.62; p < .001; Supplementary Figure 4d).

Evaluating confidence models that ignore uncertainty in estimated values

The Bayesian and Bayesian-hybrid models incorporate posterior uncertainty into their confidence computations. However, previous work has shown that confidence can also be captured using only estimated option values, without explicitly modelling their uncertainty (Salem-García et al., 2023; Desender & Verguts, 2024). For this reason, we tested two additional models: one in which confidence reflected the absolute difference between estimated option values (Mean Difference model), and another in which confidence was a weighted combination of this difference and the value of the chosen option (Mean Difference + PEB model). Further details are provided in the Methods section of the main manuscript. We first fitted these models to the three datasets reported in the main manuscript. Supplementary Figure 5a illustrates the predictions of the Mean Difference model and Supplementary Figure 5b illustrates the predictions of the Mean Difference + PEB model. Although the Bayesian-hybrid model remained slightly superior to the other models (Boldt et al.: p_model = 0.41; p_exc. = 0.53; p_pexc. = 0.53; Study 1: p_model = 0.48; p_exc. = 0.55; p_p.exc. = 0.55; Study 2: p_model = 0.55; p_exc. = 0.84; p_p.exc. = 0.84; Supplementary Figure 5c) and enabled meaningful characterisation of inter-individual profiles, the Mean Difference + PEB model emerged as a competitive alternative across these datasets.

Evaluating the predictions of models that do not incorporate posterior uncertainty.
(a) “Mean difference” model predictions to the three main datasets reported in the main manuscript (first figure: Boldt et al. data; second figure: study 1 data; third figure: study 2 data). (b) “Mean difference + PEB” model predictions. Figures are ordered in the same way as in (a). (c) Model comparison results. Figures are ordered in the same way as (a) and (b). (d) Model predictions to the dataset manipulating the mean and variance of the reward distributions of the options plus model comparison results. First figure: Hybrid model; second figure: Mean difference + PEB model; third figure Mean difference + PEB model including a scale parameter for the estimated values of the options; fourth figure: model comparison results. The conventions of the whole figure are the same as in the main manuscript, namely: green and red represent correct and incorrect responses respectively, error bars and shaded lines represent empirical data and model fits respectively, dots are individuals’ average values, and pink dots and purple asterisks in model comparison plots are exceedance and protected exceedance probabilities respectively.

Therefore, we next sought to arbitrate between this model and the Bayesian-hybrid model, and to further assess whether incorporating value-estimate uncertainty is necessary for modelling confidence. To this end, we fitted both models to an additional dataset in which we manipulated both the means and variances of the reward distributions (details are given in Methods). This manipulation was intended to differentiate the computations of the two models, as only the Bayesian-hybrid model explicitly integrates uncertainty.

The Mean Difference + PEB model failed to generalise to this dataset with more complex mean/variance manipulations (Supplementary Figure 5d). One possible explanation is that changes in the reward distributions across blocks modified the scale of option values in ways that the model—which relies heavily on those values—struggles to accommodate. We therefore implemented a variant in which estimated values were rescaled via an additional parameter to reduce across-block variability, which indeed improved model fit (Supplementary Figure 5d). Nonetheless, the Bayesian-hybrid model was clearly preferred in this dataset (p_model = 0.83; p_exc. > 0.99; p_p.exc. > 0.99). This finding reinforces the main conclusion of the manuscript: incorporating uncertainty in estimated values appears essential for explaining confidence in bandit tasks. Finally, we again compared the constant and dynamic decision-noise formulations and found that the dynamic model was superior (p_model = 0.64; p_exc. = 0.89; p_p.exc. = 0.62) with b₁ values, on average, greater than zero (t₁₄ = 3.36; p = .005, d = 0.87).

Implications of dissociating overall uncertainty from decision difficulty in the hybrid model

In the hybrid model, confidence is computed as a weighted combination of the probability of being correct and overall value confidence, the latter defined as the inverse of the total uncertainty (i.e., the sum of the variances associated with the value representations of the options). While this formulation captures an intuitive notion of global uncertainty, it does not necessarily track the discriminability of the options, and thus does not map directly onto decision difficulty.

As a consequence, these two quantities can, in principle, diverge. For instance, overall uncertainty can be high even when the difference between options is large (i.e., an easy decision), and conversely low when options are similar (i.e., a difficult decision). This dissociation raises a potential concern: if confidence were to rely heavily on overall uncertainty, the model could, in principle, give rise to counterintuitive and poorly calibrated confidence patterns that are misaligned with decision difficulty.

To evaluate whether this issue arises in practice, we simulated model predictions in two extreme scenarios that explicitly dissociate these quantities (see Supplementary Figure 6a below): (i) an easy decision under low certainty, and (ii) an “impossible” decision (i.e.: the two options having the same expected value) under high certainty. For the first case, we constructed posterior estimates of two options (A and B) as normal distributions with the following parameters: {μ_A = −10, σ_A = 3}; {μ_B = 10, σ_B = 3}. For the second case, normal distributions had the following parameters: {μ_A = 0, σ_A = 0.4}; {μ_B = 0, σ_B = 0.4}. Confidence in these two scenarios was computed as follows:

Crucially, we varied the relative weight assigned to the probability of being correct and to overall value confidence, ω = {0, 0.25, 0. 5, 0. 75, 1}, and additionally examined the confidence predictions using the weights observed empirically in our data (M=0.95, SD=0.08, range=[0.26;0.99]).

Simulations (Supplementary Figure 6b below) show that only when overall value confidence is given a large weight do counterintuitive patterns emerge. However, when the probability of being correct receives a dominant weight (as consistently observed in our participants; Supplementary Figure 6b and 6c) the model produces well-calibrated confidence estimates, with higher confidence assigned to easier decisions despite elevated uncertainty. This supports the interpretation that overall value confidence acts as a secondary, biasing component in confidence computation, modulating but not overriding the primary Bayesian estimate of correctness.

Dissociating overall uncertainty from decision difficulty in the hybrid model.
(a) The two proposed scenarios of decisions: an easy choice (estimated expected values are very different) with high uncertainty (expected values are not precisely estimated) and an impossible choice (estimated expected values are equal) with low uncertainty (expected values are precisely estimated). Note that in the right panel the cyan distribution cannot be seen as it is behind the orange one. (b) Model predictions varying the w value. (c) Model predictions using the empirical ω values found by fitting the model to the data. (d) A comparison of the beta values obtained from the regression shows that the contribution of the probability of being correct is significantly higher than the contribution of overall value confidence to confidence judgments. We opted for comparing the regression beta values instead of the ω values as we normalised the predictors for the regression, thus beta values are directly comparable between them.

Additional information

Funding

Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) (2018-03614)

Pablo Barttfeld

Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) (2021-0083)

Pablo Barttfeld

Universidad de Buenos Aires (UBA) (UBACyT 20020170100330BA)

Guillermo Solovey

Center for Open Science, Templeton World Charity Foundation and the Association for the Scientific Studies of Consciousness (Funding Consciousness Research with Registered Reports)

Pablo Barttfeld

Significance of findings

Strength of evidence

Abstract

Introduction

Different levels of confidence in reinforcement learning contexts.

Results

Value confidence emerges from Bayesian inference

Value confidence is captured by a Bayesian model.

Value confidence modulates the exploration-exploitation trade-off

Value confidence modulates the exploration-exploitation trade-off.

Decision confidence deviates from the probability of being correct

Decision confidence deviates from the probability of being correct.

Individual signatures of decision confidence reveal distinct behavioural profiles

Interindividual differences on confidence judgments predict task performance.

Discussion

Methods

Datasets description

Value confidence

Decision confidence

Computational models

Values and value confidence

Decisions and decision confidence

Model fitting and model comparison procedure

Statistical analyses

Supplementary material

Value confidence predictions for all models

Model fits for value confidence data, all models.

MCMC diagnostics

MCMC diagnostics summary.

Illustration of posterior distributions and trace plots.

Model and parameter recovery results. Value confidence

Model and parameter recovery results for value confidence data.

Model and parameter recovery results. Decision confidence.

Model and parameter recovery results for decision confidence data.

Evaluating confidence models that ignore uncertainty in estimated values

Evaluating the predictions of models that do not incorporate posterior uncertainty.

Implications of dissociating overall uncertainty from decision difficulty in the hybrid model

Dissociating overall uncertainty from decision difficulty in the hybrid model.

Additional information

Funding

References

Article and author information

Author information

Nicolás A Comay

Guillermo Solovey*

Pablo Barttfeld*

Version history

Cite all versions

Copyright

Metrics

Guillermo Solovey

Pablo Barttfeld