Different levels of confidence in reinforcement learning contexts.

Value confidence (top) represents the certainty about latent states of the environment, such as the value of one option in multi-armed bandit tasks. Under a Bayesian framework, value confidence can be straightforwardly modelled as the inverse of the standard deviation of a posterior distribution over the option’s estimated value (represented by 0 in the figure). By observing subsequent rewards, not only the estimation of the option’s value gets more precise but the posterior shrinks, which is reflected in an increase in value confidence. Decision confidence (bottom) reflects the certainty of having made a correct choice. Under a Bayesian formulation, it can be understood as the probability of being correct, illustrated here by the area under the decision variable (the distribution obtained by subtracting the unchosen option’s posterior from the chosen option’s posterior) that it is greater than zero. As trials progress and certainty about the environment increases, decision confidence better distinguishes between correct and incorrect responses. This increase in correct trials and decrease in incorrect trials is known as the “folded-x pattern” of confidence (Hangya et al., 2016; Sanders et al., 2016).

Value confidence is captured by a Bayesian model.

(a) Model fitting results for the two best models in the Boldt et al. (2019) data: the Bayesian model, where value confidence reflects the inverse of the standard deviation of the posterior distribution over an option’s inferred value, and the Rescorla-Wagner model, where value confidence equals to the square root of the number of rewards observed of a specific option. First row: models’ fits to all the data. As blocks had different lengths, we divided blocks in four quantiles with respect to the trials to be able to pool all blocks together (x axis). Second row: example of models’ fits in one block only. (b) Model fitting results to the Quandt et al. (2022) data. First row: models’ fits to Exp. 1 data. Second row: models’ fits to Exp. 2 data. Models that are a function of the number of rewards experienced (such as the RW sqrt(n) model) cannot account for the negative effect that the spread of the reward distribution has on value confidence. (c) Model comparison results. The Bayesian model was the best fitting model in Boldt et al. (2019) data. (d) Model comparison results. The Bayesian model was the best fitting model in both experiments in Quandt et al. (2022) data. In panels (a) and (b) error bars represent the standard error of the mean (SEM) of the behavioral data, shaded regions represent the SEM of models’ predictions and dots represent individual averages.

Value confidence modulates the exploration-exploitation trade-off.

(a) Illustration of an agent’s behaviour if the b1 parameter is positive. In such a case, as value confidence increases decision noise decreases because the slope of the softmax increases. This results in more exploitative decisions as value confidence increases throughout the trials. (b) Model fitting results of the models without (first row) and with (second row) dynamic decision noise (i.e.: without and with the b1 parameter). Dashed lines in these two rows represent chance level performance (i.e.: a proportion of correct choices equal to .5). The third row depicts the distribution of the b1 parameter across the three datasets analysed. Regarding the columns, the first column represents Boldt et al. data, the second column the Study 1 data and the third column the Study 2 data. (c) Model comparison results. The model that includes the modulation of decision noise by value confidence was the best fitting model across the three datasets. The conventions in this figure are equal to the ones in Figure 2.

Decision confidence deviates from the probability of being correct.

(a) Schematic representation of the models. The subtraction between the two options’ expected values posterior distributions (i.e.: chosen posterior - unchosen posterior) creates a decision variable. In such a case, the probability of being correct is the area of this distribution that is greater than zero. This probability of being correct represents the Bayesian confidence (top panel). The Bayesian-hybrid model (lower panel) combines this probability with the overall certainty about value estimates (i.e.: the overall value confidence). (b) Model fitting results. The Bayesian model can capture correct trials but its prediction deviates from the data specially in incorrect trials (top-row). The Bayesian-hybrid model, on the other hand, can account for both correct and incorrect trials (bottom-row). (c) Correlations between models’ predictions and the data. Both models’ predicted confidence and empirical confidence data were rescaled to a range from 0 to 1 before computing the correlations (as confidence scales differed between datasets). The same pattern is found: the Bayesian model predictions deviate from the data specially in incorrect trials, where the correlation is considerably diminished. (d) Model comparison results. The hybrid model was the best fitting model in two out of three datasets. Considering all the data, the hybrid model was the preferred model for 65 out of 89 participants.

Interindividual differences on confidence judgments predict task performance.

(a) We found substantial individual variation in the way decision confidence was reported. By running a regression using the latent variables from the model, that is the probability of being correct and the overall value confidence, we were able to quantify this variation. Here we plot the beta values associated with the mentioned latent variables: βp(correct) and βvalue conf.. Participants that had both betas with positive values showed a pattern where confidence tended to increase in both correct and incorrect decisions (Participant A and Participant B), or not decrease in incorrect decisions (Participant C). Participants with near zero beta βvalue conf. values had a more Bayesian style of confidence reporting (illustrated by Participant D and Participant E). Interestingly, several participants had negative βvalue conf. values. At first value, this is surprising as an increase in certainty should intuitively lead to greater confidence. However, the Bayesian model was the best model for virtually all of these participants (lightblue dots; note for instance the pattern of confidence judgments for Participant E), pointing out that the overall value confidence variable here should not be contributing and therefore its negative value suggests overfitting of the regression model. Note, however, that for some of these participants, like Participant F, a negative value appears to be justified as indeed her confidence decreased throughout the trials, that is, as certainty increases. Finally, some participants had negative βp(correct) values, which consequently led to a poor metacognitive ability as incorrect decisions were associated with higher confidence (Participant G). (b) Increasing values of the βp(correct) predicted higher task performance. (c) As with task performance, the βp(correct) values predicted the values of the b1 parameter (the parameter that controlled the exploration-exploitation trade-off with increasing value confidence levels). (d) As in the b1 parameter case, only βp(correct) values were associated with the metacognitive ability of the participants, that is, their ability to distinguish their correct and incorrect decisions with their confidence judgments. Note, however, that in this case a negative interaction between the two betas were found, meaning that increasing values of βvalue conf. negatively affected the ability of βp(correct) to predict metacognition.

Model fits for value confidence data, all models.

(a) Models’ predictions considering all the trials in Boldt et al. (2019) dataset. (b) Models’ predictions considering only one block, for illustrative purposes. Error bars represent the standard error of the mean (SEM) of the behavioural data and shaded regions represent the SEM of the models’ predictions.

MCMC diagnostics summary.

Illustration of posterior distributions and trace plots.

(a) Scenario 1: only one reward (r = 0.5) was observed. (b) Scenario 2: twenty rewards (mean r = 0.54) were observed. Theta represents the inference over the mean value of a Beta distribution, i.e.: . The vertical dashed lines at the posteriors represent the mean of the posterior distribution.

Model and parameter recovery results for value confidence data.

(a) Model recovery results for the Boldt et al. (2019) experiment design. We found that the RW n model is mistakenly considered to be the model that generated the data for several generative models, with the most prominent examples being the RW log(n) model and the RW sqrt(n) model. (b) Model recovery results for Quandt et al. (2022) experiment design. While this result is uninformative to distinguish between the models that are a function of the number of rewards seen, it helps to differentiate that group of models and the models that take the variance of the reward distribution into account. (c) Parameter recovery results for the parameter that weighted the contribution of the surprise term in the three models that included it.

Model and parameter recovery results for decision confidence data.

(a) Model recovery results regarding the models with and without the b1 parameter. (b) Parameter recovery results for the b0 and b1 parameters. (c) Model recovery results for the decision confidence models. (d) Parameter recovery results for the ω parameter that weighted the sources of information for decision confidence in the Hybrid model: the probability of being correct and the overall value confidence.

Evaluating the predictions of models that do not incorporate posterior uncertainty.

(a) “Mean difference” model predictions to the three main datasets reported in the main manuscript (first figure: Boldt et al. data; second figure: study 1 data; third figure: study 2 data). (b) “Mean difference + PEB” model predictions. Figures are ordered in the same way as in (a). (c) Model comparison results. Figures are ordered in the same way as (a) and (b). (d) Model predictions to the dataset manipulating the mean and variance of the reward distributions of the options plus model comparison results. First figure: Hybrid model; second figure: Mean difference + PEB model; third figure Mean difference + PEB model including a scale parameter for the estimated values of the options; fourth figure: model comparison results. The conventions of the whole figure are the same as in the main manuscript, namely: green and red represent correct and incorrect responses respectively, error bars and shaded lines represent empirical data and model fits respectively, dots are individuals’ average values, and pink dots and purple asterisks in model comparison plots are exceedance and protected exceedance probabilities respectively.

Dissociating overall uncertainty from decision difficulty in the hybrid model.

(a) The two proposed scenarios of decisions: an easy choice (estimated expected values are very different) with high uncertainty (expected values are not precisely estimated) and an impossible choice (estimated expected values are equal) with low uncertainty (expected values are precisely estimated). Note that in the right panel the cyan distribution cannot be seen as it is behind the orange one. (b) Model predictions varying the w value. (c) Model predictions using the empirical ω values found by fitting the model to the data. (d) A comparison of the beta values obtained from the regression shows that the contribution of the probability of being correct is significantly higher than the contribution of overall value confidence to confidence judgments. We opted for comparing the regression beta values instead of the ω values as we normalised the predictors for the regression, thus beta values are directly comparable between them.