Introduction

Disinformation is a pervasive and pernicious feature of the modern world (1). It is linked to negative social impacts that include public-health risks (24), political radicalization (5,6), violence (68) and adherence to conspiracy theories (8,9). Consequently, there is a growing interest in comprehending how false information propagates across social networks (1012), including an interest in designing strategies to curb its impact (1316) albeit with limited success to date (17). However, there is also a considerable knowledge lacuna regarding how individuals learn and update their beliefs when exposed to potential disinformation. Addressing this gap is crucial, as it has been suggested that disinformation propagates by exploiting cognitive biases (1822). Thus, discerning which aspects of learning from potential disinformation are normative versus biased has the potential to better enable targeted interventions aimed at countering its harmful effects.

We start with an assessment of a normative, Bayesian, prediction that individuals should modulate their learning as a function of the credibility of an information source, and learn more from credible, truthful, sources. This prediction is supported by previous findings showing that individuals flexibly and adaptively adjust their learning rates in response to key statistical features of the environment. For example, learning is more rapid when observation-uncertainty (“noise”) decreases and in volatile, changing, compared to stable environments, particularly following detection of change-points that render re-change knowledge obsolete (2325). Moreover, human choice is strongly influenced by social information of high (as opposed to low) credibility, such as majority opinions more confident judgments (26) and large group consensus (27). Additionally, people are disposed to follow trustworthy advisors (28), including those who have recommended optimal actions in the past (29,30).

We hypothesised that in a disinformation context individuals would show significant deviations from normative learning, reflecting a diversity of biases. First, filtering non-credible information is likely to be cognitively demanding (31), and this predicts such information would impact belief updating, even if individuals are aware it is untrustworthy. An additional consideration is that humans tend to learn more from positive self-confirming information (3234), which presents one in a positive light. We conjectured, influenced by ideas from motivated-cognition (35), that low-credibility information provides a pathway for amplification of such a bias, as uncertainty regarding information-veracity might dispose individuals to self-servingly interpret positive information as true and explain-away negative information as false. A final additional consideration is the question of how exposure to potential disinformation impacts on learning from trusted sources. One possibility is that disinformation serves as a background context against which credible information would appear more salient. Alternatively, it might lead individuals to strategically reduce their overall learning in disinformation-rich environments, resulting in diminished learning from credible sources.

To address these questions, we adopt a novel approach within the disinformation literature by exploiting a Reinforcement Learning (RL) experimental framework (36). This has the advantage that it provides a suite of behavioural tasks, and computationally tractable models, that enable estimation of latent aspects of learning processes, such as belief updating. Moreover, RL also enables an examination of the dynamics of belief updates over short timescales reflecting real-life engagements with disinformation, such as deciding whether to share a post on social media. Moreover, RL has proven success in characterizing key decision-making biases (e.g., positivity bias (3739)), albeit in scenarios where learners receive accurate information. Within RL, we can also establish benchmarks of Bayesian learning that allow a characterization of biases and deviations from normative standards. Finally, a previous literature has suggested a role for reinforcement in the dissemination of disinformation, where individuals may receive positive reinforcement (likes, shares) for spreading sensationalized or misleading information on social media platforms, inadvertently reinforcing such behaviours and contributing to a disinformation proliferation (15,40,41).

We developed a novel “disinformation” version of the classical two-armed bandit task to test the effects of potential disinformation on learning. In the hallmark two-armed bandit task (36,37,42), participants choose repeatedly between two unfamiliar bandits (i.e., slot machines), that provided rewards with different probabilities, to learn which bandit is more rewarding. Critically, in our disinformation-variant, true choice outcomes (reward or non-reward) were latent, i.e., unobservable. Instead, participants were informed about choice-outcomes by computer-programmed “feedback agents”, who were disposed to occasionally disseminate disinformation by lying (reporting a reward when the true outcome was non-reward or vice versa). As these feedback-agents varied in truthfulness, this allowed us to test the effects of source-credibility on learning. We show across two studies that the extent of belief-updates increases as a function of source-credibility. However, there were striking deviations from normative Bayesian learning, where we identify several sources of bias related to processing potential disinformation. These included learning from non-credible information, an amplified positivity bias for non-credible sources, and increased learning from trustworthy information when it was preceded by non-credible information.

Results

Disinformation two-armed bandit task

We conducted a discovery (n=104) and follow-up study (n=204). In both studies the learning tasks had the same basic structure but with a few subtle differences between them (see Discovery study and SI Discovery study methods). To anticipate, the results of both studies support similar conclusions, and, in the results section, we focus on the main study, with the final results section detailing similarities and differences in findings across the two studies.

In the main study, participants (n=204) completed the disinformation two-armed bandit task. In the traditional two-armed bandit task (36,37,42), participants choose between two slot-machines (i.e., bandits) differing in their reward probability. Participants are not instructed about bandit rewardprobabilities but instead they are provided with veridical choice feedback (e.g., reward or nonreward), allowing participants to learn which bandit is more rewarding. By contrast, in our disinformation version true choice-outcomes were latent (i.e., unobserved) and participants were informed about these outcomes via three computerized feedback-agents, who had privileged access to the true outcomes.

Before commencing the task, participants were instructed that feedback agents could disseminate disinformation, meaning that they were disposed to lie on a random minority of trials, reporting a reward when the true outcome was a non-reward, or vice versa (Fig. 1a). Participants were explicitly instructed about the credibility of each agent (i.e., based on the proportion of truth-telling trials), indicated by a “star system”: the 3-star agent was always truthful, the 2-star agents told the truth on 75% of the trials while the 1-star agent did so on 50% of the trials (Fig. 1b). Note that while the 1-star agent’s feedback was statistically equivalent to random feedback, participants were not explicitly instructed about this equivalence. Each experimental block encompassed 3 bandit pairs, each presented over 15 trials in a randomly interleaved manner. Each agent provided feedback for 5 trials for each bandit pair (with the agent order interleaved within the bandit pair). Thus, in every trial, participants were presented with one of the bandit pairs and the feedback agent associated with that trial. Upon selecting a bandit, they then received feedback from the agent (Fig. 1c). Importantly, at the end of the experiment participants received a performance-based bonus based on true bandit outcomes, which could differ from agent-provided feedback. Within each bandit-pair one bandit provided a (true) reward on 75% of the trials and the other on 25% of trials. Choice accuracy, i.e., the probability of selecting the more rewarding bandit (within each pair), was significantly above chance (mean accuracy = 0.62, t(203) = 19.94, p <.001) and improved as a function of increasing experience with each bandit-pair (average overall improvement over 15 trials = 0.22, t(203)=19.95, p<0.001) (Fig. 1d).

Task design and performance.

a, Illustration of agent-feedback. Each selected bandit generated a true outcome, either a reward or a non-reward. Participants did not see this true outcome but instead were informed about it via a computerised feedback agent (reward: dollar sign; non-reward: sad emoji). Agents told the truth on most trials (left panel). However, on a random minority of trials they lied, reporting a reward when the true outcome was a non-reward or vice versa (right panel). b, Participants received feedback from 3 distinct feedback agents of variable credibility (i.e., truth-telling probability). Credibility was represented using a starbased system: a 3-star agent always reported the truth (and never lied), a 2-star agent reported the truth on 75% of trials (lying on the remaining 25%), and a 1-star agent reported the truth half of the time (lying on the other half). Participants were explicitly instructed and quizzed about the credibility of each agent prior to the task. c, Trial-structure: On each trial participants were first presented with the feedback agent for that trial (here, the 2-star agent) and next offered a choice between a pair of bandits (represented by identicons) (for 2sec). Next, choice-feedback was provided by the agent. d, Learning curves. Average choice accuracy as a function of trial number (within a bandit-pair). Thin lines: individual participants; thick line: group mean with thickness representing the group standard error of the mean for each trial.

Credible feedback promotes greater learning

A hallmark of RL value-learning is that participants are more likely to repeat a choice following positive compared to negative reward-feedback (henceforth, “feedback effect on choice repetition”). We tested a hypothesis, based on Bayesian-normative reasoning, that this tendency would increase as a function of agent-credibility (Fig. 3a). Thus, in a binomial mixed-effects model we regressed choicerepetition (i.e., whether participants repeated their choice from the most recent trial featuring the same bandit pair; 0-switch; 1-repeat) on feedback-valence (negative or positive) and agent-credibility (1,2, or 3-star), where these are taken from the last trial featuring the same bandit pair (Methods for model-specification). Feedback valence exerted a positive effect on choice-repetition (b=0.72, F(1,2436)=1369.6, p<0.001) and interacted with agent-credibility (F(2,2436)=307.11, p<0.001), with a feedback effect being greater for more credible agents (3-star vs. 2-star: b= 0.91, F(1,2436)=351.17; 3-star vs. 1-star: b=1.15, t(2436)=24.02; and 2-star vs. 1-star: b=0.24, t(2436)=5.34, all p’s<0.001). Additionally, we found a positive feedback-effect for the 3-star agent (b=1.41, F(1,2436)=1470.2, p<0.001), and a smaller feedback-effect for the 2-star agent (b=0.49 ,F(1,2436)=230.0, p<0.001). These results support our hypothesis that learning increases as a function of information credibility (note that the feedback effect for the 1-star agent is examined below; see “Non-credible feedback elicits learning”).

To confirm that increased learning based on information credibility is expected under an assumption that subjects adhere to normative Bayesian reasoning, we formulated two Bayesian models whereby the latent value of each bandit is represented as a distribution over the probability that a bandit is truly rewarding. During the feedback stage of each trial, the value of the chosen bandit is updated (based on feedback valence and credibility) according to Bayes rule (Fig. 2a, top panel; Fig. S5c for an illustration of the model; for full model descriptions, see Methods). In the instructed-credibility Bayesian model, belief-updates are based on the instructed credibility of feedback-sources. In contrast, a free-credibility Bayesian model, allows for the possibility that value inference is Bayesian but based on “distorted probabilities”(43), attributing non-instructed degrees of credibility to sources of false information (despite our explicit instructions on the credibility of different agents). In this variant, we fixed the credibility of the 3-star agent to 1 and estimated the credibility of 2 and 1-star agents as free parameters (which were highly recoverable; see Methods and SI 3.3). Simulations based on both Bayesian models (see Methods) predicted increased learning as a function of feedback credibility (Fig. 3b; top panels; SI 3.1.1.1 Tables S3 and S4 for statistical analysis).

Next, we formulated a family of non-Bayesian computational RL models. Importantly, these models can flexibly express non-Bayesian learning patterns and, as we show in following sections, can serve to identify non-normative learning biases. Here, an assumption is that during feedback, the value of a chosen bandit (which here is represented by a point estimate, “Q value”, rather than a distribution) either increases or decreases (for positive or negative feedback, respectively) according to a magnitude quantified by the free “Credit-Assignment (CA)” model parameters(44) (Fig. 2a, bottom panel; Fig. S5b; Methods). Different model variants varied as to how task-variables influenced CA parameters with the “null” model attributing the same CA to all feedback-agents (regardless of their credibility, i.e., a single free CA-parameter), whereas the “credibility-CA” model availed of three separate CA parameters, one for each feedback agent, thereby allowing us to test how learning was modulated by feedback-credibility. Using a bootstrap generalized-likelihood ratio test for modelcomparison (Methods) we rejected the null model (group level: p<0.001), in favour of the credibility-CA model. Furthermore, model-simulations based on participants best-fitting parameters (Methods) falsified the null model as it failed to predict credibility-modulated learning, showing instead, equal learning from all feedback sources (Fig, 3b; bottom-left panel). In contrast, the credibility-CA model successfully predicted increased learning as a function of credibility (Fig. 3b, bottom-right panel) (see SI 3.1.1.1 Tables S5 and S6).

After confirming CA parameters are highly recoverable (see Methods and SI 3.3), we examined how the Maximum Likelihood (ML) CA parameters from the credibility-CA model differed as a function of feedback credibility (Fig. 3c). Using a mixed effects model (Methods), we regressed the CA parameters on their associated agents, finding that CA differed across the agents (F(2,609)=212.65, p<0.001), increasing as a function of agent-credibility (3-star vs. 2-star: b= 1.02, F(1,609)=253.73 ; 3-star vs. 1- star: b=1.24, t(609)=19.31; and 2-star vs. 1-star: b=0.22, t(609)=3.38, all p’s<0.001). We found similar results in the discovery study (see SI 1.2.1). Thus, these results provide convergent support for our conjecture that feedback from more credible sources leads to more pronounced learning.

Substantial deviations from Bayesian learning

We next implemented a model comparison between each of the Bayesian models and the credibility-CA model, using a parametric bootstrap cross-fitting method (Methods). We found that the credibility-CA model provided a superior fit for 71% of participants (sign test; p<0.001) when compared to the instructed-credibility Bayesian model, Fig. 2b; and for 53.9% (p=0.29) when compared to the free-credibility Bayesian model, Fig 2c). The discovery study revealed even stronger results supporting a conclusion that the credibility-CA model was superior to both Bayesian models for most subjects (see SI 1.2.2), suggesting pervasive deviations from normative learning.

To further characterise these deviations, we used a “cross-fitting” method. We simulated synthetic data based on Bayesian agents (using participants’ best fitting parameters), but fitted these data using the CA-models, obtaining what we term “Bayesian-CA parameters” (Fig. 2d; Methods). A comparison of these Bayesian-CA parameters, with empirical-CA parameters obtained by fitting CA models to empirical data, allowed us to uncover patterns consistent with, or deviating from, normative-Bayesian value-based inference. Using this approach, we found that both the instructed-credibility and free-credibility Bayesian models predicted increased Bayesian-CA parameters as a function of agent credibility (Fig. 3c; see SI 3.1.1.2 Tables S8 and S9). However, an in-depth comparison between Bayesian and empirical CA parameters revealed discrepancies from normative Bayesian learning.

Computational models and cross-fitting method.

a, Summary of the two model families. Bayesian models (top panel) represent a benchmark for normative learning. In these models, the observer maintains a belief-distribution over the probability a bandit is truly rewarding (denoted r). On each trial, this distribution is updated for the selected bandit according to Bayes rule, based on the valence (i.e., rewarding/non-rewarding; denoted f) and credibility of the trial’s reward feedback (denoted c). Credit-assignment models (bottom panel) are used to test deviations from Bayesian learning. Here, the observer maintains a subjective point-value (denoted Q) for each bandit. On each trial the value of the chosen bandit is updated based on a free CA parameter, quantifying the extent of value increase/decrease following positive/negative feedback. CA parameters can be modulated by the valence and credibility of feedback. b,c, Model selection between the credibility-CA model and the two variants of Bayesian models. Most participants were best fitted by a credibility-CA model, compared to the instructed-credibility Bayesian model (b) or free-credibility Bayesian (c) models. d, Cross-fitting method: Firstly, we fit a Bayesian model to empirical data, to estimate its (ML) parameters. This yields the Bayesian learning token that comes closest to accounting for a participant’s choices. Secondly, we simulate synthetic data based on the Bayesian model, using its ML parameters to obtain instances of how a Bayesian learner would behave in our task. Thirdly, we fit these synthetic data with a CA model, thus estimating “Bayesian CA parameters”, i.e., CA parameters capturing the performance of a Bayesian model. Finally, we fit the CA model directly to empirical data to obtain “empirical CA parameters”. A comparison of Bayesian and empirical CA parameters, allows us to identify, which aspects of behaviour are consistent with Bayesian belief updating, as well as characterize biases in behaviour that deviate from normative Bayesian learning.

Learning adaptations to credibility.

a, Probability of repeating a choice as a function of feedbackvalence and agent-credibility on the previous trial for the same bandit pair. The effect of feedback-valence on repetition increases as the feedback credibility increases, indicating that more credible feedback has a greater effect on behaviour. b, Similar analysis as in panel a, but for synthetic data obtained by simulating the main models. Simulations were computed using the ML parameters of participants for each model. The null model (bottom left) attributes a single CA to all credibility-levels, hence feedback exerts a constant effect on repetition (independently of its credibility). The credibility-CA model (bottom-right) allowed credit assignment to change as a function of source credibility, predicting varying effects of feedback with different credibility levels. The instructed-credibility Bayesian model (top left) updated beliefs normatively based on the true credibility of the feedback, and therefore predicted an increase effect of feedback on repetition as credibility increased. Finally, the free-credibility Bayesian model (top right) allowed for a possibility that participants use distorted credibilities for 1-star and 2-star agents when following a Bayesian strategy, also predicting an increase in the effect of feedback as credibility increased. c, ML credit assignment parameters for the credibility-CA model. Participants show a CA increase as a function of agent-credibility, as predicted by Bayesian-CA parameters for both the instructed-credibility and free-credibility Bayesian models. Moreover, participants showed a positive CA for the 1-star agent (which essentially provides feedback), which is only predicted by cross-fitting parameters for the free-credibility Bayesian model. d, ML credibility parameters for a free-credibility Bayesian model attributing credibility 1 to the 3-star agent but estimating credibility for the two lying agents as free parameters. Small dots represent results for individual participants/simulations, big circles represent the group mean (a,b,d) or median (c) of participants’ behaviour. Results of the synthetic model simulations are represented by diamonds (instructed-credibility Bayesian model), squares (free-credibility Bayesian model), upward-pointing triangles (null-CA model) and downward-pointing triangles (credibility-CA model). Error bars show the standard error of the mean. (*) p<.05, (**) p<0.01, (***) p<.001.

Non-credible feedback elicits learning

While our task instructions framed the 1-star agent as highly deceptive, lying 50% of the time, its feedback is statistically equivalent to entirely non-informative i.e., random feedback. Thus, normatively, participants should ignore and filter-out such feedback from their belief updates. Indeed, for the 1-star agent, simulations based on the instructed-credibility Bayesian model provided no evidence for either a positive feedback-effect on choice-repetition (mixed effects model described above; b=-0.01, t(2436)=-0.41, p=0.68; Fig 3b top-left) or a positive Bayesian-CA (b=-0.01, t(609)=- 0.31, p=0.76; Fig. 3c). However, contrary to this, we hypothesized that participants would struggle to entirely disregard non-credible feedback. Indeed, we found a positive feedback-effect on choicerepetition for the 1-star agent (mixed effects model, delta(M)=0.049, b=0.25, t(2436)=8.05, p<0.001), indicating participants are more likely to repeat a bandit selection after receiving positive feedback from this agent (Fig. 3a). Similarly, the CA parameter for the 1-star agent in the credibility-CA model was positive (b=0.23, t(609)=4.54, p<0.001) (Fig. 3c). The upshot of this empirical finding is that participants updated their beliefs based on essentially random feedback (see Fig. S7 for analysis showing that this resulted in decreased accuracy rates).

A potential explanation for this finding is that participants do rely on a Bayesian strategy but “distort probabilities”, attributing non-instructed degrees of credibility to lying sources (despite our explicit instructions on the credibility of different agents). Consistent with this, the ML-estimated credibility of the 1-star agent (Fig. 3d) was significantly greater than 0.5 (Wilcoxon signed-rank test, median=0.08, z=5.50, p<0.001), allowing the free-credibility Bayesian model to predict a positive feedback effect on choice-repetition (mixed-effects model: b=0.12, t(2436)=9.48, p<0.001; Fig 3b top-right) and a positive Bayesian-CA (b=0.08, t(609)=3.32, p<0.001; Fig. 3c) for the 1-star agent. For corresponding results in the discovery study see SI 1.2.3. In our Discussion we elaborate on why it might be difficult to filter out this feedback even if one can explicitly infer its randomness.

Increased learning from fully credible feedback when it follows noninformative feedback

A comparison of empirical and Bayesian credit-assignment parameters revealed a further deviation from normative learning: both Bayesian models predicted an attenuated credit-assignment for the 3- star agent [Wilcoxon signed-rank test, instructed-credibility Bayesian model (median difference=0.74, z=11.14); free-credibility Bayesian model (median difference=0.62, z=10.71), all p’s<0.001] (Fig. 3a). One explanation for enhanced learning for the 3-star agents is a contrast effect, whereby credible information looms larger against a backdrop of non-credible information. To test this hypothesis, we examined whether the impact of feedback from the 3-star agent is modulated by the credibility of the agent in the trial immediately preceding it. More specifically, we reasoned that the impact of a 3-star agent would be amplified by a “low credibility context” (i.e., when it is preceded by a low credibility trial), even when this context is entirely irrelevant for current learning. In a binomial mixed effects model, we regressed choice-repetition on feedback valence from the last trial featuring the same bandit pair, and on the feedback agent on the trial immediately preceding that last trial (i.e., the contextual credibility; Methods for model-specification). This analysis included only trials for which the last same-pair trial featured the 3-star agent and in which the context trial featured a different bandit pair (Fig. 4a). We found that feedback valence interacted with contextual credibility (F(2,2419)=5.06, p=0.006) such that a feedback-effect (from the 3-star agent) was greater when preceded by the temporal-context of the 1-star agent, compared to a context involving a 2-star (b=0.20, t(2419)=-2.42, p=0.016) or a 3-star agent (b=0.24, t(2419)=-3.01, p=0.003) (Fig. 4b). There was no difference between 2-star and 3-star agent contexts (b=0.051, F(1,2419)=0.39, p=0.53). Thus, these results support an interpretation that credible feedback exerts a greater impact on participants’ learning when it follows non-credible feedback.

Contextual effects and learning.

a, Trials contributing to the analysis of effects of credibility-context on learning from the fully credible agent. We included only “current trials (n)” for which: 1) the last trial (trial n-k) offering the same bandit pair was associated with the 3-star agent, and 2) the immediately preceding context trial (n-k-1) featured a different bandit pair (providing a learning context irrelevant to current choice). We examined how choice-repetition (from n-k to n) was modulated by feedback valence on the last same-pair trial, and on the feedback agent on the context trial (i.e., the credibility context). Note the greyed-out star-rating on the current trial indicates the identity of the current agent and was not included in the analysis. b, Difference in probability of repeating a choice after receiving positive vs negative feedback (i.e., feedback effect) from the 3- star agent, as a function of the credibility context. The 3-star agent feedback-effect is greater when preceded by a low-credibility context (i.e., 1-star agent in preceding trial), than when preceded by a higher credibility context (i.e., 2-star or 3-star agent in preceding trial). Big circles represent the group mean, and error bars show the standard error of the mean. (*) p<.05, (**) p<0.01.

Positivity bias in learning and credibility

Previous research has shown that reinforcement learning is characterized by a positivity bias, wherein subjects systematically learn more from positive than from negative feedback (37,39). One account is that this bias might result from motivated cognition influences on learning, whereby participants favour positive feedback that reflects well on their choices. We conjectured that feedback of ambiguous veracity (i.e., from the 1-star and 2-star agents) would promote this bias by allowing participants to explain-away negative feedback as a case of an agent-lying, while choosing to believe positive feedback. Previous research has quantified positivity bias in 2 ways: 1) as the absolute difference between credit-assignment based on positive or negative feedback, and 2) as the same difference but relative to the overall extent of learning.

To investigate this bias across different levels of feedback credibility we formulated a more detailed variant of the CA model. To quantify the extent of a chosen-bandit’s value increase or decrease - following positive or negative feedback respectively – the “credibility-valence-CA” variant included separate CA parameters for positive (CA+) and negative (CA-) feedback for each feedback agent. In effect, this model variant enabled us to test whether different levels of feedback credibility elicited a positivity bias (i.e., CA+ > CA-). Using a bootstrap generalized-likelihood ratio test for model comparison (Methods), we rejected, in favour of the valence-credibility-CA model, the null-CA model, the credibility-CA model and a “constant feedback-valence bias” CA model, which attributed a common valence bias (CA+ minus CA-) to all agents (all group level: all p’s<0.001). This test supported our choice of flexible CA parametrization as a factorial function of agent and feedback-valence.

After confirming the parameters of this model were highly recoverable (see Methods and SI 3.3), we used a mixed effects model to regress the ML parameters (Fig. 5a) on their associated agent-credibility and valence (see Methods). This revealed participants attributed a greater CA to positive feedback than to negative feedback (b=0.64, F(1,1218)=37.39, p<0.001). Strikingly, for lying agents, participants selectively assigned credit based on positive feedback (1-star: b=0.61, F(1,1218)=22.81, p<0.001; 2- star: b=0.85, F(1,1218)=43.5, p<0.001), with no evidence for significant credit-assignment based on negative feedback (1-star: b=-0.03, F(1,1218)=0.07, p=0.79; 2-star: b=0.14, F(1,1218)=1.28, p=0.25). Only for the 3-star agent, credit-assignment was positive for both positive (b=1.83, F(1,1218)=203.1, p<0.001) and negative (b=1.25, F(1,1218)=95.7, p=<0.001) feedback. We found no significant interaction effect between feedback valence and credibility on CA (F(2,1218)=0.12, p=0.88; Fig. 5a-b).

However, we found evidence for agent-based modulation of positivity bias when this bias was measured in relative terms. Here we calculated, for each participant and agent, a relative Valence Bias Index (rVBI) as the difference between the Credit Assignment for positive feedback (CA+) and negative feedback (CA), relative to the overall magnitude of CA (i.e., |CA+| + |CA|) (Fig. 5c). rVBI was significantly positive for all credibility levels [Wilcoxon signed-rank test, 50% credibility (median=0.92, z=6.04), 75% credibility (median=0.73, z=6.69) and 100% credibility (median=0.21, z=4.96), all p’s<0.001]. Critically, the rVBI varied depending on the credibility of feedback (Friedman test, χ2(2) = 62.39, p<0.001), such that the rVBI for the 3-star agent was lower than that for both the 1-star (Wilcoxon signed rank tests, median difference = -0.42, z = -5.40, p<0.001) and 2-start agent (median difference = -0.22, z = -5.63, p<0.001). Feedback with 50% and 75% credibility yielded similar rVBI values (median difference (75%-50%) = -0.08, z = -1.91, p = 0.055). Results from the discovery study indicate similar conclusions (Fig. S3; see SI 1.2.5). Finally, a positivity bias could not stem from a Bayesian strategy as both Bayesian models predicted a negativity bias (Fig. 5b-c; Fig. S8; and SI 3.1.1.3 Table S11-S12, 3.2.1.1, and 3.2.1.2). The upshot is that positivity bias, relative to the overall extent of CA, was greater for lying than for fully-credible agents—a pattern deviating from Bayesian normativity.

Positivity bias as a function of agent-credibility.

a, ML parameters from the credibility-valence-CA model. CA+ and CA-are free parameters representing credit assignments for positive and negative feedback respectively (for each credibility level). Our data revealed a positivity bias (CA+ > CA-) for all credibility levels. b, Absolute valence bias index (defined as CA+-CA) based on the ML parameters from the credibility-valence CA model. Positive values indicate a positivity bias, while negative values represent a negativity bias. c, Relative valence bias index (defined as (CA+-CA)/(|CA+|+|CA|)) based on the ML parameters from the credibility-valence CA model. Positive values indicate a positivity bias, while negative values represent a negativity bias. Small dots represent fitted parameters for individual participants and big circles represent the group median (a,b) or mean (c) (both of participants’ behavior), while squares are the median or mean of the fitted parameters of the free-credibility Bayesian model simulations. Error bars show the standard error of the mean. (***) p<.001 for ML fits of participants behavior.

True feedback elicits greater learning

Our findings are consistent with participant modulation of the extent of credit-assignment based solely on cued task-variables, such as feedback-credibility and valence. However, we also considered another possibility: that participants might infer, on a trial-by-trial basis, whether the feedback they received was true or false and adjust their credit assignment based on this inference. For example, for a given feedback-agent, participants might boost the credit assigned to a chosen bandit as a function of the degree to which they believe feedback was true. Notably, Bayesian inference can support a trial-level calculation of a posterior probability that feedback is true based on its credibility, valence and a prior belief (based on experiences in previous trials) regarding the probability that the chosen bandit is truly rewarding (Fig. 6a). These beliefs can partially discriminate between truthful and false feedback. As proof of this, we calculated a Bayesian posterior feedback-truthfulness belief for each participant and trial featuring the 1- or 2-star agents, (Methods; Recall for the 3-star agent, feedback is always true). On testing whether these posterior-truthfulness beliefs vary as a function of objective feedback truthfulness (true vs. lie), we found beliefs are stronger for truthful trials than for untruthful trials for both agents (1-star agent: mean difference=0.10, t(203)=39.47, p<0.001; 2-star agent: mean difference=0.08 , t(203)=34.43, p<0.001) (Fig. 6b and Fig. S9a). Note that this calculation was feasible because, as experimenters, we had privileged access to the objective truth of the choice-feedback as, when designing the experimental sessions, we generated latent true choice outcomes which could be compared to agent-reported feedback.

To formally address whether inference about feedback truthfulness modulates credit assignment, we fitted a new variant of the CA model (the “Truth-CA” model) to the data. This variant features two separate CA parameters for objectively true (CAtrue) and false (CAlie) feedback for each of the lying agents (i.e., the 1-star and 2-star agents) and a single CA parameter for the 3-star agent. We acknowledge that this model falls short of providing a mechanistically plausible description of the credit assignment process, because, unlike experimenters, participants cannot determine with certainty whether feedback from the 1- and 2-star agents is objectively true. Nonetheless, we use this ‘oracle model’ as a measurement tool to glean rough estimates for the average levels of credit assignment for true and false-feedback trials for each agent.

If participants rely on a (partial) insight regarding feedback truthfulness to amplify CA for feedback inferred as true, compared to false, then this predicts elevated average levels on CA on objectively true compared to lie feedback trials. In a mixed-effects model, we regressed the ML parameters for the non-credible agents on agent-credibility (1-star or 2-star) and truthfulness of their associated feedback (Methods). We found a main effect of truthfulness (b=0.084, t(812)=2.23, p=0.026; Fig. 6c and 6e), which was not qualified by an interaction between agent and truthfulness (b=-0.03, t(812)=- 0.43, p=0.67), consistent with participants assigning greater credit for objectively true compared to false feedback. Strikingly, model-simulations (Methods) showed this pattern is not predicted by any of our other models (Fig. 6d and 6e) (see SI 3.1.1.4 Tables S14-S17). In agreement with previous results, we found a main effect of agent (b=0.20, t(812)=5.30, p<0.001), consistent with individuals assigning greater credit for feedback from the two-star compared to the one-star agent. Note similar conclusions were found in our discovery study (see SI 1.2.6), and the overall pattern suggests that participants infer the truthfulness of feedback but modulate their credit assignment in a nonnormative manner.

Credit assignment is higher on true-feedback trials.

a, Posterior belief that feedback is true (y-axis) as a function of prior belief, i.e., during choice and before feedback receipt, that the selected bandit is rewarding (x-axis), feedback valence (dashed vs solid lines), and agent credibility (different colors). b, Distribution of posterior belief probability that feedback is true, calculated separately for each agent (1 or 2 star) and objective feedback-truthfulness (true or lie). These probabilities were computed based on trial-sequences and feedback participants experienced, indicating belief probabilities that feedback is true are higher in truth compared to lie trials. For illustration, plotted distributions pool trials across participants. The black line within each box represents the median, upper and lower bounds represent the third and first quartile respectively. The width of each half-violin plot corresponds to the density of each posterior belief value among all trials for a given condition. c, ML parameters for the “Truth-CA” model. Credit assignment parameters (y-axes) are shown as a function of agent-credibility and feedback-truthfulness (x-axes). These data show credit assignment was enhanced for true compared to false feedback (CAtrue>CAlie). Small dots represent fitted parameters for individual participants, big circles represent the group median, and error bars show the standard error of the mean. d, Like c. but here CA parameters were obtained by fitting the Truth-CA model not to empirical data but rather, to synthetic data generated from simulations of our alternative models (based on participants best fitting parameters). e, Effect of feedback-truthfulness on empirical Truth-CA parameters and on Truth-CA parameters based on synthetic simulations of our alternative models (obtained as in d.). Effects were estimated by regressing CA parameters from the Truth-CA model on the agent (1-star or 2-star) and on feedback-truthfulness. None of our models predicted higher credit assignment for true compared to false feedback. Lines represent 95% confidence intervals around the estimated effect coefficient. Small dots represent fitted parameters for individual simulations, diamonds represent the median value, and error bars show the standard error of the mean. (*) p<.05 and (***) p<.001

Discovery study

The discovery study (n=104) used a disinformation task structurally similar to that used in our main study, but with three notable differences: 1) it included 4 feedback agents, with credibilities of 50%, 70%, 85% and 100%, represented by 1, 2, 3, and 4 stars, respectively; 2) each experimental block consisted of a single bandit pair, presented over 16 trials (with 4 trials for each feedback agent); and 3) in certain blocks, unbeknownst to participants, the two bandits within a pair were equally rewarding (see SI section 1.1). Overall, the results from this study support the exact same conclusions (See SI section 1.2) but with one difference. In the discovery study, we found no evidence for learning based on 50%-credibility feedback when examining either the feedback effect on choice repetition or CA in the credibility-CA model (SI 1.2.3). However, this does not mean that participants fully filtered this feedback, because importantly, feedback from the 1-star agent elicited a positivity bias.

Discussion

Accurate information enables individuals to adapt effectively to their environment (45,46). Indeed, it has been suggested that the importance and utility of information elevate its status to that of a secondary reinforcer, imbuing it with intrinsic value beyond its immediate usefulness (47,48). However, a significant societal challenge arises from the fact that, as social animals, much information we receive is mediated by others, entailing it can be inaccurate, biased or purposefully misleading. Here, using a novel variant of the two-armed bandit task, we asked how we update our beliefs in the presence of potential disinformation, wherein true choice outcomes are latent and feedback is provided by potentially disinformative agents.

We acknowledge that several factors may limit the external validity of our task, including the fact that participants were explicitly instructed about the credibility of information sources. In contrast, in many real-life scenarios, individuals need to learn the credibility of information sources based on their own experience of the world or may even have false beliefs regarding the source-credibility of agents. Moreover, in our task, the experimenter fully controlled the credibility of the information source in every trial, whereas in many real-life situations people can exercise a degree of control over the credibility of information they receive. For example, search engines allow an exercise of choice regarding the credibility of sources. Finally, in our task, feedback agents served as rudimentary representations of social agents, who lied randomly and arbitrarily, in a motivation-free manner. Conversely, in real life, others may strategically attempt to mislead us, and we can exploit knowledge of their motivation to lie, such as when we assume that a used cars seller is more likely to portray a clapped-out car as excellent, rather than state the unfiltered truth. Nevertheless, our results attest to the utility of our task in discerning normative from biased aspects of learning in the face of disinformation, even in a simplified scenario.

Consistent with normative Bayesian principles, we show that individuals increased their learning as a function of feedback credibility. This aligns with previous studies demonstrating an impressive human ability to flexibly increase learning rates when environmental changes render prior knowledge obsolete (23,49,50), and when there is reduced inherent uncertainty, such as “observation noise” (24,51,52). However, as hypothesized, when facing potential disinformation, we also find that individuals deviate from a standard of optimal Bayesian learning in several important ways.

We show that participants revised their beliefs based on entirely non-credible feedback, whereas a Bayesian strategy dictates such feedback should be ignored. One possible explanation is that some participants failed to infer that feedback from the 1-star agent was statistically void of information content, essentially random (e.g., the group-level credibility of this agent was estimated by our free-credibility Bayesian model as higher than 50%). Participants were instructed that this feedback would be “a lie 50% of the time but were not explicitly told that this meant it was random and should therefore be disregarded. However, we argue that even if one explicitly infers the randomness of this feedback, it may still be difficult to filter it out. Indeed, truth bias (53)—the cognitive tendency to assume information is truthful unless strong evidence suggests otherwise— may have led participants to implicitly attribute some credibility to the 1-star feedback. Additionally, an individual’s ability to filter out random information might have been limited due to a high cognitive load induced by the task, which required participants to track the values of three bandit pairs and juggle between three interleaved feedback agents. Importantly, most trials in our task provided feedback from mostly truthful sources, requiring frequent belief updates. As a result, filtering out random feedback may require significant engagement of cognitive control processes. Future studies could explore whether this filtering process is more effective in environments where ignoring feedback is the default policy, such as when random feedback is presented on a majority of trials. Another possibility is that randomfeedback influences on choice may stem from reliance on episodic memory (54,55), e.g., a recollection of past choice outcomes (positive or negative feedback) accompanied by a failure to recall corresponding feedback sources. It is entirely plausible that rapid information flow on social media platforms, featuring a considerable number of information-sources, is both cognitively demanding (56,57) and taxing for source recollection, hindering efficient filtering out of non-credible information (58), such as posts from bots or unfamiliar users. Soft moderation policies in social media (13,59), or indeed fact-checking (60), may enable better filtering of disinformation, mitigating the cognitive load associated with discerning credible information from a profusion of noise. Future studies should investigate conditions that enhance an ability to discard disinformation, such as providing explicit instructions to ignore misleading feedback, manipulations that increase the time available for evaluating information, or interventions that strengthen source memory.

In support of our a priori hypothesis, and in violation of Bayesian-normative learning, participants assigned greater credit based on positive than negative feedback across all feedback-credibility levels. A similar bias has been reported in previous reinforcement learning studies, albeit only in the context of veridical feedback (38,39,61). Here, we show that this positivity bias is amplified (relative to the overall extent of CA) for information of low and intermediate credibility. Of note, previous literature has interpreted enhanced learning for positive outcomes in reinforcement learning as indicative of a confirmation bias (37,39). For example, given that participants predominantly receive rewards (and positive feedback) in our task, positive feedback may confirm, to a greater extent than negative feedback, one’s choice-outcome expectations (e.g., “I expected a positive outcome”). Additionally, positive feedback confirms one’s choice as superior (e.g., “I chose the better of the two options”). Leveraging the framework of motivated cognition (35), we posited that feedback of uncertain veracity (e.g., low credibility) amplifies this bias by incentivising individuals to self-servingly accept positive feedback as true (either because it confers positive, desirable outcomes or because it confirms one’s choice or outcome expectations), and explain away undesirable, choice-disconfirming, negative feedback as false. Alternative “informational” (motivation-independent) accounts of positivity and confirmation bias predict a contrasting trend (i.e., reduced bias in low- and medium credibility conditions) because in these contexts it is more ambiguous whether feedback confirms one’s choice or outcome expectations, as compared to a full-credibility condition. Our findings of bias exacerbation hint that previous estimates of the extent of confirmation bias may represent a lower bound, and that negative effects of confirmation bias are augmented in the presence of disinformation. This could imply an amplified confirmation bias on social media, where content from sources of uncertain credibility, such as unknown or unverified users, is more easily interpreted in a self-serving manner, disproportionately reinforcing existing beliefs (62). In turn, this could contribute to an exacerbation of the negative social outcomes previously linked to confirmation bias such as polarization (63,64), the formation of ‘echo chambers’ (19), and the persistence of misbelief regarding contemporary issues of importance such as vaccination (65,66) and climate change (6770).

A striking finding in our study was that for a fully credible feedback agent, credit assignment was exaggerated (i.e., higher than predicted by a Bayesian strategy). Furthermore, the effect of fully credible feedback on choice was further boosted when it was preceded by a low-credibility context, even when this context was entirely unrelated to current learning. We interpret this in terms of a “contrast effect”, whereby veridical information looms larger against a backdrop of disinformation (21). One upshot is that exaggerated learning might entail a risk of jumping to premature conclusions based on limited credible evidence. To illustrate, consider the example wherein a revision of one’s opinion regarding the potential risk and benefits posed by AI, based on information provided by a credible tech-source, is greater after reading a low (compared to high) credibility news item regarding climate-change. An intriguing possibility, that could be tested in future studies, is that participants strategically amplify the extent of learning from credible feedback to dilute the impact of learning from non-credible feedback. For example, a person scrolling through a social media feed, encountering copious amounts of disinformation, might amplify the weight they assign to credible feedback in order to dilute effects of ‘fake news’. Whereas such a strategy would backfire in our task, where bandits were randomly interleaved between trials, it could be beneficial in situations where the content of consumed information is temporally autocorrelated (for example, one reads several social-media items posted by members of the “AI group” and only then continues to items from a “climate-change group”). Ironically, these results also suggest that public campaigns might be more effective when embedding their messages in low-credibility contexts, which may boost their impact.

Our study suggests that individuals’ learning is modulated based on a trial-by-trial latent-state inference, amplifying learning for feedback deemed true as opposed to false. Strikingly, this nonnormative belief-updating strategy was not predicted by any of our Bayesian (or CA) models. One possibility is that this strategy is more efficient in ecological environments providing richer cues, beyond average source-credibility, as to whether an information source should be trusted in specific situations (e.g., when information sources have interests and motives that can be considered). Hence, our finding of increased learning for truthful feedback may stem from a failure to appreciate the inadequacy of this strategy in our relatively impoverished task. We note that the use of this strategy is consistent with our finding of exaggerated learning for fully credible, always-true, feedback. Taken together, these findings show that participants exaggerate learning from feedback they either know (3-star agent) or infer (1 and 2-star agents) to be true.

An important question arises as to the psychological locus of the biases we uncovered. Because we were interested in how individuals process disinformation—deliberately false or misleading information intended to deceive or manipulate—we framed the feedback agents in our study as deceptive, who would occasionally “lie” about the true choice outcome. However, statistically (though not necessarily psychologically), these agents are equivalent to agents who mix truth-telling with random “guessing” or “noise” where inaccuracies may arise from factors such as occasionally lacking access to true outcomes, simple laziness, or mistakes, rather than an intent to deceive. For example, our “50% credibility agent” is statistically identical to a “100% guessing (fully random)” agent, and our 75% credibility agent corresponds to an agent who mixes truth-telling and guessing on half the trials. While information from guessing agents would constitute misinformation due to its inaccuracy, it lacks the intentionality required to qualify as disinformation. It is possible that participants in our task represented the agents as varying in randomness or noisiness rather than as intentionally deceitful. This raises the question of whether the biases we observed are driven by the perception of potential disinformation as deceitful per se or simply as deviating from the truth. Future studies could address this question by directly comparing learning from statistically equivalent sources framed as either lying or noisy. We have begun exploring this question in a new study by comparing learning from agents who lie on a minority of trials (e.g., 25% of the time) with agents who lie on a majority of trials (e.g., 75% of the time). Although both agents provide statistically equivalent information (via flipping the feedback of the mostly lying agent), the latter is perceived as more deceitful.

Our study has bearing on prior research involving observational learning, which examined how individuals learn from the actions or advice of social partners (71,72). This body of work has demonstrated that individuals integrate learning from their private experiences with learning based on others’ actions or advice—whether by inferring the value others attribute to different options or by mimicking their behavior (73). Here, individuals modulate the extent to which they rely on social partners based on the partner’s accuracy (74). However, our task differs from traditional observational learning paradigms in several key ways. Firstly, in our study, feedback agents do not demonstrate or recommend actions; instead, they interpret the outcomes of actions on behalf of participants by indicating whether these actions generated a latent reward. Secondly, participants in our task lack a private set of experiences unmediated by feedback sources, unlike many reported observational learning paradigms. Finally, while observational learning tasks often involve actions or advice that are not always accurate (e.g., recommending or demonstrating a suboptimal choice), and research to date has not systematically addressed scenarios that involve deliberately misleading social partners. Future studies could incorporate deceptive social partners into observational learning paradigms, offering an opportunity to bridge the mechanisms underlying observational learning with those that are operative in our task. Developing unified models for these processes could provide valuable insights into how individuals integrate social information, particularly when the credibility of that information is critical for decision-making.

Although our findings and interpretations are grounded in a reinforcement learning (RL) framework, we acknowledge parallels with other approaches that have explored adaptive and biased learning. For instance, previous studies utilizing change-point inference (23,49,50,74) have addressed how individuals determine whether unexpected perceptual observations stem from “noise” as opposed to representing a genuine latent change in the underlying cause of their observations. In such tasks, incorrect assumptions about the rate of these changes (e.g., the “hazard rate”) can lead to deviations from normative statistical optimality. Similarly, in our task, learning bandit-values might involve inferring the extent to which variable choice feedback (for a bandit across trials) reflects inherent stochasticity in latent outcomes as opposed to agent deception. As in the change-point inference framework, incorrect assumptions in our task also produce biases. For example, a mistaken belief that the 1-star, random-feedback, agent is truthful on most trials would lead participants to erroneously learn from that agent’s feedback. Nevertheless, despite such similarities, there remain key differences between our task and change-point inference paradigms. Notably, choices in our task are value-based, and we consider it likely that this setup introduces biases (e.g., positivity) driven by a motivational preference for one outcome (reward) over another (non-reward). This motivational aspect may also influence whether individuals are inclined to trust or doubt feedback, depending on its valence. Additionally, explaining some of the biases we observed—such as the amplified learning from a credible source after exposure to non-credible sources in independent learning contexts—would require hierarchical inference frameworks that incorporate assumptions about the breakdown of learning-context independence. Future research could usefully investigate whether shared mechanisms underlie the biases identified here and those observed in other paradigms, potentially offering a unified account for inference problems across these approaches.

We conclude by noting previous research has often attributed the negative impacts of disinformation, such as polarization and the formation of echo chambers, to intricate processes facilitated by external or self-selection of information (7577). These processes include algorithms tailoring information to align with users’ attitudes (78) or individuals consciously opting to engage with like-minded peers (79). However, our study reveals a more profound effect of disinformation, namely that even in minimal conditions, when low credibility information is explicitly identified, disinformation significantly impacts individuals’ beliefs and decision-making processes. This occurs even when the decision at hand lacks emotional engagement or pertinence to deep, identity-related, issues. A critical next step is to deepen our understanding of these biases, particularly within complex social environments, not least to enable the development of effective prospective interventions capable of mitigating the potentially pernicious impacts of disinformation.

Materials and methods

Participants

We recruited 246 participants (mean age 39.33± 12.65, 112 female) from the Prolific participant pool (www.prolific.co) who went on to perform the task on the Gorilla platform(80). All participants were fluent English speakers with normal or corrected-to-normal vision and a Prolific approval rate of 95% or higher. UCL Research Ethics Committee approved the study (Project ID 6649/004), and all participants provided prior informed consent.

Experimental protocol

Traditional two-armed bandit task

At the beginning of the experiment participants completed a traditional version of the two-armed bandit task. Participants performed 45 trials, each featuring one of three randomly interleaved bandit pairs (such that each pair was presented on 15 trials). On each trial, participants choose between the bandit-pair, with each bandit being represented by a distinct identicon. Once a bandit was selected it generated a true outcome (converted to bonus monetary compensation) corresponding to either a reward or nothing. Within each bandit-pair, one bandit provided rewards on 75% of trials (with 25% providing no-reward), while the other bandit rewarded on 25% of the trials (75% non-reward trials). Participants were uninformed about the reward probabilities of each bandit and had to learn these based on experience.

At onset of each trial, the two bandits were presented, one on each side of the screen, and participants were asked to indicate their choice within 3 seconds by pressing the left/right arrow-keys. If the 3 seconds elapsed with no choice, participants were shown a “too slow” message and proceeded to the next trial. Following choice, the unselected bandit disappeared, and the participants were presented with the outcome of the selected bandit for 1200ms, followed by a 250 ms ISI before the start of the next trial. Rewards were represented by a green dollar symbol and non-rewards by a red sad face (both in the center of the screen). At the end of the task, participants were informed about the number of rewards they had earned.

Disinformation task

This involved a modified, disinformation version, of the same two-armed bandit task. Participants performed 8 blocks, each consisting of 45 trials. Each block followed the structure of the traditional two-armed bandit task, but with a critical difference: true choice-outcomes were withheld from participants and instead they received reward-feedback from a feedback agent. Participants were instructed prior to the task that feedback agents mostly provide accurate feedback (i.e., the true outcome) but could lie on a random minority of trials by reporting a reward in case of a true nonreward, or vice versa. The task featured three feedback agents varying in their credibility (i.e., probability of truth-telling), as indicated by a “star-rating” system, about which participants were instructed prior to the task. The 3-star agent always told the truth, whereas the other 2 agents were partially credible, reporting the truth on 75% (2-star) or 50% (1-star) of the trials. Feedback agents were randomly interleaved across trials subject to the constraint that each agent appeared on 5-trials for each bandit pair.

At the onset of each trial, participants were presented with the feedback agent for the trial (screen center) and with the two bandits, one on each side of the screen. Participants made a 2-second time limited choice by pressing the left/right arrow-keys. Following choice, the unselected bandit disappeared, and were then presented with the agent feedback for 1200ms (represented by either a rewarding green dollar sign or a non-rewarding red sad face in the center of the screen). All stimuli then disappeared for 250 ms to be followed by the start of the next trial. At the end of each block, participants were informed about the number of true rewards they had earned. They then received a 30-second break before the next block started with new 3 bandit pairs.

General protocol

At the beginning of the experiment, participants were presented with instructions for the traditional two-armed bandit task. The instructions were interleaved with four multiple-choice questions. When participants answered a question incorrectly, they could re-read the instructions and re-attempt. If participants answered a question incorrectly twice, they were compensated for the time but could not continue to the next stage. Upon completing the instructions participants proceeded to the traditional two-armed bandit task.

After the two-armed bandit task, participants were presented with instructions regarding the disinformation task. Again, these were interleaved with six questions wherein participants had two attempts to answer each question correctly. If they answered a question incorrectly twice, they were rejected and received partial participatory compensation. Participants then proceeded to the disinformation task. After completing the disinformation task, participants completed three psychiatric questionnaires (presented in random order): 1) the Obsessional Compulsive Inventory - Revised (OCI-R)(81), assessing symptoms of obsessive-compulsive disorder (OCD); 2) The Revised Green et al. Paranoid Thoughts Scale (R-GPTS)(82), measuring paranoid ideations; and 3) the DOG scale, evaluating dogmatism(83).

The participants took on average 43 minutes to complete the experiment. They received a fixed compensation of 5.48 GBP and variable compensation between 0 and 2 GBP based on their performance on the disinformation task.

Attention checks

The two tasks included randomly interleaved catch trials wherein participants were cued to press a given key within a 3-second limit. None of the participants failed more than one of these attention checks.

Data analysis

Exclusion criteria

Participants were excluded if they: 1) Either repeated or alternated key presses in more than 70% of the trials, and/or 2) their reaction time was lower than 150 ms in more than 5% of the trials. Based on these criteria 42 participants were excluded, while 204 participants were kept for the analyses.

Accuracy

Accuracy rates were calculated as the probability of choosing within a given pair the bandit with a higher reward probability. For figure 1d, we calculated for each participant and for each trial (within a bandit-pair) averaged accuracy across all bandit-pairs. We then averaged accuracy at the trial level across participants. Overall improvement for each participant was calculated as the average accuracy difference between the last and first trials for each of the bandit-pairs.

Computational models

Rl Models

We formulated a family of RL models to account for participant choices. In these models, a tendency to choose each bandit is captured by a Q-value. After reward-feedback the Q-value of the chosen bandit was updated conditional on the agent and on whether the feedback was positive or negative according to the following rule:

where CA is a free credit assignment parameter representing the magnitude of the value increase/decrease following feedback receipt F from the agents (coded as 1 for reward feedback and -1 for non-reward feedback), while fQ (∈ [0,1]) is the free parameter representing the forgetting rate of the Q-value. Additionally, the value of each of the other bandits (i.e., the unchosen bandit in the presented pair and all the bandits from the other not-shown pairs) were forgotten as per the following:

Alternative model-variants differed based on whether the CA parameter(s) were influenced by agents and/or feedback valence (see Table 1 below), allowing us to test how these variables impacted learning.

  1. The “Null” model included a unique CA parameter conveying an assumption that feedback is modulated by neither agent-credibility nor feedback valence.

  2. The “Credibility-CA” models included a dedicated CA parameter for each agent allowing for the possibility learning was selectively modulated by agent credibility (but not by feedback valence).

  3. The “Credibility-Valence-CA” model included distinct CA parameters for rewarding (CA+) and nonrewarding feedback (CA-) for each agent, allowing CA to be influenced by both feedback valence and credibility.

  4. The “constant feedback-valence bias” CA model included separate CA-parameters for each agent, but a single valence bias parameter (VB) common to all agents, such that the CA+ parameter for each agent corresponded to the sum of its CA-parameter and the common VB parameter.

Additionally, we formulated a “Truth-CA” model where CA parameters were influenced by agentcredibility and whether the feedback was objectively true. This model included distinct CA parameters for truthful (CAtrue) and non-truthful (CAlie) feedback for each non-credible agent (i.e., 1-star and 2- star) and a single CA parameter for the 3-star agent, since it never lied.

summary of free parameters for each of the CA models.

All models also included gradual perseveration for each bandit. In each trial the perseveration values (P) were updated according to

Where PERS is a free parameter representing the P-value change for the chosen bandit, and fp (∈ [0,1]) is the free parameter denoting the forgetting rate applied to the P value. Additionally, the P-values of all the non-chosen bandits (i.e., again, the unchosen bandit of the current pair, and all the bandits from the not-shown pairs) were forgotten as follows:

We modelled choices using a softmax decision rule, representing the probability of the participant to choose a given bandit over the alternative:

Bayesian Models

We also formulated a Bayesian model corresponding to a normative, ideal belief, updating strategy. In this model, beliefs about each bandit were represented by a density distribution over the probability that a bandit provides a true reward g(p), where p is the probability of a true reward (see full derivation in SI 4.1). During learning, following reward-feedback, the distribution for the chosen bandit was updated based on the agent’s feedback (F) and its associated credibility (C):

At the beginning of each block priors for each bandit were initialized to uniform distributions (g(p)=U[0,1]). In the instructed-credibility Bayesian model, we fixed the credibilities to their true values (i.e., 0.5, 0.75 and 1).

We also formulated a free-credibility Bayesian model, where we only fixed the three-star agent credibility to 1 but estimated the credibility of the two lying agents as free parameters. This model allowed the possibility that participants use distorted instructed-credibilities when following a Bayesian strategy.

For both versions, we modelled choice using a SoftMax function with a free inverse temperature parameter (β):

Where here Q(bandit) is the expected probability, the bandit provides a true reward.

Parameter optimization, model selection and synthetic model simulations

For each participant, we estimated the free parameter values that maximized the summed loglikelihood of the observed choices across all games. Trials where participants showed a response time below 150 ms were excluded from the log-likelihood calculations. To minimise the chances of finding local minima, we ran the fitting procedure 10 times for each participant, using random initializations for the parameters (CA~U[-10,10], PERS~U[-5,5], fQ~[0,1], fP~[0,1], β~[0,30], C~U[0,1]). Our Truth-CA model showed poorer convergence, so for this model we ran the fitting procedure 100 times per participant.

We performed model comparison between Bayesian and CA models using the parametric bootstrap cross-fitting method (PBCM)(84,85). In brief, this method relies on generating, for each participant, synthetic datasets (we used 201) based on maximal likelihood parameters and each model variant (i.e., the Bayesian model and the CA model), and fitting each dataset with the two models. We then calculated the log likelihood difference between the two fits for each dataset, obtaining two loglikelihood difference distributions, one for each generative model. We determined a loglikelihood difference threshold that leads to best model-classification (i.e., maximizing the proportion of true positives and true negatives). Finally, we fit the empirical data from each participant with the two model variants, calculating an empirical loglikelihood difference. A comparison of this empirical likelihood difference to the classification threshold determines which model provides a better fit for a participant’s data (see Fig. S6 for more information). We used this procedure to compare our Bayesian models (instructed-credibility and free-credibility Bayesian) with a simplified version of the credibility-CA model that did not include perseveration (PERS, fP = 0).

We also performed model-comparisons for nested CA models using generalized-likelihood ratio tests where the null distribution for rejecting a nested model (in favour of a nesting model) was based on a bootstrapping method (BGLRT)(44,86).

To assess the mechanistic predictions of each model, we generated synthetic simulations based on the ML parameters of participants. Unless stated otherwise, we generated 5 simulations for each participant (1020 total simulations) with a new sequence of trials generated as in the actual data. We analysed these data in the same way as we analysed empirical data, after pooling together the 5 simulated data set per participant.

Parameter recovery

For each model of interest, we generated 201 synthetic simulations based on parameters sampled from uniform distributions (CA~U[-10,10], PERS~U[-5,5], fQ~U[0,1], fP~U[0,1], β~U[0,30], C~U[0,1]). We fitted each simulated dataset with its generative model and calculated the Spearman’s correlation between the generative and fitted parameters.

Mixed effects models

Model-agnostic analysis of agent-credibility effects on choice-repetition

We used a mixed-effects binomial regression model to assess whether, and how, value-learning was modulated by agent-credibility, with participants serving as random effects. The regressed variable REPEAT indicated whether the current trial repeated the choice from the previous trial featuring the same bandit-pair (repeated choice=1, non-repeated choice=0) and was regressed on the following regressors: FEEDBACK coded whether feedback received in the previous trial with the same bandit pair was positive or negative (coded as 0.5, -0.5, respectively), BETTER coded whether the bandit chosen in that previous trial was the better-mostly rewarding- or the worse-mostly unrewarding-bandit within the pair, coded as 0.5 and -0.5 respectively, AGENT2-star indicated whether feedback received in the previous trial (featuring the same bandit pair) came from the 2-star agent (previous feedback from 2-star agent=1, otherwise=0) and, AGENT3-star indicated whether the feedback in the previous trial came from the 3-star agent. The model in Wilkinson’s notation was:

In figure 2a and 2b, we plot the choice-repeat probability based on feedback-valence and agentcredibility from the preceding trial with the same bandit pair. We independently calculated the repeat probability for the better (mostly rewarding) and worse (mostly non-rewarding) bandits and averaged across them. This calculation was done at the participants level, and finally averaged across participants.

Model-agnostic analysis of contextual credibility effects on choice-repetition

We used a different mixed-effects binomial regression model to test whether value learning from the 3-star agent was modulated by contextual credibility. We focused this analysis on instances where the previous trial with the same bandit pair featured the 3-star agent. We regressed the variable REPEAT, which indicated whether the current trial repeated the choice from the previous trial featuring the same bandit-pair (repeated choice=1, non-repeated choice=0). We included the following regressors: FEEDBACK coding the valence of feedback in the previous trial with the same bandit pair (positive=0.5, negative=-0.5), CONTEXT2-star indicating whether the trial immediately preceding the previous trial with the same bandit pair (context trial) featured the 2-star agent (feedback from 2-star agent=1, otherwise=0), and CONTEXT3-star indicating whether the trial immediately preceding the previous trial with the same bandit pair featured the 3-star agent. We included in this analysis only current trials where the context trial featured a different bandit pair. The model in Wilkinson’s notation was:

We originally included another regressor (BETTER) coding whether the bandit chosen in that previous trial was the better-mostly rewarding- or the worse-mostly unrewarding-bandit within the pair. Since we did not find any significant interactions between BETTER and the other regressors, we decided to omit it from the model formulation.

In figure 4c, we independently calculate the repeat probability difference for the better (mostly rewarding) and worse (mostly non-rewarding) bandits and averaged across them. This calculation was done at the participants level, and finally averaged across participants.

Effects of agent-credibility on CA parameters from credibility-CA model

We used a mixed-effects linear regression model to assess whether, and how, credit assignment was modulated by feedback-agent, with participants serving as random effects (data from Fig. 2c). We regressed the maximal likelihood CA parameters from the credibility-CA model. The regressors AGENT2-star and AGENT3-star indicated, respectively, whether the CA parameter was attributed to the 2- star or the 3-star agent. The model’s Wilkinson’s notation was:

Effects of agent-credibility and feedback valence on CA parameters from credibility-valence-CA model

We used a second mixed-effects linear regression model to test for a valence bias in learning, and how such bias was modulated by feedback credibility, with participants serving again as random effects (data from Fig. 3a). The maximal likelihood CA parameters from the credibility-valence-CA model served as the regressed variable, which was regressed on: AGENT2-star and AGENT3-star (defined in the same way as the previous model), and VALENCE coding whether the CA parameter was attributed to positive (coded as 0.5) or negative (coded as -0.5) feedback. The Wilkinson’s notation of the model was:

Effects of agent-credibility and feedback truthfulness on CA parameters from truth-CA model

Finally, we used another mixed-effects linear regression model to test whether average levels of credit participants assigned to chosen bandits varied between objectively true and false feedback (data from figures 4b and 4c). We regressed the maximal likelihood CA parameters for the 1-star and 2-star agents from the truth-CA model on the regressors: CREDIBILITY, coding whether the CA came from the 1-star or the 2-star agent (coded as -0.5 and 0.5 respectively); and TRUTH, coding whether the CA parameter was attributed to trials were the agents told the truth (coded as 0.5) or lied (coded as -0.5). The Wilkinson’s notation of the model was:

We fitted these mixed effects models using the fitglme function in Matlab. Follow up analysis were based on testing contrasts from these models.

Bayesian estimation of posterior belief that feedback is true

We calculated the Bayesian posterior conditional probability of feedback truthfulness (Fig. 4a and 4b) follows. First, we calculated the probability of each true outcome, r (0-non-reward; 1-reward) conditional on the feedback, f (0: non-reward, 1: reward), the credibility of the agent reporting the feedback (C) and the history of experiences from past trials (H):

Where proportionality omits terms independent of r, is the expected probability of the chosen bandit is rewarding (conditional on past-trial history), and g(p|H) is the density over the probability (the chosen bandit) is rewarded (conditional on the history of previous trials).

Next, we normalized the two terms (for r=0,1) to sum to 1 (to correct for the proportionality in (14)). Finally, the posterior belief in truthfulness was taken as P(r=f | f,C,H).

In Fig. 4b, we calculated for each participant the mean posterior belief of truthfulness separately for trials where each agents told the truth or lied, and we compared these mean beliefs between the two kinds of trials using a paired t-tests (one test per agent).

Code and data availability

All code and data used to generate the results and figures in this paper will be made available on GitHub upon publication.

Acknowledgements

We thank Bastien Blain, Lucie Charles and Stephano Palminteri for helpful discussions. We thank Nira Liberman, Keiji Ota, Nitzan Shahar, Konstantinos Tsetsos and Tali Sharot for providing feedback on earlier versions of the manuscript. We additionally thank the members of the Max Planck UCL Centre for Computational Psychiatry and Ageing Research for insightful discussions. The Max Planck UCL Centre is a joint initiative supported by UCL and the Max Planck Society.

J.V.P. is a pre-doctoral fellow of the International Max Planck Research School on Computational Methods in Psychiatry and Ageing Research (IMPRS COMP2PSYCH). We acknowledge funding from the Max Planck research school to J.V.P. (577749-D-CON 186534), and funding from the Max Planck Society to R.J.D. (549771-D.CON 177814). The project that gave rise to these results received the support of a fellowship from “la Caixa” Foundation (ID 100010434), with the fellowship code LCF/BQ/EU21/11890109.

J.V.P. contributed to the study design, data collection, data coding, data analyses, and writing of the manuscript. R.M. contributed to the study design, data analyses, and writing of the manuscript. R.J.D. contributed to the writing of the manuscript.