Abstract
In open societies disinformation is often considered a threat to the very fabric of democracy. However, we know little about how disinformation exerts its impact, especially its influences on individual learning processes. Guided by the notion that disinformation exerts its pernicious effects by capitalizing on learning biases, we ask which aspects of learning from potential disinformation align with normative “Bayesian” principles, and which exhibit biases deviating from these standards. To this end, we harnessed a reinforcement learning framework, offering computationally tractable models capable of estimating latent aspects of a learning process as well as identifying biases in learning. Across two experiments, computational modelling indicated that learning increased in tandem with source credibility, consistent with normative Bayesian principles. However, we also observed striking biases reflecting divergence from normative learning patterns. Notably, individuals learned from sources that should have been ignored, as these were known to be fully unreliable. Additionally, the presence of disinformation elicited exaggerated learning from trustworthy information (akin to jumping to conclusions) and exacerbated a “positivity bias” whereby individuals self-servingly boost their learning from positive, compared to negative, choice-feedback. Thus, in the face of disinformation we identify specific cognitive mechanisms underlying learning biases, with potential implications for societal strategies aimed at mitigating its harmful impacts.
Introduction
Disinformation is a pervasive and pernicious feature of the modern world (1). It is linked to negative social impacts that include public-health risks (2–4), political radicalization (5,6), violence (6–8) and adherence to conspiracy theories (8,9). Consequently, there is a growing interest in comprehending how false information propagates across social networks (10–12), including an interest in designing strategies to curb its impact (13–16) albeit with limited success to date (17). However, there is also a considerable knowledge lacuna regarding how individuals learn and update their beliefs when exposed to potential disinformation. Addressing this gap is crucial, as it has been suggested that disinformation propagates by exploiting cognitive biases (18–22). Thus, discerning which aspects of learning from potential disinformation are normative versus biased has the potential to better enable targeted interventions aimed at countering its harmful effects.
We start with an assessment of a normative, Bayesian, prediction that individuals should modulate their learning as a function of the credibility of an information source, and learn more from credible, truthful, sources. This prediction is supported by previous findings showing that individuals flexibly and adaptively adjust their learning rates in response to key statistical features of the environment. For example, learning is more rapid when observation-uncertainty (“noise”) decreases and in volatile, changing, compared to stable environments, particularly following detection of change-points that render re-change knowledge obsolete (23–25). Moreover, human choice is strongly influenced by social information of high (as opposed to low) credibility, such as majority opinions more confident judgments (26) and large group consensus (27). Additionally, people are disposed to follow trustworthy advisors (28), including those who have recommended optimal actions in the past (29,30).
We hypothesised that in a disinformation context individuals would show significant deviations from normative learning, reflecting a diversity of biases. First, filtering non-credible information is likely to be cognitively demanding (31), and this predicts such information would impact belief updating, even if individuals are aware it is untrustworthy. An additional consideration is that humans tend to learn more from positive self-confirming information (32–34), which presents one in a positive light. We conjectured, influenced by ideas from motivated-cognition (35), that low-credibility information provides a pathway for amplification of such a bias, as uncertainty regarding information-veracity might dispose individuals to self-servingly interpret positive information as true and explain-away negative information as false. A final additional consideration is the question of how exposure to potential disinformation impacts on learning from trusted sources. One possibility is that disinformation serves as a background context against which credible information would appear more salient. Alternatively, it might lead individuals to strategically reduce their overall learning in disinformation-rich environments, resulting in diminished learning from credible sources.
To address these questions, we adopt a novel approach within the disinformation literature by exploiting a Reinforcement Learning (RL) experimental framework (36). This has the advantage that it provides a suite of behavioural tasks, and computationally tractable models, that enable estimation of latent aspects of learning processes, such as belief updating. Moreover, RL also enables an examination of the dynamics of belief updates over short timescales reflecting real-life engagements with disinformation, such as deciding whether to share a post on social media. Moreover, RL has proven success in characterizing key decision-making biases (e.g., positivity bias (37–39)), albeit in scenarios where learners receive accurate information. Within RL, we can also establish benchmarks of Bayesian learning that allow a characterization of biases and deviations from normative standards. Finally, a previous literature has suggested a role for reinforcement in the dissemination of disinformation, where individuals may receive positive reinforcement (likes, shares) for spreading sensationalized or misleading information on social media platforms, inadvertently reinforcing such behaviours and contributing to a disinformation proliferation (15,40,41).
We developed a novel “disinformation” version of the classical two-armed bandit task to test the effects of potential disinformation on learning. In the hallmark two-armed bandit task (36,37,42), participants choose repeatedly between two unfamiliar bandits (i.e., slot machines), that provided rewards with different probabilities, to learn which bandit is more rewarding. Critically, in our disinformation-variant, true choice outcomes (reward or non-reward) were latent, i.e., unobservable. Instead, participants were informed about choice-outcomes by computer-programmed “feedback agents”, who were disposed to occasionally disseminate disinformation by lying (reporting a reward when the true outcome was non-reward or vice versa). As these feedback-agents varied in truthfulness, this allowed us to test the effects of source-credibility on learning. We show across two studies that the extent of belief-updates increases as a function of source-credibility. However, there were striking deviations from normative Bayesian learning, where we identify several sources of bias related to processing potential disinformation. These included learning from non-credible information, an amplified positivity bias for non-credible sources, and increased learning from trustworthy information when it was preceded by non-credible information.
Results
Disinformation two-armed bandit task
We conducted a discovery (n=104) and follow-up study (n=204). In both studies the learning tasks had the same basic structure but with a few subtle differences between them (see Discovery study and SI Discovery study methods). To anticipate, the results of both studies support similar conclusions, and, in the results section, we focus on the main study, with the final results section detailing similarities and differences in findings across the two studies.
In the main study, participants (n=204) completed the disinformation two-armed bandit task. In the traditional two-armed bandit task (36,37,42), participants choose between two slot-machines (i.e., bandits) differing in their reward probability. Participants are not instructed about bandit rewardprobabilities but instead they are provided with veridical choice feedback (e.g., reward or nonreward), allowing participants to learn which bandit is more rewarding. By contrast, in our disinformation version true choice-outcomes were latent (i.e., unobserved) and participants were informed about these outcomes via three computerized feedback-agents, who had privileged access to the true outcomes.
Before commencing the task, participants were instructed that feedback agents could disseminate disinformation, meaning that they were disposed to lie on a random minority of trials, reporting a reward when the true outcome was a non-reward, or vice versa (Fig. 1a). Participants were explicitly instructed about the credibility of each agent (i.e., based on the proportion of truth-telling trials), indicated by a “star system”: the 3-star agent was always truthful, the 2-star agents told the truth on 75% of the trials while the 1-star agent did so on 50% of the trials (Fig. 1b). Note that while the 1-star agent’s feedback was statistically equivalent to random feedback, participants were not explicitly instructed about this equivalence. Each experimental block encompassed 3 bandit pairs, each presented over 15 trials in a randomly interleaved manner. Each agent provided feedback for 5 trials for each bandit pair (with the agent order interleaved within the bandit pair). Thus, in every trial, participants were presented with one of the bandit pairs and the feedback agent associated with that trial. Upon selecting a bandit, they then received feedback from the agent (Fig. 1c). Importantly, at the end of the experiment participants received a performance-based bonus based on true bandit outcomes, which could differ from agent-provided feedback. Within each bandit-pair one bandit provided a (true) reward on 75% of the trials and the other on 25% of trials. Choice accuracy, i.e., the probability of selecting the more rewarding bandit (within each pair), was significantly above chance (mean accuracy = 0.62, t(203) = 19.94, p <.001) and improved as a function of increasing experience with each bandit-pair (average overall improvement over 15 trials = 0.22, t(203)=19.95, p<0.001) (Fig. 1d).

Task design and performance.
a, Illustration of agent-feedback. Each selected bandit generated a true outcome, either a reward or a non-reward. Participants did not see this true outcome but instead were informed about it via a computerised feedback agent (reward: dollar sign; non-reward: sad emoji). Agents told the truth on most trials (left panel). However, on a random minority of trials they lied, reporting a reward when the true outcome was a non-reward or vice versa (right panel). b, Participants received feedback from 3 distinct feedback agents of variable credibility (i.e., truth-telling probability). Credibility was represented using a starbased system: a 3-star agent always reported the truth (and never lied), a 2-star agent reported the truth on 75% of trials (lying on the remaining 25%), and a 1-star agent reported the truth half of the time (lying on the other half). Participants were explicitly instructed and quizzed about the credibility of each agent prior to the task. c, Trial-structure: On each trial participants were first presented with the feedback agent for that trial (here, the 2-star agent) and next offered a choice between a pair of bandits (represented by identicons) (for 2sec). Next, choice-feedback was provided by the agent. d, Learning curves. Average choice accuracy as a function of trial number (within a bandit-pair). Thin lines: individual participants; thick line: group mean with thickness representing the group standard error of the mean for each trial.
Credible feedback promotes greater learning
A hallmark of RL value-learning is that participants are more likely to repeat a choice following positive compared to negative reward-feedback (henceforth, “feedback effect on choice repetition”). We tested a hypothesis, based on Bayesian-normative reasoning, that this tendency would increase as a function of agent-credibility (Fig. 3a). Thus, in a binomial mixed-effects model we regressed choicerepetition (i.e., whether participants repeated their choice from the most recent trial featuring the same bandit pair; 0-switch; 1-repeat) on feedback-valence (negative or positive) and agent-credibility (1,2, or 3-star), where these are taken from the last trial featuring the same bandit pair (Methods for model-specification). Feedback valence exerted a positive effect on choice-repetition (b=0.72, F(1,2436)=1369.6, p<0.001) and interacted with agent-credibility (F(2,2436)=307.11, p<0.001), with a feedback effect being greater for more credible agents (3-star vs. 2-star: b= 0.91, F(1,2436)=351.17; 3-star vs. 1-star: b=1.15, t(2436)=24.02; and 2-star vs. 1-star: b=0.24, t(2436)=5.34, all p’s<0.001). Additionally, we found a positive feedback-effect for the 3-star agent (b=1.41, F(1,2436)=1470.2, p<0.001), and a smaller feedback-effect for the 2-star agent (b=0.49 ,F(1,2436)=230.0, p<0.001). These results support our hypothesis that learning increases as a function of information credibility (note that the feedback effect for the 1-star agent is examined below; see “Non-credible feedback elicits learning”).
To confirm that increased learning based on information credibility is expected under an assumption that subjects adhere to normative Bayesian reasoning, we formulated two Bayesian models whereby the latent value of each bandit is represented as a distribution over the probability that a bandit is truly rewarding. During the feedback stage of each trial, the value of the chosen bandit is updated (based on feedback valence and credibility) according to Bayes rule (Fig. 2a, top panel; Fig. S5c for an illustration of the model; for full model descriptions, see Methods). In the instructed-credibility Bayesian model, belief-updates are based on the instructed credibility of feedback-sources. In contrast, a free-credibility Bayesian model, allows for the possibility that value inference is Bayesian but based on “distorted probabilities”(43), attributing non-instructed degrees of credibility to sources of false information (despite our explicit instructions on the credibility of different agents). In this variant, we fixed the credibility of the 3-star agent to 1 and estimated the credibility of 2 and 1-star agents as free parameters (which were highly recoverable; see Methods and SI 3.3). Simulations based on both Bayesian models (see Methods) predicted increased learning as a function of feedback credibility (Fig. 3b; top panels; SI 3.1.1.1 Tables S3 and S4 for statistical analysis).
Next, we formulated a family of non-Bayesian computational RL models. Importantly, these models can flexibly express non-Bayesian learning patterns and, as we show in following sections, can serve to identify non-normative learning biases. Here, an assumption is that during feedback, the value of a chosen bandit (which here is represented by a point estimate, “Q value”, rather than a distribution) either increases or decreases (for positive or negative feedback, respectively) according to a magnitude quantified by the free “Credit-Assignment (CA)” model parameters(44) (Fig. 2a, bottom panel; Fig. S5b; Methods). Different model variants varied as to how task-variables influenced CA parameters with the “null” model attributing the same CA to all feedback-agents (regardless of their credibility, i.e., a single free CA-parameter), whereas the “credibility-CA” model availed of three separate CA parameters, one for each feedback agent, thereby allowing us to test how learning was modulated by feedback-credibility. Using a bootstrap generalized-likelihood ratio test for modelcomparison (Methods) we rejected the null model (group level: p<0.001), in favour of the credibility-CA model. Furthermore, model-simulations based on participants best-fitting parameters (Methods) falsified the null model as it failed to predict credibility-modulated learning, showing instead, equal learning from all feedback sources (Fig, 3b; bottom-left panel). In contrast, the credibility-CA model successfully predicted increased learning as a function of credibility (Fig. 3b, bottom-right panel) (see SI 3.1.1.1 Tables S5 and S6).
After confirming CA parameters are highly recoverable (see Methods and SI 3.3), we examined how the Maximum Likelihood (ML) CA parameters from the credibility-CA model differed as a function of feedback credibility (Fig. 3c). Using a mixed effects model (Methods), we regressed the CA parameters on their associated agents, finding that CA differed across the agents (F(2,609)=212.65, p<0.001), increasing as a function of agent-credibility (3-star vs. 2-star: b= 1.02, F(1,609)=253.73 ; 3-star vs. 1- star: b=1.24, t(609)=19.31; and 2-star vs. 1-star: b=0.22, t(609)=3.38, all p’s<0.001). We found similar results in the discovery study (see SI 1.2.1). Thus, these results provide convergent support for our conjecture that feedback from more credible sources leads to more pronounced learning.
Substantial deviations from Bayesian learning
We next implemented a model comparison between each of the Bayesian models and the credibility-CA model, using a parametric bootstrap cross-fitting method (Methods). We found that the credibility-CA model provided a superior fit for 71% of participants (sign test; p<0.001) when compared to the instructed-credibility Bayesian model, Fig. 2b; and for 53.9% (p=0.29) when compared to the free-credibility Bayesian model, Fig 2c). The discovery study revealed even stronger results supporting a conclusion that the credibility-CA model was superior to both Bayesian models for most subjects (see SI 1.2.2), suggesting pervasive deviations from normative learning.
To further characterise these deviations, we used a “cross-fitting” method. We simulated synthetic data based on Bayesian agents (using participants’ best fitting parameters), but fitted these data using the CA-models, obtaining what we term “Bayesian-CA parameters” (Fig. 2d; Methods). A comparison of these Bayesian-CA parameters, with empirical-CA parameters obtained by fitting CA models to empirical data, allowed us to uncover patterns consistent with, or deviating from, normative-Bayesian value-based inference. Using this approach, we found that both the instructed-credibility and free-credibility Bayesian models predicted increased Bayesian-CA parameters as a function of agent credibility (Fig. 3c; see SI 3.1.1.2 Tables S8 and S9). However, an in-depth comparison between Bayesian and empirical CA parameters revealed discrepancies from normative Bayesian learning.

Computational models and cross-fitting method.
a, Summary of the two model families. Bayesian models (top panel) represent a benchmark for normative learning. In these models, the observer maintains a belief-distribution over the probability a bandit is truly rewarding (denoted r). On each trial, this distribution is updated for the selected bandit according to Bayes rule, based on the valence (i.e., rewarding/non-rewarding; denoted f) and credibility of the trial’s reward feedback (denoted c). Credit-assignment models (bottom panel) are used to test deviations from Bayesian learning. Here, the observer maintains a subjective point-value (denoted Q) for each bandit. On each trial the value of the chosen bandit is updated based on a free CA parameter, quantifying the extent of value increase/decrease following positive/negative feedback. CA parameters can be modulated by the valence and credibility of feedback. b,c, Model selection between the credibility-CA model and the two variants of Bayesian models. Most participants were best fitted by a credibility-CA model, compared to the instructed-credibility Bayesian model (b) or free-credibility Bayesian (c) models. d, Cross-fitting method: Firstly, we fit a Bayesian model to empirical data, to estimate its (ML) parameters. This yields the Bayesian learning token that comes closest to accounting for a participant’s choices. Secondly, we simulate synthetic data based on the Bayesian model, using its ML parameters to obtain instances of how a Bayesian learner would behave in our task. Thirdly, we fit these synthetic data with a CA model, thus estimating “Bayesian CA parameters”, i.e., CA parameters capturing the performance of a Bayesian model. Finally, we fit the CA model directly to empirical data to obtain “empirical CA parameters”. A comparison of Bayesian and empirical CA parameters, allows us to identify, which aspects of behaviour are consistent with Bayesian belief updating, as well as characterize biases in behaviour that deviate from normative Bayesian learning.

Learning adaptations to credibility.
a, Probability of repeating a choice as a function of feedbackvalence and agent-credibility on the previous trial for the same bandit pair. The effect of feedback-valence on repetition increases as the feedback credibility increases, indicating that more credible feedback has a greater effect on behaviour. b, Similar analysis as in panel a, but for synthetic data obtained by simulating the main models. Simulations were computed using the ML parameters of participants for each model. The null model (bottom left) attributes a single CA to all credibility-levels, hence feedback exerts a constant effect on repetition (independently of its credibility). The credibility-CA model (bottom-right) allowed credit assignment to change as a function of source credibility, predicting varying effects of feedback with different credibility levels. The instructed-credibility Bayesian model (top left) updated beliefs normatively based on the true credibility of the feedback, and therefore predicted an increase effect of feedback on repetition as credibility increased. Finally, the free-credibility Bayesian model (top right) allowed for a possibility that participants use distorted credibilities for 1-star and 2-star agents when following a Bayesian strategy, also predicting an increase in the effect of feedback as credibility increased. c, ML credit assignment parameters for the credibility-CA model. Participants show a CA increase as a function of agent-credibility, as predicted by Bayesian-CA parameters for both the instructed-credibility and free-credibility Bayesian models. Moreover, participants showed a positive CA for the 1-star agent (which essentially provides feedback), which is only predicted by cross-fitting parameters for the free-credibility Bayesian model. d, ML credibility parameters for a free-credibility Bayesian model attributing credibility 1 to the 3-star agent but estimating credibility for the two lying agents as free parameters. Small dots represent results for individual participants/simulations, big circles represent the group mean (a,b,d) or median (c) of participants’ behaviour. Results of the synthetic model simulations are represented by diamonds (instructed-credibility Bayesian model), squares (free-credibility Bayesian model), upward-pointing triangles (null-CA model) and downward-pointing triangles (credibility-CA model). Error bars show the standard error of the mean. (*) p<.05, (**) p<0.01, (***) p<.001.
Non-credible feedback elicits learning
While our task instructions framed the 1-star agent as highly deceptive, lying 50% of the time, its feedback is statistically equivalent to entirely non-informative i.e., random feedback. Thus, normatively, participants should ignore and filter-out such feedback from their belief updates. Indeed, for the 1-star agent, simulations based on the instructed-credibility Bayesian model provided no evidence for either a positive feedback-effect on choice-repetition (mixed effects model described above; b=-0.01, t(2436)=-0.41, p=0.68; Fig 3b top-left) or a positive Bayesian-CA (b=-0.01, t(609)=- 0.31, p=0.76; Fig. 3c). However, contrary to this, we hypothesized that participants would struggle to entirely disregard non-credible feedback. Indeed, we found a positive feedback-effect on choicerepetition for the 1-star agent (mixed effects model, delta(M)=0.049, b=0.25, t(2436)=8.05, p<0.001), indicating participants are more likely to repeat a bandit selection after receiving positive feedback from this agent (Fig. 3a). Similarly, the CA parameter for the 1-star agent in the credibility-CA model was positive (b=0.23, t(609)=4.54, p<0.001) (Fig. 3c). The upshot of this empirical finding is that participants updated their beliefs based on essentially random feedback (see Fig. S7 for analysis showing that this resulted in decreased accuracy rates).
A potential explanation for this finding is that participants do rely on a Bayesian strategy but “distort probabilities”, attributing non-instructed degrees of credibility to lying sources (despite our explicit instructions on the credibility of different agents). Consistent with this, the ML-estimated credibility of the 1-star agent (Fig. 3d) was significantly greater than 0.5 (Wilcoxon signed-rank test, median=0.08, z=5.50, p<0.001), allowing the free-credibility Bayesian model to predict a positive feedback effect on choice-repetition (mixed-effects model: b=0.12, t(2436)=9.48, p<0.001; Fig 3b top-right) and a positive Bayesian-CA (b=0.08, t(609)=3.32, p<0.001; Fig. 3c) for the 1-star agent. For corresponding results in the discovery study see SI 1.2.3. In our Discussion we elaborate on why it might be difficult to filter out this feedback even if one can explicitly infer its randomness.
Increased learning from fully credible feedback when it follows noninformative feedback
A comparison of empirical and Bayesian credit-assignment parameters revealed a further deviation from normative learning: both Bayesian models predicted an attenuated credit-assignment for the 3- star agent [Wilcoxon signed-rank test, instructed-credibility Bayesian model (median difference=0.74, z=11.14); free-credibility Bayesian model (median difference=0.62, z=10.71), all p’s<0.001] (Fig. 3a). One explanation for enhanced learning for the 3-star agents is a contrast effect, whereby credible information looms larger against a backdrop of non-credible information. To test this hypothesis, we examined whether the impact of feedback from the 3-star agent is modulated by the credibility of the agent in the trial immediately preceding it. More specifically, we reasoned that the impact of a 3-star agent would be amplified by a “low credibility context” (i.e., when it is preceded by a low credibility trial), even when this context is entirely irrelevant for current learning. In a binomial mixed effects model, we regressed choice-repetition on feedback valence from the last trial featuring the same bandit pair, and on the feedback agent on the trial immediately preceding that last trial (i.e., the contextual credibility; Methods for model-specification). This analysis included only trials for which the last same-pair trial featured the 3-star agent and in which the context trial featured a different bandit pair (Fig. 4a). We found that feedback valence interacted with contextual credibility (F(2,2419)=5.06, p=0.006) such that a feedback-effect (from the 3-star agent) was greater when preceded by the temporal-context of the 1-star agent, compared to a context involving a 2-star (b=0.20, t(2419)=-2.42, p=0.016) or a 3-star agent (b=0.24, t(2419)=-3.01, p=0.003) (Fig. 4b). There was no difference between 2-star and 3-star agent contexts (b=0.051, F(1,2419)=0.39, p=0.53). Thus, these results support an interpretation that credible feedback exerts a greater impact on participants’ learning when it follows non-credible feedback.

Contextual effects and learning.
a, Trials contributing to the analysis of effects of credibility-context on learning from the fully credible agent. We included only “current trials (n)” for which: 1) the last trial (trial n-k) offering the same bandit pair was associated with the 3-star agent, and 2) the immediately preceding context trial (n-k-1) featured a different bandit pair (providing a learning context irrelevant to current choice). We examined how choice-repetition (from n-k to n) was modulated by feedback valence on the last same-pair trial, and on the feedback agent on the context trial (i.e., the credibility context). Note the greyed-out star-rating on the current trial indicates the identity of the current agent and was not included in the analysis. b, Difference in probability of repeating a choice after receiving positive vs negative feedback (i.e., feedback effect) from the 3- star agent, as a function of the credibility context. The 3-star agent feedback-effect is greater when preceded by a low-credibility context (i.e., 1-star agent in preceding trial), than when preceded by a higher credibility context (i.e., 2-star or 3-star agent in preceding trial). Big circles represent the group mean, and error bars show the standard error of the mean. (*) p<.05, (**) p<0.01.
Positivity bias in learning and credibility
Previous research has shown that reinforcement learning is characterized by a positivity bias, wherein subjects systematically learn more from positive than from negative feedback (37,39). One account is that this bias might result from motivated cognition influences on learning, whereby participants favour positive feedback that reflects well on their choices. We conjectured that feedback of ambiguous veracity (i.e., from the 1-star and 2-star agents) would promote this bias by allowing participants to explain-away negative feedback as a case of an agent-lying, while choosing to believe positive feedback. Previous research has quantified positivity bias in 2 ways: 1) as the absolute difference between credit-assignment based on positive or negative feedback, and 2) as the same difference but relative to the overall extent of learning.
To investigate this bias across different levels of feedback credibility we formulated a more detailed variant of the CA model. To quantify the extent of a chosen-bandit’s value increase or decrease - following positive or negative feedback respectively – the “credibility-valence-CA” variant included separate CA parameters for positive (CA+) and negative (CA-) feedback for each feedback agent. In effect, this model variant enabled us to test whether different levels of feedback credibility elicited a positivity bias (i.e., CA+ > CA-). Using a bootstrap generalized-likelihood ratio test for model comparison (Methods), we rejected, in favour of the valence-credibility-CA model, the null-CA model, the credibility-CA model and a “constant feedback-valence bias” CA model, which attributed a common valence bias (CA+ minus CA-) to all agents (all group level: all p’s<0.001). This test supported our choice of flexible CA parametrization as a factorial function of agent and feedback-valence.
After confirming the parameters of this model were highly recoverable (see Methods and SI 3.3), we used a mixed effects model to regress the ML parameters (Fig. 5a) on their associated agent-credibility and valence (see Methods). This revealed participants attributed a greater CA to positive feedback than to negative feedback (b=0.64, F(1,1218)=37.39, p<0.001). Strikingly, for lying agents, participants selectively assigned credit based on positive feedback (1-star: b=0.61, F(1,1218)=22.81, p<0.001; 2- star: b=0.85, F(1,1218)=43.5, p<0.001), with no evidence for significant credit-assignment based on negative feedback (1-star: b=-0.03, F(1,1218)=0.07, p=0.79; 2-star: b=0.14, F(1,1218)=1.28, p=0.25). Only for the 3-star agent, credit-assignment was positive for both positive (b=1.83, F(1,1218)=203.1, p<0.001) and negative (b=1.25, F(1,1218)=95.7, p=<0.001) feedback. We found no significant interaction effect between feedback valence and credibility on CA (F(2,1218)=0.12, p=0.88; Fig. 5a-b).
However, we found evidence for agent-based modulation of positivity bias when this bias was measured in relative terms. Here we calculated, for each participant and agent, a relative Valence Bias Index (rVBI) as the difference between the Credit Assignment for positive feedback (CA+) and negative feedback (CA–), relative to the overall magnitude of CA (i.e., |CA+| + |CA–|) (Fig. 5c). rVBI was significantly positive for all credibility levels [Wilcoxon signed-rank test, 50% credibility (median=0.92, z=6.04), 75% credibility (median=0.73, z=6.69) and 100% credibility (median=0.21, z=4.96), all p’s<0.001]. Critically, the rVBI varied depending on the credibility of feedback (Friedman test, χ2(2) = 62.39, p<0.001), such that the rVBI for the 3-star agent was lower than that for both the 1-star (Wilcoxon signed rank tests, median difference = -0.42, z = -5.40, p<0.001) and 2-start agent (median difference = -0.22, z = -5.63, p<0.001). Feedback with 50% and 75% credibility yielded similar rVBI values (median difference (75%-50%) = -0.08, z = -1.91, p = 0.055). Results from the discovery study indicate similar conclusions (Fig. S3; see SI 1.2.5). Finally, a positivity bias could not stem from a Bayesian strategy as both Bayesian models predicted a negativity bias (Fig. 5b-c; Fig. S8; and SI 3.1.1.3 Table S11-S12, 3.2.1.1, and 3.2.1.2). The upshot is that positivity bias, relative to the overall extent of CA, was greater for lying than for fully-credible agents—a pattern deviating from Bayesian normativity.

Positivity bias as a function of agent-credibility.
a, ML parameters from the credibility-valence-CA model. CA+ and CA-are free parameters representing credit assignments for positive and negative feedback respectively (for each credibility level). Our data revealed a positivity bias (CA+ > CA-) for all credibility levels. b, Absolute valence bias index (defined as CA+-CA–) based on the ML parameters from the credibility-valence CA model. Positive values indicate a positivity bias, while negative values represent a negativity bias. c, Relative valence bias index (defined as (CA+-CA–)/(|CA+|+|CA–|)) based on the ML parameters from the credibility-valence CA model. Positive values indicate a positivity bias, while negative values represent a negativity bias. Small dots represent fitted parameters for individual participants and big circles represent the group median (a,b) or mean (c) (both of participants’ behavior), while squares are the median or mean of the fitted parameters of the free-credibility Bayesian model simulations. Error bars show the standard error of the mean. (***) p<.001 for ML fits of participants behavior.
True feedback elicits greater learning
Our findings are consistent with participant modulation of the extent of credit-assignment based solely on cued task-variables, such as feedback-credibility and valence. However, we also considered another possibility: that participants might infer, on a trial-by-trial basis, whether the feedback they received was true or false and adjust their credit assignment based on this inference. For example, for a given feedback-agent, participants might boost the credit assigned to a chosen bandit as a function of the degree to which they believe feedback was true. Notably, Bayesian inference can support a trial-level calculation of a posterior probability that feedback is true based on its credibility, valence and a prior belief (based on experiences in previous trials) regarding the probability that the chosen bandit is truly rewarding (Fig. 6a). These beliefs can partially discriminate between truthful and false feedback. As proof of this, we calculated a Bayesian posterior feedback-truthfulness belief for each participant and trial featuring the 1- or 2-star agents, (Methods; Recall for the 3-star agent, feedback is always true). On testing whether these posterior-truthfulness beliefs vary as a function of objective feedback truthfulness (true vs. lie), we found beliefs are stronger for truthful trials than for untruthful trials for both agents (1-star agent: mean difference=0.10, t(203)=39.47, p<0.001; 2-star agent: mean difference=0.08 , t(203)=34.43, p<0.001) (Fig. 6b and Fig. S9a). Note that this calculation was feasible because, as experimenters, we had privileged access to the objective truth of the choice-feedback as, when designing the experimental sessions, we generated latent true choice outcomes which could be compared to agent-reported feedback.
To formally address whether inference about feedback truthfulness modulates credit assignment, we fitted a new variant of the CA model (the “Truth-CA” model) to the data. This variant features two separate CA parameters for objectively true (CAtrue) and false (CAlie) feedback for each of the lying agents (i.e., the 1-star and 2-star agents) and a single CA parameter for the 3-star agent. We acknowledge that this model falls short of providing a mechanistically plausible description of the credit assignment process, because, unlike experimenters, participants cannot determine with certainty whether feedback from the 1- and 2-star agents is objectively true. Nonetheless, we use this ‘oracle model’ as a measurement tool to glean rough estimates for the average levels of credit assignment for true and false-feedback trials for each agent.
If participants rely on a (partial) insight regarding feedback truthfulness to amplify CA for feedback inferred as true, compared to false, then this predicts elevated average levels on CA on objectively true compared to lie feedback trials. In a mixed-effects model, we regressed the ML parameters for the non-credible agents on agent-credibility (1-star or 2-star) and truthfulness of their associated feedback (Methods). We found a main effect of truthfulness (b=0.084, t(812)=2.23, p=0.026; Fig. 6c and 6e), which was not qualified by an interaction between agent and truthfulness (b=-0.03, t(812)=- 0.43, p=0.67), consistent with participants assigning greater credit for objectively true compared to false feedback. Strikingly, model-simulations (Methods) showed this pattern is not predicted by any of our other models (Fig. 6d and 6e) (see SI 3.1.1.4 Tables S14-S17). In agreement with previous results, we found a main effect of agent (b=0.20, t(812)=5.30, p<0.001), consistent with individuals assigning greater credit for feedback from the two-star compared to the one-star agent. Note similar conclusions were found in our discovery study (see SI 1.2.6), and the overall pattern suggests that participants infer the truthfulness of feedback but modulate their credit assignment in a nonnormative manner.

Credit assignment is higher on true-feedback trials.
a, Posterior belief that feedback is true (y-axis) as a function of prior belief, i.e., during choice and before feedback receipt, that the selected bandit is rewarding (x-axis), feedback valence (dashed vs solid lines), and agent credibility (different colors). b, Distribution of posterior belief probability that feedback is true, calculated separately for each agent (1 or 2 star) and objective feedback-truthfulness (true or lie). These probabilities were computed based on trial-sequences and feedback participants experienced, indicating belief probabilities that feedback is true are higher in truth compared to lie trials. For illustration, plotted distributions pool trials across participants. The black line within each box represents the median, upper and lower bounds represent the third and first quartile respectively. The width of each half-violin plot corresponds to the density of each posterior belief value among all trials for a given condition. c, ML parameters for the “Truth-CA” model. Credit assignment parameters (y-axes) are shown as a function of agent-credibility and feedback-truthfulness (x-axes). These data show credit assignment was enhanced for true compared to false feedback (CAtrue>CAlie). Small dots represent fitted parameters for individual participants, big circles represent the group median, and error bars show the standard error of the mean. d, Like c. but here CA parameters were obtained by fitting the Truth-CA model not to empirical data but rather, to synthetic data generated from simulations of our alternative models (based on participants best fitting parameters). e, Effect of feedback-truthfulness on empirical Truth-CA parameters and on Truth-CA parameters based on synthetic simulations of our alternative models (obtained as in d.). Effects were estimated by regressing CA parameters from the Truth-CA model on the agent (1-star or 2-star) and on feedback-truthfulness. None of our models predicted higher credit assignment for true compared to false feedback. Lines represent 95% confidence intervals around the estimated effect coefficient. Small dots represent fitted parameters for individual simulations, diamonds represent the median value, and error bars show the standard error of the mean. (*) p<.05 and (***) p<.001
Discovery study
The discovery study (n=104) used a disinformation task structurally similar to that used in our main study, but with three notable differences: 1) it included 4 feedback agents, with credibilities of 50%, 70%, 85% and 100%, represented by 1, 2, 3, and 4 stars, respectively; 2) each experimental block consisted of a single bandit pair, presented over 16 trials (with 4 trials for each feedback agent); and 3) in certain blocks, unbeknownst to participants, the two bandits within a pair were equally rewarding (see SI section 1.1). Overall, the results from this study support the exact same conclusions (See SI section 1.2) but with one difference. In the discovery study, we found no evidence for learning based on 50%-credibility feedback when examining either the feedback effect on choice repetition or CA in the credibility-CA model (SI 1.2.3). However, this does not mean that participants fully filtered this feedback, because importantly, feedback from the 1-star agent elicited a positivity bias.
Discussion
Accurate information enables individuals to adapt effectively to their environment (45,46). Indeed, it has been suggested that the importance and utility of information elevate its status to that of a secondary reinforcer, imbuing it with intrinsic value beyond its immediate usefulness (47,48). However, a significant societal challenge arises from the fact that, as social animals, much information we receive is mediated by others, entailing it can be inaccurate, biased or purposefully misleading. Here, using a novel variant of the two-armed bandit task, we asked how we update our beliefs in the presence of potential disinformation, wherein true choice outcomes are latent and feedback is provided by potentially disinformative agents.
We acknowledge that several factors may limit the external validity of our task, including the fact that participants were explicitly instructed about the credibility of information sources. In contrast, in many real-life scenarios, individuals need to learn the credibility of information sources based on their own experience of the world or may even have false beliefs regarding the source-credibility of agents. Moreover, in our task, the experimenter fully controlled the credibility of the information source in every trial, whereas in many real-life situations people can exercise a degree of control over the credibility of information they receive. For example, search engines allow an exercise of choice regarding the credibility of sources. Finally, in our task, feedback agents served as rudimentary representations of social agents, who lied randomly and arbitrarily, in a motivation-free manner. Conversely, in real life, others may strategically attempt to mislead us, and we can exploit knowledge of their motivation to lie, such as when we assume that a used cars seller is more likely to portray a clapped-out car as excellent, rather than state the unfiltered truth. Nevertheless, our results attest to the utility of our task in discerning normative from biased aspects of learning in the face of disinformation, even in a simplified scenario.
Consistent with normative Bayesian principles, we show that individuals increased their learning as a function of feedback credibility. This aligns with previous studies demonstrating an impressive human ability to flexibly increase learning rates when environmental changes render prior knowledge obsolete (23,49,50), and when there is reduced inherent uncertainty, such as “observation noise” (24,51,52). However, as hypothesized, when facing potential disinformation, we also find that individuals deviate from a standard of optimal Bayesian learning in several important ways.
We show that participants revised their beliefs based on entirely non-credible feedback, whereas a Bayesian strategy dictates such feedback should be ignored. One possible explanation is that some participants failed to infer that feedback from the 1-star agent was statistically void of information content, essentially random (e.g., the group-level credibility of this agent was estimated by our free-credibility Bayesian model as higher than 50%). Participants were instructed that this feedback would be “a lie 50% of the time but were not explicitly told that this meant it was random and should therefore be disregarded. However, we argue that even if one explicitly infers the randomness of this feedback, it may still be difficult to filter it out. Indeed, truth bias (53)—the cognitive tendency to assume information is truthful unless strong evidence suggests otherwise— may have led participants to implicitly attribute some credibility to the 1-star feedback. Additionally, an individual’s ability to filter out random information might have been limited due to a high cognitive load induced by the task, which required participants to track the values of three bandit pairs and juggle between three interleaved feedback agents. Importantly, most trials in our task provided feedback from mostly truthful sources, requiring frequent belief updates. As a result, filtering out random feedback may require significant engagement of cognitive control processes. Future studies could explore whether this filtering process is more effective in environments where ignoring feedback is the default policy, such as when random feedback is presented on a majority of trials. Another possibility is that randomfeedback influences on choice may stem from reliance on episodic memory (54,55), e.g., a recollection of past choice outcomes (positive or negative feedback) accompanied by a failure to recall corresponding feedback sources. It is entirely plausible that rapid information flow on social media platforms, featuring a considerable number of information-sources, is both cognitively demanding (56,57) and taxing for source recollection, hindering efficient filtering out of non-credible information (58), such as posts from bots or unfamiliar users. Soft moderation policies in social media (13,59), or indeed fact-checking (60), may enable better filtering of disinformation, mitigating the cognitive load associated with discerning credible information from a profusion of noise. Future studies should investigate conditions that enhance an ability to discard disinformation, such as providing explicit instructions to ignore misleading feedback, manipulations that increase the time available for evaluating information, or interventions that strengthen source memory.
In support of our a priori hypothesis, and in violation of Bayesian-normative learning, participants assigned greater credit based on positive than negative feedback across all feedback-credibility levels. A similar bias has been reported in previous reinforcement learning studies, albeit only in the context of veridical feedback (38,39,61). Here, we show that this positivity bias is amplified (relative to the overall extent of CA) for information of low and intermediate credibility. Of note, previous literature has interpreted enhanced learning for positive outcomes in reinforcement learning as indicative of a confirmation bias (37,39). For example, given that participants predominantly receive rewards (and positive feedback) in our task, positive feedback may confirm, to a greater extent than negative feedback, one’s choice-outcome expectations (e.g., “I expected a positive outcome”). Additionally, positive feedback confirms one’s choice as superior (e.g., “I chose the better of the two options”). Leveraging the framework of motivated cognition (35), we posited that feedback of uncertain veracity (e.g., low credibility) amplifies this bias by incentivising individuals to self-servingly accept positive feedback as true (either because it confers positive, desirable outcomes or because it confirms one’s choice or outcome expectations), and explain away undesirable, choice-disconfirming, negative feedback as false. Alternative “informational” (motivation-independent) accounts of positivity and confirmation bias predict a contrasting trend (i.e., reduced bias in low- and medium credibility conditions) because in these contexts it is more ambiguous whether feedback confirms one’s choice or outcome expectations, as compared to a full-credibility condition. Our findings of bias exacerbation hint that previous estimates of the extent of confirmation bias may represent a lower bound, and that negative effects of confirmation bias are augmented in the presence of disinformation. This could imply an amplified confirmation bias on social media, where content from sources of uncertain credibility, such as unknown or unverified users, is more easily interpreted in a self-serving manner, disproportionately reinforcing existing beliefs (62). In turn, this could contribute to an exacerbation of the negative social outcomes previously linked to confirmation bias such as polarization (63,64), the formation of ‘echo chambers’ (19), and the persistence of misbelief regarding contemporary issues of importance such as vaccination (65,66) and climate change (67–70).
A striking finding in our study was that for a fully credible feedback agent, credit assignment was exaggerated (i.e., higher than predicted by a Bayesian strategy). Furthermore, the effect of fully credible feedback on choice was further boosted when it was preceded by a low-credibility context, even when this context was entirely unrelated to current learning. We interpret this in terms of a “contrast effect”, whereby veridical information looms larger against a backdrop of disinformation (21). One upshot is that exaggerated learning might entail a risk of jumping to premature conclusions based on limited credible evidence. To illustrate, consider the example wherein a revision of one’s opinion regarding the potential risk and benefits posed by AI, based on information provided by a credible tech-source, is greater after reading a low (compared to high) credibility news item regarding climate-change. An intriguing possibility, that could be tested in future studies, is that participants strategically amplify the extent of learning from credible feedback to dilute the impact of learning from non-credible feedback. For example, a person scrolling through a social media feed, encountering copious amounts of disinformation, might amplify the weight they assign to credible feedback in order to dilute effects of ‘fake news’. Whereas such a strategy would backfire in our task, where bandits were randomly interleaved between trials, it could be beneficial in situations where the content of consumed information is temporally autocorrelated (for example, one reads several social-media items posted by members of the “AI group” and only then continues to items from a “climate-change group”). Ironically, these results also suggest that public campaigns might be more effective when embedding their messages in low-credibility contexts, which may boost their impact.
Our study suggests that individuals’ learning is modulated based on a trial-by-trial latent-state inference, amplifying learning for feedback deemed true as opposed to false. Strikingly, this nonnormative belief-updating strategy was not predicted by any of our Bayesian (or CA) models. One possibility is that this strategy is more efficient in ecological environments providing richer cues, beyond average source-credibility, as to whether an information source should be trusted in specific situations (e.g., when information sources have interests and motives that can be considered). Hence, our finding of increased learning for truthful feedback may stem from a failure to appreciate the inadequacy of this strategy in our relatively impoverished task. We note that the use of this strategy is consistent with our finding of exaggerated learning for fully credible, always-true, feedback. Taken together, these findings show that participants exaggerate learning from feedback they either know (3-star agent) or infer (1 and 2-star agents) to be true.
An important question arises as to the psychological locus of the biases we uncovered. Because we were interested in how individuals process disinformation—deliberately false or misleading information intended to deceive or manipulate—we framed the feedback agents in our study as deceptive, who would occasionally “lie” about the true choice outcome. However, statistically (though not necessarily psychologically), these agents are equivalent to agents who mix truth-telling with random “guessing” or “noise” where inaccuracies may arise from factors such as occasionally lacking access to true outcomes, simple laziness, or mistakes, rather than an intent to deceive. For example, our “50% credibility agent” is statistically identical to a “100% guessing (fully random)” agent, and our 75% credibility agent corresponds to an agent who mixes truth-telling and guessing on half the trials. While information from guessing agents would constitute misinformation due to its inaccuracy, it lacks the intentionality required to qualify as disinformation. It is possible that participants in our task represented the agents as varying in randomness or noisiness rather than as intentionally deceitful. This raises the question of whether the biases we observed are driven by the perception of potential disinformation as deceitful per se or simply as deviating from the truth. Future studies could address this question by directly comparing learning from statistically equivalent sources framed as either lying or noisy. We have begun exploring this question in a new study by comparing learning from agents who lie on a minority of trials (e.g., 25% of the time) with agents who lie on a majority of trials (e.g., 75% of the time). Although both agents provide statistically equivalent information (via flipping the feedback of the mostly lying agent), the latter is perceived as more deceitful.
Our study has bearing on prior research involving observational learning, which examined how individuals learn from the actions or advice of social partners (71,72). This body of work has demonstrated that individuals integrate learning from their private experiences with learning based on others’ actions or advice—whether by inferring the value others attribute to different options or by mimicking their behavior (73). Here, individuals modulate the extent to which they rely on social partners based on the partner’s accuracy (74). However, our task differs from traditional observational learning paradigms in several key ways. Firstly, in our study, feedback agents do not demonstrate or recommend actions; instead, they interpret the outcomes of actions on behalf of participants by indicating whether these actions generated a latent reward. Secondly, participants in our task lack a private set of experiences unmediated by feedback sources, unlike many reported observational learning paradigms. Finally, while observational learning tasks often involve actions or advice that are not always accurate (e.g., recommending or demonstrating a suboptimal choice), and research to date has not systematically addressed scenarios that involve deliberately misleading social partners. Future studies could incorporate deceptive social partners into observational learning paradigms, offering an opportunity to bridge the mechanisms underlying observational learning with those that are operative in our task. Developing unified models for these processes could provide valuable insights into how individuals integrate social information, particularly when the credibility of that information is critical for decision-making.
Although our findings and interpretations are grounded in a reinforcement learning (RL) framework, we acknowledge parallels with other approaches that have explored adaptive and biased learning. For instance, previous studies utilizing change-point inference (23,49,50,74) have addressed how individuals determine whether unexpected perceptual observations stem from “noise” as opposed to representing a genuine latent change in the underlying cause of their observations. In such tasks, incorrect assumptions about the rate of these changes (e.g., the “hazard rate”) can lead to deviations from normative statistical optimality. Similarly, in our task, learning bandit-values might involve inferring the extent to which variable choice feedback (for a bandit across trials) reflects inherent stochasticity in latent outcomes as opposed to agent deception. As in the change-point inference framework, incorrect assumptions in our task also produce biases. For example, a mistaken belief that the 1-star, random-feedback, agent is truthful on most trials would lead participants to erroneously learn from that agent’s feedback. Nevertheless, despite such similarities, there remain key differences between our task and change-point inference paradigms. Notably, choices in our task are value-based, and we consider it likely that this setup introduces biases (e.g., positivity) driven by a motivational preference for one outcome (reward) over another (non-reward). This motivational aspect may also influence whether individuals are inclined to trust or doubt feedback, depending on its valence. Additionally, explaining some of the biases we observed—such as the amplified learning from a credible source after exposure to non-credible sources in independent learning contexts—would require hierarchical inference frameworks that incorporate assumptions about the breakdown of learning-context independence. Future research could usefully investigate whether shared mechanisms underlie the biases identified here and those observed in other paradigms, potentially offering a unified account for inference problems across these approaches.
We conclude by noting previous research has often attributed the negative impacts of disinformation, such as polarization and the formation of echo chambers, to intricate processes facilitated by external or self-selection of information (75–77). These processes include algorithms tailoring information to align with users’ attitudes (78) or individuals consciously opting to engage with like-minded peers (79). However, our study reveals a more profound effect of disinformation, namely that even in minimal conditions, when low credibility information is explicitly identified, disinformation significantly impacts individuals’ beliefs and decision-making processes. This occurs even when the decision at hand lacks emotional engagement or pertinence to deep, identity-related, issues. A critical next step is to deepen our understanding of these biases, particularly within complex social environments, not least to enable the development of effective prospective interventions capable of mitigating the potentially pernicious impacts of disinformation.
Materials and methods
Participants
We recruited 246 participants (mean age 39.33± 12.65, 112 female) from the Prolific participant pool (www.prolific.co) who went on to perform the task on the Gorilla platform(80). All participants were fluent English speakers with normal or corrected-to-normal vision and a Prolific approval rate of 95% or higher. UCL Research Ethics Committee approved the study (Project ID 6649/004), and all participants provided prior informed consent.
Experimental protocol
Traditional two-armed bandit task
At the beginning of the experiment participants completed a traditional version of the two-armed bandit task. Participants performed 45 trials, each featuring one of three randomly interleaved bandit pairs (such that each pair was presented on 15 trials). On each trial, participants choose between the bandit-pair, with each bandit being represented by a distinct identicon. Once a bandit was selected it generated a true outcome (converted to bonus monetary compensation) corresponding to either a reward or nothing. Within each bandit-pair, one bandit provided rewards on 75% of trials (with 25% providing no-reward), while the other bandit rewarded on 25% of the trials (75% non-reward trials). Participants were uninformed about the reward probabilities of each bandit and had to learn these based on experience.
At onset of each trial, the two bandits were presented, one on each side of the screen, and participants were asked to indicate their choice within 3 seconds by pressing the left/right arrow-keys. If the 3 seconds elapsed with no choice, participants were shown a “too slow” message and proceeded to the next trial. Following choice, the unselected bandit disappeared, and the participants were presented with the outcome of the selected bandit for 1200ms, followed by a 250 ms ISI before the start of the next trial. Rewards were represented by a green dollar symbol and non-rewards by a red sad face (both in the center of the screen). At the end of the task, participants were informed about the number of rewards they had earned.
Disinformation task
This involved a modified, disinformation version, of the same two-armed bandit task. Participants performed 8 blocks, each consisting of 45 trials. Each block followed the structure of the traditional two-armed bandit task, but with a critical difference: true choice-outcomes were withheld from participants and instead they received reward-feedback from a feedback agent. Participants were instructed prior to the task that feedback agents mostly provide accurate feedback (i.e., the true outcome) but could lie on a random minority of trials by reporting a reward in case of a true nonreward, or vice versa. The task featured three feedback agents varying in their credibility (i.e., probability of truth-telling), as indicated by a “star-rating” system, about which participants were instructed prior to the task. The 3-star agent always told the truth, whereas the other 2 agents were partially credible, reporting the truth on 75% (2-star) or 50% (1-star) of the trials. Feedback agents were randomly interleaved across trials subject to the constraint that each agent appeared on 5-trials for each bandit pair.
At the onset of each trial, participants were presented with the feedback agent for the trial (screen center) and with the two bandits, one on each side of the screen. Participants made a 2-second time limited choice by pressing the left/right arrow-keys. Following choice, the unselected bandit disappeared, and were then presented with the agent feedback for 1200ms (represented by either a rewarding green dollar sign or a non-rewarding red sad face in the center of the screen). All stimuli then disappeared for 250 ms to be followed by the start of the next trial. At the end of each block, participants were informed about the number of true rewards they had earned. They then received a 30-second break before the next block started with new 3 bandit pairs.
General protocol
At the beginning of the experiment, participants were presented with instructions for the traditional two-armed bandit task. The instructions were interleaved with four multiple-choice questions. When participants answered a question incorrectly, they could re-read the instructions and re-attempt. If participants answered a question incorrectly twice, they were compensated for the time but could not continue to the next stage. Upon completing the instructions participants proceeded to the traditional two-armed bandit task.
After the two-armed bandit task, participants were presented with instructions regarding the disinformation task. Again, these were interleaved with six questions wherein participants had two attempts to answer each question correctly. If they answered a question incorrectly twice, they were rejected and received partial participatory compensation. Participants then proceeded to the disinformation task. After completing the disinformation task, participants completed three psychiatric questionnaires (presented in random order): 1) the Obsessional Compulsive Inventory - Revised (OCI-R)(81), assessing symptoms of obsessive-compulsive disorder (OCD); 2) The Revised Green et al. Paranoid Thoughts Scale (R-GPTS)(82), measuring paranoid ideations; and 3) the DOG scale, evaluating dogmatism(83).
The participants took on average 43 minutes to complete the experiment. They received a fixed compensation of 5.48 GBP and variable compensation between 0 and 2 GBP based on their performance on the disinformation task.
Attention checks
The two tasks included randomly interleaved catch trials wherein participants were cued to press a given key within a 3-second limit. None of the participants failed more than one of these attention checks.
Data analysis
Exclusion criteria
Participants were excluded if they: 1) Either repeated or alternated key presses in more than 70% of the trials, and/or 2) their reaction time was lower than 150 ms in more than 5% of the trials. Based on these criteria 42 participants were excluded, while 204 participants were kept for the analyses.
Accuracy
Accuracy rates were calculated as the probability of choosing within a given pair the bandit with a higher reward probability. For figure 1d, we calculated for each participant and for each trial (within a bandit-pair) averaged accuracy across all bandit-pairs. We then averaged accuracy at the trial level across participants. Overall improvement for each participant was calculated as the average accuracy difference between the last and first trials for each of the bandit-pairs.
Computational models
Rl Models
We formulated a family of RL models to account for participant choices. In these models, a tendency to choose each bandit is captured by a Q-value. After reward-feedback the Q-value of the chosen bandit was updated conditional on the agent and on whether the feedback was positive or negative according to the following rule:
where CA is a free credit assignment parameter representing the magnitude of the value increase/decrease following feedback receipt F from the agents (coded as 1 for reward feedback and -1 for non-reward feedback), while fQ (∈ [0,1]) is the free parameter representing the forgetting rate of the Q-value. Additionally, the value of each of the other bandits (i.e., the unchosen bandit in the presented pair and all the bandits from the other not-shown pairs) were forgotten as per the following:
Alternative model-variants differed based on whether the CA parameter(s) were influenced by agents and/or feedback valence (see Table 1 below), allowing us to test how these variables impacted learning.
The “Null” model included a unique CA parameter conveying an assumption that feedback is modulated by neither agent-credibility nor feedback valence.
The “Credibility-CA” models included a dedicated CA parameter for each agent allowing for the possibility learning was selectively modulated by agent credibility (but not by feedback valence).
The “Credibility-Valence-CA” model included distinct CA parameters for rewarding (CA+) and nonrewarding feedback (CA-) for each agent, allowing CA to be influenced by both feedback valence and credibility.
The “constant feedback-valence bias” CA model included separate CA-parameters for each agent, but a single valence bias parameter (VB) common to all agents, such that the CA+ parameter for each agent corresponded to the sum of its CA-parameter and the common VB parameter.
Additionally, we formulated a “Truth-CA” model where CA parameters were influenced by agentcredibility and whether the feedback was objectively true. This model included distinct CA parameters for truthful (CAtrue) and non-truthful (CAlie) feedback for each non-credible agent (i.e., 1-star and 2- star) and a single CA parameter for the 3-star agent, since it never lied.

summary of free parameters for each of the CA models.
All models also included gradual perseveration for each bandit. In each trial the perseveration values (P) were updated according to
Where PERS is a free parameter representing the P-value change for the chosen bandit, and fp (∈ [0,1]) is the free parameter denoting the forgetting rate applied to the P value. Additionally, the P-values of all the non-chosen bandits (i.e., again, the unchosen bandit of the current pair, and all the bandits from the not-shown pairs) were forgotten as follows:
We modelled choices using a softmax decision rule, representing the probability of the participant to choose a given bandit over the alternative:
Bayesian Models
We also formulated a Bayesian model corresponding to a normative, ideal belief, updating strategy. In this model, beliefs about each bandit were represented by a density distribution over the probability that a bandit provides a true reward g(p), where p is the probability of a true reward (see full derivation in SI 4.1). During learning, following reward-feedback, the distribution for the chosen bandit was updated based on the agent’s feedback (F) and its associated credibility (C):
At the beginning of each block priors for each bandit were initialized to uniform distributions (g(p)=U[0,1]). In the instructed-credibility Bayesian model, we fixed the credibilities to their true values (i.e., 0.5, 0.75 and 1).
We also formulated a free-credibility Bayesian model, where we only fixed the three-star agent credibility to 1 but estimated the credibility of the two lying agents as free parameters. This model allowed the possibility that participants use distorted instructed-credibilities when following a Bayesian strategy.
For both versions, we modelled choice using a SoftMax function with a free inverse temperature parameter (β):
Where here Q(bandit) is the expected probability, the bandit provides a true reward.
Parameter optimization, model selection and synthetic model simulations
For each participant, we estimated the free parameter values that maximized the summed loglikelihood of the observed choices across all games. Trials where participants showed a response time below 150 ms were excluded from the log-likelihood calculations. To minimise the chances of finding local minima, we ran the fitting procedure 10 times for each participant, using random initializations for the parameters (CA~U[-10,10], PERS~U[-5,5], fQ~[0,1], fP~[0,1], β~[0,30], C~U[0,1]). Our Truth-CA model showed poorer convergence, so for this model we ran the fitting procedure 100 times per participant.
We performed model comparison between Bayesian and CA models using the parametric bootstrap cross-fitting method (PBCM)(84,85). In brief, this method relies on generating, for each participant, synthetic datasets (we used 201) based on maximal likelihood parameters and each model variant (i.e., the Bayesian model and the CA model), and fitting each dataset with the two models. We then calculated the log likelihood difference between the two fits for each dataset, obtaining two loglikelihood difference distributions, one for each generative model. We determined a loglikelihood difference threshold that leads to best model-classification (i.e., maximizing the proportion of true positives and true negatives). Finally, we fit the empirical data from each participant with the two model variants, calculating an empirical loglikelihood difference. A comparison of this empirical likelihood difference to the classification threshold determines which model provides a better fit for a participant’s data (see Fig. S6 for more information). We used this procedure to compare our Bayesian models (instructed-credibility and free-credibility Bayesian) with a simplified version of the credibility-CA model that did not include perseveration (PERS, fP = 0).
We also performed model-comparisons for nested CA models using generalized-likelihood ratio tests where the null distribution for rejecting a nested model (in favour of a nesting model) was based on a bootstrapping method (BGLRT)(44,86).
To assess the mechanistic predictions of each model, we generated synthetic simulations based on the ML parameters of participants. Unless stated otherwise, we generated 5 simulations for each participant (1020 total simulations) with a new sequence of trials generated as in the actual data. We analysed these data in the same way as we analysed empirical data, after pooling together the 5 simulated data set per participant.
Parameter recovery
For each model of interest, we generated 201 synthetic simulations based on parameters sampled from uniform distributions (CA~U[-10,10], PERS~U[-5,5], fQ~U[0,1], fP~U[0,1], β~U[0,30], C~U[0,1]). We fitted each simulated dataset with its generative model and calculated the Spearman’s correlation between the generative and fitted parameters.
Mixed effects models
Model-agnostic analysis of agent-credibility effects on choice-repetition
We used a mixed-effects binomial regression model to assess whether, and how, value-learning was modulated by agent-credibility, with participants serving as random effects. The regressed variable REPEAT indicated whether the current trial repeated the choice from the previous trial featuring the same bandit-pair (repeated choice=1, non-repeated choice=0) and was regressed on the following regressors: FEEDBACK coded whether feedback received in the previous trial with the same bandit pair was positive or negative (coded as 0.5, -0.5, respectively), BETTER coded whether the bandit chosen in that previous trial was the better-mostly rewarding- or the worse-mostly unrewarding-bandit within the pair, coded as 0.5 and -0.5 respectively, AGENT2-star indicated whether feedback received in the previous trial (featuring the same bandit pair) came from the 2-star agent (previous feedback from 2-star agent=1, otherwise=0) and, AGENT3-star indicated whether the feedback in the previous trial came from the 3-star agent. The model in Wilkinson’s notation was:
In figure 2a and 2b, we plot the choice-repeat probability based on feedback-valence and agentcredibility from the preceding trial with the same bandit pair. We independently calculated the repeat probability for the better (mostly rewarding) and worse (mostly non-rewarding) bandits and averaged across them. This calculation was done at the participants level, and finally averaged across participants.
Model-agnostic analysis of contextual credibility effects on choice-repetition
We used a different mixed-effects binomial regression model to test whether value learning from the 3-star agent was modulated by contextual credibility. We focused this analysis on instances where the previous trial with the same bandit pair featured the 3-star agent. We regressed the variable REPEAT, which indicated whether the current trial repeated the choice from the previous trial featuring the same bandit-pair (repeated choice=1, non-repeated choice=0). We included the following regressors: FEEDBACK coding the valence of feedback in the previous trial with the same bandit pair (positive=0.5, negative=-0.5), CONTEXT2-star indicating whether the trial immediately preceding the previous trial with the same bandit pair (context trial) featured the 2-star agent (feedback from 2-star agent=1, otherwise=0), and CONTEXT3-star indicating whether the trial immediately preceding the previous trial with the same bandit pair featured the 3-star agent. We included in this analysis only current trials where the context trial featured a different bandit pair. The model in Wilkinson’s notation was:
We originally included another regressor (BETTER) coding whether the bandit chosen in that previous trial was the better-mostly rewarding- or the worse-mostly unrewarding-bandit within the pair. Since we did not find any significant interactions between BETTER and the other regressors, we decided to omit it from the model formulation.
In figure 4c, we independently calculate the repeat probability difference for the better (mostly rewarding) and worse (mostly non-rewarding) bandits and averaged across them. This calculation was done at the participants level, and finally averaged across participants.
Effects of agent-credibility on CA parameters from credibility-CA model
We used a mixed-effects linear regression model to assess whether, and how, credit assignment was modulated by feedback-agent, with participants serving as random effects (data from Fig. 2c). We regressed the maximal likelihood CA parameters from the credibility-CA model. The regressors AGENT2-star and AGENT3-star indicated, respectively, whether the CA parameter was attributed to the 2- star or the 3-star agent. The model’s Wilkinson’s notation was:
Effects of agent-credibility and feedback valence on CA parameters from credibility-valence-CA model
We used a second mixed-effects linear regression model to test for a valence bias in learning, and how such bias was modulated by feedback credibility, with participants serving again as random effects (data from Fig. 3a). The maximal likelihood CA parameters from the credibility-valence-CA model served as the regressed variable, which was regressed on: AGENT2-star and AGENT3-star (defined in the same way as the previous model), and VALENCE coding whether the CA parameter was attributed to positive (coded as 0.5) or negative (coded as -0.5) feedback. The Wilkinson’s notation of the model was:
Effects of agent-credibility and feedback truthfulness on CA parameters from truth-CA model
Finally, we used another mixed-effects linear regression model to test whether average levels of credit participants assigned to chosen bandits varied between objectively true and false feedback (data from figures 4b and 4c). We regressed the maximal likelihood CA parameters for the 1-star and 2-star agents from the truth-CA model on the regressors: CREDIBILITY, coding whether the CA came from the 1-star or the 2-star agent (coded as -0.5 and 0.5 respectively); and TRUTH, coding whether the CA parameter was attributed to trials were the agents told the truth (coded as 0.5) or lied (coded as -0.5). The Wilkinson’s notation of the model was:
We fitted these mixed effects models using the fitglme function in Matlab. Follow up analysis were based on testing contrasts from these models.
Bayesian estimation of posterior belief that feedback is true
We calculated the Bayesian posterior conditional probability of feedback truthfulness (Fig. 4a and 4b) follows. First, we calculated the probability of each true outcome, r (0-non-reward; 1-reward) conditional on the feedback, f (0: non-reward, 1: reward), the credibility of the agent reporting the feedback (C) and the history of experiences from past trials (H):
Where proportionality omits terms independent of r,
Next, we normalized the two terms (for r=0,1) to sum to 1 (to correct for the proportionality in (14)). Finally, the posterior belief in truthfulness was taken as P(r=f | f,C,H).
In Fig. 4b, we calculated for each participant the mean posterior belief of truthfulness separately for trials where each agents told the truth or lied, and we compared these mean beliefs between the two kinds of trials using a paired t-tests (one test per agent).
Code and data availability
All code and data used to generate the results and figures in this paper will be made available on GitHub upon publication.
Acknowledgements
We thank Bastien Blain, Lucie Charles and Stephano Palminteri for helpful discussions. We thank Nira Liberman, Keiji Ota, Nitzan Shahar, Konstantinos Tsetsos and Tali Sharot for providing feedback on earlier versions of the manuscript. We additionally thank the members of the Max Planck UCL Centre for Computational Psychiatry and Ageing Research for insightful discussions. The Max Planck UCL Centre is a joint initiative supported by UCL and the Max Planck Society.
J.V.P. is a pre-doctoral fellow of the International Max Planck Research School on Computational Methods in Psychiatry and Ageing Research (IMPRS COMP2PSYCH). We acknowledge funding from the Max Planck research school to J.V.P. (577749-D-CON 186534), and funding from the Max Planck Society to R.J.D. (549771-D.CON 177814). The project that gave rise to these results received the support of a fellowship from “la Caixa” Foundation (ID 100010434), with the fellowship code LCF/BQ/EU21/11890109.
J.V.P. contributed to the study design, data collection, data coding, data analyses, and writing of the manuscript. R.M. contributed to the study design, data analyses, and writing of the manuscript. R.J.D. contributed to the writing of the manuscript.
References
- 1.Global Risks Reporthttps://www.weforum.org/publications/global-risks-report-2024/
- 2.Vaccine hesitancy and (fake) news: Quasi-experimental evidence from ItalyHealth Econ 28:1377–82
- 3.The impact of fake news on social media and its influence on health during the COVID-19 pandemic: a systematic reviewJ Public Health 31:1007–16
- 4.Why Japan’s HPV vaccine rates dropped from 70% to near zerohttps://www.vox.com/science-and-health/2017/12/1/16723912/japan-hpv-vaccine
- 5.“Everything I Disagree With is #FakeNews”: Correlating Political Polarization and Spread of MisinformationarXiv https://doi.org/10.48550/arXiv.1706.05924
- 6.Fake news: the effects of social media disinformation on domestic terrorismDyn Asymmetric Confl 15:55–77
- 7.Sociological perspectives of social media, rumors, and attacks on minorities: Evidence from BangladeshFront Sociol 8:1067726
- 8.BBC News [Internet]https://www.bbc.com/news/blogs-trending-38156985
- 9.The Relationship Between Social Media Use and Beliefs in Conspiracy Theories and MisinformationPolit Behav 45:781–804
- 10.Less than you think: Prevalence and predictors of fake news dissemination on FacebookSci Adv 5:eaau4586
- 11.The spreading of misinformation onlineProc Natl Acad Sci 113:554–9
- 12.The spread of low-credibility content by social botsNat Commun 9:4787
- 13.Misinformation warnings: Twitter’s soft moderation effects on COVID-19 vaccine belief echoesComput Secur 114:102577
- 14.How to unring the bell: A meta-analytic approach to correction of misinformationCommun Monogr 85:423–41
- 15.Changing the Incentive Structure of Social Media Platforms to Halt the Spread of MisinformationPsyArXiv https://doi.org/10.31234/osf.io/26j8w
- 16.Fake news game confers psychological resistance against online misinformationPalgrave Commun 5:1–10
- 17.The efficacy of interventions in reducing belief in conspiracy theories: A systematic reviewPLOS One 18:e0280902
- 18.The spread of true and false news onlineScience 359:1146–51
- 19.A Confirmation Bias View on Social Media Induced Polarisation During Covid-19Inf Syst Front 26:417–41https://doi.org/10.1007/s10796-021-10222-9
- 20.The ConversationMisinformation and biases infect social media, both intentionally and accidentally http://theconversation.com/misinformation-and-biases-infect-social-media-both-intentionally-and-accidentally-97148
- 21.The Implied Truth Effect: Attaching Warnings to a Subset of Fake News Headlines Increases Perceived Accuracy of Headlines Without WarningsManag Sci 66:4944–57
- 22.Processing political misinformation: comprehending the Trump phenomenonR Soc Open Sci 4:160802
- 23.Learning the value of information in an uncertain worldNat Neurosci 10:1214–21
- 24.An approximately Bayesian delta-rule model explains the dynamics of belief updating in a changing environmentJ Neurosci Off J Soc Neurosci 30:12366–78
- 25.Scaling prediction errors to reward variability benefits error-driven learning in humansJ Neurophysiol 114:1628
- 26.Independent Neural Computation of Value from Other People’s ConfidenceJ Neurosci 37:673–84
- 27.Social Information Is Integrated into Value and Confidence Judgments According to Its ReliabilityJ Neurosci 37:6066–74
- 28.The neural underpinnings of an optimal exploitation of social information under uncertaintySoc Cogn Affect Neurosci 9:1746–53
- 29.Computational models for the combination of advice and individual learningCogn Sci 33:206–42
- 30.Integrating Incomplete Information With Imperfect AdviceTop Cogn Sci 11:299–315
- 31.Exposure to misleading and unreliable information reduces active information-seekingPsyArXiv https://doi.org/10.31234/osf.io/4zkxw
- 32.The optimism biasCurr Biol 21:R941–5
- 33.Forming Beliefs: Why Valence MattersTrends Cogn Sci 20:25–33
- 34.How unrealistic optimism is maintained in the face of realityNat Neurosci 14:1475–9
- 35.The neuroscience of motivated cognitionTrends Cogn Sci 19:62–4
- 36.Reinforcement Learning: An IntroductionThe MIT Press
- 37.Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processingPLOS Comput Biol 13:e1005684
- 38.Behavioural and neural characterization of optimistic reinforcement learningNat Hum Behav 1:1–9
- 39.The computational roots of positivity and confirmation biases in reinforcement learningTrends Cogn Sci 26:607–21
- 40.A computational reward learning account of social media engagementNat Commun 12:1311
- 41.How social learning amplifies moral outrage expression in online social networksSci Adv 7:eabe5641
- 42.Ten simple rules for the computational modeling of behavioral dataeLife 8:e49547https://doi.org/10.7554/eLife.49547
- 43.Numeracy, gist, literal thinking and the value of nothing in decision makingNat Rev Psychol 2:421–39
- 44.Human subjects exploit a cognitive map for credit assignmentProc Natl Acad Sci 118:e2016884118
- 45.Humans Use Directed and Random Exploration to Solve the Explore–Exploit DilemmaJ Exp Psychol Gen 143:2074–81
- 46.Reinforcement learning in the brainJ Math Psychol 53:139–54
- 47.Intrinsic Valuation of Information in Decision Making under UncertaintyPLOS Comput Biol 12:e1005020
- 48.Neural circuitry of information seekingCurr Opin Behav Sci 35:62–70
- 49.Normative evidence accumulation in unpredictable environmentsIn:
- Behrens T
- 50.A bias–variance trade-off governs individual differences in on-line learning in an unpredictable environmentNat Hum Behav 2:213–24
- 51.The Misestimation of Uncertainty in Affective DisordersTrends Cogn Sci 23:865–75
- 52.The human as delta-rule learnerDecision 7:55–66
- 53.Truth-Default Theory (TDT): A Theory of Human Deception and Deception DetectionJ Lang Soc Psychol 33:378–92
- 54.Reinstated episodic context guides sampling-based decisions for rewardNat Neurosci 20:997–1003
- 55.Reminders of past choices bias decisions for reward in humansNat Commun 8:15958
- 56.Cognitive Load and Social Media AdvertisingJ Interact Advert 23:33–54
- 57.Quantifying Information Overload in Social Media and Its Impact on Social ContagionsProc Int AAAI Conf Web Soc Media 8:170–9
- 58.Personality and perspicacity: Role of personality traits and cognitive ability in political misinformation discernment and sharing behaviorPersonal Individ Differ 196:111747
- 59.A Nudge to Credible Information as a Countermeasure to Misinformation: Evidence from TwitterInf Syst Res [Internet] https://pubsonline.informs.org/doi/full/10.1287/isre.2021.0491
- 60.Fact-Checking: A Meta-Analysis of What Works and for WhomPolit Commun 37:350–75
- 61.Information about action outcomes differentially affects learning from self-determined versus imposed choicesNat Hum Behav 4:1067–79
- 62.Peers Versus Pros: Confirmation Bias in Selective Exposure to User-Generated Versus Professional Media Messages and Its ConsequencesMass Commun Soc 23:510–36
- 63.Social Networks, Confirmation Bias and Shock Electionshttps://www.repository.cam.ac.uk/handle/1810/315203
- 64.The roots of polarization in the individual reward systemProc R Soc B Biol Sci 291:20232011
- 65.“I was Right about Vaccination”: Confirmation Bias and Health Literacy in Online Health Information SeekingJ Health Commun 24:129–40
- 66.Confirmation bias and vaccine-related beliefs in the time of COVID-19J Public Health 45:523–8
- 67.Overcoming Confirmation Bias in Misinformation Correction: Effects of Processing Motive and Jargon on Climate Change Policy SupportSci Commun :10755470241229452
- 68.How People Update Beliefs about Climate Change: Good News and Bad NewsCORNELL LAW Rev 102:1
- 69.Confirmation Bias and the Persistence of Misinformation on Climate ChangeCommun Res 49:500–23
- 70.Boomerang Effects in Science Communication: How Motivated Reasoning and Identity Cues Amplify Opinion Polarization About Climate Mitigation PoliciesCommun Res 39:701–23
- 71.A brain network supporting social influences in human decision-makingSci Adv 6:eabb4159
- 72.Neural mechanisms of observational learningProc Natl Acad Sci U S A 107:14431–6
- 73.A Neuro-computational Account of Arbitration between Choice Imitation and Goal Emulation during Human Observational LearningNeuron 106:687–699
- 74.Associative learning of social valueNature 456:245–9
- 75.Echo chambers online?: Politically motivated selective exposure among Internet news users1J Comput-Mediat Commun 14:265–85
- 76.Echo chambers, filter bubbles, and polarisation: a literature review [Internet]Reuters Institute for the Study of Journalism https://ora.ox.ac.uk/objects/uuid:6e357e97-7b16-450a-a827-a92c93729a08
- 77.Digital Technologies and Selective Exposure: How Choice and Filter Bubbles Shape News Media ExposureInt J Press 24:465–86
- 78.Algorithm-mediated social learning in online social networksTrends Cogn Sci 27:947–60
- 79.Exposure to ideologically diverse news and opinion on FacebookScience 348:1130–2
- 80.Gorilla in our midst: An online behavioral experiment builderBehav Res Methods 52:388–407
- 81.The Obsessive-Compulsive Inventory: Development and validation of a short versionPsychol Assess 14:485–96
- 82.The revised Green et al., Paranoid Thoughts Scale (R-GPTS): psychometric properties, severity ranges, and clinical cutoffsPsychol Med 51:244–53
- 83.Dogmatic behavior among students: testing a new measure of dogmatismJ Soc Psychol 142:713–21
- 84.Assessing model mimicry using the parametric bootstrapJ Math Psychol 48:28–50
- 85.Retrospective model-based inference guides modelfree credit assignmentNat Commun 10:750
- 86.Old processes, new perspectives: Familiarity is correlated with (not independent of) recollection and is more (not equally) variable for targets than for luresCognit Psychol 79:40–67
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.106073. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Vidal-Perez et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 11
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.