Introduction

Goal-directed behavior requires organisms to continually update predictions about the world to select actions in the light of new information. In environments that include discontinuities (changepoints) and noise (probabilistic errors), optimal learning requires increased weighting of surprising information during periods of change and ignoring surprising events during periods of stability. A burgeoning literature suggests that humans are able to calibrate learning rates according to the statistical content of new information (Behrens et al., 2007; Cook et al., 2019; Diederen et al., 2016; Nassar et al., 2019; Nassar et al., 2010; Razmi & Nassar, 2022), albeit to varying degrees (Kirschner et al., 2022; Kirschner et al., 2023; Nassar et al., 2016; Nassar et al., 2012; Nassar et al., 2021).

Although the exact neural mechanisms guiding dynamic learning adjustments are unclear, several neuro-computational models have been put forward to characterize adaptive learning. While these models differ in their precise computational mechanisms, they share the hypothesis that catecholamines play a critical role in adjusting the degree to which we use new information over time. For example, a class of models assumes that striatal dopaminergic prediction errors act as a teaching signal in cortico–striatal circuits to learn task structure and rules (Badre & Frank, 2012; Collins & Frank, 2013; Collins & Frank, 2016; Lieder et al., 2018; Pasupathy & Miller, 2005; Schultz et al., 1997). Another line of research highlights the role of dopamine in tracking the reward history with multiple learning rates (Doya, 2002; Kolling et al., 2016; Meder et al., 2017; Schweighofer & Doya, 2003). This integration of reward history over multiple time scales enables people to estimate trends in the environment through past and recent experiences and adjust actions accordingly (Wilson et al., 2013). Within the broader literature of cognitive control, it has been suggested that dopamine in the prefrontal cortex and basal ganglia is involved in modulating computational tradeoffs such as cognitive stability–flexibility balance (Cools, 2008; Dreisbach et al., 2005; Floresco, 2013; Goschke, 2013; Goschke & Bolte, 2014; Goschke & Bolte, 2018). In particular, it has been proposed that dopamine plays a crucial role in the regulation of meta-control parameters that facilitate dynamic switching between complementary control modes (i.e., shielding goals from distracting information vs. switching goals in response to significant changes in the environment) (Goschke, 2013; Goschke & Bolte, 2014; Goschke & Bolte, 2018). Finally, other theories highlight the importance of the locus coeruleus/norepinephrine system in facilitating adaptive learning and structure learning (Razmi & Nassar, 2022; Silvetti et al., 2018; Yu et al., 2021). Consistent with these neuro-computational models catecholaminergic drugs are known to affect cognitive performance including probabilistic reversal learning (Cook et al., 2019; Dodds et al., 2008; Repantis et al., 2010; Rostami Kandroodi et al., 2021; van den Bosch et al., 2022; Westbrook et al., 2020). Indeed, psychostimulants, such as methamphetamine, that increase extracellular catecholamine availability, can enhance cognition (Arria et al., 2017; Husain & Mehta, 2011; Smith & Farah, 2011) and are used to remediate cognitive deficits in attention deficit hyperactivity disorder (ADHD) (Arnsten & Pliszka, 2011; Prince, 2008). However, the cognitive enhancements vary across tasks and across individuals (Bowman et al., 2023; Cook et al., 2019; Cools & D’Esposito, 2011; Garrett et al., 2015; Rostami Kandroodi et al., 2021; van den Bosch et al., 2022; van der Schaaf et al., 2013) and the mechanisms underlying this variability remain poorly understood.

There is evidence that the effects of catecholaminergic drugs depend on an individual’s baseline dopamine levels in the prefrontal cortex (PFC) and striatum (Cohen & Servan-Schreiber, 1992; Cools & D’Esposito, 2011; Dodds et al., 2008; Durstewitz & Seamans, 2008; Rostami Kandroodi et al., 2021; van den Bosch et al., 2022). Depending on individual baseline dopamine levels the administration of catecholaminergic drugs can promote states of cognitive flexibility or stability. For example, pushing dopamine from low to optimal (medium) levels may increase update thresholds in the light of new information (i.e., facilitating shielding/stability), whereas if a drug pushes dopamine either too high or too low may decrease update thresholds (i.e., facilitating shifting/flexibility) (Durstewitz & Seamans, 2008; Goschke & Bolte, 2018).

Here, we argue that baseline performance should be considered when studying the behavioral effects of catecholaminergic drugs effects. To investigate the role of baseline performance in drug challenge studies, it is important to control for several factors. First, the order of drug and placebo sessions must be balanced to control for practice effects (Bartels et al., 2010; Garrett et al., 2015; MacRae et al., 1988; Servan-Schreiber et al., 1998). Second, it is desirable to obtain an independent measure of baseline performance that is not confounded with the drug vs placebo comparison. Thus, participants may be stratified based on their performance on an independent session.

In the present study, we studied the effects of methamphetamine, a stimulant that increases monoaminergic transmission, on probabilistic reversal learning dynamics in a within-subject, double-blind, randomized design. The effects of the drug on a reversal learning task were examined in relation to participants’ baseline level of performance. Baseline performance was determined during an initial drug-free session. Then, participants completed the task during two sessions after receiving placebo (PL) and 20 mg of methamphetamine (MA; order counterbalanced).

The task used to study adaptive learning dynamics was a reversal variant of an established probabilistic learning task (Fischer & Ullsperger, 2013; Jocham et al., 2014; Kirschner et al., 2022; Kirschner et al., 2023). On each trial, subjects made a choice to either gamble or avoid gambling on a probabilistic outcome, in response to a stimulus presented in the middle of the screen (see Figure 2A). A gamble could result in a gain or loss of 10 points, depending on the reward contingency associated with that stimulus. In choosing not to gamble, subjects avoided losing or winning points, but they were informed what would have happened if they had chosen to gamble. The reward contingency changed every 30-35 trials. By learning which symbols to choose and which to avoid, participants could maximize total points. A novel feature of this modified version of the task is that we introduced different levels of noise (probability) to the reward contingencies. Here, reward probabilities could be less predictable (30% or 70%) or more certain (20% or 80%). This manipulation allowed us to study the effect of MA on the dynamic balancing of updating and shielding beliefs about reward contingencies within different levels of noise in the task environment. To estimate learning rate adjustments, we fit a nested set of reinforcement learning models, that allowed for trial-by-trial learning rate adjustments.

We found that MA improved participants’ performance in the task, but this effect was driven mainly by a greater improvement in performance in those participants who performed poorly during the baseline session. Modeling results suggested that MA helps performance by adaptively shifting the relative weighting of surprising outcomes based on their statistical context. Specifically, MA facilitates down-weighting of probabilistic errors in stages of less predictable reward contingencies. Together, these results reveal novel insights into the role of catecholamines in adaptive learning behavior and highlights the importance to consider individual difference at baseline.

Results

97 healthy young adults completed the probabilistic learning task (Figure 2) (Fischer & Ullsperger, 2013; Jocham et al., 2014; Kirschner et al., 2022; Kirschner et al., 2023) on three separate sessions, an initial drug-free session, and after PL and MA. The study followed a double-blinded cross-over design, whereby 50% of participants received MA first, and 50% of participants PL first. Table 1 shows the demographic characteristics of the participants grouped by their task performance during the baseline session. The groups did not differ significantly on any of the variables measured. In a first analysis, we checked for general practice effects across the three task completions based on the total points earned in the task. We found a strong practice effect (F(2,186) = 14.53, p < .001) with better performance on session two and three compared to session one (baseline). There was no difference in the total scores between session two and three (see Figure 2B). These results suggest that the baseline session may have minimized order effects between MA and PL sessions (see also results and discussion below). The key findings detailed below are summarized in a schematic figure presented in the discussion section (Figure 7).

Demographics and drug use characteristics of study participants (n = 94)

Subjective drug effects

MA administration significantly increased ‘feel drug effect’ ratings compared to PL, at 30, 50, 135, 180, and 210 min post-capsule administration (see Figure 1; Drug x Time interaction F(5,555) = 38.46, p < 0.001).

Subjective drug effects post-capsule administration.

MA administration significantly increased ‘feel drug effect’ ratings compared to placebo. The scale for the ratings of Feeling a drug effect range from 0 to 100. The vertical black line indicates the time at which the task was completed. The asterisks refer to a significant on-/ off-drug difference.

Methamphetamine improved performance in a modified probabilistic reversal learning task only in participants who performed the task poorly at baseline.

(A) Schematic of the learning task. Each trial began with the presentation of a random jitter between 300 ms and 500 ms. Hereafter, a fixation cross was presented together with two response options (choose – green tick mark; or avoid – red no-parking sign). After the fixation cross, the stimulus was shown centrally until the participant responded or for a maximum duration of 2000 ms. Thereafter, participants’ choices were confirmed by a white rectangle surrounding the chosen option for 500 ms. Finally, the outcome was presented for 750 ms. If subjects chose to gamble on the presented stimuli, they received either a green smiling face and a reward of 10 points or a red frowning face and a loss of 10 points. When subjects avoided a symbol, they received the same feedback but with a slightly paler color and the points that could have been received were crossed out to indicate that the feedback was fictive and had no effect on the total score. A novel feature of this modified version of the task is that we introduced different levels of noise (probability) to the reward contingencies. Here, reward probabilities could be less predictable (30% or 70%), more certain (20% or 80%), or random (50%). (B) Total points earned in the task split up in sessions (baseline, drug session 1 and 2) and drug condition (PL vs. MA). Results show practice effects but no differences between the two drug sessions (baseline vs. drug session 1: 595.85 (39.81) vs. 708.62 (36.93); t(93) = –4.21, p = 5.95-05, d = 0.30; baseline vs. drug session 2: 595.85 (39.81) vs. 730.00 (38.53); t(93) = –4.77, p = 6.66-06, d = 0.35; session 1 vs. session 2: t(93) = –0.85, p = 0.399, d = 0.05). Dashed gray indicates no significant difference on/off drug (Δ∼35 points) (C) Interestingly, when we stratified drug effects by baseline performance (using median split on total points at baseline), we found that there was a trend towards better performance under MA in the low baseline performance group (n=47, p = .07). (D) Overall performance in drug session 1 and 2 stratified by baseline performance. Here, baseline performance appears not to affect performance in drug session 1 or 2. Note. IQR = inter quartile range; PL = Placebo; MA = methamphetamine.

Drug effects on overall performance and RT

In general, participants learned the task well, based on the observation that their choice behavior largely followed the underlying reward probabilities of the stimuli across the sessions (see Figure 4D-F). When all subjects were considered together, we did not find a performance benefit under MA quantified by the total points scored in the task (MA: 736.59 (37.11) vs. PL: 702.02 (38.305); t(93) = 1.38, p = 0.17, d = 0.10). When participants were stratified by their baseline performance (median spilt on total points at baseline), we found a marginally significant Drug x Baseline Performance Group interaction (Drug x Baseline Performance Group interaction: F(1,92) = 3.20, p = 0.07; see Figure 2C and Figure 7A). Post hoc t tests revealed that compared to PL, MA improved performance marginally in participants with poor baseline performance (total points MA: 522.55 (53.79) vs. PL: 443.61 (47.81); t(46) = 1.85, p = 0.071, d = 0.23). MA did not, however, improve performance in the high baseline performance group (total points MA: 950.63 (26.15) vs. PL: 960.42 (27.26); t(46) = –0.38, p = 0.698, d = 0.05). In control analyses we ensured that these effects are not driven by session-order effects (see also section on session control analyses below). Results showed no effect of Session (F(1,92) = 0.71, p = 0.40) and no Session x Baseline Performance Group interaction (F(1,92) = 0.59, p = 0.44 ; see Figure 1C). There was a trend for slightly faster RTs under MA (PL: 544.67ms (9.87) vs. MA: 533.84ms (11.51); t(93) = 1.75, p = 0.08, d = 0.10). This speed effect appeared to be independent of baseline performance (Drug x Baseline Performance Group interaction: F(1,92) = 0.45, p = 0.50). Moreover, MA was associated with reduced RT variability (average individual SD of RTs: PL: 193.74 (6.44) vs. MA: 178.98 (5.47); t(93) = 2.54, p = 0.012, d = 0.25). Reduced RT variability has previously been associated with increased attention and performance (Esterman et al., 2012; Karamacoska et al., 2018). Two-way ANOVA on RT variability revealed an effect of baseline performance (F(1,92) = 4.52, p = 0.03), with increased RT variability in low baseline performers across the drug sessions (low baseline performance: 197.27 (6.48) vs. high baseline performance: 175.45 (5.29)). Moreover, there was an effect of Drug (F(1,92) = 6.87, p = 0.01), and a Drug x Baseline Performance Group interaction (F(1,92) = 6.97, p = 0.009). Post hoc t tests indicated that the MA-related reduction in RT variability was specific to low baseline performers (PL: 212.07 (9.84) vs. MA: 182.46 (7.98); t(46) = 3.04, p = 0.003, d = 0.48), whereas MA did not affect high baseline performers RT variability (PL: 175.40 (7.51) vs. MA: 175.50 (7.55); t(46) = –0.02, p = 0.98, d < 0.01).

Methamphetamine improves learning performance when reward contingencies are less predictable

Next, to get a better understanding of how MA affects learning dynamics, we investigated the probability of correct choice (i.e., choosing the advantageous stimuli and avoiding disadvantageous stimuli) across successive reversals. As shown in Figure 3 the drug did not affect initial learning. However, the drug improved performance later in learning, particularly for stimuli with less predictable reward probabilities (see Figure 3B) and in subjects with low baseline performance. To quantify this observation, we first applied the Bai-Perron multiple break point test (see Methods) to find systematic breaks in the learning curves allowing us to divide learning into early and late stages. We applied the test to the reversal learning data across subjects. One break point was identified at 10 trials after a reversal (indexed by the vertical lines in Figure 3). We did not find drug differences when considering all reversals (PL: 0.84 (0.01) vs. MA 0.85 (0.01); t(93) = –1.14, p = 0.25, d = 0.07) and reversals to stimuli with high reward probability certainty (PL 0.86 (0.01) vs. MA 0.87 (0.01); t(93) = –0.25, p = 0.80, d = 0.02). Interestingly, we found a trend for increased learning under MA for stimuli with less predictable rewards (PL 0.80 (0.01) vs. 0.82 (0.01); t(93) = –1.80, p = 0.07, d = 0.14). Two-way ANOVA on the averaged probability of correct choice during the late stage of learning revealed a Drug x Baseline Performance Group interaction (F(1,92) = 4.85, p = 0.03; see Figure 7B). Post hoc t tests revealed that subjects performing lower at baseline appeared to benefit from MA (average accuracy late learning PL: 0.69 (0.02) vs. MA 0.74 (0.02); t(46) = –2.59, p = 0.01, d = 0.32), whereas there was no difference between MA and PL in the high baseline performance group (PL: 0.91 (0.01) vs. MA: 0.91 (0.01); t(46) = 0.29, p = 0.77, d = 0.04). We did not find other differences in reversal learning (all p > 0.1). In control analyses we split the learning curves into other possible learning situations in the task (i.e., acquisition, first reversal learning etc.). Here no drug related effects emerged (see Supplementary Figure1).

Learning curves after reversals suggest that methamphetamine improves learning performance in phases of less predictable reward contingencies in low baseline performer.

Top panel of the Figure shows learning curves after all reversals (A), reversals to stimuli with less predictable reward contingencies (B), and reversals to stimuli with high reward probability certainty (C). Bottom panel displays the learning curves stratified by baseline performance for all reversals (D), reversals to stimuli with less predictable reward probabilities (E), and reversals to stimuli with high reward probability certainty (F). Vertical black lines divide learning into early and late stages as suggested by the Bai-Perron multiple break point test. Results suggest no clear differences in the initial learning between MA and PL. However, learning curves diverged later in the learning, particular for stimuli with less predictable rewards (B) and in subjects with low baseline performance (E). Note. PL = Placebo; MA = methamphetamine; Mean/SEM = line/shading.

Computational modeling results reveal that methamphetamine affects the model parameter controlling dynamic adjustments of learning rate.

(A) Model comparison. Bayesian model selection was performed using –0.5*BIC as a proxy for model evidence (Stephan et al., 2009). The best fitting mixture model assigned proportions to each model based on the frequency with which they provided the “best” fit to the observed participant data (Mixture proportion; blue bars) and estimated the probability with which the true population mixture proportion for a given model exceeded that of all others (Exceedance probability; black bars). The hybrid model plus learning rate modulation by feedback confirmatory (model 3) provided the best fit to the majority of participants and had an exceedance probability near one in our model set. (B-C) Comparison of parameter estimates from the winning model on-/ off drug. Stars indicate significant difference for the respective parameter. Results suggest that only the parameter controlling dynamic adjustments of learning rate according to recent prediction errors, eta, was affected by our pharmacological manipulation. (D-F) Modelled and choice behavior of the participants in the task, stretched out for all stimuli. Note that in the task the different animal stimuli were presented in an intermixed and randomized fashion, but this visualization allows to see that participants’ choices followed the reward probabilities of the stimuli. Data plots are smoothed with a running average (+/− 2 trials). Ground truth corresponds to the reward probability of the respective stimuli (good: 70/80%; neutral: 50%; bad: 20/30%). Dashed black lines represent 95% confidence intervals derived from 1000 simulated agents with parameters that were best fit to participants in each group. Model predictions appear to capture the transitions in choice behavior well. Mean/SEM = line/shading. Note. IQR = inter quartile range; PL = Placebo; MA = methamphetamine;

Computational modeling results

To gain a better mechanistical understanding of the trial-to-trial learning dynamics we constructed a nested model set built from RL models (see methods) that included the following features: (1) a temperature parameter of the softmax function used to convert trial expected values to action probabilities (β), (2) a play bias term that indicates a tendency to attribute higher value to gambling behavior, and (3) an intercept term for the effect of learning rate on choice behavior. Additional parameters controlled trial-by-trial modulations of the learning rate including feedback confirmation (confirmatory feedback was defined as factual wins and counterfactual losses, disconfirmatory feedback was defined as factual losses and counterfactual wins), feedback modality (factual vs. counterfactual) and weighting of the learning rate as a function of the absolute value of previous prediction error (parameter Eta, determining the influence of surprise about the outcome on learning; Li et al., 2011). The winning model (as measured by lowest BIC and achieving protected exceedance probabilities of 100%) was one that allowed the learning rate to vary based on whether the feedback was confirmatory or not and the level of surprise of the outcome (see Figure 4A). Sufficiency of the model was evaluated through posterior predictive checks that matched behavioral choice data (see Figure 4D-F) and model validation analyses (see Supplementary Figure 2). We did not find evidence for differences in model fit between the groups (avg. BIC PL: 596.77 (21.63) vs. MA: 599.66 (19.85); t(93) = –0.25, p = 0.80, d = 0.01).

Next, we compared MAs effect on best-fitting parameters of the winning model (see Figure 4B-C). We found that eta (the parameter controlling dynamic adjustments of learning rate according to recent absolute prediction errors) was reduced under MA (eta MA: 0.24 (0.01) vs. PL 0.30 (0.01); t(93) = –3.005, p = 0.003, d = 0.43). When we stratified drug effects by baseline performance, we found a marginally significant Drug x Baseline Performance Group interaction (F(1,92) = 3.09, p = 0.08; see Figure 7C)). Post hoc t tests revealed that compared to PL, MA affected eta depending on baseline performance in the task. Here, subjects performing less well at baseline showed smaller eta’s (eta MA: 0.24 (0.01) vs. 0.33 (0.02); t(46) = –3.06, p = 0.003, d = 0.67), whereas there was no difference between MA and PL in the high baseline performance group MA: 0.23 (0.01) vs. 0.26 (0.01); t(46) = –1.03, p = 0.31, d = 0.18). We did not find drug related differences in any model parameters (all p > 0.1).

Methamphetamine affects learning rate dynamics

Next, we investigated how the model parameters fit with trial-by-trial modulations of the learning rate. Learning rates in our best fitting model were dynamic and affected by both model parameters and their interaction with feedback. Learning rate trajectories after reversals are depicted in Figure 5. As suggested by lower eta scores, MA appears to be associated with reduced learning rate dynamics in low-baseline performers. In contrast, low-baseline-performers in the PL condition exhibited greater variability in learning rate (and average LR throughout) rendering their choices more erratic. Consistent with this, on many trials their choices were driven by the most recent feedback, as their learning rates on a large subset of trials in later learning stages (on average 9 out of 11; Figure 5H) were greater than 0.5. Specifically, variability in learning rate (average individual SD of learning rate) was reduced in both early and late stages of learning across all reversals (early PL: 0.20 (0.01) vs. MA: 0.17 (0.01); t(93) = 2.72, p = 0.007, d = 0.36; late PL: 0.18 (0.01) vs. MA: 0.15 (0.01); t(93) = 2.51, p = 0.01, d = 0.33), as were reversals to stimuli with less predictable rewards (early PL: 0.19 (0.01) vs. 0.16 (0.01); t(93) = 2.98, p = 0.003, d = 0.39; late PL: 0.18 (0.01) vs. MA: 0.16 (0.01); t(93) = 2.66, p = 0.009, d = 0.35). Reversals to stimuli with high outcome certainty were also associated with decreased learning rate variability after MA administration (early PL: 0.18 (0.01) vs. MA: 0.15 (0.01); t(93) = 2.57, p = 0.01, d = 0.34; late PL: 0.18 (0.01) vs. MA: 0.15 (0.01); t(93) = 2.63, p = 0.009, d = 0.35). Two-way ANOVA revealed that this effect depended on baseline performance across all reversals (Drug x Baseline performance: F(1,92) = 3.47, p = 0.06), reversals to stimuli with less predictable rewards (Drug x Baseline performance: F(1,92) = 4.97, p = 0.02), and stimuli with high outcome certainty (Drug x Baseline performance: F(1,92) = 5.26, p = 0.03). Here, reduced variability under MA was observed in low baseline performers (all p < .006, all d > .51) but not in high baseline performers (all p >.1). Together, these patterns of results suggest that people with high baseline performance show a large difference in learning rates after true reversals and during the rest of the task including misleading feedback. Specifically, they show a peak in learning after reversals and reduced learning rates in later periods of a learning block, when choice preferences should ideally be stabilized (see Figure 5C). This results in a better signal-to-noise ratio (SNR) between real reversals and misleading feedback (i.e., surprising outcomes in the late learning stage). In low baseline performers the SNR is improved after the administration of MA. This effect was particularly visible in stages of the task where rewards were less predictable. To quantify the SNR for less predictable reward contingencies for low baseline performers, we computed the difference between learning rate peaks on true reversals (signal) vs. learning rate peaks after probabilistic feedback later in learning (noise; SNR = signal –noise). The results of this analysis revealed that MA significantly increased the SNR for low baseline performers (PL: 0.01 (0.01) vs. MA: 0.04 (0.01); t(46) = –2.81, p = 0.007, d = 0.49). Moreover, learning rates were generally higher in later stages of learning, when choice preferences should ideally have stabilized (avg. learning rate during late learning for less predictable rewards: PL: 0.48 (0.01) vs. MA: 0.42 (0.01); t(46) = 3.36, p = 0.001, d = 0.56).

Methamphetamine boosts signal-to-noise ratio between real reversals and misleading feedback in late learning stages.

Learning rate trajectories after reversal derived from the computational model. First column depicts learning rates across all subjects for all reversals (A), reversal to stimuli with high reward probability certainty (D), and reversal to stimuli with noisy outcomes (G). Middle and right column shows learning rate trajectories for subjects stratified by baseline performance (B, E, H – low baseline performance; C, F, I – high baseline performance). Results suggest that people with high baseline performance show a large difference in learning rates after true reversals and during the rest of the task including misleading feedback. Specifically, they show a peak in learning after reversals and reduced learning rates in later periods of a learning block, when choice preferences should ideally be stabilized (C). This results in a better signal-to-noise ratio (SNR) between real reversals and misleading feedback (i.e., surprising outcomes in the late learning stage). In low baseline performers the SNR is improved after the administration of MA. This effect was particularly visible in stages of the task where rewards were less predictable (H). Bottom row (J) shows the association between receiving misleading feedback later in learning (i.e., reward or losses that do not align with a stimulus’ underlying reward probability) and the probability of making the correct choice during the next encounter of the same stimulus. Results indicate a negative correlation between the probability of a correct choice after double-misleading feedback and eta (scatter plot on the right). Here, the probability of a correct choice after double-misleading feedback decreases with increasing eta. There was a trend (p = .06) that subjects under MA were more likely to make the correct choice after two misleading feedback as compared to PL (plot in the middle). This effect appeared to be dependent on baseline performance, whereby only subjects with low baseline performance seem to benefit from MA (p = 0.02; plot on the right). Note. IQR = inter quartile range; PL = Placebo; MA = methamphetamine; MFB = misleading feedback.

Thus far, our results suggest that (1) MA improved performance in subjects who performed poorly at baseline, and (2) that MA reduced learning rate variability in subjects with low baseline performance (driven by significantly lower eta parameter estimates, which improved the SNR between true reversals and misleading feedback particularly for less predictable rewards). Next, we aimed to test how these differences relate to each other. Given that eta causes increased learning after surprising feedback and that we found the biggest drug differences in later stages of learning for stimuli that have less predictable rewards, we tested the association between the probability of making the correct choice after two consecutive probabilistic errors (wins for bad stimuli and losses for good stimuli; in total this happened 8 times in the late learning stage for stimuli with 30/70% reward probability) and eta. We found a significant correlation across participants (see Figure 5J), whereby higher etas scores were associated with fewer correct choices (r = .29, p = < .001). There was a trend toward a drug effect, with subjects in MA condition being more likely to make the correct choice after two misleading feedbacks (PL: 0.82 (0.02) vs. 0.84 (0.01); t(93) = –1.92, p = 0.06, d = 0.13). Two-way ANOVA revealed, that this effect depended on baseline performance (Drug x Baseline performance: F(1,92) = 4.27, p = 0.04). Post-hoc t tests indicated higher correct choice probabilities under MA in low baseline performers (PL: 0.70 (0.02) vs. MA: 0.75 (0.02); t(46) = –2.41, p = 0.02, d = 0.30) but not in high baseline performers (PL: 0.92 (0.01234) vs. MA: 0.92 (0.01); t(46) = 0.11, p = 0.91, d = 0.01).

Methamphetamine shifts learning rate dynamics closer to the optimum for low baseline performers

To better understand the computational mechanism through which MA improved performance in low baseline performers, we first examined how performance in the task related to model parameters from our fits. To do so, we regressed task performance onto an explanatory matrix containing model parameter estimates across all conditions (see Figure 6A). The results of this analysis revealed that variability in several of the parameters was related to overall task performance, with the overall learning rate, feedback confirmation LR adjustments, and inverse temperature all positively predicting performance and eta and the play bias term negatively predicting it.

Changes in learning rate adjustment explain drug induced performance benefits in low baseline performers.

(A) Regression coefficients and 95% confidence intervals (points and lines; sorted by value) stipulating the contribution of each model parameter estimate to overall participants task performance (i.e., scored points in the task). Play bias and eta (the parameter governing the influence of surprise on learning rate) both made a significant negative contribution to overall task performance, whereas inverse temperature and learning rates were positively related to performance. (B) Differences in parameter values for on– and off-drug sessions as quantified by regression coefficients and 95% confidence intervals are plotted separately for high (red) and low (yellow) baseline performers. Note that the drug predominately affected the eta parameter and did so to a greater extent in low baseline performers. (C) eta estimates on-drug (y-axis) are plotted against eta estimates off-drug (x-axis) for high baseline performer (yellow points) and low baseline performer (red points). Note that a majority of subjects showed a reduction in eta on-drug vs. off-drug (67.02%). This effect was more pronounced in low baseline performers (low baseline performers: 74.47%; (low baseline performers: 59.57%). (D) To better understand how changes in eta might have affected overall performance we conducted a set of simulations using the parameters best fit to human subjects, except that we equipped the model with a range of randomly chosen eta values to examine how altering that parameter might affect performance (n=1000 agents). The results revealed that simulated agents with low to intermediate levels of eta achieved the best task performance, with models equipped with the highest etas performing particularly poorly. To illustrate how this relationship between eta and performance could have driven improved performance for some participants under the methamphetamine condition, we highlight four participants with low-moderate eta values under methamphetamine, but who differ dramatically in their eta values in the placebo condition (D, inset). (E) To test whether simulations correspond to actual performance differences across conditions we calculated the predicted improvement for each participant based on their eta in each condition using a polynomial function that best described the relationship between simulated eta values and scored points (red line in D; fitted with matlab’s ployfit.m function; f(x) = –– 2.35e+03*x4 + 5.64e+03*x3 +––4.71e+03*x2 + 1.29e+03*x + 692.08). We found that actual performance differences were positively correlated with the predicted ones (high baseline performer: Pearson’s Rho (47) = .31, p = .03; low baseline performer: Spearman’s Rho(47) = .0.34, p = .02). These results indicate that the individuals who showed the greatest task benefit from methamphetamine were those who underwent the most advantageous adjustments of eta in response to it. Note that we used rank order statistics for low baseline performers based on the fact that the distribution is skewed due to an outlier (upper left corner). PL = Placebo; MA = methamphetamine.

While each of these parameters explained unique variance in overall performance levels, only the parameter controlling dynamic adjustments of learning rate according to recent prediction errors, eta, was affected by our pharmacological manipulation (Figure 6B). In particular, eta was reduced in the MA condition, specifically in the low baseline group, albeit to an extent that differed across individuals (Figure 6C). To better understand how changes in eta might have affected overall performance we conducted a set of simulations using the parameters best fit to human subjects, except that we implemented equipped the model with a range of randomly chosen eta values, to examine how altering that parameter might affect performance. The results revealed that simulated agents with low to intermediate levels of eta achieved the best task performance, with models equipped with the highest etas performing particularly poorly (Figure 6D). To illustrate how this relationship between eta and performance could have driven improved performance for some participants under the methamphetamine condition, we highlight four participants with low-moderate eta values under methamphetamine, but who differ dramatically in their eta values in the placebo condition (Figure 6D, inset). Note that the participants who have the largest decreases in eta under the methamphetamines, resulting from the highest placebo levels of eta, would be expected to have the largest improvements in performance. To test whether these simulations correspond to actual performance differences across conditions we calculated the predicted improvement for each participant based on their eta in each condition using the function in Figure 6D. We found that actual performance differences were positively correlated with the predicted ones (Figure 6E), indicating that the individuals who showed the greatest task benefit from methamphetamine were those who underwent the most advantageous adjustments of eta in response to it. This result was specific to eta, and taking a similar approach to explain conditional performance differences in terms of the other model parameters, including those that were quite strongly related to performance (Figure 6A), yielded negative results (all p > .1; see Supplementary Figure S3). It is noteworthy that low-baseline performers tended to have particularly high values of eta under the baseline condition (low-baseline performers: 0.33 (0.02) vs. high-baseline performers: 0.25 (0.01); t(46) = 2.59, p = 0.01 d = 0.53), explaining why these individuals saw the largest improvements under the methamphetamine condition. Taken together, these results suggest that MA alters performance by changing the degree to which learning rates are adjusted according to recent prediction errors (eta), in particular by reducing the strength of such adjustments in low-baseline performers to push them closer to task-specific optimal values.

While eta seemed to account for the differences in the effects of MA on performance in our low and high performance groups, it did not fully account for performance differences across the two groups (see Figure 1C and Figure 7A/B). When comparing other model parameters between low and high baseline performer across drug sessions, we found that high baseline performer displayed higher overall inverse temperatures (2.97(0.05) vs. 2.11 (0.08); t(93) = 7.94, p < .001, d = 1.33). This suggests that high baseline performers displayed higher transfer of stimulus values to actions leading to better performance (as also indicated by the positive contribution of this parameter to overall performance in the GLM). Moreover, they tended to show a reduced play bias (–0.01 (0.01) vs. 0.04 (0.03); t(93) = –1.77, p = 0.08, d = 0.26) and increased intercepts in their learning rate term (–2.38 (0.364) vs. –6.48 (0.70); t(93) = 5.03, p < .001, d = 0.76). Both of these parameters have been associated with overall performance (see Figure 6A). Thus, overall performance difference between high and low baseline performed can be attributed to differences in model parameters other than eta. However, as described in the previous paragraph, differential effects of MA on performance on the two groups were driven by eta.

Summary of key findings.

Mean (SEM) scores on three measures of task performance after PL and MA, in participants stratified on low or high baseline performance. (A) There was a trend toward a drug effect, with boosted task performance (total points scored in the task) in low baseline performers (subjects were stratified via median split on baseline performance) after methamphetamine (20mg) administration. (B) Follow-up analyses revealed that on-drug performance benefits were mainly driven by significantly better choices (i.e., choosing the advantageous stimuli and avoiding disadvantageous stimuli) at later stages after reversals for less predictable reward contingencies (30/70% reward probability). (C) To understand the computational mechanism through which methamphetamine improved performance in low baseline performers we investigated how performance in the task related to model parameters from our fits. Our results suggest that methamphetamine alters performance by changing the degree to which learning rates are adjusted according to recent prediction errors (eta), in particular by reducing the strength of such adjustments in low-baseline performers to push them closer to task-specific optimal values.

Control analyses

To control for the potentially confounding factor session order (i.e., PL first vs. MA first), we repeated the two-way mixed ANOVAs with significant Drug x Baseline Session interactions with session order as a between subject factor. Including session order did not alter the significance of the observed effects and did not interact with the effects of interest (all p > .24).

Discussion

To study learning dynamics participants completed a reversal variant of an established probabilistic learning task (Fischer & Ullsperger, 2013; Jocham et al., 2014; Kirschner et al., 2022; Kirschner et al., 2023). Participants completed the task three times: in a baseline session without drug, and after PL and after oral MA (20 mg) administration. We observed a trend towards a drug effect on overall performance, with improved task performance (total points scored in the task) selectively in low baseline performers. Follow-up analyses revealed that MA performance benefits were mainly driven by significantly better choices (i.e., choosing the advantageous stimuli and avoiding disadvantageous stimuli) at later stages after reversals for less predictable reward contingencies. Modeling results suggest that MA is helping performance by adaptively shifting the relative weighting of surprising outcomes based on their statistical context. Specifically, MA facilitated down-weighting of probabilistic errors in phases of less predictable reward contingencies. In other words, in low baseline-performers the SNR between true reversals and misleading feedback is improved after the administration of MA. Our results advance the existing literature that, to date, overlooked baseline performance effects. Moreover, although existing literature has linked catecholamines to volatility-based learning rate adjustments (Cook et al., 2019), we show that these adjustments also relate to other context-dependent adjustments like levels of probabilistic noise. The key findings of this study are summarized in Figure 7.

Methamphetamine affects the relative weighting of reward prediction errors

A key finding of the current study is that MA affected the relative weighting of reward prediction errors. In our model, adjustments in learning rate are afforded by weighting the learning rate as a function of the absolute value of the previous prediction error (Li et al., 2011). This associability-gated learning mechanism is empirically well supported (Le Pelley, 2004) and facilitates decreasing learning rates in periods of stability and increasing learning rates in periods of change. MA was associated with lower weighting of prediction errors (quantified by lower eta parameters under MA). Our results comprise an important next step in understanding the neurochemical underpinnings of learning rate adjustments.

Neuro-computational models suggest that catecholamines play a critical role in adjusting the degree to which we use new information. One class of models highlights the role of striatal dopaminergic prediction errors as a teaching signal in cortico–striatal circuits to learn task structure and rules (Badre & Frank, 2012; Collins & Frank, 2013; Collins & Frank, 2016; Lieder et al., 2018; Pasupathy & Miller, 2005; Schultz et al., 1997). The implication of such models is that learning the structure of a task results in appropriate adjustments in learning rates. Optimal learning in our task with high level of noise in reward probabilities in combination with changing reward contingencies required increased learning from surprising events during periods of change (reversals) and reduced learning from probabilistic errors. Thus, neither too low learning adjustments after surprising outcomes (low eta), nor too high learning adjustments after surprising outcomes (high eta) are beneficial in our task structure. Interestingly, MA appears to shift eta closer to the optimum. In terms of the neurobiological implementation of this effect, MA may prolong the impact of phasic dopamine signals, which in turn facilitates better learning of the task structure and learning rate adjustments (Cook et al., 2019; Marshall et al., 2016; Volkow et al., 2002). Our data, in broad strokes, are consistent with the idea that dopamine in the prefrontal cortex and basal ganglia is involved in modulating meta-control parameters that facilitated dynamic switching between complementary control modes (i.e., shielding goals from distracting information vs. shifting goals in response to significant changes in the environment) (Cools, 2008; Dreisbach et al., 2005; Floresco, 2013; Goschke, 2013; Goschke & Bolte, 2014; Goschke & Bolte, 2018). A key challenge in our task is differentiating real reward reversals from probabilistic misleading feedback which is a clear shielding/shifting dilemma described in the meta-control literature. Our data suggest that MA might improve meta-control of when to shield and when to shift beliefs in low baseline performers.

Moreover, it is possible that MA’s effect on learning rate adjustments is driven by its influence on the noradrenaline system. Indeed, a line of research is highlighting the importance of the locus coeruleus/norepinephrine system in facilitating adaptive learning and structure learning (Razmi & Nassar, 2022; Silvetti et al., 2018; Yu et al., 2021). In particular, evidence from experimental studies, together with pharmacological manipulations and lesion studies of the noradrenergic system suggest that noradrenaline is important for change detection (Muller et al., 2019; Nassar et al., 2012; Preuschoff et al., 2011; Set et al., 2014). Thus, the administration of MA may have increased participants’ synaptic noradrenaline levels and, therefore, increased the sensitivity to salient events indicating true change points in the task.

It should be noted that other neuromodulators, such as acetylcholine (Marshall et al., 2016; Yu & Dayan, 2005), and serotonin (Grossman et al., 2022; Iigaya et al., 2018), have also been associated with dynamic learning rate adjustment. Future studies should compare the effects of neuromodulator-specific drugs for example a dopaminergic modulator, a noradrenergic modulator, a cholinergic modulator, and a serotonin modulator to make neuromodulator-specific claims (for example see Marshall et al., 2016). Taken together, it is likely that in our study MA effects on learning rate adjustments are driven by multiple processes that perhaps also work in concert. Moreover, because we only administered a single pharmacological agent, our results could reflect general effects of neuromodulation.

Our results are in line with recent studies that show improved performance under methylphenidate (MPH) by making learning more robust against misleading information. For example, Fallon et al. (2017) showed that (MPH) helped participants to ignore irrelevant information but impaired the ability to flexibly update items held in working memory. Another study showed that (MPH) improved performance by adaptively reducing the effective learning rate in participants with higher working memory capacity (Rostami Kandroodi et al., 2021). These studies highlight the complex effects of MPH on working memory and the role of working memory in reinforcement learning (Collins & Frank, 2012; Collins & Frank, 2018). It could be that the effect of MA on learning rate dynamics reflect a modulation of interactions between working memory and reinforcement learning strategies. However, it should be acknowledged that our task was not designed to parse out specific contributions of the reinforcement learning system and working memory to performance.

Methamphetamine selectively boosts performance in participants with poor initial task performance

Another key finding of the current study is that the benefits of MA on performance depend on the baseline task performance. Specifically, we found that MA selectively improved performance in participants that performed poorly in the baseline session. It is important to note, that MA did not bring performance of low baseline performers to the level of performance of high baseline performers. We speculate that high performers gained a good representation of the task structure during the orientation practice session, taking specific features of the task into account (change point probabilities, noise in the reward probabilities). This is reflected in a large signal to noise ratio between real reversals and misleading feedback. Because the high performers already perform the task at a near-optimal level, MA may not further enhance performance.

These results have several interesting implications. First, a novel aspect of our design is that, in contrast to most pharmacological studies, participants completed the task during a baseline session before they took part in the two drug sessions. Drug order and practice effects are typical nuisance regressors in pharmacological imaging research. Yet, although practice effects are well acknowledged in the broader neuromodulator and cognitive literature (Bartels et al., 2010; MacRae et al., 1988; Servan-Schreiber et al., 1998), our understanding of these effects is limited. One of the few studies that report on drug administration effects, showed that d-amphetamine (AMPH) driven increases in functional-MRI–based blood oxygen level-dependent (BOLD) signal variability (SDBOLD) and performance depended greatly on drug administration order (Garrett et al., 2015). In this study, only older subjects who received AMPH first improved in performance and SDBOLD. Based on research in rats, demonstrating that dopamine release increases linearly with reward-based lever press practice (Owesson-White et al., 2008), the authors speculate that practice may have shifted participants along an inverted-U-shaped dopamine performance curve (Cools & D’Esposito, 2011) by increasing baseline dopamine release (Garrett et al., 2015). Interestingly, we did not see a modulation of the MA effects by drug session order (PL first vs. MA first). Thus, the inclusion of an orientation session might be a good strategy to control for practice and drug order effects.

Our results also illustrate the large interindividual variability of MA effects. Recently a large pharmacological fMRI/PET study (n=100) presented strong evidence that interindividual differences in striatal dopamine synthesis capacity explain variability in effects of methylphenidate on reversal learning (van den Bosch et al., 2022). They demonstrated that methylphenidate improved reversal learning performance to a greater degree in participants with higher dopamine synthesis capacity, thus establishing the baseline-dependency principle for methylphenidate. These results are in line with previous research showing that methylphenidate improved reversal learning to a greater degree in participants with higher baseline working memory capacity, an index that is commonly used as an indirect proxy of dopamine synthesis capacity (Rostami Kandroodi et al., 2021; van der Schaaf et al., 2013; van der Schaaf et al., 2014). In the current study, we did not collect working memory capacity related information. However, our result that initial task performance strongly affected the effect of MA is in line with the pattern of results showing that individual baseline differences strongly influence drug effects and thus should be considered in pharmacological studies (Cools & D’Esposito, 2011; Durstewitz & Seamans, 2008; van den Bosch et al., 2022). Indeed, there is evidence from the broader literature on the effects of psychostimulants on cognitive performance, that suggest that stimulants improve performance only in low performers (Ilieva et al., 2013). Consistent with this, there is evidence in rats, that poor baseline performance was associated with greater response to amphetamine and increased performance in signal detection task (Turner et al., 2017).

Conclusion

The current data provide evidence that relative to placebo, methamphetamine facilitates the ability to dynamically adjust learning from prediction errors. This observation was seen to a greater degree in those participants who performed poorly at baseline. These results advance existing literature by presenting evidence for a causal link between catecholaminergic modulation and learning flexibility and further highlights a baseline-dependency principle for catecholaminergic modulation.

Materials and methods

Design

The results presented here were obtained from the first two sessions of a larger four-session study (clinicaltrials.gov ID number NCT04642820). During the two 4-h laboratory sessions, healthy adults ingested capsules containing methamphetamine (20 mg; MA) or placebo (PL), in mixed order under double-blind conditions. One hour after ingesting the capsule they completed the 30-min reinforcement reversal learning task. The primary comparisons were on acquisition and reversal learning parameters of reinforcement learning after MA vs PL. Secondary measures included subjective and cardiovascular responses to the drug.

Subjects

Healthy men and women aged 18-35 years were recruited with flyers and on-line advertisements. Initial eligibility was ascertained in a telephone interview (age, current drug use, medical conditions), and appropriate candidates attended an in-person interview with a physical examination, EKG and a structured clinical psychiatric interview (First et al., 2015). Inclusion criteria were a high school education, fluency in English, body mass index between 19 and 26, and good physical and mental health. Exclusion criteria were serious psychiatric disorders (e.g., psychosis, severe PTSD, depression, history of Substance Use Disorder), any regular prescription medication, history of cardiac disease, high blood pressure, consuming >4 alcoholic or caffeinated beverages a day, or working night shifts. A total of 113 healthy young adults took part in the study. We excluded four subjects because of excessive misses on at least one session. Grubbs’ test for outlier detection with a one-sided alpha of 0.001 identified a cut-off of > 40 missed trials.

Orientation session

Participants attended an initial orientation session to provide informed consent, and to complete personality questionnaires. They were told that the purpose of the study was to investigate the effects of psychoactive drugs on mood, brain, and behavior. To reduce expectancies, they were told that they might receive a placebo, stimulant, or sedative/tranquilizer. They agreed not to use any drugs except for their normal amounts of caffeine for 24 hours before and 6 hours following each session. Women who were not on oral contraceptives were tested only during the follicular phase (1-12 days from menstruation) because responses to stimulant drugs are dampened during the luteal phase of the cycle (White et al., 2002). Most participants (N=97 out of 113) completed the reinforcement learning task during the orientation session as a baseline measurement. This measure was added after the study began. Participants who did not complete the baseline measurement were omitted from the analyses presented in the main text. We run the key analyses on the full sample (n=109). This sample included participants who completed the task only on the drug sessions. When controlling for session order and number (two vs. three sessions) effects, we see no drug effect on overall performance and learning. Yet, we found that eta was also reduced under MA in the full sample, which also resulted in reduced variability in the learning rate (see supplementary results for more details).

Drug sessions

The two drug sessions were conducted in a comfortable laboratory environment, from 9 am to 1 pm, at least 72 hours apart. Upon arrival, participants provided breath and urine samples to test for recent alcohol or drug use and pregnancy (CLIAwaived Inc,Carlsbad, CAAlcosensor III, Intoximeters; AimStickPBD, hCG professional, Craig Medical Distribution). Positive tests lead to rescheduling or dismissal from the study. After drug testing, subjects completed baseline mood measures, and heart rate and blood pressure were measured. At 9:30 am they ingested capsules (PL or MA 20 mg, in color-coded capsules) under double-blind conditions. Oral MA (Desoxyn, 5 mg per tablet) was placed in opaque size 00 capsules with dextrose filler. PL capsules contained only dextrose. Subjects completed the reinforcement learning task 60 minutes after capsule ingestion. Drug effects questionnaires were obtained at multiple intervals during the session. They completed four other cognitive tasks not reported here. Participants were tested individually and were permitted to relax, read or watch neutral movies when they were not completing study measures.

Dependent measures

Reinforcement Learning Task

Participants performed a reversal variant of an established probabilistic learning task (Fischer & Ullsperger, 2013; Jocham et al., 2014; Kirschner et al., 2022; Kirschner et al., 2023). On each trial participants were presented one of three different stimuli and decided to either gamble or avoid gambling with that stimulus with the goal to maximize the final reward (see Figure 1A). A gamble resulted in winning or losing points, depending on reward contingencies associated with the particular stimulus. If participants decided not to gamble, they avoided any consequences, but were still able to observe what would have happened if they had gambled by receiving counterfactual feedback. The three stimuli – white line drawings of animals on black background – were presented in a pseudo random series that was the same for all participants. Reward contingencies for every stimulus could be 20%, 30%, 50%, 70%, or 80% and stayed constant within one block of 30-35 trials. After every block, reward contingency changed without notice. The experiment consisted of 7 blocks per stimulus, leading to 18 reversals and 714 trials in total. Presentation 22.0 (Neurobehavioral Systems) was used for task presentation. Every trial of the task began with a central fixation cross, presented for a variable time between 300 and 500 ms. After fixation, the stimulus was presented together with the two choice alternatives (a green checkmark for choosing and a red no-go sign for avoiding, sides counterbalanced across subjects) for a maximum of 2000 ms or until a response was given. If participants failed to respond in time, a question mark was shown and the trial was repeated at the end of the block. When a response was made, the stimulus stayed on screen and feedback was given after 500 ms. The outcome was then presented for 750 ms depending on the subject’s choice. Choosing to gamble led to either a green smiley face and a reward of 10 points or a red frowning face and a loss of 10 points according to the reward probability of the stimulus. An avoided gamble had no monetary consequences: the outcome was always 0. Counterfactual/fictive outcomes, indicating what would have happened had the participant chosen to gamble, were shown on screen using the same smileys in a paler color, but the reward or punishment was crossed out to indicate that the outcome was fictive.

Drug Effects Questionnaire (DEQ)

(Morean et al., 2013) The DEQ consists of 5 questions in total. In this paper we only reported the ratings of the “Do you feel any drug effect?” question which was rated on a 100 mm visual analog scale. Participants completed this at regular intervals throughout the session.

Reinforcement learning model fitting

We fit variants of reinforcement learning models to participants’ choice behavior using a constrained search algorithm (fmincon in MATLAB 2021b), which computed a set of parameters that maximized the total log posterior probability of choice behavior. The base model (M1) was a standard Q-learning model with three parameters: (1) a temperature parameter of the softmax function used to convert trial expected values to action probabilities, (2) a play bias term that indicates a tendency to attribute higher value to gambling behavior, and (3) an intercept term for the effect of learning rate on choice behavior. On each trial the expected value (Qt) of a stimulus (Xt) was calculated according to the following formula:

Q values represent the expected value of an action at trial t. α reflects the learning rate. δt represents the prediction error with Rt being the reward magnitude of that trial. On each trial, this value term was transferred into a “biased” value term (VB(Xt) = Bplay + Qt(Xt), where Bplay is the play bias term) and converted into action probabilities (P(play|VB(t)(Xt); P(pass|VB(t)(Xt)) using a softmax function. This was our base model (M1).

Next, we fit further reinforcement models by complementing the base model with additional parameters. These additional parameters controlled trial-by-trial modulations of the learning rate. Note that our base model treats the learning rate for value updates as a constant. However, previous studies have been shown that people are able to adjust their learning rate according to the volatility of the environment (Behrens et al., 2007; Nassar et al., 2010). In the Pearce-Hall hybrid model, adjustments in learning rate are afforded by weighting the learning rate as a function of the absolute value of previous prediction error (Li et al., 2011). This associability-gated learning mechanism is empirically well supported (Le Pelley, 2004) and facilitates decreasing learning rates in periods of stability and increasing learning rates in periods of change. Previous work has shown that the hybrid model can approximate normative learning rate adjustments (Li et al., 2011; Piray et al., 2019). In this hybrid model, the learning rate is updated as follows:

Here, κ is scale of learning rate (αt) and η determines the step size for updating associability (At) as a function of the absolute RPE (|δt|). On each trial, the learning rate (αt) depends on the absolute RPE from past trial. Note that the initial learning rate is defined by κ, whereby κ is determined by a logistic function of a weighted predictor matrix that could include an intercept term (Pearce-Hall hybrid model (M2)) and task variables that may additionally affected trial-by-trial learning rate adjustments. In the Pearce-Hall hybrid feedback confirmatory model (M3) the predictor matrix included an intercept term and feedback confirmatory information (i.e., was the feedback on a given trial confirmatory (factual wins and counterfactual losses) or disconfirmatory (factual losses and counterfactual wins)). Finally, in the Pearce-Hall hybrid feedback confirmatory and modality model (M4) the predictor matrix included an intercept term, feedback confirmatory information and feedback modality (factual vs. counterfactual feedback) information. The best-fitting model was determined by computing the Bayesian Information Criterion (BIC) for each model (Schwarz, 1978). Moreover, we computed protected exceedance probabilities, which gives the probability that one model was more likely than any other model of the model space (Rigoux et al., 2014). To compare participant behavior to model-predicted behavior, we simulated choice behavior using the best fitting model (Pearce-Hall hybrid feedback confirmatory model; see Figure 3A). For each trial, we used the expected trial value (Qt(Xt)) computed above, and the parameter estimates of the temperature variable as inputs to a softmax function to generate choices. Validation of model selection and parameter recovery is reported in the supplementary materials (Figure S1).

Data analysis

We analyzed drug effects on behavioral performance, and model parameters using paired t tests. Given the effects of initial performance and practice in pharmacological imaging research (Garrett et al., 2015), we additionally stratified MA effects by task performance in the orientation using median split. These data were analyzed using a two-way repeated-measures ANOVA with the factors Drug (two levels) and Baseline Performance (two levels). Paired t tests were used as post hoc tests. Moreover, we investigated reversal learning by calculating learning curves. Post hoc, we observed that drug effects on learning became only apparent in the second phase of learning. We therefore used the Bai-Perrin multiple break point test (Bai & Perron, 2003) to identify the number and location of structural breaks in the learning curves. In broad strokes, the test detects whether breaks in a curve exists, and if so, how many there are, based on the regression slope in predefined segments (here, we set the segment length to 5 trials). In our case, the test could reveal between 0 and 5 breaks (number of trials / segment length – 1). We run this test using data from all subjects and all sessions. The test detected one break that cut the learning curves into two segments (see results). We then calculated an index of learning performance after reversals by averaging the number of correct choices over the second learning phase. The index was then subjected to a two-way repeated ANOVA with the factors Drug (two levels) and Baseline Performance (two levels).

Data Availability Statement

All raw data and analysis scripts can be accessed at the Open Science Framework data repository: [insert after acceptance].

Acknowledgements

We thank all our participants who took part in this research for the generosity of their time and commitment. This research was supported by the National Institute on Drug Abuse DA02812. HM was supported by the National Institutes of Health T32 GM07019. MU was supported by the Deutsche Forschungsgemeinschaft, Grant/Award Number: SFB 1436; and the European Research Council, Grant/Award Number: 101018805”.

Competing interests

HdW is on the Board of Directors of PharmAla Biotech, and on scientific advisory committees of Gilgamesh Pharmaceuticals and MIND Foundation. These activities are unrelated to the present study. The other authors report no competing interests.

Supplementary Information

Learning curves

Top part shows learning curves quantified as the probability to select the correct choice (choosing the advantageous stimuli and avoiding disadvantageous stimuli) stratified by orientation performance. Two-way ANOVAs with the factors Drug (two levels) and Baseline Performance (two levels) on the averaged probability of correct choice during the early and late stage of learning were used to investigate drug effects. (A) No differences in the learning curves between MA and PL became evident when considering all reversals (all p > .1). (B) There was no drug related difference in the acquisition phase of the task between (all p > .05) or (C) in the first reversal learning (all p > .1). In the bottom part of the figure, learning curves are defined as the probability to select a stimulus. (D) No drug effect emerged for reversal learning from a bad stimulus to a good stimulus (all p > .09) or (E) good to bad stimuli (all p > .09). Moreover, there was no difference in reversal learning to neutral stimuli (F and G). Note. PL = Placebo; MA = methamphetamine.

Validation of model selection and parameter recovery.

After model-fitting, each model was used to simulate data for each participant using the best-fitting parameters for that participant. Each model was fit to each synthetic dataset and BIC was used to determine which model provided best fit to synthetic data. (A) Inverse confusion matrix. The frequency with which a recovered model (abscissa, determined by lowest BIC) corresponded to a given simulation model (ordinate) is depicted in color. Recovered models correspond to the same models labeled on the ordinate, with recovered model 1 corresponding to the base model, and so on. The results of the model recover analyses suggest that the recovered model typically corresponded to a synthetic dataset produced by that model. (B) Parameter values that were used to simulate data from the hybrid model with additional modulation of the learning rate by feedback confirmatory (ordinate) tended to correlate (color) with the parameter values best fit to those synthetic datasets (abscissa). Recovered parameter values correspond to the labels on the ordinate, with parameter 1 reflecting temperature parameter of the softmax function, and so on.

Relationships between model parameters not affected by the drug and task performance (measured by total scored points in the task).

To better understand how changes in model parameters not affected by methamphetamine might have affected overall performance we conducted a set of simulations using the parameters best fit to human subjects, except that we equipped the model with a range of randomly chosen temperature parameters of the softmax function (A), paly bias term (B), intercept term of the learning rate (C), and feedback confirmation term of the learning rate (D), to examine how altering these parameters might affect performance. For each model we draw 1000 values of the respective parameter from a uniform distribution spanning the fitted parameter space. The results revealed that simulated agents with higher temperature parameters achieved the best task performance (A). Moreover, agents with a play bias around zero (B), and intercept term of the learning rate (C), and feedback confirmation term of the learning rate (D) centered around zero achieved the best task performance. To test whether simulations correspond to actual performance differences across conditions we calculated the predicted performance difference for each participant based on their on– / off-drug parameter difference using a polynomial function that best described the relationship between simulated parameter values and scored points (red lines fitted with matlab’s ployfit.m function). Results are shown next to the simulation and suggest that predicted performance differences were unrelated to actual performance differences for changes in the temperature parameters of the softmax function (A; r(188) = 0.16, p = 0.10), play bias term (B; r(188) = 0.12, p = 0.22), intercept term of the learning rate (C; r(188) = 0.09, p = 0.34), and feedback confirmation term of the learning rate (D; r(188) = 0.08, p = 0.39).

Overall points full sample.

When comparing overall point in the whole sample (n = 109), we do not see a difference between MPH vs. PLA (705.68 (36.27) vs. 685.77 (35.78); t(108) = 0.81, p = 0.42, d = 0.05). Repeated mixed ANOVAs suggested, that drug effects did not depend on session order (MPH first vs. PLA first), or whether subjects performed the orientation session. Yet, participants who completed the orientation tended to performed better during the dug sessions (F(1,107) = 3.09, p = 0.08; 719.31 (26.6264) vs. 548.00 (75.09)). Note. PL = Placebo; MA = methamphetamine.

Learning curves after reversals full sample.

Figure shows learning curves after all reversals (A), reversals to high reward probability uncertainty (B), and reversals to low reward probability uncertainty (C) for the whole sample. Vertical black lines divide learning into early and late stages as suggested by the Bai-Perron multiple break point test. Paired-sample t-test revealed no drug related difference for all reversals during early learning (0.72 (0.01) vs. 0.72 (0.01); t(108) = –0.02, p = 0.98, d < 0.01) and late learning (0.83 (0.01) vs. 0.84 (0.01); t(108) = –0.80, p = 0.42, d = 0.04). Similarly, there was no significant differences in both learning stages for reversals to low reward probability certainty stimuli (early learning PLA vs MPH: 0.68 (0.01) vs. 0.69 (0.01); t(108) = –0.92, p = 0.35, d = 0.08; late learning PLA vs MPH: 0.80 (0.01) vs. 0.81 (0.01); t(108) = –1.48, p = 0.14, d = 0.10) or to low reward probability certainty stimuli (early learning PLA vs MPH: 0.74 (0.01) vs. 0.73 (0.01); t(108) = 0.87, p = 0.38, d = 0.06; late learning PLA vs MPH: 0.85 (0.01) vs. 0.85 (0.01); t(108) = –0.02, p = 0.97, d < 0.01). Mixed effect ANOVAs that controlled for session order effects and whether participants performed the orientation session revealed no significant effects (all p > .06). Note. PL = Placebo; MA = methamphetamine.

Learning curves after reversals full sample.

(A) Here we compare MPHs effect on best-fitting parameters of the winning model in the full sample (n = 109). We found that eta (i.e., the weighting of the effect of the abs. reward prediction error on learning) was reduced under MPHs (eta MPH: 0.23 (0.01) vs. PLA 0.29 (0.01); t(108) = –3.05, p = 0.002, d = 0.40). Mixed effect ANOVAs that controlled for session order effects and whether participants performed the orientation session revealed this effect did not depend on these cofounds. No other condition differences emerged. (B) Learning rate trajectories after reversal derived from the computational model. As in the reduced sample MPH appears to be associated reduced learning rate dynamics in the full sample too. Specifically, variability in learning rate (average individual SD of learning rate) tended to be reduced in the MPH condition both during early and late stages of learning across all reversals (early PLA: 0.19 (0.01) vs. 0.18 (0.01); t(108) = 1.89, p = 0.06, d = 0.24; late PLA: 0.17 (0.01) vs. MPH: 0.16 (0.01); t(108) = 1.77, p = 0.08, d = 0.23) and reversals to high reward probability uncertainty (early PLA: 0.18 (0.01) vs. 0.16 (0.01); t(108) = 1.74, p = 0.08, d = 0.22; late PLA: 0.18 (0.01) vs. MPH: 0.16 (0.01); t(108) = 1.82, p = 0.07, d = 0.24). Condition differences became most evident in reversals to low reward probability uncertainty (early PLA: 0.19 (0.01) vs. MPH: 0.16 (0.01); t(108) = 2.18, p = 0.03, d = 0.28; late PLA: 0.18 (0.01) vs. MPH: 0.16 (0.01); t(108) = 1.93, p = 0.05, d = 0.24). Control analyses revealed that these effects were independent of session order and orientation session. Note. PL = Placebo; MA = methamphetamine.