Humans perseverate on punishment avoidance goals in multigoal reinforcement learning

  1. Paul B Sharp  Is a corresponding author
  2. Evan M Russek
  3. Quentin JM Huys
  4. Raymond J Dolan
  5. Eran Eldar
  1. The Hebrew University of Jerusalem, Israel
  2. Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, United Kingdom
  3. Wellcome Centre for Human Neuroimaging, University College London, United Kingdom
  4. Division of Psychiatry, University College London, United Kingdom

Abstract

Managing multiple goals is essential to adaptation, yet we are only beginning to understand computations by which we navigate the resource demands entailed in so doing. Here, we sought to elucidate how humans balance reward seeking and punishment avoidance goals, and relate this to variation in its expression within anxious individuals. To do so, we developed a novel multigoal pursuit task that includes trial-specific instructed goals to either pursue reward (without risk of punishment) or avoid punishment (without the opportunity for reward). We constructed a computational model of multigoal pursuit to quantify the degree to which participants could disengage from the pursuit goals when instructed to, as well as devote less model-based resources toward goals that were less abundant. In general, participants (n = 192) were less flexible in avoiding punishment than in pursuing reward. Thus, when instructed to pursue reward, participants often persisted in avoiding features that had previously been associated with punishment, even though at decision time these features were unambiguously benign. In a similar vein, participants showed no significant downregulation of avoidance when punishment avoidance goals were less abundant in the task. Importantly, we show preliminary evidence that individuals with chronic worry may have difficulty disengaging from punishment avoidance when instructed to seek reward. Taken together, the findings demonstrate that people avoid punishment less flexibly than they pursue reward. Future studies should test in larger samples whether a difficulty to disengage from punishment avoidance contributes to chronic worry.

Introduction

Adaptive behavior demands we flexibly shift between pursuit of multiple goals, but disengaging from one goal in order to pursue another is often challenging. Switching between different goals is computationally demanding as it requires us to disengage processing relevant to prior goals and recruit knowledge necessary to determine the best action to pursue new goals. Consider a teenager about to play for a championship of her basketball league, a coveted prize she is poised to attain. As the game begins, she suddenly remembers that earlier that day she again forgot to show up for a school exam, and consequently might end up getting expelled from school. Although current tasks demand she reallocate attention towards the basketball game, she persists in worry about a potential looming disaster awaiting when the game ends.

One possibility is that managing multiple goals is influenced by the valence of goal outcomes (i.e., goal valence) (Guitart-Masip et al., 2012). Thus, people might devote more resources to pursuing goals involving potential punishment than to goals involving potential reward because of a tendency for losses to loom larger in magnitude than objectively equivalent gains (Novemsky and Kahneman, 2018). At the same time, people may adapt to their present environment such that a tendency to prioritize punishment avoidance might be attenuated if reward seeking goals are more frequently encountered than punishment avoidance goals. Thus, our first aim was to determine whether computational strategies for multigoal pursuit differ as a function of goal valence. Specifically, we investigated the degree to which individuals engage, and subsequently, disengage reward seeking and punishment avoidance goals under instruction, and how goal engagement and disengagement are impacted by the frequency with which the goals are encountered.

A striking example of a maladaptive preference for punishment avoidance manifests in individuals with pathological anxiety (Bar-Haim et al., 2007; Berenbaum, 2010; Gagne and Dayan, 2021; Sharp and Eldar, 2019; Warren et al., 2021). Such individuals tend to learn more quickly from punishment than reward (Aylward et al., 2019), and this can lead to avoidance of even moderately risky situations (Charpentier et al., 2017). Furthermore, evidence suggests that anxiety is associated with failing to terminate planning in relation to potential threats (Berenbaum et al., 2018; Hunter et al., 2022). However, anxiety-associated failures to effectively disengage punishment avoidance goals have not been examined in a task that tests people’s ability to engage or disengage from punishment avoidance goals at will. Such a test is required to disambiguate between underlying computational mechanisms explaining how these failures occur (Browning et al., 2015; Korn and Bach, 2019).

On the one hand, it is possible that in naturalistic settings anxious individuals allocate more resources toward punishment avoidance because they believe the environment demands it, and thus, if given explicit safety signals they would effectively disengage punishment avoidance, perhaps even more so than less anxious individuals (Wise and Dolan, 2020). On the other hand, anxious individuals might fail to disengage punishment avoidance even in the presence of explicit safety signals, evincing a more fundamental failure in exercising executive control. Importantly, both hypotheses are consistent with anxious individuals opting for avoidance behavior in approach–avoidance conflict tasks (Loh et al., 2017), but diverge in settings where punishment avoidance and reward seeking goals are unambiguously separated in time and space. Thus, our second aim was to explore potential computations involved in disengagement of punishment avoidance goals in anxiety.

We developed a novel multigoal pursuit task that required participants to learn by trial and error the probabilities that different actions lead to different task features. Learning was incentivized by occasionally coupling certain features with monetary punishment and other features with monetary reward (Figure 1). Critically, on each trial, participants were instructed either to avoid the punishment feature or to seek the reward feature, and these goals switched frequently, requiring participants to continuously adjust their behavioral policy. Unbeknownst to participants, we manipulated how frequently certain goals were encountered in each task block, allowing us to determine whether more costly decision-making resources are devoted to pursuing more frequent, and thus more reward-impacting, goals in a resource rational (RR) manner (Lieder and Griffiths, 2019).

Multigoal pursuit task.

(A) Key task components. Participants were instructed to learn the likelihood of observing two features (gold and black circles) after taking each of two actions (pressing ‘g’ or ‘j’ on the keyboard), and integrate this knowledge with instructed trial-specific goals denoting the present reward or punishment value of each feature. There were two possible goals: in one participants were instructed to seek the reward feature (reward feature = +1 point, punishment feature = 0) and in the other to avoid the punishment feature (reward feature = 0, punishment feature = −1 point). Thus, if the goal was to seek reward, participants should have selected the action most likely to lead to the reward feature (gold circle), irrespective of whether the action lead to the punishment feature (as the value of the latter is 0). Critically, whether each of the two features was present was determined independently, that is, for each action there were four different possible outcome configurations (both features present/reward feature only/punishment feature only/both features absent). To pursue goals, participants had to learn via experience four probabilities (left panel, all p(feature|action)) comprising the likelihood of observing each feature following each action (i.e., they were never instructed about these probabilities). Continued learning was required because the true probabilities of observing features for different actions drifted across trials according to semi-independent random walks (bottom left). Although participants were instructed with a more neutral narrative (see Methods), here we refer to the gold circle as the reward feature and the black circle as the punishment feature. However, the gold circle was rewarding only for reward seeking goal trials (and of no value during punishment avoidance goal trials), whereas the black circle was punishing only during punishment avoidance goal trials (and of no value during reward seeking goal trials). In the actual task implementation, the color for the reward and punishment features, and the random walks each feature took, were counterbalanced across participants. (B) Phases of a single trial. First, participants were shown both fractals and the current goal, and asked to select an action (‘Decision’). After they took an action (here, clicking ‘j’, denoted by the red outline), participants were shown feedback, which comprised the feature outcomes, the present reward value of each feature, and the total points gained (possible total points were: (1) ‘You lost 1’, (2) 0, or (3) ‘You won 1’). Finally, participants were shown the feature outcomes they would have seen had they chosen the other action (‘Counterfactual’), which could be any of four possible feature combinations. (C) Goal abundance manipulation. A totality of 160 trials were divided into two equal length blocks, constituting reward- and punishment-rich contexts. In a reward-rich context, reward seeking trials outnumbered punishment avoidance trials, and the converse was true in a punishment-rich context. Note, both the sequence and order of blocks were counterbalanced across goal types to ensure neither factor could account for prioritizing a specific goal.

We report evidence that participants relied to varying degrees on three strategies. Whereas a model-based (MB) strategy was employed to learn the probabilities by which actions led to features for the purpose of flexibly pursuing instructed goals, there was also evidence for a model-free strategy that disregarded instructed goals and relied on points won or lost to reinforce actions (Lai and Gershman, 2021). Most interestingly, we find evidence for use of a novel strategy we term, ‘goal perseveration’ (GP), whereby participants learn feature probabilities akin to an MB strategy but utilize this knowledge in a less costly and less flexible way, so as to always avoid punishment (even when instructed to seek reward) and to always seek reward (even when instructed to avoid punishment). Strikingly, this GP strategy was used to a greater extent for punishment avoidance, suggesting that disengaging punishment avoidance is harder, perhaps for evolutionarily sensible reasons (Woody and Szechtman, 2011). By contrast, the more flexible MB strategy was leveraged to a greater degree during reward seeking. Furthermore, participants flexibly increased MB control toward reward seeking goals when they were more abundant.

Finally, in a series of exploratory analyses, we sought to determine whether and how anxious individuals express a preference for punishment avoidance goals. In so doing, we found preliminary evidence that the degree of reliance on a GP strategy to avoid punishment was positively associated with dispositional worry, which appears to be unique to those expressing worry and not to individuals with obsessive–compulsive (OC) or somatic anxiety symptoms.

Results

Task description

We recruited a large online sample of participants (N = 192, excluding 56 who did not meet performance criteria; excluded participants did not differ significantly on any psychopathology measure from the retained sample; see Methods) to play an online version of a novel multigoal pursuit task (Figure 1). At each trial, participants could take two possible actions, defined by fractal images, to seek or avoid certain features. The trial’s goal was defined by the effective reinforcement value of the features, which varied from trial to trial as instructed to participants explicitly at the beginning of each trial. Thus, in reward seeking trials, encountering a ‘reward’ feature (gold circle) gifted the participant one point whereas the ‘punishment’ feature (black circle) was inconsequential (value = 0). By contrast, in punishment avoidance trials, the punishment feature took away one point whereas the reward feature had no value. Note that the reward value of either feature was continuously presented throughout the choice deliberation time (Figure 1B), ensuring that there should be no reason for participants to forget the present trial’s goal. To determine whether participants adapted their decision-making toward more frequently encountered goals, we designed our task such that one goal was more abundant in each half of the task.

After participants made a decision, they were shown the choice outcome feature, followed by a counterfactual feature outcome for the choice not made (Figure 1B). The probabilities linking features to actions varied over time, and participants could estimate these continuously drifting probabilities from experience by observing which features actions led to. Presenting both actual and counterfactual outcomes removed the need for participants to explore less-visited actions to gain information, thus ruling out information seeking as a normative explanation for deviations from optimal choice. Of note, this task design differs from influential two-factor learning paradigms (Mowrer, 1951) extensively used to study anxiety, in that in our task both action-feature and feature-value associations changed throughout the experiment, mandating continued learning and flexible decision-making.

Three computational strategies

Model based

We sought to identify computations individuals employed to learn and enact decisions in our task. A suitable computational strategy for this task is to continuously learn which task features follow each action irrespective of the instructed goal, and when deciding which action to take, rely specifically on knowledge about features relevant to the presently instructed goal. This strategy is an instance of a broader family of ‘model-based’ strategies that flexibly use knowledge about which actions lead to which states (Dolan and Dayan, 2013). By simulating an artificial MB agent, we show that a unique signature of MB control in our task manifests, when current and previous goals differ (henceforth referred to as ‘goal-switch’ trials), in the way the current goal determines whether observed features in the last trial impact subsequent action. For example, an MB agent will avoid an action leading to a punishment feature in the last trial only when the current instructed goal is to avoid punishment (Figure 2A, top row). Such behavior cannot be produced by the other strategies discussed subsequently unless the current and previous goals are the same.

Figure 2 with 3 supplements see all
Behavioral signatures of computational strategies in simulated and real data.

(A) Last outcome effects in simulated data. Each row comprises data generated by simulating one of the candidate computational strategies used to enact decisions in the present task (see Methods for parameters used). Each plot depicts the proportion of times the simulated agent takes a different action than that taken on the last trial (‘switch probability’), as a function of features experienced on the last trial for the chosen action (gold/black circles; a gray bar indicates the feature was absent), the previous goal (left vs. right plots), and the current goal (light vs. dark bars). (B) Last outcome effects in empirical data. Real participants’ switch probabilities as a function of last trial’s feature outcomes, and current and previous goals. For illustration, we overlay repeated measures t-tests of specific MB (difference between blue and black bars) and GP (green bars) predictions, broken down by goal valence. A more thorough analysis of strategies used by participants is shown in panel C. *p < 0.05, **p < 0.01, *****p < 10−5, *******p < 10−7 . (C) Empirical evidence for each strategy. Posterior distributions derived from fitting a Bayesian linear mixed-effects logistic regression evince main effects for MB (blue), GP (green), and MF (red) strategies. Evidence reflects MB and MF were leveraged for punishment avoidance and reward seeking goals whereas GP was leveraged for punishment avoidance goals, with only trending evidence it was used for reward seeking. (D) Effect of goal valence on strategy utilization. We estimated goal valence effects by examining the posterior distribution of differences between the parameters in panel C and found evidence indicating model-based utilization was greater for reward seeking, whereas goal-perseveration utilization was greater for punishment avoidance.

Model-free

An MB strategy can be highly effective in our task, but it demands constant adaptation of one’s actions to frequently changing goals. Thus, we expect participants might resort to less costly, approximate strategies (i.e., heuristics). One common heuristic simplifies MB strategies by learning which actions have higher expected value purely based on experienced rewards and punishments. This so-called ‘model-free’ (MF) reinforcement learning strategy is ubiquitously deployed in single-goal tasks (Daw et al., 2011; Sutton and Barto, 2018). In the present multitask setting, this would entail forgoing adaptation to the current goal and instead simply learning the overall expected values of the two available actions. Since the previous goal is what determines the value of the last observed features, a unique signature of an MF strategy is how a previous goal determines the impact of last observed features on subsequent action, regardless of whether the goal has switched. For example, if an action led to a punishment on the last trial, then that action will tend to be avoided irrespective of the current goal (Figure 2A, bottom row).

Goal perseveration

A MF strategy is relatively simple to implement but not particularly effective since it does not utilize the information provided by feature outcomes that currently have no reward or punishment value (i.e., a feature that is irrelevant given the trial’s goal or that is a counterfactual outcome of the unchosen action). An alternative strategy, that we term ‘goal perseveration’, might strike a better balance between simplicity and effectiveness. This strategy inherits the exact same knowledge of feature probabilities acquired by MB learning, but simplifies action selection by persistently avoiding punishment and seeking reward, simultaneously, regardless of instructed goal. This, in principle, eliminates effortful goal switching while utilizing all available information about the changing action-feature mapping. Thus, rather than constituting a separate decision system in its own right, GP is best thought of as a behavior produced by a strategic cost-saving MB agent. In goal-switch trials, a GP strategy would manifest in the observed features having the same impact on subsequent actions regardless of the current or previous trial’s instructed goal. For example, a GP agent will avoid an action that led to a punishment feature at the last trial even if both previous and current goals were to seek reward (Figure 2A, middle row).

The benefits and costs of each strategy

MB strategies typically harvest more reward than heuristic strategies but are computationally costly, hence individuals will tend to use them to a greater extent when properly incentivized (Konovalov and Krajbich, 2020; Kool et al., 2017; Patzelt et al., 2019). To determine whether our task properly incentivized the use of an MB strategy, we simulated agents playing the task many times and computed the average amount of reward earned and punishment avoided with each computational strategy. This showed that an MB strategy in our task led to significantly more reward than the other strategies (Figure 3A; e.g., around 40% more than a GP agent), and only 15% worse than an idealized model that has access to the true feature probabilities for each action. The advantage of the MB strategy was due in large part to the task involving frequent goal switching (41.8% of trials). Finally, the least costly MF strategy also earns the least reward in the present task (Figure 3A).

Task performance of distinct strategies.

(A) Average total points gained by computational strategies. Punishment, reward, and total points (i.e., reward minus punishment) were averaged over 2000 simulations for each strategy. Strategies included model based (MB), model free (MF), and three versions of a goal perseveration (GP reward seeking with MB punishment avoidance [GP-R], GP punishment avoidance with MB reward seeking [GP-P], and GP for both reward and punishment goals [GP]). Details of parameters and models for each agent simulated are detailed in Methods. Each agent played the task 2000 times. Measures are range normalized such that 0 corresponds to performance of an agent that purely guesses and one corresponds to performance of the best-performing agent. (B) Punishment avoided by computational strategies. Here, the plot tallies successful attempts by agents to avoid punishment. The results illustrate that a hybrid agent that employs the goal-perseveration punishment avoidance strategy, and utilizes model-based control for reward seeking, avoids punishment as successfully as a fully model-based agent. (C) Reward earned by computational strategies. Here, the plot tallies successful attempts by agents to seek reward. This highlights that a hybrid agent that employs the goal-perseveration punishment avoidance strategy gains less reward than a model-based agent.

Empirical evidence of each computational strategy

Evidence of MB learning

To estimate whether participants leveraged each of the three strategies, we fit a Bayesian linear mixed-effects logistic regression to participant choices on goal-switch trials, wherein unique signatures of each strategy are detectable. Besides accounting for each strategy’s signature, the regression controlled for the main effect of goal. The MB regression parameter predicted whether a participant switched to a new action on the current trial as a function of the interaction between the features observed last trial for chosen and unchosen actions and the instructed goal on the current trial (see Methods). Thus, we found a strong main effect of MB behavior (MB main effect mode = 0.59, confidence interval [CI] = [0.43,0.76], pd = 1; Figure 2).

Examination of the data prior to regression analysis suggested a difference in utilization of MB control for reward seeking relative to punishment avoidance (Figure 2B). To determine whether an MB effect was present for both punishment avoidance and reward seeking goals, we enhanced the regression with separate MB parameters for the two goals. Posterior estimates showed that individuals engaged MB control for both reward seeking (mode = −0.52, CI = [−0.69,−0.38], pd = 1) and punishment avoidance goals was highly trending (mode = 0.12, CI = [−0.01,0.28], pd = 0.96). Moreover, we found a larger MB effect for reward seeking than for punishment avoidance (mode = 0.41, CI = [0.20,0.62], pd = 1).

Evidence of MF learning

We next determined whether participants also used an MF strategy, as captured by a regression parameter reflecting an interaction between the features observed at the last trial for chosen actions and the instructed goal on the last trial (this interaction represents the reward or punishment incurred last trial). Posterior estimates showed a MF strategy was employed by participants (MF main effect mode = 0.24, CI = 0.14,0.36, pd = 1), both in response to reward (mode = −0.14, CI = [−0.23,−0.04]; Figure 2C, bottom row) and punishment (mode = 0.22, CI = [0.12,0.31]). We found no evidence that the valence of the feedback impacted MF behavior to a greater degree (mode = −0.08, CI = [−0.21,0.06], pd = 0.87).

Evidence of a GP strategy

Finally, we determined whether participants used a GP strategy, as captured by a regression parameter reflecting effects of reward and punishment features observed last trial irrespective of goal. We observed a strong GP effect (GP main effect mode = 0.26, CI = [0.14,0.40], pd = 1). Breaking the GP effect down by valence showed that GP was utilized for punishment avoidance (mode = 0.33, CI = [0.20,0.45], pd = 1), significantly more so than for reward seeking (mode = −0.11, CI = [−0.23,0.02], pd = 0.95; difference between goals: mode = −0.20, CI = [−0.37,−0.04], pd = 1).

Quantifying the contribution of each strategy to decision-making

The presence of unique signatures of MB, MF, and GP decision strategies in the empirical data presents strong evidence for the use of these strategies, but the signature measures are limited to examining goal-switch trials and, within those trials, examining the impact of features observed on the very last trial. To comprehensively quantify the extent to which participants utilized each strategy for reward seeking and punishment avoidance, we next developed a series of computational models that aim to explain all participant choices given the features observed on all preceding trials.

We first sought to determine whether each strategy explained unique variance in participants’ choices. To do so, we implemented a stepwise model comparison (see Methods for full details of the models) that began with a null model comprising only action perseveration (AP). Specifically, an AP strategy reflects the tendency of participants to stay with the action taken at the last trial, which has been found in various prior studies on single-goal reinforcement learning (Daw et al., 2011; Bayesian information criterion [BIC] = 38,273.53).

We subsequently investigated whether an MB strategy explained unique variance in participant’s choices above the null. To do so, we compared the null model to a similar model where we added an MB strategy. We found that the MB model explained significantly more variance than the null model (Δ iBIC = −3900.94), a finding that coheres with our expectation that participants would utilize an MB strategy to make choices.

Before considering additional strategies, we asked whether individuals adjusted how they utilized an MB strategy for reward seeking and punishment avoidance in a ‘resource rational’ (RR) fashion, based on how abundant each goal was in the task block (MBRR; Figure 2—figure supplement 1). Allowing the model to adjust in this way improved an index of model fit significantly (Δ iBIC = −175.95; Figure 4A) providing evidence that individuals reallocated MB resources toward goals that were more abundant.

Figure 4 with 4 supplements see all
Results from computational modeling.

*p < 0.05, ***p < 0.001. (A) Stepwise model comparison. The plot compares all models to the winning model ‘MBRR+ GP + MF’. (B) Model-based utilization is greater for reward seeking than for punishment avoidance. Here and in panel C, distributions are compared in terms of their medians due to a heavy positive skew. (C) Goal-perseveration utilization is greater for punishment avoidance than for reward seeking. Panels B and C show the distributions of utilization weights that best fitted each individual participant’s choices.

We next tested whether a MF strategy explains unique variance in choice data beyond the MBRR model. To do so, we compared the MBRR model to a similar model that combined both MBRR and MF strategies. In controlling choice, the two strategies were combined via a weighted sum of the values they each assigned to a given action. Thus, a strategy’s weight quantifies the degree to which it was utilized. The MBRR+ MF models’ fit was superior to the MBRR model (Δ iBIC = −896.72; Figure 4A), providing evidence that individuals used both MF and MBRR strategies to inform their decisions.

Finally, we tested whether a GP strategy might explain additional variance beyond the MBRR and MF strategies. Enhancing the MBRR + MF with the addition of a GP strategy significantly improved model fit (Δ iBIC = −243.73; Figure 4A), indicating participants also used a GP strategy. Henceforth, we refer to this final MBRR + MF + GP model as the ‘winning model’ (see Methods for full model formalism). To further validate this winning model, we compared it to several alternative models that were found inferior in fitting the data, including models where we removed GP, MF, and MBRR processes to ensure the order in which we added each strategy did not impact the final result, and an alternative MF account wherein the goal was encoded as part of state representation (see Figure 4—figure supplement 3 for model specifications and model comparison results). Ultimately, we showed that generating data from the winning model, using best-fitting participant parameters, could account for all but one of the mechanism-agnostic results reported in Figure 2C and D (Figure 2—figure supplement 3).

Punishment avoidance is less flexible than reward seeking

To investigate algorithmic differences between reward seeking and punishment avoidance, we used the winning model to extract the parameter values that best fit each participant. We focused our analysis on parameters quantifying the degree to which individuals utilized a particular strategy to pursue reward or avoid punishment goals. We validated that MB, GP, and MF inverse temperature parameters were recoverable from simulated experimental data, and that the degree of recoverability (i.e., the correlations of true and recovered parameter values, which were between 0.76 and 0.91; Figure 4—figure supplement 2) was in line with extant reinforcement learning modeling studies (Palminteri et al., 2017; Haines et al., 2020). Similarly, low correlations between estimated parameters (all weaker than 0.16) demonstrate our experimental design and model-fitting procedure successfully dissociated between model parameters (Wilson and Collins, 2019).

In doing so, we found that individuals relied significantly more on an MB strategy for reward seeking relative to punishment avoidance (two-tailed p < 0.001, nonparametric test [see Methods]; Figure 4B). By contrast, individuals relied more heavily on GP for punishment avoidance relative to reward seeking (two-tailed p = 0.026, nonparametric test; Figure 4C). These results suggest participants did not adaptively ‘turn off’ the goal to avoid punishment to the same extent as they did so for the goal to pursue reward.

Finally, we examined whether individuals prioritized punishment avoidance and reward seeking goals based on their relative abundance. To do so, we extracted computational parameters controlling a shift in MB utilization across task blocks for both goal types. Each of these utilization change parameters was compared to a null value of 0 using a nonparametric permutation test to derive valid p values (see Methods). This analysis revealed that individuals were sensitive to reward goal abundance (mean = 0.50, p = <0.001) but not to punishment goal abundance (mean = −0.13, p = 0.13). This result comports with previous results which highlighted a difficulty disengaging from punishment avoidance. Moreover, this result points to why our winning model, that allowed MB utilization weights to change across task block, explained participant data better than a model that kept MB utilization weights constant.

Preliminary evidence chronic worry is associated with greater perseverance of punishment avoidance

In a set of exploratory analyses, we sought to investigate how anxiety might be related to a prioritization of punishment avoidance goals. To do so, we assayed participants using self-report measures of chronic worry (Meyer et al., 1990) and somatic anxiety (Casillas and Clark, 2000) and OC symptoms (Figure 4—figure supplement 1). For each regression model, we computed p values using a nonparametric permutation test wherein we shuffled the task data with respect to the psychopathology scores, repeating the analysis on each of 10,000 shuffled datasets to derive an empirical null distribution of the relevant t-statistics.

We first report the bivariate relations between each form of psychopathology and inverse temperature parameters reflecting tendencies to utilize MB and GP punishment avoidance. Given that individuals with OCD and anxiety symptoms may overprioritize threat detection, it is conceivable that there is a relationship between all three forms of psychopathology and MB punishment avoidance. However, we found no significant or trending relationships between any form of psychopathology and MB control for punishment avoidance (Figure 5A, left column). An alternative possibility is that individuals with anxiety suffer from a dysregulation in goal pursuit, reflecting a failure to disengage punishment avoidance when instructed to do so. On this basis, we explored whether worry and somatic anxiety are positively associated with GP for punishment avoidance. In so doing we found initial evidence of a potential relationship between the tendency to worry and punishment avoidance perseveration (B = 2.15, t = 1.4, p = 0.16; Figure 5A, right column).

Exploratory relationships between threat-related psychopathology and goal-directed control for punishment avoidance.

Each row reflects a different regression model, where the score for each psychopathology measure in the left column is the dependent variable, and inverse temperature parameters reflecting model-based (‘MB Punish’) and goal-perseveration (‘GP Punish’) punishment avoidance are the regressors. Each effect is presented in the following format: β (standard error), p value. (A) Bivariate relationships without control covariates. (B) Regression coefficients when controlling for co-occurring levels of psychopathology as well as for general valence-independent levels of utilization of MB (inverse temperature and learning rate) and non-MB (AP, MF, and GP inverse temperatures) strategies. In all tables, p values are uncorrected for multiple comparisons.

To provide a more specific test of our key hypotheses, we removed variance of noninterest in order to sensitize our analyses to the unique relationships between forms of psychopathology and types of punishment avoidance. Firstly, generalized, as opposed specific obsessive, worry is thought to be particularly associated with difficulty in disengaging from worry (Berenbaum, 2010), since it lasts significantly longer in both clinical (Dar and Iqbal, 2015) and community samples (Langlois et al., 2000). Thus, we dissociated generalized from obsessive worry using the same approach taken in previous studies (Doron et al., 2012; Stein et al., 2010), namely, by including a measure of OCD symptoms as a control covariate. Controlling for OCD symptoms has the additional benefit of accounting for known relations between OCD and poor learning of task structure, reduced MB control, and perseverative tendencies (Gillan et al., 2016; Seow et al., 2021; Sharp et al., 2021). Secondly, another potentially confounding relationship exists between worry and somatic anxiety (Sharp et al., 2015), likely reflecting a general anxiety factor. Thus, we isolated worry by controlling for somatic anxiety, as commonly done in studies seeking to quantify distinct relationships of worry and somatic anxiety with cognitive performance (Warren et al., 2021) or associated neural mechanisms (Silton et al., 2011). Finally, we controlled for covariance between computational strategies that might reflect general task competencies. This included the utilization of MB (including learning rates and inverse temperatures) since observed anticorrelations in the empirical data (Figure 4—figure supplement 4) between GP and MB may derive from causal factors such as attention or IQ, as well as a general tendency to mitigate cognitive effort by using less costly strategies (AP, MF, and GP inverse temperatures; Figure 4—figure supplement 4).

This analysis showed a stronger relationship between worry and punishment perseveration (β = 3.14 (1.38), t = 2.27, p = 0.04, Figure 5B). No other significant relationship was observed between punishment perseveration or MB punishment avoidance and psychopathology (Figure 5C). Of note, we additionally found no association between the parameter governing how MB punishment was modulated by task block and levels of worry, both when including worry alone (β = 2.5 (1.91), t = 1.31, p = 0.19) and when controlling for the same covariates as detailed above (β = 1.46 (1.65), t = 0.88, p = 0.38). Ultimately, we validated the full model using a fivefold cross-validation procedure which showed that regressing worry onto the aforementioned covariates (using a ridge regression implementation) explains significantly more variance in left out test-set data (R2 = 0.24) relative to the models of the bivariate relationships between worry and GP Punishment (R2 = 0.01) and MB Punishment (R2 = 0.00).

It is important to note that all aforementioned p values testing our key hypotheses (Figure 5B) are corrected for multiple comparisons using a correction procedure designed for exploratory research (Rubin, 2017), which only controls for number of statistical tests within each hypothesis. Using a more conservative Bonferroni error correction for all four regression models, as typically employed in hypothesis-driven confirmatory work (Frane, 2020), resulted in a p value for the key effect of worry and punishment perseveration that no longer passed a conventional significance thresholds (p = 0.08). Thus, future work with a more targeted, hypothesis-driven approach needs to be conducted to ensure our tentative inferences regarding worry are valid and robust.

To illustrate the consequences of GP punishment avoidance on pursuit of reward in anxious participants, we simulated a GP + MB agent that adaptively engages reward-relevant information when instructed to, but perseverates in avoiding punishment during reward seeking. We show that such a strategy is as good as an MB agent in avoiding punishment, but comes with the cost of suboptimal reward seeking (Figure 3B and C). This trade-off mirrors the negative consequence of real-world threat avoidance in trait anxious individuals (Aderka et al., 2012). Moreover, this gives a potential normative explanation of punishment perseveration in anxious individuals; namely, if anxious individuals prioritize avoiding threat, they can do so just as well using punishment perseveration as using an MB strategy while expending fewer resources.

Discussion

Using a novel multigoal pursuit task, we investigated computational strategies humans leverage to navigate environments necessitating punishment avoidance and reward seeking. Our findings indicate the use of a strategy that avoids goal switching wherein individuals learn a model of the task but use it in a goal-insensitive manner, failing to deactivate goals when they are irrelevant. This less flexible, but computationally less costly, strategy was leveraged more in order to avoid punishment as opposed to a pursuit of reward. Beyond trial-to-trial perseverance, inflexibility in punishment avoidance manifested in a lack of blockwise adjustment to the abundance of punishment goals. By contrast, we found that a flexible MB strategy was used more for reward seeking, and was flexibly modulated in an RR way in response to an abundance of reward seeking goals changing between task blocks. Finally, we demonstrate preliminary evidence that a greater GP reliance for punishment avoidance in those individuals with greater chronic worry.

The strategic deployment of GP primarily toward punishment avoidance indicates such behavior is not merely a reflection of a noisy or forgetful MB system. Our finding that humans use less flexible computational strategies to avoid punishment, than to seek reward, aligns with the idea of distinct neural mechanisms supporting avoidance and approach behavior (McNaughton and Gray, 2000; Lang et al., 1998). Moreover, comparative ethology and evolutionary psychology (Pinker, 1997) suggest there are good reasons why punishment avoidance might be less flexible than reward seeking. Woody and Szechtman, 2011 opined that ‘to reduce the potentially deadly occurrence of false negative errors (failure to prepare for upcoming danger), it is adaptive for the system to tolerate a high rate of false positive errors (false alarms).’ Indeed, we demonstrate that in the presence of multiple shifting goals, perseverance in punishment avoidance results in false positives during reward seeking (Figure 3B), but avoids ‘missing’ punishment avoidance opportunities because of lapses in goal switching (Figure 3C). Future work could further test these ideas, as well as potential alternative explanations (Dayan and Huys, 2009).

GP may thus in fact constitute an RR strategy (Lieder and Griffiths, 2019) for approximating MB control. To illustrate this, consider that MB learning is computationally demanding in our task specifically because goals switch between trials. When the goals switch, an MB agent must retrieve and use predictions concerning a different feature. Additionally, the agent needs to continuously update its predictions concerning features even when they are not presently relevant for planning. GP avoids these computationally costly operations by pursuing goals persistently, thus avoiding switching and ensuring that features are equally relevant for planning and learning. In this way, GP saves substantial computational resources compared to MB yet is able to perform relatively well on the task, achieving better performance than MF. Additionally, if a participant selectively cares about avoiding losses (for instance, due to loss aversion), GP can perform as well as MB. Thus, we propose the GP heuristic reflects a strategic choice, which can achieve good performance while avoiding the substantial resource requirements associated with MB control. In this sense it fulfils a similar role as other proposed approximations to MB evaluation including MF RL (Sutton and Barto, 2018), the successor representation (Dayan, 1993; Momennejad et al., 2017), mixing MB and MF evaluation (Keramati et al., 2016), habitual goal selection (Cushman and Morris, 2015), and other identified heuristics in action evaluation (Daw and Dayan, 2014).

Our exploratory findings that an inflexibility in punishment avoidance was more pronounced in individuals with chronic worry is suggestive of a computational deficit that may serve to unify several known effects relating trait worry to failure to terminate threat processing. For example, in paradigms that explicitly instruct participants to ignore threat-irrelevant information, such as the dot-probe (Asmundson and Stein, 1994) and modified emotional Stroop (Dalgleish, 1995; van den Hout et al., 1995) tasks, individuals with trait worry have difficulty inhibiting processing of threat (Bar-Haim et al., 2007). Moreover, increased GP punishment avoidance may be involved in the overactivity of threat-related circuitry in anxious individuals during tasks where threat is not present (Grupe and Nitschke, 2013; Nitschke et al., 2009). However, we note that there was a significant positive skew in the somatic arousal measure, which although likely due to random sampling error (given that other symptom measures were highly representative of population distributions), may nonetheless limit our ability to generalize findings from the present sample to the population.

Our findings go beyond previous findings that report, in single-goal reinforcement learning tasks, that anxiety is associated with altered MF but intact MB control (Gillan et al., 2019). Our findings suggest a conflict between punishment avoidance and reward seeking may be necessary to uncover how knowledge of task structure is used in anxiety. Indeed, prior approach–avoidance conflict paradigms have found that trait anxiety is positively associated with neural correlates of punishment avoidance (rejected gambles that could result in loss) (Loh et al., 2017) and avoidant behavior (Bach, 2015).

A limitation of our task is that differences in strategy utilization for reward seeking and punishment avoidance (see Methods) could in part reflect differences in sensitivity to reward versus punishment. However, reward and punishment sensitivity cannot account for the effects we observe across strategies, since on the one hand, punishment avoidance was greater for GP, whereas reward seeking was greater within an MB framework. Greater punishment sensitivity compared to reward sensitivity would predict the same direction of valence effect for both behavioral strategies. Moreover, knowledge of reward features had a greater net impact on choice across both goal-oriented strategies (sum of weights across both MB and GP strategies is greater for reward seeking). That said, we recognize that differences in outcome sensitivity, which are algorithmically equivalent to differences in the magnitude of external incentives, may cause a shift from one strategy to another (Kool et al., 2017). Thus, an open question relates to how reward and punishment sensitivity might impact flexibility in goal pursuit.

Future work can further address how humans manage multigoal learning in the context of larger decision trees with multiple stages of decisions. In such environments, it is thought people employ a successor feature learning strategy, whereby the long-run multistep choice features are stored and updated following feedback (Tomov et al., 2021). Such multistep tasks can be enhanced with shifting reward seeking and punishment avoidance goals to determine how altered strategies we identify with pathological worry might relate to trade-offs between MB and successor feature strategies for prediction. Another possibility is that punishment-related features capture involuntary attention in our task because they are tagged by a Pavlovian system, and this interacts with an MB system that learns task structure. Indeed, prior work (Dayan and Berridge, 2014) has discussed possibilities of MB Pavlovian hybrid strategies.

In relation to why GP punishment avoidance may specifically be associated with chronic worry, we suggest that failures to disengage punishment avoidance may serve to explain so-called ‘termination’ failures in chronic worry (Berenbaum, 2010). The causal role of GP in failures to terminate worry episodes could avail of the fact that such failures are dissociable from a tendency to suffer ‘initiation’ failures, which involve worrying too easily in response to many stimuli (Berenbaum et al., 2018). Although the perseveration of worry may appear relevant to obsessions in OC symptoms, punishment avoidance in OC has been empirically demonstrated to be specific to idiographic domains of potential threat (e.g., sanitation Amir et al., 2009), an issue the present study did not investigate. Additionally, we did not find that GP was associated with somatic anxiety possibly due perhaps to random sampling error as we had an unusually low percentage meeting a threshold for mild symptomatology (4.7%; typical convenience samples are typically in the range of 12–18 Sharp et al., 2015; Telch et al., 1989). More importantly, somatic anxiety is thought to involve lower-order cognitive processes than those likely involved in multigoal pursuit (Sharp et al., 2015). Given that present results are preliminary in nature, future studies will need to test a prediction that chronic worry is associated with punishment perseveration in a larger sample. This should also include testing whether this association holds in a clinical population, as variation in symptoms in a clinical population may relate to punishment perseveration differently (Imperiale et al., 2021; Groen et al., 2019). Additionally, doing so may be enhanced by including parameters relating worry to punishment perseveration within the model-fitting procedure itself, and so better account for uncertainty in the estimation of such covariance (Haines et al., 2020).

In conclusion, we show humans are less flexible in avoiding punishment relative to pursuing reward, relying on a GP strategy that persists in punishment avoidance even when it is irrelevant to do so, and failing to deprioritize punishment avoidance goals when they are less abundant. Moreover, we show that GP is augmented in individuals with chronic worry, hinting at a candidate computational explanation for a persistent overprioritization of threat in anxiety.

Materials and methods

Sample and piloting

Request a detailed protocol

Prior to disseminating the task, we conducted a pilot study varying the number of features and actions participants could choose. We first found that less than half of individuals we recruited performed above chance levels when individuals had to learn a task with three actions and nine feature probabilities. We thus reduced the complexity of the task and found that including only two actions and four features allowed most participants to leverage an MB strategy.

Two hundred and forty-eight participants were recruited through Prolific services online Prolific recruiting service (https://www.prolific.co/) using the final task design from English-speaking countries to ensure participants understood task instruction. After expected data-scrubbing, our sample size had >99% power to detect valence differences in reinforcement learning parameters, conservatively assuming a medium effect size relative to large effects found previously (to account for differences between multigoal and single-goal settings; e.g., Palminteri et al., 2017; Lefebvre et al., 2017). Moreover, our sample had >80% power to detect small-medium effect sizes relating computational parameters and individual differences in anxiety ( Sharp et al., 2021). Participants gave written informed consent before taking part in the study, which was approved by the university’s ethics review board (project ID number 16639/001). The final sample was 37% male, with a mean age of 33.92 years (standard deviation [SD] = 11.97). Rates of mild but clinically relevant levels of OCD (45%) and worry (40%) comported with prior studies (Sharp et al., 2021), indicating good representation of individual variation in at least some psychopathological symptoms.

Data preprocessing

Request a detailed protocol

Eleven participants completed less than 90% of the trials and were removed. We next sought to define participants that did not understand task instructions. To do so, we computed the proportion of times participants switched from the last action taken as a function of the feature outcomes of that action and the current goal. We used these proportions to define four learning signals, two for each goal. Note, the term ‘average’ henceforth is shorthand for ‘average switching behavior across all trials’. Facing reward goals, participants should (1) switch less than average if they observed a reward feature last trial and (2) switch more if they did not observe a reward feature last trial. Facing punishment goals, participants should (1) switch more than average if they observed a punishment feature last trial and (2) switch less than average if they did not observe a punishment feature last trial. We removed six additional participants because their switch behavior was the exact opposite as they should be for each of these four learning signals. When facing a reward goal, they switched more than average having observed a reward feature last trial and switched less than average having not observed a reward feature last trial. Moreover, when facing a punishment goal, they switched less having observed a punishment feature last trial and switched more than average having not observed a punishment feature last trial. We additionally removed participants that: (1) treating a punishment feature as a reward feature (i.e., show the opposite of the two learning signals for punishment; 13 participants) and (2) treating reward feature as a punishment feature (show the opposite of the two learning signals for reward; 26 participants). Excluded subjects performed significantly worse in terms of choice accuracy. To derive accuracy, we computed the percentage of choices subjects made in line with an ideal observer that experienced the same outcome history (in terms of features) as each participant. On average, excluded subjects chose correctly 49.6% of time, whereas included subjects chose correctly 63.5% of time (difference between groups: t(190) = 9.66, p < 0.00001).

Indeed, these errors in following task structure are fundamental failures that result in average performance that is as poor or worse than an agent that purely guesses which action to take at each trial (Figure 6). Doing so resulted in our final sample of n = 192, with a low percentage of removed participants (22%) compared to similar online computational studies (Tomov et al., 2021). Importantly, removed participants were no different in terms of mean scores on key measures of psychopathology (greatest mean difference found in OCD; Welch’s t = −0.96, p = 0.33).

Excluded participants’ strategies perform similar to or worse than purely guessing.

To motivate our exclusion criteria, we simulated task performance by agents that falsify these criteria and calculated their average winnings over 5000 simulations each. The guessing agent chooses according to a coin flip. The models instantiating strategies used by excluded subjects comprise those that treat reward features as punishment features (‘Mistake Reward for Punishment’), treat punishment features as if they were reward features (‘Mistake Punishment for Reward’) or incorrectly reverse the treatment of feature types (‘Complete Reversal of Features’). Each model performed as poorly, or significantly worse, than a model that purely guesses, demonstrating a fundamental failure in these strategies for the present task. By contrast, a GP-only agent (‘GP’) that ignores goal-switching instructions does significantly better than a guessing model, and only a little worse than a pure model-based agent (‘MB as instructed’).

Additionally, including such subjects would reduce our sensitivity to estimating differences in the utilization of GP and MB for goals of differing valence, as such subjects treated the task as if there was only a single goal, or that the goals were opposite of their instructed nature. Moreover, given their model of the task, such subjects could approach the task optimally using a MF strategy, and thus were not incentivized to use goal-directed strategies at all.

To determine whether our relatively strict subject exclusion policy might have affected the results, we conducted a sensitivity analysis on a larger sample (n = 248; 98% retention) including subjects that mistreated the instructed value of certain features. To account for these subjects’ behavior, we used normal priors to allow negative inverse temperature parameters. Fitting these revised models to our data, we again demonstrate that our winning model was the best-fitting model compared to all other models. Second, we show that the GP valence effect held and even came out stronger in this larger sample. Thus, the mean difference in GP utilization for punishment and reward goals was 0.24 in our original sample and 0.50 in the larger sample (p < 0.0001). Finally, we show the MB valence effect also held in this larger sample (original sample mean difference between MB reward and MB punishment = 2.10; larger sample mean difference = 1.27, both p values <0.0001).

Symptom questionnaires

Request a detailed protocol

Before participants completed the online task, they filled out three questionnaires covering transdiagnostic types of psychopathology. Chronic worry was measured via the 16-item Penn State Worry Questionnaire (Meyer et al., 1990). Anxious arousal was measured with the 10-question mini version of the Mood and Anxiety Symptom Questionnaire – Anxious Arousal subscale (Casillas and Clark, 2000). Obsessive compulsiveness was measured with the Obsessive Compulsive Inventory – Revised (Foa et al., 2002).

Multigoal pursuit task

Request a detailed protocol

To examine how people learn and utilize knowledge about the outcomes of their actions in order to seek reward and avoid punishment, we designed a novel multigoal pursuit task. The task consisted of 160 trials. On each trial, participants had 4 s to make a choice between two choice options (represented as fractal images; (Figure 1B)). Choosing a fractal could then lead to two independent outcome features, a gold feature and a black feature. Any given choice could thus lead to both features, neither feature, or one of the two features. The chances that a choice led to a certain feature varied according to slowly changing probabilities (Figure 1A). These probabilities were partially independent of one another (i.e., the rank correlation between any pairs of correlation did not exceed 0.66; full span: [0.02,−0.17,0.24,0.28,−0.66,0.43]). The same sequence of feature probabilities (probability of encountering a given feature conditioned on a given choice) was used for all participants. This sequence was generated by starting each sequence at a random value between 0.25 and 0.75, and adding independent noise (normally distributed mean = 0.0, SD = 0.04) at each trial to each sequence, yet bounding the probabilities to be between 0.2 and 0.8. To incentivize choice based on feature probabilities, we ensured that in resultant sequences the probability of reaching a given feature differed between the two choices by at least 0.27 for each feature, on average across the 160 trials.

To manipulate participant’s goals, throughout the task, one of the two outcome features (which we refer to as the reward feature) either provided 1 or 0 point, and the other outcome feature (which we refer to as the punishment feature) would either provide 0 points or take away one point. The number of points that a given outcome feature provided on a given trial was determined by trial-specific instructed goals (on the screen on which choice options were presented, Figure 1B). A punishment avoidance goal meant the punishment feature took on the value of −1 and the reward feature took on the value of 0, whereas the reward seeking goal meant the punishment feature took on a value of 0 and the reward feature took on the value of +1. This information was presented in a color (gold or silver) matching the corresponding outcome feature’s color. Importantly, the color the feature took on and the probability trajectory for either feature were counterbalanced across participants. Participants were instructed in advance that one feature in the task might tend to provide points and the other feature might tend to take away points, but they were not told which features these would be.

To manipulate goal abundance, the frequency of which feature was set to the nonzero outcome varied in the first half versus the second half of the experiment (Figure 1C). In one half of the experiment (punishment context), the punishment feature took away a point (and the reward feature did not provide a point) on 65% of trials, and in the other half (reward context) the reward feature provided a point on 65% of trials. Which context occurred in the first versus second half of the experiment was counterbalanced across participants.

After the participant observed which outcome features of a choice they received (2 s), they observed the number of points they received (1 s), including the individual reward value of each feature received, in white, as well as a sentence summarizing their total earnings from that trial (e.g., ‘You lost 1’). Following this, in order to eliminate the need for exploratory choices, the participant was shown the features they would have received, had they chosen the alternative choice option (2 s). There was then a 1-s intertrial interval.

Clinical analyses

Request a detailed protocol

Although the computational parameters were nonindependently estimated by our hierarchical model-fitting procedure, it is vital to note this does not compromise the validity of the least-squares solution to the regressions we ran. Indeed, Friedman, Hastie et al., 2009 show that, ‘Even if the independent variables were not drawn randomly, the criterion is still valid if the dependent variables are conditionally independent given the [independent variable] inputs’ (p. 44). However, we note that it is in practice difficult to determine whether such conditional independence is met. In each regression, we excluded the learning rate for counterfactual feedback, as well as learning rate for value in the MF system, due to high collinearity with other parameters (see Methods). We verified low multicollinearity among included parameters (variance inflation factor <5 for independent variables Akinwande et al., 2015). We report all bivariate correlations between fitted parameters in Figure 4—figure supplement 4.

Algorithms defining MB, MF, and GP strategies

We first describe how each learning system in the winning model derived action values. Then, we describe how action values were integrated into a single decision process. Together, these comprise the best-fitting model that we report on in Results.

MB system

Request a detailed protocol

An MB agent learns each of the four semi-independent transition probabilities of reward and punishment features given either of the two actions one can choose. Each trial, this agent updates their prior estimate of observing a feature given an action (either ‘press g’ or ‘press j’) incrementally via a feature prediction error and a learning rate, αchosen . Here, an agent pressed ‘g’, observed a punishment feature, and updated the probability of observing a punishment feature conditional on choosing to press ‘g’:

(1) P(f=punish|press=g)t+1=P(f=punish|press=g)t+αchosen(1P(f=punish|press=g)t)

Here, the ‘t’ subscript refers to the trial, and ‘1’ in the parentheses means that the participant observed a punishment feature. If the feature was not present, the absence would be coded as a ‘0’. This same coding (one for feature observation, 0 if absent) was also used to encode the presence or absence of a reward feature.

The model learns similarly from counterfactual feedback, albeit at a different rate. Thus, at each trial, MB agents update feature probabilities for the action they did not choose via the same equation as above but with learning rate αunchosen . If the agent pressed ‘g’ the counterfactual update would be:

(2) P(f=punish|press= j)t+1 =P(f=punish|press= j)t+ αunchosen(1P(f=punish|press=j)t)

Each of the four probabilities an MB agent learns is stored in a matrix, where rows are defined by actions and columns by feature type (i.e., reward or punishment). These stored feature probabilities are multiplied by ‘utilization weights’ (βMBpunish and βMBreward) that reflect the degree to which an agent utilizes an MB strategy to pursue reward or avoid punishment. No additional parameter controls utilization of an MB strategy (e.g., there is no additional overall βMB).

Each trial, the agent computes the expected value of each outcome by multiplying stored feature probabilities given each action with the values of the features that are defined by the trial-specific goal. Here, the agent is facing an avoid punishment trial, for which the presence of a punishment feature results in taking away one point (i.e., a value of −1; below we abbreviate press as ‘p’, reward as ‘rew’, and punishment as ‘pun’):

(3a) [QMB t+1(p=g)QMB  t+1(p=j)]=[βMBrewardP(f=rew|p= g)tβMBpunishP(f=pun|p= g)tβMBrewardP(f=rew|p= j)tβMBpunishP(f=pun|p= j)t][01]

Via this computation an MB agent disregards the irrelevant goal (here reward seeking). If the agent were facing a reward goal, Equation 3a and 3b would be:

(3b) [QMB t+1(p=g)QMB  t+1(p=j)]=[βMBrewardP(f=rew|p= g)tβMBpunishP(f=pun|p= g)tβMBrewardP(f=rew|p= j)tβMBpunishP(f=pun|p= j)t][10]

Resource reallocation

Request a detailed protocol

Within the MB system, utilization weights changed across block according to a change parameter. Below is an example of how this reallocation occurred for βMBreward :

(4) βMBreward,block={βMBreward+βchangeRewardifreward_richblockβMBreward βchangeRewardifpunishment_richblock

Note that negative βchange values were allowed, and thus the model did not assume a priori that, for instance, individuals would have increased MB control for reward in the rewarding block (it could be the opposite). Thus, if the data are nevertheless consistent with a positive βchange , this is an indication that, although participants were not explicitly told which block they were in, they tended to prioritize the more abundant goal in each block (see ‘Alternative models’ for an attempt to model how participants inferred goal frequency).

MF system

Request a detailed protocol

A MF agent learns the value of either action directly based on the reward and punishment received. In our task, outcomes took on values of {−1,0,1}. Action values were updated incrementally via a value prediction error and learning rate for value, η. Below is an example updating the action value for press = j (which we omit from the right side of the equation for concision):

(5) QMF t+1 (press=j)=QMF t+ηOutcomeValuet-QMF t

Goal perseveration

Request a detailed protocol

A GP agent uses the same matrix of estimated feature probabilities as the MB system, but multiplies them by a static vector, 1-1 , which means the system always engages both goals regardless of instructions. This is the only way in which the GP agent differs from the MB agent. Having its own utilization weights (βGPreward and βGPpunish) allows the system to vary across individuals in the degree to which the ‘avoid punishment’ and ‘seek reward’ goals are each pursued when they are irrelevant:

(6) [QMB t+1(p=g)QMB  t+1(p=j)]=[βMBrewardP(f=rew|p= g)tβMBpunishP(f=pun|p= g)tβMBrewardP(f=rew|p= j)tβMBpunishP(f=pun|p= j)t][11]

Note, we also fit a model where the GP strategy learns its own matrix of estimated feature probabilities separate from that learned by the MB strategy (i.e., with a different learning rate), but this did not fit participants’ choices as well (Figure 4—figure supplement 3.).

Action perseveration

Request a detailed protocol

Actions taken on the last trial were represented by a one-hot vector (e.g., [10]), which we store in a variable, QLastTrial which was multiplied by its own utilization parameter, βAP .

Stochastic decision process

Request a detailed protocol

All action values derived by each system were integrated via their utilization weights. Below we show the integrated action value for press = j (which we omit from the right side of the equation for concision):

(7) QIntegrated (press=j)=QMB+QGP +βMFQMF +βAPQLastTrial

Note there are no utilization weights in the above equation for MB and GP Q values because they were already integrated into these Q values in Equations 3a, 3b and 6. The integrated action value was then inputted into a softmax function to generate the policy, which can be described by the probability of pressing ‘j’ (since the probability of pressing ‘g’ is one minus that):

(8) P(a=press j)=eQIntegrated(press=j)eQIntegrated(press=j)+eQIntegrated(press=g)

Alternative models tested

Request a detailed protocol

We tested several alternative models that did not explain the data as well as the winning model described above (full details in Figure 4—figure supplement 3). First, we tested models that included only one of the strategies described above (i.e., only MB, only GP, and only MF). We then tested models in a stepwise fashion detailed in Figure 2, which demonstrated that adding each strategy contributed to explaining unique variance in the data.

We additionally tested a model where differences in reward seeking and punishment avoidance were captured by the learning process as opposed to the utilization of the learned knowledge. To do so, we endowed the model with different MB and GP learning rates for punishment and reward features (αMBreward ,αMBpunish ,αGPreward ,αGPpunish) in Equation 6, and a single utilization weight (βMB, βGP).

With regard to the MB strategy, we additionally tested a model where learning from counterfactual outcomes was implemented with the same learning rate as learning from the outcomes of one’s actions.

With regard to resource reallocation, we additionally tested models where it occurred in just the GP utilization weights, or in both GP and MB utilization (in the same fashion described in Equation 4). After finding that data were best explained by the model where resource reallocation only occurred in the MB system, we tested if resource reallocation changed from trial to trial as function of recently experienced goals. That is, we examined whether individuals recruit more resources toward the goal one has most recently experienced, which could differ within a given task block.

With regard to the MF strategy, we tested a model where goals were encoded as part of its state representation (G-MF). Specifically, action values were learned separately for trials with punishment avoidance goals (QGMFpunish, press j and QGMFpunish, press g) and reward seeking goals (QGMFreward, press j and QGMFreward, press g). In this version of an MF strategy, experienced outcomes only influence decision-making under the goal in which they were experienced. The main way it differs from an MB strategy is that learning relevant to a particular goal occurs only when that goal is active. Thus, Q values cannot track feature probabilities changing during nonmatched goal trials (e.g., how reward feature probabilities might shift during punishment avoidance trials). This may be one reason why it was inferior to the best-fitting model. Similar to the best-fitting model, this model included separate utilization weights (βG-MFreward and βG-MFpunish) and a single learning rate. GP and perseveration strategies were included as in the best-fitting model, and resource reallocation was applied to the G-MF strategy in the same way as described in Equation 4.

Model fitting

Request a detailed protocol

Models were fit with a hierarchical variant of expectation–maximization algorithm known as iterative importance sampling (Bishop, 2006), which has been shown to provide high parameter and model recoverability (Michely et al., 2020b; Michely et al., 2020a). The priors for this model-fitting procedure largely do not affect the results, because the procedure iteratively updates priors via likelihood-weighted resampling in order to converge on the distributions of parameters that maximize the integrated BIC, an approximation of model evidence. Note, all parameters had weakly informed priors (see Figure 2—figure supplement 2). Specifically, the fitting procedure works by (1) drawing 100,000 samples from all group-level parameter distributions for each participant, (2) deriving participant-level likelihoods for each sampled parameter, (3) resampling parameters after weighting each sample by its likelihood, and (4) fitting new hyperparameters to the overall distribution of resampled parameter values. This process continues iteratively until the integrated BIC of the new parameter settings does not exceed that of the last iteration’s parameter settings.

Model and parameter recoverability

Request a detailed protocol

To verify that the experiment was appropriately designed to dissociate between the tested models and their parameter values, we simulated experimental data from the best-fitting and reduced models and successfully recovered key inverse temperature parameters (all above 0.58 correlation, average correlation = 0.79). The model that generated the data was recovered 10/10 times compared to the next best-fitting model (see Figure 4—figure supplement 2).

Simulating mechanism-agnostic stay-switch behavior

Request a detailed protocol

In order to examine model predictions (Figure 2), we used each given model to simulate experimental data from 400 participants, each time generating a new dataset by setting model-relevant beta parameters to 5, learning rate parameters to 0.2, and other parameters to 0. We then computed the proportion of trials in which the model chose a different action compared to the previous trial. This ‘switch probability’ was computed for each combination of the previous and current trials’ goals, and the features observed on the previous trial. We ensured there were no significant differences in the direction and significance of key effects across task versions by separately fitting our Bayesian logistic regression noted above to the subset of subjects that performed each task version. Doing so showed that all effects held and to a remarkably similar degree in both task versions (see full results in Supplementary file 1).

Simulating the optimality of each computational strategy

Request a detailed protocol

We simulated artificial agents playing the exact task 2000 times and plotted the mean reward earned. Each artificial agent was also endowed with a learning rate for feature probabilities, which sampled from a grid of values over the 0–1 range at 0.02 increments. For each type of agent, we set the utilization weights of the relevant strategy to five and other utilization weights (for nonused strategies) to 0.

Testing differences between reward seeking and punishment avoidance parameters

Request a detailed protocol

As a consequence of the iterative nature of the model-fitting procedure, parameters for individual participants are not independently estimated, precluding the use of Bayesian or frequentist parametric tests. We thus used nonparametric tests to compute unbiased p values. Due to a heavy positive skew in the distributions of utilization weight parameters at the group level (Figure 4B and C), we compared between them in terms of their median levels. We note that the skew in inverse temperature parameters is to be expected given their Gamma prior distributions are inherently skewed (Gillan et al., 2016; Sharp et al., 2021). Thus, we generated null distributions of median differences in utilization weights for both MB and GP strategies. To do so, we ran our hierarchical model-fitting procedure 300 times on 300 simulated datasets that assumed reward and punishment utilization weights were sampled from the same distribution (null hypothesis). The utilization weights that best fitted the simulated data were used to generate the null distribution of median differences. We then generated p values for median differences in the empirical data by seeing how extreme the empirical value was with respect to the generated null distribution. Each simulated dataset comprised the same number of participants as in the original sample (n = 192) and sampled both parameters with replacement from a joint distribution representing the null hypothesis that the two parameters are equal. The null distribution was derived through running our model-fitting procedure on the empirical data for one iteration to derive true posteriors at the participant level, and combining the participant-level median parameter estimates for both parameters of interest (e.g., utilization parameters for MB reward and MB punishment) to form a joint distribution. All other parameters were drawn from the group-level distributions derived by fitting the winning model to participants’ data.

Testing significance of resource reallocation parameters

Request a detailed protocol

We tested the difference of resource reallocation parameters (βchangeReward and βchangePunishment in Equation 4) from zero using a permutation test, wherein we generated a null distribution by shuffling the labels denoting task block within each participant and recording the mean for each change parameter. We then generated p-values for the empirical means of change parameters by computing the proportion of the null distribution exceeding the empirical value.

Behavioral signatures Bayesian logistic regression

Request a detailed protocol

The regression sought to explain whether participants stayed (coded as 1) or switched (coded as 0) on trial ‘t’, which we refer to as ‘choicet’ in Equation 9, for the subset of trials where the current goal differed from the goal encountered on the previous trial. Since current and previous goals are perfectly anticorrelated in such trials, the main effect of goal was simply encoded as:

goalt1 = {1if last goal=reward seeking1if last goal=punishment avoidance

GP effects were modeled by variables that encoded whether features were observed for both chosen and unchosen actions last trial with the following encoding scheme (here for reward):

rwdt1={1if reward feature observed only for chosen action 0if outcomes were the same for chosen and unchosen actions1if reward feature observed only for unchosen action

MB effects were modeled by the interaction of the GP terms and the current goal as follows (here again for the MB reward effect):

rwdt1 X goalt={1if rwdt=1 and goalt1 =1 0if goalt1 =11if rwdt=1 and goalt1 =1

Last, we modeled MF effects as the interaction between reward and punishment features observed for chosen actions and the last goal faced (here, for MF reward effects):

rwdchosen,t1 X goal t1={1if rwdt,chosen=1 and goalt1 =1 0if goalt1 =11if rwdt,chosen=1 and goalt1 =1

The dissociation between this MF signature and the MB signature described above relies on the insensitivity of the MF strategy to counterfactual outcomes, which possess no present value.

We included all independent variables in a Bayesian mixed-effects logistic regression as follows:

(9) p(choicet=1)=logistic(β1intercept+β2goalt1+β3rwdt1+β4punt1goalperseveration+β5rwdt1Xgoalt+β6punt1Xgoaltmodelbased+β7rwdchosen,t1Xgoalt1+β8punchosen,t1XgoaltModelfree)

Posterior probability distributions of each effect were estimated using a sampling procedure in BAyesian Model-Building Interface (Bambi) in Python (Capretto et al., 2020), which is a high-level interface using the PyMC3 Bayesian modeling software. The default sampler in Bambi an adaptive dynamic Hamiltonian Monte Carlo algorithm, which is an instance of a Markov chain Monte Carlo sampler. In all models, all estimated effects had good indicators of reliable sampling from the posterior, including r-hat below 1.1 and effective sample size above 1000 for all parameters. Note, Equation 9 is written at the participant level. Each effect was drawn from a normal group distribution the mean and variance of which were drawn from prior distributions, estimated by Bambi’s default algorithm, which is informed by implied partial correlations between the dependent and independent variables, and has been demonstrated to produce weakly informative but reasonable priors (Capretto et al., 2020). For hypothesis testing, we compared the 95% most credible parameter values (i.e., the 95% highest density intervals) to a null value of 0.

Data availability

All data are available in the main text or the supplementary materials. All code and analyses can be found at: https://github.com/psharp1289/multigoal_RL, copy archived at swh:1:rev:1cf24428da17e8bcb2fab6d0ff9a7a59ee1586f7.

References

    1. Bishop CM
    (2006)
    Pattern recognition
    Machine Learning 128:1–738.
  1. Conference
    1. Casillas A
    2. Clark LA
    (2000)
    The Mini mood and anxiety symptom questionnaire (Mini-MASQ)
    72nd annual Meeting of the Midwestern Psychological Association.
  2. Book
    1. Pinker S
    (1997)
    How the Mind Works
    New York: Norton.
  3. Book
    1. Sutton RS
    2. Barto AG
    (2018)
    Reinforcement Learning: An Introduction
    MIT press.

Decision letter

  1. Claire M Gillan
    Reviewing Editor; Trinity College Dublin, Ireland
  2. Christian Büchel
    Senior Editor; University Medical Center Hamburg-Eppendorf, Germany
  3. Claire M Gillan
    Reviewer; Trinity College Dublin, Ireland

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Humans perseverate on punishment avoidance goals in multigoal reinforcement learning" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Claire M Gillan as Reviewing Editor and Reviewer #3, and the evaluation has been overseen by Christian Büchel as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

The reviewers were overall favourably disposed to this manuscript – it was felt that the paper extends previous work on how humans use a mix of learning strategies to make decisions in cognitively demanding environments to show that humans flexibly adjust goal-directed learning based on statistics of the environment but inflexibly engage in punishment avoidance regardless of current goals. The task design and model are novel and interesting. However, all 3 reviewers had substantial concerns (i) aspects of the behavioural analysis (modelling), (ii) the interpretations of behavioural effects, and perhaps most substantially, (iii) the extension to psychopathology. I have outlined these essential revisions below by way of summary. Individual reviewers made further suggestions that can be considered optional, though we encourage that they also be addressed as they will strengthen the paper.

Essential revisions:

1) All three reviewers felt that the clinical analyses, albeit interesting, were not approached with the same rigour and clarity as the analyses of behaviour preceding them (which were excellent). A revision must address this comprehensively, including (i) how non-independence of parameter estimates influence clinical analyses, (ii) the inclusion of clinical covariates without first reporting bivariate effects, (iii) correction for multiple comparisons should be more extensive, (iv) visualisation of all clinical effects not just select ones, (v) more complete reporting with respect to all clinical measures gathered (including analysis of somatic anxiety). Related to this, the authors may wish to consider in the discussion that there are potential difference between non-clinical and clinical cohorts in terms of correlation of clinical measures, though the data are still a bit mixed:

Imperiale, M. N., Lieb, R., Calkins, M. E., and Meinlschmidt, G. (2021). Multimorbidity networks of mental disorder symptom domains across psychopathology severity levels in community youth. Journal of psychiatric research, 141, 267-275.

Groen, R. N., Wichers, M., Wigman, J. T., and Hartman, C. A. (2019). Specificity of psychopathology across levels of severity: a transdiagnostic network analysis. Scientific reports, 9(1), 1-10.

2) With respect to the modelling of behaviour, (i) more information about parameter recovery is requested, and (ii) concerns were raised about parameter estimation more generally given the skew evident in the inverse temperatures, coupled with the multicollinearity. More information should be provided here, including proportion of recovered parameters that can capture the main parameter-based results of the paper.

3) The interpretation of the key behavioural effects: could the authors defend the interpretation that GP is not an actual strategy, but rather a noisy or distracted attempt at MB learning. It was felt that it is rather a big claim to say that the GP approach is a reasonable third strategy on top of MF/MB learning and if this interpretation is to be maintained, the authors need to back this up by showing it is something more than just an imprecise MB approach. This ties in with additional concerns about the exclusion of participants that did not obey certain rules of the task. It was not clear how their exclusion was justified given a valid interpretation of some of these key effects may be that (i) people find it hard to keep instructions in mind in complex tasks, (ii) people may be utilising strategies that you have not defined and are not well understood but are nonetheless real – e.g. belief that reward probabilities / 'luck' switches from trial to trial. Other concerns regarding exclusion were that they appear to be asymmetric with respect to reward/punishment conditions, suggesting these data are meaningful.

Reviewer #1 (Recommendations for the authors):

1. Psychopathology analyses.

a. If the authors wish to make a connection to psychopathology, reporting the relationship between worry alone – rather than controlling for other overlapping symptom measures and model parameters – would be more appropriate. The recommendations for testing multiple, partially overlapping psychopathology measures in this paper may be helpful: DOI: 10.1177/21677026211017834

b. An alternate approach would be to focus this paper on the main findings about learning strategies and to save relationships to psychopathology for a future paper with a more appropriate sample.

2. Parameter-based analyses.

a. Providing more information on parameter recovery is needed. In particular, showing the proportion of recovered parameters that can capture the main parameter-based results of the paper (Figure 2C/D) would show that these findings reflect true underlying parameter differences rather than artifacts of model estimation.

b. If the authors retain the psychopathology analyses, they should be conducted in a way that does not assume independence of parameter estimates.

c. Alternatively, the analyses using relative model fits and trialwise regressions provide most of the information needed for the conclusions of the paper. The parameter-based analyses could be omitted with the focus instead on these other kinds of analyses.

Reviewer #2 (Recommendations for the authors):

The authors used the term "predicted" quite a bit to describe associations. I don't think this is justified (they haven't really done any predictive analyses).

If I understand correctly, the same 4 random walks were used for all participants (randomised between the 4 associations). Of the two shown, one looks much more stable than the other. It would be useful to see all 4 walks to see how comparable they are (if I am correct that the same 4 are used for all participants). If the walks are very different, should their allocation to the associations be controlled for in the analysis?

It would be useful to report the relationship between worry and the block effect (i.e. you suggest high worry is associated with higher GP/lower MB for losses-do worried people adapt to changes in the base rates of the outcomes?).

Reviewer #3 (Recommendations for the authors):

Well done on an interesting read and a contribution that will be informative for a lot of researchers. I have some suggestions to improve the paper.

All analyses with the 3 clinical factors should be presented in full; including supplementary figures if possible. Simple associations should be carried out before adding covariates to assist the reader in interpreting these findings and in generating hypotheses based on them. OCD is said to be not related to parameters at p=.08, while worry is at p=0.04 (uncorrected i guess more like p=0.02 for the latter), these are not likely to be different from one-another. And they may depend on the inclusion of these variables in the same model. Reader needs more transparency around these effects and any claims of specificity need more support. The data presented actually suggests the opposite.

Relatedly, the result in relation to worry, the effect is marginal at p=.04. While 2 multiple comparisons are controlled for, this is a fairly liberal decision given several tests were conducted and reported (i.e. GP MB and MF for punishment/reward = 6 at least; plus the 3 clinical scales = 18 etc). I'd encourage the authors to report all of the associations in a table, correct for multiple comparisons. This will serve the same purpose of suggesting the most interesting avenue for future research but also give the reader a fuller view on specificity of this to worry. This exploratory framing for the clinical effects does not detract from the main contribution of the paper or the potential for this to be especially interesting for 'worry' – it would just make them clearer and let the reader decide that for themselves a bit more.

There needs to be a bit more done with respect to relating the clinical variables to the model parameters. I would have thought this would be best placed within the hierarchical model itself. Alternatively, I wonder if these is a point-estimate that could be generated that is more flexible and less dependent on the overall group effects and other parameter values.

The authors describe issues with collinearity of the parameter values. Can a correlation matrix in the supplement be included that reports these (I think currently you can sort of see it based on simulated vs real data, but this is not the same as correlating real vs real across params).

I strongly encourage all subjects are retained (though i feel less strongly about excluding those not completing enough trials, 90% even seems a bit harsh/wasteful of data). If not, then a clear justification for why the strategy or approach of these subjects is not an accurate reflection of potentially the decision making preferences of 22% of the population. More standard indicators of inattentive responding focus on RTs, overly rigid responding that renders modelling choice impossible. Not clear why these were not used here as they seem better justified indicators of inattentive subjects. At the risk of belabouring the point(!), defining these subjects as 'not understanding instructions' could be applied to many of the key findings of this paper (i.e. avoidance perseveration suggests they don't pay attention to the current goals etc). So I think this practice is not ideal.

https://doi.org/10.7554/eLife.74402.sa1

Author response

Essential revisions:

1) All three reviewers felt that the clinical analyses, albeit interesting, were not approached with the same rigour and clarity as the analyses of behaviour preceding them (which were excellent). A revision must address this comprehensively, including (i) how non-independence of parameter estimates influence clinical analyses, (ii) the inclusion of clinical covariates without first reporting bivariate effects, (iii) correction for multiple comparisons should be more extensive, (iv) visualisation of all clinical effects not just select ones, (v) more complete reporting with respect to all clinical measures gathered (including analysis of somatic anxiety). Related to this, the authors may wish to consider in the discussion that there are potential difference between non-clinical and clinical cohorts in terms of correlation of clinical measures, though the data are still a bit mixed:

Imperiale, M. N., Lieb, R., Calkins, M. E., and Meinlschmidt, G. (2021). Multimorbidity networks of mental disorder symptom domains across psychopathology severity levels in community youth. Journal of psychiatric research, 141, 267-275.

Groen, R. N., Wichers, M., Wigman, J. T., and Hartman, C. A. (2019). Specificity of psychopathology across levels of severity: a transdiagnostic network analysis. Scientific reports, 9(1), 1-10.

We thank the reviewers and editor for thoughtful comments, including suggestions for improving the clarity and comprehensiveness of our clinical analyses. Our approach to addressing this set of concerns is twofold: Firstly, we followed Reviewer 3’s suggestion (see Reviewer comment 3.7) to re-frame the clinical analyses as exploratory, requiring further testing before definitive conclusions can be drawn. Secondly, in accordance with the reviewers’ suggestions, we substantially expanded the breadth and clarity of the reported analyses. We next detail these two sets of modifications.

A. Re-framing clinical analyses and exploratory and preliminary

To clarify that present clinical analyses are exploratory, and further investigation is required to test the validity of the findings, we implemented the following changes to the text:

I. A revision of the Abstract to make clear that inferences about worry and punishment perseveration are preliminary:

Importantly, we show preliminary evidence that individuals with chronic worry may have difficulty disengaging from punishment avoidance when instructed to seek reward. Taken together, the findings demonstrate that people avoid punishment less flexibly than they pursue reward. Future studies should test in larger samples whether a difficulty to disengage from punishment avoidance contributes to chronic worry.”

II. A revision of the Introduction, in which we now state that our analyses on psychopathology should be regarded as exploratory, and require further testing in larger samples with more targeted hypotheses (p.4):

“Finally, in a series of exploratory analyses, we determined whether and how anxious individuals express a preference for punishment avoidance goals. In so doing, we found preliminary evidence that the degree of reliance on a goal-perseveration strategy to avoid punishment was positively associated with dispositional worry, which appears unique to those expressing worry and not to individuals with obsessive-compulsive or somatic anxiety symptoms.”

III. A revision of the Discussion to emphasize the tentative nature of conclusions we can draw regarding our analyses on worry, and also to consider that relationships between symptoms and cognitive indices may differ in a clinical population (p.18):

“Given that present results are preliminary in nature, future studies will need to test a prediction that chronic worry is associated with punishment perseveration in a larger sample. This should also include testing whether this association holds in a clinical population, as variation in symptoms in a clinical population may relate to punishment perseveration differently (Imperiale et al., 2021; Groen et al., 2019).”

B. More comprehensive reporting of clinical analyses

To further address the reviewers’ concerns, we expanded the reporting of the clinical analyses as follows:

I. We now report the adjusted p-values using a family-wise error correction approach for exploratory research (Rubin, 2017), and we explicitly note that correction typically employed in confirmatory research would render the results insignificant. The cited exploratory approach defines the family of tests by the number of tests within a given exploratory hypothesis. In the present set of analyses, we consider two exploratory hypotheses: (1) that anxiety is associated with greater GP punishment and (2) that anxiety is associated with greater MB punishment. We multiplied α level by 2 for both hypotheses because we explored whether these processes were associated with either somatic anxiety or chronic worry, controlling for co-occurring psychopathology and traits of non-interest (IQ, attention and effort). Our approach, and its associated caveat, are now noted in Results (p.15):

“It is important to note that all aforementioned p-values testing our key hypotheses (Table 2B) are corrected for multiple comparisons using a correction procedure designed for exploratory research (Rubin, 2017), which only controls for number of statistical tests within each hypothesis. Using a more conservative Bonferroni error correction for all 4 regression models, as typically employed in hypothesis-driven confirmatory work (Frane, 2020), resulted in a p-value for the key effect of worry and punishment perseveration that no longer passed a conventional significance thresholds (p=0.08). Thus, future work with a more targeted, hypothesis-driven approach needs to be conducted to ensure our tentative inferences regarding worry are valid and robust.”

II. To make transparent the associations between key variables in the model and forms of psychopathology, we now present in the main text and in new Figure 5 all relevant bivariate relationships, including with somatic anxiety. We have thus added the following to Results (p.14):

“We first report the bivariate relations between each form of psychopathology and inverse temperature parameters reflecting tendencies to utilize MB and GP punishment avoidance. Given that individuals with OCD and anxiety symptoms may over-prioritize threat detection, it is conceivable that there is a relationship between all three forms of psychopathology and model-based punishment avoidance. However, we found no significant or trending relationships between any form of psychopathology and model-based control for punishment avoidance (Figure 5A, left column). An alternative possibility is that individuals with anxiety suffer from a dysregulation in goal pursuit, reflecting a failure to disengage punishment avoidance when instructed to do so. On this basis, we explored whether worry and somatic anxiety are positively associated with goal-perseveration for punishment avoidance. In so doing we found initial evidence of a trending relationship between the tendency to worry and punishment avoidance perseveration (B=2.15, t = 1.4, p=0.16; Figure 5A, right column).”

III. The reviewers correctly point out that individual parameters were non-independently estimated. However, these only serve as predictors in the reported regression analyses, and thus, these analyses make no assumption that they were independently sampled (Hastie et al., 2009). We now clarify this issue in Methods (p.21):

“Although the computational parameters were non-independently estimated by our hierarchical model-fitting procedure, it is vital to note this does not compromise the validity of the least-squares solution to the regressions we ran. Indeed, Friedman, Hastie, and Tibshirani (2009) show that, ‘Even if the independent variables were not drawn randomly, the criterion is still valid if the dependent variables are conditionally independent given the [independent variable] inputs’ (p.44).”

Nevertheless, to mitigate doubt regarding the results of our regression analyses, we conducted nonparametric permutation tests, wherein we shuffled the task data with respect to the psychopathology scores. We include these nonparametric analyses in Results (p.14):

“For each regression model, we computed p-values using a nonparametric permutation test wherein we shuffled the task data with respect to the psychopathology scores, repeating the analysis on each of 10,000 shuffled datasets to derive an empirical null distribution of the relevant t-statistics.”

IV. To motivate use of covariates in the regression analyses, we now expound in Results on why we included each covariate. Additionally, we now validate the inclusion of covariates by showing that a regression with the covariates predicts left-out data better than regression without covariates (p.14-15):

“To provide a more specific test of our key hypotheses, we removed variance of non-interest in order to sensitize our analyses to unique relationships between forms of psychopathology and types of punishment avoidance. Firstly, generalized, as opposed specific obsessive, worry is thought to be particularly associated with difficulty in disengaging from worry (Berenbaum, 2010), since it lasts significantly longer in both clinical (Dar and Iqbal, 2015) and community samples (Langlois, Freeston, and Ladouceur, 2000). Thus, we dissociated generalized from obsessive worry using the same approach taken in previous studies (Doron et al., 2013; Stein et al., 2010), namely, by including a measure of OCD symptoms as a control covariate. Controlling for OCD symptoms has the additional benefit of accounting for known relations between OCD and poor learning of task structure, reduced model-based control, and perseverative tendencies (Gillan et al., 2016; Seow et al., 2021; Sharp et al., 2021). Secondly, another potentially confounding relationship exists between worry and somatic anxiety (e.g., Sharp, Miller and Heller, 2015), likely reflecting a general anxiety factor. Thus, we isolated worry by controlling for somatic anxiety, as commonly done in studies seeking to quantify distinct relationships of worry and somatic anxiety with cognitive performance (e.g., Warren, Miller and Heller, 2021) or associated neural mechanisms (e.g., Silton et al., 2011). Finally, we controlled for covariance between computational strategies that might reflect general task competencies. This included the utilization of MB (including learning rates and inverse temperatures) since observed anticorrelations in the empirical data (Figure S7) between GP and MB may derive from causal factors such as attention or IQ, as well as a general tendency to mitigate cognitive effort by using less costly strategies (AP, MF and GP inverse temperatures; Figure S7). This analysis showed a stronger relationship between worry and punishment perseveration (ß=3.14 (1.38), t = 2.27, p=0.04, Figure 5C). No other significant relationship was observed between punishment perseveration or model-based punishment avoidance and psychopathology (Figure 5C). Ultimately, we validated the full model using a 5-fold cross-validation procedure which showed that regressing worry onto the aforementioned covariates (using a ridge regression implementation) explains significantly more variance in left out test-set data (R2=0.24) relative to the models of the bivariate relationships between worry and GP Punishment (R2=0.01) and MB Punishment (R2=0.00).”

2) With respect to the modelling of behaviour, (i) more information about parameter recovery is requested, and (ii) concerns were raised about parameter estimation more generally given the skew evident in the inverse temperatures, coupled with the multicollinearity. More information should be provided here, including proportion of recovered parameters that can capture the main parameter-based results of the paper.

We thank the reviewers for raising these concerns regarding parameter multicollinearity, skew, and recoverability. To address these, we now show that multicollinearity involving the key inverse temperature parameters is low, and clarify that skew is expected given the inherently skewed γ priors that we (and many others) use to model these parameters. Additionally, we demonstrate that these parameters are highly recoverable, in line with levels reported in extant computational modelling literature. The changes to the manuscript include:

I. Multicollinearity. We show that multicollinearity involving key parameters of interest is relatively low, within acceptable levels with respect to prior studies. To report levels of multicollinearity more comprehensively, we now include a full heatmap of correlations between fitted parameters in Figure 4 —figure supplement 4. The only large correlation was between the learning rate for factual and counterfactual feature outcomes (r=0.68). Modest correlations were also observed between model-based inverse temperatures (r=0.37) and between model-based change parameters (r=0.34).

Moreover, we clarify the levels of multicollinearity between fitted parameters in Methods (p.21):

“We report all bivariate correlations between fitted parameters in Figure 4 – Supplementary figure 4.”

II. Skew. We clarify that skew is expected for key inverse temperature parameters, given that they were modelled with γ prior distributions that are inherently skewed. Indeed, extant studies show similar levels of skew in parameter distributions modelled using γ priors (Gillan et al., 2016; Sharp, Dolan and Eldar, 2021). We now clarify this issue in Methods (p.25):

“We note that the skew in inverse temperature parameters is to be expected given their Γ prior distributions are inherently skewed (Gillan et al., 2016; Sharp, Dolan and Eldar, 2021).”

III. Recoverability. As the Reviewers point out, an inability to dissociate between certain parameters is a common problem in reinforcement learning modelling (e.g., Palminteri et al., 2017). Thus, we now provide recoverability levels for all model parameters and how they trade-off against each other, in new Figure S4. Recoverability of the four parameters of interest, MB and GP for reward and punishment, spans the correlation range of 0.76 to 0.91, as consistent with levels of parameter recovery in extant studies (Haines, Vasilleva and Ahn, 2018; Palminteri et al., 2017). Additionally, all between-parameter (i.e., off-diagonal) correlations involving the key model parameters were low (i.e., weaker than 0.16), showing that the experimental design and model-fitting procedure were capable of successfully dissociating between these parameters (Wilson and Collins, 2019). This is now reported in Results (p.12):

“We validated that MB, GP and MF inverse temperature parameters were recoverable from simulated experimental data, and that the degree of recoverability (i.e., the correlations of true and recovered parameter values, which were between 0.76 and 0.91; Figure S4) was in line with extant reinforcement learning modelling studies (Haines, Vasilleva and Ahn, 2018; Palminteri et al., 2017). Similarly, low correlations between estimated parameters (all weaker than 0.16) demonstrate our experimental design and model-fitting procedure successfully dissociated between model parameters (Wilson and Collins, 2019).”

3) The interpretation of the key behavioural effects: could the authors defend the interpretation that GP is not an actual strategy, but rather a noisy or distracted attempt at MB learning. It was felt that it is rather a big claim to say that the GP approach is a reasonable third strategy on top of MF/MB learning and if this interpretation is to be maintained, the authors need to back this up by showing it is something more than just an imprecise MB approach.

We thank the reviewers for an opportunity to both clarify our interpretation of how GP relates to MB and MF, as well as expand our argument that GP is a strategic heuristic, and not simply a noisy or distracted attempt at MB learning. Below, we begin by clarifying that, since it shares its data structures with MB, GP is best thought of as a heuristic alteration to MB learning. We then argue that this heuristic is indeed a resource rational strategy, not simply a goal-forgetting MB agent.

Firstly, a purely forgetful agent would not demonstrate the observed valence effect that is rational for agents wishing to avoid potentially fatal punishment. Secondly, our experiment was specifically designed to prevent forgetting in that the instructed goal was presented on the screen for the entire duration that participants deliberated their decisions. Thirdly, we show through formalizing a model of MB forgetting that a key prediction of that model – namely, that GP and MB utilization should be highly positively correlated – does not hold in our empirical data. Finally, we expound on the various ways that GP computations save costly resources while producing high-performance (in terms of reward earned) policies.

I. The GP system is a strategically modified model-based strategy.

We first clarify that GP corresponds to a MB agent that strategically avoids goal-switching, and therefore is not as distinct from MB as might have been suggested in the previous version of our manuscript. To emphasize this in the text, we have added the following to Results (p.6):

“An alternative strategy, that we term “goal-perseveration” (GP), might strike a better balance between simplicity and effectiveness. This strategy inherits the exact same knowledge of feature probabilities acquired by model-based learning, but simplifies action selection by persistently avoiding punishment and seeking reward, simultaneously, regardless of instructed goal. This, in principle, eliminates effortful goal-switching while utilizing all available information about the changing action-feature mapping. Thus, rather than constituting a separate decision system in its own right, GP is best thought of as a behavior produced by a strategic cost-saving MB agent.”

II. Forgetting the reward function would not predict a valence effect in GP.

A non-strategic forgetful MB agent would be just as likely to pursue the wrong goal regardless of which goal is presented on screen. This would be inconsistent with the significant valence effect that we observed, where participants tended to pursue the uninstructed goal predominantly during reward trials. The observed valence effect is consistent, instead, with a rational strategy proposed in prior work (Woody and Szechtman, 2011). Specifically, a punishment avoidance system should be far more attuned to false negatives (failing to detect a true threat) than its reward system counterpart because such missed attempts to avoid punishment could be fatal. We now highlight this point further in the Discussion (p.16):

“The strategic deployment of GP primarily towards punishment avoidance indicates such behavior is not merely a reflection of a noisy or forgetful MB system. Indeed, our finding that humans use less flexible computational strategies to avoid punishment, than to seek reward, aligns with the idea of distinct neural mechanisms supporting avoidance and approach behavior (McNaughton and Gray, 2000; Lang, Bradley and Cuthbert, 1998). Moreover, comparative ethology and evolutionary psychology (Pinker, 1997) suggest there are good reasons why punishment avoidance might be less flexible than reward seeking. Woody and Szechtman (2011) opined that “to reduce the potentially deadly occurrence of false negative errors (failure to prepare for upcoming danger), it is adaptive for the system to tolerate a high rate of false positive errors (false alarms).” Indeed, we demonstrate that in the presence of multiple shifting goals, perseverance in punishment avoidance results in false positives during reward seeking (Figure 3B), but avoids ‘missing’ punishment avoidance opportunities because of lapses in goal switching (Figure 3C). Future work could further test these ideas, as well as potential alternative explanations (Dayan and Huys, 2009).”

III. A model-based strategy that forgets the reward provides a worse account of participant choices than a combination of MB and GP strategies.

Reviewer 2 (comment 2.1) additionally suggested that a forgetful MB system could potentially explain the valanced effect of GP if forgetting of the goal set occurred more often on either reward or punishment trials. Although we argue that a dependence of forgetting on goal valence suggests that such forgetting would itself be strategic, in the interest of determining whether such an account could in principle explain our results, we now fit a new model (Forgetful + MB + MF + AP) that addresses this potential mechanism. This model included all components of the winning model, except for the GP system, and extended the model-based system to include two additional parameters, fR and fp which determine the probability of forgetting the presented reward vector (which defines the goal) on reward trials and punishment trials respectively. Specifically, according to the model, on each trial according to a fixed probability (either fR and fp depending on whether the trial had a reward or punishment goal) the participant replaced the instructed goal with the opposite goal (e.g., if the actual goal was punishment avoidance, the participant used the reward pursuit goal).

We found that this model fit the data worse than a model which included separate MB and GP controllers with no goal forgetting (MB + MF + GP + AP), thereby confirming that a model where the model-based controller forgets the current goal with different rates on reward and punishment trials does not account for our results supporting GP as well. We have added this modelling result in Figure 4 —figure supplement 3 and amended the figure to include the new BIC for this model.

In order to investigate what aspects of the data were better captured by the winning model, we used the forgetful model (Forgetful + MB + MF + AP) to simulate a new dataset, using the best-fit parameters for each participant. We predicted that because in this model both MB and (apparent) GP-like effects on choice emerge from a single MB strategy (that at times forgets the rewards) then such effects should be correlated in the simulated dataset. We confirmed this prediction by fitting the GP model (MB + MF + GP + AP) to this simulated data, and showing that MB and GP parameters, within a valence, were indeed correlated. In contrast, when fit to the actual data, no correlation between MB and GP effects are detected, suggesting that these do not emerge from a single MB system, but instead reflect distinct task strategies. We present Author response image 1 the empirical GP correlations from our data on the left, and the simulated GP correlations generated by forgetful MB agents:

Author response image 1
A comparison of GP-MB correlations for empirical data (left) and for data simulated using forgetful-MB agents (right).

We have added this text to our model comparison results detailed in Figure 4 —figure supplement 4:

“Finally, we tested an alternative model where GP behavior may derive from a MB strategy that occasionally forgets the reward function (Forgetful-MB+MthF+AP), allowing this forgetting to occur at different rates during reward and punishment goals. This model-based agent includes two additional parameters, fR and fp, which govern the probability of forgetting the presented reward function on reward pursuit trials and punishment avoidance trials respectively. Thus, on each trial, the model replaces the instructed goal with the opposite goal (e.g. if the actual goal was [-1, 0], the participant used [0,1]) with some fixed probability (either fR and fp, depending on the trial type). We again found that this model fit worse than the winning model, confirming that a model where the model-based controller forgets the current reward with different rates on reward and punishment trials does not account for our results supporting GP as well.”

IV. Availability of reward information on screen makes forgetting of reward information implausible. In addition to the aforementioned empirical evidence for why a forgetful MB agent is unlikely to explain our results, we also clarify here that we designed the task so as to reduce this specific kind of forgetting. We did so by continuously presenting the instructed goal on the screen while participants made their choice (presented in Author response image 2). The instructed goal only disappeared once a choice was made.

Author response image 2
A depiction of what participants saw on the screen for the entire decision period in our multigoal reinforcement learning task.

Thus, any forgetting could be easily remedied by glancing at the screen. We now highlight in Results (p. 5) this additional reason to discount the possibility the participants forgot the current reward instructions:

“Note that the reward value of either feature was continuously presented throughout the choice deliberation time (Figure 1b), ensuring that there should be no reason for participants to forget the present trial’s goal.”

V. GP is a resource rational task strategy.

A necessary criterion for considering GP a heuristic strategy, as opposed to simply reflecting noise or error, is that it needs to fulfil a function that would cause it to be selected. This stands in contrast to errors, which do not fulfil such a function, and an agent would be better off avoiding. We suggest that GP fulfils a function of bypassing computational costs that make model-based RL computationally demanding, while still achieving good performance in the task. Our perspective here is inspired by the resource rationality framework, which argue that many heuristic deviations from seemingly rational strategies (in this task model-based reinforcement learning), rather than reflecting errors, instead reflect strategic balancing of task performance with computational costs of implementation (Gigerenzer and Goldstein, 2011; Lieder and Griffiths, 2020).

To see how GP may constitute a resource rational strategy, consider that the key feature of this task which makes model-based RL difficult to utilize is that the goals switch between trials. When the goals switch, the model-based agent is required to change which feature predictions it uses to derive action values, which takes time. At the same time, feature predictions need to be constantly updated even on trials for which a feature was not relevant, creating a burdensome dissociation between information used for decision and information used for learning. Like model-free RL, a GP agent entirely avoids these goal-switching costs, while obtaining substantially greater rewards than a model-free agent would. It does so by constantly using the same model-based feature predictions. Additionally, as noted above with regards to the observed valence effect, a GP agent that prioritizes avoiding losses can achieve this as well as model-based learning, while avoiding the relevant switching costs.

We have now elaborated in the Discussion (p. 14-15) how GP avoids the costs of MB control, and how it can constitute a resource rational heuristic that approximates model-based evaluation:

“GP may thus in fact constitute a resource-rational strategy (Lieder and Griffiths, 2020) for approximating MB control. To illustrate this, consider that model-based learning is computationally demanding in our task specifically because goals switch between trials. When the goals switch, a model-based agent must retrieve and use predictions concerning a different feature. Additionally, the agent needs to continuously update its predictions concerning features even when they are not presently relevant for planning. GP avoids these computationally costly operations by pursuing goals persistently, thus avoiding switching and ensuring that features are equally relevant for planning and learning. In this way, GP saves substantial computational resources compared to MB yet is able to perform relatively well on the task, achieving better performance than MF. Additionally, if a participant selectively cares about avoiding losses (for instance, due to loss aversion), GP can perform as well as MB. Thus, we propose the GP heuristic reflects a strategic choice, which can achieve good performance while avoiding the substantial resource requirements associated with model-based control. In this sense it fulfils a similar role as other proposed approximations to model-based evaluation including model-free RL (Sutton and Barto, 2018), the successor representation (Dayan, 1993; Momennejad et al., 2017), mixing model-based and model-free evaluation (Keramati, Smittenaar, Dolan, and Dayan, 2016), habitual goal selection (Cushman and Morris, 2015) and other identified heuristics in action evaluation (Daw and Dayan, 2014).”

This ties in with additional concerns about the exclusion of participants that did not obey certain rules of the task. It was not clear how their exclusion was justified given a valid interpretation of some of these key effects may be that (i) people find it hard to keep instructions in mind in complex tasks, (ii) people may be utilising strategies that you have not defined and are not well understood but are nonetheless real – e.g. belief that reward probabilities / 'luck' switches from trial to trial. Other concerns regarding exclusion were that they appear to be asymmetric with respect to reward/punishment conditions, suggesting these data are meaningful.

We thank the reviewers for pointing out the need to explain the rationale underlying our criteria for excluding participants. We first note that it is the norm in studies of learning and decision making to exclude participants whose performance is indistinguishable from pure guessing (e.g., Bornstein and Daw, 2013; Otto et al., 2013). Equivalently, in the present study our approach was to exclude only participants whose strategy was tantamount to performing at chance-level or below. We now show this by simulating each of the excluded strategies and measuring its performance. Furthermore, we measure the excluded participants’ actual accuracy and show that it was indeed indistinguishable from chance. In addition, we clarify that these participants’ fundamentally different model of the task makes it impossible to estimate the effects of interest for these participants. Finally, we conduct a sensitivity analysis in which we include these excluded participants and demonstrate that the key effects in our paper all hold.

I. Chance-level performance or worse among excluded participants and their strategies

We first verified through simulation that the strategies that we used as exclusion criteria yield equal or worse reward than purely guessing. Thus, we simulated the performance in our task of agents that either (i) treated reward features as punishment features, (ii) treated punishment features as reward features, or (iii) both (i.e., reversed the meaning of feature types), and compared them to agents that (iv) purely guess, (v) use both feature types in a model-based way, and finally (vi) use both feature types as intended but using a purely goal-perseveration strategy. We thus show that the strategies of participants we excluded (i, ii, and iii) do as poorly or worse than purely guessing. This motivation for the exclusion criteria is now reported in Methods (p.19):

“These errors in following task structure are fundamental failures that result in average performance that is as poor or worse than an agent that purely guesses which action to take at each trial (Figure 6).”

Secondly, we examined participants’ actual accuracy, by comparing their choices to those of an ideal observer model would make given the participants’ observations. In the excluded sample, participants on average chose correctly 49.65% of the time, whereas in the included group participants chose correctly 63.5% of the time (difference between groups: t=9.66,p<0.00001). We now add the following text to the Methods (p.19):

“Excluded subjects performed significantly worse in terms of choice accuracy. To derive accuracy, we computed the percentage of choices subjects made in line with an ideal observer that experienced the same outcome history (in terms of features) as each participant. On average, excluded subjects chose correctly 49.6% of time, whereas included subjects chose correctly 63.5% of time (difference between groups: t(190)=9.66, p<0.00001).“

II. Including subjects that misunderstand feature types would add noise to the hypothesis tests

Including participants we previously excluded would inject considerable noise into our estimation of how reward vs. punishment feature types were differentially utilized. For example, estimates of how individuals avoided punishment features would include participants that treated these features as if they were reward. Therefore, statistical tests of MB and GP goal valence differences would be contaminated by such subjects. In fact, testing whether punishment and reward goals were pursued to different degrees with a given strategy (which is the hypothesis of the study) is a non-sensical test for subjects that treated the task as if there was only one goal or that confused reward and punishment. Furthermore, given their distorted model of the task (where the task was reduced to one goal), these subjects had no incentive to use a more complex strategy than simple model-free learning, and thus their inclusion corrupts our ability to infer how subjects recruit different goal-directed systems for decision making. We now expand on these reasons in Methods (p.19):

“Additionally, including such subjects would reduce our sensitivity to estimating differences in the utilization of GP and MB for goals of differing valence, as such subjects treated the task as if there was only a single goal, or that the goals were opposite to their instructed nature. Moreover, given their model of the task, such subjects could approach the task optimally using a MF strategy, and thus would not be incentivized to use goal-directed strategies at all.”

III. Sensitivity Analysis

To mitigate any remaining concern about subject exclusion, we conducted a sensitivity analysis that aligns with Reviewer 3’s suggestion to determine if our key effects hold in the larger sample that includes subjects that incorrectly treat certain feature types as if they were another type. Modelling the large sample (n=242; 98% retention) required us to allow negative inverse temperature parameters, since these are required to account for subjects who for instance treated a reward feature as if it were a punishment feature and the converse. We thus replaced the γ priors for the inverse temperature parameters with normal distributions. The analyses for this larger sample resulted in all key results in the main paper holding, and in some cases, strengthening. We now note in the Methods (p.20):

“To determine whether our relatively strict subject exclusion policy might have affected the results, we conducted a sensitivity analysis on a larger sample (n=248; 98% retention) including subjects that mistreated the instructed value of certain features. To account for these subjects’ behavior, we used normal priors to allow negative inverse temperature parameters. Fitting these revised models to our data, we again demonstrate that our winning model was the best-fitting model compared to all other models. Second, we show that the GP valence effect held and even came out stronger in this larger sample. Thus, the mean difference in GP utilization for punishment and reward goals was 0.24 in our original sample and 0.50 in the larger sample (p< 0.0001). Finally, we show the MB valence effect also held in this larger sample (original sample mean difference between MB reward and MB punishment = 2.10 ; larger sample mean difference = 1.27, both p-values < 0.0001).”

Reviewer #1 (Recommendations for the authors):

1. Psychopathology analyses.

a. If the authors wish to make a connection to psychopathology, reporting the relationship between worry alone – rather than controlling for other overlapping symptom measures and model parameters – would be more appropriate. The recommendations for testing multiple, partially overlapping psychopathology measures in this paper may be helpful: DOI: 10.1177/21677026211017834

We agree with the reviewer that presenting bivariate relationships between psychopathology and task strategies is important and now present in full bivariate relations between all dimensions of psychopathology and the computational parameters of interest, which we detail in response to Editor comment 1, Points II and IV. However, as noted above, we have principled reasons for controlling for co-occurring forms of psychopathology, and thus have chosen to present both analyses.

b. An alternate approach would be to focus this paper on the main findings about learning strategies and to save relationships to psychopathology for a future paper with a more appropriate sample.

We thank the reviewer for the suggestion, which is indicative of the tentative nature of psychopathology findings which we agree are preliminary in nature. However, we believe a more transparent alternative is to include these analyses and re-frame them as exploratory, per R3’s suggestion. Doing so will enable future studies to target worry and goal-perseveration a priori. We thus amended our framing of the psychopathology analyses as exploratory in the Abstract, Introduction and Discussion. Each of the changes to the main text is detailed in response to Editor Comment 1, Re-framing clinical analyses and exploratory and preliminary.

2. Parameter-based analyses.

a. Providing more information on parameter recovery is needed. In particular, showing the proportion of recovered parameters that can capture the main parameter-based results of the paper (Figure 2C/D) would show that these findings reflect true underlying parameter differences rather than artifacts of model estimation.

b. If the authors retain the psychopathology analyses, they should be conducted in a way that does not assume independence of parameter estimates.

c. Alternatively, the analyses using relative model fits and trialwise regressions provide most of the information needed for the conclusions of the paper. The parameter-based analyses could be omitted with the focus instead on these other kinds of analyses.

Although we agree with the reviewer that there is substantial overlap in the model-agnostic regression and parameter-based analyses, we opt for retaining both sets of analyses because only the computational modelling explains subjects’ choices beyond what can be explained solely based on the previous trial’s observations. Thus, we now emphasize the added value of analyzing the fitted parameters from the computational model in Results (p.11):

“The presence of unique signatures of MB, MF, and GP decision strategies in the empirical data presents strong evidence for the use of these strategies, but the signature measures are limited to examining goal-switch trials and, within those trials, examining the impact of features observed on the very last trial. To comprehensively quantify the extent to which participants utilized each strategy for reward seeking and punishment avoidance, we next developed a series of computational models that aim to explain all participant choices given the features observed on all preceding trials.”

Reviewer #2 (Recommendations for the authors):

The authors used the term "predicted" quite a bit to describe associations. I don't think this is justified (they haven't really done any predictive analyses).

We apologize for the excessive use of causal language to describe regression results. We have now changed this terminology to “positively/negatively associated with” throughout the article.

If I understand correctly, the same 4 random walks were used for all participants (randomised between the 4 associations). Of the two shown, one looks much more stable than the other. It would be useful to see all 4 walks to see how comparable they are (if I am correct that the same 4 are used for all participants). If the walks are very different, should their allocation to the associations be controlled for in the analysis?

To clarify, the task included two types of random walks: the first was more volatile (the best bandit switched once per block), while the second had more irreducible uncertainty (significantly closer across both random walks to 0.5 probability), both of which make learning more difficult. Importantly, random walks were counterbalanced across subjects: in version 1 of the task, the reward feature took the first type of random walk (i.e., more volatility) and the punishment feature took the second type of random walk (i.e., more irreducible uncertainty). In task version 2, the feature:random walk mapping was flipped.

Author response image 3
Random walks from task version 1.

Here, the reward feature took a more volatile walk, whereas the punishment feature had greater irreducible uncertainty. In task version 2 (given to the other half of participants) the feature:random walk mapping was flipped.

To test whether there were significant differences in GP and MB valence effects as a function of which feature type was paired with a given random walk, we fitted hierarchical logistic regressions quantifying the reported model-agnostic signatures (Figure 2C in text) twice, so as to independently analyze data from each task version (i.e., using each feature:random walk mapping). This allowed us to compare whether GP and MB valence effects change as a function of the feature:random walk mapping. We demonstrate across both task versions that the signature of MB Reward > MB Punishment and the signature of GP Punishment > GP Reward. Moreover, there is striking consistency in estimates across task versions and substantial overlap in the estimated HDIs for each effect (Table S1; ‘HDI’ refers to the 94% highest density interval of the posterior, bounded by the 3% and 97% quantiles). We now report this new validatory analysis in Methods (p.25):

“We ensured there were no significant differences in the direction and significance of key effects across task versions by separately fitting our Bayesian logistic regression noted above to the subset of subjects that performed each task version. Doing so showed that all effects held and to a remarkably similar degree in both task versions (see full results in Supplemental Table 1).”

It would be useful to report the relationship between worry and the block effect (i.e. you suggest high worry is associated with higher GP/lower MB for losses-do worried people adapt to changes in the base rates of the outcomes?).

We thank the reviewer for this suggestion. We have now tested for a possibility of an interaction between worry and a block effect, and the results did not support this interaction. This analysis is now reported in Results (p.15):

“Of note, we additionally found no association between the parameter governing how MB punishment was modulated by task block and levels of worry, both when including worry alone (ß=2.5 (1.91), t=1.31, p=0.19) and when controlling for the same covariates as detailed above (ß=1.46 (1.65), t=0.88, p=0.38).”

Reviewer #3 (Recommendations for the authors):

Well done on an interesting read and a contribution that will be informative for a lot of researchers. I have some suggestions to improve the paper.

All analyses with the 3 clinical factors should be presented in full; including supplementary figures if possible. Simple associations should be carried out before adding covariates to assist the reader in interpreting these findings and in generating hypotheses based on them. OCD is said to be not related to parameters at p=.08, while worry is at p=0.04 (uncorrected i guess more like p=0.02 for the latter), these are not likely to be different from one-another. And they may depend on the inclusion of these variables in the same model. Reader needs more transparency around these effects and any claims of specificity need more support. The data presented actually suggests the opposite.

We thank the reviewer for helpful suggestions to improve the clarity and transparency of our clinical analyses. We now present in full all our analyses, as detailed in response to Editor comment 1. Of note here, the trending negative relationship between OCD and punishment perseveration was in the opposite direction of the relationship between punishment perseveration and worry. This is now clearly highlighted in an updated Figure 5.

Relatedly, the result in relation to worry, the effect is marginal at p=.04. While 2 multiple comparisons are controlled for, this is a fairly liberal decision given several tests were conducted and reported (i.e. GP MB and MF for punishment/reward = 6 at least; plus the 3 clinical scales = 18 etc). I'd encourage the authors to report all of the associations in a table, correct for multiple comparisons. This will serve the same purpose of suggesting the most interesting avenue for future research but also give the reader a fuller view on specificity of this to worry. This exploratory framing for the clinical effects does not detract from the main contribution of the paper or the potential for this to be especially interesting for 'worry' – it would just make them clearer and let the reader decide that for themselves a bit more.

We have changed the language about our hypotheses at the article’s outset, present all results in full, and present corrected and uncorrected p-values to be transparent about our correction for multiple comparisons. We temper our claims about the relation between worry and punishment perseveration in the Abstract, Introduction and Discussion, as detailed above in response to Editor comment 1, Re-framing clinical analyses and exploratory and preliminary.

There needs to be a bit more done with respect to relating the clinical variables to the model parameters. I would have thought this would be best placed within the hierarchical model itself. Alternatively, I wonder if these is a point-estimate that could be generated that is more flexible and less dependent on the overall group effects and other parameter values.

The authors describe issues with collinearity of the parameter values. Can a correlation matrix in the supplement be included that reports these (I think currently you can sort of see it based on simulated vs real data, but this is not the same as correlating real vs real across params).

We now report this heatmap in Figure 4 —figure supplement 4, detailed in response to Editor comment 2.

I strongly encourage all subjects are retained (though i feel less strongly about excluding those not completing enough trials, 90% even seems a bit harsh/wasteful of data). If not, then a clear justification for why the strategy or approach of these subjects is not an accurate reflection of potentially the decision making preferences of 22% of the population. More standard indicators of inattentive responding focus on RTs, overly rigid responding that renders modelling choice impossible. Not clear why these were not used here as they seem better justified indicators of inattentive subjects. At the risk of belabouring the point(!), defining these subjects as 'not understanding instructions' could be applied to many of the key findings of this paper (i.e. avoidance perseveration suggests they don't pay attention to the current goals etc). So I think this practice is not ideal.

We agree with the reviewer that more comprehensive justification for, and scrutinization of, our exclusion criteria is warranted. We first demonstrate, by simulating agents that would be excluded in our study due to misunderstanding feature types, that the excluded strategies' performance in terms of points won is equal to, or worse than, that of agents that are purely guessing which action to take at each trial. By constrast, simulating a goal-perseveration strategy shows that it is far more adaptive in terms of reward earned and reduction in computation costs. We additionally show that excluded subjects as a group indeed perform at chance level on average. Finally, we show in a sensitivity analysis that if we include such subjects, all the major effects hold, and in some cases, become even stronger. We address this concern thoroughly in response to Editor comment 4.

References

1. Bornstein, A. M., and Daw, N. D. (2013). Cortical and hippocampal correlates of deliberation during model-based decisions for rewards in humans. PLoS Computational Biology, 9(12), e1003387.<milestone-start />‏<milestone-end />

2. Cushman, F., and Morris, A. (2015). Habitual control of goal selection in humans. Proceedings of the National Academy of Sciences, 112(45), 13817-13822.<milestone-start />‏<milestone-end />

3. Dar, K. A., and Iqbal, N. (2015). Worry and rumination in generalized anxiety disorder and obsessive compulsive disorder. The Journal of psychology, 149(8), 866-880.<milestone-start />‏<milestone-end />

4. Daw, N. D., and Dayan, P. (2014). The algorithmic anatomy of model-based evaluation. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1655), 20130478.<milestone-start />‏<milestone-end />

5. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613-624.<milestone-start />‏<milestone-end />

6. Doron, G., Derby, D. S., Szepsenwol, O., and Talmor, D. (2012). Tainted love: Exploring relationship-centered obsessive compulsive symptoms in two non-clinical cohorts. Journal of Obsessive-Compulsive and Related Disorders, 1(1), 16-24.<milestone-start />‏<milestone-end />

7. Frane, A. V. (2020). Misguided opposition to multiplicity adjustment remains a problem. Journal of Modern Applied Statistical Methods, 18(2), 28.<milestone-start />‏<milestone-end />

8. Friedman, J., Hastie, T., and Tibshirani, R. (2009). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics.

9. Gigerenzer, G., and Goldstein, D. G. (2011). The recognition heuristic: A decade of research. Judgment and Decision Making, 6(1), 100-121.<milestone-start />‏<milestone-end />

10. Gillan, C. M., Kosinski, M., Whelan, R., Phelps, E. A., and Daw, N. D. (2016). Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. eLife, 5, e11305.<milestone-start />‏<milestone-end />

11. Groen, R. N., Wichers, M., Wigman, J. T., and Hartman, C. A. (2019). Specificity of psychopathology across levels of severity: a transdiagnostic network analysis. Scientific reports, 9(1), 1-10.

12. Haines, N., Vassileva, J., and Ahn, WY (2018). The outcome ‐ representation learning model: A novel reinforcement learning model of the iowa gambling task. Cognitive Science , 42 (8), 2534-2561.

13. Haines, N., Kvam, P. D., Irving, L. H., Smith, C., Beauchaine, T. P., Pitt, M. A., and Turner, B. (2020). Theoretically Informed Generative Models Can Advance the Psychological and Brain Sciences: Lessons from the Reliability Paradox. PsyArxiv.

14. Imperiale, M. N., Lieb, R., Calkins, M. E., and Meinlschmidt, G. (2021). Multimorbidity networks of mental disorder symptom domains across psychopathology severity levels in community youth. Journal of psychiatric research, 141, 267-275.

15. Keramati, M., Smittenaar, P., Dolan, R. J., and Dayan, P. (2016). Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum. Proceedings of the National Academy of Sciences, 113(45), 12868-12873.

16. Langlois, F., Freeston, M. H., and Ladouceur, R. (2000). Differences and similarities between obsessive intrusive thoughts and worry in a non-clinical population: Study 1. Behaviour Research and Therapy, 38(2), 157-173.<milestone-start />‏<milestone-end />

17. Leeuwenberg, A. M., van Smeden, M., Langendijk, J. A., van der Schaaf, A., Mauer, M. E., Moons, K. G., and Schuit, E. (2021). Comparing methods addressing multi-collinearity when developing prediction models. arXiv preprint arXiv:2101.01603.<milestone-start />‏<milestone-end />

18. Lieder, F., and Griffiths, TL (2020). Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and Brain Sciences , 43.

19. Momennejad, I., Russek, EM, Cheong, JH, Botvinick, MM, Daw, ND, and Gershman, SJ (2017). The successor representation in human reinforcement learning. Nature Human Behavior , 1 (9), 680-692.

20. Otto, AR, Raio, CM, Chiang, A., Phelps, EA, and Daw, ND (2013). Working-memory capacity protects model-based learning from stress. Proceedings of the National Academy of Sciences , 110 (52), 20941-20946.

21. Palminteri, S., Lefebvre, G., Kilford, E. J., and Blakemore, S. J. (2017). Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing. PLoS Computational Biology, 13(8), e1005684.<milestone-start />‏<milestone-end />

22. Rubin, M. (2017). Do p values lose their meaning in exploratory analyzes? It depends on how you define the familywise error rate. Review of General Psychology , 21 (3), 269-275.

23. Seow, T. X., Benoit, E., Dempsey, C., Jennings, M., Maxwell, A., O'Connell, R., and Gillan, C. M. (2021). Model-based planning deficits in compulsivity are linked to faulty neural representations of task structure. Journal of Neuroscience, 41(30), 6539-6550.<milestone-start />‏<milestone-end />

24. Sharp, PB, Dolan, RJ, and Eldar, E. (2021). Disrupted state transition learning as a computational marker of compulsivity. Psychological Medicine , 1-11.

25. Silton, R. L., Heller, W., Engels, A. S., Towers, D. N., Spielberg, J. M., Edgar, J. C., … and Miller, G. A. (2011). Depression and anxious apprehension distinguish frontocingulate cortical activity during top-down attentional control. Journal of Abnormal Psychology, 120(2), 272.<milestone-start />‏<milestone-end />

26. Stein, D. J., Fineberg, N. A., Bienvenu, O. J., Denys, D., Lochner, C., Nestadt, G., … and Phillips, K. A. (2010). Should OCD be classified as an anxiety disorder in DSM‐V?. Depression and anxiety, 27(6), 495-506.<milestone-start />‏<milestone-end />

27. Sutton, R. S., and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.<milestone-start />‏<milestone-end />

28. Warren, S. L., Heller, W., and Miller, G. A. (2021). The structure of executive dysfunction in depression and anxiety. Journal of Affective Disorders, 279, 208-216.<milestone-start />‏<milestone-end />

29. Wilson, R. C., and Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, e49547.<milestone-start />‏<milestone-end />

30. Woody, E. Z., and Szechtman, H. (2011). Adaptation to potential threat: the evolution, neurobiology, and psychopathology of the security motivation system. Neuroscience and Biobehavioral Reviews, 35(4), 1019-1033.<milestone-start />‏<milestone-end />

https://doi.org/10.7554/eLife.74402.sa2

Article and author information

Author details

  1. Paul B Sharp

    1. The Hebrew University of Jerusalem, Jerusalem, Israel
    2. Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, United Kingdom
    3. Wellcome Centre for Human Neuroimaging, University College London, London, United Kingdom
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Evan M Russek
    For correspondence
    paul.sharp@mail.huji.ac.il
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4949-1501
  2. Evan M Russek

    1. Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, United Kingdom
    2. Wellcome Centre for Human Neuroimaging, University College London, London, United Kingdom
    Contribution
    Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review and editing
    Contributed equally with
    Paul B Sharp
    Competing interests
    No competing interests declared
  3. Quentin JM Huys

    1. Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, United Kingdom
    2. Division of Psychiatry, University College London, London, United Kingdom
    Contribution
    Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Raymond J Dolan

    1. Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, United Kingdom
    2. Wellcome Centre for Human Neuroimaging, University College London, London, United Kingdom
    Contribution
    Funding acquisition, Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9356-761X
  5. Eran Eldar

    The Hebrew University of Jerusalem, Jerusalem, Israel
    Contribution
    Conceptualization, Formal analysis, Supervision, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared

Funding

Fulbright Association (PS00318453)

  • Paul B Sharp

Israel Science Foundation (1094/20)

  • Eran Eldar

Wellcome Trust (098362/Z/12/Z)

  • Raymond J Dolan

National Institutes of Health (R01MH124092)

  • Eran Eldar

National Institutes of Health (R01MH125564)

  • Eran Eldar

Israel Binational Science Foundation (2019801)

  • Eran Eldar

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

PBS is supported by a Fulbright postdoctoral fellowship. EE is supported by NIH grants R01MH124092 and R01MH125564, ISF grant 1094/20 and US Israel BSF grant 2019801. RJD holds a Wellcome Trust Investigator award (098362/Z/12/Z). The Max Planck UCL Centre for Computational Psychiatry and Ageing Research is a joint initiative supported by the Max Planck Society and University College London.

Ethics

Human subjects: Participants gave written informed consent before taking part in the study, which was approved by the university's ethics review board (project ID number 16639/001).

Senior Editor

  1. Christian Büchel, University Medical Center Hamburg-Eppendorf, Germany

Reviewing Editor

  1. Claire M Gillan, Trinity College Dublin, Ireland

Reviewer

  1. Claire M Gillan, Trinity College Dublin, Ireland

Publication history

  1. Received: October 3, 2021
  2. Accepted: February 21, 2022
  3. Accepted Manuscript published: February 24, 2022 (version 1)
  4. Version of Record published: March 10, 2022 (version 2)

Copyright

© 2022, Sharp et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,109
    Page views
  • 168
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Paul B Sharp
  2. Evan M Russek
  3. Quentin JM Huys
  4. Raymond J Dolan
  5. Eran Eldar
(2022)
Humans perseverate on punishment avoidance goals in multigoal reinforcement learning
eLife 11:e74402.
https://doi.org/10.7554/eLife.74402

Further reading

    1. Neuroscience
    Arefeh Sherafati et al.
    Research Article Updated

    Cochlear implants are neuroprosthetic devices that can restore hearing in people with severe to profound hearing loss by electrically stimulating the auditory nerve. Because of physical limitations on the precision of this stimulation, the acoustic information delivered by a cochlear implant does not convey the same level of acoustic detail as that conveyed by normal hearing. As a result, speech understanding in listeners with cochlear implants is typically poorer and more effortful than in listeners with normal hearing. The brain networks supporting speech understanding in listeners with cochlear implants are not well understood, partly due to difficulties obtaining functional neuroimaging data in this population. In the current study, we assessed the brain regions supporting spoken word understanding in adult listeners with right unilateral cochlear implants (n=20) and matched controls (n=18) using high-density diffuse optical tomography (HD-DOT), a quiet and non-invasive imaging modality with spatial resolution comparable to that of functional MRI. We found that while listening to spoken words in quiet, listeners with cochlear implants showed greater activity in the left prefrontal cortex than listeners with normal hearing, specifically in a region engaged in a separate spatial working memory task. These results suggest that listeners with cochlear implants require greater cognitive processing during speech understanding than listeners with normal hearing, supported by compensatory recruitment of the left prefrontal cortex.

    1. Neuroscience
    Mohammad Ali Salehinejad et al.
    Research Article Updated

    Sleep strongly affects synaptic strength, making it critical for cognition, especially learning and memory formation. Whether and how sleep deprivation modulates human brain physiology and cognition is not well understood. Here we examined how overnight sleep deprivation vs overnight sufficient sleep affects (a) cortical excitability, measured by transcranial magnetic stimulation, (b) inducibility of long-term potentiation (LTP)- and long-term depression (LTD)-like plasticity via transcranial direct current stimulation (tDCS), and (c) learning, memory, and attention. The results suggest that sleep deprivation upscales cortical excitability due to enhanced glutamate-related cortical facilitation and decreases and/or reverses GABAergic cortical inhibition. Furthermore, tDCS-induced LTP-like plasticity (anodal) abolishes while the inhibitory LTD-like plasticity (cathodal) converts to excitatory LTP-like plasticity under sleep deprivation. This is associated with increased EEG theta oscillations due to sleep pressure. Finally, we show that learning and memory formation, behavioral counterparts of plasticity, and working memory and attention, which rely on cortical excitability, are impaired during sleep deprivation. Our data indicate that upscaled brain excitability and altered plasticity, due to sleep deprivation, are associated with impaired cognitive performance. Besides showing how brain physiology and cognition undergo changes (from neurophysiology to higher-order cognition) under sleep pressure, the findings have implications for variability and optimal application of noninvasive brain stimulation.