Impaired adaptation of learning to contingency volatility in internalizing psychopathology
Abstract
Using a contingency volatility manipulation, we tested the hypothesis that difficulty adapting probabilistic decisionmaking to secondorder uncertainty might reflect a core deficit that cuts across anxiety and depression and holds regardless of whether outcomes are aversive or involve reward gain or loss. We used bifactor modeling of internalizing symptoms to separate symptom variance common to both anxiety and depression from that unique to each. Across two experiments, we modeled performance on a probabilistic decisionmaking under volatility task using a hierarchical Bayesian framework. Elevated scores on the common internalizing factor, with high loadings across anxiety and depression items, were linked to impoverished adjustment of learning to volatility regardless of whether outcomes involved reward gain, electrical stimulation, or reward loss. In particular, high common factor scores were linked to dampened learning following betterthanexpected outcomes in volatile environments. No such relationships were observed for anxiety or depressionspecific symptom factors.
Introduction
Many of the situations we encounter in daily life are characterized by uncertainty. From small choices to large lifetime decisions, we rarely can know for sure the consequences that will stem from our actions. The uncertainty that pervades our daily decisionmaking is a source of greater distress, and even dysfunction, for some individuals than for others. In recognition of this, intolerance of uncertainty has been proposed as a core feature of generalized anxiety disorder (Freeston et al., 1994; Dugas et al., 1998; Dugas et al., 2001) and, more recently, a transdiagnostic marker of internalizing psychopathology more broadly (Gentes and Ruscio, 2011; Carleton et al., 2012; Boswell et al., 2013). However, these accounts are largely based on patients’ selfreported behavioral and emotional responses to exposure to uncertainty rather than experimental and computational investigations of the cognitive processes involved in choice under uncertainty. We address this in the current study. Here, we test the possibility that a deficit in adapting decisionmaking to uncertainty might be linked to internalizing psychopathology, in general, and observed across both aversive and rewardbased learning.
Within the computational literature, advances have been made in formalizing different sources of uncertainty and their effects on decision making (Yu and Dayan, 2005; Behrens et al., 2007; Nassar et al., 2012; PayzanLeNestour et al., 2013). One source of uncertainty is noise in the relationship between actions and outcomes, for instance when an action only leads to a given outcome on a proportion of the occasions it is performed. A second source of uncertainty is nonstationarity, or volatility, in the underlying causal structure; for example, when actionoutcome contingencies switch and an action that was primarily associated with a given outcome becomes predominantly associated with another. Such volatility leads to uncertainty around the estimate of outcome probability; this can be conceived of as secondorder uncertainty (Bach et al., 2011). Crucially, the level of secondorder uncertainty determines the optimal response to unexpected outcomes. When contingencies are volatile, and hence secondorder uncertainty is high, participants should adjust their probability estimates more rapidly than when contingencies are stable and secondorder uncertainty is low. Behrens et al., 2007 reported that healthy adult participants are indeed able to adjust their rate of updating, that is, their learning rate, to match contingency volatility in this manner.
Failure to adapt correctly to the source of uncertainty present in a given situation may result in inaccurate predictions and suboptimal decisions. This might well be a source of distress and dysfunction in everyday life. In a prior study, we examined the relationship between trait anxiety and adaptation of decisionmaking to contingency volatility (Browning et al., 2015). Modeling participants’ performance on an aversive version of Behrens and colleagues’ probabilistic decisionmaking task, using electrical stimulation as outcomes, we found that trait anxiety was associated with reduced adaptation of learning rate to volatility. This finding raises the possibility that poor adjustment of probabilistic learning to volatility might reflect an inability to differentially respond to different forms of uncertainty and that this, in turn, might be an important marker of psychopathology.
Critically, this initial work leaves open the question of whether impaired adaptation of learning rate to volatility is specific to anxiety or a marker of internalizing psychopathology more broadly. Scores on the Spielberger Trait Anxiety Inventory (STAI, Spielberger et al., 1983), the measure of anxiety used in Browning et al., 2015, are typically elevated across patients with both anxiety and depressive disorders and correlate highly with scores on measures of depression such as the Beck Depression Inventory (BDI, Beck et al., 1961). Establishing whether impairment in adapting decisionmaking to volatility is specific to anxiety, or more broadly linked to internalizing psychopathology, requires a way of partitioning symptom variance into components that disentangle variance common to both anxiety and depression from that specific to anxiety or to depression. Bifactor analysis provides a principled method for achieving this goal. It has been used extensively to study the structure of internalizing symptomology (Clark et al., 1994; Steer et al., 1995; Zinbarg and Barlow, 1996; Steer et al., 1999; Simms et al., 2008; Steer et al., 2008; Brodbeck et al., 2011) and has consistently revealed a substantial amount of shared variance, often termed ‘general distress’ or ‘negative affect’ (Clark and Watson, 1991; Clark et al., 1994). In addition, separate specific factors for depression and anxiety are consistently observed, with the depressionspecific factor tapping symptoms of anhedonia (Clark et al., 1994; Steer et al., 1999; Steer et al., 2008) and anxietyspecific factors tapping symptoms of anxious arousal (Clark et al., 1994; Steer et al., 1999; Steer et al., 2008) and worry (Brodbeck et al., 2011). Although bifactor modeling of internalizing symptoms is well established, it has not, to date, been used to inform studies of anxiety and depressionrelated deficits in decisionmaking. Using bifactor analysis to estimate scores for each participant on latent dimensions of internalizing symptoms, we can investigate whether impoverished adjustment of learning rate to volatility is primarily linked to general factor scores (i.e. to symptom variance common to both anxiety and depression) or to anxietyspecific or depressionspecific factor scores.
A second question of specificity pertains to the breadth of impairment in adjusting learning rate to volatility. There is evidence that the neural substrate, and affective consequences, of learning from unexpected outcomes may vary both with domain (reward versus punishment; Boureau and Dayan, 2011) and with whether outcomes are better or worse than expected (Frank et al., 2007; Cox et al., 2015; Eldar et al., 2016; Palminteri and Pessiglione, 2017). If impoverished adaptation of decisionmaking to volatility is potentially a core feature of internalizing psychopathology, it is important to establish if such impairment is, or is not, observed regardless of whether choice involves potential aversive outcomes (e.g. shock), reward gain or reward loss and both when outcomes exceed and fall short of our expectations.
Our aims were hence as follows. First, to use bifactor analysis of itemlevel responses to measures of internalizing symptoms to determine whether impaired adjustment of learning rate to volatility is linked to symptom variance common to both anxiety and depression (i.e. to general factor scores) or to symptom variance specific to anxiety or to depression. Second, to determine whether any such impairment is domain general (i.e. observed both when actions involves the pursuit of rewarding outcomes and the avoidance of punishing outcomes) or domain specific. And third, to determine whether this impairment is observed equally for learning from both positive and negative prediction errors (i.e. betterthanexpected and worsethanexpected outcomes) or is differentially linked to one or the other.
Results
Outline
We addressed these aims through two experiments. The first experiment was conducted in our onsite testing facilities. The participant sample comprised patients diagnosed with major depressive disorder (MDD) or generalized anxiety disorder (GAD), healthy control participants screened to be free of any psychiatric diagnoses, and a community sample with naturally varying levels of symptoms (for further details see Materials and methods: SelfReport Measures). Participants completed two versions of the probabilistic decisionmaking under volatility task (Behrens et al., 2007; Browning et al., 2015). In one version, they chose between two shapes that were associated with potential receipt of electrical stimulation of varying magnitude. In the other version, the outcomes involved possible receipt of financial rewards of varying magnitude. In each case, the probability that a given shape would result in receipt of shock or reward had to be learned across trials. Actionoutcome contingencies were stable in one period of the task and volatile in the other (see Materials and methods).
The second experiment was conducted online using Amazon’s Mechanical Turk platform. Here, we used reward gain and reward loss versions of the probabilistic decisionmaking under volatility task. As detailed further below, we used the same bifactor decomposition of internalizing symptoms and the same computational model of task performance across both experiments and all three versions of the volatility task. A hierarchical Bayesian framework was used for model estimation with participants’ scores on the latent factors of internalizing symptoms entered as predictors of behavioral model parameter values.
Experiment 1
Estimating latent factors of internalizing symptoms
Eightyeight participants (51 females, mean age = 27 ± 8 years) took part in experiment 1. This was conducted at the Wellcome Centre for Integrative Neuroimaging (WIN) at the John Radcliffe Hospital with ethical approval obtained from the Oxford Central University Research Ethics Committee (CUREC). Twenty participants had a primary diagnosis of major depressive disorder (MDD), 12 participants had a primary diagnosis of generalized anxiety disorder (GAD), 26 ‘healthy control’ participants of approximately the same age and sex ratio were screened to ensure they were free of any psychiatric diagnosis, and 30 participants were members of the local community with a natural range of internalizing symptoms (see Appendix 1—table 1 for participant details). Participants who met criteria for any psychiatric diagnoses apart from anxiety or depressive disorders were excluded, as were participants with neurological conditions and those currently taking psychoactive medications or recreational drugs (see Materials and methods). Data from two control participants was excluded; one was a result of equipment failure; debriefing indicated the second misunderstood the task. This left 86 participants in total.
Participants completed a battery of standardized questionnaires comprising the Spielberger StateTrait Anxiety Inventory (STAI form Y; Spielberger et al., 1983), the Beck Depression Inventory (BDI; Beck et al., 1961), the Mood and Anxiety Symptoms Questionnaire (MASQ; Clark and Watson, 1995; Watson and Clark, 1991), the Penn State Worry Questionnaire (Meyer et al., 1990), the Center for Epidemiologic Studies Depression Scale (CESD; Radloff, 1977), and the 90item Eysenck Personality Questionnaire (EPQ; Eysenck and Eysenck, 1975). These questionnaires were selected to measure a range of depressive and anxiety symptoms (e.g. anhedonia, negative mood, negative cognitive biases, worry, somatic symptoms) and to assess trait negative affect more broadly (via inclusion of the EPQ Neuroticism subscale; items from other subscales of the EPQ were not included in the bifactor analysis described below).
We sought to separate symptom variance common to both anxiety and depression from that specific to depression or to anxiety. Bifactor modeling of item level responses provides a simple approach to achieve this aim, as demonstrated previously within the internalizing literature (Clark et al., 1994; Steer et al., 1995; Steer et al., 1999; Simms et al., 2008; Brodbeck et al., 2011). Bifactor models decompose the itemlevel covariance matrix into a general factor and two or more specific factors. Here, we specified a model with one general and two specific factors. This decision drew on our theoretical aim, namely to separate symptom variance common to both anxiety and depression from that specific to depression and that specific to anxiety, and was informed by prior tripartite models of internalizing psychopathology (e.g. Clark and Watson, 1991). It was also supported by the results of eigenvalue decomposition of the covariance matrix. Only the first three eigenvalues were reliably distinguishable from noise—this was determined by comparison of the eigenvalues in descending order against eigenvalues obtained from a random normal matrix of equivalent size (Humphreys and Montanelli, 1975; Floyd and Widaman, 1995), see Figure 1—figure supplement 1 and Materials and methods for more details.
The SchmidLeiman (SL) procedure was used to estimate the loadings of individual questionnaire items on each factor (Schmid and Leiman, 1957). This procedure performs oblique factor analysis followed by a higher order factor analysis on the lower order factor correlations to extract a general factor. All three factors are forced to be orthogonal to one another, which allows for easier interpretability. In line with previous findings (Clark et al., 1994; Steer et al., 1995; Zinbarg and Barlow, 1996; Steer et al., 1999; Simms et al., 2008; Steer et al., 2008; Brodbeck et al., 2011), the general factor had high loadings (>0.4) for multiple anxietyrelated and depressionrelated items and moderately high loadings (>0.2) across almost all items. One specific factor had high loadings (>0.4) for questions related to anhedonia and depressed mood. The other specific factor had high loadings (>0.4) for questions related to worry and anxiety.
We validated this factor structure by conducting a confirmatory bifactor analysis on itemlevel responses to the same set of questionnaires completed by an independent online sample (n = 199). Participants were students at UC Berkeley (120 females, mean age = 20 ± 4). This group was fairly distinct from our first sample, being more homogenous in age and educational status and less homogenous in ethnicity, and not including individuals recruited to meet diagnosis for either GAD or MDD. Evaluating the fit of the factor structure obtained from experiment 1 in this second dataset is a strong test of its generalizability. In the confirmatory bifactor analysis, we used diagonally weighted least squares estimation and constrained the factor structure so that items were only allowed to load on a factor for which they had a loading of >0.2 in experiment 1 (see Materials and methods: Exploratory Bifactor Analysis). This constrained model showed a good fit to the data from this new participant sample, comparative fit index (CFI) = 0.962.
As a convergent analysis, we conducted an unconstrained (i.e. exploratory, not confirmatory) bifactor analysis in this second participant sample to see if a similar factor structure would emerge to that obtained in experiment 1. We again specified one common and two specific factors. The factor loadings obtained were highly congruent with the factor loadings obtained from the bifactor analysis in experiment 1 (cosinesimilarity was 0.96 for the general factor loadings, 0.81 for the depressionspecific factor loadings, and 0.77 for the anxietyspecific factor loadings). Congruence in factor loadings was assessed after matching the two specific factors according to the similarity of their loading content.
Factor loadings from experiment 1 were used to calculate factor scores for all participants from experiment 1 (n = 86) and the confirmatory factor analysis sample (n = 199). The resultant scores are plotted in Figure 1a–b. As an additional check of construct validity, participants’ scores on these factors were correlated with summary (scale or subscale) scores for the standardized questionnaires administered, Figure 1c. The questionnaires to which items belonged were not specified during the fitting of the bifactor model, hence these summary scores provide an independent measure of the construct validity of the latent factors extracted from the bifactor analysis. As can be seen in Figure 1c, participants’ scores on the general factor correlated strongly (r > 0.60) with summary scores for all the questionnaire measures, indicating that the general factor is indeed tapping variance linked to both anxiety and depressive symptoms. Scores on the depressionspecific factor correlated strongly with scores for the MASQ anhedonia subscale (MASQAD; r = 0.72) and the STAI depression subscale (STAIdep; r = 0.53), and scores on the anxietyspecific factor correlated strongly with scores for the Penn State Worry Questionnaire (PSWQ; r = 0.76). This indicates that the two specific factors extracted from the bifactor analysis do indeed capture anxiety and depressionrelated symptoms, respectively, as intended, and that these factors explain variance above that explained by the general factor. The latter conclusion can be drawn since the specific factors are orthogonal to the general factor and therefore their correlations with scale and subscale scores reflect independently explained variance. This can be further demonstrated by regressing variance explained by scores on the general factor out of scale and subscale scores and then examining the relationship between residual scores for each scale with scores on the two specific factors. As shown in Figure 1—figure supplement 2, after removing variance explained by general factor scores, nearly all the remaining variance in PSWQ scores could be captured by the anxietyspecific factor and nearly all the remaining variance in MASQAD scores could be captured by scores on the depressionspecific factor.
As outlined earlier, we applied a bifactor model to itemlevel symptom responses as we sought to tease apart symptom variance common to anxiety and depression versus unique to anxiety or depression. With model selection, both the extent to which a given model can address the aim of the experiment and the fit of the given model to the data are important considerations. In addition to assessing the absolute fit of the bifactor solution in the confirmatory factor analysis (CFA) dataset, we can also consider its fit relative to that of alternate models. The bifactor model reported here showed a better fit to the novel (CFA) dataset than a ‘flat’ correlated twofactor model, a hierarchical three factor model with the higher order factor constrained to act via the lower level factors, and a unifactor model created by retaining only the general factor and neither of the specific factors (see Appendix 2: Additional Factor Analyses for further details). We note that none of these alternate models would enable us to separate symptom variance common to anxiety and depression versus unique to anxiety and depression, as desired.
Computational modeling of task performance: measuring the impact of block type (volatile, stable), task version (reward, aversive), and relative outcome value (good, bad) on learning rate
Participants completed both reward gain and aversive versions of a probabilistic decisionmaking under volatility task (Behrens et al., 2007; Browning et al., 2015). Full task details are provided in Figure 2 and in the Materials and methods. In short, participants were asked to choose between the same two shapes repeatedly across trials. On each trial, one of the two shapes resulted in reward receipt or shock receipt; the nature of the outcome depended on the version of the task. When making their choice, participants were instructed to consider both the magnitude of the reward or shock associated with each shape, which was shown to participants inside each shape and varied across trials, and the probability that each shape would result in reward or shock receipt. The outcome probability could be learned across trials, using the outcomes received. During the stable task period, the outcome probability was fixed such that one shape had a 75% probability of resulting in reward or shock receipt if chosen and the other 25%. During the volatile task period, the shape with the higher probability of shock or reward receipt switched every 20 trials (see Materials and methods for further details).
We fitted participants’ choice behavior using alternate versions of simple reinforcement learning models. We focused on models that were parameterized in a sufficiently flexible manner to capture differences in behavior between experimental conditions (block type: volatile versus stable; task version: reward gain versus aversive) and differences in learning from better or worse than expected outcomes. We used a hierarchical Bayesian approach to estimate distributions over model parameters at an individual and populationlevel with the latter capturing variation as a function of general, anxietyspecific, and depressionspecific internalizing symptoms. Given our a priori interest in assessing the variance in choice behavior explained by the three internalizing factors, participants’ scores for the three factors were included in the estimation procedure for all models. We compared the fits of different models using Pareto smoothed importance sampling to approximate leaveoneout crossvalidation accuracy (PSISLOO; Vehtari et al., 2017). PSISLOO is a popular method with which to estimate out of sample prediction accuracy as it is less computationally expensive than evaluating exact leaveoneout or kfold crossvalidation accuracy (Gelman et al., 2013). We note that comparing models using WAIC (Watanabe and Opper, 2010), an alternative penalizationbased criterion for hierarchical Bayesian models, resulted in identical model rankings for our dataset. We elaborate briefly on model comparison here. Full details are provided in the Materials and methods; a summary of model comparison results is also provided in Appendix 3—table 1.
The models considered were informed by prior work (Lau and Glimcher, 2005; Behrens et al., 2007; Ito and Doya, 2009; Li and Daw, 2011; Berns and Bell, 2012; Akaishi et al., 2014; Browning et al., 2015; Donahue and Lee, 2015; Mkrtchian et al., 2017; Aylward et al., 2019). Each of the models included a parameter to allow for individual differences in the weighting of outcome probabilities against outcome magnitudes, an inverse temperature parameter to allow for differences in how noisily participants made choices as a function of this weighted combination of probability and magnitude, and a learning rate parameter that captured the extent to which participants adjusted probability estimates given the unexpectedness of the previous trial’s outcome. We parameterized dependence on experimental conditions by additive and interactive factors. As one example, the learning rate $\alpha $ was divided into a baseline learning rate (${\alpha}_{baseline}$), a difference in learning rates between the volatile and stable blocks (${\alpha}_{volatilestable}$), a difference in learning rates between the reward gain and aversive versions of the volatility task $\left({\alpha}_{rewardaversive}\right)$, and the twoway interaction of those differences.
Parametrizing the effects of our experimental conditions in the manner described above allowed us to test how these effects varied as a function of betweenparticipant differences in internalizing symptomatology. We parameterized the dependence on internalizing symptoms by adjusting the mean of the Bayesian hierarchical prior according to the general, depressionspecific, and anxietyspecific factor scores of each participant, using populationlevel weights $\left\{{\beta}_{g},{\beta}_{d},{\beta}_{a}\right\}$, respectively. These weights differed for each parameter component (e.g. ${\alpha}_{volatilestable}$), but we hide this specificity for notational ease. Including participants’ scores on the three latent internalizing factors in the estimation procedure in this manner enables us to separate variance linked to internalizing symptoms from noise in participants’ choices when estimating model parameters (see Materials and methods: Hierarchical Bayesian Estimation for more details).
We used a form of stagewise model construction to investigate the manner in which participants integrated outcome probability and outcome magnitude (additive or multiplicative) and the extent to which task performance could be better captured by inclusion of additional model parameters. At each stage of the model construction process, we added or modified a particular component of the model and compared the enriched model to the best model from the previous stage using leaveoneout crossvalidation error approximated by Pareto smoothed importance sampling (PSISLOO; Vehtari et al., 2017).
In the first stage, we compared a model (#1) that combined outcome probability and outcome magnitude multiplicatively (i.e. by calculating expected value, similarly to Browning et al., 2015) with a model (#2) that combined outcome probability and outcome magnitude additively; see Materials and methods for full model details. We observed that the additive model fit participants’ choice behavior better (model #2 PSISLOO = 26,164 versus model #1 PSISLOO = 27,801; difference in PSISLOO = −1637; std. error of difference = 241; lower PSISLOO is better). This finding is consistent with observations of separate striatal representations for outcome magnitude and probability (Berns and Bell, 2012), as well as findings from work with nonhuman primates where additive models have also been reported to fit choice behavior better than expected value models (Donahue and Lee, 2015).
In both of the models in the first stage, all the parameters were divided into a baseline component, a component for the difference between volatile and stable blocks, a component for the difference between reward and aversive task versions, and a component for the interaction of these two experimental factors. During the second stage, we investigated whether task performance was better captured by additionally allowing for differences in learning as a result of positive and negative prediction errors. Specifically, we added a component for relative outcome value (good, bad) and two further components that captured the interaction of relative outcome value with block type (volatile, stable) and with task version (reward, aversive). We added these components for learning rate alone in one model (#3) and for learning rate, mixture weight, and inverse temperature in another model (#5); see Materials and methods for full model details. We defined a good outcome to be the receipt of a reward or the receipt of no shock and a bad outcome to be the receipt of no reward or the receipt of shock. Including effects of relative outcome value for all three model parameters, including the twoway interactions with block type and task version, improved PSISLOO (model #5 PSISLOO = 25,462 versus model #2 PSISLOO = 26,164; difference in PSISLOO = −702; std. error of difference = 142; lower PSISLOO is better). Adding the threeway interaction of block type, task version and relative outcome value worsened PSISLOO slightly (model #6 PSISLOO = 25,486 versus model #5 PSISLOO = 25,462; difference in PSISLOO = 24; std. error of difference = 9), indicating that twoway interactions were sufficient to capture important aspects of behavioral variation.
Additional stages of model comparison revealed that allowing subjective weighting of magnitude differences improved model fit (model #7 PSISLOO = 25,154 versus model #5 PSISLOO = 25,462; difference in PSISLOO = −308; std. error of difference = 104) as did the addition of a choice kernel that captures participants’ predisposition to repeating prior choices (model #11 PSISLOO = 25,037 versus model #7 PSISLOO = 25,154; difference in PSISLOO = −117; std. error of difference = 42). Both the subjective magnitude parameter and choice kernel inverse temperature were broken down by task version (reward, aversive); a single choice kernel update rate was used across conditions; see Materials and methods: Stagewise Model Construction. In contrast to the above parameters, adding a lapse term did not improve model fit nor did allowing outcome probabilities to be separately updated for each shape; see Materials and methods for further details.
The best fitting model (#11) is presented in Equation 1ad. The probability (${p}_{t}$) that shape 1 and not shape 2 would result in reward or shock receipt if chosen is updated on each trial using a prediction error (the difference between the most recent outcome ${O}_{t1}$ and the previous estimate ${p}_{t1}$) scaled by the learning rate ($\alpha $) (Equation 1a).
Next, the estimate of the difference between the two shapes in outcome probability is combined additively with the difference in outcome magnitude using a mixture weight ($\lambda $) (Equation 1b). Here, the difference in outcome magnitude is nonlinearly scaled using $r$ to account for potential differences in subjective valuation ($M{1}_{t}$ and $M{2}_{t}$ denote the magnitude for shapes 1 and shape 2, respectively; note that the sign for the difference in this equation is removed before exponentiating and then restored).
A choice kernel is also updated on each trial using the difference between the previous choice (${C}_{t1}$) and the choice kernel on the previous trial ${k}_{t1}$, scaled by an update rate ($\eta $) (Equation 1c).
Finally, the outcome value and the choice kernel determine the probability that shape 1 was chosen on that trial using a softmax choice rule with two separate inverse temperatures ($\omega $ and ${\omega}_{k}$) (Equation 1d).
To validate the model estimation procedure, we treated participants’ estimated parameter values as the ground truth and used them to simulate 10 new datasets for each of the 86 participants. By fitting the model to these simulated datasets, we could compare the ground truth parameter values with the parameters estimated (i.e. recovered) from the simulated data. The recovered parameters from each dataset strongly correlated with the ground truth parameter values; the mean correlation across simulated datasets and across parameters was r = 0.76 (std = 0.15). For learning rate components, the average correlation was r = 0.88 (std = 0.13) (for more methodological details see Materials and methods: Parameter Recovery; for parameter recovery results see Appendix 4—figure 1 and Appendix 4—figure 2). This analysis indicates that individual model parameters were recoverable as desired. For estimates of noise in populationlevel parameters, see Appendix 4—figure 3 and Appendix 4—figure 4.
Crossgroup results: participants adjust learning rate to contingency volatility
Having selected model #11, we fit this model to participants’ choice behavior and estimated distributions over model parameters at an individual and populationlevel (as described above and detailed further in the Materials and methods). This included estimating populationlevel weights $\left\{{\beta}_{g},{\beta}_{d},{\beta}_{a}\right\}$ that captured the effect of internalizing factor scores upon each parameter component (e.g. ${\alpha}_{volatilestable}$).
In this section, we report crossgroup effects (i.e. across all participants). Here, we used the posterior distributions over the group mean for each learning rate component to examine whether learning rate varied as a function of block type (volatile or stable), task version (reward or aversive), or relative outcome value (i.e. trials following good or bad outcomes). Figure 3a shows the posterior means, along with their 95% highest posterior density intervals (HDIs), for each parameter component. If the 95% HDI for a given effect does not cross zero, the effect is considered statistically credible.
The effects of block type, task version and relative outcome value upon learning rate were statistically credible; that is, their HDIs did not cross zero: block type (${\alpha}_{volatilestable}$), µ = 0.16, 95%HDI = [0.04,0.28]; task version ($\alpha}_{rewardaversive$), µ = −0.21, 95%HDI = [−0.37, –0.03]; relative outcome value ($\alpha}_{goodbad$), µ = 0.50, 95%HDI = [0.34,0.67]. Participants had higher learning rates during the volatile block than the stable block, higher learning rates during the aversive task than the reward gain task, and higher learning rates on trials following good versus bad outcomes. None of the twoway interactions were statistically credible: block type by task version (${\alpha}_{\left(volatilestable\right)x\left(rewardaversive\right)}$), µ = 0.09, 95%HDI [−0.02, 0.21]; block type by relative outcome value (${\alpha}_{\left(volatilestable\right)x\left(goodbad\right)}$), µ = 0.04, 95%HDI [−0.05, 0.13]; task version by relative outcome value (${\alpha}_{\left(rewardaversive\right)x\left(goodbad\right)}$), µ = 0.0, 95%HDI [−0.12, 0.12]. Figure 3b illustrates differences in learning rates between experimental conditions for the reward and aversive versions of the task, respectively.
Elevated general factor scores are linked to reduced adjustment of learning rate to volatility, especially following betterthanexpected outcomes, across both reward and aversive tasks
To address our first main research question—that is, whether impaired adjustment of learning rate to volatility is linked to symptom variance common to both anxiety and depression or to symptom variance specific to anxiety or to depression—we looked at whether the difference in learning rate between the volatile and stable blocks varied as a function of general factor scores or as a function of anxiety or depressionspecific factor scores. Examining learning rate difference between blocks, ${\alpha}_{volatilestable}$, the 95% HDI for the general factor regression coefficient excluded zero, ${\beta}_{g}$ = −0.18, 95%HDI=[−0.32,–0.05], Figure 4a. Individuals with low scores on the general factor adjusted learning rate between the stable and volatile task blocks to a greater extent than individuals with high general factor scores. Neither anxietyspecific factor scores nor depressionspecific factor scores credibly modulated learning rate difference between blocks, ${\alpha}_{volatilestable}$, ${\beta}_{a}$ = −0.03, 95%HDI = [−0.16, 0.09], ${\beta}_{d}$ = 0.06, 95%HDI=[−0.08, 0.19], respectively, Figure 4—figure supplement 1. This suggests that the ability to appropriately adjust learning rate to volatility, previously linked to trait anxiety (Browning et al., 2015), is actually common to both anxiety and depression and not specific to one or the other.
To address our second research question—that is, whether the relationship between internalizing symptoms and adjustment of learning rate to volatility is domain general (i.e. holds across both aversive and reward task versions) or domain specific—we looked at whether there was an interaction between internalizing factor scores, block type and task version upon learning rate. Estimates for $\alpha}_{\left(rewardaversive\right)x\left(volatilestable\right)$ were not credibly modulated by scores on any of the three internalizing factors (${\beta}_{g}$ = 0.01, 95%HDI = [−0.12,0.15]; ${\beta}_{a}$ = 0.05, 95%HDI = [−0.07,0.17], ${\beta}_{d}$ = 0.06, 95%HDI = [−0.06, 0.17]), see Figure 4a and Figure 4—figure supplement 1. Estimates for the main effect of task version on learning rate ($\alpha}_{rewardaversive$) also did not vary credibly as a function of internalizing factor scores, ${\beta}_{g}$ = 0.08, 95%HDI = [−0.11; 0.28], ${\beta}_{a}$ = −0.11, 95%HDI = [−0.29,0.06], $\beta}_{d$ = −0.17, 95%HDI = [−0.38,0.01], see Figure 4a and Figure 4—figure supplement 1.
To address our third research question—that is, whether relative outcome value (good or bad) modulates the relationship between internalizing symptoms and learning rate adjustment to volatility—we looked at whether there was an interaction between internalizing factor scores, block type (volatile, stable) and relative outcome value (good, bad) upon learning rate. Estimates for ${\alpha}_{\left(goodbad\right)x\left(volatilestable\right)}$ were credibly modulated by scores on the general factor (${\beta}_{g}$ = −0.19, 95%HDI = [−0.3,–0.1]), Figure 4a, but not by scores on the anxietyspecific factor or the depressionspecific factor (${\beta}_{a}$ = −0.02, 95%HDI = [−0.11, 0.07], ${\beta}_{d}$ = 0.07, 95%HDI = [−0.04, 0.16]), Figure 4—figure supplement 1. In addition, scores on the general factor, but not the anxietyspecific or depressionspecific factors, also credibly modulated the main effect of relative outcome value upon learning rate (${\alpha}_{goodbad}),{\beta}_{g}$ = −0.21, 95%HDI = [−0.37, –0.04], see Figure 4a and Figure 4—figure supplement 1.
To illustrate these results, we calculated the expected learning rate for each withinsubject condition associated with scores one standard deviation above or below the mean on the general factor, Figure 4b. Low general factor scores (shown in blue) were associated with higher learning rates following good versus bad outcomes, both when outcomes were rewardrelated (here a good outcome was a reward) and when they were aversive (here a good outcome was no shock delivery). Low general factor scores were also associated with a more pronounced difference in learning rate between volatile and stable task blocks following good outcomes. This was observed both when outcomes were rewardrelated and aversive. In contrast, high scores on the general factor (shown in red) were associated with smaller differences in learning rate between volatile and stable blocks and following good versus bad outcomes; this held across both reward and aversive versions of the task. In particular, the boost in learning from positive predictions errors under volatile conditions (i.e. from good relative outcomes in the volatile block) shown by individuals with low general factor scores was absent in individuals with high general factor scores; if anything individuals with high general factor scores’ learning after positive outcomes was reduced under volatile relative to stable conditions.
To confirm that the relationship between general factor scores and the interaction of block type by relative outcome value on learning rate did not vary as a function of task version, we fit an additional model that parametrized the threeway interaction of block type (volatile, stable), relative outcome value (good, bad) and task version (reward, aversive) for learning rate. Fitting this model to participants’ choice behavior confirmed that this threeway interaction was not statistically credible, nor was its modulation by general factor scores (${\beta}_{g}$ = 0.06, 95%HDI = [−0.06; 0.18]). The effect of general factor scores on the interaction between block type (volatile, stable) and relative outcome value (good, bad) on learning rate remained credible (${\beta}_{g}$ = −0.21, 95%HDI = [−0.34; −0.1]). We note that a parameter recovery analysis revealed successful recovery of the parameter representing the threeway interaction of block type, relative outcome value and task type on learning rate (Appendix 4—figure 5). This suggests that we did not simply fail to observe a threeway interaction due to lack of experimental power. Together, these findings support the conclusion that the negative relationship between general factor scores and increased learning rate following relative good outcomes in volatile environments did not differ credibly as function of task version (reward, aversive).
Experiment 2
We fit the behavioral model from experiment 1 to an independent online sample of participants, in order to test whether the reward gain findings replicated, and whether the findings for punishment would be replicated using loss of reward in place of primary aversive outcomes (shock). Specifically, we predicted that general factor scores would inversely correlate with adaptation of learning rate to volatility, that this would be observed to a greater extent following outcomes of positive relative value (reward gain or no reward loss) and that this would hold across task version (reward gain or reward loss).
Onehundred and seventytwo participants were recruited from Amazon’s Mechanical Turk platform to participate in Experiment 2 (see Materials and methods: Experiment 2). We excluded participants who missed 10 or more responses in either the reward gain or reward loss version of the volatility task (the task is detailed further below); this left 147 participants (n = 147; 65 females) whose data were included in the analyses. Participants filled out the Spielberger StateTrait Anxiety Inventory (STAI), the Mood and Anxiety Symptoms Questionnaire (MASQ), and the Beck Depression Inventory (BDI), showing similar distributions of scores across these three measures to participants in Experiment 1 (see Appendix 1—table 1 and Appendix 1—table 3). Participants’ scores on the three internalizing latent factors were calculated using the factor loadings estimated in experiment 1 for items from this subset of measures. As can be seen in Figure 1, participants in experiment 2 (denoted by dark gray os) showed a similar range of scores on the three factors to participants in experiment 1 despite no attempt being made to specifically recruit individuals who met diagnostic criteria for anxiety or depressive disorders.
As noted above, participants only completed three of the original questionnaires to reduce fatigue, incomplete datasets and participant dropout. To check that factor scores could be reliably estimated using only items from this subset of the original questionnaires, we calculated factor scores using factor loadings estimated in experiment 1 and either the full or reduced set of questionnaire items. For this analysis, we pooled the questionnaire data from experiment 1 (n = 86) together with the online dataset used for the confirmatory factor analysis (n = 199). We correlated the factor scores calculated using either the full or reduced item set. This revealed little change in scores for the general factor (r = 0.97) or the depressionspecific factor (r = 0.97). Scores on the anxietyspecific factor were only moderately correlated when using loadings from the full versus reduced set of items (r = 0.41); this likely reflects the omission of the PSWQ from the reduced questionnaire set. These differences reflect those seen between prior studies in the literature, where loadings on the general and depression factors are fairly consistent, but the anxiety factor can either reflect symptoms of anxious arousal, worry or both, with this varying as a function of measures included (Clark et al., 1994; Steer et al., 1999; Steer et al., 2008; Brodbeck et al., 2011). For the general factor and depressionspecific factor, the high correlations between factor scores obtained using the full and reduced item set puts us in a strong position to draw conclusions across experiments 1 and 2. This is less true for the anxietyspecific factor. Our hypotheses center on the general factor; however, we include all scores on all three factors in the hierarchical model fitting procedure for consistency with experiment 1.
Each participant completed two versions of the probabilistic decisionmaking under volatility task (Browning et al., 2015): a reward gain and a reward loss task. The reward gain task was closely based on the inlab reward task, differing only in the timing of intra and intertrial intervals, which were shortened slightly for the online version, and in the calculation of monetary bonuses (see Materials and methods). The reward loss task was parallel to the reward gain task, except that participants were allocated points at the start of the task and tried to avoid losing their points throughout the task. In this version of the task, outcome magnitudes corresponded the amount of points that would be subtracted from the point total. Participants’ general level of performance on these online versions of the volatility tasks, as indexed by average magnitude of outcomes received across trials, was broadly similar to that observed in experiment 1 (see Appendix 5—figure 1).
Experiment 2 results: general factor scores are linked to reduced adjustment of learning rate to volatility across both reward loss and reward gain. As in experiment 1, this was primarily observed following betterthanexpected outcomes
At a group level, learning rates were credibly higher following good relative outcomes (reward, no loss) versus bad relative outcomes (no reward, loss) (${\alpha}_{goodbad}),\mu $ = 0.52, 95%HDI = [0.42,0.71]. There was no effect of task version on learning rate $\left({\alpha}_{gainloss}\right),\mu $ = 0.13, 95%HDI = [−0.02,0.28]. In contrast to experiment 1, there was also no credible grouplevel effect of block type on learning rate (${\alpha}_{volatilestable}),\mu $ = 0.04, 95%HDI = [−0.1,0.17], see Figure 5—figure supplement 1.
Although there was no effect of block type at the group level, in line with predictions, there was a credible interaction of block type by scores on the general factor: that is, the difference in learning rate between volatile and stable blocks, ${\alpha}_{volatilestable,}$ was inversely correlated with scores on the general factor (${\beta}_{g}$ = −0.16, 95%HDI = [−0.32,–0.01]), Figure 5a. As in experiment 1, individuals with low general factor scores showed greater adjustment of learning rate to volatility, learning faster in the volatile block than the stable block, than participants with high general factor scores. Also as in experiment 1, neither depression nor anxietyspecific factor scores credibly modulated adjustment of learning rate to volatility (${\alpha}_{volatilestable}$): ${\beta}_{a}$ = −0.03, 95%HDI = [−0.14,0.08]; ${\beta}_{d}$ = 0.01, 95%HDI = [−0.14,0.12], Figure 5—figure supplement 2.
Once again, our second question of interest was whether deficits in adaptation of learning rate to volatility linked to elevated internalizing symptomatology would be domain general or domain specific. We examined if the relationship between scores on the three internalizing factors and the effect of block type (volatile, stable) on learning rate varied as a function of task version (reward gain, reward loss). As in experiment 1, none of the corresponding threeway interactions were statistically credible: general factor by block type by task version: ${\beta}_{g}$ = −0.06, 95%HDI = [−0.17, 0.05]; anxietyspecific factor by block type by task version: ${\beta}_{a}$ = 0.01, 95%HDI = [−0.09,0.1]; or depressionspecific factor by block type by task version: ${\beta}_{d}$ = −0.03, 95%HDI = [−0.13,0.09]. In addition, scores on the three internalizing factors did not credibly modulate the main effect of task version (reward gain, reward loss) upon learning rate: ${\beta}_{g}$ = −0.04, 95%HDI = [−0.14,0.20]; ${\beta}_{a}$ = −0.15, 95%HDI = [−0.28,0.0]; ${\beta}_{d}$ = 0.01, 95%HDI = [−0.14,0.16].
Our third question of interest was whether deficits in adaptation of learning rate to volatility linked to elevated internalizing symptoms would vary depending on whether outcomes were better or worse than expected. Here, we replicated the finding from experiment 1. Specifically, general factor scores modulated the interaction of block type (volatile, stable) and relative outcome value (good, bad) upon learning rate, ${\alpha}_{\left(volatilestable\right)x\left(goodbad\right)}$, ${\beta}_{g}$ = −0.14, 95%HDI = [−0.27,–0.02], Figure 5a. As in experiment 1, individuals with low scores on the general internalizing factor showed greater increases in learning rates in the volatile relative to stable task block following betterthanexpected versus worsethanexpected outcomes. This boost in learning from positive predictions errors under volatile conditions was absent in individuals with high general factor scores; as in experiment 1, if anything, individuals with high general factor scores’ learning after positive outcomes was reduced under volatile relative to stable conditions, Figure 5b. In contrast to these findings, neither scores on the anxietyspecific factor nor on the depressionspecific factor credibly modulated the interaction of block type by relative outcome value on learning rate (${\beta}_{d}$ = 0.08, 95%HDI = [−0.02,0.18]; ${\beta}_{a}$ = 0.01, 95%HDI = [−0.09,0.1]), Figure 5—figure supplement 2.
These findings, alongside those from experiment 1, indicate that a failure to boost learning when contingencies are volatile, especially following a better than expected outcome, is a shared characteristic of both anxiety and depression. We note that, unlike in experiment 1, scores on the general factor were not associated with increased overall learning for good versus bad outcomes, ${\alpha}_{goodbad}$ (i.e. a main effect of relative outcome value, independent of its interaction with block type), ${\beta}_{g}$ = −0.04, 95%HDI = [−0.20, 0.13], Figure 5a.
Other model parameters
The behavioral model contained five other parameters (inverse temperature $\omega $, mixture weight for probability versus magnitude $\lambda $, choice kernel update rate $\eta $, choice kernel inverse temperature ${\omega}_{k}$, subjective magnitude $r$). We did not have a priori hypotheses pertaining to the relationship between these parameters and scores on the three internalizing factors. Hence, we looked for statistically credible effects that replicated across both experiments 1 and 2.
Participants as a group relied on outcome probability more during attempts to obtain reward (reward gain) than during attempts to avoid punishment (shock in experiment 1, reward loss in experiment 2); the 95% posterior intervals excluded zero for the difference in mixture weight by task version for both experiment 1 (mixture weight; ${\lambda}_{rewardaversive},$ µ = 0.26, 95%HDI = [0.01,0.49]) and for experiment 2 (mixture weight; ${\lambda}_{gainloss},$ µ = 0.29, 95%HDI = [0.02,0.55]). Excluding the learning rate results already presented, we did not observe any statistically credible associations between parameter values and scores on the three internalizing factors that replicated across both experiments 1 and 2.
Discussion
We examined how adaptation of probabilistic decisionmaking to contingency volatility varied as a function of internalizing symptomology. Through bifactor analysis of itemlevel responses to standardized questionnaire measures of anxiety and depression, we estimated participants’ scores on a general latent factor of internalizing symptomatology as well as their scores on anxietyspecific and depressionspecific factors. We modeled participants’ performance on alternate versions of a probabilistic decisionmaking under volatility task using a hierarchical Bayesian framework and populationlevel parameters that captured variance attributable to each of the three internalizing factors. In experiment 1, participants including individuals diagnosed with GAD and MDD performed reward gain and aversive versions of the task. In experiment 2, participants recruited online from Amazon Mechanical Turk performed reward gain and reward loss versions of the task. Across both experiments, we observed that high scores on the general factor were associated with reduced adjustment of learning rate to volatility. No parallel relationship was observed for scores on the anxiety or depressionspecific factors. These findings are in line with the contention that impoverished adaptation of learning to contingency volatility is linked to the common component of anxiety and depressive symptomatology.
An important, logical, next question concerns the generality of this impairment. In the current study, we addressed this by examining whether impoverished adaptation of learning to volatility in individuals with high scores on the general factor holds for both rewarding and punishing outcomes. In experiment 1, we examined performance on parallel versions of the task using financial reward gain or electrical stimulation (shock) as outcomes; in experiment 2, we compared task performance when outcomes entailed financial reward gain or financial reward loss. Neither experiment found any evidence to suggest that the relationship between impoverished adaptation of learning rate to volatility and scores on the general internalizing factor was modulated by outcome domain. Across all three versions of the task, individuals with low general factor scores showed greater adaptation of learning rate to volatility than individuals with high general factor scores. This suggests that the impairment in adjusting learning to volatility linked to elevated internalizing symptoms generalizes across both reward and punishment learning.
It has previously been suggested that anxiety might be especially associated with altered learning regarding punishment (Lissek et al., 2005), whereas depression might be especially associated with altered learning regarding reward (Elliott et al., 1997; Steele et al., 2007). However, recent reviews have suggested that evidence for the latter, at least, is far from clear cut (Robinson and Chase, 2017). In the case of the volatility task used here, two previous studies found elevated trait anxiety to be linked to impoverished learning rate adjustment when outcomes involved punishment (electric stimulation or financial loss; Browning et al., 2015; Pulcu and Browning, 2017). For outcomes involving reward gain, the correlations between trait anxiety and learning rate adaptation to volatility were not credibly different from zero but also not credibly different to that observed with punishing outcomes (Browning et al., 2015; Pulcu and Browning, 2017). The small sample size of these prior studies means that only effects of moderate size were detectable. In addition, as touched on in the Introduction, the measure of anxiety used (the STAI trait scale) does not provide a particularly pure measure of anxietyspecific affect, also showing moderatetohigh correlations with measures of depression. In the current study, our larger sample size and bifactor analysis of internalizing symptoms enables us to better clarify whether anxietyspecific or depressionspecific symptoms are linked to differential adaptation of learning rate to volatility as a function of outcome domain.
Considering anxiety first, we found no evidence to support a relationship between anxietyspecific symptoms and enhanced learning following punishing versus rewarding outcomes or to suggest a differential relationship between anxietyspecific symptoms and adaptation of learning to volatility when outcomes are punishing versus rewarding. We note that omission of PSWQ administration in experiment 2 limited the range of anxiety symptoms that could be captured in that experiment. However, it was experiment 1 that critically used electrical stimulation as punishing outcomes (this, as opposed to reward loss, is arguably the version of punishment where anxietyspecific effects are most expected) and this revealed no credible relationship between anxietyspecific symptoms and differences in learning rate as a function of task version (reward, aversive) or volatility by task version. Further, supplementary analyses using residual scores on the PSWQ or MASQAA, controlling for variance explained by the general factor, in place of scores on the anxietyspecific factor also reveal no credible relationship between anxietyspecific symptoms and differences in learning rate as a function of task version (reward, aversive) or volatility by task version (see Appendix 6—figure 1 and Appendix 6—figure 2).
There was also no evidence to support a relationship between depressionspecific symptoms and differential adaptation of learning to volatility as a function of outcome domain. These findings, together with those for the general factor reviewed above, suggest that the deficit in adapting learning rate to volatility is linked to symptom variance common to anxiety and depression and is general to both learning about reward and learning about punishment. We note that, in experiment 1, there was a trend toward depressionspecific factor scores being linked to higher learning rates, in general, in the aversive versus reward version of the task. This did not reach significance and no equivalent effect was apparent in experiment 2. This might however be worth further exploration in future work.
Our third question of interest was whether impairment in adapting learning to volatility might differentially impact learning from betterthanexpected versus worsethan expected outcomes. In the task used here, one shape predicts receipt of reward gain, loss or shock at 75–80% and the other at 20–25%; these contingencies are either stable or volatile depending on the block (following Behrens et al., 2007 and Browning et al., 2015, we used slightly more extreme probabilities in the volatile block to balance performance between conditions). Since outcome probability is always greater than 0 and less than 1, outcomes of relative positive value (reward gain, no shock, no reward loss) will be better than expected, to a varying extent, and should generate a positive prediction error. Similarly, outcomes of relative negative value (no reward gain, shock, reward loss) will be worse than expected, to a varying extent, and should generate a negative prediction error. It is widely accepted that both positive and negative prediction errors are used to inform our estimates of outcome probability—that is, we learn from both betterthanexpected and worsethanexpected outcomes. However, learning need not be symmetric. Here, across both experiments, we observed that individuals with low scores on the general internalizing factor mainly showed a volatilityrelated boost to learning from positive prediction errors. This selective boost to learning from positive prediction errors when contingencies were volatile was not shown by individuals with high scores on the general internalizing factor.
In the current task, asymmetry in learning from positive versus negative prediction errors does not confer a performance advantage given that both outcome types are equally fully informative—receiving a good outcome following selection of shape 1 means shape 2 would have resulted in a bad outcome and vice versa. However, asymmetric learning from good versus bad outcomes under volatility might be valuable in many realworld situations. Conceivably, when the world changes, there might normally be fewer ways of performing well than performing badly, so learning more substantially from the former could be advantageous (Huys et al., 2015). As such, enhanced learning from positive prediction errors when the world is volatile might reflect a learning style acquired over time by individuals with low general factor scores and applied by default in new situations, such as the current tasks.
Given the ubiquity of volatility in the realworld, our findings suggest that the development of interventions aimed at assisting participants to better distinguish contingency noise from contingency volatility, and to adjust decisions accordingly, might be useful in treatment of both anxiety and depression. An important question is whether impairment in adapting learning rate to volatility is specific to internalizing psychopathology or might represent an even more general deficit, perhaps one than also characterizes individuals with high levels of externalizing or psychotic symptomatology. We hope to address this in future work. Understanding this will enable us to more broadly identify patients likely to benefit from cognitive interventions aimed at improving decisionmaking under different sources of uncertainty, in particular contingency noise versus contingency volatility. A second important future avenue entails research aimed at increasing our understanding of, and development of interventions for, deficits in decisionmaking uniquely linked to anxiety or depressionspecific symptoms. The hope is that such interventions might valuably complement ones targeted at deficits linked to the general component of internalizing psychopathology.
In summary, our findings provide support for the contention that impoverished adjustment of learning rate to volatility is broadly linked to internalizing psychopathology. Bifactor modeling of internalizing symptoms and hierarchical Bayesian modeling of task performance enabled us to establish that this deficit is linked to symptoms shared by both anxious and depressed individuals, that is, to scores on a general internalizing factor, with no additional relationship being observed for anxiety or depression specific symptoms. The relationship between general factor scores and impaired adaptation of learning to volatility was observed in a sample including participants with anxiety and depressive disorders as well as in an online sample where no diagnostic interviews or prescreening on symptoms was conducted. Further, this relationship held regardless of whether outcomes entailed receipt of reward, shock or reward loss. This speaks to the generality of the deficit in question. Intriguingly, individuals with low general factor scores were found to predominantly boost learning under volatility following betterthanexpected outcomes. In other words, they learnt more from positive than from negative prediction errors under conditions of high contingency volatility. It is possible that in everyday life, a generally adaptive strategy when the world is rapidly changing is to learn from successful actions. The ability to do this could potentially confer resilience when faced with environments characterized by secondorder uncertainty. In contrast, for individuals with high levels of internalizing symptoms, the inability to adjust behavior depending on whether unexpected outcomes are likely due to contingency noise or to contingency volatility might well comprise decisionmaking across multiple settings; this could explain findings that selfreported intolerance of uncertainty is elevated in individuals with both anxiety and depressive disorders. More generally, we believe that greater specification of the manner in which deficits in decisionmaking, and other cognitive functions, map onto latent dimensions of psychopathology will play an important role in the advancement of personalized medicine, allowing for empirically based stratification of individuals who participate in treatment trials, prediction of individuals at risk of clinically significant psychopathology, and development of interventions aimed at remediating core cognitive and computational deficits.
Materials and methods
Experiment 1
Participants
Potential participants of both sexes, between the ages of 18 and 55, were recruited continuously from the local community for a period of one year and 6 months between March 2015 and August 2016. Advertisements in local newspapers and on local mailing lists (e.g. Oxford University mailing lists) together with flyers at local primary care practices and geographically constrained Facebook advertisements were used to recruit participants with generalized anxiety disorder (GAD) and major depressive disorder (MDD). Diagnoses were determined using the research version of the Structured Clinical Interview for DSMIVTR (SCID) administered by trained staff and supervised by an experienced clinical psychologist. We excluded participants if they were currently receiving pharmacological treatment or had been prescribed psychotropic medication or taken nonprescribed psychoactive drugs (i.e. street drugs) within the past 3 months. Participants reporting a history of neurological disease or meeting diagnostic criteria for PTSD, OCD, bipolar disorder, schizophrenia or other psychotic disorders, or substance abuse were also excluded. In parallel, we recruited a healthy control group screened using the SCID to ensure they did not meet diagnostic criteria for any DSMIVTR Axis I disorder. Here, participants were also excluded if they reported a history of neurological disease or usage of psychoactive drugs (legal or illegal) within the last 3 months.
Onehundred and eight individuals came in for the SCID screening session. Of these, 42 individuals did not meet inclusion criteria and 8 individuals declined to participate in the subsequent experimental sessions. Our final participant sample (n = 58) comprised 12 participants who met diagnostic criteria for GAD, 20 participants who met diagnostic criteria for MDD (three of whom had a secondary diagnosis of GAD), and 26 healthy control participants.
We also included within our final sample an additional 30 participants who had been recruited from the local community to perform the same tasks in the context of an fMRI study. These participants showed broadly the same agerange and sex ratio as the patient and control groups (see Appendix 1—table 1). Here, potential participants were excluded if they reported a prior diagnosis of neurological or psychiatric illness other than GAD or MDD. In addition, participants reporting usage of psychoactive medication or streetdrugs were also excluded.
We excluded data from either the reward gain or aversive version of the probabilistic decisionmaking under volatility task if there was equipment malfunction or if a participant reported after the session that they did not understand the task. In experiment 1, we excluded data from the aversive version of the volatility task from eight participants (three participants with MDD, one participant with GAD, and four control participants). Data from the reward gain version of the volatility task were excluded for six participants (two participants with GAD, one participant with MDD, two control participants, and one community member participant). Only two participants (both control subjects) had data excluded from both tasks. These exclusions left 86 participants in total. Power calculations indicated a sample size of 75 or higher would give 95% power to obtain effect sizes similar to that observed in our earlier work relating adaptation of learning in the aversive version of the volatility task to trait anxiety (Browning et al., 2015). See Appendix 1—table 1 for participant details by task.
Experimental procedure
Request a detailed protocolExperiment 1 was approved by the Oxford Central University Research Ethics Committee (CUREC) and carried out at the Wellcome Centre for Integrative Neuroimaging (WIN) within the John Radcliffe Hospital in compliance with CUREC guidelines. Written informed consent was obtained from each participant prior to participation. Participants recruited by community advertisement into the GAD and MDD patient groups or into the healthy control group were initially screened by phone. This was followed by an inperson screening session during which informed consent was obtained and the Structured Clinical Interview for DSMIVTR (SCID) was administered. Individuals meeting criteria for inclusion in the study were invited back for two additional sessions. During the second session, participants completed standardized selfreport measures of anxiety and depression and then completed the aversive (shock) version of the volatility task. During the third session, participants completed the reward gain version of the volatility task. The second and third sessions were separated by at least 1 day and no more than 1 week. All three sessions were conducted within the Nuffield Department for Clinical Neurosciences at the John Radcliff Hospital. Participants were paid at a fixed rate of £20 per session and were also given a bonus of up to £10 based on their performance in the reward gain version of the volatility task.
To increase the number of participants and to fill in the spread of symptoms, 30 additional communityrecruited participants (aged between 18 and 40 years, 14 females) were included in the sample for experiment 1. These participants were not administered the SCID, but any individuals reporting a history of psychiatric or neurological conditions were excluded as were individuals on psychotropic medication or taking illegal psychotropic agents. These participants completed the aversive and reward gain versions of the volatility task during two fMRI scanning sessions conducted a week apart in the Wellcome Centre for Integrative Neuroimaging at the John Radcliffe Hospital. Questionnaires were administered at the beginning of each of these sessions.
Selfreport measures of internalizing symptoms
Request a detailed protocolParticipants completed standardized selfreport measures of anxiety and depression. Measures included the Spielberger StateTrait Anxiety Inventory (STAI form Y; Spielberger et al., 1983), the Beck Depression Inventory (BDI; Beck et al., 1961), the Mood and Anxiety Symptoms Questionnaire (MASQ; Clark and Watson, 1995; Watson and Clark, 1991), the Penn State Worry Questionnaire (Meyer et al., 1990), and the Center for Epidemiologic Studies Depression Scale (CESD; Radloff, 1977). In addition, we administered the 80item Eysenck Personality Questionnaire (EPQ; Eysenck and Eysenck, 1975) to be able to include items from the Neuroticism subscale in our bifactor analysis.
Exploratory bifactor analysis
View detailed protocolA bifactor analysis was conducted on item level responses (n = 128) to the MASQ anhedonia subscale, the MASQ anxious arousal subscale, the STAI trait subscale, the BDI, the CESD, the PSWQ, and the EPQ neuroticism (N) subscale. Item responses were either binary (0–1), quaternary (0–4), or quinary (0–5). Response categories that were endorsed by fewer than 2% of participants were collapsed into the adjacent category to mitigate the effects of extreme skewness. Reversescoring was implemented prior to inclusion of items in the bifactor analysis to facilitate interpretation of factor loadings. Polychoric correlations were used to adjust for the fact that categorical variables cannot have correlations across the full range of −1 to 1 (Jöreskog, 1994).
We determined the number of dimensions to use in the bifactor analysis based on theoretical considerations, visually inspecting the scree plot and conducting a parallel analysis, which compares the sequence of eigenvalues from the data to their corresponding eigenvalues from a random normal matrix of equivalent size (Horn, 1965; Humphreys and Montanelli, 1975; Floyd and Widaman, 1995). Parallel analysis was conducted using the ‘fa.parallel’ function from the Psych package in R. This procedure simulates a correlation matrix out of random variables drawn from a normal distribution, with the number of variables and the number of samples matched to the actual dataset. Eigenvalue decomposition is then applied to this simulated correlation matrix to estimate the magnitudes of eigenvalues that would be expected due to chance alone. These eigenvalues are plotted as a dotted line labeled ‘random data’ in Figure 1—figure supplement 1. Eigenvalue decomposition is also applied to the correlation matrix from the actual data and these eigenvalues are also plotted in Figure 1—figure supplement 1 and labeled ‘actual data’. Only three eigenvalues obtained from actual data lie above random data line and are hence reliably distinguishable from noise. Parallel analysis can also be conducted by randomizing the rows of the actual data matrix rather than drawing new random variables from a normal distribution. This procedure also supports a three factor solution for our dataset.
We first conducted an oblique factor analysis; following this we conducted a higher order factor analysis on the lower order factor correlations to extract a single higher order factor (i.e. a general factor). This procedure is known as SchmidLeiman (SL) orthogonalization (Schmid and Leiman, 1957). Both steps were conducted using the ‘omega’ function from the Psych package in R. An alternative exploratory bifactor method, JennrichBentler Analytic Rotations (Jennrich and Bentler, 2011), was also applied to the data. The factor scores calculated by the two methods were highly correlated (r = 0.96 for the general factor, r = 0.91 for the depression factor, and r = 0.96 for the anxiety factor). We chose to use SL orthogonalization, as the JennrichBentler rotations attributed less unique variance to depression symptoms, potentially resulting in less power to detect depressionspecific effects.
Factor scores for each participant were calculated using the AndersonRubin method (Anderson and Rubin, 1956), which is a weightedleast squares solution that maintains the orthogonality of the general and specific factor scores. A confirmatory bifactor analysis was conducted in a separate online participant sample (see below).
The probabilistic decisionmaking under volatility task
Request a detailed protocolEach participant completed both a reward gain version and an aversive version of the probabilistic decisionmaking under volatility task. These two tasks were as previously described in Browning et al., 2015. The two versions of the task had parallel structures. On each trial, participants were presented with two shapes on either side of the computer screen and were instructed to choose one. Each shape was probabilistically associated with either receipt of reward (in the reward gain version of the task) or receipt of shock (in the aversive version of the task). Participants were instructed that, on each trial, one of the two shapes would result in receipt of reward (or shock). They were also instructed that in making their decision whether to choose one shape or the other, they should consider both the probability of receipt of reward (or shock) and the reward (or shock) magnitude should it be received. Outcome magnitude (199) was randomly selected for each shape and changed from trial to trial. The same random sequence of magnitudes was used for all participants who had the same block order (i.e. stable block or volatile block first) for a given version of the task. The sequence of magnitudes differed between the reward and aversive task versions.
For the reward gain version of the volatility task, reward points gained were converted to a monetary bonus that was paid out at the end of the experiment. The bonus ranged from £0 to £10 and was calculated by dividing the total sum of outcome magnitudes received by the maximum amount possible. For the aversive version of the volatility task, shock magnitude (199) corresponded to different intensities of electric stimulation. This mapping, or calibration, was conducted for each participant prior to performance of the task using the same procedure as reported by Browning et al., 2015. During calibration, participants reported their subjective experience of pain using a 10point scale, on which 1 was defined as ‘minimal pain’ and 10 as ‘worst possible pain’. Participants were told that the highest level of stimulation they would receive was a ‘7’ on this scale and that for this they should select the highest level that they were willing to tolerate receiving up to 20 times during the task. The amplitude of a single 2 ms pulse of electrical stimulation was increased from zero until participants reported feeling a sensation that they rated as 1/10. The amplitude of the single pulse was then kept at this level while the number of 2 ms pulses delivered in a train was increased. The number of pulses was increased until the participant reported a subjective intensity of 7/10. If the participants reported a 7/10 before the number of pulses reached 8 or did not report a 7/10 by 48 pulses, the amplitude of the pulse corresponding to the 1/10 level was adjusted. Participants also completed 14 trials during which the intensity of electrical shock was randomly varied by changing the number of pulses delivered in a train between one and the number required to produce a report of 7/10. Participants’ subjective pain ratings of these different levels of shock were fitted to a sigmoid curve. The single pulse reported as a 1/10 and the train of multiple pulses reported as a 7/10 formed the lowest (1) and highest (99) magnitudes for an outcome that a participant could receive during the task. The sigmoid curve was used map the outcome magnitudes in between 1 and 99 to numbers of pulses.
Each task was divided into a stable and volatile block of trials, each 90 trials long. In the stable block, one shape was associated with reward (or shock) receipt 75% of the time and the other shape was associated with reward (or shock) receipt 25% of the time. In the volatile block, the shape with a higher probability (80%) of resulting in reward (or shock) receipt switched every twenty trials; we used 80% as opposed to 75% to balance difficulty between the volatile and stable blocks (following Behrens et al., 2007 and Browning et al., 2015). The order of task blocks by task version was counterbalanced across participants such that an equal number of participants received the volatile block of each task first, the stable block of each task first, the stable block of the reward task and the volatile block of the aversive task first, or the stable block of the aversive task and the volatile block of the reward task first. Participants were not told that the task was divided into two blocks.
Computational modeling of task performance
Request a detailed protocolDecomposing parameters into parameter components
Request a detailed protocolWe fitted participants’ choice behavior using alternate versions of simple reinforcement learning models. We sought to parsimoniously yet flexibly capture differences in choice behavior associated with task version (reward gain or aversive) and block type (volatile, stable) and to be able to address if learning rate varied as a function of relative outcome value. For the reward gain version of the task, good outcomes corresponded to the receipt of a reward and bad outcomes corresponded to its omission. For the aversive task, good outcomes corresponded to the omission of electric stimulation and bad outcomes corresponded to its delivery. While the magnitude of prediction errors varied across trials, good outcomes were always associated with positive prediction errors and bad outcomes with negative prediction errors.
We decomposed core model parameters, in particular learning rate, into components that captured the effects of block type (volatile, stable), task version (aversive, reward) and relative outcome value (good, bad) as well as the twoway interactions of these effects. As an example, in Equation 2, we show the division of learning rate $\alpha $ into a baseline learning rate ${\alpha}_{baseline}$, a difference in learning rates between the volatile and stable blocks ${\alpha}_{volatilestable}$, a difference in learning rates between the reward gain and aversive versions of the volatility tasks ${\alpha}_{rewardaversive}$, a difference in learning rates between trials following good and bad outcomes ${\alpha}_{goodbad}$, and the twoway interactions of those differences. We also explored the benefit of including threeway interactions, see model comparison.
We use the term ‘parameter components’ to distinguish these elements from composite parameters, such as the overall learning rate ($\alpha $). The variable ${\chi}_{\left(volatilestable\right)}$ takes on a value of 1 when the trial is in the volatile block and a value of −1 when the trial is in the stable block. A logistic transform was applied to constrain the learning rate to be between [0,1]. This transform was also applied to other parameters that were constrained to be between 0 and 1. A logarithmic transform was used for parameters that were constrained to be positive, such as inverse temperature. The parameter ranges are specified for each parameter in the model construction section.
Hierarchical bayesian estimation of parameters
Request a detailed protocolA hierarchical Bayesian procedure was used to estimate distributions over parameters for each reinforcement learning model, and to estimate how the potential dependences of these parameters on task version (reward versus aversive), block type (volatile versus stable) and relative outcome value (good versus bad) varied as a function of individual differences in internalizing symptomology. Data from all participants was used to fit each model. Specifically, each parameter component, such as the difference in learning rates between volatile and stable blocks ($\alpha}_{volatilestable$, was assigned an independent populationlevel prior distribution that was shared across participants. The mean for each populationlevel distribution was specified as a linear model with an intercept (µ), to represent the overall mean of the parameter component across participants, along with weights ($\beta}_{g},{\beta}_{d},{\beta}_{a$) for the participants’ scores on each of the three factors ($X}_{g},{X}_{d},{X}_{a$) (g = general factor, d = depressionspecific factor, a = anxietyspecific factor). The variance (${\sigma}^{2}$) for the populationlevel distribution was also estimated separately for each parameter component. Note that there are different populationlevel parameters ($\mu ,{\beta}_{g},{\beta}_{d},{\beta}_{a},{\sigma}^{2}$) for each parameter component, but we omit this detail for notational ease. As an example, the populationlevel model for the learning rate difference between volatile and stable blocks (${\alpha}_{volatilestable}$) is given by Equation 3.
Models were fit using PyMC3 (Salvatier et al., 2016), a Python Bayesian statistical modeling software package. The hyperpriors assigned to these populationlevel parameters ($\mu ,{\beta}_{g},{\beta}_{d},{\beta}_{a}$) were uninformative Normal(0,10). The hyperpriors for the population variances, ${\sigma}^{2}$, were Cauchy(2.5). A Hamiltonian MonteCarlo method was used to sample from the full posterior. Four chains were run with 200 tuning steps and 2000 samples each. Visual inspection of the traces as well as Gelman−Rubin statistics ($\widehat{R}$) were used to assess convergence (Gelman and Rubin, 1992). There were no populationlevel parameters with $\widehat{R}$ values greater than 1.1 (most were below 1.01). There were only 8 out of the 2236 participantlevel parameters (from two participants) with $\widehat{R}$ values greater than 1.1, and these were for $\eta$ and $\omega}_{k$, which were not the focus of the main analysis. The marginal posterior distributions for the populationlevel parameters ($\mu ,{\beta}_{g},{\beta}_{d},{\beta}_{a}$) were used to assess statistical significance of populationlevel effects. Populationlevel parameters with a 95% highest posterior density (HDI) intervals that did not contain zero were deemed to be statistically credible.
Stagewise model construction
Request a detailed protocolTo find a model that was sufficiently flexible to capture important aspects of participants’ behavior yet was not overfitted to the data, we performed a stagewise model construction procedure. At each stage, we either added a computational component to the model or modified an existing component. We compared the enriched model to the best model from the previous stage using an approximate form of leaveoneout crossvalidation, which uses Pareto smoothed importance sampling (PSISLOO; Vehtari et al., 2017). Our baseline model, as well as the potential modifications, were informed by prior work (Lau and Glimcher, 2005; Behrens et al., 2007; Ito and Doya, 2009; Li and Daw, 2011; Berns and Bell, 2012; Akaishi et al., 2014; Browning et al., 2015; Donahue and Lee, 2015; Mkrtchian et al., 2017; Aylward et al., 2019). In total, 13 alternative models were assessed. The model with the lowest PSISLOO was selected as the winning model and used to make inferences about the relationships between task performance and internalizing symptoms. Each model was estimated according to the same hierarchical Bayesian procedure (detailed in the preceding section). In each case, participants’ scores on the three internalizing factors were included in the estimation procedure.
Stage 1: Additive versus multiplicative influence of outcome probability and magnitude (model #1 versus model #2)
Request a detailed protocolIn the first stage, we compared a model (#1) that assumes that participants combine outcome probability and outcome magnitude multiplicatively during decision making to a model (#2) that assumes that participants combine them additively. The multiplicative model #1 is similar to the model used in our previous work (Browning et al., 2015), and the additive model #2 is similar to models that have been found to fit well by other groups (e.g. Donahue and Lee, 2015).
In both models, the probability (${p}_{t}$) that a good outcome would result from choosing shape 1 and not shape 2, is updated on a trialbytrial basis using a RescorlaWagner rule (Equation 4a and 5a). The learning rate ($\alpha \in \left[0,1\right])$ determines how much the estimate is revised by the prediction error (i.e. the difference between the previous estimate ${p}_{t1}$ and the most recent outcome ${O}_{t1}$). The outcome is coded such that ${O}_{t1}=1$ if shape 1 is chosen and followed by a good outcome (i.e. delivery of reward or absence of electric stimulation) or if shape 2 is chosen and followed by a bad outcome (i.e. absence of reward or delivery of electric stimulation).${O}_{t1}=0$ codes for the opposite cases in each task.
For model #1, the outcome probability estimate is adjusted using a risk parameter ($\gamma \in \left[0.1,10\right]$) to capture the relative importance of magnitude versus probability to choice. The expected value is calculated, multiplying the outcome probability and outcome magnitude for each shape separately ($M{1}_{t}$ and $M{2}_{t}$ for shapes 1 and 2, respectively), before taking the difference in expected value between shapes (Equation 4c). In contrast, in model #2, the differences in magnitude and probability are calculated separately for each shape first and are then combined as an additive mixture (Equation 5b). The mixture parameter ($\lambda \in \left[0,1\right]$) in this model determines the relative importance of outcome magnitude and outcome probability, albeit in a different way to $\gamma $ in model #1. In both models, a softmax choice rule with an inverse temperature parameter ($\omega \in {R}^{+}$) is then used to specify how deterministically (or noisily) participants made choices as a function of the combined value of probability and magnitude on each trial (Equation 4d and 5c).
Equations 4ad (Model #1).
(a)(updating probability estimates).
(b)(risk adjusted probabilities).
(c)(difference in expected value).
(d)(softmax action selection).
Equations 5ac (Model #2).
(a)(updating probability estimates).
(b)(mixture of probability and magnitude).
(c)(softmax action selection).
The three parameters for model #1 ($\alpha $, $\gamma $, $\omega $) and the three parameters for model #2 ($\alpha $, $\lambda $, $\omega$) were each divided into four components: a baseline, a difference between volatile and stable blocks, a difference between the reward and aversive task, and the interaction of these two factors. We used these components in all the models considered given the importance of the experimental factors of block type (volatile, stable) and task version (reward, aversive) to our a priori hypotheses. For models #1 and #2, this resulted in twelve parameterized components estimated for each participant for each model.
PSISLOO was substantially lower for the additive model #2 than for the multiplicative model #1 (difference in PSISLOO = −1,637; std. error of difference = 241). Therefore, model #2 was carried onto the next stage of model comparison. This finding is in line with observations of separate striatal representations for outcome magnitude and probability (Berns and Bell, 2012). Additive models, as opposed to expected value models, have also been found to provide a better fit than expected value models in a probabilistic reversal learning task in nonhuman primates (Donahue and Lee, 2015).
Stage 2: Influence of relative outcome value (model #2 versus models #3#6)
Request a detailed protocolWe next investigated decomposing parameters according to whether the outcome on a given trial was better than or worse than expected, that is whether there was a positive or negative prediction error (Frank et al., 2007; Cox et al., 2015; Eldar et al., 2016). Specifically, we allowed the parameters in model #2 to differ on trials following good versus bad outcomes and for this difference in relative outcome value to interact with task version (reward versus aversive) and block type (volatile versus stable). As a reminder, good outcomes were defined to be the receipt of a reward in the reward gain version of the volatility task and an omission of electric stimulation in the aversive version of the volatility task.
In model #3, we used this additional decomposition for the learning rate parameter alone. This substantially improved PSISLOO relative to model #2 (difference in PSISLOO = −614; std. error of difference = 126). In model #4, we allowed inverse temperature $\omega $ and the mixture weight $\lambda $, instead of learning rate, to differ for good and bad outcomes; this also improved PSISLOO relative to model #2 (difference in PSISLOO = −122; std. error of difference = 50), but to a lesser extent than observed for model #3. In model #5, we allowed all three parameters to differ between good and bad outcomes and this achieved an even better PSISLOO relative to model #2 (difference in PSISLOO = −701; std. error of difference = 142). In model #6, we added the triple interaction between block type, task version, and relative outcome value; this resulted in a slightly higher (worse) PSISLOO relative to model #5 (difference in PSISLOO = 23; std. error of difference = 9). Therefore, model #5 was carried onto the next stage of model comparison. (Note that the equations for model #5 are identical to model #2, because the breakdown of composite parameters into parameter components is omitted from the equations to reduce notational clutter).
Stage 3: Nonlinear effects of magnitude difference (model #5 versus model #7)
Request a detailed protocolWe next considered the possibility that participants might treat differences in outcome magnitudes nonlinearly (i.e. that they make decisions on the basis of subjective rather than objective outcomes magnitudes), and that the degree of this nonlinearity might differ between rewarding and aversive outcomes. Model #7 uses a scaling parameter ($r\in \left[0.1,10\right]$) to capture this potential nonlinearity (Equation 6b; note that the sign for the difference in this equation was temporally removed before exponentiating and then added back again). The scaling parameter ($r$) was divided into a baseline component and a component for the difference between the reward and aversive versions of the volatility task. This decomposition was chosen given the likely possibility that participants treat differences in reward magnitudes and differences in shock magnitudes differently. (Further division of this parameter was not performed to reduce the number of model comparisons). Adding this parameter improved PSISLOO relative to model #5 (difference in PSISLOO = −308; std. error of difference = 104). Therefore, model #7 was carried forward.
Equations 6ac (Model #7).
(a) (updating probability estimates).
(b) (mixture of probability and magnitude).
(c) (softmax action selection).
Stage 4: Accounting for lapses in attention (model #7 versus model #8)
Request a detailed protocolWe next tested the inclusion of a lapse parameter ($\u03f5\in \left[0,1\right]$; model #8; Equation 7c) to allow for the possibility that participants occasionally make unintended choices due to lapses of attention or motor errors. The lapse parameter was divided into a baseline component and a component for the difference between reward and aversive versions of the volatility task. Further divisions by experimental condition were not tested given that both the lapse parameter and the inverse temperature parameter similarly aim to capture noise in participants’ choices and the inverse temperature parameter was already divided into seven different components. Model #8, which included the lapse parameter, had a slightly worse PSISLOO than Model #7 (difference in PSISLOO = 31; std. error of difference = 43). Therefore, Model #7 was retained as the best model and carried forward.
Equations 7ac (Model #8).
(a) (updating probability estimates).
(b) (mixture of probability and magnitude).
(c) (softmax action selection with lapse).
Stage 5: Separate probability estimates for each shape (model #7 versus models #9#10)
Request a detailed protocolWe next allowed for the possibility that participants maintain two separate probability estimates, one for shape 1 and one for shape 2. Although participants were instructed that these two probabilities were yoked (i.e. p and 1p), it was possible that they treated them independently. Stimulusspecific (or actionspecific) probability estimates are commonly found in models, such as Qlearning (Li and Daw, 2011; Mkrtchian et al., 2017; Aylward et al., 2019). Hence, we compared model #7 to two different models that update stimulusspecific probability estimates.
Model #9 updates the two probability estimates using a RescorlaWagner rule, similarly to model #7. On each trial, only one of these two estimates is updated, that is, the one corresponding to the shape chosen by the participant on that trial. In this model, $O{1}_{t1}=1$ if shape 1 is followed by a good outcome, whereas $O{2}_{t1}=1$ if shape 2 is followed by a good outcome. When bad outcomes occur, these variables take on a value of zero. After the update, both estimates decay toward 50% (Equation 8cd) using a decay rate ($\delta \in \left[0,1\right]$). The decay parameter consisted of a baseline component and a component for the difference between reward and aversive versions of the volatility task. The decay and learning rate act in opposite directions on probability estimates, which can make it difficult to separately estimate their effects. To mitigate this estimation issue, we did not allow decay to differ by block type (volatile versus stable) or by relative outcome value (good versus bad).
Equations 8ac (Model #9).
(a) (updating probability estimates, shape 1).
(b) (updating probability estimates, shape 2).
(c) (decay probability estimate, shape 1).
(d) (decay probability estimate, shape 2).
(e) (mixture of probability and magnitude).
(f) (softmax action selection).
Model #10 uses a BetaBernoulli Bayesian model to update the outcome probability estimates for each shape. Outcome probabilities are estimated by updating the counts of good or bad outcomes that followed the choice of each shape (Equations 9a–d). The update parameter ($\alpha \in \left[0,10\right]$) acts similarly to a learning rate. Only one of these counts is updated on each trial, that is, the one for the chosen shape and the outcome received. Otherwise the values of these counts decay towards zero (Equations 9a–d). The outcome probabilities for each shape are calculated as the mean of the Beta distribution using Equations 9e–f.
Equations 9ac (Model #10).
(a) (updating good outcome counts, shape 1).
(b) (updating bad outcome counts, shape 1).
(c) (updating good outcome counts, shape 2).
(d) (updating bad outcome counts, shape 2).
(e) (calculating probability estimates, shape 1).
(f) calculating probability estimates, shape 2).
(g) (mixture of probability and magnitude).
(h) (softmax action selection).
Neither model #9 nor model #10 improved PSISLOO relative to model #7 (difference in PSISLOO = 223 for model #9; std. error of difference = 72; and difference in PSISLOO = 171 for model #10; std. error of difference = 71; respectively). Therefore, model #7 was retained as the best model.
Stage 6: Including a choice kernel (model #7 versus model #11)
Request a detailed protocolIn the sixth stage, we allowed for the possibility that participants tended to repeat (or avoid repeating) past choices independently of the recent outcomes received (Lau and Glimcher, 2005; Ito and Doya, 2009; Akaishi et al., 2014). Model #7 was augmented to include a choice kernel (${k}_{t}$), which acts like a moving average for the participant’s recent choices. The choice kernel is updated using an update rate ($\eta \in \left[0,1\right]$; Equation 10c), which effectively determines how many previous choices are contained in the value of the choice kernel on the current trial. The impact of the choice kernel on choice was determined by an additional inverse temperature parameter (${\omega}_{k}\in {R}^{+}$; Equation 10d). The update rate ($\eta )$ consisted of only the baseline component and the inverse temperature parameter (${\omega}_{k}$) was divided into a baseline component and a component for the difference between reward and aversive versions of the volatility task. Further division of these two parameters was not performed, because the MonteCarlo Markov chains used for parameter estimation failed to converge when allowing further dependences on experimental condition (i.e. task version for the update rate or block type for update rate and inverse temperature) or relative outcome value (good versus bad); this suggests that there is insufficient information in the data for discriminating different values of these parameters across these components.
Equations 10ad (Model #11).
(a) (updating probability estimates).
(b) (mixture of probability and magnitude).
(c) (updating choice kernel).
(d) (softmax action selection).
Model #11 improved PSISLOO over model #7 (difference in PSISLOO = −117; std. error of difference = 42) and was therefore carried forward to the next stage.
Stage 7: Adding a choice kernel to models from Stage 5 (model #11 versus models #12 and #13)
Request a detailed protocolFinally, we examined adding the choice kernel to models from stage 5 (model #9 and model #10). Model #12 and model #13 were identical to models #9 and model #10, respectively, except for the inclusion of a choice kernel. The equations for model #12 and #13 are omitted because they can be directly obtained by replacing either Equation 8f or Equation 9h with Equations 10cd. Neither of these two models improved PSISLOO over model #11 (difference in PSISLOO = 179 for model #12; std. error for difference = 71; and difference in PSISLOO = 144 for model #13; std. error for difference = 63); therefore, model #11 was retained as the best fitting model.
Winning model
Request a detailed protocolAs the winning model, model #11 is used in the main manuscript (see Equations 10ad). For ease of reference, we provide the full list of parameter components again here. The learning rate ($\alpha )$, the mixture parameter ($\lambda $), and the inverse temperature parameter ($\omega )$ were each broken down into seven components: a baseline, three main effects: block type (volatile versus stable), task version (reward versus aversive), relative outcome value (good versus bad), and the three twoway interactions of these effects. The inverse temperature (${\omega}_{k}$) for the choice kernel and the subjective magnitude parameter ($r$) each had a baseline component and were allowed to differ between the reward and aversive versions of the task, but not between volatile and stable blocks or between trials following good or bad outcomes. The update rate ($\eta $) for the choice kernel had a single baseline component that did not vary by task version, block type or relative outcome value.
The results of model comparison for experiment 1 are summarized in Appendix 3—table 1.
Model parameter recovery
Request a detailed protocolWe conducted a parameter recovery analysis to check that model #11’s parameters were identifiable. Subjectspecific posterior means for each component of each parameter from model #11 (e.g. ${\alpha}_{baseline}$, ${\alpha}_{volatilestable}$, etc.) were used to simulate new choice data. The winning model was then refit to each of these simulated datasets. The original parameter components estimated from the actual dataset (referred to as ‘ground truth’ parameters) were correlated with the newly estimated parameter components (referred to as ‘recovered’ parameters) for each simulated dataset. An example of one simulated dataset is given in Appendix 4—figure 1 for learning rates and in Appendix 4—figure 2 for the other parameters. This procedure was repeated for 10 simulated datasets.
We also examined the robustness of estimates for populationlevel parameters ($\mu ,{\beta}_{g},{\beta}_{a},{\beta}_{d}$) by looking at the variability in their values across the simulated datasets. Variability across datasets reflects sensitivity to noise in participants’ choices, which is estimated by the fitted values for the two inverse temperatures in the model. The populationlevel parameters for each simulated dataset are shown in Appendix 4—figure 3 for learning rate and in Appendix 4—figure 4 for the other model parameters.
We also fit an additional model that parametrized the threeway interaction of block type (volatile, stable), relative outcome value (good, bad), and task version (reward, aversive) for learning rate. We conducted a parameter recovery analysis for this model that paralleled that described above. Specifically, 10 additional datasets were simulated using the subjectspecific parameter components from this model, the model was refit to these simulated data, and the recovered parameters were correlated with the ground truth parameters. An example of one simulated dataset for this analysis is given in Appendix 4—figure 5.
Model reproduction of basic features of choice behavior
Request a detailed protocolAs a final check, we simulated data from the winning model to see if it would reproduce basic qualitative features of participants’ actual choice behavior. The number of trials on which a participant stays with the same choice or switches to the other choice is one qualitative feature that our model should be able to reproduce even though it was not optimized to do so. In each of the simulated datasets described in the previous section, we summed the number of trials on which each simulated participant switched from one choice to the other. Each simulated participant corresponded to an actual participant whose estimated parameter values were used to generate the data for that simulated participant. Therefore, we can examine the correlation between the number of switch trials made by the simulated participants and those made by the actual participants. The actual number and the simulated number of switch trials was highly correlated across participants; see Appendix 7.
Confirmatory bifactor analysis
A new sample of participants was recruited and administered the battery of anxiety and depression questionnaires used in experiment 1 to test the reproducibility of the factor structure estimated in experiment 1.
Participants
Three hundred and twentyseven participants (203 women; mean age = 21.1 years) were recruited from UC Berkeley’s psychology research pool and asked to fill out the same battery of anxiety and depression questionnaire measures as used in Experiment 1. This questionnaire session was completed online in a single session, using Qualtrics (Qualtrics, 2014). Experimental procedures were approved by the University of CaliforniaBerkeley Committee for the Protection of Human Subjects and carried out in compliance with their guidelines. Participants checked a box online to indicate consent prior to participation. Participants received course credit for participation. Participants were excluded from the confirmatory bifactor analysis if their dataset contained one or more missing responses; 199 complete datasets were obtained (see Appendix 1—table 2 for participant details).
Measures of internalizing symptoms
Request a detailed protocolParticipants completed the same standardized questionnaire measures of anxiety, depression and neuroticism as listed under experiment 1.
Confirmatory bifactor analysis
Request a detailed protocolFor the confirmatory bifactor analysis, all of the individual itemlevel questions were allowed to load on the general factor. Additionally, each item was also allowed to load onto either the anxietyspecific factor or the depressionspecific factor. This assignment was determined by whether the item had a loading greater than 0.2 on that specific factor in the exploratory bifactor analysis conducted within experiment 1. Fiftyfive items had loadings greater than 0.2 on the depressionspecific factor, 35 items had loadings greater than 0.2 on the anxietyspecific factor, and the remaining items had loadings less than 0.2 for both the depression and anxietyspecific factors; no items had loadings greater than 0.2 on both the depression and anxietyspecific factors. After item assignment, factor loadings were reestimated. Diagonally weighted least squares estimation was used, because it is more appropriate for ordinal data and less sensitive to deviations from normality than maximum likelihood estimation (Li, 2016). This procedure was conducted using the Lavaan package in R and specifying that an orthogonal solution should be obtained. Quality of fit was determined by the comparative fit index (CFI) and the root mean square error of approximation (RMSEA).
Experiment 2
Participants
One hundred and seventytwo participants (74 females) were recruited from Amazon’s Mechanical Turk. Using qualifications on the Mechanical Turk platform, participants were restricted to be from the United States and were required to be 18 years of age or older. Participants’ specific ages were not recorded. Twentyfive participants were excluded from the online dataset for having greater than 10 missed responses in each task, leaving 147 participants for analysis.
Experimental procedure
Request a detailed protocolThis experiment was approved by the University of CaliforniaBerkeley Committee for the Protection of Human Subjects and carried out in compliance with their guidelines. One hundred and seventytwo participants were recruited using Amazon’s Mechanical Turk platform. Participants viewed an online version of the consent form and were asked to click a box to indicate their consent. They were then redirected to an externally hosted website to take part in the experiment. On the externally hosted website, participants first completed standardized selfreport measures of anxiety and depression and then completed two alternate versions of the volatility task: a reward gain and a reward loss task. These alternate versions were modified from those in experiment 1 to be suitable for online administration. The reward gain task and reward loss task were completed in the same session. Participants were required to take a 5min break after filling out the questionnaires and a second 5min break before completing the second task.
Participants were paid a fixed amount of $8 for completing the experimental session, which took approximately 1–1.5 hr. Participants could also earn a bonus between $0 and $3 depending on their performance. At the start of the experimental session, participants were told that their cumulative performance across both versions of the volatility task would be compared to other participants’ performance and a bonus of $3 will be awarded to participants that score in the top 5%, $1 to those in the top 10%, and $0.25 to the top 50%.
Selfreport measures of internalizing symptoms
Request a detailed protocolParticipants in experiment 2 completed the Spielberger StateTrait Anxiety Inventory (STAI form Y; Spielberger et al., 1983), the Beck Depression Inventory (BDI; Beck et al., 1961), the Mood and Anxiety Symptoms Questionnaire (MASQ; Clark and Watson, 1995; Watson and Clark, 1991). Participants showed similar distributions of scores on these measures to participants in experiment 1; see Appendix 1—table 1 and Appendix 1—table 3 for the means and standard deviations of questionnaire scores for participants in experiment 1 and experiment 2, respectively.
Estimation of latent factor scores
Request a detailed protocolFactor scores were calculated using the 80 items from the STAI trait scale, the BDI and the MASQ. Factor scores were calculated using the loadings (i.e. factor structure) estimated in experiment 1 and regressing participants responses from experiment 2 onto those loadings using a weighted leastsquares regression (Anderson and Rubin, 1956). Prior to the calculation of factor scores, item responses were normalized across datasets (across the data from participants in experiment 1, the online UC Berkeley participant dataset used for the confirmatory factor analysis and data from participants in experiment 2). We also used data from the first two of these datasets to compare the factor scores obtained using the STAI, MASQ, and BDI versus the full set of questionnaires.
Online version of the probabilistic decisionmaking under volatility task
Request a detailed protocolThe online versions of the probabilistic decisionmaking under volatility task comprised a reward gain and a reward loss version of the task. The structure of the task was kept as close to the inlab versions as possible. On each trial, participants were presented with two shapes, one on either side of the computer screen, and were instructed to choose one or the other. Each shape was probabilistically associated with the gain or loss of reward; the nature of the outcome (reward gain or reward loss) depending on the version of the task. When making their choice, participants were instructed to consider both the magnitude of the potential gain or loss associated with each shape, which was shown to participants inside each shape and varied across trials, and the probability that each shape would result in reward gain or reward loss. The outcome probability could be learned across trials, using the outcomes received. Participants were instructed that, on each trial, only one of the two shapes would result in the gain (or loss) of reward if chosen, while the other shape would result in no gain (or no loss) if chosen.
In the reward gain version of the task, the magnitude of potential reward gain associated with each shape, on each trial, varied from 1 to 99 reward points that would be added to the participants’ point total if choice of that shape resulted in reward receipt. In the reward loss version of the task, the magnitudes illustrated corresponded to 1 to 99 points that would be subtracted from the participants’ point total if choice of the given shape resulted in reward loss. On trials where participants chose the shape not associated with a gain (or loss), zero points were added (or subtracted) from the participants’ point total. In the reward gain task, participants started with a total of 0 points, while in the reward loss task, participants started with 5000 points. Outcome magnitudes for the two shapes were varied across trials; the sequences of outcome magnitudes for the reward gain task and for the reward loss task were the same as those used for the reward gain task and the aversive task in experiment 1, respectively. At the end of both tasks, the total number of points remaining was summed and used to determine the monetary bonus.
Each task was divided into a stable and volatile block of trials, each 90 trials long. In the stable block, one shape was associated the gain (or loss) of reward 75% of the time and the other shape was associated with the gain (or loss) of reward 25% of the time. In the volatile block, the identity of shape with a higher probability of resulting in a gain or loss (80% probability) switched every 20 trials. The identity of the first block, stable or volatile, was randomized across participants. Participants were not told that the task was divided into two blocks.
The inlab and online versions of the volatility tasks differed in certain details. In the inlab versions, participants were given 8 s to respond, while in the online version, they were given 6 s. Additionally, the time between choice and outcome presentation was shortened from 4 s (inlab) to 1 s (online). The time following outcome presentation and before the next trial was also shortened from 4 s (inlab) to 1 s (online). The shapes used in the two experiments also differed. Inlab, the two shapes were either a circle or a square and textured as Gabor patches. Online, the two shapes were both circles, one red and one yellow.
Computational modeling of task performance
Request a detailed protocolParticipants’ choice data from the two online versions of the probabilistic decisionmaking task were modeled using the bestfitting model (#11) from experiment 1. The same hierarchical Bayesian procedure was used to fit model #11 as described previously (see Materials and methods: Experiment 1). As in experiment 1, participants’ scores on the three internalizing factors were entered into the populationlevel prior distributions for model parameters. Hamiltonian MonteCarlo methods were used to sample from the posterior distributions over the model parameters. The resulting posterior distributions and their 95% highest density intervals (HDI) for the populationlevel parameters were used to test hypotheses regarding the relationship between internalizing symptoms and task performance.
Checking model fit
Request a detailed protocolFor consistency, we sought to use the same model to fit participants’ data from experiment 2 as that used in experiment 1. It was possible that differences in procedure (online versus in lab, timing changes, use of reward loss in place of shock) might impact model fit. Hence, we sought to validate that the bestfitting model from experiment 1 (model #11) was also a good fit to data from online participants in experiment 2. To address this, we fit all 13 models to the online data and compared fits across models. Model #11 had the second lowest PSISLOO (Appendix 3—table 2). Model #12 (which differs from model #11 only in the use of separate learning rates for each shape) had a slightly lower PSISLOO. However, the difference in PSISLOO for models #11 and #12 was within one standard error (difference in PSISLOO = 43; SE = 49). In contrast, model #11’s PSISLOO was more than two standard errors better than model #12 in experiment 1 (difference in PSISLOO = 179, SE = 71). Furthermore, the average PSISLOO across both experiments was lower for model #11 (PSISLOO = 38,543) than for Model #12 (PSISLOO = 38,611). Hence, if we seek to retain one model across both experiments, model #11 is the better choice.
As in experiment 1, we also checked that model #11 could reproduce basic qualitative features of the data. The posterior means for each participant’s model parameter components were used to simulate new choice data. We summed the number of trials on which each simulated participant switched from one choice to the other. Each simulated participant corresponded to an actual online participant whose estimated parameter values were used to generate the data for that simulated participant. As in experiment 1, the actual number and the simulated number of switch trials was highly correlated across participants, see Appendix 7.
Fitting behavioral models using alternate population parameters
Request a detailed protocolTo supplement the main analyses, we repeated the hierarchical modeling of task performance using the same winning model (#11) but replacing the two specific factors’ scores (entered, together with scores on the general factor, at the populationlevel during model estimation) with residual scores on either the PSWQ scale and MASQAD subscale or the MASQAA and MASQAD subscale (residuals obtained after regressing out variance explained by scores on the general factor). These provided alternate indices of anxietyspecific and depressionspecific affect. In particular, the analyses using the MASQAA subscale enable us to ensure we are not missing effects of anxiety linked to variance captured by this subscale that is not captured by the latent anxietyspecific factor extracted from the bifactor analysis (see final section of the results and Appendix 6—figure 1 and Appendix 6—figure 2).
Appendix 1
Basic demographic details
In this section, we provide demographic details for participant samples for experiment 1 (n = 86), see Appendix 1—table 1; the confirmatory factor analysis (n = 199), see Appendix 1—table 2; and experiment 2 (n = 147), see Appendix 1—table 3.
Appendix 2
Additional factor analyses
We compared the fit of the bifactor model, as assessed by confirmatory factor analysis in a novel dataset (n = 199), with that of three other models: a twocorrelated factors models, a general factor only model, and a higher order factor model. Quality of fit was determined by the comparative fit index (CFI) and the root mean square error of approximation (RMSEA). These analyses were conducted using the Lavaan package in R.
As with the bifactor model, we conducted confirmatory factor analyses for each of these alternate models using the CFA dataset (n = 199). As for the bifactor model CFA, this entailed specifying the assignment of items to factors and then reestimating the loadings.
For the twocorrelated factors model, itemtofactor assignments were obtained by first using exploratory factor analysis (EFA) to fit this model to the data from experiment 1. The exploratory fitting procedure was the same as the first stage of the main bifactor analysis, that is polychoric correlations were calculated and an oblique rotation (i.e. oblimin) was applied. Examination of factor loadings revealed that one factor loaded predominantly onto anxietyrelated items and the other factor loaded predominantly onto depressionspecific items. The two factors were strongly correlated (r = 0.5), reflecting the high degree of correlation (r > 0.52) amongst depression and anxietyrelated questionnaires (e.g. STAI, BDI, etc.) in our dataset. Items with loadings < 0.2 on both specific factors in the exploratory analysis were fixed to have loadings equal to zero for confirmatory analysis. Items with loadings > 0.2 were then reestimated during the confirmatory analysis using diagonally weighted least squares estimation. This is same procedure carried out for the main bifactor analysis (see Materials and methods: Confirmatory Bifactor Analysis).
The only difference between the twocorrelated factors model and the bifactor model in the exploratory factor analysis is that the latter involves SchmidLeiman orthogonalization following the oblique rotation in order to partition shared versus unique variance. Because this orthogonalization does not change the flexibility of the model (Reise, 2012), the two models fit equally well in the dataset they were estimated on (i.e. the experiment 1 dataset). Comparing these models using confirmatory factor analysis in the independent online sample (n = 199), showed a better fit for the bifactor model (rootmeansquare error of approximation RMSEA = 0.065) than for the correlated factors model (RMSEA = 0.083).
For the general factor only model, item assignments were taken directly from the main bifactor model (as fit in the EFA conducted on questionnaire data from experiment 1), simply excluding assignments for the specific factors. Conversely, the higher order factor model used the same specificfactor item assignments as the bifactor model (as fit in the EFA) but excluded general factor assignments; here, a higher order factor was allowed to explain correlations between the two specific factors but was not allowed to load directly onto items themselves. In the independent online sample used for the CFA (n = 199), the full bifactor model showed a better fit (RMSEA = 0.065) than either the general factor only model (RMSEA = 0.110), or the higher order three factor model (RMSEA = 0.084).
Appendix 3
Model comparison tables
In this section, we include summaries of the model comparison results for experiment 1 (Appendix 3—table 1) and experiment 2 (Appendix 3—table 2); for a full description of the models compared and the comparison procedure see Materials and methods: Stagewise Model Construction.
Appendix 4
Model parameter recovery results
Using estimated parameter values for the winning model (model #11), we simulated 10 new datasets for each of Experiment 1’s participants (n = 86), see Materials and methods. By fitting the model to these simulated datasets, we can compare the ground truth parameter values with the parameters recovered from the simulated data. The recovered parameters strongly correlated with the ground truth parameter values. For learning rate components, the mean correlation was r = 0.88 (std = 0.13) across all simulated datasets, see Appendix 4—figure 1 for an example dataset. Across all parameters, the mean correlation across simulated datasets was r = 0.76 (std = 0.15), see Appendix 4—figure 2 for an example dataset. These analyses indicates that individual model parameters were recoverable as desired.
We also examined the robustness of our estimates for populationlevel parameters ($\mu ,{\beta}_{g},{\beta}_{a},{\beta}_{d}$) by looking at the variability in their values across the simulated datasets. Variability across datasets reflects sensitivity to noise in participants’ choices, which is estimated by the fitted values for the two inverse temperatures in the model. The populationlevel parameters for each simulated dataset are shown in Appendix 4—figure 3 for learning rate and in Appendix 4—figure 4 for the other model parameters.
As described in the main text, in Experiment 1, we also fit an additional model that parametrized the threeway interaction of block type (volatile, stable), relative outcome value (good, bad), and task version (reward, aversive) for learning rate. We conducted a parameter recovery analysis for this model that paralleled that described above (see also Materials and methods). This confirmed that the three way interaction of block type, relative outcome value, and task type on learning rate could be successfully recovered. The average correlation between ground truth and recovered parameters values for this triple interaction, across 10 simulated datasets, was r = 0.86 (std = 0.10). Parameter recovery results for all learning rate components from an example simulated dataset are given in Appendix 4—figure 5.
Appendix 5
Task performance
In this section, we examine general task performance across experiment 1 and experiment 2. This enables us to compare the general level of performance in the reward gain version of the task across both experiments and to compare the general level of performance in the aversive (shock) version of the task (as conducted in experiment 1) to that in the reward loss version of the task (as conducted in experiment 2). Performance was measured using the average outcome magnitude received across trials, that is, the mean reward gained or lost and the mean level of electrical stimulation (shock) received. Higher average magnitudes indicate better performance in the reward gain version of the task; lower average magnitudes indicate better performance in the reward loss and aversive (shock) versions of the task. If anything, general level of performance was closer between the two punishment versions of the task: aversive (shock), reward loss than between the two reward gain versions of the task.
Appendix 6
Alternate modeling of anxiety and depressionspecific effects upon behavior
As detailed in the main manuscript, the depressionspecific factor was strongly correlated with residual scores on the MASQAD (after regressing out variance explained by scores on the general factor) and the anxietyspecific factor was strongly correlated with residual scores on the PSWQ (created in an equivalent fashion, see Figure 1—figure supplement 2). Using residual scores on these two scales together with scores on the general factor provides a way to approximate the bifactor model that is arguably more interpretable for clinicians who are familiar with these measures. Accordingly, we conducted an alternative hierarchical model of participants’ task performance in experiment 1, entering individual scores on the general factor along with residual scores on the MASQAD and PSWQ into the populationlevel prior distributions for model parameters (i.e. as populationlevel covariates). As expected, this produced extremely similar results to the main bifactor model. The relationship between general factor scores and the effect of volatility and volatility by relative outcome value on learning rate remained statistically credible. No credible effects on learning rate were observed for residual MASQAD or PSWQ scores (Appendix 6—figure 1).
We used a similar approach to investigate if symptoms uniquely linked to the anxious arousal scale of the MASQ (MASQAA) and not captured by the general factor might explain variance in participants’ behavior. Scores on the MASQAA were not strongly correlated with the anxietyspecific factor from the bifactor analysis and nearly half the variance in these scores remained unexplained after regressing out variance explained by the general factor. Hence the proposed analysis enables us to ensure that we are not missing a relationship between anxiety and learning rate that might be linked to symptoms uniquely captured by the MASQAA subscale. We hence fitted a hierarchical model to participants’ task performance with participants scores on the general factor and residual scores on the MASQAA and the MASQAD (after variance explained by scores on the general factor were regressed out) entered as populationlevel covariates. We modeled data from both experiment 1 and experiment 2 in this manner as the MASQ was given in both cases. Both analyses replicated the general factor effects reported above (for experiment 1 see Appendix 6—figure 1, for experiment 2 see Appendix 6—figure 2). Residual scores on the MASQAA subscale did not credibly influence adaptation of learning rate to volatility or the effect on learning rate of the interaction of block type (volatile, stable) by relative outcome value or task version. There was also no relationship between MASQAA scores and differential learning as a function of task type (aversive versus reward gain in experiment 1 and reward loss versus reward gain in experiment 2). In experiment 2, elevated residual scores on the MASQAA scores were associated with greater learning following outcomes of relative positive versus negative value (Appendix 6—figure 2). However, no equivalent relationship was observed in experiment 1 (Appendix 6—figure 1). We hence caution against weight being given to this observation prior to any replication of this result.
Appendix 7
Model reproduction of basic features of choice behavior
Data were simulated using the winning model (#11) to check that it could reproduce basic qualitative features of participants’ actual choice behavior in experiment 1 and experiment 2.
The number of trials on which a participant stays with the same choice or switches to the other choice is a qualitative feature that our model should be able to reproduce even though it was not optimized to do so. Using the datasets simulated for parameter recovery analyses, we summed the number of trials on which each simulated participant switched from choosing one shape to the other (see Materials and methods). Each simulated participant corresponded to an actual participant whose estimated parameter values were used to generate the data for that simulated participant; we next summed actual switch trials. The actual number and the simulated number of switch trials was highly correlated across participants (r(84)>0.88 for all conditions and datasets). We give an example dataset in Appendix 7—figure 1. We repeated this analysis for experiment 2. Here again, the actual number and the simulated number of switch trials was highly correlated across participants (r(145)>0.82 for all conditions and datasets); see Appendix 7—figure 2 for an example dataset.
Data availability
The data and code used to create the figures and fit the hierarchical Bayesian models to the data is available on a Github repository associated with the first author and this repository has been shared on the Open Science Framework (https://osf.io/8mzuj/).

Open Science FrameworkID 8mzuj. Impaired adaptation of learning to contingency volatility in internalizing psychopathology.
References

ConferenceStatistical inference in factor analysisProceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. pp. 111–150.

Altered learning under uncertainty in unmedicated mood and anxiety disordersNature Human Behaviour 3:1116–1123.https://doi.org/10.1038/s4156201906280

The Known Unknowns: Neural Representation of SecondOrder Uncertainty, and AmbiguityJournal of Neuroscience 31:4811–4820.https://doi.org/10.1523/JNEUROSCI.145210.2011

Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

Intolerance of uncertainty: a common factor in the treatment of emotional disordersJournal of Clinical Psychology 69:630–645.https://doi.org/10.1002/jclp.21965

Opponency revisited: competition and cooperation between dopamine and serotoninNeuropsychopharmacology 36:74–97.https://doi.org/10.1038/npp.2010.151

Increasingly certain about uncertainty: intolerance of uncertainty across anxiety and depressionJournal of Anxiety Disorders 26:468–479.https://doi.org/10.1016/j.janxdis.2012.01.011

Common and specific dimensions of selfreported anxiety and depression: implications for the cognitive and tripartite modelsJournal of Abnormal Psychology 103:645–654.https://doi.org/10.1037/0021843X.103.4.645

Tripartite model of anxiety and depression: Psychometric evidence and taxonomic implicationsJournal of Abnormal Psychology 100:316–336.https://doi.org/10.1037/0021843X.100.3.316

ReportThe Mini Mood and Anxiety Symptom Questionnaire (MiniMASQ)Department of Psychology, University of Iowa.

Generalized anxiety disorder: a preliminary test of a conceptual modelBehaviour Research and Therapy 36:215–226.https://doi.org/10.1016/S00057967(97)000703

Intolerance of uncertainty and worry: investigating specificity in a nonclinical sampleCognitive Therapy and Research 25:551–558.https://doi.org/10.1023/A:1005553414688

Abnormal response to negative feedback in unipolar depression: evidence for a diagnosis specific impairmentJournal of Neurology, Neurosurgery & Psychiatry 63:74–82.https://doi.org/10.1136/jnnp.63.1.74

ReportManual of the Eysenck Personality Questionnaire (Junior and Adult)Hodder and Stoughton.

Factor analysis in the development and refinement of clinical assessment instrumentsPsychological Assessment 7:286–299.https://doi.org/10.1037/10403590.7.3.286

Why do people worry?Personality and Individual Differences 17:791–802.https://doi.org/10.1016/01918869(94)900485

Inference from iterative simulation using multiple sequencesStatistical Science 7:457–472.https://doi.org/10.1214/ss/1177011136

An investigation of the parallel analysis criterion for determining the number of common factorsMultivariate Behavioral Research 10:193–205.https://doi.org/10.1207/s15327906mbr1002_5

Depression: a decisiontheoretic analysisAnnual Review of Neuroscience 38:1–23.https://doi.org/10.1146/annurevneuro071714033928

Validation of decisionmaking models and analysis of decision variables in the rat basal gangliaJournal of Neuroscience 29:9861–9874.https://doi.org/10.1523/JNEUROSCI.615708.2009

Dynamic responsebyresponse models of matching behavior in rhesus monkeysJournal of the Experimental Analysis of Behavior 84:555–579.https://doi.org/10.1901/jeab.2005.11004

Signals in human striatum are appropriate for policy update rather than value predictionJournal of Neuroscience 31:5504–5511.https://doi.org/10.1523/JNEUROSCI.631610.2011

Classical fear conditioning in the anxiety disorders: a metaanalysisBehaviour Research and Therapy 43:1391–1424.https://doi.org/10.1016/j.brat.2004.10.007

Development and validation of the penn state worry questionnaireBehaviour Research and Therapy 28:487–495.https://doi.org/10.1016/00057967(90)901356

Modeling avoidance in mood and anxiety disorders using reinforcement learningBiological Psychiatry 82:532–539.https://doi.org/10.1016/j.biopsych.2017.01.017

Rational regulation of learning dynamics by pupillinked arousal systemsNature Neuroscience 15:1040–1046.https://doi.org/10.1038/nn.3130

Decision Neuroscience291–303, Opponent brain systems for reward and punishment learning: causal evidence from drug and lesion studies in humans, Decision Neuroscience, Academic Press, 10.1016/B9780128053089.000233.

The CESD scale: a selfreport depression scale for research in the general populationApplied Psychological Measurement 1:385–401.https://doi.org/10.1177/014662167700100306

Invited paper: the rediscovery of bifactor measurement modelsMultivariate Behavioral Research 47:667–696.https://doi.org/10.1080/00273171.2012.715555

Learning and choice in mood disorders: searching for the computational parameters of anhedoniaComputational Psychiatry 1:208–233.https://doi.org/10.1162/CPSY_a_00009

Probabilistic programming in Python using PyMC3PeerJ Computer Science 2:e55.https://doi.org/10.7717/peerjcs.55

The development of hierarchical factor solutionsPsychometrika 22:53–61.https://doi.org/10.1007/BF02289209

Parsing the general and specific components of depression and anxiety with bifactor modelingDepression and Anxiety 25:E34–E46.https://doi.org/10.1002/da.20432

Common and specific dimensions of selfreported anxiety and depression: a replicationJournal of Abnormal Psychology 104:542–545.https://doi.org/10.1037/0021843X.104.3.542

Common and specific dimensions of selfreported anxiety and depression: the BDIII versus the BDIIABehaviour Research and Therapy 37:183–190.https://doi.org/10.1016/S00057967(98)000874

Common and specific dimensions of SelfReported anxiety and depression in adolescent outpatientsJournal of Psychopathology and Behavioral Assessment 30:163–170.https://doi.org/10.1007/s1086200790602

Practical bayesian model evaluation using leaveoneout crossvalidation and WAICStatistics and Computing 27:1413–1432.https://doi.org/10.1007/s1122201696964

Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theoryJournal of Machine Learning Research 11:3571–3594.

ReportThe Mood and Anxiety Symptom Questionnaire (MASQ)Department of Psychology, University of Iowa.

Structure of anxiety and the anxiety disorders: a hierarchical modelJournal of Abnormal Psychology 105:181–193.https://doi.org/10.1037/0021843X.105.2.181
Decision letter

Alexander ShackmanReviewing Editor; University of Maryland, United States

Joshua I GoldSenior Editor; University of Pennsylvania, United States

Argyris StringarisReviewer

Samuel J GershmanReviewer; Harvard University, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
This paper examines the relationship between psychopathology and decisionmaking adaptation to changes in contingency volatility in the environment. Previous work from the senior author has shown that individuals with high trait anxiety were less able to appropriately adjust their learning rate to the level of volatility in the context of aversive learning. The current paper extends this previous work in several ways: first, it takes a transdiagnostic approach to symptoms across both anxiety and depression; second, it conducts a very comprehensive model comparison to select the most appropriate behavioral model; third, it includes both aversive and appetitive learning, to test for generality vs. domain specificity of any finding; finally, it also compares the effects of positive and negative prediction errors across the reward and punishment domains. The authors report impaired adaptation of learning rate – in both the reward and punishment domains – as a function on a general factor of internalizing psychopathology, rather than anxiety specifically.
Decision letter after peer review:
Thank you for submitting your article "Impaired adaptation of learning to contingency volatility as a transdiagnostic marker of internalizing psychopathology" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor, Dr Shackman, and Senior Editor, Joshua Gold. The following individuals involved in review of your submission have agreed to reveal their identity: Argyris Stringaris and Samuel J Gershman.
The Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary
This paper examines relations between psychopathology and decisionmaking adaptation to changes in contingency volatility in the environment. Previous work from the senior author has shown that individuals with high trait anxiety were less able to appropriately adjust their learning rate to the level of volatility in the context of aversive learning. The current paper extends this previous work in several ways: first, it takes a transdiagnostic approach to symptoms across both anxiety and depression; second, it conducts a very comprehensive model comparison to select the most appropriate behavioral model; third, it includes both aversive and appetitive learning, to test for generality vs. domain specificity of any finding; finally, it also compares the effects of positive and negative prediction errors across the reward and punishment domains. The authors report impaired adaptation of learning rate – in both the reward and punishment domains – as a function on a general factor of internalizing psychopathology, rather than anxiety specifically.
The reviewers and I found several strengths to the report:
An admirable paper. The topic is of great interest, the care to replicate is refreshing to see, and the modeling is meticulous. We think it could be an important contribution to the field.
Overall, we were impressed with this study. The modeling and analysis was carried out carefully, and is comprehensive almost to the point that I worry the main ideas are obscured by layers of model variations (though I think it's okay as is). The new results add nuance to the emerging literature on computational modeling of anxiety and depression. They highlight the need for a transdiagnostic understanding of these disorders with respect to learning in volatile environments. We also appreciated that the authors made an effort to establish the generality of their findings across multiple populations and task variants, with relatively large sample sizes.
This is a very interesting and well written paper. It has several strengths, including a hierarchical Bayesian approach to the model fitting, careful model comparisons, and simulations to validate the winning model, as well as the use of both reward and punishment tasks, which is quite rare in this kind of studies. Results from a laboratory experiment were also largely replicated in an online study.
This will be a nice addition to the literature on learning under uncertainty and how such learning is affected in psychopathology.
Major/general suggestions
1) Bifactor Modeling. Noted by several reviewers.
a) More needs to be done in terms of ensuring that the modeling of the clinical data is appropriate and to demonstrate that the authors are in a position to draw some of the inferences they are drawing.
b) A lot in this paper hinges on the appropriateness of the bifactor modeling. I will try to explain. The paper is an extension of previous work that the authors have done in anxiety. The main innovation here is the extension to include a) depression; b) a negative affectivity or general factor resulting from the bifactor modeling. The latter, broadly, refers to a model of (typically) clinical symptoms where their common variance is accounted for by a general factor, and other latent factors (in this case anxiety and depression) retain their unique variance (and are thus uncorrelated to each other). Such models have similarities to other hierarchical symptom models. They have enjoyed a recent resurgence for a variety of reasons. The problem with these models is that they are pretty flexible and can easily be misspecified. Given how much in this paper depends on the right choice of model, I would encourage the authors to do more to demonstrate that a bivariate model is indeed appropriate and superior to more parsimonious alternative. In particular, I am concerned about two things: first, the loadings for the general factor are relatively low, which is an indication that it may not really be a superordinate factor at all. Second, I am concerned about the fact that the authors do not seem to be comparing their model to a simpler structure (apologies if I have missed this) or alternative specifications (e.g. a superordinate model). I strongly suggest they do so and use appropriate metrics, such as the NLM. A good overview of the problems with inference and interpretation of such models is by Markon et al., Ann Rev Clin Psych 2019 and references therein, particularly Eid et al., 2017. Editor: See also recent simulation studies by Irwin Waldman's group and the HiTOP consortium (Roman Kotov)
c) Likewise, another reviewer wondered, "Is the bifactor analysis necessary? What happens is a simple factor analysis, or PCA, are used? Are the first three eigenvectors similar to the ones assumed in the bifactor analysis?"
2) Power Analysis. It would be useful if the authors gave an indication of how much power they would have in order to detect meaningful effect sizes for some of the two and particularly threeway interactions in the paper. It may be appropriate to highlight this as a limitation of the study.
3) Model Validation. Given the concerns that have been expressed about the problems with LOO (being unrepresentative and often anti correlated with the training dataset, see Poldrack, 2019 JAMA Psych), the authors should either provide a rationale for selecting LOO or recompute using kfolds.
4) Differences Across Experiments. How do the authors explain the discrepancy between the two experiments with respect to the block type (volatile vs. stable) effect? Is it possible that the populations differed in their factor scores? From eyeballing Figure 1, this explanation seems implausible, but it would be useful to address this question statistically.
https://doi.org/10.7554/eLife.61387.sa1Author response
Major/general suggestions
1) Bifactor Modeling. Noted by several reviewers.
[…] In particular, I am concerned about two things: first, the loadings for the general factor are relatively low, which is an indication that it may not really be a superordinate factor at all.
During the development and refinement of standardized questionnaire measures of psychopathology such as those used here (e.g. STAI, BDI, MASQ, etc.), reliability and validity are typically assessed at the level of the scale or occasionally subscale (see Bieling et al., 1998; additional references added below). Individual item responses tend to be noisy, and aggregated scores at the scalelevel effectively average out much of this noise. Given this, we reasoned that we should not be overly concerned about the nominal magnitude of loadings on individual items, if the factor scores derived from the loadings explain substantial variance in questionnaire (sub)scale summary scores and if the correlations between factor scores and questionnaire (sub)scale summary scores support factor content validity (i.e. point to the general factor capturing variance across depression and anxiety measures and point to the anxiety specific and depression specific factors capturing additional variance unique to anxiety or depression measures, respectively).
To examine this, we correlated the scores on the general and two specific factors extracted from the bifactor analysis with the scale / subscale scores that are typically calculated for each of the questionnaires administered (e.g. scores on the MASQ anhedonia and anxious arousal subscales, etc.). Although the bifactor analysis and these scores are based on the same items, the bifactor analysis was fit without specifying which items belonged to which questionnaires. Therefore, correlations between scores on the factors extracted from the bifactor analysis and scores on the questionnaire (sub)scales can be used to assess content validity of the extracted factors independently and to determine how much variance these factors can explain at the scale / subscale level. The correlations between factor scores and questionnaire scores has been added to Figure 1 as a new panel c.
In Figure 1C, it can be seen the general factor scores correlate strongly with all of the scale and subscale scores (r>0.6). This indicates that although the individual item loadings were moderately sized, the general factor scores derived from them explain a substantial amount of variance at the scalelevel. Moreover, the general factor explains variance across questionnaires that are commonly considered depressionspecific (e.g. BDI, CESD) and those considered anxietyspecific (e.g. the PSWQ and MASQAA, i.e. the anxious arousal subscale). These correlations provide strong evidence that there is indeed a common (superordinate) factor in these data.
A second important observation from Figure 1C is that the two specific factors each correlate strongly with only one or two specific scales or subscales. The depressionspecific factor scores correlate strongly with the MASQ anhedonia subscale (r=0.74; MASQAD), and moderately with the STAI depression subscale (r=0.5, STAIdep). The anxietyspecific factor scores correlate strongly with the PSWQ (r=0.76). This confirms that specific factors are indeed selective to specific symptoms, as intended, and further suggests that they explain a good amount of variance that is not captured by the general factor (given the specific factors are orthogonal to the general factor).
We have added the corresponding text to the manuscript.
“As an additional check of construct validity, participants’ scores on these factors were correlated with summary (scale or subscale) scores for the standardized questionnaires administered, Figure 1C. […] The latter conclusion can be drawn since the specific factors are orthogonal to the general factor and therefore their correlations with scale and subscale scores reflect independently explained variance.”
We also conducted a supplementary analysis in which we regressed participants’ general factor scores against these scale and subscale scores and calculated residual scores for each (sub)scale. We then correlated participants’ residual scores for each scale with their original (nonresidualized) scale scores. The extent to which the correlation drops beneath ‘1’ gives a measure of variance (i.e. the square root of this correlation difference) explained by the general factor (see new Figure 1—figure supplement 2). The two subscales with the highest amount of residual variance are the MASQAD and the PSWQ. We then considered how much of this residual variance can be explained by the specific factors from the bifactor model. It turns out that almost all of it can, as intended, be explained by the depressive and anxiety specific factors, respectively, r=0.95, r=0.96. This shows that much of the important variance not explained by the general factor is captured by the two specific factors.
We have added the corresponding text and figure supplement to the manuscript:
“This can be further demonstrated by regressing variance explained by scores on the general factor out of scale and subscale scores and then examining the relationship between residual scores for each scale with scores on the two specific factors. As shown in Figure 1—figure supplement 2, after removing variance explained by general factor scores, nearly all the remaining variance in PSWQ scores could be captured by the anxiety specific factor and nearly all the remaining variance in MASQAD scores could be captured by scores on the depression specific factor.”
Together, these analyses provide evidence that the general factor and the twospecific factors account for a large amount of shared and unique variance at the scale and subscalelevel, and moreover that they are valid constructs, acting as intended, to separate shared from unique variance in standardized measures of anxiety and depression.
Additional related modeling of task performance.
In the revised version of the manuscript, we include additional modeling of task performance using the residual scores on the MASQAD and either the PSWQ or MASQAA. This is presented in Appendix 6: Alternate modeling of anxiety and depressionspecific effects upon behavior. We explain this additional modeling here.
Given that residual scores on the MASQAD and PSWQ (after regressing out variance explained by scores on the general factor) are so strongly related to the two specific factors, we can use these residual scores in place of the specific factors in the hierarchical model of participants’ behavior on the volatility task. This has the didactic advantage of including wellknown, standardized measures, only with variance explained by a general factor residualized out. In addition, here we are using indices that retain (or explain) 4050% of the variance from the original measures.
As expected, a hierarchical model of task performance including participants’ scores for the general factor and residual scores for the MASQAD and PSWQ entered as populationlevel covariates produced extremely similar results to those described in the main text using the three factors obtained from the bifactor model. In Appendix 6—figure 1, we show that the modulatory effect of general factor scores on effects of volatility, relative outcome value, and volatility x relative outcome value on learning rate, as reported in the main manuscript (here shown by the black data points), were also statistically significant in this new model (as shown by the green data points). Neither residual scores on the MASQAD or PSWQ significantly modulated learning rate , i.e. the 95%HDI contained zero for the main effects and twoway interactions, Appendix 6—figure 1 (green data points). We note this analysis was conducted for experiment 1 only as the PSWQ was not administered to participants in experiment 2.
A similar approach can also be used to examine the relationship between scores on the MASQAA subscale and task performance. Regressing out scores on the general factor also left substantial residual variance in MASQAA scores (see Figure 1—figure supplement 2, panel B). This is interesting, as the MASQAA subscale may capture a different, potentially more somatic, and less cognitive, subset of anxietyspecific symptoms to those picked up by the PSWQ (Bijsterbosch et al., 2014). Notably, the MASQAA subscale was also the only measure included for which residual scores (after accounting for scores on the general factor) did not show positive moderate or high correlations with one of the two specific factors. In light of this, we also fit an alternate hierarchical model to participants’ behavior that included the general factor along with residual scores for the MASQAA and the MASQAD. The learning rate results for this model, when fit to the experiment 1 dataset, are also presented in Appendix 6—figure 1. In addition, we show the results for this model when fit to the experiment 2 dataset in Appendix 6—figure 2. In both cases, modulation by general factor scores of adaptation of learning rate to volatility, especially following outcomes of positive relative value, continue to be observed (blue data points). Residual scores from the MASQAA (and MASQAD in context of these other populationlevel parameters) did not significantly influence adaptation of learning rate to volatility or the effect on learning rate of the interaction of block type (volatile, stable) by relative outcome value or task version. There was also no relationship between MASQAA or MASQAD residual scores and differential learning as a function of task type (aversive versus reward gain in experiment 1 and reward loss versus reward gain in experiment 2). In experiment 2 only, there was an effect of residual scores from the MASQAA on the overall extent of learning following outcomes of positive versus negative relative value.
We hope these additional analyses might be of interest to readers who are used to primarily administering just one or two questionnaire measures.
“Appendix 6: Alternate modeling of anxiety and depressionspecific effects upon behavior.
As detailed in the main manuscript, the depressionspecific factor was strongly correlated with residual scores on the MASQAD (after regressing out variance explained by scores on the general factor) and the anxietyspecific factor was strongly correlated with residual scores on the PSWQ (created in an equivalent fashion), see Figure 1—figure supplement 2). […] However, no equivalent relationship was observed in experiment 1 (Appendix 6—figure 1). We hence caution against weight being given to this observation prior to any replication of this result.”
Second, I am concerned about the fact that the authors do not seem to be comparing their model to a simpler structure (apologies if I have missed this) or alternative specifications (e.g. a superordinate model). I strongly suggest they do so and use appropriate metrics, such as the NLM. A good overview of the problems with inference and interpretation of such models is by Markon et al., Ann Rev Clin Psych 2019 and references therein, particularly Eid et al., 2017. Editor: See also recent simulation studies by Irwin Waldman's group and the HiTOP consortium (Roman Kotov)
We thank you for raising this concern. Here, our response is in two parts—a theoretical consideration of the basis of model selection and the provision of additional models as requested.
The problems raised by Markon, 2019 and Eid et al., 2017, are very important and speak to the broader, but obviously difficult question of “what is the right basis on which to select a model?” Although we don’t think that a definitive answer to this question can be given, we do believe that the answer strongly depends on the scientific question and the theory on which that question is based. Our scientific question is largely based on the literature arguing that anxiety and depression symptoms should be conceptually separated into shared and unique variance. Given this premise, our main criterion for choosing a factor model was whether it could partition variance in a way that aligned with previous theory. In this way, we viewed the selection of a factor model for our experiment akin to the selection of a good set of regressors, in which interpretability is key. Bifactor analyses seems to be the most well supported analytic tool for partitioning the variance in the desired way (Reise, 2012; Markon, 2019), despite its potential limitations (Eid et al., 2017). Nevertheless, the fit to data is obviously important, and indeed we tried to assess this using confirmatory factor analysis on an independent dataset.
With these considerations in mind, we have edited the manuscript text to explain our choice of model better and have also detailed the fit of three alternative factor models within the Supplementary Materials.
Before turning to the additional models, we would like to address this point:
3) I strongly suggest they do so and use appropriate metrics, such as the NLM.
We thank the reviewer for pointing us to Normalized Maximum Likelihood (NML). We also find it very appealing from a theoretical standpoint. However, NML requires summing over all possible datasets and it is not clear how to construct a suitable set of all possible datasets for factor analysis. The references in Markon, 2019, discuss NML on a more theoretical level giving practical implementations for only certain types of models (e.g. logistic regression in Rissanen, 2007), but not factor analysis. We were not able to find applications of NML in a practical case like ours.
Additional models
In line with the request to consider a simpler structure, we fit a “flat” correlated two factor model. Here, one factor loaded predominantly on anxiety items and another loaded predominantly on depression items. The two factors were strongly correlated (r=0.58). This model is actually an intermediate step in the bifactor analysis, which then proceeds to extract a general factor from these two correlated factors by applying the SchmidLeiman transformation. Since the SchmidLeiman transformation shifts the variance partitioning rather than explaining additional variance, both models actually fit identically in the inlab dataset (n=86) on which the factor models were estimated. (Reise, 2012, as cited by Markon, 2019, actually provides a beautiful account of the equivalence of different models, noting how the correlated kfactor model and the general plus kspecific factors bifactor model are reparameterizations of oneanother). Reestimating both models in the confirmatory sample (n=199) showed that the correlated factors model had a slightly worse fit (RMSEA=0.083) than the bifactor model (RMSEA=0.065).
We also consider a superordinate (or higher order) factor model, as requested. This model differs from the bifactor model in that the general factor cannot directly explain correlations in individualitems, but rather explains them indirectly through the correlations between the lower level factors. We fit this model to the confirmatory data set, using the two factors from the bifactor model but removing influences of the general factor directly onto individual items. This model fit less well (RMSEA=0.083) than the bifactor model (RMSEA=0.065).
We also considered a single factor model; here, given the relatively small amount of item variance explained by the specific models, we examined if the general factor alone (as estimated in the exploratory factor analysis) provided as good a fit to the CFA dataset. This model also fit less well when reestimated in the confirmatory sample RMSEA=0.110. Furthermore, this model cannot be used to address our scientific question.
In light of the analyses just discussed, we made the following changes to the manuscript:
We’ve added a new section “Appendix 2: Additional Factor Analyses” describing these additional factor analyses. We also briefly summarize them in the main manuscript:
“As outlined earlier, we applied a bifactor model to itemlevel symptom responses as we sought to tease apart symptom variance common to anxiety and depression versus unique to anxiety or depression. With model selection, both the extent to which a given model can address the aim of the experiment and the fit of the given model to the data are important considerations. In addition to assessing the absolute fit of the bifactor solution in the confirmatory factor analysis (CFA) dataset, we can also consider its fit relative to that of alternate models. The bifactor model reported here showed a better fit to the novel (CFA) dataset than a “flat” correlated twofactor model, a hierarchical three factor model with the higher order factor constrained to act via the lower level factors, and a unifactor model created by retaining only the general factor and neither of the specific factors (see Appendix 2: Additional Factor Analyses for further details). We note that none of these alternate models would enable us to separate symptom variance common to anxiety and depression versus unique to anxiety and depression, as desired.”
c) Likewise, another reviewer wondered, "Is the bifactor analysis necessary? What happens is a simple factor analysis, or PCA, are used? Are the first three eigenvectors similar to the ones assumed in the bifactor analysis?"
We thank the reviewer for this suggestion. We describe the results for a “simple factor analysis” (the 2factor correlated model) above. In further response to this query, we also conducted a PCA on the itemlevel questionnaire data in experiment 1 and examined the top 5 components (PC’s). Scores on the first PC were extremely similar to the general factor scores (r=0.9; see Author response image 1). However, scores on subsequent PCs cannot be used to cleanly disentangle depression and anxiety symptomatology as desired to address the study’s aims. Specifically, scores on the second PC correlated positively with worry (PSWQ r=0.59) and negatively with anhedonia (MASQAD r = 0.3). Scores on the third PC correlated with anxious arousal (MASQAA; r=0.5), but also moderately with two depression measures (BDI r=0.19 and CESD r=0.18). The fourth PC correlated moderately with one depression measure BDI (r=0.25) but not the other CESD (r=0.03), and therefore likely captured something idiosyncratic to that measure. The fifth PC did not correlate strongly with any measure, so we stopped there in extracting principal components.
As mentioned above, this PC structure would not enable us to disentangle variance common to both anxiety and depression versus variance unique to anxiety or depression. However, we can use PC1 to demonstrate the robustness of our general factor findings. Specifically, if we fit a model to experiment 1 task data, which includes participants’ scores on PC1 along with residual scores on the MASQAD and PSWQ after having regressed out variance explicable by PC1 we obtain parallel findings to those obtained using the general factor and either the two specific factors derived from the bifactor analysis or residual scores on the MASQAD and PSWQ obtained after regressing out variance explicable by scores on the general factor. This again provides evidence that the results we obtained did not hinge on the exact specification of the bifactor model.
2) Power Analysis. It would be useful if the authors gave an indication of how much power they would have in order to detect meaningful effect sizes for some of the two and particularly threeway interactions in the paper. It may be appropriate to highlight this as a limitation of the study.
We thank the reviewers for raising this important issue regarding our power to detect twoway and particularly threeway interactions.
There is clear evidence that the study was amply powered to detect twoway interactions. Firstly, we actually observed significant twoway interactions for learning rate (goodbad)x(volatilestable) in both experiment 1 and experiment 2. Secondly, our parameter recovery simulations, which used the same sample size as our real data, were able to recover twoway interactions. Such recovery simulations can be seen as ways of estimating power in the very practical sense of “the ability to detect effects, assumed to be present, if a new dataset were to be collected”. That is, we assume the effects we observed were true, we simulate new data from these effects, and then we see how well the effects can be recovered (redetected). Good recovery then speaks to good power. For parameter recovery of effects representing twoway interactions for learning rate see Appendix 4—figure 1.
To assess our power to detect threeway interactions, we performed two additional analyses.
The first analysis applied the type of parameter recovery procedure just discussed to the model that included the threeway interaction. We’ve included the results of this analysis as a new Appendix 4—figure 5. In the figure, it can be seen that the threeway interaction term (bottom middle panel) has similar parameter recoverability to the other learning rate effects (other panels).
In addition, we conducted a second permutationbased analysis that used a synthesis procedure to show that the absence of a three way interaction was unlikely to be because of a lack of power. To explain: in the manuscript, we report that scores on the general factor modulate the effect of volatility by relative outcome value on learning rate but that this does not vary as a function of task version (a paradigmatic absent threeway interaction). As an additional check to see if we would have been able to determine if scores on the general factor modulated the effect of volatility by relative outcome value by task version on learning rate, we conducted the following analysis. We fit the model with the threeway interaction of volatility (volatilestable) by relative outcome (good – bad) by task version (reward, aversive) included to participants’ data from experiment 1 (see Permutation Analysis in Author response image 2 panel A). This shows the influence of scores on the general factor on the 2way interaction of volatility x relative outcome value and the absence of any influence of general factor scores on the 3way interaction. We then permuted participants’ posterior learning rate values from just the aversive task to randomize performance across conditions and participants for this version of the task. We next used participants’ posterior means for each learning rate component (actual for the reward version of the volatility task, and the permuted values for the aversive version of the volatility task) to generate new data for each participant; all other parameters were kept the same. We then refit the model to this simulated data and reestimated individual and populationlevel parameters. By eradicating between and within subjects effects upon learning rate in the aversive version of the task in this manner, while leaving them intact in the reward version of the task, we should, if sufficiently powered, now observe that general factor scores significantly modulates the effect of volatility by relative outcome value by task version upon learning rate. This is indeed the case, ${\beta}_{g}$=0.16, 95%HDI = [0.32, 0.02], see Permutation Analysis in Author response image 2 panel B (the threeway interaction effect is circled on the far right side of the figure). This shows that if there had been a threeway interaction in our dataset—for example a difference across tasks in the twoway interaction that we actually observed—we would likely have had the power to detect it.
Actual and recovered posterior mean learning rates for each participant in each condition are plotted in panels C to Ff); small dots are individual participants and large dots are grouplevel means. Panel(c) shows the nonpermuted posterior means for learning rate as a function of block type (volatile, stable) and relative outcome value (good, bad) for the reward version of the task. Panel (D shows the permuted posterior means for the aversive version of the task. Panels E and F show the corresponding recovered parameter values. It can be seen that by permuting posterior means for learning rate in the aversive version of the task, we wipe out the boost to learning rate after good outcomes under volatile conditions shown by participants with low scores on the general factor (see Figure 4B). Parameter recovery successfully reveals the presence of this effect in the reward version of the volatility task and its (manipulated) absence in the aversive version of the volatility task.
In view of the complexity and synthetic nature of this result, we elected not to include it in the paper.
We reference the results of the parameter recovery analysis in the Results section.
“We note that a parameter recovery analysis revealed successful recovery of the parameter representing the threeway interaction of block type, relative outcome value and task type on learning rate (Appendix 4—figure 5). This suggests that we did not simply fail to observe a threeway interaction due to lack of experimental power.”
3) Model Validation. Given the concerns that have been expressed about the problems with LOO (being unrepresentative and often anti correlated with the training dataset, see Poldrack, 2019 JAMA Psych), the authors should either provide a rationale for selecting LOO or recompute using kfolds.
We thank the reviewer for pointing out this interesting potential issue with leaveoneout cross validation raised by Poldrack, 2019.
First, we wanted to clarify that we do not calculate exact leaveoneout cross validation. It would have been too computationally expensive to refit all of our models that many times. It would have also been very computationally expensive to do exact kfold cross validation. Instead, we chose to approximately calculate leaveoneout cross validation using Pareto smoothed importance sampling (PSIS; Vehtari et al., 2017). This was likely confusing because we simply abbreviated our approach as LOO, which is often done (as is enshrined in the name of the extremely popular R package for calculating this quantity). We now specify this more clearly in the manuscript and have changed our abbreviation to PSISLOO to avoid confusion.
We chose to use PSISLOO, because it has become the accepted standard for model comparison in hierarchical Bayesian model, which are often too large for exact crossvalidation (Gelman et al., 2015). Vehtari et al., 2017 provides nice simulations showing that PSISLOO approximates the exact methods well in many situations.
We also considered another metric, WAIC, which is also commonly used to compare models in the hierarchical Bayesian setting. This metric penalizes overfitting in a different way than PSISLOO. Importantly, the model rankings obtained by PSISLOO and WAIC were identical.
Finally, we note that since our study was moderately well powered (n=86 in experiment 1, n=199 in experiment 2) we were also less concerned about the issue raised by Poldrack about leaveoneout crossvalidation. His concern relates to the fact that leaving out only a single data point could induce negative correlations between the test and training set. While this could be an issue in some cases, we believe that it is likely to be more of a problem in very small datasets, in which a single observation could have a sizeable influence on its own.
We’ve added the following text to our manuscript to further justify our use of PSISLOO and to mention that our results hold using WAIC.
“We compared the fits of different models using Pareto smoothed importance sampling to approximate leaveoneout cross validation accuracy (PSISLOO; Vehtari et al., 2017). PSISLOO is a popular method with which to estimate out of sample prediction accuracy as it is less computationally expensive than evaluating exact leaveoneout or kfold cross validation accuracy (Gelman et al., 2013). We note that comparing models using WAIC (Watanabe and Opper, 2010), an alternative penalizationbased criterion for hierarchical Bayesian models, resulted in identical model rankings for our dataset.”
4) Differences Across Experiments. How do the authors explain the discrepancy between the two experiments with respect to the block type (volatile vs. stable) effect? Is it possible that the populations differed in their factor scores? From eyeballing Figure 1, this explanation seems implausible, but it would be useful to address this question statistically.
We thank the reviewer for this very interesting point. Neither of the specific factors were associated with differential effects of block type on learning. However, if participants in experiment 2 had higher general factor scores, on average, than participants in experiment 1, then since higher scores on the general factor are associated with smaller effects of volatility on learning, such a difference could indeed potentially explain the presence of a group level effect of block type in experiment 1 but not experiment 2. However, comparing the general factor scores across the two samples, we found that experiment 1’s participants had statistically higher general factor scores than experiment 2’s participants (mean difference 0.38; t=3.0 p=0.002). This is the opposite direction to what would be needed to explain the presence of a grouplevel effect of block type (volatility) in experiment 1 but not in experiment 2.
We do not have a strong alternate explanation for this discrepancy, and hence we note it but do not speculate as to its cause.
References:
Bieling, P. J., Antony, M. M., and Swinson, R. P. (1998). The StateTrait Anxiety Inventory, Trait version: structure and content reexamined. Behaviour research and therapy, 36(78), 777788.
Bijsterbosch, J., Smith, S., Forster, S., John, O. P., and Bishop, S. J. (2014). Resting state correlates of subdimensions of anxious affect. Journal of Cognitive Neuroscience, 26(4), 914926.
Cuijpers, P., Gentili, C., Banos, R. M., GarciaCampayo, J., Botella, C., and Cristea, I. A. (2016). Relative effects of cognitive and behavioral therapies on generalized anxiety disorder, social anxiety disorder and panic disorder: A metaanalysis. Journal of Anxiety Disorders, 43, 7989.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis. CRC press.
Knowles, K. A., and Olatunji, B. O. (2020). Specificity of trait anxiety in anxiety and depression: Metaanalysis of the StateTrait Anxiety Inventory. Clinical Psychology Review, 101928.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate behavioral research, 47(5), 667696.
von Glischinski, M., von Brachel, R., and Hirschfeld, G. (2019). How depressed is “depressed”? A systematic review and diagnostic metaanalysis of optimal cut points for the Beck Depression Inventory revised (BDIII). Quality of Life Research, 28(5), 11111118.
https://doi.org/10.7554/eLife.61387.sa2Article and author information
Author details
Funding
European Research Council (GA 260932)
 Sonia J Bishop
National Institute of Mental Health (R01MH091848)
 Sonia J Bishop
Alexander von Humboldt Foundation
 Peter Dayan
Max Planck Society
 Christopher Gagne
 Peter Dayan
National Institutes of Health (R01MH124108)
 Sonia J Bishop
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This research was supported by grants from the European Research Community (GA 260932) and the National Institute of Mental Health (R01MH091848). PD and CG are funded by the Max Planck Society. PD is also funded by the Alexander von Humboldt foundation. The authors thank Andrea Reinecke for assistance with Structured Clinical Diagnostic Interviews and Michael Browning, Sirius Boessenkool and Emily Witt for assistance with recruitment, experimental set up and data collection.
Ethics
Human subjects: Informed consent was obtained for all participants. Procedures for experiment 1 were approved by and complied with the guidelines of the Oxford Central University Research Ethics Committee (protocol numbers: MSDIDRECC2201236 and MSDIDRECC2201220). Procedures for experiment 2 were approved by and complied with the guidelines of the University of CaliforniaBerkeley Committee for the Protection of Human Subjects (protocol ID 2010122638).
Senior Editor
 Joshua I Gold, University of Pennsylvania, United States
Reviewing Editor
 Alexander Shackman, University of Maryland, United States
Reviewers
 Argyris Stringaris
 Samuel J Gershman, Harvard University, United States
Publication history
 Received: July 23, 2020
 Accepted: November 26, 2020
 Version of Record published: December 22, 2020 (version 1)
Copyright
© 2020, Gagne et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 4,070
 Page views

 384
 Downloads

 17
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
Understanding the neural basis for individual differences in the skin conductance response (SCR) during discriminative fear conditioning may inform on our understanding of autonomic regulation in fearrelated psychopathology. Previous regionofinterest (ROI) analyses have implicated the amygdala in regulating conditioned SCR, but whole brain analyses are lacking. This study examined correlations between individual differences in SCR during discriminative fear conditioning to social stimuli and neural activity throughout the brain, by using data from a large functional magnetic resonance imaging study of twins (N = 285 individuals). Results show that conditioned SCR correlates with activity in the dorsal anterior cingulate cortex/anterior midcingulate cortex, anterior insula, bilateral temporoparietal junction, right frontal operculum, bilateral dorsal premotor cortex, right superior parietal lobe, and midbrain. A ROI analysis additionally showed a positive correlation between amygdala activity and conditioned SCR in line with previous reports. We suggest that the observed whole brain correlates of SCR belong to a largescale midcinguloinsular network related to salience detection and autonomicinteroceptive processing. Altered activity within this network may underlie individual differences in conditioned SCR and autonomic aspects of psychopathology.

 Neuroscience
Neurodegenerative diseases are characterized by selective vulnerability of distinct cell populations; however, the cause for this specificity remains elusive. Here we show that entorhinal cortex layer 2 (EC2) neurons are unusually vulnerable to prolonged neuronal inactivity compared with neighboring regions of the temporal lobe, and that reelin+ stellate cells connecting EC with the hippocampus are preferentially susceptible within the EC2 population. We demonstrate that neuronal death after silencing can be elicited through multiple independent means of activity inhibition, and that preventing synaptic release, either alone or in combination with electrical shunting, is sufficient to elicit silencinginduced degeneration. Finally, we discovered that degeneration following synaptic silencing is governed by competition between active and inactive cells, which is a circuit refinement process traditionally thought to end early in postnatal life. Our data suggests that the developmental window for wholesale circuit plasticity may extend into adulthood for specific brain regions. We speculate that this sustained potential for remodeling by entorhinal neurons may support lifelong memory but renders them vulnerable to prolonged activity changes in disease.