A common alteration in effort-based decision-making in apathy, anhedonia, and late circadian rhythm

Sara Z. Mehrhof; Camilla L. Nord

doi:10.7554/eLife.96803.3

1 Introduction

Our circadian rhythm aligns us with our environment, regulating physiological and behavioural processes to follow 24-hour rhythms¹. Circadian integrity is pivotal to mental wellbeing and has been bidirectionally linked to numerous psychiatric disorders^2–8. Yet little is known about the cognitive or computational mechanisms of circadian dysfunction—and their alignment or diversion from mechanisms driving neuropsychiatric symptoms.

Inter-individual differences in circadian timing and alignment manifest behaviourally as chronotypes (i.e., diurnal preference)^9,10, with individuals commonly categorized as early, late, or intermediate chronotypes¹¹. A disproportionate number of psychiatric patients have a late chronotype, based on self-report¹² and genetic analysis⁹. Within clinical groups, late chronotype has been linked to depression severity and non-remission¹³, higher rates of psychiatric and general medical comorbidities¹⁴, more severe cognitive impairment, and higher symptoms of apathy^15,16. Converging evidence on the importance of circadian alignment in psychiatric pathology has led to proposals of a circadian psychiatric phenotype, either within disorders^14,17 or cutting across diagnostic categories^18,19.

Syndromes of deficient motivational behaviour, such as apathy and anhedonia, are also observed across neuropsychiatric disorders^20–23, suggesting transdiagnostic relevance²⁴. Anhedonia and apathy are associated with worse clinical outcomes^25,26 and are poorly targeted by current treatments^27–29. Empirical work suggests a common underlying neurocognitive mechanism: the integration of costs and benefits during effortful decision-making²⁴. Effort-based decision-making is commonly assessed using effort expenditure tasks: Subjects are asked to decide whether to pursue actions associated with varying levels of effort and reward levels³⁰. Computational models applied to effort-based decision-making tasks provide a formal mathematical estimate of a subject’s integration of costs and benefits into a subjective value^31,32. Higher costs devalue associated rewards, an effect referred to as effort-discounting^33–37. This computational approach enables measurement of inter- and intra-individual differences on distinct aspects of effort-based decision-making.

One key source of individual differences in motivational behaviour and effort-based decision-making is likely dopamine signalling, especially dopaminergic projections from the ventral tegmental area (VTA) to the ventral striatum²⁴. Pre-clinical animal studies show dopamine depletion reduces engagement in effortful behaviour^38,39, while dopamine enhancement promotes motivational effort exertion^40,41. In humans, dopamine depletion reduces willingness to exert effort for reward^42,43, while pharmacological dopamine enhancement increases motivation in effort-based decision-making^44,45. Further, naturally occurring variations in dopamine responsivity are correlated with effort-based decision-making: a higher dopamine responsivity (as quantified with positron emission tomography following d-amphetamine administration) is associated with willingness to exert greater effort for larger rewards⁴⁶.

Bi-directional links between chronobiology and several neurotransmitter systems have been reported, including dopamine⁴⁷. In animals, dopamine transmission and biosynthesis vary diurnally^48,49, and growing evidence suggests a bi-directional regulation between dopamine signalling and circadian rhythm^50–52. In human studies, dopamine availability, dopamine transporter genes, and dopamine receptors have been linked to proxies of circadian rhythm^53–55 and circadian-regulating gene polymorphisms⁵⁶. On a behavioural level, sleep deprivation, poor sleep quality, and insomnia were linked to low motivation in effort-based decision-making^57–59 and evening bright-light exposure enhanced effort willingness, possibly by enhancing dopamine through melatonin suppression⁶⁰. Early chronotype predicted treatment effect on motivational behaviour in a sample of depressed subjects with comorbid insomnia⁶¹. Chronotype effects are also reported for other reward decision-making tasks, with late chronotypes showing higher delay discounting⁶², less rational decision-making⁶³, and lower willingness to take risks for rewards^64,65. A circadian effect on decision-making under risk is reported, with the sensitivity to losses decreasing with time-of-day⁶⁶. This suggests that chronobiology may contribute to individual differences in effort-based decision-making, potentially in parallel ways with neuropsychiatric syndromes.

Here, we tested the relationship between motivational decision-making and three key neuropsychiatric syndromes: anhedonia, apathy, and depression, taking both a transdiagnostic and categorical (diagnostic) approach. To do this, we validate a newly developed effort-expenditure task, designed for online testing, and gamified to increase engagement. Participants completed the effort-expenditure task online, followed by a series of self-report questionnaires.

Next, we pre-registered a follow-up experiment to directly investigate how circadian preference interacts with time-of-day on motivational decision-making, using the same task and computational modelling approach. While this allows us to test how circadian effects on motivational decision-making compare to neuropsychiatric effects, we do not test for possible interactions between neuropsychiatric symptoms and chronobiology. All analyses were pre-registered (except when labelled as exploratory): see https://osf.io/2x3au and https://osf.io/y4fbe.

2 Results

2.1 Sample characteristics

Nine hundred and ninety-four participants completed all study components (i.e., demographic questions, effort-expenditure task, self-report questionnaires). After exclusion (see Methods 4.1.5), 958 participants were included in our analyses. We used a stratified recruitment approach to ensure our sample was representative of the UK population in age, sex, and history of psychiatric disorder^67–69; mean questionnaire-based measures were comparable to previous general population studies (Table 1).

Demographic characteristics and descriptive questionnaire measures in the included sample and excluded participants.

Correlations between questionnaire scores:
Correlations between questionnaire sum scores for the Snaith Hamilton Pleasure Scale (SHAPS), the Dimensional Anhedonia Rating Scale (DARS), the Apathy Evaluation Scale (AES), Morningness-Eveningness Questionnaire (MEQ), Munich Chronotype Questionnaire (MCTQ), Body mass index (BMI) and the Finish Diabetes Risk Score (FINDRISC). Asterisks indicate significance: * p<.05, ** p<.01. *** p<.001 (not accounting for multiple comparisons). Note that sum scores for the AES and the DARS have been transformed such that increasing scores can be interpreted as higher symptom severity, in line with the SHAPS. Sum scores of the MEQ have been transformed such that higher scores indicate higher eveningness, in line with the MCTQ.

Questionnaire sum scores highly correlated within groupings of questionnaires targeting psychiatric symptoms, chronobiology, and metabolic health. We also found significant correlations between some, but not all, questionnaires (Fig. 1)

2.2 Effort-expenditure task

In this novel, online effort-expenditure task (Fig. 2A-B), subjects were given a series of challenges associated with varying levels of effort and reward. By weighing up efforts against rewards, they decide whether to accept or reject challenges. We first use model-agnostic analyses to replicate effects of effort-discounting (i.e., devaluation of reward with increasing effort). Next, we took a computational modelling approach to fit economic decision-making models to the task data (Fig. 3A-D). The models posit efforts and rewards are joined into a subjective value (SV), weighed by individual effort (β_E) and reward sensitivity (β_R) parameters. The subjective value is then integrated with an individual bias to accept effortful challenges for reward (α) parameter to guide decision-making. Specifically, this acceptance bias parameter determines the range at which subjective values are translated to acceptance probabilities: the same subjective value will translate to a higher acceptance probability the higher the acceptance bias.

Effort-based decision-making: task design and model-agnostic results.
A: The task can be divided into four phases: a calibration phase to determine individual clicking capacity to calibrate effort-levels, practice trials that participants practice until successful on every effort-level, instructions and a quiz that must be passed, and the main task, consisting of 64 trials split into 4 blocks. B: Each trial consists of an offer with a reward (2,3,4, or 5 points) and an effort level (1,2,3, or 4, scaled to the required clicking speed and time the clicking must be sustained for) that subjects accept or reject. If accepted, a challenge at the respective effort level must be fulfilled for the required time to win the points. If rejected, subjects wait for a matched amount of time and receive one point. C: Proportion of accepted trials, averaged across participants and effort-reward combinations. Error bars indicate standard errors. D: Staircasing development of offered effort and reward levels across the task, averaged across participants.

Computational modelling: model visualization and model-based results.
A: Economic decision-making models posit that efforts and rewards are joined into a subjective value (SV), weighed by individual *effort* (*β_E*) and *reward sensitivity* (*β_R*) parameters. The SV is then integrated with an acceptance bias parameter and translated to decision-making. **B-C:** The model suggests that SV decreases as effort increases and increases as reward increases. The magnitude of this relationship depends on the individual effort and reward sensitivity parameters. D: The acceptance bias parameter acts as an intercept to the softmax function, thereby changing the relationship between SV and acceptance probability. E: Model comparison based on leave-out-out information criterion (LOOIC; lower is better) and expected log posterior density (ELPD; higher is better). F: Posterior predictive checks for the full parabolic model, comparing observed vs. model predicted subject-wise acceptance proportions across effort-levels (left) and reward-levels (right).

2.2.1 Replication of model-agnostic effects

The proportion of accepted trials for each effort-reward combination is plotted in Figure 2C. In line with our pre-registered hypotheses, we found significant main effects for effort (F(1,14367)=4961.07, p<.0001) and reward (F(1,14367)=3037.91, p<.001), and a significant interaction between the two (F(1,14367)=1703.24, p<.001). In post hoc ANOVAs, effort effects remained significant at all reward-levels (all p<.001) and reward effects remained significant at all effort-levels (all p<.001). The development of offered effort and reward levels across trials is shown in Figure 2D; this shows that as participants generally tend to accept challenges rather than reject them, the implemented staircasing procedure develops toward higher effort and lover reward challenges.

The mean success rate of accepted challenges across participants was high (M=98.7%) and varied little between participants (SD=3.50), indicating feasibility of all effort-levels across participants. Comparing clicking calibration results from pre- to post-task, the maximum clicking capacity decreased by 2.34 clicks on average (SD=14.5). Sixty-two (6.47%) participants reported having deviated from our instructions (i.e., changed the hand and/or finger used to make mouse clicks) throughout the game, but all effects could still be replicated in this subsample: main and interaction effects of effort and reward on the proportion of accepted trials could be replicated in this subsample (all p<.001) and there was no significant difference between participants that did or did not report finger switching in the mean percentage of accepted trials (switching: 79.51%, no switching: 76.62%; p=.149).

Subjects were engaged with the task, shown by a high rate of challenge acceptance (M=76.80%, SD=15.20, range=15.60–100%) and moderate-to-good enjoyment ratings (M=2.56, SD=0.92; on a 0–4 scale). Qualitative data of subjects describing their decision-making process during the task further confirmed high engagement (see Supplement 3).

2.2.2 Computational modelling

A model space of nine models was considered, varying in the implemented parameter and cost function (see Supplement 1.1 for mathematical definitions of all models). Prior to model fitting, parameter recovery confirmed all models yield meaningful parameter estimates (Supplement 1.2). All models showed good convergence (effective sample size (ESS)>4,223; R-hats<1.002 for all estimates). Model comparison by out-of-sample predictive accuracy identified the model implementing three parameters (acceptance bias α, reward sensitivity β_R, and effort sensitivity β_E), with a parabolic cost function (subsequently referred to as the full parabolic model) as the winning model (leave-one-out information criterion [LOOIC; lower is better] = 29734.8; expected log posterior density [ELPD; higher is better] = -14867.4; Fig. 3E). This was in line with our pre-registered hypotheses. Predictive validity of the full parabolic model was validated with posterior predictive checks, showing excellent accordance between observed and model-predicted choice data (across effort-levels: R²=.95, across reward-levels: R²=.94; Fig.3F).

2.2.3 Test-retest reliability

We validated the task in a smaller in-person sample (N=30, tested twice ~7 days apart, holding time-of-day at testing constant) to assess test-retest reliability of parameter estimates, showing moderate to excellent reliability for all parameters (i.e., all intraclass correlation coefficients >0.4, all p<.01). Parameter estimates from modelling the data at one session predicted subjects’ choices at the other session better than chance and better than group-level parameters predictions (all p<0.01)⁷² (full details reported in Supplement 2).

2.3 Transdiagnostic analysis: Questionnaire measures predict effort-based decision-making

We used partial-least-squares (PLS) regression to relate individual-level mean posterior parameter values resulting from the model fitting of the full parabolic model to the questionnaire measures. To explore individual effects post-hoc, we followed up on effects found in the PLS regression using Bayesian generalised linear models (GLMs), controlling for age and gender.

2.3.1 Acceptance bias

The acceptance bias was best predicted by a model with one component, with its highest factor loadings from psychiatric measures (increasing values indicate symptom severity; SHAPS⁷¹: -0.665; AES⁷⁰: -0.588; Dimensional Anhedonia Rating Scale [DARS]⁷³: -0.487). Weaker loadings were found for circadian measures (higher values indicate later chronotype; Mornigness-Eveningness Questionnaire [MEQ]¹¹: - 0.262; Munich Chronotype Questionnaire [MCTQ]⁷⁴: -0.117) and metabolic measures (higher values indicate higher metabolic risk; body mass index [BMI]: -0.115; Finnish Type-2 Diabetes Risk Score questionnaire [FINDRISC]⁷⁵: -0.068). Permutation testing indicated the predictive value of the resulting component (with factor loadings described above) was significant out-of-sample (root-mean-squared error [RMSE]=0.203, p=.001).

Bayesian GLMs confirmed evidence for psychiatric questionnaire measures predicting acceptance bias (SHAPS: M=-0.109; 95% highest density interval (HDI)=[-0.17,-0.04]; AES: M=-0.096; 95%HDI=[-0.15,-0.03]; DARS: M=-0.061; 95%HDI=[-0.13,- 0.01]; Fig. 4A). Post-hoc GLMs on DARS sub-scales showed an effect for the sensory subscale (M=-0.050; 95%HDI=[-0.10,-0.01]). This result of neuropsychiatric symptoms predicting a lower acceptance bias is in line with our pre-registered hypothesis. For the MEQ (95%HDI=[-0.09,0.06]), MCTQ (95%HDI=[-0.17,0.05]), BMI (95%HDI=[-0.19,0.01]), and FINDRISC (95%HDI=[-0.09,0.03]) no relationship with acceptance bias was found, consistent with the smaller magnitude of reported component loadings from the PLS regression. This null finding for dimensional measures of circadian rhythm and metabolic health was not in line with our pre-registered hypotheses.

2.3.2 Effort sensitivity

For effort sensitivity, the intercept-only model outperformed models incorporating questionnaire predictors based on RMSE.

2.3.3 Reward sensitivity

For reward sensitivity, the intercept-only model outperformed models incorporating questionnaire predictors based on RMSE. This result was not in line with our pre-registered expectations.

2.3.4 Questionnaire measures predict model agnostic task measures

Both SHAPS (M=-0.07; 95%HDI=[-0.12,-0.03]) and AES (M=-0.05; 95%HDI=[-0.10,-0.002]) sum scores could predict the proportion of accepted trials averaged across effort and reward levels (Fig. S4).

Associations between task parameter estimates and psychiatric measures.
A: Visualizations of associations between the acceptance bias task parameter and the Snaith-Hamilton Pleasure Scale (SHAPS), the Dimensional Anhedonia Rating Scale (DARS)⁷³, and the Apathy Evaluation Scale (AES)⁷⁰ **B-C:** Comparison of acceptance bias (left) and effort sensitivity (right) between a sample of participants meeting criteria for current major depressive disorder (MDD; purple, upper) on the the Mini-International Neuropsychiatric Interview 7.0.1 (M.I.N.I)⁷⁶ and age- and gender-matched controls (yellow, lower).

2.4 Diagnostic analysis: Depressed and healthy subjects differ in effort-based decision-making

In an exploratory analysis, we compared a sample of N=56 participants that met criteria for current major depressive disorder (MDD), to fifty-six healthy controls (HC), matched by age (MDD: M=37.07; HC: M=37.09, p=.99) and gender (MDD: 31 female, 23 male, 2 non-binary; HC: 32 female, 22 male, 2 non-binary; p=.98). Effort-discounting effects were confirmed in both groups. For both groups, model fitting and comparison identified the full parabolic model as the best-fitting model. We used age- and gender-controlled Bayesian GLMs to compare individual-level mean posterior parameter values between groups.

2.4.1 Acceptance bias

As in our transdiagnostic analyses of continuous neuropsychiatric measures (Results 2.3), we found evidence for a lower acceptance bias parameter in the MDD group compared to HCs (M=-0.111, 95% HDI=[-0.20,-0.03]) (Fig. 4B). This result confirmed our pre-registered hypothesis.

2.4.2 Effort sensitivity

Unlike our transdiagnostic analyses, we also found evidence for lower effort sensitivity in the MDD group compared to HCs (M=-0.111, 95% HDI=[-0.22,-0.01]) (Fig. 4C).

2.4.3 Reward sensitivity

There was no evidence for a group difference in reward sensitivity (95%HDI=[-0.07,0.11]), as in our transdiagnostic analyses.

2.5 Circadian measures affect effort-based decision-making

Due to our hypothesised interaction between circadian preference and time-of-day, testing was conducted in two specified time windows: morning (08:00–11:59) and evening (18:00–21:59), resulting in a binary time-of-day measure (morning vs. evening testing). A total of 492 participants completed the study in the morning testing window and 458 in the evening testing window. We used the two chronotype questionnaires to identify two established circadian phenotypes: “early” or “late” chronotype (see Methods 4.5), behavioural categories indicating underlying chronobiological differences^9,11,74. These classifications result in four sub-sample groups, with 89 early chronotypes (morning testing: n=63; evening testing: n=26) and 75 late chronotypes (morning testing: n=20; evening testing: n=55).

Bayesian GLMs, controlling for age and gender, predicting task parameters by time-of-day and chronotype showed effects of chronotype on reward sensitivity (i.e. those with a late chronotype had a higher reward sensitivity; M=0.325, 95% HDI=[0.19,0.46]) and acceptance bias (higher acceptance bias in early chronotypes; M=-0.248, 95% HDI=[-0.37,-0.11]), as well as an interaction between chronotype and time-of-day on acceptance bias (M=0.309, 95% HDI=[0.15,0.48]).

2.5.1 Additional pre-registered data collection

As these analyses rely on unevenly distributed sub-samples, we conducted an additional, pre-registered data collection to replicate and extend these findings (https://osf.io/y4fbe). We screened participants for their chronotype and then invited early chronotypes to take part in our study in the evening testing window, and late chronotypes in the morning testing window (Methods 4.5.1).

Using our pre-registered Bayesian stopping rule, we tested 13 early chronotype participants and 20 late chronotype participants. The data was then combined with the data from our main data collection, resulting in a full sample of n=197 participants that was used for subsequent chronotype analyses (see Table 2 for sample characteristics and statistical significance of differences).

Demographic characteristics and descriptive questionnaire measures in the early and late chronotype participants.

2.5.2 Acceptance bias

Late chronotypes showed a lower acceptance bias than early chronotypes (M=-0.11, 95% HDI=[-0.22,-0.02])—comparable to effects of transdiagnostic measures of apathy and anhedonia, as well as diagnostic criteria for depression. Crucially, we found acceptance bias was modulated by an interaction between chronotype and time-of-day (M=0.19, 95% HDI=[0.05,0.33]): post-hoc GLMs in each chronotype group showed this was driven by a time-of-day effect within late, rather than early, chronotype participants (M=0.12, 95% HDI=[0.02,0.22], such that late chronotype participants showed a lower acceptance bias in the morning testing sessions, and a higher acceptance bias in the evening testing sessions; early chronotype: 95% HDI=[-0.16,0.04]) (Fig. 5A). These results of a main effect and an interaction effect of chronotype on acceptance bias confirmed our pre-registered hypothesis.

2.5.2.1 Neuropsychiatric symptoms and circadian measures have separable effects on acceptance bias

Exploratory analyses testing for the effects of neuropsychiatric questionnaires on acceptance bias in the subsamples of early and late chronotypes confirmed the predictive value of the SHAPS (M=-0.24, 95% HDI=[-0.42,-0.06]), the DARS (M=-0.16, 95% HDI=[-0.31,-0.01]), and the AES (M=-0.18, 95% HDI=[-0.32,-0.02]) on acceptance bias.

For the SHAPS, we find that when adding the measures of chronotype and time-of-day back into the GLMs, the main effect of the SHAPS (M=-0.26, 95% HDI=[-0.43,-0.07]), the main effect of chronotype (M=-0.11, 95% HDI=[-0.22,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on acceptance bias remain. Model comparison by LOOIC reveals acceptance bias is best predicted by the model including the SHAPS, chronotype and time-of-day as predictors, followed by the model including only the SHAPS. Note that this approach to model comparison penalizes models for increasing complexity.

Repeating these steps with the DARS, the main effect of the DARS is found numerically, but the 95% HDI just includes 0 (M=-0.15, 95% HDI=[-0.30,0.002]). The main effect of chronotype (M=-0.11, 95% HDI=[-0.21,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.18, 95% HDI=[0.05,0.33]) on acceptance bias remain. Model comparison identifies the model including the DARS and circadian measures as the best model, followed by the model including only the DARS.

For the AES, the main effect of the AES is found (M=-0.19, 95% HDI=[-0.35,- 0.04]). For the main effect of chronotype, the 95% narrowly includes 0 (M=-0.10, 95% HDI=[-0.21,0.002]), while the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on acceptance bias remains. Model comparison identifies the model including the AES and circadian measures as the best model, followed by the model including only the AES.

2.5.3 Effort sensitivity

We found no evidence for circadian or time-of-day effects on effort sensitivity (chronotype main effect: 95%HDI=[-0.06,0.18], time-of-day main effect: 95%HDI=[- 0.08,0.13]).

2.5.4 Reward sensitivity

Participants with an early chronotype had a lower reward sensitivity parameter than those with a late chronotype (M=0.27, 95% HDI=[0.16,0.38]). We found no effect of time-of-day on reward sensitivity (95%HDI=[-0.09,0.11]) (Fig. 5B). These results were in line with our pre-registered hypotheses.

Effects of chronotype and time-of-day on task parameter estimates.
A: Effect of chronotype and time-of-day on reward sensitivity parameter estimates. B: Effect of chronotype and time-of-day on acceptance bias parameter estimates.

3 Discussion

Various neuropsychiatric disorders are marked by disruptions in circadian rhythm, such as a late chronotype. However, research has rarely investigated how transdiagnostic mechanisms underlying neuropsychiatric conditions may relate to inter-individual differences in circadian rhythm. Here, combining a large-scale online study with computational modelling, we replicate and extend previous work linking anhedonia, apathy, and depression to a lower bias to accept effort for reward. Crucially, we found participants with a late compared to early chronotype show the same decrease in acceptance bias. Moreover, by testing participants at chronotypecompatible and – incompatible times of day, we discovered late chronotypes show a decreased acceptance bias to exert effort for reward when tested in the morning compared to evening. This reveals neuropsychiatric symptoms and chronotype (interacting with time-of-testing) show paralleling effects on effort-based decision-making. Our results demonstrate a crucial role for circadian rhythm in computational psychiatry, potentially affecting our assessment and treatment of neurocognitive mechanisms.

We replicate and extend effects of aberrant effort-based decision-making in neuropsychiatric syndromes in a large, broadly population-representative sample. Our finding that dimensional measures of apathy and anhedonia predict acceptance bias, a computational parameter describing someone’s tendency to exert effort for reward, aligns with previous reports of impaired effort-based decision-making in psychiatric^31,77–85 and neurodegenerative populations^86–90, as well as studies linking effort-based decision-making with apathy and anhedonia specifically. The positive link between effort-based decision-making and apathy and anhedonia has been observed in both patients^77,84,91 and healthy controls^77,92,93, though some did not find this effect⁸⁵.

Our work supports previous theories that impaired effort-based decision-making represents a common, transdiagnostic mechanism across the psychiatric and neurological syndromes of anhedonia and apathy (respectively). We found corresponding effects of apathy and anhedonia on the same computational parameter - acceptance bias – reinforcing the suggestion of possible shared mechanistic underpinnings of the two motivational syndromes²⁴. Aberrant effort-based decision-making may manifest behaviourally as deficient motivation, a symptom category that cuts across traditional disease boundaries of psychiatric, neurological, and neurodevelopmental disorders³².

Our categorical (diagnostic-criteria based) analysis comparing depressed to healthy subjects likewise found depressed patients showed a lower acceptance bias, echoing our dimensional results in apathy and anhedonia. In addition, our categorical analysis revealed a distinct effect of group on effort sensitivity: depressed subjects had lower effort sensitivity, meaning their decisions were less influenced by effort changes. Possibly, this effect stems from decreased perceived differences in effort levels, as recently reported⁹⁴, indicating there are both dimensional (transdiagnostic) and potentially some diagnosis-specific effects of mental health on effort-based decision-making.

It is possible that a higher acceptance bias reflects a more optimistic assessment of future task success, in line with work on the optimism bias⁹⁵; however our task intentionally minimized unsuccessful trials by titrating effort and reward; future studies should explore this more directly.

We also found circadian effects on effort-based decision-making, paralleling those of apathy, anhedonia, and depression measures in both the affected neurocomputational parameter and the direction of effect. We observed a difference in acceptance bias between chronotypes, with late chronotypes showing a lower tendency to accept to exert effort for reward. Previous studies have suggested late chronotypes were also less accepting of delays⁶² and risk for reward^64,65.

Most importantly, we found an interaction between chronotype and time-of-day in a synchrony effect manner: early and late chronotypes showed a higher acceptance bias towards accepting effort for reward at their preferred time of day. This effect was driven by the late chronotype group, who showed a markedly lower acceptance bias in the morning, but a higher tendency in the evening. This suggests that chronotype effects on neurocomputational parameters such as acceptance bias depend on time-of-testing. Synchrony effects have previously been observed in other cognitive domains including inhibitory control⁹⁶, attention⁹⁷, learning⁹⁸, and memory⁹⁹. One interpretation of our cognitive synchrony effects may be that late chronotype participants show a diminished ability to adapt to suboptimal times-of-day due to reduced cognitive resources¹⁰⁰.

We also report a distinct effect of chronotype on effort-based decision-making that is not paralleled by effects of neuropsychiatric symptoms, nor dependent on time-of-day. Compared to early chronotypes, late chronotypes were more guided by differences in reward value, indicated by higher reward sensitivity parameters. Previous studies report altered reward functioning in late chronotypes, who show a reduced reactivity to reward in the medial prefrontal cortex, a key component of reward circuitry^101–103. Note this is not incompatible with higher reward sensitivity due to our modelling approach, in which higher reward sensitivity does not imply higher reward valuation, but rather larger subjective value differences between reward levels. Therefore, reduced reactivity to reward could be compatible with late chronotypes devaluing low reward levels more, which in our models would emerge as a reduced reward sensitivity parameter.

It is striking that the effects of neuropsychiatric symptoms on effort-based decision-making largely are paralleled by circadian effects on the same neurocomputational parameter. Exploratory analyses predicting acceptance bias by neuropsychiatric symptoms and circadian measures simultaneously indicate the effects go beyond recapitulating each other, but rather explain separable parts of the variance in acceptance bias. Overall, our results raise the possibility of altered effort-reward processing as a critical mechanism linking neuropsychiatric conditions and circadian rhythm. Previous research demonstrated depressed patients with an evening chronotype show increased diurnal mood variation¹⁰⁴. Our finding of time-of-day differences in acceptance bias among late chronotypes illustrates a potential cognitive underpinning for the observed diurnal characteristic within depression and late chronotype. Together, these findings support the idea of a circadian psychiatric phenotype^14,17, which should be considered in measurement (e.g., design of computational psychiatry studies) and potentially treatment (e.g., administration of motivation-based psychological interventions, which could be timed compatibly with chronotype).

To our surprise, we did not find statistical evidence for a relationship between effort-based decision-making and measures of metabolic health (BMI and risk for type-2 diabetes). Our analyses linking BMI to acceptance bias reveal a numeric effect in line with our hypothesis: a higher BMI relating to a lower acceptance bias. However, the 95% HDI for this effect narrowly included zero (95%HDI=[-0.19,0.01]). Possibly, our sample did not have sufficient variance in metabolic health to detect dimensional metabolic effects in a current general population sample. A recent study by our group investigates the same neurocomputational parameters of effort-based decision-making in participants with type-2 diabetes and non-diabetic controls matched by age, gender, and physical activity¹⁰⁵. We report a group effect on the acceptance bias parameter, with type-2 diabetic patients showing a lower tendency to exert effort for reward.

Our study results should be considered in light of a few limitations. First, we used online self-report measures of neuropsychiatric symptoms and depression status. There has been a large shift toward online data collection in psychiatric research, and while online data is undoubtedly noisier, results (including our own, presented in the supplemental material) usually show excellent accordance with lab-based studies¹⁰⁶. Similarly, we lack biological measures of circadian rhythm, the gold standard of chronotype assessment. However, this concern might be mitigated by previous reports of high covariance between biological- and questionnaire-based circadian measures^107,108, as well as significant chronobiological differences between the questionnaire-determined chronotypes^10,109 we use in our key findings. Nevertheless, future work should incorporate biological measures in attempts to replicate circadian effects on effort-based decision-making. This could take the form of identifying chronotypes by DNA analysis or dim-light melatonin onset, or continuous measurements of circadian proxies, such as core body temperature, heart rate, or actigraphy.

Note also that our time-of-day effects are limited by a between-subjects study design (i.e., the same participants were not tested in morning and evening sessions). It will be interesting to explore such diurnal variation in effort-based decision-making within individuals. The newly developed effort-expenditure task we present here may lend itself particularly well to such endeavours. First, it allows remote testing, meaning subjects can complete the task at different times of the day without in-person testing. Second, we demonstrated good test-retest reliability of task measures when time of testing was held constant within participants. This good test-retest reliability of our task contrasts with recent reports of poor test-retest reliability of other tasks and computational modelling parameters¹¹⁰.

Our reported analyses investigating neuropsychiatric and circadian effects on effort-based decision-making simultaneously are exploratory, as our study design was not ideally set out to examine this. Further work is needed to disentangle separable effects of neuropsychiatric and circadian measures on effort-based decision-making. One approach could be a group-based study design enabling the dissociation of the two effects (e.g., examining high-anhedonia participants with early chronotypes and low-anhedonia patients with late chronotypes, as well as the respective other, more common groupings, and testing each group in the morning and evening to examine time-of-day interactions with both anhedonia and chronotype).

Finally, we would like to note that as our study is based on a general population sample, rather than a clinical one. Hence, we cannot speak to transdiagnosticity on the level of multiple diagnostic categories.

Taken together, our results implicate circadian rhythm as an important factor in effort-based decision-making and its relationship to neuropsychiatric conditions. These results have implications for research, clinical interventions, and policy. We demonstrate that neuropsychiatric effects on effort-based decision-making are paralleled by effects of circadian rhythm and time-of-day. Exploratory analyses suggest these effects account for separable parts of the variance in effort-based decision-making. It unlikely that effects of neuropsychiatric effects on effort-based decision-making reported here and in previous literature are a spurious result due to multicollinearity with chronotype. Yet, not accounting for chronotype and time of testing, which is the predominant practice in the field, could affect results. This could take the form of either inflating or depressing results in the existing literature. On the one hand, reported neuropsychiatric effects may be inflated by systematic circadian differences between participants (i.e., overrepresentation of late chronotype in patient samples), which could be further amplified by time of testing (often the morning, incompatible with late chronotypes, and producing motivational impairments on neurocognitive measures). On the other hand, true effects may be masked by interactions between chronotype and time-of-day: Testing psychiatric subjects with a late chronotype in the evening (e.g., as a consequence of subject-selected testing times) may paint a false picture of group equivalence, as researchers are only observing part of a daily trajectory.

Our growing understanding of the relationship between circadian rhythm and neuropsychiatry may allow for critical advances in improving therapeutic outcomes from treatments^111,112. Such advances are particularly called for in the case of symptoms of apathy and anhedonia, as current treatments often fail to improve motivational deficits^27–29, but could potentially be coupled with a patient’s chronotype to increase efficacy. At minimum, clinical trials predicting change in motivational measures, such as effort-based decision-making, should assess patients at similar times of day, as this could reduce or inflate treatment effects.

Circadian rhythm and neuropsychiatric syndromes may affect motivation via parallel, as well as distinct, mechanisms—but crucially, this overlap is dependent on time of testing. Our work suggests that chronotype and time of testing are essential variables to consider in future effort-based decision-making experiments, particularly those measuring effort-based decision-making in patient groups, such as those with depression, high apathy, or high anhedonia. Beyond experimental work, future interventions should consider the role of chronotype in measurement and modulation of motivation.

4 Methods and materials

4.1 Study protocol

After providing demographics and basic medical history, subjects completed an effort-expenditure task, followed by a battery of self-report questionnaires. The study was coded in JavaScript, using Phaser v.3.50.0 for the task and jsPsych¹¹³ for questionnaires.

4.1.1 Ethics

This study was approved by the University of Cambridge Human Biology Research Ethics Committee (HBREC.2020.40). Participants provided informed consent through an online form, complying with the University of Cambridge Human Biology Research Ethics Committee procedures for online studies.

4.1.2 Recruitment

We recruited participants using Prolific¹¹⁴, in September 2022. Data was collected on weekdays, in specified daily time-windows (morning testing: 08:00–11:59; evening testing: 18:00- 21:59). To sample participants broadly representative of the UK population in age, sex, and history of psychiatric disorder, we implemented a previously-described procedure⁶⁷ using Prolific pre-screeners to obtain batches of participants aimed to match target numbers calculated based on UK population data.

Nine hundred and ninety-four participants completed all components and were paid a fixed rate of £6. A bonus of £10 was paid to ten participants. Subjects were told they could increase their chances of winning the bonus by engaging well with the study (e.g., reading questions carefully, following task instructions).

4.1.3 Effort-expenditure task

We developed a new effort-expenditure task that allowed us to assess effort-based decision-making in a remote setting; this task was also tested in-person to assess test-retest reliability. To increase engagement, we gamified the task to take place in an underwater setting and each challenge is framed as a race in which an octopus catches a shrimp. The task structure is shown in Figure 2A and the trial-level structure in Figure 2B.

The task began with an individual calibration phase to standardise maximum effort capacity, followed by the main task, which used a semi-adaptive staircase design to maximise the informative value of each choice.

For the calibration, subjects were prompted to collect points by clicking as fast as possible for ten seconds, repeated three times. The second and third repetitions were then averaged to serve as the maximum clicking capacity reference for the main task. One calibration trial was repeated at the end of the main task to monitor any notable changes in clicking capacity. Then, subjects were familiarized with their individually calibrated effort levels during a practice phase of the task. Effort levels were scaled to a given participant’s mean clicking speed (based on the calibration phase), and the time clicking must be sustained for. We used four effort-levels, corresponding to a clicking speed at 30% of a participant’s maximal capacity for 8 seconds (level 1), 50% for 11 seconds (level 2), 70% for 14 seconds (level 3), and 90% for 17 seconds (level 4). Therefore, in each trial, participants had to fulfil a certain number of mouse clicks (dependent on their capacity and the effort level) in a specific time (dependent on the effort level). Subjects were instructed to make mouse-clicks with the finger they normally use, and to not change fingers throughout the task (compliance was checked at the end of the main task). In the practice phase, all effort-levels were completed without reward associations, and failed levels were repeated until subjects succeed at each level. If a subject failed a level twice, the clicking capacity reference was adjusted to the speed reached in the practice. Finally, subjects needed to pass a six-question quiz to ensure task instructions were fully understood. If a subject failed any question on the quiz, they were returned to the instruction screens and re-took the quiz until all questions were answered correctly.

The main task took a binary-choice design: In each trial, participants accepted or rejected a challenge associated with one of the four specific effort levels and rewards. Reward was conceptualized as points (shrimp caught by the octopus) that could be collected in that trial. The points to win per challenge varied between four levels (2,3,4, or 5 points). If a subject accepted a given challenge, they needed to achieve the given effort-level to win the associated points. If a subject rejected a given challenge, they waited and received one point, with waiting times matched to the respective effort level to prevent confounding with delay discounting. Participants were able to infer their clicking progress from the distance between the octopus and the shrimp and the remaining time was indicated by a time-bar.

Subjects completed 64 trials, split into four blocks of 16 trials. For each subject, trial-by-trial presentation of effort-reward combinations were made semi-adaptively by 16 randomly interleaved staircases. Each of the 16 possible offers (4 effort-levels × 4 reward-levels) served as the starting point of one of the 16 staircase. Within each staircase, after a subject accepted a challenge, the next trial’s offer on that staircase was adjusted (by increasing effort or decreasing reward). After a subject rejected a challenge, the next offer on that staircase was adjusted by decreasing effort or increasing reward. This ensured subjects received each effort-reward combination at least once (as each participant completed all 16 staircases), while individualizing trial presentation to maximize the trials’ informative value. Therefore, in practice, even in the case of a subject rejecing all offers (and hence the staircasing procedures always adapting by decreasing effort or increasing reward), the full range of effort-reward combinations will be represented in the task across the startingpoints of all staircases (and therefore before adaption takeplace).

4.1.4 Self-report questionnaires

Subjects completed a questionnaire battery assessing mental and physical health, presented in a randomised order. We assessed anhedonia using the SHAPS⁷¹, as well as the DARS⁷³. Apathy was assessed with the AES⁷⁰. Additionally, we screened participants for meeting diagnostic criteria for current, past, or recurrent MDD using the M.I.N.I.⁷⁶. Two questionnaires targeted circadian rhythm: the MEQ¹¹ and the MCTQ⁷⁴. Metabolic health was assessed by collecting self-reported height and weight, used to calculate BMI. Additionally, the FINDRISC⁷⁵ was used to calculate individual risk scores for metabolic disease. Finally, the International Physical Activity Questionnaire (IPAQ)¹¹⁶ was included for exploratory investigations of physical activity.

4.1.5 Compliance checks and exclusion criteria

All exclusion criteria were preregistered. Participants were excluded when reporting a severe neurological condition (n=14) or English proficiency below B2 (i.e., good command/working knowledge; n=2).

To check compliance with the questionnaires, four catch questions were presented during questionnaires, including two easy questions (e.g., “Please answer ‘Not at all’.”) and two harder questions (e.g., “In the past week, I (would have) wanted to eat mouldy food.”, expected answer “Disagree” or “Definitely disagree”). Participants failing at least one easy question, or both harder questions were excluded (n=12).

As task-based exclusion criteria, subjects rejecting all offers were excluded (n=0). Participants had to have a clicking-calibration score of at least seven, as values below would lead to challenges with just one mouse-click (n=4). Subjects showing a large difference between minimum and maximum clicking speed (i.e., >3 standard deviations (SD)) during calibration trials were excluded, as a misestimation of the calibration reference is likely (n=3). Finally, subjects showing a large change in their clicking capacity (i.e., >3 SD) pre- to post-task were excluded, as it can be assumed the applied calibration was not valid during the task (n=1). We also asked two openanswer questions after completion of the main task to monitor participants’ self-reported task strategies as a way of assessing rule adherence.

4.2 Analyses of effort-expenditure task data

4.2.1 Model-agnostic analyses

Using the proportion of accepted challenges as the dependent variable, we investigated main effects of effort- and reward-levels and their interaction, using a repeated measures analysis of variances (ANOVA) of repeated measures. This approach accommodates the unbalanced design resulting from the implemented staircasing procedure.

4.2.2 Model-based analyses

4.2.2.1 Model space

To model effort-based decision-making, we considered a model space of nine models. All models are variations of the economic decision-theory model, consisting of two basic equations. First, a cost function transforms costs and rewards associated with an action into a subjective value (SV):

with β_R and β_E for reward and effort sensitivity, and R and E for reward and effort. Higher effort and reward sensitivity mean the SV is more strongly influenced by changes in effort and reward, respectively (Fig. 3B-C). Hence, low effort and reward sensitivity mean the SV, and with that decision-making, is less guided by effort and reward offers, as would be in random decision-making.

This SV is then transformed to an acceptance probability by a softmax function:

with p(accept) for the predicted acceptance probability and α for the intercept representing acceptance bias. A high acceptance bias means a subject has a bias, or tendency, to accept rather than reject effortful offers for reward (Fig. 3D).

The models differed in two aspects. First, inclusion or exclusion of the free parameters reward sensitivity (β_R) and acceptance bias (α). Second, the form of the cost function, which used either a linear function (proportional discounting at all effort-levels), a parabolic function (increases at higher effort-levels are discounted over-proportionally) or an exponential function (increases at lower effort-lower levels are discounted over-proportionally). See supplement 1.1 for mathematical definitions of all models.

4.2.2.2 Model fitting, checks, and comparisons

We took a hierarchical Bayesian approach to model fitting¹¹⁷, implemented with the CmdStan R interface¹¹⁸, with Stan code adapted from hBayesDM¹¹⁹. Prior to model fitting, effort- and reward-levels were standardized for computational ease. All models were fit using Markov-Chain Monte Carlo (MCMC), with 2000 warm-up iterations and 6000 sampling iterations, by four chains. Model convergence and chain mixing were checked using numerical diagnostics of ESS and split R-hats, and by visually inspecting trace plots. We conducted parameter recoveries for all models, confirming their ability to meaningfully recover known parameters. Model performance was compared based on out-of-sample predictive accuracy using the LOOIC (lower is better) and ELPD (higher is better). The winning model was validated using posterior predictive checks, comparing model predictions to subject wise observed choices.

4.2.2.3 Test-retest reliability

We conducted an in-person study to validate the effort-expenditure task and assess the test-retest reliability of our computational modelling parameters. A sample of N=30 participants was recruited and tested in two sessions, about one week apart. Test-retest reliability of task parameters was assessed by intra-class correlation coefficients, Pearson’s correlation coefficients (estimated both after model fitting and by embedding a correlation matrix into the model fitting procedure), and by testing the predictive accuracy of parameter estimates across sessions. See supplement 2 for full methods and results.

4.3 Linking model parameters to outcome measures

To aid interpretability and comparability of effects, task parameters and questionnaire outcome measures were standardized to be between zero and one. Questionnaire measures resulting from the DARS, AES, and MEQ were additionally transformed to be interpretable with the same directionality within questionnaire groupings (i.e., for all psychiatric measures higher values are interpreted as higher symptom severity, for all circadian measures higher values are interpreted as later chronotype).

To investigate associations between effort-based decision-making and self-report questionnaires, we ran partial least squares (PLS) regressions with questionnaire outcome measures predicting modelling parameters. PLS regression allows joint modelling of questionnaire measures, without issues due to expected multicollinearity between questionnaires. Following the best practice of model validation¹²⁰, data was split into a training (75%) and a testing (25%) subset. The training data was used to obtain the optimal number of components, based on tenfold-cross validation, and to train the model. The winning model’s predictive performance was tested out-of-sample using the held-out testing data. Statistical significance of obtained effects (i.e., the predictive accuracy of the identified component and factor loadings) was assessed by permutation tests, probing the proportion of root-mean-squared errors (RMSEs) indicating stronger or equally strong predictive accuracy under the null hypothesis.

To follow up on relationships suggested by the PLS regression, we performed Bayesian general linear models (GLMs), adjusting for age and gender (male or female, imputing natal sex for non-binary participants, given low numbers).

4.4 Comparing depressed and healthy subjects

We compared participants meeting criteria for a current MDD based on the M.I.N.I.⁷⁶, to a subset of age- and gender-matched healthy controls (HCs, participants that did not meet criteria for current MDD). For computational sparsity, we only fit the three best-fitting models from the full sample. Models were fit separately to the MDD and HC groups, using the same methods and parameters described above. Bayesian GLMs were used to quantify evidence for associations between individual-level modelling parameters and group status. As we could not be certain whether we would obtain a large enough sample size of subjects meeting criteria for MDD, these analyses were exploratory.

4.5 Investigating circadian effects

We used the two circadian rhythm questionnaires to determine participants’ chronotypes. Early chronotype was defined as meeting criteria for “morning types” on the MEQ (MEQ sum score > 58)¹¹ and having a midpoint of sleep on free days before 02:30. Late chronotype was defined as meeting criteria for “evening types” on the MEQ (MEQ sum score < 42) and having a midpoint of sleep on free days after 05:30. Subjects not falling into either category are categorized as intermediate chronotypes and were not included in these analyses.

We used Bayesian age- and gender-controlled GLMs to investigate effects of chronotype, time-of-day (morning- vs. evening-testing), and their interaction on subject-wise mean task parameters estimates.

4.5.1 Additional data collection

To improve the precision of estimated circadian effects on task parameters, we increased our sample size by conducting an additional pre-registered data collection (https://osf.io/y4fbe). We implemented a screening study comprising the MEQ¹¹ and MCTQ⁷⁴. Taking the chronotyping approach described above, subjects with an early or late chronotype were identified. Early chronotypes were invited to take part in our study in the evening, late chronotypes in the morning.

We implemented a Bayesian stopping rule to inform our data collection process, taking the following steps. First, participants were screened in batches of 250, and eligible participants were invited to the study session. Next, data resulting from this additional data collection was joined with data resulting from the main data collection and Bayesian GLMs were re-run, as described above. If our precision target of any 95% HDI reaching a maximum width 0.20 was met, we stopped data collection. Was the precision target not met, we returned to step one, and another batch of 250 participants was screened. In any case, data collection would be terminated once 200 eligible participants had completed the main study session.

4.5.2 Differentiating between the effects of neuropsychiatric symptoms and circadian measures on acceptance bias

To investigate how the effects of neuropsychiatric symptoms on acceptance bias (2.3.1) relate to effects of chronotype and time-of-day on acceptance bias we conducted exploratory analyses. In the subsamples of participants with an early or late chronotype (including additionally collected data), we first ran Bayesian GLMs with neuropsychiatric questionnaire scores (SHAPS, DARS, AES respectively) predicting acceptance bias, controlling for age and gender. We next added an interaction term of chronotype and time-of-day into the GLMs, testing how this changes previously observed neuropsychiatric and circadian effects on acceptance bias. Finally, we conducted a model comparison using LOO, comparing between acceptance bias predicted by a neuropsychiatric questionnaire, acceptance bias predicted by chronotype and time-of-day, and acceptance bias predicted by a neuropsychiatric questionnaire and time-of-day (for each neuropsychiatric questionnaire, and controlling for age and gender).

Funding:

This study was funded by an AXA Research Fund Fellowship awarded to C.L.N. (G102329) and the Medical Research Council (MC_UU_00030/12). C.L.N. is funded by a Wellcome Career Development Award (226490/Z/22/Z). This research was supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215- 20014).

Author contributions:

Conceptualization: S.Z.M, C.L.N; Methodology: S.Z.M, C.L.N; Investigation: S.Z.M, C.L.N; Project administration: S.Z.M, C.L.N; Writing (original draft, review & editing): S.Z.M, C.L.N; Formal analysis: S.Z.M; Funding Acquisition: C.L.N; Supervision: C.L.N.

Additional information:

For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.

Supplementary material

5 Computational modelling

5.1 Mathematical definition of the model space

Mathematical definition of the models included in our model space.

5.2 Model validation

Parameter recoveries were performed to ensure parameter estimates obtained from all models are meaningful. For each model, sets of parameter values were sampled from uniform distributions bound to the respective parameter ranges. Task data was then simulated for n=500 agents, using the respective parameters and modelling equations. The resulting simulated data was used for model fitting and resulting “recovered” posterior parameter estimates compared to the underlying parameter values. For all models, underlying parameters correlated highly with the recovered mean parameter estimates (Table S2). For the winning model (full parabolic model), relations between underlying and recovered parameters are additionally visualized in Figure S1A-C. Importantly, the modelling procedure did not introduce any spurious correlations between free parameters (Figure S1D).

Pearson’s correlations between underlying parameters and recovered mean parameter estimates for all models included in the model space.

Parameter recovery.
**A-C:** Comparison between underlying parameters and recovered mean parameter estimates for the three free parameters of the full parabolic model. D: Pearson’s correlations between all underlying and recovered parameters for the full parabolic model.

5.2.1 Parameter recoveries including inverse temperature

In the process of task and model space development, we also considered models incorportating an inverse temperature paramater. To this end, we conducted parameter recoveries for four models, defined in Table S3.

Parameter recoveries indicated that, parameters can be recovered reliably in model 1, which includes only effort sensitivity (β_E) and inverse temperature (τ) as free parameters (on-diagonal correlations: .98 > r > .89, off-diagonal correlations: .04 > |r| > .004). However, as a reward sensitivity parameter is added to the model (model 2), parameter recovery seems to be compromised, as parameters are estimated less accurately (on-diagonal correlations: .80 > r > .68), and spurious correlations between parameters emerge (off-diagonal correlations: .40 > |r| > .17). This issue remains when acceptance bias is added to the model (model 4; on-diagonal correlations: .90 > r > .65; off-diagonal correlations: .28 > |r| > .03), but not when inverse temperature is modelled with effort sensitivity and acceptance bias, but not reward sensitivity (model 3; on-diagonal correlations: .96 > r > .73; off-diagonal correlations: .05 > |r| > .003).

As our pre-registered hypotheses related to the reward sensitivity parameter, we opted to include models with the reward sensitivity parameter rather than the inverse temperature parameter in our model space.

Mathematical definition of models included an inverse temperature parameter.

5.3 Group and subject-wise parameter estimates

As described in the main manuscript, we took a hierarchical Bayesian approach to model fitting, using Markov-Chain Monte Carlo (MCMC) sampling. Hence, we obtained posterior parameter distributions for each parameter, on both the subject and group level. Analyses presented in the main manuscript are based on the mean for each subject-wise parameter estimate distribution. The resulting parameter estimates - both subject-wise and on a group-level – are visualised in figure S2.

Parameter estimates.
**A-C:** Visualisation of individual-level (yellow) and group-level (blue) model parameter estimates for effort sensitivity (A), reward sensitivity (B), and acceptance bias (C).

6 Test-retest reliability

For computational modelling parameters to successfully contribute to research advancing our understanding of mechanisms underlying mental health and disorder, as well as to transform such knowledge into personalized treatment approaches, it is pivotal to ensure reliability of the used measures. The reliability with which a measure captures individual characteristics ultimately sets an upper limit on its usefulness in detecting differences between groups, relationships to other measures, and intervention effects. We assessed the test-retest reliability of the novel effort-expenditure task in a separate in-person sample.

6.1 Methods

6.1.1 Sample

Thirty-three participants were recruited from S.O.N.A. and through advertisements in Cambridge Colleges. Three subjects had to be excluded due to failure during the task calibration. The final sample (N=30) consisted of 17 female and 13 male subjects with an average age of M=48.1 years (SD=21.54). All subjects were native English speakers, none reported neurological conditions and four subjects reported current and/or past psychiatric disorders (depression, anxiety, obsessive-compulsive disorder, and personality disorder).

6.1.2 Study procedure

All participants completed two testing sessions. In the first session, demographic data was collected, followed by the effort-expenditure task and the battery of self-report questionnaires. In the second session, subjects only completed the effort task. The two testing sessions were at least 6 days apart and 9 days at most, with an average difference of 6.93 days (SD=0.78). Time-of-day at testing was aimed to be held constant between testing sessions, with the average time difference between starting times of the task being 21.53 minutes (SD=30.44, min=0, max=119).

6.1.3 Analyses

All computational models defined in our model space were fit separately to the task data of each testing session using the same methods and parameters as described for the main sample in the main manuscript. Test-retest analyses were performed on the full parabolic model, given model comparison in our main sample identified this as the winning model.

Intra-class correlation coefficients (ICCs) were used to assess test-retest reliability using the ratio of intra-individual to inter-individual variability. We used two-way mixed effects ICCs considering consistency (reflecting rank order). ICCs below 0.4 indicate that a measure is not reliable, ICCs between 0.4 and 0.75 indicate moderate to good reliability and ICCs above 0.75 indicate excellent reliability¹²².

To assess correlations between model parameters from the two sessions, we first calculated Pearson’s correlation coefficients between mean posterior parameter estimates resulting from the separate model fitting procedures. Next, we re-fit the model jointly for both testing sessions, while embedding a correlation matrix into the model¹²³. Thereby, the full posterior distributions of the model parameters are fed into the calculation of correlation coefficients, rather than solely point estimates. This offers multiple benefits of Bayesian inference: Uncertainty around parameter estimates can be incorporated into the correlation estimation and Bayesian priors can be set over possible values of the correlation matrix. Note that while this model fitting procedure fits the data of both testing sessions simultaneously, separate hyper-parameters are used for the different sessions. Therefore, shrinkage cannot bias the reliability estimates.

Finally, we also made use of the predictive property of the computational modelling approach in the assessment of parameter reliability. If parameters are reliable, the estimates resulting from modelling the data from one session should be able to predict subjects’ choices in the other session better than chance¹²⁴. To test this, we calculated the model-predicted choices for each subject and trial and compared this to the observed choices in the respective other session.

Due to our hierarchical Bayesian modelling approach, individual model parameters are subject to shrinkage. To test whether the predictive property of individual parameters is solely due to shrinkage we repeated the procedure using hyper-parameters to make model predictions and then compared the resulting predictive accuracy to that of individual parameters.

6.2 Results

6.2.1 Descriptive task statistics

Data from both sessions reproduced the expected effect of effort discounting. Mixed-effects analyses of variances (ANOVA) of repeated measures confirmed significant main effects of effort (session 1: F(1,447)=128.22, p<.001; session 2: F(1,447)=129.42, p<.001) and reward (session 1: F(1,447)=56.49, p<.001; session 2: reward: F(1,447)=46.51, p<.001), as well as an interaction effect (session 1: F(1,447)=52.74, p<.001; session 2: F(1,447)=49.13, p<.001) for both testing sessions. In post hoc ANOVAs the main effect of effort remained significant at all reward levels (at p<.05) for both sessions and the main effect of reward remained significant at all effort levels (at p<.05), except at the lowest effort level in the first session, and at the two lowest reward levels in the second session.

6.2.2 Computational modelling

All models converged well for both testing sessions. For both session one and two, the full parabolic model was the winning model, based on both the leave-one-out information criterion (LOO) and the expected log predictive density (ELPD) (Figure S3A).

6.2.3 Test-retest reliability

The ICC for effort sensitivity showed best test-retest reliability (ICC(C,1)=0.797, p<.001). Reliability for reward sensitivity (ICC(C,1)=0.459, p=.0047) and acceptance bias (ICC(C,1)=0.463, p=.0043), was moderate (Figure S3B).

Correlations between point estimates of the modelling parameters across testing sessions were very strong for effort sensitivity (r=.803, p<.01), and moderate for reward sensitivity (r=.467, p<.01) and acceptance bias (r=.517, p<.01). Correlations derived from the embedded correlation matrix and therefore considering the full parameter estimate distribution resulted in slightly higher estimates (effort sensitivity: r=.867; 95%Highest density interval (HDI)=[.667-.999]; reward sensitivity: r=.550; 95%HDI=[.163-.900]; acceptance bias: r=0.585; 95%HDI=[.184-.927]).

Subject-wise parameter estimates could reliably predict individual trial-wise choice data across trials (Figure S3C). Session one parameter estimates predicted session two choice data significantly better than chance (t(1919)=46.819, p< .001), as did session two parameters for session one choice data (t(1919)=47.106, p<.001). When considering predictive accuracy of group-level parameter estimates, both session’s parameter estimates outperformed chance (session one predicting session two: t(1919)=33.485, p<.001; session two predicting session one: t(1919)=30.291, p<.001). Comparing predictive accuracy of subject-wise parameter estimates to group-level estimates, the subject-level outperformed the group-level for both session one predicting session two (t(3760.7)= 4.951, p<.001), and session two predicting session one (t(3674.2)=7.426, p<.001).

Computational modelling and test-retest reliability.
A: Model comparison for each testing session based on the leave-one-out information criterion (LOO) and expected log predictive density (ELPD). Error bars indicate standard errors. B: Subject-wise parameter estimates compared between testing sessions. C: Predictive accuracy against chance (left) and group-level parameters (right; values >0 indicate better performance of subject-level compared to group-level parameters). Labels s1s2 (and s2s1) indicate session 1 (session 2) parameters predicting session 2 (session 1) data, s1s1 (and s2s2) indicate session 1 (session 2) parameters predicting session 1 (session 2) data.

7 Model agnostic task measures relating to questionnaires

7.1 Proportion of accepted trials

To explore the relationship between model agnostic task measures to questionnaire measures of neuropsychiatric symptoms, we conducted Bayesian GLMs, with the proportion of accepted trials predicted by SHAPS scores, controlling for age and gender. The proportion of accepted trials averaged across effort and reward levels was predicted by the Snaith-Hamilton Pleasure Scale (SHAPS) sum scores (M=-0.07; 95%HDI=[-0.12,-0.03]) and the Apathy Evaluation Scale (AES) sum scores (M=-0.05; 95%HDI=[-0.10,-0.002]). Note that this was not driven only by higher effort levels; even confining data to the lowest two effort levels, SHAPS has a predictive value for the proportion of accepted trials: M=-0.05; 95%HDI=[-0.07,-0.02].

A visualisation of model agnostic task measures relating to symptoms is given in Fig. S4, comparing subgroups of participants scoring in the highest and lowest quartile on the SHAPS. This shows that participants with a high SHAPS score (i.e., more pronounced anhedonia) are less likely to accept offers than those with a low SHAPS score (Fig. S4A). Due to the implemented staircasing procedure, group differences can also be seen in the effort-reward combinations offered per trial. While for both groups, the staircasing procedure seems to devolve towards high effort – low reward offers, this is more pronounced in the subgroup of participants with a lower SHAPS score (Fig S4B).

Model agnostic task measures relation to anhedonia.
A: Comparing the proportion of accepted trials across effort (right) and reward (left) levels in subsamples of participants scoring in the highest and lowest SHAPS quartile. B: Distribution of effort-reward combinations, averaged across the final trial of 16 staircases.

7.2 Proportion of accepted but failed trials

For each participant, we computed the proportion of trial in which an offer was accepted, but the required effort then not fulfilled (i.e., failed trials). There was no relationship between average proportion of accepted but failed trials and SHAPS score (controlling for age and gender): M=0.01, 95% HDI=[-0.001,0.02]. However, there are intentionally few accepted but failed trials (M = 1.3% trials failed, SD = 3.50). This results from several task design characteristics aimed at preventing subjects from failing accepted trials, to avoid confounding of effort discounting with risk discounting.

7.3 Exertion of “extra effort”

We also explored the extent to which participants went “above and beyond” the target in accepted trials. Specifically, considering only accepted and succeeded trials, we computed the factor by which the required number of clicks was exceeded (i.e., if a subject clicked 15 times when 10 clicks were required the factor would be 1.3), averaging across effort and reward level. We then conducted a Bayesian GLM to test whether this subject wise click-exceedance measure can be predicted by apathy or anhedonia, controlling for age and gender. We found neither the SHAPS (M=-0.14, 95% HDI=[-0.43,0.17]) nor the AES (M=0.07, 95% HDI=[-0.26,0.41]) had a predictive value for the amount to which subjects exert “extra effort”.

8 Subject-reported decision-making process

After completing the task, subjects were asked whether they used “any strategy to facilitate the game”. While this question was initially included to monitor any self-reported cheating strategies, a large subset of subjects (425 subjects, 44.36%) understood this question as a prompt to report their experience of the decision-making process during the effort-expenditure task. Some examples of participants’ reports are given below:

– “Only accepting the challenge if the points were equal to or greater than the effort level.”Only accepting the challenge if the points were equal to or greater than the effort level.”
– “I decided whether the points were worth the effort.”I decided whether the points were worth the effort.”
– “I tended to reject if the reward was only 2.”I tended to reject if the reward was only 2.”
– “Measure of how many points to gain against effort and how tired my hand felt.”Measure of how many points to gain against effort and how tired my hand felt.”
– “As the game progressed I decided what ratio of effort to reward that I would tolerate.”As the game progressed I decided what ratio of effort to reward that I would tolerate.”
– “I rejected the higher difficulties which had low points.”I rejected the higher difficulties which had low points.”
– “Was it worth the effort for the points?”Was it worth the effort for the points?”
– “I didn’t do the higher effort challenges.”I didn’t do the higher effort challenges.”

Significance of findings

Strength of evidence

Abstract