1 Introduction

Our circadian rhythm aligns us with our environment, regulating physiological and behavioural processes to follow 24-hour rhythms1. Circadian integrity is pivotal to mental wellbeing and has been bidirectionally linked to numerous psychiatric disorders28. Yet lidle is known about the cognitive or computational mechanisms of circadian dysfunction—and their alignment or diversion from mechanisms driving neuropsychiatric symptoms.

Inter-individual differences in circadian timing and alignment manifest behaviourally as chronotypes (i.e., diurnal preference)9,10, with individuals commonly categorized as early, late, or intermediate chronotypes11. A disproportionate number of psychiatric patients have a late chronotype, based on self-report12 and genetic analysis9. Within clinical groups, late chronotype has been linked to depression severity and non-remission13, higher rates of psychiatric and general medical comorbidities14, more severe cognitive impairment, and higher symptoms of apathy15,16. Converging evidence on the importance of circadian alignment in psychiatric pathology has led to proposals of a circadian psychiatric phenotype, either within disorders14,17 or cuding across diagnostic categories18,19.

Syndromes of deficient motivational behaviour, such as apathy and anhedonia, are also observed across neuropsychiatric disorders2023, suggesting transdiagnostic relevance24. Anhedonia and apathy are associated with worse clinical outcomes25,26 and are poorly targeted by current treatments2729. Empirical work suggests a common underlying neurocognitive mechanism: the integration of costs and benefits during effortful decision-making24. Effort-based decision-making is commonly assessed using effort expenditure tasks: Subjects are asked to decide whether to pursue actions associated with varying levels of effort and reward levels30. Computational models applied to effort-based decision-making tasks provide a formal mathematical estimate of a subject’s integration of costs and benefits into a subjective value31,32. Higher costs devalue associated rewards, an effect referred to as effort-discounting33. This computational approach enables measurement of inter- and intra-individual differences on distinct aspects of effort-based decision-making.

One key source of individual differences in motivational behaviour and effort-based decision-making is likely dopamine signalling, especially dopaminergic projections from the ventral tegmental area (VTA) to the ventral striatum24. Pre-clinical animal studies show dopamine depletion reduces engagement in effortful behaviour34,35, while dopamine enhancement promotes motivational effort exertion36,37. In humans, dopamine depletion reduces willingness to exert effort for reward38,39, while pharmacological dopamine enhancement increases motivation in effort-based decision-making40,41. Further, naturally occurring variations in dopamine responsivity are correlated with effort-based decision-making: a higher dopamine responsivity (as quantified with positron emission tomography following d-amphetamine administration) is associated with willingness to exert greater effort for larger rewards42.

Bi-directional links between chronobiology and dopamine signalling suggest circadian rhythm may play an overlapping role in motivational behaviour and decision-making. In animals, dopamine transmission and biosynthesis vary diurnally43,44, and growing evidence suggests a bi-directional regulation between dopamine signalling and circadian rhythm4547. In human studies, dopamine availability, dopamine transporter genes, and dopamine receptors have been linked to proxies of circadian rhythm4850 and circadian-regulating gene polymorphisms51. On a behavioural level, sleep deprivation, poor sleep quality, and insomnia were linked to low motivation in effort-based decision-making5254 and evening bright-light exposure enhanced effort willingness, possibly by enhancing dopamine through melatonin suppression55. Early chronotype predicted treatment effect on motivational behaviour in a sample of depressed subjects with comorbid insomnia56. Chronotype effects are also reported for other reward decision-making tasks, with late chronotypes showing higher delay discounting57, less rational decision-making58, and lower willingness to take risks for rewards59,60. This suggests that chronobiology may contribute to individual differences in effort-based decision-making, potentially in overlapping ways with neuropsychiatric syndromes.

Here, we tested the relationship between motivational decision-making and three key neuropsychiatric syndromes: anhedonia, apathy, and depression, taking both a transdiagnostic and categorical (diagnostic) approach. To do this, we validate a newly developed effort-expenditure task, designed for online testing, and gamified to increase engagement. Next, we pre-registered a follow-up experiment to directly investigate how circadian preference interacts with time-of-day on motivational decision-making, using the same task and computational modelling approach. All analyses were pre-registered (except when labelled as exploratory): see https://osf.io/2x3au and https://osf.io/y4ke.

2 Results

2.1 Sample characteristics

Nine hundred and ninety-four participants completed all study components (i.e., demographic questions, effort-expenditure task, self-report questionnaires). After exclusion (see Methods 4.1.5), 958 participants were included in our analyses. We used a stratified recruitment approach to ensure our sample was representative of the UK population in age, sex, and history of psychiatric disorder6163; mean questionnaire-based measures were comparable to previous general population studies (Table 1).

Demographic characteristics and descriptive questionnaire measures in the included sample and excluded participants.

Questionnaire sum scores correlated within groupings of questionnaires targeting psychiatric symptoms (|r|>0.5, p<.001 for all coefficients), chronobiology (|r|=0.73, p<.001), and metabolic health (|r|=0.58, p<.001). Between groupings, we found significant correlations between the Apathy Evaluation Scale (AES)64 and both circadian measures (|r|=0.14-0.26, p<.001), as well as both metabolic measures (|r|=0.14-0.15, p<.001). The Snaith Hamilton Pleasures Scale (SHAPS)65 measuring anhedonia correlated significantly with both circadian measures (|r|=0.12-0.14, p<.001) and both metabolic measures (|r|=0.07-0.72, p<.05).

2.2 Effort-expenditure task

In this novel, online effort-expenditure task (Fig. 1A-B), subjects were given a series of challenges associated with varying levels of effort and reward. By weighing up efforts against rewards, they decided whether to accept or reject challenges. We first use model-agnostic analyses to replicate effects of effort-discounting (i.e., devaluation of reward with increasing effort). Next, we took a computational modelling approach to fit economic decision-making models to the task data. The models posit efforts and rewards are joined into a subjective value (SV), weighed by individual effort (βE) and reward sensitivity (βR) parameters. The subjective value is then integrated with an individual choice bias (α) parameter to guide decision-making.

Effort-based decision-making task and results.

A: The task can be divided into four phases: a calibration phase to determine individual clicking capacity to calibrate effort-levels, practice trials that participants practice until successful on every effort-level, instructions and a quiz that must be passed, and the main task, consisting of 64 trials split into 4 blocks. B: Each trial consists of an offer with a reward (2,3,4, or 5 points) and an effort level (1,2,3, or 4) that subjects accept or reject. If accepted, a challenge at the respective effort level must be fulfilled to win the points. If rejected, subjects wait for a matched amount of time and receive one point. C: Proportion of accepted trials, averaged across participants and effort-reward combinations. Error bars indicate standard errors. D: Model comparison based on leave-out-out information criterion (LOOIC; lower is beder) and expected log posterior density (ELPD; higher is beder). E: Posterior predictive checks for the full parabolic model, comparing observed vs. model predicted subject-wise acceptance proportions across effort-levels (left) and reward-levels (right).

2.2.1 Replication of model-agnostic effects

The proportion of accepted trials for each effort-reward combination is ploded in Figure 1C. We found significant main effects for effort (F(1,14367)=4961.07, p<.0001) and reward (F(1,14367)=3037.91, p<.001), and a significant interaction between the two (F(1,14367)=1703.24, p<.001). In post hoc ANOVAs, effort effects remained significant at all reward-levels (all p<.001) and reward effects remained significant at all effort-levels (all p<.001).

The mean success rate of accepted challenges across participants was high (M=98.7%) and varied lidle between participants (SD=3.50), indicating feasibility of all effort-levels across participants. Comparing clicking calibration results from pre- to post-task, the maximum clicking capacity decreased by 2.34 clicks on average (SD=14.5). Sixty-two (6.47%) participants reported having deviated from our instructions (i.e., changed the hand and/or finger used to make mouse clicks) throughout the game, but all effects could still be replicated in this subsample: main and interaction effects of effort and reward on the proportion of accepted trials could be replicated in this subsample (all p<.001) and there was no significant difference between participants that did or did not report finger switching in the mean percentage of accepted trials (switching: 79.51%, no switching: 76.62%; p=.149).

Subjects were engaged with the task, shown by a high rate of challenge acceptance (M=76.80%, SD=15.20, range=15.60-100%) and moderate-to-good enjoyment ratings (M=2.56, SD=0.92; on a 0-4 scale). Qualitative data of subjects describing their decision-making process during the task further confirmed high engagement (see Supplement 3).

2.2.2 Computational modelling

A model space of nine models was considered, varying in the implemented parameter and cost function (see Supplement 1.1 for mathematical definitions of all models). Prior to model fiding, parameter recovery confirmed all models yield meaningful parameter estimates (Supplement 1.2). All models showed good convergence (effective sample size (ESS)>4,223; R-hats>1.002 for all estimates). Model comparison by out-of-sample predictive accuracy identified the model implementing three parameters (choice bias α, reward sensitivity βR, and effort sensitivity βE), with a parabolic cost function (subsequently referred to as the full parabolic model) as the winning model (leave-one-out information criterion [LOOIC; lower is beder] = 29734.8; expected log posterior density [ELPD; higher is beder] = -14867.4; Fig. 1D). Predictive validity of the full parabolic model was validated with posterior predictive checks, showing excellent accordance between observed and model-predicted choice data (across effort-levels: R2=.95, across reward-levels: R2=.94; Fig. 1E).

2.2.3 Test-retest reliability

We validated the task in a smaller in-person sample (N=30, tested twice ~7 days apart, holding time-of-day at testing constant) to assess test-retest reliability of parameter estimates, showing moderate to excellent reliability for all parameters (i.e., all intraclass correlation coefficients >0.4, all p<.01). Parameter estimates from modelling the data at one session predicted subjects’ choices at the other session beder than chance and beder than group-level parameters predictions (all p<0.01)66 (full details reported in Supplement 2).

2.3 Transdiagnostic analysis: Questionnaire measures predict effort-based decision-making

We used partial-least-squares (PLS) regression to relate individual-level mean posterior parameter values resulting from the model fiding of the full parabolic model to the questionnaire measures. To explore individual effects post-hoc, we followed up on effects found in the PLS regression using Bayesian generalised linear models (GLMs), controlling for age and gender.

2.3.1 Choice bias

Choice bias was best predicted by a model with one component, with its highest factor loadings from psychiatric measures (increasing values indicate symptom severity; SHAPS65: -0.665; AES64: -0.588; Dimensional Anhedonia Rating Scale [DARS]67: -0.487). Weaker loadings were found for circadian measures (higher values indicate later chronotype; Mornignness-Eveningness Questionnaire [MEQ]11: -0.262; Munich Chronotype Questionnaire [MCTQ]68: -0.117) and metabolic measures (higher values indicate higher metabolic risk; body mass index [BMI]: -0.115; Finnish Type-2 Diabetes Risk Score questionnaire [FINDRISC]69: -0.068). Permutation testing indicated the predictive value was significant out-of-sample (root-mean-squared error [RMSE]=0.203, p=.001).

Bayesian GLMs confirmed evidence for psychiatric questionnaire measures predicting choice bias (SHAPS: M=-0.109; 95% highest density interval (HDI)=[-0.17,- 0.04]; AES: M=-0.096; 95%HDI=[-0.15,-0.03]; DARS: M=-0.061; 95%HDI=[-0.13,-0.01]; Fig. 2A). Post-hoc GLMs on DARS sub-scales showed a meaningful effect for the sensory subscale (M=-0.050; 95%HDI=[-0.10,-0.01]). For the MEQ (95%HDI=[- 0.09,0.06]), MCTQ (95%HDI=[-0.17,0.05]), BMI (95%HDI=[-0.19,0.01]), and FINDRISC (95%HDI=[-0.09,0.03]) no meaningful relationship with choice bias was found, consistent with the smaller magnitude of reported component loadings from the PLS regression.

Associations between task parameter estimates and psychiatric measures.

A: Visualizations of associations between the choice bias task parameter and the Snaith-Hamilton Pleasure Scale (SHAPS), the Dimensional Anhedonia Rating Scale (DARS)67, and the Apathy Evaluation Scale (AES)64 B-C: Comparison of task parameter choice bias (B) and effort sensitivity (C) between a sample of participants meeting criteria for current major depressive disorder (MDD; purple, upper) on the the Mini-International Neuropsychiatric Interview 7.0.1 (M.I.N.I)70 and age- and gender-matched controls (yellow, lower).

2.3.2 Effort sensitivity

For effort sensitivity, the intercept-only model outperformed models incorporating questionnaire predictors based on RMSE.

2.3.3 Reward sensitivity

For reward sensitivity, the intercept-only model outperformed models incorporating questionnaire predictors based on RMSE.

2.4 Diagnostic analysis: Depressed and healthy subjects differ in effort-based decision-making

In an exploratory analysis, we compared a sample of N=56 participants that met criteria for current major depressive disorder (MDD), to fifty-six healthy controls (HC), matched by age (MDD: M=37.07; HC: M=37.09, p=.99) and gender (MDD: 31 female, 23 male, 2 non-binary; HC: 32 female, 22 male, 2 non-binary; p=.98). Effort-discounting effects were confirmed in both groups. For both groups, model fiding and comparison identified the full parabolic model as the best-fiding model. We used age- and gender-controlled Bayesian GLMs to compare individual-level mean posterior parameter values between groups.

2.4.1 Choice bias

As in our transdiagnostic analyses of continuous neuropsychiatric measures (Results 2.3), we found evidence for a more negative choice bias parameter in the MDD group compared to HCs (M=-0.111, 95% HDI=[-0.20,-0.03]) (Fig. 2B).

2.4.2 Effort sensitivity

Unlike our transdiagnostic analyses, we also found evidence for lower effort sensitivity in the MDD group compared to HCs (M=-0.111, 95% HDI=[-0.22,-0.01]) (Fig. 2C).

2.4.3 Reward sensitivity

There was no evidence for a group difference in reward sensitivity (95%HDI=[- 0.07,0.11]), as in our transdiagnostic analyses.

2.5 Circadian measures affect effort-based decision-making

Due to our hypothesised interaction between circadian preference and time-of-day, testing was conducted in two specified time windows: morning (08:00-11:59) and evening (18:00-21:59), resulting in a binary time-of-day measure (morning vs. evening testing). A total of 492 participants completed the study in the morning testing window and 458 in the evening testing window. We used the two chronotype questionnaires to identify two established circadian phenotypes: “early” or “late” chronotype (see Methods 4.5), behavioural categories indicating underlying chronobiological differences9,11,68. These classifications result in four sub-sample groups, with 89 early chronotypes (morning testing: n=63; evening testing: n=26) and 75 late chronotypes (morning testing: n=20; evening testing: n=55).

Bayesian GLMs predicting task parameters by time-of-day and chronotype showed effects of chronotype on reward sensitivity (higher in late chronotypes; M= 0.325, 95% HDI=[0.19,0.46]) and choice bias (higher in early chronotypes; M=-0.248, 95% HDI=[-0.37,-0.11]), as well as an interaction between chronotype and time-of-day on choice bias (M=0.309, 95% HDI=[0.15,0.48]).

2.5.1 Additional pre-registered data collection

As these analyses rely on unevenly distributed sub-samples, we conducted an additional, pre-registered data collection to replicate and extend these findings (https://osf.io/y4ke). We screened participants for their chronotype and then invited early chronotypes to take part in our study in the evening testing window, and late chronotypes in the morning testing window (Methods 4.5.1).

Using our pre-registered Bayesian stopping rule, we tested 13 early chronotype participants and 20 late chronotype participants. The data was then combined with the data from our main data collection, resulting in a full sample of n=197 participants that was used for subsequent chronotype analyses (see Table 2 for sample characteristics and statistical significance of differences).

Demographic characteristics and descriptive questionnaire measures in the early and late chronotype participants.

2.5.2 Choice bias

Late chronotypes showed a more negative choice bias than early chronotypes (M=- 0.11, 95% HDI=[-0.22,-0.02])—comparable to effects of transdiagnostic measures of apathy and anhedonia, as well as diagnostic criteria for depression. Crucially, we found choice bias was modulated by an interaction between chronotype and time-of-day (M=0.19, 95% HDI=[0.05,0.33]): post-hoc GLMs in each chronotype group showed this was driven by a time-of-day effect within late, rather than early, chronotype participants (M=0.12, 95% HDI=[0.02,0.22], such that late chronotype participants showed a more negative choice bias in the morning testing sessions, and a more positive choice bias in the evening testing sessions; early chronotype: 95% HDI=[-0.16,0.04]) (Fig. 3A).

Effects of chronotype and time-of-day on task parameter estimates.

A: Effect of chronotype and time-of-day on reward sensitivity parameter estimates. B: Effect of chronotype and time-of-day on choice bias parameter estimates.

2.5.3 Effort sensitivity

We found no evidence for circadian or time-of-day effects on effort sensitivity (chronotype main effect: 95%HDI=[-0.06,0.18], time-of-day main effect: 95%HDI=[- 0.08,0.13]).

2.5.4 Reward sensitivity

Participants with an early chronotype had a lower reward sensitivity parameter than those with a late chronotype (M=0.27, 95% HDI=[0.16,0.38]). We found no effect of time-of-day on reward sensitivity (95%HDI=[-0.09,0.11]) (Fig. 3B).

3 Discussion

Various neuropsychiatric disorders are marked by disruptions in circadian rhythm, such as a late chronotype. Transdiagnostic characteristics in circadian rhythm could contribute to overlapping neurocognitive changes between disorders. However, research investigating the mechanisms underlying neuropsychiatric conditions have only rarely considered inter-individual differences in circadian rhythm. Here, combining a large-scale online study with computational modelling, we replicate and extend previous work linking anhedonia, apathy, and depression to a lower tendency to exert effort for reward. Crucially, we found the same effect in the same direction by late chronotype. Moreover, by testing participants at chronotype-compatible and –incompatible times of day, we discovered late chronotypes show a decreased willingness to exert effort for reward when tested in the morning compared to evening. This reveals overlapping effects of neuropsychiatric symptoms and chronotype dependent on time-of-testing. Our results demonstrate circadian rhythm may play a crucial role in computational psychiatry, affecting our assessment and potentially treatment of neurocognitive mechanisms.

We replicate and extend effects of aberrant effort-based decision-making in neuropsychiatric syndromes in a large, broadly population-representative sample. Our finding that dimensional measures of apathy and anhedonia predict choice bias, a computational parameter describing someone’s tendency to exert effort for reward, aligns with previous reports of impaired effort-based decision-making in psychiatric31,71–79 and neurodegenerative populations8084, as well as studies linking effort-based decision-making with apathy and anhedonia specifically. The positive link between effort-based decision-making and apathy and anhedonia has been observed in both patients71,78,85 and healthy controls71,86,87, though some did not find this effect79.

Our work supports previous theories that impaired effort-based decision-making represents a common, transdiagnostic mechanism across the psychiatric and neurological syndromes of anhedonia and apathy (respectively). We found corresponding effects of apathy and anhedonia on the same computational parameter – choice bias – reinforcing the suggestion of possible shared mechanistic underpinnings of the two motivational syndromes24. Aberrant effort-based decision-making may manifest behaviourally as deficient motivation, a symptom category that cuts across traditional disease boundaries of psychiatric, neurological, and neurodevelopmental disorders32.

Our categorical (diagnostic-criteria based) analysis comparing depressed to healthy subjects likewise found depressed patients showed a lower choice bias, echoing our dimensional results in apathy and anhedonia. In addition, our categorical analysis revealed a distinct effect of group on effort sensitivity: depressed subjects had lower effort sensitivity, meaning their decisions were less influenced by effort changes. Possibly, this effect stems from decreased perceived differences in effort levels, as recently reported88, indicating there are both dimensional transdiagnostic) and potentially some diagnosis-specific effects of mental health on effort-based decision-making.

We also found overlapping circadian effects on effort-based decision-making. Again, we observed a meaningful difference in choice bias between chronotypes, with late chronotypes showing a lower bias to accept to exert effort for reward, paralleling effects of apathy, anhedonia, and depression. Previous studies have suggested late chronotypes were also less accepting of delays57 and risk for reward59,60.

Most importantly, we found an interaction between chronotype and time-of-day in a synchrony effect manner: early and late chronotypes showed a higher bias towards accepting effort for reward at their preferred time of day. This effect was driven by the late chronotype group, who showed a markedly lower choice bias in the morning, but higher choice bias in the evening. This suggests that chronotype effects on neurocomputational parameters such as choice bias (and their overlap, or not, with effects of neuropsychiatric syndromes) depend on time-of-testing.

Synchrony effects have previously been observed in other cognitive domains including inhibitory control89, adention90, learning91, and memory92. One interpretation of our cognitive synchrony effects may be that late chronotype participants show a diminished ability to adapt to suboptimal times-of-day due to reduced cognitive resources93.

We also report a distinct effect of chronotype on effort-based decision-making that is not overlapping with effects of neuropsychiatric symptoms, nor dependent on time-of-day. Compared to early chronotypes, late chronotypes were more guided by differences in reward value, indicated by higher reward sensitivity parameters. Previous studies report altered reward functioning in late chronotypes, who show a reduced reactivity to reward in the medial prefrontal cortex, a key component of reward circuitry9496. Note this is not incompatible with higher reward sensitivity due to our modelling approach, in which higher reward sensitivity does not imply higher reward valuation, but rather larger subjective value differences between reward levels. Therefore, reduced reactivity to reward could be compatible with late chronotypes devaluing low reward levels more, which in our models would emerge as a reduced reward sensitivity parameter.

It is striking that the effects of neuropsychiatric symptoms on effort-based decision-making largely overlap with circadian effects. The interaction effect of chronotype and time-of-day at testing fosters confidence in the interpretation that circadian effects on effort-based decision-making go beyond recapitulating neuropsychiatric effects due to multi-collinearity. Overall, our results raise the possibility of altered effort-reward processing as a critical mechanism linking neuropsychiatric conditions and circadian rhythm. Previous research demonstrated depressed patients with an evening chronotype show increased diurnal mood variation97. Our finding of time-of-day differences in choice bias among late chronotypes illustrates a potential cognitive underpinning for the observed diurnal characteristic within depression and late chronotype. Together, these findings support the idea of a circadian psychiatric phenotype14,17, which should be considered in measurement (e.g., design of computational psychiatry studies) and potentially treatment (e.g., administration of motivation-based psychological interventions, which could be timed compatibly with chronotype).

Our study results should be considered in light of a few limitations. First, we used online self-report measures of neuropsychiatric symptoms and depression status. There has been a large shift toward online data collection in psychiatric research, and while online data is undoubtedly noisier, results (including our own, presented in the supplemental material) usually show excellent accordance with lab-based studies98. Similarly, we lack biological measures of circadian rhythm, the gold standard of chronotype assessment. However, this concern might be mitigated by previous reports of high covariance between biological- and questionnaire-based circadian measures99,100, as well as significant chronobiological differences between the questionnaire-determined chronotypes10,101 we use in our key findings. Nevertheless, future work should incorporate biological measures in adempts to replicate circadian effects on effort-based decision-making. This could take the form of identifying chronotypes by DNA analysis or dim-light melatonin onset, or continuous measurements of circadian proxies, such as core body temperature, heart rate, or actigraphy.

Note also that our time-of-day effects are limited by a between-subjects study design (i.e., the same participants were not tested in morning and evening sessions). It will be interesting to explore such diurnal variation in effort-based decision-making within individuals. The newly developed effort-expenditure task we present here may lend itself particularly well to such endeavours. First, it allows remote testing, meaning subjects can complete the task at different times of the day without in-person testing. Second, we demonstrated good test-retest reliability of task measures when time of testing was held constant within participants. This good test-retest reliability of our task contrasts with recent reports of poor test-retest reliability of other tasks and computational modelling parameters102.

Finally, our study was not designed to separately explore neuropsychiatric and circadian effects on effort-based decision-making simultaneously due to multi-collinearity between neuropsychiatric symptoms and chronotype (reflecting the general co-incidence of neuropsychiatric symptoms and late circadian rhythm). Future work is needed to disentangle separable effects of neuropsychiatric and circadian measures on effort-based decision-making. One approach could be a group-based study design enabling the dissociation of the two effects (e.g., examining high-anhedonia participants with early chronotypes and low-anhedonia patients with late chronotypes, as well as the respective other, more common groupings, and testing each group in the morning and evening to examine time-of-day interactions with both anhedonia and chronotype).

Taken together, our results implicate circadian rhythm as an important factor in effort effort-based decision-making and its relationship to neuropsychiatric conditions. These results have implications for research, clinical interventions, and policy. We demonstrate that neuropsychiatric effects on effort-based decision-making largely overlap with effects of circadian rhythm and time-of-day. Hence, failure to account for chronotype and time of testing, which is the predominant practice in the field, could distort results. This could take the form of either inflating or masking true results in the existing literature. On the one hand, reported neuropsychiatric effects may be inflated when driven by systematic circadian differences between participants (i.e., overrepresentation of late chronotype in patient samples), which could be further amplified by time of testing (often the morning, incompatible with late chronotypes, and producing motivational impairments on neurocognitive measures). On the other hand, true effects may be masked by interactions between chronotype and time-of-day: Testing psychiatric subjects with a late chronotype in the evening (e.g., as a consequence of subject-selected testing times) may paint a false picture of group equivalence, as researchers are only observing part of a daily trajectory.

Our growing understanding of the relationship between circadian rhythm and neuropsychiatry may allow for critical advances in improving therapeutic outcomes from treatments103,104. Such advances are particularly called for in the case of symptoms of apathy and anhedonia, as current treatments often fail to improve motivational deficits2729, but could potentially be coupled with a patient’s chronotype to increase efficacy. At minimum, clinical trials predicting change in motivational measures, such as effort-based decision-making, should assess patients at similar times of day, as this could reduce or inflate treatment effects.

In addition, our results may have potential implications for our school system. Researchers have long argued against early school start times, due to adverse effects on children and adolescents’ physical and mental health105. Our results add to these concerns. Adolescent development is tightly linked to a shift to a later circadian rhythm106, meaning that early school hours might undermine pupils’ motivation at school.

Circadian rhythm and neuropsychiatric syndromes may affect motivation via overlapping, as well as distinct, mechanisms—but crucially, this overlap is dependent on time of testing. Our work suggests that chronotype and time of testing are essential variables to consider in future effort-based decision-making experiments, particularly those measuring effort-based decision-making in patient groups, such as those with depression, high apathy, or high anhedonia, all of which produce overlapping effects with late circadian rhythm participants tested in the morning. Beyond experimental work, future interventions, both in treatment and policy, should consider the role of chronotype in measurement and modulation of motivation.

4 Methods and materials

4.1 Study protocol

After providing demographics and basic medical history, subjects completed an effort-expenditure task, followed by a badery of self-report questionnaires. The study was coded in JavaScript, using Phaser v.3.50.0 for the task and jsPsych107 for questionnaires.

4.1.1 Ethics

This study was approved by the University of Cambridge Human Biology Research Ethics Commidee (HBREC.2020.40). Participants provided informed consent through an online form, complying with the University of Cambridge Human Biology Research Ethics Commidee procedures for online studies.

4.1.2 Recruitment

We recruited participants using Prolific108, in September 2022. Data was collected on weekdays, in specified daily time-windows (morning testing: 08:00-11:59; evening testing: 18:00-21:59). To sample participants broadly representative of the UK population in age, sex, and history of psychiatric disorder, we implemented a previously-described procedure using Prolific pre-screeners to obtain batches of participants aimed to match target numbers calculated based on UK population data.

Nine hundred and ninety-four participants completed all components and were paid a fixed rate of £6. A bonus of £10 was paid to ten participants. Subjects were told they could increase their chances of winning the bonus by engaging well with the study (e.g., reading questions carefully, following task instructions).

4.1.3 Effort-expenditure task

We developed a new effort-expenditure task that allowed us to assess effort-based decision-making in a remote seding; this task was also tested in-person to assess test-retest reliability. To increase engagement, we gamified the task to take place in an underwater setting and each challenge is framed as a race in which an octopus catches a shrimp. The task structure is shown in Figure 1A and the trial-level structure in Figure 1B.

The task began with an individual calibration phase to standardise maximum effort capacity, followed by the main task, which used a semi-adaptive staircase design to maximise the informative value of each choice.

For the calibration, subjects were prompted to collect points by clicking as fast as possible for ten seconds, repeated three times. The second and third repetitions were then averaged to serve as the maximum clicking capacity reference for the main task. One calibration trial was repeated at the end of the main task to monitor any notable changes in clicking capacity. Then, subjects were familiarized with their individually calibrated effort levels during a practice phase of the task. Effort levels were scaled to a given participant’s mean clicking speed (based on the calibration phase), and the time clicking must be sustained for. We used four effort-levels, corresponding to a clicking speed at 30% of a participant’s maximal capacity for 8 seconds (level 1), 50% for 11 seconds (level 2), 70% for 14 seconds (level 3), and 90% for 17 seconds (level 4). Subjects were instructed to make mouse-clicks with the finger they normally use, and to not change fingers throughout the task (compliance was checked at the end of the main task). In the practice phase, all effort-levels were completed without reward associations, and failed levels were repeated until subjects succeed at each level. If a subject failed a level twice, the clicking capacity reference was adjusted to the speed reached in the practice. Finally, subjects needed to pass a six-question quiz to ensure task instructions were fully understood. If a subject failed any question on the quiz, they were returned to the instruction screens and re-took the quiz until all questions were answered correctly.

The main task took a binary-choice design: In each trial, participants accepted or rejected a challenge associated with one of the four specific effort levels and rewards. Reward was conceptualized as points (shrimp caught by the octopus) that could be collected in that trial. The points to win per challenge varied between four levels (2,3,4, or 5 points). If a subject accepted a given challenge, they needed to achieve the given effort-level to win the associated points. If a subject rejected a given challenge, they waited and received one point, with waiting times matched to the respective effort level to prevent confounding with delay discounting. Participants were able to infer their clicking progress from the distance between the octopus and the shrimp and the remaining time was indicated by a time-bar.

Subjects completed 64 trials, split into four blocks of 16 trials. Trial-by-trial presentation of effort-reward combinations were made semi-adaptively by 16 randomly interleaved staircases. Each staircase began with one of the 16 possible offers (4 effort-levels × 4 reward-levels). After a subject accepted a challenge, the next trial’s offer on that staircase was adjusted (by increasing effort or decreasing reward). After a subject rejected a challenge, the next offer was adjusted by decreasing effort or increasing reward. This ensured subjects received each effort-reward combination at least once, while individualizing trial presentation to maximize the trials’ informative value.

4.1.4 Self-report questionnaires

Subjects completed a questionnaire badery assessing mental and physical health, presented in a randomised order. We assessed anhedonia using the SHAPS65, as well as the DARS67. Apathy was assessed with the AES64. Additionally, we screened participants for meeting diagnostic criteria for current, past, or recurrent MDD using the M.I.N.I.70. Two questionnaires targeted circadian rhythm: the MEQ11 and the MCTQ68. Metabolic health was assessed by collecting self-reported height and weight, used to calculate BMI. Additionally, the FINDRISC69 was used to calculate individual risk scores for metabolic disease. Finally, the International Physical Activity Questionnaire (IPAQ)110 was included for exploratory investigations of physical activity.

4.1.5 Compliance checks and exclusion criteria

All exclusion criteria were preregistered. Participants were excluded when reporting a severe neurological condition (n=14) or English proficiency below B2 (i.e., good command/working knowledge; n=2).

To check compliance with the questionnaires, four catch questions were presented during questionnaires, including two easy questions (e.g., “Please answer ‘Not at all’.”) and two harder questions (e.g., “In the past week, I (would have) wanted to eat mouldy food.”, expected answer “Disagree” or “Definitely disagree”). Participants failing at least one easy question, or both harder questions were excluded (n=12).

As task-based exclusion criteria, subjects rejecting all offers were excluded (n=0). Participants had to have a clicking-calibration score of at least seven, as values below would lead to challenges with just one mouse-click (n=4). Subjects showing a large difference between minimum and maximum clicking speed (i.e., >3 standard deviations (SD)) during calibration trials were excluded, as a misestimation of the calibration reference is likely (n=3). Finally, subjects showing a large change in their clicking capacity (i.e., >3 SD) pre- to post-task were excluded, as it can be assumed the applied calibration was not valid during the task (n=1). We also asked two open-answer questions after completion of the main task to monitor participants’ self-reported task strategies as a way of assessing rule adherence.

4.2 Analyses of effort-expenditure task data

4.2.1 Model-agnostic analyses

Using the proportion of accepted challenges as the dependent variable, we investigated main effects of effort- and reward-levels and their interaction, using a mixed-effects analysis of variances (ANOVA) of repeated measures. This approach accommodates the unbalanced design resulting from the implemented staircasing procedure.

4.2.2 Model-based analyses Model space

To model effort-based decision-making, we considered a model space of nine models. All models are variations of the economic decision-theory model, consisting of two basic equations. First, a cost function transforms costs and rewards associated with an action into a subjective value (SV):

with βR and βE for reward and effort sensitivity, and R and E for reward and effort. This SV is then transformed to an acceptance probability by a softmax function:

> with p(accept) for the predicted acceptance probability and α for the intercept representing choice bias.

The models differed in two aspects. First, inclusion or exclusion of the free parameters reward sensitivity (βR) and choice bias (α). Second, the form of the cost function, which used either a linear function (proportional discounting at all effort-levels), a parabolic function (increases at higher effort-levels are discounted over-proportionally) or an exponential function (increases at lower effort-lower levels are discounted over-proportionally). See supplement 1.1 for mathematical definitions of all models. Model fitting, checks, and comparisons

We took a hierarchical Bayesian approach to model fiding111, implemented with the CmdStan R interface112, with Stan code adapted from hBayesDM113. Prior to model fiding, effort- and reward-levels were standardized for computational ease. All models were fit using Markov-Chain Monte Carlo (MCMC), with 2000 warm-up iterations and 6000 sampling iterations, by four chains. Model convergence and chain mixing were checked using numerical diagnostics of ESS and split R-hats, and by visually inspecting trace plots. We conducted parameter recoveries for all models, confirming their ability to meaningfully recover known parameters. Model performance was compared based on out-of-sample predictive accuracy using the LOOIC (lower is beder) and ELPD (higher is beder). The winning model was validated using posterior predictive checks, comparing model predictions to subject-wise observed choices. Test-retest reliability

We conducted an in-person study to validate the effort-expenditure task and assess the test-retest reliability of our computational modelling parameters. A sample of N=30 participants was recruited and tested in two sessions, about one week apart. Test-retest reliability of task parameters was assessed by intra-class correlation coefficients, Pearson’s correlation coefficients (estimated both after model fiding and by embedding a correlation matrix into the model fiding procedure), and by testing the predictive accuracy of parameter estimates across sessions. See supplement 2 for full methods and results.

4.3 Linking model parameters to outcome measures

To aid interpretability and comparability of effects, task parameters and questionnaire outcome measures were standardized to be between zero and one. Questionnaire measures resulting from the DARS, AES, and MEQ were additionally transformed to be interpretable with the same directionality within questionnaire groupings (i.e., for all psychiatric measures higher values are interpreted as higher symptom severity, for all circadian measures higher values are interpreted as later chronotype).

To investigate associations between effort-based decision-making and self-report questionnaires, we ran partial least squares (PLS) regressions with questionnaire outcome measures predicting modelling parameters. PLS regression allows joint modelling of questionnaire measures, without issues due to expected multicollinearity between questionnaires. Following the best practice of model validation114, data was split into a training (75%) and a testing (25%) subset. The training data was used to obtain the optimal number of components, based on ten-fold-cross validation, and to train the model. The winning model’s predictive performance was tested out-of-sample using the held-out testing data. Statistical significance of obtained effects was assessed by permutation tests, probing the proportion of root-mean-squared errors (RMSEs) indicating stronger or equally strong predictive accuracy under the null hypothesis.

To follow up on relationships suggested by the PLS regression, we performed Bayesian general linear models (GLMs), adjusting for age and gender (male or female, imputing natal sex for non-binary participants, given low numbers).

4.4 Comparing depressed and healthy subjects

We compared participants meeting criteria for a current MDD based on the M.I.N.I.70, to a subset of age- and gender-matched healthy controls (HCs, participants that did not meet criteria for current MDD). For computational sparsity, we only fit the three best-fiding models from the full sample. Models were fit separately to the MDD and HC groups, using the same methods and parameters described above. Bayesian GLMs were used to quantify evidence for associations between individual-level modelling parameters and group status. As we could not be certain whether we would obtain a large enough sample size of subjects meeting criteria for MDD, these analyses were exploratory.

4.5 Investigating circadian effects

We used the two circadian rhythm questionnaires to determine participants’ chronotypes. Early chronotype was defined as meeting criteria for “morning types” on the MEQ (MEQ sum score > 58)11 and having a midpoint of sleep on free days before 02:30115. Late chronotype was defined as meeting criteria for “evening types” on the MEQ (MEQ sum score < 42) and having a midpoint of sleep on free days after 05:30. Subjects not falling into either category are categorized as intermediate chronotypes and were not included in these analyses.

We used Bayesian age- and gender-controlled GLMs to investigate effects of chronotype, time-of-day (morning- vs. evening-testing), and their interaction on subject-wise mean task parameters estimates.

4.5.1 Additional data collection

To improve the precision of estimated circadian effects on task parameters, we increased our sample size by conducting an additional pre-registered data collection (https://osf.io/y4ke). We implemented a screening study comprising the MEQ11 and MCTQ68. Taking the chronotyping approach described above, subjects with an early or late chronotype were identified. Early chronotypes were invited to take part in our study in the evening, late chronotypes in the morning.

We implemented a Bayesian stopping rule to inform our data collection process, taking the following steps. First, participants were screened in batches of 250, and eligible participants were invited to the study session. Next, data resulting from this additional data collection was joined with data resulting from the main data collection and Bayesian GLMs were re-run, as described above. If our precision target of any 95% HDI reaching a maximum width 0.20 was met, we stopped data collection. Was the precision target not met, we returned to step one, and another batch of 250 participants was screened. In any case, data collection would be terminated once 200 eligible participants had completed the main study session.


This study was funded by an AXA Research Fund Fellowship awarded to C.L.N. (G102329) and the Medical Research Council (MC_UU_00030/12). C.L.N. is funded by a Wellcome Career Development Award (226490/Z/22/Z). This research was supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014).

Author contributions

Conceptualization: S.Z.M, C.L.N; Methodology: S.Z.M, C.L.N; Investigation: S.Z.M, C.L.N; Project administration: S.Z.M, C.L.N; Writing (original draft, review & editing): S.Z.M, C.L.N; Formal analysis: S.Z.M; Funding Acquisition: C.L.N; Supervision: C.L.N.

Additional information

For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.

Supplementary martial

1 Computational modelling

1.1 Mathematical definition of the model space

Mathematical definition of the models included in our model space.

1.2 Model validation

Parameter recoveries were performed to ensure parameter estimates obtained from all models are meaningful. For each model, sets of parameter values were sampled from uniform distributions bound to the respective parameter ranges. Task data was then simulated for n=500 agents, using the respective parameters and modelling equations. The resulting simulated data was used for model fitting and resulting “recovered” posterior parameter estimates compared to the underlying parameter values. For all models, underlying parameters correlated highly with the recovered mean parameter estimates (Table S2). For the winning model (full parabolic model), relations between underlying and recovered parameters are additionally visualized in Figure S1A-C. Importantly, the modelling procedure did not introduce any spurious correlations between free parameters (Figure S1D).

Parameter recovery.

A-C: Comparison between underlying parameters and recovered mean parameter estimates for the three free parameters of the full parabolic model. D: Pearson’s correlations between all underlying and recovered parameters for the full parabolic model.

Pearson’s correlations between underlying parameters and recovered mean parameter estimates for all models included in the model space.

2 Test-retest reliability

For computational modelling parameters to successfully contribute to research advancing our understanding of mechanisms underlying mental health and disorder, as well as to transform such knowledge into personalized treatment approaches, it is pivotal to ensure reliability of the used measures. The reliability with which a measure captures individual characteristics ultimately sets an upper limit on its usefulness in detecting differences between groups, relationships to other measures, and intervention effects. We assessed the test-retest reliability of the novel effort-expenditure task in a separate in-person sample.

2.1 Methods

2.1.1 Sample

Thirty-three participants were recruited from S.O.N.A. and through advertisements in Cambridge Colleges. Three subjects had to be excluded due to failure during the task calibration. The final sample (N=30) consisted of 17 female and 13 male subjects with an average age of M=48.1 years (SD=21.54). All subjects were native English speakers, none reported neurological conditions and four subjects reported current and/or past psychiatric disorders (depression, anxiety, obsessive-compulsive disorder, and personality disorder).

2.1.2 Study procedure

All participants completed two testing sessions. In the first session, demographic data was collected, followed by the effort-expenditure task and the battery of self-report questionnaires. In the second session, subjects only completed the effort task. The two testing sessions were at least 6 days apart and 9 days at most, with an average difference of 6.93 days (SD=0.78). Time-of-day at testing was aimed to be held constant between testing sessions, with the average time difference between starting times of the task being 21.53 minutes (SD=30.44, min=0, max=119).

2.1.3 Analyses

All computational models defined in our model space were fit separately to the task data of each testing session using the same methods and parameters as described for the main sample in the main manuscript. Test-retest analyses were performed on the full parabolic model, given model comparison in our main sample identified this as the winning model.

Intra-class correlation coefficients (ICCs) were used to assess test-retest reliability using the ratio of intra-individual to inter-individual variability. We used two-way mixed effects ICCs considering consistency (reflecting rank order). ICCs below 0.4 indicate that a measure is not reliable, ICCs between 0.4 and 0.75 indicate moderate to good reliability and ICCs above 0.75 indicate excellent reliability116.

To assess correlations between model parameters from the two sessions, we first calculated Pearson’s correlation coefficients between mean posterior parameter estimates resulting from the separate model fitting procedures. Next, we re-fit the model jointly for both testing sessions, while embedding a correlation matrix into the model117. Thereby, the full posterior distributions of the model parameters are fed into the calculation of correlation coefficients, rather than solely point estimates. This offers multiple benefits of Bayesian inference: Uncertainty around parameter estimates can be incorporated into the correlation estimation and Bayesian priors can be set over possible values of the correlation matrix. Note that while this model fitting procedure fits the data of both testing sessions simultaneously, separate hyper-parameters are used for the different sessions. Therefore, shrinkage cannot bias the reliability estimates.

Finally, we also made use of the predictive property of the computational modelling approach in the assessment of parameter reliability. If parameters are reliable, the estimates resulting from modelling the data from one session should be able to predict subjects’ choices in the other session better than chance66. To test this, we calculated the model-predicted choices for each subject and trial and compared this to the observed choices in the respective other session.

Due to our hierarchical Bayesian modelling approach, individual model parameters are subject to shrinkage. To test whether the predictive property of individual parameters is solely due to shrinkage we repeated the procedure using hyper-parameters to make model predictions and then compared the resulting predictive accuracy to that of individual parameters.

2.2 Resutls

2.2.1 Descriptive task statistics

Data from both sessions reproduced the expected effect of effort discounting. Mixed-effects analyses of variances (ANOVA) of repeated measures confirmed significant main effects of effort (session 1: F(1,447)=128.22, p<.001; session 2: F(1,447)=129.42, p<.001) and reward (session 1: F(1,447)=56.49, p<.001; session 2: reward: F(1,447)=46.51, p<.001), as well as an interaction effect (session 1: F(1,447)=52.74, p<.001; session 2: F(1,447)=49.13, p<.001) for both testing sessions. In post hoc ANOVAs the main effect of effort remained significant at all reward levels (at p<.05) for both sessions and the main effect of reward remained significant at all effort levels (at p<.05), except at the lowest effort level in the first session, and at the two lowest reward levels in the second session.

2.2.2 Computational modelling

All models converged well for both testing sessions. For both session one and two, the full parabolic model was the winning model, based on both the leave-one-out information criterion (LOO) and the expected log predictive density (ELPD) (Figure S2A).

Computational modelling and test-retest reliability.

A: Model comparison for each testing session based on the leave-one-out information criterion (LOO) and expected log predictive density (ELPD). Error bars indicate standard errors. B: Subject-wise parameter estimates compared between testing sessions. C: Predictive accuracy against chance (left) and group-level parameters (right; values >0 indicate better performance of subject-level compared to group-level parameters). Labels s1s2 (and s2s1) indicate session 1 (session 2) parameters predicting session 2 (session 1) data, s1s1 (and s2s2) indicate session 1 (session 2) parameters predicting session 1 (session 2) data.

2.2.3 Test-retest reliability

The ICC for effort sensitivity showed best test-retest reliability (ICC(C,1)=0.797, p<.001). Reliability for reward sensitivity (ICC(C,1)=0.459, p=.0047) and choice bias (ICC(C,1)=0.463, p=.0043), was moderate (Figure S2B).

Correlations between point estimates of the modelling parameters across testing sessions were very strong for effort sensitivity (r=.803, p<.01), and moderate for reward sensitivity (r=.467, p<.01) and choice bias (r=.517, p<.01). Correlations derived from the embedded correlation matrix and therefore considering the full parameter estimate distribution resulted in slightly higher estimates (effort sensitivity: r=.867; 95%Highest density interval (HDI)=[.667-.999]; reward sensitivity: r=.550; 95%HDI=[.163-.900]; choice bias: r=0.585; 95%HDI=[.184-.927]).

Subject-wise parameter estimates could reliably predict individual trial-wise choice data across trials (Figure S2C). Session one parameter estimates predicted session two choice data significantly better than chance (t(1919)=46.819, p<.001), as did session two parameters for session one choice data (t(1919)=47.106, p<.001). When considering predictive accuracy of group-level parameter estimates, both session’s parameter estimates outperformed chance (session one predicting session two: t(1919)=33.485, p<.001; session two predicting session one: t(1919)=30.291, p<.001). Comparing predictive accuracy of subject-wise parameter estimates to group-level estimates, the subject-level outperformed the group-level for both session one predicting session two (t(3760.7)= 4.951, p<.001), and session two predicting session one (t(3674.2)=7.426, p<.001).

3 Subject-reported decision-making process

After completing the task, subjects were asked whether they used “any strategy to facilitate the game”. While this question was initially included to monitor any self-reported cheating strategies, a large subset of subjects (425 subjects, 44.36%) understood this question as a prompt to report their experience of the decision-making process during the effort-expenditure task. Some examples of participants’ reports are given below:

  • “Only accepting the challenge if the points were equal to or greater than the effort level.”

  • “I decided whether the points were worth the effort.”

  • “I tended to reject if the reward was only 2.”

  • “Measure of how many points to gain against effort and how tired my hand felt.”

  • “As the game progressed I decided what ratio of effort to reward that I would tolerate.”

  • “I rejected the higher difficulties which had low points.”

  • “Was it worth the effort for the points?”

  • “I didn’t do the higher effort challenges.”