1. Introduction

Resting state functional magnetic resonance imaging (fMRI) serves as a tool to map brain function (1, 2). Functional connectivity (FC) at rest, estimated with methods to gauge correlated spontaneous activity between two or more regions, serves as a measure of functional brain integrity (1, 3). Individual differences in FC at rest are associated with differences in various cognitive domains (46). Associations of this nature have been reported for several large-scale networks, including the default mode network (DMN) (79), the frontoparietal network (1012), the dorsal attention network (13), and the salience network (14). Moreover, such associations have also been identified in more comprehensive investigations spanning the entire functional brain repertoire of the brain (15, 16). Most previous reports focused on associations, not predictions, utilizing the correlation between FC and behavioral phenotypes, which tend to overfit the data and therefore fail to generalize (17). Proper cross-validation, preferably in an independent sample, is therefore important to assert reliable population-level inferences (18). Recent machine learning-based predictive frameworks offer powerful tools for assessing the predictability of individual behavioral phenotypes based on brain connectivity (1923). In particular, deep neural networks (DNN) methods have been successfully applied to behavioral and disease prediction (2426), and were initially expected to outperform other machine learning approaches (2729). However, this superiority remains debatable, as recent studies have reported comparable performance between DNNs and traditional methods (29, 30). Accordingly, the present study does not aim to benchmark deep learning against traditional machine learning approaches but instead uses a consistent predictive framework to examine how brain state influences the utility of FC for cognitive prediction.

The functional connectome has demonstrated predictive utility regarding trait-like cognitive phenotypes (3134). The predictive-modeling framework of the functional connectome has been applied to various cognitive domains, including intelligence (35, 36), working memory (WM) (37), visuospatial ability (38), attention (39), creativity (40), as well as personality traits (41). Understanding the patterns contributing to predictions could offer insights into the functional organization underlying cognitive phenotypes, serving as biomarkers indicating current or prospective health conditions (4244). Moreover, the whole-brain functional connectome acts as a fingerprint with accurate identification of subjects from a large population (45) within the same cognitive state (e.g., rest-to-rest) but also across different states (e.g., rest-to-task). Overall, past research suggests that the functional connectome is relatively robust within individuals, is unique across individuals, and can predict cognitive and personality phenotypes. However, less is known about how the predictive utility of the functional connectome depends on the brain state during which FC is measured.

Despite consensus on the value of resting state functional connectivity for mapping brain function, there is an ongoing debate about whether rest is the optimal brain state for investigating individual differences in neurocognitive function (4649). A study using data from the Human Connectome Project has shown that resting state fMRI predicts differences in brain activity during various tasks, including social, language, relational, and motor tasks (50). This finding supports the notion that individual differences in neural activation can be predicted from resting state (48). However, results from the same dataset revealed that FC during task outperforms resting state FC in predicting individual differences in fluid intelligence, with FC during task explaining 20 % of the variance compared to just 6 % explained by rest (51). Consistent with this result, previous studies have investigated trait-as well as state-dependent FC, supporting the utility of an integrative approach (11, 49). More recent studies suggested that naturalistic viewing, such as movie-watching, may serve as a happy medium between unconstrained rest and overly-constrained tasks in predicting behavior differences (52, 53). Despite the presence of similar spatiotemporal activity patterns across individuals during movie-watching (54), notable individual differences in activity and functional connectivity (55) persist alongside these idiosyncratic features. This suggests that tasks which align individuals’ functional connectome more closely to an optimal level, neither completely unconstrained like rest nor overly synchronized like a task, also render them easiest to identify (46). Hence, it is plausible that FC during naturalistic paradigms improves sensitivity to predict behavioral differences (52, 56).

The primary objective of this study is to determine how brain state influences the predictive utility of the functional connectome for cognitive performance, using a deep learning framework. Specially, we test whether functional connectivity derived from different brain states differentially predicts WM and episodic memory (EM), two cores but functionally distinct cognitive domains. WM reflects the capacity to temporarily store and manipulate information and supports higher-order problem-solving, reasoning, and other key components of fluid intelligence (e.g., (5759)), whereas, EM entails the recollection of specific experiences and events (60), which is regarded as an important element of mind-wandering (61) during resting state. We examine whether these domains are differentially predicted by connectomes derived from resting state, movie watching, and n-back task fMRI.

Past studies have used neuroimaging data, including resting state to predict brain age (6264). These studies show that brain age, based on biological phenotypes, and their deviation from the chronological age (known as brain age gap prediction), could serve as a biomarker in characterizing disease risk (6466). Importantly, an older-appearing brain has been shown to exacerbate physiological and cognitive aging, and risk of mortality (63). A recent study demonstrated that while brain age can predict chronological age with high accuracy from MRI, its utility for predicting cognition is limited (67). Specifically, Tetereva and colleagues (2024) (67) showed that brain age strongly tracks chronological age and that, to predict cognition, brain age largely overlapped with chronological age, such that controlling for chronological age eliminated the predictive contribution of brain age. This finding suggests that brain-age models may provide little unique explanatory power for cognitive decline beyond what is already captured by chronological age. Building on this observation and extending the concept of a brain age gap to a brain-cognition gap (BCG, defined as the discrepancy between predicted and observed cognitive performance), we propose that BCG may serve as an informative marker of individual differences. If the brain predicts lower performance than is observed (i.e., a negative BCG), it may be compensating for underlying issues not yet apparent through cognitive assessments. By this view, individuals with negative BCG should be less healthy than those whose brains predict higher cognitive function than their actual performance (i.e., a positive BCG). Our second aim was to extend the concept of brain-age prediction to cognition by introducing BCG. Considering the significance of lifestyle and cardiovascular risk for maintaining healthy brain function (68, 69), we assess whether BCG captures individual differences beyond chronological age and examine whether individuals with positive versus negative BCG differ in lifestyle factors and cardiovascular risk, which are known contributors to brain health.

The third aim of the current study is to investigate the neurobiological underpinnings of BCG by examining the role of dopaminergic (DA) integrity. We test the hypothesis that lower DA receptor availability is associated with increased blood-oxygen-level-dependent (BOLD) signal variability, reduced functional connectome uniqueness, and larger BCG, consistent with DA’s role in modulating neural signal-to-noise ratio (SNR) and network coherence. DA is a vital neuromodulator with critical implications for motor function, reward-seeking behavior, and various higher-order cognitive functions (7077). Insufficient DA modulation can affect neurocognitive functions detrimentally (71, 76, 7880). (81, 82) (83, 84) (85). Pharmacological studies have shown that DA depletion increases the variability of the BOLD signal, subsequently leading to less synchronized connectivity within resting state networks (86). Consequently, we expect individuals with inadequate DA levels to exhibit increased regional signal variability, a less unique functional connectome, and greater BCG.

Using data from the DopamiNe, Age, connectoMe, and Cognition (DyNAMiC) study (n =180, 20-79 years, 50% female) (56), we evaluated the predictive power of the functional connectome during resting state, movie-watching, and n-back tasks for two different cognitive domains: episodic memory and working memory. Based on recent research indicating that movie-watching enhances predictability by highlighting key features of FC (52), we hypothesized that FC during movie-watching would outperform FC during rest, and possibly during task, in predicting both cognitive measures. To achieve this objective, we employed a deep neural network approach, a specific subtype of artificial intelligence (AI), to predict cognitive scores from the functional connectome. Deep learning approaches offer a flexible modeling framework capable of capturing complex linear and non-linear associations in high-dimensional data (30) and have been shown to reliably predict intelligence (23, 87). Considering the importance of individual characteristics, such as age, in predicting behavior from FC (34), we conducted external validation of our model, initially derived from an age-heterogeneous sample, in an age-homogeneous sample (from the Cognition, Brain, and Aging (COBRA) study (88)). We subsequently investigated whether individuals with positive brain-cognition prediction gaps differ from those with negative gaps in terms of lifestyle and cardiovascular disease risk factors. Moreover, we tested the hypothesis of whether individuals with lower striatal dopamine D1-like receptor availability (D1DR), the brain’s most abundant DA receptor subtype, have a less distinctive FC pattern (i.e., more regional variability) and, in turn, a larger BCG. Finally, we conducted an external validation of the link between DA and the prediction gap in an independent cohort with estimates of DA D2-like receptor (D2DR) availability (88).

2. Results

2.1. AI-Driven Predictive Modeling of Cognition Scores from the Functional Connectome

We used fMRI data from the DyNAMiC project, in which each subject underwent scanning during rest, movie-watching, and working memory n-back tasks. These data were parcellated into 273 nodes (264 with 9 additional subcortical nodes) using a previously published whole-brain functional atlas (89). The averaged time series of 273 regions were subsequently correlated to create the FC matrix for each participant and cognitive state (rest, movie-watching, and n-back).

We trained a convolutional neural network, DenselyAttention, derived from DenseNet (90) on FC matrices from each condition (resting state, movie-watching, and n-back) to generate cognition-specific prediction models of two memory domains (EM and WM). Model performance was quantified as the correlation between observed and predicted cognitive performance. Each model was then tested on all three conditions to examine the generalizability of each model across cognitive states. For example, the model trained on the resting state data (orange circles in Fig. 1) was used to predict EM scores using the test dataset derived from resting state (Fig. 1a), movie-watching, and n-back. Significance of our main predictions was assessed via linear correlations, and uncorrected p-values are presented in Tables 1-2.

Model trained on functional connectivity maps acquired at rest predicts (a) episodic memory (EM) of the test dataset. The models trained on movie-watching (b) but not n-back (c) datasets predicted EM scores of the test dataset. Table 1 summarizes the p values, correlations, mean square error (MSE), and mean absolute error (MAE) for each model. Test datasets were obtained from the same cohort for rest, movie-watching, and n-back. The winning model trained on EM at rest was evaluated on the external COBRA cohort and yielded a significant prediction of EM scores (d). Bootstrap distributions of correlations between predicted and actual EM scores indicated no significant difference in the predictive power of EM between models trained on resting state and movie-watching data (e). Additionally, the bootstrap distribution revealed that models trained on resting state and movie-watching data yielded higher correlations than those trained on n-back data (e). Visualization of features contributing to the successful prediction of EM at rest (f). A grad-CAM-derived saliency map displays the features that contributed to the model’s predictions. The hot spots overlaid on the FC map demonstrate noticeable cross-correlation contributions in “default mode” (DMN) regions. Another important feature visualized by Grad-CAM includes off-diagonal hot spots reflecting inter-connections of the DMN – “subcortical” node.

Correlation results for episodic memory (EM) score predictions

Correlation results for working memory (WM) score predictions.

2.2. Resting state and movie-watching models outperform n-back in episodic-memory prediction, with resting state offering the best generalizability

We first started with all cases in which congruent conditions were used for model building and prediction. Only models derived from the rest and movie-watching datasets yielded significant predictions of EM (Figs. 1a-b and Table 1), with resting state yielding the best-performing model (r =0.50, p <0.0001), followed by movie-watching (r =0.49, p <0.0001) (Table 1). While there was no significant difference between resting state and movie watching in predicting episodic memory (Δr =0.071, with a 95% confidence interval of [-0.097, 0.261], Fig. 1e), both models yielded a markedly better EM prediction than n-back (rest vs. n-back: Δr =0.333, with a 95% confidence interval of [0.054, 0.572]; movie vs. n-back: Δr =0.316, with a 95% confidence interval of [0.015, 0.619], Fig. 1e). Thus, the two models outperformed the n-back model (p <0.05, bootstrap test), indicating a significant improvement. To test the generalizability of these models, two types of validation analyses were performed: cross-condition and cross-data set. In the cross-condition analysis, models trained on one condition (e.g., rest) were tested on an incongruent condition (e.g., movie-watching, n-back; Table 1). Interestingly, the model trained on resting state significantly predicted EM when tested on movie-watching (r =0.44, p <0.001) and n-back (r =0.38, p =0.003) conditions. In contrast, models trained on movie-watching or n-back could not be generalized to other conditions, unable to significantly predict EM (p’s >.1), except for significant generalizability from the movie to the rest condition (Table 1).

In a cross-dataset validation analysis, the best-performing model from the age-heterogeneous DyNAMiC dataset was tested on the corresponding condition in an age-homogeneous cohort from the COBRA dataset. By doing this, we found that the resting state model derived from DyNAMiC significantly predicted EM performance in the COBRA dataset (r =0.24, p <0.0001).

Next, we aimed to delineate the relative contributions of different brain regions for the best-performing model, the model trained on the “resting state data” in predicting episodic memory. Utilizing the Grad-CAM algorithm, saliency maps were generated for the 120 FC matrices used during training of the winning model. An averaged and interpolated description of all saliency maps is depicted in Figure 1f. The saliency map highlights specific edges, especially within the default-mode network, edges between the default-mode network and subcortical areas, and edges between the default-mode and the cerebellar network. These edges, indicated by a salience intensity of ≥0.5, exert a significant influence on the model (Fig. 1f).

2.3. Movie-watching and n-back models outperform resting state in working-memory predictions, with movie-watching offering the best generalizability

We next investigated whether the superiority of resting state in predicting EM was unique to this domain, considering that previous research demonstrated the advantages of task-based fMRI and naturalistic viewing in predicting fluid intelligence (91). To do so, we compared the predictive power of different states for WM, which is shown to be more directly associated with fluid intelligence compared to EM.

The model derived from the resting state failed to predict and generalize regarding WM (p’s >0.10; Fig. 2a and Table 2). By contrast, models trained on movie-watching and n-back yielded significant predictions of WM (Figs. 2b-c and Table 2), with the movie-watching model emerging as the best-performing model (r =0.57, p <0.0001) followed by n-back (r2 =0.47, p <0.0001). While there was no significant difference between movie-watching and n-back in predicting WM (Δr =0.026, with a 95% confidence interval of [0.001, 0.052], Fig. 2e), these models yielded better WM prediction than resting state (Δr =0.517, with a 95% confidence interval [0.373, 0.662], Fig. 2e).

Model trained on FC maps acquired at rest did not significantly predict (a) the working memory (WM) in the test dataset. The model trained on the movie-watching dataset yielded the best-performing model in predicting WM (b) while the model trained on the n-back dataset (c) was the second-best model. Table 2 summarizes the p-values, correlation power, MSE, and MAE for each model. (d) Results of cross-dataset validation, where the best-performing model in the DyNAMiC dataset (i.e., movie-watching) was applied to predict WM to the COBRA dataset. However, since COBRA does not include a movie-watching paradigm, we applied the model to the n-back task in COBRA (Table 2). Bootstrap distributions of correlations between predicted and actual WM scores showed no significant difference in predictive power between models trained on movie-watching and n-back data (e). The bootstrap distribution revealed that models trained on movie-watching and n-back data exhibited higher correlations than those trained on resting state data (e). The Grad-CAM-derived saliency map highlights dominant features in the FC maps that contributed to the model’s predictions (f). The hot spots overlaid on an FC map demonstrate noticeable cross-correlation contributions in the “VAN”, “visual”, and to a lesser degree (<0.5) “DMN”. Other important features visualized by Grad-CAM include off-diagonal hot spots reflecting inter-connections of the “DMN” – with “FPN, Fronto-parietal Task Control”, “Subcortical”, and “Cerebellar”; “Cerebellar” – “FPN” node.

In cross-condition validation (Table 2), the movie-watching model yielded a significant WM prediction during both resting state (r =0.42, p <0.0001) and n-back (r =0.46, p =0.004). The model derived from n-back yielded a significant WM prediction during movie-watching (r =0.46, p =0.0002), but not during resting state (r =0.20, p =0.12).

For cross-dataset validation, the best-performing model in the DyNAMiC dataset (i.e., movie-watching) was applied to predict WM in the COBRA dataset. However, since COBRA does not include a movie-watching paradigm, we applied the model to the n-back task in COBRA. This approach revealed that the model from the DyNAMiC movie-watching condition yielded a significant WM prediction in the COBRA n-back task (r =0.47, p <0.0001).

Overall, our results suggest that movie-watching-based WM predictions generalize across different cognitive states and datasets. This finding could be further replicated using a different functional parcellation (Figs. S1-S2 and Tables S1-S2).

The Grad-CAM algorithm generated saliency maps of 120 FC maps from the DyNAMiC dataset employed for training. Figure 2f depicts an average of all Grad-CAM generated maps. The saliency map unveils that certain edges, specifically within network connectivity of task-positive regions such as the frontoparietal task control network, dorsal/ventral attention, visual, and subcortical networks, as well as between-network connectivity. FC between task-positive dorsal and ventral attention networks, and between the DMN and the fronto-parietal network, contributed to the best-performing model derived from the movie-watching dataset. Applying two different parcellation methods (89, 92) to the DyNAMiC data indicated that parcellation resolution does not significantly impact model performance (see Figs. S1-S2 and Tables S1-S2).

2.4. The brain-cognition gap is related to physical activity levels and cardiovascular risk factors

Given the importance of lifestyle and cardiovascular health for maintaining healthy brain function (68, 69, 93), we examined whether individuals with positive versus negative prediction gaps differed in physical activity habits, education, and Framingham cardiovascular disease (CVD) risk score. Our primary focus was on EM predictions derived from resting state data, as this was the common condition across the DyNAMiC and COBRA datasets. We computed the difference between predicted and observed EM scores to generate BCG. A positive BCG indicates that an individual’s brain predicted better-than-observed EM performance, whereas a negative BCG indicates more compromised brains relative to actual performance.

In the test sample of the DyNAMiC data (n =60) and the entire COBRA sample (n =177), we found that individuals with a negative BCG exhibited lower levels of physical activity and higher CVD risk scores compared to those with positive gaps (Fig. 3). Confirmatory analysis with continuous variables revealed positive relationships between GAP and physical activity (DyNAMiC: r (57) =0.40, p =0.001; COBRA: r(166) =0.17, p=0.03) and negative relationships between GAP and CVD risk score (DyNAMiC: r(57) =–0.27, p =0.03; COBRA: r(172)= –0.10, p =0.40). Moreover, individuals with negative BCG were less educated compared to those with positive BCG in the DyNAMiC dataset.

The plots compare physical activity scores (total hours per week), CVD risk, between two groups with positive and negative BAG as well as high and low memory performance in the DyNAMiC (top row) and COBRA (bottom row) datasets. *, **, and *** denote p <0.05, p <0.01, and p <0.001 respectively.

To test whether cognition on its own is related to physical activity and CVD score, we conducted a median split on EM and compared physical activity and cardiovascular risk score across the two groups. In contrast to the findings related to BCG, we found no significant difference in the level of physical activity (t(58) =-0.59, p =0.56) or cardiovascular risk score (t(58) =1.64, p =0.11) between high and low EM performers (Fig. 3). These results suggest that BCG may provide additional information, beyond cognitive measures, regarding factors that contribute to cognitive resilience.

2.5. Dopamine D1 and D2 receptor availability are associated with brain-cognition gaps

Given that BCG may partly reflect variability in neural signal, one plausible neurobiological factor contributing to BCG is dopaminergic integrity. We hypothesized that inadequate DA levels might be related to increased neural signal-to-noise ratio, thereby resulting in a less unique functional connectome, consequently leading to a greater prediction gap. We therefore initially investigated the relationship between DA receptor levels and predictive gaps across different types of DA receptors in the DyNAMiC and COBRA samples.

In the DyNAMiC sample, we found a significant correlation between striatal D1DR and prediction BCG (in those with a positive gap: r = –0.49, p =0.03; negative gap: r =0.40, p =0.01) (Fig. 4a), suggesting that lower D1DR is associated with greater BCG.

Relationship between the gap measured from predicted and actual EM scores and dopamine D1 receptor (a). Negative gaps indicate that the predicted EM score was lower than actual EM scores, while the positive gaps indicate more higher predicted scores than actual EM scores. Partial correlation analysis showed a significant correlation between D1 receptor values and the measured negative and positive gaps. Relationship between the gap measured from predicted and actual EM scores and dopamine D2 receptor (b). Negative gaps indicate that the predicted EM score was lower than actual EM scores, while positive gaps indicate more higher predicted scores than actual EM scores. Partial correlation analysis showed a significant link of D2 receptor values to negative gaps and positive gaps.

We replicated our finding in the COBRA sample using D2-like receptor availability (D2DR), revealing a significant relationship between striatal D2DR and prediction gap (in those with a positive gap: r = –0.49, p =0.001; negative gap: r =0.39, p =0.004) (Fig. 4b). Our findings provide support the view that lower D1DR/D2DR is associated with larger brain-cognition prediction gaps.

Both D1DR and D2DR availability in the striatum were associated with BCG, such that lower dopamine receptor availability was linked to a greater BCG. However, these associations varied by region. For D1DR, significant correlations with BCG were observed in the caudate (positive gap: r = –0.37, p =0.02; negative gap: r = 0.37, p =0.02) and putamen (positive gap: r = – 0.53, p =0.02; negative gap: r =0.34, p =0.03), but not in the nucleus accumbens (positive gap: r = –0.25, p = 0.31; negative gap: r =0.07, p =0.69) or the DLPFC (positive gap: r = – 0.30, p =0.21; negative gap: r =0.21, p =0.21). For D2DR, both caudate (positive gap: r = – 0.34, p =0.004; negative gap: r =0.36, p =0.0003) and putamen (positive gap: r = – 0.37, p =0.002; negative gap: r =0.22, p =0.03) showed significant associations with BCG.

2.6. Regional variability mediates the direct impact of dopamine on brain-cognition gaps

We showed that dopamine is associated with BCG. To evaluate whether functional variability mediated this relationship, we conducted additional mediation analyses. We computed BOLD signal entropy, which estimates within-region signal variability during resting state (86), that was averaged across striatal regions (left caudate: MNI coordinate ≈ [–12, 12, 6], right caudate: MNI coordinate ≈ [10, 14, 2], left putamen: MNI coordinates ≈ [–24, 6, 4], right putamen: MNI coordinates ≈ [29, 1, 4]), as we expected that reduced DA may primarily impact the local functional dynamics.

In the DyNAMiC sample, within the group exhibiting a negative BCG, we observed a negative association between striatal D1DR and entropy (r =–0.33, p =0.04) as well as a negative association between entropy and BCG (r =–0.36, p =0.03). Importantly, we observed a significant indirect effect of D1DR on the gap mediated by entropy, β =2.41, 95% CI [0.89, 4.51], p <0.0001, explaining 56.81% of the total effect of D1DR on the gap. The direct effect of D1DR, however, was not significant, β =1.83, 95% CI [–0.86, 3.79], p =0.19.

Similarly, in the group with a positive BCG, entropy was negatively associated with D1DR (r =–0.56, p =0.01) and positively associated with BCG (r =0.47, p =0.04). Importantly, an indirect effect of D1DR on BCG through entropy was observed, β =–6.78, 95% CI [–11.26, –3.05], p <.0001, accounting for 89.41% of the D1DR effect on the gap. The direct effect of D1DR was again non-significant, β =–0.803, 95% CI [–0.48, 2.27], p =0.26. These findings suggest that lower D1DR levels contribute to increased signal variability, which in turn leads to reduced specificity of FC and, consequently, a larger BCG.

In the COBRA sample, within the group with a negative BCG, we observed a negative association between striatal D2DR and entropy (r = –0.22, p =0.03) as well as a negative association between entropy and BCG (r = –0.27, p =0.007). In the group with the positive BCG, entropy was negatively associated with D2DR (r = –0.26, p =0.03) and positively associated with BCG (r =0.25, p =0.03). Moreover, we detected a significant indirect effect of D2DR on both the negative and positive gap groups through entropy. For the negative gap, β =2.18, 95% CI [0.01, 4.25], p =0.04, accounting for 63.43% of the D2DR effect on the gap; and for the positive gap, β =–2.18, 95% CI [–3.87, –0.04], p =0.04, accounting for 61.49% of the D2DR effect on the gap. Similar to the results reported for D1DR, these findings suggest that lower D2DR levels contribute to increased signal variability, which in turn may lead to reduced specificity of FC and, consequently, a larger BCG.

3. Discussion

Using deep learning models, we examined the predictive power of the functional connectome during various states (resting state, movie-watching, and n-back) on two different cognitive domains (EM and WM). Both rest and movie-watching states yielded significant predictions of EM, with the model derived from resting state generalized across states and datasets. Differences between the DyNAMiC and COBRA datasets make cross-dataset prediction a harder problem, as the age ranges of samples significantly vary, and prior studies highlight the importance of individual characteristics like age in predicting behavior from FC (34). In line with this, model performance decreased when predicting EM in the COBRA sample, whereas prediction of WM remained largely unchanged. Thus, validation outcomes suggest that the models, particularly those predicting WM, show robustness across datasets, whereas the reduced EM performance highlights potential data-specific influences that limit generalizability. The saliency map generated from the final layer of the deep learning model indicates that certain edges within DMN, and between DMN and the subcortical network contributed significantly to the prediction. Building on a recent finding by Kurkela and Ritchy (94), our finding reveals that a portion of the known-memory subnetwork within the DMN, as well as a whole-brain multivariate pattern which notably encompasses interactions of the DMN with other networks, such as the subcortical network, made a more substantial contribution to prediction. Importantly, this prediction generalizes across conditions and datasets, suggesting that features derived from resting state FC serve as a relatively stable marker of individual differences in EM, though with reduced strength in COBRA. While such generalization is partly facilitated by the similarity of functional connectivity across states, it is not a trivial outcome. For instance, the model trained on movie-watching data generalized to EM prediction during rest but failed to do so for the n-back condition, even though movie-watching and n-back connectivity patterns are themselves highly correlated. This indicates that successful generalization depends not only on shared variance across states but also on the cognitive processes most relevant to the target behavior.

Our findings are in contrast to recent work suggesting that task paradigms, in general, and movie-watching, in particular, outperform resting state data in predicting cognitive performance (51, 52, 95). While previous studies have often demonstrated a superiority of task and naturalistic viewing over resting state in predicting fluid intelligence or WM (51, 52), there are fewer reports of FC predicting EM (e.g., (96, 97)), and, to our knowledge, no study has compared rest and movie-watching. While we acknowledge that the resting state represents a complex amalgamation of cognitive, emotional, and perceptual processes (98), the good prediction power of the resting state may arise from the presence of mind wandering during rest, which is strongly related to EM (99, 100). EM plays a crucial role in generating mental content during mind wandering, especially episodes characterized by distinct times and locations (61, 101).

In contrast to the EM prediction, both n-back and movie-watching connectomes yielded significant predictions of offline WM performance. Importantly, the models derived from movie-watching and n-back outperformed the resting state in WM prediction. These differences in model accuracy when predicting the same target behaviors (i.e., WM) suggest the presence of trait-state interactions. Specifically, movies and n-back enhance individual differences in WM-relevant connections. Indeed, we found that several WM-related networks (102105), including the fronto-parietal, the salience, and the dorsal/ventral attention networks, contribute to prediction. Additionally, previous research showed that movie-watching alters the propagation of activity across cortical pathways (106), particularly within and between regions involved in audiovisual processing and attention. These alterations lead to a less segregated and more integrated network organization (107). Similarly, the n-back task has been associated with increased integration of task-positive cortico-cortical connectivity (105, 108) and striato-cortical connectivity (103). Our findings also suggest that certain task contexts strike an optimal balance between reducing neural variability and maintaining sufficient richness to capture individual differences. Prior work shows that task states quench neural variability, leading to a more reliable and predictable neural signal (109). In this context, movie watching may represent such a sweet spot constraining neural dynamics through shared audiovisual stimulation, while simultaneously engaging a broad range of cognitive processes that preserve individual differences. Taken together, our results confirm previous findings that movie watching is a suitable condition for studying individual differences across various cognitive domains. Nonetheless, if a movie-watching paradigm is not feasible/ available, resting state still provides a robust means of studying individual differences, particularly in self-referential domains, such as EM.

Our study used a deep neural network architecture that features dense connections and incorporates an attentional mechanism. While our findings demonstrate that a deep learning framework can provide reasonable predictive accuracy, it is important to note that other machine learning approaches (e.g., tree-based models) may offer comparable predictive power, as suggested by prior benchmarking work (29, 30). Our study explicitly compares predictive power across different cognitive states (rest, movie watching, n-back) to identify the states that best capture individual differences across domains. The relative performance of deep learning and other non-linear approaches depends on multiple factors, including sample size, model architecture, feature representation, and domain-specific characteristics of the prediction target. In this context, deep learning was employed as a flexible framework capable of modeling high-dimensional functional connectivity patterns across cognitive states, rather than as a claim of inherent methodological superiority. Thus, our goal was not to propose a universally superior prediction model, but rather to test how brain state influences predictive utility for WM and EM using a deep learning approach.

We found a significant link between BCG, lifestyle, and risk factors for vascular disease, such that individuals with a negative BCG exhibited lower levels of physical activity and higher cardiovascular risk scores compared to those with a positive BCG. This finding was consistent across both age-heterogeneous and older age-homogeneous samples. BCG could serve as a potential biomarker for identifying individuals at risk (e.g., individuals with a negative gap). Previous studies suggest that the brain age prediction gap is associated with cognitive aging (64), some aspects of physiological aging (63), as well as aging in other organs (110) and even mortality in older age (63). However, a recent study revealed that brain age accounts for only a small portion of cognitive decline compared to chronological age (67), suggesting that cognitive prediction might be more informative. Our findings build upon this concept by extending the BCG to behavioral variables, demonstrating that the BCG could provide insights regarding physical activity status, education, and cardiovascular risk – key factors contributing to cognitive reserve (111114). Note that the association with education was significant only in the DyNAMiC sample and did not reach significance in the COBRA dataset. An important caveat is that BCG can also be conceptualized as an error metric, like mean absolute error or mean square error, reflecting the extent to which models trained in one sample generalize to another. From this perspective, a larger gap may not only indicate individual differences related to resilience factors and dopaminergic function but also reduced model fit or generalizability across datasets. Thus, BCG likely reflects a combination of meaningful biological variability and methodological variance.

Critically, we found that D1DRs and D2DRs were strongly associated with the BCG, such that lower dopamine receptor levels were associated with greater gaps. More specifically, in two independent samples, we discovered greater correspondence (i.e., near zero in BCG) between brain function and cognition in individuals with higher D1DR/D2DR, whereas lower correspondence (i.e., significantly different BCG from zero) was found in individuals with lower D1DR/D2DR availability. Previous computational models proposed that DA modulates neuronal gain, which improves SNR in neural processing, contributing to more coherent activity across large-scale networks (e.g., balanced integration and segregation (85)). Past studies also showed that lower D1DR contributed to more BOLD variability in the subcortical area (83) and less functional segregation of the striatum (115) and the large-scale networks in aging (85), possibly due to increased noise (lower SNR). In support of this notion, we found for the first time that regional variability, estimated using entropy, mediated the impact of DA on BCG. Although the cross-sectional nature of our data warrants caution, this novel finding suggests that lower DA integrity relates to BOLD variability, which in turn is associated with a larger BCG.

An important caveat is that D1DR and D2DR availability do not provide a direct measure of dopamine signaling. Instead, it reflects receptor availability, which interacts with endogenous dopamine in a complex manner. PET measures of D1R and D2R availability reflect the density of unoccupied dopamine receptors and the degree to which endogenous dopamine competes with radioligand binding. D2R binding potential is sensitive to competition from synaptic dopamine, such that higher ambient dopamine generally reduces tracer binding; D1R binding, however, is less affected by endogenous dopamine under physiological conditions, reflecting more directly receptor expression levels. Previous studies demonstrated a significant association between D2R availability and dopamine synthesis capacity measured by FMT (116, 117), suggesting that postsynaptic receptor markers may, under certain conditions, serve as a proxy for dopaminergic signaling. Developmental factors, such as the number of dopamine-producing neurons innervating the striatum, may further influence the structural and functional relationship between pre- and post-synaptic markers. By contrast, smaller studies have reported non-significant (118, 119) or negative (120) associations, although these studies relied on [18F]FDOPA, which is considered a less precise index of dopamine synthesis than FMT. Taken together, these reports indicate that the relationship between pre- and post-synaptic markers is complex and not necessarily linear. Accordingly, our observation that lower receptor availability is associated with greater neural variability should not be interpreted as direct evidence of weaker dopaminergic signaling, but rather as reflecting the interplay between receptor density and endogenous dopamine occupancy, particularly in the case of D2DR.

Finally, we did not directly compare BCG and brain-age gap (BAG). While our focus was to investigate whether the BCG provides information about factors contributing to cognitive resilience, we acknowledge that benchmarking BCG against the brain-age gap in predicting lifestyle and vascular risk factors would be valuable. However, addressing this question lies beyond the scope of the present study, and future work should systematically compare these approaches. Finally, we acknowledge that our main and validation samples are moderate in size for deep learning, which constrains statistical power and generalizability. Although external validation, early stopping, dropout, and regularization help mitigate overfitting, larger samples will be needed in future work to fully establish the robustness of these predictive models.

In summary, our findings reveal that while tasks like movie-watching predict both episodic and working memory, there are features during rest that can effectively predict internally oriented mind-wandering-type tasks, such as EM. Additionally, individuals whose brains predict poorer cognitive performance (i.e., negative gap) exhibit lower physical activity and higher cardiovascular risk compared to those whose brains predict higher cognitive function than their actual performance (i.e., positive gap). This finding suggests that our prediction model offers a potential marker to identify individuals at risk of compromised brain maintenance. Furthermore, individuals with lower DA showed less accurate cognitive prediction (larger BCG) due to increased BOLD variability and less unique and cohesive FC.

4. Materials and methods

4.1. Participants

All participants provided written informed consent, and studies were conducted in accordance with the Declaration of Helsinki and approved by the Regional Ethical Board and the local Radiation Safety Committee (reference numbers: 2012-57-31M; 2017-404-32M).

This study used data from DyNAMiC (56), which is a longitudinal study with a focus on changes in the brain connectome and the D1DR system. At baseline,180 participants (20-79 years, 50% female) across the adult lifespan underwent all tests between 2017 and 2020 (56) (Fig. 5). Rigorous exclusion criteria were used to recruit a sample without neurological conditions and medical treatments affecting brain functioning and cognition. Exclusion criteria included brain injury or neurological disorder, dementia, neurodevelopmental disorder, psychiatric diagnosis, psychopharmacological treatment, history of severe head trauma, substance abuse or dependence, and illicit drug use. Individuals with other chronic or severe medical conditions (e.g., cancer, diabetes, and Parkinson’s disease) were also excluded. Here, we only use data from the baseline measurement.

Overview of the experimental procedure and the use of datasets.

We used a 3-fold within-sample (DyNAMiC) cross-validation where we trained our model on 120 subjects (8:2; 80% training:20% validation during training) and tested it in a separate sample of 60 subjects. The winning within-sample model was used for between-sample (COBRA) external validation.

We used a separate sample as a testing dataset from the COBRA study (88) (Fig. 5). COBRA is a longitudinal aging study in which 181 healthy individuals (64-68 years, 45% female) underwent baseline assessments of the brain, cognition, health, and lifestyle during 2012–2014 (88). Exclusion criteria at baseline included traumatic brain injury, stroke, dementia, intellectual disability, epilepsy, psychiatric and neurological disorders, diabetes, and cancer medications, severe visual or auditory impairment, claustrophobia, and poor Swedish language skills. In the current study, we used data from 177 subjects from COBRA who underwent both MRI and PET examinations at baseline (80, 88, 102, 103, 121).

4.2. Cognitive Measures

The same cognitive test battery was used in DyNAMiC and COBRA (56, 88) (Fig. S1) and assessed two cognitive domains: episodic memory (EM) and working memory (WM). Each domain was assessed using three separate tests, including letter-, number-, and figure-based material, respectively. Participants completed all tasks on a computer and responded by either typing in words or numbers; using the computer mouse; or pressing keys marked by different colors. Each test included initial practice runs, after which testing followed across several trials. For each of the three tests within a domain, scores were summarized across trials for a measure of overall performance. A composite score of performances across the three tests was calculated and used as the measure of the cognitive domain in question (i.e., episodic memory, working memory). For each of the three tests, scores were summarized across the total number of trials. The three resulting sum scores were z-standardized and averaged to form one composite score for each domain. The standardization has been carried out independently for the training (DyNAMiC) and test (COBRA) samples.

4.2.1. Episodic Memory (EM)

Tests of EM included word recall, number-word recall, and object-location recall. In word recall, participants viewed 16 Swedish concrete nouns, successively appearing on the computer screen. Each word was presented for 6 s, with an inter-stimulus interval (ISI) of 1 s. Directly following encoding, participants reported as many words as they could using the keyboard. Two trials were completed, with a combined maximum score of 32. In number-word recall, participants encoded pairs of 2-digit numbers and concrete plural nouns (e.g., 46 dogs). During encoding, eight number-word pairs were presented, each for 6 s, with an ISI of 1 s. Directly after encoding, nouns were presented again, in re-arranged order, and participants had to report the 2-digit number associated with each presented noun (e.g., How many dogs?). This test included two trials with a total combined maximum score of 16. The third test assessed object-location memory. Participants viewed a 6 × 6 square grid in which 12 objects were consecutively shown at distinct locations. Each object-position pairing was displayed for 8 s, with an ISI of 1 s. Directly following encoding, all objects were simultaneously displayed next to the grid for the participant to place in their correct position within the grid. If unable to recall the correct position of an object, participants had to guess to the best of their ability. Two trials of this test were completed, making the total combined maximum score 24.

4.2.2. Working Memory (WM)

Working memory was examined with three tests: letter updating, number updating, and spatial updating. These tests differed from the working memory n-back task performed during fMRI scanning. In letter-updating, capital letters (A–D) were consecutively presented on the computer screen, with participants instructed to always keep the three final letters in memory. Each letter was presented for 1 s, with an ISI of 0.5 s. When prompted, at any given moment, participants provided their responses by typing in three letters using the keyboard and provided a guessing-based answer if they failed to remember the correct letters. Four practice trials were followed by 16 test trials consisting of either 7-, 9-, 11-, or 13-letter sequences. The combined maximum number of correct answers across trials was 48 (16 trials × 3 reported letters = 48). The number-updating test followed a columnized 3-back design. Three boxes were presented next to each other on the screen throughout the task, in which a single digit (19) was sequentially presented from left to right for 1.5 s with an ISI of 0.5 s. During an ongoing sequence, participants had to judge whether the number currently presented in a specific box matched the last number presented in the same box (i.e., appearing three numbers before). For each presented number, they responded yes/no by pressing assigned keys (“yes” = right index finger; “no” = left index finger). Two practice trials were followed by four test trials, each consisting of 30 numbers. The combined maximum number of correct answers across trials (after discarding responses to the first three numbers in every trial, as these were not preceded by any numbers to be matched with) was 108 (27 numbers × 4 trials). In spatial-updating, participants viewed three 3 × 3 square grids presented next to each other on the computer screen. At the start of each trial, a blue circular object was displayed at a random location within each grid. After 4 s, the circular objects disappeared, leaving the grids empty. An arrow then appeared below each grid, indicating that the circular object in the corresponding grid was to be mentally moved one step in the direction of the arrow. The arrows appeared stepwise from left to right in the grids, each presented for 2.5 s (ISI = 0.5 s). Prompted by three new arrows, participants mentally moved the circular objects one more time, resulting in each circular object having moved two steps from its original location at the end of each trial. Participants indicated which square the circular object in each grid had ended up in using the computer mouse. If unsure, they were instructed to guess. The test was performed across 10 test trials, preceded by five practice trials. The combined maximum number of correct location indications was 30.

4.3. Measure of Physical Activity and Cardiovascular Disease Risk

4.3.1. Physical Activity

An extensive battery of self-rating questionnaires was administered in DyNAMiC and COBRA (56, 88). Participants were asked to indicate the frequency (number of hours during a typical summer week; options: 1-14 h with 1-h increments, or 15+ hrs) and the intensity (how physically demanding on a scale from 1 =“not at all” to 5 =“extremely”) by which they typically engage in a selection of activities relevant to life in northern Sweden. These included 15 specific activities. For the present study, we focused on physical activities and on those activities that are purely physical and that individuals are sufficiently engaged in (i.e., physically demanding ≥2.0). Each of these activities was performed by at least 20% of the participants at least once a week; e.g., walking, bicycling, jogging, strength training, household tasks, and daily work-related activities. We computed physical activity frequency (sum hrs/week) to generate physical activity scores accordingly.

4.3.2. Cardiovascular Disease risk

The risk of cardiovascular disease was determined via a multivariable score, according to the algorithm developed in the Framingham Heart Study (122). Variables include age, sex, hypertension diagnosis, systolic blood pressure, body mass index, smoking, and diabetes mellitus. The risk estimates were derived using an algorithm proposed by D’Agostino et al. (122), which employs Cox proportional-hazard regression models to predict the probability of developing any form of cardiovascular disease within 10 years:

With S0(t) being the baseline survival at follow-up t (t =10 years), βi the estimated regression coefficient, Xi the log-transformed value of the ith risk factor, Xi the corresponding mean, and m the number of risk factors included. Baseline survival, means, and regression coefficients were taken from the original algorithm, with DyNAMiC and COBRA participants’ risk variables inserted to compute the final scores. Risk score calculators can be found at framinghamheartstudy.org.

4.4. Image Acquisitions

Structural, functional, and neurochemical brain measures were acquired using MRI and PET at Umeå University Hospital in northern Sweden. For both DyNAMiC and COBRA, all MRI data were collected using a 3T Discovery MR750 MRI system (General Electric, Healthcare, Illinois, USA) equipped with a 32-channel phased-array head coil. PET was conducted in 3D mode with a Discovery PET/CT 690 (General Electric, WI, United States) to assess whole-brain D1DR with [11C]SCH23390 and D2DR with [11C]Raclopride at rest in DyNAMiC and COBRA, respectively. Comprehensive descriptions of MRI, PET, and cognitive testing protocols are given elsewhere (56, 88). In this study, we primarily focus on those data directly pertinent to the current investigation.

4.4.1. Functional MRI

For the DyNAMiC dataset, high-resolution anatomical T1-weighted images were collected using a 3-dimensional (3D) fast spoiled gradient-echo sequence with acquisition parameters of 176 sagittal slices, thickness =1 mm, TR =8.2 ms, TE =3.2 ms, flip angle =12°, and a field of view (FOV) =250×250 mm. Whole-brain functional images were acquired during resting state, naturalistic viewing, and an n-back WM task. Functional images were acquired using a T2*-weighted single-shot echo-planar imaging (EPI) sequence, with 330 volumes collected over 12 min. The sequence provided 37 axial slices, slice thickness =3.4 mm, 0.5 mm spacing, TR =2,000 ms, TE =30 ms, flip angle =80°, and FOV =250×250 mm. Ten dummy scans were collected at the start of the sequence. During the resting state, participants were instructed to keep their eyes open and focus on a white fixation cross on a black background displayed on a computer screen through a tilted mirror attached to the head coil. WM was assessed in the scanner (12 min) with a numerical n-back task, which consisted of blocks of 1-back, 2-back, and 3-back (102, 103). During movie-watching, the participants viewed and listened to a 12-minute video consisting of selected and chronologically ordered sections from the Swedish movie Cockpit (123). Participants were instructed to view the movie attentively and answer a short multiple-choice questionnaire about the movie after the scanning session. We did not monitor participants’ responses to the movie, and the chosen clips were selected to be relatively neutral in emotional content. The storyline follows Valle, a recently fired pilot whose marriage has ended, as he struggles to find new employment. In a desperate attempt to secure a job at an airline specifically recruiting a female pilot, he presents himself as a woman.

In COBRA, data were collected using identical scans for resting state and n-back WM. However, the resting state scan was shorter, lasting only 6 minutes (75, 103).

4.5. Functional Connectivity Analysis

Functional data from all conditions (i.e., rest, movie, n-back) were pre-processed using the Statistical Parametric Mapping software package (SPM12). The functional images were first corrected for slice-timing differences and in-scanner motion, followed by registration to anatomical T1 images. Distortion correction was performed using subject-specific and T1 co-registered field maps. The functional time series were subsequently demeaned and detrended, followed by simultaneous nuisance regression and temporal high-pass filtering (threshold at 0.009 Hz) to not re-introduce nuisance signals (124). The nuisance regression model included mean cerebrospinal and white-matter signal and their derivatives, 24-motion parameters (125), a binary vector flagging motion-contaminated volumes exceeding framewise displacement (FD) of 0.2 mm (126), in addition to an 18-parameter RETRICOR model (127, 128) of cardiac pulsation (up to third-order harmonics), respiration (up to fourth-order harmonics), and first-order cardio-respiratory interactions estimated using the PhysIO Toolbox v.5.0 (129). Regression models for n-back included an additional set of finite impulse response (FIR) task regressors (130) to avoid false positive connectivity due to task-evoked activations (131). The FIR regression approach involved fitting mean cross-block responses for each time point within a time-locked window of equal duration to each task block (27 blocks of 20 s), extended by an additional 18 s following each block to account for the duration of the hemodynamic response function (HRF). Given that this approach linearly fits a set of binary task regressors for each time point, it is nearly identical to subtracting the mean task response, with the difference being that FIR regression is better able to handle overlapping task responses and differences in the shape of the HRF (131). The nuisance-regressed images were subsequently normalized to sample-specific group templates (DARTEL) (132) for each dataset, respectively, followed by spatial smoothing using a 6-mm FWHM Gaussian kernel to mitigate DARTEL-induced aliasing and affine-transformed to stereotactic MNI space (ICBM152NLin2009) (133, 134). Functional images from both DyNAMiC and COBRA were preprocessed according to the steps described above, with the only exception that the RETRICOR parameters were excluded from the regression models in COBRA due to technical issues related to the respiration and cardiac traces.

4.5.1. Graph Construction

For all fMRI conditions, functional time series were averaged from 273 cortical and subcortical regions, represented by 5-mm radius spheres, based on a widely employed FC parcellation (89). In addition to 264 regions reported in the Power parcellation, we added 9 additional regions, including some subcortical regions, such as putamen, caudate, and anterior and posterior hippocampus, identified using independent component analysis (8, 135). These regions were categorized into 14 resting state networks according to a consensus partition (89). To mitigate sampling from non-gray matter voxels, each parcel underwent erosion by a permissive gray matter mask (eroding voxels <.1% threshold). The averaged time series were then subjected to Pearson’s correlations, followed by Fisher’s r-to-z transformation, resulting in the creation of a 273×273 adjacency matrix for each participant, with coefficients along the main diagonals set to zero.

To further investigate the impact of network parcellation, we replicated our prediction analysis (Supplementary Material, Figs S1-S2, and Table S1-S2) using Schaefer parcellation (92), which entails 300 cortical and subcortical regions.

4.6. Deep Neural Network Model

Based on convolutional neural networks, deep learning is an advanced form of artificial intelligence that uses multiple layers of “hidden” neural networks. Deep learning methodologies are capable of automatically identifying complex patterns and representations directly from raw data using these multi-layered networks, thus eliminating the need for explicit feature engineering or manual intervention (136). The success of the training and learning phases depends on the model’s ability to process high-dimensional input data, extracting meaningful features from complex data. This is done while managing the number of trainable parameters, which are crucial for automated feature learning during the construction of the model.

In this study, the inputs to our deep learning models were subject-specific FC maps with a matrix size of 273×273 (e.g., Fig. 6a). We generated different versions of each FC map by replacing portions of the network. For example, we relocated the DMN network toward the last nodes, which were assigned as DAN, cerebellar, subcortical, and uncertain in Figure 6a. This approach allowed us to create a diverse set of FC matrices for each individual, each reflecting a different composition of edges and neighbors while maintaining the linear relationships exonerated in the original data. Consequently, we augmented the dataset by producing a total of 3,600 FC maps from the initial set of FC maps.

(a) Example of a functional connectivity map across three different cognitive states. SEN hand: SENsory hand; SEN Mouth: SENsory Mouth; CON: Cingulo-Operculum control Network; Aud: Auditory; DMN: Default Mode Network; Memory Ret: Memory Retrieval network; Visual: Visual; FPN: Fronto-Parietal Network; Salience: Salience control network; Subcortical (upper row): subcortical network included in original Power parcellation; VAN: Ventral Attention Network; DAN: Dorsal Attention Network; Cerebellar: Cerebellar network; Subcortical (lower row): additional Subcortical regions, including hippocampus and caudate, added to the original Power Parcellation; Uncertain: Regions with less known network assignment. (b) DenselyAttention architecture. Enhanced Residual Block (ERB) and High-Frequency Attention Block (HFAB) into the Transition Block. Note that each “D.L.” layer in the table corresponds to the sequence BatchNormalization-ReLU-Conv3×3.

Following the DenseNet framework (90), we incorporated the Enhanced Residual Block (ERB) and High-Frequency Attention Block (HFAB) into the dense layers (Fig. 6b), termed DenselyAttention, to facilitate feature reuse in each layer. DenseNet (90) diverges from traditional methods like deepening layers or widening network structure by focusing on feature reuse and bypass settings. This results in fewer parameters than similar dense networks such as ResNet (137), enhances feature reuse, improves feature propagation, makes training easier, and reduces issues of gradient vanishing and model degradation.

The architecture of DenseNet is characterized by its dense connectivity pattern, which entails direct connections from each layer to all subsequent layers within its dense block (as illustrated in Fig. 6b). This design ensures that every layer has access to the feature maps generated by preceding layers, thereby facilitating a seamless and efficient gradient flow throughout the network. In essence, the knowledge acquired at each layer is propagated forward, enabling the model to effectively capture intricate patterns and dependencies within the data, ultimately enhancing its ability to learn and generalize (90). Additionally, dense blocks provide a regularizing effect, reducing overfitting, particularly on tasks with smaller training datasets (90). This is suitable for this study, which includes relatively small samples. Each sequence combines operations of batch normalization (BN), rectified linear unit (ReLU), and a 3×3 convolution (Conv). Batch normalization can effectively prevent overfitting, as described by equation 2:

Where 𝜇β, mean; 𝜎β, standard deviation; ɛ, random noise; α and β are adaptable variables in training. We utilized the Rectified Linear Unit (ReLU) activation function (138), which activates neurons by directly outputting the input if it is positive, or zero otherwise, as outlined in equation 2.

Where x is the input to a neuron. ReLU benefits the performance of networks with dense layers by decreasing the computation and selectively optimizing parameters. Four dense blocks facilitate a stepwise down-sampling in the network. These blocks are connected with transition layers, which consist of a 1×1 convolutional layer followed by a 2×2 average pooling layer.

Additionally, ERB and HFAB pairs are introduced for targeted high-frequency features and residual block enhancement. The ERB-HFAB pairs are stacked sequentially at the beginning of each dense block loop (Fig. 6b), with ERB and HFAB having 16 feature maps.

High-Frequency Attention Block (HFAB)

Our approach to attentional mechanisms, specifically the HFAB, introduces a sequential attention branch inspired by edge detection. This branch rescales each position based on its neighboring pixels, efficiently focusing on high-frequency areas. The HFAB employs a 3×3 convolution to enhance both the receptive field and computational efficiency. Batch normalization is seamlessly integrated into the attention branch, introducing global statistics during inference without additional computational cost.

Enhanced Residual Block (ERB)

We present ERB as an alternative to the traditional residual block. As illustrated in Figure 6b, ERB comprises a re-parameterization block (RepBlock) and a ReLU. In the training stage, the RepBlock utilizes a 1×1 convolution to either expand or contract the number of feature maps, employing a 3×3 convolution to extract features in a higher-dimensional space. Furthermore, two skip connections are integrated to mitigate training complexities. During inference, all linear transformations can be unified, facilitating the conversion of each RepBlock into a singular 3×3 convolution. Essentially, ERB capitalizes on the advantages of residual learning.

We implemented model training using Tensorflow 2.11.0 (139) and Keras 2.11.0 (140) as programming interfaces and trained on a fifth-generation MacBook Pro (Apple M1 MAX silicon chips, 10-core CPU, 24-core GPU, 16-core Neural Engine, 64 GB memory). For regression tasks, selecting an appropriate loss function is crucial for guiding the optimization process and ensuring accurate predictions. In this study, we opted for the mean squared error (MSE) loss function due to its suitability for regression problems. To minimize the loss function, we trained the network using the stochastic gradient descent (SGD) optimizer with a learning rate of 8e-5 and a Nesterov momentum of 0.9 (141). The number of epochs was set to 100, and the batch size was set to 74. We also added dropout (142) after each convolutional layer, except the first one, with a rate of 0.15.

We conducted training on distinct, identical datasets extracted from DyNAMiC, each comprising FC maps generated from 120 subjects. To prepare each training dataset, we randomly shuffled data for training and validation patches, allocating 80% for training and 20% for validation. For testing, we utilized all FC maps of 60 subjects from the same sample (the age-heterogeneous DyNAMiC study), enabling us to assess model performance through three-fold cross-validation (Fig. 5). Each cross-validation fold was a new training with an unseen validation set for the model. Based on its performance on the testing dataset, we selected and employed the winning model for all subsequent analyses. This model, which demonstrated the best performance on the testing dataset, underwent external validation in an independent sample from the age-homogeneous COBRA study.

To explore and visually represent the crucial features of the deep learning models contributing to the prediction of cognitive scores, we used the Grad-CAM (Gradient-weighted Class Activation Mapping) technique (143). This method interprets the model’s decisions by highlighting the regions of the input image with the most significant impact on the model’s output. Grad-CAM conducts a backward pass to compute the gradients of the target class score with respect to the feature maps of the final convolutional layer. We present the average heatmaps calculated for all input data to the model (143). Grad-CAM saliency maps were interpreted qualitatively, with a heuristic threshold (≥ 0.5) applied to highlight regions with relatively higher contribution to the model’s predictions. These values do not reflect statistical significance and should therefore be interpreted descriptively. A further limitation of this study is the absence of ground truth for validating whether highlighted regions truly correspond to the features used by the model during prediction. As such, Grad-CAM provides an approximation of model attention rather than a definitive measure of feature importance. Nevertheless, Grad-CAM remains one of the most widely used and empirically validated interpretability techniques in deep learning, particularly in medical imaging applications. Its integration with established frameworks such as Keras and TensorFlow, together with its ability to generate spatial attributions that align with domain knowledge, makes it a suitable choice for the present study. Future work may incorporate complementary interpretability approaches, including adaptations of the Haufe transformation where applicable to deep learning architectures.

4.7. Positron Emission Tomography (PET)

The scanning sessions started with a 5-minute low-dose helical CT scan (20 mA, 120 kV, 0.8 s per revolution), obtained for attenuation correction. During scanning, a thermoplastic mask was attached to the bed surface to minimize head movement.

In DyNAMiC, a 60-minute scan was performed following 350 MBq (337 ± 27 MBq) in list-mode format. Offline re-binning of list-mode data was conducted to achieve a total of 49 frames with increasing length. In COBRA, a 55-min, 18-frame dynamic PET scan was acquired during rest following intravenous bolus injection of approximately 250 MBq 11C-raclopride (264 ± 19 MBq). For both studies, attenuation- and decay-corrected images (47 slices, field of view = 25 cm, 256×256-pixel transaxial images, voxel size = 0.977× 0.97×3.27 mm3) were reconstructed with the iterative VUE Point HD-SharpIR algorithm (GE; 6 iterations, 24 subsets, 3.0 mm post filtering; full-width-at-half-maximum (FWHM): 3.2 mm). The estimation of receptor availability or binding potential relative to non-displaceable binding (BPND) was carried out following previously described procedures with the cerebellum as a reference region (76). PET images were motion-corrected and co-registered with the structural T1-weighted images from the corresponding session using Statistical Parametric Mapping software (SPM12, Wellcome Department of Imaging Science, Functional Imaging Laboratory, London, UK). Motion-corrected PET data were resliced to match the spatial dimensions of MR data (1 mm3 isotropic, 256×256×256). The mean of the first five frames was used as a source for co-registration. In DyNAMiC, frame-to-frame head motion correction, with translations ranging from 0.23 to 4.22 mm (mean ± sd =0.95 ± 0.54 mm), revealed a trend-level difference across age-groups (age <40 and age ≥40 years), as determined by Student’s t-test (t =2.0, p =.047; mean ± sd for younger individuals = 1.07 ± 0.52, mean ± sd for older individuals =0.90 ± 0.55). Partial-volume-effect (PVE) correction was achieved using the symmetric geometric transfer matrix (SGTM) method for regional correction, implemented in FreeSurfer (144). An estimated point-spread-function of 2.5 mm FWHM was utilized. Regional estimates of BPND were calculated within the automated FreeSurfer segmentations employing the simplified reference tissue model (SRTM (145)). In the current study, we focused on the striatal BPND, calculated as an average of BPND across the caudate and putamen.

4.8. The direct impact of dopamine on BCG through mediation analysis

To evaluate whether functional variability mediates the relationship between D1DR and prediction gap connectivity, we conducted additional mediation analyses. We first computed entropy, which estimates within-region signal variability (86), and then averaged this measure across all striatal regions (left caudate: MNI coordinate ≈ [–12, 12, 6], right caudate: MNI coordinate ≈ [10, 14, 2], left putamen: MNI coordinates ≈ [–24, 6, 4], right putamen: MNI coordinates ≈ [29, 1, 4].). Following the mediation analysis framework proposed by Baron and Kenny (146), our goal was to determine whether the association between D1/D2 receptors and BCG is mediated by regional variability (entropy) or if the indirect effect exceeds the direct association between D1/D2 receptors and BCG. To assess the statistical significance of this mediation effect, we employed the bootstrapping method as outlined by Preacher and Hayes (147), and age has been controlled for in all statistical analyses.

4.9. Statistical significance analysis

Statistical analyses were carried out using SPSS (IBM Corp., V24.0.0, Armonk, NY, USA), MATLAB (The MathWorks Inc., V9.13.0 (R2022b), Natick, MA, USA), and GraphPad Prism (GraphPad Software, Inc., V5.01, CA, USA). We performed partial correlations between predicted and actual scores, as well as linear regression analyses. To investigate the relationship between generated gap variables and DA receptor availability, we controlled for age (in DyNAMiC) using partial correlation. The Mann-Whitney U test was used to calculate the mean differences in prediction accuracy. The level of statistical significance was set at p-value ≤0.05. For the bootstrap-based comparison of model performance (bootstrap resampling with 1000 iterations), no test statistic with an associated degree of freedom is reported. Instead, statistical inference is based on the bootstrap distribution of the difference in correlation coefficients (Δr) and its 95% confidence interval. As bootstrap confidence-interval-based inference does not rely on an analytic sampling distribution, degrees of freedom are not defined for this procedure.

Out-of-sample predictive performance was quantified using the coefficient of determination (r2) computed via a sum-of-squares formulation (148). Unlike squared correlation coefficients, which capture only linear association, this metric evaluates how well model predictions approximate observed values relative to a baseline model. Specifically, out-of-sample r2 was defined as

where yi denotes the observed outcome in the test set, ŷi the corresponding model prediction, and ȳ train denotes the mean of the outcome variable in the training set. Using the training-set mean as the baseline ensures a strictly out-of-sample evaluation and avoids information leakage. Under this formulation, positive r2 values indicate performance exceeding the null model (predicting the training mean), whereas negative values indicate worse-than-baseline performance. Because this formulation directly compares prediction error to baseline variance, it provides a more appropriate measure of predictive accuracy than correlation-based metrics, particularly in the presence of scale or offset differences between predicted and observed values (148).

Data availability

The scripts used for developing the model are available at https://github.com/MorEsm/AI-based-Prediction-of-Cognitive-Function.

Acknowledgements

This work was funded by the Swedish Research Council (grant number 2021-02558), Knut and Alice Wallenberg Foundation (Wallenberg Fellow grant to A.S.), Bank of Sweden (RJ, grant number P20-0515 to A.S.), StratNeuro grant at Karolinska Institute (A.S.). Morteza Esmaeili and Erin Bjørkeli were supported by the Southern Eastern Norway Regional Health Authority (Helse Sør-Øst RHF, HSØ, grant numbers 2018047 and 2021023, respectively).

Additional information

Author Contributions

M.E. and A.S. (together with the COBRA PIs N.K. and L.N.) conceptualized the study, formulated the research questions, and developed the methodology. M.E., R.P., J.J., E.B.B., and K.N. performed data processing and formal analyses. M.E., A.S., R.B., J.J., K.N., N.K., L.B., and L.N. contributed to the interpretation of the findings. M.E. and A.S. supervised the study and drafted the original manuscript. All authors participated in manuscript writing and editing and approved the final version.

Funding

HOD | Helse Sør-Øst RHF (sorost) (2018047)

  • Morteza Esmaeili

HOD | Helse Sør-Øst RHF (sorost) (2021023)

  • Erin Beate Bjørkeli

University of Gothenburg | Wallenberg Centre for Molecular and Translational Medicine (WCMTM) (P20-0515)

  • Alireza Salami

Karolinska Institute (StratNeuro grant)

  • Alireza Salami

Swedish Research Council (2021-02558)

  • Alireza Salami

Additional files

Supplementary materials