Abstract
Electrophysiological methods, that is M/EEG, provide unique views into brain health. Yet, when building predictive models from brain data, it is often unclear how electrophysiology should be combined with other neuroimaging methods. Information can be redundant, useful common representations of multimodal data may not be obvious and multimodal data collection can be medically contraindicated, which reduces applicability. Here, we propose a multimodal model to robustly combine MEG, MRI and fMRI for prediction. We focus on age prediction as a surrogate biomarker in 674 subjects from the CamCAN dataset. Strikingly, MEG, fMRI and MRI showed additive effects supporting distinct brainbehavior associations. Moreover, the contribution of MEG was best explained by cortical power spectra between 8 and 30 Hz. Finally, we demonstrate that the model preserves benefits of stacking when some data is missing. The proposed framework, hence, enables multimodal learning for a wide range of biomarkers from diverse types of brain signals.
eLife digest
How old are you? What about your body, and your brain? People are used to answering this question by counting the years since birth. However, biological age could also be measured by looking at the integrity of the DNA in cells or by measuring the levels of proteins in the blood. Whether one goes by chronological age or biological age, each is simply an indicator of general health – but people with the same chronological age may have different biological ages, and vice versa.
There are different imaging techniques that can be used to study the brain. A method called MRI reveals the brain’s structure and the different types of tissue present, like white and grey matter. Functional MRIs (fMRIs for short) measure activity across different brain regions, while electrophysiology records electrical signals sent between neurons. Distinct features measured by all three techniques – MRI, fMRI and electrophysiology – have been associated with aging. For example, differences between younger and older people have been observed in the proportion of grey to white matter, the communication between certain brain regions, and the intensity of neural activity.
MRIs, with their anatomical detail, remain the goto for predicting the biological age of the brain. Patterns of neuronal activity captured by electrophysiology also provide information about how well the brain is working. However, it remains unclear how electrophysiology could be combined with other brain imaging methods, like MRI and fMRI. Can data from these three techniques be combined to better predict brain age?
Engemann et al. designed a computer algorithm stacking electrophysiology data on top of MRI and fMRI imaging to assess the benefit of this threepronged approach compared to using MRI alone. Brain scans from healthy people between 17 and 90 years old were used to build the computer model. The experiments showed that combining all three methods predicted brain age better. The predictions also correlated with the cognitive fitness of individuals. People whose brains were predicted to be older than their years tended to complain about the quality of their sleep and scored worse on memory and speedthinking tasks.
Crucially, Engemann et al. tested how the algorithm would hold up if some data were missing. This can happen in clinical practice where some tests are required but not others. Positively, prediction was maintained even with incomplete data, meaning this could be a useful clinical tool for characterizing the brain.
Introduction
Noninvasive electrophysiology assumes a unique role in clinical neuroscience. Magneto and electophencephalography (M/EEG) have an unparalleled capacity for capturing brain rhythms without penetrating the skull. EEG is operated in a wide array of peculiar situations, such as surgery (Baker et al., 1975), flying an aircraft (Skov and Simons, 1965) or sleeping (Agnew et al., 1966). Unlike EEG, MEG captures a more selective set of brain sources with greater spectral and spatial definition (Ahlfors et al., 2010; Hari et al., 2000). Yet, neither of them is optimal for isolating anatomical detail. Clinical practice in neurology and psychiatry, therefore, relies on additional neuroimaging modalities with enhanced spatial resolution such as magnetic resonance imaging (MRI), functional MRI (fMRI), or positron emission tomography (PET). Recently, machine learning has received significant interest in clinical neuroscience for its potential to predict from such heterogeneous multimodal brain data (Woo et al., 2017). Unfortunately, the effectiveness of machine learning in psychiatry and neurology is constrained by the lack of large highquality datasets (Woo et al., 2017; Varoquaux, 2017; Bzdok and Yeo, 2017; Engemann et al., 2018) and comparably limited understanding about the data generating mechanisms (Jonas and Kording, 2017). This, potentially, limits the advantage of complex learning strategies proven successful in purely somatic problems (Esteva et al., 2017; Yoo et al., 2019; Ran et al., 2019).
In clinical neuroscience, prediction can therefore be pragmatically approached with classical machine learning algorithms (Dadi et al., 2019), expertbased feature engineering and increasing emphasis on surrogate tasks. Such tasks attempt to learn on abundant highquality data an outcome that is not primarily interesting, to then exploit its correlation with the actual outcome of interest in small datasets. This problem is also known as transfer learning (Pan and Yang, 2009) which, in its simplest form, is implemented by reusing predictions from a surrogatemarker model as predictors in the small dataset. Over the past years, predicting the age of a person from its brain data has crystalized as a surrogatelearning paradigm in neurology and psychiatry (Dosenbach et al., 2010). First results suggest that the prediction error of models trained to learn age from brain data of healthy populations provides clinically relevant information (Cole et al., 2018; Ronan et al., 2016; Cole et al., 2015) related to neurodegenerative anomalies, physical and cognitive decline (Kaufmann et al., 2019). For simplicity, this characteristic prediction error is often referred to as the brain age delta Δ (Smith et al., 2019). Can learning of such a surrogate biomarker be enhanced by combining expertfeatures from M/EEG, fMRI and MRI?
Research on aging has suggested important neurological grouplevel differences between young and elderly people: Studies have found alterations in grey matter density and volume, cortical thickness and fMRIbased functional connectivity, potentially indexing brain atrophy (Kalpouzos et al., 2012) and declinerelated compensatory strategies. Peak frequency and power drop in the alpha band (8–12 Hz), assessed by EEG, has been linked to agingrelated slowing of cognitive processes, such as the putative speed of attention (Richard Clark et al., 2004; Babiloni et al., 2006). Increased anteriorization of beta band power (15–30 Hz) has been associated with effortful compensatory mechanisms (Gola et al., 2013) in response to intensified levels of neural noise, that is, decreased temporal autocorrelation of the EEG signal as revealed by flatter 1/f slopes (Voytek et al., 2015). Importantly, agerelated variability in fMRI and EEG seems to be independent to a substantial degree (Kumral et al., 2020).
The challenge of predicting at the singlesubject level from such heterogenous neuroimaging modalities governed by distinct datagenerating mechanisms has been recently addressed with modelstacking techniques (Rahim et al., 2015; Karrer et al., 2019; Liem et al., 2017). Rahim et al., 2015 enhanced classification in Alzheimer’s disease by combining fMRI and PET using a stacking approach (Wolpert, 1992), such that the stacked models used input data from different modalities. Liem et al., 2017 have then applied this approach to ageprediction and found that combining anatomical MRI with fMRI significantly helped reduce errors while facilitating detection of cognitive impairment. This suggests that stacked prediction might also enable combining MRI with electrophysiology. Yet, this idea faces one important obstacle related to the clinical reality of data collection. It is often not practical to do multimodal assessments for all patients. Scanners may be overbooked, patients may not be in the condition to undergo MRI and acute demand in intensive care units may dominate priorities. Incomplete and missing data is, therefore, inevitable and has to be handled to unleash the full potential of multimodal predictive models. To tackle this challenge, we set out to build a stacking model for predicting age from electrophysiology and MRI such that any subject was included if some data was available for at least one modality. We, therefore, call it opportunistic stacking model.
At this point, there are very few multimodal databases providing access to electrophysiology alongside MRI and fMRI. The Leipzig MindBrainBody (LEMON) dataset (Babayan et al., 2019) includes highquality researchEEG with MRI and fMRI for 154 young subjects and 75 elderly subjects. The dataset used in the present study is curated by the CamCAN (Shafto et al., 2014; Taylor et al., 2017) and was specifically designed for studying the neural correlates of aging continuously across the lifespan. The CamCAN dataset is currently the largest public resource on multimodal imaging with highresolution electrophysiology in the form of MEG alongside MRI data and rich neuropsychological data for more than 650 healthy subjects between 17 and 90 years. The choice of MEG over EEG may lead to a certain degree of friction with the agingrelated literature in electrophysiology, the bulk of which is based on EEGstudies. Fortunately, MEG and EEG share the same classes of neural generators, rendering the agingrelated EEGliterature highly relevant for MEGbased modeling. On the other hand, the distinct biophysics of MEG and EEG makes both modalities complementary methods. While EEG captures sources of any orientation, MEG preferentially captures tangential but not radial sources. Compared to EEG, MEG benefits from the magnetic transparency of the skull, which facilitates source localization by reducing the risk of errors due to an incorrect head conductivity model, but also by limiting the largescale mixing of neural sources. This significantly increases the signaltonoise ratio for MEG in higher frequencies, rendering it a formidable technique for studying cortical oscillatory activity (Lehtelä et al., 1997; Gobbelé et al., 1998). MEG is, therefore, an interesting modality in its own right for developing neurocognitive biomarkers while its close link with EEG may potentially open the door to translatable electrophysiology markers suitable for massive deployment with clinical EEG.
Our study focuses on the following questions: 1) Can MRIbased prediction of age be enhanced with MEGbased electrophysiology? 2) Do fMRI and MEG carry nonredundant clinically relevant information? 3) What are the most informative electrophysiological markers of aging? 4) Can potential advantages of multimodal learning be maintained in the presence of missing values?
Results
Opportunistic predictionstacking approach
We begin by summarizing the proposed method. To build a model for predicting age from electrophysiology, fMRI and anatomical MRI, we employed predictionstacking (Wolpert, 1992). As in Liem et al., 2017, the stacked models, here, referred to different input data instead of alternative models on the same data. We used ridge regression (Hoerl and Kennard, 1970) to linearly predict age from highdimensional inputs of each modality. Linear predictions were based on distinct features from anatomical MRI, fMRI and MEG that have been commonly associated with aging. For extracting features from MEG, in a first step, we drew inspiration from EEGliterature on aging and considered evoked response latencies, alpha band peak frequency, 1/f slope topographies assessed in sensorspace. Previous work on neural development and aging (Khan et al., 2018; Gola et al., 2013) and Alzheimer’s disease (Gaubert et al., 2019) has pointed at the importance of spatial alterations in stationary power spectra which can be exploited using highdimensional regression techniques (Fruehwirt et al., 2017). In this work, we have adapted this reasoning to the more general problem of predicting age while exploiting the advanced sourcemodeling options supported by the CamCAN dataset based on MEG and the individual MRIs. Therefore, it was our principal effort to expose the geometry of stationary power spectra with minimal distortion by using source localization based on the individual head geometry (Sabbagh et al., 2019) to then perform highdimensional regression. As a result, we predicted from the spatial distribution of power and bivariate interactions between signals (connectivity) in nine frequency bands (Table 1).
For MRI and fMRI, we followed the method established in Liem et al., 2017 and included cortical thickness, cortical surface area and subcortical volume as well as functional connectivity based on the fMRI timeseries. For detailed description of the features, see Table 2 and section Feature extraction in Materials and methods. To correct for the necessarily biased linear model, we then used a nonlinear random forest regressor with age predictions from the linear model as lowerdimensional input features.
Thereby, we made sure to use consistent crossvalidation splits for all layers and automatically selected central tuningparameters of the linear model and the random forest with nested crossvalidation. Our stacked models handle missing values by treating missing value as data, provided there is an opportunity to see at least one modality (Josse et al., 2019). We, therefore, call it opportunistic stacking model. Concretely, the procedure duplicated all variables and inserted once a small value and once a very large value where data was initially missing for which we chose biologically implausible age values of −1000 and 1000, respectively. For an illustration of the proposed model architecture, see Figure 1 section StackedPrediction Model for Opportunistic Learning in Materials and methods for a detailed description of the model.
fMRI and MEG nonredundantly enhance anatomybased prediction
Currently, anatomical MRI is the canonical modality for brain age prediction. However, MRI does not access brain dynamics, whereas MEG and fMRI both capture neuronal activity, hence, convey additional information at smaller timescales. How would they add to the prediction of brain age when combined with anatomical MRI? Figure 2A depicts a model comparison in which anatomical MRI served as baseline and which tracked changes in performance as fMRI and MEG were both added through stacking (black boxplot). Anatomical MRI scored an expected generalization error of about 6 years ($SD=0.6$, ${P}_{2.5,97.5}=[4.9,7.16]$), whereas expected chancelevel prediction was about 15.5 years ($SD=1.17$, ${P}_{2.5,97.5}=[13.26,17.8]$) based on a dummymodel proposing as prediction the average age of the trainingdata. MRI performed better than chancelevel prediction in every single crossvalidation fold. The average improvement over chancelevel prediction across folds was at least 9 years ($SD=1.33$, ${P}_{2.5,97.5}=[12.073,7.347]$). Relative to MRI, ageprediction performance was reduced by almost 1 year on average by adding either MEG ($P{r}_{<MRI}=91\mathrm{\%}$, $M=0.79$, $SD=0.57$, ${P}_{2.5,97.5}=[1.794,0.306]$) or fMRI ($P{r}_{<MRI}=94\mathrm{\%}$, $M=0.96$, $SD=0.59$, ${P}_{2.5,97.5}=[1.99,0.15]$). Finally, the performance gain was greater than 1 year on average ($P{r}_{<MRI}=99\mathrm{\%}$, $M=1.32$, $SD=0.672$, ${P}_{2.5,97.5}=[2.43,0.16]$) when adding both MEG and fMRI to the model, yielding an expected generalization error of about 4.7 years ($SD=0.55$, ${P}_{2.5,97.5}=[3.77,5.74]$). Note that dependable numerical pvalues are hard to obtain for paired model comparisons based on crossvalidation on the same dataset: Many datasets equivalent to the CamCAN would be required. Nevertheless, the uncertainty intervals extracted from the crossvalidation distribution suggests that the observed differences in performance were systematic and can be expected to generalize as more data is analyzed. Moreover, the outofsample ranking between the different models was stable over crossvalidation folds (Figure 2—figure supplement 1) with the full model achieving the first rank 71/100 times and performing at least 80/100 better than the MRI + fMRI or the MRI + MEG model. This emphasizes that the relative importance of MEG and fMRI for enhancing MRIbased prediction of age can be expected to generalize to future data.
The improved prediction obtained by combining MEG and fMRI suggests that both modalities carry independent information. If MEG and MRI carried purely redundant information, the random forest algorithm would not have reached better outofsample performance. Indeed, comparing the crossvalidated prediction errors of MEGbased and fMRIbased models (Figure 2B), errors were only weakly correlated (${r}_{Spearman}=0.139$, ${r}^{2}=0.019$, $p=1.31\times {10}^{3}$). fMRI, sometimes, made extreme errors for cases better predicted by MEG in younger people, whereas MEG made errors in distinct cases from young and old age groups. When adding anatomical MRI to each model, the errors became somewhat more dependent leading to moderate correlation (${r}_{Spearman}=0.45$, ${r}^{2}=0.20$, $p=2.2\times {10}^{16}$). This additive component also became apparent when considering predictive simulations on how the model actually combined MEG, fMRI and MRI (Figure 2—figure supplement 2) using twodimensional partial dependence analysis (Karrer et al., 2019; Hastie et al., 2005, chapter 10.13.2). Moreover, exploration of the agedependent improvements through stacking suggest that stacking predominantly reduced prediction errors uniformly (Figure 2—figure supplement 3) instead of systematically mitigating brain age bias (Le et al., 2018; Smith et al., 2019).
These findings demonstrate that stacking allows to enhance brainage prediction by extracting information from MEG, fMRI and MRI while mitigating modalityspecific errors. This raises the question whether this additive information from multiple neuroimaging modalities also implies nonredundant associations with behavior and cognition.
Brain age Δ learnt from MEG and fMRI indexes distinct cognitive functions
The brain ageΔ has been interpreted as indicator of health where positive Δ has been linked to reduced fitness or healthoutcomes (Cole et al., 2015; Cole et al., 2018). Does improved performance through stacking strengthen effectsizes? Can MEG and fMRI help detect complementary associations? Figure 3 summarizes linear correlations between the brain ageΔ and the 38 neuropsychological scores after projecting out the effect of age, Equations 6 8 (see Analysis of brainbehavior correlation in Materials and methods for a detailed overview). As effect sizes can be expected to be small in the curated and healthy population of the CamCAN dataset, we considered classical hypothesis testing for characterizing associations. Traditional significance testing (Figure 3A) suggests that the best stacking models supported discoveries for between 20% (7) and 25% (9) of the scores. Dominating associations concerned fluid intelligence, depression, sleep quality (PSQI), systolic and diastolic blood pressure (cardiac features 1,2), cognitive impairment (MMSE) and different types of memory performance (VSTM, PicturePriming, FamousFaces, EmotionalMemory). The model coefficients in Figure 3B depict the strength and direction of association. One can see that stacking models not only tended to suggest more discoveries as their performance improved but also led to stronger effect sizes. However, the trend is not strict as fMRI seemed to support unique discoveries that disappeared when including the other modalities. Similarly, some effect sizes were even slightly stronger in submodels, for example for fluid intelligence in MRI and MEG. A priori, the full model enjoys priority over the submodels as its expected generalization estimated with crossvalidation was lower. This could imply that some of the discoveries suggested by fMRI may suffer from overfitting, but are finally corrected by the full model. Nevertheless, many of the remaining associations were found by multiple methods (e.g. fluid intelligence, sleep quality assessed by PSQI) whereas others were uniquely contributed by fMRI (e.g. depression). It is also noteworthy that the directions of the effects were consistent with the predominant interpretation of the brain age Δ as indicator of mental or physical fitness (note that high PSQI score indicate sleeping difficulties whereas lower MMSE scores indicate cognitive decline) and directly confirm previous findings (Liem et al., 2017; Smith et al., 2019).
Note that the results were highly similar when performing deconfounding jointly via multiple regression (Equation 9, Figure 3—figure supplement 1) instead of predicting ageresidualized neuropsychological scores, and when including additional predictors of noninterest, that is gender, handedness and head motion (Equation 10, Figure 3—figure supplement 2). More elaborate confoundsmodeling even seemed to improve SNR as suggested by an increasing number of discoveries and growing effect sizes.
These findings suggest that brain age Δ learnt from fMRI or MEG carries nonredundant information on clinically relevant markers of cognitive health and that combining both fMRI and MEG with anatomy can help detect healthrelated issues in the first place. This raises the question of what aspect of the MEG signal contributes most.
MEGbased ageprediction is explained by source power
Whether MEG or EEGbased assessment is practical in the clinical context depends on the predictive value of single features, the cost for obtaining predictive features and the potential benefit of improving prediction by combining multiple features. Here, we considered purely MEGbased age prediction to address the following questions: Can the stacking method be helpful to analyze the importance of MEGspecific features? Are certain frequency bands of dominating importance? Is information encoded in the regional power distribution or more related to neuronal interactions between brain regions? Figure 4A compares alternative MEGbased models stacking different combinations of MEGfeatures. We compared models against chancelevel prediction as estimated with a meanregressor outputting the average age of the training data as prediction. Again, chancelevel was distributed around 15.5 years ($SD=1.17$, ${P}_{2.5,97.5}=[13.26,17.80]$). All models performed markedly better. The model based on diverse sensor space features from task and resting state recordings showed the lowest performance around 12 years MAE ($SD=1.04$, ${P}_{2.5,97.5}=[9.80,13.52]$), yet it was systematically better than chance ($P{r}_{<Chance}=98.00\mathrm{\%}$, $M=4$, $SD=1.64$, ${P}_{2.5,97.5}=[7.11,0.44]$). All models featuring sourcelevel power spectra or connectivity (‘Source Activity, Source Connectivity’) performed visibly better, with expected errors between 8 and 6.5 years and no overlap with the distribution of chancelevel scores. Models based on sourcelevel power spectra (‘Source Activity’, $M=7.40$, $SD=0.82$, ${P}_{2.5,97.5}=[6.01,9.18]$) and connectivity (‘Source Connectivity’, $M=7.58$, $SD=0.90$, ${P}_{2.5,97.5}=[6.05,9.31]$) performed similarly with a slight advantage for the ‘Source Activity’ model. The best results were obtained when combining power and connectivity features (‘Full’, $M=6.75$, $SD=0.83$, ${P}_{2.5,97.5}=[5.36,8.20]$). Adding sensor space features did not lead to any visible improvement of ‘Full’ over ‘Combine Source’ with virtually indistinguishable error distributions. The observed average modelranking was highly consistent over crossvalidation testingsplits (Figure 4—figure supplement 1), suggesting that the relative importance of the different blocks of MEG features was systematic, hence, can be expected to generalize to future data. The observed ranking between MEG models suggests that regional changes in sourcelevel power spectra contained most information while sourcelevel connectivity added another portion of independent information which helped improve prediction by at least 0.5 years on average. A similar picture emerged when inspecting the contribution of the LayerI linear models to the performance of the full model in terms of variable importance (Figure 4B). Sensor space features were least influential, whereas top contributing features were all related to power and connectivity, which, upon permutation, increased the error by up to 1 year. The most informative input to the stacking model were ridge regression models based on either signal power or the Hilbert analytic signal power concatenated across frequency bands (P_{cat}, E_{cat}). Other noteworthy contributions were related to power envelope covariance (without source leakage correction) as well as source power in the beta (15–30 Hz) and alpha (8–15 Hz) band frequency range. The results suggest that regional changes in power across different frequency bands are best summarized with a single linear model but additional nonlinear additive effects may exist in specific frequency bands. The observed importance rankings were highly consistent with importance rankings obtained from alternative methods for extraction of variable importance (Figure 4—figure supplement 2), emphasizing the robustness of these rankings.
Moreover, partial dependence analysis (Karrer et al., 2019; Hastie et al., 2005, chapter 10.13.2) suggested that the LayerII random forest extracted nonlinear functions (Figure 4—figure supplement 3). Finally, the best stacked models scored lower errors than the best linear models (Figure 4—figure supplement 4), suggesting that stacking achieved more than mere variable selection by extracting nonredundant information from the inputs.
These findings show that MEGbased prediction of age is predominantly enabled by power spectra that can be relatively easily accessed in terms of computation and data processing. Moreover, the stacking approach applied to MEG data helped improve beyond linear models by upgrading to nonlinear regression.
Advantages of multimodal stacking can be maintained in populations with incomplete data
One important obstacle for combining signals from multiple modalities in clinical settings is that not all modalities are available for all cases. So far, we have restricted the analysis to 536 cases for which all modalities were present. Can the advantage of multimodal stacking be preserved in the absence of complete data or will missing values mitigate prediction performance? To investigate this question, we trained our stacked model on all 674 cases for which we had the opportunity to extract at least one feature on any modality, hence, opportunistic stacking (see Figure 1 and Table 3 in section Sample in Materials and methods). We first compared the opportunistic model with the restricted model on the cases with complete data Figure 5A. Across stacking models, performance was virtually identical, even when extending the comparison to the cases available to the submodel with fewer modalities, for example MRI and fMRI. We then scored the fully opportunistic model trained on all cases and all modalities and compared it to different nonopportunistic submodels on restricted cases (Figure 5A, squares). The fully opportunistic model always outperformed the submodel. This raises the question of how the remaining cases would be predicted for which fewer modalities were available. Figure 5B shows the performance of the opportunistic split by subgroups defined by different combinations of input modalities available. As expected, performance degraded considerably on subgroups for which important features (as delineated by the previous results) were not available. See, for example, the subgroup for which only sensorspace MEG was available. This is unsurprising, as prediction has to be based on data and is necessarily compromised if the features important at traintime are not available at predicttime. One can, thus, say that the opportunistic model operates conservatively: The performance on the subgroups reflects the quality of the features available, hence, enables learning from the entire data.
It is important to emphasize that if missing values depend on age, the opportunistic model inevitably captures this information, hence, bases its predictions on the nonrandom missing data. This may be desirable or undesirable, depending on the applied context. To diagnose this modelbehavior, we propose to run the opportunistic random forest model with the observed missing values as input and observations from the input modalities set to zero. In the current setting, the model trained on missing data indicators performed at chance level ($P{r}_{<Chance}=30.00\mathrm{\%}$, $M=0.65$, $SD=1.68$, ${P}_{2.5,97.5}=[2.96,3.60]$), suggesting that the missing values were not informative of age.
Discussion
We have demonstrated improved learning of surrogate biomarkers by combining electrophysiology as accessed through MEG, functional and anatomical MRI. Here, we have focused on the example of ageprediction by multimodal modeling on 674 subjects from the CamCAN dataset, the currently largest publicly available collection of MEG, fMRI and MRI data. Our results suggest that MEG and fMRI both substantially improved ageprediction when combined with anatomical MRI. We have then explored potential implications of the ensuing brainage Δ as a surrogatebiomarker for cognitive and physical health. Our results suggest that MEG and fMRI convey nonredundant information on cognitive functioning and health, for example fluid intelligence, memory, sleep quality, cognitive decline and depression. Moreover, combining all modalities has led to lower prediction errors. Inspection of the MEGbased models suggested unique information on aging is conveyed by regional distribution of power in the $\alpha $ (8–12 Hz) and $\beta $ (15–30 Hz) frequency bands, in line with the notion of spectral fingerprints (Keitel and Gross, 2016). When applied in clinical settings, multimodal approaches should make it more likely to detect relevant brainbehavior associations. We have, therefore, addressed the issue of missing values, which is an important obstacle for multimodal learning approaches in clinical settings. Our stacking model, trained on the entire data with an opportunistic strategy, performed equivalently to the restricted model on common subsets of the data and helped exploiting multimodal information to the extent available. This suggests that the advantages of multimodal prediction can be maintained in practice.
fMRI and MEG reveal complementary information on cognitive aging
Our results have revealed complementary effects of anatomy and neurophysiology in ageprediction. When adding either MEG or fMRI to the anatomybased stacking model, the prediction error markedly dropped (Figure 2A). Both, MEG and fMRI helped gain almost 1 year of error compared to purely anatomybased prediction. This finding suggests that both modalities access equivalent information. This is in line with the literature on correspondence of MEG with fMRI in resting state networks, highlighting the importance of spatially correlated slow fluctuations in brain oscillations (Hipp and Siegel, 2015; Hipp et al., 2012; Brookes et al., 2011). On the other hand, recent findings suggest that agerelated variability in fMRI and EEG is independent to a substantial degree (Kumral et al., 2020; Nentwich et al., 2020). Interestingly, the prediction errors of models with MEG and models with fMRI were rather weakly correlated (Figure 2B, left panel). In some subpopulations, they even seemed anticorrelated, such that predictions from MEG or fMRI, for the same cases, were either accurate or extremely inaccurate. This additional finding suggests that the improvements of MEG and fMRI over anatomical MRI are due to their access to complementary information that helps predicting distinct cases. Indeed, as we combined MEG and fMRI in one common stacking model alongside anatomy, performance improved on average by 1.3 years over the purely anatomical model, which is almost half a year more precise than the previous MEGbased and fMRIbased models.
These results strongly argue in favor of the presence of an additive component, in line with the common intuition that MEG and fMRI are complementary with regard to spatial and temporal resolution. In this context, our results on performance decomposition in MEG (Figure 4) deliver one potentially interesting hint. Source power, especially in the $\alpha (815Hz)$ and $\beta (1526Hz)$ range were the single most contributing type of feature (Figure 4A). However, connectivity features, in general, and powerenvelope connectivity, in particular, contributed substantively (Figure 4B, Table 4). Interestingly, applying orthogonalization (Hipp et al., 2012; Hipp and Siegel, 2015) for removing source leakage did not notably improve performance (Table 4). Against the background of research on MEGfMRI correspondence highlighting the importance of slow fluctuations of brain rhythms (Hipp and Siegel, 2015; Brookes et al., 2011), this finding suggests that what renders MEG nonredundant with regard to fMRI are regional differences in the balance of fast brainrhythms, in particular in the $\alpha \beta $ range.
While this interpretation may be enticing, an important caveat arises from the fact that fMRI signals are due to neurovascular coupling, hence, highly sensitive to events caused by sources other than neuronal activity (Hosford and Gourine, 2019). Recent findings based on the dataset analyzed in the present study have shown that the fMRI signal in elderly populations might predominantly reflect vascular effects rather than neuronal activity (Tsvetanov et al., 2015). The observed complementarity of the fMRI and MEG in age prediction might, therefore, be conservatively explained by the agerelated increase in the ratio of vascular to neuronal contributions to the fMRI signal, while MEG signals are directly induced by neuronal activity, regardless of aging. Nevertheless, in the context of brainage prediction these mechanisms are less important than the sensitivity of the prediction, for instance, regarding behavioral outcomes.
In sum, our findings suggest that electrophysiology can make a difference in prediction problems in which fast brain rhythms are strongly statistically related to the biomedical outcome of interest.
Brain age Δ as sensitive index of normative aging
In this study, we have conducted an exploratory analysis on what might be the cognitive and healthrelated implications of our prediction models. Our findings suggest that the brain age Δ shows substantive associations with about 20–25% of all neuropsychological measures included. The overall bigpicture is congruent with the brain age literature (see discussion in Smith et al., 2019 for an overview) and supports the interpretation of the brain age Δ as index of decline of physical health, wellbeing and cognitive fitness. In this sample, larger values of the Δ were globally associated with elevated depression scores, higher blood pressure, lower sleep quality, lower fluid intelligence, lower scores in neurological assessment and lower memory performance. Most strikingly, we found that fMRI and MEG contributed additive, if not unique information (Figure 3). For example, the association with depression appeared first when predicting age from fMRI. Likewise, the association with fluid intelligence and sleep quality visibly intensified when including MEG.
This extends the previous discussion in suggesting that MEG and fMRI are not only complementary for prediction but also with regard to characterizing brainbehavior mappings. In this context, it is worwhile considering that predicting biomedical outcomes from multiple modalities may reduce susceptibility to ‘modality impurity’ as often observed in modeling of individual differences in cognitive abilities (Friedman and Miyake, 2004; Miyake et al., 2000). In the present study, it was remarkable that cardiac measures were exclusively related to fMRIbased models and vanished as MEG was included. This may not be entirely surprising as the fMRI signal is a combination of, both, vascular and neuronal components (Hosford and Gourine, 2019) and aging affects both of them differently, which poses an important challenges to fMRIbased studies of aging (Geerligs et al., 2017; Tsvetanov et al., 2016). It is imaginable that the cardiac measures were not associated with brain age estimates from fMRI when combined with the modalities as vascular components may have enhanced the SNR of neuronal signals through deconfounding (for extensive discussion on this topic, see Tsvetanov et al., 2019).
Which neuronal components might explain the enhanced brainbehavior links extracted from the multimodal models? It is enticing to speculate that the regional power of fastpaced $\alpha $ and $\beta $ band brain rhythms captures fastpaced components of cognitive processes such as attentional sampling or adaptive attention (Gola et al., 2013; Richard Clark et al., 2004), which, in turn might explain unique variance in certain cognitive facets, such as fluid intelligence (Ouyang et al., 2020) or visual shortterm memory (TallonBaudry et al., 2001). On the other hand, functional connectivity between cortical areas and subcortical structures, in particular the hippocampus, may be key for depression and is well captured with fMRI (Stockmeier et al., 2004; Sheline et al., 2009; Rocca et al., 2015). Unfortunately, modeling such mediation effects exceeds the scope of the current work, although it would be worth being tested in an independent study with a dedicated design.
Could one argue that the overall effect sizes were too low to be considered practically interesting? Indeed, the strength of linear association was below 0.5 in units of standard deviations of the normalized predictors and the target. On the other hand, it is important to consider that the CamCAN sample consists of healthy individuals only. It, thus, appears as rather striking that systematic and neuropsychologically plausible effects can be detected. Our findings, therefore, argue in favor of the brain age Δ being a sensitive marker of normative aging. The effects are expected to be far more pronounced when applying the method in clinical settings, that is, in patients suffering from mild cognitive impairment, depression, neurodevelopmental or neurodegenerative disorders. This suggests that brain age Δ might be used as a screening tool for a wide array of clinical settings for which the CamCAN dataset could serve as a normative sample.
Translation to the clinical setting
One critical factor for application of our approach in the clinic is the problem of incomplete availability of medical imaging and physiological measurements. Here, we addressed this issue by applying an opportunistic learning approach which enables learning from the data available at hand. Our analysis of opportunistic learning applied to age prediction revealed viable practical alternatives to confining the analysis to cases for which all measurements are available. In fact, adding extra cases with incomplete measurements never harmed prediction of the cases with complete data and the full multimodal stacking always outperformed submodels with fewer modalities (Figure 5A). Moreover, the approach allowed maintaining and extending the performance to new cases with incomplete modalities (Figure 5B). Importantly, performance on such subsets was explained by the performance of a reduced model with the remaining modalities. Put differently, opportunistic stacking performed as good as a model restricted to data with all modalities. In practice, the approach allows one to improve predictions casewise by including electrophysiology next to MRI or MRI next to electrophysiology, whenever there is the opportunity to do so.
A second critical factor for translating our findings into the clinic is that, most of the time, it is not highdensity MEG that is available but lowdensity EEG. In this context, our finding showed that the source power was the most important feature, which is of clear practical interest. This is because it suggests that a rather simple statistical object accounts for the bulk of the performance of MEG. Source power can be approximated by the sensorlevel topography of power spectra which can be computed on any multichannel EEG device in a few steps and only yields as many variables per frequency band as there are channels. Moreover, from a statistical standpoint, computing the power spectrum amounts to estimating the marginal expectation of the signal variance, which can be thought of as main effect. On the other hand, connectivity is often operationalized as bivariate interaction, which gives rise to a more complex statistical object of higher dimensionality whose precise, reproducible estimation may require far more samples. Moreover, as is the case for power envelope connectivity estimation, additional processing steps each of which may add researcher degrees of freedom (Simmons et al., 2011), such as the choice between Hilbert (Brookes et al., 2011) versus Wavelet filtering (Hipp et al., 2012), types of orthogonalization (Baker et al., 2014), and potentially thresholding for topological analysis (Khan et al., 2018). This nourishes the hope that our findings will generalize and similar performance can be unlocked on simpler EEG devices with fewer channels. While clinical EEG may not well resolve functional connectivity it may be good enough to resolve changes in the source geometry of the power spectrum (Sabbagh et al., 2020). On the other hand, source localization may be critical in this context as linear field spread actually results in a nonlinear transform when considering the power of a source (Sabbagh et al., 2019). However, in practice, it may be hard to conduct highfidelity source localization on the basis of lowdensity EEG and frequently absent information on the individual anatomy. It will, therefore, be critical to benchmark and improve learning from power topographies in clinical settings.
Finally, it is worthwhile to highlight that, here, we have focused on age in the more specific context of the brain age Δ as surrogate biomarker. However, the proposed approach is fully compatible with any target of interest and may be easily applied directly to clinical end points, for example drug dosage, survival or diagnosis. Moreover, the approach presented here can be easily adapted to work with classification problems, for instance, by substituting logistic regression for ridge regression and by using a random forest classifier in the stacking layer. We have provided all materials from our study in form of publicly available versioncontrolled code with the hope to help other teams of biomedical researchers to adapt our method to their prediction problem.
Limitations
For the present study, we see four principal limitations: availability of data, interpretability, nonexhaustive featureengineering and potential lack of generalizability due to the focus on MEG.
The CamCAN is a unique resource of multimodal neuroimaging data with sufficient data points to enable machine learning approaches. Yet, from the point of view of machine learning, the CamCAN dataset is a small dataset. This has at least two consequences. If the CamCAN included many more data points, for example beyond 10–100 k subjects, the proposed stacking model might possibly be of limited advantage compared to purely nonlinear models, for example random forests, gradient boosting or deep learning methods (Bzdok and Yeo, 2017). At the same time, the fact that the CamCAN has been unique so far, hinders generalization testing to equivalent multimodal datasets from other sites based on alternative scanning methodologies, protocols and devices (Engemann et al., 2018). This also renders computation of numerical hypothesis tests (including pvalues) more difficult in the context of predictive modeling: The majority of data points is needed for modelfitting and metrics derived from leftout crossvalidation splits, for example, predictions of brain age, lack statistical independence. This breaks essential assumptions of inferential statistics to an arbitrary and unknown degree. Our inferences were, therefore, predominantly based on estimated effectsizes, that is the expected generalization error and its uncertainty assessed through crossvalidation.
Second, at this point, statistical modeling faces the dilemma of whether inference or prediction is the priority. Procedures optimizing prediction performance in high dimensions are not yet supported by the indepth understanding required to guarantee formal statistical inferences, whereas models with wellestablished procedures for statistical inference lack predictive capability (Bzdok et al., 2018; Bzdok and Ioannidis, 2019). Forcing interpretation out of machine learning models, therefore, often leads to duplicated analysis pipelines and model specifications, which is undesirable in terms of methodological coherence (for example HoyosIdrobo et al., 2019; Haufe et al., 2014; Biecek, 2018). In the present work, we refrained from conducting finegrained inferential analysis beyond the model comparisons presented, in particular inspection of layer1 weightmaps whose interpretation remains an ongoing research effort. We hope, nevertheless, that the insights from our work will stimulate studies investigating the link between MEG, fMRI and MRI across the lifespan using an inferenceoriented framework.
Third, the MEGfeatures used in the present study were nonexhaustive. Based on the wider MEG/EEGliterature beyond the neuroscience of aging, many other features could have been included. Instead, featureengineering was based on our agingspecific literature review constrained by biophysical considerations. In particular, the distinction between sensorspace and sourcespace features was purely descriptive and not substantive. From an empirical perspective, mirroring all features in sensorspace and sourcespace could have yielded more specific inferences, for example regarding the role of sourcepower. On the other hand, biophysical prior knowledge implies that oscillatory peak frequencies and evoked response latencies are not modified by source localization, whereas source localization or datadriven approximations thereof are essential for predicting from M/EEG power spectra (Sabbagh et al., 2019). It is also fair to admit that, in the present paper, our passion was preferentially attracted by source modeling of neural power spectra. However, one could imagine that with equal investment of resources, more information could have been extracted from the sensorlevel features (see Gemein et al., 2020 for approaches to tackle the important methodological issue of unbalanced investment of developmenttime). Related, the current work has strongly benefited from expertise on modeling of MEG power spectra under the assumption of stationary as captured by global power spectra, covariance or connectivity. Recent findings suggest that nonstationary analyses focusing on transient electrophysiological events may uncover clinically relevant information on cognitive brain dynamics (Barttfeld et al., 2015; Baker et al., 2014; Vidaurre et al., 2018; Van Schependom et al., 2019). It is, therefore, important to highlight that our proposed framework is open and readily enables integration of additional low or highdimensional inputs related to richer sensorlevel features or nonstationary dynamics, beyond MEG as input modality.
Finally, while MEG and EEG share the same types of neural generators, their specific biophysics render these methods complementary for studying neuronal activity. At this point, unfortunately, there is no public dataset equivalent of the CamCAN including EEG or, both, EEG and MEG. Such a data resource would have enabled studying the complementarity between MEG with EEG as well as generalization from stacking with MRI and MEG to stacking models with MRI and EEG.
We hope that our method will help other scientists to incorporate the multimodal features related to their domain expertise into their applied regression problems.
Materials and methods
Sample
Request a detailed protocolWe included MEG (task and rest), fMRI (rest), anatomical MRI and neuropsychological data (cognitive tests, homeinterview, questionnaires) from the CAMCan dataset (Shafto et al., 2014). Our sample comprised 674 (340 female) healthy individuals between 18 (female = 18) to 88 (female = 87) years with an average of 54.2 (female = 53.7) and a standard deviation of 18.7 (female = 18.8) years. We included data according to availability and did not apply an explicit criterion for exclusion. When automated processing resulted in errors, we considered the data as missing. This induced additional missing data for some cases. A summary of available cases by input modality is reported in Table 3. For technical details regarding the MEG, fMRI, and MRI data acquisition, please consider the CamCAN reference publications (Shafto et al., 2014; Taylor et al., 2017).
Feature extraction
Feature extraction was guided by the perspective of predictive modeling. For the goal of enhancing prediction performance as opposed to statistical inference (Bzdok and Ioannidis, 2019), we emphasized on differences between modalities, hence, chose modalityspecific methods and optimizations at the risk of sacrificing direct comparability between features used for MEG, fMRI and MRI. The selection of features was guided by our literature review on the neuroscience of aging presented in the introduction.
For MEG, we analyzed sensor space features related to timing (Price et al., 2017), peak frequency (Richard Clark et al., 2004) and temporal autocorrelation (Voytek et al., 2015). Source space features included the power of sourcelevel signals (Sabbagh et al., 2019) and envelopes and their bivariate interactions (Khan et al., 2018) in nine frequency bands (see Table 1, adapted from the Human Connectome Project, LarsonPrior et al., 2013). The inclusion of power envelopes was theoretically important as the slow fluctuations of source power and their bivariate interactions have been repeatedly linked to fMRI resting state networks (Hipp and Siegel, 2015; Brookes et al., 2011). On the other hand, we specifically focused on the unique capacity of MEG to access spatial information induced by fastpaced brain rhythms emerging from regional sources (King and Dehaene, 2014; Stokes et al., 2015).
For extracting features from MRI and fMRI, we adapted the approach established by Liem et al., 2017. For fMRI, we computed bivariate functional connectivity estimates. For MRI, we focused on cortical thickness, cortical surface area and subcortical volumes. An overview on all features used is presented in Table 2. In the remainder of this section, we describe computation details.
MEG features
Request a detailed protocolPeak evoked latency
Request a detailed protocolSensory processing may slow down in the course of aging (Price et al., 2017). Here, we assessed the evoked response latency during auditory, visual and simultaneous audiovisual stimulation (index 1, Table 2). For each of the conditions, we first computed the evoked response. Then, we computed the rootmeansquare across gradiometers and looked up the time of the maximum. In total, this yielded three latency values per subject.
$\alpha $band peak frequency
Request a detailed protocolResearch suggests that the alphaband frequency may be lower in older people. Here, we computed the restingstate power spectrum using a Welch estimator (index 2, Table 2). Then, we estimated the peak frequency between 6 and 15 Hz on occipitoparietal magnetometers after removing the 1/f trend using a polynomial regression (degree = 15) by computing the maximum power across sensors and looking up the frequency bin. This yielded one peak value per subject.
1/f slope
Request a detailed protocolLongrange autocorrelation in neural timeseries gives rise to the characteristic 1/f decay of power on a logarithmic scale. Increases of neural noise during aging are thought to lead to reduced autocorrelation, hence a more shallow slope (Voytek et al., 2015). We computed the 1/f slope from the Welch power spectral estimates above on all magnetometers using linear regression (index 3, Table 2). The slope is given by the $\widehat{\beta}$ of the linear fit with the logfrequencies as predictor and the logpower as target. We obtained one estimate for each of the 102 magnetometers, resulting in a 1/f topography. No further reduction was applied.
Power and connectivity of sourcelevel signals
Request a detailed protocolThe cortical generators of the brainrhythms dominating the power spectrum change across lifespan. To predict from the spatial distribution of MEG power spectra, we relied on sourcelocalization to mitigate distortions due to individual head geometry. We adopted the pipeline optimized for highdimensional regression presented in Sabbagh et al., 2019 and modeled power spectra in the timedomain based on covariance estimates after bandpassfiltering. We considered nine frequency bands (see Table 1), computed bandpassfiltered minimum norm sourceestimates and then summarized the sourcetime courses ROIwise by the first principal components with alignment to the surface normals using the ‘pca_flip’ option provided by MNEPython (Gramfort et al., 2013). To mitigate the curse of dimensionality we used a subdivision of the DesikanKilliany atlas (Desikan et al., 2006) comprising 448 ROIs. This set of ROIs proposed by Khan et al., 2018 for predictive modeling of neurodevelopmental trajectories was specifically designed to generate approximately equal ROIsize to avoid averaging over inhomogeneous regions with distinct leadfield coverage or to avoid averaging over larger regions that may contain multiple sources cancelling each other. Subsequently, we computed the covariance matrix from the concatenated epochs and used the 448 diagonal entries as power estimates (index 4 Table 2). The offdiagonal entries served as connectivity estimates. Covariance matrices live in a nonEuclidean curved space. To avoid model violations at the subsequent linearmodeling stages, we used tangent space projection (Varoquaux et al., 2010) to vectorize the lower triangle of the covariance matrix. This projection allows one to treat entries of the covariance or correlation matrix as regular Euclidean objects, hence avoid violations to the linear model used for regression (Sabbagh et al., 2019). This yielded $448\times 448/2(448/2)=100,128$ connectivity values per subject (index 6 Table 2).
Power and connectivity of sourcelevel envelopes
Request a detailed protocolBrainrhythms are not constant in time but fluctuate in intensity. These slow fluctuations are technically captured by power envelopes and may show characteristic patterns of spatial correlation. To estimate power envelopes, for each frequency band, we computed the analytic signal using the Hilbert transform. For computational efficiency, we calculated the complexvalued analytic signal in sensor space and then sourcelocalized it using the linear minimum norm operator. To preserve linearity, we only extracted the power envelopes by taking the absolute value of the analytic signal after having performed averaging inside the ROIs. Once the envelope timeseries was computed, we applied the same procedure as for source power (paragraph above) to estimate the source power of the envelopes (index 5, Table 2) and their connectivity. Power and covariance were computed from concatenated epochs, correlation and orthogonalized correlation were computed epochwise. Note that, for systematic reasons, we also included power estimates of the envelope timeseries applying the same method as we used for the timeseries. In the MEG literature, envelope correlation is a wellestablished research topic (Hipp et al., 2012; Brookes et al., 2011). Thus, in addtition to the covariance, we computed the commonly used normalized Pearson correlations and orthogonalized Pearson correlations which are designed to mitigate source leakage (index 7–9, Table 2). However, as a result of orthogonalization, the resulting matrix is not any longer positive definite and cannot be projected to the tangent space using Riemannian geometry. Therefore, we used Fisher’s Z transform (Silver and Dunlap, 1987) to convert the correlation matrix into a set of standardnormal variables. The transform is defined as the inverse hyperbolic tangent function of the correlation coefficient: $z=\mathrm{a}\mathrm{r}\mathrm{c}\mathrm{t}\mathrm{a}\mathrm{n}\mathrm{h}(r)=\frac{1}{2}\mathrm{log}(\frac{1+r}{1r})$. This yielded 448 power envelope power estimates and 100,128 connectivity values per estimator.
fMRI features
Request a detailed protocolFunctional connectivity
Request a detailed protocolLargescale neuronal interactions between distinct brain networks has been repeatedly shown to change during healthy aging. Over the past years, for fMRIbased predictive modeling using functional atlases from about 50 to 1000 ROIs have emerged as a fundamental element for mitigating heterogeneity and dimensionality reduction, especially in small to mediumsized datasets such as the CamCAN with less than 1000 observations (Dadi et al., 2019; Abraham et al., 2017). To estimate macroscopic functional connectivity, we deviated from the 197ROI BASC atlas Bellec et al., 2010 used in Liem et al., 2017. Instead, we used an atlas with 256 sparse and partially overlapping ROIs obtained from Massive Online Dictionary Learning (MODL) (Mensch et al., 2016). Initial piloting suggested that both methods gave approximately equivalent results on average with slightly reduced variance for the MODL atlas. Then, we computed bivariate amplitude interactions using Pearson correlations from the ROIwise average timeseries (index 10, Table 2). Again, we used tangent space projection (Varoquaux et al., 2010) to vectorize the correlation matrices. This yielded 32,640 connectivity values from the lower triangle of each matrix. No further reduction was applied.
MRI features
Request a detailed protocolThe extraction of features from MRI followed the previously established strategy presented in Liem et al., 2017 which is based on cortical surface reconstruction using the FreeSurfer software. For scientific references to specific procedures, see the section MRI data processing and the FreeSurfer website http://freesurfer.net/.
Cortical thickness
Request a detailed protocolAgingrelated brain atrophy has been related to thinning of the cortical tissue (for example Thambisetty et al., 2010). We extracted cortical thickness, defined as shortest distance between white and pial surfaces, from the Freesurfer (Fischl, 2012) segmentation using a surface tessellation with 5124 vertices in fsaverage4 space obtained from the FreeSurfer command mris_preproc using default parameters (index 11, Table 2). No further reduction was computed.
Cortical surface area
Request a detailed protocolAging is also reflected in shrinkage of the cortical surface itself (for example Lemaitre et al., 2012). We extracted vertexwise cortical surface area estimates, defined as average of the faces adjacent to a vertex along the white surface, from the Freesurfer segmentation using a surface tessellation with 5124 vertices in fsaverage4 space obtained from the FreeSufer command mris_preproc using default parameters (index 12, Table 2). No further reduction was computed.
Subcortical volumes
Request a detailed protocolThe volume of subcortical structures has been linked to aging (for example Murphy et al., 1992). Here, we used the FreeSurfer command asegstats2table, using default parameters, to obtain estimates of the subcortical volumes and global volume, yielding 66 values for each subject with no further reductions (index 13, Table 2).
Stackedprediction model for opportunistic learning
We used the stacking framework (Wolpert, 1992) to build our predictive model. However, we made the important specification that input models were regularized linear models trained on input data from different modalities, whereas stacking of linear predictions was achieved by a nonlinear regression model. Our model can be intuitively denoted as follows:
Here, each ${X}_{j}{\beta}_{j}$ is the vector of predictions ${\widehat{y}}_{j}$ of the target vector $y$ from the jth model fitted using input data ${X}_{j}$:
We used ridge regression as input model and a random forest regressor as a general function approximator $f$ [Ch. 15.4.3](Hastie et al., 2005). A visual illustration of the model is presented in Figure 1.
Layer1: Ridge regression
Request a detailed protocolResults from statistical decision theory suggests that, for linear models, the expected outofsample error increases only linearly with the number of variables included in a prediction problem (Hastie et al., 2005, chapter 2), not exponentially. In practice, biased (or penalized) linear models with Gaussian priors on the coefficients, that is ridge regression (or logistic regression for classification) with ${\mathrm{\ell}}_{2}$penalty (squared ${\mathrm{\ell}}_{2}$ norm) are hard to outperform in neuroimaging settings (Dadi et al., 2019). Ridge regression can be seen as extension of ordinary least squares (OLS) where the solution is biased such that the coefficients estimated from the data are conservatively pushed toward zero:
The estimated coefficients approach zero as the penalty term $\lambda $ grows, and the solution approaches the OLS fit as $\lambda $ gets closer to zero. This shrinkage affects directions of variance with small singular values more strongly than the ones with large singular values (see eqs. 3.473.50 in Hastie et al., 2005, ch. 3.4.1), hence, can be seen as smooth principal component analysis as directions of variance are shrunk but no dimension is ever fully discarded. This is the same as assuming that the coefficient vector comes from a Gaussian distribution centered around zero such that increasing shrinkage reduces the variance ${\sigma}^{2}$ of that distribution [chapter 7.3] (Efron and Hastie, 2016):
In practice, the optimal strength for this Gaussian prior is often unknown. For predictive modeling, $\lambda $ is commonly chosen in a datadriven fashion such that one improves the expected outofsample error, for example tuned using crossvalidation. We tuned $\lambda $ using generalized crossvalidation (Golub et al., 1979) and considered 100 candidate values on an evenly spaced logarithmic scale between 10^{3} and 10^{5}. This can be regarded equivalent to assuming a flat but discrete hyperprior (a prior distribution of the hyperparameters assumed for the model parameters) on the distribution of candidate regularizationstrengths. Note that this procedure is computationally efficient and, on our problem, returned entire regularization paths within seconds. While this approach is standardpractice in applied machine learning and particularly useful with massive and highdimensional data, many other methods exist for datadriven choice of the prior which may be more appropriate in situations on smaller datasets and where parameter inference, not prediction, is the priority.
Layer2: Random forest regression
Request a detailed protocolHowever, the performance of the ridge model in high dimensions comes at the price of increasing bias. The stacking model tries to alleviate this issue by reducing the dimensionality in creating a derived data set of linear predictions, which can then be forwarded to a more flexible local regression model. Here, we chose the random forest algorithm (Breiman, 2001) which can be seen as a general function approximator and has been interpreted as an adaptive nearest neighbors algorithm (Hastie et al., 2005, chapter 15.4.3). Random forests can learn a wide range of functions and are capable of automatically detecting nonlinear interaction effects with little tuning of hyperparameters. They are based on two principles: regression trees and bagging (bootstrapping and aggregating). Regression trees are nonparametric methods and recursively subdivide the input data by finding combinations of thresholds that relate value ranges of the input variables to the target. The principle is illustrated at the right bottom of Figure 1. For a fully grown tree, each sample falls into one leaf of the tree which is defined by its unique path through combinations of inputvariable thresholds through the tree. However, regression trees tend to easily overfit. This is counteracted by randomly generating alternative trees from bootstrap replica of the dataset and randomly selecting subset of variables for each tree. Importantly, thresholds are by default optimized with regard to a socalled impurity criterion, for which we used mean squared error. Predictions are then averaged, which mitigates overfitting and also explains how continuous predictions can be obtained from thresholds.
In practice, it is common to use a generous number of trees as performance plateaus once a certain number is reached, which may lay between hundreds or thousands. Here, we used 1000 trees. Moreover, limiting the overall depth of the trees can increase bias and mitigate overfitting at the expense of model complexity. An intuitive way of conceptualizing this step is to think of the treedepth in terms of orders interaction effects. A tree with three nodes enables learning threeway interactions. Here, we tuned the model to choose between depthvalues of 4, 6, or 8 or the option of not constraining the depth. Finally, the total number of features sampled at each node determines the degree to which the individual trees are independent or correlated. Small number of variables decorrelate the trees but make it harder to find important variables as the number of input variables increases. On the other hand, using more variables at once leads to more exhaustive search of good thresholds, but may increase overfitting. As our stacking models had to deal with different number of input variables, we had to tune this parameter and let the model select between $\sqrt{p}$, $\mathrm{log}(p)$ and all $p$ input variables. We implemented selection of tuningparameters by grid search as (nested) 5fold crossvalidation with the same scoring as used for evaluation of the model performance, that is mean absolute error. The choice of the mean absolute error is a natural choice for the study of aging as error is directly expressed in the units of interest.
Stacked crossvalidation
Request a detailed protocolWe used a 10fold crossvalidation scheme. To mitigate bias due to the actual order of the data, we repeated the procedure 10 times while reshuffling the data at each repeat. We then generated agepredictions from each layer1 model on the leftout folds, such that we had for each case one ageprediction per repeat. We then stored the indices for each fold to make sure the random forest was trained on leftout predictions for the ridge models. This ensured that the inputlayer traintest splits where carried forward to layer2 and that the stacking model was always evaluated on leftout folds in which the input ages are actual predictions and the targets have not been seen by the model. Here, we customized the stacking procedure to be able to unbox and analyze the inputage predictions and implement opportunistic handling of missing values.
Variable importance
Request a detailed protocolRandom forest models and, in general, regression trees are often inspected by estimating the impact of each variable on the prediction performance. This is commonly achieved by computing the socalled variable importance. The idea is to track and sum across all trees the relative reduction of impurity each time a given variable is used to split, hence, the name mean decrease impurity (MDI). The decrease in impurity can be tracked by regular performance metrics. Here we used mean squared error, which is the default option for random forest regression in scikitlearn (Pedregosa et al., 2011). It has been shown that in correlated trees, variable importance can be biased and lead to masking effects, that is, fail to detect important variables (Louppe et al., 2013) or suggest noisevariables to be important. One potential remedy is to increase the randomness of the trees, for example by selecting randomly a single variable for splitting and using extremely randomized trees (Geurts et al., 2006; Engemann et al., 2018), as it can be mathematically guaranteed that in fully randomized trees only actually important variables are assigned importance (Louppe et al., 2013). However, such measures may mitigate prediction performance or lead to duplicated model specifications (one model for predicting, one for analyzing variable importance). Here, we used the approach from the original random forest paper (Breiman, 2001), which consists in permuting $k$ times one variable at a time and measuring the drop in performance at the units of performance scoring, that is mean absolute error in years. We computed permutation importance with $k=1000$ after fitting the random forest to the crossvalidated predictions from the layer1 models.
Insample permutation importance is computationally convenient but may potentially suffer from an irreducible risk of overfitting, even when taking precautions such as limiting the treedepth. This risk can be avoided by computing the permutations on leftout data, that is by permuting the variables in the testingset, which can be computationally expensive. However, permutation importance (whether computed on training or testing data) has the known disadvantage that it does not capture conditional dependencies or higher order interactions between variables. For example, a variable may not be so important in itself but its interaction with other variables makes it an important predictor. Such conditional dependencies between variables can be captured with MDI importance.
To diagnose potential overfitting and to assess the impact of conditional dependencies, we additionally reported outofsample permutation importance and MDI importance. We computed outofsample permutation importance for each of the 100 splits from our crossvalidation procedure with a reduced number of permutations ($k=100$) to avoid excessive computation times. MDI importance was based on the same model fit as the insample permutations.
Opportunistic learning with missing values
Request a detailed protocolAn important option for our stacking model concerns handling missing values. Here, we implemented the doublecoding approach (Josse et al., 2019) which duplicates the features and once assigns the missing value a very small and once a very large number (see also illustration in Figure 1). As our stacked input data consisted of age predictions from the ridge models, we used biologically implausible values of −1000 and 1000. This amounts to turning missing values into features and let the stacking model potentially learn from the missing values, as the reason for the missing value may contain information on the target. For example, an elderly patient may not be in the best conditions for an MRI scan, but nevertheless qualifies for electrophysiological assessment.
To implement opportunistic stacking, we considered the full dataset with missing values and then kept track of missing data while training layer1. This yielded the stackingdata consisting of the agepredictions and missing values. Stacking was then performed after applying the featurecoding of missing values. This procedure made sure that all training and test splits were defined with regard to the full cases and, hence, the stacking model could be applied to all cases after featurecoding of missing values.
Statistical inference
Request a detailed protocolRejecting a nullhypothesis regarding differences between two crossvalidated models is problematic in the absence of sufficiently large unseen data or independent datasets: crossvalidated scores are not statistically independent. Fortunately, crossvalidation yields useful empirical estimates of the performance (and its dispersion) that can be expected on unseen data (Hastie et al., 2005, Ch. 7.10). Here, we relied on uncertainty estimates of paired differences based on the stacked crossvalidation with 10 folds and 10 repeats. To provide a quantitative summary of the distributions of paired splitwise differences in performance, we extracted the mean, the standard deviation, the 2.5 and 97.5 percentiles (inner 95% of the distribution) as well as the number of splits in which a model was better than a given reference ($P{r}_{<Ref})$. We estimated chancelevel prediction using a dummy regressor that predicts the average of the trainingset target using the same crossvalidation procedure and identical random seeds to ensure splitwise comparability between nontrivial models. While not readily supporting computation of pvalues, dummy estimates are computationally efficient and yield distributions equivalent to those obtained from labelpermutation procedures. For statistical analyses linking external measurements with modelderived quantities such as the crossvalidated age prediction or the brain age Δ, we used classical parametric hypothesistesting. It should be clear, however, that hypothesistesting, here, provides a quantitative orientation that needs to be contextualized by empirical estimates of effect sizes and their uncertainty to support inference.
Analysis of brainbehavior correlation
Request a detailed protocolTo explore the cognitive implications of the brain age Δ, we computed correlations with the neurobehavioral score from the CamCAN dataset. Table 5 lists the scores we considered. The measures fall into three broad classes: neuropsychology, physiology and questionnaires (‘Type’ columns in Table 5). Extraction of neuropsychological scores sometimes required additional computation, which followed the description in Shafto et al., 2014, (see also ‘Variables’ column in scores). For some neuropsychological tasks, the CamCAN dataset provided multiple scores and sometimes the final score of interest as described in Shafto et al., 2014, had yet to be computed. At times, this amounted to computing ratios, averages or differences between different scores. In other scores, it was not obvious how to aggregate multiple interrelated subscores, hence, we extracted the first principal component explaining between about 50% and 85% of variance, hence offering reasonable summaries. In total, we included 38 variables. All neuropsychology and physiology scores (up to #17 in Table 5) were the scores available in the ‘cc770scored’ folder from release 001 of the CamCAN dataset. We selected the additional questionnaire scores (#1823 in Table 5) on theoretical grounds to provide an assessment of clinically relevant individual differences in cognitive functioning. The brain age Δ was defined as the difference between predicted and actual age of the person
such that positive values quantify overestimation and negative value underestimation. A common problem in establishing brainbehavior correlations for brain age is spurious correlations due to shared agerelated variance in the brain age Δ and the neurobehavioral score (Smith et al., 2019). To mitigate confounding effects of age, we computed the age residuals as
where $\mathrm{score}$ is the observed neuropsychological score and ${\mathrm{score}}_{age}$ is its prediction from the following polynomial regression:
The estimated linear association between the residualized score and the brain age $\mathrm{\Delta}$ was given by ${\beta}_{1}$ in
To obtain comparable coefficients across scores, we standardized both the age and the scores. We also included intercept terms in all models which are omitted here for simplicity.
It has been recently demonstrated, that such a twostep procedure can lead to spurious associations (Lindquist et al., 2019). We have, therefore, repeated the analysis with a joint deconfounding model where the polynomial terms for age are entered into the regression model alongside the brain age predictor.
Finally, the results may be due to confounding variable of noninterest. To assess the importance of such confounders, we have extended the model (Equation 9) to also include gender, handedness (binarized) and a log Frobenius norm of the variability of motion parameters (three translation, three rotation) over the 241 acquired images.
Note that motion correction was already performed during preprocessing of MRI and fMRI. Likewise, MEG source localization took into account individual head geometry as well as potentially confounding environmental noise through whitening with the noise covariance obtained from empty room recordings. Following the work by Liem et al., 2017, we included total grey matter and total intracranial volume as important features of interest among the MRIfeatures.
MEG data processing
Data acquisition
Request a detailed protocolMEG recorded at a single site using a 306 VectorView system (Elekta Neuromag, Helsinki). This system is equipped with 102 magnetometers and 204 orthogonal planar gradiometers is placed in a light magnetically shielded room. During acquisition, an online filter was applied between around 0.03 Hz and 1000 Hz. This resulted in a samplingfrequency of 1000 Hz. To support offline artifact correction, vertical and horizontal electrooculogram (VEOG, HEOG) as well as electrocardiogram (ECG) signal was concomitantly recorded. Four HeadPosition Indicator (HPI) coils were used to measure the position of the head. All types of recordings, that is restingstate, passive stimulation and the active task lasted about 8 min. For additional details on MEG acquisition, please consider the reference publications of the CamCAN dataset (Taylor et al., 2017; Shafto et al., 2014). The following sections will describe the custom data processing conducted in our study.
Artifact removal
Request a detailed protocolEnvironmental artifacts
Request a detailed protocolTo mitigate contamination of the MEG signal with artifacts produced by environmental magnetic sources, we applied temporal signalspaceseparation (tSSS) (Taulu and Kajola, 2005). The method uses spherical harmonic decomposition to separate spatial patterns produced by sources inside the head from patterns produced by external sources. We used the default settings with eight components for the harmonic decomposition of the internal sources, and three for the external sources on a ten seconds sliding window. We used a correlation threshold of 98% to ignore segments in which inner and outer signal components are poorly distinguishable. We performed no movement compensation, since there were no continuous head monitoring data available at the time of our study. The origin of internal and external multipolar moment space was estimated based on the headdigitization. We computed tSSS using the MNE maxwell_filter function (Gramfort et al., 2013) but relied on the SSS processing logfiles from CamCAN for defining bad channels.
Physiological artifacts
Request a detailed protocolTo mitigate signal distortions caused by eyemovements and heartbeats we used signal space projection (SSP) (Uusitalo and Ilmoniemi, 1997). This method learns principal components on contaminated datasegments and then projects the signal into the subspace orthogonal to the artifact. To obtain clean estimates, we excluded bad data segments from the EOG/ECG channels using the ‘global’ option from autoreject (Jas et al., 2017). We then averaged the artefactevoked signal (see ‘average’ option in mne.preprocessing.compute_proj_ecg) to enhance subspace estimation and only considered one single projection vector to preserve as much signal as possible.
Rejection of residual artifacts
Request a detailed protocolTo avoid contamination with artifacts that were not removed by SSS or SSP, we used the ‘global’ option from autoreject (Jas et al., 2017). This yielded a datadriven selection of the amplitude range above which data segments were excluded from the analysis.
Temporal filtering
Request a detailed protocolTo study bandlimited brain dynamics, we applied bandpassfiltering using the frequency band definitions in Table 1. We used default filter settings from the MNE software (development version 0.19) with a windowed timedomain design (firwin) and Hamming taper. Filter length and transition bandwidth was set using the ‘auto’ option and depended on the data.
Epoching
Request a detailed protocolFor the active and passive tasks, we considered time windows between −200 and 700 ms around stimulusonset and decimated the signal by retaining every eighth time sample.
For restingstate, we considered sliding windows of 5 s duration with no overlap and no baseline correction. To reduce computation time, we retained the first 5 min of the recording and decimated the signal by retaining every fifth time sample. Given the sampling frequency of 1000 Hz, this left unaffected the bulk of the features, only reducing the spectral resolution in the high gamma band to 75–100 Hz (instead of 75–120 Hz in the definition proposed by the Human Connectome Project [LarsonPrior et al., 2013]).
Channel selection
Request a detailed protocolIt is important to highlight that after SSS, the magnetometer and gradiometer data are reprojected from a common lower dimensional SSS coordinate system that typically spans between 64 and 80 dimensions. After SSS, magnetometers and gradiometers are reconstructed from the same basis vectors, which makes them linear combinations of another (Taulu and Kajola, 2005). As a result, both sensor types contain highly similar information and yield equivalent results in many situations (Garcés et al., 2017). Consequently, after applying SSS, the MNE software manipulates a single sensor type for source localization and uses as degrees of freedom the number of underlying SSS dimensions instead of the number of channels. Note, however, that after SSS, magnetometers and gradiometers can still yield systematically different results in sensorspace analyses despite being linear combinations of another. This happens once a nonlinear transform is applied on the sensorspace, for example power, which is explained by the fact that SSS is a linear transform and powers in sensors space breaks linearity. On the other hand, once source localization is correctly performed, which takes into account the SSS solution, differences between gradiometers and gradiometers become negligible for, both, linear transforms and nonlinear transforms. We, nevertheless, used all 102 magnetometers and 204 gradiometers for source analysis to stick with a familiar configuration. Note that while shortcuts can be achieved by processing only one of the sensor types, they should be avoided when other methods than SSS are used for preprocessing. However, driven by initial visual exploration, for some aspects of feature engineering in sensor space, that is, extraction of alpha peaks or computation of 1/f power spectra, we used the 102 magnetometers. For extraction of evoked response latencies, we used the 204 gradiometers. Nevertheless, due to the characteristic of SSS to combine sensor types into one common representation, in all analyses, magnetic fields sampled from, both, magnetometers and gradiometers were exploited even if only one type of sensors was formally included.
Covariance modeling
Request a detailed protocolTo control the risk of overfitting in covariance modeling (Engemann and Gramfort, 2015), we used a penalized maximumlikelihood estimator implementing JamesStein shrinkage (James and Stein, 1992) of the form
where $\alpha $ is the regularization strength, $\widehat{\mathrm{\Sigma}}$ is the unbiased maximumlikelihood estimator, $p$ is the number of features and $I$ the identity matrix. This, intuitively, amounts to pushing the covariance toward the identity matrix. Here, we used the Oracle Approximation Shrinkage (OAS) (Chen et al., 2010) to compute the shrinkage factor $\alpha $ mathematically.
Source localization
Request a detailed protocolTo estimate cortical generators of the MEG signal, we employed the cortically constraint MinimumNormEstimates (Hämäläinen and Ilmoniemi, 1994) based on individual anatomy of the subjects. The resulting projection operator exclusively captures inputs from the anatomy of the subject and additional whitening based on the noise covariance. On the other hand, beamforming methods, consider the segments of MEG data to be sourcelocalized through the data covariance. Methods from the MNEfamily are therefore also referred to as nonadaptive spatial filters, whereas beamforming methods are referred to as adaptive spatial filters. The MNEoperator can be expressed as
Here $G\in {\mathbb{R}}^{P\times Q}$ with $P$ sensors and $Q$ sources denotes the forward model quantifying the spread from sources to M/EEG observations and $\lambda $ a regularization parameter that controls the ${\mathrm{\ell}}_{2}$norm of the activity coefficients. This parameter implicitly controls the spatial complexity of the model with larger regularization strength leading to more spatially smeared solutions. The forward model is obtained by numerically solving Maxwell’s equations based on the estimated head geometry, which we obtained from the Freesurfer brain segmentation. Note that from a statistical perspective, the MNEsolution is a Ridge model (see Equations 34) predicting the magnetic field at a given sensor from a linear combination of corresponding entries in the leadfields. The inferred source activity is given by multiplication of the MNEoperator with sensorlevel magnetic fields.
We estimated the source amplitudes on a grid of 8196 candidate dipole locations equally spaced along the cortical mantle. We used spatial whitening to approximate the model assumption of Gaussian noise (Engemann and Gramfort, 2015). The whitening operator was based on the empty room noise covariance and applied to the MEG signal and the forward model. We applied no noise normalization and used the default depth weighting (Lin et al., 2006) as implemented in the MNE software (Gramfort et al., 2014) with weighting factor of 0.8 (Lin et al., 2006) and a looseconstraint of 0.2. The squared regularization parameter ${\lambda}^{2}$ was expressed with regard to the signaltonoise ratio and fixed at the default value of $\frac{1}{{\mathrm{SNR}}^{2}}$ with $\mathrm{SNR}=3$ for all subjects. This conservative choice was also motivated by the computational burden for optimizing the regularization parameter. Optimizing this hyperparameter would have required precomputing hundreds of MNE solutions to then perform grid search over the derived sourcelevel outputs. As the goal was prediction from the source localized signals, not inference on spatial effects, we have instead relied on the subsequent datadriven shrinkage through the level1 ridge model (see Equations 34). It may be worthwhile to systematically investigate the interplay between shrinkage at the MNElevel and the ridgelevel for predictive modeling with MEG in future research.
MRI data processing
Data acquisition
Request a detailed protocolFor additional details on data acquisition, please consider the reference publications of the CAMCan (Taylor et al., 2017; Shafto et al., 2014). The following sections will describe the custom data processing conducted in our study.
Structural MRI
Request a detailed protocolFor preprocessing of structural MRI data we used the FreeSurfer (version 6.0) software (http://surfer.nmr.mgh.harvard.edu/)) (Fischl, 2012). Reconstruction included the following steps (adapted from the methods citation recommended by the authors of FreeSurfer http://freesurfer.net/fswiki/FreeSurferMethodsCitation): motion correction and average of multiple volumetric T1weighted images (Reuter et al., 2010), removal of nonbrain tissue (Ségonne et al., 2004), automated Talairach transformation, segmentation of the subcortical white matter and deep gray matter volumetric structures (Fischl et al., 2002; Fischl et al., 2004) intensity normalization (Sled et al., 1998), tessellation of the graymatter/white matter boundary, automated topology correction (Fischl et al., 2001; Ségonne et al., 2004), and surface deformation following intensity gradients (Dale et al., 1999; Fischl and Dale, 2000). Once cortical models were computed, socalled deformable procedures were applied including surface inflation (Fischl et al., 1999), registration to a spherical atlas (Fischl et al., 1999) and cortical parcellation (Desikan et al., 2006).
fMRI
Request a detailed protocolThe available fMRI data were visually inspected. The volumes were excluded from the study provided they had severe imaging artifacts or head movements with amplitude larger than 2 mm. After the rejection of corrupted data, we obtained a subset of 626 subjects for further investigation. The fMRI volumes underwent slice timing correction and motion correction to the mean volume. Following that, coregistration between anatomical and function volumes was done for every subject. Finally, brain tissue segmentation was done for every volume and the output data were morphed to the MNI space.
Scientific computation and software
Computing environment
Request a detailed protocolFor preprocessing and featureextraction of MEG, MRI and fMRI we used a highperformance Linux server (72 cores, 376 GB RAM) running Ubuntu Linux 18.04.1 LTS. For subsequent statistical modeling, we used a golden Apple MacBook 12’́ (early 2016) running MacOS Mojave (8 GB RAM). General purpose computation was carried out using the Python (3.7.3) language and the scientific Python stack including NumPy, SciPy, Pandas, and Matplotlib. For embarrassingly parallel processing, we used the joblib library.
MEG processing
Request a detailed protocolFor MEG processing, we used the MNEPython software (https://mne.tools) (Gramfort et al., 2014) (version 0.19). All custom analysis code was scripted in Python and is shared in a dedicated repository including a small library and scripts (see section Code Availability).
MRI and fMRI processing
Request a detailed protocolFor anatomical reconstruction we used the shellscripts provided by FreeSurfer (version 6.0) software (Fischl et al., 2002). We used the pypreprocess package, which reimplements parts of the SPM12 software for the analysis of brain images (The Wellcome Centre for Human Neuroimaging, 2018), complemented by the PythonMatlab interface from Nipype (Gorgolewski et al., 2011). For feature extraction and processing related to predictive modeling with MRI and fMRI, we used the NiLearn package (Abraham et al., 2014).
Statistical modeling
Request a detailed protocolFor predictive modeling, we used the scikitlearn package (Pedregosa et al., 2011) (version 0.21). We used the R (3.5.3) language and its graphical ecosystem (R Development Core Team, 2019; Wickham, 2016; Slowikowski, 2019; Clarke and SherrillMix, 2017; Canty and Ripley, 2017) for statistical visualization of data. For computation of rankingstatistics, we used the pmr Rpackage (Lee and Yu, 2013).
Code availability
Request a detailed protocolWe share all code used for this publication on GitHub: https://github.com/dengemann/megmrisurrogatebiomarkersaging2020. (Engemann, 2020; https://github.com/elifesciencespublications/megmrisurrogatebiomarkersaging2020) Our stacked model architecture can be compactly expressed using the StackingRegressor class in scikitlearn (Pedregosa et al., 2011) as of version 0.22.
References

1
Machine learning for neuroimaging with scikitlearnFrontiers in Neuroinformatics 8:14.https://doi.org/10.3389/fninf.2014.00014
 2

3
The first night effect: an EEG study of sleepPsychophysiology 2:263–266.https://doi.org/10.1111/j.14698986.1966.tb02650.x

4
Sensitivity of MEG and EEG to source orientationBrain Topography 23:227–232.https://doi.org/10.1007/s105480100154x
 5
 6
 7
 8
 9
 10

11
Dalex: explainers for complex predictive models in rThe Journal of Machine Learning Research 19:3245–3249.
 12
 13
 14

15
Exploration, inference, and prediction in neuroscience and biomedicineTrends in Neurosciences 42:251–262.https://doi.org/10.1016/j.tins.2019.02.001
 16

17
Boot: Bootstrap R (SPlus) FunctionsR Package.

18
Shrinkage algorithms for MMSE covariance estimationIEEE Transactions on Signal Processing. pp. 5016–5029.https://doi.org/10.1109/TSP.2010.2053029

19
ggbeeswarm: Categorical Scatter (Violin Point) PlotsR Package.

20
Prediction of brain age suggests accelerated atrophy after traumatic brain injuryAnnals of Neurology 77:571–581.https://doi.org/10.1002/ana.24367
 21
 22
 23
 24
 25

26
Computer Age Statistical InferenceCambridge University Press.https://doi.org/10.1017/CBO9781316576533
 27
 28
 29
 30
 31

32
Automated manifold surgery: constructing geometrically accurate and topologically correct models of the human cerebral cortexIEEE Transactions on Medical Imaging 20:70–80.https://doi.org/10.1109/42.906426
 33

34
Sequenceindependent segmentation of magnetic resonance imagesNeuroImage 23 Suppl 1:S69–S84.https://doi.org/10.1016/j.neuroimage.2004.07.016
 35
 36

37
The relations among inhibition and interference control functions: a LatentVariable analysisJournal of Experimental Psychology: General 133:101–135.https://doi.org/10.1037/00963445.133.1.101
 38
 39
 40
 41
 42
 43

44
Highfrequency (600 hz) SEP activities originating in the subcortical and cortical human somatosensory systemElectroencephalography and Clinical Neurophysiology/Evoked Potentials Section 108:182–189.https://doi.org/10.1016/S01685597(97)001007

45
EEG beta band activity is related to attention and attentional deficits in the visual performance of elderly subjectsInternational Journal of Psychophysiology 89:334–341.https://doi.org/10.1016/j.ijpsycho.2013.05.007
 46

47
Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in PythonFrontiers in Neuroinformatics 5:13.https://doi.org/10.3389/fninf.2011.00013

48
MEG and EEG data analysis with MNEPythonFrontiers in Neuroscience 7:267.https://doi.org/10.3389/fnins.2013.00267
 49

50
Interpreting magnetic fields of the brain: minimum norm estimatesMedical & Biological Engineering & Computing 32:35–42.https://doi.org/10.1007/BF02512476

51
Timing of human cortical functions during cognition: role of MEGTrends in Cognitive Sciences 4:455–462.https://doi.org/10.1016/S13646613(00)015497

52
The elements of statistical learning: data mining, inference and predictionIn: R Tibshirani, J. H Friedman, T Hastie, editors. The Mathematical Intelligencer, 27. Springer. pp. 83–85.https://doi.org/10.1007/9780387848587
 53

54
Largescale cortical correlation structure of spontaneous oscillatory activityNature Neuroscience 15:884–890.https://doi.org/10.1038/nn.3101

55
BOLD fMRI correlation reflects FrequencySpecific neuronal correlationCurrent Biology 25:1368–1374.https://doi.org/10.1016/j.cub.2015.03.049
 56

57
What is the key mediator of the neurovascular coupling response?Neuroscience & Biobehavioral Reviews 96:174–181.https://doi.org/10.1016/j.neubiorev.2018.11.011

58
Recursive nearest agglomeration (ReNA): Fast clustering for approximation of structured signalsIEEE Transactions on Pattern Analysis and Machine Intelligence. pp. 669–681.https://doi.org/10.1109/TPAMI.2018.2815524

59
Estimation with quadratic lossIn: S Kotz, N. L Johnson, editors. Breakthroughs in Statistics. Springer. pp. 443–447.https://doi.org/10.1007/9781461209195_30
 60

61
Could a neuroscientist understand a microprocessor?PLOS Computational Biology 13:e1005268.https://doi.org/10.1371/journal.pcbi.1005268
 62

63
Local brain atrophy accounts for functional activity differences in normal agingNeurobiology of Aging 33:623.e1–62623.https://doi.org/10.1016/j.neurobiolaging.2011.02.021

64
Brainbased ranking of cognitive domains to predict schizophreniaHuman Brain Mapping 40:4487–4507.https://doi.org/10.1002/hbm.24716
 65
 66
 67

68
Characterizing the dynamics of mental representations: the temporal generalization methodTrends in Cognitive Sciences 18:203–210.https://doi.org/10.1016/j.tics.2014.01.002
 69
 70

71
A nonlinear simulation framework supports adjusting for age when analyzing BrainAGEFrontiers in Aging Neuroscience 10:317.https://doi.org/10.3389/fnagi.2018.00317

72
An R package for analyzing and modeling ranking dataBMC Medical Research Methodology 13:65.https://doi.org/10.1186/147122881365

73
Evidence for reactive magnetic 10Hz rhythm in the human auditory cortexNeuroscience Letters 222:111–114.https://doi.org/10.1016/S03043940(97)133614
 74
 75
 76

77
Modular preprocessing pipelines can reintroduce artifacts into fMRI dataHuman Brain Mapping 40:2358–2376.https://doi.org/10.1002/hbm.24528

78
Understanding variable importances in forests of randomized treesAdvances in Neural Information Processing Systems. pp. 431–439.

79
Dictionary learning for massive matrix factorizationProceedings of the 33rd International Conference on Machine Learning, Volume 48 of Proceedings of Machine Learning Research. :. pp. 1737–1746.
 80
 81
 82
 83

84
A survey on transfer learningIEEE Transactions on Knowledge and Data Engineering. pp. 1345–1359.https://doi.org/10.1109/TKDE.2009.191
 85
 86

87
R: A Language and Environment for Statistical ComputingR Foundation for Statistical Computing, Vienna, Austria.

88
Integrating multimodal priors in predictive models for the functional characterization of Alzheimer’s diseaseIn: N Navab, J Hornegger, W. M Wells, A Frangi, editors. Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015. Cham: Springer International Publishing. pp. 207–214.https://doi.org/10.1007/9783319245713
 89
 90

91
Spontaneous alpha peak frequency predicts working memory performance across the age spanInternational Journal of Psychophysiology 53:1–9.https://doi.org/10.1016/j.ijpsycho.2003.12.011

92
HippocampalDMN disconnectivity in MS is related to WM lesions and depressionHuman Brain Mapping 36:5051–5063.https://doi.org/10.1002/hbm.22992

93
Obesity associated with increased brain age from midlifeNeurobiology of Aging 47:63–70.https://doi.org/10.1016/j.neurobiolaging.2016.07.010

94
Manifoldregression to predict from MEG/EEG brain signals without source modelingAdvances in Neural Information Processing Systems.
 95
 96
 97
 98

99
Averaging correlation coefficients: should Fisher's z transformation be used?Journal of Applied Psychology 72:146–148.https://doi.org/10.1037/00219010.72.1.146
 100

101
EEG electrodes for inflight monitoringPsychophysiology 2:161–167.https://doi.org/10.1111/j.14698986.1965.tb03260.x

102
A nonparametric method for automatic correction of intensity nonuniformity in MRI dataIEEE Transactions on Medical Imaging 17:87–97.https://doi.org/10.1109/42.668698

103
ggrepel: Automatically Position NonOverlapping Text Labels with ’ggplot2’, version 0.8.1R Package.
 104

105
Cellular changes in the postmortem Hippocampus in major depressionBiological Psychiatry 56:640–650.https://doi.org/10.1016/j.biopsych.2004.08.022

106
Decoding rich spatial information with high temporal resolutionTrends in Cognitive Sciences 19:636–638.https://doi.org/10.1016/j.tics.2015.08.016
 107

108
Presentation of electromagnetic multichannel data: the signal space separation methodJournal of Applied Physics 97:124905.https://doi.org/10.1063/1.1935742
 109
 110

111
SPMStatistical Parametric Mapping.
 112
 113
 114

115
Signalspace projection method for separating MEG or EEG into componentsMedical & Biological Engineering & Computing 35:135–140.https://doi.org/10.1007/BF02534144

116
Altered transient brain dynamics in multiple sclerosis: treatment or pathology?Human Brain Mapping 40:4789–4800.https://doi.org/10.1002/hbm.24737

117
Detection of Brain FunctionalConnectivity Difference in Poststroke Patients Using GroupLevel Covariance ModelingIn: T Jiang, N Navab, J. P. W Pluim, M. A Viergever, editors. Medical Image Computing and ComputerAssisted Intervention – MICCAI 2010. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 200–208.https://doi.org/10.1007/9783642157059
 118
 119

120
AgeRelated changes in 1/f neural electrophysiological noiseJournal of Neuroscience 35:13257–13265.https://doi.org/10.1523/JNEUROSCI.233214.2015

121
Ggplot2: Elegant Graphics for Data AnalysisNew York: SpringerVerlag .https://doi.org/10.1007/9780387981413
 122

123
Building better biomarkers: brain models in translational neuroimagingNature Neuroscience 20:365–377.https://doi.org/10.1038/nn.4478
 124
Decision letter

Alexander ShackmanReviewing Editor; University of Maryland, United States

Floris P de LangeSenior Editor; Radboud University, Netherlands

Kamen TsvetanovReviewer

Nelson TrujilloBarretoReviewer; The University of Manchester, United Kingdom
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
The authors describe a novel machine learning approachopportunistic prediction stackingwith the aim of combining multiple neuroimaging modalities into a single predictive model (“biomarker”) in the face of missing data. Leveraging structural MRI, functional MRI, and MEG data from a large public biobank (CamCan), the authors show that each modality had additive incremental value in predicting age.
The reviewers and I were enthusiastic about the report. This approach has the potential to extract valuable information from multimodal databases and improve diagnostic power based on the surrogate biomarker idea. From a translational neuroscience perspective, the paper addresses (1) whether adding MEG improves age prediction and provides incremental predictive validity over sMRI or sMRI + fMRI alone; (2) which MEG features are most predictive of age. From a methodological perspective, the paper extends the original stacking method to address missing data, a common occurrence.
Decision letter after peer review:
Thank you for submitting your article "Combining electrophysiology with MRI enhances learning of surrogatebiomarkers" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Alex Shackman as the Reviewing Editor and Floris de Lange as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Kamen Tsvetanov (Reviewer #1); Nelson TrujilloBarreto (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The authors describe a novel machine learning approachopportunistic prediction stackingwith the aim of combining multiple neuroimaging modalities into a single predictive model (“biomarker”) in the face of missing data. Leveraging structural MRI, functional MRI, and MEG data from a large public biobank (CamCan), the authors show that each modality had additive incremental value in predicting age.
The reviewers and I were enthusiastic about the report:
• The paper is written and presented well.
• The research idea is interesting and timely and the proposed method has the potential to extract valuable information from multimodal databases and improve diagnostic power based on the surrogate biomarker idea.
• From a translational neuroscience perspective, the paper addresses (1) whether adding MEG improves age prediction and provides incremental predictive validity over sMRI or sMRI + fMRI alone; (2) which MEG features are most predictive of age.
• From a methodological perspective, the paper extends the original stacking method to address missing data, a common occurrence.
Challenges and Recommendations:
Nevertheless, our enthusiasm was somewhat tempered by several key limitations of the report. In this section, I briefly summarize the most important comments.
1) Challenge: – The motivation for including MEG seems to be exclusively based on EEG evidence. In the Introduction, all the evidence about the relationship between electrophysiology and ageing (and the complementarity to fMRI in that context) to motivate its inclusion as an additional modality is drawn from the EEG literature, which is odd, given that the paper uses MEG instead.
Recommendation: – The authors should motivate the use of MEG directly or at least link the evidence from EEG to the use of MEG in a more convincing way. This might be as simple as saying that, although a lot of evidence from EEG is available, there is no available EEG data in multimodal databases (which I think would be a fair statement in general), but given the relationship between the two techniques, it is expected that inclusion of MEG would be valuable as well, this link is not clear in the text.
2) Challenge: – The Materials and methods are not entirely clear. Some of the processing/feature extraction steps are not completely explained or appropriately justified. The taxonomy used for the extracted features in MEG is confusing.
Recommendation: – Clarify the Materials and methods.
3) Challenge: – Missing Statistics. – Statistical significance of some of the results are missing: Many of the results do not report any statistical significance threshold (or pvalue), which makes claims about results being above chance, not rigorously justified. (e.g. "…All stacking models performed consistently better than chance…" What does it mean to perform better than chance here? Was a statistical significance test carried out? What was the null hypothesis tested? If so, report the pvalue or equivalent used? In some cases, statistical significance is obvious, in others, it is difficult to assess via visual inspection.
Recommendation: – Report statistical significance of any comparisons made (either corrected or not) in the main report and in the supplement e.g. for MAE differences and MAE PE correlations.
4) Challenge: – Generality. – It is important to demonstrate the robustness and reliability of these features to generalize in unseen data.
Recommendation: – The authors could readily address this in the available data by splitting the sample in half (while maintaining the age distribution and data missingness) and test how similar are the loadings of each feature across data splits. The process could be repeated multiple times (1000s) to create a distribution, which can be compared to the distribution from a permuted data.
5) Challenge: – Nonrandom Missing Data. – It is important to show that the approach is not susceptible to confounds in the missing data (nonrandom missingness; e.g. more missing data coming from older individuals, which “helps” to learn an agerelated effects).
Recommendation: – This could be easily addressed e.g. by comparing model performance between two scenarios of missingness in the fully available dataset. In one scenario the missing data come from subjects with uniform age distribution and in the other scenario a bias in age selection is introduced, i.e. larger portion of missing data comes from older individuals.
6) Challenge: – Need to better integrate or segregate the main report and the supplement. Extensive text is dedicated to supplementary figures (e.g. Figure 2 and Figure 4 figure supplements).
Recommendation: – If the supplementary material so important for reaching the conclusions of the paper, perhaps it would be more helpful to include them in the body of the paper. Otherwise I would suggest the authors to move the relevant explanations to the supplements.
7) Challenge: – The authors demonstrate that their approach can identify variability in the detected signals with behavioral relevance. The authors investigated this question by testing relations between brain age residuals and cognitive measures orthogonalized wrt chronological age. It is interesting why the authors have adopted this modular process, which has faced some criticism, as opposed to a single level model that can flexibly accommodate all variables in a single model (Lindquist, Geuter, Wager and Caffo, 2019).
Recommendation: – Authors should at least address this choice in the paper. (Lindquist, et al., 2019)
8) Challenge: – The authors do not seem to have considered the inclusion of variables of not interest which may lead to spurious correlations.
Recommendation: – Variables such as gender, handedness, head motion, total grey matter and total intracranial volume should be controlled for to ensure the specificity of the findings. One way to address both points would be to evaluate a model with multiple predictors including age, cognitive variable (e.g. Cattell) and covariates of not interest, where the dependent variable is brain age. Then the statistics on the cognitive variable of interest, which would indicate unique contribution to predict brain age over and above chronological age and covariates (i.e. individual variability), can be reported in the format of Figure 3.
9) Challenge: – The previous comment raised the concern about covariates of no interest and their consideration in the post hoc analysis.
Recommendation: – It would be really worthwhile having the authors' view whether the approach can/should include covariates of no interest in Layer I. This could be particularly relevant if each modality is associated with "unique" covariates of no interest (e.g. head motion in fMRI, or empty room recording SNR in MEG). Though, I assume that the combination of multiple datasets will reduce this bias, which brings me to my next point.
10) Recommendation: – It would be a really worthwhile the authors could consider reporting what part of the data in Level I contributed to the overall model performance, i.e. what is the topography in each modality that matters (e.g. atrophy in frontal regions, connectivity in DMN etc.)?
11) Challenge: – Some of the results or explanations for the results in the Discussion are not intuitive or perhaps adequately explained in the context of the existent literature and the design of the analysis.
Recommendation: – Revise the Discussion to address this concern.
12) Challenge: – Limitations – Limitations of the methodology (i.e. its assumptions) or the choice of the features (i.e. static vs dynamic) are not adequately put in context or discussed.
Recommendation: – Authors need to, at least briefly, discuss about how potential assumption violations might affect the results.
https://doi.org/10.7554/eLife.54055.sa1Author response
Challenges and Recommendations:
Nevertheless, our enthusiasm was somewhat tempered by several key limitations of the report. In this section, I briefly summarize the most important comments.
1) Challenge: The motivation for including MEG seems to be exclusively based on EEG evidence. In the Introduction, all the evidence about the relationship between electrophysiology and ageing (and the complementarity to fMRI in that context) to motivate its inclusion as an additional modality is drawn from the EEG literature, which is odd, given that the paper uses MEG instead.
Recommendation: The authors should motivate the use of MEG directly or at least link the evidence from EEG to the use of MEG in a more convincing way. This might be as simple as saying that, although a lot of evidence from EEG is available, there is no available EEG data in multimodal databases (which I think would be a fair statement in general), but given the relationship between the two techniques, it is expected that inclusion of MEG would be valuable as well, this link is not clear in the text.
We thank the reviewers for this suggestion. We have now reworked the arguments connecting MEG and EEG to yield a more coherent and stringent motivation and emphasizing their differences more concretely. We have now expanded our review on multimodal MRI/electrophysiology datasets. Note that we have also adjusted the scope of the title to refocus the paper more strongly on MEG.
The paper is now entitled “Combining magnetoencephalography with magnetic resonance imaging enhances learning of surrogatebiomarkers”.
Please find below one new passage from the Introduction that best captures the spirit of this particular revision:
“At this point, there are very few multimodal databases providing access to electrophysiology alongside MRI and fMRI. […] MEG is, therefore, an interesting modality in its own right for developing neurocognitive biomarkers while its close link with EEG may potentially open the door to translatable electrophysiology markers suitable for massive deployment with clinical EEG.”
2) Challenge: The Materials and methods are not entirely clear. Some of the processing/feature extraction steps are not completely explained or appropriately justified. The taxonomy used for the extracted features in MEG is confusing.
Recommendation: Clarify the Materials and methods.
We thank the reviewers for having shared their concerns. We have worked through the list of issues and revised the manuscript.
3) Challenge: Missing Statistics. Statistical significance of some of the results are missing: Many of the results do not report any statistical significance threshold (or pvalue), which makes claims about results being above chance, not rigorously justified. (e.g. "…All stacking models performed consistently better than chance…" What does it mean to perform better than chance here? Was a statistical significance test carried out? What was the null hypothesis tested? If so, report the pvalue or equivalent used? In some cases, statistical significance is obvious, in others, it is difficult to assess via visual inspection.
Recommendation: Report statistical significance of any comparisons made (either corrected or not) in the main report and in the supplement e.g. for MAE differences and MAE PE correlations.
We understand the reviewers desire for a numerical estimate of statistical significance. While pvalues can in principle be readily computed in the current setting for the question whether a model performed better than chance, e.g., by permuting the labels, the situation is less clear for model comparisons targeting differences in performance between two models. Rejecting a nullhypothesis that differences between models are due to chance would require several, ie., many independent datasets. Here, we computed uncertainty estimates of paired differences using the repeated 10fold cross validation. For reasons of computational tractability, we estimated chancelevel prediction using a dummy regressor that predicts the average of the trainingset target using the same crossvalidation procedure and identical random seeds to ensure splitwise comparability with nontrivial models.
To provide a more compact summary of the performance distributions that may help the reader to assess statistical inference, we extracted additional summary statistics of the distributions of mean absolute error scores and its paired differences with a reference model, where appropriate. These summaries included the mean, the standard deviation, the 2.5 and 97.5 percentiles and the number of splits in which the model was better than the reference. We have reworked the main text to make explicit the statistical approach and supplemented our report of the main findings inside the main text with these numerical uncertainty statistics. The following new section in the Materials and methods section explains our position and our methodological approach:
“Statistical Inference
Rejecting a nullhypothesis regarding differences between two crossvalidated models is problematic in the absence of sufficiently large unseen data or independent datasets: crossvalidated scores are not statistically independent. […] It should be clear, however, that hypothesistesting, here, provides a quantitative orientation that needs to be contextualized by empirical estimates of effect sizes and their uncertainty to support inference.”
Finally, in the light of the cumulative comments made regarding the analysis of prediction errors across age, we have reworked that analysis to be visually more clearly represented and better integrated into the main text as Figure 2—figure supplement 2. In the process, we have attenuated our conclusions to be more nuanced and made the attempt to formalize inference by using an ANOVA model with age group, model family and their interactions as factors.
4) Challenge: Generality. It is important to demonstrate the robustness and reliability of these features to generalize in unseen data.
Recommendation: The authors could readily address this in the available data by splitting the sample in half (while maintaining the age distribution and data missingness) and test how similar are the loadings of each feature across data splits. The process could be repeated multiple times (1000s) to create a distribution, which can be compared to the distribution from a permuted data.
We thank the reviewers for this suggestion. Our principal method across analyses for assessing robustness and reliability of features was the model comparisons method. This approach allowed us to track changes in performance as semantically and logically related blocks of intercorrelated variables are included or excluded. We used crossvalidation to obtain an asymptotically unbiased estimate of the expected generalization error and its uncertainty distributions. Note that crossvalidation already implements the resampling procedure suggested in this comment with the difference that 90% of the data were used for training in each round and duplication of procedures is not necessary as the cross validation distribution is sufficient to obtain useful inferences (see point above). However, we strongly agree that while expectations and their uncertainties are captured by the visualization and the newly added summary statistics, the relative stability of the model rankings may not be obvious based on our previous reports. We have therefore extended the reports to visualize and quantify the outofsample stability of the model ranking across the testingset splits, which has given rise to novel supplementary figures.
For the specific inspection of the MEG model, we have now reported additional results based on two alternative variable importance metrics: 1) The outofsample permutations assessing the average permutation importance across crossvalidation splits, which may yield an estimate of variables importance that is less prone to overfitting, and, 2) MDI importance potentially sensitive to the conditional dependencies between the variables but more prone to overfitting and false negatives/false positives. These additional analyses suggested that the importance ranking was highly consistent across methods with intercorrelations above .9. (Spearman rank correlation).
We have now reported the additional results in new supplementary figures, Figure 2—figure supplement 1 and Figure 4—figure supplements 1 and 2.
5) Challenge: Nonrandom Missing Data. It is important to show that the approach is not susceptible to confounds in the missing data (nonrandom missingness; e.g. more missing data coming from older individuals, which “helps” to learn an agerelated effects).
Recommendation: This could be easily addressed e.g. by comparing model performance between two scenarios of missingness in the fully available dataset. In one scenario the missing data come from subjects with uniform age distribution and in the other scenario a bias in age selection is introduced, i.e. larger portion of missing data comes from older individuals.
We thank the reviewers for raising this point. We feel that we may not have stated clearly enough that the sensitivity of our method with regard to nonrandom missingness is not a bug but a feature. Our method, by design, necessarily learns from any nonrandom missingness, which can be desired or undesired in different contexts. We have now extended the related Results section to make this point explicit and proposed as a diagnostic instrument to detect nonrandom missingness by training the random forest from the input data only including zeros or missingness indicators. In our case, the resulting model performance was well aligned with the distribution of chancelevel scores, suggesting that missing values were not related to aging:
“It is important to emphasize that if missing values depend on age, the opportunistic model inevitably captures this information, hence, bases its predictions on the nonrandom missing data. […] In the current setting, the model trained on missing data indicators performed at chance level (Pr_{<Chance} = 30.00%, M = 0.65, SD = 1.68, P_{2.5,97.5} = [–2.96,3.60]), suggesting that the missing values were not informative of age.”
6) Challenge: Need to better integrate or segregate the main report and the supplement – Extensive text is dedicated to supplementary figures (e.g. Figure 2 and Figure 4 figure supplements).
Recommendation: If the supplementary material so important for reaching the conclusions of the paper, perhaps it would be more helpful to include them in the body of the paper. Otherwise I would suggest the authors to move the relevant explanations to the supplements.
We have now revised the presentation of the results to yield a clearer division of labor between main text and supplement. Each supplementary analysis is now summarized by one sentence in the main text while providing the detailed contextual discussion in the respective supplementary figure captions:
For Figure 2 supplements:
“This additive component also became apparent when considering predictive simulations on how the model actually combined MEG, fMRI and MRI (Figure 2—figure supplement 2) using two dimensional partial dependence analysis (Karrer et al., 2019; Hastie et al., 2005, chapter 10.13.2). Moreover, exploration of the agedependent improvements through stacking suggest that stacking predominantly reduced prediction errors uniformly (Figure 2—figure supplement 3) instead of systematically mitigating brain age bias (Le et al., 2018; Smith et al., 2019). ”
For Figure 4 supplements:
“Moreover, partial dependence analysis (Karrer et al., 2019; Hastie et al., 2005, chapter 10.13.2) suggested that the LayerII random forest extracted nonlinear functions (Figure 4—figure supplement 3). Finally, the best stacked models scored lower errors than the best linear models (Figure 4—figure supplement 4), suggesting that stacking achieved more than mere variable selection by extracting nonredundant information from the inputs.”
7) Challenge: The authors demonstrate that their approach can identify variability in the detected signals with behavioral relevance. The authors investigated this question by testing relations between brain age residuals and cognitive measures orthogonalized wrt chronological age. It is interesting why the authors have adopted this modular process, which has faced some criticism, as opposed to a single level model that can flexibly accommodate all variables in a single model (Lindquist, Geuter, Wager and Caffo, 2019).
Recommendation: Authors should at least address this choice in the paper. (Lindquist et al., 2019).
8) Challenge: The authors do not seem to have considered the inclusion of variables of not interest which may lead to spurious correlations.
Recommendation: Variables such as gender, handedness, head motion, total grey matter and total intracranial volume should be controlled for to ensure the specificity of the findings. One way to address both points would be to evaluate a model with multiple predictors including age, cognitive variable (e.g. Cattell) and covariates of not interest, where the dependent variable is brain age. Then the statistics on the cognitive variable of interest, which would indicate unique contribution to predict brain age over and above chronological age and covariates (i.e. individual variability), can be reported in the format of Figure 3.
We thank the reviewers for sharing this reference and suggesting extended deconfounding. In fact, our method was based on the discussion in Smith et al., 2019
(https://doi.org/10.1016/j.neuroimage.2019.06.017) and pragmatically motivated : modular methods may be simpler to explain. To go beyond a modular model, we also performed a joint model with polynomial confounds such that $score = brain_age + poly(age, 3) + error$ and then extracted the brain age coef, this time quantifying effects conditional on the confounders. Moreover, we have included the additional confounders gender, handedness and motion parameters in a third model.
Note that motion correction was already performed during fMRIpreprocessing and that MEG source localization took into account individual head geometry as well as potentially confounding environmental noise through whitening with the noise covariance obtained from empty room recordings. Likewise, following the work by Liem et al., 2017, we included total grey matter and total intracranial volume as important features of interest among the MRIfeatures.
We found that the alternative models did not affect our conclusions and observed that deconfounding seemed to even improve the effect sizes of the models.
We have reworked and extended the Materials and methods description, included the citation regarding modularity and reported the alternative regression models in the subsection “Analysis of brainbehavior correlation”.
“The brain age Δ was defined as the difference between predicted and actual age of the person ${\mathrm{\text{BrainAge}}}_{\Delta}\phantom{\rule{0.222em}{0ex}}=\phantom{\rule{0.222em}{0ex}}{\mathrm{\text{age}}}_{\mathrm{\text{pred}}}\mathrm{\text{age}}$, such that positive values quantify overestimation and negative value underestimation. […] Following the work by Liem et al., 2017, we included total grey matter and total intracranial volume as important features of interest among the MRIfeatures.”
See Figure 3—figure supplement 4 and Figure 3—figure supplement 5.
9) Challenge: The previous comment raised the concern about covariates of no interest and their consideration in the post hoc analysis.
Recommendation: It would be really worthwhile having the authors' view whether the approach can/should include covariates of no interest in Layer I. This could be particularly relevant if each modality is associated with "unique" covariates of no interest (e.g. head motion in fMRI, or empty room recording SNR in MEG). Though, I assume that the combination of multiple datasets will reduce this bias, which brings me to my next point.
We thank the reviewers for raising this interesting point. We are somewhat worried that including covariates at the first level may unnecessarily inflate the number of estimated parameters while, at the same time, leading to limited results due to the lack of expressiveness in the ridge model. However, including covariates in the second layer should be more promising as the number of variables will remain small and the Random Forest can learn arbitrarily deep interaction effects between covariates and brainage models. The change in performance and variable importance can then be used to assess the impact of confounds. This question is methodologically interesting and motivates a dedicated study in a more specialized journal. We once more thank the reviewers for stimulating that interesting direction of thinking.
10) Recommendation: It would be a really worthwhile the authors could consider reporting what part of the data in Level I contributed to the overall model performance, i.e. what is the topography in each modality that matters (e.g. atrophy in frontal regions, connectivity in DMN etc.)?
We thank the reviewer for this remark. Interpreting highdimensional linear models by their parameters is not necessarily an easy task. Collinearity and noise can induce strong weights on features that are not intrinsically important. Showing weights maps in such high dimensional settings requires dedicated tools (ReNa https://dx.doi.org/10.1109/TPAMI.2018.2815524, etc;) leading to modifications of the predictive models used in layer1 which were optimized for prediction but not interpretability. Adopting interpretability methods would necessarily lead to a parallel method pipeline whose integration would exceed the scope of the paper. We have admitted this important limitation in the new dedicated limitations section at the end of the Discussion.
“For the present study, we see four principal limitations: availability of data, interpretability, nonexhaustive featureengineering and potential lack of generalizability due to the focus on MEG. […] We hope, nevertheless, that the insights from our work will stimulate studies investigating the link between MEG, fMRI and MRI across the lifespan using an inferenceoriented framework.”
11) Challenge: Some of the results or explanations for the results in the Discussion are not intuitive or perhaps adequately explained in the context of the existent literature and the design of the analysis.
Recommendation: Revise the Discussion to address this concern.
We thank the reviewers for pointing out the issues concerning the interpretation of the results. We have worked through the issues and updated the manuscript accordingly.
12) Challenge: Limitations – Limitations of the methodology (i.e. its assumptions) or the choice of the features (i.e. static vs dynamic) are not adequately put in context or discussed.
Recommendation: Authors need to, at least briefly, discuss about how potential assumption violations might affect the results.
We thank the reviewers for this suggestion. We have now included an explicit limitations section at the end of the Discussion passage.
https://doi.org/10.7554/eLife.54055.sa2Article and author information
Author details
Funding
H2020 European Research Council (SLAB ERCStG676943)
 Alexandre Gramfort
Inria (Médecine Numérique 2018)
 Denis A Engemann
Inserm (Médecine Numérique 2018)
 Denis A Engemann
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was partly supported by a 2018 ‘médecine numérique’ (for digital medicine) thesis grant issued by Inserm (French national institute of health and medical research) and Inria (French national research institute for the digital sciences). It was also partly supported by the European Research Council Starting Grant SLAB ERCStG676943.
We thank Sheraz Khan for help with the Freesurfer segmentation and data management of the CamCAN dataset. We thank Mehdi Rahim for advice with the model stacking framework and data management of the CamCAN dataset. We thank Donald Krieger and Timothy Bardouille for help with the MEG coregistration. We thank Danilo Bzdok for feedback on the first version of the preprint.
Ethics
Human subjects: This study is conducted in compliance with the Helsinki Declaration. No experiments on living beings were performed for this study. The data that we used was acquired by the CamCAN consortium and has been approved by the local ethics committee, Cambridgeshire 2 Research Ethics Committee (reference: 10/H0308/50).
Senior Editor
 Floris P de Lange, Radboud University, Netherlands
Reviewing Editor
 Alexander Shackman, University of Maryland, United States
Reviewers
 Kamen Tsvetanov
 Nelson TrujilloBarreto, The University of Manchester, United Kingdom
Publication history
 Received: November 29, 2019
 Accepted: May 9, 2020
 Accepted Manuscript published: May 19, 2020 (version 1)
 Version of Record published: June 22, 2020 (version 2)
Copyright
© 2020, Engemann et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,343
 Page views

 298
 Downloads

 5
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.