1. Human Biology and Medicine
  2. Neuroscience
Download icon

Prognostication of chronic disorders of consciousness using brain functional networks and clinical characteristics

  1. Ming Song
  2. Yi Yang
  3. Jianghong He
  4. Zhengyi Yang
  5. Shan Yu
  6. Qiuyou Xie
  7. Xiaoyu Xia
  8. Yuanyuan Dang
  9. Qiang Zhang
  10. Xinhuai Wu
  11. Yue Cui
  12. Bing Hou
  13. Ronghao Yu
  14. Ruxiang Xu  Is a corresponding author
  15. Tianzi Jiang  Is a corresponding author
  1. Chinese Academy of Sciences, China
  2. PLA Army General Hospital, China
  3. Guangzhou General Hospital of Guangzhou Military Command, China
  4. University of Electronic Science and Technology of China, China
  5. University of Queensland, Australia
Research Article
  • Cited 0
  • Views 1,867
  • Annotations
Cite this article as: eLife 2018;7:e36173 doi: 10.7554/eLife.36173

Abstract

Disorders of consciousness are a heterogeneous mixture of different diseases or injuries. Although some indicators and models have been proposed for prognostication, any single method when used alone carries a high risk of false prediction. This study aimed to develop a multidomain prognostic model that combines resting state functional MRI with three clinical characteristics to predict one year-outcomes at the single-subject level. The model discriminated between patients who would later recover consciousness and those who would not with an accuracy of around 88% on three datasets from two medical centers. It was also able to identify the prognostic importance of different predictors, including brain functions and clinical characteristics. To our knowledge, this is the first reported implementation of a multidomain prognostic model that is based on resting state functional MRI and clinical characteristics in chronic disorders of consciousness, which we suggest is accurate, robust, and interpretable.

https://doi.org/10.7554/eLife.36173.001

eLife digest

Severe brain injury can lead to disorders of consciousness (DOC), such as a coma. Some patients regain consciousness after injury, while others do not. Those who do not recover are unable to communicate or move in purposeful ways, and need long-term care. It can be difficult for physicians to predict which patients will mend. This is mainly based on their observations of the patient’s behavior over time. But such perceptions are subjective and vulnerable to errors. More accurate and objective methods are needed.

Several studies suggest that the cause of the injury, the age of the person at the time of injury, and how long the person has had a DOC may predict recovery. Recent studies have shown that using a brain-imaging tool called resting state functional magnetic resonance imaging (fMRI) to measure communication between different parts of the brain may help to calculate the likelihood of recovery.

Now, Song, Yang et al. show that combining resting state fMRI with three pieces of clinical information may help to better predict who will improve. Song et al. created a computer model that forecasts recovery from DOC based on fMRI results, the cause of the person’s injury, their age at the time of injury, and how long they have had impaired consciousness. The model could tell which patients would regain consciousness 88% of the time for 112 patients from two medical centers. It also identified several patients who got better despite initial predictions from doctors that they would not.

The experiments show that combining multiple types of information can better predict which patients with DOC will convalesce. Larger studies are needed to confirm that the computer model is reliable. If they do, the model may one day help physicians and families to better plan and manage patients’ care.

https://doi.org/10.7554/eLife.36173.002

Introduction

Severe brain injury can lead to disorders of consciousness (DOC). Some patients recover consciousness after an acute brain insult, whereas others tragically fall into chronic DOC. The latter cannot communicate functionally or behave purposefully. Most patients remain bedridden, and require laborious care. The medical community is often confronted with an inability to meet the expectations of the chronic DOC patients' families. The social, economic, and ethical consequences are also tremendous (Bernat, 2006). In parallel, although more validations are required, recent pilot studies have proposed new therapeutic interventions, which challenge the existing practice of early treatment discontinuation for a chronic DOC patient (Schiff et al., 2007; Corazzol et al., 2017; Yu et al., 2017). However, before using these novel therapeutic interventions, clinicians first need to determine whether the patient is a suitable candidate. The availability of an accurate and robust prognostication is therefore a fundamental concern in the clinical management of chronic DOC patients, as medical treatment, rehabilitation therapy and even ethical decisions depend on this information.

To date, the prognostication for a DOC patient is based on physician observation of the patient's behavior over period that is sufficient to allow determination of whether there is any evidence of awareness. On the one hand, a patient's motor impairment, sensory deficit, cognitive damage, fluctuation of vigilance and medical complications could give rise to misjudgments; on the other hand, for the assessor, a lack of knowledge regarding DOC, poor training and non-use of adequate behavioral scales are additional elements that may contribute to a high possibility of mistakes. Consequently, careful and repeated behavioral assessments are considered to be particularly important for a precise diagnostic and prognostic judgment (Wannez et al., 2017). Nonetheless, behavioral assessments are inevitably subjective and vulnerable to a variety of personal interferences (Giacino et al., 2009). Physicians and scientists have therefore been seeking accurate and objective markers for diagnosis and prognosis (Demertzi et al., 2017; Noirhomme et al., 2017).

Several pioneering studies have suggested that the etiology, incidence age and duration of DOC are important indicators for prognosis (Multi-Society Task Force on PVS, 1994). Specifically, patients who have non-traumatic brain injury are expected to have a worse functional recovery than traumatic brain injury patients, and young patients were considered more likely to have a favorable outcome than older ones. During the recent decades, some pilot prognostic models have also been explored that are based on features of neurological examination (Zandbergen et al., 1998; Booth et al., 2004; Dolce et al., 2008), abnormalities detected with electroencephalogram (EEG) and evoked potentials (Steppacher et al., 2013; Kang et al., 2014; Hofmeijer and van Putten, 2016; Chennu et al., 2017), anatomical and functional changes identified with brain computed tomography (CT), positron emission tomography (PET) and magnetic resonance imaging (MRI) (Maas et al., 2007; Sidaros et al., 2008; Neuro Imaging for Coma Emergence and Recovery Consortium et al., 2012; Luyt et al., 2012; Stender et al., 2014; Wu et al., 2015), and physiological and biochemical disturbances at both the brain and body levels (Kaneko et al., 2009; Rundgren et al., 2009). Despite many efforts, however, identifying efficient biomarkers for the early prediction of outcome is still challenging and requires additional research. One of the reasons for this is that the DOC could have many different causes and could be associated with several neuropathological processes and different severities, such that any method when used alone carries the risk of false prediction (Bernat, 2016; Rossetti et al., 2016).

Recently, resting state functional MRI (fMRI) has been widely used to investigate the brain functions of DOC patients. Research suggests that these patients demonstrate multiple changes in brain functional networks, including the default mode (Vanhaudenhuyse et al., 2010; Silva et al., 2015), executive control (Demertzi et al., 2014; Wu et al., 2015), salience (Qin et al., 2015; Fischer et al., 2016), and sensorimotor (Yao et al., 2015), auditory (Demertzi et al., 2015), visual (Demertzi et al., 2014a) and subcortical networks (He et al., 2015). The within-network and between-network functional connectivity appeared to be useful indicators of functional brain damage and the likelihood of consciousness recovery (Silva et al., 2015; Di Perri et al., 2016). Taken together, these studies suggest that the brain networks and functional connectivity detected with resting state fMRI could be valuable biomarkers that can be used to trace the level of consciousness and predict the possibility of recovery.

With advances in medicine, prognostication of a DOC patient has moved toward a multidomain paradigm that combines clinical examination with the application of novel technologies (Gosseries et al., 2014). Multidomain assessment has the potential to improve prediction accuracy. More importantly, it can provide reassurances about the importance of each predictor for prognostication by offering concordant evidence (Stevens and Sutter, 2013; Rossetti et al., 2016). More than 20 years ago, the Multi-Society Task Force on persistent vegetative state (PVS) suggested that the etiology, incidence age and duration of DOC could help to predict the outcome (Multi-Society Task Force on PVS, 1994), and numerous studies have subsequently validated the clinical utility of these features (Jennett, 2005; Bruno et al., 2012; Estraneo et al., 2013; Celesia, 2016). Therefore, it is possible that a multidomain model that combines these clinical characteristics and resting state fMRI data could improve prognostic predictions at an individual level and could lead to the early identification of patients who could recover consciousness.

The present work had two major objectives. The first aim was to develop an approach to predict the prognosis of an individual DOC patient using clinical characteristics and resting state fMRI. The second aim, building on the first, was to further explore the different prognostic effects of these clinical and brain imaging features.

Materials and methods

The study paradigm is illustrated in Figure 1. Resting state fMRI and clinical data from DOC patients were collected at the so-called T0 time point when the patients' vital signs and consciousness level had stabilized and a diagnosis had been made. Outcomes were assessed at least 12 months after this T0 time point; at a time referred to as the T1 time point. The principal scales included the Coma Recovery Scale Revised (CRS-R) and the Glasgow Outcome Scale (GOS). Instead of predicting diagnosis, this study used the outcome as a target for regression and classification. Using the resting state fMRI and clinical data from the T0 time point in a training dataset, a regression model was first developed to fit each patient's CRS-R score at the T1 time point, after which the optimal cut-off value for classifying individual patients on the basis of consciousness recovery was calculated. In this way, we set up the prognostic regression and classification model. Two independent testing datasets were then used to validate the model.

Conceptual paradigm of the study.

CRS-R: Coma Recovery Scale Revised scale; GOS: Glasgow Outcome Scale.

https://doi.org/10.7554/eLife.36173.003

Subjects

This study involved three datasets. The datasets referred to as ‘Beijing 750’ and ‘Beijing HDxt’ were both collected in the PLA Army General Hospital in Beijing, and the same medical group diagnosed and managed the patients. However, the MRI scanners and imaging acquiring protocols were different for these two datasets: the ‘Beijing HDxt’ cohort was scanned with a GE signa HDxt 3.0T scanner between May 2012 and December 2013, whereas the ‘Beijing 750’ cohort was scanned with a GE Discovery MR750 3.0T scanner between January 2014 and May 2016. The dataset referred to as ‘Guangzhou HDxt’ was collected from the Guangzhou General Hospital of Guangzhou Military Command in Guangzhou, and the MRI data were obtained with a GE signa HDxt 3.0T scanner between April 2011 and December 2014.

The inclusion criterion was that the patients should be at least 1 month after the acute brain insult so that they met the DOC diagnosis. Patients were excluded when there was an unstable level of consciousness (continuous improvement or decline within the two weeks before the T0 time point), uncertain clinical diagnosis (ambiguity or disagreement between examiners), contraindication for MRI or large focal brain damage (>30% of total brain volume).

A total of 160 DOC patients were initially enrolled in this study. Eleven patients were excluded due to large local brain lesions or movement artifacts during MRI scanning. Nine patients died during the period of the follow-up, 16 patients were lost to follow-up, and for 12 patients no definite outcome information was collected at the 12-month endpoint of the follow-up. Thus, according to the inclusion and exclusion criteria and the follow-up results, the ‘Beijing 750’ dataset included 46 vegetative state/unresponsive wakefulness syndrome (VS/UWS) patients and 17 minimally conscious state (MCS) patients. The ‘Beijing HDxt’ dataset contained 20 VS/UWS patients and 5 MCS patients, and the ‘Guangzhou HDxt’ dataset contained 16 VS/UWS patients and 8 MCS patients.

The demographic and clinical characteristics of the patients are summarized in Table 1, with additional details provided in Appendix 1—table 1, 2 and 3. The ‘Beijing 750’ dataset also included 30 healthy participants, and the ‘Beijing HDxt’ dataset included 10 healthy participants. All of the healthy participants were free of psychiatric or neurological history. These healthy participants are referred to as ‘normal controls’. See Appendix 1—table 4 and 5 for details.

Table 1
Demographic and clinical characteristics of the patients in the three datasets.
https://doi.org/10.7554/eLife.36173.004
Beijing_750
(n = 63)
Beijing_HDxt
(n = 25)
Guangzhou_HDxt
(n = 24)
Gender, M/F36/2718/714/10
Etiology
 Trauma/Stroke/Anoxia17/21/2512/6/78/0/16
Age at the T0 (years)
 Mean (SD)42.8 (13.8)40.7 (15.2)39.3 (16.9)
 Range18.0 ~ 71.018.0 ~ 68.015.0 ~ 78.0
Time to MRI (months)
 Range1.0 ~ 77.01.0 ~ 44.01.0 ~ 10.0
 Mean (SD)7.4 (12.8)5.4 (8.4)2.3 (2.4)
 Median3.03.01.5
 Band
  [1,3]321320
  (3,6]1582
  (6,12]1132
  >12510
Follow-up time (months)
 Range12.0 ~ 51.014.0 ~ 53.027.0 ~ 78.0
 Mean (SD)21.0 (9.8)41.7 (8.4)52.2 (14.5)
 Median15.043.053.0
 Band
  [12,24]3820
  (24,48]24208
  >481316
Diagnosis at T0
MCS/VS17/465/208/16
CRS-R total score
 Mean (SD)7.3 (2.9)6.5 (2.3)7.1 (4.1)
 Range3.0 ~ 18.03.0 ~ 14.03.0 ~ 17.0
Outcome at T1
CRS-R total score
 Mean (SD)9.9 (5.1)12.7 (6.4)N/A
 Range3.0 ~ 22.05.0 ~ 23.0N/A
GOS score
 GOS = 5000
 GOS = 4551
 GOS = 3875
 GOS <= 2501318
  1. Abbreviations: CRS-R, Coma Recovery Scale–Revised; GOS, Glasgow Outcome Scale; MCS, minimally conscious state; N/A, not available; SD, standard deviation; VS, vegetative state/unresponsive wakefulness syndrome.

As the ‘Beijing 750’ dataset involved more patients than the other two datasets, it was used as the training dataset for model development and internal validation, whereas the ‘Beijing HDxt’ and ‘Guangzhou HDxt’ datasets were only used for external validation. The study was approved by the Ethics Committee of the PLA Army General Hospital (protocol No: 2011–097) and by the Ethics Committee of the Guangzhou General Hospital of Guangzhou Military Command (protocol No: jz20091287). Informed consent to participate in the study was obtained from the legal surrogates of the patients and from the normal controls.

Clinical measurements

Diagnosis and consciousness assessments

The diagnosis of each patient in the three datasets was made by experienced physicians according to the CRS-R scale (Multi-Society Task Force on PVS, 1994; Bernat, 2006; Magrassi et al., 2016). In the ‘Beijing 750’ and ‘Beijing HDxt’ datasets, the patients underwent the evaluations at least twice weekly within the 2 weeks before the MRI scanning (i.e. the T0 time point). The highest CRS-R score was considered as the diagnosis. The CRS-R includes six subscales that address auditory, visual, motor, oromotor, communication, and arousal functions, which are summed to yield a total score ranging from 0 to 23.

Outcome assessments

All patients were followed up at least 12 months after MRI scanning, according to the protocols for DOC described in a number of previous studies (Neuro Imaging for Coma Emergence and Recovery Consortium et al., 2012; Luyt et al., 2012; Stender et al., 2014; Pignat et al., 2016). Basically, follow-up interviews were performed in four ways, including outpatient visit, assessments by local physicians, home visit, and telephone/video review. Whenever possible, signs of responsiveness were detected or reported, the patient was evaluated either in the unit or at home by the hospital staff. In cases where no change was signaled, patients were examined twice by one hospital physician via telephone/video reviews at the end of the follow-up process.

For the training dataset, ‘Beijing 750’, two outcome scales were assessed: the GOS and CRS-R. The GOS is one of the most commonly reported global scales for functional outcome in neurology, and provides a measurement of outcome ranging from 1 to 5 (1, dead; 2, vegetative state/minimally conscious state; 3, able to follow commands/unable to live independently; 4, able to live independently/unable to return to work or school; 5, good recovery/able to return to work or school). Although simple to use and highly reliable, the GOS score cannot provide detailed information about individual differences in consciousness level for DOC patients. By contrast, the CRS-R score can assist with prognostic assessment in DOC patients (Giacino and Kalmar, 2006). The six subscales in the CRS-R comprise hierarchically arranged items that are associated with brain stem, subcortical and cortical processes. The lowest item on each subscale represents reflexive activity, whereas the highest items represent cognitively mediated behaviors. In order to simplify the modeling, we hypothesized that the higher the total CRS-R score at the follow-up, the better the outcome for the patient. We developed a regression model to fit each patient's CRS-R score at the T1 time point based on their clinical characteristics and resting state fMRI data, and designed a classification model to predict consciousness recovery or not for each patient. The classification accuracy was assessed by comparing the predicted label and the actual GOS score, that is, ‘consciousness recovery’ (GOS ≥ 3) versus ‘consciousness non-recovery’ (GOS ≤ 2).

The testing dataset ‘Beijing HDxt’ involved both the GOS scores and the CRS-R scores at the T1 time point for each patient. The testing dataset ‘Guangzhou HDxt’ measured the GOS scores, but not the CRS-R scores at the T1 time point.

MRI acquisition

All of the participants in the three datasets were scanned with resting state fMRI and T1-weighted 3D high-resolution imaging. During the MRI scanning, the participants did not take any sedative or anesthetic drugs. The resting state fMRI scan was obtained using a T2*-weighted gradient echo sequence, and a high-resolution T1-weighted anatomical scan was obtained to check whether the patients had large brain distortion or focal brain damage. For the training dataset, ‘Beijing 750’, the resting state fMRI acquisition parameters included TR/TE = 2000/30 ms, flip angle = 90°, axial 39 slices, thickness = 4 mm, no gap, FOV = 240 × 240 mm, matrix = 64 × 64, and 210 volumes (i.e. 7 min). For the testing dataset, ‘Beijing HDxt’, the resting state fMRI acquisition parameters were as follows: axial 33 slices, TR/TE = 2000/30 ms, flip angle = 90°, thickness = 4 mm, no gap, FOV = 220 × 220 mm, matrix = 64 × 64, and 240 volumes (i.e. 8 min). For the testing dataset, ‘Guangzhou HDxt’, the resting state fMRI acquisition parameters included axial 35 slices, TR/TE = 2000/30 ms, flip angle = 90°, thickness = 4 mm, no gap, FOV = 240 × 240 mm, matrix = 64 × 64, and 240 volumes (i.e. 8 min).

Data analysis

The data analysis pipeline is illustrated in Figure 2.

Data analysis pipeline.

All datasets involved in this study included resting state fMRI and clinical data. For the fMRI data in the training dataset, data analysis first encompassed preprocessing and imaging feature selection and extraction. Partial least square regression was then used to generate the regression model using the selected imaging features and clinical features in the training dataset. In this way, a prediction score that depicts the possibility of consciousness recovery was computed for each patient. The optimal cut-off value for classifying an individual patient as responsive or non-responsive was then calculated, and the prognostic classification model was obtained. The two testing datasets were only used to validate externally the regression and classification model.

https://doi.org/10.7554/eLife.36173.005

Imaging preprocessing

Preprocessing and connectivity calculation were performed in the same way for the training dataset and the two testing datasets. All resting-state fMRI scans were preprocessed using SPM8 (SPM, RRID:SCR_007037) and in-house Matlab codes. Specifically, the first five volumes of each subject were discarded. The remaining resting-state fMRI volumes were corrected for slice timing differences and realigned to the first volume to correct for inter-scan movements. The functional images were then spatially smoothed with a Gaussian kernel of 6 × 6 × 6 mm full-width at half maximum. Linear regression was used to remove the influence of head motion, whole brain signals and linear trends. The variables regressed out included 12 motion parameters (roll, pitch, yaw, translation in three dimensions and their first derivatives), the average series of the signals within the brain, and the regressors for linear trends.

Motion artifact is increasingly recognized as an important potential confound in resting state fMRI studies. Any particular motion may produce a wide variety of signal changes in the fMRI data, and may thus introduce complicated shifts in functional connectivity analysis. This problem was particularly serious for the DOC patients, as they were unlikely to follow the experimental instructions and control their head motion. To balance the demands of noise reduction and data preservation, we censored volumes that preceded or followed any movement (framewise displacement, FD) greater than 1.5 mm. The FD is a summarization of the absolute values of the derivatives of the translational and rotational realignment estimates (after converting the rotational estimates to displacement at 50 mm radius) (Power et al., 2015). The head motion measures were achieved in the preprocessing step of realignment using SPM. To obtain reliable Pearson's correlations for functional connectivity, the patients with less than 50 volumes worth of remaining data were excluded. More information about the analysis and validation of controls for motion-related artifacts are provided in Appendix 4.

Finally, to reduce low-frequency drift and high-frequency noise, band-pass filtering (0.01–0.08 Hz) was only performed on volumes that survived motion censoring.

Definition of networks and regions of interest

As noted in the introduction, multiple functional brain networks are disrupted in DOC patients. Among these impaired networks, six (the default mode, executive control, salience, sensorimotor, auditory, and visual networks) show system-level damages and significant correlations with behavioral assessments (Demertzi et al., 2014, 2015). We therefore defined a total of 22 regions of interest (ROIs) to probe these six brain networks. The definitions of the 22 ROIs were based on the results of a series of previous brain functional studies (Seeley et al., 2007; Raichle, 2011; Demertzi et al., 2015), and their names and Montreal Neurological Institute (MNI) coordinates are listed in Appendix 2.

The connection templates of the six brain networks were first investigated within the normal control group. In addition to the above-mentioned preprocessing stages, the resting state fMRI scans of the normal controls in the training dataset were transformed into MNI standard space. For each of the six networks, time series from the voxels contained in the various ROIs were extracted and averaged together. The averaged time series were then used to estimate whole-brain correlation r maps that were subsequently converted into normally distributed Fisher’s z-transformed correlation maps. Group functional connectivity maps for each of the six networks were then created with a one-sample t test (see Appendix 3 for details). Notably, the T map included both positive and negative values. We used the six T maps as the brain connection templates of the corresponding brain networks in the healthy population, which would assist to define one type of imaging features, that is the connection feature of the ROI. More information about the connection features of the ROIs are provided in the following section.

The conventional fMRI preprocess normalizes individual fMRI images into a standard space defined by a specific template image. Our goal was to extend this conventional approach to generate a functional connectivity image for each patient in his/her own imaging space. During the preprocessing of each patient’s fMRI scans, the 22 ROIs and six brain connection templates were therefore spatially warped to individual fMRI space and resampled to the voxel size of the individual fMRI image. We also developed tools to check the registration for each subject visually, some examples of which are provided in Appendix 5 and Supplementary file 1.

Calculation of imaging features

We designed two types of imaging features from the resting state fMRI, one being the functional connectivity between each pair of 22 ROIs, and the other being the spatial resemblance between the functional connection patterns of each ROI and the brain connection templates across the whole brain. The functional connectivity was based on the Pearson’s correlation coefficients, while the spatial resemblance was conceptually similar to the template-matching procedure (Greicius et al., 2004; Seeley et al., 2007; Vanhaudenhuyse et al., 2010). The basis of template matching is that the greater the spatial consistency that exists between the template of a brain network and a specific connectivity map (for example, a component in an independent component analysis), the stronger the possibility that the connectivity map belongs to that brain network. Here, for each ROI of an individual DOC patient, we first computed the Pearson’s correlation coefficients between the time-course of the ROI and that of each voxel within the brain so as to obtain a functional connectivity map, and subsequently converted the functional connectivity map to a normally distributed Fisher’s z transformed correlation map. Next, we calculated the Pearson’s correlation coefficients between the Fisher’s z transformed correlation map and the corresponding brain connection template wrapped to individual fMRI space across each voxel within the brain. A greater correlation coefficient between the two maps suggests that there is more spatial resemblance between the functional connectivity map of the ROI and the normal brain connection template. Our assumption was that the more spatial consistency that existed between the connectivity map of the ROI in a DOC patient and the brain connection template, the more intact the corresponding brain function of the ROI in this individual. In this way, we defined the connection feature of the ROI with the spatial resemblance.

Overall, for each participant in this study, there were 231 (22 × 21/2) functional connectivity features and 22 brain area connection features.

Imaging feature selection

Feature selection techniques have been widely adopted in brain analysis studies, in order to produce a small number of features for efficient classification or regression, and to reduce overfitting and increase the generalization performance of the model (Fan et al., 2007; Dosenbach et al., 2010; Drysdale et al., 2017). Feature ranking and feature subset selection are two typical feature selection methods (Guyon and Elisseeff, 2003). Feature subset selection methods are generally time consuming, and even inapplicable when the number of features is extremely large, whereas ranking-based feature selection methods are subject to local optima. Therefore, these two feature selection methods are usually used jointly. Here, we first used a correlation-based feature selection technique to select an initial set of features, and then adopted a feature subset selection method for further selection.

As a univariate method, correlation-based feature selection is simple to run and understand, and measures the linear correlation between each feature and the response variable. Here, the image features (i.e. functional connectivity features and brain area connection features) that significantly correlated to the CRS-R scores at the T1 time point across the DOC patients in the training dataset were retained for further analysis.

Competitive adaptive reweighted sampling coupled with partial least squares regression (CARS-PLSR, http://libpls.net/) was then used for further feature subset selection (Li et al., 2009, 2014). Briefly, CARS-PLSR is a sampling-based feature selection method that selects the key informative variables by optimizing the model's performance. As it provides the influence of each variable without considering the influence of the remainder of the variables, CARS-PLSR is efficient and fast in carrying out feature selection (Mehmood et al., 2012), and has therefore been used to explore possible biomarkers in medicine (Tan et al., 2010) and for wavelength selection in chemistry (Fan et al., 2011). Using CARS-PLSR, we selected a subset of key informative imaging features.

Notably, both the correlation-based and CARS-PLSR feature selection methods filtered the features from the original feature set without any transformations. This made the prognostic regression model easier to interpret, as the imaging predictors were associated with either brain regions or functional connectivity.

Prognostic modeling and assessments of predictor importance

PLSR is able to handle multicollinearity among the predictors well (Wold et al., 2001; Krishnan et al., 2011). It was therefore used to generate the prognostic regression model in the training dataset ‘Beijing 750’. Given that clinical characteristics—including the etiology, incidence age and duration of DOC—have been verified as useful prognostic indicators, we designated the selected imaging features and the three clinical characteristics at the T0 time point as independent co-variates and the CRS-R score at the T1 time point as the dependent variable. Among the three clinical characteristics, the incidence age and duration of DOC were quantitative variables, whereas the etiology was a qualitative variable. In accordance with a previous study (Estraneo et al., 2010), we categorized the etiology into three types: traumatic brain injury, stroke and anoxic brain injury. Thus, two dummy variables for etiology were designed and included in the model. Prior to model training, all involved predictors were centered and normalized (i.e. transformed into Z-scores). The prognostic regression model therefore took the imaging and clinical features as input and returned a predicted score as output. In the training dataset ‘Beijing 750’, we used cross-validation to decide that the number of latent variables for PLSR was three. To evaluate the regression model, the coefficient of determination R2 between the predicted scores and the CRS-R scores at the T1 time point was calculated, and the Bland-Altman plot was used to measure the agreement between them.

Next, receiver operating characteristic (ROC) curves were plotted for the predicted scores. The optimal cut-off value for classifying an individual patient as having recovered consciousness or not was appointed to the point with the maximal sum of true positive and false negative rates on the ROC curve. Individual patients were classified as exhibiting recovery of consciousness if their predicted scores were higher than or equal to the cut-off value, otherwise as consciousness non-recovery. The classification accuracy was calculated by comparing the predicted label and the actual GOS score,that is 'consciousness recovery' (GOS ≥ 3) versus ‘consciousness non-recovery’ (GOS ≤ 2).

As model interpretation is an important task in most applications of PLSR, there has been considerable progress in the search for optimal interpretation methods (Kvalheim and Karstang, 1989; Kvalheim et al., 2014). In this study, using the Significant Multivariate Correlation (sMC) method (Tran et al., 2014), we assessed predictor importance in the prognostic regression model. The key points in sMC are to estimate the correct sources of variability resulting from PLSR (i.e. regression variance and residual variance) for each predictor, and use them to determine statistically a variable's importance with respect to the regression model. The F-test values (termed the sMC F-values) were used to evaluate the predictors' importance in the prognostic regression model.

Internal validation of model

The prognostic regression model was internally validated using bootstrap sampling (Steyerberg, 2008). Specifically, bootstrap samples were drawn with replacement from the training dataset ‘Beijing 750’ such that each bootstrap sampling set had a number of observations equal to that of the training dataset. Using a bootstrap sampling set, correlation-based feature selection and CARS-PLSR were first used to select the feature subset, after which the PLSR was used to generate a prognostic model. We then applied the model to the bootstrap sampling set and the original training dataset, and calculated the coefficient of determination R2 of each of the two datasets. The difference between the two coefficients of determination was defined as the optimism. This process was repeated 1000 times to obtain a stable estimate of the optimism. Finally, we subtracted the optimism estimate from the coefficient of determination R2 of the ‘Beijing 750’ training dataset to obtain the optimism-corrected performance estimate.

In addition, out-of-bag (OOB) estimation was used as an estimate of model classification performance in the training dataset (James et al., 2013). Specifically, for the original training dataset x, we left out one sample at a time and denoted the resulting sets by x(–1),..., x(n). From each leave-one-out set x(–i), 1000 bootstrap learning sets of size n–1 were drawn. On every bootstrap learning set generated from x(–i), we carried out feature selection, built a PLSR regression and classification model, and applied the model to the test observation xi. A majority vote was then made to give a class prediction for observation xi. Finally, we calculated the accuracy for the whole training dataset x.

External model validation

External validation is essential to support the general applicability of a prediction model. We ensured external validity by testing the model in two testing datasets, neither of which included samples that were considered during the development of the model. First, using the prognostic regression model, we calculated one predicted score for each patient in the two testing datasets. As the ‘Beijing HDxt’ dataset assessed the patients' CRS-R scores at the T1 time point, we calculated the coefficient of determination R2 between the predicted scores and the patients' CRS-R scores at this time point. The Bland-Altman plot was also determined. Finally, the patients in the two testing datasets were assessed as achieving consciousness recovery or not on the basis of the cut-off threshold obtained using the training dataset. The performance of the classification, including the accuracy, sensitivity and specificity, was determined.

Comparison between single-domain model and combination model

Using the modeling and validation method described above, we examined the predictability and generalizability in the two testing datasets on the basis of the clinical features alone or the imaging features alone.

In addition, to compare the two types of single-domain models and the combination model, we used bootstrap resampling to obtain the distribution of the prediction accuracies in the two testing datasets based on each of the three types of models. We first resampled with replacement from the training dataset, and built a regression and classification model based on the clinical features alone, the neuroimaging features alone, or the combination of the two-domain features. We then tested the classification accuracy in the two testing datasets using the three types of models. In this way, we obtained the distribution of the prediction accuracies using each of the three types of models. Next, we used repeated measures ANOVA to determine whether or not the performances of the three types of models were the same; we also used Ψ, the root-mean-square standardized effect, to report the effect sizes of the mean differences between them.

Comparison between the proposed modeling method and linear SVM

We compared the prediction performances between the proposed modeling method and linear SVM. The code for SVM was downloaded from LIBSVM (LIBSVM, RRID:SCR_010243). The 253 imaging features and the four clinical features were integrated into one feature vector. No feature selection was adopted in the linear SVM-based classification. The patients with GOS ≥3 were labeled as 1, with the others being designated as −1 (i.e. GOS ≤2).

Similarly, the OOB estimation process was used to estimate the performance of linear SVM in the training dataset ‘Beijing 750’. Next, using all the samples in the training dataset ‘Beijing 750’, we trained a linear SVM-based classifier and then tested the predictive accuracy in the two testing datasets.

Results

Imaging feature selection

Correlation-based feature selection

Using the training dataset, we found that some imaging features significantly correlated to the CRS-R scores at the T1 time point. For example, the connection features of some brain areas, including the anterior medial prefrontal cortex (aMPFC), posterior cingulate cortex/precuneus (PCC) and right lateral parietal cortex in the default mode network, and the dorsal medial prefrontal cortex (DMPFC) and left lateral superior parietal cortex in the executive control network, displayed significant correlations to the CRS-R T1 scores across the DOC patients. We also found numerous examples of significant correlation between functional connectivity and the CRS-R score at the T1 time point, with these functional connectivities being distributed both within and between brain networks. More information about the correlations between the imaging features and the CRS-R scores at the T1 time point are provided in Appendix 6.

CARS-PLSR feature selection

Figure 3 shows the final imaging features selected with CARS-PLSR. Specifically, the brain area connection features included the aMPFC and PCC in the default mode network, and the DMPFC in the executive control network. The functional connectivity features included the connectivity between the aMPFC in the default mode network and the DMPFC in the executive control network, as well as between the middle cingulate cortex in the auditory network and the right lateral primary visual cortex in the visual network. More information about the feature selection by bootstrapping is provided in Appendix 7.

Imaging features involved in the prognostic regression model.

DMN.aMPFC, anterior medial prefrontal cortex in the default mode network; DMN.PCC, posterior cingulate cortex/precuneus in the default mode network; ExecuContr.DMPFC, dorsal medial prefrontal cortex in the executive control network; Auditory.MCC, middle cingulate cortex in the auditory network; Visual.R.V1, right lateral primary visual cortex in the visual network. DMN.aMPFC—ExecuContr.DMPFC: the functional connectivity between DMN.aMPFC and ExecuContr.DMPFC; Auditory.MCC—Visual.R.V1: the functional connectivity between Auditory.MCC and Visual.R.V1.

https://doi.org/10.7554/eLife.36173.006

Prognostic regression model and predictor importance

The prognostic regression model is presented in Figure 4. On the basis of the regression formula, we noted some interesting findings. First, there were both positive and negative weights. In particular, the weights were all positive for the three brain area connection features, whereas the weight for the functional connectivity feature between the aMPFC in the default mode network and the DMPFC in the executive control network was negative. Interestingly, this connection had the maximum sMC F-value as shown in Figure 4B. In addition, the age and the anoxic etiology had negative weights, and the age predictor had the largest sMC F-value among the four clinical features.

Prognostic regression model.

In the three subplots, each color denotes a particular predictor. (A) Regression formula. (B) Predictor importance for each predictor in prognostic regression model. The vertical axis represents the sMC F-test value. The larger the sMC F-value, the more informative the predictor with respect to the regression model. (C) The imaging features in the model are rendered on a 3D surface plot template in medial view.

https://doi.org/10.7554/eLife.36173.007

Prognostic classification model and internal validation

Figure 5A presents the predicted score for each patient in the training dataset. As shown in Figure 5B, there was good agreement between the CRS-R scores at the T1 time point and the predicted scores. The apparent coefficient of determination R2 was equal to 0.65 (permutation test, p=0.001), and the Bland-Altman plot verified the consistency between the predicted and achieved scores (one sample T test, p = 1.0). The prognostic regression model was internally validated using bootstrapping. The optimism-corrected coefficient of determination R2 was equal to 0.28.

The performance of the prediction model on the training dataset.

(A) Individual predicted scores for each DOC patient in the training dataset. The CRS-R score at the T0 time point is shown on the x axis and the predicted score on the y axis. The patients diagnosed as VS/UWS at the T0 time point are shown to the left of the vertical red solid line, whereas the patients diagnosed as MCS at this time point are shown to the right. The purplish red pentagram, imperial purple triangle and blank circle mark the patients with a GOS score ≥4,=3 and≤2, respectively, at the T1 time point. (B) Agreement between the CRS-R scores at the T1 time point and the predicted scores. The left panel shows the correlation between the CRS-R scores at the T1 time point and the predicted scores, and the right panel shows the differences between them using the Bland-Altman plot. (C) Bar chart showing the numbers or proportions of DOC patients in each band of predicted scores. In these two panels, the y axis shows the predicted score. (D) The area under the receiver-operating characteristic (ROC) curve. The star on the curve represents the point with the maximal sum of true positive and false negative rates on the ROC curve, which were chosen as the cut-off threshold for classification. Here, the corresponding predicted score = 13.9.

https://doi.org/10.7554/eLife.36173.008

Figure 5C illustrates the number and proportion of DOC patients in different bands of predicted scores. We found that the proportion of the patients with a ‘consciousness recovery’ outcome in the patient cohorts rose in conjunction with an increase in the predicted score. The higher the predicted score, the higher the proportion of patients who exhibited a favorable outcome. Figure 5D shows the area under the ROC curve (AUC = 0.96, 95% CI = 0.89–0.99). On the basis of the ROC curve for the training dataset, a threshold 13.9 was selected as the cut-off point to classify the recovery of individual patients. In other words, if the predicted score for a patient was equal to or larger than 13.9, the classification model designated the label ‘consciousness recovery’ for this patient, otherwise ‘consciousness non-recovery’. The classification accuracy was assessed by comparing the predicted and actual outcomes, that is 'consciousness recovery' (GOS ≥ 3) versus ' consciousness non-recovery’ (GOS ≤ 2). Using this method, the classification accuracy in the training dataset was up to 92%. Specifically, the sensitivity was 85%, the specificity was 94%, the positive predictive value (PPV) was 79%, the negative predictive value (NPV) was 96%, and the F1 score was 0.81.

The OOB was able to provide the mean prediction error on each training sample and to estimate the generalizability of our method in the training dataset. Using the OOB estimation, we found that the prediction accuracy in the training dataset ‘Beijing 750’ was 89%, and the sensitivity, specificity, PPV and NPV were 69%, 94%, 100%, and 87%, respectively.

External validation of the model

The performance of the prediction model on the two testing datasets is illustrated in Figure 6. As we assessed the CRS-R scores at the T1 time point for the patients in the ‘Beijing HDxt’ dataset, we calculated the coefficient of determination R2 between these scores and the predicted scores. The R2 was equal to 0.35 (permutation test, p=0.005), with the Bland-Altman plot verifying the consistency between the predicted and actual scores (one sample T test, p=0.89). Using the predicted score 13.9 as the threshold, we then tested the classification accuracy on the two testing datasets. We found that, for the ‘Beijing HDxt’ dataset, the prediction accuracy was up to 88% (sensitivity: 83%, specificity: 92%, PPV: 92%, NPV:86%, F1 score: 0.87; permutation test, p<0.001), while for the ‘Guangzhou HDxt’ dataset it was also up to 88% (sensitivity: 100%, specificity: 83%, PPV: 67%, NPV: 100%, F1 score: 0.80; permutation test, p<0.001). Notably, our model demonstrated good sensitivity and specificity for both the ‘subacute’ patients (i.e. duration of unconsciousness ≤3 months) and those in the chronic phase (i.e. duration of unconsciousness >3 months), as shown in Figure 7. More interestingly, for the testing dataset ‘Beijing HDxt’, eight DOC patients who were initially diagnosed as VS/UWS subsequently recovered consciousness. Using the proposed model, we could successfully identify seven of these and there was only one false-positive case. That is, for the VS/UWS patients, the model achieved 90.0% accuracy (sensitivity: 87.5%, specificity: 91.7%, PPV: 87.5%, NPV: 91.7%, F1 score: 0.875).

The performance of the prediction model on the two testing datasets.

(A) The individual predicted score (top panel) and agreement between the CRS-R scores at the T1 time point and the predicted scores (bottom panel) for the testing dataset ‘Beijing HDxt’. (B) The individual predicted score for each DOC patient in the testing dataset ‘Guangzhou HDxt’. The legend description is the same as for Figure 5.

https://doi.org/10.7554/eLife.36173.009
The sensitivity and specificity in the ‘subacute’ patients (i.e. duration of unconsciousness T0 ≤3 months) and those in the chronic phase (i.e. duration of unconsciousness T0 >3 months), respectively.
https://doi.org/10.7554/eLife.36173.010

To test robustness, we evaluated whether the present prognostic regression model generalized to the healthy subjects scanned in the ‘Beijing 750’ training dataset (n = 30) and to the ‘Beijing HDxt’ testing dataset (n = 10). We found that both the healthy subjects and the ‘consciousness recovery’ patients had significantly higher predicted imaging subscores than the ‘consciousness non-recovery’ patients (two-sample T test, p<0.05). Additional information is provided in Appendix 8.

Comparison of the single-domain and combination models

When only the clinical features were used to build the predictive model, the prediction accuracy for the ‘Beijing HDxt’ dataset was 68% (sensitivity: 58%, specificity: 77%, PPV: 70%, NPV: 67%, F1 score: 0.64), whereas for the ‘Guangzhou HDxt’ dataset, it was 83% (sensitivity: 100%, specificity: 78%, PPV: 60%, NPV: 100%, F1 score: 0.75). When only the imaging features were involved in the model, the prediction accuracy for the ‘Beijing HDxt’ dataset was 80% (sensitivity: 67%, specificity: 92%, PPV: 89%, NPV: 75%, F1 score: 0.76), whereas for the ‘Guangzhou HDxt’ dataset, it was 79% (sensitivity: 100%, specificity: 72%, PPV: 55%, NPV: 100%, F1 score: 0.71).

Using bootstrapping, we obtained the distribution of the prediction accuracies in the two testing datasets with each of the three types of models. In the 'Beijing HDxt' testing dataset, the means±standard deviations of the distribution of the prediction accuracies were 0.815±0.050, 0.811±0.044, and 0.666±0.037 for the combination model, the model using imaging features alone, and the model using clinical features alone, respectively. We found that there were significant differences between the means of the classification accuracies using the three types of models (repeated measures ANOVA, p<0.001). Subsequently, we conducted pairwise comparisons. We found that there was significant difference between the combination model and the model s using the imaging feature alone (paired sample t-test, p=0.001) or using the clinical feature alone (paired sample t-test, p<0.001). We also found that there was significant difference between the model using the imaging feature alone and the model using the clinical feature alone (paired sample t-test, p<0.001). Using effect-size analysis, we found that there was a mean difference of Ψ=0.004 (95% CI = [0.002, 0.007]) between the combination method and the method using only imaging features, and Ψ=0.149 (95% CI = [0.147, 0.152]) between the combination method and the method using only clinical features. We also observed a mean difference of Ψ = 0.145 (95% CI = [0.142, 0.147]) between the methods using only imaging features and only clinical features.

In the 'Guangzhou HDxt' testing dataset, the mean±standard deviation of the distribution of the prediction accuracies was 0.863±0.051, 0.783±0.044, and 0.829±0.086 for the combination model, the model using imaging features alone, and the model using clinical features alone, respectively. Similarly, we found that there were significant differences between the mean of the classification accuracies using the three types of models (repeated measures ANOVA, p<0.001), and there was significant difference between the combination model and the models using imaging features alone (paired sample t-test, p<0.001) or using clinical features alone (paired sample t-test, p<0.001). Using effect-size analysis, we found a mean difference of Ψ = 0.080 (95% CI = [0.076, 0.084]) between the combination model and the model using the imaging features alone, and Ψ = 0.034 (95% CI = [0.028, 0.040]) between the combination model and the model using only clinical features. We also observed a mean difference of Ψ = –0.046 (95% CI = [–0.053, –0.040]) between the model using imaging features alone and that using only clinical features.

Therefore, in both testing datasets, the combination of imaging and clinical features demonstrated higher accuracy than the use of the single domain features alone. In addition, use of the imaging features alone had higher predictive power in comparison to use of the clinical features alone in the ‘Beijing HDxt’ dataset, whereas the opposite condition was observed in the ‘Guangzhou HDxt’ dataset, suggesting that the two testing datasets might be heterogeneous. More information about the single-domain models are provided in Supplementary file 2.

Comparison between the proposed modeling method and linear SVM

Using the OOB estimation, we found that the accuracy of the linear SVM-based classification method in the training dataset ‘Beijing 750’ was 83% (sensitivity: 31%, specificity: 96%, PPV: 100%, NPV: 81%), which was lower than the accuracy of our proposed modeling method (i.e. accuracy: 89%, sensitivity: 69%, specificity: 94%, PPV: 100%, NPV: 87%). On the other hand, the linear SVM-based classification method achieved an accuracy of 76% (sensitivity: 58%, specificity: 92%, PPV: 88%, NPV: 71%) and 88% (sensitivity: 100%, specificity: 83%, PPV: 67%, NPV: 100%) in the ‘Beijing HDxt’ testing dataset and the ‘Guangzhou HDxt’ testing dataset, respectively. That is, the accuracy in the ‘Beijing HDxt’ testing dataset was lower than that in our method, whereas the accuracy in the ‘Guangzhou HDxt’ testing dataset was similar to that of our approach. Therefore, taking together the performance comparisons in both the training dataset and the two testing datasets, we believe that our method based on feature selection and PLSR should have higher prediction accuracy and better generalizability in comparison to linear SVM.

Discussion

In this paper, we describe the development of a prognostic model that combines resting state fMRI with three clinical characteristics to predict one-year outcomes at the single-subject level. The model discriminated between patients who would later recover consciousness and those who would not with an accuracy of around 88% in three datasets from two medical centers. It was also able to identify the prognostic importance of different predictors, including brain functions and clinical characteristics. To our knowledge, this is the first reported implementation of a multidomain prognostic model that is based on resting state fMRI and clinical characteristics in chronic DOC. We therefore suggest that this novel prognostic model is accurate, robust, and interpretable. For research only, we share the prognostic model and its Matlab code at a public download resource (https://github.com/realmsong504/pDOC; copy archived at https://github.com/elifesciences-publications/pDOC).

Brain functions are mediated by the interactions between neurons within different neural circuits and brain regions. Functional imaging can detect the local activity of different brain regions and explore the interactions between them, and has demonstrated potential for informing diagnosis and prognosis in DOC. On the one hand, many studies have focused on one modality of brain functional imaging, such as PET (Phillips et al., 2011), SPECT (Nayak and Mahapatra, 2011), task fMRI (Owen et al., 2006; Coyle et al., 2015), or resting state fMRI (Demertzi et al., 2015; Qin et al., 2015; Wu et al., 2015; Roquet et al., 2016). On the other hand, some cross-modality studies have compared the diagnostic precision between imaging modalities, for example, comparing PET imaging with task fMRI (Stender et al., 2014) or comparing PET, EEG and resting state fMRI (Golkowski et al., 2017). In our study, by combining brain functional networks detected from resting state fMRI with three clinical characteristics, we built a computational model that allowed us to make predictions regarding the prognosis of DOC patients at an individual level. We compared the models separately using only the imaging features or only the clinical characteristics and found that the combination of these predictors achieved greater accuracy. This validated the need for the use of accumulative evidence stemming from multiple assessments, each of which has different sensitivity and specificity in detecting the capacity for recovery of consciousness (Demertzi et al., 2017). Validations in additional and unseen datasets were also undertaken to evaluate the feasibility of the predictive model. Our results showed about 88% average accuracy across the two testing datasets, and good sensitivity and specificity in both the ‘subacute’ patients (i.e. 1 months ≤ duration of unconsciousness ≤ 3 months) and those in the prolonged phase (i.e. duration of unconsciousness >3 months), which suggested good robustness and the generalizability of our model.

Further, the sensitivity of 83% and 100% obtained across the two testing datasets demonstrated a low false-negative rate, which would avoid predicting non-recovery in a patient who can actually recover. Our method successfully identified 16 out of the total 18 patients who later recovered consciousness in the two testing datasets. In parallel, the specificity across the two testing datasets was up to 92% and 83%, respectively. Taken together, these results suggest that our method can precisely identify those patients with a high-potential for recovery consciousness and concurrently reduce false positives in predicting low-potential patients. In addition, although it has been widely considered that the prospect of a clinically meaningful recovery is unrealistic for prolonged DOC patients (Wijdicks and Cranford, 2005), our model correctly predicted 9 out of 10 DOC patients with longer than or equal to three months duration of DOC who subsequently recovered consciousness, including three patients with DOC longer or equal to 6 months duration, suggesting that it can also be applied to prolonged DOC patients.

According to the surviving consciousness level, DOC patients can be classified into distinct diagnostic entities, including VS/UWS and MCS. As MCS is often viewed as a state with a potentially more favorable outcome (Luauté et al., 2010), a misdiagnosis of VS/UWS could heavily bias the judgment of the prognosis, the medical treatment options and even the associated ethical decisions. Some influential studies have found that a few VS/UWS patients exhibit near-normal high-level cortical activation in response to certain stimuli or commands (Owen et al., 2006; Monti et al., 2010), and that late recovery of responsiveness and consciousness is not exceptional in patients with VS/UWS (Estraneo et al., 2010). Instead of predicting diagnosis, this study used one-year outcome as a target for regression and classification. Our method, which is based on the combined use of clinical and neuroimaging data, successfully identified seven out of the eight VS/UWS patients in the testing dataset who were initially diagnosed as VS/UWS but subsequently achieved a promising outcome.

By analyzing the sMC F-value for each predictor in the regression model, we investigated the prognostic effects of these predictors. In particular, the sMC F-value of the incidence age was greater than that of the other clinical characteristics, suggesting that incidence age might be the most important predictor among the clinical characteristics. Notably, the sMC F-value for the imaging features as a whole seemed to be greater than that for the clinical features, as shown in Figure 4B. We therefore speculate that the patient's residual consciousness capacity, indicated by brain networks and functional connectivity detected from resting state fMRI, might have a stronger prognostic effect than their clinical characteristics.

Some previous studies have shown that the resting state functional connectivity within the default mode network is decreased in severely brain-damaged patients, in proportion to their degree of consciousness impairment, from locked-in syndrome to minimally conscious, vegetative and coma patients (Vanhaudenhuyse et al., 2010). Moreover, the reduced functional connectivity within the default mode network, specifically between the MPFC and the PCC, may predict the outcome for DOC patients (Silva et al., 2015). Our model also validates that the aMPFC and PCC in the default mode network play important roles in predicting prognosis.

Above all, we found that the functional connectivity between the aMPFC and the DMPFC had the maximum sMC F-value in the prognostic regression model. The aMPFC is one of the core brain areas in the default mode network, whereas the DMPFC is located in the executive control network. Previous studies have demonstrated that this functional connectivity is negative connectivity in normal healthy populations, with the anti-correlation being proposed as one reflection of the intrinsic functional organization of the human brain (Fox et al., 2005). The default mode network directly contributes to internally generated stimulus-independent thoughts, self-monitoring, and baseline monitoring of the external world, while the executive control network supports active exploration of the external world. Correct communication and coordination between these two intrinsic anti-correlated networks is thought to be very important for optimal information integration and cognitive functioning (Buckner et al., 2008). A recent study reported that negative functional connectivities between the default mode network and the task-positive network were only observed in patients who recovered consciousness and in healthy controls, whereas positive values were obtained in patients with impaired consciousness (Di Perri et al., 2016). Further, our regression model suggests that the anti-correlations between these two diametrically opposed networks (i.e. the default mode network and the executive control network) should be the most crucial imaging feature for predicting the outcomes of the DOC patients. We therefore conclude that our prognostic model has good interpretability, and that it not only verifies the findings of previous studies but also provides a window to assess the relative significance of various predictors for the prognosis of DOC patients.

This study involved two testing datasets, which were found to be quite different, as shown in Table 1. First, the distributions of the etiology of the patients were remarkably different in the two datasets. The numbers of patients suffering from trauma/stroke/anoxia were 12/6/7 and 8/0/16 in the ‘Beijing HDxt’ and ‘Guangzhou HDxt’ datasets, respectively. The outcomes were also different. In the ‘Beijing HDxt’ dataset, 12 out of the total 25 patients recovered consciousness, compared with six out of the total 24 patients in the ‘Guangzhou HDxt’ dataset. Given that the characteristics of the two medical centers and their roles in the local health care system are different, we speculate that this could be one of the main reasons why the enrolled patient populations were heterogeneous. As described in the Introduction, DOC can have many causes and can be associated with several neuropathological processes and different severities, leading to the suggestion that information from different domains should be integrated to improve diagnosis and prognostication (Bernat, 2016). Our study demonstrates that the combination of imaging and clinical features can achieve a better performance than the use of single domain features.

However, some caution is warranted. First, although this study involved 112 DOC patients, the number of patients who would later recover consciousness was relatively low (i.e. 31/112). So, a much larger cohort will be needed for further validation. Second, the PPVs for the two testing datasets were remarkably different, with that for the ‘Guangzhou HDxt’ dataset being relatively low (67% versus 91%). Although our method predicted that nine patients in this dataset would recover consciousness, only six of them did (i.e. GOS ≥3), with the other three remaining unconscious at the end of the follow-up (i.e. GOS ≤2). Given that many additional factors are associated with the outcome of DOC patients, including medical complications, nutrition and so on, future work will need to integrate more information in order to provide more precise predictions. Third, the signal-to-noise ratio of resting state fMRI can influence functional connectivity analysis, so calibrations will be necessary when applying the predictive model across different sites, including standardizing the MRI acquisition protocols (e.g. scanner hardware, imaging protocols and acquisition sequences) and the patients' management strategies (see Appendix 10 for more information). Finally, a DOC patient’s prognosis can be considered according to at least three dimensions: survival/mortality, recovery of consciousness, and functional recovery. This study focused on predicting recovery of consciousness, and the patients who died during the follow-up were excluded. In the future, we will consider more outcome assessments, including survival/mortality and functional recovery.

In summary, our prognostic model, which combines resting state fMRI with clinical characteristics, is proposed to predict the one-year outcome of DOC patients at an individual level. The average accuracy of classifying a patient as ‘consciousness recovery’ or not was around 88% in the training dataset and two unseen testing datasets. Our model also has good interpretability, thereby providing a window to reassure physicians and scientists about the significance of different predictors, including brain networks, functional connectivities and clinical characteristics. Together, these advantages could offer an objective prognosis for DOC patients that will optimize their management and deepen our understanding of brain function during unconsciousness.

Appendix 1

Demographic and clinical characteristics of patients and normal controls in this study

The diagnosis in this study was made by experienced physicians according to the CRS-R scale. Patients were diagnosed as MCS when they demonstrated at least one of the following behaviors: (1) following simple commands; (2) yes/no responses (gestural or verbal); (3) intelligible verbalization; (4) purposeful behavior in response to an environmental stimulus; (5) vocalization or gestures in direct response to questions; (6) reaching for objects that demonstrates a clear relationship between the position of the object and the direction of the movement; (7) touching or holding objects; (8) following or staring at an object in direct response to its movement. Emergence from the MCS was signaled by the return of functional communication and/or object use.

In this study, the patients underwent the CRS-R assessments twice weekly (or more) within the two weeks before MRI scanning. So, the CRS-R was generally administered about 4–5 times for each patient. The highest CRS-R score was considered as the diagnosis and listed in the following tables. T0: the time point of the MRI scanning; T1: the time point of follow-up.

Appendix 1—table 1
Demographic and clinical characteristics of patients in the ‘Beijing_750’ dataset.
https://doi.org/10.7554/eLife.36173.015
Patient aliasGenderAge (years)DiagnoseEtiologyStructural lesions on MRITime to MRI (months)Number of CRS-R assessmentsCRS-R score at T0CRS-R subscore at T0CRS-R score at T1CRS-R subscore at T1Follow-up(months)GOSPredicted score
001M36VS/UWSAnoxiaDiffuse pons damage1670221022244632315418.26
002M29MCSTraumaBilateral-temporo-parietal damage94183551132245622339415.31
003F33VS/UWSTraumaBilateral-frontal lobe damage, atrophy12571022022245532312422.88
004F28MCSTraumaL-frontal-temporal lobe damage14153351032245622319416.58
005M23MCSAnoxiaDiffuse cortical and subcortical atrophy34102321022145522313417.08
006M45MCSStrokeL-temporo-parietal damage9492221021733422312313.94
007M39MCSStrokeBrainstem damage14173451131944512312314.39
008F27MCSTraumaL-basal ganglia damage106123321031834512319316.09
009M23MCSTraumaDiffuse cortical and subcortical atrophy6491321021944422313310.94
010M42MCSStrokeL-basal ganglia damage3771031021941632312314.72
011M53MCSStrokeDiffuse cortical and basal ganglia (caudates) damage75113321021433212314311.55
012F40VS/UWSStrokeDiffuse cortical and basal ganglia damage5671121021433312212314.67
013M22VS/UWSTraumaL-frontal-temporo-parietal lobe damage3471121021533412227315.48
014F64VS/UWSStrokeL-thalamus, basal ganglia lesions147112102112331021728.28
015F42VS/UWSAnoxiaDiffuse anoxic cortical lesions14711210292221021429.02
016M45VS/UWSAnoxiaDiffuse anoxic cortical lesions95500210271121021528.65
017F60VS/UWSAnoxiaDiffuse anoxic cortical lesions44610210261021021327.71
018M42VS/UWSStrokeR-cerebral hemisphere lesions647112102711210214212.44
019M51VS/UWSAnoxiaDiffuse cortical and subcortical atrophy34711210271121022824.28
020F35VS/UWSAnoxiaBilateral-frontal lobe damage, atrophy24711210271121021325.87
021M71VS/UWSTraumaDiffuse cortical and subcortical atrophy66310110041011011324.46
022F30VS/UWSAnoxiaBilateral-basal ganglia damage24400200270221023826.92
023F58VS/UWSTraumaDiffuse cortical and subcortical atrophy24300210040021011425.09
024M23MCSTraumaR-basal ganglia (caudates) damage5571031021122320212214.57
025F66VS/UWSTraumaBilateral-temporo-parietal damage14610210281131023225.71
026F25VS/UWSAnoxiaDiffuse cortical and subcortical atrophy34510200261120023626.75
027M48VS/UWSAnoxiaDiffuse cortical and subcortical atrophy45711210281131022927.83
028F28MCSAnoxiaDiffuse cortical and subcortical atrophy5492221021123310232211.36
029M57VS/UWSAnoxiaDiffuse cortical and subcortical atrophy114610210261021023324.70
030M61MCSStrokeBilateral-temporo-parietal lobe damage24111341021122311212210.34
031M40VS/UWSAnoxiaDiffuse cortical and subcortical atrophy44400110250111022725.70
032M39VS/UWSStrokeR-basal ganglia damage, atrophy34711210271121021228.03
033M41VS/UWSAnoxiaDiffuse cortical and subcortical atrophy24500210250021021326.44
034M26VS/UWSStrokeDiffuse cortical and subcortical atrophy544711210271121023827.28
035F50VS/UWSAnoxiaDiffuse cortical and subcortical atrophy86610210291222021225.77
036F53VS/UWSStrokeBilateral brainstem, midbrain damage34511210071121022828.02
037M67VS/UWSStrokeR- brainstem, cerebellar damage14511210030020011222.04
038M45MCSStrokeDiffuse cortical and subcortical atrophy2591321021022211213210.91
039F35VS/UWSAnoxiaDiffuse cortical and subcortical atrophy346102102811220219210.24
040F46MCSTraumaDiffuse axonal injury777112222121333221251214.76
041M49VS/UWSStrokeBilateral-brainstem, cerebellar damage1047112102711210228210.87
042M45VS/UWSStrokeDiffuse cortical and basal ganglia damage34711210281221021927.59
043M18VS/UWSAnoxiaDiffuse cortical and subcortical atrophy856111102912310212210.85
044M53VS/UWSAnoxiaBilateral-occipital lobe damage, atrophy24300200171121023421.98
045M46VS/UWSTraumaR-temporo-parietal damage44610120261012021327.23
046F29VS/UWSAnoxiaDiffuse cortical and subcortical atrophy284711210291231021228.31
047F47MCSStrokeR-basal ganglia damage4758113102112222121229.66
048M58VS/UWSStrokeBilateral-temporo-parietal lobe damage64711210281131022727.05
049M66VS/UWSAnoxiaL-frontal lobe damage44400200261021023824.79
050M34VS/UWSTraumaDiffuse axonal injury346112101812210214210.28
051F31MCSTraumaL-frontal-temporo-parietal lobe damage3511133202811220215215.56
052M33VS/UWSStrokeL-temporo-parietal lobe damage174610210281131021327.67
053F31VS/UWSAnoxiaDiffuse cortical and basal ganglia (caudates) damage14610210261021022728.36
054F28VS/UWSAnoxiaDiffuse cortical and subcortical atrophy34610210281122023229.23
055F26VS/UWSStrokeL-basal ganglia damage446102102610210212210.96
056M45VS/UWSTraumaDiffuse axonal injury14610210261021022929.05
057F69VS/UWSStrokeDiffuse cortical and subcortical atrophy446102102710220233212.43
058F68VS/UWSTraumaDiffuse axonal injury66711210291321021729.74
059M50VS/UWSStrokeL-frontal-temporo-parietal lobe damage34611110282220022727.01
060M60MCSTraumaBilateral brainstem, midbrain damage74111341021122311230211.69
061M44VS/UWSAnoxiaDiffuse cortical and subcortical atrophy246102102400210113210.48
062F35VS/UWSAnoxiaBilateral-basal ganglia damage35721110292311022729.07
063F43VS/UWSAnoxiaDiffuse cortical and subcortical atrophy247112102820211229210.09
Appendix 1—table 2
Demographic and clinical characteristics of patients in the ‘Beijing_HDxt’ dataset.
https://doi.org/10.7554/eLife.36173.016
Patient aliasGenderAge (years)DiagnoseEtiologyStructural lesions on MRITime to MRI (months)Number of CRS-R assessmentsCRS-R score at T0CRS-R subscore at T0CRS-R score at T1CRS-R subscore at T1follow-up(months)GOSPredicted score
001M19VS/UWSTraumaL-temporo-parietal lobe damage6471121022245622340420.37
002M26MCSTraumaR-thalamus, basal ganglia lesions36102321022345632347417.12
003F22VS/UWSTraumaL-temporal lobe damage4461021022245622347414.05
004M41VS/UWSStrokeBilateral brainstem, midbrain damage3461121012345632350420.23
005M36MCSStrokeBilateral brainstem damage446003102234563233947.75
006M34VS/UWSAnoxiaDiffuse cortical and subcortical atrophy1461111021432312331317.25
007F18VS/UWSTraumaDiffuse axonal injury3450120021433212341314.86
008M58MCSTraumaR-frontal lobe damage12481131021533312340315.62
009M41MCSTraumaR-frontal-temporo-parietal lobe damage15112330121834422342318.89
010M46VS/UWSStrokeL-brainstem, cerebellar damage7461021021433212353317.05
011M25VS/UWSAnoxiaDiffuse cortical and subcortical atrophy4661021021422412346318.07
012M58VS/UWSTraumaL-brainstem damage1771121021935512340310.75
013M36VS/UWSTraumaL-frontal-temporo-parietal lobe damage647112102102321024429.58
014M58VS/UWSTraumaR-frontal-temporo-parietal lobe damage44610210261021024526.69
015M65VS/UWSStrokeDiffuse cortical and subcortical atrophy34310000251011024324.01
016F24VS/UWSTraumaDiffuse axonal injury4466102102812210244214.03
017F46VS/UWSStrokeL-pons damage247112102711210240212.11
018M53VS/UWSAnoxiaDiffuse cortical and subcortical atrophy34410100261021024125.38
019F32VS/UWSTraumaL-temporo-parietal lobe damage346102102811220223213.76
020M41VS/UWSAnoxiaDiffuse cortical and subcortical atrophy244101002811220240212.06
021F33VS/UWSAnoxiaDiffuse cortical and subcortical atrophy756211002112322024724.55
022M49VS/UWSAnoxiaDiffuse cortical and subcortical atrophy24610210261021021428.97
023F25MCSAnoxiaBilateral thalamus, brainstem damage47144500231024002250212.42
024M63VS/UWSStrokeL-basal ganglia lesions54400110251011024828.22
025M68VS/UWSTraumaL-frontal-temporo-parietal lobe damage24500210260121024729.72
Appendix 1—table 3
Demographic and clinical characteristics of patients in the ‘Guangzhou_HDxt’ dataset.
https://doi.org/10.7554/eLife.36173.017
Patient aliasGenderAge (years)DiagnoseEtiologyTime to MRI (months)CRS-R score at T0CRS-R subscore at T0Follow-up (months)GOSPredicted score
001F15MCSAnoxia11313511259420.92
002M29MCSTrauma4911401261316.50
003F27MCSTrauma1910510229316.67
004F20MCSTrauma2811310241320.37
005M30MCSTrauma11511622363317.34
006M31MCSTrauma12044522351314.00
007M28VS/UWSAnoxia151020026929.71
008M48MCSTrauma1122341025529.11
009M46VS/UWSTrauma2410200165212.58
010M78VS/UWSAnoxia141001024929.20
011M39VS/UWSAnoxia150021025124.20
012F46VS/UWSAnoxia2400110246214.49
013M39VS/UWSAnoxia251020024327.30
014M16VS/UWSAnoxia2300100271214.17
015M25MCSAnoxia1121351027828.56
016F76VS/UWSAnoxia541001025923.57
017M36VS/UWSTrauma2500120265217.28
018M32VS/UWSAnoxia1061021025625.66
019F49VS/UWSAnoxia161021024924.55
020F52VS/UWSAnoxia130001022725.08
021M62VS/UWSAnoxia2300100228210.26
022F33VS/UWSAnoxia261021022929.09
023F28VS/UWSAnoxia9610210267212.62
024F57VS/UWSAnoxia150021024224.72
Appendix 1—table 4
Demographic of healthy controls in the ‘Beijing_750’ dataset.
https://doi.org/10.7554/eLife.36173.018
AliasGenderAgeHandedness
NC001F40Right
NC002M50Right
NC003F34Right
NC004M25Right
NC005M28Right
NC007F24Right
NC008F47Right
NC009F22Right
NC010F60Right
NC012F26Right
NC013M21Right
NC014F27Right
NC015M40Right
NC016M44Right
NC017F22Right
NC018M50Right
NC019M27Right
NC020F43Right
NC021F25Right
NC022M54Right
NC023F52Right
NC026M46Right
NC027F52Right
NC028M29Right
NC029F46Right
NC030M44Right
NC031M30Right
NC032M31Right
NC033M32Right
NC034M30Right
Appendix 1—table 5
Demographic of healthy controls in the ‘Beijing_HDxt’ dataset.
https://doi.org/10.7554/eLife.36173.019
AliasGenderAgeHandedness
NC001_HDxtM44Right
NC002_HDxtM42Right
NC003_HDxtM30Right
NC004_HDxtM40Right
NC005_HDxtM30Right
NC006_HDxtM30Right
NC007_HDxtF58Right
NC008_HDxtF54Right
NC009_HDxtF41Right
NC010_HDxtF41Right

Appendix 2

Brain networks and regions of interest in this study

The six brain networks investigated in this study and the names of regions of interest (ROI). The Appendix 2—table 1 represented the six brain networks, the name of ROIs, the peak coordinates in the Montreal Neurological Institute (MNI) space and the corresponding references. All of ROI were defined as a spherical region with a radius of 6 mm at the center of the peak coordinates of the ROI.

Appendix 2—table 1
Brain networks and ROIs in this study.
https://doi.org/10.7554/eLife.36173.021
Brain networkROI nameROI
abbreviation
Peak MNI coordinatesReferences
Default mode(Raichle, 2011; Demertzi et al., 2015)
Anterior medial prefrontal cortexaMPFC−1 54 27
Posterior cingulate cortex/precuneusPCC0 –52 27
Left lateral parietal cortexL.LatP−46 –66 30
Right lateral parietal cortexR.LatP49 –63 33
Executive control(Seeley et al., 2007; Raichle, 2011)
Dorsal medial PFCDMPFC0 27 46
Left anterior prefrontal cortexL.PFC−44 45 0
Right anterior prefrontal cortexR.PFC44 45 0
Left superior parietal cortexL. Parietal−50 –51 45
Right superior parietal cortexR. Parietal50 –51 45
Salience(Seeley et al., 2007; Raichle, 2011; Demertzi et al., 2015)
Left orbital frontoinsulaL.AIns−40 18 –12
Right orbital frontoinsulaR.AIns42 10 –12
Dorsal anterior cingulatedACC0 18 30
Sensorimotor(Raichle, 2011; Demertzi et al., 2015)
Left primary motor cortexL.M1−39 –26 51
Right primary motor cortexR.M138 –26 51
Supplementary motor areaSMA0 –21 51
Auditory(Raichle, 2011; Demertzi et al., 2015)
Left primary auditory cortexL.A1−62 –30 12
Right primary auditory cortexR.A159 –27 15
Middle cingulate cortexMCC0 –7 43
Visual(Demertzi et al., 2015)
Left primary visual cortexL.V1−13 –85 6
Right primary visual cortexR.V18 –82 6
Left associative visual cortexL.V4−30 –89 20
Right associative visual cortexR.V430 –89 20

Appendix 3

Brain functional network templates

Although the neurobiological implications of the spontaneous neuronal activity are not very clear, spontaneous fluctuations in the blood oxygenation level-dependent signal have been found to be coherent within a variety of functionally relevant brain regions, which are denoted as representing a ‘network’. Moreover, several networks have been found to be spatially consistent across different healthy subjects (Damoiseaux et al., 2006). Researchers suggested that the brain networks assessed by resting state fMRI may reflect an intrinsic functional architecture of the brain (Raichle, 2011). As mentioned in the manuscript, multiple networks were reported to be disrupted in the DOC patients. Here, the connection templates of the six brain networks were investigated within the healthy control group of the ‘Beijing 750’ dataset. This study focused on the cortex, so six functional networks were investigated, including the default mode network, the executive control network, salience, and the sensorimotor, auditory, and visual networks. Group functional connectivity maps for each of the six networks were created with a one-sample t test as shown in the following Appendix 3—figure 1. These templates were separately shown on the brain surface using the SurfStat toolbox (SurfStat, RRID:SCR_007081). The color bar representes the T value.

Appendix 3—figure 1
The six brain functional network templates in this study.
https://doi.org/10.7554/eLife.36173.023

Appendix 4

Quality control for resting state functional connectivity

During the MRI scanning, the foam pad and headphones were used to reduce head motion and scanner noise. The normal controls were instructed to keep still with their eyes closed, as motionless as possible and not to think about anything in particular. The same instructions were given to the patients but due to their consciousness and cognitive impairments, we could not fully control for a prolonged eye-closed yet awake scanning session. The Appendix 4—figure 1 showed cumulative distribution of head motion per volume (framewise displacement) for normal controls and the patients. The Appendix 4—figure 2 showed the results of control quality of resting state fMRI in this study. The Appendix 4—figure 3 showed the histogram of the remaining number of fMRI volumes after scrubbing.

Appendix 4—figure 1
Cumulative distribution of head motion per volume (framewise displacement) for normal controls and DOC patients separately in the training dataset ‘Beijing 750’ (A1), the testing dataset ‘Beijing HDxt’ (A2), and the testing dataset ‘Guangzhou HDxt’ (A3).

The normal controls were shown in left column, whereas the DOC patients were shown in right column. No healthy control data were available for the Guangzhou centre. In both patients and controls, head position was stable to within 1.5 mm for the vast majority (>95%) of brain volumes.

https://doi.org/10.7554/eLife.36173.025
Appendix 4—figure 2
Correlations between motion artifact and neuroanatomical distance between the ROIs in this study.

Prior studies have shown that motion artifacts tend to vary with neuroanatomical distance between brain nodes. Here, we conducted quality control analyses as described in the previous study (Power et al., 2015). Specifically, we computed correlations between head motion (mean FD) and each resting state functional connectivity (RSFC) feature and plotted them as a function of neuroanatomical distance (mm) for subjects in the training dataset ‘Beijing 750’ (B1), the testing dataset ‘Beijing HDxt’ (B2), and the testing dataset ‘Guangzhou HDxt’ (B3). Smoothing curves (in red) were plotted using a moving average filter.

https://doi.org/10.7554/eLife.36173.026
Appendix 4—figure 3
Histogram of the remaining number of fMRI volumes after scrubbing for each population, specifically ‘Beijing 750’ datatset (C1), ‘Beijing HDxt’ dataset (C2), and ‘Guangzhou HDxt’ dataset (C3).
https://doi.org/10.7554/eLife.36173.027

Appendix 5

Warped regions of interest and brain network templates

The conventional fMRI preprocess normalizes individual fMRI images into a standard space defined by a specific template image. This study generated a functional connectivity image for each patient in his/her own fMRI space. During the preprocessing of each patient’s fMRI scans, the 22 ROIs and the six brain network templates were spatially warped to individual fMRI space and resampled to the voxel size of the individual fMRI image. To ensure the registration, we developed some tools to check the transformed ROIs and brain network templates visually for each subject in this study.

Supplementary file 1 illustrated some examples of the warped ROIs in the default mode network (DMN) for the three DOC patients with a GOS score of 2,3 or 4, respectively. In addition, as a reference, we showed these figures for one normal control. The ROIs in the DMN include the anterior medial prefrontal cortex (aMPFC), the posterior cingulate cortex/precuneus (PCC), the left lateral parietal cortex (L.LatP), the right lateral parietal cortex (R.LatP). The details about these four ROIs are listed in Appendix 2, and the brain network template of the DMN is provided in Appendix 3.

Appendix 6

Correlations between imaging features and CRS-R scores at T1.

In addition, Appendix 6—figure 1 illuminated these brain area connection features and their Pearson's correlations to the CRS-R scores at the T1 time point.

Appendix 6—figure 1
The brain area connection features sorted by their Pearson's correlations to the CRS-R scores at the T1 time point in the training dataset ‘Beijing 750’.
https://doi.org/10.7554/eLife.36173.030

In addition, Appendix 6—figure 2 illuminates the functional connectivity features that were significantly correlated to the CRS-R scores at the T1 time point.

Appendix 6—figure 2
The functional connectivity features sorted by their Pearson's correlations to the CRS-R scores at the T1 time point across the DOC patients in the training dataset ‘Beijing 750’.
https://doi.org/10.7554/eLife.36173.031

Appendix 6—figure 3 showed these significant functional connectivity features in a Circos manner. The red links represent the within-network functional connectivity, while the blue links represent the inter-network functional connectivity. The width of a link is proportional to the strength of the functional connectivity.

Appendix 6—figure 3
The Circos map for the functional connectivity features that were significantly correlated to the CRS-R scores at the T1 time point across the DOC patients in the training dataset ‘Beijing 750’.
https://doi.org/10.7554/eLife.36173.032
Appendix 6—table 1
The brain area connection features and their Pearson's correlations to the CRS-R scores at the T1 time point across the DOC patients in the training dataset ‘Beijing 750’.
https://doi.org/10.7554/eLife.36173.033
ROI namePearson's correlation coefficient and p value
**DMN.aMPFCr = 0.514, p=0.000
**ExecuContr.L.Parietalr = 0.429, p=0.000
**DMN.PCCr = 0.420, p=0.001
**DMN.R.LatPr = 0.407, p=0.001
**ExecuContr.DMPFCr = 0.405, p=0.001
*ExecuContr.R.Parietalr = 0.363, p=0.003
*Sensorimotor.SMAr = −0.332, p=0.008
*ExecuContr.R.PFCr = 0.320, p=0.011
*Auditory.R.A1r = 0.315, p=0.012
*DMN.L.LatPr = 0.298, p=0.018
*ExecuContr.L.PFCr = 0.291, p=0.021
*Sensorimotor.L.M1r = 0.267, p=0.035
Auditory.L.A1r = 0.206, p=0.105
Salience.R.AInsr = −0.187, p=0.142
Sensorimotor.R.M1r = 0.167, p=0.191
Visual.L.V4r = −0.151, p=0.236
Salience.dACCr = −0.104, p=0.418
Salience.L.AInsr = 0.075, p=0.560
Visual.R.V1r = 0.065, p=0.611
Auditory.MCCr = 0.053, p=0.682
Visual.R.V4r = −0.031, p=0.809
Visual.L.V1r = −0.028, p=0.830
  1. **: p<0.05, FDR corrected; *: p<0.05, uncorrected.

Appendix 6—table 2
Functional connectivity features and their Pearson's correlations to the CRS-R scores at the T1 time point across the DOC patients in the training dataset ‘Beijing 750’.
https://doi.org/10.7554/eLife.36173.034
Functional connectivityPearson's correlation coefficient and p-value
DMN.aMPFC - ExecuContr.DMPFCr = −0.489, p=0.000
*DMN.L.LatP - Visual.L.V4r = −0.421, p=0.001
*Auditory.MCC - Visual.R.V1r = 0.375, p=0.002
*ExecuContr.R.PFC - ExecuContr.R.Parietalr = 0.361, p=0.004
*ExecuContr.DMPFC - Auditory.MCCr = −0.351, p=0.005
*ExecuContr.L.PFC - Salience.dACCr = −0.335, p=0.007
*Sensorimotor.R.M1 - Sensorimotor.SMAr = −0.330, p=0.008
*Sensorimotor.R.M1 - Auditory.L.A1r = 0.319, p=0.011
*Salience.dACC - Visual.R.V1r = 0.319, p=0.011
*ExecuContr.DMPFC - Sensorimotor.L.M1r = −0.310, p=0.013
*DMN.R.LatP - Visual.R.V4r = −0.306, p=0.015
*ExecuContr.L.Parietal - Sensorimotor.L.M1r = −0.302, p=0.016
*DMN.aMPFC - Salience.dACCr = −0.292, p=0.020
*DMN.aMPFC - Sensorimotor.L.M1r = −0.286, p=0.023
*DMN.aMPFC - DMN.PCCr = 0.275, p=0.029
*ExecuContr.R.Parietal - Visual.R.V4r = −0.270, p=0.033
*DMN.aMPFC - Sensorimotor.R.M1r = −0.268, p=0.034
*ExecuContr.R.Parietal - Sensorimotor.R.M1r = −0.263, p=0.037
*Sensorimotor.L.M1 - Sensorimotor.SMAr = −0.261, p=0.039
*DMN.R.LatP - Sensorimotor.R.M1r = −0.261, p=0.039
*ExecuContr.R.Parietal - Visual.L.V4r = −0.257, p=0.042
*Salience.dACC - Visual.L.V4r = 0.256, p=0.043
*ExecuContr.DMPFC - Sensorimotor.R.M1r = −0.255, p=0.043
*DMN.aMPFC - Visual.L.V1r = 0.251, p=0.047
*Salience.R.AIns - Sensorimotor.L.M1r = 0.250, p=0.049
*DMN.L.LatP - Sensorimotor.SMAr = 0.248, p=0.050
  1. Specifically, the functional connectivity features were the functional connectivity between any pair of ROIs. As there were more than 200 functional connectivity features, because of space limitations, only the functional connectivity features that were significantly correlated to the CRS-R scores at the T1 time point are shown. **: p<0.05, FDR corrected; *: p<0.05, uncorrected.

Appendix 7

Histogram depicting the imaging features included in CARS-PLSR models

We resampled 1000 times with replacement from the training dataset ‘Beijing 750’. In each bootstrap sampling set, the CARS-PLSR was used for imaging feature subset selection. We then summarized the numbers of each imaging feature that was included in the CARS-PLSR model. Appendix 7—figure 1 shows the histogram depicting the imaging features included in CARS-PLSR models. The horizontal bar represents the number.

Appendix 7—figure 1
Histogram depicting the imaging features included in CARS-PLSR models.
https://doi.org/10.7554/eLife.36173.036

Appendix 8

Validations in healthy controls

To test robustness, we evaluated whether the prognostic regression model generalized to the normal controls (NC) in the training dataset ‘Beijing 750’ (n = 30) and the testing dataset ‘Beijing HDxt’ (n = 10). No normal control data were available in the Guangzhou centre. As the NC subjects did not have the clinical characteristics, we calculated the subscores using the imaging features alone and then compared these subscores to those of the DOC patients. Appendix 8—figure 1 showed the imaging subscores for all of the subjects in the three datasets. We would like to emphasize that the normal controls in the training dataset were only used to establish the brain network templates, and not for any training.

Appendix 8—figure 1
The imaging subscores for all of the subjects in the three datasets.
https://doi.org/10.7554/eLife.36173.038

We found that (1) in the training dataset ‘Beijing 750’, the NC subjects had significantly larger imaging subscores in comparison to both the DOC patients with consciousness recovery and the DOC patients with consciousness non-recovery (one-way ANOVA, p<0.05, multiple comparison corrected), and that the DOC patients with consciousness recovery had significantly larger imaging subscores in comparison to the DOC patients with consciousness non-recovery (one-way ANOVA, p<0.05, multiple comparison corrected); (2) in the testing dataset ‘Beijing HDxt’, the NC subjects had significantly larger imaging subscores in comparison to the DOC patients with consciousness non-recovery (one-way ANOVA, p<0.05, multiple comparison corrected), and the DOC patients with consciousness recovery had significantly larger imaging subscores in comparison to the DOC patients with consciousness non-recovery (one-way ANOVA, p<0.05, multiple comparison corrected); (3) in the testing dataset ‘Guangzhou HDxt’, the imaging subscores of the DOC patients with consciousness recovery were significantly larger than that of DOC patients with consciousness non-recovery (two-sample t-tests, p<0.05).

Appendix 9

Variations across different sites

To investigate variations across different sites, we did two experiments using the normal control (NC) subjects in this study. First, we explored whether the predicted imaging subscores of the NC subjects were significantly different between the training dataset ‘Beijing 750’ (n = 30) and the testing dataset ‘Beijing HDxt’ (n = 10). We found that there was no significant difference between the two groups (two-sample T test, p=0.24). The distribution is shown in Appendix 9—figure 1.

Appendix 9—figure 1
The distribution of the predicted imaging subscores of the healthy controls at different sites.
https://doi.org/10.7554/eLife.36173.040

Second, we investigated the relationships between the fMRI signal-to-noise ratio (SNR) and the predicted imaging subscores. Different MRI acquisition protocols (e.g. scanner hardware, imaging protocols and acquisition sequences) can influence the imaging SNR. But, it is not trivial to estimate the SNR in resting-state fMRI because the noise is complex and also differs spatially. Here, we calculated the temporal SNR (tSNR) as the ratio between the mean fMRI signal and its temporal standard deviation in each voxel (Welvaert and Rosseel, 2013), and then averaged across all voxels in each region of interest (ROI) (Gardumi et al., 2016; Hay et al., 2017). As there were 22 ROIs in this study, the median of these 22 ROI tSNR values was used as the measure for evaluating the SNR of the fMRI. We then correlated the median tSNR with the predicted imaging subscores across all of the NC subjects (n = 40), and found that there were significant correlations between them (Pearson's correlation r = 0.36, p=0.024) as shown inAppendix 9—figure 2.

Appendix 9—figure 2
The correlations between the fMRI signal-to-noise ratio (SNR) and the predicted imaging subscores in the healthy controls.
https://doi.org/10.7554/eLife.36173.041

From the above two experiments, we found that (1) the fMRI tSNR could be one of influencing factors in the application of the presented model; (2) the predicted imaging subscores for the NC subjects could be approximate across different sites when the tSNR was proximity. Therefore, we suggested that our presented model can be applied to different centers, although the calibration might be required. Further, the tSNR in fMRI is not only associated with instrumental noise but also modulated by subject-related noise, such as physiological fluctuations and motion artifacts (Huettel et al., 2009). Therefore, we suggest that, on the one hand, the quality of imaging acquisition, including MRI scanner and imaging sequence/parameters, need to be guaranteed; and on the other hand, scanning protocols is required to be standardized to reduce the subject-related noise during the scanning.

References

  1. 1
  2. 2
  3. 3
  4. 4
    Coma and Disorders of Consciousness. London: Springer London
    1. M-A Bruno
    2. D Ledoux
    3. A Vanhaudenhuyse
    4. O Gosseries
    5. A Thibaut
    6. S Laureys
    (2012)
    11–23, Prognosis of Patients with Altered State of Consciousness, Coma and Disorders of Consciousness. London: Springer London, 10.1007/978-1-4471-2440-5_2.
  5. 5
  6. 6
    Vegetative state two decades after the Multi-Society task force (MSTF) Report
    1. GG Celesia
    (2016)
    Brain Function and Responsiveness in Disorders of Consciousness: Springer pp. 171–184.
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
    Audio-visual crossmodal fMRI connectivity differentiates single patients with disorders of consciousness. 
    1. A Demertzi
    2. G Antonopoulos
    3. HU Voss
    4. JS Crone
    5. ND Schiff
    6. M Kronbichler
    (2014)
    Frontiers in Human Neuroscience: Belgian Brain Council 2014 MODULATING the BRAIN: FACTS, FICTION,FUTURE.
  13. 13
  14. 14
    Measuring states of pathological (un)consciousness: research dimensions, clinical applications, and ethics
    1. A Demertzi
    2. JD Sitt
    3. S Sarasso
    4. W Pinxten
    (2017)
    Neuroscience of Consciousness, 2017, 10.1093/nc/nix010, 30042843.
  15. 15
  16. 16
  17. 17
  18. 18
    Resting-state connectivity biomarkers define neurophysiological subtypes of depression
    1. AT Drysdale
    2. L Grosenick
    3. J Downar
    4. K Dunlop
    5. F Mansouri
    6. Y Meng
    7. RN Fetcho
    8. B Zebley
    9. DJ Oathes
    10. A Etkin
    11. AF Schatzberg
    12. K Sudheimer
    13. J Keller
    14. HS Mayberg
    15. FM Gunning
    16. GS Alexopoulos
    17. MD Fox
    18. A Pascual-Leone
    19. HU Voss
    20. BJ Casey
    21. MJ Dubin
    22. C Liston
    (2017)
    Nature Medicine, 23, 10.1038/nm.4246, 27918562.
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
    Coma recovery Scale-Revised
    1. J Giacino
    2. K Kalmar
    (2006)
    Accessed June 2018.
  28. 28
    Behavioral assessment in patients with disorders of consciousness: gold standard or fool's gold?
    1. JT Giacino
    2. C Schnakers
    3. D Rodriguez-Moreno
    4. K Kalmar
    5. N Schiff
    6. J Hirsch
    (2009)
    In: S Laureys, N. D Schiff, A. M Owen, editors. Coma Science: Clinical and Ethical Implications.. Amsterdam: Elsevier Science Bv. pp. 33–48.
    https://doi.org/10.1016/S0079-6123(09)17704-X
  29. 29
  30. 30
  31. 31
  32. 32
    An introduction to variable and feature selection
    1. I Guyon
    2. A Elisseeff
    (2003)
    Journal of Machine Learning Research 3:1157–1182.
  33. 33
  34. 34
  35. 35
  36. 36
    Functional Magnetic Resonance Imaging
    1. SA Huettel
    2. AW Song
    3. G McCarthy
    (2009)
    Sinauer Associates, Functional Magnetic Resonance Imaging, 2 Edn.
  37. 37
    An Introduction to Statistical Learning
    1. G James
    2. D Witten
    3. T Hastie
    4. R Tibshirani
    (2013)
    Springer.
  38. 38
    Boundaries of Consciousness: Neurobiology and Neuropathology
    1. B Jennett
    (2005)
    537–543, Thirty years of the vegetative state: clinical, ethical and legal problems, Boundaries of Consciousness: Neurobiology and Neuropathology, 10.1016/S0079-6123(05)50037-2.
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
    libPLS: an integrated library for partial least squares regression and discriminant analysis
    1. H Li
    2. Q Xu
    3. Y Liang
    (2014)
    PeerJ Preprints 2:e190.
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
    Clinical Prediction Models
    1. E Steyerberg
    (2008)
    A practical approach to development, validation, and updating, Clinical Prediction Models, Springer Science & Business Media.
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82

Decision letter

  1. Klaas Enno Stephan
    Reviewing Editor; University of Zurich and ETH Zurich, Switzerland
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Prognostication of chronic disorders of consciousness using brain functional networks and clinical characteristics" for consideration by eLife. Your article has been reviewed by Timothy Behrens as the Senior Editor, a Reviewing Editor, and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Klaas Enno Stephan (Reviewer #1); Adrian M Owen (Reviewer #2); Jessica Schrouff (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This is a very interesting and timely study that examines the predictability of clinical outcome of patients with disorders of consciousness (DOC), based on "resting state" functional MRI and clinical data. It reports that both clinical data and neuroimaging data can separately predict individual clinical outcomes, but that the combined usage of both sources of information is superior.

The reviewers agreed that this study represents a potentially important attempt towards establishing practically useful predictors based on data that can be obtained in clinical practice. For DOC, the sample size is large, and the use of two separate test datasets is laudable. Having said this, we also identified several major concerns with regard to the statistical approach and the interpretation of the results. These are detailed below.

Essential revisions:

1) There are several concerns regarding the statistical approach taken. A main issue that featured in the discussion among reviewers is that feature selection and model performance assessment were performed without nested cross-validation. This should be done to obtain an accurate estimate of model performance in the training data set as well. In the present scenario in the paper, performance measures derived from the training set cannot be used as an estimate of the model's generalization ability. At present, the abstract and discussion are slightly misleading in that they report an average accuracy over training and validation sets. Second, the reasons for double feature selection (given the relatively low number of features) and, more generally, for the choice of PLSR are unclear. As the method is put forward in this work, comparison with simpler methods would be useful (e.g. vanilla SVM). Third, it is unclear how the bootstrap was used to perform internal validation. It only performs validation of the training samples. Typically, the trained model would be applied to the subjects not used in that bootstrap sample and the R2 would be computed based on those. Otherwise, the R2 value does not represent generalizability, but only goodness of fit. If this is a misunderstanding, this paragraph should be revised. Fourth, what regularisation was applied in PLSR? Fifth, there is little discussion of the PPV, while assessing the number of false negative is also very important. In all cases, statistical assessment through permutations of the labels is desired. Finally, when evaluating the performance of classification, it would be more convincing if you reported the balanced accuracy. This is because the results for the two different test datasets differ remarkably with respect to sensitivity (subsection “Accuracy comparison between single-domain models and combination model”); this suggests that the outcome groups were differentially balanced across the datasets.

2) The interpretation of classifier weights is problematic (Discussion section): Please see Haufe et al., 2014, Kia et al., 2017 and Weichwald et al., 2015 for the caveats of interpreting the weights of classification models. Especially, as feature selection is performed here, the authors could refrain from looking at the weights (and especially at their sign; for example, finding that duration of DOC has positive impact on outcome does not seem realistic) and instead identify which features were selected.

3) It is not clear that excluding patients who died is the right thing to do. At the very least, whether it was the right thing to do should have been discussed. On the one hand, I appreciate that if they died, one can never know if they would have regained consciousness. But in my experience (and I am sure this is the authors' experience too), these patients mostly die for a reason that is related to their initial injury (directly or indirectly). Plus, whether they die or not is of course part and parcel of the prognostication (Can we predict those who die?). I know the paper's intention was to try to predict recovery of consciousness (rather than who died), but I think an argument can be made for including those who died in the 'non recovery of consciousness' group. How does that affect the model? After all, a model that says, "Assuming you live, these data have an 88% chance of predicting whether you will regain consciousness" is nowhere near as good as a model that says, "These data have an 88% chance of predicting whether you will regain consciousness".

4) Subsection “MRI acquisition”. Supplementary material 1 shows quite a bit of variation in the time after injury at which the MRI was conducted (in the testing dataset alone, the range is 1 month to 77 months). What is noticeable is that almost all those who have a late scan do not recover. On the one hand, this is obvious – if you are still vegetative after 77 months you are very unlikely to recover whether you have a scan or not, and in this sense, the point of trying to 'predict recovery' is less obvious. Plus of course at this point there will have been many brain changes that do not reflect the original injury per se but may be the consequences of languishing in this stage for some years (atrophy, etc.). What would be most interesting would be to see how the model does with patients who, say are scanned within 3 months of their injury. These are the patients that we really care about because if we could prognosticate among them then it could really change clinical practice. To put it simply, including data from a patient who has been in a MCS for 77 months or VS for 54 months and predicting that they will not recover is not adding much (and may be affecting the model), because the chances of any meaningful recovery at that stage is so low anyway.

5) fMRI analysis: How were brain lesions taken into account? This could influence the functional connectivity measures considerably.

6) fMRI motion correction (subsection “Imaging preprocessing”): The application of band-pass filtering to "scrubbed" fMRI timeseries is not unproblematic given that the removal of scans from the time series can severely affect temporal autocorrelation. Also, were there significant differences in the number of scrubbed volumes between the outcome groups (for unresponsive patients, movement would seem to be less of an issue than for controls or more responsive patients), and could this have affected regression/classification? Did you consider using scan-nulling regressors as an alternative? (see, for example, the discussion in Caballero-Gaudes and Reynolds, 2017) In SM4, please split the graphs by outcome. If more movement was observed for one subtype of patients, this could drive all the observed results.

7) Subsection “Accuracy comparison between single-domain models and combination model”: When only the clinical features are included in the authors model the accuracy in the Guangzhou dataset is 83%. When both clinical and imaging features are included it is 88%. Given the rather small size of the sample, this seems to amount to a difference of one patient which is neither here nor there in terms of clinical prognostication. It's not wildly different for the other dataset either. Can you please comment on this? More generally, it seems overstated how much better the combined features did in the model – this would need a critical discussion but also better quantification. In other words, how much more is gained exactly by using both datasets as opposed to either dataset alone? Please report details of the statistical superiority of the combined feature set (p-value, effect size). Also, when you directly compare neuroimaging data and clinical data, is there any difference in how well they support prediction?

8) It is laudable that the code of the statistical model for predictions is made available. Will the code for the analysis pipeline and the data also be made available?

9) The structure and writing of the paper could be improved. In particular, some of the 'sign-posting' in the article is poor. For example, 'normal controls' are referred to in subsection “Subjects” and subsection “Definition of networks and regions of interest” and so on, but we never really learn much about them until the Results section. This had the unfortunate consequence of making the article feel very vague and lacking important details. They were mostly revealed much later (in the Supplementary material) but better sign posting earlier on would help with that. Likewise, in the Materials and methods section "The two testing datasets were then used to validate the model". This comes out of nowhere, and one has no idea what the authors were referring to at this point. Likewise, in the Materials and methods section "All patients were followed up at least 12 months after the T0 time point" sounds vague and under specified. A sign post to Supplementary material 1 would have really helped.

10) Discussion section. At least two of these three references (the ones that I am an author on) do not support the point being made at all. In neither Owen et al., 2006 nor Monti et al., 2010 was "recovery after many years" observed.

https://doi.org/10.7554/eLife.36173.058

Author response

Essential revisions:

1) There are several concerns regarding the statistical approach taken.

A main issue that featured in the discussion among reviewers is that feature selection and model performance assessment were performed without nested cross-validation. This should be done to obtain an accurate estimate of model performance in the training data set as well. In the present scenario in the paper, performance measures derived from the training set cannot be used as an estimate of the model's generalization ability. At present, the abstract and discussion are slightly misleading in that they report an average accuracy over training and validation sets.

This study designed 257 features, i.e., 253 imaging features and 4 clinical features, while there were only 63 samples in the training dataset "Beijing 750". That is, the sample size was smaller than the number of features. Traditional k-fold cross validation could bring about a large prediction error. Therefore, we have provided an out-of-bag (OOB) estimation as an estimate of model performance in the training dataset (James et al., 2013). Specifically, for the original "Beijng_750" training dataset x, we left out one sample at a time and denoted the resulting sets by x(-1),…, x(-n). From each leave-one-out set x(-i), we drew 1000 bootstrap learning sets of size n-1, built a prediction rule on every bootstrap learning set generated from x(-i), and applied the rule to the test observation xi. A majority vote was then made to give a class prediction for observation xi. Double feature selection and class prediction based on the ROC curve were performed on each bootstrap sample, that is, only using the samples that did not have xᵢ in their bootstrap sample. OOB is able to give the mean prediction error on each training sample xᵢ. In this way, we were able to measure the prediction error of our method in the training dataset. Using the OOB estimation, we found that the prediction accuracy in the training dataset was up to 89%, and the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were 69%, 94%,100%, and 87%, respectively.

We have modified the relevant text in the Abstract and Discussion section. In addition, we have added some text in subsection “Internal validation of model” in the revised manuscript as follows.

"In addition, out-of-bag (OOB) estimation was used as an estimate of model classification performance in the training dataset (James et al., 2013). Specifically, for the original training dataset x, we left out one sample at a time and denoted the resulting sets by x(-1),…, x(-n). From each leave-one-out set x(-i), 1000 bootstrap learning sets of size n-1 were drawn. On every bootstrap learning set generated from x(-i), we carried out feature selection, built a PLSR regression and classification model, and applied the model on the test observation xi. A majority vote was then made to give a class prediction for observation xi. Finally, we calculated the accuracy for the whole training dataset x."

We have also added some text to subsection “Prognostic classification model and internal validation” in the revised manuscript as follows.

"The OOB was able to provide the mean prediction error on each training sample and estimate the generalizability of our method in the training dataset. Using the OOB estimation, we found that the prediction accuracy in the training dataset "Beijing 750" was 89%, and the sensitivity, specificity, PPV and NPV was 69%, 94%, 100%, and 87% respectively."

Second, the reasons for double feature selection (given the relatively low number of features) and, more generally, for the choice of PLSR are unclear. As the method is put forward in this work, comparison with simpler methods would be useful (e.g. vanilla SVM).

This study designed two types of imaging features from the resting state fMRI data. Thus, for each participant in this study, there were 231 (22×21/2) functional connectivity features and 22 brain area connection features. Although the total number of imaging features (i.e. 253) is low in comparison to the number of image voxels, it is more than the number of patients in the training dataset (i.e. 63 patients). In addition, we found that there was multi-collinearity in these 253 imaging features. The maximum correlation coefficient was up to 0.7985 (Pearson's correlation, p=4.5174e-015, uncorrected). Therefore, standard linear regression would fail in this study.

PLSR is a powerful and effective method to handle these sorts of problematic data sets. It can predict a response from a set of high-dimensional and collinear predictors. This prediction is achieved by extracting from the predictors a set of orthogonal factors called latent variables which have the best predictive power. PLSR has been widely used in a variety of fields (Wold et al., 2001), including fMRI data analysis (Krishnan et al., 2011). This is the reason that we chose to use PLSR in this study.

In addition, we hoped to use as few imaging features as possible in the predictive model, which could prove helpful to interpret and understand the model. We therefore carried out feature selection. Feature selection techniques have been widely adopted in brain analysis studies, in order to produce a small number of features for efficient classification or regression, and to reduce overfitting and increase the generalization performance of the model. Feature ranking and feature subset selection are two typical feature selection methods. Feature subset selection methods are generally time consuming, and even inapplicable when the number of features is extremely large, whereas ranking-based feature selection methods are subject to local optima. Therefore, these two feature selection methods are usually used jointly. Here, we first used a correlation-based feature ranking to select an initial set of features, and then adopted the CARS-PLSR method for further feature subset selection.

We have added some comparison with linear SVM. The code for SVM was downloaded from LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). The 253 imaging features and the four clinical features were integrated into one feature vector. No feature selection was adopted in the linear SVM-based classification. The patients who recovered consciousness recovery were labeled as 1 (i.e. GOS score≥3), and those who did not were labeled as -1 (i.e. GOS score≤2).

(1) Performance in the training dataset

As outlined above in 1.1, the OOB estimation process was used to estimate the performance of the linear SVM in the training dataset "Beijing 750".

Training datasetNo feature selection +Linear SVMFeature selection + PLSR
Beijing_750Accuracy: 83%
Sensitivity: 31%
Specificity: 96%
PPV: 100%
NPV: 81%
Accuracy: 89%
Sensitivity: 69%
Specificity: 94%
PPV: 100%
NPV: 87%

(2) Performances in the two testing datasets

Using all of the samples in the training dataset "Bejing 750", we trained a linear SVM-based classifier and then tested the predictive accuracy in the two testing datasets.

Testing datasetNo feature selection +Linear SVMFeature selection + PLSR
Beijing_HDxtAccuracy: 76%
Sensitivity: 58%
Specificity: 92%
PPV: 88%
NPV: 71%
Accuracy: 88%
Sensitivity: 83%
Specificity: 92%
PPV: 91%
NPV: 86%
Guangzhou_HDxtAccuracy: 88%
Sensitivity: 100%
Specificity: 83%
PPV: 67%
NPV: 100%
Accuracy: 88%
Sensitivity: 100%
Specificity: 83%
PPV: 67%
NPV: 100%

Taking together the performance comparisons in both the training dataset and the two testing datasets, we believe that our method "Feature selection + PLSR" should be superior to " No feature selection + Linear SVM" in this study.

We have added the comparisons with linear SVM in subsection “Comparison between the proposed modeling method and linear SVM” as follows.

“We compared the prediction performances between the proposed modeling method and linear SVM. The code for SVM was downloaded from LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). The 253 imaging features and the four clinical features were integrated into one feature vector. No feature selection was adopted in the linear SVM-based classification. The patients with GOS≥3 were labeled as 1, with the others being designated as -1 (i.e., GOS ≤2).

Similarly, the OOB estimation process was used to estimate the performance of linear SVM in the training dataset "Beijing 750". Next, using all the samples in the training dataset "Beijing 750", we trained a linear SVM-based classifier and then tested the predictive accuracy in the two testing datasets."

We have provided these comparison results in subsection “Comparison between the proposed modeling method and linear SVM” as follows.

"Using the OOB estimation, we found that the accuracy of the linear SVM-based classification method in the training dataset "Beijing 750" was 83% (sensitivity: 31%, specificity: 96%, PPV: 100%, NPV: 81%), which was lower than the accuracy of our proposed modeling method (i.e., accuracy: 89%, sensitivity: 69%, specificity: 94%, PPV: 100%, NPV: 87%). On the other hand, the linear SVM-based classification method achieved an accuracy of 76% (sensitivity: 58%, specificity: 92%, PPV: 88%, NPV: 71%) and 88% (sensitivity: 100%, specificity: 83%, PPV: 67%, NPV: 100%) in the "Beijing HDxt" testing dataset and the "Guangzhou HDxt" testing dataset, respectively. That is, the accuracy in the "Beijing HDxt" testing dataset was lower than that in our method, whereas the accuracy in the "Guangzhou HDxt" testing dataset was similar to that of our approach. Therefore, taking together the performance comparisons in both the training dataset and the two testing datasets, we believe that our method based on feature selection and PLSR should have higher prediction accuracy and better generalizability in comparison to linear SVM."

Third, it is unclear how the bootstrap was used to perform internal validation. It only performs validation of the training samples. Typically, the trained model would be applied to the subjects not used in that bootstrap sample and the R2 would be computed based on those. Otherwise, the R2 value does not represent generalizability, but only goodness of fit. If this is a misunderstanding, this paragraph should be revised.

The internal validation in this study was based on the proposal of the Prognosis Research Strategy (PROGRESS) group (Steyerberg, et al., 2013). The procedure was strictly implemented according to Steyerberg, (2008).

The main goal of internal validation is to correct for overfitting, which causes optimism about a model's performance in new subjects. In the chapter 5 of Steyerberg, (2008), optimism is defined as true performance minus apparent performance, where true performance refers to the performance in the underlying population, and apparent performance refers to the estimated performance in the sample. Bootstrapping can be used to quantify the optimism of a prediction model. This optimism is subsequently subtracted from the original estimate to obtain an “optimism-corrected” performance estimate. The formula is as follows.

Optimism-corrected performance is calculated as:

Optimism-corrected performance = Apparent performance in sample − Optimism,

where

Optimism = Bootstrap performance − Test performance.

The exact steps are as follows:

1) Construct a PLSR model in the training dataset "Beijing_750"; determine the apparent coefficient of determination R2 on the training dataset;

2) Draw a bootstrap sample (denoted by Sample*) with replacement from the original training sample (denoted by Sample);

3) Construct a model (Model*) in Sample*, replaying every step that was done in the original Sample, including the double feature selection and PLSR. Determine the bootstrap coefficient of determination R*2 as the apparent performance of Model* on Sample*;

4) Apply Model* to the original Sample without any modification to determine the test performance RT2;

5) Calculate the optimism as the difference between bootstrap performance R*2 and test performance RT2, that is, Optimism = R*2 − RT2;

6) Repeat steps 1–4 1000 times, to obtain a stable estimate of the optimism;

7) Subtract the optimism estimate (step 5) from the apparent performance R2 (step 1) to obtain the optimism-corrected performance estimate.

To make the description more precise and clearer, we have modified the text in in subsection “Internal validation of model” as follows.

"The prognostic regression model was internally validated using bootstrap (Steyerberg, 2008). Specifically, bootstrap samples were drawn with replacement from the training dataset "Beijing 750" such that each bootstrap sampling set had a number of observations equal to that of the training dataset. Using a bootstrap sampling set, the correlation-based feature selection and CARS-PLSR was first used to select feature subset, after which the PLSR was used to generate a prognostic model. We then applied the model to the bootstrap sampling set and the original training dataset and calculated the coefficient of determination R2 of each of the two datasets. The difference between the two coefficients of determination was defined as the optimism. The process was repeated 1000 times to obtain a stable estimate of the optimism. Finally, we subtracted the optimism estimate from the coefficient of determination R2 on the "Beijing 750" training dataset to obtain the optimism-corrected performance estimate."

Fourth, what regularisation was applied in PLSR?

We did not use any regularisation in PLSR. There is only one hyperparameter in our method, that is, the number of latent variables. Given that nine predictors were involved in the PLSR, the rank of the predictors matrix X was less than or equal to nine, which meant that the number of latent variables could range from 1 to 8. We used cross-validation to decide that the number of latent variables was 3. We also found that the classification results were not sensitive to different number of latent variables in our study.

Fifth, there is little discussion of the PPV, while assessing the number of false negative is also very important. In all cases, statistical assessment through permutations of the labels is desired.

The PPV across the two testing datasets was up to 91% and 67%, respectively. That is, the PPV for the Guangzhou dataset was relatively low. Our method predicted that nine patients in this dataset would recover consciousness recovery, but only six of them did (i.e. GOS score≥3), with the other three patients remaining unconsciousness at the end of the follow-up. In the two testing datasets, our method successfully identified the 16 out of the total 18 patients (16/18) who later recovered consciousness. That is, the number of false negatives was two.

We have added some discussion in Discussion section as follows.

"Secondly, the PPVs for the two testing datasets was remarkably different, with that for the "Guangzhou HDxt" dataset being relatively low (67% versus 91%). Although our method predicted that nine patients in this dataset would recover consciousness, only six of them did (i.e. GOS≥3), with the other three remaining unconscious at the end of the follow-up (i.e. GOS≤2). Given that many additional factors are associated with the outcome of DOC patients, including medical complications, nutrition and so on, future work will need to integrate more information in order to provide more precise predictions."

We have also provided the statistical assessment based on the permutation test for the classification accuracies in the two testing datasets. Specifically, we repeated the classification procedure after randomizing the labels of the testing dataset 1000 times. The p-value was then given by the percentage of runs for which the classification accuracy obtained was greater than the classification accuracy obtained in the dataset with the original label. The results were as follows.

1) The "Beijing_HDxt" testing dataset

In the "Beijing HDxt" dataset, the prediction accuracy in the dataset with the original label was 88%. We found that it was significantly greater than the classification accuracy after randomizing the labels (permutation test, p<0.001).

2) The "Guangzhou_HDxt" testing dataset

In the "Guangzhou HDxt" dataset, the prediction accuracy in the dataset with original label was 88%. We found that it was also significantly greater than the classification accuracy after randomizing the labels (permutation test, p<0.001).

Finally, when evaluating the performance of classification, it would be more convincing if you reported the balanced accuracy. This is because the results for the two different test datasets differ remarkably with respect to sensitivity (subsection “Accuracy comparison between single-domain models and combination model”); this suggests that the outcome groups were differentially balanced across the datasets.

We have acknowledged that the outcomes for the two testing datasets were remarkably different. In the "Beijing_HDxt" dataset, 12 out of the total 25 patients recovered consciousness, while six out of the total 24 patients recovered consciousness in the "Guangzhou_HDxt" dataset. In fact, the etiology for the patients was remarkably different in the two datasets. The number of patients in trauma/stroke/anoxia cohorts were 12/6/7 and 8/0/16 in the "Beijing_HDxt" and "Guangzhou_HDxt" dataset, respectively. Given that the characteristics of the two medical centers and their roles in the local health care system are different, we speculate that the enrolled patient populations might be different, which could be one reason why the outcome groups were differentially balanced across the two datasets. However, we would like to emphasize that these etiologies are the main sources of DOC, and a good prognostication method should be applicable to different etiologies and to different datasets from different medical centers.

Using the predicted scores of the patients in the two testing datasets, we plotted the ROC curves for each of the two testing datasets as follows.

From the above figure, we can see that the AUC for each of the two testing datasets is larger than 0.9, although there are some differences between them in terms of the ROC.

2) The interpretation of classifier weights is problematic (Discussion section): Please see Haufe et al., 2014, Kia et al., 2017 and Weichwald et al., 2015 for the caveats of interpreting the weights of classification models. Especially, as feature selection is performed here, the authors could refrain from looking at the weights (and especially at their sign; for example, finding that duration of DOC has positive impact on outcome does not seem realistic) and instead identify which features were selected.

We believe that the interpretation of "classifier weights" that the reviewer mentioned should refer to the regression weights in the PLSR.

According to the reviewer's suggestion, we reviewed the studies on the interpretation of weights of linear models in neuroimaging analysis. These studies, including Haufe et al., (2014), Kia et al., (2017) and Weichwald et al., (2015), are very instructive. They illustrated the source of the problem and provided some general discussions and heuristic rules. In addition, we reviewed some specific literatures focusing on the interpretation of predictor importance in PLSR. Given that model interpretation is an important task in most applications of PLSR, researchers have been looking for optimal interpretation methods and have made considerable progress (Kvalheim and Karstang, 1989; Kvalheim et al., 2014; Tran et al., 2014).

In summary, as the reviewer pointed out, directly interpreting the model using model parameters such as the weights and loadings in the PLSR is a problematic exercise. Therefore, we have used one recent method (Tran et al., 2014), i.e. significance Multivariate Correlation (sMC), to analyze the predictor importance in our PLSR model. We have corrected the relevant results and added some discussions to the manuscript. In the following, we will firstly elaborate the cause of problem, and then introduce the sMC method for interpreting variable importance in PLSR. After that, we will present our new results.

Cause of problem:

In Haufe et al., (2014), it was demonstrated that linear backward models, i.e. linear decoding models, cannot be interpreted in terms of weight, as a large weight does not necessarily correspond to a feature that picks up the signal and, vice-versa, a feature that picks up the signal does not necessarily have a large weight. For example, consider a simple regression case with two predictors, X = [x1(n) χ2(n)], and a response y(n). Imagine that y(n)=signal(n), and x1(n)=signal(n)-2*noise(n) and χ2(n)=noise(n), where the noise(n) is random noise and independent of the signal(n). Thus, we can easily get y(n) = x1(n) +2* χ2(n). The weight for χ2(n) is larger than the weight for x1(n). However, it is evident that χ2(n) is just noise and could be any value. In this case, the weight does not reflect variable importance. In fact, the weights are determined not only by the correlation of X to the response y, but also concurrently by the covariance in X.

Therefore, some researchers suggest that the lack of interpretability of multivariate decoding in neuroimaging could be a direct consequence of low signal-to-noise ratios, the high dimensionality of neuroimaging, high correlations among different dimensions of data, and cross-subject variability (Kia et al., 2017).

Introduction of the sMC method:

The goal of PLSR is to predict Y from X and to describe their common structure (Krishnan et al., 2011). It finds latent variables stored in a matrix T that model X and simultaneously predict Y. Formally this is expressed as a double decomposition of X and the predicted Y^:

X=TPTandY^=TBCT (Eq. 1)

where P and C are loadings (or weights) and B is a diagonal matrix. These latent variables are ordered according to the amount of variance of Y^ that they explain. Rewriting Eq. (1) shows that Y^ can also be expressed as a regression model as:

Y^=TBCT=XBPLS (Eq. 2)

with

BPLS=PT+BCT (eq. 3)

where PT+ is the Moore-Penrose pseudo-inverse of PT. The matrix BPLS is equivalent to the regression weights of multiple regression. From the above equations, we can see that PLSR finds components from X that are also relevant for Y. Specifically, PLSR searches for a set of components (called latent vectors) that performs a simultaneous decomposition of X and Y with the constraint that these components explain as much as possible of the covariance between X and Y. Therefore, PLSR is robust to the noise in comparison to standard linear regression.

However, PLSR constructs latent variables that not only maximize the correlation of X to the response y, but also concurrently try to maximize the explained variance in X. When the dominant source of variation is not related to y, the maximization of the explained X-variance is likely to bring irrelevant information into the PLSR model, meaning that the interpretation of PLSR model parameters and variable importance based on these parameters is not a straightforward exercise. The literature largely discusses this issue and multiple approaches have been proposed to tackle the problem (Kvalheim and Karstang, 1989; Kvalheim et al., 2014; Tran et al., 2014).

Using the property of the basic sequence theory, Tran et al., (2014) developed a method called sMC to statistically assess variable importance in PLSR. The key points in sMC are to estimate for each variable the correct sources of variability resulting from the PLSR (i.e. regression variance and residual variance) and use them to statistically determine a variable's importance with respect to the regression model. In sMC, the F-test values (called sMC F-values) are used to evaluate the predictors' importance in the prognostic model. In the paper by Tran et al., (2014), the performance of sMC was demonstrated on both an omics-related simulated dataset and a real dataset of NIR spectra of pharmaceutical tablets.

A simulation to test the sMC method:

Let us return to the above regression example. For a clearer explanation, we give certain values to the X, y, and noise, and then perform some simulations in Matlab. The code is shown in Author response image 4 below.

We specified different amplitude, i.e. noise_amplitude, to obtain the results at different noise levels:

Although the above simulation is a simple regression example, we can see that the sMC F-value can robustly discriminate the importance of the two predictors even when the amplitude of random noise is 50 times the standard deviation of the signal.

Our results:

We used the sMC method to analyze the predictor importance in our study. The results are as follows.

In the above figure, the x-axis represents each predictor, and the y-axis is the sMC F-value. The first~fifth predictors are the imaging predictors, that is, DMN.aMPFC, DMN.PCC, ExecuContr.DMPFC, DMN.aMPFC – ExecuContr.DMPFC, Auditory.MCC – Visual.R.V1, respectively. The sixth-ninth predictors are the clinical predictors, that is, Incidence_age, Duration_of_DOC, Etiology_traumatic, Etiology_anoxic, respectively.

We have corrected Figure 4 and modified the relevant text in the manuscript. We have also deleted the discussion about the sign of regression weights.

Finally, we would like to refer to the paper of Kia et al., (2017), who stated that " Intuitively, the interpretability of a brain decoder refers to the level of information that can be reliably derived by an expert from the resulting maps…". Our prognostication model can be reliably applied to independent testing datasets, and the predictors can be basically explained by previous studies and experimental data. Therefore, we suggest that the prognostication model should be interpretable.

3) It is not clear that excluding patients who died is the right thing to do. At the very least, whether it was the right thing to do should have been discussed. On the one hand, I appreciate that if they died, one can never know if they would have regained consciousness. But in my experience (and I am sure this is the authors' experience too), these patients mostly die for a reason that is related to their initial injury (directly or indirectly). Plus, whether they die or not is of course part and parcel of the prognostication (Can we predict those who die?). I know the paper's intention was to try to predict recovery of consciousness (rather than who died), but I think an argument can be made for including those who died in the 'non recovery of consciousness' group. How does that affect the model? After all, a model that says, "Assuming you live, these data have an 88% chance of predicting whether you will regain consciousness" is nowhere near as good as a model that says, "These data have an 88% chance of predicting whether you will regain consciousness".

The reviewer raises an important point. However, if the patients died, we could not assess their CRS-R scores at the end of the follow-up. In this study, the development of the regression model was based on this. In the current situation, as the number of training samples was relatively low, it is difficult to involve those patients who died as the "missing data" in the regression model. On the other hand, we used the present model to test the patients who died. However, we are uncertain whether or not this work makes sense. After all, the present model is developed based on the presence or absence of consciousness recovery. We have acknowledged that mortality is one outcome of DOC, but this study has not involved this aspect. We have added some discussion in the Discussion section of the revised manuscript to address this:

"Finally, a DOC patient’s prognosis can be considered according to at least three dimensions: survival/mortality, recovery of consciousness, and functional recovery. This study focused on predicting recovery of consciousness, and the patients who died during the follow-up were excluded. In the future, we will consider more outcome assessments, including survival/mortality and functional recovery."

4) Subsection “MRI acquisition”. Supplementary material 1 shows quite a bit of variation in the time after injury at which the MRI was conducted (in the testing dataset alone, the range is 1 month to 77 months). What is noticeable is that almost all those who have a late scan do not recover. On the one hand, this is obvious – if you are still vegetative after 77 months you are very unlikely to recover whether you have a scan or not, and in this sense, the point of trying to 'predict recovery' is less obvious. Plus of course at this point there will have been many brain changes that do not reflect the original injury per se but may be the consequences of languishing in this stage for some years (atrophy, etc.). What would be most interesting would be to see how the model does with patients who, say are scanned within 3 months of their injury. These are the patients that we really care about because if we could prognosticate among them then it could really change clinical practice. To put it simply, including data from a patient who has been in a MCS for 77 months or VS for 54 months and predicting that they will not recover is not adding much (and may be affecting the model), because the chances of any meaningful recovery at that stage is so low anyway.

Although the range of the times after injury at which the MRI scanning occurred (i.e., the T0 timepoint) was relatively wide (1-77 months), the majority (i.e., 65 out of 112) of the patients were scanned no more than 3 months after the injury, so the median of T0 was equal to or less than 3 months as shown in Table 1 in the manuscript. We emphasize that this is clinically-useful time point and close to the time of injury.

Notably, our model demonstrated good sensitivity and specificity in both the "subacute" patients (i.e. duration of unconsciousness ≤3 months) and in those in the chronic phase (i.e. duration of unconsciousness >3 months), as shown in Figure 7 of the manuscript.

We also found that our model correctly predicted the nine out of ten patients with a DOC duration longer than or equal to three months who subsequently recovered consciousness, including three patients with longer than or equal to six months duration. This suggested that our model might be feasible to assess prolonged DOC patients.

As DOC can be acute or chronic, the diagnosis and prognosis for prolonged patients is valuable for both the patients and their families and caregivers (Bernat, 2006). The social, economic, and ethical consequences associated with DOC are tremendous. Neurologists are often confronted with the expectations of the prolonged patients' families. Accordingly, several highly impactful functional imaging studies have enrolled the prolonged DOC patients, for example, the patients with 24 years disease duration published in Lancet (Stender et al., 2014), and the ones with 5 years disease duration (Vanhaudenhuyse et al., 2010) or over 25 years disease duration (Demertzi et al., 2015) published in Brain. A recent study published in Brain also enrolled prolonged DOC patients (Chennu et al., 2017).

A prolonged state of impaired consciousness is devastating. It has been widely considered that a clinically meaningful recovery is unrealistic for prolonged DOC patients (Wijdicks and Cranford, 2005). However, recent pilot studies challenge the existing practice of early treatment discontinuation for these patients, and motivate further research to develop therapeutic interventions, although more validations are required. For example, one study published in Nature reported the use of deep brain stimulation in a MCS patient with 6 years disease duration (Schiff et al., 2007), and another study published in Current Biology reported the use of vagus nerve stimulation in a VS/UWS patient with 15 years disease duration (Corazzol et al., 2017). Recently, we reported the first case of transcutaneous auricular vagus nerve stimulation in DOC (Yu et al., 2017).

However, before using a novel therapeutic intervention, clinicians first need to determine if the patient is a suitable candidate for these treatments. This requires thorough examination and a review of possible outcomes prior to any intervention. Therefore, an accurate and robust prognostic test for the DOC patient, including a prolonged one, is of significant clinical value. We believe that our study has made important progress in this direction.

5) fMRI analysis: How were brain lesions taken into account? This could influence the functional connectivity measures considerably.

According to the clinical characteristics of DOC, this study has carefully taken into account the lesions of the DOC patients during the imaging preprocessing.

The etiology of DOC can be subdivided into three types, including trauma, stroke and anoxic brain injury. The three causes often have different locations of injury and distinct severities of damage to brain structure and function. For example, for trauma, brain injury is mainly from shear, oppression, and impact, so the DOC patients with trauma could have large or small brain deformation or an incomplete brain. The injury can be relatively limited in a focal region or diffused axonal damage. In stroke, the patient's brain is relatively complete, and the injury location is generally limited to the blood supply area. In anoxia, the brain is always complete, but the injury diffuses through the whole cortex.

In this study, we checked the T1-weighted 3D high-resolution image for each patient. According to the size of the visible lesion in the T1 image, we divided the lesion into large focal lesions, small focal lesions, and diffused lesions. Large focal lesions often destroy the spatial warping of the ROIs, violate the registration performance and further disrupt the data analysis. Previous studies (Di Perri et al., 2016) used the exclusion criterion that large focal brain damage occupied more than two-thirds of one hemisphere. Our study referred to this criterion and excluded the patient with large focal lesions that exceeded 30% of the total brain volume. In contrast, small focal lesions and diffused lesions should have very limited effects on the registration. ROI-based functional connectivity analysis has been widely used in brain disease studies (Zhang and Raichle, 2010), including DOC studies (Demertzi et al., 2015; Qin et al., 2015; Silva et al., 2015). These brain lesions can not only influence the functional connectivity of the local ROI in which the lesion was located in but also cause the widespread changes throughout the brain (Alstott et al., 2009). Here, we hypothesize that such abnormal manifestations in functional integration from small focal lesions and diffused lesions could be promising markers for the prognosis of DOC patients.

This study developed protocols to segment the lesion and calculate the volume. Specifically, three authors (MS, YY, and JHH) completed segmentation of the focal lesion and volume calculation. First, using ITK-SNAP and in-house software, the author (MS) segmented the focal lesion in the T1-weighted 3D high-resolution image. YY and JHH, who are both experienced neurosurgeon, then visually checked the segmented lesion slice by slice and modified the boundary if necessary. In this way, focal lesions were extracted from the high-resolution T1 images, and the volume of the lesions was computed. The following figure shows an example ("Beijing_750" dataset, patient alias='013', diagnosis=VS/UWS, GOS=3). The lesion volume within the left brain hemisphere was about 73000ml, 7% of the total brain volume (i.e., the summary of gray matter and white matter).

We also developed a toolkit to visually check the transformed ROIs for each patient. More examples are provided in Appendix 5 and Supplementary file 1.

6) fMRI motion correction (subsection “Imaging preprocessing”): The application of band-pass filtering to "scrubbed" fMRI timeseries is not unproblematic given that the removal of scans from the time series can severely affect temporal autocorrelation. Also, were there significant differences in the number of scrubbed volumes between the outcome groups (for unresponsive patients, movement would seem to be less of an issue than for controls or more responsive patients), and could this have affected regression/classification? Did you consider using scan-nulling regressors as an alternative? (see, for example, the discussion in Caballero-Gaudes and Reynolds, 2017) In SM4, please split the graphs by outcome. If more movement was observed for one subtype of patients, this could drive all the observed results.

Motion artifact may produce a wide variety of signal changes in the fMRI data, and thus introduce complicated shifts in functional connectivity analysis. This problem is particularly serious for patients with brain diseases, as they are unlikely to follow the experimental instructions well and control their head motion. It is common practice to exclude patients who exhibit excessive head motion, but this would result in only a few patients being enrolled in studies such as ours, and the statistical power could therefore be significantly reduced. Power et al., (2012, 2015) introduced a novel method, called “scrubbing”, that identifies motion-induced spikes in the rsfc-MRI timeseries and excises these data with a temporal mask; adjacent timepoints are then temporally concatenated. Subsequently, Carp (2013) proposed a modification of scrubbing where data were removed and interpolated prior to band-pass filtering in order to avoid propagation of the motion artifact during the application of the band-pass filter. Using simulated data, he demonstrated that this modified scrubbing procedure was able to recover the “ground truth” connectivity in the timeseries (Carp, 2013). This kind of "scrubbing" method has proven one of the most popular approaches for taking the head movement into account, and a great many clinical rs-fMRI studies have adopted the approach, for example, in depression (Drysdale et al., 2017), Parkinson's disease (Mathys et al., 2016), stroke (Adhikari et al., 2017), traumatic brain injury (Dall'Acqua et al., 2017), and the DOC (Qin et al., 2015; Wu et al., 2015). In the paper by Caballero-Gaudes and Reynolds (2017), the authors acknowledged that "scrubbing" is "a popular approach" and proposed it to be "equivalent to adding scan nulling regressors in the model". Therefore, we used the method here to reduce the motion artifacts of the DOC patients.

We did not found any significant differences in the number of remaining volumes after scrubbing between the "consciousness recovery" group (GOS≥3) and "consciousness nonrecovery" group (GOS≤2) in either the training dataset or in the two testing datasets (two-sample Kolmogorov-Smirnov test, p=0.21, 0.51, 0.14 in the training dataset "Beijing 750", the testing dataset "Beijing_HDxt" and "Guangzhou HDxt", respectively). In addition, we found that there were no significant correlations between the number of remaining volumes and the predicted scores in the training dataset and the two testing datasets (Pearson's correlation, p>0.05). The following figures shows the relationship between the number of remaining volumes and the predicted scores and the predicted imaging subscores in the training dataset. Therefore, we do not think that the number of volumes remaining after scrubbing would significantly affect the regression and classification in our study.

However, in response to the reviewer's suggestion, we have split the graphs in Appendix 4 based on outcome. Please see the new Appendix 4.

In summary, as the reviewer pointed out, scrubbing can destroy the temporal structure of the fMRI data, but to balance the demands of noise reduction and data preservation and to follow the well-accepted preprocessing method in literature, our study has to adopt this compromise "scrubbing" approach to reduce motion artifacts.

7) Subsection “Accuracy comparison between single-domain models and combination model”: When only the clinical features are included in the authors model the accuracy in the Guangzhou dataset is 83%. When both clinical and imaging features are included it is 88%. Given the rather small size of the sample, this seems to amount to a difference of one patient which is neither here nor there in terms of clinical prognostication. It's not wildly different for the other dataset either. Can you please comment on this? More generally, it seems overstated how much better the combined features did in the model – this would need a critical discussion but also better quantification. In other words, how much more is gained exactly by using both datasets as opposed to either dataset alone? Please report details of the statistical superiority of the combined feature set (p-value, effect size). Also, when you directly compare neuroimaging data and clinical data, is there any difference in how well they support prediction?

To test whether the combined use of clinical and neuroimaging data is superior to that of any single domain data, we have added some new results to the revised manuscript. We resampled with replacement from the training dataset and built a regression and classification model based on clinical features alone, neuroimaging features alone, or a combination of the two-domain features. We then tested the classification accuracy in the two testing datasets using the three predictive models. In this way, we obtained the distribution of the prediction accuracies using each of the three types of the models. Next, we used repeated measures ANOVA to test whether or not the performances of the three types of models were the same, as well as Ψ, the root-mean-square standardized effect, to report the effect sizes of the mean differences between them. The effect size analysis was carried out using the Matlab toolbox "Measures of Effect Size" (https://github.com/hhentschke/measures-of-effect-size-toolbox).

1) The "Beijing_HDxt" testing dataset

Using repeated measures ANOVA, we found that there were significant differences between the mean of the classification accuracies using the three types of models (p<0.001). We subsequently conducted pairwise comparisons and found that there was a significant difference between the combination model and the model that used imaging features alone (paired sample t-test, p=0.001) or clinical features alone (paired sample t-test, p<0.001). We also found that there was significant difference between the model using imaging features alone and the model using clinical features alone (paired sample t-test, p<0.001).

Using effect size analysis, we found that there was a mean difference of Ψ=0.004 (95% confidence intervals [0.002, 0.007]) between the combination model and the model using imaging features alone, and Ψ=0.149 (95% confidence intervals [0.147, 0.152]) between the combination method and the model using clinical features alone. We also found that there was a mean difference of Ψ=0.145 (95% confidence intervals [0.142, 0.147]) between the methods using imaging features alone and clinical features alone.

2) The "Guangzhou_HDxt" testing dataset

Using repeated measures ANOVA, we found that there were significant differences between the mean of the classification accuracies using the three types of models (p<0.001). We subsequently conduct pairwise comparisons and found that there was a significant difference between the combination model and the model using imaging features alone (paired sample t-test, p<0.001) or clinical features alone (paired sample t-test, p<0.001). We also found that there was a significant difference between the model using imaging features alone and that only using clinical features (paired sample t-test, p<0.001).

Using effect size analysis, we found that there was a mean difference of Ψ=0.080 (95% confidence intervals [0.076, 0.084]) between the combination model and the model using imaging features alone, and a mean difference of Ψ=0.034 (95% confidence intervals [0.028, 0.040]) between the combination model and the model using clinical features alone. We also found that there was a mean difference of Ψ= -0.046 (95% confidence intervals [-0.053, -0.040]) between the model using only imaging features and that using only clinical features.

3) Summary

Firstly, in both the two testing datasets, the combination of imaging and clinical features demonstrated better performances than using imaging features alone or using clinical features alone. Secondly, the mean differences between the models using neuroimaging features alone or using clinical features alone were reversed in the two testing datasets, which suggested that the two testing datasets were heterogeneous and that it is necessary to integrate different information to improve prognostication for DOC.

We have provided these statistical results in the subsection "Comparison between single-domain model and combination model”.

In addition, we have added some discussion in the Discussion section as follows.

"This study involved two testing datasets, which were found to be quite different, as shown in Table 1. First, the distributions of the etiology of the patients were remarkably different in the two datasets. The numbers of patients suffering from trauma/stroke/anoxia were 12/6/7 and 8/0/16 in the "Beijing HDxt" and "Guangzhou HDxt" datasets, respectively. The outcomes were also different. In the "Beijing HDxt" dataset, 12 out of the total 25 patients recovered consciousness, compared with six out of the total 24 patients in the "Guangzhou HDxt" dataset. Given that the characteristics of the two medical centers and their roles in the local health care system are different, we speculate that this could be one of the main reasons that the enrolled patient populations were heterogeneous. As described in the Introduction, DOC can have many causes and be associated with several neuropathological processes and different severities, leading to the suggestion that information from different domains should be integrated to improve diagnosis and prognostication (Bernat, 2016). Our study demonstrates that the combination of imaging and clinical features can achieve a better performance than the use of single domain features."

8) It is laudable that the code of the statistical model for predictions is made available. Will the code for the analysis pipeline and the data also be made available?

We have opened the analysis pipeline in the package "pDOC", including fMRI preprocessing, feature calculation and extraction, regression and classification, and results visualization. Anyone is welcome to download it from the github (https://github.com/realmsong504/pDOC). Appendix 1 included anonymised demographic and clinical characteristics for each enrolled patient. However, there is no current plan to release the patients' imaging data due to confidentiality agreements.

9) The structure and writing of the paper could be improved. In particular, some of the 'sign-posting' in the article is poor. For example, 'normal controls' are referred to in subsection “Subjects” and subsection “Definition of networks and regions of interest” and so on, but we never really learn much about them until the Results section. This had the unfortunate consequence of making the article feel very vague and lacking important details. They were mostly revealed much later (in the Supplementary material) but better sign posting earlier on would help with that. Likewise, in the Msterials and methods section "The two testing datasets were then used to validate the model". This comes out of nowhere, and one has no idea what the authors were referring to at this point. Likewise, in the Materials and methods section "All patients were followed up at least 12 months after the T0 time point" sounds vague and under specified. A sign post to Supplementary material 1 would have really helped.

We have carefully revised the manuscript in light of these comments. (please see Materials and methods section). In addition, we have carefully proofread the manuscript.

10) Discussion section. At least two of these three references (the ones that I am an author on) do not support the point being made at all. In neither Owen et al., 2006 nor Monti et al., 2010 was "recovery after many years" observed.

We thank the reviewer for pointing this out. We have modified the text in the Discussion section as follows.

"Some influential studies have found that a few VS/UWS patients exhibit near-normal high-level cortical activation in response to certain stimuli or command (Owen et al., 2006; Monti et al., 2010), and that late recovery of responsiveness and consciousness is not exceptional in patients with VS/UWS (Estraneo et al., 2010). "

Additional references:

Adhikari MH, Hacker CD, Siegel JS, Griffa A, Hagmann P, Deco G, et al.

Decreased integration and information capacity in stroke measured by whole

brain models of resting state activity. Brain 2017; 140: 1068-85.

Caballero-Gaudes C, Reynolds RC. Methods for cleaning the BOLD fMRI signal.

Neuroimage 2017; 154: 128-49.

Carp J. Optimizing the order of operations for movement scrubbing: Comment on Power et al. Neuroimage 2013; 76(1): 436-8.

Dall'Acqua P, Johannes S, Mica L, Simmen HP, Glaab R, Fandino J, et al.

Functional and structural network recovery after mild traumatic brain injury: A

1-year longitudinal study. Frontiers in Human Neuroscience 2017; 11.

Drysdale AT, Grosenick L, Downar J, Dunlop K, Mansouri F, Meng Y, et al.

Resting-state connectivity biomarkers define neurophysiological subtypes of

depression. Nature Medicine 2017; 23(1):28-38

Haufe S, Meinecke F, Goergen K, Daehne S, Haynes J-D, Blankertz B, et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage 2014; 87: 96-110. 18 / 37

Kia SM, Pons SV, Weisz N, Passerini A. Interpretability of multivariate brain maps in linear brain decoding: definition, and heuristic quantification in multivariate analysis of MEG time-locked effects. Frontiers in Neuroscience 2017; 10: 22.

Mathys C, Caspers J, Langner R, Sudmeyer M, Grefkes C, Reetz K, et al.

Functional connectivity differences of the subthalamic Nucleus related to

Parkinson's disease. Human Brain Mapping 2016; 37(3): 1235-53.

Power JD, Barnes KA, Snyder AZ, Schlaggar BL, Petersen SE. Spurious but

systematic correlations in functional connectivity MRI networks arise from

subject motion. Neuroimage 2012; 59(3): 2142-54.

Weichwald S, Meyer T, Ozdenizci O, Schoelkopf B, Ball T, Grosse-Wentrup M. Causal interpretation rules for encoding and decoding models in neuroimaging. Neuroimage 2015; 110: 48-59.

https://doi.org/10.7554/eLife.36173.059

Article and author information

Author details

  1. Ming Song

    1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    2. Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    Contribution
    Conceptualization, Software, Formal analysis, Validation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Yi Yang
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8846-5707
  2. Yi Yang

    Department of Neurosurgery, PLA Army General Hospital, Beijing, China
    Contribution
    Conceptualization, Data curation, Validation, Investigation, Writing—original draft, Writing—review and editing
    Contributed equally with
    Ming Song
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3096-0312
  3. Jianghong He

    Department of Neurosurgery, PLA Army General Hospital, Beijing, China
    Contribution
    Resources, Data curation, Investigation, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  4. Zhengyi Yang

    1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    2. Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    Contribution
    Formal analysis, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  5. Shan Yu

    1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    2. Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    Contribution
    Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4385-6306
  6. Qiuyou Xie

    Centre for Hyperbaric Oxygen and Neurorehabilitation, Guangzhou General Hospital of Guangzhou Military Command, Guangzhou, China
    Contribution
    Data curation, Validation
    Competing interests
    No competing interests declared
  7. Xiaoyu Xia

    Department of Neurosurgery, PLA Army General Hospital, Beijing, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  8. Yuanyuan Dang

    Department of Neurosurgery, PLA Army General Hospital, Beijing, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  9. Qiang Zhang

    Department of Neurosurgery, PLA Army General Hospital, Beijing, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  10. Xinhuai Wu

    Department of Radiology, PLA Army General Hospital, Beijing, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  11. Yue Cui

    1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    2. Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    Contribution
    Methodology, Writing—original draft
    Competing interests
    No competing interests declared
  12. Bing Hou

    1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    2. Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    Contribution
    Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  13. Ronghao Yu

    Centre for Hyperbaric Oxygen and Neurorehabilitation, Guangzhou General Hospital of Guangzhou Military Command, Guangzhou, China
    Contribution
    Project administration
    Competing interests
    No competing interests declared
  14. Ruxiang Xu

    Department of Neurosurgery, PLA Army General Hospital, Beijing, China
    Contribution
    Project administration
    For correspondence
    zjxuruxiang@163.com
    Competing interests
    No competing interests declared
  15. Tianzi Jiang

    1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    2. Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
    3. CAS Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Beijing, China
    4. Key Laboratory for Neuroinformation of the Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
    5. Queensland Brain Institute, University of Queensland, Brisbane, Australia
    Contribution
    Funding acquisition, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    jiangtz@nlpr.ia.ac.cn
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9531-291X

Funding

National Natural Science Foundation of China (81471380)

  • Ming Song

Beijing Municipal Science and Technology Commission (Z161100000216152)

  • Ming Song

Youth Innovation Promotion Association of the Chinese Academy of Sciences

  • Ming Song

Beijing Municipal Scienceand Technology Commission (Z161100000516165)

  • Yi Yang

National Key R&D Program of China (2017YFA0105203)

  • Shan Yu
  • Tianzi Jiang

National Natural Science Foundation of China (31620103905)

  • Tianzi Jiang

National Natural Science Foundation of China (91432302)

  • Tianzi Jiang

Chinese Academy of Sciences (Science Frontier Program: QYZDJ-SSW-SMC019)

  • Tianzi Jiang

Beijing Municipal Science and Technology Commission (Z161100000216139)

  • Tianzi Jiang

Guangdong Pearl River Talents Plan Innovative and Entrepreneurial Team (2016ZT06S220)

  • Tianzi Jiang

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors appreciate the help of Ms Rowan Tweedale with the use of English in this paper. This work was partially supported by the Natural Science Foundation of China (Grant Nos. 81471380, 91432302, 31620103905), the Science Frontier Program of the Chinese Academy of Sciences (Grant No. QYZDJ-SSW-SMC019), the National Key R and D Program of China (Grant No. 2017YFA0105203), the Beijing Municipal Science and Technology Commission (Grant Nos. Z161100000216139, Z161100000216152, Z161100000516165), the Guangdong Pearl River Talents Plan Innovative and Entrepreneurial Team (grant 2016ZT06S220) and Youth Innovation Promotion Association CAS.

Ethics

Human subjects: The study was approved by the ethics committee of the PLA Army General Hospital (protocol no: 2011-097) and the ethics committee of the Guangzhou General Hospital of Guangzhou Military Command (protocol no: jz20091287). Informed consent to participate in the study was obtained from the legal surrogates of the patients and from the normal controls.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Klaas Enno Stephan, University of Zurich and ETH Zurich, Switzerland

Publication history

  1. Received: February 23, 2018
  2. Accepted: August 3, 2018
  3. Accepted Manuscript published: August 14, 2018 (version 1)
  4. Version of Record published: September 19, 2018 (version 2)

Copyright

© 2018, Song et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,867
    Page views
  • 328
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Cell Biology
    2. Human Biology and Medicine
    Terje R Kolstad et al.
    Research Article
    1. Human Biology and Medicine
    Anders B Jensen et al.
    Research Article