Introduction

Stroke is the second leading cause of death worldwide, with an annual mortality rate of approximately 5.5 million, and also the leading cause of disability globally, accounting for 50% of cases [1]. Generally, ischemic stroke accounts for the majority, about 80% of stroke cases [2][3]. PSE is a significant complication, with studies indicating that as many as 3-30% of stroke patients develop epilepsy, which has a negative impact on patients’ prognosis and quality of life [4]. It can exacerbate cognitive, psychiatric, and physical impairments caused by cerebrovascular disease and comorbidities [5].

Furthermore, the highest incidence of PSE occurs within the first year after acute stroke, accounting for about half of the cases [2]. Therefore, early prediction and intervention for PSE, especially ischemic ones, are crucial.

Currently, most studies utilize clinical data to establish statistical models, survival analysis and cox regression [2][6], and multiple linear regression [7] to construct simple models. Last year, Lin et al. developed a model based on radiomics that outperformed the conventional clinical model in predicting PSE related to intracerebral hemorrhage (ICH). They suggested that a combined radiomics-clinical model could better assist clinicians in assessing the individual risk of PSE after the first occurrence of ICH and facilitate early diagnosis and treatment of PSE [8]. However, subsequent studies have raised doubts regarding the application of radiomics, suggesting the need for further research [9]. Overall, there is still a relative scarcity of research on PSE prediction, with most studies focusing on the analysis of specific or certain risk factors [10][11][8][12]. No study has proposed or established a more comprehensive and scientifically accurate prediction model.

Machine learning has emerged as a promising approach in recent years for constructing medical models, as it excels in handling large volumes of data and complex information, and has been increasingly applied in neuroscience and clinical prediction [13][14][15]. Previous studies have utilized machine learning for related research on post-stroke cognitive impairments [16], stroke and myocardial infarction risk prediction models in large artery vasculitis patients [14], post-stroke depression prediction models based on liver function test indicators [17], and prediction of hematoma expansion in traumatic brain injury (TBI) [18]. Models constructed using machine learning algorithms can automatically handle linear or complex nonlinear relationships between different variables and provide insights into the contribution of different features to the prediction target, which is challenging for traditional statistical models. However, machine learning methods require a substantial amount of data and are prone to overfitting when trained on small sample data. The more valid and high-quality data input, the better machine learning algorithms can capture the underlying patterns between the data, thereby achieving more accurate predictions.

This study try to select important risk factors from mutiple fearures extracted from the clinical records and examination data of ischemic stroke patients and subsequently develops a prediction model for PSE using machine learning methods. By utilizing relevant early admission features of ischemic stroke patients, we aim to automatically predict the probability of PSE occurrence and further guide clinical decision-making and nursing care.

Research content and method

Research patients

This study retrospectively included all stroke patients admitted to the Chongqing Emergency Center between June 2017 and June 2022 for the development of the prediction model. Subsequently, patient data from three external validation centers, namely, Qianjiang Central Hospital, Bishan District People’s Hospital, and Yubei District Traditional Chinese Medicine Hospital, were collected between July 2022 and July 2023 for external validation and evaluation of the model. The external validation cohort focused more on collecting positive cases to examine the model’s ability to identify positive samples.

Inclusion criteria: (1) Age between 18 and 90 years at admission; (2) Diagnosed with acute ischemic stroke and hospitalized for treatment.

Exclusion criteria: (1) Patients with a history of stroke or transient ischemic attack (TIA); (2) Patients with a history of other conditions such as traumatic brain injury, intracranial tumors, or cerebral vascular malformations that may cause epilepsy; (3) Patients with a history of epilepsy or who have received antiepileptic drugs for the prevention of seizures or for other diseases (such as migraine or psychiatric disorders); (4) Patients who died within 72 hours after stroke onset.

This study collected de-identified data from relevant patients for the construction of a multi-modal database for stroke patients. The study protocol was approved by the Ethics Committees of Chongqing University Center Hospital, Chongqing University Qianjiang Central Hospital, Bishan District People’s Hospital, and Yubei Traditional Chinese Medicine Hospital. The procedure of selection is in figure 1.

Selection and exclusion procedure of patients

Data collection

  1. General information: gender, age, nihss score at admission;

  2. Comorbidities and complications: uremia, dvt(previous deep vein thrombosis), diabetes mellitus, hypertension, coronary atherosclerosis, atrial fibrillation, cerebral hernia, hydrocephalus, hypoproteinemia, hyperuricemia, hyperlipidemia, internal carotid stenosis, common carotid stenosis,etc.

  3. According to CT or MRI, the patient’s cortical lobes and subcortical involvement were counted: frontal lobe \ parietal lobe \ temporal lobe \ occipital lobe \ insular lobe \ basal ganglia \ internal capsule \ brain stem \ cerebellum \ periventricular \ centrum semiovale \ thalamus involvement. In addition, the extent of cortical involvement (frontal lobe, parietal lobe, temporal lobe, occipital lobe and insular lobe each accumulated 1 point) and the extent of subcortical involvement (basal ganglia, internal capsule, brain stem, periventricular, thalamus and cerebellum any accumulated 1 point) were summarized.

  4. According to CTA, MRA or DSA, the patient’s vascular stenosis or occlusion was counted: ACA(anterior cerebral artery) \ MCA(middle cerebral artery) \ PCA(posterior cerebral artery) \ VA(vertebral artery) \ BA(basilar artery)

  5. Important laboratory indicators: Blood lipids (tg(triglyceride), hdl(high density lipoprotein cholesterol), (ldl)low density lipoprotein cholesterol), liver function (alt(Alanine Transaminase),ast(Aspartate Aminotransferase), bilirubin, albumin), renal function (urea, bua(blood uric acid), creatinine),blood gas(lactic acid, anion gap, tco2(total carbon dioxide)), coagulation related indicators (inr(international normalized ratio), pt(prothrombin time), (aptt) activated partial thromboplastin time, (tt)thrombin time. D-dimer, fibrinogen) and myocardial enzymes (ck(creatine kinase), ck-mb(creatine kinase isoenzyme), ldh(lactate dehydrogenase), ima(ischemic albumin),hbdh(α- hydroxybutyrate dehydrogenase).

Data processing and model building

(Processing of missing data) We counted the values of all laboratory indicators for the first time after stroke admission, excluded indicators with missing values of more than 10%, and filled the data of the remaining indicators with missing values of more than 1000 cases by random forest algorithm.

(Distribution of characteristics) Univariate analysis was used to examine the distribution of characteristics between the negative group and the positive group. The data were then divided into a training set and a test set at a ratio of 7:3.

(Processing of unbalanced data) Considering the low incidence of PSE and the small proportion of positive patients, the positive data of the training set were augmented by smote oversampling method.

(Processing of categorical data) For categorical data, the one-heat method is used for transformation. The LASSO method was then used in the training set to screen the important features.

(Model building) Select the 20 features with the largest absolute value of LASSO Regression coefficient, and use NB(Naive Bayes), LR(Logistic Regression), DT(Decision Tree), RF(Random Forest),GB (GradientBoosting), MLP, XGB(XGBoost), LGBM(LightGBM), KNN(KNeighbors) these 9 common methods to build machine learning models. Accuracy, Sensitivity, Specificity, F1-score, positive predictive value (PPV) and negative predictive value (NPV) were used to evaluate the performance of the model. The area under the ROC(receiver operating characteristic curve) was used to measure the discrimination of the model, and the calibration plot and Brier score were used to evaluate the calibration of the model. DCA(decision curve analysis) was used to evaluate the net benefit of the model for patients. (External validation of the model) The generalization performance of the model was evaluated using patient data from the external validation cohort. (Influencing factor analysis) After screening the best model, the interpretable analysis of the model was performed by SHAP(SHapley Additive exPlanations) algorithm to analyze the contribution of different features to the prediction and their clinical significance.

Statistical approach

PostgreSQL v15 (http://www.postgresql.org/) was used to search and extract the data from the local database.

The open-source statistical package "Scipy.stats" in Python was used for statistical analysis. The details of the univariate significance analysis for each feature are as follows:

First, the Shapiro-Wilk test was used to check the normality of the distribution for each feature. For features that did not follow a normal distribution, the Mann-Whitney U test was used to assess their significance with respect to the target variable.

For features that exhibited a normal distribution, the Levene test was employed to assess the homogeneity of variances. Features with homogeneous variances were analyzed using the Student’s t-test to determine their significance with respect to the target variable, while features with heterogeneous variances were analyzed using the Welch’s t-test.

The confidence intervals for the AUC values and Brier scores were obtained by performing 1000 bootstrap resampling iterations on the corresponding datasets. The binary classification thresholds for the predicted probabilities generated by all models were established using the maximum Youden index derived from the training cohort.

Throughout the study, a two-tailed p-value less than 0.05 was considered statistically significant.

Result

Characteristics of study participants

A total of 21459 patients were included in this study, of which 15021 patients were included in the training set, and the incidence of PSE was 4.3%. The test set contained 6438 patients with a PSE incidence of 4.3%. The external validation cohort consisted of 536 patients at three hospitals. Statistical details of the clinical characteristics of the patients are provided in the table.

Statistical analysis showed that the patients who had higher possibility of PSE were with complications of uremia, history of DVT, atrial fibrillation, hyperuricemia, cerebral hernia and hydrocephalus. The involved locations of frontal lobe, parietal lobe, occipital lobe, temporal lobe, cortex, subcortex, basal ganglia and hypothalamus. The general characteristics included age, gender, nihss score; Laboratory indicators included wbc count, hba1c, crp, tg, ast, alt, bilirubin, urea, bua, aptt, tt, d_dimer, ck. The p values of ckmb, ldh, hbdh, ima, lactate, and anion_gap . Besides, the p values of fatty liver, coronary heart disease, hyperlipidemia, and hdl were significant, and patients with negative or low values of these indicators had a high risk of secondary disease. The statistics analysis result, the uni and multi regression analysis result table is in table 1, table 2 and table 3.

Single factor significant analysis results

Univariable logistic regression results

Multivariable logistic regression results

Performance of machine learning models

The relevant indicators of the machine learning model are shown in table 4, and the roc curves, calibration curve and DCA are shown in figure 3. It can be found that the over all models the auc of tree models such as RF, XGboost and lightGBM are better than other models, and the ppv value of random forest is the highest, reaching 0.977, which is more accurate for the identification of positive patients(the most important function of our models). Complex algorithms were superior to traditional logistic regression. The Brier score of the calibration curve reached 0.006, and the DCA also showed good clinical decision-making benefits, which had good practical value. In the external validation cohort, the Sensitivity was 0.91, the ppv was 0.95, and only 8 people made incorrect predictions, demonstratijmng a good predictive ability of the model.

LASSO Regression Coefficient Paths

ROC of train(A1), ROC of test(B1), CC of train(A2), CC of test(B2), PCA of train(A3), PCA of test(B3)

The AUC,Accuracy, Sensitivity, Specificity, F1-score, positive predictive value (PPV) and negative predictive value (NPV) of ML models of train|test groups

Analysis of SHAP risk factors

Shap analysis, individual decision attempts and overall decision curves are shown in figure 4. Among the general characteristics, female were prone to PSE. About the nihss score, the higher the nihss score, the more likely to be PSE, nihss score has a third effect just below white blood cell count and D-dimer. Higher wbc, d-dimer,crp,ast,ck_mb,hba1c bilirubin,tco2,ldh and lower hbdh,plt,aptt at admission were more likely to develop PSE. However, the relevant regions of the brain as the single factor had little effect on the whole. Among the complications, only hypertension is more prone to pse, coronary heart disease, diabetes, hyperlipidemia, fatty liver and so on are less prone to PSE.

SHAP value(left),force plot(upper right) and decision plot (lower right)

Discussion

Our study utilized comprehensive clinical data, imaging data, laboratory test data, and other data from stroke patients. We employed machine learning algorithms to establish a predictive model, achieving an AUC score of above 0.95, which demonstrated more accurate predictions compared to traditional statistical methods. Our research found that tree-based ensemble models showed superior overall prediction capabilities when dealing with large sample sizes and high-dimensional features.

During the modeling process, due to the extreme imbalance between negative and positive samples, we employed SMOTE oversampling to augment the data, resulting in improved training performance. Through SHAP analysis, we conducted interpretability analysis of the model and determined the importance of different features.

In our study, age and NIHSS score were treated as continuous variables. We found that, overall, female patients, older patients, and those with higher NIHSS scores were more prone to develop PSE, which is consistent with recent articles. High NIHSS scores, indicative of more severe stroke, increased the risk of complications, ranking only to white blood cell count and d-dimer in our model [5][19][10][20]. However, there are conflicting opinions regarding the impact of age. [5][21] suggested that age <65 is a high- risk factor, which aligns with our findings, while some studies [22] confirmed that advanced age is the determining factor. Yamada et al. [21] also concurred with our study in identifying a higher risk of complications among females, whereas Waafi’s research [10] indicated that the likelihood of male patients developing complications is 3.325 times that of females, which contradicts our findings.

Previous studies have shown that patients with diabetes, dyslipidemia, hypertension, depression, or dementia are at an increased risk of developing vascular epilepsy [12]. In our study, statistics and multiple ML models analyzed the association between comorbidities and complications, revealing that patients with coronary heart disease, diabetes, fatty liver, hyperlipidemia, or large artery stenosis or plaques(CCA and ICA) were less likely to develop epilepsy. According to the TOAST classification, ischemic stroke is categorized into five types: large artery atherosclerosis, cardioembolism, small vessel occlusion, other determined etiology, and undetermined etiology. Patients with combined comorbidities generally fall into the categories of large artery atherosclerosis and cardioembolism, which are relatively well-defined and easier to intervene, thus resulting in a lower likelihood of developing epilepsy. Conversely, strokes with undetermined etiology usually have a poor prognosis and are more likely to lead to epilepsy. Among diabetes patients, higher HbA1c levels indicate poorer blood sugar control, resulting in a higher probability of developing complications, which significantly affects certain patients, while those with good control have a lower overall risk of developing complications.

Alain et al. found that cortical infarction was more likely to result in epilepsy in patients hospitalized with anterior circulation ischemic stroke [23]. Lin et al. found that factors such as cortical involvement and intracerebral hemorrhage volume increased the likelihood of PSE, which is consistent with our research findings [8]. Al-Sahli et al. also suggested that cortical brain injury and large-area lesions increased the risk of PSE [5][21]. In our study, statistics showed affections of cortical and subcortical regions both increased the possibility of PSE, but had lower affection than the other features so didn’t be selected in lasso regression.

Previous studies have found that acute infection is a risk factor for ischemic stroke [24]. C-reactive protein (CRP) reflects the level of inflammation and is an independent prognostic factor [25]. In our study, SHAP analysis showed that white blood cell count had the greatest impact among the routine blood test parameters, surpassing the NIHSS score. High white blood cell count may indicate severe inflammation and infection, as well as increased blood viscosity, making patients more susceptible to secondary complications. In general, high red blood cell count and low platelet count also have some influence.

A large-scale study on Chinese individuals found a negative correlation between plasma high-density lipoprotein cholesterol (HDL-C) concentration and the risk of ischemic stroke, a weak positive correlation between plasma triglyceride (TG) concentration and the risk of ischemic stroke, and a strong correlation between plasma low-density lipoprotein cholesterol (LDL-C) concentration and apolipoprotein B [26]. High HDL-C levels are associated with better prognosis [27]. Our study is consistent with previous research, indicating that high LDL-C, low HDL-C, and high TG levels are more likely to lead to PSE. This can be easily understood as high cholesterol and triglyceride levels lead to increased blood viscosity and vascular sclerosis, making it easier for clots to form [12][28][29]. Higher D-dimer levels indicate greater brain tissue damage and a higher likelihood of PSE. Overall, lower activated partial thromboplastin time (aPTT) and fibrinogen levels are associated with an increased risk of PSE. INR, PT, and TT have a smaller impact. Among liver function parameters, aspartate aminotransferase (AST) has the greatest influence on PSE, while high AST levels, low alanine aminotransferase (ALT) levels, and low albumin levels all have a certain degree of impact. Lingling Ding et al. found that liver enzyme subgroups characterized by alanine aminotransferase and aspartate aminotransferase were associated with a high risk of adverse function [30], which is consistent with our research.

Studies have shown that subgroups identified by renal function biomarkers such as urinary microalbumin, cystatin C, and creatinine have significantly higher stroke recurrence and poorer prognosis [30]. In our study, low urea levels and high uric acid levels had a negative impact [31][32][33]. Our research is similar to their conclusions. While elevated uric acid levels at admission are positively associated with PSE, patients previously diagnosed with hyperuricemia are less likely to develop epilepsy. Considering that uric acid functions as a strong reducing agent and has neuroprotective properties [34], patients with normal liver and kidney function and a certain degree of hyperuricemia have stronger resistance to emergencies [35][36]. However, excessively high uric acid levels indicate metabolic disorders and poor liver and kidney function, which are associated with poor prognosis.

When stroke patients are admitted, cardiac enzyme profile tests are often performed to rule out concurrent myocardial ischemia. However, studies have shown that elevated CK-MB in stroke patients may not be related to the heart [37]. Multiple cardiac enzymes are important prognostic indicators [38][39] and have been included in stroke scores [40]. Some studies have shown a higher incidence of abnormal serum cardiac enzyme profiles in the acute phase of stroke. Although the incidence of abnormalities is unrelated to the nature of the stroke, it is associated with the severity of the stroke, with patients with consciousness disorders having a significantly higher incidence of abnormal cardiac enzyme profiles than those without consciousness disorders [41]. In our study, CK, CK- MB, and IMA in the cardiac enzyme profile had a significant impact and high predictive value, but the specific mechanisms require further research [34].

Deficiency and prospect

Although our study incorporates a large amount of information and utilizes almost all available data, including clinical data, imaging data, and laboratory test data, in an attempt to establish more accurate prediction models beyond traditional statistics using machine learning algorithms, there are still several limitations in the modeling process.

The data is not sufficiently representative, and the model’s generalization ability needs further assessment. Firstly, although we collected data from multiple tertiary hospitals, encompassing over 20,000 cases, earlier data was lost due to hospital system upgrades. The collected data mainly represents patients diagnosed in the past five years and is primarily concentrated in the Chongqing region, resulting in a limited temporal and spatial span that may restrict its generalizability to other regions.

Restricted by retrospective research, some important predictive indicators are missing. As our study is retrospective, many potentially meaningful indicators, such as hemorheology, thromboelastography, hormone levels, are significantly missing and had to be excluded. If additional features were included, it might be possible to further improve the accuracy of the model.

Consider incorporating features related to patients’ baseline for more accurate predictions. Secondly, regarding imaging and laboratory examinations, we mainly extracted the results from the first examination upon admission, without fully utilizing the results of subsequent examinations. In the future, the use of recurrent neural networks to comprehensively extract features will be considered. In subsequent research, data standardization should be improved, and the number of cases and important indicators should continue to increase. It is also advisable to explore more scientifically advanced methods, such as deep learning, and fully leverage all available data to make more accurate predictions.

Conclusion

In summary, we developed an interpretable machine learning model to predict the risk of PSE in hospitalized patients with ischemic stroke. Based on a large volume of medical records, our artificial intelligence model demonstrates good predictive ability for PSE. Significant predictors include NIHSS score, D-dimer levels, lactate levels, and white blood cell count, followed by indicators related to liver function and cardiac enzyme profiles. The transparency and interpretability of the model’s predictions can foster trust among clinical practitioners and facilitate decision-making. However, further prospective studies are needed to validate the utility of this tool before its application in clinical settings.

Author Contributions

JL and HH are both first writer who analysed the data by python and wrote the first draft of the manuscript.YD and TS are both corresponding author who wrote part of the draft and designed the original research. Chongqing University Central Hospital,Chongqing University Qianjiang Hospital, Yubei District hospital and Bishan hospital of Chongqing Medical University provided the database of all cases of the patients. The others collected data and wrote sections of the manuscript. All authors took part in the research and contributed to manuscript revision, read,and approved the submitted version.

Data Availability

The codes,models,analysis results can be provided for researchers if needed by the corresponding author.

Acknowledgements

The authors would like to thank the colleagues in the information and imaging departments for their hard work contributing to the final research results.

Ethics approval statement

We confirm that we have read the Journal’s position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.

Funding statement

The research is funded by Based on artificial intelligence and multiple omics technology set up a system of auxiliary cardiovascular disease diagnosis and treatment(2023CDJYGRH-ZD06);by Emergency medicine Key laboratory of Chongqing Joint Fund for Talent Innovation and development 2024RCCX10, by Brain-like intelligence research Key laboratory of chongqing Education Commission(BIR2019004)

Conflict of interests

The authors have no relevant conflicts of interest to disclose.

Clinical trial registration

The trial number is RS202406.