Introduction

Stroke is the second leading cause of death worldwide, with an annual mortality rate of approximately 5.5 million, and also the leading cause of disability globally, accounting for 50% of cases [1]. Generally, ischemic stroke accounts for the majority, about 80% of stroke cases [2][3]. Post-stroke epilepsy (PSE) is a significant complication, with studies indicating that as many as 3-30% of stroke patients develop epilepsy, which has a negative impact on patients’ prognosis and quality of life [4]. It can exacerbate cognitive, psychiatric, and physical impairments caused by cerebrovascular disease and comorbidities [5]. Furthermore, the highest incidence of PSE occurs within the first year after acute stroke, accounting for about half of the cases [2]. Therefore, early prediction and intervention for PSE, especially ischemic ones, are crucial.

Currently, most studies utilize clinical data to establish statistical models, survival analysis and cox regression [2][6], and multiple linear regression [7] to construct simple models for the prediction of PSE. Last year, Lin et al. developed a model based on radiomics that outperformed the conventional clinical model in predicting PSE related to intracerebral hemorrhage (ICH). They suggested that a combined radiomics-clinical model could better assist clinicians in assessing the individual risk of PSE after the first occurrence of ICH and facilitate early diagnosis and treatment of PSE [8]. However, subsequent studies have raised doubts regarding the application of radiomics, suggesting the need for further research [9]. Overall, there is still a relative scarcity of research on PSE prediction, with most studies focusing on the analysis of specific or certain risk factors [10][11][8][12] constructing simple models and hardly proposed or established a more comprehensive and scientifically accurate prediction model.

Machine learning has emerged as a promising approach in recent years for constructing medical models, as it excels in handling large volumes of data and complex information, and has been increasingly applied in neuroscience and clinical prediction [13][14][15]. Previous studies have utilized machine learning for related research on post-stroke cognitive impairments [16], stroke and myocardial infarction risk prediction models in large artery vasculitis patients [14], post-stroke depression prediction models based on liver function test indicators [17], and prediction of hematoma expansion in traumatic brain injury (TBI) [18]. Models constructed using machine learning algorithms can automatically handle linear or complex nonlinear relationships between different variables and provide insights into the contribution of different features to the prediction target, which is challenging for traditional statistical models. However, machine learning methods require a substantial amount of data and are prone to overfitting when trained on small sample data. The more valid and high-quality data input, the better machine learning algorithms can capture the underlying patterns between the data, thereby achieving more accurate predictions.

This study try to select important risk factors from mutiple fearures extracted from the clinical records and examination data of ischemic stroke patients and subsequently develops a prediction model for PSE using machine learning methods. By utilizing relevant early admission features of ischemic stroke patients, we aim to automatically predict the probability of PSE occurrence and further guide clinical decision-making and nursing care.

Research content and method

Research patients

This study retrospectively included all stroke patients admitted to the Chongqing Emergency Center between June 2017 and June 2022 for the development of the prediction model. Subsequently, patient data from three external validation centers, namely, Qianjiang Central Hospital, Bishan District People’s Hospital, and Yubei District Traditional Chinese Medicine Hospital, were collected between July 2022 and July 2023 for external validation and evaluation of the model. The external validation cohort focused more on collecting positive cases to examine the model’s ability to identify positive samples.

Inclusion criteria: (1) Age between 18 and 90 years at admission; (2) Diagnosed with acute ischemic stroke and hospitalized for treatment.

Exclusion criteria: (1) Patients with a history of stroke or transient ischemic attack (TIA); (2) Patients with a history of other conditions such as traumatic brain injury, intracranial tumors, or cerebral vascular malformations that may cause epilepsy; (3) Patients with a history of epilepsy or who have received antiseizure medications for the prevention of seizures or for other diseases (such as migraine or psychiatric disorders); (4) Patients who died within 72 hours after stroke onset.

This study collected de-identified data from relevant patients for the construction of a multi-modal database for stroke patients. The study protocol was approved by the Ethics Committees of Chongqing University Center Hospital, Chongqing University Qianjiang Central Hospital, Bishan District People’s Hospital, and Yubei Traditional Chinese Medicine Hospital.

The procedure of selection is in figure1. Total there are 42079 records from the stroke database, 24733 patients were diagonosed as ischemic stroke or lacular stoke with new onset. Then we excluded hemorrage stroke(4565),history of stroke(2154), TIA(3570), unclear cause stroke(561) and records who missed important data(6496). Then we excluded patients whose seizure might be attributed to other potential causes (brain tumor, intracranial vascular malformation, traumatic brain injury,etc)(865). Then we exclude patient who had a seizure history(152) or died in hospital (1444). Then we excluded patients who were lost to follow-up (had no outpatient records and cant contact by phone) or died within 3 months of the stroke incident(813). Finally 21459 cases are involved in this research.

Selection and exclusion procedure of patients

LASSO Regression Coefficient Paths

Data collection

We extracted all records and other relevant data from the database of the hospitals. Under the structure of PostgreSQL we coded Structured Query Language to manage different data as follows:

  1. General information: gender, age, NIHSS(the National Institutes of Health Stroke Scale) score at admission;

  2. (2) Comorbidities and complications: uremia, DVT(previous deep vein thrombosis), diabetes mellitus, hypertension, coronary atherosclerosis, atrial fibrillation, cerebral hernia, hydrocephalus, hypoproteinemia, hyperuricemia, hyperlipidemia, internal carotid stenosis, common carotid stenosis,etc.

  3. According to CT or MRI records, the patient’s cortical lobes and subcortical involvement were counted: frontal lobe \ parietal lobe \ temporal lobe \ occipital lobe \ insular lobe \ basal ganglia \ internal capsule \ brain stem \ cerebellum \ periventricular \ centrum semiovale \ thalamus involvement. In addition, the extent of cortical involvement (frontal lobe, parietal lobe, temporal lobe, occipital lobe and insular lobe each accumulated 1 point) and the extent of subcortical involvement (basal ganglia, internal capsule, brain stem, periventricular, thalamus and cerebellum any accumulated 1 point) were summarized.

  4. According to CTA, MRA or DSA records, the patient’s vascular stenosis or occlusion was counted: ACA(anterior cerebral artery) \ MCA(middle cerebral artery) \ PCA(posterior cerebral artery) \ VA(vertebral artery) \ BA(basilar artery)

  5. Important laboratory indicators: Blood lipids (TG(Triglyceride), HDL(High Density Lipoprotein Cholesterol), LDL(Low Density Lipoprotein Cholesterol)), liver function (ALT(Alanine Transaminase), AST(Aspartate Aminotransferase), Bilirubin, Albumin), renal function (Urea, BUA(Blood Uric Acid), Creatinine), blood gas (Lactate, Anion Gap, TCO2(Total Carbon Dioxide)), coagulation related indicators (INR(International Normalized Ratio), PT(Prothrombin Time), APTT(Activated Partial Thromboplastin Time), TT(Thrombin Time), D-Dimer, Fibrinogen) and myocardial enzymes (CK(Creatine Kinase), CK-MB(Creatine Kinase Isoenzyme), LDH(Lactate Dehydrogenase), IMA(Ischemic Modified Albumin), HBDH(α-Hydroxybutyrate Dehydrogenase)).

Data processing and model building

(Processing of missing data) We counted the values of all laboratory indicators for the first time after stroke admission (everyone who was admitted because of stroke would perform blood routine, liver and kidney function and so on), excluded indicators with missing values of more than 10%, and filled the data of the remaining indicators with missing values by random forest algorithm using the default parameter. First, we go through all the features, starting with the one with the least missing (since the least accurate information is needed to fill in the feature with the least missing). When filling in a feature, replace the missing value of the other feature with 0. Each time a regression prediction is completed, the predicted value is placed in the original feature matrix and the next feature is filled in. After going through all the features, the data is complete.

(Distribution of characteristics) Univariate analysis was used to examine the distribution of characteristics between the PSE negative group and the positive group. The data were then divided into a training set and a test set at a ratio of 7:3.

(Processing of unbalanced data) Considering the low incidence of PSE and the small proportion of positive patients, the positive data of the training set were augmented by Synthetic Minority Over-sampling Technique combined with Edited Nearest Neighbors by using default parameter of SMOTEENN method from imblearn python package and set random seed at 42 for repetition.

(Processing of categorical data) For categorical data, the one-hot method is used for transformation. The LASSO method was then used in the training set to screen the important features.

(Model building) We first used LASSO regression to select the 20 most important features. Next, we employed 9 common machine learning methods, including Naive Bayes, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Multi-Layer Perceptron, XGBoost, LightGBM, and K-Nearest Neighbors. We then optimized the hyperparameters of each model through grid search to improve their performance.To evaluate the models, we calculated metrics such as accuracy, sensitivity, specificity, F1-score, positive predictive value, and negative predictive value. We also plotted the ROC curve, calibration curve, and decision curve. Additionally, we used an independent external validation dataset to assess the generalization performance of the selected model. Finally, we leveraged the SHAP algorithm to perform an interpretable analysis of the best-performing model, investigating the contribution of each feature to the model’s predictions and their clinical significance. Through this series of model building, optimization, and analysis, we developed a machine learning model with good predictive performance and interpretability, providing valuable support for clinical decision-making.

Statistical approach

PostgreSQL v15 (http://www.postgresql.org/) was used to search and extract the data from the local database.

The open-source statistical package “Scipy.stats” in Python was used for statistical analysis. The details of the univariate significance analysis for each feature are as follows:

The Shapiro-Wilk test was used to check the normality of the distribution for each feature. For features that did not follow a normal distribution, the Mann-Whitney U test was used to assess their significance with respect to the target variable.

For features that exhibited a normal distribution, the Levene test was employed to assess the homogeneity of variances. Features with homogeneous variances were analyzed using the Student’s t-test to determine their significance with respect to the target variable, while features with heterogeneous variances were analyzed using the Welch’s t-test.

The confidence intervals for the AUC values and Brier scores were obtained by performing 1000 bootstrap resampling iterations on the corresponding datasets. The binary classification thresholds for the predicted probabilities generated by all models were established using the maximum Youden index derived from the training cohort.

Throughout the study, a two-tailed p-value less than 0.05 was considered statistically significant.

All the codes were uploaded at https://github.com/conanan/lasso-ml.

Result

Filling of missing data

These features had missing values that were filled using a Random Forest (RF) model, addressing the missing data one feature at a time: Plt, WBC, RBC, HbA1c, CRP, TG, LDL, HDL, AST, ALT, Bilirubin, Albumin, Urea, Creatinine, BUA, PT, APTT, TT, INR, D-dimer, Fibrinogen, CK, CK-MB, LDH, HBDH, IMA, Lactate, Anion_gap, TCO2, NIHSS.

Characteristics of study participants

A total of 21459 patients were included in this study, of which 15021 patients were included in the training set, and the incidence of PSE was 4.3%. The test set contained 6438 patients with a PSE incidence of 4.3%. The external validation cohort consisted of 536 patients at three hospitals. Statistical details of the clinical characteristics of the patients are provided in the table1.

Single factor significant analysis results

Statistical analysis showed that the patients who had higher possibility of PSE were with complications of uremia, history of DVT, atrial fibrillation, hyperuricemia, cerebral hernia and hydrocephalus. The involved locations of frontal lobe, parietal lobe, occipital lobe, temporal lobe, cortex, subcortex, basal ganglia and hypothalamus. The general characteristics included age, gender, nihss score; Laboratory indicators included WBC count, HbA1C, CRP, triglycerides, AST, ALT, bilirubin, urea, uric acid, APTT, PT, D-dimer, CK, CK-MB, LDH, HBDH, IMA, lactate, and anion gap. Besides, the p values of fatty liver, coronary heart disease, hyperlipidemia, and hdl were significant, and patients with negative or low values of these indicators had a high risk of secondary disease. The statistics analysis result, the uni and multi regression analysis result table is in table1,table2 and table3.

Univariable logistic regression results

Multivariable logistic regression results

Performance of machine learning models

The relevant indicators of the machine learning model are shown in table4, and the ROC curves, calibration curve and DCA are shown in figure3. It can be found that the over all models the AUC of tree models such as RF, XGboost and lightGBM are better than other models, and the PPV value of random forest is the highest, reaching 0.864, which is the most important function of our models. Complex machine learning algorithms were superior to traditional logistic regression. The Brier score of the calibration curve reached 0.006, and the DCA also showed good clinical decision-making benefits, which had good practical value. In the external validation cohort, we use the RF to predict. The Sensitivity was 0.91, the PPV was 0.95, demonstrate a good predictive ability of the model.

ROC of train(A1), ROC of test(B1), CC of train(A2), CC of test(B2), PCA of train(A3),PCA of test(B3)

The AUC,Accuracy, Sensitivity, Specificity, F1-score, positive predictive value (PPV) and negative predictive value (NPV) of ML models of train|test groups

Analysis of SHAP risk factors

The analysis in Figure 4 shows the SHAP (Shapley Additive Explanations) values, individual decision attempts, and overall decision curves. Among the general characteristics, females had a higher rate of PSE.Regarding the NIHSS score, higher score cause higher incidence rate of PSE. Higher values of WBC count, D-dimer, CRP, AST, CK-MB, HbA1c, bilirubin, TCO2, and LDH at admission were associated with a greater likelihood of developing PSE. Conversely, lower values of HBDH, PLT, and APTT were also linked to a higher probability of the outcome.However, the specific regions of the brain affected did not have a significant individual effect on the overall outcome.Among the complications, only hypertension was more strongly associated with the development of the outcome. Other conditions, such as coronary heart disease, diabetes, hyperlipidemia, and fatty liver, were less likely to be related to the outcome. We use the force plot of the first person to show the influence of different features of the first person, we can see that long APTT time contribute best to PSE, then the AST level and others, the NIHSS score may be low and contribute opposite to the final result. Then the decision plot is a collection of model decisions that show how complex models arrive at their predictions.

SHAP value(left),force plot(upper right) and decision plot (lower right)

Discussion

Our study utilized comprehensive clinical data, imaging data, laboratory test data, from the database of the stroke patients and employed machine learning algorithms to establish a predictive model, achieving an AUC score of above 0.95, which demonstrated more accurate predictions compared to traditional statistical methods. Our research found that tree-based ensemble models showed superior overall prediction capabilities when dealing with large sample sizes and high-dimensional features.

During the modeling process, due to the extreme imbalance between negative and positive samples, we employed SMOTEENN technique to resample an imbalanced dataset for machine learning, resulting in improved training performance. Through SHAP analysis, we conducted interpretability analysis of the model and determined the importance of different features.

In our study, age and NIHSS score were treated as continuous variables. We found that, overall, female patients, older patients, and those with higher NIHSS scores were more prone to develop PSE, which is consistent with recent articles. High NIHSS scores, indicative of more severe stroke, increased the risk of complications, ranking only to white blood cell count and d-dimer in our model [5][19][10][20]. However, there are conflicting opinions regarding the impact of age. Some researchs [5][21] suggested that age <65 is a high-risk factor, which aligns with our findings, while some studies [22] confirmed that advanced age is the determining factor. Yamada et al. [21] also concurred with our study in identifying a higher risk of complications among females, whereas Waafi et al. [10] indicated that the likelihood of male patients developing complications is 3.325 times that of females, which contradicts our findings.

Previous studies have shown that patients with diabetes, dyslipidemia, hypertension, depression, or dementia are at an increased risk of developing vascular epilepsy [12]. In our study, statistics and multiple ML models analyzed the association between comorbidities and complications, revealing that patients with coronary heart disease, diabetes, fatty liver, hyperlipidemia, or large artery stenosis or plaques(CCA and ICA) were less likely to develop epilepsy. According to the TOAST classification, ischemic stroke is categorized into five types: large artery atherosclerosis, cardioembolism, small vessel occlusion, other determined etiology, and undetermined etiology. Patients with combined comorbidities generally fall into the categories of large artery atherosclerosis and cardioembolism, which are relatively well-defined and easier to intervene, thus resulting in a lower likelihood of developing epilepsy. Conversely, strokes with undetermined etiology usually have a poor prognosis and are more likely to lead to epilepsy. Among diabetes patients, higher HbA1c levels indicate poorer blood sugar control, resulting in a higher probability of developing complications, which significantly affects certain patients, while those with good control have a lower overall risk of developing complications.

Alain et al. found that cortical infarction was more likely to result in epilepsy in patients hospitalized with anterior circulation ischemic stroke [23]. Lin et al. found that factors such as cortical involvement and intracerebral hemorrhage volume increased the likelihood of PSE, which is consistent with our research findings [8]. Al-Sahli et al. also suggested that cortical brain injury and large-area lesions increased the risk of PSE [5][21]. In our study, statistics showed affections of cortical and subcortical regions both increased the possibility of PSE, but had lower affection than the other features so didn’t be selected in lasso regression.

Previous studies have found that acute infection is a risk factor for ischemic stroke [24]. C-reactive protein (CRP) reflects the level of inflammation and is an independent prognostic factor [25]. In our study, regression and SHAP analysis both showed that white blood cell count had great impact among the routine blood test parameters, in SHAP it even surpassed the NIHSS score. High white blood cell count may indicate severe inflammation and infection, as well as increased blood viscosity, making patients more susceptible to secondary complications. In general, high red blood cell count and low platelet count also have some influence.

A large-scale study on Chinese individuals found a negative correlation between plasma high-density lipoprotein cholesterol (HDL-C) concentration and the risk of ischemic stroke, a weak positive correlation between plasma triglyceride (TG) concentration and the risk of ischemic stroke, and a strong correlation between plasma low-density lipoprotein cholesterol (LDL-C) concentration and apolipoprotein B [26]. High HDL-C levels are associated with better prognosis [27]. Our study is consistent with previous research, indicating that high LDL-C, low HDL-C, and high TG levels are more likely to lead to PSE. This can be easily understood as high cholesterol and triglyceride levels lead to increased blood viscosity and vascular sclerosis, making it easier for clots to form [12][28][29]. Higher D-dimer levels indicate greater brain tissue damage and a higher likelihood of PSE. Overall, lower activated partial thromboplastin time (APTT) and fibrinogen levels are associated with an increased risk of PSE. INR, PT, and TT have a smaller impact. Among liver function parameters, aspartate aminotransferase (AST) has the greatest influence on PSE, while high AST levels, low alanine aminotransferase (ALT) levels, and low albumin levels all have a certain degree of impact. Lingling Ding et al. found that liver enzyme subgroups characterized by alanine aminotransferase and aspartate aminotransferase were associated with a high risk of adverse function [30], which is consistent with our research.

Studies have shown that subgroups identified by renal function biomarkers such as urinary microalbumin, cystatin C, and creatinine have significantly higher stroke recurrence and poorer prognosis [30]. In our study, low urea levels and high uric acid levels had a negative impact [31][32][33]. Our research is similar to their conclusions. While elevated uric acid levels at admission are positively associated with PSE, patients previously diagnosed with hyperuricemia are less likely to develop epilepsy. Considering that uric acid functions as a strong reducing agent and has neuroprotective properties [34], patients with normal liver and kidney function and a certain degree of hyperuricemia have stronger resistance to emergencies [35][36]. However, excessively high uric acid levels indicate metabolic disorders and poor liver and kidney function, which are associated with poor prognosis.

When stroke patients are admitted, cardiac enzyme profile tests are often performed to rule out concurrent myocardial ischemia. However, studies have shown that elevated CK-MB in stroke patients may not only be related to the heart [37]. Multiple cardiac enzymes are important prognostic indicators [38][39] and have been included in stroke scores [40]. Some studies have shown a higher incidence of abnormal serum cardiac enzyme profiles in the acute phase of stroke. Although the incidence of abnormalities is unrelated to the nature of the stroke, it is associated with the severity of the stroke, with patients with consciousness disorders having a significantly higher incidence of abnormal cardiac enzyme profiles than those without consciousness disorders [41]. In our study, CK, CK-MB, and IMA in the cardiac enzyme profile had a significant impact and high predictive value, but the specific mechanisms require further research [34].

Although our study incorporates a large amount of information and utilizes almost all available data, including clinical data, imaging data, and laboratory test data, in an attempt to establish more accurate prediction models beyond traditional statistics using machine learning algorithms, there are still several limitations in the modeling process.

While the current study provides valuable conclusion, the data sample may not be fully representative, and the model’s generalization ability requires further assessment. Although the data was collected from multiple tertiary hospitals, encompassing over 20,000 cases, earlier data was lost due to hospital system upgrades. The collected data mainly represents patients diagnosed in the past five years and is primarily concentrated in the Chongqing region, which may limit the model’s applicability to other geographic areas.

Additionally, the retrospective nature of the research has resulted in the lack of certain important predictive indicators. As this was a retrospective study, many potentially meaningful features, such as hemorheology, thromboelastography, and hormone levels, were significantly missing and had to be excluded. Incorporating these additional features could potentially improve the model’s accuracy.

To enhance the predictive power of the model, it would be beneficial to incorporate more beyond baseline patient characteristics. For example, the current analysis primarily utilized the results of the first examination upon admission, without fully leveraging the information from subsequent examinations. In future research, the use of recurrent neural networks could facilitate the comprehensive extraction of features from the entire sequence of examinations.

To further strengthen the study, data standardization should be improved, and the number of cases and important indicators should continue to increase. Additionally, it would be advisable to explore more advanced scientific methods, such as deep learning, and fully leverage all available data to make more accurate predictions.

Conclusion

We developed an interpretable machine learning model to predict the risk of post-stroke epilepsy (PSE) in hospitalized patients with ischemic stroke. Leveraging a large volume of medical records, our artificial intelligence model demonstrates good predictive performance for PSE.The key predictors identified by the model include NIHSS score, D-dimer levels, lactate levels, and white blood cell count, followed by indicators related to liver function and cardiac enzyme profiles. The transparency and interpretability of the model’s predictions can foster trust among clinical practitioners and facilitate decision-making.While the results are promising, further prospective studies are needed to validate the clinical utility of this tool before its application in real-world settings.

Data availability statement

The codes, models, analysis results was uploaded at https://github.com/conanan/lasso-ml. The full dataset can be provided for researchers if needed by the corresponding author.

Acknowledgements

The authors would like to thank the colleagues in the information and imaging departments for their hard work contributing to the final research results.

Additional information

Author Contributions

JL and HH are both first writer who analysed the data by python and wrote the first draft of the manuscript.YD and TS are both corresponding author who wrote part of the draft and designed the original research. Chongqing University Central Hospital,Chongqing University Qianjiang Hospital, Yubei District hospital and Bishan hospital of Chongqing Medical University provided the database of all cases of the patients. The others collected data and wrote sections of the manuscript. All authors took part in the research and contributed to manuscript revision, read,and approved the submitted version.

Ethics approval statement

We confirm that we have read the Journal’s position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.

Funding statement

The research is funded by Based on artificial intelligence and multiple omics technology set up a system of auxiliary cardiovascular disease diagnosis and treatment(2023CDJYGRH-ZD06); by Emergency medicine Key laboratory of Chongqing Joint Fund for Talent Innovation and development (2024RCCX10), by Brain-like intelligence research Key laboratory of chongqing Education Commission(BIR2019004)

Conflict of interests

The authors have no relevant conflicts of interest to disclose.

Patient consent statement

This study was a retrospective study and only deidentified patient data were collected, exempting the need for patient informed consent rights.

Clinical trial registration

The trail number is RS202406.