Early prediction of level-of-care requirements in patients with COVID-19

  1. Boran Hao
  2. Shahabeddin Sotudian
  3. Taiyao Wang
  4. Tingting Xu
  5. Yang Hu
  6. Apostolos Gaitanidis
  7. Kerry Breen
  8. George C Velmahos
  9. Ioannis Ch Paschalidis  Is a corresponding author
  1. Center for Information and Systems Engineering, Boston University, United States
  2. Division of Trauma, Emergency Services, and Surgical Critical Care Massachusetts General Hospital, Harvard Medical School, United States

Abstract

This study examined records of 2566 consecutive COVID-19 patients at five Massachusetts hospitals and sought to predict level-of-care requirements based on clinical and laboratory data. Several classification methods were applied and compared against standard pneumonia severity scores. The need for hospitalization, ICU care, and mechanical ventilation were predicted with a validation accuracy of 88%, 87%, and 86%, respectively. Pneumonia severity scores achieve respective accuracies of 73% and 74% for ICU care and ventilation. When predictions are limited to patients with more complex disease, the accuracy of the ICU and ventilation prediction models achieved accuracy of 83% and 82%, respectively. Vital signs, age, BMI, dyspnea, and comorbidities were the most important predictors of hospitalization. Opacities on chest imaging, age, admission vital signs and symptoms, male gender, admission laboratory results, and diabetes were the most important risk factors for ICU admission and mechanical ventilation. The factors identified collectively form a signature of the novel COVID-19 disease.

eLife digest

The new coronavirus (now named SARS-CoV-2) causing the disease pandemic in 2019 (COVID-19), has so far infected over 35 million people worldwide and killed more than 1 million. Most people with COVID-19 have no symptoms or only mild symptoms. But some become seriously ill and need hospitalization. The sickest are admitted to an Intensive Care Unit (ICU) and may need mechanical ventilation to help them breath. Being able to predict which patients with COVID-19 will become severely ill could help hospitals around the world manage the huge influx of patients caused by the pandemic and save lives.

Now, Hao, Sotudian, Wang, Xu et al. show that computer models using artificial intelligence technology can help predict which COVID-19 patients will be hospitalized, admitted to the ICU, or need mechanical ventilation. Using data of 2,566 COVID-19 patients from five Massachusetts hospitals, Hao et al. created three separate models that can predict hospitalization, ICU admission, and the need for mechanical ventilation with more than 86% accuracy, based on patient characteristics, clinical symptoms, laboratory results and chest x-rays.

Hao et al. found that the patients’ vital signs, age, obesity, difficulty breathing, and underlying diseases like diabetes, were the strongest predictors of the need for hospitalization. Being male, having diabetes, cloudy chest x-rays, and certain laboratory results were the most important risk factors for intensive care treatment and mechanical ventilation. Laboratory results suggesting tissue damage, severe inflammation or oxygen deprivation in the body's tissues were important warning signs of severe disease.

The results provide a more detailed picture of the patients who are likely to suffer from severe forms of COVID-19. Using the predictive models may help physicians identify patients who appear okay but need closer monitoring and more aggressive treatment. The models may also help policy makers decide who needs workplace accommodations such as being allowed to work from home, which individuals may benefit from more frequent testing, and who should be prioritized for vaccination when a vaccine becomes available.

Introduction

As a result of the SARS-CoV-2 pandemic, many hospitals across the world have resorted to drastic measures: canceling elective procedures, switching to remote consultations, designating most beds to COVID-19, expanding Intensive Care Unit (ICU) capacity, and re-purposing doctors and nurses to support COVID-19 care. In the U.S., the CDC estimates more than 310,000 COVID-19 hospitalizations from March 1 to June 13, 2020 (CDC, 2020).

Much of the modeling work related to the pandemic has focused on spread dynamics (Kucharski et al., 2020). Others have described patients who were hospitalized (Richardson et al., 2020) (n = 5700) and (Buckner et al., 2020) (n = 105), became critically ill (Gong et al., 2020) (n = 372), or succumbed to the disease (n = 1625 (Onder et al., 2020), n = 270 [Wu et al., 2020]). In data from the New York City, 14.2% required ICU treatment and 12.2% mechanical ventilation (Richardson et al., 2020). With such rates, the logistical and ethical implications of bed allocation and potential rationing of care delivery are immense (White and Lo, 2020). To date, while state- or country-level prognostication has developed to examine resource allocation at a mass scale, there is inadequate evidence based on a large cohort on accurate prediction of the disease progress at the individual patient level. A string of recent studies developed models to predict severe disease or mortality based on clinical and laboratory findings, for example (Yan et al., 2020) (n = 485), (Gong et al., 2020) (n = 372), (Bhargava et al., 2020) (n = 197), (Ji et al., 2020) (n = 208), and (Wang et al., 2020) (n = 296). In these studies, several variables such as Lactate Dehydrogenase (LDH) (Gong et al., 2020; Ji et al., 2020; Yan et al., 2020) and C-reactive protein (CRP) have been identified as important predictors. All of these studies considered relatively small cohorts and, with the exception of Bhargava et al., 2020, considered patients in China. Although it is believed that the virus remains the same around the globe, the physiologic response to the virus and the eventual course of disease depend on multiple other factors, many of them regional (e.g. population characteristics, hospital practices, prevalence of pre-existing conditions) and not applicable universally. Triage of adult patients with COVID-19 remains challenging with most evidence coming from expert recommendations; evidence-based methods based on larger U.S.-based cohorts have not been reported (Sprung et al., 2020).

Leveraging data from five hospitals of the largest health care system in Massachusetts, we seek to develop personalized, interpretable predictive models of (i) hospitalization, (ii) ICU treatment, and (iii) mechanical ventilation, among SARS-CoV-2 positive patients. To develop these models, we developed a pipeline leveraging state-of-the-art Natural Language Processing (NLP) tools to extract information from the clinical reports for each patient, employing statistical feature selection methods to retain the most predictive features for each model, and adapting a host of advance machine learning-based classification methods to develop parsimonious (hence, easier to use and interpret) predictive models. We found that the more interpretable models can, for the most part, deliver similar predictive performance compared to more complex, ‘black-box’ models involving ensembles of many decision trees. Our results support our initial hypothesis that important clinical outcomes can be predicted with a high degree of accuracy upon the patient’s first presentation to the hospital using a relatively small number of features, which collectively compose a ‘signature’ of the novel COVID-19 disease.

Results

We extracted data for all patients (n = 2566) who had a positive RT-PCR SARS-CoV-2 test between March 4 and April 13, 2020 at five Massachusetts hospitals, included in the same health care system (Massachusetts General Hospital (MGH), Brigham and Women’s Hospital (BWH), Faulkner Hospital (FH), Newton-Wellesley Hospital (NWH), and North Shore Medical Center (NSM)). The study was approved by the pertinent Institutional Review Boards.

Demographics, pre-hospital medications, and comorbidities were extracted for each patient based on the electronic medical record. Patient symptoms, vital signs, radiologic findings, and laboratory results were recorded at their first hospital presentation (either clinic or emergency department) before testing positive for SARS-CoV-2. A total of 164 features were extracted for each patient. ICU admission and mechanical ventilation were determined for each patient. Complete blood count values were considered as absolute counts. Representative statistics comparing hospitalized, ICU admitted, and mechanically ventilated patients are provided in Table A1 (Appendix). Table A2 (Appendix) reports how patients were distributed among the five hospitals.

Among the 2566 patients with a positive test, 930 (36.2%) were hospitalized. Among the hospitalized, 273 (29.4% of the hospitalized) required ICU care of which 217 (79.5%) required mechanical ventilation. The mean age over all patients was 51.9 years (SD: 18.9 years) and 45.6% were male.

Hospitalization

The mean age of hospitalized patients was 62.3 years (SD: 18 years) and 55.3% were male. We employed linear and non-linear classification methods for predicting hospitalizations. Non-linear methods included random forests (RF) (Breiman, 2001) and XGBoost (Chen and Guestrin, 2016). Linear methods included support vector machines (SVM) (Cortes and Vapnik, 1995) and Logistic Regression (LR); each linear method used either ℓ1- or ℓ2-norm regularization and we report the best-performing flavor of each model.

Results are reported in Table 1. We report the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) and the Weighted-F1 score, both computed out-of-sample (in a test set not used for training the model). As we detail under Methods, we used two validation strategies. The ‘Random’ strategy randomly split the patients into a training and a test set and was repeated five times; from these five splits we report the average and the standard deviation of the test performance. The ‘BWH’ strategy trained the models on MGH, FH, NWH, and NSM patients, and evaluated performance on BWH patients.

Table 1
Hospitalization prediction model (test performance).

The values inside the parentheses refer to the standard deviation of the corresponding metric. Random refers to test set results from the five random training/test splits. BWH refers to training on four other hospitals and testing on data from BWH. SVM-L1 and LR-L1 refer to the ℓ1-norm regularized SVM and LR models. For the parsimonious model, we list the LR coefficients of each variable (Coef), the correlation of the variable with the outcome (Y-corr), the mean of the variable (Y1-mean) in the positive class (hospitalized for this table), and the mean of the variable (Y0-mean) in the negative class (non-hospitalized). Binary Coef denotes the coefficient of the variables in the binarized model. We report the corresponding odds ratio (OR) and the 95% confidence intervals (CI). Thresholds used for the binarized model are provided in Appendix 1—table 5.

AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 106 features
LR-L287.0% (1.7%)85.9%81.6% (1.3%)84.2%
SVM-L187.0% (1.6%)85.8%81.5% (1.5%)83.9%
XGBoost87.8% (1.9%)87.7%80.9% (1.8%)83.3%
RF88.2% (1.6%)88.1%81.2% (1.1%)83.2%
Models using 74 statistically selected features
LR-L287.1% (1.7%)86.0%82.0% (1.3%)83.9%
SVM-L187.1% (1.7%)85.8%82.0% (1.4%)84.0%
XGBoost87.9% (1.9%)87.6%81.2% (1.9%)84.2%
RF88.0% (1.7%)88.1%80.8% (1.7%)83.9%
Parsimonious Model using 11 features
LR-L283.4% (1.7%)83.7%78.7% (0.9%)81.0%
SVM-L183.4% (1.7%)83.8%78.1% (1.1%)79.9%
Variables for the Parsimonious Model
VariableCoefY1 meanY0 meanp-valueY-corrCoef binaryOROR 95% CI
SpO2 (%)−11.9095.4497.11<0.001−0.291.745.673.978.12
Temperature10.3637.2137.06<0.0010.080.862.361.763.18
Respiratory
Rate
7.2022.8220.83<0.0010.18−0.130.880.691.13
Age5.1462.3146.02<0.0010.410.882.41.863.11
Pulse4.6090.0990.4<0.001−0.010.72.011.492.71
Diastolic
BP
−3.5673.0777.21<0.001−0.231.514.512.887.06
Adrenal
Insufficiency
3.090.0130.001<0.0010.082.5813.141.57110.37
BMI2.3031.3431.64<0.001−0.04−0.090.910.711.17
Transplantation1.900.0230.002<0.0010.11.434.191.0416.87
Dyspnea1.850.170.02<0.0010.2627.414.8511.32
CKD1.550.140.02<0.0010.250.812.251.353.74
Intercept−2.51
  1. SpO2: oxygen saturation; BP: Blood pressure; BMI: Body Mass Index; CKD: Chronic Kidney Disease.

The hospitalization models used symptoms, pre-existing medications, comorbidities, and patient demographics. Laboratory results and radiologic findings were not considered since these were not available for most non-hospitalized patients. Full models used all (106) variables retained after several pre-processing steps described in Materials and methods. Applying the statistical variable selection procedure described in the Appendix (specifically, eliminating variables with a p-value exceeding 0.05), yields a model with 74 variables. To provide a more parsimonious, highly interpretable, and easier to implement model, we used recursive feature elimination (see Appendix) to select a model with only 11 variables. The best model using the random validation approach has an AUC of 88% while the best parsimonious (linear) model has an AUC of 83%, being though easier to interpret and implement. Validation on the BWH patients yields an AUC of 84% for the parsimonious model.

Table 1 also reports the 11 variables in the parsimonious LR model, including their LR coefficients, and a binarized version of this model as described in Materials and methods. The most important variables associated with hospitalization were: oxygen saturation, temperature, respiratory rate, age, pulse, blood pressure, a comorbidity of adrenal insufficiency, BMI, prior transplantation, dyspnea, and kidney disease.

Additionally, we assessed the role of pre-existing ACE inhibitor (ACEI) and angiotensin receptor blocker (ARB) medications by adding these variables into the parsimonious binarized model, while controlling for additional relevant variables (hypertension, diabetes, and arrhythmia comorbidities and other hypertension medications). We found that while ARBs are not a factor, ACEIs reduce the odds of hospitalization by 3/4, on average, controlling for other important factors, such as age, hypertension, and related comorbidities associated with the use of these medications.

ICU admission

The mean age of ICU admitted patients was 63.3 years (SD: 15.1 years) and 63% were male. The ICU and ventilation prediction models used the features considered for the hospitalization, as well as laboratory results and radiologic findings. For these models, we excluded patients who required immediate ICU admission or ventilation (defined as within 4 hr from initial presentation). This was implemented in order to focus on patients where triaging is challenging and risk prediction would be beneficial. There were 2513 and 2525 patients remaining for the ICU and the mechanical ventilation prediction models, respectively.

For the model including 2513 patients (Table 2), we first developed a model using all 130 variables retained after pre-processing, then employed statistical variable selection to retain 56 of the variables, and then applied recursive feature elimination with LR to select a parsimonious model which uses only 10 variables. The following variables were included: opacity observed in a chest scan, respiratory rate, age, fever, male gender, albumin, anion gap, oxygen saturation, LDH, and calcium. In addition, we generated a binarized version of the parsimonious model. The parsimonious model for all 2513 patients has an AUC of 86%, almost as high as the model with all 130 features.

Table 2
ICU prediction model (test performance).

Abbreviations are as in Table 1. Thresholds for the binarized model, PSI and CURB-65 scores are in the Appendix.

ICU prediction results with 2513 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost86.0% (2.8%)83.1%90.0% (1.7%)91.7%
SVM-L185.9% (2.5%)80.2%89.9% (1.0%)89.2%
LR-L184.6% (2.8%)76.8%89.7% (1.0%)89.9%
RF86.9% (2.4%)83.7%90.4% (1.1%)91.1%
Models using 56 statistically selected features
XGBoost86.8% (3.1%)82.8%90.4% (1.4%)91.3%
SVM-L186.2% (2.6%)82.6%90.6% (1.2%)90.8%
LR-L185.8% (2.9%)81.8%90.2% (1.3%)91.3%
RF86.7% (2.0%)83.2%90.5% (1.7%)91.5%
Parsimonious Model using 10 features
LR-L185.8% (2.6%)83.9%90.0% (1.4%)89.1%
LR-L1 (binarized model)84.2% (2.2%)82.5%89.8% (1.1%)88.1%
Model using PSI or CURB-65 score
PSI score72.9% (4.9%)78.8%86.8% (0.7%)88.2%
CURB-65 score67.0% (5.0%)75.4%87.0% (0.5%)88.1%
Variables for the parsimonious model
VariableCoefY1 meanY0 meanp-valueY-corrCoef binaryOROR 97.5% CI
Radiology
Opacities
0.540.760.27<0.0010.301.414.082.835.89
Respiratory
Rate
0.4624.6121.37<0.0010.160.501.661.142.41
Age0.4562.6150.58<0.0010.180.561.761.272.43
Fever0.400.640.33<0.0010.180.611.831.322.55
Male0.350.640.44<0.0010.120.501.651.212.26
Albumin−0.343.683.84<0.001−0.160.581.781.102.90
Anion
Gap
0.3316.4015.35<0.0010.13−0.050.950.461.98
SpO2 (%)−0.2294.7296.72<0.001−0.240.832.291.633.21
LDH0.22400.40327.48<0.0010.150.962.621.743.94
Calcium−0.218.849.01<0.001−0.100.551.731.212.48
Intercept−0.93
  1. SpO2: oxygen saturation; LDH: Lactate dehydrogenase.

For comparison purposes against well-established scoring systems, we implemented two commonly used pneumonia severity scores, CURB-65 (Lim et al., 2003) and the Pneumonia Severity Index (PSI) (Fine et al., 1997). Predictions based on the PSI and CURB-65 scores, have AUCs of 73% and 67%, respectively.

We also developed a model for a more restrictive set of patients. Specifically, the number of missing lab values for some patients is substantial. Given the importance of LDH and CRP, as revealed by our models, the more restricted patient set contains 669 patients with non-missing LDH and CRP values. After removing patients who required intubation or ICU admission within 4 hr of hospital presentation, we included 628 patients and 635 patients for the restricted ICU admission and ventilation models, respectively.

The best restricted model for the 628 patients (Table 3) is the nonlinear XGBoost model using 29 statistically selected features with an AUC of 83%, with a linear parsimonious LR model close behind (AUC 80%). An RF model using all variables yields an AUC of 77% when tested on BWH data. PSI- and CURB-65 models have AUCs below 59%.

Table 3
Restricted ICU prediction model (test performance).

Abbreviations are as in Table 1. Thresholds for the binarized model, PSI and CURB-65 scores are in the Appendix.

ICU prediction results with 628 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost82.5% (1.9%)67.3%81.4% (0.7%)72.6%
SVM-L177.8% (3.8%)72.8%79.7% (1.2%)73.6%
LR-L175.9% (3.6%)69.7%79.2% (2.5%)73.7%
RF80.9% (2.7%)76.9%78.8% (1.9%)73.6%
Models using 29 statistically selected features
XGBoost82.7% (2.7%)76.2%80.6% (2.1%)72.6%
SVM-L177.9% (3.7%)73.1%78.5% (1.4%)73.6%
LR-L178.4% (4.1%)71.5%79.5% (2.6%)74.4%
RF82.1% (2.8%)74.1%79.0% (2.4%)75.4%
Parsimonious Model using 8 features
LR-L180.1% (2.9%)74.2%80.9% (2.1%)77.2%
LR-L1 (binarized model)72.5% (5.4%)69.9%73.4% (2.8%)69.7%
Model using PSI or CURB-65 score
PSI score58.8% (7.4%)68.3%66.7% (2.2%)65.3%
CURB-65 score56.8% (4.5%)76.9%66.2% (1.5%)63.8%
Variables for the parsimonious model
VariableCoefY1 meanY0 meanp-valueY-corrCoef binaryOROR 97.5% CI
LDH0.53519.88304.40<0.0010.151.594.882.658.99
CRP (mg/L)0.47127.1767.43<0.0010.350.762.130.706.47
Calcium−0.358.839.01<0.001−0.130.712.031.253.31
IDDM0.300.250.120.0030.151.002.731.624.60
SpO2 (%)−0.2994.1395.590.003−0.220.341.410.922.16
Radiology Opacities0.250.880.71<0.0010.160.621.861.053.29
Anion Gap0.2016.6615.28<0.0010.200.341.400.484.12
Sodium−0.16136.13137.53<0.001−0.140.471.601.052.43
Intercept−0.34
  1. LDH: Lactate dehydrogenase; CRP: C-reactive protein; IDDM: Insulin-dependent diabetes mellitus; SpO2: oxygen saturation.

Mechanical ventilation

The mean age of patients requiring mechanical ventilation was 63.3 years (SD: 14.7 years) and 63.6% were male. Again, we excluded patients who were intubated within 4 hr of their hospital admission.

For the model including 2525 patients (Table 4), we used statistical feature selection to select 55 variables, and recursive feature elimination with LR to select a parsimonious model with only eight variables. The following variables were included: lung opacities, albumin, fever, respiratory rate, glucose, male gender, LDH, and anion gap. In addition, we generated a binarized version of the parsimonious model. The best model for all 2525 patients was a nonlinear RF model using the 55 statistically selected variables and yielding an AUC of 86%. The best linear model was the parsimonious LR model with an AUC of 85%. PSI- and CURB-65 models yield AUCs of 74% and 67%, respectively.

Table 4
Ventilation prediction model (test performance).

Abbreviations are as in Table 1. Thresholds for the binarized model, PSI and CURB-65 scores are in the Appendix.

Ventilation prediction results with 2525 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost85.8% (4.0%)83.8%91.0% (0.4%)91.6%
SVM-L182.6% (4.9%)83.8%90.9% (0.8%)91.6%
LR-L180.7% (5.4%)81.7%90.4% (1.2%)91.4%
RF85.7% (3.9%)83.7%91.2% (0.9%)91.8%
Models using 55 statistically selected features
XGBoost85.7% (3.3%)86.3%91.1% (0.6%)91.6%
SVM-L183.9% (3.7%)84.8%90.9% (1.1%)91.7%
LR-L183.3% (4.0%)83.9%90.8% (1.3%)91.4%
RF86.4% (3.4%)86.7%91.4% (1.1%)91.3%
Parsimonious Model using 8 features
LR-L185.2% (2.3%)87.0%90.3% (0.3%)90.7%
LR-L1 (binarized model)81.3% (3.1%)82.6%90.0% (0.6%)90.2%
Model using PSI or CURB-65 score
PSI score73.6% (4.1%)80.7%89.4% (0.4%)90.3%
CURB-65 score66.8% (3.1%)75.9%89.7% (0.1%)90.0%
Variables for the Parsimonious Model
VariableCoefY1 meanY0 meanp-valueY-corrCoef binaryOROR 97.5% CI
Radiology
opacities
0.860.770.28<0.0010.271.584.863.257.25
Albumin−0.453.653.83<0.001−0.161.072.911.804.72
Fever0.430.660.33<0.0010.170.722.051.422.95
Respiratory
rate
0.4224.7021.44<0.0010.150.501.641.092.47
Glucose0.38170.17138.32<0.0010.150.972.631.714.06
Male0.340.640.44<0.0010.100.431.541.092.18
LDH0.33408.56328.78<0.0010.140.912.481.583.89
Anion
gap
0.3116.5015.37<0.0010.130.271.310.533.25
Intercept−1.06
  1. LDH: Lactate dehydrogenase.

The best model for the restricted case of 635 patients (Table 5) was the linear parsimonious LR model (with just five variables) achieving an AUC of 82%. PSI- and CURB-65 models do not exceed AUC of 58%.

Table 5
Restricted ventilation prediction model (test performance).

Abbreviations are as in Table 1.Thresholds for the binarized, PSI and CURB-65 scores are in the Appendix.

Ventilation prediction results with 635 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost80.6% (1.9%)74.7%79.4% (2.6%)75.7%
SVM-L179.4% (5.2%)71.3%80.8% (2.0%)75.7%
LR-L176.9% (3.9%)68.2%78.6% (3.2%)73.4%
RF81.0% (3.1%)75.8%79.8% (4.2%)72.7%
Models using 29 statistically selected features
XGBoost81.6% (3.2%)76.9%79.0% (2.9%)71.7%
SVM-L179.1% (4.6%)69.4%80.6% (2.5%)75.7%
LR-L180.9% (3.6%)70.9%80.4% (2.2%)75.7%
RF81.3% (2.6%)75.4%79.2% (1.7%)69.6%
Parsimonious Model using 5 features
LR-L182.4% (3.7%)75.2%81.8% (1.7%)71.7%
LR-L1 (binarized model)71.4% (6.2%)65.5%76.6% (3.5%)68.3%
Model using PSI or CURB-65 score
PSI score57.6% (4.5%)67.4%73.2% (1.3%)71.2%
CURB-65 score56.9% (7.1%)74.0%72.4% (0.2%)68.3%
Variables for the parsimonious model
VariableCoefY1 meanY0 meanp-valueY-corrCoef binaryOROR 97.5% CI
CRP (mg/L)0.60134.5269.62<0.0010.350.421.530.514.59
LDH0.55550.41311.01<0.0010.161.876.473.1913.10
Calcium−0.398.829.00<0.001−0.130.581.791.072.98
IDDM0.360.260.120.0020.151.183.261.905.58
Anion Gap0.2916.8115.32<0.0010.1918.661.27E+080.00inf
Intercept−0.39
  1. CRP: C-reactive protein; LDH: Lactate dehydrogenase; IDDM: Insulin-dependent diabetes mellitus.

Time period between ICU/ventilation model prediction and corresponding outcomes

Table 6 reports the mean and the median time interval (in hours) between hospital admission time and ICU/ventilation outcomes. Specifically, we report statistics for ICU admission or intubation outcomes from the correct ICU/intubation predictions made by our models trained on four hospitals (MGH, NWH, NSM, FH) and applied to BWH patients (both the models making predictions for all patients and the restricted models). As we have noted earlier, our models use the lab results closest to admission (either on admission date or the following day). We also report the time interval between the last lab result used by the model and the corresponding ICU/intubation outcome.

Table 6
Mean and median hours between reference date/lab results to outcomes in full/restricted ICU and ventilation model prediction.
From reference date (mean)From reference date (median)From lab results (mean)From lab results (median)
Restricted ICU38.1328.0822.559.90
Restricted intubation35.3626.4022.3710.39
Full ICU22.8617.2815.8612.99
Full intubation25.6222.2010.238.97

Discussion

We developed three models to predict need for hospitalization, ICU admission, and mechanical ventilation in patients with COVID-19. The prediction models are not meant to replace clinicians’ judgment for determining level of care. Instead, they are designed to assist clinicians in identifying patients at risk of future decompensation. Patient vital signs were the most important predictors of hospitalization. This is expected as vital signs reflect underlying disease severity, the need for cardiorespiratory resuscitation, and the risk of future decompensation without adequate medical support. Older age and BMI were also important predictors for hospitalization. Age has been recognized as an important factor associated with severe COVID-19 in previous series (Grasselli et al., 2020; Guan et al., 2020; Richardson et al., 2020). However, it is not known whether age itself or the presence of comorbidities place patients at risk for severe disease. Our results demonstrate that age is a stronger predictor of severe COVID-19 than a host of underlying comorbidities.

In terms of patient comorbidities, adrenal insufficiency, prior transplantation, and chronic kidney disease were strongly associated with need for hospitalization. Diabetes mellitus was associated with a need for ICU admission and mechanical ventilation, which might be due to its detrimental effects on immune function.

For the ICU and ventilation prediction models screening all at-risk (COVID-19-positive patients), opacities observed in a chest scan, age, and male gender emerge as important variables. Males have been found to have worse in-hospital outcomes in other studies as well (Palaiodimos et al., 2020).

We also identified several routine laboratory values that are predictive of ICU admission and mechanical ventilation. Elevated serum LDH, CRP, anion gap, and glucose, as well as decreased serum calcium, sodium, and albumin were strong predictors of ICU admission and mechanical ventilation. LDH is an indicator of tissue damage and has been found to be a marker of severity in P. jirovecii pneumonia (Zaman and White, 1988). Along with CRP, it was among the two most important predictors of ICU admission and ventilation in the parsimonious model among patients who had LDH and CRP measurements on admission. This finding is consistent with previous reports identifying LDH as an important prognostic factor (Gong et al., 2020; Ji et al., 2020; Mo et al., 2020; Yan et al., 2020). In addition, lower serum calcium is associated with cell lysis and tissue destruction, as it is often seen as part of the tumor lysis syndrome. Elevated serum anion gap is a marker of metabolic acidosis and ischemia, suggesting that tissue hypoxia and hypoperfusion may be components of severe disease.

For all three prognostic models, we developed predicting hospitalizations, ICU care, and mechanical ventilation, AUC ranges within 86–88%, which indicates strong predictive power. Interestingly, we can achieve AUC within 85–86% for ICU and ventilation prediction with a parsimonious linear model utilizing no more than 10 variables. In all cases, we can also develop a parsimonious model with binarized variables using medically suggested normal and abnormal variable thresholds. These binarized models have similar performance with their continuous counterparts. The ICU and ventilation models using all patients are very accurate, but, arguably, make a number of ‘easier’ decisions since more than 60% of the patients are never hospitalized. Many of these patients are younger, healthy, and likely present with mild-to-moderate symptoms. To test the robustness of the models to patients with potentially more ‘complex’ disease, we developed ICU and ventilation models on a restricted set of patients. This is the subset of patients who are hospitalized and most of the crucial labs are available for them (specifically CRP and LDH which emerged as important from our models). The best AUC for these models drops, but not below 82%, which indicates robustness of the model even when dealing with arguably harder to assess cases. LDH, CRP, calcium, lung opacity, anion gap, SpO2, sodium, and a comorbidity of insulin-controlled diabetes appear as the most significant for these patients. Interestingly, the corresponding binarized models have about 10% lower AUC; apparently, for the more severely ill, clinical variables deviate substantially from normal and knowing the exact values is crucial.

The models have been validated with two different approaches, using random splits of the data into training and testing, as well as training in some hospitals and testing at a different hospital. Performance metrics are relatively consistent with these two approaches. We also compared the models against standard pneumonia severity scores, PSI and CURB-65, establishing that our models are significantly stronger, which highlights the different clinical profile of COVID-19.

We also examined how much in advance of the ICU or ventilation outcomes our models are able to make a prediction. Of course, this is not entirely in our control; it depends on what state the patients get admitted and how soon their condition deteriorates to require ICU admission and/or ventilation. Table 6 reports the corresponding statistics. For example, the restricted ICU and ventilation models are making a correct prediction upon admission (using the lab results closest to that time) for outcomes that on average occur 38 and 35 hr later, respectively.

To further test the accuracy of the restricted ICU and ventilation models well in advance of the corresponding event, we considered an extended BWH test set (adding 11 more patients) and computed the accuracy of the models when the test set was restricted to patients whose outcome (ICU admission or ventilation) was more than x hours after the admission lab results based on which the prediction was made, with x being 6 hr, or 12 hr, or 18 hr, or 24 hr, or even 48 hr. The ICU model reaches an AUC of 87% and a weighted F1-score of 86% at x = 18 hr. The ventilation model reaches an AUC of 64% and an F1-score of 72% at x = 48 hr. These results demonstrate that the predictive models can indeed make predictions well into the future, when physicians would be less certain about the course of the disease and when there is potentially enough time to intervene and improve outcomes.

A manual review of the predictions by the models indicates that they performed well at predicting future ICU admissions for patients who presented with mild disease several days before ICU admission was necessary. Such patients were hemodynamically stable and had minimal oxygen requirements on the floor, before clinical deterioration necessitated ICU admission. We identified several such patients. A typical case is that of a 51-year-old male with a history of hypertension, obesity, and insulin-dependent type 2 diabetes mellitus, who presented with a 3-day history of dyspnea, cough and myalgias. In the emergency department, he was hemodynamically stable, saturating at 96–97% on 2 L of nasal cannula. The patient was admitted to the floor and did well for 3 days, saturating at 93–96% on room air. On the fourth day of hospitalization, he had increasing oxygen requirements and the decision was made to transfer him to the ICU. He was intubated and ventilated for 30 days. Our prediction models accurately predicted at the time of his presentation that he would eventually require ICU admission and mechanical ventilation. This prediction was based on such variables as an elevated LDH (241 U/L) and the presence of insulin-dependent diabetes mellitus. Another such case is that of a 59-year-old male without a significant prior medical history who presented with 2 days of dyspnea, nausea, and diarrhea. At the emergency department, he was tachycardic at 110 beats per minute and saturating at 96% on room air, and the patient was admitted. For 2 days, the patient was hemodynamically stable, saturating at 94–97% on room air. On the third day of hospitalization, he had increasing oxygen requirements, eventually requiring transfer to the ICU. He was intubated and ventilated for the next 14 days. Our prediction model predicted the patient’s decompensation at his presentation, due to elevations in LDH (348 U/L) and CRP (102.3 mg/L).

We also considered the role of ACEIs and ARBs and their potential association with the outcomes. It has been speculated that ACEIs may worsen COVID-19 outcomes because they upregulate the expression of ACE2, which the virus targets for cell entry. No such evidence has been reported in earlier studies (Kuster et al., 2020; Patel and Verma, 2020). In fact, a smaller study (Zhang et al., 2020) (n = 1128 vs. 2566 in our case) reported a beneficial effect and (Rossi et al., 2020) warn of potential harmful effects of discontinuing ACEIs or ARBs due to COVID-19. Our hospitalization model suggests that ACEIs do not increase hospitalization risk and may slightly reduce it (OR 95% CI is (0.52,1.04) with a mean of 0.73). In the ICU and ventilation models, the role of these two medications is statistically weaker to observe any meaningful association.

The models we derived can be used for a variety of purposes: (i) guiding patient triage to appropriate inpatient units, (ii) guiding staffing and resource planning logistics, and (iii) understanding patient risk profiles to inform future policy decisions, such as targeted risk-based stay-at-home restrictions, testing, and vaccination prioritization guidelines once a vaccine becomes available.

Calculators implementing the parsimonious models corresponding to each of the Tables 1, 2, 3, 4, 5 have been made available online (Hao et al., 2020).

Materials and methods

Data extraction

Request a detailed protocol

Natural Language Processing (NLP) was used to extract patient comorbidities (see Appendix for details), pre-existing medications, admission vital signs, hospitalization course, ICU admission, and mechanical intubation.

Pre-processing

Request a detailed protocol

The categorical features were converted to numerical by ‘one-hot’ encoding. Each categorical feature, such as gender and race, was encoded as an indicator variable for each category. Features were standardized by subtracting the mean and dividing by the standard deviation.

Several pre-processing steps, including variable imputation, outlier elimination, and removal of highly correlated variables were undertaken (see Appendix). After completing these procedures, 106 variables for each patient remained to be used by the hospitalization model. For the ICU and ventilation prediction models, we added laboratory results and radiologic findings. We removed variables with more than 90% missing values out of the roughly 2500 patients retained for these models; the remaining missing values were imputed as described above. These pre-processing steps retained 130 variables for the ICU and ventilation models.

Classification methods

Request a detailed protocol

We employed nonlinear ensemble methods including Random forests (RF) (Breiman, 2001) and XGBoost (Chen and Guestrin, 2016). We also employed ‘custom’ linear methods which yield interpretable models; specifically, support vector machines (SVM) (Cortes and Vapnik, 1995) and Logistic Regression (LR). In both cases, the variants we computed were robust to noise and the presence of outliers (Chen and Paschalidis, 2018), using proper regularization. LR, in addition to a prediction, provides the likelihood associated with the predicted outcome, which can be used as a confidence measure in decision making. Further details on these methods are in the Appendix.

For each outcome, we used the statistical feature selection and recursive feature elimination procedures described in the Appendix to develop an LR parsimonious model. The LR coefficients are comparable since the variables are standardized. Hence, a larger absolute coefficient indicates that the corresponding variable is a more significant predictor. Positive (negative) coefficients imply positive (negative) correlation with the outcome. We also developed a version of this model by converting all continuous variables into binary variables, using medically motivated thresholds (see Appendix). We report the coefficients of the ‘binarized’ model and the implied odds ratio (OR), representing how the odds of the outcome are scaled by having a specific variable being abnormal vs. normal, while controlling for all other variables in the model.

Outcomes and performance metrics

Request a detailed protocol

Model performance metrics included the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) and the Weighted-F1 score. The ROC plots the true positive rate (a.k.a. recall or sensitivity) against the false positive rate (equal to one minus the specificity). We optimized algorithm parameters to maximize AUC.

The F1 score is the harmonic mean of precision and recall. Precision (or positive predictive value) is defined as the ratio of true positives over true and false positives. The Weighted-F1 score is computed by weighting the F1-score of each class by the number of patients in that class.

Model validation

Request a detailed protocol

The data were split into a training (80%) and a test set (20%). Algorithm parameters were optimized on the training (derivation) set using fivefold cross-validation. Performance metrics were computed on the test set. This process was repeated five times, each time with a random split into training/testing sets. In columns labeled as Random in Tables 1, 2, 3, 4, 5, we report the average (and standard deviation) of the test performance metrics over the five random splits. We also performed a different type of validation. We trained the models on MGH, FH, NWH, and NSM patients, and evaluated performance on BWH patients. These results are reported under the columns BWH in the tables.

Appendix 1

1. Representative statistics of patients and variables highly correlated with the outcomes

Characteristics of the 2566 patients who tested positive for SARC-CoV2 with key statistics for each cohort (hospitalized vs. not, ICU admitted vs. not, and mechanically ventilated vs. not) are provided in Appendix 1—table 1. For each variable we provide a mean value of the variable (or percentage for categorical variables) in each cohort and its complement and a p-value computed using a chi-squared test for categorical variables and a Kolmogorov-Smirnov (KS) test for continuous variables. A low p-value supports rejection of the null hypothesis, implying that the corresponding variable is statistically different in a cohort compared to its complement (e.g., hospitalized vs. not).

Appendix 1—table 2 reports how the entire patient cohort is distributed across the five different hospitals according to the various outcome groups.

Appendix 1—table 1
Representative patient statistics.
Admitted (36.2%)ICU (10.6%)Intubated (8.5%)
YesNop-valueYesNop-valueYesNop-value
Age62.346.0<0.00163.350.6<0.00163.350.9<0.001
Gender (male)55.3%40.1%<0.00163.0%43.5%<0.00163.6%43.9%<0.001
Asian3.7%4.0%0.973.7%3.9%13.7%3.9%1
Black/African American15.7%17.8%0.6114.7%17.3%0.7514.3%17.3%0.74
Hispanic/Latino4.9%5.9%0.816.6%5.4%0.886.9%5.4%0.83
White45.4%43.9%0.9139.6%45.0%0.4039.6%44.9%0.53
Hypertension61.7%26.4%<0.00162.3%36.5%<0.00161.8%37.1%<0.001
Diabetes34.2%9.7%<0.00140.7%15.9%<0.00142.9%16.3%<0.001
Alzheimer6.7%0.6%<0.0012.6%2.8%13.2%2.7%0.98
Congestive Heart Failure (CHF)11.3%0.8%<0.0019.5%4.0%<0.0018.8%4.2%0.025
Chronic Kidney Disease (CKD)14.4%1.7%<0.00112.8%5.5%<0.00111.5%5.8%0.011
ACE Inhibitors (ACEIs)17.5%8.4%<0.00120.5%10.7%<0.00119.8%11.0%0.002
Acetaminophen
Tylenol
39.8%17.8%<0.00131.9%25.1%0.1230.4%25.4%0.45
Amiodarone1.6%0.1%<0.0011.5%0.5%0.320.9%0.6%0.95
Anticoagulants9.4%1.7%<0.0019.9%3.8%<0.00111.1%3.8%<0.001
Anti-depressants25.4%16.7%<0.00120.5%19.8%0.9922.6%19.6%0.77
Angiotensin Receptor Blockers (ARBs)12.0%5.2%<0.00115.4%6.8%<0.00117.1%6.8%<0.001
Aspirin related32.3%11.6%<0.00133.7%17.4%<0.00133.2%17.8%<0.001
Beta-Blockers28.1%10.4%<0.00125.6%15.7%<0.00125.8%16.0%0.003
Calcium Chanel Blockers (CCBs)2.6%0.7%0.0014.4%1.0%<0.0014.6%1.1%<0.001
Coumadin
warfarin
3.5%0.7%<0.0011.8%1.7%11.8%1.7%1
Diuretics16.0%4.5%<0.00113.9%8.1%0.01513.4%8.3%0.089
Immuno- suppressants5.3%2.6%0.0053.7%3.5%14.1%3.5%0.97
Insulin related14.6%3.5%<0.00119.0%6.2%<0.00121.2%6.3%<0.001
Metformin related19.5%8.6%<0.00123.8%11.2%<0.00124.9%11.4%<0.001
Nonsteroidal anti-inflammatory drugs (NSAIDs)21.9%21.0%0.9519.0%21.6%0.8218.0%21.6%0.66
Proton Pump Inhibitors (PPIs)26.6%15.0%<0.00124.5%18.5%0.1325.8%18.6%0.081
Statins45.1%17.3%<0.00147.6%24.9%<0.00145.6%25.7%<0.001
Steroids30.5%23.0%<0.00130.8%25.2%0.2630.4%25.3%0.44
Cough65.6%29.6%<0.00168.1%39.6%<0.00169.1%40.2%<0.001
Dyspnea16.6%2.2%<0.00121.6%5.7%<0.00123.5%5.9%<0.001
Chest pain21.1%5.6%<0.00122.0%9.9%<0.00124.4%10.0%<0.001
Fever57.4%23.7%<0.00161.2%32.9%<0.00163.6%33.4%<0.001
SpO295.297.4<0.00193.496.7<0.00193.396.7<0.001
Diastolic BP72.578.1<0.00172.075.6<0.00170.975.6<0.001
Pulse90.688.3<0.00193.388.80.00394.188.90.01
Respiratory Rate (RR)23.120.3<0.00125.621.2<0.00125.921.3<0.001
Temperature (oC)37.237.0<0.00137.337.10.00137.337.10.001
Anion Gap15.817.015.1<0.00117.115.1<0.001
Sodium137.0136.3137.4<0.001136.2137.3<0.001
Calcium9.08.89.0<0.0018.89.0<0.001
Lactic acid1.82.11.6<0.0012.11.6<0.001
Glomerular filtration rate (GFR)67.064.872.3<0.00164.771.9<0.001
Chloride98.197.298.8<0.00197.198.8<0.001
Glucose149.6171.5135.8<0.001173.9137.2<0.001
Lactate Dehydrogenase (LDH)377.2524.6303.9<0.001551.8310.6<0.001
Albumin3.83.63.9<0.0013.63.9<0.001
D-Dimer1373.51525.01223.7<0.0011614.51214.0<0.001
C-reactive Protein (CRP)89.6133.165.5<0.001140.168.1<0.001
Blood Urea Nitrogen (BUN)21.424.318.5<0.00123.818.9<0.001
Creatine Kinase (CK)385.2563.4282.7<0.001620.3285.1<0.001
Ferritin854.21349.5601.6<0.0011477.1621.8<0.001
Mean Platelet Volume (MPV)10.510.610.5<0.00110.610.5<0.001
Atelectasis19.0%4.6%<0.00115.8%9.2%0.00816.6%9.2%0.007
Consolidation5.9%0.6%<0.00110.3%1.6%<0.00111.1%1.7%<0.001
Nodule4.9%0.6%<0.0014.4%1.9%0.0723.7%2.0%0.47
Opacity64.8%13.7%<0.00178.4%26.7%<0.00180.6%27.8%<0.001
Pleural Effusion8.8%1.1%<0.00111.7%3.0%<0.00113.8%3.0%<0.001
Appendix 1—table 2
Distribution of patients in different hospitals and outcome groups.
HospitalPositiveAdmittedICUIntubated
Brigham and Women's Hospital (BWH)6481716756
Newton-Wellesley Hospital (NWH)4341453318
Massachusetts General Hospital (MGH)1195475144121
North Shore Medical Center (NSM)97631612
Faulkner Hospital (FH)192761310
Total2566930273217

2. Natural Language Processing (NLP) of clinical notes

The de-identified data consisted of demographics, lab results, history and physical examination (H and P) notes, progress notes, radiology reports, and discharge notes. We extracted all variables needed for each patient and built a profile using NLP tools. There were mainly two difficulties. First, many important features such as vitals and medical history (prior conditions, medications) were not in a table format and were extracted from the report text using different regular expression templates, post-processing the results to eliminate errors due to non-uniformity in the reports (e.g., a line break may separate a date from the field indicating the type). Second, the negations in the text should be recognized. Simply recognizing a medical term such as ‘cough’ or ‘fever’ is not sufficient since the report may include ‘Patient denies fever or cough’. We applied multiple NLP schemes to overcome these difficulties.

Regular expression matching is the basic strategy we used to extract features such as body temperature values (with or without decimal followed by ‘?C/?F’) and blood pressure values (‘xx(x)/xx(x)’ even if they are mixed up with a date ‘mm/dd/yyyy’ having similar symbols). Extracting pulse and respiratory rates is challenging since it is easy to mismatch the corresponding values; thus, we also matched the indicators ‘RR:’ (respiratory rate) or ‘P’ (pulse rate) in the vicinity of the number.

To extract symptoms in H and P notes and findings in radiology reports, we used two NLP models: a Named Entity Recognition (NER) model, and a Natural Language Inference (NLI) model (Zhu et al., 2018). The first model aims at finding all the symptoms/disease named entities in the report. The key motivation of NER is that it is hard to list all possible disease names and search for them in each sentence; instead, NER models use the context to infer the possible targets, thus, even abbreviations like ‘N/V’ will be recognized. We used the spaCy NER model (Kiperwasser and Goldberg, 2016) trained on the BC5CDR corpus. The NLI model is used to detect negations, by checking if a sentence as a premise supports the hypothesis that the patient truly has the disease/symptoms in it. We applied a fine-tuned RoBERTa model (Liu et al., 2019) to perform NLI.

For medication extraction, we used the Unified Medical Language System (UMLS) (UMLS, 2019), which comprehensively contains medical terms and their relationships. We added a medication to the patient’s prior to admission medication list only If the medication or brand name is found in the UMLS ‘Pharmacologic Substance’ or ‘Clinical Drug’ category.

Symptoms, medical history, and prior medications from H and P notes are often described using different terminology or acronyms that imply the same condition or medication (e.g., dyspnea and SOB). We manually mapped these non-unique descriptors to distinct categories. An appropriate classification was also used for comorbidities, prior medications, radiological findings, and laboratories. The entire list of variables extracted and used in the analysis is provided in Appendix 1—table 3.

Appendix 1—table 3
List of 164 features used for hospitalization, ICU, and ventilation models.
CategoryFeatures
DemographicsMarital status, Gender, Race, Age, Language, Tobacco, Alcohol, Height, Weight, BMI
VitalsSystolic BP, Diastolic BP, Temperature, Pulse, Respiratory Rate, SpO2 percentage
SymptomsFever, Cough, Dyspnea, Fatigue, Diarrhea, Nausea, Vomiting, Abdominal pain, Loss of smell, Loss of taste, Chest pain, Headache, Sore throat, Hemoptysis, Myalgia
Pre-existing medicationsSteroids, ACEIs, ARBs, NSAIDs, Anti-depressants, CCBs, Diuretics, Digoxin, Statins, Beta-Blockers, Acetaminophen Tylenol, Immunosuppressants, Anticoagulants, Aspirin related, Coumadin warfarin, Amiodarone, Insulin related, Metformin related, PPIs
ComorbiditiesHypertension, COPD, Diabetes, CKD, CAD, MI, Asthma, Osteoarthritis arthritis, SLE, HLD, Arrhythmia, Thyroid disease, Stroke, Migraine, Epilepsy, Alzheimer, Parkinson, Nephrolithiasis, Cushing, Adrenal Insufficiency, Diverticulosis, GERD, IBS, IBD, Cholelithiasis, Inguinal hernia, Hepatitis, Cirrhosis, Valvular disease, CHF, PAD, Osteoporosis, Cancer, TB, Cardiomyopathy, AAA, DVT, vWD, Anemia, Transplantation, HIV, Depression, Anxiety
RadiologyOpacity, Atelectasis, Consolidation, Pleural Effusion, Pneumothorax, Nodule
LabsRDW, PLT, MCH, HGB, MCHC, HCT, MCV, RBC, WBC, MPV, NRBC (%), GFR (estimated), Creatinine, Potassium, Chloride, Sodium, Anion Gap, BUN, Glucose, Calcium, Carbon Dioxide, Absolute Neutrophil count, Absolute Lymphocyte count, Absolute Monocyte count, Absolute Eosinophil count, Absolute Basophil count, Immature Granulocytes, ALT, Total Protein, Albumin, Globulin, AST, Bilirubin (Total), Alkaline phosphatase, NRBC Auto (#), LDH, Ferritin, CK, Magnesium, CRP, PT, D-Dimer, Lactic acid, Phosphorus, PTT, PCO2 (Venous), pH (Venous), Fibrinogen, Lipase, Bands (manual), PO2 (Venous), Base Deficit (Venous), Iron, Bilirubin (Direct), Myelocytes, HCO3 (unspecified), TIBC, Base Deficit (Arterial), PCO2 (Arterial), Metamyelocytes, Plasma cells (%), PO2 (Arterial), Ionized Calcium, pH (Arterial), Osmolality

To evaluate the accuracy of the NLP models on our data, we randomly selected 35 hr and P notes and manually checked the model, evaluating the precision, recall, and F1-score for the extracted terms. For the NER+NLI deep learning model, we compared all the symptoms extracted by the models against the manually extracted ground truth. For the general regular expression matching models, we checked the extraction of vitals as a representative task, particularly since vitals have the most complicated format in the original notes. Appendix 1—table 4 provides the results of this manual evaluation.

Appendix 1—table 4
Performance of the NLP models.
Precision (%)Recall (%)F1-score (%)
NER+NLI model93.6087.9790.70
Regular expression matching99.0196.1597.56
Appendix 1—table 5
Abnormal ranges for laboratory tests and vitals.
VariableAbnormal range
Albumin<3.3
Chloride<95
Lactic acid≥2
LDH≥250
CRP (mg/L)≥10
Calcium≤8.5
Anion gap≥12
Glucose≥110
Total protein≤6.5 or ≥8.3
D-Dimer (ng/mL)≥500
GFR≤60
Sodium<135
Globulin≤2 or ≥4
SpO2≤94
Systolic blood pressure≤100
Pulse≥100
Respiratory rate≥20
Age≥65
Diastolic blood pressure≤60
BMI≥30
Temperature≥37.5 °C or ≥98.7 °F

For both types of models, the F1-score exceeds 90%. Most of the symptoms missing are due to non-obvious abbreviations. Regular expression matching has better performance since potential errors may only come from very rare formats we did not consider.

3. Classification methods

A random forest (RF) (Breiman, 2001) is an ensemble algorithm that achieves high accuracy and generalization performance by combining multiple weak decision tree classifiers. For training, RF uses bootstrap aggregating (bagging) technique to randomly select a training sample set for each decision tree classifier. It trains multiple decision trees in parallel during the training phase, where each tree is trained using a random sample set from the original training set. In the test phase, RF uses the trained decision tree classifiers to classify a test sample, and then combines all the classifiers by majority voting.

XGBoost (Chen and Guestrin, 2016) generates a series of decision trees in sequential order; each decision tree is fitted to the residual between the prediction of the previous decision tree and the target value, and this is repeated until a predetermined number of trees or a convergence criterion is reached. All decision trees computed are combined with proper weights to produce a final decision. XGBoost uses shrinkage and column subsampling to prevent overfitting and achieves fast training using a number or parallelization approaches.

Both of these nonlinear models are expensive to train compared to the linear models we discuss next. Essentially, each one of them trains an ensemble of many decision trees (could be as many as 500 or more) and a decision is made by combining information from all of these trees.

Among the linear classifiers, we used the support vector machine (SVM) (Cortes and Vapnik, 1995), which computes an optimal hyperplane separating the two classes. To render the method robust to noise and the presence of outliers (Chen and Paschalidis, 2018) we used (ℓ1- or ℓ2-norm) regularized versions of SVM.

We also used Logistic regression (LR) -- a common classification method that uses a linear regression model to approximate the logarithmic odds (logit) of the true classification label. LR, in addition to a prediction, also provides the likelihood of the predicted outcome, which can be used as a confidence measure in decision making. Similar to SVM, we used (ℓ1- or ℓ2-norm) regularized logistic regression to find the optimal subset of features from the initial feature space. In particular, based on the LR model, the predicted probability of the outcome, denoted by y^, is estimated by the formula:

y^=11+ exp{b0i=1nbixi},

where exp{} denotes the exponential function, b0 is the intercept, x1,,xn the variables used by the model, and b1,,bn the corresponding coefficients. Using this formula and the LR coefficients (and intercept) provided in Tables 1, 2, 3, 4, 5, one can obtain an easily computable value for the predicted probability of the corresponding outcome. Comparing that value to a threshold (in the interval [0,1]) yields a prediction. The threshold can be set depending on the desired trade-off between sensitivity and specificity, which is typically specified by the user.

4. Pre-processing, statistical feature selection and recursive feature elimination

We extracted patients’ laboratory test results at the date of hospital admission (reference date). Since some lab tests may be received several hours after the reference time, we extracted the nearest set of lab results to the reference time. Some tests have multiple Logical Observation Identifiers Names and Codes (LOINC), referring to the same quantity, and were merged. White blood cells (WBC) types (basophils, eosinophils, lymphocytes, monocytes, and neutrophils) were reported both as an absolute count and percentage (of WBC). We eliminated the percentages and maintained the absolute counts. We also removed all laboratory test results that did not contain enough information for a significant percentage of the patients (less than 10%). This retained 65 laboratory variables.

Missing variables were imputed using the mode or, for some key lab variables, by regressing on the non-missing variables of the patient. To mitigate the effect of outliers, each variable with values higher than the 99th percentile or lower than the 1st percentile, was replaced with the 99th or 1st percentile, respectively. Finally, and to avoid collinearity, of the variables that were highly correlated (absolute correlation coefficient higher than 0.8) we removed one among the two.

For each model, we used a variety of statistical feature selection approaches. Specifically, we first calculated a p-value for each variable as described earlier and removed all variables with a p-value exceeding 0.05. Further, we used (ℓ1-norm) regularized LR and performed recursive feature elimination as follows. We run LR and obtained the coefficients of the model. We then eliminated the variable with the smallest absolute coefficient and re-run LR to obtain a new model. We kept iterating in this fashion, to select a model that maximizes a metric equal to the mean AUC minus its standard deviation in a validation dataset.

5. Thresholds for the binarized models

Thresholds used for generating binarized versions of our parsimonious models are reported in Appendix 1—table 5. In these models, a variable is set to one if the corresponding continuous variable is abnormal and 0 otherwise.

6. Standard pneumonia severity scores

For comparison purposes we implemented two commonly used pneumonia severity scores, CURB-65 (Lim et al., 2003) and the Pneumonia Severity Index (PSI) (Fine et al., 1997). CURB-65 uses a mental test assessment, Blood Urea Nitrogen (BUN), respiratory rate, blood pressure, and the indicator of age being 65 or older. PSI uses similar information, a host of laboratory values, and comorbidities. From CURB-65 we did not score for mental status since we did not have such information. From PSI, we did not use mental status and whether the patient was a nursing home resident. Given that laboratory values are used, we computed these scores to predict ICU care and ventilator use. In each case, we computed the corresponding score and then optimized a threshold using cross-validation over the training set in order to make the prediction. We used these thresholds and evaluated performance of each scoring system in the test set.

7. Training/Derivation Model Performance

Performance metrics for the various models on the training/derivation cohorts are reported in Appendix 1—tables 6, 7, 8, 9, 10. These are computed for both the random splitting of the data into training and testing sets (in this case, we provide the mean and standard deviation over the five random splits), as well as for the training dataset formed from patients at MGH, FH, NWH, and NSM (these results are under the column named BWH in Appendix 1—tables 6, 7, 8, 9, 10, simply to match the terminology of Tables 1, 2, 3, 4, 5).

Appendix 1—table 6
Derivation cohort performance for the hospitalization prediction model.

Abbreviations and metrics reported are as in Table 1.

AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 106 features
LR-L288.3% (0.4%)88.3%82.9% (0.5%)82.3%
SVM-L188.2% (0.4%)88.2%82.8% (0.5%)82.1%
XGBoost91.5% (2.1%)90.9%85.7% (2.3%)85.2%
RF96.0% (0.7%)95.3%92.9% (1.2%)90.8%
Models using 74 statistically selected features
LR-L287.8% (0.4%)87.8%82.4% (0.4%)81.7%
SVM-L187.8% (0.4%)87.7%82.5% (0.7%)81.7%
XGBoost91.9% (1.8%)91.9%86.0% (1.8%)86.2%
RF94.9% (0.9%)96.6%91.3% (1.3%)93.2%
Parsimonious Model using 11 features
LR-L282.6% (0.5%)82.4%77.6% (0.1%)76.9%
SVM-L182.5% (0.5%)82.3%77.5% (0.3%)76.9%
Appendix 1—table 7
Derivation cohort performance for the ICU prediction model.

Abbreviations and metrics reported are as in Table 1.

ICU prediction results (training performance) with 2513 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost94.5% (3.6%)96.1%94.0% (1.7%)94.1%
SVM-L189.7% (0.7%)91.4%91.5% (0.4%)91.9%
LR-L191.3% (0.6%)92.9%91.5% (0.5%)91.9%
RF93.4% (3.2%)97.0%94.3% (1.6%)95.4%
Models using 56 statistically selected features
XGBoost94.1% (1.5%)95.1%93.6% (0.6%)93.7%
SVM-L188.5% (0.7%)89.7%91.2% (0.4%)91.4%
LR-L189.3% (0.7%)90.4%91.2% (0.2%)91.4%
RF91.0% (1.9%)94.9%93.0% (1.0%)94.2%
Parsimonious Model using 10 features
LR-L186.2% (0.6%)83.8%90.4% (0.4%)89.1%
LR-L1
(binarized
model)
84.0% (0.6%)80.6%89.4% (0.1%)88.2%
Model using PSI or CURB-65 score
PSI score74.3% (1.2%)72.3%87.5% (0.2%)87.1%
CURB-65 score67.9% (1.3%)65.3%87.3% (0.2%)86.8%
Appendix 1—table 8
Derivation cohort performance for the restricted ICU prediction model.

Abbreviations and metrics reported are as in Table 1.

ICU prediction training performance with 628 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost89.6% (4.8%)92.5%85.4% (5.8%)87.6%
SVM-L180.1% (0.6%)80.8%79.4% (0.5%)80.4%
LR-L187.1% (0.8%)88.0%83.5% (0.5%)83.6%
RF95.6% (2.9%)95.7%91.0% (3.3%)90.2%
Models using 29 statistically selected features
XGBoost86.3% (1.0%)87.4%81.9% (0.4%)83.8%
SVM-L180.5% (0.9%)80.4%79.1% (0.5%)80.4%
LR-L180.9% (1.0%)81.6%79.0% (0.3%)80.3%
RF89.8% (2.6%)92.8%85.0% (1.9%)88.2%
Parsimonious Model using 8 features
LR-L180.4% (0.9%)81.4%79.7% (0.5%)80.0%
LR-L1
(binarized
model)
75.4% (1.1%)77.2%75.2% (0.7%)77.5%
Model using PSI or CURB-65 score
PSI score60.5% (1.7%)59.0%68.6% (0.5%)68.7%
CURB-65 score60.2% (1.2%)57.2%67.5% (0.4%)67.3%
Appendix 1—table 9
Derivation cohort performance for the ventilation prediction model.

Abbreviations and metrics reported are as in Table 1.

Ventilation prediction training performance with 2525 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost97.2% (1.5%)95.2%95.8% (1.0%)94.5%
SVM-L192.3% (0.7%)92.8%93.1% (0.1%)93.4%
LR-L193.8% (0.6%)94.3%93.3% (0.2%)93.2%
RF95.1% (0.8%)94.7%95.4% (0.5%)94.3%
Models using 55 statistically selected features
XGBoost96.9% (1.4%)98.3%95.6% (0.9%)96.6%
SVM-L190.8% (0.7%)91.3%92.7% (0.2%)93.0%
LR-L191.4% (0.7%)92.0%92.6% (0.3%)92.8%
RF94.8% (0.7%)94.1%95.5% (0.3%)94.8%
Parsimonious Model using 8 features
LR-L186.9% (0.5%)88.1%91.6% (0.2%)91.9%
LR-L1
(binarized
model)
84.4% (0.7%)86.7%91.1% (0.2%)91.2%
Model using PSI or CURB-65 score
PSI score74.0% (1.0%)71.4%89.9% (0.1%)89.6%
CURB-65 score67.6% (0.8%)64.7%89.7% (0.0%)89.6%
Appendix 1—table 10
Derivation cohort performance for the restricted ventilation prediction model. Abbreviations and metrics reported are as in Table 1.
Ventilation prediction training performance with 635 patients
AlgorithmAUCF1-weighted
RandomBWHRandomBWH
Models using all 130 features
XGBoost91.8% (2.2%)98.6%87.4% (2.0%)95.3%
SVM-L181.2% (0.7%)83.2%82.4% (1.1%)83.9%
LR-L189.7% (0.6%)89.6%86.9% (1.0%)85.8%
RF93.5% (4.2%)93.7%89.5% (3.8%)89.7%
Models using 29 statistically selected features
XGBoost89.9% (2.3%)89.9%86.1% (1.6%)86.0%
SVM-L181.5% (1.6%)84.4%82.2% (1.2%)83.7%
LR-L182.6% (0.7%)84.0%83.0% (0.9%)83.6%
RF92.3% (4.8%)94.3%88.8% (3.7%)89.3%
Parsimonious Model using 5 features
LR-L180.3% (1.0%)79.0%82.1% (0.7%)81.7%
LR-L1
(binarized
model)
73.1% (1.4%)66.5%78.3% (0.9%)73.5%
Model using PSI or CURB-65 score
PSI score58.8% (1.0%)57.2%73.9% (0.3%)74.2%
CURB-65 score58.5% (1.7%)55.8%73.2% (0.1%)73.7%

8. Performance of the restricted ICU and ventilation models with sufficient distance to the event

Appendix 1—table 11 lists the performance of the restricted ICU and mechanical ventilation parsimonious LR-L1 models provided in Tables 3 and 5 when applied to a test set consisting of the BWH patients and 11 additional patients whose data were collected right after the original dataset was compiled. In these results, we excluded patients whose predicted outcome (ICU or intubation) occurs less than x hours from the time the admission lab results were made available, where x takes values in the set {6 hr, 12 hr, 18 hr, 24 hr, 48 hr}. Thus, the corresponding test set includes only patients with sufficient time difference from the data used to make the prediction, assessing how far into the future the predictive model could reach. We added the additional 11 patients to make sure we have a sufficient number of test patients to perform this study. As the results suggest, ICU admission estimation is fairly accurate and robust, whereas intubation prediction had moderate predictive power.

Appendix 1—table 11
AUC and weighted F1-score on an extended BWH test set, where patients with lab-to outcome time smaller than or equal to certain gaps are excluded.
Time gap6hr12 hr18 hr24 hr48 hr
Restricted ICU model - AUC86.05%84.73%86.85%86.14%84.62%
Restricted ICU model - weighted-F183.10%82.17%86.47%86.09%86.28%
Restricted intubation model - AUC68.00%64.44%63.85%63.85%64.34%
Restricted intubation model - weighted-F165.75%66.59%69.81%69.81%72.33%

Data availability

Source code for processing patient data is provided together with the submission. Due to HIPAA restrictions and Data Use Agreements we can not make the original patient data publicly available. Interested parties may submit a request to obtain access to de-identified data to the authors. The authors would request pertinent IRB approval to make available a de-identified version of the data, stripped of any protected health information as specified under HIPAA rules. The IRB of the hospital system approved the study under Protocol #2020P001112 and the Boston University IRB found the study as being Not Human Subject Research under Protocol #5570X (the BU team worked with a de-identified limited dataset).

References

  1. Conference
    1. Chen T
    2. Guestrin C
    (2016)
    Xgboost
    A Scalable Tree Boosting systemProceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. pp. 785–794.
    1. Chen R
    2. Paschalidis IC
    (2018)
    A robust learning approach for regression models based on distributionally robust optimization
    Journal of Machine Learning Research 19:1–48.
  2. Website
    1. Hao B
    2. Sotudian S
    3. Wang T
    4. Xu T
    5. Paschalidis IC
    (2020) COVID calculators
    Network Optimization and Control Lab, Boston University. Accessed October 10, 2020.
  3. Report
    1. Zhu H
    2. Paschalidis IC
    3. Tahmasebi A
    (2018)
    Clinical Concept Extraction with Contextual Word Embedding
    NeurIPS Machine Learning for Health Workshop.

Article and author information

Author details

  1. Boran Hao

    Center for Information and Systems Engineering, Boston University, Boston, United States
    Contribution
    Software, Formal analysis, Validation, Methodology, Writing - original draft
    Contributed equally with
    Shahabeddin Sotudian, Taiyao Wang and Tingting Xu
    Competing interests
    No competing interests declared
  2. Shahabeddin Sotudian

    Center for Information and Systems Engineering, Boston University, Boston, United States
    Contribution
    Software, Formal analysis, Validation, Methodology, Writing - original draft
    Contributed equally with
    Boran Hao, Taiyao Wang and Tingting Xu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5864-6192
  3. Taiyao Wang

    Center for Information and Systems Engineering, Boston University, Boston, United States
    Contribution
    Software, Formal analysis, Validation, Methodology, Writing - original draft
    Contributed equally with
    Boran Hao, Shahabeddin Sotudian and Tingting Xu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0331-3892
  4. Tingting Xu

    Center for Information and Systems Engineering, Boston University, Boston, United States
    Contribution
    Software, Formal analysis, Validation, Methodology, Writing - original draft
    Contributed equally with
    Boran Hao, Shahabeddin Sotudian and Taiyao Wang
    Competing interests
    No competing interests declared
  5. Yang Hu

    Center for Information and Systems Engineering, Boston University, Boston, United States
    Contribution
    Software, Formal analysis, Validation
    Competing interests
    No competing interests declared
  6. Apostolos Gaitanidis

    Division of Trauma, Emergency Services, and Surgical Critical Care Massachusetts General Hospital, Harvard Medical School, Boston, United States
    Contribution
    Resources, Data curation, Software, Supervision, Validation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  7. Kerry Breen

    Division of Trauma, Emergency Services, and Surgical Critical Care Massachusetts General Hospital, Harvard Medical School, Boston, United States
    Contribution
    Resources, Data curation, Validation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  8. George C Velmahos

    Division of Trauma, Emergency Services, and Surgical Critical Care Massachusetts General Hospital, Harvard Medical School, Boston, United States
    Contribution
    Conceptualization, Resources, Supervision, Validation, Investigation, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
  9. Ioannis Ch Paschalidis

    Center for Information and Systems Engineering, Boston University, Boston, United States
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    yannisp@bu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3343-2913

Funding

National Science Foundation (IIS-1914792)

  • Ioannis Ch Paschalidis

National Science Foundation (DMS-1664644)

  • Ioannis Ch Paschalidis

National Science Foundation (CNS-1645681)

  • Ioannis Ch Paschalidis

National Institute of General Medical Sciences (R01 GM135930)

  • Ioannis Ch Paschalidis

Office of Naval Research (N00014-19-1-2571)

  • Ioannis Ch Paschalidis

National Institutes of Health (UL54 TR004130)

  • Ioannis Ch Paschalidis

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

Research partially supported by the NSF under grants IIS-1914792, DMS-1664644, and CNS-1645681, by the ONR under MURI grant N00014-19-1-2571, and by the NIH under grant R01 GM135930.

Ethics

Human subjects: The Institutional Review Board of Mass General Brigham reviewed and approved the study under Protocol #2020P001112. The Boston University IRB found the study as being Not Human Subject Research under Protocol #5570X (the BU team worked with a de-identified limited dataset).

Version history

  1. Received: June 29, 2020
  2. Accepted: October 4, 2020
  3. Accepted Manuscript published: October 12, 2020 (version 1)
  4. Version of Record published: October 29, 2020 (version 2)

Copyright

© 2020, Hao et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,155
    Page views
  • 279
    Downloads
  • 41
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, PubMed Central, Crossref.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Boran Hao
  2. Shahabeddin Sotudian
  3. Taiyao Wang
  4. Tingting Xu
  5. Yang Hu
  6. Apostolos Gaitanidis
  7. Kerry Breen
  8. George C Velmahos
  9. Ioannis Ch Paschalidis
(2020)
Early prediction of level-of-care requirements in patients with COVID-19
eLife 9:e60519.
https://doi.org/10.7554/eLife.60519

Share this article

https://doi.org/10.7554/eLife.60519

Further reading

    1. Medicine
    2. Microbiology and Infectious Disease
    3. Epidemiology and Global Health
    4. Immunology and Inflammation
    Edited by Jos WM van der Meer et al.
    Collection

    eLife has published articles on a wide range of infectious diseases, including COVID-19, influenza, tuberculosis, HIV/AIDS, malaria and typhoid fever.

    1. Medicine
    2. Neuroscience
    Flora Moujaes, Jie Lisa Ji ... Alan Anticevic
    Research Article

    Background:

    Ketamine has emerged as one of the most promising therapies for treatment-resistant depression. However, inter-individual variability in response to ketamine is still not well understood and it is unclear how ketamine’s molecular mechanisms connect to its neural and behavioral effects.

    Methods:

    We conducted a single-blind placebo-controlled study, with participants blinded to their treatment condition. 40 healthy participants received acute ketamine (initial bolus 0.23 mg/kg, continuous infusion 0.58 mg/kg/hr). We quantified resting-state functional connectivity via data-driven global brain connectivity and related it to individual ketamine-induced symptom variation and cortical gene expression targets.

    Results:

    We found that: (i) both the neural and behavioral effects of acute ketamine are multi-dimensional, reflecting robust inter-individual variability; (ii) ketamine’s data-driven principal neural gradient effect matched somatostatin (SST) and parvalbumin (PVALB) cortical gene expression patterns in humans, while the mean effect did not; and (iii) behavioral data-driven individual symptom variation mapped onto distinct neural gradients of ketamine, which were resolvable at the single-subject level.

    Conclusions:

    These results highlight the importance of considering individual behavioral and neural variation in response to ketamine. They also have implications for the development of individually precise pharmacological biomarkers for treatment selection in psychiatry.

    Funding:

    This study was supported by NIH grants DP5OD012109-01 (A.A.), 1U01MH121766 (A.A.), R01MH112746 (J.D.M.), 5R01MH112189 (A.A.), 5R01MH108590 (A.A.), NIAAA grant 2P50AA012870-11 (A.A.); NSF NeuroNex grant 2015276 (J.D.M.); Brain and Behavior Research Foundation Young Investigator Award (A.A.); SFARI Pilot Award (J.D.M., A.A.); Heffter Research Institute (Grant No. 1–190420) (FXV, KHP); Swiss Neuromatrix Foundation (Grant No. 2016–0111) (FXV, KHP); Swiss National Science Foundation under the framework of Neuron Cofund (Grant No. 01EW1908) (KHP); Usona Institute (2015 – 2056) (FXV).

    Clinical trial number:

    NCT03842800