1. Medicine
  2. Microbiology and Infectious Disease
Download icon

Augmented curation of clinical notes from a massive EHR system reveals symptoms of impending COVID-19 diagnosis

Short Report
Cite this article as: eLife 2020;9:e58227 doi: 10.7554/eLife.58227
1 figure, 2 tables and 2 additional files


Figure 1 with 2 supplements
Augmented curation of the unstructured clinical notes and comparison of symptoms between COVIDpos vs. COVIDneg patients.

(a) Augmented curation of the unstructured clinical notes from Electronic Health Records (EHRs). (b) COVID-19-related symptom entity recognition, sentiment analysis and grouping of synonyms. (c) Comparison of symptoms extracted from EHR clinical notes of COVIDpos vs. COVIDneg patients.

Figure 1—figure supplement 1
SciBERT Architecture and Training Configuration.
Figure 1—figure supplement 2
Examples of Sentence Classification Used in Training a SciBERT Model for Phenotype/Symptom Sentiment Analysis.


Table 1
Augmented curation of the unstructured clinical notes from the EHR reveals specific clinically confirmed phenotypes that are amplified in COVIDpos patients over COVIDneg patients in the week prior to the SARS-CoV-2 PCR testing date.

The key COVIDpos amplified symptoms in the week preceding PCR testing (i.e. day = −7 to day = −1) are highlighted in gray (p-value<1E-10). The ratio of COVIDpos to COVIDneg proportions represents the fold change amplification of each phenotype in the COVIDpos patient set (symptoms are sorted based on this column).

(p-value<1E-10 in gray)
Count (%) (N = 2317)
Count (%)
(N = 74850)
(COVID+/COVID-) Relative RatioRelative ratio
(95% CI)
2-tailed p-valueBH-corrected p-value
Altered or diminished sense of taste or smell145 (6.3%)173 (0.2%)27.08(21.81, 33.62)<1E-300<1E-300
Fever/chills750 (32.4%)9421 (12.6%)2.57(2.42, 2.74)3.57E-1694.64E-168
Cough769 (33.2%)11083 (14.8%)2.24(2.11, 2.38)4.60E-1293.99E-128
Respiratory difficulty681 (29.4%)10082 (13.5%)2.18(2.04, 2.33)3.06E-1051.99E-104
Myalgia/Arthralgia288 (12.4%)4620 (6.2%)2.01(1.8, 2.25)5.35E-342.78E-33
Rhinitis200 (8.6%)2947 (3.9%)2.19(1.92, 2.52)2.25E-299.75E-29
Headache325 (14.0%)6124 (8.2%)1.71(1.55, 1.9)1.34E-234.98E-23
Congestion228 (9.8%)4261 (5.7%)1.73(1.53, 1.96)4.45E-171.45E-16
GI upset195 (8.4%)10670 (14.3%)0.59(0.52, 0.68)1.74E-155.03E-15
Wheezing49 (2.1%)3765 (5.0%)0.42(0.32, 0.56)1.82E-104.73E-10
Dermatitis26 (1.1%)2519 (3.4%)0.33(0.23, 0.5)2.60E-096.15E-09
Generalized symptoms169 (7.3%)8129 (10.9%)0.67(0.58, 0.78)4.82E-081.04E-07
Respiratory Failure73 (3.2%)1363 (1.8%)1.73(1.38, 2.19)3.09E-066.18E-06
Diarrhea228 (9.8%)5452 (7.3%)1.35(1.19, 1.53)3.47E-066.44E-06
Pharyngitis160 (6.9%)3635 (4.9%)1.42(1.22, 1.66)7.05E-061.22E-05
Chest pain/pressure148 (6.4%)6122 (8.2%)0.78(0.67, 0.92)1.88E-033.06E-03
Change in appetite/intake95 (4.1%)2271 (3.0%)1.35(1.11, 1.66)3.37E-035.15E-03
Otitis13 (0.6%)874 (1.2%)0.48(0.29, 0.85)6.98E-031.01E-02
Cardiac95 (4.1%)2443 (3.3%)1.26(1.03, 1.54)2.62E-023.59E-02
Fatigue229 (9.9%)8268 (11.0%)0.89(0.79, 1.02)7.83E-021.02E-01
Conjunctivitis9 (0.4%)167 (0.2%)1.74(0.95, 3.52)1.00E-011.24E-01
Dry mouth5 (0.2%)316 (0.4%)0.51(0.24, 1.3)1.28E-011.51E-01
Hemoptysis13 (0.6%)283 (0.4%)1.48(0.89, 2.65)1.61E-011.78E-01
Dysuria16 (0.7%)732 (1.0%)0.71(0.45, 1.18)1.64E-011.78E-01
Diaphoresis35 (1.5%)979 (1.3%)1.15(0.84, 1.63)3.99E-014.15E-01
Neuro150 (6.5%)4952 (6.6%)0.98(0.84, 1.15)7.86E-017.86E-01
Table 2
Temporal analysis of the EHR clinical notes for the week preceding PCR testing (i.e. day −7 to day −1), leading up to the day of PCR testing (day 0) in COVIDpos and COVIDneg patients.

Temporal enrichment for each symptom is quantified using the ratio of COVIDpos patient proportion over the COVIDneg patient proportion for each day. The patient proportions in the rows labeled ‘Positive’ and ‘Negative’ represent the fraction of COVIDpos (n = 2,317) and COVIDneg (n = 74,850) patients with the specified symptom on each day. Symptoms with p-value<1E-10 are highlighted in green and 1E-10 < p value<1E-03 in gray.

SymptomCOVID-19 (N = 77167)Day = −7Day = −6Day = −5Day = −4Day = −3Day = −2Day = −1
Altered or diminished sense of taste or smellPositive (n = 2317)4.75E-033.88E-033.45E-032.59E-031.73E-030.00E+004.75E-03
 Negative (n = 74850)1.07E-044.01E-051.07E-041.07E-049.35E-052.27E-049.75E-04
 Ratio (Positive/Negative)44.4296.9132.3024.2318.460.004.87
 Ratio (Positive/Negative)5.224.313.643.082.411.030.91
 Ratio (Positive/Negative)2.221.821.321.061.040.460.71
 Ratio (Positive/Negative)7.125.884.983.812.900.961.06
Respiratory DifficultyPositive2.24E-022.11E-021.81E-021.55E-021.25E-028.20E-035.35E-02
 Ratio (Positive/Negative)4.433.713.122.652.030.950.70
Change in appetite/intakePositive1.73E-031.73E-031.73E-035.18E-034.32E-035.61E-031.86E-02
 Ratio (Positive/Negative)1.331.271.293.733.082.941.37
 Ratio (Positive/Negative)3.652.982.452.161.950.651.41
 Ratio (Positive/Negative)3.542.722.621.462.380.730.74
 Ratio (Positive/Negative)6.324.

Data availability

The EHR dataset where augmented curation was conducted from the Mayo Clinic records was accessed under IRB 20-003278, "Study of COVID-19 patient characteristics with augmented curation of Electronic Health Records (EHR) to inform strategic and operational decisions". The EHR data cannot be shared or released due to HIPAA regulations. Contact corresponding authors for additional details regarding the IRB, and please refer to the Mayo Clinic IRB website for further details on our commitment to patient privacy (https://www.mayo.edu/research/institutional-review-board/overview). The summary statistics derived from the EHRs are enclosed within the manuscript.

Additional files

Supplementary file 1

Enrichment of diagnosis codes.

(A) Enrichment of diagnosis codes amongst COVIDpos patients in the week preceding PCR testing. (B) Enrichment of diagnosis codes amongst COVIDneg patients in the week preceding PCR testing. (C) Symptoms and their synonyms used for the EHR analysis. (D) Pairwise analysis of symptoms in the COVIDpos and COVIDneg cohorts. The pairwise symptom combinations with BH-corrected p-value<0.01 are summarized. (E) Patients with at least one clinical note over time. (F) SciBERT vs. BioClinicalBERT Phenotype Sentiment Model Performance on 18,490 sentences. (G) Model Performance Trained on 18,490 Sentences Containing 250 Different Cardiovascular, Pulmonary, and Metabolic Phenotype. (H) Model Performance Trained on 21,678 Sentences Containing 250 Different Cardiovascular, Pulmonary, and Metabolic Phenotypes and Expanded to Include 26 COVID-related Symptoms. (I) Synonym classification model performance

Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)