1 figure, 2 tables and 2 additional files


Figure 1 with 2 supplements
Augmented curation of the unstructured clinical notes and comparison of symptoms between COVIDpos vs. COVIDneg patients.

(a) Augmented curation of the unstructured clinical notes from Electronic Health Records (EHRs). (b) COVID-19-related symptom entity recognition, sentiment analysis and grouping of synonyms. (c) Comparison of symptoms extracted from EHR clinical notes of COVIDpos vs. COVIDneg patients.

Figure 1—figure supplement 1
SciBERT Architecture and Training Configuration.
Figure 1—figure supplement 2
Examples of Sentence Classification Used in Training a SciBERT Model for Phenotype/Symptom Sentiment Analysis.


Table 1
Augmented curation of the unstructured clinical notes from the EHR reveals specific clinically confirmed phenotypes that are amplified in COVIDpos patients over COVIDneg patients in the week prior to the SARS-CoV-2 PCR testing date.

The key COVIDpos amplified symptoms in the week preceding PCR testing (i.e. day = −7 to day = −1) are highlighted in gray (p-value<1E-10). The ratio of COVIDpos to COVIDneg proportions represents the fold change amplification of each phenotype in the COVIDpos patient set (symptoms are sorted based on this column).

(p-value<1E-10 in gray)
Count (%) (N = 2317)
Count (%)
(N = 74850)
(COVID+/COVID-) Relative RatioRelative ratio
(95% CI)
2-tailed p-valueBH-corrected p-value
Altered or diminished sense of taste or smell145 (6.3%)173 (0.2%)27.08(21.81, 33.62)<1E-300<1E-300
Fever/chills750 (32.4%)9421 (12.6%)2.57(2.42, 2.74)3.57E-1694.64E-168
Cough769 (33.2%)11083 (14.8%)2.24(2.11, 2.38)4.60E-1293.99E-128
Respiratory difficulty681 (29.4%)10082 (13.5%)2.18(2.04, 2.33)3.06E-1051.99E-104
Myalgia/Arthralgia288 (12.4%)4620 (6.2%)2.01(1.8, 2.25)5.35E-342.78E-33
Rhinitis200 (8.6%)2947 (3.9%)2.19(1.92, 2.52)2.25E-299.75E-29
Headache325 (14.0%)6124 (8.2%)1.71(1.55, 1.9)1.34E-234.98E-23
Congestion228 (9.8%)4261 (5.7%)1.73(1.53, 1.96)4.45E-171.45E-16
GI upset195 (8.4%)10670 (14.3%)0.59(0.52, 0.68)1.74E-155.03E-15
Wheezing49 (2.1%)3765 (5.0%)0.42(0.32, 0.56)1.82E-104.73E-10
Dermatitis26 (1.1%)2519 (3.4%)0.33(0.23, 0.5)2.60E-096.15E-09
Generalized symptoms169 (7.3%)8129 (10.9%)0.67(0.58, 0.78)4.82E-081.04E-07
Respiratory Failure73 (3.2%)1363 (1.8%)1.73(1.38, 2.19)3.09E-066.18E-06
Diarrhea228 (9.8%)5452 (7.3%)1.35(1.19, 1.53)3.47E-066.44E-06
Pharyngitis160 (6.9%)3635 (4.9%)1.42(1.22, 1.66)7.05E-061.22E-05
Chest pain/pressure148 (6.4%)6122 (8.2%)0.78(0.67, 0.92)1.88E-033.06E-03
Change in appetite/intake95 (4.1%)2271 (3.0%)1.35(1.11, 1.66)3.37E-035.15E-03
Otitis13 (0.6%)874 (1.2%)0.48(0.29, 0.85)6.98E-031.01E-02
Cardiac95 (4.1%)2443 (3.3%)1.26(1.03, 1.54)2.62E-023.59E-02
Fatigue229 (9.9%)8268 (11.0%)0.89(0.79, 1.02)7.83E-021.02E-01
Conjunctivitis9 (0.4%)167 (0.2%)1.74(0.95, 3.52)1.00E-011.24E-01
Dry mouth5 (0.2%)316 (0.4%)0.51(0.24, 1.3)1.28E-011.51E-01
Hemoptysis13 (0.6%)283 (0.4%)1.48(0.89, 2.65)1.61E-011.78E-01
Dysuria16 (0.7%)732 (1.0%)0.71(0.45, 1.18)1.64E-011.78E-01
Diaphoresis35 (1.5%)979 (1.3%)1.15(0.84, 1.63)3.99E-014.15E-01
Neuro150 (6.5%)4952 (6.6%)0.98(0.84, 1.15)7.86E-017.86E-01
Table 2
Temporal analysis of the EHR clinical notes for the week preceding PCR testing (i.e. day −7 to day −1), leading up to the day of PCR testing (day 0) in COVIDpos and COVIDneg patients.

Temporal enrichment for each symptom is quantified using the ratio of COVIDpos patient proportion over the COVIDneg patient proportion for each day. The patient proportions in the rows labeled ‘Positive’ and ‘Negative’ represent the fraction of COVIDpos (n = 2,317) and COVIDneg (n = 74,850) patients with the specified symptom on each day. Symptoms with p-value<1E-10 are highlighted in green and 1E-10 < p value<1E-03 in gray.

SymptomCOVID-19 (N = 77167)Day = −7Day = −6Day = −5Day = −4Day = −3Day = −2Day = −1
Altered or diminished sense of taste or smellPositive (n = 2317)4.75E-033.88E-033.45E-032.59E-031.73E-030.00E+004.75E-03
 Negative (n = 74850)1.07E-044.01E-051.07E-041.07E-049.35E-052.27E-049.75E-04
 Ratio (Positive/Negative)44.4296.9132.3024.2318.460.004.87
 Ratio (Positive/Negative)5.224.313.643.082.411.030.91
 Ratio (Positive/Negative)2.221.821.321.061.040.460.71
 Ratio (Positive/Negative)7.125.884.983.812.900.961.06
Respiratory DifficultyPositive2.24E-022.11E-021.81E-021.55E-021.25E-028.20E-035.35E-02
 Ratio (Positive/Negative)4.433.713.122.652.030.950.70
Change in appetite/intakePositive1.73E-031.73E-031.73E-035.18E-034.32E-035.61E-031.86E-02
 Ratio (Positive/Negative)1.331.271.293.733.082.941.37
 Ratio (Positive/Negative)3.652.982.452.161.950.651.41
 Ratio (Positive/Negative)3.542.722.621.462.380.730.74
 Ratio (Positive/Negative)6.324.

Additional files

Supplementary file 1

Enrichment of diagnosis codes.

(A) Enrichment of diagnosis codes amongst COVIDpos patients in the week preceding PCR testing. (B) Enrichment of diagnosis codes amongst COVIDneg patients in the week preceding PCR testing. (C) Symptoms and their synonyms used for the EHR analysis. (D) Pairwise analysis of symptoms in the COVIDpos and COVIDneg cohorts. The pairwise symptom combinations with BH-corrected p-value<0.01 are summarized. (E) Patients with at least one clinical note over time. (F) SciBERT vs. BioClinicalBERT Phenotype Sentiment Model Performance on 18,490 sentences. (G) Model Performance Trained on 18,490 Sentences Containing 250 Different Cardiovascular, Pulmonary, and Metabolic Phenotype. (H) Model Performance Trained on 21,678 Sentences Containing 250 Different Cardiovascular, Pulmonary, and Metabolic Phenotypes and Expanded to Include 26 COVID-related Symptoms. (I) Synonym classification model performance

Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Tyler Wagner
  2. FNU Shweta
  3. Karthik Murugadoss
  4. Samir Awasthi
  5. AJ Venkatakrishnan
  6. Sairam Bade
  7. Arjun Puranik
  8. Martin Kang
  9. Brian W Pickering
  10. John C O'Horo
  11. Philippe R Bauer
  12. Raymund R Razonable
  13. Paschalis Vergidis
  14. Zelalem Temesgen
  15. Stacey Rizza
  16. Maryam Mahmood
  17. Walter R Wilson
  18. Douglas Challener
  19. Praveen Anand
  20. Matt Liebers
  21. Zainab Doctor
  22. Eli Silvert
  23. Hugo Solomon
  24. Akash Anand
  25. Rakesh Barve
  26. Gregory Gores
  27. Amy W Williams
  28. William G Morice II
  29. John Halamka
  30. Andrew Badley
  31. Venky Soundararajan
Augmented curation of clinical notes from a massive EHR system reveals symptoms of impending COVID-19 diagnosis
eLife 9:e58227.