Pancreatic cancer symptom trajectories from Danish registry data and free text in electronic health records

  1. Jessica Xin Hjaltelin
  2. Sif Ingibergsdóttir Novitski
  3. Isabella Friis Jørgensen
  4. Troels Siggaard
  5. Siri Amalie Vulpius
  6. David Westergaard
  7. Julia Sidenius Johansen
  8. Inna M Chen
  9. Lars Juhl Jensen
  10. Søren Brunak  Is a corresponding author
  1. Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  2. Department of Oncology, Copenhagen University Hospital - Herlev and Gentofte, Denmark
  3. Copenhagen University Hospital, Rigshospitalet, Blegdamsvej, Denmark
4 figures, 1 table and 2 additional files

Figures

Figure 1 with 1 supplement
Comparison of pancreatic cancer symptoms in the Danish National Patient Registry (NPR) and electronic health records (EHRs).

(A) Symptoms before the pancreatic cancer diagnosis identified in NPR (NNPR = 24), by text mining of clinical notes from EHRs (Nnotes = 16) or in both data sources (Nboth = 57). (B) The top 10 most frequent symptoms that are only found in the clinical notes. (C) The top 10 most frequent symptoms in NPR. (D) The most frequent symptoms from the clinical notes and NPR, from the list of 57 overlapping symptoms. *Some symptom names have been shortened for overview (see Supplementary file 1b).

Figure 1—figure supplement 1
Comparing symptoms found from text mining, registry (NPR) and established well-known symptoms of pancreatic cancer.

Type 2 diabetes/new-onset diabetes was not included in the well-known symptoms list (Supplementary file 1a), since this is coded as a disease in the ICD-10 chapter. This comparison only covers the ICD-10 symptom chapter (R chapter).

Figure 2 with 2 supplements
The most frequent text-mined symptoms from the clinical notes 5 years prior to pancreatic cancer diagnosis.

The most common and significant (p<0.05) symptoms in the text-mined clinical notes are shown with survival information and time to pancreatic cancer diagnosis (Supplementary file 3). The symptoms are extracted over a 5-year period up to the time of the first pancreatic cancer diagnosis. If a symptom is noted more than once in one hospital encounter, the symptom is counted once only. The purple bars indicate patients with survival ≤90 days and the green bars indicate patients with a survival >90 days. Symptom names may have been shortened for overview (see Supplementary file 1b). Tumor stage has been calculated using the Danish cancer registry and the pancreatic cancer TNM staging classification version 7. Outlier dots have been removed to safeguard patient-sensitive information.

Figure 2—figure supplement 1
The most frequent registry-based symptoms from the NPR prior to pancreatic cancer diagnosis.

Significant symptoms are filtered so only those prior to a pancreatic cancer diagnosis with a RR>1 and p-value<0.05. Outlier dots has been removed to safeguard patient-sensitive information.

Figure 2—figure supplement 2
Staging information for pancreatic cancer patients.

(A) Pancreatic cancer cases with at least one clinical note 5 years prior to the cancer diagnosis. (B) Pancreatic cancer patients without any clinical notes 5 years prior to the cancer diagnosis. The information was extracted from the Danish Cancer Registry.

Figure 3 with 2 supplements
Symptom trajectories before and after pancreatic cancer diagnosis.

(A) The registry symptom trajectories consist of significant disease pairs with a Relative Risk (RR) >1 (Supplementary file 1d). Each trajectory has a minimum of 100 patients. (B) Symptom trajectories from clinical notes consisting of significant disease pairs with Odds Ratio (OR) >1 (Supplementary file 1e). Each trajectory is followed by a minimum of 20 patients. The width of the trajectories indicates visually the number of patients. The purple-coloured trajectories represent patient groups with median survival ≤90 days. The green-coloured trajectories represent patient groups with median survival >90. Some symptoms names have been shortened for overview.

Figure 3—figure supplement 1
Registry-based trajectories with both symptoms and diseases.

Significant length 3 and length 4 trajectories for the registry-based analysis using the 18,523 pancreatic cancer cases from the Danish National Patient Registry. The diseases and chapters are from the ICD-10 classification system. The length 4 trajectories have min. 80 patients per trajectory. The length 3 trajectories have min. 400 patients per trajectory.

Figure 3—figure supplement 2
Survival of patients in significant trajectories.

(A) Survival in months for patients in the registry-based disease trajectories for both length 3 and 4 (N=10,542). (B) Survival in months for patients in the text-mined symptom trajectories (N=311). The pancreatic cancer patient cohort from (B) is a subset of the pancreatic cancer patient cohort in A.

The text mining pipeline.

A dictionary was generated with symptoms and expanded with word endings to capture multiple forms of the same symptom. Afterwards, the dictionary and the corpus (clinical notes) were tokenized to extract spelling errors. The spelling errors were then added to the dictionary. Finally, the program Tagcorpus (Pafilis et al., 2013) was used to tag the symptoms in the corpus. The text mining performance was then evaluated.

Tables

Table 1
Data set and patient characteristics.
General cohort informationThe National Patient Registry (NPR)Electronic Health Records (EHRs)
Data set timeline1994–20182006–2016
N pancreatic cancer patients23,5923078
N controls6.9 million30,780
Pancreatic cancer cohort information
Female9328 (50.4%)1506 (48.9%)
Male9195 (49.6%)1572 (51.1%)
Mean age at diagnosis (female/male)73/6972/70
Age distributions (years)
<40139 (0.8%)16 (0.52%)
40–50733 (4.0%)107 (3.48%)
50–602523 (13.6%)352 (11.4%)
60–705288 (28.5%)941 (30.6%)
70–806017 (32.5%)1037 (33.7%)
>803821 (20.6%)625 (20.3%)

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jessica Xin Hjaltelin
  2. Sif Ingibergsdóttir Novitski
  3. Isabella Friis Jørgensen
  4. Troels Siggaard
  5. Siri Amalie Vulpius
  6. David Westergaard
  7. Julia Sidenius Johansen
  8. Inna M Chen
  9. Lars Juhl Jensen
  10. Søren Brunak
(2023)
Pancreatic cancer symptom trajectories from Danish registry data and free text in electronic health records
eLife 12:e84919.
https://doi.org/10.7554/eLife.84919