1. Computational and Systems Biology
  2. Epidemiology and Global Health
Download icon

Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining

  1. Isa Kristina Kirk
  2. Christian Simon
  3. Karina Banasik
  4. Peter Christoffer Holm
  5. Amalie Dahl Haue
  6. Peter Bjødstrup Jensen
  7. Lars Juhl Jensen
  8. Cristina Leal Rodríguez
  9. Mette Krogh Pedersen
  10. Robert Eriksson
  11. Henrik Ullits Andersen
  12. Thomas Almdal
  13. Jette Bork-Jensen
  14. Niels Grarup
  15. Knut Borch-Johnsen
  16. Oluf Pedersen
  17. Flemming Pociot
  18. Torben Hansen
  19. Regine Bergholdt
  20. Peter Rossing  Is a corresponding author
  21. Søren Brunak  Is a corresponding author
  1. University of Copenhagen, Denmark
  2. Odense University Hospital, Denmark
  3. Steno Diabetes Center Copenhagen, Denmark
  4. Rigshospitalet, Denmark
  5. Holbæk Hospital, Denmark
  6. Herlev-Gentofte Hospital, Denmark
  7. Technical University of Denmark, Denmark
Research Article
Cite this article as: eLife 2019;8:e44941 doi: 10.7554/eLife.44941
4 figures and 7 additional files

Figures

Figure 1 with 6 supplements
Comparison of distributions of ICD-10 diagnosis codes with and without text mining.

(A) Percentage of diagnosis codes belonging to the different ICD-10 chapters and the relative increase in diagnosis codes from the different chapters when combining the text-mined and assigned codes. (B) Age distributions of text-mined and assigned ICD-10 diagnosis codes from the SDCC corpus divided into the 21 ICD-10 chapters.

Figure 1—figure supplement 1
Distribution of patients per physiological and biochemical test.

Number of unique patients who have had a given biochemical test. Shows that the majority of biochemical tests were performed on only a few individuals. The red lines mark the 25, 50, and 75% of individuals in the cohort: 26, 41, 64, and 356 biochemical tests were taken in 75%, 50%, 25%, and less than 25% of the cohort, respectively.

Figure 1—figure supplement 2
Physiological and biochemical tests in the SDCC corpus.

Bar plots of unique individuals who have had the test taken (grey bars) and the number of times each individual have had the test taken (red outline bars).

Figure 1—figure supplement 3
Linear Discriminant Analysis 1 (LDA).

Linear Discriminant Analysis (LDA) was performed on the biochemical tests for the 71 clusters with at least 50 individuals. The linear discriminants (LD) 1 and 2 (A) and 1 and 3 (B) for the LDA, are shown using the biochemical test identifiers. Identifiers in blue contributes most to the variance among clusters for LD1, purple identifiers contribute most to the variance from LD2 or LD3, and green colored identifiers are common across LD1 and LD2 or LD1 and LD3.

Figure 1—figure supplement 4
Linear Discriminant Analysis 2 (LDA).

The three identifiers contributing most to the variance in the LDA (NPU04998, NPU18004, and SDCNOTAT_BTSys) were removed, and a new LDA analysis was performed. A and B display the relationship between LD 1, 2, and 3. Blue colored identifiers contribute the most to the variance between clusters for LD1, purple colored contribute the most to the variance for LD2 and 3, and green colored identifiers are the ones common among LD1 and 2, or LD2 and LD3.

Figure 1—figure supplement 5
Distribution of HbA1c measurements for T1D and T2D patients.

The vertical line corresponds to the HbA1c threshold used when defining dysregulation.

Figure 1—figure supplement 6
Biochemical patterns for the level of glycemic dysregulation.

The groups are based on numbers of parameters of glycemic dysregulation. A MANOVA test was performed to detect if there were any differences in means among the groups, for each biochemical test (Bonferroni adj. p-value<=0.01). These groups are marked with an asterisk (*). Subsequently a Kolmogorov-Smirnov test was applied to discover whether the distribution of mean biochemical values for each group was significantly higher or lower than the other groups (Bonferroni adj. p-value<=0.01). Blue indicates mean distributions that are significantly higher than the other groups, and red indicates significantly lower distributions. They grey and less clear color indicates that the distribution within this group was not significantly different from the other groups. B = blood, p=Plasma, S = serum, U = urine.

Figure 2 with 5 supplements
Phenotypic clusters found in the SDCC cohort.

The clustering was created with diagnosis vectors of 13,928 patients (with text in the record) comprising both text-mined and assigned ICD-10 codes. A total of 172 clusters were created, where 11,208 patients (80.47%) were captured in the clustering (clusters with five or less patients were discarded for statistical reasons). (A) Each node represents a patient within the corpus colored by the association to one of the 172 unique clusters. (B) The 71 clusters with at least 50 patients colored with the same palette as in (A).

Figure 2—figure supplement 1
Density of days in contact with SDCC for each cluster.

Density diagram of days an individual has been connected to Steno Diabetes Center Copenhagen (SDCC) divided by each cluster. The black line indicates the mean of SDCC connection time for the entire cohort. Some clusters for example Cluster1, show two peaks indicating that there are at least two groups of individuals in a cluster; one connected to SDCC more than the average cohort, and one connected less.

Figure 2—figure supplement 2
Distribution of assigned primary diabetes type for each cluster.

Density diagram of days an individual has been connected to Steno Diabetes Center Copenhagen (SDCC) divided by each cluster. The black line indicates the mean of SDCC connection time for the entire cohort. Some clusters for example Cluster1, show two peaks indicating that there are at least two groups of individuals in a cluster; one connected to SDCC more than the average cohort, and one connected less.

Figure 2—figure supplement 3
Distribution of age for each cluster.

Density diagram of days an individual has been connected to Steno Diabetes Center Copenhagen (SDCC) divided by each cluster. The black line indicates the mean of SDCC connection time for the entire cohort. Some clusters for example Cluster1, show two peaks indicating that there are at least two groups of individuals in a cluster; one connected to SDCC more than the average cohort, and one connected less.

Figure 2—figure supplement 4
Distribution of duration of diabetes for each cluster.

The diabetes duration distribution for individuals in each cluster. Each bar corresponds to a bin for a given interval of the diabetes duration. The height of the bins is the percentage of individuals in the cluster being in that diabetes age bin. The diabetes duration is calculated as the difference in years between diabetes onset and the date for the latest SDCC data entry.

Figure 2—figure supplement 5
Clustering robustness analysis.

To assess the robustness of the clustering, various diluted (points in blue) and shuffled realizations (points in red) of the similarity network were used as input for the MCL algorithm, and the resulting clustering’s were compared to the reference clustering using the Variation of Information (VI) measure. The two horizontal lines show the value that the VI would take if we were to randomly assign 10% and 20% of the vertices to different random clusters, respectively.

Hierarchical clustering based on enriched comorbid ICD-10 diagnoses.

The comorbidities present in a minimum of 10 patients and significantly enriched (adj. p-value<=0.05) in each cluster are shown in the pie charts. The number of significant codes ranges from 1 to 10. Each color corresponds to an ICD-10 code chapter as listed in the legend of Figure 1. Six main groups and an outlier (cluster 70) resulted, and the colors of the dendrogram branches indicate to which hierarchical groups the clusters belong. The size of the pie charts represents the average diabetes duration (years with diabetes) divided into six bins. The 21 clusters where at least 50% of the patients have three or more HbA1c severity parameters are marked with a red line surrounding the pie chart.

Comorbidity patterns within the six symptom groups.

(A) Comorbidity correlations between the combined symptom groups. (B) Asymmetric comorbidity matrix for observing row diagnosis codes before column diagnoses. First, we calculated Bonferroni corrected p-values for diagnosis pair directionality, second, we extracted the top 100 unique diagnosis codes pairs with lowest adjusted p-values and lastly, we calculated a comorbidity score (CS) by using the log2 of observing the pair more or less than expected. The heat-map colors reflect the CS quantification. (C) Comorbidity pairs unique for each of the symptom groups. All interactions are observed significantly more (blue) or less (red) than expected (adj. p-value<=0.01). Arrows indicate that the diagnoses are observed in the particular order (Fischer’s exact test with Bonferroni correction p-value<=0.01). Node size indicates in how many symptom groups the diagnosis code is observed in, ranging from one group (the diagnosis is unique for the group, largest nodes) to six groups (all groups have the code, smallest nodes).

Data availability

All data generated or analysed during this study are included in the manuscript and supporting files except for the raw person sensitive electronic health record data due to confidentiality requirements.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)