Research Article

Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining

University of Copenhagen, Denmark
Odense University Hospital, Denmark
Steno Diabetes Center Copenhagen, Denmark
Rigshospitalet, Denmark
Holbæk Hospital, Denmark
Herlev-Gentofte Hospital, Denmark
Technical University of Denmark, Denmark

Dec 10, 2019

Open access
Copyright information

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Diabetes is a diverse and complex disease, with considerable variation in phenotypic manifestation and severity. This variation hampers the study of etiological differences and reduces the statistical power of analyses of associations to genetics, treatment outcomes, and complications. We address these issues through deep, fine-grained phenotypic stratification of a diabetes cohort. Text mining the electronic health records of 14,017 patients, we matched two controlled vocabularies (ICD-10 and a custom vocabulary developed at the clinical center Steno Diabetes Center Copenhagen) to clinical narratives spanning a 19 year period. The two matched vocabularies comprise over 20,000 medical terms describing symptoms, other diagnoses, and lifestyle factors. The cohort is genetically homogeneous (Caucasian diabetes patients from Denmark) so the resulting stratification is not driven by ethnic differences, but rather by inherently dissimilar progression patterns and lifestyle related risk factors. Using unsupervised Markov clustering, we defined 71 clusters of at least 50 individuals within the diabetes spectrum. The clusters display both distinct and shared longitudinal glycemic dysregulation patterns, temporal co-occurrences of comorbidities, and associations to single nucleotide polymorphisms in or near genes relevant for diabetes comorbidities.

Introduction

Electronic Health Records (EHRs) contain patient characteristics from different data layers including text narratives, assigned diagnosis codes, biochemical values, and prescription data. These data types display a high degree of complementarity, providing an excellent basis for deep phenotyping and patient stratification. Recent studies have shown how structured data derived from EHRs can be used to assess phenotypic variability of different disease areas (Li et al., 2015; Dahlem et al., 2015; Doshi-Velez et al., 2014; Kho et al., 2011; Kho et al., 2012). While the use of structured EHR data in many instances resembles traditional registry- or biobank-based research, the inclusion of unstructured data such as clinical narratives allows for the definition of even more fine-grained phenotypes, which could lead to novel subgroup stratifications (Li et al., 2015; Roque et al., 2011; Miotto et al., 2016).

A vast amount of information on symptoms, lifestyle, complications, and comorbidities is available from clinical narratives in unstructured EHR data. Text mining applying natural language processing (NLP) algorithms is one strategy, but simpler approaches have also been shown to be valuable in the context of clinical text, for reviews see Jensen et al. (2012) and Denny (2012). These methods work across language barriers and have been successfully implemented in for example adverse drug reaction detection (Warrer et al., 2012), subgrouping of chronic obstructive pulmonary disease (Fu et al., 2015), cancer subgrouping (Chen et al., 2015), and classification of epileptic children (Pereira et al., 2013). Such studies show the possibilities of using and integrating different parts of EHRs for matching phenotypically similar subgroups to biomarker data, which is key to improved treatment and characterizing etiological differences.

Several large initiatives have been established for utilizing EHRs, including the Electronic Medical Records and Genomics (eMERGE) consortium of DNA biorepositories that links genetic data with electronic medical records (McCarty et al., 2011; Gottesman et al., 2013), and EMR-driven nonnegative restricted Boltzmann machines (eNRBM) which use unsupervised learning for analyzing EHRs (Tran et al., 2015). Furthermore, other studies have used general approaches for finding direct and inverse comorbidities (Doshi-Velez et al., 2014; Roque et al., 2011; Gligorijevic et al., 2016).

Diabetes Mellitus (DM) is a difficult disease to stratify (American Diabetes Association, 2017). DM covers etiologically different metabolic disorders that exhibit the same phenotype, hyperglycemia, due to either insufficient insulin production relative to insulin demand or insulin resistance. Although DM is classified into different major subtypes, it has been hypothesized to represent a disease continuum rather than strict distinct disease subtypes (American Diabetes Association, 2017; Flannick et al., 2016). One recent data-driven study used five subgroups of adult-onset diabetes and clustered six parameters from the structured data of the EHR (Ahlqvist et al., 2018). DM is a complex disorder associated with several comorbidities and organ complications. These can be classified as macrovascular complications that is cardiovascular disease, and microvascular complications resulting in eye, kidney, and nerve damage. Cardiovascular complications alone are responsible for 50–80% of all-cause mortality in diabetes patients (Laakso, 2001). The severity of complications is affected by glycemic dysregulation, that is increased or fluctuating blood glucose levels (Stratton et al., 2000; UK Prospective Diabetes Study Group, 1998a; UK Prospective Diabetes Study Group, 1998b; Nathan et al., 1993), and successful reduction and prevention of diabetic complications have been observed when the glycemic dysregulation is reduced or removed (Stratton et al., 2000; UK Prospective Diabetes Study Group, 1998a). Therefore, risk factors for glycemic dysregulation are crucial to diabetes progression (Ahlqvist et al., 2015). Known risk factors for complications include age, diabetes duration, polypharmacy, comorbidities (Juarez et al., 2012), increased levels of circulating triglyceride and LDL-cholesterol, and lower levels of HDL-cholesterol (Saudek et al., 2006; Giannini et al., 2011; Bitzur et al., 2009). Finding new risk factors that can help classify poorly regulated versus well-regulated diabetes, such as other biochemical variables or genetic variants, could improve treatment and reduce diabetic complications.

In this study, we utilized the unstructured data of EHRs and performed a deep phenotypic characterization of a Danish diabetes cohort of 14,017 individuals, aged 18 to 101 at the end of the study, using vocabularies comprising both diagnosis codes and ‘exposome’ related terms. We used text-mined and assigned diagnosis codes to stratify the cohort and described it using both physiological and genetic variation data. The unstructured EHR data enabled us to classify patients based on their level of glycemic dysregulation and to identify potential biochemical and genetic markers associated with dysglycemia.

Results

Text mining the EHR corpus

The general aim of the text mining effort was to obtain a richer phenotypic characterization of each patient. Initially, each patient had in 4.9 assigned codes on average. Applying text mining with two vocabularies (ICD-10 and SDC-custom) resulted in a 4-fold increase to 18.6 codes per patient. Moreover, the distribution of codes across ICD-10 chapters changed considerably when adding the text-mined codes, with chapters I, VII, XVIII and XIX showing the largest increases (6, 15, 25 and 22-fold increase, respectively) (Figure 1). This illustrates the difference between the assigned diagnosis codes from the structured data and the much more symptom-rich codes detected by text mining.

Figure 1 with 6 supplements see all

Download asset Open asset

Comparison of distributions of ICD-10 diagnosis codes with and without text mining.

(A) Percentage of diagnosis codes belonging to the different ICD-10 chapters and the relative increase in diagnosis codes from the different chapters when combining the text-mined and assigned codes. (B) Age distributions of text-mined and assigned ICD-10 diagnosis codes from the SDCC corpus divided into the 21 ICD-10 chapters.

Figure 1—source data 1 Diagnosis code breakdown data.: https://cdn.elifesciences.org/articles/44941/elife-44941-fig1-data1-v1.txt
Download elife-44941-fig1-data1-v1.txt
Figure 1—source data 2 Age distribution data.: https://cdn.elifesciences.org/articles/44941/elife-44941-fig1-data2-v1.txt
Download elife-44941-fig1-data2-v1.txt

Comorbidity clustering based on text-mined and assigned diagnosis codes

For each patient, the assigned and text-mined ICD-10 codes were combined to create a patient-specific diagnosis-vector where the primary diabetes type (E10 or E11) was not included. Contrary to cancer for example, where the ICD-10 diagnoses are quite reliable and highly detailed, the primary codes in a multi-organ disease like diabetes are used in a fuzzier way, as the knowledge on robust diabetes subtypes and their characteristics in the context of comorbidities is quite limited. We do therefore not want the clustering to be driven by the broad, less etiology-relevant primary codes from the endocrinology chapter, but rather by more objectively observed symptoms, other diseases and lifestyle features. Following code-abundance normalization and BM-25 correction the vectors were clustered using MCL producing 172 clusters (mean = 65 patients, min = 11, max = 979, median = 40), in which 11,208 patients (80.47%) were included Figure 2A. The remaining 2720 patients (19.53%) were in clusters with ten or less patients and were therefore omitted from subsequent analyses.

Figure 2 with 5 supplements see all

Download asset Open asset

Phenotypic clusters found in the SDCC cohort.

The clustering was created with diagnosis vectors of 13,928 patients (with text in the record) comprising both text-mined and assigned ICD-10 codes. A total of 172 clusters were created, where 11,208 patients (80.47%) were captured in the clustering (clusters with five or less patients were discarded for statistical reasons). (A) Each node represents a patient within the corpus colored by the association to one of the 172 unique clusters. (B) The 71 clusters with at least 50 patients colored with the same palette as in (A).

Even though codes for the primary diabetes type were not part of the diagnosis vectors, specific clusters were significantly enriched for T1D patients (cluster 1: N = 506, adj. p-value=9.3e-51 and cluster 9: N = 101 adj. p-value=1.2e-10). Other clusters had significantly more T2D patients than expected (cluster 3: N = 233, adj. p-value=9.1e-10, cluster 5: N = 170, adj. p-value=3.8e-13 and cluster 6, N = 158 adj. p-value=8.4e-17). In addition, we observed a cluster significantly enriched with the ICD-10 term E13: other diabetes (cluster 25, N = 93, adj. p-value=1.8e-142), which includes diabetes due to genetic defects, post-pancreatectomy diabetes and post-procedural diabetes. Several other clusters had a mix of T1D and T2D patients according to the assigned codes. Further characteristics of the laboratory data and prescription data as well as the clusters regarding sex, age, observational time, years with diabetes etc. can be found in Supplementary files 1–3 and in Figure 1—figure supplement 1, Figure 1—figure supplement 2, Figure 1—figure supplement 3, Figure 1—figure supplement 4, Figure 2—figure supplement 1, Figure 2—figure supplement 2, Figure 2—figure supplement 3, Figure 2—figure supplement 4. The robustness of the clustering was found to be high (see description in Materials and methods and Figure 2—figure supplement 5). To maintain power in subsequent analyses we focused on clusters with at least 50 patients (71 clusters comprising 8652 patients, Figure 2B).

Enriched comorbidity and symptom patterns in diabetes patient clusters

The 71 clusters (Figure 2B) were grouped by hierarchical clustering, using distances obtained from cluster specific symptoms from the ICD-10 chapter XVIII (level 1). Six main groups and an outlier (cluster 70) were found containing 5, 8, 21, 11, 7 and 18 of the original clusters, respectively. The symptom groups are illustrated by the branch colors in Figure 3. The nodes represent the 71 clusters each depicted as a pie chart displaying the comorbidities and symptoms that are significantly enriched (adj. p-value≤0.05), see Supplementary file 4 for details on the enrichment and p-values.

Figure 3

Download asset Open asset

Hierarchical clustering based on enriched comorbid ICD-10 diagnoses.

The comorbidities present in a minimum of 10 patients and significantly enriched (adj. p-value<=0.05) in each cluster are shown in the pie charts. The number of significant codes ranges from 1 to 10. Each color corresponds to an ICD-10 code chapter as listed in the legend of Figure 1. Six main groups and an outlier (cluster 70) resulted, and the colors of the dendrogram branches indicate to which hierarchical groups the clusters belong. The size of the pie charts represents the average diabetes duration (years with diabetes) divided into six bins. The 21 clusters where at least 50% of the patients have three or more HbA1c severity parameters are marked with a red line surrounding the pie chart.

The 71 clusters were defined based on the associated comorbidities, excluding DM without complications, and from the pie charts we observed that distinct diagnoses do indeed characterize the clusters. For example, ICD-10 code N40: Benign prostatic hyperplasia for cluster 56, L40: Psoriasis for cluster 16, F20: Schizophrenia for cluster 47, K29: Functional intestinal disorders for cluster 17, and Z94: Transplanted organ and tissue status for cluster 42. Using Fisher’s exact test, we found that: Symptoms related to skin and subcutaneous tissue (adj. p-value<0.001) characterized symptom group five and Symptoms related to digestive system and abdomen; cognition, perception, emotional state and behavior; and general symptoms and signs (adj. p-value<0.001 for all) characterized symptom group 3. These results correspond well to the enriched codes observed in Figure 3, as was the case for the other enriched codes across the 71 clusters within the six symptom groups.

Genomic characterization by SNP association of phenotypically determined clusters

We evaluated the 71 clusters in the six symptom groups, plus the outlier cluster, for SNPs that could characterize the different groups (details on the genetic data can be found in the ‘Genomic characterization’ section under Materials and methods). The five highest association signals (independent) for each group are shown in Supplementary file 5. Only results from analyses with more than 15 cases and a well-calibrated QQ-plot (visual inspection and a lambda inflation factor >0.96) are reported. Accordingly, clusters 1–5, 7–9, 12, 15–18, 21–23, 26, 31, 35, 39, 45, 46, and 66, as well as all aggregated symptom groups, met the criteria. The median coverage of the symptom clusters was 31% [range: 10–67]. SNPs characterizing the symptom groups were found in several instances and association signals to disease-associated genes were also found for several of the clusters (Figure 3). Most frequently found, unsurprisingly, were genes associated by GWAS to diabetes or diabetes-related cardio-metabolic traits (cluster 3: MYO3B, cluster 4: DAPK1, cluster 5: LPIN2, cluster 7: SAMD4A and FHIT, cluster 8: ERG and PLCB1, cluster 12: MYT1L, cluster 15: UBE2WP1, cluster 16: ADARB2, CDKAL1, and CLIP1, cluster 17: C8orf37-AS1, cluster 21: FHOD3 and MCF2L, cluster 24: MTCL1, cluster 26: NTM, cluster 31: PCDH15, CDH4, and DCTD, cluster 31: KLF12, cluster 39: FHOD3, cluster 45: IGF1R, BCAS3, and TENM4, cluster 46: NRXN3). Cluster eight is characterized by cardiovascular complications, and three of the top ranking genes for this cluster have been associated with LDL peak particle diameter (THBS4; Rudkowska et al., 2015), abdominal aortic aneurysm (ERG; Jones et al., 2017), pulse pressure (ERG; Warren et al., 2017), and diastolic blood pressure (PLCB1; Warren et al., 2017). Cluster 21 is enriched for the ICD-10 diagnosis foot ulcer (L97), and MCF2L, one of the top ranking genes for cluster, has been associated with both end-stage coagulation (Williams et al., 2013) and prothrombin time (Tang et al., 2012). In total, of the top five association signals that were mapped to genes (n = 103) we found five (CDKAL1, DCDC2C, KLF12, LPIN2, TLE1) to be related with diabetes.

Comorbidity pairs and patterns within symptom related clusters

We detected codes occurring significantly more or less together within and across the symptom groups (Fischer’s test with Bonferroni adjusted p-values<=0.01) defining distinct comorbidity pairs. If the comorbidity pairs covered more than 100 unique codes (symptom groups 4 and 7) we extracted only the most significant pairs until these pairs consisted of 100 unique codes.

Figure 4A illustrates the comorbidity correlations for the six main symptom groups where each pairwise interaction has a comorbidity score (see Material and methods). To characterize whether a diagnosis occurred significantly more before or after another, we made this analysis in a temporal manner. Figure 4B illustrates the comparison of the first diagnosis (row) to the second diagnosis (column). We found that especially the diagnoses related to diabetes (E13, O24), diabetes with complications (shortened to E10 and E11), obesity (E66), diseases of the pancreas (K86), poly- and proteinuria (R35 and R80), and to some extent hypertension and ischemic heart disease (I10, I20, I21, I25) are observed before other diagnoses (blue indicates that the row diagnosis is observed prior to the column diagnosis more than expected, and red indicates the opposite). Focusing on the different symptom groups, we detected which comorbidity pairs were unique in the different groups, and Figure 4C displays these unique comorbidity interactions.

Figure 4

Download asset Open asset

Comorbidity patterns within the six symptom groups.

(A) Comorbidity correlations between the combined symptom groups. (B) Asymmetric comorbidity matrix for observing row diagnosis codes before column diagnoses. First, we calculated Bonferroni corrected p-values for diagnosis pair directionality, second, we extracted the top 100 unique diagnosis codes pairs with lowest adjusted p-values and lastly, we calculated a comorbidity score (CS) by using the log2 of observing the pair more or less than expected. The heat-map colors reflect the CS quantification. (C) Comorbidity pairs unique for each of the symptom groups. All interactions are observed significantly more (blue) or less (red) than expected (adj. p-value<=0.01). Arrows indicate that the diagnoses are observed in the particular order (Fischer’s exact test with Bonferroni correction p-value<=0.01). Node size indicates in how many symptom groups the diagnosis code is observed in, ranging from one group (the diagnosis is unique for the group, largest nodes) to six groups (all groups have the code, smallest nodes).

Figure 4—source data 1 Comorbidity pattern data.: https://cdn.elifesciences.org/articles/44941/elife-44941-fig4-data1-v1.txt
Download elife-44941-fig4-data1-v1.txt

In symptom group two we found that L84: corns and callosities is observed significantly more together within patients with T1D than T2D (CS = 1.24, adj. p-value=4.06e-15 and CS = −1.58, adj. p-value=1.25e-03, respectively). Temporal analysis of diagnosis occurrence showed that T1D is observed before L84 (Figure 4B, mean time difference = 8.3 years, adj. p-value=1.01e-39). Corns or callosities are unproblematic in healthy people, but in diabetes patients they can cause skin defects that increase the risk for additional complications, for example foot ulcers which can lead to amputations (Apelqvist et al., 2000; Hunt, 2011).

Although not observed significantly together within any clusters the temporal analysis showed that the time between T2D and elevated blood glucose levels (R73) is significantly shorter in symptom group two than in groups 4, 5 and 6 (mean time = 2.2 days; adj. p-value=6.45e-04, 3.29e-06 and 2.73e-06, respectively).

In symptom group 5, five of the eleven clusters are enriched with ICD-10 codes from chapter XIII: Diseases of the musculoskeletal system and connective tissue, especially dorsopathies, spondylopathies and soft tissue disorders. Further, these diagnoses are observed exclusively in this group and show unique disease co-occurrence patterns, for example M48-M54 (other spondylopathies and dorsalgia, CS = 1.01, adj. p-value=1.8e-04) and M43-M47 (deforming dorsopathies and spondylosis, CS = 1.54, adj. p-value=1.91e-06). One of the top ranked genetic associations for this cluster (rs76548985, p-value=1.43e-06) is LINC00351, associated with sporadic amyotrophic lateral sclerosis (Xie et al., 2014). It is worth noting that clusters 8, 22, 33, 35, 45 within symptom group five are all enriched for drugs from ATC chapter A10B: blood glucose lowering drugs, excluding insulin (adj. p-value<0.05), and all but cluster eight are associated with glycemic dysregulation.

Within symptom group 7, we observed two diagnosis pairs less than expected: E11-E13 (CS = −1.46, adj. p-value=1.57e-04), and K86-E78 (CS = −1.24, adj. p-value=5.46e-04). Hence, this group contains patients where T2D and other diabetes as well as diseases of the pancreas and disorders of lipoprotein metabolism are not given together. In contrast, I34: nonrheumatic mitral valve disorder is observed more often than expected together with heart failure (I50, CS = 1.83, adj. p-value=0.009) and atrioventricular and left bundle-branch block (I44, CS = 1.53, adj. p-value=0.0018). Interestingly, one of the top genetic signals for symptom group seven maps to MIR8052 (rs6590490, p-value=3.14e-07) that has been associated with pulse pressure (Warren et al., 2017). Comparably, among the top genetic signals for symptom group 4, a group where are large proportion of the patients are characterized by hypotension (I95) and vertigo (R42), are ANLN that has been associated with systolic blood pressure (Parmar et al., 2016).

Glycemic dysregulation

We evaluated five different parameters associated with glycemic dysregulation (glycemic dysregulation, hyperglycemia, check-point detection of fluctuating HbA1c levels, HbA1c level at diabetes onset and amount of HbA1c observations above diagnosis threshold for T1D and T2D [53 mmol/mol]) and found that 2942 patients did not meet any threshold criterion, 2484 met one, 4647 two, 4057 three, 531 four, and 22 met all five criteria. The distribution of HbA1c measurements for T1D and T2D is shown in Figure 1—figure supplement 5. First, we investigated whether there was any difference in mean values of the 20 different biochemical tests (see Material and methods) and subsequently we applied a Kolmogorov-Smirnov test to assess how these distributions differed. We found that the means of 14 of the different biochemical tests were differently distributed between the six groups (adj. p-value<0.01) of patients with different number of dysregulation parameters, and furthermore observed a distinct difference between the not-or-slightly dysregulated patients (groups 0 to 2) and the middle-or-highly dysregulated patients (groups 3 to 5) (Figure 1—figure supplement 6). The group with five parameters showed no significant difference, due to the low number of patients (N = 22). The group with 3–5 parameters showed higher levels of triglyceride and HbA1c, and lower levels of sodium, urine creatinine, C-peptide, hemoglobin, diastolic blood pressure and height. Elevated levels of HbA1c, triglyceride, LDL-cholesterol and cholesterol and lower levels of HDL-cholesterol are known biochemical values associated with glycemic dysregulation and thus verified our findings.

The detection of higher levels of potassium and plasma creatinine as well as the lowered sodium, urine creatinine, hemoglobin levels indicates that these biochemical tests might be used in future prediction of glycemic dysregulation. Glycemic dysregulation is expected to cause renal problems, (identified by elevated plasma creatinine and elevated urine albumin) and hypertension, which is treated with RAS blocking agents (ACE inhibitors and angiotensin two receptor blockers) and diuretic agents, which elevate potassium and lower sodium. The treatment profile of this patient group revealed an enrichment of patients treated by RAS blocking agents in most of the clusters. Based on these observations we considered having at least three of the parameters as the best approximation for a definition of potential glycemic dysregulation.

Using the results from the biochemical analysis, we divided the cohort in two: those with at least three parameters associated with glycemic dysregulation, and those with two or less. In the 71 clusters defined above, 21 had more than 50% patients with at least three parameters (Figure 3, red circles). We found 10 of the 21 clusters in symptom group 3 of which, cluster 5, 24, and 47, were enriched for poor compliance when using the SDC-custom dictionary (adj. p-value=5.9e-03, 1.9e-03 and 2.6e-02, respectively). By further investigating the enrichment of SDC-custom terms (adj. p-values≤0.05) we found that the majority of the 21 clusters had terms related to cardiovascular complications (e.g. beta blocks, ischemia, diuretics and bypass), kidney complications (e.g. nephropathy, edema and albuminuria), metabolic complications (hypoglycemia and insulin chock) and neurologic related disorders (e.g. neuropathy and loss of memory). Furthermore, all the patients in cluster 47 have schizophrenia (N = 76, adj. p-value=8e-141), and behavioral features might therefore account for the glycemic dysregulation. The same could be the case for cluster 24, in which all have epilepsy (N = 108, adj. p-value=7.6e-186).

Genetic characterization of dysregulated patients

To assess if glycemic dysregulation is a diabetic complication or evidence of disease etiology, we further tested whether any SNPs were associated with glycemic dysregulation (n = 2,120). The five top associating signals map to NCKAP5, CLNK, PSD3, KPNA5, and LINC00333 (Supplementary file 5), although not reaching genome-wide significance. Interestingly, two of the genes associated with schizophrenia (LINC00333 [Goes et al., 2015] and NCKAP5 [Draaken et al., 2015]) and PSD3 have also been associated to traits related to urinary and blood metabolite levels, metabolic traits, and triglyceride levels (Raffler et al., 2015; Teslovich et al., 2010; Shin et al., 2014; Rueedi et al., 2014). However, none of the five top ranked genes have been previously linked to glycemic levels or diabetic dysregulation.

Discussion

Previous studies using EHRs in diabetes research have focused on improving clinical decision making (O'Connor et al., 2011), clinical prediction (Miotto et al., 2016), patient management (Cebul et al., 2011), mortality risk (Pantalone et al., 2009; Pantalone et al., 2010), genetic risk factors (Kho et al., 2012), and subgroup identification (Li et al., 2015). Only the study by Miotto et al. (2016) used the different layers of the EHRs, aimed at predictive measures of clinical outcome. A study from the eMERGE consortium extracted phenotypes from EHR narratives by using NLP-based methods (Kho et al., 2011). They used EHR for phenotypic characterization of five main diseases, but a fine-grained analysis of phenotypic characterization within the diseases was not performed. Further, NLP was included only in the phenotypic determination of three of the diseases, not for diabetes determination.

Stratification and subdivision of diabetic cohorts have typically been performed on homogeneous data sets within specific diabetes types such as T1D, T2D, or gestational diabetes (Perry et al., 2012; Ren et al., 2016; Lin et al., 2012; Achenbach et al., 2004). One of the more recent stratification studies of diabetes patients is Li et al. (2015) that identified subtypes of T2Ds of mixed ethnicity using the structured part of EHRs. They detected three distinct subgroups that could be linked to significant SNPs through gene-disease associations in a patient-unspecific manner. Further, elevated HbA1c levels were used to explain one subgroup with microvascular diabetic complications. In contrast to the study by Li et al., we have taken the stratification and characterization several steps further both by investigating a heterogeneous diabetic cohort almost five times as large and obtaining the full comorbidity pattern and symptoms relatedness through mining of the text-narratives using both an ‘exposure-oriented’ and a diagnosis-based dictionary. In addition, we used the biochemical data to produce a severity classification (the five parameters of glycemic dysregulation) and integrating this with both the text-mined and assigned diagnoses, we were able to determine many different, more homogeneous groups of patients with shared symptoms and comorbidities, as well as different levels for glycemic dysregulation.

Another recent diabetes stratification study by Ahlqvist et al. (2018) used a data-driven approach and k-means clustering to subgroup adult-onset diabetes and characterize five subgroups showing differing disease progression and risk of diabetes complications. However, this approach concerned only individuals with type 2 diabetes and a characterization based on six parameters (glutamate decarboxylase antibodies, age at diagnosis, BMI, HbA1c, and homoeostatic model assessment 2 estimates of β-cell function and insulin resistance), and thus clinical narratives, medication, and genetics were not used as we have done in this study.

The text mining approach used in relation to ICD-10 codes was based on level three rather than the more detailed level four since it would increase tremendously the dimensionality of the feature space. While this obviously reflects a less deep phenotyping, for a data set of this size many level four codes would be unique, likely leading to a less stable subsequent clustering and analysis. In fact, our attempt to use the much more fine-grained SNOMED-CT terminology confirmed that a data set needs to be very large for such a fine-grained vocabulary to be useful.

In this work, we deliberately excluded the primary diabetes types without complications, T1D and T2D, and thereby constructed a stratification of the cohort driven solely by comorbidities, complications, other diseases, and symptoms. However, combining different diabetic subtypes can be problematic, since their etiologies differ and disease progression is different across diabetes types, treatment, compliance and lifestyle (Adeghate et al., 2006). Our focus was not to characterize specific comorbidity-related groups within a certain diabetes type, since extensive epidemiological studies of this kind have been done previously. Instead, we focused on the diabetes continuum with the aim of investigating whether it was possible in an unsupervised manner to detect relevant and meaningful diabetic subgroups by comorbidities, symptoms, or level of glycemic dysregulation. Further, we detected novel biochemical and genetic candidates that might relate these to the different cohort subdivisions, such as shared symptom patterns for phenotypically similar patients and the level of glycemic dysregulation. These biochemical and genetic candidates could be potential risk factors for additional complications, especially concerning glycemic dysregulation, that could be verified by further experimental studies. As the cohort is enriched for sicker patients with diabetes melitus complications the features and the overall grouping described would not necessarily be the same if another cohort dominated by prediabetes individuals would have been analyzed.

Despite our focus on the phenotypic variation among diabetes patients, the stratification is restricted by the limited coverage of the genetic data, which lowers the power considerably. We were able to obtain genetic data for 2337 patients, of whom 2125 remained after quality control and stratification. Hence, only 14% of the patients in our final cohort had descriptive genetic information.

By adding biochemical, prescription, and genetic data we observed that the clusters were significantly different from each other on parameters other than comorbidities. By including the text narratives of the EHRs we were able to capture diagnoses that in another context would be considered as a primary diagnosis, for example epilepsy, schizophrenia and cerebral palsy. These diagnoses are not known comorbidities of diabetes but can influence the treatment and management of the diabetes patient. For instance, we observed that all patients in cluster 47 had schizophrenia, which could influence their compliance since the cluster was associated with glycemic dysregulation. We determined this when assessing the level of glycemic dysregulation and found that this cluster indeed showed a high number of patients with at least three parameters for glycemic dysregulation. However, a more in-depth analysis is required to clarify whether the glycemic dysregulation is due to the behavioral effects of schizophrenia, underlying genetic variants, adverse drug reactions due to polypharmacy, or other variables.

Despite our data from both assigned and text-mined diagnoses, misdiagnoses can occur, and we performed a manual inspection of randomly selected EHRs to establish the validity of the data. Furthermore, we observed some patients assigned with different diabetes types, for example first assigned with T1D and later with T2D, and vice versa. Inspecting the biochemical values of GAD65 autoantibodies and comparing them to the primary diagnosis type we found 182 T2D assigned individuals to have GAD65 levels above 10 IU/ml, possibly indicative of T1D or LADA; however, these individuals were not significantly enriched in any cluster. We also observed 621 individuals with GAD65 levels below 10 IU/ml, which is consistent with known late-term effects of T1D (results not shown). An in-depth temporal analysis of these patients with mixed diabetes types could be interesting and integrating biochemical as well as genetic variation data could elucidate which, if any, phenotype might be the most accurate.

In this study, we have used data from a unique cohort of 14,017 patients with diabetes, of which 12,866 had been diagnosed with either T1D or T2D. Integrating the assigned and text-mined ICD-10 and SDC-custom diagnoses, an MCL clustering was carried out which resulted in 172 unique clusters. Of these, 71 had at least 50 patients, which were subsequently divided into groups with shared symptoms. Investigating the complication enrichment and comorbidity patterns in the clusters and symptom groups we detected clusters described by specific disorders such as hypothyroidism, schizophrenia, and functional intestine disorder as well as unique comorbidity interaction patterns both with and without temporal significance. An interesting approach could be to extend the temporal analysis to investigate how disease progression within and between clusters and symptoms groups develops for multiple diagnoses. This could be done with a trajectory-based approach as done recently by Jensen et al. (2014).

Materials and methods

EHR data

Request a detailed protocol

All data originate from the Steno Diabetes Center Copenhagen (SDCC), a specialized diabetes hospital in the Capital Region of Denmark. In Denmark patients with type 1 diabetes (T1D) are followed in hospital outpatient clinics such as SDCC, and the T1D patients studied comprise 35% of all adult patients with T1D in the Capital Region of Denmark. Patients with type 2 diabetes (T2D) are referred from primary care for treatment optimization, typically for a period of six to twelve months. When treatment goals are reached, and they have no diabetic complications, they are referred back to general practice. Patients needing intensive control and treatment, because of micro- or macrovascular complications, are offered life-long follow-up at the SDCC. At any time, approximately 2000 patients with complicated T2D are followed at the SDCC. Generally, the patients registered in the SDCC electronic patient records are representative of Danish patients with T1D and the 10% most complicated patients with T2D (Jørgensen et al., 2016). Moreover, the patient followed at SDCC are comparable to patients followed in all Danish hospital diabetes outpatient clinics in terms of distribution of age and duration of diabetes. The data comprise all communications and contacts recorded at the hospital over a period of 19 years (1993–2012) for 14,017 patients. This includes, primary diagnoses, prescriptions and laboratory tests, 1.2M clinical narrative entries, 420 different types of laboratory tests with 4.15M laboratory measurements and a total number of 440,555 drug prescriptions. On average, each patient had 85 clinical narratives with an average length of 34 words (212 characters). In addition, genetic data from several research projects have been linked to the patients and added to the EHRs.

Text-mining dictionaries, tagging and corpus matches

Request a detailed protocol

An in-house developed framework for mining Danish text was used for the analysis (Roque et al., 2011; Eriksson et al., 2013). The algorithm tags words in the text narratives in a named entity recognition (NER) fashion based on supplied dictionaries. In this study, we used two main dictionaries: The International Classification of Disease version 10 (ICD-10) truncated to level 3 (e.g. E10: Type 1 Diabetes), and a complementary ‘exposome-oriented’ dictionary (SDC-custom). The latter holds terms related to diabetes specific subtypes (e.g. MODY and LADA), complications (e.g. the different severities of neuropathy, retinopathy and nephropathy), treatments and examinations (e.g. gastric bypass, renography, and beta blockers), lifestyle and lifestyle related disorders (e.g. obesity, exercise level, smoking), and compliance. The SDC-custom dictionary was developed in collaboration with physicians at the SDCC (see Supplementary file 6 for a translated and condensed version). The Danish ICD-10 version currently contains roughly 20,000 unique descriptions of clinical concepts, each with a unique ICD-10 code.

The NER used for dictionary matching, in addition, performs lemmatization and de-latinization of tagged words, accounts for language negations or subject negations (e.g. ‘the patient’s mother had retinopathy’), and performs fuzzy matching with a Hamming distance of 1 (e.g. ‘diabtes’ is transformed to its correct spelling ‘diabetes’). A thorough explanation of the algorithm is provided (Simon et al., 2019, manuscript in preparation). Other details, for example on ‘negation scope’, that is the position of negations relative to the negated term in Danish, have been published previously (Thomas et al., 2014).

Running the text-mining algorithm (Simon et al., 2019, manuscript in preparation) on the SDCC corpus with the two dictionaries (ICD-10 and SDC-custom) recognized 1,028,593 entities from the dictionaries in 12,504 patients (80.5% of the entire corpus). None of the remaining patients had any non-trivial match between the dictionaries and EHR narratives. The two dictionaries shared some general terms, for example T1D and T2D; these duplicate matches were removed and 941,087 unique code matches remained. Of these, 267,404 were fuzzy matches representing 4181 unique variants. The variants were manually validated, resulting in removal of 10,952 (4.1%) matches. After removal of negated sentences (n = 255,302) 594,600 code-to-text matches in 12,467 patients were left.

Patient phenotype vectors from assigned and text mined codes

Request a detailed protocol

The structured ICD-10 codes assigned to patients during their contact with SDCC were extracted from the EHRs, along with all ICD-10 codes captured by mining the text parts of the EHRs. The two ICD-10 lists were combined, but to prevent the primary, assigned diabetes types from dominating the patient stratification, diagnosis codes for diabetes without complications (E10 and E109, in total 3740 codes, and E11 and E119, in total 3624 codes) were removed. Approximately 8% of the assigned codes were removed in this way. The list of codes and their frequencies for each patient were transformed using the BM25 weighting scheme (Robertson and Walker, 1994), which scores a code c in patient P, accounting for the code frequency in all patients, frequency of the codes in the patient (document frequency), and number codes in the patient record (document length), see Equation 1.

S c o r e (p, c) = \sum_{i - 1}^{n} I D F (c_{i}) \cdot \frac{f (c_{i}, p) \cdot (k_{1} + 1)}{f (c_{i}, p) + k_{1} \cdot (a - b + b \cdot \frac{| p |}{| p_{a v e} |})}

Here, $I D F (c)$ is the inverse document frequency for the code $c$ computed as

I D F (c_{i}) = \log \frac{N - n (c_{i}) + 0.5}{n (c_{i}) + 0.5}

With $N$ being the total number of patients and $n (c)$ the number of patients with a given code $c_{i}$ , and the term $f (c_{i}, p)$ is the frequency of code $c_{i}$ in patient $p$ . The number of codes associated with each patient vector, $P$ , is given by the length of the vector, $| p |$ , and the average number of codes in the entire corpus is $| p_{a v e} |$ . Finally, $b$ and $k_{1}$ are free parameters that determine to what extent document length is considered (b) and how much the scoring equation resembles a normal TF-IDF ( $k_{1}$ ), respectively. The value of $b$ was set to 0.75 and does not fully account for the document length $(b = 1)$ and $k_{1}$ was set to 1.2 giving a low resemblance of TF-IDF ( $k_{1} \to \infty$ ).

Clustering patients from Cosine similarities

Request a detailed protocol

All patients were clustered using their pairwise cosine similarities calculated from the BM25 transformed code vectors. A cosine similarity ≥ 0.5 was set as a cut-off prior to clustering, to minimize the number of edges in the subsequent patient network. To increase the variance of the cosine similarities, these were scaled from the interval 0.5–1 to 10–100. We wanted to do a network based clustering and therefore used Markov Clustering (MCL) with the inflation parameter set to 1.2 and rest left as default (Van Dongen, 2000). Different inflation parameters were tested and evaluated based on the efficiency, mass fraction, and area fraction parameters.

Grouping clusters in symptom related groups

Request a detailed protocol

We organized the clusters into symptom groups based on the frequency of their symptom codes using ICD-10 chapter XVIII level 1, for example R50-69: General symptoms and signs. We used a Euclidean distance and applied a hierarchical clustering using Ward.D as the agglomeration method since we wanted to expose the hierarchical relationship amongst the clusters. The entire analysis was performed using R (version 3.2.1).

Enrichment analysis of diagnosis codes

Request a detailed protocol

The MCL clusters were tested for ICD-10 and SDC-custom codes found more often than expected, using a binominal test while correcting for sex and birth decade. The metadata such as average age, days at SDCC, and diabetes duration (from the date of diabetes diagnosis until the end of the study) were calculated, and further p-values for each cluster were obtained using a Wilcoxon test against the remaining clusters. In both analyses, p-values were adjusted using Benjamini-Hochberg correction for multiple testing, where a p-value≤0.05 was considered significant.

Comorbidity patterns for diagnosis pairs

Request a detailed protocol

We performed three independent analyzes without considering the clusters by applying Fischer’s exact tests to obtain p-values for all diagnosis pairs within the SDCC corpus: 1) p-values for observing the codes together, 2) p-values for observing diagnosis A prior to diagnosis B, and 3) p-values for observing diagnosis B prior to diagnosis A. P-values from the three different sets were adjusted using Bonferroni correction for multiple testing, and the pairs were subsequently ranked based on these values. To detect whether the pairs were observed more together than expected we applied a comorbidity score as described in Roque et al. (2011). For the temporal pairs, we also applied an ANOVA test to investigate whether any of these pairs were unique for a symptom group. All p-values were corrected for multiple testing, and an adjusted p-value≤0.05 was considered significant.

Robustness of the MCL generated clusters

Request a detailed protocol

To assess quantitatively the stability of the clusters generated, we constructed various diluted and shuffled realizations of the similarity network used as input to the MCL algorithm. We used a reference clustering similar to the clustering presented in Figure 2B (either by including the patients in the 71 clusters or all patients). The diluted versions were generated by randomly deleting edges with a probability of α, whereas the shuffled realizations were created by shuffling edges between nodes (patients) as described earlier (Karrer et al., 2008). The latter produces a network where the number of edges and vertices are unchanged. An α of zero leaves the reference network unchanged, while a value of 1 leads to a complete randomization of the similarity network. Each of these randomizations of the input were repeated five times for various values of α in the range 0–50% and used as input for the MCL algorithm. The resulting clustering’s were then compared to the reference clustering by means of the Variation of Information measure (VI) (Meilă, 2007) and plotted as function of increasing values of α (see Figure 2—figure supplement 5). The figure includes two horizontal lines corresponding to the value that the VI would take if we were to randomly assign 10% and 20% of the vertices to different random clusters, respectively. This analysis showed that the clustering is stable in relation to removing edges, which is evidence that the cosine metric-based cutoff used does not change the overall structure of the clustering. The shuffling is a more impactful randomization, however despite this, we can still shuffle around 10% of the edges and still retrieve 90% of the patients in the groups of the 71 reference clusters.

Quantitative assessment of glycemic dysregulation

Request a detailed protocol

Glycemic dysregulation was assessed for each patient by evaluating five different parameters. The first two parameters were obtained using the SDC-custom code for dysregulation (sdcL03) and the ICD-10 codes for hyperglycemia (R73 and E89). The remaining three were found by analyzing longitudinal measurements for glycated hemoglobin (HbA1c). Due to a large variation in both the number of measurements and their frequency, HbA1c values were pre-processed. We divided the HbA1c measurements for each patient into segments containing a minimum of five values, spanning a time interval of at least three months (equivalent to the functional lifetime of red blood cells). In total 10,112 patients had HbA1c measurements that fulfilled the criteria, and the subsequent analyses were performed on this sub-population.

We performed three analyses on the longitudinal pre-processed HbA1c data for each patient: 1) a Bayesian analysis of change point detection to find potential peaks of HbA1c values in a patient, 2) analysis of mixed effects models to estimate the HbA1c value at diabetes onset, and, 3) analysis of the frequency of values in different HbA1c bins (e.g. general level for diagnosing T1D or T2D, the critical interval for hyperglycemia etc.) to appoint an HbA1c severity score.

Laboratory test data

Request a detailed protocol

The laboratory tests were longitudinal data such as blood pressure measurements and biochemical analyses of blood and urine samples, and each test was assigned a unique identifier using the NPU-terminology, which is the recommended administration and communication measure of laboratory tests in Denmark (Petersen et al., 2012). In our data, several laboratory tests had an SDC identifier, being from local laboratory facilities at SDCC. Both test IDs, NPU and SDC, were analyzed separately, despite sometimes measuring the same biochemical variables.

In total, 420 different physiological tests were performed across 14,847 patients from the entire corpus. Measurements within and between tests were unbalanced with no general system in measurement interval, frequency, or number of patients who had a test taken. Due to this lack of systematic coverage, only tests that were performed on at least 75% of the entire corpus (10,788 patients) were analyzed (26 tests). However, the test for C-peptide (NPU18004) was also included as it was available for 74.9% of the cohort and is widely used to distinguish T1D and T2D. Measurements outside the biological reference interval for a given test, that is HbA1c measurements below 15 mmol/mol and above 184 mmol/mol, were removed, and for each patient the mean, median and standard deviation for each test with continuous values (20 of the 26 tests) were calculated. If the data was not normally distributed for a test we log-transformed it and normalized all values to mean = 0 and SD = 1. All analyses after assigning patients to clusters were performed on the 10,788 patients.

We applied a MANOVA to test if means among the three different patient groups (clusters, symptom groups or patients being dysregulated) were significantly different, and a Kolmogorov–Smirnov test was applied to investigate whether the distribution of the sample means in the patient groups were significantly higher or lower than means in the remaining groups. All p-values were adjusted using Bonferroni correction for multiple testing, and an adjusted p-value≤0.05 was considered significant.

Drug prescription data

Request a detailed protocol

Prescription data was available for 12,147 patients with a total number of 440,555 drug prescriptions. Drug compounds were identified by the ATC classification system, which is divided into groups at five different levels. In this study, we summarized the data using ATC-codes at level three and four: chemical and pharmacological and therapeutical, respectively.

From the initial set of prescriptions, we manually reviewed 104 drugs which did not have an ATC code in the EHR or were mapped to more than one ATC code. In addition to the manual review, pro.medicin (www.pro.medicin.dk, accessed October 2018) was used to map drug names to their corresponding ATC code. The SDCC prescription data and the WHO Collaborating Centre for Drug Statistics Methodology (www.whocc.no, accessed October 2018) were used for crosschecking. We performed Fisher’s exact test to investigate prescription enrichment (3rd level of the ATC classification) in clusters with at least 50 patients. The p-values were adjusted using Benjamini–Hochberg correction for multiple testing, and an adjusted p-value≤0.05 was considered significant.

Genomic characterization

Request a detailed protocol

A total of 2290 patients with T2D and 1028 patients with T1D from SDCC were genotyped separately using the HumanOmniExpress (24v1) array from Illumina as previously described (Charmet et al., 2018; Steinthorsdottir et al., 2014). Genotypes were called using GenomeStudio, and imputed separately using the Haplotype Reference Consortium (HRC) imputation panel (McCarthy et al., 2016). Prior to imputation, the two datasets were filtered to retain only high-quality samples/SNPs (sample call rate ≥98%, no mislabeled sex, no ethnic outliers, heterozygosity within 2 SD from the mean, SNP call rate ≥98%, no monomorphic SNPs, no Hardy–Weinberg disequilibrium outliers). After imputation, SNPs with minor allele frequency (MAF) <0.01, more than 20% missingness, R square less than 0.30, and duplicate SNPs were removed, and the two datasets were merged retaining only variants common to the two sets. After merging, relatedness between all individuals were calculated and close relatives were excluded. Of the 3318 patients, 2337 had EHR information and could be mapped to clusters. In total 2125 patients passed quality control and were taken forward for genomic characterization. Logistic regression was used to test for genetic differences (PLINK 1.90 beta, https://www.cog-genomics.org/1.9) between the different groups of interest (clusters and symptom groups) and linear regression was used to evaluate the SNPs impact on dysregulation. Cases were defined as all individuals in a given cluster/symptom group, and controls as all individuals not belonging to the respective cluster/symptom group. Glycemic dysregulation was defined as a score ranking from 0 (low) to 5 (high) based on five dysregulation parameters (see section on glycemic dysregulation). All analyses were adjusted for age and sex. The test statistics were adjusted for inflation (population stratification) using the three first principal components estimated using the --pca function in PLINK. Genetic associations were defined based on data derived from the EBI GWAS catalog version 1.0.1 (http://www.ebi.ac.uk/gwas/) unless otherwise stated. A p-value less than 5*10–8 was considered genome-wide significant.

Data availability

All data generated or analysed during this study are included in the manuscript and supporting files except for the raw person sensitive electronic health record data due to confidentiality requirements.

References

(2004) Stratification of type 1 diabetes risk on the basis of islet autoantibody characteristics
Diabetes 53:384–392.

https://doi.org/10.2337/diabetes.53.2.384
- PubMed
- Google Scholar
(2006) An update on the etiology and epidemiology of diabetes mellitus
Annals of the New York Academy of Sciences 1084:1–29.

https://doi.org/10.1196/annals.1372.029
- PubMed
- Google Scholar
(2015) The genetics of diabetic complications
Nature Reviews Nephrology 11:277–287.

https://doi.org/10.1038/nrneph.2015.37
- PubMed
- Google Scholar
1. Ahlqvist E
2. Storm P
3. Käräjämäki A
4. Martinell M
5. Dorkhan M
6. Carlsson A
7. Vikman P
8. Prasad RB
9. Aly DM
10. Almgren P
11. Wessman Y
12. Shaat N
13. Spégel P
14. Mulder H
15. Lindholm E
16. Melander O
17. Hansson O
18. Malmqvist U
19. Lernmark Å
20. Lahti K
21. Forsén T
22. Tuomi T
23. Rosengren AH
24. Groop L
(2018) Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables
The Lancet Diabetes & Endocrinology 6:361–369.

https://doi.org/10.1016/S2213-8587(18)30051-2
- PubMed
- Google Scholar
1. American Diabetes Association
(2017) 2. classification and diagnosis of diabetes
Diabetes Care 40:S11–S24.

https://doi.org/10.2337/dc17-S005
- Google Scholar
(2000) International consensus and practical guidelines on the management and the prevention of the diabetic foot
Diabetes/Metabolism Research and Reviews 16:S84–S92.

https://doi.org/10.1002/1520-7560(200009/10)16:1+<::AID-DMRR113>3.0.CO;2-S
- Google Scholar
1. Bitzur R
2. Cohen H
3. Kamari Y
4. Shaish A
5. Harats D
(2009) Triglycerides and HDL cholesterol
Diabetes Care 32:S373–S377.

https://doi.org/10.2337/dc09-S343
- Google Scholar
1. Cebul RD
2. Love TE
3. Jain AK
4. Hebert CJ
(2011) Electronic health records and quality of diabetes care
New England Journal of Medicine 365:825–833.

https://doi.org/10.1056/NEJMsa1102519
- PubMed
- Google Scholar
1. Charmet R
2. Duffy S
3. Keshavarzi S
4. Gyorgy B
5. Marre M
6. Rossing P
7. McKnight AJ
8. Maxwell AP
9. Ahluwalia TVS
10. Paterson AD
11. Trégouët DA
12. Hadjadj S
(2018) Novel risk genes identified in a genome-wide association study for coronary artery disease in patients with type 1 diabetes
Cardiovascular Diabetology 17:61.

https://doi.org/10.1186/s12933-018-0705-0
- PubMed
- Google Scholar
1. Chen Y
2. Li L
3. Xu R
(2015)
Disease comorbidity network guides the detection of molecular evidence for the link between colorectal Cancer and obesity

AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science 2015:201–206.
- PubMed
- Google Scholar
(2015) Predictability bounds of electronic health records
Scientific Reports 5:11865.

https://doi.org/10.1038/srep11865
- PubMed
- Google Scholar
1. Denny JC
(2012) Chapter 13: mining electronic health records in the genomics era
PLOS Computational Biology 8:e1002823.

https://doi.org/10.1371/journal.pcbi.1002823
- PubMed
- Google Scholar
(2014) Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis
Pediatrics 133:e54–e63.

https://doi.org/10.1542/peds.2013-0819
- PubMed
- Google Scholar
1. Draaken M
2. Knapp M
3. Pennimpede T
4. Schmidt JM
5. Ebert AK
6. Rösch W
7. Stein R
8. Utsch B
9. Hirsch K
10. Boemers TM
11. Mangold E
12. Heilmann S
13. Ludwig KU
14. Jenetzky E
15. Zwink N
16. Moebus S
17. Herrmann BG
18. Mattheisen M
19. Nöthen MM
20. Ludwig M
21. Reutter H
(2015) Genome-wide association study and meta-analysis identify ISL1 as genome-wide significant susceptibility gene for bladder exstrophy
PLOS Genetics 11:e1005024.

https://doi.org/10.1371/journal.pgen.1005024
- PubMed
- Google Scholar
(2013) Dictionary construction and identification of possible adverse drug events in danish clinical narrative text
Journal of the American Medical Informatics Association 20:947–953.

https://doi.org/10.1136/amiajnl-2013-001708
- PubMed
- Google Scholar
(2016) Common and rare forms of diabetes mellitus: towards a continuum of diabetes subtypes
Nature Reviews Endocrinology 12:394–406.

https://doi.org/10.1038/nrendo.2016.50
- PubMed
- Google Scholar
(2015) Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows
Journal of Biomedical Semantics 6:8.

https://doi.org/10.1186/s13326-015-0004-6
- PubMed
- Google Scholar
1. Giannini C
2. Santoro N
3. Caprio S
4. Kim G
5. Lartaud D
6. Shaw M
7. Pierpont B
8. Weiss R
(2011) The Triglyceride-to-HDL cholesterol ratio
Diabetes Care 34:1869–1874.

https://doi.org/10.2337/dc10-2234
- Google Scholar
(2016) Large-Scale discovery of Disease-Disease and Disease-Gene associations
Scientific Reports 6:32404.

https://doi.org/10.1038/srep32404
- PubMed
- Google Scholar
1. Goes FS
2. McGrath J
3. Avramopoulos D
4. Wolyniec P
5. Pirooznia M
6. Ruczinski I
7. Nestadt G
8. Kenny EE
9. Vacic V
10. Peters I
11. Lencz T
12. Darvasi A
13. Mulle JG
14. Warren ST
15. Pulver AE
(2015) Genome-wide association study of schizophrenia in ashkenazi jews
American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 168:649–659.

https://doi.org/10.1002/ajmg.b.32349
- Google Scholar
1. Gottesman O
2. Kuivaniemi H
3. Tromp G
4. Faucett WA
5. Li R
6. Manolio TA
7. Sanderson SC
8. Kannry J
9. Zinberg R
10. Basford MA
11. Brilliant M
12. Carey DJ
13. Chisholm RL
14. Chute CG
15. Connolly JJ
16. Crosslin D
17. Denny JC
18. Gallego CJ
19. Haines JL
20. Hakonarson H
21. Harley J
22. Jarvik GP
23. Kohane I
24. Kullo IJ
25. Larson EB
26. McCarty C
27. Ritchie MD
28. Roden DM
29. Smith ME
30. Böttinger EP
31. Williams MS
32. eMERGE Network
(2013) The electronic medical records and genomics (eMERGE) Network: past, present, and future
Genetics in Medicine 15:761–771.

https://doi.org/10.1038/gim.2013.72
- PubMed
- Google Scholar
1. Hunt DL
(2011)
Diabetes: foot ulcers and amputations

BMJ Clinical Evidence 2011:0602.
- PubMed
- Google Scholar
(2012) Mining electronic health records: towards better research applications and clinical care
Nature Reviews Genetics 13:395–405.

https://doi.org/10.1038/nrg3208
- PubMed
- Google Scholar
1. Jensen AB
2. Moseley PL
3. Oprea TI
4. Ellesøe SG
5. Eriksson R
6. Schmock H
7. Jensen PB
8. Jensen LJ
9. Brunak S
(2014) Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients
Nature Communications 5:4022.

https://doi.org/10.1038/ncomms5022
- PubMed
- Google Scholar
1. Jones GT
2. Tromp G
3. Kuivaniemi H
4. Gretarsdottir S
5. Baas AF
6. Giusti B
7. Strauss E
8. Van't Hof FN
9. Webb TR
10. Erdman R
11. Ritchie MD
12. Elmore JR
13. Verma A
14. Pendergrass S
15. Kullo IJ
16. Ye Z
17. Peissig PL
18. Gottesman O
19. Verma SS
20. Malinowski J
21. Rasmussen-Torvik LJ
22. Borthwick KM
23. Smelser DT
24. Crosslin DR
25. de Andrade M
26. Ryer EJ
27. McCarty CA
28. Böttinger EP
29. Pacheco JA
30. Crawford DC
31. Carrell DS
32. Gerhard GS
33. Franklin DP
34. Carey DJ
35. Phillips VL
36. Williams MJ
37. Wei W
38. Blair R
39. Hill AA
40. Vasudevan TM
41. Lewis DR
42. Thomson IA
43. Krysa J
44. Hill GB
45. Roake J
46. Merriman TR
47. Oszkinis G
48. Galora S
49. Saracini C
50. Abbate R
51. Pulli R
52. Pratesi C
53. Saratzis A
54. Verissimo AR
55. Bumpstead S
56. Badger SA
57. Clough RE
58. Cockerill G
59. Hafez H
60. Scott DJ
61. Futers TS
62. Romaine SP
63. Bridge K
64. Griffin KJ
65. Bailey MA
66. Smith A
67. Thompson MM
68. van Bockxmeer FM
69. Matthiasson SE
70. Thorleifsson G
71. Thorsteinsdottir U
72. Blankensteijn JD
73. Teijink JA
74. Wijmenga C
75. de Graaf J
76. Kiemeney LA
77. Lindholt JS
78. Hughes A
79. Bradley DT
80. Stirrups K
81. Golledge J
82. Norman PE
83. Powell JT
84. Humphries SE
85. Hamby SE
86. Goodall AH
87. Nelson CP
88. Sakalihasan N
89. Courtois A
90. Ferrell RE
91. Eriksson P
92. Folkersen L
93. Franco-Cereceda A
94. Eicher JD
95. Johnson AD
96. Betsholtz C
97. Ruusalepp A
98. Franzén O
99. Schadt EE
100. Björkegren JL
101. Lipovich L
102. Drolet AM
103. Verhoeven EL
104. Zeebregts CJ
105. Geelkerken RH
106. van Sambeek MR
107. van Sterkenburg SM
108. de Vries JP
109. Stefansson K
110. Thompson JR
111. de Bakker PI
112. Deloukas P
113. Sayers RD
114. Harrison SC
115. van Rij AM
116. Samani NJ
117. Bown MJ
(2017) Meta-Analysis of Genome-Wide association studies for abdominal aortic aneurysm identifies four new Disease-Specific risk loci
Circulation Research 120:341–353.

https://doi.org/10.1161/CIRCRESAHA.116.308765
- PubMed
- Google Scholar
(2016) The danish adult diabetes registry
Clinical Epidemiology 8:429–434.

https://doi.org/10.2147/CLEP.S99518
- PubMed
- Google Scholar
1. Juarez DT
2. Sentell T
3. Tokumaru S
4. Goo R
5. Davis JW
6. Mau MM
(2012) Factors associated with poor glycemic control or wide glycemic variability among diabetes patients in Hawaii, 2006-2009
Preventing Chronic Disease 9:120065.

https://doi.org/10.5888/pcd9.120065
- PubMed
- Google Scholar
(2008) Robustness of community structure in networks
Physical Review E 77:046119.

https://doi.org/10.1103/PhysRevE.77.046119
- Google Scholar
1. Kho AN
2. Pacheco JA
3. Peissig PL
4. Rasmussen L
5. Newton KM
6. Weston N
7. Crane PK
8. Pathak J
9. Chute CG
10. Bielinski SJ
11. Kullo IJ
12. Li R
13. Manolio TA
14. Chisholm RL
15. Denny JC
(2011) Electronic medical records for genetic research: results of the eMERGE consortium
Science Translational Medicine 3:79re1.

https://doi.org/10.1126/scitranslmed.3001807
- PubMed
- Google Scholar
1. Kho AN
2. Hayes MG
3. Rasmussen-Torvik L
4. Pacheco JA
5. Thompson WK
6. Armstrong LL
7. Denny JC
8. Peissig PL
9. Miller AW
10. Wei WQ
11. Bielinski SJ
12. Chute CG
13. Leibson CL
14. Jarvik GP
15. Crosslin DR
16. Carlson CS
17. Newton KM
18. Wolf WA
19. Chisholm RL
20. Lowe WL
(2012) Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study
Journal of the American Medical Informatics Association 19:212–218.

https://doi.org/10.1136/amiajnl-2011-000439
- PubMed
- Google Scholar
1. Laakso M
(2001) Cardiovascular disease in type 2 diabetes: challenge for treatment and prevention
Journal of Internal Medicine 249:225–235.

https://doi.org/10.1046/j.1365-2796.2001.00789.x
- PubMed
- Google Scholar
1. Li L
2. Cheng WY
3. Glicksberg BS
4. Gottesman O
5. Tamler R
6. Chen R
7. Bottinger EP
8. Dudley JT
(2015) Identification of type 2 diabetes subgroups through topological analysis of patient similarity
Science Translational Medicine 7:311ra174.

https://doi.org/10.1126/scitranslmed.aaa9364
- PubMed
- Google Scholar
1. Lin Z
2. Bei J-X
3. Shen M
4. Li Q
5. Liao Z
6. Zhang Y
7. Lv Q
8. Wei Q
9. Low H-Q
10. Guo Y-M
11. Cao S
12. Yang M
13. Hu Z
14. Xu M
15. Wang X
16. Wei Y
17. Li L
18. Li C
19. Li T
20. Huang J
21. Pan Y
22. Jin O
23. Wu Y
24. Wu J
25. Guo Z
26. He P
27. Hu S
28. Wu H
29. Song H
30. Zhan F
31. Liu S
32. Gao G
33. Liu Z
34. Li Y
35. Xiao C
36. Li J
37. Ye Z
38. He W
39. Liu D
40. Shen L
41. Huang A
42. Wu H
43. Tao Y
44. Pan X
45. Yu B
46. Tai ES
47. Zeng Y-X
48. Ren EC
49. Shen Y
50. Liu J
51. Gu J
(2012) A genome-wide association study in han chinese identifies new susceptibility loci for ankylosing spondylitis
Nature Genetics 44:73–77.

https://doi.org/10.1038/ng.1005
- Google Scholar
1. McCarthy S
2. Das S
3. Kretzschmar W
4. Delaneau O
5. Wood AR
6. Teumer A
7. Kang HM
8. Fuchsberger C
9. Danecek P
10. Sharp K
11. Luo Y
12. Sidore C
13. Kwong A
14. Timpson N
15. Koskinen S
16. Vrieze S
17. Scott LJ
18. Zhang H
19. Mahajan A
20. Veldink J
21. Peters U
22. Pato C
23. van Duijn CM
24. Gillies CE
25. Gandin I
26. Mezzavilla M
27. Gilly A
28. Cocca M
29. Traglia M
30. Angius A
31. Barrett JC
32. Boomsma D
33. Branham K
34. Breen G
35. Brummett CM
36. Busonero F
37. Campbell H
38. Chan A
39. Chen S
40. Chew E
41. Collins FS
42. Corbin LJ
43. Smith GD
44. Dedoussis G
45. Dorr M
46. Farmaki AE
47. Ferrucci L
48. Forer L
49. Fraser RM
50. Gabriel S
51. Levy S
52. Groop L
53. Harrison T
54. Hattersley A
55. Holmen OL
56. Hveem K
57. Kretzler M
58. Lee JC
59. McGue M
60. Meitinger T
61. Melzer D
62. Min JL
63. Mohlke KL
64. Vincent JB
65. Nauck M
66. Nickerson D
67. Palotie A
68. Pato M
69. Pirastu N
70. McInnis M
71. Richards JB
72. Sala C
73. Salomaa V
74. Schlessinger D
75. Schoenherr S
76. Slagboom PE
77. Small K
78. Spector T
79. Stambolian D
80. Tuke M
81. Tuomilehto J
82. Van den Berg LH
83. Van Rheenen W
84. Volker U
85. Wijmenga C
86. Toniolo D
87. Zeggini E
88. Gasparini P
89. Sampson MG
90. Wilson JF
91. Frayling T
92. de Bakker PI
93. Swertz MA
94. McCarroll S
95. Kooperberg C
96. Dekker A
97. Altshuler D
98. Willer C
99. Iacono W
100. Ripatti S
101. Soranzo N
102. Walter K
103. Swaroop A
104. Cucca F
105. Anderson CA
106. Myers RM
107. Boehnke M
108. McCarthy MI
109. Durbin R
110. Haplotype Reference Consortium
(2016) A reference panel of 64,976 haplotypes for genotype imputation
Nature Genetics 48:1279–1283.

https://doi.org/10.1038/ng.3643
- PubMed
- Google Scholar
1. McCarty CA
2. Chisholm RL
3. Chute CG
4. Kullo IJ
5. Jarvik GP
6. Larson EB
7. Li R
8. Masys DR
9. Ritchie MD
10. Roden DM
11. Struewing JP
12. Wolf WA
13. eMERGE Team
(2011) The eMERGE network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies
BMC Medical Genomics 4:13.

https://doi.org/10.1186/1755-8794-4-13
- PubMed
- Google Scholar
1. Meilă M
(2007) Comparing clusterings—an information based distance
Journal of Multivariate Analysis 98:873–895.

https://doi.org/10.1016/j.jmva.2006.11.013
- Google Scholar
1. Miotto R
2. Li L
3. Kidd BA
4. Dudley JT
(2016) Deep patient: an unsupervised representation to predict the future of patients from the electronic health records
Scientific Reports 6:26094.

https://doi.org/10.1038/srep26094
- PubMed
- Google Scholar
(1993) The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus
New England Journal of Medicine 329:977–986.

https://doi.org/10.1056/NEJM199309303291401
- PubMed
- Google Scholar
(2011) Impact of electronic health record clinical decision support on diabetes care: a randomized trial
The Annals of Family Medicine 9:12–21.

https://doi.org/10.1370/afm.1196
- PubMed
- Google Scholar
1. Pantalone KM
2. Kattan MW
3. Yu C
4. Wells BJ
5. Arrigain S
6. Jain A
7. Atreja A
8. Zimmerman RS
(2009) The risk of developing coronary artery disease or congestive heart failure, and overall mortality, in type 2 diabetic patients receiving rosiglitazone, pioglitazone, metformin, or sulfonylureas: a retrospective analysis
Acta Diabetologica 46:145–154.

https://doi.org/10.1007/s00592-008-0090-3
- PubMed
- Google Scholar
1. Pantalone KM
2. Kattan MW
3. Yu C
4. Wells BJ
5. Arrigain S
6. Jain A
7. Atreja A
8. Zimmerman RS
(2010) The risk of overall mortality in patients with type 2 diabetes receiving glipizide, glyburide, or glimepiride monotherapy: a retrospective analysis
Diabetes Care 33:1224–1229.

https://doi.org/10.2337/dc10-0017
- PubMed
- Google Scholar
1. Parmar PG
2. Taal HR
3. Timpson NJ
4. Thiering E
5. Lehtimäki T
6. Marinelli M
7. Lind PA
8. Howe LD
9. Verwoert G
10. Aalto V
11. Uitterlinden AG
12. Briollais L
13. Evans DM
14. Wright MJ
15. Newnham JP
16. Whitfield JB
17. Lyytikäinen L-P
18. Rivadeneira F
19. Boomsma DI
20. Viikari J
21. Gillman MW
22. St Pourcain B
23. Hottenga J-J
24. Montgomery GW
25. Hofman A
26. Kähönen M
27. Martin NG
28. Tobin MD
29. Raitakari O
30. Vioque J
31. Jaddoe VWV
32. Jarvelin M-R
33. Beilin LJ
34. Heinrich J
35. van Duijn CM
36. Pennell CE
37. Lawlor DA
38. Palmer LJ
(2016) International Genome-Wide Association Study Consortium Identifies Novel Loci Associated With Blood Pressure in Children and Adolescents
Circulation: Cardiovascular Genetics 9:266–278.

https://doi.org/10.1161/CIRCGENETICS.115.001190
- Google Scholar
1. Pereira L
2. Rijo R
3. Silva C
4. Agostinho M
(2013) ICD9-based text mining approach to children epilepsy classification
Procedia Technology 9:1351–1360.

https://doi.org/10.1016/j.protcy.2013.12.152
- Google Scholar
1. Perry JR
2. Voight BF
3. Yengo L
4. Amin N
5. Dupuis J
6. Ganser M
7. Grallert H
8. Navarro P
9. Li M
10. Qi L
11. Steinthorsdottir V
12. Scott RA
13. Almgren P
14. Arking DE
15. Aulchenko Y
16. Balkau B
17. Benediktsson R
18. Bergman RN
19. Boerwinkle E
20. Bonnycastle L
21. Burtt NP
22. Campbell H
23. Charpentier G
24. Collins FS
25. Gieger C
26. Green T
27. Hadjadj S
28. Hattersley AT
29. Herder C
30. Hofman A
31. Johnson AD
32. Kottgen A
33. Kraft P
34. Labrune Y
35. Langenberg C
36. Manning AK
37. Mohlke KL
38. Morris AP
39. Oostra B
40. Pankow J
41. Petersen AK
42. Pramstaller PP
43. Prokopenko I
44. Rathmann W
45. Rayner W
46. Roden M
47. Rudan I
48. Rybin D
49. Scott LJ
50. Sigurdsson G
51. Sladek R
52. Thorleifsson G
53. Thorsteinsdottir U
54. Tuomilehto J
55. Uitterlinden AG
56. Vivequin S
57. Weedon MN
58. Wright AF
59. Hu FB
60. Illig T
61. Kao L
62. Meigs JB
63. Wilson JF
64. Stefansson K
65. van Duijn C
66. Altschuler D
67. Morris AD
68. Boehnke M
69. McCarthy MI
70. Froguel P
71. Palmer CN
72. Wareham NJ
73. Groop L
74. Frayling TM
75. Cauchi S
76. MAGIC
77. DIAGRAM Consortium
78. GIANT Consortium
(2012) Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases
PLOS Genetics 8:e1002741.

https://doi.org/10.1371/journal.pgen.1002741
- PubMed
- Google Scholar
(2012) Properties and units in the clinical laboratory sciences. Part XXIII. The NPU terminology, principles, and implementation: a user’s guide (IUPAC Technical Report)
Pure and Applied Chemistry 84:137–165.

https://doi.org/10.1351/PAC-REP-11-05-03
- Google Scholar
1. Raffler J
2. Friedrich N
3. Arnold M
4. Kacprowski T
5. Rueedi R
6. Altmaier E
7. Bergmann S
8. Budde K
9. Gieger C
10. Homuth G
11. Pietzner M
12. Römisch-Margl W
13. Strauch K
14. Völzke H
15. Waldenberger M
16. Wallaschofski H
17. Nauck M
18. Völker U
19. Kastenmüller G
20. Suhre K
(2015) Genome-Wide association study with targeted and Non-targeted NMR metabolomics identifies 15 novel loci of urinary human metabolic individuality
PLOS Genetics 11:e1005487.

https://doi.org/10.1371/journal.pgen.1005487
- PubMed
- Google Scholar
1. Ren Y
2. Zhang M
3. Zhao J
4. Wang C
5. Luo X
6. Zhang J
7. Zhu T
8. Li X
9. Yin L
10. Pang C
11. Feng T
12. Wang B
13. Zhang L
14. Li L
15. Yang X
16. Zhang H
17. Hu D
(2016) Association of the hypertriglyceridemic waist phenotype and type 2 diabetes mellitus among adults in China
Journal of Diabetes Investigation 7:689–694.

https://doi.org/10.1111/jdi.12489
- PubMed
- Google Scholar
Conference
1. Robertson SE
2. Walker S
(1994)
Some simple effective approximations to the 2–Poisson Model for Probabilistic Weighted Retrieval

SIGIR ’94 Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 232–241.
- Google Scholar
1. Roque FS
2. Jensen PB
3. Schmock H
4. Dalgaard M
5. Andreatta M
6. Hansen T
7. Søeby K
8. Bredkjær S
9. Juul A
10. Werge T
11. Jensen LJ
12. Brunak S
(2011) Using electronic patient records to discover disease correlations and stratify patient cohorts
PLOS Computational Biology 7:e1002141.

https://doi.org/10.1371/journal.pcbi.1002141
- PubMed
- Google Scholar
1. Rudkowska I
2. Pérusse L
3. Bellis C
4. Blangero J
5. Després JP
6. Bouchard C
7. Vohl MC
(2015) Interaction between common genetic variants and total fat intake on Low-Density lipoprotein peak particle diameter: a Genome-Wide association study
Journal of Nutrigenetics and Nutrigenomics 8:44–53.

https://doi.org/10.1159/000431151
- PubMed
- Google Scholar
1. Rueedi R
2. Ledda M
3. Nicholls AW
4. Salek RM
5. Marques-Vidal P
6. Morya E
7. Sameshima K
8. Montoliu I
9. Da Silva L
10. Collino S
11. Martin FP
12. Rezzi S
13. Steinbeck C
14. Waterworth DM
15. Waeber G
16. Vollenweider P
17. Beckmann JS
18. Le Coutre J
19. Mooser V
20. Bergmann S
21. Genick UK
22. Kutalik Z
(2014) Genome-wide association study of metabolic traits reveals novel gene-metabolite-disease links
PLOS Genetics 10:e1004132.

https://doi.org/10.1371/journal.pgen.1004132
- PubMed
- Google Scholar
(2006) Assessing glycemia in diabetes using self-monitoring blood glucose and hemoglobin A1c
Jama 295:1688–1697.

https://doi.org/10.1001/jama.295.14.1688
- PubMed
- Google Scholar
1. Shin SY
2. Fauman EB
3. Petersen AK
4. Krumsiek J
5. Santos R
6. Huang J
7. Arnold M
8. Erte I
9. Forgetta V
10. Yang TP
11. Walter K
12. Menni C
13. Chen L
14. Vasquez L
15. Valdes AM
16. Hyde CL
17. Wang V
18. Ziemek D
19. Roberts P
20. Xi L
21. Grundberg E
22. Waldenberger M
23. Richards JB
24. Mohney RP
25. Milburn MV
26. John SL
27. Trimmer J
28. Theis FJ
29. Overington JP
30. Suhre K
31. Brosnan MJ
32. Gieger C
33. Kastenmüller G
34. Spector TD
35. Soranzo N
36. Multiple Tissue Human Expression Resource (MuTHER) Consortium
(2014) An atlas of genetic influences on human blood metabolites
Nature Genetics 46:543–550.

https://doi.org/10.1038/ng.2982
- PubMed
- Google Scholar
(2014) Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes
Nature Genetics 46:294–298.

https://doi.org/10.1038/ng.2882
- PubMed
- Google Scholar
1. Stratton IM
2. Adler AI
3. Neil HA
4. Matthews DR
5. Manley SE
6. Cull CA
7. Hadden D
8. Turner RC
9. Holman RR
(2000) Association of glycaemia with macrovascular and microvascular complications of type 2 diabetes (UKPDS 35): prospective observational study
BMJ 321:405–412.

https://doi.org/10.1136/bmj.321.7258.405
- PubMed
- Google Scholar
1. Tang W
2. Schwienbacher C
3. Lopez LM
4. Ben-Shlomo Y
5. Oudot-Mellakh T
6. Johnson AD
7. Samani NJ
8. Basu S
9. Gögele M
10. Davies G
11. Lowe GD
12. Tregouet DA
13. Tan A
14. Pankow JS
15. Tenesa A
16. Levy D
17. Volpato CB
18. Rumley A
19. Gow AJ
20. Minelli C
21. Yarnell JW
22. Porteous DJ
23. Starr JM
24. Gallacher J
25. Boerwinkle E
26. Visscher PM
27. Pramstaller PP
28. Cushman M
29. Emilsson V
30. Plump AS
31. Matijevic N
32. Morange PE
33. Deary IJ
34. Hicks AA
35. Folsom AR
(2012) Genetic associations for activated partial thromboplastin time and prothrombin time, their gene expression profiles, and risk of coronary artery disease
The American Journal of Human Genetics 91:152–162.

https://doi.org/10.1016/j.ajhg.2012.05.009
- PubMed
- Google Scholar
1. Teslovich TM
2. Musunuru K
3. Smith AV
4. Edmondson AC
5. Stylianou IM
6. Koseki M
7. Pirruccello JP
8. Ripatti S
9. Chasman DI
10. Willer CJ
11. Johansen CT
12. Fouchier SW
13. Isaacs A
14. Peloso GM
15. Barbalic M
16. Ricketts SL
17. Bis JC
18. Aulchenko YS
19. Thorleifsson G
20. Feitosa MF
21. Chambers J
22. Orho-Melander M
23. Melander O
24. Johnson T
25. Li X
26. Guo X
27. Li M
28. Shin Cho Y
29. Jin Go M
30. Jin Kim Y
31. Lee JY
32. Park T
33. Kim K
34. Sim X
35. Twee-Hee Ong R
36. Croteau-Chonka DC
37. Lange LA
38. Smith JD
39. Song K
40. Hua Zhao J
41. Yuan X
42. Luan J
43. Lamina C
44. Ziegler A
45. Zhang W
46. Zee RY
47. Wright AF
48. Witteman JC
49. Wilson JF
50. Willemsen G
51. Wichmann HE
52. Whitfield JB
53. Waterworth DM
54. Wareham NJ
55. Waeber G
56. Vollenweider P
57. Voight BF
58. Vitart V
59. Uitterlinden AG
60. Uda M
61. Tuomilehto J
62. Thompson JR
63. Tanaka T
64. Surakka I
65. Stringham HM
66. Spector TD
67. Soranzo N
68. Smit JH
69. Sinisalo J
70. Silander K
71. Sijbrands EJ
72. Scuteri A
73. Scott J
74. Schlessinger D
75. Sanna S
76. Salomaa V
77. Saharinen J
78. Sabatti C
79. Ruokonen A
80. Rudan I
81. Rose LM
82. Roberts R
83. Rieder M
84. Psaty BM
85. Pramstaller PP
86. Pichler I
87. Perola M
88. Penninx BW
89. Pedersen NL
90. Pattaro C
91. Parker AN
92. Pare G
93. Oostra BA
94. O'Donnell CJ
95. Nieminen MS
96. Nickerson DA
97. Montgomery GW
98. Meitinger T
99. McPherson R
100. McCarthy MI
101. McArdle W
102. Masson D
103. Martin NG
104. Marroni F
105. Mangino M
106. Magnusson PK
107. Lucas G
108. Luben R
109. Loos RJ
110. Lokki ML
111. Lettre G
112. Langenberg C
113. Launer LJ
114. Lakatta EG
115. Laaksonen R
116. Kyvik KO
117. Kronenberg F
118. König IR
119. Khaw KT
120. Kaprio J
121. Kaplan LM
122. Johansson A
123. Jarvelin MR
124. Janssens AC
125. Ingelsson E
126. Igl W
127. Kees Hovingh G
128. Hottenga JJ
129. Hofman A
130. Hicks AA
131. Hengstenberg C
132. Heid IM
133. Hayward C
134. Havulinna AS
135. Hastie ND
136. Harris TB
137. Haritunians T
138. Hall AS
139. Gyllensten U
140. Guiducci C
141. Groop LC
142. Gonzalez E
143. Gieger C
144. Freimer NB
145. Ferrucci L
146. Erdmann J
147. Elliott P
148. Ejebe KG
149. Döring A
150. Dominiczak AF
151. Demissie S
152. Deloukas P
153. de Geus EJ
154. de Faire U
155. Crawford G
156. Collins FS
157. Chen YD
158. Caulfield MJ
159. Campbell H
160. Burtt NP
161. Bonnycastle LL
162. Boomsma DI
163. Boekholdt SM
164. Bergman RN
165. Barroso I
166. Bandinelli S
167. Ballantyne CM
168. Assimes TL
169. Quertermous T
170. Altshuler D
171. Seielstad M
172. Wong TY
173. Tai ES
174. Feranil AB
175. Kuzawa CW
176. Adair LS
177. Taylor HA
178. Borecki IB
179. Gabriel SB
180. Wilson JG
181. Holm H
182. Thorsteinsdottir U
183. Gudnason V
184. Krauss RM
185. Mohlke KL
186. Ordovas JM
187. Munroe PB
188. Kooner JS
189. Tall AR
190. Hegele RA
191. Kastelein JJ
192. Schadt EE
193. Rotter JI
194. Boerwinkle E
195. Strachan DP
196. Mooser V
197. Stefansson K
198. Reilly MP
199. Samani NJ
200. Schunkert H
201. Cupples LA
202. Sandhu MS
203. Ridker PM
204. Rader DJ
205. van Duijn CM
206. Peltonen L
207. Abecasis GR
208. Boehnke M
209. Kathiresan S
(2010) Biological, clinical and population relevance of 95 loci for blood lipids
Nature 466:707–713.

https://doi.org/10.1038/nature09270
- PubMed
- Google Scholar
Conference
(2014)
Negation scope and spelling variation for text-mining of Danish electronic patient records

Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis. pp. 64–88.
- Google Scholar
1. Tran T
2. Nguyen TD
3. Phung D
4. Venkatesh S
(2015) Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM)
Journal of Biomedical Informatics 54:96–105.

https://doi.org/10.1016/j.jbi.2015.01.012
- PubMed
- Google Scholar
1. UK Prospective Diabetes Study Group
(1998a) Effect of intensive blood-glucose control with metformin on complications in overweight patients with type 2 diabetes (UKPDS 34)
The Lancet 352:854–865.

https://doi.org/10.1016/S0140-6736(98)07037-8
- Google Scholar
1. UK Prospective Diabetes Study Group
(1998b) Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33)
The Lancet 352:837–853.

https://doi.org/10.1016/S0140-6736(98)07019-6
- Google Scholar
Thesis
1. Van Dongen S
(2000)
Graph Clustering by Flow Simulation

University of Utrecht.
- Google Scholar
(2017) Genome-wide association analysis identifies novel blood pressure loci and offers biological insights into cardiovascular risk
Nature Genetics 49:403–415.

https://doi.org/10.1038/ng.3768
- PubMed
- Google Scholar
(2012) Using text-mining techniques in electronic patient records to identify ADRs from medicine use
British Journal of Clinical Pharmacology 73:674–684.

https://doi.org/10.1111/j.1365-2125.2011.04153.x
- PubMed
- Google Scholar
(2013) Ischemic stroke is associated with the ABO locus: the EuroCLOT study
Annals of Neurology 73:16–31.

https://doi.org/10.1002/ana.23838
- PubMed
- Google Scholar
1. Xie T
2. Deng L
3. Mei P
4. Zhou Y
5. Wang B
6. Zhang J
7. Lin J
8. Wei Y
9. Zhang X
10. Xu R
(2014) A genome-wide association study combining pathway analysis for typical sporadic amyotrophic lateral sclerosis in chinese han populations
Neurobiology of Aging 35:1778.e9–1778.e23.

https://doi.org/10.1016/j.neurobiolaging.2014.01.014
- Google Scholar

Article and author information

Author details

Isa Kristina Kirk

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing

Contributed equally with
Christian Simon

Competing interests
No competing interests declared
Christian Simon

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing

Contributed equally with
Isa Kristina Kirk

Competing interests
No competing interests declared
Karina Banasik

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Formal analysis, Methodology, Writing—original draft, Project administration, Writing—review and editing

Competing interests
No competing interests declared
Peter Christoffer Holm

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Formal analysis, Methodology, Writing—review and editing

Competing interests
No competing interests declared
Amalie Dahl Haue

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Formal analysis, Methodology, Writing—review and editing

Competing interests
No competing interests declared
Peter Bjødstrup Jensen
1. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
2. Odense Patient Data Explorative Network (OPEN), Odense University Hospital, Odense, Denmark
Contribution
Formal analysis, Supervision, Methodology, Writing—review and editing

Competing interests
No competing interests declared
Lars Juhl Jensen

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Conceptualization, Data curation, Formal analysis, Supervision, Methodology, Writing—original draft, Writing—review and editing

Competing interests
No competing interests declared
Cristina Leal Rodríguez

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Methodology, Writing—review and editing

Competing interests
No competing interests declared
Mette Krogh Pedersen

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Data curation, Methodology

Competing interests
No competing interests declared
Robert Eriksson

Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Data curation, Methodology

Competing interests
No competing interests declared
Henrik Ullits Andersen

Steno Diabetes Center Copenhagen, Gentofte, Denmark

Contribution
Conceptualization, Resources, Data curation, Validation, Writing—review and editing

Competing interests
No competing interests declared
Thomas Almdal
1. Steno Diabetes Center Copenhagen, Gentofte, Denmark
2. Department of Endocrinology, Rigshospitalet, Copenhagen, Denmark
Contribution
Conceptualization, Resources, Data curation, Writing—review and editing

Competing interests
No competing interests declared
Jette Bork-Jensen

Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Data curation, Writing—review and editing

Competing interests
No competing interests declared
Niels Grarup

Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark

Contribution
Data curation, Writing—review and editing

Competing interests
No competing interests declared
Knut Borch-Johnsen

Holbæk Hospital, Holbæk, Denmark

Contribution
Conceptualization, Resources, Supervision, Project administration, Writing—review and editing

Competing interests
No competing interests declared
Oluf Pedersen
1. Steno Diabetes Center Copenhagen, Gentofte, Denmark
2. Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
Contribution
Conceptualization, Resources, Data curation, Writing—review and editing

Competing interests
No competing interests declared
Flemming Pociot
1. Steno Diabetes Center Copenhagen, Gentofte, Denmark
2. Department of Clinical Medicine, Herlev-Gentofte Hospital, Herlev, Denmark
Contribution
Resources, Data curation, Methodology, Writing—review and editing

Competing interests
No competing interests declared
Torben Hansen
1. Steno Diabetes Center Copenhagen, Gentofte, Denmark
2. Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark
Contribution
Conceptualization, Resources, Data curation, Supervision, Writing—review and editing

Competing interests
No competing interests declared
Regine Bergholdt

Steno Diabetes Center Copenhagen, Gentofte, Denmark

Contribution
Conceptualization, Resources, Data curation, Methodology, Project administration, Writing—review and editing

Competing interests
No competing interests declared
Peter Rossing
1. Steno Diabetes Center Copenhagen, Gentofte, Denmark
2. Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark
Contribution
Conceptualization, Resources, Data curation, Supervision, Methodology, Writing—original draft, Project administration, Writing—review and editing

For correspondence
peter.rossing@regionh.dk

Competing interests
No competing interests declared
Søren Brunak
1. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
2. Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
Contribution
Conceptualization, Resources, Data curation, Funding acquisition, Methodology, Writing—original draft, Project administration, Writing—review and editing

For correspondence
soren.brunak@cpr.ku.dk

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-0316-5866

Funding

Danish Council for Strategic Research (0603-00321B)

Søren Brunak

Innovation Fund Denmark (5153-00002B)

Søren Brunak

Novo Nordisk Foundation (NNF14CC0001)

Søren Brunak

Novo Nordisk Foundation (NNF17OC0027594)

Søren Brunak

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.