Prediction of type 2 diabetes mellitus onset using logistic regression-based scorecards

  1. Yochai Edlitz
  2. Eran Segal  Is a corresponding author
  1. Weizmann Institute of Science, Israel


Background: Type 2 diabetes (T2D) accounts for ~90% of all cases of diabetes, resulting in an estimated 6.7 million deaths in 2021, according to the International Diabetes Federation (IDF). Early detection of patients with high risk of developing T2D can reduce the incidence of the disease through a change in lifestyle, diet, or medication. Since populations of lower socio-demographic status are more susceptible to T2D and might have limited resources or access to sophisticated computational resources, there is a need for accurate yet accessible prediction models.

Methods: In this study, we analyzed data from 44,709 non-diabetic U.K. Biobank participants aged 40-69, predicting the risk of T2D onset within a selected timeframe (mean of 7.3 years with a standard deviation of 2.3 years). We started with 798 features that we identified as potential predictors for T2D onset. We first analyzed the data using gradient boosting decision trees, survival analysis, and logistic regression methods. We devised one non-laboratory model accessible to the general population and one more precise yet simple model that utilizes laboratory tests. We simplified both models to an accessible scorecard form, tested the models on normoglycemic and prediabetes sub cohorts, and compared the results to the results of the general cohort. We established the non-laboratory model using the following covariates: sex, age, weight, height, waist size, hip circumference, waist-to-hip Ratio (WHR), and Body-Mass Index (BMI). For the laboratory model, we used age and sex together with four common blood tests: HDL (high-density lipoprotein), gamma-glutamyl transferase, glycated hemoglobin, and triglycerides. As an external validation dataset, we used the electronic medical record database of Clalit Health Services.

Results: The non-laboratory scorecard model achieved an Area Under the Receiver Operating Curve (auROC) of 0.81 (0.77-0.84 95% Confidence Interval (CI)) and an odds ratio (OR) between the upper and fifth prevalence deciles of 17.2 (5-66 95% CI). Using this model, we classified three risk groups, a group with 1% (0.8-1%), 5% (3-6%), and the third group with a 9% (7-12%) risk of developing T2D. We further analyzed the contribution of the laboratory-based model and devised a blood-test model based on age, sex and the four common blood tests noted above. In this scorecard model, we included age, sex, glycated hemoglobin (HbA1c%), gamma glutamyl-transferase, triglycerides, and HDL cholesterol. Using this model, we achieved an auROC of 0.87 (0.85-0.90 95% CI) and a deciles' OR of x48 (12-109 95% CI). Using this model, we classified the cohort into four risk groups with the following risks: 0.5% (0.4%-7%); 3% (2-4%); 10% (8-12%) and a high-risk group of 23% (10-37%) of developing T2D. When applying the blood tests model using the external validation cohort (Clalit), we achieved an auROC of 0.75 (0.74-0.75 95% CI). We analyzed several additional comprehensive models, which included genotyping data and other environmental factors. We found that these models did not provide cost-efficient benefits over the four blood test model. The commonly used German Diabetes Risk Score (GDRS) and Finnish Diabetes Risk Score (FINDRISC) models, trained using our data, achieved an auROC of 0.73 (0.69-0.76) and 0.66 (0.62-0.70), respectively, inferior to the results achieved by the four blood test model and by the Anthropometry models.

Conclusions: The four blood tests and anthropometric models outperformed the commonly used non-laboratory models, the FINDRISC and the GDRS. We suggest that our models be used as tools for decision-makers to assess populations at elevated T2D risk and thus improve medical strategies. These models might also provide a personal catalyst for changing lifestyle, diet, or medication modifications to lower the risk of T2D onset.

Funding: No Funders. The funders had no role in study design, data collection, interpretation, or the decision to submit the work for publication.

Data availability

All data that we used to develop the models in this research is available through the UK Biobank database. The external validation cohort is from "Clalit healthcare".The two databases can be accessed upon specific requests and approval as described below.UKBiobank - The UK Biobank data is Available from UK Biobank subject to standard procedures ( The UK Biobank resource is open to all bona fide researchers at bona fide research institutes to conduct health-related research in the public interest. UK Biobank welcomes applications from academia and commercial institutes.Clalit - The data that support the findings of the external Clalit cohort originate from Clalit Health Services ( Due to restrictions, these data can be accessed only by request to the authors and/or Clalit Health Services. Requests for access to all or parts of the Clalit datasets should be addressed to Clalit Healthcare Services via the Clalit Research Institute ( The Clalit Data Access committee will consider requests given the Clalit data-sharing policy.Source code for analysis is available at

The following previously published data sets were used

Article and author information

Author details

  1. Yochai Edlitz

    Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Yavne, Israel
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7733-3995
  2. Eran Segal

    Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Yavne, Israel
    For correspondence
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6859-1164


Feinberg Graduate School, Weizmann Institute of Science

  • Eran Segal

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Nicola Napoli, Campus Bio-Medico University of Rome, Italy

Publication history

  1. Received: July 1, 2021
  2. Accepted: May 26, 2022
  3. Accepted Manuscript published: June 22, 2022 (version 1)


© 2022, Edlitz & Segal

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.


  • 227
    Page views
  • 97
  • 0

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yochai Edlitz
  2. Eran Segal
Prediction of type 2 diabetes mellitus onset using logistic regression-based scorecards
eLife 11:e71862.

Further reading

    1. Epidemiology and Global Health
    2. Evolutionary Biology
    Fabrizio Menardo
    Research Article

    Detecting factors associated with transmission is important to understand disease epidemics, and to design effective public health measures. Clustering and terminal branch lengths (TBL) analyses are commonly applied to genomic data sets of Mycobacterium tuberculosis (MTB) to identify sub-populations with increased transmission. Here, I used a simulation-based approach to investigate what epidemiological processes influence the results of clustering and TBL analyses, and whether differences in transmission can be detected with these methods. I simulated MTB epidemics with different dynamics (latency, infectious period, transmission rate, basic reproductive number R0, sampling proportion, sampling period, and molecular clock), and found that all considered factors, except for the length of the infectious period, affect the results of clustering and TBL distributions. I show that standard interpretations of this type of analyses ignore two main caveats: (1) clustering results and TBL depend on many factors that have nothing to do with transmission, (2) clustering results and TBL do not tell anything about whether the epidemic is stable, growing, or shrinking, unless all the additional parameters that influence these metrics are known, or assumed identical between sub-populations. An important consequence is that the optimal SNP threshold for clustering depends on the epidemiological conditions, and that sub-populations with different epidemiological characteristics should not be analyzed with the same threshold. Finally, these results suggest that different clustering rates and TBL distributions, that are found consistently between different MTB lineages, are probably due to intrinsic bacterial factors, and do not indicate necessarily differences in transmission or evolutionary success.

    1. Epidemiology and Global Health
    Shaun Truelove et al.
    Research Article Updated

    In Spring 2021, the highly transmissible SARS-CoV-2 Delta variant began to cause increases in cases, hospitalizations, and deaths in parts of the United States. At the time, with slowed vaccination uptake, this novel variant was expected to increase the risk of pandemic resurgence in the US in summer and fall 2021. As part of the COVID-19 Scenario Modeling Hub, an ensemble of nine mechanistic models produced 6-month scenario projections for July–December 2021 for the United States. These projections estimated substantial resurgences of COVID-19 across the US resulting from the more transmissible Delta variant, projected to occur across most of the US, coinciding with school and business reopening. The scenarios revealed that reaching higher vaccine coverage in July–December 2021 reduced the size and duration of the projected resurgence substantially, with the expected impacts was largely concentrated in a subset of states with lower vaccination coverage. Despite accurate projection of COVID-19 surges occurring and timing, the magnitude was substantially underestimated 2021 by the models compared with the of the reported cases, hospitalizations, and deaths occurring during July–December, highlighting the continued challenges to predict the evolving COVID-19 pandemic. Vaccination uptake remains critical to limiting transmission and disease, particularly in states with lower vaccination coverage. Higher vaccination goals at the onset of the surge of the new variant were estimated to avert over 1.5 million cases and 21,000 deaths, although may have had even greater impacts, considering the underestimated resurgence magnitude from the model.