Machine learning models for classification and identification of significant attributes to detect type 2 diabetes

Koushik Chandra Howlader, Md Shahriare Satu, Md Abdul Awal, Md Rabiul Islam, Sheikh Mohammed Shariful Islam, Julian M.W. Quinn, Mohammad Ali Moni

Research output: Contribution to journalArticlepeer-review

39 Citations (Scopus)
22 Downloads (Pure)


Type 2 Diabetes (T2D) is a chronic disease characterized by abnormally high blood glucose levels due to insulin resistance and reduced pancreatic insulin production. The challenge of this work is to identify T2D-associated features that can distinguish T2D sub-types for prognosis and treatment purposes. We thus employed machine learning (ML) techniques to categorize T2D patients using data from the Pima Indian Diabetes Dataset from the Kaggle ML repository. After data preprocessing, several feature selection techniques were used to extract feature subsets, and a range of classification techniques were used to analyze these. We then compared the derived classification results to identify the best classifiers by considering accuracy, kappa statistics, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and logarithmic loss (logloss). To evaluate the performance of different classifiers, we investigated their outcomes using the summary statistics with a resampling distribution. Therefore, Generalized Boosted Regression modeling showed the highest accuracy (90.91%), followed by kappa statistics (78.77%) and specificity (85.19%). In addition, Sparse Distance Weighted Discrimination, Generalized Additive Model using LOESS and Boosted Generalized Additive Models also gave the maximum sensitivity (100%), highest AUROC (95.26%) and lowest logarithmic loss (30.98%) respectively. Notably, the Generalized Additive Model using LOESS was the top-ranked algorithm according to non-parametric Friedman testing. Of the features identified by these machine learning models, glucose levels, body mass index, diabetes pedigree function, and age were consistently identified as the best and most frequently accurate outcome predictors. These results indicate the utility of ML methods in constructing improved prediction models for T2D and successfully identified outcome predictors for this Pima Indian population.

Original languageEnglish
Article number2
Number of pages13
JournalHealth Information Science and Systems
Issue number1
Publication statusPublished - Feb 2022


Dive into the research topics of 'Machine learning models for classification and identification of significant attributes to detect type 2 diabetes'. Together they form a unique fingerprint.

Cite this