Diabetes has become a rapidly growing public health concern, placing increasing pressure on healthcare systems worldwide. Detecting the disease at an early stage is critical for minimizing long-term complications and improving patient quality of life. This study investigates the application of machine learning techniques for assessing diabetes risk using the R programming environment. Logistic Regression, Decision Tree, and Random Forest classifiers were developed and evaluated using the Pima Indians Diabetes Dataset, a benchmark dataset widely employed in medical analytics research. To enhance predictive reliability, comprehensive data preprocessing steps were applied, including missing value treatment, feature scaling, and variable selection. Model development and evaluation were carried out using established R packages such as caret, tidyverse, ggplot2, and random Forest. Performance was assessed through multiple classification metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. Among the evaluated models, the Random Forest classifier demonstrated the strongest predictive performance, indicating its suitability for diabetes risk assessment tasks. The findings highlight the effectiveness of R-based machine learning frameworks in supporting proactive healthcare monitoring and data-driven clinical decision-making.
Introduction
Diabetes mellitus is a rapidly growing global health concern, affecting over 537 million adults in 2021 and projected to reach 643 million by 2030. Early diagnosis is crucial to prevent severe complications such as cardiovascular disease, kidney failure, and nerve damage. Traditional clinical diagnostic methods may not always support early detection, prompting the use of data science and machine learning for predictive modeling. Machine learning enables identification of at-risk individuals using physiological and demographic attributes, supporting preventive healthcare and personalized treatment strategies.
This study investigates diabetes risk prediction using three supervised machine learning algorithms—Logistic Regression, Decision Tree, and Random Forest—implemented in R. The research utilizes the Pima Indians Diabetes Dataset (PIDD) from the UCI repository, containing 768 patient records with eight medical predictor variables and a binary outcome variable.
Methodology
The analytical workflow included:
Data Preprocessing: Handling missing values via median imputation, normalization, outlier treatment, and 80/20 train-test splitting.
Feature Selection: Correlation analysis and recursive feature elimination identified Glucose, BMI, Age, and Diabetes Pedigree Function as key predictors.
Model Development:
Logistic Regression (baseline, interpretable)
Decision Tree (non-linear modeling with pruning)
Random Forest (ensemble method with hyperparameter tuning)
Evaluation Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC, confusion matrix analysis, and visualization using R packages such as caret, randomForest, ggplot2, and pROC.
Results
Performance comparison showed:
Logistic Regression: 76% accuracy, AUC ≈ 0.80
Decision Tree: 78% accuracy, AUC ≈ 0.84
Random Forest: 83% accuracy, AUC ≈ 0.89
Random Forest outperformed other models across all evaluation metrics due to its ensemble learning capability, reduced variance, and improved generalization. Feature importance analysis confirmed Glucose as the strongest predictor, followed by BMI and Age—consistent with established clinical knowledge. Confusion matrix results demonstrated balanced sensitivity and specificity, indicating reliable identification of both diabetic and non-diabetic patients.
Conclusion
This study explored the use of machine learning techniques implemented in R for early diabetes risk prediction. By applying Logistic Regression, Decision Tree, and Random Forest models to a standardized healthcare dataset, the research demonstrated how algorithmic choice and preprocessing strategies influence predictive outcomes. Among the evaluated approaches, the Random Forest model consistently delivered the most reliable performance across multiple evaluation criteria.
Beyond predictive accuracy, this work highlights the importance of reproducible analytical workflows in healthcare research. The integration of data preprocessing, model comparison, and visualization within a single programming environment supports transparency and practical deployment in clinical decision-support systems. The alignment between model-derived feature importance and established medical risk factors further reinforces the validity of the analytical framework.
Future research may extend this work by incorporating larger and more diverse patient datasets, exploring advanced learning architectures, and integrating explainable artificial intelligence techniques to enhance clinical trust. The methodology presented in this study provides a scalable foundation for developing data-driven tools that support early intervention and preventive healthcare strategies.
References
[1] J. C. Smith, R. N. Everhart, J. Dickson, W. Knowler, and R. Johannes, “Using the Pima Indians diabetes database for benchmarking,” National Institute of Diabetes and Digestive and Kidney Diseases, 1988.
[2] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA: Morgan Kaufmann, 2011.
[3] P. Patel, D. Sharma, and S. Aggarwal, “Comparison of machine learning algorithms for diabetes prediction,” International Journal of Computer Applications, vol. 139, no. 5, pp. 10–14, 2016.
[4] A. Sisodia and S. Sisodia, “Prediction of diabetes using classification algorithms,” Procedia Computer Science, vol. 132, pp. 1578–1585, 2018.
[5] Y. Li and Y. Li, “Healthcare analytics using R: A case study on disease prediction,” Journal of Medical Systems, vol. 43, no. 6, pp. 210, 2019.
[6] S. Pradhan, R. Mohapatra, and P. Behera, “Cardiovascular disease prediction using R-based machine learning models,” Health Informatics Journal, vol. 26, no. 3, pp. 1923–1936, 2020.
[7] M. Alam, A. Kumar, and S. Gupta, “Interactive visualization of diabetes data using R Shiny applications,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 10, pp. 123–131, 2021.
[8] M. Kuhn, Caret Package: Classification and Regression Training, R Package Documentation, 2021.
[9] T. L. Therneau and B. Atkinson, rpart: Recursive Partitioning and Regression Trees, R Package, 2021.
[10] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.