In the field of ML-Machine Learning, classification is one of the most widely used prediction tasks. In recent era, ML is being widely deployed in almost every field of real-world applications including heathcare. When we use ML for healthcare applications, it should be our main goal to achieve highest possible accuracy. Accuracy of any model is dependent on training dataset and algorithm being implemented. Different characteristics of training dataset contribute significantly to achieve highest possible accuracy. If we talk about general observations then the healthcare applications related data are mainly numerical like test reports showing numerical values. Classification is a categorical task that is easy to understand by patients like whether someone is having a particular disease or not. In this research work, we have evaluated and compared performances of various classifiers to decide which classifier works best when the training data is exclusively numerical. Based on our experiments, we have observed that Logistic Regression, Neural Network and Naive Bayes perform more accurately for exclusively numerical data to predict diabetes.
Introduction
Machine Learning (ML) is increasingly vital in healthcare for making predictive decisions, especially through classification tasks that assign categorical labels—such as diagnosing diseases or predicting treatment effectiveness. Accuracy in healthcare ML models is critical, as errors can have serious consequences. Since healthcare datasets often contain numerical values (e.g., blood pressure, glucose levels), selecting appropriate classifiers capable of handling such data is essential.
This study evaluates several popular classification algorithms—Logistic Regression, Neural Networks, Naive Bayes, Decision Trees, Support Vector Machines (SVM), and k-Nearest Neighbors (kNN)—using the Orange tool, which simplifies model training and evaluation without coding. The focus is on diabetes prediction, using a Kaggle dataset with 768 patient records (268 diabetic, 500 non-diabetic), with all features being numerical.
The literature review highlights prior work on ML applications in healthcare, ethical concerns, data security, and classifier performance comparisons. The study applies cross-validation techniques to assess classifier accuracy, aiming to identify the best-performing model for diabetes prediction with numerical data.
Results suggest that different algorithms vary in effectiveness depending on data and methodology. Cross-validation ensures models generalize well without overfitting. Overall, this research underscores the importance of evaluating multiple classifiers to select the most accurate one for critical healthcare applications like diabetes diagnosis.
Conclusion
This research work aimed to identify the best classification algorithms to process exclusively numerical data. We used dataset to predict diabetes. As a part of our research work, 6 most widely used classification algorithms are evaluated using Orange tool. As a part of testing, various cross validations are made using different folds. We have observed that while processing numerical data, Logistic Regression, Naive Bayes and Neural Network algorithms perform the best as compared to Tree, kNN and SVM. This observation helps us to select which method to use for what type of dataset for to achieve higher accuracy. This work can be further extended to be evaluated with different datasets.
References
[1] Education, Pearson. Machine Learning, 1e. Pearson Education India., 2018
[2] Rebala, Gopinath, Ajay Ravi, and Sanjay Churiwala. An introduction to machine learning. Springer, 2019.
[3] Pereira, F. C., and S. S. Borysov. \"Machine Learning Fundamentals Mobility Patterns, Big Data and Transport Analytics.\" (2019): Elsevier 9-29.
[4] Char, Danton S., Michael D. Abràmoff, and Chris Feudtner. \"Identifying ethical considerations for machine learning healthcare applications.\" The American Journal of Bioethics 20.11 (2020): 7-17.
[5] Chen, Irene Y., et al. \"Ethical machine learning in healthcare.\" Annual review of biomedical data science 4 (2021): 123-144.
[6] Qayyum, Adnan, et al. \"Secure and robust machine learning for healthcare: A survey.\" IEEE Reviews in Biomedical Engineering 14 (2020): 156-180.
[7] Waring, Jonathan, Charlotta Lindvall, and Renato Umeton. \"Automated machine learning: Review of the state-of-the-art and opportunities for healthcare.\" Artificial intelligence in medicine 104 (2020): 101822.
[8] Ahmed, Zeeshan, et al. \"Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.\" Database 2020 (2020): baaa010.
[9] Chen, Richard J., et al. \"Synthetic data in machine learning for medicine and healthcare.\" Nature Biomedical Engineering 5.6 (2021): 493-497.
[10] Chen, Po-Hsuan Cameron, Yun Liu, and Lily Peng. \"How to develop machine learning models for healthcare.\" Nature materials 18.5 (2019): 410-414.
[11] Hasan, Md Kamrul, et al. \"Diabetes prediction using ensembling of different machine learning classifiers.\" IEEE Access 8 (2020): 76516-76531.
[12] Mujumdar, Aishwarya, and V. Vaidehi. \"Diabetes prediction using machine learning algorithms.\" Procedia Computer Science 165 (2019): 292-299.
[13] Jaiswal, Varun, Anjli Negi, and Tarun Pal. \"A review on current advances in machine learning based diabetes prediction.\" Primary Care Diabetes 15.3 (2021): 435-443.
[14] Khanam, Jobeda Jamal, and Simon Y. Foo. \"A comparison of machine learning algorithms for diabetes prediction.\" Ict Express 7.4 (2021): 432-439.
[15] Soni, Mitushi, and Sunita Varma. \"Diabetes prediction using machine learning techniques.\" International Journal of Engineering Research & Technology (Ijert) Volume 9 (2020).
[16] Diabetes dataset https://www.kaggle.com/datasets/mathchi/diabetes-data-set