Over 1.6 billion people worldwide suffer from anemia, a common hematologic disorder that goes undiagnosed in situations with limited resources due to a lack of diagnostic tools. This study presents a machine learning approach to the low-cost, non-invasive diagnostic method of detecting anemia from complete blood count (CBC) data. Using high-quality CBC data, we compare the predictability of three supervised models: Random Forest, Support Vector Machine (SVM), and Logistic Regression. GridSearchCV was used for extensive preprocessing, hyperparameter tuning, and stratified cross-validation techniques. Among the models compared, Random Forest had the highest accuracy of 99.48%, outperforming the SVM model (23.81%) and the Logistic Regression model (57.23%). SHAP (SHapley Additive exPlanations), which has a strong correlation with clinical relevance, was used to select the most contributing features influencing predictions in order to enhance model interpretability. Our results show that interpretability and ensemble learning can work well together as a diagnostic support system for the early identification of anemia in clinical settings.
Introduction
Anemia, a widespread global health issue affecting over 1.6 billion people, is characterized by low hemoglobin or red blood cell counts and often remains undiagnosed in low-resource settings due to limited testing infrastructure. Traditional diagnosis relies on Complete Blood Count (CBC) tests, which are resource-intensive and challenging to access in remote areas.
This research explores the use of machine learning (ML) to improve anemia diagnosis accuracy and accessibility by analyzing CBC data. The study compares three supervised ML models—Logistic Regression, Support Vector Machine (SVM), and Random Forest (RF)—after preprocessing and hyperparameter tuning. Results show that the Random Forest classifier outperforms the others, achieving 99% accuracy, compared to Logistic Regression (57%) and SVM (23%). This highlights the importance of model selection and tuning in clinical prediction tasks.
The research contributions include:
A comprehensive preprocessing pipeline for CBC data.
Performance comparison of ML models using accuracy, confusion matrices, and classification metrics.
Demonstration of significant improvement through hyperparameter optimization using Grid Search and cross-validation.
The literature review confirms that RF and SVM are commonly used for anemia classification, with RF generally providing better performance, especially when properly tuned. Challenges remain in model interpretability and clinical adoption.
The methodology involved cleaning and encoding CBC datasets, splitting data into training and testing sets, and optimizing model parameters. Evaluation used metrics like accuracy, precision, recall, F1 score, and confusion matrices. SHapley Additive exPlanations (SHAP) were applied to interpret model predictions.
Overall, Random Forest was identified as the most effective ML approach for anemia prediction using non-invasive CBC data, offering a scalable, accurate tool to support early diagnosis, especially in resource-limited healthcare settings.
Conclusion
In this study, Random Forest, Logistic Regression, and Support Vector Machine (SVM) were evaluated to predict anemia with CBC data. After hyperparameter tuning to refine models through Grid Search, Random Forest achieved a high accuracy of 99.48% compared to Logistic Regression (57%) and SVM (23%) accuracy. These findings strongly suggest that ensemble methods are far superior at acquiring complex patterns in medical data when compared to Logistic Regression and SVM.
Random Forest emerged as the most appropriate model because it demonstrated high predictive accuracy, robustness to overfitting, and an ability to be compatible with explanation tools such as SHAP. It is the only algorithm that achieved performance and explanation. The latter is a high priority for clinical settings and relevance to clinical decision making.
References
[1] M. R. Aditya, T. Sutanto, H. Budiman, M. R. N. Ridha, U. Syapotro, and N. Azijah, “Machine learning models for classification of anemia from CBC results: Random Forest, SVM, and Logistic Regression,” J. Data Sci., vol. 2024, no. 49, 2024.
[2] Kitaw B, Asefa C, Legese F, et al. Leveraging machine learning models for anemia severity detection among pregnant women following ANC: Ethiopian context. BMC Public Health. 2024 Dec 18;24(1):3500.
[3] Gómez-Gómez J, Rico A, Guzmán JR, et al. Anemia Classification System Using Machine Learning. Informatics. 2025;12(1):19.
[4] Awaad AS, Elbarawy YM, Mancy H, Ghannam NE. Exploring CBC Data for Anemia Diagnosis: A Machine Learning and Ontology Perspective. BioMedInformatics. 2025;5(3):35.
[5] Shweta N, Pande SD. Prediction of Anemia using Various Ensemble Learning and Boosting Techniques. EAI Endorsed Transactions on Pervasive Health and Technology. 2023;10.4108/eetpht.9.4197.
[6] E. Aboelnaga. Anemia Types Classification [Online Kaggle dataset]. 2023. Available: https://www.kaggle.com/datasets/ehababoelnaga/anemia-types-classification (accessed Jun. 21, 2025).
[7] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
[8] D. W. Hosmer Jr., S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression, 3rd ed. New York, NY, USA: Wiley, 2013.
[9] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
[10] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[11] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Adv. Neural Inf. Process. Syst. 30 (NIPS), 2017, pp. 4765–4774.