Cardiovascular Risk Prediction Using Extreme Gradient Boosting: A Machine Learning Approach

Authors: Ch. Sathwika, K. Priyanka, V. Sirisha, G. Venkatesh, G. Prudhvi

DOI Link: https://doi.org/10.22214/ijraset.2026.81709

Abstract

Cardiovascular mortality remains a major global health issue. A significant number of deaths could be prevented with timely risk identification. This study presents a machine learning framework called the Heart Attack Prediction System (HAPS), which uses XGBoost as its main predictive engine. The model is trained on a combined set of 12,000 records from two publicly available sources: the UCI Cleveland Heart Disease Dataset and the Kaggle Heart Attack Analysis and Prediction Dataset. In addition to the usual 13 clinical parameters, we create eight interaction features derived from the domain, which expands the input space to 21 dimensions. Under consistent experimental conditions, XGBoost achieves a classification accuracy of 99.71% and an area under the ROC curve of 0.9999. It outperforms five other algorithms, including Random Forest (99.62%), Logistic Regression (99.67%), SVM (99.62%), KNN (99.58%), and Gradient Boosting (99.67%). Ten-fold stratified cross-validation results in a mean accuracy of 99.71% ± 0.03%, confirming strong generalization. The system operates through a Flask-based web interface, allowing clinicians to get real-time risk estimates without needing specialized programming skills.

Introduction

The Heart Attack Prediction System (HAPS) is a machine learning–based healthcare solution designed to identify individuals at risk of cardiovascular disease before severe symptoms appear. Cardiovascular diseases remain the leading cause of premature death globally, causing approximately 17.9 million deaths annually. Traditional diagnostic methods often require specialists, costly equipment, and lengthy clinical procedures, making large-scale screening difficult in resource-constrained regions. To address this challenge, HAPS employs XGBoost (Extreme Gradient Boosting), a powerful supervised machine learning algorithm known for its high accuracy, speed, and ability to model complex feature relationships.

The system was trained on 12,000 clinical records collected from the UCI Cleveland Heart Disease Dataset and the Kaggle Heart Attack Analysis and Prediction Dataset. In addition to the standard 13 clinical heart-disease features, the model incorporates eight engineered interaction features derived from medical knowledge, enabling it to capture more complex cardiovascular risk patterns. HAPS compares the performance of six machine learning algorithms—Random Forest, Logistic Regression, SVM, KNN, Gradient Boosting, and XGBoost—under identical experimental conditions.

The architecture consists of five stages: data ingestion, preprocessing and feature engineering, model training, evaluation, and web deployment. Clinical data undergo cleaning, missing-value handling, scaling, and transformation before model training. The final trained XGBoost model is deployed through a Flask-based web application, where healthcare workers can enter patient information and instantly receive a color-coded risk prediction without requiring advanced technical expertise.

Experimental results demonstrate that XGBoost outperforms all competing algorithms, achieving 99.71% accuracy, 99.8% precision, 99.7% recall, 99.7% F1-score, and 99.99% ROC-AUC on the test dataset. The model recorded only four false negatives out of 2,400 test cases, making it highly effective for identifying high-risk patients. Ten-fold cross-validation further confirmed the model’s stability, with a mean accuracy of 99.71% ± 0.03%, indicating strong generalization and reliability.

Feature importance analysis revealed that several engineered interaction features ranked among the most influential predictors, validating the importance of domain-driven feature engineering. Training curves showed rapid convergence and no signs of overfitting. Overall, HAPS provides an accurate, scalable, and user-friendly tool for early cardiovascular risk assessment, demonstrating how machine learning can support preventive healthcare and improve clinical decision-making, particularly in resource-limited settings.

Conclusion

This paper presented HAPS, a machine learning framework designed for cardiovascular risk stratification in preventive clinical settings. By augmenting the standard 13-feature clinical schema with eight domain-motivated interaction terms and employing XGBoost as the predictive engine, HAPS achieves a classification accuracy of 99.71% and a ROC-AUC of 0.9999 on a 12,000-record aggregated dataset—outperforming all five competing algorithms under identical experimental conditions. Cross-validation stability (99.71% ± 0.03%) corroborates genuine model generalization. The Flask-based deployment layer operationalizes the model as a clinically accessible tool requiring no programming expertise, bridging the gap between algorithmic performance and practical clinical adoption. HAPS demonstrates that the combination of principled feature engineering, an appropriate ensemble algorithm, and lightweight deployment infrastructure can yield a practical early-warning instrument for preventive cardiology. Future development directions include: integration with wearable IoT devices for continuous passive monitoring; incorporation of genetic and familial history data for individualized profiling; extension to deep sequential architectures (LSTM, 1D-CNN) applied to raw ECG signals; SHAP and LIME explainability modules for per-patient decision attribution; and federated learning protocols enabling cross-institutional model training without patient-level data centralization.

References

[1] H. Ahmed, I. Younis, A. S. M. Sanwar Hossain, and M. Hasan, “Effective heart disease prediction using machine learning algorithms,” Algorithms, vol. 14, no. 10, p. 303, Oct. 2021. [2] A. Altantayeva, Z. Amirgaliyev, and M. Kunelbayev, “Heart disease risk prediction using deep learning,” Multimedia Tools and Applications, vol. 82, no. 12, pp. 18131–18150, 2023. [3] S. E. Awan, F. Ullah, H. Ur Rehman, M. Nawaz, and G. Havyarimana, “Early detection of heart disease using intelligent computational model,” Scientific Reports, vol. 10, no. 1, p. 18898, 2020. [4] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785–794. [5] G. S. S. Bindhika, B. Mahesh, and K. R. Rao, “Heart disease prediction using machine learning techniques,” International Research Journal of Engineering and Technology (IRJET), vol. 7, no. 4, pp. 3680–3685, 2020. [6] N. Nandal, R. Yadav, R. Beniwal, D. Dhingra, and A. Vij, “Machine learning-based heart attack prediction using advanced algorithms,” F1000Research, vol. 11, p. 1126, 2022. [7] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, and V. Froelicher, “International application of a new probability algorithm for the diagnosis of coronary artery disease,” American Journal of Cardiology, vol. 64, no. 5, pp. 304–310, Sep. 1989. [8] A. Singh, A. Bhatt, and R. Soni, “Explainable machine learning for cardiovascular risk assessment using SHAP,” Journal of Medical Informatics and Intelligent Systems, vol. 9, no. 2, pp. 45–58, 2023. [9] R. Rahman, “Heart Attack Analysis and Prediction Dataset,” Kaggle, 2021. [Online]. Available: https://www.kaggle.com. [10] J. Brownlee, XGBoost With Python: Gradient Boosted Trees for Machine Learning. Machine Learning Mastery, 2016.

Copyright

Copyright © 2026 Ch. Sathwika, K. Priyanka, V. Sirisha, G. Venkatesh, G. Prudhvi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET81709

Publish Date : 2026-05-01

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here