Heart disease remains one of the leading causes of mortality worldwide, making early prediction essential for effective prevention and timely treatment. This paper presents an advanced machine learning-based system for predicting heart attack risk using a stacked hybrid ensemble approach. The proposed system integrates multiple machine learning algorithms, including Random Forest, Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression, Gradient Boosting, and XGBoost.
These models are combined using a stacking classifier with Logistic Regression as the meta-learner to achieve higher accuracy and reliability.
The system analyzes various patient health parameters such as age, cholesterol levels, blood pressure, heart rate, and other clinical factors from the Cleveland Heart Disease Dataset. Data preprocessing techniques including data cleaning, feature scaling using StandardScaler, feature selection, and stratified train-test splitting are applied to improve robustness. Experimental results demonstrate that the stacked hybrid model achieves a prediction accuracy of approximately 89–90%, outperforming individual base classifiers. The proposed solution offers a non-invasive, cost-effective, and accurate method for early detection of heart disease, thereby supporting healthcare professionals in making informed clinical decisions.
Introduction
This paper proposes an Advanced Heart Attack Risk Prediction System that uses a stacked hybrid machine learning approach to improve the early detection of heart disease, one of the leading causes of death worldwide. Early identification of heart attack risk can significantly reduce mortality and improve patient outcomes. Traditional diagnostic methods rely heavily on manual evaluation by healthcare professionals, which can be time-consuming and susceptible to human error. The integration of Artificial Intelligence (AI) and Machine Learning (ML) offers a faster, more accurate, and data-driven alternative for supporting medical decision-making.
Objectives and Motivation
The primary goal of the study is to develop an intelligent system capable of predicting heart attack risk using patient medical data. The system analyzes important health indicators such as:
Age
Blood pressure
Cholesterol levels
Heart rate
Chest pain type
Blood sugar levels
Electrocardiogram (ECG) results
Other cardiovascular risk factors
To improve prediction accuracy, the study combines multiple machine learning algorithms through a stacking ensemble technique.
Literature Review
Previous research has demonstrated the effectiveness of machine learning algorithms in heart disease prediction. Commonly used models include:
Logistic Regression – Simple and interpretable but less effective for complex data.
Decision Tree – Easy to understand but prone to overfitting.
Random Forest – High accuracy and reduced overfitting through ensemble learning.
K-Nearest Neighbors (KNN) – Effective for small datasets but computationally expensive for larger datasets.
Support Vector Machine (SVM) – Strong performance on high-dimensional data but requires careful tuning.
XGBoost and Gradient Boosting – Advanced boosting algorithms that capture complex patterns and improve predictive performance.
The literature also identifies challenges such as:
Missing and inconsistent medical data.
Imbalanced datasets.
Feature selection difficulties.
Overfitting issues.
Research shows that hybrid and ensemble models consistently outperform single classifiers by combining the strengths of multiple algorithms.
Proposed Methodology
The proposed system employs a stacked hybrid machine learning model that integrates several classifiers:
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Logistic Regression
XGBoost
Gradient Boosting
The outputs of the base models are combined using a Logistic Regression meta-learner, which produces the final prediction.
System Workflow
The prediction process consists of five stages:
Data Collection and Preprocessing
Uses the Cleveland Heart Disease Dataset.
Handles missing values and inconsistencies.
Applies normalization using StandardScaler.
Feature Selection
Identifies the most relevant attributes such as age, cholesterol, blood pressure, and heart rate.
Removes redundant features to improve efficiency.
Model Training
Uses an 80:20 stratified train-test split.
Applies cross-validation and hyperparameter tuning.
Stacking Ensemble
Combines predictions from Random Forest, XGBoost, SVM, and Gradient Boosting.
Uses Logistic Regression as the final meta-model.
Risk Prediction
Classifies patients as either high-risk or low-risk for heart disease.
Dataset
The system is trained and evaluated using the Cleveland Heart Disease Dataset, which contains 14 clinical attributes, including:
Age
Sex
Chest pain type
Resting blood pressure
Cholesterol level
Fasting blood sugar
ECG results
Maximum heart rate
Exercise-induced angina
ST depression (Oldpeak)
Slope of ST segment
Number of major vessels
Thalassemia
Heart disease diagnosis (target variable)
System Design and Implementation
The architecture consists of five modules:
Data Acquisition Module
Collects patient information.
Data Preprocessing Module
Cleans and normalizes data.
Feature Selection Module
Selects significant predictors.
Machine Learning and Stacking Module
Trains multiple classifiers and combines their outputs.
Prediction and Output Module
Generates final risk predictions.
The system is implemented using:
Python
Scikit-learn
XGBoost
Pandas and NumPy
Matplotlib and Seaborn
Streamlit for the web-based user interface
Results and Performance
The stacked hybrid model achieved an overall prediction accuracy of approximately 89–90%, outperforming individual machine learning models.
Key performance benefits include:
Higher accuracy.
Better generalization to unseen data.
Reduced prediction errors.
Faster prediction times suitable for real-time use.
Performance evaluation using precision, recall, and F1-score confirmed the model’s effectiveness in identifying both high-risk and low-risk patients.
Validation
The system was tested with multiple patient scenarios:
High-risk patient: Correctly classified with a confidence score of approximately 75.9%.
Low-risk patient: Correctly classified with a confidence score of approximately 62.6%.
These results demonstrate the model’s reliability and practical applicability in clinical environments
Conclusion
This paper presented an Advanced Heart Attack Risk Prediction System using a stacked hybrid machine learning approach. The system integrates multiple machine learning algorithms, including Random Forest, Decision Tree, KNN, SVM, Logistic Regression, XGBoost, and Gradient Boosting, combined through a stacking classifier with Logistic Regression as the meta-learner. By leveraging the complementary strengths of diverse algorithms, the stacked hybrid model achieved a prediction accuracy of approximately 89%, outperforming individual base classifiers. The system demonstrated effective data preprocessing, feature selection, and model training using the Cleveland Heart Disease Dataset. A user-friendly web interface was developed using Streamlit, enabling users to input patient data and obtain rapid, accurate prediction results. The system provides a non-invasive, cost-effective, and reliable method for early detection of heart disease, supporting healthcare professionals in making informed clinical decisions. However, certain limitations exist. The accuracy of the system depends on the quality and size of the dataset used for training. A limited number of health parameters are considered, and the system may not generalize equally well to unseen datasets from different populations. Additionally, the system is not currently integrated with real-time hospital information systems.
References
[1] World Health Organization (WHO), \"Cardiovascular Diseases (CVDs),\" Fact Sheet. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
[2] S. Raheja and N. Ray, \"Detection of heart disease using machine learning,\" in Proc. Int. Conf. on Artificial-Business Analytics, Quantum and Machine Learning, Singapore: Springer Nature, 2023, pp. 1–8.
[3] P. Sharma, R. Gupta, and A. Kaur, \"Hybrid BiLSTM-GRU model for coronary heart disease prediction using randomized search cross-validation,\" Journal of Healthcare Engineering, vol. 2023, pp. 1–12, Apr. 2023.
[4] P. Balakrishnan and R. Kumar, \"IoT-enabled cardiovascular risk prediction using recurrent convolutional neural networks and fuzzy C-means,\" IEEE Trans. on Industrial Informatics, vol. 19, no. 7, pp. 2345–2354, Jul. 2023.
[5] B. Nandy, A. Dey, and D. Goswami, \"Swarm-ANN: A swarm intelligence-based artificial neural network for heart disease prediction,\" Applied Soft Computing, vol. 110, pp. 107677, Oct. 2021.
[6] R. Elsedimy, S. Ibrahim, and M. Abdelghany, \"Quantum-behaved particle swarm optimization-support vector machine model for cardiovascular disease prediction,\" Int. J. of Computational Intelligence Systems, vol. 16, no. 4, pp. 239–254, Apr. 2023.
[7] X. Cai, J. Li, and Y. Wang, \"Independent validation of AI cardiovascular risk models: A comprehensive review and development of independent validation score (IVS),\" Journal of Medical Systems, vol. 48, no. 1, pp. 12–28, Jan. 2024.
[8] M. M. Islam, T. Nasrin, and A. Uddin, \"Real-time cardiovascular disease prediction system using IoT and machine learning,\" Journal of Healthcare Informatics Research, vol. 7, no. 3, pp. 285–302, Sep. 2023.
[9] A. Hossain, M. Miah, and M. H. Kabir, \"Feature selection in random forest models for accurate heart disease prediction,\" Computers in Biology and Medicine, vol. 153, pp. 106415, Aug. 2023.
[10] E. K. Dritsas and M. Trigka, \"Ensemble machine learning for heart disease prediction with SMOTE: Addressing class imbalance in medical data,\" Medical Informatics and Decision Making, vol. 23, no. 5, pp. 89–103, Nov. 2023.