Healthcare companies produce vast amounts of raw information, commonly referred to as huge data, which can uncover invisible layouts and insightful perspectives to support informed decision-making. Data-driven decisions tend to be more reliable than those based on intuition, as they leverage large-scale datasets. Exploratory Data Analysis (EDA) plays a key role in this process by helping identify errors, recognize data characteristics, validate assumptions, and examine relationships between variables. In this context, EDA involves examining data without relying on statistical modeling or drawing formal conclusions. Analysts across various fields use EDA to uncover patterns and make informed forecasts. Recently, data analytics has become more accessible and increasingly important in healthcare, particularly for addressing disease outbreaks and emergencies. EDA serves as a foundational step in data analysis and supports the healthcare sector by enhancing treatments and promoting preventive care.
Introduction
1. Introduction
Heart disease remains one of the leading causes of death globally. Early detection and accurate prediction are crucial for improving patient outcomes. Predictive analysis depends on diverse datasets, including clinical, lifestyle, and demographic factors. Exploratory Data Analysis (EDA) plays a key initial role in understanding patterns before applying machine learning.
2. Heart Disease Overview
Definition & Symptoms: Affects heart function, impeding blood flow to organs; common symptoms include fatigue, shortness of breath, and leg swelling.
Diagnosis Challenges: Involves advanced tools and expert interpretation, which may be limited in availability.
Contributing Factors: Age, gender, lifestyle habits (e.g., smoking, obesity), and existing conditions (e.g., diabetes, high cholesterol).
Specific Conditions:
Atherosclerosis: Plaque buildup in arteries leads to restricted blood flow.
Acute Myocardial Infarction (AMI): Caused by blocked blood supply, leading to tissue damage.
Spasmodic Conditions: Sudden artery spasms may occur without prior atherosclerosis.
Gender Differences: Men are more prone to heart attacks; symptom duration differs between sexes.
Physiological Impact: Affects other organs, such as bone marrow and spleen, causing systemic changes.
3. Exploratory Data Analysis (EDA)
Purpose: Helps detect patterns, outliers, and anomalies using visual (e.g., plots) and non-visual (e.g., summary stats) methods.
Types:
Graphical vs. Non-graphical
Univariate, Bivariate, and Multivariate
Importance: Validates assumptions, guides model development, and informs feature selection.
4. Literature Survey
Several studies have applied machine learning for heart disease prediction:
Techniques Used: Naive Bayes, Decision Trees, SVM, Random Forest, Neural Networks, KNN, Gradient Boosting, etc.
A correlation matrix helps remove redundant or irrelevant features to optimize the model.
D. Model Selection
Compares multiple ML algorithms to choose the best-performing one.
Common models used: Logistic Regression, Decision Trees, SVM, KNN, Random Forest.
E. Training and Testing
Split data into training and testing sets.
Models are trained to learn patterns and evaluated on unseen test data for predictive accuracy.
Algorithms Explained:
Decision Tree: Uses flowchart-like structures for decision-making; simple and interpretable.
Random Forest: Ensemble of decision trees; improves accuracy via voting.
K-Nearest Neighbors (KNN): Classifies data based on the nearest data points; good for non-linear patterns.
Conclusion
Based on the accuracy results of the different algorithms, Random Forest provides the highest accuracy at 96%, making it the most effective model for the given task. Decision Tree follows closely with an accuracy of 91%, showing strong performance as well. KNN, SVM, and Logistic Regression yield significantly lower accuracies, ranging from 69% to 73%. This indicates that these algorithms may not be as well-suited to the data for predicting heart disease in this case, compared to Random Forest and Decision Tree.
The high performance of Random Forest can be attributed to its ability to aggregate the results of multiple decision trees, which helps reduce overfitting and improve generalization. Therefore, Random Forest would be the recommended algorithm for heart disease prediction, as it provides the most reliable and accurate results.
References
[1] L. Bui, T. B. Horwich, and G. C. Fonarow, “Epidemiology and risk profile of heart failure,” Nature Reviews Cardiology, vol. 8, no. 1, pp. 30–41, 2011.
[2] J.Mourão-Miranda, A.L.W.Bokde,C.Born, H.Hampel,and M. Stetter, “Classifying brain states and determining the discriminating activation patterns : support vector machine on functionalMRIdata,”NeuroImage,vol.28,no.4,pp.980–995, 2005.
[3] S.Ghwanmeh, A.Mohammad, and A.Al-Ibrahim,“Innovative artificial neural networks-based decision support system for heartdiseasesdiagnosis,”JournalofIntelligentLearningSystems and Applications, vol. 5, no. 3, pp. 176–183, 2013.
[4] Q. K. Al-Shayea, “Artificial neural networks in medical diagnosis,” International Journal of Computer Science Issues, vol. 8, no. 2, pp. 150– 154, 2011.
[5] K. Vanisree and J. Singaraju, “Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks,” International Journal of Computer Applications, vol. 19, no. 6, pp. 6–12, 2011.
[6] Al Mamoon I, Sani AS, Islam AM, Yee OC, Kobayashi F, Komaki S, “A proposal of body implementable early heart attack detection system”, 1-4, 2013.
[7] Patterson K , Matthias Nahrendorf. Circ Res 119: 790-793, 2016.
[8] Soni, J., Ansari, U., Sharma, D., & Soni, S, “Predictive data mining for medical diagnosis: An overview of heart disease prediction. International Journal of Computer Applications”, 17(8), 43-48, 2011.
[9] Masethe, H. D., & Masethe, M. A , “Prediction of heart disease using classification algorithms”, In Proceedings of the world congress on engineering and computer science (Vol. 2, pp. 22-24), 2014-Oct.
[10] Komorowski M, Marshall D. C , J, Salciccioli J D and Crutain Y, Chapter 15- Exploratory Data Analysis - Secondary Analysis of Electronic Health Records. DOI: 10.1007/978-3-319-43742-2_15, 2016.
[11] Valdiviezo-Diaz, P., Reátegui, R., Barba-Guaman, L., Ortega, M, “Exploratory Data Analysis on Cervical Cancer Diseases. In: Botto-Tobar”, M., Montes León, S., Torres-Carrión, P., Zambrano Vizuete, M., Durakovic, B. (eds) Applied Technologies. ICAT 2021. Communications in Computer and Information Science, vol 1535. Springer, Cham. https://doi.org/10.1007/978- 3-031-03884-6_32 ,2022.
[12] Huang, CW., Lu, R., Iqbal, U. et al., “A richly interactive exploratory data analysis and visualization tool using electronic medical records”, BMC Med Inform Decis Mak 15, 92. https://doi.org/10.1186/s12911-015-0218-7, 2015.
[13] Rashik Rahmen, “Heart Attack Analysis Prediction Dataset”, https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysisprediction-dataset, year = 2021-03-22.
[14] Alsmadi, Tibra, Nour Alqudah, and Hassan Najadat, “Prediction of Covid-19 patients states using Data mining techniques”, 2021 International Conference on Information Technology (ICIT), IEEE, 2021.
[15] Khourdifi, Youness, and Mohamed Bahaj, “Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization” International Journal of Intelligent Engineering and Systems 12.1: 242-252, 2019.
[16] A. H. M. S. U. Marjia Sultana , “Analysis of Data Mining Techniques for Heart Disease Prediction” , 2018.
[17] M. I. K. A. I.S. Musfiq Ali , “Heart Disease Prediction Using Machine Learning Algorithms”.
[18] M. A. K. S. H. K. M. A. V. P. M Marimuthu, “A Review on Heart Disease Prediction using Machine Learning and Data Analytics Approach”.
[19] Huang, CW., Lu, R., Iqbal, U. et al., “A richly interactive exploratory data analysis and visualization tool using electronic medical records”, BMC Med Inform Decis Mak 15, 92. https://doi.org/10.1186/s12911-015-0218-7. 2015.
[20] R. Indrakumaria, T Poongodi and Sowmya Rajnan Jena, (2020). Heart Disease Prediction using Exploratory Data Analysis, International Conference on Smart Sustainable Intelligent Computing and Applications under ICITETM2020, Procedia Computer Science 173 (2020) 130–139
[21] S. Nalluri, R. Vijaya Saraswathi, S. Ramasubbareddy, K. Govinda, and E. Swetha, “Chronic heart disease prediction using data mining techniques,” in Data engineering and communication technology, Springer, 2020, pp. 903–912.
[22] Samuel Harford, Houshang Darabi, Marina [2019] Del Rios, Somshubra Majumdar, Fazle Karim, Terry Vanden Hoek, Kim Erwin, Dennis P. Watson, \"A Machine Learning Based Model for Classification and Sensitivity Analysis of Out of Hospital Cardiac Arrest Outcomes\" Elsevier, Resuscitation 138, pp. 134–140
[23] W. L. Costa, L. S. Figueredo, and E. T. A. Alves, “Application of an Artificial Neural Network for Heart Disease Diagnosis Brazilian Congress on Biomedical Engineering, Springer, 2019, pp. 753– 758.
[24] Alkhamis, Moh A., et al. \"Interpretable machine learning models for predicting in-hospital and 30 days adverse events in acute coronary syndrome patients in Kuwait.\" Scientific Reports 14.1 (2024): 1243.
[25] Peng, Mengxiao, et al. \"Prediction of cardiovascular disease risk based on major contributing features.\" Scientific Reports 13.1 (2023): 4778.
[26] Srinivasan, Saravanan, et al. \"An active learning machine technique based prediction of cardiovascular heart disease from UCI-repository database.\" Scientific Reports 13.1 (2023): 13588.
[27] Cho, Sang-Yeong, et al. \"Pre-existing and machine learning-based models for cardiovascular risk prediction.\" Scientific reports 11.1 (2021): 8886.
[28] Schiborn, Catarina, et al. \"A newly developed and externally validated non-clinical score accurately predicts 10-year cardiovascular disease risk in the general adult population.\" Scientific Reports 11.1 (2021): 19609
[29] Ward, Andrew, et al. \"Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population.\" NPJ digital medicine 3.1 (2020): 125.
[30] Grammer, Tanja B., et al. \"Cardiovascular risk algorithms in primary care: Results from the DETECT study.\" Scientific reports 9.1 (2019): 1101