In the modern era of data-driven decision-making, early and accurate disease diagnosis has emerged as a critical challenge, particularly in resource-constrained settings. This project proposes a machine learning-based disease prediction system using a Random Forest Classifier to forecast potential diseases based on symptoms provided by the user. The system is designed as a robust, interactive tool to aid in preliminary medical assessments. The model has been trained on a dataset comprising 4,920 records that span 133 symptoms and 41 unique diseases, using binary encoding to represent the presence or absence of each symptom. The core of the system is the Random Forest algorithm, chosen for its high accuracy, robustness, and ability to handle large feature spaces effectively. The classifier achieves an accuracy of approximately 97.6% on unseen test data, demonstrating strong predictive performance. The user can interact with the system via a Command Line Interface (CLI), inputting symptoms to receive a predicted disease along with a disclaimer highlighting the system’s advisory nature. In addition to disease prediction, the model provides a feature importance visualization, offering transparency into which symptoms most influence the outcome. This not only improves interpretability but also serves as a learning aid for users and researchers.
By tapping into the capabilities of scikit-learn, pandas, and visualization libraries, the project exemplifies how a well-tuned ML pipeline can serve real-world healthcare needs. While the system in its current form is intended for educational and exploratory purposes, it sets the stage for broader implementation in mobile apps, hospital triage tools, and telemedicine systems.
Introduction
The project proposes a machine learning-based disease prediction system using the Random Forest algorithm to predict possible diseases from user-input symptoms. It aims to provide quick, preliminary health guidance and support patients in reaching the right medical specialists, especially in situations where healthcare access is limited or delayed. While not a replacement for doctors, the system acts as an early decision-support tool to improve healthcare accessibility and efficiency.
The need for this study arises from delays in diagnosis, inefficient patient referrals, shortage of specialists in rural areas, and overall gaps in healthcare accessibility. To address these issues, the system includes two main objectives: predicting diseases based on symptoms with high accuracy, and recommending appropriate doctors based on the predicted condition.
The literature review highlights various machine learning approaches used in disease prediction, especially in heart disease, cancer, and neurological disorders. Techniques like SVM, Decision Trees, KNN, XGBoost, and Random Forest have shown strong performance, with Random Forest often providing higher accuracy and stability. Studies also emphasize feature selection, active learning, and web-based deployment for improving healthcare systems.
Methodologically, the system uses a dataset of 4,920 records and 133 symptom features. Data is preprocessed, encoded, and split for training and testing. A Random Forest classifier is trained and tuned, achieving about 97.6% accuracy. Feature importance analysis is also used to identify key symptoms influencing predictions.
Conclusion
The disease prediction model demonstrated outstanding performance across all major evaluation metrics, highlighting its strength and dependability in real-world medical applications. With an impressive overall accuracy of 97.6%, and performance ranging from 95.2% to 99.8% across different disease categories, the system consistently produced correct and reliable predictions. This level of accuracy reflects the model’s ability to learn complex patterns from medical data effectively. Such consistent results are critical in healthcare environments where even small errors can lead to significant consequences. The model’s strong performance indicates that it can serve as a dependable foundation for assisting clinicians in diagnostic decision-making processes.
In addition to high accuracy, the model achieved a precision score of 96.5%, demonstrating its effectiveness in minimizing false positive predictions, which is essential in avoiding unnecessary treatments or anxiety for patients. Its recall score of 97.2% further emphasizes its ability to correctly identify actual disease cases, ensuring that critical conditions are not overlooked. The F1 score of 96.8% reflects a well-balanced trade-off between precision and recall, indicating that the model maintains both sensitivity and specificity. This balance is especially important in healthcare systems, where both overdiagnosis and underdiagnosis can have serious implications for patient outcomes and overall treatment efficiency.
Most notably, the model achieved an average AUC of 0.992, indicating near-perfect class discrimination capability, which is particularly valuable when dealing with complex, multi-class medical datasets. This high AUC score confirms the model’s ability to distinguish between different disease categories with exceptional clarity. Furthermore, the consistent performance across various diseases suggests strong generalization , meaning the model can adapt well to new and unseen data. This is a rare and highly desirable trait in healthcare AI systems. Overall, the proposed model proves to be a scalable, accurate, and reliable solution for early disease detection and has strong potential as an advanced clinical decision support tool.
References
[1] Aamir, Sanam, et al. \"Predicting breast cancer leveraging supervised machine learning techniques.\" Computational and Mathematical Methods in Medicine 2022 (2022).
[2] J. Amin, M. Sharif, M. Yasmin, T. Saba, and M. Raza, ‘‘Use of machine intelligence to conduct analysis of human brain data for detection of abnormalities in its cognitive functions,’’ Multimedia Tools Appl., vol. 79, nos. 15–16, pp. 10955–10973, Apr. 2020.
[3] Z. Du, Y. Yang, J. Zheng, Q. Li, D. Lin, Y. Li, J. Fan, W. Cheng, X.-H. Chen, and Y. Cai, ‘‘Accurate prediction of coronary heart disease for patients with hypertension from electronic health records with big data and machine learning methods: Model development and performance evaluation,’’ JMIR Med. Informat., vol. 8, no. 7, Jul. 2020, Art. no. e17257.
[4] El-Hasnony, Ibrahim M., et al. \"Multi-label active learning-based machine learning model for heart disease prediction.\" Sensors 22.3 (2022): 1184.
[5] A. Garg, B. Sharma, and R. Khan, ‘‘Heart disease prediction using machine learning techniques,’’ in Proc. IOP Conf. Ser., Mater. Sci. Eng., vol. 1022, 2021, Art. no. 01204
[6] J. A. W. Gold, F. B. Ahmad, J. A. Cisewski, L. M. Rossen, A. J. Montero, K. Benedict, B. R. Jackson, and M. Toda, ‘‘Increased deaths from fungal infections during the coronavirus disease 2019 pandemic—National vital statistics system, United States, January 2020–December 2021,’’ Clin. Infectious Diseases, vol. 76, no. 3, pp. e255–e262, Feb. 2023.
[7] Humayun, Mamoona, et al. \"Framework for detecting breast cancer risk presence using deep learning.\" Electronics 12.2 (2023): 403.
[8] H. Jindal, S. Agrawal, R. Khera, R. Jain, and P. Nagrath, ‘‘Heart disease prediction using machine learning algorithms,’’ in Proc. IOP Conf. Ser., Mater. Sci. Eng., vol. 1022, 2021, Art. no. 012072.