Liver disease is like one of the major health issues affecting millions of people around the world, and it can be pretty serious in a lot of cases too. The liver performs important functions such as detoxification, digestion support, metabolism regulation, protein synthesis, and storage of nutrients. Damage to the liver can lead to severe health complications and even death if not detected early. Most liver diseases do not show symptoms during the early stages, which makes diagnosis dif- ficult. Traditional diagnostic methods depend on blood tests, imaging systems, and expert medical analysis, which may not always be available in rural and low-resource healthcare envi- ronments. Therefore, an intelligent automated system is needed for fast and accurate liver disease prediction.The proposed system applies several preprocessing techniques including missing value handling, feature scaling, feature engineering, and class balancing using SMOTETomek [8], [9]. Six machine learning algorithms are compared: Logistic Regression, Decision Tree, K-Nearest Neighbors (KNN), Naive Bayes, Support Vector Machine (SVM), and Random Forest [7]. The performance of all algorithms is evaluated using Accuracy, Precision, Recall, F1-Score, and ROC- AUC Score. Experimental analysis shows that Random Forest gets the best overall performance , with higher accuracy, better recall , stronger F1 score and an improved ROC-AUC value, kinda compared to the other algorithms. [2], [3]. Random Forest seems to do better because it blends many decision trees reduces overfitting , handles messy or noisy data in an efficient way and it works really well with nonlinear patterns that show up in medical datasets. It’s almost like a committee effect, though not exactly [7].
Introduction
The text explains a machine learning approach for early prediction of liver disease, emphasizing its importance due to the disease’s high mortality rate and the difficulty of early detection.
Liver diseases often remain asymptomatic in early stages, so patients are diagnosed late when treatment is harder. Traditional diagnostic methods (blood tests, imaging, biopsy) are accurate but expensive, slow, and less accessible in rural areas, creating a need for affordable automated prediction systems.
The study proposes using machine learning models on clinical blood test data (from the Indian Liver Patient Dataset). It compares six algorithms: Logistic Regression, KNN, SVM, Decision Tree, Naive Bayes, and Random Forest. The full pipeline includes data preprocessing, feature engineering (e.g., AST/ALT ratio), handling class imbalance using SMOTETomek, model training, and evaluation.
Among all models, Random Forest performs the best due to its ability to handle nonlinear relationships, noisy data, and imbalanced datasets. The final system is deployed as a Flask-based web application for practical clinical use.
The literature review supports these findings, showing that:
Liver diseases are hard to detect early and require predictive tools.
Machine learning helps identify patterns in medical data.
Ensemble methods like Random Forest generally outperform simpler models.
Handling feature correlation and class imbalance significantly improves accuracy.
Conclusion
This study presented an effective machine learning-based liver disease prediction system using the Indian Liver Patient Dataset (ILPD) [1], [12]. A complete prediction frame- work was developed that included data preprocessing, feature engineering, class imbalance handling using SMOTETomek [8], [9], machine learning model training, evaluation, and deployment. Six different machine learning algorithms were compared, namely Logistic Regression, Decision Tree, K- Nearest Neighbors, Naive Bayes, Support Vector Machine, and Random Forest [2], [3], [4]. The model performed bet- ter because it successfully handled nonlinear relationships, reduced overfitting through ensemble learning, and managed noisy and imbalanced medical data effectively [7]. Feature engineering techniques such as AST/ALT ratio, logarithmic bilirubin transformation, and age group categorization also improved prediction performance by providing meaningful medical information to the model [3]. The proposed system was successfully deployed using Flask as a user-friendly web application that allows users to enter patient clinical values and receive real-time liver disease predictions with risk probability [11]. This system can help healthcare professionals in early disease detection and support faster medical decision-making, especially in hospitals and healthcare centers with limited expert availability [5], [12]. Over all, the suggested Random Forest framework for predicting liver disease showed pretty strong performance in terms of prediction capability, also with dependable results and a kind of practical usability, which makes it fit for real world healthcare use cases [7].
References
[1] V. Ramana and N. B. Venkateswarlu, “ILPD (Indian Liver Patient Dataset),” UCI Machine Learning Repository, 2012. [Online]. Available: https://archive.ics.uci.edu/dataset/225/ilpd+indian+liver+patient+dataset
[2] M. Ghosh, M. M. S. Raihan, M. Raihan, L. Akter, A. K. Bairagi, S. S. Alshamrani, and M. Masud, “A comparative analysis of machine learning algorithms to predict liver disease,” Intelligent Automation & Soft Computing, vol. 30, no. 3, pp. 917–928, 2021.
[3] S. Tokala, K. Hajarathaiah, S. R. P. Gunda, S. Botla, L. Nalluri, P. Nagamanohar, S. Anamalamudi, and M. K. Enduri, “Liver disease prediction and classification using machine learning techniques,” Inter- national Journal of Advanced Computer Science and Applications, vol. 14, no. 2, pp. 871–878, 2023.
[4] Riya and B. Kaur, “Liver disease prediction using machine learning algorithms,” International Journal of Computer Applications, vol. 185, no. 27, pp. 36–44, Aug. 2023.
[5] “Machine learning approaches for liver disease prediction,” Frontiers in Physiology, 2025. [Online]. Available: https://www.frontiersin.org
[6] “Voting classifier ensemble for liver disease prediction,” MDPI Comput- ers, vol. 12, no. 4, 2023.
[7] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[9] G. Lemaˆ?tre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp. 1–5, 2017.
[10] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[11] M. Grinberg, Flask Web Development: Developing Web Applications with Python. Sebastopol, CA: O’Reilly Media, 2018.
[12] B. V. Ramana, M. S. Babu Prasad, and N. B. Venkateswarlu, “A critical comparative study of liver patients from USA and INDIA: An exploratory analysis,” International Journal of Computer Science Issues, vol. 9, no. 3, pp. 506–516, 2012.