Using Machine Learning Technique- Logistic Regression and Random Forest to Detect Fraud in Healthcare Insurance Claims Industry

Authors: Shailee Shah, Dr. Jyotindra Dharwa

DOI Link: https://doi.org/10.22214/ijraset.2025.70890

Abstract

Insurance fraud puts at risk the integrity of insurance systems around the world and can result in large financial losses. The stability and sustainability of the insurance markets depend on the detection and prevention of fraudulent activity. This study suggests a multipronged strategy to improve insurance fraud detection by utilizing cutting-edge technologies. The study starts by examining the state of insurance fraud today, identifying typical fraudulent schemes, and investigating the difficulties insurance firms have in spotting fraudulent activity. It then looks at conventional fraud detection techniques and their shortcomings in dealing with new fraudulent strategies. We looked into the use of supervised machine learning techniques like decision trees. After data preparation and Principal Component Analysis, Random Forest and Logistic Regressions are used to analyse different aspects and classify claims as either fraudulent or non-fraudulent. Following preprocessing and PCA on the dataset, the outcomes of applying Random Forest and Logistic Regressions are presented in this work.

Introduction

Insurance is a legal contract between an insurer (insurance company) and an insured (individual or organization) to provide financial protection against specific losses such as accidents, illnesses, or disasters. Health insurance is particularly important as it improves access to medical care and reduces the financial burden of medical expenses.

Logistic regression is a common statistical method used in insurance for binary classification problems, such as predicting whether a claim will be made or if a person is eligible for insurance. This research focuses on evaluating logistic regression's effectiveness in forecasting insurance claims and eligibility, helping insurers identify high-risk clients early and reduce losses.

The research survey reviews various studies on insurance claims prediction, fraud detection, and machine learning applications. Techniques like Random Forest, logistic regression, artificial neural networks, and data mining have been used with notable success in predicting claims and identifying fraudulent activity.

Data preprocessing steps include cleaning, handling missing values, feature transformation, one-hot encoding, and dimensionality reduction (PCA) to prepare data for modeling.

Two main methods highlighted are:

Random Forest: An ensemble method used to detect fraudulent medical claims by analyzing features like claim amount, frequency, provider reputation, and billing patterns. It builds multiple decision trees and uses majority voting for classification.
Logistic Regression: Used to estimate the probability of fraud based on input features such as claim details and provider behavior. Claims with probabilities above a threshold are flagged for review.

Results show Random Forest outperforms logistic regression in accuracy, precision, recall, and F1 score for fraud detection.

Conclusion

The empirical investigation in this paper has provided important new information about how well different fraud detection models and strategies work. Every technique has different benefits and drawbacks when it comes to spotting fraudulent activity, ranging from rule-based systems to machine learning algorithms and ensemble methods. Utilizing relevant and informative features from insurance data is crucial, as feature engineering and selection have also been identified as crucial elements in increasing the accuracy of fraud detection. This paper gives result of Random Forest model shows the accuracy of 79% and Logistic Regression model shows the accuracy of 58% for detecting fraudulent claims. In future we can apply other supervised learning techniques and unsupervised learning techniques for particular disease’s impact on the claim for considering it as a legitimate or fraud.

References

[1] Thakre V P, Poul R D, Sawarkar A D (March 05, 2025) Predictive Precision: Unraveling Health Insurance Claim Patterns with Logistic Regression and Decision Trees. Cureus J Computer Sci 2 : es44389-025-03010-y. DOI https://doi.org/10.7759/s44389-025-03010- [2] Saraswat BK, Singhal A, Agarwal S, Singh A: Insurance claim analysis using traditional machine learning algorithms. 2023 International Conference on Disruptive Technologies (ICDT), Greater Noida. 2023, 623- 628. 10.1109/ICDT57929.2023.10150491 [3] Seo HJ, Oh IH, Yoon SJ: A comparison of the cancer incidence rates between the National Cancer Registry and insurance claims data in Korea. Asian Pacific Journal of Cancer Prevention. 2012, 13:6163-6168. 10.7314/apjcp.2012.13.12.6163 [4] Smith KA, Willis RJ, Brooks M: An analysis of customer retention and insurance claim patterns using data mining: a case study. Journal of the Operational Research Society. 2000, 51:532-541. 10.1057/palgrave.jors.2600941 [5] DeVoe JE, Tillotson CJ, Wallace LS: Children’s receipt of health care services and family health insurance patterns. The Annals of Family Medicine. 2009, 7:406-413. 10.1370/afm.1040 [6] Antwi S, Zhao X: A logistic regression model for Ghana National Health Insurance claims. International Journal of Business and Social Research. 2012, 139-47. [7] Seo HJ, Oh IH, Yoon SJ: A comparison of the cancer incidence rates between the National Cancer Registry and insurance claims data in Korea. Asian Pacific Journal of Cancer Prevention. 2012, 13:6163-6168. 10.7314/apjcp.2012.13.12.6163 [8] Sun C, Li Q, Li H, Shi Y, Zhang S, Guo W: Patient cluster divergence-based healthcare insurance fraudster detection. IEEE Access. 2019, 7:14162-14170. 10.1109/access.2018.2886680 [9] Rayan N: Framework for analysis and detection of fraud in health insurance. 2019 IEEE 6th International Conference on Cloud Computing and Intelligence Systems (CCIS), Singapore. 2019, 47-56. 10.1109/CCIS48116.2019.9073700 [10] Ramani K, Kumar ST, Datta PP, Jamuna P, Nithin KS: Predicting health insurance claim amount through machine learning algorithms. 2024 IEEE International Conference on Information Technology, Electronics and Intelligent Communication Systems (ICITEICS), Bangalore, India. 2024, 1-6. 10.1109/ICITEICS61368.2024.10625132 [11] Saripalli P, Tirumala V, Chimmad A: Assessment of healthcare claims rejection risk using machine learning. 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services,Dalian, China. 2017, 1-6. 10.1109/HealthCom.2017.8210758 [12] Roy R, George KT: Detecting insurance claims fraud using machine learning techniques. 2017 International Conference on Circuit, Power and Computing Technologies (ICCPCT), Kollam, India. 2017, 1-6. 10.1109/ICCPCT.2017.8074258 [13] Arunkumar C, Kalyan S, Ravishankar H: Fraudulent detection in healthcare insurance. Advances in Electrical and Computer Technologies. Sengodan T, Murugappan M, Misra S (ed): Springer, Singapore; 2021. 711:1-9. 10.1007/978-981-15-9019-1_1 [14] Nabrawi E, Alanazi A: Fraud detection in healthcare insurance claims using machine learning. Risks. 2023, 11:160. 10.3390/risks11090160 [15] https://www.healthcare.digital/single-post/future-of-telemedicine-and-virtual-care-key-trends-and-predictions [16] https://proassurance.com/knowledge-center/different-types-of-insurance

Copyright

Copyright © 2025 Shailee Shah, Dr. Jyotindra Dharwa. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70890

Publish Date : 2025-05-13

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here