The detection of fraudulent claims has become a significant challenge in the insurance industry, where manual review processes and rule-based systems often fall short in identifying complex, evolving fraud patterns. This project presents a data-driven approach to fraud detection using a real-world insurance dataset composed of 1000 policyholder records, with features including customer demographics, claim details, incident types, and vehicle information. The study employs supervised machine learning algorithms—Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost)—to classify insurance claims as fraudulent or legitimate. Comprehensive data preprocessing techniques are applied, including handling missing values, encoding categorical features, normalization, and oversampling of minority classes using SMOTE. The system is evaluated using precision, recall, accuracy, F1-score, and confusion matrix, ensuring a well-rounded performance analysis.
Experimental results indicate that XGBoost outperforms SVM in most evaluation metrics, especially in identifying minority class fraud cases. Feature importance analysis reveals that variables such as total claim amount, incident severity, police report availability, and customer occupation play a critical role in determining the likelihood of fraud. The study highlights the importance of intelligent automation in detecting fraudulent activities while improving operational efficiency in insurance workflows. This project not only demonstrates the practical value of machine learning in fraud prevention but also provides a scalable, interpretable solution suitable for integration in real-time decision support systems in the insurance sector.
Introduction
Insurance fraud is a global concern, causing billions in annual losses. Traditional detection systems—based on static rules and manual review—struggle with rising fraud complexity and high false positive rates. This project aims to enhance fraud detection using machine learning (ML), focusing on Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost).
Objectives
Develop a classification model to detect fraudulent insurance claims.
Compare the performance of SVM and XGBoost.
Preprocess data (handle missing values, encode categories, normalize features).
Handle class imbalance using SMOTE (Synthetic Minority Oversampling Technique).
Evaluate models using accuracy, precision, recall, F1-score, and confusion matrix.
Analyze feature importance for business insights.
Literature Insights
Prior research confirms the effectiveness of ML techniques in fraud detection:
SVM and decision trees are commonly used (Ngai et al., 2011).
SMOTE improves performance on imbalanced datasets (Bauder & Khoshgoftaar, 2018).
XGBoost is favored for scalability and accuracy in tabular data (Chen & Guestrin, 2016).
Methodology
A structured ML pipeline was followed:
Data Collection: Insurance claims labeled as fraud or non-fraud.
Preprocessing:
Imputation of missing values.
Categorical encoding (label or one-hot).
Normalization of features.
Balancing: Applied SMOTE to address class imbalance.
Model Training: Trained both SVM and XGBoost.
Evaluation: Used metrics like accuracy, precision, recall, F1-score, and confusion matrix.
System Architecture
Consists of:
Data Input Layer
Preprocessing Layer
Balancing Layer (SMOTE)
Model Layer (SVM, XGBoost)
Evaluation Layer
Prediction Layer (for future claims with feature-based explanations)
Key Algorithms
SVM: Best for binary classification with well-scaled data; struggles with imbalanced datasets and outliers.
XGBoost: Boosted tree algorithm with regularization, tree pruning, and inbuilt handling of missing values. More robust to imbalance and noise.
Issue: Poor fraud detection due to class imbalance.
XGBoost Results:
Accuracy: 80%
Fraud Recall: 58%
Fraud Precision: 64%
Insight: XGBoost performed significantly better in identifying fraudulent claims.
Conclusion
This project demonstrates the feasibility and effectiveness of using machine learning algorithms—particularly XGBoost—for detecting fraudulent insurance claims. By applying proper preprocessing, feature engineering, and class imbalance correction techniques, the models achieved high accuracy and reliability. Among the two, XGBoost consistently outperformed SVM, making it a more robust choice for deployment. The study highlights the importance of automation in fraud detection and the value of interpretability in gaining business insights. With proper integration, this system can significantly reduce fraud-related losses while supporting claim analysts in decision-making.
References
[1] Ngai, E. W. T., et al. \"The application of data mining techniques in financial fraud detection.\" Expert Systems with Applications (2011).
[2] Bauder, R. A., & Khoshgoftaar, T. M. \"The effect of class imbalance techniques on the performance of fraud detection models.\" IEEE Transactions (2018).
[3] Patil, S., & Thorat, S. \"SVM-Based Approach for Health Insurance Fraud Detection.\" IJARCCE (2020).
[4] Chen, T., & Guestrin, C. \"XGBoost: A scalable tree boosting system.\" Proceedings of the 22nd ACM SIGKDD (2016).