The rapid increase in the number of online job portals brings new challenges in the recruitment process, including the posting of job advertisements aimed to exploit job seekers in fraudulent practices. In this paper, we presents a detailed machine learning-based approach to identifying fraudulent job postings integrating Natural Language Processing (NLP) and ensemble learning. For a dataset of 5,000 job postings, we developed a binary classification model using TF-IDF vectorization coupled with an XGBoost classifier and attained an accuracy of 94.2%. The system incorporates SHAP (SHapley Additive exPlanations) to address model interpretability for the various stakeholders in a prediction scenario. We also created an interactive web app using Streamlit which allows users to analyze a single job, as well as, import files for batch predictions. For the first time, we propose a comprehensive approach for fraud detection which integrates feature extraction from the body of a job posting, suspicious keyword lists, and contact number analysis. Above all, we demonstrate that our model exceeds the performance of a standard logistic regression baseline by 8.3% in F1-score, especially for the detection of advanced fraudulent job postings.
Introduction
The rise of digital recruitment platforms such as LinkedIn, Indeed, Glassdoor, and Naukri has increased accessibility to job opportunities but has also enabled the spread of fraudulent job postings. Approximately 14% of online job ads are fake, leading to financial loss, identity theft, emotional distress, and erosion of trust in recruitment systems. Common red flags include vague descriptions, unrealistic salaries, requests for payments, use of free email domains, and pressure-inducing language.
Detecting fake job postings is challenging due to their increasing sophistication and the severe class imbalance in real-world datasets, where legitimate postings greatly outnumber fraudulent ones. The research aims to develop a robust machine learning system capable of accurately classifying job postings, extracting meaningful textual features, providing interpretable predictions using SHAP, and offering a user-friendly web interface for real-time detection.
Literature Review
Machine learning approaches consistently outperform traditional rule-based fraud detection systems. Prior studies have applied SVMs, LSTMs, and Random Forests, achieving accuracies between 84% and 91%. Recent work emphasizes model interpretability, with SHAP offering transparent insights into fraud predictions. The current study aims to fill gaps by incorporating URL scraping, handling extreme data imbalance, offering a complete user interface, and integrating interpretability tools.
Methodology
The study uses the Kaggle “Fake Job Postings” dataset containing 17,880 listings, of which only 4.8% are fraudulent. A stratified subset of 5,000 samples preserves class distribution. Text fields are aggregated and normalized; missing values are retained as they may carry predictive signals. TF-IDF is used for vectorization, complemented by domain-specific suspicious keyword features, skill-detection features, and email/phone pattern analysis.
XGBoost is selected as the primary classifier due to its strength in imbalanced classification, regularization abilities, and high predictive power. Logistic Regression serves as a baseline. SHAP is used for global and local interpretability. The solution is implemented in Python using libraries such as scikit-learn, XGBoost, SHAP, Streamlit, and BeautifulSoup.
Results
XGBoost significantly outperforms the baseline:
Accuracy: 94.20%
Precision: 89.40%
Recall: 87.60%
ROC-AUC: 0.9587
The confusion matrix shows low false negatives—critical for preventing scam exposure—and a manageable false positive rate. Error analysis indicates that brief or poorly written legitimate postings are sometimes flagged as fake, whereas advanced scams imitate the tone and style of real listings.
Feature Analysis
SHAP analysis identifies influential terms often signaling fraud, such as urgency-related or overly generic language. Suspicious keyword features (e.g., “urgent,” “earn money,” “no experience needed”) are strong predictors. Patterns such as the use of free email domains and missing phone numbers also strongly correlate with fraudulent behavior.
Web Application
The Streamlit-based interface supports:
Single job analysis via text input or URL scraping
Batch predictions through CSV uploads
Visualization dashboards with SHAP plots, confidence charts, confusion matrices, and word clouds
Deployment and Scalability
The system can be deployed on local servers or cloud platforms, with scalability enhanced via API-based model serving, database integration, distributed processing, and caching. Security measures include HTTPS, input sanitization, rate limiting, and GDPR-compliant data handling.
Conclusion
This research presents a comprehensive machine learning system for detecting fraudulent job postings, achieving 94.2% accuracy with strong performance across precision (89.4%) and recall (87.6%) metrics. The system\'s key contributions include:
1) Robust Classification Demonstrated 8.3% F1-score improvement over logistic regression baseline, validating the effectiveness of gradient boosting for this task.
2) Interpretability: Integration of SHAP analysis provides transparent, feature-level explanations, enhancing stakeholder trust and enabling continuous model refinement.
3) Practical Deployment: Development of a user-friendly Streamlit web application with both single-job and batch prediction capabilities, complete with real-time URL scraping and comprehensive visualizations.
4) Domain Knowledge Integration: Incorporation of suspicious keyword detection, skill extraction, and contact information analysis provides multi-faceted fraud assessment beyond pure ML classification.
5) Reproducibility: Complete codebase with modular architecture enables easy replication and extension by other researchers and practitioners.
The experimental results demonstrate that machine learning can effectively identify sophisticated fraudulent patterns that traditional rule-based systems miss. The confusion matrix analysis revealed only 19 false negatives out of 1,000 test samples, indicating strong protection for job seekers while maintaining an acceptable false positive rate.
Looking forward, the integration of transformer-based language models, multi-class fraud categorization, and active learning mechanisms promises to further enhance detection capabilities. The system\'s modular architecture facilitates easy integration into existing recruitment platforms, providing valuable protection for job seekers in the digital age.
References
[1] A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, \"Credit card fraud detection: A realistic modeling and a novel learning strategy,\" IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3784-3797, Aug. 2018.
[2] S. Abdallah, M. Gaber, B. Sripada, and S. Krishnaswamy, \"Fraud detection in online auction: A survey,\" ACM Computing Surveys, vol. 45, no. 1, pp. 1-36, Nov. 2012.
[3] B. Alghamdi and F. Alharby, \"An intelligent model for online recruitment fraud detection,\" Journal of Information Security, vol. 10, no. 3, pp. 155-176, 2019.
[4] 4. S. Vidros, C. Kolias, G. Kambourakis, and L. Akoglu, \"Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset,\" Future Internet, vol. 9, no. 1, p. 6, 2017.
[5] V. Krishna, S. Ravi, M. Soora, and A. Sethuraman, \"Fake job recruitment detection using machine learning approach,\" International Journal of Innovative Technology and Exploring Engineering, vol. 8, no. 10, pp. 2278-3075, 2019.
[6] 6. S. M. Lundberg and S. I. Lee, \"A unified approach to interpreting model predictions,\" in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4768-4777.