The rapid rise of online recruitment platforms in the current digital era has led to a significant increase in fake job postings that put job seekers at risk of financial losses, identity theft, and mental distress. This research paper offers a detailed study on how to automatically detect fraudulent job advertisements by employing supervised machine learning methods and a web application based on Flask. The study uses the Employment Scam Aegean Dataset (EMSCAD), which contains 17,880 labelled job postings, to train and test various classification models including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Light GBM, and deep learning models like Multi-layer Perceptron (MLP), Bidirectional Long Short-Term Memory (Bi-LSTM), and Deep Neural Networks (DNN).
The web application that was developed, using Python, Flask, and scikit-learn, takes job posting inputs from users across multiple fields and utilizes a trained machine learning model to classify these postings in real time as either genuine or fraudulent. The text features are extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization method. The experimental results indicate that the Light GBM classifier achieves the most balanced performance with an accuracy of 98.18% and a ROC-AUC score of 0.91, while the Bi-LSTM model achieves the highest raw accuracy of 98.71%. The integrated web system connects academic model development with practical application, creating a fraud detection tool accessible to users. The results show that natural language processing (NLP) and machine learning can serve as essential elements in developing safer and more reliable digital recruitment environments.
Introduction
Fraudulent job postings on digital employment platforms are an increasing global problem, exploiting job seekers through fake advertisements that request sensitive information or payments. Existing manual moderation and rule-based detection methods are insufficient at scale, motivating the development of an automated machine learning-based detection system deployed as a Flask web application. The system processes structured job posting inputs, combines them into a unified text representation, and classifies them as genuine or fraudulent using a trained model.
The literature review shows that fraud detection research spans areas like review spam, email spam, and fake news detection, with machine learning and deep learning methods (such as SVM, Naive Bayes, Random Forest, Gradient Boosting, and transformers) improving performance. In fraudulent job detection specifically, prior studies highlight the effectiveness of deep learning (e.g., BiLSTM), ensemble models, and hybrid NLP features (TF-IDF with embeddings), but also reveal gaps such as class imbalance, limited use of full textual content, lack of deployable systems, and insufficient real-world scalability and evaluation metrics beyond accuracy.
The study uses the EMSCAD Kaggle dataset (17,880 job postings), which is highly imbalanced (about 91% genuine and 9% fraudulent). It contains both structured and unstructured features, with missing values in some textual fields that can themselves indicate fraudulent behavior. Preprocessing includes handling missing values, text normalization, tokenization, lemmatization, and stop-word removal. All job fields are concatenated into a single text input and transformed using TF-IDF (up to 10,000 features with unigrams and bigrams). Class imbalance is addressed using SMOTE.
Several machine learning models are trained and compared, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and XGBoost, using stratified train/validation/test splits. The best-performing model is selected for deployment in a real-time web application that predicts whether a job posting is genuine or fraudulent based on user inputs, aiming to improve accessibility, automation, and practical usability in fraud detection systems.
Conclusion
This research paper presents a comprehensive study on detecting fraudulent job postings using supervised machine learning techniques, culminating in the deployment of a Flask-based web application that is fully functional. The study shows that integrating natural language processing, specifically TF-IDF vectorization applied to concatenated multi-field job posting text, with ensemble and deep learning classifiers leads to effective fraud detection systems.
Among the evaluated models, LightGBM is the optimal choice for deployment, achieving an accuracy of 98.18% and an ROC-AUC score of 0.91 with a recall of 0.83. This represents the best balance between minimizing false positives and detecting fraudulent postings. The Bidirectional LSTM model achieves the highest raw accuracy of 98.71% but has significantly lower recall, highlighting the need for multi-metric evaluation in cases of imbalanced classification.
The web application developed translates academic model performance into a practical tool that users can access. It accepts structured job posting inputs, applies the trained pipeline in real time, and communicates binary classification results through an intuitive interface. The system effectively connects machine learning research with deployable fraud prevention solutions. The architecture of the Flask application shows that integrating scikit-learn models into web-based inference systems is feasible with minimal infrastructure overhead.
The findings affirm that AI-driven methods are viable and superior to manual or rule-based approaches for detecting fraudulent job postings, providing scalability, consistency, and potential for continuous improvement. Deploying such systems by online recruitment platforms can significantly reduce employment scams, protect job seekers, and enhance the integrity of digital recruitment ecosystems.
As fraudulent actors advance their deception strategies, developing adaptive, continually-learning, and multimodal detection systems is a critical area for future research. This work establishes a foundational framework for building such advanced systems.
References
[1] Pillai, A. S. (2023). Detecting Fake Job Postings Using Bidirectional LSTM. arXiv preprint.
[2] Boka, M. (2024). Predicting Fake Job Posts Using Machine Learning Models. SSRN Electronic Journal.
[3] Kumar, S. (2025). A Review of Machine Learning and NLP-Based Detection of Fake Job Posts. Digital Manuscript Pedia.
[4] Naudé, M., & Kumar, S. (2023). A Machine Learning Approach for Detecting Fraudulent Job Postings. SpringerLink.
[5] Gulshan, P., et al. (2024). Fraudulent Online Job Advertisement Detection using Machine Learning Techniques. IJIRSET.
[6] Bhatta, S. (2025). Detecting Fake Job Postings using NLP and Machine Learning. GitHub Repository.
[7] Boka, M., & Gulshan, P. (2024). Fake Job Post Detection using Machine Learning and Deep Learning. IJAEM.
[8] Sasidharan, A. (2023). Detecting Fake Job Postings Using Bidirectional LSTM. ResearchGate / IJRPR.
[9] Kumar, S., et al. (2025). Fake Job Post Detection using Machine Learning and Deep Learning. IJRPR.
[10] Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
[11] Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems (NeurIPS).
[12] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780.
[13] Chawla, N. V., et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
[14] Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O\'Reilly Media.
[15] Grinberg, M. (2018). Flask Web Development: Developing Web Applications with Python (2nd ed.). O\'Reilly Media.