Hybrid NLP and Machine Learning Framework for Detecting Human and AI-Written Texts

Authors: Mrs. B. Haritha, G. Aswini, K. Suvarna Lakshmi, P. Sailaja, P. Aparna, I. Haritha

DOI Link: https://doi.org/10.22214/ijraset.2025.69630

Abstract

With the rise of powerful AI language models like GPT-4 and LLaMA, distinguishing between AI-generated and human-written text has become increasingly challenging. This project presents a detection system that utilizes Natural Language Processing (NLP) and Machine Learning (ML) to identify AI-generated content. It integrates deep BERT embeddings with carefully crafted linguistic features such as perplexity, sentence structure, sentiment, and word usage. These features train two classifiers -XGBoost and Support Vector Machine (SVM)—which are combined into an ensemble model for enhanced accuracy. Trained on a balanced dataset of AI and human-written texts, the ensemble model achieved up to 93% accuracy, while XGBoost and SVM individually attained 84% and 81%, respectively. The system also includes a user-friendly interface for real-time text analysis and generates an HTML report detailing predictions and confidence scores. This solution provides an effective tool for educators, researchers, and institutions to detect AI-generated text and promote the ethical use of AI technologies.

Introduction

Overview:
With the rise of advanced language models like GPT-4, LLaMA, and Gemini, AI can now generate human-like, fluent, and stylistically consistent text. While this benefits fields like education, media, and legal writing, it also raises concerns about misinformation, plagiarism, and identity fraud. Traditional detection tools struggle due to the fluency and originality of AI-generated text.

To address this, a hybrid detection system is proposed that combines:

BERT-based semantic embeddings
Linguistic feature analysis
XGBoost and SVM classifiers
An ensemble (soft voting) approach
A user-friendly interface with real-time predictions and confidence scores

Key Features of the System:

Linguistic Features:
- Sentence length and complexity
- Word usage diversity
- Sentiment polarity and subjectivity
- Informality and use of personal tone
Semantic Features (BERT):
- Uses BERT’s CLS token to capture deep contextual understanding of the text
- Helps detect subtle patterns missed by surface-level features
Custom Perplexity Estimation:
- Rule-based approach to estimate how predictable the text is
- AI texts tend to be more predictable (lower perplexity)
Model Training:
- Dataset: 400 human-written and 400 AI-generated texts across multiple styles
- Classifiers: XGBoost and SVM trained with fused semantic + linguistic features
- Soft-voting ensemble used to combine model outputs for improved accuracy

Results & Performance:

Model	Accuracy
XGBoost	~High
SVM	~Slightly Lower
Ensemble	93%

Ensemble model outperformed individual models, offering strong capability in distinguishing AI vs. human text.
Semantic + linguistic features proved highly effective.
System includes HTML-based output reports, debugging tools, and confidence scores for transparency.

Conclusion

This project effectively detects whether text is AI-generated or human-written. It uses BERT for meaning and linguistic features for style, spotting AI’s polished text versus humans\' varied, emotional writing. Combining XGBoost and SVM in an ensemble ensures accurate predictions, trained on 400 AI and 400 human texts. The model performs well and outputs results in an easy-to-read HTML report.

References

[1] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019).BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://arxiv.org/abs/1810.04805 [2] Solaiman, I., et al. (2019).Release Strategies and the Social Impacts of Language Models. OpenAI. https://arxiv.org/abs/1908.09203 [3] Zellers, R., Holtzman, A., Rashkin, H., et al. (2019).Defending Against Neural Fake News. NeurIPS. https://arxiv.org/abs/1905.12616 [4] Zhang, Y., et al. (2021).Detecting AI-Generated Text: A Survey. arXiv. https://arxiv.org/abs/2107.06499 [5] OpenAI (2023). GPT-4 Technical Report.https://openai.com/research/gpt-4 [6] Kreps, S., McCain, R. M., & Brundage, M. (2022).All the News That\'s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation. Journal of Experimental Political Science. https://doi.org/10.1017/XPS.2021.29 [7] Shu, K., et al. (2022).Fake News Detection on Social Media: A Data Mining Perspective (Updated Review). SIGKDD Explorations.https://arxiv.org/abs/1708.01967 [8] Liu, J., et al. (2023).DetectGPT: Zero-Shot Detection of Machine-Generated Text via Probability Curvature. arXiv. https://arxiv.org/abs/2301.11305 [9] Weidinger, L., et al. (2021).Ethical and Social Risks of Harm from Language Models. arXiv. https://arxiv.org/abs/2112.04359

Copyright

Copyright © 2025 Mrs. B. Haritha, G. Aswini, K. Suvarna Lakshmi, P. Sailaja, P. Aparna, I. Haritha. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET69630

Publish Date : 2025-04-24

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here