Legal Document Authentication and Verification System Leveraging OCR, NLP, and CNN for Comprehensive Document Authentication

Authors: Prof. Moushmee Kuri, Prasad Bokare, Sandipak Dhuri, Atharv Inamdar, Pawas Gavande, Aditya Gupta

DOI Link: https://doi.org/10.22214/ijraset.2025.70968

Abstract

Legal document forgery is a huge threat to the integrity of institutions as well as of individuals. Based on Optical Character Recognition, NLP, and CNN, this paper designs a \"Legal Document Authentication and Verification System\" that targets the detection of forgery, especially at the textual, image-based, and structural layers of Indian driving license documents. It uses advanced machine learning techniques in signature, layout, and text anomaly analysis. Modular in design, with scalability features, findings have been established in governance, education, and banking sectors. Benchmarked evaluation proved promising accuracy in detecting forged documents.

Introduction

Problem Statement

Forgery of legal documents has become easier due to advanced digital tools. Manual verification is error-prone, slow, and ineffective against modern forgery techniques. A reliable, automated solution is needed to detect tampering in text, images, and layout structures.

Objectives & Key Deliverables

Real-time document verification system using AI techniques.
Forgery detection through:
- OCR + NLP for text and structural inconsistencies.
- CNN-based image and layout anomaly detection.
Cross-domain scalability with a unified platform.
Use case: Indian Driving Licenses (DLs), especially in fraud-prone sectors of emerging markets.

Literature Review Highlights

Image-Based Forgery Detection:
- Techniques like PCT, LBP, and SURF are used to detect image tampering with accuracy up to 97.5%.
Deep Learning (CNNs):
- CNNs are effective in identifying forged content with high recall (97.3%).
- Enhancements include DCT-based feature extraction for compressed documents.
Text Analysis with OCR & NLP:
- NLP identifies semantic and formatting anomalies (e.g., inconsistent names, fonts, spacing).
- OCR tools like Tesseract are used for text extraction from scanned images.
Challenges:
- High-quality forgeries mimic originals closely.
- Document format variability and noise reduce detection accuracy.
Emerging Technologies:
- GANs for synthetic forgery data generation.
- Blockchain for tamper-proof storage.
- Combined OCR-NLP-CNN approaches improve robustness.

Proposed Methodology

1. Document Preprocessing

Image normalization and noise removal for uniformity.

2. Forgery Detection Modules

Text Analysis (OCR + NLP):
- Tokenization, semantic anomaly detection, and format consistency checks.
Layout & Signature Analysis (CNN):
- Logo/table detection and signature verification against stored templates.

3. System Architecture

Front-End: Document upload and visual display of tampered areas.
Back-End: Python (TensorFlow for CNN, spaCy for NLP); secure cloud storage.

4. Multi-Modal Analysis

Consolidates outputs from OCR, NLP, and CNN for a unified forgery detection score.

Data Analysis & Visual Insights

Forgery Prevalence in India:
- 35% forged, 65% genuine (based on sample dataset).
Regional Forgery Rates:
- West India (30%) and North India (25%) lead in tampering incidents.
Image Augmentation for Training:
- Techniques like rotation, scaling, and blurring equally used (~33% each).

System Design & Implementation

Tech Stack:
- Python, TensorFlow, spaCy, React.js.
Deployment:
- Cloud-based with role-based access control.
Evaluation Metrics:
- Achieved 98.2% precision and 96.5% recall on custom datasets.

User-Centric Design

Functional Requirements:
- Multi-format support (PDF, JPEG), text & image analysis, real-time feedback.
Non-Functional:
- Secure, scalable, and fast.

Stakeholder Feedback

Legal and government professionals validated the need for layout and signature checks.

Proposed Algorithm (Workflow)

User login & document upload
Text extraction (OCR) + layout check + signature verification
Anomaly detection using AI
Forgery scoring and result notification
User feedback support & error correction

Expected Impact

Reliable, automated legal document verification.
High accuracy in detecting forgeries in text, layout, and signatures.
Scalable and user-friendly solution for government, legal, and financial institutions.

Conclusion

It is a paper on a scalable and efficient system of verification of legal documents using OCR, NLP, and CNN. It is based on the fact that detecting forgery in text, images, and layouts makes the system suitably applicable in sensitive domains like governance, education, and banking. Future improvements will include integration with blockchain for tamper-proof storage and GANs to provide synthetic forgery simulation toward suitability for the changing nature of threats.

References

[1] Saber et al., 2021, Advanced Feature Extraction Techniques for Image Forgery Detection. [2] Rani et al., 2021, Template Matching and SURF for Splicing Forgery Detection. [3] Diallo et al., 2020, Impact of JPEG Compression on Forgery Detection Models. [4] Thibault et al., 2020, Dissimilarity Measures in Document Fraud Detection. [5] Lavanyaa et al., 2022, Legal Document Analysis Using Natural Language Processing and Deep Learning. [6] Addison et al., 2020, Generative Adversarial Networks for Synthetic Forgery Detection Training. [7] Halili et al., 2022, Legal Implications of Document Forgery in Cybersecurity. [8] Lokesh Nandanwar et al., 2023, Altered Text Detection in Document Images Using DCT and CNN. [9] Eli Yaacoby et al., 2021, System for Authenticating and Verifying Documents Using Public Key Cryptography. [10] R. Smith, 2007, An Overview of the Tesseract OCR Engine.

Copyright

Copyright © 2025 Prof. Moushmee Kuri, Prasad Bokare, Sandipak Dhuri, Atharv Inamdar, Pawas Gavande, Aditya Gupta. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70968

Publish Date : 2025-05-14

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here