Automated Handwriting Recognition and Digital Document Transformation Using Vision Transformers and TrOCR

Authors: Mohamed Haleem Akmal S, Rohan Karthik R S, Vasanthakumar S, Manojkumar P, Ms. R. Kavitha

DOI Link: https://doi.org/10.22214/ijraset.2026.79652

Abstract

This paper presents an Automated Handwriting Recognition and Digital Document Transformation System de- signed to convert handwritten documents into editable digital formats. The system accepts scanned PDFs and handwritten images as input and performs image preprocessing and noise reduction to improve text clarity. A Vision Transformer (ViT) encoder is employed to extract visual features from image patches, while a Transformer-based decoder generates character sequences using self-attention and cross-attention mechanisms. The core recognition model utilises TrOCR for end-to-end hand- writing recognition without requiring character segmentation. Post-processing algorithms such as token decoding, spell correc- tion, and text formatting are applied to enhance readability and consistency. Finally, the recognised text is exported into digital formats such as TXT, DOCX, and searchable PDF, enabling efficient and accurate digital document transformation.

Introduction

The text describes the development of an AI-based handwritten document digitization system that converts handwritten documents into editable digital text using advanced transformer-based architectures. Handwritten documents such as historical records, prescriptions, student notes, and government files remain difficult to digitize because traditional Optical Character Recognition (OCR) systems are designed mainly for printed text and rely on character segmentation. These methods perform poorly on cursive or free-form handwriting due to unclear character boundaries, writing style variations, noise, skewed scans, and low-resolution images.

To overcome these limitations, the project uses the Transformer-based OCR model called TrOCR, which treats handwriting recognition as a sequence-to-sequence problem and eliminates the need for explicit character segmentation. The system integrates TrOCR into a complete document processing pipeline capable of handling handwritten JPG/PNG images and multi-page scanned PDFs.

The proposed system includes several major components:

A preprocessing module that improves image quality using grayscale conversion, noise removal, skew correction, resizing, and contrast enhancement.
A recognition module based on TrOCR, which combines a Vision Transformer (ViT) encoder and Transformer decoder to recognize handwriting using attention mechanisms.
A post-processing module that performs token decoding, spell correction, text formatting, and merging of multi-page outputs.
A Flask-based web interface that allows users to upload handwritten documents and download outputs in TXT, DOCX, and searchable PDF formats.

The literature review discusses the evolution of OCR systems, from traditional rule-based approaches and convolutional neural networks to modern transformer-based models like Vision Transformers and TrOCR. Unlike older systems such as Tesseract and CRNN-CTC models, TrOCR achieves better performance by using attention-based sequence modeling and contextual understanding.

The problem statement highlights key limitations of existing OCR systems, including dependency on character segmentation, sensitivity to poor image quality, lack of contextual understanding, limited output formats, and poor handling of multi-page documents. The proposed system addresses these issues through an end-to-end pipeline that automates document digitization from upload to export.

The system architecture consists of four layers:

A frontend web interface for uploading files,
A Flask API layer for request handling,
A preprocessing module for image enhancement,
A recognition and output module using TrOCR and post-processing tools.

The preprocessing stage converts images into standardized formats by applying grayscale conversion, Gaussian filtering, skew correction using Hough transforms, and contrast enhancement using CLAHE. The TrOCR model then divides images into patches, extracts visual features through a Vision Transformer encoder, and generates text autoregressively using a Transformer decoder with beam search decoding. Finally, the recognized text is cleaned, spell-checked, merged for multi-page documents, and exported into editable digital formats.

Conclusion

We set out to build a practical, accessible system for converting handwritten documents into editable digital text—without requiring technical expertise from the end user. The result is a complete, deployable pipeline that handles the full workflow from raw file upload to multi-format digital output. Our TrOCR-based recognition achieves a character error rate of 10.3% overall and 4.2% on clearly printed handwriting, competitive with the published state of the art on similar benchmarks for a non-fine-tuned deployment. The preprocess- ing pipeline, multi-format output generation, and Flask-based web interface together produce a system that is immediately usable in educational, healthcare, and administrative contexts. What we found most valuable during development was how much preprocessing quality influenced downstream recogni- tion accuracy. Skew correction and CLAHE contrast enhance- ment, steps that might seem secondary to the main model, produced measurable improvements in WER. Getting the input right matters as much as the model itself. For other undergraduate teams building NLP and vi- sion systems, we hope the architecture patterns documented here—particularly the segmentation-free TrOCR integration, the multi-format export pipeline, and the page-level paralleli- sation design for future work—provide a reusable foundation for similar document intelligence projects.

References

[1] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186. [2] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI Blog, 2019. [3] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. ICLR, 2021. [4] M. Li et al., “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models,” in Proc. AAAI, 2023. [5] R. Smith, “An Overview of the Tesseract OCR Engine,” in Proc. ICDAR, 2007, pp. 629–633. [6] B. Shi, X. Bai, and C. Yao, “An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, pp. 2298–2304, 2017. [7] U.-V. Marti and H. Bunke, “The IAM-Database: An English Sentence Database for Offline Handwriting Recognition,” Int. J. Doc. Anal. Recognit., vol. 5, no. 1, pp. 39–46, 2002. [8] JaidedAI, “EasyOCR: Ready-to-use OCR with 80+ Supported Lan- guages,” GitHub, 2020. [Online]. Available: https://github.com/JaidedAI/ EasyOCR [9] T. Yu et al., “Spider: A Large-Scale Human-Labeled Dataset for Com- plex and Cross-Domain Semantic Parsing and Text-to-SQL Task,” in Proc. EMNLP, 2018. [10] S. Schelter et al., “Automating Large-Scale Data Quality Verification,” Proc. VLDB Endow., vol. 11, no. 12, pp. 1781–1794, 2018.

Copyright

Copyright © 2026 Mohamed Haleem Akmal S, Rohan Karthik R S, Vasanthakumar S, Manojkumar P, Ms. R. Kavitha. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79652

Publish Date : 2026-04-07

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here