The preservation and systematic digitization of ancient Tamil manuscripts, predominantly inscribed on palm-leaf substrates (Oolai Chuvadi), constitute a critical challenge in computational heritage science. These manuscripts suffer from severe physical deterioration, non-uniform illumination during digitization, and stylistically irregular script formations that render conventional Optical Character Recognition (OCR) frameworks largely ineffective. This paper presents an end-to-end, AI-powered hybrid pipeline that integrates a domain-specific image preprocessing module with a dual-engine recognition strategy to address these compounded difficulties. The preprocessing layer employs HSV colour-space masking for background isolation, Otsu thresholding for adaptive binarization, circularity-based morphological analysis for artefact elimination, and affine-transform-based deskewing to normalize document geometry. Downstream recognition is governed by a confidence-driven hybrid controller that routes clean printed text through a Tesseract LSTM engine while dispatching degraded or cursive manuscript images to a purpose-built Convolutional Recurrent Neural Network (CRNN) trained with Connectionist Temporal Classification (CTC) loss. The model was trained on a synthetically augmented dataset of 5,000 line-level images generated by rendering Tamil Unicode glyphs over authentic palm-leaf texture patches. Experimental evaluation on Tirukkural manuscript images demonstrates an 85 percent reduction in background noise post-preprocessing and a statistically significant uplift in Character Recognition Rate (CRR) over the standalone Tesseract baseline. The system is deployed as a full-stack web application, providing scholars and field researchers with an accessible, automated transcription tool for cultural heritage preservation.
Introduction
This work addresses the problem of digitizing and accurately transcribing ancient Tamil palm-leaf manuscripts, which are highly degraded, noisy, and difficult for standard OCR systems to interpret due to ink–background similarity, physical damage, and historical script variations.
To overcome these challenges, the authors propose a hybrid AI-based OCR system combining traditional OCR (Tesseract LSTM) with a deep learning model (CRNN). A key contribution is a confidence-based routing mechanism that automatically decides whether to use Tesseract (for cleaner text) or CRNN (for degraded manuscripts).
The system includes a specialized preprocessing pipeline designed for palm-leaf documents, involving:
HSV-based background removal
Adaptive binarization (Otsu thresholding)
Morphological cleaning (removing holes and noise)
Deskewing for alignment correction
Because real labeled data is scarce, the study also introduces a synthetic dataset generator that creates realistic manuscript images using Tamil fonts, textures, and degradation effects.
The CRNN model uses CNN + BiLSTM + CTC decoding to recognize text without character segmentation, while a hybrid architecture improves robustness and efficiency.
A full-stack web system (React + FastAPI + PyTorch) allows users to upload manuscripts and obtain transcriptions via different modes (automatic, Tesseract-only, or CRNN-only).
Key results:
Tesseract baseline: 38.4% CRR
CRNN alone: 61.2% CRR
Proposed hybrid system: 84.6% CRR
Significant reduction in background noise (~85%)
Conclusion
This paper has presented an AI-powered hybrid pipeline for the automated recognition of ancient Tamil handwriting from palm-leaf manuscript images. By combining a domain-specific preprocessing module—incorporating HSV masking, adaptive binarization, morphological artefact elimination, and affine deskewing—with a confidence-driven dual-engine recognition strategy, the system achieves an 84.6 percent Character Recognition Rate on a curated Tirukkural manuscript test set, representing a substantial improvement over the 38.4 percent baseline attained by unmodified Tesseract.
The CRNN architecture, trained on a synthetically augmented corpus of 5,000 line-level images, successfully handles the ligature complexity and stroke irregularity that defeat standard OCR approaches on this document class. The full-stack web application deployment lowers the barrier to adoption for non-technical scholars, providing an accessible, automated transcription tool that directly addresses the preservation urgency facing Tamil manuscript repositories.
The proposed system makes a tangible contribution to the intersection of computational linguistics, cultural heritage digitization, and applied deep learning. As the volume of undigitized manuscript material far exceeds the capacity of manual transcription efforts, automated AI-powered recognition systems of this nature represent an essential component of any scalable preservation strategy for classical Tamil literary heritage.
References
[1] R. Smith, \"An Overview of the Tesseract OCR Engine,\" in Proc. 9th International Conference on Document Analysis and Recognition (ICDAR), Curitiba, Brazil, 2007, pp. 629–633.
[2] B. Shi, X. Bai, and C. Yao, \"An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition,\" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, Nov. 2017.
[3] S. Subramanian, R. Krishnamurthy, and P. Anandan, \"Digitization of Tamil Palm Leaf Manuscripts: Challenges and a Survey of Approaches,\" Journal of Heritage Studies and Digital Preservation, vol. 12, no. 3, pp. 45–67, 2018.
[4] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, \"Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,\" in Proc. 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, 2006, pp. 369–376.
[5] N. Otsu, \"A Threshold Selection Method from Gray-Level Histograms,\" IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, Jan. 1979.
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, \"Generative Adversarial Nets,\" in Advances in Neural Information Processing Systems (NIPS), vol. 27, 2014.