Multilingual Handwritten OCR using CLIP and Tesseract

Authors: Abhishek Singh Sengar, Akash Kushwaha, Devendra , Ms. Aarti Attri, Dr. Sureshwati

DOI Link: https://doi.org/10.22214/ijraset.2025.68700

Abstract

Optical Character Recognition (OCR) of handwritten text is an extremely challenging problem, particularly in multilingual and low-resource environments. Conventional OCR engines like Tesseract work well for printed text but not for handwriting because of extreme variations in style, language, and noise. The breakthroughs in multimodal models, especially CLIP (Contrastive Language–Image Pretraining), provide new avenues agnostic knowledge This paper discusses the possibility of combining CLIP with Tesseract to improve multilingual handwritten OCR, covering current methods, limitations, and future research directions.

Introduction

I. Introduction

Optical Character Recognition (OCR) is a crucial area of computer vision, widely used in:

Document digitization
Text extraction from images
Handwriting recognition

Modern OCR systems for languages like English and French are highly developed, but for complex or ancient scripts (e.g., Malayalam), challenges remain—particularly with handwritten and curved text.

This study explores integrating traditional OCR tools (like Tesseract) with modern deep learning approaches (like CLIP) to enhance performance, especially in multilingual handwritten recognition.

II. Literature Survey

Traditional OCR:

Earlier methods (e.g., template matching) lacked adaptability.
Tesseract OCR, powered by LSTM, improved recognition but struggles with noise and cursive writing.

Deep Learning in OCR:

CNNs and RNNs (e.g., CRNN) enable better feature extraction and sequence modeling.
However, models often require retraining for new languages.

Multilingual OCR Challenges:

Script variations, limited labeled data, and linguistic nuances.
Datasets like IAM and RIMES exist, but multilingual coverage is still inadequate.

Vision-Language Models (CLIP):

OpenAI’s CLIP aligns image and text embeddings, enabling zero-shot learning.
Useful in script identification and contextual reranking of OCR outputs.

Hybrid Methods:

Combine deep models (e.g., GANs, CLIP) for pre/post-processing with Tesseract for efficient OCR.
Examples include CLIP-based language detection and reranking of Tesseract predictions.

Recent Advances:

TrOCR (Microsoft) and MHTR use transformers for better multilingual recognition.
Multimodal models like M4C show promise in text + vision tasks.

III. Methodology: Proposed Hybrid OCR Pipeline

Preprocessing & Segmentation
- Grayscale conversion, adaptive thresholding, noise removal (morphological ops/CNN), and line segmentation.
Script Identification with CLIP
- Use CLIP to compare image segments with language prompts and identify the script (e.g., Hindi, Arabic).
OCR Using Tesseract
- Selects the relevant language model based on CLIP output.
- Performs OCR using LSTM-based recognition and beam search for hypothesis generation.
Semantic Reranking with CLIP
- Embeds each OCR output hypothesis using CLIP’s text encoder.
- Selects the best match based on semantic similarity to the original image.
Post-processing
- Includes spelling correction, grammar fixes, format restoration, and optional use of transformer-based language models like mBERT or XLM-R.

IV. Implementation Details

Platform: Python
Tools:
- Tesseract OCR for recognition
- NumPy for classification
- OpenCV for image preprocessing
- Tkinter for GUI
- Google Translate API for language translation
Process Flow:
- Image captured or uploaded → Grayscaling & binarization → Noise removal → OCR via Tesseract → Language selection via Tkinter → Translation → Output in GUI
Languages Supported: Hindi, Kannada, Marathi, Malayalam, Tamil, Telugu, Urdu
Deployment: Implemented as an Android app for user accessibility and mobile compatibility.

Conclusion

Tesseract OCR is easily known as the strongest open-source OCR because of its accuracy and flexibility. It can be used for many OCR-related tasks because it can recognize multiple languages and image types. EasyOCR is another Python-based OCR tool and has a very simple work interface which supports text extraction and recognition through deep-learning-based methods. MMOCR is an advanced and mature OCR toolbox consisting of the latest features in text detection, recognition, and layout analyses. It is developed by OpenMMLab. It provides pre-trained models based on state-of-the-art deep-learning solutions, is agnostic to frameworks across many languages, and allows users to train and fine-tune models.I believe it would be great to compare five OCR libraries: Tesseract OCR, MMOCR, Paddle OCR, Easy OCR, and Keras OCR in a survey and analyze their performance using different languages like English, Hindi, Arabic, Tamil, and Malayalam. Obviously, out of all these libraries, Tesseract has a high percentage of recognition as it has focused on improving errors in Malayalam OCRs, reaching 93%, which is not bad at all. However, when tested against all of them, including those in English, Hindi, Tamil, and Arabic, Tesseract OCR is still on top. This mainly puts Tesseract OCR ahead with respect to highly competent Malayalam OCR tasks. Therefore, this will prove very helpful to the user for a better performance in Malayalam text recognition. Apart from this, a little survey might be useful comparing Tesseract with MMOCR, Paddle OCR, Easy OCR, and Keras OCR credentials for diverse languages like English, Hindi, Arabic, Tamil, and Malayalam. Besides, Tesseract does raise some standards in this bunch because it tries to lower the error occurrence for Malayalam OCRs, having quite a commendable 93% accuracy, which is very high. Notably, putting it to the test against the rest in English, Hindi, Tamil, and Arabic really proves promising: Tesseract OCR has been the best among all. Very much attributed to this, however, is Tesseract OCR on the grounds of quite a high performance in Malayalam OCR work. Thus, the user benefits significantly with Tesseract OCR for economical and accurate Malayalam text recognition.

References

[1] Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training. Sensors 2020, 20, 6793. [Google Scholar] [CrossRef] [PubMed] [2] Kyamakya, K.; Haj Mosa, A.; Machot, F.A.; Chedjou, J.C. Document-Image Related Visual Sensors and Machine Learning Techniques. Sensors 2021, 21, 5849. [Google Scholar] [CrossRef] [PubMed] [3] Miller, M.T.; Romanov, M.G.; Savant, S.B. The PremodernIslamicate World Digitalizing the Textual Heritage: Principles and Plans. Int. J. Middle East Stud. 2018, 50, 103-109. [Google Scholar] [CrossRef] [4] Kitab Project. Available online: https://kitab-project.org/about/ (accessed on 10 March 2022). [5] Available online: https://persdigumd.github.io/PDL/ for Persian Digital Library, Roshan Institute for Persian Studies, University of Maryland (accessed on 10 March 2022). [6] The many languages and alphabets in which this culture is preserved and well-kept initiate its process with the first steps of converting it into a knowledge extractor and cataloguer. In Proceedings of the Conference on Information Technology for Social Good, GoodIT \'21, Rome, Italy, 9-11 September 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 301-304. [Google Scholar] [CrossRef] [7] Chirag Patel, Atul Patel, Dharmendra Patel, Optical Character Recognition by Open-Source OCR Tool Tesseract: A Case Study in 2012, International Journal of Computer Applications (0975 - 8887) [8] Thomas Hegg hammer, A benchmarking experiment using Tesseract, Amazon Textract, and Google Document AI: OCR by the Journal of Computational Social Science (2022) 5:861-882 [9] An open-source OCR evaluationtool Rafael C. Carrasco, Departamento de Lenguajes y Sistemas Informaticos Universidad de Alicante (Spain). [10] GurkanSoykan, DenizYuret, TevfikMetinSezgin, A Comprehensive Gold Standard and Benchmark for Comics Text Detection and Recognition, Computation and Language (cs.CL); Artificial Intelligence (cs.AI [11] Vedhaviyassh, D.R.; Sudhan, R.; Saranya, G.; Safa, M.; Arun, D., Comparative analysis of EasyOCR and TesseractOCR for automatic license plate recognition using a deep learning algorithm, 2022 6th International Conference on Electronics, Communication and Aerospace Technology, 01-03 December 2022. [12] K.H. Nikoghosyan. OCR Engine Comparison - TesseractvsEasyOCRvsKeras-OCR, Russian-Armenian University, Armenia, 2022. [13] Lei Feng; ZongwuKe; Na Wu, ModelsKG: A Design and Research on Knowledge Graph of Multimodal Curriculum Based on PaddleOCR and DeepKE, 2022 14th International Conference on Advanced Computational Intelligence (ICACI), 15-17 July 2022. [14] Dan Zhang and Yunjie Li Research and Application of Health Code Recognition Based on Paddle OCR under the Background of Epidemic Prevention and Control, Journal of Artificial Intelligence Practice Vol 6, Issue 1, 2023. [15] R. Deepa; S. Gayathri; P. Chitra; J. Jeno Jasmine; R. Renuga Devi; A. Thilagavathy, An Enhanced Machine Learning Technique for Text Detection using Keras Sequential model, 2023 Second International Conference on Electronics and Renewable Systems (ICEARS), 02-04.

Copyright

Copyright © 2025 Abhishek Singh Sengar, Akash Kushwaha, Devendra , Ms. Aarti Attri, Dr. Sureshwati . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET68700

Publish Date : 2025-04-11

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here