A Unified Flask-Based Framework for Image Text Recognition, Multilingual Translation, and Text Summarization

Authors: Ashik N Shetty, Anmol Ganesh, Ambati Manohar Reddy, Saibaj Ambi

DOI Link: https://doi.org/10.22214/ijraset.2025.69051

Abstract

ThisstudypresentsacomprehensivereviewofOCR (optical character recognition), Translation, and Object Detection Research from a single image. With the fast advancementofdeeplearning,morepowerfultoolsthatcan learn semantic, high-level, and deeper features have been proposed to solve the issues that plague traditional systems. The rise of high-powered desktop computer has aided OCR reading technology by permitting the creation of more sophisticated recognition software that can read a range of common printed typefaces and handwritten texts. However, implementinganOCRthatworksinallfeasiblescenariosand produces extremely accurate results remains a difficult process. Object detection is also the difficult problem of detectingvariousitemsinphotographs.Objectidentification usingdeeplearningisapopularuseofthetechnology,whichis distinguished by its superior feature learning and representationcapabilities when compared to standard object detectionapproaches.Themajorfocusofthisreviewpaperis ontextrecognition,objectdetection,andtranslationfroman image-basedinputapplicationemployingOCRandtheYOLO technique.

Introduction

Summary:

The document describes a multifunctional mobile application integrating OCR-based text recognition, object detection, and language translation to overcome language barriers and enhance usability. With advances in mobile cameras and computer vision, the app processes images (e.g., documents, signs) to recognize text and objects, then translates or summarizes content as needed.

Proposed Solution:
The app offers three main features from the home screen:

Text Recognition: Uses Tesseract OCR for extracting text from images, followed by translation via Google Translate.
Object Detection: Employs YOLOv3 to identify objects in photos or live camera feeds, with optional translation of detected object names.
Language Translation: Allows users to input text for direct translation into selected languages.

Methodology:

OCR involves image preprocessing, feature extraction, and post-processing for accuracy.
YOLO divides images into grids to detect and classify objects efficiently.
The Google Translate API (via googletrans library) handles all language translation tasks.

Results:

OCR performed well with clear, printed text but less so with noisy, skewed, or handwritten input.
YOLOv3 accurately detected various objects quickly in static images.
Translations maintained semantic meaning but sometimes struggled with idiomatic expressions.
The system responded swiftly (within seconds), suitable for real-time use.

Advantages:

User-friendly, all-in-one platform combining key functions without needing multiple apps.
Helps break language barriers and supports multilingual users.
Modular, lightweight, scalable, and deployable on cloud platforms.
Fast response times enable near-real-time applications.

Limitations:

OCR requires knowing the input language and struggles with handwriting and poor image quality.
Object detection is limited to trained categories.
Translation can produce literal, less nuanced results.
Summarization might omit complex details.
Requires internet access and lacks offline mode, personalization, or data storage.

Conclusion

For both characteristics, the created program can perform text recognition, object identification, and language translationintoachosenlanguagewithhighaccuracy.This application may be improved to handle the issue of translatingpdfsandotherdocumentsfromonelanguageto another.The integrated system was evaluated using a diverse set of inputs, including high-resolution printed documents, handwritten notes, street signs with multilingual content, and real-world scenes containing identifiable objects. The OCR component, powered by Tesseract, performed efficiently on clean and well-lit images of printed text, demonstrating a high degree of accuracy in extracting content. However, when presented with handwritten text or images with significant noise, its performance slightly declined, highlighting the importance of proper preprocessing techniques such as image thresholding and denoising.

References

[1] Thakare, Sahil, et al. \"Document Segmentation and LanguageTranslationUsing Tesseract-OCR.\"2018IEEE 13th International Conference on Industrial and Information Systems (ICIIS). IEEE, 2018. [2] Li,Gaohe,XinhaoLi,andBo Xu.\"NumericalSimulation TechnologyStudyonAutomaticTranslationofForeign Language Images Based on Tesseract-ORC.\" 2019 InternationalConference on Robots & Intelligent System (ICRIS). IEEE, 2019. [3] Liu, Chengji, et al. \"Object detection based on YOLO network.\" 2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC). IEEE, 2018. [4] K. Elissa, Hiral Modi, M.C.parikh, “A Review On Optical Character Recognition Techniques”, International Journal of Computer Application,2017. [5] Huang, Rachel, Jonathan Pedoeem, and Cuixian Chen. \"YOLO-LITE: a real-time object detection algorithm optimized for non-GPU computers.\" 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018. [6] Memon, Jamshed,etal.\"Handwrittenopticalcharacter recognition (OCR): A comprehensive systematic literaturereview(SLR).\"IEEEAccess8(2020):142642- 142668.J. [7] Du, Juan. \"Understanding of object detection based on CNN family and YOLO.\" Journal of Physics: Conference Series. Vol. 1004. No. 1. IOP Publishing, 2018. [8] Tao, Jing, et al. \"An object detection system based on YOLO in traffic scene.\" 2017 6th International Conference on Computer Science and Network Technology (ICCSNT). IEEE, 2017. [9] Ahmad,Tanvir,etal.\"Objectdetectionthroughmodified YOLO neural network.\" Scientific Programming 2020 (2020). [10] Chen, X., Yuille, A.L. (2016). Detecting and Reading Text in Natural Scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [11] Bahdanau, D., Cho, K., Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In: Proceedings of the International Conference on Learning Representations (ICLR). [12] Vaswani, A. et al. (2017). Attention Is All You Need. In: Advances in Neural Information Processing Systems (NeurIPS). [13] Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. [14] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems (NeurIPS). [15] Lample, G., Conneau, A. (2019). Cross-lingual Language Model Pretraining. In: Advances in Neural Information Processing Systems (NeurIPS). [16] Zhang, Y., Jin, L., Zhai, Z. (2017). Drawing and Recognizing Chinese Characters with Recurrent Neural Network. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. [17] Gehring, J. et al. (2017). Convolutional Sequence to Sequence Learning. In: Proceedings of the International Conference on Machine Learning (ICML). [18] Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. [19] Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT.Redmon, J. et al. (2018). YOLOv3: An Incremental Improvement.

Copyright

Copyright © 2025 Ashik N Shetty, Anmol Ganesh, Ambati Manohar Reddy, Saibaj Ambi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET69051

Publish Date : 2025-04-16

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here