Real-Time Sign-to-Text Translation System Using Multi-Modal Feature Fusion and Semantic Correction

Authors: Rathigha S, Pulikanti Rajith Teja, Kishore G, Dr. M. Ezhilarasan

DOI Link: https://doi.org/10.22214/ijraset.2026.79590

Abstract

Communication barriers persist for individuals with hearing and speech impairments because sign language, while expressive, is not widely understood outside the Deaf community. To address this, this paper presents a Real-Time Sign-to-Text Translation System designed to bridge the communication gap for individuals with hearing and speech impairments. Traditional vision-based systems often struggle with continuous gesture recognition, poor accuracy under varying conditions, and a lack of semantic understanding. To overcome these limitations, we propose a three-module architecture. The first module utilizes MediaPipe Holistic for extracting 3D spatial landmarks from the hands, face, and body pose. The second module employs a Long Short-Term Memory (LSTM) network to process these temporal sequences, effectively capturing dynamic motion patterns and stabilizing predictions with confidence gating. The final module integrates a Transformer-based Natural Language Processing (NLP) model alongside deterministic fallback templates to perform semantic correction, converting raw gloss sequences into grammatically coherent English sentences. Experimental results on the LSA64 dataset demonstrate a validation accuracy of 97.2%, with the system sustaining real-time processing capabilities on CPU hardware. The integrated web application delivers low-latency, end-to-end translation, making it a viable assistive technology for inclusive communication in education, healthcare, and public services.

Introduction

Traditional systems mainly recognize static hand gestures or isolated signs, and they often fail to capture continuous motion, facial expressions, and grammatical structure, leading to inaccurate or incomplete translations. To solve this, the proposed system introduces a multi-modal deep learning pipeline for more natural and real-time sign-to-text conversion.

The system uses MediaPipe Holistic to extract hand, face, and body landmarks, which are converted into structured feature vectors. These are passed into an LSTM-based model to capture temporal motion and classify sign sequences (glosses). Finally, a Transformer-based NLP module (T5-small) converts gloss outputs into grammatically correct English sentences, improving readability and communication quality.

The solution is implemented as a Flask-based real-time web application, streaming video through WebSockets and displaying translated text instantly.

Experiments show strong performance, achieving about 97% accuracy on LSA64, good generalization on MS-ASL, and an end-to-end latency of around 1.1–1.2 seconds, making it suitable for real-time use.

Overall, the system improves upon previous methods by combining multi-modal feature extraction, temporal modeling, and semantic correction, enabling more accurate and natural sign language communication.

Conclusion

This project successfully develops a real-time sign-to-text translation platform that effectively bridges the communication gap for individuals relying on sign language. By representing frames as fused 411-dimensional feature vectors, capturing temporal dynamics via an LSTM network, and refining raw predictions through a transformer-based NLP layer, the system provides accurate, grammatically correct English sentences. Achieving a 97.2% validation accuracy on the LSA64 dataset and maintaining robust real-time performance, the system marks a significant step in transforming sign language recognition into a deployable, user-friendly assistive technology.

References

[1] A. S. M. Miah, M. A. M. Hasan, Y. Tomioka, and J. Shin, “Hand Gesture Recognition for Multi-Culture Sign Language Using Graph and General Deep Learning Network,” IEEE Open Journal of the Computer Society, vol. 5, pp. 144–156, 2024. [2] A. Luhtaru, E. Korotkova, and M. Fishel, “No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models,” in Proc. EACL, 2024. [3] B. Alsharif, E. Alalwany, and M. Ilyas, “Transfer learning with YOLOV8 for real-time recognition system of American Sign Language Alphabet,” Franklin Open, vol. 8, pp. 1–11, 2024. [4] E. Yenisari and S. Yavuz, “Deep Learning-Based Sign Language Recognition Using Efficient Multi-Feature Attention Mechanism,” IEEE Access, vol. 13, pp. 126684–126702, 2025. [5] M. R. Hassan, K. Nordin, and M. R. Islam, “A review on deep learning techniques for sign language recognition,” IEEE Access, vol. 9, pp. 101789–101810, 2021. [6] M. Zhang, S. Yang, and M. Zhao, “Deep Learning-Based Standard Sign Language Discrimination,” IEEE Access, vol. 11, pp. 125822–125835, 2023. [7] M. Al-Qurishi, T. Khalid, and R. Souissi, “Deep Learning for Sign Language Recognition: Current Techniques, Benchmarks, and Open Issues,” IEEE Access, vol. 9, pp. 126917–126951, 2021. [8] N. Aloysius, G. M., and P. Nedungadi, “Optimized Multi-Modal Conformer-Based Framework for Continuous Sign Language Recognition,” IEEE Open Journal of the Computer Society, vol. 6, pp. 739–749, 2025. [9] P. Antonowicz, D. Kasperek, and M. Podpora, “Sign Language Recognition—Dataset Cleaning for Robust Word Classification in a Landmark-Based Approach,” IEEE Access, vol. 13, pp. 81877–81888, 2025. [10] R. Rao, S. Sharma, and N. Malik, “Automatic Text Summarization Using Transformer-Based Language Models,” International Journal of System Assurance Engineering and Management, vol. 15, no. 6, pp. 2599–2605, 2024. [11] R. Varghese and S. M., “YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness,” in Proc. ADICS, Chennai, India, pp. 217–221, 2024. [12] R. Wong, N. C. Camgoz, and R. Bowden, “Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation,” in Proc. ICLR, 2024. [13] S. Alyami, H. Luqman, and M. Hammoudeh, “Isolated Arabic Sign Language Recognition Using a Transformer-Based Model and Landmark Keypoints,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 1, 2024. [14] S. K. Anithadevi, S. K. Palanisamy, S. S. Rubini, and S. Shrestha, “MediaPipe-LSTM-Enhanced Framework for Real-Time Dynamic Sign Language Recognition in Inclusive Communication Systems,” Engineering Reports, Wiley Online Library, 2025. [15] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 37, no. 3, pp. 311–324, May 2007. [16] T. Starner and A. Pentland, “Real-Time American Sign Language Recognition from Video Using Hidden Markov Models,” in Proc. International Symposium on Computer Vision, pp. 265–270, 2002. [17] V. Sharma, A. Sharma, and S. Saini, “Real-time attention-based embedded LSTM for dynamic sign language recognition on edge devices,” Journal of Real-Time Image Processing, vol. 21, article 53, 2024. [18] Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition,” in Proc. AAAI Conference on Artificial Intelligence, pp. 1113–1122, 2021. [19] Z. Chen et al., “C2RL: Content and Context Representation Learning for Gloss-Free Sign Language Translation and Retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 9, pp. 8533–8545, 2025.

Copyright

Copyright © 2026 Rathigha S, Pulikanti Rajith Teja, Kishore G, Dr. M. Ezhilarasan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79590

Publish Date : 2026-04-06

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here