Sign-to-Sound Converter: A Low-Latency, Edge-Optimized American Sign Language to Speech Translation System Using 2D-CNN and Redis-Backed Audio Caching

Authors: Mohammad Mustaqeem Ali, MD Sameer, CH Dilip Kumar, Tejaswini R

DOI Link: https://doi.org/10.22214/ijraset.2026.79330

Abstract

Real-time sign language translation systems frequently suffer from high latency and bandwidth bottlenecks due to continuous video frame transmission and computationally heavy backend processing. This paper introduces Sign Recognition Model, an end-to-end American Sign Language (ASL) to spoken audio translation system designed for low-latency communication. To minimize redundant data transfer, Sign Recognition Model employs an edge-optimized frontend utilizing React.js and MediaPipe.js, which tracks hand visibility and selectively captures frames at 500ms intervals only when continuous signing is detected for over one second. These contextually rich frames are transmitted via a REST API to a Python FastAPI backend. The system utilizes a Long Short-Term Memory (2D-CNN) network trained on the processed Word-Level American Sign Language (WLASL) dataset, extracting both manual (hand signs) and non-manual (facial expressions) features to accurately decode sequences into text. To further reduce latency, the predicted text is queried against a Redis cache; if a cache miss occurs, the text is synthesized into natural-sounding audio using Coqui TTS or Suno AI’s Bark model, asynchronously cached, and streamed back to the client for automated playback. By distributing the computational load between client-side landmark detection and a cache-optimized backend, Sign Recognition Model provides a scalable, near real-time auditory communication bridge for the Deaf and Hard of Hearing community.

Introduction

The text describes the development of Sign Recognition Model, a web-based American Sign Language (ASL) recognition system designed to improve communication between the Deaf and Hard of Hearing (DHH) community and non-signers. Traditional sign language recognition systems often face challenges such as high computational costs, network latency from continuous video streaming, and unnatural audio generation.

To solve these issues, the proposed system uses an intelligent client-server architecture. A React.js frontend integrated with MediaPipe.js performs edge-based hand detection and only captures frames when a user’s hands remain visible for one second. Frames are then transmitted every 500 milliseconds, significantly reducing unnecessary network traffic and processing.

On the backend, a Python FastAPI server processes the captured image sequences using a Long Short-Term Memory (2D-CNN) deep learning model trained on the WLASL dataset. The model analyzes both hand gestures and facial expressions to accurately interpret the meaning and emotional context of signs. The translated text is then converted into natural-sounding speech using advanced Text-to-Speech (TTS) models such as Coqui TTS and Bark. A Redis caching system stores frequently used audio outputs, reducing latency and enabling faster playback.

The literature review explains the evolution of Sign Language Recognition systems from sensor-based approaches to modern vision-based deep learning methods using CNNs, RNNs, LSTMs, and MediaPipe. However, existing systems often fail to provide efficient real-time communication and realistic speech generation.

The proposed architecture includes four major components: edge-assisted frame capture, distributed backend processing with RabbitMQ, deep learning inference, and Redis-based audio caching. Overall, the system aims to provide accurate, low-latency, and conversational ASL translation suitable for real-world deployment.

Conclusion

The development of Sign Recognition Model successfully demonstrates a highly optimized, edge-assisted architecture for real-time American Sign Language (ASL) to spoken audio translation. By identifying and addressing the core bottlenecks of traditional web-based sign language recognition—namely network latency from continuous video streaming, deep learning inference overhead, and the computational cost of Text-to-Speech (TTS) generation—this project offers a viable framework for real-world deployment. The integration of MediaPipe as a client-side temporal gating mechanism significantly reduced network payloads, ensuring only high-value, active-signing frames were transmitted. On the backend, the transition to a dense-layer pruned ResNet (2D CNN) architecture, coupled with aggressive data augmentation, resolved initial overfitting issues and yielded an outstanding overall predictive accuracy of 97%. Furthermore, the implementation of a Redis-backed caching layer and strict RabbitMQ worker timeouts guaranteed that the system could deliver natural-sounding audio within a strict 5-second threshold, with cache hits resolving in near real-time. Ultimately, Sign Recognition Model proves that dividing computational workloads between edge-detection and distributed backend synthesis can successfully bridge the communication gap for the Deaf and Hard of Hearing community without sacrificing conversational fluidity.

References

[1] D. Li, C. Rodriguez, J. Yu, and H. Li, \"Word-Level Deep Sign Language Recognition from Video: A New Large-Scale Dataset and Methods Comparison,\" Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. [Online]. Available: https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed [2] K. He, X. Zhang, S. Ren, and J. Sun, \"Deep Residual Learning for Image Recognition,\" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016. [3] C. Lugaresi et al., \"MediaPipe: A Framework for Building Perception Pipelines,\" arXiv preprint arXiv:1906.08172, 2019. [4] S. Hochreiter and J. Schmidhuber, \"Long Short-Term Memory,\" Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. (Even though you switched to ResNet, it is good academic practice to cite 2D-CNN if you discussed it in your literature review as a comparison). [5] Coqui AI, \"Coqui TTS: A Deep Learning Toolkit for Text-to-Speech,\" GitHub Repository, 2021. [Online]. Available: https://github.com/coqui-ai/TTS [6] Suno AI, \"Bark: Transformer-based Text-to-Audio Model,\" GitHub Repository, 2023. [Online]. Available: https://github.com/suno-ai/bark [7] S. Sanfilippo and P. Noordhuis, \"Redis: An In-Memory Database,\" 2009. [Online]. Available: https://redis.io/ [8] S. Pare, A. Bhandari, and A. Kumar, \"Vision-based sign language recognition system: A Comprehensive Review,\" 2020 International Conference on Inventive Computation Technologies (ICICT), IEEE, 2020. Tip: Cite this in Section 2 (Literature Review) when you discuss the shift from wearable sensors (gloves) to vision-based camera recognition. [9] S. Ramírez, \"FastAPI: High performance, easy to learn, fast to code, ready for production,\" Tiangolo, 2018. [Online]. Available: https://fastapi.tiangolo.com/ Tip: Cite this in Section 3.3 when you mention choosing FastAPI for its asynchronous ASGI capabilities. [10] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, \"SMOTE: Synthetic Minority Over-sampling Technique,\" Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.

Copyright

Copyright © 2026 Mohammad Mustaqeem Ali, MD Sameer, CH Dilip Kumar, Tejaswini R. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79330

Publish Date : 2026-04-03

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here