Sign language serves as a crucial communication bridge between individuals with hearing impairments and the general population; however, approximately 97% of non-signers are unable to understand sign-based communication, resulting in significant interaction barriers. This paper introduces a machine learning and computer vision–based American Sign Language (ASL) to Speech Conversion in real time. The system employs a pre-trained Convolutional Neural Network (CNN) model for gesture classification, while OpenCV and MediaPipe frameworks are utilized for hand detection, region-of-interest (ROI) extraction, and edge tracking to enhance recognition accuracy. Identified gestures are transformed into textual characters and combined to form meaningful sentences, which are then converted into speech using the pyttsx3 text-to-speech (TTS) engine. Experimental results demonstrate an accuracy range of 96% to 99% for static gesture recognition. The system performs effectively in offline settings, making it suitable for deployment in low-connectivity environments. This work contributes to improved accessibility and inclusion for ASL users by bridging the communication gap between signers and non-signers. Future enhancements may include dynamic gesture recognition, expanded datasets for improved generalization, and support for multiple sign languages to facilitate broader real-world applications.
Introduction
Sign language is the primary mode of communication for deaf individuals, but a major communication gap persists because most people do not understand it. With over 7 million deaf people in India and only about 250 certified interpreters, access to sign language education and translation is limited. To address this challenge, the paper proposes a Sign Language to Voice (SL2V) translation system using American Sign Language (ASL) and modern machine learning techniques. Gesture recognition technology enables computers to interpret hand movements as commands, offering an effective communication solution when speech is not possible.
Literature Review
Recent research on sign language translation focuses on computer vision and deep learning methods such as CNNs, RNNs, LSTMs, SSD MobileNetV2, and multimodal frameworks. These models aim to improve accuracy, real-time performance, and robustness. Key observations include:
CNN- and MobileNet-based models achieve high accuracy (97–100%) for static gestures.
Many systems still struggle with dynamic gestures, environmental variations, and limited datasets.
Deep learning approaches increasingly integrate text-to-speech (TTS) and two-way translation (sign-to-voice and voice-to-sign).
Studies highlight the need for more scalable, real-time, multimodal, and multi-language recognition systems.
Proposed Methodology
The SL2V system converts sign gestures into text and speech through four stages:
Gesture capture using OpenCV and MediaPipe for real-time hand tracking.
Preprocessing (grayscale conversion, filtering, normalization) and extraction of 21 hand landmarks.
Gesture classification using a trained CNN model to identify ASL alphabets from spatial patterns.
Speech synthesis using the pyttsx3 TTS engine, forming words from detected letters and generating spoken output.
Machine Learning Framework
CNNs are used for feature extraction, pattern recognition, pooling, and classification.
A structured pipeline transforms video frames into filtered, normalized inputs for the CNN.
The system generates high-accuracy predictions and converts them to text and audio in real time.
Implementation
The final system uses SSD MobileNetV2 for real-time, lightweight gesture detection and pyttsx3 for offline speech output. The dataset includes thousands of preprocessed images with augmentation to improve robustness.
Real-time performance is achieved, though challenges include:
Difficulty recognizing dynamic or fast hand movements
Sensitivity to lighting and background noise
Need for larger and more diverse gesture datasets
Future improvements include adaptive thresholding, advanced noise filtering, multi-angle gesture recognition, and transformer-based models for better sequential gesture understanding.
Applications
The system can be deployed in education, healthcare, workplaces, and any environment requiring accessible communication for hearing-impaired individuals. With continued refinement, it can become a practical and scalable tool for inclusive human–computer interaction.
Conclusion
The Sign Language to Voice Translator aims to take the signs as real-time input and produce the text and its speech at the same time. The system can identify alphabets of American Sign Language and read out the word after processing the input, letter by letter. Most of the other models focus on sign language to text and some from text to speech, but here we have tried to combine both the systems to yield better results to foster innovation. This is a significant step taken to eradicate the gap between the Deaf-Mute and the majority to cultivate a more inclusive community.
Looking forward, future ventures seek to train the model more rigorously through deep learning for it to identify and suggest new words independently. Future improvements will include increasing the input time limit, ensuring the enhancement of its processing and execution speed. To carry out short conversations smoothly and in pace with the gestures will be the primary goal alongside improving the dataset and incorporating different languages other than English during real-time translation. Implementing the model’s converse, i.e., to convert audio input into sign language is also one of the major future scopes of this project.
References
[1] S. V. Nimbalkar, S. N. Vaidya, M. M. Gade, P. S. Hagare, and P. N. Shendage, “Empowering deaf with American sign language interpreter using deep learning,” in Proc. IEEE Int. Conf. Comput. Commun. Inform. (ICCCI), 2024.
[2] O. Rane, T. Shishodiya, R. Sawant, and A. Godbole, “SignSpeak - Sign language interconverter using CNN-based approach,” in Proc. 7th Int. Conf. Electron., Commun. Aerosp. Technol. (ICECA), IEEE, 2023.
[3] S. Pandey, A. Khurshid, S. Ansari, and N. N. Dubey, “Hand speak’s: Sign language recognition system,” Int. J. Multidiscip. Res. (IJFMR), vol. 6, no. 3, pp. 1–13, May-Jun. 2024.
[4] D. Hemamalini, P. P. K. Reddy, T. Nikhil, and M. V. Kumar, “Communication through hands in sign language - A CNN collaborative study,” Shanlax Int. J. Arts Sci. Humanit., vol. 11, no. S3, pp. 40–48, Jul. 2024.
[5] N. Shirisha, D. B. V. Jagannadham, P. Parshapu, S. M. Rao, V. S. Kumari, and D. A. Subhahan, “Hand talk: Sign language to text converter using CNN,” in Proc. 8th Int. Conf. I-SMAC (IoT Soc., Mobile, Analytics, Cloud), IEEE, 2024.
[6] B. M. Gunji, N. M. Bhargav, A. Dey, I. K. Zeeshan Mohammed, and S. Sathyajith, “Recognition of sign language based on hand gestures,” J. Adv. Appl. Comput. Math., vol. 8, pp. 21–32, 2022
[7] “Sign language recognition using deep learning,” in Proc. Int. Conf. Artif. Intell. Mach. Vis. (AIMV), IEEE, 2021.
[8] V. M. Nair, “American sign language gesture recognition using deep convolutional neural network,” in Proc. 8th Int. Conf. Smart Comput. Commun. (ICSCC), IEEE, 2021.
[9] A. Kumar, M. Madaan, S. Kumar, A. Saha, and S. Yadav, “American sign language gesture recognition in real-time using convolutional neural networks,” in Proc. 8th Int. Conf. Signal Process. Integr. Netw. (SPIN), IEEE, 2021.