Communication can be a major hurdle for individuals with hearing and speech impairments, particularly when interacting with people who do not understand sign language. Existing methods such as using human interpreters or written messages often fall short due to cost, inconvenience, or lack of real-time interaction. To address this challenge, we developed an intelligent system that recognizes American Sign Language (ASL) fingerspelling gestures and translates them into both text and speech in real time. Using computer vision and a Convolutional Neural Network (CNN), the system processes hand gestures captured via webcam and identifies corresponding alphabet characters. Hand landmark detection is carried out using the cvzone HandTrackingModule, and recognized letters are displayed within a graphical user interface created using Tkinter. The interface also offers suggested word predictions with the help of the Enchant dictionary and provides audio feedback through the pyttsx3 library. Additionally, gesture-based commands like space, clear, and backspace make the system more interactive and user-friendly. This solution aims to support seamless, accessible communication, especially in educational and assistive settings.
Introduction
This project presents a real-time assistive system that translates American Sign Language (ASL) fingerspelling gestures into both written text and spoken words, enhancing communication for individuals with hearing or speech impairments and supporting sign language learners.
Key Features
Uses a webcam to capture hand gestures.
Employs a Convolutional Neural Network (CNN) trained to recognize ASL alphabet gestures.
Utilizes cvzone’s HandTrackingModule for precise hand landmark detection.
Offers a user-friendly Tkinter interface displaying live video, detected letters, constructed words, and spelling suggestions (via Enchant dictionary).
Integrates text-to-speech (pyttsx3) for audio feedback.
Supports gesture-based commands like space, backspace, and confirmation.
Background and Related Work
Advances in AI and computer vision have enabled more accurate real-time gesture recognition.
Important prior contributions include CNN models for dynamic gestures, improved hand keypoint detection (Multiview bootstrapping), and efficient hand tracking frameworks like Google’s MediaPipe.
Previous systems mostly focused on static gesture recognition but faced challenges such as poor performance under varying conditions, limited gesture vocabulary, and high resource demands.
System Architecture
User Layer: Webcam captures real-time hand gestures.
Processing Layer: Frames are preprocessed (resized, normalized, background removed) for consistent quality.
Prediction Layer: CNN classifies the gestures; recognized characters are displayed and used for word construction.
Gesture commands are handled, and text-to-speech converts recognized text into spoken output for accessibility.
Methodology
Real-time gesture capture via webcam.
Image preprocessing to enhance input quality.
Dataset preparation for CNN training.
CNN model training and optimization to classify ASL alphabets.
Real-time prediction with visual and audio output to support seamless communication.
Applications and Impact
Facilitates inclusive communication for people with hearing or speech impairments.
Acts as an educational tool for learning ASL fingerspelling.
Provides an accessible, hardware-light solution for real-world use.
Conclusion
This project presents an effective ASL recognition system that uses deep learning to convert hand gestures into both text and speech. Real-time gesture detection through a webcam and classification via a CNN model ensures accurate alphabet recognition. The system includes a user-friendly Tkinter-based GUI that displays predicted characters, forms sentences, and provides voice output. Word suggestions further enhance sentence construction and communication efficiency. The integration of visual and audio feedback makes it an accessible tool for individuals with hearing or speech impairments. Overall, the system offers a reliable and interactive platform for assistive communication. The system can be extended to support dynamic gestures and full sign language sentences using advanced models like RNNs or Transformers. Future versions may also include facial expression analysis to add emotional context to communication. Multilingual output and customizable gesture sets could enhance its versatility across users. Improving the GUI with gesture previews and responsive controls can boost user interaction. Deployment as a mobile or web application with GPU acceleration will improve performance and accessibility. Incorporating evaluation tools like real-time feedback and confusion matrices can further increase accuracy and reliability.
References
[1] Molchanov, P., Gupta, S., Kim, K., & Kautz, J. (2016). Real-time recognition of hand gestures using 3D convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1–7). https://doi.org/10.1109/CVPRW.2015.7301347
[2] Simon, T., Joo, H., Matthews, I., & Sheikh, Y. (2017). Estimating hand key points from single RGB images using Multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1145–1153. https://doi.org/10.1109/CVPR.2017.125
[3] Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Ceze, L., & Taylor, J. (2019). Media Pipe: A versatile framework for building real-time perception pipelines. arXiv preprint arXiv:1906.08172. https://arxiv.org/abs/1906.08172
[4] Kaur, A., & Singh, M. (2020). Deep learning approach for real-time recognition of American Sign Language alphabets. International Journal of Engineering and Advanced Technology (IJEAT), 9(3), 136–140. https://doi.org/10.35940/ijeat.C4918.029320