Advance technologies, such as Mediapipe, Pytorch, and YOLOv5, Nvidia CNN, CUDA toolkit, were used in developing the real-time sign language translator. These technologies were used to create an exactly accurate light-weight model of CNN that achieved a success rate of 95.6%. Using the Nvidia CNN and CUDA toolkit, translation of sign language digital videos in real-time was accomplished with minimal latency by accelerating the processing of CNN model. The solution has been embodied in the form of a virtual camera that will be able to translate sign language into subtitles in any video conferencing platform using OBS software so it can be useful in actual scenarios where speedy and efficient communication between individuals who are deaf or hard to hear and otherwise is needed. Overall, real-time sign language translator can have a tremendous impact on communication and accessibility of the deaf and the hard of hearing people.”
Introduction
Purpose & Motivation:
Communication is fundamental, yet language barriers—especially between hearing and deaf communities—persist. Over 70 million deaf or hard-of-hearing people rely on sign language, but interactions with non-signers are often limited. This project proposes a real-time sign language translation system to bridge this gap using YOLOv5, PyTorch, and Nvidia technologies.
System Overview:
Uses webcams to capture sign language gestures.
Applies YOLOv5 (for object detection) and PyTorch (for model training).
Incorporates MediaPipe and OpenCV for real-time hand tracking.
OBS software routes the translated video (with subtitles) to platforms like Google Meet and Microsoft Teams.
The model leverages Nvidia CUDA & CNNs for fast, GPU-accelerated inference.
A Flask-based web interface makes the tool accessible online.
Literature Review Highlights:
Several recent works explored sign language recognition using AI:
Hybrid models (DenseNet201 + MediaPipe) show improved gesture recognition.
Achieved 95% accuracy in sign-to-text translation.
Limitations included tracking failures when hands were out of frame or occluded.
GPU usage (vs CPU) significantly improved model responsiveness.
Future work aims to improve occlusion handling and expand the gesture dataset for broader coverage.
Conclusion
The gestures of hands also have an immense potential of use in the sphere of human-computer interaction. Vision-based hand gesture recognition methods have shown a variety of benefits compared with those of old devices. Yet, recognizing hand gestures is a problem that is rather difficult, and this research works is an exceedingly small step in the direction of achieving the desired outcomes in the process of sign language recognition. In this paper, a vision-based system was introduced, which was able to interpret the American Sign Language hand gestures and convert them to speech or text and vice versa. The proposed solution was tested in real-time circumstances, and it revealed that the classification models were able to detect all trained gestures, the user-independent feature, which is one of the main requirements for this kind of system. Combined with machine learning algorithms, the selected hand features proved to be very efficient and can be used for any real-time sign language recognition system. In future, further improvements will be made to the system and experiments shall be carried out using entire language datasets. Finally, the proposed solution is a good starting point of development of any vision-based sign language recognition user interface system. Sign language grammar is easily flexible, and the system can be adapted to teach new gestures in new languages.