Real-Time Indian Sign Language Recognition to Text and Speech Using Computer Vision and Deep Learning
Authors: Mr. Kunal Kanchankar, Ms. Utkarsha Gore, Mr. Utkarsh Nilatkar, Mr. Vaibhav Kawde, Mr. Vaibhav Chouragade, Mr. Vaibhav Yerpude, Mr. Vedant Pundkar
Sign language plays a crucial role in the lives of over 63 million people in India who live with hearing impairments. It is not just a method of communication; it is a vibrant, expressive language that allows them to connect with the world around them. However, the biggest challenge these individuals face is the absence of reliable, real-time translation systems specifically designed for Indian Sign Language (ISL). This gap leads to significant isolation in everyday situations, such as classrooms where lessons are delivered verbally, workplaces where meetings rely on spoken discussions, or even simple social gatherings were conversations flow without interpretation. Without tools to bridge this divide, hearing- impaired people often feel excluded, limiting their opportunities for education, employment, and social integration.
To tackle this pressing issue, this paper introduces an innovative, vision-based system for real-time recognition of ISL. This system transforms live hand gestures captured by an ordinary webcam into readable text and audible speech, all powered by free, open-source software. No expensive hardware or proprietary tools are required, making it accessible to a wide audience. At its core, the system employs MediaPipe, a lightweight library from Google, to detect and track the key landmarks on hands—think of these as the joints and fingertips that form the building blocks of every sign. Once these landmarks are identified, a sophisticated deep learning model combining Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks steps in to analyze and interpret sequences of gestures. This hybrid model excels at understanding not just isolated signs but also the fluid, continuous phrases that make up natural signing.
For the speech output, the system integrates user-friendly text-to-speech (TTS) engines like gTTS (Google Text-to-Speech) for online generation or pyttsx3 for fully offline operation. What sets this apart is its multilingual support—users can choose Hindi, English, or other regional languages, ensuring the spoken output feels natural and relevant to diverse Indian users. To train this model effectively, we built a custom dataset from scratch, collecting over 26,000 high-quality images. These cover the full ISL alphabet (26 letters) and 50 common words like \"hello,\" \"thank you,\" \"mother,\" \"father,\" \"hungry,\" and \"doctor.\" Images were captured under a variety of real- world conditions: bright indoor lights, dim evening settings, cluttered backgrounds like busy kitchens or parks, and even different camera angles to simulate handheld or fixed setups. This diversity ensures the system isn\'t just a lab experiment but something that works in the chaos of daily life.
Performance-wise, the system shines with an impressive 87.2% accuracy when recognizing individual alphabet signs and 79.5% for more complex, continuous phrases—think signing a full sentence like \"I am hungry, please help.\" On standard laptop hardware (something like an Intel i5 with 8GB RAM), it processes and responds in just 418 milliseconds on average, fast enough to keep conversations flowing without awkward pauses. We put it through rigorous environmental tests: in controlled lab spaces, accuracy was near-perfect; in echoey corridors with people walking by, it held steady; and outdoors under harsh sunlight or shade, there was only a modest 6.8% drop. This robustness comes from smart preprocessing techniques that adjust for lighting and noise.
Introduction
Communication connects humans, but for India’s 63+ million hearing-impaired individuals, barriers exist due to widespread unfamiliarity with Indian Sign Language (ISL). ISL is a rich, two-handed visual language with regional variations, yet most of India’s hearing population cannot understand it, causing challenges in education, employment, healthcare, and social life. Human interpreters are scarce, especially outside cities.
Technology offers solutions: advances in computer vision and deep learning enable sign recognition. Existing global tools focus on ASL and fail with ISL due to grammar, two-handed signs, and cultural differences. To address this, the project proposes a real-time, offline ISL translator using just a webcam and open-source Python libraries. Key features include:
Hand tracking: MediaPipe detects 21 landmarks per hand in 3D.
Gesture recognition: A CNN-LSTM hybrid captures spatial and temporal patterns, distinguishing letters and phrases.
Output: Tkinter GUI shows text and confidence; text-to-speech (TTS) delivers voice in Hindi/English or regional accents.
Efficiency: Lightweight (<50MB), offline, works on low-cost devices, culturally localized, and validated on 1,200+ sequences with high accuracy (92% alphabets, 84% phrases).
The methodology relies on diverse, crowdsourced datasets (26,000+ images), robust preprocessing, and augmentation. Latency is low (~418 ms end-to-end), and performance holds across PCs, Raspberry Pi, and mobile devices.
Comparison with prior work:
Sensor gloves are accurate but expensive and restrictive.
Vision-based CNNs detect static signs but struggle with sequences.
MediaPipe + LSTM improves sequence recognition but ASL-focused models fail for ISL.
Existing TTS/cloud solutions face latency and internet dependency.
This system uniquely combines offline operation, high ISL sequence accuracy, real-world usability, and regional adaptability, providing a practical pipeline for inclusive AI, education, and communication for the deaf community in India.
Conclusion
In wrapping up, this project unveils a transformative real-time ISL translator: A webcam-powered bridge from silent gestures to spoken words, crafted with open-source ingenuity. By harmonizing computer vision (MediaPipe/OpenCV), deep learning (CNN- LSTM), and TTS (gTTS/pyttsx3), it delivers a plug-and-play pipeline that\'s accurate (87.2% alphabets), swift (418ms), and sensitive to India\'s linguistic mosaic.
Tested across 1,200+ sequences and 20 users, it proves not just feasible but beloved—81.4 usability, with calls for everyday adoption. For a nation where 63 million voices go unheard, this tool amplifies them, fostering inclusion in schools (interactive lessons), clinics (clear consultations), and streets (barrier-free chats).
Looking ahead: Expand to full dialects (e.g., 200+ words), mobile-first deployment (Android/iOS), and bidirectional magic— speech-to-sign for hearing allies. Imagine apps where signers and speakers converse fluidly,
dissolving divides. With community input and iterative tweaks, we\'re not just building software; we\'re constructing equity. India deserves communication without compromise—let\'s sign, speak, and connect.
References
[1] Census of India, “Disabled Population,” 2011.
[2] P. Swain and A. Nayak, “A Review on Sign Language Recognition System,” IJERT, vol. 9, 2021.
[3] Anusha et al., “A Smart Glove for Sign Language Recognition,” IJSRET, vol. 7, no. 4, 2021.
[4] Jagdish and Raju, “Deep Computer Vision with AI-Based Sign Language,” Nature Sci. Rep., Sep. 2025.
[5] I. Goodfellow et al., Deep Learning. MIT Press, 2016.