MookVaani: An AI-Powered Lip Reader for Real-Time Communication and Speech Generation

Authors: SIddhika A. Savant, Aliya S. Nadaph, Harshada A. Shelake, Samruddhi V. Chopade, Prof. Amruta M. Kate

DOI Link: https://doi.org/10.22214/ijraset.2025.76076

Abstract

MookVaani is an artificial intelligence-based lip-reading system, purposed to assist speaking-impaired or those who have difficulty in communicating verbally. Most of the existing speech-recognition tools rely completely on audio input; thus, such tools are useless for non-verbal people or in noisy environments. To bridge this gap, MookVaani picks up the lip movement of a person through a webcam and converts these into text and spoken audio in real time. The system uses MediaPipe for lip landmark detection, while deep-learning models like CNN and LSTM recognize the pattern for visual speech. A Streamlit interface is developed which shows real-time video, predicted text, and generated speech, therefore making the tool easy for any person to use. The project is inspired by the need for inclusive and accessible technology for people with speech disabilities, hearing impairments, or individuals who need silent communication in crowded or sensitive places. MookVaani aspires to provide a method of silent-to-speech communication in an efficient, reliable, and user-friendly way. The system becomes significantly more powerful in sectors like education, healthcare, and accessibility with additional improvements such as multi-lingual support, mobile deployment, and emotion detection.

Introduction

The text presents MookVaani, a visual-only lip-reading system designed to help non-verbal individuals communicate by converting silent lip movements into text and speech. Traditional speech-recognition technologies depend on audio input, which fails for users with speech impairments, in noisy environments, or when speaking is not possible. MookVaani addresses this gap by relying solely on video-based visual speech recognition (VSR), enabling accessible, affordable, and real-time communication using only a webcam.

Problem Context

Non-verbal individuals often struggle to communicate because most recognition systems rely on audio, and human lip reading is extremely difficult. Existing VSR research is limited by controlled environments, heavy equipment requirements, or lack of real-time capability. Thus, there is a need for a robust, low-cost, practical system that works on ordinary video input.

Objectives

The project aims to:

Build an end-to-end deep learning model using CNNs and LSTMs to recognize speech from silent video.
Ensure robustness across lighting changes, speakers, head movements, and accents.
Create a thorough preprocessing pipeline for lip-region extraction and data normalization.
Evaluate system performance using metrics like WER and sentence accuracy.
Explore applications in accessibility, education, healthcare, and quiet/noisy environments.
Document limitations, ethics, and future improvements.

Scope

Included:

Development of a visual-only lip-reading deep learning model.
Use of CNNs for spatial lip-feature extraction and LSTMs/Transformers for temporal modeling.
Use of datasets like GRID and LRS.
Full preprocessing workflow (face detection, lip cropping, data augmentation).
Comprehensive performance evaluation and research-style documentation.

Excluded (Future Work):

Real-time deployment with optimized latency.
Multilingual or continuous long-form lip reading.
Audio-visual combined systems.
Large-scale community testing, emotion detection, identity-invariance, and highly unconstrained environments.

Literature Review Overview

Modern lip-reading has moved from handcrafted features to powerful end-to-end deep learning models. Techniques include 3D-CNN + LSTM architectures, viseme-based representations, multimodal learning, attention mechanisms, and large-scale datasets. Despite progress, challenges persist: viseme ambiguity, speaker variability, environmental noise, dataset scarcity, and major accuracy gaps compared to audio-based speech recognition.

Research Gaps

Key needs include better generalization across speakers, resolving phoneme ambiguity, leveraging multimodal learning, enhancing real-world robustness, and addressing privacy and fairness concerns.

Proposed Methodology

The system follows a sequential pipeline:

Video Input via webcam or uploaded file.
Data Collection from large lip-reading datasets.
Preprocessing: lip detection, cropping, normalization, frame sequencing.
CNN Feature Extraction from each frame.
LSTM Temporal Modeling of frame sequences.
Word Prediction through a classifier.
Text-to-Speech Conversion to produce spoken output.

Expected Outcomes

Better understanding of how VSR and deep learning enable lip-reading.
Increased awareness of the communication challenges faced by speech-impaired users.
A functional prototype capable of converting silent lip movements into text and speech.
Insights into improving prediction accuracy, generalization, system speed, and real-world stability.

Conclusion

This project introduces the steps to be taken in the design and development of a web-based system using modern, available technologies such as Replit. Based on the pace of the workflow, starting from problem analysis to design and implementation and ending with evaluation, the project aims to offer a simple, practical, and scalable solution. The proposed system will efficiently solve the identified problem but without hampering its ease of use, accessibility, and adaptability to future enhancements. It makes the development process more transparent and easier to manage while structuring the work by clear phases, as done in academic research. The project will lay a foundation for any application to be stable, reliable, and easy to use. Further extensions can be either for intelligent enhancement or other factors like high performance, possibly with the integration of advanced technology, based on user requirements and practical feedback.

References

[1] M. Wand, J. Koutník, and J. Schmidhuber, “Lip reading with long short-term memory,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1902–1913, Nov. 2015. [2] Y. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “LipNet: End-to-End Sentence-Level Lipreading,” arXiv preprint arXiv:1611.01599, 2016. [3] GRID Corpus. [Online]. Available: https://spandh.dcs.shef.ac.uk/gridcorpus/ [4] LRS2 Dataset. [Online]. Available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html [5] OpenCV Library. [Online]. Available: https://opencv.org/ [6] MediaPipe. [Online]. Available: https://developers.google.com/mediapipe [7] PyTorch. [Online]. Available: https://pytorch.org/ [8] TensorFlow. [Online]. Available: https://www.tensorflow.org/ [9] Pyttsx3 Python Text-to-Speech Library. [Online]. Available: https://pypi.org/project/pyttsx3/ [10] Replit. [Online]. Available: https://replit.com/ [11] Next.js Documentation. [Online]. Available: https://nextjs.org/docs [12] J. Brownlee, Deep Learning for Time Series Forecasting: Predict the Future with MLPs, CNNs and LSTMs in Python, Machine Learning Mastery, 2018. [13] N. Deshmukh, A. Ahire, S. H. Bhandari, A. Mali, and K. Warkari, \"Vision based Lip Reading System using Deep Learning,\" in 2021 International Conference on Computing, Communication and Green Engineering (CCGE), Pune, India [Online].Available: https://www.researchgate.net/publication/362015222_Vision_based_Lip_Reading_System_using_Deep_Learning

Copyright

Copyright © 2025 SIddhika A. Savant, Aliya S. Nadaph, Harshada A. Shelake, Samruddhi V. Chopade, Prof. Amruta M. Kate. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET76076

Publish Date : 2025-12-04

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here