Growing synergies among artificial intelligence, natural language processing, and assistive technology are creating unprecedented opportunities to bridge the information gap experienced by visually impaired individuals. This paper offers a structured survey of recent developments in AI-driven video content summarization and Braille translation, examining six key research works alongside the foundational models and frameworks that underpin them. The scope covers automatic speech recognition, multilingual video processing, text-to-Braille conversion, and hardware-based Braille output. Technologies reviewed include OpenAI Whisper for audio transcription, FLAN-T5 for abstractive summarization, MarianMT by Helsinki-NLP for multilingual translation, and embedded systems for tactile Braille rendering. Through thematic analysis and cross-system comparison, recurring design patterns, capability limitations, and underexplored opportunities are brought to light. Findings indicate that while the constituent technologies have individually reached a high level of maturity, a unified, real-time, multilingual pipeline capable of delivering end-to-end Braille-accessible video output represents a largely uncharted but high-impact research direction.
Introduction
Visual impairment remains a major barrier to digital accessibility worldwide, with billions of people experiencing some form of vision loss. Accessing video-based content is particularly challenging for visually impaired users because traditional solutions such as manual Braille transcription and pre-recorded audio descriptions are costly, time-consuming, and difficult to scale. Recent advances in Artificial Intelligence, including Automatic Speech Recognition (ASR), Large Language Models (LLMs), and Neural Machine Translation (NMT), have created opportunities for automated systems that can convert video content into accessible formats such as summarized text, translated content, and Braille output.
This survey reviews AI-based video summarization and Braille translation technologies, focusing on the models, architectures, and deployment strategies used in existing systems. Key technologies include OpenAI Whisper for multilingual speech recognition, FLAN-T5 for abstractive text summarization, and MarianMT for multilingual translation. Together, these technologies can form a complete pipeline that extracts audio from videos, transcribes speech, generates concise summaries, translates content into different languages, and converts the output into Braille.
The paper discusses foundational concepts such as Braille systems, speech recognition, text summarization, machine translation, and software integration frameworks. Braille remains the primary tactile reading system for visually impaired individuals, while Whisper provides accurate multilingual transcription. FLAN-T5 generates readable summaries from lengthy transcripts, and MarianMT enables translation across numerous language pairs. Flask and Python are commonly used to integrate these AI models into accessible applications.
The literature review examines several related systems. Existing research includes video-to-text and Braille conversion systems, multilingual video translation tools, automated subtitling platforms, speech-to-Braille devices, embedded text-to-Braille hardware, and localized Braille applications for Indian languages. While these systems successfully address individual components such as transcription, translation, or Braille rendering, none provide a complete real-time solution that combines all stages into a single end-to-end accessibility pipeline.
A comparative analysis reveals significant limitations in current approaches. Some systems support Braille output but lack multilingual capabilities and AI-driven summarization. Others offer real-time translation and transcription but do not generate Braille. As a result, no existing system simultaneously provides video transcription, summarization, multilingual translation, and Braille rendering in real time. The proposed architecture aims to fill this gap by integrating Whisper, FLAN-T5, and MarianMT within a Flask-based framework to produce text, audio, and Braille outputs from video content.
The study identifies several research gaps, including the absence of end-to-end real-time video-to-Braille systems, limited multilingual Braille support, insufficient multimodal video summarization techniques, lack of standardized evaluation benchmarks, inadequate user involvement in system design, and challenges in handling noisy real-world video environments.
Future research should focus on leveraging multimodal large language models that process both video and audio, enabling richer contextual understanding and summarization. Additional priorities include on-device AI inference for offline accessibility, adaptive selection between different Braille grades, development of cross-lingual Braille standards, participatory design involving visually impaired users, and creation of standardized datasets and evaluation metrics. Overall, the survey concludes that integrating advanced AI technologies into a unified video-to-Braille pipeline has significant potential to improve digital accessibility and information access for visually impaired individuals worldwide.
Conclusion
This paper has conducted a thorough survey of AI-driven approaches to video content summarization and Braille translation aimed at supporting visually impaired users.
Through examination of six primary research works and the underlying AI models, software frameworks, and accessibility technologies they employ — encompassing ASR, multilingual video processing, abstractive summarization, and physical Braille output — a coherent picture of the field’s current state has emerged. Component-level capabilities have advanced considerably: Whisper delivers dependable multilingual speech transcription, FLAN-T5 produces high-fidelity abstractive summaries, and MarianMT handles neural translation across a broad spectrum of language pairs.
Despite this progress, a meaningful gap persists at the system level: no reviewed work successfully combines all pipeline stages — transcription, summarization, translation, and Braille rendering — into a cohesive, real-time, multilingual solution. The proposed architecture tackles this limitation directly, unifying Whisper-based transcription, FLAN-T5 summarization, MarianMT translation, and Unicode Braille rendering through a Flask web application. Looking ahead, key opportunities include adoption of multimodal large language models, edge deployment for offline scenarios, intelligent Braille grade selection, and co-design with visually impaired communities. As AI, multimodal learning, and assistive technology continue to evolve together, the prospect of making the world’s vast video content fully accessible to all users draws closer to reality.
References
[1] B. Sridhar, G. Saivishnu, V. ManiShanker, D. D. Lakshmi, and S. Hariharan, \"Summarization of Video into Text and Text to Braille Script,\" in Proc. IEEE Int. Conf. Knowledge Engineering and Communication Systems (ICKECS), 2024, pp. 1-6.
[2] D. Rakesh, D. P. R. Sandesh, and V. Nirmalrani, \"LinguaFusion: AI-Powered Multilingual Video Voice Translator,\" in Proc. IEEE 7th Int. Conf. Intelligent Sustainable Systems (ICISS), 2025, pp. 1-5.
[3] A. Prabhakar, U. Agarwal, and S. Bhardwaj, \"Automated Video Subtitling and Translation Using Whisper and Helsinki Models,\" in Proc. IEEE Int. Conf. Engineering, Technology & Management (ICETM), 2025, pp. 1-6.
[4] S. Ponnuru, M. Chandana, M. Ravikumar, K. Rakesh, and K. S. Devatha, \"Real-Time Speech-to-Braille Conversion Tablet for the Visually Impaired,\" in Proc. IEEE Int. Conf. Next Generation Communication & Information Processing (INCIP), 2025, pp. 1-4.
[5] M. Kavitha, V. Meenakshi, M. Pushpavalli, S. Amudha, S. Bharathi, and P. Pavithra, \"Communication Device for Converting Text to Braille,\" in Proc. IEEE Int. Conf. Inventive Computation Technologies (ICICT), 2023, pp. 1-5.
[6] M. Parida, S. Mishra, and R. Nayak, \"Enhancing Braille Accessibility: Android App for Indian Braille,\" in Proc. IEEE Int. Conf. Advanced Learning Technologies (ICALT), 2023, pp. 1-4.
[7] OpenAI, \"Whisper: Automatic Speech Recognition Model,\" OpenAI GitHub Repository, 2024. [Online]. Available: https://github.com/openai/whisper
[8] Google Research, \"FLAN-T5: Instruction Fine-Tuned Language Model,\" Hugging Face Model Hub, 2024. [Online]. Available: https://huggingface.co/google/flan-t5-base
[9] Hugging Face, \"MarianMT: Multilingual Translation Models,\" Helsinki-NLP Model Hub, 2024. [Online]. Available: https://huggingface.co/Helsinki-NLP
[10] Python Software Foundation, \"Python 3.10 Documentation,\" 2024. [Online]. Available: https://docs.python.org/3.10/
[11] Pallets Projects, \"Flask: Lightweight Web Framework for Python,\" Flask Documentation, 2024. [Online]. Available: https://flask.palletsprojects.com/