Video is a very significant tool of communication, education and entertainment in the digital age. Nonetheless, the language to be utilized, individuals with hearing impairment and absence of accessibility tools tend to set limitations on the accessibility and visibility of this information. This research has introduced a web-based system named Audible Sense to some of these challenges which would generate automatically subtitles, multilingual translation and speech synthesis based on the video content. The site uses Whisper speech recognition model to extract audio in MP4, AVI, MOV and MKV formats to transcribe uploaded videos in audio format. These textual transcriptions have the WebVTT subtitles format in order to be played on the current video players. To suit the international viewers, the subtitles will be translated to languages of choice through the use of Google Translate API so that the system can ensure easy access in multiple languages. In addition, the subtitles have been translated into natural speech using Google Text-to-Speech (gTTS) and therefore can add the voiceover in multiple languages. The system is providing an end-to-end, automated solution to video content creators, educational systems, and broadcasters to enhance the reach content and enhance inclusivity. Machine translation, and speech synthesis, the Audible Sense assists in enhancing the communication access across the linguistic barriers and makes the multimedia contents to be accessible to more people. Its scalability and the ease of interface which it possesses render it applicable to various applications including international broadcasting as well as the readily available digital media which makes the transformative approach in content adaptation and localisation readily available.
Introduction
The text describes the development of AudibleSense, an AI-based system designed to improve video accessibility by overcoming language, hearing, and format barriers in multimedia content.
The main problem is that existing video accessibility solutions (transcription, translation, and speech synthesis tools) are fragmented, expensive, and not real-time, making it difficult to provide seamless multilingual and accessible content. Many systems also fail to support emotional expression in speech and require multiple separate tools.
To solve this, AudibleSense integrates three key technologies into a single automated pipeline:
Automatic Speech Recognition (ASR) to generate subtitles from audio
Machine Translation (MT) to convert subtitles into multiple languages
Text-to-Speech (TTS) to generate natural and emotionally expressive speech
The system is designed to be user-friendly, scalable, and compatible with multiple video formats, making it useful for educators, content creators, broadcasters, and accessibility services.
The literature review highlights advancements in ASR (like Whisper), neural machine translation, and emotion-aware TTS, but also points out remaining challenges such as real-time processing, emotional consistency across languages, and lack of fully integrated end-to-end systems.
Conclusion
To conclude, the AudibleSense system has been in a position to effectively incorporate the functionality of speech recognition, multilingual translation, and speech synthesis into one automated system, which can make more video content more accessible and more inclusive. The system demonstrated high accuracy in the transcription, multi-language translation of subtitles and speech synthesis and it was an expressive tool to content creators, educators and broadcasters. It addresses the big challenges, i.e. the issues related to language barriers and the issue of accessibility since it offers a scalable and efficient method to turn video content into a globally accessible format. Despite some limitations, including the real-time process optimization, and the improvement of emotive character of synthesized speech, it is a good candidate to develop further in order to enhance multimedia accessibility. Having a intuitive interface and their non-invasive workflow, AudibleSense has a chance to transform video content consumption across linguistic and accessibility divides to offer a complete solution to diverse audiences the world over.
References
[1] A. Mukherjee, S. Gupta and R. Banerjee, \"Emotion-Aware Semantic TTS with Context-Aware NLP,\" in Proc. Interspeech, 2022, pp. 215-219.
[2] H. Kang, J. Park, S. Lee, \"ZET-Speech: Zero-Shot Emotion-Controllable TTS,\" arXiv preprint arXiv:2305.13831.
[3] J Lee, K Han, and Y Kim, Emotion-Adaptive Spherical Vectors for TTS (ECE-TTS), Applied Sciences vol.15, no.9, p.5108, 2023, doi: 10.3390/app15095108
[4] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, and R. J. Weiss, \"Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,\" in Proc. Intern. Conf. on Machine Learning and Acoust. 2018, pp. 518-526.
[5] L. Zhang, X. Sun, Z. Li, End-to-End Emotional Speech Synthesis with Prosody Transfer, IEEE Trans. Audio, Speech, and Language Processing, vol. 29, pp. 1402-1413, 2021.
[6] H. Chen, S. Luo and F. Xie, \"Multi-Speaker Emotional Voice Cloning Using Few-Shot Learning,\" in IEEE Access, vol. 10, pp. 112345 - 112358, 2022.
[7] D. Bahdanau, K. Cho and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, arXiv preprint arXiv:1409.0473, 2014.
[8] M. Schuster and K. Nakajima, \"Japanese and Korean Voice Search,\" in Proc. IEEE ICASSP, 2012, pp. 5149-5152.
[9] T. Kudo and J. Richardson, \"SentencePiece: a simple and language independent Subword Tokenizer and Detokenizer for Neural Text Processing,\" arXiv preprint arXiv:1808.06226, 2018.
[10] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting,\" J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.