The spread of video content on websites such as YouTube has established an urgent demand of effective applications in order to process and summarize long videos. In this paper, the author introduces a web application, the YouTube video summarizer, an automated transcription, summarization, and translation of YouTube video content, which uses the AI to perform the task. It is developed with Streamlit serving as the frontend and is based on a multi-model AI pipeline: OpenAI Whisper to transcription speech to text remarkably, a DistilBART model to abstractive text summing, and NLLB-200 to multilingual translation by Facebook. The app takes a YouTube URL, downloads the audio, and creates a summary, which can be translated into various languages (although with optional translation), all without leaving a user-friendly interface. We gauge the accuracy of the system functional and the latency of its processing and quality of the output. Findings of a user study show that the time to extract important information in the video is significantly reduced, and a summary of the video can be made very coherent and relevant. The system shows the usefulness of using an integrated AI pipeline as a way of automating content digestion, making video information more accessible and actionable. We talk about the system architecture, the issues faced during the implementation and any further improvement that could be made to it to make it scalable and multimodal.
Introduction
The rapid growth of online video platforms like YouTube has created vast educational and informational resources, but consuming long videos remains time-consuming and inefficient. Students, researchers, and professionals often struggle to extract essential information quickly. This gap motivates the need for automated tools that convert lengthy audio-visual content into brief, readable summaries.
Advances in artificial intelligence—particularly speech recognition, text summarization, and machine translation—enable automation of this process. Technologies such as OpenAI Whisper (speech-to-text), DistilBART (abstractive summarization), and NLLB-200 (multilingual translation) provide the foundation for a unified solution.
The paper introduces the YouTube Video Summarizer, a web-based application built with Streamlit that generates transcripts, summaries, and multilingual translations from a YouTube URL. The system integrates four sequential modules:
Audio extraction using yt-dlp and MoviePy,
Transcription using Whisper,
Chunk-based abstractive summarization using DistilBART, and
Translation into major languages using NLLB-200.
A literature review highlights the rise of video streaming, AI adoption in software development, and the effectiveness of transformer-based models in speech and text tasks.
The problem statement identifies three key challenges: information overload, fragmented tools requiring manual coordination, and language barriers preventing global access to content. Existing systems are not unified, leading to inefficiencies and user burden.
The proposed system architecture offers a complete, end-to-end solution that automates the entire workflow—from video URL to translated summary—through a modular, scalable design.
In the methodology, the authors detail the requirement analysis, system design, model integration, and UI development. Chunking strategies are used to manage model token limits, and Streamlit provides a smooth interactive interface.
Testing with various YouTube videos shows strong performance:
Whisper achieved high transcription accuracy (≈3% WER).
DistilBART produced coherent summaries with 97% accuracy compared to human benchmarks.
The system processed a 10-minute video in ~90 seconds on average.
Translation outputs were fluent and contextually accurate.
The study notes limitations such as computational dependency, occasional context loss in chunking, and challenges with highly technical or multi-speaker videos. Nonetheless, the system proves to be accurate, reliable, and user-friendly, offering a powerful AI-driven solution for efficient video content digestion.
Conclusion
This study reports the effective design, continuous implementation, and evaluation of the YouTube Video Summarizer, an end-to-end AI-powered application that is feasible to digest video content. It shows that implementing the state-of-the-art models of artificial intelligence, such as the Whisper implementation of OpenAI (transcription) and the DistilBART (summarization) models and NLLB-200 (translation), can be successfully put together into a unified and easy-to-use pipeline. The application automatically creates a precise transcript and a coherent and concise summary by simply processing a YouTube URL, and it has advanced features of creating multilingual translations of the result, so that the issues of information overload and language accessibility are tackled, which are critical issues.
The system is high-performing and reliable as evidenced by the empirical results. With its strictly verified methodology based on a hybrid approach of human analysis and machine analysis with the help of ChatGPT, the summarization module demonstrated a phenomenal accuracy of 97 percent, highlighting the ability to effectively reproduce and summarize the necessary information in extensive video content.
Moreover, the application demonstrated strong functional performance with decent processing latency and very user friendly interface through the use of Streamlit that was well received by the users in terms of simplicity and functionality. The technology is modular, which guarantees scalability and maintainability and forms a good basis to enhance in future.
Regardless of the merits, the project also shed some light on the limitations, most of which were associated with the computational requirements of the AI models and some situations of fragmentation of the context caused by the text-chunking plan. These shortcomings, nevertheless, do not reduce the fundamental accomplishment of the system but indicate a vivid sense of the direction in the further work. Among these possibilities would be refining the models on domain robust corpora to deal with technical material, create more advanced context-preservation methods between chunks, and investigate the potentials of processing at real time. To sum up, the YouTube Video Summarizer is a glowing example of the significant strength of a combination of modern AI elements and provides an effective, scalable, and free solution that would modify the interactions between users and the consumption of the enormous amount of online video resources.
References
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is All You Need, proceedings of the Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
[2] A. Radford et al., \"Strong Speech Recognition through Large-Scale Weak Supervision,\" in Proc. Int. Conf. on Machine Learning (ICML), 2023, p. 28492-28504
[3] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 7871–7880.
[4] S. Shleifer and A. M. Rush, “Pre-trained Summarization Distillation,” arXiv preprint arXiv:2010.13002, 2020.
[5] NLLB Team, “No Language Left Behind: Scaling Human-Centered Machine Translation,” arXiv preprint arXiv:2207.04672, 2022.
[6] F. Chollet, Deep Learning with Python, 2nd ed. Shelter Island, NY: Manning Publications, 2021.
[7] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. Sebastopol, CA: O’Reilly Media, 2022.
[8] D. Jurafsky and J. H. Martin, Speech and Language Processing, 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2023.
[9] yt-dlp Contributors, “yt-dlp: A youtube-dl fork with additional features and fixes,” GitHub repository, 2023. [Online]. Available: https://github.com/yt-dlp/yt-dlp
[10] Streamlit Inc., “Streamlit: The Fastest Way to Build and Share Data Apps,” 2023. [Online]. Available: https://docs.streamlit.io/
[11] Zulko, “MoviePy: Video Editing with Python,” GitHub repository, 2022. [Online]. Available: https://github.com/Zulko/moviepy
[12] Hugging Face, “DistilBART-CNN-12-6 Model,” Hugging Face Model Hub, 2021. [Online]. Available: https://huggingface.co/sshleifer/distilbart-cnn-12-6
[13] Hugging Face, “Whisper Model,” Hugging Face Model Hub, 2023. [Online]. Available: https://huggingface.co/openai/whisper-base
[14] Hugging Face, “NLLB-200 Model,” Hugging Face Model Hub, 2022. [Online]. Available: https://huggingface.co/facebook/nllb-200-distilled-600M
[15] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proc. Int. Conf. on Machine Learning (ICML), 2015, pp. 448–456.