AI-Powered YouTube Video Summarization: A Comprehensive Web Application for Automatic Content Analysis and Note Generation

Authors: Aryan Upadhyay, Arjun Mohod, Manish Chaudhari, Kartik Shahu, Kartik Waghmare, Mr. Harshad M. Kubade

DOI Link: https://doi.org/10.22214/ijraset.2026.76857

Abstract

The AI-powered video summarizer project, S- Notevid, aims to develop, implement, and evaluate an advanced web application designed to generate structured study notes from educational content on YouTube. This project focuses on a diverse range of educational videos across disciplines like science, technology, and humanities. The system leverages the youtube- transcript library for accurate text extraction and Google’s Gemini model for state-of-the-art natural language processing to generate comprehensive notes. These models are integrated into a cohesive system built with a React front-end and an Express.js back-end, capable of processing input from YouTube URLs to generate informative notes. The system’s performance will be evaluated using both quantitative metrics and qualitative user feedback to ensure relevance and accuracy. Potential applications include integration into online education platforms, content recommendation systems, and accessibility services for learners.

Introduction

S-Notevid is an AI-powered web application designed to address the challenge of efficiently extracting key information from the rapidly growing volume of YouTube video content. The system generates structured, study-friendly notes from educational videos by fetching transcripts using the youtube-transcript library and processing them with Google’s Gemini large language model. Built with a React frontend and an Express.js backend, the platform allows users to simply input a YouTube URL and receive comprehensive notes in a scalable, user-friendly environment.

The primary objective of S-Notevid is to enhance learning efficiency by enabling users to grasp essential concepts without watching entire videos. It focuses on accessibility, accuracy, and ease of use, providing a clean interface for URL submission, note viewing, and access to previously processed content. The application has broad use cases across education, research, content curation, news summarization, and learning management systems.

The project’s literature review highlights the evolution of video summarization from traditional visual-feature-based methods to modern deep learning approaches. While existing research emphasizes visual analysis, S-Notevid addresses a key gap by prioritizing transcript-based natural language processing, which is especially critical for educational content. By leveraging an LLM, the system moves beyond basic summaries to produce detailed, structured notes.

S-Notevid follows a modern client–server architecture consisting of a React frontend, an Express.js backend, and a PostgreSQL database. Users authenticate via Google OAuth, submit YouTube links, and receive AI-generated notes that are stored for future access. The methodology emphasizes transcript extraction, LLM-driven note generation, data persistence, and clear presentation, with provisions for future integration of visual frame extraction.

Implementation details include secure API communication, efficient state management, optimized database design, and carefully engineered prompts for Gemini to ensure clarity, hierarchy, and relevance in the generated notes. Performance evaluation shows effective transcript extraction, accurate note generation, average processing times of 15–45 seconds, and positive user feedback highlighting time savings, readability, and accessibility.

Conclusion

The S-Notevid project successfully demonstrates the development and implementation of a modern web application designed to enhance the educational value of YouTube content. By leveraging a state-of-the-art Large Language Model, the system effectively transforms lengthy video transcripts into structured, comprehensive, and easily digestible study notes. The architecture, built on React, Express.js, and the Google Gemini API, proves to be a robust and scalable solution for on-demand content processing. The primary contribution of this project lies in its NLP first approach to video content analysis. Unlike traditional summarizers that focus on visual keyframe extraction, S-Notevid prioritizes the semantic richness of the spoken transcript, making it an exceptionally effective tool for lectures, tutorials, and other knowledge-dense videos. This approach not only saves users significant time and effort but also provides a deeper, more organized understanding of the material. The system’s practical implementation addresses a genuine need in modern education and content consumption. With the exponential growth of video-based learning resources, tools that can efficiently extract and organize information become increasingly valuable. S-Notevid stands as a successful proof- of-concept for a new generation of AI-powered educational tools, effectively showcasing how modern API-driven AI can be applied to build practical, user-centric applications that meet the evolving needs of learners.

References

[1] K. E. Nair, S. A. Johns, and A. John, “An overview of machine learning techniques applicable for summarisation of videos in education,” in International Conference on Machine Learning and Cybernetics, Kollam, India: TKM College of Engineering, 2019. [2] N. Anand, R. K. Koshariya, and V. Garg, “VidSum - Automated Video Summarization using Deep Learning,” in International Conference on Computer Science and Engineering, Noida, India: Jaypee Institute of Information Technology, 2024. [3] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Video Summarization Using Deep Neural Networks: A Survey,” IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 1949–1961, Aug. 2019. [4] B. Zhao, H. Li, X. Lu, and X. Li, “Reconstructive Sequence-Graph Network for Video Summarization,” IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1234–1245, May 2021. [5] S. Lal, S. Duggal, and I. Sreedevi, “Online Video Summarization: Predicting Future To Better Summarize Present,” IEEE Transactions on Multimedia, vol. 23, pp. 567–576, April 2022. [6] A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural In- formation Processing Systems 30 (NeurIPS 2017), 2017, pp. 5998–6008. [7] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020, pp. 1877–1901. [8] R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, “Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond,” in Pro- ceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016, pp. 280–290. [9] M. Denil, D. Deniset, and N. Freitas, “Learning where to attend with deep architectures for image tracking,” in International Conference on Machine Learning, 2014. [10] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186. [11] A. Knobel, W. Mu¨ller, and O. Freimuth, “Secure OAuth 2.0 Implemen- tation Patterns,” Journal of Web Security, vol. 15, no. 2, pp. 45–62, 2022. [12] C. V. Ellis and S. M. Becker, “Efficient State Management in Single- Page Applications,” in International Conference on Web Services, pp. 234–245, 2023. [13] D. Krammer and B. Schneier, “Best Practices in API Security,” ACM Transactions on Security and Privacy, vol. 18, no. 3, pp. 1–28, Sept. 2023. [14] L. Richardson and S. Ruby, RESTful Web Services, O’Reilly Media, 2nd ed., 2017. [15] A. Tanenbaum and M. van Steen, Distributed Systems: Principles and Paradigms, Prentice Hall, 3rd ed., 2017. [16] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521,pp. 436–444, May 2015. [17] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations, 2021. [18] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [19] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and Tell: A Neural Image Caption Generator,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164. [20] C. Papineni, M. Roukos, W. Ward, and W. Zhu, “BLEU: Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of ACL, 2002, pp. 311–318. [21] X. Zhang and Y. LeCun, “Text Understanding from Scratch,” in arXiv preprint arXiv:1502.01852, 2015. [22] R. Socher, B. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank,” in EMNLP, 2013. [23] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” in NeurIPS, 2015. [24] C. Olah, “Understanding LSTM Networks,” Blog post: colah.github.io, 2015. [25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.

Copyright

Copyright © 2026 Aryan Upadhyay, Arjun Mohod, Manish Chaudhari, Kartik Shahu, Kartik Waghmare, Mr. Harshad M. Kubade. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET76857

Publish Date : 2026-01-06

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here