In recent years digital learning has undergone rapid transformation, evolving from recorded lectures delivered on DVD to formats that are streamed online. Though these videos are easily accessible and downloaded in a matter of seconds, unfortunately they are insufficient for meaningful engagement and searching for specific content within a video of long duration is a futile and time-consuming process that does nothing to enhance learning outcomes.
This project addresses the problem by proposing a system that incorporates a Retrieval-Augmented Generation (RAG) based Artificial Intelligence Teaching Assistant that transforms lecture videos into a highly interactive and intelligent system that facilitates learning. The first step in doing this would be to convert the lectures into textual transcripts using speech-to-text technology, then preprocess the content and fragment it into meaningful parts of content to be processed.
Introduction
The text discusses how digital technology and Artificial Intelligence are transforming education by making learning more interactive and accessible through online resources and virtual learning environments. Traditional keyword-based search methods are inadequate for finding specific information within long educational videos, so the project uses Natural Language Processing (NLP), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG) to improve information retrieval and answer generation.
The proposed AI-powered Teaching Assistant processes lecture videos by converting speech into text using tools like Whisper, organizing transcripts into meaningful chunks, and generating semantic embeddings stored in vector databases such as FAISS or Pinecone. When students ask questions in natural language, the system retrieves the most relevant transcript sections through semantic similarity search and generates accurate, context-aware answers using a controlled LLM, minimizing hallucinations.
The system also provides timestamps and references to the original lecture material for transparency and verification. A Streamlit-based interface integrates transcription, embedding, retrieval, and generation modules into a privacy-preserving and efficient platform. Overall, the project aims to transform passive lecture videos into an interactive learning system that enhances accessibility, reliability, and student engagement.
Conclusion
This paper describes a novel RAG-based Teaching Assistant that enables higher accessibility and interactivity for video-based learning. Traditional lecture recording systems usually do not support efficient indexing allowing learners to easily find specific parts of video recordings.
This paper presents a hybrid human-computer question answering system that incorporates various technology to achieve high-quality performance. In particular, it combines speech-to-text (STT) capabilities, semantic embeddings, similarity-based retrieval techniques, and answers regulation mechanism in order to generate accurate, context-wise responses (answers) along with relevant source references and corresponding timestamps.
Experimental results show that the developed system improves the efficiency of video search, reduces the number of hallucinated results, and provides better user experience through an interactive learning from questions to answers process. The system offers an effective and domain specific learning tool that automatically organizes huge video data and transforms traditional passive video learning into active learning from videos using semantic retrieval methodology and state-of-the-art AI techniques.
References
[1] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459–9474, 2020.
[2] A. Radford et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
[3] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017.
[4] T. B. Brown et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020.
[5] A. Radford et al., “Whisper: Robust speech recognition via large-scale weak supervision,” OpenAI Technical Report, 2022.
[6] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” Proc. EMNLP, pp. 3982–3992, 2019.
[7] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” Proc. NAACL-HLT, pp. 4171–4186, 2019.
[8] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[9] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021.
[10] V. Karpukhin et al., “Dense passage retrieval for open- domain question answering,” Proc. EMNLP, pp. 6769–6781, 2020.
[11] O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction,”Proc. SIGIR, pp. 39–48, 2020.
[12] J. Gao et al., “Neural approaches to conversational AI,”Foundations and Trends in Information Retrieval, vol. 13, no. 2–3, pp. 127–298, 2019.
[13] R. Piskorski and G. Jacquet, “Vector databases and semantic search: A survey,” Information Systems, 2023.
[14] Streamlit Inc., “Streamlit: The fastest way to build data apps,” 2024. [Online]. Available: https://streamlit.io
[15] Python Software Foundation, “Python language reference,” 2024. [Online]. Available: https://www.python.org