This paper presents the design and implementation of a Retrieval-Augmented Generation (RAG) based chatbot developed entirely from scratch using Python, without relying on any high-level orchestration framework like LangChain. Our system uses Sentence Transformers (all-MiniLM-L6-v2) for generating 384-dimensional semantic embeddings, ChromaDB as the persistent vector database for storing and searching document chunks, the Groq API (LLaMA 3 70B) for fast and accurate language model inference, and Flask for the web-based user interface. The core motivation was to build a transparent, fully controllable RAG pipeline where every component can be understood, debugged, and optimized independently. The system takes any collection of PDF or text documents, splits them into overlapping chunks, embeds them, and answers user questions by retrieving the most semantically relevant content and grounding the LLM\'s response in that content. Evaluation using the RAGAS framework shows 88% domain-specific accuracy (vs. 58% for a standalone LLaMA 3 baseline), hallucination reduction from ~38% to ~10%, and an overall RAGAS score of 0.86. The system runs fully on standard hardware with CPU-only embedding and Groq API for sub-second LLM inference.
Introduction
The text describes the design and development of a custom Retrieval-Augmented Generation (RAG) chatbot that answers questions from user-provided documents accurately while avoiding hallucinations.
The main problem addressed is that traditional search tools rely on keyword matching and miss semantic meaning, while standard Large Language Models (LLMs) like GPT can hallucinate and lack access to private or recent documents. To solve this, the authors build a fully custom RAG system from scratch (without frameworks like LangChain) using Python, Sentence Transformers, ChromaDB, Groq API, and Flask.
The system works in a pipeline:
Documents (PDF/text) are processed using PyMuPDF and split into overlapping chunks.
Each chunk is converted into semantic embeddings using the all-MiniLM-L6-v2 SentenceTransformer model.
Embeddings are stored in ChromaDB, enabling fast semantic search using cosine similarity.
When a user asks a question, it is also embedded and matched with the most relevant chunks.
These retrieved chunks are sent as context to an LLM (via Groq API with LLaMA 3/Mixtral) with strict instructions to answer only from provided text, reducing hallucinations.
The system is designed to be lightweight, fast, and transparent, with full control over each component. It includes metadata tracking for source traceability, improving trust in answers.
Conclusion
In this paper, we presented a custom RAG chatbot built from scratch using Python, Sentence Transformers, ChromaDB, Groq API, and Flask — without relying on any high-level orchestration framework like LangChain. Building the system this way gave us complete control and a deeper understanding of how each component works.
Our system successfully addresses the two main problems with standard LLMs: inability to work with private documents and hallucination. By using semantic embeddings for retrieval and grounding the LLM\'s responses in retrieved context, we achieved 88% domain-specific accuracy and reduced hallucination to just 10%. The overall RAGAS score of 0.86 shows high quality across all evaluation dimensions.
The Groq API proved to be an excellent choice for LLM inference — it provides state-of-the-art model quality at speeds that make the chatbot feel truly real-time. ChromaDB was easy to set up and performed reliably for our document sizes. And Sentence Transformers gave us high-quality semantic embeddings that ran efficiently on CPU without any GPU requirement.
We believe this work shows that powerful RAG systems can be built without expensive infrastructure or complex frameworks. A motivated student or small team can build a production-quality AI chatbot using entirely open-source and free tools. We hope this paper helps other students and developers who want to build their own RAG systems.
References
[1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, \"Distributed Representations of Words and Phrases and their Compositionality,\" Advances in Neural Information Processing Systems, vol. 26, 2013.
[2] J. Pennington, R. Socher, and C. Manning, \"GloVe: Global Vectors for Word Representation,\" Proc. EMNLP, pp. 1532–1543, 2014.
[3] A. Vaswani, N. Shazeer, N. Parmar et al., \"Attention Is All You Need,\" Advances in Neural Information Processing Systems, vol. 30, 2017.
[4] N. Reimers and I. Gurevych, \"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,\" Proc. EMNLP, pp. 3982–3992, 2019.
[5] J. Anton, \"Chroma: The AI-Native Open-Source Embedding Database,\" GitHub Repository, 2022. [Online]. Available: https://github.com/chroma-core/chroma
[6] P. Lewis, E. Perez, A. Piktus et al., \"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,\" Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[7] Groq Inc., \"Groq API Documentation,\" 2024. [Online]. Available: https://console.groq.com/docs