Retrieval-Augmented Generation (RAG) has emerged as an effective framework for enhancing factual accuracy in Large Language Models (LLMs) by grounding generated responses in retrieved document context. This paper presents the design, implementation, and evaluation of a complete RAG pipeline for Document Question Answering (DocQA) using FAISS-based semantic retrieval and the Llama3 model running locally through Ollama. The system processes PDF and text documents, constructs a vector index, retrieves top-k relevant chunks using embeddings, and generates grounded answers via LangChain’s RetrievalQA chain. A benchmark consisting of ten document-derived questions was used to evaluate performance. Token-level F1 score, exact-match accuracy, and hallucination rate were computed to quantify system reliability. Experimental results show an exact-match accuracy of 30%, a hallucination rate of 20%, and F1 scores ranging from 0.13 to 1.0. The study highlights strengths in retrieval consistency and identifies challenges in generation alignment, providing an empirical baseline for future improvements in RAG-based document reasoning.
Introduction
Large Language Models (LLMs) such as Llama3, GPT-4, and Mistral excel at natural language tasks but are prone to hallucinations, generating plausible yet factually incorrect outputs. This limitation is critical in enterprise and knowledge-intensive applications where accuracy is essential. Retrieval-Augmented Generation (RAG) addresses this by combining an LLM with a retrieval mechanism that fetches relevant information from external documents, grounding responses in verifiable content and reducing hallucinations.
The study implements a local RAG pipeline using:
FAISS for high-performance vector similarity search,
LangChain for orchestration, and
Llama3 (via Ollama) for grounded text generation.
Documents (PDF/text) are preprocessed, split into chunks, embedded, and indexed in FAISS. At query time, the top-k relevant chunks are retrieved and provided to the LLM for answer generation.
Evaluation
Dataset: 10 document-specific questions with known ground-truth answers.
Metrics: Token-level F1 score, exact-match accuracy, and hallucination rate.
Results:
Exact matches: 3/10 (30% accuracy)
Hallucinations: 2/10 (20% rate)
Token F1 scores ranged from 0.13 to 1.0
Analysis: Retrieval successfully provided relevant context for all questions. Most non-exact matches were due to paraphrasing, while hallucinations occurred when the model added unsupported elaboration.
Conclusion
This paper presented a complete RAG-based Document Question Answering system using FAISS, LangChain, and Llama3. A benchmark evaluation demonstrated moderate exact-match accuracy and low but non-zero hallucination rates. The study highlights both the strengths and limitations of retrieval-grounded local LLMs. Future work may explore improved reranking techniques, hybrid retrieval architectures, hallucination-aware generation, and larger evaluation benchmarks.
References
REFERENCES
[1] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” 2020. Available: https://huggingface.co/papers/2005.11401
[2] M. Cheng, Y. Luo, J. Ouyang, Q. Liu, H. Liu, L. Li, S. Yu, B. Zhang, J. Cao, J. Ma, and D. Wang, “A Survey on Knowledge-Oriented Retrieval-Augmented Generation,” arXiv:2503.10677, 2025. Available: https://arxiv.org/abs/2503.10677
[3] S. Gupta, R. Ranjan, and S. N. Singh, “A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions,” arXiv:2410.12837, 2024. Available: https://arxiv.org/abs/2410.12837
[4] F. Ye, S. Li, Y. Zhang, and L. Chen, “R²AG: Incorporating Retrieval Information into Retrieval-Augmented Generation,” arXiv:2406.13249, 2024. Available: https://arxiv.org/abs/2406.13249
[5] Y. Xiong, Y. Cui, S. Wu, H. Wu, C. Chen, Y. Yuan, L. Huang, X. Liu, T.-W. Kuo, N. Guan, and C. J. Xue, “Retrieval-Augmented Generation for Natural Language Processing: A Survey,” 2024. Available: https://haolun-wu.github.io/assets/pdf/p_arxiv_RAG_Survey/paper.pdf
[6] A. Kumar, J. Wang, K. Zhang, and Y. Feng, “SimRAG: Self-Improving Retrieval-Augmented Generation,” Proc. NAACL, 2025. Available:
https://aclanthology.org/2025.naacl-long.575.pdf
[7] NVIDIA, “What Is Retrieval-Augmented Generation (RAG)?,” 2023. Available: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
[8] MarkTechPost, “Building a Retrieval-Augmented Generation (RAG) System with FAISS and Open-Source LLMs,” 2025. Available
https://www.marktechpost.com/2025/03/18/building-a-retrieval-augmented-generation-rag-system-with-faiss-and-open-source-llms/
[9] Papers with Code, “Retrieval-Augmented Generation (RAG) Method Overview,” 2024. Available:
https://paperswithcode.com/method/rag