Optimizing Cross-Lingual Information Retrieval in Campus Administration using Retrieval-Augmented Generation and Semantic Chunking

Authors: Harsh Singh, Garvaansh Gupta

DOI Link: https://doi.org/10.22214/ijraset.2026.76860

Abstract

Digital transformation in Indian higher-education institutions is constrained not by the absence of information, but by the difficulty of accessing it across linguistic and structural boundaries. Administrative data such as admission rules, fee structures, examination schedules, and scholarship policies are published primarily in English and distributed across heterogeneous document formats, while students interact using Hindi, regional languages, and mixed Romanized scripts such as Hinglish. This paper presents an optimized Retrieval-Augmented Generation (RAG) architecture designed as a campus-scale natural language information system rather than a simple chatbot. The proposed framework integrates multilingual semantic embeddings, vector-based document retrieval, conversational state management, and grounded response generation into a unified, auditable architecture. A hybrid two-tier backend separates high-frequency user interaction from computationally intensive retrieval and inference, enabling scalable deployment across multiple institutions. Experimental evaluation demonstrates that the architectural design achieves high retrieval accuracy and low latency while preserving factual reliability, making it suitable for real-world administrative decision support in multilingual academic environments.

Introduction

The paper addresses the linguistic and administrative challenges faced by Indian universities, where official information is published in English while students commonly ask questions in Hindi, regional languages, or mixed-script forms like Hinglish. This mismatch makes critical academic information functionally inaccessible, leading to overcrowded administrative offices, delays, and reliance on informal—and often inaccurate—peer networks. Existing digital portals and chatbots are inadequate due to rigid keyword search, lack of multilingual understanding, poor conversational context handling, and the risk of unverified or hallucinated responses.

To solve this, the paper presents CampusMitra, a Retrieval-Augmented Generation (RAG)–based campus information system designed as a structured knowledge infrastructure rather than a simple chatbot. The system introduces script-agnostic intent mapping using multilingual embeddings, conversational context persistence for multi-turn queries, a confidence-aware decision gate to prevent incorrect answers, and auditable, source-attributed responses for high-stakes administrative data.

The proposed architecture uses a hybrid two-tier backend (Node.js for interaction and FastAPI for inference), ChromaDB for metadata-isolated vector storage, and LangChain for controlled retrieval and response synthesis. A recursive semantic chunking strategy ensures accurate retrieval from dense administrative documents, while RAG is preferred over fine-tuning to maintain real-time accuracy as rules and notices change frequently.

Experimental evaluation across English, Hindi, and Hinglish queries achieved high retrieval precision (Top-1: ~92%, Top-3: ~96%) with acceptable latency (~1.4 seconds). The confidence gate successfully prevented hallucinations, and projected results indicate a potential 70% reduction in routine administrative workload. While challenges remain in handling code-switching ambiguity and complex tabular data, the study demonstrates that a grounded, multilingual RAG architecture can significantly improve access, accuracy, and efficiency in campus administrative information systems.

Conclusion

This research validates that an optimized, local-first RAG stack is superior to general-purpose LLMs for campus administration. By prioritizing intent-based mapping, conversational memory, and source-attributed grounding, the system bridges the linguistic divide and reduces the operational load on administrative staff. Future work will explore the deployment of quantized local models for 100% offline institutional operability and expanded omnichannel support for regional messaging platforms.

References

[1] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS, 2020. [2] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP, 2019. [3] X. Wang et al., “Cross-Lingual Sentence Embeddings for Low-Resource Languages,” ACL, 2021. [4] LangChain Documentation, “Conversational Memory and Retrieval Chains,” 2024. [5] ChromaDB Documentation, “Persistent and Metadata-Aware Vector Storage,” 2024. [6] J. Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL, 2019. [7] T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” ICLR, 2013. [8] A. Radford et al., “Improving Language Understanding by Generative Pre-Training,” OpenAI, 2018. [9] K. Lee et al., “Latent Retrieval for Weakly Supervised Open Domain Question Answering,” ACL, 2019. [10] S. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” EMNLP, 2020. [11] Hugging Face, “Stella-v5 Multilingual Embedding Model,” 2024. [12] M. Lewis et al., “BART: Denoising Sequence-to-Sequence Pre-training [13] for Natural Language Generation,” ACL, 2020.

Copyright

Copyright © 2026 Harsh Singh, Garvaansh Gupta. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET76860

Publish Date : 2026-01-07

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here