This paper introduces the Sandbox: Document Generating Engine, a novel Python and Streamlit-based web application designed to automate document processing and significantly streamline the conversion of raw data into polished, final reports (Achachlouei, A., Patil, M. A., Joshi, Q., Vair, T. & N. 2021). The system addresses the common challenge of manual, time-consuming, and error-prone data entry tasks.
The core innovation lies in the integration of advanced AI concepts, aligning with the objective of augmenting intelligent document processing (IDP) workflows with contemporary large language models (LLMs). By leveraging the semantic analysis capabilities typically associated with Natural Language Processing (NLP), the system moves beyond rigid, manual processes to achieve intelligent template mapping. This functionality allows the platform to analyse data uploaded in diverse formats—including .csv, .xlsx, and .txt files—to automatically identify critical information and accurately populate predefined document templates (Adhikari, P. R. 2018).
Designed for scalability and security, the application features a robust authentication system utilising bcrypt for password hashing and PostgreSQL for secure credential management. Our results demonstrate that this AI-enhanced approach yields substantial improvements in efficiency and productivity, establishing the system as a valuable scholarly assistance tool for researchers and organisations seeking to accelerate their document creation workflows (Bitzenbauer, P. 2023).
Introduction
The rapid rise of generative AI, particularly Large Language Models (LLMs), is reshaping various fields by automating tasks, improving efficiency, and enabling smarter data processing. Traditional document creation is often manual, error-prone, and time-consuming, highlighting the need for automated solutions.
The Sandbox: Document Generating Engine is a web application built with Python and Streamlit designed to address this problem. It automates document processing and data extraction from multiple file types (text, CSV, Excel), featuring Intelligent Document Processing (IDP) powered by AI for smart template mapping. This allows raw data to be semantically analyzed and automatically populated into predefined templates, improving workflow efficiency, especially in academic and professional settings.
The system emphasizes security with bcrypt-hashed passwords and a PostgreSQL database, and modularity, enabling easy integration of future AI models. Its main goals are to:
Provide a secure, user-friendly platform.
Automate data extraction and processing.
Transform raw data into polished documents.
Maintain flexibility for future AI integration.
A review of literature highlights:
Early NLP improved information extraction, summarization, and knowledge construction.
Traditional data-to-template mapping methods (placeholder substitution, programmatic mapping) are rigid and prone to errors.
LLMs offer semantic understanding, context-aware text generation, and cross-disciplinary knowledge integration, enabling more intelligent IDP workflows.
Current gaps include lack of validation, risk of over-reliance, and limited context-aware integration, which Sandbox addresses through modular, AI-enhanced design.
The methodology involves a modular architecture with interconnected modules:
User Authentication: Secure login with bcrypt and PostgreSQL.
Data Ingestion: Upload and process various file types.
NLP/LLM Layer: Semantic analysis of uploaded data.
Intelligent Mapping Engine: Uses AI to populate templates accurately.
Document Generation: Produces polished, professional reports.
Overall, the Sandbox engine represents an efficient, secure, and flexible solution for modern document automation, leveraging AI to reduce errors, save time, and enhance productivity.
Conclusion
This project successfully developed the \"Sandbox: Document Generating Engine\", a secure, AI-ready platform that dramatically streamlines the document creation process. By focusing on augmenting intelligent document processing (IDP) workflows with contemporary large language models (LLMs), we have created a powerful solution that tackles the inefficiency of manual data handling.
References
[1] Achachlouei, A., Patil, M. A., Joshi, Q., Vair, T. & N. (2021). Document Automation Architectures and Technologies: A Survey. arXiv. https://arxiv.org/abs/2109.02605
[2] Adhikari, P. R. (2018). Understanding of Plagiarism through Information Literacy: A Study among the Students of Higher Education of Nepal. Journal of Business and Social Sciences Research, 3(2), 165–181. https://doi.org/10.3126/jbssr.v3i2.28132
[3] AlAli, R., & Wardat, Y. (2024). Opportunities and Challenges of Integrating Generative Artificial Intelligence in Education. International Journal of Religion, 5(7), 784–793. https://doi.org/10.61707/8y29gv34
[4] Aldosari, S. A. M. (2020). The Future of Higher Education in the Light of Artificial Intelligence Transformations. International Journal of Higher Education, 9(3), 145. https://doi.org/10.5430/ijhe.v9n3p145
[5] Almahasees, Z., Khalil, M., & Am inzadeh, S. (2024). Students’ Perceptions of the Benefits and Challenges of Integrating ChatGPT in Higher Education. Pakistan Journal of Life and Social Sciences (PJLSS), 22(2), 3479–3494. https://doi.org/10.57239/PJLSS-2024-22.2.00256
[6] Archila, P. A., Ortiz, B. T., Truscott de Mejía, A.-M., & Molina, J. (2024). Thinking critically about scientific information generated by ChatGPT. Information and Learning Science. https://doi.org/10.1108/ILS-04-2024-0040
[7] Arora, S., Yang, S., Eyuboglu, B., Narayan, S., Hojel, A., Trummer, A., & E., I. R. (2023). Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proc. VLDB Endow., 17(2), 92–104. https://doi.org/10.14778/3620359.3620366
[8] Athaluri, A. S., Manthena, S. V., K., M. V. S. R., Kesapragada, V., Yarlagadda, T., Dave, & Dudumpudi, R. T. S. (2023). Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus, 15(12). https://doi.org/10.7759/cureus.49964
[9] Bakiri, H., Mbembati, H., & Tinabo, R. (2023). Artificial Intelligence Services at Academic Libraries in Tanzania: Awareness, Adoption and Prospects. University of Dar Es Salaam Library Journal, 18(2).https://doi.org/10.4314/udslj.v18i2.3
[10] Bearman, M., Tai, J., Dawson, P., Boud, D., & Ajjawi, R. (2024). Developing evaluative judgement for a time of generative artificial intelligence. Assessment & Evaluation in Higher Education, 49(6), 893–905. https://doi.org/10.1080/02602938.2024.2335321
[11] Biswas, S., Jain, S., Morariu, R., Gu, V. L., Mathur, J., Wigington, P., Sun, C., & Uehida, T. (2024). DocSynthV2: A Practical Autoregressive Modelling for Document Generation. arXiv. https://arxiv.org/abs/2406.02492.
[12] Bitzenbauer, P. (2023). ChatGPT in physics education: A pilot study on easy-to-implement activities. Contemporary Educational Technology, 15(3), ep430. https://doi.org/10.30935/cedtech/13176.
[13] Borkovska, I., Kolosova, H., Kozubska, I., & Antonenko, I. (2024). Integration of AI into the Distance Learning Environment: Enhancing Soft Skills. Arab World English Journal, 1(1), 56–72. https://doi.org/10.24093/awej/ChatGPT.3
[14] Bozkurt, A. (2024). Tell Me Your Prompts and I Will Make Them True: The Alchemy of Prompt Engineering and Generative AI. Open Praxis, 16(2), 111–118. https://doi.org/10.55982/openpraxis.16.2.661
[15] Bradley, C. (2013). Information Literacy Articles in Science Pedagogy Journals. Evidence Based Library and Information Practice, 8(4), 78–92. https://doi.org/10.18438/B8JG76
[16] Cain, W. (2024). Prompting Change: Exploring Prompt Engineering in Large Language Model AI and Its Potential to Transform Education. TechTrends, 68(1), 47–57. https://doi.org/10.1007/s11528-023-00896-0
[17] Carroll, A. J., & Borycz, J. (2024). Integrating large language models and generative artificial intelligence tools into information literacy instruction. The Journal of Academic Librarianship, 50(4), 102899.https://doi.org/10.1016/j.acalib.2024.102899
[18] ÇAYIR, A. (2023). A Literature Review on the Effect of Artificial Intelligence on Education. ?nsan ve Sosyal Bilimler Dergisi, 6(2), 276–288. https://doi.org/10.53048/johass.1375684
[19] Lin, C.-H., & Cheng, C. P. (2024). Legal Documents Drafting with Fine-Tuned Pre-trained Large Language Model. arXiv. https://arxiv.org/abs/2406.08860
[20] Mohammadi, B., et al. (2024). Creativity Has Left the Chat: The Price of Debiasing Language Models. arXiv. https://arxiv.org/abs/2403.04595
[21] Mridul, M. A., Sloyan, I., Gupta, A., & Seneviratne, O. (2025). AI4Contracts: LLM & RAG-Powered Encoding of Financial Derivative Contracts. arXiv. https://arxiv.org/abs/2506.09633
[22] Nigam, S. K., Patnaik, B. D., Thomas, A. V., Shallum, N., Ghosh, K., & Bhattacharya, A. (2025). Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhiDastavej. International Journal of Law, Technology, and Management. https://doi.org/10.48550/arXiv.2506.09540
[23] Zhao, H., & Li, D. (2024). A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval–Augmented Generation. arXiv. https://arxiv.org/abs/2403.18560
[24] Zhang, Q., Huang, B., Jiang, V., Wang, J., Jiang, Z., He, L., & Zhang, C. (2024). Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction. ResearchGate. https://arxiv.org/abs/2403.11186