Security Report Generation via LLM-RAG Assisted Directory Scanning: An Integrated Framework for Enhanced Software Documentation and Vulnerability Detection

Authors: Utkarsh Rajendra Pingale, Irfan Ajmer Pasha Shaikh

DOI Link: https://doi.org/10.22214/ijraset.2025.71261

Abstract

Thorough and recent documentation is essential for software maintenance, security audits, and knowledge transfer. Yet most software projects lack complete or recent documentation. Current tools mostly produce low-level code summaries without incorporating external knowledge, resulting in inefficienciesandenhancedsecurityrisk.Thispaperintroduces an LLM-RAG-augmented automated documentation system utilizing Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Information Retrieval (IR) methods.The system reads project directories, extracts metadata, and createspreliminarydocumentationusingLLMs.ARAGmodule enhances this documentation by pulling external information pertinenttothetask,likesecurityadvisoriesandbugreports, to provide a complete and actionable documentation framework. Evaluation is based on qualitative user studies and quantitative measurements. This work seeks to enhance documentation quality, increase software maintainability, and streamline security auditing through an AI-powered, explainable, and transparent report generation framework.

Introduction

A. Background of Study

Modern software development depends on thorough, up-to-date documentation for efficient processes, security, and knowledge transfer. However, many projects suffer from incomplete or outdated documentation, making it hard to understand code structure or address vulnerabilities. Traditional tools mainly summarize code but lack integration with external sources (e.g., bug trackers, security databases), leading to inefficiencies and increased security risks. Emerging solutions like LLM-RAG (Large Language Model – Retrieval-Augmented Generation) can automate documentation, pull real-time external data, and support better collaboration and security.

B. Problem Statement

Current documentation automation tools suffer from:

Lack of deep security insights, leading to missed risks.
No integration with external sources (e.g., CVE, CWE, GitHub), slowing response to threats.
Stale documentation, failing to keep up with evolving projects.
Knowledge silos that hinder collaboration and increase onboarding difficulty.

These gaps cause increased operational costs, security vulnerabilities, and slower development cycles.

C. Research Objectives

The research investigates how RAG techniques can:

Improve automated documentation by integrating domain-specific, real-time knowledge.
Enhance security reporting through external data (e.g., CVE, CWE).
Assist software maintenance and audits using structured, AI-generated reports.

Main goals:

Develop a LLM-RAG-based documentation system.
Ensure comprehensive, up-to-date documentation.
Evaluate documentation quality via user feedback and accuracy metrics.

D. Significance of Study

The system offers:

Improved security awareness, using external threat intelligence for faster mitigation.
Reduced manual workload, with dynamic updates that maintain accurate documentation without repetitive manual effort.

II. Literature Review

Existing research showcases the power of LLMs for code summarization and vulnerability detection but often lacks:

Integration with external data sources.
Dynamic updates.
Contextual security insights.

RAG emerges as a promising solution, combining retrieval and generation to provide real-time, context-rich outputs. However, many studies fail to address documentation-specific applications or real-time responsiveness.

Identified gaps:

Limited use of real-time external data.
Static documentation that becomes outdated.
Lack of actionable insights.
Absence of unified documentation frameworks.

Relevance of RAG:
RAG addresses these gaps by offering live updates, actionable insights, and integration with external knowledge bases, making documentation more useful and adaptive.

III. Research Methodology

A. Research Design

The study uses:

LLMs to generate initial documentation from code and metadata.
RAG to enhance with external insights.
Hybrid evaluation (qualitative + quantitative) to assess output quality.

B. Data Collection

Primary data: Source code and metadata from various open-source projects.
Secondary data: External sources like CVE, CWE, GitHub, Stack Overflow, and security advisories.

C. System Architecture

The system includes:

Directory Scanner: Extracts and analyzes project structure.
LLM-Based Summarizer: Generates draft documentation with basic security insights.
RAG-Powered Retrieval: Fetches real-time external data for contextual relevance.
Report Synthesis Engine: Compiles final documentation in multiple formats (Markdown, HTML, PDF).

D. Data Analysis

Qualitative: User studies and expert reviews.
Quantitative: BLEU/ROUGE (text quality), Flesch score (readability), precision/recall/F1 (vulnerability detection).

E. Workflow

Steps:

User initiates analysis with a project directory.
Directory Scanner collects project data.
LLM generates preliminary docs.
RAG fetches external knowledge.
Enhanced content is synthesized and formatted.
Users provide feedback for iterative improvements.
Final output includes embedded references and actionable insights.

IV. Expected Outcomes and Impact

Security-Enriched Documentation: Real-time integration with vulnerability databases ensures up-to-date, actionable insights.
Automation and Time Savings: Reduces manual effort in maintaining accurate documentation.
Better Vulnerability Awareness: Developers receive contextual remediation steps, improving response times.
Explainability and Trust: Traceable sources (e.g., CVE links, GitHub issues) build transparency.
Dynamic Updates: The system automatically adapts to code and external changes.
Enhanced Audits and Compliance: Easier for regulated industries to maintain audit-ready documents.
Improved Collaboration: Unified documents bridge silos and promote knowledge sharing.
Scalability Across Domains: The modular design supports other sectors (e.g., construction, legal, healthcare).

V. Timeline

A Gantt chart (not shown here) outlines the phased development and evaluation schedule.

Conclusion

The research proposes a LLM-RAG-powered documentation framework that combines internal project data with external security and development knowledge to generate dynamic, context-rich, and actionable software documentation. It addresses major gaps in existing tools by improving security insights, reducing manual effort, and fostering collaboration, with promising cross-domain applicability.

References

[1] N. Lykousas, V. Argyropoulos, and F. Casino, “The potential of llm-generated reports in devsecops,” arXiv.org, vol. abs/2410.01899, Oct.2024. [Online]. Available: https://export.arxiv.org/pdf/2410.01899v1.pdf [2] M.L.Bernardi,M.Cimitile,andR.Pecori,“Automaticjobsafetyreportgenerationusingrag-basedllms,”vol.abs/1605.02592,p.1–8,Jun.2024. [3] X.Du,G.Zheng,K.J.Wang,J.Feng,W.Deng,M.Liu,X.Peng, T. Ma, and Y. Lou, “Vul-rag: Enhancing llm-based vulnerabilitydetection via knowledge-level rag,” Jun. 2024. [Online]. Available:https://arxiv.org/pdf/2406.11147 [4] K.Mohammed,“Llm-drivenautomationinvulnerabilitymanagement,” Openaccessresearchjournalofscienceandtechnology,Sep.2024. [5] Z.Li,S.Dutta,andM.Naik,“Llm-assistedstaticanalysisfordetecting security vulnerabilities,” May 2024. [Online]. Available: https://arxiv.org/pdf/2405.17238 [6] M.KeltekandZ.Li,“Lsast–enhancingcybersecuritythroughllm-supportedstaticapplicationsecuritytesting,”Sep.2024.[Online]. Available:https://export.arxiv.org/pdf/2409.15735v2.pdf [7] R.Gupta,G.Pandey,andS.K.Pal,“Automatinggovernmentreportgen-eration: A generative ai approach for efficient data extraction, analysis,and visualization,” Sep. 2024. [8] K.E.Hill,“Systemsandmethodsforsoft-ware scanning tool,” Mar. 2016. [Online]. Available: https://patents.google.com/patent/US20160274903 [9] J.Chen,H.Xiang,L.X.Li,Y.Zhang,B.Ding,andQ.Li,“Utilizingpreciseandcompletecodecontexttoguidellminautomatic false positive mitigation,” Nov. 2024. [Online]. Available:http://arxiv.org/pdf/2411.03079 [10] W. Dai, Q. Ouyang, X. Zeng, C. Zhao, L. Zhu, and Y. Chen, “Methodof automatically generating report,” Sep. 2018. [11] J.L.TurnerandR.E.Turner,“Methodforprovidingcustomizedandautomatedsecurityassistance,adocumentmarkingregime,andcentraltrackingandcontrolforsensitiveorclassifieddocuments in electronic format,” Sep. 2006. [Online]. Available: https://patents.google.com/patent/US7958147B1/en [12] S.Pranathi,T.Akshita,M.Vaishnavi,M.Ramachandra,and D. Sundaragiri, “Transforming raw data into polished reports: An llm-powered solution for customizing template-based pdfs,” InternationalJournalForMultidisciplinaryResearch,May2024.[Online].Available: https://www.ijfmr.com/papers/2024/3/18590.pdf

Copyright

Copyright © 2025 Utkarsh Rajendra Pingale, Irfan Ajmer Pasha Shaikh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET71261

Publish Date : 2025-05-19

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here