Automated code review is fundamental to software quality assurance, yet existing approaches either rely on deterministic static analysis tools that lack pedagogical depth or on large language models (LLMs) that risk generating unverified and hallucinated feedback. To address this limitation, this paper introduces an agentic AI architecture for Python code review that decomposes the review process into specialized, collaborative agents with distinct reasoning roles. The proposed framework employs a modular multi-agent pipeline consisting of: (i) a static analysis agent responsible for deterministic detection of syntactic errors, code smells, and security vulnerabilities; (ii) an interpretation agent that transforms validated diagnostic outputs into structured, beginner-oriented explanations using a grounded large language model; (iii) a scoring agent that quantitatively evaluates code quality and assigns skill levels; and (iv) a learning agent that generates adaptive practice recommendations based on identified knowledge gaps. By enforcing role separation and constraining generative reasoning to verified analytical outputs, the architecture reduces hallucination risk while preserving the expressiveness and instructional richness of LLM-based feedback. Unlike monolithic LLM-driven systems, the proposed agentic design ensures modularity, interpretability, reproducibility, and extensibility. Experimental evaluation on curated Python programming tasks demonstrates that the multi-agent collaboration model improves feedback reliability, conceptual clarity, and review efficiency compared to standalone static analysis tools and single-agent LLM baselines. The results highlight the potential of agent-oriented AI architectures for intelligent tutoring systems and next-generation AI-assisted software engineering workflows.
Introduction
The text proposes an agentic AI-based system for automated code review that improves on traditional tools and standalone Large Language Models (LLMs) by combining reliability, interpretability, and educational support.
Traditional code reviews are accurate but slow and inconsistent, while static analysis tools are fast and deterministic but lack meaningful explanations. LLM-based code assistants provide helpful natural-language feedback but can be unreliable due to hallucinations and inconsistent outputs.
To address this, the paper introduces a multi-agent architecture where the code review process is split into specialized roles:
A static analysis agent that detects issues deterministically
An interpretation agent (LLM-based) that explains verified issues in simple language
A scoring agent that evaluates code quality and skill level
A learning agent that provides personalized improvement guidance
The key idea is to ground LLM outputs in verified static analysis results, reducing hallucinations while keeping explanations useful and human-friendly. This makes the system more trustworthy, reproducible, and suitable for both professional and educational use.
The literature review highlights the evolution from manual review → static analysis tools → LLM-based assistants → emerging agentic AI systems, and identifies a gap: existing solutions rarely combine deterministic correctness with pedagogical explanation and adaptive learning.
Conclusion
This paper presented PyReview, an agentic multi-agent framework for automated Python code review that integrates deterministic static analysis with grounded Large Language Model (LLM)-based interpretation, quantitative scoring, and adaptive learning guidance. By decomposing the review process into specialized agents, the proposed architecture balances symbolic reliability with generative expressiveness, reducing hallucination risk while enhancing interpretability.
Experimental results demonstrate consistent diagnostic detection, grounded and beginner-friendly explanations, reproducible scoring, and personalized improvement recommendations. Compared to standalone static analysis tools and monolithic LLM-based systems, the agentic framework improves trustworthiness, traceability, and pedagogical value.
Although limitations remain in static rule coverage and generative uncertainty, the proposed architecture provides a scalable and reliable foundation for AI-assisted code review and intelligent tutoring systems. Future work will focus on expanding semantic analysis capabilities, large-scale empirical validation, and adaptive learning optimization to further enhance system robustness and educational impact.
References
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017.
[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, Cambridge, MA, USA, 2016.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
[7] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[8] A. Svyatkovskiy, Y. Zhao, S. Fu, and N. Sundaresan, “Intellicode compose: Code generation using transformer,” in Proc. EMNLP, pp. 6717–6727, 2020.
[9] Z. Codabux, Z. Sultana, and M.-R. Chowdhury, “A catalog of metrics at the source code level for vulnerability prediction: A systematic mapping study,” Journal of Software: Evolution and Process, 2023.
[10] R. Sapkota, et al., “AI Agents vs. Agentic AI: A Conceptual Taxonomy and Application Mapping,” arXiv preprint arXiv:2505.10468, 2025.
[11] M. Abou Ali, et al., “Agentic AI: A comprehensive survey of architectures and multi-agent orchestration,” Applied Intelligence, 2025.
[12] C. Masters, A. Vellanki, J. Shangguan, et al., “Orchestrating Human-AI Teams: The Manager Agent as a Unifying Research Challenge,” arXiv preprint arXiv:2510.02557, 2025.