AutoTestGen is a privacy-preserving, multi-language framework that automates unit-test synthesis and iterative repair for Java, Python, and JavaScript projects using local large language models (LLMs). The system addresses three practical limitations of current test generators: single-ecosystem focus, brittle outputs that require significant developer edits, and reliance on remote cloud inference. AutoTestGen combines a language-agnostic orchestration core with per-language adapters (JUnit/pytest/Jest), a deterministic prompt/sanitization layer, and a compile–run–repair feedback loop that analyzes runtime and compile diagnostics to refine test generation automatically. The pipeline performs static code inspection to build intent descriptors, issues framework-aware prompts to an on-premise LLaMA family model, sanitizes and normalizes generated code (imports, signatures, module layout), and repeatedly regenerates until a configurable success criteria is met or retries are exhausted. We evaluate AutoTestGen on representative Java, Python, and JavaScript modules and report improvements in first-pass compilation validity and final pass rates, along with measurable coverage uplift. Results show that iterative repair increases compilation success by ?9 percentage points and yields high-quality, assertion-rich tests requiring minimal manual edits. The design emphasizes reproducibility, CI friendliness, and privacy — making AutoTestGen suitable for enterprise and research contexts where code confidentiality is important.
Introduction
Software testing is essential for ensuring reliability, maintainability, and correctness, with unit testing being a critical step. Manual unit-test creation is labor-intensive, leading to the adoption of automated tools like EvoSuite, Randoop, Pynguin, and Diffblue Cover. While effective, these tools struggle with semantic accuracy and often require manual intervention. Recent advances with large language models (LLMs) enable semantic-aware test generation but typically rely on cloud-based infrastructure, raising privacy and cost concerns.
AutoTestGen addresses these limitations by providing a local, privacy-preserving, LLaMA-based framework for multi-language automated unit test generation. It uses a compile–run–repair feedback loop to iteratively refine tests until all validation checks succeed, improving compilation validity, test reliability, and assertion quality. The system extracts code metadata, builds structured test intents, generates framework-specific prompts, synthesizes tests via a local LLM, sanitizes code, executes tests, and repairs errors automatically.
Evaluation results show AutoTestGen outperforms traditional and basic LLM-based methods, achieving 95% compilation validity, 91% pass rate, coverage uplift of 9–11% across Java, Python, and JavaScript, and a 60% reduction in manual edits. Its local, deterministic operation ensures reproducibility and suitability for CI/CD pipelines while preserving data privacy.
Conclusion
This paper presented AutoTestGen, a novel LLaMA-based automated test generation framework that supports multiple programming languages while preserving data privacy through local inference. By combining a compile–run–repair feedback loop with language-specific adapters, AutoTestGen ensures high compilation validity, improved coverage, and reproducible results without cloud dependency. Experimental results show significant improvements in both compilation and pass rates compared with conventional and LLM-based baselines. The modular design and deterministic output make AutoTestGen suitable for enterprise environments, CI/CD integration, and academic research requiring reproducible results.
References
[1] G. Fraser and A. Arcuri, “EvoSuite: Automatic Test Suite Generation for Object-Oriented Software,” in Proc. ESEC/FSE, 2011. evosuite.org
[2] G. Fraser and A. Arcuri, “Whole Test Suite Generation,” IEEE Trans. Softw. Eng., vol. 39, no. 2, pp. 276–291, 2013. evosuite.org
[3] J. M. Rojas, G. Fraser, and A. Arcuri, “A Detailed Investigation of the Effectiveness of Whole Test Suite Generation,” Empirical Softw. Eng., 2016. White Rose Research Online
[4] G. Fraser and A. Arcuri, “Achieving Scalable Mutation-based Generation of Whole Test Suites,” Empirical Softw. Eng., 2014. evosuite.org
[5] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed Random Test Generation,” in Proc. ICSE, 2007. Homes at UW
[6] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Randoop: Feedback-Directed Random Testing for Java,” (tech. report/extended paper), 2007. MIT CSAIL+1
[7] S. Lukasczyk and G. Fraser, “Pynguin: Automated Unit Test Generation for Python,” arXiv:2202.05218, 2022. arXiv
[8] M. Harman et al., “Deploying Search Based Software Engineering with Sapienz at Facebook,” (case/deployment paper), 2018. UCL Discovery
[9] Facebook Engineering, “Sapienz: Intelligent Automated Software Testing at Scale,” engineering blog, May 2018. Engineering at Meta
[10] Diffblue Ltd., “Diffblue Cover—AI for Java Unit Test Generation,” product documentation and site, 2025. Diffblue+1
[11] Ollama, “API Reference—POST /api/generate (streaming can be disabled via \\\"stream\\\": false),” 2025. Ollama Docs
[12] Z. Li et al., “An Empirical Study of Unit Test Generation with Large Language Models,” arXiv:2406.18181, 2024. arXiv
[13] Y. Liu et al., “Test Intention Guided LLM-Based Unit Test Generation (IntUT),” in Proc. ICSE, 2025. ACM Digital Library
[14] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS, 2020. NeurIPS Proceedings+1
[15] Z. Zhang, H. Li, Y. Wang, and Z. Jin, “LLM-based Unit Test Generation via Property Retrieval,” arXiv:2410.13542, 2024. arXiv
[16] Additional EvoSuite studies and environment notes (selection for background reading): “On the Effectiveness of Whole Test Suite Generation,” SSBSE, 2014; “Automated Unit Test Generation for Classes with Environment Dependencies,” ASE, 2014. evosuite.org+1
[17] Supplementary Sapienz sources (industry context): Resource management and large-scale testing at Facebook (engineering notes), 2017–2018. Engineering at Meta+1