The deployment of large language models (LLMs) in production systems has exposed a fundamental gap between the rapid advancement of model capabilities and the maturity of the engineering practices used to govern them. Unlike conventional software, LLMs are probabilistic, non-deterministic, and increasingly embedded in safety-critical, multi-objective, and multilingual environments where standard quality assurance techniques are insufficient.
This paper surveys the key engineering challenges in production LLM deployment—covering evaluation methodology, lifecycle discipline, agent-level robustness, and linguistic coverage—and proposes a unified four-stage pipeline that integrates acceptance test-driven development, Pareto-based multi-objective evaluation, and rigorous language-specific benchmarking. We examine how these approaches address complementary failure modes: misalignment between model behaviour and business requirements, collapse of multi-dimensional performance to inadequate scalar metrics, and silent degradation in non-English deployment settings.
The proposed pipeline draws on engineering principles that are well-established in software development but largely absent from current LLM practice, and we demonstrate how their adoption yields more comprehensive safety coverage, more transparent trade-off analysis, and more reliable multilingual quality assurance than benchmark-centric approaches alone.
Introduction
Large language models (LLMs) are increasingly deployed in high-stakes, real-world applications, but conventional software testing and static benchmarks (like GLUE or BIG-Bench) are insufficient for evaluating probabilistic, generative models. The paper highlights three complementary frameworks addressing this gap: ATDLLMD integrates acceptance test-driven development into the LLM lifecycle for business-aligned, continuous evaluation; MO-PTSP provides a multi-objective agent benchmarking environment that captures realistic trade-offs; and CalamanCy demonstrates rigorous multilingual evaluation with carefully annotated datasets.
Together, these frameworks share key engineering principles: specification before evaluation, multi-dimensional performance metrics, and iterative, continuous refinement. They can be unified into a four-stage deployment pipeline: (1) specification of acceptance tests and annotation guidelines, (2) development and iterative red-green-refactor cycles, (3) multi-objective agent evaluation with Pareto analysis, and (4) continuous post-deployment monitoring. This approach ensures LLM deployments are trustworthy, aligned with stakeholder objectives, robust under conflicting requirements, and linguistically inclusive.
Practical implications include improved business-aligned outcomes, principled multi-objective optimization, and detection of language-specific degradation. Limitations involve resource demands for stakeholder engagement, domain-specific calibration, and availability of native-language annotators. Future directions suggest automation of test generation, stochastic environment modeling, and scaling multilingual evaluation to broader NLP tasks.
Conclusion
This paper has examined three complementary contributions to the challenge of trustworthy LLM deployment. ATDLLMD [5] provides a lifecycle methodology that embeds business-aligned acceptance criteria into LLM development through the CPMAI framework and an iterative feedback loop. MO-PTSP [11] supplies a rigorous, physics-based benchmarking environment in which agents must navigate genuine objective conflicts, enabling Pareto-frontier analysis of deployment strategies. CalamanCy [16] demonstrates the annotation discipline and disaggregated reporting required to extend evaluation rigour to linguistically diverse deployment settings.
These frameworks share three foundational engineering principles—specification before evaluation, multi-dimensional performance characterisation, and continuous iterative refinement—that together define a more mature practice of LLM engineering than current benchmark-centric approaches afford. The four-stage unified pipeline synthesised from these contributions offers a practical roadmap for teams seeking to move from capability benchmarking to requirements-driven, Pareto-aware, and multilingual LLM deployment. As LLMs are deployed in increasingly high-stakes and diverse operational settings, such a roadmap represents an engineering necessity rather than a best-effort aspiration.
References
[1] P. Lewis et al., \"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,\" in Proc. NeurIPS, 2020.
[2] B. Srivastava et al., \"Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models,\" arXiv:2206.04615, 2022.
[3] P. Liang et al., \"HELM: Holistic Evaluation of Language Models,\" in Proc. NeurIPS Datasets and Benchmarks Track, 2022.
[4] L. Zheng et al., \"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,\" in Proc. NeurIPS, 2023.
[5] V. R. Parupally, \"ATDLLMD: A Test-Driven Framework for Safe, Reliable, and Business-Centric LLM Development,\" IET Conf. Proc., vol. 2025, no. 43, pp. 612–618, 2025, doi: 10.1049/icp.2025.4778.
[6] Y. Bai et al., \"Constitutional AI: Harmlessness from AI Feedback,\" arXiv:2212.08073, 2022.
[7] P. F. Christiano et al., \"Deep Reinforcement Learning from Human Preferences,\" in Proc. NeurIPS, 2017.
[8] L. Gao et al., \"A Framework for Few-Shot Language Model Evaluation,\" Zenodo, 2021, doi: 10.5281/zenodo.5371628.
[9] K. Beck, Test-Driven Development: By Example. Addison-Wesley, 2003.
[10] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, \"Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,\" in Proc. ACL, 2020.
[11] V. R. Parupally, \"A Multi-Objective Game Environment for Evaluating AI Agents,\" in Proc. 2nd Global AI Summit – Int. Conf. on Artificial Intelligence and Emerging Technology (AI Summit), Noida, India, 2025, pp. 692–697, doi: 10.1109/AISummit66170.2025.11410745.
[12] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, \"A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,\" IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, 2002.
[13] Q. Zhang and H. Li, \"MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition,\" IEEE Trans. Evol. Comput., vol. 11, no. 6, pp. 712–731, 2007.
[14] T. Brys, A. Harutyunyan, P. Vrancx, A. Nowé, and M. Taylor, \"Multi-Objectivization of Reinforcement Learning Problems by Reward Shaping,\" in Proc. IJCNN, 2014.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,\" in Proc. NAACL, 2019.
[16] V. Parupally, \"CalamanCy: A Tagalog Natural Language Processing Toolkit,\" in Proc. IEEE Int. Conf. on Industrial Technology & Computer Engineering (ICITCE), Penang, Malaysia, 2025, pp. 45–51, doi: 10.1109/ICITCE65255.2025.11210765.
[17] A. Conneau et al., \"Unsupervised Cross-lingual Representation Learning at Scale,\" in Proc. ACL, 2020.
[18] Y. Li et al., \"AlpacaEval: An Automatic Evaluator of Instruction-Following Models,\" GitHub, 2023.