This paper presents a structured human-AI collaborative workflow for developing verified scientific software from a published research paper. The large language model Claude (Anthropic) was guided by the author to translate, reimplement, and optimise a nonlinear finite element analysis (FEA) program originally published in 2015.
The original program implemented a Combined Corotational-Total Lagrangian (CR-TL) formulation for large-displacement 2D beam structures in Microsoft Excel VBA. Using a domain-specific skill document and iterative prompt engineering, Claude generated an equivalent Python implementation. It subsequently identified six performance bottlenecks. The optimised implementation replaces dense matrix operations with sparse assembly, eliminates the Python element loop through vectorised NumPy einsum operations, and applies boundary conditions via a large-penalty method. Verification against NAFEMS benchmark tests NLGB2 and NLGB4 shows results matching closed-form solutions to four decimal places, with no parameter tuning or calibration. A 154-fold reduction in computation time at 1000 elements enables models with up to 5000 elements on a standard laptop. Three categories of LLM error are identified and characterised. The work demonstrates that domain-expert-guided AI collaboration can produce verified, production-quality scientific software from published numerical formulations.
Introduction
The text describes a study on using a large language model (Anthropic’s Claude) to redevelop and optimise a nonlinear finite element analysis (FEA) program for large-displacement beam structures, originally implemented in Excel VBA, into a high-performance Python library.
The motivation is that nonlinear FEA for problems like offshore pipeline installation requires accurate geometrically nonlinear modelling, but building such software is complex and demands deep expertise in mechanics, numerical methods, and software engineering. While LLMs (and tools like GitHub Copilot) are known to speed up general coding, their reliability in physics-heavy scientific computing remains uncertain, especially for tasks requiring strict numerical correctness.
The study reimplements a 2015 CR-TL-based corotational beam FEA program, validates it against NAFEMS benchmark problems, and uses a structured “skill document” plus human–AI collaboration workflow to guide development. The resulting Python version achieves a 154× performance improvement and scales to about 5000 elements.
Key findings are that:
A domain-specific “skill document” is crucial for constraining LLM behaviour and encoding both equations and engineering judgement.
Human expertise remains essential for decisions the LLM cannot reliably make (e.g., solver choice, rotation handling, parameter tuning).
Mandatory benchmark verification is necessary because LLM-generated code can pass superficial tests while failing silently in critical cases.
LLMs can significantly accelerate scientific software development but are unreliable without strong expert oversight.
The methodology is structured as an iterative loop: specification via skill document, staged code generation, benchmark-based verification, error categorisation, optimisation guided by profiling, and continuous documentation updates.
Error analysis shows typical LLM failure modes in scientific computing, including missing physics-critical variables (e.g., moment terms), incorrect rotation handling, and environment-dependent syntax issues that may not appear in basic tests.
Conclusion
This paper has demonstrated a structured human-AI collaborative workflow for producing verified, optimised scientific software from a published numerical formulation. Principal conclusions are as follows.
1) The skill document approach was the most important single methodological element. Encoding domain knowledge, engineering judgement, and verification criteria in a reusable reference document enabled consistent, high-quality LLM output across multiple sessions.
2) The Python implementation correctly reproduced the CR-TL formulation without calibration, matching closed-form solutions to four decimal places on both NLGB2 and NLGB4.
3) Six performance bottlenecks were identified by expert-directed profiling and resolved by the LLM, achieving a 154-fold speedup at 1000 elements through sparse matrices, vectorised einsum assembly, and a penalty BC method.
4) Three LLM error categories were characterised — omission, syntax compatibility, and solver strategy. Mandatory benchmark verification is the essential safeguard against all three, and is irreplaceable by code review or unit testing alone.
5) The methodology is generalisable to other published numerical methods with clear mathematical specifications and available benchmarks. The results establish that domain-expert-guided AI collaboration is a viable and efficient path to verified, production-quality scientific software — not a shortcut that trades correctness for speed, but a structured workflow that achieves both.
References
[1] Bai Y and Bai Q 2014 Subsea Pipeline Design, Analysis and Installation (Amsterdam: Gulf Professional Publishing)
[2] Palmer A C and King R A 2008 Subsea Pipeline Engineering 2nd edn (Tulsa: PennWell)
[3] Bathe K J 1996 Finite Element Procedures (Englewood Cliffs: Prentice Hall)
[4] Belytschko T, Liu W K and Moran B 2000 Nonlinear Finite Elements for Continua and Structures (Chichester: Wiley)
[5] OpenAI 2023 GPT-4 Technical Report arXiv:2303.08774
[6] Anthropic 2024 Claude Model Card (San Francisco: Anthropic)
[7] Sivaraman S M 2015 A VBA based computer program for nonlinear FEA of large displacement 2D beam structures Int. J. Res. Appl. Sci. Eng. Technol. 3 1–21
[8] Crisfield M A 1991 Non-linear Finite Element Analysis of Solids and Structures vol 1 (Chichester: Wiley)
[9] Bathe K J and Bolourchi S 1979 Large displacement analysis of three-dimensional beam structures Int. J. Numer. Methods Eng. 14 961–986
[10] Holsgrove S C and Lyons L P R 1989 Benchmark Tests for Two-Dimensional Thin Beams and Axisymmetric Shells NAFEMS Report N4
[11] Lages E N et al 1999 Nonlinear FEA using an object-oriented philosophy Eng. Comput. 15 73–89
[12] McKenna F, Scott M H and Fenves G L 2010 Nonlinear FEA software using object composition J. Comput. Civ. Eng. 24 95–107
[13] Commend S and Zimmermann T 2001 Object-oriented nonlinear FEA: a primer Adv. Eng. Softw. 32 611–628
[14] De Souza R M 2000 Force-based finite element for large displacement inelastic analysis of frames PhD thesis University of California Berkeley
[15] Chen M et al 2021 Evaluating large language models trained on code arXiv:2107.03374
[16] Frieder S et al 2023 Mathematical capabilities of ChatGPT Adv. Neural Inf. Process. Syst. 36
[17] Nejjar I et al 2024 LLMs for engineering: code generation for FEA arXiv preprint
[18] Romero-Garcia E et al 2024 Towards AI-assisted CFD Comput. Fluids 280 106370
[19] White J et al 2023 A prompt pattern catalog for ChatGPT arXiv:2302.11382
[20] Wei J et al 2022 Chain-of-thought prompting Adv. Neural Inf. Process. Syst. 35
[21] Crisfield M A 1980 A fast incremental/iterative solution for snap-through Comput. Struct. 13 55–62
[22] Riks E 1979 An incremental approach to snapping and buckling Int. J. Solids Struct. 15 524–551
[23] Peng S, Kalliamvakou E, Cihon P and Demirer M 2023 The impact of AI on developer productivity: evidence from GitHub Copilot arXiv:2302.06590
[24] Ziegler A et al 2022 Productivity assessment of neural code completion Proc. 6th ACM SIGPLAN Int. Symp. Machine Programming pp 21–29
[25] Liao H et al 2024 SciCode: a research coding benchmark curated by scientists arXiv:2407.13168
[26] Lin Z, Cai Q, Shen L and Xiao M 2024 Enhancing automated paper reproduction via prompt-free collaborative agents arXiv:2512.02812
[27] Hou X et al 2024 Large language models for software engineering: a systematic literature review ACM Trans. Softw. Eng. Methodol. (in press) arXiv:2308.10620
[28] Lewis P et al 2020 Retrieval-augmented generation for knowledge-intensive NLP tasks Adv. Neural Inf. Process. Syst. 33 9459–9474