An Overview on the Performance of Reasoning Agents in Large Language Models

Authors: Nathan Man, Harish Senthilkumar, Ryan Li, Samarth Prajapati, Pranav Shankar, Ethan Shen, Tanusha Tamijet

DOI Link: https://doi.org/10.22214/ijraset.2025.72225

Abstract

The recent rise of Large Language Models (LLMs), which are able to generate human-like text, has put a large amount of attention onto AI and its potential uses. However, most LLMs are limited to a one-dimensional/leftto-right method of decision- making that can impede their performance in tasks that require accurate foresight and reference to previous decisions to execute. We hypothesize that various types of LLM reasoning agents have different strengths and weaknesses that allow for applications for different strategic use cases. In our research, we hope to determine the specific use cases and strengths of various reasoning agents, which will allow for the creation of LLMs tailored towards certain tasks with the use of such agents. With the help of reasoning agents, such as symbolic, arithmetic, and chain-of-thought reasoning, LLMs adopt a greater understanding of the context given to them and use a multi-step approach to adequately solve problems. Existing challenges in evaluating reasoning agents within LLMs include issues such as dataset biases and the potential brittleness of the model. These challenges, combined with the ethical concerns surrounding the reasoning agents such as their susceptibility to amplifying biases within a response, offer a rich research area. Using a quantitative analysis of several reasoning agents within a controlled environment, we apply diverse multi-modal and iterative reasoning techniques. Through this analysis, we explore the strengths and weaknesses of these reasoning techniques, resulting in a better understanding of the reasoning capabilities to be applied to real- world scenarios and products.

Introduction

Reasoning agents enhance Large Language Models (LLMs) by improving their problem-solving abilities through structured approaches. Notable agents include:

Chain-of-Thought (CoT) prompting, which breaks reasoning into intermediate steps, boosting performance on arithmetic, commonsense, and symbolic tasks.
Zero-shot reasoning, which uses simple cues like “Let’s think step by step” to encourage structured thinking without extra training.
Tree-of-Thought (ToT) prompting, which explores multiple reasoning paths to handle complex problems but at higher computational cost.

This study tested various reasoning agents—Zero-shot CoT, Automatic Chain-of-Thought (Auto-CoT), Self-Consistency, and Retrieval-Augmented Generation (RAG)—on four models (Llama 2, Llama 3, ChatGPT 3.5, and Gemini 1.0 Pro) across datasets including GSM8K (math problems) and CSQA (commonsense questions).

Key findings:

CoT significantly improved GPT-3.5 accuracy on GSM8K from 16.8% to 75.2%.
Self-Consistency generally provided the highest accuracy gains across models and datasets.
Auto-CoT improved performance by clustering questions and generating tailored reasoning examples.
Some reasoning agents (like RAG) decreased accuracy on certain models and datasets.
Performance varied widely depending on the model and reasoning agent, highlighting the importance of matching agents to task types.

Overall, reasoning agents substantially boost LLM effectiveness on complex reasoning tasks, with Self-Consistency and CoT showing the most promise for enhancing accuracy and reliability.

Conclusion

Across all tested models - Llama-2, Llama-3, Gemini 1.0, Gemini 1.5, and Llama 3 8B - structured reasoning methods AutoCoT and Self-Consistency consistently improved accuracy over the base models. Self-Consistency improved the performance of Llama-2, Llama-3, Gemini 1.0, and Llama 3 8B for the GSM8K dataset and improved the performance of Llama-2, Gemini 1.0, and Llama-3. With the significant increase in accuracy across both the datasets, Self-Consistency demonstrates the ability for structured mathematical reasoning as well as common sense based, open-ended reasoning, suggesting strong reasoning capabilities and adaptability across diverse problem types. Autocot performed quite well on the GSM8K dataset as Llama-2 saw a sizable improvement in accuracy while Gemini 1.0 and Llama 3.1 8b saw a massive boost in accuracy. Only in Llama 3 did autocot see a dip in performance, thus indicating that across most LLMs, autocot improves the model’s structured mathematical reasoning. On the CSQA dataset, autocot improves the accuracy of Llama3 slightly, improves the accuracy of Llama-2 by a drastic margin, and sees a sizable dip in Gemini 1.0 which indicates that in general, autocot boosts the ability of an LLM for open ended reasoning. Autocot’s propensity for improving the mathematical and open-ended reasoning makes it reliable for diverse problem types on most LLMs. On the other hand, 0-shot cot saw an inconsistent performance on the GSM8K dataset depending on the LLM.With a sizable decrease in accuracy in Llama-2, negligible decrease in accuracy in Llama 3, and massive increases in accuracy in Gemini 1.0 and Llama 3.1 8b, 0-shot cot shows inconsistencies in its ability to apply structured, step-by-step mathematical reasoning, potentially due to limitations in problem decomposition or numerical manipulation. Conversely, with the CSQA dataset, 0-shot cot performed much better seeing a very negligible dip in performance with Gemini 1.0 and a sizable increase in accuracy with Llama-2, Llama-3, Llama 3.1 8b. This exhibits 0-shot cot improves commonsense reasoning in most models, though one model may have inherent limitations. Lastly, RAG performed poorly regardless of the dataset as for GSM8K and CSQA, Llama-2 and Llama-3 saw sizable decreases in accuracy. This suggests that RAG’s approach is ineffective across both mathematical and commonsense reasoning tasks, potentially due to a lack of generalization of RAG.

References

[1] Yao, Shunyu, et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” 17 May 2023 (2305.10601.pdf (arxiv.org)) [2] Wei, Jason, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” 28 January 2022 (2201.11903.pdf (arxiv.org)) [3] Wang, Xuezhi, et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models” 7 March 2023 (2203.11171.pdf (arxiv.org)) [4] Yao, Shunyu, et al. “ReAct: Synergizing Reasoning and Acting in Language Models” 6 October 2023 (2210.03629.pdf (arxiv.org)) [5] Mialon, Gregoire, et al. “Augmented Language Models: a Survey” 15 February 2023 [6] (2302.07842.pdf (arxiv.org)) [7] Huang, Jie, et al. “Towards Reasoning in Large Language Models: A Survey” 20 December 2022 (2212.10403.pdf (arxiv.org)) [8] Hao, Shibo, et al. “Reasoning with Language Model is Planning with World Model” 24 [9] May 2023 (2305.14992.pdf (arxiv.org))

Copyright

Copyright © 2025 Nathan Man, Harish Senthilkumar, Ryan Li, Samarth Prajapati, Pranav Shankar, Ethan Shen, Tanusha Tamijet. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET72225

Publish Date : 2025-06-06

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here