The introduction of the mind-blowing Large Language Models (LLMs) such as ChatGPT imposes the image of the brilliant future of smart chat practices in diverse circumstances. The issue of establishing good and representative prompts becomes increasingly burning as the models evolve to accept a broader range of variety of what they read as input, text, graphics, noise, audio, and video. The article suggests the quality, coherence, and depth of the answers that ChatGPT generates in a multimodal setting, specifically, the visual data interpretation using Tableau. This paper discusses the response of ChatGPT to some technical questions that people face in the real world regarding image recognition and manipulation of texts and emotions. The primary question, according to the middle section of the experiment, is whether the sufficiency of the model compliance is defined by the accuracy and clarity of the prompts set by the user, or the scenario of the issue at hand itself. Traces of experimental evidence indicate that the performance of ChatGPT depends on the variations of the ways the questions are formulated and on the realms of discussion. Further, the paper looks into the application of the visualization tools as a way of realizing the behavior of the models, assisting in the process of assessing it in a less wordy way through Tableau. Overall, this contribution helps to comprehend the possibilities and boundaries of LLMs in connection with the actual applications of multimodal systems and emphasizes the impact of the speed of design that helps to improve the results obtained through the assistance of AI.
Introduction
The document explores the rapid evolution of Conversational Artificial Intelligence (AI), emphasizing its transformation of human-computer interaction through natural language, audio, and visual inputs. Technologies like ChatGPT, AudioGPT, Whisper, and DALL·E are enabling multimodal AI systems that integrate text, speech, and images to perform complex, emotionally aware tasks in fields like healthcare, education, and personal assistance.
A central focus is on Human-in-the-Loop (HITL) systems, where human input refines AI behavior and ethical decision-making. Prompt engineering is also highlighted as crucial for optimizing large language model (LLM) outputs. Despite progress, challenges remain in emotional intelligence, cultural sensitivity, scalability, and ethical frameworks.
The research aims to evaluate how prompt types (e.g., educational, reflective, visual, or auditory) and response metrics (e.g., clarity, relevance, emotion detection) affect the performance of multimodal models. Using tools like Tableau for visual analytics, the study assesses 100 human-rated prompts across different LLMs to identify patterns in performance and accuracy.
The methodology includes:
Categorizing prompts by modality (text, image, audio).
Scoring model outputs manually on metrics like accuracy, emotional recognition, and fidelity.
Analyzing data using Tableau visualizations (e.g., bar, box, scatter plots) to understand model behavior.
Key models used:
ChatGPT (GPT-4) for text.
Whisper for speech recognition.
DALL·E for image generation.
AudioGPT for emotion detection and multimodal synthesis.
The study proposes a Multimodal Prompt Evaluation Framework and mathematically models response scores on a scale of 0–10, aiming to improve multimodal AI's transparency, accuracy, and emotional sensitivity in real-world tasks.
Conclusion
This paper has compared five state-of-the-art multimodal AI systems against each other with 100 well-thought-out prompts across three domains: text, image, and audio, run through ChatGPT. The findings indicate that immediate engineering is vital to affect the quality of output and is especially important for complicated tasks, where the involvement of emotion recognition features or abstract visual analysis is required. It is worth mentioning that the performance of models differed dramatically, especially when evaluated against audio-based prompts, which indicated the issues with the stability of responses. Such variability was well translated into boxplots and line graphs in Tableau, and this underlines why modality alignment is critical in terms of producing a consistent and accurate output. The importance of strategic design of prompts and considerate modal selection is also underscored by such results. The efficacy and reliability of conversational AI systems could considerably improve by streamlining timely trends and outlooks in the real world.
References
[1] Supriyono, Aji Prasetya Wibawa, Suyono, Fachrul Kurniawan, Advancements in natural language processing: Implications, challenges, and future directions, Telematics and Informatics Reports, Volume 16,2024,100173, ISSN 2772-5030,https://doi.org/10.1016/j.teler.2024.100173
[2] Kusal, S., Patil, S., Choudrie, J., Kotecha, K., mishra, S., &abraham, A. (2022). AI-Based Conversational Agents: A Scoping Review From Technologies to Future Directions: Conversational agents. IEEE Access, 10, 92337 - 92356. Article 22029817. https://doi.org/10.1109/ACCESS.2022.3201144
[3] Lima, Maria &Wairagkar, Maitreyee & Gupta, Manish & Baena, Ferdinando &Barnaghi, Payam & Sharp, David & Vaidyanathan, Ravi. (2021). Conversational Affective Social Robots for Ageing and Dementia Support. IEEE Transactions on Cognitive and Developmental Systems. PP. 1-1. DOI: 10.1109/TCDS.2021.3115228
[4] M. T. Teye, Y. M. Missah, E. Ahene, and T. Frimpong, \"Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion Detection,\" in IEEE Access, vol. 10, pp. 24976-24984, 2022, doi: 10.1109/ACCESS.2022.3153787
[5] Hassija, Vikas & Chakrabarti, Arjab& Singh, Anushka &Chamola, Vinay & Sikdar, Biplab. (2023). Unleashing the Potential of Conversational AI: Amplifying Chat-GPT’s Capabilities and Tackling Technical Hurdles. IEEE Access. PP. 1-1. DOI: 10.1109/ACCESS.2023.3339553
[6] M. R. Lima et al., \"Discovering Behavioral Patterns Using Conversational Technology for In-Home Health and Well-Being Monitoring,\" in IEEE Internet of Things Journal, vol. 10, no. 21, pp. 18537-18552, 1 Nov.1, 2023, doi:10.1109/JIOT.2023.3290833
[7] A. Elragal, A. I. Awad, I. Andersson, and J. Nilsson, \"A Conversational AI Bot for Efficient Learning: A Prototypical Design,\" in IEEE Access, vol. 12, pp. 154877-154887, 2024, doi:10.1109/ACCESS.2024.3476953
[8] [8] A. Ghosh and K. Deepa, \"QueryMintAI: Multipurpose Multimodal Large Language Models for Personal Data,\" in IEEE Access, vol. 12, pp. 144631-144651, 2024, doi: 10.1109/ACCESS.2024.3468996
[9] D. Park, G.-t. An, C. Kamyod and C. G. Kim, \"A Study on Performance Improvement of Prompt Engineering for Generative AI with a Large Language Model,\" in Journal of Web Engineering, vol. 22, no. 8, pp. 1187-1206, November 2023, doi: 10.13052/jwe1540-9589.2285
[10] A. Koubaa, W. Boulila, L. Ghouti, A. Alzahem, and S. Latif, \"Exploring ChatGPT Capabilities and Limitations: A Survey,\" in IEEE Access, vol. 11, pp. 118698-118721, 2023, doi: 10.1109/ACCESS.2023.3326474
[11] A. Alparslan, \"The Role of Accuracy and Validation Effectiveness in Conversational Business Analytics,\" in IEEE Access, vol. 13, pp. 29279-29291, 2025, doi: 10.1109/ACCESS.2025.3540975
[12] [12] Z. Bai and Y. Bai, \"Exploring the Role of CLIP Global Visual Features in Multimodal Large Language Models,\" ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10889200.
[13] Z. Bai and Y. Bai, \"Improving Multimodal Large Language Models through Combining Resampler and MLP Projections,\" ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10888340
[14] Q. Chen, X. Yao, H. Ye, and Y. Hong, \"Enhancing 3D Medical Image Understanding with 2D Multimodal Large Language Models,\" ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10889731
[15] M. Kang, X. Zhang, F. Wei, S. Xu, and Y. Liu, \"Enhancing Image Editing with Chain-of-Thought Reasoning and Multimodal Large Language Models,\" ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10890562.
[16] M. Puerta-Beldarrain, O. Gómez-Carmona, R. Sánchez-Corcuera, D. Casado-Mansilla, D. López-de-Ipiña and L. Chen, \"A Multifaceted Vision of the Human-AI Collaboration: A Comprehensive Review,\" in IEEE Access, vol. 13, pp. 29375-29405, 2025, doi: 10.1109/ACCESS.2025.3536095
[17] G. Rocchietti, C. Rulli, F. Maria Nardini, C. Ioana Muntean, R. Perego, and O. Frieder, \"ChatGPT Versus Modest Large Language Models: An Extensive Study on Benefits and Drawbacks for Conversational Search,\" in IEEE Access, vol. 13, pp. 15253-15271, 2025, doi:10.1109/ACCESS.2025.3529741