This paper presents an integrated multimodal robotic system that effectively combines state-of-the-art Large Language Models with advanced perception and control mechanisms, enabling sophisticated task execution and natural human-robot interaction. Current robotic implementations, predominantly reliant on rigid programming paradigms, demonstrate significant limitations in adaptability when confronted with complex, real-world scenarios. Our proposed architecture addresses these constraints through a comprehensive framework leveraging the contextual reasoning capabilities of multimodal LLMs. The system architecture incorporates cutting-edge models including GPT-4o and Gemini 2.0 Flash for nuanced linguistic interpretation and environmental understanding, working in conjunction with object detection systems such as YOLO and Grounding DINO to achieve robust situational awareness. Following rigorous validation in PyBullet simulations, we successfully deployed the framework on a physical platform utilizing Raspberry Pi 5 hardware with ROS 2 integration. Experimental evaluations confirm the system’s exceptional performance in processing complex directives, navigating challenging environments, and executing precise manipulation tasks according to user specifications, demonstrating significant advantages over conventional approaches. This research establishes a promising foundation for next-generation autonomous systems with applications spanning industrial automation, healthcare assistance, and adaptive support technologies.
Introduction
The goal is to build intelligent robotic systems capable of autonomous, complex interactions in dynamic human environments. Traditional robots rely on fixed programming and lack adaptability, while current AI approaches, including Large Language Models (LLMs) and multimodal perception, offer new opportunities but face challenges in natural language understanding, sensory integration, and real-time adaptability on resource-limited hardware.
Key Challenges:
Integrating natural language, vision, and other sensors for dynamic task execution.
Operating robustly on low-power platforms like Raspberry Pi 5.
Moving beyond simulation to real-world validation.
Contributions:
Integrated Multimodal Architecture: Combines LLMs (e.g., GPT-4V, Gemini), advanced vision models (YOLO, OWL-ViT, Grounding DINO), and sensors within ROS 2 for flexible robotic tasks.
LLM-Driven Task Execution: LLMs interpret complex commands and generate actionable plans in JSON format for the robot.
Sim-to-Real Validation: Extensive testing in PyBullet simulation followed by deployment on a physical robot powered by Raspberry Pi 5.
User Interaction: Natural language commands via a Streamlit web interface with text-to-speech feedback.
System Architecture:
Inputs include text commands, camera vision, LIDAR, ultrasonic sensors, IMU, and wheel encoders.
The Raspberry Pi 5 processes commands using LLMs and fuses multimodal sensor data for perception and navigation.
A task manager executes action sequences and handles error recovery.
Outputs include motor control for movement and gripper manipulation, plus audible and textual feedback to users.
Methodology:
Tasks are decomposed into JSON commands by LLMs, processed sequentially with continuous sensor feedback for real-time adaptation.
Perception uses YOLO for fast known-object detection and vision-language models for zero-shot detection of novel objects.
Navigation and manipulation are handled through ROS 2 nodes integrating sensor data.
Experimental Results:
Tested 50 tasks combining navigation and vision in simulation and real hardware.
Compared LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.2 Vision, LLaVA) for command generation, navigation, and scene understanding; Gemini 2.0 led in command generation and navigation, GPT-4o excelled in scene understanding.
Evaluated object detection models with Gemini: YOLO excels at fast detection of known objects but struggles with novel items; OWL-ViT and Grounding DINO support zero-shot detection with better generalization but slower inference.
Gemini LLM dynamically generates descriptive prompts for zero-shot detection, enabling flexible object recognition in unfamiliar scenarios.
Conclusion
Simulated evaluations in PyBullet demonstrate the potential of integrating multimodal LLMs with robotic systems for complex tasks. Models like Gemini 2.0 flash show promise in interpreting commands, generating actions, and demonstrating proficient navigation and scene understanding. This highlights LLM advancements and their applicability to robotics. Object identification investigation reveals a trade-off: YOLO excels in speed for known objects, while zero-shot models like OWL-ViT and Grounding DINO, guided by an LLM, offer superior flexibility for novel objects, crucial for dynamic human environments.
This work supports creating more intuitive and adaptable robots. LLM-driven task decomposition, coupled with robust perception, enables robots to handle a wider range of requests. While simulation results are foundational, future work includes sim-to-real transfer and hardware validation. Challenges in real-time performance on constrained hardware, safety, and seamless integration of perception, reasoning, and action remain key research areas.
References
[1] R. Zhang, A. Gupta, J. Zhu, K. Gopalakrishnan, and A. Faust, ”Safety Aware Task Planning via Large Language Models in Robotics,” arXiv preprint arXiv:2503.15707, 2025. [Online]. Available: https://arxiv.org/abs/2503.15707
[2] Y. Wu, Z. Chen, H. Zhang, M. Chen, J. M. Alonso, and C. Chen, ”Large Language Models for Multi-Robot Systems: A Survey,” arXiv preprint arXiv:2502.03814, 2025. [Online]. Available: https://arxiv.org/abs/2502.03814
[3] H. Liu, Y. Zhu, K. Kato, A. Tsukahara, I. Kondo, T. Aoyama, and Y.Hasegawa, ”Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration,” IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 7165-7172, Aug. 2024.
[4] H. K. Omeed, A. O. Alani, I. H. Rasul, A. M. Ashir and S. A. Mohammed, ”Integrating Computer Vision and language model for interactive AI - Robot,” in 2024 21st International Multi-Conference on Systems, Signals & Devices (SSD), Erbil, Iraq, 2024, pp. 124-131.
[5] P. Sikorski, L. Schrader, K. Yu, L. Billadeau, J. Meenakshi, N. Mutharasan, F. Esposito, H. AliAkbarpour, and M. Babaiasl, ”Deployment of NLP and LLM Techniques to Control Mobile Robots at the Edge: A Case Study Using GPT-4-Turbo and LLaMA 2,” arXiv preprint arXiv:2403.05381, 2024.
[6] C. Wang, S. Hasler, D. Tanneberg, F. Ocker, F. Joublin, A. Ceravola, J. Deigmoeller, and M. Gienger, ”LaMI: Large Language Models for Multi-Modal Human-Robot Interaction,” arXiv preprint arXiv:2401.15174, 2024.
[7] R. K. Thaker, ”Generative AI and Robotics: From Large Language Models to Intelligent Human-Robot Interaction and Task Planning,”International Journal of Innovative Research in Management, Programming, and Sliding Shapes (IJIRMPS), vol. 12, no. 4, pp. 1-10, Jul.-Aug.2024.
[8] J. Wang et al., ”Large Language Models for Robotics: Opportunities, Challenges, and Perspectives,” arXiv preprint arXiv:2401.04334, 2024.
[9] S. Alzahrani, N. Aldoahman, F. Bou Nassif and A. Bou Nassif, ”LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness,” arXiv preprintarXiv:2409.05217, 2024.
[10] M. Waseem and A. Orouskhani, ”Object Recognition for NAO Robot in Webots Simulation Environment: A Comparative Study between YOLO (Doubao Vision Understanding Model),” International Journal of Innovative Research in Computer and Communication Engineering, 2024.
[11] S. Minaee et al., ”A Review of 3D Object Detection with Vision-Language Models,” arXiv preprint arXiv:2504.18738, 2025.
[12] X. Han et al., ”Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision,” arXiv preprint arXiv:2504.02477, 2025.
[13] H. Su, W. Qi, J. Chen, C. Yang, J. Sandoval, and M. A. Laribi, ”Recent advancements in multimodal human–robot interaction,” Frontiers in Neurorobotics, vol. 17, p. 1084000, 2023.
[14] G. Babu and E. Sathiyanarayanan, ”Revolutionizing Human-Robot Interaction (HRI): Multimodal Intelligent Robotic System for Responsive Collaboration,” International Journal of Intelligent Systems and Applications in Engineering (IJISAE), vol. 12, no. 17s, pp. 464-473, 2024.
[15] A. Alharbi, B. Alahmadi, M. Alharthi, A. Alsubaie, R. Babli, B. Alsolai, M. Baljoon, N. Mohammed, E. Serrano, H. Vega, T. Alhmiedat and J. M. Alonso, ”ROS 2 Key Challenges and Advances: A Survey of ROS 2 Research, Libraries, and Applications,” Preprints.org, 2024101204, 2024. DOI: 10.20944/preprints202410.1204.v2..
[16] J. Stewart, ”ROS 2 Robot With SLAM,” University of Cape Town, Dept. of Electrical Engineering, B.Sc. Mechatronics Report, Oct. 2024.
[17] T. B. Brown et al., ”Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020.
[18] J. Redmon and A. Farhadi, ”YOLOv3: An Incremental Improvement,” arXiv preprint arXiv:1804.02767, 2018.