A Review on Automated Communication Assessment Platform: Combining Body Language Analysis, Speech Metrics, and Topic Relevance Detection

Authors: Aditya S. Deshmukh, Akshay S. Pawar, Anushka S. Mishra, Darshan C. Jaiswal, Akshay A. Wadatkar, Prof. Dipti A. Mirkute

DOI Link: https://doi.org/10.22214/ijraset.2026.79623

Certificate: View Certificate

Abstract

This research introduces an innovative, AI-driven web platform designed to enhance professional communication through real-time, multimodal feedback. Built on the Flask framework, the system integrates advanced computer vision via MediaPipe and sophisticated Natural Language Processing (NLP) techniques to evaluate performance across five critical dimensions. The NLP engine utilizes TF-IDF vectorization and cosine similarity to assess speech content relevance, ensuring that users maintain focus on specific themes while an intelligent filtering module identifies unprofessional language, slang, and excessive filler words. By converting speech to text via Speech Recognition , the system applies these NLP models to provide timestamped transcripts that highlight off-topic segments with high accuracy. The platform offers a range of simulated scenarios, from job interviews to healthcare presentations, and generates comprehensive performance reports featuring metrics such as shoulder alignment, words per minute, and linguistic precision. Longitudinal experimental results indicate that consistent use of the platform over a four-week period leads to substantial improvements, including a 40% increase in topic relevance scores and a stabilization of speaking pace to the professional ideal. By combining diverse analytical modalities into a single local-processing interface, this project provides a scalable, privacy-conscious solution for continuous professional development and public speaking mastery

Introduction

The text describes the development of an Automated Communication Assessment Platform designed to improve professional communication skills through objective, real-time analysis of both verbal and non-verbal behavior. Traditional soft-skills coaching depends heavily on subjective human observation, which lacks scalability and consistency. This platform addresses the issue by combining computer vision, speech recognition, and generative AI to provide automated and data-driven communication feedback.

The system is built using a modern full-stack architecture. The frontend, developed with React.js, captures live video and audio while displaying interactive feedback. User authentication is secured through Google OAuth 2.0 via Google Cloud Console, and MongoDB is used to store user profiles, session reports, posture scores, and performance history. The backend is powered by Flask, which coordinates data flow between the frontend, AI modules, and the database.

For non-verbal communication analysis, the platform uses OpenCV and MediaPipe to track body posture, head orientation, and engagement through skeletal landmark detection. Simultaneously, speech recognition converts spoken audio into text transcripts. These transcripts are analyzed using the Gemini API, which evaluates topic relevance, language complexity, semantic clarity, and overall communication quality through prompt-engineered AI feedback. By combining posture analysis and language evaluation, the system delivers a comprehensive assessment of communication performance.

The implementation follows four major phases: secure user authentication and session setup, real-time video/audio capture with posture tracking, backend AI-based analysis using Gemini, and report generation with visualized feedback. Users receive both quantitative scores and qualitative coaching suggestions immediately after a session. The platform also supports long-term progress tracking through stored historical reports.

Conclusion

The development of the Automated Communication Assessment Platform marks a significant advancement in the integration of full-stack web technologies with multimodal artificial intelligence. By successfully bridging the gap between computer vision and generative linguistics, this project demonstrates that a robust coaching environment can be built without the need for expensive, localized hardware. The core strength of the architecture lies in its hybrid processing model: utilizing the React frontend and MediaPipe for efficient, low-latency posture detection, while offloading complex semantic analysis to the Gemini API via a Flask micro-framework. This ensures that users receive immediate, data-driven feedback on their physical presence and verbal content simultaneously, providing a holistic view of their communication efficacy that traditional, manual evaluation methods often lack. Furthermore, the implementation of Google OAuth 2.0 and MongoDB ensures that the platform is not only a tool for immediate assessment but also a secure, long-term repository for professional development. By archiving every session\'s transcript, posture scores, and AI-generated verdicts, the system allows for longitudinal progress tracking, transforming subjective soft-skills practice into a quantifiable and measurable journey. The ability of the Gemini API to provide nuanced, topic-specific feedback based on custom sectors and difficulty levels proves that generative AI can serve as a highly scalable and objective alternative to human coaching. Ultimately, this platform democratizes access to elite communication training, offering a versatile solution for students and professionals to refine their skills in an increasingly digital and competitive global landscape. As the project evolves, the current framework serves as a scalable foundation for more advanced physiological and emotional analysis. Future iterations could integrate real-time facial action coding to detect micro-expressions, providing deeper insight into a speaker\'s confidence and emotional state. Additionally, the inclusion of vocal sentiment analysis and pitch modulation tracking would further refine the platform\'s ability to assess tone and persuasion. By continuing to leverage the synergy between real-time data capture and large language models, this platform is poised to become an essential tool in the future of AI-driven education and professional career preparation.

References

[1] Mendonca, V., Rao, S. M., et al. (2023). Speech Recognition using Python. PRYS International Journal of Engineering Technology and Management Sciences, 7(3). DOI: 10.46647/ijetms.2023.v07i03.099. [2] V S, C., M S, V., et al. (2024). Posture Assessment Using Pose Detection in Python: A Real-time Approach with MediaPipe and OpenCV. International Journal for Multidisciplinary Research (IJFMR), 6(6). [3] Sinha, E., Tyagi, A., & Kumar, A. (2025). OpenCV for Computer Vision Applications. International Journal for Multidisciplinary Research (IJFMR), 7(3). E-ISSN: 2582-2160. [4] Vyshnavi, V. R., & Malik, A. (2019). Efficient Way of Web Development Using Python and Flask. International Journal of Recent Research Aspects, 6(2), 16-19. ISSN: 2349-7688. [5] Gil-Martin, M., Marini, M. R., et al. (2023). Hand Gesture Recognition Using MediaPipe Landmarks and Deep Learning Networks. THAU Group, Information Processing and Telecommunications Center, UPM. [6] Patni, J. C., Singh, A., & Sharma, H. K. (2020). Real Time Linguistic Analysis using Natural Language Processing. International Journal of Recent Technology and Engineering (IJRTE), 8(5). ISSN: 2277-3878. [7] Adeniji, T. A., &Otolorin, S. A. (2025). Leveraging AI in Application Integration and API Development. Journal of Advances in Mathematics and Computer Science, 40(7), 68-85. DOI: 10.9734/jamcs/2025/v40i72022. [8] Chauhan, A. (2019). A Review on Various Aspects of MongoDB Databases. International Journal of Engineering Research & Technology (IJERT), 8(5). ISSN: 2278-0181. [9] Lewis, M., Liu, Y., et al. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Facebook AI. (Foundational research for the generative feedback mechanisms used in modern LLMs). [10] Ascari, R. E. O. S., Pereira, R., & Silva, L. (2020). Computer Vision-based Methodology to Improve Interaction for People with Motor and Speech Impairment. ACM Transactions on Accessible Computing (TACCESS), 13(4). DOI: 10.1145/3408300. [11] Lahute, S. V., & Jadhav, S. P. (2024). REACT JS – A JAVASCRIPT LIBRARY. International Research Journal of Modernization in Engineering Technology and Science (IRJMETS), 6(4). DOI: 10.56726/IRJMETS52186. [12] Borra, P. (2024). A Survey of Google Cloud Platform (GCP): Features, Services, and Applications. International Journal of Advanced Research in Science, Communication and Technology (IJARSCT), 4(3). ISSN (Online): 2581-9429.

Copyright

Copyright © 2026 Aditya S. Deshmukh, Akshay S. Pawar, Anushka S. Mishra, Darshan C. Jaiswal, Akshay A. Wadatkar, Prof. Dipti A. Mirkute. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79623

Publish Date : 2026-04-07

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here