Although non-verbal communication critically impacts interview success, traditional mock interview platforms predominantly evaluate isolated speech and text. To overcome this limitation, we introduce Cognify AI, a multimodal system that integrates vision-based behavioral analysis (via DeepFace and OpenCV) with locally executed NLP-based text evaluation within a single real-time processing framework. Unlike single-modality systems that process visual and textual streams in silos, Cognify AI simultaneously correlates eye contact stability,facial emotional patterns, and the presence of key semantic concepts in responses. Our evaluation demonstrates that this inte- grated feedback loop yields measurable improvements in both technical articulation and non-verbal composure across successive candidate sessions, providing actionable insights that tangibly enhance interview preparedness.
Introduction
The text introduces Cognify AI, a multimodal intelligent interview evaluation system designed to improve automated recruitment by analyzing not just what candidates say, but also how they behave during interviews.
It highlights a key problem in modern hiring: most existing systems focus mainly on text or speech-based evaluation, giving feedback on answers but ignoring non-verbal signals such as eye gaze, facial expressions, posture, and emotional control. These behavioral cues strongly influence human interview decisions, especially in virtual recruitment settings.
To solve this, Cognify AI proposes a unified multimodal framework that combines:
Unlike earlier systems that process each modality separately, Cognify AI fuses all three streams into a single evaluation pipeline, producing more complete and human-like feedback.
The literature survey shows the evolution of interview systems:
Early systems used rule-based or keyword-based text evaluation.
Later systems used machine learning and deep learning for semantic understanding but still focused mainly on text.
Multimodal systems began combining vision, speech, and text, but often lacked true integration or real-time fusion.
LLM-based systems improved contextual evaluation but remained largely text-only.
Speech-only and vision-only systems improved individual analysis but could not provide holistic assessment.
Cognify AI differentiates itself by synchronously integrating all modalities, ensuring that neither verbal nor non-verbal behavior is ignored, and producing a unified score and feedback.
The system architecture is modular and designed for real-time performance. It includes:
A user interface layer, where candidates log in, view dashboards, and start interviews.
A system that allows uploading resumes and job descriptions to personalize evaluation.
A backend pipeline (implied) that processes audio, video, and text simultaneously and generates performance feedback.
Conclusion
This paper presented Cognify AI, a multimodal mock interview preparation system that integrates vision-based behavioral anal- ysis with large language model-driven content evaluation within a unified, real-time processing framework. The system was designed to address a fundamental gap in existing automated interview platforms, which predominantly assess verbal content while overlooking non-verbal behavioral cues such as eye contact and facial emotional state.
The proposed architecture processes audio and video streams through parallel, independent pipelines. Speech is transcribed using OpenAI Whisper and evaluated for concept coverage and holistic quality by a locally hosted Ollama/Llama 3 language model. Simultaneously, video frames are analyzed using OpenCV and DeepFace to extract eye contact scores and dominant emotional states. The outputs of both pipelines are synthesized into a structured, actionable feedback report.
Evaluation was conducted over five successive mock interview sessions with two candidates under controlled conditions. Candidate 1 demonstrated improvement in concept accuracy from 48% to 62%, eye contact from 60% to 100%, and holistic score from 46% to 77% across sessions. Candidate 2 similarly improved concept accuracy from 40% to 67%, eye contact from 55% to 95%, and holistic score from 42% to 72%. In both cases, the dominant emotional state transitioned from Neutral in early sessions to Happy by Session 3, suggesting increased composure through iterative practice.
These results indicate that multimodal feedback combining verbal and non-verbal signals supports measurable, progressive improvement across repeated sessions — a capability absent in single-modality systems. Qualitative comparison further confirms that Cognify AI extends beyond existing text-only and speech-only platforms by incorporating eye contact analysis, emotion detection, and multimodal fusion within a single evaluation pipeline.
The current evaluation is limited to a small controlled sample, and the accuracy of visual analysis remains sensitive to envi- ronmental conditions such as lighting and camera quality. Future work will focus on expanding the evaluation dataset to include diverse candidate profiles, incorporating long-term adaptive personalization based on historical session performance, improv- ing robustness of emotion recognition under varied recording conditions, and extending the system to support domain-specific interview tracks beyond technical roles.
Cognify AI demonstrates that integrating behavioral and semantic signals into a cohesive evaluation framework is both tech- nically feasible and practically effective for structured interview preparation.
References
REFERENCES
[1] Caldera et al., “AI Interviewer Chatbot for Technical and HR Brilliance: A Tool for Upskilling Candidate,” International Journal of Engineering Research & Technology (IJERT), 2024.
[2] K. Senthilkumar et al., “AI Based Mock Interview System Using Natural Language Processing,” in Proc. 2025 International Conference on Advanced Computing Technologies (ICoACT), 2025, pp. 01–06, doi: 10.1109/ICoACT63339.2025.11005032.
[3] I. Naim, M. I. Tanveer, D. Gildea, and M. E. Hoque, “Automated Analysis and Prediction of Job Interview Performance,”
[4] IEEE Transactions on Affective Computing, vol. 9, no. 2, pp. 191–204, 2018, doi: 10.1109/TAFFC.2016.2614299.
[5] N. Rayasam et al., “Multimodal Sentiment Analysis for Interviews and Proctoring,” in Proc. 2024 IEEE 9th Inter- national Conference on Computational Intelligence and Applications (ICCIA), 2024, pp. 115–119, doi: 10.1109/IC- CIA62557.2024.10719163.
[6] C. Kim et al., “Fairness-Aware Multimodal Learning in Automatic Video Interview Assessment,” IEEE Access, vol. 11, pp. 122676–122688, 2023.
[7] R. Mandal, P. Lohar, D. Patil, A. Patil, and S. Wagh, “AI-Based Mock Interview Evaluator: An Emotion and Confidence Classifier Model,” in Proc. 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS), IEEE, 2023, pp. 521–526.
[8] T. Brown et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, 2020.
[9] S. Wangwiwattana and P. Tongvivat, “Automating Academic Assessment: A Large Language Model Approach,” in Proc. 2023 International Conference on Information Technology and Electrical Engineering, IEEE, 2023.
[10] P. Chokpattarabun et al., “AI-Powered MetaHuman Interviewer: Serious Game for Student Job Interview Skills,” IEEE Access, vol. 12, pp. 11275–11285, Dec. 2025.
[11] H. Sun, H. Lin, H. Yan, Y. Song, X. Gao, and R. Yan, “MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting,” in Proc. 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2025.
[12] Y. Dai et al., “LLM-generated Feedback in Real Classes and Beyond: Perspectives from Students and Instructors,” in Proc. 17th International Conference on Educational Data Mining (EDM), 2024.
[13] Y.-C. Chou, F. R. Wongso, C.-Y. Chao, and H.-Y. Yu, “An AI Mock-interview Platform for Interview Performance Analysis,” in Proc. 10th Int. Conf. Information and Education Technology (ICIET), 2022, pp. 37–41.
[14] B. J. K. Reddy, B. S. C. Reddy, H. S. Shekhawat, and Indumathy M., “AIVA: AI-powered Interview Verbal Analysis System using Fine Tuned Models,” in Proc. International Conference on Intelligent Systems and Digital Transformation (ICISD), Atlantis Highlights in Intelligent Systems, 2025.
[15] S. Yadav et al., “AI Voice Interview Agent for Real-Time Personalized Mock Interviews,” IJSAT, 2025.
[16] J. M. Basch and K. G. Melchers, “Here’s Looking at You: Does Eye Contact in Video Interviews Affect How Applicants are Perceived and Evaluated?” Journal of Business and Psychology, 2025.
[17] R. Loga Priya, S. R. K. Sri Roshan, G. S. Vidharsana, and P. N. Saravanan, “AI-Enhanced Eye Tracking for Candidate Assessment in Job Interviews,” in Proc. 2025 6th Int. Conf. on Mobile Computing and Sustainable Informatics (ICMCSI), IEEE, 2025.
[18] S. R. Jagtap, V. Kulkarni, Y. Pachorkar, O. Taur, S. Gupta, and U. Pujeri, “AI-Driven Real-Time Interview Simulation App with Voice Recognition and Facial Analysis,” Indian Journal of Science and Technology, vol. 18, no. 25, pp. 2058–2066, 2025.