Emotional intelligence is very significant in the current competitive job market as it determines the success of candidates during the interview process. The presented paper is a proposal of an AI-based system that directly addresses the needs of assessing the emotional response and the level of confidence of the candidates during a mock interview. The application has two main modules, which include facial emotion detecting and audio emotion classification. To detect emotions on the face, the system uses the YOLO model that involves processing video feeds by detecting important emotions like anger, happiness, sad, fear and surprise. The emotion classification audio module employs the use of advanced models such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RCNN), and a combination of CNNBlock and ConformerBlock to determine the emotional tone of the voice of the candidate, which can be anger, fear, and happiness among others. The webapp is developed with a backend written in flask and frontend written in HTML, CSS, and JavaScript to give users a very user-friendly interface to upload video and audio files. Not just does it predict the emotional state but it also gives a confidence score to each prediction which the candidate can find useful. The combination of these technologies will assist the candidates to control their emotions better and understand themselves, which will eventually make them perform better in a stressful interview situation. This study supports the possibility of AI in the future of mock interview assessment and emotional intelligence training.
Introduction
Human resource management has evolved significantly with the emergence of artificial intelligence, particularly in candidate evaluation processes. Traditional interview methods primarily assess verbal responses and technical skills, often overlooking emotional intelligence and non-verbal cues—factors that strongly influence a candidate’s performance under stress. To address this gap, the proposed project introduces an AI-based mock interview evaluator that analyzes both facial expressions and vocal cues to assess emotional reactions and confidence levels.
The system integrates two key components: facial emotion recognition using YOLO-based deep learning models and audio emotion classification using CNN, RCNN, and hybrid CNN–Conformer architectures. The facial analysis module recognizes emotions such as happiness, anger, sadness, surprise, fear, and neutrality by detecting facial landmarks and expressions in real time. The audio module classifies emotional tone by analyzing speech features like pitch and rhythm through MFCC-based processing and advanced neural networks. Both modules generate confidence scores, providing candidates with measurable insights into their emotional responses and areas for improvement.
A Flask-based web application with an HTML/CSS/JavaScript interface enables users to upload video and audio recordings of mock interviews. The system processes the inputs and delivers real-time feedback on emotional states and confidence, helping users enhance self-awareness and interview performance. The literature review highlights the rapid progress of multimodal emotion recognition using deep learning, attention mechanisms, and transformer-based fusion models, which support the development of this integrated assessment framework.
Conclusion
In the current paper, the design of an AI-based system to be used in judging emotional feedback and confidence during fake interviews was discussed. The suggested system will combine two essential aspects: facial emotion recognition and audio emotion classification with the help of the latest deep learning models to give a complex evaluation of the emotional conditions of a candidate during an interview. Using the models of YOLO-based facial emotion recognition and audio emotion classification based on the Hybrid CNN + ConformerBlock, the system can analyze visual and auditory signals as such with high accuracy.
The combination of these two modalities greatly improves the predictability of emotions and it can be said that the emotional intelligence of a candidate is rated more efficiently. The confidence-sensitive feedback system also becomes an even stronger mechanism since it can offer real-time, practical feedback to the candidates to improve on their emotional control and interview performance. This multimodal system is helpful because it is able to record subtle facial expressions and tones in speaking and provides a full picture of how the candidate behaves during the interview.
This study proves that AI can transform the conventional recruitment process since it is no longer based on verbal reactions but also on emotional intelligence. The combination of the real-time assessment of emotional responses in the system with the detailed feedback offered can become useful not only to the interviewees but also the interviewers to improve the preparation in the interview and self-awareness. Further development of the models, extending the emotion categories, and implementing the system to the variety of real-life situations might be considered as a direction of the future work to further confirm the validity of the system.
References
[1] F. Ullah, S. M. Sarwar, and A. Xiong, “Optimizing Real-Time Emotion Recognition: A YOLO v.8 Deep Learning Solution for Facial Expression Analysis,” Proceedings of the IEEE International Conference on Computer and Communications, ICCC, no. 2024, pp. 150–156, 2024, doi: 10.1109/ICCC62609.2024.10942205.
[2] R. G. Praveen, E. Granger, and P. Cardinal, “Recursive Joint Attention for Audio-Visual Fusion in Regression based Emotion Recognition,” Apr. 2023, Accessed: Nov. 15, 2025. [Online]. Available: https://arxiv.org/pdf/2304.07958
[3] S. Kour, P. Sharma, A. M. Zargar, A. Sonania, T. Hassan, and Nijamuddin, “Emotion Recognition from Speech Signals Using Hybrid CNN Model,” Proceedings - 3rd International Conference on Advancement in Computation and Computer Technologies, InCACCT 2025, pp. 666–670, 2025, doi: 10.1109/INCACCT65424.2025.11011474.
[4] H. Jin, T. Yang, L. Yan, C. Wang, and X. Song, “Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks,” Applied Sciences 2025, Vol. 15, Page 11971, vol. 15, no. 22, p. 11971, Nov. 2025, doi: 10.3390/APP152211971.
[5] Z. Cheng et al., “Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning”.
[6] Z. Lin, F. Cruz, and E. B. Sandoval, “Self context-aware emotion perception on human-robot interaction,” Australasian Conference on Robotics and Automation, ACRA, Jan. 2024, Accessed: Nov. 15, 2025. [Online]. Available: https://arxiv.org/pdf/2401.10946
[7] J. Salas-Cáceres, J. Lorenzo-Navarro, D. Freire-Obregón, and M. Castrillón-Santana, “Multimodal emotion recognition based on a fusion of audiovisual information with temporal dynamics,” Multimedia Tools and Applications 2024 84:23, vol. 84, no. 23, pp. 27327–27343, Sep. 2024, doi: 10.1007/S11042-024-20227-6.
[8] R. Rani and M. K. Ramaiya, “Enhancing Speech Emotion Recognition with Multi-Modal Hybrid Features and CNN,” International Journal of Electronics and Communication Engineering, vol. Volume 12, no. 7, pp. 35–46, Jul. 2025, doi: 10.14445/23488549/IJECE-V12I7P104.
[9] T. Thebaud et al., “Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision,” ACM International Conference Proceeding Series, pp. 684–689, Nov. 2024, doi: 10.1145/3678957.3689332;PAGE:STRING:ARTICLE/CHAPTER.
[10] Z. Ma et al., “emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 15747–15760, Dec. 2023, doi: 10.18653/v1/2024.findings-acl.931.
[11] S. Latif, A. Shahid, and J. Qadir, “Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation,” Applied Acoustics, vol. 210, Jan. 2023, doi: 10.1016/j.apacoust.2023.109425.
[12] R. Gnana, E. Granger, and P. Cardinal, “RECURSIVE JOINT ATTENTION FOR AUDIO-VISUAL FUSION IN REGRESSION BASED EMOTION RECOGNITION”, Accessed: Nov. 15, 2025. [Online]. Available: https://github.com/
[13] Z. Lin, F. Cruz, and E. B. Sandoval, “Self context-aware emotion perception on human-robot interaction,” 2023.
[14] J. Salas-Cáceres, J. Lorenzo-Navarro, D. Freire-Obregón, and M. Castrillón-Santana, “Multimodal emotion recognition based on a fusion of audiovisual information with temporal dynamics,” Multimedia Tools and Applications 2024 84:23, vol. 84, no. 23, pp. 27327–27343, Sep. 2024, doi: 10.1007/S11042-024-20227-6.