In the contemporary recruitment environment, the evaluation of a candidate\'s stress and confidence remains highly subjective. This paper proposes an intelligent system, the AI-Powered Interview Stress and Confidence Analyzer, designed to provide objective metrics by analyzing both facial expressions and vocal biomarkers. While the system architecture includes both modalities, the core contribution of this paper is a robust methodology for the vocal analysis component, which addresses the critical limitations of existing approaches that rely on acted emotional data. Our proposed solution for the Face module centers on creating a novel, context-specific dataset of authentic interview audio. We detail the complete system\'s architecture, including the system architecture employs a Convolutional Neural Network (CNN) to interpret facial emotions. For the vocal analysis component, the framework is designed to use Mel-Frequency Cepstral Coefficients (MFCCs), a feature set that has proven effective for classifying stress and emotion in speech. This paper outlines a clear path toward developing a comprehensive, non-intrusive tool to complement traditional interview procedures, providing holistic, data-driven insights for recruiters and candidates.
Introduction
The project focuses on developing an AI-powered system to assess stress and confidence during job interviews using facial expressions and vocal analysis. Traditional interviews often rely on subjective judgment, making it difficult to objectively evaluate candidates’ psychological states.
Facial Expression Analysis:
Uses a Convolutional Neural Network (CNN) with three layers to classify 10 emotions (happy, sad, proud, angry, etc.).
Datasets: FER-2013 (~25,000 images) and a custom web-scraped dataset (~25,000 images).
Preprocessing: resizing, grayscale conversion, normalization, and 80/20 train-test split.
Model outputs are mapped to confidence levels (Confident, Underconfident, Neutral) in real-time video analysis.
Achieved 72% accuracy on test data; most accurate for ‘Happy’ (F1-score 0.55) and ‘Surprise’ (0.46), but struggled with ‘Determined’ and ‘Proud’.
Vocal Modulation Analysis:
Psychological stress alters vocal features such as pitch (F0), jitter, shimmer, speech rate, and spectral qualities.
Initial models using LSTM and XGBoost on public emotional speech datasets performed poorly due to exaggerated “acted” emotions.
Proposed approach uses a custom interview dataset with mock interviews to capture authentic stress and confidence markers.
Features extracted using eGeMAPS and ComParE; models (LSTM, XGBoost) are trained and evaluated using accuracy, precision, recall, and F1-score.
Key Contributions:
Combines facial and vocal cues for objective assessment of stress and confidence.
Provides a data-driven tool to supplement traditional interviews, aiding both recruiters and candidates.
Highlights challenges of using public emotion datasets and the need for context-specific data collection.
Conclusion
This paper detailed the development of an AI-powered analyzer for stress and confidence. As presented in the results, the facial expression recognition module achieved a modest accuracy of 72%. The key finding from its evaluation was the significant overfitting observed during training, which limits the model\'s ability to generalize to new, unseen data. This establishes a functional but limited baseline for facial analysis.
The Reamaning contribution of this work remains the proposed methodology for the vocal analysis component. Our initial findings confirmed that standard emotional datasets are insufficient for this task, validating the need to create a novel, context-specific dataset from mock interviews.
Future work will focus on improving collect more dataset for the vocal model and improve the accuracy of vocal model. The long-term objective is to fuse both modalities, creating a robust and reliable tool to bring data-driven objectivity to the interview process.
FUTURE WORK AND EXPECTED OUTCOMES
The immediate next step is to complete the data acquisition phase by conducting the planned mock interviews. Following this, we will execute the manual annotation process to label the collected recordings for stress and confidence. Once the dataset is fully prepared and labeled, the feature extraction process will begin, utilizing the eGeMAPS and ComParE feature sets.
Subsequently, the machine learning models detailed in the methodology, including XGBoost and LSTM, will be trained and validated. Performance will be rigorously measured using standard metrics such as Accuracy, Precision, Recall, and F1-Score to identify the most effective model.
We expect that the models trained on our custom dataset will demonstrate significantly better generalization and performance on real-world interview audio compared to models trained on acted data. The primary expected outcome is a validated methodology and a robust classifier capable of providing an objective, data-driven assessment of stress and confidence. For the broader system, future work will also focus on enhancing the facial expression model. This includes exploring dataset balancing,
Furthermore, we recognize the inherent limitations of relying solely on classifying basic emotions to determine complex states like stress. Recent multi-modal research has shown that in controlled stress-inducing scenarios, facial expressions do not always group into the classic emotion categories, and that facial muscle movements alone may not be reliable predictors of physiological stress responses. This underscores the importance of our dual-modal approach. By eventually fusing vocal and facial data, we aim to create a more robust and nuanced classifier that addresses the limitations of a single-modality system, as advocated for by current research.[12] Data augmentation, and transfer learning to further improve its performance.
The ultimate long-term goal is to investigate fusion techniques. We plan to explore both feature-level and decision-level fusion of the vocal and facial model outputs. This will allow us to create a more comprehensive and accurate classifier that leverages the strengths of both modalities to provide a single, holistic assessment of a candidate\'s stress and confidence.
References
[1] Gupta, S., Gambhir, S., Gambhir, M., Majumdar, R., Shrivastava, A.K., and Pham, H. 2025. A deep learning approach to analyse stress by using voice and body posture. Soft Computing. 29 (2025), 1719-1745.
[2] Zainal, N.A., Asnawi, A.L., Ibrahim, S.N., Azmin, N.F.M., Harum, N., and Zin, N.M. 2025. Utilizing MFCCS and TEO-MFCCS to classify stress in females using SSNNA. IIUM Engineering Journal. 26, 1 (2025), 324-335.
[3] Kaklauskas, A., Vlasenko, A., Seniut, M., and Krutinis, M. 2009. Voice Stress Analyser System for E-Testing. In Proceedings of the 2009 Ninth IEEE International Conference on Advanced Learning Technologies. 693-695.
[4] Sondhi, S., Vijay, R., Khan, M., and Salhan, A.K. 2016. Voice Analysis for Detection of Deception. In Proceedings of the 2016 11th International Conference on Knowledge, Information and Creativity Support Systems (KICSS).
[5] Sandulescu, V., Andrews, S., Ellis, D., Dobrescu, R., and Martinez-Mozos, O. 2015. Mobile App for Stress Monitoring using Voice Features. In Proceedings of the 5th IEEE International Conference on E-Health and Bioengineering (EHB 2015).
[6] Chidaravalli, S., Jayadev, N., Divyashree, P., Yadav, G.A., and Prajwal, B. 2022. Stress and Anxiety Detection through Speech Recognition and Facial Cues using Deep Neural Network. International Journal of Innovative Research in Technology (IJIRT). 9, 2 (2022), 1040-1044.
[7] Sharmila Chidaravalli1, Namratha Jayadev2, Divyashree P3, Ghanavi Yadav A4, Prajwal B5 1,2,3,4,5 Stress and Anxiety Detection through Speech Recognition and Facial Cues using Deep Neural Network Dept. of Information Science & Engg., Global Academy of Technology, Bangalore, India
[8] Almeida, J. and Rodrigues, F. 2021. Facial Expression Recognition System for Stress Detection with Deep Learning. In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021). 1 (2021), 256-263
[9] Bhagat, D., Vakil, A., Gupta, R.K., and Kumar, A. 2024. Facial Emotion Recognition (FER) using Convolutional Neural Network (CNN). Procedia Computer Science. 235 (2024), 2079-2089.
[10] Ismail, N. 2017. Analysing Qualitative Data Using Facial Expressions in an Educational Scenario. International Journal of Quantitative and Qualitative Research Methods. 5, 3 (2017), 37-50.
[11] Kumar, G.S., Cheriyan, J., Aparna, N., and Swathy, J. 2025. Unleashing Facial Expression Recognition for Stress Detection Using Deep CNN Model. Procedia Computer Science. 259 (2025), 306-315.
[12] Ringgold, V., Burkhardt, F., Abel, L., Kurz, M., Müller, V., Richer, R., Eskofier, B.M., Shields, G.S., and Rohleder, N. 2025. Multimodal stress assessment: Connecting task-related changes in self-reported stress, salivary biomarkers, heart rate, and facial expressions in the context of the stress response to the Trier Social Stress Test. Psychoneuroendocrinology. 180 (2025), 107560.