Emotion recognition is important for improving human computer interaction. In this study, we proposed two hybrid deep learning architectures are to recognize and classify six emotions such as anger, disgust, fear, happy, neutral, and sad using both audio and video data from the CREMA-D dataset.The first model uses a late fusion approach that combines ResNet-18 for extracting features from audio spectrograms, Vision Transformer (ViT-Tiny) for extracting features from video frames, and LSTM to learn the temporal patterns in video sequences. The second model replaces ResNet-18 and ViT with EfficientNet-B0 for both audio and video feature extraction, followed by LSTM for temporal learning and feature fusion.The results show that the EfficientNet-based model performs better, achieving an accuracy of 82%, while the ResNet18–ViT–LSTM model achieved 79% accuracy. The models performed very well in recognizing the happy emotion, but emotions like fear and sad were more difficult to classify.Overall, the results demonstrate that combining audio and visual information with temporal modeling can significantly improve emotion recognition performance.
Introduction
Emotions play an important role in human life by influencing thoughts, behavior, and decision-making. Humans naturally recognize emotions through facial expressions, voice, and behavior. According to research, emotions can be described using two factors: valence (positive or negative emotion) and arousal (intensity of emotion). Studies also identify six basic emotions—happiness, sadness, anger, fear, surprise, and disgust—which are common across cultures. With the growth of human–machine interaction through chatbots and intelligent systems such as ChatGPT, Gemini, Grok, and Claude, recognizing human emotions has become important for improving user experience and enabling more natural communication between humans and machines.
Facial Emotion Recognition (FER) is a key task in computer vision. Earlier methods used traditional machine learning techniques like Support Vector Machines and logistic regression, but modern approaches mainly rely on deep learning models. FER methods are generally categorized based on temporal information (static or dynamic) and data modality (unimodal or multimodal). Static models analyze single images, while dynamic models analyze video sequences using techniques such as convolutional neural networks (CNNs) and recurrent neural networks like LSTM, BiLSTM, and GRU to capture temporal patterns in facial expressions. Multimodal systems combine different types of data such as facial images, speech, and text using fusion techniques to improve accuracy.
The research methodology uses the CREMA-D multimodal dataset, which contains 7,442 video clips from 91 actors expressing six emotions (anger, disgust, fear, happy, neutral, and sad) with different intensity levels. Data processing involves extracting video frames and converting audio signals into Mel spectrograms. Exploratory data analysis is performed using Python libraries such as NumPy, Pandas, and Matplotlib in a Kaggle environment with GPU acceleration.
Two hybrid deep learning models were proposed for emotion recognition. The first model combines ResNet-18, Vision Transformer (ViT), and LSTM to extract video and audio features and classify emotions. The second model uses EfficientNet-B0 with LSTM for multimodal feature extraction and classification. Both models apply a late-fusion strategy to combine audio and visual features before predicting the final emotion category.
Experimental results show that both models successfully detect emotions from audio-visual data. The first model achieved 79% accuracy, while the second model achieved 82% accuracy, demonstrating better performance and generalization. The results indicate that multimodal deep learning architectures, especially those using EfficientNet and LSTM, can significantly improve emotion recognition accuracy and contribute to the development of more intelligent human–machine interaction systems.
Conclusion
This study proposed two hybrid multimodal deep learning architectures for multi-class emotion recognition using audiovisual data. The first model integrates ResNet18, Vision Transformer, and LSTM, while the second model uses EfficientNet with LSTM for feature extraction and temporal modeling. Experimental results show that the EfficientNet–LSTM model outperforms the ResNet18–ViT–LSTM model, achieving a higher accuracy of 82% compared to 79%. Future work may focus on incorporating attention mechanisms, transformer-based temporal modeling, and larger multimodal datasets to further enhance the performance of emotion recognition systems.
References
[1] P. Ekman, “An argument for basic emotions,” Cognition & Emotion, vol. 6, no. 3–4, pp. 169–200, 1992.
[2] P. Ekman et al., “Basic emotions,” in Handbook of Cognition and Emotion, pp. 45–60, 1999.
[3] K. Aizi and M. Ouslim, “Score level fusion in multi-biometric identification based on zones of interest,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 1, pp. 1498–1509, 2022.
[4] G. Bargshady, X. Zhou, R. C. Deo et al., “Enhanced deep learning algorithm development to detect pain intensity from facial expression images,” Expert Systems with Applications, vol. 149, p. 113305, 2020.
[5] C. Busso, M. Bulut, C. C. Lee et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[6] S. Castro, D. Hazarika, V. Pérez-Rosas et al., “Towards multimodal sarcasm detection (an obviously perfect paper),” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019.
[7] H. Fan, X. Zhang, Y. Xu et al., “Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals,” Information Fusion, vol. 104, p. 102161, 2024.
[8] D. Freire-Obregón, D. Hernández-Sosa, O. J. Santana et al., “Towards facial expression robustness in multi-scale wild environments,” in International Conference on Image Analysis and Processing, 2023.
[9] J. Y. Kim and S. H. Lee, “CoordViT: A novel method to improve vision transformer-based speech emotion recognition using coordinate information concatenate,” in 2023 International Conference on Electronics, Information, and Communication (ICEIC), pp. 1–4, 2023.
[10] S. Kumawat, M. Verma, and S. Raman, “LBVCNN: Local binary volume convolutional neural network for facial expression recognition from image sequences,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 207–216, 2019.
[11] S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1195–1215, 2022.
[12] C. Lisetti, “Affective computing,” Pattern Analysis and Applications, vol. 1, pp. 71–73, 1998.
[13] S. Liu and R. He, “Decision-level fusion detection method of hydrogen leakage in hydrogen supply system of fuel cell truck,” Fuel, vol. 367, p. 131455, 2024.
[14] A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion,” 2022.
[15] B. Pan, K. Hirota, Z. Jia et al., “A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,” Neurocomputing, vol. 561, p. 126866, 2023.
[16] R. W. Picard, Affective Computing. Cambridge, MA, USA: MIT Press, 1997.
[17] R. W. Picard, “Toward computers that recognize and respond to user emotion,” IBM Systems Journal, vol. 39, no. 3–4, pp. 705–719, 2000.
[18] S. Poria, D. Hazarika, N. Majumder et al., “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 527–536, 2019.
[19] F. Ringeval, A. Sonderegger, J. S. Sauer et al., “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pp. 1–8, 2013.
[20] J. A. Russell, “Pancultural aspects of the human conceptual organization of emotions,” Journal of Personality and Social Psychology, vol. 45, no. 6, p. 1281, 1983.
[21] E. Ryumina, D. Dresvyanskiy, and A. Karpov, “In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study,” Neurocomputing, vol. 514, pp. 435–450, 2022.
[22] M. Sajjad, F. U. M. Ullah, M. Ullah et al., “A comprehensive survey on deep facial expression recognition: Challenges, applications, and future guidelines,” Alexandria Engineering Journal, vol. 68, pp. 817–840, 2023.
[23] M. Singh, R. Singh, and A. Ross, “A comprehensive overview of biometric fusion,” Information Fusion, vol. 52, pp. 187–205, 2019.
[24] V. Vielzeuf, S. Pateux, and F. Jurie, “Temporal multimodal fusion for video emotion classification in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, New York, NY, USA, pp. 569–576, 2017.
[25] M. T. Vu, M. Beurton-Aimar, and S. Marchand, “Multitask multi-database emotion recognition,” in IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 3630–3637, 2021.
[26] S. Wang, Z. Zheng, S. Yin et al., “A novel dynamic model capturing spatial and temporal patterns for facial expression analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2082–2095, 2020.
[27] Z. Zhang, P. Luo, C. C. Loy et al., “From facial expression recognition to interpersonal relation prediction,” 2018.