Emotion recognition from images is a challenging task due to variations in facial expressions, body posture, and environmental context. Facial features alone are often insufficient to accurately identify emotions in real-world scenarios. This paper presents an implementation of a context-based emotion recognition system using deep learning and the EMOTIC dataset. The proposed approach employs a dual-stream convolutional neural network architecture that separately processes human-centric and contextual information. Features extracted from both streams are fused to predict discrete emotion categories and continuous affective dimensions. Experimental results demonstrate that the inclusion of contextual cues significantly improves emotion recognition performance compared to human-only models, validating the effectiveness of the proposed implementation.
Introduction
Emotion recognition is essential for applications such as human–computer interaction, intelligent surveillance, social robotics, and behavioral analysis. Traditional approaches mainly relied on facial expression analysis using handcrafted features or deep convolutional neural networks (CNNs). However, these methods perform poorly in real-world environments where facial cues may be obscured due to occlusion, lighting variations, or non-frontal views.
To address these limitations, context-aware emotion recognition has emerged as a more robust approach. Humans naturally interpret emotions using contextual information such as body posture, surrounding objects, and scene semantics. The EMOTIC (Emotion in Context) dataset supports this approach by providing approximately 23,000 real-world images annotated with 26 discrete emotion categories and continuous affective dimensions: Valence, Arousal, and Dominance (VAD), along with human bounding boxes.
The proposed system implements a dual-stream deep learning architecture consisting of:
Human Stream Network – Processes cropped human regions to capture body posture, clothing, and facial/gesture cues using a pretrained ResNet-50 model.
Context Stream Network – Processes the full image (with masked human regions) to extract scene semantics and environmental cues, also using ResNet-50.
Features from both streams are concatenated and passed through fully connected layers for fusion. The system performs:
Discrete emotion classification using a Softmax layer
Continuous emotion regression (VAD) using regression layers
Implementation details include resizing images to 224×224, applying data augmentation (horizontal flipping and normalization), and training with the Adam optimizer (learning rate 0.0001, batch size 32, 30 epochs). Cross-entropy loss is used for classification, Mean Squared Error (MSE) for regression, and the total loss is a weighted combination of both.
Conclusion
This paper presented an implementation of a context-based emotion recognition system using deep learning and the EMOTIC dataset. By combining human-centric and contextual features through a dual-stream architecture, the proposed model achieves improved performance over baseline approaches. Future work will explore attention mechanisms and multi-modal inputs such as audio and text for enhanced emotion understanding.
References
[1] Shreya, L., and N. Nagarathna. ”Emotion Based Music Recommendation System for Specially-
[2] Abled.” 2021 International Conference on Recent Trends on Electronics, Information, Communication Technology (RTEICT). IEEE, 2021.
[3] Lee, Shih-Hsiung, et al. ”A music recommendation system for depression therapy based on EEG.” 2020 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-Taiwan). IEEE, 2020.
[4] Zhang, Ting-Zheng, Tiffany Chang, and Ming-Hao Wu. ”A Brainwave-based Attention Diagnosis and Music Recommendation System for Reading Improvement.” 2021 IEEE International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC). IEEE, 2021.
[5] Asif, Anum, Muhammad Majid, and Syed Muhammad Anwar. ”Human stress classification using EEG signals in response to music tracks.” Computers in biology and medicine 107 (2019): 182-196.
[6] B?alan, Oana, et al. ”Emotion classification based on biophysical signals and machine learning techniques.” Symmetry 12.1 (2019): 21.
[7] Santamaria-Granados, Luz, Juan Francisco Mendoza-Moreno, and Gustavo Ramirez-Gonzalez. ”Tourist recommender systems based on emotion recognition—a scientometric review.” Future Internet 13.1 (2020): 2.
[8] Cunha, Joana, et al. ”The Effect of Music on Brain Activity an Emotional State.” Engineering Proceedings 7.1 (2021): 19.
[9] Xu, Baoguo, et al. ”Continuous Hybrid BCI Control for Robotic Arm Using Noninvasive Electroencephalogram, Computer Vision, and Eye Tracking.” Mathematics 10.4 (2022): 618.
[10] Jaafar, Sirwan Tofiq, and Mokhtar Mohammadi. ”Epileptic seizure detection using deep learning approach.” UHD Journal of Science and Technology 3.2 (2019): 41-50.
[11] Kang, Dongwann, and Sanghyun Seo. ”Personalized smart home audio system with automatic music selection based on emotion.” Multimedia Tools and Applications 78.3 (2019): 3267-3276.