Multimodal Emotion Recognition: A Comprehensive Survey of Architectures, Fusion Strategies, Datasets, and Future Directions

Authors: Faisal Majeed, Poonam Dhankhar

DOI Link: https://doi.org/10.22214/ijraset.2026.83881

Abstract

Recent advancements in computer science is showing numerous miracles one reason is human recognition framework which is considered to be the base for human computer Interaction (HCI). This functionality reduces the gap between artificial empathic & socially aware systems. In this regard various developments have been made. The early models were built by keeping in view the single factor for recognizing the human emotions which include the facial expressions, voice tone, facial gestures, eye movements etc. In reality the human emotions are properly recognized when we consider the mentioned multiple factors into consideration at once. These features range from what a person says, how their voice changes pitch, face gestures various other physiological signals such as heart rate or skin responses. All these features collectively allow any system to recognize human emotions accurately. Unimodal emotion detection systems process only a single type of modality at a time and often fail to capture complex emotional states, but Multimodal system removes this problem by combining all the features and collectively give a result on the basis of various feature all at once and has shown remarkable results. This survey provides an overview and a deeper understanding of the state-of-the-art Multimodal Emotion Recognition systems. The paper starts with analyzing classical methods to the recent multimodal emotion detection systems that are predominantly based on Transformers architectures. They leverage pretrained models like Vision Transformer on facial features, Wav2Vec 2.0 on speech and BERT for text. These features are then fused via cross-attention or multimodal transformers. High-end systems might leverage the latest large multimodal models that can take images, audio and text together. In addition, modern multimodal emotion recognition systems rely on CNNs, LSTMs, CNN-LSTM hybrids, graph neural networks, autoencoders, capsule networks and ensemble methods. Older systems relied on traditional CNN and LSTM and newer systems are using more graph-based approaches as well as large multimodal foundation models.

Introduction

The text presents a comprehensive overview of Multimodal Emotion Recognition (MER), a field within affective computing that aims to identify human emotions using multiple data sources such as text, speech, facial expressions, physiological signals, and contextual information. Emotion recognition enhances human-computer interaction by enabling systems to understand and respond appropriately to users' emotional states, with applications in healthcare, education, social robotics, driver monitoring, and virtual assistants.

Traditional emotion recognition systems were largely unimodal, relying on a single source of information (e.g., facial expressions or speech) and handcrafted features such as Local Binary Patterns (LBP) and Mel-Frequency Cepstral Coefficients (MFCCs). Although effective in controlled environments, these systems often struggled with real-world emotion recognition because emotions are complex and cannot be accurately inferred from a single modality.

To address these limitations, MER combines information from multiple modalities through fusion techniques. MER improves performance through:

Redundancy, which increases robustness when one modality is noisy or unavailable.
Complementarity, where different modalities contribute unique emotional cues.

Recent advances in deep learning have significantly improved MER. Models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformers, and Multimodal Large Language Models (MLLMs) automatically learn complex spatial, temporal, and cross-modal relationships. Modern MLLMs can not only recognize emotions but also provide explanations for their predictions.

The survey highlights key contributions:

Analysis of feature learning from text, audio, visual, physiological, and contextual data.
Comparison of modern architectures including CNNs, RNNs, Graph Neural Networks, State Space Models, and MLLMs.
Development of a mathematical framework for multimodal fusion, cross-modal attention, and self-supervised learning.
Review of benchmark datasets such as IEMOCAP, MELD, CMU-MOSEI, and DEAP, along with their limitations.
Discussion of future directions including federated learning and edge AI.

The text also introduces the theoretical foundations of affective computing and discusses two major emotion representation frameworks:

Categorical models, based on Paul Ekman’s basic emotions (anger, fear, happiness, sadness, etc.).
Dimensional models, particularly James Russell’s Valence–Arousal–Dominance framework, which represents emotions as continuous dimensions.

Additionally, the survey explains multimodal learning concepts such as shared latent representations, cross-modal learning, and temporal sequence modeling, emphasizing that emotions evolve over time and require models capable of capturing long-term dependencies.

Finally, the text discusses text-based emotion recognition, tracing its evolution from traditional NLP methods like TF-IDF to contextual language models such as BERT, RoBERTa, DeBERTa, and large language models such as GPT-4, which provide more accurate understanding of context, semantics, and emotional expressions.

Conclusion

Multimodal Emotion Recognition represents a rapidly maturing intersection of artificial intelligence, cognitive neuroscience, and human-centered design. The field has shown transition from classical statistical and machine learning approaches to advanced deep learning architectures such as Graph Neural Networks (GNNs), Transformers, State Space Models like Mamba, and Multimodal Large Language Models (MLLMs). These models have improved the ability of computers to understand and interpret complex emotions. Modern models which are based on fusion techniques particularly cross-attention mechanisms and self-supervised contrastive learning, have removed major challenges like feature redundancy and semantic misalignment between different modalities. Despite these advancements, MER systems still face too many challenges when the models are moved from laboratory to real-world scenarios. Also, most of the datasets are culturally biased or scripted and are collected under constrained conditions. The modern researches are trying to address these problems along with other problems such as handling missing modalities, reducing cross cultural bias and optimisation of computationally expensive MLLMs for real time Edge AI applications without reducing the accuracy. As MER systems are becoming increasingly important for integration into daily life in the fields such as mental health monitoring, healthcare, virtual assistants and social robotics. The issues related to privacy, fairness and transparency are becoming critically important. The Federated Learning are helping to protect the sensitive user data while Explainable AI (XAI) methods are essential for maintaining trust and interpretability in the automatic emotion recognition systems. By addressing all these challenges, MER has the potential to contribute significantly toward the development of intelligent, empathetic, and socially aware artificial systems capable of interacting naturally and effectively with humans.

References

[1] M. J. D. Kumar, M. Sukesh Rao, and K. C. Narendra, \"Multimodal Emotion Recognition: A Comprehensive Survey of Datasets, Methods, and Applications,\" IEEE Access, vol. 13, 2025, doi: 10.1109/ACCESS.2025.3636186. [2] Wu Y, Mi Q, Gao T. A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics (Basel). 2025 Jun 27;10(7):418. doi: 10.3390/biomimetics10070418. PMID: 40710231; PMCID: PMC12292624. [3] Lian H, Lu C, Li S, Zhao Y, Tang C, Zong Y. A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy (Basel). 2023 Oct 12;25(10):1440. doi: 10.3390/e25101440. PMID: 37895561; PMCID: PMC10606253. [4] Ding S, Ma L and Li H (2025) Multimodal physiological signal emotion recognition based on multi-head cross attention with representation learning. Front. Psychiatry 16:1713559. doi: 10.3389/fpsyt.2025.1713559 [5] Yu, L.; Ge, Y.; Ansari, S.; Imran, M.; Ahmad, W. Multimodal Sensing-Enabled Large Language Models for Automated Emotional Regulation: A Review of Current Technologies, Opportunities, and Challenges. Sensors 2025, 25, 4763. https://doi.org/10.3390/s25154763 [6] Ma, X., “Comparative Analysis of FedAvg and FedProx Algorithms in Federated Learning for Handwritten Character Recognition on the EMNIST Dataset”. Academic Journal of Science and Technology, 19(2), 501-506. https://doi.org/10.54097/h7srvr13 [7] F. Rahimi, C. Tamantini, A. Orlandini, F. Fracasso, and R. Siciliano, \"Comparing Fusion Strategies for Multimodal Emotion Prediction Using Deep Physiological Features,\" in Proc. Workshop on Social Robotics for Human-Centered Assistive and Rehabilitation AI (Fit4MedRob), held in conjunction with the International Conference on Social Robotics (ICSR), 2025. [8] Emily S. Cross, Arvid Kappas. 2026. Social Robotics Is Not (Just) About Machines, It Is About People: Psychology\'s Role in Developing Social Machines. Annual Review Psychology. 77:649-678. https://doi.org/10.1146/annurev-psych-040325-025951 [9] A.-S. Moon, H. Kim, Y.-C. Park, and J. Lee, \"A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions,\" Computers, Materials & Continua, vol. 87, no. 2, 2026, doi: 10.32604/cmc.2026.076411. [10] Y. Shou, T. Meng, W. Ai, F. Fu, N. Yin, and K. Li, \"A Comprehensive Survey on Multi-modal Conversational Emotion Recognition with Deep Learning,\" arXiv preprint arXiv:2312.05735, 2025, doi: 10.48550/arXiv.2312.05735. [11] A. Nandi and F. Xhafa, \"A federated learning method for real-time emotion state classification from multi-modal streaming,\" Methods, vol. 204, pp. 340–347, Aug. 2022, doi: 10.1016/j.ymeth.2022.03.005. [12] A. Yazici, T. Kucukyilmaz, T. Dokeroglu, A. Sharipbay, M. H. Lee, and B. Tyler, “State-of-the-art Multimodal Emotion Recognition: A comprehensive survey and taxonomy,” Intelligent Systems with Applications, vol. 30, Art. no. 200642, 2026, doi: 10.1016/j.iswa.2026.200642. [13] Lanxin Bi, Yunqi Zhang, Luyi Wang, Yake Niu, and Hui Zhao. 2025. Two Challenges, One Solution: Robust Multimodal Learning through Dynamic Modality Recognition and Enhancement. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12855–12867, Suzhou, China. Association for Computational Linguistics. [14] Chengyan Wu et al, Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects - ACL [15] J. Han et al., \"Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies,\" arXiv preprint arXiv:2512.20938, 2025, doi: 10.48550/arXiv.2512.20938. [16] Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, et al. Supervised contrastive learning. In: Advances in Neural Information Processing Systems. 2020; 33:18661–18673. [17] Che L, Wang J, Zhou Y, Ma F. Multimodal Federated Learning: A Survey. Sensors (Basel). 2023 Aug 6;23(15):6986. doi: 10.3390/s23156986. PMID: 37571768; PMCID: PMC10422520. [18] Mostert, W.; Kurien, A.; Djouani, K. Multi-Modal Emotion Detection and Tracking System Using AI Techniques. Computers 2025, 14, 441. https://doi.org/10.3390/computers14100441 [19] S. Sarah et al., \"Multimodal Emotion Recognition with Explainable AI for Cognitive Human-Computer Interaction in Smart Environments,\" 2025 5th International Conference on Soft Computing for Security Applications (ICSCSA), Salem, India, 2025, pp. 1091-1096, doi: 10.1109/ICSCSA66339.2025.11170860 [20] Lin, L. I.. “A concordance correlation coefficient to evaluate reproducibility. Biometrics”, 45(1), 255-268 [21] Grosu, M.-M.; Datcu, O.; Tapu, R.; Mocanu, B. A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models. Appl. Sci. 2026, 16, 1289. https://doi.org/10.3390/app16031289 [22] A. Hoffsommer, H. Schneider, S. Pavlitska, and J. M. Zöllner, \"DEAP DIVE: Dataset Investigation with Vision Transformers for EEG Evaluation,\" arXiv preprint arXiv:2510.00725, 2025, doi: 10.48550/arXiv.2510.00725. [23] B. T. Atmaja and M. Akagi, \"Evaluation of Error and Correlation-Based Loss Functions for Multitask Learning Dimensional Speech Emotion Recognition,\" arXiv preprint arXiv:2003.10724, 2020, doi: 10.48550/arXiv.2003.10724 [24] Mengara Mengara, A.G.; Moon, Y.-k. CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts. Mathematics 2025, 13, 1907. https://doi.org/10.3390/math13121907 [25] H. Zhang et al., \"Cross-Modal Contrastive Learning for Text-to-Image Generation,\" in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 833–842. [26] Y. Zhao et al., \"Heterogeneous Interactive Graph Network for Audio–Visual Question Answering,\" Knowledge-Based Systems, vol. 300, Art. no. 112165, 2024. [27] Shou, Y., Meng, T., Ai, W., Li, K. (2026). Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-Guidance Fusion. In: Ribeiro, R.P., et al. Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2025. Lecture Notes in Computer Science(), vol 16016. Springer, Cham. https://doi.org/10.1007/978-3-032-06078-5_29 [28] Yuntao Shou, Tao Meng, Wei Ai, and Keqin Li. 2025. Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 256–268, Abu Dhabi, UAE. Association for Computational Linguistics. [29] Barker, D.; Tippireddy, M.K.R.; Farhan, A.; Ahmed, B. Ethical Considerations in Emotion Recognition Research. Psychol. Int. 2025, 7, 43. https://doi.org/10.3390/psycholint7020043 [30] XUE Jieying, Emotion Detection with Context, Emotional Dynamics, and Speaker Personality Modeling, JAIST Repository [Online] [31] S. Kalateh, L. A. Estrada-Jimenez, S. Nikghadam-Hojjati, and J. Barata, \"A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges,\" IEEE Access, vol. 12, pp. 104000–104035, 2024, doi: 10.1109/ACCESS.2024.3430850 [32] G. Seneviratne et al., \"CROSS-GAiT: Cross-Attention-Based Multimodal Representation Fusion for Parametric Gait Adaptation in Complex Terrains,\" arXiv preprint arXiv:2409.17262, 2024. [33] Moorthy, S.; Moon, Y.-K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics 2025, 13, 1100. https://doi.org/10.3390/math13071100 [34] R. Zhao et al., \"Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition,\" arXiv preprint arXiv:2501.10408, 2025. [35] S. R. Ahamed et al., \"Evaluating Early, Late and Hybrid Fusion in Multimodal Emotion Detection with Pretrained Models,\" Research Square, Apr. 2026, doi: 10.21203/rs.3.rs-8907947/v1 [36] Y. Sun and T. Zhou, \"DialogueMLLM: Transforming Multimodal Emotion Recognition in Conversation Through Instruction-Tuned MLLM,\" IEEE Access, vol. 13, 2025, doi: 10.1109/ACCESS.2025.3591447. [37] Shuai, T.; Beng, S.; Khalid, F.B.; Rahmat, R.W.B.O.K. Advances in Facial Micro-Expression Detection and Recognition: A Comprehensive Review. Information 2025, 16, 876. https://doi.org/10.3390/info16100876 [38] S. Lei et al., \"InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models,\" arXiv preprint arXiv:2309.11911, 2024. [39] D. M. L. Dissanayake, \"Emotion Recognition from Physiological Signals Using Machine Learning on the CASE Dataset,\" M.S. thesis, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland, Dec. 2025. [Online] [40] J. Murzaku and O. Rambow, “OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs,” arXiv preprint arXiv:2503.21480, Mar. 2025. Available: https://doi.org/10.48550/arXiv.2503.21480 [41] Karthiga M, Suganya E, Sountharrajan S, Balusamy B, Selvarajan S. Eeg based smart emotion recognition using meta heuristic optimization and hybrid deep learning techniques. Sci Rep. 2024 Dec 4;14(1):30251. doi: 10.1038/s41598-024-80448-5. PMID: 39632923; PMCID: PMC11618626. [42] B. L. Fuchs et al., \"Understanding Transformer Reasoning Capabilities via Graph Algorithms,\" in Advances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024. [43] Liu J, Li J, Dong J, Mo Z, Liu N, Li Q, Yuan Y. Adaptive Graph Learning with Multimodal Fusion for Emotion Recognition in Conversation. Biomimetics (Basel). 2025 Jun 25;10(7):414. doi: 10.3390/biomimetics10070414. [44] Yan, J.; Li, P.; Du, C.; Zhu, K.; Zhou, X.; Liu, Y.; Wei, J. Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures. Electronics 2024, 13, 3756. https://doi.org/10.3390/electronics13183756 [45] Kipp, M., & Martin, J. C. (2015). Expressing emotion through posture and gesture. In R. A. Calvo, S. D\'Mello, J. Gratch, & A. Kappas (Eds.), The Oxford Handbook of Affective Computing (pp. 209–221). Oxford University Press. [46] Xie, J.; Wang, Y.; Meng, T.; Tai, J.; Zheng, Y.; Varatnitski, Y.I. Multimodal Emotion Recognition Method Based on Domain Generalization and Graph Neural Networks. Electronics 2025, 14, 885. https://doi.org/10.3390/electronics14050885 [47] H. Liu, \"Emotion Detection through Body Gesture and Face,\" arXiv preprint arXiv:2407.09913, 2024, doi: 10.48550/arXiv.2407.09913. [48] Zhang, M.; Yu, A.; Sheng, X.; Park, J.; Rhee, J.; Cho, K. EmoBERTa–CNN: Hybrid Deep Learning Approach Capturing Global Semantics and Local Features for Enhanced Emotion Recognition in Conversational Settings. Mathematics 2025, 13, 2438. https://doi.org/10.3390/math13152438 [49] A. Koledoye, C. Unachukwu, G. Nwobu, and H. Rana, \"Benchmarking the Computational and Representational Efficiency of State Space Models against Transformers on Long-ContextDyadic Sessions,\" arXiv preprint arXiv:2601.01237, 2026, doi: 10.48550/arXiv.2601.01237. [50] F. Ma et al., \"A Review of Human Emotion Synthesis Based on Generative Technology\" in IEEE Transactions on Affective Computing, vol. 16, no. 04, pp. 2579-2598, Oct.-Dec. 2025, doi: 10.1109/TAFFC.2025.3573878. [51] Y. Shou, T. Meng, W. Ai, and K. Li, \"Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey,\" arXiv preprint arXiv:2509.24322, 2025, doi: 10.48550/arXiv.2509.24322. [52] Hegh, A. N., Adeyelu, A. A., Iorliam, A., & Otor, S. U. (2025). MULTI-MODAL EMOTION RECOGNITION MODEL USING GENERATIVE ADVERSARIAL NETWORKS (GANs) FOR AUGMENTING FACIAL EXPRESSIONS AND PHYSIOLOGICAL SIGNALS. FUDMA JOURNAL OF SCIENCES, 9(5), 277-290. https://doi.org/10.33003/fjs-2025-0905-3412 [53] J. Li, X. Wang and Z. Zeng, \"Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition,\" in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 10, pp. 8786-8803, Oct. 2025, doi: 10.1109/TPAMI.2025.3581236 [54] Soujanya Poria et al. “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations” Doi: https://doi.org/10.48550/arXiv.1810.02508 [55] Steven R. Livingstone, Frank A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English” doi: https://doi.org/10.1371/journal.pone.0196391

Copyright

Copyright © 2026 Faisal Majeed, Poonam Dhankhar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET83881

Publish Date : 2026-06-22

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here