The issue of comprehending human emotions has a significant influence on the user experience in the digital world. The conventional technique of recommending music is primarily based on the previous listening behavior of users, ratings, or playlists. But this approach does not take into account the current emotional status of the user. Several current emotion detection frameworks are based on one modality only, such as text or facial expression. Such systems can be ineffective in practical applications due to the complicated nature of human emotions that vary according to context and personal behavior.
This limitation can be addressed through the proposed use of artificial intelligence, machine learning, and deep learning technologies for building an emotion recognition framework that will incorporate all the aforementioned elements to recognize emotions in a user in an improved manner. The combination of several inputs makes it easier for the system to detect emotions. Facial expressions, voice tone, and text analysis can be used together to give better insights about emotions and reduce possible inaccuracies arising from single-input recognition methods.
This system is developed into an interactive web application through the use of Flask programming language. Once the emotion of the user is detected, he/she is suggested personalized music that correlates to his/her mood state. In this way, user experience can be significantly enhanced through personalized suggestions of music depending on their current emotional state.
Introduction
The study presents a multi-modal emotion recognition-based music recommendation system that aims to provide personalized music suggestions according to a user’s emotional state. Music plays an important role in human emotions, and people often choose music that matches their mood. However, traditional music recommender systems mainly depend on user history and preferences, ignoring real-time emotions, which can result in unsuitable recommendations.
To overcome this limitation, the proposed system combines facial expression analysis, speech emotion recognition, and text sentiment analysis to detect emotions more accurately. Since human emotions are complex and dynamic, relying on a single data source is often insufficient. Therefore, the study adopts a multi-modal approach using Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) techniques to improve emotion detection accuracy.
The literature review highlights previous research supporting the effectiveness of emotion recognition technologies. Studies by Kumar et al. and Islam et al. demonstrated the usefulness of text-based sentiment analysis for detecting emotions. Research by El Ayadi et al., Abadi et al., and Gupta et al. confirmed that speech signals such as tone, pitch, and intensity are strongly related to emotions and can be effectively analyzed using deep learning techniques. Poria et al. emphasized that combining multiple input modes—such as text, speech, and facial expressions—significantly improves emotion recognition performance.
The proposed system consists of several layers. The first is the data collection layer, where facial expressions are captured through webcams, speech through microphones, and textual input through typing. This multi-channel strategy improves emotional understanding compared to single-input systems.
Next is the preprocessing layer, where collected data is cleaned and prepared for analysis. Facial images are resized and normalized, speech signals are converted into meaningful features such as spectrograms and Mel Frequency Cepstral Coefficients (MFCCs), and text data is processed using Natural Language Processing (NLP) techniques to remove irrelevant words.
In the feature extraction and classification layer, AI models analyze the processed data. Facial emotion recognition uses Convolutional Neural Networks (CNNs) such as ResNet and VGG16 trained on datasets like FER2013. Speech emotion recognition employs CNN and CNN-LSTM models to identify emotions from tone, pitch, and speech rate. For text analysis, NLP techniques combined with Support Vector Machine (SVM) classifiers and CNN models are used to classify emotions as positive, negative, or neutral.
The decision-making layer combines outputs from facial, speech, and text analysis to determine the user’s final emotional state, such as happiness, sadness, anger, or calmness. Based on the detected emotion, the music recommendation layer suggests suitable songs—for example, upbeat music for happy moods and soothing music for sadness or anger.
The system is implemented using a Flask web application framework. Flask handles backend processing, while HTML, CSS, and JavaScript are used for the front-end interface. Users can upload facial images or videos, voice recordings, or text inputs, and the system provides real-time emotion detection and music recommendations.
The implementation also includes machine learning techniques such as SVM for text emotion classification. Different kernel functions were tested, with the Radial Basis Function (RBF) kernel providing the best performance. Feature selection using Chi-square tests and dimensionality reduction through Principal Component Analysis (PCA) further improved classification accuracy.
The results demonstrate that the system successfully detects emotions from video, audio, and text inputs and provides fast, personalized music recommendations. Overall, the study shows that integrating multi-modal emotion detection with real-time AI processing can significantly enhance user interaction and satisfaction in music recommendation systems.
Conclusion
This paper presents a multimodal emotion-based music recommendation system that can recognize emotions of users and recommend appropriate music through analysis of their facial expressions, voice patterns, and text inputs. This method increases the accuracy and effectiveness of emotion recognition since it relies on multiple sources rather than just one source.
This multimodal emotion-based music recommendation system has been developed as a web application with Flask. It enables real-time recognition of the emotion of a user and provides recommendations for appropriate music instantly. Depending on the emotion recognized, the system recommends personalized music tracks for the user based on their emotions. The simplicity and efficiency of this system make it user-friendly and effective.
References
[1] A. Abdul, J. Chen, H. Y. Liao, and S. H. Chang, “Emotion-Aware Personalized Music Recommendation System Using Convolutional Neural Networks,” Applied Sciences, vol. 8, no. 7, 2018.
[2] W. Deng, “Application of Multimodal Emotion Recognition Technology in Recommendation Systems,” Highlights in Science, Engineering and Technology, 2025.
[3] Y. Wu, Q. Mi, and T. Gao, “A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions,” Biomimetics, 2025.
[4] V. S. G. S. Phaneendra and K. Ragavan, “Emotion-Based Music Recommendation System Integrating Facial Expression Recognition and Lyrics Sentiment Analysis,” IEEE Access, 2025.
[5] D. Ayata, Y. Yaslan, and M. E. Kamasak, “Emotion Recognition from Multimodal Physiological Signals for Emotion-Aware Systems,” Journal of Medical and Biological Engineering, vol. 40, pp. 149–157, 2020.
[6] S. Wang, “Music Emotion Recognition and Modeling Based on Multimodal Signal Fusion,” Traitement du Signal, 2025.
[7] M. Athavle, D. Mudale, U. Shrivastav, and M. Gupta, “Music Recommendation Based on Face Emotion Recognition,” Journal of Informatics Electrical and Electronics Engineering, 2021.
[8] “IoT-Based Approach to Multimodal Music Emotion Recognition,” Alexandria Engineering Journal, 2024.
[9] Y. Wu, Q. Mi, and T. Gao, “A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions,” Biomimetics, 2025.
[10] R. Pillalamarri and U. Shanmugam, “A Review on EEG-Based Multimodal Learning for Emotion Recognition,” Artificial Intelligence Review, 2025.