Music Streaming Platform Based on Speech & Facial Emotion Analysis

Authors: Vegi Kumar, Meduri Madhu Bhargav, Kolapati Loka Naga Surya Narayana Murthy

DOI Link: https://doi.org/10.22214/ijraset.2025.71842

Abstract

Traditional recommendation systems rely mostly on user-initiated historical listening patterns, genre classifying, or collaborative-filtering recommendations, thus totally disregarding the mood of a user. In a very real sense, emotions are fluid; and at any moment, they can go on and dominate someone\'s music choice. This work proposes a music streaming platform that considers real-time mood inputs to make music recommendations based on facial expression and speech analysis. By merging CNNs with facial emotion detection and LSTMs with speech emotion recognition, the system can track every multimodal emotional input with the utmost precision. This paper presents the technical implementation, evaluation criteria, and user feedback of this system thus extending the domain of affective computing and intelligent human-computer interaction.

Introduction

1. Background & Motivation

Music streaming platforms like Spotify or Apple Music rely on user preferences and behavior history for recommendations. However, these systems struggle to adapt to users' real-time emotional states, which fluctuate based on mood, stress, or environment. The research proposes Moodify, an intelligent music streaming application that uses real-time facial and vocal emotion recognition to tailor music recommendations accordingly. This bridges the gap between affective computing and digital media consumption.

2. Problem Statement

Current systems:

Use static or historical data for recommendations.
Lack adaptability to spontaneous emotional changes.
Rely on unimodal emotion detection (either facial or speech), which can fail under practical constraints (noise, lighting, etc.).

Core Problem: The absence of a robust, real-time, multimodal emotion-aware music recommendation system.

3. Objectives and Scope

Moodify aims to:

Use CNN (trained on FER-2013) for facial emotion recognition.
Use LSTM (trained on TESS) for speech emotion recognition.
Fuse both inputs using a weighted method for improved accuracy.
Match recognized emotions to a song library tagged with emotional metadata.
Provide an interactive dashboard for tracking emotions and allowing user overrides.

Scope Includes:

Real-time emotion detection via webcam & mic.
ML models served via Flask APIs.
MERN stack (MongoDB, Express, React, Node.js) web platform.
Emotion-based song recommendation and visual dashboard.

4. Literature Review

Past works explored unimodal systems or offline datasets:

Facial: CNNs (e.g., Zhang et al.) work well but struggle with occlusion.
Speech: LSTMs (e.g., Kumar et al.) handle temporal patterns but are noise-sensitive.
Multimodal: Limited real-world implementations. E.g., MoodPlayer (facial only) and EmoPlayer (speech only) lack dynamic or fused inputs.

Moodify’s Edge: Combines both facial and speech emotion detection in real-time and integrates a working music recommendation engine with playback and feedback features.

5. System Design & Architecture

Components:

Emotion Detection: CNN (face) + LSTM (speech).
Fusion Engine: Default 60% face + 40% speech weighting; dynamically adjusts.
Recommendation Engine: Matches user emotion vectors with song metadata using cosine similarity.
Dashboard: Tracks emotion trends, provides manual controls, and displays playback history.

Datasets:

FER-2013 (facial emotions).
TESS (speech emotions).
Custom Music Dataset (emotion-tagged royalty-free songs).

Tech Stack:

Frontend: React.js, Tailwind CSS, Redux.
Backend: Node.js, Express.js, MongoDB, JWT.
ML: TensorFlow, Keras, Librosa, OpenCV, Flask.
Deployment: Docker, AWS EC2, NGINX, GitHub Actions, Grafana.

6. Implementation

The application was built in modular fashion:

Frontend: Real-time webcam/mic inputs, player controls, and emotion tracking charts.
Backend: API routing, JWT auth, logging, and rate limiting.
ML APIs: Dockerized Flask services for CNN and LSTM models.
Fusion & Recommendation: Logic implemented in backend with dynamic weighting and freshness filtering.
Database: MongoDB Atlas with GridFS for audio storage.
Deployment: CI/CD pipelines, Docker containerization, and system monitoring via Prometheus/Grafana.

7. Evaluation & Results

Accuracy:
- Facial Emotion (CNN): 95.8%
- Speech Emotion (LSTM): 99.2%
Performance:
- APIs respond in under 400ms.
- High F1-scores (~0.97+), especially for Fear, Surprise, Neutral.
- Slight confusion between Sad and Neutral, a common issue in emotion classification.

User Feedback:

Positive reception during beta testing.
Real-time emotion tracking and mood adaptation improved satisfaction.
Visualizations and manual override features enhanced interactivity.

Conclusion

Moodify represents a major development in bridging the gaps between emotional intelligence and music recommendation systems. Moodify advances the state of personalized, context-aware digital experiences by introducing real-time facial and speech-based emotion recognition with a robust backend music-streaming service. It moves away from the usual playlist-driven system by seeing music not as mere content, but as a responsive medium that can respond to the user\'s emotional state. The system at the center comprises two deep learning modules—CNN for facial emotion recognition and LSTM for speech emotion recognition—trained and fine-tuned to achieve high accuracy over a variety of emotional categories. The novel fusion mechanism intelligently combines both modalities to arrive at a dominant emotional state, which is then passed onto a similarity-based recommendation engine that matches a selected music library, annotated emotionally. The results are streamed onto a real-time web interface, creating a gliding experience of emotions and music. On the outside, the intricate technicalities within Moodify demonstrate a philosophy in design centered around the user. Features such as real-time emotion visualizations, interactive dashboards, and a lightweight front end ensure that it does not degenerate into mere functional utility but maintains a psychologically engaging and intuitive user experience. At its very core lies a rigorous evaluation on accuracy, latency, and user satisfaction, guaranteeing the system\'s usefulness in practical scenarios. Here rests a great implication for affective computing research. With the mentioned future extensions in AR/VR, wearable technology, multi-language-interface, transformer-based architecture, etc., it stands as a scalable and adaptable solution. Meanwhile, its applications stretch beyond entertainment, for instance, into mental wellness, education, and immersive digital spaces.

References

[1] P. Ekman and W. V. Friesen, \"Constants across cultures in the face and emotion,\" J. Pers. Soc. Psychol., vol. 17, no. 2, pp. 124-129, 1971. [2] M. Sharma, R. Biswas, and K. K. Dewangan, \"Emotion-aware music retrieval using spectral audio features,\" IEEE Trans. Affective Comput., vol. 10, no. 3, pp. 423-435, 2019. [3] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, \"Joint face detection and alignment using multitask cascaded convolutional networks,\" IEEE Signal Process. Lett., vol. 23, no. 10, pp. 1499-1503, Oct. 2016. [4] Y. Huang, Y. Li, and J. Yu, \"Speech emotion recognition with deep learning: A review,\" IEEE Access, vol. 8, pp. 48789-48804, 2020. [5] C. Vondrick, D. Oktay, and A. Torralba, \"Emotion recognition in speech using cross-modal transfer in the wild,\" in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 5065-5074. [6] [6] M. Panwar and R. Biswas, \"Emotion-aware music recommendation systems: A survey,\" Int. J. Comput. Appl., vol. 184, no. 4, pp. 21-26, 2022. [7] J. Deng et al., \"ImageNet: A large-scale hierarchical image database,\" in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248-255. [8] A. Kumar and D. M. Dhote, \"Speech emotion recognition using deep learning techniques,\" in Proc. Int. Conf. Comput. Commun. Autom. (ICCCA), 2020, pp. 1-6. [9] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, \"Facial expression recognition with deep learning: A review,\" IEEE Trans. Affective Comput., vol. 12, no. 4, pp. 1197-1215, Oct.-Dec. 2021. [10] A. Jain, S. Arora, and R. Arora, \"Facial emotion recognition using convolutional neural networks and representational learning,\" Int. J. Comput. Appl., vol. 182, no. 23, pp. 1–6, 2019. [11] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, \"End-to-end multimodal emotion recognition using deep neural networks,\" IEEE J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1301–1309, Dec. 2017. [12] R. W. Picard, Affective Computing, Cambridge, MA, USA: MIT Press, 1997. [13] M. Soleymani et al., \"A survey of multimodal sentiment analysis,\" Image Vis. Comput., vol. 65, pp. 3–14, Nov. 2017. [14] B. McFee et al., “librosa: Audio and music signal analysis in Python,” in Proc. 14th Python Sci. Conf., 2015, pp. 18–25. [15] TensorFlow, Available online: https://www.tensorflow.org/ [16] FER-2013 Dataset, Available online: https://www.kaggle.com/datasets/msambare/fer2013 [17] TESS Dataset, Available online: https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess [18] Spotify Web API, Available online: https://developer.spotify.com/documentation/web-api/ [19] OpenCV Documentation, Available online: https://docs.opencv.org/ [20] Google Cloud Speech-to-Text API, Available online: https://cloud.google.com/speech-to-text

Copyright

Copyright © 2025 Vegi Kumar, Meduri Madhu Bhargav, Kolapati Loka Naga Surya Narayana Murthy. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET71842

Publish Date : 2025-05-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here