AI-Based Music Mood Composer Using Deep Learning Techniques

Authors: Sagar T, Harshitha M M

DOI Link: https://doi.org/10.22214/ijraset.2025.74078

Abstract

AI Music Creation mood-based composition using artificial intelligence has becomea novel interdisciplinary field combining signal processing, affective computing, and deep learning. This paper proposes a music-generatingAI based on a user’s emotional context by leveraging a combination of recurrent neural networks (RNNs), transformer architectures, and audio feature embeddings. The system incorporates emotion recognition via audio or text input, followed by real-time music generation aligned with the detected mood (e.g., happy, sad, calm, energetic). A custom The dataset was compiled from open-access sources annotated with emotion labels. The proposed architecture achieves highmood-classification accuracy and generates harmonically rich, emotionally aligned music sequences. This study explores both performance and interpretability using attention heatmaps and feature saliency analysis to enhance transparency and user trust in generative AI systems.

Introduction

Music is a powerful emotional medium. With AI advancements, there's growing interest in generating music that aligns with human emotions. Traditional rule-based systems struggle to capture emotional nuances, but deep learning models—especially RNNs, LSTMs, and Transformers—have significantly improved music generation capabilities.

This study proposes a multimodal AI system that:

Recognizes human emotion via text and audio inputs
Generates MIDI-based music aligned with the detected mood
Evaluates outputs using objective metrics and human feedback

2. Related Work

A. Symbolic Music Generation

Early methods: Rule-based, Markov models — lacked flexibility.
Deep learning shift: Google’s Magenta, OpenAI’s MuseNet used LSTM and Transformer models for better compositions.

B. Mood & Emotion Recognition

Audio Features: MFCC, Chroma, Spectral Contrast
Text Features: BERT-based sentiment analysis
Enables personalized, emotionally relevant content generation.

C. Multimodal Emotion Recognition

Combines audio, text, and facial cues.
Uses CNN-RNN and Transformer models for richer context and higher accuracy.

D. Emotion-to-Music Mapping

Datasets like DEAM, EmoMusic link musical elements (tempo, key, mode) to emotion tags.
Challenges remain due to the subjectivity of emotional interpretation.

E. Transformer Models in Composition

Models like Music Transformer outperform LSTMs in long-term structure and harmonic consistency.
Emotion conditioning is still a developing area.

F. Explainability in Music AI

Tools like attention maps and Grad-CAM help visualize model decisions, increasing user trust and interpretability.

3. Methodology

A. Data Collection & Preprocessing

Audio: IEMOCAP, RAVDESS datasets for emotion-labeled speech
Text: LyricsEmotion, EmpatheticDialogues for emotion-labeled conversations
Music: MIDI files from MAESTRO, DEAM, EmoMusic tagged with 5 emotions: Happy, Sad, Calm, Angry, Energetic
Features extracted: MFCCs, BERT embeddings, note sequences, key/tempo normalization

B. Emotion Classification

Audio Classifier: CNN + BiLSTM hybrid model
Text Classifier: Fine-tuned BERT
Fusion: Soft voting ensemble for combining both modalities

C. Music Generation

Model: Emotion-conditioned Transformer-based generator
Input: MIDI tokens + emotion label embeddings
Training: Categorical Cross-Entropy loss, Adam optimizer, 5-fold validation

D. Evaluation Metrics

Objective: BLEU Score, Tonal Coherence, Polyphonic Score
Subjective: Human ratings on mood accuracy, emotional engagement, and musicality
Explainability: Attention maps and Grad-CAM highlight decision factors

4. Experiments & Results

A. Emotion Detection Accuracy

Model	Modality	Accuracy	F1-Score
CNN-BiLSTM	Audio	89.1%	88.9%
BERT Fine-Tuned	Text	92.7%	92.8%
Multimodal Ensemble	Audio + Text	94.5%	94.7%

B. Music Generation Metrics

Model	BLEU	Tonal Coherence	Polyphonic Score
LSTM (Baseline)	0.62	0.76	0.65
Music Transformer	0.81	0.89	0.82
Ours (Conditional Transformer)	0.84	0.92	0.87

C. Human Evaluation (80 Participants)

Criteria	Avg. Rating (out of 5)
Mood Accuracy	4.6
Emotional Engagement	4.5
Overall Musicality	4.4

Participants favored transformer-generated music for its emotional depth and realism.

5. Comparison with Existing Systems

System	Emotion Conditioning	Explainability	Musical Quality
MuseNet	Limited	No	High
AIVA	Manual/Rule-based	No	Moderate
LSTM-based	Basic	No	Low
Ours	Multimodal & Real-time	Yes (Visualizations)	High

Conclusion

This study presents a novel framework for generating emotionally aligned music based on multimodal user inputs using deep learning. The system effectively integrates: 1) Emotion recognition via audio and text 2) Music generation using Transformer-based models 3) Explainability tools for model transparency The hybrid model achieved high classification accuracy (94.5%) in emotion detection and outperformed traditional methods in music generation tasks, producing compositions that listeners rated highly for mood accuracy and musicality. Future Work 1) Real-Time Generation and Deployment Optimizing model size and inference speed to support on-device or web-based real-time music generation. 2) Multi-Sensory Emotion Detection Incorporating facial expressions and physiological sensors (e.g., heart rate, EEG) for deeper emotional analysis. 3) Adaptive Personalization Fine-tuning music outputs based on user preferences, mood history, and cultural context. 4) Clinical and Therapeutic Applications Exploring integration into music therapy tools for stress reduction, mental health monitoring, and emotional well-being. 5) Interactive Music Interfaces developing user-facing applications with visual dashboards and emotion sliders for controlling musical parameters interactively.

References

[1] H. Huang, A. Vaswani, I. Simon, et al., “Music Transformer: Generating Music with Long-Term Structure,” Proceedings of the International Conference on Learning Representations (ICLR), 2019. [2] OpenAI, “MuseNet: A Generative Model of Music with AI,” OpenAI Research Blog, 2019. [Online]. Available: https://openai.com/research/musenet [3] A. Delbouys, R. Bittner, E. Vincent, et al., “Music Mood Detection Based on Audio and Lyrics with Deep Neural Nets,” Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018. [4] Z. Xia, Y. Zhang, and T. Zhou, “Emotion Recognition from Speech using Deep Neural Network with Spectrogram Augmentation,” IEEE Access, vol. 7, pp. 128123–128133, 2019. doi: 10.1109/ACCESS.2019.2939222. [5] F. Ferreira, A. Oliveira, and D. Oliveira, “DEAM: A Dataset for Emotion Annotation in Music,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 10, no. 3, pp. 1–23, 2020. [6] D. Mohammad, T. Akbari, and M. Soleymani, “EmoMusic: A Dataset for Music Emotion Recognition,” Proceedings of the ACM International Conference on Multimedia, 2019, pp. 627–630. [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the NAACL-HLT, vol. 1, pp. 4171–4186, 2019. [8] A. Ghazi, S. Sharda, and M. Yadav, “A Comparative Study of Transformer-based Models for Symbolic Music Generation,” Journal of Intelligent & Fuzzy Systems, vol. 43, no. 6, pp. 7921–7932, 2022. [9] A. Mohammad, S. Rahman, and N. Jahan, “Multimodal Emotion Recognition Using Deep Fusion of Audio and Text Features,” Multimedia Tools and Applications, vol. 81, no. 10, pp. 14175–14193, 2022. [10] Google Brain Team, “Magenta: Music and Art Generation with Machine Learning,” [Online]. Available: https://magenta.tensorflow.org, 2023.

Copyright

Copyright © 2025 Sagar T, Harshitha M M. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET74078

Publish Date : 2025-09-04

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here