Smart Speech Emotion Recognition System Using MFCC and LSTM

Authors: Puskar Deb, Rahul Kumar, Arnab Dolai, Pritam Mukherjee, Devanshu Dev, MD Sufiyan Azam, Koushik Pal, Avali Banerjee

DOI Link: https://doi.org/10.22214/ijraset.2025.71481

Certificate: View Certificate

Abstract

Speech Emotion Recognition (SER) aims to automatically detect human emotions from spoken language using computational methods. In this study, we propose a deep learning approach that leverages Mel Frequency Cepstral Coefficients (MFCC) features extracted from speech signals. A Long Short-Term Memory (LSTM) neural network is trained to classify emotions into seven categories. The model achieves a validation accuracy of approximately 93.93%. Extensive experiments on spectrogram and waveform visualizations reveal significant distinctions among different emotions, highlighting the potential of MFCC-based SER systems.

Introduction

Speech Emotion Recognition (SER) allows machines to understand human emotions from voice, enhancing human-computer interaction in applications such as virtual assistants, healthcare, and customer service. This research focuses on building a robust SER system using Mel-Frequency Cepstral Coefficients (MFCCs) and a Long Short-Term Memory (LSTM) neural network to classify emotional states from speech.

Key Concepts:

Emotions Conveyed in Speech: Speech carries not only words but also emotional cues like anger, fear, happiness, sadness, disgust, surprise, and neutrality.
SER Importance: Recognizing emotions makes AI systems more human-like, especially in HCI, virtual agents, and assistive technologies.

Methodology:

Dataset:

Toronto Emotional Speech Set (TESS) with 2800 audio files across 7 emotions.

Preprocessing:

Audio loaded with Librosa.
Visualized using waveform and spectrogram plots.

Feature Extraction:

40 MFCCs extracted from 3-second audio samples.
Averaged across time to form fixed-length feature vectors.

Model Architecture:

LSTM Layer (256 units) + Dropout for regularization.
Dense layers (ReLU) and Softmax output for emotion classification.
Optimizer: Adam
Loss: Categorical Crossentropy
Training: 50 epochs, 80:20 train-validation split

Results:

Performance:

Training Accuracy: ~99.79%
Validation Accuracy: 93.93%
Loss curves show stable learning.
Slight overfitting observed; can be improved with data augmentation.

Confusion Matrix Insights:

High accuracy for angry, sad, and neutral.
Disgust and surprise often confused due to similar acoustic patterns.
Fear misclassified as sadness, matching human perception.

Challenges:

Similar-sounding emotions
Short speech clips limit context.
Variability in speakers (accent, gender, tone)
Background noise affects clarity

Model Comparison:

Model	Accuracy	Training Time	Memory Usage	Real-Time Performance
LSTM	98.75%	Longer	Higher	Very Good
SVM	85.36%	Shorter	Lower	Good

LSTM outperformed traditional models like SVM by 10–15%, proving its strength in learning temporal speech patterns.

Conclusion

In this research, we developed a Speech Emotion Recognition (SER) system using deep learning techniques, specifically focusing on feature extraction through MFCCs and classification using an LSTM-based model. Our experiments demonstrated that the LSTM model could effectively capture the temporal dynamics of speech signals and classify emotions such as happiness, sadness, anger, fear, disgust, surprise, and neutrality with promising accuracy. The overall performance shows the potential of deep learning approaches for emotion-aware human-computer interaction systems. Despite the encouraging results, the study also revealed certain challenges. Emotions with overlapping acoustic features, such as fear and sadness or disgust and surprise, were more difficult to distinguish, leading to occasional misclassifications. Additionally, variations in speaker accents, recording quality, and short speech durations influenced model performance. Addressing these challenges will require more diverse datasets, advanced model architectures such as attention mechanisms, and data augmentation strategies to improve generalization and robustness. In the future, this work can be extended by integrating multimodal emotion recognition that combines speech with facial expressions, gestures, or physiological signals to create a more comprehensive and accurate emotion detection system. Moreover, exploring transformer-based models and transfer learning from large speech pre-trained models could further enhance recognition accuracy. With continuous advancements, Speech Emotion Recognition systems hold the potential to revolutionize applications in virtual assistants, healthcare, customer service, and interactive entertainment.

References

[1] Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (3rd ed.). Draft. [2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. [3] Rabiner, L., & Schafer, R. (2010). Introduction to Digital Speech Processing. Foundations and Trends® in Signal Processing. [4] Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile – The Munich Versatile and Fast Open-Source Audio Feature Extractor. Proceedings of ACM Multimedia. [5] Kaggle. (n.d.). Speech Emotion Recognition Dataset. Retrieved from https://www.kaggle.com/dataset.

Copyright

Copyright © 2025 Puskar Deb, Rahul Kumar, Arnab Dolai, Pritam Mukherjee, Devanshu Dev, MD Sufiyan Azam, Koushik Pal, Avali Banerjee. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET71481

Publish Date : 2025-05-22

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here