Speech Emotion Recognition (SER) aims to automatically detect human emotions from spoken language using computational methods. In this study, we propose a deep learning approach that leverages Mel Frequency Cepstral Coefficients (MFCC) features extracted from speech signals. A Long Short-Term Memory (LSTM) neural network is trained to classify emotions into seven categories. The model achieves a validation accuracy of approximately 93.93%. Extensive experiments on spectrogram and waveform visualizations reveal significant distinctions among different emotions, highlighting the potential of MFCC-based SER systems.
Introduction
Speech Emotion Recognition (SER) allows machines to understand human emotions from voice, enhancing human-computer interaction in applications such as virtual assistants, healthcare, and customer service. This research focuses on building a robust SER system using Mel-Frequency Cepstral Coefficients (MFCCs) and a Long Short-Term Memory (LSTM) neural network to classify emotional states from speech.
Key Concepts:
Emotions Conveyed in Speech: Speech carries not only words but also emotional cues like anger, fear, happiness, sadness, disgust, surprise, and neutrality.
SER Importance: Recognizing emotions makes AI systems more human-like, especially in HCI, virtual agents, and assistive technologies.
Methodology:
Dataset:
Toronto Emotional Speech Set (TESS) with 2800 audio files across 7 emotions.
Preprocessing:
Audio loaded with Librosa.
Visualized using waveform and spectrogram plots.
Feature Extraction:
40 MFCCs extracted from 3-second audio samples.
Averaged across time to form fixed-length feature vectors.
Model Architecture:
LSTM Layer (256 units) + Dropout for regularization.
Dense layers (ReLU) and Softmax output for emotion classification.
Optimizer: Adam
Loss: Categorical Crossentropy
Training: 50 epochs, 80:20 train-validation split
Results:
Performance:
Training Accuracy: ~99.79%
Validation Accuracy:93.93%
Loss curves show stable learning.
Slight overfitting observed; can be improved with data augmentation.
Confusion Matrix Insights:
High accuracy for angry, sad, and neutral.
Disgust and surprise often confused due to similar acoustic patterns.
Fear misclassified as sadness, matching human perception.
Challenges:
Similar-sounding emotions
Short speech clips limit context.
Variability in speakers (accent, gender, tone)
Background noise affects clarity
Model Comparison:
Model
Accuracy
Training Time
Memory Usage
Real-Time Performance
LSTM
98.75%
Longer
Higher
Very Good
SVM
85.36%
Shorter
Lower
Good
LSTM outperformed traditional models like SVM by 10–15%, proving its strength in learning temporal speech patterns.
Conclusion
In this research, we developed a Speech Emotion Recognition (SER) system using deep learning techniques, specifically focusing on feature extraction through MFCCs and classification using an LSTM-based model. Our experiments demonstrated that the LSTM model could effectively capture the temporal dynamics of speech signals and classify emotions such as happiness, sadness, anger, fear, disgust, surprise, and neutrality with promising accuracy. The overall performance shows the potential of deep learning approaches for emotion-aware human-computer interaction systems.
Despite the encouraging results, the study also revealed certain challenges. Emotions with overlapping acoustic features, such as fear and sadness or disgust and surprise, were more difficult to distinguish, leading to occasional misclassifications. Additionally, variations in speaker accents, recording quality, and short speech durations influenced model performance. Addressing these challenges will require more diverse datasets, advanced model architectures such as attention mechanisms, and data augmentation strategies to improve generalization and robustness.
In the future, this work can be extended by integrating multimodal emotion recognition that combines speech with facial expressions, gestures, or physiological signals to create a more comprehensive and accurate emotion detection system. Moreover, exploring transformer-based models and transfer learning from large speech pre-trained models could further enhance recognition accuracy. With continuous advancements, Speech Emotion Recognition systems hold the potential to revolutionize applications in virtual assistants, healthcare, customer service, and interactive entertainment.
References
[1] Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (3rd ed.). Draft.
[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
[3] Rabiner, L., & Schafer, R. (2010). Introduction to Digital Speech Processing. Foundations and Trends® in Signal Processing.
[4] Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile – The Munich Versatile and Fast Open-Source Audio Feature Extractor. Proceedings of ACM Multimedia.
[5] Kaggle. (n.d.). Speech Emotion Recognition Dataset. Retrieved from https://www.kaggle.com/dataset.