Emotion recognition from speech (ERS) assists in automatically recognizing human emotions from speech. It is a significant application of affective computing.This system has a simple and straightforward pipeline: preprocessing, acoustic feature extraction, and supervised classification. This paper includes the analysis of the CNN-LSTM model that extract MFCCs, prosodic features, and spectral features from two popular speech emotion datasets, RAVDESS and CREMA-D. We also analyzed different machine learning classifiers, such as Support Vector Machines, Random Forest Classifiers, k-Nearest Neighbors, and a Multi-Layer Perceptron (MLP) classifier, and compared their performance. The results indicate that the MLP classifier outperforms other classifiers with an accuracy of 85%, thereby proving that neural networks can be used for effective speech emotion recognition.
Introduction
The text is a research overview of Speech Emotion Recognition (SER), an area of affective computing that aims to automatically identify human emotions from speech signals.
It begins by explaining that emotions are naturally expressed through speech features such as pitch, tone, energy, and speaking rate, beyond just the spoken words. SER focuses on analyzing these paralinguistic cues to classify emotions like happiness, sadness, anger, fear, and neutrality. The field has become increasingly important due to applications in mental health monitoring, intelligent tutoring systems, virtual assistants, and human-computer interaction.
Despite progress, SER remains challenging due to speaker differences, background noise, cultural variation, and overlapping emotional expressions. The paper positions itself as a systematic review and experimental study that analyzes architectures, evaluates performance metrics (accuracy, precision, recall, F1-score), and studies the impact of different feature combinations.
The literature review traces SER development from early statistical models such as GMMs and HMMs, which relied on handcrafted prosodic features, to machine learning models like SVMs and Random Forests using MFCC features. It then highlights the shift to deep learning, where CNNs and LSTMs significantly improved performance by automatically learning features from spectrograms and audio signals. More recent models include attention mechanisms and Transformers, achieving state-of-the-art results on datasets like IEMOCAP, RAVDESS, and CREMA-D.
The study uses a multi-corpus dataset approach, combining TESS, RAVDESS, and CREMA-D to improve diversity in speakers and recording conditions. A total of 15,538 audio samples across 129 speakers are standardized into seven emotion categories. Audio preprocessing includes normalization, trimming, and segmentation into fixed 3-second clips.
Feature extraction is primarily based on MFCCs (Mel-Frequency Cepstral Coefficients), which effectively represent speech characteristics aligned with human auditory perception. Each audio sample is converted into a 130×40 feature representation for model input.
Conclusion
This paper presents a speaker-independent speech emotion recognition system based on a hybrid CNN-LSTM model, which was trained on a composite corpus that combined the TESS, RAVDESS, and CREMA-D speech datasets. By combining disparate sources of emotional speech and using a speaker-based data splitting strategy, the goal of this method is to assess the generalization performance on novel speak-ers. The emotional features were extracted from the Mel-Frequency Cepstral Coefficients (MFCCs), which are effective at representing the perceptually important spectral and tempo-ral information in speech signals.
The analysis concludes that this model has the ability to learn discriminative emotional features and that it can achieve good training performance and validation accuracy. The convolutional layers helped to extract local temporal fea-tures effectively, and the Long Short-Term Memory (LSTM) layer helped to model the long-term emotional relationships between the speech frames. The model was able to converge and generalize well despite the inter-speaker variability and heterogeneity of the datasets.
However, the presence of performance attenuation for some classes of emotions highlights the challenge of separating emotions that are acoustically similar, as well as the ef-fect of cross-corpus variation. Future research will focus on data augmentation methods, attention models, and transformer models to improve performance. In addition, methods for handling class imbalance and domain adaptation will also be investigated.
References
[1] Ververidis and C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods,” Speech Communication, 2006.
[2] B. S. et al., “The interspeech emotion challenge,” in Proc. Interspeech, 2009.
[3] R. W. Picard, Affective Computing. MIT Press, 1997.
[4] S. Livingstone and F. Russo, “The ravdess dataset,” PLOS ONE, 2018.
[5] M. E. Ayadi, M. Kamel, and F. Karray, “Survey on speech emotion recognition,” Pattern Recognition, 2011.
[6] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[7] G. T. et al., “End-to-end speech emotion recognition,” in ICASSP, 2016.
[8] I. G. et al., Deep Learning. MIT Press, 2016.
[9] H. C. et al., “Crema-d,” IEEE Transactions on Affective Computing, 2014.
[10] K. Dupuis and M. Kathleen, Toronto emotional speech set (TESS). University of Toronto. Toronto, ON, Canada., 2006.
[11] F. Eyben, M. Wo¨llmer, and B. Schuller, “Opensmile: The munich versatile and fast open-source audio feature extractor,” in Proc. ACM Multimedia, 2010, pp. 1459–1462.
[12] Z. Zhang, M. Wo¨llmer, and B. Schuller, “Speech emotion recognition using deep convolutional neural networks,” in Proc. IEEE ICASSP, 2017,
[13] pp. 3642–3646.
[14] B. Schuller et al., “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2,
[15] pp. 119–131, 2010.
[16] Z. Zhang et al., “Speech emotion recognition using deep convolutional neural networks,” in Proc. IEEE ICASSP, 2017, pp. 3642–3646.
[17] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.