Speech Emotion Recognition (SER) is a crucial domain within speech processing that focuses on detecting and classifying emotional states conveyed through spoken language. Traditional systems utilize self-supervised learning to analyze speech signals, but they often suffer from lower prediction accuracy due to the limited ability of these models to capture temporal dependencies and emotional nuances. To address this limitation, the Long Short-Term Memory (LSTM) algorithm is proposed as an enhancement. LSTM, with its ability to retain long-term dependencies and model sequential data effectively, significantly improves accuracy in classifying emotions. By better capturing the complex patterns in speech, the proposed LSTM-based approach offers more reliable emotion detection, overcoming the drawbacks of the existing self-supervised learning system.
Introduction
Speech Emotion Recognition (SER) aims to detect human emotions from voice by analyzing acoustic features like pitch, tone, and rhythm. Traditional methods using handcrafted features and machine learning struggled with variations in language, accents, and noise. Recent approaches involve self-supervised learning (SSL) models and deep learning techniques like Long Short-Term Memory (LSTM) networks, which better capture temporal dependencies and improve emotion detection accuracy in real-world conditions.
Literature Survey highlights various advancements:
Combining SSL with spectral features via Mixture of Experts improves robustness against domain shifts.
SER can be applied to public safety for detecting disruptive behavior regardless of speaker or gender.
Transforming speech data into 2D formats using Hilbert curves enhances feature extraction and emotion recognition.
Spatial-temporal parallel networks extract richer emotional features without losing continuity.
Pre-trained models like Wav2vec2 and HuBERT provide valuable embeddings for improving SER performance.
Proposed System:
Uses MFCC for feature extraction and LSTM for classification, capitalizing on LSTM’s ability to model sequential speech data.
The system architecture involves preprocessing raw audio to MFCC features, passing them through LSTM layers, and outputting emotion predictions.
Implementation employed Python, TensorFlow/Keras, Librosa, and used the RAVDESS dataset with eight emotion classes.
Training involved data normalization, 80:20 train-test split, Adam optimizer, categorical cross-entropy loss, and five-fold cross-validation.
A prototype deployed via Flask supports real-time or recorded input emotion prediction.
Results & Discussion:
Data collection focused on diverse, labeled emotional speech samples.
Preprocessing removed noise and normalized audio for better feature extraction.
MFCCs effectively captured speech characteristics vital for emotion classification.
Model evaluation used accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.
Accuracy measures overall correct predictions, while precision focuses on correctness of positive predictions, both important for evaluating SER models.
Conclusion
The proposed system successfully implements Speech Emotion Recognition (SER) by combining Mel-frequency cepstral coefficients (MFCC) for feature extraction and Long Short-Term Memory (LSTM) for classification.
MFCC effectively reduces the complexity of the raw audio data while preserving critical information about speech characteristics, ensuring that essential emotional cues are captured. The LSTM model, known for its strength in handling sequential data, leverages these features to recognize patterns in speech signals over time, enabling accurate emotion detection. This integration of MFCC and LSTM creates a robust system that can accurately interpret emotional states from speech, contributing to enhanced performance in SER tasks. The approach balances computational efficiency with the preservation of vital emotional information, making it a powerful solution for applications in speech analysis, human-computer interaction, and emotional AI systems. Future work could explore the integration of attention mechanisms with the LSTM model to enhance the system\'s ability to focus on critical emotional segments in speech. Additionally, incorporating multimodal inputs, such as combining facial expressions with speech, could further improve emotion recognition accuracy. Expanding the dataset to include diverse languages and emotional expressions would also enhance the model\'s robustness and generalization.
References
[1] Mekruksavanich, S.; Jitpattanakul, A. Sensor-based Complex Human Activity Recognition from Smartwatch Data Using Hybrid Deep Learning Network. In Proceedings of the 36th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Jeju, Republic of Korea, 27–30 June 2021; pp. 1–4.
[2] Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165.
[3] Latif, S.; Qadir, J.; Qayyum, A.; Usama, M.; Younis, S. Speech technology for healthcare: Opportunities, challenges, and state of the art. IEEE Rev. Biomed. Eng. 2020, 14, 342–356.
[4] Cho, J.; Kim, B. Performance analysis of speech recognition model based on neuromorphic architecture of speech data preprocessing technique. J. Inst. Internet Broadcast Commun. 2022, 22, 69–74.
[5] Lee, S.; Park, H. Deep-learning-based Gender Recognition Using Various Voice Features. In Proceedings of the Symposium of the Korean Institute of Communications and Information Sciences, Seoul, Republic of Korea, 17–19 November 2021; pp. 18–19.
[6] Fonseca, A.H.; Santana, G.M.; Bosque Ortiz, G.M.; Bampi, S.; Dietrich, M.O. Analysis of ultrasonic vocalizations from mice using computer vision and machine learning. Elife 2021, 10, e59161.
[7] Lee, Y.; Lim, S.; Kwak, I.Y. CNN-based acoustic scene classification system. Electronics 2021, 10, 371.
[8] Ma, X.; Wu, Z.; Jia, J.; Xu, M.; Meng, H.; Cai, L. Emotion recognition from variable-length speech segments using deep learning on spectrograms. Proc. Interspeech 2018, 2018, 3683–3687.
[9] Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5.
[10] Zhang, S.; Li, C. Research on feature fusion speech emotion recognition technology for smart teaching. Mob. Inf. Syst. 2022, 2022, 7785929.
[11] Subramanian, R.R.; Sireesha, Y.; Reddy, Y.S.P.K.; Bindamrutha, T.; Harika, M.; Sudharsan, R.R. Audio Emotion Recognition by Deep Neural Networks and Machine Learning Algorithms. In Proceedings of the 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Virtual Conference, 8–9 October 2021; pp. 1–6.
[12] Zheng, L.; Li, Q.; Ban, H.; Liu, S. Speech Emotion Recognition Based on Convolution Neural Network Combined with Random Forest. In Proceedings of the 2018 Chinese Control and Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 4143–4147.
[13] Li, H.; Zhang, X.; Wang, M.J. Research on speech Emotion Recognition Based on Deep Neural Network. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021; pp. 795–799.
[14] Zhang, Y.; Du, J.; Wang, Z.; Zhang, J.; Tu, Y. Attention-based Fully Convolutional Network for Speech Emotion Recognition. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1771–1775.
[15] Carofilis, A.; Alegre, E.; Fidalgo, E.; Fernández-Robles, L. Improvement of accent classification models through grad-transfer from spectrograms and gradient-weighted class activation mapping. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2859–2871.
[16] Xu, J., Deng, J., & Schuller, B. (2023). Attention-based multimodal fusion for emotion recognition using speech and text. IEEE/ACM Transactions on Audio, Speech, and Language Processing,31,190–204.
[17] Gong, Y., &Poellabauer, C. (2022). Self-supervised representation learning for speech emotion recognition. In ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7412–7416).
[18] Tzirakis, P., Zhang, J., & Schuller, B. (2022). End-to-end speech emotion recognition using deep neural networks and self-attention mechanisms. Computer Speech & Language, 71, 101258.
[19] Latif, S., Qayyum, A., Usama, M., & Qadir, J. (2021). Speech emotion recognition: Features, classification schemes, and databases. Journal of Intelligent & Fuzzy Systems, 40(3), 1117–1132.
[20] Jaiswal, A., Mahata, D., & Shah, R. R. (2021). Multi-task learning for speech emotion recognition using self and supervised tasks. Knowledge-Based Systems, 227, 107203.
[21] Gupta, R., & Narayan, S. M. (2022). Multilingual speech emotion recognition using MFCC and Bi-LSTM. In Proceedings of the 2022 International Conference on Communication, Control and Intelligent Systems (CCIS) (pp. 35–40).
[22] Satt, A., Rozenberg, S., & Hoory, R. (2021). Efficient emotion recognition from speech using spectrogram patch-based CNN. Computer Speech & Language, 66, 101142.
[23] Rani, S., & Pasha, M. A. (2022). A novel hybrid model for speech emotion recognition using deep CNN and GRU. Multimedia Tools and Applications, 81, 20409–20429.
[24] Chaudhari, V., & Chauhan, P. (2023). Emotion classification from speech using MFCC, Chroma and LSTM. In 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 913–917).
[25] Jiang, K., Yin, Z., & Ren, F. (2023). Adaptive attention network for cross-corpus speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 1135–1147.
[26] Sun, H., Li, Y., & Liu, C. (2023). Enhancing speech emotion recognition with Spectro-temporal attention and CNN-BiLSTM models. Applied Acoustics, 203, 109248.