Speech Emotion Recognition (SER) has emerged as a critical field in artificial intelligence, enabling systems to interpret human emotions through speech signals. This research proposes a comprehensive SER system utilizing Mel-Frequency Cepstral Coefficients (MFCCs) as primary features extracted from audio signals. The system is trained and evaluated on a combined dataset consisting of RAVDESS and TESS, comprising 5,252 labeled audio samples across eight emotional categories. A comparative study is conducted between a traditional Decision Tree classifier and a deep learning-based One-Dimensional Convolutional Neural Network (1D CNN). The Decision Tree model achieves an accuracy of approximately 68%, whereas the CNN achieves around 85.5% validation accuracy, demonstrating superior performance. In addition to model development, a desktop-based application using CustomTkinter is implemented for real-time emotion detection from microphone input or audio files. The research highlights the importance of feature extraction, model selection, and deployment considerations while presenting a scalable and practical solution for SER.
Introduction
The text describes a Speech Emotion Recognition (SER) system that identifies human emotions from speech using audio processing and machine learning techniques. Since speech carries both linguistic and emotional information, the system is designed for applications like virtual assistants, healthcare, and customer service, though emotion detection is challenging due to noise, speaker differences, and subjective expression.
The project uses MFCC (Mel-Frequency Cepstral Coefficients) features extracted from audio and compares two models: a traditional Decision Tree and a deep learning-based 1D CNN (Conv1D). It is trained on a combined RAVDESS and TESS dataset (~5,200 samples covering eight emotion classes) and includes a real-time desktop application for emotion prediction from live microphone input or audio files.
The system pipeline includes feature extraction, model training, and real-time inference using tools like Librosa, TensorFlow/Keras, Scikit-learn, and CustomTkinter. MFCC features are stored locally using Joblib, and the trained CNN model is saved as an H5 file.
Results show that the Decision Tree achieves around 68% accuracy, while the CNN significantly improves performance to about 85.5% validation accuracy, demonstrating the advantage of deep learning in capturing complex speech patterns. The project highlights the effectiveness of a complete SER pipeline and the trade-off between classical and deep learning approaches in emotion recognition tasks.
Conclusion
This project implements a Speech Emotion Recognition system using 40 MFCC features on a combined RAVDESS + TESS dataset. A Decision Tree model and a 1-D CNN were trained on the same data for comparison. The Decision Tree achieved around 68% accuracy, while the CNN reached approximately 85.5% validation accuracy, showing a clear improvement using deep learning on the same feature set. The system also includes a desktop application (LiveMic.py) for real-time prediction using microphone input or WAV files, along with waveform visualization and model loading support. Overall, the project delivers a working end-to-end pipeline covering feature extraction, model training, and real-time inference, with better performance observed from the CNN model.
References
[1] S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLOS ONE, vol. 13, no. 5, p. e0196391, 2018.
[2] K. Dupuis and M. K. Pichora-Fuller, “Toronto Emotional Speech Set (TESS),” University of Toronto, 2010. [Online].
Available: https://tspace.library.utoronto.ca/handle/1807/24487
[3] B. McFee et al., “librosa: Audio and music signal analysis in Python,” in Proc. 14th Python in Science Conference (SciPy), 2015, pp. 18–25.
[4] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[5] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. [Online]. Available: https://www.tensorflow.org/
[6] F. Chollet et al., “Keras,” 2015. [Online]. Available: https://keras.io/
[7] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-28, no. 4, pp. 357–366, Aug. 1980.
[8] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649.
[9] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. International Conference on Learning Representations (ICLR), 2015.
[10] T. N. Sainath et al., “Convolutional neural networks for LVCSR,” in Proc. IEEE ICASSP, 2013, pp. 8614–8618.