Audio Aura - Speech Emotion Recognition System

Authors: Swayam Gawali, Ishaan G Koli, Suyog Gorule, Prathmesh Deokar, Rohini Deshpande

DOI Link: https://doi.org/10.22214/ijraset.2025.70092

Abstract

Speech emotion recognition (SER) plays a crucial role in human-computer interaction, enabling systems to in- terpret and respond to user emotions effectively. In human- computer interaction, speech emotion recognition (SER) is es- sential because it allows systems to efficiently understand and react to user emotions. In this research, we introduce Audio Aura, a machine learning-based system for voice signal emotion classification. To improve classification accuracy and extractrich speech representations, the system uses a transformer-based model called Wav2Vec2. By leveraging Wav2Vec2’s self- supervised learning capabilities, Audio Aura effectively captures temporal and contextual features in speech. The Toronto Emo- tionalSpeechSet(TESS)datasetisusedtotrainandassess the system, and it shows remarkable accuracy in recognizing emotions like neutrality, anger, sadness, and happiness. Com- pared to traditional machine learning approaches, transformer- based models demonstrate significant improvements in affective computing, making SER applications more robust in real-world scenarios.

Introduction

Speech Emotion Recognition (SER) is a vital technology for understanding human emotions from speech, enhancing applications like virtual assistants, customer service, and human-computer interaction. Traditional emotion recognition relied on text or facial cues, but speech offers a more natural and unobtrusive method. Recent advances in AI and deep learning, particularly using models like Wav2Vec2, allow automatic extraction of complex speech features such as pitch, tone, and rhythm to accurately classify emotions.

This study introduces Audio Aura, a deep learning-based SER system that uses the Wav2Vec2 transformer model for feature extraction and a Softmax classifier for emotion classification across seven categories: neutral, happy, sad, angry, fearful, disgusted, and surprised. Audio Aura was trained and tested on benchmark datasets like TESS, achieving high accuracy (95%) and a weighted F1-score of 0.93.

The system preprocesses audio by resampling and noise reduction, extracts speech embeddings via Wav2Vec2, and classifies emotions in real-time with high precision. It is robust against accents and some background noise, making it suitable for applications in mental health monitoring, customer feedback analysis, and human-robot interaction.

Despite strong performance, challenges remain such as sensitivity to noise, confusion between similar emotions (e.g., fear and surprise), and computational demands for real-time processing. Future improvements may involve better noise handling, transfer learning, and optimization for edge devices.

Overall, Audio Aura demonstrates the potential to advance emotion-aware AI systems, enabling more personalized and emotionally intelligent interactions across various industries.

Conclusion

The Speech Emotion Recognition (SER) system presented inthisreporteffectivelyidentifiesemotionsfromspeechusing a Wav2Vec2-based feature extraction and Softmax classifier. Usingdeepspeechembeddings,thesystemaccuratelycaptures phonetic and prosodic variations, enabling accurate emotion identification. The model was trained and tested using bench- markingdatasetswithahighaccuracyof0.95(95%),ensuring its reliability and robustness for emotion classification into neutral, happy, sad, angry, fearful, disgusted, and surprised categories. Through thorough testing and analysis, the system has become a highly efficient and scalable real-time emotion recognition tool. The usage of deep-learning-based methods removestheneedformanualfeatureengineering,making thesystemversatileandapplicabletovariedapplications like mental health monitoring, sentiment analysis, and AI- driven virtual assistants. With future expansion, this system can revolutionize human-computer interaction, promoting a more empathetic and adaptable AI-driven experience across sectors.

References

[1] H. I. Attar, N. K. Kadole, O. G. Karanjekar, D. R. Nagarkar, and S.More,”SpeechEmotionRecognitionSystemUsingMachineLearning,”2023. [2] G. Deshmukh, A. Gaonkar, G. Golwalkar, and S. Kulkarni, ”Speechbased Emotion Recognition using Machine Learning,” 2023. [3] G.S.,”SpeechEmotionRecognitionusingMachineLearninginPython,” 2023. [4] D.Bertero,F.B.Siddique,C.-S.Wu,Y.Wan,R.H.Y.Chan,and [5] P.Fung,”Real-TimeSpeechEmotionandSentimentRecognitionforInteractive,” 2023. [6] T.M.Wani,T.S.Gunawan,andS.A.A.Qadri,”AComprehensiveReview of Speech Emotion Recognition Systems,” 2023.

Copyright

Copyright © 2025 Swayam Gawali, Ishaan G Koli, Suyog Gorule, Prathmesh Deokar, Rohini Deshpande. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70092

Publish Date : 2025-04-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here