This project explores the classification of animal emotions based on their vocalizations using deep learning models. Leveraging the Kaggle dataset \"Audio Cats and Dogs\" and expanding it to include multiple animal species, the study employs feature extraction, signal processing, and neural network architectures to analyse audio patterns. Advanced clustering techniques and classification models are used to detect and categorize emotional states, enhancing our understanding of animal communication. The project aims to achieve high classification accuracy and develop a robust model for real-time emotion recognition in animals.
Introduction
This study explores the use of deep learning to recognize emotions from animal vocalizations, addressing limitations of traditional, subjective methods. Inspired by advances in human Speech Emotion Recognition (SER) using CNN and LSTM models, the researchers propose a hybrid CNN-LSTM architecture trained on an expanded multi-species dataset including cats, dogs, frogs, lions, and more. The system extracts Mel Frequency Cepstral Coefficients (MFCCs) from audio, leveraging convolutional layers for spatial feature learning and LSTMs for temporal dynamics. It achieves over 85% test accuracy, supporting real-time emotion prediction useful for wildlife monitoring, veterinary care, and animal welfare.
The work builds on previous SER research in humans and bioacoustics, addressing challenges such as limited labeled datasets and interspecies vocal variability. Compared to an existing shallow neural network approach that showed moderate accuracy and lacked temporal modeling, the proposed hybrid model better captures complex patterns in animal sounds.
The dataset is well-structured with labeled audio clips covering multiple species and emotions like aggression, calm, or distress. Performance is evaluated using standard metrics (accuracy, precision, recall, F1-score), with confusion matrices helping analyze class distinctions. A user interface demonstrates practical application by predicting emotions from uploaded animal sounds, visualizing waveforms and spectrograms.
Conclusion
This study presents a novel approach to Speech Emotion Recognition (SER) of animal vocalizations using a hybrid CNN-LSTM deep learning model. The system leverages a custom multi-species dataset composed of diverse animal sounds, paired with emotional labels, and applies robust preprocessing techniques such as MFCC extraction to convert audio signals into spectrogram representations. The CNN layers effectively learn spatial audio features, while the LSTM layers model temporal dependencies across vocal sequences.
Experimental results demonstrate the model\'s strong classification capability, achieving an accuracy exceeding 85% on unseen test data. The framework also includes a real-time prediction module that enables users to input new animal sounds and receive emotion predictions along with waveform and spectrogram visualizations. This real-time applicability positions the model as a promising tool for wildlife monitoring, veterinary diagnostics, and animal behavior research.
Moreover, the study confirms the feasibility and importance of applying deep learning models originally designed for human SER to the domain of non-human emotion analysis. By bridging this gap, the work significantly contributes to the fields of bioacoustics, animal welfare, and affective computing.
References
[1] Y. Zhang et al., \"Speech Emotion Recognition Using Deep Convolutional Neural Network and Discrete Wavelet Transform,\" IEEE Access, vol. 7, pp. 94736–94744, 2019.
[2] K. Schuller et al., \"Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge,\" Speech Communication, vol. 53, no. 9–10, pp. 1062–1087, 2011.
[3] P. Tzirakis et al., \"End-to-end multimodal emotion recognition using deep neural networks,\" IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
[4] Z. Zhao et al., \"Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and CNNs for speech emotion recognition,\" Speech Communication, vol. 114, pp. 1–9, 2019.
[5] S. Livingstone and F. Russo, \"The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,\" PLOS ONE, vol. 13, no. 5, e0196391, 2018.
[6] R. Briefer, \"Vocal expression of emotions in mammals: mechanisms of production and evidence,\" Journal of Zoology, vol. 288, no. 1, pp. 1–20, 2012.
[7] E. Filippi et al., \"Vocal correlates of emotional valence in primates,\" Current Biology, vol. 27, no. 1, pp. 110–115, 2017.
[8] M. Linhart et al., \"Expression of emotional arousal in two different piglet call types,\" PLOS ONE, vol. 10, no. 8, e0135414, 2015.
[9] C. Taylor et al., \"The acoustic communication of emotion in domestic dogs (Canis familiaris),\" Behavioural Processes, vol. 124, pp. 64–71, 2016.
[10] A. Molnár et al., \"Classification of dog barks: A machine learning approach,\" Animal Cognition, vol. 13, no. 4, pp. 679–688, 2010.
[11] D. Mac Aodha et al., \"Bat detective—Deep learning tools for bat acoustic signal detection,\" PLoS Computational Biology, vol. 14, no. 3, e1005995, 2018.
[12] H. Glotin et al., \"Audio bird classification with inception-v4 extended and mixed features,\" in CLEF Working Notes, 2017.
[13] N. R. Tanguay et al., \"Whale song classification using CNNs with transfer learning,\" in Proc. ICMLA, pp. 1120–1125, 2019.
[14] D. Ko et al., \"Audio augmentation for speech emotion recognition,\" in Proc. Interspeech, pp. 915–919, 2021.
[15] S. Davis and P. Mermelstein, \"Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,\" IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[16] A. A. Ghosh et al., \"Animal sound classification using MFCC and machine learning techniques,\" in Proc. ICIP, pp. 119–124, 2019.
[17] M. El Ayadi et al., \"Survey on speech emotion recognition: Features, classification schemes, and databases,\" Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[18] T. Hassan et al., \"Real-time speech emotion recognition on edge devices,\" in Proc. IEEE CCNC, pp. 1–6, 2022.
[19] B. Cheng et al., \"Quantization and deployment of deep learning models using TensorFlow Lite,\" arXiv preprint arXiv:2001.07020, 2020.
[20] A. Zhang et al., \"Pet emotion detection system based on deep learning and IoT,\" in Proc. ICIP, pp. 214–218, 2021.