Sound Event Classification (SEC) is concerned with automatic recognition and classification of sounds in the environment which include alarms, animal sounds, human activities and machine sounds. SEC finds its applications in the real world in smart surveillance, urban sound, healthcare support system, and intelligent multimedia analysis. Nevertheless, traditional SEC methods using handcrafted features or convolutional neural networks (trained in their entirety) tend to be limited in the generalization ability as well as being highly sensitive to noise and overfit since labeled audio datasets are relatively small. To eliminate these constraints, the present work suggests a hybrid structure of sound event classification that gathers pretrained deep acoustic features extraction with ensemble learning. To produce deep acoustic embeddings with audio signals, a Pretrained Audio Neural Network (PANN) using CNN14 architecture trained on large-scale AudioSet is used as a fixed feature extractor to make robust deep acoustic representations of 2048 dimensions. These embeddings are then categorized into groups by an XGBoost classifier which is more adept at dealing with complicated decision boundaries and is more robust on small datasets. This workflow avoids the training deep networks entirely and overfitting is also drastically reduced. The suggested model is tested on ESC-50 environmental sound dataset, which is composed of 2000 audio samples divided into 50 sound classes and tested on the official 5-fold cross-validation setup. The results of the experiments show that the mean classification accuracy was 90.45, and the precision, recall, and F1-scores were similar in all the classes. The findings support the fact that the combination of pretrained acoustic representations with ensemble learning is an effective and credible solution to environmental sound event classification.
Introduction
Sound Event Classification (SEC) focuses on identifying and categorizing environmental sounds, which are highly diverse, unstructured, and often unpredictable. SEC is important for real-world applications such as smart surveillance, urban noise monitoring, healthcare and assisted living, multimedia indexing, robotics, and smart homes. However, challenges such as noise, overlapping sound sources, variable durations, and limited labeled data make SEC a complex research problem.
Early SEC approaches relied on handcrafted acoustic features (e.g., MFCCs, spectral and temporal features) combined with traditional machine learning classifiers like SVMs and Random Forests. While effective in controlled settings, these methods struggled with noise and complex acoustic patterns. Deep learning methods, especially CNNs trained on spectrograms, improved performance by learning features automatically but require large datasets and high computational resources, often leading to overfitting on small datasets.
To address these limitations, the proposed study introduces a hybrid SEC framework that combines deep feature extraction and ensemble learning. Pretrained Audio Neural Networks (PANNs) with the CNN14 architecture are used to extract high-level, 2048-dimensional acoustic embeddings from preprocessed audio. These embeddings are then classified using XGBoost, a gradient boosting ensemble algorithm known for strong performance and resistance to overfitting on medium-sized datasets.
The system is evaluated using the ESC-50 environmental sound dataset, which contains 2,000 balanced audio samples across 50 classes, following an official 5-fold cross-validation protocol. Performance is measured using accuracy, precision, recall, and F1-score. Experimental results show consistent and strong classification performance, achieving an average accuracy of 90.45% across folds, demonstrating the robustness and effectiveness of the proposed hybrid framework for sound event classification, particularly in data-limited scenarios.
Conclusion
Environmental sound event classification is concerned with the automatic detection of sound events based on the signal of real-world audio. Environmental sounds are very heterogeneous, unorganized and most of the time are influenced by surrounding sounds and thus it is difficult to classify them correctly. Recent developments in deep learning have also enhanced performance through automatic learning of features using time-frequency representations, but there is high cost in terms of large labeled datasets and high computational cost to train deep models in mode. Transfer learning has therefore been introduced as one of the solutions to these challenges, by utilizing the benefit of using pretrained models on large-scale audio datasets. This paper is an experiment using Pretrained Audio Neural Network (PANN) as a CNN14 architecture to obtain the high-level acoustic embedding of an event, which reflects critical temporal and spectral features of sound events. Ensemble based XGBoost classifier is then used to classify these deep features which improves robustness and generalization.
References
[1] V. Pann, K. S. Kwon, B. Kim, D. H. Jang, J. Kim, and J. B. Kim, “Robustness of CNN-based model assessment for pig vocalization classification across diverse acoustic environments,” Computers and Electronics in Agriculture, vol. 240, Art. no. 111181, 2026.
[2] A. Roy and U. Satija, “Effect of auscultation hindering noises on detection of adventitious respiratory sounds using pre-trained audio neural nets: A comprehensive study,” IEEE Transactions on Instrumentation and Measurement, 2025.
[3] S. Prakash and K. Sangeetha, “Systems classification of air pollutants using Adam optimized CNN with XGBoost feature selection,” Analog Integrated Circuits and Signal Processing, vol. 122, no. 3, p. 35, 2025.
[4] X. Zhou, B. Wang, X. Bao, H. Qi, Y. Peng, Z. Xu, and F. Zhang, “Quantitative detection of mixed gas infrared spectra based on joint SAE and PLS downscaling with XGBoost,” Processes, vol. 13, no. 7, Art. no. 2112, 2025.
[5] S. Wan, S. Li, Z. Chen, and Y. Tang, “An ultrasonic-AI hybrid approach for predicting void defects in concrete-filled steel tubes via enhanced XGBoost with Bayesian optimization,” Case Studies in Construction Materials, vol. 22, Art. no. e04359, 2025.
[6] A. Nogueira, H. Oliveira, J. Machado, and J. Tavares, “Sound classification and processing of urban environments: A systematic literature review,” Sensors, vol. 22, 2022, doi: 10.3390/s22228608.
[7] A. Bansal and N. Garg, “Environmental sound classification: A descriptive review of the literature,” Intelligent Systems with Applications, vol. 16, Art. no. 200115, 2022, doi: 10.1016/j.iswa.2022.200115.
[8] P. Grumiaux, S. Kiti?, L. Girin, and A. Guérin, “A survey of sound source localization with deep learning methods,” The Journal of the Acoustical Society of America, vol. 152, no. 1, p. 107, 2021, doi: 10.1121/10.0011809.
[9] P. Gairí, T. Pallejà, and M. Tresanchez, “Environmental sound recognition on embedded devices using deep learning: A review,” Artificial Intelligence Review, vol. 58, 2025, doi: 10.1007/s10462-025-11106-z.
[10] M. Tailleur, P. Aumond, M. Lagrange, and V. Tourre, “Sound source classification for soundscape analysis using fast third-octave bands data from an urban acoustic sensor network,” The Journal of the Acoustical Society of America, vol. 156, no. 1, pp. 416–427, 2024, doi: 10.1121/10.0026479.
[11] W. Mu, B. Yin, X. Huang, J. Xu, and Z. Du, “Environmental sound classification using temporal-frequency attention-based convolutional neural network,” Scientific Reports, vol. 11, 2021, doi: 10.1038/s41598-021-01045-4.
[12] A. Ekpezu, F. Katsriku, W. Yaokumah, and I. Wiafe, “The use of machine learning algorithms in the classification of sound: A systematic review,” International Journal of Service Science, Management, Engineering, and Technology, vol. 13, pp. 1–28, 2022, doi: 10.4018/ijssmet.298667.
[13] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020, doi: 10.1109/TASLP.2020.3030497.
[14] O. Toffa and M. Mignotte, “Environmental sound classification using local binary pattern and audio features collaboration,” IEEE Transactions on Multimedia, vol. 23, pp. 3978–3985, 2021, doi: 10.1109/TMM.2020.3035275.
[15] M. Tailleur, J. Lee, M. Lagrange, K. Choi, L. Heller, K. Imoto, and Y. Okamoto, “Correlation of Fréchet audio distance with human perception of environmental audio is embedding dependent,” arXiv preprint, arXiv:2403.17508, 2024, doi: 10.48550/arXiv.2403.17508.