Inthispaper,PLPcoefficientsandPLPCCfeaturesareinvestigatedasarepresentationofanacousticsceneusingDNN.Wehaveexper- imentedonDCASE2018Task1datasetandDCASE2017dataset. Experiments are carried out for subtasks A and B. We have exper- imented with individual feature sets as well as decision level DNN scorefusionsofdifferentcombinationsoffeaturesets. Fromtheex- periments, it was observed that the proposed PLP and PLPCC give betterperformanceforsubtasksAandB. ForsubtasksAandB,in- dividualPLPyieldanimprovementof8.9%and13.6%respectively. FurtherPLPCCresultedinanimprovement of8.6% and12.5%. We haveachievedsignificantimprovementsinaccuracyforsubtasksA (11.4%)andB(14.4%)afterfusionof DNNdecisionlevelscoresob- tainedfromPLP,PLPCCandlogmel-bandenergiescomparedtothe 2018 baseline system.We have also experimented on 2017 dataset on4foldcross-validation,withindividualPLPyieldinganimprove- ment of 5.8% and PLPCC achieving an improvement of 4.7%.The fusionofDNNdecisionlevelscoresobtainedfromPLP,PLPCCand log mel-band energies gave an improvement of 6.0% compared to the 2017 baseline system.
Introduction
The research on Acoustic Scene Classification (ASC) has gained attention in signal processing and machine learning due to its wide applications in surveillance, smartphones, robotics, and hearing aids. Early ASC methods relied on spectral, cepstral, and energy-based features with classifiers like SVM and HMM. Later, deep learning approaches such as CNNs, I-vectors, and GANs significantly improved accuracy through feature learning and data augmentation, as seen in DCASE challenges (2016–2018).
The paper proposes a DNN-based ASC system that uses Perceptual Linear Prediction (PLP) and PLP Cepstral Coefficients (PLPCC) features, along with Log-Mel features, and explores score-level fusion of these features to enhance classification accuracy.
Feature extraction: PLP and PLPCC replicate the human auditory system through Bark-scale frequency warping, equal-loudness pre-emphasis, and intensity-loudness conversion.
Classifier: A fully connected Deep Neural Network (DNN) with three hidden layers (ReLU activation, Adam optimizer, softmax output) is used.
Fusion strategy: Combines DNN scores from different feature sets (PLP, PLPCC, Log-Mel) to capture complementary information.
Datasets: Experiments were conducted on TUT Acoustic Scenes 2017 and 2018 (DCASE tasks 1A & 1B) datasets, covering recordings from various environments and devices.
Results:
PLP features individually outperform PLPCC and Log-Mel features.
Score-level fusion (P7: PLP + PLPCC + Log-Mel) achieved the highest accuracy across all datasets.
DCASE 2018 Task 1A: 11.4% improvement over the baseline.
DCASE 2018 Task 1B: 14.4% improvement.
DCASE 2017: 6.0% improvement.
The proposed system performed especially well for Bus, Park, Street Traffic, and Shopping Mall classes.
Conclusion:
The proposed DNN-based fusion of PLP, PLPCC, and Log-Mel features provides a robust and discriminative representation for complex environmental sounds. It significantly improves ASC performance over existing CNN and MLP baselines, demonstrating that combining auditory-inspired and spectral features leads to superior scene classification results.
Conclusion
Inthispaper,aninvestigationofPLPandPLPCCfeatureswithDNN architecture has been applied to model the ASC. We experimented with TUT Acoustic Scenes 2018 Datasets of task1 including Sub- task A and B and TUT Acoustic Scenes 2017 dataset.The study demonstrated that the capability of individual feature sets and fu- sion of PLP, PLPCC and Log-Mel band energies at DNN score de- cisionlevel. IndividualPLPfeaturesyieldanimprovementof8.9% and 13.6% and PLPCC features result in an improvement of 8.6% and 12.5% for subtask A and subtask B of DCASE 2018 challenge. SignificantimprovementsinaccuracyisachievedforDNNdecision levelscoresobtainedfromPLP,PLPCCandLog-Melbandenergies. Improvementsof11.4%and14.4%wereachievedinsubtasksAand BrespectivelycomparedtotheDCASE2018ASCbaselinesystem. From DCASE TUT Acoustic Scenes 2017 dataset, individual PLP featuresyieldanimprovementof5.8%andPLPCCfeaturesresultin an improvement of 4.7% respectively.An improvement of 6.0% is achievedwiththefusionstudycomparedtotheDCASE2017base- linesystem.ThisshowsthatPLP,PLPCCandLog-Melbandener- giescarrycomplementaryacousticinformation. Futureworkwould be dedicated to the investigation of different combinations of fea- tures for ASC.
References
[1] Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuo- mas,“A multi-device dataset for urban acoustic scene clas- sification,”Submitted to DCASE2018 Workshop, 2018.
[2] Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, and MarkD.Plumbley,“Acousticsceneslassification:Classifying environments from the sounds they produce,”The Journal of IEEE Signal Processing Magazine , vol. 32, pp. 16–34, 2015.
[3] SChu,SNarayanan,CJKuo,andMJMataric,“WhereamI? scene recognition for mobile robots using audio features,”in Proc.IEEEInternationalConferenceonMultimediaandExpo, 2006, pp. 885–888.
[4] J T Geiger, B Schuller, and G Rigoll,“Recognising acoustic sceneswithlarge-scaleaudiofeatureextractionandSVM,”in Proc. IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2013.
[5] W Nogueira, G Roma, and P Herrera,“Sound scene identifi- cationbasedonMFCC,binauralfeaturesandasupportvector machineclassifier,”inProc.IEEEAASPChallengeonDetec- tion and Classification of Acoustic Scenes and Events, 2013.
[6] A. Rakotomamonjy and G. Gasso,“Histogram of gradientsof time-frequency representations for audio scene detection,” IEEETransactionsonAudio,Speech,andLanguageProcess- ing, vol. 23, no. 1, 2015.
[7] AxelPlinge,ReneGrzeszick,andGernotAFink,“ABag-of- featuresapproachtoacousticeventdetection,”inProc.IEEE InternationalConferenceonAcousticsSpeechandSignalPro- cessing (ICASSP), 2014, pp. 3704–3708.
[8] PKhunarsal,CLursinsap,andTRaicharoen,“Veryshorttime environmentalsoundclassificationbasedonspectrogrampat- ternmatching,”TheJournaloftheInformationSciences,vol. 243, pp. 57–74, 2013.
[9] JSalamonandJPBello,“Deepconvolutionalneuralnetworks and data augmentation for environmental sound classification,” vol. 24, no. 3, pp. 279–283, 2017.
[10] W Nogueira, G Roma, and P Herrera,“Sound scene identifi- cationbasedonMFCC,binauralfeaturesandasupportvector machineclassifier,”inProc.IEEEAASPChallengeonDetec- tion and Classification of Acoustic Scenes and Events, 2013.
[11] M Chum, A Habshush, A Rahman, and C Sang,“Scene clas- sification challenge using hidden Markov models and frame based classification,”in Proc. IEEE AASP Challenge on De- tectionandClassificationofAcousticScenesandEvents, 2013.
[12] HamidEghbal-Zadeh,BernhardLehner,MatthiasDorfer,and Gerhard Widmer,“CP-JKU submissions for DCASE-2016:a hybrid approach using binaural i-vectors and deep convolu- tionalneuralnetworks,”Tech.Rep.,DCASE2016Challenge, September 2016.
[13] Rakib Hyder, Shabnam Ghaffarzadegan, Zhe Feng, John H L Hansen,andTaufiqHasan,“Acousticsceneclassificationusing aCNN-Supervectorsystemtrainedwithauditoryandspectro- gram image features,”in Proc. INTERSPEECH, 2017.
[14] SeongkyuMun,SangwookPark,DavidHan,andHanseokKo, “Generativeadversarialnetworkbasedacousticscenetraining setaugmentationandselectionusingSVMhyper-plane,”Tech. Rep., DCASE2017 Challenge, September 2017.
[15] YPetetin,CLaroche,andAMayoue,“Deepneuralnetworks foraudioscenerecognition,”inProc.EuropianConferenceon Signal Processing (EUSIPCO), 2015, pp. 125–129.
[16] V Bisot, R Serizel, S Essid, and G Richard,“Acoustic scene classification with matrix factorization for unsupervised fea- ture learning,”in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 6445–6449.
[17] HongweiSongandJiqingHanandShiwenDeng,“Acompact and discriminative feature based on auditory summary statis- tics for acoustic scene classification,”in Proc. Interspeech, 2018, pp. 3294–3298.
[18] Manjunath Mulimani and Shashidhar G Koolagudi,“Robust acoustic event classification using bag-of-visual-words,”in Proc. Interspeech, 2018, pp. 3319–3322.
[19] Qian, Kun and Ren, Zhao and Pandit, Vedhas and Yang, Zi- jiangandZhang,ZixingandSchuller,Bjo¨rn,“Waveletsrevis- ited for the classification of acoustic scenes,”in Proc. Work- shop on Detection and Classification of Acoustic Scenes and Events (DCASE), November 2017, pp. 108–112.
[20] ShefaliWaldekarandGoutamSaha,“Wavelet-basedaudiofea- tures for acoustic scene classification,”Tech. Rep., DCASE Challenge, September 2018.
[21] ShefaliWaldekarandGoutamSaha,“Wavelettransformbased mel-scaledfeaturesforacousticsceneclassification,”inProc. Interspeech, 2018, pp. 3323–3327.
[22] Sakashita, Yuma and Aono, Masaki,“Acoustic scene classi- fication by ensemble of spectrograms based on adaptive tem- poral divisions,”Tech. Rep., DCASE Challenge, September 2018.
[23] Dorfer, Matthias and Lehner, Bernhard and Eghbal-zadeh, Hamid and Christop, Heindl and Fabian, Paischer and Ger- hard,Widmer,“Acousticsceneclassificationwithfullyconvo- lutional neural networks and I-vectors,”Tech. Rep., DCASE Challenge, September 2018.
[24] Zeinali, Hossein and Burget, Lukas and Cernocky, Honza, “Convolutional neural networks and x-vector embedding for dcase2018acousticsceneclassificationchallenge,”Tech.Rep., DCASE Challenge, September 2018.
[25] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “TUT database for acoustic scene classification and sound eventdetection,”inProc.24thConferenceonEuropeanSignal Processing (EUSIPCO), Budapest, Hungary, 2016.
[26] HynekHermansky,“Perceptuallinearpredictive(PLP)analy- sisofspeech,”TheJournaloftheAcousticalSocietyofAmer- ica, vol. 87, no. 4, pp. 1738–1752, 1990.
[27] Hacker Christian Hoenig Florian, Stemmer Georg and Brug- naraFabio, “RevisingPerceptualLinearPrediction(PLP),” in Proc. INTERSPEECH, 2005, pp. 2997–3000.
[28] J Schmidhuber,“Deep learning in neural networks: An overview,” arXiv, 2014.
[29] Diederik, P Kingma, and Jmmy Lei Ba,“ADAM: A Method for Stochastic Optimization,”in Proc. Conference on ICLR, 2015.