Speaker diarization, the challenging task of segmenting audio recordings by speaker identity, remains critical for advancing conversational speech processing applications. This paper presents a comprehensive experimental evaluation of a novel modular framework that integrates Convolutional Neural Networks (CNNs) with Long Short-Term Memory (LSTM) networks for generating discriminative speaker embeddings, combined with K-means clustering for multi-speaker identification. Our six-stage processing pipeline encompasses audio preprocessing with Silero Voice Activity Detection (VAD), mel-spectrogram feature extraction, neural embedding generation through a hybrid CNN-LSTM architecture, unsupervised clustering, and timeline creation with confidence scoring.
Experimental validation on a 7.16-minute conversational audio recording demonstrates exceptional system reliability with 99.4% overall success rate across all processing stages. The framework successfully identified 6 distinct speakers across 16 segments with 97.1% preprocessing efficiency, 99.5% segmentation coverage, and 100% success rates for feature extraction, embedding generation, and clustering.
Speaker distribution analysis revealed realistic conversational dynamics with dominant speakers accounting for 50% of total speaking time (SPEAKER_3: 23.7%, SPEAKER_5: 26.3%) and 12 speaker transitions at 1.68 transitions per minute. The modular architecture enables detailed analysis of each processing component, providing transparency and interpretability advantages over end-to-end black-box systems while maintaining CPU-based processing compatibility. These findings demonstrate the effectiveness of hybrid neural-clustering approaches for practical speaker diarization applications and contribute to understanding modular system design principles in conversational speech analysis.
Introduction
Speaker diarization—the task of determining “who spoke when” in multi-speaker recordings—is essential for applications like meeting transcription, broadcast analysis, and conversational AI. Traditional approaches rely on modular pipelines with feature extraction (e.g., MFCCs, i-vectors) and clustering (e.g., K-means, AHC), but recent advances favor deep learning, particularly end-to-end neural diarization (EEND) systems and x-vector embeddings, which achieve state-of-the-art performance but often require GPUs and lack interpretability.
The proposed framework integrates traditional signal processing with a hybrid CNN-LSTM neural network in a modular six-stage pipeline:
Audio preprocessing – resampling, normalization, and silence removal.
Speech segmentation – VAD-based segmentation using Silero VAD.
Feature extraction – Mel-spectrogram computation.
Embedding generation – CNN-LSTM hybrid for discriminative speaker embeddings.
Speaker clustering – K-means with standardized embeddings and confidence scoring.
Timeline creation – chronological speaker assignments in RTTM format.
Experimental results show high reliability: overall pipeline success rate of 99.4%, efficient silence removal (97.1%), near-perfect segmentation (99.5%), and 100% success for feature extraction, embedding generation, clustering, and timeline creation. Speaker distribution analysis reveals realistic conversational dynamics, with dominant participants contributing about 50% of speaking time and brief speakers identified with high confidence.
Compared to literature, the framework offers CPU-based processing, modular transparency, interpretability, and robust handling of natural multi-party conversations, balancing the strengths of neural embeddings and traditional clustering.
Conclusion
This paper presented a comprehensive experimental evaluation of a modular CNN-LSTM framework for multi-speaker diarization, demonstrating the effective integration of neural embeddings with traditional clustering techniques. The proposed six-stage processing pipeline achieved an exceptional overall success rate of 99.4% across all components, highlighting the robustness and reliability of the system. Key accomplishments include high system reliability with preprocessing efficiency of 97.1%, segmentation coverage of 99.5%, and perfect success rates for feature extraction, embedding generation, clustering, and timeline creation. The framework effectively captured realistic speaker dynamics, accurately identifying six speakers with natural participation patterns, including dominant speakers occupying 50% of the total speaking time and brief contributors with high identification confidence. Furthermore, the CPU-compatible implementation demonstrated practical viability for wider deployment without compromising performance. The modular architecture provided a comprehensive analysis framework, enabling independent component evaluation and optimization, while the transparent design with extensive visualization capabilities ensured reproducibility and facilitated comparative research.
References
[1] S. M. Metev and V. P. Veiko, Laser Assisted Microtechnology, 2nd ed., R. M. Osgood, Jr., Ed. Berlin, Germany: Springer-Verlag, 1998.
[2] Sell, G., & Garcia-Romero, D. (2014, December). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In 2014 IEEE Spoken Language Technology Workshop (SLT) (pp. 413-417). IEEE.
[3] Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., & Watanabe, S. (2019). End-to-end neural speaker diarization with permutation-free objectives. arXiv preprint arXiv:1909.05952.
[4] Horiguchi, N., Kamoi, K., Horie, S., Iwasaki, Y., Kurozumi-Karube, H., Takase, H., & Ohno-Matsui, K. (2020). A 10-year follow-up of infliximab monotherapy for refractory uveitis in Behçet’s syndrome. Scientific Reports, 10(1), 22227.
[5] Ma, D., Ryant, N., & Liberman, M. (2021, June). Probing acoustic representations for phonetic properties. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 311-315). IEEE.
[6] Tranter, S. E., & Reynolds, D. A. (2006). An overview of automatic speaker diarization systems. IEEE Transactions on audio, speech, and language processing, 14(5), 1557-1565.
[7] Garcia-Romero, D., & Espy-Wilson, C. Y. (2011, August). Analysis of i-vector length normalization in speaker recognition systems. In Interspeech (Vol. 2011, pp. 249-252).
[8] Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. A., & Dehak, R. (2011, August). Language recognition via i-Vectors and dimensionality reduction. In Interspeech (pp. 857-860).
[9] Sell, G., & Garcia-Romero, D. (2014, December). Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In 2014 IEEE Spoken Language Technology Workshop (SLT) (pp. 413-417). IEEE.
[10] Snyder, T. D., De Brey, C., & Dillow, S. A. (2018). Digest of Education Statistics 2016, NCES 2017-094. National Center for Education Statistics.
[11] Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.
[12] Zeinali, H., Wang, S., Silnova, A., Mat?jka, P., & Plchot, O. (2019). But system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592.
[13] Ding, L., Spicer, R. A., Yang, J., Xu, Q., Cai, F., Li, S., ... & Mehrotra, R. (2017). Quantifying the rise of the Himalaya orogen and implications for the South Asian monsoon. Geology, 45(3), 215-218.
[14] Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., & Watanabe, S. (2019). End-to-end neural speaker diarization with permutation-free objectives. arXiv preprint arXiv:1909.05952.
[15] Fujita, K., Inoue, A., Kuzuya, M., Uno, C., Huang, C. H., Umegaki, H., & Onishi, J. (2020). Mental health status of the older adults in Japan during the COVID-19 pandemic. Journal of the American Medical Directors Association, 22(1), 220.
[16] Horiguchi, N., Kamoi, K., Horie, S., Iwasaki, Y., Kurozumi-Karube, H., Takase, H., & Ohno-Matsui, K. (2020). A 10-year follow-up of infliximab monotherapy for refractory uveitis in Behçet’s syndrome. Scientific Reports, 10(1), 22227.
[17] Kinoshita, S. W., Nakamura, F., & Wu, B. (2021). Star formation triggered by shocks. The Astrophysical Journal, 921(2), 150.