As there is an advancements of multimedia technologies, human computer interfaces are in huge demand which is an predominant area for research. Vocal representations, facial expressions, lip movement are used to extract various types of information. In particular, the detection of disfluencies, i.e., interruptions in the normal flow of speech characterized by pauses, repetitions, and sound prolongations, is of interest not only for improving speech recognition systems but also for potentially identifying emotional aspects in audio. Several studies have aimed to define computational methods to identify and classify disfluencies, as well as appropriate evaluation methods in different languages. However, no studies have compiled the findings in the literature on this topic. This is important for both summarizing the motivations and applications of the research, as well as identifying opportunities that could guide new investigations. Our objective is to provide an analysis of the state of the art, the main limitations, and the challenges in this field. Most of the existing disfluency detection model are trained on American English datasets and peform poorly on Indian English and Code – switched speech. In this paper a comprehensive review is done on reviewing the Transformer based approach for identifying disfluencies in Indian English and Telugu – English Code switched speech.
Introduction
Human–computer conversational interfaces increasingly rely on vocal, facial, and body cues to support decision-making. Among these cues, speech disfluencies—interruptions such as pauses, repetitions, prolongations, and repairs—play an essential role in understanding natural human communication. Disfluencies occur across languages and contexts and are studied in clinical research, second-language learning, and speech technology. The long-standing DiSS (Disfluency in Spontaneous Speech) conference highlights the interdisciplinary relevance of this topic.
While past reviews focused mainly on stuttering-related disfluencies, especially using machine learning and ASR-based detection methods, this paper uniquely provides a comprehensive, cross-domain review of disfluency detection and correction, relevant to voice assistants, conversational AI, meeting transcription, educational tools, and medical/legal dictation.
The main contributions of the paper include:
A structured classification of disfluencies
Challenges in Indian English and code-switched speech
A review of Transformer-based approaches
A review of multimodal fusion architectures
Classification of Disfluencies
Disfluencies are typically described using Shriberg’s (1994) four-part framework:
Reparandum (the portion replaced),
Interruption Point,
Interregnum (fillers or hesitation),
Repair (the corrected segment).
Ten disfluency categories were found in prior studies. “Interregnum” (e.g., fillers, interjections) appeared most frequently. More complex forms like repairs require syntactic and acoustic analysis, making them challenging for ASR systems. “Stuttering” was the least studied and is treated as a neurobiological disorder distinct from simple word fragments.
Challenges in Code-Switched Speech
Code-switched speech—mixing two or more languages—poses significant challenges, including:
Accent diversity
Frequent switching
Mixed grammar and phonology
High disfluency rates
Lack of labeled datasets for Indian English and code-switching contexts
Transformer-Based Approaches
Transformers dominate modern speech and language processing due to their ability to capture long-range context. Models like BERT, RoBERTa, mBERT, MuRIL, Wav2Vec2, HuBERT, and Whisper are widely used.
BERT: Effective for sequence labeling of disfluencies using BIO tags, leveraging bidirectional context to detect fillers and repetitions.
RoBERTa: An optimized BERT variant with dynamic masking and improved contextual understanding, widely used for classification tasks and better at capturing subtle disfluency cues.
Conclusion
No single transformer handles both Indian English and disfluencies perfectly — multimodal models are needed. Whisper + MuRIL combination is currently the strongest baseline for Hinglish and Indian English disfluency detection. Fine-tuning on prosodic cues is essential, since Indian English disfluencies are heavily prosody-driven. Code-switch boundaries trigger more hesitations, which transformers handle well if trained on bilingual data.Future directions will be on multimodal transformer models which combine both acoustic and prosodic features of the speech.
References
[1] ACDC (2024). Automated cardiac diagnosis challenge.
[2] Avanzi, M. (2024). A corpus-based approach to french regional prosodic variation. Cahiers de linguistique française, (31):309–323. DOI: 10.1093/oxfordhb/9780198865131.013.20.
[3] Bach, N. and Huang, F. (2019). Noisy BiLSTM-based models for disfluency detection. In Proc. Interspeech 2019, pages 4230-4234.DOI:10.21437/Interspeech.2019-1336.
[4] Barrett, L., Hu, J., and Howell, P. (2022). Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1160–1172. DOI: 10.1109/TASLP.2022.3155295.
[5] Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., and Blackburn, J. (2020). The pushshift reddit dataset. Proceedings of the International AAAI Conference on Web and Social Media, 14(1):830–839. DOI: 10.1609/icwsm.v14i1.7347.
[6] Belz, M., Müller, M., and Mooshammer, C. (2023).
[7] Bertero, D., Wang, L., Chan, H. Y., and Fung, P. (2015). A comparison between a DNN and a CRF disfluency detection and reconstruction system. In Proc. Interspeech 2015, pages 844–848. DOI: 10.21437/Interspeech.2015-263.
[8] Bui, H. H., Phung, D. Q., and Venkatesh, S. (2004). Hierarchical hidden markov models with general state hierarchy. In Proceedings of the national conference on artificial intelligence, pages 324–329. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press;1999. Available at: https://cdn.aaai.org/AAAI/ 2004/AAAI04-052.pdf.
[9] Caines, A., Yannakoudakis, H., Allen, H., Pérez-Paredes, P., Byrne, B., and Buttery, P. (2022). The teacher-student chatroom corpus version 2: more lessons, new annotation, automatic detection of sequence shifts. In
[10] Alfter, D., Volodina, E., François, T., Desmet, P., Cornillie, F., Jönsson, A., and Rennes, E., editors, Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, pages 23–35, Louvain-la-Neuve, Belgium. LiU Electronic Press. DOI: 10.3384/ecp190003.
[11] Caines, A., Yannakoudakis, H., Edmondson, H., Allen, H.,
[12] Pérez-Paredes, P., Byrne, B., and Buttery, P. (2020). The teacher-student chatroom corpus. In Alfter, D., Volodina, E., Pilan, I., Lange, H., and Borin, L., editors, Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning, pages 10–20, Gothenburg, Sweden. LiU Electronic Press. DOI: 10.3384/ecp2017510.
[13] Calhoun, S. et al. (2009). NXT switchboard annotations ldc2009t26. DOI: 10.35111/nn2p-v103.
[14] Canavan, A., Graff, D., and Zipperlen, G. (1997). Callhome american english speech ldc97s42. DOI: 10.35111/exq3- x930.
[15] Canavan, A. and Zipperlen, G. (1996). Callfriend american english-non-southern dialect ldc96s46. DOI: 10.35111/d37s-c536.
[16] Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M.,Guillemot, M., Hain,
[17] T., Kadlec, J., Karaiskos, V., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., Post,
[18] W., Reidsma, D., Wellner, P., and McCowan, L. (2005). The AMI meeting corpus. In Proceedings of Symposium on Annotating and Measuring Meeting Behavior. DOI:10.1007/116774823.
[19] Chen, L. and Yoon, S.-Y. (2011). Detecting structural events for assessing non-native speech. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, IUNLPBEA ’11, page 38–45, USA. Association for Computational Linguistics. DOI: 10.21437/interspeech.2010-282.