With the rise of remote and global teams, virtual meetings face challenges like language barriers and information overload that hinder effective communication and decision-making. Traditional transcription and translation methods are slow and error-prone, especially for long or multilingual meetings.
To address this, the work proposes an AI-powered platform that automates real-time transcription, analysis, summarization, and translation of meeting conversations. The system uses OpenAI’s Whisper for accurate multilingual speech-to-text conversion, followed by advanced Natural Language Processing (NLP) for sentiment analysis, topic modeling, and dual-stage summarization (extractive and abstractive) to create clear, insightful summaries. A multilingual translation module supports over 100 languages, enabling seamless communication across linguistic boundaries.
The platform features a modular, scalable architecture accessible via a user-friendly web interface, allowing users to upload or record meetings and receive transcripts, summaries, and translations. It is designed for diverse domains including corporate, education, healthcare, and government, improving inclusivity, reducing manual effort, and accelerating decision-making.
Evaluation showed high transcription accuracy (low word error rates), effective summarization comparable to human summaries, and good translation quality, though minor issues remain with idiomatic phrases. The system processes typical meeting audio quickly and is well-received by users for ease of use on multiple devices.
Conclusion
In this paper, we presented an innovative AI-driven solution for real-time speech-to-text transcription, multilingual translation, and speaker identification, leveraging cutting-edge technologies such as OpenAI Whisper and pyannote-audio. This system allows users to transcribe speech, translate content across over 100 languages, and identify multiple speakers in real-time, making it highly applicable to diverse environments like meetings, interviews, and collaborative discussions. The seamless integration of transcription, translation, and speaker identification offers a powerful tool for improving communication across language barriers and enhancing productivity in multilingual settings. In conclusion, this project has established a solid foundation for an AI-powered, real-time multilingual voice transcription and translation system.
References
[1] Y. Fujita et al., \"End-to-end neural speaker diarization with self-attention,\" arXiv preprint arXiv:1909.06247, Sep. 2019. [Online].
[2] Available: https://arxiv.org/abs/1909.06247. [Accessed: Apr. 13, 2025].
[3] S. Maiti et al., \"End-to-end diarization for a variable number of speakers with local-global networks and discriminative speaker embeddings,\" arXiv preprint arXiv:2105.02096, May 2021. [Online]. Available: https://arxiv.org/abs/2105.02096. [Accessed: Apr. 13, 2025].
[4] S. Wang et al., \"Can Whisper perform speech-based in-context learning?\" arXiv preprint arXiv:2309.07081, Sep. 2023. [Online].
[5] Available: https://arxiv.org/abs/2309.07081. [Accessed: Apr. 13, 2025].
[6] J. Han et al., \"Leveraging self-supervised learning for speaker diarization,\" arXiv preprint arXiv:2409.09408, Sep. 2024. [Online].
[7] Available: https://arxiv.org/abs/2409.09408. [Accessed: Apr. 13, 2025].
[8] A. Koenecke et al., \"Careless Whisper: Speech-to-text hallucination harms,\" arXiv preprint arXiv:2402.08021, Feb. 2024. [Online].
[9] Available: https://arxiv.org/abs/2402.08021. [Accessed: Apr. 13, 2025].