In multi-user automated ecosystems, extracting actionable intelligence from spoken conversations requires understanding both who is speaking and the emotional context of their words. This paper presents an integrated, decoupled machine learning architecture designed for real-time speech-to-text transcription, unsupervised speaker diarization, and linguistic sentiment classification. The framework utilizes an optimized faster-whisper transformer pipeline to transcribe audio signals into granular, timestamped text segments. Concurrently, the acoustic domain leverages Mel-Frequency Cepstral Coefficients (MFCCs) processed through a multi-seed Consensus KMeans clustering ensemble to achieve stable speaker identity tracking without requiring prior acoustic enrollment. The extracted textual segments are subsequently converted using a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer and classified using a calibrated machine learning model. Experimental evaluations demonstrate high computational efficiency on commodity CPU infrastructure, making it highly suitable for enterprise customer intelligence and digital media analytics.
Introduction
The rapid growth of multimedia content such as meetings, customer service calls, lectures, and broadcasts has increased the need for automated systems that can identify who spoke when (speaker diarization) and determine the sentiment of the conversation (sentiment analysis). Traditional end-to-end multimodal deep learning approaches are computationally expensive, sensitive to noise, and often require powerful GPUs.
To address these challenges, this study proposes a CPU-efficient sequential cascade framework that separates speech processing from sentiment analysis. The system first converts speech into text and then applies text-based sentiment analysis, allowing it to leverage mature NLP techniques while maintaining speaker identification through acoustic feature analysis.
Key Contributions
The proposed framework introduces:
A multi-threaded CPU-based deployment model that runs efficiently without requiring GPUs.
A consensus K-Means clustering approach that solves speaker label inconsistency across multiple runs.
A confidence-based sentiment classification strategy that filters neutral speech and focuses on clear emotional expressions.
Related Work and Research Gap
Previous approaches include:
MFCC, DTW, and GMM-based speaker verification systems.
Transformer-based speech recognition models such as Whisper.
Deep multimodal sentiment analysis models.
Acoustic emotion recognition frameworks like Wav2Vec2 and HuBERT.
However, existing methods often:
Require significant computational resources.
Depend on cloud services.
Are sensitive to environmental noise.
Lack integrated multi-speaker diarization and sentiment analysis.
The proposed framework addresses these limitations by combining efficient transcription, speaker clustering, and text-based sentiment classification into a deployable real-time pipeline.
Proposed Methodology
1. Audio Extraction and Preprocessing
The system:
Accepts audio and video files.
Extracts audio from multimedia content.
Converts audio to:
16 kHz sampling rate
Mono channel format
This standardization ensures compatibility across all processing stages.
2. Speech-to-Text Transcription
The framework uses faster-whisper, an optimized version of Whisper, to:
Convert speech into text.
Detect speech segments with timestamps.
Operate efficiently on standard CPUs using float32 precision.
Each segment contains:
Start time
End time
Transcribed text
3. Speaker Diarization
For each speech segment:
Feature Extraction
20 Mel-Frequency Cepstral Coefficients (MFCCs) are extracted.
Statistical features (mean and standard deviation) are calculated.
A 40-dimensional feature vector is generated.
Feature Scaling
Standard normalization is applied.
Consensus K-Means Clustering
Instead of relying on a single K-Means execution, clustering is performed with multiple random seeds.
A voting mechanism determines the final speaker assignment:
Speaker 0
Speaker 1
This ensures stable speaker identification across repeated runs.
4. Sentiment Analysis
After transcription:
Text Vectorization
Text is converted into numerical features using TF-IDF.
Classification
A machine learning model (e.g., Logistic Regression) predicts sentiment probabilities.
Fallback Mechanism
If probability estimates are unavailable:
Positive → 1.0
Negative → 0.0
This prevents runtime failures.
Sentiment Classification Logic
To reduce false emotional labeling, the framework introduces a confidence threshold:
Probability (Ppos)
Sentiment
> 0.65
Positive
0.35 – 0.65
Neutral
< 0.35
Negative
This ensures that objective or conversational statements are not incorrectly classified as positive or negative.
System Implementation
AI/ML Stack
Python 3.10+
Flask
faster-whisper
librosa
NumPy
scikit-learn
TF-IDF
Logistic Regression
Backend
Spring Boot
REST APIs
PostgreSQL/MongoDB
Functions include:
Media upload
Audio processing
Report generation
Frontend
Angular or React
Interactive file uploads
Real-time transcript visualization
Speaker timelines and sentiment dashboards
Results and Evaluation
The framework successfully performs:
Speech transcription
Speaker identification
Sentiment classification
Example Results
Text
Speaker
Confidence
Sentiment
“The implementation metrics look highly promising.”
Speaker 1
0.88
Positive
“We need to re-evaluate the model coefficients.”
Speaker 0
0.21
Negative
“The system completed the run at fourteen frames per second.”
Speaker 1
0.52
Neutral
Key Findings
Speaker separation is reliable using the consensus clustering method.
Neutral thresholding reduces misclassification of ordinary speech.
CPU-based execution remains efficient, avoiding the heavy resource requirements of deep multimodal systems.
Conclusion
This paper demonstrates an effective, production-ready framework for multi-speaker audio transcription, diarization, and sentiment tracking. By using an optimized faster-whisper transformer alongside a highly reliable Consensus K-Means clustering approach, the system achieves stable speaker tracking and accurate linguistic sentiment modeling. Implementing a dedicated neutral classification window prevents model polarization bias, delivering dependable performance on commodity CPU setups.
References
[1] N. Dhariwal, S. C. Akunuri, and K. Sharmila Banu, “Audio and Text Sentiment Analysis of Radio Broadcasts,” IEEE Access, vol. 11, pp. 145–156, 2023.
[2] Z. Guo, T. Jin, W. Xu, W. Lin, Y. Wu, “Bridging the Gap for Test-Time Multimodal Sentiment Analysis,” in Proc. AAAI Conf. Artificial Intelligence, 2025, pp. 11234–11243.
[3] Y. Mao, Q. Liu, Y. Zhang, “Sentiment Analysis Methods, Applications, and Challenges: A Systematic Review,” Journal of King Saud University – Computer and Information Sciences, vol. 36, no. 2, pp. 1019–1039, 2024.
[4] B. T. Atmaja, A. Sasou, “Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations,” Sensors, vol. 22, no. 14, pp. 5410–5422, 2022.
[5] S. Chen, Y. Wu, J. Wu, M. Zhang, X. Wu, J. Li, “UniSpeech-SAT: Universal Speech Representation Learning with Speaker-Aware Pre-Training,” in Proc. IEEE ICASSP, 2022, pp. 3452–3456.
[6] Y. Jia, X. Chen, J. Yu, L. Wang, Y. Xu, S. Liu, Y. Wang, “Speaker Recognition Based on Characteristic Spectrograms and AC-SOM,” Complex Intelligent Systems, vol. 7, no. 4, pp. 18231837, 2021.
[7] Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, R. Salakhutdinov, “Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT),” EMNLP 2019.
[8] S. Maghilnan, M. R. Kumar, “Sentiment Analysis on Speaker Specific Speech Data,” I2C2 2017