Sentiment Analysis and Speaker Mapping with Machine Learning

Authors: Mayur Ankushrao, Sanket Pawar, Vishal Mule, Aniket Markad, Prof. S. V. Shinde

DOI Link: https://doi.org/10.22214/ijraset.2026.83557

Abstract

In multi-user automated ecosystems, extracting actionable intelligence from spoken conversations requires understanding both who is speaking and the emotional context of their words. This paper presents an integrated, decoupled machine learning architecture designed for real-time speech-to-text transcription, unsupervised speaker diarization, and linguistic sentiment classification. The framework utilizes an optimized faster-whisper transformer pipeline to transcribe audio signals into granular, timestamped text segments. Concurrently, the acoustic domain leverages Mel-Frequency Cepstral Coefficients (MFCCs) processed through a multi-seed Consensus KMeans clustering ensemble to achieve stable speaker identity tracking without requiring prior acoustic enrollment. The extracted textual segments are subsequently converted using a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer and classified using a calibrated machine learning model. Experimental evaluations demonstrate high computational efficiency on commodity CPU infrastructure, making it highly suitable for enterprise customer intelligence and digital media analytics.

Introduction

The rapid growth of multimedia content such as meetings, customer service calls, lectures, and broadcasts has increased the need for automated systems that can identify who spoke when (speaker diarization) and determine the sentiment of the conversation (sentiment analysis). Traditional end-to-end multimodal deep learning approaches are computationally expensive, sensitive to noise, and often require powerful GPUs.

To address these challenges, this study proposes a CPU-efficient sequential cascade framework that separates speech processing from sentiment analysis. The system first converts speech into text and then applies text-based sentiment analysis, allowing it to leverage mature NLP techniques while maintaining speaker identification through acoustic feature analysis.

Key Contributions

The proposed framework introduces:

A multi-threaded CPU-based deployment model that runs efficiently without requiring GPUs.
A consensus K-Means clustering approach that solves speaker label inconsistency across multiple runs.
A confidence-based sentiment classification strategy that filters neutral speech and focuses on clear emotional expressions.

Related Work and Research Gap

Previous approaches include:

MFCC, DTW, and GMM-based speaker verification systems.
Transformer-based speech recognition models such as Whisper.
Deep multimodal sentiment analysis models.
Acoustic emotion recognition frameworks like Wav2Vec2 and HuBERT.

However, existing methods often:

Require significant computational resources.
Depend on cloud services.
Are sensitive to environmental noise.
Lack integrated multi-speaker diarization and sentiment analysis.

The proposed framework addresses these limitations by combining efficient transcription, speaker clustering, and text-based sentiment classification into a deployable real-time pipeline.

Proposed Methodology

1. Audio Extraction and Preprocessing

The system:

Accepts audio and video files.
Extracts audio from multimedia content.
Converts audio to:
- 16 kHz sampling rate
- Mono channel format

This standardization ensures compatibility across all processing stages.

2. Speech-to-Text Transcription

The framework uses faster-whisper, an optimized version of Whisper, to:

Convert speech into text.
Detect speech segments with timestamps.
Operate efficiently on standard CPUs using float32 precision.

Each segment contains:

Start time
End time
Transcribed text

3. Speaker Diarization

For each speech segment:

Feature Extraction

20 Mel-Frequency Cepstral Coefficients (MFCCs) are extracted.
Statistical features (mean and standard deviation) are calculated.
A 40-dimensional feature vector is generated.

Feature Scaling

Standard normalization is applied.

Consensus K-Means Clustering

Instead of relying on a single K-Means execution, clustering is performed with multiple random seeds.

A voting mechanism determines the final speaker assignment:

Speaker 0
Speaker 1

This ensures stable speaker identification across repeated runs.

4. Sentiment Analysis

After transcription:

Text Vectorization

Text is converted into numerical features using TF-IDF.

Classification

A machine learning model (e.g., Logistic Regression) predicts sentiment probabilities.

Fallback Mechanism

If probability estimates are unavailable:

Positive → 1.0
Negative → 0.0

This prevents runtime failures.

Sentiment Classification Logic

To reduce false emotional labeling, the framework introduces a confidence threshold:

Probability (Ppos)	Sentiment
> 0.65	Positive
0.35 – 0.65	Neutral
< 0.35	Negative

This ensures that objective or conversational statements are not incorrectly classified as positive or negative.

System Implementation

AI/ML Stack

Python 3.10+
Flask
faster-whisper
librosa
NumPy
scikit-learn
TF-IDF
Logistic Regression

Backend

Spring Boot
REST APIs
PostgreSQL/MongoDB

Functions include:

Media upload
Audio processing
Report generation

Frontend

Angular or React
Interactive file uploads
Real-time transcript visualization
Speaker timelines and sentiment dashboards

Results and Evaluation

The framework successfully performs:

Speech transcription
Speaker identification
Sentiment classification

Example Results

Text	Speaker	Confidence	Sentiment
“The implementation metrics look highly promising.”	Speaker 1	0.88	Positive
“We need to re-evaluate the model coefficients.”	Speaker 0	0.21	Negative
“The system completed the run at fourteen frames per second.”	Speaker 1	0.52	Neutral

Key Findings

Speaker separation is reliable using the consensus clustering method.
Neutral thresholding reduces misclassification of ordinary speech.
CPU-based execution remains efficient, avoiding the heavy resource requirements of deep multimodal systems.

Conclusion

This paper demonstrates an effective, production-ready framework for multi-speaker audio transcription, diarization, and sentiment tracking. By using an optimized faster-whisper transformer alongside a highly reliable Consensus K-Means clustering approach, the system achieves stable speaker tracking and accurate linguistic sentiment modeling. Implementing a dedicated neutral classification window prevents model polarization bias, delivering dependable performance on commodity CPU setups.

References

[1] N. Dhariwal, S. C. Akunuri, and K. Sharmila Banu, “Audio and Text Sentiment Analysis of Radio Broadcasts,” IEEE Access, vol. 11, pp. 145–156, 2023. [2] Z. Guo, T. Jin, W. Xu, W. Lin, Y. Wu, “Bridging the Gap for Test-Time Multimodal Sentiment Analysis,” in Proc. AAAI Conf. Artificial Intelligence, 2025, pp. 11234–11243. [3] Y. Mao, Q. Liu, Y. Zhang, “Sentiment Analysis Methods, Applications, and Challenges: A Systematic Review,” Journal of King Saud University – Computer and Information Sciences, vol. 36, no. 2, pp. 1019–1039, 2024. [4] B. T. Atmaja, A. Sasou, “Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations,” Sensors, vol. 22, no. 14, pp. 5410–5422, 2022. [5] S. Chen, Y. Wu, J. Wu, M. Zhang, X. Wu, J. Li, “UniSpeech-SAT: Universal Speech Representation Learning with Speaker-Aware Pre-Training,” in Proc. IEEE ICASSP, 2022, pp. 3452–3456. [6] Y. Jia, X. Chen, J. Yu, L. Wang, Y. Xu, S. Liu, Y. Wang, “Speaker Recognition Based on Characteristic Spectrograms and AC-SOM,” Complex Intelligent Systems, vol. 7, no. 4, pp. 18231837, 2021. [7] Y. H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, R. Salakhutdinov, “Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT),” EMNLP 2019. [8] S. Maghilnan, M. R. Kumar, “Sentiment Analysis on Speaker Specific Speech Data,” I2C2 2017

Copyright

Copyright © 2026 Mayur Ankushrao, Sanket Pawar, Vishal Mule, Aniket Markad, Prof. S. V. Shinde. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET83557

Publish Date : 2026-06-10

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here