In the digital era, online movie reviews have become a key platform for audiences to share their opinions and sentiments. While many sentiment analysis systems focus exclusively on text-based data, they often miss the subtle emotional signals conveyed through speech. This paper introduces a voice-based hybrid sentiment analysis model that combines both acoustic features and textual content to enhance sentiment classification accuracy. The system integrates machine learning algorithms such as Support Vector Machine (SVM), Naïve Bayes, and Linear Regression to create a robust hybrid model. Acoustic data is analyzed to extract prosodic and spectral features like pitch, energy, and Mel Frequency Cepstral Coefficients (MFCCs), while Natural Language Processing (NLP) techniques are employed to process transcribed text. By merging both audio and text features, the model improves sentiment polarity detection accuracy. Experimental results on publicly available datasets show that this hybrid approach outperforms traditional single-modality methods. This research emphasizes the value of multi-modal sentiment analysis and paves the way for more emotionally intelligent human-computer interactions.
Introduction
Objective
Traditional sentiment analysis relies primarily on text-based data and misses emotional depth conveyed through vocal features like tone, pitch, and rhythm. This project introduces a hybrid sentiment analysis model that integrates both textual and acoustic features from spoken movie reviews to improve accuracy and emotional understanding.
Key Features of the Proposed System
Hybrid Sentiment Detection:
Combines textual sentiment (from transcribed speech) with emotional cues from audio signals.
Enhances classification accuracy and emotional nuance detection.
Machine Learning Algorithms Used:
Support Vector Machine (SVM)
Naïve Bayes
Linear Regression
Acoustic Features Extracted:
MFCC (Mel Frequency Cepstral Coefficients)
Pitch
Energy
Extracted using tools like OpenSMILE
Textual Features Processed:
Speech is converted to text using tools like Google Speech-to-Text, Whisper, or Vosk.
NLP models like BERT, RoBERTa, or LSTM are used for sentiment classification.
System Workflow
Voice Input: Users provide spoken reviews.
Speech-to-Text: Audio is transcribed via ASR tools.
Text Preprocessing: Cleaning, tokenization, and formatting for analysis.
Text Sentiment Analysis: NLP models classify the transcribed sentiment.
Speech Recognition: Google Speech-to-Text, Whisper, Vosk
Existing Systems vs. Proposed Approach
Existing Systems
Proposed Hybrid Model
Text-only sentiment detection
Multimodal (text + audio) sentiment
Ignores vocal emotion
Detects tone, pitch, emotion
Lower accuracy in emotional subtleties
More precise emotional sentiment analysis
Conclusion
The development of a voice-based hybrid sentiment analysis system for movie reviews demonstrates the potential of combining acoustic and linguistic features to achieve more accurate and insightful sentiment detection. By leveraging both the emotional tone in speech and the textual content obtained through automatic speech recognition, the system offers a richer understanding of user opinions. This approach not only enhances sentiment classification performance but also opens new avenues for applications in voice-driven interfaces, entertainment feedback systems, and personalized user experiences. As technology evolves, integrating advanced models and real-time capabilities will further strengthen the effectiveness and adaptability of such systems.
References
[1] Wöllmer, M., Eyben, F., Schuller, B., &Rigoll, G. (2013). Sentiment Analysis in Audio-Visual Contexts: A Study of YouTube Movie Reviews. IEEE Intelligent Systems, 28(3), 46–53.
[2] Poria, S., Cambria, E., Hazarika, D., &Vij, P. (2017). Sentiment Analysis in User-Generated Videos Considering Contextual Factors. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 873–883.
[3] Yoon, S., Byun, S., & Jung, K. (2018). Recognizing Speech Emotion Using Both Audio and Text Modalities. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 112–118.
[4] Akhtar, M. S., Ekbal, A., & Bhattacharyya, P. (2019). Multi-task Learning Approaches for Emotion Recognition and Sentiment Analysis Using Multiple Modalities. Neurocomputing, 398, 247–260.