Voice Based Hybrid Sentiment Analysis on Movie Reviews

Authors: Mrs. G. Venkateswari, Y. Srinija, P. Sasikala, T. Hima Deepika, P. Sirisha

DOI Link: https://doi.org/10.22214/ijraset.2025.69606

Abstract

In the digital era, online movie reviews have become a key platform for audiences to share their opinions and sentiments. While many sentiment analysis systems focus exclusively on text-based data, they often miss the subtle emotional signals conveyed through speech. This paper introduces a voice-based hybrid sentiment analysis model that combines both acoustic features and textual content to enhance sentiment classification accuracy. The system integrates machine learning algorithms such as Support Vector Machine (SVM), Naïve Bayes, and Linear Regression to create a robust hybrid model. Acoustic data is analyzed to extract prosodic and spectral features like pitch, energy, and Mel Frequency Cepstral Coefficients (MFCCs), while Natural Language Processing (NLP) techniques are employed to process transcribed text. By merging both audio and text features, the model improves sentiment polarity detection accuracy. Experimental results on publicly available datasets show that this hybrid approach outperforms traditional single-modality methods. This research emphasizes the value of multi-modal sentiment analysis and paves the way for more emotionally intelligent human-computer interactions.

Introduction

Objective

Traditional sentiment analysis relies primarily on text-based data and misses emotional depth conveyed through vocal features like tone, pitch, and rhythm. This project introduces a hybrid sentiment analysis model that integrates both textual and acoustic features from spoken movie reviews to improve accuracy and emotional understanding.

Key Features of the Proposed System

Hybrid Sentiment Detection:
- Combines textual sentiment (from transcribed speech) with emotional cues from audio signals.
- Enhances classification accuracy and emotional nuance detection.
Machine Learning Algorithms Used:
- Support Vector Machine (SVM)
- Naïve Bayes
- Linear Regression
Acoustic Features Extracted:
- MFCC (Mel Frequency Cepstral Coefficients)
- Pitch
- Energy
- Extracted using tools like OpenSMILE
Textual Features Processed:
- Speech is converted to text using tools like Google Speech-to-Text, Whisper, or Vosk.
- NLP models like BERT, RoBERTa, or LSTM are used for sentiment classification.

System Workflow

Voice Input: Users provide spoken reviews.
Speech-to-Text: Audio is transcribed via ASR tools.
Text Preprocessing: Cleaning, tokenization, and formatting for analysis.
Text Sentiment Analysis: NLP models classify the transcribed sentiment.
Audio Feature Extraction: Analyzes vocal cues.
Audio Emotion Recognition: Detects speaker emotion (e.g., joy, anger).
Fusion Module: Merges text and audio insights for final sentiment labeling.
Visualization: Presents results in a user-friendly interface.

Technologies & Tools

Machine Learning Libraries: Scikit-learn, TensorFlow
NLP Toolkit: NLTK (Natural Language Toolkit)
Speech Recognition: Google Speech-to-Text, Whisper, Vosk

Existing Systems vs. Proposed Approach

Existing Systems	Proposed Hybrid Model
Text-only sentiment detection	Multimodal (text + audio) sentiment
Ignores vocal emotion	Detects tone, pitch, emotion
Lower accuracy in emotional subtleties	More precise emotional sentiment analysis

Conclusion

The development of a voice-based hybrid sentiment analysis system for movie reviews demonstrates the potential of combining acoustic and linguistic features to achieve more accurate and insightful sentiment detection. By leveraging both the emotional tone in speech and the textual content obtained through automatic speech recognition, the system offers a richer understanding of user opinions. This approach not only enhances sentiment classification performance but also opens new avenues for applications in voice-driven interfaces, entertainment feedback systems, and personalized user experiences. As technology evolves, integrating advanced models and real-time capabilities will further strengthen the effectiveness and adaptability of such systems.

References

[1] Wöllmer, M., Eyben, F., Schuller, B., &Rigoll, G. (2013). Sentiment Analysis in Audio-Visual Contexts: A Study of YouTube Movie Reviews. IEEE Intelligent Systems, 28(3), 46–53. [2] Poria, S., Cambria, E., Hazarika, D., &Vij, P. (2017). Sentiment Analysis in User-Generated Videos Considering Contextual Factors. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 873–883. [3] Yoon, S., Byun, S., & Jung, K. (2018). Recognizing Speech Emotion Using Both Audio and Text Modalities. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 112–118. [4] Akhtar, M. S., Ekbal, A., & Bhattacharyya, P. (2019). Multi-task Learning Approaches for Emotion Recognition and Sentiment Analysis Using Multiple Modalities. Neurocomputing, 398, 247–260.

Copyright

Copyright © 2025 Mrs. G. Venkateswari, Y. Srinija, P. Sasikala, T. Hima Deepika, P. Sirisha . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET69606

Publish Date : 2025-04-24

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here