Twitter Insight: A Comprehensive Pre-processing Approach for Twitter Sentiment Analysis

Authors: P. Yashwanth, P. Shashank, M. Prakash, P. Mallikarjun

DOI Link: https://doi.org/10.22214/ijraset.2025.70023

Abstract

The vast expansion of online news information in contemporary stimes requires efficient systems for classifying content while determining underlying emotional sentiments. The proposed integrated system uses Natural Language Processing methods to both sort news articles into designated categories political, sports, business and entertainment and to analyze their sentiment expressions simultaneously. The system utilizes a detailed data processing methodology that involves tokenizing content then removing stop words before performing lemma normalization. A machine learning model requires numerical data inputs so TF-IDF vectorization performs feature extraction on text to generate numerical features suitable for algorithms. A group of classification techniques including Logistic Regression, Decision Trees and XGBoost reader are used to find the best method for classifying news content. The news content sentiment assessment relies on lexicon-based methods integrated into the system. The web-based Streamlit application presents an all-inclusive interface for users to increase accessibility while they interact with the workflow system. The interface provides users a platform to add news articles that generates instant feedback about category and sentiment detection with additional visual elements showing word clouds alongside sentiment distribution graphs. Standard performance assessment metrics show that the system competently identifies news categories and feelings through its reliable analysis process. The dual purpose capability of this tool serves readers who want organized news articles with sentiment analysis and researchers analyzing media content. The system will benefit from future development which includes deep learning models as well as expansion to multilingual data to enhance both classification accuracy and opera-tional scope.

Introduction

Overview

The rapid expansion of digital news has led to information overload, making it difficult for readers to extract key insights. This challenge is addressed using Natural Language Processing (NLP) techniques for automated news classification and sentiment analysis, helping both media organizations and the general public understand emotional and thematic content in large volumes of articles. A user-friendly Streamlit-based web interface allows interactive analysis and enhances accessibility for non-technical users.

A. Key Challenges

Manual classification and sentiment interpretation are inefficient and inconsistent for large-scale data.
Human annotators face difficulties due to subjectivity and evolving language patterns like slang and abbreviations.
These challenges highlight the necessity for automated systems to process and interpret news content accurately and efficiently.

B. Importance of Sentiment Analysis

Understanding public sentiment is crucial in various domains:

Politics: Gauging voter opinions for campaign strategies.
Business: Aligning product development with customer sentiment.
Public Health: Crafting effective communication during health crises.
Sentiment analysis empowers decision-makers with real-time public opinion data.

C. Motivation for an Integrated System

Combining classification and sentiment analysis into a single platform enhances content understanding. An interactive web interface allows users to:

Input news text
Receive immediate categorization and emotional feedback
This integrated solution streamlines analysis and supports broader public use.

II. Literature Review

DLCTC (BiLSTM + TextCNN) outperforms traditional models in classifying tweets by combining local and global context features.
CNNs, RNNs, Transformers are shown to be effective for sentiment analysis, especially when paired with noise reduction.
Traditional ML techniques remain competitive in low-data scenarios.
VADER sentiment analysis effectively tracks public health discourse on platforms like Twitter.

III. Methodology

Key Preprocessing Steps:

Tokenization – Breaks text into words, hashtags, emojis for effective analysis.
Lowercasing – Normalizes words for consistency.
Stop-word Removal – Eliminates non-informative words like "the", "and".
Stemming – Reduces words to their base form (e.g., "played" → "play").
Lemmatization – More precise than stemming, uses grammar to identify base word (e.g., "better" → "good").

IV. Experiments and Results

A. Word Cloud

Visualizes most frequent terms in the dataset (e.g., “covid,” “death,” “politics”).

B. Category Distribution (Pie Chart)

Data split into six categories: positive, political, disaster, terror, riot, and protest.
Protest is the least represented (7.5%), revealing class imbalance.

C. Sentiment Distribution

Negative sentiment dominates, followed by positive and neutral, which has the fewest samples.

D. Accuracy Comparison

XGBoost and Logistic Regression perform best with high balanced accuracy.
Decision Trees tend to overfit.
K-Nearest Neighbors underperforms due to sensitivity to feature space.

E. Confusion Matrix

Training data shows high accuracy.
Test data shows some confusion among classes 0, 2, and 3, indicating room for improvement in generalization.

Conclusion

This research creates a complete solution for Twitter sentiment analysis through a processed data pipeline combined with the effective XGBoost algorithm. A combined approach of text normalization techniques including tokenization and lowercasing and stop-word removal and stemming and lemmatization allows the model to analyze Twitter data, which has high levels of noise and informal language and unstructured content. The TF-IDF method selects essential terms out of context to build features while improving both signal strength and noise reduction in the available dataset. The XGBoost classifier achieves higher performance by processing optimized inputs than the traditional classifiers Logistic Regression alongside Random Forest. The implemented system showed exceptional accuracy and F1-score, which proves its preparedness to operate in brand monitoring platforms and trend prediction and public sentiment analysis applications. Our approach proves practical for big social media analytics because of its efficient results and wide range of applications. We explore highly promising approaches for enhancing our existing sentiment analysis tool. The tool can be empowered by instal-ling sophisticated artificial intelligence system components such as LSTM, GRU, and BERT to perform advanced language analysis that detects subtle details beyond basic system capabilities. An updated version of the tool would acquire emotional reading skills comparable to human understanding of text emotions. It is essential to expand the tool\'s capability to process multiple languages for its proper development. The sentiment analysis capabilities expand when the tool is designed to interpret emotions from texts in multiple languages, thus enabling global-scale social media monitoring. Such capability would enable monitoring of worldwide public opinion and emotional reactions to events and products, thus delivering important information. Real-time data streaming tools Apache Kafka or Spark Streaming would boost the tool\'s efficiency when integrated into its design. The tool would supply immediate sentiment assessments for live broadcasts, including elections and crises, through this modification, which provides real-time actionable data.

References

[1] Neogi, A. S., Garg, K. A., Mishra, R. K., & Dwivedi, Y. K. Sentiment analysis and classification of Indian farmers’ protest using Twitter data. International Journal of Information Management Data Insights, 1(2), 100019. https://doi.org/10.1016/j.jjimei.2021.100019. (2021) [2] Behl, S., Rao, A., Aggarwal, S., Chadha, S., & Pannu, H. Twitter for disaster relief through sentiment analysis for COVID-19 and natural hazard crises. International Journal of Disaster Risk Reduction, 55, 102101. https://doi.org/10.1016/j.ijdrr.2021.102101. (2021) [3] Tan, K. L., Lee, C. P., Lim, K. M., & Anbananthen, K. S. M. Sentiment Analysis With Ensemble Hybrid Deep Learning Model. IEEE Access, 10, 103694–103704. https://doi.org/10.1109/access.2022.3210182. (2022) [4] Lu, Q., Zhu, Z., Zhang, D., Wu, W., & Guo, Q. Interactive Rule Attention Network for Aspect-Level Sentiment Analysis. IEEE Access, 8, 52505–52516. https://doi.org/10.1109/ACCESS.2020.2981139. (2020) [5] Koonchanok, R., Pan, Y., & Jang, H. Public Attitudes Toward ChatGPT on Twitter: Sentiments, Topics, and Occupations. arXiv preprint arXiv:2306.12951. (2023) [6] Adams, T., Ajello, A., Silva, D., & Vazquez-Grande, F. More than Words: Twitter Chatter and Financial Market Sentiment. arXiv preprint arXiv:2305.16164. (2023) [7] Sasikumar, U., Zaman, A., Mawlood-Yunis, A.-R., & Chatterjee, P. Sentiment Analysis of Twitter Posts on Global Conflicts. arXiv preprint arXiv:2312.03715. (2023) [8] Thakur, N. Sentiment Analysis and Text Analysis of the Public Discourse on Twitter about COVID-19 and MPox. arXiv preprint arXiv:2312.10580. (2023) [9] Srivastava, S., Sarkar, M. K., & Chakraborty, C. Sentiment analysis of Twitter data using machine learning: COVID-19 perspective. International Journal of Data Analysis Techniques and Strategies, 1–16. Inderscience Publishers. (2024) [10] Subasar, A. Sentiment Analysis of Twitter Users Ahead of the 2024 Election Using the Naive Bayes Method. Internet of Things and Artificial Intelligence Journal, 4(3). https://pubs.ascee.org. (2024) [11] Widawati, E. B. Sentiment analysis and topic modelling of 2024 U.S. and Indonesian election tweets: A study of political discourse and public opinion. Final Year Project, Nanyang Technological University. (2024) [12] Mantika, A. M., Triayudi, A., & Aldisa, R. T. Sentiment Analysis on Twitter Using Naïve Bayes and Logistic Regression for the 2024 Presidential Election. SaNa: Journal of Blockchain, NFTs and Metaverse Technology, 2(1). (2024) [13] Qi, Z., Zeng, B., & Zhang, C. Sentiment analysis of Twitter user comments based on long short-term memory networks. IET Conference Proceedings, 2024(19). IET Digital Library. (2024)

Copyright

Copyright © 2025 P. Yashwanth, P. Shashank, M. Prakash, P. Mallikarjun. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET70023

Publish Date : 2025-04-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here