The rapid growth of social media has resulted in an enormous volume of user opinions expressed in textual form. In the Indian digital space, these opinions are often written in Hinglish, a code-mixed language that combines Hindi and English words using the Roman script. Conventional Natural Language Processing (NLP) techniques are typically designed for standard English text or pure Hindi text or any other specific language and therefore face difficulties when processing such mixed-language expressions. This research investigates the problem of sentiment identification in Hinglish comments commonly found on online platforms. The study explores the linguistic characteristics of code-mixed text, including inconsistent spelling, informal slang, and mixed grammatical structures. To address these challenges, the research examines machine learning–based approaches for classifying sentiments from Hinglish data collected from social media reviews. The proposed methodology includes data preprocessing, normalization of text, and feature extraction techniques to transform textual content into machine-interpretable representations. Various supervised learning models are analyzed to evaluate their ability to distinguish between positive, negative, and neutral sentiments. The findings highlight the importance of developing language-aware NLP techniques for multilingual environments and demonstrate how specialized models can improve sentiment analysis for Indian code-mixed communication.
Introduction
It begins by explaining that social media platforms like Twitter, YouTube, and e-commerce reviews have made Hinglish a dominant form of online communication in India. However, existing sentiment analysis systems work well for single languages but struggle with Hinglish due to code-mixing, inconsistent spelling, slang, emojis, and lack of large labeled datasets.
The main goal of the study is to build a system that can accurately classify Hinglish text into positive, negative, or neutral sentiment.
Proposed approach:
The system uses a standard machine learning pipeline:
Data collection from YouTube, Twitter, and review platforms
Preprocessing to clean text, normalize spelling variations, and handle slang/code-mixing
Feature extraction using TF-IDF, Word2Vec, and FastText
Model training using Logistic Regression, SVM, Random Forest, and Naïve Bayes
Evaluation using accuracy, precision, recall, and macro F1-score
Methodology highlights:
Uses an end-to-end pipeline from raw text to sentiment prediction
Handles challenges like transliteration and inconsistent Hinglish spelling
Compares multiple feature-model combinations to find the best performer
Expected outcomes:
SVM with TF-IDF is expected to perform best overall
FastText is expected to handle spelling variations and slang better
The neutral class is hardest to classify due to ambiguity and limited data
Model performance depends heavily on preprocessing and feature selection
Conclusion
The study aims to fill the gap in NLP systems that cannot properly handle Hinglish by building a robust, language-aware sentiment analysis framework. It emphasizes that better preprocessing and feature engineering are key to improving performance on code-mixed Indian social media text.
References
[1] M. B. S. C. S. Muhammad Kashif Nazir, “Sentiment analysis for code-mixed low-resource languages: a systematic review of approaches, techniques, applications, challenges, and future directions,” Springer, 2026.
[2] G. Singh, “Sentiment Analysis of Code-Mixed Social Media Text (Hinglish),” p. 17, 2021.
[3] R. Baghel, “A Survey on Code-Mixed Sentiment Analysis Based on Hinglish Dataset,” in Lecture Notes in Networks and Systems ((LNNS,volume 664)).
[4] A. K. M. K. Pratibha, “Expanding Research Horizons for Hinglish Text by Tackling Challenges and Research Gaps,” Jisem Journal, 2025.
[5] A. P. A. E. P. B. Soumitra Ghosh, “Multitasking of sentiment detection and emotion recognition in code-mixed Hinglish data,” Science Direct, vol. 260, 2023.
[6] M. P. R. A. Adarsh Singh Jadon, “Hinglish Sentiment Analysis: Deep Learning Models for Nuanced Sentiment Classification in Multilingual Digital Communication,” IEEE explore, 2024.
[7] I. K. Brajesh Khare, “Optimized emotion classification in code-mixed Hinglish text using an mBERT based hybrid neural network with attention mechanisms,” International Journal of Information Technology, 2025.