Cyberbullying Detection in Heterogeneous Data Streams Using Pytesseract OCR and HateBERT: A Cross-Modal Approach for Text, Image, and Emoji Interpretation
Authors: Sivadarshini N, Raaja Manickam V , Shree Varshini R, Dr. Santhi Baskaran
With the exponential growth of social media platforms, cyberbullying has become a pressing issue affecting users worldwide. This paper presents a Smart Cyberbullying Detection and Intervention System that leverages advanced AI models to identify and address harmful online behavior. The proposed system employs the HateBERT model for robust detection of toxic content, processing both textual inputs (such as comments and tweets) and multimodal content like memes that combine images and text. A multilingual approach ensures inclusivity across diverse user bases, allowing for effective detection in various languages. In addition to detecting the presence of cyberbullying, it classifies the type of bullying (e.g., age, racism, gender, ethnicity), enabling more nuanced responses. Additionally, the system integrates a chatbot-based counseling module to provide real-time, empathetic support to affected users. Built with a scalable architecture, the solution includes components for content analysis, user interaction, and mental health support. This comprehensive framework not only enhances the accuracy of cyberbullying detection across media types but also offers immediate, human-like intervention, promoting safer and more supportive online communities.
Introduction
The paper presents a Multimodal Cyberbullying Detection and Counselling System that detects cyberbullying across text, images, and emojis using a fine-tuned HateBERT model designed for abusive language detection. It supports multiple languages and classifies the type of bullying (e.g., racism, gender, ethnicity). The system integrates OCR for extracting text from images (like memes) and emoji interpretation to better capture the full context of online abuse.
A built-in AI chatbot provides real-time, empathetic counselling to victims, enhancing mental health support immediately after bullying detection. The system combines NLP, sentiment analysis, and AI-driven dialogue in a unified platform.
Performance evaluation shows strong results:
Cyberbullying detection accuracy of 93.2% with an F1-score of 0.91.
OCR accuracy with low character and word error rates (~5% and 6.3%).
Chatbot response time averaging 1.1 seconds with a 97% success rate.
System latency of 1.5 seconds and scalability up to 40 concurrent users.
Compared to traditional text-only or keyword-based systems, this multimodal, multilingual approach offers superior detection, classification, and victim support, making it well-suited for deployment in schools, gaming communities, and social media platforms.
Conclusion
This paper presented a Multimodal Cyberbullying Detection and Counselling System that addresses the rising concern of online abuse across social media platforms. By combining OCR-based image text extraction, emoji interpretation, multilingual text analysis, and a HateBERT-powered classification model, the system accurately detects cyberbullying in various formats including comments, memes, and emoji-rich messages. In addition to classifying bullying as age-based, gender-based, religion-based, and more, the system’s multilingual capabilities enable it to effectively process code-mixed and non-English content and it offers immediate emotional support through a densely trained AI counselling chatbot.
While the current implementation performs well across multiple dimensions, several future enhancements are identified to further strengthen the system:
1) Sarcasm and Coded Language Handling: Integration of models trained on sarcastic and contextually disguised bullying content will enhance sensitivity to implicit abuse.
2) Social Media API Integration: Real-time monitoring via direct platform APIs (e.g., Instagram, Twitter) would enable dynamic detection and user protection.
3) Adaptive Feedback Learning: Incorporating feedback loops from users to retrain the model can improve detection accuracy over time.
4) Cloud-Based Deployment: Hosting the system on scalable cloud infrastructure will allow deployment across institutions, public forums, and organizations.
In conclusion, the proposed system sets a comprehensive framework for intelligent, multilingual, and cross-modal cyberbullying detection with embedded emotional support. Its real-world adaptability positions it as a promising tool for promoting digital safety and psychological well-being in increasingly diverse online communities.
References
[1] Alabdulwahab, A., Haq, M. A., &Alshehri, M. (2023). Cyberbullying Detection using Machine Learning and Deep Learning. International Journal of Advanced Computer Science and Applications, 14(10).
[2] Almomani, A., Nahar, K., Alauthman, M., Al-Betar, M. A., Yaseen, Q., & Gupta, B. B. (2024). Image Cyberbullying Detection and Recognition Using Transfer Deep Machine Learning. International Journal of Cognitive Computing in Engineering, 5, 14–26.
[3] Maity, K., Saha, S., & Bhattacharyya, P. (2023). Emoji, Sentiment, and Emotion Aided Cyberbullying Detection in Hinglish. IEEE Transactions on Computational Social Systems.
[4] Nikitha, G. S., Shenoyy, A., Chaturya, K., Latha, J. C., & Janani Shree, M. (2024). Detection of Cyberbullying Using NLP and Machine Learning in Social Networks for Bi-Language. International Journal of Scientific Research & Engineering Trends, 10(1), 128–134.
[5] Satya Narayana, G., Susmitha, V., Nagarani, J., Chinnarao, M., & Lavanya, P. (2024). Detection of Cyberbullying on Social Media Using Machine Learning Algorithms. International Journal of Novel Research and Development, 9(3), 45–52.
[6] Tahmid, F. I., Akbar, F., & Rahman, A. (2024). BulliShield: A Smart Cyberbullying Detection and Reporting System. IEEE Women in Data Science Conference, 198–203.
[7] Mahmud, T., Ptaszynski, M., & Masui, F. (2024). Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts. Electronics, 13(9), 1677.
[8] Roy, P., & Mali, F. U. (2022). Cyberbullying Detection Using Deep Transfer Learning. Complex & Intelligent Systems, 8(6), 5449–5467.
[9] Rosa, H., Ribeiro, E., Ferreira, P. C., Carvalho, J. P., &Figueira, Á. (2023). Multimodal Cyberbullying Detection on Social Media Using Fusion of Text, Image, and Metadata. IEEE Access, 11, 5732–5744.
[10] Van Hee, C., Lefever, E.,&Hoste, V. (2023). Detection and Fine-Grained Classification of Cyberbullying Events. Natural Language Engineering, 29(2), 269–299.